-------------------------------------------------------------------------------
$Id: README.txt,v 1.8 2004/01/05 18:53:11 stack-sf Exp $
-------------------------------------------------------------------------------
Heritrix is the Internet Archive's open-source, extensible, web-scale,
archival-quality web crawler. See <http://crawler.archive.org> for project
details.
Webmasters! Heritrix is designed to respect the robots.txt exclusion
directives and META robots tags. If you notice our crawler behaving poorly,
please send us email at archive-crawler-agent *at* lists *dot* sourceforge
*dot* net.
Table of Contents
I. Before You Begin
a. Requirements
II. Getting Started
a. Building
b. Starting Heritrix
III. Configuration Files
a. System Properties
IV. Crawling
a. ARC Files
b. The Graphical User Interface
c. The Command Line Interface
V. License
I. Before You Begin
a. Requirements
i. Java Runtime Environment
The Heritrix crawler is implemented purely in Java. This means that the only
true requirement for running it is that you have a Java Runtime Environment
(JRE) installed. The Heritrix crawler makes use of 1.4 features so your JRE
must be at least of a 1.4.0 pedigree. The Sun Java runtimes for many platforms
are available at <http://java.com>.
ii. Linux
The Heritrix crawler has been primarily built and tested on Linux. It may
perform acceptably elsewhere due to Java portability.
iii. Building
You can build Heritrix from source using Ant or Maven. The Maven build is more
comprehensive and will generate all from either the packaged source ofrom a CVS
checkout. The Ant build is less complete in that it doesn't generate the
distribution documentation but it does produce all else needed to run Heritrix.
If you are building Heritrix w/ Ant, you must have Ant installed. You can get
Ant here: <http://ant.apache.org/>. Our build used 1.5.x Ant. If you want to
run the Heritrix unit tests from Ant, you will have to make sure the Ant
optional.jar file sits beside the junit.jar. See
<http://ant.apache.org/manual/OptionalTasks/junit.html> for what you must do
setting up Ant to run junit tests.
The Heritrix maven build was developed using 1.0-rc1 Maven. You can get Maven
from here: <http://maven.apache.org>.
II. Getting Started
There are three ways to obtain Heritrix:
(1) packaged binary download from:
<http://sourceforge.net/projects/archive-crawler>
(2) packaged source download from:
<http://sourceforge.net/projects/archive-crawler>
(3) checkout from CVS
cvs.sourceforge.net:/cvsroot/archive-crawler
The packaged binary is named heritrix-?.?.?.tar.gz or heritrix-?.?.?.zip and the
packaged source is named heritrix-?.?.?-src.tar.gz or heritrix-?.?.?-src.zip
where '?.?.?' is the current heritrix release version.
For how to get Heritrix from CVS, see
<http://sourceforge.net/cvs/?group_id=73833>. Be aware that anonymous access
does not give you the current HEAD but a snapshot that can at times be up to 24
hours behind current development.
a. Building
i. If you obtained packaged source, here is how you build w/ Ant:
% tar xfz heritrix-?.?.?-src.tar.gz
% cd heritrix-?.?.?
% $ANT_HOME/bin/ant dist
In the 'dist' subdir will be all you need to run the Heritrix crawler. To
learn more about the ant build, type 'ant -projecthelp'.
To build a CVS source checkout w/ Maven:
$ cd CVS_CHECKOUT_DIR
$ $MAVEN_HOME/bin/maven dist
In the 'target/distribution' subdir, you will find packaged source and binary
builds. Run '$MAVEN_HOME/bin/maven -g' for other Maven possibilities.
b. Starting Heritrix
To run Heritrix, first do the following:
% export $HERITRIX_HOME=/PATH/TO/BUILT/HERITRIX
...where $HERITRIX_HOME is the location of your built Heritrix (i.e. under the
'dist' dir if you built w/ Ant, or under the untarred binary
target/distribution/heritrix.?.?.?.tar.gz dir if you built w/ Maven, or under
the untarred heritrix.?.?.?.tar.gz if you pulled a packaged binary).
Next run:
% cd $HERITRIX_HOME
% chmod u+x $HERITRIX_HOME/bin/heritrix.sh
% $HERITRIX_HOME/bin/heritrix --help
This should output something like the following:
Usage: java org.archive.crawler.Heritrix --help|-h
Usage: java org.archive.crawler.Heritrix --no-wui ORDER.XML
Usage: java org.archive.crawler.Heritrix [--port=PORT] \
[ORDER.XML [--start|--wait|--set]]
Options:
--help|-h Prints this message.
--no-wui Start crawler without a web User Interface.
--port PORT is port the web UI runs on. Default: 8080.
ORDER.XML The crawl to launch. Optional if '--no-wui' NOT specified.
--start Start crawling using specified ORDER.XML:
--wait Load job specified by ORDER.XML but do not start. Default.
--set Set specified ORDER.XML as the default.
The usage output talks of the an ORDER.XML file. The ORDER.XML is the "master
config file". It specifies which modules will be used to process URIs, in which
order URIs will be processed, how and where files will ge written to disk, how
"polite" the crawler should be, crawl limits, etc. The configuration system is
currently undergoing revision and the format of ORDER.XML will probably be
changed. The best thing to do meantime is to copy an existing order.xml file.
See under 'docs/example-settings/broad-crawl' for an up-to-date sample
configuration that does a broad crawl (If there is no 'docs/example-settings'
in your built distribution, see
<http://crawler.archive.org/docs/example-settings/broad-crawl/order.xml>).
Before you begin crawling you *MUST* at least change the default "User-Agent"
and "From" header fields in the order.xml (or via the administrative interface).
You should set these to something meaningful that allows administrators of
sites you'll be crawling to contact you. The software requires that User-Agent
value be of the form...
[name] (+[http-url])[optional-etc]
...where [name] is the crawler identifier and [http-url] is an URL giving more
information about your crawling efforts. If desired, additional info may be
placed after the close-parenthesis.
Also, the From value must be an email address.
(Please do not leave the Archive Open Crawler project's contact information in
these fields, we do not have the time or the resources to field complaints
about crawlers which we do not administer.)
Once you have an order.xml file edited to your liking you can run the crawler
either via the UI or without. Here is how you'd run it w/o going via the UI:
$ $HERITRIX_HOME/bin/heritrix.sh --no-wui order.xml
You should see output showing the crawler running. Tail the logs, see your
order.xml for where you told the crawler to dump them, to monitor crawler
progress.
To start the crawler w/ the UI enabled run the following:
$ $HERITRIX_HOME/bin/heritrix.sh
You should see output like the following:
14:11:10.415 EVENT Starting Jetty/4.2.15rc0
14:11:10.603 EVENT Checking Resource aliases
14:11:10.832 EVENT Started WebApplicationContext[/admin,Admin]
14:11:11.094 EVENT Started SocketListener on 0.0.0.0:8080
14:11:11.095 EVENT Started org.mortbay.jetty.Server@1f6f0bf
Heritrix is running
Web UI on port 8080
Browse to the Web UI to start a crawl and to load and configure crawl jobs.
Eventually this will be the preferred mechanism for configuring and running
crawls, but it is currently under regular revision.
III. Configuration
TODO: The configuration system is being revised at the moment. Meantime
study extant configurations at docs/example-settings.
a. System Properties
Below we document system properties passed on the command-line that can
influence Heritrix behavior.
i. heritrix.webapp.path
Path to webapp directory. Default: webapps. Set to src/webapps if you want to
run the webapp inside eclipse, etc.
ii. heritrix.default.orderfile
Default order.xml file to us
没有合适的资源?快使用搜索试试~ 我知道了~
垂直搜索引擎实例代码
![preview](https://csdnimg.cn/release/downloadcmsfe/public/img/white-bg.ca8570fa.png)
共308个文件
java:126个
js:47个
jsp:27个
![preview-icon](https://csdnimg.cn/release/downloadcmsfe/public/img/scale.ab9e0183.png)
![star](https://csdnimg.cn/release/downloadcmsfe/public/img/star.98a08eaa.png)
温馨提示
基于 Lucene 与 Nutch web搜索引擎
资源推荐
资源详情
资源评论
![rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![application/x-rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![iso](https://img-home.csdnimg.cn/images/20210720083646.png)
![application/octet-stream](https://img-home.csdnimg.cn/images/20210720083646.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![application/x-rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![application/x-rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![application/x-rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![application/x-rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![application/x-rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![application/pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![application/x-rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
收起资源包目录
![package](https://csdnimg.cn/release/downloadcmsfe/public/img/package.f3fc750b.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/GIF.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/GIF.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/GIF.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/GIF.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/GIF.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/GIF.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/GIF.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/GIF.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/GIF.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/GIF.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/GIF.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/GIF.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/GIF.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/GIF.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/GIF.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/GIF.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/GIF.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/GIF.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/GIF.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/GIF.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/GIF.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/GIF.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/GIF.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/GIF.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/HTML.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/HTML.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/JAR.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/JAR.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/JAR.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/JAR.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/JAR.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/JAR.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/JAR.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/JAR.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/JAR.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/JAR.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/JAR.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/JAR.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/JAR.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/JAR.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/JAR.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/JAR.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/JAR.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/JAR.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
共 308 条
- 1
- 2
- 3
- 4
资源评论
![avatar-default](https://csdnimg.cn/release/downloadcmsfe/public/img/lazyLogo2.1882d7f4.png)
- zly12212012-06-18垂直搜索的,参考学习下,正在研究垂直搜索
- FKWD2014-05-09参考参考,挺好的,刚学习Heritrix+Lucene搭建小型的垂直搜索引擎,有兴趣的朋友可以一起研究一下
- yuelianghexingxing2013-10-10参考下,感觉还行吧,没仔细看呢
- dreamjunyeau2012-09-07Nutch我没用,用的是HETRIX也参考一下,找点灵感吧
- cxpyft2014-03-25还好,刚好和我准备的搜索引擎刚好成对比!
![avatar](https://profile-avatar.csdnimg.cn/default.jpg!1)
rtghbnm
- 粉丝: 0
- 资源: 11
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助
![voice](https://csdnimg.cn/release/downloadcmsfe/public/img/voice.245cc511.png)
![center-task](https://csdnimg.cn/release/downloadcmsfe/public/img/center-task.c2eda91a.png)
安全验证
文档复制为VIP权益,开通VIP直接复制
![dialog-icon](https://csdnimg.cn/release/downloadcmsfe/public/img/green-success.6a4acb44.png)