INSTALLATION
- Unzip the distribution and consider the main directory "RoadRunner"
- Add to your CLASSPATH these libraries:
lib/roadrunner.jar
lib/nekohtml.jar
lib/xercesImpl.jar
lib/xmlParserAPIs.jar
In order to run RoadRunner you can either cd to the main RoadRunner directory
or set the system property "rr.home" to point to the main directory.
(Note: the system property "rr.home" can be set by adding the option
-Drr.home=path/to/directory/RoadRunner
to the JVM invocation)
USAGE
The main class is roadrunner.Shell. For help on its usage
type "java roadrunner.Shell" without args:
Usage: java roadrunner.Shell [-trace[level]] [-O<config.xml>]
[-N<name>]
( -Finputfiles.txt | file0 [file1 ...] )
-O<config.xml> a xml configuration file
-trace[level] enable tracing at log level <level>
-F<files.txt> files.txt lists the input filenames
-N<name> set a <name> for output (default "out"):
wrapper file(s): "<name>Wrapper<N>.xml"
data file(s): "<name>_DataSet<N>.xml"
RR disposes inferred wrappers and extracted dataset under
subdirectories of the "output" directory. Each subdirectory
is named after the argument following the option -N.
The option -O specifies an XML configuration file whose
sintax is briefly described in docs/PROPERTIES.TXT
EXAMPLES
Three examples (bash sintax):
1) Soccer players
java roadrunner.Shell -Nplayers \
-Oexamples/fifaworldcup.yahoo.com/players.xml \
examples/fifaworldcup.yahoo.com/Cannavaro_xy_.xhtml \
examples/fifaworldcup.yahoo.com/Nesta_xy_.xhtml \
examples/fifaworldcup.yahoo.com/Zidane_xy_.xhtml
2) jobs
java roadrunner.Shell -Nhotjobs \
-Oexamples/www.hotjobs.com/hotjobs.xml \
examples/www.hotjobs.com/ByLocation01_xy_.xhtml \
examples/www.hotjobs.com/ByLocation02_xy_.xhtml \
examples/www.hotjobs.com/ByLocation03_xy_.xhtml
3) Overstock Jewelry
java roadrunner.Shell -Noverstock \
-Oexamples/www.overstock.com/jewelry.xml \
examples/www.overstock.com/jewelry01_xy_.xhtml \
examples/www.overstock.com/jewelry02_xy_.xhtml \
examples/www.overstock.com/jewelry03_xy_.xhtml
After executing one of these examples, you find a directory named "output",
whose subdirectories are named after the -N option of the command line:
"players" for the first example and "overstock" for the last.
Therefore, the last example about jewelry produces the following files:
output/overstock/overstock00.xml (the wrapper inferred)
output/overstock/overstock0_DataSet.xml (the dataset extracted from input samples)
output/overstock/overstockWrappersIndex.xml (and index of all wrappers produced)
output/overstock/results.html (html file to display all results)
Open the files "results.html" with a css, xslt-capable browser (Mozilla 1.3- and
recent versions of IE seem to work) to watch directly the data extracted by the
automatically generated wrapper.
USING ROADRUNNER ON NEW SAMPLES
RoadRunner works on well-formed documents. The input HTML pages are
pre-processed using the nekoHTML parser (http://www.apache.org/~andyc/neko/doc/html)
to produce DOM representations of input documents.
Note that in order to use the RR visual labelling feature, the input html document
must be adorned with html comments reporting the coordinates of the bounding box
of every text string in the visual rendering of the page. Just to give an
example of what RR Labeller expects, the following fragment of HTML code:
...<TR><TD><!--BB:10,10,60,20--> a text X</TD><TD><!--BB:70,10,120,30--> a text Y</TD></TR>...
states that the coordinates of the bounding box of the string "a text X" in the
visual rendering of the document are (minX=10, minY=10, maxX=60, maxY=20);
similarly for "a text Y". Currently we use the suffix "_xy_.xhtml" to mark
HTML files which have been processed in this way.
REFERENCES
The general ideas underlying the RoadRunner Project can be found in the papers:
Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo: RoadRunner: Towards
Automatic Data Extraction from Large Web Sites. Proc. of 27th International
Conference on Very Large Databases (VLDB 2001): 109-118
http://www.vldb.org/conf/2001/P109.pdf
Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo Automatic Annotation of
Data Extracted from Large Web Sites. SIGMOD Conference 2003, WebDB Workshop.
http://doi.acm.org/10.1145/643477.643480
Valter Crescenzi, Giansalvatore Mecca, On Automatic Information Extraction from
Large Web Sites. Dip. Informatica ed Automazione Technical Report dia-76-2003.
http://www.dia.uniroma3.it/db/roadRunner/publications/dia-76-2003.ps.gz
THIRD-PARTY SOFTWARE INCLUDED
* Xerces Java Parser (check file license: license/Apache_Software_LICENSE.TXT)
* NekoHTML HTML scanner (check file license: license/NekoHTML_LICENSE.TXT )
This product includes software developed by:
- Andy Clark (http://www.apache.org/~andyc/neko/doc/html)
- The Apache Software Foundation (http://www.apache.org/).
Namely, these softwares are used to produce DOM representations of input html pages.
没有合适的资源?快使用搜索试试~ 我知道了~
RoadRunner JAVA源码
共423个文件
gif:180个
java:124个
jpg:66个
4星 · 超过85%的资源 需积分: 49 14 下载量 111 浏览量
2011-03-06
11:42:16
上传
评论
收藏 2.14MB RAR 举报
温馨提示
web信息提取算法RoadRunner的实现
资源推荐
资源详情
资源评论
收起资源包目录
RoadRunner JAVA源码 (423个子文件)
tidy.cfg 500B
text.css 5KB
text.css 5KB
text.css 5KB
main.css 3KB
main.css 3KB
main.css 3KB
mini_dvd_0219.gif 31KB
mini_dvd_0219.gif 31KB
mini_dvd_0219.gif 31KB
sky_ca_toolbox.gif 16KB
120x600.gif 12KB
120x600.gif 12KB
verify2.gif 5KB
verify2.gif 5KB
verify2.gif 5KB
en_120x90w.gif 5KB
en_120x90w.gif 5KB
tr1.gif 4KB
tr1.gif 4KB
tr1.gif 4KB
westmodule_pelewebca.gif 4KB
y_hj_logo_5.gif 3KB
y_hj_logo_5.gif 3KB
y_hj_logo_5.gif 3KB
oslogo152_40.gif 3KB
oslogo152_40.gif 3KB
oslogo152_40.gif 3KB
westmodule_eshops_ab.gif 3KB
eastmodule_eshops_fs.gif 3KB
eng12090.gif 3KB
eng12090.gif 3KB
yournewoverstock.gif 2KB
yournewoverstock.gif 2KB
yournewoverstock.gif 2KB
coming_en.gif 2KB
tr_foot6.gif 2KB
tr_foot6.gif 2KB
tr_foot6.gif 2KB
apparelshoestore.gif 2KB
apparelshoestore.gif 2KB
apparelshoestore.gif 2KB
entertainstore.gif 2KB
entertainstore.gif 2KB
entertainstore.gif 2KB
jewelrystorerev.gif 2KB
jewelrystorerev.gif 2KB
jewelrystorerev.gif 2KB
electronicstore.gif 2KB
electronicstore.gif 2KB
electronicstore.gif 2KB
en_gamezone120x30.gif 1KB
en_gamezone120x30.gif 1KB
en_gamezone120x30.gif 1KB
flatrateship.gif 1KB
flatrateship.gif 1KB
flatrateship.gif 1KB
sportstore.gif 1KB
sportstore.gif 1KB
sportstore.gif 1KB
homegardenstore.gif 1KB
homegardenstore.gif 1KB
homegardenstore.gif 1KB
worldstore.gif 1KB
worldstore.gif 1KB
worldstore.gif 1KB
resume_edge.gif 1KB
resume_edge.gif 1KB
resume_edge.gif 1KB
da20.gif 1KB
da20.gif 1KB
da20.gif 1KB
pf30.gif 1KB
pf30.gif 1KB
pf30.gif 1KB
et20.gif 1KB
et20.gif 1KB
et20.gif 1KB
almostsold.gif 943B
almostsold.gif 943B
inventorylimit.gif 916B
inventorylimit.gif 916B
inventorylimit.gif 916B
colspacer.gif 834B
colspacer.gif 834B
colspacer.gif 834B
ms.gif 818B
ms.gif 818B
ms.gif 818B
only1.gif 789B
only1.gif 789B
hm30.gif 768B
hm30.gif 768B
hm30.gif 768B
tr41.gif 764B
tr41.gif 764B
tr41.gif 764B
fwccom120x30.gif 758B
fwccom120x30.gif 758B
fwccom120x30.gif 758B
共 423 条
- 1
- 2
- 3
- 4
- 5
资源评论
- zk56359022011-09-29好贵的分数……因为注释编码问题,netbeans生成不了,删注释用了好长时间
- HustXrb2013-08-09价值一般,也只能看看。
- cuiyang_02272013-08-02这代码,看不太明白,不知道怎么用……
scustephen
- 粉丝: 6
- 资源: 2
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功