JSpider
User Manual
version 0-5-0-dev
http://j-spider.sourceforge.net
JSpider 0-5-0-dev User Manual
http://j-spider.sourceforge.net 2/121
JSpider 0-5-0-dev User Manual
http://j-spider.sourceforge.net 3/121
OVERVIEW............................................................................ 9
I. INTRODUCTION .........................................................11
A. What is JSpider?.......................................................11
B.
Definition of terms....................................................11
C. License....................................................................11
D.
What can I do?.........................................................12
1. Using JSpider........................................................................................12
2. Giving feedback....................................................................................12
3. Posting on mailing lists.........................................................................12
4. Forums..................................................................................................12
5. Reporting bugs......................................................................................12
6. Submitting feature requests...................................................................13
7. Submitting patches................................................................................13
II.
C
ONCEPTS
............................................................14
A.
JSpider global design ................................................14
1. Main components..................................................................................14
2. JSpider engine core...............................................................................15
3. SPI components ....................................................................................15
Rules ........................................................................................................15
Plugins......................................................................................................16
Event Filters .............................................................................................16
4. API components....................................................................................16
Object model ............................................................................................17
Event system.............................................................................................17
B. JSpider applications ..................................................18
1. JSpider application................................................................................18
2. JSpider-tool...........................................................................................18
C.
Event system...........................................................20
1. Types of events.....................................................................................20
2. Event Dispatching.................................................................................20
3. Event list...............................................................................................22
D.
Object model ...........................................................22
1. Sites......................................................................................................23
2. Resources..............................................................................................24
E. Spidering process........................................................26
INSTALLATION ................................................................... 27
III. PREREQUISITES ......................................................29
IV. BINARY INSTALLATION ..............................................30
A. Downloading............................................................30
B.
Unpacking ...............................................................30
C. Basic configuration....................................................30
D. Testing....................................................................31
V.
B
UILDING FROM
CVS................................................33
A. Setting the CVSROOT................................................33
B. Checking out............................................................34
JSpider 0-5-0-dev User Manual
http://j-spider.sourceforge.net 4/121
C.
Basic configuration (optional).....................................34
D. Building from source .................................................34
E. Running the test suite..................................................36
F.
Using JSpider..............................................................37
VI.
F
OLDER
O
VERVIEW
..................................................38
USAGE ................................................................................ 41
VII.
S
TARTING
JS
PIDER
................................................43
A. Windows..................................................................43
B. Unix........................................................................43
C.
Configurations..........................................................44
VIII. S
CENARIO
: C
HECKING A SITE FOR ERRORS
....................45
A. Goal........................................................................45
B.
Configuration ...........................................................45
1. Global configuration .............................................................................46
Proxy configuration ..................................................................................46
Other ........................................................................................................46
2. Per-site configuration............................................................................47
Site Configuration ‘base’ ..........................................................................47
Site Configuration ‘default’ ......................................................................48
3. Plugin Configuration.............................................................................49
Console plugin..........................................................................................50
Filewriter plugin .......................................................................................50
StatusBasedFileWriter Plugin ...................................................................51
C. Example..................................................................51
1. Console output......................................................................................52
2. 404.out..................................................................................................55
3. Error-report.out.....................................................................................55
IX. SCENARIO: DOWNLOADING A SITE TO LOCAL DISK ..............56
A. Goal........................................................................56
B.
Configuration ...........................................................56
1. Global configuration .............................................................................56
2. Site-specific configurations...................................................................57
Site configuration ‘base’...........................................................................57
Site configuration ‘skip’............................................................................57
C. Example..................................................................58
D.
Sample output..........................................................58
X. S
CENARIO
: P
LAYING AROUND WITH
JS
PIDER
....................59
A. The default configuration...........................................59
1. Configuration........................................................................................59
2. Starting .................................................................................................59
3. Output...................................................................................................59
B. Forgetting about robots.txt ........................................61
C.
Going not too deep ...................................................62
XI.
U
SING
JS
PIDER
-
TOOL
...............................................65
A. Usage .....................................................................65
B. Tools.......................................................................66
1. headers..................................................................................................66
JSpider 0-5-0-dev User Manual
http://j-spider.sourceforge.net 5/121
2. info.......................................................................................................66
3. fetch......................................................................................................67
4. download..............................................................................................67
5. findlinks................................................................................................68
6. email.....................................................................................................68
CONFIGURATION................................................................ 69
XII. ENVIRONMENT .....................................................71
A. Java 1.3: XML parser Configuration.............................72
B.
JSPIDER_HOME env. variable.....................................73
XIII.
C
ONFIGURATION OVERVIEW
......................................74
A. Common configuration ..............................................74
B.
General configuration................................................74
C.
Per-site configuration................................................74
XIV. COMMON CONFIGURATION........................................76
A. Logging subsystem ...................................................77
1. Logged items ........................................................................................77
2. Configuration........................................................................................77
3. Using Log4j..........................................................................................78
Adapting the log4j configuration...............................................................78
Configuration change example..................................................................78
4. Using JDK 1.4 logging..........................................................................80
XV.
G
ENERAL CONFIGURATION
..........................................81
A. The ‘default’ configuration..........................................81
B. Other configurations .................................................82
C.
Configuration Files....................................................84
D. jspider.properties .....................................................85
1. Proxy settings .......................................................................................85
2. Threading..............................................................................................86
3. User Agent............................................................................................86
4. Rules.....................................................................................................87
5. Storage..................................................................................................91
XVI. P
ER
-
SITE CONFIGURATIONS
......................................93
A. sites.properties ........................................................94
B. Site-specific configuration files ...................................95
1. Site handling.........................................................................................95
2. Robots.txt .............................................................................................95
3. Throttling..............................................................................................96
4. Proxy ....................................................................................................98
5. User Agent............................................................................................98
6. Cookies.................................................................................................98
7. Rules.....................................................................................................99
XVII.
P
LUGIN CONFIGURATION
........................................101
A.
Plugin.properties ....................................................101
1. Global event filtering ..........................................................................101
2. Plugin definition .................................................................................102
B. Plugin configuration files..........................................103
1. Plugin implementation class................................................................103