没有合适的资源?快使用搜索试试~ 我知道了~
perl&lwp.pdf
5星 · 超过95%的资源 需积分: 16 98 下载量 167 浏览量
2010-06-16
13:32:01
上传
评论 1
收藏 3.02MB PDF 举报
温馨提示
试读
318页
LWP(Library for Web access in Perl):perl网页爬取工具。 原版chm,转成pdf方便打印。删去了一些冗余信息。后附附录和索引。
资源推荐
资源详情
资源评论
Preface
I
by Sean M. Burke
ISBN 0-596-00178-9
First Edition, published June 2002.
(See the catalog page for this book.)
Preface .............................................................................................................................................. 1
0.1. Audience for This Book ...................................................................................................... 1
0.2. Structure of This Book ........................................................................................................ 2
0.3. Order of Chapters ............................................................................................................... 3
0.4. Important Standards Documents ....................................................................................... 3
0.5. Conventions Used in This Book .......................................................................................... 4
0.6. Comments & Questions ..................................................................................................... 4
0.7. Acknowledgments .............................................................................................................. 5
Chapter 1. Introduction to Web Automation .................................................................................... 6
1.1. The Web as Data Source..................................................................................................... 6
1.1.1. Screen Scraping ....................................................................................................... 7
1.1.2. Brittleness ............................................................................................................... 8
1.1.3. Web Services ........................................................................................................... 8
1.2. History of LWP .................................................................................................................... 9
1.3. Installing LWP ................................................................................................................... 10
1.3.1. Installing LWP from the CPAN Shell ....................................................................... 11
1.3.1.1. Configuring ......................................................................................................... 11
1.3.1.2. Obtaining help .................................................................................................... 11
1.3.1.3. Installing LWP ..................................................................................................... 12
1.3.2. Installing LWP Manually ........................................................................................ 12
1.3.2.1. Download distributions ...................................................................................... 13
1.3.2.2. Unpack and configure ........................................................................................ 13
1.3.2.3. Make, test, and install ........................................................................................ 14
1.4. Words of Caution ............................................................................................................. 15
1.4.1. Network and Server Load ...................................................................................... 15
1.4.2. Copyright ............................................................................................................... 16
1.4.3. Acceptable Use ...................................................................................................... 17
1.5. LWP in Action ................................................................................................................... 17
1.5.1. The Object-Oriented Interface .............................................................................. 18
1.5.2. Forms ..................................................................................................................... 18
1.5.3. Parsing HTML ......................................................................................................... 19
1.5.4. Authentication....................................................................................................... 21
Chapter 2. Web Basics ..................................................................................................................... 21
2.1. URLs .................................................................................................................................. 22
Preface
II
2.2. An HTTP Transaction ........................................................................................................ 24
2.2.1. Request .................................................................................................................. 25
2.2.2. Response ............................................................................................................... 26
2.3. LWP::Simple ..................................................................................................................... 27
2.3.1. Basic Document Fetch ........................................................................................... 27
2.3.2. Fetch and Store ..................................................................................................... 27
2.3.3. Fetch and Print ...................................................................................................... 28
2.3.4. Previewing with HEAD ........................................................................................... 29
2.4. Fetching Documents Without LWP::Simple ..................................................................... 31
2.5. Example: AltaVista ........................................................................................................... 33
2.6. HTTP POST ........................................................................................................................ 35
2.7. Example: Babelfish ........................................................................................................... 37
Chapter 3. The LWP Class Model..................................................................................................... 39
3.1. The Basic Classes .............................................................................................................. 40
3.2. Programming with LWP Classes ....................................................................................... 41
3.3. Inside the do_GET and do_POST Functions ..................................................................... 42
3.4. User Agents ...................................................................................................................... 43
3.4.1. Connection Parameters ......................................................................................... 44
3.4.2. Request Parameters .............................................................................................. 45
3.4.3. Protocols................................................................................................................ 46
3.4.4. Redirection ............................................................................................................ 47
3.4.5. Authentication....................................................................................................... 48
3.4.6. Proxies ................................................................................................................... 48
3.4.7. Request Methods .................................................................................................. 49
3.4.7.1. Saving response content to a file ....................................................................... 50
3.4.7.2. Sending response content to a callback ............................................................. 51
3.4.7.3. Mirroring a URL to a file ..................................................................................... 52
3.4.8. Advanced Methods ............................................................................................... 53
3.5. HTTP::Response Objects .................................................................................................. 54
3.5.1. Status Line ............................................................................................................. 54
3.5.2. Content .................................................................................................................. 55
3.5.3. Headers ................................................................................................................. 56
3.5.4. Expiration Times .................................................................................................... 57
3.5.5. Base for Relative URLs ........................................................................................... 58
3.5.6. Debugging ............................................................................................................. 59
3.6. LWP Classes: Behind the Scenes ....................................................................................... 60
Chapter 4. URLs ............................................................................................................................... 60
4.1. Parsing URLs ..................................................................................................................... 60
4.1.1. Constructors .......................................................................................................... 61
4.1.2. Output ................................................................................................................... 63
4.1.3. Comparison ........................................................................................................... 63
4.1.4. Components of a URL ............................................................................................ 64
4.1.5. Queries .................................................................................................................. 66
4.2. Relative URLs .................................................................................................................... 67
Preface
III
4.3. Converting Absolute URLs to Relative .............................................................................. 69
4.4. Converting Relative URLs to Absolute .............................................................................. 70
Chapter 5. Forms ............................................................................................................................. 71
5.1. Elements of an HTML Form .............................................................................................. 72
5.2. LWP and GET Requests ..................................................................................................... 73
5.2.1. GETting Fixed URLs ................................................................................................ 73
5.2.2. GETting a query_form( ) URL................................................................................. 74
5.3. Automating Form Analysis ............................................................................................... 76
5.4. Idiosyncrasies of HTML Forms .......................................................................................... 78
5.4.1. Hidden Elements ................................................................................................... 79
5.4.2. Text Elements ........................................................................................................ 79
5.4.3. Password Elements................................................................................................ 79
5.4.4. Checkboxes ............................................................................................................ 79
5.4.5. Radio Buttons ........................................................................................................ 80
5.4.6. Submit Buttons ...................................................................................................... 81
5.4.7. Image Buttons ....................................................................................................... 82
5.4.8. Reset Buttons ........................................................................................................ 83
5.4.9. File Selection Elements ......................................................................................... 83
5.4.10. Textarea Elements ............................................................................................... 84
5.4.11. Select Elements and Option Elements ................................................................ 84
5.5. POST Example: License Plates .......................................................................................... 86
5.5.1. The Form ............................................................................................................... 87
5.5.2. Use formpairs.pl .................................................................................................... 88
5.5.3. Translating This into LWP ...................................................................................... 88
5.6. POST Example: ABEBooks.com ......................................................................................... 90
5.6.1. The Form ............................................................................................................... 92
5.6.2. Translating This into LWP ...................................................................................... 92
5.6.3. Adding Features ..................................................................................................... 93
5.6.4. Generalizing the Program ...................................................................................... 96
5.7. File Uploads ...................................................................................................................... 98
5.8. Limits on Forms .............................................................................................................. 102
Chapter 6. Simple HTML Processing with Regular Expressions ..................................................... 103
6.1. Automating Data Extraction ........................................................................................... 103
6.2. Regular Expression Techniques ...................................................................................... 105
6.2.1. Anchor Your Match .............................................................................................. 106
6.2.2. Whitespace .......................................................................................................... 106
6.2.3. Embedded Newlines ............................................................................................ 106
6.2.4. Minimal and Greedy Matches ............................................................................. 107
6.2.5. Capture ................................................................................................................ 107
6.2.6. Repeated Matches ............................................................................................... 108
6.2.7. Develop from Components ................................................................................. 108
6.2.8. Use Multiple Steps .............................................................................................. 109
6.3. Troubleshooting ............................................................................................................. 110
6.4. When Regular Expressions Aren't Enough ..................................................................... 112
Preface
IV
6.5. Example: Extracting Linksfrom a Bookmark File ............................................................ 113
6.6. Example: Extracting Linksfrom Arbitrary HTML ............................................................. 116
6.7. Example: Extracting Temperatures from Weather Underground ................................... 118
Chapter 7. HTML Processing with Tokens ..................................................................................... 120
7.1. HTML as Tokens .............................................................................................................. 121
7.2. Basic HTML::TokeParser Use .......................................................................................... 122
7.2.1. Start-Tag Tokens .................................................................................................. 123
7.2.2. End-Tag Tokens .................................................................................................... 124
7.2.3. Text Tokens .......................................................................................................... 124
7.2.4. Comment Tokens ................................................................................................. 125
7.2.5. Markup Declaration Tokens ................................................................................. 125
7.2.6. Processing Instruction Tokens ............................................................................. 126
7.3. Individual Tokens ............................................................................................................ 126
7.3.1. Checking Image Tags ........................................................................................... 127
7.3.2. HTML Filters ........................................................................................................ 127
7.4. Token Sequences ............................................................................................................ 128
7.4.1. Example: BBC Headlines ...................................................................................... 129
7.4.2. Translating the Problem into Code ...................................................................... 130
7.4.3. Bundling into a Program ...................................................................................... 132
7.5. More HTML::TokeParser Methods ................................................................................. 135
7.5.1. The get_text( ) Method ....................................................................................... 135
7.5.2. The get_text( ) Method with Parameters ............................................................ 136
7.5.3. The get_trimmed_text( ) Method ....................................................................... 137
7.5.4. The get_tag( ) Method ........................................................................................ 138
7.5.4.1. Start-tags .......................................................................................................... 139
7.5.4.2. End-tags ............................................................................................................ 140
7.5.5. The get_tag( ) Method with Parameters ............................................................. 140
7.6. Using Extracted Text ....................................................................................................... 141
Chapter 8. Tokenizing Walkthrough .............................................................................................. 143
8.1. The Problem ................................................................................................................... 143
8.2. Getting the Data ............................................................................................................. 144
8.3. Inspecting the HTML ...................................................................................................... 145
8.4. First Code ....................................................................................................................... 147
8.5. Narrowing In ................................................................................................................... 148
8.6. Rewrite for Features ....................................................................................................... 150
8.6.1. Debuggability ...................................................................................................... 151
8.6.2. Images and Applets ............................................................................................. 154
8.6.3. Link Text ............................................................................................................... 155
8.6.4. Live Data .............................................................................................................. 156
8.7. Alternatives .................................................................................................................... 158
Chapter 9. HTML Processing with Trees ........................................................................................ 158
9.1. Introduction to Trees ...................................................................................................... 159
9.2. HTML::TreeBuilder ......................................................................................................... 160
9.2.1. Constructors ........................................................................................................ 161
Preface
V
9.2.2. Parse Options ...................................................................................................... 162
9.2.3. Parsing ................................................................................................................. 163
9.2.4. Cleanup ............................................................................................................... 164
9.3. Processing ...................................................................................................................... 164
9.3.1. Methods for Searching the Tree .......................................................................... 164
9.3.2. Attributes of a Node ............................................................................................ 165
9.3.3. Traversing ............................................................................................................ 167
9.4. Example: BBC News ........................................................................................................ 170
9.5. Example: Fresh Air .......................................................................................................... 174
Chapter 10. Modifying HTML with Trees....................................................................................... 177
10.1. Changing Attributes ...................................................................................................... 178
10.1.1. Whitespace ........................................................................................................ 180
10.1.2. Other HTML Options ......................................................................................... 181
10.2. Deleting Images ............................................................................................................ 182
10.3. Detaching and Reattaching .......................................................................................... 184
10.3.1. The detach_content( ) Method ......................................................................... 186
10.3.2. Constraints ........................................................................................................ 187
10.4. Attaching in Another Tree ............................................................................................ 187
10.4.1. Retaining Comments ......................................................................................... 188
10.4.2. Accessing Comments ......................................................................................... 189
10.4.3. Attaching Content ............................................................................................. 190
10.5. Creating New Elements ................................................................................................ 194
10.5.1. Literals ............................................................................................................... 195
10.5.2. New Nodes from Lists ....................................................................................... 196
Chapter 11. Cookies, Authentication, and Advanced Requests .................................................... 198
11.1. Cookies ......................................................................................................................... 198
11.1.1. Enabling Cookies ............................................................................................... 200
11.1.2. Loading Cookies from a File .............................................................................. 200
11.1.3. Saving Cookies to a File ..................................................................................... 201
11.1.4. Cookies and the New York Times Site ............................................................... 201
11.2. Adding Extra Request Header Lines.............................................................................. 202
11.2.1. Pretending to Be Netscape ................................................................................ 204
11.2.2. Referer ............................................................................................................... 206
11.3. Authentication ............................................................................................................. 207
11.3.1. Comparing Cookies with Basic Authentication ................................................. 207
11.3.2. Authenticating via LWP ..................................................................................... 208
11.3.3. Security.............................................................................................................. 209
11.4. An HTTP Authentication Example:The Unicode Mailing Archive ................................. 209
Chapter 12. Spiders ....................................................................................................................... 212
12.1. Types of Web-Querying Programs ................................................................................ 212
12.2. A User Agent for Robots ............................................................................................... 214
12.3. Example: A Link-Checking Spider ................................................................................. 216
12.3.1. The Basic Spider Logic ....................................................................................... 216
12.3.2. Overall Design in the Spider .............................................................................. 218
剩余317页未读,继续阅读
athing
- 粉丝: 2
- 资源: 8
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
- 1
- 2
- 3
前往页