1 / 53
Python 网络爬虫
目录
Python 网络爬虫 .................................................................................................... 1
1 抓取网页的含义和 URL 基本构成 ................................................................... 3
1.1 网络爬虫的定义 ...................................................................................... 3
1.2 浏览网页的过程 ...................................................................................... 3
1.3 URI 的概念和举例 ................................................................................... 3
1.3.1 什么是 URI? ............................................................................... 3
1.3.2 URI 组成 ........................................................................................ 3
1.4 URL 的理解和举例 .................................................................................. 4
1.4.1 HTTP 协议的 URL 示例 ............................................................... 4
1.4.2 文件的 URL .................................................................................. 5
2 利用 urllib2 通过指定的 URL 抓取网页内容 .................................................. 5
2.1 发送 data 表单数据 ................................................................................. 8
2.2 2.设置 Headers 到 http 请求 .................................................................... 9
3 异常的处理和 HTTP 状态码的分类 ............................................................... 10
3.1 URLError ................................................................................................ 10
3.2 HTTPError .............................................................................................. 10
3.3 Wrapping ................................................................................................. 12
4 Opener 与 Handler 的介绍和实例应用 ............................................................ 14
4.1 geturl(): ................................................................................................. 14
4.2 info(): .................................................................................................... 14
4.2.1 Openers ........................................................................................ 15
4.2.2 Handles ......................................................................................... 15
5 urllib2 的使用细节与抓站技巧 ........................................................................ 17
5.1 Proxy 的设置 ......................................................................................... 18
5.2 在 HTTP Request 中加入特定的 Header .......................................... 19
5.3 4.Redirect ................................................................................................ 19
5.4 5.Cookie .................................................................................................. 20
5.5 使用 HTTP 的 PUT 和 DELETE 方法 ........................................... 21
5.6 得到 HTTP 的返回码 .......................................................................... 21
5.7 Debug Log ............................................................................................... 21
5.8 表单的处理 ............................................................................................ 22