没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
Python programming — text and web mining
Finn
˚
Arup Nielsen
DTU Compute
Technical University of Denmark
September 17, 2013
Python programming — text and web mining
Overview
Get the stuff: Crawling, search
Converting: HTML processing/stripping, format conversion
Tokenization, identifying and splitting words and sentences.
Word normalization, finding the stem of the word, e.g., “talked” → “talk”
Text classificiation (supervized), e.g., spam detection.
Finn
˚
Arup Nielsen 1 September 17, 2013
Python programming — text and web mining
Web crawling issues
Honor robots.txt — the file on the Web server that describe what you
are allowed to crawl and not.
Tell the Web server who you are.
Handling errors and warnings gracefully, e.g., the 404 (“Not found”).
Don’t overload the Web server you are downloading from, especially if
you do it in parallel.
Consider parallel download large-scale crawling
Finn
˚
Arup Nielsen 2 September 17, 2013
Python programming — text and web mining
Crawling restrictions in robots.txt
Example robots.txt on http://neuro.compute.dtu.dk with rule:
Disallow: /wiki/Special:Search
Meaning http://neuro.compute.dtu.dk/wiki/Special:Search should not be
crawled.
Python module robotparser for handling rules:
>>> import robotparser
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url("http://neuro.compute.dtu.dk/robots.txt")
>>> rp.read() # Reads the robots.txt
>>> rp.can_fetch("*", "http://neuro.compute.dtu.dk/wiki/Special:Search")
False
>>> rp.can_fetch("*", "http://neuro.compute.dtu.dk/movies/")
True
Finn
˚
Arup Nielsen 3 September 17, 2013
Python programming — text and web mining
Tell the Web server who you are
Use of urllib2 module to set the User-agent of the HTTP request:
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [("User-agent", "fnielsenbot/0.1 (Finn A. Nielsen)")]
response = opener.open("http://neuro.compute.dtu.dk")
This will give the following entry (here split into two line) in the Apache
Web server log (/var/log/apach2/access.log) :
130.225.70.226 - - [31/Aug/2011:15:55:28 +0200]
"GET / HTTP/1.1" 200 6685 "-" "fnielsenbot/0.1 (Finn A. Nielsen)"
This allows a Web server admininstrator to block you if you put too much
load on the Web server.
See also (Pilgrim, 2004, section 11.5) “Setting the User-Agent”.
Finn
˚
Arup Nielsen 4 September 17, 2013
剩余68页未读,继续阅读
资源评论
- wglacier2015-03-26比较实用,感谢分享
lzy568
- 粉丝: 0
- 资源: 1
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 机器学习和数据挖掘课程设计-米其林餐厅数据挖掘管理系统源码+使用文档说明.zip
- html html html展示我与ai的对化
- 数据结构课程设计-全国交通出行咨询模拟系统C语言实现源码.zip
- cef-binary-109.0.1+gcd5e37a+chromium-109.0.5414.8-windows32
- 基于C语言的全国交通咨询系统模拟源码.zip
- 正点原子HAL库 STM32F4 DMA(学习自用附源码)
- 炫酷代码雨,超级炫酷哦!!!
- 基于物联网MQTT协议的智能停车场管理系统
- POETIZE个人博客系统源码 - 最美博客
- 基于深度学习的行人检测系统源码+项目说明(YoloV3+Tensorflow).zip
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功