python爬虫源码_python爬虫源码下载资源-CSDN文库

共15个文件

py：8个

pyc：4个

sql：2个

python

爬虫

4星 · 超过85%的资源需积分: 16 100 浏览量 2008-12-29 20:26:08 上传评论收藏 12KB RAR 举报

资源详情

资源评论

收起资源包目录

groupSpider.rar （15个子文件）

groupSpider

XMLLoader.pyc 4KB

csdnSpider.py 5KB

database.sql 342B

JavaGroupContent.pyc 3KB

JavaGroupContent.py 2KB

groupSpider

XMLLoader.pyc 4KB

csdnSpider.py 5KB

database.sql 342B

JavaGroupContent.pyc 3KB

JavaGroupContent.py 2KB

__init__.py 0B

XMLLoader.py 3KB

__init__.py 0B

XMLLoader.py 3KB

java_group.xml 859B

#!/usr/bin/env python #-*- coding:utf-8-*- from sys import argv from os import makedirs, unlink, sep, chdir from os.path import dirname, exists, isdir, splitext from string import replace, find, lower from htmllib import HTMLParser from urllib import urlretrieve from urlparse import urlparse, urljoin from formatter import DumbWriter, AbstractFormatter from cStringIO import StringIO from sgmllib import SGMLParser from time import sleep import JavaGroupContent import XMLLoader import pickle import re class Retriever(object): #download Web pages def __init__(self, url): self.url = url self.file = self.filename(url) def filename(self, url): path=url path = re.sub("\W","_",path) path+=".html" return path def isForbidden(self): return 0; def isForbidden(self): return 0; def download(self): try: if True: retval = urlretrieve(self.url, self.file) javaGroupContent=JavaGroupContent.JavaGroupContent() javaGroupContent.meet_page(self.url, self.file) else: retval = '*** INFO: no need to download ' except IOError: retval = ('*** ERROR: invalid URL "%s"' % self.url,) return retval def parseAndGetLinks(self): self.parser = HTMLParser(AbstractFormatter(DumbWriter(StringIO()))) try: self.parser.feed(open(self.file).read()) self.parser.close() except IOError: pass return self.parser.anchorlist class Crawler(object): count =0 #a global count for how many pages have been down loaded def __init__(self,para_queue,para_seen,para_config): self.q = para_queue self.seen = para_seen self.config = para_config def getPage(self,url): r=Retriever(url) retval = r.download() if retval[0] == '*': return else: Crawler.count +=1 print '\nPage: ', Crawler.count print 'URL:', url print 'FILE:', retval[0] self.seen.append(url) links=r.parseAndGetLinks() for eachLink in links: if eachLink[:4] != 'http' and find(eachLink, '://') == -1: eachLink = urljoin(url,eachLink) if find(lower(eachLink),'mailto:') != -1: continue eachLinks = eachLink.split("#",2) eachLink=eachLinks[0] if eachLink not in self.seen: if eachLink not in self.q: self.q.append(eachLink) else: pass else: pass def go(self): while self.q: url = self.q.pop() print url downloadtag=0 ignoretag=0; for dowloadpattern in self.config.downloads: downloadtag += url.count(dowloadpattern) for ignorepattern in self.config.ignores: ignoretag += url.count(ignorepattern) if downloadtag==1 and ignoretag==0: print "Getting"+url self.getPage(url) waittime=self.config.downloadWait for i in range(int(waittime)): print i+1, sleep(1) def main(): if len(argv) > 1: url = argv[1] else: try: url = raw_input('Enter a site name: ') except (KeyboardInterrupt, EOFError): url ='' try: chdir("../"+url) configFileURL=url+".xml" config = XMLLoader.XMLLoader(configFileURL) startUrl = config.startURL startUrl ='http://webdev.csdn.net/page/' queue=[] seen =[] try: queue = pickle.load(open("queue.txt")) except: queue.append(startUrl) try: seen = pickle.load(open("seen.txt")) except: seen=[] robot=Crawler(queue,seen,config) robot.go() finally: pickle.dump(robot.q,open("queue.txt","w")) pickle.dump(robot.seen,open("seen.txt","w")) if __name__=='__main__': main()

评论收藏

内容反馈

xray_haha

2012-10-23

不太会用，例子程序不完善

python 爬虫源码

评论4

最新资源

python 爬虫源码

评论4

最新资源

相关推荐

python 70+爬虫脚本项目源码.zip

2021年9个常用的python爬虫源码

python爬虫爬网页部分内容空白，但源码可以看到，已解决

爬虫脚本项目源码-python实现代码雨效果

python爬虫源码

python爬虫源代码

网络爬虫源代码集锦

简单的python爬虫，代码完整

用python写网络爬虫书本源码

22个爬虫项目源码 绝对实用

python 爬虫实例源码下载(pyspider).zip

python源码之爬虫

python爬虫源代码2

简单Python爬虫代码

python知乎评论爬虫源代码

python爬虫简单源码，附解释。

81个Python爬虫源代码+九款开源爬虫工具.doc

Python网络爬虫实战blog文源代码.zip

Python爬虫示例代码

Python3爬虫课程资料代码.rar

doc文档python爬虫源码

python爬虫源码.zip

作品提交python爬虫源码实例

81个Python爬虫源代码

Python简单爬虫的实现

python爬虫小程序.rar_Python__Python_

超级简单快捷爬虫代码python

Python简单网页爬虫示例

22个爬虫项目源码绝对实用