douban_eem.rar_python小工具_reading资源-CSDN文库

共4个文件

txt：2个

xls：1个

py：1个

版权申诉

python小工具

reading

195 浏览量 2022-09-19 11:06:59 上传评论收藏 11KB RAR 举报

标题 "douban_eem.rar_python小工具_reading" 暗示了这是一个使用Python编程语言编写的爬虫程序，目标是抓取豆瓣阅读（Douban Reading）平台上的信息。这个压缩包可能包含了实现该功能所需的全部代码和相关资源。在Python中，"reptile"通常指的是网络爬虫，这是一种自动化程序，可以遍历网页，提取所需数据。豆瓣阅读是一个在线平台，提供电子书、杂志等阅读资源，用户可以在这里发现新书、查看书评和评分。编写针对豆瓣阅读的爬虫，可能涉及到的数据包括书籍信息、作者信息、用户评论、评分等。我们需要了解Python爬虫的基本构成。通常，爬虫会使用requests库来发送HTTP请求获取网页内容，BeautifulSoup或lxml库解析HTML文档，找到我们需要的数据。例如，我们可能需要通过分析URL模式获取书籍列表，然后解析每个书籍详情页的HTML，提取书名、作者、简介、评分等信息。接着，数据的存储也是一个重要环节。Python提供了多种方式来保存爬取的数据，如CSV、JSON格式，或者存入数据库如SQLite、MySQL。如果爬取的数据量较大，可能还需要使用到Pandas库进行数据清洗和处理。关于"reading"标签，可能意味着这个爬虫工具还包含了一些对抓取数据的处理和分析功能，比如统计最受欢迎的书籍、平均评分最高的作者等。这可能涉及数据分析技巧，如使用Numpy进行数值计算，matplotlib或seaborn进行数据可视化。压缩包内的文件" douban"可能是项目的主文件夹，里面可能包含以下结构： 1. `douban.py`: 主程序文件，包含了爬虫的主要逻辑。 2. `config.py`: 配置文件，可能包含了请求头、延迟时间等爬虫设置。 3. `models.py`: 可能定义了用于数据存储的类，如书籍、作者模型。 4. `utils.py`: 辅助函数集合，例如网络请求、数据解析等。 5. `data`: 存储爬取数据的文件夹，可能有CSV或JSON文件。 6. `requirements.txt`: 项目依赖的Python库列表，方便他人复现环境。在实际应用中，要注意遵守网站的robots.txt协议，尊重版权，避免频繁请求导致服务器压力过大。此外，豆瓣阅读可能有反爬虫机制，如验证码、IP限制等，因此编写爬虫时可能需要加入策略以应对这些挑战，如使用代理IP池、设置延时等。这个"Python小工具"是一个用于抓取和分析豆瓣阅读平台数据的爬虫程序，涉及到了网络请求、HTML解析、数据存储与分析等多个Python编程和爬虫技术相关的知识点。

资源推荐

资源详情

资源评论

收起资源包目录

douban_eem.rar （4个子文件）

douban

douban.txt 5KB

myfile.xls 22KB

douban_01.py 5KB

text.txt 649B

#!/user/bin/python #coding:utf8 import requests,time,xlrd,xlwt,random,re,os from bs4 import BeautifulSoup url0 = 'https://book.douban.com' #总的初始地址，后面用来字符串连接 library,library_data,popular,popular_data,culture, culture_data = ([],[],[],[],[],[]) life,life_data,economy,economy_data,science,science_data= ([],[],[],[],[],[]) sheet_list = [library,popular,culture,life,economy,science] sheet_list_data = [library_data, popular_data, culture_data, life_data, economy_data, science_data] #数据表 num_list = 0 #用作sheet_list_data的索引 user_agent = ['Mozilla/5.0 (Windows NT 6.2; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0', 'Mozilla/5.0 (Android; Mobile; rv:14.0) Gecko/14.0 Firefox/14.0', 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36', 'Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:53.0) Gecko/20100101 Firefox/53.0'] first_sum = time.time() second_time = time.time() proxies = {'http':'socks5://127.0.0.1:9050', 'https':'socks5://127.0.0.1:9050'} def getSoup(url): #获取对应url的soup对象 global first_sum res = '' while(res == ''): try: if time.time() - first_sum >= 5*60: os.system("""(echo authenticate '"mypassword"'; echo signal newnym; echo quit) | nc localhost 9051""") time.sleep(2) first_sum = time.time() headers = {'user-agent':user_agent[random.randint(0,4)],'Connection':'close'} res = requests.get(url,headers = headers,proxies = proxies) res.encoding = 'utf - 8' soup = BeautifulSoup(res.text,'html.parser') except requests.exceptions.ProxyError as e: print type(e) os.system("""(echo authenticate '"mypassword"'; echo signal newnym; echo quit) | nc localhost 9051""") time.sleep(2) except Exception as e: raise return soup def getFinal(list_name): #将爬取的数据存入数据列表中 global second_time for urls in list_name: count = 0 url = url0 + urls #得到此标签（首页）的url while(url): #在此标签各页中循环 print '用时：', time.time() - second_time second_time = time.time() soups = getSoup(url) getData(soups) #time.sleep(random.uniform(1, 3)) #爬一个休息1到3秒，爬10个再休息个10秒 if count == 10: time.sleep(random.uniform(3, 4)) count = 0 url_next = soups.select('.next')#得到此页的下页链接 if url_next == []: break else: url_next_n = url_next[0].select('a') if url_next_n == []: break else: url = url_next_n[0]['href'] url = url0 + url count += 1 def getData(soup): #获取此页所需的数据 global num_errno r = r'[0-9]' r_press = r'出版社' infos = soup.select('.info') for info in infos: try: num_people = '' bookname = info.select('a')[0].text.strip().split()[0] pub = info.select('.pub')[0].text.strip().split('/') author = pub[0] press = '未显示' for i in pub: if re.findall(r_press,i) != '未显示': press = i break if info.select('.star')[0].select('span')[0].text == '': score = info.select('.star')[0].select('span')[1].text nums = re.findall(r,info.select('.star')[0].select('span')[2].text.strip()) for num in nums: num_people = num_people + str(num) else: score = '评价人数不足' num_people = score print '成功爬取一条信息' sheet_list_data[num_list].append((bookname, author, score, num_people, press)) except: print('错误！！！！！！！！！！') #错误提示 num_errno += 1 if raw_input('请继续') in ['','y']: pass def writeExcel(): #Excel写入 row = 1 wb=xlwt.Workbook(encoding = 'utf-8') style = xlwt.XFStyle() #下面字体的设置是为了能正常写入中文 font = xlwt.Font() font.name = 'SimSun' # 指定“宋体” style.font = font sheet_name = ['library','popular','culture','life','economy','science'] sheet = ['sheet1','sheet2','sheet3','sheet4','sheet5','sheet6'] sheet_row_name = ['书名','作者','评分','评分人数','出版社'] for i in range(6): #为Excel添加6张表 sheet[i] = wb.add_sheet(sheet_name[i]) for i in sheet: #为每张表的开头添加行名字 for i_col in range(5): i.write(0,i_col,sheet_row_name[i_col]) for i in sheet_list_data: #数据写入excel for ii in i: for i_col in range(5): i.write(row,i_col,ii(i_col)) row += 1 wb.save('myfile.xls') def f(x): #排序所用的权值函数 if x[2] == '评价人数不足': return 0 else: return x[2] num_errno = 0 #用来记录可能的错误 requests.adapters.DEFAULT_RETRIES = 3 url = 'https://book.douban.com/tag/?view=type&icn=index-sorttags-all' soup = getSoup(url) for lables in soup.select('.tagCol'): #获取6个大表，其中的数据为各个标签的url for lable in lables.select('a'): sheet_list[num_list].append(lable['href']) num_list += 1 num_list = 0 for i in sheet_list: #实现数据的爬取并写入相应列表中 getFinal(i) print '错误数量:',num_errno time.sleep(5) sheet_list_data[num_list] = sorted(sheet_list_data[num_list], key=f, reverse = True) num_list += 1 writeExcel() #Excel操作

评论收藏

内容反馈

版权申诉