Python爬虫项目之爬取知乎数据.zip_python爬虫项目资源-CSDN文库

共17个文件

py：11个

xml：4个

iml：1个

需积分: 1 191 浏览量 2024-05-30 06:00:06 上传评论 3 收藏 19KB ZIP 举报

在Python编程领域，爬虫是一项重要的技术，用于自动地从互联网上抓取大量信息。本项目专注于使用Python爬虫来获取知乎网站上的数据。知乎是中国知名的问答社区，包含丰富的知识和观点，是研究用户行为、热门话题以及网络趋势的理想来源。 Python作为一门强大的脚本语言，因其简洁的语法和丰富的库支持，成为了爬虫开发者的首选工具。在这个项目中，我们将使用Python的requests库来发起HTTP请求，获取网页内容；然后使用BeautifulSoup或PyQuery等解析库解析HTML文档，提取所需的数据。要爬取知乎数据，我们需要理解网站的结构。知乎的网页通常基于HTML和JavaScript构建，有些数据可能在动态加载中。因此，我们可能需要使用如Selenium这样的浏览器自动化工具来模拟用户交互，获取动态生成的内容。接着，我们要处理登录问题。许多网站，包括知乎，需要用户登录后才能访问某些特定内容。我们可以使用requests库中的Session对象来保持会话状态，模拟登录过程，将登录后的cookies保存下来，以便后续请求携带这些cookies，使得爬虫能够访问受限页面。在解析HTML时，我们需要定位到目标数据所在的HTML标签。例如，问题可能在`<h2>`标签中，答案可能在`<div class="CellBody">`中。我们可以使用BeautifulSoup的select()或find_all()方法，通过CSS选择器或tag name来查找这些元素。爬虫还需要处理分页问题。知乎的问答通常有多个页面，我们需要找到下一页链接的元素，如`<a rel="next">`，并提取出其href属性值，以获取下一页的URL。有时，分页信息可能隐藏在JavaScript代码中，这时可以使用如re模块进行正则表达式匹配，或者使用json库解析嵌入在HTML中的JSON数据。同时，为了遵守网站的Robots协议和避免被封IP，我们需要在请求之间添加适当的延迟，并设置User-Agent，使请求看起来像是来自真实的浏览器。还应该考虑处理反爬策略，如验证码和IP限制，这可能需要更高级的策略，如使用代理IP池。此外，爬取的数据通常需要存储起来，可以是文本文件、CSV文件或数据库。pandas库是一个很好的工具，可以方便地将数据结构化并写入各种格式。对于大数据量的爬取，MySQL、SQLite等数据库是理想的选择。为了提高爬虫的效率和可维护性，推荐使用Scrapy框架。Scrapy提供了完整的爬虫项目结构，包括中间件、调度器、下载器等功能，便于实现复杂的爬虫逻辑和数据处理流程。总结来说，Python爬虫项目爬取知乎数据涉及到的关键知识点包括：Python基础知识、HTTP协议、requests库、HTML解析（BeautifulSoup/PyQuery）、网页动态加载处理（Selenium）、模拟登录、分页爬取、数据存储（pandas/数据库）、反爬策略以及Scrapy框架的使用。掌握这些技能，你就能创建一个高效且稳定的知乎数据爬虫。

资源推荐

资源详情

资源评论

收起资源包目录

Python爬虫项目之爬取知乎数据.zip （17个子文件）

Python爬虫项目之爬取知乎数据

ZhiHu

__init__.py 0B

pipelines.py 116B

MysqlPipelines

__init__.py 0B

pipelines.py 576B

Mysql.py 3KB

spiders

__init__.py 161B

zhihu.py 7KB

items.py 2KB

settings.py 6KB

middlewares.py 2KB

zhihu-entrypoint.py 72B

scrapy.cfg 254B

.idea

markdown-navigator

profiles_settings.xml 104B

workspace.xml 36KB

misc.xml 4KB

modules.xml 262B

ZhiHu.iml 398B

# -*- coding:utf-8 -*- import scrapy from scrapy.http import Request from ZhiHu.items import ZhihuItem from ZhiHu.MysqlPipelines.Mysql import NumberCheck from scrapy.conf import settings from ZhiHu.settings import Tool import requests import json class Myspider(scrapy.Spider): '''初始化各种参数''' name='ZhiHu' allowed_domains=['zhihu.com'] L = '' K = '' All_Num=546049 # 目标抓取量,手动填入 Save_Num=NumberCheck.find_save() # 已经抓取量 DB_Num=NumberCheck.find_db_real() # 上次爬虫，数据库的最后一条数据DB_Num Last_Num=NumberCheck.find_last() # 获得上一次爬虫，轮子哥关注量 url='https://www.zhihu.com/api/v4/members/excited-vczh/followers?include=data%5B*%5D.locations%2Cemployments%2Cgender%2Ceducations%2Cbusiness%2Cvoteup_count%2Cthanked_Count%2Cfollower_count%2Cfollowing_count%2Ccover_url%2Cfollowing_topic_count%2Cfollowing_question_count%2Cfollowing_favlists_count%2Cfollowing_columns_count%2Cavatar_hue%2Canswer_count%2Carticles_count%2Cpins_count%2Cquestion_count%2Ccommercial_question_count%2Cfavorite_count%2Cfavorited_count%2Clogs_count%2Cmarked_answers_count%2Cmarked_answers_text%2Cmessage_thread_token%2Caccount_status%2Cis_active%2Cis_force_renamed%2Cis_bind_sina%2Csina_weibo_url%2Csina_weibo_name%2Cshow_sina_weibo%2Cis_blocking%2Cis_blocked%2Cis_following%2Cis_followed%2Cmutual_followees_count%2Cvote_to_count%2Cvote_from_count%2Cthank_to_count%2Cthank_from_count%2Cthanked_count%2Cdescription%2Chosted_live_count%2Cparticipated_live_count%2Callow_message%2Cindustry_category%2Corg_name%2Corg_homepage%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&limit=20&offset=0' response=requests.get(url, headers=settings['DEFAULT_REQUEST_HEADERS']) parse=json.loads(response.text) try: # 获得最新关注者数目Now_Num,注意检验token是否过期 Now_Num=parse['paging']['totals'] # 因为关注者不时更新，需计算出真实Real_Num,分两种情况讨论 # 第一次DB_Num和Last_Num为None,第二次之后不为None if DB_Num is not None: if Last_Num is not None: Real_Num = DB_Num+(Now_Num-Last_Num)-1 Save_Num = Save_Num else: Real_Num=All_Num Save_Num=0 print u'目标爬取：', All_Num print u'已经抓取：', Save_Num print u'' print u'目前关注：',Now_Num print u'上次关注：',Last_Num except KeyError: print u'\n' print u'Authorization过期，请停止程序，重新抓取并在settings中更新' def start_requests(self): #每隔20页抓一次 urls=['https://www.zhihu.com/api/v4/members/excited-vczh/followers?include=data%5B*%5D.locations%2Cemployments%2Cgender%2Ceducations%2Cbusiness%2Cvoteup_count%2Cthanked_Count%2Cfollower_count%2Cfollowing_count%2Ccover_url%2Cfollowing_topic_count%2Cfollowing_question_count%2Cfollowing_favlists_count%2Cfollowing_columns_count%2Cavatar_hue%2Canswer_count%2Carticles_count%2Cpins_count%2Cquestion_count%2Ccommercial_question_count%2Cfavorite_count%2Cfavorited_count%2Clogs_count%2Cmarked_answers_count%2Cmarked_answers_text%2Cmessage_thread_token%2Caccount_status%2Cis_active%2Cis_force_renamed%2Cis_bind_sina%2Csina_weibo_url%2Csina_weibo_name%2Cshow_sina_weibo%2Cis_blocking%2Cis_blocked%2Cis_following%2Cis_followed%2Cmutual_followees_count%2Cvote_to_count%2Cvote_from_count%2Cthank_to_count%2Cthank_from_count%2Cthanked_count%2Cdescription%2Chosted_live_count%2Cparticipated_live_count%2Callow_message%2Cindustry_category%2Corg_name%2Corg_homepage%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&limit=20&offset='+str(i)for i in range(0,self.Real_Num)[::-1]] for url in urls: yield Request(url,callback=self.parse) def parse(self,response): '''解析json格式数据，获得各类数据''' data=json.loads(response.text)['data'] i=data[0] item=ZhihuItem() item['Save_Num']=self.Save_Num + 1 item['Last_Num']=self.Now_Num item['Real_Num']=self.Real_Num item['name']=i['name'] #提取标签、个人简介中的文本,用到自定义的Tool类 tool=Tool() item['headline']=tool.replace(i['headline']) item['description']=tool.replace(i['description']) item['detailURL']='https://www.zhihu.com/people/'+str(i['url_token']) item['gender']=i['gender'] item['user_type']=i['user_type'] item['is_active']=i['is_active'] if len(i['locations'])== 0: item['locations']='' else: for n in i['locations']: item['locations']=n['name'] try: item['business']=i['business']['name'] except KeyError: item['business']='' #教育经历，分多种情况讨论 if len(i['educations'])== 0: item['educations']='' else: content=[] for n in i['educations']: S='school' in n.keys() M='major' in n.keys() if S: if M: self.L=n['school']['name']+'/'+n['major']['name'] else: self.L=n['school']['name'] else: self.L=n['major']['name'] content.append(self.L) item['educations']='' for l in content: item['educations']+=l+' ' #职业经历，分多种情况讨论 if len(i['employments'])== 0: item['employments']='' else: content=[] for n in i['employments']: C='company' in n.keys() J='job' in n.keys() if C: if J: self.K=n['company']['name']+'/'+n['job']['name'] else: self.K=n['company']['name'] else: if J: self.K=n['job']['name'] content.append(self.K) item['employments']='' for k in content: item['employments']+=k+' ' item['following_count']=i['following_count'] item['follower_count']=i['follower_count'] item['mutual_followees_count']=i['mutual_followees_count'] item['voteup_count']=i['voteup_count'] item['thanked_count']=i['thanked_count'] item['favorited_count']=i['favorited_count'] item['logs_count']=i['logs_count'] item['following_question_count']=i['following_question_count'] item['following_topic_count']=i['following_topic_count'] item['following_favlists_count']=i['following_favlists_count'] item['following_columns_count']=i['following_columns_count'] item['articles_count']=i['articles_count'] item['question_count']=i['question_count'] item['answer_count']=i['answer_count'] item['pins_count']=i['pins_count'] item['participated_live_count']=i['participated_live_count'] item['hosted_live_count']=i['hosted_live_count'] print u'序号：', self.Real_Num self.Real_Num-= 1 return item

评论收藏

内容反馈