## What is weixin_crawler?
weixin_crawler是一款使用Scrapy、Flask、Echarts、Elasticsearch等实现的微信公众号文章爬虫,自带分析报告([报告样例](readme_img/report_example.png))和全文检索功能,几百万的文档都能瞬间搜索。weixin_crawler设计的初衷是尽可能多、尽可能快地爬取微信公众的历史发文
如果你想先看看这个项目是否有趣,这段不足3分钟的介绍视频一定是你需要的:
https://www.youtube.com/watch?v=CbfLRCV7oeU&t=8s
## 主要特点
1. 使用Python3编写
2. 爬虫框架为Scrapy并且实际用到了Scrapy的诸多特性,是深入学习Scrapy的不错开源项目
3. 利用Flask、Flask-socketio、Vue实现了高可用性的UI界面。功能强大实用,是新媒体运营等岗位不错的数据助手
4. 得益于Scrapy、MongoDB、Elasticsearch的使用,数据爬取、存储、索引均简单高效
5. 支持微信公众号的全部历史发文爬取
6. 支持微信公众号文章的阅读量、点赞量、赞赏量、评论量等数据的爬取
7. 自带面向单个公众号的数据分析报告
8. 利用Elasticsearch实现了全文检索,支持多种搜索和模式和排序模式,针对搜索结果提供了趋势分析图表
9. 支持对公众号进行分组,可利用分组数据限定搜索范围
10. 原创手机自动化操作方法,可实现爬虫无人监管
11. 反爬措施简单粗暴
## 使用到的主要工具
| 语言 | | Python3.6 |
| --- | ------- | --------------------------------------------- |
| 前端 | web框架 | Flask / Flask-socketio / gevent |
| | js/css库 | Vue / Jquery / W3css / Echarts / Front-awsome |
| 后端 | 爬虫 | Scrapy |
| | 存储 | Mongodb / Redis |
| | 索引 | Elasticsearch |
## 运行方法
weixin_crawler已经在Win/Mac/Linux系统下运行成功, 建议优先使用win系统尝试
weixin_crawler could work on win/mac/linux, although it is suggested to try on win os firstly
> #### Insatall mongodb / redis / elasticsearch and run them in the background
>
> 1. downlaod mongodb / redis / elasticsearch from their official sites and install them
>
> 2. run them at the same time under the default configuration. In this case mongodb is localhost:27017 redis is localhost:6379(or you have to config in weixin_crawler/project/configs/auth.py)
>
> 3. Inorder to tokenize Chinese, *elasticsearch-analysis-ik* have to be installed for Elasticsearch
>
> #### Install proxy server and run proxy.js
>
> 1. install nodejs and then npm install anyproxy and redis in weixin_crawler/proxy
>
> 2. cd to weixin_crawler/proxy and run node proxy.js
>
> 3. install anyproxy https CA in both computer and phone side
>
> 4. if you are not sure how to use anyproxy, [here ](https://github.com/alibaba/anyproxy)is the doc
>
> #### Install the needed python packages
>
> 1. NOTE: you may can not simply type pip install -r requirements.txt to install every package, twisted is one of them which is needed by scrapy. When you get some problems about installing python package(twisted for instance), [here](https://www.lfd.uci.edu/~gohlke/pythonlibs/) always have a solution——downlod the right version package to your drive and run $ pip install package_name
>
> 2. I am not sure if your python enviroment will throw other package not found error, just install any package that is needed
>
> #### Some source code have to be modified(maybe it is not reasonable)
>
> 1. scrapy Python36\Lib\site-packages\scrapy\http\request\ \__init\__.py --> weixin_crawler\source_code\request\\__init\__.py
>
> 2. scrapy Python36\Lib\site-packages\scrapy\http\response\ \__init\__.py --> weixin_crawler\source_code\response\\\__init\__.py
>
> 3. pyecharts Python36\Lib\site-packages\pyecharts\base.py --> weixin_crawler\source_code\base.py. In this case function get_echarts_options is added in line 106
>
> #### If you want weixin_crawler work automatically those steps are necessary or you shoud operate the phone to get the request data that will be detected by Anyproxy manual
>
> 1. Install adb and add it to your path(windows for example)
>
> 2. install android emulator(NOX suggested) or plugin your phone and make sure you can operate them with abd from command line tools
>
> 3. If mutiple phone are connected to your computer you have to find out their adb ports which will be used to add crawler
>
> 4. adb does not support Chinese input, this is a bad news for weixin official account searching. In order to input Chinese, adb keyboard has to be installed in your android phone and set it as the default input method, more is [here](https://github.com/senzhk/ADBKeyBoard)
>
> Why could weixin_crawler work automatically? Here is the reason:
>
> - If you want to crawl a wechat official account, you have to search the account in you phone and click its "全部消息" then you will get a message list , if you roll down more lists will be loaded. Anyone of the messages in the list could be taped if you want to crawl this account's reading data
> - If a nickname of a wechat official account is given, then wexin_crawler operate the wechat app installed in a phone, at the same time anyproxy is 'listening background'...Anyway weixin_crawler get all the request data requested by wechat app, then it is the show time for scrapy
> - As you supposed, in order to let weixin_crawler operate wechat app we have to tell adb where to click swap and input, most of them are defined in weixin_crawler/project/phone_operate/config.py. BTW phone_operate is responsible for wechat operate just like human beings, its eyes are baidu OCR API and predefined location tap area, its fingers are adb
>
> #### Run the main.py
>
> $ cd weixin_crawler/project/
>
> $ python(3) ./main.py
>
> Now open the browser and everything you want would be in localhost:5000.
>
> In this long step list you may get stucked, join our community for help, tell us what you have done and what kind of error you have found.
>
> Let's go to explore the world in localhost:5000 together
## 功能展示
UI主界面
![1](readme_img/爬虫主界面.gif)
添加公众号爬取任务和已经爬取的公众号列表
![1](readme_img/公众号.png)
爬虫界面
![](readme_img/caiji.png)
设置界面
![ ](readme_img/设置.png)
公众号历史文章列表
![ ](readme_img/历史文章列表.gif)
报告
![ ](readme_img/报告.gif)
搜索
![ ](readme_img/搜索.gif)
## 加入社区
也许你属于:
- 刚刚毕业的技术小白,没啥项目开发经验
- 能轻松让weixin_crawler跑起来的老司机
- 爬虫大牛,指哪儿爬哪儿
- 你可能是从事新媒体运营一类工作,完全不懂编程但是希望充分运用weixin_crawler的各项功能服务于自己的工作
不管你属于哪一类,只要你对微信数据分析有浓厚的兴趣,通过作者微信加入我们的社区都能获得想要。
## 回馈作者
weixin_crawler从2018年6月就开始利用业余时间开发(居然用了半年时间),无奈作者水平有限,至今才勉强能拿出一个可用版本分享给各位爬虫爱好者,多谢大家的期待。如果你喜欢这个项目,期待你的回馈。
你可以通过以下任意一种方式回馈作者(可多选哦):
- 一个小小的star,并把这个有趣的开源项目分享给别的开发者,哪怕只有一位,只要他也是在技术精进道路上的砥砺前行着
- 打赏给作者一杯咖啡,以后的熬夜奋战也因此多了一丝效率 :)[谢谢这些小伙伴](readme_img/thanks.md)
- 加入社群一起贡献代码,我们一起创造出更酷的爬虫
- 加入知识星球听作者将weixin_crawler的每一个函数和每一个问题解决的思维过程娓娓道来,你会因此认识更多意志坚�
没有合适的资源?快使用搜索试试~ 我知道了~
Python-高效微信公众号历史文章和阅读数据爬虫poweredbyscrapy
共121个文件
py:61个
js:18个
html:12个
需积分: 23 9 下载量 8 浏览量
2019-08-10
07:36:29
上传
评论 1
收藏 13.83MB ZIP 举报
温馨提示
高效微信公众号历史文章和阅读数据爬虫powered by scrapy
资源推荐
资源详情
资源评论
收起资源包目录
Python-高效微信公众号历史文章和阅读数据爬虫poweredbyscrapy (121个子文件)
scrapy.cfg 257B
font-awesome.min.css 30KB
w3.css 23KB
ui.css 0B
fontawesome-webfont.eot 162KB
历史文章列表.gif 1.8MB
搜索.gif 1.61MB
爬虫主界面.gif 1.34MB
报告.gif 1.1MB
.gitattributes 48B
.gitignore 106B
gzh_report.html 13KB
index.html 5KB
gzh.html 4KB
crawler.html 3KB
utils.html 3KB
all_articles.html 2KB
settings.html 2KB
index.html 2KB
menu.html 2KB
framework.html 870B
header.html 714B
footer.html 124B
wq.jpg 62KB
wq.jpg 62KB
web-crawlers.jpg 34KB
echarts.js 2.64MB
el_componets.js 422KB
vue.js 356KB
echarts-wordcloud.min.js 125KB
jquery.min.js 85KB
socket.io.js 61KB
jquery.tablesorter.min.js 16KB
option_data.js 5KB
settings.js 4KB
result.js 4KB
proxy.js 4KB
option_data.js 3KB
post.js 3KB
data_bind.js 2KB
ui_interact.js 1KB
data_update.js 982B
echarts_data.js 832B
hi.js 0B
README.md 8KB
thanks.md 751B
join_base.md 733B
FontAwesome.otf 132KB
report_example.png 7.22MB
打赏.png 151KB
知识星球.png 105KB
公众号.png 42KB
caiji.png 24KB
设置.png 23KB
aii.png 3KB
__init__.py 11KB
base.py 9KB
data_queue.py 8KB
article.py 7KB
tidy_req_data.py 7KB
GZH.py 7KB
view.py 6KB
__init__.py 6KB
setting.py 6KB
crawl_article.py 5KB
PhoneControl.py 5KB
__init__.py 5KB
event.py 5KB
VC.py 5KB
WeixinOperate.py 5KB
__init__.py 5KB
__init__.py 5KB
decode_response.py 5KB
__init__.py 4KB
OCR.py 3KB
utils.py 3KB
article_list.py 3KB
__init__.py 2KB
meta_data.py 2KB
config.py 2KB
request_reading_data.py 2KB
config.py 2KB
__init__.py 2KB
load_more.py 2KB
MultiTask.py 1KB
__init__.py 1KB
router.py 1KB
AllArticles.py 1KB
crawl_article.py 975B
utils.py 955B
auth.py 952B
ui_instance.py 888B
load_more.py 783B
__init__.py 722B
__init__.py 449B
view.py 447B
__init__.py 443B
__init__.py 415B
crawl_article.py 287B
dp.py 264B
共 121 条
- 1
- 2
资源评论
weixin_39840914
- 粉丝: 435
- 资源: 1万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功