pyspider [![Build Status]][Travis CI] [![Coverage Status]][Coverage] [![Try]][Demo]
========
A Powerful Spider(Web Crawler) System in Python. **[TRY IT NOW!][Demo]**
- Write script in Python
- Powerful WebUI with script editor, task monitor, project manager and result viewer
- [MySQL](https://www.mysql.com/), [MongoDB](https://www.mongodb.org/), [Redis](http://redis.io/), [SQLite](https://www.sqlite.org/), [Elasticsearch](https://www.elastic.co/products/elasticsearch); [PostgreSQL](http://www.postgresql.org/) with [SQLAlchemy](http://www.sqlalchemy.org/) as database backend
- [RabbitMQ](http://www.rabbitmq.com/), [Beanstalk](http://kr.github.com/beanstalkd/), [Redis](http://redis.io/) and [Kombu](http://kombu.readthedocs.org/) as message queue
- Task priority, retry, periodical, recrawl by age, etc...
- Distributed architecture, Crawl Javascript pages, Python 2.{6,7}, 3.{3,4,5,6} support, etc...
Tutorial: [http://docs.pyspider.org/en/latest/tutorial/](http://docs.pyspider.org/en/latest/tutorial/)
Documentation: [http://docs.pyspider.org/](http://docs.pyspider.org/)
Release notes: [https://github.com/binux/pyspider/releases](https://github.com/binux/pyspider/releases)
Sample Code
-----------
```python
from pyspider.libs.base_handler import *
class Handler(BaseHandler):
crawl_config = {
}
@every(minutes=24 * 60)
def on_start(self):
self.crawl('http://scrapy.org/', callback=self.index_page)
@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
for each in response.doc('a[href^="http"]').items():
self.crawl(each.attr.href, callback=self.detail_page)
def detail_page(self, response):
return {
"url": response.url,
"title": response.doc('title').text(),
}
```
[![Demo][Demo Img]][Demo]
Installation
------------
* `pip install pyspider`
* run command `pyspider`, visit [http://localhost:5000/](http://localhost:5000/)
**WARNING:** WebUI is open to the public by default, it can be used to execute any command which may harm your system. Please use it in an internal network or [enable `need-auth` for webui](http://docs.pyspider.org/en/latest/Command-Line/#-config).
Quickstart: [http://docs.pyspider.org/en/latest/Quickstart/](http://docs.pyspider.org/en/latest/Quickstart/)
Contribute
----------
* Use It
* Open [Issue], send PR
* [User Group]
* [中文问答](http://segmentfault.com/t/pyspider)
TODO
----
### v0.4.0
- [ ] a visual scraping interface like [portia](https://github.com/scrapinghub/portia)
License
-------
Licensed under the Apache License, Version 2.0
[Build Status]: https://img.shields.io/travis/binux/pyspider/master.svg?style=flat
[Travis CI]: https://travis-ci.org/binux/pyspider
[Coverage Status]: https://img.shields.io/coveralls/binux/pyspider.svg?branch=master&style=flat
[Coverage]: https://coveralls.io/r/binux/pyspider
[Try]: https://img.shields.io/badge/try-pyspider-blue.svg?style=flat
[Demo]: http://demo.pyspider.org/
[Demo Img]: https://github.com/binux/pyspider/blob/master/docs/imgs/demo.png
[Issue]: https://github.com/binux/pyspider/issues
[User Group]: https://groups.google.com/group/pyspider-users
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
Python爬虫程序,特点:使用Python编写脚本,提供强大的APIPython,强大的WebUI和脚本编辑器、任务监控和项目管理和结果查看支持JavaScript页面后端系统支持:MySQL, MongoDB, SQLite, Postgresql支持任务优先级、重试、定期抓取等分布式架构
资源推荐
资源详情
资源评论
收起资源包目录
Python爬虫程序,特点:使用Python编写脚本,提供强大的APIPython,强大的WebUI和脚本编辑器、任务监控和项目 (182个子文件)
.babelrc 28B
logging.conf 763B
.coveragerc 350B
debug.min.css 5KB
index.min.css 2KB
tasks.min.css 1KB
task.min.css 783B
result.min.css 390B
Dockerfile 873B
.gitignore 339B
.gitignore 5B
index.html 9KB
debug.html 6KB
task.html 4KB
result.html 4KB
tasks.html 2KB
MANIFEST.in 175B
tox.ini 305B
debug.min.js 24KB
debug.js 19KB
splitter.js 10KB
phantomjs_fetcher.js 7KB
css_selector_helper.js 7KB
index.js 6KB
index.min.js 4KB
css_selector_helper.min.js 4KB
webpack.config.js 723B
result.min.js 257B
tasks.min.js 256B
task.min.js 255B
package.json 627B
debug.less 7KB
index.less 2KB
task.less 1023B
result.less 641B
tasks.less 556B
variable.less 545B
LICENSE 11KB
splash_fetcher.lua 6KB
Command-Line.md 9KB
self.crawl.md 8KB
HTML-and-CSS-Selector.md 8KB
AJAX-and-more-HTTP.md 7KB
Architecture.md 5KB
Deployment-demo.pyspider.org.md 4KB
Deployment.md 4KB
Quickstart.md 4KB
Render-with-PhantomJS.md 3KB
Frequently-Asked-Questions.md 3KB
README.md 3KB
index.md 3KB
Working-with-Results.md 3KB
About-Projects.md 2KB
Running-pyspider-with-Docker.md 2KB
About-Tasks.md 2KB
Response.md 2KB
self.send_message.md 1KB
Script-Environment.md 1KB
@every.md 729B
ISSUE_TEMPLATE.md 659B
@catch_status_code_error.md 542B
index.md 396B
index.md 198B
demo.png 834KB
inspect_element.png 252KB
request-headers.png 237KB
search-for-request.png 232KB
css_selector_helper.png 124KB
tutorial_imdb_front.png 94KB
developer-tools-network.png 90KB
index_page.png 77KB
twitch.png 47KB
creating_a_project.png 42KB
run_one_step.png 29KB
developer-tools-network-filter.png 19KB
pyspider-arch.png 17KB
scheduler.py 45KB
run.py 28KB
test_scheduler.py 28KB
tornado_fetcher.py 27KB
test_fetcher.py 24KB
test_database.py 23KB
test_fetcher_processor.py 23KB
test_processor.py 20KB
test_webui.py 20KB
base_handler.py 15KB
test_run.py 13KB
counter.py 12KB
pprint.py 12KB
utils.py 12KB
test_message_queue.py 10KB
project_module.py 9KB
rabbitmq.py 8KB
bench.py 8KB
processor.py 8KB
task_queue.py 8KB
response.py 7KB
debug.py 7KB
webdav.py 7KB
test_webdav.py 7KB
共 182 条
- 1
- 2
资源评论
财云量化
- 粉丝: 2092
- 资源: 171
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 使用 PID 控制器控制加热器输出的房间温度的模拟 1仿真智能控温建筑中的 PID控制 2仿真使用 Simulink
- 煤层工作开挖过程,会引起邻近煤岩层应力、变形场发生变化,以及引起临近煤层卸压,从而达到保护层开挖目的 本模型根据煤岩层之间的位
- 基于Java语言的重庆地区宠物行业假数据检测设计源码
- 基于JavaScript、Java、CSS的杭州联合银行Zoffice设计源码分享
- 双馈风机(永磁同步风机)惯性控制+下垂控制参与系统一次调频的Matlab Simulink模型,调频结束后转速回复,造成频率二次
- 基于Python、HTML、JavaScript和CSS的交互式数据可视化设计源码
- 基于Python和HTML的中华正字Web+App UI自动设计源码
- 基于Java和Vue的企业级逻辑魔方模型方案设计源码
- 基于Python技术的BOSS直聘投简历流程设计源码
- 基于OpenGL的Java实现蓝牙心电图表(ECG)设计源码
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功