pyspider [![Build Status]][Travis CI] [![Coverage Status]][Coverage] [![Try]][Demo]
========
A Powerful Spider(Web Crawler) System in Python. **[TRY IT NOW!][Demo]**
- Write script in Python
- Powerful WebUI with script editor, task monitor, project manager and result viewer
- [MySQL](https://www.mysql.com/), [MongoDB](https://www.mongodb.org/), [Redis](http://redis.io/), [SQLite](https://www.sqlite.org/), [Elasticsearch](https://www.elastic.co/products/elasticsearch); [PostgreSQL](http://www.postgresql.org/) with [SQLAlchemy](http://www.sqlalchemy.org/) as database backend
- [RabbitMQ](http://www.rabbitmq.com/), [Beanstalk](http://kr.github.com/beanstalkd/), [Redis](http://redis.io/) and [Kombu](http://kombu.readthedocs.org/) as message queue
- Task priority, retry, periodical, recrawl by age, etc...
- Distributed architecture, Crawl Javascript pages, Python 2.{6,7}, 3.{3,4,5,6} support, etc...
Tutorial: [http://docs.pyspider.org/en/latest/tutorial/](http://docs.pyspider.org/en/latest/tutorial/)
Documentation: [http://docs.pyspider.org/](http://docs.pyspider.org/)
Release notes: [https://github.com/binux/pyspider/releases](https://github.com/binux/pyspider/releases)
Sample Code
-----------
```python
from pyspider.libs.base_handler import *
class Handler(BaseHandler):
crawl_config = {
}
@every(minutes=24 * 60)
def on_start(self):
self.crawl('http://scrapy.org/', callback=self.index_page)
@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
for each in response.doc('a[href^="http"]').items():
self.crawl(each.attr.href, callback=self.detail_page)
def detail_page(self, response):
return {
"url": response.url,
"title": response.doc('title').text(),
}
```
[![Demo][Demo Img]][Demo]
Installation
------------
* `pip install pyspider`
* run command `pyspider`, visit [http://localhost:5000/](http://localhost:5000/)
**WARNING:** WebUI is open to the public by default, it can be used to execute any command which may harm your system. Please use it in an internal network or [enable `need-auth` for webui](http://docs.pyspider.org/en/latest/Command-Line/#-config).
Quickstart: [http://docs.pyspider.org/en/latest/Quickstart/](http://docs.pyspider.org/en/latest/Quickstart/)
Contribute
----------
* Use It
* Open [Issue], send PR
* [User Group]
* [中文问答](http://segmentfault.com/t/pyspider)
TODO
----
### v0.4.0
- [ ] a visual scraping interface like [portia](https://github.com/scrapinghub/portia)
License
-------
Licensed under the Apache License, Version 2.0
[Build Status]: https://img.shields.io/travis/binux/pyspider/master.svg?style=flat
[Travis CI]: https://travis-ci.org/binux/pyspider
[Coverage Status]: https://img.shields.io/coveralls/binux/pyspider.svg?branch=master&style=flat
[Coverage]: https://coveralls.io/r/binux/pyspider
[Try]: https://img.shields.io/badge/try-pyspider-blue.svg?style=flat
[Demo]: http://demo.pyspider.org/
[Demo Img]: https://github.com/binux/pyspider/blob/master/docs/imgs/demo.png
[Issue]: https://github.com/binux/pyspider/issues
[User Group]: https://groups.google.com/group/pyspider-users
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
PySipder 是一个 Python 爬虫程序 使用 Python 编写脚本,提供强大的 API Python 2&3 强大的 WebUI 和脚本编辑器、任务监控和项目管理和结果查看 支持 JavaScript 页面 后端系统支持:MySQL, MongoDB, SQLite, Postgresql 支持任务优先级、重试、定期抓取等 分布式架构 示例代码: from pyspider.libs.base_handler import * class Handler(BaseHandler): crawl_config = { } @every(minutes=24 * 60) def on_start(self): self.crawl('http://scrapy.org/', callback=self.index_page) @config(age=10 * 24 * 60 * 60) def index_page(self, response): for each in
资源推荐
资源详情
资源评论
收起资源包目录
PySipder是一个Python爬虫程序.rar (184个子文件)
.babelrc 28B
logging.conf 763B
.coveragerc 350B
debug.min.css 5KB
index.min.css 2KB
tasks.min.css 1KB
task.min.css 783B
result.min.css 390B
Dockerfile 873B
.gitignore 339B
.gitignore 5B
说明.htm 4KB
index.html 9KB
debug.html 6KB
task.html 4KB
result.html 4KB
tasks.html 2KB
MANIFEST.in 175B
tox.ini 305B
debug.min.js 24KB
debug.js 19KB
splitter.js 10KB
phantomjs_fetcher.js 7KB
css_selector_helper.js 7KB
index.js 6KB
index.min.js 4KB
css_selector_helper.min.js 4KB
webpack.config.js 723B
result.min.js 257B
tasks.min.js 256B
task.min.js 255B
package.json 627B
debug.less 7KB
index.less 2KB
task.less 1023B
result.less 641B
tasks.less 556B
variable.less 545B
LICENSE 11KB
splash_fetcher.lua 6KB
Command-Line.md 9KB
self.crawl.md 8KB
HTML-and-CSS-Selector.md 8KB
AJAX-and-more-HTTP.md 7KB
Architecture.md 5KB
Deployment-demo.pyspider.org.md 4KB
Deployment.md 4KB
Quickstart.md 4KB
Render-with-PhantomJS.md 3KB
Frequently-Asked-Questions.md 3KB
README.md 3KB
index.md 3KB
Working-with-Results.md 3KB
About-Projects.md 2KB
Running-pyspider-with-Docker.md 2KB
About-Tasks.md 2KB
Response.md 2KB
self.send_message.md 1KB
Script-Environment.md 1KB
@every.md 729B
ISSUE_TEMPLATE.md 659B
@catch_status_code_error.md 542B
index.md 396B
index.md 198B
java历史进程.pdf 214KB
demo.png 834KB
inspect_element.png 252KB
request-headers.png 237KB
search-for-request.png 232KB
css_selector_helper.png 124KB
tutorial_imdb_front.png 94KB
developer-tools-network.png 90KB
index_page.png 77KB
twitch.png 47KB
creating_a_project.png 42KB
run_one_step.png 29KB
developer-tools-network-filter.png 19KB
pyspider-arch.png 17KB
scheduler.py 45KB
run.py 28KB
test_scheduler.py 28KB
tornado_fetcher.py 27KB
test_fetcher.py 24KB
test_database.py 23KB
test_fetcher_processor.py 23KB
test_processor.py 20KB
test_webui.py 20KB
base_handler.py 15KB
test_run.py 13KB
counter.py 12KB
pprint.py 12KB
utils.py 12KB
test_message_queue.py 10KB
project_module.py 9KB
rabbitmq.py 8KB
bench.py 8KB
processor.py 8KB
task_queue.py 8KB
response.py 7KB
debug.py 7KB
共 184 条
- 1
- 2
资源评论
野生的大熊
- 粉丝: 235
- 资源: 246
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功