How to try it:
--------------
The recommended way to install dependencies is to use virtualenv and
then do:
pip install -r requirements.txt
Run the server using:
twistd -n slyd
and point your browser to:
http://localhost:9001/static/index.html
Chrome and Firefox are supported, but it works better with chrome.
Slyd API Notes
--------------
This will be moved to separate docs - it's currently some notes for developers
All resources are either under /dist/ or /projects/.
project listing/creation/deletion/renaming
To get list all existing projects, just GET http://localhost:9001/projects:
$ curl http://localhost:9001/projects -> ["project1", "project2"]
New projects can be created by posting to /projects, for example:
$ curl -d '{"cmd": "create", "args": ["project_X"]}' http://localhost:9001/projects
To delete a project:
$ curl -d '{"cmd": "rm", "args": ["project_X"]}' http://localhost:9001/projects
To rename a project:
$ curl -d '{"cmd": "mv", "args": ["oldname", "newname"]}' http://localhost:9001/projects
Please note that projects will not be overwritten when renaming or creating new ones (if a project
with the given name already exists an error from the 400 family will be returned).
spec
The project specification is available under /projects/PROJECT_ID/spec. The path format
mirrors the slybot format documented here:
http://slybot.readthedocs.org/en/latest/project.html
Currently, this is read only, but it will soon support PUT/POST.
The entire spec is returned for a GET request to the root:
$ curl http://localhost:9001/projects/78/spec
{"project": {
"version": "1308771278",
"name": "demo"
..
}
A list of available spiders can be retrieved:
$ curl http://localhost:9001/projects/78/spec/spiders
["accommodationforstudents.com", "food.com", "pinterest.com", "pin", "mhvillage"]
and specific resources can be requested:
$ curl http://localhost:9001/projects/78/spec/spiders/accommodationforstudents.com
{
"templates":
...
"respect_nofollow": true
}
The spec can be updating by POSTing:
$ curl --data @newlinkedin.js http://localhost:9001/projects/78/spec/spiders/linkedin
An HTTP 400 will be returned if the uploaded spec does not validate.
Basic commands are available for manipulating spider files. For example:
$ curl -d '{"cmd": "rm", "args": ["spidername"]}' http://localhost:9001/projects/78/spec/spiders
Available commands are:
* mv - move spider from first arg to second. If the second exists it is overwritten.
* rm - delete spider
bot/fetch
Accepts json object with the following fields:
* request - same as scrapy requst object. At least needs a url
* spider - spider name within in the project
* page_id - unique ID for this page, must match the id used in templates (not yet implemented)
Returns a json object containing (so far):
* page - page content, not yet annotated but will be
* response - object containing the response data: http code and headers
* items - array of items extracted
* fp - request fingerprint
* error - error message, present if there was an error
* links - array of links followed
Coming soon in the response:
* template_id - id of template that matched
* trace - textual trace of the matching process - for debugging
If you want to work on an existing project, put it in data/projects/PROJECTID, these can be downloaded from dash or by:
$ bin/sh2sly data/projects -p 78 -k YOURAPIKEY
Then you can extract data:
$ curl -d '{"request": {"url": "http://www.pinterest.com/pin/339740365610932893/"}, "spider": "pinterest.com"}' http://localhost:9001/projects/78/bot/fetch
{
"fp": "0f2686acdc6a71eeddc49045b7cea0b6f81e6b61",
"items": [
{
"url": "http://www.pinterest.com/pin/339740365610932893/",
"_template": "527387aa4d6c7133c6551481",
"image": [
"http://media-cache-ak0.pinimg.com/736x/6c/c5/35/6cc5352046df0f8d8852cbdfb31542bb.jpg"
],
"_type": "pin",
"name": [
"Career Driven"
]
}
],
"page": "<!DOCTYPE html>\n ...."
}
Testing
-------
slyd can be tested using twisted:
trial tests
没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
收起资源包目录
Visual_scraping_for_Scrapy_portia.zip (754个子文件)
make.bat 7KB
make.bat 5KB
.bowerrc 60B
Makefile.buildbot 20B
scrapy.cfg 37B
changelog 142B
CHANGES 8KB
CHANGES 7KB
compat 2B
nginx.conf 2KB
portia.conf 306B
proxy_portia_server.conf 258B
proxy_slyd.conf 258B
control 678B
copyright 36B
Dockerfile 725B
.dockerignore 99B
.editorconfig 514B
.editorconfig 514B
.ember-cli 31B
entry 493B
.gitattributes 253B
.gitignore 520B
.gitignore 247B
.gitignore 204B
.gitignore 35B
.gitkeep 0B
.gitkeep 0B
.gitkeep 0B
.gitkeep 0B
.gitkeep 0B
.gitkeep 0B
.gitkeep 0B
.gitkeep 0B
.gitkeep 0B
spider-structure-listing.hbs 8KB
project-structure-listing.hbs 8KB
project-listing.hbs 4KB
page-actions-editor.hbs 4KB
toolbar.hbs 3KB
data-structure-annotations.hbs 3KB
spider-row.hbs 3KB
schema-structure-listing.hbs 2KB
extractor-options.hbs 2KB
json-file-compare.hbs 2KB
url-bar.hbs 2KB
generated-url-options.hbs 2KB
fragment-options.hbs 2KB
combo-box.hbs 2KB
select-box.hbs 2KB
annotation-options.hbs 2KB
extracted-items-json.hbs 1KB
spider-options.hbs 1KB
browser-view-port.hbs 1KB
list-item-relation-manager.hbs 1KB
data-structure-listing.hbs 1KB
inspector-panel.hbs 1KB
regex-pattern-list.hbs 1KB
overlays.hbs 1KB
extracted-item-table.hbs 1KB
project-list.hbs 1KB
edit-sample-button.hbs 1KB
tools.hbs 1KB
list-item-combo.hbs 1KB
file-selector.hbs 1013B
link-crawling-options.hbs 912B
list-item-selectable.hbs 872B
tool-group.hbs 836B
extracted-items-group.hbs 830B
dropdown-widget.hbs 810B
application.hbs 700B
add-start-url-button.hbs 696B
project-structure-spider-feed-url.hbs 695B
project-structure-spider-generated-url.hbs 660B
field-options.hbs 657B
save-status.hbs 645B
options.hbs 614B
help.hbs 601B
start-url-options.hbs 587B
list-item-annotation-field.hbs 584B
extracted-items-panel.hbs 572B
list-item-item-schema.hbs 552B
list-item-add-annotation-menu.hbs 541B
create-project-button.hbs 540B
list-item-link-crawling.hbs 529B
link-options.hbs 515B
show-links-legend.hbs 486B
feed-url-options.hbs 485B
options.hbs 483B
options.hbs 477B
list-item-field-type.hbs 477B
tools.hbs 454B
toolbar.hbs 421B
notification-container.hbs 420B
project-structure-spider-url.hbs 415B
buffered-input.hbs 406B
tooltip-container.hbs 372B
tree-list-item.hbs 362B
help-icon.hbs 355B
browsers.hbs 347B
共 754 条
- 1
- 2
- 3
- 4
- 5
- 6
- 8
资源评论
好家伙VCC
- 粉丝: 2119
- 资源: 9145
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 的玩具 Python 实现.zip
- RHCE linux下的火墙管理 及下载
- ESP32-C3FH4 : UltraLowPower SoC with RISCV SingleCore CPU Supporting 2.4 GHz WiFi and Bluetooth LE
- 用于解包和反编译由 Python 代码编译的 EXE 的辅助脚本 .zip
- 用于自动执行任务的精选 Python 脚本列表.zip
- 全国IT学科竞赛蓝桥杯的比赛特点及参赛心得
- 用于编码面试审查的算法和数据结构 .zip
- 用于操作 ESC,POS 打印机的 Python 库.zip
- 用于控制“Universal Robots”机器人的 Python 库.zip
- 用于控制 Broadlink RM2,3 (Pro) 遥控器、A1 传感器平台和 SP2,3 智能插头的 Python 模块.zip
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功