Python爬虫框架：PySpider，既简单易用又功能强大且带图形界面.zip

共113个文件

png：70个

md：28个

jpg：6个

版权申诉

爬虫

python

数据收集

70 浏览量 2024-03-01 12:29:21 上传评论收藏 28.1MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

Python爬虫框架：PySpider，既简单易用又功能强大且带图形界面.zip （113个子文件）

.gitignore 62B

favicon.ico 1KB

autohome_excel_screenshot_1.jpg 380KB

autohome_excel_screenshot_2.jpg 368KB

autohome_detail_series_home.jpg 146KB

autohome_detail_series_all.jpg 123KB

autohome_detail_series_years.jpg 99KB

autohome_detail_series_a.jpg 54KB

book.json 4KB

book_current.json 1KB

README_current.json 780B

Makefile 40B

code.md 117KB

car_brand_data.md 24KB

result.md 14KB

common_issue.md 9KB

reference.md 9KB

pitfall.md 8KB

README.md 7KB

find_extract_html.md 7KB

process.md 5KB

README.md 5KB

phantomjs.md 3KB

project_steps.md 3KB

baidu_hot_list.md 3KB

delete_project.md 2KB

README.md 2KB

config_json.md 2KB

self_crawl.md 2KB

SUMMARY.md 2KB

data_folder.md 2KB

README.md 1KB

README.md 425B

README.md 150B

README.md 144B

README.md 140B

README.md 99B

README.md 95B

README.md 44B

node_modules 36B

autohome_page_car_detail.png 2MB

autohome_page_search_pic_tbar.png 1007KB

pyspider_many_599_paused.png 872KB

autohome_page_car_series.png 856KB

left_show_current_level_function.png 634KB

show_parent_level_function_name.png 604KB

click_prev_next_to_switch.png 601KB

scrapy_debug_response_in_command_line.png 578KB

autohome_pyspider_process_error.png 563KB

autohome_car_homepage.png 530KB

pyspider_baidu_hot_debug.png 511KB

pyspider_css_selector_help.png 502KB

pyspider_with_debug_ui.png 485KB

click_left_arrow_return_upper_level.png 474KB

pyspider_debug_show_href_with_host.png 460KB

pyspider_http_400_bad_request.png 442KB

autohome_car_debug_page.png 408KB

pyspider_debug_pass_para.png 393KB

autohome_spec_audi_a2l_1.png 386KB

autohome_pyspider_export_result_csv.png 344KB

html_source_href_no_host.png 340KB

pyspider_right_to_new_2.png 322KB

pyspider_follow_3.png 321KB

pyspider_code_save.png 316KB

autohome_spec_audi_a2l_2.png 305KB

pyspider_sqlite_project_contain_python_code.png 292KB

pyspider_run_more_link.png 290KB

autohome_car_pyspider_debug.png 276KB

pyspider_returned_up_level.png 261KB

run_redo_parent_level.png 237KB

autohome_page_brand_a.png 219KB

pyspider_right_to_new.png 219KB

pyspider_project_db_structure.png 213KB

control_c_terminate_pyspider.png 212KB

pyspider_output_effect.png 211KB

autohome_pyspider_error_doc_empty.png 205KB

pyspider_debug_main.png 203KB

pyspider_run_more_link_2.png 201KB

pyspider_left_previous_level.png 189KB

pyspider_config_data_path.png 187KB

pyspider_click_three_point_expand.png 187KB

autohome_pyspider_error_detail_1.png 182KB

baidu_host_csv_preview.png 179KB

pyspider_normal_run.png 166KB

pyspider_up_level_run.png 165KB

pyspider_show_links.png 144KB

pyspider_steps_allow_network.png 142KB

pyspider_baidu_hot_results.png 138KB

pyspider_baidu_hot_running.png 116KB

pyspider_group_delete.png 113KB

pyspider_baidu_hot_csv.png 107KB

click_pyspider_project_name_return.png 105KB

pyspider_webui_paused.png 97KB

pyspider_project_deleted.png 95KB

pyspider_ctrl_c_stop.png 92KB

pyspider_rerun_project.png 91KB

pyspider_click_run.png 81KB

pyspider_many_retry_paused.png 79KB

autohome_pyspider_debug_webui.png 71KB

共 113 条

# PySpider的心得 ## 对于加载更多内容，除了想办法找js或api，也可以换个其他的思路 **问题**：想要获取单个页面的更多的内容，一般页面都是向下滚动，加载更多。内部往往是js实现，调用额外的api获取更多数据，加载更多数据。 **思路**：所以一般往往会去研究和抓包，搞清楚调用的api。但是其实有思路多去看看网页中与之相关的其他内容，往往可以通过其他途径，比如另外有个单独的页面，可以获取我所需要的所有的车型车系的数据。就可以避免非要去研究和抓包api了。详见：[【已解决】pyspider中如何加载汽车之家页面中的更多内容](http://www.crifan.com/pyspider_how_load_more_content_data_from_current_page) ## 调试界面中的`enable css selector helper` 点击web后可以看到html页面内容再点击`enable css selector helper`后之后点击某个页面元素，则可以直接显示出对应的css的selector ![PySpider中的css选择器帮助](../../assets/img/pyspider_css_selector_help.png) 不过话说我个人调试页面期间，很少用到。都是直接去Chrome浏览器中调试页面，查看html源码，寻找合适的css selector。 ## 发送`POST`请求且传递格式为`application/x-www-form-urlencoded`的`form data`参数代码： ```python @config(age=10 * 24 * 60 * 60) def index_page(self, response): # <ul class="list-user list-user-1" id="list-user-1"> for each in response.doc('ul[id^="list-user"] li a[href^="http"]').items(): self.crawl(each.attr.href, callback=self.detail_page) maxPageNum = 10 for curPageIdx in range(maxPageNum): curPageNum = curPageIdx + 1 print("curPageNum=%s" % curPageNum) getShowsUrl = "http://xxx/index.php?m=home&c=match_new&a=get_shows" headerDict = { "Content-Type": "application/x-www-form-urlencoded" } dataDict = { "counter": curPageNum, "order": 1, "match_type": 2, "match_name": "", "act_id": 3 } self.crawl( getShowsUrl, method="POST", headers=headerDict, data=dataDict, cookies=response.cookies, callback=self.parseGetShowsCallback ) def parseGetShowsCallback(self, response): print("parseGetShowsCallback: self=%s, response=%s"%(self, response)) respJson = response.json print("respJson=%s" % (respJson)) ``` 实现了： * 发送`POST` * 传递header * `"Content-Type": "application/x-www-form-urlencoded"` * 传递`data` * 一个`dict`，包含对应的`key`和`value` * 顺带传递了`cookie` * `cookies=response.cookies` * 获得返回的`JSON` * `callback`中用`response.json` ## 无法继续爬取时，注意是否是重复url导致的当发现没有继续爬取后续数据时，记得想想是不是重复url导致的。比如此处的： ```bash POST /selfReadingBookQuery2 { "offset": 0, "limit":10} ``` 和： ```bash POST /selfReadingBookQuery2 { "offset": 10, "limit":10} ``` 虽然（json参数）变化了，但是url没变 -> 导致不（重复）爬取解决办法：让每次的url不同实现方式：比如给url后面加上`#hash`值 ### 举例说明 ```python timestampStr = datetime.now().strftime("%Y%m%d_%H%M%S_%f") curUrlWithHash = curUrl + "#" + timestampStr self.crawl(curUrlWithHash, ... ``` 的： ```bash /selfReadingBookQuery2#20190409_162018_413205 /selfReadingBookQuery2#20190409_162117_711811 ``` 即可实现，每次请求url都不同，就可以继续爬取了。如果还是不行，或者说，为了更加保险，可以再去加上itag，比如： ```python # add hash value for url to force re-crawl when POST url not changed timestampStr = datetime.now().strftime("%Y%m%d_%H%M%S_%f") curUrlWithHash = curUrl + "#" + timestampStr fakeItagForceRecrawl = "%s_%s_%s" % (timestampStr, offset, limit) self.crawl(curUrlWithHash, itag=fakeItagForceRecrawl, # To force re-crawl for next page method="POST", ``` ## 当连续多个请求都出现599超时连接后，且尝试retry也都全部失败后，会自动暂停之前遇到过多次，类似这种： ![pyspider_many_599_paused](../../assets/img/pyspider_many_599_paused.png) ```bash [I 180922 11:25:42 scheduler:959] task retry 0/3 ChildQupeiyinApp:c3e0a65a42199256898652ce1e737321 https://childapi.qupeiyin.com/show/detail?show_id=129410670 [E 180922 11:25:42 tornado_fetcher:212] [599] ChildQupeiyinApp:80040ee81a217bc05a877ff41ee74d05 https://childapi.qupeiyin.com/show/detail?show_id=130095443, HTTP 599: Connection timed out after 20000 milliseconds 20.00s [E 180922 11:25:42 processor:202] process ChildQupeiyinApp:80040ee81a217bc05a877ff41ee74d05 https://childapi.qupeiyin.com/show/detail?show_id=130095443 -> [599] len:0 -> result:None fol:0 msg:0 err:Exception('HTTP 599: Connection timed out after 20000 milliseconds',) [I 180922 11:25:42 scheduler:959] task retry 0/3 ChildQupeiyinApp:80040ee81a217bc05a877ff41ee74d05 https://childapi.qupeiyin.com/show/detail?show_id=130095443 [E 180922 11:25:42 tornado_fetcher:212] [599] ChildQupeiyinApp:3783982a707a6c82b8c30d619a6933d7 https://childapi.qupeiyin.com/show/detail?show_id=130096232, HTTP 599: Connection timed out after 20000 milliseconds 20.00s [E 180922 11:25:42 processor:202] process ChildQupeiyinApp:3783982a707a6c82b8c30d619a6933d7 https://childapi.qupeiyin.com/show/detail?show_id=130096232 -> [599] len:0 -> result:None fol:0 msg:0 err:Exception('HTTP 599: Connection timed out after 20000 milliseconds',) [I 180922 11:25:43 scheduler:959] task retry 0/3 ChildQupeiyinApp:3783982a707a6c82b8c30d619a6933d7 https://childapi.qupeiyin.com/show/detail?show_id=130096232 [I 180922 11:26:08 scheduler:126] project ChildQupeiyinApp updated, status:STOP, paused:True, 1667087 tasks ^C[I 180922 11:26:13 scheduler:663] scheduler exiting... [I 180922 11:26:13 tornado_fetcher:671] fetcher exiting... [I 180922 11:26:13 processor:229] processor exiting... [I 180922 11:26:13 result_worker:66] result_worker exiting... ``` 上述 `status:STOP, paused:True` 就是表示**暂停**了。对应着界面上`status`自动变成`PAUSED` ![pyspider_webui_paused](../../assets/img/pyspider_webui_paused.png) -> 估计是内部逻辑发现多次是`599`的错误，就自动暂停重试了。避免了后续无效的请求 -> 还是很智能的，因为此处实际上是网络断了，导致无法请求的。

评论收藏

内容反馈

版权申诉