Scrapy基本功能测试项目资源-CSDN文库

共16个文件

py：6个

pyc：6个

cfg：1个

python

scrapy

爬虫

需积分: 9 36 浏览量 2016-08-11 15:53:19 上传评论收藏 21KB ZIP 举报

Scrapy是一个强大的Python爬虫框架，专为数据抓取和数据处理而设计。在这个"Scrapy基本功能测试项目"中，我们将深入探讨如何利用Scrapy来实现网页数据的抓取并将其存储到本地MongoDB数据库。让我们了解Scrapy的基础架构。Scrapy基于Twisted异步网络库，使其能够高效地处理多个并发请求。它由多个组件构成，包括Spider、Item、Item Pipeline、Request/Response、Downloader Middleware和Scheduler。Spider是Scrapy的核心，负责解析网页并提取所需数据。Items定义了我们想要抓取的数据结构，而Item Pipeline则处理这些数据，例如清洗、验证和存储。在项目开始时，我们需要创建一个新的Scrapy项目。这可以通过运行`scrapy startproject project_name`命令完成。接着，我们将创建一个Spider，使用`scrapy genspider spider_name domain`命令。在Spider类中，我们需要定义start_urls（初始抓取的URL列表）和parse方法，用于处理下载器返回的网页内容。描述中提到的"爬取指定网页的数据"，在Scrapy中通常涉及到解析HTML或XML文档。我们可以使用Scrapy内置的XPath或CSS选择器来提取数据。例如，`response.css('selector')`或`response.xpath('expression')`可以选取网页中的特定元素。提取出的数据会被封装到Item对象中，然后传递给Item Pipeline。Item Pipeline是一系列处理步骤，每个步骤（称为Pipeline）都可以执行数据清洗、验证和转换操作。例如，我们可以使用Pipeline去除重复数据，或者将数据格式化为MongoDB可接受的格式。接下来，我们需要配置MongoDB数据库。Scrapy没有内置的MongoDB支持，但可以通过安装第三方库如`scrapy-mongo`来实现。在设置好数据库连接后，我们可以在Pipeline中添加代码，将Item对象写入MongoDB的集合。在Scrapy项目中，`MyTest`很可能是一个包含爬虫代码、配置文件和其他辅助文件的目录。在这个目录下，你可能会看到`spiders`子目录，其中包含刚才提到的Spider代码；`items.py`文件定义了数据结构；`pipelines.py`文件包含了自定义的Item Pipeline；`settings.py`文件用于设置项目的全局配置。总结来说，这个"Scrapy基本功能测试项目"涵盖了以下知识点： 1. Scrapy框架的基本架构和组件。 2. 创建和配置Scrapy项目。 3. 设计和实现Spider，包括定义start_urls、解析方法以及使用XPath/CSS选择器提取数据。 4. Item和Item Pipeline的概念，用于定义数据结构和处理流程。 5. 集成MongoDB数据库，将爬取的数据存入本地数据库。 6. 对`MyTest`目录中各部分文件的作用进行解释。通过实践这个项目，你将对Scrapy有更深入的理解，并能熟练地运用它来爬取和处理网络数据。

资源推荐

资源详情

资源评论

收起资源包目录

ScrapyTest.zip （16个子文件）

MyTest

scrapy.cfg 256B

item_list.json 2KB

Resources 23KB

MyTest

items.pyc 451B

pipelines.pyc 1KB

pipelines.py 798B

spiders

__init__.pyc 139B

__init__.py 161B

My_spider.py 788B

My_spider.pyc 1KB

__init__.pyc 131B

items.py 322B

__init__.py 0B

settings.py 3KB

settings.pyc 502B

Books 51KB

# -*- coding: utf-8 -*- # Scrapy settings for MyTest project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html BOT_NAME = 'MyTest' SPIDER_MODULES = ['MyTest.spiders'] NEWSPIDER_MODULE = 'MyTest.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'MyTest (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'MyTest.middlewares.MyCustomSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'MyTest.middlewares.MyCustomDownloaderMiddleware': 543, #} # Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'MyTest.pipelines.MytestPipeline':300, } MONGODB_SERVER = "localhost" MONGODB_PORT = 27017 MONGODB_DB = "pipeline_db" MONGODB_COLLECTION = "test" # Enable and configure the AutoThrottle extension (disabled by default) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

评论收藏

内容反馈