bingpic_爬虫_PYHTNO_源码资源-CSDN文库

共15个文件

py：7个

pyc：6个

cfg：1个

版权申诉

3 浏览量 2021-10-02 15:17:59 上传评论收藏 10KB RAR 举报

：“bingpic_爬虫_PYHTNO_源码” 涉及的主要知识点是爬虫技术，特别是使用Python的Scrapy框架进行网络数据抓取。Scrapy是一个强大的、高效的Python框架，专为爬取网站并提取结构化数据而设计。在本项目中，它被用来自动化下载Bing搜索引擎上的高清图片。：描述中提到的“在主程序文件夹运行scrapy”是指在命令行中进入项目目录，并通过Scrapy命令启动爬虫。Scrapy的命令行工具可以执行各种操作，如创建新项目、启动爬虫、查看项目结构等。关键词“输入要搜索的关键词和要下载的页数”意味着爬虫会根据用户提供的参数，比如“python scrapy crawl spider -a keyword='example' -a pages=5”，来搜索特定关键词（如'example'）并下载指定页数（例如5页）的图片。在Python中，Scrapy框架通常包含以下几个关键组件： 1. **Spider**：这是爬虫的核心部分，负责定义如何解析网页内容，提取需要的数据，以及如何跟进新的链接。在这个案例中，Spider会解析Bing的图片搜索结果页面，找到图片URL，并启动下载流程。 2. **Item**：定义了我们想要抓取的数据结构，如图片的URL、大小、描述等。 3. **Item Pipeline**：处理从Spider中获取到的Item，可以进行数据清洗、验证、存储等操作。 4. **Downloader Middleware**：处理下载请求和响应，可以实现延迟下载、处理反爬策略等功能。 5. **Settings**：配置爬虫的行为，如下载延迟、并发请求的数量、是否启用cookies等。对于“PYHTNO”，可能是个拼写错误，但我们可以推测这里指的是Python HTTP客户端库，如`requests`，它常用于发送HTTP请求，获取网页内容。在Scrapy中，虽然默认使用了其内置的下载器，但在某些情况下，开发者可能会选择使用`requests`库进行更灵活的控制。在下载高清图片时，需要注意的问题包括： 1. **反爬策略**：Bing和其他搜索引擎可能会有反爬机制，如检查User-Agent、Cookie、IP地址等，因此可能需要设置相应的headers和中间件来模拟浏览器行为。 2. **图片质量**：爬取的图片应尽可能保持高清晰度，这可能涉及到解析HTML或CSS以获取原始大图URL。 3. **下载速度**：为了避免对服务器造成过大压力，通常需要设置下载延迟，避免短时间内发送大量请求。 4. **异常处理**：处理可能出现的网络问题，如超时、重定向、404错误等。 5. **数据存储**：下载的图片需要妥善保存，可以按照日期、关键词等信息分类存储。通过以上分析，这个项目提供了一个实用的示例，展示了如何使用Scrapy和Python来构建一个自动化的图片爬虫，适用于学习者和开发者参考。

资源推荐

资源详情

资源评论

收起资源包目录

bingpic.rar （15个子文件）

bingpic

scrapy.cfg 257B

bingpic

scrapy 0B

images

middlewares.py 4KB

pipelines.py 1KB

spiders

__pycache__

__init__.cpython-38.pyc 139B

bingspider.cpython-38.pyc 2KB

bingspider.py 3KB

__init__.py 161B

__pycache__

settings.cpython-38.pyc 916B

__init__.cpython-38.pyc 131B

pipelines.cpython-38.pyc 1KB

items.cpython-38.pyc 397B

items.py 359B

__init__.py 0B

settings.py 4KB

# Scrapy settings for bingpic project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html import os BOT_NAME = 'bingpic' SPIDER_MODULES = ['bingpic.spiders'] NEWSPIDER_MODULE = 'bingpic.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'bingpic (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = False IMAGES_URLS_FIELD = "pic_url" # 设置图片下载后的存储路径，放到工程目录下images文件夹 # 获取当前目录绝对路径 project_dir = os.path.abspath(os.path.dirname(__file__)) # 获取images存储路径 IMAGES_STORE = os.path.join(project_dir,'images') # 设定处理图片的最小高度，宽度 IMAGES_MIN_HEIGHT = 100 IMAGES_MIN_WIDTH = 100 # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'bingpic.middlewares.BingpicSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'bingpic.middlewares.BingpicDownloaderMiddleware': 543, #} DEFAULT_REQUEST_HEADERS = { 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding':'gzip, deflate, sdch', 'Accept-Language':'zh-CN,zh;q=0.8', 'Cache-Control':'max-age=0', 'Connection':'keep-alive', 'User-Agent':'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36' } # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'bingpic.pipelines.BingpicPipeline': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

评论收藏

内容反馈

版权申诉