Python爬虫爬取某网站数据_python爬虫网络数据资源-CSDN文库

共14个文件

py：7个

xml：5个

iml：1个

128 浏览量 2024-01-23 17:26:46 上传评论 2 收藏 10KB ZIP 举报

Python爬虫技术是一种用于自动化网页数据抓取的工具，它能够高效地遍历网页，提取所需信息。在本示例中，我们关注的是使用Scrapy框架来实现这一目标。Scrapy是一个强大的Python爬虫框架，它提供了丰富的功能，如请求调度、网页解析、数据存储等，使得开发爬虫变得更加便捷。我们需要安装Scrapy。在命令行中运行`pip install scrapy`即可完成安装。安装完成后，我们可以使用Scrapy的命令行工具创建一个新的项目。例如，我们可以通过`scrapy startproject my_crawler`来创建一个名为`my_crawler`的项目。项目创建后，我们需要定义爬虫。在`my_crawler/spiders`目录下创建一个新的Python文件，比如`myspider.py`，并在其中定义爬虫类。这个类需要继承自Scrapy的`Spider`类，并设置`name`（爬虫的唯一标识）、`start_urls`（爬虫开始爬取的URL列表）和其他相关属性。 ```python import scrapy class MySpider(scrapy.Spider): name = 'myspider' start_urls = ['http://example.com'] def parse(self, response): # 在这里处理响应并提取数据 pass ``` `parse`方法是Scrapy爬虫的默认回调函数，当爬虫收到网页响应时，会调用这个方法。在这个方法里，我们可以使用Scrapy提供的选择器（如XPath或CSS选择器）来解析HTML，提取我们需要的数据。例如，如果我们要提取每个商品的名称和价格，可以这样写： ```python def parse(self, response): for product in response.css('div.product'): name = product.css('h2::text').get() price = product.css('span.price::text').get() yield {'name': name, 'price': price} ``` 在这里，`response.css`返回一个选择器对象，可以用来匹配和选取HTML元素。`::text`表示提取元素的文本内容，`.get()`则获取匹配的第一个结果。`yield`关键字用于生成器，使得Scrapy可以将数据保存到指定的输出格式，如JSON、CSV或数据库。 Scrapy还提供了中间件和管道这两个组件，分别用于处理爬虫在请求和响应之间的数据，以及对爬取到的数据进行清洗和存储。中间件可以用于处理如用户代理、请求重试、IP代理等复杂逻辑，而管道则可以用来去除重复数据、验证数据有效性等。在Scrapy的配置文件`settings.py`中，我们可以设置爬虫的行为，例如下载延迟（避免频繁请求导致被封IP）、启用中间件和管道等。 Python爬虫结合Scrapy框架，为我们提供了一套高效且灵活的网页数据抓取解决方案。通过学习和实践，你可以轻松地构建出符合需求的爬虫，爬取并处理大量网络数据。在实际应用中，注意遵守网站的robots.txt协议，尊重网站的版权，以及合法合规地使用爬虫技术。

资源推荐

资源详情

资源评论

收起资源包目录

Python爬虫爬取某网站数据.zip （14个子文件）

Python爬虫爬取某网站数据

scrapy.cfg 264B

.idea

qianduoduo.iml 284B

workspace.xml 1KB

misc.xml 749B

inspectionProfiles

Project_Default.xml 155B

profiles_settings.xml 228B

modules.xml 272B

qianduoduo

__init__.py 0B

pipelines.py 2KB

spiders

__init__.py 161B

investment.py 3KB

items.py 565B

settings.py 3KB

middlewares.py 2KB

#! /usr/bin/env python # -*-coding:utf-8-*- __author__ = 'dahv' import scrapy from scrapy.spiders import CrawlSpider import logging import re from scrapy.selector import Selector from qianduoduo.items import InvestmentItem class InvestmentSpider(CrawlSpider): name = 'invest' custom_settings = {'ITEM_PIPELINES': {'qianduoduo.pipelines.QianduoduoPipeline':300}} def __init__(self): super(InvestmentSpider, self).__init__() self.allowed_domains = ['http://d.com.cn/'] self.start_urls = ['http://d.com.cn/lend-0-0-0-1.html'] self.last_page_xpath = "//div[@id='page']//a[last()-1]/@href" self.link_xpath = "//td[@class='line4']/table//@href" self.box_xpath = "//div[@class='box_r_left']" self.headers = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, sdch', 'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6', 'Cache-Control': 'max-age=0', 'Connection': 'keep-alive', 'Host': 'd.com.cn', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36' } # params_xpath to crawl self.project_name_xpath = "//h2" self.loan_amount_xpath = "//li[@class='l_amount']/div" self.annual_return_xpath = "//li[@class='l_allot']/div" self.repayment_method_xpath = "//li[@class='l_paytype']/div" def parse(self, response): logging.info("=====GET SUCCESS=======") # last page last_page = response.xpath(self.last_page_xpath).extract()[0] # total page total_page = re.search("-0-(\d+).html", last_page).group(1) # for page in xrange(1, int(total_page)+1): yield scrapy.Request(url='http://d.com.cn/lend-0-0-0-{}.html'.format(page), headers=self.headers, dont_filter=True, callback=self.parse_page) def parse_page(self, response): logging.info('============REQUEST PAGE SUCCESSFULLY!!===============') box = response.xpath(self.box_xpath).extract() if box: box_no = len(box) else: box_no = 0 for a_box in box: secondary_href = Selector(text=a_box).xpath("//div[@class='top_title_inner']/a/@href").extract()[0] yield scrapy.Request(url=secondary_href, dont_filter=True, callback=self.parse_item) def parse_item(self, response): logging.info('==========START CRAWLER=================') item = InvestmentItem() try: item['project_name'] = response.xpath(self.project_name_xpath).xpath("string(.)").extract()[0].strip().encode("utf-8") except: item['project_name'] = '' try: item['loan_amount'] = response.xpath(self.loan_amount_xpath).xpath("string(.)").extract()[0].strip().encode("utf-8") except: item['loan_amount'] = '' try: item['annual_return'] = response.xpath(self.annual_return_xpath).xpath("string(.)").extract()[0].strip().encode("utf-8") except: item['annual_return'] = '' try: item['repayment_method'] = response.xpath(self.repayment_method_xpath).xpath("string(.)").extract()[0].strip().encode("utf-8") except: item['repayment_method'] = '' item['secondary_href'] = response.url # print item yield item

评论收藏

内容反馈