使用scrapy抓取googleappstore信息写入mysql-2018.6资源-CSDN文库

共15个文件

py：9个

sh：4个

cfg：1个

scrapy

google

app

需积分: 9 107 浏览量 2018-06-23 15:26:06 上传评论收藏 12KB ZIP 举报

Scrapy是一个强大的Python爬虫框架，它为开发者提供了构建网络爬虫所需的各种工具和组件，使得数据抓取工作变得高效且便捷。在这个项目中，我们将使用Scrapy来抓取Google App Store的应用信息，并将这些数据存储到MySQL数据库中。2018年6月时，这个任务可能涉及到的技术点主要包括以下几个方面： 1. **Scrapy框架**：Scrapy提供了爬虫项目结构、中间件、下载器、爬虫类等核心模块，方便开发者快速构建爬虫。在本项目中，我们需要创建一个Scrapy项目，定义Spider类，编写解析网页的XPath或CSS选择器，以及设置爬取规则。 2. **Google App Store API**：虽然Google没有提供官方的API供爬虫使用，但可以通过模拟浏览器行为（例如使用Selenium库）或者分析网页结构来获取数据。通常，这需要解析HTML页面，查找包含应用信息的元素，如应用名称、评分、评论数量等。 3. **网络请求与反爬策略**：Google可能会有反爬机制，如验证码、IP限制或User-Agent检查。Scrapy的设置中可以配置User-Agent随机池，甚至配合rotating_proxies中间件来更换IP，以避免被封禁。 4. **数据解析**：使用XPath或CSS选择器从HTML中提取所需信息。XPath是XML路径语言，适用于HTML解析；CSS选择器则更接近前端开发者的习惯。在Scrapy中，`response.xpath()`和`response.css()`方法用于执行查询。 5. **数据存储**：在本项目中，数据存储的目标是MySQL数据库。需要在MySQL中创建对应的表结构，字段应包括应用ID、名称、评分、评论数等。然后，在Scrapy的Pipeline组件中处理抓取到的数据，进行清洗、验证和入库操作。使用Python的`mysql-connector-python`库可以方便地连接和操作MySQL数据库。 6. **Scrapy设置与配置**：在`settings.py`文件中，配置数据库连接参数，如数据库地址、用户名、密码等。同时，可能还需要调整请求延时（`DOWNLOAD_DELAY`）以减小对目标网站的压力。 7. **异常处理与日志记录**：为确保程序的稳定运行，应添加异常处理代码，如捕获网络错误、数据库操作失败等。同时，Scrapy的日志系统可以帮助跟踪和调试爬虫的运行状态。 8. **代码可直接运行**：为了实现这一目标，代码应该清晰、模块化，避免硬编码，提供必要的依赖安装说明（如`requirements.txt`文件），并确保所有外部资源（如数据库连接信息）可以通过环境变量或配置文件安全地注入。 9. **持续集成与自动化**：为了提高效率，可以将爬虫项目与持续集成工具（如Jenkins、Travis CI）结合，实现代码提交后自动运行测试和爬虫，及时发现和修复问题。 10. **版本控制**：使用Git进行版本控制，可以方便地追踪代码变更，多人协作时也更易于管理。以上就是使用Scrapy抓取Google App Store信息并写入MySQL所需涉及的主要技术点。在实际操作中，可能还会遇到其他挑战，如动态加载的内容、登录验证等，需要根据具体情况进行处理。对于初学者来说，这个项目是一个很好的实践机会，能够深入了解Web爬虫和数据库操作。

资源推荐

资源详情

资源评论

收起资源包目录

app_store_scrapy.zip （15个子文件）

store

start.sh 157B

google

settings.py 3KB

pipelines.py 4KB

middlewares.py 4KB

__init__.py 0B

items.py 690B

spiders

__init__.py 0B

google.py 7KB

status.sh 201B

spider.log 15KB

stop.sh 130B

scrapy.cfg 255B

__init__.py 0B

daemon.sh 302B

store_spider.py 808B

#! /usr/bin/evn python # coding:utf-8 import json import re import time from lxml import etree import requests from requests import Response from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from store.google.items import GoogleItem def get_start_urls(): google_urls = {'https://play.google.com/store', 'https://play.google.com/store/apps/top', 'https://play.google.com/store/apps/new'} start_urls = list() for google_url in google_urls: try: result = requests.get(google_url, timeout=5) except requests.RequestException: return start_urls if isinstance(result, Response): result = result.content dom_tree = etree.HTML(result) for link in dom_tree.xpath("//@href"): if link.startswith('/store'): link = 'https://play.google.com'+link if link not in start_urls: start_urls.append(link) return start_urls class GoogleSpider(CrawlSpider): name = "google" allowed_domains = ["play.google.com"] start_urls = get_start_urls() rules = [ Rule(LinkExtractor(allow=('/store/apps/details',)), callback='parse_app', follow=True), ] def parse_app(self, response): self.do_nothing() item = GoogleItem() self.init_item(item) try: item['url'] = response.url tmp = response.xpath('//h1[@class="AHFaub"]/span[1]').xpath('text()').extract() tmp = tmp[0] if tmp else '' try: high_points = re.compile(u'[\U00010000-\U0010ffff]') except re.error: high_points = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]') item['app_name'] = high_points.sub(u'', tmp) list_tmp = response.xpath('//a[@itemprop="genre"]').xpath('text()').extract() tmp = '' for lt in list_tmp: tmp += ' ' + lt item['category'] = tmp tmp = response.xpath('//div[@class="dNLKff"]/c-wiz/div/div//' 'div[@class="L0jl5e bUWb7c cm4lTe"]/div/@style').extract() tmp = float(tmp[0].strip('%').strip('width: ')) / 100.0 if tmp else float(0) tmp2 = response.xpath('count(//div[@class="dNLKff"]/c-wiz/div/div//' 'div[@class="vQHuPe bUWb7c"])').extract() item['score'] = float(tmp2[0]) + tmp if tmp2 else tmp tmp = response.xpath('//a[@class="hrTbp R8zArc"]').xpath('text()').extract() tmp = tmp[0] if tmp else '' item['provider'] = high_points.sub(u'', tmp) tmp = response.xpath('//span[@class="AYi5wd TBRnV"]/span[1]').xpath('text()').extract() tmp = tmp[0] if tmp else '' item['raters'] = tmp list_tmp = response.xpath('//div[@class="W4P4ne "]/div[2]/div/div/span/div/span') \ .xpath('text()').extract() try: self.get_some_cols(list_tmp, item) except Exception as e: self.logger.error(str(e.message) + str(list_tmp) + str(item['url'])) self.re_get_some_cols(item) tmp = response.xpath('//div[@class="W4P4ne "]/div[2]/div/div/span/div/span/div') \ .xpath('text()').extract() tmp = tmp[0] if tmp else '' item['content_level'] = tmp list_tmp = response.xpath('//div[@class="mMF0fd"]/span/@title').extract() tmp = {} for i in range(len(list_tmp)): tmp[i + 1] = str(list_tmp[len(list_tmp) - i - 1]) item['score_detail'] = json.dumps(tmp) yield item except Exception as e: self.logger.error(str(e.message) + str(item['url'])) def format_time(self, uf_time): month_dict = {'January': '01', 'February': '02', 'March': '03', 'April': '04', 'May': '05', 'June': '06', 'July': '07', 'August': '08', 'September': '09', 'October': '10', 'November': '11', 'December': '12'} if '' != uf_time: try: split_time = uf_time.replace(',', '').split() if len(split_time) == 3: month = month_dict[split_time[0]] day = split_time[1] year = split_time[2] st_created_time = year + '-' + month + '-' + day time_array = time.strptime(st_created_time, '%Y-%m-%d') return time.strftime('%Y-%m-%d', time_array) except Exception as e: self.logger.error(e.message) return uf_time def init_item(self, item): self.do_nothing() for col in 'url,app_name,category,provider,score,score_detail,raters,app_update_time,' \ 'app_size,installation_times,current_app_version,android_requirements,' \ 'content_level'.split(','): item[col] = '' def re_get_some_cols(self, item): result = '' try: result = requests.get(item['url'], timeout=5) except requests.RequestException as e: self.logger.error(e.message) if isinstance(result, Response): result = result.content dom_tree = etree.HTML(result) list_tmp = dom_tree.xpath('//div[@class="W4P4ne "]/div[2]/div/div/span/div/span/text()') try: self.get_some_cols(list_tmp, item) except Exception as e: self.logger.error(str(e.message) + str(list_tmp) + str(item['url'])) def get_some_cols(self, list_tmp, item): if list_tmp: item['app_update_time'] = self.format_time(list_tmp[0]) if len(list_tmp) > 1: if '+' in list_tmp[1]: item['installation_times'] = list_tmp[1] item['current_app_version'] = list_tmp[2].strip() item['android_requirements'] = list_tmp[3] else: item['app_size'] = list_tmp[1] item['installation_times'] = list_tmp[2] if 'and up' in list_tmp[3]: item['android_requirements'] = list_tmp[3] else: item['current_app_version'] = list_tmp[3].strip() item['android_requirements'] = list_tmp[4] def do_nothing(self): pass

评论收藏

内容反馈