zhaopin-python爬虫案例-招聘网站信息爬取.rar

共11个文件

py：9个

jpg：1个

cfg：1个

需积分: 1 95 浏览量 2024-05-31 14:29:46 上传评论 4 收藏 530KB RAR 举报

该压缩包文件“zhaopin-python爬虫案例-招聘网站信息爬取.rar”是一个关于使用Python进行网络爬虫的实例，主要目的是抓取招聘网站上的信息。这个案例旨在教授如何利用Python编程语言来自动化地从网页上获取数据，特别是与招聘相关的职位信息，如职位名称、工作职责、薪资待遇、任职要求等。 Python爬虫是数据获取的一个重要工具，它允许开发者通过编写特定的代码来模拟浏览器行为，自动遍历并解析网页，提取所需的数据。在这个案例中，我们可能需要用到以下几个核心知识点： 1. **Python基础**：理解Python的基础语法，包括变量、数据类型、控制结构（如循环和条件语句）、函数和模块的使用。 2. **requests库**：Python中的requests库用于发送HTTP请求，如GET和POST，是爬虫获取网页内容的关键。我们需要学习如何设置请求头、处理cookies以及处理重定向等。 3. **BeautifulSoup库**：BeautifulSoup是Python中一个用于HTML和XML文档解析的库，用于抽取和解析数据。我们需要学会创建解析器，找到和遍历HTML元素，以及提取特定标签内的文本。 4. **正则表达式（re模块）**：用于对字符串进行匹配和查找，可以帮助我们更精确地定位和提取目标信息。 5. **数据处理和存储**：抓取到的数据可能需要进行清洗和整理，可以使用pandas库来构建DataFrame进行数据操作，最后可能需要将数据保存为CSV或Excel文件。 6. **异常处理**：编写爬虫时，可能会遇到各种异常，如网络连接问题、服务器返回错误等，因此需要编写异常处理代码以确保程序的健壮性。 7. **IP代理和反爬策略**：为了防止被目标网站封禁，可能需要使用IP代理来更换请求源，同时了解常见的反爬策略，如User-Agent变换、延时请求等。 8. **道德和法律**：在进行网络爬虫时，必须遵守相关法律法规，尊重网站的robots.txt文件，避免对目标网站造成过大负担。 9. **实战经验**：通过实际编写代码，理解如何将理论知识应用到实际项目中，解决实际问题。在学习这个案例的过程中，你将一步步学习如何分析招聘网站的网页结构，如何编写爬虫脚本，如何解析HTML数据，以及如何存储和处理抓取到的数据。通过实践，你不仅能掌握Python爬虫的基本技巧，还能提升解决问题的能力。

资源推荐

资源详情

资源评论

收起资源包目录

zhaopin-python爬虫案例-招聘网站信息爬取.rar （11个子文件）

zhaopin-python爬虫案例-招聘网站信息爬取

__init__.py 0B

zhaopin

__init__.py 0B

pipelines.py 1KB

spiders

__init__.py 161B

boss.py 2KB

items.py 502B

settings.py 3KB

middlewares.py 5KB

fuli.jpg 524KB

scrapy.cfg 257B

start.py 103B

# Define here the models for your spider middleware # # See documentation in: # https://docs.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals # useful for handling different item types with a single interface from itemadapter import is_item, ItemAdapter from scrapy.http.response.html import HtmlResponse from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By from selenium.common.exceptions import NoSuchElementException from selenium.webdriver import ActionChains from PIL import Image import base64 import json import requests import time class ZhaopinDownloaderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. def __init__(self): options = webdriver.ChromeOptions() # 处理证书错误 options.add_argument('--ignore-certificate-errors') # 修改windows.navigator.webdriver，防机器人识别机制，selenium自动登陆判别机制 options.add_experimental_option('excludeSwitches', ['enable-automation']) options.add_argument("--disable-blink-features=AutomationControlled") self.browser = webdriver.Chrome(options=options) self.wait = WebDriverWait(self.browser, 10) @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): if spider.name == 'boss': self.browser.get(request.url) time.sleep(1) html = self.browser.page_source # 点击关掉登录提示框 if '立即登录，享受优质服务' in html: self.browser.find_element_by_css_selector('.closeIcon').click() print("出现登录提示框，正在关闭...") time.sleep(1) # 解决ip验证问题 elif '当前IP地址可能存在异常访问行为，完成验证后即可正常使用' in html: self.browser.find_element_by_css_selector(".btn").click() captcha_element = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.geetest_widget'))) loc = captcha_element.location size = captcha_element.size left, top, right, bottom = loc['x'], loc['y'], loc['x'] + size['width'], loc['y'] + size['height'] # 设置浏览器窗口宽高 width = self.browser.execute_script('return document.documentElement.scrollWidth') height = self.browser.execute_script('return document.documentElement.scrollHeight') self.browser.set_window_size(width, height) screenshot = self.browser.get_screenshot_as_png() screenshot = Image.open(screenshot) captcha = screenshot.crop((left, top, right, bottom)) captcha_bytes = captcha.tobytes() result_str = ZhaopinDownloaderMiddleware.base64_api('xxx', 'xxx', captcha_bytes) print(result_str) for cor in result_str.split('|'): x, y = cor.split(',') ActionChains(self.browser).move_to_element_with_offset(captcha_element, int(x), int(y)).click().perform() time.sleep(.3) # 点击确认 self.browser.find_element_by_css_selector('.geetest_commit_tip').click() time.sleep(1) return HtmlResponse(url=request.url, body=html, status=200, encoding='utf-8') return None @staticmethod def base64_api(uname, pwd, img: bytes, typeid=21): # 请求快识别平台 b64 = base64.b64encode(img).decode('utf8') data = {"username": uname, "password": pwd, "typeid": typeid, "image": b64} result = json.loads(requests.post("http://api.ttshitu.com/predict", json=data).text) if result['success']: return result["data"]["result"] else: return result["message"] def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)

评论收藏

内容反馈