爬取房价信息_python爬虫

共1个文件

py：1个

版权申诉

5星 · 超过95%的资源 65 浏览量 2021-10-02 08:29:06 上传评论 1 收藏 1KB ZIP 举报

在Python编程领域，爬虫是一种常见的技术用于自动化地获取网页数据。在这个项目中，我们将讨论如何使用Python爬虫来爬取全国的房价信息，并将这些数据整理存储到Excel表格中，便于数据分析和处理。我们需要了解Python爬虫的基础知识。Python提供了许多库来帮助我们构建网络爬虫，如`requests`用于发送HTTP请求，`BeautifulSoup`或`lxml`用于解析HTML和XML文档，`pandas`用于数据处理和分析，以及`openpyxl`或`xlwt`用于创建和编辑Excel文件。 1. **Python requests库**：`requests`库是Python中用于发送HTTP请求的简单库。在爬取房价信息时，我们需要向房地产网站发送GET请求，获取网页内容。例如： ```python import requests url = "http://example.com/house_prices" response = requests.get(url) html_content = response.text ``` `response.text`返回的是网页的HTML源代码。 2. **HTML解析**：解析HTML内容通常使用`BeautifulSoup`库。这个库允许我们通过CSS选择器、标签名、属性等方法找到目标数据。例如： ```python from bs4 import BeautifulSoup soup = BeautifulSoup(html_content, 'html.parser') house_prices = soup.select('.price') # 假设价格信息在class为'price'的元素中 ``` 3. **数据提取与清洗**：提取出房价后，我们可能需要进行数据清洗，去除无关字符，转换为数值类型等。这可以使用`re`（正则表达式）库或者直接在BeautifulSoup对象上操作。 4. **Pandas处理数据**：`pandas`库是数据分析的利器，我们可以创建DataFrame来存储房价数据，方便后续处理。例如： ```python import pandas as pd data = {'city': [], 'price': []} for price in house_prices: city = price.find('span', class_='city').text.strip() value = float(price.text.replace('元', '').replace('万', '0000')) data['city'].append(city) data['price'].append(value) df = pd.DataFrame(data) ``` 5. **写入Excel**：我们可以使用`pandas`的`to_excel`函数将数据保存到Excel文件中： ```python df.to_excel('全国房价信息.xlsx', index=False) ``` 如果需要设置特定的Excel格式，可以使用`openpyxl`或`xlsxwriter`库，提供更多的自定义选项。 6. **注意事项与问题处理**：在实际爬虫过程中，可能遇到反爬虫策略、请求限制等问题。这时，我们可能需要设置User-Agent、使用代理IP、模拟登录、添加延时等手段应对。 7. **多线程与异步请求**：为了提高爬取效率，可以使用`concurrent.futures`或`asyncio`库实现多线程或多进程爬取，或者使用`aiohttp`进行异步请求。 8. **道德与法律**：合法合规是爬虫的重要原则，确保遵循网站的robots.txt规则，不频繁请求，尊重网站版权，避免侵犯他人隐私。 Python爬虫在房价信息爬取中扮演了关键角色，结合HTML解析、数据处理和Excel存储，能够高效地收集和组织大量房价数据，为后续的分析和决策提供有力支持。通过不断学习和实践，我们可以构建更复杂的爬虫系统，满足更多样化的数据需求。

资源推荐

资源详情

资源评论

收起资源包目录

爬取房价信息.zip （1个子文件）

爬取房价信息.py 3KB

import requests import bs4 import openpyxl year = input("请输入要查询的年份：") def get(): url = "https://www.58.com/fangjiawang/quanguo-%s/"%year UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36" Referer = "https://www.58.com/fangjiawang/quanguo-%s/"%url[-5:-1] headers={"User-Agent":UserAgent,"Referer":Referer} response = requests.get(url,headers) return response def targets(): response = get() content = bs4.BeautifulSoup(response.text,"html.parser") targetall = content.find_all("ul",class_="box-price") #这里注意find_all返回一个数据集列表，而数据集合列表无属性，需要单独拿出来 return targetall ''' target = targetall[0].find_all("a") print("2020年全国房价排行榜") for target in target: print(target.b.text,end="\t") print(target.span.text,end="\t") print(target.em.text,end="\n") target = targetall[1].find_all("a") print("2020年全国房价涨幅榜") for target in target: print(target.b.text,end="\t") print(target.span.text,end="\t") print(target.em.text,end="\n") target = targetall[2].find_all("a") for target in target: print("2020年全国房价跌幅榜") print(target.b.text,end="\t") print(target.span.text,end="\t") print(target.em.text,end="\n") ''' def ToExcel(): targetall = targets() wb = openpyxl.Workbook()#新建工作簿 ws = wb.create_sheet(index=0,title = "%s年全国房价排行榜"%year)#在索引值为0处加入一个新工作表 count=1 ws["a1"] = "地区" ws["b1"] = "房价" ws["c1"] = "涨幅率" target = targetall[0].find_all("a") for target in target: count +=1 ws["a%s"%count] = target.b.text ws["b%s"%count] = target.span.text ws["c%s"%count] = target.em.text ws = wb.create_sheet(index=1,title = "%s年全国房价涨幅榜"%year) count=1 ws["a1"] = "地区" ws["b1"] = "房价" ws["c1"] = "涨幅率" target = targetall[1].find_all("a") for target in target: count +=1 ws["a%s"%count] = target.b.text ws["b%s"%count] = target.span.text ws["c%s"%count] = target.em.text ws = wb.create_sheet(index=2,title = "%s年全国房价跌幅榜"%year) count=1 ws["a1"] = "地区" ws["b1"] = "房价" ws["c1"] = "涨幅率" target = targetall[2].find_all("a") for target in target: count +=1 ws["a%s"%count] = target.b.text ws["b%s"%count] = target.span.text ws["c%s"%count] = target.em.text string = "%s年全国房价信息.xlsx"%year wb.save(string)#保存为excel if __name__ == "__main__": ToExcel()

评论收藏

内容反馈

版权申诉