Python爬虫开发基于Python实现的批量抓取采集新浪博客页面的所有文章含源代码及案例数据集.rar资源-CSDN文库

共139个文件

html：138个

py：1个

版权申诉

python

爬虫

数据集

5星 · 超过95%的资源 66 浏览量 2022-06-18 12:02:06 上传评论 1 收藏 1.62MB RAR 举报

在本资源中，我们主要探讨的是使用Python进行网络爬虫开发，特别针对新浪博客的页面进行批量抓取和数据采集。Python爬虫是获取互联网上大量数据的重要工具，尤其适用于处理结构化的网页信息，如新浪博客中的文章内容。在这个项目中，我们将分析提供的源代码`Crawl_sina_blog.py`，并理解如何利用它来抓取新浪博客文章。 Python爬虫的核心是使用像BeautifulSoup、Scrapy或Requests这样的库。在这个案例中，很可能是使用了Requests库发送HTTP请求获取网页内容，然后通过BeautifulSoup解析HTML，提取出所需的信息，如文章标题、作者、发表日期以及文章内容等。`Crawl_sina_blog.py`很可能包含了一系列的函数，用于定义请求的URL、解析HTML、存储数据等关键步骤。 1. **HTTP请求与 BeautifulSoup解析**：Python的Requests库负责向新浪博客服务器发送GET请求，获取特定博客页面的HTML源码。然后，BeautifulSoup库解析这个HTML文档，找到文章相关的HTML标签，例如`<h2>`（可能用于标题）、`<p>`（段落，可能包含文章内容）和`<span>`（可能包含作者或日期信息）。 2. **数据提取**：解析后的HTML元素可以通过CSS选择器或XPath表达式定位。例如，我们可以使用`.select()`或`.find_all()`方法查找特定的标签，然后通过属性访问器获取子元素的文本内容。 3. **循环与分页**：新浪博客的文章通常分布在多个页面上，因此爬虫需要实现分页功能。这通常通过在URL中添加页码参数或者分析页面上的下一页链接实现。`Crawl_sina_blog.py`可能会有一个循环结构，递增页码并继续抓取新的页面。 4. **数据存储**：抓取到的数据通常会存储在文件或数据库中。在本案例中，可能使用了CSV、JSON或SQLite等格式保存结果。Python的csv模块可以方便地写入CSV文件，而json库则可以将数据结构化为JSON。如果涉及到数据库操作，如SQLAlchemy或sqlite3库，则可以将数据存储到本地SQLite数据库。 5. **异常处理与延迟策略**：为了避免因频繁请求导致IP被封禁，爬虫会包含异常处理机制，如设置重试次数、设置延时等。`time.sleep()`函数可以用来控制每次请求之间的间隔，以降低对服务器的压力。 6. **案例数据集**：资源中提到的"hanhan"可能是抓取的一个具体案例，可能是抓取了名为"hanhan"的博主的所有文章。这些数据可用于测试爬虫的效果，进行数据分析或进一步的文本挖掘。这个项目展示了如何利用Python进行网络爬虫开发，以实现批量抓取新浪博客文章的目标。通过学习和理解`Crawl_sina_blog.py`，你可以了解整个抓取过程，包括发送请求、解析响应、提取数据以及存储数据的关键技术。对于想要提升Python爬虫技能的开发者来说，这是一个很好的实践案例。

资源推荐

资源详情

资源评论

收起资源包目录

Python爬虫开发基于Python实现的批量抓取采集新浪博客页面的所有文章含源代码及案例数据集.rar （139个子文件）

blog_4701280b0102wrup.html 78KB

blog_4701280b0102e0l4.html 78KB

blog_4701280b0102wruo.html 76KB

blog_4701280b010183ny.html 71KB

blog_4701280b0102e0eu.html 64KB

blog_4701280b0100ey1x.html 62KB

blog_4701280b0102e0ib.html 58KB

blog_4701280b0100evps.html 53KB

blog_4701280b0102e061.html 52KB

blog_4701280b0102ek51.html 52KB

blog_4701280b0102ecxd.html 47KB

blog_4701280b0100egc6.html 46KB

blog_4701280b0102dz5s.html 46KB

blog_4701280b0102e074.html 45KB

blog_4701280b0102e0ak.html 45KB

blog_4701280b01017hzx.html 45KB

blog_4701280b0102ec39.html 45KB

blog_4701280b0102e7er.html 45KB

blog_4701280b0102e02q.html 45KB

blog_4701280b01017ijj.html 45KB

blog_4701280b0100mri0.html 44KB

blog_4701280b0100erbx.html 44KB

blog_4701280b0102dzqy.html 44KB

blog_4701280b0102e5np.html 44KB

blog_4701280b0102e0fm.html 44KB

blog_4701280b0100hcf6.html 44KB

blog_4701280b01017i4g.html 44KB

blog_4701280b0102dz84.html 43KB

blog_4701280b0102e4qq.html 43KB

blog_4701280b0100japd.html 43KB

blog_4701280b0100jloa.html 43KB

blog_4701280b01017iv8.html 43KB

blog_4701280b0102dxmp.html 43KB

blog_4701280b0100fpjr.html 42KB

blog_4701280b0102edcd.html 42KB

blog_4701280b0102egl0.html 42KB

blog_4701280b0102e3v6.html 42KB

blog_4701280b0100g8zf.html 42KB

blog_4701280b0100iy7s.html 42KB

blog_4701280b0102e07s.html 42KB

blog_4701280b0100gzwj.html 42KB

blog_4701280b0102e4c3.html 42KB

blog_4701280b0100gqf8.html 42KB

blog_4701280b0102eo83.html 42KB

blog_4701280b0100mrhm.html 42KB

blog_4701280b0102e0p3.html 41KB

blog_4701280b0100glm8.html 41KB

blog_4701280b0102e0th.html 41KB

blog_4701280b0102e54a.html 41KB

blog_4701280b010185jh.html 41KB

blog_4701280b0100g7gq.html 41KB

blog_4701280b0100hrm2.html 41KB

blog_4701280b0101854o.html 41KB

blog_4701280b0100fej4.html 41KB

blog_4701280b0100ht1x.html 40KB

blog_4701280b0102e63p.html 40KB

blog_4701280b01017hr5.html 40KB

blog_4701280b0100insm.html 40KB

blog_4701280b0100fixk.html 40KB

blog_4701280b0102e7wj.html 40KB

blog_4701280b0100lcum.html 40KB

blog_4701280b01017ijd.html 40KB

blog_4701280b0100fc2s.html 40KB

blog_4701280b0102e3nr.html 40KB

blog_4701280b0102eb8d.html 40KB

blog_4701280b0100h9tc.html 40KB

blog_4701280b0102e7pk.html 40KB

blog_4701280b01017hsy.html 40KB

blog_4701280b0100gyzh.html 40KB

blog_4701280b0102e4gf.html 40KB

blog_4701280b0100h7b2.html 40KB

blog_4701280b0102dz9f.html 40KB

blog_4701280b010176yw.html 39KB

blog_4701280b0100l4sf.html 39KB

blog_4701280b0100gxme.html 39KB

blog_4701280b010183ai.html 39KB

blog_4701280b0100ev3s.html 39KB

blog_4701280b0100g03k.html 39KB

blog_4701280b0100kusa.html 39KB

blog_4701280b0100ee0m.html 39KB

blog_4701280b0100gce1.html 39KB

blog_4701280b0102e42a.html 39KB

blog_4701280b0100limx.html 39KB

blog_4701280b0102ef4t.html 39KB

blog_4701280b0100g801.html 39KB

blog_4701280b010176x6.html 39KB

blog_4701280b0102eck1.html 39KB

blog_4701280b0102e85j.html 39KB

blog_4701280b0100h01f.html 39KB

blog_4701280b0100easn.html 38KB

blog_4701280b0100gjd6.html 38KB

blog_4701280b0100fzmm.html 38KB

blog_4701280b0102dx7u.html 38KB

blog_4701280b0100fozw.html 38KB

blog_4701280b0100gcs5.html 38KB

blog_4701280b0100j9lt.html 38KB

blog_4701280b0100hy9k.html 38KB

blog_4701280b0100h3c8.html 38KB

blog_4701280b0100en8n.html 38KB

共 139 条

#coding:utf-8 import urllib import time import os ''' 抓取新浪博客博客页面的所有文章 http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html ''' page=1 # 起始页 while page <= 7: # 一共爬取7页 url = [''] * 50 # 预分配大小，每页有50篇文章 base_url = 'http://blog.sina.com.cn/s/articlelist_1191258123_0_' temp = base_url + str(page) + '.html' # 网页的完全路径 content = urllib.urlopen(temp).read() # 读取HTML文本内容，打开首页 i = 0 title = content.find(r'<a title=') # title href = content.find(r'href=', title) # href html = content.find(r'.html', href) # html,找到网页 while title != -1 and href != -1 and html != -1 and i < 50: url[i] = content[href + 6 : html + 5] print url[i] title = content.find(r'<a title=',html) href = content.find(r'href=',title) html = content.find(r'.html',href) i = i + 1 else: print 'end page=',page '''''写入本地文件''' j = 0 while(j < i): content = urllib.urlopen(url[j]).read() path = 'hanhan/' + str(page) + "/" if os.path.isdir(path): # 已经有路径 open(path + url[j][-26:], 'w+').write(content) else: os.makedirs(path) # 没有则创建路径 open(path + url[j][-26:], 'w+').write(content) j = j + 1 time.sleep(1) # 延时 else: print 'download' page = page + 1 else: print 'all find end'

评论收藏

内容反馈

版权申诉