Python爬虫之BeautifulSoup模块使用指南_beautifulsoup用法详解资源-CSDN文库

122 浏览量 2020-09-20 07:15:47 上传评论收藏 83KB PDF 举报

在Python的网络爬虫开发中，BeautifulSoup是一个非常重要的库，它主要用于解析HTML和XML文档，使得我们可以方便地提取和操作网页中的数据。本指南将详细介绍如何使用BeautifulSoup进行网页抓取。安装BeautifulSoup非常简单，通过pip命令即可完成： ```bash $ pip install beautifulsoup4 ``` 此外，为了提升解析效率和处理复杂HTML，通常还会搭配其他的解析器，如lxml或html5lib。这两个库需要额外安装： ```bash $ pip install html5lib $ pip install lxml ``` 一旦安装完成，我们就可以开始使用BeautifulSoup了。下面以一个简单的HTML字符串为例，演示其基本用法： ```python html_doc = """ <html><head><title>The Dormouse's story</title></head> <body>The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</body></html> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, "lxml") ``` 创建好`soup`对象后，我们可以使用各种方法来探索和提取数据。例如，获取`title`标签： ```python soup.title # 输出：<title>The Dormouse's story</title> ``` 进一步，我们可以获取`title`标签的名称和文本内容： ```python soup.title.name # 输出："title" soup.title.string # 输出："The Dormouse's story" ``` 对于`p`标签，我们可以查询它的属性，如`class`： ```python soup.p['class'] # 输出：['title'] ``` 还可以通过CSS选择器来定位特定元素，例如查找所有`a`标签： ```python soup.find_all('a') # 输出：[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] ``` 此外，BeautifulSoup提供了多种查找和遍历文档树的方法，如`find()`, `find_all()`, `descendants`, `children`等，可以根据需要灵活运用。在实际的网页爬虫项目中，通常需要结合requests库获取网页内容，然后使用BeautifulSoup进行解析。例如： ```python import requests response = requests.get('http://example.com') soup = BeautifulSoup(response.text, 'lxml') ``` 这样，我们就可以对获取的网页内容进行分析和数据提取。 BeautifulSoup提供了强大的HTML和XML解析功能，是Python爬虫开发不可或缺的一部分。通过熟练掌握其用法，我们可以高效地处理和抽取网页数据，从而实现各种爬虫任务。在实际工作中，可以根据需求选择合适的解析器，如lxml，以获得更好的性能。

资源推荐

资源详情

资源评论

Python 爬虫之爬虫之Beautiful Soup模块使用指南模块使用指南

主要介绍了Python 爬虫之Beautiful Soup模块使用指南，小编觉得挺不错的，现在分享给大家，也给大家做个参考。一起跟随小编过来看看吧

爬取网页的流程一般如下：

1. 选着要爬的网址（url）

2. 使用 python 登录上这个网址（urlopen、requests 等）

3. 读取网页信息（read() 出来）

4. 将读取的信息放入 BeautifulSoup

5. 使用 BeautifulSoup 选取 tag 信息等

可以看到，页面的获取其实不难，难的是数据的筛选，即如何获取到自己想要的数据。本文就带大家学习下 BeautifulSoup 的使用。

BeautifulSoup 官网介绍如下：

Beautiful Soup 是一个可以从 HTML 或 XML 文件中提取数据的 Python 库，它能够通过你喜欢的转换器实现惯用的文档导航、查找、修改文档的方式，能够帮你节省数小时甚至数天的工作时

间。

1 安装安装

可以利用 pip 直接安装：

$ pip install beautifulsoup4

BeautifulSoup 不仅支持 HTML 解析器，还支持一些第三方的解析器，如 lxml，XML，html5lib 但是需要安装相应的库。如果我们不安装，则 Python 会使用 Python 默认的解析器，其中 lxml 解析器更加

强大，速度更快，推荐安装。

$ pip install html5lib

$ pip install lxml

2 BeautifulSoup 的简单使用的简单使用

首先我们先新建一个字符串，后面就以它来演示 BeautifulSoup 的使用。

html_doc = """

<html><head><title>The Dormouse's story</title></head>

<body>

The Dormouse's story

Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.

...

"""

使用 BeautifulSoup 解析这段代码，能够得到一个 BeautifulSoup 的对象，并能按照标准的缩进格式的结构输出:

>>> from bs4 import BeautifulSoup

>>> soup = BeautifulSoup(html_doc, "lxml")

>>> print(soup.prettify())

篇幅有限，输出结果这里不再展示。

另外，这里展示下几个简单的浏览结构化数据的方法：

>>> soup.title

<title>The Dormouse's story</title>

>>> soup.title.name

'title'

>>> soup.title.string

"The Dormouse's story"

>>> soup.p['class']

['title']

>>> soup.a

<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>

>>> soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

>>> soup.find(id='link1')

3 对象的种类对象的种类

Beautiful Soup 将复杂 HTML 文档转换成一个复杂的树形结构，每个节点都是 Python 对象，所有对象可以归纳为 4 种: Tag、NavigableString、BeautifulSoup、Comment 。

3.1 Tag

Tag通俗点讲就是 HTML 中的一个个标签，像上面的 div，p，例如：

<title>The Dormouse's story</title>

可以利用 soup 加标签名轻松地获取这些标签的内容。

>>> print(soup.p)

The Dormouse's story

>>> print(soup.title)

<title>The Dormouse's story</title>

不过有一点是，它查找的是在所有内容中的第一个符合要求的标签，如果要查询所有的标签，我们在后面进行介绍。

每个 Tag 有两个重要的属性 name 和 attrs，name 指标签的名字或者 tag 本身的 name，attrs 通常指一个标签的 class。

>>> print(soup.p.name)

>>> print(soup.p.attrs)

{'class': ['title']}

3.2 NavigableString

NavigableString：获取标签内部的文字，如，soup.p.string。

>>> print(soup.p.string)

The Dormouse's story

3.3 BeautifulSoup

BeautifulSoup：表示一个文档的全部内容。大部分时候，可以把它当作 Tag 对象，是一个特殊的 Tag。

3.4 Comment

Comment：Comment 对象是一个特殊类型的 NavigableString 对象，其输出的内容不包括注释符号，但是如果不好好处理它，可能会对我们的文本处理造成意想不到的麻烦。

>>> markup = ""

>>> soup = BeautifulSoup(markup)

>>> comment = soup.b.string

>>> print(comment)

Hey, buddy. Want to buy a used parser?

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余1页未读，立即下载

评论收藏

内容反馈

weixin_38604395

粉丝: 3
资源: 910

Python 爬虫之Beautiful Soup模块使用指南

Python 使用Beautiful Soup 爬虫教程.pdf

python爬虫开发之Beautiful Soup模块从安装到详细使用方法与实例

python爬虫-Beautiful Soup库入门（四）

Python3爬虫学习之爬虫利器Beautiful Soup用法分析

Beautiful Soup爬虫框架在Python爬虫开发中的重要性

Python爬虫库BeautifulSoup的介绍与简单使用实例

Python使用Beautiful Soup包编写爬虫时的一些关键点

Python 爬虫入门的教程之Beautiful Soup解析

Python爬虫利器二之Beautiful Soup的用法.zip_python_爬虫_爬虫 python_爬虫 pyth

Python爬虫之BeautifulSoup库

python爬虫学习笔记之Beautifulsoup模块用法详解

Python利用Beautiful Soup模块创建对象详解

Python中使用Beautiful Soup库的超详细教程

能支持beautifulsoup的python版本

Crawling_Project:使用python，BeautifulSoup

Beautiful Soup 爬虫

Python爬虫学习资料收集.zip

Python网络爬虫实战.pdf

python爬虫从入门到精通（模块）

python爬虫

python大作业 含爬虫、数据可视化、地图、报告、及源码（2016-2021全国各地区粮食产量）.rar

《点燃我温暖你》中李峋的同款爱心代码

Python金融量化的高级库：TA-Lib-0.4.24（包含python3.7、3.8、3.9、3.10的32位和64位版本）

大麦网抢票脚本【Python脚本】

最新资源

python大作业含爬虫、数据可视化、地图、报告、及源码（2016-2021全国各地区粮食产量）.rar