pythonbeautifulsoup库入门安装教程.pdf资源-CSDN文库

196 浏览量 2023-06-11 17:12:17 上传评论收藏 146KB PDF 举报

pythonbeautifulsoup库⼊门安装教程库⼊门安装教程⽬录⽬录 beautiful soup库的安装 beautiful soup库的理解 beautiful soup库的引⽤ BeautifulSoup类回顾demo.html Tag标签 Tag的attrs（属性） Tag的NavigableString HTML基本格式标签树的下⾏遍历标签树的上⾏遍历标签的平⾏遍历 bs库的prettify()⽅法 bs4库的编码 beautiful soup库的安装库的安装 pip install beautifulsoup4 beautiful soup库的理解库的理解 beautiful soup库是解析、遍历、维护"标签树"的功能库 beautiful soup库的引⽤库的引⽤ from bs4 import BeautifulSoup import bs4 BeautifulSoup类类 BeautifulSoup对应⼀个HTML/XML⽂档的全部内容回顾回顾demo.html import requests r = requests.get("ht 《PythonBeautifulSoup库入门安装教程》 BeautifulSoup库是Python中用于解析HTML和XML文档的强大工具，它提供了方便的方法来导航、搜索和修改解析树。本教程将引导您完成BeautifulSoup库的安装、理解以及基本使用。 **一、BeautifulSoup库的安装** 在Python环境中，可以通过pip命令轻松安装BeautifulSoup4库： ```bash pip install beautifulsoup4 ``` 安装完成后，即可在项目中导入BeautifulSoup库： ```python from bs4 import BeautifulSoup ``` **二、BeautifulSoup库的理解** BeautifulSoup库的核心功能是解析HTML或XML文档，创建一个可遍历的“标签树”结构，以便于开发者方便地提取和操作数据。它支持多种解析器，如lxml和html.parser，可以根据需求选择合适的解析器。 **三、BeautifulSoup类** `BeautifulSoup`类是整个库的核心，它接收HTML或XML字符串，并生成一个解析树对象。例如： ```python import requests from bs4 import BeautifulSoup r = requests.get("http://python123.io/ws/demo.html") demo = r.text soup = BeautifulSoup(demo, "html.parser") ``` 在这个例子中，`soup`对象包含了整个HTML文档的内容。 **四、HTML基本格式** HTML是一种标记语言，由标签构成，如`<html>`、`<head>`、`<title>`、`<body>`等。每个标签可以有属性（如`class`、`id`），并可能包含文本内容。 **五、标签遍历** 1. **下行遍历**：从父标签到子标签，逐层深入。 2. **上行遍历**：从子标签到父标签，逐层返回。 3. **平行遍历**：在同一层级的兄弟标签之间移动。例如，通过`.children`和`.parent`属性可以实现遍历： ```python for child in tag.children: print(child) print(tag.parent) ``` **六、Tag标签** 在BeautifulSoup中，每个HTML标签被表示为一个`Tag`对象，具有`.name`属性来获取标签名，如`<p>`的`.name`为`'p'`。同时，标签还拥有`.attrs`属性，用于获取所有属性及其值，如`class`、`href`等。 ```python tag = soup.p print(tag.name) print(tag.attrs) ``` **七、NavigableString** 除了`Tag`对象，BeautifulSoup还处理文本内容，用`NavigableString`对象表示。这些字符串是不可变的，可以通过索引来访问或操作。 **八、`prettify()`方法** `prettify()`方法用于美化输出HTML，使其更易读： ```python print(soup.prettify()) ``` **九、编码处理** BeautifulSoup允许指定输入和输出的编码，以处理不同字符集的文档： ```python soup = BeautifulSoup(demo, "html.parser", from_encoding='utf-8') ``` BeautifulSoup库提供了一套直观的API，使得处理HTML和XML文档变得简单高效。通过理解和熟练运用上述知识点，您可以轻松地在Python中进行网页抓取和数据解析任务。

资源推荐

资源详情

资源评论

pythonbeautifulsoup库⼊门安装教程库⼊门安装教程
⽬录⽬录
beautiful soup库的安装
beautiful soup库的理解
beautiful soup库的引⽤
BeautifulSoup类
回顾demo.html
Tag标签
Tag的attrs（属性）
Tag的NavigableString
HTML基本格式
标签树的下⾏遍历
标签树的上⾏遍历
标签的平⾏遍历
bs库的prettify()⽅法
bs4库的编码
beautiful soup库的安装库的安装
pip install beautifulsoup4
beautiful soup库的理解库的理解
beautiful soup库是解析、遍历、维护“标签树”的功能库
beautiful soup库的引⽤库的引⽤
from bs4 import BeautifulSoup
import bs4
BeautifulSoup类类
BeautifulSoup对应⼀个HTML/XML⽂档的全部内容
回顾回顾demo.html
import requests
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
print(demo)
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>
Tag标签标签
基本元素基本元素 说明说明
Tag 标签，最基本的信息组织单元，分别⽤<>和</>标明开头和结尾
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.title)
tag = soup.a
print(tag)
<title>This is a python demo page</title>
<a  href="http://www.icourse163.org/course/BIT-268001" >Basic Python</a>
任何存在于HTML语法中的标签都可以⽤soup.访问获得。当HTML⽂档中存在多个相同对应内容时，soup.返回第⼀个
Tag的的name
基本元素基本元素 说明说明
Name
标签的名字，
…
的名字是'p',格式：.name
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.a.name)
print(soup.a.parent.name)
print(soup.a.parent.parent.name)
a
p   
body