### 一、 开发背景与环境介绍
- 豆瓣、猫眼、微博、B站等社交网络逐渐兴起,传统的网络爬虫无法满足人们对社交网络信息的爬取及分析的需求,故爬取特定主题内容的网络爬虫便应运而生,本文采用Windows 10系统及JetBrains PyCharm 2019.2.5 x64,实现了面向猫眼电影的网络爬虫,实现对猫眼Top100电影的爬取,并对得到的数据进行可视化分析及展示。
### 二、 程序功能分析
- 运行本程序,即可实现自动爬取猫眼Top100,并将所得数据存入sqlite3数据库,之后通过Flask将数据呈现在网页端。
### 三、 总体设计
1. 寻找爬虫入口,使用BeautifulSoup库解析爬取数据,将爬取到的信息保存到数据库。
2. 通过使用python的第三方库对数据库中的电影数据进行划分,提取各个属性有价值的部分,再利用html中的Flask,Echarts、WordCloud等技术,将其合理的展现在页面上,并添加”首页”,”电影”,”评分”,”词云”,”关于”标题,最后再对每个header进行一个html设计单击相关元素时进行相关页面跳转。
### 四、 详细设计
- 寻找爬取入口
在Chrome浏览器中打开猫眼电影TOP100,其Url为:https://maoyan.com/board/4?offset= 。通过观察可以发现,目标数据是存在于网页源代码中的,直接请求网页Url即可获得。同时,第1页仅显示了10部电影,相比于第2页的Url:https://maoyan.com/board/4?offset=10,故可推断offset为偏移值。
所以在抓取信息时,只需改变offset所赋予的参数(第page页的 offset=10*(page-1))就可以实现自动翻页爬取。再通过 urllib.request 获取 html,向header中添加cookie和User-Agent实现伪装,以免出现美团滑块验证,最后将数据保存在html中,以便后续利用正则表达式对数据进行分析。
- 解析数据
解析保存的html文件,使用BeautifulSoup库,根据规则找到所需的内容所在位置,把电影的各类信息解析出来(先定位到每个电影的标签,再定位到各类信息的标签,最后将各类信息从标签中提取出来)。在解析过程中应结合re正则表达式提取所需内容。
设置askURL(url)方法,并将URL作为Request()方法的参 数,构造Request对象;将Request对象作为urlopen()方法的参数,发送给远程服 务器,获取网页内容返回列表;使用read( )方法读取远程服务器返回的页面信息执行完后返回 HTML网页内容。
- 保存数据
根据前面得到的解析结果datalist将所需内容保存到sqlite3数据库中,首先建立数据库,之后再进行连接,通过for循环对datalist进行遍历,以每列的属性值作为游标,并将所有数据转换为列表,最后按数组下标对属性值实施逐行插入操作。
- 可视化数据
Python 为我们提供了用来制作图表的库函数如,matplotlib,pandas等。事实证明这些方法已经十分成功,而本文使用apache开源web可视 化库 Echarts,集合 python 语言自动生成 Echarts 图表。它不但 可以绘制图表,还可以嵌入到独立的HTML网页。具有良好的性 能,并且使用方便。
### 五、 部分源码及运行界面
```
import sqlite3
# 获取网页数据
import requests
# 正则表达式
import re
# 网页解析,获取数据
from bs4 import BeautifulSoup
# 保存为excel
import xlwt
#正则表达式规则 *表示多个字符,?表示0个到多个
#影片排名链接的规则
findIndex = re.compile(r'board-index.*?>(\d+).*?')
#影片图片的链接
findImage = re.compile(r'class="board-img".*?src="(.*?)"')
#影片片名
findTitle = re.compile(r'title="(.*?)">')
#影片主演
findActor = re.compile(r'class="star">(.|\n)(.*)')
#影片上映时间
findTime = re.compile(r'class="releasetime">(.*?)</p> ')
#猫眼评分
findScore1 = re.compile(r'class="integer">(.*?)</i>')
findScore2 = re.compile(r'class="fraction">(.*?)</i>')
# 爬取网页
# 解析数据
# 保存数据
def main():
baseurl = "https://maoyan.com/board/4?offset="
#1.爬取网页
datalist = getData(baseurl)
#3.保存数据
# savepath = "猫眼TOP100.xls"
dbpath = "movie2.db"
# saveData(datalist, savepath)
saveData2DB(datalist,dbpath)
def getData(baseUrl):
datalist = []
for i in range(0, 10):
url = baseUrl + str(i * 10)
html = askUrl(url)
# 解析数据
soup = BeautifulSoup(html, "html.parser")
for item in soup.find_all("dd"):
# 测试
# print(item)
data = []
item = str(item)
# 排名
index = re.findall(findIndex, item)[0]
data.append(index)
# 图片地址
image = re.findall(findImage, item)[0]
data.append(image)
# 标题
title = re.findall(findTitle, item)[0]
data.append(title)
# 作者
actor = re.findall(findActor, item)[0]
actorList = list(actor)
for i in actorList:
actorNew = "".join(i).strip()
data.append(actorNew)
# 上映时间
time = re.findall(findTime, item)[0]
data.append(time)
# 分数
score1 = re.findall(findScore1, item)[0]
# data.append(score1)
score2 = re.findall(findScore2, item)[0]
# data.append(score2)
score = score1 + score2
data.append(score)
# print(data)
datalist.append(data)
#print(datalist)
return datalist
# 爬取网页
def askUrl(url):
#模拟浏览器头部消息,向猫眼浏览器发送消息
headers = {
"Accept":"* / *",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
"Cookie":"BIDUPSID=E53D8374080760C7A143C4FD57FD9FC0; PSTM=1578902890; HMACCOUNT=6B314F8B0EFA06B7; BAIDUID=600AD9DA4AECD6C574A62250C5383D41:FG=1; BDRCVFR[Fz5t0nreVcc]=mk3SLVN4HKm; delPer=0; PSINO=6; H_PS_PSSID=; BA_HECTOR=ah0101ak2ka48k2hb31fu57qk0r; HMVT=6bcd52f51e9b3dce32bec4a3997715ac|1608687465|"}
html = ""
try:
response = requests.get(url, headers=headers)
html = response.content.decode("utf-8")
# print(html)
except requests.exceptions as e:
if hasattr(e, "code"):
print(e.code)
if hasattr(e, "reason"):
print(e.reason)
return html
# 保存数据到excel中
def saveData(datalist, savepath):
# 创建book对象
book = xlwt.Workbook(encoding="utf-8")
# 创建工作表
sheet = book.add_sheet("猫眼TOP100", cell_overwrite_ok=True)
col = ("电影排名", "图片地址", "电影名称", "演出人员", "上映时间", "电影评分")
for i in range(0, 6):
sheet.write(0, i, col[i])
#col[i]代表列名
for i in range(0, 100):
print("第%d条" % (i + 1))
try:
data = datalist[i]
except:
continue
for j in range(0, 6):
#向表中写入数据
sheet.write(i + 1, j, data[j])
book.save(savepath)
# 保存数据到数据库中
def saveData2DB(datalist, dbpath):
init_db(dbpath)
conn = sqlite3.connect(dbpath)
cur = conn.cursor()
for data in datalist:
for index in range(len(data)):
if index == 4:
data[index] = '"' + str(data[index]) + '"'
continue
data[index] = '"' + data[index] + '"'
sql = '''
insert into movie100(
movie_rank,image_link,name,actor,time,score
)
values(%s)''' % ",".join(data)
# values(%s)'''%",". join('%s' %a for a in data)
print(sql)
cur.execute(sql)
conn.commit()
cur.close()
co
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
使用Windows 10系统及JetBrains PyCharm,实现了面向猫眼电影的网络爬虫,实现对猫眼Top100电影的爬取,并对得到的数据进行可视化分析及展示。运行本程序,即可实现自动爬取猫眼Top100,并将所得数据存入sqlite3数据库,之后通过Flask将数据呈现在网页端。
资源推荐
资源详情
资源评论
![tar](https://img-home.csdnimg.cn/images/20210720083646.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![txt](https://img-home.csdnimg.cn/images/20210720083642.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
收起资源包目录
![package](https://csdnimg.cn/release/downloadcmsfe/public/img/package.f3fc750b.png)
![folder](https://csdnimg.cn/release/downloadcmsfe/public/img/folder.005fa2e5.png)
![folder](https://csdnimg.cn/release/downloadcmsfe/public/img/folder.005fa2e5.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![folder](https://csdnimg.cn/release/downloadcmsfe/public/img/folder.005fa2e5.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/EXE.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/EXE.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/EXE.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/EXE.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/EXE.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![folder](https://csdnimg.cn/release/downloadcmsfe/public/img/folder.005fa2e5.png)
![folder](https://csdnimg.cn/release/downloadcmsfe/public/img/folder.005fa2e5.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/XLS.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![folder](https://csdnimg.cn/release/downloadcmsfe/public/img/folder.005fa2e5.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/HTML.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/HTML.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/HTML.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/HTML.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/HTML.png)
![folder](https://csdnimg.cn/release/downloadcmsfe/public/img/folder.005fa2e5.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/HTML.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/HTML.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/HTML.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/HTML.png)
![folder](https://csdnimg.cn/release/downloadcmsfe/public/img/folder.005fa2e5.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![folder](https://csdnimg.cn/release/downloadcmsfe/public/img/folder.005fa2e5.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![folder](https://csdnimg.cn/release/downloadcmsfe/public/img/folder.005fa2e5.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![folder](https://csdnimg.cn/release/downloadcmsfe/public/img/folder.005fa2e5.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![folder](https://csdnimg.cn/release/downloadcmsfe/public/img/folder.005fa2e5.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/HTML.png)
![folder](https://csdnimg.cn/release/downloadcmsfe/public/img/folder.005fa2e5.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/HTML.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/HTML.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/HTML.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/HTML.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
![file-type](https://csdnimg.cn/release/download/static_files/pc/images/minetype/UNKNOWN.png)
共 85 条
- 1
资源评论
![avatar-default](https://csdnimg.cn/release/downloadcmsfe/public/img/lazyLogo2.1882d7f4.png)
![avatar](https://profile-avatar.csdnimg.cn/aad7549737184464a38916c266ae456e_csdn1561168266.jpg!1)
python慕遥
- 粉丝: 2729
- 资源: 262
![benefits](https://csdnimg.cn/release/downloadcmsfe/public/img/vip-rights-1.c8e153b4.png)
下载权益
![privilege](https://csdnimg.cn/release/downloadcmsfe/public/img/vip-rights-2.ec46750a.png)
C知道特权
![article](https://csdnimg.cn/release/downloadcmsfe/public/img/vip-rights-3.fc5e5fb6.png)
VIP文章
![course-privilege](https://csdnimg.cn/release/downloadcmsfe/public/img/vip-rights-4.320a6894.png)
课程特权
![rights](https://csdnimg.cn/release/downloadcmsfe/public/img/vip-rights-icon.fe0226a8.png)
开通VIP
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助
![voice](https://csdnimg.cn/release/downloadcmsfe/public/img/voice.245cc511.png)
![center-task](https://csdnimg.cn/release/downloadcmsfe/public/img/center-task.c2eda91a.png)
最新资源
- 基于LoRa的主从机农田监测系统原理图
- PTC Creo View 是由 PTC 公司开发的一款专业的三维可视化软件,专为工程设计和制造领域而设计
- torchvision中CIFAR10数据集
- 山东大学面向对象编程考试内容的详细归纳
- 基于LoRa的主从机农田监测系统代码
- 计算机组成原理第六版课后习题可能涉及的一些主要内容和概念
- Visual Studio 最新版一键安装包(何时安装何时就可以最新版)
- Matplotlib - Matplotlib tutorial - Nicolas P. Rougier
- XlineSoft PHPRunner 是一款功能强大且灵活的 PHP 代码生成器,专为快速开发和部署数据库驱动的 Web 应用
- c语言之俄罗斯方块123
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
![feedback](https://img-home.csdnimg.cn/images/20220527035711.png)
![feedback](https://img-home.csdnimg.cn/images/20220527035711.png)
![feedback-tip](https://img-home.csdnimg.cn/images/20220527035111.png)
安全验证
文档复制为VIP权益,开通VIP直接复制
![dialog-icon](https://csdnimg.cn/release/downloadcmsfe/public/img/green-success.6a4acb44.png)