python爬虫爬取网页资源

共7个文件

xml：4个

py：1个

xlsx：1个

python

爬虫

需积分: 0 7 下载量 185 浏览量 2023-06-28 15:51:03 上传评论 1 收藏 21KB RAR 举报

温馨提示

利用python的requests和BeautifulSoup库，定向获取网页标签内容，把网页里面的表格内容爬下来，利用openpyxl 库声明一个Workbook，生成一个excel表格，存储在本地的excel文件中。爬取地址如下： https://www.basketball-reference.com/leagues/NBA_2014_games-december.html 附件中带了源码和生成的excel文件，安装BeautifulSoup 和 openpyxl 需要通过python的pip管理工具，不会的可自行百度。本资源适合初入门python的新手，欢迎下载观看、学习！

资源推荐

资源详情

资源评论

收起资源包目录

pry_tables.rar （7个子文件）

pry_tables

.idea

workspace.xml 9KB

misc.xml 294B

modules.xml 279B

pry_tables.iml 408B

encodings.xml 196B

paqu_table.py 1KB

test.xlsx 17KB

共 7 条

import time import requests from bs4 import BeautifulSoup from openpyxl import Workbook headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36' } # 爬虫[Requests设置请求头Headers],伪造浏览器 # 核心爬取代码 url = 'https://www.basketball-reference.com/leagues/NBA_2014_games-december.html' params = {"show_ram": 1} response = requests.get(url, params=params, headers=headers) # 访问url outwb = Workbook() #声明Workbook outws = outwb.worksheets[0] # 定义excel 的sheet soup = BeautifulSoup(response.text, 'html.parser') # 获取网页源代码 tr = soup.find('tbody').find_all('tr') # .find定位到所需数据位置 .find_all查找所有的tr（表格） # 去除标签栏 outws.append(['Date','Start(ET)','Visitor/Neutral','PTS','Home/Neutral','PTS','','Attend.','Arena','Notes']) for j in tr[1:]: # tr2[1:]遍历第1列到最后一列，表头为第0列 listData = [] # 定义数组 th = j.find_all('th') # th表格 thDate = th[0].get_text().strip() listData.append(thDate) tds = j.find_all('td') # td表格 for k in tds[0:]: tdDate = k.get_text().strip() listData.append(tdDate) outws.append(listData) outwb.save(r'test.xlsx')

评论收藏

内容反馈