【免费】爬虫获取豆瓣网页评论信息

共2个文件

py：1个

jsonl：1个

爬虫

需积分: 0 137 浏览量 2024-04-10 15:22:01 上传评论收藏 8KB RAR 举报

资源推荐

资源详情

资源评论

收起资源包目录

获取豆瓣网页评论信息.rar （2个子文件）

获取豆瓣网页评论信息

train0404-1.jsonl 13KB

10crawlers_douban.py 2KB

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Thu Mar 21 09:25:15 2024 @author: ubuntu """ import time import requests from bs4 import BeautifulSoup import jsonlines import json x = [] # URL豆瓣电影评论页面 url = 'https://movie.douban.com/subject/26363254/comments' # GET请求模拟浏览器发送 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3' } # # URL初始页面 base_url = 'https://movie.douban.com/subject/26363254/comments' # 逐页爬行 for i in range(2): # 假定爬行前10页的评论？ # 完整的URL拼接(每页20条评论) page_url = f'{base_url}?start={i*20}&limit=20' #page_url = 'abc{}bc{}d'.format(i*20,i*30) # 发送请求获取HTML内容 response = requests.get(page_url, headers=headers) if response.status_code == 200: soup = BeautifulSoup(response.text, 'html.parser') comment_list = soup.find_all('div', class_='comment') # 同样的方法分析评论 for comment in comment_list: value = {} commenter = comment.find('a', class_='').text content = comment.find('p', class_='comment-content').text.strip() rating_tag = comment.find('span', class_='rating') rating = rating_tag['title'] if rating_tag else '无评分' value['评论者'] = commenter value['评分'] = rating value['评论内容'] = content x.append(value) # 设置延迟设置，以避免过快爬取被封ip。 time.sleep(2) # 将生成的JSONL格式数据写入train.jsonl文件 with jsonlines.open('train0404-1.jsonl', mode='w') as writer: writer.write_all(x)

评论收藏

内容反馈