python3网络爬虫系列（三）爬取给定URL网页（访问量、阅读量）实例资源-CSDN文库

181 浏览量 2021-01-20 02:40:35 上传评论收藏 149KB PDF 举报

资源推荐

资源详情

资源评论

python3网络爬虫系列（三）爬取给定网络爬虫系列（三）爬取给定URL网页（访问量、阅网页（访问量、阅

读量）实例读量）实例

当你的才华还撑不起你的野心时，你应该静下心去学习当你的才华还撑不起你的野心时，你应该静下心去学习。。

前言前言

已经搭建好代理IP池之后，就可以尝试用获得的代理IP访问给定URL，爬取页面，具体的源码和更多说明在github库Simulate-

clicks-on-given-URL里，供大家学习。

代码代码

这段代码可以返回我们需要的用户这段代码可以返回我们需要的用户IP

PROXY_POOL_URL = 'http://localhost:5555/random'

def get_proxy():

try:

response = requests.get(PROXY_POOL_URL)

if response.status_code == 200:

ip = response.text

#设置代理,格式如下

proxy_ip = "http://" + ip

proxy_ips = "https://" + ip

proxy = {"https":proxy_ips,"http":proxy_ip}

return proxy

except ConnectionError:

return None

共享共享cookie，保持登陆状态，保持登陆状态

def get_cookie(url,urls):

if(url==urls[0]):

f=open(r'cookie0.txt','r')#打开所保存的cookies内容文件

if(url==urls[1]):

f=open(r'cookie1.txt','r')#打开所保存的cookies内容文件

if(url==urls[2]):

f=open(r'cookie2.txt','r')#打开所保存的cookies内容文件

cookies={}#初始化cookies字典变量

for line in f.read().split(';'): #按照字符：进行划分读取

#其设置为1就会把字符串拆分成2份

name,value=line.strip().split('=',1)

cookies[name]=value #为字典cookies添加内容

return cookies

爬取网页爬取网页

def simulate_click(urls,num):

success = 0

fail = 0

referer_list=[

'https://www.google.com/search?q=csdn&rlz=1C1EJFC_enSE810SE810&oq=csdn&aqs=chrome..69i57j69i59l2j0l5.2484j0j8&sourceid=chrome&ie=UTF-

8',

'http://blog.csdn.net/',

'https://blog.csdn.net/weixin_41896265',

'https://www.sogou.com/tx?

query=%E4%BD%BF%E7%94%A8%E7%88%AC%E8%99%AB%E5%88%B7csdn%E8%AE%BF%E9%97%AE%E9%87%8F&hdq=sogou-site-

706608cfdbcc1886-0001&ekv=2&ie=utf8&cid=qb7.zhuye&',

'https://www.baidu.com/s?wd=csdn&rsv_spt=1&rsv_iqid=0xa615ef5b0000b256&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-

8&tn=78040160_26_pg&ch=8&rsv_enter=1&rsv_dl=tb&rsv_sug2=0&inputT=1113&rsv_sug4=1528'

]

while(num>0):

#随机user_agent和Referer

ua = UserAgent()

headers = {

'user-agent': ua.random, #随机agent

'Referer': random.choice(referer_list), #表示不是凭空产生

}

proxies = get_proxy()

for url in urls:

cookies = get_cookie(url,urls)

print("【正在访问】{}".format(url))

try:

session = requests.session()

resp=session.get(url,headers=headers,proxies=proxies,cookies=cookies,timeout=3)

if resp.status_code == requests.codes.ok:

print("---------------------------")

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余3页未读，立即下载

内容反馈

weixin_38698149

粉丝: 5
资源: 935

最新资源

资源上传下载、课程学习等过程中有任何疑问或建议，欢迎提出宝贵意见哦~我们会及时处理！点击此处反馈

feedback-tip