平时工作中常用的Python零碎知识总结，爬虫学习总结与练习目录监控，文件处理，Python数据分析学习总结源码资源-CSDN文库

共605个文件

md：332个

png：196个

py：23个

python

爬虫

数据分析

需积分: 5 131 浏览量 2024-08-11 22:15:02 上传评论收藏 55.5MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

平时工作中常用的Python零碎知识总结，爬虫学习总结与练习目录监控，文件处理，Python数据分析学习总结源码（605个子文件）

wind_data.csv 520KB

appl_1980_2014.csv 416KB

chipotle.csv 379KB

cars1.csv 11KB

cars2.csv 10KB

US_Crime_Rates_1960_2014.csv 5KB

drinks.csv 5KB

iris.csv 4KB

euro12.csv 2KB

225_1.gif 22.14MB

239_1.gif 500KB

226_1.gif 259KB

206_1.gif 219KB

150_1.gif 213KB

144_1.gif 207KB

232_1.gif 194KB

112_1.gif 165KB

94_1.gif 153KB

257_1.gif 136KB

209_1.gif 92KB

19_1.gif 91KB

15_1.gif 48KB

05_1.gif 35KB

1047_1.gif 33KB

.gitignore 30B

openssl-1.1.1g.tar.gz 9.35MB

pcre-8.44.tar.gz 1.99MB

nginx-1.18.0.tar.gz 1015KB

zlib-1.2.11.tar.gz 593KB

1.html 240KB

2.html 222KB

detail.html 476B

index.html 267B

result.html 263B

weibo5.jpg 430KB

namedtuple.jpg 181KB

weibo4.jpg 156KB

diguistack.jpg 114KB

xiecheng.jpg 98KB

stack.jpg 83KB

django2.jpg 82KB

nginxhttps.jpg 72KB

hashtable.jpg 34KB

django1.jpg 32KB

112_1.jpg 21KB

jiaocuotiao.jpg 21KB

vtiao.jpg 15KB

redis_1.jpg 14KB

htiao.jpg 14KB

tu.jpg 14KB

ex_depth.jpg 13KB

balance_1.jpg 11KB

Python爬虫--顺企网企业信息爬取.md 19KB

Python数据分析学习二--pandas库.md 16KB

README_cp.md 15KB

Python数据分析学习四--合并数据集.md 13KB

Python爬虫--武汉15个地区9万条二手房数据.md 13KB

Python爬虫--代理池维护.md 13KB

main.md 13KB

Python常用小知识点(更新ing……).md 12KB

Python爬虫--利用代理池系统爬取微信公众号文章.md 11KB

git文档.md 11KB

Git基础.md 10KB

Python基础总结二.md 10KB

一、网络基础知识.md 10KB

网络基础知识.md 9KB

Python数据分析学习一--numpy.md 9KB

装饰器.md 9KB

编写一个投票web站点.md 9KB

Linux文件系统.md 8KB

《深入浅出统计学》笔记上.md 8KB

Python爬虫之selenium自动化.md 8KB

Python实现TFTP文件传输.md 8KB

python爬虫之网易云音乐.md 8KB

ZeroMQ基础.md 8KB

深入浅出统计学笔记上.md 8KB

Python实现TFTP文件传输.md 8KB

Python数据分析学习五--数据转换、过滤清理.md 8KB

Python数据分析学习七--groupby分组.md 8KB

redis基础理论.md 7KB

python中的协程.md 7KB

Python之多线程.md 7KB

TCP连接的建立与释放.md 7KB

http首部字段.md 7KB

《图解HTTP》—http首部字段总结.md 7KB

《图解HTTP》学习笔记之协议.md 7KB

TCPIP总结.md 7KB

Python爬虫之知乎钓鱼贴图片爬取.md 6KB

查询.md 6KB

python爬虫之一尘论坛发帖数据爬取.md 6KB

Python基础总结一.md 6KB

Python爬虫之构建自己的代理IP池.md 6KB

Python简单处理csv，json，xml，Excel文件.md 6KB

Python爬虫之BeautifulSoup.md 6KB

Linux文件目录管理.md 6KB

sed命令与awk命令.md 6KB

Python爬虫--selenium爬取淘宝商品信息.md 6KB

Python爬虫--scrapy爬虫框架入门.md 6KB

三、IP协议.md 6KB

共 605 条

import requests from bs4 import BeautifulSoup import time import traceback """ 爬取代理网站的免费代理并返回 """ class Crawler(object): def get_crawler_proxy(self): proxy_set_taiyang = self.crawl_taiyang() proxy_set_89 = self.crawl_89ip() proxy_set_66 = self.crawl_66ip() return proxy_set_taiyang | proxy_set_89 | proxy_set_66 def crawl_66ip(self): print('爬取66代理......') proxy_set = set() for i in range(1, 5): try: url = 'http://www.66ip.cn/{}.html'.format(i) headers = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'zh-CN,zh;q=0.9', 'Host': 'www.66ip.cn', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36' } res = requests.get(url, headers=headers) soup = BeautifulSoup(res.text, 'html.parser') table = soup.select('.container table')[0] if soup.select('.container table') else None if not table: continue for tr in table.select('tr')[1:]: tds = tr.select('td') ip = tds[0].text port = tds[1].text proxy_ip_port = '{}:{}'.format(ip, port) proxy_set.add(proxy_ip_port) except: print('爬取66代理异常') print(traceback.format_exc()) print('爬取到66代理{}个'.format(len(proxy_set))) return proxy_set def crawl_taiyang(self): print('爬取太阳代理......') url = 'http://ty-http-d.upupfile.com/index/index/get_free_ip' headers = { 'Accept': 'text/html, */*; q=0.01', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'zh-CN,zh;q=0.9', 'Host': 'ty-http-d.upupfile.com', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36' } proxy_set = set() for i in range(1, 15): try: date = {'page': i} res = requests.post(url, data=date, headers=headers) html = res.json()['ret_data']['html'] soup = BeautifulSoup(html, 'html.parser') for item in soup.find_all(class_='tr ip_tr'): divs = item.select('div') ip = divs[0].text.replace(' ', '').replace('\n', '') port = divs[1].text.replace(' ', '').replace('\n', '') proxy_ip_port = '{}:{}'.format(ip, port) proxy_set.add(proxy_ip_port) except: print('爬取太阳代理异常') print(traceback.format_exc()) print('爬取到太阳代理{}个'.format(len(proxy_set))) return proxy_set def crawl_89ip(self): print('爬取89代理......') proxy_set = set() for i in range(1, 15): try: url = 'https://www.89ip.cn/index_{}.html'.format(i) headers = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'zh-CN,zh;q=0.9', 'Host': 'www.89ip.cn', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36' } res = requests.get(url, headers=headers) soup = BeautifulSoup(res.text, 'html.parser') table = soup.find(class_='layui-table') for tr in table.select('tr'): tds = tr.select('td') if len(tds) > 2: ip = tds[0].text.replace(' ', '').replace('\n', '').strip() port = tds[1].text.replace(' ', '').replace('\n', '').strip() proxy_ip_port = '{}:{}'.format(ip, port) proxy_set.add(proxy_ip_port) except: print('爬取89代理异常') print(traceback.format_exc()) print('爬取到89代理{}个'.format(len(proxy_set))) return proxy_set if __name__ == '__main__': p = Crawler().get_crawler_proxy() print(p)

评论收藏

内容反馈