没有合适的资源?快使用搜索试试~ 我知道了~
【python爬虫】 - python web scraping tutorial
需积分: 5 0 下载量 31 浏览量
2024-07-10
18:40:19
上传
评论
收藏 849KB PDF 举报
温馨提示
1. PYTHON WEB SCRAPING – INTRODUCTION 1 What is Web Scraping?1 Origin of Web Scraping1 Web Crawling v/s Web Scraping1 Uses of Web Scraping2 Components of a Web Scraper3 Working of a Web Scraper3 2. PYTHON WEB SCRAPING – GETTING STARTED WITH PYTHON 5 Why Python for Web Scraping?5 Installation of Python5 Setting Up the PATH7 Running Python7 3. PYTHON WEB SCRAPING – PYTHON MODULES FOR WEB SCRAPING 9 Python Development Environments using virtualenv9 Python Modules for Web Scraping 11 Requests11
资源推荐
资源详情
资源评论
![](https://csdnimg.cn/release/download_crawler_static/89530544/bg1.jpg)
Python Web Scraping
![](https://csdnimg.cn/release/download_crawler_static/89530544/bg2.jpg)
Python Web Scraping
i
About the Tutorial
Web scraping, also called web data mining or web harvesting, is the process of
constructing an agent which can extract, parse, download and organize useful information
from the web automatically.
This tutorial will teach you various concepts of web scraping and makes you comfortable
with scraping various types of websites and their data.
Audience
This tutorial will be useful for graduates, post graduates, and research students who either
have an interest in this subject or have this subject as a part of their curriculum. The
tutorial suits the learning needs of both a beginner or an advanced learner.
Prerequisites
The reader must have basic knowledge about HTML, CSS, and Java Script. He/she should
also be aware about basic terminologies used in Web Technology along with Python
programming concepts. If you do not have knowledge on these concepts, we suggest you
to go through tutorials on these concepts first.
Copyright & Disclaimer
Copyright 2018 by Tutorials Point (I) Pvt. Ltd.
All the content and graphics published in this e-book are the property of Tutorials Point (I)
Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republish
any contents or a part of contents of this e-book in any manner without written consent
of the publisher.
We strive to update the contents of our website and tutorials as timely and as precisely as
possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt.
Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of our
website or its contents including this tutorial. If you discover any errors on our website or
in this tutorial, please notify us at contact@tutorialspoint.com
![](https://csdnimg.cn/release/download_crawler_static/89530544/bg3.jpg)
Python Web Scraping
ii
Table of Contents
About the Tutorial .................................................................................................................................... i
Audience .................................................................................................................................................. i
Prerequisites ............................................................................................................................................ i
Copyright & Disclaimer ............................................................................................................................. i
Table of Contents .................................................................................................................................... ii
1. PYTHON WEB SCRAPING – INTRODUCTION ......................................................................... 1
What is Web Scraping?............................................................................................................................ 1
Origin of Web Scraping............................................................................................................................ 1
Web Crawling v/s Web Scraping ............................................................................................................. 1
Uses of Web Scraping .............................................................................................................................. 2
Components of a Web Scraper ................................................................................................................ 3
Working of a Web Scraper....................................................................................................................... 3
2. PYTHON WEB SCRAPING – GETTING STARTED WITH PYTHON ............................................. 5
Why Python for Web Scraping? ............................................................................................................... 5
Installation of Python .............................................................................................................................. 5
Setting Up the PATH ................................................................................................................................ 7
Running Python ....................................................................................................................................... 7
3. PYTHON WEB SCRAPING – PYTHON MODULES FOR WEB SCRAPING ................................... 9
Python Development Environments using virtualenv ......................................................................... 9
Python Modules for Web Scraping ........................................................................................................ 11
Requests ............................................................................................................................................... 11
Urllib3 ................................................................................................................................................... 12
Selenium ............................................................................................................................................... 13
Scrapy ................................................................................................................................................... 14
4. PYTHON WEB SCRAPING — LEGALITY OF WEB SCRAPING ................................................. 15
![](https://csdnimg.cn/release/download_crawler_static/89530544/bg4.jpg)
Python Web Scraping
iii
Introduction .......................................................................................................................................... 15
Research Required Prior to Scraping ..................................................................................................... 15
5. PYTHON WEB SCRAPING – DATA EXTRACTION .................................................................. 21
Web page Analysis ................................................................................................................................ 21
Different Ways to Extract Data from Web Page .................................................................................... 21
Beautiful Soup ....................................................................................................................................... 23
Lxml ...................................................................................................................................................... 24
6. PYTHON WEB SCRAPING – DATA PROCESSING .................................................................. 26
Introduction .......................................................................................................................................... 26
CSV and JSON Data Processing .............................................................................................................. 26
Data Processing using AWS S3 ............................................................................................................... 27
Data processing using MySQL ................................................................................................................ 28
Data processing using PostgreSQL ......................................................................................................... 30
7. PYTHON WEB SCRAPING – PROCESSING IMAGES AND VIDEOS ......................................... 31
Introduction .......................................................................................................................................... 31
Getting Media Content from Web Page ................................................................................................ 31
Extracting Filename from URL ............................................................................................................... 31
Information about Type of Content from URL ....................................................................................... 32
Generating Thumbnail for Images ......................................................................................................... 34
Screenshot from Website ...................................................................................................................... 34
Thumbnail Generation for Video ........................................................................................................... 35
Ripping an MP4 video to an MP3 .......................................................................................................... 36
8. PYTHON WEB SCRAPING – DEALING WITH TEXT ............................................................... 37
Introduction .......................................................................................................................................... 37
Getting started with NLTK ..................................................................................................................... 37
Installing Other Necessary packages ..................................................................................................... 38
![](https://csdnimg.cn/release/download_crawler_static/89530544/bg5.jpg)
Python Web Scraping
iv
Tokenization ......................................................................................................................................... 38
Stemming .............................................................................................................................................. 39
Lemmatization ...................................................................................................................................... 39
Chunking ............................................................................................................................................... 40
Bag of Word (BoW) Model Extracting and converting the Text into Numeric Form ............................... 41
Building a Bag of Words Model in NLTK ................................................................................................ 42
Topic Modeling: Identifying Patterns in Text Data ................................................................................. 42
Topic Modeling Algorithms ................................................................................................................... 43
9. PYTHON WEB SCRAPING – SCRAPING DYNAMIC WEBSITES ............................................... 44
Introduction .......................................................................................................................................... 44
Dynamic Website Example .................................................................................................................... 44
Approaches for Scraping data from Dynamic Websites ......................................................................... 44
Reverse Engineering JavaScript ............................................................................................................. 45
Rendering JavaScript ............................................................................................................................. 46
10. PYTHON WEB SCRAPING — SCRAPING FORM BASED WEBSITES ....................................... 48
Introduction .......................................................................................................................................... 48
Interacting with Login forms ................................................................................................................. 48
Loading Cookies from the Web Server .................................................................................................. 49
Automating forms with Python ............................................................................................................. 50
11. PYTHON WEB SCRAPING — PROCESSING CAPTCHA .......................................................... 52
What is CAPTCHA? ................................................................................................................................ 52
Loading CAPTCHA with Python .............................................................................................................. 52
Pillow Python Package .......................................................................................................................... 53
OCR: Extracting Text from Image using Python ..................................................................................... 54
12. PYTHON WEB SCRAPING — TESTING WITH SCRAPERS ...................................................... 55
Introduction .......................................................................................................................................... 55
剩余20页未读,继续阅读
资源评论
![avatar-default](https://csdnimg.cn/release/downloadcmsfe/public/img/lazyLogo2.1882d7f4.png)
![avatar](https://profile-avatar.csdnimg.cn/default.jpg!1)
![avatar-vip](https://csdnimg.cn/release/downloadcmsfe/public/img/user-vip.1c89f3c5.png)
concisedistinct
- 粉丝: 2707
- 资源: 237
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助
![voice](https://csdnimg.cn/release/downloadcmsfe/public/img/voice.245cc511.png)
![center-task](https://csdnimg.cn/release/downloadcmsfe/public/img/center-task.c2eda91a.png)
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
![feedback](https://img-home.csdnimg.cn/images/20220527035711.png)
![feedback](https://img-home.csdnimg.cn/images/20220527035711.png)
![feedback-tip](https://img-home.csdnimg.cn/images/20220527035111.png)
安全验证
文档复制为VIP权益,开通VIP直接复制
![dialog-icon](https://csdnimg.cn/release/downloadcmsfe/public/img/green-success.6a4acb44.png)