Web Scraping with Python, 2nd Edition.pdf

所需积分/C币:50 2019-05-11 11:52:44 6.47MB PDF
收藏 收藏

Web Scraping with Python Collecting More Data from the Modern Web. Python 经典图书, 清晰文字源生PDF,带目录标签。2018年最新出版。第二版。
SECOND EDITION Web scraping with Python Collecting More Data from the Modern Web Ryan mitchell Beijing Boston. Farnham. Sebastopol. Tokyo OREILLY Web Scraping with Python by ryan mitchell Copyright@ 2018 Ryan Mitchell. All rights reserved Printed in the united states of america Published by o reilly Media, Inc, 1005 Gravenstein Highway North, Sebastopol, CA95472 O Reilly books may be purchased for educational, business, or sales promotional use. Online editions are alsoavailableformosttitles(http://oreilly.com/safari).Formoreinformationcontactourcorporate/insti tutionalsalesdepartment800-998-9938orcorporate@oreilly.com Editor: Allyson MacDonald Indexer: Judith McConville Production Editor: Justin Billing Interior Designer: David Futato Copyeditor: Sharon Wilkey Cover Designer: Karen Montgomery Proofreader: Christina edwards Illustrator rebecca demarest april 2018 Second edition Revision History for the Second edition 2018-03-20 First release Seehttp://oreilly.com/catalog/errata.csp?isbn=9781491985571forreleasedetails The OReilly logo is a registered trademark of O Reilly Media, Inc. Web Scraping with Python, the cover mage, and related trade dress are trademarks of o Reilly Media, Inc. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights 978-1-491-98557-1 Table of contents Preface Part 1. Building scrapers 1. Your First Web scraper Connecting An Introduction to BeautifulSoup Installing BeautifulSoup 33668 Running BeautifulSoup Connecting Reliably and Handling Exceptions 2. Advanced HTML Parsing. 15 You don t always Need a hammer 15 Another Serving of BeautifulSoup find( and find_allo with BeautifulSoup 18 Other BeautifulSoup Objects 20 Navigating trees 21 Regular Expressions 25 Regular Expressions and BeautifulSoup 29 Accessing Attributes ambda expressions 31 3. Writing Web Crawlers. 33 Traversing a single domain 33 Crawling an Entire Site 37 Collecting Data Across an Entire Site Crawling across the internet 4. Web Crawling models Planning and defining objects 50 Dealing with Different Website layouts 53 Structuring Crawlers 58 Crawling Sites Through Search 58 Crawling sites Through Links 61 Crawling multiple page types Thinking About Web Crawler Models 65 5. Scrap ...................................................................67 Installing Scrap 67 Initializing a New spider 68 Writing a Simple Scraper 69 Spidering with rules Creating Items 74 Outputting Items The Item Pipeline ogging with Scrap More resources 80 6. Storing Data. 83 Media files 83 Storing Data to CsV 86 MySQL 88 Installing MySQL 89 Some basic commands 91 Integrating with Python 94 Database Techniques and Good Practice 7 Six Degrees in MySQL 100 Email 103 Part Il. Advanced Scraping 7. Reading Documents. 107 Document Encoding 107 108 Text Encoding and the global Internet 109 CSV 113 Reading CSV Files 113 PDF 115 Microsoft Word and , docx 117 8. Cleaning Your Dirty Data 121 Cleaning in Code 121 iv Table of Contents Data normalization 124 Cleaning After the Fact 126 Open Refine 126 9. Reading and Writing Natural Languages. ,131 Summarizing data 132 Markov models 135 ix Degrees of Wikipedia: Conclusion 139 Natural Language Toolkit 142 Installation and Setup 142 al analysis with Nlt 143 Lexicographical analysis with NLtK Additional resources 149 10. Crawling Through Forms and Logins. 151 Python Requests library 151 Submitting a basic form 152 Radio Buttons, Checkboxes, and Other Inputs 154 Submitting Files and images 155 Handling Logins and Cookies 156 Http Basic Access Authentication 157 Other Form Problems 158 1. Scraping JavaScript. 161 A Brief Introduction to JavaScript 162 Common javaScript libraries 163 Ajax and DynamiC HTML 165 Executing JavaScript in Python with Selenium 166 Additional selenium Webdrivers 171 Handling redirects 171 A Final Note on JavaScript 173 2. Crawling Through apls. 175 a Brief introduction to apis 175 Http Methods and apis More about APi responses 178 Parsing json 179 Undocumented apis 181 Finding Undocumented APIs 182 Documenting Undocumented APIs 184 Finding and Documenting apis automatically 184 Combining APIs with Other Data Sources 187 Table of conten More about apis 190 3. Image processing and text recognition 93 Overview of libraries 194 Pillow 194 Tesseract 195 NumPy 197 Processing Well-Formatted Text 197 Adjusting Images Automatically 200 Scraping Text from Images on Websites 203 Reading CAPTCHAs and Training Tesseract 206 Training Tesseract 207 Retrieving CAPTCHAS and Submitting Solutions 211 14. Avoiding scraping traps ,215 A Note on ethics 215 Looking like a human 216 adjust Your headers 217 Handling Cookies with JavaScript 218 Timing Is everything 220 Common Form security features 221 Hidden Input Field Values 221 Avoiding Honeypots 223 The human Checklist 224 15. Testing Your Website with Scrapers............... 227 An Introduction to Testing 227 What are Unit Tests 228 Python unittest 228 Testing Wikipedia 230 Testing with Selenium 233 Interacting with the Site 233 unittest or selenium? 236 16. Web Crawling in Parallel 239 Processes versus threads 239 Multithreaded crawling 240 Race Conditions and Queues 242 The threading module 245 Multiprocess Crawling 247 Multiprocess Crawling 249 Communicating between Processes 251 Table of contents Multiprocess Crawling-Another Approach 253 7. Scraping Remotely..... 255 Why Use Remote Servers? 255 Avoiding IP Address Blocking 256 Portability and extensibility 257 257 PySocks 259 Remote hosting 259 Running from a Website-Hosting Account 260 Running from the cloud 261 Additional resources 262 18. The Legalities and Ethics of Web Scraping. Trademarks, Copyrights, Patents, Oh My! 263 Copyright la 264 Trespass to Chattels 266 The Computer Fraud and abuse ac 268 robots. txt and Terms of Service 269 Three Web Scrapers 272 ebay versus bidder,s edge and Trespass to chattels 272 United States v. auernheimer and The computer fraud and abuse act 274 Field v Google: Copyright and robots 275 Moving forward 276 ndex 279 Table of contents|ⅶi

试读 127P Web Scraping with Python, 2nd Edition.pdf
立即下载 身份认证后 购VIP低至7折
关注 私信
Web Scraping with Python, 2nd Edition.pdf 50积分/C币 立即下载
Web Scraping with Python, 2nd Edition.pdf第1页
Web Scraping with Python, 2nd Edition.pdf第2页
Web Scraping with Python, 2nd Edition.pdf第3页
Web Scraping with Python, 2nd Edition.pdf第4页
Web Scraping with Python, 2nd Edition.pdf第5页
Web Scraping with Python, 2nd Edition.pdf第6页
Web Scraping with Python, 2nd Edition.pdf第7页
Web Scraping with Python, 2nd Edition.pdf第8页
Web Scraping with Python, 2nd Edition.pdf第9页
Web Scraping with Python, 2nd Edition.pdf第10页
Web Scraping with Python, 2nd Edition.pdf第11页
Web Scraping with Python, 2nd Edition.pdf第12页
Web Scraping with Python, 2nd Edition.pdf第13页
Web Scraping with Python, 2nd Edition.pdf第14页
Web Scraping with Python, 2nd Edition.pdf第15页
Web Scraping with Python, 2nd Edition.pdf第16页
Web Scraping with Python, 2nd Edition.pdf第17页
Web Scraping with Python, 2nd Edition.pdf第18页
Web Scraping with Python, 2nd Edition.pdf第19页
Web Scraping with Python, 2nd Edition.pdf第20页

试读结束, 可继续阅读

50积分/C币 立即下载