Web Scraping with Python

所需积分/C币:2 2015-08-20 10:56:06 6.36MB PDF
收藏 收藏

Web Scraping with Python - Collecting Data from the Modern Web 英文版
Web scraping with Python Collecting Data from the Modern Web Ryan mitchell Beijing: Boston. Farnham. Sebastopol. Tokyo OREILLY Web Scraping with Python by Ryan Mitchell Copyright o 2015 Ryan Mitchell. All rights reserved Printed in the United States of America Published by o reilly Media, Inc, 1005 Gravenstein Highway North, Sebastopol, CA95472 OReilly books may be purchased for educational, business, or sales promotional use Online editions are alsoavailableformosttitles(http://safaribooksonline.com).Formoreinformationcontactourcorporate institutionalsalesdepartment800-998-9938orcorporate@oreilly.com Editors: Simon St Laurent and Allyson MacDonald Indexer: Lucie Haskins Production Editor: Shiny Kalapurakkel Interior Designer: David Futato Copyeditor: Jasmine Kwityn Cover Designer: Karen Montgomer Proofreader: Carla Thornton Illustrator: Rebecca Demarest June 2015 First edition Revision History for the First Edition 2015-06-10: First Release Seehttp://oreilly.com/catalog/errata.csp?isbn=9781491910276forreleasedetails The O Reilly logo is a registered trademark of O Reilly Media, Inc. Web Scraping with Python, the cover image, and related trade dress are trademarks of o reilly media, Inc While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/ or rights 978-1-491-91027-6 ILSI Table of contents Prefaceⅶi Part I. Building Scrapers 1. Your First Web Scraper. Connecting An Introduction to BeautifulSoup Installing BeautifulSoup 33668 Running BeautifulSoup Connecting Reliably 2. Advanced HTML Parsing............... ou Dont Always Need a Hammer 13 Another Serving of beautifulsoup 14 findO and findallo with BeautifulSoup 16 Other BeautifulSoup object 18 Navigating TI rees 18 Regular expressions 22 Regular Expressions and BeautifulSoup 27 Accessing attributes 28 Lambda expressions 28 Beyond BeautifulSoup 29 3. Starting to〔rawl. 31 Traversing a Single Domain 31 Crawling an Entire Site 35 Collecting Data Across an Entire Site 38 C rawlin g Across the Internet Crawling with Scrap 4. Using apls. How APis Work 50 Common conventions 50 Methods 51 Authentication 52 Responses 52 API calls 53 Echo nest 54 A Few Examples Twitter 55 Getting Started 56 A Few Examples 57 Google aPis 60 Getting Started A Few Examples 61 Parsing son 63 Bringing it all back home 64 More about apis 68 5. Storing D 71 Media files 71 Storing Data to CSV MySQL Installing MySQL Some basic commands 79 Integrating with Python 82 Database Techniques and Good Practice Six Degrees" in MySQL 87 Email 90 6. Reading Documents. 93 ocument Encoding 93 Text 94 Text Encoding and the global internet 94 CSV 98 Reading CSV Files 98 PDE 100 Microsoft word and docx 102 Part I. Advanced Scraping 7. Cleaning Your Dirty Data 109 Cleaning in Code 109 iv Table of Contents Data normalization 112 Cleaning After the fact 113 Open Refine 114 8. Reading and Writing Natural Languages. 119 Summarizing data 120 Markov model 123 Six Degrees of Wikipedia: Conclusion 126 Natural language Toolkit 129 Installation and Setup 129 Statistical Analysis with NLTK 130 Lexicographical analysis with NLtK 132 Additional resources 136 9. Crawling Through Forms and Logins. ,137 Python Requests library Submitting a basic form 138 Radio Buttons, Checkboxes, and Other Inputs 140 Submitting files and images 141 Handling logins and cookies 142 Http Basic Access authentication Other Form problems 144 10. Scraping JavaScript.…………… 147 A Brief Introduction to JavaScript 148 Common JavaScript libraries 149 Ajax and Dynamic Html 151 Executing JavaScript in Python with Selenium 152 Handling Redirects 158 11. Image Processing and Text Recognition............. ,,161 Overview of libraries 162 Pillow 162 Tesseract 163 164 Processing Well-Formatted Text 164 Scraping Text from Images on Websites 166 Reading CAPTCHAs and Training Tesseract 169 Training Tesseract 171 Retrieving CaptChas and submitting solutions 174 Table of contents 2. Avoiding scraping traps A Note on ethics 177 Looking like a human 178 Adjust Your Headers 179 Handling Cookies 181 Timing Is Everything 182 Common Form Security Features 183 Hidden input Field values 183 Avoiding Honeypots 184 The Human Checklist 186 13. Testing Your Website with Scrapers.............. ....... 189 An Introduction to Testing 189 What Are Unit Tests? 190 Python unittest Testing wikiped 191 Testing with Selenium 193 Interacting with the Site 194 Unittest or selenium? 197 14. Scraping Remotely Why Use Remote Servers 199 Avoiding Ip Address Blocking 199 Portability and Extensibility 200 Tor 201 PySocks 202 Remote hosting 203 Running from a Website Hosting Account 203 Running from the Cloud 204 Additional resources 206 Moving forward 206 A. Python at a Glance... 209 B. The Internet at a glance 213 C. The legalities and Ethics of Web Scraping 217 Index 231 Table of contents Preface To those who have not developed the skill, computer programming can seem like a kind of magic. If programming is magic, then web scraping is wizardry, that is, the application of magic for particularly impressive and useful-yet surprisingly effortless feat In fact, in my years as a software engineer, I've found that very few programming practices capture the excitement of both programmers and la aymen a like quite like web scraping. The ability to write a simple bot that collects data and streams it down a terminal or stores it in a database while not difficult, never fails to provide a certain thrill and sense of possibility, no matter how many times you might have done it efore It's unfortunate that when I speak to other programmers about web scraping theres a lot of misunderstanding and confusion about the practice. Some people arent sure if it's legal (it is), or how to handle the modern Web, with all its JavaScript, multimedia, and cookies. Some get confused about the distinction between APIs and web scra ers This book seeks to put an end to many of these common questions and misconcept tions about web scraping while providing a comprehensive guide to most common web-scraping tasks Beginning in Chapter 1, Ill provide code samples periodically to demonstrate con cepts. These code samples are in the public domain, and can be used with or without attribution(although acknowledgment is always appreciated). All code samples also will be available on the website for viewing and downloading What Is Web scraping? The automated gathering of data from the Internet is nearly as old as the Internet itself. Although web scraping is not a new term, in years past the practice has been more commonly known as screen scraping, data mining, web harvesting, or similar variations. General consensus today seems to favor web scraping, so that is the term I'll use throughout the book, although I will occasionally refer to the web-scraping programs themseives as bots. In theory, web scraping is the practice of gathering data through any means other than a program interacting with an API (or, obviously, through a human using a web browser). This is most commonly accomplished by writing an automated program that queries a web server, requests data(usually in the form of the hTml and other files that comprise web pages), and then parses that data to extract needed informa- tion In practice, web scraping encompasses a wide variety of programming techniques and technologies, such as data analysis and information security. This book will cover the basics of web scraping and crawling(Part I), and delve into some of the advanced topics in Part II Why web scraping? If the only way you access the Internet is through a browser, you're missing out on a huge range of possibilities. Although browsers are handy for executing JavaScript, displaying images, and arranging objects in a more human-readable format(among other things), web scrapers are excellent at gathering and processing large amounts of data(among other things). Rather than viewing one page at a time through the nar- row window of a monitor, you can view databases spanning thousands or even mil lions of pages at once In addition, web scrapers can go places that traditional search engines cannot. A Google search for cheapest flights to boston" will result in a slew of advertisements and popular flight search sites. Google only knows what these websites say on their content pages, not the exact results of various queries entered into a flight search application. However, a well-developed web scraper can chart the cost of a flight to Boston over time, across a variety of websites, and tell you the best time to buy your ticket You might be asking: Isnt data gathering what APIs are for? (If you're unfamiliar with APIs, see Chapter 4. )Well, APIs can be fantastic, if you find one that suits your purposes. They can provide a convenient stream of well-formatted data from one server to another. You can find an API for many different types of data you might reface

试读 127P Web Scraping with Python
限时抽奖 低至0.43元/次
身份认证后 购VIP低至7折
lengyff 还行,勉强能用用。
tititatitita 英文原版,带书签,有页脚,内容非常好.
mlong611 bucuo,haoyong
波西米 还不错的一本书,入门也比较好,比较有趣,感谢分享
tanglong8834 好用,好用。。。。。不错的资源
  • 分享王者

关注 私信
Web Scraping with Python 2积分/C币 立即下载
Web Scraping with Python第1页
Web Scraping with Python第2页
Web Scraping with Python第3页
Web Scraping with Python第4页
Web Scraping with Python第5页
Web Scraping with Python第6页
Web Scraping with Python第7页
Web Scraping with Python第8页
Web Scraping with Python第9页
Web Scraping with Python第10页
Web Scraping with Python第11页
Web Scraping with Python第12页
Web Scraping with Python第13页
Web Scraping with Python第14页
Web Scraping with Python第15页
Web Scraping with Python第16页
Web Scraping with Python第17页
Web Scraping with Python第18页
Web Scraping with Python第19页
Web Scraping with Python第20页

试读结束, 可继续阅读

2积分/C币 立即下载