Python Web Scraping Cookbook-Packt Publishing(2018).pdf )

所需积分/C币:10 2018-03-31 22:56:03 15.98MB PDF
41
收藏 收藏
举报

The internet contains a wealth of data. This data is both provided through structured APIs as well as by content delivered directly through websites. While the data in APIs is highly structured, information found in web pages is often unstructured and requires collection, extraction, and processing to be of value. And collecting data is just the start of the journey, as that data must also be stored, mined, and then exposed to others in a value-added form. With this book, you will learn many of the core tasks needed in collecting various forms of information from websites. We will cover how to collect it, how to perform several common data operations (including storage in local and remote databases), how to perform common media-based tasks such as converting images an videos to thumbnails, how to clean unstructured data with NTLK, how to examine several data mining and visualization tools, and finally core skills in building a microservices-based scraper and API that can, and will, be run on the cloud. Through a recipe-based approach, we will learn independent techniques to solve specific tasks involved in not only scraping but also data manipulation and management, data mining, visualization, microservices, containers, and cloud operations. These recipes will build skills in a progressive and holistic manner, not only teaching how to perform the fundamentals of scraping but also taking you from the results of scraping to a service offered to others through the cloud. We will be building an actual web-scraper-as-a-service using common tools in the Python, container, and cloud ecosystems.
Python Web Scraping Cookbook Copyright C 2018 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews Every effort has been made in the preparation of this book to ensure the accuracy of the information presented However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information Commissioning editor: Veena pagare Acquisition Editor: Tushar Gupta Content Development editor: Tejas Limkar Technical Editor: Danish Shaikh Copy Editor: Safis Editing Project Coordinator: Manthan Patel Proofreader: Safis Editing xer: rekha nair Graphics: Tania dutta Production Coordinator: Shraddha Falebhai First published: February 2018 Production reference: 1070218 Published by Packt Publishing Ltd Livery place 35 Livery street Birmingham B3 2PB, UK ISBN978-1-78728-521-7 www.packtpub.com Contributors About the author Michael Heydt is an independent consultant specializing in social, mobile, analytics, and cloud technologies, with an emphasis on cloud native 12-factor applications michael has been a software developer and trainer for over 30 years and is the author of books such as D3. js By Example, Learning Pandas, Mastering Pandas for Finance, and Instant Lucene. nEt. You can find more information about him on linkedIn at michaelheydt i would like to greatly thank my family for putting up with me disappearing for months on end and sacrificing my sparse free time to indulge in creation of content and books like this one. They are my true inspiration and enablers about the reviewers Mei lu is the founder and ceo of Jobfully, providing career coaching for software developers and engineering leaders. She is also a Career/Executive Coach for Carnegie mellon University Alumni Association, specializing in the software/ high-tech Industry. Previously Mei was a software engineer and an engineering manager at Qpass, M.I.T., and MicroStrategy. She received her Ms in Computer Science from the University of pennsylvania and her ms in engineering from Carnegie Mellon Universit Lazar Telebak is a freelance web developer specializing in web scraping, crawling, and indexing web pages using Python libraries/frameworks He has worked mostly on projects of automation, website scraping, crawling, and exporting data in various formats(CSV, ]SoN, XML, and TXt)and databases such as(mongoDB, SQLAIchemy, and postgres ). Lazar also has experience of fronted technologies and languages such as HTML, CSs, JavaScript, and jQuery Packt is searching for authors like you If you re interested in becoming an author for Packt, please visit authors. packtpub com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for,or submit your own idea Mapt mapt. l Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website Why subscribe? Spend less time learning and more time coding with practical eBooks and videos from over 4,000 industry professionals Improve your learming with skill plans built especially for you Get a free eBook or video every month Mapt is fully searchable Copy and paste, print, and bookmark content PacktPub, com Did you know that packt offers e book versions of every book published with Pdf and epuBfilesavailableyoUcanupgradetotheeboOkversionatwww.Packtpub.comandasa print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at service gpacktpub com for more details AtwwWi.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signupfora range of free newsletters, and receive exclusive discounts and offers on Packt books and eBOoKs Table of contents Preface Chapter 1: Getting Started with Scraping Introduction Setting up a Python development environment Getting ready 7888 How to do it Scraping Python. org with Requests and Beautiful Soup 13 Getting ready 13 How to do it How it works Scraping Python. org in urllib3 and Beautiful Soup 19 Getting read 19 How to do it 19 How it works 20 There's more 20 Scraping Python. org with Scrapy 21 Getting ready How to do it How it works Scraping Python. org with Selenium and PhantomJS Getting read 22255 How to do it How it works 28 Theres more 28 Chapter 2: Data Acquisition and extraction Introduction 29 How to parse websites and navigate the doM using BeautifulSoup 30 Getting ready How to do it 32 How it works 35 There's more 35 Searching the DOM with Beautiful Soup's find methods 35 Getting read 35 Table of contents How to do it Querying the DOM with XPath and Ixml Getting ready How to do it 39 How it works 45 There's more 45 Querying data with XPath and CSS selectors 46 Getting ready How to do it 47 How it works 47 Theres more 48 Using Scrap selectors Getting ready 48 How to do it 48 How it works There's more Loading data in unicode /UTF-8 Getting ready 00012 How to do it How it works 53 Theres more 53 Chapter 3: Processing Data 54 Introduction Working with csv and json data 55 Getting read How to do it How it works 63 Theres more 63 Storing data using AWS $3 64 Getting ready 64 How to do it 65 How it works There's more Storing data using MySQL 69 Getting ready 69 How to do it 70 How it works 74 Theres more 74 Tiil Table of contents Storing data using Postgre SQL Getting read How to do it How it works There's more Storing data in Elasticsearch Getting read How to do it How it works 83 Theres more 83 How to build robust etL pipelines with Aws sQs 84 Getting ready 84 How to do it - posting messages to an AWs queue How it works How to do it -reading and processing messages How it works There's more Chapter 4: Working with Images, Audio, and other Assets 90 Introduction 91 Downloading media content from the web 91 Getting ready 91 How to do it How it works There's more 9 Parsing a Url with urllib to get the filename 93 Getting ready 93 How to do it 93 How it works There's more Determining the type of content for a URL Getting ready How to do it 45555 How it works Theres more Determining the file extension from a content type 97 Getting ready How to do it 97 How it works 97 Table of contents There's more Downloading and saving images to the local file system How to do it How it works There's more Downloading and saving images to S3 100 Getting read 100 How to do it How it works 101 Theres more 102 Generating thumbnails for images 102 Getting ready 103 How to do it 103 How it works 104 Taking a screenshot of a website 105 Getting read How to do it 105 How it works 107 Taking a screenshot of a website with an external service 108 Getting ready 109 How to do it 110 How it works 112 There's more 114 Performing OCR on an image with pytesseract 114 Getting ready How to do it 115 How it works 116 Theres more 116 Creating a video thumbnail 116 Getting ready How to do it 116 How it works 118 There's more 119 Ripping an MP4 video to an MP3 119 Getting ready 119 How to do it 120 There's more 120 Chapter 5: Scraping-Code of Conduct 12 []

...展开详情
试读 127P Python Web Scraping Cookbook-Packt Publishing(2018).pdf )
立即下载 身份认证后 购VIP低至7折
一个资源只可评论一次,评论内容不能少于5个字
您会向同学/朋友/同事推荐我们的CSDN下载吗?
谢谢参与!您的真实评价是我们改进的动力~
  • 分享王者

关注 私信
上传资源赚钱or赚积分
最新推荐
Python Web Scraping Cookbook-Packt Publishing(2018).pdf ) 10积分/C币 立即下载
1/127
Python Web Scraping Cookbook-Packt Publishing(2018).pdf )第1页
Python Web Scraping Cookbook-Packt Publishing(2018).pdf )第2页
Python Web Scraping Cookbook-Packt Publishing(2018).pdf )第3页
Python Web Scraping Cookbook-Packt Publishing(2018).pdf )第4页
Python Web Scraping Cookbook-Packt Publishing(2018).pdf )第5页
Python Web Scraping Cookbook-Packt Publishing(2018).pdf )第6页
Python Web Scraping Cookbook-Packt Publishing(2018).pdf )第7页
Python Web Scraping Cookbook-Packt Publishing(2018).pdf )第8页
Python Web Scraping Cookbook-Packt Publishing(2018).pdf )第9页
Python Web Scraping Cookbook-Packt Publishing(2018).pdf )第10页
Python Web Scraping Cookbook-Packt Publishing(2018).pdf )第11页
Python Web Scraping Cookbook-Packt Publishing(2018).pdf )第12页
Python Web Scraping Cookbook-Packt Publishing(2018).pdf )第13页
Python Web Scraping Cookbook-Packt Publishing(2018).pdf )第14页
Python Web Scraping Cookbook-Packt Publishing(2018).pdf )第15页
Python Web Scraping Cookbook-Packt Publishing(2018).pdf )第16页
Python Web Scraping Cookbook-Packt Publishing(2018).pdf )第17页
Python Web Scraping Cookbook-Packt Publishing(2018).pdf )第18页
Python Web Scraping Cookbook-Packt Publishing(2018).pdf )第19页
Python Web Scraping Cookbook-Packt Publishing(2018).pdf )第20页

试读结束, 可继续阅读

10积分/C币 立即下载