Practical Data Science Cookbook

所需积分/C币:10 2018-01-08 10:44:54 14.09MB PDF
24
收藏 收藏
举报

2016年最新教材,内容丰富,非常棒的书籍,在一个网站上购买的,分享给大家
5: Visually exploring Employment Data o b' Chapter 5: Visually Exploring Employment Data o iNtroduction o pReparing for analysis o importing employment data into R' o b Exploring the employment data o oBtaining and merging additional data o bAdding geographical information o b'Extracting state- and county-level wage and employment information o vIsualizing geographical distributions of pay o b'Exploring where the jobs are, by industr o aNimating maps for a geospatial time series o b Benchmarking performance for some common tasks 6: Driving Visual Analyses with Automobile Data o b Chapter 6: Driving Visual analyses with Automobile data b'lntroduction o gEtting started with IPython o b'Exploring jupiter notebook o pReparing to analyze automobile fuel efficiencies o b' Exploring and describing fuel efficiency data with Python o b'Analyzing automobile fuel efficiency over time with python o b'Investigating the makes and models of automobiles with Python 7: Working with Social graphs o b' Chapter 7: Working with Social Graphs o iNtroduction o b'Preparing to work with social networks in Python o iMporting networks o b'Exploring subgraphs within a heroic network o fInding strong ties o finding key players o b Exploring the characteristics of entire networks b'Clustering and community detection in social networks o vIsualizing graphs o bSocial networks in r 8: Recommending Movies at Scale(Python) o b' Chapter 8: Recommending Movies at Scale(Python l o iNtroduction o b'Modeling preference expressions' o uNderstanding the dat o b'Ingesting the movie review data o finding the highest-scoring movies o iMproving the movie-rating system' o mEasuring the distance between users in the pi rererence space o cOmputing the correlation between users' o b'Finding the best critic for a user o pRedicting movie ratings for users' o b'Collaboratively filtering item by item o b Building a non-negative matrix factorization model' o lOading the entire dataset into the memory o dUmping the svd-based model to the disk o b Training the SVD-based model' o bEsting the svd-based model' 9: Harvesting and Geolocating Twitter Data(Python) o b' Chapter 9: Harvesting and Geolocating Twitter Data(Python) o iNtroduction o b'Creating a Twitter application uNderstanding the Twitter APIvl. 1' o b Determining your Twitter followers and friends o pUlling twitter user profiles o b Making requests without running afoul of T s rate limits o b'Storing son data to disk b'Setting up MongoDB for storing Twitter data o b'Storing user profiles in MongoDB using PyMongo o b' Exploring the geographic information available in profiles o bLotting geospatial data in Python 10: Forecasting New Zealand Overseas Visitors o b' Chapter 10: Forecasting New Zealand Overseas Visitors o iNtroduction o b The ts object o vIsualizing time series data o bSimple linear regression models' o b'aCF and pacF o bARIMa models o aCcuracy measurements o b'Fitting seasonal arima models' 11: German Credit Data Analysis o b Chapter 11: German Credit Data analysis o iNtroduction o bSimple data transformations o vIsualizing categorical data o dIscriminant analysis' o b Dividing the data and the roc o b'Fitting the logistic regression mode. o dEcision trees and rules o b'Decision tree for german data' Chapter 1. Preparing Your data science Environment a traditional cookbook contains culinary recipes of interest to the authors and helps readers expand their repertoire of foods to prepare. Many might believe that the end product of a recipe is the dish itself and one can read this book, in much the same way. Every chapter guides the reader through the application of the stages of the data science pipeline to different datasets with various goals. Also, just as in cooking, the final product can simply be the analysis applied to a particular set We hope that you will take a broader view, however. Data scientists learn by doing, ensuring that every iteration and hypothesis improves the practioners knowledge base. By taking multiple datasets through the data science pipeline using two different programming languages(R and python ), we hope that you will start to abstract out the analysis patterns, see the bigger picture, and achieve a deeper understanding of this rather ambiguous field of data science We also want you to know that, unlike culinary recipes, data science recipes are ambiguous. When chefs begin a particular dish, they have a very clear picture in mind of what the finished product will look like. For data scientists the situation is often different. One does not always know what the dataset in question will look like, and what might or might not be possible, given the amount of time and resources. Recipes are essentially a way to dig into the data and get started on the path towards asking the right questions to complete the best dish possible If you are from a statistical or mathematical background, the modeling techniques on display might not excite you per se. Pay attention to how many of the recipes overcome practical issues in the data science pipeline such as loading large datasets and working with scalable tools to adapting known techniques to create data applications, interactive graphics, and web pages rather than reports and papers. We hope that these aspects will enhance your appreciation and understanding of data science and apply good data science to your domain. Practicing data scientists require a great number and diversity of tools to get the job done. Data practitioners scrape, clean, visualize, model, and perform a million different tasks with a wide array of tools. If you ask most people working with data, you will learn that the foremost component in this toolset is the language used to perform the analysis and modeling of the data Identifying the best programming language for a particular task is akin to asking which world religion is correct, just with slightly less bloodshed In this book, we split our attention between two highly regarded, yet very different, languages used for data analysis -r and python and leave it up to you to make your own decision as to which language you prefer. We will help you by dropping hints along the way as to the suitability of each language for various tasks, and we'll compare and contrast similar analyses done on the same dataset with each language When you learn new concepts and techniques, there is always the question of depth versus breadth, Given a fixed amount of time and effort should you work towards achieving moderate proficiency in both r and python, or should you go all in on a single language? From our professional experiences we strongly recommend that you aim to master one language and have awareness of the other. Does that mean skipping chapters on a particular language? Absolutely not! However, as you go through this book, pick one language and dig deeper, looking not only to develop conversational ability, but also fluency. To prepare for this chapter, ensure that you have sufficient bandwidth to download up to several gigabytes of software in a reasonable amount of time Understanding the data science pipeline Before we start installing any software, we need to understand the repeatable set of steps that we will use for data analysis throughout the book How to do it The following are the five key steps for data analysis Acquisition: The first step in the pipeline is to acquire the data from a variety of sources, including relational databases, NOSQL and document stores, web scraping, and distributed databases such as hdfs on a Hadoop platform, RESTful aPis, flat files, and hopefully this is not the case. PDfs 2. Exploration and understanding: The second step is to come to an understanding of the data that you will use and how it was collected; this often requires significant exploration 3. Munging, wrangling, and manipulation: This step is often the singl most time-consuming and important step in the pipeline Data is almost never in the needed form for the desired analys SIS 4. Analysis and modeling: This is the fun part where the data scientist gets to explore the statistical relationships between the variables in the data and pulls out his or her bag of machine learning tricks to cluster, categorize, or classify the data and create predictive models to see into the future 5. Communicating and operationalizing: At the end of the pipeline, we need to give the data back in a compelling form and structure, sometimes to ourselves to inform the next iteration and sometimes to a completely different audience. The data products produced can be a simple one-off report or a scalable web product that will be used interactively by millions How it works。 Although the preceding list is a numbered list, don't assume that every project will strictly adhere to this exact linear sequence. In fact, agile data scientists know that this process is highly iterative. Often, data exploration informs how the data must be cleaned, which then enables more exploration and deeper understanding. Which of these steps comes first often depends on your initial familiarity with the data. If you work with the systems producing and capturing the data every day, the initial data exploration and understanding stage might be quite short, unless something is wrong with the production system. Conversely, if you are handed a dataset with no background details, the data exploration and understanding stage might require quite some timeand numerous non-programming steps, such as talking with the system developers) The following diagram shows the data science pipeline Data Ingestion Data Munging and Wrangling Computation and Analyses Reporting and Modeling and Visualization Application As you have probably heard or read by now, data munging or wrangling can often consume 80 percent or more of project time and resources. In a perfect world, we would always be given perfect data. Unfortunately, this is never the case, and the number of data problems that you will see is virtually infinite. Sometimes, a data dictionary might change or might be missing, so understanding the field values is simply not possible. Some data fields may contain garbage or values that have been switched with another field. An update to the web app that passed testing might cause a little bug that prevents data from being collected, causing a few hundred thousand rows to go missing. If it can go wrong, it probably did at some point; the data you analyze is the sum total of all of these mistakes The last step, communication and operationalization, is absolutely critical but with intricacies that are not often fully appreciated. Note that the last step in the pipeline is not entitled data visualization and does not revolve around simply creating something pretty and/or compelling, which is a complex topic in itself. Instead, data visualizations will become a piece of a larger story that we will weave together from and with data some go even further and say that the end result is always an argument as there is no point in undertaking all of this effort unless you are trying to persuade someone or some group of a particular point

...展开详情
试读 127P Practical Data Science Cookbook
立即下载 身份认证后 购VIP低至7折
一个资源只可评论一次,评论内容不能少于5个字
您会向同学/朋友/同事推荐我们的CSDN下载吗?
谢谢参与!您的真实评价是我们改进的动力~
上传资源赚钱or赚积分
最新推荐
Practical Data Science Cookbook 10积分/C币 立即下载
1/127
Practical Data Science Cookbook第1页
Practical Data Science Cookbook第2页
Practical Data Science Cookbook第3页
Practical Data Science Cookbook第4页
Practical Data Science Cookbook第5页
Practical Data Science Cookbook第6页
Practical Data Science Cookbook第7页
Practical Data Science Cookbook第8页
Practical Data Science Cookbook第9页
Practical Data Science Cookbook第10页
Practical Data Science Cookbook第11页
Practical Data Science Cookbook第12页
Practical Data Science Cookbook第13页
Practical Data Science Cookbook第14页
Practical Data Science Cookbook第15页
Practical Data Science Cookbook第16页
Practical Data Science Cookbook第17页
Practical Data Science Cookbook第18页
Practical Data Science Cookbook第19页
Practical Data Science Cookbook第20页

试读结束, 可继续阅读

10积分/C币 立即下载