没有合适的资源?快使用搜索试试~ 我知道了~
Advanced Analytics with Spark Patterns for Learning from Data at...
需积分: 10 29 下载量 57 浏览量
2017-01-13
09:00:10
上传
评论
收藏 3.98MB PDF 举报
温馨提示
Advanced Analytics with Spark Patterns for Learning from Data at Scale 英文无水印pdf pdf使用FoxitReader和PDF-XChangeViewer测试可以打开
资源推荐
资源详情
资源评论
DATASPARK
Advanced Analytics with Spark
ISBN: 978-1-491-91276-8
US $49.99 CAN $57.99
Twitter: @oreillymedia
facebook.com/oreilly
In this practical book, four Cloudera data scientists present a set of self-
contained patterns for performing large-scale data analysis with Spark. The
authors bring Spark, statistical methods, and real-world data sets together to
teach you how to approach analytics problems by example.
You’ll start with an introduction to Spark and its ecosystem, and then dive into
patterns that apply common techniques—classication, collaborative ltering,
and anomaly detection, among others—to elds such as genomics, security,
and nance. If you have an entry-level understanding of machine learning and
statistics, and you program in Java, Python, or Scala, you’ll nd these patterns
useful for working on your own data applications.
Patterns include:
■ Recommending music and the Audioscrobbler data set
■ Predicting forest cover with decision trees
■ Anomaly detection in network trac with K-means clustering
■ Understanding Wikipedia with Latent Semantic Analysis
■ Analyzing co-occurrence networks with GraphX
■ Geospatial and temporal data analysis on the New York City
Taxi Trips data
■ Estimating nancial risk through Monte Carlo simulation
■ Analyzing genomics data and the BDG project
■ Analyzing neuroimaging data with PySpark and Thunder
Sandy Ryza is a Senior Data Scientist at Cloudera and active contributor to the
Apache Spark project.
Uri Laserson is a Senior Data Scientist at Cloudera, where he focuses on Python
in the Hadoop ecosystem.
Sean Owen is Director of Data Science for EMEA at Cloudera, and a committer for
Apache Spark.
Josh Wills is Senior Director of Data Science at Cloudera and founder of the
Apache Crunch project.
Advanced Analytics with Spark
Ryza, Laserson,
Owen & Wills
Sandy Ryza, Uri Laserson,
Sean Owen & Josh Wills
Advanced
Analytics with
Spark
PATTERNS FOR LEARNING FROM DATA AT SCALE
DATASPARK
Advanced Analytics with Spark
ISBN: 978-1-491-91276-8
US $49.99 CAN $57.99
Twitter: @oreillymedia
facebook.com/oreilly
In this practical book, four Cloudera data scientists present a set of self-
contained patterns for performing large-scale data analysis with Spark. The
authors bring Spark, statistical methods, and real-world data sets together to
teach you how to approach analytics problems by example.
You’ll start with an introduction to Spark and its ecosystem, and then dive into
patterns that apply common techniques—classication, collaborative ltering,
and anomaly detection, among others—to elds such as genomics, security,
and nance. If you have an entry-level understanding of machine learning and
statistics, and you program in Java, Python, or Scala, you’ll nd these patterns
useful for working on your own data applications.
Patterns include:
■ Recommending music and the Audioscrobbler data set
■ Predicting forest cover with decision trees
■ Anomaly detection in network trac with K-means clustering
■ Understanding Wikipedia with Latent Semantic Analysis
■ Analyzing co-occurrence networks with GraphX
■ Geospatial and temporal data analysis on the New York City
Taxi Trips data
■ Estimating nancial risk through Monte Carlo simulation
■ Analyzing genomics data and the BDG project
■ Analyzing neuroimaging data with PySpark and Thunder
Sandy Ryza is a Senior Data Scientist at Cloudera and active contributor to the
Apache Spark project.
Uri Laserson is a Senior Data Scientist at Cloudera, where he focuses on Python
in the Hadoop ecosystem.
Sean Owen is Director of Data Science for EMEA at Cloudera, and a committer for
Apache Spark.
Josh Wills is Senior Director of Data Science at Cloudera and founder of the
Apache Crunch project.
Advanced Analytics with Spark
Ryza, Laserson,
Owen & Wills
Sandy Ryza, Uri Laserson,
Sean Owen & Josh Wills
Advanced
Analytics with
Spark
PATTERNS FOR LEARNING FROM DATA AT SCALE
Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills
Advanced Analytics with Spark
978-1-491-91276-8
[LSI]
Advanced Analytics with Spark
by Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills
Copyright © 2015 Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/
institutional sales department: 800-998-9938 or corporate@oreilly.com.
Editor: Marie Beaugureau
Production Editor: Kara Ebrahim
Copyeditor: Kim Cofer
Proofreader: Rachel Monaghan
Indexer: Judy McConville
Interior Designer: David Futato
Cover Designer: Ellie Volckhausen
Illustrator: Rebecca Demarest
April 2015: First Edition
Revision History for the First Edition
2015-03-27: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491912768 for release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Advanced Analytics with Spark, the
cover image of a peregrine falcon, and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
Table of Contents
Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1.
Analyzing Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
The Challenges of Data Science 3
Introducing Apache Spark 4
About This Book 6
2.
Introduction to Data Analysis with Scala and Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Scala for Data Scientists 10
The Spark Programming Model 11
Record Linkage 11
Getting Started: The Spark Shell and SparkContext 13
Bringing Data from the Cluster to the Client 18
Shipping Code from the Client to the Cluster 22
Structuring Data with Tuples and Case Classes 23
Aggregations 28
Creating Histograms 29
Summary Statistics for Continuous Variables 30
Creating Reusable Code for Computing Summary Statistics 31
Simple Variable Selection and Scoring 36
Where to Go from Here 37
3.
Recommending Music and the Audioscrobbler Data Set. . . . . . . . . . . . . . . . . . . . . . . . . . 39
Data Set 40
The Alternating Least Squares Recommender Algorithm 41
Preparing the Data 43
iii
剩余275页未读,继续阅读
资源评论
yinkaisheng-nj
- 粉丝: 763
- 资源: 6231
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- Python大作业-爬虫(高分大作业)
- Python 图片压缩工具
- qt4.8.6资源,用户qt安装,编译与学习
- (176465412)电气设计视频教程-Eplan.P8
- Python大作业爬虫项目并且用web展示爬虫的内容(高分项目)源码+说明
- Python项目-实例-27 生成词云图.zip
- (176566822)数据库课程设计ssm027学校运动会信息管理系统+jsp.sql
- C# WPF-激光焊接机配套软件源码及文档(带视觉需halcon)
- (177333248)c++实现的仿QQ贪吃蛇大作战多人联机游戏.zip
- Python大作业-爬虫(高分大作业).zip
- (177487602)c++ 家谱管理系统.zip
- IMG-8274.GIF
- (177938850)115-基于51单片机和PROTEUS的基于C51单片机的智能交通灯设计.zip
- 基于微信小程序的宏华水利小程序.zip
- (OC)数据加载SVG图片
- linux3.8.6内核资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功