没有合适的资源?快使用搜索试试~ 我知道了~
Learning Spark
5星 · 超过95%的资源 需积分: 50 349 下载量 48 浏览量
2014-12-17
23:41:10
上传
评论 3
收藏 1.19MB PDF 举报
温馨提示
This book targets Data Scientists and Engineers. We chose these two groups because they have the most to gain from using Spark to expand the scope of problems they can solve. Spark’s rich collection of data focused libraries (like MLlib) make it easy for data scientists to go beyond problems that fit on single machine while making use of their statistical background. Engineers, meanwhile, will learn how to write general-purpose distributed programs in Spark and operate production applications. Engineers and data scientists will both learn different details from this book, but will both be able to apply Spark to solve large distributed problems in their respective fields.
资源推荐
资源详情
资源评论
2
Learning Spark
Holden Karau
Andy Konwinski
Patrick Wendell
Matei Zaharia
Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo
3
Table of Contents
Preface
................................................................................................................................................
5
Audience
................................................................................................................................................
5
How This Book is Organized
..............................................................................................................
6
Supporting Books
.................................................................................................................................
6
Code Examples
.....................................................................................................................................
7
Early Release Status and Feedback
...................................................................................................
7
Chapter 1. Introduction to Data Analysis with Spark
......................................................
8
What is Apache Spark?
.......................................................................................................................
8
A Unified Stack
.....................................................................................................................................
8
Who Uses Spark, and For What?
......................................................................................................
11
A Brief History of Spark
....................................................................................................................
13
Spark Versions and Releases
............................................................................................................
13
Spark and Hadoop
.............................................................................................................................
14
Chapter 2. Downloading and Getting Started
...................................................................
15
Downloading Spark
............................................................................................................................
15
Introduction to Spark’s Python and Scala Shells
..........................................................................
16
Introduction to Core Spark Concepts
.............................................................................................
20
Standalone Applications
...................................................................................................................
23
Conclusion
..........................................................................................................................................
25
Chapter 3. Programming with RDDs
...................................................................................
26
RDD Basics
.........................................................................................................................................
26
Creating RDDs
...................................................................................................................................
28
RDD Operations
................................................................................................................................
28
Passing Functions to Spark
..............................................................................................................
32
Common Transformations and Actions
.........................................................................................
36
Persistence (Caching)
........................................................................................................................
46
Conclusion
..........................................................................................................................................
48
Chapter 4. Working with Key-Value Pairs
.........................................................................
49
4
Motivation
..........................................................................................................................................
49
Creating Pair RDDs
...........................................................................................................................
49
Transformations on Pair RDDs
.......................................................................................................
50
Actions Available on Pair RDDs
......................................................................................................
60
Data Partitioning
................................................................................................................................
61
Conclusion
..........................................................................................................................................
70
Chapter 5. Loading and Saving Your Data
..........................................................................
71
Motivation
...........................................................................................................................................
71
Choosing a Format
.............................................................................................................................
71
Formats
...............................................................................................................................................
72
File Systems
........................................................................................................................................
88
Compression
.......................................................................................................................................
89
Databases
............................................................................................................................................
91
Conclusion
..........................................................................................................................................
93
About the Authors
........................................................................................................................
95
5
Preface
As parallel data analysis has become increasingly common, practitioners in many fields have
sought easier tools for this task. Apache Spark has quickly emerged as one of the most popular
tools for this purpose, extending and generalizing MapReduce. Spark offers three main benefits.
First, it is easy to use—you can develop applications on your laptop, using a high-level API that
lets you focus on the content of your computation. Second, Spark is fast, enabling interactive use
and complex algorithms. And third, Spark is a general engine, allowing you to combine multiple
types of computations (e.g., SQL queries, text processing and machine learning) that might
previously have required learning different engines. These features make Spark an excellent
starting point to learn about big data in general.
This introductory book is meant to get you up and running with Spark quickly. You’ll learn how
to learn how to download and run Spark on your laptop and use it interactively to learn the API.
Once there, we’ll cover the details of available operations and distributed execution. Finally,
you’ll get a tour of the higher-level libraries built into Spark, including libraries for machine
learning, stream processing, graph analytics and SQL. We hope that this book gives you the
tools to quickly tackle data analysis problems, whether you do so on one machine or hundreds.
Audience
This book targets Data Scientists and Engineers. We chose these two groups because they have
the most to gain from using Spark to expand the scope of problems they can solve. Spark’s rich
collection of data focused libraries (like MLlib) make it easy for data scientists to go beyond
problems that fit on single machine while making use of their statistical background. Engineers,
meanwhile, will learn how to write general-purpose distributed programs in Spark and operate
production applications. Engineers and data scientists will both learn different details from this
book, but will both be able to apply Spark to solve large distributed problems in their respective
fields.
Data scientists focus on answering questions or building models from data. They often have a
statistical or math background and some familiarity with tools like Python, R and SQL. We have
made sure to include Python, and wherever possible SQL, examples for all our material, as well
as an overview of the machine learning and advanced analytics libraries in Spark. If you are a
data scientist, we hope that after reading this book you will be able to use the same
mathematical approaches to solving problems, except much faster and on a much larger scale.
剩余94页未读,继续阅读
fredfudan
- 粉丝: 0
- 资源: 4
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
- 1
- 2
- 3
- 4
- 5
- 6
前往页