ApacheSparkCheatSheet资源-CSDN文库

需积分: 9 83 浏览量 2018-07-04 16:47:52 上传评论收藏 602KB PDF 举报

资源推荐

资源详情

资源评论

DZONE.COM/REFCARDZ

257204

Apache Spark

UPDATED BY TIM SPANN BIG DATA SOLUTIONS ENGINEER, HORTONWORKS

WRITTEN BY ASHWINI KUNTAMUKKALA SOFTWARE ARCHITECT, SCISPIKE

WHY APACHE SPARK?

Apache Spark has become the engine to enhance many of the

capabilities of the ever-present Apache Hadoop environment. For

Big Data, Apache Spark meets a lot of needs and runs natively on

Apache Hadoop’s YARN. By running Apache Spark in your Apache

Hadoop environment, you gain all the security, governance, and

scalability inherent to that platform. Apache Spark is also extremely

well integrated with Apache Hive and gains access to all your Apache

Hadoop tables utilizing integrated security.

Apache Spark has begun to really shine in the areas of streaming data

processing and machine learning. With first-class support of Python

as a development language, PySpark allows for data scientists,

engineers and developers to develop and scale machine learning with

ease. One of the features that has expanded this is the support for

Apache Zeppelin notebooks to run Apache Spark jobs for exploration,

data cleanup, and machine learning. Apache Spark also integrates

with other important streaming tools in the Apache Hadoop space,

namely Apache NiFi and Apache Kafka. I like to think of Apache Spark

+ Apache NiFi + Apache Kafka as the three amigos of Apache Big Data

ingest and streaming. The latest version of Apache Spark is 2.2.

ABOUT APACHE SPARK

Apache Spark is an open source, Hadoop-compatible, fast and

expressive cluster-computing data processing engine. It was created

at AMPLabs in UC Berkeley as part of Berkeley Data Analytics Stack

(BDAS). It is a top-level Apache project. The below figure shows the

various components of the current Apache Spark stack.

It has six major benefits:

1. Lightning speed of computation because data are loaded in

distributed memory (RAM) over a cluster of machines. Data can

be quickly transformed iteratively and cached on demand for

subsequent usage.

2. Highly accessible through standard APIs built in Java, Scala,

Python, R, and SQL (for interactive queries) and has a rich set of

machine learning libraries available out of the box.

3. Compatibility with existing Hadoop 2.x (YARN) ecosystems so

companies can leverage their existing infrastructure.

4. Convenient download and installation processes. Convenient

shell (REPL: Read-Eval-Print-Loop) to interactively learn the APIs.

5. Enhanced productivity due to high-level constructs that keep

the focus on content of computation.

6. Multiple user notebook environments supported by Apache

Zeppelin.

Also, Spark is implemented in Scala, which means that the code is

very succinct and fast and requires JVM to run.

HOW TO INSTALL APACHE SPARK

The following table lists a few important links and prerequisites:

Current Release

2.2.0 @ apache.org/dyn/closer.lua/

spark/spark-2.2.0/spark-2.2.0-bin-

hadoop2.7.tgz

Downloads Page

spark.apache.org/downloads.html

JDK Version (Required) 1.8 or higher

Scala Version (Required) 2.11 or higher

Python (Optional) [2.7, 3.5)

Simple Build Tool (Re-

quired)

scala-sbt.org

Development Version

github.com/apache/spark

CONTENTS

∠

WHY APACHE SPARK?

∠ ABOUT APACHE SPARK

∠ HOW TO INSTALL APACHE SPARK

∠ HOW APACHE SPARK WORKS

∠ RESILIENT DISTRIBUTED DATASET

∠ DATAFRAMES

∠ RDD PERSISTENCE

∠ SPARK SQL

∠ SPARK STREAMING

DZONE.COM/REFCARDZ

APACHE SPARK

Building Instructions

spark.apache.org/docs/latest/

building-spark.html

Maven 3.3.9 or higher

Hadoop + Spark Instal-

lation

docs.hortonworks.com/HDP

Documents/Ambari-2.6.0.0/

bk_ambari-i0nstallation/content/

ch_Getting_Ready.html

Apache Spark can be configured to run standalone or on Hadoop

2 YARN. Apache Spark requires moderate skills in Java, Scala, or

Python. Here we will see how to install and run Apache Spark in the

standalone configuration.

1. Install JDK 1.8+, Scala 2.11+, Python 3.5+ and Apache Maven.

2. Download Apache Spark 2.2.0 Release.

3. Untar and unzip spark-2.2.0.tgz in a specified directory.

4. Go to the directory and run sbt to build Apache Spark.

export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"

mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.3 -Phive

-Phive thriftserver -DskipTests clean package

5. Launch Apache Spark standalone REPL. For Scala, use:

./spark-shell

For Python, use:

./pyspark

6. Go to SparkUI @ http://localhost:4040

This is a good quick start, but I recommend utilizing a Sandbox or

an available Apache Zeppelin notebook to begin your exploration of

Apache Spark.

HOW APACHE SPARK WORKS

The Apache Spark engine provides a way to process data in distributed

memory over a cluster of machines. The figure below shows a logical

diagram of how a typical Spark job processes information.

RESILIENT DISTRIBUTED DATASET

The core concept in Apache Spark is the resilient distributed dataset

(RDD). It is an immutable distributed collection of data, which is

partitioned across machines in a cluster. It facilitates two types of

operations: transformations and actions. A transformation is an

operation such as

filter()

map()

, or

union()

on an RDD that yields

another RDD. An action is an operation such as

count()

first()

take(n)

, or

collect()

that triggers a computation, returns a value

back to the Driver program, or writes to a stable storage system like

Apache Hadoop HDFS. Transformations are lazily evaluated in that

they don’t run until an action warrants it. The Apache Spark Driver

remembers the transformations applied to an RDD, so if a partition

is lost (say a worker machine goes down), that partition can easily be

reconstructed on some other machine in the cluster. That is why is it

called “Resilient.”

The following code snippets show how we can do this in Python using

the Spark 2 PySpark shell.

%spark2.pyspark

guten = spark.read.text('/load/55973-0.txt')

COMMONLY USED TRANSFORMATIONS

Transformation &

Purpose

Example & Result

filter(func)

Purpose: new RDD by

selecting those data

elements on which

func

returns true

shinto = guten.filter( guten.Variable

contains("Shinto") )

map(func)

Purpose: return new

RDD by applying

func

on each data element

val rdd =

sc.parallelize(List(1,2,3,4,5))

val times2 = rdd.map(*2) times2.

collect()

Result:

Array[Int] = Array(2, 4, 6, 8, 10)

flatMap(func)

Purpose: Similar to

map

but

func

returns

a sequence instead of

a value. For example,

mapping a sentence

into a sequence of

words

val rdd=sc.

parallelize(List(“Spark is

awesome”,”It is fun”))

val fm=rdd.flatMap(str=>str.

split(“ “))

fm.collect()

Result:

Array[String] = Array(Spark, is,

awesome, It, is, fun)

reduceByKey(-

func,[numTasks])

Purpose: To aggregate

values of a key using a

function. “numTasks”

is an optional parame-

ter to specify a number

of reduce tasks

val word1=fm.map(word=>(word,1))

val wrdCnt = word1.

reduceByKey(_+_)wrdCnt.collect()

Result:

Array[(String, Int)] =

Array((is,2), (It,1),

(awesome,1), (Spark,1), (fun,1))

groupByKey([num-

Tasks])

Purpose: To convert

(K,V) to (K,Iterable<V>)

val cntWrd = wrdCnt.map{case

(word, count) => (count, word)}

cntWrd.groupByKey().collect()

Result:

Array[(Int, Iterable[String])] =

Array((1,ArrayBuffer(It,

awesome, Spark, fun)),

(2,ArrayBuffer(is)))

剩余6页未读，继续阅读

评论收藏

内容反馈

过往记忆

粉丝: 4376
资源: 275

Apache Spark Cheat Sheet

最新资源

Apache Spark Cheat Sheet

各类速查表汇总-PySpark Cheat Sheet -Spark in Python

clojure-cheatsheet, 用于Emacs的Clojure Cheatsheet.zip

conda cheat sheet v4.6.zip

R ggplot2 cheatsheet

css3-cheat-sheet_IDEAL_cheatsheet_css_

acm-cheat-sheet, Acm Cheat Sheet.zip

markdown-cheatsheet, 用于 Github README.md的Markdown Cheatsheet.zip

Python 3 Cheat Sheet

CheatSheet_1.3.4.zip

opencv_cheatsheet

CheatSheet for Mac

JavaScript Cheat Sheet - JS小抄-快速查看

Vim cheatsheet中文版 绝对珍藏版

x86 assembly cheatsheet

x64_cheatsheet.pdf

python_cheatsheet_完美总结.pdf

cheatsheet for mac

test heuristics cheat sheet

RAPIDS cheatsheet.pdf

atlassian git cheatsheet

Scala-升级版.docx

基于spark的图书推荐系统

大数据期末课设~基于spark的气象数据处理与分析

全国2014-2018年空气质量csv数据集文件数据

大数据全套教程完整版

全国职业技能大赛大数据赛项十套赛题（shtd）

scala-sdk-2.12.15

spark-3.3.1-bin-3.0.0-cdh6.3.2.tgz

基于hadoop和echarts的教育大数据可视化系统

最新资源

Vim cheatsheet中文版绝对珍藏版