没有合适的资源?快使用搜索试试~ 我知道了~
databricks-spark-knowledge-base.pdf
需积分: 9 21 下载量 156 浏览量
2015-12-07
17:09:41
上传
评论 1
收藏 815KB PDF 举报
温馨提示
试读
22页
几种典型的spark出现的问题。一部分work well。spark相关编码。问题解决过程中有错误展示、解决实例代码。
资源推荐
资源详情
资源评论
1. Knowledgebase
2. Best Practices
i. Avoid GroupByKey
ii. Don't copy all elements of a large RDD to the driver
iii. Gracefully Dealing with Bad Input Data
3. General Troubleshooting
i. Job aborted due to stage failure: Task not serializable:
ii. Missing Dependencies in Jar Files
iii. Error running start-all.sh - Connection refused
iv. Network connectivity issues between Spark components
4. Performance & Optimization
i. How Many Partitions Does An RDD Have?
ii. Data Locality
5. Spark Streaming
i. ERROR OneForOneStrategy
Table of Contents
The contents contained here is also published in Gitbook format.
Best Practices
Avoid GroupByKey
Don't copy all elements of a large RDD to the driver
Gracefully Dealing with Bad Input Data
General Troubleshooting
Job aborted due to stage failure: Task not serializable:
Missing Dependencies in Jar Files
Error running start-all.sh - Connection refused
Network connectivity issues between Spark components
Performance & Optimization
How Many Partitions Does An RDD Have?
Data Locality
Spark Streaming
ERROR OneForOneStrategy
This content is covered by the license specified here.
Databricks Spark Knowledge Base
Let's look at two different ways to compute word counts, one using reduceByKey and the other using groupByKey :
val words = Array("one", "two", "two", "three", "three", "three")
val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))
val wordCountsWithReduce = wordPairsRDD
.reduceByKey(_ + _)
.collect()
val wordCountsWithGroup = wordPairsRDD
.groupByKey()
.map(t => (t._1, t._2.sum))
.collect()
While both of these functions will produce the correct answer, the reduceByKey example works much better on a large
dataset. That's because Spark knows it can combine output with a common key on each partition before shuffling the
data.
Look at the diagram below to understand what happens with reduceByKey . Notice how pairs on the same machine with
the same key are combined (by using the lamdba function passed into reduceByKey ) before the data is shuffled. Then the
lamdba function is called again to reduce all the values from each partition to produce one final result.
On the other hand, when calling groupByKey - all the key-value pairs are shuffled around. This is a lot of unnessary data
to being transferred over the network.
To determine which machine to shuffle a pair to, Spark calls a partitioning function on the key of the pair. Spark spills data
to disk when there is more data shuffled onto a single executor machine than can fit in memory. However, it flushes out
the data to disk one key at a time - so if a single key has more key-value pairs than can fit in memory, an out of memory
exception occurs. This will be more gracefully handled in a later release of Spark so the job can still proceed, but should
still be avoided - when Spark needs to spill to disk, performance is severely impacted.
Avoid GroupByKey
剩余21页未读,继续阅读
资源评论
苏轶然
- 粉丝: 21
- 资源: 52
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功