Apache Spark on Docker
==========
[![DockerPulls](https://img.shields.io/docker/pulls/sequenceiq/spark.svg)](https://registry.hub.docker.com/u/sequenceiq/spark/)
[![DockerStars](https://img.shields.io/docker/stars/sequenceiq/spark.svg)](https://registry.hub.docker.com/u/sequenceiq/spark/)
This repository contains a Docker file to build a Docker image with Apache Spark. This Docker image depends on our previous [Hadoop Docker](https://github.com/sequenceiq/hadoop-docker) image, available at the SequenceIQ [GitHub](https://github.com/sequenceiq) page.
The base Hadoop Docker image is also available as an official [Docker image](https://registry.hub.docker.com/u/sequenceiq/hadoop-docker/).
##Pull the image from Docker Repository
```
docker pull sequenceiq/spark:1.6.0
```
## Building the image
```
docker build --rm -t sequenceiq/spark:1.6.0 .
```
## Running the image
* if using boot2docker make sure your VM has more than 2GB memory
* in your /etc/hosts file add $(boot2docker ip) as host 'sandbox' to make it easier to access your sandbox UI
* open yarn UI ports when running container
```
docker run -it -p 8088:8088 -p 8042:8042 -p 4040:4040 -h sandbox sequenceiq/spark:1.6.0 bash
```
or
```
docker run -d -h sandbox sequenceiq/spark:1.6.0 -d
```
## Versions
```
Hadoop 2.6.0 and Apache Spark v1.6.0 on Centos
```
## Testing
There are two deploy modes that can be used to launch Spark applications on YARN.
### YARN-client mode
In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
```
# run the spark shell
spark-shell \
--master yarn-client \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1
# execute the the following command which should return 1000
scala> sc.parallelize(1 to 1000).count()
```
### YARN-cluster mode
In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application.
Estimating Pi (yarn-cluster mode):
```
# execute the the following command which should write the "Pi is roughly 3.1418" into the logs
# note you must specify --files argument in cluster mode to enable metrics
spark-submit \
--class org.apache.spark.examples.SparkPi \
--files $SPARK_HOME/conf/metrics.properties \
--master yarn-cluster \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \
$SPARK_HOME/lib/spark-examples-1.6.0-hadoop2.6.0.jar
```
Estimating Pi (yarn-client mode):
```
# execute the the following command which should print the "Pi is roughly 3.1418" to the screen
spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn-client \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \
$SPARK_HOME/lib/spark-examples-1.6.0-hadoop2.6.0.jar
```
### Submitting from the outside of the container
To use Spark from outside of the container it is necessary to set the YARN_CONF_DIR environment variable to directory with a configuration appropriate for the docker. The repository contains such configuration in the yarn-remote-client directory.
```
export YARN_CONF_DIR="`pwd`/yarn-remote-client"
```
Docker's HDFS can be accessed only by root. When submitting Spark applications from outside of the cluster, and from a user different than root, it is necessary to configure the HADOOP_USER_NAME variable so that root user is used.
```
export HADOOP_USER_NAME=root
```
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
【hadoop&spark】说明:Hadoop、Spark、Python3容器 (Hadoop, Spark, Python3 containers) 文件列表: docker-compose.yml (560, 2019-07-19) docker-hadoop (0, 2019-07-19) docker-hadoop\Dockerfile (4971, 2019-07-19) docker-hadoop\LICENSE (71624, 2019-07-19) docker-hadoop\bootstrap.sh (735, 2019-07-19) docker-hadoop\core-site.xml (324, 2019-07-19) docker-hadoop\core-site.xml.template (154, 2019-07-19) docker-hadoop\hdfs-site.xml (467, 2019-07-19) docker-hadoop\mapred-site.xml (266, 2019-07-19) docker-hadoop\mapred-site
资源推荐
资源详情
资源评论
收起资源包目录
5198126.zip (44个子文件)
5198126
docker-hadoop-spark-master
docker-spark
健康养生秘笈.url 133B
主播培训.url 61B
LICENSE 70KB
武术资料获取.url 121B
yarn-remote-client
健康养生秘笈.url 133B
core-site.xml 325B
主播培训.url 61B
武术资料获取.url 121B
黑客技术.url 62B
美味小吃技术.url 127B
yarn-site.xml 1KB
撩妹套路(120G).url 195B
职业技能培训.url 61B
bootstrap.sh 901B
黑客技术.url 62B
Dockerfile 1KB
.gitignore 18B
美味小吃技术.url 127B
撩妹套路(120G).url 195B
README.md 3KB
职业技能培训.url 61B
docker-compose.yml 560B
.gitignore 7B
docker-hadoop
健康养生秘笈.url 133B
core-site.xml 324B
主播培训.url 61B
ssh_config 94B
LICENSE 70KB
武术资料获取.url 121B
bootstrap.sh 735B
mapred-site.xml.bak 138B
core-site.xml.template 154B
hdfs-site.xml 467B
黑客技术.url 62B
Dockerfile 5KB
mapred-site.xml 266B
yarn-site.xml.bak 251B
.gitignore 9B
美味小吃技术.url 127B
yarn-site.xml 256B
撩妹套路(120G).url 195B
README.md 3KB
职业技能培训.url 61B
README.md 365B
共 44 条
- 1
资源评论
hyzixue
- 粉丝: 41
- 资源: 166
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 3. Kafka入门-安装与基本命令
- java全大撒大撒大苏打
- pca20241222
- LabVIEW实现LoRa通信【LabVIEW物联网实战】
- CS-TY4-4WCN-转-公版-XP1-8B4WF-wifi8188
- 计算机网络期末复习资料(课后题答案+往年考试题+复习提纲+知识点总结)
- 从零学习自动驾驶Lattice规划算法(下) 轨迹采样 轨迹评估 碰撞检测 包含matlab代码实现和cpp代码实现,方便对照学习 cpp代码用vs2019编译 依赖qt5.15做可视化 更新:
- 风光储、风光储并网直流微电网simulink仿真模型 系统由光伏发电系统、风力发电系统、混合储能系统(可单独储能系统)、逆变器VSR+大电网构成 光伏系统采用扰动观察法实现mppt控
- (180014016)pycairo-1.18.2-cp35-cp35m-win32.whl.rar
- (180014046)pycairo-1.21.0-cp311-cp311-win32.whl.rar
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功