# Horovod Helm Chart
## Introduction
This chart bootstraps Horovod which is a Distributed TensorFlow Framework on a Kubernetes cluster using the Helm Package Manager. It deploys Horovod workers as statefulsets, and the Horovod driver as a job, then discover the host list automatically.
## Prerequisites
- Kubernetes cluster v1.8+
## Build Docker Image
You can download [official Horovod Dockerfile](https://github.com/horovod/horovod/blob/master/docker/horovod/Dockerfile), then modify it according to your requirement, e.g. select a different CUDA, TensorFlow or Python version.
```
# mkdir horovod-docker
# wget -O horovod-docker/Dockerfile https://raw.githubusercontent.com/horovod/horovod/master/docker/horovod/Dockerfile
# docker build -t horovod:latest horovod-docker
```
## Prepare ssh keys
```
# Setup ssh key
export SSH_KEY_DIR=`mktemp -d`
cd $SSH_KEY_DIR
yes | ssh-keygen -N "" -f id_rsa
```
## Create the values.yaml
To run Horovod with GPU, you can create `values.yaml` like below
```
# cat << EOF > ~/values.yaml
---
ssh:
useSecrets: true
hostKey: |-
$(cat $SSH_KEY_DIR/id_rsa | sed 's/^/ /g')
hostKeyPub: |-
$(cat $SSH_KEY_DIR/id_rsa.pub | sed 's/^/ /g')
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
worker:
number: 2
image:
repository: horovod/horovod
tag: 0.24.3
driver:
image:
repository: horovod/horovod
tag: 0.24.3
args:
- "mpirun -np 3 --hostfile /horovod/generated/hostfile --mca orte_keep_fqdn_hostnames t --allow-run-as-root --display-map --tag-output --timestamp-output sh -c 'python /examples/tensorflow_mnist.py'"
EOF
```
For most cases, the overlay network impacts the Horovod performance greatly, so we should apply `Host Network` solution. To run Horovod with Host Network and GPU, you can create `values.yaml` like below
```
# cat << EOF > ~/values.yaml
---
+useHostNetwork: true
ssh:
useSecrets: true
port: 32222
hostKey: |-
$(cat $SSH_KEY_DIR/id_rsa | sed 's/^/ /g')
hostKeyPub: |-
$(cat $SSH_KEY_DIR/id_rsa.pub | sed 's/^/ /g')
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
worker:
number: 2
image:
repository: horovod/horovod
tag: 0.24.3
driver:
image:
repository: horovod/horovod
tag: 0.24.3
args:
- "mpirun -np 3 --hostfile /horovod/generated/hostfile --mca orte_keep_fqdn_hostnames t --allow-run-as-root --display-map --tag-output --timestamp-output sh -c 'python /examples/tensorflow_mnist.py'"
EOF
```
> notice: the difference is that you should set `useHostNetwork` as true, then set another ssh port rather than `22`
## Installing the Chart
To install the chart with the release name `mnist`:
```bash
$ helm install --values ~/values.yaml mnist stable/horovod
```
## Uninstalling the Chart
To uninstall/delete the `mnist` deployment:
```bash
$ helm delete mnist
```
The command removes all the Kubernetes components associated with the chart and
deletes the release.
## Upgrading an existing Release to a new major version
A major chart version change (like v1.2.3 -> v2.0.0) indicates that there is an
incompatible breaking change needing manual actions.
### 1.0.0
This version removes the `chart` label from the `spec.selector.matchLabels`
which is immutable since `StatefulSet apps/v1beta2`. It has been inadvertently
added, causing any subsequent upgrade to fail. See https://github.com/helm/charts/issues/7726.
In order to upgrade, delete the Horovod StatefulSet before upgrading, supposing your Release is named `my-release`:
```bash
$ kubectl delete statefulsets.apps --cascade=false my-release
```
## Configuration
The following table lists the configurable parameters of the Horovod
chart and their default values.
| Parameter | Description | Default |
|-----------|-------------|---------|
| `useHostNetwork` | Host network | `false` |
| `ssh.port` | The ssh port | `22` |
| `ssh.useSecrets` | Determine if using the secrets for ssh | `false` |
| `worker.number`| The worker's number | `5` |
| `worker.image.repository` | horovod worker image | `horovod/horovod` |
| `worker.image.pullPolicy` | `pullPolicy` for the worker | `IfNotPresent` |
| `worker.image.tag` | `tag` for the worker | `0.24.3` |
| `resources`| pod resource requests & limits| `{}`|
| `worker.env` | worker's environment variables | `{}` |
| `driver.image.repository` | horovod driver image | `horovod/horovod` |
| `driver.image.tag` | `tag` for the driver | `0.24.3` |
| `driver.image.pullPolicy` | image pullPolicy for the driver image| `IfNotPresent` |
| `driver.args` | driver's args | `{}` |
| `driver.env` | driver's environment variables | `{}` |
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
Horovod是针对TensorFlow,Keras,PyTorch和Apache MXNet的分布式深度学习培训框架。Horovod的目标是使分布式深度学习快速且易于使用。 Horovod由LF AI和数据基金会(LF AI&Data)托管。如果您是一家致力于在人工智能,机器和深度学习中使用开源技术的公司,并希望在这些领域中支持开源项目的社区,请考虑加入LF AI和数据基金会。有关谁参与以及Horovod如何扮演角色的详细信息,请阅读Linux Foundation公告。
资源推荐
资源详情
资源评论
收起资源包目录
Horovod分布式深度学习框架 (515个子文件)
make.bat 787B
operations.cc 71KB
mpi_ops.cc 68KB
mpi_ops_v2.cc 58KB
nccl_operations.cc 52KB
controller.cc 43KB
gpu_operations.cc 35KB
mpi_ops.cc 33KB
xla_mpi_ops.cc 22KB
timeline.cc 21KB
mpi_operations.cc 21KB
ccl_operations.cc 20KB
message.cc 18KB
gloo_operations.cc 18KB
response_cache.cc 17KB
parameter_manager.cc 17KB
collective_operations.cc 17KB
gloo_context.cc 15KB
adasum_gpu_operations.cc 14KB
mpi_gpu_operations.cc 14KB
process_set.cc 14KB
mpi_controller.cc 10KB
gloo_controller.cc 10KB
mpi_context.cc 9KB
cuda_operations.cc 9KB
hip_operations.cc 8KB
tensor_queue.cc 7KB
bayesian_optimization.cc 7KB
gaussian_process.cc 7KB
adapter_v2.cc 7KB
operation_manager.cc 6KB
stall_inspector.cc 6KB
ddl_operations.cc 5KB
common.cc 5KB
env_parser.cc 5KB
adasum_mpi.cc 5KB
adapter.cc 5KB
tensor_util.cc 5KB
half.cc 5KB
http_store.cc 4KB
adasum_mpi_operations.cc 4KB
logging.cc 3KB
gpu_context_impl.cc 3KB
cuda_util.cc 3KB
ready_event.cc 3KB
cuda_util.cc 3KB
group_table.cc 2KB
memory_store.cc 2KB
handle_manager.cc 2KB
fusion_buffer_manager.cc 2KB
thread_pool.cc 2KB
ddl_mpi_context_manager.cc 1KB
nvtx_op_range.cc 938B
setup.cfg 39B
.clang-format 132B
FindCUDAToolkit.cmake 29KB
Utilities.cmake 5KB
FindPytorch.cmake 4KB
FindMxnet.cmake 4KB
FindNCCL.cmake 3KB
FindTensorflow.cmake 2KB
FindROCM.cmake 1KB
FindNVTX.cmake 966B
Dockerfile.test.cpu 14KB
custom.css 798B
hip_kernels.cu 13KB
cuda_kernels.cu 13KB
spark-mpi.dia 6KB
Dockerfile 10KB
Dockerfile 5KB
Dockerfile 3KB
Dockerfile 3KB
.empty 0B
.empty 0B
horovod.exp 114B
message.fbs 4KB
custom_call_config.fbs 1KB
.gitignore 190B
.gitmodules 2KB
Dockerfile.test.gpu 12KB
message_generated.h 24KB
adasum.h 22KB
common.h 14KB
collective_operations.h 12KB
operations.h 11KB
nccl_operations.h 11KB
gpu_operations.h 11KB
controller.h 9KB
custom_call_config_generated.h 8KB
parameter_manager.h 8KB
message.h 8KB
mpi_ops.h 6KB
timeline.h 6KB
gaussian_process.h 5KB
process_set.h 5KB
bayesian_optimization.h 5KB
half.h 5KB
response_cache.h 5KB
gloo_operations.h 4KB
global_state.h 4KB
共 515 条
- 1
- 2
- 3
- 4
- 5
- 6
资源评论
酷爱码
- 粉丝: 8726
- 资源: 1853
下载权益
C知道特权
VIP文章
课程特权
开通VIP
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- Autosar学习视频10-19节
- stm32小车.zip
- AshampooUnInstaller v15.00.22 Portable一款强大的卸载工具,彻底、智能著称阿香婆强制卸载软件.rar
- Ashampoo WinOptimizer v27.00.05 阿香婆一款专业的垃圾清理、碎片整理启动项管理系统优化工具.rar
- misc设备驱动 正点原子阿尔法
- youleng-wms JAVA开发的WMS源码可以借签学习 数据库MYSQL
- 385大神asp.net三层设计停车场管理系统毕业课程源码设计+参考论文
- 数据集,训练数据集,深度学习
- 384大神asp.net基于三层汽车进销存销售管理系统毕业课程源码设计
- AutoSAR基础学习资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功