# Horovod Helm Chart
## Introduction
This chart bootstraps Horovod which is a Distributed TensorFlow Framework on a Kubernetes cluster using the Helm Package Manager. It deploys Horovod workers as statefulsets, and the Horovod driver as a job, then discover the host list automatically.
## Prerequisites
- Kubernetes cluster v1.8+
## Build Docker Image
You can download [official Horovod Dockerfile](https://github.com/horovod/horovod/blob/master/docker/horovod/Dockerfile), then modify it according to your requirement, e.g. select a different CUDA, TensorFlow or Python version.
```
# mkdir horovod-docker
# wget -O horovod-docker/Dockerfile https://raw.githubusercontent.com/horovod/horovod/master/docker/horovod/Dockerfile
# docker build -t horovod:latest horovod-docker
```
## Prepare ssh keys
```
# Setup ssh key
export SSH_KEY_DIR=`mktemp -d`
cd $SSH_KEY_DIR
yes | ssh-keygen -N "" -f id_rsa
```
## Create the values.yaml
To run Horovod with GPU, you can create `values.yaml` like below
```
# cat << EOF > ~/values.yaml
---
ssh:
useSecrets: true
hostKey: |-
$(cat $SSH_KEY_DIR/id_rsa | sed 's/^/ /g')
hostKeyPub: |-
$(cat $SSH_KEY_DIR/id_rsa.pub | sed 's/^/ /g')
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
worker:
number: 2
image:
repository: horovod/horovod
tag: 0.24.3
driver:
image:
repository: horovod/horovod
tag: 0.24.3
args:
- "mpirun -np 3 --hostfile /horovod/generated/hostfile --mca orte_keep_fqdn_hostnames t --allow-run-as-root --display-map --tag-output --timestamp-output sh -c 'python /examples/tensorflow_mnist.py'"
EOF
```
For most cases, the overlay network impacts the Horovod performance greatly, so we should apply `Host Network` solution. To run Horovod with Host Network and GPU, you can create `values.yaml` like below
```
# cat << EOF > ~/values.yaml
---
+useHostNetwork: true
ssh:
useSecrets: true
port: 32222
hostKey: |-
$(cat $SSH_KEY_DIR/id_rsa | sed 's/^/ /g')
hostKeyPub: |-
$(cat $SSH_KEY_DIR/id_rsa.pub | sed 's/^/ /g')
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
worker:
number: 2
image:
repository: horovod/horovod
tag: 0.24.3
driver:
image:
repository: horovod/horovod
tag: 0.24.3
args:
- "mpirun -np 3 --hostfile /horovod/generated/hostfile --mca orte_keep_fqdn_hostnames t --allow-run-as-root --display-map --tag-output --timestamp-output sh -c 'python /examples/tensorflow_mnist.py'"
EOF
```
> notice: the difference is that you should set `useHostNetwork` as true, then set another ssh port rather than `22`
## Installing the Chart
To install the chart with the release name `mnist`:
```bash
$ helm install --values ~/values.yaml mnist stable/horovod
```
## Uninstalling the Chart
To uninstall/delete the `mnist` deployment:
```bash
$ helm delete mnist
```
The command removes all the Kubernetes components associated with the chart and
deletes the release.
## Upgrading an existing Release to a new major version
A major chart version change (like v1.2.3 -> v2.0.0) indicates that there is an
incompatible breaking change needing manual actions.
### 1.0.0
This version removes the `chart` label from the `spec.selector.matchLabels`
which is immutable since `StatefulSet apps/v1beta2`. It has been inadvertently
added, causing any subsequent upgrade to fail. See https://github.com/helm/charts/issues/7726.
In order to upgrade, delete the Horovod StatefulSet before upgrading, supposing your Release is named `my-release`:
```bash
$ kubectl delete statefulsets.apps --cascade=false my-release
```
## Configuration
The following table lists the configurable parameters of the Horovod
chart and their default values.
| Parameter | Description | Default |
|-----------|-------------|---------|
| `useHostNetwork` | Host network | `false` |
| `ssh.port` | The ssh port | `22` |
| `ssh.useSecrets` | Determine if using the secrets for ssh | `false` |
| `worker.number`| The worker's number | `5` |
| `worker.image.repository` | horovod worker image | `horovod/horovod` |
| `worker.image.pullPolicy` | `pullPolicy` for the worker | `IfNotPresent` |
| `worker.image.tag` | `tag` for the worker | `0.24.3` |
| `resources`| pod resource requests & limits| `{}`|
| `worker.env` | worker's environment variables | `{}` |
| `driver.image.repository` | horovod driver image | `horovod/horovod` |
| `driver.image.tag` | `tag` for the driver | `0.24.3` |
| `driver.image.pullPolicy` | image pullPolicy for the driver image| `IfNotPresent` |
| `driver.args` | driver's args | `{}` |
| `driver.env` | driver's environment variables | `{}` |
没有合适的资源?快使用搜索试试~ 我知道了~
TensorFlow、Keras、PyTorch和Apache MXNet的分布式培训框架
共515个文件
py:260个
h:55个
rst:54个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 29 浏览量
2024-03-14
21:25:50
上传
评论
收藏 1.83MB ZIP 举报
温馨提示
TensorFlow、Keras、PyTorch和Apache MXNet都支持分布式训练,以下是它们各自的分布式培训框架: 1. **TensorFlow**: - TensorFlow提供了多种分布式训练策略和工具,包括tf.distribute.Strategy、tf.distribute.experimental.MultiWorkerMirroredStrategy、tf.distribute.experimental.TPUStrategy等。 - tf.distribute.Strategy可以帮助用户在多个设备或多台机器上训练模型,并提供了不同级别的并行训练方式,如数据并行、参数服务器和模型并行。 - tf.distribute.experimental.MultiWorkerMirroredStrategy允许在多个工作节点上进行分布式训练,同时支持同步和异步训练方式。 - tf.distribute.experimental.TPUStrategy专门用于在Google的TPU(Tensor Processing Unit)上进行...
资源推荐
资源详情
资源评论
收起资源包目录
TensorFlow、Keras、PyTorch和Apache MXNet的分布式培训框架 (515个子文件)
make.bat 787B
CSDN关注我不迷路.bmp 2.79MB
operations.cc 71KB
mpi_ops.cc 68KB
mpi_ops_v2.cc 58KB
nccl_operations.cc 52KB
controller.cc 43KB
gpu_operations.cc 35KB
mpi_ops.cc 33KB
xla_mpi_ops.cc 22KB
timeline.cc 21KB
mpi_operations.cc 21KB
ccl_operations.cc 20KB
message.cc 18KB
gloo_operations.cc 18KB
response_cache.cc 17KB
parameter_manager.cc 17KB
collective_operations.cc 17KB
gloo_context.cc 15KB
adasum_gpu_operations.cc 14KB
mpi_gpu_operations.cc 14KB
process_set.cc 14KB
mpi_controller.cc 10KB
gloo_controller.cc 10KB
mpi_context.cc 9KB
cuda_operations.cc 9KB
hip_operations.cc 8KB
tensor_queue.cc 7KB
bayesian_optimization.cc 7KB
gaussian_process.cc 7KB
adapter_v2.cc 7KB
operation_manager.cc 6KB
stall_inspector.cc 6KB
ddl_operations.cc 5KB
common.cc 5KB
env_parser.cc 5KB
adasum_mpi.cc 5KB
adapter.cc 5KB
tensor_util.cc 5KB
half.cc 5KB
http_store.cc 4KB
adasum_mpi_operations.cc 4KB
logging.cc 3KB
gpu_context_impl.cc 3KB
cuda_util.cc 3KB
ready_event.cc 3KB
cuda_util.cc 3KB
group_table.cc 2KB
memory_store.cc 2KB
handle_manager.cc 2KB
fusion_buffer_manager.cc 2KB
thread_pool.cc 2KB
ddl_mpi_context_manager.cc 1KB
nvtx_op_range.cc 938B
setup.cfg 39B
.clang-format 132B
FindCUDAToolkit.cmake 29KB
Utilities.cmake 5KB
FindPytorch.cmake 4KB
FindMxnet.cmake 4KB
FindNCCL.cmake 3KB
FindTensorflow.cmake 2KB
FindROCM.cmake 1KB
FindNVTX.cmake 966B
Dockerfile.test.cpu 14KB
custom.css 798B
hip_kernels.cu 13KB
cuda_kernels.cu 13KB
spark-mpi.dia 6KB
Dockerfile 10KB
Dockerfile 5KB
Dockerfile 3KB
Dockerfile 3KB
.empty 0B
.empty 0B
horovod.exp 114B
message.fbs 4KB
custom_call_config.fbs 1KB
.gitignore 190B
.gitmodules 2KB
Dockerfile.test.gpu 12KB
message_generated.h 24KB
adasum.h 22KB
common.h 14KB
collective_operations.h 12KB
operations.h 11KB
nccl_operations.h 11KB
gpu_operations.h 11KB
controller.h 9KB
custom_call_config_generated.h 8KB
parameter_manager.h 8KB
message.h 8KB
mpi_ops.h 6KB
timeline.h 6KB
gaussian_process.h 5KB
process_set.h 5KB
bayesian_optimization.h 5KB
half.h 5KB
response_cache.h 5KB
gloo_operations.h 4KB
共 515 条
- 1
- 2
- 3
- 4
- 5
- 6
资源评论
专家-百锦再
- 粉丝: 7194
- 资源: 731
下载权益
C知道特权
VIP文章
课程特权
开通VIP
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功