Horovod分布式深度学习框架资源-CSDN文库

共515个文件

py：260个

h：55个

rst：54个

版权申诉

分布式

深度学习

65 浏览量 2024-03-10 16:44:28 上传评论收藏 1.51MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

Horovod分布式深度学习框架（515个子文件）

make.bat 787B

operations.cc 71KB

mpi_ops.cc 68KB

mpi_ops_v2.cc 58KB

nccl_operations.cc 52KB

controller.cc 43KB

gpu_operations.cc 35KB

mpi_ops.cc 33KB

xla_mpi_ops.cc 22KB

timeline.cc 21KB

mpi_operations.cc 21KB

ccl_operations.cc 20KB

message.cc 18KB

gloo_operations.cc 18KB

response_cache.cc 17KB

parameter_manager.cc 17KB

collective_operations.cc 17KB

gloo_context.cc 15KB

adasum_gpu_operations.cc 14KB

mpi_gpu_operations.cc 14KB

process_set.cc 14KB

mpi_controller.cc 10KB

gloo_controller.cc 10KB

mpi_context.cc 9KB

cuda_operations.cc 9KB

hip_operations.cc 8KB

tensor_queue.cc 7KB

bayesian_optimization.cc 7KB

gaussian_process.cc 7KB

adapter_v2.cc 7KB

operation_manager.cc 6KB

stall_inspector.cc 6KB

ddl_operations.cc 5KB

common.cc 5KB

env_parser.cc 5KB

adasum_mpi.cc 5KB

adapter.cc 5KB

tensor_util.cc 5KB

half.cc 5KB

http_store.cc 4KB

adasum_mpi_operations.cc 4KB

logging.cc 3KB

gpu_context_impl.cc 3KB

cuda_util.cc 3KB

ready_event.cc 3KB

cuda_util.cc 3KB

group_table.cc 2KB

memory_store.cc 2KB

handle_manager.cc 2KB

fusion_buffer_manager.cc 2KB

thread_pool.cc 2KB

ddl_mpi_context_manager.cc 1KB

nvtx_op_range.cc 938B

setup.cfg 39B

.clang-format 132B

FindCUDAToolkit.cmake 29KB

Utilities.cmake 5KB

FindPytorch.cmake 4KB

FindMxnet.cmake 4KB

FindNCCL.cmake 3KB

FindTensorflow.cmake 2KB

FindROCM.cmake 1KB

FindNVTX.cmake 966B

Dockerfile.test.cpu 14KB

custom.css 798B

hip_kernels.cu 13KB

cuda_kernels.cu 13KB

spark-mpi.dia 6KB

Dockerfile 10KB

Dockerfile 5KB

Dockerfile 3KB

.empty 0B

horovod.exp 114B

message.fbs 4KB

custom_call_config.fbs 1KB

.gitignore 190B

.gitmodules 2KB

Dockerfile.test.gpu 12KB

message_generated.h 24KB

adasum.h 22KB

common.h 14KB

collective_operations.h 12KB

operations.h 11KB

nccl_operations.h 11KB

gpu_operations.h 11KB

controller.h 9KB

custom_call_config_generated.h 8KB

parameter_manager.h 8KB

message.h 8KB

mpi_ops.h 6KB

timeline.h 6KB

gaussian_process.h 5KB

process_set.h 5KB

bayesian_optimization.h 5KB

half.h 5KB

response_cache.h 5KB

gloo_operations.h 4KB

global_state.h 4KB

共 515 条

# Horovod Helm Chart ## Introduction This chart bootstraps Horovod which is a Distributed TensorFlow Framework on a Kubernetes cluster using the Helm Package Manager. It deploys Horovod workers as statefulsets, and the Horovod driver as a job, then discover the host list automatically. ## Prerequisites - Kubernetes cluster v1.8+ ## Build Docker Image You can download [official Horovod Dockerfile](https://github.com/horovod/horovod/blob/master/docker/horovod/Dockerfile), then modify it according to your requirement, e.g. select a different CUDA, TensorFlow or Python version. ``` # mkdir horovod-docker # wget -O horovod-docker/Dockerfile https://raw.githubusercontent.com/horovod/horovod/master/docker/horovod/Dockerfile # docker build -t horovod:latest horovod-docker ``` ## Prepare ssh keys ``` # Setup ssh key export SSH_KEY_DIR=`mktemp -d` cd $SSH_KEY_DIR yes | ssh-keygen -N "" -f id_rsa ``` ## Create the values.yaml To run Horovod with GPU, you can create `values.yaml` like below ``` # cat << EOF > ~/values.yaml --- ssh: useSecrets: true hostKey: |- $(cat $SSH_KEY_DIR/id_rsa | sed 's/^/ /g') hostKeyPub: |- $(cat $SSH_KEY_DIR/id_rsa.pub | sed 's/^/ /g') resources: limits: nvidia.com/gpu: 1 requests: nvidia.com/gpu: 1 worker: number: 2 image: repository: horovod/horovod tag: 0.24.3 driver: image: repository: horovod/horovod tag: 0.24.3 args: - "mpirun -np 3 --hostfile /horovod/generated/hostfile --mca orte_keep_fqdn_hostnames t --allow-run-as-root --display-map --tag-output --timestamp-output sh -c 'python /examples/tensorflow_mnist.py'" EOF ``` For most cases, the overlay network impacts the Horovod performance greatly, so we should apply `Host Network` solution. To run Horovod with Host Network and GPU, you can create `values.yaml` like below ``` # cat << EOF > ~/values.yaml --- +useHostNetwork: true ssh: useSecrets: true port: 32222 hostKey: |- $(cat $SSH_KEY_DIR/id_rsa | sed 's/^/ /g') hostKeyPub: |- $(cat $SSH_KEY_DIR/id_rsa.pub | sed 's/^/ /g') resources: limits: nvidia.com/gpu: 1 requests: nvidia.com/gpu: 1 worker: number: 2 image: repository: horovod/horovod tag: 0.24.3 driver: image: repository: horovod/horovod tag: 0.24.3 args: - "mpirun -np 3 --hostfile /horovod/generated/hostfile --mca orte_keep_fqdn_hostnames t --allow-run-as-root --display-map --tag-output --timestamp-output sh -c 'python /examples/tensorflow_mnist.py'" EOF ``` > notice: the difference is that you should set `useHostNetwork` as true, then set another ssh port rather than `22` ## Installing the Chart To install the chart with the release name `mnist`: ```bash $ helm install --values ~/values.yaml mnist stable/horovod ``` ## Uninstalling the Chart To uninstall/delete the `mnist` deployment: ```bash $ helm delete mnist ``` The command removes all the Kubernetes components associated with the chart and deletes the release. ## Upgrading an existing Release to a new major version A major chart version change (like v1.2.3 -> v2.0.0) indicates that there is an incompatible breaking change needing manual actions. ### 1.0.0 This version removes the `chart` label from the `spec.selector.matchLabels` which is immutable since `StatefulSet apps/v1beta2`. It has been inadvertently added, causing any subsequent upgrade to fail. See https://github.com/helm/charts/issues/7726. In order to upgrade, delete the Horovod StatefulSet before upgrading, supposing your Release is named `my-release`: ```bash $ kubectl delete statefulsets.apps --cascade=false my-release ``` ## Configuration The following table lists the configurable parameters of the Horovod chart and their default values. | Parameter | Description | Default | |-----------|-------------|---------| | `useHostNetwork` | Host network | `false` | | `ssh.port` | The ssh port | `22` | | `ssh.useSecrets` | Determine if using the secrets for ssh | `false` | | `worker.number`| The worker's number | `5` | | `worker.image.repository` | horovod worker image | `horovod/horovod` | | `worker.image.pullPolicy` | `pullPolicy` for the worker | `IfNotPresent` | | `worker.image.tag` | `tag` for the worker | `0.24.3` | | `resources`| pod resource requests & limits| `{}`| | `worker.env` | worker's environment variables | `{}` | | `driver.image.repository` | horovod driver image | `horovod/horovod` | | `driver.image.tag` | `tag` for the driver | `0.24.3` | | `driver.image.pullPolicy` | image pullPolicy for the driver image| `IfNotPresent` | | `driver.args` | driver's args | `{}` | | `driver.env` | driver's environment variables | `{}` |

评论收藏

内容反馈

版权申诉