# Using Ray with Cloud TPUs
This **experimental** repository contains a minimal example of how you can use
Ray (ray.io) with Cloud TPUs.
These examples are not meant to be used in production services and are for
illustrative purposes only.
## Helpful pre/post-reads
- [Ray Overview](https://docs.ray.io/en/latest/ray-overview/index.html)
- [Ray Cluster](https://docs.ray.io/en/latest/cluster/vms/getting-started.html#vm-cluster-quick-start)
- [Ray Job](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html)
- [JAX Multi-process programming model](https://jax.readthedocs.io/en/latest/multi_process.html#multi-process-programming-model)
## What's included in this repo?
For your convenience, we provide:
- generic abstractions that hide away boilerplate for common TPU actions and
- toy examples that you can fork for your own basic workflows.
[`tpu_api.py`](src/tpu_api.py) - Python wrapper for basic TPU operations using the
[Cloud TPU API](https://cloud.google.com/tpu/docs/reference/rest)
[`tpu_controller.py`](src/tpu_controller.py) - Class representation of a TPU. This
is essentially a wrapper for `tpu_api.py`.
[`ray_tpu_controller.py`](src/ray_tpu_controller.py) - TPU controller with Ray
functionality. This abstracts away boilerplate for Ray Cluster and Ray Jobs.
[`run_basic_jax.py`](src/run_basic_jax.py) - Basic example that shows how to use
`RayTpuController` for `print(jax.device_count())`.
[`run_hp_search.py`](src/run_hp_search.py) - Basic example that shows how Ray
Tune can be used with JAX/Flax on MNIST.
[`run_t5x_autoresume.py`](src/run_t5x_autoresume.py) - Example that showcases how
you can use `RayTpuController` for fault tolerant training using T5X as an
example workload.
## Tutorial
### Setting up your CPU VM
One of the basic ways you can use Ray with a TPU pod is to set up the TPU pod as a ray cluster. We've found that creating a separate CPU VM as an admin (aka coordinator VM) is the natural way to do this. See the below for a visualization
and commands for how you might do this with `gcloud` commands:
![Ray Cluster on TPU](tpu_ray_cluster.png)
$ gcloud compute instances create my_tpu_admin --machine-type=n1-standard-4 ...
$ gcloud compute ssh my_tpu_admin
$ (vm) #install Python3, Ray, ...
$ (vm) ray start --head --port=6379 --num-cpus=0
# (Ray returns the IP address of the HEAD node, let's call it RAY_HEAD_IP)
$ (vm) gcloud compute tpus tpu-vm create $TPU_NAME ... --metadata startup-script="pip3 install ray && ray start --address=$RAY_HEAD_IP --resources='{\"tpu_host\": 1}'"
For your convenience, we also provide basic scripts (see [`create_cpu.sh`](create_cpu.sh) and [`deploy_to_admin.sh`](deploy_to_admin.sh)) for creating
an admin CPU VM and deploying the contents of this folder to your CPU VM.
- `create_cpu.sh` will naturally create a VM named `$USER-admin` and will
utilize whatever project and zone is set to your `gcloud config` defaults.
Run `gcloud config list` to see what those defaults are.
- `create_cpu.sh` by default allocates a boot disk size of 200GB.
- `deploy_to_admin.sh` assumes your VM name is `$USER-admin` - if you change
that value in `create_cpu.sh` please be sure to change it in `deploy_to_admin.sh`.
0. If you do not have a dedicated service account for TPU administration (highly recommended), set one up:
Note: This only needs to be run once!
1. Create a CPU admin:
$ ./create_cpu.sh
Note that this scripts installs dependencies on the VM via
[startup script](https://cloud.google.com/compute/docs/instances/startup-scripts/linux)
and automatically blocks until the startup script is complete.
2. Deploy local code to CPU:
$ ./deploy_to_admin.sh
3. SSH to the VM
$ gcloud compute ssh $USER-admin -- -L8265:localhost:8265
Note that we enable port forwarding here as Ray will automatically start a dashboard at port 8265. From the machine that you SSH to your VM, you will be able to access this dashboard at
4. If you skipped step 0, set up your gcloud credentials within the CPU VM:
$ gcloud auth login --update-adc
Note that this command authorizes your VM instance to use your personal Google account which may be a security risk in a production setting.
5. Run the necessary pip installs:
$ pip3 install -r src/requirements.txt
6. Start the Ray admin:
$ ray start --head --port=6379 --num-cpus=0
Note: `--num-cpus=0` will avoid cpu jobs like profiling to be scheduled on the admin node.
### Basic JAX Example
See [`run_basic_jax.py`](src/run_basic_jax.py).
For ML frameworks compatible with Cloud TPUs that use a multi-controller programming model (e.g. JAX and PyTorch/XLA PJRT), you must run at least one process per host (see [Multi-process programming model](https://jax.readthedocs.io/en/latest/multi_process.html#multi-process-programming-model)). The basic way
this looks in practice might be as follows:
$ gcloud compute tpus tpu-vm scp my_bug_free_python_code my_tpu:~/ --worker=all
$ gcloud compute tpus tpu-vm ssh my_tpu --worker=all --command="python3 ~/my_bug_free_python_code/main.py"
If you have more than ~16 hosts (e.g. v4-128) you will run into SSH scalability issues
and your command might have to change to:
$ gcloud compute tpus tpu-vm scp my_bug_free_python_code my_tpu:~/ --worker=all --batch-size=8
$ gcloud compute tpus tpu-vm ssh my_tpu --worker=all --command="python3 ~/my_bug_free_python_code/main.py &" --batch-size=8
This can become a hindrance on developer velocity if `my_bug_free_python_code`
contains bugs! One of the ways you can solve this problem is by using an orchestrator like K8s or Ray. Ray includes the concept of a [Runtime environment](https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#runtime-environments) that, when applied, deploys code and dependencies when the Ray application is run.
Combining the Ray Runtime Env with Ray Cluster and Ray Jobs allows us to bypass
the SCP/SSH cycle. [`run_basic_jax.py`](src/run_basic_jax.py) is a minimal example that demonstrates
how you can use the Ray Jobs and Ray runtime environment on a Ray cluster with TPU VMs to run a JAX workload.
Assuming you followed the above examples, you should be able to run this with:
$ python3 src/run_basic_jax.py
Some example output from this:
2023-03-01 22:12:10,065 INFO worker.py:1364 -- Connecting to existing Ray cluster at address:
2023-03-01 22:12:10,072 INFO worker.py:1544 -- Connected to Ray cluster. View the dashboard at
W0301 22:12:11.148555 140341931026240 ray_tpu_controller.py:143] TPU is not found, create tpu...
Creating TPU: allencwang-ray-test
Request: {'accelerator_config': {'topology': '2x2x2', 'type': 'V4'}, 'runtimeVersion': 'tpu-vm-v4-base', 'networkConfig': {'enableExternalIps': True}, 'metadata': {'startup-script': '#! /bin/bash\necho "hello world"\nmkdir -p /dev/shm\nsudo mount -t tmpfs -o size=100g tmpfs /dev/shm\n pip3 install ray[default]\nray start --resources=\'{"tpu_host": 1}\' --address='}}
Create TPU operation still running...
Create TPU operation still running...
Create TPU operation complete.
I0301 22:13:17.795493 140341931026240 ray_tpu_controller.py:121] Detected 0 TPU hosts in cluster, expecting 2 hosts in total
I0301 22:13:17.795823 140341931026240 ray_tpu_controller.py:160] Waiting for 30s for TPU hosts to join cluster...
I0301 22:13:47.840997 140341931026240 ray_tpu_controller.py:121] Detected 0 TPU hosts in cluster, expecting 2 hosts in total
I0301 22:13:47.841155 140341931026240 ray_tpu_controller.py:160] Waiting for 30s for TPU hosts to join cluster...
I0301 22:14:17.884582 140341931026240 ray_tpu_controller.py:121] Detected 0 TPU hosts in cluster, expecting 2 hosts in total
I0301 22:14:17.884731 140341931026240 ray_tpu_controller.py:160] Waiting for 30s for TPU hosts to join cl
没有合适的资源?快使用搜索试试~ 我知道了~
需积分: 1 0 下载量 144 浏览量
收藏 41.73MB ZIP 举报
资源名称:CondConv - 动态卷积模块代码 资源简介: CondConv是一种先进的卷积模块,它通过引入条件参数化的概念,使得卷积核能够根据输入样本动态调整。这种设计突破了传统卷积核固定参数的限制,为每个输入提供定制化的卷积核,从而在保持推理效率的同时提升模型的表现力和性能。 适用人群: 深度学习研究人员和工程师 计算机视觉领域的学生和开发者 对动态卷积和模型优化感兴趣的技术爱好者 资源内容: CondConv核心代码:提供TensorFlow和PyTorch实现的CondConv模块,支持在多种深度学习框架中使用。 性能分析:基于不同数据集的实验结果,展示CondConv在多种视觉任务中相对于传统卷积的性能提升。 使用指南:详细的代码使用说明和集成步骤,帮助用户快速理解CondConv的工作原理和应用方法。 应用示例:演示如何在现有网络架构中替换标准卷积层为CondConv层,以及如何利用CondConv进行模型微调。 主要特点: 动态性:CondConv能够为每个输入样本学习并应用特定的卷积核参数。 即插即用:作为一个模块化的组件,CondConv可以轻松集成到现有
即插即用卷积模块CondConv (853个子文件)
active_config 6B
active_config 6B
active_config 6B
active_config 6B
active_config 6B
tpuv4_bar2_ranges.c 220KB
tpuv4lite_bar2_ranges.c 153KB
gasket_core.c 54KB
gasket_page_table.c 46KB
tpuv2_core.c 31KB
gasket_interrupt.c 23KB
gasket_ioctl.c 11KB
tpu_common.c 11KB
accel.c 11KB
gasket_sysfs.c 9KB
tpuv4lite_driver.c 9KB
tpuv4.c 7KB
gasket_dmabuf.c 6KB
tpuv3_driver.c 6KB
tpuv2_driver.c 5KB
tpuv4_driver.c 5KB
tpuv4common.c 5KB
asic_sw_firmware_indirect_registers.c 4KB
accel_lib.c 4KB
tpuv4_userspace_lst_port_indirect_offsets.c 4KB
tpuv4common_lst.c 2KB
tpuv4lite_userspace_lst_port_indirect_offsets.c 2KB
tpuv4lite_bar0_ranges.c 1KB
tpuv4_bar0_ranges.c 1KB
tpuv4lite_userspace_firmware_indirect_accessor_offsets.c 831B
tpuv4_userspace_firmware_indirect_accessor_offsets.c 811B
tpuv4lite_mgr_error_loperf_offsets.c 683B
tpuv4_mgr_error_loperf_offsets.c 663B
tpuv4lite_firmware_version_offsets.c 628B
tpuv4_firmware_version_offsets.c 586B
tpuv4lite_pcie_flr_status_offsets.c 534B
tpuv4_pcie_flr_status_offsets.c 514B
tpuv4lite_device_owner_offsets.c 505B
tpuv4lite_reinit_reset_offsets.c 505B
tpuv4lite_firmware_state_offsets.c 500B
tpuv4_device_owner_offsets.c 485B
tpuv4_reinit_reset_offsets.c 485B
tpuv4_firmware_state_offsets.c 480B
asic_sw_clock.c 392B
asic_sw_module.c 249B
config_ctpu9 144B
config_ctpu9 144B
config_ctpu9 69B
config_default 119B
coco_label_map.csv 852B
fashionpedia_label_map.csv 511B
Dockerfile.devel-tpu 4KB
Dockerfile-nightly.devel-tpu 4KB
Dockerfile 1KB
Dockerfile 1KB
.gitignore 244B
.gitignore 17B
up.go 18KB
tpu.go 12KB
common_test.go 12KB
auth.go 12KB
gce.go 11KB
auth_test.go 10KB
up_test.go 9KB
tpu_test.go 9KB
common.go 9KB
status.go 8KB
resourcemgmt_test.go 7KB
ctrl.go 6KB
resourcemgmt.go 6KB
restart_test.go 5KB
gcloud_cli_test.go 5KB
status_test.go 5KB
config.go 5KB
list.go 4KB
gce_test.go 4KB
restart.go 4KB
config_gcloud_test.go 4KB
config_gcloud.go 4KB
delete.go 4KB
pause.go 4KB
config_test.go 4KB
main.go 4KB
gcloud_cli.go 3KB
config_cmd.go 3KB
tf_versions.go 3KB
tpu_size.go 3KB
tpu_locations.go 2KB
serviceusage.go 2KB
devshell.go 2KB
delete_test.go 2KB
pause_test.go 2KB
quota.go 2KB
devshell_test.go 2KB
ctrl_test.go 2KB
tpu_size_test.go 2KB
tpu_locations_test.go 2KB
list_test.go 2KB
config_gce.go 2KB
共 853 条
- 1
- 2
- 3
- 4
- 5
- 6
- 9
- 粉丝: 759
- 资源: 11
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助