# Using Ray with Cloud TPUs
This **experimental** repository contains a minimal example of how you can use
Ray (ray.io) with Cloud TPUs.
These examples are not meant to be used in production services and are for
illustrative purposes only.
## Helpful pre/post-reads
- [Ray Overview](https://docs.ray.io/en/latest/ray-overview/index.html)
- [Ray Cluster](https://docs.ray.io/en/latest/cluster/vms/getting-started.html#vm-cluster-quick-start)
- [Ray Job](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html)
- [JAX Multi-process programming model](https://jax.readthedocs.io/en/latest/multi_process.html#multi-process-programming-model)
## What's included in this repo?
For your convenience, we provide:
- generic abstractions that hide away boilerplate for common TPU actions and
- toy examples that you can fork for your own basic workflows.
Specifically:
[`tpu_api.py`](src/tpu_api.py) - Python wrapper for basic TPU operations using the
[Cloud TPU API](https://cloud.google.com/tpu/docs/reference/rest)
[`tpu_controller.py`](src/tpu_controller.py) - Class representation of a TPU. This
is essentially a wrapper for `tpu_api.py`.
[`ray_tpu_controller.py`](src/ray_tpu_controller.py) - TPU controller with Ray
functionality. This abstracts away boilerplate for Ray Cluster and Ray Jobs.
[`run_basic_jax.py`](src/run_basic_jax.py) - Basic example that shows how to use
`RayTpuController` for `print(jax.device_count())`.
[`run_hp_search.py`](src/run_hp_search.py) - Basic example that shows how Ray
Tune can be used with JAX/Flax on MNIST.
[`run_t5x_autoresume.py`](src/run_t5x_autoresume.py) - Example that showcases how
you can use `RayTpuController` for fault tolerant training using T5X as an
example workload.
## Tutorial
### Setting up your CPU VM
One of the basic ways you can use Ray with a TPU pod is to set up the TPU pod as a ray cluster. We've found that creating a separate CPU VM as an admin (aka coordinator VM) is the natural way to do this. See the below for a visualization
and commands for how you might do this with `gcloud` commands:
![Ray Cluster on TPU](tpu_ray_cluster.png)
```
$ gcloud compute instances create my_tpu_admin --machine-type=n1-standard-4 ...
$ gcloud compute ssh my_tpu_admin
$ (vm) #install Python3, Ray, ...
$ (vm) ray start --head --port=6379 --num-cpus=0
...
# (Ray returns the IP address of the HEAD node, let's call it RAY_HEAD_IP)
$ (vm) gcloud compute tpus tpu-vm create $TPU_NAME ... --metadata startup-script="pip3 install ray && ray start --address=$RAY_HEAD_IP --resources='{\"tpu_host\": 1}'"
```
For your convenience, we also provide basic scripts (see [`create_cpu.sh`](create_cpu.sh) and [`deploy_to_admin.sh`](deploy_to_admin.sh)) for creating
an admin CPU VM and deploying the contents of this folder to your CPU VM.
Notes:
- `create_cpu.sh` will naturally create a VM named `$USER-admin` and will
utilize whatever project and zone is set to your `gcloud config` defaults.
Run `gcloud config list` to see what those defaults are.
- `create_cpu.sh` by default allocates a boot disk size of 200GB.
- `deploy_to_admin.sh` assumes your VM name is `$USER-admin` - if you change
that value in `create_cpu.sh` please be sure to change it in `deploy_to_admin.sh`.
Instructions:
0. If you do not have a dedicated service account for TPU administration (highly recommended), set one up:
```
./create_tpu_service_account.sh
```
Note: This only needs to be run once!
1. Create a CPU admin:
```
$ ./create_cpu.sh
```
Note that this scripts installs dependencies on the VM via
[startup script](https://cloud.google.com/compute/docs/instances/startup-scripts/linux)
and automatically blocks until the startup script is complete.
2. Deploy local code to CPU:
```
$ ./deploy_to_admin.sh
```
3. SSH to the VM
```
$ gcloud compute ssh $USER-admin -- -L8265:localhost:8265
```
Note that we enable port forwarding here as Ray will automatically start a dashboard at port 8265. From the machine that you SSH to your VM, you will be able to access this dashboard at http://127.0.0.1:8265/.
4. If you skipped step 0, set up your gcloud credentials within the CPU VM:
```
$ gcloud auth login --update-adc
```
Note that this command authorizes your VM instance to use your personal Google account which may be a security risk in a production setting.
5. Run the necessary pip installs:
```
$ pip3 install -r src/requirements.txt
```
6. Start the Ray admin:
```
$ ray start --head --port=6379 --num-cpus=0
```
Note: `--num-cpus=0` will avoid cpu jobs like profiling to be scheduled on the admin node.
### Basic JAX Example
See [`run_basic_jax.py`](src/run_basic_jax.py).
For ML frameworks compatible with Cloud TPUs that use a multi-controller programming model (e.g. JAX and PyTorch/XLA PJRT), you must run at least one process per host (see [Multi-process programming model](https://jax.readthedocs.io/en/latest/multi_process.html#multi-process-programming-model)). The basic way
this looks in practice might be as follows:
```
$ gcloud compute tpus tpu-vm scp my_bug_free_python_code my_tpu:~/ --worker=all
$ gcloud compute tpus tpu-vm ssh my_tpu --worker=all --command="python3 ~/my_bug_free_python_code/main.py"
```
If you have more than ~16 hosts (e.g. v4-128) you will run into SSH scalability issues
and your command might have to change to:
```
$ gcloud compute tpus tpu-vm scp my_bug_free_python_code my_tpu:~/ --worker=all --batch-size=8
$ gcloud compute tpus tpu-vm ssh my_tpu --worker=all --command="python3 ~/my_bug_free_python_code/main.py &" --batch-size=8
```
This can become a hindrance on developer velocity if `my_bug_free_python_code`
contains bugs! One of the ways you can solve this problem is by using an orchestrator like K8s or Ray. Ray includes the concept of a [Runtime environment](https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#runtime-environments) that, when applied, deploys code and dependencies when the Ray application is run.
Combining the Ray Runtime Env with Ray Cluster and Ray Jobs allows us to bypass
the SCP/SSH cycle. [`run_basic_jax.py`](src/run_basic_jax.py) is a minimal example that demonstrates
how you can use the Ray Jobs and Ray runtime environment on a Ray cluster with TPU VMs to run a JAX workload.
Assuming you followed the above examples, you should be able to run this with:
```
$ python3 src/run_basic_jax.py
```
Some example output from this:
```
2023-03-01 22:12:10,065 INFO worker.py:1364 -- Connecting to existing Ray cluster at address: 10.130.0.19:6379...
2023-03-01 22:12:10,072 INFO worker.py:1544 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265
W0301 22:12:11.148555 140341931026240 ray_tpu_controller.py:143] TPU is not found, create tpu...
Creating TPU: allencwang-ray-test
Request: {'accelerator_config': {'topology': '2x2x2', 'type': 'V4'}, 'runtimeVersion': 'tpu-vm-v4-base', 'networkConfig': {'enableExternalIps': True}, 'metadata': {'startup-script': '#! /bin/bash\necho "hello world"\nmkdir -p /dev/shm\nsudo mount -t tmpfs -o size=100g tmpfs /dev/shm\n pip3 install ray[default]\nray start --resources=\'{"tpu_host": 1}\' --address=10.130.0.19:6379'}}
Create TPU operation still running...
Create TPU operation still running...
Create TPU operation complete.
I0301 22:13:17.795493 140341931026240 ray_tpu_controller.py:121] Detected 0 TPU hosts in cluster, expecting 2 hosts in total
I0301 22:13:17.795823 140341931026240 ray_tpu_controller.py:160] Waiting for 30s for TPU hosts to join cluster...
I0301 22:13:47.840997 140341931026240 ray_tpu_controller.py:121] Detected 0 TPU hosts in cluster, expecting 2 hosts in total
I0301 22:13:47.841155 140341931026240 ray_tpu_controller.py:160] Waiting for 30s for TPU hosts to join cluster...
I0301 22:14:17.884582 140341931026240 ray_tpu_controller.py:121] Detected 0 TPU hosts in cluster, expecting 2 hosts in total
I0301 22:14:17.884731 140341931026240 ray_tpu_controller.py:160] Waiting for 30s for TPU hosts to join cl
没有合适的资源?快使用搜索试试~ 我知道了~
即插即用卷积模块CondConv
共853个文件
py:405个
yaml:107个
h:63个
需积分: 1 0 下载量 144 浏览量
2024-08-23
12:33:26
上传
评论
收藏 41.73MB ZIP 举报
温馨提示
资源名称:CondConv - 动态卷积模块代码 资源简介: CondConv是一种先进的卷积模块,它通过引入条件参数化的概念,使得卷积核能够根据输入样本动态调整。这种设计突破了传统卷积核固定参数的限制,为每个输入提供定制化的卷积核,从而在保持推理效率的同时提升模型的表现力和性能。 适用人群: 深度学习研究人员和工程师 计算机视觉领域的学生和开发者 对动态卷积和模型优化感兴趣的技术爱好者 资源内容: CondConv核心代码:提供TensorFlow和PyTorch实现的CondConv模块,支持在多种深度学习框架中使用。 性能分析:基于不同数据集的实验结果,展示CondConv在多种视觉任务中相对于传统卷积的性能提升。 使用指南:详细的代码使用说明和集成步骤,帮助用户快速理解CondConv的工作原理和应用方法。 应用示例:演示如何在现有网络架构中替换标准卷积层为CondConv层,以及如何利用CondConv进行模型微调。 主要特点: 动态性:CondConv能够为每个输入样本学习并应用特定的卷积核参数。 即插即用:作为一个模块化的组件,CondConv可以轻松集成到现有
资源推荐
资源详情
资源评论
收起资源包目录
即插即用卷积模块CondConv (853个子文件)
active_config 6B
active_config 6B
active_config 6B
active_config 6B
active_config 6B
BUILD 235B
tpuv4_bar2_ranges.c 220KB
tpuv4lite_bar2_ranges.c 153KB
gasket_core.c 54KB
gasket_page_table.c 46KB
tpuv2_core.c 31KB
gasket_interrupt.c 23KB
gasket_ioctl.c 11KB
tpu_common.c 11KB
accel.c 11KB
gasket_sysfs.c 9KB
tpuv4lite_driver.c 9KB
tpuv4.c 7KB
gasket_dmabuf.c 6KB
tpuv3_driver.c 6KB
tpuv2_driver.c 5KB
tpuv4_driver.c 5KB
tpuv4common.c 5KB
asic_sw_firmware_indirect_registers.c 4KB
accel_lib.c 4KB
tpuv4_userspace_lst_port_indirect_offsets.c 4KB
tpuv4common_lst.c 2KB
tpuv4lite_userspace_lst_port_indirect_offsets.c 2KB
tpuv4lite_bar0_ranges.c 1KB
tpuv4_bar0_ranges.c 1KB
tpuv4lite_userspace_firmware_indirect_accessor_offsets.c 831B
tpuv4_userspace_firmware_indirect_accessor_offsets.c 811B
tpuv4lite_mgr_error_loperf_offsets.c 683B
tpuv4_mgr_error_loperf_offsets.c 663B
tpuv4lite_firmware_version_offsets.c 628B
tpuv4_firmware_version_offsets.c 586B
tpuv4lite_pcie_flr_status_offsets.c 534B
tpuv4_pcie_flr_status_offsets.c 514B
tpuv4lite_device_owner_offsets.c 505B
tpuv4lite_reinit_reset_offsets.c 505B
tpuv4lite_firmware_state_offsets.c 500B
tpuv4_device_owner_offsets.c 485B
tpuv4_reinit_reset_offsets.c 485B
tpuv4_firmware_state_offsets.c 480B
asic_sw_clock.c 392B
asic_sw_module.c 249B
config_ctpu9 144B
config_ctpu9 144B
config_ctpu9 69B
config_default 119B
coco_label_map.csv 852B
fashionpedia_label_map.csv 511B
Dockerfile.devel-tpu 4KB
Dockerfile-nightly.devel-tpu 4KB
Dockerfile 1KB
Dockerfile 1KB
.gitignore 244B
.gitignore 17B
up.go 18KB
tpu.go 12KB
common_test.go 12KB
auth.go 12KB
gce.go 11KB
auth_test.go 10KB
up_test.go 9KB
tpu_test.go 9KB
common.go 9KB
status.go 8KB
resourcemgmt_test.go 7KB
ctrl.go 6KB
resourcemgmt.go 6KB
restart_test.go 5KB
gcloud_cli_test.go 5KB
status_test.go 5KB
config.go 5KB
list.go 4KB
gce_test.go 4KB
restart.go 4KB
config_gcloud_test.go 4KB
config_gcloud.go 4KB
delete.go 4KB
pause.go 4KB
config_test.go 4KB
main.go 4KB
gcloud_cli.go 3KB
config_cmd.go 3KB
tf_versions.go 3KB
tpu_size.go 3KB
tpu_locations.go 2KB
serviceusage.go 2KB
devshell.go 2KB
delete_test.go 2KB
pause_test.go 2KB
quota.go 2KB
devshell_test.go 2KB
ctrl_test.go 2KB
tpu_size_test.go 2KB
tpu_locations_test.go 2KB
list_test.go 2KB
config_gce.go 2KB
共 853 条
- 1
- 2
- 3
- 4
- 5
- 6
- 9
资源评论
吃小南瓜�
- 粉丝: 759
- 资源: 11
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功