即插即用卷积模块CondConv_ghostconv资源-CSDN文库

共853个文件

py：405个

yaml：107个

h：63个

tensorflow

需积分: 1 144 浏览量 2024-08-23 12:33:26 上传评论收藏 41.73MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

即插即用卷积模块CondConv （853个子文件）

active_config 6B

BUILD 235B

tpuv4_bar2_ranges.c 220KB

tpuv4lite_bar2_ranges.c 153KB

gasket_core.c 54KB

gasket_page_table.c 46KB

tpuv2_core.c 31KB

gasket_interrupt.c 23KB

gasket_ioctl.c 11KB

tpu_common.c 11KB

accel.c 11KB

gasket_sysfs.c 9KB

tpuv4lite_driver.c 9KB

tpuv4.c 7KB

gasket_dmabuf.c 6KB

tpuv3_driver.c 6KB

tpuv2_driver.c 5KB

tpuv4_driver.c 5KB

tpuv4common.c 5KB

asic_sw_firmware_indirect_registers.c 4KB

accel_lib.c 4KB

tpuv4_userspace_lst_port_indirect_offsets.c 4KB

tpuv4common_lst.c 2KB

tpuv4lite_userspace_lst_port_indirect_offsets.c 2KB

tpuv4lite_bar0_ranges.c 1KB

tpuv4_bar0_ranges.c 1KB

tpuv4lite_userspace_firmware_indirect_accessor_offsets.c 831B

tpuv4_userspace_firmware_indirect_accessor_offsets.c 811B

tpuv4lite_mgr_error_loperf_offsets.c 683B

tpuv4_mgr_error_loperf_offsets.c 663B

tpuv4lite_firmware_version_offsets.c 628B

tpuv4_firmware_version_offsets.c 586B

tpuv4lite_pcie_flr_status_offsets.c 534B

tpuv4_pcie_flr_status_offsets.c 514B

tpuv4lite_device_owner_offsets.c 505B

tpuv4lite_reinit_reset_offsets.c 505B

tpuv4lite_firmware_state_offsets.c 500B

tpuv4_device_owner_offsets.c 485B

tpuv4_reinit_reset_offsets.c 485B

tpuv4_firmware_state_offsets.c 480B

asic_sw_clock.c 392B

asic_sw_module.c 249B

config_ctpu9 144B

config_ctpu9 69B

config_default 119B

coco_label_map.csv 852B

fashionpedia_label_map.csv 511B

Dockerfile.devel-tpu 4KB

Dockerfile-nightly.devel-tpu 4KB

Dockerfile 1KB

.gitignore 244B

.gitignore 17B

up.go 18KB

tpu.go 12KB

common_test.go 12KB

auth.go 12KB

gce.go 11KB

auth_test.go 10KB

up_test.go 9KB

tpu_test.go 9KB

common.go 9KB

status.go 8KB

resourcemgmt_test.go 7KB

ctrl.go 6KB

resourcemgmt.go 6KB

restart_test.go 5KB

gcloud_cli_test.go 5KB

status_test.go 5KB

config.go 5KB

list.go 4KB

gce_test.go 4KB

restart.go 4KB

config_gcloud_test.go 4KB

config_gcloud.go 4KB

delete.go 4KB

pause.go 4KB

config_test.go 4KB

main.go 4KB

gcloud_cli.go 3KB

config_cmd.go 3KB

tf_versions.go 3KB

tpu_size.go 3KB

tpu_locations.go 2KB

serviceusage.go 2KB

devshell.go 2KB

delete_test.go 2KB

pause_test.go 2KB

quota.go 2KB

devshell_test.go 2KB

ctrl_test.go 2KB

tpu_size_test.go 2KB

tpu_locations_test.go 2KB

list_test.go 2KB

config_gce.go 2KB

共 853 条

# Using Ray with Cloud TPUs This **experimental** repository contains a minimal example of how you can use Ray (ray.io) with Cloud TPUs. These examples are not meant to be used in production services and are for illustrative purposes only. ## Helpful pre/post-reads - [Ray Overview](https://docs.ray.io/en/latest/ray-overview/index.html) - [Ray Cluster](https://docs.ray.io/en/latest/cluster/vms/getting-started.html#vm-cluster-quick-start) - [Ray Job](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html) - [JAX Multi-process programming model](https://jax.readthedocs.io/en/latest/multi_process.html#multi-process-programming-model) ## What's included in this repo? For your convenience, we provide: - generic abstractions that hide away boilerplate for common TPU actions and - toy examples that you can fork for your own basic workflows. Specifically: [`tpu_api.py`](src/tpu_api.py) - Python wrapper for basic TPU operations using the [Cloud TPU API](https://cloud.google.com/tpu/docs/reference/rest) [`tpu_controller.py`](src/tpu_controller.py) - Class representation of a TPU. This is essentially a wrapper for `tpu_api.py`. [`ray_tpu_controller.py`](src/ray_tpu_controller.py) - TPU controller with Ray functionality. This abstracts away boilerplate for Ray Cluster and Ray Jobs. [`run_basic_jax.py`](src/run_basic_jax.py) - Basic example that shows how to use `RayTpuController` for `print(jax.device_count())`. [`run_hp_search.py`](src/run_hp_search.py) - Basic example that shows how Ray Tune can be used with JAX/Flax on MNIST. [`run_t5x_autoresume.py`](src/run_t5x_autoresume.py) - Example that showcases how you can use `RayTpuController` for fault tolerant training using T5X as an example workload. ## Tutorial ### Setting up your CPU VM One of the basic ways you can use Ray with a TPU pod is to set up the TPU pod as a ray cluster. We've found that creating a separate CPU VM as an admin (aka coordinator VM) is the natural way to do this. See the below for a visualization and commands for how you might do this with `gcloud` commands: ![Ray Cluster on TPU](tpu_ray_cluster.png) ``` $ gcloud compute instances create my_tpu_admin --machine-type=n1-standard-4 ... $ gcloud compute ssh my_tpu_admin $ (vm) #install Python3, Ray, ... $ (vm) ray start --head --port=6379 --num-cpus=0 ... # (Ray returns the IP address of the HEAD node, let's call it RAY_HEAD_IP) $ (vm) gcloud compute tpus tpu-vm create $TPU_NAME ... --metadata startup-script="pip3 install ray && ray start --address=$RAY_HEAD_IP --resources='{\"tpu_host\": 1}'" ``` For your convenience, we also provide basic scripts (see [`create_cpu.sh`](create_cpu.sh) and [`deploy_to_admin.sh`](deploy_to_admin.sh)) for creating an admin CPU VM and deploying the contents of this folder to your CPU VM. Notes: - `create_cpu.sh` will naturally create a VM named `$USER-admin` and will utilize whatever project and zone is set to your `gcloud config` defaults. Run `gcloud config list` to see what those defaults are. - `create_cpu.sh` by default allocates a boot disk size of 200GB. - `deploy_to_admin.sh` assumes your VM name is `$USER-admin` - if you change that value in `create_cpu.sh` please be sure to change it in `deploy_to_admin.sh`. Instructions: 0. If you do not have a dedicated service account for TPU administration (highly recommended), set one up: ``` ./create_tpu_service_account.sh ``` Note: This only needs to be run once! 1. Create a CPU admin: ``` $ ./create_cpu.sh ``` Note that this scripts installs dependencies on the VM via [startup script](https://cloud.google.com/compute/docs/instances/startup-scripts/linux) and automatically blocks until the startup script is complete. 2. Deploy local code to CPU: ``` $ ./deploy_to_admin.sh ``` 3. SSH to the VM ``` $ gcloud compute ssh $USER-admin -- -L8265:localhost:8265 ``` Note that we enable port forwarding here as Ray will automatically start a dashboard at port 8265. From the machine that you SSH to your VM, you will be able to access this dashboard at http://127.0.0.1:8265/. 4. If you skipped step 0, set up your gcloud credentials within the CPU VM: ``` $ gcloud auth login --update-adc ``` Note that this command authorizes your VM instance to use your personal Google account which may be a security risk in a production setting. 5. Run the necessary pip installs: ``` $ pip3 install -r src/requirements.txt ``` 6. Start the Ray admin: ``` $ ray start --head --port=6379 --num-cpus=0 ``` Note: `--num-cpus=0` will avoid cpu jobs like profiling to be scheduled on the admin node. ### Basic JAX Example See [`run_basic_jax.py`](src/run_basic_jax.py). For ML frameworks compatible with Cloud TPUs that use a multi-controller programming model (e.g. JAX and PyTorch/XLA PJRT), you must run at least one process per host (see [Multi-process programming model](https://jax.readthedocs.io/en/latest/multi_process.html#multi-process-programming-model)). The basic way this looks in practice might be as follows: ``` $ gcloud compute tpus tpu-vm scp my_bug_free_python_code my_tpu:~/ --worker=all $ gcloud compute tpus tpu-vm ssh my_tpu --worker=all --command="python3 ~/my_bug_free_python_code/main.py" ``` If you have more than ~16 hosts (e.g. v4-128) you will run into SSH scalability issues and your command might have to change to: ``` $ gcloud compute tpus tpu-vm scp my_bug_free_python_code my_tpu:~/ --worker=all --batch-size=8 $ gcloud compute tpus tpu-vm ssh my_tpu --worker=all --command="python3 ~/my_bug_free_python_code/main.py &" --batch-size=8 ``` This can become a hindrance on developer velocity if `my_bug_free_python_code` contains bugs! One of the ways you can solve this problem is by using an orchestrator like K8s or Ray. Ray includes the concept of a [Runtime environment](https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#runtime-environments) that, when applied, deploys code and dependencies when the Ray application is run. Combining the Ray Runtime Env with Ray Cluster and Ray Jobs allows us to bypass the SCP/SSH cycle. [`run_basic_jax.py`](src/run_basic_jax.py) is a minimal example that demonstrates how you can use the Ray Jobs and Ray runtime environment on a Ray cluster with TPU VMs to run a JAX workload. Assuming you followed the above examples, you should be able to run this with: ``` $ python3 src/run_basic_jax.py ``` Some example output from this: ``` 2023-03-01 22:12:10,065 INFO worker.py:1364 -- Connecting to existing Ray cluster at address: 10.130.0.19:6379... 2023-03-01 22:12:10,072 INFO worker.py:1544 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 W0301 22:12:11.148555 140341931026240 ray_tpu_controller.py:143] TPU is not found, create tpu... Creating TPU: allencwang-ray-test Request: {'accelerator_config': {'topology': '2x2x2', 'type': 'V4'}, 'runtimeVersion': 'tpu-vm-v4-base', 'networkConfig': {'enableExternalIps': True}, 'metadata': {'startup-script': '#! /bin/bash\necho "hello world"\nmkdir -p /dev/shm\nsudo mount -t tmpfs -o size=100g tmpfs /dev/shm\n pip3 install ray[default]\nray start --resources=\'{"tpu_host": 1}\' --address=10.130.0.19:6379'}} Create TPU operation still running... Create TPU operation still running... Create TPU operation complete. I0301 22:13:17.795493 140341931026240 ray_tpu_controller.py:121] Detected 0 TPU hosts in cluster, expecting 2 hosts in total I0301 22:13:17.795823 140341931026240 ray_tpu_controller.py:160] Waiting for 30s for TPU hosts to join cluster... I0301 22:13:47.840997 140341931026240 ray_tpu_controller.py:121] Detected 0 TPU hosts in cluster, expecting 2 hosts in total I0301 22:13:47.841155 140341931026240 ray_tpu_controller.py:160] Waiting for 30s for TPU hosts to join cluster... I0301 22:14:17.884582 140341931026240 ray_tpu_controller.py:121] Detected 0 TPU hosts in cluster, expecting 2 hosts in total I0301 22:14:17.884731 140341931026240 ray_tpu_controller.py:160] Waiting for 30s for TPU hosts to join cl

评论收藏

内容反馈