# ChannelPermutations
Standalone code to reproduce results in "[Channel Permutations for N:M Sparsity](https://proceedings.neurips.cc/paper/2021/hash/6e8404c3b93a9527c8db241a1846599a-Abstract.html)," Jeff Pool and Chong Yu, NeurIPS 2021.
Three search strategies are supported: randomly generating permutations and checking quality, greedily swapping columns until convergence (i.e. TETRIS adapted for 2:4 sparsity), and the technique presented in the above paper, optimizing stripe groups. This tool will apply these strategies, as configured below, to either a randomly-generated matrix or an .npy file (typically from a real network) and report the efficacy and runtime of the strategy.
## Quick Start
### Installation
#### GPU path
Requirements:
- CUDA
- pybind11
A container such as `nvcr.io/nvidia/pytorch:21.12-py3` satisfies these requirements.
Installation (from this directory):
```
pushd ../permutation_search_kernels/CUDA_kernels
nvcc -O3 -shared -Xcompiler -fPIC -Xcompiler -DTORCH_EXTENSION_NAME=permutation_search_cuda -std=c++11 $(python3 -m pybind11 --includes) permutation_search_kernels.cu -o ../permutation_search_cuda$(python3-config --extension-suffix)
popd
```
#### CPU path
Only NumPy is required for CPU-only execution.
### Important arguments
`python3 permutation_test.py` will tell you all the available arguments and alert you about required arguments:
```
usage: permutation_test.py [-h] [--infile INFILE] [--channels CHANNELS] [--filters FILTERS]
[--verbosity VERBOSITY] [--seed SEED] [--pretty_print PRETTY_PRINT]
[--unstructured UNSTRUCTURED] [--gpu GPU] [--check_permutation CHECK_PERMUTATION]
[--intermediate_steps INTERMEDIATE_STEPS] [--print_permutation PRINT_PERMUTATION]
strategy [strategy ...]
permutation_test.py: error: the following arguments are required: strategy
```
Detailed information about each argument:
- `--infile` (string) accepts .npy files with weights dumped from some model checkpoint. By default, the input file is `'random'`, which will generate a random 2D matrix with `CHANNELS` columns and `FILTERS` rows.
- `--channels` and `--filters` (unsigned integers) specify the size of the randomly-generated matrix if there is no input file specified.
- `--verbosity` (unsigned integer) controls the amount of debug and status information printed. `0` is just the important data, `11` can give periodic status details, and higher integers provide increasingly more detail.
- `--seed` (unsigned integer) allows for changing the random seed, which will affect the random matrix generation, random permutations generated, and columns swapped for bounded regressions.
- `--pretty_print` (bool) prints a pretty graph by default (below), but disabling will generate output friendly for redirecting to a .csv file.
- `--unstructured` (float) will apply unstructured pruning to the matrix before searching for permutations. A negative value will find the minimum unstructured sparsity for which a search strategy can find a perfect permutation and not create any extra zeros.
- `--gpu` (bool) uses CUDA kernels by default (if they are built and there is a GPU available), but you can override this to run on the CPU.
- `--check_permutation` (bool) makes sure the permutation tracked during the search process matches the one that's recovered directly from the permuted matrix.
- `--intermediate_steps` (unsigned integer) will emit permutations with efficacies equally dividing the distance between the default order and the best permutation found.
- `--print_permutation` (bool) prints the permutation found for each strategy.
Finally, after these optional arguments, provide the search strategies desired. There are three strategies offered:
- `random,<num_seeds=10>`
- `channel_swaps,<bounded_regressions=100>`
- `optimize_stripe_groups,<stripe_group_size_in_columns=8>,<bounded_regressions=100>`
### Launch a test with interesting search strategies
Now that kernels are built, you can use them to accelerate the search, which can be quite time-consuming without using the GPU. Below, we report results on a number of interesting strategies for a 64-column, 128-row random matrix using a V100 accelerator.
$ python3 permutation_test.py --channels 64 --filters 128 channel_swap,0 channel_swap,100 channel_swap,1000 optimize_stripe_groups,8,0 optimize_stripe_groups,8,100 optimize_stripe_groups,8,1000 optimize_stripe_groups,12,0 random,1000 random,10000 random,100000
Found permutation search CUDA kernels for standalone testing
Found 2 gpus
strategy , magnitude, efficacy, duration
unpruned , 4083.169, - , -
unstructured , 3060.238, - , -
50% rows , 3042.332, 100.0, -
default 2:4 , 2852.376, 0.0, 0.000
channel_swap,0 , 2913.352, 32.1, 0.214
channel_swap,100 , 2914.174, 32.5, 2.249
channel_swap,1000 , 2920.694, 36.0, 20.248
optimize_stripe_groups,8,0 , 2919.757, 35.5, 0.013
optimize_stripe_groups,8,100 , 2919.758, 35.5, 0.152
optimize_stripe_groups,8,1000 , 2919.935, 35.6, 1.387
optimize_stripe_groups,12,0 , 2921.947, 36.6, 0.860
random,1000 , 2873.380, 11.1, 0.116
random,10000 , 2873.603, 11.2, 1.149
random,100000 , 2879.129, 14.1, 11.510
For this particular input, the `channel_swap` strategy requires 1000 bounded regressions in order to surpass the efficacy of optimizing two stripe groups (8 columns) without any bounded regressions, but allowing 1000 bounded regressions when optimizing two stripe groups is slightly worse than swapping channels with 1000 bounded regressions. Optimizing *three* stripe groups at a time outperforms all the other approaches by a wide margin. Testing many random permutations is inefficient and ineffective.
Without GPU acceleration, these tests would be much slower (though they find the same final permutations):
$ python3 permutation_test.py --gpu 0 --channels 64 --filters 128 channel_swap,0 channel_swap,100 optimize_stripe_groups,8,0 optimize_stripe_groups,8,100 random,1000
strategy , magnitude, efficacy, duration
unpruned , 4083.169, - , -
unstructured , 3060.238, - , -
50% rows , 3042.332, 100.0, -
default 2:4 , 2852.377, 0.0, 0.016
channel_swap,0 , 2913.351, 32.1, 55.972
channel_swap,100 , 2914.174, 32.5, 450.025
optimize_stripe_groups,8,0 , 2919.759, 35.5, 60.653
optimize_stripe_groups,8,100 , 2919.759, 35.5, 465.709
random,1000 , 2873.381, 11.1, 14.889
### Perform the ablation study from Table 1
`bash ablation_studies.sh` will generate the results for the ablation study, showing the relative importance of the bounded regressions and stripe group greedy phase.
#
没有合适的资源?快使用搜索试试~ 我知道了~
python的apex的小包,解决Pycharm爆红
共417个文件
py:247个
cu:57个
cpp:30个
需积分: 16 0 下载量 92 浏览量
2022-11-09
17:06:51
上传
评论
收藏 1009KB ZIP 举报
温馨提示
python的apex的小包,解决Pycharm爆红
资源推荐
资源详情
资源评论
收起资源包目录
python的apex的小包,解决Pycharm爆红 (417个子文件)
COPYRIGHT 67B
bottleneck.cpp 174KB
conv_bias_relu.cpp 57KB
multihead_attn_frontend.cpp 41KB
norm_sample.cpp 20KB
layer_norm_cuda.cpp 13KB
fmha_api.cpp 12KB
ln_api.cpp 8KB
fused_dense.cpp 8KB
interface.cpp 7KB
syncbn.cpp 6KB
fused_adam_cuda.cpp 5KB
index_mul_2d_cuda.cpp 5KB
amp_C_frontend.cpp 5KB
mlp.cpp 5KB
cudnn_gbn.cpp 3KB
scaled_masked_softmax.cpp 3KB
generic_scaled_masked_softmax.cpp 3KB
scaled_upper_triang_masked_softmax.cpp 2KB
transducer_loss.cpp 2KB
transducer_joint.cpp 2KB
focal_loss_cuda.cpp 2KB
interface.cpp 2KB
peer_memory.cpp 1KB
nccl_p2p.cpp 1KB
multi_tensor_distopt_lamb.cpp 1KB
multi_tensor_distopt_adam.cpp 1KB
flatten_unflatten.cpp 584B
fused_lamb_cuda.cpp 562B
fused_weight_gradient_dense.cpp 540B
nccl_version.cpp 307B
pytorch_theme.css 2KB
welford.cu 53KB
mlp_cuda.cu 51KB
fused_dense_cuda.cu 48KB
transducer_joint_kernel.cu 37KB
layer_norm_cuda_kernel.cu 35KB
fused_adam_cuda_kernel.cu 34KB
transducer_loss_kernel.cu 31KB
permutation_search_kernels.cu 27KB
peer_memory_cuda.cu 27KB
xentropy_kernel.cu 24KB
encdec_multihead_attn_norm_add_cuda.cu 20KB
index_mul_2d_cuda_kernel.cu 19KB
self_multihead_attn_norm_add_cuda.cu 17KB
encdec_multihead_attn_cuda.cu 16KB
multi_tensor_lamb_mp.cu 15KB
multi_tensor_distopt_lamb_kernel.cu 15KB
self_multihead_attn_bias_additive_mask_cuda.cu 14KB
self_multihead_attn_bias_cuda.cu 14KB
self_multihead_attn_cuda.cu 13KB
ln_bwd_semi_cuda_kernel.cu 13KB
multi_tensor_l2norm_kernel.cu 13KB
multi_tensor_lamb.cu 12KB
multi_tensor_distopt_adam_kernel.cu 12KB
batch_norm_add_relu.cu 12KB
ln_fwd_cuda_kernel.cu 11KB
batch_norm.cu 11KB
focal_loss_cuda_kernel.cu 11KB
multi_tensor_l2norm_scale_kernel.cu 9KB
fused_lamb_cuda_kernel.cu 8KB
multi_tensor_sgd_kernel.cu 8KB
nccl_p2p_cuda.cu 7KB
fmha_noloop_reduce.cu 6KB
fmha_fprop_fp16_512_64_kernel.sm80.cu 6KB
multi_tensor_l2norm_kernel_mp.cu 6KB
fmha_dgrad_fp16_512_64_kernel.sm80.cu 5KB
multi_tensor_novograd.cu 5KB
masked_softmax_dropout_cuda.cu 5KB
multi_tensor_axpby_kernel.cu 5KB
multi_tensor_adam.cu 4KB
multi_tensor_lamb_stage_1.cu 4KB
additive_masked_softmax_dropout_cuda.cu 4KB
fused_weight_gradient_dense_cuda.cu 4KB
scaled_masked_softmax_cuda.cu 4KB
multi_tensor_scale_kernel.cu 4KB
generic_scaled_masked_softmax_cuda.cu 4KB
fmha_fprop_fp16_256_64_kernel.sm80.cu 4KB
fmha_fprop_fp16_384_64_kernel.sm80.cu 4KB
fmha_fprop_fp16_128_64_kernel.sm80.cu 4KB
ipc.cu 4KB
fused_weight_gradient_dense_16bit_prec_cuda.cu 3KB
multi_tensor_lamb_stage_2.cu 3KB
fmha_dgrad_fp16_256_64_kernel.sm80.cu 3KB
fmha_dgrad_fp16_128_64_kernel.sm80.cu 3KB
fmha_dgrad_fp16_384_64_kernel.sm80.cu 3KB
scaled_upper_triang_masked_softmax_cuda.cu 3KB
multi_tensor_adagrad.cu 3KB
nccl_version_check.cu 253B
softmax.cuh 118KB
strided_batched_gemm.cuh 34KB
ln_utils.cuh 24KB
layer_norm.cuh 23KB
ln_bwd_kernels.cuh 12KB
dropout.cuh 10KB
multi_tensor_apply.cuh 5KB
philox.cuh 4KB
ln_fwd_kernels.cuh 4KB
peer_memory_cuda.cuh 2KB
nccl_p2p_cuda.cuh 1KB
共 417 条
- 1
- 2
- 3
- 4
- 5
资源评论
JJxiao24
- 粉丝: 178
- 资源: 2
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功