python的apex的小包，解决Pycharm爆红资源-CSDN文库

共417个文件

py：247个

cu：57个

cpp：30个

python资源

需积分: 16 92 浏览量 2022-11-09 17:06:51 上传评论收藏 1009KB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

python的apex的小包，解决Pycharm爆红（417个子文件）

bottleneck.cpp 174KB

conv_bias_relu.cpp 57KB

multihead_attn_frontend.cpp 41KB

norm_sample.cpp 20KB

layer_norm_cuda.cpp 13KB

fmha_api.cpp 12KB

ln_api.cpp 8KB

fused_dense.cpp 8KB

interface.cpp 7KB

syncbn.cpp 6KB

fused_adam_cuda.cpp 5KB

index_mul_2d_cuda.cpp 5KB

amp_C_frontend.cpp 5KB

mlp.cpp 5KB

cudnn_gbn.cpp 3KB

scaled_masked_softmax.cpp 3KB

generic_scaled_masked_softmax.cpp 3KB

scaled_upper_triang_masked_softmax.cpp 2KB

transducer_loss.cpp 2KB

transducer_joint.cpp 2KB

focal_loss_cuda.cpp 2KB

interface.cpp 2KB

peer_memory.cpp 1KB

nccl_p2p.cpp 1KB

multi_tensor_distopt_lamb.cpp 1KB

multi_tensor_distopt_adam.cpp 1KB

flatten_unflatten.cpp 584B

fused_lamb_cuda.cpp 562B

fused_weight_gradient_dense.cpp 540B

nccl_version.cpp 307B

pytorch_theme.css 2KB

welford.cu 53KB

mlp_cuda.cu 51KB

fused_dense_cuda.cu 48KB

transducer_joint_kernel.cu 37KB

layer_norm_cuda_kernel.cu 35KB

fused_adam_cuda_kernel.cu 34KB

transducer_loss_kernel.cu 31KB

permutation_search_kernels.cu 27KB

peer_memory_cuda.cu 27KB

xentropy_kernel.cu 24KB

encdec_multihead_attn_norm_add_cuda.cu 20KB

index_mul_2d_cuda_kernel.cu 19KB

self_multihead_attn_norm_add_cuda.cu 17KB

encdec_multihead_attn_cuda.cu 16KB

multi_tensor_lamb_mp.cu 15KB

multi_tensor_distopt_lamb_kernel.cu 15KB

self_multihead_attn_bias_additive_mask_cuda.cu 14KB

self_multihead_attn_bias_cuda.cu 14KB

self_multihead_attn_cuda.cu 13KB

ln_bwd_semi_cuda_kernel.cu 13KB

multi_tensor_l2norm_kernel.cu 13KB

multi_tensor_lamb.cu 12KB

multi_tensor_distopt_adam_kernel.cu 12KB

batch_norm_add_relu.cu 12KB

ln_fwd_cuda_kernel.cu 11KB

batch_norm.cu 11KB

focal_loss_cuda_kernel.cu 11KB

multi_tensor_l2norm_scale_kernel.cu 9KB

fused_lamb_cuda_kernel.cu 8KB

multi_tensor_sgd_kernel.cu 8KB

nccl_p2p_cuda.cu 7KB

fmha_noloop_reduce.cu 6KB

fmha_fprop_fp16_512_64_kernel.sm80.cu 6KB

multi_tensor_l2norm_kernel_mp.cu 6KB

fmha_dgrad_fp16_512_64_kernel.sm80.cu 5KB

multi_tensor_novograd.cu 5KB

masked_softmax_dropout_cuda.cu 5KB

multi_tensor_axpby_kernel.cu 5KB

multi_tensor_adam.cu 4KB

multi_tensor_lamb_stage_1.cu 4KB

additive_masked_softmax_dropout_cuda.cu 4KB

fused_weight_gradient_dense_cuda.cu 4KB

scaled_masked_softmax_cuda.cu 4KB

multi_tensor_scale_kernel.cu 4KB

generic_scaled_masked_softmax_cuda.cu 4KB

fmha_fprop_fp16_256_64_kernel.sm80.cu 4KB

fmha_fprop_fp16_384_64_kernel.sm80.cu 4KB

fmha_fprop_fp16_128_64_kernel.sm80.cu 4KB

ipc.cu 4KB

fused_weight_gradient_dense_16bit_prec_cuda.cu 3KB

multi_tensor_lamb_stage_2.cu 3KB

fmha_dgrad_fp16_256_64_kernel.sm80.cu 3KB

fmha_dgrad_fp16_128_64_kernel.sm80.cu 3KB

fmha_dgrad_fp16_384_64_kernel.sm80.cu 3KB

scaled_upper_triang_masked_softmax_cuda.cu 3KB

multi_tensor_adagrad.cu 3KB

nccl_version_check.cu 253B

softmax.cuh 118KB

strided_batched_gemm.cuh 34KB

ln_utils.cuh 24KB

layer_norm.cuh 23KB

ln_bwd_kernels.cuh 12KB

dropout.cuh 10KB

multi_tensor_apply.cuh 5KB

philox.cuh 4KB

ln_fwd_kernels.cuh 4KB

peer_memory_cuda.cuh 2KB

nccl_p2p_cuda.cuh 1KB

共 417 条

# ChannelPermutations Standalone code to reproduce results in "[Channel Permutations for N:M Sparsity](https://proceedings.neurips.cc/paper/2021/hash/6e8404c3b93a9527c8db241a1846599a-Abstract.html)," Jeff Pool and Chong Yu, NeurIPS 2021. Three search strategies are supported: randomly generating permutations and checking quality, greedily swapping columns until convergence (i.e. TETRIS adapted for 2:4 sparsity), and the technique presented in the above paper, optimizing stripe groups. This tool will apply these strategies, as configured below, to either a randomly-generated matrix or an .npy file (typically from a real network) and report the efficacy and runtime of the strategy. ## Quick Start ### Installation #### GPU path Requirements: - CUDA - pybind11 A container such as `nvcr.io/nvidia/pytorch:21.12-py3` satisfies these requirements. Installation (from this directory): ``` pushd ../permutation_search_kernels/CUDA_kernels nvcc -O3 -shared -Xcompiler -fPIC -Xcompiler -DTORCH_EXTENSION_NAME=permutation_search_cuda -std=c++11 $(python3 -m pybind11 --includes) permutation_search_kernels.cu -o ../permutation_search_cuda$(python3-config --extension-suffix) popd ``` #### CPU path Only NumPy is required for CPU-only execution. ### Important arguments `python3 permutation_test.py` will tell you all the available arguments and alert you about required arguments: ``` usage: permutation_test.py [-h] [--infile INFILE] [--channels CHANNELS] [--filters FILTERS] [--verbosity VERBOSITY] [--seed SEED] [--pretty_print PRETTY_PRINT] [--unstructured UNSTRUCTURED] [--gpu GPU] [--check_permutation CHECK_PERMUTATION] [--intermediate_steps INTERMEDIATE_STEPS] [--print_permutation PRINT_PERMUTATION] strategy [strategy ...] permutation_test.py: error: the following arguments are required: strategy ``` Detailed information about each argument: - `--infile` (string) accepts .npy files with weights dumped from some model checkpoint. By default, the input file is `'random'`, which will generate a random 2D matrix with `CHANNELS` columns and `FILTERS` rows. - `--channels` and `--filters` (unsigned integers) specify the size of the randomly-generated matrix if there is no input file specified. - `--verbosity` (unsigned integer) controls the amount of debug and status information printed. `0` is just the important data, `11` can give periodic status details, and higher integers provide increasingly more detail. - `--seed` (unsigned integer) allows for changing the random seed, which will affect the random matrix generation, random permutations generated, and columns swapped for bounded regressions. - `--pretty_print` (bool) prints a pretty graph by default (below), but disabling will generate output friendly for redirecting to a .csv file. - `--unstructured` (float) will apply unstructured pruning to the matrix before searching for permutations. A negative value will find the minimum unstructured sparsity for which a search strategy can find a perfect permutation and not create any extra zeros. - `--gpu` (bool) uses CUDA kernels by default (if they are built and there is a GPU available), but you can override this to run on the CPU. - `--check_permutation` (bool) makes sure the permutation tracked during the search process matches the one that's recovered directly from the permuted matrix. - `--intermediate_steps` (unsigned integer) will emit permutations with efficacies equally dividing the distance between the default order and the best permutation found. - `--print_permutation` (bool) prints the permutation found for each strategy. Finally, after these optional arguments, provide the search strategies desired. There are three strategies offered: - `random,<num_seeds=10>` - `channel_swaps,<bounded_regressions=100>` - `optimize_stripe_groups,<stripe_group_size_in_columns=8>,<bounded_regressions=100>` ### Launch a test with interesting search strategies Now that kernels are built, you can use them to accelerate the search, which can be quite time-consuming without using the GPU. Below, we report results on a number of interesting strategies for a 64-column, 128-row random matrix using a V100 accelerator. $ python3 permutation_test.py --channels 64 --filters 128 channel_swap,0 channel_swap,100 channel_swap,1000 optimize_stripe_groups,8,0 optimize_stripe_groups,8,100 optimize_stripe_groups,8,1000 optimize_stripe_groups,12,0 random,1000 random,10000 random,100000 Found permutation search CUDA kernels for standalone testing Found 2 gpus strategy , magnitude, efficacy, duration unpruned , 4083.169, - , - unstructured , 3060.238, - , - 50% rows , 3042.332, 100.0, - default 2:4 , 2852.376, 0.0, 0.000 channel_swap,0 , 2913.352, 32.1, 0.214 channel_swap,100 , 2914.174, 32.5, 2.249 channel_swap,1000 , 2920.694, 36.0, 20.248 optimize_stripe_groups,8,0 , 2919.757, 35.5, 0.013 optimize_stripe_groups,8,100 , 2919.758, 35.5, 0.152 optimize_stripe_groups,8,1000 , 2919.935, 35.6, 1.387 optimize_stripe_groups,12,0 , 2921.947, 36.6, 0.860 random,1000 , 2873.380, 11.1, 0.116 random,10000 , 2873.603, 11.2, 1.149 random,100000 , 2879.129, 14.1, 11.510 For this particular input, the `channel_swap` strategy requires 1000 bounded regressions in order to surpass the efficacy of optimizing two stripe groups (8 columns) without any bounded regressions, but allowing 1000 bounded regressions when optimizing two stripe groups is slightly worse than swapping channels with 1000 bounded regressions. Optimizing *three* stripe groups at a time outperforms all the other approaches by a wide margin. Testing many random permutations is inefficient and ineffective. Without GPU acceleration, these tests would be much slower (though they find the same final permutations): $ python3 permutation_test.py --gpu 0 --channels 64 --filters 128 channel_swap,0 channel_swap,100 optimize_stripe_groups,8,0 optimize_stripe_groups,8,100 random,1000 strategy , magnitude, efficacy, duration unpruned , 4083.169, - , - unstructured , 3060.238, - , - 50% rows , 3042.332, 100.0, - default 2:4 , 2852.377, 0.0, 0.016 channel_swap,0 , 2913.351, 32.1, 55.972 channel_swap,100 , 2914.174, 32.5, 450.025 optimize_stripe_groups,8,0 , 2919.759, 35.5, 60.653 optimize_stripe_groups,8,100 , 2919.759, 35.5, 465.709 random,1000 , 2873.381, 11.1, 14.889 ### Perform the ablation study from Table 1 `bash ablation_studies.sh` will generate the results for the ablation study, showing the relative importance of the bounded regressions and stripe group greedy phase. #

评论收藏

内容反馈