apex-master(1).zip_APEX资源-CSDN文库

共416个文件

py：259个

cu：49个

h：25个

需积分: 39 55 浏览量 2022-02-24 08:32:16 上传评论收藏 958KB ZIP 举报

Apex是Salesforce公司推出的一种基于Java的开源编程语言，主要用于构建在Salesforce云平台上的企业级应用程序。Apex类似于Java，但它是为多租户环境设计的，具有高可用性和可扩展性。这个"apex-master (1).zip"文件很可能包含了一个Apex项目的源代码仓库，可能是从GitHub或其他版本控制系统中导出的。在Apex中，开发者可以创建自定义业务逻辑，处理数据，以及与Salesforce的其他组件如触发器、类、未来方法、批处理作业等进行交互。Apex代码在Salesforce平台上运行，因此它能够直接访问Salesforce的数据模型和API，提供了对Salesforce对象、字段和关系的直接操作能力。标签"apex"进一步确认了这个压缩包的内容与Apex编程语言有关。这可能包括示例代码、教程、项目框架或者是一个完整的应用实例。文件名"apex-master"暗示这是一个主分支的代码库，通常代表了项目的最新稳定状态。在深入研究"apex-master"目录之前，我们可以预期以下内容： 1. **源代码文件**：可能包含".cls"文件，这是Apex类的文件扩展名。这些类可能包含了业务逻辑、服务层、数据访问对象（DAO）等不同功能。 2. **测试文件**："test"目录下可能有".test.cls"文件，这些是Apex测试类，用于确保代码的功能正确并符合最佳实践。 3. **元数据配置**：可能有XML文件，这些描述了Apex类与其他Salesforce元数据（如页面布局、对象定义、字段等）的关系。 4. **README文件**：通常会提供项目概述、安装指南、贡献者信息以及如何运行测试的说明。 5. **持续集成/持续部署(CI/CD)**：如果项目是专业管理的，可能会有Jenkins、GitLab CI或类似的配置文件，用于自动化构建和部署过程。 6. **文档**：可能包含Markdown或PDF格式的用户指南、开发者注释，解释了项目的工作原理和使用方式。 7. **配置文件**：如销售云或服务云的定制配置，可能以JSON或XML格式存储。通过分析和学习这个"apex-master"项目，开发者可以了解Apex的最佳实践、项目结构和Salesforce平台上的应用开发流程。这对于初学者来说是一份宝贵的资源，对于经验丰富的开发者则提供了参考和灵感。为了充分利用这份资源，需要熟悉Salesforce平台，理解Apex语法，并且可能需要安装Salesforce的开发工具如Salesforce DX或Force.com IDE来导入和运行项目。

资源详情

资源评论

资源推荐

收起资源包目录

apex-master (1).zip （416个子文件）

bottleneck.cpp 98KB

multihead_attn_frontend.cpp 41KB

fmha_api.cpp 15KB

layer_norm_cuda.cpp 13KB

ln_api.cpp 8KB

fused_dense.cpp 8KB

interface.cpp 7KB

syncbn.cpp 6KB

fused_adam_cuda.cpp 5KB

amp_C_frontend.cpp 5KB

mlp.cpp 5KB

scaled_masked_softmax.cpp 3KB

scaled_upper_triang_masked_softmax.cpp 2KB

transducer_loss.cpp 2KB

transducer_joint.cpp 2KB

interface.cpp 2KB

multi_tensor_distopt_lamb.cpp 1KB

flatten_unflatten.cpp 584B

fused_lamb_cuda.cpp 562B

multi_tensor_distopt_adam.cpp 560B

pytorch_theme.css 2KB

welford.cu 53KB

mlp_cuda.cu 51KB

fused_dense_cuda.cu 48KB

transducer_joint_kernel.cu 37KB

layer_norm_cuda_kernel.cu 35KB

fused_adam_cuda_kernel.cu 34KB

transducer_loss_kernel.cu 31KB

xentropy_kernel.cu 24KB

encdec_multihead_attn_norm_add_cuda.cu 20KB

self_multihead_attn_norm_add_cuda.cu 17KB

encdec_multihead_attn_cuda.cu 16KB

multi_tensor_lamb_mp.cu 15KB

multi_tensor_distopt_lamb_kernel.cu 15KB

permutation_search_kernels.cu 14KB

self_multihead_attn_bias_additive_mask_cuda.cu 14KB

self_multihead_attn_bias_cuda.cu 14KB

self_multihead_attn_cuda.cu 13KB

multi_tensor_l2norm_kernel.cu 13KB

multi_tensor_lamb.cu 12KB

ln_bwd_semi_cuda_kernel.cu 12KB

batch_norm_add_relu.cu 12KB

batch_norm.cu 11KB

ln_fwd_cuda_kernel.cu 11KB

multi_tensor_l2norm_scale_kernel.cu 9KB

fused_lamb_cuda_kernel.cu 8KB

multi_tensor_sgd_kernel.cu 8KB

multi_tensor_distopt_adam_kernel.cu 7KB

fmha_noloop_reduce.cu 6KB

multi_tensor_l2norm_kernel_mp.cu 6KB

fmha_dgrad_fp16_512_64_kernel.sm80.cu 5KB

fmha_fprop_fp16_512_64_kernel.sm80.cu 5KB

multi_tensor_novograd.cu 5KB

masked_softmax_dropout_cuda.cu 5KB

multi_tensor_axpby_kernel.cu 5KB

multi_tensor_adam.cu 4KB

multi_tensor_lamb_stage_1.cu 4KB

additive_masked_softmax_dropout_cuda.cu 4KB

multi_tensor_scale_kernel.cu 4KB

scaled_masked_softmax_cuda.cu 4KB

ipc.cu 4KB

multi_tensor_lamb_stage_2.cu 3KB

fmha_dgrad_fp16_128_64_kernel.sm80.cu 3KB

fmha_dgrad_fp16_256_64_kernel.sm80.cu 3KB

fmha_dgrad_fp16_384_64_kernel.sm80.cu 3KB

fmha_fprop_fp16_256_64_kernel.sm80.cu 3KB

fmha_fprop_fp16_128_64_kernel.sm80.cu 3KB

scaled_upper_triang_masked_softmax_cuda.cu 3KB

fmha_fprop_fp16_384_64_kernel.sm80.cu 3KB

multi_tensor_adagrad.cu 3KB

softmax.cuh 118KB

strided_batched_gemm.cuh 34KB

ln_utils.cuh 24KB

layer_norm.cuh 23KB

ln_bwd_kernels.cuh 12KB

dropout.cuh 10KB

multi_tensor_apply.cuh 5KB

ln_fwd_kernels.cuh 4KB

philox.cuh 3KB

Dockerfile 760B

.gitignore 2KB

.gitignore 31B

.gitmodules 306B

nhwc_batch_norm_kernel.h 109KB

smem_tile.h 51KB

utils.h 31KB

batch_norm.h 28KB

batch_norm_add_relu.h 26KB

scaled_upper_triang_masked_softmax.h 23KB

fmha_dgrad_kernel_1xN_reload_nl.h 23KB

scaled_masked_softmax.h 22KB

fmha_dgrad_kernel_1xN_reload.h 22KB

softmax.h 18KB

gmem_tile.h 16KB

fmha_fprop_kernel_1xN_nl.h 13KB

fmha_fprop_kernel_1xN.h 13KB

fmha_fprop_kernel_1xN_reload_v.h 13KB

gemm.h 12KB

type_shim.h 9KB

ln_kernel_traits.h 6KB

共 416 条

## PyProf - PyTorch Profiling tool ### What does this tool do? Analyzing the performance of deep neural networks is hard. Getting kernels out of [NvProf]([https://developer.nvidia.com/nvidia-visual-profiler](https://developer.nvidia.com/nvidia-visual-profiler)) or [NSight Compute]([https://developer.nvidia.com/nsight-compute](https://developer.nvidia.com/nsight-compute)) provides some generic kernel name and its execution time, but not detailed information regarding the following: - Which layer launched it: e.g. the association of `ComputeOffsetsKernel` with a concrete PyTorch layer or API is not obvious. - What the tensor dimensions and precision were: without knowing the tensor dimensions and precision, it's impossible to reason about whether the actual (silicon) kernel time is close to maximum performance of such a kernel on the GPU. Knowing the tensor dimensions and precision, we can figure out the FLOPs and bandwidth required by a layer, and then determine how close to maximum performance the kernel is for that operation. - Forward-backward correlation: currently it's very hard to determine what the forward pass step was that resulted in the particular weight and data gradients (wgrad, dgrad), which makes it difficult to determine the tensor dimensions required by these backprop steps to assess their performance. - Did the kernel use [Tensor Cores]([https://www.youtube.com/watch?v=yyR0ZoCeBO8](https://www.youtube.com/watch?v=yyR0ZoCeBO8))? - Which line in the user's code resulted in launching this particular kernel (program trace)? PyProf addresses all of the issues above by: 1. Instrumenting PyTorch operations to capture the tensor dimensions and precision using [NVTX](https://devblogs.nvidia.com/cuda-pro-tip-generate-custom-application-profile-timelines-nvtx). This information is recorded at profile capture time, e.g. using [NvProf](https://developer.nvidia.com/nvidia-visual-profiler). 2. Querying the record produced by the profiler to correlate the kernel name and duration with PyTorch API/layer name, tensor dimensions, tensor precision, as well as calculating FLOPs and bandwidth for common operations. In addition, extra information from the profile is added for use by CUDA professionals, such as CUDA launch parameters (block/grid dimensions). Regarding FLOP and bandwidth implementations, these are usually quite straightforward. For example, for matrices A<sub>MxK</sub> and B<sub>KxN</sub>, the FLOP count for a matrix multiplication is 2 * M * N * K, and bandwidth is M * K + N * K + M * N. Note that these numbers are based on the algorithm, not the actual performance of the specific kernel. For more details, see NVIDIA's [Deep Learning Performance Guide](https://docs.nvidia.com/deeplearning/sdk/dl-performance-guide/index.html). Armed with such information, the user can determine various issues to help them tune the network. For instance, according to the [Tensor Core Performance Guide]([https://docs.nvidia.com/deeplearning/sdk/dl-performance-guide/index.html](https://docs.nvidia.com/deeplearning/sdk/dl-performance-guide/index.html)), the M, N and K dimensions that result in Tensor Core usage need to be divisible by 8. In fact, PyProf comes with a flag that lets the user obtain information regarding whether Tensor Cores were used by the kernel. Other useful information might include knowing that a particular kernel did not exploit much thread parallelism, as determined by the grid/block dimensions. Since many PyTorch kernels are open-source (or even custom written by the user, as in [CUDA Extensions]([https://pytorch.org/tutorials/advanced/cpp_extension.html](https://pytorch.org/tutorials/advanced/cpp_extension.html))), this provides the user with information that helps root cause performance issues and prioritize optimization work. ### How to get started? 1. Add the following lines to your PyTorch network: ```python import torch.cuda.profiler as profiler from apex import pyprof pyprof.nvtx.init() ``` Run the training/inference loop with the [PyTorch's NVTX context manager](https://pytorch.org/docs/stable/_modules/torch/autograd/profiler.html#emit_nvtx) `with torch.autograd.profiler.emit_nvtx()`. Optionally, you can use `profiler.start()` and `profiler.stop()` to pick an iteration (say after warm-up) for which you would like to capture data. Here's an example: ```python iters = 500 iter_to_capture = 100 # Define network, loss function, optimizer etc. # PyTorch NVTX context manager with torch.autograd.profiler.emit_nvtx(): for iter in range(iters): if iter == iter_to_capture: profiler.start() output = net(images) loss = criterion(output, labels) loss.backward() optimizer.step() if iter == iter_to_capture: profiler.stop() ``` 2. Run NVprof to generate a SQL (NVVP) file. This file can be opened with NVVP, as usual. ```sh # If you used profiler.start() and profiler.stop() in net.py nvprof -f -o net.sql --profile-from-start off -- python net.py # Profile everything nvprof -f -o net.sql -- python net.py ``` **Note:** if you're experiencing issues with hardware counters and you get a message such as `**_ERR_NVGPUCTRPERM The user running <tool_name/application_name> does not have permission to access NVIDIA GPU Performance Counters on the target device_**`, please follow the steps described in [Hardware Counters](#hardware-counters). 3. Run parser on the SQL file. The output is an ASCII file. Each line is a python dictionary which contains information about the kernel name, duration, parameters etc. This file can be used as input to other custom scripts as well. ```sh python -m apex.pyprof.parse net.sql > net.dict ``` 4. Run the profiler. The input is the python dictionary created above. The tool can produce a CSV output, a columnated output (similar to `column -t` for terminal readability) and a space separated output (for post processing by AWK for instance). The tool produces 20 columns of information for every GPU kernel but you can select a subset of columns using the `-c` flag. Note that a few columns might have the value "na" implying either its a work in progress or the tool was unable to extract that information. Assuming the directory is `prof`, here are a few examples of how to use `prof.py`. ```sh # Print usage and help. Lists all available output columns. python -m apex.pyprof.prof -h # Columnated output of width 150 with some default columns. python -m apex.pyprof.prof -w 150 net.dict # CSV output. python -m apex.pyprof.prof --csv net.dict # Space seperated output. python -m apex.pyprof.prof net.dict # Columnated output of width 130 with columns index,direction,kernel name,parameters,silicon time. python -m apex.pyprof.prof -w 130 -c idx,dir,kernel,params,sil net.dict # CSV output with columns index,direction,kernel name,parameters,silicon time. python -m apex.pyprof.prof --csv -c idx,dir,kernel,params,sil net.dict # Space separated output with columns index,direction,kernel name,parameters,silicon time. python -m apex.pyprof.prof -c idx,dir,kernel,params,sil net.dict # Input redirection. python -m apex.pyprof.prof < net.dict ``` 5. Profile-guided optimization If kernels that do matrix multiplication/GEMM or convolution use half precision (fp16) data but do not use Tensor Cores (the TC column in the profile analysis output doesn't show a "1"), one can follow some basic steps to increase the likelihood that a Tensor Core-compatible kernel will be chosen. For example, for GEMMs, M, N and K should be divisible by 8, and for convoluti