## PyProf - PyTorch Profiling tool
### What does this tool do?
Analyzing the performance of deep neural networks is hard. Getting kernels out of [NvProf]([https://developer.nvidia.com/nvidia-visual-profiler](https://developer.nvidia.com/nvidia-visual-profiler)) or [NSight Compute]([https://developer.nvidia.com/nsight-compute](https://developer.nvidia.com/nsight-compute)) provides some generic kernel name and its execution time, but not detailed information regarding the following:
- Which layer launched it: e.g. the association of `ComputeOffsetsKernel` with a concrete PyTorch layer or API is not obvious.
- What the tensor dimensions and precision were: without knowing the tensor dimensions and precision, it's impossible to reason about whether the actual (silicon) kernel time is close to maximum performance of such a kernel on the GPU. Knowing the tensor dimensions and precision, we can figure out the FLOPs and bandwidth required by a layer, and then determine how close to maximum performance the kernel is for that operation.
- Forward-backward correlation: currently it's very hard to determine what the forward pass step was that resulted in the particular weight and data gradients (wgrad, dgrad), which makes it difficult to determine the tensor dimensions required by these backprop steps to assess their performance.
- Did the kernel use [Tensor Cores]([https://www.youtube.com/watch?v=yyR0ZoCeBO8](https://www.youtube.com/watch?v=yyR0ZoCeBO8))?
- Which line in the user's code resulted in launching this particular kernel (program trace)?
PyProf addresses all of the issues above by:
1. Instrumenting PyTorch operations to capture the tensor dimensions and precision using [NVTX](https://devblogs.nvidia.com/cuda-pro-tip-generate-custom-application-profile-timelines-nvtx). This information is recorded at profile capture time, e.g. using [NvProf](https://developer.nvidia.com/nvidia-visual-profiler).
2. Querying the record produced by the profiler to correlate the kernel name and duration with PyTorch API/layer name, tensor dimensions, tensor precision, as well as calculating FLOPs and bandwidth for common operations. In addition, extra information from the profile is added for use by CUDA professionals, such as CUDA launch parameters (block/grid dimensions).
Regarding FLOP and bandwidth implementations, these are usually quite straightforward. For example, for matrices A<sub>MxK</sub> and B<sub>KxN</sub>, the FLOP count for a matrix multiplication is 2 * M * N * K, and bandwidth is M * K + N * K + M * N. Note that these numbers are based on the algorithm, not the actual performance of the specific kernel. For more details, see NVIDIA's [Deep Learning Performance Guide](https://docs.nvidia.com/deeplearning/sdk/dl-performance-guide/index.html).
Armed with such information, the user can determine various issues to help them tune the network. For instance, according to the [Tensor Core Performance Guide]([https://docs.nvidia.com/deeplearning/sdk/dl-performance-guide/index.html](https://docs.nvidia.com/deeplearning/sdk/dl-performance-guide/index.html)), the M, N and K dimensions that result in Tensor Core usage need to be divisible by 8. In fact, PyProf comes with a flag that lets the user obtain information regarding whether Tensor Cores were used by the kernel. Other useful information might include knowing that a particular kernel did not exploit much thread parallelism, as determined by the grid/block dimensions. Since many PyTorch kernels are open-source (or even custom written by the user, as in [CUDA Extensions]([https://pytorch.org/tutorials/advanced/cpp_extension.html](https://pytorch.org/tutorials/advanced/cpp_extension.html))), this provides the user with information that helps root cause performance issues and prioritize optimization work.
### How to get started?
1. Add the following lines to your PyTorch network:
```python
import torch.cuda.profiler as profiler
from apex import pyprof
pyprof.nvtx.init()
```
Run the training/inference loop with the [PyTorch's NVTX context manager](https://pytorch.org/docs/stable/_modules/torch/autograd/profiler.html#emit_nvtx)
`with torch.autograd.profiler.emit_nvtx()`. Optionally, you can
use `profiler.start()` and `profiler.stop()` to pick an iteration
(say after warm-up) for which you would like to capture data.
Here's an example:
```python
iters = 500
iter_to_capture = 100
# Define network, loss function, optimizer etc.
# PyTorch NVTX context manager
with torch.autograd.profiler.emit_nvtx():
for iter in range(iters):
if iter == iter_to_capture:
profiler.start()
output = net(images)
loss = criterion(output, labels)
loss.backward()
optimizer.step()
if iter == iter_to_capture:
profiler.stop()
```
2. Run NVprof to generate a SQL (NVVP) file. This file can be opened with NVVP, as usual.
```sh
# If you used profiler.start() and profiler.stop() in net.py
nvprof -f -o net.sql --profile-from-start off -- python net.py
# Profile everything
nvprof -f -o net.sql -- python net.py
```
**Note:** if you're experiencing issues with hardware counters and you get a message such as `**_ERR_NVGPUCTRPERM The user running <tool_name/application_name> does not have permission to access NVIDIA GPU Performance Counters on the target device_**`, please follow the steps described in [Hardware Counters](#hardware-counters).
3. Run parser on the SQL file. The output is an ASCII file. Each line
is a python dictionary which contains information about the kernel name,
duration, parameters etc. This file can be used as input to other custom
scripts as well.
```sh
python -m apex.pyprof.parse net.sql > net.dict
```
4. Run the profiler. The input is the python dictionary created above. The tool can produce a CSV output, a columnated output (similar to `column -t` for terminal readability) and a space separated output (for post processing by AWK for instance). The tool produces 20 columns of information for every GPU kernel but you can select a subset of columns using the `-c` flag. Note that a few columns might have the value "na" implying either its a work in progress or the tool was unable to extract that information. Assuming the directory is `prof`, here are a few examples of how to use `prof.py`.
```sh
# Print usage and help. Lists all available output columns.
python -m apex.pyprof.prof -h
# Columnated output of width 150 with some default columns.
python -m apex.pyprof.prof -w 150 net.dict
# CSV output.
python -m apex.pyprof.prof --csv net.dict
# Space seperated output.
python -m apex.pyprof.prof net.dict
# Columnated output of width 130 with columns index,direction,kernel name,parameters,silicon time.
python -m apex.pyprof.prof -w 130 -c idx,dir,kernel,params,sil net.dict
# CSV output with columns index,direction,kernel name,parameters,silicon time.
python -m apex.pyprof.prof --csv -c idx,dir,kernel,params,sil net.dict
# Space separated output with columns index,direction,kernel name,parameters,silicon time.
python -m apex.pyprof.prof -c idx,dir,kernel,params,sil net.dict
# Input redirection.
python -m apex.pyprof.prof < net.dict
```
5. Profile-guided optimization
If kernels that do matrix multiplication/GEMM or convolution use half precision (fp16) data but do not use Tensor Cores (the TC column in the profile analysis output doesn't show a "1"), one can follow some basic steps to increase the likelihood that a Tensor Core-compatible kernel will be chosen. For example, for GEMMs, M, N and K should be divisible by 8, and for convoluti
没有合适的资源?快使用搜索试试~ 我知道了~
2022年数字中国创新大赛 (DCIC 2022) 卫星应用赛题-海上船舶智能检测
共2003个文件
py:1507个
md:163个
sh:72个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 66 浏览量
2024-07-12
11:47:21
上传
评论
收藏 13.35MB ZIP 举报
温馨提示
【作品名称】:2022年数字中国创新大赛 (DCIC 2022) 卫星应用赛题——海上船舶智能检测 【适用人群】:适用于希望学习不同技术领域的小白或进阶学习者。可作为毕设项目、课程设计、大作业、工程实训或初期项目立项。 解决方案 预处理 形态学闭操作处理图片:增强小目标船只 数据增强 Mosaic,MixUp (只在YOLOX中work) GridMask,Flip,Translate 模型架构 ConvNext Swin-T YoloX 模型融合 WBF融合多次结果 1.2 创新性 anchor适应赛题数据 根据训练集给出的目标label,统计目标长短边比例,目标大小 根据统计数据更改anchor ratio 和 anchor scale以适应小目标与船只目标大小 cascade iou阈值适应赛题map要求 cascade 每个head的预测bbox结果在其对应iou阈值的AP上效果最好 根据比赛0.5的iou要求, 将cas三个head的iou阈值调整为0.3, 0.4, 0.5 加入背景类 由于SAR图像背景多样性,船只靠岸时,船只离岸在海面时,船只目标
资源推荐
资源详情
资源评论
收起资源包目录
2022年数字中国创新大赛 (DCIC 2022) 卫星应用赛题-海上船舶智能检测 (2003个子文件)
checkpoint.py.backup 3KB
bottleneck.cpp 98KB
bottleneck.cpp 98KB
multihead_attn_frontend.cpp 41KB
multihead_attn_frontend.cpp 41KB
cocoeval.cpp 20KB
yolox_openvino.cpp 18KB
yolox.cpp 17KB
yolox.cpp 16KB
fmha_api.cpp 15KB
fmha_api.cpp 15KB
yoloXncnn_jni.cpp 14KB
layer_norm_cuda.cpp 13KB
yolox.cpp 13KB
ln_api.cpp 8KB
ln_api.cpp 8KB
fused_dense.cpp 8KB
fused_dense.cpp 8KB
fused_dense.cpp 8KB
layer_norm_cuda.cpp 8KB
layer_norm_cuda.cpp 8KB
interface.cpp 7KB
interface.cpp 7KB
syncbn.cpp 6KB
syncbn.cpp 6KB
syncbn.cpp 6KB
fused_adam_cuda.cpp 5KB
fused_adam_cuda.cpp 5KB
amp_C_frontend.cpp 5KB
amp_C_frontend.cpp 5KB
amp_C_frontend.cpp 5KB
mlp.cpp 5KB
mlp.cpp 5KB
mlp.cpp 5KB
scaled_masked_softmax.cpp 3KB
scaled_masked_softmax.cpp 3KB
scaled_masked_softmax.cpp 3KB
scaled_upper_triang_masked_softmax.cpp 2KB
scaled_upper_triang_masked_softmax.cpp 2KB
scaled_upper_triang_masked_softmax.cpp 2KB
transducer_loss.cpp 2KB
transducer_loss.cpp 2KB
transducer_joint.cpp 2KB
transducer_joint.cpp 2KB
interface.cpp 2KB
interface.cpp 2KB
multi_tensor_distopt_lamb.cpp 1KB
multi_tensor_distopt_lamb.cpp 1KB
flatten_unflatten.cpp 584B
flatten_unflatten.cpp 584B
flatten_unflatten.cpp 584B
fused_lamb_cuda.cpp 562B
fused_lamb_cuda.cpp 562B
multi_tensor_distopt_adam.cpp 560B
multi_tensor_distopt_adam.cpp 560B
vision.cpp 524B
pytorch_theme.css 2KB
pytorch_theme.css 2KB
pytorch_theme.css 2KB
custom.css 556B
result.csv 1.07MB
welford.cu 53KB
mlp_cuda.cu 51KB
fused_dense_cuda.cu 48KB
transducer_joint_kernel.cu 37KB
fused_adam_cuda_kernel.cu 34KB
transducer_loss_kernel.cu 31KB
layer_norm_cuda_kernel.cu 26KB
xentropy_kernel.cu 24KB
encdec_multihead_attn_norm_add_cuda.cu 20KB
self_multihead_attn_norm_add_cuda.cu 17KB
encdec_multihead_attn_cuda.cu 16KB
multi_tensor_lamb_mp.cu 15KB
multi_tensor_distopt_lamb_kernel.cu 15KB
self_multihead_attn_bias_additive_mask_cuda.cu 14KB
self_multihead_attn_bias_cuda.cu 14KB
self_multihead_attn_cuda.cu 13KB
multi_tensor_l2norm_kernel.cu 13KB
multi_tensor_lamb.cu 12KB
ln_bwd_semi_cuda_kernel.cu 12KB
batch_norm_add_relu.cu 12KB
batch_norm.cu 11KB
ln_fwd_cuda_kernel.cu 11KB
multi_tensor_l2norm_scale_kernel.cu 9KB
fused_lamb_cuda_kernel.cu 8KB
multi_tensor_sgd_kernel.cu 8KB
multi_tensor_distopt_adam_kernel.cu 7KB
fmha_noloop_reduce.cu 6KB
multi_tensor_l2norm_kernel_mp.cu 6KB
fmha_dgrad_fp16_512_64_kernel.sm80.cu 5KB
fmha_fprop_fp16_512_64_kernel.sm80.cu 5KB
multi_tensor_novograd.cu 5KB
masked_softmax_dropout_cuda.cu 5KB
multi_tensor_axpby_kernel.cu 5KB
multi_tensor_adam.cu 4KB
multi_tensor_lamb_stage_1.cu 4KB
additive_masked_softmax_dropout_cuda.cu 4KB
multi_tensor_scale_kernel.cu 4KB
scaled_masked_softmax_cuda.cu 4KB
ipc.cu 4KB
共 2003 条
- 1
- 2
- 3
- 4
- 5
- 6
- 21
资源评论
MarcoPage
- 粉丝: 4389
- 资源: 8837
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 白色大气风格的摇滚音乐网站模板下载.zip
- 白色大气风格的医疗公司模板下载.zip
- 白色大气风格的医院网站模板下载.zip
- 白色大气风格的医疗设备企业网站模板.zip
- 白色大气风格的医院网页模板下载.zip
- 白色大气风格的英文网站模板下载.zip
- 白色大气风格的医院医疗网站模板下载.zip
- 白色大气风格的移动设备APP官网模板下载.zip
- 白色大气风格的有机小麦种植业网站模板下载.zip
- 白色大气风格的游泳体育竞技网站模板下载.zip
- 白色大气风格的影视传媒公司企业网站源码下载.zip
- 白色大气风格的中国教学教育网站模板下载.zip
- 白色大气风格的运动鞋销售网站模板下载.zip
- 白色大气风格的重工业公司模板下载.zip
- 白色大气风格的珠宝首饰网站模板下载.zip
- 白色大气风格的珠宝首饰官网整站网站源码下载.zip
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功