算法部署-使用TensorRT部署MobileViT算法-优质算法部署项目实战.zip

共51个文件

py：21个

pyc：11个

npy：4个

版权申诉

TensorRT

优质项目

112 浏览量 2024-03-05 15:26:46 上传评论收藏 64.25MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

算法部署_使用TensorRT部署MobileViT算法_优质算法部署项目实战.zip （51个子文件）

算法部署_使用TensorRT部署MobileViT算法_优质算法部署项目实战

convert_to_trt.py 4KB

benchmark

__init__.py 0B

test_torch_precision.py 1KB

imagenet_lmdb_datasets.py 2KB

data

batch8.npy 6.03MB

calibration.npy 75MB

batch1.npy 772KB

batch4.npy 3.02MB

gen_test_data.py 2KB

doc

MobileVit_TensorRT优化.pptx 1.03MB

plugin

attentionPlugin.cu 8KB

layerNormPlugin.h 10KB

test_attention_plugin.py 4KB

Makefile 1004B

layerNormPlugin.cu 6KB

attentionPlugin.h 7KB

test_trt.py 8KB

requirements.txt 337B

onnx_add_plugin.py 3KB

convert_to_onnx.py 3KB

MobileViT_Pytorch

utils.py 5KB

__init__.py 0B

main.py 7KB

LICENSE 1KB

figure

loss.png 178KB

model_arch.png 85KB

accuracy.png 301KB

main_ddp.py 8KB

weights-file

model_best.pth.tar 21.64MB

qat_checkpoint.pth.tar 21.64MB

sim_pth.py 192B

models

__init__.py 85B

module.py 6KB

model_opt.py 5KB

model.py 5KB

module_opt.py 7KB

__pycache__

model.cpython-38.pyc 3KB

module.cpython-37.pyc 6KB

model_opt.cpython-38.pyc 3KB

__init__.cpython-37.pyc 210B

model.cpython-37.pyc 4KB

module_opt.cpython-38.pyc 6KB

module.cpython-38.pyc 6KB

__init__.cpython-38.pyc 235B

__pycache__

utils.cpython-38.pyc 6KB

__init__.cpython-37.pyc 181B

__init__.cpython-38.pyc 145B

README.md 3KB

README.md 11KB

test_trt_precision.py 4KB

calibrator.py 2KB

# mobileVit-TensorRT # 原始模型 ### 模型简介 - mobileVit时一种轻量型的视觉transformer，本项目中的模型应用在ImageNet分类任务中 - 模型网络具体说明可参考 https://mp.weixin.qq.com/s/OoXGZ5pHLMSPZjyriWYstA ，pytoch 实现源码为https://github.com/wilile26811249/MobileViT ，文献 https://arxiv.org/abs/2110.02178 - 模型的整体结构，如下图所示,MobileViT 中的初始层是一个 3×3 的标准卷积，然后是 MobileNetv2（或 MV2）块和 MobileViT 块，激活函数为Swish。 ![Image_20220526110045](https://user-images.githubusercontent.com/106289938/170406914-d78b4042-a4bb-4732-902c-5b64dd9969f0.png) ### 模型优化的难点 - 动态网络转成TRT文件时会出现如下错误 python: /root/gpgpu/MachineLearning/myelin/src/compiler/optimizer/kqv_gemm_split.cpp:350: void myelin::ir::kqv_split_pattern_t::check_transpose(): Assertion `in_dims.size() == 3' failed. ![image](https://user-images.githubusercontent.com/106289938/170433167-e32e5cbe-af6d-49ae-82cc-d177d9133252.png) - 动态网络转换后会生成大量shape操作相关节点 - 网络转换成INT8 trt engine时会阻碍tensrrt自动层融合 # 优化过程（具体可看doc中《MobileVit_TensorRT优化》PPT） - 环境依赖 - 硬件环境：本次比赛使用nvidia官方的docker，镜像名称为nvcr.io/nvidia/tensorrt:22.04-py3 ，GPU为NVIDIA A10，CPU为4核Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz - 软件环境：系统版本：Ubuntu 9.4.0-1ubuntu1~20.04.1； GPU驱动：510.39.01； CUDA:11.6； cudnn:8.4.0； TensrRT:8.2.4； python: 3.8.10 - dockers设置 (1)下载git repo: mkdir trt2022_src; cd trt2022_src ; git clone https://github.com/chenlamei/MobileVit_TensorRT.git ; cd .. (2)docker下载及挂载到/tagert/目录运行: docker pull nvcr.io/nvidia/tensorrt:22.04-py3 sudo docker run --gpus all -it --name tensorrt_22_06 -v ~/trt2022_src:/target nvcr.io/nvidia/tensorrt:22.04-py3 docker start -i tensorrt_22_06 (3)docker中安装必要的python包： cd /target/MobileVit_TensorRT mkdir target pip config set global.index-url https://pypi.douban.com/simple pip install -r requirements.txt pip install onnx-graphsurgeon - pytorch模型转换为onnx python convert_to_onnx.py --model_path [pytorch model path] --save_path [save path of onnx file] --batch [batchsize of the model] --opt :optimize pytorch model ；--dynamic: export dynamic model（注：静态网络需要batch 参数，动态网络不需要） - 例子：转化batchsize为4的静态网络: python convert_to_onnx.py --model_path MobileViT_Pytorch/weights-file/model_best.pth.tar --save_path ./target/MobileViT_b4.onnx --batch 4 转化batchsize为4的优化后的静态网络: python convert_to_onnx.py --model_path MobileViT_Pytorch/weights-file/model_best.pth.tar --save_path ./target/MobileViT_opt_b4.onnx --batch 4 --opt 转化优化后的动态网络: python convert_to_onnx.py --model_path MobileViT_Pytorch/weights-file/qat_checkpoint.pth.tar --save_path target/MobileViT_dynamic_opt.onnx --opt --dynamic - onnx模型添加plugin python onnx_add_plugin.py --input_path [path of input onnx file] --save_path [save path of onnx file] - 例子：python onnx_add_plugin.py --input_path target/MobileViT_dynamic_opt.onnx --save_path target/MobileViT_dynamic_opt_plugin.onnx 注：(1)会添加layernorm plugin和Attention plugin；（2）若不添加plugin可跳过此步骤 - onnx转化trt （1）生成plugin so files (注意需要按照自己硬件环境修改makefile中的TRT_PATH和CUDA_PATH): cd plugin；make -B ； cd .. （2）将onnx模型转换为tensorrt engine python convert_to_trt.py --input_path [path of onnx file] --save_path [path of trt file] --batch [batchsize of the model] --dynamic :the onnx mode is dynamic ; --fp16:use fp16 precision ;--int8:use int8 precision - 例子：python convert_to_trt.py --input_path target/MobileViT_dynamic_opt_plugin.onnx --save_path target/MobileViT_dynamic_opt_plugin_int8.trt --dynamic --fp16 --int8 - 优化记录（1）简化MultiHeadSelfAttention：通过修改MobileViT_Pytorch/models/model.py源代码，去掉MultiHeadSelfAttention中forward中的两个rearrange的语句得到优化后的网络图，同时解决了动态网络trt转换错误的问题（2）4DMM转为2DMM：参考https://github.com/NVIDIA/trt-samples-for-hackathon-cn/tree/master/cookbook/10-BestPractice/Convert3DMMTo2DMM ，实验发现4DMM转为2DMM也会起到加速效果。修改MobileViT_Pytorch/models/model.py 中FFN或Transformer中forward代码，在FFN前后各添加一个reshape节点，将4D矩阵变为2D矩阵，计算完成后再转换为4D （3）cuda graph优化：使用nsight system监测原始网络inference发现tensorrt将几十个node转为一个foreignnode，而且进行了优化，例如将layernorm的多个node转换为cuda kernel __myl__bb0_15_NegExpAddDivMulResTra*。使用nsight system发现gpu kernel submit时间较长，所以使用cuda graph优化cpu占用和gpu kernel submit时间（4）int8优化：生成trt engine时添加--fp16 --int8 flag可以生成int8 engine。测试速度发现INT8效果不好，使用test_trt.py测试时添加--ProfilerLayer flag，打印每层信息，发现layernorm被拆分成了ReduceMean +Sub ......等layer。所以int8速度不理想主要是由于使用int8导致trt融合layer失败，layernorm被拆分等原因，后续添加plugin，手动融合layernorm等层。精度上面，通过test_trt_precision.py测试发现PTQ精度有一定下降，top1 acc约有2个百分点的下降，通过修改IInt8MinMaxCalibrator为IInt8EntropyCalibrator2实现了一定的精度提升。后面通过简化版的QAT（训练中只对weight进行量化，feature map不量化）提高了量化精度，最终量化后top1 acc精度仅下降0.3个百分点，top5下降 0.25个百分点（5）添加attention+layernorm plugin：通过修改onnx 模型将layernorm相关的节点融合为一个节点，onnx修改代码为可参考onnx_add_plugin.py中的addLayerNormPlugin。通过修改onnx 模型将MultiHeadSelfAttention相关的节点融合为一个节点，onnx修改代码为onnx_add_plugin.py中的addAttentionPlugin。写对应c++代码，添加tensorrt plugin，代码在plugin文件夹中。 # 精度与加速效果 - 精度 test_trt.py 可输出不同batchsize下tensorrt 模型和pytorch模型输出差异（包括相对差和绝对差的平均值、中位数、最大值）。实际测试发现由于模型任务为分类，导致大部分输出值数值较小，相对误差较大，所以在判断模型精度时主要以绝对误差为主。pytorch模型是在参考源码给出的训练好的模型，模型文件为 MobileViT_Pytorch/weights-file/model_best.pth.tar。测试数据的生成可参考benchmark/gen_test_data.py，基准pytorch模型的精度测试可参考benchmark/test_torch_precision.py 。 test_trt_precision.py 可测试tensorrt 模型在imagenet数据集上的正确率，相对test_trt.py更有说服力一些，但需要下载测试数据集，且数据集为LMDB格式，可参考https://github.com/xunge/pytorch_lmdb_imagenet 将原始数据转为LMDB格式。在测试INT8 engine时发现使用test_trt_precision.py更有效，因为分类任务不需要严格保证输出数值的准确性，只需要保证其数值的相对大小，分类结果的正确性，尤其是QAT生成的模型只能使用test_trt_precision.py来测试精度。 - 性能 test_trt.py 可输出不同batchsize下模型的latency、throughput,在性能测试中，模型会预热20次，然后循环50次来降低系统误差。 - 测试代码（1）基本测试：python test_trt.py --trt_path [path of trt file] --batch [batchsize of the model] --dynamic：the trt mode is dynamic; --ProfilerLayer:print every laye

评论收藏

内容反馈

版权申诉