TensorRT-tensorrt插件Plugin的自动生成工具-Autogen-优质算法部署项目实战.zip资源-CSDN文库

共2000个文件

h：858个

py：660个

html：147个

版权申诉

TensorRT

Plugin

自动生成

157 浏览量 2024-10-21 06:07:22 上传评论收藏 27.71MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

TensorRT-tensorrt插件Plugin的自动生成工具-Autogen-优质算法部署项目实战.zip （2000个子文件）

elf.c 127KB

dwarf.c 113KB

graph_executor.c 41KB

xcoff.c 40KB

macho.c 35KB

pecoff.c 23KB

crt_runtime_api.c 19KB

ztest.c 15KB

load_json.c 14KB

main.c 13KB

silence.c 12KB

unknown.c 12KB

no.c 12KB

yes.c 12KB

page_allocator.c 12KB

xztest.c 12KB

btest.c 11KB

ai_runtime_api.c 10KB

mtest.c 10KB

fileline.c 9KB

mmap.c 8KB

graph_executor_module.c 8KB

testlib.c 6KB

ndarray.c 5KB

func_registry.c 5KB

alloc.c 4KB

ttest.c 4KB

stack_allocator.c 4KB

packed_func.c 4KB

edtest.c 4KB

backtrace.c 4KB

allocfail.c 3KB

stest.c 3KB

simple.c 3KB

mmapio.c 3KB

sort.c 3KB

posix.c 3KB

read.c 3KB

runtime.c 3KB

instrumented_alloc.c 3KB

atomic.c 3KB

print.c 3KB

unittest.c 3KB

state.c 2KB

unknown.c 2KB

nounwind.c 2KB

crt_backend_api.c 2KB

test_format.c 2KB

edtest2.c 2KB

problem_space.cpp 37KB

cudnn_helpers.cpp 16KB

performance_report.cpp 13KB

cublas_helpers.cpp 10KB

enumerated_types.cpp 8KB

visualize_layout.cpp 6KB

gpu_timer.cpp 4KB

manifest.cpp 4KB

main.cpp 2KB

array_test.go 14KB

function.go 12KB

value.go 12KB

ndarray.go 12KB

function_test.go 9KB

value_test.go 7KB

complex.go 5KB

module.go 4KB

device.go 4KB

bytearray.go 3KB

module_test.go 3KB

simple.go 2KB

type.go 2KB

pack_func_closure_return.go 2KB

pack_func_closure_arg.go 2KB

pack_func_register.go 2KB

pack_func_handle_arg.go 2KB

pack_func_convert.go 2KB

bytearray_test.go 1KB

error.go 1KB

gotvm.go 1KB

gotvm_test.go 1KB

error_test.go 1KB

utils.go 1KB

matrix.h 356KB

mma_tensor_op_tile_iterator.h 133KB

default_mma_core_sm80.h 100KB

mma_tensor_op_tile_iterator_sm70.h 97KB

mma_complex_tensor_op_tile_iterator_sm80.h 77KB

mma_tensor_op_tile_iterator_sm80.h 73KB

transform.h 68KB

nn.h 66KB

predicated_tile_access_iterator.h 66KB

default_multistage_mma_complex_core_sm80.h 62KB

packed_func.h 60KB

predicated_tile_iterator.h 60KB

mma_simt_tile_iterator.h 58KB

default_mma_core_simt.h 56KB

mma_sm80.h 53KB

default_conv2d_dgrad.h 52KB

convolution.h 51KB

map.h 51KB

共 2000 条

![ALT](/media/images/gemm-hierarchy-with-epilogue-no-labels.png "Complete CUDA GEMM decomposition") # CUTLASS 2.7 _CUTLASS 2.7 - September 2021_ CUTLASS is a collection of CUDA C++ template abstractions for implementing high-performance matrix-multiplication (GEMM) and related computations at all levels and scales within CUDA. It incorporates strategies for hierarchical decomposition and data movement similar to those used to implement cuBLAS and cuDNN. CUTLASS decomposes these "moving parts" into reusable, modular software components abstracted by C++ template classes. These thread-wide, warp-wide, block-wide, and device-wide primitives can be specialized and tuned via custom tiling sizes, data types, and other algorithmic policy. The resulting flexibility simplifies their use as building blocks within custom kernels and applications. To support a wide variety of applications, CUTLASS provides extensive support for mixed-precision computations, providing specialized data-movement and multiply-accumulate abstractions for half-precision floating point (FP16), BFloat16 (BF16), Tensor Float 32 (TF32), single-precision floating point (FP32), double-precision floating point (FP64) types, integer data types (4b and 8b), and binary data types (1b). CUTLASS demonstrates warp-synchronous matrix multiply operations targeting the programmable, high-throughput _Tensor Cores_ implemented by NVIDIA's Volta, Turing, and Ampere architectures. CUTLASS implements high-performance Convolution via the implicit GEMM algorithm. Implicit GEMM is the formulation of a convolution operation as a GEMM thereby taking advantage of CUTLASS's modular GEMM pipeline. This allows CUTLASS to build convolutions by reusing highly optimized warp-wide GEMM components and below. See the [Quick Start Guide](/media/docs/quickstart.md) to get started quickly. See the [functionality listing](/media/docs/functionality.md) for the list of operations supported at each level of the execution model hierarchy. See the [CHANGELOG](CHANGELOG.md) for descriptions of recent updates. # What's New in CUTLASS 2.7 CUTLASS 2.7 is a minor update to CUTLASS adding: - Mainloop fusion for GEMM: [summation over A or B](/examples/23_ampere_gemm_operand_reduction_fusion/ampere_gemm_operand_reduction_fusion.cu) - [Optimizations for strided DGRAD](/include/cutlass/conv/kernel/default_conv2d_dgrad.h) - [Half-precision GELU_taylor activation functions](/include/cutlass/epilogue/thread/activation.h#L196) - Tuning and bug fixes to [fused GEMM + GEMM example](/examples/13_two_tensor_op_fusion/) - Support for smaller than 128b aligned Convolutions: [see examples](test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f16_sm80.cu#L272) - Caching of results to accelerate Convolution [unit tests](test/unit/conv/device/cache_testbed_output.h) - Numerous updates from the community (thanks!) # What's New in CUTLASS 2.6 CUTLASS 2.6 is a minor update to CUTLASS adding: - Fused [broadcast](test/unit/gemm/device/gemm_with_broadcast_f16n_f16n_f16n_tensorop_f32_sm75.cu) and [reductions](/test/unit/gemm/device/gemm_with_reduction_f16n_f16n_f16n_tensorop_f32_sm75.cu) in the epilogues of GEMM and Convolution - [Quaternion-valued GEMM](/examples/21_quaternion_gemm/quaternion_gemm.cu) and [Convolution](/examples/22_quaternion_conv/quaternion_conv.cu) in single-precision - [New strided Dgrad](test/unit/conv/device/conv2d_strided_dgrad_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu) implementation offers up to 4x performance improvements over previous strided Dgrad - 64-bit strides for large tensor allocations - [General affine layouts](/examples/18_ampere_fp64_tensorop_affine2_gemm/ampere_fp64_tensorop_affine2_gemm.cu) fp64 tensor core and simt GEMM - [Batched GEMV](/test/unit/gemm/device/gemv.cu) preview implementation - Enhanced functionality, boosted performance, and bug fixes in the epilogue. - Optimal performance when compiled with the [CUDA 11.4 Toolkit](https://developer.nvidia.com/cuda-toolkit) - Adopt new L2 prefetch feature in [ptx instruction](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#ptx-isa-version-7-4). - Enhanced Clang support and the combination of Clang 13 and CUDA 11.4 can build and run kernels from Pascal and Ampere. - Numerous updates from the community (thanks!) # What's New in CUTLASS 2.5 CUTLASS 2.5 is a minor update to CUTLASS adding: - [Tensor reductions](/test/unit/reduction/device/tensor_reduce_contiguous.cu) - [Optimizations for 3-D convolution](include/cutlass/conv/threadblock/conv3d_fprop_activation_tile_access_iterator_optimized.h) - [Fused Convolution+Convolution example](/examples/13_two_tensor_op_fusion/README.md) # What's New in CUTLASS 2.4 CUTLASS 2.4 is a significant update to CUTLASS adding: - 1-D, 2-D, and 3-D convolution targeting Tensor and CUDA cores for NVIDIA Ampere, Turing, and Volta GPU architectures - CUTLASS profiler support for convolution - [Documentation](/media/docs/implicit_gemm_convolution.md) describing Implicit GEMM Convolution algorithm and implementation # What's New in CUTLASS 2.3 CUTLASS 2.3 is a minor update to CUTLASS adding: - GEMMs targeting structured [Sparse Tensor Cores](test/unit/gemm/device/gemm_f16n_f16n_f32t_tensor_op_f32_sparse_sm80.cu) in NVIDIA Ampere Architecture GPUs - Fast SGEMM kernels targeting GeForce RTX 30-series CUDA Cores - Intended to be compiled with [CUDA 11.1 Toolkit](https://developer.nvidia.com/cuda-toolkit) or later # What's New in CUTLASS 2.2 CUTLASS 2.2 is a significant update to CUTLASS adding: - Coverage of [NVIDIA Ampere Architecture features](https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/) - Tensor Core-accelerated GEMMs targeting Tensor Float 32, BFloat16, and double-precision data types - Deep software pipelines using asynchronous copy - Described in [GTC 2020 Webinar (SR 21745)](https://developer.nvidia.com/gtc/2020/video/s21745) - Intended to be compiled with [CUDA 11 Toolkit](https://developer.nvidia.com/cuda-toolkit) or later # What's New in CUTLASS 2.1 CUTLASS 2.1 is a minor update to CUTLASS adding: - [Planar complex GEMM kernels](/examples/10_planar_complex/planar_complex.cu) targeting Volta and Turing Tensor Cores - BLAS-style API to launch kernels compiled into the [CUTLASS Library](/media/docs/quickstart.md#cutlass-library) # What's New in CUTLASS 2.0 CUTLASS 2.0 is a substantial refactoring from the previous version, intended to offer: - Better performance over 1.x, particularly for kernels targeting Turing Tensor Cores - Robust and durable templates that reliably span the design space - Encapsulated functionality that may be reusable in other contexts **See the [CHANGELOG](CHANGELOG.md) for more details.** # Performance <p align="center"><img src=/media/images/cutlass-performance-plot.png></p> CUTLASS primitives are very efficient. When used to construct device-wide GEMM kernels, they exhibit performance comparable to cuBLAS for scalar GEMM computations. The above figure shows CUTLASS performance relative to cuBLAS for large matrix dimensions on an NVIDIA GeForce 2080 Ti, an NVIDIA A100, and an NVIDIA TitanV using CUDA 11.0 Toolkit. Tensor Core operations are implemented using CUDA's [mma instruction](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma). # Compatibility CUTLASS requires a C++11 host compiler and performs best when built with the [CUDA 11.4 Toolkit](https://developer.nvidia.com/cuda-toolkit). It is also compatible with CUDA 10.2, CUDA 11.0, CUDA 11.1, CUDA 11.2, and CUDA 11.3. We have tested the following environments. |**Operating System** | **Compiler** | |-----------------|----------| | Windows 10 | Microsoft Visual Studio 2015| | | Microsoft Visual Studio 2017| | Ubuntu 16.04 | GCC 5.4.0 | | Ubuntu 18.04 | GCC 7.5.0 | | Ubuntu 20.04 | GCC 10.2.0 | Additionally, CUTLASS may be built with clang. See [these instructi

评论收藏

内容反馈

版权申诉