![ALT](/media/images/gemm-hierarchy-with-epilogue-no-labels.png "Complete CUDA GEMM decomposition")
# CUTLASS 2.7
_CUTLASS 2.7 - September 2021_
CUTLASS is a collection of CUDA C++ template abstractions for implementing
high-performance matrix-multiplication (GEMM) and related computations at all levels
and scales within CUDA. It incorporates strategies for hierarchical decomposition and
data movement similar to those used to implement cuBLAS and cuDNN. CUTLASS decomposes
these "moving parts" into reusable, modular software components abstracted by C++ template
classes. These thread-wide, warp-wide, block-wide, and device-wide primitives can be specialized
and tuned via custom tiling sizes, data types, and other algorithmic policy. The
resulting flexibility simplifies their use as building blocks within custom kernels
and applications.
To support a wide variety of applications, CUTLASS provides extensive support for
mixed-precision computations, providing specialized data-movement and
multiply-accumulate abstractions for half-precision floating
point (FP16), BFloat16 (BF16), Tensor Float 32 (TF32),
single-precision floating point (FP32), double-precision floating
point (FP64) types, integer data types (4b and 8b), and binary data types (1b).
CUTLASS demonstrates warp-synchronous matrix multiply operations
targeting the programmable, high-throughput _Tensor Cores_ implemented by
NVIDIA's Volta, Turing, and Ampere architectures.
CUTLASS implements high-performance Convolution via the implicit GEMM algorithm.
Implicit GEMM is the formulation of a convolution operation as a GEMM thereby taking advantage of
CUTLASS's modular GEMM pipeline.
This allows CUTLASS to build convolutions by reusing highly optimized warp-wide GEMM components and below.
See the [Quick Start Guide](/media/docs/quickstart.md) to get started quickly.
See the [functionality listing](/media/docs/functionality.md) for the list of operations
supported at each level of the execution model hierarchy.
See the [CHANGELOG](CHANGELOG.md) for descriptions of recent updates.
# What's New in CUTLASS 2.7
CUTLASS 2.7 is a minor update to CUTLASS adding:
- Mainloop fusion for GEMM: [summation over A or B](/examples/23_ampere_gemm_operand_reduction_fusion/ampere_gemm_operand_reduction_fusion.cu)
- [Optimizations for strided DGRAD](/include/cutlass/conv/kernel/default_conv2d_dgrad.h)
- [Half-precision GELU_taylor activation functions](/include/cutlass/epilogue/thread/activation.h#L196)
- Tuning and bug fixes to [fused GEMM + GEMM example](/examples/13_two_tensor_op_fusion/)
- Support for smaller than 128b aligned Convolutions: [see examples](test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f16_sm80.cu#L272)
- Caching of results to accelerate Convolution [unit tests](test/unit/conv/device/cache_testbed_output.h)
- Numerous updates from the community (thanks!)
# What's New in CUTLASS 2.6
CUTLASS 2.6 is a minor update to CUTLASS adding:
- Fused [broadcast](test/unit/gemm/device/gemm_with_broadcast_f16n_f16n_f16n_tensorop_f32_sm75.cu) and [reductions](/test/unit/gemm/device/gemm_with_reduction_f16n_f16n_f16n_tensorop_f32_sm75.cu) in the epilogues of GEMM and Convolution
- [Quaternion-valued GEMM](/examples/21_quaternion_gemm/quaternion_gemm.cu) and [Convolution](/examples/22_quaternion_conv/quaternion_conv.cu) in single-precision
- [New strided Dgrad](test/unit/conv/device/conv2d_strided_dgrad_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu) implementation offers up to 4x performance improvements over previous strided Dgrad
- 64-bit strides for large tensor allocations
- [General affine layouts](/examples/18_ampere_fp64_tensorop_affine2_gemm/ampere_fp64_tensorop_affine2_gemm.cu) fp64 tensor core and simt GEMM
- [Batched GEMV](/test/unit/gemm/device/gemv.cu) preview implementation
- Enhanced functionality, boosted performance, and bug fixes in the epilogue.
- Optimal performance when compiled with the [CUDA 11.4 Toolkit](https://developer.nvidia.com/cuda-toolkit)
- Adopt new L2 prefetch feature in [ptx instruction](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#ptx-isa-version-7-4).
- Enhanced Clang support and the combination of Clang 13 and CUDA 11.4 can build and run kernels from Pascal and Ampere.
- Numerous updates from the community (thanks!)
# What's New in CUTLASS 2.5
CUTLASS 2.5 is a minor update to CUTLASS adding:
- [Tensor reductions](/test/unit/reduction/device/tensor_reduce_contiguous.cu)
- [Optimizations for 3-D convolution](include/cutlass/conv/threadblock/conv3d_fprop_activation_tile_access_iterator_optimized.h)
- [Fused Convolution+Convolution example](/examples/13_two_tensor_op_fusion/README.md)
# What's New in CUTLASS 2.4
CUTLASS 2.4 is a significant update to CUTLASS adding:
- 1-D, 2-D, and 3-D convolution targeting Tensor and CUDA cores for NVIDIA Ampere, Turing, and Volta GPU architectures
- CUTLASS profiler support for convolution
- [Documentation](/media/docs/implicit_gemm_convolution.md) describing Implicit GEMM Convolution algorithm and implementation
# What's New in CUTLASS 2.3
CUTLASS 2.3 is a minor update to CUTLASS adding:
- GEMMs targeting structured [Sparse Tensor Cores](test/unit/gemm/device/gemm_f16n_f16n_f32t_tensor_op_f32_sparse_sm80.cu) in NVIDIA Ampere Architecture GPUs
- Fast SGEMM kernels targeting GeForce RTX 30-series CUDA Cores
- Intended to be compiled with [CUDA 11.1 Toolkit](https://developer.nvidia.com/cuda-toolkit) or later
# What's New in CUTLASS 2.2
CUTLASS 2.2 is a significant update to CUTLASS adding:
- Coverage of [NVIDIA Ampere Architecture features](https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/)
- Tensor Core-accelerated GEMMs targeting Tensor Float 32, BFloat16, and double-precision data types
- Deep software pipelines using asynchronous copy
- Described in [GTC 2020 Webinar (SR 21745)](https://developer.nvidia.com/gtc/2020/video/s21745)
- Intended to be compiled with [CUDA 11 Toolkit](https://developer.nvidia.com/cuda-toolkit) or later
# What's New in CUTLASS 2.1
CUTLASS 2.1 is a minor update to CUTLASS adding:
- [Planar complex GEMM kernels](/examples/10_planar_complex/planar_complex.cu) targeting Volta and Turing Tensor Cores
- BLAS-style API to launch kernels compiled into the [CUTLASS Library](/media/docs/quickstart.md#cutlass-library)
# What's New in CUTLASS 2.0
CUTLASS 2.0 is a substantial refactoring from the previous version, intended to offer:
- Better performance over 1.x, particularly for kernels targeting Turing Tensor Cores
- Robust and durable templates that reliably span the design space
- Encapsulated functionality that may be reusable in other contexts
**See the [CHANGELOG](CHANGELOG.md) for more details.**
# Performance
<p align="center"><img src=/media/images/cutlass-performance-plot.png></p>
CUTLASS primitives are very efficient. When used to construct device-wide GEMM kernels,
they exhibit performance comparable to cuBLAS for scalar GEMM
computations. The above figure shows CUTLASS performance relative to cuBLAS
for large matrix dimensions on an NVIDIA GeForce 2080 Ti, an NVIDIA A100, and an NVIDIA TitanV
using CUDA 11.0 Toolkit. Tensor Core operations are implemented using CUDA's
[mma instruction](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma).
# Compatibility
CUTLASS requires a C++11 host compiler and
performs best when built with the [CUDA 11.4 Toolkit](https://developer.nvidia.com/cuda-toolkit).
It is also compatible with CUDA 10.2, CUDA 11.0, CUDA 11.1, CUDA 11.2, and CUDA 11.3.
We have tested the following environments.
|**Operating System** | **Compiler** |
|-----------------|----------|
| Windows 10 | Microsoft Visual Studio 2015|
| | Microsoft Visual Studio 2017|
| Ubuntu 16.04 | GCC 5.4.0 |
| Ubuntu 18.04 | GCC 7.5.0 |
| Ubuntu 20.04 | GCC 10.2.0 |
Additionally, CUTLASS may be built with clang.
See [these instructi
没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
收起资源包目录
TensorRT-tensorrt插件Plugin的自动生成工具-Autogen-优质算法部署项目实战.zip (2000个子文件)
elf.c 127KB
dwarf.c 113KB
graph_executor.c 41KB
xcoff.c 40KB
macho.c 35KB
pecoff.c 23KB
crt_runtime_api.c 19KB
ztest.c 15KB
load_json.c 14KB
main.c 13KB
silence.c 12KB
unknown.c 12KB
no.c 12KB
yes.c 12KB
page_allocator.c 12KB
xztest.c 12KB
btest.c 11KB
ai_runtime_api.c 10KB
mtest.c 10KB
fileline.c 9KB
mmap.c 8KB
graph_executor_module.c 8KB
testlib.c 6KB
ndarray.c 5KB
func_registry.c 5KB
alloc.c 4KB
ttest.c 4KB
stack_allocator.c 4KB
packed_func.c 4KB
edtest.c 4KB
backtrace.c 4KB
allocfail.c 3KB
stest.c 3KB
simple.c 3KB
mmapio.c 3KB
sort.c 3KB
posix.c 3KB
read.c 3KB
runtime.c 3KB
instrumented_alloc.c 3KB
atomic.c 3KB
print.c 3KB
unittest.c 3KB
state.c 2KB
unknown.c 2KB
nounwind.c 2KB
crt_backend_api.c 2KB
test_format.c 2KB
edtest2.c 2KB
problem_space.cpp 37KB
cudnn_helpers.cpp 16KB
performance_report.cpp 13KB
cublas_helpers.cpp 10KB
enumerated_types.cpp 8KB
visualize_layout.cpp 6KB
gpu_timer.cpp 4KB
manifest.cpp 4KB
main.cpp 2KB
array_test.go 14KB
function.go 12KB
value.go 12KB
ndarray.go 12KB
function_test.go 9KB
value_test.go 7KB
complex.go 5KB
module.go 4KB
device.go 4KB
bytearray.go 3KB
module_test.go 3KB
simple.go 2KB
type.go 2KB
pack_func_closure_return.go 2KB
pack_func_closure_arg.go 2KB
pack_func_register.go 2KB
pack_func_handle_arg.go 2KB
pack_func_convert.go 2KB
bytearray_test.go 1KB
error.go 1KB
gotvm.go 1KB
gotvm_test.go 1KB
error_test.go 1KB
utils.go 1KB
matrix.h 356KB
mma_tensor_op_tile_iterator.h 133KB
default_mma_core_sm80.h 100KB
mma_tensor_op_tile_iterator_sm70.h 97KB
mma_complex_tensor_op_tile_iterator_sm80.h 77KB
mma_tensor_op_tile_iterator_sm80.h 73KB
transform.h 68KB
nn.h 66KB
predicated_tile_access_iterator.h 66KB
default_multistage_mma_complex_core_sm80.h 62KB
packed_func.h 60KB
predicated_tile_iterator.h 60KB
mma_simt_tile_iterator.h 58KB
default_mma_core_simt.h 56KB
mma_sm80.h 53KB
default_conv2d_dgrad.h 52KB
convolution.h 51KB
map.h 51KB
共 2000 条
- 1
- 2
- 3
- 4
- 5
- 6
- 20
资源评论
__AtYou__
- 粉丝: 3486
- 资源: 2157
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 基于Django和HTML的新疆地区水稻产量影响因素可视化分析系统(含数据集)
- windows conan2应用构建模板
- 3_base.apk.1
- 基于STM32F103C8T6的4g模块(air724ug)
- 基于Java技术的ASC学业支持中心并行项目开发设计源码
- 基于Java和微信支付的wxmall开源卖票商城设计源码
- 基于Java和前端技术的东软环保公众监督系统设计源码
- 基于Python、HTML、CSS的crawlerdemo软件工程实训爬虫设计源码
- 基于多智能体深度强化学习的边缘协同任务卸载方法设计源码
- 基于BS架构的Java、Vue、JavaScript、CSS、HTML整合的毕业设计源码
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功