没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
V1.0
NVIDIA A100 Tensor Core GPU
Architecture
UNPRECEDENTED ACCELERATION AT EVERY SCALE
ii
NVIDIA A100 Tensor Core GPU Architecture
Table of Contents
Introduction 7
Introducing NVIDIA A100 Tensor Core GPU - our 8th Generation Data Center GPU for the
Age of Elastic Computing 9
NVIDIA A100 Tensor Core GPU Overview 11
Next-generation Data Center and Cloud GPU 11
Industry-leading Performance for AI, HPC, and Data Analytics 12
A100 GPU Key Features Summary 14
A100 GPU Streaming Multiprocessor (SM) 15
40 GB HBM2 and 40 MB L2 Cache 16
Multi-Instance GPU (MIG) 16
Third-Generation NVLink 16
Support for NVIDIA Magnum IO™ and Mellanox Interconnect Solutions 17
PCIe Gen 4 with SR-IOV 17
Improved Error and Fault Detection, Isolation, and Containment 17
Asynchronous Copy 17
Asynchronous Barrier 17
Task Graph Acceleration 18
NVIDIA A100 Tensor Core GPU Architecture In-Depth 19
A100 SM Architecture 20
Third-Generation NVIDIA Tensor Core 23
A100 Tensor Cores Boost Throughput 24
A100 Tensor Cores Support All DL Data Types 26
A100 Tensor Cores Accelerate HPC 28
Mixed Precision Tensor Cores for HPC 28
A100 Introduces Fine-Grained Structured Sparsity 31
Sparse Matrix Definition 31
Sparse Matrix Multiply-Accumulate (MMA) Operations 32
Combined L1 Data Cache and Shared Memory 33
Simultaneous Execution of FP32 and INT32 Operations 34
A100 HBM2 and L2 Cache Memory Architectures 34
iii
NVIDIA A100 Tensor Core GPU Architecture
A100 HBM2 DRAM Subsystem 34
ECC Memory Resiliency 35
A100 L2 Cache 35
Maximizing Tensor Core Performance and Efficiency for Deep Learning Applications 38
Strong Scaling Deep Learning Performance 38
New NVIDIA Ampere Architecture Features Improved Tensor Core Performance 39
Compute Capability 44
MIG (Multi-Instance GPU) Architecture 45
Background 45
MIG Capability of NVIDIA Ampere GPU Architecture 46
Important Use Cases for MIG 46
MIG Architecture and GPU Instances in Detail 48
Compute Instances 50
Compute Instances Enable Simultaneous Context Execution 52
MIG Migration 53
Third-Generation NVLink 53
PCIe Gen 4 with SR-IOV 54
Error and Fault Detection, Isolation, and Containment 54
Additional A100 Architecture Features 55
NVJPG Decode for DL Training 55
Optical Flow Accelerator 56
Atomics Improvements 57
NVDEC for DL 57
CUDA Advances for NVIDIA Ampere Architecture GPUs 59
CUDA Task Graph Acceleration 59
CUDA Task Graph Basics 59
Task Graph Acceleration on NVIDIA Ampere Architecture GPUs 60
CUDA Asynchronous Copy Operation 62
Asynchronous Barriers 64
L2 Cache Residency Control 65
Cooperative Groups 67
Conclusion 69
Appendix A - NVIDIA DGX A100 70
iv
NVIDIA A100 Tensor Core GPU Architecture
NVIDIA DGX A100 - The Universal System for AI Infrastructure 70
Game-changing Performance 71
Unmatched Data Center Scalability 72
Fully Optimized DGX Software Stack 72
NVIDIA DGX A100 System Specifications 75
Appendix B - Sparse Neural Network Primer 77
Pruning and Sparsity 78
Fine-Grained and Coarse-Grained Sparsity 78
v
NVIDIA A100 Tensor Core GPU Architecture
List of Figures
Figure 1. Modern cloud datacenter workloads require NVIDIA GPU acceleration ................... 8
Figure 2. New Technologies in NVIDIA A100....................................................................... 10
Figure 3. NVIDIA A100 GPU on new SXM4 Module ............................................................ 12
Figure 4. Unified AI Acceleration for BERT-LARGE Training and Inference .......................... 13
Figure 5. A100 GPU HPC application speedups compared to NVIDIA Tesla V100 ............... 14
Figure 6. GA100 Full GPU with 128 SMs (A100 Tensor Core GPU has 108 SMs) ................ 20
Figure 7. GA100 Streaming Multiprocessor (SM) ................................................................. 22
Figure 8. A100 vs V100 Tensor Core Operations................................................................. 25
Figure 9. TensorFloat-32 (TF32) ......................................................................................... 27
Figure 10. Iterations of TCAIRS Solver to Converge to FP64 Accuracy .............................. 30
Figure 11. TCAIRS solver speedup over the baseline FP64 direct solver............................ 30
Figure 12. A100 Fine-Grained Structured Sparsity ............................................................. 32
Figure 13. Example Dense MMA and Sparse MMA operations........................................... 33
Figure 14. A100 Tensor Core Throughput and Efficiency ................................................... 40
Figure 15. A100 SM Data Movement Efficiency ................................................................. 41
Figure 16. A100 L2 cache residency controls ..................................................................... 42
Figure 17. A100 Compute Data Compression .................................................................... 42
Figure 18. A100 strong-scaling innovations........................................................................ 43
Figure 19. Software-based MPS in Pascal vs Hardware-Accelerated MPS in Volta............. 45
Figure 20. CSP Multi-user node Today .............................................................................. 47
Figure 21. Example CSP MIG Conf iguration ...................................................................... 48
Figure 22. Example MIG compute configuration with three GPU Instances. ........................ 49
Figure 23. MIG Configuration with multiple independent GPU Compute workloads ............. 50
Figure 24. Example MIG partitioning process ..................................................................... 51
Figure 25. Example MIG config with three GPU Instances and four Compute Instances. .... 52
Figure 26. NVIDIA DGX A100 with Eight A100 GPUs......................................................... 54
Figure 27. Illustration of optical f low and stereo disparity .................................................... 56
Figure 28. Execution Breakdown for Sequential 2us Kernels. ............................................. 60
Figure 29. Impact of Task Graph acceleration on CPU launch latency ................................ 61
Figure 30. Grid-to-Grid Latency Speedup using CUDA graphs ........................................... 62
Figure 31. A100 Asynchronous Copy vs No Asynchronous Copy ....................................... 63
Figure 32. Synchronous vs Asynchronous Copy to Shared Memory ................................... 64
Figure 33. A100 Asynchronous Barriers............................................................................. 65
Figure 34. A100 L2 residency control example................................................................... 67
Figure 35. Warp-Wide Reduction ....................................................................................... 68
Figure 36. NVIDIA DGX 100 System ................................................................................. 70
Figure 37. DGX A100 Delivers unprecedented AI performance for training and inference. .. 71
Figure 38. NVIDIA DGX Software Stack ............................................................................ 73
Figure 39. Dense Neural Network ...................................................................................... 77
Figure 40. Fine-Grained Sparsity ....................................................................................... 79
Figure 41. Coarse Grained Sparsity................................................................................... 80
Figure 42. Fine Grained Structured Sparsity ...................................................................... 81
剩余82页未读,继续阅读
资源评论
wangye_nwpu
- 粉丝: 0
- 资源: 4
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功