CUDAC++ProgrammingGuide_c++版本cuda安装资源-CSDN文库

CUDA

需积分: 2 184 浏览量 2023-12-19 13:38:11 上传评论收藏 2.53MB PDF 举报

资源推荐

资源详情

资源评论

CUDA C++ Best Practices Guide

Release 12.3

NVIDIA

Nov 14, 2023

Contents

1 What Is This Document? 3

2 Who Should Read This Guide? 5

3 Assess, Parallelize, Optimize, Deploy 7

3.1 Assess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2 Parallelize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.3 Optimize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.4 Deploy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Recommendations and Best Practices 11

5 Assessing Your Application 13

6 Heterogeneous Computing 15

6.1 Dierences between Host and Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

6.2 What Runs on a CUDA-Enabled Device? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

7 Application Proling 19

7.1 Prole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

7.1.1 Creating the Prole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

7.1.2 Identifying Hotspots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

7.1.3 Understanding Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

7.1.3.1 Strong Scaling and Amdahl’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

7.1.3.2 Weak Scaling and Gustafson’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

7.1.3.3 Applying Strong and Weak Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

8 Parallelizing Your Application 23

9 Getting Started 25

9.1 Parallel Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

9.2 Parallelizing Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

9.3 Coding to Expose Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

10 Getting the Right Answer 27

10.1 Verication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

10.1.1 Reference Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

10.1.2 Unit Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

10.2 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

10.3 Numerical Accuracy and Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

10.3.1 Single vs. Double Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

10.3.2 Floating Point Math Is not Associative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

10.3.3 IEEE 754 Compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

10.3.4 x86 80-bit Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

11 Optimizing CUDA Applications 31

12 Performance Metrics 33

12.1 Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

12.1.1 Using CPU Timers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

12.1.2 Using CUDA GPU Timers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

12.2 Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

12.2.1 Theoretical Bandwidth Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

12.2.2 Eective Bandwidth Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

12.2.3 Throughput Reported by Visual Proler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

13 Memory Optimizations 37

13.1 Data Transfer Between Host and Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

13.1.1 Pinned Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

13.1.2 Asynchronous and Overlapping Transfers with Computation . . . . . . . . . . . . . . . 38

13.1.3 Zero Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

13.1.4 Unied Virtual Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

13.2 Device Memory Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

13.2.1 Coalesced Access to Global Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

13.2.1.1 A Simple Access Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

13.2.1.2 A Sequential but Misaligned Access Pattern . . . . . . . . . . . . . . . . . . . . . . 45

13.2.1.3 Eects of Misaligned Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

13.2.1.4 Strided Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

13.2.2 L2 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

13.2.2.1 L2 Cache Access Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

13.2.2.2 Tuning the Access Window Hit-Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

13.2.3 Shared Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

13.2.3.1 Shared Memory and Memory Banks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

13.2.3.2 Shared Memory in Matrix Multiplication (C=AB) . . . . . . . . . . . . . . . . . . . . 52

13.2.3.3 Shared Memory in Matrix Multiplication (C=AAT) . . . . . . . . . . . . . . . . . . . . 55

13.2.3.4 Asynchronous Copy from Global Memory to Shared Memory . . . . . . . . . . . . 57

13.2.4 Local Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

13.2.5 Texture Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

13.2.5.1 Additional Texture Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

13.2.6 Constant Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

13.2.7 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

13.2.7.1 Register Pressure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

13.3 Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

13.4 NUMA Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

14 Execution Conguration Optimizations 63

14.1 Occupancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

14.1.1 Calculating Occupancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

14.2 Hiding Register Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

14.3 Thread and Block Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

14.4 Eects of Shared Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

14.5 Concurrent Kernel Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

14.6 Multiple contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

15 Instruction Optimization 69

15.1 Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

15.1.1 Division Modulo Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

15.1.2 Loop Counters Signed vs. Unsigned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

15.1.3 Reciprocal Square Root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

15.1.4 Other Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

15.1.5 Exponentiation With Small Fractional Arguments . . . . . . . . . . . . . . . . . . . . . . 70

15.1.6 Math Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

15.1.7 Precision-related Compiler Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

15.2 Memory Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

16 Control Flow 75

16.1 Branching and Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

16.2 Branch Predication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

17 Deploying CUDA Applications 77

18 Understanding the Programming Environment 79

18.1 CUDA Compute Capability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

18.2 Additional Hardware Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

18.3 Which Compute Capability Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

18.4 CUDA Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

19 CUDA Compatibility Developer’s Guide 83

19.1 CUDA Toolkit Versioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

19.2 Source Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

19.3 Binary Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

19.3.1 CUDA Binary (cubin) Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

19.4 CUDA Compatibility Across Minor Releases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

19.4.1 Existing CUDA Applications within Minor Versions of CUDA . . . . . . . . . . . . . . . . 88

19.4.1.1 Handling New CUDA Features and Driver APIs . . . . . . . . . . . . . . . . . . . . . 89

19.4.1.2 Using PTX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

19.4.1.3 Dynamic Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

19.4.1.4 Recommendations for building a minor-version compatible library . . . . . . . . 93

19.4.1.5 Recommendations for taking advantage of minor version compatibility in your

application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

20 Preparing for Deployment 95

20.1 Testing for CUDA Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

20.2 Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

20.3 Building for Maximum Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

20.4 Distributing the CUDA Runtime and Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

20.4.1 CUDA Toolkit Library Redistribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

20.4.1.1 Which Files to Redistribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

20.4.1.2 Where to Install Redistributed CUDA Libraries . . . . . . . . . . . . . . . . . . . . . 100

21 Deployment Infrastructure Tools 103

21.1 Nvidia-SMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

21.1.1 Queryable state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

21.1.2 Modiable state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

21.2 NVML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

21.3 Cluster Management Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

21.4 Compiler JIT Cache Management Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

21.5 CUDA_VISIBLE_DEVICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

22 Recommendations and Best Practices 107

22.1 Overall Performance Optimization Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

23 nvcc Compiler Switches 109

23.1 nvcc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

iii

剩余117页未读，继续阅读

评论收藏

内容反馈

Dream_Ross

粉丝: 16
资源: 9

CUDA C++ Programming Guide

最新资源

CUDA C++ Programming Guide

CUDA_C_Programming_Guide

CUDA_C_Programming_Guide.pdf

CUDA C Programming Guide v9.0

CUDA_C_Programming_Guide_CN

CUDA C Best Practices Guide 4.1

CUDA Programming pdf

CUDA_C_Programming_Guide.zip_cuda 并行计算_gpu并行计算_并行计算 c++

CUDA_C_Programming_Guide 7.5

NVIDIA CUDA Programming Guide

CUDA_C_Programming_Guide 4.1

CUDA_C_Programming_Guide _9.0

CUDA Programming Guide 4.0

CUDA C Programming Guide v8.0

win10环境下vscode运行opencv(C++)(解压即用)-1号包

虹软3.0人脸识别客户端（追踪，活体检测，人脸特征存储，人脸识别，人脸注册，人脸匹配）可离线断网部署 5000个免费key/年

c++入门，核心，提高讲义笔记

仿照Visionmaster，用C++、Qt编写的视觉软件

最新资源