没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
试读
66页
This Best Practices Guide is a manual to help developers obtain the best performance from the NVIDIA® CUDA™ architecture using version 2.3 of the CUDA Toolkit. It presents established optimization techniques and explains coding metaphors and idioms that can greatly simplify programming for the CUDA architecture.
资源推荐
资源详情
资源评论
July 2009
Optimization
NVIDIA CUDA C Programming
Best Practices Guide
CUDA Toolkit 2.3
CUDA Best Practices Guide
ii July 2009
July 2009 iii
Table of Contents
Preface
Chapter 1. Introduction to Parallel Computing with CUDA
1.1
1.2 Under
1.3 CUDA API
Chapter 2. Performance Metrics
2.1 Timing
2.2 Bandwidth
Chapter 3. Memory Optimizations
3.1 Dat
.......................................................................................................................... vii
What Is This Document? .................................................................................................. vii
Who Should Read This Guide? .......................................................................................... vii
Recommendations and Best Practices ............................................................................... viii
Contents Summary ......................................................................................................... viii
............................................ 1
Heterogeneous Computing with CUDA .......................................................................... 1
1.1.1 Differences Between Host and Device .................................................................... 1
1.1.2 What Runs on a CUDA-Enabled Device? ................................................................. 2
1.1.3 Maximum Performance Benefit .............................................................................. 3
standing the Programming Environment ............................................................... 4
1.2.1 CUDA Compute Capability ..................................................................................... 4
1.2.2 Additional Hardware Data ...................................................................................... 5
1.2.3 C Runtime for CUDA and Driver API Version ............................................................ 5
1.2.4 Which Version to Target ........................................................................................ 6
s ................................................................................................................. 6
1.3.1 C Runtime for CUDA ............................................................................................. 7
1.3.2 CUDA Driver API ................................................................................................... 7
1.3.3 When to Use Which API ........................................................................................ 8
1.3.4 Comparing Code for Different APIs ......................................................................... 8
................................................................................... 11
..................................................................................................................... 11
2.1.1 Using CPU Timers ............................................................................................... 11
2.1.2 Using CUDA GPU Timers ..................................................................................... 12
................................................................................................................ 12
2.2.1 Theoretical Bandwidth Calculation ........................................................................ 13
2.2.2 Effective Bandwidth Calculation ........................................................................... 13
2.2.3 Throughput Reported by cudaprof ....................................................................... 13
................................................................................ 15
a Transfer Between Host and Device ..................................................................... 15
3.1.1 Pinned Memory .................................................................................................. 15
3.1.2 Asynchronous Transfers and Overlapping Transfers with Computation .................... 16
CUDA Best Practices Guide
iv July 2009
3.1.3 Zero Copy .......................................................................................................... 18
3.
3.
3.
3.
3.
Ch
Ch
5.
Ch
2 Device Memory Spaces .............................................................................................. 19
2.1 Coalesced Access to Global Memory ..................................................................... 20
3.2.1.1 A Simple Access Pattern ................................................................................ 21
3.2.1.2 A Sequential but Misaligned Access Pattern .................................................... 22
3.2.1.3 Effects of Misaligned Accesses ....................................................................... 23
3.2.1.4 Strided Accesses .......................................................................................... 24
2.2 Shared Memory .................................................................................................. 26
3.2.2.1 Shared Memory and Memory Banks ............................................................... 26
3.2.2.2 Shared Memory in Matrix Multiplication (C = AB) ............................................ 27
3.2.2.3 Shared Memory in Matrix Multiplication (C = AA
T
) ........................................... 31
3.2.2.4 Shared Memory Use by Kernel Arguments ...................................................... 33
3.2.3 Local Memory ..................................................................................................... 33
2.4 Texture Memory ................................................................................................. 33
3.2.4.1 Textured Fetch vs. Global Memory Read ........................................................ 34
3.2.4.2 Additional Texture Capabilities ....................................................................... 35
3.2.5 Constant Memory ............................................................................................... 36
2.6 Registers ........................................................................................................... 36
3.2.6.1 Register Pressure ......................................................................................... 36
apter 4. Execution Configuration Optimizations ...................................................... 37
4.1 Occupancy ............................................................................................................... 37
4.2 Calculating Occupancy ............................................................................................... 37
4.3 Hiding Register Dependencies .................................................................................... 39
4.4 Thread and Block Heuristics ....................................................................................... 40
4.5 Effects of Shared Memory .......................................................................................... 41
apter 5. Instruction Optimizations ........................................................................... 43
1 Arithmetic Instructions .............................................................................................. 43
5.1.1 Division and Modulo Operations ........................................................................... 43
5.1.2 Reciprocal Square Root ....................................................................................... 44
5.1.3 Other Arithmetic Instructions ............................................................................... 44
5.1.4 Math Libraries .................................................................................................... 44
5.2 Memory Instructions ................................................................................................. 45
apter 6. Control Flow ................................................................................................ 47
6.1Branching and Divergence ......................................................................................... 47
6.2Branch Predication .................................................................................................... 47
July 2009 v
Chapter 7. Getting the Right Answer
7.2 Debugging
7.3 Numerical
Appendix A. Recommendations and Best
Appendix B. Useful NVCC Compiler Switches ............................................................... 55
NVCC ............................................................................................................................ 55
........................................................................... 49
7.1 Checking Defective Code ........................................................................................... 49
............................................................................................................... 49
Accuracy and Precision ............................................................................... 50
7.3.1 Single vs. Double Precision .................................................................................. 50
7.3.2 Floating-Point Math Is Not Associative .................................................................. 50
7.3.3 Promotions to Doubles and Truncations to Floats .................................................. 50
7.3.4 IEEE 754 Compliance .......................................................................................... 51
7.3.5 x86 80-bit Computations ..................................................................................... 51
Practices ..................................................... 53
A.1 Overall Performance Optimization Strategies ............................................................... 53
A.2 High-Priority Recommendations ................................................................................. 54
A.3 Medium-Priority Recommendations ............................................................................. 54
A.4 Low-Priority Recommendations .................................................................................. 54
剩余65页未读,继续阅读
资源评论
- zhouxr20002016-05-21英伟达公司也有有免费的
- huanle198913452013-06-29好资料,英文版的资料就是清晰,权威,好好学习一下
- alpha.52013-02-28挺好的,清晰度也有.
lulyon
- 粉丝: 2
- 资源: 9
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功