NVIDIACUDACProgrammingBestPracticesGuideVersion2.3资源-CSDN文库

4星 · 超过85%的资源需积分: 9 25 浏览量 2010-06-06 10:30:33 上传评论 1 收藏 1.03MB PDF 举报

资源推荐

资源详情

资源评论

July 2009

Optimization

NVIDIA CUDA C Programming

Best Practices Guide

CUDA Toolkit 2.3

July 2009 iii

Table of Contents

Preface

Chapter 1. Introduction to Parallel Computing with CUDA

1.1

1.2 Under

1.3 CUDA API

Chapter 2. Performance Metrics

2.1 Timing

2.2 Bandwidth

Chapter 3. Memory Optimizations

3.1 Dat

.......................................................................................................................... vii

What Is This Document? .................................................................................................. vii

Who Should Read This Guide? .......................................................................................... vii

Recommendations and Best Practices ............................................................................... viii

Contents Summary ......................................................................................................... viii

............................................ 1



Heterogeneous Computing with CUDA .......................................................................... 1

1.1.1 Differences Between Host and Device .................................................................... 1

1.1.2 What Runs on a CUDA-Enabled Device? ................................................................. 2

1.1.3 Maximum Performance Benefit .............................................................................. 3

standing the Programming Environment ............................................................... 4

1.2.1 CUDA Compute Capability ..................................................................................... 4

1.2.2 Additional Hardware Data ...................................................................................... 5

1.2.3 C Runtime for CUDA and Driver API Version ............................................................ 5

1.2.4 Which Version to Target ........................................................................................ 6

s ................................................................................................................. 6

1.3.1 C Runtime for CUDA ............................................................................................. 7

1.3.2 CUDA Driver API ................................................................................................... 7

1.3.3 When to Use Which API ........................................................................................ 8

1.3.4 Comparing Code for Different APIs ......................................................................... 8

................................................................................... 11

..................................................................................................................... 11

2.1.1 Using CPU Timers ............................................................................................... 11

2.1.2 Using CUDA GPU Timers ..................................................................................... 12

................................................................................................................ 12

2.2.1 Theoretical Bandwidth Calculation ........................................................................ 13

2.2.2 Effective Bandwidth Calculation ........................................................................... 13

2.2.3 Throughput Reported by cudaprof ....................................................................... 13

................................................................................ 15

a Transfer Between Host and Device ..................................................................... 15

3.1.1 Pinned Memory .................................................................................................. 15

3.1.2 Asynchronous Transfers and Overlapping Transfers with Computation .................... 16

CUDA Best Practices Guide

iv July 2009

3.1.3 Zero Copy .......................................................................................................... 18

2 Device Memory Spaces .............................................................................................. 19



2.1 Coalesced Access to Global Memory ..................................................................... 20

3.2.1.1 A Simple Access Pattern ................................................................................ 21

3.2.1.2 A Sequential but Misaligned Access Pattern .................................................... 22

3.2.1.3 Effects of Misaligned Accesses ....................................................................... 23

3.2.1.4 Strided Accesses .......................................................................................... 24

2.2 Shared Memory .................................................................................................. 26

3.2.2.1 Shared Memory and Memory Banks ............................................................... 26

3.2.2.2 Shared Memory in Matrix Multiplication (C = AB) ............................................ 27

3.2.2.3 Shared Memory in Matrix Multiplication (C = AA

) ........................................... 31

3.2.2.4 Shared Memory Use by Kernel Arguments ...................................................... 33

3.2.3 Local Memory ..................................................................................................... 33

2.4 Texture Memory ................................................................................................. 33

3.2.4.1 Textured Fetch vs. Global Memory Read ........................................................ 34

3.2.4.2 Additional Texture Capabilities ....................................................................... 35

3.2.5 Constant Memory ............................................................................................... 36

2.6 Registers ........................................................................................................... 36

3.2.6.1 Register Pressure ......................................................................................... 36

apter 4. Execution Configuration Optimizations ...................................................... 37

4.1 Occupancy ............................................................................................................... 37

4.2 Calculating Occupancy ............................................................................................... 37

4.3 Hiding Register Dependencies .................................................................................... 39

4.4 Thread and Block Heuristics ....................................................................................... 40

4.5 Effects of Shared Memory .......................................................................................... 41

apter 5. Instruction Optimizations ........................................................................... 43

1 Arithmetic Instructions .............................................................................................. 43

5.1.1 Division and Modulo Operations ........................................................................... 43

5.1.2 Reciprocal Square Root ....................................................................................... 44

5.1.3 Other Arithmetic Instructions ............................................................................... 44

5.1.4 Math Libraries .................................................................................................... 44

5.2 Memory Instructions ................................................................................................. 45

apter 6. Control Flow ................................................................................................ 47

6.1Branching and Divergence ......................................................................................... 47

6.2Branch Predication .................................................................................................... 47

剩余65页未读，继续阅读

评论收藏

内容反馈

zhouxr2000

2016-05-21

英伟达公司也有有免费的
huanle19891345

2013-06-29

好资料，英文版的资料就是清晰，权威，好好学习一下
alpha.5

2013-02-28

挺好的，清晰度也有.

lulyon

粉丝: 2
资源: 9

NVIDIA CUDA C Programming Best Practices Guide Version 2.3

最新资源

NVIDIA CUDA C Programming Best Practices Guide Version 2.3

CUDA C Best Practices Guide

CUDA_C_Best_Practices_Guide

CUDA_C_Best_Practices_Guide.pdf

win10 x64 系统中tensorflow遇到ImportError: Could not find ‘cudart64_100.dll’错误解决方法之一

cudart64_100.dll 解决"dlerror: cudart64_100.dll not found"的问题

cudart64_100.dll_.zip

cudart64_101.dll

NVIDIA GPU Computing

NVIDIA CUDA Programming Guide

cudart64_40_17.dll

cudart64_101.zip

cudart64_92.dll

cudart.dll

NVIDIA_CUDA_Programming_Guide_2.2.1.pdf

NVIDIA_CUDA_Programming_Guide_2.1.pdf

CUDA_C_Best_Practices_Guide_cuda_GPU_

HeKun-NVIDIA#CUDA-Programming-Guide-in-Chinese#附录J纹理获取1

CUDA C Best Practices Guide 4.1

CUDA_2.0编程指南_NVIDIA_CUDA_Programming_Guide_2.0Final

STM32循迹小车（灰度+OpenMV权重判断）

谭浩强C语言程序设计第五版详细答案

MQTT协议设备客户端与图传APP.zip

Keil.STM32H7xx-DFP.3.1.1.pack

最新资源