没有合适的资源?快使用搜索试试~ 我知道了~
CUDA_C_Programming_Guide v10.0.pdf
需积分: 10 31 下载量 200 浏览量
2019-03-29
16:05:36
上传
评论
收藏 3.62MB PDF 举报
温馨提示
试读
311页
CUDA_C_Programming_Guide V10.0 最新版的CUDA C编程指南
资源推荐
资源详情
资源评论
CUDA C PROGRAMMING GUIDE
PG-02829-001_v10.0 | October 2018
Design Guide
www.nvidia.com
CUDA C Programming Guide PG-02829-001_v10.0|ii
CHANGES FROM VERSION 9.0
‣
Documented restriction that operator-overloads cannot be __global__ functions in
Operator Function.
‣
Removed guidance to break 8-byte shuffles into two 4-byte instructions. 8-byte
shuffle variants are provided since CUDA 9.0. See Warp Shuffle Functions.
‣
Passing __restrict__ references to __global__ functions is now supported.
Updated comment in __global__ functions and function templates.
‣
Documented CUDA_ENABLE_CRC_CHECK in CUDA Environment Variables.
‣
Warp matrix functions now support matrix products with m=32, n=8, k=16 and m=8,
n=32, k=16 in addition to m=n=k=16.
www.nvidia.com
CUDA C Programming Guide PG-02829-001_v10.0|iii
TABLE OF CONTENTS
Chapter1.Introduction.........................................................................................1
1.1.From Graphics Processing to General Purpose Parallel Computing............................... 1
1.2.CUDA
®
: A General-Purpose Parallel Computing Platform and Programming Model.............3
1.3.A Scalable Programming Model.........................................................................4
1.4.Document Structure...................................................................................... 5
Chapter2.Programming Model............................................................................... 7
2.1. Kernels......................................................................................................7
2.2.Thread Hierarchy......................................................................................... 8
2.3.Memory Hierarchy....................................................................................... 10
2.4.Heterogeneous Programming.......................................................................... 12
2.5.Compute Capability..................................................................................... 14
Chapter3.Programming Interface..........................................................................15
3.1.Compilation with NVCC................................................................................ 15
3.1.1.Compilation Workflow.............................................................................16
3.1.1.1.Offline Compilation.......................................................................... 16
3.1.1.2.Just-in-Time Compilation....................................................................16
3.1.2.Binary Compatibility...............................................................................16
3.1.3.PTX Compatibility..................................................................................17
3.1.4.Application Compatibility.........................................................................17
3.1.5.C/C++ Compatibility............................................................................... 18
3.1.6.64-Bit Compatibility............................................................................... 18
3.2.CUDA C Runtime.........................................................................................18
3.2.1. Initialization.........................................................................................19
3.2.2.Device Memory..................................................................................... 20
3.2.3.Shared Memory..................................................................................... 23
3.2.4.Page-Locked Host Memory........................................................................28
3.2.4.1.Portable Memory..............................................................................29
3.2.4.2.Write-Combining Memory....................................................................29
3.2.4.3.Mapped Memory...............................................................................30
3.2.5.Asynchronous Concurrent Execution............................................................ 31
3.2.5.1.Concurrent Execution between Host and Device........................................31
3.2.5.2.Concurrent Kernel Execution............................................................... 31
3.2.5.3.Overlap of Data Transfer and Kernel Execution......................................... 32
3.2.5.4.Concurrent Data Transfers.................................................................. 32
3.2.5.5. Streams......................................................................................... 32
3.2.5.6. Graphs.......................................................................................... 36
3.2.5.7. Events...........................................................................................42
3.2.5.8.Synchronous Calls.............................................................................43
3.2.6.Multi-Device System............................................................................... 43
3.2.6.1.Device Enumeration.......................................................................... 43
www.nvidia.com
CUDA C Programming Guide PG-02829-001_v10.0|iv
3.2.6.2.Device Selection.............................................................................. 43
3.2.6.3.Stream and Event Behavior................................................................. 44
3.2.6.4.Peer-to-Peer Memory Access................................................................44
3.2.6.5.Peer-to-Peer Memory Copy..................................................................45
3.2.7.Unified Virtual Address Space................................................................... 46
3.2.8.Interprocess Communication..................................................................... 46
3.2.9.Error Checking......................................................................................47
3.2.10. Call Stack.......................................................................................... 47
3.2.11.Texture and Surface Memory................................................................... 48
3.2.11.1.Texture Memory............................................................................. 48
3.2.11.2.Surface Memory............................................................................. 57
3.2.11.3.CUDA Arrays..................................................................................61
3.2.11.4.Read/Write Coherency..................................................................... 61
3.2.12.Graphics Interoperability........................................................................61
3.2.12.1.OpenGL Interoperability................................................................... 62
3.2.12.2.Direct3D Interoperability...................................................................64
3.2.12.3.SLI Interoperability..........................................................................70
3.3.Versioning and Compatibility.......................................................................... 71
3.4.Compute Modes..........................................................................................72
3.5. Mode Switches........................................................................................... 73
3.6.Tesla Compute Cluster Mode for Windows.......................................................... 73
Chapter4.Hardware Implementation......................................................................75
4.1.SIMT Architecture....................................................................................... 75
4.2.Hardware Multithreading...............................................................................77
Chapter5.Performance Guidelines........................................................................ 79
5.1.Overall Performance Optimization Strategies...................................................... 79
5.2.Maximize Utilization.................................................................................... 79
5.2.1.Application Level...................................................................................79
5.2.2. Device Level........................................................................................ 80
5.2.3.Multiprocessor Level...............................................................................80
5.2.3.1.Occupancy Calculator........................................................................ 82
5.3.Maximize Memory Throughput........................................................................ 84
5.3.1.Data Transfer between Host and Device....................................................... 85
5.3.2.Device Memory Accesses..........................................................................86
5.4.Maximize Instruction Throughput.....................................................................90
5.4.1.Arithmetic Instructions............................................................................90
5.4.2.Control Flow Instructions......................................................................... 94
5.4.3.Synchronization Instruction.......................................................................95
AppendixA.CUDA-Enabled GPUs........................................................................... 96
AppendixB.C Language Extensions........................................................................ 97
B.1.Function Execution Space Specifiers.................................................................97
B.1.1. __device__.......................................................................................... 97
B.1.2. __global__...........................................................................................97
www.nvidia.com
CUDA C Programming Guide PG-02829-001_v10.0|v
B.1.3. __host__............................................................................................. 98
B.1.4.__noinline__ and __forceinline__............................................................... 98
B.2.Variable Memory Space Specifiers....................................................................98
B.2.1. __device__.......................................................................................... 99
B.2.2.__constant__........................................................................................99
B.2.3. __shared__.......................................................................................... 99
B.2.4.__managed__......................................................................................100
B.2.5.__restrict__........................................................................................100
B.3.Built-in Vector Types.................................................................................. 102
B.3.1.char, short, int, long, longlong, float, double...............................................102
B.3.2. dim3................................................................................................ 103
B.4.Built-in Variables.......................................................................................103
B.4.1. gridDim............................................................................................. 103
B.4.2. blockIdx............................................................................................ 103
B.4.3. blockDim........................................................................................... 103
B.4.4. threadIdx...........................................................................................104
B.4.5. warpSize............................................................................................104
B.5.Memory Fence Functions............................................................................. 104
B.6.Synchronization Functions............................................................................ 107
B.7.Mathematical Functions...............................................................................108
B.8.Texture Functions...................................................................................... 108
B.8.1.Texture Object API............................................................................... 108
B.8.1.1.tex1Dfetch()..................................................................................108
B.8.1.2. tex1D()........................................................................................ 108
B.8.1.3.tex1DLod()....................................................................................108
B.8.1.4.tex1DGrad().................................................................................. 109
B.8.1.5. tex2D()........................................................................................ 109
B.8.1.6.tex2DLod()....................................................................................109
B.8.1.7.tex2DGrad().................................................................................. 109
B.8.1.8. tex3D()........................................................................................ 109
B.8.1.9.tex3DLod()....................................................................................109
B.8.1.10.tex3DGrad().................................................................................110
B.8.1.11.tex1DLayered()............................................................................. 110
B.8.1.12.tex1DLayeredLod().........................................................................110
B.8.1.13.tex1DLayeredGrad()....................................................................... 110
B.8.1.14.tex2DLayered()............................................................................. 110
B.8.1.15.tex2DLayeredLod().........................................................................110
B.8.1.16.tex2DLayeredGrad()....................................................................... 111
B.8.1.17.texCubemap().............................................................................. 111
B.8.1.18.texCubemapLod().......................................................................... 111
B.8.1.19.texCubemapLayered().....................................................................111
B.8.1.20.texCubemapLayeredLod()................................................................ 111
B.8.1.21.tex2Dgather()...............................................................................111
剩余310页未读,继续阅读
资源评论
mminrong
- 粉丝: 1
- 资源: 24
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功