没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
‣ Added new section C++11 Language Features, ‣ Clarified that values of const-qualified variables with builtin floating-point types cannot be used directly in device code when the Microsoft compiler is used as the host compiler, ‣ Documented the extended lambda feature, ‣ Documented that typeid, std::type_info, and dynamic_cast are only supported in host code, ‣ Documented the restrictions on trigraphs and digraphs, ‣ Clarified the conditions under which layout mismatch can occur on Windows.
资源推荐
资源详情
资源评论
CUDA C PROGRAMMING GUIDE
PG-02829-001_v7.5 | September 2015
Design Guide
www.nvidia.com
CUDA C Programming Guide PG-02829-001_v7.5|ii
CHANGES FROM VERSION 7.0
‣
Updated C/C++ Language Support to:
‣
Added new section C++11 Language Features,
‣
Clarified that values of const-qualified variables with builtin floating-point types
cannot be used directly in device code when the Microsoft compiler is used as
the host compiler,
‣
Documented the extended lambda feature,
‣
Documented that typeid, std::type_info, and dynamic_cast are only supported
in host code,
‣
Documented the restrictions on trigraphs and digraphs,
‣
Clarified the conditions under which layout mismatch can occur on Windows.
‣
Updated Table 12 to mention support of half-precision floating-point operations on
devices of compute capabilities 5.3.
‣
Updated Table 2 with throughput for half-precision floating-point instructions.
‣
Added compute capability 5.3 to Table 13.
‣
Added the maximum number of resident grids per device to Table 13.
‣
Clarified the definition of __threadfence() in Memory Fence Functions.
‣
Mentioned in Atomic Functions that atomic functions do not act as memory fences.
www.nvidia.com
CUDA C Programming Guide PG-02829-001_v7.5|iii
TABLE OF CONTENTS
Chapter1.Introduction.........................................................................................1
1.1.From Graphics Processing to General Purpose Parallel Computing............................... 1
1.2.CUDA
®
: A General-Purpose Parallel Computing Platform and Programming Model.............4
1.3.A Scalable Programming Model.........................................................................5
1.4.Document Structure...................................................................................... 7
Chapter2.Programming Model............................................................................... 9
2.1. Kernels......................................................................................................9
2.2.Thread Hierarchy........................................................................................ 10
2.3.Memory Hierarchy....................................................................................... 12
2.4.Heterogeneous Programming.......................................................................... 14
2.5.Compute Capability..................................................................................... 16
Chapter3.Programming Interface..........................................................................17
3.1.Compilation with NVCC................................................................................ 17
3.1.1.Compilation Workflow.............................................................................18
3.1.1.1.Offline Compilation.......................................................................... 18
3.1.1.2.Just-in-Time Compilation....................................................................18
3.1.2.Binary Compatibility...............................................................................18
3.1.3.PTX Compatibility..................................................................................19
3.1.4.Application Compatibility.........................................................................19
3.1.5.C/C++ Compatibility............................................................................... 20
3.1.6.64-Bit Compatibility............................................................................... 20
3.2.CUDA C Runtime.........................................................................................20
3.2.1. Initialization.........................................................................................21
3.2.2.Device Memory..................................................................................... 21
3.2.3.Shared Memory..................................................................................... 24
3.2.4.Page-Locked Host Memory........................................................................29
3.2.4.1.Portable Memory..............................................................................30
3.2.4.2.Write-Combining Memory....................................................................30
3.2.4.3.Mapped Memory...............................................................................30
3.2.5.Asynchronous Concurrent Execution............................................................ 31
3.2.5.1.Concurrent Execution between Host and Device........................................32
3.2.5.2.Concurrent Kernel Execution............................................................... 32
3.2.5.3.Overlap of Data Transfer and Kernel Execution......................................... 32
3.2.5.4.Concurrent Data Transfers.................................................................. 33
3.2.5.5. Streams......................................................................................... 33
3.2.5.6. Events...........................................................................................37
3.2.5.7.Synchronous Calls.............................................................................37
3.2.6.Multi-Device System............................................................................... 38
3.2.6.1.Device Enumeration.......................................................................... 38
3.2.6.2.Device Selection.............................................................................. 38
www.nvidia.com
CUDA C Programming Guide PG-02829-001_v7.5|iv
3.2.6.3.Stream and Event Behavior................................................................. 38
3.2.6.4.Peer-to-Peer Memory Access................................................................39
3.2.6.5.Peer-to-Peer Memory Copy..................................................................39
3.2.7.Unified Virtual Address Space................................................................... 40
3.2.8.Interprocess Communication..................................................................... 41
3.2.9.Error Checking......................................................................................41
3.2.10. Call Stack.......................................................................................... 42
3.2.11.Texture and Surface Memory................................................................... 42
3.2.11.1.Texture Memory............................................................................. 42
3.2.11.2.Surface Memory............................................................................. 52
3.2.11.3.CUDA Arrays..................................................................................56
3.2.11.4.Read/Write Coherency..................................................................... 56
3.2.12.Graphics Interoperability........................................................................56
3.2.12.1.OpenGL Interoperability................................................................... 57
3.2.12.2.Direct3D Interoperability...................................................................59
3.2.12.3.SLI Interoperability..........................................................................65
3.3.Versioning and Compatibility.......................................................................... 66
3.4.Compute Modes..........................................................................................67
3.5. Mode Switches........................................................................................... 68
3.6.Tesla Compute Cluster Mode for Windows.......................................................... 68
Chapter4.Hardware Implementation......................................................................69
4.1.SIMT Architecture....................................................................................... 69
4.2.Hardware Multithreading...............................................................................71
Chapter5.Performance Guidelines........................................................................ 72
5.1.Overall Performance Optimization Strategies...................................................... 72
5.2.Maximize Utilization.................................................................................... 72
5.2.1.Application Level...................................................................................72
5.2.2. Device Level........................................................................................ 73
5.2.3.Multiprocessor Level...............................................................................73
5.2.3.1.Occupancy Calculator........................................................................ 75
5.3.Maximize Memory Throughput........................................................................ 77
5.3.1.Data Transfer between Host and Device....................................................... 78
5.3.2.Device Memory Accesses..........................................................................79
5.4.Maximize Instruction Throughput.....................................................................83
5.4.1.Arithmetic Instructions............................................................................83
5.4.2.Control Flow Instructions......................................................................... 87
5.4.3.Synchronization Instruction.......................................................................88
AppendixA.CUDA-Enabled GPUs........................................................................... 89
AppendixB.C Language Extensions........................................................................ 90
B.1.Function Type Qualifiers............................................................................... 90
B.1.1. __device__.......................................................................................... 90
B.1.2. __global__...........................................................................................90
B.1.3. __host__............................................................................................. 90
www.nvidia.com
CUDA C Programming Guide PG-02829-001_v7.5|v
B.1.4.__noinline__ and __forceinline__............................................................... 91
B.2.Variable Type Qualifiers................................................................................91
B.2.1. __device__.......................................................................................... 91
B.2.2.__constant__........................................................................................92
B.2.3. __shared__.......................................................................................... 92
B.2.4.__managed__....................................................................................... 93
B.2.5. __restrict__......................................................................................... 93
B.3.Built-in Vector Types................................................................................... 94
B.3.1.char, short, int, long, longlong, float, double................................................ 94
B.3.2. dim3.................................................................................................. 95
B.4.Built-in Variables........................................................................................ 96
B.4.1. gridDim.............................................................................................. 96
B.4.2. blockIdx..............................................................................................96
B.4.3. blockDim.............................................................................................96
B.4.4. threadIdx............................................................................................ 96
B.4.5. warpSize............................................................................................. 96
B.5.Memory Fence Functions...............................................................................96
B.6.Synchronization Functions............................................................................. 99
B.7.Mathematical Functions...............................................................................100
B.8.Texture Functions...................................................................................... 100
B.8.1.Texture Object API............................................................................... 101
B.8.1.1.tex1Dfetch()..................................................................................101
B.8.1.2. tex1D()........................................................................................ 101
B.8.1.3.tex1DLod()....................................................................................101
B.8.1.4.tex1DGrad().................................................................................. 101
B.8.1.5. tex2D()........................................................................................ 101
B.8.1.6.tex2DLod()....................................................................................101
B.8.1.7.tex2DGrad().................................................................................. 102
B.8.1.8. tex3D()........................................................................................ 102
B.8.1.9.tex3DLod()....................................................................................102
B.8.1.10.tex3DGrad().................................................................................102
B.8.1.11.tex1DLayered()............................................................................. 102
B.8.1.12.tex1DLayeredLod().........................................................................102
B.8.1.13.tex1DLayeredGrad()....................................................................... 103
B.8.1.14.tex2DLayered()............................................................................. 103
B.8.1.15.tex2DLayeredLod().........................................................................103
B.8.1.16.tex2DLayeredGrad()....................................................................... 103
B.8.1.17.texCubemap().............................................................................. 103
B.8.1.18.texCubemapLod().......................................................................... 103
B.8.1.19.texCubemapLayered().....................................................................104
B.8.1.20.texCubemapLayeredLod()................................................................ 104
B.8.1.21.tex2Dgather()...............................................................................104
B.8.2.Texture Reference API........................................................................... 105
剩余260页未读,继续阅读
资源评论
- 海角儿2016-07-14真实可靠,赞
- 放屁带出翔丶2018-09-19书很不错,工具书
yang_88
- 粉丝: 0
- 资源: 6
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 第四章:栈与队列(一)
- 施工人员检查19-YOLO(v5至v9)、CreateML、Darknet、Paligemma、TFRecord、VOC数据集合集.rar
- dlib-19.17.0-cp37-win-amd64.whl
- 基于统一模态架构的开源语言智能体训练框架Agent Lumos
- Java项目-基于 Java+MySql+Swing图书管管理系统(视频+源码).zip
- Java项目-基于 Java+MySql+Swing汽车租赁管理系统(详细档+视频+源码).zip
- 施工人员吊车推出车检测28-YOLO(v5至v9)、COCO、Darknet、VOC数据集合集.rar
- ART框架自动多步推理与工具利用提升大型语言模型能力
- 大规模API调用的自反思层级代理模型AnyTool研究与应用
- Agent-as-a-Judge: 使用智能体评估代码生成任务的有效性
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功