CUDA
编程指南
Version 1.1
I
Version
1.1
11/29/2007
NVIDIA CUDA
统一计算设备架构
编程指南
CUDA
编程指南
Version 1.1
II
CUDA
编程指南
Version 1.1
III
目 录
第
第第
第
1
章
章章
章
CUDA
简介
简介简介
简介
...................................................................................................................................................... 1
1.1
作为数据并行计算设备的图形处理器
.................................................................................................... 1
1.2 CUDA
:一种
GPU
计算的新架构
.......................................................................................................... 3
1.3
文档结构
................................................................................................................................................... 6
第
第第
第
2
章
章章
章
编程模型
编程模型编程模型
编程模型
............................................................................................................................................................ 7
2.1
高度多线程协处理器
............................................................................................................................... 7
2.2
线程分批
................................................................................................................................................... 7
2.2.1
线程块
.............................................................................................................................................. 7
2.2.2
线程块网格
...................................................................................................................................... 8
2.3
内存模型
................................................................................................................................................. 10
第
第第
第
3
章
章章
章
硬件实现
硬件实现硬件实现
硬件实现
........................................................................................................................................................ 13
3.1
具有片上共享内存的一组
SIMD
多处理器
.......................................................................................... 13
3.2
执行模型
................................................................................................................................................. 14
3.3
计算能力
................................................................................................................................................. 15
3.4
多个设备
................................................................................................................................................. 16
3.5
显示模式切换
......................................................................................................................................... 16
第
第第
第
4
章
章章
章
应用编程接口
应用编程接口应用编程接口
应用编程接口
................................................................................................................................................ 17
4.1 C
编程语言扩展
...................................................................................................................................... 17
4.2
语言扩展
................................................................................................................................................. 17
4.2.1
函数类型限定符
............................................................................................................................ 18
4.2.2
变量类型限定符
............................................................................................................................ 19
4.2.3
执行配置
........................................................................................................................................ 21
4.2.4
内置变量
........................................................................................................................................ 21
4.2.5
使用
NVCC
编译
........................................................................................................................... 22
4.3
共用运行时组件
..................................................................................................................................... 23
4.3.1
内置向量类型
................................................................................................................................ 23
4.3.2
数学函数
........................................................................................................................................ 24
4.3.3
时间函数
........................................................................................................................................ 24
4.3.4
纹理类型
........................................................................................................................................ 24
4.4
设备运行时组件
..................................................................................................................................... 26
4.4.1
数学函数
........................................................................................................................................ 26
CUDA
编程指南
Version 1.1
IV
4.4.2
同步函数
........................................................................................................................................ 26
4.4.3
类型转换函数
................................................................................................................................ 26
4.4.4
类型强制函数
................................................................................................................................ 27
4.4.5
纹理函数
........................................................................................................................................ 27
4.4.6
原子函数
........................................................................................................................................ 28
4.5
宿主运行时组件
....................................................................................................................................... 28
4.5.1
常用概念
........................................................................................................................................ 29
4.5.2
运行时
API ..................................................................................................................................... 32
4.5.3
驱动程序
API ................................................................................................................................. 39
第
第第
第
5
章
章章
章
性能指南
性能指南性能指南
性能指南
........................................................................................................................................................ 47
5.1
指令性能
................................................................................................................................................. 47
5.1.1
指令吞吐量
.................................................................................................................................... 47
5.1.2
内存带宽
........................................................................................................................................ 49
5.2
每块的线程数
......................................................................................................................................... 62
5.3
宿主和设备之间的数据传送
................................................................................................................. 63
5.4
纹理拾取与全局或常量内存读取
......................................................................................................... 63
5.5
整体性能优化策略
................................................................................................................................. 64
第
第第
第
6
章
章章
章
矩阵乘法示例
矩阵乘法示例矩阵乘法示例
矩阵乘法示例
................................................................................................................................................ 67
6.1
概述
......................................................................................................................................................... 67
6.2
源码清单
................................................................................................................................................. 69
6.3
源码攻略
................................................................................................................................................. 71
6.3.1 Mul() ............................................................................................................................................ 71
6.3.2 Muld() .......................................................................................................................................... 71
附录
附录附录
附录
A
技术规格
技术规格技术规格
技术规格
........................................................................................................................................................ 73
A.1
通用规范
................................................................................................................................................ 74
A.2
浮点标准
................................................................................................................................................ 74
附录
附录附录
附录
B
数学函数
数学函数数学函数
数学函数
........................................................................................................................................................ 77
B.1
共用运行时组件
..................................................................................................................................... 77
B.2
设备运行时组件
..................................................................................................................................... 80
附录
附录附录
附录
C
原子函数
原子函数原子函数
原子函数
........................................................................................................................................................ 83
C.1
算术函数
................................................................................................................................................. 83
C.1.1 atomicAdd() ............................................................................................................................... 83
C.1.2 atomicSub() ............................................................................................................................... 83
C.1.3 atomicExch() ............................................................................................................................. 83
CUDA
编程指南
Version 1.1
V
C.1.4 atomicMin() ............................................................................................................................... 84
C.1.5 atomicMax() ............................................................................................................................... 84
C.1.6 atomicInc() ............................................................................................................................... 84
C.1.7 atomicDec() ............................................................................................................................... 84
C.1.8 atomicCAS() ............................................................................................................................... 84
C.2
位函数
..................................................................................................................................................... 85
C.2.1 atomicAnd() ............................................................................................................................... 85
C.2.2 atomicOr() .................................................................................................................................. 85
C.2.3 atomicXor() ............................................................................................................................... 85
附录
附录附录
附录
D
运行时
运行时运行时
运行时
API
参考
参考参考
参考
........................................................................................................................................... 87
D.1
设备管理
.................................................................................................................................................. 87
D.1.1 cudaGetDeviceCount() .......................................................................................................... 87
D.1.2 cudaSetDevice() ...................................................................................................................... 87
D.1.3 cudaGetDevice() ...................................................................................................................... 87
D.1.4 cudaGetDeviceProperties() .............................................................................................. 88
D.1.5 cudaChooseDevice() .............................................................................................................. 89
D.2
线程管理
.................................................................................................................................................. 89
D.2.1 cudaThreadSynchronize() .................................................................................................. 89
D.2.2 cudaThreadExit() ................................................................................................................... 89
D.3
流管理
...................................................................................................................................................... 89
D.3.1 cudaStreamCreate() .............................................................................................................. 89
D.3.2 cudaStreamQuery() ................................................................................................................. 89
D.3.3 cudaStreamSynchronize() .................................................................................................. 89
D.3.4 cudaStreamDestroy() ............................................................................................................ 89
D.4
事件管理
.................................................................................................................................................. 90
D.4.1 cudaEventCreate() ................................................................................................................. 90
D.4.2 cudaEventRecord() ................................................................................................................. 90
D.4.3 cudaEventQuery() ................................................................................................................... 90
D.4.4 cudaEventSynchronize() ..................................................................................................... 90
D.4.5 cudaEventDestroy() .............................................................................................................. 90
D.4.6 cudaEventElapsedTime() ..................................................................................................... 90
D.5
内存管理
.................................................................................................................................................. 91
D.5.1 cudaMalloc() ............................................................................................................................. 91
D.5.2 cudaMallocPitch() ................................................................................................................. 91