CUDA C++ Best Practices Guide DG-05603-001_v11.2|iii
6.3.3.IEEE 754 Compliance....................................................................................................... 13
6.3.4.x86 80-bit Computations..................................................................................................13
Chapter7.Optimizing CUDA Applications......................................................................... 14
Chapter8.Performance Metrics....................................................................................... 15
8.1.Timing...................................................................................................................................... 15
8.1.1.Using CPU Timers............................................................................................................15
8.1.2.Using CUDA GPU Timers.................................................................................................16
8.2.Bandwidth................................................................................................................................ 16
8.2.1.Theoretical Bandwidth Calculation................................................................................. 17
8.2.2.Effective Bandwidth Calculation......................................................................................17
8.2.3.Throughput Reported by Visual Profiler......................................................................... 18
Chapter9.Memory Optimizations......................................................................................19
9.1.Data Transfer Between Host and Device.............................................................................. 19
9.1.1.Pinned Memory.................................................................................................................20
9.1.2.Asynchronous and Overlapping Transfers with Computation........................................20
9.1.3.Zero Copy.......................................................................................................................... 23
9.1.4.Unified Virtual Addressing............................................................................................... 24
9.2.Device Memory Spaces...........................................................................................................24
9.2.1.Coalesced Access to Global Memory..............................................................................26
9.2.1.1.A Simple Access Pattern...........................................................................................26
9.2.1.2.A Sequential but Misaligned Access Pattern........................................................... 27
9.2.1.3.Effects of Misaligned Accesses................................................................................ 27
9.2.1.4.Strided Accesses........................................................................................................28
9.2.2.L2 Cache........................................................................................................................... 30
9.2.2.1.L2 Cache Access Window......................................................................................... 30
9.2.2.2.Tuning the Access Window Hit-Ratio....................................................................... 31
9.2.3.Shared Memory................................................................................................................ 34
9.2.3.1.Shared Memory and Memory Banks........................................................................ 34
9.2.3.2.Shared Memory in Matrix Multiplication (C=AB)......................................................35
9.2.3.3.Shared Memory in Matrix Multiplication (C=AAT).................................................... 38
9.2.3.4.Asynchronous Copy from Global Memory to Shared Memory.................................40
9.2.4.Local Memory................................................................................................................... 43
9.2.5.Texture Memory................................................................................................................43
9.2.5.1.Additional Texture Capabilities................................................................................. 43
9.2.6.Constant Memory............................................................................................................. 44
9.2.7.Registers........................................................................................................................... 44
9.2.7.1.Register Pressure......................................................................................................44
9.3.Allocation................................................................................................................................. 45