March 2009
CUDA 2.2 Pinned
Memory APIs
March 2009
ii
Month 2007 1
Table of Contents
Table of Contents ............................................................................................... 1
1. Overview ........................................................................................................ 2
1.1 “Portable pinned memory”: available to all contexts ................................................................... 3
1.2 “Mapped pinned memory”: zero-copy ........................................................................................... 3
1.3 Write-combined memory .................................................................................................................. 4
2 Driver API .................................................................................................... 5
2.1 New device attributes ........................................................................................................................ 5
2.2 cuCtxCreate ........................................................................................................................................ 5
2.3 cuMemHostAlloc .............................................................................................................................. 6
2.4 cuMemHostGetDevicePointer ........................................................................................................ 6
3 CUDA Runtime API ...................................................................................... 7
3.1 New Device Properties ..................................................................................................................... 7
3.2 cudaSetDeviceFlags ........................................................................................................................... 7
3.3 cudaHostAlloc .................................................................................................................................... 8
3.3 cudaHostGetDevicePointer ............................................................................................................. 8
4 Frequently Asked Questions ........................................................................ 9
4.1 I am trying to use mapped pinned memory, but I’m not getting a device pointer. ................. 9
4.2 Why didn’t NVIDIA implement zero-copy simply by ignoring the copy commands? .......... 9
4.3 When should I use mapped pinned memory? ............................................................................... 9
4.4 I am trying to use mapped pinned memory, and I’m not getting the expected results. .......10
4.5 Why do pinned allocations seem to be using CUDA address space? ......................................10
4.6 Mapped pinned memory is giving me a big performance hit! ..................................................11
4.7 When should I use write-combined memory? ............................................................................11
CUDA 2.2 Pinned Memory APIs
March 2009 2
1. Overview
The term “pinned memory” does not appear anywhere in the CUDA header files, but has
been adopted by the CUDA developer community to refer to memory allocated by the
CUDA driver API’s
cuMemAllocHost()
or the CUDA runtime’s
cudaMallocHost()
functions. Such memory is allocated for the CPU, but also page-locked and mapped for
access by the GPU for higher transfer speeds and eligibility for asynchronous memcpy
1
.
However, before CUDA 2.2 the benefits of pinned memory were realized only on the CPU
thread (or, if using the driver API, the CUDA context) in which the memory was allocated.
This restriction is especially problematic on pre-CUDA 2.2 applications that are operating
multiple GPUs, since a given buffer was guaranteed to be treated as pageable by one of the
CUDA contexts needed to drive multiple GPUs.
In addition, before CUDA 2.2, pinned memory could only be copied to and from a GPU’s
device memory; CUDA kernels could not access CPU memory directly, even if it was
pinned.
CUDA 2.2 introduces new APIs that relax these restrictions via a new function called
cuMemHostAlloc()
2
(or in the CUDA runtime,
cudaHostAlloc()
). The new features are
as follows:
‐ “Portable” pinned buffers that are available to all GPUs.
‐ “Mapped” pinned buffers that are mapped into the CUDA address space. On integrated
GPUs, mapped pinned memory enables applications to avoid superfluous copies since
integrated GPUs operate on the same pool of physical memory as the CPU. As a result,
mapped pinned buffers may be referred to as “zero-copy” buffers.
‐ “WC” (write-combined) memory that is not cached by the CPU, but kept in a small
intermediary buffer and written as needed at high speed. WC memory has higher PCI
Express copy performance and does not have any effect on the CPU caches (since the
WC buffers are a separate hardware resource), but WC memory has drawbacks. The
CPU cannot read from WC memory without incurring a performance penalty
3
, so WC
memory cannot be used in the general case – it is best for buffers where the CPU is
1
Pageable memory cannot be copied asynchronously since the operating system may move it or swap
it out to disk before the GPU is finished using it.
2
You may wonder why NVIDIA would name a function “cuMemHostAlloc” when the existing
function to allocate pinned memory is called “cuMemAllocHost.” Both naming conventions follow
“global to local” scoping as you read from left to right –prefix, function family, action.
cuMemAllocHost() belongs to the family “Mem” and performs the “alloc host” operation;
cuMemHostAlloc() belongs to the family “MemHost” and performs the “alloc” function.
3
Note, SSE4.1 introduced the MOVNTDQA instruction that enables CPUs to read from WC
memory with high performance.
CUDA 2.2 Pinned Memory APIs
March 2009 3
producing data for consumption by the GPU. Additionally, WC memory may require
fence instructions to ensure coherence.
4
These features are completely orthogonal - you can allocate a portable, write-combined
buffer, a portable pinned buffer, a write-combined buffer that is neither portable nor pinned,
or any other permutation enabled by the flags.
1.1 “Portable pinned memory”: available to all
contexts
Before CUDA 2.2, the benefits of pinned memory could only be realized on the CUDA
context that allocated it. This restriction was especially onerous on multi-GPU applications,
since they often divide problems among GPUs, dispatching different subsets of the input
data to different GPUs and gathering the output data into one buffer.
cuMemHostAlloc ()
relaxes this restriction through the CU_MEMALLOC_PORTABLE flag. When this flag is
specified, the pinned memory is made available to all CUDA contexts, not just the one that
performed the allocation
5
.
Portable pinned memory works both for contexts that predate the allocation, and for
contexts that are created after the allocation has been performed.
Portable pinned memory may be freed by any CUDA context by calling
cuMemFreeHost(). Once freed, it is no longer available to any CUDA context.
The CUDA runtime exposes this feature via the new
cudaHostAlloc()
function with the
cudaHostAllocPortable flag. Memory allocated by cudaHostAlloc() may be freed
by calling cudaFreeHost().
1.2 “Mapped pinned memory”: zero-copy
To date, CUDA has presented a memory model where the CPU and GPU have distinct
memory that is accessible to one device or the other, but never both. Data interchange
between the two devices is achieved by allocating two buffers (one each in CPU memory
and GPU memory) and copying data between them. This memory model reflects the target
GPUs for CUDA, which historically have been discrete GPUs with dedicated memory
subsystems.
There are two scenarios where it is desirable for the CPU and GPU to share a buffer without
explicit buffer allocations and copies:
1) OnGPUsintegratedintothemotherboard,thecopiesaresuperfluousbecausethe
CPUandGPUmemoryarephysicallythesame.
4
The CUDA driver uses WC internally and must issue a store fence instruction whenever it sends a
command to the GPU. So the application may not have to use store fences at all.
5
Portable pinned memory is not the default for compatibility reasons. Making pinned allocations
portable without an opt-in from the application could cause failures to be reported that did not occur
in previous versions of CUDA.