simpleZeroCopy.tar.gz_Zero_cuda资源-CSDN文库

版权申诉

zero

cuda

187 浏览量 2022-09-14 19:42:50 上传评论收藏 199KB GZ 举报

共4个文件

o：1个

makefile：1个

pdf：1个

在CUDA（Compute Unified Device Architecture）并行计算框架中，Zero Copy是一种优化技术，它允许GPU直接访问主机内存，从而减少数据传输的开销，提高计算效率。标题中的"simpleZeroCopy.tar.gz_Zero_cuda"暗示了一个关于CUDA零拷贝的简单示例程序。接下来，我们将深入探讨CUDA Zero Copy的概念，实现原理以及如何在实际编程中应用。 **1. CUDA Zero Copy概念：** CUDA Zero Copy是CUDA编程模型中的一个重要特性，它通过利用NVidia GPU的Page Mapped Memory（Pinned Memory）来实现。Pinned Memory是一种特殊的主机内存类型，它被映射到GPU可以直接访问的地址空间，这样就避免了数据在CPU和GPU之间进行传统的系统内存复制，从而显著提升数据传输速度。 **2. Pinned Memory与Host-Visible Memory：** 在CUDA中，Host-Visible Memory是指可以被GPU访问的主机内存。Pinned Memory是Host-Visible Memory的一种，它具有高速缓存不一致性的特性，确保GPU能快速、连续地读写数据，而不会因CPU的缓存操作导致数据失效。 **3. Zero Copy的实现：** 在CUDA程序中，使用cudaHostAlloc函数分配Pinned Memory，并通过cudaMemcpyToSymbol或cudaMemcpy3D等API将数据从Pinned Memory直接传输到GPU Global Memory，反之亦然。这样就实现了数据的“零拷贝”传输，减少了CPU参与的数据搬移，提升了整体性能。 **4. simpleZeroCopy.cu源码分析：** 这个压缩包中的"simpleZeroCopy.cu"很可能是实现CUDA Zero Copy的源代码文件。通常，它会包含以下部分： - __global__函数定义：这是在GPU上执行的函数，用于处理数据。 - 主机代码：负责分配Pinned Memory，启动kernel（GPU执行的计算任务），并在适当的时候将数据传输到GPU或从GPU返回。 - 错误检查：确保CUDA API调用无误，如使用cudaGetErrorString检查错误代码。 **5. Makefile的作用：** "Makefile"是一个构建脚本，用于编译和链接CUDA程序。它会指定编译器、编译选项、源文件、依赖库等信息，执行后生成可执行文件。在CUDA项目中，通常会包含nvcc编译器的指令，用于处理CUDA代码并生成可执行文件。 **6. doc文件：** "doc"可能包含关于程序的文档或注释，解释代码的工作原理，如何运行，以及可能的优化技巧。 "simpleZeroCopy.tar.gz_Zero_cuda"是一个用于学习CUDA Zero Copy技术的实例，通过对"simpleZeroCopy.cu"源代码的分析和Makefile的编译，我们可以了解并实践如何在实际项目中应用这一优化技术，以提高GPU并行计算的效率。

资源推荐

资源详情

资源评论

收起资源包目录

simpleZeroCopy.tar.gz （4个子文件）

obj

i386

release

simpleZeroCopy.cu.o 19KB

simpleZeroCopy.cu 6KB

doc

CUDA2.2PinnedMemoryAPIs.pdf 249KB

Makefile 2KB

March 2009

CUDA 2.2 Pinned

Memory APIs

March 2009

Month 2007 1

Table of Contents

Table of Contents ............................................................................................... 1

1. Overview ........................................................................................................ 2

1.1 “Portable pinned memory”: available to all contexts ................................................................... 3

1.2 “Mapped pinned memory”: zero-copy ........................................................................................... 3

1.3 Write-combined memory .................................................................................................................. 4

2 Driver API .................................................................................................... 5

2.1 New device attributes ........................................................................................................................ 5

2.2 cuCtxCreate ........................................................................................................................................ 5

2.3 cuMemHostAlloc .............................................................................................................................. 6 

2.4 cuMemHostGetDevicePointer ........................................................................................................ 6

3 CUDA Runtime API ...................................................................................... 7

3.1 New Device Properties ..................................................................................................................... 7

3.2 cudaSetDeviceFlags ........................................................................................................................... 7

3.3 cudaHostAlloc .................................................................................................................................... 8

3.3 cudaHostGetDevicePointer ............................................................................................................. 8

4 Frequently Asked Questions ........................................................................ 9

4.1 I am trying to use mapped pinned memory, but I’m not getting a device pointer. ................. 9

4.2 Why didn’t NVIDIA implement zero-copy simply by ignoring the copy commands? .......... 9

4.3 When should I use mapped pinned memory? ............................................................................... 9

4.4 I am trying to use mapped pinned memory, and I’m not getting the expected results. .......10

4.5 Why do pinned allocations seem to be using CUDA address space? ......................................10

4.6 Mapped pinned memory is giving me a big performance hit! ..................................................11

4.7 When should I use write-combined memory? ............................................................................11

CUDA 2.2 Pinned Memory APIs

March 2009 2

1. Overview

The term “pinned memory” does not appear anywhere in the CUDA header files, but has

been adopted by the CUDA developer community to refer to memory allocated by the

CUDA driver API’s

cuMemAllocHost()

or the CUDA runtime’s

cudaMallocHost()

functions. Such memory is allocated for the CPU, but also page-locked and mapped for

access by the GPU for higher transfer speeds and eligibility for asynchronous memcpy

However, before CUDA 2.2 the benefits of pinned memory were realized only on the CPU

thread (or, if using the driver API, the CUDA context) in which the memory was allocated.

This restriction is especially problematic on pre-CUDA 2.2 applications that are operating

multiple GPUs, since a given buffer was guaranteed to be treated as pageable by one of the

CUDA contexts needed to drive multiple GPUs.

In addition, before CUDA 2.2, pinned memory could only be copied to and from a GPU’s

device memory; CUDA kernels could not access CPU memory directly, even if it was

pinned.

CUDA 2.2 introduces new APIs that relax these restrictions via a new function called

cuMemHostAlloc()

(or in the CUDA runtime,

cudaHostAlloc()

). The new features are

as follows:

‐ “Portable” pinned buffers that are available to all GPUs.

‐ “Mapped” pinned buffers that are mapped into the CUDA address space. On integrated

GPUs, mapped pinned memory enables applications to avoid superfluous copies since

integrated GPUs operate on the same pool of physical memory as the CPU. As a result,

mapped pinned buffers may be referred to as “zero-copy” buffers.

‐ “WC” (write-combined) memory that is not cached by the CPU, but kept in a small

intermediary buffer and written as needed at high speed. WC memory has higher PCI

Express copy performance and does not have any effect on the CPU caches (since the

WC buffers are a separate hardware resource), but WC memory has drawbacks. The

CPU cannot read from WC memory without incurring a performance penalty

, so WC

memory cannot be used in the general case – it is best for buffers where the CPU is

Pageable memory cannot be copied asynchronously since the operating system may move it or swap

it out to disk before the GPU is finished using it.

You may wonder why NVIDIA would name a function “cuMemHostAlloc” when the existing

function to allocate pinned memory is called “cuMemAllocHost.” Both naming conventions follow

“global to local” scoping as you read from left to right –prefix, function family, action.

cuMemAllocHost() belongs to the family “Mem” and performs the “alloc host” operation;

cuMemHostAlloc() belongs to the family “MemHost” and performs the “alloc” function.

Note, SSE4.1 introduced the MOVNTDQA instruction that enables CPUs to read from WC

memory with high performance.

CUDA 2.2 Pinned Memory APIs

March 2009 3

producing data for consumption by the GPU. Additionally, WC memory may require

fence instructions to ensure coherence.

These features are completely orthogonal - you can allocate a portable, write-combined

buffer, a portable pinned buffer, a write-combined buffer that is neither portable nor pinned,

or any other permutation enabled by the flags.

1.1 “Portable pinned memory”: available to all

contexts

Before CUDA 2.2, the benefits of pinned memory could only be realized on the CUDA

context that allocated it. This restriction was especially onerous on multi-GPU applications,

since they often divide problems among GPUs, dispatching different subsets of the input

data to different GPUs and gathering the output data into one buffer.

cuMemHostAlloc ()

relaxes this restriction through the CU_MEMALLOC_PORTABLE flag. When this flag is

specified, the pinned memory is made available to all CUDA contexts, not just the one that

performed the allocation

Portable pinned memory works both for contexts that predate the allocation, and for

contexts that are created after the allocation has been performed.

Portable pinned memory may be freed by any CUDA context by calling

cuMemFreeHost(). Once freed, it is no longer available to any CUDA context.

The CUDA runtime exposes this feature via the new

cudaHostAlloc()

function with the

cudaHostAllocPortable flag. Memory allocated by cudaHostAlloc() may be freed

by calling cudaFreeHost().

1.2 “Mapped pinned memory”: zero-copy

To date, CUDA has presented a memory model where the CPU and GPU have distinct

memory that is accessible to one device or the other, but never both. Data interchange

between the two devices is achieved by allocating two buffers (one each in CPU memory

and GPU memory) and copying data between them. This memory model reflects the target

GPUs for CUDA, which historically have been discrete GPUs with dedicated memory

subsystems.

There are two scenarios where it is desirable for the CPU and GPU to share a buffer without

explicit buffer allocations and copies:

1) OnGPUsintegratedintothemotherboard,thecopiesaresuperfluousbecausethe

CPUandGPUmemoryarephysicallythesame.

The CUDA driver uses WC internally and must issue a store fence instruction whenever it sends a

command to the GPU. So the application may not have to use store fences at all.

Portable pinned memory is not the default for compatibility reasons. Making pinned allocations

portable without an opt-in from the application could cause failures to be reported that did not occur

in previous versions of CUDA.

评论收藏

内容反馈

版权申诉

邓凌佳

粉丝: 76
资源: 1万+

simpleZeroCopy.tar.gz_Zero_cuda

LU.tar.gz_cuda LU_cuda lu 分解_cuda 线性

198.TensorFlow-From-Zero-To-One__amusi.tar.gz

Python库 | dask-cuda-0.12.0a200121.tar.gz

Python库 | dask-cuda-0.13.0b200223.tar.gz

Python库 | dask-cuda-0.13.0b200221.tar.gz

cuda源代码

CUDA的两个例子程序

cuda入门程序

cuda by example 书中源码

cuda8.0,下载包

tetgen1.4.2.tar.gz_C++有限元_tetgen1.4.2.tar.gz_划分网格_有限元c_网格

pytorch-1.6.0-py3.6-cuda10.2.89-cudnn7.6.5-0.tar.bz2

libX11DepenSrc.tar.gz

jdk-11.0.19-linux-x64-bin.tar.gz文件(分享给需要的同学)

cuda_11.1.0_455.23.05_linux.tar.gz1

Python库 | dask-cuda-0.13.0b200316.tar.gz

Python库 | dask-cuda-0.20.0a210328.tar.gz

Python库 | dask-cuda-21.12.0a211119.tar.gz

ezw.tar.gz_Zero_zero tree

Python库 | dask-cuda-0.16.0a200830.tar.gz

cuda并行求和代码

cuda优化代码

cuda7.5标准文件

cuda基础入门，轻松入门cuda

CUDA入门示例代码

szip-2.1.tar.gz_szip_szip-2.1.1_szip-2.1.tar.gz_szip.2.1_szip2.1

mod_fastcgi-2.4.6.tar.gz

pytorch-1.7.1-py3.8_cuda11.0.221_cudnn8.0.5_0.tar.bz2

pytorch-1.1.0-py3.6_cuda10.0.130_cudnn7.5.1_0.tar.bz2

cuda_11.1.0_455.23.05_linux.tar.gz3

最新资源