并行编程cuda资源-CSDN文库

需积分: 9 27 浏览量 2018-01-04 11:01:08 上传评论收藏 701KB PDF 举报

在标题“并行编程cuda”中，我们可以看出本文是关于CUDA（Compute Unified Device Architecture）的讨论，CUDA是一种由NVIDIA推出并行计算平台和编程模型，它使开发者能够利用NVIDIA的GPU（图形处理单元）进行通用计算。描述“Scalable Parallel PROGRAMMING with CUDA”表明，本文将重点介绍如何使用CUDA来构建可扩展的并行程序。 CUDA的出现标志着主流处理器芯片已经转型为并行系统。随着摩尔定律的延续，处理器核心数不断增长。开发者面临一个挑战，那就是开发出能够透明地扩展并行性的主流应用程序，以便能够充分利用不断增加的处理器核心数量。这一点类似于3D图形应用程序能够透明地扩展其并行性以适应拥有不同数量核心的多核GPU。在早期使用CUDA这种可扩展的并行编程模型及其C语言扩展的过程中，许多复杂的程序能够通过几个易于理解的抽象概念来表达。自NVIDIA在2007年发布CUDA以来，开发者已经迅速开发了适用于广泛应用领域的可扩展并行程序，包括计算化学、稀疏矩阵求解器、排序、搜索以及物理模型等。这些应用程序能够透明地扩展到数百个处理器核心以及成千上万的并发线程。NVIDIA的GPU配备了新的Tesla统一图形和计算架构，这些GPU能够运行CUDA C程序，并且在笔记本电脑、个人电脑、工作站和服务器上广泛可用。 CUDA提供三个关键的抽象概念：线程层次结构的层次、共享内存以及屏障同步。这些抽象为传统C代码中的单一线程层次结构提供了一个清晰的并行结构。多级线程、内存和同步提供了细粒度的数据并行性和线程并行性，嵌套在粗粒度的数据并行性和任务并行性之中。这些抽象引导程序员去划分（partition）工作，使程序能够高效地运行在GPU上。此外，CUDA可扩展并行编程模型也适用于其他共享内存的并行处理架构，包括多核CPU。这意味着CUDA不仅限于GPU，它也可以应用在其他并行计算环境中，使其具有更广泛的适用性和灵活性。 CUDA模型能够简化并行编程的复杂性，因为开发者不再需要深入了解硬件的许多细节。通过CUDA提供的并行结构，开发者可以专注于将问题分解为可并行处理的部分，而CUDA则负责处理线程的创建、调度和管理。这使得开发者能够将更多时间投入到算法设计和程序优化中，而不是复杂的并行编程细节。 CUDA编程的一个核心概念是其内存模型，它定义了不同类型的内存和内存访问方式。其中，全局内存适用于所有线程，是线程共享的内存空间；共享内存是线程块内部线程之间共享的内存，访问速度快，但空间有限；常量和纹理内存设计用于存储只读数据，经常被用于存储需要大量重复访问的数据。 CUDA的线程模型同样重要，它定义了线程如何组织在网格（grid）和块（block）中。线程块是并行执行的基本单位，一个块内的线程可以进行同步和共享数据。网格是组织多个线程块的更高层次的结构。线程层次结构使得CUDA可以利用GPU的层次性架构，把一个复杂的问题分解为多个更小的、可以并行处理的部分。 CUDA编程还包括了对并行性进行有效管理的同步原语，如屏障同步（barriers synchronization）。屏障同步用于同步线程块内的所有线程，确保它们在继续执行下一段代码之前达到一个同步点。并行编程通常需要对算法进行重新设计，以便能够并行化计算。在CUDA中，这个过程包括确定可以并行处理的计算部分，以及如何在不同线程间划分数据和任务。CUDA的编程范式鼓励开发者思考并行计算而不是顺序计算，这往往要求对问题和数据结构有新的理解。 CUDA不仅是一种编程模型和语言，它还包括了丰富的软件和硬件生态系统。开发者可以利用CUDA提供的工具链进行开发、调试和性能分析。此外，NVIDIA定期发布新的GPU架构和CUDA版本，为开发者提供了持续的技术进步。 CUDA是目前最重要的并行编程技术之一，它极大地促进了高性能计算的发展。通过提供一个简洁的并行编程模型，CUDA使得开发者能够高效地利用GPU的并行计算能力，开发出高性能的科学、工程和图形应用程序。随着并行计算在各个领域的日益重要，CUDA将持续扮演关键角色。

资源推荐

资源详情

资源评论

40 March/April 2008 ACM QUEUE rants: feedback@acmqueue.com

Scalable Parallel

PROGRAMMING

JOHN NICKOLLS, IAN BUCK, AND

MICHAEL GARLAND, NVIDIA,

KEVIN SKADRON, UNIVERSITY OF VIRGINIA

42 March/April 2008 ACM QUEUE rants: feedback@acmqueue.com

model is also applicable to other shared-memory parallel

processing architectures, including multicore CPUs.

CUDA provides three key abstractions—a hierarchy

of thread groups, shared memories, and barrier syn-

chronization—that provide a clear parallel structure to

conventional C code for one thread of the hierarchy.

Multiple levels of threads, memory, and synchronization

provide ﬁne-grained data parallelism and thread paral-

lelism, nested within coarse-grained data parallelism and

task parallelism. The abstractions guide the programmer

to partition the problem into coarse sub-problems that

can be solved independently in parallel, and then into

Scalable Parallel PROGRAMMING with

CUDA

riven by the insatiable market demand for realtime,

high-deﬁnition 3D graphics, the programmable GPU

(graphics processing unit) has evolved into a highly

parallel, multithreaded, manycore processor. It is designed to

efﬁciently support the graphics shader programming model,

in which a program for one thread draws one vertex or

shades one pixel fragment. The GPU excels at ﬁne-grained,

data-parallel workloads consisting of thousands of indepen-

dent threads executing vertex, geometry, and pixel-shader

program threads concurrently.

The tremendous raw performance of modern GPUs has

led researchers to explore mapping more general non-graph-

ics computations onto them. These GPGPU (general-purpose

computation on GPUs) systems have produced some impres-

sive results, but the limitations and difﬁculties of doing this

via graphics APIs are legend. This desire to use the GPU as a

more general parallel computing device motivated NVIDIA

to develop a new uniﬁed graphics and computing GPU

architecture and the CUDA programming model.

GPU COMPUTING ARCHITECTURE

Introduced by NVIDIA in November 2006, the Tesla uniﬁed

graphics and computing architecture

1,2

signiﬁcantly extends

the GPU beyond graphics—its massively multithreaded

processor array becomes a highly efﬁcient uniﬁed platform

for both graphics and general-purpose parallel computing

applications. By scaling the number of processors and mem-

ory partitions, the Tesla architecture spans a wide market

range—from the high-performance enthusiast GeForce 8800

GPU and professional Quadro and Tesla computing products

to a variety of inexpensive, mainstream GeForce GPUs. Its

computing features enable straightforward programming of

the GPU cores in C with CUDA. Wide availability in laptops,

desktops, workstations, and servers, coupled with C pro-

grammability and CUDA software, make the Tesla architec-

ture the ﬁrst ubiquitous supercomputing platform.

The Tesla architecture is built around a scalable array of

multithreaded SMs (streaming multiprocessors). Current

GPU implementations range from 768 to 12,288 concur-

rently executing threads. Transparent scaling across this wide

range of available parallelism is a key design goal of both the

GPU architecture and the CUDA programming model. Figure

A shows a GPU with 14 SMs—a total of 112 SP (streaming

processor) cores—interconnected with four external DRAM

partitions. When a CUDA program on the host CPU invokes

a kernel grid, the CWD (compute work distribution) unit

enumerates the blocks of the grid and begins distributing

them to SMs with available execution capacity. The threads

of a thread block execute concurrently on one SM. As thread

blocks terminate, the CWD unit launches new blocks on the

vacated multiprocessors.

An SM consists of eight scalar SP cores, two SFUs (special

function units) for transcendentals, an MT IU (multithreaded

instruction unit), and on-chip shared memory. The SM cre-

ates, manages, and executes up to 768 concurrent threads

in hardware with zero scheduling overhead. It can execute

as many as eight CUDA thread blocks concurrently, limited

by thread and memory resources. The SM implements the

CUDA __syncthreads() barrier synchronization intrinsic with

a single instruction. Fast barrier synchronization together

with lightweight thread creation and zero-overhead thread

scheduling efﬁciently support very ﬁne-grained parallelism,

allowing a new thread to be created to compute each vertex,

pixel, and data point.

To manage hundreds of threads running several different

programs, the Tesla SM employs a new architecture we call

UNIFIED GRAPHICS AND COMPUTING GPUS

GPUs

FOCUS

剩余13页未读，继续阅读

评论收藏

内容反馈

_小马奔腾

粉丝: 142
资源: 1

并行编程cuda

CUDA并行程序设计 GPU编程指南,cuda并行程序设计gpu编程指南pdf,C,C++

CUDA编程 并行编程

CUDA高性能计算并行编程

CUDA并行程序设计 GPU编程指南

CUDA并行程序设计 GPU编程指南-中文英文高清完整版（各500+页）

哈弗曼编码 cuda并行编程

并行编程

cuda编程学习

cuda并行求和代码

CUDA高性能计算并行编程[整理].pdf

CUDA by example （中文：GPU高性能编程CUDA实战）代码实例

GPU高性能编程CUDA实战.pdf.zip

GPU高性能编程CUDA实战中文

rechuandao.rar_cuda并行

cuda编程指南

cuda编程要点

cuda编程手册

cuda编程讲义

perfbook.2019.12.22a.pdf_深入理解并行编程_prefbook_perfbook_

GPU高性能编程CUDA实战-代码

CUDA平台下多核GPU高性能并行编程研究.pdf

OpenCL异构并行编程实战-src,opencl异构并行编程实战 pdf,C,C++

并行运算编程

cuda多文章并行分词器

cuda的c编程手册

并行编程基础

cuda.rar_cuda_cuda学习_cuda编程

最新资源

CUDA编程并行编程