42 March/April 2008 ACM QUEUE rants: feedback@acmqueue.com
model is also applicable to other shared-memory parallel
processing architectures, including multicore CPUs.
3
CUDA provides three key abstractions—a hierarchy
of thread groups, shared memories, and barrier syn-
chronization—that provide a clear parallel structure to
conventional C code for one thread of the hierarchy.
Multiple levels of threads, memory, and synchronization
provide fine-grained data parallelism and thread paral-
lelism, nested within coarse-grained data parallelism and
task parallelism. The abstractions guide the programmer
to partition the problem into coarse sub-problems that
can be solved independently in parallel, and then into
Scalable Parallel PROGRAMMING with
CUDA
D
riven by the insatiable market demand for realtime,
high-definition 3D graphics, the programmable GPU
(graphics processing unit) has evolved into a highly
parallel, multithreaded, manycore processor. It is designed to
efficiently support the graphics shader programming model,
in which a program for one thread draws one vertex or
shades one pixel fragment. The GPU excels at fine-grained,
data-parallel workloads consisting of thousands of indepen-
dent threads executing vertex, geometry, and pixel-shader
program threads concurrently.
The tremendous raw performance of modern GPUs has
led researchers to explore mapping more general non-graph-
ics computations onto them. These GPGPU (general-purpose
computation on GPUs) systems have produced some impres-
sive results, but the limitations and difficulties of doing this
via graphics APIs are legend. This desire to use the GPU as a
more general parallel computing device motivated NVIDIA
to develop a new unified graphics and computing GPU
architecture and the CUDA programming model.
GPU COMPUTING ARCHITECTURE
Introduced by NVIDIA in November 2006, the Tesla unified
graphics and computing architecture
1,2
significantly extends
the GPU beyond graphics—its massively multithreaded
processor array becomes a highly efficient unified platform
for both graphics and general-purpose parallel computing
applications. By scaling the number of processors and mem-
ory partitions, the Tesla architecture spans a wide market
range—from the high-performance enthusiast GeForce 8800
GPU and professional Quadro and Tesla computing products
to a variety of inexpensive, mainstream GeForce GPUs. Its
computing features enable straightforward programming of
the GPU cores in C with CUDA. Wide availability in laptops,
desktops, workstations, and servers, coupled with C pro-
grammability and CUDA software, make the Tesla architec-
ture the first ubiquitous supercomputing platform.
The Tesla architecture is built around a scalable array of
multithreaded SMs (streaming multiprocessors). Current
GPU implementations range from 768 to 12,288 concur-
rently executing threads. Transparent scaling across this wide
range of available parallelism is a key design goal of both the
GPU architecture and the CUDA programming model. Figure
A shows a GPU with 14 SMs—a total of 112 SP (streaming
processor) cores—interconnected with four external DRAM
partitions. When a CUDA program on the host CPU invokes
a kernel grid, the CWD (compute work distribution) unit
enumerates the blocks of the grid and begins distributing
them to SMs with available execution capacity. The threads
of a thread block execute concurrently on one SM. As thread
blocks terminate, the CWD unit launches new blocks on the
vacated multiprocessors.
An SM consists of eight scalar SP cores, two SFUs (special
function units) for transcendentals, an MT IU (multithreaded
instruction unit), and on-chip shared memory. The SM cre-
ates, manages, and executes up to 768 concurrent threads
in hardware with zero scheduling overhead. It can execute
as many as eight CUDA thread blocks concurrently, limited
by thread and memory resources. The SM implements the
CUDA __syncthreads() barrier synchronization intrinsic with
a single instruction. Fast barrier synchronization together
with lightweight thread creation and zero-overhead thread
scheduling efficiently support very fine-grained parallelism,
allowing a new thread to be created to compute each vertex,
pixel, and data point.
To manage hundreds of threads running several different
programs, the Tesla SM employs a new architecture we call
UNIFIED GRAPHICS AND COMPUTING GPUS
GPUs
FOCUS