High-PerformanceandScalableGPUGraphTraversal资源-CSDN文库

gpu

需积分: 9 15 浏览量 2018-07-07 10:30:07 上传评论收藏 3.72MB PDF 举报

资源推荐

资源详情

资源评论

High-Performance and Scalable GPU Graph Traversal

DUANE MERRILL and MICHAEL GARLAND, NVIDIA Corporation

and ANDREW GRIMSHAW, University of Virginia

Breadth-First Search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph

analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and

work distribution are both irregular and data dependent. Recent work has demonstrated the plausibility of

GPU sparse graph traversal, but has tended to focus on asymptotically inefﬁcient algorithms that perform

poorly on graphs with nontrivial diameter.

We present a BFS parallelization focused on ﬁne-grained task management constructed from efﬁcient

preﬁx sum computations that achieves an asymptotically optimal O(|V|+|E|) gd work complexity. Our

implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of

3.3 billion and 8.3 billion traversed edges per second using single- and quad-GPU conﬁgurations, respec-

tively. This level of performance is several times faster than state-of-the-art implementations on both CPU

and GPU platforms.

Categories and Subject Descriptors: G.2.2 [Discrete Mathematics]: Graph Theory—Graph Algorithms;

D.1.3 [Programming Techniques]: Concurrent programming; F.2.2 [Analysis of Algorithms and

Problem Complexity]: Nonnumerical Algorithms and Problems—Computations on discrete structures,

Geometrical problems and computations

General Terms: Design, Algorithms, Performance

Additional Key Words and Phrases: Breadth-ﬁrst search, GPU, graph algorithms, parallel algorithms, preﬁx

sum, graph traversal, sparse graphs

ACM Reference Format:

Merrill, D., Garland, M., and Grimshaw, A. 2015. High-performance and scalable GPU graph traversal. ACM

Trans. Parallel Comput. 1, 2, Article 14 (January 2015), 30 pages.

DOI:http://dx.doi.org/10.1145/2717511

1. INTRODUCTION

Algorithms for analyzing sparse relationships represented as graphs provide crucial

tools in many computational ﬁelds, ranging from genomics to electronic design au-

tomation to social network analysis. In this article, we explore the parallelization of

one fundamental graph algorithm on GPUs: breadth-ﬁrst search (BFS). BFS is a com-

mon building block for more sophisticated graph algorithms, yet simple enough so

that we can analyze its behavior in depth. It is also used as a core computational ker-

nel in a number of benchmark suites, including Parboil [Stratton et al. 2012], Rodinia

[Che et al. 2009], and the emerging Graph500 supercomputer benchmark [Graph500

List 2011].

Contemporary processor architecture provides increasing parallelism in order to

deliver higher throughput while maintaining energy efﬁciency. Modern GPUs are at

the leading edge of this trend, provisioning tens of thousands of data-parallel threads.

D. Merrill was supported in part by a NVIDIA Graduate Fellowship.

Authors’ addresses: D. Merrill (corresponding author), M. Garland, NVIDIA Corporation; email:

duane.merrill@gmail.com; A. Grimshaw, University of Virginia, Charlottesville, VA.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted

without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that

copies bear this notice and the full citation on the ﬁrst page. Copyrights for third-party components of this

work must be honored. For all other uses, contact the Owner/Author.

2015 Copyright is held by the author/owner(s). 2329-4949/2015/01-ART14

DOI:http://dx.doi.org/10.1145/2717511

ACM Transactions on Parallel Computing, Vol. 1, No. 2, Article 14, Publication date: January 2015.

14:2 D. Merrill et al.

Despite their high computational throughput, GPUs might appear poorly suited for

sparse graph computation. In particular, BFS is representative of a class of algorithms

for which it is hard to obtain signiﬁcantly better performance from parallelization.

Optimizing memory usage is nontrivial because memory access patterns are deter-

mined by the structure of the input graph, and parallelization f urther introduces

concerns of contention, load imbalance, and underutilization on multithreaded archi-

tectures [Agarwal et al. 2010; Leiserson and Schardl 2010; Xia and Prasanna 2009].

The wide data parallelism of GPUs can be particularly sensitive to these performance

issues.

Prior work on parallel graph algorithms has relied on two key architectural fea-

tures for performance: the ﬁrst is multithreading and overlapped computation for

hiding memory latency, and the second is ﬁne-grained synchronization, speciﬁcally

atomic read-modify-write operations. Atomic mechanisms are convenient for coor-

dinating the dynamic placement of data into shared data structures and for arbi-

trating contended status updates [Agarwal et al. 2010; Bader and Madduri 2006a;

Bader et al. 2005].

Modern GPU architectures provide both of these features, however, serializations

from atomic collisions are particularly expensive for GPUs in terms of efﬁciency and

performance. In general, mutual exclusion does not scale to thousands of parallel

processing elements. Furthermore, the cost of ﬁne-grained and dynamic serialization

between threads within the same GPU SIMD unit is greater than that of more tra-

ditional, overlapped SMT threads. For example, all SIMD lanes may be made to wait

while the collisions of only a few are serialized.

For machines with wide data parallelism, we argue that software-based preﬁx sum

computations [Blelloch 1990; Hillis and Steele 1986] are a more suitable approach for

cooperative data placement. Preﬁx sum is a bulk-synchronous algorithmic primitive

that can be used to compute scatter offsets for concurrent threads, given their dynamic

allocation requirements. Efﬁcient GPU preﬁx sums [Merrill and Grimshaw 2009] allow

to reorganize sparse and uneven workloads into dense and uniform ones in all phases

of graph traversal.

Our work as described in this article makes contributions in the following areas.

— Parallelization strategy. We present a GPU BFS parallelization that performs an

asymptotically optimal linear amount of work. It is the ﬁrst to incorporate ﬁne-

grained parallel adjacency list expansion. We also introduce local duplicate detection

techniques for avoiding race conditions that create redundant work. We demonstrate

that our approach delivers high performance on a broad spectrum of structurally

diverse graphs and, to our knowledge, we also describe the ﬁrst design for multi-

GPU graph traversal.

— Empirical performance characterization. We present detailed analyses that isolate

and analyze the expansion and contraction aspects of BFS throughout the traversal

process. We reveal that serial and warp-centric expansion techniques described by

prior work signiﬁcantly underutilize the GPU for important graph genres, and also

show that the fusion of neighbor expansion and inspection within the same kernel

often yields worse performance than performing them s eparately.

— High performance. We demonstrate that our methods deliver excellent performance

on a diverse body of real-world graphs. Our implementation achieves traversal rates

in excess of 3.3 billion and 8.3 billion traversed edges per second (TE/s) for single-

and quad-GPU conﬁgurations, respectively. In context, recent state-of-the-art paral-

lel implementations achieve 0.7 billion and 1.3 billion TE/s for similar datasets on

single- and quad-socket multicore processors [Agarwal et al. 2010]. Our implemen-

tations are publicly available via the B40C Project [Merrill 2011].

ACM Transactions on Parallel Computing, Vol. 1, No. 2, Article 14, Publication date: January 2015.

High-Performance and Scalable GPU Graph Traversal 14:3

2. BACKGROUND

Modern NVIDIA GPU processors consist of tens of multiprocessor cores, each manag-

ing on the order of a thousand hardware-scheduled threads. Each multiprocessor core

employs data-parallel SIMD (single instruction, multiple data) techniques in which a

single instruction stream is executed by a ﬁxed-size grouping of threads called a warp.

A cooperative thread array (or CTA) is a group of threads that will be co-located on

the same multiprocessor and share a local scratch memory. Parallel threads are used

to execute a single program, or kernel. A sequence of kernel invocations is bulk syn-

chronous: each kernel is initially presented with a consistent view of the results from

the previous.

The efﬁciency of GPU architecture stems from the bulk-synchronous and SIMD as-

pects of the machine model. They facilitate excellent processor utilization on uniform

workloads having regularly structured computation. When the computation becomes

dynamic and varied, mismatches with the underlying architecture can result in signif-

icant performance penalties. For example, performance can be degraded by irregular

memory access patterns that cannot be coalesced, or that result in arbitrary mem-

ory bank conﬂicts, control-ﬂow divergences between SIMD warp threads that result in

thread serialization, and load imbalances between barrier synchronization points that

result in resource underutilization [Owens et al. 2008]. I n this work, we make exten-

sive use of local preﬁx sum computation as a foundation for reorganizing sparse and

uneven workloads into dense and uniform ones.

2.1. Breadth-First Search

BFS is a graph traversal algorithm that systematically explores every reachable ver-

tex from a given source, where closer vertices are visited ﬁrst. Fundamental uses of

BFS include: identifying all of the connected components within a graph, ﬁnding the

diameter of a tree, and testing a graph for bipartiteness [Cormen et al. 2001]. More so-

phisticated problems incorporating BFS include: identifying the reachable set of heap

items during garbage collection [Cheney 1970], belief propagation in statistical in-

ference [Gonzalez et al. 2009], ﬁnding community structure in networks [Newman

and Girvan 2004], and computing the maximum ﬂow/minimum cut for a given graph

[Hussein et al. 2007].

We consider graphs of the form G = (V, E)withasetV of n vertices and a set E of

m directed edges. Given a source vertex v

, our goal is to traverse the vertices of G in

breadth-ﬁrst order starting at v

. Each newly discovered vertex v

will be labeled by:

(a) its distance d

from v

and/or (b) the predecessor vertex p

immediately preceding

it on the shortest path to v

. For simplicity, we identify the vertices v

.. v

n−1

using

integer indices. The pair (v

, v

) indicates a directed edge in the graph from v

→ v

and the adjacency list A



|(v

, v

) 2 E



is the set of neighboring vertices incident

on vertex v

. We treat undirected graphs as symmetric directed graphs containing both

, v

) and (v

, v

) for each undirected edge. In this article, all graph sizes and traversal

rates are measured in terms of directed edge counts.

We represent the graph using an adjacency matrix A whose rows are the adjacency

lists A

. The number of edges within sparse graphs is typically only a constant factor

larger than n. We use the well-known compressed sparse row (CSR) format to store

the graph in memory consisting of two arrays.

Figure 1 provides a simple example, where the column-indices array C is formed

from the set of the adjacency lists concatenated into a single array of m integers. The

row-offsets R array contains n + 1 integers, and entry R[ i] is the index in C of the

adjacency list A

ACM Transactions on Parallel Computing, Vol. 1, No. 2, Article 14, Publication date: January 2015.

14:4 D. Merrill et al.

Fig. 1. Example CSR representation: column-indices array C and row-offsets array R comprise the

adjacency matrix A.

ALGORITHM 1 The simple sequential breadth-ﬁrst search algorithm for marking vertex distances from

the source s. Alternatively, a shortest-paths search tree can be constructed by marking i as j ’s predecessor

in line 11.

Input: Vertex set V, row-offsets array R, column-indices array C, source vertex s

Output: Array dist[0..n-1] with dist[ v] holding the distance from s to v

Functions: Enqueue(val) inserts val at the end of the queue instance. Dequeue() returns the front

element of the queue instance.

1 Q := {}

2 for i in V:

3 dist[i] := 

4 dist[s] := 0

5 Q.Enqueue(s)

6 while (Q != {}) :

7 i = Q.Dequeue()

8 for offset in R[i] .. R[i+1]-1 :

9 j := C[offset]

10 if (dist[j] == )

11 dist[j] := dist[i] + 1;

12 Q.Enqueue(j)

We store graph edges in the order they are deﬁned. We do not perform any ofﬂine pre-

processing in order to improve locality of reference, improve load balance, or eliminate

sparse memory references. Such strategies might include sorting neighbors within

their adjacency lists, sorting vertices into a space-ﬁlling curve and remapping their

corresponding vertex identiﬁers, splitting up vertices having large adjacency lists, en-

coding adjacency row-offset and length information into vertex identiﬁers, removing

duplicate edges, singleton vertices, and self-loops, etc.

Algorithm 1 presents the standard sequential BFS method. It operates by circulat-

ing the vertices of the graph through a FIFO queue that is initialized with v

[Cormen

et al. 2001]. As vertices are dequeued, their neighbors are examined, and unvisited

neighbors labeled with their distance and/or predecessor and enqueued for later pro-

cessing. This algorithm performs linear O(m + n) work since each vertex is labeled

exactly once and each edge is traversed exactly once.

2.2. Parallel Breadth-First Search

The FIFO ordering of the sequential algorithm forces it to label vertices in increasing

order of depth, where each depth level is fully explored before the next. Most parallel

BFS algorithms are level-synchronous, that is, each level may be processed in parallel,

so as long as the sequential ordering of levels is preserved. An implicit race condition

can exist where multiple tasks may concurrently discover a vertex v

. This is generally

considered benign since all such contending tasks would apply the same d

andgivea

valid value of p

Structurally different methods may be more suitable for graphs with very large di-

ameters, such as, algorithms based on the method of Ullman and Yannakakis [1990];

such alternatives are beyond the scope of this article.

As illustrated in Figure 2, each iteration of a level-synchronous method identiﬁes

both an edge and vertex frontier. The edge frontier is the set of all edges to be tra-

versed during this iteration or, equivalently, the set of all A

where v

was marked in

ACM Transactions on Parallel Computing, Vol. 1, No. 2, Article 14, Publication date: January 2015.

High-Performance and Scalable GPU Graph Traversal 14:5

Fig. 2. Example BFS frontier evolution from source vertex v

. For each vertex, the distance from the source

is the BFS iteration in which it appeared in the vertex frontier.

ALGORITHM 2 A simple quadratic work

vertex-oriented BFS parallelization.

Input:Vertex set V, row-offsets array R,

column-indices array C, source vertex s

Output:Array dist[0..n-1] with dist[ v]

holding the distance from s to v

1 parallel for (i in V):

2 dist[i] := 

3 dist[s] := 0

4 iteration := 0

5 do :

6 done := true

7 parallel for (i in V) :

8 if (dist[i] == iteration)

9 done := false

10 for (offset in R[i] .. R[i+1]-1) :

11 j := C[offset]

12 dist[j] = iteration + 1

13 iteration++

 while (!done)



ALGORITHM 3 A linear work BFS paralleliza-

tion constructed using a global vertex-frontier

queue.

Input: Vertex set V, row-offsets array R,

column-indices array C, source vertex s,

queues

Output: Array dist[0..n-1] with dist[ v]

holding the distance from s to v

Functions: LockedEnqueue(val) safely

inserts val at the end of the queue instance

1 parallel for (i in V) :

2 dist[i] := 

3 dist[s] := 0

4 iteration := 0

5 inQ := {}

6 inQ.LockedEnqueue(s)

7 while (inQ != {}) :

8 outQ := {}

9 parallel for (i in inQ) :

10 for (offset in R[i] .. R[i+1]-1) :

11 j := C[offset]

12 if (dist[j] == )

13 dist[j] = iteration + 1

14 outQ.LockedEnqueue(j)

15 iteration++

16 inQ := outQ

the previous iteration. The vertex frontier, by contrast is t he unique subset of such

neighbors that are unmarked and which will be labeled and expanded for the next

iteration. Each BFS iteration comprises two logical phases that realize these frontiers.

(1) Neighbor expansion. The vertices in the vertex frontier are expanded into an edge

frontier of neighboring vertex identiﬁers.

(2) Status lookup and ﬁltering. The neighbor identiﬁers in the edge frontier are con-

tracted into a vertex frontier of unvisited vertices.

Quadratic Work Parallelizations. The simplest parallel BFS algorithms inspect ev-

ery edge or, at a minimum, every vertex during every iteration. These methods per-

form a quadratic amount of work. A vertex v

is marked when a task discovers an edge

→ v

, where v

has been marked and v

has not. As Algorithm 2 illustrates, vertex-

oriented variants must subsequently expand and mark the neighbors of v

. Their work

complexity is O(n

+ m) as there may n BFS iterations in the worst case.

Quadratic parallelization strategies have been used by almost all prior GPU

implementations. The static assignment of tasks to vertices (or edges) trivially maps

to the data-parallel GPU machine model, and each thread’s computation is completely

ACM Transactions on Parallel Computing, Vol. 1, No. 2, Article 14, Publication date: January 2015.

剩余29页未读，继续阅读

评论收藏

内容反馈

qq_30051413

粉丝: 0
资源: 6

High-Performance and Scalable GPU Graph Traversal

GPU Clusters for High-Performance Computing.pdf

Gunrock - A High-Performance Graph Processing Library on the GPU - 2016-计算机科学

Building+Scalable+and+High-performance+Java+Web+Applications+Ssing+J2EE+Technology

high-performance-data-mining-scaling-algorithms-applications-and-systems

Gen-Z-Scalable-Connector-Specification-v1.1.pdf

Conv-DBN for Scalable Unsupervised Learning of Hierarchical Representations

TensorFlow-A-Framework-for-Scalable-Machine.zip

Building-a-scalable-architecture

High Performance in-memory computing with Apache Ignite.pdf

GIFT: A Real-Time and Scalable 3D Shape Search Engine

Building-Scalable-Apps-with-Redis-and-Node.js

21-KDD-Scalable Hierarchical Agglomerative Clustering

Julia High performance.pdf

ulib-2.1.0_src.tar.gz_The Cover_ulib

react-native-scalable-image：React Native Image组件，可自动缩放宽度或高度以保持原始宽高比

High Performance MySQL: Optimization, Backups, Replication, and More

MCS56744_56746-DS00-RDS.pdf

High Performance Cluster Computing

藏经阁-ADMM based Scalable Machine Learning on Apache Spark.pdf

Cloud Native Architectures: Design high-availability and cost-effective

intel-xeon-scalable-processor-throughput-latency.pdf

19-华勤-Sage Practical & Scalable ML-Driven Performance Debugging

山东大学-刘志成-Scalable Visualization Systems for Broad Audiences

Go Web Programming(Manning,2016)

LogFS - finally a scalable flash file system

藏经阁-Easy,scalable,Fault-tolerant S.pdf

Go Web Programming(《Go Web 编程》英文版 作者：Sau Sheong Chang)

最新资源

Go Web Programming(《Go Web 编程》英文版作者：Sau Sheong Chang)