通过顶点分类在基于BSP的图形处理中实现零通信延迟资源-CSDN文库

研究论文

125 浏览量 2021-03-12 04:46:44 上传评论收藏 811KB PDF 举报

资源推荐

资源详情

资源评论

Achieving up to Zero Communication Delay in

BSP-based Graph Processing via Vertex

Categorization

Xuhong Zhang, Ruijun Wang, Xunchao Chen, Jun Wang, Tyler Lukasiewicz

Department of Electrical Engineering & Computer Science

University of Central Florida

Orlando, Florida 32826

{xzhang, ruijun, xchen, jwang}@eecs.ucf.edu

Dezhi Han

College of Information Engineering

Shanghai Maritime University

Shanghai, 201306, China

dezhihan88@sina.com.cn

Abstract—The Bulk Synchronous Parallel (BSP) model, which

divides a graphing algorithm into multiple supersteps, has become

extremely popular in distributed graph processing systems. How-

ever, the high number of network messages exchanged in each

superstep of the graph algorithm will create a long period of

time. We refer to this as a communication delay. Furthermore, the

BSP’s global synchronization barrier does not allow computation

in the next superstrep to be scheduled during this communication

delay. This communication delay makes up a large percentage

of the overall processing time of a superstep. While most recent

research has focused on reducing number of network messages,

but communication delay is still a deterministic factor for overall

performance. In this paper, we add a runtime communication and

computation scheduler into current graph BSP implementations.

This scheduler will move some computation from the next

superstep to the communication phase in the current superstep

to mitigate the communication delay. Finally, we prototyped our

system, Zebra, on Apache Hama, which is an open source clone of

the classic Google Pregel. By running a set of graph algorithms

on an in-house cluster, our evaluation shows that our system

could completely eliminate the communication delay in the best

case and can achieve average 2X speedup over Hama.

I. INTRODUCTION

Graph data structures are widely used to model structural

relationships among objects. For example, Web graphs, social

networks, knowledge bases and protein interactions are all

modeled with graphs. These graphs are growing at an aston-

ishing rate. For example, Facebook’s social graph has scaled to

trillions of edges [4]. Performing analytics on these enormous

graphs is becoming more challenging as they continue to

grow. The traditional MapReduce framework is not efﬁcient at

graph processing due to the special features of graph structures

and algorithms [18], [19]. To better utilize these features,

Google has proposed Pregel [12]. Pregel’s model is very

popular and has lead to the emergence of many current widely

used distributed graph processing frameworks such as Apache

Hama [2], Apache Giraph [1], GPS [19], GraphLab [11], and

Mizan [8].

All of these Pregel-like systems are implemented based

on the Bulk Synchronous Parallel (BSP) model [22], which

divides a graphing algorithm into multiple supersteps. Within

each superstep, each vertex executes the same vertex program:

combine messages from neighboring vertices, apply the com-

bined messages to update the vertex value, and send new

message to neighboring vertices. All messages are transmitted

along edges. From a high level perspective, the execution

ﬂow of all vertices in a superstep can be viewed as a

three phase process: computation, communication, and barrier

synchronization. The barrier between two supersteps is used

to coordinate the parallel execution of every vertex program

across a cluster of compute nodes, where each node holds

a portion of the whole graph. Each node will be completely

dedicated to the transmission of messages during the commu-

nication and synchronization phases. Since some edges in the

graph will be cut when the graph is partitioned and messages

transmitted along cut edges will be network messages. For

large graphs, millions and even billions of network messages

can be passed during each superstep. In addition, the barrier

between supersteps is so strict that if even one message on

some node is not sent during the communication phase, all

other nodes must remain idle and the next superstep cannot be

initiated. We refer to this period when all nodes are occupied

with the transmission of messages as a communication delay.

This communication delay dominates a superstep [3] and

results in server CPU resource underutilization.

Current solutions attempt to reduce the number of network

messages, but there still exists a non-negligible communication

delay [7], [6], [12]. Therefore, we investigate this issue from a

new angle, scheduling computation during the communication

delay with a new reﬁned BSP sync barrier. The barrier will

take advantage of the special features of graph structures

and algorithms while maintaining the two most important

synchronization properties in the BSP graph processing [23].

• Consistency: At the beginning of each superstep, a

vertex’s computing function can be triggered if and only

if all its incoming messages from neighbors have been

received.

• Isolation: Within the same superstep, newly generated

messages from any vertex will not be seen by any other

vertex.

We have discovered two underutilized localities provided

by the graph structure: vertex locality and edge locality,

that can help build a new reﬁned BSP barrier. We say that

a vertex has the property of vertex locality if all of its

incoming neighbor vertices are located on the same node.

In this paper, we categorize this kind of vertex as local

vertex and others as remote vertex. Edge locality refers to

the percentage of non-cut incoming edges of remote vertex.

In this paper, we regard messages received through non-cut

edges as local messages and messages through cut edges as

remote messages. The Vertex program essentially consists

of two loosely coupled operations: message consuming and

message producing. Vertex locality ensures that the consis-

tency property is maintained without synchronizing at the

barrier, since all incoming messages for the next superstep

are local messages and are instantly available in memory

after the local machine’s computation phase in the current

superstep ﬁnishes. By maintaining the isolation property, both

the message consuming and message producing operations

in the next superstep on these local vertices can be directly

initiated before the cost barrier. Now the barrier will only

synchronize remote vertices. However, edge locality could still

allow some message consuming operations on remote verticies

in the next superstep to be scheduled before the barrier. The

degrees of these two localities are very high in real world

graph data. Detailed examination is in Section IV-B.

In this paper, a run-time computation and communication

scheduler is proposed so that some computation in the next

superstep can be scheduled to be executed during the commu-

nication delay phase in the current superstep. We ﬁrst develop

a runtime vertex categorization scheme with no preprocessing

overhead to utilize vertex locality. With this categorization, our

scheduler can schedule all of the computation on local vertices

in the next superstep to the communication delay phase in

the current superstep. To further utilize edge locality, we

decouple the vertex computation into message consuming and

message producing operations so that our scheduler can move

all consuming operation on local messages of remote vertices

in the next superstep to the communication delay phase in the

current superstep. Through this overlapping of computation

and communication, our solution can dramatically mitigate

the communications delay. Our proposed solution could totally

eliminates the communication delay in the best case and can

achieve average 2X speedup over Hama.

In summary, this paper makes the following contributions:

• Extensive examination of new graph locality in dis-

tributed graph processing.

• A run-time vertex categorization with no preprocessing

overhead.

• Decouple the vertex program into message consuming

and message producing operations.

• A run-time computation and communication scheduler.

• A prototype based on Apache Hama and a thorough

evaluation on our proposed system.

The rest of the paper is organized as follows. Section II

presents the background and motivation of our paper. Sec-

tion III describes Zebra’s system design and execution ﬂow.

Extensive graph locality examination in done Section IV-B.

Section IV-D evaluates our system’s performance against

Hama. We then describe the related work in Section V and

ﬁnally conclude our paper in Section VI.

II. B

ACKGROUND AND MOTIVATION

^ǇŶĐŚƌŽŶŝǌĂƚŝŽŶĂƌƌŝĞƌ

ŽŵƉƵƚĂƚŝŽŶ

ŽŵŵƵŶŝĐĂƚŝŽŶ

ϭ Ϯ ϯ ϰ ϱ

Fig. 1. BSP Model

A. BSP Model

BSP is a parallel programming model consists of a set

of prossesor-memory pairs, a communications network and

a mechanism for efﬁcient barrier synchronization. Figure 1

illustrates an outline of the BSP model, which expresses

algorithm as a sequence of supersteps. In distributed BSP

based graph processing systems, vertices are partitioned across

compute nodes. These vertices send messages along edges to

perform computation. In this model, all vertices are assigned

an active status at the beginning. Each active vertex can switch

itself into inactive status in each superstep. An inactive vertex

can also be switched to an active status if it receives a message

during the execution of any subsequent supersteps. Graph

algorithm will terminate when there are no active vertices or

when a user deﬁned maximum superstep is reached. Figure 2

gives an example implementation of PageRank in BSP model.

B. Issues with Current BSP Graph Processing

Though the BSP model has been shown by many recent

graph processing systems to be a simple yet effective approach

to handling large scale graph applications, current implemen-

tations overlook some special features of graph structure and

algorithm. In this paper, we focus the following two issues that

arise from the current BSP implementation in graph processing

systems

1) Communication Delay: Recent graph system pa-

pers [14], [3], [19] report that the communication delay occu-

pies more than half of the overall processing time. We also run

PageRank on Twitter graph (41.7 million vertices, 1.47 billion

edges) to verify this. Figure 3 shows the time break down

of multiple supersteps. As observed, communication delay

113

剩余9页未读，继续阅读

评论收藏

内容反馈

weixin_38740596

粉丝: 3
资源: 986

通过顶点分类在基于BSP的图形处理中实现零通信延迟

GPU上的图形处理：调查

3D游戏编程大师技巧源代码

3D程序员面试题

基于CORDIC的反正弦和反余弦计算的FPGA实现

BA无标度网络中的SIR模型

使用3DCNN和卷积LSTM进行手势识别学习时空特征

基于三次贝塞尔曲线的类汽车曲率连续路径平滑

基于机器学习的设备剩余寿命预测方法综述

基于维纳过程的退化模型，具有递归过滤算法，可用于估计剩余使用寿命

基于FPGA的奇异值和特征值分解的快速实现。

磁悬浮系统自适应模糊PID控制器的设计

基于BP神经网络的人口预测

无人机协同目标的多无人机协同搜索方法

两轮平衡车的建模与控制研究

基于改进遗传算法的六自由度机器人时间最优轨迹规划

一种基于深度学习的机械臂抓取方法

基于深度神经网络的交通流量预测

一种去除ECG中基线漂移和工频干扰的高效滤波方法

基于稀疏贝叶斯学习的高效DOA估计方法

适用于1-8GHz宽带应用的原始Vivaldi天线

亮度保持和细节增强的红外图像增强方法

近场中的磁偶极子模型

一种基于LMS算法的流水线ADC数字校准算法

一种鲁棒的三维点云骨架提取方法

弗兰克（Frank）编码的LFM波形及其在MIMO雷达中的应用

最新资源