没有合适的资源?快使用搜索试试~ 我知道了~
批量同步并行(BSP)模型将绘图算法分为多个超步,在分布式图形处理系统中已变得非常流行。 但是,在图算法的每个超级步骤中交换的大量网络消息将创建很长的时间。 我们将此称为通信延迟。 此外,BSP的全局同步屏障不允许在此通信延迟期间调度下一个超级strep中的计算。 这种通信延迟在超步的总处理时间中占很大比例。 尽管最近的研究集中在减少网络消息的数量上,但是通信延迟仍然是整体性能的决定性因素。 在本文中,我们将运行时通信和计算调度程序添加到当前的图BSP实现中。 该调度程序会将一些计算从下一个超级步骤移至当前超级步骤中的通信阶段,以减轻通信延迟。 最后,我们在Apache Hama上对我们的系统ebra进行了原型设计,Apache Hama是经典Google Pregel的开源克隆。 通过在内部群集上运行一组图算法,我们的评估表明,我们的系统可以最大程度地消除通信延迟的情况下,可以达到Hama的平均2倍加速。
资源推荐
资源详情
资源评论
![pdf](https://img-home.csdnimg.cn/images/20241231044930.png)
![7z](https://img-home.csdnimg.cn/images/20241231044736.png)
![doc](https://img-home.csdnimg.cn/images/20241231044833.png)
![pdf](https://img-home.csdnimg.cn/images/20241231044930.png)
![pdf](https://img-home.csdnimg.cn/images/20241231044930.png)
![pdf](https://img-home.csdnimg.cn/images/20241231044930.png)
![pdf](https://img-home.csdnimg.cn/images/20241231044930.png)
![](https://csdnimg.cn/release/download_crawler_static/15737187/bg1.jpg)
Achieving up to Zero Communication Delay in
BSP-based Graph Processing via Vertex
Categorization
Xuhong Zhang, Ruijun Wang, Xunchao Chen, Jun Wang, Tyler Lukasiewicz
Department of Electrical Engineering & Computer Science
University of Central Florida
Orlando, Florida 32826
{xzhang, ruijun, xchen, jwang}@eecs.ucf.edu
Dezhi Han
College of Information Engineering
Shanghai Maritime University
Shanghai, 201306, China
dezhihan88@sina.com.cn
Abstract—The Bulk Synchronous Parallel (BSP) model, which
divides a graphing algorithm into multiple supersteps, has become
extremely popular in distributed graph processing systems. How-
ever, the high number of network messages exchanged in each
superstep of the graph algorithm will create a long period of
time. We refer to this as a communication delay. Furthermore, the
BSP’s global synchronization barrier does not allow computation
in the next superstrep to be scheduled during this communication
delay. This communication delay makes up a large percentage
of the overall processing time of a superstep. While most recent
research has focused on reducing number of network messages,
but communication delay is still a deterministic factor for overall
performance. In this paper, we add a runtime communication and
computation scheduler into current graph BSP implementations.
This scheduler will move some computation from the next
superstep to the communication phase in the current superstep
to mitigate the communication delay. Finally, we prototyped our
system, Zebra, on Apache Hama, which is an open source clone of
the classic Google Pregel. By running a set of graph algorithms
on an in-house cluster, our evaluation shows that our system
could completely eliminate the communication delay in the best
case and can achieve average 2X speedup over Hama.
I. INTRODUCTION
Graph data structures are widely used to model structural
relationships among objects. For example, Web graphs, social
networks, knowledge bases and protein interactions are all
modeled with graphs. These graphs are growing at an aston-
ishing rate. For example, Facebook’s social graph has scaled to
trillions of edges [4]. Performing analytics on these enormous
graphs is becoming more challenging as they continue to
grow. The traditional MapReduce framework is not efficient at
graph processing due to the special features of graph structures
and algorithms [18], [19]. To better utilize these features,
Google has proposed Pregel [12]. Pregel’s model is very
popular and has lead to the emergence of many current widely
used distributed graph processing frameworks such as Apache
Hama [2], Apache Giraph [1], GPS [19], GraphLab [11], and
Mizan [8].
All of these Pregel-like systems are implemented based
on the Bulk Synchronous Parallel (BSP) model [22], which
divides a graphing algorithm into multiple supersteps. Within
each superstep, each vertex executes the same vertex program:
combine messages from neighboring vertices, apply the com-
bined messages to update the vertex value, and send new
message to neighboring vertices. All messages are transmitted
along edges. From a high level perspective, the execution
flow of all vertices in a superstep can be viewed as a
three phase process: computation, communication, and barrier
synchronization. The barrier between two supersteps is used
to coordinate the parallel execution of every vertex program
across a cluster of compute nodes, where each node holds
a portion of the whole graph. Each node will be completely
dedicated to the transmission of messages during the commu-
nication and synchronization phases. Since some edges in the
graph will be cut when the graph is partitioned and messages
transmitted along cut edges will be network messages. For
large graphs, millions and even billions of network messages
can be passed during each superstep. In addition, the barrier
between supersteps is so strict that if even one message on
some node is not sent during the communication phase, all
other nodes must remain idle and the next superstep cannot be
initiated. We refer to this period when all nodes are occupied
with the transmission of messages as a communication delay.
This communication delay dominates a superstep [3] and
results in server CPU resource underutilization.
Current solutions attempt to reduce the number of network
messages, but there still exists a non-negligible communication
delay [7], [6], [12]. Therefore, we investigate this issue from a
new angle, scheduling computation during the communication
delay with a new refined BSP sync barrier. The barrier will
take advantage of the special features of graph structures
and algorithms while maintaining the two most important
synchronization properties in the BSP graph processing [23].
• Consistency: At the beginning of each superstep, a
vertex’s computing function can be triggered if and only
if all its incoming messages from neighbors have been
received.
• Isolation: Within the same superstep, newly generated
messages from any vertex will not be seen by any other
112978-1-4673-7891-8/15/$31.00 ©2015 IEEE
![](https://csdnimg.cn/release/download_crawler_static/15737187/bg2.jpg)
vertex.
We have discovered two underutilized localities provided
by the graph structure: vertex locality and edge locality,
that can help build a new refined BSP barrier. We say that
a vertex has the property of vertex locality if all of its
incoming neighbor vertices are located on the same node.
In this paper, we categorize this kind of vertex as local
vertex and others as remote vertex. Edge locality refers to
the percentage of non-cut incoming edges of remote vertex.
In this paper, we regard messages received through non-cut
edges as local messages and messages through cut edges as
remote messages. The Vertex program essentially consists
of two loosely coupled operations: message consuming and
message producing. Vertex locality ensures that the consis-
tency property is maintained without synchronizing at the
barrier, since all incoming messages for the next superstep
are local messages and are instantly available in memory
after the local machine’s computation phase in the current
superstep finishes. By maintaining the isolation property, both
the message consuming and message producing operations
in the next superstep on these local vertices can be directly
initiated before the cost barrier. Now the barrier will only
synchronize remote vertices. However, edge locality could still
allow some message consuming operations on remote verticies
in the next superstep to be scheduled before the barrier. The
degrees of these two localities are very high in real world
graph data. Detailed examination is in Section IV-B.
In this paper, a run-time computation and communication
scheduler is proposed so that some computation in the next
superstep can be scheduled to be executed during the commu-
nication delay phase in the current superstep. We first develop
a runtime vertex categorization scheme with no preprocessing
overhead to utilize vertex locality. With this categorization, our
scheduler can schedule all of the computation on local vertices
in the next superstep to the communication delay phase in
the current superstep. To further utilize edge locality, we
decouple the vertex computation into message consuming and
message producing operations so that our scheduler can move
all consuming operation on local messages of remote vertices
in the next superstep to the communication delay phase in the
current superstep. Through this overlapping of computation
and communication, our solution can dramatically mitigate
the communications delay. Our proposed solution could totally
eliminates the communication delay in the best case and can
achieve average 2X speedup over Hama.
In summary, this paper makes the following contributions:
• Extensive examination of new graph locality in dis-
tributed graph processing.
• A run-time vertex categorization with no preprocessing
overhead.
• Decouple the vertex program into message consuming
and message producing operations.
• A run-time computation and communication scheduler.
• A prototype based on Apache Hama and a thorough
evaluation on our proposed system.
The rest of the paper is organized as follows. Section II
presents the background and motivation of our paper. Sec-
tion III describes Zebra’s system design and execution flow.
Extensive graph locality examination in done Section IV-B.
Section IV-D evaluates our system’s performance against
Hama. We then describe the related work in Section V and
finally conclude our paper in Section VI.
II. B
ACKGROUND AND MOTIVATION
^LJŶĐŚƌŽŶŝnjĂƚŝŽŶĂƌƌŝĞƌ
ŽŵƉƵƚĂƚŝŽŶ
ŽŵŵƵŶŝĐĂƚŝŽŶ
ϭ Ϯ ϯ ϰ ϱ
Fig. 1. BSP Model
A. BSP Model
BSP is a parallel programming model consists of a set
of prossesor-memory pairs, a communications network and
a mechanism for efficient barrier synchronization. Figure 1
illustrates an outline of the BSP model, which expresses
algorithm as a sequence of supersteps. In distributed BSP
based graph processing systems, vertices are partitioned across
compute nodes. These vertices send messages along edges to
perform computation. In this model, all vertices are assigned
an active status at the beginning. Each active vertex can switch
itself into inactive status in each superstep. An inactive vertex
can also be switched to an active status if it receives a message
during the execution of any subsequent supersteps. Graph
algorithm will terminate when there are no active vertices or
when a user defined maximum superstep is reached. Figure 2
gives an example implementation of PageRank in BSP model.
B. Issues with Current BSP Graph Processing
Though the BSP model has been shown by many recent
graph processing systems to be a simple yet effective approach
to handling large scale graph applications, current implemen-
tations overlook some special features of graph structure and
algorithm. In this paper, we focus the following two issues that
arise from the current BSP implementation in graph processing
systems
1) Communication Delay: Recent graph system pa-
pers [14], [3], [19] report that the communication delay occu-
pies more than half of the overall processing time. We also run
PageRank on Twitter graph (41.7 million vertices, 1.47 billion
edges) to verify this. Figure 3 shows the time break down
of multiple supersteps. As observed, communication delay
113
剩余9页未读,继续阅读
资源评论
![avatar-default](https://csdnimg.cn/release/downloadcmsfe/public/img/lazyLogo2.1882d7f4.png)
![avatar](https://profile-avatar.csdnimg.cn/default.jpg!1)
weixin_38740596
- 粉丝: 3
- 资源: 986
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助
![voice](https://csdnimg.cn/release/downloadcmsfe/public/img/voice.245cc511.png)
![center-task](https://csdnimg.cn/release/downloadcmsfe/public/img/center-task.c2eda91a.png)
最新资源
- 跨平台开发指南-YOLOv11在Android、iOS端实时检测落地实践.pdf
- 跨平台开发指南-YOLOv11模型转ONNX及移动端部署最佳实践.pdf
- MATLAB实现IBES-ELM基于改进的秃鹰搜索优化算法优化极限学习机的数据回归预测 (含模型描述及示例代码)
- 跨域迁移学习-YOLOv11在极地科考中的冰雪目标快速适配方案.pdf
- 零基础入门YOLOv11-从PyTorch训练到ONNX跨平台部署全流程.pdf
- 跨行业应用-YOLOv11在野生动物追踪与生态监测中的创新实践.pdf
- 零售场景深度应用-YOLOv11实现货架商品识别与库存动态管理.pdf
- 零售盗窃预防-YOLOv11实时异常行为检测与报警联动方案.pdf
- 零售场景落地-YOLOv11多目标顾客行为分析与货架陈列优化系统(新零售).pdf
- 零售货架管理-YOLOv11缺货检测与SKU匹配自动化系统设计.pdf
- 零售货架管理-YOLOv11商品缺货预警与陈列合规性检测模型部署.pdf
- 零售货架管理-YOLOv11商品缺货检测与陈列合规性自动审核.pdf
- 零售结算革命-YOLOv11多商品并行识别与自动计价技术实现.pdf
- 零售货架智能管理-YOLOv11商品缺货检测与补货提醒.pdf
- 零售结算革命-YOLOv11+RFID融合的无人便利店商品识别方案.pdf
- 零售收银升级-YOLOv11商品自动识别与价格结算系统开发.pdf
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
![feedback](https://img-home.csdnimg.cn/images/20220527035711.png)
![feedback](https://img-home.csdnimg.cn/images/20220527035711.png)
![feedback-tip](https://img-home.csdnimg.cn/images/20220527035111.png)
安全验证
文档复制为VIP权益,开通VIP直接复制
![dialog-icon](https://csdnimg.cn/release/downloadcmsfe/public/img/green-success.6a4acb44.png)