没有合适的资源?快使用搜索试试~ 我知道了~
批量同步并行(BSP)模型将绘图算法分为多个超步,在分布式图形处理系统中已变得非常流行。 但是,在图算法的每个超级步骤中交换的大量网络消息将创建很长的时间。 我们将此称为通信延迟。 此外,BSP的全局同步屏障不允许在此通信延迟期间调度下一个超级strep中的计算。 这种通信延迟在超步的总处理时间中占很大比例。 尽管最近的研究集中在减少网络消息的数量上,但是通信延迟仍然是整体性能的决定性因素。 在本文中,我们将运行时通信和计算调度程序添加到当前的图BSP实现中。 该调度程序会将一些计算从下一个超级步骤移至当前超级步骤中的通信阶段,以减轻通信延迟。 最后,我们在Apache Hama上对我们的系统ebra进行了原型设计,Apache Hama是经典Google Pregel的开源克隆。 通过在内部群集上运行一组图算法,我们的评估表明,我们的系统可以最大程度地消除通信延迟的情况下,可以达到Hama的平均2倍加速。
资源推荐
资源详情
资源评论
Achieving up to Zero Communication Delay in
BSP-based Graph Processing via Vertex
Categorization
Xuhong Zhang, Ruijun Wang, Xunchao Chen, Jun Wang, Tyler Lukasiewicz
Department of Electrical Engineering & Computer Science
University of Central Florida
Orlando, Florida 32826
{xzhang, ruijun, xchen, jwang}@eecs.ucf.edu
Dezhi Han
College of Information Engineering
Shanghai Maritime University
Shanghai, 201306, China
dezhihan88@sina.com.cn
Abstract—The Bulk Synchronous Parallel (BSP) model, which
divides a graphing algorithm into multiple supersteps, has become
extremely popular in distributed graph processing systems. How-
ever, the high number of network messages exchanged in each
superstep of the graph algorithm will create a long period of
time. We refer to this as a communication delay. Furthermore, the
BSP’s global synchronization barrier does not allow computation
in the next superstrep to be scheduled during this communication
delay. This communication delay makes up a large percentage
of the overall processing time of a superstep. While most recent
research has focused on reducing number of network messages,
but communication delay is still a deterministic factor for overall
performance. In this paper, we add a runtime communication and
computation scheduler into current graph BSP implementations.
This scheduler will move some computation from the next
superstep to the communication phase in the current superstep
to mitigate the communication delay. Finally, we prototyped our
system, Zebra, on Apache Hama, which is an open source clone of
the classic Google Pregel. By running a set of graph algorithms
on an in-house cluster, our evaluation shows that our system
could completely eliminate the communication delay in the best
case and can achieve average 2X speedup over Hama.
I. INTRODUCTION
Graph data structures are widely used to model structural
relationships among objects. For example, Web graphs, social
networks, knowledge bases and protein interactions are all
modeled with graphs. These graphs are growing at an aston-
ishing rate. For example, Facebook’s social graph has scaled to
trillions of edges [4]. Performing analytics on these enormous
graphs is becoming more challenging as they continue to
grow. The traditional MapReduce framework is not efficient at
graph processing due to the special features of graph structures
and algorithms [18], [19]. To better utilize these features,
Google has proposed Pregel [12]. Pregel’s model is very
popular and has lead to the emergence of many current widely
used distributed graph processing frameworks such as Apache
Hama [2], Apache Giraph [1], GPS [19], GraphLab [11], and
Mizan [8].
All of these Pregel-like systems are implemented based
on the Bulk Synchronous Parallel (BSP) model [22], which
divides a graphing algorithm into multiple supersteps. Within
each superstep, each vertex executes the same vertex program:
combine messages from neighboring vertices, apply the com-
bined messages to update the vertex value, and send new
message to neighboring vertices. All messages are transmitted
along edges. From a high level perspective, the execution
flow of all vertices in a superstep can be viewed as a
three phase process: computation, communication, and barrier
synchronization. The barrier between two supersteps is used
to coordinate the parallel execution of every vertex program
across a cluster of compute nodes, where each node holds
a portion of the whole graph. Each node will be completely
dedicated to the transmission of messages during the commu-
nication and synchronization phases. Since some edges in the
graph will be cut when the graph is partitioned and messages
transmitted along cut edges will be network messages. For
large graphs, millions and even billions of network messages
can be passed during each superstep. In addition, the barrier
between supersteps is so strict that if even one message on
some node is not sent during the communication phase, all
other nodes must remain idle and the next superstep cannot be
initiated. We refer to this period when all nodes are occupied
with the transmission of messages as a communication delay.
This communication delay dominates a superstep [3] and
results in server CPU resource underutilization.
Current solutions attempt to reduce the number of network
messages, but there still exists a non-negligible communication
delay [7], [6], [12]. Therefore, we investigate this issue from a
new angle, scheduling computation during the communication
delay with a new refined BSP sync barrier. The barrier will
take advantage of the special features of graph structures
and algorithms while maintaining the two most important
synchronization properties in the BSP graph processing [23].
• Consistency: At the beginning of each superstep, a
vertex’s computing function can be triggered if and only
if all its incoming messages from neighbors have been
received.
• Isolation: Within the same superstep, newly generated
messages from any vertex will not be seen by any other
112978-1-4673-7891-8/15/$31.00 ©2015 IEEE
vertex.
We have discovered two underutilized localities provided
by the graph structure: vertex locality and edge locality,
that can help build a new refined BSP barrier. We say that
a vertex has the property of vertex locality if all of its
incoming neighbor vertices are located on the same node.
In this paper, we categorize this kind of vertex as local
vertex and others as remote vertex. Edge locality refers to
the percentage of non-cut incoming edges of remote vertex.
In this paper, we regard messages received through non-cut
edges as local messages and messages through cut edges as
remote messages. The Vertex program essentially consists
of two loosely coupled operations: message consuming and
message producing. Vertex locality ensures that the consis-
tency property is maintained without synchronizing at the
barrier, since all incoming messages for the next superstep
are local messages and are instantly available in memory
after the local machine’s computation phase in the current
superstep finishes. By maintaining the isolation property, both
the message consuming and message producing operations
in the next superstep on these local vertices can be directly
initiated before the cost barrier. Now the barrier will only
synchronize remote vertices. However, edge locality could still
allow some message consuming operations on remote verticies
in the next superstep to be scheduled before the barrier. The
degrees of these two localities are very high in real world
graph data. Detailed examination is in Section IV-B.
In this paper, a run-time computation and communication
scheduler is proposed so that some computation in the next
superstep can be scheduled to be executed during the commu-
nication delay phase in the current superstep. We first develop
a runtime vertex categorization scheme with no preprocessing
overhead to utilize vertex locality. With this categorization, our
scheduler can schedule all of the computation on local vertices
in the next superstep to the communication delay phase in
the current superstep. To further utilize edge locality, we
decouple the vertex computation into message consuming and
message producing operations so that our scheduler can move
all consuming operation on local messages of remote vertices
in the next superstep to the communication delay phase in the
current superstep. Through this overlapping of computation
and communication, our solution can dramatically mitigate
the communications delay. Our proposed solution could totally
eliminates the communication delay in the best case and can
achieve average 2X speedup over Hama.
In summary, this paper makes the following contributions:
• Extensive examination of new graph locality in dis-
tributed graph processing.
• A run-time vertex categorization with no preprocessing
overhead.
• Decouple the vertex program into message consuming
and message producing operations.
• A run-time computation and communication scheduler.
• A prototype based on Apache Hama and a thorough
evaluation on our proposed system.
The rest of the paper is organized as follows. Section II
presents the background and motivation of our paper. Sec-
tion III describes Zebra’s system design and execution flow.
Extensive graph locality examination in done Section IV-B.
Section IV-D evaluates our system’s performance against
Hama. We then describe the related work in Section V and
finally conclude our paper in Section VI.
II. B
ACKGROUND AND MOTIVATION
^LJŶĐŚƌŽŶŝnjĂƚŝŽŶĂƌƌŝĞƌ
ŽŵƉƵƚĂƚŝŽŶ
ŽŵŵƵŶŝĐĂƚŝŽŶ
ϭ Ϯ ϯ ϰ ϱ
Fig. 1. BSP Model
A. BSP Model
BSP is a parallel programming model consists of a set
of prossesor-memory pairs, a communications network and
a mechanism for efficient barrier synchronization. Figure 1
illustrates an outline of the BSP model, which expresses
algorithm as a sequence of supersteps. In distributed BSP
based graph processing systems, vertices are partitioned across
compute nodes. These vertices send messages along edges to
perform computation. In this model, all vertices are assigned
an active status at the beginning. Each active vertex can switch
itself into inactive status in each superstep. An inactive vertex
can also be switched to an active status if it receives a message
during the execution of any subsequent supersteps. Graph
algorithm will terminate when there are no active vertices or
when a user defined maximum superstep is reached. Figure 2
gives an example implementation of PageRank in BSP model.
B. Issues with Current BSP Graph Processing
Though the BSP model has been shown by many recent
graph processing systems to be a simple yet effective approach
to handling large scale graph applications, current implemen-
tations overlook some special features of graph structure and
algorithm. In this paper, we focus the following two issues that
arise from the current BSP implementation in graph processing
systems
1) Communication Delay: Recent graph system pa-
pers [14], [3], [19] report that the communication delay occu-
pies more than half of the overall processing time. We also run
PageRank on Twitter graph (41.7 million vertices, 1.47 billion
edges) to verify this. Figure 3 shows the time break down
of multiple supersteps. As observed, communication delay
113
剩余9页未读,继续阅读
资源评论
- #完美解决问题
- #运行顺畅
- #内容详尽
- #全网独家
- #注释完整
weixin_38740596
- 粉丝: 3
- 资源: 986
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 西门子200plc与v20变频器uss通讯 1,uss轮询控制 2,控制变频器启停,读取电压,电流,能耗,控制输出频率等 3,报警
- 基于隐极永磁同步电机(共三个模型)-模型电流预测控制(MPCC)包含初始化脚本,可修改电机参数 分别采用开关频率控制、两步法、最基本的控制方法,共三个模型
- FactoryIO 升降台场景 仿真实验程序 使用西门子1500PLC博图V16,使用梯形图和简单的SCL语言混合编程,通俗易懂,流程清晰,写有详细注释,起到抛砖引玉的作用,比较适合有动手能力的入门初
- c#工业自动化通信开发库,工业自动软件必备的基本程序 包括串口通信,TCP客户端,tcp服务器端,高并发物联网接收服务器端,udp通信,can总线通信,profinet,modbus tcp rtu
- 新鲜出炉的React博客系统源码,极简主义设计,手机端可自适应,超级简单,带部署文档与演示视频加截图 使用技术: 客户端前端:Next.js + React 管理端前端:React + Ant De
- QT C++ 百度智能云 OCR文字识别综合示例,源码 示例1.0集成多个使用场景,标准OCR、高精度OCR、身份证、银行卡、机动车行驶证、驾驶证、增值税发票、定额发票 在百度AI开放平台创建OCR
- 基于51单片机的pid算法控制电机转速 可以通过按键设置电机转速,结合定时器跟用外部中断检测脉冲,得出当前电机转速,再利用pid算法进行纠正,并将当前转速显示在LCD1602上面 包括程序代码+pro
- 基于labview的CAN上位机:可通过DBC实时解析报文、接收报文分类显示、报文周期发送等功能,源码交付
- Canoe-OSEK网络管理自动化测试脚本CAPL 这适用于主流osek nm的测试用例 1.启动程序 2.加载配置文件 3.选择帧类型(标准帧或扩展帧) 4.修改配置文件,自动弹出配置文件窗口 5
- 台达24es通讯(rtu方式)两台施耐德ATV310变频器示例 施耐德变频器的rtu有一点麻烦,是和大多变频器通讯不一样,它有它的逻辑,但这并不妨碍我们和它的通讯,比如用台达plc来通讯,点动频率,加
- DCMG-PV-Battery-VSC:基于Matlab Simulink的含光储单元的直流微电网仿真模型,通过并网变器VSC与交流电网连接 仿真条件:MATLAB Simulink R2015b
- 四4层电梯三菱PLC程序带io表接线图 主要功能: 1. 电梯内选和外选按钮的呼叫与对应指示灯的显示功能; 2. 电梯开门和关门动作,开门到位延时后,自动关闭; 3. 电梯上升和下降的动作; 4. 电
- 短路故障模型 Matlab simulink 质量过硬,非诚勿扰 可用于模拟电压暂降等电能质量问题,适配于本家的IEEE 33节点模型 此外,还可做电力系统暂态稳定性分析,在各类短路故障情况下
- 基于SSM框架的校园外卖系统(JAVA WEB源码) 开发技术:SSM+BootStrap + maven + Layui + MySQL5.5 实现功能:包括用户端和管理员端; 前台主要功能有用户注
- 一台成熟的非标套管机程序,设备已投产,包含详细注释程序+触摸屏组态,伺服联动,多路伺服绝对定位,模块化分区编程,程序简洁便于调试,思路清晰 轴控全部采取调用方式运行 可靠性高 是爱好电气自动化设
- 三菱PLC和labview通过MX通讯,可以实时读写
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功