Energy-efficientwork-stealinglanguage资源-CSDN文库

需积分: 9 16 浏览量 2014-10-14 17:07:15 上传评论收藏 1.69MB PDF 举报

工作窃取是一种多线程程序运行时的线程管理策略，对于构建并行编程语言的多线程运行时非常有效。其主要目标是维持多线程语言运行时的效率，尤其是针对并行架构。HERMES是一种能量高效的多线程工作窃取运行时语言，它依赖于两个互补的算法来协调线程的运行节奏，即线程的工作路径敏感算法和工作负载敏感算法。这些算法能够基于窃取者和受害者之间的关系以及工作窃取队列的大小来调整线程的运行速度。窃取者指的是那些寻找并执行任务的线程，而受害者则是那些任务被窃取的线程。HERMES的设计洞察在于认识到这些线程对于整个程序运行时间的影响是不同的，通过协调它们的执行节奏（tempo）可以在几乎不影响性能的情况下提高能效。HERMES运行时在Intel Cilk Plus的运行时环境之上构建，并通过标准的动态电压与频率调节（DVFS）实现对线程节奏的调整。在HERMES上运行的基准测试显示，与商业CPU相比，平均可实现11-12%的能源节省，同时平均性能损失仅为3-4%。工作窃取运行时的核心在于管理线程间的任务分配，以最大化多核处理器的性能。当一个线程（窃取者）完成自己的工作队列中的任务后，它可以“窃取”另一个线程（受害者）的工作队列中的任务来执行。这种策略可以在动态负载条件下平衡处理器间的负载，特别是在面对不可预测的任务延迟或计算时间时，工作窃取能够有效地提高程序的吞吐量。 HERMES运行时中的工作路径敏感算法根据窃取者与受害者在执行路径上的关系来确定每个线程的运行节奏。这个算法旨在识别那些影响程序总体运行时间的关键路径，并在不牺牲性能的前提下对这些路径上的线程进行优先处理。另一方面，工作负载敏感算法则根据工作窃取队列的大小来选择适当的运行节奏，以保证当工作队列较大时，线程能够以较高的频率运行，反之则降低频率。 HERMES运行时的设计考虑到了能效和性能之间的平衡。在现代处理器中，DVFS是一种重要的能量管理技术，它允许操作系统根据当前的计算负载动态地调整CPU的电压和频率。HERMES利用这一技术来对线程的运行节奏进行调节，从而在不影响程序性能的同时减少能量消耗。这与以往那种不论负载情况均以最大频率运行CPU的方法形成了鲜明对比。 HERMES的实现与测试证明了工作窃取运行时的能效潜力。通过在基准测试中进行基于测量的测试，HERMES能够实现显著的能源节约。这篇论文的工作不仅为并行编程语言的研究与开发提供了新的视角，也指出了通过智能调度来优化能量消耗的重要性。 HERMES工作窃取语言运行时的设计与实现体现了并行计算领域内对于高效能、低功耗计算的不断追求。通过精确地协调多核处理器上的线程执行，HERMES在提升程序性能的同时，也开辟了一条降低能源消耗的途径。这种策略对于推动绿色计算、延长移动设备的电池寿命等方面都具有积极的意义。随着并行编程变得越来越普遍，HERMES这类技术的应用前景将会越来越广阔。

资源推荐

资源详情

资源评论

Energy-Efﬁcient Work-Stealing Language Runtimes

Haris Ribic and Yu David Liu

SUNY Binghamton

Binghamton NY 13902, USA

{hribic1,davidL}@binghamton.edu

Abstract

Work stealing is a promising approach to constructing multi-

threaded program runtimes of parallel programming lan-

guages. This paper presents HERMES, an energy-efﬁcient

work-stealing language runtime. The key insight is that

threads in a work-stealing environment – thieves and victims

– have varying impacts on the overall program running time,

and a coordination of their execution “tempo” can lead to en-

ergy efﬁciency with minimal performance loss. The center-

piece of HERMES is two complementary algorithms to coor-

dinate thread tempo: the workpath-sensitive algorithm deter-

mines tempo for each thread based on thief-victim relation-

ships on the execution path, whereas the workload-sensitive

algorithm selects appropriate tempo based on the size of

work-stealing deques. We construct HERMES on top of In-

tel Cilk Plus’s runtime, and implement tempo adjustment

through standard Dynamic Voltage and Frequency Scaling

(DVFS). Benchmarks running on HERMES demonstrate an

average of 11-12% energy savings with an average of 3-4%

performance loss through meter-based measurements over

commercial CPUs.

Categories and Subject Descriptors D.3.4 [Programming

Languages]: Processors— Run-Time Environments; D.3.3

[Programming Languages]: Language Constructs and Fea-

tures

Keywords work stealing; energy efﬁciency; language run-

times; thread management; DVFS

1. Introduction

Work stealing is a thread management strategy effective for

maintaining multi-threaded language runtimes, with paral-

lel architectures as speciﬁc target and with a primary goal

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full citation

on the ﬁrst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a

fee. Request permissions from permissions@acm.org.

ASPLOS ’14, March 1–4, 2014, Salt Lake City, Utah, USA.

 2014 ACM 978-1-4503-2305-5/14/03. . . $15.00.

http://dx.doi.org/10.1145/2541940.2541971

of load balancing. In the multi-core era, work stealing re-

ceived considerable interest in language runtime design.

With its root in Cilk [6, 16], work stealing is widely avail-

able in industry-strength C/C++/C#-based language frame-

works such as Intel TBB [20], Intel Cilk Plus [21], and

Microsoft .NET framework [25]. The core idea of work

stealing has also made its way into mainstream languages

such as Java [24], X10 [10, 23, 30], Haskell [28], and

Scala [32]. There is an active interest in research improv-

ing its performance-critical properties, such as adaptive-

ness [2, 18], scalability [13], and fairness [14].

In comparison, energy efﬁciency in work-stealing sys-

tems has received little attention. At a time where power-

hungry data centers and cloud computing servers are the

norm of computing infrastructure, energy efﬁciency is a ﬁrst-

class design goal with direct consequences on operational

cost, reliability, usability, maintainability, and environmen-

tal sustainability. The lack of energy-efﬁcient solutions for

work-stealing systems is particularly unfortunate, because

the platforms on which work stealing is most promising

to make impact – systems with a large number of paral-

lel units – happen to be large power consumers and require

more sophisticated techniques to achieve energy efﬁciency

[8, 12, 17, 22, 27, 33, 38].

HERMES is a ﬁrst step toward energy efﬁciency for work-

stealing runtimes. Program execution under HERMES is

tempo-enabled

: different threads may execute at differ-

ent speeds (tempo), achieved by adjusting the frequencies

of host CPU cores through standard DVFS. The effect of

DVFS on energy management is widely known. The real

challenge lies upon balancing the trade-off between energy

and performance, as lower frequencies may also slow down

program execution. The primary design goal of HERMES

is to tap inherent and unique features of the work-stealing

runtime to help make judicious DVFS decisions, ultimately

maximizing energy savings while minimizing performance

loss. Speciﬁcally, HERMES is endowed with two algorithms:

The term is inspired by music composition, where each movement of a

musical piece is often marked with a different tempo – e.g., allegro (“fast”)

and lento (“slow”) – to indicate the speed of execution.

513

•

workpath-sensitive tempo control: thread tempo is set

based on control ﬂow, with threads tackling “immediate

work” [7] executing at a faster tempo. This design ap-

proach corresponds to a key design principle in work-

stealing algorithms: the work-ﬁrst principle.

•

workload-sensitive tempo control: thread tempo is set

based on the number of work items a thread needs to

tackle, as indicated by the size of the deque in work-

stealing runtimes. Threads with a longer deque execute

at a faster tempo.

HERMES uniﬁes the two tempo control strategies in one.

Our experiments show that the two strategies are highly com-

plementary. For instance, on a 32-core machine, each strat-

egy can contribute to 6% and 7% energy savings respec-

tively, whereas the uniﬁed algorithm can yield 11% energy

savings. In the same setting, each strategy incurs 6% and 5%

performance loss respectively, whereas the uniﬁed algorithm

incurs 3% loss.

This paper makes the following contributions:

1. The ﬁrst framework, to the best of our knowledge, ad-

dressing energy efﬁciency in work-stealing systems. The

framework achieves energy efﬁciency through thread

tempo control.

2. Two novel, complementary tempo control strategies: one

workpath-sensitive and one workload-sensitive.

3. A prototyped implementation and experimental evalua-

tion demonstrating an average of 11-12% energy savings

with 3-4% performance loss over work-stealing bench-

marks. The results are stable throughout comprehensive

design space exploration.

2. Background: Work Stealing

Work stealing was originally developed in Cilk [6, 16], a C-

like language designed for parallel programming. The main

appeal of work stealing is its synergic solution spanning the

compute stack, bridging the gap between abstraction layers

such as architectures, operating systems, compilers, program

runtimes, and programming models.

Work stealing is a load balancing scheduler for multi-

threaded programs over parallel architectures. The program

runtime consists of multiple threads called workers, each

executing on a host CPU core (or hardware parallel unit in

general). Each worker maintains a queue-like data structure

– called a double-ended queue or deque – each item of which

is a task to be processed by the worker. When a worker

ﬁnishes processing a task, it picks up next one from its deque

and continues the execution of that task. When the deque is

empty (we say the worker or its host core is idle), the worker

steals a task from the deque of another worker. In this case

we call the stealing worker a thief whereas the worker whose

task was stolen a victim. The selection of victims follows

the principles observed by load balancing and may vary in

different implementations of work stealing.

What sets work stealing apart from standard load balanc-

ing techniques is how the runtime structure described above

corresponds to program structures and compilation units.

Each task on the deque is a block of executable code – or

more strictly, a program counter pointing to the executable

code – demarcated by the programmer and optimized by the

compiler. In that sense, to have a worker “pick up a task” is

indeed to have the worker continue its execution over the ex-

ecutable code embodied in the task. To describe the process

in more detail, let us use the following Cilk example:

L1 cilk int f ( )

L2 { int n1 = spawn f 1 ( ) ;

L3 . . . / / o t h e r s t a t e m e n t s

L4 }

L5 cilk int f 1 ( ) {

L6 int n2 = spawn f 2 ( ) ;

L7 . . . / / o t h e r s t a t e m e n t s

L8 }

L9 cilk int f 2 ( ) {

L10 . . . / / o t h e r s t a t e m e n t s

L11 }

Logically, each spawn can be viewed as a thread creation.

On the implementation level however, a work-stealing run-

time adopts Lazy Task Creation [31], where for each spawn,

the executing worker simply puts a task onto its own deque,

either to pick it up later or to be stolen by some other worker.

This strategy aligns thread management with the underly-

ing parallel architecture: a program that invokes f above 20

times but runs on a dual-core CPU can operate only with 2

threads (workers) instead of 40.

Work-First Principle The question here is what the item

placed on the deque should embody. For instance, when

L2 is executed, one tempting design would be to consider

f1 as the task placed on the deque. The Cilk-like work-

stealing algorithm takes the opposite approach: it places the

continuation of the current spawn statement onto the deque.

In the example above, it is the program counter pointing to

L3. The current worker continues to invoke f1 as if spawn

were elided.

This design reﬂects a fundamental principle well articu-

lated in Cilk: the work-ﬁrst principle. The principle concerns

the relationship between the parallel execution of a program

and its corresponding serial execution. (A logically equiva-

lent view for the latter would be to have the parallel program

execute on a single-core machine.) Let us revisit the exam-

ple above. If it is executed on a single-core machine, f1 is

the “immediate” work when L2 is reached, and hence carries

more urgency. For that reason, f1 should be immediately ex-

ecuted by the current worker, whereas the continuation is not

as urgent and is hence placed on the deque.

Work-ﬁrst principle plays a pivotal role in the design of

work-stealing systems. In Cilk, it further leads to a compila-

514

3 1

(a)

6 1

(b)

(c)

(d)

9 5

8 4

(e)

(f)

1 2 3 4 1 2 3 4

Figure 1. Work Stealing: An Illustration

tion strategy known as fast/slow clones, and a distinct solu-

tion for locking [16]. Both are out of the scope of this paper.

Deque Management One natural consequence of placing

continuations onto the deque is that the order of tasks on

the deque reﬂects the immediacy of processing these items

as deﬁned by the work-ﬁrst principle: the earlier the item is

placed, the less immediate it is. For example, if the control

ﬂow of a worker reaches L10, two tasks are placed on the

deque, the program counter for L3 (when the spawn in L2 is

executed) and the program counter for L7 (when the spawn

in L6 is executed). In a serial execution, L3 will only be

encountered after L7.

With this observation, deque is designed as a data struc-

ture that can be manipulated on both ends. Let us call the

head of the deque as the earliest item placed on the deque by

the worker, whereas the tail of the deque as the latest. When

a worker becomes idle, it always retrieves from the tail of

Algorithm 2.1 Worker

w : WORKER

procedure SCHEDULE(w)

loop

t ← POP(w)

if t==null then

v = SELECT()

t ← STEAL(v)

if t==null then

YIELD(w)

else

WORK(w, t)

end if

else

WORK(w, t)

end if

end loop

end procedure

Structures

structure WORKER

DQ // deque (array)

H // head index

T // tail index

end structure

structure TASK

... // program counter, etc

end structure

Other Deﬁnitions

procedure WORK(w, t)

// worker w runs task t

procedure SELECT()

// select and return a victim

procedure YIELD(w)

// yield worker w

procedure LOCK(w)

procedure UNLOCK(w)

// lock/unlock w

Algorithm 2.2 Push

w : WORKER

t : TASK

procedure PUSH(w,t)

w.T++

w.DQ[w.T] ← t

end procedure

Algorithm 2.3 Pop

w : WORKER

procedure POP(w)

w.T– –

if w.H > w.T then

w.T++

LOCK(w)

w.T−−

if w.H > w.T then

w.T++

UNLOCK(w)

return null

end if

UNLOCK(w)

return w.DQ[w.T]

end procedure

Algorithm 2.4 Steal

v : WORKER // victim

procedure STEAL(v)

LOCK(v)

v.H++

if v.H > v.T then

v.H– –

UNLOCK(v)

return null

end if

UNLOCK(v)

return v.DQ[v.H]

end procedure

Figure 2. Work Stealing Algorithm

its own deque, i.e., the most immediate task. On the other

hand, when a thief attempts to steal from a worker, it always

retrieves from the head of that worker’s deque, i.e., the least

immediate task. From now on, we call the worker placing a

task to its own deque a push, while removing a task from its

own deque a pop. We continue to use term steal to refer to a

worker removing a task from another worker’s deque.

515

剩余14页未读，继续阅读

评论收藏

内容反馈

xiaochaoyxc

粉丝: 0
资源: 7

Energy-efficient work-stealing language

最新资源

Energy-efficient work-stealing language

bh-ad-12-stealing-from-thieves-Saher-slides.zip_Stealing_ionCube

bh-ad-12-stealing-from-thieves-Saher-slides_ioncubedecoder_Steal

Cache Aware Bi-tier Task-stealing in Multi-socket Multi-core Architecture (icpp11)-计算机科学

Smart-Stealing-for-Parallel-GC-in-JVM:智能窃取

stop-stealing-dreams:英语到土耳其语翻译

Kid-Stealing-Back:7DLR 2021

Stop Stealing Sheep-crx插件

disable-focus-stealing-popups:禁用所有焦点窃取弹出窗口 chrome 扩展

Stealing Busted-crx插件

Stop Stealing Sheep & Find Out How Type Works

competitive-data-stealing-tool:一种窃取汽车线索网站竞争对手数据的工具

Syngress - Stealing the Network - How to Own the Box (2003).pdf

山东省高密市第三中学高中英语Unit1Greatscientists错题重考无答案新人教版必修5

Windows Token Stealing

ConcurrentDeque:针对C ++ 17的Chase-Lev免锁工作窃取双端队列的快速，通用实现

Stealing the Network: How to Own the Box

Streaming.Sharing.Stealing.Big.Data.and.the.Future.of.Entertainment

四级核心词汇-字体很好看

C ++中的快速工作窃取队列模板-C/C++开发

golang work steal调度算法

大学1-6级英语单词表

英语六级考试完形填空必须掌握的重要短语

高性能并发业务.pdf

调度算法_多核_调度算法_correctlyq2j_多核调度_多核操作系统_

核聚英语23篇.docx

PingPongGANN:基于神经网络和遗传算法的乒乓球AI

一种采用Lock-Free同步机制的数据结构的研究.pdf

2010年大学英语四级词汇表(新大纲)

最新资源