MapReduce:FairScheduler前传资源-CSDN文库

需积分: 9 16 浏览量 2019-07-31 01:04:01 上传评论收藏 830KB PDF 举报

MapReduce是一种广泛使用的编程模型，用于处理和生成大数据集。其核心思想是将大数据集分割成独立的块，这些块可以并行处理，然后再将结果合并起来。MapReduce模型主要由两个阶段组成：Map（映射）阶段和Reduce（归约）阶段。在多用户共享的MapReduce集群环境中，合理地调度作业至关重要。作业调度器负责在集群的多用户环境下，按照一定的策略分配计算资源，如CPU、内存和存储等。由于MapReduce作业的运行特点，传统的调度算法可能在该环境下表现不佳，主要问题在于数据局部性和Map与Reduce任务之间的依赖。数据局部性指的是Map任务在计算时尽可能在数据存储节点上运行，以减少数据在网络中的传输，提高效率。在MapReduce环境中，数据局部性尤为重要，因为它可以显著影响作业的执行效率。如果数据局部性处理不当，可能会导致大量的网络传输开销，降低集群的整体性能。 Map任务与Reduce任务之间的依赖关系，也是MapReduce环境中作业调度需要考虑的因素之一。Map任务的输出作为Reduce任务的输入，因此调度器需要确保在Map任务完成后，Reduce任务能够及时并且有效地接收到所需数据。本文提到的“公平调度器”（Fair Scheduler）是为了解决上述问题而设计的。在Facebook的实践设计中，公平调度器通过为每个用户或应用程序分配公平的资源份额，从而提高了多用户共享集群时的效率和公平性。此外，调度器还需要考虑任务的优先级、资源需求和任务类型等因素，以做出合理的调度决策。在资源分配方面，公平调度器支持多个队列，每个队列可以设定资源配额，保证用户或应用程序得到公平的资源分配。同时，调度器还会尝试快速响应用户的需求变化，如在任务执行期间动态调整资源分配，实现资源使用的优化。文章还提到了多个机构对这个研究项目的支持，包括加州大学伯克利分校、Yahoo!和Facebook等。这表明该研究不仅是在理论上的探讨，而且是在实际环境中得到验证和应用的。通过这些内容，我们可以了解到MapReduce集群在多用户环境下的调度挑战，以及公平调度器是如何应对这些挑战的。同时也看到了公平调度器在设计过程中所依赖的技术和理论基础，以及它在实践中所取得的成效和相关研究机构的支持情况。这些知识点对于理解MapReduce作业调度机制以及如何优化多用户共享集群环境下的资源分配具有重要价值。

资源推荐

资源详情

资源评论

Job Scheduling for Multi-User MapReduce Clusters

Matei Zaharia

Dhruba Borthakur

Joydeep Sen Sarma

Khaled Elmeleegy

Scott Shenker

Ion Stoica

Electrical Engineering and Computer Sciences

University of California at Berkeley

Technical Report No. UCB/EECS-2009-55

http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-55.html

April 30, 2009

Job Scheduling for Multi-User MapReduce Clusters

Matei Zaharia

†

Dhruba Borthakur

‡

Joydeep Sen Sarma

‡

Khaled Elmeleegy

∗

Scott Shenker

†

Ion Stoica

†

University of California, Berkeley

‡

Facebook Inc

∗

Yahoo! Research

matei@berkeley.edu {dhruba,jssarma}@facebook.com khaled@yahoo-inc.com {istoica,shenker}@cs.berkeley.edu

Abstract

Sharing a MapReduce cluster between users is attractive

because it enables statistical multiplexing (lowering costs)

and allows users to share a common large data set. How-

ever, we ﬁnd that traditional scheduling algorithms can

perform very poorly in MapReduce due to two aspects of

the MapReduce setting: the need for data locality (run-

ning computation where the data is) and the dependence

between map and reduce tasks. We illustrate these prob-

lems through our experience designing a fair scheduler for

MapReduce at Facebook, which runs a 600-node multi-

user data warehouse on Hadoop. We developed two simple

techniques, delay scheduling and copy-compute splitting,

which improve throughput and response times by factors

of 2 to 10. Although we focus on multi-user workloads,

our techniques can also raise throughput in a single-user,

FIFO workload by a factor of 2.

1 Introduction

MapReduce and its open-source implementation Hadoop

[2] were originally optimized for large batch jobs such as

web index construction. However, another use case has

recently emerged: sharing a MapReduce cluster between

multiple users, which run a mix of long batch jobs and

short interactive queries over a common data set. Sharing

enables statistical multiplexing, leading to lower costs over

building private clusters for each group. Sharing a cluster

also leads to data consolidation (colocation of disparate

data sets). This avoids costly replication of data across

private clusters, and lets an organization run unanticipated

queries across disjoint datasets efﬁciently.

Our work was originally motivated by the MapReduce

workload at Facebook, a major web destination that runs a

data warehouse on Hadoop. Event logs from Facebook’s

website are imported into a Hadoop cluster every hour,

where they are used for a variety of applications, including

analyzing usage patterns to improve site design, detecting

spam, data mining and ad optimization. The warehouse

runs on 600 machines and stores 500 TB of compressed

data, which is growing at a rate 2 TB per day. In addition

to “production” jobs that must run periodically, there are

many experimental jobs, ranging from multi-hour machine

learning computations to 1-2 minute ad-hoc queries sub-

mitted through a SQL interface to Hadoop called Hive [3].

The system runs 3200 MapReduce jobs per day and has

been used by over 50 Facebook engineers.

As Facebook began building its data warehouse, it found

the data consolidation provided by a shared cluster highly

beneﬁcial. For example, an engineer working on spam de-

tection could look for patterns in arbitrary data sources,

like friend lists and ad clicks, to identify spammers. How-

ever, when enough groups began using Hadoop, job re-

sponse times started to suffer due to Hadoop’s FIFO sched-

uler. This was unacceptable for production jobs and made

interactive queries impossible, greatly reducing the utility

of the system. Some groups within Facebook considered

building private clusters for their workloads, but this was

too expensive to be justiﬁed for many applications.

To address this problem, we have designed and imple-

mented a fair scheduler for Hadoop. Our scheduler gives

each user the illusion of owning a private Hadoop cluster,

letting users start jobs within seconds and run interactive

queries, while utilizing an underlying shared cluster efﬁ-

ciently. During the development process, we have uncov-

ered several scheduling challenges in the MapReduce set-

ting that we address in this paper. We found that existing

scheduling algorithms can behave very poorly in MapRe-

duce, degrading throughput and response time by factors

of 2-10, due to two aspects of the setting: data locality (the

need to run computations near the data) and interdepen-

dence between map and reduce tasks. We developed two

simple, robust algorithms to overcome these problems: de-

lay scheduling and copy-compute splitting. Our techniques

provide 2-10x gains in throughput and response time in a

multi-user workload, but can also increase throughput in

a single-user, FIFO workload by a factor of 2. While we

present our results in the MapReduce setting, they gener-

alize to any data ﬂow based cluster computing system, like

Dryad [20]. The locality and interdependence issues we

address are inherent in large-scale data-parallel computing.

There are two aspects that differentiate scheduling in

MapReduce from traditional cluster scheduling [12]. The

ﬁrst aspect is the need for data locality, i.e., placing tasks

on nodes that contain their input data. Locality is crucial

for performance because the network bisection bandwidth

in a large cluster is much lower than the aggregate band-

width of the disks in the machines [16]. Traditional clus-

ter schedulers that give each user a ﬁxed set of machines,

剩余17页未读，继续阅读

评论收藏

内容反馈

weixin_38669628

粉丝: 387
资源: 6万+

MapReduce:Fair Scheduler前传

最新资源

MapReduce:Fair Scheduler前传

MapReduce:Fair Scheduler PPT分享

mapreduce程序

Delay Tails in MapReduce Scheduling

mapreduce1

MapReduce应用

Hadoop MapReduce v2 Cookbook.pdf

Map Reduce

Hadoop平台中MapReduce调度算法研究.pdf

23、hadoop集群中yarn运行mapreduce的内存、CPU分配调度计算与优化

mapreduce的PPT

一个MapReduce简单程序示例

Mapreduce并行编程模型研究

大数据应用的中存储和处理问题剖析.pptx

大数据高级编程最佳实践

基于Hadoop的调度算法研究与实现.docx

Yarn知名培训私密资料

Yarn框架代码详细分析V0.3.pdf

南京大学 大数据 期末题库 pdf

elasticsearch5.5.1客户端所需jar包，解决必须用maven工程问题

4.Yarn资源调度器.pdf

Hadoop技术答疑汇总

Hadoop 2.0部署配置文件示例.zip

HADOOP 的工作调度器介绍

yarn调度流程.docx

hadoop2.2.0

apache hadoop 2.7.2.chm

Hadoop期末整理.pdf

最新资源

南京大学大数据期末题库 pdf