Job Scheduling for Multi-User MapReduce Clusters
Matei Zaharia
†
Dhruba Borthakur
‡
Joydeep Sen Sarma
‡
Khaled Elmeleegy
∗
Scott Shenker
†
Ion Stoica
†
†
University of California, Berkeley
‡
Facebook Inc
∗
Yahoo! Research
matei@berkeley.edu {dhruba,jssarma}@facebook.com khaled@yahoo-inc.com {istoica,shenker}@cs.berkeley.edu
Abstract
Sharing a MapReduce cluster between users is attractive
because it enables statistical multiplexing (lowering costs)
and allows users to share a common large data set. How-
ever, we find that traditional scheduling algorithms can
perform very poorly in MapReduce due to two aspects of
the MapReduce setting: the need for data locality (run-
ning computation where the data is) and the dependence
between map and reduce tasks. We illustrate these prob-
lems through our experience designing a fair scheduler for
MapReduce at Facebook, which runs a 600-node multi-
user data warehouse on Hadoop. We developed two simple
techniques, delay scheduling and copy-compute splitting,
which improve throughput and response times by factors
of 2 to 10. Although we focus on multi-user workloads,
our techniques can also raise throughput in a single-user,
FIFO workload by a factor of 2.
1 Introduction
MapReduce and its open-source implementation Hadoop
[2] were originally optimized for large batch jobs such as
web index construction. However, another use case has
recently emerged: sharing a MapReduce cluster between
multiple users, which run a mix of long batch jobs and
short interactive queries over a common data set. Sharing
enables statistical multiplexing, leading to lower costs over
building private clusters for each group. Sharing a cluster
also leads to data consolidation (colocation of disparate
data sets). This avoids costly replication of data across
private clusters, and lets an organization run unanticipated
queries across disjoint datasets efficiently.
Our work was originally motivated by the MapReduce
workload at Facebook, a major web destination that runs a
data warehouse on Hadoop. Event logs from Facebook’s
website are imported into a Hadoop cluster every hour,
where they are used for a variety of applications, including
analyzing usage patterns to improve site design, detecting
spam, data mining and ad optimization. The warehouse
runs on 600 machines and stores 500 TB of compressed
data, which is growing at a rate 2 TB per day. In addition
to “production” jobs that must run periodically, there are
many experimental jobs, ranging from multi-hour machine
learning computations to 1-2 minute ad-hoc queries sub-
mitted through a SQL interface to Hadoop called Hive [3].
The system runs 3200 MapReduce jobs per day and has
been used by over 50 Facebook engineers.
As Facebook began building its data warehouse, it found
the data consolidation provided by a shared cluster highly
beneficial. For example, an engineer working on spam de-
tection could look for patterns in arbitrary data sources,
like friend lists and ad clicks, to identify spammers. How-
ever, when enough groups began using Hadoop, job re-
sponse times started to suffer due to Hadoop’s FIFO sched-
uler. This was unacceptable for production jobs and made
interactive queries impossible, greatly reducing the utility
of the system. Some groups within Facebook considered
building private clusters for their workloads, but this was
too expensive to be justified for many applications.
To address this problem, we have designed and imple-
mented a fair scheduler for Hadoop. Our scheduler gives
each user the illusion of owning a private Hadoop cluster,
letting users start jobs within seconds and run interactive
queries, while utilizing an underlying shared cluster effi-
ciently. During the development process, we have uncov-
ered several scheduling challenges in the MapReduce set-
ting that we address in this paper. We found that existing
scheduling algorithms can behave very poorly in MapRe-
duce, degrading throughput and response time by factors
of 2-10, due to two aspects of the setting: data locality (the
need to run computations near the data) and interdepen-
dence between map and reduce tasks. We developed two
simple, robust algorithms to overcome these problems: de-
lay scheduling and copy-compute splitting. Our techniques
provide 2-10x gains in throughput and response time in a
multi-user workload, but can also increase throughput in
a single-user, FIFO workload by a factor of 2. While we
present our results in the MapReduce setting, they gener-
alize to any data flow based cluster computing system, like
Dryad [20]. The locality and interdependence issues we
address are inherent in large-scale data-parallel computing.
There are two aspects that differentiate scheduling in
MapReduce from traditional cluster scheduling [12]. The
first aspect is the need for data locality, i.e., placing tasks
on nodes that contain their input data. Locality is crucial
for performance because the network bisection bandwidth
in a large cluster is much lower than the aggregate band-
width of the disks in the machines [16]. Traditional clus-
ter schedulers that give each user a fixed set of machines,
1