Dynamo: Facebook’s Data Center-Wide Power Management System
Qiang Wu, Qingyuan Deng, Lakshmi Ganesh, Chang-Hong Hsu
∗
,
Yun Jin, Sanjeev Kumar
†
, Bin Li, Justin Meza, and Yee Jiun Song
Facebook, Inc.
∗
University of Michigan
Abstract—Data center power is a scarce resource that
often goes underutilized due to conservative planning. This
is because the penalty for overloading the data center power
delivery hierarchy and tripping a circuit breaker is very high,
potentially causing long service outages. Recently, dynamic
server power capping, which limits the amount of power
consumed by a server, has been proposed and studied as a way
to reduce this penalty, enabling more aggressive utilization
of provisioned data center power. However, no real at-scale
solution for data center-wide power monitoring and control
has been presented in the literature.
In this paper, we describe Dynamo – a data center-wide
power management system that monitors the entire power
hierarchy and makes coordinated control decisions to safely
and efficiently use provisioned data center power. Dynamo
has been developed and deployed across all of Facebook’s
data centers for the past three years. Our key insight is that
in real-world data centers, different power and performance
constraints at different levels in the power hierarchy necessi-
tate coordinated data center-wide power management.
We make three main contributions. First, to understand
the design space of Dynamo, we provide a characterization
of power variation in data centers running a diverse set of
modern workloads. This characterization uses fine-grained
power samples from tens of thousands of servers and spanning
a period of over six months. Second, we present the detailed
design of Dynamo. Our design addresses several key issues
not addressed by previous simulation-based studies. Third,
the proposed techniques and design have been deployed
and evaluated in large scale data centers serving billions of
users. We present production results showing that Dynamo
has prevented 18 potential power outages in the past 6
months due to unexpected power surges; that Dynamo enables
optimizations leading to a 13% performance boost for a
production Hadoop cluster and a nearly 40% performance
increase for a search cluster; and that Dynamo has already
enabled an 8% increase in the power capacity utilization
of one of our data centers with more aggressive power
subscription measures underway.
Keywords—data center; power; management.
I. INTRODUCTION
Warehouse-scale data centers consist of many thousands
of machines running a diverse set of workloads and comprise
the foundation of the modern web. The power delivery infras-
tructure supplying these data centers is equipped with power
breakers designed to protect the data center from damage
due to electrical surges. While tripping a power breaker
ultimately protects the physical infrastructure of a data center,
its application-level effects can be disastrous, leading to long
service outages at worst and degraded user experience at best.
Given how severe the outcomes of tripping a power breaker
are, data center operators have traditionally taken a con-
servative approach by over-provisioning data center power,
provisioning for worst-case power consumption, and further
adding large power buffers [1], [2]. While such an approach
ensures safety and reliability with high confidence, it is
wasteful in terms of power infrastructure utilization – a scarce
data center resource. For example, it may take several years
∗
Work was performed while employed by Facebook, Inc.
†
Currently with Uber, Inc.
to construct a new power delivery infrastructure and every
megawatt of power capacity can cost around 10 to 20 million
USD [2], [3].
Under-utilizing data center power is especially inefficient
because power is frequently the bottleneck resource limiting
the number of servers that a data center can house. It is
even more so with the recent trend of increasing server
power density [4], [5]. Figure 1 shows that server peak power
consumption nearly doubled going from the 2011 server (24-
core Westmere-based) to the 2015 server (48-core Haswell-
based) at Facebook. This trend has led to the proliferation of
ghost spaces in data centers: unused, and unusable, space [6].
To help improve data center efficiency, over-subscription
of data center power has been proposed in recent years [1],
[6], [7]. With over-subscription, the planned peak data center
power demand is intentionally allowed to surpass data center
power supply, under the assumption that correlated spikes
in server power consumption are infrequent. However, this
exposes data centers to the risk of tripping power breakers
due to highly unpredictable power spikes (e.g., a natural
disaster or a special event that causes a surge in user activity
for a service). To make matters worse, a power failure in
one data center could cause a redistribution of load to other
data centers, tripping their power breakers and leading to a
cascading power failure event.
Therefore, in order to achieve both power safety and
Figure 1. The measured power consumption (in watts) as a
function of server CPU utilization for two generations of web
servers used at Facebook. The data points were measured from
a 24-core Westmere-based web server (24×L5639@2.13GHz,
12GB RAM, 2×1G NIC) from 2011, while the • data points were
measured from a 48-core Haswell-based web server (48×E5-
2678v3@2.50GHz, 32GB RAM, 1×10G NIC) from 2015. Both
servers were running a real web server workload. We varied the
server processor utilization by changing the rate of requests sent
to the server. Note that the 2015 server power was measured
using an on-board power sensor while the 2011 server power
was measured using a Yokogawa power meter.
2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture
1063-6897/16 $31.00 © 2016 IEEE
DOI 10.1109/ISCA.2016.48
469