Megastore: Providing Scalable, Highly Available
Storage for Interactive Ser vices
Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson,
Jean-Michel L
´
eon, Yawei Li, Alexander Lloyd, Vadim Yushprakh
Google, Inc.
{jasonbaker,chrisbond,jcorbett,jfurman,akhorlin,jimlarson,jm,yaweili,alloyd,vadimy}@google.com
ABSTRACT
Megastore is a storage system developed to meet the re-
quirements of today’s interactive online services. Megas-
tore blends the scalability of a NoSQL datastore with the
convenience of a traditional RDBMS in a novel way, and
provides both strong consistency guarantees and high avail-
ability. We provide fully serializable ACID semantics within
fine-grained partitions of data. This partitioning allows us
to synchronously replicate each write across a wide area net-
work with reasonable latency and support seamless failover
b etween datacenters. This paper describ es Megastore’s se-
mantics and replication algorithm. It also describes our ex-
p erience supporting a wide range of Google production ser-
vices built with Megastore.
Categories and Subject Descriptors
C.2.4 [Distributed Systems]: Distributed databases; H.2.4
[Database Management]: Systems—concurrency, distrib-
uted databases
General Terms
Algorithms, Design, Performance, Reliability
Keywords
Large databases, Distributed transactions, Bigtable, Paxos
1. INTRODUCTION
Interactive online services are forcing the storage commu-
nity to meet new demands as desktop applications migrate
to the cloud. Services like email, collaborative documents,
and social networking have been growing exponentially and
are testing the limits of existing infrastructure. Meeting
these services’ storage demands is challenging due to a num-
b er of conflicting requirements.
First, the Internet brings a huge audience of potential
users, so the applications must be highly scalable. A service
This article is published under a Creative Commons Attribution License
(http://creativecommons.org/licenses/by/3.0/), which permits distribution
and reproduction in any medium as well allowing derivative works, pro-
vided that you attribute the original work to the author(s) and CIDR 2011.
5
th
Biennial Conference on Innovative Data Systems Research (CIDR ’11)
January 9-12, 2011, Asilomar, California, USA.
can be built rapidly using MySQL [10] as its datastore, but
scaling the service to millions of users requires a complete
redesign of its storage infrastructure. Second, services must
comp ete for users. This requires rapid development of fea-
tures and fast time-to-market. Third, the service must be
resp onsive; hence, the storage system must have low latency.
Fourth, the service should provide the user with a consistent
view of the data—the result of an update should be visible
immediately and durably. Seeing edits to a cloud-hosted
spreadsheet vanish, however briefly, is a poor user experi-
ence. Finally, users have come to expect Internet services to
b e up 24/7, so the service must be highly available. The ser-
vice must be resilient to many kinds of faults ranging from
the failure of individual disks, machines, or routers all the
way up to large-scale outages affecting entire datacenters.
These requirements are in conflict. Relational databases
provide a rich set of features for easily building applications,
but they are difficult to scale to hundreds of millions of
users. NoSQL datastores such as Google’s Bigtable [15],
Apache Hadoop’s HBase [1], or Facebook’s Cassandra [6]
are highly scalable, but their limited API and loose consis-
tency models complicate application development. Repli-
cating data across distant datacenters while providing low
latency is challenging, as is guaranteeing a consistent view
of replicated data, esp ecially during faults.
Megastore is a storage system developed to meet the stor-
age requirements of today’s interactive online services. It
is novel in that it blends the scalability of a NoSQL data-
store with the convenience of a traditional RDBMS. It uses
synchronous replication to achieve high availability and a
consistent view of the data. In brief, it provides fully serial-
izable ACID semantics over distant replicas with low enough
latencies to support interactive applications.
We accomplish this by taking a middle ground in the
RDBMS vs. NoSQL design space: we partition the data-
store and replicate each partition separately, providing full
ACID semantics within partitions, but only limited con-
sistency guarantees across them. We provide traditional
database features, such as secondary indexes, but only those
features that can scale within user-tolerable latency limits,
and only with the semantics that our partitioning scheme
can supp ort. We contend that the data for most Internet
services can be suitably partitioned (e.g., by user) to make
this approach viable, and that a small, but not spartan, set
of features can substantially ease the burden of developing
cloud applications.
Contrary to conventional wisdom [24, 28], we were able to
use Paxos [27] to build a highly available system that pro-
223