googlemegastore报告与揭秘资源-CSDN文库

共2个文件

pdf：2个

google

megastore

需积分: 16 166 浏览量 2011-02-18 17:04:01 上传评论收藏 1.44MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

google megastore.zip （2个子文件）

Megastore[CIDR11-google].pdf 931KB

Google Megastore分布式存储技术全揭秘.pdf 635KB

Megastore: Providing Scalable, Highly Available

Storage for Interactive Ser vices

Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson,

Jean-Michel L

eon, Yawei Li, Alexander Lloyd, Vadim Yushprakh

Google, Inc.

{jasonbaker,chrisbond,jcorbett,jfurman,akhorlin,jimlarson,jm,yaweili,alloyd,vadimy}@google.com

ABSTRACT

Megastore is a storage system developed to meet the re-

quirements of today’s interactive online services. Megas-

tore blends the scalability of a NoSQL datastore with the

convenience of a traditional RDBMS in a novel way, and

provides both strong consistency guarantees and high avail-

ability. We provide fully serializable ACID semantics within

ﬁne-grained partitions of data. This partitioning allows us

to synchronously replicate each write across a wide area net-

work with reasonable latency and support seamless failover

b etween datacenters. This paper describ es Megastore’s se-

mantics and replication algorithm. It also describes our ex-

p erience supporting a wide range of Google production ser-

vices built with Megastore.

Categories and Subject Descriptors

C.2.4 [Distributed Systems]: Distributed databases; H.2.4

[Database Management]: Systems—concurrency, distrib-

uted databases

General Terms

Algorithms, Design, Performance, Reliability

Keywords

Large databases, Distributed transactions, Bigtable, Paxos

1. INTRODUCTION

Interactive online services are forcing the storage commu-

nity to meet new demands as desktop applications migrate

to the cloud. Services like email, collaborative documents,

and social networking have been growing exponentially and

are testing the limits of existing infrastructure. Meeting

these services’ storage demands is challenging due to a num-

b er of conﬂicting requirements.

First, the Internet brings a huge audience of potential

users, so the applications must be highly scalable. A service

This article is published under a Creative Commons Attribution License

(http://creativecommons.org/licenses/by/3.0/), which permits distribution

and reproduction in any medium as well allowing derivative works, pro-

vided that you attribute the original work to the author(s) and CIDR 2011.

Biennial Conference on Innovative Data Systems Research (CIDR ’11)

January 9-12, 2011, Asilomar, California, USA.

can be built rapidly using MySQL [10] as its datastore, but

scaling the service to millions of users requires a complete

redesign of its storage infrastructure. Second, services must

comp ete for users. This requires rapid development of fea-

tures and fast time-to-market. Third, the service must be

resp onsive; hence, the storage system must have low latency.

Fourth, the service should provide the user with a consistent

view of the data—the result of an update should be visible

immediately and durably. Seeing edits to a cloud-hosted

spreadsheet vanish, however brieﬂy, is a poor user experi-

ence. Finally, users have come to expect Internet services to

b e up 24/7, so the service must be highly available. The ser-

vice must be resilient to many kinds of faults ranging from

the failure of individual disks, machines, or routers all the

way up to large-scale outages aﬀecting entire datacenters.

These requirements are in conﬂict. Relational databases

provide a rich set of features for easily building applications,

but they are diﬃcult to scale to hundreds of millions of

users. NoSQL datastores such as Google’s Bigtable [15],

Apache Hadoop’s HBase [1], or Facebook’s Cassandra [6]

are highly scalable, but their limited API and loose consis-

tency models complicate application development. Repli-

cating data across distant datacenters while providing low

latency is challenging, as is guaranteeing a consistent view

of replicated data, esp ecially during faults.

Megastore is a storage system developed to meet the stor-

age requirements of today’s interactive online services. It

is novel in that it blends the scalability of a NoSQL data-

store with the convenience of a traditional RDBMS. It uses

synchronous replication to achieve high availability and a

consistent view of the data. In brief, it provides fully serial-

izable ACID semantics over distant replicas with low enough

latencies to support interactive applications.

We accomplish this by taking a middle ground in the

RDBMS vs. NoSQL design space: we partition the data-

store and replicate each partition separately, providing full

ACID semantics within partitions, but only limited con-

sistency guarantees across them. We provide traditional

database features, such as secondary indexes, but only those

features that can scale within user-tolerable latency limits,

and only with the semantics that our partitioning scheme

can supp ort. We contend that the data for most Internet

services can be suitably partitioned (e.g., by user) to make

this approach viable, and that a small, but not spartan, set

of features can substantially ease the burden of developing

cloud applications.

Contrary to conventional wisdom [24, 28], we were able to

use Paxos [27] to build a highly available system that pro-

223

vides reasonable latencies for interactive applications while

synchronously replicating writes across geographically dis-

tributed datacenters. While many systems use Paxos solely

for locking, master election, or replication of metadata and

conﬁgurations, we believe that Megastore is the largest sys-

tem deployed that uses Paxos to replicate primary user data

across datacenters on every write.

Megastore has been widely deployed within Google for

several years [20]. It handles more than three billion write

and 20 billion read transactions daily and stores nearly a

p etabyte of primary data across many global datacenters.

The key contributions of this paper are:

1. the design of a data model and storage system that

allows rapid development of interactive applications

where high availability and scalability are built-in from

the start;

2. an implementation of the Paxos replication and con-

sensus algorithm optimized for low-latency operation

across geographically distributed datacenters to pro-

vide high availability for the system;

3. a report on our experience with a large-scale deploy-

ment of Megastore at Google.

The paper is organized as follows. Section 2 describes how

Megastore provides availability and scalability using parti-

tioning and also justiﬁes the suﬃciency of our design for

many interactive Internet applications. Section 3 provides

an overview of Megastore’s data model and features. Sec-

tion 4 explains the replication algorithms in detail and gives

some measurements on how they perform in practice. Sec-

tion 5 summarizes our experience developing the system.

We review related work in Section 6. Section 7 concludes.

2. TOWARD AVAILABILITY AND SCALE

In contrast to our need for a storage platform that is

global, reliable, and arbitrarily large in scale, our hardware

building blocks are geographically conﬁned, failure-prone,

and suﬀer limited capacity. We must bind these compo-

nents into a uniﬁed ensemble oﬀering greater throughput

and reliability.

To do so, we have taken a two-pronged approach:

• for availability, we implemented a synchronous, fault-

tolerant log replicator optimized for long distance-links;

• for scale, we partitioned data into a vast space of small

databases, each with its own replicated log stored in a

per-replica NoSQL datastore.

2.1 Replication

Replicating data across hosts within a single datacenter

improves availability by overcoming host-speciﬁc failures,

but with diminishing returns. We still must confront the

networks that connect them to the outside world and the

infrastructure that powers, cools, and houses them. Eco-

nomically constructed sites risk some level of facility-wide

outages [25] and are vulnerable to regional disasters. For

cloud storage to meet availability demands, service providers

must replicate data over a wide geographic area.

2.1.1 Strategies

We evaluated common strategies for wide-area replication:

Asynchronous Master/Slave A master node replicates

write-ahead log entries to at least one slave. Log ap-

pends are acknowledged at the master in parallel with

transmission to slaves. The master can support fast

ACID transactions but risks downtime or data loss

during failover to a slave. A consensus protocol is re-

quired to mediate mastership.

Synchronous Master/Slave A master waits for changes

to be mirrored to slaves before acknowledging them,

allowing failover without data loss. Master and slave

failures need timely detection by an external system.

Optimistic Replication Any member of a homogeneous

replica group can accept mutations [23], which are

asynchronously propagated through the group. Avail-

ability and latency are excellent. However, the global

mutation ordering is not known at commit time, so

transactions are impossible.

We avoided strategies which could lose data on failures,

which are common in large-scale systems. We also discarded

strategies that do not permit ACID transactions. Despite

the operational advantages of eventually consistent systems,

it is currently too diﬃcult to give up the read-modify-write

idiom in rapid application development.

We also discarded options with a heavyweight master.

Failover requires a series of high-latency stages often caus-

ing a user-visible outage, and there is still a huge amount

of complexity. Why build a fault-tolerant system to arbi-

trate mastership and failover workﬂows if we could avoid

distinguished masters altogether?

2.1.2 Enter Paxos

We decided to use Paxos, a proven, optimal, fault-tolerant

consensus algorithm with no requirement for a distinguished

master [14, 27]. We replicate a write-ahead log over a group

of symmetric peers. Any node can initiate reads and writes.

Each log append blocks on acknowledgments from a ma-

jority of replicas, and replicas in the minority catch up as

they are able—the algorithm’s inherent fault tolerance elim-

inates the need for a distinguished “failed” state. A novel

extension to Paxos, detailed in Section 4.4.1, allows local

reads at any up-to-date replica. Another extension permits

single-roundtrip writes.

Even with fault tolerance from Paxos, there are limita-

tions to using a single log. With replicas spread over a

wide area, communication latencies limit overall through-

put. Moreover, progress is impeded when no replica is cur-

rent or a majority fail to acknowledge writes. In a traditional

SQL database hosting thousands or millions of users, us-

ing a synchronously replicated log would risk interruptions

of widespread impact [11]. So to improve availability and

throughput we use multiple replicated logs, each governing

its own partition of the data set.

2.2 Partitioning and Locality

To scale our replication scheme and maximize performance

of the underlying datastore, we give applications ﬁne-grained

control over their data’s partitioning and locality.

2.2.1 Entity Groups

To scale throughput and localize outages, we partition our

data into a collection of entity groups [24], each indepen-

dently and synchronously replicated over a wide area. The

underlying data is stored in a scalable NoSQL datastore in

each datacenter (see Figure 1).

Entities within an entity group are mutated with single-

phase ACID transactions (for which the commit record is

224

Figure 1: Scalable Replication

Figure 2: Operations Across Entity Groups

replicated via Paxos). Operations across entity groups could

rely on expensive two-phase commits, but typically leverage

Megastore’s eﬃcient asynchronous messaging. A transac-

tion in a sending entity group places one or more messages

in a queue; transactions in receiving entity groups atomically

consume those messages and apply ensuing mutations.

Note that we use asynchronous messaging between logi-

cally distant entity groups, not physically distant replicas.

All network traﬃc between datacenters is from replicated

op erations, which are synchronous and consistent.

Indexes local to an entity group obey ACID semantics;

those across entity groups have looser consistency. See Fig-

ure 2 for the various operations on and between entity groups.

2.2.2 Selecting Entity Group Boundaries

The entity group deﬁnes the a priori grouping of data

for fast op erations. Boundaries that are too ﬁne-grained

force excessive cross-group operations, but placing too much

unrelated data in a single group serializes unrelated writes,

which degrades throughput.

The following examples show ways applications can work

within these constraints:

Email Each email account forms a natural entity group.

Operations within an account are transactional and

consistent: a user who sends or labels a message is

guaranteed to observe the change despite possible fail-

over to another replica. External mail routers handle

communication between accounts.

Blogs A blogging application would be modeled with mul-

tiple classes of entity groups. Each user has a proﬁle,

which is naturally its own entity group. However, blogs

are collaborative and have no single permanent owner.

We create a second class of entity groups to hold the

p osts and metadata for each blog. A third class keys

oﬀ the unique name claimed by each blog. The appli-

cation relies on asynchronous messaging when a sin-

gle user operation aﬀects both blogs and proﬁles. For

a lower-traﬃc operation like creating a new blog and

claiming its unique name, two-phase commit is more

convenient and performs adequately.

Maps Geographic data has no natural granularity of any

consistent or convenient size. A mapping application

can create entity groups by dividing the glob e into non-

overlapping patches. For mutations that span patches,

the application uses two-phase commit to make them

atomic. Patches must be large enough that two-phase

transactions are uncommon, but small enough that

each patch requires only a small write throughput.

Unlike the previous examples, the number of entity

groups does not grow with increased usage, so enough

patches must be created initially for suﬃcient aggre-

gate throughput at later scale.

Nearly all applications built on Megastore have found nat-

ural ways to draw entity group boundaries.

2.2.3 Physical Layout

We use Google’s Bigtable [15] for scalable fault-tolerant

storage within a single datacenter, allowing us to support

arbitrary read and write throughput by spreading operations

across multiple rows.

We minimize latency and maximize throughput by let-

ting applications control the placement of data: through the

selection of Bigtable instances and speciﬁcation of locality

within an instance.

To minimize latency, applications try to keep data near

users and replicas near each other. They assign each entity

group to the region or continent from which it is accessed

most. Within that region they assign a triplet or quintuplet

of replicas to datacenters with isolated failure domains.

For low latency, cache eﬃciency, and throughput, the data

for an entity group are held in contiguous ranges of Bigtable

rows. Our schema language lets applications control the

placement of hierarchical data, storing data that is accessed

together in nearby rows or denormalized into the same row.

3. A TOUR OF MEGASTORE

Megastore maps this architecture onto a feature set care-

fully chosen to encourage rapid development of scalable ap-

plications. This section motivates the tradeoﬀs and de-

scrib es the developer-facing features that result.

3.1 API Design Philosophy

ACID transactions simplify reasoning about correctness,

but it is equally imp ortant to be able to reason about perfor-

mance. Megastore emphasizes cost-transparent APIs with

runtime costs that match application developers’ intuitions.

Normalized relational schemas rely on joins at query time

to service user operations. This is not the right model for

Megastore applications for several reasons:

• High-volume interactive workloads beneﬁt more from

predictable performance than from an expressive query

language.

225

• Reads dominate writes in our target applications, so it

pays to move work from read time to write time.

• Storing and querying hierarchical data is straightfor-

ward in key-value stores like Bigtable.

With this in mind, we designed a data model and schema

language to oﬀer ﬁne-grained control over physical locality.

Hierarchical layouts and declarative denormalization help

eliminate the need for most joins. Queries specify scans or

lo okups against particular tables and indexes.

Joins, when required, are implemented in application code.

We provide an implementation of the merge phase of the

merge join algorithm, in which the user provides multiple

queries that return primary keys for the same table in the

same order; we then return the intersection of keys for all

the provided queries.

We also have applications that implement outer joins with

parallel queries. This typically involves an index lookup fol-

lowed by parallel index lookups using the results of the ini-

tial lookup. We have found that when the secondary index

lo okups are done in parallel and the number of results from

the ﬁrst lookup is reasonably small, this provides an eﬀective

stand-in for SQL-style joins.

While schema changes require corresponding modiﬁcations

to the query implementation code, this system guarantees

that features are built with a clear understanding of their

p erformance implications. For example, when users (who

may not have a background in databases) ﬁnd themselves

writing something that resembles a nested-loop join algo-

rithm, they quickly realize that it’s better to add an index

and follow the index-based join approach above.

3.2 Data Model

Megastore deﬁnes a data model that lies between the ab-

stract tuples of an RDBMS and the concrete row-column

storage of NoSQL. As in an RDBMS, the data model is de-

clared in a schema and is strongly typ ed. Each schema has

a set of tables, each containing a set of entities, which in

turn contain a set of properties. Properties are named and

typed values. The types can be strings, various ﬂavors of

numbers, or Google’s Protocol Buﬀers [9]. They can be re-

quired, optional, or repeated (allowing a list of values in a

single property). All entities in a table have the same set

of allowable properties. A sequence of properties is used

to form the primary key of the entity, and the primary keys

must be unique within the table. Figure 3 shows an example

schema for a simple photo storage application.

Megastore tables are either entity group root tables or

child tables. Each child table must declare a single distin-

guished foreign key referencing a root table, illustrated by

the ENTITY GROUP KEY annotation in Figure 3. Thus each

child entity references a particular entity in its root table

(called the root entity). An entity group consists of a root

entity along with all entities in child tables that reference it.

A Megastore instance can have several root tables, resulting

in diﬀerent classes of entity groups.

In the example schema of Figure 3, each user’s photo col-

lection is a separate entity group. The root entity is the

User, and the Photos are child entities. Note the Photo.tag

ﬁeld is repeated, allowing multiple tags per Photo without

the need for a sub-table.

3.2.1 Pre-Joining with Keys

While traditional relational modeling recommends that

CREATE SCHEMA PhotoApp;

CREATE TABLE User {

required int64 user_id;

required string name;

} PRIMARY KEY(user_id), ENTITY GROUP ROOT;

CREATE TABLE Photo {

required int64 user_id;

required int32 photo_id;

required int64 time;

required string full_url;

optional string thumbnail_url;

repeated string tag;

} PRIMARY KEY(user_id, photo_id),

IN TABLE User,

ENTITY GROUP KEY(user_id) REFERENCES User;

CREATE LOCAL INDEX PhotosByTime

ON Photo(user_id, time);

CREATE GLOBAL INDEX PhotosByTag

ON Photo(tag) STORING (thumbnail_url);

Figure 3: Sample Schema for Photo Sharing Service

all primary keys take surrogate values, Megastore keys are

chosen to cluster entities that will be read together. Each

entity is mapped into a single Bigtable row; the primary key

values are concatenated to form the Bigtable row key, and

each remaining property occupies its own Bigtable column.

Note how the Photo and User tables in Figure 3 share

a common user id key preﬁx. The IN TABLE User direc-

tive instructs Megastore to colocate these two tables into

the same Bigtable, and the key ordering ensures that Photo

entities are stored adjacent to the corresponding User. This

mechanism can be applied recursively to speed queries along

arbitrary join depths. Thus, users can force hierarchical lay-

out by manipulating the key order.

Schemas declare keys to b e sorted ascending or descend-

ing, or to avert sorting altogether: the SCATTER attribute in-

structs Megastore to prepend a two-byte hash to each key.

Enco ding monotonically increasing keys this way prevents

hotsp ots in large data sets that span Bigtable servers.

3.2.2 Indexes

Secondary indexes can be declared on any list of entity

prop erties, as well as ﬁelds within protocol buﬀers. We dis-

tinguish between two high-level classes of indexes: local and

global (see Figure 2). A local index is treated as separate

indexes for each entity group. It is used to ﬁnd data within

an entity group. In Figure 3, PhotosByTime is an example

of a local index. The index entries are stored in the entity

group and are updated atomically and consistently with the

primary entity data.

A global index spans entity groups. It is used to ﬁnd

entities without knowing in advance the entity groups that

contain them. The PhotosByTag index in Figure 3 is global

and enables discovery of photos marked with a given tag,

regardless of owner. Global index scans can read data owned

by many entity groups but are not guaranteed to reﬂect all

recent updates.

226

评论收藏

内容反馈

hanbo2854

粉丝: 0
资源: 13

google megastore 报告与揭秘

最新资源

google megastore 报告与揭秘

Percolator分布式事务

Google Megastore分布式存储技术全揭秘.doc

Altera IP MegaStore

MEGASTORE V1.0 商城门户多品类外贸独立站商城模板.zip

Megastore: Providing Scalable, Highly Available Storage for Interactive Services

GoogleMegastore分布式存储技术全揭秘

MegaStore:具有数据库交互的 Android 应用程序

Google论文集合

GOOGLE 云计算

google云计算原理与应用

淘宝-分布式海量数据库的探索Wasp

云计算(第二版全)

云计算第二版

这就是搜索引擎

这就是搜索引擎:核心技术详解

wasp:大型商店系统

从GoogleSpanner漫谈分布式存储与数据库技术

云计算 第二版

这就是搜索引擎(mobi).zip

2021-2023 大型企业新兴技术路线图.pdf

基于MATLAB的水果识别的数字图像处理.pdf

SPEI与SPI及ET0指数计算的R语言实现.pdf

软件工程项目管理计划书(完整版).pdf

最新大一《计算机导论》期末考试试题-模拟试题及答案.pdf

chatgpt-技术原理

精美大气的Axure模板

实用Axure模板

高中数学知识点总结超全.pdf

DnsJumper服务器设置更改dns

最新资源

云计算第二版