没有合适的资源?快使用搜索试试~ 我知道了~
Google大数据三篇著名论文1
需积分: 0 4 下载量 128 浏览量
2022-08-04
16:54:15
上传
评论
收藏 2.84MB PDF 举报
温馨提示
试读
124页
The Google File SystemSanjay Ghemawat, Howard Gobioff, and Shun-Tak LeungWe have
资源详情
资源评论
The Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
Google
∗
ABSTRACT
We have designed and implemented the Google File Sys-
tem, a scalable distributed file system for large distributed
data-intensive applications. It provides fault tolerance while
running on inexpensive commodity hardware, and it delivers
high aggregate performance to a large number of clients.
While sharing many of the same goals as previous dis-
tributed file systems, our design has been driven by obser-
vations of our application workloads and technological envi-
ronment, both current and anticipated, that reflect a marked
departure from some earlier file system assumptions. This
has led us to reexamine traditional choices and explore rad-
ically different design points.
The file system has successfully met our storage needs.
It is widely deployed within Google as the storage platform
for the generation and processing of data used by our ser-
vice as well as research and development efforts that require
large data sets. The largest cluster to date provides hun-
dreds of terabytes of storage across thousands of disks on
over a thousand machines, and it is concurrently accessed
by hundreds of clients.
In this paper, we present file system interface extensions
designed to support distributed applications, discuss many
aspects of our design, and report measurements from both
micro-benchmarks and real world use.
Categories and Subject Descriptors
D[4]: 3—Distributed file systems
General Terms
Design, reliability, performance, measurement
Keywords
Fault tolerance, scalability, data storage, clustered storage
∗
The authors can be reached at the following addresses:
{sanjay,hgobioff,shuntak}@google.com.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SOSP’03, October 19–22, 2003, Bolton Landing, New York, USA.
Copyright 2003 ACM 1-58113-757-5/03/0010 ...
$5.00.
1. INTRODUCTION
We have designed and implemented the Google File Sys-
tem (GFS) to meet the rapidly growing demands of Google’s
data processing needs. GFS shares many of the same goals
as previous distributed file systems such as performance,
scalability, reliability, and availability. However, its design
has been driven by key observations of our application work-
loads and technological environment, both current and an-
ticipated, that reflect a marked departure from some earlier
file system design assumptions. We have reexamined tradi-
tional choices and explored radically different points in the
design space.
First, component failures are the norm rather than the
exception. The file system consists of hundreds or even
thousands of storage machines built from inexpensive com-
modity parts and is accessed by a comparable number of
client machines. The quantity and quality of the compo-
nents virtually guarantee that some are not functional at
any given time and some will not recover from their cur-
rent failures. We have seen problems caused by application
bugs, operating system bugs, human errors, and the failures
of disks, memory, connectors, networking, and power sup-
plies. Therefore, constant monitoring, error detection, fault
tolerance, and automatic recovery must be integral to the
system.
Second, files are huge by traditional standards. Multi-GB
files are common. Each file typically contains many applica-
tion objects such as web documents. When we are regularly
working with fast growing data sets of many TBs comprising
billions of objects, it is unwieldy to manage billions of ap-
proximately KB-sized files even when the file system could
support it. As a result, design assumptions and parameters
such as I/O operation and block sizes have to be revisited.
Third, most files are mutated by appending new data
rather than overwriting existing data. Random writes within
a file are practically non-existent. Once written, the files
are only read, and often only sequentially. A variety of
data share these characteristics. Some may constitute large
repositories that data analysis programs scan through. Some
may be data streams continuously generated by running ap-
plications. Some may be archival data. Some may be in-
termediate results produced on one machine and processed
on another, whether simultaneously or later in time. Given
this access pattern on huge files, appending becomes the fo-
cus of performance optimization and atomicity guarantees,
while caching data blocks in the client loses its appeal.
Fourth, co-designing the applications and the file system
API benefits the overall system by increasing our flexibility.
For example, we have relaxed GFS’s consistency model to
vastly simplify the file system without imposing an onerous
burden on the applications. We have also introduced an
atomic append operation so that multiple clients can append
concurrently to a file without extra synchronization between
them. These will be discussed in more details later in the
paper.
Multiple GFS clusters are currently deployed for different
purposes. The largest ones have over 1000 storage nodes,
over 300 TB of disk storage, and are heavily accessed by
hundreds of clients on distinct machines on a continuous
basis.
2. DESIGN OVERVIEW
2.1 Assumptions
In designing a file system for our needs, we have been
guided by assumptions that offer both challenges and op-
portunities. We alluded to some key observations earlier
and now lay out our assumptions in more details.
• The system is built from many inexpensive commodity
components that often fail. It must constantly monitor
itself and detect, tolerate, and recover promptly from
component failures on a routine basis.
• The system stores a modest number of large files. We
expect a few million files, each typically 100 MB or
larger in size. Multi-GB files are the common case
and should be managed efficiently. Small files must be
supported, but we need not optimize for them.
• The workloads primarily consist of two kinds of reads:
large streaming reads and small random reads. In
large streaming reads, individual operations typically
read hundreds of KBs, more commonly 1 MB or more.
Successive operations from the same client often read
through a contiguous region of a file. A small ran-
dom read typically reads a few KBs at some arbitrary
offset. Performance-conscious applications often batch
and sort their small reads to advance steadily through
the file rather than go back and forth.
• The workloads also have many large, sequential writes
that append data to files. Typical operation sizes are
similar to those for reads. Once written, files are sel-
dom modified again. Small writes at arbitrary posi-
tions in a file are supported but do not have to be
efficient.
• The system must efficiently implement well-defined se-
mantics for multiple clients that concurrently append
to the same file. Our files are often used as producer-
consumer queues or for many-way merging. Hundreds
of producers, running one per machine, will concur-
rently append to a file. Atomicity with minimal syn-
chronization overhead is essential. The file may be
read later, or a consumer may be reading through the
file simultaneously.
• High sustained bandwidth is more important than low
latency. Most of our target applications place a pre-
mium on processing data in bulk at a high rate, while
few have stringent response time requirements for an
individual read or write.
2.2 Interface
GFS provides a familiar file system interface, though it
does not implement a standard API such as POSIX. Files are
organized hierarchically in directories and identified by path-
names. We support the usual operations to create, delete,
open, close, read,andwrite files.
Moreover, GFS has snapshot and record append opera-
tions. Snapshot creates a copy of a file or a directory tree
at low cost. Record append allows multiple clients to ap-
pend data to the same file concurrently while guaranteeing
the atomicity of each individual client’s append. It is use-
ful for implementing multi-way merge results and producer-
consumer queues that many clients can simultaneously ap-
pend to without additional locking. We have found these
types of files to be invaluable in building large distributed
applications. Snapshot and record append are discussed fur-
ther in Sections 3.4 and 3.3 respectively.
2.3 Architecture
A GFS cluster consists of a single master and multiple
chunkservers and is accessed by multiple clients,asshown
in Figure 1. Each of these is typically a commodity Linux
machine running a user-level server process. It is easy to run
both a chunkserver and a client on the same machine, as long
as machine resources permit and the lower reliability caused
by running possibly flaky application code is acceptable.
Files are divided into fixed-size chunks. Each chunk is
identified by an immutable and globally unique 64 bit chunk
handle assigned by the master at the time of chunk creation.
Chunkservers store chunks on local disks as Linux files and
read or write chunk data specified by a chunk handle and
byte range. For reliability, each chunk is replicated on multi-
ple chunkservers. By default, we store three replicas, though
users can designate different replication levels for different
regions of the file namespace.
The master maintains all file system metadata. This in-
cludes the namespace, access control information, the map-
ping from files to chunks, and the current locations of chunks.
It also controls system-wide activities such as chunk lease
management, garbage collection of orphaned chunks, and
chunk migration between chunkservers. The master peri-
odically communicates with each chunkserver in HeartBeat
messages to give it instructions and collect its state.
GFS client code linked into each application implements
the file system API and communicates with the master and
chunkservers to read or write data on behalf of the applica-
tion. Clients interact with the master for metadata opera-
tions, but all data-bearing communication goes directly to
the chunkservers. We do not provide the POSIX API and
therefore need not hook into the Linux vnode layer.
Neither the client nor the chunkserver caches file data.
Client caches offer little benefit because most applications
stream through huge files or have working sets too large
to be cached. Not having them simplifies the client and
the overall system by eliminating cache coherence issues.
(Clients do cache metadata, however.) Chunkservers need
not cache file data because chunks are stored as local files
and so Linux’s buffer cache already keeps frequently accessed
data in memory.
2.4 Single Master
Having a single master vastly simplifies our design and
enables the master to make sophisticated chunk placement
Legend:
Data messages
Control messages
Application
(file name, chunk index)
(chunk handle,
chunk locations)
GFS master
File namespace
/foo/bar
Instructions to chunkserver
Chunkserver state
GFS chunkserverGFS chunkserver
(chunk handle, byte range)
chunk data
chunk 2ef0
Linux file system Linux file system
GFS client
Figure 1: GFS Architecture
and replication decisions using global knowledge. However,
we must minimize its involvement in reads and writes so
that it does not become a bottleneck. Clients never read
and write file data through the master. Instead, a client asks
the master which chunkservers it should contact. It caches
this information for a limited time and interacts with the
chunkservers directly for many subsequent operations.
Let us explain the interactions for a simple read with refer-
ence to Figure 1. First, using the fixed chunk size, the client
translates the file name and byte offset specified by the ap-
plication into a chunk index within the file. Then, it sends
the master a request containing the file name and chunk
index. The master replies with the corresponding chunk
handle and locations of the replicas. The client caches this
information using the file name and chunk index as the key.
The client then sends a request to one of the replicas,
most likely the closest one. The request specifies the chunk
handle and a byte range within that chunk. Further reads
of the same chunk require no more client-master interaction
until the cached information expires or the file is reopened.
In fact, the client typically asks for multiple chunks in the
same request and the master can also include the informa-
tion for chunks immediately following those requested. This
extra information sidesteps several future client-master in-
teractions at practically no extra cost.
2.5 Chunk Size
Chunk size is one of the key design parameters. We have
chosen 64 MB, which is much larger than typical file sys-
tem block sizes. Each chunk replica is stored as a plain
Linux file on a chunkserver and is extended only as needed.
Lazy space allocation avoids wasting space due to internal
fragmentation, perhaps the greatest objection against such
a large chunk size.
A large chunk size offers several important advantages.
First, it reduces clients’ need to interact with the master
because reads and writes on the same chunk require only
one initial request to the master for chunk location informa-
tion. The reduction is especially significant for our work-
loads because applications mostly read and write large files
sequentially. Even for small random reads, the client can
comfortably cache all the chunk location information for a
multi-TB working set. Second, since on a large chunk, a
client is more likely to perform many operations on a given
chunk, it can reduce network overhead by keeping a persis-
tent TCP connection to the chunkserver over an extended
period of time. Third, it reduces the size of the metadata
stored on the master. This allows us to keep the metadata
in memory, which in turn brings other advantages that we
will discuss in Section 2.6.1.
On the other hand, a large chunk size, even with lazy space
allocation, has its disadvantages. A small file consists of a
small number of chunks, perhaps just one. The chunkservers
storing those chunks may become hot spots if many clients
are accessing the same file. In practice, hot spots have not
been a major issue because our applications mostly read
large multi-chunk files sequentially.
However, hot spots did develop when GFS was first used
by a batch-queue system: an executable was written to GFS
as a single-chunk file and then started on hundreds of ma-
chines at the same time. The few chunkservers storing this
executable were overloaded by hundreds of simultaneous re-
quests. We fixed this problem by storing such executables
with a higher replication factor and by making the batch-
queue system stagger application start times. A potential
long-term solution is to allow clients to read data from other
clients in such situations.
2.6 Metadata
The master stores three major types of metadata: the file
and chunk namespaces, the mapping from files to chunks,
and the locations of each chunk’s replicas. All metadata is
kept in the master’s memory. The first two types (names-
paces and file-to-chunk mapping) are also kept persistent by
logging mutations to an operation log stored on the mas-
ter’s local disk and replicated on remote machines. Using
a log allows us to update the master state simply, reliably,
and without risking inconsistencies in the event of a master
crash. The master does not store chunk location informa-
tion persistently. Instead, it asks each chunkserver about its
chunks at master startup and whenever a chunkserver joins
the cluster.
2.6.1 In-Memory Data Structures
Since metadata is stored in memory, master operations are
fast. Furthermore, it is easy and efficient for the master to
periodically scan through its entire state in the background.
This periodic scanning is used to implement chunk garbage
collection, re-replication in the presence of chunkserver fail-
ures, and chunk migration to balance load and disk space
usage across chunkservers. Sections 4.3 and 4.4 will discuss
these activities further.
One potential concern for this memory-only approach is
that the number of chunks and hence the capacity of the
whole system is limited by how much memory the master
has. This is not a serious limitation in practice. The mas-
ter maintains less than 64 bytes of metadata for each 64 MB
chunk. Most chunks are full because most files contain many
chunks, only the last of which may be partially filled. Sim-
ilarly, the file namespace data typically requires less then
64 bytes per file because it stores file names compactly us-
ing prefix compression.
If necessary to support even larger file systems, the cost
of adding extra memory to the master is a small price to pay
for the simplicity, reliability, performance, and flexibility we
gain by storing the metadata in memory.
2.6.2 Chunk Locations
The master does not keep a persistent record of which
chunkservers have a replica of a given chunk. It simply polls
chunkservers for that information at startup. The master
can keep itself up-to-date thereafter because it controls all
chunk placement and monitors chunkserver status with reg-
ular HeartBeat messages.
We initially attempted to keep chunk location information
persistently at the master, but we decided that it was much
simpler to request the data from chunkservers at startup,
and periodically thereafter. This eliminated the problem of
keeping the master and chunkservers in sync as chunkservers
join and leave the cluster, change names, fail, restart, and
so on. In a cluster with hundreds of servers, these events
happen all too often.
Another way to understand this design decision is to real-
ize that a chunkserver has the final word over what chunks
it does or does not have on its own disks. There is no point
in trying to maintain a consistent view of this information
on the master because errors on a chunkserver may cause
chunks to vanish spontaneously (e.g., a disk may go bad
and be disabled) or an operator may rename a chunkserver.
2.6.3 Operation Log
The operation log contains a historical record of critical
metadata changes. It is central to GFS. Not only is it the
only persistent record of metadata, but it also serves as a
logical time line that defines the order of concurrent op-
erations. Files and chunks, as well as their versions (see
Section 4.5), are all uniquely and eternally identified by the
logical times at which they were created.
Since the operation log is critical, we must store it reli-
ably and not make changes visible to clients until metadata
changes are made persistent. Otherwise, we effectively lose
the whole file system or recent client operations even if the
chunks themselves survive. Therefore, we replicate it on
multiple remote machines and respond to a client opera-
tion only after flushing the corresponding log record to disk
both locally and remotely. The master batches several log
records together before flushing thereby reducing the impact
of flushing and replication on overall system throughput.
The master recovers its file system state by replaying the
operation log. To minimize startup time, we must keep the
log small. The master checkpoints its state whenever the log
grows beyond a certain size so that it can recover by loading
the latest checkpoint from local disk and replaying only the
Wri te Record Append
Serial defined defined
success interspersed with
Concurrent consistent inconsistent
successes but undefined
Failure inconsistent
Table 1: File Region State After Mutation
limited number of log records after that. The checkpoint is
in a compact B-tree like form that can be directly mapped
into memory and used for namespace lookup without ex-
tra parsing. This further speeds up recovery and improves
availability.
Because building a checkpoint can take a while, the mas-
ter’s internal state is structured in such a way that a new
checkpoint can be created without delaying incoming muta-
tions. The master switches to a new log file and creates the
new checkpoint in a separate thread. The new checkpoint
includes all mutations before the switch. It can be created
in a minute or so for a cluster with a few million files. When
completed, it is written to disk both locally and remotely.
Recovery needs only the latest complete checkpoint and
subsequent log files. Older checkpoints and log files can
be freely deleted, though we keep a few around to guard
against catastrophes. A failure during checkpointing does
not affect correctness because the recovery code detects and
skips incomplete checkpoints.
2.7 Consistency Model
GFS has a relaxed consistency model that supports our
highly distributed applications well but remains relatively
simple and efficient to implement. We now discuss GFS’s
guarantees and what they mean to applications. We also
highlight how GFS maintains these guarantees but leave the
details to other parts of the paper.
2.7.1 Guarantees by GFS
File namespace mutations (e.g., file creation) are atomic.
They are handled exclusively by the master: namespace
locking guarantees atomicity and correctness (Section 4.1);
the master’s operation log defines a global total order of
these operations (Section 2.6.3).
The state of a file region after a data mutation depends
on the type of mutation, whether it succeeds or fails, and
whether there are concurrent mutations. Table 1 summa-
rizes the result. A file region is consistent if all clients will
always see the same data, regardless of which replicas they
read from. A region is defined after a file data mutation if it
is consistent and clients will see what the mutation writes in
its entirety. When a mutation succeeds without interference
from concurrent writers, the affected region is defined (and
by implication consistent): all clients will always see what
the mutation has written. Concurrent successful mutations
leave the region undefined but consistent: all clients see the
same data, but it may not reflect what any one mutation
has written. Typically, it consists of mingled fragments from
multiple mutations. A failed mutation makes the region in-
consistent (hence also undefined): different clients may see
different data at different times. We describe below how our
applications can distinguish defined regions from undefined
regions. The applications do not need to further distinguish
between different kinds of undefined regions.
Data mutations may be writes or record appends.Awrite
causes data to be written at an application-specified file
offset. A record append causes data (the “record”) to be
appended atomically at least once even in the presence of
concurrent mutations, but at an offset of GFS’s choosing
(Section 3.3). (In contrast, a “regular” append is merely a
write at an offset that the client believes to be the current
end of file.) The offset is returned to the client and marks
the beginning of a defined region that contains the record.
In addition, GFS may insert padding or record duplicates in
between. They occupy regions considered to be inconsistent
and are typically dwarfed by the amount of user data.
After a sequence of successful mutations, the mutated file
region is guaranteed to be defined and contain the data writ-
ten by the last mutation. GFS achieves this by (a) applying
mutations to a chunk in the same order on all its replicas
(Section 3.1), and (b) using chunk version numbers to detect
any replica that has become stale because it has missed mu-
tations while its chunkserver was down (Section 4.5). Stale
replicas will never be involved in a mutation or given to
clients asking the master for chunk locations. They are
garbage collected at the earliest opportunity.
Since clients cache chunk locations, they may read from a
stale replica before that information is refreshed. This win-
dow is limited by the cache entry’s timeout and the next
open of the file, which purges from the cache all chunk in-
formation for that file. Moreover, as most of our files are
append-only, a stale replica usually returns a premature
end of chunk rather than outdated data. When a reader
retries and contacts the master, it will immediately get cur-
rent chunk locations.
Long after a successful mutation, component failures can
of course still corrupt or destroy data. GFS identifies failed
chunkservers by regular handshakes between master and all
chunkservers and detects data corruption by checksumming
(Section 5.2). Once a problem surfaces, the data is restored
from valid replicas as soon as possible (Section 4.3). A chunk
is lost irreversibly only if all its replicas are lost before GFS
can react, typically within minutes. Even in this case, it be-
comes unavailable, not corrupted: applications receive clear
errors rather than corrupt data.
2.7.2 Implications for Applications
GFS applications can accommodate the relaxed consis-
tency model with a few simple techniques already needed for
other purposes: relying on appends rather than overwrites,
checkpointing, and writing self-validating, self-identifying
records.
Practically all our applications mutate files by appending
rather than overwriting. In one typical use, a writer gener-
ates a file from beginning to end. It atomically renames the
file to a permanent name after writing all the data, or pe-
riodically checkpoints how much has been successfully writ-
ten. Checkpoints may also include application-level check-
sums. Readers verify and process only the file region up
to the last checkpoint, which is known to be in the defined
state. Regardless of consistency and concurrency issues, this
approach has served us well. Appending is far more effi-
cient and more resilient to application failures than random
writes. Checkpointing allows writers to restart incremen-
tally and keeps readers from processing successfully written
file data that is still incomplete from the application’s per-
spective.
In the other typical use, many writers concurrently ap-
pend to a file for merged results or as a producer-consumer
queue. Record append’s append-at-least-once semantics pre-
serves each writer’s output. Readers deal with the occa-
sional padding and duplicates as follows. Each record pre-
pared by the writer contains extra information like check-
sums so that its validity can be verified. A reader can
identify and discard extra padding and record fragments
using the checksums. If it cannot tolerate the occasional
duplicates (e.g., if they would trigger non-idempotent op-
erations), it can filter them out using unique identifiers in
the records, which are often needed anyway to name corre-
sponding application entities such as web documents. These
functionalities for record I/O (except duplicate removal) are
in library code shared by our applications and applicable to
other file interface implementations at Google. With that,
the same sequence of records, plus rare duplicates, is always
delivered to the record reader.
3. SYSTEM INTERACTIONS
We designed the system to minimize the master’s involve-
ment in all operations. With that background, we now de-
scribe how the client, master, and chunkservers interact to
implement data mutations, atomic record append, and snap-
shot.
3.1 Leases and Mutation Order
A mutation is an operation that changes the contents or
metadata of a chunk such as a write or an append opera-
tion. Each mutation is performed at all the chunk’s replicas.
We use leases to maintain a consistent mutation order across
replicas. The master grants a chunk lease to one of the repli-
cas, which we call the primary. The primary picks a serial
order for all mutations to the chunk. All replicas follow this
order when applying mutations. Thus, the global mutation
order is defined first by the lease grant order chosen by the
master, and within a lease by the serial numbers assigned
by the primary.
The lease mechanism is designed to minimize manage-
ment overhead at the master. A lease has an initial timeout
of 60 seconds. However, as long as the chunk is being mu-
tated, the primary can request and typically receive exten-
sions from the master indefinitely. These extension requests
and grants are piggybacked on the HeartBeat messages reg-
ularly exchanged between the master and all chunkservers.
The master may sometimes try to revoke a lease before it
expires (e.g., when the master wants to disable mutations
on a file that is being renamed). Even if the master loses
communication with a primary, it can safely grant a new
lease to another replica after the old lease expires.
In Figure 2, we illustrate this process by following the
control flow of a write through these numbered steps.
1. The client asks the master which chunkserver holds
the current lease for the chunk and the locations of
the other replicas. If no one has a lease, the master
grants one to a replica it chooses (not shown).
2. The master replies with the identity of the primary and
the locations of the other (secondary) replicas. The
client caches this data for future mutations. It needs
to contact the master again only when the primary
剩余123页未读,继续阅读
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0
最新资源