IRONFileSystems(iron-sosp05)-计算机科学资源-CSDN文库

71 浏览量 2021-04-22 18:28:13 上传评论收藏 271KB PDF 举报

资源推荐

资源详情

资源评论

IRON File Systems

Vijayan Prabhakaran, Lakshmi N. Bairavasundaram, Nitin Agrawal, Haryadi S. Gunawi,

Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau

Computer Sciences Department, University of Wisconsin, Madison

{vijayan,laksh,nitina,haryadi,dusseau,remzi}@cs.wisc.edu

ABSTRACT

Commodity ﬁle systems trust disks to either work or fail com-

pletely, yet modern disks exhibit more complex failure modes. We

suggest a new fail-partial failure model for disks, which incorpo-

rates realistic localized faults such as latent sector errors and block

corruption. We then develop and apply a novel failure-policy ﬁn-

gerprinting framework, to investigate how commodity ﬁle systems

react to a range of more realistic disk failures. We classify their

failure policies in a new taxonomy that measures their Internal RO-

bustNess (IRON), which includes both failure detection and recov-

ery techniques. We show that commodity ﬁle system failure poli-

cies are often inconsistent, sometimes buggy, and generally inade-

quate in their ability to recover from partial disk failures. Finally,

we design, implement, and evaluate a prototype IRON ﬁle system,

Linux ixt3, showing that techniques such as in-disk checksumming,

replication, and parity greatly enhance ﬁle system robustness while

incurring minimal time and space overheads.

Categories and Subject Descriptors:

D.4.3 [Operating Systems]: File Systems Management

D.4.5 [Operating Systems]: Reliability

General Terms: Design, Experimentation, Reliability

Keywords: IRON ﬁle systems, disks, storage, latent sector errors,

block corruption, fail-partial failure model, fault tolerance, reliabil-

ity, internal redundancy

1. INTRODUCTION

Disks fail – but not in the way most commodity ﬁle systems ex-

pect. For many years, ﬁle system and storage system designers have

assumed that disks operate in a “fail stop” manner [56]; within this

classic model, the disks either are working perfectly, or fail abso-

lutely and in an easily detectable manner.

The fault model presented by modern disk drives, however, is

much more complex. For example, modern drives can exhibit la-

tent sector faults [16, 34, 57], where a block or set of blocks are

inaccessible. Worse, blocks sometimes become silently corrupted

[9, 26, 73]. Finally, disks sometimes exhibit transient performance

problems [7, 67].

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

SOSP’05, October 23–26, 2005, Brighton, United Kingdom.

There are many reasons for these complex failures in disks. For

example, a buggy disk controller could issue a “misdirected” write

[73], placing the correct data on disk but in the wrong location. In-

terestingly, while these failures exist today, simply waiting for disk

technology to improve will not remove these errors: indeed, these

errors may worsen over time, due to increasing drive complexity

[5], immense cost pressures in the storage industry, and the esca-

lated use of less reliable ATA disks – not only in desktop PCs but

also in large-scale clusters [23] and storage systems [20, 28].

Developers of high-end systems have realized the nature of these

disk faults and built mechanisms into their systems to handle them.

For example, many redundant storage systems incorporate a back-

ground disk scrubbing process [33, 57], to proactively detect and

subsequently correct latent sector errors by creating a new copy of

inaccessible blocks; some recent storage arrays incorporate extra

levels of redundancy to lessen the potential damage of undiscov-

ered latent errors [16]. Similarly, highly-reliable systems (e.g., Tan-

dem NonStop) utilize end-to-end checksums to detect when block

corruption occurs [9].

Unfortunately, such technology has not ﬁltered down to the realm

of commodity ﬁle systems, including Linux ﬁle systems such as

ext3 [71], ReiserFS [49], and IBM’s JFS [11], or Windows ﬁle sys-

tems such as NTFS [63]. Such ﬁle systems are not only pervasive

in the home environment, storing valuable (and often non-archived)

user data such as photos, home movies, and tax returns, but also in

many internet services such as Google [23].

In this paper, the ﬁrst question we pose is: how do modern com-

modity ﬁle systems react to failures that are common in modern

disks? To answer this query, we aggregate knowledge from the re-

search literature, industry, and ﬁeld experience to form a new model

for disk failure. We label our model the fail-partial failure model

to emphasize that portions of the disk can fail, either through block

errors or data corruption.

With the model in place, we develop and apply an automated

failure-policy ﬁngerprinting framework, to inject more realistic disk

faults beneath a ﬁle system. The goal of ﬁngerprinting is to unearth

the failure policy of each system: how it detects and recovers from

disk failures. Our approach leverages gray-box knowledge [6, 62]

of ﬁle system data structures to meticulously exercise ﬁle system

access paths to disk.

To better characterize failure policy, we develop an Internal RO-

bustNess (IRON) taxonomy, which catalogs a broad range of detec-

tion and recovery techniques. Hence, the output of our ﬁngerprint-

ing tool is a broad categorization of which IRON techniques a ﬁle

system uses across its constituent data structures.

Our study focuses on three important and substantially different

open-source ﬁle systems, ext3, ReiserFS, and IBM’s JFS, and one

closed-source ﬁle system, Windows NTFS. Across all platforms,

we ﬁnd a great deal of illogical inconsistency in failure policy, of-

ten due to the diffusion of failure handling code through the ker-

nel; such inconsistency leads to substantially different detection

and recovery strategies under similar fault scenarios, resulting in

unpredictable and often undesirable fault-handling strategies. We

also discover that most systems implement portions of their fail-

ure policy incorrectly; the presence of bugs in the implementa-

tions demonstrates the difﬁculty and complexity of correctly han-

dling certain classes of disk failure. We observe little tolerance of

transient failures; most ﬁle systems assume a single temporarily-

inaccessible block indicates a fatal whole-disk failure. Finally, we

show that none of the ﬁle systems can recover from partial disk

failures, due to a lack of in-disk redundancy.

This behavior under realistic disk failures leads us to our second

question: how can we change ﬁle systems to better handle modern

disk failures? We advocate a single guiding principle for the design

of ﬁle systems: don’t trust the disk. The ﬁle system should not view

the disk as an utterly reliable component. For example, if blocks

can become corrupt, the ﬁle system should apply measures to both

detect and recover from such corruption, even when running on a

single disk. Our approach is an instance of the end-to-end argument

[53]: at the top of the storage stack, the ﬁle system is fundamentally

responsible for reliable management of its data and metadata.

In our initial efforts, we develop a family of prototype IRON ﬁle

systems, all of which are robust variants of the Linux ext3 ﬁle sys-

tem. Within our IRON ext3 (ixt3), we investigate the costs of using

checksums to detect data corruption, replication to provide redun-

dancy for metadata structures, and parity protection for user data.

We show that these techniques incur modest space and time over-

heads while greatly increasing the robustness of the ﬁle system to

latent sector errors and data corruption. By implementing detec-

tion and recovery techniques from the IRON taxonomy, a system

can implement a well-deﬁned failure policy and subsequently pro-

vide vigorous protection against the broader range of disk failures.

The contributions of this paper are as follows:

• We deﬁne a more realistic failure model for modern disks

(the fail-partial model) (§2).

• We formalize the techniques to detect and recover from disk

errors under the IRON taxonomy (§3).

• We develop a ﬁngerprinting framework to determine the fail-

ure policy of a ﬁle system (§4).

• We analyze four popular commodity ﬁle systems to discover

how they handle disk errors (§5).

• We build a prototype version of an IRON ﬁle system (ixt3)

and analyze its robustness to disk failure and its performance

characteristics (§6).

To bring the paper to a close, we discuss related work (§7), and

ﬁnally conclude (§8).

2. DISK FAILURE

There are many reasons that the ﬁle system may see errors in

the storage system below. In this section, we ﬁrst discuss common

causes of disk failure. We then present a new, more realistic fail-

partial model for disks and discuss various aspects of this model.

2.1 The Storage Subsystem

Figure 1 presents a typical layered storage subsystem below the

ﬁle system. An error can occur in any of these layers and propagate

itself to the ﬁle system above.

Generic Block I/O

Device Driver

Device Controller

Firmware

Media

Transport

HostDisk

Generic File System

Specific File System

Storage Subsystem

Electrical

Mechanical

Cache

Figure 1: The Storage Stack. We present a schematic of the entire

storage stack. At the top is the ﬁle system; beneath are the many

layers of the storage subsystem. Gray shading implies software or

ﬁrmware, whereas white (unshaded) is hardware.

At the bottom of the “storage stack” is the disk itself; beyond the

magnetic storage media, there are mechanical (e.g., the motor and

arm assembly) and electrical components (e.g., busses). A particu-

larly important component is ﬁrmware – the code embedded within

the drive to control most of its higher-level functions, including

caching, disk scheduling, and error handling. This ﬁrmware code

is often substantial and complex (e.g., a modern Seagate drive con-

tains roughly 400,000 lines of code [19]).

Connecting the drive to the host is the transport. In low-end sys-

tems, the transport medium is often a bus (e.g., SCSI), whereas

networks are common in higher-end systems (e.g., FibreChannel).

At the top of the stack is the host. Herein there is a hardware

controller that communicates with the device, and above it a soft-

ware device driver that controls the hardware. Block-level software

forms the next layer, providing a generic device interface and im-

plementing various optimizations (e.g., request reordering).

Above all other software is the ﬁle system. This layer is often

split into two pieces: a high-level component common to all ﬁle

systems, and a speciﬁc component that maps generic operations

onto the data structures of the particular ﬁle system. A standard

interface (e.g., Vnode/VFS [36]) is positioned between the two.

2.2 Why Do Disks Fail?

To motivate our failure model, we ﬁrst describe how errors in the

layers of the storage stack can cause failures.

Media: There are two primary errors that occur in the magnetic

media. First, the classic problem of “bit rot” occurs when the mag-

netism of a single bit or a few bits is ﬂipped. This type of problem

can often (but not always) be detected and corrected with low-level

ECC embedded in the drive. Second, physical damage can occur on

the media. The quintessential “head crash” is one culprit, where the

drive head contacts the surface momentarily. A media scratch can

also occur when a particle is trapped between the drive head and the

media [57]. Such dangers are well-known to drive manufacturers,

and hence modern disks park the drive head when the drive is not

in use to reduce the number of head crashes; SCSI disks sometimes

include ﬁlters to remove particles [5]. Media errors most often lead

to permanent failure or corruption of individual disk blocks.

Mechanical: “Wear and tear” eventually leads to failure of moving

parts. A drive motor can spin irregularly or fail completely. Erratic

arm movements can cause head crashes and media ﬂaws; inaccu-

rate arm movement can misposition the drive head during writes,

leaving blocks inaccessible or corrupted upon subsequent reads.

Electrical: A power spike or surge can damage in-drive circuits

and hence lead to drive failure [68]. Thus, electrical problems can

lead to entire disk failure.

Drive ﬁrmware: Interesting errors arise in the drive controller,

which consists of many thousands of lines of real-time, concurrent

ﬁrmware. For example, disks have been known to return correct

data but circularly shifted by a byte [37] or have memory leaks

that lead to intermittent failures [68]. Other ﬁrmware problems

can lead to poor drive performance [54]. Some ﬁrmware bugs are

well-enough known in the ﬁeld that they have speciﬁc names; for

example, “misdirected” writes are writes that place the correct data

on the disk but in the wrong location, and “phantom” writes are

writes that the drive reports as completed but that never reach the

media [73]. Phantom writes can be caused by a buggy or even mis-

conﬁgured cache (i.e., write-back caching is enabled). In summary,

drive ﬁrmware errors often lead to sticky or transient block corrup-

tion but can also lead to performance problems.

Transport: The transport connecting the drive and host can also be

problematic. For example, a study of a large disk farm [67] reveals

that most of the systems tested had interconnect problems, such

as bus timeouts. Parity errors also occurred with some frequency,

either causing requests to succeed (slowly) or fail altogether. Thus,

the transport often causes transient errors for the entire drive.

Bus controller: The main bus controller can also be problematic.

For example, the EIDE controller on a particular series of moth-

erboards incorrectly indicates completion of a disk request before

the data has reached the main memory of the host, leading to data

corruption [72]. A similar problem causes some other controllers to

return status bits as data if the ﬂoppy drive is in use at the same time

as the hard drive [26]. Others have also observed IDE protocol ver-

sion problems that yield corrupt data [23]. In summary, controller

problems can lead to transient block failure and data corruption.

Low-level drivers: Recent research has shown that device driver

code is more likely to contain bugs than the rest of the operating

system [15, 22, 66]. While some of these bugs will likely crash the

operating system, others can issue disk requests with bad parame-

ters, data, or both, resulting in data corruption.

2.3 The Fail-Partial Failure Model

From our discussion of the many root causes for failure, we are

now ready to put forth a more realistic model of disk failure. In our

model, failures manifest themselves in three ways:

• Entire disk failure: The entire disk is no longer accessible. If

permanent, this is the classic “fail-stop” failure.

• Block failure: One or more blocks are not accessible; often re-

ferred to as “latent sector errors” [33, 34].

• Block corruption: The data within individual blocks is altered.

Corruption is particularly insidious because it is silent – the storage

subsystem simply returns “bad” data upon a read.

We term this model the Fail-Partial Failure Model, to empha-

size that pieces of the storage subsystem can fail. We now discuss

some other key elements of the fail-partial model, including the

transience, locality, and frequency of failures, and then discuss how

technology and market trends will impact disk failures over time.

2.3.1 Transience of Failures

In our model, failures can be “sticky” (permanent) or “transient”

(temporary). Which behavior manifests itself depends upon the

root cause of the problem. For example, a low-level media problem

portends the failure of subsequent requests. In contrast, a transport

or higher-level software issue might at ﬁrst cause block failure or

corruption; however, the operation could succeed if retried.

2.3.2 Locality of Failures

Because multiple blocks of a disk can fail, one must consider

whether such block failures are dependent. The root causes of

block failure suggest that some forms of block failure do indeed

exhibit spatial locality [34]. For example, a scratched surface can

render a number of contiguous blocks inaccessible. However, all

failures do not exhibit locality; for example, a corruption due to a

misdirected write may impact only a single block.

2.3.3 Frequency of Failures

Block failures and corruptions do occur – as one commercial

storage system developer succinctly stated, “Disks break a lot – all

guarantees are ﬁction” [29]. However, one must also consider how

frequently such errors occur, particularly when modeling overall re-

liability and deciding which failures are most important to handle.

Unfortunately, as Talagala and Patterson point out [67], disk drive

manufacturers are loathe to provide information on disk failures;

indeed, people within the industry refer to an implicit industry-wide

agreement to not publicize such details [4]. Not surprisingly, the

actual frequency of drive errors, especially errors that do not cause

the whole disk to fail, is not well-known in the literature. Previous

work on latent sector errors indicates that such errors occur more

commonly than absolute disk failure [34], and more recent research

estimates that such errors may occur ﬁve times more often than ab-

solute disk failures [57].

In terms of relative frequency, block failures are more likely to

occur on reads than writes, due to internal error handling common

in most disk drives. For example, failed writes to a given sector

are often remapped to another (distant) sector, allowing the drive

to transparently handle such problems [31]. However, remapping

does not imply that writes cannot fail. A failure in a component

above the media (e.g., a stuttering transport), can lead to an unsuc-

cessful write attempt; the move to network-attached storage [24]

serves to increase the frequency of this class of failures. Also, for

remapping to succeed, free blocks must be available; a large scratch

could render many blocks unwritable and quickly use up reserved

space. Reads are more problematic: if the media is unreadable, the

drive has no choice but to return an error.

2.3.4 Trends

In many other areas (e.g., processor performance), technology

and market trends combine to improve different aspects of com-

puter systems. In contrast, we believe that technology trends and

market forces may combine to make storage system failures occur

more frequently over time, for the following three reasons.

First, reliability is a greater challenge when drives are made in-

creasingly more dense; as more bits are packed into smaller spaces,

drive logic (and hence complexity) increases [5].

Second, at the low-end of the drive market, cost-per-byte domi-

nates, and hence many corners are cut to save pennies in IDE/ATA

drives [5]. Low-cost “PC class” drives tend to be tested less and

have less internal machinery toprevent failures from occurring [31].

The result, in the ﬁeld, is that ATA drives are observably less reli-

able [67]; however, cost pressures serve to increase their usage,

even in server environments [23].

Finally, the amount of software is increasing in storage systems

and, as others have noted, software is often the root cause of er-

rors [25]. In the storage system, hundreds of thousands of lines of

software are present in the lower-level drivers and ﬁrmware. This

low-level code is generally the type of code that is difﬁcult to write

and debug [22, 66] – hence a likely source of increased errors in

the storage stack.

剩余14页未读，继续阅读

评论收藏

内容反馈

weixin_38585666

粉丝: 6
资源: 966

IRON File Systems (iron-sosp05)-计算机科学

最新资源

IRON File Systems (iron-sosp05)-计算机科学

SEDA - An Architecture for Well-Conditioned, Scalable Internet Services - Deck (seda-sosp01-talk)-计算机科学

What's New In Apple File Systems - Slides - WWDC 2019-计算机科学

国外技术干货：amazon-dynamo-sosp2007.zip

bigtable-osdi06--NoSQL.pdf ,amazon-dynamo-sosp2007-NoSQL.pdf

rocksteady-sosp17-slides.pdf

gfs-sosp2003.pdf

cloudvisor-sosp2011

amazon-dynamo-sosp2007.pdf

Google_三大论文中文版

google 关于大数据的三篇文章

Google三大论文

ACM Symposium on Operating Systems Principles 2009年论文集

Google Mapreduce,GFS,Bigtable--Google三大核心技术论文

谷歌3篇大数据论文中英文

国际会议排名及IEEE会议介绍.doc

SOSP 2019.zip

SOSP 2013-ACM Symposium on Operating Systems Principles 2013年论文集

SOSP 2011-ACM Symposium on Operating Systems Principles 2011年论文集

OSDI 2012-Operating Systems Design and Implementation 2012年论文集

sosp2019 论文合集

MapReduce框架和HDFS框架

sosp:小型OS原型

Hadoop - 权威网站和经典书籍

nooks SOSP 2002

Hashed and Hierarchical Timing Wheels

sosp 2013-ACM symposium on operating system principles

gfs+BigTable+map reduce 三合一.zip

Google云计算文档，Hadoop相关

最新资源