没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
IRON File SystemsVijayan Prabhakaran, Lakshmi N. Bairavasundaram, Nitin Agrawal, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-DusseauComputer Sciences Department, University of Wisconsin, Madison{vijayan,laksh,nitina,haryadi,dusseau,remzi}@cs.wisc.eduABSTRACT Commodity file systems trust disks to either work or fail com- pletely, yet modern disks exhibit more complex failure modes. We suggest a newfail-partial failure modelfor disks, which incorpo- rates realistic localized f
资源推荐
资源详情
资源评论
IRON File Systems
Vijayan Prabhakaran, Lakshmi N. Bairavasundaram, Nitin Agrawal, Haryadi S. Gunawi,
Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau
Computer Sciences Department, University of Wisconsin, Madison
{vijayan,laksh,nitina,haryadi,dusseau,remzi}@cs.wisc.edu
ABSTRACT
Commodity file systems trust disks to either work or fail com-
pletely, yet modern disks exhibit more complex failure modes. We
suggest a new fail-partial failure model for disks, which incorpo-
rates realistic localized faults such as latent sector errors and block
corruption. We then develop and apply a novel failure-policy fin-
gerprinting framework, to investigate how commodity file systems
react to a range of more realistic disk failures. We classify their
failure policies in a new taxonomy that measures their Internal RO-
bustNess (IRON), which includes both failure detection and recov-
ery techniques. We show that commodity file system failure poli-
cies are often inconsistent, sometimes buggy, and generally inade-
quate in their ability to recover from partial disk failures. Finally,
we design, implement, and evaluate a prototype IRON file system,
Linux ixt3, showing that techniques such as in-disk checksumming,
replication, and parity greatly enhance file system robustness while
incurring minimal time and space overheads.
Categories and Subject Descriptors:
D.4.3 [Operating Systems]: File Systems Management
D.4.5 [Operating Systems]: Reliability
General Terms: Design, Experimentation, Reliability
Keywords: IRON file systems, disks, storage, latent sector errors,
block corruption, fail-partial failure model, fault tolerance, reliabil-
ity, internal redundancy
1. INTRODUCTION
Disks fail – but not in the way most commodity file systems ex-
pect. For many years, file system and storage system designers have
assumed that disks operate in a “fail stop” manner [56]; within this
classic model, the disks either are working perfectly, or fail abso-
lutely and in an easily detectable manner.
The fault model presented by modern disk drives, however, is
much more complex. For example, modern drives can exhibit la-
tent sector faults [16, 34, 57], where a block or set of blocks are
inaccessible. Worse, blocks sometimes become silently corrupted
[9, 26, 73]. Finally, disks sometimes exhibit transient performance
problems [7, 67].
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SOSP’05, October 23–26, 2005, Brighton, United Kingdom.
Copyright 2005 ACM 1-59593-079-5/05/0010 ...$5.00.
There are many reasons for these complex failures in disks. For
example, a buggy disk controller could issue a “misdirected” write
[73], placing the correct data on disk but in the wrong location. In-
terestingly, while these failures exist today, simply waiting for disk
technology to improve will not remove these errors: indeed, these
errors may worsen over time, due to increasing drive complexity
[5], immense cost pressures in the storage industry, and the esca-
lated use of less reliable ATA disks – not only in desktop PCs but
also in large-scale clusters [23] and storage systems [20, 28].
Developers of high-end systems have realized the nature of these
disk faults and built mechanisms into their systems to handle them.
For example, many redundant storage systems incorporate a back-
ground disk scrubbing process [33, 57], to proactively detect and
subsequently correct latent sector errors by creating a new copy of
inaccessible blocks; some recent storage arrays incorporate extra
levels of redundancy to lessen the potential damage of undiscov-
ered latent errors [16]. Similarly, highly-reliable systems (e.g., Tan-
dem NonStop) utilize end-to-end checksums to detect when block
corruption occurs [9].
Unfortunately, such technology has not filtered down to the realm
of commodity file systems, including Linux file systems such as
ext3 [71], ReiserFS [49], and IBM’s JFS [11], or Windows file sys-
tems such as NTFS [63]. Such file systems are not only pervasive
in the home environment, storing valuable (and often non-archived)
user data such as photos, home movies, and tax returns, but also in
many internet services such as Google [23].
In this paper, the first question we pose is: how do modern com-
modity file systems react to failures that are common in modern
disks? To answer this query, we aggregate knowledge from the re-
search literature, industry, and field experience to form a new model
for disk failure. We label our model the fail-partial failure model
to emphasize that portions of the disk can fail, either through block
errors or data corruption.
With the model in place, we develop and apply an automated
failure-policy fingerprinting framework, to inject more realistic disk
faults beneath a file system. The goal of fingerprinting is to unearth
the failure policy of each system: how it detects and recovers from
disk failures. Our approach leverages gray-box knowledge [6, 62]
of file system data structures to meticulously exercise file system
access paths to disk.
To better characterize failure policy, we develop an Internal RO-
bustNess (IRON) taxonomy, which catalogs a broad range of detec-
tion and recovery techniques. Hence, the output of our fingerprint-
ing tool is a broad categorization of which IRON techniques a file
system uses across its constituent data structures.
Our study focuses on three important and substantially different
open-source file systems, ext3, ReiserFS, and IBM’s JFS, and one
closed-source file system, Windows NTFS. Across all platforms,
1
we find a great deal of illogical inconsistency in failure policy, of-
ten due to the diffusion of failure handling code through the ker-
nel; such inconsistency leads to substantially different detection
and recovery strategies under similar fault scenarios, resulting in
unpredictable and often undesirable fault-handling strategies. We
also discover that most systems implement portions of their fail-
ure policy incorrectly; the presence of bugs in the implementa-
tions demonstrates the difficulty and complexity of correctly han-
dling certain classes of disk failure. We observe little tolerance of
transient failures; most file systems assume a single temporarily-
inaccessible block indicates a fatal whole-disk failure. Finally, we
show that none of the file systems can recover from partial disk
failures, due to a lack of in-disk redundancy.
This behavior under realistic disk failures leads us to our second
question: how can we change file systems to better handle modern
disk failures? We advocate a single guiding principle for the design
of file systems: don’t trust the disk. The file system should not view
the disk as an utterly reliable component. For example, if blocks
can become corrupt, the file system should apply measures to both
detect and recover from such corruption, even when running on a
single disk. Our approach is an instance of the end-to-end argument
[53]: at the top of the storage stack, the file system is fundamentally
responsible for reliable management of its data and metadata.
In our initial efforts, we develop a family of prototype IRON file
systems, all of which are robust variants of the Linux ext3 file sys-
tem. Within our IRON ext3 (ixt3), we investigate the costs of using
checksums to detect data corruption, replication to provide redun-
dancy for metadata structures, and parity protection for user data.
We show that these techniques incur modest space and time over-
heads while greatly increasing the robustness of the file system to
latent sector errors and data corruption. By implementing detec-
tion and recovery techniques from the IRON taxonomy, a system
can implement a well-defined failure policy and subsequently pro-
vide vigorous protection against the broader range of disk failures.
The contributions of this paper are as follows:
• We define a more realistic failure model for modern disks
(the fail-partial model) (§2).
• We formalize the techniques to detect and recover from disk
errors under the IRON taxonomy (§3).
• We develop a fingerprinting framework to determine the fail-
ure policy of a file system (§4).
• We analyze four popular commodity file systems to discover
how they handle disk errors (§5).
• We build a prototype version of an IRON file system (ixt3)
and analyze its robustness to disk failure and its performance
characteristics (§6).
To bring the paper to a close, we discuss related work (§7), and
finally conclude (§8).
2. DISK FAILURE
There are many reasons that the file system may see errors in
the storage system below. In this section, we first discuss common
causes of disk failure. We then present a new, more realistic fail-
partial model for disks and discuss various aspects of this model.
2.1 The Storage Subsystem
Figure 1 presents a typical layered storage subsystem below the
file system. An error can occur in any of these layers and propagate
itself to the file system above.
Generic Block I/O
Device Driver
Device Controller
Firmware
Media
Transport
HostDisk
Generic File System
Specific File System
Storage Subsystem
Electrical
Mechanical
Cache
Figure 1: The Storage Stack. We present a schematic of the entire
storage stack. At the top is the file system; beneath are the many
layers of the storage subsystem. Gray shading implies software or
firmware, whereas white (unshaded) is hardware.
At the bottom of the “storage stack” is the disk itself; beyond the
magnetic storage media, there are mechanical (e.g., the motor and
arm assembly) and electrical components (e.g., busses). A particu-
larly important component is firmware – the code embedded within
the drive to control most of its higher-level functions, including
caching, disk scheduling, and error handling. This firmware code
is often substantial and complex (e.g., a modern Seagate drive con-
tains roughly 400,000 lines of code [19]).
Connecting the drive to the host is the transport. In low-end sys-
tems, the transport medium is often a bus (e.g., SCSI), whereas
networks are common in higher-end systems (e.g., FibreChannel).
At the top of the stack is the host. Herein there is a hardware
controller that communicates with the device, and above it a soft-
ware device driver that controls the hardware. Block-level software
forms the next layer, providing a generic device interface and im-
plementing various optimizations (e.g., request reordering).
Above all other software is the file system. This layer is often
split into two pieces: a high-level component common to all file
systems, and a specific component that maps generic operations
onto the data structures of the particular file system. A standard
interface (e.g., Vnode/VFS [36]) is positioned between the two.
2.2 Why Do Disks Fail?
To motivate our failure model, we first describe how errors in the
layers of the storage stack can cause failures.
Media: There are two primary errors that occur in the magnetic
media. First, the classic problem of “bit rot” occurs when the mag-
netism of a single bit or a few bits is flipped. This type of problem
can often (but not always) be detected and corrected with low-level
ECC embedded in the drive. Second, physical damage can occur on
the media. The quintessential “head crash” is one culprit, where the
drive head contacts the surface momentarily. A media scratch can
also occur when a particle is trapped between the drive head and the
media [57]. Such dangers are well-known to drive manufacturers,
and hence modern disks park the drive head when the drive is not
in use to reduce the number of head crashes; SCSI disks sometimes
include filters to remove particles [5]. Media errors most often lead
to permanent failure or corruption of individual disk blocks.
Mechanical: “Wear and tear” eventually leads to failure of moving
parts. A drive motor can spin irregularly or fail completely. Erratic
arm movements can cause head crashes and media flaws; inaccu-
rate arm movement can misposition the drive head during writes,
leaving blocks inaccessible or corrupted upon subsequent reads.
2
Electrical: A power spike or surge can damage in-drive circuits
and hence lead to drive failure [68]. Thus, electrical problems can
lead to entire disk failure.
Drive firmware: Interesting errors arise in the drive controller,
which consists of many thousands of lines of real-time, concurrent
firmware. For example, disks have been known to return correct
data but circularly shifted by a byte [37] or have memory leaks
that lead to intermittent failures [68]. Other firmware problems
can lead to poor drive performance [54]. Some firmware bugs are
well-enough known in the field that they have specific names; for
example, “misdirected” writes are writes that place the correct data
on the disk but in the wrong location, and “phantom” writes are
writes that the drive reports as completed but that never reach the
media [73]. Phantom writes can be caused by a buggy or even mis-
configured cache (i.e., write-back caching is enabled). In summary,
drive firmware errors often lead to sticky or transient block corrup-
tion but can also lead to performance problems.
Transport: The transport connecting the drive and host can also be
problematic. For example, a study of a large disk farm [67] reveals
that most of the systems tested had interconnect problems, such
as bus timeouts. Parity errors also occurred with some frequency,
either causing requests to succeed (slowly) or fail altogether. Thus,
the transport often causes transient errors for the entire drive.
Bus controller: The main bus controller can also be problematic.
For example, the EIDE controller on a particular series of moth-
erboards incorrectly indicates completion of a disk request before
the data has reached the main memory of the host, leading to data
corruption [72]. A similar problem causes some other controllers to
return status bits as data if the floppy drive is in use at the same time
as the hard drive [26]. Others have also observed IDE protocol ver-
sion problems that yield corrupt data [23]. In summary, controller
problems can lead to transient block failure and data corruption.
Low-level drivers: Recent research has shown that device driver
code is more likely to contain bugs than the rest of the operating
system [15, 22, 66]. While some of these bugs will likely crash the
operating system, others can issue disk requests with bad parame-
ters, data, or both, resulting in data corruption.
2.3 The Fail-Partial Failure Model
From our discussion of the many root causes for failure, we are
now ready to put forth a more realistic model of disk failure. In our
model, failures manifest themselves in three ways:
• Entire disk failure: The entire disk is no longer accessible. If
permanent, this is the classic “fail-stop” failure.
• Block failure: One or more blocks are not accessible; often re-
ferred to as “latent sector errors” [33, 34].
• Block corruption: The data within individual blocks is altered.
Corruption is particularly insidious because it is silent – the storage
subsystem simply returns “bad” data upon a read.
We term this model the Fail-Partial Failure Model, to empha-
size that pieces of the storage subsystem can fail. We now discuss
some other key elements of the fail-partial model, including the
transience, locality, and frequency of failures, and then discuss how
technology and market trends will impact disk failures over time.
2.3.1 Transience of Failures
In our model, failures can be “sticky” (permanent) or “transient”
(temporary). Which behavior manifests itself depends upon the
root cause of the problem. For example, a low-level media problem
portends the failure of subsequent requests. In contrast, a transport
or higher-level software issue might at first cause block failure or
corruption; however, the operation could succeed if retried.
2.3.2 Locality of Failures
Because multiple blocks of a disk can fail, one must consider
whether such block failures are dependent. The root causes of
block failure suggest that some forms of block failure do indeed
exhibit spatial locality [34]. For example, a scratched surface can
render a number of contiguous blocks inaccessible. However, all
failures do not exhibit locality; for example, a corruption due to a
misdirected write may impact only a single block.
2.3.3 Frequency of Failures
Block failures and corruptions do occur – as one commercial
storage system developer succinctly stated, “Disks break a lot – all
guarantees are fiction” [29]. However, one must also consider how
frequently such errors occur, particularly when modeling overall re-
liability and deciding which failures are most important to handle.
Unfortunately, as Talagala and Patterson point out [67], disk drive
manufacturers are loathe to provide information on disk failures;
indeed, people within the industry refer to an implicit industry-wide
agreement to not publicize such details [4]. Not surprisingly, the
actual frequency of drive errors, especially errors that do not cause
the whole disk to fail, is not well-known in the literature. Previous
work on latent sector errors indicates that such errors occur more
commonly than absolute disk failure [34], and more recent research
estimates that such errors may occur five times more often than ab-
solute disk failures [57].
In terms of relative frequency, block failures are more likely to
occur on reads than writes, due to internal error handling common
in most disk drives. For example, failed writes to a given sector
are often remapped to another (distant) sector, allowing the drive
to transparently handle such problems [31]. However, remapping
does not imply that writes cannot fail. A failure in a component
above the media (e.g., a stuttering transport), can lead to an unsuc-
cessful write attempt; the move to network-attached storage [24]
serves to increase the frequency of this class of failures. Also, for
remapping to succeed, free blocks must be available; a large scratch
could render many blocks unwritable and quickly use up reserved
space. Reads are more problematic: if the media is unreadable, the
drive has no choice but to return an error.
2.3.4 Trends
In many other areas (e.g., processor performance), technology
and market trends combine to improve different aspects of com-
puter systems. In contrast, we believe that technology trends and
market forces may combine to make storage system failures occur
more frequently over time, for the following three reasons.
First, reliability is a greater challenge when drives are made in-
creasingly more dense; as more bits are packed into smaller spaces,
drive logic (and hence complexity) increases [5].
Second, at the low-end of the drive market, cost-per-byte domi-
nates, and hence many corners are cut to save pennies in IDE/ATA
drives [5]. Low-cost “PC class” drives tend to be tested less and
have less internal machinery toprevent failures from occurring [31].
The result, in the field, is that ATA drives are observably less reli-
able [67]; however, cost pressures serve to increase their usage,
even in server environments [23].
Finally, the amount of software is increasing in storage systems
and, as others have noted, software is often the root cause of er-
rors [25]. In the storage system, hundreds of thousands of lines of
software are present in the lower-level drivers and firmware. This
low-level code is generally the type of code that is difficult to write
and debug [22, 66] – hence a likely source of increased errors in
the storage stack.
3
剩余14页未读,继续阅读
资源评论
weixin_38585666
- 粉丝: 6
- 资源: 966
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 文字生产视频-可灵1.6
- 特易通 TYTMD-760 V2版 MD-760 V2版固件
- 玄奥八字合婚注册版,,很实用的一个软件
- TYT 特易通 MD-760 V2版升级软件
- 2025年北京幼儿园家长会模板.pptx
- 2025年新学期幼儿园家长会卡通模板.pptx
- 2025年上海幼儿园新学期家长会模板.pptx
- 地球仪电灯炮儿童读书素材班会家长会模板.pptx
- TYTMD-760 V2版写频软件
- 春天柳树风筝素材小学班会家长会模板.pptx
- 成都幼儿园2025年新学期家长会模板.pptx
- 深圳小学一年级家长会通用模板.pptx
- 上海小学三年级卡通班会家长会模板.pptx
- 手绘彩虹元素小学家长会班会模板.pptx
- 向日葵背景元素小学班会家长会模板.pptx
- 长沙卡通2025年幼儿园家长会模板.pptx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功