UnderstandingdiskfailureratesWhat资源-CSDN文库

disk

failure

需积分: 9 147 浏览量 2011-10-04 03:36:37 上传评论收藏 618KB PDF 举报

资源推荐

资源详情

资源评论

Understanding Disk Failure Rates: What

Does an MTTF of 1,000,000 Hours Mean

to You?

BIANCA SCHROEDER and GARTH A. GIBSON

Carnegie Mellon University

Component failure in large-scale IT installations is becoming an ever-larger problem as the number

of components in a single cluster approaches a million.

This article is an extension of our previous study on disk failures [Schroeder and Gibson 2007]

and presents and analyzes ﬁeld-gathered disk replacement data from a number of large produc-

tion systems, including high-performance computing sites and internet services sites. More than

110,000 disks are covered by this data, some for an entire lifetime of ﬁve years. The data includes

drives with SCSI and FC, as well as SATA interfaces. The mean time-to-failure (MTTF) of those

drives, as speciﬁed in their datasheets, ranges from 1,000,000 to 1,500,000 hours, suggesting a

nominal annual failure rate of at most 0.88%.

We ﬁnd that in the ﬁeld, annual disk replacement rates typically exceed 1%, with 2–4% common

and up to 13% observed on some systems. This suggests that ﬁeld replacement is a fairly different

process than one might predict based on datasheet MTTF.

We also ﬁnd evidence, based on records of disk replacements in the ﬁeld, that failure rate is not

constant with age, and that rather than a signiﬁcant infant mortality effect, we see a signiﬁcant

The MPP2 data was collected and made available using the Molecular Science Computing Facility

(MSCF) in the William R. Wiley Environmental Molecular Sciences Laboratory, a national scientiﬁc

user facility sponsored by the U.S. Department of Energy’s Ofﬁce of Biological and Environmental

Research. This material is based upon work supported by the Department of Energy under Award

no. DE-FC02-06ER25767 and on research sponsored in part by the Army Research Ofﬁce under

agreement no. DAAD19-02-1-0389. This report was prepared as an account of work sponsored by an

agency of the United States Government. Neither the United States Government nor any agency

thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal

liability or responsibility for the accuracy, completeness, or usefulness of any information, appa-

ratus, product, or process disclosed, or represents that its use would not infringe privately owned

rights. Reference herein to any speciﬁc commercial product, process, or service by trade name,

trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement,

recommendation, or favoring by the United States Government or any agency thereof. The views

and opinions of authors expressed herein do not necessarily state or reﬂect those of the United

States Government or any agency thereof.

Authors’ addresses: B. Schroeder (corresponding author), G. A. Gibson, Computer Science

Department, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213; email:

{bianca,garth}@cs.cmu.edu.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is

granted without fee provided that copies are not made or distributed for proﬁt or direct commercial

advantage and that copies show this notice on the ﬁrst page or initial screen of a display along

with the full citation. Copyrights for components of this work owned by others than ACM must be

honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,

to redistribute to lists, or to use any component of this work in other works requires prior speciﬁc

permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn

Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org.



2007 ACM 1553-3077/2007/10-ART8 $5.00 DOI = 10.1145/1288783.1288785 http://doi.acm.org/

10.1145/1288783.1288785

ACM Transactions on Storage, Vol. 3, No. 3, Article 8, Publication date: October 2007.

8:2

•

B. Schroeder and G. A. Gibson

early onset of wear-out degradation. In other words, the replacement rates in our data grew con-

stantly with age, an effect often assumed not to set in until after a nominal lifetime of 5 years.

Interestingly, we observe little difference in replacement rates between SCSI, FC, and SATA

drives, potentially an indication that disk-independent factors such as operating conditions affect

replacement rates more than component-speciﬁc ones. On the other hand, we see only one instance

of a customer rejecting an entire population of disks as a bad batch, in this case because of media

error rates, and this instance involved SATA disks.

Time between replacement, a proxy for time between failure, is not well modeled by an ex-

ponential distribution and exhibits signiﬁcant levels of correlation, including autocorrelation and

long-range dependence.

Categories and Subject Descriptors: B.8.0 [Performance and Reliability]: General; C.4

[Computer Systems Organization]: Performance of Systems; D.4.5 [Operating Systems]:

Reliability

General Terms: Measurement, Reliability

Additional Key Words and Phrases: Hard drive replacements, hard drive failure, storage reliabil-

ity, MTTF, annual failure rates, annual replacement rates, time between failure, wear-out, infant

mortality, failure correlation, datasheet MTTF

ACM Reference Format:

Schroeder, B. and Gibson, G. A. 2007. Understanding disk failure rates: What does an MTTF of

1,000,000 hours mean to you? ACM Trans. Storage 3, 3, Article 8 (October 2007), 31 pages. DOI =

10.1145/1288783.1288785 http://doi.acm.org/ 10.1145/1288783.1288785

1. MOTIVATION

Despite major efforts both in industry and in academia, high reliability remains

a major challenge in running large-scale IT systems, and disaster prevention

and cost of actual disasters make up a large fraction of the total cost of owner-

ship. With ever-larger server clusters, maintaining high levels of reliability and

availability is a growing problem for many sites, including high-performance

computing systems and Internet service providers. A particularly big concern

is the reliability of storage systems, for several reasons. First, failure of stor-

age can not only cause temporary data unavailability, but in the worst case

can lead to permanent data loss. Second, technology trends and market forces

may combine to make storage system failures occur more frequently in the fu-

ture [Prabhakaran et al. 2005]. Finally, the size of storage systems in modern,

large-scale IT installations has grown to an unprecedented scale with thou-

sands of storage devices, making component failures the norm rather than the

exception [Ghemawat et al. 2003].

Large-scale IT systems therefore need better system design and manage-

ment to cope with more frequent failures. One might expect increasing levels of

redundancy designed for speciﬁc failure modes [Corbett et al. 2004; Ghemawat

et al. 2003], for example. Such designs and management systems are based

on very simple models of component failure and repair processes [Patterson

et al. 1988]. Better knowledge about the statistical properties of storage fail-

ure processes, such as the distribution of time between failures, may empower

researchers and designers to develop new, more reliable, and available storage

systems.

ACM Transactions on Storage, Vol. 3, No. 3, Article 8, Publication date: October 2007.

Understanding Disk Failure Rates

•

8:3

Unfortunately, many aspects of disk failures in real systems are not well

understood, probably because the owners of such systems are reluctant to re-

lease failure data or do not gather such data. As a result, practitioners usually

rely on vendor-speciﬁed parameters such as mean time-to-failure (MTTF) to

model failure processes, although many are skeptical of the accuracy of those

models [Elerath 2000a, 2000b]. Too much academic and corporate research is

based on anecdotes and back-of-the-envelope calculations, rather than empiri-

cal data [Schwarz et al. 2006].

The work in this article is part of a broader research agenda with the long-

term goal of providing a better understanding of failures in IT systems by

collecting, analyzing, and making publicly available a diverse set of real failure

histories from large-scale production systems.In our pursuit, we have spoken

to a number of large production sites and were able to convince several of them

to provide failure data from some of their systems.

In this work, we provide an extension of our study in Schroeder and Gibson

[2007]. We present an analysis of nine datasets we have collected, with a focus

on storage-related failures. The datasets come from a number of large-scale

production systems, including high-performance computing sites and large In-

ternet services sites, and consist primarily of hardware replacement logs. The

datasets vary in duration from one month to ﬁve years and cover in total a pop-

ulation of more than 100,000 drives from at least four different vendors. Disks

covered by this data include drives with SCSI and FC interfaces, commonly rep-

resented as the most reliable types of disk drives, as well as drives with SATA

interfaces, common in desktop and nearline systems. Although 100,000 drives

is a very large sample relative to previously published studies, it is small com-

pared to the estimated 35 million enterprise drives, and 300 million total drives

built in 2006 [Drummer et al. 2006]. Phenomena such as bad batches caused by

fabrication line changes may require much larger datasets to fully characterize.

We analyze three different aspects of the data. We begin in Section 3 by ask-

ing how disk replacement frequencies compare to replacement frequencies of

other hardware components. In Section 4, we provide a quantitative analysis of

disk replacement rates observed in the ﬁeld and compare our observations with

common predictors and models used by vendors. In Section 5, we analyze the

statistical properties of disk replacement rates. We study correlations between

disk replacements, identify the key properties of the empirical distribution of

time between replacements, and compare our results to common models and

assumptions. Section 7 provides an overview of related work and Section 8

concludes.

2. METHODOLOGY

2.1 What is a Disk Failure?

While it is often assumed that disk failures follow a simple fail-stop model

(where disks either work perfectly or fail absolutely and in an easily detectable

manner [Patterson et al. 1988; Prabhakaran et al. 2005]), disk failures are much

more complex in reality. For example, disk drives can experience latent sector

ACM Transactions on Storage, Vol. 3, No. 3, Article 8, Publication date: October 2007.

8:4

•

B. Schroeder and G. A. Gibson

faults or transient performance problems. Often it is hard to correctly attribute

the root cause of a problem to a particular hardware component.

Our work is based on hardware replacement records and logs, that is, we

focus on the disk conditions that lead a drive customer to treat a disk as per-

manently failed and to replace it. We analyze records from a number of large

production systems which contain a record for every disk that was replaced in

the system during the time of the data collection. To interpret the results of

our work correctly, it is crucial to understand the process of how this data was

created. After a disk drive is identiﬁed as the likely culprit in a problem, the

operations staff (or the computer system itself) perform(s) a series of tests on

the drive to assess its behavior. If the behavior qualiﬁes as faulty according

to the customer’s deﬁnition, the disk is replaced and a corresponding entry is

made in the hardware replacement log.

The important thing to note is that there is not one unique deﬁnition for

when a drive is faulty. In particular, customers and vendors might use different

deﬁnitions. For example, a common way for a customer to test a drive is to

read all of its sectors to see if any reads experience problems, deciding that it is

faulty if any one operation takes longer than a certain threshold. The outcome

of such a test will depend on how the thresholds are chosen. Many sites follow

a “better safe than sorry” mentality, and use even more rigorous testing. As

a result, it cannot be ruled out that a customer may declare a disk faulty,

while its manufacturer sees it as healthy. This also means that the deﬁnition of

“faulty” that a drive customer uses does not necessarily ﬁt the deﬁnition that

a drive manufacturer uses to make drive reliability projections. In fact, a disk

vendor has reported that for 43% of all disks returned by customers they ﬁnd

no problem with the disk [Drummer et al. 2006].

It is also important to note that the failure behavior of a drive depends on

the operating conditions, and not only on component-level factors. For example,

failure rates are affected by environmental factors such as temperature and

humidity, data center handling procedures, workloads, and “duty cycles” or

powered-on hours patterns.

We would also like to point out that the failure behavior of disk drives, even

if they are of the same model, can differ, since disks are manufactured using

processes and parts that may change. These changes, such as a change in a

drive’s ﬁrmware or a hardware component, or even the assembly line on which

a drive was manufactured, can change the failure behavior of a drive. This

effect is often called the effect of batches or vintage. A bad batch can lead to

unusually high drive failure rates or anomalously high rates of media errors.

For example, in the HPC3 dataset (see Table I) the customer had 11,000 SATA

drives replaced in October 2006 after observing a high frequency of media er-

rors during writes. Although it took a year to resolve, the customer and vendor

agreed that these drives did not meet warranty conditions. The cause was at-

tributed to the breakdown of a lubricant, leading to unacceptably high head

ﬂying heights. In the data, the replacements of these drives are not recorded

as failures.

In our analysis we do not further study the effect of batches. We report on the

ﬁeld experience, in terms of disk replacement rates, of a set of drive customers.

ACM Transactions on Storage, Vol. 3, No. 3, Article 8, Publication date: October 2007.

剩余30页未读，继续阅读

评论收藏

内容反馈

chocolatezwj

粉丝: 0
资源: 5

Understanding disk failure rates What

最新资源

Understanding disk failure rates What

Understanding Disk IO

learning Rates and How It Improves Performance in Deep Learning.pdf

Understanding in Mathematics

Understanding Understanding

Understanding and Using C Pointers 原版pdf by Reese

Understanding Compression

Understanding Jitter and Phase Noise.pdf

Understanding analysis

Understanding .NET - A Tutorial and Analysis

Understanding zigbee rf4ce

Understanding the Linux Kernel 3rd Edition

Understanding Context

understanding cryptography

Understanding and Evaluating Kubernates

SIP_ Understanding the Session Initiation Protocol, Fourth Edition

Understanding ECMAScript 6 中文版

Understanding_ELF.pdf

Understanding theWarburg effect the metabolic requirements of cell proliferation

understanding linux kernel 1.pdf

DS918.zip DS918.zip DS918.zip

群晖DS918+ dsm7.0.1 集成8125b网卡驱动 引导文件 镜像

syno_search_fullpack.zip

csv文件下载（人员信息.txt)

tools.zip tools.zip

DSM7.X套件.zip

RePKG.rar文件

CBM2199E量产工具

DS3617.zip DS3617.zip

群晖NAS系列教程.zip

最新资源

群晖DS918+ dsm7.0.1 集成8125b网卡驱动引导文件镜像