没有合适的资源?快使用搜索试试~ 我知道了~
A System of Patterns for Fault Tolerance
需积分: 2 1 下载量 61 浏览量
2023-06-08
22:13:44
上传
评论
收藏 217KB PDF 举报
温馨提示
试读
48页
Many fault tolerance techniques that have been devised, applied and improved over the past three decades represent general solutions to recurring problems in the design of fault tolerant computer systems. This document presents some of the best known such techniques, formatted as patterns and organized by a classification scheme into a system of patterns for fault tolerance.
资源推荐
资源详情
资源评论
A System of Patterns for Fault Tolerance
∗
Titos Saridakis
NOKIA Research Center
PO Box 407, FIN-00045 NOKIA Group, Finland
Tel: (+358) 7180 37293 Fax: (+358) 7180 36308
titos.saridakis@nokia.com
ABSTRACT
Many fault tolerance techniques that have been devised, applied and improved
over the past three decades represent general solutions to recurring problems in
the design of fault tolerant computer systems. This document presents some of
the best known such techniques, formatted as patterns and organized by a
classification scheme into a system of patterns for fault tolerance. This pattern
system reveals the relations among the presented patterns for fault tolerance and
delineates a number of ways in which these patterns can be used to refine each
other. In turn, these refinement relations create design frameworks for the devel-
opment of fault tolerant systems with different efficiency and complexity charac-
teristics.
Keywords: Design Framework, Fault Tolerance Pattern, Pattern classification.
1 INTRODUCTION
The indissoluble bonds of computers and failures have produced a plurality of fault
tolerance techniques that can satisfy, potentially, any requirement regarding the be-
havior of a computer system in the presence of faults. Consequently, the develop-
ment of fault tolerant systems does no longer rely on the (re)invention of ways to deal
with various faults that may occur; rather, it relies on the selection of the most appro-
priate one among the well-understood fault tolerance techniques. Each such tech-
nique provides a solution to a recurring fault tolerance problem under a set of clearly
defined assumptions about the type of the failures it deals with and the constraints
about the system behavior it guarantees. Hence, a well-understood fault tolerance
technique outlines a pattern that applies to concrete problems in the design of fault
tolerant systems in the specific context defined by aforementioned assumptions and
constraints. In this document a set of fault tolerance techniques are formatted and
presented as patterns following a form similar to the one used in [3].
In general terms, fault tolerance provides techniques to confront faults and their con-
sequences in a system. These techniques describe the detection of errors in a sys-
tem, and the means that ensure the recovery of a system from errors or the masking
of errors in a system. The patterns presented in this document cover all these three
constituents of fault tolerance, which are error detection, recovery and masking. The
different ways in which the presented patterns can be combined to produce a com-
plete solution for the design of fault tolerant systems are captured in a classification
scheme for the fault tolerance patterns. This classification scheme transforms the
∗
Copyright 2002 by NOKIA. All rights reserved. Permission is granted to copy for EuroPLoP 2002.
presented set of patterns into a system of patterns that provides guidelines on how to
refine the design of a system to transform it into its fault tolerant counterpart.
Although the patterns presented in this document provide solutions to fault tolerance
problems, the content of this document is addressed to software designers and archi-
tects and not to fault tolerance experts. The presented patterns capture widely used
fault tolerance techniques from the field of distributed systems [10] and their compre-
hension does not require profound fault tolerance expertise.
Each pattern in this document presents a solution to a specific problem in detecting,
recovering from, or masking an error. Combining these patterns according to the
guidelines given by the classification scheme provides complete solutions to fault tol-
erance problems in the design of a system. Hence, each such combination of pat-
terns forms the basis for a design framework for fault tolerant systems with specific
properties (e.g. regarding the failure types they can cope with, the number of simul-
taneous errors they can tolerate, the time and complexity overhead of the fault toler-
ant mechanisms, etc).
The remainder of this document is organized in four sections. Section 2 contains a
summary of the background information regarding fault tolerance that is necessary
for a non-expert to follow the presentation. Section 3 presents a set of patterns that
capture well-understood fault tolerance techniques for error detection, recovery and
masking. In section 4 these patterns are organized as a pattern system with the help
of a classification scheme that reveals their relations (mainly dependency and re-
finement relations). The same section contains also a discussion on the relation of
the presented system of fault tolerance patterns with other well-known pattern sys-
tems. The document concludes in section 5 with a brief summary of the presented
work, a brief evaluation of the importance of the system of fault tolerance patterns in
the software design and some reflections on the future of pattern systems for prob-
lems specific to non-functional properties (e.g. security, timeliness, configurability,
etc).
2 BACKGROUND
The long term and profound study of failures in computer systems has delivered a
clear understanding of the different types of failures that may occur and a variety of
techniques for dealing with them. The intent of this section is to provide a summary of
the fault tolerance background that would help system designers and architects to
probe deep into the patterns presented in the following section. This background is
taken from [7] and it includes the system model adopted in this document, the defini-
tions of fault tolerance related terms, a description of the failure types considered in
the context of this document, and a brief presentation of essential fault tolerance
concepts.
A system is an entity with a well-defined behavior in terms of output it produces and
which is a function of the input it receives, the passage of time and its internal logic.
By “well-defined behavior” we mean that the output produced by the system is previ-
ously agreed upon and unambiguously distinguishable from output that does not
qualify as well-defined behavior. The well-defined behavior of a system is called the
system specification. A system interacts with its environment by receiving input from
it and delivering output to it. It may be possible to decompose a system into constitu-
ent (sub)systems. In CBSE terms a system is a component that may consists of the
assembly of a number of smaller components. In OO terms a system is a composi-
tion of objects, each of which may be itself a composition of smaller objects.
A failure is said to occur in a system when the system’s environment observes an
output from the system that does not conform to its specification. An error is the part
of the system, e.g. one of its constituent (sub)systems, which is liable to lead to a
failure. A fault is the adjudged cause of an error and may itself be the result of a fail-
ure. Hence, a fault causes an error that produces a failure, which subsequently may
result to a fault, and so on. Let us consider the following example:
A software bug in an application is a fault that leads to an error when the
application execution reaches the point affected by the bug, which in turn
makes the application crash which is a failure. By crashing, the applica-
tion leaves blocked the socket ports it used which is a fault and the com-
puter on which the application crashed has socket ports which are not
used by any process nevertheless not accessible to running applications
which is an error, and which in turn leads to a failure when another appli-
cation requests these ports.
Based on the above, a fault in a system may propagate to the system's environment.
A system is called fault tolerant when it can deal with faults and their consequent er-
rors in such a way that it does not violate its specification, i.e. the environment of a
fault tolerant system does not perceive a failure of the system. Hence, a fault tolerant
system does not propagate faults to its environment. Fault tolerance techniques are
practical methods that describe how to detect an error and confine it within a system.
The confinement can be based on the restoration of the subsystem on which the er-
ror was detected before that error infects other parts of the system, or it can be based
on the masking of the error occurrence (e.g. by isolating the subsystem on which the
error was detected and using some form of redundancy to deliver the expected out-
put).
Each fault tolerance technique provides different guarantees regarding the properties
associated to the system qualities such as the time or the space overhead introduced
to the normal execution of the system, the efficiency of the reaction to a failure, the
design complexity added to the system, etc. In general, fault tolerance techniques are
based on the following principles:
• Constituents of a fault tolerant system monitor other constituents for failure occur-
rences. By observing a failure, the monitoring subsystem can detect an error on
the monitored subsystem. These monitoring activities are often called error detec-
tion.
• In order to enable the restoration of a subsystem after an error has been detected
on it, appropriate information regarding the subsystem may be saved when certain
conditions are met (e.g. at regular time intervals, right after the subsystem delivers
some output according to its specification, when the subsystem decides by its own
to save the appropriate information, etc). This saving activity is often called check-
pointing. The appropriate information save in a checkpointing activity may vary
from a complete snapshot of the internal subsystem representation (i.e. the state
of the subsystem) to selected piece of its internal representation that have
changed since the last checkpoint.
• When a monitoring subsystem observes a failure on a monitored subsystem, it
may activate a mechanism that will use the last checkpoint of the latter subsystem
in order to eliminate the error that led to the observed failure and restore the sub-
system to an error-free state. These restoration activities are often called error re-
covery.
• In some cases, when a monitoring subsystem observes a failure on a monitored
subsystem, it does not let the erroneous behavior of the latter subsystem affect
any other parts of the overall system by using a some form of redundancy (e.g. a
duplicate of the failed subsystem) to cover up for the observed failure. These ac-
tivities are often called error masking.
Before proceeding with the design of a fault tolerant system, the designer must de-
termine the following two issues. First, the system designer must determine the type
of failures that will be confronted by the fault tolerance mechanism. Different fault tol-
erance techniques have been developed to deal with different failure types and they
may differ in all three means for error detection, recovery and masking. Hence, the
failure types that will be confronted by a system play a decisive role in the selection
of the fault tolerance techniques that can be applied to render the system fault toler-
ant. Some representative failure types are (see [10] for more information on failure
types):
fail-stop failures where the failed system ceases execution without producing any
output and the failure is detectable by its environment,
•
•
•
•
crash failures where the failed subsystem ceases execution without producing any
output but the failure might not be detectable by its environment,
omission failures where a subsystem fails to deliver output to (send omission), or
receive input from (receive omission) its environment, and
byzantine failures where the failed subsystem exhibits arbitrary behavior.
The second issue that must be determined is the unit of failure in the fault tolerant
system. The unit of failure is the minimum part of the system (i.e. the minimum sub-
system) where an error will be confined. Given the recursive decomposition of a sys-
tem into subsystems, any subsystem may potentially be decomposed to smaller con-
stituent systems. By defining the unit of failure, the system designer determines the
subsystems that will be monitored for failures. These subsystems may not be fault
tolerant themselves, and the fault tolerance mechanism that will be put in place will
not provide any guarantees about them experiencing failures. However, their compo-
sition will contain error detection, recovery and/or masking activities that will render
the resulting system fault tolerant with respect to the faults that may appear inside
each unit of failure. For example, in a distributed system consisting of a number of
machines interconnected over a network, the unit of failure can be set to be a ma-
chine or the set of processes that belong to the same application running on a single
machine, or even the individual processes.
Once the failure type and the unit of failure issues are sorted out, the designer has a
clear indication about the what fault tolerance mechanisms to choose and where to
apply them in the system in order to make it fault tolerant. Still, a number of other fac-
tors will influence the final decision of the exact fault tolerance mechanism to be em-
ployed and its exact configuration. These factors include the number of simultaneous
errors that may occur, the design, space and time complexity of the fault tolerant
mechanism and how these align with the requirements about the corresponding sys-
tem qualities, etc.
3 FAULT TOLERANCE PATTERNS
This section presents in the form of patterns a selection of fault tolerance techniques
that deal with error detection, recovery and masking.
3.1 Fail-Stop Processor
Dealing with byzantine failures is extremely difficult and costly because of the arbi-
trary nature of the error that leads to the failure. For example, a system that exhibits
failures of byzantine semantics may deliver erroneous output, or no output at all, or it
may duplicate (correct or erroneous) output. Dealing with all different possible errors
is costly (e.g. different detection techniques for different types of errors) and bears a
big overhead regarding the design, space and time complexity of the system. When
developing a system out of constituents which by their very nature may experience
byzantine failure, it is desirable to transform these constituents to constituents with
equivalent functional specification but with more "designer friendly" failure semantics.
The Fail-Stop Processor pattern [14] describes one way for achieving that.
3.1.1 Context
The Fail-Stop Processor pattern applies to a system that has the following
characteristics:
The system is deterministic, i.e. its output is solely defined by its initial state, the
sequence of inputs it has processed so far and the current time (in terms of clock
time and/or time elapsed since the system initialization).
•
•
•
•
The errors the system may experience are transient, i.e. as opposite to permanent
errors like those caused by algorithmic faults.
The errors the system may experience are not due to errors in the input it re-
ceives.
The errors the system may experience cause it to exhibit byzantine failures.
3.1.2 Problem
In the above context, the Fail-Stop Processor pattern solves the problem of
transforming the byzantine failures to fail-stop failures by balancing the following
forces:
• The error is confined within the failed system and does not infect its environment.
• The error is detected by the environment.
• The time overhead on error-free system execution is kept very low.
3.1.3 Solution
The solution to the above problem suggested by the Fail-Stop Processor pat-
tern is based on the replication of the system and the comparison of the replicas out-
put for unanimity. Each one of the replicas is called a processor in the remainder. All
processors are identical to the system on which the Fail-Stop Processor pattern
is applied, hence they are deterministic. They are all initialized simultaneously and
they receive exactly the same input, hence at any given time and in the absence of
errors they must produce the same output. If the output produced by the processors
are not exactly the same, then an error has occurred and the ensemble of the proc-
essors must be shut down in order to prevent the propagation of the error in the envi-
ronment.
剩余47页未读,继续阅读
资源评论
秋雨夕照
- 粉丝: 78
- 资源: 5
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 信呼OA系统2.1.7版源码
- 3122080306 邹子轩 实验报告二.docx
- 基于STM32 NUCLEO板设计彩色LED照明灯(纯cubeMX开发)(大赛作品,文档完整,可直接运行)
- 发那科工业机器人保养大全
- Sphere.h
- REMD固有时间尺度分解信号分量可视化(Matlab完整源码和数据)
- 嵌入式系统双单片机STC89C52+STC15W104多功能学习板电路图可扩展 适用于单片机初学者和教学
- 基于STM32蓝牙控制小车系统设计(硬件+源代码+论文)大赛作品
- XILINXFPGA源码基于Spartan3火龙刀系列FPGA开发板VGA测试例程
- Java聊天室的设计与实现【尚学堂·百战程序员】
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功