ASystemofPatternsforFaultTolerance资源-CSDN文库

需积分: 2 61 浏览量 2023-06-08 22:13:44 上传评论收藏 217KB PDF 举报

资源推荐

资源详情

资源评论

A System of Patterns for Fault Tolerance

∗

Titos Saridakis

NOKIA Research Center

PO Box 407, FIN-00045 NOKIA Group, Finland

Tel: (+358) 7180 37293 Fax: (+358) 7180 36308

titos.saridakis@nokia.com

ABSTRACT

Many fault tolerance techniques that have been devised, applied and improved

over the past three decades represent general solutions to recurring problems in

the design of fault tolerant computer systems. This document presents some of

the best known such techniques, formatted as patterns and organized by a

classification scheme into a system of patterns for fault tolerance. This pattern

system reveals the relations among the presented patterns for fault tolerance and

delineates a number of ways in which these patterns can be used to refine each

other. In turn, these refinement relations create design frameworks for the devel-

opment of fault tolerant systems with different efficiency and complexity charac-

teristics.

Keywords: Design Framework, Fault Tolerance Pattern, Pattern classification.

1 INTRODUCTION

The indissoluble bonds of computers and failures have produced a plurality of fault

tolerance techniques that can satisfy, potentially, any requirement regarding the be-

havior of a computer system in the presence of faults. Consequently, the develop-

ment of fault tolerant systems does no longer rely on the (re)invention of ways to deal

with various faults that may occur; rather, it relies on the selection of the most appro-

priate one among the well-understood fault tolerance techniques. Each such tech-

nique provides a solution to a recurring fault tolerance problem under a set of clearly

defined assumptions about the type of the failures it deals with and the constraints

about the system behavior it guarantees. Hence, a well-understood fault tolerance

technique outlines a pattern that applies to concrete problems in the design of fault

tolerant systems in the specific context defined by aforementioned assumptions and

constraints. In this document a set of fault tolerance techniques are formatted and

presented as patterns following a form similar to the one used in [3].

In general terms, fault tolerance provides techniques to confront faults and their con-

sequences in a system. These techniques describe the detection of errors in a sys-

tem, and the means that ensure the recovery of a system from errors or the masking

of errors in a system. The patterns presented in this document cover all these three

constituents of fault tolerance, which are error detection, recovery and masking. The

different ways in which the presented patterns can be combined to produce a com-

plete solution for the design of fault tolerant systems are captured in a classification

scheme for the fault tolerance patterns. This classification scheme transforms the

∗

presented set of patterns into a system of patterns that provides guidelines on how to

refine the design of a system to transform it into its fault tolerant counterpart.

Although the patterns presented in this document provide solutions to fault tolerance

problems, the content of this document is addressed to software designers and archi-

tects and not to fault tolerance experts. The presented patterns capture widely used

fault tolerance techniques from the field of distributed systems [10] and their compre-

hension does not require profound fault tolerance expertise.

Each pattern in this document presents a solution to a specific problem in detecting,

recovering from, or masking an error. Combining these patterns according to the

guidelines given by the classification scheme provides complete solutions to fault tol-

erance problems in the design of a system. Hence, each such combination of pat-

terns forms the basis for a design framework for fault tolerant systems with specific

properties (e.g. regarding the failure types they can cope with, the number of simul-

taneous errors they can tolerate, the time and complexity overhead of the fault toler-

ant mechanisms, etc).

The remainder of this document is organized in four sections. Section 2 contains a

summary of the background information regarding fault tolerance that is necessary

for a non-expert to follow the presentation. Section 3 presents a set of patterns that

capture well-understood fault tolerance techniques for error detection, recovery and

masking. In section 4 these patterns are organized as a pattern system with the help

of a classification scheme that reveals their relations (mainly dependency and re-

finement relations). The same section contains also a discussion on the relation of

the presented system of fault tolerance patterns with other well-known pattern sys-

tems. The document concludes in section 5 with a brief summary of the presented

work, a brief evaluation of the importance of the system of fault tolerance patterns in

the software design and some reflections on the future of pattern systems for prob-

lems specific to non-functional properties (e.g. security, timeliness, configurability,

etc).

2 BACKGROUND

The long term and profound study of failures in computer systems has delivered a

clear understanding of the different types of failures that may occur and a variety of

techniques for dealing with them. The intent of this section is to provide a summary of

the fault tolerance background that would help system designers and architects to

probe deep into the patterns presented in the following section. This background is

taken from [7] and it includes the system model adopted in this document, the defini-

tions of fault tolerance related terms, a description of the failure types considered in

the context of this document, and a brief presentation of essential fault tolerance

concepts.

A system is an entity with a well-defined behavior in terms of output it produces and

which is a function of the input it receives, the passage of time and its internal logic.

By “well-defined behavior” we mean that the output produced by the system is previ-

ously agreed upon and unambiguously distinguishable from output that does not

qualify as well-defined behavior. The well-defined behavior of a system is called the

system specification. A system interacts with its environment by receiving input from

it and delivering output to it. It may be possible to decompose a system into constitu-

ent (sub)systems. In CBSE terms a system is a component that may consists of the

assembly of a number of smaller components. In OO terms a system is a composi-

tion of objects, each of which may be itself a composition of smaller objects.

A failure is said to occur in a system when the system’s environment observes an

output from the system that does not conform to its specification. An error is the part

of the system, e.g. one of its constituent (sub)systems, which is liable to lead to a

failure. A fault is the adjudged cause of an error and may itself be the result of a fail-

ure. Hence, a fault causes an error that produces a failure, which subsequently may

result to a fault, and so on. Let us consider the following example:

A software bug in an application is a fault that leads to an error when the

application execution reaches the point affected by the bug, which in turn

makes the application crash which is a failure. By crashing, the applica-

tion leaves blocked the socket ports it used which is a fault and the com-

puter on which the application crashed has socket ports which are not

used by any process nevertheless not accessible to running applications

which is an error, and which in turn leads to a failure when another appli-

cation requests these ports.

Based on the above, a fault in a system may propagate to the system's environment.

A system is called fault tolerant when it can deal with faults and their consequent er-

rors in such a way that it does not violate its specification, i.e. the environment of a

fault tolerant system does not perceive a failure of the system. Hence, a fault tolerant

system does not propagate faults to its environment. Fault tolerance techniques are

practical methods that describe how to detect an error and confine it within a system.

The confinement can be based on the restoration of the subsystem on which the er-

ror was detected before that error infects other parts of the system, or it can be based

on the masking of the error occurrence (e.g. by isolating the subsystem on which the

error was detected and using some form of redundancy to deliver the expected out-

put).

Each fault tolerance technique provides different guarantees regarding the properties

associated to the system qualities such as the time or the space overhead introduced

to the normal execution of the system, the efficiency of the reaction to a failure, the

design complexity added to the system, etc. In general, fault tolerance techniques are

based on the following principles:

• Constituents of a fault tolerant system monitor other constituents for failure occur-

rences. By observing a failure, the monitoring subsystem can detect an error on

the monitored subsystem. These monitoring activities are often called error detec-

tion.

• In order to enable the restoration of a subsystem after an error has been detected

on it, appropriate information regarding the subsystem may be saved when certain

conditions are met (e.g. at regular time intervals, right after the subsystem delivers

some output according to its specification, when the subsystem decides by its own

to save the appropriate information, etc). This saving activity is often called check-

pointing. The appropriate information save in a checkpointing activity may vary

from a complete snapshot of the internal subsystem representation (i.e. the state

of the subsystem) to selected piece of its internal representation that have

changed since the last checkpoint.

• When a monitoring subsystem observes a failure on a monitored subsystem, it

may activate a mechanism that will use the last checkpoint of the latter subsystem

in order to eliminate the error that led to the observed failure and restore the sub-

system to an error-free state. These restoration activities are often called error re-

covery.

• In some cases, when a monitoring subsystem observes a failure on a monitored

subsystem, it does not let the erroneous behavior of the latter subsystem affect

any other parts of the overall system by using a some form of redundancy (e.g. a

duplicate of the failed subsystem) to cover up for the observed failure. These ac-

tivities are often called error masking.

Before proceeding with the design of a fault tolerant system, the designer must de-

termine the following two issues. First, the system designer must determine the type

of failures that will be confronted by the fault tolerance mechanism. Different fault tol-

erance techniques have been developed to deal with different failure types and they

may differ in all three means for error detection, recovery and masking. Hence, the

failure types that will be confronted by a system play a decisive role in the selection

of the fault tolerance techniques that can be applied to render the system fault toler-

ant. Some representative failure types are (see [10] for more information on failure

types):

fail-stop failures where the failed system ceases execution without producing any

output and the failure is detectable by its environment,

•

crash failures where the failed subsystem ceases execution without producing any

output but the failure might not be detectable by its environment,

omission failures where a subsystem fails to deliver output to (send omission), or

receive input from (receive omission) its environment, and

byzantine failures where the failed subsystem exhibits arbitrary behavior.

The second issue that must be determined is the unit of failure in the fault tolerant

system. The unit of failure is the minimum part of the system (i.e. the minimum sub-

system) where an error will be confined. Given the recursive decomposition of a sys-

tem into subsystems, any subsystem may potentially be decomposed to smaller con-

stituent systems. By defining the unit of failure, the system designer determines the

subsystems that will be monitored for failures. These subsystems may not be fault

tolerant themselves, and the fault tolerance mechanism that will be put in place will

not provide any guarantees about them experiencing failures. However, their compo-

sition will contain error detection, recovery and/or masking activities that will render

the resulting system fault tolerant with respect to the faults that may appear inside

each unit of failure. For example, in a distributed system consisting of a number of

machines interconnected over a network, the unit of failure can be set to be a ma-

chine or the set of processes that belong to the same application running on a single

machine, or even the individual processes.

Once the failure type and the unit of failure issues are sorted out, the designer has a

clear indication about the what fault tolerance mechanisms to choose and where to

apply them in the system in order to make it fault tolerant. Still, a number of other fac-

tors will influence the final decision of the exact fault tolerance mechanism to be em-

ployed and its exact configuration. These factors include the number of simultaneous

errors that may occur, the design, space and time complexity of the fault tolerant

mechanism and how these align with the requirements about the corresponding sys-

tem qualities, etc.

3 FAULT TOLERANCE PATTERNS

This section presents in the form of patterns a selection of fault tolerance techniques

that deal with error detection, recovery and masking.

3.1 Fail-Stop Processor

Dealing with byzantine failures is extremely difficult and costly because of the arbi-

trary nature of the error that leads to the failure. For example, a system that exhibits

failures of byzantine semantics may deliver erroneous output, or no output at all, or it

may duplicate (correct or erroneous) output. Dealing with all different possible errors

is costly (e.g. different detection techniques for different types of errors) and bears a

big overhead regarding the design, space and time complexity of the system. When

developing a system out of constituents which by their very nature may experience

byzantine failure, it is desirable to transform these constituents to constituents with

equivalent functional specification but with more "designer friendly" failure semantics.

The Fail-Stop Processor pattern [14] describes one way for achieving that.

3.1.1 Context

The Fail-Stop Processor pattern applies to a system that has the following

characteristics:

The system is deterministic, i.e. its output is solely defined by its initial state, the

sequence of inputs it has processed so far and the current time (in terms of clock

time and/or time elapsed since the system initialization).

•

The errors the system may experience are transient, i.e. as opposite to permanent

errors like those caused by algorithmic faults.

The errors the system may experience are not due to errors in the input it re-

ceives.

The errors the system may experience cause it to exhibit byzantine failures.

3.1.2 Problem

In the above context, the Fail-Stop Processor pattern solves the problem of

transforming the byzantine failures to fail-stop failures by balancing the following

forces:

• The error is confined within the failed system and does not infect its environment.

• The error is detected by the environment.

• The time overhead on error-free system execution is kept very low.

3.1.3 Solution

The solution to the above problem suggested by the Fail-Stop Processor pat-

tern is based on the replication of the system and the comparison of the replicas out-

put for unanimity. Each one of the replicas is called a processor in the remainder. All

processors are identical to the system on which the Fail-Stop Processor pattern

is applied, hence they are deterministic. They are all initialized simultaneously and

they receive exactly the same input, hence at any given time and in the absence of

errors they must produce the same output. If the output produced by the processors

are not exactly the same, then an error has occurred and the ensemble of the proc-

essors must be shut down in order to prevent the propagation of the error in the envi-

ronment.

剩余47页未读，继续阅读

评论收藏

内容反馈

秋雨夕照

粉丝: 78
资源: 5

A System of Patterns for Fault Tolerance

Pattern-Oriented Software Architecture (Vol.1)-A System of Patterns.pdf

a system of patterns

The Joy of Patterns-Using Patterns for Enterprise Development

(Tapestry)An infrastructure for Fault-Tolerant Wide-area Location and Routing.ppt

Implementing.Cloud.Design.Patterns.for.AWS.1782177345

Implementing.Cloud.Design.Patterns.for.AWS

英文原版-Patterns for College Writing Brief Edition A Rhetorical Reader and Guide 13th Edition

Learning syntactic patterns for automatic hypernym discovery

SOA Modeling Patterns for Service Oriented Discovery and Analysis

Go Design Patterns for Real-World Projects

Storm Blueprints- Patterns for Distributed Real-time Computation(PACKT,2014)

Design Patterns for Embedded Systems in C

Applied.Akka.Patterns

Go_Design+Patterns+for+Real-World+Projects-Packt+Publishing(2017).pdf

Patterns for time-triggered embedded systems

Implementing Cloud Design Patterns for AWS 01

Advanced Analytics with Spark: Patterns for Learning from Data at Scale

Beginning SOLID Principles and Design Patterns for ASP.NET Developers.pdf

Common Design Patterns for Symbian OS: The Foundations of Smartphone Software

ntko插件web版，插件

jdk-8u391-windows-x64.exe

Git-2.43.0-64-bit.exe

罗技GHUB驱动安装包

MobaXterm中文界面设置版本

华为RH2288V3最新BIOS-V522+IBMCV397

FENG宝塔FRP映射插件

Xshell7 免费版.zip

vasp 6.4.2压缩包/安装包

OPC-Server模拟器服务端 客户端

最新资源

OPC-Server模拟器服务端客户端