See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/37436949
The Database State Machine Approach
ArticleinDistributed and Parallel Databases · July 2003
DOI: 10.1023/A:1022887812188·Source: OAI
CITATIONS
155
READS
416
3 authors:
Some of the authors of this publication are also working on these related projects:
Scalable Dependability View project
High Performance In-Memory Databases View project
Fernando Pedone
University of Lugano
192 PUBLICATIONS3,604 CITATIONS
SEE PROFILE
Rachid Guerraoui
École Polytechnique Fédérale de Lausanne
673 PUBLICATIONS17,631 CITATIONS
SEE PROFILE
André Schiper
École Polytechnique Fédérale de Lausanne
311 PUBLICATIONS10,776 CITATIONS
SEE PROFILE
All content following this page was uploaded by André Schiper on 21 May 2014.
The user has requested enhancement of the downloaded file.
Distributed and Parallel Databases, 14, 71–98, 2003
c
2003 Kluwer Academic Publishers. Manufactured in The Netherlands.
The Database State Machine Approach
FERNANDO PEDONE fernando.pedone@epfl.ch
Hewlett-Packard Laboratories (HP Labs), Software Technology Laboratory, Palo Alto, CA 94304, USA;
School of Computer and Communication Systems, EPFL—Swiss Federal Institute of Technology,
CH-1015 Lausanne, Switzerland
RACHID GUERRAOUI
ANDR
´
E SCHIPER
School of Computer and Communication Systems, EPFL—Swiss Federal Institute of Technology,
CH-1015 Lausanne, Switzerland
Recommended by: Abdelsalam Helal
Abstract. Database replication protocols have historically been built on top of distributed database systems, and
have consequently been designed and implemented using distributed transactional mechanisms, such as atomic
commitment. We present the Database State Machine approach, a new way to deal with database replication in a
cluster of servers. This approach relies on a powerful atomic broadcast primitive to propagate transactions between
database servers, and alleviates the need for atomic commitment. Transaction commit is based on a certification
test, and abort rate is reduced by the reordering certification test. The approach is evaluated using a detailed
simulation model that shows the scalability of the system and the benefits of the reordering certification test.
Keywords: database replication, transaction processing, state machine approach, optimistic concurrency control,
synchronous replication, atomic broadcast
1. Introduction
Software replication is considered a cheap way to increase data availability when com-
pared to hardware-based techniques [16]. However, designing a synchronous replication
scheme (i.e., all copies are always consistent) that has good performance is still an active
area of research both in the database and in the distributed systems communities. Commer-
cial databases are typically based on the asynchronous replication model, which tolerates
inconsistencies among replicas [12, 32].
This paper investigates a new approach for synchronous database replication on a clus-
ter of database servers (e.g., a group of workstations connected by a local-area network).
The replication mechanism presented is based on the state machine approach [30], and
differs from traditional replication mechanisms in that it does not handle replication us-
ing distributed transactional mechanisms, such as atomic commitment [5, 13]. The state
machine approach was proposed as a general mechanism for dealing with replication,
however no previous study has addressed its use in the domain of a cluster of database
servers.
Our Database State Machine is based on the deferred update technique. According to
this technique, transactions are processed locally at one database server (i.e., one replica
72
PEDONE, GUERRAOUI AND SCHIPER
manager) and, at commit time, are forwarded for certification to the other servers (i.e.,
the other replica managers). Deferred update replication offers many advantages over its
alternative, immediate update replication, which synchronises every transaction operation
across all servers. Among these advantages, one may cite: (a) better performance, by gath-
ering and propagating multiple updates together, and localising the execution at a single,
possibly nearby, server (thus reducing the number of messages in the network), (b) better
support for fault tolerance, by simplifying server recovery (i.e., missing updates may be
treated by the communication module as lost messages), and (c) lower deadlock rate, by
eliminating distributed deadlocks [12].
The main drawback of the deferred update technique is that the lack of synchronisation
during transaction execution may lead to large transaction abort rates. We show how the
Database State Machine approach can be used to reduce the transaction abort rate by using a
reordering certification test, which looks for possible serialisable executions before deciding
to abort a transaction.
We have developed a simulation model of the Database State Machine and conducted
several experiments with it. The results obtained by our simulation model allowed us to
assess some important points about the system, like its scalability, and the effectiveness of
the reordering technique. Particularly, in the former case, it shows which parts of the system
are more prone to become resource bottlenecks. Evaluations of the reordering technique
have shown that transaction aborts due to serialisation problems can be reduced from 20%
to less than 5% in clusters of 8 database servers.
The rest of the paper is organised as follows. In Section 2, we introduce the replicated
database model on which our results are based, and the two main concepts used in our
approach. In Section 3, we recall the principle of the deferred update replication technique.
In Section 4, we show how to transform deferred update replication into a state machine
replication scheme. An optimisation of this approach that reduces the number of aborted
transactions is described in Section 5. In Section 6, we present the simulation tool we used
to evaluate the protocols discussed in the paper and draw some conclusions about them. In
Section 7 we discuss related work, and in Section 8 we conclude the paper. The proofs of
correctness of the algorithms are in the Appendix.
2. System model and definitions
In this section, we describe the system model and the two main concepts involved in our
approach, that is, those of state machine, and atomic broadcast. The state machine approach
delineates the replication strategy, and the atomic broadcast constitutes a sufficient order
mechanism to implement a state machine.
2.1. Database and failures
We consider a system composed of a set of sites. Sites in communicate through message
passing, and do not have access to a shared memory or to a common clock. To simplify the
presentation, we assume the existence of a discrete global clock, even if sites do not have
access to it. The range of the clock’s ticks is the set of natural numbers. The set is divided
DATABASE STATE MACHINE APPROACH
73
into two disjoint subsets: a subset of client sites, denoted
C
, and a subset of database sites,
denoted
D
. Hereafter, we consider that
C
={c
1
, c
2
,...,c
m
}, and
D
={s
1
, s
2
,...,s
n
}.
Each database site plays the role of a replica manager, and each one has a full copy of the
database.
Sites fail independently and only by crashing (i.e., we exclude Byzantine failures [24]).
We also assume that every database site eventually recovers after a crash. If a site is able
to execute requests at a certain time τ (i.e., the site did not fail or failed but recovered) we
say that the site is up at time τ ; otherwise the site is said to be down at time τ . For each
database site, we consider that there is a time after which the site is forever up.
1
Transactions are sequences of read and write operations followed by a commit or abort
operation. A transaction is called a query (or read-only) if it does not contain any write
operations; otherwise it is called an update transaction. Transactions, denoted t
a
, t
b
, and t
c
,
are submitted by client sites, and executed by database sites. Our correctness criterion for
transaction execution is one-copy serializability (1SR) [5].
2.2. The state machine approach
The state machine approach, also called active replication, is a non-centralised replication
coordination technique. Its key concept is that all replicas receive and process the same
sequence of requests. Replica consistency is guaranteed by assuming that when provided
with the same input (e.g., an external request) each replica will produce the same output
(e.g., state change). This assumption implicitly implies that replicas have a deterministic
behaviour.
The way requests are disseminated among replicas can be decomposed into two require-
ments [30]:
1. Agreement. Every non-faulty replica receives every request.
2. Order. If a replica processes request req
1
before req
2
, then no replica processes request
req
2
before request req
1
.
The order requirement can be weakened if some semantic information about the requests
is known. For example, if two requests commute, that is, independently of the order they
are processed they produce the same final states and sequence of outputs, then it is not
necessary that order be enforced among the replicas for these two requests.
2.3. Atomic broadcast
An atomic broadcast primitive enables to send messages to several sites, with the guarantee
that all sites agree on the set of messages delivered and the order according to which
the messages are delivered [17] (implementation details are discussed in Section 6.2).
Atomic broadcast is defined by the primitives broadcast(m) and deliver(m), and ensures
the following properties.
1. Agreement. If a site delivers a message m then every site delivers m.
2. Order. No two sites deliver any two messages in different orders.
74
PEDONE, GUERRAOUI AND SCHIPER
3. Termination. If a site broadcasts message m and does not fail, then every site eventually
delivers m.
The total order induced on the deliver is represented by the relation ≺. If message m
1
is
delivered before message m
2
, then deliver(m
1
) ≺ deliver(m
2
).
It is important to notice that the properties of atomic broadcast are defined in terms
of message delivery and not in terms of message reception. Typically, a database site
first receives a message, then executes some protocol to guarantee the atomic broadcast
properties, and finally delivers the message. From Section 2.2, it should be clear that atomic
broadcast is sufficient to guarantee the correct dissemination of requests to replicas acting
as state machines.
The notion of delivery captures the concept of durability, that is, a site must not forget
that it has delivered a message after it recovers from a crash. The agreement and order
properties of atomic broadcast also have an impact on the recovery of sites. When a site
s
i
recovers after a crash, s
i
must deliver first all messages it missed during the crashed
period.
3. Deferred update replication
The deferred update replication technique [5] is a way of dealing with requests in a replicated
database environment. It will be the base for the Database State Machine presented in
Section 4. In this section, we first recall the principle of the deferred update replication
technique, and then we provide a detailed characterisation of it.
3.1. Deferred update replication technique
In the deferred update replication technique, transactions are locally executed at one database
site, and during their execution, no interaction between other database sites occurs (see
figure 1). Transactions are locally synchronised at database sites according to some con-
currency control mechanism [5]. However, we assume throughout the paper that databases
synchronise transactions using a strict two-phase locking scheduler. When a client requests
the transaction commit, the transaction’s updates (e.g., the redo log records) and some con-
trol structures are propagated to all database sites, where the transaction will be certified
and, if possible, committed. This procedure, starting with the commit request, is called
termination protocol. The objective of the termination protocol is twofold: (i) propagating
transactions to database sites, and (ii) certifying them.
The certification test aims at ensuring one-copy serialisability. It decides to abort a trans-
action if the transaction’s commit would lead the database to an inconsistent state (i.e.,
non-serialisable). For example, consider two concurrent transactions, t
a
and t
b
, that are
executed at different database sites, and that update a common data item. On requesting
the commit, if t
a
arrives before t
b
at the database site s
i
but after t
b
at the database site
s
j
(i = j), both transactions t
a
and t
b
might have to be aborted, since otherwise, site s
i
would see transaction t
a
before transaction t
b
, and site s
j
would see transaction t
b
before
transaction t
a
, violating one-copy serialisability.
评论1