2 Front. Comput. Sci., 2021, 15(2): 152605
entry from its current term has been committed. Obviously, this
may block read requests to prevent the risk of returning stale
data [8] and may lead to the livelock anomaly [17]. To avoid
these problems, once the new leader is elected, it needs to com-
mit an entry from its term. Specifically, each leader needs to
append a blank no-op entry—which we call the special mark
log entry in this paper—into the log at the start of its term.
Unfortunately, leader change can lead to the result that the
log of a replica may be inconsistent with that of another node
in Raft. This is because a replica can persist a log entry regard-
less of whether the corresponding write is committed. We ana-
lyze the log inconsistency anomaly in Section 3.1. Therefore, to
guarantee the correctness and consistency of the system, when
a replica recovers as a follower, it needs to repair its local log
first. In this paper, we are mainly concerned with how a re-
covering follower repairs its log. The conventional follower log
repair methods (the details is presented in Section 3.2) usually
need many network round trips or more data to be transmitted,
which increases follower recovery time.
In this work, we present an accurate and efficient log repair
(AELR) algorithm for follower recovery, which requires only
one network round trip for fetching the least log entries from
the leader when a follower is recovering. This algorithm lever-
ages the special mark log entries to accurately find the extrane-
ous log entries inconsistent with the leader’s, which enables the
recovering follower to repair its log accurately and efficiently.
Since we make use of the properties of Raft replication, our fol-
lower log repair method can apply to a database system adopt-
ing the Raft-like protocol. The following is the list of our main
contributions.
• We give the notion of the special mark log entry, which is
the delimiter at the start of a term. Then we propose the
leader’s takeover execution, which utilizes its own special
entry and enables other replicas to confirm whether the
special entry is committed.
• We introduce the AELR algorithm, which utilizes the spe-
cial mark log entries to repair the recovering follower’s
log accurately and efficiently. Then we explain why this
mechanism works and analyze it together with other log
repair approaches.
• We have implemented the AELR algorithm in the open
source database system OceanBase. The performance
analysis demonstrates the effectiveness of our method in
terms of recovery time.
This paper is organized as follows. First, we review the Raft
replication in Section 2. In Section 3, we analyze the log incon-
sistency anomaly and summarize the follower log repair meth-
ods. We introduce the special mark log entry and how the leader
uses it to take over in Section 4. Section 5 describes our accurate
and efficient log repair (AELR) algorithm for follower recov-
ery. In Section 6, we introduce the implementation of AELR
in a real database system. Section 7 presents the performance
evaluation. The related works are described in Section 8. We
conclude the paper in Section 9.
2 Preliminaries
In this section, we introduce the log replication model adopt-
ing the strong leadership and log coherency features, which is
basedonRaft,buthassomedifference from the original. And
we give the properties of this replication using formalization.
• Strong leader: The leader replica is responsible for all the
write requests and it is the only one that can generate log
entries.
• Log coherency: There are no holes in the persisted log
(i.e., the log in the non-volatile storage) of each replica.
The hole in a log means that the LSNs of log entries are
not consecutive in the log.
Since the datasets of a database can entirely reside within the
main memory, each replica is a main-memory database system
in this work. Although checkpoint and snapshot are the other
important aspects of recovery technique in the database litera-
ture, this work mainly focuses on the log repairing in the Raft-
based systems. For a traditional main-memory database system,
the recovery can be divided into two phases: checkpoint in-
stalling and log replaying. Since there exist invalid log entries
in a log, the recovery of replica using the Raft protocol can be
divided into three phases: log repairing, checkpoint installing
and log replaying. It should be noted that the last two phases
are equal to the traditional two-phase recovery. When recover-
ing as a follower, the replica first repairs its local log. Then it
installs the latest checkpoint and replays the local log entries
from the checkpoint. Obviously, we does not need to modify
the original recovery phases, except that we only add the log
repair phase. Therefore, our proposed method in this work can
extend for the recovery setting with checkpoint or snapshot.
2.1 The overview of Raft replication
To provide highly available services, the replicated database
systems usually are deployed on a cluster of collaborative com-
modity machines, where each one is a replica node used as a
state machine and mapped to one of the three roles: Leader,
Follower,orCandidate. Traditionally, systems adopting Raft
protocol have two main phases: leader election and log repli-
cation, whose executions can be overlapped. For ease of de-
scription, the total number of state machines is N.
In the Raft-based system, the lifecycle of the system is di-
vided into consecutive “terms” of arbitrary length, each term
is numbered with a monotonically increasing integer term_id.
Specifically, if a replica node is elected as the new leader, the
system enters a new leader term whose term_id is greater than
previous terms’.
During normal processing of log replication, only the leader
of a term can accept the write requests from clients. Figure 1
shows the model of log replication in the Raft replicated sys-
tem. To enhance understandability, we divide the replication
processing into two steps:
Step 1: When receiving a write from a client, the leader gen-
erates a log entry e and sends the entry e to all replicas. When
a replica receives the entry e, it persists e into its local storage
and then returns an acknowledgment. If the leader gets the ac-
knowledgments from a majority of replicas, it enters into the
second step.
Step 2: Since the entry e is persisted on a majority of repli-
cas, the leader can confirm that the entry e is committed. Thus,