【免费】raft算法，英文原版论文_raft论文原文,raft算法论文资源-CSDN文库

raft

论文

需积分: 0 194 浏览量 2020-03-24 15:00:38 上传评论收藏 542KB PDF 举报

资源推荐

资源详情

资源评论

In Search of an Understandable Consensus Algorithm

(Extended Version)

Diego Ongaro and John Oust erhou t

Stanford University

Abstract

Raft is a consen sus algorithm for managing a replicated

log. It produces a result equivalent to (multi-)Paxos, and

it is as efﬁcient as Paxos, but its structur e is different

from Paxos; this makes Raft more understandab le than

Paxos and also provides a better foundation for build-

ing practical system s. In order to enhance understandabil-

ity, Raft separates the key elements of consensus, such as

leader election, log replication, and safety, and it enforce s

a stronger degree of coherency to reduce the number of

states that must be considered. Results from a user study

demonstra te that Raft is easier for students to learn than

Paxos. Raft also includes a new m e chanism for changing

the cluster membership, which uses overlapping majori-

ties to gua rantee safety.

1 Introduction

Consensus algorithm s allow a collection of machin es

to work as a coheren t group that ca n survive the fail-

ures o f some of its members. Because of this, they play a

key role in building reliable large-scale software systems.

Paxos [15, 16] has dominated the discussion of consen-

sus algorithms over the last decade: most implementations

of consensus are based on Paxos or inﬂuenced by it, and

Paxos has become the primary vehicle used to teach stu-

dents about c onsensus.

Unfortu nately, Paxos is quite difﬁcult to understand, in

spite of numerous attempts to make it more approachable.

Furthermore, its architecture requires complex chan ges

to support practical systems. As a result, both system

builders and students struggle with Paxos.

After struggling with Paxos ourselves, we set out to

ﬁnd a new consensus algorithm that could provide a bet-

ter foundatio n for system building and education. Our ap-

proach was unusua l in that our primary goal was under-

standability: could we deﬁne a consensus algorithm for

practical systems and describe it in a way that is signiﬁ-

cantly easier to learn than Paxos? Furtherm ore, we wanted

the algorithm to facilitate the development of intuitions

that are essential for system builders. It was important not

just for the algorithm to work, but for it to be obvious why

it works.

The result of this work is a consensus algorithm called

Raft. In designing Raft we applied speciﬁc techniques to

improve understand ability, including de composition (Raft

separates leader election, log rep lica tion, and safety) and

This tech report is an extended version of [32]; additional material is

noted with a gray bar in the margin. Published May 20, 2014.

state space reduction (relative to Paxos, Raft reduces the

degree of nondeterminism and th e ways servers can be in-

consistent with each othe r). A user study with 43 students

at two universities shows that Raft is signiﬁcan tly easier

to understan d than Paxos: after learning both algorithms,

33 of these students were able to answer questions about

Raft better than questions about Paxos.

Raft is similar in m any ways to existing consensus al-

gorithms (most notably, Oki and Liskov’s Viewstamped

Replication [29, 22]), but it has several novel fea tures:

• Strong leader: Raft u ses a stronger form of leader-

ship than other consensus algorithms. For example,

log entries only ﬂow fr om the leader to other servers.

This simpliﬁes the management of the replicated log

and makes Raft easier to understand.

• Leader election: Raft uses randomized timers to

elect leaders. This adds only a small amount of

mechanism to the heartbeats already required for any

consensus algorithm, while resolving conﬂicts sim-

ply and rapidly.

• Membership changes: Raft’s mechanism for

changin g the set of servers in the cluster uses a new

joint consensus approach where the majorities of

two different conﬁgurations overlap during transi-

tions. This allows the cluster to continue opera ting

normally du ring conﬁguration changes.

We believe that Raft is superior to Paxos and othe r con-

sensus algorithms, both for educatio nal purposes and as a

foundation for implementation. It is simpler and more un-

derstandable than other algorithms; it is described com-

pletely enough to m eet the needs of a practical system;

it has several open-source implementations and is used

by several companies; its saf ety properties have bee n for-

mally speciﬁed and proven; and its efﬁciency is compara-

ble to other algorithms.

The remainder of the paper intr oduces the replicated

state machine p roblem (Section 2), discusses the strength s

and weaknesses of Paxos (Section 3), describes our gen-

eral approach to u nderstandability (Section 4), presents

the Raft co nsensus algorithm (Sections 5–8), evaluates

Raft (Section 9), and discusses related work (Section 10).

2 Replicated state machines

Consensus algorithms typically arise in the context of

replicated state machines [ 37]. In this approach, state m a-

chines on a collection of servers compute identical copies

of the same state an d can continue operating even if some

of the servers are down. Replicated state machines are

Figure 1: Replicated state machine architecture. The con-

sensus al gorithm manages a replicated log containing state

machine commands f rom clients. The state machines process

identical sequences of commands from the logs, so they pro-

duce the same outputs.

used to solve a variety of fault tolerance problems in dis-

tributed systems. For example, large-scale systems that

have a single cluster leader, such as GFS [8], HDFS [38],

and RAMCloud [33], typically use a separate replicated

state machine to manage leader election and stor e conﬁg-

uration information that must survive leader crashes. Ex-

amples of replicated state machines include Chubby [2]

and ZooKeeper [11].

Replicated state machines are typically implemented

using a replicated log, as shown in Figure 1. Each server

stores a log containing a series of comma nds, which its

state ma chine executes in order. Each log contains the

same comma nds in the same order, so each state ma-

chine processes the same seq uence of commands. Since

the state machines are determin istic, each computes the

same state and the same sequen c e of outputs.

Keeping the replicated log consistent is the job of the

consensus algorithm. The consensus modu le on a server

receives commands from clients and adds them to its log.

It communicate s with the consensus m odules on other

servers to ensure that every log eventually contains the

same requests in the same order, even if some servers fail.

Once c ommands are properly re plicated, each server’s

state m a chine processes them in log order, and the out-

puts are returned to clients. As a result, the servers appear

to form a single, high ly reliable state machine.

Consensus algorithms for practica l systems typically

have the f ollowing properties:

• They ensure safety (never returning an incorrect re-

sult) under all non-Byzantine conditions, including

network delays, partitions, and packet loss, duplica-

tion, and reo rdering.

• They are fully functional (available) as long as any

majority o f the servers are operational and can com-

municate with each other and with clients. Thus, a

typical cluster of ﬁve servers can tolerate the failure

of any two servers. Servers are assumed to fail by

stopping; they may later recover from state on stable

storage and rejoin the cluster.

• They do not depend on timing to ensure the consis-

tency of the logs: faulty clocks and extreme message

delays can, at worst, cause availability problems.

• In the co mmon case, a command can complete as

soon as a majority of the cluster has r e sponded to a

single round of remote procedure c alls; a minority of

slow servers need not impact overall system perfor-

mance.

3 What’s wrong with Paxos?

Over the last ten years, Leslie Lamport’s Paxos proto-

col [15] has become almost synonymous with consensus:

it is the protocol most commonly taught in courses, and

most imp le mentations o f consensus use it as a starting

point. Paxos ﬁrst deﬁnes a protocol c apable of reaching

agreement on a single decision, such as a single replicated

log entry. We refer to this subset as single-decree Paxos.

Paxos then combines multiple instances of this protocol to

facilitate a series of decisions such as a log (m ulti-Paxos).

Paxos ensure s both safety and liveness, and it supports

changes in c luster membership. Its correctness has been

proven, and it is efﬁcient in the normal case.

Unfortu nately, Paxos has two signiﬁcant drawbacks.

The ﬁrst drawback is that Paxos is exceptionally difﬁ-

cult to understand. The full explanation [15] is noto ri-

ously opaqu e; few pe ople succeed in und e rstanding it, and

only with great effort. As a result, there have been several

attempts to explain Paxos in simpler terms [16, 20, 21].

These explanations focus on the single-decree subset, yet

they are still ch allenging. In an informal survey of atten-

dees at NSDI 2012, we found few people who were com-

fortable with Paxos, even amon g seasoned researchers.

We struggled with Paxos ourselves; we were not able to

understand the complete protocol until after reading sev-

eral simpliﬁed explanations and designing our own a lter-

native protocol, a process that took almost a year.

We hypothesize that Paxos’ opaqueness derives from

its choice of the single-decr ee subset as its foundation.

Single-decree Paxos is dense and subtle: it is divided into

two stages that do n ot have simple intuitive explanations

and cannot be understood independently. Because of this,

it is difﬁcult to develop intuitions about why the single-

decree protocol works. The composition rules for multi-

Paxos add signiﬁcant additional complexity and subtlety.

We believe that the overall problem of reaching consensus

on multip le decisions (i.e., a log in stead of a single entry)

can be decomposed in other ways that are more direct and

obvious.

The second problem with Paxos is that it does not pro-

vide a good foundation for building practical implemen-

tations. One re ason is that there is no widely agreed-

upon algorithm for multi-Paxos. Lamport’s descriptions

are mostly about single-decree Paxos; he sketched possi-

ble approaches to multi-Paxos, but many details are miss-

ing. There have been several attempts to ﬂesh out and op-

timize Paxos, such as [26], [39], and [13], but these differ

from each other and from Lamport’s sketches. Systems

such as Chubby [4] have impleme nted Paxos-like algo-

rithms, but in most cases their deta ils have not been pub-

lished.

Furthermore, the Paxos architecture is a poor one for

building pra ctical systems; this is another consequence of

the single-decree decomposition. For example, there is lit-

tle beneﬁt to choosing a collection of log entries indepe n-

dently and then melding them into a sequential log; this

just adds complexity. I t is simp le r and more efﬁcient to

design a system around a log, where new entrie s are ap-

pended sequentially in a constrained order. Another pr ob-

lem is that Paxos uses a symmetric peer-to-peer approach

at its core (though it eventually suggests a w e ak form of

leadership as a performance optimization). This makes

sense in a simpliﬁed world where only one decision will

be made, but few practical systems use this approach. If a

series of decisions must be made, it is simpler and faster

to ﬁrst elect a leader, then have the leader coordinate the

decisions.

As a result, practical systems bear little resemblance

to Paxos. Each implementation begins with Paxos, dis-

covers the difﬁculties in implementing it, and then de-

velops a signiﬁcantly different architecture. This is time-

consumin g and error-prone, and the difﬁculties of under-

standing Paxos exacerbate the problem . Paxos’ formula-

tion may be a good one for proving theorems about its cor-

rectness, but real implementations are so different from

Paxos that the proof s have little value. The following com-

ment fro m the Chubby implem e nters is typical:

There are signiﬁcant gaps between the description of

the Paxos algorithm and the needs of a real-world

system. . . . the ﬁnal system will be based on an un-

proven protocol [4].

Because of these problems, we concluded that Paxos

does not provide a good foundation either for system

building or for education . Given the importance of co n-

sensus in large-scale software systems, we decided to see

if we could design an alternative consensus algorithm

with better properties than Paxos. Raft is the result of tha t

experiment.

4 Designing for understandability

We had several goals in designing Raft: it must provide

a complete and practical foundation for system building,

so that it signiﬁcantly reduces the amo unt of design work

required of developers; it must b e safe under all conditions

and available under typical operating condition s; and it

must be efﬁcient fo r common operations. But our most

important goal—and most difﬁcult challenge—was un-

derstandability. It must be possible for a large aud ie nce to

understand the algorithm co mfortably. In addition, it must

be possible to develop intuitions about the algorithm, so

that system builders can make the extensions that are in-

evitable in real-world implementations.

There were numero us points in the design of Raft

where we had to choose amon g alternative approaches.

In these situations we evaluated the alternatives based on

understandability: how hard is it to explain each altern a -

tive (for example, how complex is its state space, and doe s

it have subtle implicatio ns?), and how easy will it be for a

reader to co mpletely understand the approach and its im-

plications?

We recogn ize that there is a high degree of subjectiv-

ity in such a nalysis; no netheless, we used two techniques

that are generally applicable. The ﬁrst technique is the

well-known approach o f proble m decomposition: wher-

ever possible, we divided problems into separate pieces

that could be solved, exp lained, and understood relatively

indepen dently. For example, in Raft we separated leader

election, log replication, safety, and membership changes.

Our second appr oach was to simplify the state space

by reducing the number of states to consider, ma king the

system more coherent and elim inating nondeterminism

where possible. Speciﬁcally, logs are not allowed to have

holes, and Raft limits the ways in which logs can become

inconsistent with each othe r. Although in most cases we

tried to elimina te nondeterminism, there are some situ-

ations where nondeter minism actually improves under-

standability. In particular, randomized a pproaches intro-

duce nondeterm inism, but they tend to reduce the state

space by hand ling all possible choices in a similar fashion

(“choo se any; it doesn’t matter”). We used randomization

to simplify the Raft leader election algorithm.

5 The Raft consensus algorithm

Raft is an algorithm for managing a replicated log of

the form described in Section 2. Figure 2 summarizes the

algorithm in condensed form for reference, and Figure 3

lists key properties of the algo rithm; the elements of these

ﬁgures are discussed pie cewise over the rest of this sec-

tion.

Raft implements co nsensus by ﬁrst electing a distin-

guished leader, then giving the leader complete responsi-

bility for ma naging the replicated log. The leader accepts

log entries from clients, replicates them on other servers,

and tells servers when it is safe to apply log entries to

their state machines. H aving a leader simpliﬁes the man-

agement of the replicated log. For example, the leader can

decide where to place new entries in the log without con-

sulting other servers, and data ﬂows in a simple fashion

from the leader to other servers. A leader can fail or be-

come disconnected from the other servers, in which case

a new leader is elected.

Given the leader approach, Raft decomposes the con-

sensus problem into thre e relatively independent subprob-

lems, which are discussed in the subsections that follow:

• Leader election: a new leader must be chosen when

an existing leader fails (Section 5.2).

• Log replication: the leader must accept log entries

Invoked by candidates to gather votes (§5.2).

Arguments:

term candidate’s term

candidateId candidate requesting vote

lastLogIndex index of candidate’s last log entry (§5.4)

lastLogTerm term of candidate’s last log entry (§5.4)

Results:

term currentTerm, for candidate to update itself

voteGranted true means candidate received vote

Receiver implementation:

1. Reply false if term < currentTerm (§5.1)

2. If votedFor is null or candidateId, and candidate’s log is at

least as up-to-date as receiver’s log, grant vote (§5.2, §5.4)

RequestVote RPC

Invoked by leader to replicate log entries (§5.3); also used as

heartbeat (§5.2).

Arguments:

term leader’s term

leaderId so follower can redirect clients

prevLogIndex index of log entry immediately preceding

new ones

prevLogTerm term of prevLogIndex entry

entries[] log entries to store (empty for heartbeat;

may send more than one for efficiency)

leaderCommit leader’s commitIndex

Results:

term currentTerm, for leader to update itself

success true if follower contained entry matching

prevLogIndex and prevLogTerm

Receiver implementation:

1. Reply false if term < currentTerm (§5.1)

2. Reply false if log doesn’t contain an entry at prevLogIndex

whose term matches prevLogTerm (§5.3)

3. If an existing entry conflicts with a new one (same index

but different terms), delete the existing entry and all that

follow it (§5.3)

4. Append any new entries not already in the log

5. If leaderCommit > commitIndex, set commitIndex =

min(leaderCommit, index of last new entry)

AppendEntries RPC

Persistent state on all servers:

(Updated on stable storage before responding to RPCs)

currentTerm latest term server has seen (initialized to 0

on first boot, increases monotonically)

votedFor candidateId that received vote in current

term (or null if none)

log[] log entries; each entry contains command

for state machine, and term when entry

was received by leader (first index is 1)

Volatile state on all servers:

commitIndex index of highest log entry known to be

committed (initialized to 0, increases

monotonically)

lastApplied index of highest log entry applied to state

machine (initialized to 0, increases

monotonically)

Volatile state on leaders:

(Reinitialized after election)

nextIndex[] for each server, index of the next log entry

to send to that server (initialized to leader

last log index + 1)

matchIndex[] for each server, index of highest log entry

known to be replicated on server

(initialized to 0, increases monotonically)

State

All Servers:

• If commitIndex > lastApplied: increment lastApplied, apply

log[lastApplied] to state machine (§5.3)

• If RPC request or response contains term T > currentTerm:

set currentTerm = T, convert to follower (§5.1)

Followers (§5.2):

• Respond to RPCs from candidates and leaders

• If election timeout elapses without receiving AppendEntries

RPC from current leader or granting vote to candidate:

convert to candidate

Candidates (§5.2):

• On conversion to candidate, start election:

• Increment currentTerm

• Vote for self

• Reset election timer

• Send RequestVote RPCs to all other servers

• If votes received from majority of servers: become leader

• If AppendEntries RPC received from new leader: convert to

follower

• If election timeout elapses: start new election

Leaders:

• Upon election: send initial empty AppendEntries RPCs

(heartbeat) to each server; repeat during idle periods to

prevent election timeouts (§5.2)

• If command received from client: append entry to local log,

respond after entry applied to state machine (§5.3)

• If last log index ≥ nextIndex for a follower: send

AppendEntries RPC with log entries starting at nextIndex

• If successful: update nextIndex and matchIndex for

follower (§5.3)

• If AppendEntries fails because of log inconsistency:

decrement nextIndex and retry (§5.3)

• If there exists an N such that N > commitIndex, a majority

of matchIndex[i] ≥ N, and log[N].term == currentTerm:

set commitIndex = N (§5.3, §5.4).

Rules for Servers

Figure 2: A condensed summary of the Raft consensus algorithm (excluding membership changes and log compaction). The server

behavior in the upper-left box is described as a set of rules that t rigger independently and repeatedly. Section numbers such as §5.2

indicate where particular features are discussed. A formal speciﬁcation [31] describes the algorithm more precisely.

剩余17页未读，继续阅读

评论收藏

内容反馈

mamil

粉丝: 9
资源: 6

raft算法，英文原版论文

RAFT

raft一致性算法英文原版论文

raft 经典论文

Raft一致性算法中文翻译版(格式同原论文)

Raft论文解析中文版

raft 分布式一致性算法原文

raft_zh.pdf

最经典的蛙跳算法论文（Shuffled Frog-Leaping Algorithm,英文原版）

raft一致性算法中文版

raft-zh_cn:Raft一致性算法论文的中文翻译

SSD目标检测算法论文-英文原版

F5隐写算法分析Fridrich英文原版论文

raft算法简单实现-java

Raft 一致性算法论文译文

斯坦福大学的raft算法ppt详解

Python-基于Raft算法用于容错分布式系统的异步复制框架

分布式系统的Raft算法.docx

raft

基于图神经网络的社团发现算法的几篇具体英文论文+翻译.rar

英文论文--关联规则算法论文

Raft 共识算法大论文-中文翻译-前六章

Raft算法中文动画演示文件

RaftCore:.NET Core的Raft算法的实现

raft-atc14.pdf

Raft论文中文翻译版

综合介绍了Paxos和Raft算法

分布式共识算法之Raft算法

最新资源