0% found this document useful (0 votes)
38 views18 pages

In Search of An Understandable Consensus Algorithm

Raft is a consensus algorithm designed to be more understandable and efficient than Paxos, focusing on key elements like leader election and log replication. A user study showed that students found Raft 23% easier to comprehend than Paxos, making it a better educational tool. Raft's architecture simplifies the management of replicated logs and includes mechanisms for leader election and membership changes, addressing the complexities and shortcomings of Paxos.

Uploaded by

lpb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views18 pages

In Search of An Understandable Consensus Algorithm

Raft is a consensus algorithm designed to be more understandable and efficient than Paxos, focusing on key elements like leader election and log replication. A user study showed that students found Raft 23% easier to comprehend than Paxos, making it a better educational tool. Raft's architecture simplifies the management of replicated logs and includes mechanisms for leader election and membership changes, addressing the complexities and shortcomings of Paxos.

Uploaded by

lpb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

In Search of an Understandable Consensus Algorithm

Diego Ongaro and John Ousterhout


Stanford University
(Draft of October 7, 2013)
Abstract terminism and the ways servers can be inconsistent with
Raft is a consensus algorithm for managing a replicated each other). A user study with 43 students at two univer-
log. It produces a result equivalent to (multi-)Paxos, and sities shows that Raft is significantly easier to understand
it is as efficient as Paxos, but its structure is different than Paxos: after learning both algorithms, students were
from Paxos; this makes Raft more understandable than able to answer questions about Raft 23% better than ques-
Paxos and also provides a better foundation for building tions about Paxos.
practical systems. In order to enhance understandabil- Raft is similar in many ways to existing consensus al-
ity, Raft separates the key elements of consensus, such as gorithms (most notably, Oki and Liskov’s Viewstamped
leader election, log replication, and safety, and it enforces Replication [28, 21]), but it has several novel features:
a stronger degree of coherency to reduce the number of • Strong leader: Raft uses a stronger form of leader-
states that must be considered. Results from a user study ship than other consensus algorithms. For example,
demonstrate that Raft is easier for students to learn than log entries only flow from the leader to other servers.
Paxos. Raft also includes a new mechanism for changing This simplifies the management of the replicated log
the cluster membership, which uses overlapping majori- and makes Raft easier to understand.
ties to guarantee safety. • Leader election: Raft uses randomized timers to
1 Introduction elect leaders. This adds only a small amount of
Consensus algorithms allow a collection of machines mechanism to the heartbeats already required for any
to work as a coherent group that can survive the failures consensus algorithm, while resolving conflicts sim-
of some of its members. Because of this, they play a ply and rapidly.
key role in building reliable large-scale software systems. • Membership changes: Raft’s mechanism for
Paxos [14, 15] has dominated the discussion of consen- changing the set of servers in the cluster uses a novel
sus algorithms over the last decade: most implementa- joint consensus approach where the majorities of two
tions of consensus are based on Paxos or influenced by it, different configurations overlap during transitions.
and Paxos has become the primary vehicle used to teach This allows the cluster to continue operating nor-
students about consensus. mally during configuration changes.
Unfortunately, Paxos is quite difficult to understand, in We believe that Raft is superior to Paxos and other con-
spite of numerous attempts to make it more approach- sensus algorithms, both for educational purposes and as
able. Furthermore, its architecture is unsuitable for build- a foundation for implementation. It is simpler and more
ing practical systems, requiring complex changes to cre- understandable than other algorithms; it is described com-
ate an efficient and complete solution. As a result, both pletely enough to meet the needs of a practical system; it
system builders and students struggle with Paxos. has several open-source implementations; its safety prop-
After struggling with Paxos ourselves, we set out to erties have been formally specified and proven; and its
find a new consensus algorithm that could provide a bet- efficiency is comparable to other algorithms.
ter foundation for system building and education. Our ap- The remainder of the paper introduces the replicated
proach was unusual in that our primary goal was under- state machine problem (Section 2), discusses the strengths
standability: could we define a consensus algorithm and and weaknesses of Paxos (Section 3), describes our gen-
describe it in a way that is significantly easier to learn than eral approach to understandability (Section 4), presents
Paxos, and that facilitates the development of intuitions the Raft consensus algorithm (Sections 5-8), evaluates
that are essential for system builders? It was important Raft (Section 9), and discusses related work (Section 10).
not just for the algorithm to work, but for it to be obvi-
ous why it works. In addition, the algorithm needed to be 2 Achieving fault-tolerance with replicated
complete enough to cover all the major issues required for state machines
an implementation. Consensus algorithms typically arise in the context of
The result of this work is a consensus algorithm called replicated state machines [34]. In this approach, state ma-
Raft. In designing Raft we applied specific techniques to chines on a collection of servers compute identical copies
improve understandability, including decomposition (Raft of the same state and can continue operating even if some
separates leader election, log replication, and safety) and of the servers are down. Replicated state machines are
state space reduction (Raft reduces the degree of nonde- used to solve a variety of fault-tolerance problems in dis-

1
tency of the logs: faulty clocks and extreme message
delays can, at worst, cause availability problems.
• In the common case, a command can complete as
soon as any majority of the cluster has responded to
a single round of remote procedure calls; a minority
of slow servers need not impact overall system per-
formance.
3 What’s wrong with Paxos?
Over the last ten years, Leslie Lamport’s Paxos pro-
Figure 1: Replicated state machine architecture. The con- tocol [14] has become almost synonymous with consen-
sensus algorithm manages a replicated log containing state
sus: it is the protocol most commonly taught in courses,
machine commands from clients. The state machines process
and most implementations of consensus use it as a starting
identical sequences of commands from the logs, so they pro-
duce the same outputs. point. Paxos first defines a protocol capable of reaching
agreement on a single decision, such as a single replicated
tributed systems. For example, large-scale systems that log entry. We refer to this subset as single-decree Paxos.
have a single cluster leader, such as GFS [7], HDFS [35], Paxos then combines multiple instances of this protocol to
and RAMCloud [29], typically use a separate replicated facilitate a series of decisions such as a log (multi-Paxos).
state machine to manage leader election and store config- Paxos ensures both safety and liveness, and it supports
uration information that must survive leader crashes. Ex- changes in cluster membership. Its correctness has been
amples of replicated state machines include Chubby [2] proven, and it is efficient in the normal case.
and ZooKeeper [10]. Unfortunately, Paxos has two significant drawbacks.
Replicated state machines are typically implemented The first drawback is that Paxos is exceptionally difficult
using a replicated log, as shown in Figure 1. Each server to understand. The full explanation [14] is notoriously
stores a log containing a series of commands, which its opaque; few people succeed in understanding it, and only
state machine executes in order. Each log contains the with great effort. As a result, there have been several
same commands in the same order, so each state machine attempts to explain Paxos in simpler terms [15, 19, 20].
processes the same sequence of commands. Since the These explanations focus on the single-decree subset, yet
state machines are deterministic, each computes the same they are still challenging. In an informal survey of atten-
state and the same sequence of outputs. dees at NSDI 2012, we found few people who were com-
Keeping the replicated log consistent is the job of the fortable with Paxos, even among seasoned researchers.
consensus algorithm. As shown in Figure 1, the consen- We struggled with Paxos ourselves; we were not able to
sus module on a server receives commands from clients understand the complete protocol until after reading sev-
and adds them to its log. It communicates with the con- eral simplified explanations and designing our own alter-
sensus modules on other servers to ensure that every log native protocol, a process that took almost a year.
eventually contains the same requests in the same order, We hypothesize that Paxos’ opaqueness derives from
even if some servers fail. Once commands are properly its choice of the single-decree subset as its foundation.
replicated, each server’s state machine processes them in Single-decree Paxos is dense and subtle: it is divided into
log order, and the outputs are returned to clients. As a two stages that do not have simple intuitive explanations
result, the servers appear to form a single, highly-reliable and cannot be understood independently. Because of this,
state machine. it is difficult to develop intuitions about why the single-
Consensus algorithms for practical systems typically decree protocol works. The composition rules for multi-
have the following properties: Paxos add significant additional complexity and subtlety.
• They ensure safety (never returning an incorrect re- We believe that the overall problem of reaching consensus
sult) under all non-Byzantine conditions, including on multiple decisions (i.e., a log instead of a single entry)
network delays, partitions, and packet loss, duplica- can be decomposed in other ways that are more direct and
tion, and reordering. obvious.
• They are fully functional (available) as long as any The second problem with Paxos is that it does not pro-
majority of the servers are operational and can com- vide a good foundation for building practical implemen-
municate with each other and with clients. Thus, a tations. One reason is that there is no widely agreed-upon
typical cluster of five servers can tolerate the failure algorithm for multi-Paxos. Lamport’s descriptions are
of any two servers. Servers are assumed to fail by mostly about single-decree Paxos; he sketched possible
stopping; they may later recover from state on stable approaches to multi-Paxos, but many details are missing.
storage and rejoin the cluster. There have been several attempts to flesh out and optimize
• They do not depend on timing to ensure the consis- Paxos, such as [25], [36], and [12], but these differ from

2
each other and from Lamport’s sketches. Systems such as gorithm, so that system builders can make the extensions
Chubby [4] have implemented Paxos-like algorithms, but that are inevitable in real-world implementations.
in most cases their details have not been published. There were numerous points in the design of Raft
Furthermore, the Paxos architecture is a poor one for where we had to choose among alternative approaches.
building practical systems; this is another consequence of In these situations we evaluated the alternatives based on
the single-decree decomposition. For example, there is understandability: how hard is it to explain each alterna-
little benefit to choosing a collection of log entries inde- tive (for example, how complex is its state space, and does
pendently and then melding them into a sequential log; it have subtle implications?), and how easy will it be for
this just adds complexity. It is simpler and more efficient a reader to completely understand the approach and its
to design a system around a log, where new entries are ap- implications? Given a choice between an alternative that
pended sequentially in a constrained order. Another prob- was concise but subtle and one that was longer (either in
lem is that Paxos uses a symmetric peer-to-peer approach lines of code or explanation) but more obvious, we chose
at its core (though it eventually suggests a weak form of the more obvious approach. Fortunately, in most cases the
leadership as a performance optimization). This makes more obvious approach was also more concise.
sense in a simplified world where only one decision will We recognize that there is a high degree of subjectiv-
be made, but few practical systems use this approach. If a ity in such analysis; nonetheless, we used two techniques
series of decisions must be made, it is simpler and faster that are generally applicable. The first technique is the
to first elect a leader, then have the leader coordinate the well-known approach of problem decomposition: wher-
decisions. ever possible, we divided problems into separate pieces
As a result, practical systems bear little resemblance that could be solved, explained, and understood relatively
to Paxos. Each implementation begins with Paxos, dis- independently. For example, in Raft we separated leader
covers the difficulties in implementing it, and then devel- election, log replication, safety, and membership changes.
ops a significantly different architecture. This is time- Our second approach was to simplify the state space
consuming and error-prone. The difficulties of under- by reducing the number of states to consider, making the
standing Paxos exacerbate the problem: system builders system more coherent and eliminating nondeterminism
must modify the Paxos algorithm in major ways, yet where possible. For example, logs are not allowed to have
Paxos does not provide them with the intuitions needed holes, and Raft limits the ways in which logs can become
for this. Paxos’ formulation may be a good one for prov- inconsistent with each other. This approach conflicts with
ing theorems about its correctness, but real implementa- advice given by Lampson: “More nondeterminism is bet-
tions are so different from Paxos that the proofs have little ter, because it allows more implementations [19].” In our
value. The following comment from the Chubby imple- situation we needed only a single implementation, but it
menters is typical: needed to be understandable; we found that reducing non-
There are significant gaps between the description of determinism usually improved understandability. We sus-
the Paxos algorithm and the needs of a real-world pect that trading off implementation flexibility for under-
system.... the final system will be based on an un- standability makes sense for most system designs.
proven protocol [4].
5 The Raft consensus algorithm
Because of these problems, we have concluded that
Raft uses a collection of servers communicating with
Paxos does not provide a good foundation either for sys-
remote procedure calls (RPCs) to implement a replicated
tem building or for education. Given the importance of
log of the form described in Section 2. Figure 2 sum-
consensus in large-scale software systems, we decided to
marizes the algorithm in condensed form for reference,
see if we could design an alternative consensus algorithm
and Figure 3 lists key properties of the algorithm; the el-
with better properties than Paxos. Raft is the result of that
ements of these figures are discussed piecewise over the
experiment.
rest of this section.
4 Designing for understandability Raft implements consensus by first electing a distin-
We had several goals in designing Raft: it must provide guished leader, then giving the leader complete responsi-
a complete and appropriate foundation for system build- bility for managing the replicated log. The leader accepts
ing, so that it significantly reduces the amount of design log entries from clients, replicates them on other servers,
work required of developers; it must be safe under all con- and tells servers when it is safe to apply log entries to
ditions and available under typical operating conditions; their state machines. Having a leader simplifies the man-
and it must be efficient for common operations. But our agement of the replicated log. For example, the leader can
most important goal—and most difficult challenge—was decide where to place new entries in the log without con-
understandability. It must be possible for a large audi- sulting other servers, and data flows in a simple fashion
ence to understand the algorithm comfortably. In addi- from the leader to other servers. A leader can fail or be-
tion, it must be possible to develop intuitions about the al- come disconnected from the other servers, in which case

3
Rules for Servers State
All Servers: Persistent state on all servers:
• If commitIndex > lastApplied: increment lastApplied, apply (Updated on stable storage before responding to RPCs)
log[lastApplied] to state machine (§5.3) currentTerm latest term server has seen (initialized to 0
• If RPC request or response contains term T > currentTerm: on first boot, increases monotonically)
set currentTerm = T, convert to follower (§5.1) votedFor candidateId that received vote in current
term (or null if none)
Followers (§5.2): log[] log entries; each entry contains command
• Respond to RPCs from candidates and leaders for state machine, and term when entry
• If election timeout elapses without receiving AppendEntries was received by leader (first index is 1)
RPC from current leader or granting vote to candidate:
convert to candidate Volatile state on all servers:
commitIndex index of highest log entry known to be
Candidates (§5.2): committed (initialized to 0, increases
• On conversion to candidate, start election: monotonically)
• Increment currentTerm lastApplied index of highest log entry applied to state
• Vote for self machine (initialized to 0, increases
• Reset election timeout monotonically)
• Send RequestVote RPCs to all other servers
• If votes received from majority of servers: become leader Volatile state on leaders:
• If AppendEntries RPC received from new leader: convert to (Reinitialized after election)
follower nextIndex[] for each server, index of the next log entry
• If election timeout elapses: start new election to send to that server (initialized to leader
last log index + 1)
Leaders: matchIndex[] for each server, index of highest log entry
• Upon election: send initial empty AppendEntries RPCs known to be replicated on server
(heartbeat) to each server; repeat during idle periods to (initialized to 0, increases monotonically)
prevent election timeouts (§5.2)
• If command received from client: append entry to local log, AppendEntries RPC
respond after entry applied to state machine (§5.3)
• If last log index ≥ nextIndex for a follower: send Invoked by leader to replicate log entries (§5.3); also used as
AppendEntries RPC with log entries starting at nextIndex heartbeat (§5.2).
• If successful: update nextIndex and matchIndex for Arguments:
follower (§5.3) term leader's term
• If AppendEntries fails because of log inconsistency: leaderId so follower can redirect clients
decrement nextIndex and retry (§5.3) prevLogIndex index of log entry immediately preceding
• If there exists an N such that N > commitIndex, a majority new ones
of matchIndex[i] ≥ N, and log[N].term == currentTerm: prevLogTerm term of prevLogIndex entry
set commitIndex = N (§5.3, §5.4). entries[] log entries to store (empty for heartbeat;
may send more than one for efficiency)
RequestVote RPC leaderCommit leader’s commitIndex
Invoked by candidates to gather votes (§5.2).
Results:
Arguments: term currentTerm, for leader to update itself
term candidate's term success true if follower contained entry matching
candidateId candidate requesting vote prevLogIndex and prevLogTerm
lastLogIndex index of candidate's last log entry (§5.4)
Receiver implementation:
lastLogTerm term of candidate's last log entry (§5.4) 1. Reply false if term < currentTerm (§5.1)
Results: 2. Reply false if log doesn’t contain an entry at prevLogIndex
term currentTerm, for candidate to update itself whose term matches prevLogTerm (§5.3)
voteGranted true means candidate received vote 3. If an existing entry conflicts with a new one (same index
but different terms), delete the existing entry and all that
Receiver implementation: follow it (§5.3)
1. Reply false if term < currentTerm (§5.1) 4. Append any new entries not already in the log
2. If votedFor is null or candidateId, and candidate's log is at 5. If leaderCommit > commitIndex, set commitIndex =
least as up-to-date as receiver’s log, grant vote (§5.2, §5.4) min(leaderCommit, last log index)

Figure 2: A condensed summary of the Raft consensus algorithm (excluding membership changes and log compaction). The
server behavior in the upper-left box is described as a set of rules that trigger independently and repeatedly. Section numbers such
as §5.2 indicate where particular features are discussed. A formal specification [33] describes the algorithm more precisely.
a new leader is elected. • Leader election: a new leader must be chosen when
Given the leader approach, Raft decomposes the con- an existing leader fails (Section 5.2).
sensus problem into three relatively independent subprob- • Log replication: the leader must accept log entries
lems, which are discussed in the subsections that follow: from clients and replicate them across the cluster,

4
Election Safety: at most one leader can be elected in a
given term. §5.2
Leader Append-Only: a leader never overwrites or deletes
entries in its log; it only appends new entries. §5.3
Log Matching: if two logs contain an entry with the same
index and term, then the logs are identical in all entries
up through the given index. §5.3
Leader Completeness: if a log entry is committed in a Figure 4: Server states. Followers only respond to requests
given term, then that entry will be present in the logs of from other servers. If a follower receives no communication,
the leaders for all higher-numbered terms. §5.4 it becomes a candidate and initiates an election. A candidate
State Machine Safety: if a server has applied a log entry that receives votes from a majority of the full cluster becomes
at a given index to its state machine, no other server the new leader. Leaders typically operate until they fail.
will ever apply a different log entry for the same index. most one leader in a given term.
§5.4.3 Different servers may observe the transitions between
Figure 3: Raft guarantees that each of these properties is true terms at different times, and in some situations a server
at all times. The section numbers indicate where each prop- may not observe an election or even entire terms. Terms
erty is discussed. act as a logical clock [13] in Raft, and they allow Raft
servers to detect obsolete information such as stale lead-
forcing the other logs to agree with its own (Sec-
ers. Each server stores a current term number, which in-
tion 5.3).
creases monotonically over time. Current terms are ex-
• Safety: the key safety property for Raft is the State
changed whenever servers communicate; if one server’s
Machine Safety Property in Figure 3: if any server
current term is smaller than the other, then it updates its
has applied a particular log entry to its state ma-
current term to the larger value. If a candidate or leader
chine, then no other server may apply a different
discovers that its term is out of date, it immediately re-
command for the same log index. Section 5.4 de-
verts to follower state. If a server receives a request with
scribes how Raft ensures this property; the solution
a stale term number, it rejects the request.
involves slight extensions to the election and replica-
tion mechanisms described in Sections 5.2 and 5.3. Raft uses only two types of RPCs between servers for
the basic consensus algorithm. RequestVote RPCs are ini-
After presenting the consensus algorithm, this section dis- tiated by candidates during elections (Section 5.2), and
cusses the issue of availability and the role of timing in the AppendEntries RPCs are initiated by leaders to repli-
system. cate log entries and to provide a form of heartbeat (Sec-
5.1 Raft basics tion 5.3). A third RPC is introduced in Section 7 for trans-
A Raft cluster contains several servers (five is a typical ferring snapshots between servers.
number, which allows the system to tolerate two failures). 5.2 Leader election
At any given time each server is in one of three states: Raft uses a heartbeat mechanism to trigger leader elec-
leader, follower, or candidate. In normal operation there tion. When servers start up, they begin as followers. A
is exactly one leader and all of the other servers are fol- server remains in follower state as long as it receives valid
lowers. Followers are passive: they issue no RPCs on RPCs from a leader or candidate. Leaders send periodic
their own but simply respond to RPCs from leaders and heartbeats (AppendEntries RPCs that carry no log entries)
candidates. The leader handles all client requests (if a to all followers in order to maintain their authority. If a
client contacts a follower, the follower redirects it to the follower receives no communication over a period of time
leader). The third state, candidate, is used to elect a new called the election timeout, then it assumes there is no vi-
leader as described in Section 5.2. Figure 4 shows the able leader and begins an election to choose a new leader.
states and their transitions; the transitions are discussed To begin an election, a follower increments its cur-
below. rent term and transitions to candidate state. It then issues
Raft divides time into terms of arbitrary length, as RequestVote RPCs in parallel to each of the other servers
shown in Figure 5. Terms are numbered with consecu- in the cluster. A candidate continues in this state until one
tive integers. Each term begins with an election, in which of three things happens: (a) it wins the election, (b) an-
one or more candidates attempt to become leader as de- other server establishes itself as leader, or (c) a period of
scribed in Section 5.2. If a candidate wins the election, time goes by with no winner. These outcomes are dis-
then it serves as leader for the rest of the term. In some cussed separately in the paragraphs below.
situations an election will result in a split vote. In this case A candidate wins an election if it receives votes from
the term will end with no leader; a new term (with a new a majority of the servers in the full cluster for the same
election) will begin shortly. Raft ensures that there is at term. Each server will vote for at most one candidate in a

5
Figure 5: Time is divided into terms, and each term begins
with an election. After a successful election, a single leader
manages the cluster until the end of the term. Some elections
fail, in which case the term ends without choosing a leader.
The exact transitions may be observed at different times on
different servers.
given term, on a first-come-first-served basis (note: Sec-
tion 5.4 adds an additional restriction on votes). The ma- Figure 6: Logs are composed of entries, which are numbered
jority rule ensures that at most one candidate can win the sequentially. Each entry contains the term in which it was
election for a particular term (the Election Safety Prop- created (the number in each box) and a command for the state
erty in Figure 3). Once a candidate wins an election, it machine. An entry is considered committed if it is safe for
becomes leader. It then sends heartbeat messages to ev- that entry to be applied to state machines.
ery other server to establish its authority and prevent new another candidate with higher rank, it would return to
elections. follower state so that the higher ranking candidate could
While waiting for votes, a candidate may receive an more easily win the next election. We found that this ap-
AppendEntries RPC from another server claiming to be proach created subtle issues around availability, particu-
leader. If the leader’s term (included in its RPC) is at least larly when combined with the safety extensions discussed
as large as the candidate’s current term, then the candidate in Section 5.4. We made adjustments to the algorithm
recognizes the leader as legitimate and returns to follower several times, but after each adjustment new corner cases
state. If the term in the RPC is older than the candidate’s appeared. Eventually we concluded that the randomized
current term, then the candidate rejects the RPC and con- retry approach is more obvious and understandable.
tinues in candidate state.
The third possible outcome is that a candidate neither 5.3 Log replication
wins nor loses the election: if many followers become Once a leader has been elected, it begins servicing
candidates at the same time, votes could be split so that client requests. Each client request contains a command
no candidate obtains a majority. When this happens, each to be executed by the replicated state machines. The
candidate will time out and start a new election by incre- leader appends the command to its log as a new entry, then
menting its term and initiating another round of Request- issues AppendEntries RPCs in parallel to each of the other
Vote RPCs. However, without extra measures split votes servers to replicate the entry. When the entry has been
could repeat indefinitely. safely replicated (as described below), the leader applies
Raft uses randomized election timeouts to ensure that the entry to its state machine and returns the result of that
split votes are rare and that they are resolved quickly. execution to the client. If followers crash or run slowly,
To prevent split votes in the first place, election time- or if network packets are lost, the leader retries Append-
outs are chosen randomly from a fixed interval (currently Entries RPCs indefinitely (even after it has responded to
150-300ms in our implementation). This spreads out the the client) until all followers eventually store all log en-
servers so that in most cases only a single server will time tries.
out; it wins the election and sends heartbeats before any Logs are organized as shown in Figure 6. Each log en-
other servers time out. The same mechanism is used to try stores a state machine command along with the term
handle split votes. Each candidate restarts its (random- number when the entry was received by the leader. The
ized) election timeout at the start of an election, and it term numbers in log entries are used to detect inconsis-
waits for that timeout to elapse before starting the next tencies between logs and to ensure some of the properties
election; this reduces the likelihood of another split vote in Figure 3. Each log entry also has an integer index iden-
in the new election. Section 9.3 shows that this approach tifying its position in the log.
elects a leader rapidly. The leader decides when it is safe to apply a log en-
Elections are an example of how understandability try to the state machines; such an entry is called commit-
guided our choice between design alternatives. Initially ted. Raft guarantees that committed entries are durable
we planned to use a ranking system: each candidate was and will eventually be executed by all of the available
assigned a unique rank, which was used to select be- state machines. In the simple case of a leader replicat-
tween competing candidates. If a candidate discovered ing entries from its current term, a log entry is committed

6
once it is stored on a majority of servers (e.g., entries 1-7
in Figure 6). Section 5.4 will extend this rule to handle
other situations. The leader keeps track of the highest in-
dex it knows to be committed, and it includes that index
in future AppendEntries RPCs (including heartbeats) so
that the other servers eventually find out. Once a follower
learns that a log entry is committed, it applies the entry to
its local state machine (in log order).
We designed the Raft log mechanism to maintain a high
level of coherency between the logs on different servers.
Not only does this simplify the system’s behavior and
make it more predictable, but it is an important component Figure 7: When the leader at the top comes to power, it is
of ensuring safety. Raft maintains the following proper- possible that any of scenarios (a-f) could occur in follower
ties, which together constitute the Log Matching Property logs. Each box represents one log entry; the number in the
in Figure 3: box is its term. A follower may be missing entries (a-b), may
have extra uncommitted entries (c-d), or both (e-f). For exam-
• If two entries in different logs have the same index
ple, scenario (f) could occur if that server was the leader for
and term, then they store the same command. term 2, added several entries to its log, then crashed before
• If two entries in different logs have the same index committing any of them; it restarted quickly, became leader
and term, then the logs are identical in all preceding for term 3, and added a few more entries to its log; before any
entries. of the entries in either term 2 or term 3 were committed, the
The first property follows from the fact that a leader server crashed again and remained down for several terms.
creates at most one entry with a given log index in a given To bring a follower’s log into consistency with its own,
term, and log entries never change their position in the the leader must find the latest log entry where the two logs
log. agree, delete any entries in the follower’s log after that
The second property is guaranteed by a simple consis- point, and send the follower all of the leader’s entries after
tency check performed by AppendEntries. When send- that point. All of these actions happen in response to the
ing an AppendEntries RPC, the leader includes the index consistency check performed by AppendEntries RPCs.
and term of the entry in its log that immediately precedes The leader maintains a nextIndex for each follower, which
the new entries. If the follower does not find an entry in is the index of the next log entry the leader will send to
its log with the same index and term, then it refuses the that follower. When a leader first comes to power, it ini-
new entries. The consistency check acts as an induction tializes all nextIndex values to the index just after the last
step: the initial empty state of the logs satisfies the Log one in its log (11 in Figure 7). If a follower’s log is incon-
Matching Property, and the consistency check preserves sistent with the leader’s, the AppendEntries consistency
the Log Matching Property whenever logs are extended. check will fail in the next AppendEntries RPC. After a
As a result, whenever AppendEntries returns successfully, rejection, the leader decrements nextIndex and retries the
the leader knows that the follower’s log is identical to its AppendEntries RPC. Eventually nextIndex will reach a
own log up through the new entries. point where the leader and follower logs match. When
During normal operation, the logs of the leader and fol- this happens, AppendEntries will succeed; it will remove
lowers stay consistent, so the AppendEntries consistency any conflicting entries in the follower’s log and append
check never fails. However, leader crashes can leave the entries from the leader’s log (if any). Once AppendEntries
logs inconsistent (the old leader may not have fully repli- succeeds, the follower’s log is consistent with the leader’s,
cated all of the entries in its log). These inconsisten- and it will remain that way for the rest of the term.
cies can compound over a series of leader and follower If desired, the protocol can be optimized to reduce the
crashes. Figure 7 illustrates the ways in which followers’ number of rejected AppendEntries RPCs. For example,
logs may differ from that of a new leader. A follower may when rejecting an AppendEntries request, the follower
be missing entries that are present on the leader (a-b), it can include information about the term that contains the
may have extra entries that are not present on the leader conflicting entry (term number and index of the first log
(c-d), or both (e-f). Missing and extraneous entries in a entry for this term). With this information, the leader
log may span multiple terms. can decrement nextIndex to bypass all of the conflicting
In Raft, the leader handles inconsistencies by forcing entries in that term; one AppendEntries RPC will be re-
the followers’ logs to duplicate its own. This means that quired for each term with conflicting entries, rather than
conflicting entries in follower logs will be overwritten one RPC per entry. In practice, we doubt this optimiza-
with entries from the leader’s log. Section 5.4 will show tion is necessary, since failures happen infrequently and it
that this is safe. is unlikely that there will be many inconsistent entries.

7
With this mechanism, a leader does not need to take any
special actions to restore log consistency when it comes to
power. It just begins normal operation, and the logs auto-
matically converge in response to failures of the Append-
Entries consistency check. A leader never overwrites or
deletes entries in its log (the Leader Append-Only Prop-
erty in Figure 3).
This log replication mechanism exhibits the desirable
consensus properties described in Section 2: Raft can ac- Figure 8: Scenarios for commitment. In each scenario S1
cept, replicate, and apply new log entries as long as a ma- is leader and has just finished replicating a log entry to S3.
jority of the servers are up; in the normal case a new entry In (a) the entry is from the leader’s current term (2), so it is
can be replicated with a single round of RPCs to a ma- now committed. In (b) the leader for term 4 is replicating an
jority of the cluster; and a single slow follower will not entry from term 2; index 2 is not safely committed because
impact performance. S5 could become leader of term 5 (with votes from S2, S3,
and S4) and overwrite the entry. Once the leader for term 4
5.4 Safety
has replicated an entry from term 4 in scenario (c), S5 cannot
The previous sections described how Raft elects lead- win an election, so both indexes 2 and 3 are now committed.
ers and replicates log entries. However, the mechanisms
entry must be present in at least one of those servers. If
described so far are not quite sufficient to ensure that each
the candidate’s log is at least as up-to-date as any other log
state machine executes exactly the same commands in the
in that majority (where “up-to-date” is defined precisely
same order. For example, a follower might be unavail-
below), then it will hold all the committed entries. The
able while the leader commits several log entries, then it
RequestVote RPC implements this restriction: the RPC
could be elected leader and overwrite these entries with
includes information about the candidate’s log, and the
new ones; as a result, different state machines might exe-
voter denies its vote if its own log is more up-to-date than
cute different command sequences.
that of the candidate.
This section completes the Raft algorithm with two ex-
tensions: it restricts which servers may be elected leader, Raft determines which of two logs is more up-to-date
and it restricts which entries are considered committed. by comparing the index and term of the last entries in the
Together, these restrictions ensure that the leader for any logs. If the logs have last entries with different terms, then
given term contains all of the entries committed in pre- the log with the later term is more up-to-date. If the logs
vious terms (the Leader Completeness Property from Fig- end with the same term, then whichever log is longer is
ure 3). We then show how the Leader Completeness Prop- more up-to-date.
erty leads to correct behavior of the replicated state ma- 5.4.2 Restriction on commitment
chine. We now explore whether the election restriction is suf-
In any leader-based consensus algorithm, the leader ficient to ensure the Leader Completeness Property. Con-
must eventually store all of the committed log entries. In sider the situations where a leader decides that a log entry
some consensus algorithms, such as Viewstamped Repli- is committed. There are two such situations, which are di-
cation [21], a leader can be elected even if it doesn’t agrammed in Figure 8. The most common case is where
initially contain all of the committed entries. These al- the leader replicates an entry from its current term (Fig-
gorithms contain additional mechanisms to identify the ure 8(a)). In this case the entry is committed as soon as
missing entries and transmit them to the new leader, ei- the leader confirms that it is stored on a majority of the
ther during the election process or shortly afterwards. Un- full cluster. At this point only servers storing the entry
fortunately, this results in considerable additional mecha- can be elected as leader.
nism and complexity. Raft uses a simpler approach where The second case for commitment is when a leader is
it guarantees that all the committed entries from previous committing an entry from an earlier term. This situation
terms are present on each new leader from the moment is illustrated in Figure 8(b). The leader for term 2 created
of its election, without the need to transfer those entries an entry at log index 2 but replicated it only on S1 and S2
to the leader. This means that log entries only flow in before crashing. S5 was elected leader for term 3 but was
one direction, from leaders to followers, and leaders never unaware of this entry (it received votes from itself, S3,
overwrite existing entries in their logs. and S4). Thus it created its own entry in log slot 2; then
5.4.1 Election restriction it crashed before replicating that entry. S1 was elected
Raft uses the voting process to prevent a candidate from leader for term 4 (with votes from itself, S2 and S3). It
winning an election unless its log contains all committed then replicated its log index 2 on S3. In this situation, S1
entries. A candidate must contact a majority of the cluster cannot consider log index 2 committed even though it is
in order to be elected, which means that every committed stored on majority of the servers: S5 could still be elected

8
leads to one of two contradictions.
6. First, if the voter and leaderU shared the same last
log term, then leaderU ’s log must have been at least
as long as the voter’s, so its log contained every entry
in the voter’s log. This is a contradiction, since the
voter contained the committed entry and leaderU was
assumed not to.
Figure 9: Suppose that S1 (leader for term T) commits a new 7. Otherwise, leaderU ’s last log term must have been
log entry from its term, but that entry is not stored by the larger than the voter’s. Moreover, it was larger than
leader for a later term U (S5). Then there must be at least one T, since the voter’s last log term was at least T (it con-
server (S3) that accepted the log entry and also voted for S5.
tains the committed entry from term T). The earlier
leader (since its log is more up-to-date than the logs of S2, leader that created leaderU ’s last log entry must have
S3, and S4) and propagate its own value for index 2. contained the committed entry in its log (by assump-
Raft handles this situation with an additional restric- tion). Then, by the Log Matching Property, leaderU ’s
tion on committing log entries. A new leader may not log must also contain the committed entry, which is
conclude that any log entries are committed until it has a contradiction.
committed an entry from its current term. Once this hap- 8. This completes the contradiction. Thus, the leaders
pens, all of the preceding entries in its log are also com- of all terms greater than T must contain all entries
mitted. Figure 8(c) shows how this preserves the Leader from term T that are committed in term T.
Completeness Property: once the leader has replicated an 9. The Log Matching Property guarantees that future
entry from term 4 on a majority of the cluster, the election leaders will also contain entries that are committed
rules prevent S5 from being elected leader. indirectly, such as index 2 in Figure 8(c).
5.4.3 Safety argument Given the Leader Completeness Property, we can prove
the State Machine Safety Property from Figure 3, which
Given the complete rules for commitment and election,
states that if a server has applied a log entry at a given
we can now argue more precisely that the Leader Com- index to its state machine, no other server will ever ap-
pleteness Property holds (this argument is based on our
ply a different log entry for the same index. At the time
safety proof; see Section 9.2). We assume that the Leader a server applies a log entry to its state machine, its log
Completeness Property does not hold, then we prove a must be identical to the leader’s log up through that entry
contradiction. Suppose the leader for term T (leaderT )
and the leader must have decided the entry is committed.
commits a log entry from its term, but that log entry is Now consider the lowest term in which any server applies
not stored by the leader of some future term. Consider a given log index; the Log Completeness Property guar-
the smallest term U > T whose leader (leaderU ) does not
antees that the leaders for all higher terms will store that
store the entry. same log entry, so servers that apply the index in later
1. The committed entry must have been absent from terms will apply the same value. Thus, the State Machine
leaderU ’s log at the time of its election (leaders never Safety Property holds.
delete or overwrite entries). Finally, Raft requires servers to apply entries in log in-
2. leaderT replicated the entry on a majority of the clus- dex order. Combined with the State Machine Safety Prop-
ter, and leaderU received votes from a majority of erty, this means that all servers will apply exactly the same
the cluster. Thus, at least one server (“the voter”) set of log entries to their state machines, in the same or-
both accepted the entry from leaderT and voted for der.
leaderU , as shown in Figure 9. The voter is key to
reaching a contradiction. 5.5 Follower and candidate crashes
3. The voter must have accepted the committed entry Until this point we have focused on leader failures. Fol-
from leaderT before voting for leaderU ; otherwise it lower and candidate crashes are much simpler to han-
would have rejected the AppendEntries request from dle than leader crashes, and they are both handled in the
leaderT (its current term would have been higher than same way. If a follower or candidate crashes, then fu-
T). ture RequestVote and AppendEntries RPCs sent to it will
4. The voter still stored the entry when it voted for fail. Raft handles these failures by retrying indefinitely;
leaderU , since every intervening leader contained the the server will eventually restart (as a follower) and the
entry (by assumption), leaders never remove entries, RPC will complete successfully. If a server crashes af-
and followers only remove entries if they conflict ter completing an RPC but before responding, then it will
with the leader. receive the same RPC again after it restarts. Fortunately,
5. The voter granted its vote to leaderU , so leaderU ’s Raft RPCs are idempotent so this causes no harm. For
log must have been as up-to-date as the voter’s. This example, if a follower receives an AppendEntries request

9
that includes log entries already present in its log, it ig-
nores those entries in the new request.
5.6 Timing and availability
One of our requirements for Raft is that safety must not
depend on timing: the system must not produce incorrect
results just because some event happens more quickly or
slowly than expected. However, availability (the ability
of the system to respond to clients in a timely manner)
is a different story: it must inevitably depend on timing.
For example, if message exchanges take longer than the
Figure 10: Switching directly from one configuration to an-
typical time between server crashes, candidates will not
other is unsafe because different servers will switch at dif-
stay up long enough to win an election; without a steady ferent times. In this example, the cluster grows from three
leader, Raft cannot make progress. servers to five. Unfortunately, there is a point in time where
Leader election is the aspect of Raft where timing is two different leaders can be elected for the same term, one
most critical. Raft will be able to elect and maintain a with a majority of the old configuration (Cold ) and another
steady leader as long as the system satisfies the following with a majority of the new configuration (Cnew ).
timing requirement:
gorithm) is fixed. In practice, it will occasionally be nec-
broadcastTime ! electionTimeout ! MTBF
essary to change the configuration, for example to replace
In this inequality broadcastTime is the average time it servers when they fail or to change the degree of replica-
takes a server to send RPCs in parallel to every server
tion. Although this can be done by taking the entire clus-
in the cluster and receive their responses; electionTime-
ter off-line, updating configuration files, and then restart-
out is the election timeout described in Section 5.2; and ing the cluster, this would leave the cluster unavailable
MTBF is the average time between failures for a single
during the changeover. In addition, if there are any man-
server. The broadcast time should be an order of mag-
ual steps, they risk operator error. In order to avoid these
nitude less than the election timeout so that leaders can issues, we decided to automate configuration changes and
reliably send the heartbeat messages required to keep fol-
incorporate them into the Raft consensus algorithm.
lowers from starting elections; given the randomized ap-
The biggest challenge for configuration changes is to
proach used for election timeouts, this inequality also
ensure safety: there must be no point during the transition
makes split votes unlikely. The election timeout should be
where it is possible for two leaders to be elected for the
a few orders of magnitude less than MTBF so that the sys-
same term. Unfortunately, any approach where servers
tem makes steady progress. When the leader crashes, the
switch directly from the old configuration to the new con-
system will be unavailable for roughly the election time-
figuration is unsafe. It isn’t possible to atomically switch
out; we would like this to represent only a small fraction
all of the servers at once, so there will be a period of time
of overall time.
when some of the servers are using the old configuration
The broadcast time and MTBF are properties of the un-
while others have switched to the new configuration. As
derlying system, while the election timeout is something
shown in Figure 10, this can result in two independent
we must choose. Raft’s RPCs typically require the recip-
majorities.
ient to persist information to stable storage, so the broad-
In order to ensure safety, configuration changes must
cast time may range from 0.5ms to 20ms, depending on
use a two-phase approach. There are a variety of ways to
storage technology. As a result, the election timeout is
implement the two phases. For example, some systems
likely to be somewhere between 10ms and 500ms. Typi-
(e.g., [21]) use the first phase to disable the old configura-
cal server MTBFs are several months or more, which eas-
tion so it cannot process client requests; then the second
ily satisfies the timing requirement.
phase enables the new configuration. In Raft the cluster
Raft will continue to function correctly even if the tim-
first switches to a transitional configuration we call joint
ing requirement is occasionally violated. For example, the
consensus; once the joint consensus has been committed,
system can tolerate short-lived networking glitches that
the system then transitions to the new configuration. The
make the broadcast time larger than the election timeout.
joint consensus combines both the old and new configu-
If the timing requirement is violated over a significant pe-
rations:
riod of time, then the cluster may become unavailable.
• Log entries are replicated to all servers in both con-
Once the timing requirement is restored, the system will
figurations.
become available again.
• Any server from either configuration may serve as
6 Cluster membership changes leader.
Up until now we have assumed that the cluster configu- • Agreement (for elections and entry commitment) re-
ration (the set of servers participating in the consensus al- quires majorities from both the old and new configu-

10
Cnew , it must eventually step down (return to follower
state). In Raft the leader steps down immediately after
committing a configuration entry that does not include it-
self. This means that there will be a period of time (while
it is committing Cnew ) where the leader is managing a
cluster that does not include itself; it replicates log entries
but does not count itself in majorities. The leader should
not step down earlier, because members not in Cnew could
Figure 11: Timeline for a configuration change. Dashed still be elected, resulting in unnecessary elections.
lines show configuration entries that have been created but not The second issue is that new servers may not initially
committed, and solid lines show the latest committed config- store any log entries. If they are added to the cluster in
uration entry. The leader first creates the Cold,new configura- this state, it could take quite a while for them to catch
tion entry in its log and commits it to Cold,new (a majority of up, during which time it might not be possible to com-
Cold and a majority of Cnew ). Then it creates the Cnew entry mit new log entries. In order to avoid availability gaps,
and commits it to a majority of Cnew . There is no point in
Raft introduces an additional phase before the configura-
time in which Cold and Cnew can both make decisions inde-
pendently.
tion change, in which the new servers join the cluster as
non-voting members (the leader will replicate log entries
rations. to them, but they are not considered for majorities). Once
As will be shown below, the joint consensus allows indi- the new servers have caught up with the rest of the cluster,
vidual servers to transition between configurations at dif- the reconfiguration can proceed as described above.
ferent times without compromising safety. Furthermore, The third issue is that servers that are removed from
joint consensus allows the cluster to continue servicing the cluster may still disrupt the cluster’s availability. If
client requests throughout the configuration change. these servers do not know that they have been removed,
Cluster configurations are stored and communicated they can still start new elections. These elections cannot
using special entries in the replicated log; Figure 11 illus- succeed, but they may cause servers in the new cluster to
trates the configuration change process. When the leader adopt larger term numbers, causing valid cluster leaders
receives a request to change the configuration from Cold to step down. We are currently working on a solution to
to Cnew , it stores the configuration for joint consensus this problem.
(Cold,new in the figure) as a log entry and replicates that
entry using the mechanisms described previously. Once 7 Log compaction
a given server adds the new configuration entry to its log, In a practical system, the Raft log cannot grow without
it uses that configuration for all future decisions (a server bound. As clients issue requests, the log grows longer, oc-
always uses the latest configuration in its log, regardless cupying more space and taking more time to replay. This
of whether the entry is committed). This means that the will eventually cause availability problems without some
leader will use the rules of Cold,new to determine when mechanism to discard obsolete information that has accu-
the log entry for Cold,new is committed. If the leader mulated in the log.
crashes, a new leader may be chosen under either Cold There are two basic approaches to compaction: log
or Cold,new , depending on whether the winning candidate cleaning and snapshotting. Log cleaning [32] inspects log
has received Cold,new . In any case, Cnew cannot make entries to determine whether they are live—whether they
unilateral decisions during this period. contribute to the current system state. Live entries are
Once Cold,new has been committed, neither Cold nor rewritten to the head of the log, then large consecutive re-
Cnew can make decisions without approval of the other, gions of the log are freed. This process is incremental and
and the Leader Completeness Property ensures that only efficient, but choosing which regions of the log to clean
servers with the Cold,new log entry can be elected as and determining which entries are live can be complex.
leader. It is now safe for the leader to create a log en- The second approach, snapshotting, operates on the
try describing Cnew and replicate it to the cluster. Again, current system state rather than on the log. In snapshot-
this configuration will take effect on each server as soon ting, the entire current system state is written to a snap-
as it is seen. When the new configuration has been com- shot on stable storage, then the entire log up to that point
mitted under the rules of Cnew , the old configuration is is discarded. Compared to log cleaning, it is not incre-
irrelevant and servers not in the new configuration can be mental and less efficient (even information that has not
shut down. As shown in Figure 11, there is no time when changed since the last snapshot is rewritten). However,
Cold and Cnew can both make unilateral decisions; this it is much simpler (for example, state machines need not
guarantees safety. track which log entries are live). Snapshotting is used in
There are three more issues to address for reconfigu- Chubby and ZooKeeper and is assumed for the remainder
ration. First, if the leader is part of Cold but not part of of this section.

11
This snapshotting approach departs from Raft’s strong
leader principle, since followers can take snapshots with-
out the knowledge of the leader. We considered an al-
ternative leader-based approach in which only the leader
would create a snapshot, then it would send this snapshot
to each of its followers. However, this has two disadvan-
tages. First, sending the snapshot to each follower would
waste network bandwidth and slow the snapshotting pro-
cess. Each follower already has the information needed
to produce its own snapshots, and it is typically much
Figure 12: A server replaces the committed entries in its log
cheaper for a server to produce a snapshot from its local
(indexes 1 through 5) with a new snapshot, which stores just state than it is to send and receive one over the network.
the current state (variables x and y in this example). The Second, the leader’s implementation would be more com-
snapshot’s last included index and term position the snapshot plex. For example, the leader would need to send snap-
in the log preceding entry 6. shots to followers in parallel with replicating new log en-
tries to them, so as not to block new client requests.
Figure 12 shows the basic idea of snapshotting in Raft.
Each server takes snapshots independently, covering just There are two more issues that impact snapshotting per-
the committed entries in its log. Most of the work con- formance. First, servers must decide when to snapshot. If
sists of the state machine writing its current state to the a server snapshots too often, it wastes disk bandwidth and
snapshot. Raft also includes a small amount of metadata energy; if it snapshots too infrequently, it risks exhaust-
in the snapshot: the last included index is the index of the ing its storage capacity, and it increases the time required
last entry in the log that the snapshot replaces (the last en- to replay the log during restarts. One simple strategy is
try the state machine had applied), and the last included to take a snapshot when the log reaches a fixed size in
term is the term of this entry. These are preserved to sup- bytes. If this size is set to be significantly larger than the
port the AppendEntries consistency check for the first log expected size of a snapshot, then the disk bandwidth over-
entry following the snapshot, since that entry needs a pre- head for snapshotting will be small.
vious log index and term. To enable cluster membership The second performance issue is that writing a snapshot
changes (Section 6), the snapshot also includes the latest can take a significant amount of time, and we do not want
configuration in the log as of last included index. Once a this to delay normal operations. The solution is to use
server completes writing a snapshot, it may delete all log copy-on-write techniques so that new updates can be ac-
entries up through the last included index, as well as any cepted without impacting the snapshot being written. For
prior snapshot. example, state machines built with functional data struc-
tures naturally support this. Alternatively, the operating
Although servers normally take snapshots indepen-
system’s copy-on-write support (e.g., fork on Linux) can
dently, the leader must occasionally send snapshots to fol-
be used to create an in-memory snapshot of the entire state
lowers that lag behind. This happens when the leader
machine (our implementation uses this approach).
has already discarded the next log entry that it needs to
send to a follower. Fortunately, this situation is unlikely 8 Client interaction
in normal operation: a follower that has kept up with the This section describes how clients interact with Raft,
leader would already have this entry. However, an excep- including finding the cluster leader and supporting lin-
tionally slow follower or a new server joining the cluster earizable semantics [9]. These issues apply to all
(Section 6) would not. The way to bring such a follower consensus-based systems, and solutions are typically han-
up-to-date is for the leader to send it a snapshot over the dled in similar ways.
network. Clients of Raft send all of their requests to the leader.
Our implementation uses a new RPC called Install- When a client first starts up, it connects to a randomly-
Snapshot for leaders to send snapshots to followers that chosen server. If the client’s first choice is not the leader,
are too far behind. Upon receiving a snapshot with this that server will reject the client’s request and supply in-
RPC, a follower must decide what to do with its existing formation about the most recent leader it has heard from
log entries. It must remove any log entries that conflict (AppendEntries requests include the network address of
with the snapshot (this is similar to the AppendEntries the leader). If the leader crashes, client requests will time
RPC). If the follower has an entry that matches the snap- out; clients then try again with randomly-chosen servers.
shot’s last included index and term, then there is no con- Our goal for Raft is to implement linearizable seman-
flict: it removes only the prefix of its log that the snapshot tics (each operation appears to execute instantaneously,
replaces. Otherwise, the follower removes its entire log; exactly once, at some point between its invocation and
it is all superseded by the snapshot. its response). However, as described so far Raft can exe-

12
cute a command multiple times: for example, if the leader 60
crashes after committing the log entry but before respond-
ing to the client, the client will retry the command with a 50

new leader, causing it to be executed a second time. The


40
solution is for clients to assign unique serial numbers to

Raft grade
every command. Then, the state machine tracks the latest 30
serial number processed for each client, along with the as-
sociated response. If it receives a command whose serial 20
number has already been executed, it responds immedi-
10
ately without re-executing the request. Raft then Paxos
Read-only operations can be made linearizable in sev- Paxos then Raft
0
eral ways. One approach is to serialize them into the log 0 10 20 30 40 50 60
just like other client requests, but this is relatively inef- Paxos grade

ficient and not strictly necessary. Raft handles read-only Figure 13: A scatter plot of 43 participants’ grades compar-
requests without involving the log, but it must take two ex- ing their performance on each exam. Points above the diago-
nal (33) represent participants who scored higher on the Raft
tra precautions to avoid returning stale information. First,
exam.
a leader must have the latest information on which entries
are committed. The Leader Completeness Property guar- ating Systems course at Stanford University and a Dis-
antees that a leader has all committed entries, but at the tributed Computing course at U.C. Berkeley. We recorded
start of its term, it may not know which those are. To a video lecture of Raft and another of Paxos, and created
find out, it needs to commit an entry from its term. Raft corresponding quizzes. The Raft lecture covered the con-
handles this by having each leader commit a blank no-op tent of this paper except for log compaction; the Paxos
entry into the log at the start of its term. Second, a leader lecture covered enough material to create an equivalent
must check whether it has been deposed before process- replicated state machine, including single-decree Paxos,
ing a read-only request (its information may be stale if a multi-decree Paxos, reconfiguration, and a few optimiza-
more recent leader has been elected). Raft handles this tions needed in practice (such as leader election). The
by having the leader exchange heartbeat messages with quizzes tested basic understanding of the algorithms and
a majority of the cluster before responding to read-only also required students to reason about corner cases. Each
requests. Alternatively, the leader could rely on the heart- student watched one video, took the corresponding quiz,
beat mechanism to provide a form of lease [8], but this watched the second video, and took the second quiz.
would rely on timing for safety (it assumes bounded clock About half of the participants did the Paxos portion first
skew). and the other half did the Raft portion first in order to
account for both individual differences in performance
9 Implementation and evaluation and experience gained from the first portion of the study.
We have implemented Raft as part of a replicated We compared participants’ scores on each quiz to deter-
state machine that stores configuration information for mine whether participants showed a better understanding
RAMCloud [29] and assists in failover of the RAMCloud of Raft.
coordinator. The Raft implementation contains roughly We tried to make the comparison between Paxos and
2000 lines of C++ code, not including tests, comments, Raft as fair as possible. The experiment favored Paxos in
or blank lines. The source code is freely available [22]. two cases: 15 of the 43 participants reported having some
There are also about 25 other open source implementa- prior experience with Paxos, and the Paxos video is 14%
tions [30] of Raft in various stages of development, based longer than the Raft video. As summarized in Table 1, we
on drafts of this paper. have taken steps to mitigate potential sources of bias. All
The remainder of this section evaluates Raft using three of our materials are available for review [27].
criteria: understandability, correctness, and performance. On average, participants scored 4.9 points higher on the
9.1 Understandability Raft quiz than on the Paxos quiz (out of a possible 60
To measure Raft’s understandability relative to Paxos, points, the mean Raft score was 25.7 and the mean Paxos
we conducted an experimental study using upper-level un- score was 20.8); Figure 13 shows their individual scores.
dergraduate and graduate students in an Advanced Oper- A paired t-test states that, with 95% confidence, the true

Concern Steps taken to mitigate bias Materials for review [27]


Equal lecture quality Same lecturer for both. Paxos lecture based on and improved from existing videos
materials used in several universities. Paxos lecture is 14% longer.
Equal quiz difficulty Questions grouped in difficulty and paired across exams. quizzes
Fair grading Used rubric. Graded in random order, alternating between quizzes. rubric
Table 1: Concerns of possible bias against Paxos in the study, steps taken to counter each, and additional materials available.
13
100%
20

cumulative percent
number of participants

80%
15 Paxos much easier 60%
Paxos somewhat easier 150-150ms
10 Roughly equal 40% 150-151ms
Raft somewhat easier 150-155ms
Raft much easier 150-175ms
5 20%
150-200ms
150-300ms
0%
0
implement explain 100 1000 10000 100000
100%
Figure 14: Using a 5-point scale, participants were asked

cumulative percent
(left) which algorithm they felt would be easier to implement 80%
in a functioning, correct, and efficient system, and (right) 60%
which would be easier to explain to a CS graduate student. 12-24ms
40% 25-50ms
distribution of Raft scores has a mean at least 2.5 points 50-100ms
20% 100-200ms
larger than the true distribution of Paxos scores. Account- 0%
150-300ms
ing for whether people learn Paxos or Raft first and prior 0 100 200 300 400 500 600
experience with Paxos, a linear regression model predicts time without leader (ms)
scores 11.0 points higher on the Raft exam than on the Figure 15: The time to detect and replace a crashed leader.
Paxos exam (prior Paxos experience helps Paxos signifi- The top graph varies the amount of randomness in election
cantly and helps Raft slightly less). Curiously, the model timeouts, and the bottom graph scales the minimum election
also predicts scores 6.3 points lower on Raft for people timeout. Each line represents 1000 trials (except for 100 tri-
als for “150-150ms”) and corresponds to a particular choice
that have already taken the Paxos quiz; although we don’t
of election timeouts; for example, “150-155ms” means that
know why, this does appear to be statistically significant.
election timeouts were chosen randomly and uniformly be-
We also surveyed participants after their quizzes to see tween 150ms and 155ms. The measurements were taken on
which algorithm they felt would be easier to implement a cluster of 5 servers with a broadcast time of roughly 15ms.
or explain; these results are shown in Figure 14. An over- Results for a cluster of 9 servers are similar.
whelming majority of participants reported Raft would be of messages (a single round-trip from the leader to half the
easier to implement and explain (33 of 41 for each ques- cluster). It is also possible to further improve Raft’s per-
tion). However, these self-reported feelings may be less formance. For example, it easily supports batching and
reliable than participants’ quiz scores, and participants pipelining requests for higher throughput and lower la-
may have been biased by knowledge of our hypothesis tency. Various optimizations have been proposed in the
that Raft is easier to understand. literature for other algorithms; many of these could be ap-
9.2 Correctness plied to Raft, but we leave this to future work.
We have developed a formal specification and a proof We used our Raft implementation to measure the per-
of safety for the consensus mechanism described in Sec- formance of Raft’s leader election algorithm and answer
tion 5. The formal specification [33] makes the informa- two questions. First, does the election process converge
tion summarized in Figure 2 completely precise using the quickly? Second, what is the minimum downtime that
TLA+ specification language [16]. It is about 400 lines can be achieved after leader crashes?
long and serves as the subject of the proof. It is also useful To measure leader election, we repeatedly crashed the
on its own for anyone implementing Raft. We have me- leader of a cluster of 5 servers and timed how long it took
chanically proven the Log Completeness Property using to detect the crash and elect a new leader (see Figure 15).
the TLA proof system [6]. However, this proof relies on To generate a worst-case scenario, the servers in each trial
invariants that have not been mechanically checked (for had different log lengths, so some candidates were not el-
example, we have not proven the type safety of the speci- igible to become leader. Furthermore, to encourage split
fication). votes, our test script triggered a synchronized broadcast of
heartbeat RPCs from the leader before terminating its pro-
Furthermore, we have written an informal proof [33] of
cess (this approximates the behavior of the leader repli-
the State Machine Safety property which is complete (it
cating a new log entry prior to crashing). The leader was
relies on the specification alone) and relatively precise (it
crashed uniformly randomly within its heartbeat interval,
is about 9 pages or 3500 words long).
which was half of the minimum election timeout for all
9.3 Performance tests. Thus, the smallest possible downtime was about
Raft’s performance is similar to other consensus algo- half of the minimum election timeout.
rithms such as Paxos. The most important case for per- The top graph in Figure 15 shows that a small amount
formance is when an established leader is replicating new of randomization in the election timeout is enough to
log entries. Raft achieves this using the minimal number avoid split votes in elections. In the absence of random-

14
ness, leader election consistently took longer than 10 sec- for leader election. In contrast, Raft incorporates leader
onds in our tests due to many split votes. Adding just 5ms election directly into the consensus algorithm and uses it
of randomness helps significantly, resulting in a median as the first of the two phases of consensus. This results in
downtime of 287ms. Using more randomness improves less mechanism than in Paxos.
worst-case behavior: with 50ms of randomness the worst- Raft also has less mechanism than VR or ZooKeeper,
case completion time (over 1000 trials) was 513ms. even though both of those systems are also leader-based.
The bottom graph in Figure 15 shows that downtime The reason for this is that Raft minimizes the functionality
can be reduced by reducing the election timeout. With in non-leaders. For example, in Raft, log entries flow in
an election timeout of 12-24ms, it takes only 35ms on only one direction: outward from the leader in Append-
average to elect a leader (the longest trial took 152ms). Entries RPCs. In VR log entries flow in both directions
However, lowering the timeouts beyond this point violates (leaders can receive log entries during the election pro-
Raft’s timing requirement: leaders have difficulty broad- cess); this results in additional mechanism and complex-
casting heartbeats before other servers start new elections. ity. The published description of ZooKeeper also transfers
This can cause unnecessary leader changes and lower log entries both to and from the leader, but the implemen-
overall system availability. We recommend using a con- tation is apparently more like Raft [31]. Raft has fewer
servative election timeout such as 150-300ms; such time- message types than any other algorithm for consensus-
outs are unlikely to cause unnecessary leader changes and based log replication that we are aware of.
will still provide good availability. Several different approaches for cluster member-
ship changes have been proposed or implemented in
10 Related work other work, including Lamport’s original proposal [14],
There have been numerous publications related to con- VR [21], and SMART [23]. We chose the joint consensus
sensus algorithms, many of which fall into one of the fol- approach for Raft because it leverages the rest of the con-
lowing categories: sensus protocol, so that very little additional mechanism
• Lamport’s original description of Paxos [14], and at- is required for membership changes. Lamport’s α-based
tempts to explain it more clearly [15, 19, 20]. approach was not an option for Raft because it assumes
• Elaborations of Paxos, which fill in missing details consensus can be reached without a leader. In comparison
and modify the algorithm to provide a better founda- to VR and SMART, Raft’s reconfiguration algorithm has
tion for implementation [25, 36, 12]. the advantage that membership changes can occur with-
• Systems that implement consensus algorithms, such out limiting the processing of normal requests; in con-
as Chubby [2, 4], ZooKeeper [10, 11], and Span- trast, VR must stop all normal processing during config-
ner [5]. The algorithms for Chubby and Spanner uration changes, and SMART imposes an α-like limit on
have not been published in detail, though both claim the number of outstanding requests. Raft’s approach also
to be based on Paxos. ZooKeeper’s algorithm has adds less mechanism than either VR or SMART.
been published in more detail, but it is quite differ-
ent from Paxos. 11 Conclusion
• Performance optimizations that can be applied to Algorithms are often designed with correctness, effi-
Paxos [17, 18, 3, 24, 1, 26]. ciency, and/or conciseness as the primary goals. Although
• Oki and Liskov’s Viewstamped Replication (VR), an these are all worthy goals, we believe that understandabil-
alternative approach to consensus developed around ity is just as important. None of the other goals can be
the same time as Paxos. The original description [28] achieved until developers render the algorithm into a prac-
was intertwined with a protocol for distributed trans- tical implementation, which will inevitably deviate from
actions, but the core consensus protocol has been and expand upon the published form. Unless developers
separated in a recent update [21]. VR uses a leader- have a deep understanding of the algorithm and can create
based approach with many similarities to Raft. intuitions about it, it will be difficult for them to retain its
The greatest difference between Raft and other con- desirable properties in their implementation.
sensus algorithms is Raft’s strong leadership: Raft uses In this paper we addressed the issue of distributed con-
leader election as an essential part of the consensus proto- sensus, where a widely accepted but impenetrable algo-
col, and it concentrates as much functionality as possible rithm, Paxos, has challenged students and developers for
in the leader. This approach results in a simpler algorithm many years. We developed a new algorithm, Raft, which
that is easier to understand. For example, in Paxos, leader we have shown to be more understandable than Paxos.
election is orthogonal to the basic consensus protocol: it We also believe that Raft provides a better foundation for
serves only as a performance optimization and is not re- system building. Furthermore, it achieves these benefits
quired for achieving consensus. However, this results in without sacrificing efficiency or correctness. Using un-
additional mechanism: Paxos includes both a two-phase derstandability as the primary design goal changed the
protocol for basic consensus and a separate mechanism way we approached the design of Raft; as the design pro-

15
gressed we found ourselves reusing a few techniques re- [5] C ORBETT, J. C., D EAN , J., E PSTEIN , M., F IKES , A., F ROST,
peatedly, such as decomposing the problem and simplify- C., F URMAN , J. J., G HEMAWAT, S., G UBAREV, A., H EISER ,
C., H OCHSCHILD , P., H SIEH , W., K ANTHAK , S., K OGAN , E.,
ing the state space. These techniques not only improved L I , H., L LOYD , A., M ELNIK , S., M WAURA , D., N AGLE , D.,
the understandability of Raft but also made it easier to Q UINLAN , S., R AO , R., ROLIG , L., S AITO , Y., S ZYMANIAK ,
convince ourselves of its correctness. M., TAYLOR , C., WANG , R., AND W OODFORD , D. Spanner:
Google’s globally-distributed database. In Proceedings of the 10th
12 Acknowledgments USENIX conference on Operating Systems Design and Implemen-
tation (Berkeley, CA, USA, 2012), OSDI’12, USENIX Associa-
The user study would not have been possible with- tion, pp. 251–264.
out the support of Ali Ghodsi, David Mazières, and the
[6] C OUSINEAU , D., D OLIGEZ , D., L AMPORT, L., M ERZ , S.,
students of CS 294-91 at Berkeley and CS 240 at Stan- R ICKETTS , D., AND VANZETTO , H. TLA+ proofs. In FM
ford. Scott Klemmer helped us design the user study, (2012), D. Giannakopoulou and D. Méry, Eds., vol. 7436 of Lec-
and Nelson Ray advised us on statistical analysis. The ture Notes in Computer Science, Springer, pp. 147–154.
Paxos slides for the user study borrowed heavily from [7] G HEMAWAT, S., G OBIOFF , H., AND L EUNG , S.-T. The google
a slide deck originally created by Lorenzo Alvisi. A file system. In Proceedings of the nineteenth ACM symposium on
Operating systems principles (New York, NY, USA, 2003), SOSP
special thanks goes to David Mazières for finding the ’03, ACM, pp. 29–43.
last (we hope!) and most subtle bug in Raft. Many
[8] G RAY, C., AND C HERITON , D. Leases: An efficient fault-tolerant
people provided helpful feedback on the paper and user mechanism for distributed file cache consistency. In Proceedings
study materials, including Ed Bugnion, Michael Chan, of the 12th ACM Ssymposium on Operating Systems Principles
Daniel Giffin, Arjun Gopalan, Jon Howell, Vimalkumar (1989), pp. 202–210.
Jeyakumar, Ankita Kejriwal, Aleksandar Kracun, Amit [9] H ERLIHY, M. P., AND W ING , J. M. Linearizability: a correct-
Levy, Joel Martin, Satoshi Matsushita, Oleg Pesok, David ness condition for concurrent objects. ACM Trans. Program. Lang.
Syst. 12 (July 1990), 463–492.
Ramos, Robbert van Renesse, Mendel Rosenblum, Nico-
las Schiper, Deian Stefan, Andrew Stone, Ryan Stutsman, [10] H UNT, P., K ONAR , M., J UNQUEIRA , F. P., AND R EED , B.
Zookeeper: wait-free coordination for internet-scale systems. In
David Terei, Stephen Yang, Matei Zaharia, and anony- Proceedings of the 2010 USENIX annual technical conference
mous conference reviewers. Werner Vogels tweeted a (Berkeley, CA, USA, 2010), USENIX ATC ’10, USENIX Asso-
link to an earlier draft, which gave Raft significant ex- ciation, pp. 11–11.
posure. This work was supported by the Gigascale Sys- [11] J UNQUEIRA , F. P., R EED , B. C., AND S ERAFINI , M. Zab: High-
tems Research Center and the Multiscale Systems Cen- performance broadcast for primary-backup systems. In Proceed-
ings of the 2011 IEEE/IFIP 41st International Conference on De-
ter, two of six research centers funded under the Fo- pendable Systems&Networks (Washington, DC, USA, 2011), DSN
cus Center Research Program, a Semiconductor Research ’11, IEEE Computer Society, pp. 245–256.
Corporation program, by STARnet, a Semiconductor Re- [12] K IRSCH , J., AND A MIR , Y. Paxos for system builders, 2008.
search Corporation program sponsored by MARCO and [13] L AMPORT, L. Time, clocks, and the ordering of events in a dis-
DARPA, by the National Science Foundation under Grant tributed system. Commun. ACM 21, 7 (July 1978), 558–565.
No. 0963859, and by grants from Facebook, Google, Mel- [14] L AMPORT, L. The part-time parliament. ACM Trans. Comput.
lanox, NEC, NetApp, SAP, and Samsung. Diego Ongaro Syst. 16, 2 (May 1998), 133–169.
is supported by The Junglee Corporation Stanford Gradu- [15] L AMPORT, L. Paxos made simple. ACM SIGACT News 32, 4
ate Fellowship. (Dec. 2001), 18–25.
[16] L AMPORT, L. Specifying Systems, The TLA+ Language and Tools
References for Hardware and Software Engineers. Addison-Wesley, 2002.
[1] B OLOSKY, W. J., B RADSHAW, D., H AAGENS , R. B., K USTERS ,
[17] L AMPORT, L. Generalized consensus and paxos.
N. P., AND L I , P. Paxos replicated state machines as the ba-
http://research.microsoft.com/apps/pubs/
sis of a high-performance data store. In Proceedings of the 8th
default.aspx?id=64631, 2005.
USENIX conference on Networked systems design and implemen-
tation (Berkeley, CA, USA, 2011), NSDI’11, USENIX Associa- [18] L AMPORT, L. Fast paxos. http://research.microsoft.
tion, pp. 11–11. com/apps/pubs/default.aspx?id=64624, 2006.

[2] B URROWS , M. The chubby lock service for loosely-coupled dis- [19] L AMPSON , B. W. How to build a highly available system
tributed systems. In Proceedings of the 7th symposium on Op- using consensus. In Distributed Algorithms, O. Baboaglu and
erating systems design and implementation (Berkeley, CA, USA, K. Marzullo, Eds. Springer-Verlag, 1996, pp. 1–17.
2006), OSDI ’06, USENIX Association, pp. 335–350. [20] L AMPSON , B. W. The abcd’s of paxos. In Proceedings of the 20th
ACM Symposium on Principles of Distributed Computing (New
[3] C AMARGOS , L. J., S CHMIDT, R. M., AND P EDONE , F. Multico-
York, NY, USA, 2001), PODC 2001, ACM, pp. 13–13.
ordinated paxos. In Proceedings of the twenty-sixth annual ACM
symposium on Principles of distributed computing (New York, NY, [21] L ISKOV, B., AND C OWLING , J. Viewstamped replication revis-
USA, 2007), PODC ’07, ACM, pp. 316–317. ited. Tech. Rep. MIT-CSAIL-TR-2012-021, MIT, July 2012.

[4] C HANDRA , T. D., G RIESEMER , R., AND R EDSTONE , J. Paxos [22] LogCabin source code. http://github.com/logcabin/
made live: an engineering perspective. In Proceedings of the logcabin.
twenty-sixth annual ACM symposium on Principles of distributed [23] L ORCH , J. R., A DYA , A., B OLOSKY, W. J., C HAIKEN , R.,
computing (New York, NY, USA, 2007), PODC ’07, ACM, D OUCEUR , J. R., AND H OWELL , J. The smart way to mi-
pp. 398–407. grate replicated stateful services. In Proceedings of the 1st

16
ACM SIGOPS/EuroSys European Conference on Computer Sys-
tems 2006 (New York, NY, USA, 2006), EuroSys ’06, ACM,
pp. 103–115.
[24] M AO , Y., J UNQUEIRA , F. P., AND M ARZULLO , K. Mencius:
building efficient replicated state machines for wans. In Pro-
ceedings of the 8th USENIX conference on Operating systems de-
sign and implementation (Berkeley, CA, USA, 2008), OSDI’08,
USENIX Association, pp. 369–384.
[25] M AZI ÈRES , D. Paxos made practical. Jan. 2007.
[26] M ORARU , I., A NDERSEN , D. G., AND K AMINSKY, M. There
is more consensus in egalitarian parliaments. In Proceedings of
the 24th ACM Symposium on Operating System Principles (New
York, NY, USA, 2013), SOSP 2013, ACM.
[27] Raft user study. http://ramcloud.stanford.edu/
˜ongaro/userstudy/.
[28] O KI , B. M., AND L ISKOV, B. H. Viewstamped replication: A
new primary copy method to support highly-available distributed
systems. In Proceedings of the seventh annual ACM Symposium on
Principles of distributed computing (New York, NY, USA, 1988),
PODC ’88, ACM, pp. 8–17.
[29] O USTERHOUT, J., A GRAWAL , P., E RICKSON , D., K OZYRAKIS ,
C., L EVERICH , J., M AZI ÈRES , D., M ITRA , S., N ARAYANAN ,
A., O NGARO , D., PARULKAR , G., ROSENBLUM , M., RUM -
BLE , S. M., S TRATMANN , E., AND S TUTSMAN , R. The case
for ramcloud. Commun. ACM 54 (July 2011), 121–130.
[30] Raft implementations. https://ramcloud.stanford.
edu/wiki/display/logcabin/LogCabin.
[31] R EED , B. Personal communications, May 17, 2013.
[32] ROSENBLUM , M., AND O USTERHOUT, J. K. The design and im-
plementation of a log-structured file system. ACM Trans. Comput.
Syst. 10 (February 1992), 26–52.
[33] Safety proof and formal specification for Raft.
http://ramcloud.stanford.edu/˜ongaro/
raftproof.pdf.
[34] S CHNEIDER , F. B. Implementing fault-tolerant services using the
state machine approach: a tutorial. ACM Comput. Surv. 22, 4 (Dec.
1990), 299–319.
[35] S HVACHKO , K., K UANG , H., R ADIA , S., AND C HANSLER , R.
The hadoop distributed file system. In Proceedings of the 2010
IEEE 26th Symposium on Mass Storage Systems and Technologies
(MSST) (Washington, DC, USA, 2010), MSST ’10, IEEE Com-
puter Society, pp. 1–10.
[36] VAN R ENESSE , R. Paxos made moderately complex. Tech. rep.,
Cornell University, 2012.

17

You might also like