0% found this document useful (0 votes)

132 views29 pages

Unit 4 BCT

Bct

Uploaded by

pvarshinibca

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

132 views29 pages

Unit 4 BCT

Bct

Uploaded by

pvarshinibca

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 29

Raft Consensus Algorithm

Introduction
Raft protocol was developed by Diego Ongaro and John Ousterhout (Stanford University)
which won Diego his Ph.D in 2014(The link for the paper is in the References section at the
end of the article). Raft was designed for better understandability of how Consensus(we will
explain what consensus is, in a moment) can be achieved considering that its predecessor, the
Paxos Algorithm, developed by Lesli Lamport is very difficult to understand and implement.
Hence, the title of the paper by Diego, ‘In Search of an Understandable Consensus
Algorithm’. Before Raft, Paxos was considered the holy grail in achieving Consensus..
Lets start.

Consensus
So, to understand Raft, we shall first have a look at the problem which the Raft protocol tries
to solve and that is achieving Consensus. Consensus means multiple servers agreeing on
same information, something imperative to design fault-tolerant distributed systems. Lets
describe it with the help of couple visuals.
So, lets first define the process used when a client interacts with a server to clarify the
process.
Process : The client sends a message to the server and the server responds back with a reply.
A consensus protocol tolerating failures must have the following features :

 Validity : If a process decides(read/write) a value, then it must have been proposed by

some other correct process
 Agreement : Every correct process must agree on the same value
 Termination : Every correct process must terminate after a finite number of steps.
 Integrity : If all correct processes decide on the same value, then any process has the said
value.
Now, there can be two types of systems assuming only one client(for the sake of
understandability):

 Single Server system : The client interacts with a system having only one server with no
backup. There is no problem in achieving consensus in such a system.

single server raft visual


 Multiple Server system : The client interacts with a system having multiple servers. Such
systems can be of two types :
 Symmetric :- Any of the multiple servers can respond to the client and all the
other servers are supposed to sync up with the server that responded to the
client’s request, and
 Asymmetric :- Only the elected leader server can respond to the client. All other
servers then sync up with the leader server.
Such a system in which all the servers replicate(or maintain) similar data(shared state) across
time can for now be referred to as, replicated state machine.
We shall now define some terms used to refer individual servers in a distributed system.

 Leader – Only the server elected as leader can interact with the client. All other servers
sync up themselves with the leader. At any point of time, there can be at most one
leader(possibly 0, which we shall explain later)
 Follower – Follower servers sync up their copy of data with that of the leader’s after every
regular time intervals. When the leader server goes down(due to any reason), one of the
followers can contest an election and become the leader.
 Candidate – At the time of contesting an election to choose the leader server, the servers
can ask other servers for votes. Hence, they are called candidates when they have requested
votes. Initially, all servers are in the Candidate state.
So, the above system can now be labelled as in the following snap.

multiple server labelled raft visual

CAP theorem CAP Theorem is a concept that a distributed database system can only have 2
of the 3:
 Consistency – The data is same in all the server nodes(leader or follower), implying the
system has nearly instantaneous sync capabilities
 Availability – Every request gets a response(success/failure). It requires the system to be
operational 100% of the time to serve requests, and
 Partition Tolerance – The system continues to respond, even after some of the server
nodes fail. This implies that the system maintains all the requests/responses function
somehow.

What is the Raft protocol

Raft is a consensus algorithm that is designed to be easy to understand. It’s equivalent to
Paxos in fault-tolerance and performance. The difference is that it’s decomposed into
relatively independent subproblems, and it cleanly addresses all major pieces needed for
practical systems. We hope Raft will make consensus available to a wider audience, and that
this wider audience will be able to develop a variety of higher quality consensus-based
systems than are available today.

Raft consensus algorithm explained

To begin with, Raft states that each node in a replicated state machine(server cluster) can stay
in any of the three states, namely, leader, candidate, follower. The image below will provide
the necessary visual aid.

Under normal conditions, a node can stay in any one of the above three states. Only a leader
can interact with the client; any request to the follower node is redirected to the leader node.
A candidate can ask for votes to become the leader. A follower only responds to candidate(s)
or the leader.
To maintain these server status(es), the Raft algorithm divides time into small terms of
arbitrary length. Each term is identified by a monotonically increasing number, called term
number.
Term number
This term number is maintained by every node and is passed while communications between
nodes. Every term starts with an election to determine the new leader. The candidates ask for
votes from other server nodes(followers) to gather majority. If the majority is gathered, the
candidate becomes the leader for the current term. If no majority is established, the situation
is called a split vote and the term ends with no leader. Hence, a term can have at most one
leader.
Purpose of maintaining term number
Following tasks are executed by observing the term number of each node:

 Servers update their term number if their term number is less than the term numbers of
other servers in the cluster. This means that when a new term starts, the term numbers are
tallied with the leader or the candidate and are updated to match with the latest
one(Leader’s)
 Candidate or Leader demotes to the Follower state if their term number is out of date(less
than others). If at any point of time, any other server has a higher term number, it can
become the Leader immediately.
 As we said earlier that the term number of the servers are also communicated, if a request
is achieved with a stale term number, the said request is rejected. This basically means that
a server node will not accept requests from server with lower term number
Raft algorithm uses two types of Remote Procedure Calls(RPCs) to carry out the functions :

 RequestVotes RPC is sent by the Candidate nodes to gather votes during an election
 AppendEntries is used by the Leader node for replicating the log entries and also as a
heartbeat mechanism to check if a server is still up. If heartbeat is responded back to, the
server is up else, the server is down. Be noted that the heartbeats do not contain any log
entries.
Now, lets have a look at the process of leader election.

Leader election
In order to maintain authority as a Leader of the cluster, the Leader node sends heartbeat to
express dominion to other Follower nodes. A leader election takes place when a Follower
node times out while waiting for a heartbeat from the Leader node. At this point of time, the
timed out node changes it state to Candidate state, votes for itself and issues RequestVotes
RPC to establish majority and attempt to become the Leader. The election can go the
following three ways:

 The Candidate node becomes the Leader by receiving the majority of votes from the
cluster nodes. At this point of time, it updates its status to Leader and starts sending
heartbeats to notify other servers of the new Leader.
 The Candidate node fails to receive the majority of votes in the election and hence the term
ends with no Leader. The Candidate node returns to the Follower state.
 If the term number of the Candidate node requesting the votes is less than other Candidate
nodes in the cluster, the AppendEntries RPC is rejected and other nodes retain their
Candidate status. If the term number is greater, the Candidate node is elected as the new
Leader.
raft leader election

The following excerpt from the Raft paper(linked in the references below) explains a
significant aspect of server timeouts.

Raft uses randomized election timeouts to ensure that split votes are rare and that they are
resolved quickly. To prevent split votes in the first place, election timeouts are chosen
randomly from a fixed interval (e.g., 150–300ms). This spreads out the servers so that in most
cases only a single server will time out; it wins the election and sends heartbeats before any
other servers time out. The same mechanism is used to handle split votes. Each candidate
restarts its randomized election timeout at the start of an election, and it waits for that
timeout to elapse before starting the next election; this reduces the likelihood of another split
vote in the new election.

Log Replication
For the sake of simplicity while explaining to the beginner level audience, we will restrict our
scope to client making only write requests. Each request made by the client is stored in the
Logs of the Leader. This log is then replicated to other nodes(Followers). Typically, a log
entry contains the following three information :

 Command specified by the client to execute

 Index to identify the position of entry in the log of the node. The index is 1-based(starts
from 1).
 Term Number to ascertain the time of entry of the command.
The Leader node fires AppendEntries RPCs to all other servers(Followers) to sync/match up
their logs with the current Leader.The Leader keeps sending the RPCs until all the Followers
safely replicate the new entry in their logs.
There is a concept of entry commit in the algorithm. When the majority of the servers in the
cluster successfully copy the new entries in their logs, it is considered committed. At this
point, the Leader also commits the entry in its log to show that it has been successfully
replicated. All the previous entries in the log are also considered committed due to obvious
reasons. After the entry is committed, the leader executes the entry and responds back with
the result to the client.
It should be noted that these entries are executed in the order they are received.
If two entries in different logs(Leader’s and Followers’) have identical index and term, they
are guaranteed to store the same command and the logs are identical upto that point(Index).
However, in case the Leader crashes, the logs may become inconsistent. Quoting the Raft
paper :

In Raft, the leader handles inconsistencies by forcing the followers’ logs to duplicate its own.
This means that conflicting entries in follower logs will be overwritten with entries from the
leader’s log.

The Leader node will look for the last matched index number in the Leader and Follower, it
will then overwrite any extra entries further that point(index number) with the new entries
supplied by the Leader. This helps in Log matching the Follower with the Leader. The
AppendEntries RPC will iteratively send the RPCs with reduced Index Numbers so that a
match is found. When the match is found, the RPC succeeds.

Safety
In order to maintain consistency and same set of server nodes, it is ensured by the Raft
consensus algorithm that the leader will have all the entries from the previous terms
committed in its log.
During a leader election, the RequestVote RPC also contains information about the
candidate’s log(like term number) to figure out which one is the latest. If the candidate
requesting the vote has less updated data than the Follower from which it is requesting vote,
the Follower simply doesn’t vote for the said candidate. The following excerpt from the
original Raft paper clears it in a similar and profound way.

Raft determines which of two logs is more up-to-date by comparing the index and term of the
last entries in the logs. If the logs have last entries with different terms, then the log with the
later term is more up-to-date. If the logs end with the same term, then whichever log is longer
is more up-to-date.

Rules for Safety in the Raft protocol

The Raft protocol guarantees the following safety against consensus malfunction by virtue of
its design :

 Leader election safety – At most one leader per term)

 Log Matching safety(If multiple logs have an entry with the same index and term, then
those logs are guaranteed to be identical in all entries up through to the given index.
 Leader completeness – The log entries committed in a given term will always appear in
the logs of the leaders following the said term)
 State Machine safety – If a server has applied a particular log entry to its state machine,
then no other server in the server cluster can apply a different command for the same log.
 Leader is Append-only – A leader node(server) can only append(no other operations like
overwrite, delete, update are permitted) new commands to its log
 Follower node crash – When the follower node crashes, all the requests sent to the
crashed node are ignored. Further, the crashed node can’t take part in the leader election
for obvious reasons. When the node restarts, it syncs up its log with the leader node

Cluster membership and Joint Consensus

When the status of nodes in the cluster changes(cluster configuration changes), the system
becomes susceptible to faults which can break the system. So, to prevent this, Raft uses what
is known as a two phase approach to change the cluster membership. So, in this approach, the
cluster first changes to an intermediate state(known as joint consensus) before achieving the
new cluster membership configuration. Joint consensus makes the system available to
respond to client requests even when the transition between configurations is taking place.
Thus, increasing the availability of the distributed system, which is a main aim.

What are its advantages/Features

 The Raft protocol is designed to be easily understandable considering that the most popular
way to achieve consensus on distributed systems was the Paxos algorithm, which was very
hard to understand and implement. Anyone with basic knowledge and common sense can
understand major parts of the protocol and the research paper published by Diego Ongaro
and John Ousterhout
 It is comparatively easy to implement than other alternatives, primarily the Paxos, because
of a more targeted use case segment, assumptions about the distributed system. Many open
source implementations of the Raft are available on the internet. Some are in Go, C+
+, Java
 The Raft protocol has been decomposed into smaller subproblems which can be tackled
relatively independently for better understanding, implementation, debugging, optimizing
performance for a more specific use case
 The distributed system following the Raft consensus protocol will remain operational even
when minority of the servers fail. For example, if we have a 5 server node cluster, if 2
nodes fail, the system can still operate.
 The leader election mechanism employed in the Raft is so designed that one node will
always gain the majority of votes within a maximum of 2 terms.
 The Raft employs RPC(remote procedure calls) to request votes and sync up the
cluster(using AppendEntries). So, the load of the calls does not fall on the leader node in
the cluster.
 Raft was designed recently, so it employs modern concepts which were not yet understood
at the time of the formulation of the Paxos and similar protocols.
 Any node in the cluster can become the leader. So, it has a certain degree of fairness.
 Many different open source implementations for different use cases are already out there
on GitHub and related places
 Companies like MongoDB, HashiCorp, etc. are using it!
Raft Alternatives

 Paxos – Variants :- multi-paxos, cheap paxos, fast paxos, generalised paxos

 Practical Byzantine Fault Tolerance algorithm (PBFT)
 Proof-of-Stake algorithm (PoS)
 Delegated Proof-of-Stake algorithm (DPoS)

Limitations

 Raft is strictly single Leader protocol. Too much traffic can choke the system. Some
variants of Paxos algorithm exist that address this bottleneck.
 There are a lot of assumptions considered to be acting, like non-occurrence of Byzantine
failures, which sort of reduces the real life applicability.
 Raft is a more specialized approach towards a subset of problems which arise in achieving
consensus.
 Cheap-paxos(a variant of Paxos), can work even when there is only one node functioning
in the server cluster. To generalise, K+1 replicated servers can tolerate shutting down of/
fault in K servers.
Permissioned Blockchain – Raft Consensus
The idea behind the Raft consensus algorithm is that the nodes (i.e., server computers)
collectively select a leader, and the remaining nodes become the followers. The leader is
responsible for state transition log replication across the followers under the closed
distributed environment, assuming that all the nodes are trustworthy and have no malicious
intent.

The basic idea of Raft came from the fact that in a distributed environment, we can come to a
consensus based on the Paxos algorithm and elect a leader. Interestingly, if we have a leader
in the system, we can avoid multiple proposers proposing something altogether.

In the case of Paxos, we don’t have any straightforward mechanism to elect a leader.
However, to elect a leader, multiple proposers propose the thing simultaneously.
Consequently, the protocol becomes complex, and the acceptors have to accept one of the
proposals from the proposer. In that case, we use the highest proposal number for the tie-
breaking mechanism and embed a certain algorithm in Paxos to ensure that every proposal
coming from a different proposer is unique. Thus, all these internal details make the Paxos
more complicated.

In a distributed environment and under a synchronous assumption (closed environment), it is

possible to design a consensus algorithm. First, we will elect a leader and then the tasks of the
leader to propose something. There will be a single proposer, and all the acceptors are
followers of the leader. They may either accept or reject the leader’s opinion.

Table of Contents hide

1 Raft Overview
2 Raft Consensus Algorithm
2.1 Electing the Leader: Voting Request
2.2 Electing the Leader: Follower Node’s Decision Making
2.3 Electing the Leader: Majority Voting
2.4 Multiple Leader Candidates: Current Leader Failure
2.5 Multiple Leader Candidates: Simultaneous Request Vote
2.6 Committing Entry Log
2.7 Handling Failure
2.8 References
2.9 Share this:
2.10 Like this:
Raft Overview

The system starts up and has a set of follower nodes. The follower nodes look for a leader. If
a timeout happens, there is no leader, and there is a need to elect a leader in the system. A
few candidates stand for a leader in the election process, and the remaining nodes vote for the
candidate. The candidates who receive the majority votes become the leader. The leader
proposes a proposal, and the followers can either vote for or against that proposal.

Raft
consensus algo
An example from the database replication: We have distributed multiple replicated servers,
and we want to build a consensus among these multiple replicated servers. Whenever some
transactions are coming up from the clients, we want these replicated servers to decide
whether to commit those transactions collectively.

Raft Consensus Algorithm

Electing the Leader: Voting Request

The first part of the Raft is to elect a leader, and for that, there should be some leader
candidates. The nodes sense the network, and if there is no leader candidate, then one of the
nodes will announce that I want to be a leader. The leader candidate requests the votes. This
voting request contains 2 parameters:

 Team: The last calculated number known to candidate + 1.

 Index: Committed transactions available to the candidate.

These algorithms work in multiple rounds, and the term indicates a new voting round. If the
last voting finishes, then the next term will be old term number + 1; The index indicates
committed transactions available to the candidate. It is just like an increasing number to
distinguish between already committed and new transactions.

Electing the Leader: Follower Node’s Decision Making

Once the nodes receive a voting request, their task is to vote pro or against the candidate. So,
this is the mechanism to elect a leader in the Raft consensus algorithm. Each node compares
the received term and index with the corresponding current known values.
 The node(i) receives the voting request. It compares the already seen team with the
newly received team. If a newly received team is less than the already seen team, then
it discards because the node considers this request as an old request.
 The newly received team is greater than the already seen team. It checks for the newly
received index number with the already seen index number. If the newly received
index number is greater than already seen, it votes for the candidate; else, it declines.
Electing the Leader: Majority Voting

Every node sends their vote and candidates who get majority vote becomes a leader, and
commit the corresponding log entry. in other words, If a certain leader candidate, receives
majority of the vote from the nodes, then that particular candidate becomes a leader and other
becomes the follower of that

node.
Multiple Leader Candidates: Current Leader Failure

Let us understand a scenario where there is a leader, and three followers and the current team
is 10, and the commit index value is 100. Suppose the leader node has failed or followers
didn’t receive a heartbeat message within the heartbeat timeout period.

After the timeout, one of the nodes will become a leader candidate, initiates a leader election,
and becomes a new leader with team 11 and commit index value 100. The new leader
periodically sends the heartbeat message to everyone to indicate his presence.
In the meantime, the old leader gets recovered, and he also receives a heartbeat message from
the new leader. The old leader understands that a new term has started. Then the old leader
will change his status from leader to follower. So this is the one way to handling a new leader
by utilizing the team parameter.

Multiple Leader Candidates: Simultaneous Request Vote

Let us understand a scenario where there is a leader, and three followers and the current team
is 20, and the commit index value is 200. Suppose the leader node has failed or followers
didn’t receive a heartbeat message within the heartbeat timeout period. It may be possible
that multiple followers sense the timeout period simultaneously and become a leader
candidate, and initiates the leader election procedure independently. Two nodes send the
request messages with team 21 at the same time in the network.
There are two leader candidates, and both are sending voting request messages, at the same
time, for round (term) 21. Then, they look for the majority voting. In this example, the first
candidate receives two votes, and the second candidate receives one vote, so based on the
majority voting, the first candidate is a winner.

The node which gets the majority votes send a heartbeat message to everyone. Another leader
candidate also received the heartbeat message from the winner, and this leader candidate falls
back to a follower from the leader candidate.
Committing Entry Log

In the above sections, we have seen the procedure to elect a leader and other special cases.
Now we will understand how the transactions are managed in a closed distributed
environment. Let us consider that the current term value is 10, and the index value is 100,
which means most of the nodes have seen and committed transaction index value number
100.

The leader proposes a new transaction, adds an entry log with term 10 and the new
transaction index value as 101. Further, the leader sends a message called append entries to
all the followers, and they collectively vote either for or against this transaction.
The leader receives the vote for this transaction index value 101. The followers’ node votes
for or against this transaction. If the majority says that they are fine with committing this
particular log. Then, the leader considers that the transaction log is approved by the
followers.

After successful acceptance of the entry log, the leader sends an accept message based on the
majority voting to all the individual followers to update the committed index to 101.
Handling Failure

Multiple kinds of failures exist in the environment. However, Paxos and Raft consensus
algorithms only support Crash or Network fault. The followers may have crashed, but the
system can tolerate up to N/2 -1, where N is the total number of nodes in the environment, as
it does not affect the system due to the majority voting. This indicates that the majority of the
followers are non-faulty, and they can send a vote. The leader can take the majority decision
whether to accept or reject a particular transaction.

Permissioned Blockchain – Lamport Shoskat Pease Algorithm or Agreement Protocol

The main idea behind this algorithm is: There is a commander and N lieutenants. The
commander initiates the process and sends an initial message to all the lieutenants in the
closed network. Later, each lieutenant forwards the value received from the commander to
the other lieutenants except the sender. So at the end of the rounds, all the lieutenants must be
having N-1 values, except the offline lieutenants. In the end, they will apply the majority
voting principle and achieves the consensus. This is one of the first algorithms for Byzantine
Generals’ Problem.

Table of Contents hide

1 Base Condition for Commander
2 Base Condition for Lieutenant
3 General Condition for Lieutenant
4 References
5 Share this:
6 Like this:
Base Condition for Commander

Pulse-1 is the initial pulse where the commander sends the message to all the
Lieutenants. Broadcast (N, t=0), where N is the number of processes and t is the algorithm
parameter, denotes the individual rounds. The Commander decides his own value, and in this
case, the possible values are {retreat, attack}. In this example, N = 3 has three different
lieutenants and is trying to reach a consensus.
Base Condition for Lieutenant

Each lieutenant receives the message from the commander and checks whether it is a pulse-1
message or not. If it is a pulse-1 message, and the sender is the commander, accept it;
otherwise, wait for a pulse-1 message. Suppose a pulse-1 message is received then broadcast
this message to all other processes in the network.

General Condition for Lieutenant

All the lieutenants broadcast their values to the other lieutenants except the senders. At the
end of the rounds, all the lieutenants must be having N-1 values, except the offline
lieutenants. In the end, they will apply the majority voting principle and achieves the
consensus.
In this agreement Protocol, after N rounds, each process must be having the N values; this is
because the system is synchronous and having a reliable communication medium. Once they
have, N values can apply the majority voting principle and achieve the consensus. However,
to achieve consensus, the system should satisfy the below condition.

 The system must have a minimum of three lieutenants (N =3) and a commander. So,
out of N number of processes (lieutenants), maximum of F number of the processes
can be faulty, and F + 1 number of processes must be non-faulty such that N = 2*F +
1.
 The system should be fully connected, and the receivers always know the identity of
the senders.
 The system should be synchronous and having a reliable communication medium.

Lamport-Shostak-Pease BFT Algorithm OR AGREEMENT PROTOCOL

Practical Byzantine Fault Tolerance (pBFT) is a consensus algorithm introduced in the
late 90s by Barbara Liskov and Miguel Castro. pBFT was designed to work efficiently
in asynchronous (no upper bound on when the response to the request will be received )
systems. It is optimized for low overhead time. Its goal was to solve many problems
associated with already available Byzantine Fault Tolerance solutions. Application areas
include distributed computing and blockchain.
What is Byzantine Fault Tolerance?

Byzantine Fault Tolerance (BFT) is the feature of a distributed network to

reach consensus (agreement on the same value) even when some of the nodes in the network
fail to respond or respond with incorrect information . The objective of a BFT mechanism is
to safeguard against the system failures by employing collective decision making (both –
correct and faulty nodes) which aims to reduce to influence of the faulty nodes. BFT is
derived from Byzantine Generals’ Problem.

Byzantine Generals’ Problem

The problem was explained aptly in a paper by LESLIE LAMPORT, ROBERT SHOSTAK,
and MARSHALL PEASE at Microsoft Research in 1982:

Imagine that several divisions of the Byzantine army are camped outside an enemy city,
each division commanded by its own general. The generals can communicate with one
another only by messenger. After observing the enemy, they must decide upon a common
plan of action. However, some of the generals may be traitors, trying to prevent the loyal
generals from reaching an agreement. The generals must decide on when to attack the city,
but they need a strong majority of their army to attack at the same time. The generals must
have an algorithm to guarantee that (a) all loyal generals decide upon the same plan of
action, and (b) a small number of traitors cannot cause the loyal generals to adopt a bad
plan. The loyal generals will all do what the algorithm says they should, but the
traitors may do anything they wish. The algorithm must guarantee condition (a) regardless
of what the traitors do. The loyal generals should not only reach agreement, but should
agree upon a reasonable plan.
Byzantine fault tolerance can be achieved if the correctly working nodes in the network
reach an agreement on their values. There can be a default vote value given to missing
messages i.e., we can assume that the message from a particular node is ‘faulty’ if the
message is not received within a certain time limit. Furthermore, we can also assign a
default response if the majority of nodes respond with a correct value.
Leslie Lamport proved that if we have 3m+1 correctly working processors, a consensus
(agreement on same state) can be reached if atmost m processors are faulty which means
that strictly more than two-thirds of the total number of processors should be honest.
Types of Byzantine Failures:

There are two categories of failures that are considered. One is fail-stop (in which the node
fails and stops operating) and other is arbitrary-node failure. Some of the arbitrary node
failures are given below :
· Failure to return a result.
· Respond with an incorrect result.
· Respond with a deliberately misleading result.
· Respond with a different result to different parts of the system.

Advantages of pbft:

· Energy efficiency :
 pBFT can achieve distributed consensus without carrying out complex
mathematical computations (like in PoW). Zilliqa employs pBFT in
combination with PoW-like complex computations round for every 100th
block.
· Transaction finality :
 The transactions do not require multiple confirmations (like in case
of PoW mechanism in Bitcoin where every node individually verifies all the
transactions before adding the new block to the blockchain; confirmations
can take between 10-60 minutes depending upon how many entities confirm
the new block) after they have been finalized and agreed upon.
· Low reward variance :
 Every node in the network takes part in responding to the request by the
client and hence every node can be incentivized leading to low variance in
rewarding the nodes that help in decision making.

How pBFT works?

pBFT tries to provide a practical Byzantine state machine replication that can work even
when malicious nodes are operating in the system.
Nodes in a pBFT enabled distributed system are sequentially ordered with one node being
the primary (or the leader node) and others referred to as secondary (or the backup nodes).
Note here that any eligible node in the system can become the primary by transitioning
from secondary to primary (typically, in the case of a primary node failure). The goal is that
all honest nodes help in reaching a consensus regarding the state of the system using the
majority rule.

A practical Byzantine Fault Tolerant system can function on the condition that the
maximum number of malicious nodes must not be greater than or equal to one-third of all
the nodes in the system. As the number of nodes increase, the system becomes more secure.

pBFT consensus rounds are broken into 4 phases (refer with the image below):
· The client sends a request to the primary(leader) node.
· The primary(leader) node broadcasts the request to the all the
secondary(backup) nodes.
· The nodes(primary and secondaries) perform the service requested and then
send back a reply to the client.
· The request is served successfully when the client receives ‘m+1’ replies
from different nodes in the network with the same result, where m is the
maximum number of faulty nodes allowed.

The primary(leader) node is changed during every view(pBFT consensus

rounds) and can be substituted by a view change protocol if a predefined
quantity of time has passed without the leading node broadcasting a request to
the backups(secondary). If needed, a majority of the honest nodes can vote on
the legitimacy of the current leading node and replace it with the next leading
node in line.

Limitations of pBFT:
The pBFT consensus model works efficiently only when the number of nodes in the
distributed network is small due to the high communication overhead that increases
exponentially with every extra node in the network.

· Sybil attacks : The pBFT mechanisms are susceptible to Sybil

attacks, where one entity(party) controls many identities. As the
number of nodes in the network increase, sybil attacks become
increasingly difficult to carry out. But as pBFT mechanisms have
scalability issues too, the pBFT mechanism is used in combination
with other mechanism(s).

· Scaling : pBFT does not scale well because of its

communication(with all the other nodes at every step) overhead. As
the number of nodes in the network increase(increases as O(n^k),
where n is the messages and k is the number of nodes), so does the
time taken to respond to the request.

Platforms using pBFT variants:

· Zilliqa – pBFT in combination with PoW consensus
· Hyperledger Fabric – permissioned version of pBFT
· Tendermint – pBFT + DPoS(Delegated Proof-of-Stake)

Variations of pBFT:

To enhance the quality and performance of pBFT for specific use cases and
conditions, many variations were proposed and employed. Some of them are:
· RBFT – Redundant BFT
· ABsTRACTs
· Q/U
· HQ – Hybrid Quorum Protocol for BFT
· Adapt
· Zyzzyva – Speculative Byzantine Fault Tolerance
· Aardvark

BFT over Asynchronous systems

What’s “asynchronous” Byzantine fault tolerance (ABFT)?
When a decentralized network is Byzantine fault tolerant, it means that the honest
members, or nodes, of a network can be guaranteed to agree on the timing and order
(consensus) of a set of transactions. Regardless as to whether there are some nodes
maliciously trying to prevent that consensus — even if as many as 1/3 of nodes are trying
to negatively affect consensus by delaying transactions or otherwise corrupting things. This
is the ‘fault tolerance’ of the network, meaning how many nodes can the network tolerate
acting maliciously, but still come to an honest consensus.
The ‘asynchronous’ property of Byzantine fault tolerance overcomes a challenge of fault
tolerance, which is that of timing. Many forms of Byzantine fault tolerance assume there is
a maximum threshold of message latency when coming to a consensus. An asynchronous
byzantine fault tolerant (ABFT) network eliminates this assumption and allows for some
messages to be lost or indefinitely delayed.

An ABFT network allows for messages to be lost or indefinitely delayed and assumes only
that at some point an honest node’s messages will eventually get through. It is much more
challenging for an honest node to assess whether another node is not following the rules, if
that node’s messages can be indeterminately delayed, but this scenario much better reflects
that network reliability in the real world.

Byzantine generals problem

“several divisions of the Byzantine army are camped outside an enemy city, each division
commanded by its own general. The generals can communicate with one another only by
messenger. After observing the
enemy, they must decide upon a common plan of action.”

“Byzantine Generals” metaphor used in the classical paper by Lamport et al. [Lamport et al.,
1982]

The paper considered a synchronous system, i.e., a system in which there are known delay
bounds for processing and communication.

Byzantine Generals
The problem is given in terms of generals who have surrounded the enemy.

Generals wish to organize a plan of action to attack or to retreat.Each general observes the
enemy and communicates his observations to the others.

Unfortunately there are traitors among generals and traitors want to influence this plan to the
enemy’s advantage. They may lie about whether they will support a particular plan and what
other generals told them.
The game theory analogy behind the Byzantine Generals Problem is that several generals are
besieging Byzantium. They have surrounded the city, but they must collectively decide when
to attack. If all generals attack at the same time, they will win, but if they attack at different
times, they will lose. The generals have no secure communication channels with one another
because any messages they send or receive may have been intercepted or deceptively sent by
Byzantium’s defenders. How can the generals organize to attack at the same time?

General: either a loyal general or a traitor

Consensus:
A: All loyal generals decide upon the same plan of actions

B: A small number of traitors cannot cause loyal generals to adopt a bad plan

What algorithm for decision making should the generalsuse to reach a Consensus?
What percentage of liars can the algorithm tolerate andstill correctly determine a
Consensus?
Assume plan of actions: attack or retreat
 n be the number of generals
 v(i) be the opinion of general i (attack/retreat)
 each general i communicate the value v(i) by messangers to each other general j
 each general final decision obtained by: majority vote among the values v(1), …, v(n)

To satisfy condition A:
every general must apply the majority function to the same values v(1),…,v(n).
But a traitor may send different values to different generals thus generals may receive
different values
To satisfy condition B:
for each i, if the i-th general is loyal, then the value he sends must be used by every loyal
general as the value v(i)
Let us consider the Consensus problem into a simpler situation in which we have: 1
commanding general (C) n-1 lieutenant generals (L1, …, Ln-1)
Consensus:
Interactive Consistency conditions.
IC1: All loyal lieutenant generals obey the same command
IC2: The decision of loyal lieutenants must agree with the commanding general’s order if he
is loyal.
Lieutenant generals send messages back and forth among themselves reporting the command
received by the Commanding General.

The Byzantine generals problem for 3 loyal generals and 1 traitor.

 The generals announce their troop strengths (in units of 1 thousand soldiers).
 The vectors that each general assembles based on (a)
 The vectors that each general receives in step 3.
The solution to the problem relies on an algorithm that can guarantee that:
1. Using Oral Message
Solution using Oral Message
Solution for more than 3m+1 generals with m traitors

Oral messages:

 Every message that is sent is delivered correctly

 The receiver of a message knows who sent it
 The absence of a message can be detected
Function ‘majority’:

 With the property that if a majority of the values v i equals v, then majority(v 1,…,v n-1 )
equals v.
Order set V i

 Each lieutenant uses it to store orders from others

Algorithm OM(m) can deal with m traitors

 Defined recursively
Base case: OM(0)
 Commander sends messages to Lieutenants
 Each Lieutenant receives and records it. V i ={v 0 :attack}
OM(m)
 Each Lieutenant act as the commander in OM(m-1)
 Send messages to ‘his’ Lieutenants
 Do this recursively attack

What is Byzantine Failure?

Byzantine Fault is a fault that presents different symptoms to different observers. a
Byzantine Failure is the loss of a system component due to a Byzantine Fault.

So it stands to reason that the objective of a Byzantine Fault Tolerant system is to be able
to defend against Byzantine failures.
Therefore, the Byzantine Fault Tolerance model could help in resolving this problem. The
generals would need an algorithm that could guarantee the following conditions.

1. All the loyal generals would act and agree on the same plan of action.
2. The loyal generals of the Byzantine army would not follow a bad plan under the influence of
traitor generals.
3. The loyal generals would follow all the rules specified in the algorithm
4. All the loyal generals of the Byzantine army must reach a consensus irrespective of the
actions of traitors.
5. Most important of all, the loyal generals should also reach an agreement on a specific and
reasonable plan.
Note that Byzantine faults are the most severe and difficult to deal with. Byzantine fault
tolerance has been needed in airplane engine systems, nuclear power plants, and pretty much
any system whose actions depend on the results of a large amount of sensors.

Lamport-Shostak-Pease BFT Algorithm

Practical Byzantine Fault Tolerance (pBFT) is a consensus algorithm introduced in the late
90s by Barbara Liskov and Miguel Castro. pBFT was designed to work efficiently in
asynchronous (no upper bound on when the response to the request will be received) systems.
It is optimized for low overhead time. Its goal was to solve many problems associated with
already available Byzantine Fault Tolerance solutions. Application areas include distributed
computing and blockchain.
How pBFT works?

pBFT tries to provide a practical Byzantine state machine replication that can work even
when malicious nodes are operating in the system.
Nodes in a pBFT enabled distributed system are sequentially ordered with one node being the
primary (or the leader node) and others referred to as secondary (or the backup nodes). Note
here that any eligible node in the system can become the primary by transitioning from
secondary to primary (typically, in the case of a primary node failure). The goal is that all
honest nodes help in reaching a consensus regarding the state of the system using the
majority rule.

A practical Byzantine Fault Tolerant system can function on the condition that the maximum
number of malicious nodes must not be greater than or equal to one-third of all the nodes in
the system. As the number of nodes increase, the system becomes more secure.

pBFT consensus rounds are broken into 4 phases (refer with the image below):

· The client sends a request to the primary(leader) node.

· The primary(leader) node broadcasts the request to the all the secondary(backup)
nodes.

· The nodes(primary and secondaries) perform the service requested and then send back
a reply to the client.

· The request is served successfully when the client receives ‘m+1’ replies from
different nodes in the network with the same result, where m is the maximum number of
faulty nodes allowed.

The primary(leader) node is changed during every view(pBFT consensus rounds) and can be
substituted by a view change protocol if a predefined quantity of time has passed without the
leading node broadcasting a request to the backups(secondary). If needed, a majority of the
honest nodes can vote on the legitimacy of the current leading node and replace it with the
next leading node in line.
BFT over Asynchronous systems
What’s “asynchronous” Byzantine fault tolerance (ABFT)?
When a decentralized network is Byzantine fault tolerant, it means that the honest members,
or nodes, of a network can be guaranteed to agree on the timing and order (consensus) of a
set of transactions. Regardless as to whether there are some nodes maliciously trying to
prevent that consensus — even if as many as 1/3 of nodes are trying to negatively affect
consensus by delaying transactions or otherwise corrupting things. This is the ‘fault
tolerance’ of the network, meaning how many nodes can the network tolerate acting
maliciously, but still come to an honest consensus.

The ‘asynchronous’ property of Byzantine fault tolerance overcomes a challenge of fault

tolerance, which is that of timing. Many forms of Byzantine fault tolerance assume there is a
maximum threshold of message latency when coming to a consensus. An asynchronous
byzantine fault tolerant (ABFT) network eliminates this assumption and allows for some
messages to be lost or indefinitely delayed.

Raft Book
No ratings yet
Raft Book
258 pages
Raft - Consensus Protocol
100% (1)
Raft - Consensus Protocol
6 pages
Ch9 Consensus
No ratings yet
Ch9 Consensus
40 pages
CST 428 Block Chain Technologies: Consensus Algorithms and Bitcoin
No ratings yet
CST 428 Block Chain Technologies: Consensus Algorithms and Bitcoin
75 pages
Unit 3
No ratings yet
Unit 3
62 pages
CS8603 UNIT 4 Agreement in A Failure Free System
No ratings yet
CS8603 UNIT 4 Agreement in A Failure Free System
37 pages
BCS613A Blockchain Technology Model QP SolvedSearch Creators
No ratings yet
BCS613A Blockchain Technology Model QP SolvedSearch Creators
50 pages
RAFT Consensus Algorithm: Leader The Followers
No ratings yet
RAFT Consensus Algorithm: Leader The Followers
36 pages
CSE446 Lecture 5
No ratings yet
CSE446 Lecture 5
34 pages
LogDevice Consensus Deepdive
No ratings yet
LogDevice Consensus Deepdive
56 pages
Blockchain White Paper PDF
No ratings yet
Blockchain White Paper PDF
49 pages
Module 1
No ratings yet
Module 1
95 pages
Chapter 8
No ratings yet
Chapter 8
29 pages
Raft
No ratings yet
Raft
30 pages
Unit IV
No ratings yet
Unit IV
46 pages
Sec4 Consensus With Raft
No ratings yet
Sec4 Consensus With Raft
23 pages
Ch8 RAFT, Paxos
No ratings yet
Ch8 RAFT, Paxos
24 pages
Blockchain Assignment 2
No ratings yet
Blockchain Assignment 2
33 pages
Consensus and Paxos
No ratings yet
Consensus and Paxos
34 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
Non Text Magic Studio Magic Design For Presentations L&P
No ratings yet
Non Text Magic Studio Magic Design For Presentations L&P
23 pages
CSE446 Lecture 4
No ratings yet
CSE446 Lecture 4
32 pages
PBFT
No ratings yet
PBFT
26 pages
The Chubby Locks Service: For Loosely-Coupled Distributed Systems
No ratings yet
The Chubby Locks Service: For Loosely-Coupled Distributed Systems
57 pages
CSE446 Lecture 4
No ratings yet
CSE446 Lecture 4
30 pages
DC Unit 3
No ratings yet
DC Unit 3
44 pages
06 Consensus
No ratings yet
06 Consensus
11 pages
Raft Extended
No ratings yet
Raft Extended
18 pages
RAFT Slides
No ratings yet
RAFT Slides
15 pages
Raft
No ratings yet
Raft
17 pages
L20: Replicated State Machines With Paxos: Sam Madden 6.033 Spring 2014
No ratings yet
L20: Replicated State Machines With Paxos: Sam Madden 6.033 Spring 2014
44 pages
Block Chain Material
No ratings yet
Block Chain Material
19 pages
In Search of An Understandable Consensus Algorithm
No ratings yet
In Search of An Understandable Consensus Algorithm
18 pages
Permissioned Blockchain - Raft Consensus - Complete
No ratings yet
Permissioned Blockchain - Raft Consensus - Complete
13 pages
CSE446 Lecture 5
No ratings yet
CSE446 Lecture 5
10 pages
Raft Consensus Mechanism and The Applications
No ratings yet
Raft Consensus Mechanism and The Applications
10 pages
Raft Diego Ongaro
No ratings yet
Raft Diego Ongaro
11 pages
RAFT
No ratings yet
RAFT
8 pages
Arxiv2004 2004.05074 (Heidi Howard 2020) Paxos Vs Raft Have We Reached Consensus On Distributed Consensus
No ratings yet
Arxiv2004 2004.05074 (Heidi Howard 2020) Paxos Vs Raft Have We Reached Consensus On Distributed Consensus
8 pages
Unit 2 BFT
No ratings yet
Unit 2 BFT
13 pages
Distributed Systems - Fault Tolerance
No ratings yet
Distributed Systems - Fault Tolerance
21 pages
Raft
No ratings yet
Raft
68 pages
Blockbuster Blockchain
No ratings yet
Blockbuster Blockchain
81 pages
Ch8 Distributed
No ratings yet
Ch8 Distributed
12 pages
Unit 4
No ratings yet
Unit 4
11 pages
Raft-PLUS - Improving Raft by Multi-Policy Based Leader Election With Unprejudiced Sorting
No ratings yet
Raft-PLUS - Improving Raft by Multi-Policy Based Leader Election With Unprejudiced Sorting
11 pages
Application Level Consensus
No ratings yet
Application Level Consensus
10 pages
Proactive Leader Election in Asynchronous Shared Memory Systems
No ratings yet
Proactive Leader Election in Asynchronous Shared Memory Systems
15 pages
Module 3 Distributed Consensus Updated
No ratings yet
Module 3 Distributed Consensus Updated
20 pages
Bully Algorithm
No ratings yet
Bully Algorithm
14 pages
In Search of An Understandable Consensus Algorithm
No ratings yet
In Search of An Understandable Consensus Algorithm
17 pages
Unit 2 - 2.2 (Leader Election or Mutual Exclusion)
No ratings yet
Unit 2 - 2.2 (Leader Election or Mutual Exclusion)
12 pages
Raft Made Simple
No ratings yet
Raft Made Simple
18 pages
BCT1
No ratings yet
BCT1
9 pages
Improved Raft Algorithm Exploiting Federated Learning For Private Blockchain Performance Enhancement
No ratings yet
Improved Raft Algorithm Exploiting Federated Learning For Private Blockchain Performance Enhancement
5 pages
Distributed Consensus
No ratings yet
Distributed Consensus
6 pages
Distributed Consensus in Distributed Systems
No ratings yet
Distributed Consensus in Distributed Systems
8 pages
Answer-ConcensusAlgorithms - Quize
No ratings yet
Answer-ConcensusAlgorithms - Quize
4 pages
Lec14 Paxos
No ratings yet
Lec14 Paxos
4 pages
Blockchain-Assignment 4 Answer Description
No ratings yet
Blockchain-Assignment 4 Answer Description
4 pages
Foundations of Blockchain Technology Basics
No ratings yet
Foundations of Blockchain Technology Basics
100 pages
Blockchain Architecture Design Notes
No ratings yet
Blockchain Architecture Design Notes
29 pages
Unit 3
No ratings yet
Unit 3
1 page
Lab1report 20140912
No ratings yet
Lab1report 20140912
1 page
Block Chain
No ratings yet
Block Chain
147 pages
BCT Notes For Unit-1
No ratings yet
BCT Notes For Unit-1
47 pages
Block Chain Unit-4
No ratings yet
Block Chain Unit-4
37 pages
Bockchain and Cryptocurrency
No ratings yet
Bockchain and Cryptocurrency
113 pages
Proof-Of-Work Vs Proof-Of-Stake - Securing The Chain
No ratings yet
Proof-Of-Work Vs Proof-Of-Stake - Securing The Chain
35 pages
CS1713-Blockchain Technologies Lecture Notes-Unit I
No ratings yet
CS1713-Blockchain Technologies Lecture Notes-Unit I
40 pages
Unit I Introduction To Blockchain
No ratings yet
Unit I Introduction To Blockchain
78 pages
Algo
No ratings yet
Algo
18 pages
Blockchain
No ratings yet
Blockchain
18 pages
Analysis and Design of Distributed Control Plane Mechanisms in SDN-based Industrial Networks
No ratings yet
Analysis and Design of Distributed Control Plane Mechanisms in SDN-based Industrial Networks
202 pages
KIVA Network White Paper
No ratings yet
KIVA Network White Paper
61 pages
Chapter 12 Blockchain1
No ratings yet
Chapter 12 Blockchain1
49 pages
Sheet MCQ-1
No ratings yet
Sheet MCQ-1
32 pages
Blockchain MCA
No ratings yet
Blockchain MCA
17 pages
Chapter 05
No ratings yet
Chapter 05
32 pages
A Survey of Blockchain Applications in The FinTech Sector
No ratings yet
A Survey of Blockchain Applications in The FinTech Sector
44 pages
Agreement Protocols-I
No ratings yet
Agreement Protocols-I
38 pages
Paper chptr01
No ratings yet
Paper chptr01
36 pages
Blockchain-Based Logging To Defeat Malicious Insiders The Case of Remote Health Monitoring Systems
No ratings yet
Blockchain-Based Logging To Defeat Malicious Insiders The Case of Remote Health Monitoring Systems
18 pages
AO Paper
No ratings yet
AO Paper
17 pages
TP-PBFT: A Scalable PBFT Based On Threshold Proxy Signature For Iot-Blockchain Applications
No ratings yet
TP-PBFT: A Scalable PBFT Based On Threshold Proxy Signature For Iot-Blockchain Applications
17 pages
Decrypting Distributed Ledger Design-Taxonomy, Classification
No ratings yet
Decrypting Distributed Ledger Design-Taxonomy, Classification
22 pages
A Case-Study Application of RTCA DO-254: Design Assurance Guidance For Airborne Electronic Hardware
No ratings yet
A Case-Study Application of RTCA DO-254: Design Assurance Guidance For Airborne Electronic Hardware
10 pages
On Scalability and Performance of Permissioned Blockchain Systems
No ratings yet
On Scalability and Performance of Permissioned Blockchain Systems
2 pages
Evaluation of Some SMTP Testing, Email Verification, Header Analysis, SSL Checkers, Email Delivery, Email Forwarding and WordPress Email Tools
From Everand
Evaluation of Some SMTP Testing, Email Verification, Header Analysis, SSL Checkers, Email Delivery, Email Forwarding and WordPress Email Tools
Dr. Hidaia Mahmood Alassoulii
No ratings yet
ROUTING INFORMATION PROTOCOL: RIP DYNAMIC ROUTING LAB CONFIGURATION
From Everand
ROUTING INFORMATION PROTOCOL: RIP DYNAMIC ROUTING LAB CONFIGURATION
Mulayam Singh
No ratings yet

Unit 4 BCT

Uploaded by

Unit 4 BCT

Uploaded by

Raft Consensus Algorithm

 Validity : If a process decides(read/write) a value, then it must have been proposed by

single server raft visual

multiple server labelled raft visual

What is the Raft protocol

Raft consensus algorithm explained

 Command specified by the client to execute

Rules for Safety in the Raft protocol

 Leader election safety – At most one leader per term)

Cluster membership and Joint Consensus

What are its advantages/Features

 Paxos – Variants :- multi-paxos, cheap paxos, fast paxos, generalised paxos

In a distributed environment and under a synchronous assumption (closed environment), it is

Table of Contents hide

Raft Consensus Algorithm

 Team: The last calculated number known to candidate + 1.

Electing the Leader: Follower Node’s Decision Making

Multiple Leader Candidates: Simultaneous Request Vote

Permissioned Blockchain – Lamport Shoskat Pease Algorithm or Agreement Protocol

Table of Contents hide

General Condition for Lieutenant

Lamport-Shostak-Pease BFT Algorithm OR AGREEMENT PROTOCOL

Byzantine Fault Tolerance (BFT) is the feature of a distributed network to

Byzantine Generals’ Problem

How pBFT works?

The primary(leader) node is changed during every view(pBFT consensus

· Sybil attacks : The pBFT mechanisms are susceptible to Sybil

· Scaling : pBFT does not scale well because of its

Platforms using pBFT variants:

BFT over Asynchronous systems

Byzantine generals problem

General: either a loyal general or a traitor

The Byzantine generals problem for 3 loyal generals and 1 traitor.

 Every message that is sent is delivered correctly

 Each lieutenant uses it to store orders from others

What is Byzantine Failure?

Lamport-Shostak-Pease BFT Algorithm

· The client sends a request to the primary(leader) node.

The ‘asynchronous’ property of Byzantine fault tolerance overcomes a challenge of fault

You might also like