Distributed Systems
Election Algorithms
Prof. Dr.-Ing. Torben Weis
University Duisburg-Essen
Outline
Distributed Election
Definition, Problem statement
Algorithms
Bully Algorithm
Ring Algorithm
Paxos Algorithms
Distributed Systems Torben Weis 2
University Duisburg-Essen
Distributed Election
Problem
Election of one node from a set of connected nodes
Network topology
Each computer can communicate with all others
Each computer can only communicate with its direct
neighbours
Fault tolerance
Will it work with network partitions?
Consistency
Can nodes temporarily disagree on the outcome of
the election?
Availability
Will election always succeed?
Distributed Systems Torben Weis 3
University Duisburg-Essen
Distributed Election
What qualifies a winner?
All nodes could potentially be elected
Most algorithms are not fair
Fairness means that all nodes have the same chance of
being elected
Winner by IP-Address or MAC-Address
The highest number wins
This property is stable
Running the algorithm multiple times yields the same
result of the last winner is still online
Winner by Time
The first to propose himself for election will win
Distributed Systems Torben Weis 4
University Duisburg-Essen
Bully Algorithm (1)
Basic Idea
The node with the highest number/address wins
Initial setting
Every node has a unique number
Each nodes knows all other nodes and their number
This forms a fully connected network graph
Initially, all nodes assume they are the winner
Repeatedly check the presence of nodes with a
higher ID
Distributed Systems Torben Weis 5
University Duisburg-Essen
Bully Algorithm (2)
The Bully Algorithm (for a node P)
It allows node P to determine the node with the
highest number that is currently online
1.P sends an ELECTION message to all connected
nodes with higher numbers
2.If no one responds, P wins the election and
becomes coordinator
P informs all others that it has won the election
3.If one of the higher-ups answers, then it starts
an Election itself and Ps job is done
Bully Algorithm (3)
(a) Node 4 holds an election
(b) Nodes 5 and 6 respond, telling 4 to stop
(c) Now 5 and 6 each hold an election
Bully Algorithm (4)
(d) Node 6 tells 5 to stop
(e) Node 6 wins and tells everyone.
Bully Algorithm (5)
Worst Case Complexity
Let n be the number of nodes
The node with the lowest number starts the election
-> (n-1) messages
These nodes send back on Ok message -> (n-1)
These nodes start an election themselves
-> (n-2) + (n-3) + + 1 = O(n2)
The coordinator sends a notification to all -> (n-1)
In total O(4(n-1) + n2) = O(n2)
Distributed Systems Torben Weis 9
University Duisburg-Essen
Ring Election Algorithm (1)
Basic Idea
Send a message around the ring and record the node
with the highest number
Initial Setting
Every node has a unique number
All nodes are connected by a ring topology
Nodes need to know only their direct ring neighbour
Nodes do not need to know about numbers of any
other node
Distributed Systems Torben Weis 10
University Duisburg-Essen
Ring Election Algorithm (2)
Algorithm
1. Each node can start an election by sending an
election message containing its number only
2. When a node receives an election message
A) if the first number in the message is its own
The election is finished
Lookup the highest number in the message
Optionally send the election result around the
ring to inform all others
B) append the nodes number to the message and
forward the message to its neighbour
Distributed Systems Torben Weis 11
University Duisburg-Essen
Ring Election Algorithm (3)
Ring Election Algorithm (4)
Complexity
Same in worst-case, average-case and best-case
Let n be the number of nodes
Send one message along the ring -> n messages
Notify all about the result -> n messages
In total: 2n messages
Size of the message:
1st message: 4 bytes, 2nd 8 bytes, 3rd 12 bytes
This can be improved by sending only the number of
the node starting the election and the highest number
-> all messages have the same constant size
Distributed Systems Torben Weis 13
University Duisburg-Essen
Election in Wireless Networks (1)
Basic Idea
Compute a sink tree and forward highest node
number to the root of the tree
Initial settings
Each node has a unique number
Each node can communicate with its neighbours only
Algorithm outline
1) Build the tree
2) Report the node with the highest number to root
Distributed Systems Torben Weis 14
University Duisburg-Essen
Election in Wireless Networks (2)
Building the Tree
Send message to all neighbours
When receiving a message:
If first message, then sender becomes parent in the tree
Otherwise, tree remains unchanged
Acknowledge receipt of the message
Otherwise the sender cannot know whether it is a
leaf node
Report highest number
All lead nodes send their number to their parent
Parent forwards maximum of its own number and the
numbers reported by its children
Eventually the root node will receive the highest
number
Distributed Systems Torben Weis 15
University Duisburg-Essen
Elections in Wireless Networks (2)
Election in a wireless network, with node a as the
source
(a) Initial network
(b)(e) The build-tree phase
Elections in Wireless Networks (3)
Figure 6-22. Election algorithm in a wireless
network, with node a as the source. (a) Initial
network. (b)(e) The build-tree phase
Elections in Wireless Networks (4)
(e) The build-tree phase
(f) Reporting of best node to source
Election and Faults
Availability
Will the algorithm always deliver a result?
Consistency
Will the algorithm always deliver the correct result?
If multiple elections are running, will they deliver the
same results?
Network and Node Failure
Can the network fall apart?
Can messages disappear?
Can nodes fail?
Distributed Systems Torben Weis 19
University Duisburg-Essen
Availability
Availability is defined as
Using the status function
availability at time t is defined as
A(t) = Pr[X(t) = 1] = E[X(t)]
Verteilte Systeme Torben Weis 20
Universitt Duisburg-Essen
Election and Availability
Bully Algorithm
The elected coordinator fails before it can report its
success
Coordinator is single point of failure
Ring Algorithm
If one node drops the message, election fails
Every node is a single point of failure
Election in Wireless Networks
Inner node of the tree fails
Thus, the tree will never be fully constructed and the
algorithm cannot terminate
Distributed Systems Torben Weis 21
University Duisburg-Essen
Election and Consistency
Bully Algorithm
The network is partitioned
In each partition a node starts an election
Both nodes will come back with different results
Ring Algorithm
Either it works correctly or not at all, i.e.
it is consistent but not highly available
Election in Wireless Networks
Either it works correctly or not at all, i.e.
it is consistent but not highly available
Distributed Systems Torben Weis 22
University Duisburg-Essen
Improved Bully Algorithm
Idea
Do not wait for the coordinator
At all time a node believes that the winner is the
highest active node it received a message from
If no message has been received, the node believes it
is the winner itself
The algorithm is always available
At all times a node has a believe who the winner is
The algorithm is still not consistent
Proof: Let all network communication fail
Now every node believes it is the winner itself
Distributed Systems Torben Weis 23
University Duisburg-Essen
Partitioning
Partitioning means nodes in the network cannot
exchange messages
Impossible to avoid partitioning in practice
Every packet drop can is a partitioning
All networked computers may suffer from partitioning
Partitions cause problems for election
Either no election is held because some nodes are
alive but unreachable
Or an election is held although some nodes cannot
vote
Seems that one has to choose between
availability and consistency
Verteilte Systeme Torben Weis 24
Universitt Duisburg-Essen
CAP-Theorem
C = Strong Consistency
A = Availability
P = Partition-Tolerance
CAP Theorem
It is only possible for an algorithm to have two of the
three C, A, P properties at a time
Gilbert and Lynch. Brewer's conjecture and the feasibility of consistent,
available, partition-tolerant web services. ACM SIGACT News (2002) vol. 33 (2)
pp. 59
Verteilte Systeme Torben Weis 25
Universitt Duisburg-Essen
CAP-Theorem
Possible combinations
CAP is not possible
CP
The system is always consistent
But sometimes it cannot provide a result, e.g. no
election is possible
Example use case: Money transfer between banks
AP
The system always provides an answer
The answers obtained by multiple clients can differ
Example use case: Search engine
CA
Useless in a distributed system since P cannot be
avoided
Verteilte Systeme Torben Weis 26
Universitt Duisburg-Essen
CAP Theorem Informal Proof
Setting:
Three nodes in two partitions {A, B} and {C}.
One client writes on C.
Option 1:
C allows writing. C cannot talk with A, B.
Thus {A,B} and {C} are not consistent
Option 2:
C does not allow writing to avoid
inconsistencies.
This implies the system is not available for
writing
Verteilte Systeme Torben Weis 27
Universitt Duisburg-Essen
Leader Election in a Cloud Cluster
Cloud Cluster Hardware
Up to 1000 nodes, some are always broken
Cloud Application Software
Tasks must be distributed among the machines
Some repository must know who is doing what
Repository Requirements
Availability -> Replication
Consistency
Fault Tolerant
all three together is not possible -> CAP Theorem!
To avoid data loss, use a CP algorithm
Verteilte Systeme Torben Weis 28
Universitt Duisburg-Essen
Paxos Algorithm
Paxos produces a distributed consensus with
unreliable nodes
However, all nodes are trustworthy
Election is a special case of distributed consensus
Paxos produces a value on which all nodes agree
Every node can propose a value
Once there is a consensus, the value will never change
It is possible that no consensus can be reached
otherwise it would be a CAP algorithm
To build a repository we need to agree on a
sequence of values
This is achieved by Multi-Paxos
Verteilte Systeme Torben Weis 29
Universitt Duisburg-Essen
Paxos Roles
Client
Wants to propose or learn the consensus value
Proposer
Nodes that are allowed to propose values to the acceptors
Leader
A special proposer that is doing all the proposing.
All other proposers remain silent
There should be only one lead
If there is more than one lead, the algorithm cannot reach a
consensus
Acceptor
Accept proposals from the Proposers and votes on them
Learner
Informs the client about the outcome of the consensus algorithm
In practice one node fulfils more than one role
Verteilte Systeme Torben Weis 30
Universitt Duisburg-Essen
Paxos Algorithm
Algorithm consists of two phases
Phase 1: Promise
Proposer sends a proposal to the acceptors
Acceptors either accept or reject the proposal
Phase 2: Accept
Proposers ask acceptors to accept a value
Acceptors will either do this or reject the value
Verteilte Systeme Torben Weis 31
Universitt Duisburg-Essen
Paxos: Preparations
Client wants to reach a consensus
If there is already a consensus, the client wants to
learn it
If there is no consensus yet, the client tries to
establish a consensus
A leader must be elected among the proposers
Multiple leaders do not break the system, but
eventually a single one should emerge
-> AP algorithm is sufficient
We can use the improved Bully algorithm
Verteilte Systeme Torben Weis 32
Universitt Duisburg-Essen
Paxos: Phase 1
Propose
Proposer selects a number N
Proposer chooses these increasing numbers
Proposer sends the message prepare(N) to a majority
of acceptors
Promise
If acceptor already got a proposal with number M>=N
then it sends a reject message
If the acceptors has accepted a value V because of a
previous proposal M<=N, then it sends an accept
containing (M, V)
Otherwise, the acceptors sends an accept without a
value, i.e. (0, 0)
Verteilte Systeme Torben Weis 33
Universitt Duisburg-Essen
Paxos: Phase 1
As in majority voting, the proposer waits for a
limited time
If no majority has responded in time, no consensus can be
reached
If the proposer got (M,V) with M>=N then there are
multiple leaders. Choose higher N and retry
If the proposer got only (0, 0) values, it can propose
its own value in phase 2
If it got one or more (M, V) responses, it must
propose the value V that came with the highest M,
(no new value can be proposed, instead support one
that is already circulating)
Result of the first phase: Proposer knows which
value to propose
Verteilte Systeme Torben Weis 34
Universitt Duisburg-Essen
Paxos: Phase 2
Accept Request
Proposer sends (N, V) to a majority of acceptors
Accept
If an acceptor has already given a promise (step 1) for
M>N then reject and do not accept
Otherwise, store (N, V) and send an accept message
Note, even if an acceptor accepted a message,
this may still be no consensus
Once an acceptor accepted (N,V), it can as well accept
some (M,V2) if M>N !!
Verteilte Systeme Torben Weis 35
Universitt Duisburg-Essen
Paxos: Phase 2
An acceptor could accept (1, Hallo) and later
on accept (2, Huhu) as well
Once a majority of acceptors has agreed on
some value V, then the result of phase 1 must
always be to accept value V again
Hence, the consensus cannot change once it has
been reached
Verteilte Systeme Torben Weis 36
Universitt Duisburg-Essen
Paxos as Sequence Diagram
Client Proposer Acceptor Learner
| | | | | | |
X-------->| | | | | | Request
| X--------->|->|->| | | Prepare(N)
| |<---------X--X--X | | Promise(N,M,V)
| X--------->|->|->| | | Accept!(N,V)
| |<---------X--X--X------>|->| Accepted(N,V)
|<---------------------------------X--X Response
| | | | | | |
Verteilte Systeme Torben Weis 37
Universitt Duisburg-Essen
Paxos: An Acceptor Fails
Client Proposer Acceptor Learner
| | | | | | |
X-------->| | | | | | Request
| X--------->|->|->| | | Prepare(N)
| | | | ! | | !! FAIL !!
| |<---------X--X | | Promise(N,M,V)
| X--------->|->| | | Accept!(N,V)
| |<---------X--X--------->|->| Accepted(N,V)
|<---------------------------------X--X Response
| | | | | |
Verteilte Systeme Torben Weis 38
Universitt Duisburg-Essen
Paxos: A Learner Fails
Client Proposer Acceptor Learner
| | | | | | |
X-------->| | | | | | Request
| X--------->|->|->| | | Prepare(1)
| |<---------X--X--X | | Promise(1)
| X--------->|->|->| | | Accept!(1,V)
| |<---------X--X--X------>|->| Accepted(1,V)
| | | | | | ! !! FAIL !!
|<---------------------------------X Response
| | | | | |
Verteilte Systeme Torben Weis 39
Universitt Duisburg-Essen
Paxos: A Proposer Fails
Client Leader Acceptor Learner
| | | | | | |
X----->| | | | | | Request
| X------------>|->|->| | | Prepare(1)
| |<------------X--X--X | | Promise(1)
| | | | | | |
| | | | | | |
| X------------>| | | | | Accept!(1,Va)
| ! | | | | | !! Leader fails
| | | | | | | !! NEW LEADER !!
| X--------->|->|->| | | Prepare(2)
| |<---------X--X--X | | Promise(2,1,V)
| X--------->|->|->| | | Accept!(2,V)
| |<---------X--X--X------>|->| Accepted(2,V)
|<---------------------------------X--X Response
| | | | | | |
Verteilte Systeme Torben Weis 40
Universitt Duisburg-Essen
Client Proposer Acceptor Learner
| | | | | | |
X----->| | | | | | Request
| X------------>|->|->| | | Prepare(1)
| |<------------X--X--X | | Promise(1)
| ! | | | | | !! LEADER FAILS
| | | | | | | !! NEW LEADER
| X--------->|->|->| | | Prepare(2)
| |<---------X--X--X | | Promise(2)
| | | | | | | | !! OLD LEADER recovers
| | | | | | | | !! OLD LEADER tries 2, denied
| X------------>|->|->| | | Prepare(2)
| |<------------X--X--X | | Nack(2)
| | | | | | | | !! OLD LEADER tries 3
| X------------>|->|->| | | Prepare(3)
| |<------------X--X--X | | Promise(3)
| | | | | | | | !! NEW LEADER proposes, denied
| | X--------->|->|->| | | Accept!(2,V)
| | |<---------X--X--X | | Nack(3)
| | | | | | | | !! NEW LEADER tries 4
| | X--------->|->|->| | | Prepare(4)
| | |<---------X--X--X | | Promise(4)
Verteilte Systeme Torben Weis 41
Universitt Duisburg-Essen
Multi Paxos
Goal: A distributed data repository that is
always consistent
Paxos: Distributed agreement for a single value
Multi-Paxos: Distributed agreement for a
sequence of values V1, V2, V3,
The repository is a state machine
V1 specifies the first state transition
V2 specifies the next state transition
Since all nodes agree on the sequence Vi, all must
agree on the current state of the state machine
Verteilte Systeme Torben Weis 42
Universitt Duisburg-Essen
Multi Paxos (Optimization)
Idea: Run the first step only once, i.e. Prepare/Promise
Problem: Multiple proposers could propose the same number
N, which is avoided by Prepare/Promise
Let I be the instance number of the leader
Hence (N,I) is unique, even if two proposer think they are
leaders and both use the number N
Client Proposer Acceptor Learner
| | | | | | | --- First Request ---
X-------->| | | | | | Request
| X--------->|->|->| | | Prepare(N)
| |<---------X--X--X | | Promise(N,I)
| X--------->|->|->| | | Accept!(N,I,Vn)
| |<---------X--X--X------>|->| Accepted(N,I,Vn)
|<---------------------------------X--X Response
| | | | | | |
Verteilte Systeme Torben Weis 43
Universitt Duisburg-Essen
Multi Paxos
Here the leader wants to get an accept for number
(N+1, I) without getting a promise for (N+1) first
Client Proposer Acceptor Learner
| | | | | | | --- Following Requests
X-------->| | | | | | Request
| X--------->|->|->| | | Accept!(N+1,I,W)
| |<---------X--X--X------>|->| Accepted(N+1,I,W)
|<---------------------------------X--X Response
| | | | | | |
Verteilte Systeme Torben Weis 44
Universitt Duisburg-Essen