0% found this document useful (0 votes)
69 views44 pages

Distributed Systems: Election Algorithms

This document discusses distributed election algorithms. It outlines the bully algorithm, ring algorithm, and algorithms for wireless networks. The bully algorithm elects the node with the highest ID by having nodes check the presence of nodes with higher IDs. The ring algorithm sends an election message around a ring topology to find the highest ID. Wireless network algorithms build a sink tree to report the highest ID to the root. The document discusses issues like availability, consistency, and partitioning in distributed elections and how they relate to the CAP theorem.

Uploaded by

Ilyass2012
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views44 pages

Distributed Systems: Election Algorithms

This document discusses distributed election algorithms. It outlines the bully algorithm, ring algorithm, and algorithms for wireless networks. The bully algorithm elects the node with the highest ID by having nodes check the presence of nodes with higher IDs. The ring algorithm sends an election message around a ring topology to find the highest ID. Wireless network algorithms build a sink tree to report the highest ID to the root. The document discusses issues like availability, consistency, and partitioning in distributed elections and how they relate to the CAP theorem.

Uploaded by

Ilyass2012
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Distributed Systems

Election Algorithms

Prof. Dr.-Ing. Torben Weis


University Duisburg-Essen
Outline

Distributed Election
Definition, Problem statement
Algorithms
Bully Algorithm
Ring Algorithm
Paxos Algorithms

Distributed Systems Torben Weis 2


University Duisburg-Essen
Distributed Election

Problem
Election of one node from a set of connected nodes
Network topology
Each computer can communicate with all others
Each computer can only communicate with its direct
neighbours
Fault tolerance
Will it work with network partitions?
Consistency
Can nodes temporarily disagree on the outcome of
the election?
Availability
Will election always succeed?

Distributed Systems Torben Weis 3


University Duisburg-Essen
Distributed Election

What qualifies a winner?


All nodes could potentially be elected
Most algorithms are not fair
Fairness means that all nodes have the same chance of
being elected

Winner by IP-Address or MAC-Address


The highest number wins
This property is stable
Running the algorithm multiple times yields the same
result of the last winner is still online
Winner by Time
The first to propose himself for election will win
Distributed Systems Torben Weis 4
University Duisburg-Essen
Bully Algorithm (1)

Basic Idea
The node with the highest number/address wins

Initial setting
Every node has a unique number
Each nodes knows all other nodes and their number
This forms a fully connected network graph
Initially, all nodes assume they are the winner

Repeatedly check the presence of nodes with a


higher ID

Distributed Systems Torben Weis 5


University Duisburg-Essen
Bully Algorithm (2)

The Bully Algorithm (for a node P)


It allows node P to determine the node with the
highest number that is currently online

1.P sends an ELECTION message to all connected


nodes with higher numbers
2.If no one responds, P wins the election and
becomes coordinator
P informs all others that it has won the election
3.If one of the higher-ups answers, then it starts
an Election itself and Ps job is done
Bully Algorithm (3)

(a) Node 4 holds an election


(b) Nodes 5 and 6 respond, telling 4 to stop
(c) Now 5 and 6 each hold an election
Bully Algorithm (4)

(d) Node 6 tells 5 to stop


(e) Node 6 wins and tells everyone.
Bully Algorithm (5)

Worst Case Complexity


Let n be the number of nodes
The node with the lowest number starts the election
-> (n-1) messages
These nodes send back on Ok message -> (n-1)
These nodes start an election themselves
-> (n-2) + (n-3) + + 1 = O(n2)
The coordinator sends a notification to all -> (n-1)

In total O(4(n-1) + n2) = O(n2)

Distributed Systems Torben Weis 9


University Duisburg-Essen
Ring Election Algorithm (1)

Basic Idea
Send a message around the ring and record the node
with the highest number

Initial Setting
Every node has a unique number
All nodes are connected by a ring topology
Nodes need to know only their direct ring neighbour
Nodes do not need to know about numbers of any
other node

Distributed Systems Torben Weis 10


University Duisburg-Essen
Ring Election Algorithm (2)

Algorithm
1. Each node can start an election by sending an
election message containing its number only
2. When a node receives an election message
A) if the first number in the message is its own
The election is finished
Lookup the highest number in the message
Optionally send the election result around the
ring to inform all others
B) append the nodes number to the message and
forward the message to its neighbour

Distributed Systems Torben Weis 11


University Duisburg-Essen
Ring Election Algorithm (3)
Ring Election Algorithm (4)

Complexity
Same in worst-case, average-case and best-case
Let n be the number of nodes
Send one message along the ring -> n messages
Notify all about the result -> n messages
In total: 2n messages

Size of the message:


1st message: 4 bytes, 2nd 8 bytes, 3rd 12 bytes
This can be improved by sending only the number of
the node starting the election and the highest number
-> all messages have the same constant size

Distributed Systems Torben Weis 13


University Duisburg-Essen
Election in Wireless Networks (1)

Basic Idea
Compute a sink tree and forward highest node
number to the root of the tree

Initial settings
Each node has a unique number
Each node can communicate with its neighbours only

Algorithm outline
1) Build the tree
2) Report the node with the highest number to root

Distributed Systems Torben Weis 14


University Duisburg-Essen
Election in Wireless Networks (2)

Building the Tree


Send message to all neighbours
When receiving a message:
If first message, then sender becomes parent in the tree
Otherwise, tree remains unchanged
Acknowledge receipt of the message
Otherwise the sender cannot know whether it is a
leaf node
Report highest number
All lead nodes send their number to their parent
Parent forwards maximum of its own number and the
numbers reported by its children
Eventually the root node will receive the highest
number
Distributed Systems Torben Weis 15
University Duisburg-Essen
Elections in Wireless Networks (2)

Election in a wireless network, with node a as the


source
(a) Initial network
(b)(e) The build-tree phase
Elections in Wireless Networks (3)

Figure 6-22. Election algorithm in a wireless


network, with node a as the source. (a) Initial
network. (b)(e) The build-tree phase
Elections in Wireless Networks (4)

(e) The build-tree phase


(f) Reporting of best node to source
Election and Faults

Availability
Will the algorithm always deliver a result?
Consistency
Will the algorithm always deliver the correct result?
If multiple elections are running, will they deliver the
same results?
Network and Node Failure
Can the network fall apart?
Can messages disappear?
Can nodes fail?

Distributed Systems Torben Weis 19


University Duisburg-Essen
Availability

Availability is defined as

Using the status function

availability at time t is defined as


A(t) = Pr[X(t) = 1] = E[X(t)]

Verteilte Systeme Torben Weis 20


Universitt Duisburg-Essen
Election and Availability

Bully Algorithm
The elected coordinator fails before it can report its
success
Coordinator is single point of failure
Ring Algorithm
If one node drops the message, election fails
Every node is a single point of failure
Election in Wireless Networks
Inner node of the tree fails
Thus, the tree will never be fully constructed and the
algorithm cannot terminate

Distributed Systems Torben Weis 21


University Duisburg-Essen
Election and Consistency

Bully Algorithm
The network is partitioned
In each partition a node starts an election
Both nodes will come back with different results
Ring Algorithm
Either it works correctly or not at all, i.e.
it is consistent but not highly available
Election in Wireless Networks
Either it works correctly or not at all, i.e.
it is consistent but not highly available

Distributed Systems Torben Weis 22


University Duisburg-Essen
Improved Bully Algorithm

Idea
Do not wait for the coordinator
At all time a node believes that the winner is the
highest active node it received a message from
If no message has been received, the node believes it
is the winner itself
The algorithm is always available
At all times a node has a believe who the winner is
The algorithm is still not consistent
Proof: Let all network communication fail
Now every node believes it is the winner itself

Distributed Systems Torben Weis 23


University Duisburg-Essen
Partitioning

Partitioning means nodes in the network cannot


exchange messages
Impossible to avoid partitioning in practice
Every packet drop can is a partitioning
All networked computers may suffer from partitioning
Partitions cause problems for election
Either no election is held because some nodes are
alive but unreachable
Or an election is held although some nodes cannot
vote
Seems that one has to choose between
availability and consistency
Verteilte Systeme Torben Weis 24
Universitt Duisburg-Essen
CAP-Theorem

C = Strong Consistency
A = Availability
P = Partition-Tolerance

CAP Theorem
It is only possible for an algorithm to have two of the
three C, A, P properties at a time

Gilbert and Lynch. Brewer's conjecture and the feasibility of consistent,


available, partition-tolerant web services. ACM SIGACT News (2002) vol. 33 (2)
pp. 59

Verteilte Systeme Torben Weis 25


Universitt Duisburg-Essen
CAP-Theorem

Possible combinations
CAP is not possible
CP
The system is always consistent
But sometimes it cannot provide a result, e.g. no
election is possible
Example use case: Money transfer between banks
AP
The system always provides an answer
The answers obtained by multiple clients can differ
Example use case: Search engine
CA
Useless in a distributed system since P cannot be
avoided

Verteilte Systeme Torben Weis 26


Universitt Duisburg-Essen
CAP Theorem Informal Proof

Setting:
Three nodes in two partitions {A, B} and {C}.
One client writes on C.
Option 1:
C allows writing. C cannot talk with A, B.
Thus {A,B} and {C} are not consistent
Option 2:
C does not allow writing to avoid
inconsistencies.
This implies the system is not available for
writing

Verteilte Systeme Torben Weis 27


Universitt Duisburg-Essen
Leader Election in a Cloud Cluster

Cloud Cluster Hardware


Up to 1000 nodes, some are always broken
Cloud Application Software
Tasks must be distributed among the machines
Some repository must know who is doing what
Repository Requirements
Availability -> Replication
Consistency
Fault Tolerant
all three together is not possible -> CAP Theorem!
To avoid data loss, use a CP algorithm

Verteilte Systeme Torben Weis 28


Universitt Duisburg-Essen
Paxos Algorithm

Paxos produces a distributed consensus with


unreliable nodes
However, all nodes are trustworthy
Election is a special case of distributed consensus
Paxos produces a value on which all nodes agree
Every node can propose a value
Once there is a consensus, the value will never change
It is possible that no consensus can be reached
otherwise it would be a CAP algorithm
To build a repository we need to agree on a
sequence of values
This is achieved by Multi-Paxos

Verteilte Systeme Torben Weis 29


Universitt Duisburg-Essen
Paxos Roles
Client
Wants to propose or learn the consensus value
Proposer
Nodes that are allowed to propose values to the acceptors
Leader
A special proposer that is doing all the proposing.
All other proposers remain silent
There should be only one lead
If there is more than one lead, the algorithm cannot reach a
consensus
Acceptor
Accept proposals from the Proposers and votes on them
Learner
Informs the client about the outcome of the consensus algorithm

In practice one node fulfils more than one role

Verteilte Systeme Torben Weis 30


Universitt Duisburg-Essen
Paxos Algorithm

Algorithm consists of two phases


Phase 1: Promise
Proposer sends a proposal to the acceptors
Acceptors either accept or reject the proposal
Phase 2: Accept
Proposers ask acceptors to accept a value
Acceptors will either do this or reject the value

Verteilte Systeme Torben Weis 31


Universitt Duisburg-Essen
Paxos: Preparations

Client wants to reach a consensus


If there is already a consensus, the client wants to
learn it
If there is no consensus yet, the client tries to
establish a consensus

A leader must be elected among the proposers


Multiple leaders do not break the system, but
eventually a single one should emerge
-> AP algorithm is sufficient
We can use the improved Bully algorithm

Verteilte Systeme Torben Weis 32


Universitt Duisburg-Essen
Paxos: Phase 1

Propose
Proposer selects a number N
Proposer chooses these increasing numbers
Proposer sends the message prepare(N) to a majority
of acceptors
Promise
If acceptor already got a proposal with number M>=N
then it sends a reject message
If the acceptors has accepted a value V because of a
previous proposal M<=N, then it sends an accept
containing (M, V)
Otherwise, the acceptors sends an accept without a
value, i.e. (0, 0)

Verteilte Systeme Torben Weis 33


Universitt Duisburg-Essen
Paxos: Phase 1

As in majority voting, the proposer waits for a


limited time
If no majority has responded in time, no consensus can be
reached
If the proposer got (M,V) with M>=N then there are
multiple leaders. Choose higher N and retry
If the proposer got only (0, 0) values, it can propose
its own value in phase 2
If it got one or more (M, V) responses, it must
propose the value V that came with the highest M,
(no new value can be proposed, instead support one
that is already circulating)

Result of the first phase: Proposer knows which


value to propose
Verteilte Systeme Torben Weis 34
Universitt Duisburg-Essen
Paxos: Phase 2

Accept Request
Proposer sends (N, V) to a majority of acceptors
Accept
If an acceptor has already given a promise (step 1) for
M>N then reject and do not accept
Otherwise, store (N, V) and send an accept message

Note, even if an acceptor accepted a message,


this may still be no consensus
Once an acceptor accepted (N,V), it can as well accept
some (M,V2) if M>N !!

Verteilte Systeme Torben Weis 35


Universitt Duisburg-Essen
Paxos: Phase 2

An acceptor could accept (1, Hallo) and later


on accept (2, Huhu) as well
Once a majority of acceptors has agreed on
some value V, then the result of phase 1 must
always be to accept value V again
Hence, the consensus cannot change once it has
been reached

Verteilte Systeme Torben Weis 36


Universitt Duisburg-Essen
Paxos as Sequence Diagram

Client Proposer Acceptor Learner


| | | | | | |
X-------->| | | | | | Request
| X--------->|->|->| | | Prepare(N)
| |<---------X--X--X | | Promise(N,M,V)
| X--------->|->|->| | | Accept!(N,V)
| |<---------X--X--X------>|->| Accepted(N,V)
|<---------------------------------X--X Response
| | | | | | |

Verteilte Systeme Torben Weis 37


Universitt Duisburg-Essen
Paxos: An Acceptor Fails

Client Proposer Acceptor Learner


| | | | | | |
X-------->| | | | | | Request
| X--------->|->|->| | | Prepare(N)
| | | | ! | | !! FAIL !!
| |<---------X--X | | Promise(N,M,V)
| X--------->|->| | | Accept!(N,V)
| |<---------X--X--------->|->| Accepted(N,V)
|<---------------------------------X--X Response
| | | | | |

Verteilte Systeme Torben Weis 38


Universitt Duisburg-Essen
Paxos: A Learner Fails

Client Proposer Acceptor Learner


| | | | | | |
X-------->| | | | | | Request
| X--------->|->|->| | | Prepare(1)
| |<---------X--X--X | | Promise(1)
| X--------->|->|->| | | Accept!(1,V)
| |<---------X--X--X------>|->| Accepted(1,V)
| | | | | | ! !! FAIL !!
|<---------------------------------X Response
| | | | | |

Verteilte Systeme Torben Weis 39


Universitt Duisburg-Essen
Paxos: A Proposer Fails

Client Leader Acceptor Learner


| | | | | | |
X----->| | | | | | Request
| X------------>|->|->| | | Prepare(1)
| |<------------X--X--X | | Promise(1)
| | | | | | |
| | | | | | |
| X------------>| | | | | Accept!(1,Va)
| ! | | | | | !! Leader fails
| | | | | | | !! NEW LEADER !!
| X--------->|->|->| | | Prepare(2)
| |<---------X--X--X | | Promise(2,1,V)
| X--------->|->|->| | | Accept!(2,V)
| |<---------X--X--X------>|->| Accepted(2,V)
|<---------------------------------X--X Response
| | | | | | |

Verteilte Systeme Torben Weis 40


Universitt Duisburg-Essen
Client Proposer Acceptor Learner
| | | | | | |
X----->| | | | | | Request
| X------------>|->|->| | | Prepare(1)
| |<------------X--X--X | | Promise(1)
| ! | | | | | !! LEADER FAILS
| | | | | | | !! NEW LEADER
| X--------->|->|->| | | Prepare(2)
| |<---------X--X--X | | Promise(2)
| | | | | | | | !! OLD LEADER recovers
| | | | | | | | !! OLD LEADER tries 2, denied
| X------------>|->|->| | | Prepare(2)
| |<------------X--X--X | | Nack(2)
| | | | | | | | !! OLD LEADER tries 3
| X------------>|->|->| | | Prepare(3)
| |<------------X--X--X | | Promise(3)
| | | | | | | | !! NEW LEADER proposes, denied
| | X--------->|->|->| | | Accept!(2,V)
| | |<---------X--X--X | | Nack(3)
| | | | | | | | !! NEW LEADER tries 4
| | X--------->|->|->| | | Prepare(4)
| | |<---------X--X--X | | Promise(4)

Verteilte Systeme Torben Weis 41


Universitt Duisburg-Essen
Multi Paxos

Goal: A distributed data repository that is


always consistent
Paxos: Distributed agreement for a single value
Multi-Paxos: Distributed agreement for a
sequence of values V1, V2, V3,

The repository is a state machine


V1 specifies the first state transition
V2 specifies the next state transition
Since all nodes agree on the sequence Vi, all must
agree on the current state of the state machine

Verteilte Systeme Torben Weis 42


Universitt Duisburg-Essen
Multi Paxos (Optimization)

Idea: Run the first step only once, i.e. Prepare/Promise


Problem: Multiple proposers could propose the same number
N, which is avoided by Prepare/Promise
Let I be the instance number of the leader
Hence (N,I) is unique, even if two proposer think they are
leaders and both use the number N

Client Proposer Acceptor Learner


| | | | | | | --- First Request ---
X-------->| | | | | | Request
| X--------->|->|->| | | Prepare(N)
| |<---------X--X--X | | Promise(N,I)
| X--------->|->|->| | | Accept!(N,I,Vn)
| |<---------X--X--X------>|->| Accepted(N,I,Vn)
|<---------------------------------X--X Response
| | | | | | |
Verteilte Systeme Torben Weis 43
Universitt Duisburg-Essen
Multi Paxos

Here the leader wants to get an accept for number


(N+1, I) without getting a promise for (N+1) first

Client Proposer Acceptor Learner


| | | | | | | --- Following Requests
X-------->| | | | | | Request
| X--------->|->|->| | | Accept!(N+1,I,W)
| |<---------X--X--X------>|->| Accepted(N+1,I,W)
|<---------------------------------X--X Response
| | | | | | |

Verteilte Systeme Torben Weis 44


Universitt Duisburg-Essen

You might also like