Consensus & Agreement: Arvind Krishnamurthy Fall 2003
Consensus & Agreement: Arvind Krishnamurthy Fall 2003
Group Communication
n
Unicast messages: from a single source to a single destination Multicast messages: from a single source to multiple destinations (designated as a group) Issues:
n
Fault tolerance: two kinds of faults in distributed systems n Crash faults (also known as fail-stop or benign faults): process fails and simply stops operating n Byzantine faults: process fails and acts in an arbitrary manner (or malicious agent is trying to bring down the system) Ordering: n Achieve some kind of consistency in how messages of different multicasts are delivered to the processes
Basic Multicast
n
Channels are assumed to be reliable (do not corrupt messages and deliver them exactly once) A straightforward way to implement B-multicast is to use a reliable one-to-one send operation:
n n
A basic multicast primitive guarantees a correct process will eventually deliver the message, as long as the multicaster (sender) does not crash.
Reliable Multicast
n
Desired properties:
n
Integrity: A correct (i.e., non-faulty) process p delivers a message m at most once. Validity: If a correct process multicasts message m, then it will eventually deliver m. (Local liveness) Agreement: If a correct process delivers message m, then all the other correct processes in group(m) will eventually deliver m. Property of all or nothing. Validity and agreement together ensure overall liveness Question: how do you build reliable multicast using basic multicast?
Ordered Multicast
n
FIFO ordering: If a correct process issues multicast(g,m) and then multicast(g,m), then every correct process that delivers m will deliver m before m. Causal ordering: If multicast(g,m) multicast(g,m) then any correct process that delivers m will deliver m before m. Total ordering: If a correct process delivers message m before m, then any other correct process that delivers m will deliver m before m.
n n n
Causal ordering implies FIFO ordering Causal ordering does not imply total ordering Total ordering does not imply causal ordering
Multicast a message, solicit sequence numbers from processes, multicast a sequence number that is computed based on solicited values
P2 1 Message 3 2 2 1
Seq sed ropo 2P
P4
3 Agreed Seq 1 2 3 P3 P1
Figure 11.15
Aqg the largest agreed sequence number it has seen Pqg its own largest proposed sequence number
1. Process p B-multicasts <m, i> to g, where i is a unique identifier for m. 2. Each process q replies to the sender p with a proposal for the messages agreed sequence number of
n n
Message processing deliver Hold-back queue Delivery queue When delivery guarantees are met
3. p collects all the proposed sequence numbers and selects the largest as the next agreed sequence number, a. It B-multicasts <i, a> to g.
4. Recipients set Aqg := Max(Aqg,a), attach a to the message and reorder hold-back queue.
Consensus
Consensus: N Processes agree on a value.
For example, synchronized action (go / abort) Consensus may have to be reached in the presence of failure. Process failure process crash (fail-stop failure), arbitrary failure.
Communication failure lost or corrupted messages.
In a consensus algorithm:
some or all other processes.
All Pi start in an undecided state. Each Pi proposes a value vi from a set D and communicates it to A consensus is reached if all non-failed processes agree on the
same value, d. Each non-failed Pi sets its decision variable to d and changes its state to decided.
Consensus Requirements
Termination:
Eventually each correct process sets its decision value. Agreement: The decision value is the same for all correct processes, i.e., if pi If all correct processes Pis propose the same value, d, then any
correct process in the decided state has decision value = d.
and pj are correct and have entered the decided state, then di=dj
Integrity:
where processes agree on a vector of values, one value for each process
P1
d1 (5,7,2, -) V1 = 5 V3 = 2
Co ns e Al nsu g. s
V2 = 7
P2
d2 (5,7,2, -)
P3
d3 (5,7,2, -)
V4 =
P4
Crashed
reliable communication). Once each general has collected all values, it determines the right value (attack or retreat).
Problem Equivalence
n
Interactive consistency (IC) can be solved if there is a solution for Byzantine Generals (BG) problem:
n
Just run BG n times Run IC to produce a vector of values at each process Then apply the majority function on the vector Resulting value is the consensus value If no majority, choose a bottom value Commander sends its proposed value to itself and each of the other generals All processes run C with the values received Resulting consensus value is the value required by BG
n n
Proof of correctness
n n
Proof by contradiction. Assume that two processes differ in their final set of values. Assume that pi possesses a value v that pj does not possess.
A third process (pk) sent v to pi and crashed before sending v to pj Any process sending v in the previous round must have crashed; otherwise, both pk and pi should have received v. Proceeding in this way, we infer at least one crash in each of the preceding rounds. But we have assumed at most f crashes can occur and there are f+1 rounds contradiction.
A faulty process may send any message with any value at any time; or it may omit to send any message. In the case of arbitrary failure, no solution exists if N<=3f.
p1(Commander)
1:v 2:1:v 1:v
p1 (Commander)
1:w 2:1:w 1:x
p2
3:1:u
p3
p2
3:1:x
p3
Solution
n
To solve the Byzantine generals problem in a synchronous system, we require. N>=3f+1 Consider N=4, f=1
n
In the first round, the commander sends a value to each of the other generals In the second round, each of the other generals sends the value it received to its peers. The correct generals need only apply a simple majority function on the set of values received.
p1 (Commander)
1:v 1:v 2:1:v 1:v 3:1:u 4:1:v 4:1:v 2:1:v 3:1:w
p1 (Commander)
1:u 1:v 2:1:u 1:w 3:1:w 4:1:v 4:1:v 2:1:u 3:1:w
p2
p3
p2
p3
p4
p4
Each processor maintains a tree data structure in its local state Each node of the tree is labeled with a sequence of processor indices with no repeats n Roots label is empty sequence n Root has n children labeled 0 through n-1 n Child node labeled i has n-1 children labeled 0 through i-1 and i+1 through n-1 n In general, node at level d with label v has n- d children skipping any index already present in v n Nodes at level f+1 are the leaves
10
n n
Each processor fills in the tree nodes with values as the rounds go by Initially, store your input in the root (level 0) Round 1: send level 0 of your tree (the root); store value received from pj in node j (level 1) Round 2: send level 1 of your tree; store value received from pj for node k in node k:j (level 2)
n
In the last round, each processor uses the values in its tree to compute its decision
n n
Decision is resolve() Where resolve() equals: n Value in tree node labeled if it is a leaf n majority{resolve() : is a child of }
11
5 0:1
n
5 0:2
3 0:3
5 1:0
5 1:2
2 1:3
5 2:0
5 2:1
4 2:3
6 3:0
7 3:1
8 3:2
Assume that nodes 0, 1, and 2 are legitimate; they contribute value 5 Assume that node 3 is byzantine
Resolving nodes
5
5 0:1
5 0:2
3 0:3
5 1:0
5 1:2
2 1:3
5 2:0
5 2:1
4 2:3
6 3:0
7 3:1
8 3:2
n n n
Resolve a leaf node: return the value of the node Resolve an internal node: return the majority value of children Decision by processor: resolve the root
12
Proof of algorithm
n
Resolve Lemma: Non-faulty processor pis resolved value for node = j equals what pj has stored for . Proof: By induction on the height of .
Basis: is a leaf. 1) Then pi stores in node what pj sends it for in the last round. 2) For leaves, the resolved value is the tree value.
Proof (contd.)
Induction: is not a leaf. By tree definition, has at least n f children Since n > 3f, has majority of non-faulty children Let k be a child of such that pk is non-faulty Since pj is non-faulty, pj correctly reports to pk that it has some value v in node ; thus pk stores v in node = j By induction, pjs resolved value for k equals the value v that pk has in its tree node So all of s non-faulty children resolve to v in pjs tree, and thus resolves to v in pjs tree
13
Proof (contd.)
Non-faulty Pj v/Non-faulty Pk Non-faulty Pi Majority of -/v children are non-faulty : j
: j
v/-
-/v
Proof of Validity
n
Non-faulty processor pi decides on resolve(), which is the majority among resolve(j) (for all j from 0 to n-1) The previous lemma implies that for each non-faulty pj n resolve(j) for pi = value stored at the root of pjs tree n Value stored at the root is pjs input = v n Thus pi decides v
14
Proof of Agreement
n
Show that all non-faulty processors resolve to the same value for their tree roots A node is common if all non-faulty processors resolve to the same value for it. (We will need to show that the root is common.) Strategy:
n n
Show that every node with a certain property is common Show that the root has the property
Lemma: If every -to-leaf path has a common node, then is common. Proof by Induction:
Basis: is a leaf. Then every -to-leaf path consists solely of , and since the path is assumed to contain a common node, that node is
Lemma (contd.)
n
Induction Step:
n n
is not a leaf. Suppose in contradiction is not common. Then every child of has the property that every -to-leaf path has a common node Since the height of is smaller than the height of , the inductive hypothesis implies that is common Therefore, all non-faulty processors compute the same resolved value for , and thus is common
15
n n
There are f+2 nodes on a root-to-leaf path The label of each non-root node on a root-to-leaf path ends in a distinct processor index (the processor from which the value is to be received) At least one of these indices is that of a non-faulty processor Resolve Lemma implies that the node whose label ends with a non-faulty processor is a common node
Can reduce the message size with a simple algorithm that increases the number of processors to n > 4f and number of rounds to 2(f + 1) Phase King Algorithm: Uses f + 1 phases, each taking two rounds
Code for pi pref = my input First round of phase k: send pref to all receive prefs of others let maj be the value that occurs > n/2 times among all prefs (0 if none) let mult be the number of times maj occurs
16
Algorithm (contd.)
Second round of phase k: if my_proc == k then send maj receive tie-breaker from pk if mult > n/2 + f then pref = maj else pref = tie-breaker if k == f+1 then decide pref
Lemma: If all non-faulty processors prefer v at start of phase k, then all do at end of phase k. Proof:
n
Each non-faulty processor receives at least n f preferences (including its own) for v in the first round of phase k Since n > 4 f: n/2 > 2f (n n/2) > f + f n f > n/2 + f. Thus the processors still prefer v.
17
Proof (contd.)
n
Lemma: If the king of phase k is non-faulty, then all nonfaulty processors have the same preference at the end of phase k. Proof:
n n
Consider two non-faulty processors pi and pj Case 1: pi and pj both use pks tie-breaker. Since pk is non-faulty, they agree Case 2: pi uses its majority value and pj uses the kings tie-breaker n pis majority value is v n pi receives more than n/2 + f preferences for v n pk receives more than n/2 preferences for v n pks tie-breaker is v
Proof (contd.)
n
Case 3: pi and pj both use their own majority values n pis majority value is v n pi receives more than n/2 + f preferences for v n pj receives more than n/2 preferences for v n pjs majority value is also v
Since there are f + 1 phases, at least one has a non-faulty king At the end of that phase, all non-faulty processors have the same preference From that phase onward, the non-faulty preferences stay the same Thus the decisions are the same.
18
Fischer-Lynch-Patterson (1985)
No completely asynchronous consensus protocol can tolerate even a single unannounced process death
Assumptions
n
Fail-stop failure:
n
Impossibility result holds for byzantine failure messages are delivered correctly and exactly once No assumptions regarding the relative speeds of processes or the delay time in delivering a message No synchronized clock n Algorithms based on time-out can not be used No ability to detect the death of a process
Asynchronous:
n
19
Non-faulty process decides on a value in {0, 1} Stores the value in a write-once output register All non-faulty processes that make a decision must choose the same value. For proof: assume that some processes eventually make a decision (weaker requirement) Cannot choose 0 arbitrarily
Requirement:
n
Notation
n
A configuration consists of
n
First, receive(p) to get a message m Based on ps internal state and m, p enters a new internal state and sends finite messages to other
20
A schedule from C
n
n n
a finite or infinite sequence of events that can be applied, in turn, starting from C The associated sequence of steps is called a run (C) denotes the resulting configuration and is said to be reachable from C If C is reachable from some initial configuration
An accessible configuration C
n
Lemma 1
n
Suppose that from some configuration C, the schedules and 2 lead to configuration C1 and C2 respectively.
n
If the sets of processes taking steps in 1 and 2 respectively, are disjoint: Then 2 can be applied to C1 and 1 can be applied to C2, and both lead to the same configuration.
21
Definitions
n
A process is non-faulty
n
A configuration C has decision value v if some process p is in a decision state with output register containing v. Deciding run
n
Admissible run
n
At most one process is faulty and all messages sent to non-faulty processes are eventually received
Bivalent, 0-valent/1-valent
n
22
Correctness
n
No trivial solutions (there are some configurations that lead to result 0 and some that lead to result 1) No accessible configuration has more than one decision value Every admissible run is a deciding run
Theorem 1
n
There must be some initial configuration that is bivalent Consider some event e = (p, m) that is applicable to a bivalent configuration, C n Consider the set of configurations reachable from C w/o applying e (let this set be ) n Apply e to each one of these configurations to get the set D n Show that D contains a bivalent configuration Construct an infinite sequence of stages where each stage starts with a bivalent configuration and ends with a bivalent configuration
23
Lemma 2
n n
Consider configuration C2 = { 1, 1, 1, , 1 }
n
There must be two configurations C3 and C4: n C3 is 0-valent, C4 is 1-valent n Some processor p changed its value from 0 to 1 Consider some admissible deciding run from C3 involving no p-events. n Let be associated schedule. n Apply to C4. Clearly, resulting state should be 0. n Implies contradiction.
Lemma 3
n
Let e = (p, m) be an event that is applicable to C. Let be the set of configurations reachable from C without applying e, and let D = e() ={e(E) | E }. Then, D contains a bivalent configuration.
24
Graphical Representation
C7 C1 C8
C { 0, 1 }
C2 C4 C6 C5
C3 C11
C9 { 0, 1 }
C10
Proof
n
Consider E0:
n n
If E0 belongs to , then e(E0) = F0 belongs to D If E0 does not belong to , then there is a F0: n Such that F0 belongs to D n F0 E0 In either case, there is a F0 D and F0 is 0-valent
n n
Similarly there exists a F1 which is 1-valent and F1 D D contains 0-valent and 1-valent configurations
25
Two Cases
C7 C1 C8
C { 0, 1 }
C2 { 0 } C4 C6 C5
C3 C11
C9
C10
C11
C12
{1}
Proof (contd.)
n
There exists two adjacent states G0 and G1 in , such that e(G0) is 0valent and e(G1) is 1-valent
C7 {0} C1 C8 {0}
C { 0, 1 }
C2 C4 C6 C5
C3 C11 {0}
C9
C10
{1}
26
Proof (contd.)
G0 e D0 {0}
n
e = (p m) G1 e e D1 {1}
n n
Assume that the event that transforms G0 to G1 is e = (p, m) and let p != p Recall that p is the processor with the delayed message (and the delayed event e) e is applicable to D0 and transforms D0 to D1 (commutativity lemma) What does this imply?
Proof (contd.)
G0 e D0 D1 e G1 e
If p is same as p: consider some configuration A that is reachable from G0 that involves no events to p, and is deciding. Let be the schedule.
27
Proof (contd.)
G0 e D0 D1 e G1 e
A e E0 e e E1
Proof Wrapup
n
No processor fails n Each processor executes an infinite steps n All messages sent to a processor is delivered in finite time Every configuration in the sequence is bivalent Start with a bivalent configuration Delay some message Can always find some other bivalent configuration that is reached by delivering the message
28
m0 P0
m11 m12 P1
m9 m10 P2
m6 P3
m11 m12 P0 P1
m9 m10 P2
m6 m13 P3
Block a message for the next processor, construct another possible bivalent configuration Construction can go on for ever:
n
No faults (infinite steps for each processor, messages delivered in finite time) Always goes from one bivalent configuration to another bivalent configuration
29
Paxos Consensus
n
Assume that a collection of processes that can propose values, choose a value
n n
Only a value that has been proposed may be chosen Only a single value is chosen A single process may act as more than one agent Asynchronous messages Agents operate at arbitrary speed, may fail by starting, and may restart. (If agents fail and restart, assume that there is non-volatile storage.) Guarantee safety and not liveness
Model:
n n
Simple solutions
n n
Acceptor chooses the first proposed value Rejects all subsequent values Failure of acceptor means no further progress Proposer sends a value to a large enough set of acceptors What is large enough? n Some majority of acceptors, which implies that only one value will be chosen n Because any two majorities will have at least one common acceptor
30
No liveness requirements:
n
If a proposal does not succeed, you can always restart a new proposal
Proposing a value Accepting a value Choosing a value (if a majority of acceptors accept a value)
4 P2 4 4
31
Refinements
n
n n
Which implies that multiple proposals could be chosen P1: Have to make all of the chosen proposals be the same value! Trivially satisfies the condition that only a single value is chosen Requires coordination between proposers and acceptors
More refinements
n
32
It is issued only if there is a set S consisting of a majority of acceptors such that either: n No acceptor in S has accepted any proposal numbered less than n, or n v is the value of the highest-numbered proposal among all proposals numbered less than n accepted by the acceptors in S
One can satisfy P4 by maintaining the invariant P5 How does one enforce P5?
33
(4) If the acceptor receives a request (propose, n , v), it accepts the proposal unless it has already responded to a prepare request having a number greater than n.
34
In Well-Behaved Runs
P1 A1 A2 P1 A1 A2 L1 L2
. . .
Lk
An
Example
A1 A2 A3 A4 A5 A6
Prepare, Prop #: 2 P1
35
Example
A1 A2 A3 A4 A5 A6 P1 Ack, 2, ,
Example
Prepare, Prop #: 1 A1 A2 A3 A4 A5 A6 P2
36
Example
A1 A2 A3 A4 A5 A6 P3 Prepare, Prop #: 3 A1 A2 A3 A4 A5 A6 Ack, 3, , P3
Example
Accept, Prop #: 2, Val: 99 A1 A2 A3 Propose, Prop #: 2, Val: 99 A4 A5 A6 P1 A1 A2 A3 L2 A4 L3 A5 A6 L1
37
Example
A1 A2 A3 A4 P4 A5 A6 Prepare, Prop #: 4 P3 Propose, Prop #: 3, Val: 42 L2 Accept, Prop #: 3, Val: 42 L1
L3
Example
A1 A2 A3 A4 A5 A6
38
Example
A1 Prepare, Prop #: 5 A2 P5 A3 A4 P4 Propose, 4, 42 A5 A6 P4 Propose, 4, 42 P5 Ack, 5, 2, 99 Ack, 5, 2, 99 Ack, 5, 3, 42 Ack, 5, 3, 42 A3 A4 A5 A6 A1 A2
Example
Propose 5, 42 P5 L2 A3 A4 A5 A6 Accept, 5, 42 L3 A1 L1 A2
39
It can abandon a proposal in the middle of the protocol at any time Probably a good idea to abandon a proposal if some processor has begun trying to issue a higher-numbered one
If an acceptor ignores a prepare or accept request because it has already received a prepare request with a higher number:
n
It should probably inform the proposer who should then abandon its proposal
Persistent storage:
n
Each acceptor needs to remember the highest numbered proposal it has accepted and the highest numbered prepare request that it has acked.
Progress
n
Easy to construct a scenario in which two proposers each keep issuing a sequence of proposals with increasing numbers
n n n
n n
P completes phase 1 for a proposal numbered n1 Q completes phase 1 for a proposal numbered n2 > n1 Ps accept requests in phase 2 are ignored by some of the processors P begins a new proposal with a proposal number n3 > n2 And so on
40
Announcements
n
41