Lecture 14
Lecture 14
CIS 5050
Software Systems
– Crash faults
– Rational faults
– Byzantine faults
• State-machine replication
– Consensus protocols
– FLP impossibility
• Paxos
– Intuition
– Algorithm
2
©2016-2024 Linh Thi Xuan Phan
Faults and failures
What is X? X=5
X=5 Correct
X=5
X=5 What is X?
Set X:=5 X=5 Fault
(masked)
X=5 X=5
What is X?
Faults
X=3 causing
X=3 failure
• Terminology:
– Fault: Some component is not working correctly
– Failure: System as a whole is not working correctly
3
©2016-2024 Linh Thi Xuan Phan
Faults in distributed systems
• What could possibly go wrong?
– Node loses power
– Hard disk fails
– Administrator accidentally erases data
– Administrator configures node incorrectly
– Software bug triggers
– Network overloaded, drops lots of packets
– Hacker breaks into some of the nodes
– Disgruntled employee manipulates node
– Fire breaks out in data center where node resides
– Police confiscates node because of illegal activity
– ...
– ...
4
©2016-2024 Linh Thi Xuan Phan
Common misconceptions about faults
• "Faults are rare exceptions"
– NO! At scale, faults are occurring all the time
– Stopping the system while handling the fault is NOT an option -
system needs to continue despite the fault
5
©2016-2024 Linh Thi Xuan Phan
Types of faults
• Crash faults
– Node simply stops
– Examples: OS crash, power loss
• Rational behavior
– Owner manipulates node to increase profit
– Example: Traffic attraction attack (see next slide)
• Byzantine faults
– Arbitrary - faulty node could do anything
(stop, tamper with data, tell lies, attack
other nodes, send spam, spy on user...)
– Example: Node compromised by a hacker,
data corruption, hardware defect...
6
©2016-2024 Linh Thi Xuan Phan
Goldberg et al., "Rationality and Traffic Attraction: Incentives for honestly announcing paths in BGP", SIGCOMM 2010
Rational fault example
I have a good
$$$$
route
$$$$$$$ $$$$
Who knows
I need how to
get to YouTube?
connectivity!
I have an
okay route
$$$
$$$
Alice
• Example: Interdomain routing with BGP
– Control over the system is distributed
– Alice's provider can choose between several routes to the same destination
7
©2016-2024 Linh Thi Xuan Phan
Goldberg et al., "Rationality and Traffic Attraction: Incentives for honestly announcing paths in BGP", SIGCOMM 2010
Rational fault example
$$$$
I wish my
route
I havehad
a GREAT
routechosen
been to YouTube
$$$
Alice
http://status.aws.amazon.com/s3-20080720.html
9
©2016-2024 Linh Thi Xuan Phan
Source: http://www.justice.gov/criminal/cybercrime/duronioIndict.htm
Some examples of Byzantine faults
Disgruntled UBS PaineWebber Employee Charged with Allegedly Unleashing “Logic Bomb” on
Company Computers
NEWARK - A disgruntled computer systems administrator for UBS PaineWebber was charged today with
using a “logic bomb” to cause more than $3 million in damage to the company’s computer network, and with
securities fraud for his failed plan to drive down the company’s stock with activation of the logic bomb, U.S.
Attorney Christopher J. Christie announced. Roger Duronio, 60, of Bogota, N.J., was charged today a two-
count Indictment returned by a federal grand jury, according to Assistant U.S. Attorney William Devaney.
The Indictment alleges that Duronio, who worked at PaineWebber’s offices in Weehawken, N.J., planted the
logic bomb in some 1,000 of PaineWebber’s approximately 1,500 networked computers in branch offices
around the country. Duronio, who repeatedly expressed dissatisfaction with his salary and bonuses at Paine
Webber resigned from the company on Feb. 22, 2002. The logic bomb Duronio allegedly planted was
activated on March 4, 2002. In anticipation that the stock price of UBS PaineWebber’s parent company,
UBS, A.G., would decline in response to damage caused by the logic bomb, Duronio also purchased more
than $21,000 of “put option” contracts for UBS, A.G.’s stock, according to the charging document. A put
option is a type of security that increases in value when the stock price drops. Market conditions at the time
suggest there was no such impact on the UBS, A.G. stock price. [...] The Indictment alleges that, from about
November 2001 to February, Duronio constructed the logic bomb computer program. On March 4, as
planned, Duronio’s program activated and began deleting files on over 1,000 of UBS PaineWebber’s
computers. It cost PaineWebber more than $3 million to assess and repair the damage, according to the
Indictment. As one of the company’s computer systems administrators, Duronio had responsibility for, and
access to, the entire UBS PaineWebber computer network, according to the Indictment. He also had access
to the network from his home computer via secure Internet access. [...]
10
©2016-2024 Linh Thi Xuan Phan
Correlated faults
• A single problem can cause many faults
– Example: Overloaded machine crashes, increases load on other
machines ® domino effect
– Example: Bug is triggered in a program that is used on lots of
machines
– Example: Hacker manages to break into many computers due to a
shared vulnerability
– Example: Machines may be connected to the same power grid,
cooled by the same A/C, managed by the same admin
– ...
11
©2016-2024 Linh Thi Xuan Phan
Recap: Faults and failures
• Faults happen all the time
– Hardware malfunction, software bug, manipulation, hacker break-ins,
misconfiguration, ...
– NOT a rare occurrence at scale - must design system to handle
them
12
©2016-2024 Linh Thi Xuan Phan
Plan for today
• Fault models
– Crash faults
– Rational faults
– Byzantine faults
• State-machine replication NEXT
– Consensus protocols
– FLP impossibility
• Paxos
– Intuition
– Algorithm
13
©2016-2024 Linh Thi Xuan Phan
Why should we care?
• Some services are so important that a failure or
downtime would be a disaster
– Examples: Google's central synchronization service,
Yahoo's ZooKeeper service, ...
14
©2016-2024 Linh Thi Xuan Phan
Goal: Replicated service
1,3,7,2 1,3,7,2
18
©2016-2024 Linh Thi Xuan Phan
FLP: Consensus is "impossible"!
• No asynchronous algorithm for agreeing on a one-
bit value can guarantee that it will terminate in the
presence of crash faults
– Even if no crash faults actually occur
• What now?
– Change the problem statement: Randomized algorithms,
approximate agreement, k-set agreement, ...
– Change the assumptions: Assume bounds on message delays, or
that we have a reliable oracle (failure detector) that tells us when a
node crashed
– Paxos: Guarantees safety but not liveness (termination)
M. Fischer, N. Lynch, M. Paterson, "Impossibility of distributed consensus with one faulty process",
Journal of the ACM, April 1985, 32(2):374-382. ACM Knuth Prize 2007!
19
©2016-2024 Linh Thi Xuan Phan
Plan for today
• Fault models
– Crash faults
– Rational faults
– Byzantine faults
• State-machine replication
– Consensus protocols
– FLP impossibility
• Paxos NEXT
– Intuition
– Algorithm
20
©2016-2024 Linh Thi Xuan Phan
What is Paxos?
21
©2016-2024 Linh Thi Xuan Phan
An unusual paper...
22
©2016-2024 Linh Thi Xuan Phan
The Paxos algorithm
...
37. Add 7 to X
38. Read Y
39. (nop)
40. Z:=X+Y
...
• Scenario:
– There is a replicated append-only log ("ledgers" in the paper)
– The instances of this log are kept consistent by the protocol
– We can use this log to record the sequence of operations
→ All nodes will process them in the same order
23
©2016-2024 Linh Thi Xuan Phan
The Paxos algorithm
• Safety
– Only a value that has been proposed may be chosen
– Only a single value is chosen
– A node never learns that a value has been chosen unless
it actually has been chosen
• Liveness
– Some proposed value is eventually chosen
– A nodes can eventually learn the chosen value
24
©2016-2024 Linh Thi Xuan Phan
The Paxos algorithm
• Simple strawman solution (not Paxos):
– There is a single node (let's call it the "acceptor") that decides which
entries should go into the log
– Other nodes can send messages to the acceptor to propose a new
entry they want to add
– The acceptor accepts proposals in the order in which they are
received
– When the acceptor adds a new entry, it sends a message to the
other nodes, so they can learn what the entry was
25
©2016-2024 Linh Thi Xuan Phan
Model
• Network:
– May lose messages (messenger leaves forever)
– May duplicate messages
– Asynchronous (messages can be delayed arbitrarily)
– But: No message corruption
• Nodes:
– Can fail by crashing (legislator leaves the Chamber)
– No central clock (hourglass timers)
– But: Have some persistent memory (ledgers)
– But: Strictly follow the protocol - no lying, data corruption...
26
©2016-2024 Linh Thi Xuan Phan
Phase 1: Prepare
• Suppose a node A wants to propose a value X for an
entry e:
– A chooses a new proposal number n
– A sends PREPAREe(n) to a majority ("quorum") of the other nodes
27
©2016-2024 Linh Thi Xuan Phan
Phase 1: Prepare
• If a node B receives PREPARE(n) from A:
– If B has already acknowledged a PREPARE(n’) with n’>n for this
entry, then it does nothing.
– If B has previously accepted any proposals for this entry, it responds
with ACK(n, n', X'), where n' is the highest proposal number it has
accepted for this entry, and X' is the corresponding value.
– Otherwise, B responds with ACK(n, -, -).
28
©2016-2024 Linh Thi Xuan Phan
Phase 2: Accept
• If A receives ACKs from a majority of the other
nodes, it issues ACCEPT(n, X*)
– X* is the value from the ACK with the highest proposal number, or
the original X if none of the ACKs had a value
• If B receives ACCEPT(n, X*)
– If B has already acknowledged to a PREPARE(n') with n’ > n, then
B does nothing
– Otherwise, B accepts the proposal and sends ACCEPT(n, X*) to all
the learners
• If a learner L receives ACCEPT(n, X*) from a
majority of the acceptors, it decides X*
– L then sends DECIDE(X*) to all the other learners
– If another learner receives DECIDE(X*), it decides X* for this entry
29
©2016-2024 Linh Thi Xuan Phan
Let's think about this
• What happens if two nodes concurrently propose
different values?
30
©2016-2024 Linh Thi Xuan Phan
Example
A's quorum
A B C D E F G H I
A: PREPARE(5)
31
©2016-2024 Linh Thi Xuan Phan
Example
A's quorum
A B C D E F G H I
A: PREPARE(5)
C,D,E,G,H: ACK(5,-,-)
32
©2016-2024 Linh Thi Xuan Phan
Example
A's quorum
A B C D E F G H I
A: PREPARE(5)
C,D,E,G,H: ACK(5,-,-)
A: ACCEPT(5,X)
E: ACCEPT(5,X)
33
©2016-2024 Linh Thi Xuan Phan
Example
A's quorum B's quorum
A B C D E F G H I
A: PREPARE(5)
C,D,E,G,H: ACK(5,-,-)
A: ACCEPT(5,X)
E: ACCEPT(5,X)
B: PREPARE(8)
34
©2016-2024 Linh Thi Xuan Phan
Example
A's quorum B's quorum
A B C D E F G H I
A: PREPARE(5)
C,D,E,G,H: ACK(5,-,-)
A: ACCEPT(5,X)
E: ACCEPT(5,X)
B: PREPARE(8)
D,F,G,I: ACK(8,-,-)
35
©2016-2024 Linh Thi Xuan Phan
Example
A's quorum B's quorum
A B C D E F G H I
A: PREPARE(5)
C,D,E,G,H: ACK(5,-,-)
A: ACCEPT(5,X)
E: ACCEPT(5,X)
B: PREPARE(8)
D,F,G,I: ACK(8,-,-)
E: ACK(8,5,X)
36
©2016-2024 Linh Thi Xuan Phan
Example
A's quorum B's quorum
A B C D E F G H I
A: PREPARE(5)
C,D,E,G,H: ACK(5,-,-)
A: ACCEPT(5,X)
E: ACCEPT(5,X)
B: PREPARE(8)
D,F,G,I: ACK(8,-,-)
E: ACK(8,5,X)
B: ACCEPT(8,X)
37
©2016-2024 Linh Thi Xuan Phan
Example
A's quorum B's quorum
A B C D E F G H I
A: PREPARE(5)
C,D,E,G,H: ACK(5,-,-)
A: ACCEPT(5,X)
E: ACCEPT(5,X)
B: PREPARE(8)
D,F,G,I: ACK(8,-,-)
E: ACK(8,5,X)
B: ACCEPT(8,X)
D,..,I: ACCEPT(8,X)
38
©2016-2024 Linh Thi Xuan Phan
For clarify, we assume only A, B, C are learners. In general, all nodes are learners.
Example
A's quorum B's quorum
A B C D E F G H I
A: PREPARE(5)
C,D,E,G,H: ACK(5,-,-)
A: ACCEPT(5,X)
E: ACCEPT(5,X)
B: PREPARE(8)
D,F,G,I: ACK(8,-,-)
E: ACK(8,5,X)
B: ACCEPT(8,X)
D,..,I: ACCEPT(8,X)
A: DECIDE(X)
39
©2016-2024 Linh Thi Xuan Phan
For clarify, we assume only A, B, C are learners. In general, all nodes are learners.
Recap: Paxos
• Goal: Build a replicated service
– Multiple machines acting 'as if' they were a single machine
– Can mask faults if not too many happen simultaneously
• Paxos implements an important building block: A
consistent append-only log
– Useful to make the replicas agree on the order in which to process
requests → prevent divergence
– More generally, consensus is useful in many other scenarios
• But: Paxos assumes crash faults
– Malicious nodes can easily disrupt the algorithm by telling lies
– Can we build a similar protocol that can tolerate malicious nodes as
well?
– Yes, we can! See next lecture.
40
©2016-2024 Linh Thi Xuan Phan