Fault Tolerance
Fault Tolerance
Tim Wood
• Cooling infrastructure
• Power converters
• Backup generators
3
Modular Data Center
• ...or use shipping containers
• Each container filled with
thousands of servers
• Can easily add new
containers
• “Plug and play”
• Just add electricity
4
Definitions
• Availability: whether the system is ready to use
at a particular time
• Reliability: whether the system can run
continuously without failure
• Safety: whether a disaster happens if the system
fails to run correctly at some point
• Maintainability: how easily a system can be
repaired after failure
5
Availability and Reliability
• System 1: crashes for 1 millisecond every hour
6
Availability and Reliability
• System 1: crashes for 1 millisecond every hour
• Better than 99.9999% availability
• Not very good reliability...
7
Quantifying Reliability
• MTTF: Mean Time To Failure
• The average amount of time until a failure occurs
<---MTTF---> <---MTTF--->
<-MTTR->
Time
<-------MTBF------->
8
Real Failure Rates
• Standard hard drive MTBF
• 600,000 hours = 68 years!
• 1/68 = 1.5% chance of failure per year
9
Real Failure Rates
• Standard hard drive MTBF
• 600,000 hours = 68 years!
• 1/68 = 1.5% chance of failure per year
9
Real Failure Rates
• Standard hard drive MTBF
• 600,000 hours = 68 years!
• 1/68 = 1.5% chance of failure per year
9
Reliability Challenges
• Typical failures in one year of a google data center:
• 1000 individual machine failures
• thousands of hard drive failures
• 1 PDU (Power Distribution Unit) failure (about 500-1000 machines suddenly
disappear, budget 6 hours to come back)
• 1 rack-reorganization (You have plenty of warning: 500-1000 machines powered
down, about 6 hours)
• 1 network rewiring (rolling 5% of machines down over 2-day span)
• 20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) 5
racks go wonky (40-80 machines see 50% packet loss)
• 8 network maintenances (4 might cause ~30-minute random connectivity losses)
• 12 router reloads (takes out DNS and external virtual IP address (VIPS) for a
couple minutes)
• 3 router failures (have to immediately pull traffic for an hour)
• 0.5% overheat (power down most machines in under five minutes, expect 1-2
days to recover)
• dozens of minor 30-second blips for DNS
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/stanford-295-
talk.pdf
10
Types of Failures
• Systems can fail in different ways
• How?
11
Types of Failures
• Systems can fail in different ways
• Crash failure
• Timing failure
• Content failure
• Malicious failure
12
Fault Tolerance through Replication
• We can handle failures through redundancy
13
Fault Detection
• Detecting a fault can be difficult
• Crash failure
• Timing failure
• Content failure
• Malicious failure
14
Fault Detection
• Detecting a fault can be difficult
• Crash failure
• Timing failure
• Content failure
• Malicious failure
• Approaches:
• Heartbeat messages
• Adaptive timeouts
• Voting / Quorums
• Authentication / signatures
14
Detection is Hard
• Or maybe even impossible
15
Two Generals Problem
17
Fault Tolerance through Replication
• How to tolerate a crash failure?
18
Fault Tolerance through Replication
• How to tolerate a crash failure?
2+2=4
P1 output = 4
Inputs
P2 crash
x(
f+1 replicas
4
A B Replica Receives Action
5 4 A 4, 4, 5 = 4
5
4 B 4, 4, 5 = 4
C
4 C 4, 4, 5 = 4?
20
Byzantine Generals Problem
• There are N generals making plans for an attack
• They need to decide whether to Attack or Retreat
• Send your vote to everyone (0=retreat, 1=attack)
• But f generals may be traitors that lie and collude
• Can all correct replicas agree on what to do?
• Take majority vote of planned actions
21
Byzantine Generals Solved!
• Need more replicas to reach consensus
• Requires 3f+1 replicas to tolerate f byzantine faults
• Step 1: Send your plan to everyone
• Step 2: Send learned plans to everyone
• Step 3: Use majority of each column
Replica Receives Vote A B
A: (1,0,1,1) A: 1
B: (1,0,0,1) B: 0
A C:
D:
(1,1,1,1)
(1,0,1,1)
C:
D:
1
1
x y
A: (1,0,1,1) A: 1
B
B: (1,0,0,1) B: 0 C D
C: (0,0,0,0) C: 0
D: (1,0,0,1) D: 1 z
22
Can we make this any easier?
• Fundamental challenge in BFT:
• Nodes can misrepresent what they heard from other nodes
23
Can we make this any easier?
• Fundamental challenge in BFT:
• Nodes can misrepresent what they heard from other nodes
23
Denial of Service
• Attack to reduce the availability of a service
• Can also cause crashes if software is poorly written
Amazon.com
24
Sept 2012 DDoS
• Six US banks attacked
• Attacks were announced in advance
• Banks still could not prevent the damage
• Attackers sent 65 gigabytes of data per second
25
Sept 2012 DDoS
• But it's not clear if that was the real source...
• Botnet machines have relatively low bandwidth
• Would need 65,000+ compromised machines
26
Anonymous (?)
• The Anonymous "hacktivist" group has used
DDoS for various political causes
• Members run LOIC software and target a specific site
• "Volunteer bot net"
• But be careful...
27
Defending against DDoS
• Some DDoS traffic can be easily distinguished
• Most web apps can safely ignore ICMP and UDP traffic
ISP
Amazon.com
28
Summary
• Software systems must worry about:
• Hardware and software failures
• Service availability
• Malicious attacks that affect reliability and/or availability
• Approaches:
• Redundancy
• Fault mitigation
• Fault detection
29