0% found this document useful (0 votes)
18 views33 pages

Fault Tolerance

Uploaded by

roy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views33 pages

Fault Tolerance

Uploaded by

roy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Availability, Reliability,

and Fault Tolerance


Guest Lecture for Software Systems Security

Tim Wood

Professor Tim Wood - The George Washington University


Distributed Systems have Problems
• Hardware breaks
• Software is buggy
• People are jerks

• But software is increasingly important


• Runs nuclear reactors
• Controls engines and wings on a passenger jet
• Runs your facebook wall!

• How can we make these systems more


reliable?
• Particularly large distributed systems
2
Inside a Data Center
• Giant warehouse filled with:
• Racks of servers
• Disk arrays

• Cooling infrastructure
• Power converters
• Backup generators

3
Modular Data Center
• ...or use shipping containers
• Each container filled with
thousands of servers
• Can easily add new
containers
• “Plug and play”
• Just add electricity

• Allows data center to be


easily expanded
• Pre-assembled, cheaper

4
Definitions
• Availability: whether the system is ready to use
at a particular time
• Reliability: whether the system can run
continuously without failure
• Safety: whether a disaster happens if the system
fails to run correctly at some point
• Maintainability: how easily a system can be
repaired after failure

5
Availability and Reliability
• System 1: crashes for 1 millisecond every hour

• System 2: never crashes, but has to be shutdown


two weeks a year

6
Availability and Reliability
• System 1: crashes for 1 millisecond every hour
• Better than 99.9999% availability
• Not very good reliability...

• System 2: never crashes, but has to be shutdown


two weeks a year
• "Perfectly" reliable
• Only 96% availability

Is one more important?

7
Quantifying Reliability
• MTTF: Mean Time To Failure
• The average amount of time until a failure occurs

• MTTR: Mean Time To Repair


• The average amount of time to repair after a failure

• MTBF: Mean Time Between Failures

<---MTTF---> <---MTTF--->
<-MTTR->
Time
<-------MTBF------->

8
Real Failure Rates
• Standard hard drive MTBF
• 600,000 hours = 68 years!
• 1/68 = 1.5% chance of failure per year

9
Real Failure Rates
• Standard hard drive MTBF
• 600,000 hours = 68 years!
• 1/68 = 1.5% chance of failure per year

• A big Google data center:


• Has 200,000+ hard drives
• 1.5% x 200,000 = 2,921 drive crashes per year
• or about 8 disk failures per day

9
Real Failure Rates
• Standard hard drive MTBF
• 600,000 hours = 68 years!
• 1/68 = 1.5% chance of failure per year

• A big Google data center:


• Has 200,000+ hard drives
• 1.5% x 200,000 = 2,921 drive crashes per year
• or about 8 disk failures per day

• Failures happen a lot


• Need to design software to be resilient to all types of
hardware failures
• Actual failure rates are closer to 3% per year

9
Reliability Challenges
• Typical failures in one year of a google data center:
• 1000 individual machine failures
• thousands of hard drive failures
• 1 PDU (Power Distribution Unit) failure (about 500-1000 machines suddenly
disappear, budget 6 hours to come back)
• 1 rack-reorganization (You have plenty of warning: 500-1000 machines powered
down, about 6 hours)
• 1 network rewiring (rolling 5% of machines down over 2-day span)
• 20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) 5
racks go wonky (40-80 machines see 50% packet loss)
• 8 network maintenances (4 might cause ~30-minute random connectivity losses)
• 12 router reloads (takes out DNS and external virtual IP address (VIPS) for a
couple minutes)
• 3 router failures (have to immediately pull traffic for an hour)
• 0.5% overheat (power down most machines in under five minutes, expect 1-2
days to recover)
• dozens of minor 30-second blips for DNS

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/stanford-295-
talk.pdf
10
Types of Failures
• Systems can fail in different ways

• How?

11
Types of Failures
• Systems can fail in different ways

• Crash failure
• Timing failure
• Content failure
• Malicious failure

• Are some easier to deal with than others?

12
Fault Tolerance through Replication
• We can handle failures through redundancy

• Have multiple replicas run the program


• May want to keep them away from each other
• May want to use different hardware platforms
• May want to use different software implementations

13
Fault Detection
• Detecting a fault can be difficult
• Crash failure
• Timing failure
• Content failure
• Malicious failure

14
Fault Detection
• Detecting a fault can be difficult
• Crash failure
• Timing failure
• Content failure
• Malicious failure

• Approaches:
• Heartbeat messages
• Adaptive timeouts
• Voting / Quorums
• Authentication / signatures

14
Detection is Hard
• Or maybe even impossible

• How long should we set a timeout?

• How do we know heart beat messages will go


through?

15
Two Generals Problem

Ninjas The Enemy Pirates

• The Ninja general and the Pirate general need to


coordinate an attack
• Can (try to) send messengers back and forth
• Messengers can be shot
• How can they guarantee they will both attack at the
same time?
16
Two Generals Problem
• We need to worry about physical characteristics
when we build systems
• Packets can be lost, delayed, reordered
• Disks can be slow, fail, or crash

• Or things can be actively malicious


• Big trouble...

• What kinds of assumptions do we need to make?


• Network is ordered and reliable, but may be slow
• We can quantify the expected number of nodes that will fail

17
Fault Tolerance through Replication
• How to tolerate a crash failure?

• How to tolerate a content failure?

• How many replicas to tolerate f such failures?

18
Fault Tolerance through Replication
• How to tolerate a crash failure?
2+2=4
P1 output = 4
Inputs
P2 crash

x(
f+1 replicas

• How to tolerate a content failure?


2+2
P1 =4
2+2=5 Majority
Inputs P2
Voter output = 4
2 = 4
P3 2+
2f+1 replicas
19
Agreement without Voters
• We can't always assume there is a perfectly
correct voter to validate the answers
• Better: Have replicas reach agreement amongst
themselves about what to do
• Exchange calculated value and have each node pick winner

4
A B Replica Receives Action

5 4 A 4, 4, 5 = 4
5
4 B 4, 4, 5 = 4

C
4 C 4, 4, 5 = 4?

20
Byzantine Generals Problem
• There are N generals making plans for an attack
• They need to decide whether to Attack or Retreat
• Send your vote to everyone (0=retreat, 1=attack)
• But f generals may be traitors that lie and collude
• Can all correct replicas agree on what to do?
• Take majority vote of planned actions

0 Replica Receives Action


A B
A 1, 0, 1 Attack!
1 1
0 B 1, 0, 0 Retreat!
1
C
0 C 1, 0, ? ???

Majority voting doesn't work if a replica lies!

21
Byzantine Generals Solved!
• Need more replicas to reach consensus
• Requires 3f+1 replicas to tolerate f byzantine faults
• Step 1: Send your plan to everyone
• Step 2: Send learned plans to everyone
• Step 3: Use majority of each column
Replica Receives Vote A B
A: (1,0,1,1) A: 1
B: (1,0,0,1) B: 0
A C:
D:
(1,1,1,1)
(1,0,1,1)
C:
D:
1
1
x y
A: (1,0,1,1) A: 1
B
B: (1,0,0,1) B: 0 C D
C: (0,0,0,0) C: 0
D: (1,0,0,1) D: 1 z
22
Can we make this any easier?
• Fundamental challenge in BFT:
• Nodes can misrepresent what they heard from other nodes

• How can we fix this?

• Have nodes sign messages!


• Then liars can't forge messages with false information

23
Can we make this any easier?
• Fundamental challenge in BFT:
• Nodes can misrepresent what they heard from other nodes

• How can we fix this?

• Have nodes sign messages!


• Then liars can't forge messages with false information

• Crypto actually is useful!

23
Denial of Service
• Attack to reduce the availability of a service
• Can also cause crashes if software is poorly written

• "Unsophisticated but effective"


• Flood target with traffic
• No easy way to differentiate between a valid request and an
attacker

Amazon.com

24
Sept 2012 DDoS
• Six US banks attacked
• Attacks were announced in advance
• Banks still could not prevent the damage
• Attackers sent 65 gigabytes of data per second

• Iranian "Cyber Fighter" group claimed


responsibility
• Encouraged members to use Low Orbit Ion Cannon
software to flood banks with traffic
• Also used botnets as a traffic source

25
Sept 2012 DDoS
• But it's not clear if that was the real source...
• Botnet machines have relatively low bandwidth
• Would need 65,000+ compromised machines

• Most traffic to the banks was coming from about 200


IP addresses
• Appear to be a small set of compromised high powered web servers

• Not clear if Iranian hacker group did all of this or if


some other group was the mastermind
• Iranian government fighting against sanctions?
• Eastern European crime groups make fraudulent purchases and
then disrupt bank web activity long enough for them to go through

26
Anonymous (?)
• The Anonymous "hacktivist" group has used
DDoS for various political causes
• Members run LOIC software and target a specific site
• "Volunteer bot net"

• But be careful...

• In March 2012 the LOIC software had a trojan


• Ran a DDoS on the enemy...
• And stole your bank and gmail account info

27
Defending against DDoS
• Some DDoS traffic can be easily distinguished
• Most web apps can safely ignore ICMP and UDP traffic

• But performance impact will depend where filtering is


performed
• Firewall on server being attacked may limit impact on application, but still
clogs network
• Firewall at ISP is much better, but may be under someone else's control

ISP

Amazon.com

28
Summary
• Software systems must worry about:
• Hardware and software failures
• Service availability
• Malicious attacks that affect reliability and/or availability

• Approaches:
• Redundancy
• Fault mitigation
• Fault detection

29

You might also like