0% found this document useful (0 votes)

18 views33 pages

Fault Tolerance

Uploaded by

roy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views33 pages

Fault Tolerance

Uploaded by

roy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Availability, Reliability,

and Fault Tolerance

Guest Lecture for Software Systems Security

Tim Wood

Professor Tim Wood - The George Washington University

Distributed Systems have Problems
• Hardware breaks
• Software is buggy
• People are jerks

• But software is increasingly important

• Runs nuclear reactors
• Controls engines and wings on a passenger jet
• Runs your facebook wall!

• How can we make these systems more

reliable?
• Particularly large distributed systems
2
Inside a Data Center
• Giant warehouse filled with:
• Racks of servers
• Disk arrays

• Cooling infrastructure
• Power converters
• Backup generators

3
Modular Data Center
• ...or use shipping containers
• Each container filled with
thousands of servers
• Can easily add new
containers
• “Plug and play”
• Just add electricity

• Allows data center to be

easily expanded
• Pre-assembled, cheaper

4
Definitions
• Availability: whether the system is ready to use
at a particular time
• Reliability: whether the system can run
continuously without failure
• Safety: whether a disaster happens if the system
fails to run correctly at some point
• Maintainability: how easily a system can be
repaired after failure

5
Availability and Reliability
• System 1: crashes for 1 millisecond every hour

• System 2: never crashes, but has to be shutdown

two weeks a year

6
Availability and Reliability
• System 1: crashes for 1 millisecond every hour
• Better than 99.9999% availability
• Not very good reliability...

• System 2: never crashes, but has to be shutdown

two weeks a year
• "Perfectly" reliable
• Only 96% availability

Is one more important?

7
Quantifying Reliability
• MTTF: Mean Time To Failure
• The average amount of time until a failure occurs

• MTTR: Mean Time To Repair

• The average amount of time to repair after a failure

• MTBF: Mean Time Between Failures

<---MTTF---> <---MTTF--->
<-MTTR->
Time
<-------MTBF------->

8
Real Failure Rates
• Standard hard drive MTBF
• 600,000 hours = 68 years!
• 1/68 = 1.5% chance of failure per year

9
Real Failure Rates
• Standard hard drive MTBF
• 600,000 hours = 68 years!
• 1/68 = 1.5% chance of failure per year

• A big Google data center:

• Has 200,000+ hard drives
• 1.5% x 200,000 = 2,921 drive crashes per year
• or about 8 disk failures per day

9
Real Failure Rates
• Standard hard drive MTBF
• 600,000 hours = 68 years!
• 1/68 = 1.5% chance of failure per year

• A big Google data center:

• Has 200,000+ hard drives
• 1.5% x 200,000 = 2,921 drive crashes per year
• or about 8 disk failures per day

• Failures happen a lot

• Need to design software to be resilient to all types of
hardware failures
• Actual failure rates are closer to 3% per year

9
Reliability Challenges
• Typical failures in one year of a google data center:
• 1000 individual machine failures
• thousands of hard drive failures
• 1 PDU (Power Distribution Unit) failure (about 500-1000 machines suddenly
disappear, budget 6 hours to come back)
• 1 rack-reorganization (You have plenty of warning: 500-1000 machines powered
down, about 6 hours)
• 1 network rewiring (rolling 5% of machines down over 2-day span)
• 20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) 5
racks go wonky (40-80 machines see 50% packet loss)
• 8 network maintenances (4 might cause ~30-minute random connectivity losses)
• 12 router reloads (takes out DNS and external virtual IP address (VIPS) for a
couple minutes)
• 3 router failures (have to immediately pull traffic for an hour)
• 0.5% overheat (power down most machines in under five minutes, expect 1-2
days to recover)
• dozens of minor 30-second blips for DNS

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/stanford-295-
talk.pdf
10
Types of Failures
• Systems can fail in different ways

• How?

11
Types of Failures
• Systems can fail in different ways

• Crash failure
• Timing failure
• Content failure
• Malicious failure

• Are some easier to deal with than others?

12
Fault Tolerance through Replication
• We can handle failures through redundancy

• Have multiple replicas run the program

• May want to keep them away from each other
• May want to use different hardware platforms
• May want to use different software implementations

13
Fault Detection
• Detecting a fault can be difficult
• Crash failure
• Timing failure
• Content failure
• Malicious failure

14
Fault Detection
• Detecting a fault can be difficult
• Crash failure
• Timing failure
• Content failure
• Malicious failure

• Approaches:
• Heartbeat messages
• Adaptive timeouts
• Voting / Quorums
• Authentication / signatures

14
Detection is Hard
• Or maybe even impossible

• How long should we set a timeout?

• How do we know heart beat messages will go

through?

15
Two Generals Problem

Ninjas The Enemy Pirates

• The Ninja general and the Pirate general need to

coordinate an attack
• Can (try to) send messengers back and forth
• Messengers can be shot
• How can they guarantee they will both attack at the
same time?
16
Two Generals Problem
• We need to worry about physical characteristics
when we build systems
• Packets can be lost, delayed, reordered
• Disks can be slow, fail, or crash

• Or things can be actively malicious

• Big trouble...

• What kinds of assumptions do we need to make?

• Network is ordered and reliable, but may be slow
• We can quantify the expected number of nodes that will fail

17
Fault Tolerance through Replication
• How to tolerate a crash failure?

• How to tolerate a content failure?

• How many replicas to tolerate f such failures?

18
Fault Tolerance through Replication
• How to tolerate a crash failure?
2+2=4
P1 output = 4
Inputs
P2 crash

x(
f+1 replicas

• How to tolerate a content failure?

2+2
P1 =4
2+2=5 Majority
Inputs P2
Voter output = 4
2 = 4
P3 2+
2f+1 replicas
19
Agreement without Voters
• We can't always assume there is a perfectly
correct voter to validate the answers
• Better: Have replicas reach agreement amongst
themselves about what to do
• Exchange calculated value and have each node pick winner

4
A B Replica Receives Action

5 4 A 4, 4, 5 = 4
5
4 B 4, 4, 5 = 4

C
4 C 4, 4, 5 = 4?

20
Byzantine Generals Problem
• There are N generals making plans for an attack
• They need to decide whether to Attack or Retreat
• Send your vote to everyone (0=retreat, 1=attack)
• But f generals may be traitors that lie and collude
• Can all correct replicas agree on what to do?
• Take majority vote of planned actions

0 Replica Receives Action

A B
A 1, 0, 1 Attack!
1 1
0 B 1, 0, 0 Retreat!
1
C
0 C 1, 0, ? ???

Majority voting doesn't work if a replica lies!

21
Byzantine Generals Solved!
• Need more replicas to reach consensus
• Requires 3f+1 replicas to tolerate f byzantine faults
• Step 1: Send your plan to everyone
• Step 2: Send learned plans to everyone
• Step 3: Use majority of each column
Replica Receives Vote A B
A: (1,0,1,1) A: 1
B: (1,0,0,1) B: 0
A C:
D:
(1,1,1,1)
(1,0,1,1)
C:
D:
1
1
x y
A: (1,0,1,1) A: 1
B
B: (1,0,0,1) B: 0 C D
C: (0,0,0,0) C: 0
D: (1,0,0,1) D: 1 z
22
Can we make this any easier?
• Fundamental challenge in BFT:
• Nodes can misrepresent what they heard from other nodes

• How can we fix this?

• Have nodes sign messages!

• Then liars can't forge messages with false information

23
Can we make this any easier?
• Fundamental challenge in BFT:
• Nodes can misrepresent what they heard from other nodes

• How can we fix this?

• Have nodes sign messages!

• Then liars can't forge messages with false information

• Crypto actually is useful!

23
Denial of Service
• Attack to reduce the availability of a service
• Can also cause crashes if software is poorly written

• "Unsophisticated but effective"

• Flood target with traffic
• No easy way to differentiate between a valid request and an
attacker

Amazon.com

24
Sept 2012 DDoS
• Six US banks attacked
• Attacks were announced in advance
• Banks still could not prevent the damage
• Attackers sent 65 gigabytes of data per second

• Iranian "Cyber Fighter" group claimed

responsibility
• Encouraged members to use Low Orbit Ion Cannon
software to flood banks with traffic
• Also used botnets as a traffic source

25
Sept 2012 DDoS
• But it's not clear if that was the real source...
• Botnet machines have relatively low bandwidth
• Would need 65,000+ compromised machines

• Most traffic to the banks was coming from about 200

IP addresses
• Appear to be a small set of compromised high powered web servers

• Not clear if Iranian hacker group did all of this or if

some other group was the mastermind
• Iranian government fighting against sanctions?
• Eastern European crime groups make fraudulent purchases and
then disrupt bank web activity long enough for them to go through

26
Anonymous (?)
• The Anonymous "hacktivist" group has used
DDoS for various political causes
• Members run LOIC software and target a specific site
• "Volunteer bot net"

• But be careful...

• In March 2012 the LOIC software had a trojan

• Ran a DDoS on the enemy...
• And stole your bank and gmail account info

27
Defending against DDoS
• Some DDoS traffic can be easily distinguished
• Most web apps can safely ignore ICMP and UDP traffic

• But performance impact will depend where filtering is

performed
• Firewall on server being attacked may limit impact on application, but still
clogs network
• Firewall at ISP is much better, but may be under someone else's control

ISP

Amazon.com

28
Summary
• Software systems must worry about:
• Hardware and software failures
• Service availability
• Malicious attacks that affect reliability and/or availability

• Approaches:
• Redundancy
• Fault mitigation
• Fault detection

O-RAN TIFG E2E-Test 0-R003-v06 00
No ratings yet
O-RAN TIFG E2E-Test 0-R003-v06 00
192 pages
IOT Mod1
No ratings yet
IOT Mod1
23 pages
BDS Session 3
No ratings yet
BDS Session 3
68 pages
Fault Tolerance Notes
No ratings yet
Fault Tolerance Notes
101 pages
Fault Tolerance FDCC
No ratings yet
Fault Tolerance FDCC
76 pages
Lecture 3
No ratings yet
Lecture 3
118 pages
(Big Data For Industry 4.0) K. Suganthi, R. Karthik, G. Rajesh, Peter Ho Chiung Ching - Machine Learning and Deep Learning Techniques in Wireless and Mobile Networking Systems-CRC Press (2021)
No ratings yet
(Big Data For Industry 4.0) K. Suganthi, R. Karthik, G. Rajesh, Peter Ho Chiung Ching - Machine Learning and Deep Learning Techniques in Wireless and Mobile Networking Systems-CRC Press (2021)
285 pages
SW Architecture - Lecture - 03
No ratings yet
SW Architecture - Lecture - 03
46 pages
Unit 11 Dependability-and-Security
No ratings yet
Unit 11 Dependability-and-Security
39 pages
II Fault Tolerant Techniques
No ratings yet
II Fault Tolerant Techniques
101 pages
Slides 08 PDF
No ratings yet
Slides 08 PDF
95 pages
Css Techmax Compressed
No ratings yet
Css Techmax Compressed
392 pages
Lect8 FaultTolerance
No ratings yet
Lect8 FaultTolerance
37 pages
Cybersecurity 5
No ratings yet
Cybersecurity 5
34 pages
II - Fault-Tolerant-techniques
No ratings yet
II - Fault-Tolerant-techniques
104 pages
VHDL Information
100% (1)
VHDL Information
21 pages
Shortest Path Algorithms
No ratings yet
Shortest Path Algorithms
94 pages
2.1,2.2-Service Models of Cloud Computing
No ratings yet
2.1,2.2-Service Models of Cloud Computing
17 pages
IT502 Operating System: Charotar Institute of Technology
No ratings yet
IT502 Operating System: Charotar Institute of Technology
49 pages
Cyclic Redundancy Check
No ratings yet
Cyclic Redundancy Check
40 pages
Waf DG
No ratings yet
Waf DG
337 pages
CH 4
No ratings yet
CH 4
25 pages
WorkloadCharacterizationAndModeling 2005 Feitelson
No ratings yet
WorkloadCharacterizationAndModeling 2005 Feitelson
508 pages
Distributed Systems - Fault Tolerance
No ratings yet
Distributed Systems - Fault Tolerance
21 pages
Cybersecurity Essentials 3.0-Module04
No ratings yet
Cybersecurity Essentials 3.0-Module04
39 pages
Tính Toán Phân Tán
No ratings yet
Tính Toán Phân Tán
79 pages
Chapter 3
No ratings yet
Chapter 3
40 pages
Fault Tolerant System Design
100% (1)
Fault Tolerant System Design
44 pages
Study of McEliece Cryptosystem
No ratings yet
Study of McEliece Cryptosystem
19 pages
Fault Tolerance Exam
No ratings yet
Fault Tolerance Exam
14 pages
Introduction To Fault Tolerance
No ratings yet
Introduction To Fault Tolerance
20 pages
CH 10
No ratings yet
CH 10
51 pages
A Survey of Multi-Access Edge Computing in 5G and
No ratings yet
A Survey of Multi-Access Edge Computing in 5G and
59 pages
Chapter4 2
No ratings yet
Chapter4 2
51 pages
Queuing Theory Edited
100% (2)
Queuing Theory Edited
19 pages
FailureDetector ds14
No ratings yet
FailureDetector ds14
33 pages
1 Chapter 11 Security and Dependability
No ratings yet
1 Chapter 11 Security and Dependability
46 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
BRKSPG-2003 - Internet Peering
No ratings yet
BRKSPG-2003 - Internet Peering
96 pages
Fault Tolerance Techniques: Unit 3
No ratings yet
Fault Tolerance Techniques: Unit 3
40 pages
DS Lecture1 PDF
No ratings yet
DS Lecture1 PDF
41 pages
Fault Tolerance
No ratings yet
Fault Tolerance
13 pages
Anonymous Sudan Report
No ratings yet
Anonymous Sudan Report
45 pages
16 Fault Tolerance
No ratings yet
16 Fault Tolerance
34 pages
Fault Tolerant Computer System Design Pradhan PDF
50% (2)
Fault Tolerant Computer System Design Pradhan PDF
2 pages
01-Introduction of Cyber Security
No ratings yet
01-Introduction of Cyber Security
42 pages
Ch03 Types and Application of Virtualization
No ratings yet
Ch03 Types and Application of Virtualization
17 pages
Distributed Con Currency Control - 2 of 3
100% (1)
Distributed Con Currency Control - 2 of 3
46 pages
Security+ Cheat Sheet
100% (1)
Security+ Cheat Sheet
54 pages
Introduction To Computer Test Bank: Chapter 6 Inside The System Unit
100% (3)
Introduction To Computer Test Bank: Chapter 6 Inside The System Unit
24 pages
Real-Time Systems: Frank Drews
No ratings yet
Real-Time Systems: Frank Drews
30 pages
A Review of Cybersecurity Guidelines For Manufacturing Factories in Industry 4.0
No ratings yet
A Review of Cybersecurity Guidelines For Manufacturing Factories in Industry 4.0
29 pages
A Deep Learning Based Framework For Cyberattack Detection in IoT Networks
No ratings yet
A Deep Learning Based Framework For Cyberattack Detection in IoT Networks
21 pages
Distributed Systems (Cosc 6003) : Chapter 1 - Introduction
No ratings yet
Distributed Systems (Cosc 6003) : Chapter 1 - Introduction
37 pages
Rest & Restful Web Services
No ratings yet
Rest & Restful Web Services
38 pages
Distributed Systems Lecture 1-2
No ratings yet
Distributed Systems Lecture 1-2
20 pages
SecPoint Vulnerability Scanning Profiles
100% (1)
SecPoint Vulnerability Scanning Profiles
15 pages
4G5G UNIT5-security Features
No ratings yet
4G5G UNIT5-security Features
11 pages
Software Reliability
No ratings yet
Software Reliability
24 pages
Chapter 1: Security Fundamentals: Module A: Security Concepts
No ratings yet
Chapter 1: Security Fundamentals: Module A: Security Concepts
41 pages
Cyber Security Unit 4
No ratings yet
Cyber Security Unit 4
13 pages
Brocade FESX424
No ratings yet
Brocade FESX424
12 pages
Regulation - 2022: Betck105I-Introduction To Cyber Security
No ratings yet
Regulation - 2022: Betck105I-Introduction To Cyber Security
19 pages
NSE 1: Next Generation Firewall (NGFW) : Study Guide
No ratings yet
NSE 1: Next Generation Firewall (NGFW) : Study Guide
26 pages
Distributed System Architecture
100% (1)
Distributed System Architecture
14 pages
Distributed Systems REPORT
No ratings yet
Distributed Systems REPORT
39 pages
F5 ASM Course Content
No ratings yet
F5 ASM Course Content
20 pages
Cyber Terrorism
No ratings yet
Cyber Terrorism
23 pages
Capacity Planning For Application Design: White Paper
No ratings yet
Capacity Planning For Application Design: White Paper
10 pages
Ip Spoofing Seminar Report
100% (3)
Ip Spoofing Seminar Report
36 pages
Distributed Shared Memory
No ratings yet
Distributed Shared Memory
30 pages
Basic-Concepts MTBF
No ratings yet
Basic-Concepts MTBF
15 pages
Suggestion of New Core Point of Attacks On IEEE 802.16e Networks: A Survey
No ratings yet
Suggestion of New Core Point of Attacks On IEEE 802.16e Networks: A Survey
6 pages
Software Development Life Cycle (SDLC)
No ratings yet
Software Development Life Cycle (SDLC)
49 pages
Distributed Resource Management: Distributed Shared Memory
No ratings yet
Distributed Resource Management: Distributed Shared Memory
20 pages
Defcon 20 Miu Panel Ddos
No ratings yet
Defcon 20 Miu Panel Ddos
84 pages
Distributed Systems Characterization and Design
No ratings yet
Distributed Systems Characterization and Design
35 pages
Threats and Vulnerabilities of Cloud Computing: A Review
No ratings yet
Threats and Vulnerabilities of Cloud Computing: A Review
7 pages
CEH Brochure
No ratings yet
CEH Brochure
24 pages
Concurrency Control and Reliable Commit Protocol in Distributed Database Systems
No ratings yet
Concurrency Control and Reliable Commit Protocol in Distributed Database Systems
37 pages
Fault Avoidance and Tolerance Technique
No ratings yet
Fault Avoidance and Tolerance Technique
15 pages
029-Introduction To Firewall
No ratings yet
029-Introduction To Firewall
4 pages
Ngaf Ds P Ngaf55-Datasheet 20201127
No ratings yet
Ngaf Ds P Ngaf55-Datasheet 20201127
2 pages
08s Cpe633 Test1 Solution
No ratings yet
08s Cpe633 Test1 Solution
3 pages
Ethical Hacking and Cyber Security Syllabus.
No ratings yet
Ethical Hacking and Cyber Security Syllabus.
5 pages
H3C SecBlade IPS Marketing Brochure-5M105-20100830
No ratings yet
H3C SecBlade IPS Marketing Brochure-5M105-20100830
5 pages
iOS Attack
No ratings yet
iOS Attack
1 page
30-IEEE - 0493 - 2007 (Gold) Design of Reliable Industrial and Comm
No ratings yet
30-IEEE - 0493 - 2007 (Gold) Design of Reliable Industrial and Comm
1 page
Reliability and Availablity
No ratings yet
Reliability and Availablity
6 pages