Fault Tolerance in Distributed Systems
Fault Tolerance in Distributed Systems
Distributed Systems
Fault Tolerant
Distributed Systems
ICS 230
Prof. Nalini Venkatasubramanian
(with some slides modified from Prof.
Ghosh, University of Iowa)
Fundamentals
What is fault?
A fault is a blemish, weakness, or shortcoming of a
particular hardware or software component.
Fault, error and failures
Why fault tolerant?
Availability, reliability, dependability, …
How to provide fault tolerance ?
Replication
Checkpointing and message logging
Hybrid
4 Reliability
Reliability is an emerging and critical concern in
traditional and new settings
Transaction processing, mobile applications, cyberphysical
systems
New enhanced technology makes devices vulnerable to
errors due to high complexity and high integration
Technology scaling causes problems
Exponential increase of soft error rate
Mobile/pervasive applications running close to humans
E.g Failure of healthcare devices cause serious results
Redundancy techniques incur high overheads of power and
performance
TMR (Triple Modular Redundancy) may exceed 200% overheads
without optimization [Nieuwland, 06]
Challenging to optimize multiple properties (e.g.,
performance, QoS, and reliability)
Classification of failures
Byzantine failure
Transient failure
Not Heisenberg
Bug
Application
Packet
Loss
Middleware/
Exce Network
ption OS
Soft
Hardware Error
Hardware Errors and Error
11 Control Schemes
Metric Traditional
Failures Causes
s Approaches
Soft Errors, External Radiations, FIT, Spatial Redundancy (TMR,
Hard Failures, Thermal Effects, MTTF, Duplex, RAID-1 etc.) and
System Crash Power Loss, Poor MTBF Data Redundancy (EDC,
Design, Aging ECC, RAID-5, etc.)
Transistor
5 hours MTTF
1
0
1 month MTTF
Traditional
Failures Causes Metrics
Approaches
Wrong Incomplete Number of Spatial Redundancy (N-
outputs, Specification, Poor Bugs/Klines, version Programming,
Infinite software design, QoS, MTTF, etc.), Temporal
loops, Crash Bugs, Unhandled MTBF Redundancy (Checkpoints
Exception and Backward Recovery,
etc.)
Software errors become dominant as system’s complexity increases
(e.g.) Several bugs per kilo lines
Hard to debug, and redundancy techniques are expensive
(e.g.) Backward recovery with checkpoints is inappropriate for real-time applications
Memory leak
Processes fail to entirely free up the physical memory that has
been allocated to them. This effectively reduces the size of the
available physical memory over time. When this becomes smaller
than the minimum memory needed to support an application, it
crashes.
Traditional
Failures Causes Metrics
Approaches
Data Losses, Network Packet Loss Resource Reservation, Data
Deadline Congestion, Rate, Redundancy (CRC, etc.),
Misses, Node Noise/Interfere Deadline Temporal Redundancy
(Link) Failure, nce, Malicious Miss Rate, (Retransmission, etc.),
System Down Attacks SNR, MTTF, Spatial Redundancy
MTBF, MTTR (Replicated Nodes, MIMO,
etc.)
•SNR: Signal to Noise Ratio
Omission Errors – lost/dropped messages •MTTR: Mean Time To Recovery
•CRC: Cyclic Redundancy Check
Network is unreliable (especially, wireless networks)•MIMO: Multiple-In Multiple-Out
Buffer overflow, Collisions at the MAC layer, Receiver out of range
Joint approaches across OSI layers have been investigated for
Classifying fault-tolerance
Masking tolerance.
Application runs as it is. The failure does not have a visible impact.
All properties (both liveness & safety) continue to hold.
Non-masking tolerance.
Safety property is temporarily affected, but not liveness.
Fail-safe tolerance
Given safety predicate is preserved, but liveness may be affected
Example. Due to failure, no process can enter its critical section for
an indefinite period. In a traffic crossing, failure changes the traffic in
both directions to red.
Graceful degradation
Application continues, but in a “degraded” mode. Much depends on
what kind of degradation is acceptable.
1 3
0
6
5
7 4 2
time
rollback
Forward error recovery
Computation does not care about getting the history right, but
moves on, as long as eventually the safety property is restored.
True for self-stabilizing systems.
27 Conventional Approaches
Build redundancy into hardware/software
Modular Redundancy, N-Version ProgrammingConventional
TRM (Triple Modular Redundancy) can incur 200% overheads
without optimization.
Replication of tasks and processes may result in
overprovisioning
Error Control Coding
Checkpointing and rollbacks
Usually accomplished through logging (e.g. messages)
Backward Recovery with Checkpoints cannot guarantee the
completion time of a task.
Hybrid
Recovery Blocks
28 1) Modular Redundancy
Modular Redundancy
Multiple identical replicas
of hardware modules
Voter mechanism
fault Data
Compare outputs and Producer A Consumer
select the correct output voter
Tolerate most hardware Producer B
faults
Effective but expensive
29 2) N-version Programming
N-version Programming
Different versions by
different teams Data
Tolerate some
software bugs Programmer K Programmer L
30 3) Error-Control Coding
Error-Control Coding
Replication is effective
but expensive fault
Error-Detection Coding Data
Coding
(example) Parity Bit, Error
Control
Hamming Code, CRC Data
Much less redundancy
than replication
Conventional Protection
31 for Caches
Cache is the most hit by soft
errors
Conventional Protected
Caches
Unaware of Application
Unaware of fault tolerance at
applications
Implement a redundancy
technique such as ECC to protect
all data for every access
Overkill for multimedia
applications High
ECC (e.g., a Hamming Code) Cost
incurs high performance
penalty by up to 95%, power
overhead by up to 22%, and Cache ECC
area cost by up to 25%
PPC (Partially Protected
32 Caches)
Observation
Not all data are equally failure PPC
critical
Multimedia data vs. control
variables
Propose PPC architectures Unprotected Protected
to provide an unequal Cache Cache
protection for mobile
multimedia systems [Lee,
CASES06][Lee, TVLSI08]
Unprotected cache and
Protected cache at the
same level of memory
hierarchy
Protected cache is typically
smaller to keep power and
Memory
delay the same as or less
than those of Unprotected
cache
PPC
33 Applications Memory
Reduction
Power/Delay
hardware layer exploiting
error-tolerance of
Tolerance
Fault
multimedia data at
application layer
Simple data partitioning
for multimedia
applications
Multimedia data is failure
non-critical
All other data is failure
critical
34 PPC for general purpose apps
All data are not equally failure itical
Propose a PPC architecture to provide
unequal protection
Support an unequal protection at
hardware layer by exploiting error- Application Data & Code
tolerance and vulnerability at
application Error-tolerance of MM data
DPExplore [Lee, PPCDIPES08] Vulnerability of Data &
Code
Explore partitioning space by exploiting Page Partitioning
vulnerability of each data page Algorithms
Vulnerable time Failure Non- Failure
It is vulnerable for the time when Critical Critical
eventually it is read by CPU or written FNC & FC are mapped into
back to Memory Unprotected & Protected
Pages causing high vulnerable Caches
time are failure critical Unprotected Protected
Cache Cache
PPC
35 CC-PROTECT
Approach which cooperates existing
schemes across layers to mitigate the PBPAIR -
impact of soft errors on the failure rate and Application Error Resilience
video quality in mobile video encoding systems
PPC (Partially Protected Caches) with EDC
(Error Detection Codes) at hardware layer
DFR (Drop and Forward Recovery) at
middleware
PBPAIR (Probability-Based Power Aware
Intra Refresh) at application layer
Middleware/
DFR -
Demonstrate the effectiveness of low- Error Correction
cost (about 50%) reliability (1,000x) at OS
the minimal cost of QoS (less than 1%)
ECC
EDC
Unprotected Protected
Hardware Cache Cache
Mobile Video Application
Error-prone
Networks
36 CC-PROTECT
Original
Video
QoS Loss
BER (Backward
DFR (Drop &
Monitor & Trigger Error-prone
Support Error Recovery)
Forward Recovery)
Networks
Translate SER Selective DFR EAVE & PPC
MW/OS
Soft
Feedback
Error Mobile Video Application
frame K
Parameter frame K+1
Unprotected Protected
EDC Error detection
Cache Cache PPC
Application
(Error-Prone or
EDC + DFR +
impact impact
PBPAIR(CC-PROTECT) impact
Error-Resilient)
Energy Saving
36%
56%
17% Reduction compared to HW-PROTECT
26%
49% Reductioncompared
4% Reduction comparedtotoBASE
BASE
37 Hardware
(Unprotected
BASE = Error-prone video or Protected)
encoding + unprotected
cache
HW-PROTECT = Error-prone
video encoding + PPC with
ECC
APP-PROTECT = Error-
resilient video encoding +
unprotected cache
MULTI-PROTECT = Error-
resilient video encoding +
PPC with ECC
CC-PROTECT1 = Error-prone
video encoding + PPC with
EDC
CC-PROTECT2 = Error-prone
video encoding + PPC with
EDC + DFR
CC-PROTECT = error-resilient
video encoding + PPC with
EDC + DFR
4) Checkpoints &
38 Rollbacks
Checkpoints and
Rollbacks
Checkpoint
A copy of an application’s Data
state
Producer A Consumer
Save it in storage immune
to the failures
Application State K
Rollback
Restart the execution state (K-1) state K
Rollback
Request handler
Communication Layer
IP Address
Network
Egida (UT Austin)