0% found this document useful (0 votes)
21 views51 pages

Fault Tolerance in Distributed Systems

Uploaded by

Pedro Lopes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views51 pages

Fault Tolerance in Distributed Systems

Uploaded by

Pedro Lopes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 51

Fault Tolerance in

Distributed Systems
Fault Tolerant
Distributed Systems

ICS 230
Prof. Nalini Venkatasubramanian
(with some slides modified from Prof.
Ghosh, University of Iowa)
Fundamentals

 What is fault?
 A fault is a blemish, weakness, or shortcoming of a
particular hardware or software component.
 Fault, error and failures
 Why fault tolerant?
 Availability, reliability, dependability, …
 How to provide fault tolerance ?
 Replication
 Checkpointing and message logging
 Hybrid
4 Reliability
 Reliability is an emerging and critical concern in
traditional and new settings
 Transaction processing, mobile applications, cyberphysical
systems
 New enhanced technology makes devices vulnerable to
errors due to high complexity and high integration
 Technology scaling causes problems
 Exponential increase of soft error rate
 Mobile/pervasive applications running close to humans
 E.g Failure of healthcare devices cause serious results
 Redundancy techniques incur high overheads of power and
performance
 TMR (Triple Modular Redundancy) may exceed 200% overheads
without optimization [Nieuwland, 06]
 Challenging to optimize multiple properties (e.g.,
performance, QoS, and reliability)
Classification of failures

Crash failure Security failure

Omission failure Temporal failure

Byzantine failure
Transient failure

Software failure Environmental perturbations


Crash failures

Crash failure = the process halts. It is irreversible.

In synchronous system, it is easy to detect crash failure (using heartbeat


signals and timeout). But in asynchronous systems, it is never accurate, since
it is not possible to distinguish between a process that has crashed, and a
process that is running very slowly.

Some failures may be complex and nasty. Fail-stop failure is a simple


abstraction that mimics crash failure when program execution becomes
arbitrary. Implementations help detect which processor has failed. If a system
cannot tolerate fail-stop failure, then it cannot tolerate crash.
Transient failure
(Hardware) Arbitrary perturbation of the global state. May be
induced by power surge, weak batteries, lightning, radio-
frequency interferences, cosmic rays etc.

Not Heisenberg

(Software) Heisenbugs are a class of temporary internal faults


and are intermittent. They are essentially permanent faults
whose conditions of activation occur rarely or are not easily
reproducible, so they are harder to detect during the testing
phase.

Over 99% of bugs in IBM DB2 production code are non-


deterministic and transient (Jim Gray)
Temporal failures

Inability to meet deadlines – correct results


are generated, but too late to be useful.
Very important in real-time systems.

May be caused by poor algorithms, poor


design strategy or loss of synchronization
among the processor clocks
Byzantine failure

Anything goes! Includes every conceivable form


of erroneous behavior. The weakest type of
failure

Numerous possible causes. Includes malicious


behaviors (like a process executing a different
program instead of the specified one) too.

Most difficult kind of failure to deal with.


Errors/Failures across
10 system layers
 Faults or Errors can cause Failures

Bug

Application
Packet
Loss
Middleware/
Exce Network
ption OS

Soft
Hardware Error
Hardware Errors and Error
11 Control Schemes

Metric Traditional
Failures Causes
s Approaches
Soft Errors, External Radiations, FIT, Spatial Redundancy (TMR,
Hard Failures, Thermal Effects, MTTF, Duplex, RAID-1 etc.) and
System Crash Power Loss, Poor MTBF Data Redundancy (EDC,
Design, Aging ECC, RAID-5, etc.)

 Hardware failures are increasing as technology scales


 (e.g.) SER increases by up to 1000 times [Mastipuram, 04] •FIT: Failures in Time (109 hours)
•MTTF: Mean Time To Failure
 Redundancy techniques are expensive •MTBF: Mean Time b/w Failures
•TMR: Triple Modular Redundancy
 (e.g.) ECC-based protection in caches can incur 95% performance •EDC: Error Detection Codes
•ECC: Error Correction Codes
penalty [Li, 05] •RAID: Redundant Array of
Inexpensive Drives
Soft Errors (Transient
12 Faults)
 SER increases  Caches are most hit due to:
exponentially as  Larger portion in processors
technology scales (more than 50%)
 Integration, voltage  No masking effects (e.g.,
scaling, altitude, latitude logical masking)

Intel Itanium II Processor


[Baumann, 05]

Transistor
5 hours MTTF

1
0
1 month MTTF

Bit Flip •MTTF: Mean time To Failure


Soft errors
13 SER (FIT) MTTF Reason
1 Mbit @ 0.13 µm 1000 104 years
64 MB @ 0.13 µm 64x8x1000 81 days High Integration
SER (FIT) MTTF Reason
128 MB @ 65 nm 2x1000x64x8x100 1 hour Technology scaling and
1
1 Mbit
Mbit @
@ 0.13
0.13 µm
µm 1000
1000
0 104
104 years
years Twice Integration
64
A MB @@0.13 µm 64x8x1000 81 days High Integration
64system
64 MB
MB @ 65 nm
@ 0.13
0.13 µm
µm 2x2x1000x64x8x10
64x8x1000
64x8x1000 30
81
81 days
days Memory
High takes up 50%
High Integration
Integration
128 00 minutes of soft errors in a
128 MB
MB @@ 65
65 nm
nm 2x1000x64x8x10
2x1000x64x8x10 1
1 hour
hour Technology scaling
Technology scaling
system
00
00 and
and Twice
Twice
A system with 100x2x2x1000x64x 18 Integration
Exponential
Integration
relationship
voltage scaling @ 8x1000 seconds b/w SER & Supply
A system @ 65 2x2x1000x64x8x 30 Memory takes up
65 nm Voltage
nm 1000 minutes 50% of soft errors in
A system with 800x100x2x2x1000 0.02 High Intensity of
a system
voltage scaling @ x64x8x1000 FIT seconds Neutron Flux at flight
A system
flight with
(35,000 ft) @ 100x2x2x1000x6 18 Exponential
(high altitude)
voltage
65 nm scaling 4x8x1000 seconds relationship b/w SER
@ 65 nm & Supply Voltage
Soft Error Rate (SER) – FIT (Failures in Time) = number of errors in 109 hours
Software Errors and Error
14
Control Schemes

Traditional
Failures Causes Metrics
Approaches
Wrong Incomplete Number of Spatial Redundancy (N-
outputs, Specification, Poor Bugs/Klines, version Programming,
Infinite software design, QoS, MTTF, etc.), Temporal
loops, Crash Bugs, Unhandled MTBF Redundancy (Checkpoints
Exception and Backward Recovery,
etc.)
 Software errors become dominant as system’s complexity increases
 (e.g.) Several bugs per kilo lines
 Hard to debug, and redundancy techniques are expensive
 (e.g.) Backward recovery with checkpoints is inappropriate for real-time applications

•QoS: Quality of Service


Software failures

Coding error or human error


On September 23, 1999, NASA lost the $125 million Mars orbiter
spacecraft because one engineering team used metric units
while another used English units leading to a navigation fiasco,
causing it to burn in the atmosphere.

Design flaws or inaccurate modeling


Mars pathfinder mission landed flawlessly on the Martial surface
on July 4, 1997. However, later its communication failed due to
a design flaw in the real-time embedded software kernel
VxWorks. The problem was later diagnosed to be caused due to
priority inversion, when a medium priority task could preempt a
high priority one.
Software failures

Memory leak
Processes fail to entirely free up the physical memory that has
been allocated to them. This effectively reduces the size of the
available physical memory over time. When this becomes smaller
than the minimum memory needed to support an application, it
crashes.

Incomplete specification (example Y2K)


Year = 99 (1999 or 2099)?
Many failures (like crash, omission etc) can be
caused by software bugs too.
Network Errors and Error
17 Control Schemes

Traditional
Failures Causes Metrics
Approaches
Data Losses, Network Packet Loss Resource Reservation, Data
Deadline Congestion, Rate, Redundancy (CRC, etc.),
Misses, Node Noise/Interfere Deadline Temporal Redundancy
(Link) Failure, nce, Malicious Miss Rate, (Retransmission, etc.),
System Down Attacks SNR, MTTF, Spatial Redundancy
MTBF, MTTR (Replicated Nodes, MIMO,
etc.)
•SNR: Signal to Noise Ratio
 Omission Errors – lost/dropped messages •MTTR: Mean Time To Recovery
•CRC: Cyclic Redundancy Check
 Network is unreliable (especially, wireless networks)•MIMO: Multiple-In Multiple-Out
 Buffer overflow, Collisions at the MAC layer, Receiver out of range
 Joint approaches across OSI layers have been investigated for
Classifying fault-tolerance

Masking tolerance.
Application runs as it is. The failure does not have a visible impact.
All properties (both liveness & safety) continue to hold.

Non-masking tolerance.
Safety property is temporarily affected, but not liveness.

Example 1. Clocks lose synchronization, but recover soon thereafter.


Example 2. Multiple processes temporarily enter their critical sections,
but thereafter, the normal behavior is restored.
Classifying fault-tolerance

Fail-safe tolerance
Given safety predicate is preserved, but liveness may be affected

Example. Due to failure, no process can enter its critical section for
an indefinite period. In a traffic crossing, failure changes the traffic in
both directions to red.
Graceful degradation
Application continues, but in a “degraded” mode. Much depends on
what kind of degradation is acceptable.

Example. Consider message-based mutual exclusion. Processes will


enter their critical sections, but not in timestamp order.
Failure detection

The design of fault-tolerant algorithms will be simple if


processes can detect failures.
 In synchronous systems with bounded delay channels,
crash failures can definitely be detected using timeouts.
 In asynchronous distributed systems, the detection of
crash failures is imperfect.
 Completeness – Every crashed process is suspected
 Accuracy – No correct process is suspected.
Example

1 3
0

6
5

7 4 2

0 suspects {1,2,3,7} to have failed. Does this satisfy completeness?


Does this satisfy accuracy?
Classification of completeness

 Strong completeness. Every crashed process


is eventually suspected by every correct
process, and remains a suspect thereafter.

 Weak completeness. Every crashed process is


eventually suspected by at least one correct
process, and remains a suspect thereafter.
Note that we don’t care what mechanism is used for suspecting a
process.
Classification of accuracy

 Strong accuracy. No correct process is ever


suspected.

 Weak accuracy. There is at least one correct


process that is never suspected.
Eventual accuracy

A failure detector is eventually strongly accurate, if there exists a


time T after which no correct process is suspected.

(Before that time, a correct process be added to and removed from


the list of suspects any number of times)

A failure detector is eventually weakly accurate, if there exists a time


T after which at least one process is no more suspected.
Classifying failure
detectors
Perfect P. (Strongly) Complete and strongly accurate
Strong S. (Strongly) Complete and weakly accurate
Eventually perfect ◊P.
(Strongly) Complete and eventually strongly accurate
Eventually strong ◊S
(Strongly) Complete and eventually weakly accurate

Other classes are feasible: W (weak completeness) and


weak accuracy) and ◊W
Backward vs. forward error
recovery
Backward error recovery
When safety property is violated, the computation rolls
back and resumes from a previous correct state.

time

rollback
Forward error recovery
Computation does not care about getting the history right, but
moves on, as long as eventually the safety property is restored.
True for self-stabilizing systems.
27 Conventional Approaches
 Build redundancy into hardware/software
 Modular Redundancy, N-Version ProgrammingConventional
TRM (Triple Modular Redundancy) can incur 200% overheads
without optimization.
 Replication of tasks and processes may result in
overprovisioning
 Error Control Coding
 Checkpointing and rollbacks
 Usually accomplished through logging (e.g. messages)
 Backward Recovery with Checkpoints cannot guarantee the
completion time of a task.
 Hybrid
 Recovery Blocks
28 1) Modular Redundancy
 Modular Redundancy
 Multiple identical replicas
of hardware modules
 Voter mechanism
fault Data
 Compare outputs and Producer A Consumer
select the correct output voter
Tolerate most hardware Producer B

faults
Effective but expensive
29 2) N-version Programming
 N-version Programming
 Different versions by
different teams Data

 Different versions may Producer A Consumer

not contain the same


voter
bugs
 Voter mechanism Program
fault i Program j

Tolerate some
software bugs Programmer K Programmer L
30 3) Error-Control Coding
 Error-Control Coding
 Replication is effective
but expensive fault
 Error-Detection Coding Data

and Error-Correction Producer A Consumer

Coding
 (example) Parity Bit, Error
Control
Hamming Code, CRC Data
 Much less redundancy
than replication
Conventional Protection
31 for Caches
 Cache is the most hit by soft
errors
 Conventional Protected
Caches

Unaware of Application
 Unaware of fault tolerance at
applications
 Implement a redundancy
technique such as ECC to protect
all data for every access
 Overkill for multimedia
applications High
 ECC (e.g., a Hamming Code) Cost
incurs high performance
penalty by up to 95%, power
overhead by up to 22%, and Cache ECC
area cost by up to 25%
PPC (Partially Protected
32 Caches)
 Observation
 Not all data are equally failure PPC
critical
 Multimedia data vs. control
variables
 Propose PPC architectures Unprotected Protected
to provide an unequal Cache Cache
protection for mobile
multimedia systems [Lee,
CASES06][Lee, TVLSI08]
 Unprotected cache and
Protected cache at the
same level of memory
hierarchy
 Protected cache is typically
smaller to keep power and
Memory
delay the same as or less
than those of Unprotected
cache
PPC

PPC for Multimedia


Unprotected Protected
Cache Cache

33 Applications Memory

 Propose a selective data


protection [Lee,
CASES06]
 Unequal protection at

Reduction
Power/Delay
hardware layer exploiting
error-tolerance of

Tolerance
Fault
multimedia data at
application layer
 Simple data partitioning
for multimedia
applications
 Multimedia data is failure
non-critical
 All other data is failure
critical
34 PPC for general purpose apps
 All data are not equally failure itical
 Propose a PPC architecture to provide
unequal protection
 Support an unequal protection at
hardware layer by exploiting error- Application Data & Code
tolerance and vulnerability at
application Error-tolerance of MM data
 DPExplore [Lee, PPCDIPES08] Vulnerability of Data &
Code
 Explore partitioning space by exploiting Page Partitioning
vulnerability of each data page Algorithms
 Vulnerable time Failure Non- Failure
 It is vulnerable for the time when Critical Critical
eventually it is read by CPU or written FNC & FC are mapped into
back to Memory Unprotected & Protected
 Pages causing high vulnerable Caches
time are failure critical Unprotected Protected
Cache Cache
PPC
35 CC-PROTECT
 Approach which cooperates existing
schemes across layers to mitigate the PBPAIR -
impact of soft errors on the failure rate and Application Error Resilience
video quality in mobile video encoding systems
 PPC (Partially Protected Caches) with EDC
(Error Detection Codes) at hardware layer
 DFR (Drop and Forward Recovery) at
middleware
 PBPAIR (Probability-Based Power Aware
Intra Refresh) at application layer
Middleware/
DFR -
 Demonstrate the effectiveness of low- Error Correction
cost (about 50%) reliability (1,000x) at OS
the minimal cost of QoS (less than 1%)

ECC
EDC
Unprotected Protected
Hardware Cache Cache
Mobile Video Application

Error-prone
Networks

36 CC-PROTECT
Original
Video

Error-Controller Error-Resilient Frame Packet


Error- Drop Loss
(e.g., frame drop) Encoder (e.g., PBPAIR) Aware
Video
Error-Aware Video Encoder (EAVE)

QoS Loss

BER (Backward
DFR (Drop &
Monitor & Trigger Error-prone
Support Error Recovery)
Forward Recovery)
Networks
Translate SER Selective DFR EAVE & PPC

MW/OS
Soft
Feedback
Error Mobile Video Application
frame K
Parameter frame K+1

Unprotected Protected
EDC Error detection
Cache Cache PPC
Application
(Error-Prone or
EDC + DFR +
impact impact
PBPAIR(CC-PROTECT) impact
Error-Resilient)

Energy Saving
36%
56%
17% Reduction compared to HW-PROTECT
26%
49% Reductioncompared
4% Reduction comparedtotoBASE
BASE
37 Hardware
(Unprotected
 BASE = Error-prone video or Protected)
encoding + unprotected
cache
 HW-PROTECT = Error-prone
video encoding + PPC with
ECC
 APP-PROTECT = Error-
resilient video encoding +
unprotected cache
 MULTI-PROTECT = Error-
resilient video encoding +
PPC with ECC
 CC-PROTECT1 = Error-prone
video encoding + PPC with
EDC
 CC-PROTECT2 = Error-prone
video encoding + PPC with
EDC + DFR
 CC-PROTECT = error-resilient
video encoding + PPC with
EDC + DFR
4) Checkpoints &
38 Rollbacks
 Checkpoints and
Rollbacks
 Checkpoint
 A copy of an application’s Data
state
Producer A Consumer
 Save it in storage immune
to the failures
Application State K
 Rollback
 Restart the execution state (K-1) state K
Rollback

from a previously saved


checkpoint Checkpoint fault
 Recover from transient
and permanent
hardware and software
failures
Message Logging
 Tolerate crash failures
 Each process periodically records its local
state and log messages received after
 Once a crashed process recovers, its state must
be consistent with the states of other processes
 Orphan processes
• surviving processes whose states are inconsistent with
the recovered state of a crashed process
 Message Logging protocols guarantee that upon
recovery no processes are orphan processes
Message logging protocols
 Pessimistic Message Logging
• avoid creation of orphans during execution
• no process p sends a message m until it knows that all
messages delivered before sending m are logged; quick
recovery
• Can block a process for each message it receives - slows
down throughput
• allows processes to communicate only from recoverable
states; synchronously log to stable storage any
information that may be needed for recovery before
allowing process to communicate
Message Logging
 Optimistic Message Logging
• take appropriate actions during recovery to eliminate all
orphans
• Better performance during failure-free runs
• allows processes to communicate from non-recoverable
states; failures may cause these states to be
permanently unrecoverable, forcing rollback of any
process that depends on such states
Causal Message Logging
 Causal Message Logging
• no orphans when failures happen and do not block
processes when failures do not occur.
• Weaken condition imposed by pessimistic protocols
• Allow possibility that the state from which a process
communicates is unrecoverable because of a failure, but
only if it does not affect consistency.
• Append to all communication information needed to
recover state from which communication originates -
this is replicated in memory of processes that causally
depend on the originating state.
KAN – A Reliable Distributed
Object System (UCSB)
 Goal
 Language support for parallelism and distribution
 Transparent location/migration/replication
 Optimized method invocation
 Fault-tolerance
 Composition and proof reuse
 Log-based forward recovery scheme
 Log of recovery information for a node is maintained externally on
other nodes.
 The failed nodes are recovered to their pre-failure states, and the
correct nodes keep their states at the time of the failures.
 Only consider node crash failures.
 Processor stops taking steps and failures are eventually detected.
Basic Architecture of the
Fault Tolerance Scheme
Physical Node i

Logical Node x Logical Node y

Fault Detector Failure handler


External
Log

Request handler

Communication Layer

IP Address

Network
Egida (UT Austin)

 An object-oriented, extensible toolkit for low-


overhead fault-tolerance
 Provides a library of objects that can be used to
compose log-based rollback recovery protocols.
 Specification language to express arbitrary rollback-recovery
protocols
 Checkpointing
• independent, coordinated, induced by specific patterns of
communication
 Message Logging
• Pessimistic, optimistic, causal
AQuA

 Adaptive Quality of Service Availability


 Developed in UIUC and BBN.
 Goal:
 Allow distributed applications to request and
obtain a desired level of availability.
 Fault tolerance
 replication
 reliable messaging
Features of AQuA
 Uses the QuO runtime to process and make
availability requests.
 Proteus dependability manager to configure the
system in response to faults and availability
requests.
 Ensemble to provide group communication
services.
 Provide CORBA interface to application objects
using the AQuA gateway.
Group structure
 For reliable mcast and pt-to-pt. Comm
 Replication groups
 Connection groups
 Proteus Communication Service Group for
replicated proteus manager
• replicas and objects that communicate with the manager
• e.g. notification of view change, new QuO request
• ensure that all replica managers receive same info
 Point-to-point groups
• proteus manager to object factory
AQuA Architecture
Fault Model, detection and
Handling

 Object Fault Model:


 Object crash failure - occurs when object stops sending out
messages; internal state is lost
• crash failure of an object is due to the crash of at lease one element
composing the object
 Value faults - message arrives in time with wrong content (caused
by application or QuO runtime)
• Detected by voter
 Time faults
• Detected by monitor
 Leaders report fault to Proteus; Proteus will kill objects
with fault if necessary, and generate new objects
51 5) Recovery Blocks
 Recovery Blocks
 Multiple alternates to
perform the same Data
functionality Producer A Consumer
 One Primary module and
Secondary modules
 Different approaches Application

1) Select a module with Block X Block X2

output satisfying Block Y

acceptance test Block Z


state (K-1) state K
Rollback

2) Recovery Blocks and


Rollbacks Checkpoint fault
 Restart the execution from
a previously saved
checkpoint with secondary
module
Tolerate software failures

You might also like