0% found this document useful (0 votes)

21 views51 pages

Fault Tolerance in Distributed Systems

Uploaded by

Pedro Lopes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views51 pages

Fault Tolerance in Distributed Systems

Uploaded by

Pedro Lopes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 51

Fault Tolerance in

Distributed Systems
Fault Tolerant
Distributed Systems

ICS 230
Prof. Nalini Venkatasubramanian
(with some slides modified from Prof.
Ghosh, University of Iowa)
Fundamentals

 What is fault?
 A fault is a blemish, weakness, or shortcoming of a
particular hardware or software component.
 Fault, error and failures
 Why fault tolerant?
 Availability, reliability, dependability, …
 How to provide fault tolerance ?
 Replication
 Checkpointing and message logging
 Hybrid
4 Reliability
 Reliability is an emerging and critical concern in
traditional and new settings
 Transaction processing, mobile applications, cyberphysical
systems
 New enhanced technology makes devices vulnerable to
errors due to high complexity and high integration
 Technology scaling causes problems
 Exponential increase of soft error rate
 Mobile/pervasive applications running close to humans
 E.g Failure of healthcare devices cause serious results
 Redundancy techniques incur high overheads of power and
performance
 TMR (Triple Modular Redundancy) may exceed 200% overheads
without optimization [Nieuwland, 06]
 Challenging to optimize multiple properties (e.g.,
performance, QoS, and reliability)
Classification of failures

Crash failure Security failure

Omission failure Temporal failure

Byzantine failure
Transient failure

Software failure Environmental perturbations

Crash failures

Crash failure = the process halts. It is irreversible.

In synchronous system, it is easy to detect crash failure (using heartbeat

signals and timeout). But in asynchronous systems, it is never accurate, since
it is not possible to distinguish between a process that has crashed, and a
process that is running very slowly.

Some failures may be complex and nasty. Fail-stop failure is a simple

abstraction that mimics crash failure when program execution becomes
arbitrary. Implementations help detect which processor has failed. If a system
cannot tolerate fail-stop failure, then it cannot tolerate crash.
Transient failure
(Hardware) Arbitrary perturbation of the global state. May be
induced by power surge, weak batteries, lightning, radio-
frequency interferences, cosmic rays etc.

Not Heisenberg

(Software) Heisenbugs are a class of temporary internal faults

and are intermittent. They are essentially permanent faults
whose conditions of activation occur rarely or are not easily
reproducible, so they are harder to detect during the testing
phase.

Over 99% of bugs in IBM DB2 production code are non-

deterministic and transient (Jim Gray)
Temporal failures

Inability to meet deadlines – correct results

are generated, but too late to be useful.
Very important in real-time systems.

May be caused by poor algorithms, poor

design strategy or loss of synchronization
among the processor clocks
Byzantine failure

Anything goes! Includes every conceivable form

of erroneous behavior. The weakest type of
failure

Numerous possible causes. Includes malicious

behaviors (like a process executing a different
program instead of the specified one) too.

Most difficult kind of failure to deal with.

Errors/Failures across
10 system layers
 Faults or Errors can cause Failures

Bug

Application
Packet
Loss
Middleware/
Exce Network
ption OS

Soft
Hardware Error
Hardware Errors and Error
11 Control Schemes

Metric Traditional
Failures Causes
s Approaches
Soft Errors, External Radiations, FIT, Spatial Redundancy (TMR,
Hard Failures, Thermal Effects, MTTF, Duplex, RAID-1 etc.) and
System Crash Power Loss, Poor MTBF Data Redundancy (EDC,
Design, Aging ECC, RAID-5, etc.)

 Hardware failures are increasing as technology scales

 (e.g.) SER increases by up to 1000 times [Mastipuram, 04] •FIT: Failures in Time (109 hours)
•MTTF: Mean Time To Failure
 Redundancy techniques are expensive •MTBF: Mean Time b/w Failures
•TMR: Triple Modular Redundancy
 (e.g.) ECC-based protection in caches can incur 95% performance •EDC: Error Detection Codes
•ECC: Error Correction Codes
penalty [Li, 05] •RAID: Redundant Array of
Inexpensive Drives
Soft Errors (Transient
12 Faults)
 SER increases  Caches are most hit due to:
exponentially as  Larger portion in processors
technology scales (more than 50%)
 Integration, voltage  No masking effects (e.g.,
scaling, altitude, latitude logical masking)

Intel Itanium II Processor

[Baumann, 05]

Transistor
5 hours MTTF

1
0
1 month MTTF

Bit Flip •MTTF: Mean time To Failure

Soft errors
13 SER (FIT) MTTF Reason
1 Mbit @ 0.13 µm 1000 104 years
64 MB @ 0.13 µm 64x8x1000 81 days High Integration
SER (FIT) MTTF Reason
128 MB @ 65 nm 2x1000x64x8x100 1 hour Technology scaling and
1
1 Mbit
Mbit @
@ 0.13
0.13 µm
µm 1000
1000
0 104
104 years
years Twice Integration
64
A MB @@0.13 µm 64x8x1000 81 days High Integration
64system
64 MB
MB @ 65 nm
@ 0.13
0.13 µm
µm 2x2x1000x64x8x10
64x8x1000
64x8x1000 30
81
81 days
days Memory
High takes up 50%
High Integration
Integration
128 00 minutes of soft errors in a
128 MB
MB @@ 65
65 nm
nm 2x1000x64x8x10
2x1000x64x8x10 1
1 hour
hour Technology scaling
Technology scaling
system
00
00 and
and Twice
Twice
A system with 100x2x2x1000x64x 18 Integration
Exponential
Integration
relationship
voltage scaling @ 8x1000 seconds b/w SER & Supply
A system @ 65 2x2x1000x64x8x 30 Memory takes up
65 nm Voltage
nm 1000 minutes 50% of soft errors in
A system with 800x100x2x2x1000 0.02 High Intensity of
a system
voltage scaling @ x64x8x1000 FIT seconds Neutron Flux at flight
A system
flight with
(35,000 ft) @ 100x2x2x1000x6 18 Exponential
(high altitude)
voltage
65 nm scaling 4x8x1000 seconds relationship b/w SER
@ 65 nm & Supply Voltage
Soft Error Rate (SER) – FIT (Failures in Time) = number of errors in 109 hours
Software Errors and Error
14
Control Schemes

Traditional
Failures Causes Metrics
Approaches
Wrong Incomplete Number of Spatial Redundancy (N-
outputs, Specification, Poor Bugs/Klines, version Programming,
Infinite software design, QoS, MTTF, etc.), Temporal
loops, Crash Bugs, Unhandled MTBF Redundancy (Checkpoints
Exception and Backward Recovery,
etc.)
 Software errors become dominant as system’s complexity increases
 (e.g.) Several bugs per kilo lines
 Hard to debug, and redundancy techniques are expensive
 (e.g.) Backward recovery with checkpoints is inappropriate for real-time applications

•QoS: Quality of Service

Software failures

Coding error or human error

On September 23, 1999, NASA lost the $125 million Mars orbiter
spacecraft because one engineering team used metric units
while another used English units leading to a navigation fiasco,
causing it to burn in the atmosphere.

Design flaws or inaccurate modeling

Mars pathfinder mission landed flawlessly on the Martial surface
on July 4, 1997. However, later its communication failed due to
a design flaw in the real-time embedded software kernel
VxWorks. The problem was later diagnosed to be caused due to
priority inversion, when a medium priority task could preempt a
high priority one.
Software failures

Memory leak
Processes fail to entirely free up the physical memory that has
been allocated to them. This effectively reduces the size of the
available physical memory over time. When this becomes smaller
than the minimum memory needed to support an application, it
crashes.

Incomplete specification (example Y2K)

Year = 99 (1999 or 2099)?
Many failures (like crash, omission etc) can be
caused by software bugs too.
Network Errors and Error
17 Control Schemes

Traditional
Failures Causes Metrics
Approaches
Data Losses, Network Packet Loss Resource Reservation, Data
Deadline Congestion, Rate, Redundancy (CRC, etc.),
Misses, Node Noise/Interfere Deadline Temporal Redundancy
(Link) Failure, nce, Malicious Miss Rate, (Retransmission, etc.),
System Down Attacks SNR, MTTF, Spatial Redundancy
MTBF, MTTR (Replicated Nodes, MIMO,
etc.)
•SNR: Signal to Noise Ratio
 Omission Errors – lost/dropped messages •MTTR: Mean Time To Recovery
•CRC: Cyclic Redundancy Check
 Network is unreliable (especially, wireless networks)•MIMO: Multiple-In Multiple-Out
 Buffer overflow, Collisions at the MAC layer, Receiver out of range
 Joint approaches across OSI layers have been investigated for
Classifying fault-tolerance

Masking tolerance.
Application runs as it is. The failure does not have a visible impact.
All properties (both liveness & safety) continue to hold.

Non-masking tolerance.
Safety property is temporarily affected, but not liveness.

Example 1. Clocks lose synchronization, but recover soon thereafter.

Example 2. Multiple processes temporarily enter their critical sections,
but thereafter, the normal behavior is restored.
Classifying fault-tolerance

Fail-safe tolerance
Given safety predicate is preserved, but liveness may be affected

Example. Due to failure, no process can enter its critical section for
an indefinite period. In a traffic crossing, failure changes the traffic in
both directions to red.
Graceful degradation
Application continues, but in a “degraded” mode. Much depends on
what kind of degradation is acceptable.

Example. Consider message-based mutual exclusion. Processes will

enter their critical sections, but not in timestamp order.
Failure detection

The design of fault-tolerant algorithms will be simple if

processes can detect failures.
 In synchronous systems with bounded delay channels,
crash failures can definitely be detected using timeouts.
 In asynchronous distributed systems, the detection of
crash failures is imperfect.
 Completeness – Every crashed process is suspected
 Accuracy – No correct process is suspected.
Example

1 3
0

6
5

7 4 2

0 suspects {1,2,3,7} to have failed. Does this satisfy completeness?

Does this satisfy accuracy?
Classification of completeness

 Strong completeness. Every crashed process

is eventually suspected by every correct
process, and remains a suspect thereafter.

 Weak completeness. Every crashed process is

eventually suspected by at least one correct
process, and remains a suspect thereafter.
Note that we don’t care what mechanism is used for suspecting a
process.
Classification of accuracy

 Strong accuracy. No correct process is ever

suspected.

 Weak accuracy. There is at least one correct

process that is never suspected.
Eventual accuracy

A failure detector is eventually strongly accurate, if there exists a

time T after which no correct process is suspected.

(Before that time, a correct process be added to and removed from

the list of suspects any number of times)

A failure detector is eventually weakly accurate, if there exists a time

T after which at least one process is no more suspected.
Classifying failure
detectors
Perfect P. (Strongly) Complete and strongly accurate
Strong S. (Strongly) Complete and weakly accurate
Eventually perfect ◊P.
(Strongly) Complete and eventually strongly accurate
Eventually strong ◊S
(Strongly) Complete and eventually weakly accurate

Other classes are feasible: W (weak completeness) and

weak accuracy) and ◊W
Backward vs. forward error
recovery
Backward error recovery
When safety property is violated, the computation rolls
back and resumes from a previous correct state.

time

rollback
Forward error recovery
Computation does not care about getting the history right, but
moves on, as long as eventually the safety property is restored.
True for self-stabilizing systems.
27 Conventional Approaches
 Build redundancy into hardware/software
 Modular Redundancy, N-Version ProgrammingConventional
TRM (Triple Modular Redundancy) can incur 200% overheads
without optimization.
 Replication of tasks and processes may result in
overprovisioning
 Error Control Coding
 Checkpointing and rollbacks
 Usually accomplished through logging (e.g. messages)
 Backward Recovery with Checkpoints cannot guarantee the
completion time of a task.
 Hybrid
 Recovery Blocks
28 1) Modular Redundancy
 Modular Redundancy
 Multiple identical replicas
of hardware modules
 Voter mechanism
fault Data
 Compare outputs and Producer A Consumer
select the correct output voter
Tolerate most hardware Producer B

faults
Effective but expensive
29 2) N-version Programming
 N-version Programming
 Different versions by
different teams Data

 Different versions may Producer A Consumer

not contain the same

voter
bugs
 Voter mechanism Program
fault i Program j

Tolerate some
software bugs Programmer K Programmer L
30 3) Error-Control Coding
 Error-Control Coding
 Replication is effective
but expensive fault
 Error-Detection Coding Data

and Error-Correction Producer A Consumer

Coding
 (example) Parity Bit, Error
Control
Hamming Code, CRC Data
 Much less redundancy
than replication
Conventional Protection
31 for Caches
 Cache is the most hit by soft
errors
 Conventional Protected
Caches

Unaware of Application
 Unaware of fault tolerance at
applications
 Implement a redundancy
technique such as ECC to protect
all data for every access
 Overkill for multimedia
applications High
 ECC (e.g., a Hamming Code) Cost
incurs high performance
penalty by up to 95%, power
overhead by up to 22%, and Cache ECC
area cost by up to 25%
PPC (Partially Protected
32 Caches)
 Observation
 Not all data are equally failure PPC
critical
 Multimedia data vs. control
variables
 Propose PPC architectures Unprotected Protected
to provide an unequal Cache Cache
protection for mobile
multimedia systems [Lee,
CASES06][Lee, TVLSI08]
 Unprotected cache and
Protected cache at the
same level of memory
hierarchy
 Protected cache is typically
smaller to keep power and
Memory
delay the same as or less
than those of Unprotected
cache
PPC

PPC for Multimedia

Unprotected Protected
Cache Cache

33 Applications Memory

 Propose a selective data

protection [Lee,
CASES06]
 Unequal protection at

Reduction
Power/Delay
hardware layer exploiting
error-tolerance of

Tolerance
Fault
multimedia data at
application layer
 Simple data partitioning
for multimedia
applications
 Multimedia data is failure
non-critical
 All other data is failure
critical
34 PPC for general purpose apps
 All data are not equally failure itical
 Propose a PPC architecture to provide
unequal protection
 Support an unequal protection at
hardware layer by exploiting error- Application Data & Code
tolerance and vulnerability at
application Error-tolerance of MM data
 DPExplore [Lee, PPCDIPES08] Vulnerability of Data &
Code
 Explore partitioning space by exploiting Page Partitioning
vulnerability of each data page Algorithms
 Vulnerable time Failure Non- Failure
 It is vulnerable for the time when Critical Critical
eventually it is read by CPU or written FNC & FC are mapped into
back to Memory Unprotected & Protected
 Pages causing high vulnerable Caches
time are failure critical Unprotected Protected
Cache Cache
PPC
35 CC-PROTECT
 Approach which cooperates existing
schemes across layers to mitigate the PBPAIR -
impact of soft errors on the failure rate and Application Error Resilience
video quality in mobile video encoding systems
 PPC (Partially Protected Caches) with EDC
(Error Detection Codes) at hardware layer
 DFR (Drop and Forward Recovery) at
middleware
 PBPAIR (Probability-Based Power Aware
Intra Refresh) at application layer
Middleware/
DFR -
 Demonstrate the effectiveness of low- Error Correction
cost (about 50%) reliability (1,000x) at OS
the minimal cost of QoS (less than 1%)

ECC
EDC
Unprotected Protected
Hardware Cache Cache
Mobile Video Application

Error-prone
Networks

36 CC-PROTECT
Original
Video

Error-Controller Error-Resilient Frame Packet

Error- Drop Loss
(e.g., frame drop) Encoder (e.g., PBPAIR) Aware
Video
Error-Aware Video Encoder (EAVE)

QoS Loss

BER (Backward
DFR (Drop &
Monitor & Trigger Error-prone
Support Error Recovery)
Forward Recovery)
Networks
Translate SER Selective DFR EAVE & PPC

MW/OS
Soft
Feedback
Error Mobile Video Application
frame K
Parameter frame K+1

Unprotected Protected
EDC Error detection
Cache Cache PPC
Application
(Error-Prone or
EDC + DFR +
impact impact
PBPAIR(CC-PROTECT) impact
Error-Resilient)

Energy Saving
36%
56%
17% Reduction compared to HW-PROTECT
26%
49% Reductioncompared
4% Reduction comparedtotoBASE
BASE
37 Hardware
(Unprotected
 BASE = Error-prone video or Protected)
encoding + unprotected
cache
 HW-PROTECT = Error-prone
video encoding + PPC with
ECC
 APP-PROTECT = Error-
resilient video encoding +
unprotected cache
 MULTI-PROTECT = Error-
resilient video encoding +
PPC with ECC
 CC-PROTECT1 = Error-prone
video encoding + PPC with
EDC
 CC-PROTECT2 = Error-prone
video encoding + PPC with
EDC + DFR
 CC-PROTECT = error-resilient
video encoding + PPC with
EDC + DFR
4) Checkpoints &
38 Rollbacks
 Checkpoints and
Rollbacks
 Checkpoint
 A copy of an application’s Data
state
Producer A Consumer
 Save it in storage immune
to the failures
Application State K
 Rollback
 Restart the execution state (K-1) state K
Rollback

from a previously saved

checkpoint Checkpoint fault
 Recover from transient
and permanent
hardware and software
failures
Message Logging
 Tolerate crash failures
 Each process periodically records its local
state and log messages received after
 Once a crashed process recovers, its state must
be consistent with the states of other processes
 Orphan processes
• surviving processes whose states are inconsistent with
the recovered state of a crashed process
 Message Logging protocols guarantee that upon
recovery no processes are orphan processes
Message logging protocols
 Pessimistic Message Logging
• avoid creation of orphans during execution
• no process p sends a message m until it knows that all
messages delivered before sending m are logged; quick
recovery
• Can block a process for each message it receives - slows
down throughput
• allows processes to communicate only from recoverable
states; synchronously log to stable storage any
information that may be needed for recovery before
allowing process to communicate
Message Logging
 Optimistic Message Logging
• take appropriate actions during recovery to eliminate all
orphans
• Better performance during failure-free runs
• allows processes to communicate from non-recoverable
states; failures may cause these states to be
permanently unrecoverable, forcing rollback of any
process that depends on such states
Causal Message Logging
 Causal Message Logging
• no orphans when failures happen and do not block
processes when failures do not occur.
• Weaken condition imposed by pessimistic protocols
• Allow possibility that the state from which a process
communicates is unrecoverable because of a failure, but
only if it does not affect consistency.
• Append to all communication information needed to
recover state from which communication originates -
this is replicated in memory of processes that causally
depend on the originating state.
KAN – A Reliable Distributed
Object System (UCSB)
 Goal
 Language support for parallelism and distribution
 Transparent location/migration/replication
 Optimized method invocation
 Fault-tolerance
 Composition and proof reuse
 Log-based forward recovery scheme
 Log of recovery information for a node is maintained externally on
other nodes.
 The failed nodes are recovered to their pre-failure states, and the
correct nodes keep their states at the time of the failures.
 Only consider node crash failures.
 Processor stops taking steps and failures are eventually detected.
Basic Architecture of the
Fault Tolerance Scheme
Physical Node i

Logical Node x Logical Node y

Fault Detector Failure handler

External
Log

Request handler

Communication Layer

IP Address

Network
Egida (UT Austin)

 An object-oriented, extensible toolkit for low-

overhead fault-tolerance
 Provides a library of objects that can be used to
compose log-based rollback recovery protocols.
 Specification language to express arbitrary rollback-recovery
protocols
 Checkpointing
• independent, coordinated, induced by specific patterns of
communication
 Message Logging
• Pessimistic, optimistic, causal
AQuA

 Adaptive Quality of Service Availability

 Developed in UIUC and BBN.
 Goal:
 Allow distributed applications to request and
obtain a desired level of availability.
 Fault tolerance
 replication
 reliable messaging
Features of AQuA
 Uses the QuO runtime to process and make
availability requests.
 Proteus dependability manager to configure the
system in response to faults and availability
requests.
 Ensemble to provide group communication
services.
 Provide CORBA interface to application objects
using the AQuA gateway.
Group structure
 For reliable mcast and pt-to-pt. Comm
 Replication groups
 Connection groups
 Proteus Communication Service Group for
replicated proteus manager
• replicas and objects that communicate with the manager
• e.g. notification of view change, new QuO request
• ensure that all replica managers receive same info
 Point-to-point groups
• proteus manager to object factory
AQuA Architecture
Fault Model, detection and
Handling

 Object Fault Model:

 Object crash failure - occurs when object stops sending out
messages; internal state is lost
• crash failure of an object is due to the crash of at lease one element
composing the object
 Value faults - message arrives in time with wrong content (caused
by application or QuO runtime)
• Detected by voter
 Time faults
• Detected by monitor
 Leaders report fault to Proteus; Proteus will kill objects
with fault if necessary, and generate new objects
51 5) Recovery Blocks
 Recovery Blocks
 Multiple alternates to
perform the same Data
functionality Producer A Consumer
 One Primary module and
Secondary modules
 Different approaches Application

1) Select a module with Block X Block X2

output satisfying Block Y

acceptance test Block Z

state (K-1) state K
Rollback

2) Recovery Blocks and

Rollbacks Checkpoint fault
 Restart the execution from
a previously saved
checkpoint with secondary
module
Tolerate software failures

Oxf HB Virt Music
100% (7)
Oxf HB Virt Music
721 pages
ES 06 Fault-Tolerance
No ratings yet
ES 06 Fault-Tolerance
65 pages
Lec17 (SW)
No ratings yet
Lec17 (SW)
40 pages
LECT-7A-Software Reliability Metrics
No ratings yet
LECT-7A-Software Reliability Metrics
37 pages
STDcurs1 Merged
No ratings yet
STDcurs1 Merged
139 pages
Unit5 1
No ratings yet
Unit5 1
23 pages
Ch-4-Fault Tularance - Naming-SM
No ratings yet
Ch-4-Fault Tularance - Naming-SM
42 pages
Software Fault Tolerance Methods
No ratings yet
Software Fault Tolerance Methods
50 pages
Architecture Design For Soft Errors
No ratings yet
Architecture Design For Soft Errors
8 pages
Fault Tolerant Systems: Prerequisites
No ratings yet
Fault Tolerant Systems: Prerequisites
14 pages
Unit 11 Dependability-and-Security
No ratings yet
Unit 11 Dependability-and-Security
39 pages
Distrsyslectureset7 Win20
No ratings yet
Distrsyslectureset7 Win20
114 pages
Fault Tolerance and Recovery
No ratings yet
Fault Tolerance and Recovery
50 pages
Introduction To Dependable and Fault Tolerant Computing Systems
No ratings yet
Introduction To Dependable and Fault Tolerant Computing Systems
31 pages
RTS UNiT 4
No ratings yet
RTS UNiT 4
19 pages
Reliable System Design: Hardware Design Checklist Testing Embedded Systems Critical Systems
No ratings yet
Reliable System Design: Hardware Design Checklist Testing Embedded Systems Critical Systems
28 pages
7.fault Tolerance
No ratings yet
7.fault Tolerance
35 pages
Fundamental Concepts of Dependability: Algirdas Aviz Ienis Jean-Claude Laprie Brian Randell
No ratings yet
Fundamental Concepts of Dependability: Algirdas Aviz Ienis Jean-Claude Laprie Brian Randell
21 pages
RTFT15 Unit 2
No ratings yet
RTFT15 Unit 2
53 pages
A Survey of Fault Tolerance Approaches On Different Architecture Levels
No ratings yet
A Survey of Fault Tolerance Approaches On Different Architecture Levels
9 pages
CS61C Su18 27 MRR Dependability
No ratings yet
CS61C Su18 27 MRR Dependability
60 pages
Week09-Fault Tolerant System
No ratings yet
Week09-Fault Tolerant System
26 pages
11 Errors
No ratings yet
11 Errors
33 pages
Computer and Spftware Reliability
No ratings yet
Computer and Spftware Reliability
4 pages
09 Fault Tolerance
No ratings yet
09 Fault Tolerance
5 pages
Dependability: Dependability Proper Improper Failure Restoration
No ratings yet
Dependability: Dependability Proper Improper Failure Restoration
9 pages
Fault Tolerance Techniques
No ratings yet
Fault Tolerance Techniques
4 pages
Week 07a
No ratings yet
Week 07a
26 pages
Fault Tolerance Computing Lecture Note
No ratings yet
Fault Tolerance Computing Lecture Note
61 pages
DS CH7 - Fault Tolerance
No ratings yet
DS CH7 - Fault Tolerance
17 pages
Rtes Reliability and Fault Torelance
No ratings yet
Rtes Reliability and Fault Torelance
40 pages
Functional Testing in RTS
No ratings yet
Functional Testing in RTS
47 pages
9 Reliability
No ratings yet
9 Reliability
68 pages
Chapter 8
No ratings yet
Chapter 8
107 pages
Robust Subh PDF
No ratings yet
Robust Subh PDF
30 pages
Dependable and Secure Computing Concepts
No ratings yet
Dependable and Secure Computing Concepts
14 pages
Lesson 2 - Fault and Error Modelling
No ratings yet
Lesson 2 - Fault and Error Modelling
7 pages
Reliable and Fault Tolerant Distributed Systems
No ratings yet
Reliable and Fault Tolerant Distributed Systems
45 pages
Lecture 7 - FAULT-TOLERANT COMPUTING
No ratings yet
Lecture 7 - FAULT-TOLERANT COMPUTING
13 pages
03 - Reliability Software
No ratings yet
03 - Reliability Software
56 pages
Notes On Fault Tolerance
No ratings yet
Notes On Fault Tolerance
2 pages
Design Patterns For High Availability
No ratings yet
Design Patterns For High Availability
10 pages
Fundamental Concepts of Dependability: Algirdas Aviz Ienis Jean-Claude Laprie Brian Randell
No ratings yet
Fundamental Concepts of Dependability: Algirdas Aviz Ienis Jean-Claude Laprie Brian Randell
6 pages
Distributed System - Failures
No ratings yet
Distributed System - Failures
12 pages
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
No ratings yet
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
48 pages
Reference Book Principles of Distributed Database System Chapters
No ratings yet
Reference Book Principles of Distributed Database System Chapters
25 pages
16 Fault Tolerance
No ratings yet
16 Fault Tolerance
34 pages
Chapter 3
No ratings yet
Chapter 3
40 pages
Safety Critical Computer Systems: Failure Independence and Software Diversity Effects On Reliability of Dual Channel Structures
No ratings yet
Safety Critical Computer Systems: Failure Independence and Software Diversity Effects On Reliability of Dual Channel Structures
10 pages
Reliability: APSC 380: I M 1997/98 W S T 2
No ratings yet
Reliability: APSC 380: I M 1997/98 W S T 2
4 pages
Software Reliability: CIS 376 Bruce R. Maxim UM-Dearborn
No ratings yet
Software Reliability: CIS 376 Bruce R. Maxim UM-Dearborn
37 pages
Rts
No ratings yet
Rts
44 pages
Software Reliability: by Allesh Panda Iiit BBSR
No ratings yet
Software Reliability: by Allesh Panda Iiit BBSR
37 pages
Lect8 FaultTolerance
No ratings yet
Lect8 FaultTolerance
37 pages
Faulttolerancech5 150426005118 Conversion Gate02
No ratings yet
Faulttolerancech5 150426005118 Conversion Gate02
24 pages
Introduction To Fault Tolerance
No ratings yet
Introduction To Fault Tolerance
20 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
6 pages
Fault Tolerance Techniques: Unit 3
No ratings yet
Fault Tolerance Techniques: Unit 3
40 pages
Manual SerDia2010 en
No ratings yet
Manual SerDia2010 en
235 pages
Rajib Mall Lecture Notes
No ratings yet
Rajib Mall Lecture Notes
78 pages
Fault Avoidance and Tolerance Technique
No ratings yet
Fault Avoidance and Tolerance Technique
15 pages
Space Laser Communications For Beyond 5G 6G 1
No ratings yet
Space Laser Communications For Beyond 5G 6G 1
13 pages
Lecture2 Connecting and Communicating Online
No ratings yet
Lecture2 Connecting and Communicating Online
42 pages
Seminar Report Artificial Intelligence in Power Station
No ratings yet
Seminar Report Artificial Intelligence in Power Station
31 pages
ISO13485
No ratings yet
ISO13485
2 pages
Cs - REVISION TOUR
No ratings yet
Cs - REVISION TOUR
59 pages
Standard Special Provisions (SSP) Manual: Ministry of Transportation Ontario
No ratings yet
Standard Special Provisions (SSP) Manual: Ministry of Transportation Ontario
1,114 pages
Electromagnetic Brake Project
No ratings yet
Electromagnetic Brake Project
3 pages
Subhadip Mitra Major Project
No ratings yet
Subhadip Mitra Major Project
7 pages
ACB Schneider
No ratings yet
ACB Schneider
3 pages
Command 3G - Baru
No ratings yet
Command 3G - Baru
5 pages
VLT5000 5000flux 6000 8000 Profibus DP V1 MG90G102
No ratings yet
VLT5000 5000flux 6000 8000 Profibus DP V1 MG90G102
63 pages
ALVO Integra
No ratings yet
ALVO Integra
10 pages
Dasdasd PDF
No ratings yet
Dasdasd PDF
6 pages
Files2Sql - Manual (PDF Library)
No ratings yet
Files2Sql - Manual (PDF Library)
32 pages
Provisional Grade Sheet: 1301287614 Rajkishor Pandey
No ratings yet
Provisional Grade Sheet: 1301287614 Rajkishor Pandey
1 page
Mvi56e MCM MCMXT Um PDF
No ratings yet
Mvi56e MCM MCMXT Um PDF
205 pages
Front and Back Cover - "Mooring System Engineering For Offshore Structures"
0% (1)
Front and Back Cover - "Mooring System Engineering For Offshore Structures"
2 pages
2017 FR C Game Season Manual
No ratings yet
2017 FR C Game Season Manual
131 pages
Data Structres & Algorithms
No ratings yet
Data Structres & Algorithms
4 pages
How To Migrate WordPress Site To A New Domain or URL - BlogVault
No ratings yet
How To Migrate WordPress Site To A New Domain or URL - BlogVault
17 pages
Rochedale PSP
No ratings yet
Rochedale PSP
123 pages
CS628 - Assignment 4
No ratings yet
CS628 - Assignment 4
2 pages
Angular Js
No ratings yet
Angular Js
6 pages
OLT Config
No ratings yet
OLT Config
16 pages
Project Title: Cracking Cooler Fin Fan Project No: 401004-00011
No ratings yet
Project Title: Cracking Cooler Fin Fan Project No: 401004-00011
4 pages
LG Flatron L3000a Prospecto
No ratings yet
LG Flatron L3000a Prospecto
1 page
DSA Continue Assessment
No ratings yet
DSA Continue Assessment
2 pages
Programming and Prototyping with Teensy Microcontrollers: Definitive Reference for Developers and Engineers
From Everand
Programming and Prototyping with Teensy Microcontrollers: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Fault Tolerance in Distributed Systems

Uploaded by

Fault Tolerance in Distributed Systems

Uploaded by

Fault Tolerance in

Crash failure Security failure

Omission failure Temporal failure

Software failure Environmental perturbations

Crash failure = the process halts. It is irreversible.

In synchronous system, it is easy to detect crash failure (using heartbeat

Some failures may be complex and nasty. Fail-stop failure is a simple

(Software) Heisenbugs are a class of temporary internal faults

Over 99% of bugs in IBM DB2 production code are non-

Inability to meet deadlines – correct results

May be caused by poor algorithms, poor

Anything goes! Includes every conceivable form

Numerous possible causes. Includes malicious

Most difficult kind of failure to deal with.

 Hardware failures are increasing as technology scales

Intel Itanium II Processor

Bit Flip •MTTF: Mean time To Failure

•QoS: Quality of Service

Coding error or human error

Design flaws or inaccurate modeling

Incomplete specification (example Y2K)

Example 1. Clocks lose synchronization, but recover soon thereafter.

Example. Consider message-based mutual exclusion. Processes will

The design of fault-tolerant algorithms will be simple if

0 suspects {1,2,3,7} to have failed. Does this satisfy completeness?

 Strong completeness. Every crashed process

 Weak completeness. Every crashed process is

 Strong accuracy. No correct process is ever

 Weak accuracy. There is at least one correct

A failure detector is eventually strongly accurate, if there exists a

(Before that time, a correct process be added to and removed from

A failure detector is eventually weakly accurate, if there exists a time

Other classes are feasible: W (weak completeness) and

 Different versions may Producer A Consumer

not contain the same

and Error-Correction Producer A Consumer

PPC for Multimedia

 Propose a selective data

Error-Controller Error-Resilient Frame Packet

from a previously saved

Logical Node x Logical Node y

Fault Detector Failure handler

 An object-oriented, extensible toolkit for low-

 Adaptive Quality of Service Availability

 Object Fault Model:

1) Select a module with Block X Block X2

output satisfying Block Y

acceptance test Block Z

2) Recovery Blocks and

You might also like