fault tolerance techniques

The document discusses fault tolerance techniques, emphasizing the importance of maintaining system functionality despite hardware or software faults. It outlines two main recovery methods: forward error recovery, which makes selective corrections, and backward error recovery, which is more general but can be costly. Additionally, it categorizes faults based on their behavior and causes, highlighting the complexities and challenges in ensuring system reliability.

Uploaded by

abhaypratapsingh05644

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

fault tolerance techniques

Uploaded by

abhaypratapsingh05644

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

ROHINI College of Engineering and Technology

Fault Tolerance Techniques

Fault - tolerance is defined informally as the ability of a system to deliver
the expected service even in the presence of faults.
A common misconception about realtime computing is that fault-
tolerance is orthogonal to realtime requirements. It is often assumed that the
availability and reliability requirements of a system can be addressed independent
of its timing constraints.
A real-time system may fail to function correctly either because of errors
in its hardware and/ or software or because of not responding in time to meet the
timing requirements that are usually imposed by its "environment."
Hardware fault is some physical defects that can cause a components to
malfunction. Software fault is a bug that can cause a program to fail for a given
set of inputs.
An error is a manifestation of a fault. The fault latency is the duration
between the onset of a fault and its manifestation as an error.
An error latency is the duration between when an error is produced and
when it is either recognized as an error or causes the failure of the system.
Error recovery is the process by which the system attempts to recover from
the effects of an error.
Recovery from an error is fundamental to fault tolerance.
Two main forms of recovery :
1. Forward error recovery
2. Backward error recovery
Forward error recovery
Forward recovery, attempt to bring system to a new stable state from which
it is possible to proceed (applied in situations where the nature if errors are known
and a reset can be applied).

EC8791-Embedded and Realtime Systems

ROHINI College of Engineering and Technology

Forward error recovery continues from an erroneous state by making

selective corrections to the system state.
This include making safe the controlled environment which may be
hazardous or damaged because of the failure.
It is system specific and depends on accurate predictions of the location
and cause of errors (i.e., damage assessment).
Examples : Redundant pointers in data structures and the use of self-
correcting codes such as Hamming Codes.
Advantage forward - error recovery :
1. Less overhead
Disadvantages of forward recovery :
2. In order to work, all potential errors need to be accounted for up-front.
3. Limited use, i.e. only when impact of faults understood.
4. Cannot be used as general mechanism for error recovery.
5. Design specifically for a particular system.
Backward recovery :
Most extensively used in distributed systems and generally safest. It can be
incorporated into middleware layers.
Backward recovery is complicated in the case of process, machine or
network failure but no guarantee that same fault may occur again.
It can not be applied to irreversible (non-idempotent) operations, e.g. ATM
withdrawal.
Advantage backward - error recovery :
1. Simple to implement
2. Can be used as general recovery mechanism
3. Capable of providing recovery from arbitrary damage.
Disadvantage of backward recovery :
1. Check pointing can be very expensive - especially when errors are very
rare.

EC8791-Embedded and Realtime Systems

ROHINI College of Engineering and Technology

2. Performance penalty
3. No guarantee that fault does not occur again
4. Some components cannot be recovered
Causes of Failuer
There are three causes of failuer,
1. Errors in the specification or design
2. Defecuts in the components
3. Environmental effects
Mistake in the specification and design are very difficult to guard against.
Many hardware failuers and all software failuers occurs such mistake.
If the specification is wrong, everything that processeds fro it, design and
implementation, likely to be unsatisfactory.
Fault Types
Fault are classified as temporal behaviours and output behaviours.
1. Temporal behaviours classification
 Fault are of three types : Permanent, intermittent and transient.
 Transient faults : These occur once and then disappear. For example, a
network message transmission times out but works fine when attempted a
second time.
 Intermittent faults : These are the most annoying of component faults.
This fault is characterized by a fault occurring, then vanishing again, then
occurring. An example of this kind of fault is a loose connection.
 Permanent faults : This fault is persistent : It continues to exist until the
faulty component is repaired or replaced. Examples of this fault are disk
head crashes, software bugs, and burnt-out hardware.
2. Output behavioure classification
 Malicious faults : Inconsistent output.
 Nonmalicious fault : Consistent output errors.

EC8791-Embedded and Realtime Systems

ROHINI College of Engineering and Technology

 Fail stop : Responds to up to a certain maximum numbers of failuers by

simply stoping, rather than putting out incorrect outputs. The component
simply stops working. For instance, a hard disk which refuses to read or
write.
 Fail safe : Its failuer mode is biased so that the application process does
not suffer catastrophe upon failuers. A component under too much load is
likely to fail. A fail safe system, on detecting a large amount of load,
processes each request slower to avoid failure.
3. Independence and correlation
 Components failures may be independent and correlated.
 Independent : A failure is said to be independent it it does not directly or
indirectly cause another failure.
 Correlated : If the failure is said to be correlated if they are related in some
way.

EC8791-Embedded and Realtime Systems

50 Shades Blue Resulting S4hana Migration Guide
No ratings yet
50 Shades Blue Resulting S4hana Migration Guide
27 pages
Real Time Systems
79% (14)
Real Time Systems
394 pages
Cyclic Redundancy Check: Mathematical and Hardware Overview
100% (3)
Cyclic Redundancy Check: Mathematical and Hardware Overview
26 pages
SAP EWM Batch Mangement Config Document
100% (1)
SAP EWM Batch Mangement Config Document
26 pages
Chapter 1 On The Secure Hash Algorithm Family
No ratings yet
Chapter 1 On The Secure Hash Algorithm Family
17 pages
DS CH7 - Fault Tolerance
No ratings yet
DS CH7 - Fault Tolerance
17 pages
Chapter 3
No ratings yet
Chapter 3
40 pages
RTFT15 Unit 2
No ratings yet
RTFT15 Unit 2
53 pages
Fault Tolerance
No ratings yet
Fault Tolerance
10 pages
002. Lesson 2 - Fault and Error Modelling.docx
No ratings yet
002. Lesson 2 - Fault and Error Modelling.docx
7 pages
Introduction To Fault Tolerance
No ratings yet
Introduction To Fault Tolerance
20 pages
Unit-5 Faults in RTOS
No ratings yet
Unit-5 Faults in RTOS
5 pages
Lecture 7 - FAULT-TOLERANT COMPUTING
No ratings yet
Lecture 7 - FAULT-TOLERANT COMPUTING
13 pages
Reference Book Principles of Distributed Database System Chapters
No ratings yet
Reference Book Principles of Distributed Database System Chapters
25 pages
Distributed Failure Recovery
No ratings yet
Distributed Failure Recovery
30 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
6 pages
Fault Tolerance Techniques: Unit 3
No ratings yet
Fault Tolerance Techniques: Unit 3
40 pages
Chapter 8
No ratings yet
Chapter 8
107 pages
Reliable System Design: Hardware Design Checklist Testing Embedded Systems Critical Systems
No ratings yet
Reliable System Design: Hardware Design Checklist Testing Embedded Systems Critical Systems
28 pages
OS Presentattion
No ratings yet
OS Presentattion
15 pages
Fault Avoidance and Tolerance Technique
No ratings yet
Fault Avoidance and Tolerance Technique
15 pages
Distributed System - Failures
No ratings yet
Distributed System - Failures
12 pages
Fault Tolerance Computing Lecture Note
No ratings yet
Fault Tolerance Computing Lecture Note
61 pages
DS unit_4
No ratings yet
DS unit_4
20 pages
7.Fault_Tolerance
No ratings yet
7.Fault_Tolerance
35 pages
Fault Tolerance and Recovery
No ratings yet
Fault Tolerance and Recovery
50 pages
Lect8 FaultTolerance
No ratings yet
Lect8 FaultTolerance
37 pages
Unit-4 Failure Recovery and Fault Tolerance Basic Concept
No ratings yet
Unit-4 Failure Recovery and Fault Tolerance Basic Concept
4 pages
Slides 08 PDF
No ratings yet
Slides 08 PDF
95 pages
WRL0004 TMP
No ratings yet
WRL0004 TMP
9 pages
RTES RELIABILITY AND FAULT TORELANCE
No ratings yet
RTES RELIABILITY AND FAULT TORELANCE
40 pages
Cloud
No ratings yet
Cloud
18 pages
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
No ratings yet
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
48 pages
Safety Critical Computer Systems: Failure Independence and Software Diversity Effects On Reliability of Dual Channel Structures
No ratings yet
Safety Critical Computer Systems: Failure Independence and Software Diversity Effects On Reliability of Dual Channel Structures
10 pages
Distrsyslectureset7 Win20
No ratings yet
Distrsyslectureset7 Win20
114 pages
Explain How The Dijkstra's Banker's Algorithm Can Be Used To Avoid Unsafe Situations That Can Lead To Deadlock For A Single Resource Type
No ratings yet
Explain How The Dijkstra's Banker's Algorithm Can Be Used To Avoid Unsafe Situations That Can Lead To Deadlock For A Single Resource Type
6 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
51 pages
dis sys
No ratings yet
dis sys
16 pages
STDcurs1 Merged
No ratings yet
STDcurs1 Merged
139 pages
Fundamental Concepts of Dependability: Algirdas Aviz Ienis Jean-Claude Laprie Brian Randell
No ratings yet
Fundamental Concepts of Dependability: Algirdas Aviz Ienis Jean-Claude Laprie Brian Randell
6 pages
03 - Reliability Software
No ratings yet
03 - Reliability Software
56 pages
Dependability: Dependability Proper Improper Failure Restoration
No ratings yet
Dependability: Dependability Proper Improper Failure Restoration
9 pages
Fault Lecture 01 - Introduction
No ratings yet
Fault Lecture 01 - Introduction
20 pages
Design Patterns For High Availability
No ratings yet
Design Patterns For High Availability
10 pages
SDA Session 8
No ratings yet
SDA Session 8
17 pages
Software Reliability: CIS 376 Bruce R. Maxim UM-Dearborn
No ratings yet
Software Reliability: CIS 376 Bruce R. Maxim UM-Dearborn
37 pages
Rajib Mall Lecture Notes
No ratings yet
Rajib Mall Lecture Notes
78 pages
Unit 4_Deadlock Handling & Recovery Techniques & Failuere Classification
No ratings yet
Unit 4_Deadlock Handling & Recovery Techniques & Failuere Classification
55 pages
Chapter 8-Fault Tolerance
100% (1)
Chapter 8-Fault Tolerance
71 pages
Distributed Sys 8
No ratings yet
Distributed Sys 8
97 pages
Unit 4 - Es
No ratings yet
Unit 4 - Es
9 pages
Fundamental Concepts of Dependability: Algirdas Aviz Ienis Jean-Claude Laprie Brian Randell
No ratings yet
Fundamental Concepts of Dependability: Algirdas Aviz Ienis Jean-Claude Laprie Brian Randell
21 pages
Lec 3
No ratings yet
Lec 3
30 pages
Fault Tolerance
No ratings yet
Fault Tolerance
27 pages
A_survey_of_fault_tolerance_approaches_on_different_architecture_levels
No ratings yet
A_survey_of_fault_tolerance_approaches_on_different_architecture_levels
9 pages
09 Fault Tolerance
No ratings yet
09 Fault Tolerance
5 pages
LECT-7A-Software Reliability metrics
No ratings yet
LECT-7A-Software Reliability metrics
37 pages
Dependable and Secure Computing Concepts
No ratings yet
Dependable and Secure Computing Concepts
14 pages
Software Fault Tolerance Methods
No ratings yet
Software Fault Tolerance Methods
50 pages
Unit10 Fault Tolerance and Security
No ratings yet
Unit10 Fault Tolerance and Security
24 pages
Notes On Fault Tolerance
No ratings yet
Notes On Fault Tolerance
2 pages
Reliability: APSC 380: I M 1997/98 W S T 2
No ratings yet
Reliability: APSC 380: I M 1997/98 W S T 2
4 pages
Software Reliability: by Allesh Panda Iiit BBSR
No ratings yet
Software Reliability: by Allesh Panda Iiit BBSR
37 pages
Penetration Testing Fundamentals-2: Penetration Testing Study Guide To Breaking Into Systems
From Everand
Penetration Testing Fundamentals-2: Penetration Testing Study Guide To Breaking Into Systems
Devi Prasad
No ratings yet
Assignment#2 - CMP424 - CH4 - Error Detection and Correction Techniques
No ratings yet
Assignment#2 - CMP424 - CH4 - Error Detection and Correction Techniques
8 pages
Assignment 1,2
No ratings yet
Assignment 1,2
34 pages
Error Detection and Correction
100% (2)
Error Detection and Correction
31 pages
Notes(CS601)
No ratings yet
Notes(CS601)
21 pages
Scheme
No ratings yet
Scheme
2 pages
CH 3 Datalink
No ratings yet
CH 3 Datalink
35 pages
Experiment No.: 2: AIM: To Perform Encoding and Decoding For Hamming Code. APPARATUS: Scilab. Theory
No ratings yet
Experiment No.: 2: AIM: To Perform Encoding and Decoding For Hamming Code. APPARATUS: Scilab. Theory
4 pages
OceanStor Dorado 2000 6.1.6 Technical White Paper
No ratings yet
OceanStor Dorado 2000 6.1.6 Technical White Paper
133 pages
Jesd204c Primer Part1
No ratings yet
Jesd204c Primer Part1
3 pages
Computer Da 2
No ratings yet
Computer Da 2
20 pages
Error Detection Codes PDF
No ratings yet
Error Detection Codes PDF
2 pages
Cryptographic Hash Functions
No ratings yet
Cryptographic Hash Functions
55 pages
CN ch3 1
No ratings yet
CN ch3 1
60 pages
Advanced Planning and Optimization
No ratings yet
Advanced Planning and Optimization
13 pages
HMAC Generator - Online Hash Encryption
No ratings yet
HMAC Generator - Online Hash Encryption
1 page
D C B A: Bipolar
No ratings yet
D C B A: Bipolar
4 pages
TDD Harq
No ratings yet
TDD Harq
24 pages
Computer Networks: Error Detection and Correction
No ratings yet
Computer Networks: Error Detection and Correction
22 pages
Umoja - Job Aid - 1 Fixed Asset ECC Reports v1.3 PDF
No ratings yet
Umoja - Job Aid - 1 Fixed Asset ECC Reports v1.3 PDF
35 pages
Error Detection and Correction
No ratings yet
Error Detection and Correction
36 pages
Ch7 Digital Transmission Systems - 2
No ratings yet
Ch7 Digital Transmission Systems - 2
17 pages
Сигейт F3 Serial Port Diagnostics
50% (2)
Сигейт F3 Serial Port Diagnostics
60 pages
Reversing Lora Knight
100% (1)
Reversing Lora Knight
128 pages
314318-DATA_COMMUNICATION_AND_COMPUTER_NETWORK_281124
No ratings yet
314318-DATA_COMMUNICATION_AND_COMPUTER_NETWORK_281124
9 pages
Data Communications
No ratings yet
Data Communications
271 pages
AC Lab Exp 5
No ratings yet
AC Lab Exp 5
2 pages

fault tolerance techniques

Uploaded by

fault tolerance techniques

Uploaded by

ROHINI College of Engineering and Technology

Fault Tolerance Techniques

EC8791-Embedded and Realtime Systems

Forward error recovery continues from an erroneous state by making

EC8791-Embedded and Realtime Systems

EC8791-Embedded and Realtime Systems

 Fail stop : Responds to up to a certain maximum numbers of failuers by

EC8791-Embedded and Realtime Systems

You might also like