0% found this document useful (0 votes)
45 views

Distributed System - Failures

This document discusses failures, faults, and fault tolerance in systems. It defines key terms like failure, error, fault, and explains that while perfect software is impossible, fault tolerance aims to increase dependability by allowing systems to function correctly despite internal faults. Faults are classified by duration (transient or permanent) or cause (design faults or operational faults). The general process of fault tolerance includes error detection, error recovery, and fault treatment. Error detection identifies invalid states, while recovery restores the system to a valid state either by rolling back or moving forward. Fault treatment repairs or replaces the failed component.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Distributed System - Failures

This document discusses failures, faults, and fault tolerance in systems. It defines key terms like failure, error, fault, and explains that while perfect software is impossible, fault tolerance aims to increase dependability by allowing systems to function correctly despite internal faults. Faults are classified by duration (transient or permanent) or cause (design faults or operational faults). The general process of fault tolerance includes error detection, error recovery, and fault treatment. Error detection identifies invalid states, while recovery restores the system to a valid state either by rolling back or moving forward. Fault treatment repairs or replaces the failed component.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 12

Failures and Fault Tolerance

Classification of failures
Security
Fundamentals of Fault tolerance
It is simply not possible to devise absolutely
foolproof, 100% reliable software.
The best we can do is to reduce the
probability of failure to an "acceptable" level.
Fault tolerance is the ability of a system to
perform its function correctly even in the
presence of internal faults. The purpose of
fault tolerance is to increase the dependability
of a system.

A failure occurs when an actual running system
deviates from this specified behavior. The cause
of a failure is called an error.
An error represents an invalid system state, one
that is not allowed by the system behavior
specification. The error itself is the result of a
defect in the system or fault, which fault is the
root cause of a failure.
A fault may not necessarily result in an error, but
the same fault may result in multiple errors

Fault Classification
Based on duration, faults can be classified as transient or
permanent.
A different way to classify faults is by their underlying
cause.
Design faults are the result of design failures
Operational faults, on the other hand, are faults that occur during
the lifetime of the system and are invariably due to physical
causes

General Fault Tolerant Procedure
Series of distinct activities that are typically
(although not necessarily) performed in
sequence.
Error detection is the process of identifying that
the system is in an invalid state - damage
confinement; In other words, we first treat the
symptoms and then go after the underlying cause
The most common techniques for error detection
are: Replication checks, Timing checks, Run-time
constraints checking, Diagnostic checks



Error Recovery
The system needs to be restored to a valid
state(Two general approaches exists]
In backward error recovery, the system is
restored to a previous known valid state. This
often requires check pointing the system state
and, once an error is detected, rolling back the
system state to the last check pointed state.
forward error recovery is more appropriate. This
involves driving the system from the erroneous
state to a new valid state.

Fault Treatment
Repair procedure.failure
component.replacestandby.COLD, WARM
and HOT standby components

You might also like