Software Fault Tolerance Methods
Software Fault Tolerance Methods
More and more people depend and rely on computer systems Increasing need for computer systems also increases the need for fault tolerance computer systems The interest on the area of fault tolerant realtime systems is increasing
Software faults may be occurred in design of the system Virtually impossible to design and implement completely fault free system Measures have to be provided in order to detect and tolerate faults
System
Is a set of interacting components with a design
Service
Delivered by a system is the behavior of that system which affects users or other systems
Dependability Attributes
Undesired, but seldom unexpected, circumstance causing or resulting from undependability System behaves in an unacceptable manner No longer satisfy its specifications when a system failure occurs
System failure
A system, which no longer delivers a service that complies with the specification of the system, is said to suffer from a system failure
Error
is a system state, which is liable to lead to a subsequent failure
Fault
The conditions which caused the error
In order to assess the severity of faults and to decide measures for removing them a classification is useful
Whether or not an error leads to a failure depends on a set of factors A system that incorporates redundancy on some level may mask the error
Failure Modes;
Failure Domain
The value of the service does not comply with the specifications
Failure Perception
Experienced by the user of the system
Failure Consequences
Different levels of severity
Not every fault leads to error Not every error leads to failure
Faults are active when they produce errors Errors are detected by error detection algorithms or mechanisms. Failures occur when error passes through the interface of the system
Methods and techniques enabling the provision of the ability to deliver a service on which reliance can be placed, and the reaching of confidence in this ability
A dependable software
Procurement (Fault prevention and Fault tolerance)
Methodology used to construct a dependable system
Fault prevention
How to prevent fault occurrence by construction
Fault tolerance
How to provide service when faults are present
Fault removal
How to minimize the presence of faults
Fault forecasting
How to estimate the creation and manifestation of faults
Since human activities are involved, these four means are goals that cannot be fully reached
Error Detection
Is the detection of an erroneous state Lead to subsequent failure
Damage Assessment
When an error has been detected in order to establish more precisely to which extent the system is damaged
Error Processing
Error Recovery
An attempt to substitute the erroneous system state with one which is error-free 1. Backward recovery 2. Forward recovery
Fault Treatment
Diagnosis Passivation
Is the duplication of critical components or functions of a system with the intention of increasing reliability of the system
A fault tolerant system is assumed to support some level of redundancy, ensuring that faults can be tolerated using the four phases
Space
Hardware redundancy Denoted as H
Information
Software redundancy Denoted as S
Repetition
Time redundancy Denoted as T
Enable the expected properties of a system to be expressed, and allow the quality of the system resulting from the impairments and the means opposing them to be assessed
Reliability
the extent to which system continuously provides its service
Safety
the extent to which a system avoids catastrophic consequences on the environment
Security
the extent to which a system prevents unauthorized access and/or handling of information
Recovery Block (RB) N-Version Programming (NVP) Consensus Recovery Block (CRB) Distributed Recovery Block (DRB) N Self-Checking Programming (NSCP) Data Diversity
An error in the operation of a module, explicitly detected by the acceptance test The module fails to terminate, detected by a time-out An error is detected during execution of a module by one of the implicit error detection mechanisms An inner recovery block has failed due to all modules being rejected either explicitly or implicitly
The types of faults tolerated by recovery blocks Designing the primary and alternate modules Designing the acceptance test Designing the recovery cache mechanism
A decision mechanism
A supervisory Program
In order for the decision mechanism to do its job, the outputs of the N versions must be synchronized
The types of faults tolerated by N-version programming The initial specification Generating independent versions The decision mechanism
Cost of implementation
To integrate software and hardware fault tolerance into one single structure Both the primary and the alternate modules are replicated and are resident on two or more separate nodes interconnected by a network Software faults -> Traditional recovery block fashion Hardware faults -> In backup nodes
The system is divided into several self checking components comprised of different variants (equivalent to alternates in RB and versions in NVP)
Programs fail for special cases in the input space Moving the input data out of failure domain with two approaches
Retry Block N-copy programming
Since all fault tolerance depends on some kind of redundancy, fault tolerant systems will always be more expensive The fault tolerance technique of choice is of course highly application dependent CRB and DRB are still mostly used for academic research
Low-cost systems should use fault tolerance schemes that do not make use of hardware redundancy High-cost systems should use schemes such as NVP, NSCP or NCP