0% found this document useful (0 votes)
96 views

Software Fault Tolerance Methods

This document discusses concepts of dependability and software fault tolerance techniques. It defines dependability as the trustworthiness of a system and its ability to deliver services. Software faults may occur during design so fault tolerance measures are needed. Common techniques include recovery blocks, N-version programming, consensus recovery blocks, and distributed recovery blocks. The goal of fault tolerance is to enable systems to tolerate faults through redundancy. [/SUMMARY]

Uploaded by

Monil Joshi
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views

Software Fault Tolerance Methods

This document discusses concepts of dependability and software fault tolerance techniques. It defines dependability as the trustworthiness of a system and its ability to deliver services. Software faults may occur during design so fault tolerance measures are needed. Common techniques include recovery blocks, N-version programming, consensus recovery blocks, and distributed recovery blocks. The goal of fault tolerance is to enable systems to tolerate faults through redundancy. [/SUMMARY]

Uploaded by

Monil Joshi
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Introduction Concepts of Dependability Software Fault Tolerance Techniques Conclusion Questions

More and more people depend and rely on computer systems Increasing need for computer systems also increases the need for fault tolerance computer systems The interest on the area of fault tolerant realtime systems is increasing

Software faults may be occurred in design of the system Virtually impossible to design and implement completely fault free system Measures have to be provided in order to detect and tolerate faults

Defined as trustworthiness of a system Reliance on the service it provides

System
Is a set of interacting components with a design

Service
Delivered by a system is the behavior of that system which affects users or other systems

Dependability Impairments Dependability Means

Dependability Attributes

Undesired, but seldom unexpected, circumstance causing or resulting from undependability System behaves in an unacceptable manner No longer satisfy its specifications when a system failure occurs

System failure
A system, which no longer delivers a service that complies with the specification of the system, is said to suffer from a system failure

Error
is a system state, which is liable to lead to a subsequent failure

Fault
The conditions which caused the error

In order to assess the severity of faults and to decide measures for removing them a classification is useful

Whether or not an error leads to a failure depends on a set of factors A system that incorporates redundancy on some level may mask the error

Failure Modes;
Failure Domain
The value of the service does not comply with the specifications

Failure Perception
Experienced by the user of the system

Failure Consequences
Different levels of severity

Not every fault leads to error Not every error leads to failure

Faults are active when they produce errors Errors are detected by error detection algorithms or mechanisms. Failures occur when error passes through the interface of the system

Methods and techniques enabling the provision of the ability to deliver a service on which reliance can be placed, and the reaching of confidence in this ability

A dependable software
Procurement (Fault prevention and Fault tolerance)
Methodology used to construct a dependable system

Validation (Fault removal and Fault forecasting)


Methodology used to ensure the dependability of a system

Fault prevention
How to prevent fault occurrence by construction

Fault tolerance
How to provide service when faults are present

Fault removal
How to minimize the presence of faults

Fault forecasting
How to estimate the creation and manifestation of faults

Since human activities are involved, these four means are goals that cannot be fully reached

The ability of an operational system to tolerate the presence of faults


1. 2. 3. 4. Error Detection Damage Assessment Error Processing Fault Treatment

Error Detection
Is the detection of an erroneous state Lead to subsequent failure

Damage Assessment
When an error has been detected in order to establish more precisely to which extent the system is damaged

Error Processing
Error Recovery
An attempt to substitute the erroneous system state with one which is error-free 1. Backward recovery 2. Forward recovery

Fault Treatment
Diagnosis Passivation

Is the duplication of critical components or functions of a system with the intention of increasing reliability of the system

A fault tolerant system is assumed to support some level of redundancy, ensuring that faults can be tolerated using the four phases

Space
Hardware redundancy Denoted as H

Information
Software redundancy Denoted as S

Repetition
Time redundancy Denoted as T

Enable the expected properties of a system to be expressed, and allow the quality of the system resulting from the impairments and the means opposing them to be assessed

Have four main attributes;


Availability
the extent to which a system has a readiness for usage

Reliability
the extent to which system continuously provides its service

Safety
the extent to which a system avoids catastrophic consequences on the environment

Security
the extent to which a system prevents unauthorized access and/or handling of information

Recovery Block (RB) N-Version Programming (NVP) Consensus Recovery Block (CRB) Distributed Recovery Block (DRB) N Self-Checking Programming (NSCP) Data Diversity

Basic elements of RB;


One primary module
A program module which performs desired operation

Zero or more alternate modules


Same desired operation in different way

One acceptance test


A test which confirms the output of the modules

An error in the operation of a module, explicitly detected by the acceptance test The module fails to terminate, detected by a time-out An error is detected during execution of a module by one of the implicit error detection mechanisms An inner recovery block has failed due to all modules being rejected either explicitly or implicitly

The types of faults tolerated by recovery blocks Designing the primary and alternate modules Designing the acceptance test Designing the recovery cache mechanism

Basic elements of NVP;


The initial specification
N software versions
The specification of the functionality which is desired by the software Software modules which all are independently generated from initial specification Decides what the final result of the computations will be using the results from the N versions as input A software structure used to drive the N versions and the decision mechanism

A decision mechanism

A supervisory Program

In order for the decision mechanism to do its job, the outputs of the N versions must be synchronized

The types of faults tolerated by N-version programming The initial specification Generating independent versions The decision mechanism

A synthesis of the original recovery block and N-version programming

Include RB problems Include NVP problems

Cost of implementation

To integrate software and hardware fault tolerance into one single structure Both the primary and the alternate modules are replicated and are resident on two or more separate nodes interconnected by a network Software faults -> Traditional recovery block fashion Hardware faults -> In backup nodes

The system is divided into several self checking components comprised of different variants (equivalent to alternates in RB and versions in NVP)

Usual problems associated with acceptance test Comparison mechanism

Programs fail for special cases in the input space Moving the input data out of failure domain with two approaches
Retry Block N-copy programming

The acceptance test The voter

Since all fault tolerance depends on some kind of redundancy, fault tolerant systems will always be more expensive The fault tolerance technique of choice is of course highly application dependent CRB and DRB are still mostly used for academic research

Low-cost systems should use fault tolerance schemes that do not make use of hardware redundancy High-cost systems should use schemes such as NVP, NSCP or NCP

https://www.cs.drexel.edu/~bmitchel/course/cs575/ClassPapers/hi ller98software.pdf http://ce.kashanu.ac.ir/babamir/Session1.pdf http://en.wikipedia.org/wiki/Redundancy_%28engineering%29 http://srel.ee.duke.edu/sw_ft/node5.html http://www.ibm.com/developerworks/rational/library/114.html

You might also like