Week09-Fault Tolerant System
Week09-Fault Tolerant System
& Business
Continuity
Planning
Week-09: Fault Tolerant System
Topics Covered
• Introduction to Fault tolerant system
• Fault,error and failure
• MTTF and MTTR
• Techniques of fault tolerant
• Real world high availability system
Introduction
• Reliability and availability have become
increasingly important in today’s computer
dependent world.
• To achieve the needed reliability and
availability, we need fault-tolerant computers.
Fault tolerant system
• Fault tolerant system have the ability to
tolerate faults by detecting failures, and
isolate defect modules so that the rest of the
system can operate correctly.
Case Study
• According to a study on Tandem systems , the
percentage of outages caused by hardware
faults was 30% in 1985, but had decreased to
10% in 1989. Outages caused by software
faults increased in the same period, from 43%
to over 60%!
Fault , Error, Failure
• When a system or module is designed, its
behavior is specified. When in service, we can
observe its behavior. When the observed
behavior differs from the specified behavior,
we call it a failure.
• A failure occurs because of an error, caused by
a fault.
Fault , Error, Failure
• For example is a cosmic ray that discharges a
memory cell (fault), causing an error. When
the memory cell is read, we have a memory
failure and the error becomes effective.
Module reliability
• This reliability is statistically quantified as
mean-time-to-failure (MTTF).
• The average time it takes to repair a module
• after the detection of the failure is called
mean-time-to-repair (MTTR).
Module Availability
• We get the module availability, which is the
ratio of service accomplishment to elapsed
time.
Module Availability
• We classify systems into different availability
classes as shown in table. Currently, most
general-purpose systems are operating in class
3 or 4.
Fault-Tolerance Techniques
• Hardware redundancy
• Information redundancy
• Software redundancy
• Time redundancy
Hardware Redundancy
• Making a module failfast can be done by
duplication.
• Two identical copies of a module are
employed , with a comparator checking the
output of the two copies.
• When the output differs, a fault is detected.
• This is a widely used technique, because it is
easy to realize, and relatively cheap.
Hardware Redundancy
Information Redundancy
• Information redundancy is the addition of
extra information to data, to allow error
detection and correction.
• This is typically error-detecting codes, error-
correcting codes (ECC), and self-checking
circuits.
• Parity codes are used in most modern
computers for memory error detection.
Software Redundancy
• There are some important differences
between software and hardware errors.
• Software development is also a more complex
and immature art than hardware design.
• It is said that perfect software is possible —
it’s just a matter of time and money.
Software Redundancy
• There are two software fault-tolerance
techniques
• N-version programming: Write the program N
times, then operate all N programs in parallel,
and take a majority vote for each answer.
• Transactions: Write the program as a
transaction. Use a consistency check at the
end, and if the conditions are not met, restart.
Software Fault Detection
• Watchdog timers and timeouts: A watchdog
daemon process can watch the life of an
application by periodically sending the process
a signal and check the return value to detect if
it is alive
• Consistency checking/self-checking: The
programs can use assertions to check the
results of computations.
Time Redundancy
• Hardware- and information- redundancy
requires extra hardware.
• This could be avoided by doing operations
several times in the same module and check
the results, in stead of doing it in parallel on
several modules and compare the outputs.
• This reduces the amount of hardware at the
expense of using additional time.
Fault-Tolerance in General-Purpose
Computers
• A processor contains many registers. To
provide fault-tolerance.
• Database transaction as software techniques.
High availability computer systems
• Tandem Computers
• Stratus
• MARS
• Sun Netra ft 1800
• Fault-Tolerance on Clusters
Tandem Computers
Stratus
MARS
Sun Nethra ft 1800
Fault-Tolerance on Clusters
Key points
• This chapter is an introduction to fault-
tolerance concepts and systems, mainly from
the hardware point of view.
• There are four methods of fault tolerant
Software, hardware, time and information.
• Finally, some systems are studied as case
examples, including Tandem, Stratus, MARS,
and Sun Netra ft 1800