0% found this document useful (0 votes)

99 views7 pages

Design of Fault Tolerant Systems

The document discusses fault tolerant systems and how they can be designed. It provides two examples of fault tolerant systems: flight control systems and computer networks. There are generally four stages to designing fault tolerance: detecting faults, identifying faults, taking action to maintain performance, and updating the system status. Fault tolerance can be achieved through hardware methods like additional logic circuits or software methods like diagnostic routines. The key is incorporating fault tolerance during the initial product design.

Uploaded by

Andre Mars

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

99 views7 pages

Design of Fault Tolerant Systems

Uploaded by

Andre Mars

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

11 Design of fault tolerant systems

Previous chapters have dealt with methods by which reliability can be determined
and improved. We have also introduced the concepts of maintainability and the
cost of reliability. This chapter adopts a rather different approach by looking at
ways in which an equipment or system can be made to operate within specifica-
tion (or to some defined lesser specification) when a fault is present.
The human body is an excellent example of a fault tolerant system. If we should
suffer an injury to a limb, the muscles on our other fully functional limbs will
develop in order to compensate. If we suffer the loss of one of our senses, our
other senses will develop in order to make up for that loss.
The concept of a fault tolerant system is quite s i m p l e - the important thing to
remember is that fault tolerance has to be designed into a product. There are four
basic stages in this process:

9 Firstly, we need to know that a fault has occurred - this process involves some
means of monitoring the current level of performance of the equipment or sys-
tem and detecting abnormal conditions when they arise.
9 Secondly, we need to identify the nature and source of the fault in order to safe-
guard the operation of the system.
9 Thirdly, we need to take the necessary action in order to maintain the perfor-
mance of the system to within specified limits or to a reduced specification (as
appropriate). This can be achieved in a number of different ways, including
modifying the way in which the system operates (e.g. by switching a power
source from one part of the system to another or by routing signals in a differ-
ent direction) or by bringing redundant components and subsystems into oper-
ation (e.g. by switching to a backup power supply or a standby processor).
9 Lastly, we need to update the status of the system, reporting the fault by gener-
ating appropriate messages, displays or alarms so that the user is made aware
of the current state of the system. At some later stage - perhaps during a rou-
tine check cycle - the fault can be rectified and the system made to revert to its
original state.
64 An Elementary Guide to Reliability

Fault tolerance
Fault tolerance can be defined as the ability of a system to operate within speci-
fication (or to some lesser defined specification) when a fault is present. Clearly,
the more complex and more critical an item of equipment or system is, the more
it can benefit from a degree of fault tolerance. Simple, non-critical equipment is
unlikely to be a candidate for the implementation of a fault tolerant system - even
though some parts of the equipment may exhibit a degree of fault tolerance in
their operation.
Examples of fault tolerant systems include:

9 Flight control systems- the new Boeing 777 has no fewer than seven Inertial
Reference Units (IRU) while the Boeing 747-400 has three. Since only three
IRUs are necessary at any one time, the failure of one IRU on a Boeing 777 is
not particularly serious.
9 Computer networks- computer networks are made more robust by using adap-
tive, fault tolerant software. A token ring network, for example, involves pass-
ing 'tokens' between stations in the network. Network software can be made to
detect lost or corrupt tokens or to render invalid duplicate tokens that may sub-
sequently be generated. This process is quite transparent to the network user
who is usually blissfully unaware that a fault has occurred.

There are two basic approaches to making equipment or systems fault tolerant.
One method involves the use of additional hardware while the other involves the
use of software.

Hardware methods
Hardware methods involve the use of additional logic or a programmed logic
array (PLA) in order to make logical decisions concerning the state of the system
at any time. Hardware methods are well suited to performing such basic tasks as:

9 detecting missing signals

9 indicating out-of-specification supply voltages
9 identifying timing or framing errors.

On a system without fault tolerance, once a fault condition has been detected, the
output signal would be typically used to drive a warning device such as a signal
lamp, LED, magnetic indicator, electromechanical flag or piezoelectric trans-
ducer. Where a system has a degree of fault tolerance, the output signal is used to
initiate changeover to a redundant component or subsystem or to modify the
Design of fault tolerant systems 65

behaviour of the system in such a way as to safeguard essential aspects of its oper-
ation.
Figure 11.1 shows the simplified arrangement of a fault tolerant system based
on hardware.

Mains voltage sensor v"= ,.-'= Mains supply failed

Battery voltage sensor

Battery supply low or
" missing
Signal A

Signal B '=
Logic system v" Main display failure
" (PLA)
Signal C v=~
lb.. Out-of-range input
v detected
Signal D v"~
Framing error
Signal E ,.-"-
detected
i

Figure 11.1.

Software methods
Software methods involve incorporating software routines, procedures or func-
tions within control programs that will:

9 perform full system diagnostics during initialization;

9 perform periodic diagnostic checks during program execution (for example,
periodically reading a status byte);
9 ensure that out of range indications are recognized and erroneous data is
ignored;
9 log faults as they occur together, where possible, with sufficient information
(including date and time) so that the user can determine the point at which the
fault occurred and the circumstances that were prevailing at the time.

Certain fault conditions, such as loss of a power rail or a signal from a fire
detection loop, are so important that they must be dealt with immediately upon
detection. These signals can be given a high priority within the system of interrupt
signals sent to the C P U - either directly or via an external interrupt controller.
66 An Elementary Guide to Reliability
Provided that the CPU is accepting interrupts at the current level (i.e. that its inter-
nal logic has not been placed in a state in which interrupts are 'masked' or 'dis-
abled'), the processor will suspend its current operation (saving important data so
that an orderly return can be made to the point at which it was interrupted) and
then determine the source of the interrupt- for example, by polling each subsys-
tem to establish which was the instigator of the interrupt request. Having estab-
lished the source of the interrupting signal, the processor can then execute an
appropriate interrupt fault correcting service routine (ISR) before returning to the
previously suspended task.

Shared information bus

.LJ. 2J. .L].

Peripheral I Peripheral
device or device or devce or
subsystem subsystem subsystem
CPU

E
Coded Priority E
E Interruptsfrom
interrupt
inputs
encoder other
peripheral
devicesand
subsystems

Figure 11.2.

Other, less important, fault conditions can be detected by generating one or

more status bytes in which each bit represents the signal from an input or output,
and then placing these bytes on the data bus where they can be read at regular
intervals by the CPU. Since a very large number of fault conditions could poten-
tiaUy be present within a complex system, a 'look-up' table (LUT) can be used to
contain a set of 'signature'bytes for each fault condition. When a fault is detected,
this table is searched until a successful comparison is made and then the neces-
sary corrective action is taken (see Figure 11.3).
Design of fault tolerant systems 67

SET POINTER TO
START OF LUT

J
INPUT DIAGNOSTIC
BYTE(S)

INCREMENT
POINTER
i ~ Hll

COMPARE BYTE(S)
WITH LUT ENTRY

NO
SAME?

YES

DISPLAY ERROR
MESSAGE/TAKE
APPROPRIATE
RECOVERY ACTION

Figure 11.3.
68 An Elementary Guide to Reliability
Finally, Table 11.1 below shows how hardware and software methods compare:

Table 11.1

Advantages Disadvantages

Hardware methods Simple. Does not require Not easy to modify or

computer processing time. reconfigure. More suited to a
limited number of fault
conditions.

Software methods Easy to modify and Requires programming

reconfigure. Easy to expertise. Higher initial
implement in systems that cost. Requires computer
are already computer- processing time.
based. Can easily cope
with a large number of
fault conditions.

Combination of methods
Finally, it is possible to combine software and hardware methods and enjoy some
of the benefits of both hardware and software approaches. A typical example of
this is a 'watchdog controller' within a programmable logic control (PLC) system
or as part of a bus-based industrial controller.
The watchdog controller usually comprises an interface card or module which
incorporates its own intelligent processor, interface logic, control program and
timer. The watchdog controller has shared access to all the system bus lines and
is therefore able to determine the status of the system at any point. The controller
is also able to alert the main CPU (or current bus 'master' processor in a multi-
Design of fault tolerant systems 69
processing system) using one or more of the interrupt request (IRQ) or attention
request (ATNRQ) lines. When these lines are asserted by the processor in the
watchdog controller, the main CPU will suspend normal processing within the
main control program and start to execute instructions that will initiate recovery.
The facilities available from a simple watchdog controller generally include:

9 generating a status byte that is periodically read (typically every 1-2.5 seconds)
by the main control program. This status byte provides an indication of the cur-
rent state of the system. If the status byte is not read within a predetermined
period, the watchdog controller assumes that a fault condition has been encoun-
tered and the board takes appropriate action. Typical situations which result in
the watchdog status byte not being read involve major system failures, attempts
to access inoperative peripheral hardware (a hardware hang) or the software
running in an uncontrolled infinite loop (a software hang).
9 monitoring one or more of the power rails and generating appropriate signals
when the voltage present fails to meet the defined tolerances for the rail con-
cemed. Typical actions in the event of a low voltage being detected involve pre-
serving important system variables before switching over to backup supplies or
stand-by batteries.
9 the ability to exploit the multiprocessing capability of a system by making use
of independent processors and, where necessary, duplicate I/O circuitry
attached to independent signal conditioning boards. Typical actions in the event
of detecting I/O failure involve momentarily suspending operation of the main
control program, treating current data values as invalid, and switching to other
multiplexed I/O lines or differently addressed devices.

Complete Tutorial On Hacking Into Paypal Accounts PDF
100% (10)
Complete Tutorial On Hacking Into Paypal Accounts PDF
3 pages
Viper2000 Userinstr
No ratings yet
Viper2000 Userinstr
20 pages
Arduino Based Gloves Translator of Filipino Sign Language FSL Into Speech and Text
No ratings yet
Arduino Based Gloves Translator of Filipino Sign Language FSL Into Speech and Text
62 pages
DMAIC Six Sigma Guide 3M
100% (5)
DMAIC Six Sigma Guide 3M
162 pages
COP Facade Retention
No ratings yet
COP Facade Retention
9 pages
SOP Pipe Welding
100% (1)
SOP Pipe Welding
1 page
Fault Tolerant Computing
No ratings yet
Fault Tolerant Computing
4 pages
09 Fault Tolerance
No ratings yet
09 Fault Tolerance
5 pages
7.fault Tolerance
No ratings yet
7.fault Tolerance
35 pages
Unit-5 Faults in RTOS
No ratings yet
Unit-5 Faults in RTOS
5 pages
Week09-Fault Tolerant System
No ratings yet
Week09-Fault Tolerant System
26 pages
Chapter 3
No ratings yet
Chapter 3
40 pages
Lect8 FaultTolerance
No ratings yet
Lect8 FaultTolerance
37 pages
Computerized Approach For Matrixform Fmea 1979
No ratings yet
Computerized Approach For Matrixform Fmea 1979
1 page
Ch-4-Fault Tularance - Naming-SM
No ratings yet
Ch-4-Fault Tularance - Naming-SM
42 pages
Fault Tolerant Design: An Introduction: Elena Dubrova
No ratings yet
Fault Tolerant Design: An Introduction: Elena Dubrova
162 pages
Redundant and Voting System
No ratings yet
Redundant and Voting System
10 pages
Extending Rt-Minix With Fault Tolerance Capabilities: Pablo J. Rogina
No ratings yet
Extending Rt-Minix With Fault Tolerance Capabilities: Pablo J. Rogina
8 pages
Design Patterns For High Availability
No ratings yet
Design Patterns For High Availability
10 pages
Fault Tolerance Computing Lecture Note
No ratings yet
Fault Tolerance Computing Lecture Note
61 pages
Fault Tolerance
No ratings yet
Fault Tolerance
17 pages
Introduction To Fault Tolerance
No ratings yet
Introduction To Fault Tolerance
20 pages
Ran Dell 75
No ratings yet
Ran Dell 75
18 pages
10.IJAEST Vol No 5 Issue No 2 An Implementation of On Chip Hardware Controller For Fault Detection and Tolerance in FPGA 163 165
No ratings yet
10.IJAEST Vol No 5 Issue No 2 An Implementation of On Chip Hardware Controller For Fault Detection and Tolerance in FPGA 163 165
3 pages
Faulttolerancech5 150426005118 Conversion Gate02
No ratings yet
Faulttolerancech5 150426005118 Conversion Gate02
24 pages
Reliable System Design: Hardware Design Checklist Testing Embedded Systems Critical Systems
No ratings yet
Reliable System Design: Hardware Design Checklist Testing Embedded Systems Critical Systems
28 pages
II Fault Tolerant Techniques
No ratings yet
II Fault Tolerant Techniques
101 pages
Esd Unit3
No ratings yet
Esd Unit3
8 pages
Introduction
No ratings yet
Introduction
8 pages
Rts
No ratings yet
Rts
44 pages
II - Fault-Tolerant-techniques
No ratings yet
II - Fault-Tolerant-techniques
104 pages
Hardware Redundancy.
No ratings yet
Hardware Redundancy.
4 pages
Safety Critical Computer Systems: Failure Independence and Software Diversity Effects On Reliability of Dual Channel Structures
No ratings yet
Safety Critical Computer Systems: Failure Independence and Software Diversity Effects On Reliability of Dual Channel Structures
10 pages
Cloud
No ratings yet
Cloud
18 pages
Fault Tolerance Techniques: Unit 3
No ratings yet
Fault Tolerance Techniques: Unit 3
40 pages
Fault Tolerance Computing 1
No ratings yet
Fault Tolerance Computing 1
59 pages
11 Errors
No ratings yet
11 Errors
33 pages
Embedded Unit-3
No ratings yet
Embedded Unit-3
10 pages
Capability of Single Hardware Channel For Automotive Safety Applications
No ratings yet
Capability of Single Hardware Channel For Automotive Safety Applications
5 pages
DFTS BE 4 II Sem Unit 2
No ratings yet
DFTS BE 4 II Sem Unit 2
112 pages
Fault-Tolerant Architectures
No ratings yet
Fault-Tolerant Architectures
23 pages
Module3-Dynamic Response of A System
No ratings yet
Module3-Dynamic Response of A System
60 pages
Ee 587 Soc Design & Test: Partha Pande School of Eecs Washington State University Pande@Eecs - Wsu.Edu
No ratings yet
Ee 587 Soc Design & Test: Partha Pande School of Eecs Washington State University Pande@Eecs - Wsu.Edu
44 pages
Fault Tolerance
No ratings yet
Fault Tolerance
27 pages
Lecture 4
No ratings yet
Lecture 4
25 pages
FT Embedded Systems EE8205
No ratings yet
FT Embedded Systems EE8205
36 pages
35 Fault Findings
No ratings yet
35 Fault Findings
32 pages
Lesson 1 - Introduction To Fault-Tolerant Computing
No ratings yet
Lesson 1 - Introduction To Fault-Tolerant Computing
6 pages
ES 06 Fault-Tolerance
No ratings yet
ES 06 Fault-Tolerance
65 pages
Synthesis of Fault-Tolerant Embedded Systems: Eles, Petru Izosimov, Viacheslav Pop, Paul Peng, Zebo
No ratings yet
Synthesis of Fault-Tolerant Embedded Systems: Eles, Petru Izosimov, Viacheslav Pop, Paul Peng, Zebo
7 pages
Shema First
No ratings yet
Shema First
10 pages
Lecture 7 - FAULT-TOLERANT COMPUTING
No ratings yet
Lecture 7 - FAULT-TOLERANT COMPUTING
13 pages
Fault Tolerant System Design: Why Use Fault Tolerance?
No ratings yet
Fault Tolerant System Design: Why Use Fault Tolerance?
17 pages
Nonstop Fault Tolerant Servers Quick Reference
No ratings yet
Nonstop Fault Tolerant Servers Quick Reference
10 pages
Fault Avoidance and Tolerance Technique
No ratings yet
Fault Avoidance and Tolerance Technique
15 pages
AI 940 Dep Architectures
No ratings yet
AI 940 Dep Architectures
65 pages
Fault Tolerant System.
No ratings yet
Fault Tolerant System.
2 pages
Unit5 1
No ratings yet
Unit5 1
23 pages
Design For Testability: N.Pitcheswara Rao Assistant Professor ECE Department
No ratings yet
Design For Testability: N.Pitcheswara Rao Assistant Professor ECE Department
47 pages
SSD: An Affordable Fault Tolerant Architecture For Superscalar Processors
No ratings yet
SSD: An Affordable Fault Tolerant Architecture For Superscalar Processors
8 pages
Dependability: © 2016 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 2
No ratings yet
Dependability: © 2016 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 2
7 pages
FPGA Test Time Reduction Through A Novel Interconnect Testing Scheme
No ratings yet
FPGA Test Time Reduction Through A Novel Interconnect Testing Scheme
9 pages
Rtes Reliability and Fault Torelance
No ratings yet
Rtes Reliability and Fault Torelance
40 pages
Ground Disturbance Certificate
No ratings yet
Ground Disturbance Certificate
3 pages
Workstation Risk Assessment Checklist / Report: + Display Screen
No ratings yet
Workstation Risk Assessment Checklist / Report: + Display Screen
2 pages
OHS Performance Measurement
No ratings yet
OHS Performance Measurement
109 pages
Reducing Cost PDF
No ratings yet
Reducing Cost PDF
5 pages
Performance Measurement Process
No ratings yet
Performance Measurement Process
32 pages
Evaluation of Quality Management System
No ratings yet
Evaluation of Quality Management System
8 pages
Do Incentives Work
No ratings yet
Do Incentives Work
14 pages
Safe, Secure and Reliable Baggage Handling
No ratings yet
Safe, Secure and Reliable Baggage Handling
6 pages
The Link Between Productivity and Health and Safety at Work
No ratings yet
The Link Between Productivity and Health and Safety at Work
32 pages
An Index of Health and Safety in Catering
No ratings yet
An Index of Health and Safety in Catering
4 pages
Safe Handling and Storage of Dry Ice: Telephone
No ratings yet
Safe Handling and Storage of Dry Ice: Telephone
2 pages
SOP - Motor Vehicle Operation
No ratings yet
SOP - Motor Vehicle Operation
1 page
Is Your Workplace Noisy
No ratings yet
Is Your Workplace Noisy
4 pages
SOP Vacuum Cleaners
100% (1)
SOP Vacuum Cleaners
1 page
SOP - Handling Asbestos
No ratings yet
SOP - Handling Asbestos
61 pages
KPI Examples - Sales
100% (1)
KPI Examples - Sales
3 pages
Absorbent Pads & Earmuffs
No ratings yet
Absorbent Pads & Earmuffs
2 pages
Waste Reduction Plan
No ratings yet
Waste Reduction Plan
51 pages
CAD-Technologies Company Profile PDF
No ratings yet
CAD-Technologies Company Profile PDF
12 pages
MPC-20-R2 Setup For A Mechanical Engine
No ratings yet
MPC-20-R2 Setup For A Mechanical Engine
1 page
Oceantrx 4: 1.15 M (4') Maritime Stabilized Vsat System
No ratings yet
Oceantrx 4: 1.15 M (4') Maritime Stabilized Vsat System
4 pages
2X-EI19-ROW64 Windows10 Upgrade Kit For FDR Go PLUS
No ratings yet
2X-EI19-ROW64 Windows10 Upgrade Kit For FDR Go PLUS
3 pages
BTB Brochure
No ratings yet
BTB Brochure
7 pages
Sandhya - Self Intro
No ratings yet
Sandhya - Self Intro
1 page
05 - Auger Specifications
No ratings yet
05 - Auger Specifications
1 page
BS81xC xv130
No ratings yet
BS81xC xv130
35 pages
Maltego Webinar Slides 58f8040a532d6
No ratings yet
Maltego Webinar Slides 58f8040a532d6
10 pages
Olt Configuration Detail PDF
No ratings yet
Olt Configuration Detail PDF
108 pages
Enatel Manual 5U Compact PSC140705xx-107 V1.0
No ratings yet
Enatel Manual 5U Compact PSC140705xx-107 V1.0
22 pages
The X MSCI Programmer's Handbook
No ratings yet
The X MSCI Programmer's Handbook
123 pages
67 Electric Motor
No ratings yet
67 Electric Motor
23 pages
ABB REB500 Commissioning Manual
No ratings yet
ABB REB500 Commissioning Manual
94 pages
Marketing and Sales Management
No ratings yet
Marketing and Sales Management
13 pages
Low Voltage Switchgear and Control Gear
100% (1)
Low Voltage Switchgear and Control Gear
9 pages
ZYX-S2 User Manual
No ratings yet
ZYX-S2 User Manual
7 pages
101NDXFGLUUU1
No ratings yet
101NDXFGLUUU1
2 pages
Naukri Pawankumar (2y 2m)
No ratings yet
Naukri Pawankumar (2y 2m)
2 pages
Software Engineering Lab
No ratings yet
Software Engineering Lab
2 pages
Haulroad: Design and Maintenance Guide Design and Maintenance
100% (1)
Haulroad: Design and Maintenance Guide Design and Maintenance
11 pages
C-Language Syllabus
No ratings yet
C-Language Syllabus
3 pages
Files2Sql - Manual (PDF Library)
No ratings yet
Files2Sql - Manual (PDF Library)
32 pages
Inkjet
No ratings yet
Inkjet
4 pages
TP MS3663S PB818
No ratings yet
TP MS3663S PB818
25 pages
Swiggy: Case Study
50% (2)
Swiggy: Case Study
23 pages
Defining Digital Advertising
No ratings yet
Defining Digital Advertising
8 pages

Design of Fault Tolerant Systems

Uploaded by

Design of Fault Tolerant Systems

Uploaded by

11 Design of fault tolerant systems

9 detecting missing signals

Mains voltage sensor v"= ,.-'= Mains supply failed

Battery voltage sensor

9 perform full system diagnostics during initialization;

Shared information bus

.LJ. 2J. .L].

Other, less important, fault conditions can be detected by generating one or

Hardware methods Simple. Does not require Not easy to modify or

Software methods Easy to modify and Requires programming

You might also like