0% found this document useful (0 votes)
99 views7 pages

Design of Fault Tolerant Systems

The document discusses fault tolerant systems and how they can be designed. It provides two examples of fault tolerant systems: flight control systems and computer networks. There are generally four stages to designing fault tolerance: detecting faults, identifying faults, taking action to maintain performance, and updating the system status. Fault tolerance can be achieved through hardware methods like additional logic circuits or software methods like diagnostic routines. The key is incorporating fault tolerance during the initial product design.

Uploaded by

Andre Mars
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views7 pages

Design of Fault Tolerant Systems

The document discusses fault tolerant systems and how they can be designed. It provides two examples of fault tolerant systems: flight control systems and computer networks. There are generally four stages to designing fault tolerance: detecting faults, identifying faults, taking action to maintain performance, and updating the system status. Fault tolerance can be achieved through hardware methods like additional logic circuits or software methods like diagnostic routines. The key is incorporating fault tolerance during the initial product design.

Uploaded by

Andre Mars
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

11 Design of fault tolerant systems

Previous chapters have dealt with methods by which reliability can be determined
and improved. We have also introduced the concepts of maintainability and the
cost of reliability. This chapter adopts a rather different approach by looking at
ways in which an equipment or system can be made to operate within specifica-
tion (or to some defined lesser specification) when a fault is present.
The human body is an excellent example of a fault tolerant system. If we should
suffer an injury to a limb, the muscles on our other fully functional limbs will
develop in order to compensate. If we suffer the loss of one of our senses, our
other senses will develop in order to make up for that loss.
The concept of a fault tolerant system is quite s i m p l e - the important thing to
remember is that fault tolerance has to be designed into a product. There are four
basic stages in this process:

9 Firstly, we need to know that a fault has occurred - this process involves some
means of monitoring the current level of performance of the equipment or sys-
tem and detecting abnormal conditions when they arise.
9 Secondly, we need to identify the nature and source of the fault in order to safe-
guard the operation of the system.
9 Thirdly, we need to take the necessary action in order to maintain the perfor-
mance of the system to within specified limits or to a reduced specification (as
appropriate). This can be achieved in a number of different ways, including
modifying the way in which the system operates (e.g. by switching a power
source from one part of the system to another or by routing signals in a differ-
ent direction) or by bringing redundant components and subsystems into oper-
ation (e.g. by switching to a backup power supply or a standby processor).
9 Lastly, we need to update the status of the system, reporting the fault by gener-
ating appropriate messages, displays or alarms so that the user is made aware
of the current state of the system. At some later stage - perhaps during a rou-
tine check cycle - the fault can be rectified and the system made to revert to its
original state.
64 An Elementary Guide to Reliability

Fault tolerance
Fault tolerance can be defined as the ability of a system to operate within speci-
fication (or to some lesser defined specification) when a fault is present. Clearly,
the more complex and more critical an item of equipment or system is, the more
it can benefit from a degree of fault tolerance. Simple, non-critical equipment is
unlikely to be a candidate for the implementation of a fault tolerant system - even
though some parts of the equipment may exhibit a degree of fault tolerance in
their operation.
Examples of fault tolerant systems include:

9 Flight control systems- the new Boeing 777 has no fewer than seven Inertial
Reference Units (IRU) while the Boeing 747-400 has three. Since only three
IRUs are necessary at any one time, the failure of one IRU on a Boeing 777 is
not particularly serious.
9 Computer networks- computer networks are made more robust by using adap-
tive, fault tolerant software. A token ring network, for example, involves pass-
ing 'tokens' between stations in the network. Network software can be made to
detect lost or corrupt tokens or to render invalid duplicate tokens that may sub-
sequently be generated. This process is quite transparent to the network user
who is usually blissfully unaware that a fault has occurred.

There are two basic approaches to making equipment or systems fault tolerant.
One method involves the use of additional hardware while the other involves the
use of software.

Hardware methods
Hardware methods involve the use of additional logic or a programmed logic
array (PLA) in order to make logical decisions concerning the state of the system
at any time. Hardware methods are well suited to performing such basic tasks as:

9 detecting missing signals


9 indicating out-of-specification supply voltages
9 identifying timing or framing errors.

On a system without fault tolerance, once a fault condition has been detected, the
output signal would be typically used to drive a warning device such as a signal
lamp, LED, magnetic indicator, electromechanical flag or piezoelectric trans-
ducer. Where a system has a degree of fault tolerance, the output signal is used to
initiate changeover to a redundant component or subsystem or to modify the
Design of fault tolerant systems 65

behaviour of the system in such a way as to safeguard essential aspects of its oper-
ation.
Figure 11.1 shows the simplified arrangement of a fault tolerant system based
on hardware.

Mains voltage sensor v"= ,.-'= Mains supply failed

Battery voltage sensor


Battery supply low or
" missing
Signal A

Signal B '=
Logic system v" Main display failure
" (PLA)
Signal C v=~
lb.. Out-of-range input
v detected
Signal D v"~
Framing error
Signal E ,.-"-
detected
i

Figure 11.1.

Software methods
Software methods involve incorporating software routines, procedures or func-
tions within control programs that will:

9 perform full system diagnostics during initialization;


9 perform periodic diagnostic checks during program execution (for example,
periodically reading a status byte);
9 ensure that out of range indications are recognized and erroneous data is
ignored;
9 log faults as they occur together, where possible, with sufficient information
(including date and time) so that the user can determine the point at which the
fault occurred and the circumstances that were prevailing at the time.

Certain fault conditions, such as loss of a power rail or a signal from a fire
detection loop, are so important that they must be dealt with immediately upon
detection. These signals can be given a high priority within the system of interrupt
signals sent to the C P U - either directly or via an external interrupt controller.
66 An Elementary Guide to Reliability
Provided that the CPU is accepting interrupts at the current level (i.e. that its inter-
nal logic has not been placed in a state in which interrupts are 'masked' or 'dis-
abled'), the processor will suspend its current operation (saving important data so
that an orderly return can be made to the point at which it was interrupted) and
then determine the source of the interrupt- for example, by polling each subsys-
tem to establish which was the instigator of the interrupt request. Having estab-
lished the source of the interrupting signal, the processor can then execute an
appropriate interrupt fault correcting service routine (ISR) before returning to the
previously suspended task.

Shared information bus

.LJ. 2J. .L].


Peripheral I Peripheral
device or device or devce or
subsystem subsystem subsystem
CPU

E
Coded Priority E
E Interruptsfrom
interrupt
inputs
encoder other
peripheral
devicesand
subsystems

Figure 11.2.

Other, less important, fault conditions can be detected by generating one or


more status bytes in which each bit represents the signal from an input or output,
and then placing these bytes on the data bus where they can be read at regular
intervals by the CPU. Since a very large number of fault conditions could poten-
tiaUy be present within a complex system, a 'look-up' table (LUT) can be used to
contain a set of 'signature'bytes for each fault condition. When a fault is detected,
this table is searched until a successful comparison is made and then the neces-
sary corrective action is taken (see Figure 11.3).
Design of fault tolerant systems 67

SET POINTER TO
START OF LUT

J
INPUT DIAGNOSTIC
BYTE(S)

INCREMENT
POINTER
i ~ Hll

COMPARE BYTE(S)
WITH LUT ENTRY

NO
SAME?

YES

DISPLAY ERROR
MESSAGE/TAKE
APPROPRIATE
RECOVERY ACTION

Figure 11.3.
68 An Elementary Guide to Reliability
Finally, Table 11.1 below shows how hardware and software methods compare:

Table 11.1

Advantages Disadvantages

Hardware methods Simple. Does not require Not easy to modify or


computer processing time. reconfigure. More suited to a
limited number of fault
conditions.

Software methods Easy to modify and Requires programming


reconfigure. Easy to expertise. Higher initial
implement in systems that cost. Requires computer
are already computer- processing time.
based. Can easily cope
with a large number of
fault conditions.

Combination of methods
Finally, it is possible to combine software and hardware methods and enjoy some
of the benefits of both hardware and software approaches. A typical example of
this is a 'watchdog controller' within a programmable logic control (PLC) system
or as part of a bus-based industrial controller.
The watchdog controller usually comprises an interface card or module which
incorporates its own intelligent processor, interface logic, control program and
timer. The watchdog controller has shared access to all the system bus lines and
is therefore able to determine the status of the system at any point. The controller
is also able to alert the main CPU (or current bus 'master' processor in a multi-
Design of fault tolerant systems 69
processing system) using one or more of the interrupt request (IRQ) or attention
request (ATNRQ) lines. When these lines are asserted by the processor in the
watchdog controller, the main CPU will suspend normal processing within the
main control program and start to execute instructions that will initiate recovery.
The facilities available from a simple watchdog controller generally include:

9 generating a status byte that is periodically read (typically every 1-2.5 seconds)
by the main control program. This status byte provides an indication of the cur-
rent state of the system. If the status byte is not read within a predetermined
period, the watchdog controller assumes that a fault condition has been encoun-
tered and the board takes appropriate action. Typical situations which result in
the watchdog status byte not being read involve major system failures, attempts
to access inoperative peripheral hardware (a hardware hang) or the software
running in an uncontrolled infinite loop (a software hang).
9 monitoring one or more of the power rails and generating appropriate signals
when the voltage present fails to meet the defined tolerances for the rail con-
cemed. Typical actions in the event of a low voltage being detected involve pre-
serving important system variables before switching over to backup supplies or
stand-by batteries.
9 the ability to exploit the multiprocessing capability of a system by making use
of independent processors and, where necessary, duplicate I/O circuitry
attached to independent signal conditioning boards. Typical actions in the event
of detecting I/O failure involve momentarily suspending operation of the main
control program, treating current data values as invalid, and switching to other
multiplexed I/O lines or differently addressed devices.

You might also like