0% found this document useful (0 votes)
130 views10 pages

Redundant and Voting System

Redundant and Voting System
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
130 views10 pages

Redundant and Voting System

Redundant and Voting System
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

1.

10 Redundant and Voting Systems


T. J. WILLIAMS, E. J. SCHAFFER (1982) T. J. WILLIAMS (1995) A. ROHR (2003)

INTRODUCTION The goal of system reliability or of fault-tolerant com-


puting therefore is to either prevent or be able to recover from
Reliability in process control computing can be defined as faults and continue correct system operation. This also
the correct operation of a system up to a time t = T, given includes immunity to software faults induced into the system.
1
that it was operating correctly at the starting time t = 0. To achieve a high reliability, it is essential that component
However, correct operation can have many meanings, reliability be as high as possible. “As the complexity of com-
depending on the requirements previously established for the puter systems increase, almost any level of guaranteed reli-
system. A common attitude today is that single or multiple ability of individual elements becomes insufficient to provide
6
failures can be accepted as long as the system does not go a satisfactory probability of successful task completion.”
down or the desired operation is not interrupted or disturbed. Therefore, successful fault-tolerant computers must use a
Reliability is therefore a goal to be expected of a system and judicious selection of protective redundancy to help meet the
is set by the users. reliability requirements. The three redundancy techniques are
To obtain a certain measure of reliability, the term fault- as follows:
tolerant computing can be used. It may be defined as “the
ability to execute specified algorithms correctly regardless of 1. Hardware redundancy
2
hardware errors and program errors.” Since different com- 2. Software redundancy
puters in different applications have widely different require- 3. Time redundancy
ments for reliability, availability, recovery time, data protec-
tion, and maintainability, an opportunity exists for the use of
many different fault-tolerant techniques.
3 These three techniques cover all methods of fault tolerance.
The understanding of fault tolerance can be helped by Hardware redundancy can be defined as any circuitry in the
first understanding faults. A fault can be defined as “the system that is not necessary for normal computer operation
deviation of one or more logic variables in the computer should no faults occur. Software redundancy, similarly, is addi-
1
hardware from their design-specified values.” A logic value tional program instructions present solely to handle faults. Any
for a digital computer is either a zero or a one. A fault is the retrial of instructions is known as time redundancy.
appearance of an incorrect value such as a logic gate “stuck
on zero” or “stuck on one.” The fault causes an “error” if it,
Hardware Redundancy
in turn, produces an incorrect operation of the previously
correctly functioning logic elements. Therefore, the term Hardware redundancy can be described as the set of all hard-
fault is restricted to the actual hardware that fails. ware components that need to be introduced into the system
1
Faults can be classified in several ways. Their most to provide fault tolerance with respect to operational faults.
important characteristic is a function of their duration. They These components would be superfluous should no faults
can be either permanent (solid or “hard”) or transient (inter- occur, and their removal would not diminish the computing
mittent or “soft”). Permanent faults are caused by solid fail- power of the system in the absence of faults.
4
ures of components. They are easier to diagnose but usually In achieving hardware fault tolerance, it is clear that one
7
require the use of more drastic correction techniques than do should use the most reliable components available. However,
transient faults. Transient faults cause 80 to 90% of faults in increasing component reliability has only a small impact on
5
most systems. Transient faults, or intermittents, can be increasing system reliability. Therefore, it is “more important
8
defined as random failures that prevent the proper operation to be able to recover from failures than to prevent them.”
of a unit for only a short period of time—not long enough Redundant techniques allow recovery and are thus very
to be tested and diagnosed as a permanent failure. Often, important in achieving fault-tolerant systems. The techniques
transient faults become permanent with further deterioration used in achieving hardware redundancy can be divided into
of the equipment. Then, permanent fault-tolerant techniques two categories: static (or masking) redundancy and dynamic
must be used for system recovery. redundancy.

126
© 2003 by Béla Lipták
1.10 Redundant and Voting Systems 127

Static techniques are effective in handling both transient existence of a fault, which sometimes be corrected if enough
and permanent failures. Masking is virtually instantaneous of the original information remains intact.
and automatic. It can be defined as any computer error cor- This means that the original binary word can be recon-
rection method that is transparent to the user and often to the structed with some such codes if a set number of bits in the
software. Redundant components serve to mask the effect of coded word have not changed. Encoding and decoding words
hardware failures of other components. with the use of redundant hardware can be very effective in
Many different techniques of static redundancy can be detecting errors. Through hardware or software algorithms,
applied. The simplest or lowest level of complexity is by a incorrect data can also often be reconstructed. Otherwise, the
massive replication of the individual components of the sys- detected errors can be handled by module replacement and
1
tem. For example, four diodes connected as two parallel software recovery actions. The actions taken depend on the
pairs that are themselves connected in series will not fail if extent of the fault and of the recovery mechanisms available
any one diode fails “open” or “short.” Logical gates in similar to the computing system.
9,10
quadded arrangements can also guard against single faults,
and even some multiple faults, for largely replicated systems. Software Redundancy
More sophisticated systems use replication at higher lev-
Software redundancy refers to all additional software installed
els of complexity to mask failures. Instead of using a mere
in a system that would not be needed for a fault-free com-
massive replication of components configured in fault-tolerant
puter. Software redundancy plays a major role in most fault-
arrangements, identical nonredundant computer sections or
tolerant computers. Even computers that recover from fail-
modules can be replicated and their outputs voted upon.
ures mainly by hardware means use software to control their
Examples are triple modular redundancy (TMR) and more
recovery and decision-making processes. The level of soft-
massive modular redundancy (NMR), where N can stand for
ware used depends on the recovery system design. The recov-
any odd number of modules.
ery design depends on the type of error or malfunction that
In addition to component replication, coding can be used is expected. Different schemes have been found to be more
to mask faults as well as to detect them. With the use of some appropriate for the handling of different errors. Some can be
codes, data that has been garbled (i.e., bits changed due to accomplished most efficiently solely by hardware means.
hardware errors) can sometimes be recovered instantaneously Others need only software, but most use a mixture of the two.
with the use of redundant hardware. Dynamic recovery meth- For a functional system, i.e., one without hardware design
ods are, however, better able to handle many of these faults. faults, errors can be classified into two varieties: (1) software
Higher levels of fault tolerance can be achieved more design errors and (2) hardware malfunctions.
easily through dynamic redundancy and implemented The first category can be corrected mainly by means of
through the dual actions of fault detection and recovery. This software. It is extremely difficult for hardware to be designed
often requires software help in conjunction with hardware to correct for programmers’ errors. The software methods,
redundancy. Many of these methods are extensions of static though, are often used to correct hardware faults—especially
techniques. transient ones. The reduction and correction of software design
Massive redundancy in components can often be better errors can be accomplished through the techniques outlined
utilized when controlled dynamically. Redundant modules, or below.
spares, can have a better fault tolerance when they are left Computers may be designed to detect several software
unpowered until needed, since they will not degrade while errors.
14,15
Examples include the use of illegal instructions
11
awaiting use. This technique, standby redundancy, often uses (i.e., instructions that do not exist), the use of privileged
dynamic voting techniques to achieve a high degree of fault instructions when the system has not been authorized to
tolerance. This union of the two methods is referred to as process them, and address violations. This latter refers to
12
hybrid redundancy. Additional hardware is needed for the reading or writing into locations beyond usable memory.
detection and switching out of faulty modules and the switch- These limits can often be set physically on the hardware.
ing in of good spares within the system by this technique. Computers capable of detecting these errors allow the pro-
13
Error detecting and error correcting codes can be used grammer to handle the errors by causing interrupts. The inter-
to dynamically achieve fault tolerance in a computing system. rupts route the program to specific locations in memory. The
Coding refers to the addition of extra bits to and the rear- programmer, knowing these locations, can then add his own
ranging of the bits of a binary word that contains information. code to branch to his specific subroutines, which can handle
The strategy of coding is to add a minimum number of check each error in a specified manner.
bits, the additional bits, to the message in such a way that a Software recovery from software errors can be accom-
4
given degree of error detection or correction is achieved. plished via several methods. As mentioned before, parallel
Error detection and correction is accomplished by comparing programming, in which alternative methods are used to deter-
the new word, hopefully unchanged after transmission, stor- mine a correct solution, can be used when an incorrect solution
age, or processing, with a set of allowable configurations of can be identified. Some less sophisticated systems print out
bits. Discrepancies discovered in this manner signal the diagnostics so that the user can correct the program off line

© 2003 by Béla Lipták


128 General Considerations

from the machine. This should only be a last resort for a To


fault-tolerant machine. Nevertheless, a computer should always supervisor
keep a log of all errors incurred, memory size permitting. Serial bit line To other redundant systems
Preventive measures used with software methods refer
mainly to the use of redundant storage. Hardware failures
often result in a garbling or a loss of data or instructions that Master Slave Duplex
are read from memory. If hardware techniques such as coding system
Parallel bus
cannot recover the correct bit pattern, those words will
become permanently lost. Therefore, it is important to at least
duplicate all necessary program and data storage so that it Protocol processor
can be retrieved if one copy is destroyed. In addition, special and associated
measures should be taken so that critical programs such as interface hardware
error recovery programs are placed in nonvolatile storage, A
Triplex
i.e., read-only memory. Critical data as well should be placed
system
in nondestructive readout memories. An example of such a
memory is a plated-wire memory.
The second task of the software in fault tolerance is to
detect and diagnose errors. Software error-detection tech- B C
niques for software errors often can be used to detect transient
hardware faults. This is important, since “a relatively large
number of malfunctions are intermittent in nature rather than
9
solid failures.” Time-redundant processes, i.e., repeated tri-
als, shall be used for their recovery.
Software detection techniques do not localize the sources FIG. 1.10a
of the errors. Therefore, diagnostic test programs are fre- Comparison of duplex and triplex redundancy systems.
quently implemented to locate the module or modules
responsible. These programs often test the extent of the faults
at the time of failure or perform periodic tests to determine
malfunctions before they manifest themselves as errors dur- 1. Intramodule data errors
ing program execution. Almost every computer system uses 2. Intermodule data transfer errors
some form of diagnostic routines to locate faults. In a fault- 3. Address errors
4. Control signal errors
tolerant system, the system itself initiates these tests and
5. Power failure
interprets their results, as opposed to the outside insertion of
6. Timing failure
test programs by operators in other systems.
7. Reconfiguration faults

The two main designs considered here are that of a duplex


Fault-Tolerant Computer System Design system with two identical computers operating in parallel and
that of a triplex system (see Figure 1.10a). The triplex system
The design of a fault-tolerant industrial computer system has three computers operating synchronously. In addition to
should be different from that of a similar system for a space- those error detecting and correcting capabilities already built
borne computer system. Maintenance is available in an indus- into the computers, fault-tolerant features will be present in
trial environment to replace any modules that may have software for both systems. The duplex system will feature a
failed. In addition, the system may be much larger, and a comparison of data for fault detection with rollback and
hierarchy of many computers of different sizes may be nec- recovery to handle transient errors. The triplex system will
16
essary to handle the various operations. Therefore, a fault- incorporate a software voting scheme with memory reload
tolerant communication network may be required as well. to recover from transient failures. This removes the overhead
Future advancements must also be considered when study- of rollback. Each duplicated system of computers will com-
ing the design of fault-tolerant systems. Throw-away proces- municate internally via a parallel data bus that will allow
sors on a chip are no longer a myth, and the manufacture of high-speed communication, plus a parallel control bus that
equivalent inexpensive mass memory is probably already here will initiate interrupts to handle any faults within the system.
6
as well. Valid future designs must incorporate provisions for All computer elements will communicate with higher level
these advances and allow for larger replacement modules for systems via a full-duplex synchronous serial bit bus, a bus
quicker and simpler fault location and maintenance. that will permit simultaneous message transfer in both direc-
The ways in which faults manifest themselves have not tions, through the protocol microprocessor. With these com-
17
changed. They may be summarized as the following: ponents, a fully reliable system should be realized.

© 2003 by Béla Lipták


1.10 Redundant and Voting Systems 129

FIELD INSTRUMENT REDUNDANCY AND VOTING which the operator immediately notices. The consequence is
that the SIS-based definitions developed in IEC 61508, to
The above concepts apply not only to process computers but some extent, can also be used as guidelines for control loops
also to basic process control systems (BPCSs) and safety instru- that require high uptime and whose unavailability would,
mented systems (SISs), where they also improve performance, within a short time, drive the plant to conditions requiring
availability, and reliability. In the case of field instruments plant shutdown.
and final control elements, they mainly guarantee continuity of IEC 61508 Part 6 gives the definition of the various
operation and increase uptime, whereas, in SIS systems, they architectures most commonly used in the safety instrumented
minimize nuisance or spurious interventions and alarms. systems. They apply for use with one, two, or three elements
The techniques used in BPCS and SIS systems are similar and their various combinations. The elements that are used
and have initially been developed for the inherently more in a single or multiple configuration can be either transmitters
demanding SIS applications. For SIS systems, the need of or final control elements, but they are mainly for transmitters,
international regulations has been recognized (ANSI/ISA- and only very rarely for control valves, because of the sub-
18 19 20
84.01-1996, IEC 61508-1998/2000, and IEC 61511, in stantial difference in costs. The control system, such as a
draft version) while, for non-safety related control loops, this DCS system, is usually configured with multiple controllers
is left to good engineering practice. Therefore, the discussion and redundant other system components (e.g., system bus,
of redundancy and voting techniques, as applied to the field I/O bus, HMI). IEC 61508 considers and gives definitions to
instruments of BPCS systems, will be based on the SIS the configurations described below.
standards as guidelines. The BPCS goal is to improve control
loop availability such that the trigger point for the interven- Single-Transmitter Configuration (Figure 1.10b)
tion of the associated SIS system is unlikely ever to be 1oo1 A single transmitter is used, as in many control
reached. Thereby, redundancy in BPCS also improves safety. loops. These loops consist of an analog transmitter
This is because increased availability reduces the number of and an analog controller (pneumatic or electronic).
shutdowns, which tend to shorten the life of the plant due to This configuration is the most prone to overall mal-
the resulting thermal and other stresses. functioning. Errors and failures can be caused by a
One of the main objectives of measurement and control sticking control valve or transmitter or by an out of
specialists is to improve the availability and accuracy of range signal (up or down scale). In these loops,
measurements. To achieve that goal and to minimize system-
atic uncertainty while increasing reliability, correct specifi-
cation, instrument selection, and installation are essential.
Assuming that the transmitters have been properly spec-
ified, selected, and installed, one can further improve total System bus
performance by measuring the same variable with more than
one sensor. Depending on the importance of the measurement,
redundancy can involve two or more detectors measuring the
same process variable. When three or more sensors are used,
one can select the “majority view” by voting. With this Multiloop
approach, one would select m measurements out of the total controller
n number of signals so, that m > n/2. In industrial practice,
n is normally 3 so that m is 2.
The redundant and voting techniques have been standard- DI
ized in various SIS-related documents, including ANSI/ISA-
84.01, IEC 61508, and IEC 61511. The SIS systems usually
evaluate on–off signals or threshold limits of analog signals
whereas, in process control, redundancy and voting is obtained T AI
by the evaluation of multiple analog signals. The main differ-
ence between BPCS and SIS systems is that SIS is a “dormant”
system, but continuously self-checking, and it is called upon
to operate only in an emergency. In addition, the SIS is fail DO
safe; i.e., if it fails, it brings the plant to a safe status. SIS
malfunctioning is inferred from diagnostic programs and not
from plant conditions, because the plant cannot be shut down
or brought to unsafe conditions just to test the SIS system. All AO
international regulations follow this approach.
In contrast to SIS systems, the BPCS control loops are FIG. 1.10b
always active and, if they malfunction, they actuate alarms, 1oo1/1oo1D transmitter input.

© 2003 by Béla Lipták


130 General Considerations

System bus System bus

Multiloop Multiloop
controller controller

DI
DI

Tx AI Tx AI

Ty AI
Ty AI

Tz AI

DO
DO

AO AO

FIG. 1.10d
2oo3 transmitter input.

FIG. 1.10c
as correct and representative of the process condi-
2oo2D transmitter input.
tions. Concurrency means that they differ by no
more than X%.
diagnostic protection is very limited. Remember, in
the past, the burn-out feature of thermocouples was Diagnostic Coverage
almost the only diagnostic function implemented in
the mV/psi converters or mV/mA transducers. The diagnostic coverage in the BPCS is much less than in
1oo1D A single transmitter is used, with diagnos- the SIS, for reasons outlined previously, and is provided
tic coverage integral to the transmitter (e.g., self- mainly in and by the DCS, which has the capability of com-
21,22
validating transmitters ) and/or external in the paring the signals received from the transmitters and deter-
control system. mining whether they are within the imposed limits so as to
consider them to be concurrent. If an inconsistency is
Two-Transmitter Configuration (Figure 1.10c)
detected, the DCS is capable of signaling the abnormal sit-
1oo2 Two transmitters in parallel are used; the failure
uation and to maintain control, at least in some instances,
of one determines the loss of control. In principle,
without operator intervention.
this definition cannot be borrowed from IEC 61508.
1oo2D Two transmitters in parallel are used, with
1oo1D The diagnostic coverage can be partly integral
diagnostic coverage mainly residing in the control
to the transmitter and/or external in the control
system. The type of diagnostic functions will be
system (rate of change alarms, overrange alarms
covered afterward.
detecting the individual fault). In a broader sense,
2oo2 Two transmitters in parallel are used. The loss
in addition, the material balance (data reconcilia-
of control should be determined by the failure of
tion) performed in the DCS can contribute to detect
both. In principle, this definition cannot be borrowed
a failure in the flow transmitters or their unreliable
from IEC 61508.
reading.
Three-Transmitter Configuration (Figure 1.10d) 1oo2D The signal from each transmitter is checked to
2oo3 Three transmitters in parallel are used. The con- verify if it is within the validity limits (i.e., 4-20 mA).
current value indicated by two of them is assumed If a transmitter is outside the validity range, its signal

© 2003 by Béla Lipták


1.10 Redundant and Voting Systems 131

a common percentage discrepancy value is chosen


Tx
MULTILOOP
and used for all measures, because experience has
AVERAGE
CONTROLLER
CONTROL
shown that it is unlikely that a transmitter fails to a
Ty VALVE value close to the correct one.
When the discrepancy is beyond the preset value,
but both signals are within validity limits, it is not
∆ < ∆0
ALARM possible to determine which one is invalid. In this
NO case, an alarm is produced, and the controller is
FORCE TO MANUAL
automatically forced to manual, with output frozen
FREEZE OUTPUT
at the last valid value. The operator then has the
FIG. 1.10e responsibility to discard one of the two transmitters
1oo2D signal conditioning. and use the other as the input to the controller, then
switch to auto again.
is discarded, the controller receives the value from 2oo3 The signal from each transmitter is checked to
the other transmitter and an alarm is issued to warn verify whether it is within the validity limits (i.e., 4
the operator about the malfunctioning. If both trans- to 20 mA). If a transmitter is outside the validity
mitters are within validity limits, the difference range, its signal is discarded as invalid, and the
among their signals is calculated. In case the differ- remaining two are used as if they were in 1oo2D
ence is within a preset value (in the range of few configuration. If no invalid signal is detected, then
percent), the average value is assumed as good and the discrepancy between the values is calculated.
used for the control function (Figure 1.10e). Supposing the three signals are X, Y, and Z, the
The acceptable discrepancy between the two differences X − Y, Y − Z, and Z − X are calculated.
transmitters depends on the measurement condi- If each of them is within the preset limits, the
tions; for instance, the acceptable discrepancy in the median value is taken as good and used as process
level measurement in a steam drum is larger than in variable by the controller. If one difference exceeds
the case of a pressure measurement. As an indica- the preset limit, an alarm is issued to the operator,
tion, for two level transmitters installed at different and the median value is used as process variable for
ends of the steam drum, 5% discrepancy is accept- the controller. If two differences exceed the preset
able. However, for pressure measurement, 2% limit, the value of the transmitter involved in both
should not be exceeded. Normally, in the process the excessive differences is discarded, an alarm is
industry, it is not necessary to select a very small issued to the operator, and the average value of the
discrepancy (such as twice the declared accuracy) remaining two is used as process value. If all three
between the transmitted values, because the differ- differences exceed the preset limit, this means that
ence could be the result of many causes other than at least two transmitters are not reliable. In this case,
a transmitter failure or the need for recalibration (the the controller is automatically forced to manual,
main reason could be the installation). Sometimes, with output equal to last valid value (Figure 1.10f).

Tx

COMPUTING
MULTILOOP
Ty
CONTROLLER
CONTROL
MODULE VALVE

Tz

ALARM

∆ < ∆0 ∆ < ∆0 ∆ < ∆0

TRUTH TABLE
A B C A YES YES YES NO
B YES YES NO NO
C YES NO NO NO
Computing Module Median Median Average NONE
Alarm NO YES YES YES
Discard NONE NONE Tz Tx, Ty, Tz
Force contr. output to man. NO NO NO YES

FIG. 1.10f
2oo3 signal conditioning.

© 2003 by Béla Lipták


132 General Considerations

The operator has the responsibility to select one of necessary to evaluate properly to what extent it is necessary
the three transmitters as the good one, use it as input to avoid common-mode failures as a factor of the criticality
to the controller, and switch to auto again. There are of the application. If, for instance, all the cables are installed
some possible variations in the algorithms used for on the same tray, distributing the redundant transmitters in
the selection of the valid signals and the discarding different cables does not improve the situation; it would be
of the unreliable ones, and they depend on the avail- unusual for only one cable to break while the others remain
able control blocks of the involved DCS. in operation.

Engineering Redundant Measures


Complex Control Loops
When redundant measures are performed, attention shall be
given to avoid common-mode failures. A possible common- In the power generation industry, two or three transmitters in
mode cause of inaccuracy in flow measurement is the primary redundant or voting configuration are commonly used for all
flow element, which can be erroneously calculated, wrongly control loops. As the boiler control requires complex loops,
installed, or show wear after extended operation. This situa- a question to be answered is to what extent can it be justified
tion is not easily corrected but can be detected by material that the transmitters be duplicated or triplicated.
balance and/or maintenance (i.e., by checking the size of the Consider the steam drum level control, and suppose that
orifice or throat and the sharpness of the orifice edge). duplicated measurements are the contractual requirement. Of
Starting from the tapping point on the process, the most course, the level transmitters, the relevant steam and water
usual common-mode failures are examined. Two or three flow transmitters, are duplicated to allow for a fully redundant
pressure transmitters connected to the same valved branch three-element control. What about the pressure transmitter
on a pipe or vessel are prone to common-mode failure, used to compensate the level measurement for density? If the
because the closing or clogging of the root valve puts all boiler is operating at constant pressure, a single transmitter
transmitters out of service. Such an installation is therefore should be sufficient. If it fails, the last valid value could be
to be avoided. frozen and used, obviously sending an alarm to the operator.
The same multicore cable containing the 4 to 20 mA sig- If the boiler is operating in sliding pressure, then it must also
nals of redundant transmitters is a possible cause of common- duplicate the pressure transmitter.
mode failure because, if it is cut, all the signals are lost at Consider the steam flow compensation for pressure and
the same time. Therefore, the signals should be contained in temperature. If the flow transmitter is duplicated, the pressure
different cables, and the cables themselves should be routed and temperature transmitters do not always require duplication;
on different cable racks. if these variables are allowed to vary only slightly and the
The same input card receiving the signals from redundant measure is not used for custody transfer, it should be sufficient
transmitters is a possible cause of common-mode failure to freeze the last valid value for compensation, with a warning
because, even if only a single channel (even different from to the operator. But, again, if the boiler is operating in sliding
the ones under consideration) of the input card fails and the pressure mode or could possibly have a variable steam tem-
card is then replaced, all the transmitter inputs are lost simul- perature, then the transmitters should be duplicated.
taneously. It is appropriate to connect the redundant trans- Notwithstanding these considerations, if a plant is sup-
mitters to different input cards. plied as a turnkey operation, the specifications uniformly will
call for redundancy without closely examining the real need
If the multiloop controller in the DCS is not redundant
for duplication or triplication of the transmitters.
(or fault tolerant), the input cards connected to the redundant
transmitters should belong to different multiloop controllers
to avoid common-mode faults. It is, however, highly prefer-
Final Control Elements
able to use redundant multiloop controllers (rather than two
independent nonredundant multiloop controllers) because of In very rare instances, the final control elements can be dupli-
the increased traffic in the system bus and of a loop failure cated, as when the erosive/corrosive or sticking characteris-
consequent to the system bus failure. tics of the fluid could cause unacceptable downtime. It should
Because all the transmitters are powered from the DCS, be noted that the need for duplicating the valves is more
redundant power supplies to the DCS modules should be stringent if the valves are more sophisticated and therefore
provided, coming from different sources. more expensive, so justifying this duplication is often diffi-
For an exhaustive evaluation of failure modes, Tables cult. The price of a critical valve can be 10 to 20 times the
B.5.1 and B.5.2 of ANSI/ISA-84.01-1996 provide a useful price of a transmitter (and on occasion, even more).
guideline, even though they cover the SIS. If duplicated, the control valves are not operating in
These simple examples clearly indicate that it is not suf- parallel, but one is in “cold standby,” so the analogy with
ficient to duplicate or triplicate the transmitters and that, to redundant transmitters is not accurate.
obtain the best possible results, the complete engineering The possible configurations of the final control elements
of the system shall be carried out correctly. However, it is are as listed below:

© 2003 by Béla Lipták


1.10 Redundant and Voting Systems 133

1oo1 A single control valve is used, as in all typical +


control loops. This configuration covers the majority I
of the applications, being the simplest; however, a Vi=6,2V
P
valve malfunction (e.g., sticking) could be detected,
with some time delay, as a drift in process variables
caused by the incorrect positioning of the trim.
1oo1D A single valve is used, with some diagnostic
coverage integral with the electropneumatic posi-
I
tioner. This is a check of the valve’s actual position − Vi=6,2V
against the required one and verification that the P
dynamic response of the valve has not changed over
time. This function is made possible by smart posi-
FIG. 1.10g
tioners that generate a position signal that is trans-
Single output to two I/P converters.
ferred to the control system as a feedback on valve
behavior. The DCS compares the requested position
with the actual one and gives an alarm if the dis-
control redundancy is not essential. The solution can be a
crepancy becomes excessive or does not tend to zero
simple analog output feeding, in series, both of the I/P con-
in a reasonable time. The value of the discrepancy
verters whose pneumatic output adjusts the pump stroke. The
being alarmed is somewhat higher than between two output to the stopped pump has no effect. If one I/P converter
transmitters and is considered acceptable if it is fails in short circuit, the signal through the other keeps flow-
about 5 to 7%, provided that this offset tends to zero ing correctly. It is sufficient that the pump in operation is the
in time. one controlled by the healthy I/P converter, and so the plant
For on–off control valves, the diagnostic function will have no problems. To prevent the stroke from going to
can occasionally command the valve to move from zero if one I/P converter fails open, it is sufficient to install
the current condition shortly and slightly, perform- a zener diode on the terminals of each I/P converter to guar-
ing only a partial travel, to determine whether the antee the signal flow continuity (Figure 1.10g). In this way,
trim is stuck and to prevent it from sticking. The if all the equipment is healthy, the 4 to 20 mA signal does
result is monitored, and any malfunctioning sets off not flow through the zener diodes. However, if a converter
an alarm. This partial movement, however, shall be fails open, the current flows through the parallel diode with-
compatible with the process characteristics. out any deleterious effect.
1oo2D Two valves in cold standby are used, with diag-
nostic coverage mainly residing in the DCS. The Availability Considerations
diagnostic helps to determine if and when the
switching from the main valve to the spare one must Availability can be defined as “the probability that a system
23
be performed. To avoid encountering a faulty spare is operating at time t.” For control loops, the availability is
valve, it is necessary to exercise it regularly so as the major parameter in that high availability means continuity
to achieve an awareness of and confidence about its of operation. If a single component (e.g., a transmitter) is
operability. used, its failure means that the control loop is no longer
If the problem is associated with erosion or cor- operable in an automatic mode until the component has been
rosion, the Cv of the valve increases at low lift. repaired or replaced. This can require several hours and could
However, if the flow remains within the operational involve a plant shutdown, which requires additional time for
values, it is not easy to detect a problem. Only when restarting and consequent loss of production.
the valve is almost closed and the passing flow cre- By duplicating the transmitter, the failure probability is
ates process upsets can the erosion/corrosion situa- doubled, but the measure becomes almost immediately avail-
tion be detected, unless a correlation between flow able (in terms of minutes), and the probability that the second
and valve opening can be established and logged transmitter will fail during the few hours needed to
over time. replace/repair the failed one is really close to zero. With the
technique explained above, the duration of unavailability cor-
An example of a different application that always responds to the time needed for the operator to select the
involves redundancy of the final control element is stroke transmitter he trusts most and assign it as the process variable
control of the dosing pumps. In boiler applications, the chem- to the controller.
ical dosing pumps are normally redundant and in cold By triplicating the transmitters, the failure probability is
standby because of the criticality of the application, while tripled, but even the short period of unavailability of the
the accuracy is not a major requirement (in some cases, the measure is eliminated. The control system is able to auto-
stroke adjustment is manual). If the stroke is controlled from matically calculate, from the set of available measurements,
the DCS, it is important that both pumps be controlled, but the one that best represents the process.

© 2003 by Béla Lipták


134 General Considerations

Of course, these qualitative statements are true only if 18. ANSI/ISA-84.01-1996: Application of Safety Instrumented Systems for
the maintenance activity is immediate and the faulty equip- the Process Industry, ISA, Research Triangle Park, NC, 1997.
19. IEC 61508: Functional Safety of Electrical/Electronic/Programmable
ment is repaired/substituted within the shift or the day. The
Electronic Safety-Related Systems, International Electrotechnical
triple transmitters are less sensitive to the faulty operation of Commission, Geneva, 1998/2000.
one (or even two) units, because it is possible for the operator 20. IEC 61511: Functional Safety: Safety Instrumented Systems for the
to select the remaining healthy transmitter and use it as the Process Industry Sector (draft), International Electrotechnical Com-
process variable of the control loop. mission, Geneva.
21. Clarke, D. W., Validation of sensors, actuators and processes, Proc.
These qualitative considerations can be studied quantita-
Interkama ISA Tech. Conf., Düsseldorf, Germany, 1999.
tively with statistical methods based on the fact that failures 22. Manus H., Self-validating (SEVA) sensors—towards standards and
are random variables. Further information is available in the products, Proc. 29th BIAS Int. Conf., Milan, Italy, 2000.
23,24
specialized literature. 23. Goble, W. M., Control Systems Safety Evaluation & Reliability, ISA,
Research Triangle Park, NC, 1998.
24. Gruhn, P. and Cheddie, H. L., Safety Shutdown Systems: Design, Anal-
References ysis and Justification, ISA, Research Triangle Park, NC, 1998.

1. Avizienis, A., Architecture of fault-tolerant computer systems, Dig.


1975 Int. Symp. Fault-Tolerant Computing, Paris, June 1975, IEEE Bibliography
Computer Society, New York, 3–16.
2. Avizienis, A., Fault-tolerant computing—an overview, Computer, 4(1),
Allen, J. R. and Yau, S. S., Real-time fault detection for small computers,
5–8, 1971.
AFIPS Conf. Proc., Vol. 40, AFIPS Press, Montvale, NJ, 119–127,
3. Wensley, J. H., Levitt, K. N., and Neumann, P. G., A comparative study
1972.
of architectures for fault-tolerance, Dig. 1974 Int. Symp. Fault-Tolerant
Anderson, D. A. and Metze, G., Design of totally self-checking check
Computing, Urbana, IL, IEEE Computer Society, New York, 4-16 to
circuits for m-out-of-n codes, IEEE Trans. Comput., C-22, 263–269,
4-21, 1974.
1973.
4. Schaeffer, E. J. and Williams, T. J., An Analysis of Fault Detection,
Anderson, T. and Lee, P., Fault Tolerance, Principles and Practice, 2nd ed.,
Correction and Prevention in Industrial Computer Systems, Report No.
Springer Verlag, Vienna, 1990.
106, Purdue Laboratory for Applied Industrial Control, Purdue Uni-
versity, West Lafayette, IN, 1977. Avizienis, A., Approaches to computer reliability—then and now, AFIPS
5. Tasar, O. and Tasar, V., A study of intermittent faults in digital com- Conf. Proc., Vol. 45, AFIPS Press, Montvale, NJ, 401–411, 1976.
puters, AFIPS Conf. Proc., Vol. 46, AFIPS Press, Montvale, NJ, Avizienis, A., Arithmetic error codes: cost and effectiveness studies for
807–811, 1977. application in digital system design, IEEE Trans. Comput., C-20 (11),
6. Short, R. A., The attainment of reliable digital systems through the 1322–1331, 1971.
use of redundancy—a survey, IEEE Computer Group News, 2(2), 2–17, Avizienis, A., Filley, G. C., Mathur, F. P., Rennels, D. A., Rohr, J. A., and
1968. Rubin, D. K., The STAR (self-testing-and-repairing) computer: an
7. Lyons, R. E. and Vanderkulk, W., The use of triple modular redundancy investigation of the theory and practice of fault-tolerant computer
to improve computer reliability, IBM J. Res. Dev., 6(2), 200–209, 1962. design, IEEE Trans. Comput., C-20 (11), 1312–1321, 1971.
8. Wulf, W. A., Reliable hardware/software architecture, IEEE Trans. Bartow, N. and McGuire, R., System/360 model 85 micro-diagnostics,
Software Eng., SE-1 (2), 233–240, 1975. AFIPS Conf. Proc., Vol. 36, AFIPS Press, Montvale, NJ, 191–197,
9. Higgins, A. N., Error recovery through programming, AFIPS Conf. 1970.
Proc., Vol. 33, AFIPS Press, Montvale, NJ, 39–43, 1986. Bashkow, T. R., Friets, J., and Karson, A., A programming system for detection
10. Tryon, J. G., Quadded Logic, Redundancy Techniques for Computing and diagnosis of machine malfunctions, IEEE Trans. Comput., EC-12,
Systems, Spartan Press, Washington, DC, 1962, 205–228. 10–17, 1963.
11. Losq, J., Influence of fault-detection and switching mechanisms on Bennetts, R. G., Design of Testable Logic Circuits, Addison-Wesley, Read-
reliability of stand-by systems, Dig. 1975 Int. Symp. Fault-Tolerant ing, MA, 1984.
Computing, Paris, France, June 1975, IEEE Computer Society, New Bequaert, F. C., Quip—A system for automatic program generation, AFIPS
York, 81–86. Conf. Proc., Vol. 33, Part I, AFIPS Press, Montvale, NJ, 611–616,
12. Mathur, F. P. and Avizienis, A., Reliability analysis and architecture 1968.
of a hybrid-redundant digital system: generalized triple modular redun- Borgerson, B. R., Dynamic confirmation of system integrity, AFIPS Conf.
dancy with self-repair, AFIPS Conf. Proc., Vol. 36, AFIPS Press, Proc., Vol. 41, Part I, AFIPS Press, Montvale, NJ, 89–96, 1972.
Montvale, NJ, 375–383, 1970. Bouricius, W. G., Carter, W. C., Jessup, D. C., Schneider, P. R., and Wadia,
13. Szygenda, S. A. and Flynn, M. I., Coding techniques for failure recov- A. B., Reliability modeling for fault-tolerant computers, IEEE Trans.
ery in a distributive modular memory organization, AFIPS Conf. Proc., Comput., C-20 (11), 1306–1311, 1971.
Vol. 38, AFIPS Press, Montvale, NJ, 459–466, 1971. Brosius, D. B. and Jurison, J., Design of a voter-comparator-switch for
14. Texas Instruments, Inc., Model 990/10 Computer System Hardware redundant computer modules, Dig. 1973 Int. Symp. Fault-Tolerant
Reference Manual, Manual Number 945417–9701, Austin, TX, August Computing, Palo Alto, CA, IEEE Computer Society, New York,
1977. 113–117, 1973.
15. Digital Equipment Corp., Specification for DDCMP Digital Data Com- Buckley, J. E., IBM protocols part 2: SDLC, Computer Des., 14(2), 14–16,
munications Message Protocol, 3rd ed., Maynard, MA, 1974. 1975.
16. Brosius, J. P. Jr. and Russell, B. J., Fault-tolerant plated wire memory Burchby, D. D., Kern, L. W., and Sturm, W. A., Specification of the fault-
for long duration space missions, Dig. 1973 Int. Symp. Fault-Tolerant tolerant spaceborne computer (FTSC), Dig. 1976 Int. Symp. Fault-
Computing, Palo Alto, CA, June 1973, IEEE Computer Society, New Tolerant Computing, Pittsburgh, PA, IEEE Computer Society, New
York, 33–38. York, 129–133, 1976.
17. Stiffler, J. J., Architectural design for near-100% fault coverage, in Carter, W. C., Jessup, D. C., Bwadia, A. B., Schneider, P. R., and Bouricius,
Dig. 1976 Int. Symp. Fault-Tolerant Computing, Pittsburgh, PA, June W. G., Logic design for dynamic and interactive recovery, IEEE Trans.
1976, IEEE Computer Society, New York, 134–137. Comput., C-20 (11), 1300–1305, 1971.

© 2003 by Béla Lipták


1.10 Redundant and Voting Systems 135

Carter, W. C. and McCarthy, C. E., Implementation of an experimental fault- Kopetz, H., Fault tolerance in real-time systems, Proc. 11th IFAC World
tolerant memory system, IEEE Trans. Comput., C-25 (6), 557–568, 1976. Congress, Tallinn, Estonia, Vol. 7, 111–118, 1990.
Chandy, K. M. and Ramamoorthy, C. V., Rollback and Recovery Strategies, Losq, J., A highly efficient redundancy scheme: self-purging redundancy,
IEEE Trans. Comput., C-21 (6), 546–556, 1972. IEEE Trans. Comput., C-25 (6), 569–578, 1976.
Chandy, K. M., Ramamoorthy, C. V., and Coway, A., A framework for Mathur, F. P. and deSousa, P. T., GMR: general modular redundancy, Dig.
hardware-software trade-offs in the design of fault-tolerant computers, 1974 Int. Symp. Fault-Tolerant Computing, Urbana, IL, IEEE Com-
AFIPS Conf. Proc., Vol. 41, Part 2, AFIPS Press, Montvale, NJ, 53–56, puter Society, New York, 2-20 to 2-25, 1974.
1972. Mills, H. D., Structured programming in large systems, in Debugging Tech-
Conn, R. B., Alexandridis, N. A., and Avizienis, A., Design of a fault-tolerant, niques in Large Systems, R. Rustin, Ed., Prentice Hall, Englewood
modular computer with dynamic redundancy, AFIPS Conf. Proc., Vol. Cliffs, NJ, 44–55, 1971.
41, Part 2, AFIPS Press, Montvale, NJ, 1057–1067, 1972. Mine, H. and Koga, Y., Basic properties and construction method for fail-
Daly, T. E., Tsou, H. S. E., Lewis, J. L., and Hollowich, M. E., The design safe logical systems, IEEE Trans. Electronic Comput., EC-16 (3),
and verification of a synchronous executive for a fault-tolerant system, 282–289, 1967.
Dig. 1973 Int. Symp. Fault-Tolerant Computing, Palo Alto, CA, IEEE O’Brien, F. J., Rollback point insertion, Dig. 1976 Int. Symp. Fault-Tolerant
Computer Society, New York, 3–9, 1973. Computing, Pittsburgh, PA, IEEE Computer Society, New York,
DeAnglis, D. and Lauro, J. A., Software recovery in the fault-tolerant spa- 138–142, 1976.
ceborne computer, Dig. 1976 Int. Symp. Fault-Tolerant Computing, Patel, A. M. and Hsiao, M. Y., An adaptive error correction scheme for
Pittsburgh, PA, IEEE Computer Society, New York, 143–147, 1976. computer memory system, AFIPS Conf. Proc., Vol. 41, Part 1, AFIPS
Elliot, B. H. and Williams, T. J., Proposal for a Hierarchical Computer Press, Montvale, NJ, 83–85, 1972.
Control System, Systems Engineering of Hierarchy Computer Control Peterson, W. W. and Brown, D. T., Cyclic codes for error detection, Proc.
Systems for Large Steel Manufacturing Complexes, Report No. 84, IRE, 22(1), 228–235, 1961.
Vol. 1, Chap. 1-4, Purdue Laboratory for Applied Industrial Control,
Peterson, W. W. and Weldon, E. J. Jr., Error Correcting Codes, MIT Press,
Purdue University, West Lafayette, IN, August 1976, 4-71 to 4-119.
Cambridge, MA, 1972.
Elmondorf, W. R., Fault-tolerant programming, Dig. 1972 Int. Symp. Fault-
Pradhan, D. K., Fault-Tolerant Computing—Theory and Techniques, Vols. I
Tolerance Computing, Newton, MA, IEEE Computer Society, New
and II, Prentice Hall, Englewood Cliffs, NJ, 1986.
York, 79–83, 1972.
Randell, B., System Structure for Software Fault-Tolerance, IEEE Trans.
Hamming, R. W., Error detecting and error correcting codes, Bell System
Comput., SE-1 (2), 220–232, 1975.
Technol. J., 29(2), 147–160, 1950.
Rennels, D. A. and Avizienis, A., RMS: a reliability modeling system for
Hayes, J. P., Checksum test methods, Dig. 1976 Int. Symp. Fault-Tolerant
self-repairing computers, Dig. 1973 Int. Symp. Fault-Tolerant Com-
Computing, Pittsburgh, PA, IEEE Computer Society, New York,
puting, Palo Alto, CA, IEEE Computer Society, New York, 131–135,
114–120, 1976.
1973.
Himmelblau, D. M., Fault Detection and Diagnosis in Chemical and Pet-
rochemical Process, Elsevier Scientific, New York, 1978. Rohr, J. A., STAREX: self-repair routines: software recovery in the JPL-
Hopkins, A. L. Jr. and Smith, T. B. III, The architectural elements of a STAR computer, Dig. 1973 Int. Symp. Fault-Tolerant Computing, Palo
symmetric fault-tolerant multiprocessor, Dig. 1974 Int. Symp. Fault- Alto, CA, IEEE Computer Society, New York, 11–16, 1973.
Tolerant Computing, Urbana, IL, IEEE Computer Society, New York, Sellers, F. F. Jr., Hsiao, M. Y., and Bearnson, L. W., Error Detection Logic
4-2 to 4-6, 1974. for Digital Computers, McGraw-Hill, New York, 1968.
Horning, J. J., Lauer, H. C., Melliar-Smith, P. M., and Randell, B., A Program Shepherd, M. Jr., Distributed Computing Power: Opportunities and Chal-
Structure for Error Detection and Recovery, in Lecture Notes in Com- lenges, 1977 National Computer Conference Address, Dallas, TX,
puter Science, G. Goos and J. Hartmanis, Eds., Vol. 16, Springer- 1977.
Verlag, Berlin, 171–187, 1974. Texas Instruments, Inc., 990 Computer Family Systems Handbook, Manual
IBM Systems Development Division, IBM Synchronous Data Link Control No. 945250-9701, Texas Instruments, Inc., Austin, TX, 1976.
General Information, IBM Manual GA 27-3093-0, IBM Systems Wachter, W. J., System malfunction detection and correction, Dig. 1975 Int.
Development Division, Research Triangle Park, NC, 1974. Symp. Fault-Tolerant Computing, Paris, France, IEEE Computer Soci-
Jensen, P. A., Quadded nor logic, IEEE Trans. Reliability, Vol. R-12, 22–31, ety, New York, 196–201, 1975.
1963. Wakerly, J., Error Detecting Codes, Self-Checking Circuits and Applica-
Kim, K. H. and Ramamoorthy, C. V., Failure-tolerant parallel programming tions, North Holland, New York, 1978.
and its supporting system architecture, AFIPS Conf. Proc., Vol. 45, Wensley, J. H., SIFT—software implemented fault-tolerance, AFIPS
AFIPS Press, Montvale, NJ, 413–423, 1976. Conf. Proc., Vol. 41, Part 1, AFIPS Press, Montvale, NJ, 243–253,
King, J. C., Proving programs to be correct, IEEE Trans. Comput., C-20 1972.
(11), 1331–1336, 1971. Williams, T. J., The development of reliability in industrial control systems,
Kompass, E. J. and Williams, T. J., Eds., Total control systems availability, Proc. IEEE Micro., 4(6), 66–80, 1984.
Proc. 15th Annu. Advanced Control Conf., Purdue University, West Williams, T. W. and Parker, K. P., Design for testability—a survey, IEEE
Lafayette, IN, September 11–13, 1989. Trans. Comput., C-31 (1), 2–14, 1982.

© 2003 by Béla Lipták

You might also like