Redundant and Voting System
Redundant and Voting System
126
© 2003 by Béla Lipták
1.10 Redundant and Voting Systems 127
Static techniques are effective in handling both transient existence of a fault, which sometimes be corrected if enough
and permanent failures. Masking is virtually instantaneous of the original information remains intact.
and automatic. It can be defined as any computer error cor- This means that the original binary word can be recon-
rection method that is transparent to the user and often to the structed with some such codes if a set number of bits in the
software. Redundant components serve to mask the effect of coded word have not changed. Encoding and decoding words
hardware failures of other components. with the use of redundant hardware can be very effective in
Many different techniques of static redundancy can be detecting errors. Through hardware or software algorithms,
applied. The simplest or lowest level of complexity is by a incorrect data can also often be reconstructed. Otherwise, the
massive replication of the individual components of the sys- detected errors can be handled by module replacement and
1
tem. For example, four diodes connected as two parallel software recovery actions. The actions taken depend on the
pairs that are themselves connected in series will not fail if extent of the fault and of the recovery mechanisms available
any one diode fails “open” or “short.” Logical gates in similar to the computing system.
9,10
quadded arrangements can also guard against single faults,
and even some multiple faults, for largely replicated systems. Software Redundancy
More sophisticated systems use replication at higher lev-
Software redundancy refers to all additional software installed
els of complexity to mask failures. Instead of using a mere
in a system that would not be needed for a fault-free com-
massive replication of components configured in fault-tolerant
puter. Software redundancy plays a major role in most fault-
arrangements, identical nonredundant computer sections or
tolerant computers. Even computers that recover from fail-
modules can be replicated and their outputs voted upon.
ures mainly by hardware means use software to control their
Examples are triple modular redundancy (TMR) and more
recovery and decision-making processes. The level of soft-
massive modular redundancy (NMR), where N can stand for
ware used depends on the recovery system design. The recov-
any odd number of modules.
ery design depends on the type of error or malfunction that
In addition to component replication, coding can be used is expected. Different schemes have been found to be more
to mask faults as well as to detect them. With the use of some appropriate for the handling of different errors. Some can be
codes, data that has been garbled (i.e., bits changed due to accomplished most efficiently solely by hardware means.
hardware errors) can sometimes be recovered instantaneously Others need only software, but most use a mixture of the two.
with the use of redundant hardware. Dynamic recovery meth- For a functional system, i.e., one without hardware design
ods are, however, better able to handle many of these faults. faults, errors can be classified into two varieties: (1) software
Higher levels of fault tolerance can be achieved more design errors and (2) hardware malfunctions.
easily through dynamic redundancy and implemented The first category can be corrected mainly by means of
through the dual actions of fault detection and recovery. This software. It is extremely difficult for hardware to be designed
often requires software help in conjunction with hardware to correct for programmers’ errors. The software methods,
redundancy. Many of these methods are extensions of static though, are often used to correct hardware faults—especially
techniques. transient ones. The reduction and correction of software design
Massive redundancy in components can often be better errors can be accomplished through the techniques outlined
utilized when controlled dynamically. Redundant modules, or below.
spares, can have a better fault tolerance when they are left Computers may be designed to detect several software
unpowered until needed, since they will not degrade while errors.
14,15
Examples include the use of illegal instructions
11
awaiting use. This technique, standby redundancy, often uses (i.e., instructions that do not exist), the use of privileged
dynamic voting techniques to achieve a high degree of fault instructions when the system has not been authorized to
tolerance. This union of the two methods is referred to as process them, and address violations. This latter refers to
12
hybrid redundancy. Additional hardware is needed for the reading or writing into locations beyond usable memory.
detection and switching out of faulty modules and the switch- These limits can often be set physically on the hardware.
ing in of good spares within the system by this technique. Computers capable of detecting these errors allow the pro-
13
Error detecting and error correcting codes can be used grammer to handle the errors by causing interrupts. The inter-
to dynamically achieve fault tolerance in a computing system. rupts route the program to specific locations in memory. The
Coding refers to the addition of extra bits to and the rear- programmer, knowing these locations, can then add his own
ranging of the bits of a binary word that contains information. code to branch to his specific subroutines, which can handle
The strategy of coding is to add a minimum number of check each error in a specified manner.
bits, the additional bits, to the message in such a way that a Software recovery from software errors can be accom-
4
given degree of error detection or correction is achieved. plished via several methods. As mentioned before, parallel
Error detection and correction is accomplished by comparing programming, in which alternative methods are used to deter-
the new word, hopefully unchanged after transmission, stor- mine a correct solution, can be used when an incorrect solution
age, or processing, with a set of allowable configurations of can be identified. Some less sophisticated systems print out
bits. Discrepancies discovered in this manner signal the diagnostics so that the user can correct the program off line
FIELD INSTRUMENT REDUNDANCY AND VOTING which the operator immediately notices. The consequence is
that the SIS-based definitions developed in IEC 61508, to
The above concepts apply not only to process computers but some extent, can also be used as guidelines for control loops
also to basic process control systems (BPCSs) and safety instru- that require high uptime and whose unavailability would,
mented systems (SISs), where they also improve performance, within a short time, drive the plant to conditions requiring
availability, and reliability. In the case of field instruments plant shutdown.
and final control elements, they mainly guarantee continuity of IEC 61508 Part 6 gives the definition of the various
operation and increase uptime, whereas, in SIS systems, they architectures most commonly used in the safety instrumented
minimize nuisance or spurious interventions and alarms. systems. They apply for use with one, two, or three elements
The techniques used in BPCS and SIS systems are similar and their various combinations. The elements that are used
and have initially been developed for the inherently more in a single or multiple configuration can be either transmitters
demanding SIS applications. For SIS systems, the need of or final control elements, but they are mainly for transmitters,
international regulations has been recognized (ANSI/ISA- and only very rarely for control valves, because of the sub-
18 19 20
84.01-1996, IEC 61508-1998/2000, and IEC 61511, in stantial difference in costs. The control system, such as a
draft version) while, for non-safety related control loops, this DCS system, is usually configured with multiple controllers
is left to good engineering practice. Therefore, the discussion and redundant other system components (e.g., system bus,
of redundancy and voting techniques, as applied to the field I/O bus, HMI). IEC 61508 considers and gives definitions to
instruments of BPCS systems, will be based on the SIS the configurations described below.
standards as guidelines. The BPCS goal is to improve control
loop availability such that the trigger point for the interven- Single-Transmitter Configuration (Figure 1.10b)
tion of the associated SIS system is unlikely ever to be 1oo1 A single transmitter is used, as in many control
reached. Thereby, redundancy in BPCS also improves safety. loops. These loops consist of an analog transmitter
This is because increased availability reduces the number of and an analog controller (pneumatic or electronic).
shutdowns, which tend to shorten the life of the plant due to This configuration is the most prone to overall mal-
the resulting thermal and other stresses. functioning. Errors and failures can be caused by a
One of the main objectives of measurement and control sticking control valve or transmitter or by an out of
specialists is to improve the availability and accuracy of range signal (up or down scale). In these loops,
measurements. To achieve that goal and to minimize system-
atic uncertainty while increasing reliability, correct specifi-
cation, instrument selection, and installation are essential.
Assuming that the transmitters have been properly spec-
ified, selected, and installed, one can further improve total System bus
performance by measuring the same variable with more than
one sensor. Depending on the importance of the measurement,
redundancy can involve two or more detectors measuring the
same process variable. When three or more sensors are used,
one can select the “majority view” by voting. With this Multiloop
approach, one would select m measurements out of the total controller
n number of signals so, that m > n/2. In industrial practice,
n is normally 3 so that m is 2.
The redundant and voting techniques have been standard- DI
ized in various SIS-related documents, including ANSI/ISA-
84.01, IEC 61508, and IEC 61511. The SIS systems usually
evaluate on–off signals or threshold limits of analog signals
whereas, in process control, redundancy and voting is obtained T AI
by the evaluation of multiple analog signals. The main differ-
ence between BPCS and SIS systems is that SIS is a “dormant”
system, but continuously self-checking, and it is called upon
to operate only in an emergency. In addition, the SIS is fail DO
safe; i.e., if it fails, it brings the plant to a safe status. SIS
malfunctioning is inferred from diagnostic programs and not
from plant conditions, because the plant cannot be shut down
or brought to unsafe conditions just to test the SIS system. All AO
international regulations follow this approach.
In contrast to SIS systems, the BPCS control loops are FIG. 1.10b
always active and, if they malfunction, they actuate alarms, 1oo1/1oo1D transmitter input.
Multiloop Multiloop
controller controller
DI
DI
Tx AI Tx AI
Ty AI
Ty AI
Tz AI
DO
DO
AO AO
FIG. 1.10d
2oo3 transmitter input.
FIG. 1.10c
as correct and representative of the process condi-
2oo2D transmitter input.
tions. Concurrency means that they differ by no
more than X%.
diagnostic protection is very limited. Remember, in
the past, the burn-out feature of thermocouples was Diagnostic Coverage
almost the only diagnostic function implemented in
the mV/psi converters or mV/mA transducers. The diagnostic coverage in the BPCS is much less than in
1oo1D A single transmitter is used, with diagnos- the SIS, for reasons outlined previously, and is provided
tic coverage integral to the transmitter (e.g., self- mainly in and by the DCS, which has the capability of com-
21,22
validating transmitters ) and/or external in the paring the signals received from the transmitters and deter-
control system. mining whether they are within the imposed limits so as to
consider them to be concurrent. If an inconsistency is
Two-Transmitter Configuration (Figure 1.10c)
detected, the DCS is capable of signaling the abnormal sit-
1oo2 Two transmitters in parallel are used; the failure
uation and to maintain control, at least in some instances,
of one determines the loss of control. In principle,
without operator intervention.
this definition cannot be borrowed from IEC 61508.
1oo2D Two transmitters in parallel are used, with
1oo1D The diagnostic coverage can be partly integral
diagnostic coverage mainly residing in the control
to the transmitter and/or external in the control
system. The type of diagnostic functions will be
system (rate of change alarms, overrange alarms
covered afterward.
detecting the individual fault). In a broader sense,
2oo2 Two transmitters in parallel are used. The loss
in addition, the material balance (data reconcilia-
of control should be determined by the failure of
tion) performed in the DCS can contribute to detect
both. In principle, this definition cannot be borrowed
a failure in the flow transmitters or their unreliable
from IEC 61508.
reading.
Three-Transmitter Configuration (Figure 1.10d) 1oo2D The signal from each transmitter is checked to
2oo3 Three transmitters in parallel are used. The con- verify if it is within the validity limits (i.e., 4-20 mA).
current value indicated by two of them is assumed If a transmitter is outside the validity range, its signal
Tx
COMPUTING
MULTILOOP
Ty
CONTROLLER
CONTROL
MODULE VALVE
Tz
ALARM
TRUTH TABLE
A B C A YES YES YES NO
B YES YES NO NO
C YES NO NO NO
Computing Module Median Median Average NONE
Alarm NO YES YES YES
Discard NONE NONE Tz Tx, Ty, Tz
Force contr. output to man. NO NO NO YES
FIG. 1.10f
2oo3 signal conditioning.
The operator has the responsibility to select one of necessary to evaluate properly to what extent it is necessary
the three transmitters as the good one, use it as input to avoid common-mode failures as a factor of the criticality
to the controller, and switch to auto again. There are of the application. If, for instance, all the cables are installed
some possible variations in the algorithms used for on the same tray, distributing the redundant transmitters in
the selection of the valid signals and the discarding different cables does not improve the situation; it would be
of the unreliable ones, and they depend on the avail- unusual for only one cable to break while the others remain
able control blocks of the involved DCS. in operation.
Of course, these qualitative statements are true only if 18. ANSI/ISA-84.01-1996: Application of Safety Instrumented Systems for
the maintenance activity is immediate and the faulty equip- the Process Industry, ISA, Research Triangle Park, NC, 1997.
19. IEC 61508: Functional Safety of Electrical/Electronic/Programmable
ment is repaired/substituted within the shift or the day. The
Electronic Safety-Related Systems, International Electrotechnical
triple transmitters are less sensitive to the faulty operation of Commission, Geneva, 1998/2000.
one (or even two) units, because it is possible for the operator 20. IEC 61511: Functional Safety: Safety Instrumented Systems for the
to select the remaining healthy transmitter and use it as the Process Industry Sector (draft), International Electrotechnical Com-
process variable of the control loop. mission, Geneva.
21. Clarke, D. W., Validation of sensors, actuators and processes, Proc.
These qualitative considerations can be studied quantita-
Interkama ISA Tech. Conf., Düsseldorf, Germany, 1999.
tively with statistical methods based on the fact that failures 22. Manus H., Self-validating (SEVA) sensors—towards standards and
are random variables. Further information is available in the products, Proc. 29th BIAS Int. Conf., Milan, Italy, 2000.
23,24
specialized literature. 23. Goble, W. M., Control Systems Safety Evaluation & Reliability, ISA,
Research Triangle Park, NC, 1998.
24. Gruhn, P. and Cheddie, H. L., Safety Shutdown Systems: Design, Anal-
References ysis and Justification, ISA, Research Triangle Park, NC, 1998.
Carter, W. C. and McCarthy, C. E., Implementation of an experimental fault- Kopetz, H., Fault tolerance in real-time systems, Proc. 11th IFAC World
tolerant memory system, IEEE Trans. Comput., C-25 (6), 557–568, 1976. Congress, Tallinn, Estonia, Vol. 7, 111–118, 1990.
Chandy, K. M. and Ramamoorthy, C. V., Rollback and Recovery Strategies, Losq, J., A highly efficient redundancy scheme: self-purging redundancy,
IEEE Trans. Comput., C-21 (6), 546–556, 1972. IEEE Trans. Comput., C-25 (6), 569–578, 1976.
Chandy, K. M., Ramamoorthy, C. V., and Coway, A., A framework for Mathur, F. P. and deSousa, P. T., GMR: general modular redundancy, Dig.
hardware-software trade-offs in the design of fault-tolerant computers, 1974 Int. Symp. Fault-Tolerant Computing, Urbana, IL, IEEE Com-
AFIPS Conf. Proc., Vol. 41, Part 2, AFIPS Press, Montvale, NJ, 53–56, puter Society, New York, 2-20 to 2-25, 1974.
1972. Mills, H. D., Structured programming in large systems, in Debugging Tech-
Conn, R. B., Alexandridis, N. A., and Avizienis, A., Design of a fault-tolerant, niques in Large Systems, R. Rustin, Ed., Prentice Hall, Englewood
modular computer with dynamic redundancy, AFIPS Conf. Proc., Vol. Cliffs, NJ, 44–55, 1971.
41, Part 2, AFIPS Press, Montvale, NJ, 1057–1067, 1972. Mine, H. and Koga, Y., Basic properties and construction method for fail-
Daly, T. E., Tsou, H. S. E., Lewis, J. L., and Hollowich, M. E., The design safe logical systems, IEEE Trans. Electronic Comput., EC-16 (3),
and verification of a synchronous executive for a fault-tolerant system, 282–289, 1967.
Dig. 1973 Int. Symp. Fault-Tolerant Computing, Palo Alto, CA, IEEE O’Brien, F. J., Rollback point insertion, Dig. 1976 Int. Symp. Fault-Tolerant
Computer Society, New York, 3–9, 1973. Computing, Pittsburgh, PA, IEEE Computer Society, New York,
DeAnglis, D. and Lauro, J. A., Software recovery in the fault-tolerant spa- 138–142, 1976.
ceborne computer, Dig. 1976 Int. Symp. Fault-Tolerant Computing, Patel, A. M. and Hsiao, M. Y., An adaptive error correction scheme for
Pittsburgh, PA, IEEE Computer Society, New York, 143–147, 1976. computer memory system, AFIPS Conf. Proc., Vol. 41, Part 1, AFIPS
Elliot, B. H. and Williams, T. J., Proposal for a Hierarchical Computer Press, Montvale, NJ, 83–85, 1972.
Control System, Systems Engineering of Hierarchy Computer Control Peterson, W. W. and Brown, D. T., Cyclic codes for error detection, Proc.
Systems for Large Steel Manufacturing Complexes, Report No. 84, IRE, 22(1), 228–235, 1961.
Vol. 1, Chap. 1-4, Purdue Laboratory for Applied Industrial Control,
Peterson, W. W. and Weldon, E. J. Jr., Error Correcting Codes, MIT Press,
Purdue University, West Lafayette, IN, August 1976, 4-71 to 4-119.
Cambridge, MA, 1972.
Elmondorf, W. R., Fault-tolerant programming, Dig. 1972 Int. Symp. Fault-
Pradhan, D. K., Fault-Tolerant Computing—Theory and Techniques, Vols. I
Tolerance Computing, Newton, MA, IEEE Computer Society, New
and II, Prentice Hall, Englewood Cliffs, NJ, 1986.
York, 79–83, 1972.
Randell, B., System Structure for Software Fault-Tolerance, IEEE Trans.
Hamming, R. W., Error detecting and error correcting codes, Bell System
Comput., SE-1 (2), 220–232, 1975.
Technol. J., 29(2), 147–160, 1950.
Rennels, D. A. and Avizienis, A., RMS: a reliability modeling system for
Hayes, J. P., Checksum test methods, Dig. 1976 Int. Symp. Fault-Tolerant
self-repairing computers, Dig. 1973 Int. Symp. Fault-Tolerant Com-
Computing, Pittsburgh, PA, IEEE Computer Society, New York,
puting, Palo Alto, CA, IEEE Computer Society, New York, 131–135,
114–120, 1976.
1973.
Himmelblau, D. M., Fault Detection and Diagnosis in Chemical and Pet-
rochemical Process, Elsevier Scientific, New York, 1978. Rohr, J. A., STAREX: self-repair routines: software recovery in the JPL-
Hopkins, A. L. Jr. and Smith, T. B. III, The architectural elements of a STAR computer, Dig. 1973 Int. Symp. Fault-Tolerant Computing, Palo
symmetric fault-tolerant multiprocessor, Dig. 1974 Int. Symp. Fault- Alto, CA, IEEE Computer Society, New York, 11–16, 1973.
Tolerant Computing, Urbana, IL, IEEE Computer Society, New York, Sellers, F. F. Jr., Hsiao, M. Y., and Bearnson, L. W., Error Detection Logic
4-2 to 4-6, 1974. for Digital Computers, McGraw-Hill, New York, 1968.
Horning, J. J., Lauer, H. C., Melliar-Smith, P. M., and Randell, B., A Program Shepherd, M. Jr., Distributed Computing Power: Opportunities and Chal-
Structure for Error Detection and Recovery, in Lecture Notes in Com- lenges, 1977 National Computer Conference Address, Dallas, TX,
puter Science, G. Goos and J. Hartmanis, Eds., Vol. 16, Springer- 1977.
Verlag, Berlin, 171–187, 1974. Texas Instruments, Inc., 990 Computer Family Systems Handbook, Manual
IBM Systems Development Division, IBM Synchronous Data Link Control No. 945250-9701, Texas Instruments, Inc., Austin, TX, 1976.
General Information, IBM Manual GA 27-3093-0, IBM Systems Wachter, W. J., System malfunction detection and correction, Dig. 1975 Int.
Development Division, Research Triangle Park, NC, 1974. Symp. Fault-Tolerant Computing, Paris, France, IEEE Computer Soci-
Jensen, P. A., Quadded nor logic, IEEE Trans. Reliability, Vol. R-12, 22–31, ety, New York, 196–201, 1975.
1963. Wakerly, J., Error Detecting Codes, Self-Checking Circuits and Applica-
Kim, K. H. and Ramamoorthy, C. V., Failure-tolerant parallel programming tions, North Holland, New York, 1978.
and its supporting system architecture, AFIPS Conf. Proc., Vol. 45, Wensley, J. H., SIFT—software implemented fault-tolerance, AFIPS
AFIPS Press, Montvale, NJ, 413–423, 1976. Conf. Proc., Vol. 41, Part 1, AFIPS Press, Montvale, NJ, 243–253,
King, J. C., Proving programs to be correct, IEEE Trans. Comput., C-20 1972.
(11), 1331–1336, 1971. Williams, T. J., The development of reliability in industrial control systems,
Kompass, E. J. and Williams, T. J., Eds., Total control systems availability, Proc. IEEE Micro., 4(6), 66–80, 1984.
Proc. 15th Annu. Advanced Control Conf., Purdue University, West Williams, T. W. and Parker, K. P., Design for testability—a survey, IEEE
Lafayette, IN, September 11–13, 1989. Trans. Comput., C-31 (1), 2–14, 1982.