Dependable Computing
Dependable Computing
2
[email protected]
http://www.ece.ucsb.edu/~parhami
All rights reserved for the author. No part of this book may be reproduced,
stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical,
photocopying, microfilming, recording, or otherwise, without written permission. Contact the author at:
ECE Dept., Univ. of California, Santa Barbara, CA 93106-9560, USA ([email protected])
Dedication
To my academic mentors of long ago:
and
Preface
“The skill of writing is to create a context in which other people
can think.”
Edwin Schlossberg
Accounts of computer system errors, failures, and other mishaps range from humorous to
horrifying. On the lighter side, we have data entry or computation errors that lead to an
electric company sending a bill for hundreds of thousands of dollars to a small residential
customer. Such errors cause a chuckle or two as they are discussed at the dinner table and
are usually corrected to everyone's satisfaction, leaving no permanent damage (though in
a few early occurrences of this type, some customers suffered from their power being
disconnected due to nonpayment of the erroneous bill). At the other extreme, there are
dire consequences such as an airliner with hundreds of passengers on board crashing, two
high-speed commuter trains colliding, a nuclear reactor taken to the brink of meltdown,
or financial markets in a large industrial country collapsing. Causes of such annoying or
catastrophic behavior range from design deficiencies, interaction of several rare or
unforeseen events, operator errors, external disturbances, or malicious actions.
In nearly all engineering and industrial disciplines, quality control is an integral part of
the design and manufacturing processes. There are also standards and guidelines that
make at least certain aspects of quality assurance more or less routine. This is far from
being the case in computer engineering, particularly with regard to software products.
True, we do offer dependable computing courses to our students, but in doing so, we
create an undesirable separation between design and dependability concerns. A structural
engineer does not learn about bridge-building in one course and about ensuring that
bridges do not collapse in another. A toaster or steam-iron manufacturer does not ship its
products with a warning label that there is no guarantee that the device will prepare toast
or remove wrinkles from clothing and that the manufacturer will not be liable for any
harm resulting from their use.
The field of dependable (aka fault-tolerant) computing was born in the late 1960s,
because longevity, safety, and robustness requirements of space and military applications
could not be met through conventional design. Space applications presented the need for
long-life systems, with either no possibility of on-board maintenance or repair (unmanned
missions) or with stringent reliability requirements in harsh, and relatively unpredictable
environments (manned missions). Military applications required extreme robustness in
the face of punishing operational conditions, partial system wipeout, or targeted attacks
by a sophisticated adversary. Early researchers of the field were thus predominantly
supported by aerospace and defense organizations.
As the field matured, application areas broadened beyond aerospace and military systems
and they now include a wide array of domains, from automotive computers to large
redundant disk arrays. In fact, many of the methods discussed in this book are routinely
utilized even in contexts that do not satisfy the traditional definitions of high-risk or
safety-critical systems, although the most elaborate techniques continue to be developed
for systems whose failure would endanger human lives. Systems in the latter category
include:
Such advanced techniques then trickle down and eventually find their way into run-of-
the-mill computer systems, such as traditional desktop and laptop computers.
The field of dependable computing has matured to the point that a dozen or so texts and
reference books have been published. Some of these books that cover dependable
computing in general (as opposed to special aspects or ad-hoc/unconventional methods)
are listed at the end of this preface. Each of these books possesses unique strengths and
has contributed to the formation and fruition of the field. The current text, Dependable
Computing: A Multilevel Approach, exposes the reader to the basic concpets of
dependable computing in sufficient detail to enable their use in many hardware/software
contexts. Covered methods include monitoring, redundancy, error detection/correction,
self-test, self-check, self-repair, adjudication, and fail-soft operation. The book is an
outgrowth of lecture notes that the author has developed and refined over many years.
Here are the most important features of this text in comparison with the listed books:
b. Emphasis on both the underlying theory and actual system designs: The ability to
cope with complexity requires both a deep knowledge of the theoretical
underpinnings of dependable computing and examples of designs that help us
understand the theory. Such designs also provide building blocks for synthesis as
well as reference points for cost-performance comparisons. This viewpoint is
reflected, for example, in the detailed coverage of error-coding techniques that
d. Wide coverage of important topics: The current text covers virtually all important
algorithmic and hardware design topics in dependable computing, thus providing
a balanced and complete view of the field. Coverage of testable design, voting
algorithms, software redundancy, and fail-safe systems do not all appear in other
textbooks.
Summary of Topics
The seven parts of this book, each composed of four chapters, have been written with the
following goals:
Part I sets the stage, gives a taste of what is to come, and provides a detailed perspective
on the assessment of dependability in computing systems and the modeling tools needed
for this purpose.
Part II deals with impairments to dependability at the physical (device) level, how they
may lead to system vulnerability and low integrated-circuit yield, and what
countermeasures are available for dealing with them.
Part III deals with impairments to dependability at the logical (circuit) level, how the
resulting faults can affect system behavior, and how redundancy methods can be used to
deal with them.
Part IV covers information-level impairments that lead to data-path and control errors,
methods for detecting/correcting such errors, and ways of preventing such errors from
propagating and causing problems at even higher levels of abstraction.
Part V deals with everything that can go wrong at the architectural level, that is, at the
level of interactions between subsystems, be they parts of a single computer or nodes in a
widely distributed system.
Part VI covers service-level impairments that may cause a system not to be able to
perform the required tasks, even though it has not totally failed in an absolute sense.
Part VII deals with breaches at the computation-result or outcome level, where the
success or failure of a computing system is ultimately judged and the costs of aberrant
results or actions must be borne.
For classroom use, the topics in each chapter of this text can be covered in a lecture
spanning 1-2 hours. In my own teaching, I have used the chapters primarily for 1.5-hour
lectures, twice a week, in a 10-week quarter, omitting or combining some chapters to fit
the material into 18-20 lectures. But the modular structure of the text lends itself to other
lecture formats, self-study, or review of the field by practitioners. In the latter two cases,
the readers can view each chapter as a study unit (for one week, say) rather than as a
lecture. Ideally, all topics in each chapter should be covered before moving to the next
chapter. However, if fewer lecture hours are available, then some of the subsections
located at the end of chapters can be omitted or introduced only in terms of motivations
and key results.
An instructor's solutions manual is planned. The author's detailed syllabus for the course
ECE 257A at UCSB is available at
http://www.ece.ucsb.edu/~parhami/ece_257a.htm
References to classical papers in dependable computing key design ideas, and important
state-of-the-art research contributions are listed at the end of each chapter. These
references provide good starting points for doing in-depth studies or for preparing term
papers/projects. New ideas in the field of dependable computing appear in papers
presented at an annual technical gathering, the Dependable Systems and Networks (DSN)
conference, sponsored jointly by Institute of Electrical and Electronics Engineers (IEEE)
and International Federation for Information Processing (IFIP). DSN, which was formed
by merging meetings sponsored separately by IEEE and IFIP, covres all aspects of the
field, including techniques for dependable computing and communications, as well as
performance and other implications of dependability features.
Acknowledgments
General References
The list that follows contains references of two kinds: (1) books that have greatly
influenced the current text and (2) general reference sources for in-depth study or
research. Books and other information sources that are relevant to specific chapters are
listed in the end-of-chapter reference lists.
[Ande81] Anderson, T. and P. A. Lee, Fault Tolerance: Principles and Practice, Prentice Hall, 1981.
[Ande85] Anderson, T. A. (ed.), Resilient Computing Systems, Collins, 1985. Also: Wiley, 1986.
[ComJ] The Computer Journal, published monthly by the British Computer Society.
[Comp] IEEE Computer, magazine published by the IEEE Computer Society. Has published several
special issues on dependable computing: Vol. 17, No. 6, August 1984; Vol. 23, No. 5, July
1990.
[CSur] ACM Computing Surveys, journal published by the Association for Computing Machinery.
[DCCA] Dependable Computing for Critical Applications, series of conferences later merged into DSN.
[Diab05] Diab, H. B. and A. Y. Zomaya (eds.), Dependable Computing Systems: Paradigms,
Performance Issues, and Applications, Wiley, 2005.
[DSN] Proc. IEEE/IFIP Int'l Conf. Dependable Systems and Networks, Conference formed by
merging earlier series of meetings, the oldest of which (FTCS) dated back to 1971. URL:
http://www.dsn.org
[Dunn02] Dunn, W. R., Practical Design of Safetry-Critical Computer Systems, Reliability Press, 2002.
[EDCC] Proc. European Dependable Computing Conf., Conference held 10 times from 1994 to 2014
and turned into an annual event in 2015, when it merged with the European Workshop on
Dependable Computing. URL: http://edcc2015.lip6.fr/
[FTCS] IEEE Symp. Fault-Tolerant Computing, series of annual symposia, begun in 1971 and
eventually merged into DSN.
[Geff02] Geffroy, J.-C. and G. Motet, Design of Dependable Computing Systems, Springer, 2002.
[Gray85] Gray, J., “Why Do Computers Stop and What Can Be Done About It?” Technical Report
TR85.7, Tandem Corp., 1985.
[ICCD] Proc. IEEE Int'l Conf. Computer Design, sponsored annually by the IEEE Computer Society
since 1983.
[IFIP] Web site of the International Federation for Information Processing Working Group WG 10.4
on Dependable Computing. http://www.dependability.org/wg10.4
[Jalo94] Jalote, P., Fault Tolerance in Distributed Systems, Prentice Hall, 1994.
[John89] Johnson, B. W., Design and Analysis of Fault-Tolerant Digital Systems, Addison-Wesley,
1989.
[Knig12] Knight, J., Fundamentals of Dependable Computing for Software Engineers, CRC Press, 2012.
[Kore07] Koren, I. and C. M. Krishna, Fault-Tolerant Systems, Morgan Kaufmann, 2007.
[Lala01] Lala, P. K., Self-Checking and Fault-Tolerant Digital Design, Morgan Kaufmann, 2001.
[Levi94] Levi, S.-T. and A. K. Agrawala, Fault-Tolerant System Design, McGraw-Hill. 1994.
[Negr89] Negrini, R., M. G. Sami, and R. Stefanelli, Fault Tolerance Through Reconfiguration in VLSI
and WSI Arrays, MIT Press, 1989.
[Nels87] Nelson, V. P. and B. D. Carroll (eds.), Tutorial: Fault-Tolerant Computing, IEEE Computer
Society Press, 1987.
[Prad86] Pradhan, D. K. (ed.), Fault-Tolerant Computing: Theory and Techniques, 2 Vols., Prentice
Hall, 1986.
[Prad96] Pradhan, D. K. (ed.), Fault-Tolerant Computer System Design, Prentice Hall, 1996.
[PRDC] Proc. IEEE Pacific Rim Int'l Symp. Dependable Computing, sponsored by IEEE Computer
Society and held every 1-2 years since 1989.
[Rand13] Randall, B., J.-C. Laprie, H. Kopetx, and B. Littlewood (eds.), Predictably Dependable
Computing Systems, Springer, 2013.
[Shoo02] Shooman, M. L., Reliability of Computer Systems and Networks, Wiley, 2002.
[Siew82] Siewiorek, D. P. and R.S. Swarz, The Theory and Practice of Reliable System Design, Digital
Press, 1982.
[Siew92] Siewiorek, D. P. and R. S. Swarz, Reliable Computer Systems: Design and Evaluation, Digital
Press, 2nd Edition, 1992. Also: A. K. Peters, 1998.
[Sori09] Sorin, D., Fault Tolerant Computer Architecture, Morgan & Claypool, 2009.
[SRDS] Proc. IEEE Symp. Reliable Distributed Systems, sponsored annually by IEEE Computer
Society.
[TCom] IEEE Trans. Computers, journal published by the IEEE Computer Society. Has publlished a
number of sepecial issues on dependable computing: Vol. 41, No. 2, February 1992; Vol. 47,
No. 4, April 1998; Vol. 51, No. 2, February 2002.
[TDSC] IEEE Trans. Dependable and Secure Computing, journals published by the IEEE Computer
Society.
[Toy87] Toy, W. N., “Fault-Tolerant Computing,” Advances in Computers, Vol. 26, 1987, pp. 201-279.
[TRel] IEEE Trans. Reliability, quarterly journal published by IEEE Reliability Society.
Structure at a Glance
The multilevel model on the right of the following table is shown to emphasize its
influence on the structure of this book; the model is explained in Chapter 1 (Section 1.4).
Table of Contents
Part I Introduction: Dependable Systems
1 Background and Motivation 15
2 Dependability Attributes 44
3 Combinational Modeling 74
4 State-Space Modeling 103
Part II Defects: Physical Imperfections
5 Defect Avoidance 128
6 Defect Circumvention 147
7 Shielding and Hardening 163
8 Yield Enhancement 176
Part III Faults: Logical Deviations
9 Fault Testing 192
10 Fault Masking 210
11 Design for Testability 227
12 Replication and Voting 240
Part IV Errors: Information Distorations
13 Error Detection 255
14 Error Correction 275
15 Self-Checking Modules 295
16 Redundant Disk Arrays 310
Part V Malfunctions: Architectural Anomalies
17 Malfunction Diagnosis 331
18 Malfunction Tolerance 347
19 Standby Redundancy 361
20 Resilient Algorithms 371
Part VI Degradations: Behavioral Lapses
21 Degradation Allowance 385
22 Degradation Management 401
23 Robust Task Scheduling 415
24 Software Redundancy 427
Part VII Failures: Computational Breaches
25 Failure Confinement 444
26 Failure Recovery 457
27 Agreement and Adjudication 468
28 Fail-Safe System Design 487
Appendix
A Past, Present, and Future 498
Computer and information systems are important components of the modern society,
which has grown increasingly reliant on the availability and proper functioning of such
systems. When a computer systems fails:
A writer or reporter, who can no longer type on her personal computer, may
become less productive, perhaps missing a deadline.
A bank customer, unable to withdraw or deposit funds through a remote ATM,
may become irate and perhaps suffer financial losses.
A telephone company subscriber may miss personal or business calls or even
suffer loss of life due to delayed emergency medical care.
An airline pilot or passenger may experience delays and discomfort, perhaps even
perish in a crash or midair collision.
A space station or manned spacecraft may lose maneuverability or propulsion,
wandering off in space and becoming irretrievably lost.
Thus, consequences of computer system failures can range from inconvenience to loss of
life. Low-severity consequences, such as dissatisfaction and lost productivity, though not
as important on a per-occurrence basis, are much more frequent, thus leading to
nonnegligible aggregate effects on the society’s economic well-being and quality of life.
Serious injury or loss of life, due to the failure of a safety-critical system, is no doubt a
cause for concern. As computers are used for demanding and critical applications by an
ever expanding population of minimally trained users, the dependability of computer
hardware and software becomes even more important.
In one of the early proposals, dependability is defined succinctly as “the probability that
a system will be able to operate when needed” [Hosf60]. This simplistic definition,
which subsumes both of the well-known notions of reliability and availability, is only
valid for systems with a single catastrophic failure mode; i.e., systems that are either
completely operational or totally failed. The problem lies in the phrase “be able to
operate.” What we are actually interested in is “task accomplishment” rather than
“system operation.”
The following definition [Lapr82] is more suitable in this respect: “... dependability [is
defined] as the ability of a system to accomplish the tasks (or equivalently, to provide the
service[s]) which are expected from it.” This definition does have its own weaknesses.
For one thing, the common notion of “specified behavior” has been replaced by
“expected behavior” so that possible specification slips are accommodated in addition to
the usual design and implementation flaws. However, if our expectations are realistic and
precise, they might be considered as simply another form of system specification
(possibly a higher-level one). However, if we expect too much from a system, then this
definition is an invitation to blame our misguided expectations on the system’s
undependability.
A more useful definition has been provided by Carter [Cart82]: “... dependability may be
defined as the trustworthiness and continuity of computer system service such that
reliance can justifiably be placed on this service.” This definition has two positive
aspects: It takes the time element into account explicitly (“continuity”) and stresses the
need for dependability validation (“justifiably”). Laprie’s version of this definition
[Lapr85] can be considered a step backwards in that it substitutes “quality” for
“trustworthiness and continuity.” The notions of “quality” and “quality assurance” are
well-known in many engineering disciplines and their use in connection with computing
is a welcome trend. However, precision need not be sacrificed for compatibility.
In the most recent compilation of dependable computing terminology evolving from the
preceding efforts [Aviz04], dependability is defined as a system’s “ability to deliver
service that can justifiably be trusted” (original definition) and “ability to avoid service
failures that are more frequent and more severe than is acceptable” (alternate definition),
with trust defined as “accepted dependence.” Both of these definitions are rather
unhelpful and the first one appears to be circular.
Having defined computer system dependability, we next turn to the question of why we
need to be concerned about it. This we will do through a sequence of viewpoints, or
arguments. In these arguments, we use back-of-the-envelope calculations to illustrate the
three classes of systems for which dependability requirements are impossible to meet
without special provisions:
a. The reliability argument: Assume that electronic digital systems fail at a rate of
about = 10–9 per transistor per hour. This failure rate may be higher or lower for
different types of circuits, hardware technologies, and operating environments, but the
same argument is applicable if we change . Given a constant per-transistor failure rate ,
an n-transistor digital system will have all of its components still working after t hours
with probability R(t) = e–nt. We will justify this exponential reliability equation in
Chapter 2; for now, let’s accept it on faith. For a fixed , as assumed here, R(t) is a
function of nt. Figure 1.1 shows the variation of R(t) with nt. While it is the case that not
every transistor failure translates into system failure, to be absolutely sure about correct
system operation, we have to proceed with this highly pessimistic assumption. Now, from
Fig. 1.1, we can draw some interesting conclusions.
The on-board computer for a 10-year unmanned space mission to explore the
solar system should be built out of only O(103) transistors if the O(105)-hour
mission is to have a 90% success probability.
The computerized flight control system on board an aircraft cannot contain more
than O(104) transistors if it is to fail with a likelihood of less than 1 in 10,000
during a 10-hour intercontinental flight.
The need for special design methods becomes quite apparent if we note that modern
microprocessors and digital signal processor (DSP) chips contain tens or hundreds of
millions of transistors.
c. The availability argument: A central telephone switching facility should not be down
for more than a few minutes per year; more outages would be undesirable not only
because they lead to customer dissatisfaction, but also owing to the potential loss of
revenues. If the diagnosis, and replacement, of subsystems that are known to be
malfunctioning takes 20-30 minutes, say, a mean time between failures (MTBF) of
O(105) hours is required. A system with reliability R(t) = e–nt, as assumed in our earlier
reliability argument, has an MTBF of 1/(n). Again, we accept this without proof for
now. With = 10–9/transistor/hour, the telephone switching facility cannot contain more
than O(104) transistors if the stated availability requirement is to be met; this is several
orders of magnitude below the complexity of modern telephone switching centers.
0.6
–n t
e
0.4
.3679
0.2
0.0
10 4 10 6 10 8 10 10
nt
One of the important problems facing the designers of distributed computing systems is
ensuring data availability and integrity. Consider, for example, a system with five sites
(processors) that are interconnected pairwise via dedicated links, as depicted in Fig. 1.2.
The detailed specifications of the five sites, S0-S4, and 10 links, L0-L9, are not important
for this case study; what is important, is that sites and links can malfunction, making the
data stored at one site inaccessible to one or more other sites. If data availability to all
sites is critical, then everything must be stored in redundant form.
S0
L4 L0
L5
L9
S4 S1
L6
L3 L8 L1
L7
S3 S2
L2
Let the probability that a particular site (link) is available and functions properly during
any given attempt at data access be aS (aL) and assume that the sites and links
malfunction independently of each other.
Example 1.1: Home site and mirror site Quantify the improvement in data availability when
each data file Fi has a home or primary site H(Fi), with a copy stored at a mirror site M(Fi). This
doubles the aggregate storage requirements for the file system, but allows access to data despite
faulty sites and links.
Solution: To quantify the accessibility of a particular file, we note that a site can obtain a copy of
Fi if the home site H(Fi) and the link leading to it have not failed OR the home site is inaccessible,
but the mirror site M(Fi) can be accessed. Thus, the availability (measure of accessibility) A(Fi)
for a mirrored file Fi is:
In deriving the preceding equation, we have assumed that the file must be accessed directly from
one of the two sites that holds a copy. Analyzing the problem when indirect accesses via other
sites are also allowed is left as an exercise. As a numerical example, for aS = 0.99 and aL = 0.95,
we have:
In other words, data unavailability has been reduced from 5.95% in the nonredundant case to
0.35% in the mirrored case.
Example 1.2: File triplication Suppose that the no-access probability of Example 1.1 is still too
high. Evaluate, and quantify the availability impact of, a scheme where three copies of each data
file are kept.
Data unavailability is thus reduced to 0.02%. This improvement in data availability is achieved at
the cost of increasing the aggregate storage requirements by a factor of 3, compared with the
nonredundant case.
Example 1.3: File dispersion Let us now devise a more elaborate scheme that requires lower
redundancy. Each file Fi is viewed as a bit string of length l. It is possible to encode such a file
into approximately 5l/3 bits (a redundancy of 67%, as opposed to 100% and 200% in Examples
1.1 and 1.2, respectively) so that if the resulting bit string of length 5l/3 is divided equally into five
pieces, any three of these (l/3)-bit pieces can be used to reconstruct the original l-bit file. Let us
accept for now that the preceding data dispersion scheme is actually workable and assume that for
each file, we store one of the five pieces thus obtained in a different site in Fig. 1.2. Now, any site
needing access to a particular data file already has one of the three required pieces and can
reconstruct the file if it obtains two other pieces from the remaining four sites. Quantify the data
availability in this case.
Solution: In this case, a file Fi is available if two or more of four sites are accessible:
Again, we have assumed that a file fragment must be accessed directly from the site that holds it.
With aS = 0.99 and aL = 0.95, we have:
Data unavailability is thus 0.08% in this case. This result is much better than that of Example 1.1
and is achieved at a lower redundancy as well. It also performs only slightly worse than the
triplication method of Example 1.2, which has considerably greater storage overhead.
Members of the Newcastle Reliability Project, led by Professor Brian Randell, have
advocated a hierarchic view [Ande82]: A (computer) system is a set of components
(themselves systems) which interact according to a design (another system). The
recursion stops when we arrive at atomic systems whose internal structures are of no
interest at the particular level of detail with which we are concerned. System failure is
defined as deviation of its behavior from that predicted (required) by the system’s
authoritative specification. Such a behavioral deviation results from an erroneous system
state. An error is a part of an erroneous state which constitutes a difference from a valid
state. The cause of the invalid state transition which first establishes an erroneous state, is
a fault in one of the system’s components or in the system’s design. Similarly, the
component’s or design’s failure can be attributed to an erroneous state within the
corresponding (sub)system resulting from a component or design fault, and so on.
Therefore, at each level of the hierarchy, “the manifestation of a fault will produce errors
in the state of the system, which could lead to a failure” (Fig. 1.3).
Aspect Impairment
Structure Fault
State Error
Behavior Failure
With the preceding model, failure and fault are simply different views of the same
phenomenon. This is quite elegant and enlightening but introduces problems by the need
for continual establishment of frames of reference when discussing the causes (faults) and
effects (failures) of deviant system behavior at various levels of abstraction. While it is
true that a computer system may be viewed at many different levels of abstraction, it is
also true that some of these levels have proved more useful in practice. Avižienis
[Aviz82] takes four of these levels and proposes the use of distinct terminology for
impairments to dependability (“undesired events,” in his words) at each of these levels.
His proposal can be summarized in the cause-effect diagram of Fig. 1.4.
Universe Impairment
Physical Failure
Logical Fault
Informational Error
External Crash
There are a few problems with the preceding choices of names for undesired events. The
term “failure” has traditionally been used both at the lowest and the highest levels of
abstraction; viz, failure rate, failure mode, and failure mechanism used by electrical
engineers and device physicists alongside system failure, fail-soft operation, and fail-safe
system coming from computer architects. To comply with the philosophy of distinct
naming for different levels, Avižienis retains “failure” at the physical level and uses
“crash” at the other end. However, this latter term is unsuitable. Inaccuracies or delays,
beyond what is expected according to system specifications, can hardly be considered
“crashes” in the ordinary sense of the term.
Furthermore, the term “crash” puts the emphasis on system operation rather than task
accomplishment and is thus unsuitable for fail-soft computer systems (such systems are
like airplanes in that they fail much more often than they crash!). We are thus led to a
definition of failure for computer systems that parallels that used by forensic experts in
structural engineering [Carp96]: Failure is an unacceptable difference between expected
and observed performance.
Another problem is that there are actually three external views of a computer system. The
maintainer’s external view consists of a set of interacting subsystems that must be
monitored for detecting possible malfunctions in order to reconfigure the system or,
alternatively, to guard against dangerous consequences (such as total system crash). The
operator’s external view, which is more abstract than the maintainer’s system-level view,
consists of a black box capable of providing certain computational services. Finally, the
end user’s external view is shaped by the system’s reaction to particular situations or
requests.
Abstraction Impairment
Component Defect
Low-
Level
Logic Fault
First
Cycle
State
Information Error
Mid-
Level
System Malfunction
Service Degradation
Second High-
Level
Cycle
Result Failure
Figure 1.5 depicts this extended model. There are two ways in which our extended six-
level impairment model can be viewed. The first view, shown on the left edge of Fig. 1.5,
is to consider it an unrolled version of the model in Fig. 1.3. This unrolling allows us to
talk about two cycles simultaneously without a danger of being verbose or ambiguous. A
natural question, then, is why stop at one unrolling. The reason is quite pragmatic. We are
looking at dependability from the eyes of the system architect who deals primarily with
module interactions and performance parameters, but also needs to be mindful of the
circuit level (perhaps even down to switches and transistors) in order to optimize/balance
the system from complex angles involving speed, size, power consumption,
upward/downward scalability, and so on.
The second view, which is the one that we will use in the rest of this book, is that shown
in the right half of Fig. 1.5:
High-level impairments, affecting the system as a whole, are of interest not only
to system architects but also to system integrators.
Taking into account the fact that a nonatomic component is itself a system, usage of the
term “failure” in failure rate, failure mode, and failure mechanism could be justified by
noting that a component is its designer’s end product (system). Therefore, we can be
consistent by associating the term “failure” with the highest and the term “defect” with
the lowest level of abstraction. The component designer’s failed system is the system
architect’s defective component. However, to maintain a consistent point of view
throughout the book, we will use the somewhat unfamiliar component-level terms defect
rate, defect mode, and defect mechanism from now on.
One final point: Computer systems are composed of hardware and software elements.
Thus, the reader may be puzzled by a lack of specific mention of impairments to software
dependability in our discussions thus far. The reason for this is quite practical. Whereas
one can meaningfully talk about defects, faults, errors, malfunctions, degradations, and
failures for software, it is the author’s experience that formulating statements, or
describing design methods, that apply to both hardware and software, requires some
stretching that makes the results somewhat convoluted and obscure. Perhaps, this would
not be the case if the author possessed greater expertise in software dependability.
Nevertheless, for the sake of completeness, we will discuss a number of algorithm and
software design topics in the later parts of the book, with the hope that some day the
discussion of dependable hardware and software can be truly integrated.
Anecdote: The development of the six-level model of Fig. 1.5 began in 1986, when the
author was a Visiting Professor at the University of Waterloo in Canada. Having six
levels is somewhat unsatisfactory, since successful models tend to have seven levels.
However, despite great effort expended in those days, the author was unable to add a
seventh type of impairment to the model. Undeterred by this setback, the author devised
the seven states of a system shown in Fig. 1.6.
The field of dependable computing deals with the procurement, forecasting, and
validation of computer system dependability. According to our discussions in Section 1.3,
impairments to dependability can be viewed from six abstraction levels. Thus, subfields
of dependable computing can be thought of as dealing with some aspects of one or more
of these levels. Specifically, we take the view that a computer system can be in one of
seven states: Ideal, defective, faulty, erroneous, malfunctioning, degraded, or failed, as
depicted in Fig. 1.6. Note that these states have nothing to do with whether or not the
system is “running.” A system may be running even in the failed state; the fact that it is
failed simply means that it isn’t delivering what is expected of it.
Upon the completion of its design and implementation, a system may end up in any of the
seven states, depending on the appropriateness and thoroughness of validation efforts.
Once in the initial state, the system moves from one state to another as a result of
deviations and remedies. Deviations are events that take the system to a lower (less
desirable) state, while remedies are techniques or measures that enable a system to make
the transition to a higher state.
The observability of the system state (ease of external recognition that the system is in a
particular state) increases as we move downward in Fig 1.6. For example, the inference
that a system is “ideal” can only be made through formal proof techniques; a proposition
that is currently impossible for complex computer systems. At the other extreme, a failed
system can usually be recognized with little or no effort. As examples of intermediate
states, the “faulty” state is recognizable by extensive off-line testing, while the
“malfunctioning” state is observable by on-line monitoring with moderate effort. It is,
therefore, common practice to force a system into a lower state (e.g., from “defective” to
“faulty,” under torture testing) in order to deduce its initial state.
Ideal
Defective
Faulty
Erroneous
Malfunctioning
Degraded
Failed
Fig. 1.6 System states and state transitions in the multilevel model
of dependable computing. Horizontal arrows on the left
denote entry points. Downward (upward) transitions
represent deviations (remedies). Self-loops model tolerance.
One can associate five attributes with each of the transitions in Fig. 1.6. These attributes
are:
For defect-induced failures, the sequence of transitions from defect to failure may be
quite slow, owing to large interlevel hindrances or latencies, or so quick as to defy
detection. Natural interlevel latencies can be increased through tolerance provisions or
reduced for making the deviations more readily observable (because deviations near the
bottom of Fig. 1.6 are more easily detected). The former methods, are referred to as
defect tolerance, fault tolerance, error tolerance, malfunction tolerance, degradation
tolerance, and failure tolerance, while the latter are useful for defect testing, fault testing,
error testing, and so on.
We will discuss deviations and remedies in considerable detail in the rest of this book.
Here, we present just a few samples of how subfields of dependable computing can be
characterized according to their relevance to one or more of these transitions and their
attributes.
Possible causes for the sideways transitions in Fig. 1.6 include specification slips and
implementation mistakes (including the use of wrong or unfit building blocks). A
deviations may be caused by wearout, overloading, or external disturbance and is
additionally characterized by its duration as being permanent, intermittent, or transient.
Note, however, that even transient deviations can have permanent consequences; for
example, a fault-induced error may persist after the fault itself has vanished.
We can also classify deviations by their extent (local and global or catastrophic) and
consistency (determinate and indeterminate or Byzantine). A local deviation can quickly
develop into a catastrophic one if not promptly detected by monitors and isolated by
means of firewalls. Transient and indeterminate deviations are notoriously difficult to
handle. To see why, imagine feeling ill, but all the symptoms of your illness disappearing
each time you visit a doctor. Worse yet, imagine the symptoms changing as you go from
one doctor to another for obtaining a second opinion.
Let us use some familiar every-day phenomena to accentuate the states and state
transitions shown in Fig. 1.6 [Parh97]. These examples also show the relevance of such a
multilevel model in a wider context. Example 1.5 is similar to Example 1.4, but it
illustrates the lateral transitions of Fig. 1.6 and multilevel tolerance methods better.
Example 1.6 incorporates both tolerance and avoidance techniques.
Example 1.4: Automobile brake system Relate an automobile brake system to the multilevel
model of Fig. 1.6.
Solution: An automobile brake system with a weak joint in the brake fluid piping (e.g., caused by
a design flaw or a road hazard) is defective. If the weak joint breaks down, the brake system
becomes faulty. A careful (off-line) inspection of the automobile can reveal the fault. However,
the driver does not automatically notice the fault (on-line) while driving. The brake system state
becomes erroneous when the brake fluid level drops dangerously low. Again, the error is not
automatically noticed by the driver, unless a working brake fluid indicator light is present. A
malfunctioning break system results from the improper state of its hydraulics when the brake pedal
is applied. With no brake fluid indicator light, the driver’s first realization that something is wrong
comes from noticing the degraded performance of the brake system (higher force needed or lower
deceleration). If this degraded performance is insufficient for slowing down or stopping the
vehicle when needed, the brake system has failed to act properly or deliver the expected result.
Example 1.5: Automobile tire Relate the functioning of an automobile tire to the multilevel
model of Fig. 1.6.
Solution: Take an automobile with one tire having a weak spot on its road surface. The defect
may be a result of corrosion or due to improper manufacture and inspection. Use of multiple layers
or steel reinforcement constitute possible defect tolerance techniques. A hole in the tire is a fault.
It may result from the defect or caused directly by a nail. Low tire pressure due to the hole, or
directly as a result of improper initialization, is viewed as an error. Automatic steering
compensation leads to error tolerance (at least for a while). A tire that is unfit for use, either due to
its pressure dropping below a threshold or because it was unfit to begin with (e.g., too small),
leads to a malfunction. A vehicle with multiple axles or twin tires can tolerate some tire
malfunctions. In the absence of tolerance provisions, one can still drive an automobile having a
flat or otherwise unfit tire, but the performance (speed, comfort, safety, etc.) will be seriously
degraded. Even a vehicle with three or more axles suffers performance degradation in terms of
load capacity when a tire malfunctions. Finally, as a result of the preceding sequence of events, or
because someone forgot to install a vital subsystem, the entire automobile system may fail.
Solution: Defects in the organization’s staff promotions policies may cause improper promotions,
viewed as faults. The ensuing ineptitudes and dissatisfactions are errors in the organization’s state.
The organization’s personnel or departments probably start to malfunction as a result of the errors,
in turn causing an overall degradation of performance. The end result may be the organization’s
failure to achieve its goals. Many parallels exist between organizational procedures and
dependable computing terms such as defect removal (external reviews), fault testing (staff
evaluations), fault tolerance (friendly relations, teamwork), error correction (openness, alternate
rewards), and self-repair (mediation, on-the-job training).
Example 1.7: Leak and drainage analogies Discuss the similarities of avoidance and tolerance
methods in the model of Fig. 1.6 to those of stopping leaks in a water system and using drainage to
prevent leaks from causing extensive damage.
Solution: Figure 1.7 depicts a system of six concentric water reservoirs. Pouring water from
above corresponds to defects, faults, and other impairments, depending on the layer(s) being
affected. These impairments can be avoided by controlling the flow of water through valves or
tolerated through the provision of drains of acceptable capacities for the reservoirs. The system
fails if water ever gets to the outermost reservoir. This may happen, for example, by a broken
valve at some layer combined with inadequate drainage at the same and all outer layers. Wall
heights between adjacent reservoirs correspond to natural interlevel latencies in the multilevel
model of Fig. 1.6. Water overflowing from the outermost reservoir into the surrounding area is the
analog of a computer failure adversely affecting the larger physical, corporate, or societal system.
Concentric reservoirs
are analogs of the
six nonideal model levels,
with defect being innermost Opening drain valves represents
tolerance techniques
We will describe a number of dependable computer systems after learning about the
methods used in their design and implementation. However, it is desirable to have a brief
preview of the types of system that can benefit from these methods in way of motivation.
As mentioned in Section 1.2, three main categories of dependable computer systems can
be distinguished. These are reviewed in the following paragraphs. Most modern general-
purpose computers also include dependability enhancement features, but these are often
toned-down versions of the methods used in the following system classes to make them
implementable within more stringent cost constraints. For examples of dependability
enhancement methods found in general-purpose computers, see [Siew92], pp. 427-523.
Long-life systems: Long life computer systems are needed in application domains where
maintenance is impossible or costly. In the first category, we have computers on board
spacecraft, particularly those for deep space planetary probes. In a multiyear space probe,
for example, it is imperative that the computer be still fully functional near the end of the
mission, when the greatest payoffs in terms of discoveries and data collection are
expected. The second category includes machines installed in remote, treacherous, or
hostile environments that cannot be reached easily or safely. Requirements in the design
of such systems are similar, except for cost constraints, which are likely to be less
stringent for space applications. Take the case of the Galileo Orbiter which, with help
from a separate Probe module, collected data on Jupiter. The Galileo architecture was
based on the earlier Voyager system that explored the outer planets of the solar system,
with a main difference that the Galileo design used microprocessors extensively (a total
of 27 in the computing and instrumentation systems). Module replication, error coding,
activity monitoring, and shielding were some of the many diverse methods used to ensure
a long life.
operating in lock step, with a hardware voter connected to the system bus performing bit-
by-bit voting to prevent the propagation of erroneous information to the memory. In
SIFT, the off-the-shelf computing elements were connected via custom-designed buses
and bus interfaces. Each computing element had an executive, along with software voting
procedures that could mask out an erroneous result received from a malfunctioning unit.
A task requiring data from another task got them from units that executed the task
(typically by 2-out-of-3 voting), with the actual multiplicity of a task being a function of
its criticality. Subsequently, August Systems chose the SIFT philosophy for its
dependable computers aimed at the industrial control market.
Problems
In September 2018, gas explosions rocked a vast area in northeastern Massachusetts, leading to the loss of
one life, many injuries, and destruction of property. Investigations revealed that just before the explosions,
pipe pressure was 12 times higher than the safe limit. Using Internet sources, write a one-page report on
this incident, focusing on how/why pressure monitors, automatic shut-off mechanisms, and human
oversight failed to prevent the disaster.
2 Dependability Attributes
“The shifts of fortune test the reliability of friends.”
Cicero
In Chapter 1, we briefly touched upon the notions or reliability, safety, and availability,
as different facets of computer system dependability. In this chapter, we provide precise
definitions for these concepts and also introduce other related and distinct aspects of
computer systems dependability, such as testability, maintainability, serviceability,
graceful degradation, robustness, resilience, security, and integrity. Table 2.1 shows the
list of concepts that will be dealt with, along with typical qualitative usages and
associated quantitative aspects or measures. For example, in the case of reliability, we
encounter statements such as “this system is ultrareliable” (qualitative usage) and “the
reliability of this systems for a one-year mission is 0.98” (quantitative usage).
We devote the remainder of this section to a brief review of some needed concepts from
probability theory. The review is intended as a refresher. Readers who have difficulties
with these notions should consult one of the introductory probability/statistics textbooks
listed at the end of the chapter; for example, [Papo90].
prob[E] = 0.1
we mean that if the experiment is repeated many times, under the same conditions, the
outcome will be E in roughly 10% of the cases. For example
means that out of 1000 systems of the same type, all operating under the same application
and environmental conditions, about 100 will fail within 10 weeks.
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Term Qualitative Usage(s) Quantitative Measure(s)
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Availability Highly available (Pointwise) Availability
High-availability Interval availability
Continuously available MTBF, MTTR
Integrity High-integrity
Tamper-proof
Fool-proof
Performability Performability
MCBF
Resilience Resilient
Abbreviations used: MCBF = mean computation between failures; MTBF = mean time between failures;
MTFF = mean time to first failure; MTTF = mean time to failure; MTTR = mean time to repair.
Thus, probability values have physical significance and can be determined via
experimentation. Of course, when the probabilities are very small, it may become
impossible or costly to conduct the requisite experiments. For example, experimental
verification that
When multiple outcomes are of interest, we deal with composite, joint, and conditional
probabilities satisfying the following:
Suppose that we have observed 20 identical systems under the same conditions and
measured the time to failure for each. The top part of Fig. 2.1 shows the distribution of
the time to failure as a scatter plot. The cumulative distribution function (CDF) for the
time to failure x (life length of the system), defined as the fraction of the systems that
have failed before a given time t, is shown in the middle part of Fig. 2.1 in the form of a
staircase. Of course with a very large sample, we would get a continuous CDF curve that
goes from 0 for t = 0 to 1 for t = ∞. Finally, the probability density function (pdf) is
shown at the bottom of Fig. 2.1. CDF represents the area under the pdf curve; i.e. the
following relationships hold:
Based on the preceding, the interpretation of the pdf f(t) in Fig. 2.1 is that the probability
of the system failing in the time interval [t, t + dt] is f(t) dt . So, where the dots in the
scatter plot are closer together, f(t) assumes a larger value.
Once the CDF or pdf for a random variable has been determined experimentally, we
might try to approximate it by a suitable equation and carry out various probabilistic
calculations based on such an analytical model. Examples of commonly used
distributions include uniform, normal, exponential, and binomial (Fig. 2.2).
0 10 20 30 40 50
Time
1.0
0.8
0.6 CDF
0.4 F(t)
0.2
0.0
0 10 20 30 40 50
Time
0.05
0.04 pdf
0.03 f(t)
0.02
0.01
0.00
0 10 20 30 40 50
Time
Fig. 2.1 Scatter plot for the random variable representing the lifetime
of a system, along with its cumulative distribution function
(CDF) and probability density function (pdf).
F(x)
1
CDF CDF CDF CDF
f(x)
Given a random variable x with a known probability distribution, its expected value,
denoted as E[x] or Ex, is defined as:
The interpretation of Ex is that it is the mean of the values observed for x over a large set
of experiments. The variance x2, and standard deviation x, of a random variable x are
indicators of the spread of x values relative to Ex:
When dealing with two random variables, the notion of covariance is of some interest:
Given the covariance x, y, one can define the correlation coefficient
which is the expected value of the product of centered and normalized random variables
(x – Ex)/x and (y – Ey)/y.
When x,y = 0, we say that x and y are uncorrelated and we have E[xy] = ExEy.
Independent random variables are necessarily uncorrelated, but the converse is not
always true (it is true for the normal distribution, though).
Start Failure
Up Down
State
The reliability R(t), defined as the probability that the system remains in the “Good” state
throughout the time interval [0, t], was the only dependability measure of interest to early
designers of dependable computer systems. Such systems were typically used for
spacecraft guidance and control where repairs were impossible and the system was
effectively lost upon the first failure. Thus, the reliability R(tM) for the mission duration
tM accurately reflected the probability of successfully completing the mission and thus
achieving acceptable reliabilities for large values of tM (the so-called long-life systems)
became the main challenge. Reliability is related to the CDF of the system lifetime, also
known as unreliability, by:
Let z(t) dt be the probability of system failure between times t and t + dt. The function z(t)
is called the hazard function. Then, the reliability R(t) satisfies:
This is simply a statement of the fact that for a system to be functional until time t + dt, it
must have been functional until time t and must not fail in the interval [t, t + dt] of
duration dt. From the preceding equation, we obtain:
For a constant hazard rate , that is, for z(t) = , we obtain the exponential reliability law
which we took for granted in Section 1.1.
The hazard function z(t), reliability R(t), CDF of failure F(t), and pdf of failure f(t) are
related as follows:
We can thus view z(t) as the conditional probability of failure occurring in the time
interval [t, t + dt], given that failure has not occurred up to time t. With a constant hazard
rate , or exponential reliability law, failure of the system is independent of its age, that
is, the fact that it has already survived for a long time has no bearing on its failure
probability over the next unit-time interval.
Note that the reliability R(t) is a monotonic (nonincreasing) function of time. Thus, when
the survival of a system for the entire duration of a mission of length tM is at issue,
reliability can be specified by the single numeric index R(tM). Mean time to failure
(MTTF), or mean time to first failure (MTFF), is another single-parameter indicator of
reliability. The mean time to failure for a system is given by:
The first equality above is essentially the definition of expected value of the time to
failure while the second one, indicating that MTTF is equal to the area under the
reliability curve, is easily provable.
In addition to the constant hazard rate z(t) = , which leads to the exponential reliability
law, the distributions shown in Table 2.2 have been suggested for reliability modeling.
The Weibull distribution has two parameters: The shape parameter , and the scale
parameter . Both exponential and Raleigh distributions are special cases of the Weibull
distribution, corresponding to = 1 and = 2, respectively. Similarly, the Gamma
distribution covers the exponential and Erlang distributions as special cases (b = 1 and b
an integer, respectively). The parameters of the various reliability models in Table 2.2
can be derived based on field failure data.
The Gamma function (), used in the formulas in Table 2.2, is defined as:
() = ∫ 𝑥 𝑒 𝑑𝑥 (2.2.Gam1)
For this reason, the function is called generalized factorial. This last equation allows us
to compute () recursively based on the values of () for 1 ≤ < 2.
The discrete versions of the exponential, Weibull, and normal distributions are known as
geometric, discrete Weibull, and binomial distributions, respectively. The geometric
distribution is obtained by replacing e– by the discrete probability q of survival over one
time step and time t by the number k of time steps, leading to the reliability equation:
In the case of the discrete Weibull distribution, e is replaced with q, and t with k,
leading to:
Closed-form formulas are generally hard to obtain for the parameters of interest with the
discrete Weibull distribution. Finally, the binomial distribution is characterized by the
reliability equation:
k n
R(k) = 1 – ∑j=0 (j ) p jq n–j 0≤k≤n (2.2.Bino)
When n is large, the binomial distribution can be approximated by the normal distribution
with parameters = np and = npq .
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Distribution z(t) f(t) R(t) = 1 – F(t) MTTF
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Exponential et et 1/
k–1
(k1)! (t) et ∑i=0 (t) i/i!
Erlang k1et k /
(b) (t)
Gamma b1et
1
Normal* e(t)2/(22)
2
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
(*) Reliability, and MTTF formulas for the normal distribution are quite involved. One can use numerical
tables listing the values of the integral (1⁄√2𝜋) ∫ 𝑒 /
𝑑𝑥 to evaluate R((t – )/).
R1(tM)
R1(t)
0.0
T1(rG) tM T2(rG) MTTF2 MTTF1
Time (t)
Referring to Fig. 2.4, we observe that the time t1 to the first failure is a random variable
whose expected value is the system’s MTTF. Even though it is true that a higher MTTF
implies a higher reliability in the case of nonredundant systems, the use of MTTF is
misleading when redundancy is applied. For example, in Fig. 2.4, System 1 with
reliability function R1(t) is much less reliable than System 2 with reliability function R2(t)
for the mission duration tM, but it has a longer MTTF. Since usually tM << MTTF, the
shape of the reliability curve is much more important than the numerical value of MTTF.
The reliability difference R2 R1 and reliability gain R2/R1 are natural measures for
comparing two systems having reliabilities R1 and R2. In order to facilitate the
comparison of highly reliable systems (with reliability values very close to 1), several
comparative measures have been suggested including reliability improvement factor, of
System 2 over System 1, for a given mission time tM
reliability improvement index, of System 2 over System 1, for the mission time tM
where the (mission) time function T is the inverse of the reliability function R. Thus,
R(T(r)) = r and T(R(t)) = t.
Example 2.1: Comparing system reliabilities Systems 1 and 2 have constant failure rates of 1
= 1/yr and 2 = 2/yr. Quantify the reliability advantage of System 1 over System 2 for a one-
month period.
Solution: The reliabilities of the two systems for a one-month period are R1(1/12) = e–11/12 =
0.9200 and R2(1/12) = e–21/12 = 0.8465. The reliability advantage of System 1 over System 2
can be quantified in the following ways:
R1(1/12) – R2(1/12) = 0.9200 – 0.8465 = 0.0735
R1(1/12) / R2(1/12) = 0.9200 / 0.8465 = 1.0868
RIF1/2(1/12) = (1 – 0.8465) / (1 – 0.9200) = 1.9188
RII1/2(1/12) = log 0.8465 / log 0.9200 = 1.9986
For a reliability goal of 0.9, the mission time extension of System 1 over System 2 is derived as
MTE1/2(0.9) = (– ln 0.9)(1/1 – 1/2) = 0.0527 yr = 19.2 days
while the mission time improvement factor of System 1 over System 2 is:
MTIF1/2(0.9) = 2 /1 = 2.0
Example 2.2: Analog of Amdahl’s law for reliability Amdahl’s law states that if in a unit-time
computation a fraction f doesn’t change and the remaining fraction 1 – f is speeded up to run p
times as fast, the overall speedup will be s = 1 / (f + (1 – f)/p). Show that a similar formula applies
to the reliability improvement index after improving the failure rate for some parts of a system.
Solution: Consider a system with two segments, having failure rates and – , respectively.
Upon improving the failure rate of the second segment to ( – )/p, we have RII = log Roriginal / log
Rimproved = / ( + ( – )/p). Letting / = f, we obtain: RII = 1 / (f + (1 – f)/p)
As mentioned earlier, at first reliability was the only measure of interest in evaluating
computer system dependability. The advent of time-sharing systems brought with it a
concern for the continuity of computer service (the so-called high-availability systems)
and thus minimizing the “down time” became a prime concern. Interval availability, or
simply availability, A(t), defined as the fraction of time that the system is operational
during the interval [0, t], is the natural dependability measure in this respect. The limit A
of A(t) as t tends to infinity, if it exists, is known as the steady-state availability.
Availability is a function not only of how rarely a system fails but also of how quickly it
can be repaired upon failure. Thus, the time to repair is important and maintainability is
used as a qualitative descriptor for ease of repair (i.e., faster or less expensive
maintenance procedures).
The probability a(t) that a system is available at time t is known as its pointwise
availability (which is the same as reliability when there is no repair). To take repair into
consideration, one can consider a repair rate function zr(t) which results in the probability
of unsuccessful repair up to time t having an equation similar to reliability, with various
distributions possible. For example, one can have exponentially distributed repair times,
with repair rate :
Consider a two-state repairable system, as shown in Fig. 2.5. The system begins
operation in the “Up” state, but then moves back and forth between the two states due to
failures and successful repairs. The duration of the system staying in the “Up” state is a
random variable corresponding to the time to first failure, while that of the “Down” state
is the time to successful repair.
Repair
Start Down
Up
State
Failure
Up
Down
0 t1 t'1 t 2 t'2 t
Time
Fig. 2.6 depicts the variations of the state of an example repairable two-state system with
time. Until time t1, the example system in Fig. 2.6 is continuously available and thus has
an interval availability of 1. After the first failure at time t1, availability drops below 1
and continues to decrease until the completion of the first repair at time t'1. The repair
time is thus t'1 – t1. The second failure occurs at time t2, yielding a time between failures
of t2 – t1. Over a period of time, the expected value of t'i – ti and ti+1 – ti are known as
mean time to repair (MTTR) and mean time between failures (MTBF), respectively.
In the special case of zr(t) = , i.e. a constant repair rate, the steady-state availability for a
system with a constant failure rate is:
A = (2.3.Av2)
+
where MTBF = MTTF + MTTR = 1/ + 1/ is the mean time between failures.
Pointwise availability a(t) and interval availability A(t) are related as follows:
t
A(t) = (1/t) 0
a(x) dx (2.3.Av4)
Both pointwise and interval availability are relatively difficult to derive in general.
However, for most practical purposes, the steady-state availability A can be used in lieu
of pointwise availability. This is because if a(t) can be assumed to be a constant, then it
must be equal to A(t) by the preceding equation. Interval availability being a constant in
turn implies A(t) = A. As an example, if a system is available 99% of the time in the long
run, then we can assume that at any given time instant, it will be available with
probability 0.99.
key attributes of such a system. He adds that a robust computer system “retains its ability
to deliver service in conditions which are beyond its normal domain of operation,
whether due to harsh treatment, or unreasonable service requests, or misoperation, or the
impact of faults, or lack of maintenance, etc.”
Example 2.3: Availability formula Consider exponential and repair laws, with failure rate and
repair rate . In the time interval [0, t], we can expect t failures which take t/ time units to
repair, on the average. Thus, for large t, the system will be under repair for t/ time units out of t
time units, yielding the availability 1 – /. Is there anything wrong with this argument, given that
the availability was previously derived to be A = 1 – /( + )?
Solution: The number of expected failures is actually slightly less than t, because the system is
operational only in a fraction A of time t, where A is the availability. Correcting the argument, we
note that At failures are expected over the available time At, yielding an expected repair time of
At/ time units. Thus, the availability A satisfies the equation A = 1 – A/ , where the last term
is the fraction of the time t that is spent on repair. This yields A = 1/(1 + /) = /( + ).
Continual increase in system complexities and the difficulties in testing for initial system
verification and subsequent preventive and diagnostic maintenance have led to concern
for testability. Since any preventive or diagnostic maintenance procedure is based on
testing, maintainability and testability are closely related. Testability is often quantified
by the complementary notions of controllability and observability. In the context of
digital circuits, controllability is an indicator of the ease with which the various internal
points of the circuit can be driven to desired states by supplying values at the primary
inputs. Similarly, observability indicates the ease with which internal logic values can be
observed by monitoring the primary outputs of the circuit [Gold79].
In practice, computer systems have more than two states. There may be multiple
operational states where the system exhibits different computational capabilities.
Availability analysis for gracefully degrading systems will be discussed along with
performability in Section 2.4. It is noteworthy that performability and similar terms, some
of which were listed in the preceding paragraph, are sometimes collectively referred to as
the “-ilities”. Several other informally defined “-ilities” can be found in the literature
(survivability, reconfigurability, diagnosability, and so on), although none has found
widespread acceptance.
Widespread use of multiprocessors and gracefully degrading systems, that did not obey
the all-or-none mode of operation implicit in conventional reliability and availability
models, caused some difficulties in dependability evaluation. Consequently,
performability was suggested as a relevant measure. Again the desirability of a simple
numeric measure led to the suggestion of mean computation before failure (MCBF),
although the use of this measure did not become as widespread as the MTTF and MTBF
of the earlier era. These concepts will be discussed in the remainder of this section.
Consider a dual-processor computer system with two performance levels; both processors
working (worth = 2) and only one processor working (worth = 1), ignoring all other
resources. If the processors fail one at a time, and are repaired one at a time, then the
system’s state diagram is as shown in Fig. 2.7. In Chapter 4, we will show how the
steady-state probabilities for the system being in each of its states can be determined. If
these probabilities are pUp2, pUp1, and 1 – pUp2 – pUp1, then, the performability of the
system is:
P = 2pUp2 + pUp1
As a numerical example, pUp2 = 0.92 and pUp1 = 0.06 lead to P = 1.90. In other words,
the performance level of the system is equivalent to 1.9 processors on the average, with
the ideal performability being 2.
When processors fail and are repaired independently, and if Processor i has a steady-state
availability Ai, then performability of the system above becomes:
P = A1 + A2
More generally, i.e., when the resources are not identical, each availability Ai of a
resource must be multiplied by the worth of that resource. Note that the independent
repair assumption implies that maintenance resources are not limited. If there is only one
repair person, say, then this assumption may not be valid.
Example 2.4: Performability improvement factor With a low rate of failure, performability
will be close to its ideal value. For this reason, a performability improvement factor, PIF, can be
defined in a manner similar to RIF, given in equation 2.2.RIF. For the two-processor system
analyzed in the preceding paragraphs, determine the PIF relative to a fail-hard system that would
go down when either processor fails.
Solution: The performability of the fail-hard system is readily seen to be 2pUp2 = 2 0.92 = 1.84.
Given the ideal performability of 2, the performabilities of our fail-hard and fail-soft systems in
relative terms are 1.84/2 = 0.92 and 1.90/2 = 0.95, respectively. Thus:
PIFfail-soft/fail-hard = (1 – 0.92) / (1 – 0.95) = 1.6
Note that like RIF, PIF is useful for comparing different designs. The result 1.6 does not have any
physical significance. The performability ratio in this example is 0.95 / 0.92 = 1.033, which is a
true indicator of the increase in expected performance.
Figure 2.8 depicts the variations of the state of an example three-state repairable system
with time. Up to time t1, the example system in Fig. 2.8 is continuously available with
maximal performance. After the partial failure at time t1, performance drops.
Subsequently, at time t2, the system fails completely. After partial repair is completed at
time t'2, system performance goes up and is eventually restored to its full level at time t'1.
The full repair time is, therefore, t'1 – t1, with partial repair taking t'2 – t2 time.
Partial
Failure
Up
Total Partial
Failure Repair
Partially Up
Down
t1 t2 t'2 t'1 t 3 t'3 t
0
Time
Fig. 2.8 System up, partially up, and down times contributing to its
performability.
The shaded area in Fig. 2.8 represents the computational power that is available prior to
the first total failure. If this power is utilized in its entirety, then the shaded area
represents the amount of computation that is performed before total system failure. The
expected value of this parameter is known as mean computation before failure (MCBF).
Since no computation is performed in the totally failed state, MCBF can be viewed as
representing mean computation between failures. Thus, MCBF is to performability as
MTBF is to reliability.
Note that performability generalizes both reliability and performance. For a system that
never fails, performability is the same as performance. For a system that has a single
performance level (all or none), performability is synonymous with reliability; i.e.,
performance level is 100% iff the system has not failed.
If availalability and MTTR are known, MTTF can be found from equation 2.3.Av3.
The measures discussed thus far deal with the operation and performance of the computer
system (and frequently only with the hardware) rather than with the integrity and success
of the computations performed. Neither availability nor performability distinguishes
between a system that experiences 30 two-minute outages per week and one that fails
once per week but takes an hour to repair. Increasing dependence on transaction
processing systems and safety-critical applications of computers has led to new concerns
such as integrity, safety, security, and privacy. The first two of these concerns, and the
corresponding dependability measures, are treated in this section. Security and privacy
will be discussed in Section 2.6.
The two attributes of integrity and safety are similar; integrity is inward-looking and
relates to the capacity of a system to protect its computational resources and data under
adverse circumstances, while safety is outward-looking and pertains to consequences of
incorrect actions for the system environment and users. One can examine system integrity
by assigning the potential causes and consequences of failures to a number or a
continuum of “severity” classes. If computational resources and data are not corrupted
due to low-severity causes, then the system fares well on integrity. If the failure of a
system seldom has severe external consequences, then the system is high on safety.
Safety, on the other hand, is almost always quantified. Leveson [Leve86] defines safety
as “the probability that conditions [leading] to mishaps (hazards) do not occur, whether
or not the intended function is performed”. Central to the quantification of safety is the
notion of risk. A standard dictionary defines risk as “the possibility of loss or injury”.
Reliability engineers use probability instead of possibility. The expected loss or risk
associated with a particular failure is a function of both its severity and its probability.
More precisely:
For example, the approximate individual risk (early fatality probability per year)
associated with motor vehicle accidents is 3 104 which is about 10 times the risk of
drowning, 100 times the risk of railway accidents, and 1000 times the risk of being killed
by a hurricane ([Henl81], p. 11). Individual risks below 106 per year are generally
deemed acceptable. Computer scientists and engineers have so far only scratched the
surface of safety evaluation techniques and much more work in this area can be expected
in the coming years.
To put our discussion of safety on the same footing as those of reliability, availability,
and performability, we envisage the three-state system model of Fig. 2.9. Certain adverse
condition may cause the system to fail in an unsafe manner; the probability of these
events must be minimized. Safe failures, on the other hand, are acceptable. The provision
of a separate “Safe Down” state, rather than merging it with the “Up” state, is useful in
that the two states may be given different weights in deriving numerical figures of merit
of various kinds. Furthermore, if we add transitions corresponding to repair from each
“Down” state to the “Up” state, we can quantify not only the risk of unsafe operation but
also the chances that the backup (manual) system may be stretched beyond its capacity
owing to overly long repair time.
Consider for example the more elaborate state model depicted in Fig. 2.10. Here, we
model the fact that a safe failure condition may turn into an unsafe one if the situation is
not handled properly and the possibility that the system can recover from a safe failure
through repair.
Failure
Repair
Start Safe Mishandling Unsafe
Up
State Down Down
Failure
Failure
Fig. 2.10 Three-state repairable system with safe and unsafe failed
states.
In both Figs. 2.9 and 2.10, one may use multiple unsafe failed states, one for each level of
severity, say. In this way, the probabilities of ending up in the various unsafe states can
be used for risk assessment using Eqn. (2.5.Risk2). State-space modeling techniques are
discussed in Chapter 4.
Despite several decades of research on privacy and security in computing systems, these
two aspects have resisted quantitative assessment. In theory, security can be quantified in
the same manner as safety, that is, by considering frequency or probability of a security
breach as one factor, and magnitude or severity of the event as another. However,
quantifying both factors is substantially more difficult in the case of security, compared
with safety. One aspect of the difficulty pertains to the fact that security breaches are
often not accidental, so they are ill-suited to a probabilistic treatment.
We end this section by noting that system security is orthogonal to both reliability and
safety. A system that automatically locks up when a security breach is suspected may be
deemed highly secure but it may not be very reliable or safe in the traditional sense.
Problems
3 Combinational Modeling
“Torture numbers, and they’ll confess to anything.”
Gregg Easterbrook
Given a set of components, subsystems, or other parts that comprise a system, one can
determine the probability of the system being operational by enumerating all possible
combinations of good and bad parts that result (or do not result) in system failure. The
overall probability of these subcases is the system unreliability (reliability). This method
works well when the number of parts is fairly small and their interactions and mutual
influences are well understood.
Solution: The common bus is a critical system part. The system fails if the bus, both processors,
or all four memory modules malfunction. This is the same as saying that the system functions
properly if the bus, one of the two processors, and one of the four memory modules work. This has
the probability R = rb[1 – (1 – rp)2][1 – (1 – rm)4].
Example 3.1 was simple enough to allow us to write the pertinent reliability equation
directly. We now illustrate how case analysis can be used to reduce the complexity of
reliability evaluation using the same simple example.
First consider two cases: the bus system works (probability rb) or it does not work
(probability 1 – rb). We begin constructing a tree, as in Fig. 3.1, where the root labeled
“No information” has two children, labeled “Bus okay” and “Bus bad.” If we are
interested in enumerating the operational system configurations, we can ignore the
subtree corresponding to “Bus bad.” We next focus on the processors and form three
subtrees for the “Bus okay” branch, labeling them “Both processors okay” (probability
rp2), “One processor okay” (probability 2rp(1 – rp)), and “Both processors bad”
(probability (1 – rp)2). We can merge the first two of these branches and assigning the
resulting “At least one processor okay” branch the probability 2rp – rp2, because the two
are identical with respect to the proper functioning of the system. Continuing in this
manner, we arrive at all possible leaf nodes associated with working configurations.
Adding the probabilities of these leaf nodes yields the overall system reliability. We can
stop expanding each branch as soon as the reliability equation for the corresponding state
can be written directly.
No information
rb 1 – rb
1 – (1 – rp)2 (1 – rp)2
1 – (1 – rm)4 (1 – rm)4
We further illustrate the method of case analysis with two additional examples.
Example 3.2: Data availability modeling with home and mirror sites Use case analysis to
derive the availability of data for the system in Example 1.1.
Solution: The required case-analysis tree is depicted in Fig. 3.2, leading to the availability
equation A = aSaL + (1 – aSaL)aSaL = 2aSaL – (aSaL)2.
Example 3.3: Data availability modeling with triplication Use case analysis to derive the
availability of data for the system in Example 1.2.
Solution: The required case-analysis tree is depicted in Fig. 3.3, leading to the availability
equation A = aSaL + (1 – aSaL)aSaL + (1 – aSaL)2aSaL = 3aSaL – 3(aSaL)2 + (aSaL)3.
aSaL (1 – aS)aL
1 – aL
aSaL aSaL
Fig. 3.2 Case analysis used to derive the data availability equation
for Example 1.1 (home and mirror sites).
a Sa L (1 – aS)aL
1 – aL
aSaL
1 – aL (1 – aS)aL
aSaL aSaL
Fig. 3.3 Case analysis used to derive the data availability equation
for Example 1.2 (home site and two backups).
A series system of n components is one in which the proper operation of each of the n
components is required for the system to perform its function. Such a system is
represented as in Fig. 3.4a, with each rectangular block denoting one of the subsystems.
Fig. 3.4 Series system block diagram and example with valves that
are prone to stuck-on-shut failures.
Given the reliability Ri for the ith component of a series system, the overall reliability of
the system, assuming independence of failures in the subsystems, is given by:
R = ∏ 𝑅 (3.2.ser1)
If the ith component in a series system has a constant hazard rate i, thus having
exponential reliability, the overall system will have exponential reliability with the hazard
rate i. This is a direct consequence of equation (3.2.ser). With repairable components,
having hazard rate i and repair rate i, the availability of a series system of n
components is related to the individual module availabilities Ai = i / (i + i) by:
A = ∏ 𝐴 (3.2.ser2)
A parallel system of n components is one in which the proper operation of a single one of
the n components is sufficient for the system to perform its function. Such a system is
represented as in Fig. 3.5a, with each rectangular block denoting one of the subsystems.
Fig. 3.5 Parallel system block diagram and example with valves that
are prone to stuck-on-shut failures.
Given the reliability Ri for the ith component of a parallel system, the overall reliability
of the system, assuming independence of failures in the subsystems, is given by:
R =1–∏ (1 − 𝑅 ) (3.2.par1)
For example, if placing a valve on each of four branches of a pipe (Fig. 3.5b), with the
component valves being prone to stuck-on-shut failures, yields a parallel system; we can
still control access to the reservoir even if three of the valves fail in the stuck-on-shut
mode. Again, the term “parallel” in parallel system does not imply that the subsystems
are physically connected in parallel in the mechanical or electrical sense.
If the components in a parallel system are repairable, with the ith component having a
hazard rate i and repair rate i, the availability of a parallel system of n components is
related to the individual module availabilities Ai = i / (i + i) by:
A =1– ∏ (1 − 𝐴 ) (3.2.par2)
Reliability and availability equations for series and parallel systems are quite simple. This
does not mean, however, that proper application of these equations does not require
careful thinking. The following example illustrates that care must be exercised in
performing even simple dependability analyses.
Example 3.4: A two-way parallel system In a passenger plane, the failure rate of the cabin
–5
pressurizing system is 10 /hr and the failure rate of the oxygen-mask deployment system is also
10–5/hr. What is the probability of loss of life due to both systems failing during a 10-hour flight?
Possible solution 1: Given the assumption of failure independence, both systems fail together at a
rate of 10–10/hr. Thus, fatality probability for a 10-hour flight is 10–10 10 = 10–9. Fatality odds of
1 in a billion or less are generally deemed acceptable in safety-critical systems.
Possible solution 2: The probability of the cabin-pressurizing system failing during a 10-hour
flight is 10–4. The probability of the oxygen-mask system failing during the flight is also 10–4.
Given the assumption of independence, the probability of both systems failing together during the
flight is 10–8. This latter probability is higher than acceptable norms for safety-critical systems.
Analysis: So, which of the two solutions is correct? Neither one. Here’s why. When we multiply
the two per-hour failure rates and then take the flight duration into account, we are assuming that
only the failure of the two systems within the same hour is catastrophic. This produces the
optimistic reliability estimate 1 – 10–9. When we multiply the two flight-long failure rates, we are
assuming that the failures of both systems would be catastrophic, no matter when each occurs
during the flight. This produces the pessimistic reliability estimate 1 – 10 –8. The reader should be
able to supply examples of when the two systems fail at different times during a flight, without
leading to a catastrophe.
The simple reliability equation (3.2.par1) for a parallel system is based on the assumption
that all n subsystems contribute to the proper system functioning at the same time, and
each is capable of performing the entire job, so that the failure of up to n – 1 of the n
subsystems will be noncritical. This happens, for example, if we have n concurrently
operating ventilation systems in a lab, each with its own power supply, in order to ensure
the proper removal of hazardous fumes. If the capacity of one of the subsystems is
inadequate and we need at least two of them to perform the job, we no longer have a
parallel system, but a 2-out-of-n system (see Section 3.3). Similarly, if only one of the
subsystems is active at any given time, with the others activated in turn upon detection of
a failure, then equation (3.2.par1) is valid only if failure detection is perfect and
instantaneous, and the activation of spares is always successful.
The simplest way to account for imperfect failure detection and activation of spares in a
parallel system is via the use of a coverage parameter c, with c < 1. The coverage
parameter is defined as the probability that the switchover from an operating module to a
spare module goes without a hitch. Thus, in a two-unit parallel system in which the
primary module has reliability r1 and the spare has reliability r2, the system reliability is:
R = r1 + (1 – r1)cr2 (3.2.cov1)
Equation (3.2.cov1) essentially tells us that the two-way parallel system with imperfect
coverage will work if unit 1 works, or if unit 1 fails, but the switchover is successful and
unit 2 works. With modules having identical reliability r, equation (3.2.cov1) becomes:
( )
R = r[1 + c(1 – r)] = r (3.2.cov2)
( )
( )
R=r (3.2.cov3)
( )
Note that, in practice, the coverage factor is not a constant, but deteriorates with more
spares. In this case the depiction of the effect of coverage in Fig. 3.6 may be viewed as
optimistic. So, adding a large number of spares is not only unhelpful (as suggested by the
saturation effect in Fig. 3.6), but it may actually be detrimental to system reliability.
In a k-out-of-n system, there are n modules, the proper functioning of any k of which is
sufficient for system operation. Note that both series (n-out-of-n) and parallel (1-out-of-n)
systems are special cases of this more general class. For example, if you have one spare
tire in the trunk of your car, then, ignoring the possible difference between a spare tire
and a regular tire, your tire system is a 4-out-of-5 system. You can continue driving your
car as long as at most one tire malfunctions, assuming successful switchover from a
regular tire to the spare tire. If you carry two spare tires in your trunk, then your tire
system may be described as a 4-out-of-6 system.
One of the most commonly used systems of this type is a 2-out-of-3 system, depicted in
Fig. 3.7. This redundancy scheme is known as triple modular redundancy (TMR) and
relies on a decision circuit, or voter, to deduce the correct output based on the outputs it
receives from three concurrently operating modules.
Assuming a perfect (always working) voting unit, the reliability of a TMR system with
module reliabilities r1, r2, and r3 is:
Assuming exponential reliability r = e–t for each of the three modules and taking the
voter to be perfectly reliable, the MTTF parameter of a TMR system can be obtained
based on eqns. (2.2.MTTF) and (3.2.TMR2) with rv = 1:
MTTFTMR = = ∫ 𝑅(𝑡)𝑑𝑡 = ∫ [3𝑒 −2𝑒 ]𝑑𝑡 = 5/(6) (3.2.MTTF)
Note that even though the reliability of a TMR system is greater than that of a single
module, its MTTF deteriorates from 1/ to 5/(6).
The reliability equation for a k-out-of-n system with an imperfect voting unit and
identical modules is:
R = rv ∑ 𝑟 (1 − 𝑟) (3.2.kofn)
In the special case of odd n with k = (n + 1)/2, the k-out-of-n scheme uses majority voting
and is sometimes referred to as n-modular redundancy (NMR). It is readily seen from
equations (3.2.TMR2) and (3.2.kofn) that TMR and NMR methods lead to significant
reliability improvement only if the voting unit is much more reliable than the modules
performing the system functions.
A key element in the application of k-out-of-n redundancy schemes, and their special
cases of majority voting, is the design of appropriate “voting” circuits. Considerations in
the design of voting circuits are discussed in Chapter 12.
Replicating the voters and performing the entire computation in three separate and
independent channels is one way of removing the voting circuits from the critical system
core. Figure 3.8 shows how voter triplication in a TMR system will allow voter failures
as well as module failures to be tolerated. As the oval dashed boxes indicate, the voter
reliability can be lumped with module reliability, instead of it appearing separately, as in
equations (3.2.TMR2) and (3.2.kofn).
Fig. 3.8 TMR system with nonreplicated and replicated voting units.
Note that in writing the reliability equation (3.2.TMR1) for a TMR system, we have
pessimistically assumed that any two module failures will render the system useless. This
is not always the case. For example, if the modules produce single-bit outputs, then when
the output of one module is stuck-on-1, while a second module’s output is stuck-on-0, the
system can still produce the correct output, despite the occurrence of double module
failures. Such compensating failures, as well as situations where problems are detected
because the multiple modules produce distinct erroneous results, leading to a lack of
majority agreement, are discussed in Chapter 12.
Two variants of k-out-of-n systems also merit discussion, although they are not in
common use for modeling computer systems. We first note that the type of k-out-of-n
system we have covered thus far can be called k-out-of-n:G system, with the added
qualifier “G” designating that the system is “good” when at least k of its n modules are
good. We may also encounter k-out-of-n:F systems, in which the failure of any k or more
subsystems is tantamount to system failure. Clearly, a k-out-of-n:F system is identical to
an (n – k + 1)-out-of-n:G system. So, the new notation is unnecessary for the type of
systems we have been discussing.
A consecutive k-out-of-n:G system is one in which the n modules are linearly ordered,
say, by indexing them from 1 to n, with the failure of any k consecutive modules causing
system failure. So, for example, such a system may not be able to function with exactly k
working modules, unless these k modules happen to be consecutive.
Example 3.5: Consecutive 3-out-of-5:G system Three values can be transmitted over three of
five buses, using shift switches at the interface, as depicted in Fig. 3.9. Shift switches are
controlled using a common set of control signals that puts all of them in the upward, straight, or
downward connection state. Such a reconfiguration scheme is easier and less costly to implement
than arbitrary (crossbar) connectivity and is thus quite popular.
Solution: The reliability of this system is different from an ordinary 3-out-of-5 system, because,
for example, the outage of the middle bus line is not tolerated, even if it is the only one. Let each
bus line have reliability r and assume that the switches are perfect. The system works when all 5
bus lines are okay, or any 4 are okay, except if the middle bus line is the bad one (4 cases in all),
or if the set of 3 bus lines {1, 2, 3}, {2, 3, 4}, or {3, 4, 5} are good, with the remaining 2 being
bad. Adding the three terms corresponding to the cases above, we get the system reliability
equation: R = r5 + 4r4(1 – r) + 3r3(1 – r)2 = 3r3 – 2r4.
Common switch
setting controls
Switch setting
control signals
The second variant is the class of consecutive k-out-of-n:F systems. Here, any k or more
consecutive failed modules render the system nonfunctional. So, such a system may
tolerate more than k module failures, provided the failed modules are not consecutive.
Example 3.6: Consecutive 2-out-of-n:F system Consider a system of n street lights, where the
lights provide a minimum level of illumination deemed adequate for safety, unless two or more
consecutive lights are out. What is the reliability of this consecutive 2-out-of-n:F system?
Solution: The reliability of this system is different from an ordinary (n – 1)-out-of-n system,
because, for example, the safety criterion is met even if every other light is out. Let each street
light have reliability r. Let f(n) be the reliability for a consecutive 2-out-of-n:F system. Then, we
can write f(n) = r f(n – 1) + r(r – 1) f(n – 2), with f(1) = 1 and f(2) = 2r – r2. The two terms in the
equation for f(n) correspond to the two possible cases for the first street light. If that light is
working, then the system will be okay if the remaining n – 1 lights do not suffer 2 consecutive
outages. Otherwise, if the first light is out, then the second light must be working, and the
remaining n – 2 lights should not have 2 consecutive outages. The recurrence and its associated
initial conditions allow us to compute f(n) for any value of n, either numerically for a given value
of r or symbolically for arbitrary r. For example, we find f(5) = r2 + 3r3 – 4r4 + r5.
B D
A F G
C E
A reliability block diagram is best understood in terms of its success paths. A success
path is simply a path through the modules, that leads from one side of the diagram to the
other. In the case of Fig. 3.11, the success paths are A-B-D-F-G and A-C-E-F-G. By
definition, the system modeled by a reliability block diagram is functional if all the
modules on at least one success path are functional.
The reliability equation corresponding to an RBD can be easily derived by applying the
series and parallel reliability equations (3.2.ser1) and (3.2.par1). In the case of the RBD
in Fig. 3.11, using rX to denote the reliability of module X, we have:
If all modules in Fig. 3.11 have the same reliability r, equation (3.4.RBD1) reduces to R
= r5(2 – r2). Note that because 2 – r2 > 1, the system modeled is more reliable than a
series system with five identical modules, as one would expect.
Example 3.7: Parallel-series and series-parallel systems Denoting the reliability of module j
in Fig. 3.12 as rj:
a. Derive the reliability equation for the parallel-series system of Fig. 3.12a.
b. Derive the reliability equation for the series-parallel system of Fig. 3.12b.
c. Compare the reliability expressions derived in parts a and b and discuss.
Solution: For parts a and b, we use equations (3.2.ser1) and (3.2.par1) in turn.
a. Ra = 1 – (1 – r1 r2)(1 – r3 r4)
b. Rb = [1 – (1 – r1)(1 – r3)] [1 – (1 – r2)(1 – r4)]
c. After some simple algebraic manipulation, the difference of the reliabilities for parts a and b
is found to be Rb – Ra = r1r4(1 – r2)(1 – r3) + r2r3(1 – r1)(1 – r4). Because the difference is
always positive, the series-parallel configuration of Fig. 3.12b always offers better reliability
compared with the parallel-series arrangement of Fig. 3.12a. We should have been able to
predict this advantage, which is precisely due to Fig. 3.12b surviving when modules 1 and 4
are operational, while modules 2 and 3 have failed, or vice versa.
The basic series-parallel RBDs discussed thus far can be extended in several different
ways. One way of extending RBDs is to allow k-out-of-n structures in addition to n-out-
of-n (series) and 1-out-of-n (parallel) constructs. Such a structure is drawn as a set of
parallel blocks with a suitable notation indicating that k out of the n blocks must be
functional. This can take the form of an annotation next to the blocks, or a voter-like
connector on the right-hand side on the parallel group into which the label “k of n” or “k /
n” is inscribed. The use of such k-out-of-n structures does not complicate the derivation
of the reliability equation: we simply use equation (3.3.kofn) in this case, in lieu of
equation (3.2.par1).
B E
A C F 2/3 H
D G
A second way to extend RBDs is to allow connectivity patterns that are more general
than series-parallel. For example, the bridge pattern of Fig. 3.14 would constitute such an
extended RBD. In this example, one may view module 5 is being capable of replacing
modules 2 and 3 when the latter interact with modules 1 and 4 (but not when module 3
should cooperate with module 6).
1 2 3 4
A third way to extend RBDs is to allow repeated blocks [Misr70]. Figure 3.15 depicts
two ways of representing a 2-out-of-3 structure, using parallel-series and series-parallel
connection of blocks A, B, and C, with repetition.
A B
A B C
B C
B C A
C A
When RBDs are not of the simple series/parallel variety or when they have repeated
elements, special methods are required for their analysis. The following two examples
demonstrate the method.
Example 3.8: Non-series/parallel RBDs Consider the extended RBD in Fig. 3.14 and denote
the reliability of module i by ri. Derive an expression for the overall system reliability.
Solution: The system functions properly if a string of healthy modules connect one side of the
diagram to the other. Because module 2 is the unit whose role deviates from series or parallel
connection, we will perform a case analysis by assuming that it works (replacing it with a line
connecting modules 1 and 3) or does not work (disconnecting modules 1 and 3). We thus get the
system reliability equation R = r2R2good + (1 – r2)R2bad, where R2good and R2bad are conditional
reliabilities for the two cases just mentioned. Our solution is complete upon noting that R2good =
r4[1 – (1 – r1)(1 – r6)][1 – (1 – r3)(1 – r5)] and R2bad = r4[1 – (1 – r1r5)(1 – r3r6)].
Example 3.9: Extended RBDs with repeated elements Consider an extended RBD that is 3-
way parallel, with each of the parallel branches being a series connection, as follows: (a) 1-5-4,
(b) 1-2-3-4, and (c) 6-3-4. Boxes with the same number denote a common module. So, for
example, the two occurrences of 1 in the diagram represent a common module 1. This RBD may
be viewed as equivalent to that in Fig.3.14, in that it has the same success paths. So the analysis of
this example is another way of solving Example 3.8. Derive a reliability expression for this RBD.
Solution: The inequality R 1 – i(1 – Rith success path) provides an upper bound on system
reliability. The reason that the expression on the right-hand side represents an upper bound rather
than an exact value is that is takes multiple occurrences of the same module as having independent
failures. In the case of our example, we get R 1 – (1 – r1r5r4)(1 – r1r2r3r4)(1 – r6r3r4). It turns out
that if we multiply out the parenthesized terms on right-hand side of the foregoing inequality, but
do this by ignoring the higher powers of each reliability term, an exact reliability formula results.
For our example, the process just outlined yields R = r3r4r6 + r1r2r3r4 + r1r2r3r4r6 + r1r4r5 –
r1r3r4r5r6 – r1r2r3r4r5 – r1r2r3r4r5r6.
Thus far, we have taken the modules in an RBD to be independent of each other. More
sophisticated models take the possibility of common-cause failures into account or allow
the failure of some modules to affect the proper functioning of others, perhaps after a
randomly variable amount of time [Levi13].
Figure 3.16 depicts an example reliability graph having success paths A-E-H-L-N, B-D-
G-M, and C-F-H-K-M, among others. A reliability graph can be analyzed by converting
it to a number of series/parallel structures through case analysis.
Reliability graphs are more powerful than simple series/parallel RBDs in terms of
representational power, but they are less powerful than the most general form of fault
trees, to be described next.
Fault tree is a tool for top-down reliability analysis. To construct a fault tree, we start at
the top with an undesirable event called a “top event” and determine all the possible ways
in which the top event can occur. The fault-tree method allows us to determine, in a
systematic fashion, how the top event can be caused by individual or combined lower-
level undesirable events. Figure 3.17 contains an informal description for the building
process as well aa some of the pertiment symbols used.
Fig. 3.17 Constructing a fault tree and some of the pertinent symbols.
For example, if the top event is being late for work, its immediate causes might be clock
radio not turning on, family emergency, or the bus not running on time. Going one level
down, the clock radio might fail to turn on due to the coincidence of a power outage and
its battery being dead.
Once a fault tree has been built, it can be analyzed in at least two ways: using the cut-set
method and via conversion to a reliability block diagram.
A cut set is any set of initiators so that the failure of all of them induces the top event. A
minimal cut set is a cut set for which no subset is also a cut set. As an example, for the
fault tree of Fig. 3.18, the minimal cut sets are {a, b}, {a, d}, and {b, c}. The equivalent
RBD for a given fault tree is one that has the same minimal cut sets. An example is
depicted in Fig. 3.18. Note that the equivalent RBD mat not be unique.
Other than allowing us to probabilistically analyze a fault tree, cut sets also help in
common-cause failure assessment and exposition of system vulnerability: a small cut ses
indicates high vulnerability. The notion of path set is the dual of cut set. A path set is any
set of initiators so that if all of them are fault-free, the top event is inhibited. One can
drive the path sets for a fault tree by exchanging AND and OR gates and then obtaining
the cut sets for the transformed tree.
Example 3.10: Fault trees and RBDs Consider a system exhibiting the minimal cut set {a, b},
{a, c}, {a, d}, {c, d, e, f}.
a. Construct a fault tree for the system.
b. Derive an equivalent RBD.
c. What is the path set for this example?
Solution:
a. A fault tree for the system is depicted in Fig. 3.19a.
b. One possible RBD is depicted in Fig. 3.19 b.
c. Exchanging AND and OR gates in Fig. 3.19a, we find the path set of the original diagram as
the cut set of the transformed diagram, thus obtaining: {a, c}, {a, d}, {a, e}, {a, f}, {b, c, d}.
a b c d d
c d e f
a e
b c d f
Fig. 3.19 Fault tree and RBD associated with Example 3.10.
In conclusion, we note that the combinational models introduced in this chapter constitute
a hierarchy in terms of their representational power [Malh94]. At the top of this
hierarchy, we have FTs with repeated elements. Reliability graphs are next, with
somewhat less representational power. Finally, at the bottom of the hierarchy we have
RBDs and ordinary FTs (no repeated elements), with these latter two models being
equivalent in the sense of having identical representational power.
Problems
System A
System B
a. Write the reliability equations for the two systems and determine the conditions under which
system A is more reliable than system B.
b. Repeat part a in the more general case when there are m pairs, rather than 3 pairs, of modules
arranged as series connection of m parallel pairs or parallel connection of two m-module series
chains.
c. Generalize the conclusions of part b to a case where instead of parallel pairs, we have k-wide
parallel connections. In other words, assume that there are km modules in all, with the parallel
parts being k-wide and the series parts being of length m.
A B
a. Write down the reliability equation of the series connection, and show that it is preferable to a
nonredundant diode iff open-circuit failures are less likely than closed-circuit failures (p < q).
b. Repeat part a for two diodes in parallel replacing a single diode, and derive the corresponding
condition.
c. Show that it is possible to arrange 4 diodes so that they act as a 3-out-of-4 system between A and
B, given the open-circuit and short-circuit failure modes.
4 State-Space Modeling
“All models are wrong, some are useful.”
G. E. P. Box
“When one admits that nothing is certain one must, I think, also
admit that some things are much more nearly certain than
others.”
Bertrand Russell
State-space models are suitable for modeling of systems that can be in multiple
states from the viewpoint of functionality and can move from state to state as a
result of deviations (e.g., malfunctions) and remedies (e.g., repairs). Even though
many classes of state-space models have been introduced and are in use, our focus
will be on Markov models that have been found most useful in practice and
possess sufficient flexibility and power to faithfully represent nearly all system
categories of practical interest. In this chapter, we introduce Markov chains as
models of probabilistically evolving systems and explore their use in evaluating
the dependability of computer systems, particularly those with repairable parts.
A Markov chain can be represented by a transition matrix M, where the element mi,j in
row i and column j denotes the probability associated with the transition from state i to
state j When the Markov chain has no transition between two states, the corresponding
entry in M is 0. The transition matrix M0 for the Markov chain of Fig. 4.1 is:
A matrix such as the one in eqn. (4.1.exMm0), in which the sum of entries in each row
add up to 1, is referred to as a Markov matrix. A second example of a Markov matrix is
provided by eqn. (4.1.exMm1).
At any given time, our knowledge about the state of an n-state Markov chain can be
represented by an n-vector, with its ith element representing the probability of being in
state i. For example, the state vector (1, 0, 0, 0) means that the system depicted in Fig. 4.1
is known to be in state 0, (1/2, 1/2, 0, 0) means that it is equally likely to be in state 0 or
state 1, and (1/4, 1/4, 1/4, 1/4) denotes complete uncertainty about the state of the system.
Clearly, the elements of a state vector must add up to 1.
( ) ( ) ( ) ( )
Starting from state 𝑠 ( ) = (𝑠 ,𝑠 ,𝑠 ,𝑠 ) at time 0, one can compute the state vector
at time step 1 via multiplying s by the transition matrix M, that is, 𝑠 ( ) = 𝑠 ( ) M. More
generally, the state vector of the system after k time steps is given by:
𝑠( )
= 𝑠 ( ) Mk (4.1.svec)
For example, if the system in Fig. 4.1 is initially in state 0 or 1 with equal probabilities,
its state vector after one and two time steps will be:
A discrete-time Markov chain can be viewed as a sequential machine with no input, also
known as an autonomous sequential machine. In such a machine, transitions are triggered
by the clock signal, with no other external influence. In conventional digital circuits,
clock-driven counters are examples of autonomous sequential machines that move
through a sequence of states on their own. If we have two or more possible input values,
then a general stochastic sequential machine results. Such a machine will have a separate
transition matrix for each input value and its state vector after k time steps will depend on
the k-symbol input sequence that it receives.
Example 4.1: Stochastic sequential machines Consider a 4-state, 2-input stochastic sequential
machine with the state transition probabilities provided by the matrix M0 of eqn. (4.1.exMm0) for
input value 0 and by the matrix M1 of eqn. (4.1.exMm1) for input value 1. Assuming that the
machine begins in state 0, what will be its state vector after receiving the input sequence 0100?
Solution: The initial state vector (1, 0, 0, 0) must be multiplied by M0M1M0M0 to find the final
state vector. This process yields the state vector (0.2620, 0.2844, 0.3586, 0.0950).
Continuous-time Markov chains are more useful for modeling computer system
performance and dependability. In such chains, transitions are labeled with rates, rather
than probabilities. For example, a transition rate of 10–6 per hour means over a very short
time period of dt hours, the transition will occur with probability 10–6dt. We often use
Greek letters to denote arbitrary transition rates, with being a common choice for
failure rate and for repair or service rate.
Markov chains are used widely for modeling systems of many different kinds. For
example, the process of computer programming may be modeled by considering various
states corresponding to problem formulation, program design, testing, debugging, and
running [Funk07]. State transitions are labeled with estimated probabilities of going from
one state (say, testing) to other states (say, debugging or running). Pattern recognition
problems may be tackled by the so-called “hidden Markov models” [Ghah01]. As a final
example, a person’s movement between home, office, and various other places of interest
can be studied via a Markov model that associates probabilities with going from one of
the places to each other place [Ashb03]. Many more applications exist [Bolc06].
Example 4.2: Markov model for a fail-soft multiprocessor Consider a parallel machine with 3
processors, each of which has a failure rate and a repair rate . When 2 or 3 processors fail, only
one can be repaired at a time. The machine can operate with any number of processors, but its
performance will be lower with only 1 or 2 processors working, thus providing fail-soft operation.
Construct a suitable Markov model for this parallel machine.
Solution: Each of the 3 processors can be in the up (1) or down (0) state, leading to 8 possible
states in all. For example the state 111 corresponds to all processors being operational and 110
corresponds to the third one being down. Figure 4.2a depicts the resulting 8-state Markov model
and the associated state transitions. We know that the 3 processors are identical with regard to
their failure and repair charactristics. If they are also the same in other respects, there is no need to
distinguish the 3 states in which a single processor is down or the ones having 2 bad processors.
Merging these states, we get the simplified Markov model of Fig. 4.2b. The new failure rates are
obtained by summing the values on the merged transitions, but the repair rate does not change. In
the event that a single processor cannot provide sufficient computational power for the tasks of
interest, we might further merge states 0 and 1 in Fig. 4.2b, given that both of them imply the
status of the system being “down.” Thus, whether it is appropriate to simplify a Markov model by
merging some states depends on the model’s semantics, rather then its appearance.
Fig. 4.2 Markov models for a 3-processor fail-soft system. In part (a),
all solid arrows should be labeled and all dashed arrows .
In modeling nonrepairable systems, the reliability equation and parameters such as MTTF
are of interest. Such systems eventually fail, so our objective is often to determine the
system lifetime and devise method for increasing it.
Example 4.3: Two-state nonrepairable system A nonrepairable system has the failure rate .
Use a 2-state Markov model to find the reliability of this system as a function of time.
Solution: The requisite Markov model is depicted in Fig. 4.3a. Reliability can be viewed as the
probability of the system being in state 1. To find the reliability as a function of time, we note that
p1 changes at the rate – (failure rate being means that the probability of failure over a short
time interval dt is dt). Thus, we can set up the differential equation p1 = –p1, which has the
solution p1 = e–t, given the initial condition p1(0) = 1. Figure 4.3b shows a plot of the reliability
R(t) = p1(t) as a function of time. The probability of the system being in state 0 can be found from
the identity p0 + p1 = 1 to be p0 = 1 – e–t.
(a) System states and Markov model (b) System reliability over time
Fig. 4.3 Markov model and the reliability curve for a 2-state
nonrepairable system.
Example 4.4: n-module parallel system Consider n lightbulbs lit at the same time, each failing
independently at the rate . Construct a Markov chain to model this system of n parallel lightbulbs
without replacement and use the model to find the expected time until all n lightbulbs die.
Solution: The requisite Markov model has n states, labeled from n (start state, when all lightbulbs
are good) down to 0 (failure state) is depicted in Fig. 4.4 (ignore the dashed box, with is for
Example 4.5). The expected time to go from state k to state k – 1 is 1/(k). Thus, the system’s
expected lifetime, or the time to go from state n to state 0, is (1/)[1/n + 1/(n – 1) + … + 1/2 + 1].
We see that due to the use of parallel lightbulbs, the lifetime 1/ of a single lightbulb has been
extended by a factor equal to the sum of the harmonic series, which is O(log n) for large n.
Example 4.5: k-out-of-n nonrepairable systems Construct and solve a Markov model for a
nonrepairable k-out-of-n system.
Solution: The Markov model with n + 1 states, in which state i represents i of the n modules being
operational, can be simplified by merging the last k states into a single “failure” or “down” state.
The following balance equations can be written for the n – k + 1 good states:
pn = –npn
pn–1 = npn – (n – 1)pn–1, and more generally, for j k,
pj = (j + 1)pj+1 – jpj
From the system of n + k + 1 differential equations above, we find pn = e–nt, given the initial
condition pn(0) = 1, and more generally, for j k,
pj = (e–t)j (1 – e–t) n–j
The problability pF for the failure state can be obtained from pn + pn–1 + … + pk + pF = 1.
Note that to solve the system of differential equations of Example 4.5, we did not need to
resort to the more general method involving LaPlace transform (see Section 4.6), because
the first equation was solvable directly and each additional equation introduced only one
new dependent variable.
In a repairable system, the effect of failures can be undone by repair actions. One way to
model variations in repair time is to associate a repair rate with each repair transition, as
depicted in Fig. 4.5a. A repair rate means that the repair time has the exponential
distribution, with the probability of repair action taking more time then t being e–t. The
effectiveness and timeliness of repair actions can be assessed by evaluating the system’s
steady-state availability, or the fraction of time the system is expected to be in the “Up”
state. The following example shows have the availability can be derived for the simplest
possible repairable system.
Example 4.6: Two-state repairable system Consider a repairable system with failure rate
and repair rate . Derive a formula for the steady-state system availability.
Solution: The requisite Markov model is depicted in Fig. 4.5a. Availability can be viewed as the
probability of the system being in state 1. To find the steady-state availability, we set up the
balance equation –p1 + p0 = 0, which indicates that, over the long run, a transition out of state 1
has the same probability as a transition into it. The balance equation above, along with p0 + p1 = 1,
allow us to obtain p1 = /( + ) and p0 = /( + ). We will see later that availability of this
system as a function of time is A(t) = /( + ) + /( + )e–(+)t, which is consistent with the
steady-state availability A = /( + ) just derived.
(a) System states and Markov model (b) System availability over time
As discussed in Section 2.5, we may want to distinguish different failure states to allow a
more refined safety analysis. Another reason for contemplating multiple failure states is
to model different rates of failures as well as different rates of repair in a repairable
system. Some failures may be repairable, while others aren’t, and classes of failures may
differ in the difficulty and latency of repair actions. The following example illustrates the
Markov modeling process for one class of such systems.
Example 4.7: Repairable system with two failure states Consider a repairable system with
failure rate = 1 + 2, divided into two parts 0 for failures of type 0 and 1 for failures of type 1.
Assuming the common repair rate for both failure types, derive a formula for the steady-state
system availability and the probabilities in being in the two failure states.
Solution: The requisite Markov model is depicted in Fig. 4.6a. To obtain steady-state probabilities
of the system being in each of its 3 states, we write the following two balance equations and use
them in conjuction with p0 + p1 + p2 = 1 to find p0 = 0/( + ), p1 = 1/( + ), and p2 = /( + ).
–p2 + p1 + p0 = 0
–p1 + 1p2 = 0
Figure 4.6b shows typical variations in state probabilities over time (with the derivation to be
discussed later) and their convergence to steady-state values just derived. If penalties or costs cj
are associated with being in the various failed states, analyses of the kind presented in this
example allow the computation of the total system risk as failure states cjpj.
(a) System states and Markov model (b) State probabilities over time
As discussed in Section 2.4, we may want to have multiple operational states, associated
with different levels of system capability, to allow for performability analysis. As in the
case of multiple failure states, discussed in Section 4.3, different operational states may
differ in their failure and repair rates, allowing for more accurate reliability and
availability analyses. Such states may also have different rewards or benefits bj associated
with them so that a weighted total benefit operational states bjpj can be computed. The
following examples illustrate the Markov modeling process for such systems.
Example 4.8: Repairable system with two operational states Consider a repairable system
with operational states 1 and 2 and failure state 0, as depicted in Fig. 4.7a with its associated
failure and repair rates. Derive the probabilities of being in the various states and use them to
compute system availability and system performability, the latter under the assumption of the
performance or benefit b2 associated with state 2 is twice that of b1 = 1 for state 1.
Solution: The requisite Markov model is depicted in Fig. 4.7a. To obtain steady-state probabilities
for the system states, we write the following two balance equations and use them in conjuction
with p0 + p1 + p2 = 1 to find p0 = 12/(12), p1 = 2/2, and p2 = , where = 1/[1 + 2/2 +
12/(12)].
–2p2 + 2p1 = 0
1p 1 – 1p 0 = 0
Figure 4.7b shows typical variations in state probabilities over time and their convergence to
steady-state values just derived. System availability is simply A = p1 + p2 = (1 + 2/2). Assuming
a performance level of 1 unit in state 1 and 2 units in state 2, system performability is P = p1 + 2p2
= (2 + 2/2). If the performance level of 2 in state 2 results from 2 processors running in parallel,
it might be reasonable to assume 2 = 21 = 2 and 2 = 1 = (single repair facility). Then,
assuming = /, we get A = ( + 2) / (2 + 2 + 2) and P = 2( + 1) / (2 + 2 + 2).
(a) System states and Markov model (b) State probabilities over time
Example 4.9: Fail-soft systems with imperfect coverage We saw in Section 3.2 that adding
parallel redundancy without ensuring accurate and timely malfunction detection and switchover
may not be helpful. We also introduced a coverage parameter c and used it to derive the reliability
of an n-way parallel system under imperfect coverage in eqn. (3.2.cov3). Analyze a repairable fail-
soft system composed of 2 processors, so that upon a first processor failure, successful switching
to a single-processor configuration occurs with probability c.
Solution: The requisite Markov model is depicted in Fig. 4.8. Note that the transition labeled 2 in
an ordinary fail-soft system has been replaced with two transitions: one labeled 2c into state 1,
corresponding to successful reconfiguration, and another labled 2(1 – c) into state 0, representing
catastrophic failure upon the first module outage. To obtain steady-state probabilities for the
system states, we write the following two balance equations and use them in conjuction with p0 +
p1 + p2 = 1 to find p0, p1, and p2.
–2p2 + 2p1 = 0
2(1 – c)p2 + 1p1 – 1p0 = 0
Even though we have set up the model in full generality in terms of failure and repair rates, we
solve it only for 2 = 21 = 2 and 2 = 1 = . Defining = / and = 1/[2 + (4 – 2c) + 2],
we get p0 = 2[(1 – c) + 1], p1 = 2, and p2 = 2. Note that just as reconfiguration can be
unsuccessful, so too repair might be imperfect or the system may be unable to reuse a properly
repaired module. Thus, the use of coverge is appropriate for repair transitions as well.
In the preceding sections, we introduced the method of transition balancing for finding
the steady-state probabilities associated with the states of a Markov model. This method
is quite simple and is readily automated throught the use of reliability modeling tools or
general solvers for systems of linear equation. Occasionally, however, we would like to
derive the transient solutions of a Markov chain to study the short-term behavior of a
system or to gain insight into how quickly the system reaches its steady state.
To find the transient solutions to a Markov model, we first set up a system of first-order
differential equations describing the time-variant balance (rather than the steady-state
balance) at each state. We then apply the Laplace transform to the system of differential
equations, solve the resulting system of algebraic equations, and find the final solutions
via the application of inverse LaPlace transform. The LaPlace transform converts a time-
domain function f(t) to its transform-domain counterpart F(s) by:
One more technique that we need is that of partial fraction expansion. Here is a brief
review, which is adequate for most cases. More details and a large set of examples can be
found in [Wiki15a].
Given a fraction N(s)/D(s), where N(s) and D(s) are the numerator and denominator
polynomials in s, with N(s) being of a lower degree than D(s), it can be expanded as a
sum of fractions of the form ai /(s – ri), assuming that the ri values are the roots of the
equation D(s) = 0. We assume that the roots are simple. Repeated roots require a
modification in this process that we will not discuss here (see Problem 4.21). This partial
fraction expansion allows us to convert an arbitrary fraction to a sum of fractions of the
forms that appear in the right-hand column of Table 4.1 and thus be able to apply the
inverse LaPlace transform to them.
( ) ( )
= = + + ...+ (4.5.pfe1)
( ) ∏ ( )
By converting the sum of the fractions on the right-hand side of equation (4.5.pfe1) to a
single fraction having the product of the denominators as its denominator and equating
the coefficients of the various powers of s in the numerators of both sides, we readily
derive the constants ai as:
( ) ( )
ai = (4.5.pfe2)
( )
Example 4.10: Two-state repairable systems Find transient state probabilities for the 2-state
repairable system of Fig. 4.5a and show that the results are consistent with the transient
availability curve depicted in Fig. 4.5b.
Solution: We begin by setting up the balance differential equations for the two states.
p1(t) = –p1(t) + p0(t)
p0(t) = –p0(t) + p1(t)
Using Laplace transform, we convert the equations above into algebraic equations, noting the
initial conditions p1(0) = 1 and p0(0) = 0.
sP1(s) – p1(0) = –P1(s) + P0(s)
sP0(s) – p0(0) = –P0(s) + P1(s)
Noting the initial conditions p1(0) = 1 and p0(0) = 0, we find the solutions:
P1(s) = (s + ) / [s2 + ( + )s]
P0(s) = / [s2 + ( + )s]
To apply the inverse Lapalce transform to P1(s) and P0(s), we need to convert the right-hand sides
of the two equations above into function forms whose inverse Laplace transform are known to us
from Table 4.1. This is done via partial-fraction expansion discussed just before this example.
P0(s) = / [s2 + ( + )s] = a1/s + a2/(s + + )
The denominators s and s + + on the right are the factors of the original denominator on the
left, while a1 and a2 are constants to be determined by insuring that the two sides are equal for all
values of s. From this process, we find a1 = /( + ) and a2 = –/( + ), allowing us to write
down the final result.
p0(t) = InverseLaplace[a1/s + a2/(s + + )] = /( + ) – /( + )e–( + )t
The inverse Laplace transform is similarly applied to P1(s), yielding:
p1(t) = /( + ) + /( + )e–( + )t
These results are consistent with p1(t) shown in Fig. 4.5b decreasing from 1 at t = 0 to /( + ) in
steady state (t = ).
Example 4.11: Triplicated system with repair The lifetime of a triplicated system with voting
(Fig. 3.7) can be extended by allowing repair or replacement to take place upon the first module
malfunction. In this way, it is possible for the system to return to full functionality, with 3 working
units, before a second module malfunction renders it inoperable. Only if the second malfunction
occurs before the completion of repair or replacement will the system experience failure. Analyze
the MTTF of such a TMR system with repair.
Solution: The Markov model for this system is depicted in Fig. 4.9. Steady-state probabilities for
the system states can be obtained from the following two balance equations, used in conjuction
with p0 + p1 + pF = 1. Unfortunately, this steady-state analysis is unhelpful, because it leads to p0 =
p1 = 0 and pF = 1, which isn’t surprising (why?).
–3p0 + p1 = 0
3p0 – (2+p1 = 0
In general, with a failure state that has no outgoing transitions, a so-called absorbing state, we get
pF = 1 in steady state.
Time-variant state probabilities can be obtained from the following differential equations, with the
initial conditions p0(0) = 1 and p1(0) = 0.
p0 = –3p0 + p1
p1 = 3p0 – (2+p1
Using Laplace transform, we can solve the equations above to obtain R(t) = p0(t) + p1(t). The
result, with the notational conventions = / and = (2 + 10 + 1)1/2 becomes:
R(t) = [( + + 5)e–(–+5)t/2 – ( – + 5)e–(++5)t/2]/(2)
Integrating R(t), per eqn. (2.2.MTTF), we find, after simplification:
MTTF = [1 + (– 1)/6]/ = (1 + /5)[5/(6)]
This result suggests that the provision of repair extends the MTTF of a TMR system by a factor of
1 + /5 over that of an nonrepairable TMR system, which in turn has a smaller MTTF (by a factor
of 5/6) compared with a nonredundant module. For example, with = 10–6/hr and = 0.1/hr, we
have MTTFModule = 1M hr, MTTFTMR = 0.833M hr, and MTTFTMR with repair = 16,668M hr.
Time
0 1 F
2nd failure
1st failure
0 1
(a) System states and Markov model (b) Reliabilities over time
A particularly useful Markov model is the so-called birth-and-death process. This model
is used in queuing-theory analysis, where the customer’s arrival rate and providers’
service rate determine the queue size and waiting time. Referring to the infinite birth-and-
death process depicted in Fi.g 4.10, transition from state j to state j + 1 is an arrival or
birth. Transition from state j to state j – 1 is a departure or death. Closed-form solutions
for state probabilities are difficult to obtain in general, but steady-state probabilities are
easily obtained:
TMR
with
repair
In the finite version of Fig. 4.10, the last state is labeled n and we have an (n + 1)-state
birth-and-death process. The following two examples deal with special cases of the
process shown in Fig. 4.10
Example 4.12: A simple queuing system for bank customers Customers enter a bank at the
rate of and they are serviced by a single teller at the rate of . Even if < , the length of the
waiting line can grow indefinitely due to random variations. Use a birth-and-death process to
compute the probability of the queue length becoming n for different values of n and determine the
average queue length.
Solution: Using eqn. (4.6.BnD), with all the i set to and all the i set to , and taking = /,
we find pj = p0/j. From j0 pj = 1, we obtain p0 = 1 – 1/ and pj = (1 – 1/)/j. The average queue
length is j0 jpj = (1 – 1/)j0 j/j = (1 – 1/)/(–12 = 1/(–1) = /( – ). Note that for = 1,
the queue length becomes unbounded, and for a service rate that is slightly greater than but very
close to the arrival rate , the queue can become quite long.
Example 4.13: Gracefully degrading system with n identical modules The behavior of a
gracefully degrading system with n identical modules can be modeled by an (n + 1)-state birth-
and-death process, where state j represents the unavailability of j modules (state 0, with all the
modules being functional, is the initial state). Find the probability of the system being in state 0,
assuming up to s modules can be repaired at once (the so-called M/M/s queue, where M stands for
Markov process for failure and repair and s is the number of service stations or providers).
Solution: Figure 4.11 depicts the markov model for the system under consideration. The repair
rates used depend on the value of s. For s = 1, all repair transitions are labeled . For general s,
repair transitions are labeled , 2, … , s, from the left side of the diagram, with the remaining
transitions, if any, all labeled s, the maximum repair rate with s service providers. Applying
balance equations and defining = /, we can find the steady-state probabilities of being in the
various states as:
pj = (n – j + 1)pj–1/(j), for 1 j s
pj = (n – j + 1)pj–1/(s), for s + 1 j n
The equations above yield each state probability in terms of p0:
𝑛
pj = 𝑗 – jp0, for 1 j s
𝑛
pj = 𝑗 – j[j!/(s!sj–s)]p0, for s + 1 j n
Using p0 + p1 + … + pn = 1, we can compute p0 to complete the derivation:
𝑛 𝑛
p0 = 1/[∑ 𝑗
𝜌 +∑ 𝑗
𝜌 𝑗!/(𝑠! 𝑠 )]
The state probabilities just derived can be used to compute system availability (summing over
states in which the system is deemed available) or performability (weighted sum, where the weight
for each state is a measure of system performance at that state).
Examples of software aids for dependability modeling include those offered by PTC
Windchill (formerly Relex) [PTCW20] and ReliaSoft [Reli20], companies specializing in
reliability engineering, University of Virginia’s Galileo [UVA20], and Iowa State
University’s HIMAP [IASU20]. There are also more limited tools associated with Matlab
and a number of Matlab-based systems. A 2004 study enumerates a set of possibly useful
tools [Bhad04]. A Google search for “reliability analysis tools” will reveal a host of other
products and guides.
Problems
1 10 2
1
00 1 11
2
2 1
01
a. Explain the assumptions under which the model was developed. In particular, pay attention to the
lack of full symmetry in the state diagram.
b. Solve the simplified model to derive the steady-state probabilities associated with the four states.
b. Use the model of part a to find the probability Ci that team i (i = 1, 2) wins the championship.
b. Is it feasible to solve the Markov model for k = 2 colors in an array of side length n = 10?
c. Write a program to experiment with and observe state changes in the example of part b.
d. Prove that eventually, all cells will assume the same color.
e. Prove that the probability that one of two initial colors prevails is equal to the proportion of cells
that are of that color at the start.
a. Does such a convergence occur for any Markov matrix M? If yes, explain your reasoning in full; if
not, provide a counterexample.
b. Assuming that convergence does occur, outline an efficient procedure for finding the fixed long-
term distribution (x0 x1 . . . xn–2 1–x0–x1– … –xn–2) from M, without having to compute many
powers of M to empirically verify convergence.
Defective
“[An engineer] recognizes, unfortunate though they may be, that
defects are unplanned experiments that can teach one how to
Faulty make the next design better.”
Erroneous Henry Petroski, To Engineer Is Human, 1985
Malfunctioning
“The search for perfection begins with detecting imperfection.”
Degraded
Anonymous
Failed
5 Defect Avoidance
“Better a diamond with a flaw than a pebble without one.”
Chinese proverb
“The omission of the silicon that had been put in nickel [core of
the cathode] to make processing easier . . . raised the effective
life of vacuum tubes from 500 hours to 500,000 hours. The
marginal checking gave another factor of ten on that.”
Jay W. Forrester, reflecting on the Whirlwind I
computer, 1983
Complete defect avoidance, if possible, would be the preferred choice for dealing
with dependability concerns at the device level. Unfortunately, however, defect-
free devices and components may be very expensive (due to stringent
manufacturing and/or careful screening requirements) and perhaps even
impossible to build. Thus, we do what we can, within technical and budgetary
constraints, to reduce the occurrence of defects. We then handle what remains
through accurate modeling and appropriate circumvention techniques. In this
chapter, we deal with the understanding, detecting, and modeling of defects,
leaving the discussion of defect circumvention techniques to Chapter 6.
Defects can be viewed as imperfections or weaknesses that may lurk around without
causing any harm, but that can potentially give rise to faults and other undesirable
behavior. Any manufactured part is prone to defects, particularly if it is mass produced.
In this section, we review the types of defects that one finds in integrated circuits and in
certain mass storage devices as prominent examples of what can go wrong during
manufacturing and how the presence of defects can be detected.
Modern integrated circuits are produced via a sequence of complex processing steps
involving the deposition of material and structures in layers, beginning with a substrate.
As the structures shrink in size, things can and do go wrong. Small impurities in the
material involved, tiny particles or air bubbles, or even the natural variations associated
with automatic production can lead to problems. Figure 5.1 shows two examples of
defects in modern ICs that affect the circuit elements deposited on a surface and vertical
interconnections between different layers. Figure 5.2a stresses the fact that ideal designs
often gain nonuniformity through the mass production process, making circuit parameters
and other aspects of the system different from one point on the chip to another. The same
absolute difference in physical dimensions and shapes becomes much more serious in
relative terms as the technology is scaled down. Figure 5.2b shows the temperature
distribution across a chip. Because speed and other circuit parameters are affected by
temperature, the effect of temperature variations is similar to those of nonuniformity
resulting from the finite precision of the manufacturing process.
(a) Particle embedded between layers (b) Resistive open due to unfilled via
Fig. 5.2 Process and run-time variations can lead to subtle defects,
and associated performance problems, arising from
changes in resistance, capacitance, and other parameters.
Because defects may not be directly noticeable, one approach to their detection is to
intentionally push the system from defective to faulty or erroneous state in the multilevel
model of Fig. 1.6, so as to make the system state more observable. Such burn-in or stress
tests will be discussed in Section 5.5.
Besides on-chip defects discussed thus far, defects can occur in elements found at higher
levels of the digital system packaging hierarchy, including in connectors, printed-circuit
boards, cabling, and enclosures. However, these types of defects are deemed less serious,
because they lend themselves to easier detection though visual inspection and assembly-
or system-level testing.
The currently dominant technologies for mass storage devices consist of writing data on
smooth surfaces, using magnetic or optical recording schemes. Like integrated circuits,
these recording schemes have also experienced exponential density improvements,
recording more and more bits per unit area. When the rectangular area devoted to the
storage of a single bit has sides that are measured in nanometers (see Fig. 5.3a), slight
impurities, surface variations, dust particles, and minute scratches can potentially wipe
out thousands of bits worth of stored data. It would be utterly impractical to discard every
manufactured magnetic or optical disk if it contained the slightest surface defect.
To see the magnitude of the problem resulting from the high recording density on a hard
magnetic disk, for example, consider the comparative dimensions depicted in Fig. 5.3b.
With such minute dimensions, a small scratch on the disk surface can affect multiple
tracks and thousands of bits. To make matters worse, the read/write head must be placed
very close to the disk surface to allow accurate reading and writing of data with the
densities shown in Fig. 5.3a. The read/write head actually flies on a cushion on air, nearly
touching the surface of the disk. Slight variations in the surface or the presence of an
external particles can cause a head crash, leading to substantial damage to the disk
surface and the stored data. This is why modern disk drives are built with tightly sealed
enclosures.
(a) Bits on the disk surface (b) Head separation from the surface
Fig. 5.3 The high density and small head separation (less than 1 m)
in magnetic recording storage technology.
Surface defects, and their impact on the stored data, are not unique to magnetic mass
storage devices. Similar considerations apply to other storage media, such as CDs and
DVDs.
Challenges from disk defects are similar to those faced by IC designers and
manufacturers: namely, the detection of these defects and appropriate schemes to
circumvent them. Data on disk memory is often encoded using CRC or a similarly strong
error-correcting code. When a sector exhibits repeated violations of the code, it may be
remapped to a different physical disk location and its original location flagged as
unusable. Computer operating systems routinely monitor disk operation, using externally
observable characteristics and certain sensor-provided information to avoid serious disk
failures and the ensuing data loss. Here is a partial list of monitored parameters in modern
disk drives: head flying height (a downward trend often portends a head crash); number
of remapped sectors; frequency of error correction via the built-in code. Additionally, the
following are signs of mechanical or electrical problems that may lead to future data loss:
changes in spin-up time, rising temperatures in the unit, reduction in data throughput.
Modern disk memories typically have strong protection built in against defect-caused
data corruption. As shown in Fig. 5.3disk, the protection may span multiple levels, from
individual sectors (red protective coding), blocks of sectors (blue), and blocks of blocks
of sectors (green).
The multistep chemical and physical processes that lead from a silicon crystal ingot to a
finished IC chip is depicted in Fig. 5.4. Defects on the sliced wafer lead to a certain
number of defective dies after the wafer has been patterned (converted into a collection of
dies, via a number of processing steps). In the example of Fig. 5.4, 11 of the 120 dies are
shown as defective. So, assuming that no other defects arise in the processes of dicing,
testing, and mounting the dies, a yield of 109/120 91% is achieved.
Example 5.1: Financial gain from yield improvement Consider a company that manufactures
5000 wafers per week. A wafer holds 200 dies, with each good die generating a revenue of $15.
Estimate the annual revenue gain from each percentage point in improved yield.
Solution: Each percentage point improvement in yield results in 2 additional good dies per wafer,
corresponding to a revenue gain of $30. So, the expected annual revenue gain for a 1% yield
improvement is $30 gain/wafer 5000 wafers/week 52 weeks/year = $7.8M.
Experimental evidence suggests that the die yield is related to defect density (number of
defects per unit area) and die area, as shown in equation (5.2.yield). The parameter a is a
technology-dependent constant which is in the range 3-4 for modern CMOS processes.
×
Die yield = = [1 + ] (5.2.yield)
Blank wafer
30-60 cm Patterned wafer
with defects
Processing:
Silicon Slicer x x 20-30 steps
crystal x x x
15-30 x x
ingot cm x x
x x
Microchip
Good
Die or other part Part
Die die
Dicer tester Mounting tester
Usable
part
to ship
~1 cm ~1 cm
It is this nonlinear relationship that causes die cost to grow superlinearly with the die size
(chip complexity). Note that assuming a fixed cost for a wafer and a good yield at the
wafer level (i.e., only a small number of wafers have to be discarded), the cost of a die is
directly proportional to its area and inversely proportional to the die yield.
Because, according to equation (5.2.yield), die yield is a decreasing function of the die
area, the cost of a die will grow superlinearly with its area. This effect is evident from
Fig. 5.5, where the same defect pattern has rendered 11 of the dies useless, leading to a
much smaller yield in the case of the larger dies of Fig. 5.5b than the smaller dies in Fig.
5.5a. In the extreme of using an entire wafer to implement a single integrated circuit, that
is, having one die per wafer, yield becomes a very serious problem. This is why many of
the defect circumvention methods, discussed in Chapter 6, were first suggested in
connection with wafer-scale integration.
Example 5.2: Effects of dies size on yield and cost Assume that the dies in Fig. 5.5 are 1 1
and 2 2 cm in size and ignore the defect pattern shown. Assuming a defect density of 0.8/cm2,
2
how much more expensive will the 2 2 die be than the 1 1 die?
Solution: Let the wafer yield be w. From the die yield formula, we obtain a yield of 0.492w and
0.113w for the 1 1 and 2 2 dies, respectively, assuming a = 3. Plugging these values into the
formula for die cost, we find that the 2 2 die costs (120/26) (0.492/0.113) = 20.1 times as
much as the 1 1 die; this represents a factor of 120/26 = 4.62 greater cost attributable to the
smaller number of dies on a wafer and a factor of 0.492/0.113 = 4.35 due to the effect of yield.
With a = 4, the ratio assumes the somewhat larger value (120/26) (0.482/0.095) = 23.4.
The aforementioned effect of die size on yield is widely known and duly emphasized in
VLSI design courses. Another cost factor associated with yield, however, is often
ignored: low yield leads to much higher testing cost, if an overall part quality is to be
achieved. This is illustrated in the following example.
Example 5.3: Effects of yield on testing and part reliability Assuming a part yield of 50%,
discuss how achieving an acceptable defective part rate of 100 defects per million (DPM) affects
the part cost. Include all factors contributing to cost.
Solution: Consider manufacturing 2M parts of which 1M are expected to be defective, given the
50% yield. To achieve the goal of 100 DPM in parts shipped, we must catch 999,900 of the 1M
defective parts. Any testing process is imperfect in that the test will miss some of the defects
(imperfect test coverage) and will also generate a number of false positives. Thus, we require a test
coverage of 99.99%. Going from a coverage of 99.9% (a fraction 10 –3 or 0.1% of the defects
missed) to 99.99% (10–4 or 0.01% missed), for example, entails a significant investment in test
development and application times. False positives do not constitute a major cost in this particular
context, because discarding another 1-2% of the parts due to false positives in testing does not
change the scale of the financial loss.
To make the discussion in the solution to Example 5.3 more quantitative, we need to
model the testing cost as a function of test coverage (Fig. 5.6). This modeling cannot be
done in general, as testing cost depends on the tested circuit’s functionality and
implementation technology. There is a significant body of research, however, to assist us
with this task in specific cases of interest [Agra01].
Testing cost
–5 –4 –3 –2 –1
10 10 10 10 10
Missed defect fraction
Fig. 5.6 Testing cost rises sharply with a reduction in the desired
fraction of missed defects.
Defects are of two main types. Global or gross-area defects result from scratches (e.g.,
from wafer mishandling), mask misalignment, or over/under-etching. Such defects can be
eliminated or minimized by appropriate provisions and process adjustments. Local or
spot defects result from process imperfections (e.g., extra or missing material), process
variations, or effects of airborne particles. Even though spot defects are harder to deal
with, not all such defects lead to structural or parametric damage. The actual damage
suffered depends on the location and extent of the defect relative to feature size.
Two examples of defect modeling are depicted in Fig. 5.7. Excess material deposited on
the die surface can cause physically proximate conductors to become connected. If we
model extra-material defects as circles, then the lightly shaded rectangular regions in Fig.
5.7a indicate possible locations for the center of the defect circle of a certain size that
would lead to improper connectivity. Pinhole defects result from tiny areas where
material may be missing (due to burst bubbles, for example). This may cause problems
because missing dielectric material between two vertically adjacent conductors may lead
to their becoming connected. Critical regions for pinhole defects, shown as small lightly
shaded squares in Fig. 5.7b, correspond to overlapping conductors that are separated by a
thin dielectric layer.
Under such assumptions, the modeling process consists of determining the likelihood of
having defects that fall in the corresponding critical regions, based on some knowledge
about defect kind and size distributions.
Critical
areas
0.3
Defects / cm2
0.2
0.1
0.0
0 100 200 300 400
Defect diameter (nm)
Fig. 5.8 A sample defect size distribution for an overall defect rate of
0.3/cm2.
The exponent parameter p is typically in the range [2.0, 3.5] and k is a normalizing
constant. Figure 5.8 depicts a sample defect size distribution, assuming an overall defect
density of 0.3/cm2.
Most components or systems do not have a constant failure rate over their lifetimes. A
simplified model that accounts for variations in the failure rate over time, known as the
bathtub curve, is based on the hypothesis that three factors contribute to failure. The first
factor, infant mortality, is due to imperfections in system design and construction that
may lead to early failures. Taking the analogy of a new car, it happens that mere factory
inspections and testing are inadequate for removing all defects, leading to quite a few
defective or low-quality cars (the so-called “lemons”) to be marketed and sold. If the
particular car you buy survives this early phase, then it will likely function without much
trouble for many years.
The second factor, random failures, can arise at any time during a component’s or
system’s life due to environmental influences, normal stresses, or load conditions. The
constant failure rate is often used to model this phase of useful life.
The third factor is the wearing out of devices or circuits, leading to higher likelihood of
failures as a component or system ages. As depicted in Fig. 5.btc, the wearout effect is
more pronounced for mechanical devices than for electronics. In fact, many computer and
communication equipment become obsolete and discarded so quickly that wearout isn’t a
significant concern. On the other hand, fatigue or wearout is a major concern for aircraft
parts, including those forming the fuselage, and is dealt with by preventive maintenance
and periodic replacement. Interestingly, aging or deterioration is not limited to hardware
but has also been observed in software, owing to the accumulation of state information
(from setting of user preferences, updates, and extensions).
Using colorful expressions such as “the bathtub curve does not hold water” [Wong88],
reliability researchers have been pointing out the weaknesses of the bathtub curve model
for quite a long time. Our discussion of the bathtub curve is motivated by the fact that it
provides a useful pedagogical tool for drawing attention to infant mortality (and hence
the importance of rigorous post-manufacturing tests and burn-in tests) and wearout (often
avoided by preventive maintenance and early retirement of devices or systems that are
prone to deterioration with age). It also tells us why the constant failure rate assumption
might be appropriate during most systems’ post-burn-in, pre-wearout, useful lives.
Referring to Fig. 5.burn1, we note the effect of infant mortality on the reliability function,
driving home the point that unless we deal with the infant mortality problem, achieving
high reliability would be impossible.
90
Reliability
80
70 No significant wear-out
60
50
0 20 40 60 80 100
Time in years
In order to expose existing and latent defects that lead to infant mortality, one needs to
test a component or system extensively under various environmental and usage scenarios.
An alternative to such extended testing, which may take an unreasonable amount of time,
is to expose the product to abnormally harsh conditions in an effort to accelerate the
exposure of defects. The name “burn-in” comes from the fact that for electronic circuits,
testing them under high temperatures (literally in ovens) is commonly used, given that
intense heat can accelerate the failure processes. In the extended sense, “burn-in” refers
to any harsher-than-normal treatment, including using greater loads, higher clock
frequencies, excessive shock and vibration, and so on.
The ovens used for high-temperature burn-in testing of electronic devices and systems are
quite elaborate and expensive, as they require fine controls to avoid damaging sensitive
parts in the circuits under test.
As depicted in Fig. 5.burn2, components that survive burn-in testing, will be left with
very few residual defects that could potentially lead to early failures.
100
Components
burned-in for 3 years
99
Reliability
98
97 Normal components
with no burn-in
96
95
0 2 4 6 8 10
Time in years
Besides initial or manufacturing imperfections, wear and tear during the course of a
device’s lifetime can lead to the emergence of defects. A harsh operating environment or
excessive load may speed up the development of defects. Such conditions can sometimes
be counteracted by operational measures such as temperature control, load redistribution,
or clock scaling. Radiation-induced defects can be minimized by proper shielding or
hardening (see Chapter 7) and those resulting from mishandling, shock, or vibration can
be mitigated by encasing, padding, or mechanical insulation.
One of the most commonly used strategies for active detect prevention is periodic or
preventive maintenance. Preventive maintenance forestalls latent defects from developing
into full-blown defects that produce faults, errors, and so on. To grasp the role of
preventive maintenance for computer parts, consider that passenger aircraft parts are
routinely replaced according to a fixed maintenance schedule so as to avoid fatigue-
induced failures. So, an aircraft engine may be replaced at the end of its nominal service
period, even though it exhibits no signs of impending failure. Referring to the bathtub
curve of Fig. 5.btc, this is akin to resetting the clock and avoiding the wearout phase of
the curve for the replaced part. For this strategy to be effective, however, we must also
make sure to avoid the infant mortality phase of the new engine by subjecting it to
rigorous burn-in and stress testing.
Given that preventive maintenance has as associated cost in terms of personnel and lost
system functionality, many studies have been performed to optimize the maintenance
schedule under various cost models and system characteristics, including whether the
preventive maintenance is perfect (rendering the system “like new”) or imperfect (e.g.,
reducing the effective age of the system that dictates its hazard rate, but not fully resetting
it to zero [Bart06]). Often, the resulting models for maintenance optimization are too
complex for analytical solution, necessitating the use of numerical solutions.
Problems
6 Defect Circumvention
“We grow tired of everything but turning others into ridicule, and
congratulating ourselves on their defects.”
William Hazlitt
As the densities of integrated circuits and memory devices rose through decades
of exponential growth, defect circumvention methods took on increasingly
important roles. Today, it is nearly impossible to build a defect-free silicon wafer
or a perfectly uniform magnetic-disk platter. So methods for detecting defective
areas, and avoiding them via initial configuration or subsequent reconfiguration,
are in widespread use. Defect circumvention methods covered in this chapter have
a great deal in common with switching and reconfiguration schemes employed at
the level of modules or nodes in parallel and distributed systems. Our focus here
is on methods that are particularly suited to fine-grain circumvention.
When a wafer emerges from the manufacturing process, visual inspections are performed
to identify obvious defects. During this phase, the inspector (human or machine) focuses
on the more problematic areas, such as the edges of a wafer.
Providing redundant components or cells, plus a capability to avoid or route around bad
elements is one way of avoiding defects. This approach is best-suited to systems that
have a regular or repetitive structure on the die. Examples include memories, FPGAs,
multicore chips, and chip-multiprocessors. Irregular or random logic implies greater
redundancy arising from replication, with the interconnect problem exacerbated by the
need for the replicated structures not to be too close to each other (to minimize common-
cause defects).
Example 6.1: Disk sector remapping Assume that bad disk sectors are detected with 100%
probability and that there is a hard limit on the number of remapped sectors due to performance
concerns. Suggest a reliability model for the disk.
Solution: To be provided.
Early semiconductor memories were less reliable than their immediate predecessors
(magnetic core memories). Thus, methods of dealing with defective bit cells in such
memories were developed early on. One class of methods involving error-
detecting/correcting codes will be discussed in Chapters 13 and 14. Here, we focus on
defect circumvention methods that allow us to bypass defective memory cells, assuming
that their presence is detected via appropriate tests or via concurrent error detection.
A commonly used scheme is to provide the memory array (as a whole or in subarrays of
smaller size) with spare rows and columns of cells. In the example of Fig. 6.masrc, the
memory array is shown to consist of two subarrays, each with its dedicated spare rows
and columns. When a bad memory cell is detected, the entire row or column containing it
is replaced with one of the spares. The choice of using a spare row or a spare column is
arbitrary when there is an isolated bad cell, whereas in the case of multiple cell defects in
the same row/column, one approach can be more efficient than the other. Switches at the
periphery of the array or subarray allow external connections to be routed to the spare
row/column in lieu of the one containing the bad memory cell(s). There are also defects
in wiring and other row/column access mechanisms that may disable an entire row or
column, in which case the choice of replacement is obvious.
Let us focus on an array or subarray with m data rows and s spare rows. Assuming perfect
switching and reconfiguration, the redundant arrangement can be modeled as an m-out-
of-(m + s) system. The modeling becomes somewhat more complex when we have both
spare rows and columns, but the relevant models are still combinational in nature.
Given a particular pattern of memory cell defects, finding the optimal reconfiguration is
nontrivial. We will discuss the pertinent methods in connection with yield enhancement
for semiconductor memory chips in Chapter 8.
Example 6.2: Reliability modeling for redundant memory arrays Statement to be provided..
Solution: To be provided.
Moore and Shannon, in their pioneering work on the reliability of relay circuits [Moor56]
showed how one can build arbitrarily reliable circuits from unreliable, or in their word,
“crummy,” relays. Consider relays that are prone to short-circuiting when they are
supposed to be open. Let the probability of such an improper short-circuiting event be p.
Then, the relay circuit of Fig. 6.M&S will experience a similarly defective behavior (i.e.,
short-circuiting) with probability
It is readily verified that h(p) < p, provided that p < 0.382. In other words, as long as
each relay isn’t totally unreliable (a relay with p 1/3 is crummy indeed), some
improvement in behavior is achieved via the bridge circuit of Fig. 6.M&S with four-fold
redundancy. Recursive application of this scheme will lead to arbitrarily reliable relay
circuit having the reliability function h(h(h( … h(p)))).
FPGA and FPGA-like devices are particularly suitable for defect circumvention methods
via removal (bypassing). As shown in simplified form in Fig. 6.FPGA1, an FPGA
consists of an array of configurable logic blocks (CLBs) that have programmable
interconnects among themselves and with special I/O blocks at the chip boundaries. The
programmable interconnects, or routing resources, can take on different forms in FPGAs,
with an example depicted in Fig. 6.FPGA2. Defect circumvention in such devices is quite
natural because it relies on the same mechanisms that are used for layout constraints (e.g.,
use only blocks in the upper-left quadrant) or for blocks and routing resources that are no
longer available due to prior assignment.
FPGAs are examples of circuits that are composed of multiple identical parts that are
interchangeable. Similar methods are applicable to multicore chips and chip-
multiprocessors. In the latter systems, processors and memory modules may be the units
that are bypassed or replaced. However, defects may also impact the interconnection
network connecting the processors with each other, or linking processors with memory
modules. Such networks constitute the main defect circumvention challenge in this case.
We will discuss the switching and reconfiguration aspects of such systems when we get
to the malfunction level in our multilevel model.
Solution: To be provided.
Arrays can be built from identical nodes, of which several can be placed on a single chip.
If such nodes are independent of each other and have separate I/O connections, then it
would be an easy matter to avoid the use of an defective nodes. For example, to build a
massively parallel processor out of 64-processor chips, we might place 72 processors on
each chip to allow for up to 8 defective processors. We often prefer, however, to
interconnects the nodes on the chip for higher-bandwidth communication, both on-chip
and off-chip. As shown in Fig. 6.defarray-a, use of on-chip connections can lead to
shorter and more efficient links, while also allowing more pins for each off-chip channel.
(a) Building blocks (b) 2-state switches (c) 2D array with defects
In the following, we assume 4-port, 2-state switches depicted in Fig. 6.defarray-b. For
example, a 1D array can be constructed from such switches and a set of functional and
spare nodes, as shown in Fig. 6.array1D-a. Alternatively, we can embed mux-switches in
each of the blocks so as to select one of two inputs (from the block immediately to the
left or from the block two positions to the left) and ignoring the other input, based on
diagnostic information. Such embedded switches remove single points of failures that are
associated with the nonredundant switches of Fig. 6.array1D-a and also simplify the
reliability modeling process.
The same reconfiguration schemes used for 1D arrays can be applied to 2D mesh arrays,
as depicted in Fig. 6.arr2D1, with the switches allowing a node to be avoided by moving
to a different row or column.
Assuming that we also have the capability to bypass nodes within their own rows and
columns (e.g., via a separate switching scheme not shown in Fig. 6.arr2D1), we can
salvage a smaller working array from one with spare rows and/or columns, as depicted in
Fig. 6.arr2D2-a. The heavy arrows in Fig. 6.arr2D2-b denote how rows and columns have
shifted downward or rightward to avoid the bad nodes. We will discuss both the
reconfiguration capacity and the reliability modeling of such schemes in Section 8.5 in
connection with yield enhancement methods.
Example 6.4: Reliability modeling for processor arrays To be provided based on [Parh19].
Solution: To be provided.
The notion of “crummy” components, that occupied Moore and Shannon because of
unreliable electromechanical relays, is once again front and center as we enter the age of
nonoelectronics. The sheer density of nanoelectronic circuits makes precise
manufacturing almost impossible and the effects of even minor process variations quite
serious. It is, therefore, necessary to incorporate defect circumvention methods into the
design process and the structure of such circuits.
For example, hybrid-technology FPGAs, with CMOS logic elements and very compact
but unreliable crossbar nanoswitches, need defect circumvention schemes [Robi07] to be
deemed practical. Such hybrid schemes, as depicted in Fig. 6.nano, are expected to
produce an increase of 8-fold or more in density, while providing reliable operation via
defect circumvention. As another example, the use of memory architectures with block-
level redundancy has been proposed for hybrid semiconductor/nanodevice
implementation [Stru05]. The scheme uses error-correcting codes for defect tolerance, as
opposed to using them to overcome damage from operational or “soft” errors. A possible
structure is depicted in Fig. 6.memory.
Problems
6.4 Title
Intro
a. x
b. x
c. x
d. x
6.5 Title
Intro
a. x
b. x
c. x
d. x
6.6 Title
Intro
a. x
b. x
Row bus
Column
bus
“Go on. Nothing that you can say can distress me now. I am
hardened.”
E. M. Forester, “The Machine Stops”, in The
Eternal Moment (Collection of Short Stories),
Harcourt Brace, 1928
Shielding is the act of isolating a part or subsystem from the external world, or
from other parts or subsystems, with the goal of preventing defects that are caused
or aggravated by external influences. This approach, which has been used for
decades to protect systems that operate in harsh environments, is now necessary
for run-of-the-mill digital systems, given the continually rising operating
frequencies and susceptibility of nanoscale components to electromagnetic
interference and particle impacts. As effective as shielding can be, it is often not
enough. Hardening is the complementary technique of increasing the resilience of
components with regard to the undesirable effects named above.
Shrinking feature sizes have made on-chip crosstalk a major problem. Increased clock
frequency is also an important contributing factor. At very high frequencies, the small,
distributed capacitance that exists between mutually insulated circuit nodes may lead to
an effective short to the ground, weakening the signals and affecting their ability to
perform the intended functions. Referring to Fig. 7.1a, the interwire capacitance CI can
easily exceed the load plus parasitic capacitance CL for long buses, affecting power
dissipation, speed, and signal integrity.
Materials and techniques exist for shielding hardware from a variety of external
influences such as static electricity, electromagnetic interference, or radiation. Many
advanced shielding methods have been developed for use with spacecraft computers that
may be subjected to extreme temperatures and other harsh environmental conditions.
Noteworthy among adverse conditions affecting electronic systems in space is
bombardment by high energy atomic particles.
As VLSI circuit features shrink, the radiation problem, formerly problematic only during
space missions, affects the proper operation of electronics even on earth. We will discuss
methods for dealing with the radiation problem in Sections 7.3 and 7.4.
Computing and other electronic equipment can be affected by radiation of two kinds:
electromagnetic and particle.
Particle radiation comes in a variety of forms. Alpha particles (helium nuclei) are the
least penetrating so that even paper stops them. Beta particles (electrons) are somewhat
more penetrating, requiring the use of aluminium sheets. Neutron radiation is more
diffictult to deal with, requiring rather bulky shielding. Finally cosmic radiation comes
into play for space electronics. Besides primary radiation of the kinds just cited,
secondary radiation, arising from the interaction of primary radiation and material used
for shielding, is also of some interest.
As integrated circuits shrink in size, the damage done by high-energy particles, such as
protons or heavy ions, can be significant. Radiation ionizes the oxide, creating electrons
and holes; the electrons then flow out, creating a positive charge which leads to current
leak across the channel. It also decreases the threshold voltage, which affects timing and
other operational parameters. It has been estimated that a one-way mission to Mars
exposes the electronics to about 1000 kilorad of radiation in total, which is near the limit
of what is now tolerable by advanced space electronics.
The most common negative impacts of radiation, and the associated terminology, are as
follows:
Single-event upset (SEU): A single ion changing the state of a memory or register bit;
multiple bits being affected is possible, but rare.
Single-event gate rupture (SEGR): A heavy ion hitting the gate region, combined with
applied high voltage, as in EEPROMs, creates breakdown.
One important point to keep in mind about enclosures used to mitigate radiation effects is
that proper care must be taken about the choice of material. Because of the possibility of
secondary radiation (radiation of a different kind produced as a result of the primary
radiation interacting with the packaging material), improper packaging may actually do
more harm than good in protecting against radiation effects.
Besides radiation, a variety of other envioronmental conditions can affect the proper
functioning of computer equipment. Vibrations, shocks, and spills constitute some of the
major categories of such conditions.
Vibration can be a problem when a computer system is installed in a car, truck, train,
boat, airplane, or space capsule (basically, anything that moves or spins). Certain factory
or process-control installations are also prone to excessive vibrations that may cause
loose connections to become undone or various mechanical parts to break down from
stress. Systems can be ruggedized to tolerate vibration by initial stress testing (screening
out products that are prone to fail when exposed to vibration) and use of special casing
that absorbs or neutralizes the unwanted movements.
(a) Ruggedized phone (b) Ruggedized laptop (c) Ruggedized disk drive
Protection against spills, and waterproofing in general, is technically quite simple, but of
course adds to the product cost. Watches and cameras have been marketed in waterproof
versions for many decades. The same methods can be applied to smartphones, laptops,
and other electronic devices. As mechanical moving parts, bottons, and levers disappear
from our devices, the task of waterproofing becomes simpler.
Laptop computers that have been partially ruggedized against shock and spills and are
aimed for use by children have been in existence for several years now. Other
environmental conditions against which protection may be sought include electrical noise
(needed for use in some industrial environments; see Section 7.1), radiation (see Section
7.3), and heat.
This section has not yet been written. The following paragraphs contain some of the
points to be made.
to determine the operational voltage swing. We present a control policy which achieves
our goals with minimal complexity; such simplicity is demonstrated by implementing the
policy in a synthesizable controller. Such a controller is an embodiment of a self-
calibrating circuit that compensates for significant manufacturing parameter deviations
and environmental variations. Experimental results show that energy savings amount up
to 42%, while at the same time meeting performance requirements.
Problems
8 Yield Enhancement
“As the soft yield of water cleaves obstinate stone, so to yield
with life solves the insolvable: To yield, I have learned, is to
come back again.”
Lao-Tzu
In Section 5.2, we introduced the notion of yield and explained why a small
deterioration in defect density is amplified in the way it affects the final product
cost. Despite significant strides in improving the design and manufacturing
processes for integrated circuits, yield has presented a greater challenge with each
generation of denser and more complex devices. Due to the significant impact of
yield on cost, modern production technologies for electronic devices incorporate
provisions for detecting and circumventing defects of various kinds to reduce the
need for discarding slightly defective parts. In this chapter we review defect
detection and circumvention methods that are particularly suitable for the goal of
yield enhancement in digital circuits and storage products.
Yield models are combinatorial in nature and range from primitive to highly
sophisticated. Let us begin with a very simple example.
Example 8.1: Modeling of yield Consider a square chip area of side 1 cm filled with parallel,
equally spaced nodes with width and separation of 1 m. Assume there are an average of 10
random defects per cm2. Defects are of the extra-material kind, with 80% being small defects of
diameter 0.5 m and 20% being larger defects of diameter 1.5 m. What is the expected yield of
this simple chip?
Solution: The expected number of defects on the chip is 10 (8 small, 2 large). Small defects
cannot lead to shorts, so we can ignore them. A large defect leads to a short if its center is within a
0.5-m band halfway between two neighboring nodes. So we need to compute the probability of at
least 1 large defect appearing within a critical area of 0.25 cm2, given an average of 2 such defects
in 1 cm2. Because each of the 2 defects falls in the critical area with probability 1/4, the probability
of having at least 1 large defect in that area is 1 – (3/4)2 = 7/16, giving a yield of 9/16 56%.
Most yield models in practical use are based on defect distributions that provide
information about the frequencies and sizes of defects of various kinds. They then take
the exact circuit layout or some rough model of it into account in deriving critical areas
for each defect type/size. The ratio of the critical area to the total area is a measure of the
sensitivity of the particular layout to the corresponding defect type/size.
In automated tools for electronic circuit design, floorplanning and routing stages affect
the resulting yield. Thus, VLSI layout must be done with defect patterns and their
impacts in mind. Designers can mitigate the effect of extra- and missing-material defects
by adjusting the rules for floorplanning and routing. For example, wider wires are less
sensitive to missing-material defects and narrower wires are less likely to be shorted to
others by extra material, given the same center-to-center separation (Fig. 7.matdef). The
examples above indicate that designers face a complex array of optimizations and trade-
offs, as they must strike a balance with regard to sensitivity to different defect types.
Different chip layout/routing designs differ in their sensitivity to various defect classes.
Because of defect clustering, one good idea is to place blocks with similar sensitivities to
defects apart from each other.
Killer Latent
Killer
defect Extra material defect
Latent
defect Missing material Killer
defect
(a) Missing material (b) Defects with wider wires (c) Defects with narrower wires
One approach to modeling the impact of defects on yield is to derive critical areas in the
layout where the presence of a defect of a given size would disable the circuit. The gray
regions in Fig. 7.critarea represents the results of such an effort for small extra-material
defects of a specific diameter and larger defect of the size shown. The small defect is
seen to be noncritical in most areas, causing shorts between wires/vias only in the small,
fairly narrow gray regions shown. So, there is a relatively low probability that the small
defects would lead to an unusable chip. The larger defect, on the other hand, can lead to
shorts when centered in a significant portion of the circuit segment shown, making it a
killer defect with high probability.
The fraction of the chip area that is critical with respect to various defect sizes, combined
with information on the distribution of such defects (Fig. 5.8), allows us to compute the
overall probability that the chip will be rendered unusable by extra-material or missing-
material defects. If changes in the layout cannot improve an unreasonably low yield, then
redundancy techniques, discussed elsewhere in this chapter, might be called for.
Given a particular pattern of defective memory cells (bits), such as the dark cells in Fig.
memsrc-a, we would like to know if the available spare resources are adequate for
circumventing the defects. In other words, we want to find out an assignment of spare
rows and columns to the defective cells so that all defects are covered, if it exists, or to
conclude that the circumvention capacity of the system has been exceeded. For the
example set of 7 defects in Fig. memsrc-a, the assignment can be readily found by
inspection: use the spare columns to cover the defects in columns 2 and 4 (numbering
from 0) and the spare rows to circumvent the defects in rows 3 and 5. We may also be
interested to find the optimal assignment, that is, one that used a minimal number of
spare resources, in case more defects must be circumvented in future.
(a) Memory array, with spare rows/columns (b) A representation of the defect pattern
Fig. 8.memsrc Memory array with spare rows and column, and the bipartite
graph representing the defect pattern shown.
The latter approach for bypassing a single defective switch is depicted in Fig. 8.red1Dsw,
where we see that even though the switching cell 3 is inoperative, communication
between neighbors among the remaining processors via the switching network is not
interrupted. However, the unusability of the switching cell 3 also makes processor 2
inaccessible. The use of distributed switching, as shown in Fig. 6.array1D-b, obviates the
need for considering redundancy schemes for the switches and for more complicated
models with separate provisions for switching reliability.
Whereas the 1D arrays discussed above have no limit on the array size and the number of
spares, the 2D array reconfiguration schemes of Fig. 6.arr2D1 are more constrained,
owing to the more rigid connectivity requirements between processor nodes. Referring to
Fig. 6.arr2D2-b, we note that a particular pattern of defects can be circumvented if
straight, nonintersecting, nonoverlapping paths (the solid arrows) can be drawn from the
spare row or column elements to each defective element. The 7 “compensation paths,”
shown as heavy arrows in Fig. 6.arr2D2-b, do not intersect or overlap, indicating that the
7 defective processors can be circumvented, as demonstrated in the same figure.
The discussion above is based on the assumption of 2-way shift-switching at the edges of
the array, so that a row/column is either connected straight through or with a one-position
shift downward/rightward (Fig. 8.shift-a). It is also possible to use 3-way shift-switching
at the edges (Fig. 8.shift-b), which would allow the spare row/column to be viewed as
being on either side of the array, depending on the defect pattern. This added flexibility
would improve the worst-case noncircumventable defect pattern to 4 processors, thus
improving yield.
Example 6.x: Reliability modeling for procressor arrays To be provided based on [Parh19].
Solution: To be provided.
We can go beyond the 3-defect limit for reconfigurable 2D arrays by providing spare
rows and columns on all array boundaries, that is, spare rows at the top and bottom and
spare columns on either side. Figuring out the worst-case defect pattern in this case is left
as an exercise.
As in most engineering problems, the optimal solution method for a particular application
and defect model may be a mix of the various method available to us. Intutively, we can
think of the effectiveness as the hybrid approach as being due to each method having
some weaknesses that are covered by the other(s).
The effectiveness of the hybrid approach just discussed can be explained thus. Row and
column spares are very effective for large numbers of defects when they cluster along
rows and columns. Use of error-correcting codes is capable of overcoming widespread
random defects when each word does not have too many defective cells. The latter
weakness is covered by row spares that allow us to circumvent the entire word (and
several other words in the same row).
As devices and interconnects are scaled down, integrated-circuits become more error-
prone and vulnerable to both external influences and internal interference. One important
reason for such errors and vulnerabilities is manufacturing process variations [Ghos10].
Process imperfections lead to transistors, wires, and other circuit elements to have
imperfect shapes, something that can be considered mild defects. When circuit elements
are relatively large, small imperfections do not cause serious variations in electrical
properties, such as resistance or capacitance. However, for a tiny element, a small
irregularity in shape can translate to relatively large variations in electrical parameters, as
well as large variations between supposedly-identical elements.
The same mechanisms that make process variations more serious in modern VLSI than in
previous generations of circtuis also may lead to massive numbers of defects or new
kinds of defects that have not been observed before [Breu04], [Siri04].
Example 8.y: Effect of process variations on wire resistance Consider a wire of width 100 nm
on an IC chip. Process variations may cause the width of the wire along up to half of its length to
become either as small as 50 nm or as large as 150 nm. Quantify the change in the wire’s
resistance in the worst case.
Solution: Assuming no variation in the thickness (depth) of the wire, the resistance is inversely
proportional to wire width, doubling when the width is halved. In the worst case, half of the wire
will have its original resistance R/2 and the other half will have resistance ranging from R to R/3.
Thus, the total resistance of the wire will range from R/2 + R = 3R/2 to R/2 + R/3 = 5R/6, placing
the variations relative to the original resistance R in the interval [–17%, +50%], or a factor of 1.8
separating the maximum and minimum resistance values.
Problems
b. Is it advantageous to provide more than one spare row or column on each side of the array?
c. Would increasing the defect tolerance capability change if both spare rows and spare columns are
on one side of the array?
a.
b.
All rights reserved for the author. No part of this book may be reproduced,
stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical,
photocopying, microfilming, recording, or otherwise, without written permission. Contact the author at:
ECE Dept., Univ. of California, Santa Barbara, CA 93106-9560, USA ([email protected])
Dedication
To my academic mentors of long ago:
and
Structure at a Glance
The multilevel model on the right of the following table is shown to emphasize its
influence on the structure of this book; the model is explained in Chapter 1 (Section 1.4).
Plutarch
Faulty
Erroneous “A fault that humbles a man is of greater value than a virtue that
puffs him up.”
Malfunctioning
Anonymous
Degraded
9. Fault Testing
10. Fault Masking
11. Design for Testability
12. Replication and Voting
9 Fault Testing
“As long as there are tests, there will be prayer in schools.”
Anonymous
“To test, or not to test; that is the question: Whether ‘tis nobler for
the tester’s soul to suffer the barbs and snickers of outraged
designers, or to take arms against a sea of failures, and by
testing, end them? To try: to test; to test . . .”
B. Beizer, Software Testing Techniques
Fault testing is performed in three contexts. Engineering tests aim to ascertain whether a
new system (a prototype, say) is correctly designed. Manufacturing tests are performed to
establish the correctness of an implementation. Maintenance tests, performed in the field,
check for correct operation, either because some problem was encountered (corrective
maintenance) or else in anticipation of potential problems (preventive maintenance). As
shown in Fig. 9.1, fault testing entails the three steps of test generation, test validation,
and test application. We present a brief overview of these three steps in the rest of this
section, later on focusing exclusively on test generation.
Test generation entails the selection of input test patterns whose application to the circuit
would expose the fault set of interest via observing the circuit’s outputs. The set of test
patterns may be preset, meaning that the entire set is applied before reaching a decision,
or adaptive, where the selection of a test to apply depends on previous test outcomes,
with test application stopping as soon as enough information is available.
FAULT TESTING
FAULT TESTING
(Engineering,
(Engineering, Manufacturing,
Manufacturing, Maint Maintenance)enance)
FAULT FAULT DIAG- ALGO- SIMULA- FAULT MANUAL AUTO- TEST CONCUR-
MODEL COVER- NOSIS RITHM TION INJEC- MATIC MODE RENT
switch- AGE EXTENT D-algo- software TION (ATE) (BIST) on-line
or gate- none rithm, (parallel, testing
level (check- boolean deductive, (self-
(single/ out, go/ differ- concur- off-line testing checked
multiple no-go) ence, rent) or design)
stuck-at, to full etc. hardware
bridging, resolu- (simulation
etc.) tion engine)
If we view the circuit or system under test as a black box, perhaps because we know
nothing about its internal design, we opt for functional testing. For example, functional
testing of an adder entails the application of various integers as inputs and checking that
the correct sum is produced in each case. A functional test is exhaustive if all possible
combinations of inputs are applied. Exhaustive testing is practical only for circuits with a
relatively small number of inputs (4-bit or 8-bit adder, but not 32-bit adder). Random
testing entails the selection of a random sample of input test patterns, which of course
provides no guarantee of catching all faults. In heuristic functional test generation, we
pick the tests to correspond to typical inputs as well as what we believe to be problematic
or “corner” cases. In the case of an adder, our selections may include both positive and
negative integers, small and large numbers, values that lead to overflow, inputs that
generate sums of extreme values, and, perhaps, inputs that generate carry chains of
varying lengths.
Once a set of tests has been generated, we need test validation to ensure that the chosen
tests will accomplish our goal of complete fault coverage or high-probability detection.
Some approaches to test validation are theoretical, meaning that they can provide
guaranteed results within the limitations of the chosen fault model. Experimental test
validation, which may provide partial assurance, is done via simulation (modeling the
circuit, with and without faults) or fault injection (purposely introducing faults of various
kinds in the circuit or system to see if they are detected).
Following validation, we enter the domain of test application, where we expose the
circuit or system under test to the chosen inputs and observe its outputs. When test
application is externally controlled, the circuit or system is placed in test mode and its
behavior is observed for an extended period of time. This kind of manual or automatic
test application is known as off-line testing. Increasingly, for modern digital systems, test
application is internally controlled via built-in provisions for testing and testability
enhancement. Internally controlled test application can be off-line, meaning that a special
test mode is entered during which the circuit or system ceases normal operation, or on-
line, where testing and normal operation are concurrent.
A circuit or system whose testing requires relatively low effort in test generation and
application scores high on testability. We will see in Chapter 11 that testability can be
quantified, but, for now, we are using testability in its qualitative sense. The set-up for
testing is depicted in Fig. 9.2. Testability requires that each point within the circuit under
test be controllable (through paths leading from primary inputs to that point) and
observable (by propagating the state of the desired point to one or more primary outputs).
Redundancy in circuits often curtails controllability and observability, thus having a
negative impact on testability.
Referring to Fig. 9.2, test patterns may be randomly generated, come from a preset list, or
be selected according to previous test outcomes. Test results emerging at the circuit’s
outputs may be used in raw form (implying high data volumes) or compressed into a
signature before the comparison. Reference value can come from a “gold” or trusted
version of the circuit that runs concurrently with it or from a precomputed table of
expected outcomes. Finally, test application may be off-line or on-line (concurrent).
Complete, high-coverage testing is critical, because any delay in fault detection has
important financial consequences. The rule of thumb that you lose a few dollars when
you catch a component fault, tens of dollars if it goes to the circuit-board level, hundreds
of dollars if the faulty component makes it to the system level, and thousands of dollard if
in-field corrective action is required, is important to remember.
Except perhaps in the case of exhaustive functional testing, test coverage may be well
below 100%, with reasons including the large input space, model inaccuracies, and
impossibility of dealing with all combinations of the modeled faults. So, testing, which
may be quite effective in the early stages of system design, when there may be many
residual bugs and faults, tends to be less convincing when bugs/faults are rare.
Paraphrasing Edsger W. Dijkstra, who made an oft-quoted statement in connection with
program bugs, we can say that testing can be used to show the presence of faults, but
never to show their absence. Also relevant is this observation by Steve C. McConnell:
“Trying to improve software quality by increasing the amount of testing is like trying to
lose weight by weighing yourself more often.”
At the gate or logic level, the most widely considered faults are the so-called “line stuck”
faults, where a circuit line/node assumes a constant logic value, independent of the
circuit’s inputs. We will focus on these stuck-at-0 (s-a-0) and stuck-at-1 (s-a-1) faults in
the rest of this chapter. For example, Fig. 9.3a shows the upper input of the rightmost
AND gate suffering from an s-a-0 fault. Line bridging faults result from unintended
connection between two lines/nodes, often leading to wired OR/AND of the respective
signals (Fig. 9.3a). Line open faults (bottom line in Fig. 9.3a) can sometimes be modeled
as an s-a-0 or s-a-1 fault. Delay faults (excessively long delays for some signals) are less
tractable than the previous fault types. Coupling and crosstalk faults are other examples
of fault types included in some models.
(a) Logic circuit and fault examples (b) Testing for a particular s-a-0 fault
The main ideas behind test design are controlling the faulty point from primary inputs
and propagating its behavior to some primary output. Considering the s-a-0 fault shown
in Fig. 9.3b, a test must consist of input values that force that particular line to 1 and then
propagate the resulting value (1 if the line is healthy, 0 if it is s-a-0) to one of the two
primary outputs. The process of propagating the 1/0 value to a primary output is known
as path sensitization. In the example of Fig. 9.3b, the path from the s-a-0 line to output K
is sensitized by choosing the lower AND gate input to be 1 (a 0 input inhibits
propagation, because it makes the AND gate output 0, regardless of the value on the
upper input) and the lower OR gate input to be 0.
The path sensitization method just discussed is formalized in the D-algorithm [Roth66],
which is based on the D-calculus. A 1/0 on the logic circuit diagram (Fig. 9.3b) is
represented as D and 0/1 is represented as 𝐷. Then, D and 𝐷 values are propagated to
outputs via forward tracing (path sensitization) and towards the inputs via backward
tracing, eventually producing the required tests. In applying the D-algorithm, circuit lines
must be considered and labeled separately, as depicted in Fig. 9.DAlg-a. This is required
because in some cases, electrically connected lines (such as M, N, and P in Fig. 9.DAlg-
a) may not be affected in the same way by a fault on one of them.
(a) Circuit with all lines labeled (b) Reconvergent fan-out example
Fig. 9.DAlg Example circuit for, and a problem with, the D algorithm.
The worst-case time complexity of the D-algorithm is exponential in the circuit size,
given that it must consider all path combinations. The presence of XOR gates in the
circuit cuases the behavior to approach the worst case. However, the average case is
much better and tends to be quadratic in the circuit size. PODEM also has exponential
time complexity, but in the number of circuit inputs, not its size.
Once the set of possible tests for each fault of interest has been obtained, the rest of the
test generation process is as follows. We construct a table whose rows represent test
patterns (circuit input values) and whose columns correspond to faults. An “x” is placed
at a row-column intersection if the test associated with that row detects the specific fault
associated with the column; a hyphen is placed otherwise. A partial table of this kind is
shown in Table 9.1. Our task is completed upon choosing a minimal set of rows that
cover all faults. This covering problem can be solved quite efficiently in a manner similar
to choosing prime implicants for a minimal sum-of-products representation of a logic
function. For example, in the case of Table 9.1, if only the 4 faults shown are of interest,
then we have two minimal test sets: {(0, 0, 1), (0, 1, 1)} and {(0, 0, 1), (1, 0, 1)}.
P Q __
A B C s-a-0 s-a-1 s-a-0 s-a-1
0 0 0 - - - x
0 0 1 - x - x
0 1 1 x - x -
1 0 1 x - x -
It is easy to see that any test that detects P s-a-0 in Fig. 9.DAlg-a also detects L s-a-0 and
Q s-a-0. Such faults are said to be equivalent. In a similar manner, the faults Q s-a-1, R s-
a-1, and K s-a-1 are equivalent. Identifying equivalent faults before test generation leads
to savings in time and effort, because only one fault from each equivalence class needs to
be considered.
K = f(A, B, C) = AB BC CA (9.3.Cout1)
Intuitively, dK/dB = 1 (satisfied when A C) tells us that the value of K changes when B
changes from 0 to 1 or that K is sensitive to a change in the value of B. Conversely,
dK/dB = 0 (satisfied for A = C) indicates that the value of K is insensitive to a change in
the value of B.
Consider in Fig. 9.DAlg.a the line P being s-a-0. A stuck line behaves as an independent
variable rather than as a dependent one. So, considering P as an independent variable, the
Boolean equation for the output Q can be obtained as:
K = PC AB (9.3.Cout2)
Similarly, tests that detect P s-a-1 are solutions to the equation 𝑃 dK/dP = 1.
From our discussion of the Boolean difference method in Section 9.3, and in particular
from equations 9.3.Psa0t and 9.3.Psa1t, we see that the problem of generating tests for a
particular fault of interest can be converted to solving an instance of the satisfiability
(SAT) problem. The SAT problem is defined, in its decision form, as answering the
question: Is a particular Boolean expression satisfiable, that is, can it assume the value 1
for some assignment of values to its variables? According to a well-known theorem of
complexity theory [Cook71], SAT is NP-complete. That is, no efficient, subexponential-
time algorithm is known that would solve an arbitrary instance of SAT. In fact, even
highly restricted versions of SAT remain NP-complete.
Conversion of the fault detection problem to SAT leads to the suspicion that perhaps fault
detection is also NP-complete. To show this, we must prove that an arbitrary instance of
SAT or some other NP-complete problem can be easily transformed to a fault-detection
problem. Such a proof was first constructed by Ibarra and Sahni [Ibar75] and
subsequently simplified by Fujiwara [Fuji82]. The demonstration that fault detection is an
NP-complete problem makes it unlikely that the task can be performed efficiently by
means of a general algorithm any time in the near future. We are thus motivated to seek
solutions to the problem in special cases and to devise heuristic algorithms that produce
acceptable, near-complete solutions for circuit classes of practical interest.
In the rest of this section, we provide a proof that fault detection is an NP-complete
problem. In fact, we will prove that a highly restricted form of the problem, that is,
finding tests for stuck-at fault in certain 3-level circuits composed entirely of AND and
OR gates (with all primary inputs being uncomplemented) is NP complete, by
transforming a known NP-complete problem to it. Circuits composed entirely of AND
and OR gates and lacking any complemented inputs are known as monotone circuits, so
named because if you increase the value of some of the inputs from 0 to 1 the ouput
either does not change or it changes from 0 to 1 (it can never go from 1 to 0). Thus, our
proof will show that fault detection in certain restricted classes of monotone logic circuits
is NP-complete, making the general problem NP-complete as well.
E = (ai bi ci … ) (9.4.CM1)
Each of the ANDed (multiplied) clauses on the right-hand side of equation 9.4.CM1 is
the logical OR of n or fewer terms, all being either true variables or complemented
variables, but not both in the same clause. As an example, an instance of CM-SAT
expression with 7 variables might be the following, which contains two clauses with only
complemented variables and two clauses with only uncomplemented variables.
The variable vk is a new variable; thus, the CM-SAT problem obtained may have up to
twice as many variables as the original problem. Similarly, when a clause contains a
single uncomplemented variable and two complemented ones:
It is easy to see that the original 3SAT problem is satisfiable iff the derived CM-SAT
problem is satisfiable.
At the first level, we place a number of AND gates, one for each clause with
complemented inputs. Each AND receives the inputs which appear in the clause, but in
true form, and produces the complement of the clause as its output. For example, the
AND gate associated with the clause x1 x4 x5 x6 x7 in equation 9.4.SAT2 will
have the output x1x4x5x6x7 = (x1 x4 x5 x6 x7). At the second level, there are OR
gates, one receiving the outputs of the first-level AND gates as its inputs and the others
forming the remaining clauses with only uncomplemented variables. Finally, the outputs
of all the OR gates in level 2 are fed to a single AND gate in level 3. As an example, the
3-level circuit constructed as above for the CM-SAT instance of equation 9.4.SAT2 is
depicted in Fig. 9.SAT.
In the circuit of Fig. 9.SAT, an s-a-1 fault on line y1 is detectable iff there exists an input
pattern that sets all outputs of the level-1 AND gates to 0 and all outputs of the level-2
OR gates, other than the top one, to 1. It is easy to see that the test will then satisfy the
instance of the CM-SAT problem from which the circuit was derived. Given that the
conversion time from CM-SAT to fault detection is a polynomial in the number n of
variables, the fault detection problem must be NP-complete. We have thus proved:
y5
Even leaving the high complexity of test generation aside, it is still the case that
exponentially many (up to 2n) test patterns may be required for an n-input combinational
circuit. This may lead to a significant amount of time spent in applying and analyzing
tests. The presence of memory in the circuit expands the number of required test cases,
given that the circuit behavior is influenced by its state. To test a sequential machine, we
may need to apply different input sequences for each possible initial state. This double-
exponential complexity may render testing with 100% coverge impractical.
Memory devices are special sequential circuits for which a wide variety of testing
strategies have been devised. A simple-minded approach would be to write the all-0s and
all-1s bit-patterns into every memory word and read out to verify proper storage and
retrieval. This seems to ensure that every memory cell is capable of storing and correctly
reading out the bit values 0 and 1. The problems with this approach include the fact that it
does not test the memory access/decoding mechanism (there is no way to know that the
intended word was written into and retrieved) and does not allow for pattern-sensitive
faults, where cell operation is affected by the values stored in nearby cells. Furthermore,
modern high-density memories experience dynamic faults that are exposed only for
specific access sequences.
The optimal way of testing memory units changes with each technology shift and
evolutionary step in their development. Given the trends in increasing size and
sensitivity, both arising from higher densities, built-in self-test appears to be the only
viable approach in the long run. A particular challenge in applying built-in self-test
methods is that any such method consumes some memory bandwidth, thus requiring
some sacrifice in performance. Memory testing continues to be an active research area.
Problems
x1 x1 x1
x2 x2 x2
x3 x1 x2
x4 x2 x3
Circuit A Circuit B Circuit C
9.x Title
Intro
a. xxx
b. xxx
c. xxx
d. xxx
10 Fault Masking
“Don’t find fault with what you don’t understand.”
French proverb
Fault masking can also be called fault tolerance, but we avoid using the latter term
because of its past association with the entire field of dependable computing.
There are two ways to mask faults. One way is to build upon the inherent
redundancy in logic circuits. This is akin to the redundancy observed in natural
language: you may be able to understand a sentence in this book even after
removing every vowel, or covering the lower half of every letter. Another way is
by using replicated circuits whose outputs feed a fusion or combining circuit,
often (inappropriately) called a voter. This approach leads to static, dynamic, and
hybrid redundancy methods. Masking of transient faults is also discussed.
Referring to Fig. 10.1, we note that there is a pleasant symmetry between avoidance and
masking strategies for dealing with faults (as explained in the introductory paragraph of
this chapter, we prefer to use fault masking in lieu of fault tolerance). Certain faults may
be unavoidable, while others may turn out to be unmaskable. Thus, practical strategies for
dealing with faults are often hybrid in that they encompass some avoidance and some
masking techniques.
In the 9-leaf tree of Fig. 10.1, this chapter’s focus will be on the rightmost two leaves
labeled “restored” and “unaffected.” Masking through restoration requires that a fault be
exposed, promptly detected by a monitor, and fully circumvented via reconfiguration.
This approach requires the application of dynamic redundancy, where redundant features
are built into the system but are not brought into service until they are needed to replace
faulty elements. Masking through concealment is based on static redundancy methods,
whereby redundant elements are fully integrated into the circuit in a way that they cover
for imperfections in other elements.
Interwoven redundant logic is based on the notions of critical and subcritical faults.
Referring to Fig. 10.quad-a, we see that with the given input values, if line a is s-a-0, the
circuit output will not change. We thus consider a s-a-0 a subcritical fault. The same
holds true for h s-a-0, c s-a-1, and d s-a-1. On the other hand, line b s-a-1 will change the
circuit output from 0 to 1, making it a critical fault for the input pattern shown. Because
not all faults are of the stuck-at type, henceforth we characterize each fault as a 0 1 flip
or a 1 0 flip.
We see from the preceding discussion that even a nonredundant logic circuit is capable of
masking some faults. This is both good and bad. It is good in the sense that not every
fault will affect the correct functioning of the circuit. It is bad in the sense of impacting
the timeliness of fault detection. Generally speaking, an AND gate with many inputs is
more sensitive to 1 0 flips, because they have the potential of causing a 1 0 flip at
the gate’s output. A multi-input OR gate, on the other hand, is more sensitive to 0 1
flips. We are thus motivated to seek methods involving alternating use of AND and OR
gates for turning this inherent masking capability of logic circuits into a general scheme
that also masks critical faults.
Consider, for example, the logic circuit of Fig. 10.quad-b with potentially critical 1 0
flips on either input of the shaded AND gate at the top left. Suppose we quadruplicate
this AND gate and the OR gate which it feeds, as well as all inputs and other signals, as
depicted in Fig. 10.quad-c. The four copies of any signal x are named x1, x2, x3, and x4,
with the value of x taken to be the 3-out-of-4 majority value among the four signal
copies. The connectivity pattern of inputs and other replicated signals is such that any
critical flip at the AND layer turns into a subcritical flip at the following OR layer. For
example, the critical 1 0 flip for a1 causes the subcritical 1 0 flip at the top input e1
to the OR gate it feeds at the next circuit layer.
One can show that to mask h critical faults with this alternating, interwoven AND-OR
arrangement, the number of gates must be multiplied by (h + 1)2 and the number of inputs
for each gate must increase by the factor h + 1. For h = 1, this interwoven redundant logic
scheme is known as quadded logic. Note that the alternating AND and OR layers can be
replaced by uniform NAND layers in the usual way.
For make contacts, we have a > c, while for break contacts a < c holds. No matter how
crummy the relays, that is, how close the values of a and c to each other, one can
interconnect many of them in a redundant structure, using series and parallel elements, to
achieve an arbitrarily high reliability.
Example 10.M&S: Reliable circuits from crummy relays Consider the parallel-series quad of
make relays depicted in Fig. 10.relay-c. Derive the reliability paramters for the quad and
determine under what conditions it behaves better than a single make relay.
Solution: The a parameter of the quad circuit can be derived to be aquad = prob [connection made |
relays energized] = 2a2 – a4. Thus, aquad > a if a > 0.62. Similarly, the quad’s c parameter can be
shown to be cquad = prob [connection made | relays not energized] = 2c2 – c4. It is readily seen that
cquad < c for all values of c. Thus, unless the a parameter has an unreasonably low value, the quad
circuit offers better reliability than a single make relay.
(a) Relay contact types (b) AND circuit (c) Parallel-series quad
Fig. 10.relay Relays and their use in building an AND circuit and a more
reliable make circuit.
RIFTMR/Simplex = = (10.3.RIF)
( )
Assuming Rm = e–t, the reliability curves for a simplex module and the TMR system are
shown as a function of t in Fig. 10.TMR-c. We know that the MTTF for the simplex
system is 1/. The MTTF for the TMR system can be shown to be:
The shorter MTTF of a TMR system is due to the fact that RTMR falls below Rm and stays
there for t > ln 2, thus leading to a smaller area under the reliability curve.
(a) Simple TMR scheme (b) Reliability vs. Rm (c) Reliability over time
Fig. 10.tmr Relays and their use in building an AND circuit and a more
reliable make circuit.
Q
C
(a) (b)
One recent application of TMR is in the design of flip-flops that are designed to
withstand single event upsets (SEUs). As shown in Fig. 10.seuFF, triplication of the FF
itself, or both the FF and its correction circuitry, results in single faults to be tolerated.
Dynamic redundancy, as the name implies, requires some sort of action to bypass a faulty
element, once a fault has been detected. The simplest form of dynamic redundancy
entails a fault detector tacked on to an operating unit and a mechanism that allows
switching to an alternate (standby, spare) unit (Fig. 10.dynr-a). In contrast, static or
masking redundancy (Fig. 10.tmr-a) simply hides the effects of a fault and continues
operation uninterrupted, provided that the scheme’s masking capacity is not exceeded.
The fault detector in Fig. 10.dynr-a may be of various kinds: a code checker, a watchdog
timer, or a comparator, in the event that the operational unit is itself duplicated. The latter
scheme is sometimes referred to as “pair and spare.”
[Elaborate further on fault detection.]
The standby or spare unit may be “cold,” meaning that it is powered off until needed.
When the spare is to be switched in, it must be powered on, initialized, and set to an
appropriate state to be able to continue the work of the operational unit from where it was
interrupted (preferable) or from a restart position. Alternatively, we may use a “hot”
spare which runs in parallel with the operational unit and is thus in sync with it. A hot
spare can be switched in with minimal delay, thus helping to improve availability. We
may also opt for the intermediate case of a “warm” spare that is powered on but not
completely in sync with the operational unit.
Fig. 10.dynr Relays and their use in building an AND circuit and a more
reliable make circuit.
We can combine the advantages of static (uninterrupted operation) with those of dynamic
redundancy (lower hardware and energy costs) into a hybrid redundancy scheme, as
depicted in Fig. 10-dynr-b. Initually, the reconfiguration switch S is set so that the
outputs of units 1, 2, and 3 are selected and sent to the voting circuit. When a fault occurs
in one of the 3 operational units, the voting circuit masks the fault, but the voter output
allows the switch S to determine which of the operational units disagreed with the final
outcome. The disagreeing unit is then replaced with the spare, allowing the hybrid-
redundant system to continue its operation and to tolerate a second fault later on.
It is possible to igmore the first few instance of disagreement from one unit in the
expectation that they were due to transient faults. In this scheme, the switch maintains a
disagreement tally for each operational unit and replaces it only if the tally exceeds a
preset threshold. Another optimization is to switch to duplex or simplex operation when
the supply of spares has been exhausted and one of the operational units experiences a
fault. Continuing with duplex operation provides greater safety, whereas switching to
simplex mode extends the lifetime of the system.
Figure 10.sw depicts high-level designs of switches for standby and hybrid redundancy.
[Elaborate on switch design, including the self-purging variant and the associated
threshold voting scheme.]
Spares Spares
We now summarize our discussion of static versus dynamic (and hybrid) redundancy.
Static redundancy provides immediate masking and thus uninterrupted operation. It is
also high on safety. Its disadvantages include power and area penalties and the fact that
the voting circuit is critical to correct operation of the system. Dynamic redundancy
consumes less power (especially with cold standbys) and provides longer life by simply
adding more spares. Tolerance is not immediate, causing availability and safety concerns.
Also, the assumption of longer life with more spares is critically dependant on the
coverage factor. In the absence of near-perfect coverage, the addition of more spares may
not help or even be detrimental to system reliability. Hybrid redundancy have some of the
advantages of both schemes, as well as some of their disadvantages. The switch-voter
part of a hybrid-redundant system is both complex and critical.
We will see in Chapter 15 that, via a scheme known as self-checking design, fault
detection coverage of standby redundancy can be improved, making the technique more
attractive from a practical standpoint.
Instead of replicating a circuit or module, one may reuse the same unit multiple times to
derive several results. This strategy, sometimes referred to as retry, is particularly
effective for dealing with transient faults. Clearly, if the unit has a permanent fault, then
the same erroneous result will be obtained during each retry.
One way around the problem presented by permanent faults is to change things around
during recomputations so as to make it less likely for the same fault to lead to identical
errors, thus reducing the likelihood of the fault going undetected. For example, if an
adder is being used to compute the sum a + b, we may switch the operands and compute
b + a during the second run, or we may complement the inputs and output, thus aiming to
obtain the same result via the computation –(–a – b). Similarly, in multiplying a by the
even number 2b, we may try computing (2a) b the second time. For arbitrary integer
oprands a and b, we can find the product (2a) b/2, while initializing the cumulative
partial product to a if b0 = 1 and to 0 otherwise. Yet another alternative, assuming that a
and b are not very large, is to compute (2a) (2b) and divide the result by 4 through
right-shifting by 2 bits.
Both NMR and hybrid hardware redundancy have been used in practice. The Japanese
Shinkansen “Bullet” train employed a triple-duplex control system, implying 6-fold
redundancy. [Elaborate on the redundancy scheme.] Before they were permanently
retired in 2012, NASA’s Space Shuttles used computers with 5-way redundancy in
hardware and 2-way reudnancy in software. The 5 hardware units originally consisted of
3 concurrently operating units and 2 spare units. Later, the configuration was changed to
4 + 1. Two independently developed software systems were used to protect against
software design faults.
One consequence of using static redundancy is reduced testability. The very act of
masking faults makes it impossible to detect them via the circuit’s inputs and outputs.
Consider, for example, the quadded logic scheme of Section 10.2. This scheme is
designed to mask a single stuck-at fault. So what happenes if the redundant circuit
already contains a fault at the time of its manufacture? The fault cannot be detected by
simply testing the redundant circuit, and if a second fault develops during use (which is
really the first fault from the user’s viewpoint), it may not be masked. Similar difficulties
arise with regard to replication-based redundant systems. Thus, incorporating testability
features, of the types discussed in Chapter 11, is even more significant for systems
employing masking redundancy.
As an example, consider the TMR system of Fig. 10.tmrt-a. If any of the three units
becomes faulty after system assembly but prior to its use, then the first fault occurring
during the system’s operation may lead to an incorrect output. This is because the
presence of the first fault in not detectable using only the circuit’s primary inputs and
outputs. This problem can be fixed by using the arrangement depicted in Fig. 10.tmrt-b.
A multiplexer is used to select one of four values as the output from the redundant circuit:
option 0 selects the value produced by the voting unit, while options 1-3 select the output
supplied by one of the units 1-3. The only detrimental effect of this arrangement is the
added multiplexer delay during normal operation of the redundant circuit. This is more
than offset by the facility for testing each of the replicas initially as well as periodically
during system operation.
Problems
10.1 Title
Intro
a. xxx
b. xxx
c. xxx
d. xxx
10.4 Title
Intro
a. xxx
b. xxx
c. xxx
d. xxx
A comparator is a 2-input, 2-ouput logic circuit that receives the integers a and b as inputs and delivers c =
min(a, b) and d = max(a, b) as outputs. Thus, a comparator either sends the two inputs straight through or
exchanges them to achieve correctly ordered outputs.
a. Show the design of a 4-input sorting network using 5 comparators.
b. Assume that comparators can get “stuck on straight through” or “stuck on exchange,” thus not
providing the correct ordering in some cases. How can the network of part a be tested for such
stuck-type faults efficiently, assuming at most a single faulty comparator?
c. Is it possible to build a 4-input sorting network that masks the fault of any single comparator?
How, or why not?
a. Convert the circuit so that only NAND gates are used. Feel free to change the outputs to 𝑠̅ and 𝑐̅ if
it makes your job easier.
b. Draw the quadded version of the circuit in part a, using only NAND gates.
c. Is it always possible to convert a quadded AND-OR circuit into a quadded NAND circuit by
replacing each AND and OR gate with a NAND gate?
Whereas a simple circuit with small set of inputs and outputs can be tested with a
reasonable amount of time and effort, a complex unit, such as a microprocessor, would be
impossible to test solely based on its input/output behavior. Hence, nearly all modern
digital circuits are designed and built with special provisions for improved testability.
Because the cost of a fault is greatly reduced when it is caught early, as discussed in
Section 9.1, testability features appear at all levels of the digital system packaging
hierarchy, from the component or chip level, through board and subassembly levels, all
the way to the system level. Our discussions in this chapter pertain mostly to circuit- and
board-level techniques, but similar considerations apply elsewhere. In particular,
diagnosability features form important considerations at the malfunction level (see
Chapter 17).
In order for a unit to be easily testable, we must be able to control and observe its internal
points. Thus, controllability and observability are the cornerstones of testability, as we
will see in detail in Section 11.2.
To allow detection of a fault on line Li of a logic circuit (see Fig. 11.1), we must be able
both to control that line from the circuit’s primary inputs and to observe its behavior via
the primary outputs. Thus, good testability requires good controllability and good
obsevability. It is thus natural to define the testability T(Pi) of a line Li as the product of
its controllability C(Li) and its observability O(Li). We will discuss suitable quantification
of controllability and observability shortly. However, assuming that we have already
determined C and O values in the interval [0, 1] for each line Li, 1 i n, within the
circuit, the overall testability of the circuit can be taken to be the average of the
testabilities for all points. Thus:
Note that testability metric defined by equation (11.2.test) is an empirical measure that
does not have a physical significance. In other words, if the testability of a circuit turns
out to be 0.23, it does not tell us much about how difficult it would be to test the circuit.
It is only useful for comparing different circuits, different designs for the same circuit, or
different points within a circuit with regard to ease of testing. So, a design with a
testability of 0.23 may be preferred to one that has a testability of 0.16, say.
To determine the line controllabilities within a given circuit, we begin from primary
inputs, each of which is assigned the perfect controllability of 1, and proceed toward the
outputs. The controllability of the single output line of a k-input gate is the product of the
gate’s controllability transfer factor (CTF) and the average of its input controllabilities:
Path to
control P
Path to
observe P
The CTF of a gate depends on how easy it is to set its output to 0 or 1 at will. Taking a 3-
input AND gate of Fig. 11.2a as an example, 7 of its 8 input patterns set the output to 0
and only one pattern sets the output to 1. The relatively large difference between N(0) = 7
and N(1) = 1 indicates poor controllability at the gate’s output. In general, we use
equation 11.2.CTF to derive a gate’s CTF:
( ) ( )
CTF = 1 – (11.2.CTF)
( ) ( )
When an equal number of input patterns set the gate’s output to 0 or 1, we have N(0) =
N(1), leading to the perfect CTF of 1. An XOR gate has the N(0) = N(1) property and is
thus a desirable circuit component in terms of testability. In the case of a line that fans out
into f lines, as in Fig. 11.2-b, the controllability of each of the output lines is given by:
A line of very low controllability constitutes a good place for the insertion of a testpoint.
Solution: The XOR gate has a CTF of 1, making the controllability of line M, and thus line P,
equal to 1. Two-input AND and OR gates have a CTF of 1/2. Thus both Q and R have a
controllability of 1/2, giving K a controllability of 1/4.
To determine the line observabilities within a given circuit, we begin from primary
outputs, each of which is assigned the perfect observability of 1, and proceed toward the
inputs. The observability of each input line of a k-input gate is the product of the gate’s
observability transfer factor (OTF) and its output observability:
The OTF of a gate depends on how easy it is to sensitize a path from a gate input to the
output. Taking a 3-input AND gate of Fig. 11.3a as an example and considering a fault on
one of the inputs, only N(sp) = 1 of the 4 patterns on the other two input sensitizes a path
to the output, whereas N(ip) = 3 patterns inhibit the propagation. The relatively small
number of sensitizing options indicates poor observability of the gate’s inputs. In general,
we use equation 11.2.OTF to derive a gate’s OTF:
( )
OTF = ( )
(11.2.OTF)
( )
When there are no inhibiting input patterns, we have N(ip) = 0, leading to the perfect
OTF of 1. An XOR gate has the N(ip) = 0 property and is thus a desirable circuit
component in terms of observability. In the case of a line that fans out into f lines, as in
Fig. 11.3b, the observability of each of the input lines is given by:
A line of very low observability constitutes a good place for the insertion of a testpoint.
Example 11.obser: Quantifying observability Derive the observabilities of lines B and P in the
logic circuit of Fig. 11.1.
Solution: Two-input AND and OR gates have an OTF of 1/2, leading to an observability of 1/2
for Q and 1/4 for P, tracing back from the primary output K. Line B has an observability of 1,
given the path consisting of two XORs (OTF = 1) from B to the primary output S.
If during testability analysis one or more line in the circuit are shown to have low
testabilities, we might want to make those lines externally controllable and observable via
testpoint placement. Figure 11.tps shows how the testability of a system composed of two
cascaded modules can be improved by inserting degating logic at their interface. In this
way, each module can be tested separately and then in tandem to ensure both proper
module operation and correct interfacing.
Example 11.test: Placement of a testpoint Derive the testabilities of all lines in Fig. 10.quad-a
and from them, deduce the location of a single testpoint that would help most with testability.
Solution: To be provided.
If m testpoints are to be placed, it may not be a good idea to pick as their locations the
lines with the m lowest testability values. This is because locating a testpoint on the
lowest-testability line will in general affect the testabilities of many other lines. A good
strategy would be to recalculate the testabilities of all circuit lines after the optimal
placement of a single testpoint and considering it as a primary input/output.
(a) Partitioned design (b) Normal mode (c) Test mode for A
Sequential circuits are more difficult to test than combinational circuits, because their
behavior is state-dependent. We need exponentially many tests to test the sequential
circuit’s behavior for each initial condition. One way to reduce the number of test
patterns needed is to test the flip-flops’ proper operation and the correctness of the
combinational part separately. Inputs to the combinational part are the circuit’s primary
inputs and those coming from the FFs (Fig. 11.scand-a). The ability to load arbitrary bit
patterns into the circuit’s FFs will allow us to apply desired test patterns to the
combinational logic and then observe its response by looking at the primaty outputs and
the new contents of the FFs.
Figure 11.scand-b shows one way of accomplishing this aim. All the FFs in the circuit are
strung into a long chain, with serial input and serial output. This chain is known as a scan
path. There is multiplexer before each FF in the scan path that allows the FF to receive its
input from the scan path (test mode) or from the regular source within the circuit (normal
operation). During testing, we alternate between test mode and normal mode in many
phases. In test mode, we shift a desired pattern into the FFs, while shifting out the stored
pattern placed there by the previous phase of testing.
Given the difficulty of testing, built-in self-test (BIST) capability is incorporated into
many complex systems. BIST requires the incorporation of test-pattern generation and
pass/fail decision into the same package as the circuit under test. As in ordinary testing,
the test patterns may be generated randomly at the time of application or may be
precomputed and stored in memory. Among the most common methods of on-the-fly test
generation is the use of linear feedback shift registers (LFSRs)
Problems
Typical hardware units consist of a data path, where computations and other data
manipulations take place, and a control unit that is in charge of scheduling and
sequencing the data path operations (Fig. 12.dpcg). A small amount of glue logic binds
the main two parts together, allowing various optimizations as well as customization of
certain generic capabilities for applications of interest. Redundancy methods for dealing
with faults in data path, control unit, and glue logic are quite different. Generally
speaking, many more options are available for protecting data paths through redundancy,
as opposed to control circuitry (far fewer options) and the glue logic (extremely limited).
Fig. 12.dpcg Hardware unit with data path, control unit, and glue logic.
Options for the control unit may involve coding of control signals, control-flow
watchdogs, and self-checking design. Protection methods for the glue logic are limited to
simple replication and self-checking design.
The schemes depicted in Fig. 12.space have already been discussed in connection with
fault masking. In this section, we take a more detailed look at some of them, with the goal
of understanding the trade-offs involved in their use and the extent of protection they
offer in data path operations.
Let us begin by a more detailed examination of the TMR scheme of Fig. 12.space-c.
Previously, we viewed TMR as a 2-out-of-3 system and derived its reliability with an
imperfect voting circuit in equation 10.3.RTMR. In the following example, we consider
the effects of an imperfect voter on system reliability.
Example 12.TMR1: Modeling of TMR with imperfect voting circuit Consider a TMR
system with identical module reliabilities Rm and voter reliability Rv. Under what conditions will
the TMR system be more reliable than a simplex module?
Solution: The reliability of the TMR system can be written as R = Rv(3Rm2 – 2Rm3), For R > Rm,
we must have Rv > 1/(3Rm – 2Rm2). On the other hand, for a given Rv, the system will be more
reliable than a module if (3 – 9 − 8/𝑅 )/4 < Rm < (3 + 9 − 8/𝑅 )/4 (see Fig. 12.TMR) For
example, if Rv = 0.95, reliability improvement requires that 0.56 < Rm < 0.94. In practice, Rv is
very close to 1, that is, Rv = 1 – . We then have 1/Rv 1 + and (1 – 8)0.5 1 – 4. Thus, the
condition for reliability improvement becomes 0.5 + < Rm < 1 – . Because modules tend to be
much more reliable than voters, improved reliability is virtually guaranteed.
Example 12.TMR3: TMR with compensating errors Not all double-module faults lead to an
erroneous output in TMR. Consider a TMR system in which each of the 3 modules sends a single
bit to the voting circuit. Let the module reliability be Rm = 1 – p0 – p1, where p0 and p1 are
probabilities of 0-fault and 1-fault, respectively. Derive the system reliability R.
Solution: The system operates correctly if it has no more than one fault or if there are two faults
with compensating errors. Thus, R = (3Rm2 – 2Rm3) + 6p0p1Rm, with the last term being the
contribution of compendating errors to system reliability. Take for example the numerical values
Rm = 0.998 and p0 = p1 = 0.001. These values yield R = 0.999 990 = 0.999 984 + 000 006, where
0.000 006 is the improvement to the ordinary TMR reliability 0.999 984 due to the modeling of
compensating errors. We can derive the respective reliability improvement factors thus:
RIFTMR/Simplex = 0.002 / 0.000 016 = 125 and RIFCompens/TMR = 0.000 016 / 0.000 010 = 1.6.
(a) Using 2 adders and a multiplier (b) Using an adder and a multiplier
Section to be written.
Consider a TMR system with a 3-way voting circuit. Assuming that we do not worry
about how the system will behave in case of two faulty modules, a simple voter can be
designed from a word comparator and a 2-to-1 multiplexer, as depicted in Fig. 12.vote-a.
If x1 and x2 disagree, then x3 is chosen as the voting result; otherwise x1 = x2 is passed on
to the output. This comparison-based voting scheme can be readily extended to a larger
number of computation channels, but the number of comparators required rises sharply
with the number n of channels.
In applications where single-bit results (decisions) are to be voted on, the voting circuit is
referred to as a bit-voter. We can synthesize bit-voters from logic gates (Fig. 12.3of5),
but the circuit complexity quickly explodes with increasing n. It is thus imperative to find
systematic design schemes based on higher-level components that keep the design
complexity in check.
(a) Comparison-based voting circuit (b) Basics and notation for bit-voting
One possibility is the use of multiplexers, as depicted in Fig. 12.muxv. For example, in
Fig. 12.muxv-b, the inputs are partitioned into the subsets {x1, x2, x3} and {x4, x5}. If
inputs in the first set are identical, then the majority output is already known. If two
inputs in the first set are 1s, then the voting result is 1 iff at least one member of the
second set is 1. Finally, if one of the 3 inputs in the first set is 1, then x4 = x5 = 1 is
required for producing a 1 at the output.
Other design strategies for synthezising bit-voters include the use of arithmetic circuits
(add the bits and compare the sum to a threshold) and the use of selection networks (the
majority of bit values is also their median). Synthesis of bit-voters based on these
strategies has been performed and the results compared (Fig.12.cmplx). It is readily seen
that multiplexer-based designs have the edge for small values of n, but for larger values
of n, designs based on selection networks tend to win.
Word-voting circuits cannot be designed based on bit-voting on the various bit positions.
To see this, note that word inputs 00, 10, and 11 have majority results of 1 and 0 in their
two positions, but the result of bit-voting, that is, 10, is not a majority value. The situation
can even be worse. With word inputs 000, 101, and 110, the bitwise majority voting
result is 100, which isn’t even equal to one of the inputs.
Recently, a recursive method for the construction of voting networks has been proposed
that offers regularlity, scalability and power-efficiency benefits over previous design
methods [Parh21], [Parh21a]. The essense of the method is illustrated in Fig. 12.recTCN,
which shows an at-least-l-out-of-n threshold circuit built from a multiplexer (mux), an at-
least-l-out-of-(n–1) threshold circuit, and an at-least-(l–1)-out-of-(n–1) threshold circuit.
Figure 12.5of9 illustrates the unrolling of the recursive construction method in the
specific case of a 5-out-of-9 majority voter.
x1
. n–1 l/(n – 1)
. TCN 0
.
xn–1 l/n
MUX
n–1 (l – 1)/(n – 1)
TCN 1
xn
One variation on the theme of replication and voting is that of self-purging redundancy.
Instead of having n operational units and s spares, all n + s units contribute to the
computation by sending their results to a threshold voting circuit (Fig. 12.purge). When a
module disagrees with the voting outcome, it takes itself out of the computatioon by
resetting a flip-flop whose output enables the module output to go to the voting circuit.
The threshold may be fixed or it may vary as units are taken out of service.
An interesting design strategy is based on alternating logic. The basic strategy is depicted
in Fig. 12-alt. Alternating logic takes advantage of the fact that the same fault is unlikely
to affect a signal and its complement in the same way. This property is routinely used in
data transmission over buses by sending a data packet, then sending the bitwise
complement of the data, and comparing the two versions at the destination, allowing the
detection of bus lines s-a-0, s-a-1, and many transient faults. Let the dual of a Boolean
function f(x1, x2, … , xn) be defined as another function fd(x1, x2, … , xn) such that:
The dual of a Boolean function (logic circuit) can be obtained by exchanging AND and
OR operators (gates). For example, the dual of f(a, b, c) = ab c is fd(a, b, c) = (a b)c.
Using the dual of a function instead of its identical duplicate provides greater protection
against common-cause and certain unidirectional multiple faults. If a function is self-
dual, a property that is heldby many commonly used circuits such as binary adders (both
unsigned anc complement), a form of time redundancy can be applied by using the same
circuit to compute the function and its dual.
Problems
12.x Title
Intro
a. xx
b. xx
c. xx
All rights reserved for the author. No part of this book may be reproduced,
stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical,
photocopying, microfilming, recording, or otherwise, without written permission. Contact the author at:
ECE Dept., Univ. of California, Santa Barbara, CA 93106-9560, USA ([email protected])
Dedication
To my academic mentors of long ago:
and
Structure at a Glance
The multilevel model on the right of the following table is shown to emphasize its
influence on the structure of this book; the model is explained in Chapter 1 (Section 1.4).
Defective
“To err is human––to blame it on someone else is even more
human.”
Faulty
Jacob's law
Erroneous
“Let me put it this way, Mr. Amor. The 9000 series is the most
Malfunctioning
reliable computer ever made. No 9000 computer has ever made
Degraded a mistake or distorted information. We are all, by any practical
definition of the words, foolproof and incapable of error.”
Failed
13 Error Detection
”Thinking you know when in fact you don’t is a fatal mistake, to
which we are all prone.”
Bertrand Russell
“An expert is a person who has made all the mistakes that can
be made in a very narrow field.”
Niels Bohr
One way of dealing with errors is to ensure their prompt detection, so that they
can be counteracted by appropriate recovery actions. Another approach, discussed
in Chapter 10, is automatic error correction. Error detection requires some form of
result checking that might be done through time and/or informational redundancy.
Examples of time redundancy include retry, variant computation, and output
verification. The simplest form of informational redundancy is replication, where
each piece of information is duplicated, triplicated, and so on. This would imply a
redundancy of at least 100%. Our focus in this chapter is on lower-redundancy,
and thus more efficient, error-detecting codes.
Error detection schemes have been used since ancient times. Jewish scribes are said to
have devised methods, such as fitting a certain exact number of words on each line/page
or comparing a sample of a new copy with the original [Wikipedia], to prevent errors
during manual copying of text, thousands of years ago. When an error was discovered,
the entire page, or in cases of multiple errors, the entire manuscript, would be destroyed,
an action equivalent to the modern retransmission or recomputation. Discovery of the
Dead Sea Scrolls, dating from about 150 BCE, confirmed the effectiveness of these
quality control measures.
The most effective modern error detection schemes are based on redundancy. In the most
common set-up, k-bit information words are encoded as n-bit codewords, n > k. Changing
some of the bits in a valid codeword produces an invalid codeword, thus leading to
detection. The ratio r/k, where r = n – k is the number of redundant bits, indicates the
extent of redundancy or the coding cost. Hardware complexity of error detection is
another measure of cost. Time complexity of the error detection algorithm is a third
measure of cost, which we often ignore, because the process can almost always be
overlapped with useful communication/computation.
Figure 13.1a depicts the data space of size 2k, the code space of size 2n, and the set of 2n –
2k invalid codewords, which, when encountered, indicate the presence of errors.
Conceptually, the simplest redundant represention for error detection consists of
replication of data. For example, triplicating a bit b, to get the corresponding codeword
bbb allows us to detect errors in up to 2 of the 3 bits. For example, the valid codeword
000 can change to 010 due to a single-bit error or to 110 as a result of 2 bit-flips, both of
which are invalid codewords allowing detection of the errors. We will see in Chapter 14
that the same triplication scheme allows the correction of single-bit errors in lieu of
detection of up to 2 bit-errors.
A possible practical scheme for using duplication is depicted in Fig. 13.2a. The desired
computation y = f(x) is performed twice, preceded by encoding of the input x
(duplication) and succeeded by comparison to detect any disagreement between the two
results. Any error in one copy of x, even if accompanied by mishaps in the associated
computation copy, is detectable with this scheme. A variant, shown in Fig. 13.2b, is
based on computing 𝑦 in the second channel, with different outputs being a sign of
correctness. One advantage of the latter scheme, that is, endocing x as x𝑥̅ , over straight
duplication, or xx, is that it can detect any unidirectional error (0s changing to 1s, or 1s to
0s, but not both at the same time), even if the errors span both copies.
(a) Data and code spaces (b) Triplication at the bit level
Fig. 13.1 Data and code spaces in general (sizes 2k and 2n) and for
bit-level triplication (sizes 2 and 8).
Example 13.parity: Odd/even parity checking One of the simplest and oldest methods of
protecting data against accidental corruption is the use of a single parity bit for k bits of data (r =
1, n = k + 1). Show that the provision of an even or odd parity bit, that is, an extra bit that makes
the parity of the (k + 1)-bit codeword even or odd, will detect any single bit-flip or correct an
erasure error. Also, describe the encoding, decoding, and error-checking processes.
Solution: Any single bit-flip will change the parity from even to odd or vice versa, thus being
detectable. An erased bit value can be reconstructed by noting which of the two possibilities
would make the parity correct. During encoding, an even parity bit can be derived by XORing all
data bits together. An odd parity bit is simply the complement of the corresponding even parity
bit. No decoding is needed, as the code is separable. Error checking consists of recomputing the
parity bits and comparing it against the existing one.
The operators just defined propagate any input errors to their outputs, thus facilitating
error detection. For example, it is readily verified that (0, 1) (1, 1) = (1, 1) and (0, 1)
(0, 0) = (0, 0).
A particularly useful notion for the design and assessment of error codes is that of
Hamming distance, defined as the number of positions in which two bit-vectors differ.
The Hamming distance of a code is the minimum distance between its codewords. For
example, it is easy to see that a 5-bit code in which all codewords have weight 2 (the 2-
out-of-5 code) has Hamming distance 2. This code has 10 codewords and is thus suitable
for representing the decimal digits 0-9, among other uses.
We next review the types of errors and various ways of modeling them. Error models
capture the relationships between errors and their causes, including circuit faults. Errors
are classified as single or multiple (according to the number of bits affected), inversion or
erasure (flipped or unrecognizable/missing bits), random or correlated, and symmetric or
asymmetrice (regarding the likelihood of 01 or 10 inversions). Note that Nonbinary
codes have substitution rather than inversion errors. For certain applications, we also
need to deal with transposition errors (exchange of adjacent symbols). Also note that
errors are permanent by nature; in our terminology, we have transient faults, but no such
thing as transient errors.
Error codes, first used for and developed in connection with communication on noisy
channels, were later applied to protecting stored data against corruption. In computing
applications, where data is manipulated in addition to being transmitted and stored, a
commonly applied strategy is to use coded information during transmission and storage
and to strip/reinstate the redundancy via decoding/encoding before/after data
manipulations. Fig. 13.coding depicts this process. While any error-detecting/correcting
code can be used for protection against transmission and storage errors, most such codes
are not closed under arithmetic/logic operations. Arithmetic error codes, to be discussed
in Section 13.5, provide protection for data manipulation circuits as well as transmission
and storage systems.
Protected by encoding
I E S D O
n n S t S e u
p c e o e c t
u o n r n o p
t d d e d d u
e e t
Unprotected
Manipulate
We end this section with a brief review of criteria used to judge the effectiveness of error-
detecting as well as error-correcting codes. These include redundancy (r redundant bits
used for k information bits, for an overhead of r/k), encoding circuit/time complexity,
decoding circuit/time complexity (nonexistent for separable codes), error detection
capability (single, double, b-bit burst, byte, unidirectional), and possible closure under
operations of interest.
Note that error-detecting and error-correcting codes used for dependability improvement
are quite different from codes used for privacy and security. In the latter case, a simple
decoding process would be detrimental to the purpose for which the code is used.
Checksum codes constitute one of the most widely used classes of error-detecting codes.
In such codes, one or more check digits or symbols, computed by some kind of
summation, are attached to data digits or symbols.
Example 13.UPC: Universal product code, UPC-A In UPC-A, an 11-digit decimal product
number is augmented with a decimal check digit, which is considered the 12th digit and is
computed as follows. The odd-indexed digits (numbering begins with 1 at the left) are added up
and the sum is multiplied by 3. Then, the sum of the even-indexed digits is added to the previous
result, with the new result subtracted from the next higher multiple of 10, to obtain a check digit in
[0-9]. For instance, given the product number 03600029145, its check digit is computed thus. We
first find the weighted sum 3(0 + 6 + 0 + 2 + 1 + 5) + 3 + 0 + 0 + 9 + 4 = 58. Subtracting the latter
value from 60, we obtain the check digit 2 and the codeword 036000291452. Describe the error-
detection algorithm for UPC-A code. Then show that all single-digit substitution errors and nearly
all transposition errors (switching of two adjacent digits) are detectable.
Solution:
To detect errors, we recompute the check digit per the process outlined in the problem statement
and compare the result against the listed check digit. Any single-digit substitution error will add to
the weighted sum a positive or negative error magnitude that is one of the values in [1-9] or in
{12, 15, 18, 21, 24, 27}. None of the listed values is a multiple of 10, so the error is detectable. A
transposition error will add or subtract an error mangnitude that is the difference between 3a + b
and 3b + a, that is, 2(a – b). As long as a – b 5, the error magnitude will not be divisible by 10
and the error will be detectable. The undetectable exceptions are thus adjacent transposed digits
that differ by 5 (i.e., 5 and 0; 6 and 1; etc.).
Generalizing from Example 13.UPC, checksum codes are characterized as follows. Given
the data vector x1, x2, … , xk, we attach a check digit xk+1 to the right end of the vector
so as to satisfy the check equation
∑ 𝑤 𝑥 = 0 mod A (13.2.chksm)
where the wj are weights associated with the different digit positions and A is a check
constant. With this terminology, the UPC-A code of example 13.UPC has the weight
vector 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1 and the check constant A = 10. Such a checksum
code will detect all errors that add to the weighted sum an error magnitude that is not a
multiple of A. In some variants of checksum codes, the binary representations of the
vector elements are XORed together rather than added (after multiplication by their
corresponding weights). Because the XOR operation is simpler and faster, such XOR
checkcum codes have a speed and cost advantage, but they are not as strong in terms of
error detection capabilities.
Another weight-based code is the separable Berger code, whose codewords are formed as
follows. We count the number of 0s in a k-bit data word and attach to the data word the
representation of the count as a log2(k + 1)-bit binary number. Hence, the codewords
are of length n = k + r = k + log2(k + 1). Using a vertical bar to separate the data part
and check part of a Berger code, here are some examples of codewords with k = 6:
000000|110; 000011|100; 101000|100; 111110|001.
A Berger code can detect all unidirectional errors, regardless of the number of bits
affected. This is because 01 flips can only decrease the count of 0s in the data part, but
they either leave the check part unchanged or increase it. As for random errors, only
single errors are guaranteed to be detectable (why?). The following example introduces
an alternate form of Berger code.
Example 13.Berger: Alternate form of Berger code Instead of attaching the count of 0s as a
binary number, we may attach the 1’s-complement (bitwise complement) of the count of 1s. What
are the error-detection capabilities of the new variant?
Solution: The number of 1s increases by 01flips, thus increasing the count of 1s and decreasing
its 1’s-complement. A similar opposing direction of change applies to 10 flips. Thus, all
unidirectional errors remain detectable.
A cyclic code is any code in which a cyclic shift of a codeword always results in another
codeword. A cyclic code can be characterized by its generating polynomial G(x) of
degree r, with all of the coefficients being 0s and 1s. Here is an example generator
polynomial of degree r = 3:
G(x) = 1 + x + x3 (13.4.GenP)
Multiplying a polynomial D(x) that represents a data word by the generator polynomial
G(x) produces the codeword polynomial V(x). For example, given the 7-bit data word
1101001, associated with the polynomial 1 + x + x3 + x6, the corresponding codeword is
obtained via a polynomical multiplication in which coefficients are added modulo 2:
The polynomial G(x) of a cyclic code is not arbitrary, but must be a factor of 1 + xn. For
example, given that
several potential choices are available for the generator polynomial of a 16-bit cyclic
code. Each of the factors on the right-hand side of equation 13.4.poly, or the product of
any subset of the 5 factors, can be used as a generator polynomial. The resulting 16-bit
cyclic codes will be different with respect to their error detection capabilities and
encoding/decoding schemes.
An n-bit cyclic code with k bits’ worth of data and r bits of redundancy (generator
polynomial of degree r = n – k) can detect all burst errors of width less than r. This is
because the burst error polynomial xi E(x), where E(x) is of degree less than r, cannot be
divisible by the degree-r generator polynomial G(x).
What makes cyclic codes particularly desirable is that they require fairly simple hardware
for encoding and decoding. The linear shift register depicted in Fig. 13.cced-a receives
the data word D(x) bit-serially and produces the code vector V(x) bit-serially, beginning
with the constant term (the coefficient of x0). The equally simple cyclic-code decoding
hardware in Fig. 13.cc3e-b is readily understood if we note that B(x) = (x + x3)D(x) and
D(x) = V(x) + B(x) = (1 + x + x3)D(x) + (x + x3)D(x).
Cyclic codes, as defined so far, are not separable, thus potentially slowing down the
delivery of data due to the decoding latency. Here is one way to construct separable
cyclic codes that leads to the cyclic redundancy check (CRC) codes in common use for
disk memories. Given the degree-(k – 1) polynomial D(x) associated with the k-bit data
word and the degree-r generator polynomial G(x), the degree-(n – 1) polynomial
corresponding to the n-bit encoded word is:
It is easy to see that V(x) computed from equation 13.crc is divisible by G(x). Because the
remainder polynomial in the square brackets is at most of degree r – 1, the data part D(x)
remains separate from the check component.
Example 13.CRC1: Separable cyclic code Consider a CRC code with 4 data bits and the
generator polynomial G(x) = 1 + x + x3. Form the CRC codeword associated with the data word
1001 and check your work by verifying the divisibility of the resulting polynomial V(x) by G(x).
Solution: Dividing x3D(x) = x3 + x6 by G(x), we get the remainder x + x2. The code-vector
polynomial is V(x) = [x3D(x) mod G(x)] + x3D(x) = x + x2 + x3 + x6, corresponding to the codeword
0111001. Rewriting V(x) as x(1 + x + x3) + x3(1 + x + x3) confirms that it is divisible by G(x).
Example 13.CRC2: Even parity as CRC code Show that the use of a single even parity bit is a
special case of CRC and derive its generator polynomial G(x).
Solution: Let us take an example data word 1011, with D(x) = 1 + x2 + x3 and its even-parity
coded version 11011 with V(x) = 1 + x + x3 + x4 (note that the parity bit precedes the data). Given
that the remainder D(x) mod G(x) is a single bit, the generator polynomial, if it exists, must be
G(x) = 1 + x. We can easily verify that (1 + x2 + x3) mod (1 + x) = 1. For a general proof, we note
that xi mod (1 + x) = 1 for all values of i. Therefore, the number of 1s in the data word, which is
the same as the number of terms in the polynomial D(x), determines the number of 1s that must be
added (modulo 2) to form the check bit. The latter process coincides with the definition of an even
parity-check bit.
Example 13.arith1: Errors caused by single faults in arithmetic circuits Show that a single
logic fault in an unsigned binary adder can potentially flip an arbitrary number of bits in the sum.
Solution: Consider the addition of two 16-bit unsigned binary numbers 0010 0111 0010 0001 and
0101 1000 1101 0011, whose correct sum is 0111 1111 1111 0100. Indexing the bit positions from
the right end beginning with 0, we note that during this particular addition, the carry signal going
from position 3 to position 4 is 0. Now suppose that the output of the carry circuit in that position
is s-a-1. The erroneous carry of 1 will change the output to 1000 0000 0000 0100, flipping 12 bits
in the process and changing the numerical value of the ouput by 16.
Characterization of arithmetic errors is better done via the value added to or subtracted
from a correct result, rather than by the number of bits affected. When the amount added
or subtracted is a power of 2, as in Example 13.arith1, we say that we have an arithmetic
error of weight 1, or a single arithmetic error. When the amount added or subtracted can
be expressed as the sum or difference of two different powers of 2, we say we have an
arithmetic eror of weight 2, or double arithmetic error.
Example 13.arith2: Arithmetic weight of error Consider the correct sum of two unsigned
binary numbers to be 0111 1111 1111 0100.
a. Characterize the arithmetic errors that transform the sum to 1000 0000 0000 0100.
b. Repeat part a for the transformation to 0110 0000 0000 0100.
Solution:
a. The error is +16 = +24, thus it is characterized as a positive weight-1 or single error.
b. The error is –32 752 = –215 + 24, thus it is characterized as a negative weight-2 or double error.
Even though both x and 2ax are unknown, we do know that 2ax ends with a 0s. Thus,
equation 13.5.div allows us to compute the rightmost a bits of x, which become the next a
bits of 2ax. This a-bit-at-a-time process continues until all bits of x have been derived.
Example 13.arith3: Decoding the 15x code What 16-bit unsigned binary number is represented
by 0111 0111 1111 0100 1100 in the low-cost product code 15x?
Solution: We know that 16x is of the form ●●●● ●●●● ●●●● ●●●● 0000. We use equation
13.5.div to find the rightmost 4 bits of x as 0000 – 1100 = 0100, remembering the borrow-out in
position 3 for the next step. We now know that 16x is of the form ●●●● ●●●● ●●●● 0100 0000.
In the next step, we find 0100 – 0100 (–1) = (–1) 1111, where a parenthesized –1 indicates
borrow-in/out. We now know 16x to be of the form ●●●● ●●●● 1111 0100 0000. The next 4
digits of x are found thus: 1111 – 1111 (–1) = (–1) 1111. We now know 16x to be of the form
●●●● 1111 1111 0100 0000. The final 4 digits of x are found thus: 1111 – 0111 (–1) = 0111.
Putting all the pieces together, we have the answer x = 0111 1111 1111 0100.
Ax Ay = A(x y) (13.5.addsub)
Note that product codes are nonseparable, because data and redundant check information
are intermixed. We next consider a class of separable arithmetic error-detecting codes.
A residue error-correcting code represents an integer N by the pair (N, C = |N|A), where
|N|A is the residue of N modulo the chosen check modulus A. Because the data part N and
the check part C are separate, decoding is trivial. To encode a number N, we must
compute |N|A and attach it to N. This is quite easy for a low-cost modulus A = 2a – 1:
simply divide the number N into a-bit chunks, beginning at its right end, and add the
chunks together modulo A. Modulo-(2a – 1) addition is just like ordinary a-bit addition,
with the adder’s carry-out line connected to its carry-in line, a configuration known as
end-around carry.
Example 13.mod15: Computing low-cost residues Compute the mod-15 residue.of the 16-bit
unsigned binary number 0101 1101 1010 1110.
Solution: The mod-15 residue of x is obtained thus: 0101 + 1101 + 1010 + 1110 mod 15.
Beginning at the right end, the first addition produces an outgoing carry, leading to the mod-15
sum 1110 + 1010 – 10000 + 1 = 1001. Next, we get 1001 + 1101 – 10000 + 1 = 0111. Finally, in
the last step, we get: 0111 + 0101 = 1100. Note that the end-around carry is 1 in all three steps.
For an arbitrary modulus A, we can use table-lookup to compute |N|A. Suppose that the
number N can be divided into m chunks of size b bits. Then, we need a table of size 2bm,
having 2b entries per chunk. For each chunk, we consult the corresponding part of the
table, which stores the mod-A residues all possible b-bit chunks in that position, adding
all the entries thus read out, modulo A.
An inverse residue code uses the check part D = A – C instead of C = |N|A. This change is
motivated by the fact that certain unidirectional errors that are prevalent in VLSI circuits
tend to change the values of N and C in the same direction, raising the possibility that the
error will go undetected. For example if the least-significant bits of N and C both change
from 0 to 1, the value of both N and C will be increased by 1, potentially causing a match
between the residue of the erroneous N value and the erroneous C value. Inverse residue
encoding eliminates this possibility, given that unidirectional errors will affect the N and
D parts in opposite directions.
The theory of error-detecting codes is quite extensive. One can devise virtually an infinite
number of different error-detecting codes and entire textbooks have been devoted to the
study of such codes. Our study in the previous sections of this chapter was necessarily
limited to codes that have been found most useful in the design of dependable computer
systems. This section is devoted to the introduction and very brief discussion of a number
of other error-detecting codes, in an attempt to fill in the gaps.
Erasure errors lead to some symbols to become unreadable, effectively reducing the
length of a codeword from n to m, when there are n – m erasures. The code ensures that
the original k data symbols are recoverable from the m available symbols. When m = k,
the erasure code is optimal, that is, any k bits can be used to reconstruct the n-bit
codeword and thus the k-bit data word. Near-optimal erasure codes require (1 + )k
symbols to recover the original data, where > 0. Examples of near-optimal erasure
codes include Tornado codes and low-density parity-check (LDPC) codes.
Given that 8-bit bytes are important units of data representation, storage, and
transmission in modern digital systems, codes that use bytes as their symbols are quite
useful for computing applications. As an example, a single-byte-error-correcting, double-
byte-error-detecting code [Kane82] may be contemplated.
Most codes are designed to deal with random errors, with particular distributions across a
codeword (say, uniform distribution). In certain situations, such as when bits are
transmitted serially or when a surface scratch on a disk affects a small disk area, the
possibility that multiple adjacent bits are adversely affected by an undesirable event
exists. Such errors, referred to “burst errors,” are characterized by their extent or length.
For example, a single-bit-error-correcting, 6-bit-burst-error-detecting code might be of
interest in such contexts. Such a code would correct a single random error, while
providing safety against a modest-length burst error.
In this chapter, we have seen error-detecting codes applied at the bit-string or word level.
It is also possible to apply coding at higher levels of abstraction. Error-detecting codes
that are applicable at the data structure level (robust data structures) or at the application
level (algorithm-based error tolerance) will be discussed in Chapter 20.
Problems
In the old, 10-digit ISBN code, the 9-digit book identifier x1x2x3x4x5x6x7x8x9 was augmented with a 10th
check digit x10, derived as (11 – W) mod 11, where W is the modulo-11 weighted sum 1i9 (11 – i)xi.
Because the value of x10 is in the range [0, 10], the check digit is written as X when the residue is 10.
a. Provide an algorithm to check the validity of the 10-digit ISBN code x1x2x3x4x5x6x7x8x9x10.
b. Show that the code detects all single-digit substitution errors.
c. Show that the code detects all single transposition errors.
d. Since a purely numerical code would be more convenient, it is appealing to replace the digit value
X, when it occurs, with 0. How does this change affect the code’s capability in detecting
substitution errors?
e. Repeat part d for transposition errors.
14 Error Correction
“Error is to truth as sleep is to waking. As though refreshed, one
returns from erring to the path of truth.”
Johann Wolfgang von Goethe, Wisdom and
Experience
“Don’t argue for other people’s weaknesses. Don’t argue for your
own. When you make a mistake, admit it, correct it, and learn
from it / immediately.”
Stephen R. Covey
Instead of detecting errors and performing some sort of recovery action such as
retransmission or recomputation, one may aim for providing sufficient redundancy in the
code so as to correct the most common errors quickly. In contrast to the backward
recovery methods associated with error detection followed by additional actions, error-
correcting codes are said to allow forward recovery. In practice, we may use an error
correction scheme for simple, common errors in conjunction with error detection for rarer
or more extensive error patterns. A Hamming single-error-correcting/double-error-
detecting (SEC/DED) code provides a good example.
Error-correcting codes were also developed for communication over noisy channels and
were later adopted for use in computer systems. Notationally, we proceed as in the case
of error-detecting codes, discussed in Chapter 13. In other words, we assume that k-bit
information words are encoded as n-bit codewords, n > k. Changing some of the bits in a
valid codeword produces an invalid codeword, thus leading to detection, and with
appropriate provisions, to correction. The ratio r/k, where r = n – k is the number of
redundant bits, indicates the extent of redundancy or the coding cost. Hardware
complexity of error correction is another measure of cost. Time complexity of the error
correction algorithm is a third measure of cost, which we often ignore, because we
expect correction events to be very rare.
Figure 14.1a depicts the data space of size 2k, the code space of size 2n, and the set of 2n –
2k invalid codewords, which, when encountered, indicate the presence of errors.
Conceptually, the simplest redundant represention for error correction consists of
replication of data. For example, triplicating a bit b to get the corresponding codeword
bbb allows us to correct an error in 1 of the 3 bits. Now, if the valid codeword 000
changes to 010 due to a single-bit error, we can correct the error, given that the erroneous
value is closer to 000 than to 111. We saw in Chapter 13 that the same triplication
scheme can be used for the detection of single-bit and double-bit errors in lieu of
correction of a single-bit error. Of course, triplication does not have to be applied at the
bit leval. A data bit-vector x of length k can be triplicated to become the 3k-bit codeword
xxx. Referring to Fig. 14.1b, we note that if the voter is triplicated to produce three copies
of the result y, the modified circuit would supply the coded yyy version of y, which can
then be used as input to other circuits.
Fig. 14.1 Data and code spaces for error-correction coding in general
(sizes 2k and 2n) and for triplication (sizes 2k and 23k).
The high-redundancy triplication code corresponding to the voting scheme of Fig. 14.1b
is conceptually quite simple. However, we would like to have lower-redundancy codes
that offer similar protection against potential errors. To correct a single-bit error in an n-
bit (non)codeword with r = n – k bits of redundancy, we must have 2r > k + r, which
dictates slightly more than log2 k bits of redunadancy. One of our challenges in this
chapter is to determine whether we can approach this lower bound and come up with
highly efficient single-error-correcting codes and, if not, whether we can get close to the
bound. More generally, the challenge of designing efficient error-correcting codes with
different capabilities is what will drive us in the rest of this chapter. Let us begin with an
example that achieves single-error correction and double-error detection with a
redundancy of r = 2 𝑘 + 1 bits.
Solution: A single bit-flip will cause the parity check to be violated for exactly one row and one
column, thus pinpointing the location of the erroneous bit. A double-bit error is detectable because
it will lead to the violation of parity checks for 2 rows (when the errors are in the same column),
2 columns, or 2 rows and 2 columns. A pattern of 3 errors may be such that there are 2 errors in
row i and 2 errors in column j (one of them shared), leading to no parity check violation.
The criteria used to judge error-correcting codes are quite similar to those used for error-
detecting codes. These include redundancy (r redundant bits used for k information bits,
for an overhead of r/k), encoding circuit/time complexity, decoding circuit/time
complexity (nonexistent for separable codes), error correction capability (single, double,
b-bit burst, byte, unidirectional), and possible closure under operations of interest.
Greater correction capability generally entails more redundancy. To correct c errors, a
minimum code distance of 2c + 1 is necessary. Codes may also have a combination of
correction and detection capabilities. To correct c errors and additionally detect d errors,
where d > c, a miminum code distance of c + d + 1 is needed. For example, a SEC/DED
code cannot have a distance of less than 4.
c1 c2
e1 e3 e2
c3
For each of the three codewords, distance-1 and distance-2 words from it are highlighted
by drawing a dashed box through them. For each codeword, there are 8 distance-1 words
and 16 distance-2 words. We note that distance-1 words, colored yellow, are distinct for
each of the three codewords. Thus, if a single-bit error transforms c1 to e1, say, we know
that the correct word is c1, because e1 is closer to c1 than to the other two codewords.
Certain distance-2 noncodewords, such as e2, are similarly closest to a particular valid
codeword, which may allow their correction. However, as seen from the example of the
noncodeword e3, which is at distance 2 from all three valid codewords, some double-bit
errors may not be correctable in a distance-3 code.
Hamming codes take their name from Richard W. Hamming, an American scientist who
is rightly recognized as their inventor. However, while Hamming’s publication of his idea
was delayed by Bell Laboratory’s legal department as part of their patenting strategy,
Claude E. Shannon independently discovered and published the code in his seminal 1949
book, The Mathematical Theory of Communication [Nahi13].
We begin our discussion of Hamming codes with the simplest possible example: a (7, 4)
single-error-correcting (SEC) code, with n = 7 total bits, k = 4 data bits, and r = 3
redundant parity-check bits. As depicted in Fig. 14.H74-a, each parity bit is associated
with 3 data bits and its value is chosen to make the group of 4 bits have even parity.
Thus, the data word 1001 is encoded as the codeword 1001|101. The evenness of parity
for pi’s group is checked by computing the syndrome si and verifying that it is 0. When
all three syndromes are 0s, the word is error-free and no correction is needed. When the
syndrome vector s2s1s0 contains one or more 1s, the combination of values point to a
unique bit that is in error and must be flipped to correct the 7-bit word. Figure 14.H74-b
shows the correspondence between the syndrome vector and the bit that is in error.
Encoding and decoding circuitry for the Hamming (7. 4) SEC code of Fig. 14.H74 are
shown in Fig. 14. Hed.
The redundancy of 3 bits for 4 data bits (3/7 43%) in the (7, 4) Hamming code of Fig.
14.H74 is unimpressive, but the situation gets better with more data bits. We can
construct (15, 11), (31, 26), (63, 57), (127, 120), (255, 247), (511, 502), and (1023, 1013)
SEC Hamming codes, with the last one on the list having a redundancy of only 10/1023,
which is less than 1%. The general pattern is having 2 r – 1 total bits with r check bits. In
this way, the r syndrome bits can assume 2r possible values, one of which corresponds to
the no-error case and the remaining 2r – 1 are in one-to-one correspondence with the
code’s 2r – 1 bits. It is easy to see that the redundancy is in general r/(2r – 1), which
approaches 0 for very wide codes.
Fig. 14.H74 A Hamming [7, 4] SEC code, with n = 7 total bits, k = 4 data
bits, and r = 3 redundant parity-check bits.
We next discuss the structure of the parity check matrix H for an (n, k) Hamming code.
As seen in Fig. 14.pcm, the columns of H hold all the possible bit patterns of length n – k,
except the all-0s pattern. Hence, n = 2n–k – 1 is satisfied for any Hamming SEC code. The
last n – k columns of the parity check matrix form the identity matrix, given that (by
definition) each parity bit is included in only one parity set. The error syndrome s of
length r is derived by multiplying the r n parity check matrix H by the n-bit received
word, where matrix-vector multiplication is done by using the AND operation instead of
multiplication and the XOR operation instead of addition.
By rearranging the columns of H so that they appear in ascending order of the binary
numbers they represent (and, of course, making the corresponding change in the
codeword), we can has the syndrome vector correspond to the column number directly
(Fig. 14.rpcm-a), allowing us to use the simple error corrector depicted in Fig. 14.rpcm-b.
The parity check matrix and the error corrector for a general Hamming code are depicted
in Fig. 14.gpcm.
Associated with each Hamming code is an n d generator matrix G such that the product
of G by the d-element data vector is the n-element code vector. For example:
1 0 0 0 𝑑
⎡0 ⎡ ⎤
1 0 0⎤ 𝑑 ⎢
𝑑
⎥
⎢0 0 1 0⎥
⎢ ⎥ 𝑑 ⎢𝑑 ⎥
⎢0 0 0 1⎥ × = ⎢𝑑 ⎥ (14.2.gen)
𝑑
⎢1 1 1 0⎥ ⎢𝑝 ⎥
⎢1 0 1 1⎥ 𝑑
⎢𝑝 ⎥
⎣0 1 1 1⎦ ⎣𝑝 ⎦
To convert a Hamming SEC code into a SEC-DED code, we add a row of all 1s and a
column corresponding to the extra parity bit pr to the parity check matrix, as shown in
Fig. 14.secded-a. [Elaborate further on the check matrix, the effects of a double-bit error,
and the error corrector in Fig. 14.secded-b.]
Hamming codes are examples of linear codes, but linear codes can be defined in other
ways too. A code is linear iff given any two codewords u and v, the bit-vector w = u v
is also a codeword. Throughout the discussion that follows, data and code vectors are
considered to be column vectors. A linear code can be characterized by its generator
matrix or parity check matrix. The n k (n rows, k columns) generator matrix G, when
multiplied by a k-vector d, representing the data, produces the n-bit coded version u of
the data
u=Gd (14.3.enc)
s=Hv (14.3.chk)
In this section, we introduce two widely used classes of codes that allow flexible design
of a variety of codes with desired error correction capabilities and simple decoding, with
one class being a subclass of the other. Alexis Hocquenghem in 1959 [Hocq59], and
later, apparently independently, Raj Bose and D. K. Ray-Chaudhuri in 1960 [Bose60],
invented a class of cyclic error-correcting codes that are named BCH codes in their
honor. Irving S. Reed and Gustave Solomon [Reed60] are credited with developing a
special class of BCH codes that has come to bear their names.
k 2s – 1 – 2t (14.4.RS)
Inequality 14.4.RS suggests that the symbol bit-width s grows with the data length k, and
this may be viewed as a disadvantage of RS codes. One important property of RS codes
is that they guarantee optimal mimimum code distance, given the code parameters.
Example 14.RS The Reed-Solomon code RS(7, 3) Consider an RS(7, 3) code, with 3-bit
symbols, defined by the generator polynomial of the form g(x) = (1 + x)( + x)(2 + x)(3 + x) =
6 + 5x + 5x2 + 2x3 + x4, where satisfies 1 + + 3 = 0. This code can correct any double-
symbol error. What types of errors are correctable by this code at the bit level.
Solution: Given that each of the 8 symbols can be mapped to 3 bits, the RS(7, 3) code defined
here is a (21, 9) code, if we count bit positions, and has a code distance of 5. This means that any
random double-bit error will be correctable. Additionally, because the RS code can correct any
error that is confined to no more than 2 symbols, any burst error of length 4 is also correctable.
Note that a burst error of length 5 can potentially span 3 adjacent 3-bit symbols.
–––––––––––––––––––––––––––––––
Power Polynomial Vector
–––––––––––––––––––––––––––––––
-- 0 000
1 1 001
010
2 2 100
3 +1 011
4 2 + 110
5
++1
2
111
6 2 + 1 101
–––––––––––––––––––––––––––––––
BCH codes Have the advantage of being binary codes, thus avoiding the additional
burden of converting nonbinary symbols into binary via encoding. A BCH(n, k) code can
be characterized by its n (n – k) parity check matrix P which allows the computation of
the error syndrome via the vector-matrix multiplication W P, where W is the received
word. For example, Fig. 14.BCH shows the how the syndrome for BCH(15, 7) code is
computed. This example code shown is also characterized by it generator polynomial:
g(x) = 1 + x4 + x6 + x7 + x8 (14.4.BCH)
Practical applications of BCH codes include the use of BCH(511, 493) as a double-error-
correcting code in a video coding standard for videophones and the use of BCH(40, 32)
as SEC/DED code in ATM communication.
Fig. 14.BCH The parity check matrix and syndrome generation via
vector-matrix multiplication for BCH(15, 7) code.
Consider the (7, 15) biresidue code, which uses mod-7 and mod-15 residues of a data
words as its check parts. The data word can be up to 12 bits wide and the attached check
parts are 3 and 4 bits, respectively, for a total code width of 19 bits. Figure 14.bires
shows the syndromes generated when data is corrupted by a weight-1 arithmetic error,
corresponding to the addition or subtraction of a power of 2 to/from the data. Because all
the syndromes in this table up to the error magnitude 211 are distinct, such weight-1
arithmetic errors are correctable by this code.
In general, a biresidue code with relativel prime low-cost check moduli A = 2a – 1 and B
= 2b – 1 supports ab bits of data for weight-1 error correction. The representational
redundancy of the code is:
Thus, by increasing the values of a and b, we can reduce the amount of redundancy.
Figure 14.brH compares such biresidue codes with the Hamming SEC code in terms of
code and data widths. Figure 14.brarith shows a general scheme for performing
arithmetic operations and checking of results with biresidue-coded operands. The scheme
is very similar to that of residue codes for addition, subtraction, and multiplication,
except that two residue checks are required. Division and square-rooting remain difficult.
As in the case of error-detecting codes, we have only scratched the surface of the vast
theory of error-correcting codes in the preceding sections. So as to present a more
complete picture of the field, which is reflected in many excellent textbooks some of
which are cited in the references section, we touch upon a number of other error-
correcting codes in this section. The codes include:
Reed-Muller codes: RM codes have a recursive contruction, with smaller codes used to
build larger ones.
Turbo codes: Turbo codes are highly efficient separable codes with iterative (soft)
decoding. A data word is augmented with two check words, one obtained directly from
an encoder and a second one formed based on an interleaved version of the data. The two
encoders for Turbo codes are generally identical. Soft decoding means that each of two
decoders provides an assessment of the probability of a bit being 1. The two decoders
then exchange information and refine their estimates iteratively. Turbo codes are
extensively used in cell phones and other communication applications.
Information dispersal: In this scheme, data is encoded in n pieces, such that any k of the
pieces would be adequate for reconstruction. Such codes are useful for protecting privacy
as well as data integrity.
In this chapter, we have seen error-correcting codes applied at the bit-string or word
level. It is also possible to apply coding at higher levels of abstraction. Error-correcting
codes that are applicable at the data structure level (robust data structures) or at the
application level (algorithm-based error tolerance) will be discussed in Chapter 20.
Problems
b. Show that the burst-error-correcting capability of the code of part b is greater than its random-
error-correcting capability.
c. What is the distance of the product of two codes of distances d1 and d2? Hint: Assume that each of
the two codes contains the all-0s codeword and a codeword of weight equal to the code distance.
d. What can we say about the burst-error-correction capability of the product of two codes in
general?
14.x Title
Intro
a. xxx
b. xxx
15 Self-Checking Modules
“I have not failed. I’ve just found 10,000 ways that won’t work.”
Thomas Edison
It is possible to check the operation of a function unit without the high circuit/area and
power overheads of replication, which entails a redundancy of at least 100%. A number
of methods are available to us for this purpose, which we shall discuss in this chapter.
Among these methods, those based on coding of the function unit’s input and output are
the most rigorous and readily verifiable.
Consider the input and output data spaces in Fig. 15.1a. The encoded input space is
divided into code space and error space; ditto for the encoded output space. The function
f to be realized maps points from the input code space into the output code space (the
solid arrow in Fig. 15.1a). This represents the normal functioning of the self-checking
circuit, during which the validity of the output is verified by the self-checking code
checker in Fig. 15.1b. When a particular fault occurs in the function unit, the desiger
should ensure that either an error-space output is producted by the circuit or else the fault
is masked by producing the correct output (the dashed arrows in Fig. 15.1a). Thus, the
challenge of self-checking circuit design is to come of for strategies to ensure that any
fault from a designated fault-class of interest is either detected by the output it produces
or produces the correct output. This chapter is devoted to a review of such design
strategies and ways of assessing their effectiveness.
Fig. 15.1 Data and code spaces in general (sizes 2k and 2n) and for
bit-level triplication (sizes 2 and 8).
Figure 15.1b contains two blocks, each of which can be faulty or fault-free. Thus, we
need to consider four cases in deriving a suitable design process.
Both the function unit and the checker are okay: This is the expected or normal
mode of operation during which correct results are obtained and certified.
Only the function unit is okay: False alarm may be raised by the faulty checker,
but this situation is safe.
Only the checker is okay: We have either no output error or a detectable error.
Neither the function unit nor the checker is okay: The faulty checker may miss
detecting a fault-induced error at the function unit output. This problem leads us
to the design of checkers with at least two output lines; a single check signal, if
stuck-at-okay, will go undetected, thus raising the possibility that a double fault of
this kind is eventually encountered. We say that undetected faults increase the
probability of fault accumulation.
Self-monitoring circuits
Problems
“If two men on the same job agree all the time, then one is
useless. If they disagree all the time, then both are useless.”
Darryl F. Zanuck
Magnetic disk drives form the main vehicles for supplying stable archival storage in
many applications. Since the inception of hard disk drive in 1956, the recording density
of these storage devices has grown exponentially, much like the exponential growth in
the density of integrated circuits. Currently, tens of gigabytes of information can be
stored on each cm2 of disk surface, making it feasible to provide terabyte-class disk
drives for personal use and petabyte-class archival storage for enterprise systems.
Amdahl’s rules of thumb for system balance dictate that for each GIPS of performance
one needs 1 GB of main memory and 100 GB of disk storage. Thus the trend toward ever
greater performance brings with it the need for ever larger secondary storage.
Figure 16.1 shows a typical disk memory configuration and the terminology associated
with its design and use. There are 1-12 platters mounted on a spindle that rotates at
speeds of 3600 to well over 10,000 revolutions per minute. Data is recorded on both
surfaces of each platter along circular tracks. Each track is divided into several sectors,
with a sector being the unit of data transfer into and out of the disk. The recording density
is a function of the track density (tracks per centimeter or inch) and the linear bit density
along the track (bits per centimeter or inch). In the year 2010, the areal recording density
of inexpensive commercial disks was in the vicinity of 100 Gb/cm 2. Early computer disks
had diameters of up to 50 cm, but modern disks are seldom outside the range of 1.0-3.5
inches (2.5-9.0 cm) in diameter.
The recording area on each surface does not extend all the way to the center of the
platter because the very short tracks near the middle cannot be efficiently utilized. Even
so, the inner tracks are a factor of 2 or more shorter than the outer tracks. Having the
same number of sectors on all tracks would limit the track (and, hence, disk capacity) by
what it is possible to record on the short inner tracks. For this reason, modern disks put
more sectors on the outer tracks. Bits recorded in each sector include a sector number at
the beginning, followed by a gap to allow the sector number to be processed and noted by
the read/write head logic, the sector data, and error detection/correction information.
There is also a gap between adjacent sectors. It is because of these gaps, the sector
number, and error-coding overhead, plus spare tracks that are often used to allow for
“repairing” bad tracks discovered at the time of manufacturing testing and in the course
of disk memory operation (see Section 6.2), that a disk’s formatted capacity is much
lower than its raw capacity based on data recording density.
Actuator
Recording area
Track c – 1
Track 2
Track 1
Track 0
Arm
Direction of Platter
rotation
Spindle
An actuator can move the arms holding the read/write heads, of which we have as many
as there are recording surfaces, to align them with a desired cylinder consisting of tracks
with same diameter on different recording surfaces. Reading of very closely spaced data
on the disk necessitates that the head travel very close to the disk surface (within a
fraction of a micrometer). The heads are prevented from crashing onto the surface by a
thin cushion of air. Note that even the tiniest dust particle is so large in comparison with
the head separation from the surface that it will cause the head to crash onto the surface.
Such head crashes damage the mechanical parts and destroy a great deal of data on the
disk. To prevent these highly undesirable events, hard disks are typically sealed in
airtight packaging.
Disk performance is related to access latency and data transfer rate. Access latency is the
sum of cylinder seek time (or simply seek time) and rotational latency, the time needed
for the sector of interest to arrive under the read/write head. Thus:
A third term, the data transfer time, is often negligible compared with the other two and
can be ignored.
Seek time depends on how far the head has to travel from its current position to the target
cylinder. Because this involves a mechanical motion, consisting of an acceleration phase,
a uniform motion, and a deceleration or braking phase, one can model the seek time for
moving by c cylinders as follows, where , , and are constants:
The linear term (c – 1), corresponding to the uniform motion phase, is a rather recent
addition to the seek-time equation; older disks simply did not have enough tracks, and/or
a high enough acceleration, for uniform motion to kick in.
Rotational latency is a function of where the desired sector is located on the track. In the
best case, the head is aligned with the track just as the desired sector is arriving. In the
worst case, the head just misses the sector and must wait for nearly one full revolution.
So, on average, the rotational latency is equal to the time for half a revolution:
30 30 000
Average rotational latency = s = ms (16.1.avgrl)
rpm rpm
Hence, for a rotation speed of 10 000 rpm, the average rotational latency is 3 ms and its
range is 0-6 ms.
The data transfer rate is related to the rotational speed of the disk and the linear bit
density along the track. For example, suppose that a track holds roughly 2 Mb of data and
the disk rotates at 10,000 rpm. Then, every minute, 2 1010 bits pass under the head.
Because bits are read on the fly as they pass under the head, this translates to an average
data transfer rate of about 333 Mb/s = 42 MB/s. The overhead induced by gaps, sector
numbers, and CRC encoding causes the peak transfer rate to be somewhat higher than the
average transfer rate thus computed.
While hard disk drives continue to grow in capacity, there are applications for which no
single disk can satisfy the storage needs. There are other applications that need high data
rates, in excess of what can be provided by a single disk, so as to keep a large number of
computational units usefully busy. In such cases, arrays of disks, sometimes numbering
in hundreds or thousands, are used.
Modern disk drives are highly reliable, boasting MTTFs that are measured in decades.
With hundreds or thousands of drives in the same disk-array system, however, a few
failures per year or even per month are to be expected. Thus, to avoid data loss which is
critically important for systems dealing with large data bases, reliability improvement
measures are required.
The intersection of the two requirements just discussed (improved capacity and
throughput on one side and higher reliability on the other), brought about the idea of
using a redundant array of independent disks (RAID) in lieu of a single disk unit. Another
motivating factor for RAID in the mid 1980s was that it no longer made economic sense
to design and manufacture special high-capacity, high-reliability disk units for mainframe
and supercomputer applications, given that the same capacity and throughput could be
provided by combining low-cost disks, mass-produced for the personal computer market.
In fact, the letter “I” in RAID originally stood for “Inexpensive.”
The steep downward trend in the per-gigabyte price of disk memories, and thus of RAID
systems, has made such storage systems affordable even for personal applications, such
as home storage servers.
Much of our discussion in this chapter is in terms of magnetic hard-disk drives, the
common components in RAID systems. Even though SSD storage has different failure
mechanisms (they contain no moving parts to fail, but suffer from higher error rates and
limited erase counts), applicable high-level concepts are pretty much the same. Please
refer to [Jere11] for issues involved in designing SSD RAIDS. A good overview of SSD
RAID challenges and of products available on the market is provided in [CW13].
The two ideas of disk mirroring and striping are foundational in the composition of RAID
systems, so we discuss them here in detail before reviewing the structure of modern
redundant disk arrays.
Mirroring refers to duplicating each data file on a separate disk, so that it remains
available in the event of the original disk failing. The original file copy and the mirror
copy are updated together and are thus fully synchronized. Read operations, however,
take place on the original copy. So, even though a mirrored disk system provides no
improvement in access speed performance or data transfer bandwidth, it offers high
reliability due to acting as a two-way parallel system. Only if both disks containing the
original file and its mirror copy fail will we have data loss. The drawback of 100%
redundancy in storage space is what motivated the development of subsequent RAID
schemes based on various forms of low-redundancy data encoding. Mirroring is
sometimes referred to as Raid level 1 (RAID1, for short), because it was the first form of
redundancy applied to disk arrays.
In disk striping, we divide our data into smaller pieces, perhaps all the way down to the
byte or bit level, and store the pieces on different disks, so that all those pieces can be
read out or written into concurrently. This increases the disk system’s read and write
bandwidth for large files, but has the drawback that all the said disks must be functional
for us to be able to recover or manipulate the file. The disk system in effect behaves as a
series system and will thus have a lower reliability than a single disk. Striping is
sometimes referred to as RAID level 0 (RAID0, for short), even though no data
redundancy is involved in it.
The key idea for making disk arrays reliable is to spread each data block across multiple
disks in encoded form, so that the failure of one (or a small number) of the disk drives
does not lead to data loss. Many different encoding schemes have been tried or proposed
for this purpose. One feature that makes the encoding simpler compared with arbitrary
error-correcting codes is the fact that standard disk drives already come with strong error-
detecting and error-correcting codes built in. It is extremely unlikely, though not totally
impossible, for a data error on a specific disk drive to go undetected. This possibility is
characterized by the disk’s bit error rate (BER), which for modern disk drives is on the
order of 10–15. This vanishingly small probability of reading corrupt data off a disk
without detecting the error can be ignored for most practical applications, leading to the
assumption that disk data errors can be detected with perfect accuracy.
So, for all practical purposes, disk errors at the level of data access units (say sectors) are
of the erasure kind, rendering codes capable of dealing with erasure errors adequate for
encoding of data blocks across multiple disks. The simplest such code is duplication.
Note that duplication is inadequate for error correction with arbitrary errors. In the latter
case, when the two data copies disagree, there is no way of finding out which copy is in
error. However, if one of the two copies is accidentally lost or erased, then the other copy
supplies the correct data. Simple parity or checksum schemes can also be used for
erasure-error correction. Again, with arbitrary (inversion) errors, a parity bit or checksum
can only detect errors. The same parity bit or checksum, however, can be used to
reconstruct a lost or erased bit/byte/word/block.
Fig. 16.parity Parity-coded bits and blocks of data across multiple disks.
There are 4 data bits/bytes/words/blocks, followed by a
parity bit/byte/word/block.
Whether parity encoding is done with bits or blocks of data, the concepts are the same, so
we proceed with the more practical block-level parity encoding. The parity block P for
the four blocks A, B, C, and D of data depicted in Fig. 16.parity is obtained as:
P=ABCD (16.3.prty)
Then, if one of the blocks, say B, is lost or erased, it can be rebuilt thus:
B=ACDP (16.3.rbld)
Note that if the disk drive holding the block B becomes inaccessible, one needs to read
the entire contents of the other four disks in order to reconstruct all the lost blocks. This is
a time-consuming process, raising the possibility that a second disk fails before the
reconstruction is complete. This is why it may be desirable to include a second coding
scheme to allow the possibility of two disk drive failures. For example, we may use a
second coding scheme in which the check block Q is derived from the data blocks A, B,
C, and D in a way that is different from P:
Q = g(A, B, C, D) (16.3.chk2)
Then, the data will continue to be protected even during the rebuilding process after the
first disk failure.
We have already alluded to RAID0 (provides high performance, but no error tolerance),
RAID1 (provides high data integrity, but is an overkill in terms of redundancy), and
RAID10 (more accurately, RAID1/0) as precursors of the more elaborate RAID2-RAID6
schemes to be discussed in this section.
[The following two links no longer work. I am looking for equivalent replacements]
http://storageadvisors.adaptec.com/2005/11/01/raid-reliability-calculations/
http://storageadvisors.adaptec.com/2005/11/01/raid-reliability-calculations/
From the website pointed to by the latter link, I find the following for Seagate disks
MTTF = 1.0-1.5 M hr
Using the data above, the poster finds mean time to data loss as follows:
Problems
a. Assuming that the situation for disk memories is similar, why does this data refute the assumption
that disk lifetimes are exponentially distributed?
b. Which distribution is called for in lieu of exponential distribution?
d. What is the reliability of the disk array of part c over a 1-year period?
e. For p = 0.1, which is more reliable: The disk array of part a or that of part c?
f. For p = 0.6, which is more reliable: The disk array of part a or that of part c?
Failed
Ernest K. Gann, The Black Watch
A system moves from erroneous to malfunctioning state when an error affects the
functional behavior of some constituent subsystem. Design or implementation
flaws can lead to malfunctions directly, even in the absence of errors. At the
architectural level, malfunctions have manifestations similar to those of faults
occurring at the logic level. One difference is that instead of pass/not-pass testing
to detect the presence of malfunctions, we tend to use diagnostic testing to detect
and locate the offending modules. We then concern ourselves with methods to
tolerate such malfunctions via redundancy and reconfiguration. Because modules
interacting at the architectural level tend to be rather complex, standby or
dynamic redundancy is often preferred to massive or static redundancy. Making
either type of redundancy work, however, is nontrivial in view of difficulties in
synchronization and maintenance of data integrity during and after switchovers.
We conclude this part by a discussion of malfunction tolerance by means of
robustness features in parallel processing systems.
17 Malfunction Diagnosis
“If information systems fail or seriously malfunction, societal
activities lose support, and this may sometimes result in
uncontrollable chaos in society as a whole.”
H. Inose and J. R. Pierce, Information
Technology and Civilization
Modern computer systems have processing capabilities that are distributed among
multiple modules, even when there is only one “processor” in the system. Examples of
units with processing capabilities include graphics cards, network interfaces, input/output
channels, and device controllers. Furthermore, multiple CPUs are being employed to
provide the required computational power in a wide spectrum of systems, given the
marked slowdown in clock frequency improvements and the greater energy efficiency of
slower processors. Thus, it makes sense to try to use these capabilities in performing
cross-diagnostic checks among such modules.
Self-diagnostic checks are quite common. When you turn on your desktop or laptop
computer, a diagnostic check is run to verify the correct functioning of major subsystems,
including the CPU, memory, disk drive, and various interfaces. The check is not
exhaustive but is intended to catch most common problems. In the context of dependable
computing, we need a bit more coverage than such quick sanity checks.
In our discussion of fault testing in Chapter 9, we assumed that a special tester unit
applied test patterns to the circuit under test and used the circuit’s outputs to render a
judgment about it health. In the case of intermodule diagnostic testing, this approach may
prove impractical, given that the complexity of the modules involved would generate an
extremely heavy volume of data being passed between them. One way around this
difficulty is for the tester to initiate a self-diagnostic process in another module to
determine whether it is working properly. This self-diagnostic should not have a yes/no
answer, because such a binary outcome would increase the chances of a malfunctioning
module generating a “yes” answer.
Here is a workable strategy. The initiator supplies the module under test with a “seed
value” to be used in the diagnostic process. The self-diagnosing unit uses the supplied
seed as an argument of an extensive computation that exercises nearly all of its
components, including memory resources, ending up with an “answer” that is a known
function of the seed value. It is this answer that is returned to the initiator as the
diagnostic outcome. Given this outcome, it is then easy for the test initiator to compare
the returned value with the expected result and to deduce whether the reporting unit is
healthy. With a 64-bit diagnostic outcome, say, it is less likely for a malfunctioning unit
to accidentally produce the correct result.
Note that when the health of a unit is suspect, there is no reason to trust its ability to
execute the very instructions that constitute the self-diagnostic routine. For this reason,
we often use a layered approach to self-diagnosis. At the beginning of the process, a
small core of the module is tested. This core may be in charge of executing some very
simple instructions and may have very limited memory and other resources. Once it has
been established that the core can be trusted with regard to its health, the circle of trust is
gradually extended to other parts of the module, in each phase using trusted parts whose
health has been previously established to test new parts.
The PMC system-level diagnosis model used in this book is due to Preparata, Metze, and
Chien [Prep67]. The system under diagnosis consists of a set of modules for which we
have defined a testing graph (Fig. 17.2a). A directed edge from M i to Mj in the testing
graph indicates that module Mi can test module Mj. Note that the testing graph may or
may not correspond to the actual physical connectivity among the modules. For example,
the four modules of Fig. 17.2a may be interconnected by a bus, thus making it possible
for each of them to test any other module. In this case, the testing graph is a proper
subgraph of the complete graph characterizing the physical connectivity among the
modules. Among reasons for selecting a subset of the available physical links to form a
testing graph is the desire to reduce the communications overhead and to limit the module
workloads that result from administering self-diagnostic tests and interpreting the results.
The diagnosis verdicts Dij {0, 1}, with 0 meaning “pass” and 1 representing “fail,”
form an n n Boolean diagnosis matrix D, which is usually quite sparse. In particular,
the diagonal entries of D are never used (unit i does not judge itself). We assume that a
good unit always renders a correct judgement about other units, that is, tests have perfect
coverage and no false alarms, but that a verdict rendered by a malfunctioning unit is
arbitrary and cannot be trusted. Note that the PMC model which we use for malfunction
diagnosis is often referred to as a model for “system-level fault diagnosis.” Following our
terminology, a system-level fault is referred to as a malfunction.
Mi Mj Mi Mj
(a) Testing graph (b) Pass outcome of a test (c) Fail outcome of a test
Fig. 17.2 System-level testing graph and the two possible test
outcomes when module i tests module j.
Example 17.1: Interpreting test outcomes Consider the following diagnosis matrix D for the
4-module system of Fig. 17.2a, in which dashes denote lack of testing. If we know that no more
than one module can be malfunctioning, interpret the test outcomes in D.
− 0 − −
0 0
D= − −
1 − − −
1 − 0 −
The PMC model is but one of the models available for malfunction diagnosis, but it is
widely used and highly suitable for the points we want to make regarding key notions of
system-level diagnosis. Other malfunction diagnosis models include the comparison
model [Maen81], in which an observer assigns tasks to processors and draws conclusions
regarding their health via comparing the results they return.
To establish 1-diagnosability, we consider all possible single malfunctions and verify that
the resulting syndromes (sets of diagnosis outcomes) are distinct from each other and
from the no-malfunction case, regardless of how the malfunctioning unit behaves in its
assessments of other modules.
Example 17.synd: One-step 1-diagnosibility Find the syndromes for the system of Fig. 17.2a
and show that it is one-step 1-diagnosable but not 2-diagnosable.
Solution: Syndromes for the system of Fig. 17.2a are listed in Fig. 17.synd-a, with the results
confirming the system’s 1-diagnosability, given that the rows corresponding to different single
malfunctions differ with each other and with the malfunction-free case in at lease one position. On
the other hand, we see that the system of Fig. 17.2a is not 2-diagnosable, because the syndrome for
the double malfunction {M0, M1} may be indistinguishable from the single malfunction {M0} in
some cases. Given the restriction to at most one malfunctioning unit, the syndrome dictionary in
Fig. 17.synd-b allows us to translate the observed 6-bit syndrome {D01, D12, D13, D20, D30, D32} to
the identity of the malfunctioning unit, if any.
(a) Syndromes for single and a few double malfunctions (b) Syndrome dictionary
General results can be obtained for 1-step t-diagnosability that can be applied in some
cases in lieu of an exhaustive analysis of the kind done in Example 17.synd.
Based on Theorem 17.td-nec, the 4-module system of Fig. 17.2a can never become 2-
diagnosable, regardless of how many links we add to the testing graph. By Theorem
17.td-suff, the same system is 1-diagnosable.
The diagnosability problem has a lot in common with a collection of popular puzzles
about liars and truth-tellers. Consider the following setting. You visit an island whose
inhabitants are from two tribes: member of one tribe (“liars”) consistently lie; members
of the other tribe (“truth-tellers”) always tell the truth. Members of the two tribes are
indistinguishable to us, but they can recognize eath other. The puzzles then ask us various
questions about how to deduce the truth about various situations from the unreliable
responses we receive. In fact, malfunction diagnosis corresponds to an extended, more
challenging, version of these puzzles in which a third tribe (“randoms”) is introduced.
Healthy modules correspond to “truth-tellers,” because they correctly diagnose other
modules. Malfunctioning modules correspond to “randoms,” because their judgments are
unrelated to the truth. Interestingly, liars aren’t as hard to deal with as randoms, because
their consistency provides us with more information.
Problem 17.a1 (the extent of 1-step t-diagnosability): Given a directed graph defining
the test links, find the largest value of t for which the system is 1-step t-diagnosable.
The foregoing problem is easy if no two units test one another and fairly difficult if
mutual testing is allowed. There exists a vast amount of published research dealing with
Problem 17.a1.
Problem 17.a2 (1-step malfunction diagnosis): Given a directed graph defining the test
links and a set of test outcomes, identify all malfunctioning untis, assuming there are no
more than t such units.
Problem 17.a2, which arises when we want to repair or reconfigure a system using test
outcomes, is solved via table lookup or analytical methods.
Problem 17.s (connection assignment for 1-step t-diagnosability): Specify the test links
that would make an n-unit system 1-step t-diagnosable, using as few test links as
possible.
In Example 17.synd, we established that the system of Fig. 17.2a is not 2-diagnosable,
because the syndromes for the double malfunction {M 0, M1} is potentially indistinct from
that of the single malfunction {M0}. We may note that a common syndrome for the two
malfunction patterns just listed does provide some useful diagnostic information: that M 0
is definitely bad. So, we can potentially use this information to replace or repair M 0
before further testing to identify other malfunctioning modules. This observation leads us
to the notion of sequential diagnosability, which means that the test syndrome points
unambiguously to at least one malfunctioning unit. Assuming that we began with k
malfunctions, replacing or repairing one bad unit leaves us with no more than k – 1
malfunctions, thus reducing the diagnosis problem to a simpler one. Iterating in this
manner, allows us to identify all k malfunctioning units in k or fewer rounds.
Example 17.seqd1: Sequential-diagnosibility Show that the system of Fig. 17.2a isn’t
sequentially 2-diagnosable.
Solution: The desired result is readily established by noting that the syndromes for the
malfunction sets {M0, M2} and {M1, M3}, that is, (x 1 0 x 1 1) and (1 x x 0 x x), where x denotes
0/1 and test results are listed in the order shown in Fig. 17.synd-a, are potentially indistinct.
In fact, the result of Example 17.seqd could have been established based on the following
general theorem.
Solution: As an example, consider the 5-node directed ring of Fig. 17.dring-a. Possible test
outcomes are for single and some double malfunctions are depicted in Fig. 17.dring-b. We note
that even though some of the syndromes shown overlap, they all point to M0 being malfunctiong.
So, under sequential diagnosabiloity, there is no ambiguity and we can replace M0 before
proceeding. The set of all syndromes that point to M0 being malfunctioning is shown in Fig.
17.dring-c.
It is possible to prove that an n-node directed ring is sequentially t-diagnosable for any t
satisfying (t2 – 1)/4 + t + 2 n (see Problem 17.5), a result from which two parts of
Example 17.seqd2 follow.
4 1
3 2
Analysis and synthesis problems for sequential diagnosability parallel those of 1-step
diagnosability, and have likewise been extensively studied.
Problem 17.a2-s (sequential malfunction diagnosis): Given a directed graph defining the
test links and a set of test outcomes, identify at least one malfunctioning unti (preferably
more), assuming there are no more than t such units.
Problem 17.s-s (connection assignment for sequential diagnosability): Specify the test
links that would make an n-unit system sequentially t-diagnosable, using as few test links
as possible.
An n-node ring, with n >= 2t + 1, with added test links from 2t – 2 other nodes to node 0
(besides node n – 1, which already tests it) has the required property.
So far in our discussions of diagnosability, we have demanded full accuracy in the sense
of requiring that all malfunctioning modules be identified (1-step diagnosability) or that
some malfunctiong modules, and only malfunctioning modules, be deduced (sequential
diagnosability) from the test outcomes. By relaxing these requirements, we may be able
to successfully diagnose systems that would be undiagnosable with the former, more
strict definitions.
Given the values of t and s, the problem of deciding whether a system is t/s-diagnosable
is co-NP-complete. However, there exist efficient, polynomial-time, algorithms to find
the largest integer t such that the system is t/t- or t/(t + 1)-diagnosable.
Finally, we can integrate the notion of safety, where some malfunctions are promptly
detected but not necessarily diagnosed, into our definitions. Safe diagnosability implies
that up to t malfunctions are correctly diagnosed and up to u are detected, where u > t .
This kind of diagnosability/testability, which is reminiscent of combination error-
correcting/detecting codes, ensures that there is no danger of incorrect diagnosis for a
larger number of malfunctions (up to u), thus increasing system safety.
Diagnosability results have been published for a great variety of regular interconnection
networks, such as the three topologies shown in Fig. 17.topo. Topologies like these have
been used in the design of many general-purpose parallel computers and special-purpose
architectures for high-performance computing. The specific examples shown in Fig.
17.topo are composed of degree-4 nodes and thus can be 1-step 4-diagnosable at best.
Proving that they indeed possess this level of diagnosability or deriving their level of
sequential diagnosability are active research areas.
Problems
as intercluster links, whereas the links of G are intracluster links. Thus, node degree of Bsw(G) is one more
than that of G. The intercluster links define a bipartite subgraph; hence, the name “biswapped.” Study the
diagnosability of biswapped netwotks
18 Malfunction Tolerance
“If you improve or tinker with something long enough, eventually
it will break or malfunction.”
Arthur Bloch
“The test of courage comes when we are in the minority. The test
of tolerance comes when we are in the majority.”
Ralph W. Sockman
For example, Fig. 18.1a shows a 16-processor parallel computer whose nodes are
interconnected via a 2D torus topology. It is readily seen that each source node is
connected to each possible destination node via 4 parallel node-and-edge-disjoint paths.
Because of this property, any set of 3 malfunctioning resources (nodes and/or links) can
be tolerated without cutting off intermodule communication. In graph-throretic terms, we
say that the system in Fig. 18.1a is 4-connected. High connectivity is a desirable attribute
for malfunction tolerance.
As a second example, consider the 3-bus system of Fig. 18.1b, where each I/O port from
a 2-port module is connectable to one of two buses. The dashed boxes in the middle can
be viewed as programmable 2 2 switches that allow each module to communicate with
any other module via two different paths. Thus, besides tolerating malfunctioning
modules, we can route around a single malfunctioning bus as well.
We can abstract the following general strategy from the scheme of isolating a module
from a bus, as discussed in the preceding paragraph. When a critical shared resource is to
be accessed by a module, some form of external authorization is needed to ensure that a
run-away module does not cause the entire system to crash. The situation can be likened
to transactions at a bank. A customer can perform simple transactions at an ATM. If the
customer wants to withdraw a larger sum of money than the ATM’s transaction limit,
assistance from a teller must be sought (a form of external authorization). For very large
transactions, even the teller does not have the required authority and the branch manager
gets involved (the second external authorization).
Permission 1
Permission 2
(a) Isolating a module from a bus (b) Reading from and writing to a bus
Malfunction tolerance would be much easier if modules would simply stop functioning,
rather than engage in arbitrary behavior. Unpredicatable or arbitrary behviour on the part
of a malfunctioning element, sometimes referred to as a “Byzantine malfunction,” is
notoriously difficult to handle. One source of the difficulty is that the module’s arbitrary
behavior may make it seem different to multiple external observers, some judging it to be
healthy and others detecting the malfunction.
Methods are available to ensure that a malfunctioning module stops in an inert state,
where it can’t confuse the system’s healthy modules. Here is one way to accomplish this
goal. Suppose modules run (approximately) synchronized clocks and have access to
reliable stable storage, where critical data can be stored. A k-malfunction-stop module
can be implemented from k + 1 identical units of this kind, operating in parallel. The key
element for this realization is an s-process that decides when the redundant module has
stopped and sets a “stop” flag in stable storage to “true.”
Malfunction-stop modules
Logs are essential tools for system recovery via undo and redo operations. The undo and
redo operations are quite similar to their namesakes in word-processing and other
common applications. When a detected malfunction makes it impossible or inadvisable to
continue processing as usual, the partial effects of incomplete transactions must be
undone to maintain consistency in stable storage. Similarly, the partially complete
transaction must be redone when circumstance allow it.
Regular arrays of modules are used extensively in certain applications that need high-
throughput processing of massive amounts of data. Examples include communication
routing, scientific modeling, visual rendering, and certain kinds of simulation. Note that
regularity in our discussions here refers to the interconnection pattern, not physical layout
(the latter may be the case for on-chip systems). The focus of our discussion will be on
2D arrays, although some of the techniques can be extended to higher dimensions in a
straightforward manner.
Row/column bypassing is a widely used method for reconfiguring 2D arrays. Let us first
focus on row bypassing for modules that communicate in one direction: from top to
bottom. As seen in Fig. 18.pass-a, placing a multiplexer at the input to each module
allows us to bypass the previous row, taking the input from the row immediately above it.
Applying the same scheme to the other 3 inputs of a 4-port module leads to the building
block of Fig. 18.pass-b, which allows both row and column bypassing.
(a) Row bypassing in downward direction (b) Module with row/column bypass
For 2D meshes without wraparound links (i.e., not torus networks), an ingenious
reconfiguration scheme allows the use of a single spare module for replacing any
malfunctioning module located in any row/column. The scheme, depicted in Fig.
18.spare-b, takes advantage of unused ports at the edges of a mesh to provide additional
links that are not used under normal conditions. When a module malfunctions, the
remaining healthy modules are renumbered (assigned new row/column numbers) and
their ports relabeled to form a new working mesh. In the example of Fig. 18.spare-b, once
the malfunctioning module 5 has been isolated, the system is reconfigured as shown, so
that, for example, the new row 0 will consist of modules 6, 7, 8, and 9, which are
renumbered 0, 1, 2, and 3.
Task scheduling problems are hard even when resource requirements and availability are
both fixed and known a priori. These problems become significantly more difficult when
resource requirements fluctuate and/or resource availability changes dynamically due to
modules malfunctioning and are returned to service after repair.
Problems
18.1 xxx
Intro
a. xxx
b. xxx
c. xxx
18.2 xxx
Intro
a. xxx
b. xxx
c. xxx
d. xxx
19 Standby Redundancy
“A long life may not be good enough, but a good life is long
enough.”
Anonymous
No amount of spare resources is useful if the malfunctioning of the active module is not
promptly detected. Detection options include:
Activity monitor
Control-flow watchdog
Spare modules in standby redundancy are of two main kinds. A spare that is inactive,
perhaps even powered down to conserve energy and to reduce its exposure to wear and
tear and thus failure, is known as a cold spare. Conversely, an active spare that is fully
ready to take over the function of a malfunctioning active module at any time is known as
a hot spare. In a manner similar to the use of the term “firmware” as something that
bridges the gap between hardware and software, we designate a spare that falls between
the two extremes above, perhaps being powered up but not quite up to date with respect
to the state of the active module, as a warm spare.
Conditioning refers to preparing a spare module to take the place of an active module.
Switching mechanism for standby sparing have a lot in common with those used for
defect circumvention, particularly when spares are shared among multiple units.
Problems
19.x Title
Intro
a. xxx
b. xxx
c. xxx
d. xxx
Parallel processors are divided into two classes of global-memory and distributed-
memory systems. In global-memory multiprocessors, a number of processing nodes are
connected to a large collection of memory modules, or banks, via a processor-to-memory
interconnection network, often implemented as a multistage structure of switching
elements. Instead of linking processors to memory banks, such multistage networks can
also be used to interconnect processing nodes to each other. In the latter processor-to-
processor interconnection usage, such networks are also called indirect networks, because
the connections among processors are established indirectly through switches, rather than
directly via links that connect the processors’ communication ports.
We will deal with multistage (indirect) networks in Section 20.6. The rest of this chapter
is devoted to problems associated with direct interprocessor communication networks
exemplified by the 64-node (6-dimensional) hypercube, depicted in Fig. 20.nets-a. A
wide variety of different interconnection networks have been proposed over the years, so
much so that the multitude of options available is often referred to as “the sea of
interconnection networks’ (Fig. 20.nets-b). The proposed networks differ in topological,
performance, robustness, and realizability attributes.
We often assume that a parallel system is built from homogeneous processing nodes,
although interconnected heterogeneous nodes are sometimes considered. The internode
communication architecture is characterized by the type of routing scheme (packet
switching vs. wormhole or cut-through) and the routing protocols supported (e.g.,
whether nodes have buffer storage for messages that are passing through). Such details
don’t matter at the level of graph represemtation, which models only connectivity issues.
In robust parallel processing, we don’t make a distinction between ordinary resources and
spare resources. All resources are pooled and what would have been spare modules,
communication links, and the like are made available to boost performance in the absence
of malfunctions. The nominally extra processing and communication resources allow us
to overcome the effects of malfunctioning processors and transmission paths by simply
switching to alternates.
Two key notions in allowing the tolerance of malfunctioning nodes and links are those of
connectivity and parallel paths. Node-disjoint paths, which are useful for malfunction
tolerance, can also be used for improved performance via parallel transmission of pieces
of long messages.
Internode distances vary as a result of malfunctions. For example, if one link becomes
unavailable in a 2D mesh, the formerly distance-1 pair of nodes that it connected turn into
distance-3 nodes. Because a network of connectivity can become partitioned as a result
of or more malfunctions, it is common to analyze the behavior of direct networks in the
presence of worst-case patterns of – 1 malfunctions. For example, one may ask how the
diameter of a network, or its average internode distance, is affected in the presence of
such worst-case patterns of malfunctioning units.
One of the most challenging open problem of graph theory is synthesizing graphs that
have small diameters, while maintaining a desirably small node degree. More
specifically, given nodes of a given maximum degree, we seek to synthesize the largest
possible graphs with bounded diameter or, given a desired size, we wish to minimize the
resulting diameter [Chun87]. Of particular interest, in the context of dependable
computing, are graphs with small diameters that remain small after deleting a few nodes
or edges.
For very large, or for loosely-connected networks, it is more realistic to assume that each
node knows only about malfunctioning resources in its immediate neighborhood. Then,
path calculation must occur in a distributed manner. Such distributed routing decisions
may lead to:
Suboptimal paths: Messages may not travel via the shortest available paths
Deadlocks: Messages may interfere with, or circularly wait for, each other
Livelocks: Messages may wander around, never reaching their destinations
Embedding is a mapping of one network (the guest) onto another network (the host).
Emulation allows one network to behave as another. The two notions are related, in the
sense that a good embedding can be used to devise an efficient emulation. Both notions
are useful for malfunction tolerance.
Multistage networks use switched to interconnect nodes instead of providing direct links
between them.
Just as was the case for direct networks, the design space for indirect networks is quite
vast, leading to the term “sea of indirect interconnection networks.” The proposed
networks differe in topological, performance, robustness, and realizability attributes.
Problems
All rights reserved for the author. No part of this book may be reproduced,
stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical,
photocopying, microfilming, recording, or otherwise, without written permission. Contact the author at:
ECE Dept., Univ. of California, Santa Barbara, CA 93106-9560, USA ([email protected])
Dedication
To my academic mentors of long ago:
and
Structure at a Glance
The multilevel model on the right of the following table is shown to emphasize its
influence on the structure of this book; the model is explained in Chapter 1 (Section 1.4).
Defective
“Junk is the ultimate merchandise. The junk merchant does not
sell his product to the consumer, he sells the consumer to the
Faulty product. He does not improve and simplify his merchandise, he
degrades and simplifies the client.”
Erroneous
William S. Burroughs
Malfunctioning
Booker T. Washington
21 Degradation Allowance
“A hurtful act is the transference to others of the degradation
which we bear in ourselves.”
Simone Weil
The quotation “eighty percent of success is showing up,” from humorist Woody
Allen, can be rephrased for fail-soft systems as “eighty percent of not failing is
degradation allowance.” This is because malfunctions do not automatically lead to
degradation: they may engender an abrupt transition to failure. In other words,
providing mechanisms to allow operation in degraded mode is the primary
challenge in implementing fail-soft computer systems. For a malfunction to be
noncatastrophic, its identification must be quick and the module’s internal state
and associated data must be fully recoverable. Stable storage, checkpointing , and
rollback are some of the techniques at our disposal for this purpose.
Degradations occur in many different ways. A byte-sliced arithmetic/logic unit might lose
precision if a malfunctioning slice is bypassed through reconfiguration (inaccuracy). A
dual-processor real-time system with one malfunctioning unit might choose to ignore less
critical computations in order to keep up with the demands of time-critical events
(incompleteness). A malfunctioning disk drive within a transaction processing system can
effectively slow down the system’s response (tardiness). These are all instances of
degraded performance. In this broader sense, performance is quite difficult to define, but
Meyer [Meye80] does an admirable job:
Remember that in the quoted text above, faults/failures correspond to malfunctions in our
terminology. It was the concerns cited above that led to the definition of performability
(see Section 2.4) as a composite measure that encompasses performance and reliability
and that constitutes a proper generalization of both notions.
Graceful degradation isn’t a foregone conclusion when a system has resource redundancy
in the form of multiple processors, extra memory banks, parallel interconnecting buses,
and the like. Rather, active provisions are required to ensure that degradation, rather than
total interruption, of service will result upon module malfunctions. The title “Degradation
Allowance” for this chapter is intended to drive home the point that degradation must be
explicitly provided for in the design of a system.
Example 21.1: Degradation allowance is not automatic Describe a system that has more
resources of a particular kind than absolutely needed but that cannot gracefully degrade when even
one of those resources become unavailable.
Solution: Most automobiles have 4 wheels. In theory, a vehicle can operate with 3 wheels; in fact,
a variety of 3-wheeled autos exist. However, an ordinary 4-wheeled vehicle cannot operate if one
of the wheels becomes unavailable, because the design of 3-wheeled vehicles is quite different
from 4-wheeled ones.
Among the prerequisites for graceful degradation are quick diagnosis of isolated
malfunctions, effective removal and quarantine of malfunctioning elements, on-line
repair (preferably via hot-pluggable modules), and avoidance of catastrophic
malfunctions. The issues surrounding degradation management, that is, adaptation to
resource loss via task prioritization and load redistribution, monitoring of system
operation in degraded mode, returning the system to intact or less degraded state at the
earliest opportunity, and resuming normal operation when possible, will be discussed in
Chapter 22.
On-line and off-line repairs, and their impacts on system operation and performance are
depicted in Fig. 21-fsoft. On-line repair is accomplished via the removal/replacement of
affected modules in a way that does not disrupt the operation of the remaining system
parts. Off-line repair involves shutting down the entire system while affected modules are
removed and their replacements are plugged in. Note that with on-line repair, it may be
possible to avoid system shut-down altogether, thus improving both availability and
performability of the system.
Next, a working configuration must be created that includes only properly functioning
units. Such a working configuration would exclude processors, channels, controllers, and
I/O elements (such as sensors) that have been identified as malfunctioning. Other
examples of resources that might be excluded are bad tracks on disk, garbled files, and
noisy communication channels. Additionally, virtual address remapping can be used to
remove parts of memory from use. In the case of a malfunctioning cache memory, one
might bypass the cache altogether or use a more restricted mapping scheme that exposes
only the properly functioning part.
The final considerations before resuming disrupted processes include the recovery of
state information from removed units, if possible, initializing any new resource that has
been brought on-line, and reactivating processes via rollback or restart.
When, at some future time, the removed units are to be returned to service (say, after
completion of repair or upon verification that the malfunction resulted from transient
rather than permanent conditions), the steps outlined above may have to be repeated.
A storage device or system is stable if it can never lose its contents under normal
operating conditions. This kind of permanence is lacking in certain storage devices, such
as register files and SRAM/DRAM chips, unless the system is provided with battery
backup for a time duration long enough to save the contents of volatile memories, such as
a disk cache, on a more permanent medium. Until recently, use of disk memory was the
main way of realizing stable storage in computing devices, but now there are other
options such as flash memory and magnetoresistive RAM. Combined stability and
reliability can be provided for data via RAID-like methods.
Malfunction tolerance would become much easier if affected moduled simply stopped
functioning, rather than engage in arbitrary behavior that may be disruptive to the rest of
the system. Unpredicatable or Byzantine malfunctions are notoriously difficult to handle.
Thus, we are motivated to seek methods of designing modules that behave in a
malfunction-stop manner.
Given access to a reliable stable storage, along with its controlling s-process and
(approximately) synchronized clocks, we can implement a k-malfunction-stop module
from k + 1 units that can perform the same function. These units do not have to be
identical. Here is how the s-process decides whether the high-level modules has stopped:
The simplest recovery scheme is restart, in which case the partially completed actions
during the unsuccessful execution of a transaction must be undone. One way to achieve
this end is by using logs, which form the main subject of this secion. Another way is
through the use of a method known as shadow paging. Note, however, that recovery with
restart may be impossible in systems that operate in real time, performing actions that
cannot be undone. Examples abound in the process control domain and in space
applications. Such actions must be compensated for as part of the degradation
management strategy.
The use of recovery logs has been studied primarily in connection with database
management systems. Logs contain sufficient information to allow the restoration of our
system to a recent consistent state. They maintain information about the changes made to
the data by various transactions. A previously backed up copy of the data is typically
restored, followed by reapplying the operations of committed transactions, up to the time
of failure, found in the recovery log.
A common protocol for recovery logs is write-ahead logging (WAL). Log entries can be
of two kinds, undo-type entries and redo-type entries, with some logs containing both
kinds of entries. Undo-type log entries hold old data values, so that the values can be
restored if needed. Redo-type entries hold new data values to be used in case of operation
retry. The main idea of write-ahead logging is that no changes should be made before the
necessary log entries are created and saved. In this way, we are guaranteed to have the
proper record for all changes should recovery be required. [More elaboration needed.]
Logs are sequential append-only files. The relative inefficiency of a sequential structure
isn’t a major concern, given that logs are rarely used to effect recovery.
An efficient scheme for using recovery logs is via deferred updates. In this method, any
changes to the data are not written directly to stable storage. Rather, the data affected by
updates is cached in main memory; only after all changes associated with a transaction
have been applied will the data be written to stable storage (preceded, of course, by
saving the requisite log entries). In this way, access to the typically slow stable storage is
minimized and performance is improved.
Checkpoints are placed at convenient locations along the course of a computation, not
necessarily at equal intervals, but we often assume equal checkpointing intervals for the
sake of analytical tractability. Checkpointing entails some overhead consisting of the
program instructions needed to save the state and partial results and those needed to
recover from failure by reloading a previously computation state.
We see from Example 21.chkpt1 that not using checkpoints may lead to a small
probability of task completion within a specified time period. On the other hand, using
too many checkpoints may be counterproductive, given the associated overhead. Thus,
there may be an optimal configuration that leads to the best expected completion time.
We will discuss optimal checkpointing in Section 21.6.
Example 21.chkpt2: Data checkpointing Consider data objects stored on a primary site and k
backup sites. With appropriate design, such a scheme will be k-malfunction-tolerant. Each data
access request is sent to the primary site, where a read request is honored immediately. A write
request triggers a chain of events consisting of the primary site sending update transactions to the
backup sites and acknowledging the original write request only after acknowledgements have been
received from all backup sites. Argue that increasing the number k of backup sites beyond a
certain point may be counterproductive in that it may lead to lower data availability.
Solution: When the primary site is up and running, data is available. Data becomes unavailable in
three system state: (1) Recovery state, in which the primary site has malfunctioned and the system
is in the process of “electing” a new primary site from among the backup sites. (2) Checkpoint
state, in which the primary site is performing data backup. (3) Idle state, in which all sites, primary
and backup, are unavailable. As the number of backups increases, the system will spend more time
in states 1 and 2 and less time in state 3, so there may be an optimal choice for the number k of
backup sites. Analysis by Huang and Jalote [Huan89], with a number of specific assumptions,
indicates that data availability goes up from the nonredundant figure of 0.922 to 0.987 for k = 1,
0.996 for k = 2, 0.997 for k = 4, beyond which there is no improvement.
The checkpointing scheme depicted in Fig. chkpt1 is synchronous in the sense of all
running processes doing their checkpointing at the same time, perhaps dictated by a
central system authority. In large, or loosely coupled systems, it is more likely for
processes to schedule their checkpoints independently based on their own state of
computataion and when checkpointing is most convenient. These asynchronous
checkpoints, depicted in Fig. 21.chkpt2, do not create any difficulty if the processes are
independent and non-interacting. Upon a detected malfunction, all affected processes are
notified, with each process independently rolling back to its latest checkpoint.
One way of dealing with such dependencies is to also roll back certain interacting
processes when a given process is rolled back. There is a possibility of a chain reaction
that could lead to all processes having to restart from their very beginning. In general, we
need to identify a recovery line, or a consistent set of checkpoint, whose selection would
lead to correct re-execution of all processes. This is a nontrivial problem. An alternative
approach is to create stable logs of all interprocess communications, so that a process can
consult them upon re-execution.
There is a clear tradeoff in checkpoint insertion. Too few checkpoints lead to long
rollback and waste of computational resources in the event of a malfunction. Too many
checkpoints lead to excessive time overhead. These two opposing trends are depicted in
Fig. 21.optcp-a. As in many other engineering problems, there is often a happy medium
that can be found analytically or experimentally.
Solution: The computation can be viewed as having q + 1 states corresponding to the fraction i/q
of it completed, for i = 0 to q. From each state to the next one in sequence, the transition
probability over the time step T/q is 1 – T/q, as depicted in the discrete-time Markov chain of Fig.
21.optcp-b. By using the latter linear approximation, we have implicitly assumed that T/q << 1/.
We can easily derive Ttotal = T/(1 – T/q) + (q – 1)Tcp = T + T2/(q – T) + (q – 1)Tcp.
Differentiating Ttotal with respect to q and equating with 0 yields qopt = T( + /𝑇 ). For
example, if we have T = 200 hr, = 0.01 / hr, and Tcp = 1/8 hr, we get q opt
= 59 and Ttotal 211 hr.
Solution: The expected rollback time due to a malfunction in the time interval [0, Pcp] is found by
integrating (a + bx)dx over [0, Pcp], yielding Trb = Pcp(a + bPcp/2). We can choose Pcp to
minimize the relative checkpointing overhead O = (Tcp + Trb)/Pcp = Tcp/Pcp + (a + bPcp/2) by
equating dO/dPcp with 0. The result is Pcpopt = 2𝑇 /(𝑏). Let us assume, for example, that Tcp =
16 min and = 0.0005/min (corresponding to an MTTF of 33.3 hr). Then, the optimal
checkpointing period is Pcpopt = 800 min = 13.3 hr. If by using a faster memory for saving the
checkpoint snapshots we can reduce Tcp to 1 min (a factor of 16 reduction), the optimal
checkpointing period goes down by a factor of 4 to become Pcpopt = 200 min = 3.3 hr.
Problems
21.x Title
Intro
a. xxx
b. xxx
c. xxx
d. xxx
22 Degradation Management
“Most of us don't think, we just occasionally rearrange our
prejudices.”
Frank Knox
Reliable data storage requires that the availability and integrity of data not be dependent
on the health of any one site. To ensure this property, data may be replicated at different
sites, or it may be dispersed so that losing pieces of the data does not preclude its accurate
reconstruction.
As discussed earlier, data replication can place a large burden on the system, perhaps
even leading in the extreme to the nullification of its advantages. The need for updating
all replicas before proceeding with further actions is one such burden. One way around
this problem is the establishment of read and write quorums. Consider the example in
Fig. 22.integ-a, where the 9 data replicas are logically viewed as forming a 2D array. If a
read operation is defined as accessing the 3 replicas in any one column (the read quorum)
and selecting the replica with the latest time-stamp, then the system can safely proceed
after updating any 3 replicas in one row (the write quorum). Because the read and write
quorums intersect in all cases, there is never a danger of using stale data and the system is
never bogged down if one or two replicas are out of date or unavailable.
Fig. 22.integ Ensuring data integrity and availability via replication and
dispersion.
A similar result can be achieved via the data dispersion scheme of Fig. 22.integ-b, where
a piece of data is divided into 6 pieces, which are then encoded in such a way that any
two of the encoded pieces suffice for reconstructing the original data. Such an encoding
requires 3-fold redundancy and is thus comparable to 3-way replication in terms of
storage overhead. Now, if we define read and write quorums to comprise any 4 of the 6
encoded pieces, gaining access to any 4 pieces would be sufficient for reconstructing the
data, because the 4 pieces are bound to have at least 2 pieces that are up to date (have the
latest time stamp). This scheme too eases the burden of updating the data copies by not
requiring that every single piece be kept up to date at all times.
Consider the following puzzle known as “the two generals problem.” The setting is as
follows. Two generals lead the two divisions of an army camped on the mountains on
each side of an enemy-occupied valley. The two army divisions can communicate only
via messengers. Messengers, who are loyal and highly reliable, may need an arbitrary
amount of time to cross the valley and in fact may never arrive due to being captured by
the enemy forces.
We need a scheme for the two generals G1 and G2 to agree on a common attack time,
given that attack by only one division would be disastrous. Here is a possible scenario.
General G1 decides on time T and sends a messenger to inform G2. Because G1 will not
attack unless he is certain that G2 has received the message about his proposed attack
time, G2 sends an acknowledgment to G1. Now, G2 will have to make sure that G1 has
received his acknowledgment, because he knows that G1 will not attack without it. So, G1
must acknowledge G2’s acknowledgment. This can go on forever, without either general
ever being sure that the other general will attack at time T.
The situation above is akin to what happens at a bank’s computer system when you try to
make cash withdrawal from an ATM. The ATM should not dispense cash unless it is
certain that the bank’s central computer checks the account balance and adjusts it after
the withdrawal. On the other hand, you will not like it is your account balance is reduced
without any cash being dispensed. So, the two sides, the ATM and the bank’s database,
must act in concert, either both completing the transaction or both abandoning it. Thus,
the withdrawal transaction, or electronic funds transfer between two accounts, must be an
atomic, all-or-none action.
Acquire (read or write) lock for a desired data object and operation
Perform operation while holding the lock
Release lock
One must take care to avoid deadlocks arising from circular waiting for locks.
To avoid participants being stranded in the “wait” state (e.g., when the coordinator
malfunctions), a time-out scheme may be implemented.
To deal with the shortcomings of the two-phase commit, a three-phase commit protocol
may be devised. As shown in Fig. 22.3pcp, an extra “prepare” state is inserted between
the “wait” and “commit” states of two-phase commit. This protocol is safe from
blocking, given the absence of a local state that is adjacent to both a “commit” state and
an “abort” state.
Atomic broadcasting entails not only reliable message delivery but also requires that
multiple broadcasts be received in the same order by all nodes. If we use the scheme
outlined in the preceding paragraph, in-order delivery of messages will not be guaranteed,
so atomic broadcasting is much more complicated.
Many distributed systems, built from COTS nodes (processors plus memory) and
standard interconnects, contain sufficient resource redundancy to allow the
implementation of software-based malfunction tolerance schemes. Interconnect
malfunctions are dealt with by synthesizing reliable primitives for point-to-point and
collective communication (broadcast, multicast, and so on), as discussed in Section 22.3.
Node malfunctions are modeled differently, as illustrated in Fig. 22.malfns, with possible
models reanging from benign crash malfunctions, that are fairly easy to deal with, to the
arbitrary or Byzantine malfunctions, that require greater care in protocol development
and much higher redundancy.
A potentially helpful resource in managing a group of cooperating nodes, that are subject
to malfunctions, is a reliable group membership service. The group’s membership may
expand and contract owing to changing processing requirements or because of
malfunctions and repairs. A reliable group membership service maintains up-to-date
status information and thus supports a reliable multicast, via which a message sent by one
group member is guaranteed to be received by all other members.
A perfect malfunction detector, having strong completeness and strong accuracy is the
minimum required for interactive consistency. Strong completeness, along with eventual
weak consistency are the minimum requirements for consensus [Rayn05]. [Elaboration to
be added.]
When the configuration of a system changes due to the detection and removal of
malfunctioning units, division of labor in ongoing computations must be reconsidered.
This can be done via a remapping scheme that determines the new division of
responsibilities among participating modules, or via load balancing (basic computational
assignments do not change, but the loads of the removed modules are distributed among
other modules). Load balancing is also used not just to accommodate lost/recovered
resources due to malfunctions and repairs, but also to optimize system performance in the
face of changing computational requirements.
Even in the absence of a detected malfunction and the attendant system reconfiguration,
remapping of a computation to have its various pieces executed by different modules may
be useful for exposing hidden malfunctions. This is because the effects of a
malfunctioning module will likely be different on diverse computations, making it highly
unlikely to get the same final results for the original and remapped computation.
Let us consider a remapping example for a computation that runs on a 5-cell linear array.
By adding a redundant 6th cell at the end of the array (Fig. 22.remap), we can arrange for
the computation to be performed in two different ways: one starting in cell 1 and another
starting in cell 2 [Kwai97]. Each cell j + 1 can be instructed to compare the result of step j
in the computation that it received from the cell to its left to the result of step j that it
obtains within the second computation. A natural extension of this scheme is to provide 2
extra cells in the array and to perform three versions of the computation, with cell j + 2
voting on the three results obtained by cell j in the first computation, cell j + 1 in the
second version, and cell j itself in the third version.
A gracefully degradable system typically has one ideal or intact state, multiple degraded
states, and at least one failure state, as depicted in Fig. 22.degsys. To ensure that the
system degrades rather than fail or come to a halt, we have to reduce the probability of
malfunctions going undetected, increase the accuracy of malfunction diagnosis, make
repair rates much greater than malfunction rates (typically, but keeping hot-pluggable
spares), and provide sufficient safety factor in computational capacity.
Problems
22.1 Title
Intro
a. xxx
b. xxx
c. xxx
d. xxx
22.x Title
Intro
a. xxx
b. xxx
c. xxx
d. xxx
23 Resilient Algorithms
“Perfection is achieved, not when there is nothing more to add,
but when there is nothing left to take away.”
Antoine de Saint Exupery
Many of the hardware and software redundancy methods assume that we are building the
entire system (or a significant part of it) from scratch. Many users of highly reliable
systems, however, do not have the capability to develop such systems and thus have one
of two options.
The first option is to buy dependable systems from vendors that specialize in such
systems and associated support services. Here is a partial list of companies which have
offered, or are now offering, fault-tolerant systems and related services:
An early experiment with the latter approach was performed in the 1970s, when Stanford
University built one of two “concept systems” for fly-by-wire aircraft control, using
mostly COTS components. The resulting multiprocessor, named SIFT (software-
implemented fault tolerance) was meant to introduce a fault tolerance scheme that
contrasted with the hardware-intensive approach of MIT’s FTMP (fault-tolerant
multiprocessor). The Stanford and MIT design teams strived to achieve a system failure
rate goal of 10–9 per hour over a 10-hour flight, which is typical of avionics safety
requirements. Some fundamental results on, and methods for, clock synchronization
emerged from the SIFT project. To prevent errors from propagating in SIFT, processors
obtained multiple copies of data from different memories over different buses (local
voting).
The COTS approach to fault tolerance has some inherent limitations. Some modern
microprocessors have dependability features built in: they may use parity and other codes
in memory, TLB, and microcode store; they may take advantage of retry features at
various levels, from bus transmissions to full instructions; they may have provide
machine check facilities and registers to hold the check results. According to Avizienis,
however, these features are often not documented enough to allow users to build on them,
the protection provided is nonsystematic and uneven, recovery options may be limited to
shutdown and restart, description of error handling is scattered among a lot of other
detail, and there is no top-down view of the features and their interrelationships [Aviz97].
Manufacturers can incorporate both more advanced and new features, and at times have
experimented with a number of mechanisms, but until recently, the low volume of the
application base hindered commercial viability.
Stored and transmitted data can be protected against unwanted changes through encoding,
but coding does not protect the structure of the data. Consider, for example, an ordered
list of numbers. Individual numbers can be protected by encoding and the set of values
can be protected by a checksum; the ordering of data, however, remains unprotected with
either scheme. Some protection against an inadvertent change in ordering can be
provided by a weighted checksum. Another idea is to provide within the array a field that
records the difference between each element and the one that follows it. A natural
question at this point is whether we can devise general schemes for protecting data
structures of common interest.
Robust data structures provide fairly good protection with little design effort or run-time
overhead
Robustness features to protect the structure can be combined with coding methods (such
as checksums) to protect the content
Other robust data structures of interest include trees, FIFOs, stacks or LIFOs, heaps, and
queues. In general, a linked data structure is 2-detectable and 1-correctable iff the link
network is 2-connected.
Binary trees have weak connectivity and are thus quite vulnerable to corrupted or missing
links. One way to strengthen the connectivity of trees is to add parent links and/or threads
(links that connect a node to higher-level nodes). An example of a thread link is shown in
Fig. 23.rtree. Threads can be added with little overhead by taking advantage of unused
leaf links (one bit in every node can be used to identify leaves, thus freeing their link
fields for other uses).
It is sometimes possible to design algorithms and associated data structures to that the
computation becomes resilient to both representational and computational errors. A prime
example is provided by a method known as algorithm-based malfunction tolerance,
which is more widely known in the literarure by the acronym ABFT (algorithm-based
fault tolerance).
Consider the 3 3 matrix M shown in Fig. 23.abmt1. Adding modulo-8 row checksums
and column checksums results in the row-checksum matrix Mr and column checksum
matrix Mc, respectively. Including both sets of checksums, with the lower-right matrix
element set to the checksum of the row checksums or of the column checksums (it can be
shown that the result is the same either way) lead to the full checksum matrix Mf, a
representation of M that allows the correction of any single error in the matrix elements
and detection of up to 3 errors; some patterns of 4 errors, such as in the 4 elements
enclosed in the dashed box in Fig. 23.abmt1, may go undetected.
Problems
23.x Title
Problem intro
a. xxx
b. xxx
c. xxx
d. xxx
24 Software Redundancy
“Those parts of the system that you can hit with a hammer (not
advised) are called hardware; those program instructions that
you can only curse at are called software.”
Anonymous
For a steam iron: There is no guarantee, explicit or implied, that this device will remove
wrinkles from clothing or that it will not lead to the user’s electrocution. The manufacturer is
not liable for any bodily harm or property damage resulting from the operation of this device.
For an electric toaster: The name “toaster” for this product is just a symbolic identifier.
There is no guarantee, explicit or implied, that the device will prepare toast. Bread slices
inserted in the product may be burnt from time to time, triggering smoke detectors or causing
fires. By opening the package, the user acknowledges that s/he is willing to assume sole
responsibility for any damages resulting from the product’s operation.
You may hesitate before buying such a steam iron or toaster, yet this is how we purchase
commodity software. Software producers and marketers, far from postulating dependable
operation, do not even promise correct functional behavior! The situation is only slightly
better for custom software, produced to exacting functional and reliability specifications.
Beginning with unit test (see Fig. 24.sdlc), major structural and logical problems
remaining in a piece of software are removed early on. What remains after extensive
verification and validation is a collection of tiny flaws which surface under rare
conditions or particular combinations of circumstances, thus giving software failure a
statistical nature. Software usually contains one or more flaws per thousand lines of code,
with < 1 flaw considered good (linux has been estimated to have 0.1). If there are f flaws
in a software component, the hazard rate, that is, rate of failure occurrence per hour, is kf,
with k being the constant of proportionality which is determined experimentally (e.g., k =
0.0001). Software reliability is then modeled by:
According to this model, the only way to improve software reliability is to reduce the
number of residual flaws through more rigorous verification and/or testing.
Flaw
Not expected
to occur
Given extensive testing, the residual flaws in software are by nature difficult to detect.
They manifest themselves primarily for unusual combinations of inputs and program
states (the so-called “corner cases”), schematically represented in Fig. 24.swflaw. Light
shading is used to denote the parts of input/state space for which the software is free from
flaws. Unshaded regions represent input/state combinations that are known to be
impossible, thus making them irrelevant to proper functioning of the software. Dark
shading is used for trouble spots, which have been missed during testing. Occasionally,
during the use of a released piece of software, a user’s operating conditions will hit one
of these trouble spots, thus exposing the associated flaw. Once a flaw has been exposed,
it is dealt with through a new release of the software (particularly if the flaw is deemed
important in the sense of its potentials for affecting many other users) or through a
software patch.
For a while, there was some resistance to the idea of treating software malfunctions in a
probabilistic fashion, much like what we do with hardware-caused malfunctions. The
argument went that software flaws, which are due to design errors, either exist or do not
exist and that they do not emerge probabilistically. However, as discussed in the
preceding paragraph, software flaws are often exposed by rare combinations of inputs
and internal states, it does make sense to assume that there is a certain probability
distribution, derivable from input distributions, for a software malfunction to occur.
The idea of using software redundancy to counteract uncertainties in design quality and
residual bugs is a natural one. However, it is not clear what form the redundancy should
take: it is certainly not helpful to replicate the same piece of software, with identical
internal flaws/bugs. We will tackle this topic in the last three sections of this chapter.
A software flaw or bug can lead to operational error for certain combinations of inputs
and system states, causing a software-induced failure. Informally, the term “software
failure” is used to denote any software-related dependability problem. Flaw removal can
be modeled in various ways, two of which are depicted in Fig. 24.flawr. When removing
existing flaws does not introduce any new flaws, we have the optimistic model of Fig.
24.flawr-a. Flaw removal is quick in early stages, but as more flaws are removed, it
becomes more difficult to pinpoint the remaining ones, leading to a reduction in flaw
removal rate. The more realistic model of Fig. 24.flawr-b assumes that the number of
new flaws introduced is proportional to the removal rate. The following example is based
on the simpler model of Fig. 24.flawr-a.
Example 24.flaw: Modeling the software flaw removal process Assume that no new flaws are
introduced as we remove existing flaws in a piece of software estimated to have F0 = 132 flaws
initially and that the flaw removal rate linearly decreases with time. Model the reliability of this
software as a function of time.
Solution: Let F be the number of residual flaws and be the testing time in months. From the
problem statement, we can write dF()/d = –(a – b), leading to F() = F0 – a(1 – b/(2a)). The
hazard function is then z() = kF(), where k is a constant of proportionality, and R(t) = e–kF()t.
Taking k to be 0.000132, we find R(t) = exp(–0.000132(130 – 30(1 – /16))t). If testing is done
for = 8 months, the reliability equation becomes e–0.00132t, which corresponds to an MTTF of 758
hours. Note that no testing would have resulted in an MTTF of 58 hours and that testing for 2, 4,
and 6 months would have led to MTTFs of 98, 189, and 433 hours, respectively.
Linearly decreasing flaw removal rate isn’t the only option in modeling. Constant flaw
removal rate has also been considered, but it does not lead to a very realistic model.
Exponentially decreasing flaw removal rate is more realistic than linearly decreasing,
since flaw removal rate never really becomes 0. Model constants can be estimated via:
The primary reasons for software aging include accumulation of junk in the state part of
the system (which is reversible via restoration) and long-term cumulative effects of
updates via patching and the like. As the software’s structure deviates from its original
clean form, unexpected failures begin to occur. Eventually software becomes so mangled
that it must be discarded and redeveloped from scratch.
The recovery block method may be viewed as the software counterpart to standby sparing
for hardware. Suppose we can verify the results obtained by a software module by
subjecting them to an acceptance test. For now, let us assume that the acceptance test is
perfect in the sense of not missing any erroneous results and not producing any false
positive. Implications of imperfect acceptance tests will be discussed later. With these
assumptions, one can organize a number of different software modules all performing the
same computation in the form of a recovery block.
The comments next to the pseudocode lines in the program structure 24.5.rb provide an
example in which the task to be performed is that of sorting a list. The primary module
uses the quicksort algorithm, which has very good average-case running time but is rather
complicated in terms of programming, and thus prone to residual software bugs. The first
alternate uses the bubblesort algorithm, which is not as fast, but much easier to write and
debug. The longer running time of bubblesort may not be problematic, given that we
expect the alternates to be executed rarely. As we go down the list of the alternates, even
simpler but perhaps lower-performing algorithms may be utilized. In this way, diversity
can be provided among the alternates can be provided, while also reducing the
development cost relative to N-version programming. Design diversity makes it more
likely for one of the alternate modules to succeed when the primary module fails to
produce an acceptable result.
The acceptance test for our sorting example can take the form of a linear scan of the
output list to verify that its elements are in nondescending or nonascending order,
depending on the direction of sorting. Such an acceptance test will detect an improperly
sorted list, but may not catch the problem when the output list does not consist of exactly
the same set of values as the input list. We can of course make the acceptance test as
comprehensive as desired, but a price is paid in both software development effort and
running time, given that the acceptance test is on the critical path.
In general, the acceptance test can range from a simple reasonableness check to a
sophisticated and thorough validation effort. Note that performing the computation a
second time and comparing the two sets of results can be viewed as a form of acceptance
testing, in which the acceptance test module is of the same order of complexity as the
main computational module. Computations that have simple inverses lend themselves to
efficient acceptance testing. For example, the results of root finding for a polynomial can
be readily verified by polynomial evaluation.
We now present the elements of a general notation that facilitates the study and synthesis
of other software redundancy schemes.
Problems
Defective “Engineers have learned so well from failures that a major failure
today is big news. We no longer expect bridges to collapse or
Faulty
buildings to fall in or spacecraft to explode. Such disasters are
Erroneous
perceived as anomalous, . . . We grieve the lost lives, we search
among the designers for the guilty. Yet these disasters serve the
Malfunctioning same function as the failures of an earlier era. Failures remain
the engineer's best teacher, his best laboratory.”
Degraded
25 Failure Confinement
“I always turn to the sports page first, which records people’s
accomplishments. The front page has nothing but man’s
failures.”
Earl Warren
The provision of manual back-up and bypass capability is a good idea, even for systems
that are not safety-critical. On Friday, November 30, 1996 (Thanksgiving weekend), the
US railroad company Amtrak lost ticketing capability due to a communication system
disruption. Unfortunately for the company, station personnel had no up-to-date fare
information as back-up, and were thus unable to issue tickets manually. This lack of
foresight led to major coustomer inconvenience and loss of revenue for Amtrak. One
must note, however, that in certain cases, such as e-commerce Web sites, manual back-up
systems may be impractical.
Fig. 25.1 Computer failure may not lead to system failure or disaster.
The first step in proper handling of computer system failures is being aware of their
inevitability and, preferably, likelihood. Poring over failure data that are available from
experience reports and repositories is helpful to both computer designers and users.
System designers can get a sense of where the dependability efforts should be directed
and how to avoid common mistakes, while users can become empowered to face failure
events and to put in place appropriate recovery plans.
Unfortunately, much of the available failure statistics are incomplete and, at times,
misleading. Collecting experimental failure data isn’t easy. Widespread experiments are
impossible for one-of-a-kind or limited-issue systems and performing them under
reasonably uniform conditions is quite a challenge for mass-produced systsme. There is
also the embarrassment factor: system operators may be reluctant to report failures that
put their technical and administrative skills in doubt, and vendors, especially those who
boast about the reliability of their systems, may be financially motivated to hide or
obscure failure events. Once a failure event is logged, assigning a cause to it is nontrivial.
These caveats notwithstanding, whatever data that is available should be used rather than
ignored, perhaps countering any potential bias by drawing information from multiple
independent sources.
Software failure data is available from the following two sources, among others. The
Promise Software Engineering Repository [PSER12] contains a collection of publicly
available datasets and tools to help researchers who build predictive software models and
the software engineering community at large. The failure data at the Software Forensics
Center [SFC12] is the largest of its kind in the world and includes specific details about
hundreds of projects, with links to thousands of cases.
Let us consider the application of failure data to assessing and validating reliability
models through an example.
Example 25.valid: Validating reliability models Consider the reliability state model shown in
Fig. 25.disk for mirrored disk pairs, where state i corresponds to i disks being healthy.
a. Solve the model and derive the disk pair’s MTTF, given a disk MTTF of 50 000 hr per
manufacturer’s claim and an estimated MTTR of 5 hr.
b. In 48 000 years of observation (2 years 6000 systems 4 disk pairs / system), 35 double disk
failures were logged. Do the observation results confirm the model of part a? Discuss.
Solution: From the MTTF and MTTR values, we find = 2 10–5/hr and = 0.2/hr.
a. The model of Fig. 25.disk is readily solved to provide an effective disk pair failure rate of about
22/ or an approximate disk pair MTTF of /(22) = 15,811 yr.
b. The observation data suggests a disk pair MTTF of 48 000/35 1371 yr. The observed MTTF is
more than 11 times worse than the modeled value. The discrepancy may be attributed to one or
more of the following factors: exaggerated disk MTTF claim on the part of the manufacturer,
underestimation of repair time, imperfect coverage in recovering from the failure of a single disk
in the pair. The latter factor can be accounted for by including a transition from state 2 to state 0,
with an appropriate rate, in Fig. 25.disk.
Given its importance in our discussion here, let’s reproduce equation (2.5.Risk1) and its
alternate form, equation (2.5.Risk2), here, giving them new numbers for ready reference
in this chapter.
Now, consider the following thought experiment: an attempt to establish how much your
life is worth to you.
You have a 1/10 000 chance of dying today. If it were possible to buy out
the risk, how much would you be willing to pay? Assume that you are not
limited by current assets, that is, you can use future earnings too.
An answer of $1000 (risk) combined with the frequency 10–4 leads to a magnitude of
$10M, which is the implicit worth you assign to your life. Assigning monetary values to
lives, repugnant as it seems, happens all the time in risk assessment. The higher salaries
demanded by workers in certain dangerous lines of work and our willingness to pay the
cost of smoke detectors are examples of trade-offs involving the exchange of money for
endangering or protecting human lives.
In an eye-opening book [Tale07], author Nassim Nicholas Taleb discusses rare events
and how humans are ill-equipped for judging and comparing their probabilities. In a
subsequent book [Tale12], that also has important implications for the design of resilient
systems, Taleb discusses systems that not only survive disorder and harsh conditions, but
thrive and improve in such environment.
There are age-old design principles that engieers should be aware of in order to produce
reliable systems. These principles apply to any design, but they are particularly important
in developing highly complex hardware and software systems.
Many system failures would not occur if engineers were aware of their ethical
responsibilities toward the customers and the society at large. Modern engineering
curricula include some formal training in how to deal with ethical quandries. This
training is often provided via a required or recommended course that reviews general
principles of ethics, discusses them in the context of engineering practice, and provide a
number of case studies in which students consider the impact of various career and design
decisions. In the author’s opinion, just as dependability should not be taught in a course
separate from those that deal with engineering design, so too discussion of ethics must be
integrated in all courses within an engineering curriculm.
All professional engineering societies have codes of ethics that outline the principles of
ethical behavior in the respective professions. For example, the IEEE Code of Ethics
[IEEE19] compels us engineers to follow rules of ethics in general (be fair, reject bribery,
avoid conflicts of interest) and in technical activities:
IEEE has a separate “Code of Conduct” [IEEE14] that spells out in greater details
guidelines for responsible practice of engineering.
Problems
25.x Title
Intro
a. xxx
b. xxx
c. xxx
d. xxx
26 Failure Recovery
“The first rule of holes: when you’re in one, stop digging.”
Molly Ivins
Planning to deal with computer system failures has a great deal in common with
preparations undertaken in anticipation of natural disasters. Since computers are
often components in larger control, corporate, or societal systems, interaction with
the users and environment must also be factored in. Recovery from computer
failures is made possible by systems and procedures for backing up data and
programs and for alternate facilities to run applications in case of complete system
shut-down due to a catastrophic failure, a natural disaster, or a malicious attack.
Once a failure has occurred, investigations may be conducted to establish the
cause, assign responsibility, and catalog the event for educational purposes and as
part of failure logs to help future designers.
Just as an organization might hold fire drills to familiarize its personnel with the
procedures to be followed in the event of a real fire, so too it must plan for dealing with,
and recovering from, computer system failures. Whether an anticipated failure has mild
consequences or leads to a disaster, the corresponding recovery procedures must be
properly documented and be part of the personnel training programs.
Recovery from a failure can be expressed in the same manner as the recovery block
scheme in the program structure 24.5.rb, with the manual or emergency procedure being
considered the last alternate and human judgment forming part of the acceptance test.
When the failure is judged to be a result of transient environmental conditions, the same
alternate may be executed multiple times, before moving on to the next one, including the
final initiation of manual recovery.
Many terms have been used to describe the process of recovery from computer system
failures. First, systems that are capable of working with diminished resources are referred
to as fail-slow or fail-soft. These terms imply that the system is resilient and won’t be
brought down when some resources become unavailable. At the opposite extreme, we
have fail-fast and fail-hard systems that are purposely designed to fail quickly and in a
physically obvious manner, following the philosophy that subtle failures that may go
unnoticed may be more dangerous than overt ones, such as total system shut-down or
crash. Extended failure detection latency is undersirable not only owing to potentially
errant system behavior, but also because the occurrence of subsequent unrelated failures
may overwhelm the system’s defenses. Along the same lines, a fail-stop system comes to
a complete stop rather than behave erratically upon failure. The latter category of systems
may be viewed as a special case of a fail-safe systems, where the notion of safe behavior
generalizes that of halting all actions.
Alongside the terms reviewed in the preceding paragraph, we use fail-over to indicate a
situation when failure of a unit (such as Web server) is overcome by another unit taking
over its workload. This is easily done when the failed unit does not carry much state. If a
video is being streamed by a server that fails, the server taking over needs to know only
the file name, the receipient’s address, and the current point of streaming within the
video. Fail-over software is available for Web servers as part of firewalls for most
popular operating systems. The term fail-back is used to refer to the failed system
returning to service, either as the primary unit or as a warm standby. Finally, the term
fail-forward, derived from the notion of forward error correction (the ability of error-
correcting codes to let the computation go forward after the occurrence of an error, as
opposted to error-detecting schemes leading to backward error correction via rollback to
a previous checkpoint) is sometimes, though rarely, used.
Key elements in ensuring proper human reaction to failure events are the believability
and helpfulness of failure warnings.
A system’s human interface not only affects its usability but also contributes heavily to
its reliability, availability, and safety. A properly designed and tuned interface can help
prevent many human errors that form one of the main sources of system unreliability and
risk. The popular saying “The interface is the system” is in fact quite true. Here is a
common way of evaluating a user interface. A group of 3-5 usability experts and/or
nonexperts judges the interface based on a set of specific criteria. Here are some criteria,
which would be used to judge most interfaces [Gree09a]:
Visibility of system state: The user knows about what is happening inside the
computer from looking at the interface.
Speaks the user’s language: Uses concepts that are familiar to users. If there are
different user classes (say, novices and experts in the field), the interface is
understandable to all.
Minimizes human memory load: Human memory is fallible and people are likely
to make errors if they must remember information. Where possible, critical
information appears on the screen. Recognition and selection from a list are easier
than memory recall.
Provides feedback to the user: When the user acts, the interface confirms that
something happened. The feedback may range from a simple beep to indicate that
a button press was recorded to a detailed message that describes the consequences
of the action.
Provides good error messages: When errors occur, the user is given helpful
information about the problem. Poor error messages can be disastrous.
Consistency: Similar actions produce similar results. Visually similar objects
(colors, shapes) are related in an important way. Conversely, objects that are
fundamentally different have distinct visual appearance.
Backing up of data is a routine practice for corporate and enterprise systems. The data
banks held by a company, be they data related to designs, inventory, personnel, suppliers,
or customer base, is quite valuable to the company’s business and must be rigorously
protected against loss. Unfortunately, many personal computer users do not take data
backup seriously and end up losing valuable data as a result. Here are some simple steps
that nonexpert computer users can take to protect their data files:
In cases where the cause of a failure isn’t immediately evident, or when multiple parties
involved do not agree on the cause, computer forensics may come into play. Computer
forensics is a relatively new speciality of increasingly utility in:
There are several journals on computer forensics, digital investigations, and e-discovery.
Automated reservations, ticketing, flight scheduling, fuel delivery, kitchens, and general
administration, United Airlines + Univac,
started 1966, target 1968, scrapped 1970, $50M
Hotel reservations linked with airline and car rental, Hilton + Marriott + Budget +
American Airlines, started 1988, scrapped 1992, $125M
IBM workplace OS for PPC (Mach 3.0 + binary compatibility with AIX + DOS, Mac
OS, OS/400 + new clock mgmt + new RPC + new I/O + new CPU), started 1991,
scrapped 1996, $2B
US FAA’s Advanced Automation System (to replace a 1972 system), started 1982,
scrapped 1994, $6B
Many more such failures exist, as exemplified by the following partial list:
Learning from failures is one of the tenets of engineering practice. As elegantly pointed
out by Henry Petroski [Petr06]:
Problems
26.x Title
Intro
a. xxx
b. xxx
c. xxx
d. xxx
In Chapter 12, we studied simple voting schemes and their associated hardware
implementations. In this chapter, we introduce a variety of more flexible, and thus
computationally more complex, voting schemes that are often implemented in
software or a combination of hardware and software. Voting schemes constitute
particular instances of a process known as data fusion, where suspect or
incomplete data from various sources are used to derive more accurate or
trustworthy values. One complication with voting or data fusion in a distributed
environment is the possibility of communication errors and Byzantine (absolute
worst-case) failures. Discussions of these notions conclude this chapter.
From our discussion of hardware voting in Section xxxx, we are already familiar with the
notion of majority voting. A majority fuser sets the output y to be the value provided by a
majority of the inputs xi, if such a majority exists (Fig. 27.fuser-a). We also know that
majority fusers can be built from comparators and multiplexers.
Weighted fusion, which covers majority fusion as a special case, can be defined as
follows. Given n input data objects x1, x2, . . . , xn and associated nonnegative real weights
v1, v2, ... , vn, with vi = V, compute output y and its weight w such that y is “supported
by” a set of input objects with weights totaling w, where w satisfies a condition associated
with a chosen subscheme. Here are some subschemes that can be used in connection with
the general arrangement of Fig. 27-fuser-b:
Unanimity w=V
Majority w > V/2
Supermajority w 2V/3
Byzantine w > 2V/3
Plurality (w for y) (w for any z y)
Threshold w > some preset lower bound
Plurality fusion (in its special nonweighted case depicted in Fig. 27.fuser-c) selects the
value that appears on the largest number of inputs and presents it at output. With the
input values {1, 3, 2, 3, 4}, the output will be 3. It is interesting to note that with
approximate values, selection of the plurality results may be nontrivial. If the inputs are
{1.00, 3.00, 0.99, 3:00, 1.01}, one can legitimately argue that 1.00 constitutes the proper
plurality result. We will discuss approximate voting in more detail later. For now, we
note that median fusion would have produced the output 1.00 in the latter example.
Fig. 27.fuser Three kinds of data fusers derived from voting schemes.
An agreement set for an n-input voting scheme is a subset of the n inputs so that if all
inputs in the subset are in agreement, then the output of the voting scheme is based on
that particular subset. A voting scheme can be fully characterized by its agreement sets.
For example, simple 2-out-of-3 majority voting has the agreement sets {x1, x2}, {x2, x3},
and {x3, x1}. Clearly, the agreement sets cannot be arbitrary if the voting outcome is to be
well-defined. For example, in a 4-input voting scheme, the agreement sets cannot include
{x1, x2} and {x3, x4}. It is readily proven that for a collection of agreement sets to make
sense, no two sets should have an empty intersection.
Approximate voting
Approval voting
Interval voting
With some support from the system, in the form of certain kind of centralized or
distributed reliable and tamperproof service, the number of replicas needed for Byzantine
resilience can be reduced from 3f + 1 to 2f + 1, which is the minimum possible. Several
proposals of this kind have been made since 2004, with the latest [Vero13] possessing
performance and efficiency advantages over the earlier ones.
Problems
27.1 Title
Intro
a. xxx
b. xxx
c. xxx
d. xxx
27.2 Title
Intro
a. xxx
b. xxx
c. xxx
d. xxx
A generalized voting scheme can be specified by listing its agreement sets. For example, simple 2-out-of-3
majority voting with inputs A, B, and C has the agreement sets {A, B}, {B, C}, {C, A}. Show that each of
the agreement sets below corresponds to a weighted threshold voting scheme and present a simple
hardware voting unit implementation for each case.
a. {A, B}, {A, C}, {A, D}, {B, C, D}
b. {A, B}, {A, C, D}, {B, C, D}
c. {A, B, C}, {A, C, D}, {B, C, D}
d. {A, B}, {A, C, D}, {B, D}
Safety-critical systems exists outside the domain of computing. Thus, over many years of
experience with such systems, a number of principles for desining safe systems have been
identified. These principles include [Gold87]:
1. Use barriers and interlocks to constrain access to critical system resources or states
2. Perform critical actions incrementally, rather than in a single step
3. Dynamically modify system goals to avoid or mitigate damages
4. Manage the resources needed to deal with a safety crisis, so that enough will be
available in an emergency
5. Exercise all critical functions and safety features regularly to assess and maintain their
viability
6. Design the operator interface to provide the information and power needed to deal with
exceptions
7. Defend the system against malicious attacks
Let us ignore all this complexity and focus on the simplest possible intersection with two
sets of 3 lights (green, yellow/amber, red) for the two intersecting roads. The simplest
algorithm for controlling the lights would cycle through various states, allowing cars to
move on one street for a fixed amount of time and then changing the lights to let cars on
the other road to proceed for a while. The state of the intersection at any given time can
be represented as XY, where each of the letters X and Y can assume a color code from
{G, Y, R}. With regard to safety, the state RR is very safe, but, of course, undesirable
from the traffic movement point of view, and the state GG is highly dangerous and
should never occur. One can say that with regard to safety, the states can be ordered in
the following way, from the safest to the most dangerous: RR, {RY, YR}, {RG, GR},
YY, {GY, YG}, GG. So, the objective in the design should be to assure that the state of
the traffic lights is one of the first five just listed, avowing any of the final four states.
Now, if the lights are controlled by six separate signals that turn them on and off, there is
a chance that through some signal being stuck or incorrectly computed, one of the
prohibited states is entered.
Any fail-safe system will have some false alarms: situations in which the system enters
its safe mode of operation, or stops functioning altogether, in the absence of any
imminent danger. Conversely, monitoring components in a fail-safe system may miss
hazardous conditions, thus allowing the system to remain operational in an unsafe mode.
False alarms lead to inconvenience, lower productivity, and perhaps financial losses.
However, these undesirable consequences are preferable to potentially huge losses that
might result from unsafe operation. This tradeoff between the frequency of false alarms
and the likelihood of missed warnings is a fundamental one in any system that relies on
tests with binary outcomes for diagnostics.
A binary test is characterized by its sensitivity and specificity. Sensitivity is the fraction of
positive cases (e.g., people having a particular disease) that the test diagnoses correctly.
Specificity is the fraction of negative cases (e.g., healthy people) that are correctly
identified. Figure 28.test contains a graphical depiction of the notions of test specificity
and sensitivity, as well as the trade-offs between them. We see that Test 1 in Fig. 28.xa is
high on specificity, because it does not lead to very many false positives, whereas Test 2
in Fig. 28.xb is highly sensitive in the sense of correctly diagnosing a great majority of
the positive cases. In general, highly sensitive tests tend to be less specific, and vice
versa. For safety-critical systems, we want to err on the side of too much sensitivity.
Within a self-monitoring safety-critical system, a dead man’s switch can take the form of
a monitoring unit that continuously runs tests to verify the safety of the current
configuration and operational status. If some predetermined time interval passes without
the monitor asserting itself or confirming proper status, the system will automatically
enter its safe operating mode.
Interlocks, watchdog units, and other preventive and monitoring mechanisms are
commonplace in safety-critical systems.
Problems
In this appendix, we trace the history of dependable computing, from the earliest
digital computers to modern machines used in a variety of application domains,
from space exploration and real-time process control to banking and e-commerce.
We explore a few turning points along this impressive historical path, including
emergence of long-life systems for environments that make repair impossible,
development of safety-centered computer systems, meeting the stringent demands
of applications requiring high availability, and interaction of dependability with
security concerns. After discussing how advanced ideas and methods developed
in connection with system classes named above find their way into run-of-the-mill
commercial and personal systems, we conclude with a discussion of current
trends, future outlook, and resources for further study of dependable computing.
Even though fault-tolerant computing as a discipline had its origins in the late 1960s, key
activities in the field began some two decades earlier in Prague, where computer scientist
Antonin Svoboda (1907-1980) built the SAPO computer. The pioneering design of SAPO
employed triplication and voting to overcome the effects of poor component quality.
Svoboda’s efforts were little known in the West and thus did not exert much influence on
the field as we know it. JPL’s STAR (self-testing-and-repairing) computer, on the other
hand, was highly influential, both owing to its design innovations and because the project
leader, Professor Algirdas Avizienis, was one of the early leaders of the field in the US.
The STAR computer project was motivated by NASA’s plans for a Solar System
exploration mission taking 10 years. Known as “The Grand Tour,” the mission was later
scrapped in its original form, but a number of its pieces, including the highly fault-
tolerant computer, formed the basis of later space endeavors.
In the 2010s decade, we have continued to see the development and further maturation of
the field. New challenges to be faced in the rest of this decade include applying and
extending available techniques to new computing paradigms and environments, auch as
cloud computing and its attendant mobile platforms (smartphones and compact tablets).
Among problems that need extensive study and innovative solutions is greater integration
of reliability and security concerns. At the end of this decade, the field of dependable
computing, and its flagship conference, will be preparing to celebrate their 50th
anniversary (DSN-50 will be held in 2020): a milestone that should be cause for a
detailed retrospective assessment of, and prospective planning for, the methods and
traditions of the field, as they have been practiced and as they might apply to new and
emerging technologies. By then, some of the pioneers of the field will no longer be with
us, but there are ample new researchers to carry the torch forward.
Interest in long-life computer systems began with the desire to launch spacecraft on
multiyear missions, where there is no possibility of repair. Even for manned space
missions, the limited repair capacity isn’t enough to offset the effects of even greater
reliability requirements. Today, we are still interested in long-life system for space travel,
but the need for such systems has expanded owing to many remotely located or hard-to-
access systems for intelligence gathering and environmental monitoring, to name only
two application domains.
Systems of historical and current interest that fall into the long-life category include
NASA’s OAO, the Galileo spacecraft, JPL STAR, the International Space Station,
communication satellites, and remote sensor networks.
Safety-critical computer systems were first employed in flight control, nuclear reactor
safety, and factory automation. Today, the scope of safety-critical systems has broadened
substantially, given the expansion of numerous applications requiring computerized
control: high-speed transportation, health monitoring, surgical robots.
Systems of historical and current interest that fall into the safety-critical category include
Carnegie Mellon University’s C.vmp, Stanford University’s SIFT, MIT’s FTMP, the
industrial control computers of August Systems, high-speed train controls, and
automotive computers
Systems of historical and current interest that fall into the high-availability category
include AT&T ESS 1-5 (telephone switching, 1965-1982), Tandem’s various computers
(NonStop I-II, . . . , Cyclone, CLX800, 1976-1991), Stratus FT200-XA2000 (1981-1990),
banking systems, Internet portals (Google, Yahoo), and e-commerce (Amazon, eBay).
.
The following description of Tandem NonStop Cyclone is copied from [Parh99].
A fully configured cyclone system consisted of 16 processors that were organized into
sections of 4 processors. Processors in each section were interconnected by a pair of 20
MB/s buses (Dynabus) and each could support two I/O subsystems capable of burst
transfer rates of 5 MB/s (Fig. A.Tand-a). An I/O subsystem consisted of two I/O
channels, each supporting up to 32 I/O controllers. Multiple independent paths were
provided to each I/O device via redundant I/O subsystems, channels, and controllers. Up
to 4 sections could be linked via unidirectional fiber optics Dynabus + that allowed
multiple sections to be nonadjacent within a room or even housed in separate rooms (Fig.
A.Tand-b). By Isolating Dynabus+ from Dynabus, full-bandwidth communications could
occur independently in each 4-processor section. Other features of the NonStop Cyclone
are briefly reviewed below.
(a) One section of NonStop Cyclone system (b) Dynabus + connecting 4 sections
Processors: Cyclone’s 32-bit processors had advanced superscalar CISC designs. They
used dual 8-stage pipelines, an instruction pairing technique for instruction-level parallel
processing, sophisticated branch predication algorithms for minimizing pipeline bubbles,
and separate 64 KB instruction and data caches. Up to 128 MB of main memory could be
provided for each cyclone processor. The main memory was protected against errors
through the application of a SEC/DED code. Data transfers between memory and I/O
channels were performed via DMA and thus did not interfere with continued instruction
processing.
Hardware reliability: Use of multiple processors, buses, power supplies, I/O paths, and
mirrored disks were among the methods used to ensure continuous operation despite
hardware malfunctions. A fail-fast strategy was employed to reduce the possibility of
error propagation and data contamination. Packaging and cooling technologies had also
been selected to minimize the probability of failure and to allow components, such as
circuit boards, fans, and power supplies, to be hot-pluggable without a need to interrupt
system operation. When a malfunctioning processor was detected via built-in hardware
self-checking logic, its load was transparently distributed to other processors by the
operating system.
Software reliability: The GUARDIAN 90 operating system was a key to Cyclone’s high
performance and reliability. Every second, each processor was required to send an “I’m
alive” message to every other processor over all buses. Every 2 seconds, each processor
checked to see if it had received a message from every other processor. If a message had
not been received from a particular processor, it was assumed to be malfunctioning.
Other software mechanisms for malfunction detection included data consistency checks
and kernel-level assertions. Malfunctions in buses, I/O paths, and memory were tolerated
by avoiding the malfunctioning unit or path. Processor malfunctions led to deactivation
of the processor. For critical applications, GUARDIAN 90 maintained duplicate backup
processes on disjoint processors. To reduce overhead, the backup process was normally
inactive but was kept consistent with the primary process via periodic checkpointing
messages. Upon malfunction detection, the backup process was started from the last
checkpoint, perhaps using mirror copies of the data.
Due to highly unreliable components, early computers had extensive provisions for fault
masking and error detection/correction. Today’s components are ultrareliable, but there
are so many of them in a typical system that faults/errors/malfunctions are inevitable. It is
also the case that computers are being used not only by those with hardware/software
training but predominantly by nonexperts, or experts in other fields, who would be
immensely inconvenienced by service interruptions and erroneous results.
Systems of historical and current interest that fall into the commercial/personal category
include SAPO, IBM System 360, and IBM Power 6 [Reic08].
In this section, we review some of the active research areas in dependable computing and
provide a forecast of where the field is headed in the next few decades. Technology
forecasting is, of course, a perilous task and many forecasters look quite silly when their
predications are examined decades hence. Examples of off-the-mark forecasts include
Thomas J. Watson’s “I think there is a world market for maybe five computers,” and Ken
Olson’s “There is no reason anyone would want a computer in their home.” Despite these
and other spectacular failures in forecasting, and cautionary anonymous bits of advice
such as “There are two kinds of forecasts: lucky and wrong,” I am going to heed the more
optimistic musing of Henri Poincare, who said “It is far better to foresee even without
certainty than not to foresee at all.”
Dependable computer systems and design methods continue to evolve. Over the years,
emphasis in the field has shifted from building limited-edition systems with custom
components to using commercial off-the-shelf (COTS) components to the extent
possible. This strategy implies incorporating dependability features through layers of
software and services that run on otherwise untrustworthy computing elements.
Designers of COTS components in turn provide features that enable and facilitate such a
layered approach. This trend, combined with changes in technology and scale of systems,
creates a number of challenges which will occupy computer system designers for decades
to come.
Nanotechnology brings about smaller and more energy-efficient devices, but it also
creates new challenges. Smaller devices are prone to unreliable operation, present
complex modeling problems (due to the need for taking parameter variations into
account), and exacerbate the already difficult testing problems.
Parameter variations [Ghos10] occur from one wafer to another, between dies on the
same wafer, and within each die. Reasons for parameter variations include vertical
nonuniformities in the layer thicknesses and horizontal nonuniformities in line and
spacing widths.
Cloud computing has been presented as a solution to all corporate and personal
computing problems, but we must be aware of its complicated reliability and availability
problems [Baue12]. Whereas it is true that multiplicity and diversity of resources can lead
to higher reliability by avoiding single points of failure, this benefit does not come about
automatically. Accidental and deliberate outages are real possibilities and identifying the
weakest link in this regard is no easy task. Assignment of blame in the event of failures
and rigorous risk assessment for both e-commerce and safety-critical systems are among
other difficulties.
The human brain is often viewed as an epitome of design elegance and efficiency.
Though quite energy-efficient (the brain uses around 20 W of power, whereas performing
a minute fraction of the brain’s function for a short period of time via simulation requires
supercomputers that consume megawatts of popwer), its design is anything but elegant.
The brain has grown in layers over millions of years of evolutionary time. Newer
capabilities are housed in the neocortex, developed fairly recently, and the older reptilian
brain parts are still there (Fig. A.brain-a). As a result, there is functional redundancy,
meaning that the same function (such as vision) may be performed in multiple regions.
Futhermore, the use of a fairly small number of building blocks make it easy for one part
of the brain to cover for other parts when they are damaged. The death or disconnection
of a small number of neurons is often inconsequential and those suffering substantial
brain injuries can and do recover to full brian function. The brain uses electrochemical
signaling that is extremely slow compared with electronic communication in modern
digital computers. Yet, the overall computational power of the brain is as yet unmatched
by even the most powerful supercomputers. Finally, memory and computational
functionalities are intermixed in the brain: there aren’t a separate memory units and
computational elements.
(a) Main parts of the human brain (b) Communication between neurons
As for data longevity, typical media used for mass data storage have lifespans of 3-20
years. Data can be lost to both media decay and format obsolescence. [Elaborate]
Data preservation is particularly daunting when documents and records are produced and
consumed in digital form, as such data files have no nondigital back-up.
The field of dependable computing must deal with uncertainties of various kinds.
Therefore, research in this field has entailed methods for representing and reasoning
about uncertainties. Uncertainties can exist in both data (missing, imprecise, or estimated
data) and in rules for processing data (e.g., empirical rules). Approches used to date fall
under four categories:
The interaction between failures and intrusions (security breaches) has become quite
important in recent year. Increasingly, hackers on computer systems take advantage of
extended system vulnerabilities during failure episodes to compromise such systems.
[Elaborate]
As top-of-the-line supercomputers use more and more processing nodes (currently in the
hundreds of thousands, soon to be in the millions), system MTBF is shortened and the
reliability overhead increases. Checkpointing, for example, must be done more
frequently, which can lead to superlinear overhead of such methods in terms of the
number p of processors. Typically, computational speed-up increases with p (albeit
sometimes facing a flattening due to Amdahl’s law). The existence of reliability overhead
may mean that the speed-up can actually decline beyond a certain number of processors.
The field of dependable computing has undergone many changes since the appearance of
the first computer incorporating error detection and correction [Svob79]. The emergence
of new technologies and the unwavering quest for higher performance, usability, and
reliability are bound to create new challenges in the coming years. These will include
completely new challenges, as well as novel or transformed versions of the ones
discussed in the preceding paragraphs. Researchers in dependable computing, who helped
make digital computers into indispensable tools in the six-plus decades since the
introduction of fault tolerance notions, will thus have a significant role to play in making
them even more reliable and ubiquitous as we proceed through the second decade of the
Twenty-First Century.
Fault tolerance strategies to date have by and large relied on perfect hardware and/or
perfect detection of failures. Thus, we either get the correct results (nearly all of the time)
or hit an exceptional failure condition, which we detect and recover from. With modern
computing technology, where extreme complexity and manufacturing variability make
failures the norm, rather than the exception, we should design computers more like
biological systems in which failures (and noise) are routinely expected.
in Valencia, Spain
We summarize our discussions of the history, current status, and future of dependable
computing in the timeline depicted in Fig. A.tmln. As for resources that would allow the
reader to gain additional insights in dependable computing, we have already listed some
general references at the end of the Preface and topic-specific references at the end of
each chapter (and this appendix). Other resources, which are nowadays quite extensive,
thanks to electronic information dissemination, can be found through Internet search
engines. For example, a search for “fault-tolerant computer” on Google yields some 1.5
million items, not counting additional hits for “dependable computing,” “computer
system reliability,” and other related terms. The author maintains a list of Web resources
for dependable computing on his companion website for this book: it can be reached via
the author’s faculty Web site at University of California, Santa Barbara (UCSB).
In this book, we have built a framework and skimmed over a number of applicable
techniques at various levels of the system hierarchy, from devices to applications. This
framework can serve as your guide, allowing you to integrate new knowledge you gain in
the field and to see how it fits in the dependable computing universe.
Problems
A.3 Title
Intro
a. xxx
b. xxx
c. xxx
d. xxx
[Seid09] Seidel, F., “X-by-Wire,” accessed online on December 2, 2009, at: http://osg.informatik.tu-
chemnitz.de/lehre/old/ws0809/sem/online/x-by-wire.pdf
[Siew95] Siewiorek, D. (ed.), Fault-Tolerant Computing Highlights from 25 Years, Special Volume of
the 25th Int’l Symp. Fault-Tolerant Computing, 1995.
[Sing18] Singh, G., B. Raj, and R. K. Sarin, “Fault-Tolerant Design and Analysis of QCA-Based
Circuits,” IET Circuits, Devices & Systems, Vol. 12, No. 5, pp. 638-644, 2018.
[Svob79] Svoboda. A., “Oral History Interview OH35 by R. Mapstone,” 15 November 1979, Charles
Babbage Institute, University of Minnesota, Minneapolis. Transcripts available at:
http://www.cbi.umn.edu/oh/pdf.phtml?id=263
[Trav04] Traverse, P., I. Lacaze, and J. Souyris, “Airbus Fly-by-Wire: A Total Approach to
Dependability,” in Building the Information Society, ed. by P. Jacquart, pp. 191-212, Springer,
2004.
[Vinh16] Vinh, P. C. and E. Vassev, “Nature-Inspired Computation and Communication: A Formal
Approach,” Future Generation Computer Systems, Vol. 56, pp. 121-123, March 2016.
[Walf15] Walfish, M. and A. J. Blumberg, “Verifying Computations without Reexecuting Them,”
Communications of the ACM, Vol. 58, No. 2, pp. 74-84, February 2015.
[Yang12] Yang, X., Z. Wang, J. Xue, and Y. Zhou, “The Reliability Wall for Exascale Supercomputing,”
IEEE Trans. Computers, Vol. 61, No. 6, pp. 767-779, June 2012.