0% found this document useful (0 votes)
11 views535 pages

Dependable Computing

The document is a draft of the forthcoming book 'Dependable Computing: A Multilevel Approach' by Behrooz Parhami, which explores the field of dependable computing, its significance, and methodologies. It emphasizes the importance of quality control in computer engineering and outlines the book's structure, which is divided into seven parts covering various aspects of dependability. The text aims to provide a comprehensive understanding of dependable computing through a combination of theoretical foundations and practical system designs.

Uploaded by

agiakatsikas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views535 pages

Dependable Computing

The document is a draft of the forthcoming book 'Dependable Computing: A Multilevel Approach' by Behrooz Parhami, which explores the field of dependable computing, its significance, and methodologies. It emphasizes the importance of quality control in computer engineering and outlines the book's structure, which is divided into seven parts covering various aspects of dependability. The text aims to provide a comprehensive understanding of dependable computing through a combination of theoretical foundations and practical system designs.

Uploaded by

agiakatsikas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 535

1

2
[email protected]
http://www.ece.ucsb.edu/~parhami

This is a draft of the forthcoming book


Dependable Computing: A Multilevel Approach,
by Behrooz Parhami, publisher TBD
ISBN TBD; Call number TBD

All rights reserved for the author. No part of this book may be reproduced,
stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical,
photocopying, microfilming, recording, or otherwise, without written permission. Contact the author at:
ECE Dept., Univ. of California, Santa Barbara, CA 93106-9560, USA ([email protected])

Dedication
To my academic mentors of long ago:

Professor Robert Allen Short (1927-2003),


who taught me digital systems theory
and encouraged me to publish my first research paper on
stochastic automata and reliable sequential machines,

and

Professor Algirdas Antanas Avižienis (1932- )


who provided me with a comprehensive overview
of the dependable computing discipline
and oversaw my maturation as a researcher.

About the Cover


The cover design shown is a placeholder. It will be replaced by the actual cover image
once the design becomes available. The two elements in this image convey the ideas that
computer system dependability is a weakest-link phenomenon and that modularization &
redundancy can be employed, in a manner not unlike the practice in structural
engineering, to prevent failures or to limit their impact.
Last modified: 2020-09-29 3

Preface
“The skill of writing is to create a context in which other people
can think.”
Edwin Schlossberg

“When a complex system succeeds, that success masks its


proximity to failure. ... Thus, the failure of the Titanic contributed
much more to the design of safe ocean liners than would have
her success. That is the paradox of engineering and design.”
Henry Petroski, Success through Failure: The
Paradox of Design

The Context of Dependable Computing

Accounts of computer system errors, failures, and other mishaps range from humorous to
horrifying. On the lighter side, we have data entry or computation errors that lead to an
electric company sending a bill for hundreds of thousands of dollars to a small residential
customer. Such errors cause a chuckle or two as they are discussed at the dinner table and
are usually corrected to everyone's satisfaction, leaving no permanent damage (though in
a few early occurrences of this type, some customers suffered from their power being
disconnected due to nonpayment of the erroneous bill). At the other extreme, there are
dire consequences such as an airliner with hundreds of passengers on board crashing, two
high-speed commuter trains colliding, a nuclear reactor taken to the brink of meltdown,
or financial markets in a large industrial country collapsing. Causes of such annoying or
catastrophic behavior range from design deficiencies, interaction of several rare or
unforeseen events, operator errors, external disturbances, or malicious actions.

In nearly all engineering and industrial disciplines, quality control is an integral part of
the design and manufacturing processes. There are also standards and guidelines that
make at least certain aspects of quality assurance more or less routine. This is far from
being the case in computer engineering, particularly with regard to software products.
True, we do offer dependable computing courses to our students, but in doing so, we
create an undesirable separation between design and dependability concerns. A structural
engineer does not learn about bridge-building in one course and about ensuring that
bridges do not collapse in another. A toaster or steam-iron manufacturer does not ship its
products with a warning label that there is no guarantee that the device will prepare toast

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 4

or remove wrinkles from clothing and that the manufacturer will not be liable for any
harm resulting from their use.

The field of dependable (aka fault-tolerant) computing was born in the late 1960s,
because longevity, safety, and robustness requirements of space and military applications
could not be met through conventional design. Space applications presented the need for
long-life systems, with either no possibility of on-board maintenance or repair (unmanned
missions) or with stringent reliability requirements in harsh, and relatively unpredictable
environments (manned missions). Military applications required extreme robustness in
the face of punishing operational conditions, partial system wipeout, or targeted attacks
by a sophisticated adversary. Early researchers of the field were thus predominantly
supported by aerospace and defense organizations.

Of course, designing computer systems that were robust, self-reconfiguring, and


ultimately self-repairing was only one part of the problem. There was also a need to
quantify various aspects of system dependability via detailed and faithful models that
would allow the designers of such systems to gain the confidence and trust of the user
community. A failure probability of one-in-a-billion, say, was simply too low to allow
experimental assessment and validation, as such experiments would have required the
construction and testing of many millions of copies of the target system in order to allow
general conclusions to be drawn with reasonable confidence. Thus, research-and-
development teams in dependable computing pursued analytical and simulation models to
help with the evaluation process. Extending and fine-tuning of such models is one of the
main activity threads in the field.

As the field matured, application areas broadened beyond aerospace and military systems
and they now include a wide array of domains, from automotive computers to large
redundant disk arrays. In fact, many of the methods discussed in this book are routinely
utilized even in contexts that do not satisfy the traditional definitions of high-risk or
safety-critical systems, although the most elaborate techniques continue to be developed
for systems whose failure would endanger human lives. Systems in the latter category
include:

– Advanced infrastructure and transportation systems, such as high-speed trains


– Process control in hazardous environments, such as nuclear power plants
– Patient monitoring and emergency health-care procedures, as in surgical robots

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 5

Such advanced techniques then trickle down and eventually find their way into run-of-
the-mill computer systems, such as traditional desktop and laptop computers.

Scope and Features

The field of dependable computing has matured to the point that a dozen or so texts and
reference books have been published. Some of these books that cover dependable
computing in general (as opposed to special aspects or ad-hoc/unconventional methods)
are listed at the end of this preface. Each of these books possesses unique strengths and
has contributed to the formation and fruition of the field. The current text, Dependable
Computing: A Multilevel Approach, exposes the reader to the basic concpets of
dependable computing in sufficient detail to enable their use in many hardware/software
contexts. Covered methods include monitoring, redundancy, error detection/correction,
self-test, self-check, self-repair, adjudication, and fail-soft operation. The book is an
outgrowth of lecture notes that the author has developed and refined over many years.
Here are the most important features of this text in comparison with the listed books:

a. Division of material into lecture-size chapters: In my approach to teaching, a


lecture is a more or less self-contained module with links to past lectures and
pointers to what will transpire in future. Each lecture must have a theme or title
and must proceed from motivation, to details, to conclusion. In designing the text,
I have strived to divide the material into chapters, each of which is suitable for
one lecture (1-2 hours). A short lecture can cover the first few subsections while a
longer lecture can deal with variations, peripheral ideas, or more advanced
material near the end of the chapter. To make the structure hierarchical, as
opposed to flat or linear, lectures are grouped into seven parts, each composed of
four lectures and covering one level in our multilevel model (see the figure at the
end of this preface).

b. Emphasis on both the underlying theory and actual system designs: The ability to
cope with complexity requires both a deep knowledge of the theoretical
underpinnings of dependable computing and examples of designs that help us
understand the theory. Such designs also provide building blocks for synthesis as
well as reference points for cost-performance comparisons. This viewpoint is
reflected, for example, in the detailed coverage of error-coding techniques that

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 6

later lead to a better understanding of various of self-checking design methods


and redundant disk-arrays (Part IV). Another example can be found in Chapter 17
where the rigorous discussion of malfunction diagnosis allows a more systematic
treatment of reconfiguration and self-repair.

c. Linking dependable computing to other subfields of computing: Dependable


computing is nourished by, and in turn feeds other subfields of computer systems
and technology. Examples of such links abound. Parallel and distributed
computing is a case in point, given that such systems contain multiple resources
of each kind and thus offer the possibility of sharing spare subsystems for greater
efficiency. In fact, one can even find links to topics outside traditional science and
engineering disciplines. For example, designers of redundant systems with
replication and voting can learn a great deal from the treatment of voting systems
by mathematicians and social scientists. Such links are pointed out and pursued
throughout the text.

d. Wide coverage of important topics: The current text covers virtually all important
algorithmic and hardware design topics in dependable computing, thus providing
a balanced and complete view of the field. Coverage of testable design, voting
algorithms, software redundancy, and fail-safe systems do not all appear in other
textbooks.

e. Unified and consistent notation/terminology throughout the text: Every effort is


made to use consistent notation/terminology throughout the text. For example, R
always stands for reliability and s for the number of spare units. While other
authors have done this in the basic parts of their texts, there is a tendency to cover
more advanced research topics by simply borrowing the notation and terminology
from the reference source. Such an approach has the advantage of making the
transition between reading the text and the original reference source easier, but it
is utterly confusing to the majority of the students who rely on the text and do not
consult the original references except, perhaps, to write a research paper.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 7

Summary of Topics

The seven parts of this book, each composed of four chapters, have been written with the
following goals:

Part I sets the stage, gives a taste of what is to come, and provides a detailed perspective
on the assessment of dependability in computing systems and the modeling tools needed
for this purpose.

Part II deals with impairments to dependability at the physical (device) level, how they
may lead to system vulnerability and low integrated-circuit yield, and what
countermeasures are available for dealing with them.

Part III deals with impairments to dependability at the logical (circuit) level, how the
resulting faults can affect system behavior, and how redundancy methods can be used to
deal with them.

Part IV covers information-level impairments that lead to data-path and control errors,
methods for detecting/correcting such errors, and ways of preventing such errors from
propagating and causing problems at even higher levels of abstraction.

Part V deals with everything that can go wrong at the architectural level, that is, at the
level of interactions between subsystems, be they parts of a single computer or nodes in a
widely distributed system.

Part VI covers service-level impairments that may cause a system not to be able to
perform the required tasks, even though it has not totally failed in an absolute sense.

Part VII deals with breaches at the computation-result or outcome level, where the
success or failure of a computing system is ultimately judged and the costs of aberrant
results or actions must be borne.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 8

Pointers on How to Use the Book

For classroom use, the topics in each chapter of this text can be covered in a lecture
spanning 1-2 hours. In my own teaching, I have used the chapters primarily for 1.5-hour
lectures, twice a week, in a 10-week quarter, omitting or combining some chapters to fit
the material into 18-20 lectures. But the modular structure of the text lends itself to other
lecture formats, self-study, or review of the field by practitioners. In the latter two cases,
the readers can view each chapter as a study unit (for one week, say) rather than as a
lecture. Ideally, all topics in each chapter should be covered before moving to the next
chapter. However, if fewer lecture hours are available, then some of the subsections
located at the end of chapters can be omitted or introduced only in terms of motivations
and key results.

Problems of varying complexity, from straightforward numerical examples or exercises


to more demanding studies or mini-projects, have been supplied for each chapter. These
problems form an integral part of the book and have not been added as afterthoughts to
make the book more attractive for use as a text. A total of xxx problems are included (xx-
xx per chapter). Assuming that two lectures are given per week, either weekly or
biweekly homework can be assigned, with each assignment having the specific coverage
of the respective half-part (two chapters) or full part (four chapters) as its "title".

An instructor's solutions manual is planned. The author's detailed syllabus for the course
ECE 257A at UCSB is available at

http://www.ece.ucsb.edu/~parhami/ece_257a.htm

References to classical papers in dependable computing key design ideas, and important
state-of-the-art research contributions are listed at the end of each chapter. These
references provide good starting points for doing in-depth studies or for preparing term
papers/projects. New ideas in the field of dependable computing appear in papers
presented at an annual technical gathering, the Dependable Systems and Networks (DSN)
conference, sponsored jointly by Institute of Electrical and Electronics Engineers (IEEE)
and International Federation for Information Processing (IFIP). DSN, which was formed
by merging meetings sponsored separately by IEEE and IFIP, covres all aspects of the
field, including techniques for dependable computing and communications, as well as
performance and other implications of dependability features.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 9

Other conferences include Pacific Rim International Symposium on Dependable


Computing [PRDC], European Dependable Computing Conference [EDCC], Symposium
on Reliable Distributed Systems [SRDS], and International Conference on Computer
Design [ICCD]. The field's most pertinent archival journals are IEEE Transactions on
Dependable and Secure Computing [TDSC], IEEE Transactions on Reliability [TRel],
IEEE Transactions on Computers [TCom], and The Computer Journal [ComJ].

Acknowledgments

The current text, Dependable Computing: A Multilevel Approach, is an outgrowth of


lecture notes that I have used for the graduate course "ECE 257A: Fault-Tolerant
Computing" at the University of California, Santa Barbara, and, in rudimentary forms, at
several other institutions prior to 1988. The text has benefited greatly from keen
observations, curiosity, and encouragement of my many students in these courses. A
sincere thanks to all of them!

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 10

General References

The list that follows contains references of two kinds: (1) books that have greatly
influenced the current text and (2) general reference sources for in-depth study or
research. Books and other information sources that are relevant to specific chapters are
listed in the end-of-chapter reference lists.

[Ande81] Anderson, T. and P. A. Lee, Fault Tolerance: Principles and Practice, Prentice Hall, 1981.
[Ande85] Anderson, T. A. (ed.), Resilient Computing Systems, Collins, 1985. Also: Wiley, 1986.
[ComJ] The Computer Journal, published monthly by the British Computer Society.
[Comp] IEEE Computer, magazine published by the IEEE Computer Society. Has published several
special issues on dependable computing: Vol. 17, No. 6, August 1984; Vol. 23, No. 5, July
1990.
[CSur] ACM Computing Surveys, journal published by the Association for Computing Machinery.
[DCCA] Dependable Computing for Critical Applications, series of conferences later merged into DSN.
[Diab05] Diab, H. B. and A. Y. Zomaya (eds.), Dependable Computing Systems: Paradigms,
Performance Issues, and Applications, Wiley, 2005.
[DSN] Proc. IEEE/IFIP Int'l Conf. Dependable Systems and Networks, Conference formed by
merging earlier series of meetings, the oldest of which (FTCS) dated back to 1971. URL:
http://www.dsn.org
[Dunn02] Dunn, W. R., Practical Design of Safetry-Critical Computer Systems, Reliability Press, 2002.
[EDCC] Proc. European Dependable Computing Conf., Conference held 10 times from 1994 to 2014
and turned into an annual event in 2015, when it merged with the European Workshop on
Dependable Computing. URL: http://edcc2015.lip6.fr/
[FTCS] IEEE Symp. Fault-Tolerant Computing, series of annual symposia, begun in 1971 and
eventually merged into DSN.
[Geff02] Geffroy, J.-C. and G. Motet, Design of Dependable Computing Systems, Springer, 2002.
[Gray85] Gray, J., “Why Do Computers Stop and What Can Be Done About It?” Technical Report
TR85.7, Tandem Corp., 1985.
[ICCD] Proc. IEEE Int'l Conf. Computer Design, sponsored annually by the IEEE Computer Society
since 1983.
[IFIP] Web site of the International Federation for Information Processing Working Group WG 10.4
on Dependable Computing. http://www.dependability.org/wg10.4
[Jalo94] Jalote, P., Fault Tolerance in Distributed Systems, Prentice Hall, 1994.
[John89] Johnson, B. W., Design and Analysis of Fault-Tolerant Digital Systems, Addison-Wesley,
1989.
[Knig12] Knight, J., Fundamentals of Dependable Computing for Software Engineers, CRC Press, 2012.
[Kore07] Koren, I. and C. M. Krishna, Fault-Tolerant Systems, Morgan Kaufmann, 2007.
[Lala01] Lala, P. K., Self-Checking and Fault-Tolerant Digital Design, Morgan Kaufmann, 2001.
[Levi94] Levi, S.-T. and A. K. Agrawala, Fault-Tolerant System Design, McGraw-Hill. 1994.
[Negr89] Negrini, R., M. G. Sami, and R. Stefanelli, Fault Tolerance Through Reconfiguration in VLSI
and WSI Arrays, MIT Press, 1989.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 11

[Nels87] Nelson, V. P. and B. D. Carroll (eds.), Tutorial: Fault-Tolerant Computing, IEEE Computer
Society Press, 1987.
[Prad86] Pradhan, D. K. (ed.), Fault-Tolerant Computing: Theory and Techniques, 2 Vols., Prentice
Hall, 1986.
[Prad96] Pradhan, D. K. (ed.), Fault-Tolerant Computer System Design, Prentice Hall, 1996.
[PRDC] Proc. IEEE Pacific Rim Int'l Symp. Dependable Computing, sponsored by IEEE Computer
Society and held every 1-2 years since 1989.
[Rand13] Randall, B., J.-C. Laprie, H. Kopetx, and B. Littlewood (eds.), Predictably Dependable
Computing Systems, Springer, 2013.
[Shoo02] Shooman, M. L., Reliability of Computer Systems and Networks, Wiley, 2002.
[Siew82] Siewiorek, D. P. and R.S. Swarz, The Theory and Practice of Reliable System Design, Digital
Press, 1982.
[Siew92] Siewiorek, D. P. and R. S. Swarz, Reliable Computer Systems: Design and Evaluation, Digital
Press, 2nd Edition, 1992. Also: A. K. Peters, 1998.
[Sori09] Sorin, D., Fault Tolerant Computer Architecture, Morgan & Claypool, 2009.
[SRDS] Proc. IEEE Symp. Reliable Distributed Systems, sponsored annually by IEEE Computer
Society.
[TCom] IEEE Trans. Computers, journal published by the IEEE Computer Society. Has publlished a
number of sepecial issues on dependable computing: Vol. 41, No. 2, February 1992; Vol. 47,
No. 4, April 1998; Vol. 51, No. 2, February 2002.
[TDSC] IEEE Trans. Dependable and Secure Computing, journals published by the IEEE Computer
Society.
[Toy87] Toy, W. N., “Fault-Tolerant Computing,” Advances in Computers, Vol. 26, 1987, pp. 201-279.
[TRel] IEEE Trans. Reliability, quarterly journal published by IEEE Reliability Society.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 12

Structure at a Glance
The multilevel model on the right of the following table is shown to emphasize its
influence on the structure of this book; the model is explained in Chapter 1 (Section 1.4).

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 13

Table of Contents
Part I Introduction: Dependable Systems
1 Background and Motivation 15
2 Dependability Attributes 44
3 Combinational Modeling 74
4 State-Space Modeling 103
Part II Defects: Physical Imperfections
5 Defect Avoidance 128
6 Defect Circumvention 147
7 Shielding and Hardening 163
8 Yield Enhancement 176
Part III Faults: Logical Deviations
9 Fault Testing 192
10 Fault Masking 210
11 Design for Testability 227
12 Replication and Voting 240
Part IV Errors: Information Distorations
13 Error Detection 255
14 Error Correction 275
15 Self-Checking Modules 295
16 Redundant Disk Arrays 310
Part V Malfunctions: Architectural Anomalies
17 Malfunction Diagnosis 331
18 Malfunction Tolerance 347
19 Standby Redundancy 361
20 Resilient Algorithms 371
Part VI Degradations: Behavioral Lapses
21 Degradation Allowance 385
22 Degradation Management 401
23 Robust Task Scheduling 415
24 Software Redundancy 427
Part VII Failures: Computational Breaches
25 Failure Confinement 444
26 Failure Recovery 457
27 Agreement and Adjudication 468
28 Fail-Safe System Design 487
Appendix
A Past, Present, and Future 498

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 14

I Introduction: Dependable Systems


Ideal

“Once every decade an unloaded gun will fire; once every


Defective
century a rake will fire.”
Faulty
Russian saying about rifles used on stage
Erroneous

“The severity with which a system fails is directly proportional to


Malfunctioning
the intensity of the designer’s belief that it cannot.”
Degraded
Anonymous
Failed

Chapters in This Part


1. Background and Motivation
2. Dependability Attributes
3. Combinational Modeling
4. State-Space Modeling

Dependability concerns are integral parts of engineering design. Ideally, we


would like our computer systems to be perfect, yielding timely and correct results
in all cases. However, just as bridges collapse and airplanes crash occasionally
(albeit rarely), so too computer hardware and software cannot be made totally
immune to unpredictable or catastrophic behavior. Despite great strides in
component reliability and programming methodology, the exponentially
increasing complexity of integrated circuits and software components makes the
design of prefect computer systems nearly impossible. In this part, after reviewing
some application areas for which conventionally designed hardware and software
do not offer the desired level of dependability, we discuss evaluation criteria and
tools for dependable computer systems. Put another way, the four chapters of this
part, listed above, answer the key questions of where we want to go and how we
know whether we have gotten there.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 15

1 Background and Motivation


“Failing to plan is planning to fail.”
Effie Jones

“. . . on October 5, 1960, the warning system at NORAD


indicated that the United States was under massive attack by
Soviet missiles with a certainty of 99.9 percent. It turned out that
the BMEWS radar in Thule, Greenland, had spotted the rising
moon. Nobody had thought about the moon when specifying how
the system should act.”
A. Borning, Computer System Reliability and
Nuclear War

Topics in This Chapter


1.1. The Need for Dependability
1.2. A Motivating Case Study
1.3. Impairments to Dependability
1.4. A Multilevel Model
1.5. Examples and Analogies
1.6. Dependable Computer Systems

Learning how to do things right, without first developing an appreciation of why


those things need to be done, consequences of not doing them, and underlying
objectives, can be difficult. Dependable computing is no exception. In this
chapter, we establish the need for dependable computing, introduce the basic
terminology of the field, and review several application areas for which certain
facets of computer system dependability are important, but where dependability
requirements cannot be met with straightforward design. The terminology is
presented in the framework of a multilevel model that facilitates the study of
dependability, from the lowest levels of devices and circuits to the highest levels
of service adequacy and computational outcomes.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 16

1.1 The Need for Dependability

Computer and information systems are important components of the modern society,
which has grown increasingly reliant on the availability and proper functioning of such
systems. When a computer systems fails:

 A writer or reporter, who can no longer type on her personal computer, may
become less productive, perhaps missing a deadline.
 A bank customer, unable to withdraw or deposit funds through a remote ATM,
may become irate and perhaps suffer financial losses.
 A telephone company subscriber may miss personal or business calls or even
suffer loss of life due to delayed emergency medical care.
 An airline pilot or passenger may experience delays and discomfort, perhaps even
perish in a crash or midair collision.
 A space station or manned spacecraft may lose maneuverability or propulsion,
wandering off in space and becoming irretrievably lost.

Thus, consequences of computer system failures can range from inconvenience to loss of
life. Low-severity consequences, such as dissatisfaction and lost productivity, though not
as important on a per-occurrence basis, are much more frequent, thus leading to
nonnegligible aggregate effects on the society’s economic well-being and quality of life.
Serious injury or loss of life, due to the failure of a safety-critical system, is no doubt a
cause for concern. As computers are used for demanding and critical applications by an
ever expanding population of minimally trained users, the dependability of computer
hardware and software becomes even more important.

But what is dependability? Webster’s Ninth New Collegiate Dictionary defines


“dependable” as capable of being trusted or relied upon; a synonym for “reliable.” In the
technical sense of the term, dependability is viewed as a qualitative system attribute that
can be variously quantified by reliability, availability, performability, safety, and so on.
Thus, the use of “dependability” to represent the qualitative sense of the term “reliability”
allows us to restrict the application of the latter term to a precisely defined probabilistic
measure of survival.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 17

In one of the early proposals, dependability is defined succinctly as “the probability that
a system will be able to operate when needed” [Hosf60]. This simplistic definition,
which subsumes both of the well-known notions of reliability and availability, is only
valid for systems with a single catastrophic failure mode; i.e., systems that are either
completely operational or totally failed. The problem lies in the phrase “be able to
operate.” What we are actually interested in is “task accomplishment” rather than
“system operation.”

The following definition [Lapr82] is more suitable in this respect: “... dependability [is
defined] as the ability of a system to accomplish the tasks (or equivalently, to provide the
service[s]) which are expected from it.” This definition does have its own weaknesses.
For one thing, the common notion of “specified behavior” has been replaced by
“expected behavior” so that possible specification slips are accommodated in addition to
the usual design and implementation flaws. However, if our expectations are realistic and
precise, they might be considered as simply another form of system specification
(possibly a higher-level one). However, if we expect too much from a system, then this
definition is an invitation to blame our misguided expectations on the system’s
undependability.

A more useful definition has been provided by Carter [Cart82]: “... dependability may be
defined as the trustworthiness and continuity of computer system service such that
reliance can justifiably be placed on this service.” This definition has two positive
aspects: It takes the time element into account explicitly (“continuity”) and stresses the
need for dependability validation (“justifiably”). Laprie’s version of this definition
[Lapr85] can be considered a step backwards in that it substitutes “quality” for
“trustworthiness and continuity.” The notions of “quality” and “quality assurance” are
well-known in many engineering disciplines and their use in connection with computing
is a welcome trend. However, precision need not be sacrificed for compatibility.

In the most recent compilation of dependable computing terminology evolving from the
preceding efforts [Aviz04], dependability is defined as a system’s “ability to deliver
service that can justifiably be trusted” (original definition) and “ability to avoid service
failures that are more frequent and more severe than is acceptable” (alternate definition),
with trust defined as “accepted dependence.” Both of these definitions are rather
unhelpful and the first one appears to be circular.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 18

To present a more useful definition of dependable computing, we must examine the


various aspects of undependability. From a user’s viewpoint, undependability takes the
form of late, incomplete, inaccurate, or incorrect results or actions [Parh78]. The two
notions of trustworthiness (correctness, accuracy) and timeliness can be abstracted from
the above; completeness need not be considered separately, because any missing result or
action can be deemed infinitely late. We also note that dependability should not be
considered an intrinsic property of a computer system. Rather, it should be defined with
respect to particular (classes of) computations and/or interactions. A system that is
physically quite unreliable may well be made dependable for a particular class of
interactions through the use of reasonableness checks and other algorithmic methods.

The preceding discussion leads us to the following [Parh94]: Dependability of a computer


or computer-based system may be defined as justifiable confidence that it will perform
specified actions or deliver specified results in a trustworthy and timely manner. Note
that the preceding definition does not preclude the possibility of having different levels of
importance for diverse classes of user interactions or varying levels of criticality for
situations in which the computer is required to react. Such variations simply correspond
to different levels of confidence and dependability. Our definition retains the positive
elements of previous definitions, while presenting a result-level view of the time
dimension by replacing the notion of “service continuity” by timeliness of actions or
results.

Having defined computer system dependability, we next turn to the question of why we
need to be concerned about it. This we will do through a sequence of viewpoints, or
arguments. In these arguments, we use back-of-the-envelope calculations to illustrate the
three classes of systems for which dependability requirements are impossible to meet
without special provisions:

Long-life = Fail-slow = Rugged = High-reliability


Safety-critical = Fail-safe = Sound = High-integrity
Nonstop = Fail-soft = Robust = High-availability

a. The reliability argument: Assume that electronic digital systems fail at a rate of
about  = 10–9 per transistor per hour. This failure rate may be higher or lower for
different types of circuits, hardware technologies, and operating environments, but the
same argument is applicable if we change . Given a constant per-transistor failure rate ,

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 19

an n-transistor digital system will have all of its components still working after t hours
with probability R(t) = e–nt. We will justify this exponential reliability equation in
Chapter 2; for now, let’s accept it on faith. For a fixed , as assumed here, R(t) is a
function of nt. Figure 1.1 shows the variation of R(t) with nt. While it is the case that not
every transistor failure translates into system failure, to be absolutely sure about correct
system operation, we have to proceed with this highly pessimistic assumption. Now, from
Fig. 1.1, we can draw some interesting conclusions.

 The on-board computer for a 10-year unmanned space mission to explore the
solar system should be built out of only O(103) transistors if the O(105)-hour
mission is to have a 90% success probability.

 The computerized flight control system on board an aircraft cannot contain more
than O(104) transistors if it is to fail with a likelihood of less than 1 in 10,000
during a 10-hour intercontinental flight.

The need for special design methods becomes quite apparent if we note that modern
microprocessors and digital signal processor (DSP) chips contain tens or hundreds of
millions of transistors.

b. The safety argument: In a safety-critical system, the “cost” of failure can be


quantified by the product of hazard probability and severity of its consequences. Based
on the reliability argument above, a 1M-transistor flight control computer fails with
probability 10–2 during a 10-hour flight. If 0.1% of all computer failures cause the
airplane to crash, an airline that operates O(103) planes, each of which has O(102) such
flights per year, will experience O(1) computer-related crashes per year, on the average.
If a crash kills O(102) people and the cost per lost life to the airline is O(107) dollars, then
the airline can expect safety mishaps resulting from computer failures to cost it billions of
dollars per year. Clearly, this is just one view of safety; other entities, such as
transportation safety boards, passengers, and their families may quantify the
consequences of a plane crash differently. However, the preceding argument is enough to
show that safety requirements alone justify investing in special design and
implementation methods to postpone, or reduce the chances of, computer system failures.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 20

c. The availability argument: A central telephone switching facility should not be down
for more than a few minutes per year; more outages would be undesirable not only
because they lead to customer dissatisfaction, but also owing to the potential loss of
revenues. If the diagnosis, and replacement, of subsystems that are known to be
malfunctioning takes 20-30 minutes, say, a mean time between failures (MTBF) of
O(105) hours is required. A system with reliability R(t) = e–nt, as assumed in our earlier
reliability argument, has an MTBF of 1/(n). Again, we accept this without proof for
now. With  = 10–9/transistor/hour, the telephone switching facility cannot contain more
than O(104) transistors if the stated availability requirement is to be met; this is several
orders of magnitude below the complexity of modern telephone switching centers.

.9999 .9990 .9900


1.0
.9048
0.8

0.6
–n  t
e
0.4
.3679

0.2

0.0
10 4 10 6 10 8 10 10
nt

Fig. 1.1 Reliability of an n-transistor system after t hours as a


function of nt for  = 10–9/transistor/hour.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 21

1.2 A Motivating Case Study

One of the important problems facing the designers of distributed computing systems is
ensuring data availability and integrity. Consider, for example, a system with five sites
(processors) that are interconnected pairwise via dedicated links, as depicted in Fig. 1.2.
The detailed specifications of the five sites, S0-S4, and 10 links, L0-L9, are not important
for this case study; what is important, is that sites and links can malfunction, making the
data stored at one site inaccessible to one or more other sites. If data availability to all
sites is critical, then everything must be stored in redundant form.

S0
L4 L0
L5
L9
S4 S1

L6
L3 L8 L1
L7
S3 S2
L2

Fig. 1.2 Five-site distributed computer system with a dedicated


direct link connecting each pair of sites.

Let the probability that a particular site (link) is available and functions properly during
any given attempt at data access be aS (aL) and assume that the sites and links
malfunction independently of each other.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 22

Example 1.1: Home site and mirror site Quantify the improvement in data availability when
each data file Fi has a home or primary site H(Fi), with a copy stored at a mirror site M(Fi). This
doubles the aggregate storage requirements for the file system, but allows access to data despite
faulty sites and links.

Solution: To quantify the accessibility of a particular file, we note that a site can obtain a copy of
Fi if the home site H(Fi) and the link leading to it have not failed OR the home site is inaccessible,
but the mirror site M(Fi) can be accessed. Thus, the availability (measure of accessibility) A(Fi)
for a mirrored file Fi is:

A(Fi) = aSaL + (1 – aSaL)aSaL = 2aSaL – (aSaL)2

In deriving the preceding equation, we have assumed that the file must be accessed directly from
one of the two sites that holds a copy. Analyzing the problem when indirect accesses via other
sites are also allowed is left as an exercise. As a numerical example, for aS = 0.99 and aL = 0.95,
we have:

A(Fi) = 0.99 0.95  (2 – 0.99 0.95) ≈ 0.9965

In other words, data unavailability has been reduced from 5.95% in the nonredundant case to
0.35% in the mirrored case.

Example 1.2: File triplication Suppose that the no-access probability of Example 1.1 is still too
high. Evaluate, and quantify the availability impact of, a scheme where three copies of each data
file are kept.

Solution: In this case, the availability of a file Fi becomes:

A(Fi) = aSaL + (1 – aSaL)aSaL + (1 – aSaL)2aSaL 


 = 3aSaL – 3(aSaL)2 + (aSaL)3

Now, with aS = 0.99 and aL = 0.95, we have:

A(Fi) = 0.99 0.95  [3 – 30.990.95 + (0.990.95)2] ≈ 0.9998

Data unavailability is thus reduced to 0.02%. This improvement in data availability is achieved at
the cost of increasing the aggregate storage requirements by a factor of 3, compared with the
nonredundant case.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 23

Example 1.3: File dispersion Let us now devise a more elaborate scheme that requires lower
redundancy. Each file Fi is viewed as a bit string of length l. It is possible to encode such a file
into approximately 5l/3 bits (a redundancy of 67%, as opposed to 100% and 200% in Examples
1.1 and 1.2, respectively) so that if the resulting bit string of length 5l/3 is divided equally into five
pieces, any three of these (l/3)-bit pieces can be used to reconstruct the original l-bit file. Let us
accept for now that the preceding data dispersion scheme is actually workable and assume that for
each file, we store one of the five pieces thus obtained in a different site in Fig. 1.2. Now, any site
needing access to a particular data file already has one of the three required pieces and can
reconstruct the file if it obtains two other pieces from the remaining four sites. Quantify the data
availability in this case.

Solution: In this case, a file Fi is available if two or more of four sites are accessible:

A(Fi) = (aSaL)4 + 4(1 – aSaL)(aSaL)3 + 6(1 – aSaL)2(aSaL)2


 = 6(aSaL)2 – 8(aSaL)3 + 3(aSaL)4

Again, we have assumed that a file fragment must be accessed directly from the site that holds it.
With aS = 0.99 and aL = 0.95, we have:

A(Fi) = (0.990.95)2  [6 – 80.990.95 + 3(0.990.95)2] ≈ 0.9992

Data unavailability is thus 0.08% in this case. This result is much better than that of Example 1.1
and is achieved at a lower redundancy as well. It also performs only slightly worse than the
triplication method of Example 1.2, which has considerably greater storage overhead.

In Examples 1.1-1.3, we ignored several important considerations such as how redundant


data are kept consistent, how malfunctioning sites/links are identified, how recovery is
accomplished when a malfunctioning site that has been repaired is to return to service,
and, finally, how data corrupted by the actions of an adversary (rather than by a site or
link malfunction) might be detected. However, the examples do point to the diversity of
methods for coping with dependability problems and to the sophistication required for
devising efficient or cost-effective schemes.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 24

1.3 Impairments to Dependability

Impairment to dependability are variously described as hazards, flaws, defects, faults,


errors, malfunctions, failures, and crashes. There are no universally agreed upon
definitions for these terms, causing different and sometimes conflicting usage. Attempts
at presenting precise definitions for the terms and standardizing their use have not been
very successful. In the following paragraphs, we review two key proposals on how to
view and describe impairments to dependability.

Members of the Newcastle Reliability Project, led by Professor Brian Randell, have
advocated a hierarchic view [Ande82]: A (computer) system is a set of components
(themselves systems) which interact according to a design (another system). The
recursion stops when we arrive at atomic systems whose internal structures are of no
interest at the particular level of detail with which we are concerned. System failure is
defined as deviation of its behavior from that predicted (required) by the system’s
authoritative specification. Such a behavioral deviation results from an erroneous system
state. An error is a part of an erroneous state which constitutes a difference from a valid
state. The cause of the invalid state transition which first establishes an erroneous state, is
a fault in one of the system’s components or in the system’s design. Similarly, the
component’s or design’s failure can be attributed to an erroneous state within the
corresponding (sub)system resulting from a component or design fault, and so on.
Therefore, at each level of the hierarchy, “the manifestation of a fault will produce errors
in the state of the system, which could lead to a failure” (Fig. 1.3).

Aspect Impairment
Structure Fault
 
State Error
 
Behavior Failure

Fig. 1.3 Schematic diagram of the Newcastle hierarchical model and


impairments within one level (the fault-error-failure cycle).

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 25

With the preceding model, failure and fault are simply different views of the same
phenomenon. This is quite elegant and enlightening but introduces problems by the need
for continual establishment of frames of reference when discussing the causes (faults) and
effects (failures) of deviant system behavior at various levels of abstraction. While it is
true that a computer system may be viewed at many different levels of abstraction, it is
also true that some of these levels have proved more useful in practice. Avižienis
[Aviz82] takes four of these levels and proposes the use of distinct terminology for
impairments to dependability (“undesired events,” in his words) at each of these levels.
His proposal can be summarized in the cause-effect diagram of Fig. 1.4.

Universe Impairment
Physical Failure
 
Logical Fault
 
Informational Error
 
External Crash

Fig. 1.4 Cause-effect diagram for Avižienis’ four-universe model of


impairments to dependability.

There are a few problems with the preceding choices of names for undesired events. The
term “failure” has traditionally been used both at the lowest and the highest levels of
abstraction; viz, failure rate, failure mode, and failure mechanism used by electrical
engineers and device physicists alongside system failure, fail-soft operation, and fail-safe
system coming from computer architects. To comply with the philosophy of distinct
naming for different levels, Avižienis retains “failure” at the physical level and uses
“crash” at the other end. However, this latter term is unsuitable. Inaccuracies or delays,
beyond what is expected according to system specifications, can hardly be considered
“crashes” in the ordinary sense of the term.

Furthermore, the term “crash” puts the emphasis on system operation rather than task
accomplishment and is thus unsuitable for fail-soft computer systems (such systems are

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 26

like airplanes in that they fail much more often than they crash!). We are thus led to a
definition of failure for computer systems that parallels that used by forensic experts in
structural engineering [Carp96]: Failure is an unacceptable difference between expected
and observed performance.

Another problem is that there are actually three external views of a computer system. The
maintainer’s external view consists of a set of interacting subsystems that must be
monitored for detecting possible malfunctions in order to reconfigure the system or,
alternatively, to guard against dangerous consequences (such as total system crash). The
operator’s external view, which is more abstract than the maintainer’s system-level view,
consists of a black box capable of providing certain computational services. Finally, the
end user’s external view is shaped by the system’s reaction to particular situations or
requests.

Abstraction Impairment
Component Defect
Low-
  Level
Logic Fault
First
Cycle  
State
Information Error
Mid-
  Level
System Malfunction
 
Service Degradation
Second High-
  Level
Cycle
Result Failure

Fig. 1.5 Cause-effect diagram for an extended six-level view of


impairments to dependability.

Figure 1.5 depicts this extended model. There are two ways in which our extended six-
level impairment model can be viewed. The first view, shown on the left edge of Fig. 1.5,
is to consider it an unrolled version of the model in Fig. 1.3. This unrolling allows us to
talk about two cycles simultaneously without a danger of being verbose or ambiguous. A

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 27

natural question, then, is why stop at one unrolling. The reason is quite pragmatic. We are
looking at dependability from the eyes of the system architect who deals primarily with
module interactions and performance parameters, but also needs to be mindful of the
circuit level (perhaps even down to switches and transistors) in order to optimize/balance
the system from complex angles involving speed, size, power consumption,
upward/downward scalability, and so on.

The second view, which is the one that we will use in the rest of this book, is that shown
in the right half of Fig. 1.5:

 Low-level impairments, affecting the devices/components or circuit functions, are


within the realm of hardware engineers and logic designers.

 Mid-level impairments span the circuit-system interface, which includes the


register-transfer level and the processor-switch-memory level.

 High-level impairments, affecting the system as a whole, are of interest not only
to system architects but also to system integrators.

Taking into account the fact that a nonatomic component is itself a system, usage of the
term “failure” in failure rate, failure mode, and failure mechanism could be justified by
noting that a component is its designer’s end product (system). Therefore, we can be
consistent by associating the term “failure” with the highest and the term “defect” with
the lowest level of abstraction. The component designer’s failed system is the system
architect’s defective component. However, to maintain a consistent point of view
throughout the book, we will use the somewhat unfamiliar component-level terms defect
rate, defect mode, and defect mechanism from now on.

One final point: Computer systems are composed of hardware and software elements.
Thus, the reader may be puzzled by a lack of specific mention of impairments to software
dependability in our discussions thus far. The reason for this is quite practical. Whereas
one can meaningfully talk about defects, faults, errors, malfunctions, degradations, and
failures for software, it is the author’s experience that formulating statements, or
describing design methods, that apply to both hardware and software, requires some
stretching that makes the results somewhat convoluted and obscure. Perhaps, this would
not be the case if the author possessed greater expertise in software dependability.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 28

Nevertheless, for the sake of completeness, we will discuss a number of algorithm and
software design topics in the later parts of the book, with the hope that some day the
discussion of dependable hardware and software can be truly integrated.

Anecdote: The development of the six-level model of Fig. 1.5 began in 1986, when the
author was a Visiting Professor at the University of Waterloo in Canada. Having six
levels is somewhat unsatisfactory, since successful models tend to have seven levels.
However, despite great effort expended in those days, the author was unable to add a
seventh type of impairment to the model. Undeterred by this setback, the author devised
the seven states of a system shown in Fig. 1.6.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 29

1.4 A Multilevel Model

The field of dependable computing deals with the procurement, forecasting, and
validation of computer system dependability. According to our discussions in Section 1.3,
impairments to dependability can be viewed from six abstraction levels. Thus, subfields
of dependable computing can be thought of as dealing with some aspects of one or more
of these levels. Specifically, we take the view that a computer system can be in one of
seven states: Ideal, defective, faulty, erroneous, malfunctioning, degraded, or failed, as
depicted in Fig. 1.6. Note that these states have nothing to do with whether or not the
system is “running.” A system may be running even in the failed state; the fact that it is
failed simply means that it isn’t delivering what is expected of it.

Upon the completion of its design and implementation, a system may end up in any of the
seven states, depending on the appropriateness and thoroughness of validation efforts.
Once in the initial state, the system moves from one state to another as a result of
deviations and remedies. Deviations are events that take the system to a lower (less
desirable) state, while remedies are techniques or measures that enable a system to make
the transition to a higher state.

The observability of the system state (ease of external recognition that the system is in a
particular state) increases as we move downward in Fig 1.6. For example, the inference
that a system is “ideal” can only be made through formal proof techniques; a proposition
that is currently impossible for complex computer systems. At the other extreme, a failed
system can usually be recognized with little or no effort. As examples of intermediate
states, the “faulty” state is recognizable by extensive off-line testing, while the
“malfunctioning” state is observable by on-line monitoring with moderate effort. It is,
therefore, common practice to force a system into a lower state (e.g., from “defective” to
“faulty,” under torture testing) in order to deduce its initial state.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 30

Ideal

Defective

Faulty

Erroneous

Malfunctioning

Degraded

Failed

Fig. 1.6 System states and state transitions in the multilevel model
of dependable computing. Horizontal arrows on the left
denote entry points. Downward (upward) transitions
represent deviations (remedies). Self-loops model tolerance.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 31

One can associate five attributes with each of the transitions in Fig. 1.6. These attributes
are:

(Natural) Cause of the transition


(Natural) Hindrance factors
Facilitation schemes
Avoidance techniques
Modeling methods and tools

For defect-induced failures, the sequence of transitions from defect to failure may be
quite slow, owing to large interlevel hindrances or latencies, or so quick as to defy
detection. Natural interlevel latencies can be increased through tolerance provisions or
reduced for making the deviations more readily observable (because deviations near the
bottom of Fig. 1.6 are more easily detected). The former methods, are referred to as
defect tolerance, fault tolerance, error tolerance, malfunction tolerance, degradation
tolerance, and failure tolerance, while the latter are useful for defect testing, fault testing,
error testing, and so on.

We will discuss deviations and remedies in considerable detail in the rest of this book.
Here, we present just a few samples of how subfields of dependable computing can be
characterized according to their relevance to one or more of these transitions and their
attributes.

Fault injection: cause[faulty]


Fault testing: facilitation[faultyerroneous]
Defect tolerance: avoidance[defectivefaulty]
Error modeling: modeling[erroneous] + modeling[faultyerroneous]

Thus, a byproduct of the preceding characterization might be an accurate indexing


scheme for reference sources in the field of dependable computing; viz, a technical paper
may be indexed by listing attributes or relevant transitions as above.

Possible causes for the sideways transitions in Fig. 1.6 include specification slips and
implementation mistakes (including the use of wrong or unfit building blocks). A
deviations may be caused by wearout, overloading, or external disturbance and is
additionally characterized by its duration as being permanent, intermittent, or transient.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 32

Note, however, that even transient deviations can have permanent consequences; for
example, a fault-induced error may persist after the fault itself has vanished.

We can also classify deviations by their extent (local and global or catastrophic) and
consistency (determinate and indeterminate or Byzantine). A local deviation can quickly
develop into a catastrophic one if not promptly detected by monitors and isolated by
means of firewalls. Transient and indeterminate deviations are notoriously difficult to
handle. To see why, imagine feeling ill, but all the symptoms of your illness disappearing
each time you visit a doctor. Worse yet, imagine the symptoms changing as you go from
one doctor to another for obtaining a second opinion.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 33

1.5 Examples and Analogies

Let us use some familiar every-day phenomena to accentuate the states and state
transitions shown in Fig. 1.6 [Parh97]. These examples also show the relevance of such a
multilevel model in a wider context. Example 1.5 is similar to Example 1.4, but it
illustrates the lateral transitions of Fig. 1.6 and multilevel tolerance methods better.
Example 1.6 incorporates both tolerance and avoidance techniques.

Example 1.4: Automobile brake system Relate an automobile brake system to the multilevel
model of Fig. 1.6.

Solution: An automobile brake system with a weak joint in the brake fluid piping (e.g., caused by
a design flaw or a road hazard) is defective. If the weak joint breaks down, the brake system
becomes faulty. A careful (off-line) inspection of the automobile can reveal the fault. However,
the driver does not automatically notice the fault (on-line) while driving. The brake system state
becomes erroneous when the brake fluid level drops dangerously low. Again, the error is not
automatically noticed by the driver, unless a working brake fluid indicator light is present. A
malfunctioning break system results from the improper state of its hydraulics when the brake pedal
is applied. With no brake fluid indicator light, the driver’s first realization that something is wrong
comes from noticing the degraded performance of the brake system (higher force needed or lower
deceleration). If this degraded performance is insufficient for slowing down or stopping the
vehicle when needed, the brake system has failed to act properly or deliver the expected result.

Example 1.5: Automobile tire Relate the functioning of an automobile tire to the multilevel
model of Fig. 1.6.

Solution: Take an automobile with one tire having a weak spot on its road surface. The defect
may be a result of corrosion or due to improper manufacture and inspection. Use of multiple layers
or steel reinforcement constitute possible defect tolerance techniques. A hole in the tire is a fault.
It may result from the defect or caused directly by a nail. Low tire pressure due to the hole, or
directly as a result of improper initialization, is viewed as an error. Automatic steering
compensation leads to error tolerance (at least for a while). A tire that is unfit for use, either due to
its pressure dropping below a threshold or because it was unfit to begin with (e.g., too small),
leads to a malfunction. A vehicle with multiple axles or twin tires can tolerate some tire
malfunctions. In the absence of tolerance provisions, one can still drive an automobile having a
flat or otherwise unfit tire, but the performance (speed, comfort, safety, etc.) will be seriously
degraded. Even a vehicle with three or more axles suffers performance degradation in terms of
load capacity when a tire malfunctions. Finally, as a result of the preceding sequence of events, or
because someone forgot to install a vital subsystem, the entire automobile system may fail.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 34

Example 1.6: Organizational decision-making Relate the decision-making processes in a


small organization (e.g., those related to staff promotions) to the multilevel model of Fig. 1.6.

Solution: Defects in the organization’s staff promotions policies may cause improper promotions,
viewed as faults. The ensuing ineptitudes and dissatisfactions are errors in the organization’s state.
The organization’s personnel or departments probably start to malfunction as a result of the errors,
in turn causing an overall degradation of performance. The end result may be the organization’s
failure to achieve its goals. Many parallels exist between organizational procedures and
dependable computing terms such as defect removal (external reviews), fault testing (staff
evaluations), fault tolerance (friendly relations, teamwork), error correction (openness, alternate
rewards), and self-repair (mediation, on-the-job training).

Example 1.7: Leak and drainage analogies Discuss the similarities of avoidance and tolerance
methods in the model of Fig. 1.6 to those of stopping leaks in a water system and using drainage to
prevent leaks from causing extensive damage.

Solution: Figure 1.7 depicts a system of six concentric water reservoirs. Pouring water from
above corresponds to defects, faults, and other impairments, depending on the layer(s) being
affected. These impairments can be avoided by controlling the flow of water through valves or
tolerated through the provision of drains of acceptable capacities for the reservoirs. The system
fails if water ever gets to the outermost reservoir. This may happen, for example, by a broken
valve at some layer combined with inadequate drainage at the same and all outer layers. Wall
heights between adjacent reservoirs correspond to natural interlevel latencies in the multilevel
model of Fig. 1.6. Water overflowing from the outermost reservoir into the surrounding area is the
analog of a computer failure adversely affecting the larger physical, corporate, or societal system.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 35

Closing inlet valves represents


Wall heights represent avoidance techniques
interlevel latencies

Concentric reservoirs
are analogs of the
six nonideal model levels,
with defect being innermost Opening drain valves represents
tolerance techniques

Fig. 1.7 An analogy for our multilevel model of dependable


computing. Defects, faults, errors, malfunctions,
degradations, and failures are represented by pouring water
from above. Valves represent avoidance and tolerance
techniques. The goal is to avoid overflow.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 36

1.6 Dependable Computer Systems

We will describe a number of dependable computer systems after learning about the
methods used in their design and implementation. However, it is desirable to have a brief
preview of the types of system that can benefit from these methods in way of motivation.
As mentioned in Section 1.2, three main categories of dependable computer systems can
be distinguished. These are reviewed in the following paragraphs. Most modern general-
purpose computers also include dependability enhancement features, but these are often
toned-down versions of the methods used in the following system classes to make them
implementable within more stringent cost constraints. For examples of dependability
enhancement methods found in general-purpose computers, see [Siew92], pp. 427-523.

Long-life systems: Long life computer systems are needed in application domains where
maintenance is impossible or costly. In the first category, we have computers on board
spacecraft, particularly those for deep space planetary probes. In a multiyear space probe,
for example, it is imperative that the computer be still fully functional near the end of the
mission, when the greatest payoffs in terms of discoveries and data collection are
expected. The second category includes machines installed in remote, treacherous, or
hostile environments that cannot be reached easily or safely. Requirements in the design
of such systems are similar, except for cost constraints, which are likely to be less
stringent for space applications. Take the case of the Galileo Orbiter which, with help
from a separate Probe module, collected data on Jupiter. The Galileo architecture was
based on the earlier Voyager system that explored the outer planets of the solar system,
with a main difference that the Galileo design used microprocessors extensively (a total
of 27 in the computing and instrumentation systems). Module replication, error coding,
activity monitoring, and shielding were some of the many diverse methods used to ensure
a long life.

Safety-critical systems: The failure of flight control computers on commercial aircraft,


safety monitors in nuclear power plants, and process control systems in chemical plants
can have dire consequences. Early designs for safety-critical applications included
Carnegie-Mellon University’s C.vmp (voted multiprocessor) and Stanford University’s
SIFT (software-implemented fault tolerance) systems. Both designs were based on the
notion of voting on the results of multiple independent computations. Such an approach
appears inevitable in view of the need for quick recovery from malfunctions in a real-
time system with hard deadlines. The C.vmp system used three microprocessors

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 37

operating in lock step, with a hardware voter connected to the system bus performing bit-
by-bit voting to prevent the propagation of erroneous information to the memory. In
SIFT, the off-the-shelf computing elements were connected via custom-designed buses
and bus interfaces. Each computing element had an executive, along with software voting
procedures that could mask out an erroneous result received from a malfunctioning unit.
A task requiring data from another task got them from units that executed the task
(typically by 2-out-of-3 voting), with the actual multiplicity of a task being a function of
its criticality. Subsequently, August Systems chose the SIFT philosophy for its
dependable computers aimed at the industrial control market.

High-availability systems: Research and development in high-availability computers


was pioneered by AT&T in the early 1960s in order to meet the stringent availability
goals for its electronic switching systems (ESS). In the early versions of the system, two
processors ran in tight synchronism, with multiple hardware comparators matching the
contents of various registers and buffers. A mismatch caused an interrupt to occur and
diagnostics to be run for identifying the malfunctioning unit and eventually the faulty
circuit pack. ESS processors have gone through many design modifications over the
years. Some of these were necessitated by the changing technology while others resulted
from field experience with previous designs. A second major application area for high-
availability computers is in on-line transaction processing systems of the types found in
banking and e-commerce. Although several sophisticated users of this type had custom-
designed highly dependable systems as far back as the mid 1960s, commercial marketing
of ready-made systems for such applications was started by Tandem Computers in the
late 1970s. Tandem’s early machines, named “NonStop”, and their successors were
designed with the aim of preventing any single hardware or software malfunction from
disabling the system. This objective was achieved by hardware and informational
redundancy as well as procedural safeguards, backup processes, consistency checks, and
recovery schemes in a distributed-memory multiprocessor, with redundant buses
connecting its various subsystems.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 38

Problems

1.1 High-performability systems


To the classes of high-reliability, high-safety, and high-availability systems, characterized in Section 1.1,
one might add the class of high-performability systems. Suggest terms similar to those listed for the
preceding classes (e.g., nonstop = fail-soft = robust = high-availability) to characterize this new class.
Justify your choices.

1.2 The many facets of dependability


With reference to the three classes of fail-slow, fail-safe, and fail-soft systems, briefly characterized in
Section 1.1, discuss what each of the following terms might mean and how the resulting classes of
dependable systems are related to the above:
a. Fail-stop
b. Fail-hard
c. Fail-fast
d. Fool-proof
e. Tamper-proof
f. Resilient
g. Secure

1.3 The need for dependability


Suppose that the system failure rate does not grow linearly with the number n of transistors but is rather a
sublinear function of n. Pick a suitable function for the failure rate of an n-transistor system, draw the
corresponding reliability curve on Fig. 1.1, and discuss the resulting changes in the three arguments
presented at the end of Section 1.1.

1.4 Data availability in distributed systems


With reference to our discussion of data availability in Section 1.2:
a. Write an expression for the accessibility measure A(Fi) of Example 1.1 as a function of u = 1 –
aSaL and provide intuitive justification for the result.
b. Repeat part a for Example 1.2.

1.5 Data availability in distributed systems


The keen reader may have observed that in the Examples 1.1-1.3 of Sections 1.2, links are much less
reliable than sites. Suppose that we connect each pair of sites by two different links, each one allowing
access independently with probability aL = 0.95. Redo Examples 1.1-1.3, calculating the numerical
accessibility measure in each case. Compare the results with those of the original examples and discuss.

1.6 Data availability in a ring network


Redo examples 1.1, 1.2, and 1.3, assuming that the 5 sites are interconnected as a bidirectional ring
network, rather than a complete network. Note that with ring connectivity, it no longer makes sense to
assume that access to the data is allowed only via a direct link. When data is accessed indirectly, all nodes
and links on the indirect path must be functional for access to be successful. In your analysis, consider the
worst case regarding where data copies are located relative to the user site.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 39

1.7 Data availability in distributed systems


With reference to our discussion of data availability in Section 1.2:
a. Consider Examples 1.1-1.3 and assume that there are six, rather than five, sites in the distributed
system. Thus, in Example 1.3, an l-bit data file is encoded into six (l/3)-bit pieces, any three of
which can be used to reconstruct the original file. The data redundancy of 100% for this 3-out-of-6
scheme is identical to that of mirroring in Example 1.1. Redo Examples 1.1-1.3 and compare the
resulting data access probabilities.
b. In general, with 2k sites and encoding of files such that any k of their 2k pieces, stored one per site,
can be used for reconstruction, a data redundancy identical to mirroring, but with greater
accessibility, is obtained. Plot the data access probabilities for mirroring and the preceding k-out-
of-2k scheme in a 2k-site distributed system, for 1 ≤ k ≤ 8, and discuss the results.

1.8 Data availability in distributed systems


With reference to our discussion of data availability in Section 1.2:
a. Quantify the probability of being able to access a data file Fi in Example 1.1 if, whenever direct
access is impossible due to link malfunctions, the requesting site can obtain a copy of the desired
file indirectly via the other two sites, if they have access to the required information.
b. Repeat part a for Example 1.2.
c. Repeat part a for Example 1.3.

1.9 Impairments to dependability


With reference to the two-cycle interpretation shown on the left edge of Fig. 1.5, consider two other
unrollings, one beyond (preceding) the component level and another beyond (following) the result level.
Briefly state how one might characterize the abstraction levels and impairments for the resulting cycles.

1.10 Multilevel model of dependable computing


Relate the following notions to the transitions in Fig. 1.6 and their attributes in a manner similar to the
examples provided near the end of Section 1.4.
a. Preventive maintenance
b. Design for testability
c. Fault modeling
d. Error-detecting codes
e. Error-correcting codes
f. Self-repairing systems
g. Yield improvement for integrated circuits
h. Checkpointing

1.11 Multilevel model of dependable computing


a. Consider the floating-point division flaw in early Pentium processors. Place this flaw and its
consequences and remedies in the context of the model in Fig. 1.6.
b. Repeat part a for the Millennium bug, aka the Y2K (year-2000) problem, which would have
caused some programs using dates with 2-digit year fields to fail when the year turned from 1999
to 2000 if the problem were not fixed in time.
c. Repeat part a for a third widespread hardware or software flaw that you identify.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 40

1.12 Multilevel model of dependable computing


Relate the following situations to the analogy in Fig. 1.7.
a. A parallel computer system is designed to tolerate up to two malfunctioning processors. When a
malfunction is diagnosed, the integrated-circuit card that holds the malfunctioning processor is
isolated, removed, and replaced with a good one, all in about 15 minutes.
b. When the Pentium division flaw was uncovered, a software package was modified through
“patches” to avoid operand values for which the flaw would lead to errors. Upon replacing the
Pentium chip with a redesigned chip (without the flaw), the patched software was not modified.

1.13 Dependable computer systems


Demonstrate the orthogonality of the three attributes of long life, high availability, and safety by discussing
how:
a. Long-life systems are not necessarily highly available or safe.
b. Highly available systems could be short-lived and/or unsafe.
c. Safety-critical systems may be neither long-life nor nonstop.

1.14 The human immune system


Avizienis [Aviz04] has suggested the human immune system as a suitable model for developing a
hardware-based infrastructure for fault tolerance. Discuss his proposal in a couple of paragraphs and relate
it to the multilevel model of Fig. 1.6.

1.15 The risks of poorly designed user interfaces


Olsen [Olse08] has described an incident in which an on-line banking customer lost $100,000 due to a
simple keying error, which (the customer alleges) should have been caught by the system’s user interface.
Study the article and write a one-page report about the nature of the error and why poorly designed user
interfaces constitute major risk factors in computer-based systems.

1.16 The Excel 2007 design flaw


According to news stories published in the last week of September 2007, the then newest version of
Microsoft Excel contained a flaw that led to incorrect values in rare cases. For example, when multiplying
77.1 by 850, 10.2 by 6,425 or 20.4 by 3,212.5, the number 100,000 was displayed instead of the correct
result 65,535. Similar errors were observed for calculations that produced results close to 65,536. Study this
problem using Internet sources and discuss in one typed page (single spacing is okay) the nature of the
flaw, why it went undetected, exactly what caused the errors, and how Microsoft dealt with the problem.

1.17 The USS Yorktown incident


USS Yorktown, one of the first “smart ships” deployed by the US Navy, was the subject of this 1998 news
headline: “Software Glitches Leave Navy Smart Ship Dead in the Water.” Write a 2-page investigative
report about the incident, what caused it, and actions taken to prevent similar failures.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 41

1.18 Overheating batteries in airliners


In mid-January 2013, the world’s entire fleet of Boeing-787 (“Dreamliner”) aircraft was grounded while
the lithium-ion battery systems, believed to have caused multiple incidents of fire on board the aircraft,
were investigated. In late January, it was determined that overcharging of the batteries led to the
overheating and resulting fires. Boing insisted that such battery overcharges were highly unlikely, given
that multiple systems are in place to prevent them from happening. Investigate the problem and write a 2-
page report of the incidents and the associated corrective actions, focusing on the multiple prevention
mechanisms claimed by Boeing and why they did not stop the overheating.

1.19 Reliability and robustness vs. efficiency


Read the viewpoint [Ackl13] explaining why undue emphasis on efficiency undermines reliability and
robustness/resilience. Moshe Vardi also discusses the problem in a lecture (starting at the 7:45 mark of the
following YouTube video), entitled “Lessons from COVID-19: Efficiency vs. Resilience”:
https://www.youtube.com/watch?v=SjGXbIosMQA
a. Write a 200-word abstract for [Ackl13], summarizing its main points and argument.
b. Repeat part a for Moshe Vardi’s lecture.

1.20 Sharing a secret


Eleven scientists collaborate on a project and they want to store project documents in a safe that can be
opened only when at least 6 of the 11 are present. Show that if this is to be done using ordinary locks and
keys, the minimal solution requires that the safe be equipped with 462 locks and each scientist provided
with 252 keys. [Hint: The numbers 462 and 252 equal 11 5
and 105
, respectively, where 𝑚𝑛 is the number of
ways you can choose m items from among n.]

1.21 Simple, avoidable design errors


Computer maker Lenovo recalled more than 0.5 million laptop power cords in late 2014, citing their
tendency to overheat, melt, burn, and spark. It is difficult to imagine how such a design problem can exist
in one the simplest items used in connection with a computer; a device that has been designed and built by
numerous vendors for decades. Investigate the problem using on-line sources and present a single-page
typeset report (single-spacing is okay) about the problem and its causes

1.22 Building trust from untrustworthy components


The ultimate goal of designers of dependable systems is to ascertain trustworthiness of system results,
despite using components that are not totally trustworthy. Read the paper [Seth15] and present your
assessment of the practicality of this goal and some of the methods proposed in a single-spaced typed page.

1.23 Risks of automation


Read the paper [Neum16] by Peter G. Neumann, moderator of the “Inside Risks” forum, write a half-page
abstract for it, and answer the following questions:
a. What is total-system safety? (Provide a definition.)
b. Which area needs more work in the coming years: aviation safety or automotive safety?
c. What is the most important risk associated with the Internet of Things (IoT)?
d. Why are the requirements for security and law enforcement at odds?

1.24 Risks of infrastructure deterioration

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 42

In September 2018, gas explosions rocked a vast area in northeastern Massachusetts, leading to the loss of
one life, many injuries, and destruction of property. Investigations revealed that just before the explosions,
pipe pressure was 12 times higher than the safe limit. Using Internet sources, write a one-page report on
this incident, focusing on how/why pressure monitors, automatic shut-off mechanisms, and human
oversight failed to prevent the disaster.

1.25 The troubles of Boeing 737 Max 8


Following two crashes of Boeing 737 Max 8 passenger jets in late 2018 and early 2019, killing hundreds,
airlines and various aviation authorities around the world grounded the planes until crash causes could be
determined and attendant design flaws corrected. When Boeing 737 Max 8 was tested following its
introduction, certain stability problems were detected, but rather than redesigning the plane, Boeing chose
to augment them with a software system to compensate for the problems in flight. Both crashes were
attributed to flaws in the aforementioned software and lack of certain monitoring instruments that could
have warned the pilots of emerging challenges. As of late 2019, the planes remained grounded, because
Boeing’s purported fixes have not satisfied aviation authorities. Using on-line sources, study the causes of
the two crashes, reasons for grounding the planes, and results of investigations conducted after the
grounding, presenting your results in a 2-page report.

1.26 Risks of trusting the physics of sensors


Read the paper [Fu18] and answer the following questions:
a. Why do the authors think that there are risks involved in the physics of sensors?
b. How can a voice-activated system be tricked into exhibiting malicious behavior?
c. Which other application areas involving sensors are particularly vulnerable?
d. In what ways do current educational systems hinder the design of secure embedded systems?

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 43

References and Further Readings


[Ackl13] Ackley, David H., “Beyond Efficiency,” Communications of the ACM, Vol. 56, No. 10, pp. 38-
40, October 2013.
[Ande82] Anderson, T. and P. A. Lee, “Fault Tolerance Terminology Proposals,” Proc. Int’l Symp.
Fault-Tolerant Computing, 1982, pp. 29-33.
[Aviz82] Avižienis, A., “The Four-Universe Information System Model for the Study of Fault
Tolerance,” Proc. Int’l Symp. Fault-Tolerant Computing, 1982, pp. 6-13.
[Aviz04] Avižienis, A., J.-C. Laprie, B. Randell, and C. Landwehr, “Basic Concepts and Taxonomy of
Dependable and Secure Computing,” IEEE Trans. Dependable and Secure Computing, Vol. 1,
No. 1, pp. 11-33, January-March 2004.
[Aviz04a] Avižienis, A., “Dependable Systems of the Future: What Is Still Needed?” Proc. 18th IFIP
World Computer Congress, Toulouse, August 2004.
[Carp96] Carper, K. L., “Construction Pathology in the United States,” Structural Engineering
International, Vol. 6, No. 1, pp. 57-60, February 1996.
[Cart82] Carter, W. C., “A Time for Reflection,” Proc. Int’l Symp. Fault-Tolerant Computing, 1982, p.
41.
[Fu18] Fu, K. and W. Xu, “Risks of Trusting the Physics of Sensors” (“Inside Risks” Column),
Communications of the ACM, Vol. 61, No. 2, pp. 20-23, February 2018.
[Hosf60] Hosford, J. E., “Measures of Dependability,” Operations Research, Vol. 8, No. 1, pp. 53-64,
January/February 1960.
[Jain11] Jain, M. and R. Gupta, “Redundancy Issues in Software and Hardware Systems: An
Overview,” Int’l J. Reliability, Quality and Safety Engineering, Vol. 18, No. 1, pp. 61-98,
2011.
[Lapr82] Laprie, J.-C., “Dependability: A Unifying Concept for Reliable Computing,” Proc. Int’l Symp.
Fault-Tolerant Computing, 1982, pp. 18-21.
[Lapr85] Laprie, J.-C., “Dependable Computing and Fault Tolerance: Concepts and Terminology,” Proc.
Int’l Symp. Fault-Tolerant Computing, 1985, pp. 2-11.
[Neum16] Neumann, P. G., “Risks of Automation: A Cautionary Total-System Perspective of our
Cyberfuture,” Communications of the ACM, Vol. 59, No. 10, pp. 26-30, October 2016.
[Olse08] Olsen, K. A., “The $100,000 Keying Error,” IEEE Computer, Vol. 41, No. 4, pp. 108 & 106-
107, April 2008.
[Parh78] Parhami, B., “Errors in Digital Computers: Causes and Cures,” Australian Computer Bulletin,
Vol. 2, No. 2, pp. 7-12, March 1978.
[Parh94] Parhami, B., “A Multi-Level View of Dependable Computing,” Computers & Electrical
Engineering, Vol. 20, No. 4, pp. 347-368, 1994.
[Parh97] Parhami, B., “Defect, Fault, Error, ... , or Failure?” IEEE Trans. Reliability, Vol. 46, No. 4, pp.
450-451, December 1997.
[Seth15] Sethumadhavan, S., A. Waksman, M. Suozzo, Y. Huang, and J. Eum, “Trustworthy Hardware
from Untrusted Components,” Communications of the ACM, Vol. 58, No. 9, pp. 60-71,
September 2015.
[Siew92] Siewiorek, D. P. and R. S. Swarz, Reliable Computer Systems: Design and Evaluation, Digital
Press, 2nd Ed., 1992.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 44

2 Dependability Attributes
“The shifts of fortune test the reliability of friends.”
Cicero

“Then there is the man who drowned crossing a stream with an


average depth of six inches.”
W. I. E. Gates

Topics in This Chapter


2.1. Aspects of Dependability
2.2. Reliability and MTTF
2.3. Availability, MTTR, and MTBF
2.4. Performability and MCBF
2.5. Integrity and Safety
2.6. Privacy and Security

Based on our discussions in Chapter 1, computer system dependability is not a


single concept; rather, it possesses several different facets or components. These
include reliability, availability, testability, maintainability (the so-called “ilities”),
graceful degradation, robustness, safety, security, and integrity. In this chapter, we
quantify some of the key attributes of a dependable computer or computer-based
system, explore the relationships among these quantitative measures, and show
how they might be evaluated in rudimentary cases, with appropriate simplifying
assumptions. More detailed examination of dependability modeling techniques
will be presented in Chapter 3, covering combinational models, and Chapter 4,
introducing state-space models.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 45

2.1 Aspects of Dependability

In Chapter 1, we briefly touched upon the notions or reliability, safety, and availability,
as different facets of computer system dependability. In this chapter, we provide precise
definitions for these concepts and also introduce other related and distinct aspects of
computer systems dependability, such as testability, maintainability, serviceability,
graceful degradation, robustness, resilience, security, and integrity. Table 2.1 shows the
list of concepts that will be dealt with, along with typical qualitative usages and
associated quantitative aspects or measures. For example, in the case of reliability, we
encounter statements such as “this system is ultrareliable” (qualitative usage) and “the
reliability of this systems for a one-year mission is 0.98” (quantitative usage).

We devote the remainder of this section to a brief review of some needed concepts from
probability theory. The review is intended as a refresher. Readers who have difficulties
with these notions should consult one of the introductory probability/statistics textbooks
listed at the end of the chapter; for example, [Papo90].

Let E be one of the possible outcomes of some experiment. By

prob[E] = 0.1

we mean that if the experiment is repeated many times, under the same conditions, the
outcome will be E in roughly 10% of the cases. For example

prob[System S fails within 10 weeks] = 0.1

means that out of 1000 systems of the same type, all operating under the same application
and environmental conditions, about 100 will fail within 10 weeks.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 46

Table 2.1 Dependability-related terms with their most common


qualitative usages and quantifications (if any).

–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Term Qualitative Usage(s) Quantitative Measure(s)
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Availability Highly available (Pointwise) Availability
 High-availability Interval availability
Continuously available MTBF, MTTR

Integrity High-integrity 
 Tamper-proof
Fool-proof

Maintainability Easily maintainable


Maintenance-free
Self-repairing

Performability Performability
MCBF

Reliability Reliable Reliability


Highly reliable MTTF or MTFF
 High-reliability
Ultrareliable

Resilience Resilient

Robustness Robust Impairment tolerance count

Safety High-safety Risk


Fail-safe

Security Highly secure 


 High-security
Fail-secure

Serviceability Easily serviceable

Testability Easily testable Controllability


Self-testing Observability
Self-checking
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––

Abbreviations used: MCBF = mean computation between failures; MTBF = mean time between failures;
MTFF = mean time to first failure; MTTF = mean time to failure; MTTR = mean time to repair.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 47

Thus, probability values have physical significance and can be determined via
experimentation. Of course, when the probabilities are very small, it may become
impossible or costly to conduct the requisite experiments. For example, experimental
verification that

prob[Computer C fails within 10 weeks] = 10–6

requires experimentation with many millions of computers.

When multiple outcomes are of interest, we deal with composite, joint, and conditional
probabilities satisfying the following:

prob[not A] = 1 – prob[A] (2.1.CJC)


prob[A  B] = prob[A] +prob[B] – prob[A  B]
= prob[A] +prob[B] if A and B are mutually exclusive
prob[A  B] = prob[A] prob[B] if A and B are independent
prob[A | B] = prob[A  B] / prob[B] {read: probability of A, given B}

Suppose that we have observed 20 identical systems under the same conditions and
measured the time to failure for each. The top part of Fig. 2.1 shows the distribution of
the time to failure as a scatter plot. The cumulative distribution function (CDF) for the
time to failure x (life length of the system), defined as the fraction of the systems that
have failed before a given time t, is shown in the middle part of Fig. 2.1 in the form of a
staircase. Of course with a very large sample, we would get a continuous CDF curve that
goes from 0 for t = 0 to 1 for t = ∞. Finally, the probability density function (pdf) is
shown at the bottom of Fig. 2.1. CDF represents the area under the pdf curve; i.e. the
following relationships hold:

F(t) = prob[x ≤ t] = ∫ 𝑓(𝑥)𝑑𝑥 (2.1.CDF)

f(t) = prob[t ≤ x ≤ t + dt] / dt = dF(t) / dt (2.1.pdf)

Based on the preceding, the interpretation of the pdf f(t) in Fig. 2.1 is that the probability
of the system failing in the time interval [t, t + dt] is f(t) dt . So, where the dots in the
scatter plot are closer together, f(t) assumes a larger value.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 48

Once the CDF or pdf for a random variable has been determined experimentally, we
might try to approximate it by a suitable equation and carry out various probabilistic
calculations based on such an analytical model. Examples of commonly used
distributions include uniform, normal, exponential, and binomial (Fig. 2.2).

0 10 20 30 40 50
Time
1.0
0.8
0.6 CDF
0.4 F(t)
0.2
0.0
0 10 20 30 40 50
Time
0.05
0.04 pdf
0.03 f(t)
0.02
0.01
0.00
0 10 20 30 40 50
Time

Fig. 2.1 Scatter plot for the random variable representing the lifetime
of a system, along with its cumulative distribution function
(CDF) and probability density function (pdf).

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 49

F(x)
1
CDF CDF CDF CDF

f(x)

pdf pdf pdf

Uniform Exponential Normal Binomial

Fig. 2.2 Some commonly used probability distributions, defined by


their CDF and pdf graphs.

Given a random variable x with a known probability distribution, its expected value,
denoted as E[x] or Ex, is defined as:

Ex = ∫ 𝑥𝑓(𝑥)𝑑𝑥 for continuous distributions (2.1.EV)

= ∑k xk f(xk) for discrete distributions

The interpretation of Ex is that it is the mean of the values observed for x over a large set
of experiments. The variance x2, and standard deviation x, of a random variable x are
indicators of the spread of x values relative to Ex:

x2 = ∫ (𝑥 − 𝐸 ) 𝑓(𝑥)𝑑𝑥 for continuous distributions (2.1.Var1)

= ∑k (xk – Ex)2 f(xk) for discrete distributions

Based on the preceding definition, we easily find:

x2 = E[(x – Ex)2] = E[x2] – (Ex)2 (2.1.Var2)

When dealing with two random variables, the notion of covariance  is of some interest:

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 50

x, y = E[(x – Ex) (y – Ey)] (2.1.Cov)


= E[xy] – ExEy

Given the covariance x, y, one can define the correlation coefficient

x, y = x, y /(xy) (2.1.Corr)

which is the expected value of the product of centered and normalized random variables
(x – Ex)/x and (y – Ey)/y.

When x,y = 0, we say that x and y are uncorrelated and we have E[xy] = ExEy.
Independent random variables are necessarily uncorrelated, but the converse is not
always true (it is true for the normal distribution, though).

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 51

2.2 Reliability and MTTF

Reliability is an important engineering concept whose theoretical development was


apparently started by von Braun during World War II in Germany after the first series of
ten V-1 missiles all blew up on the launching pads [Henl81]. Principles of reliability
engineering can be found in many book (e.g., [Lewi87], [Tobi86]). Reliability is an
appropriate measure of dependability for a two-state system that starts in a fully
functional or “good” state and moves to a “failed” state when it can no longer operate as
specified. System functionality is completely lost when the transition into the failed state
occurs (Fig. 2.3).

Start Failure
Up Down
State

Fig. 2.3 Two-state (nonrepairable) system.

The reliability R(t), defined as the probability that the system remains in the “Good” state
throughout the time interval [0, t], was the only dependability measure of interest to early
designers of dependable computer systems. Such systems were typically used for
spacecraft guidance and control where repairs were impossible and the system was
effectively lost upon the first failure. Thus, the reliability R(tM) for the mission duration
tM accurately reflected the probability of successfully completing the mission and thus
achieving acceptable reliabilities for large values of tM (the so-called long-life systems)
became the main challenge. Reliability is related to the CDF of the system lifetime, also
known as unreliability, by:

F(t) = 1 – R(t) (2.2.Unrel)

Let z(t) dt be the probability of system failure between times t and t + dt. The function z(t)
is called the hazard function. Then, the reliability R(t) satisfies:

R(t + dt) = R(t) [1 z(t) dt] (2.2.Haz1)

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 52

This is simply a statement of the fact that for a system to be functional until time t + dt, it
must have been functional until time t and must not fail in the interval [t, t + dt] of
duration dt. From the preceding equation, we obtain:

dR(t)/R(t) = z(t) dt (2.2.Haz2)


R(t) = exp(− ∫ 𝑧(𝑥)𝑑𝑥 ) (2.2.Haz3)

For a constant hazard rate , that is, for z(t) = , we obtain the exponential reliability law
which we took for granted in Section 1.1.

R(t) = et (2.2.Exp)

The hazard function z(t), reliability R(t), CDF of failure F(t), and pdf of failure f(t) are
related as follows:

z(t) = f(t)/R(t) = f(t)/[1 – F(t)] (2.2.Haz4)

We can thus view z(t) as the conditional probability of failure occurring in the time
interval [t, t + dt], given that failure has not occurred up to time t. With a constant hazard
rate , or exponential reliability law, failure of the system is independent of its age, that
is, the fact that it has already survived for a long time has no bearing on its failure
probability over the next unit-time interval.

Note that the reliability R(t) is a monotonic (nonincreasing) function of time. Thus, when
the survival of a system for the entire duration of a mission of length tM is at issue,
reliability can be specified by the single numeric index R(tM). Mean time to failure
(MTTF), or mean time to first failure (MTFF), is another single-parameter indicator of
reliability. The mean time to failure for a system is given by:

MTTF = ∫ 𝑡𝑓(𝑡)𝑑𝑡 = ∫ 𝑅(𝑡)𝑑𝑡 (2.2.MTTF)

The first equality above is essentially the definition of expected value of the time to
failure while the second one, indicating that MTTF is equal to the area under the
reliability curve, is easily provable.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 53

In addition to the constant hazard rate z(t) = , which leads to the exponential reliability
law, the distributions shown in Table 2.2 have been suggested for reliability modeling.

The Weibull distribution has two parameters: The shape parameter , and the scale
parameter . Both exponential and Raleigh distributions are special cases of the Weibull
distribution, corresponding to  = 1 and  = 2, respectively. Similarly, the Gamma
distribution covers the exponential and Erlang distributions as special cases (b = 1 and b
an integer, respectively). The parameters of the various reliability models in Table 2.2
can be derived based on field failure data.

The Gamma function (), used in the formulas in Table 2.2, is defined as:

() = ∫ 𝑥 𝑒 𝑑𝑥 (2.2.Gam1)

In particular, (1/2) =  , (1) = (2) = 1, (k) = (k – 1)! when k is an integer, and:

( + 1) =  () for any    (2.2.Gam2)

For this reason, the  function is called generalized factorial. This last equation allows us
to compute () recursively based on the values of () for 1 ≤  < 2.

The discrete versions of the exponential, Weibull, and normal distributions are known as
geometric, discrete Weibull, and binomial distributions, respectively. The geometric
distribution is obtained by replacing e– by the discrete probability q of survival over one
time step and time t by the number k of time steps, leading to the reliability equation:

R(k) = qk = (1 – p)k (2.2.Geom)

In the case of the discrete Weibull distribution, e is replaced with q, and t with k,
leading to:

R(k) = qk = (1 – p)k (2.2.Weib)

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 54

Closed-form formulas are generally hard to obtain for the parameters of interest with the
discrete Weibull distribution. Finally, the binomial distribution is characterized by the
reliability equation:

k n
R(k) = 1 – ∑j=0 (j ) p jq n–j 0≤k≤n (2.2.Bino)

When n is large, the binomial distribution can be approximated by the normal distribution
with parameters  = np and  = npq .

Table 2.2 Some commonly assumed continuous failure distributions


and their associated reliability and MTTF formulas.

–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Distribution z(t) f(t) R(t) = 1 – F(t) MTTF
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Exponential  et  et 1/

Rayleigh 2(t) 2(t)et2 et2 (1/)  / 2

Weibull (t)1 (t)1et et (1/) (1+1/)

 k–1
(k1)! (t) et ∑i=0 (t) i/i!
Erlang k1et k /


(b) (t)
Gamma b1et

1
Normal* e(t)2/(22)
 2
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
(*) Reliability, and MTTF formulas for the normal distribution are quite involved. One can use numerical
tables listing the values of the integral (1⁄√2𝜋) ∫ 𝑒 /
𝑑𝑥 to evaluate R((t – )/).

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 55

System Reliability (R)


1.0
R2(tM)
rG
R2(t)

R1(tM)

R1(t)

0.0
T1(rG) tM T2(rG) MTTF2 MTTF1
Time (t)

Fig. 2.4 Example reliability functions for systems 1 and 2.

Referring to Fig. 2.4, we observe that the time t1 to the first failure is a random variable
whose expected value is the system’s MTTF. Even though it is true that a higher MTTF
implies a higher reliability in the case of nonredundant systems, the use of MTTF is
misleading when redundancy is applied. For example, in Fig. 2.4, System 1 with
reliability function R1(t) is much less reliable than System 2 with reliability function R2(t)
for the mission duration tM, but it has a longer MTTF. Since usually tM << MTTF, the
shape of the reliability curve is much more important than the numerical value of MTTF.
The reliability difference R2  R1 and reliability gain R2/R1 are natural measures for
comparing two systems having reliabilities R1 and R2. In order to facilitate the
comparison of highly reliable systems (with reliability values very close to 1), several
comparative measures have been suggested including reliability improvement factor, of
System 2 over System 1, for a given mission time tM

RIF2/1(tM) = [1  R1(tM)] / [1  R2(tM)] (2.2.RIF)

reliability improvement index, of System 2 over System 1, for the mission time tM

RII2/1(tM) = log R1(tM) / log R2(tM) (2.2.RII)

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 56

mission time extension for a given reliability goal rG

MTE2/1(rG) = T2(rG) T1(rG) = R21(rG) R11(rG) (2.2.MTE)

and mission time improvement factor for a given reliability goal rG

MTIF2/1(rG) = T2(rG) T1(rG) = R21(rG) R11(rG) (2.2.MTIF)

where the (mission) time function T is the inverse of the reliability function R. Thus,
R(T(r)) = r and T(R(t)) = t.

Example 2.1: Comparing system reliabilities Systems 1 and 2 have constant failure rates of 1
= 1/yr and 2 = 2/yr. Quantify the reliability advantage of System 1 over System 2 for a one-
month period.

Solution: The reliabilities of the two systems for a one-month period are R1(1/12) = e–11/12 =
0.9200 and R2(1/12) = e–21/12 = 0.8465. The reliability advantage of System 1 over System 2
can be quantified in the following ways:
R1(1/12) – R2(1/12) = 0.9200 – 0.8465 = 0.0735
R1(1/12) / R2(1/12) = 0.9200 / 0.8465 = 1.0868
RIF1/2(1/12) = (1 – 0.8465) / (1 – 0.9200) = 1.9188
RII1/2(1/12) = log 0.8465 / log 0.9200 = 1.9986
For a reliability goal of 0.9, the mission time extension of System 1 over System 2 is derived as
MTE1/2(0.9) = (– ln 0.9)(1/1 – 1/2) = 0.0527 yr = 19.2 days
while the mission time improvement factor of System 1 over System 2 is:
MTIF1/2(0.9) = 2 /1 = 2.0

Example 2.2: Analog of Amdahl’s law for reliability Amdahl’s law states that if in a unit-time
computation a fraction f doesn’t change and the remaining fraction 1 – f is speeded up to run p
times as fast, the overall speedup will be s = 1 / (f + (1 – f)/p). Show that a similar formula applies
to the reliability improvement index after improving the failure rate for some parts of a system.

Solution: Consider a system with two segments, having failure rates  and  – , respectively.
Upon improving the failure rate of the second segment to ( – )/p, we have RII = log Roriginal / log
Rimproved =  / ( + ( – )/p). Letting  /  = f, we obtain: RII = 1 / (f + (1 – f)/p)

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 57

2.3 Availability, MTTR, and MTBF

As mentioned earlier, at first reliability was the only measure of interest in evaluating
computer system dependability. The advent of time-sharing systems brought with it a
concern for the continuity of computer service (the so-called high-availability systems)
and thus minimizing the “down time” became a prime concern. Interval availability, or
simply availability, A(t), defined as the fraction of time that the system is operational
during the interval [0, t], is the natural dependability measure in this respect. The limit A
of A(t) as t tends to infinity, if it exists, is known as the steady-state availability.

A = limt∞ A(t) (2.3.Av1)

Availability is a function not only of how rarely a system fails but also of how quickly it
can be repaired upon failure. Thus, the time to repair is important and maintainability is
used as a qualitative descriptor for ease of repair (i.e., faster or less expensive
maintenance procedures).

Clearly, maintainability is closely related to availability in that high availability cannot be


achieved without attention to maintenance speed-up techniques. Serviceability is
sometimes used as a synonym for maintainability. As the concern with dependability
spread from highly advanced special-purpose systems to commercial environments, terms
such as maintainability and serviceability became a permanent part of computer jargon.
The attention to ease of maintenance is not new, as even some second-generation
computers had extensive hardware and software aids for this purpose [Cart64].

The probability a(t) that a system is available at time t is known as its pointwise
availability (which is the same as reliability when there is no repair). To take repair into
consideration, one can consider a repair rate function zr(t) which results in the probability
of unsuccessful repair up to time t having an equation similar to reliability, with various
distributions possible. For example, one can have exponentially distributed repair times,
with repair rate :

prob[repair is not completed in t time units] = e–t

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 58

Consider a two-state repairable system, as shown in Fig. 2.5. The system begins
operation in the “Up” state, but then moves back and forth between the two states due to
failures and successful repairs. The duration of the system staying in the “Up” state is a
random variable corresponding to the time to first failure, while that of the “Down” state
is the time to successful repair.

Repair
Start Down
Up
State
Failure

Fig. 2.5 Two-state repairable system.

Time to first failure Time between failures


Repair time

Up

Down
0 t1 t'1 t 2 t'2 t
Time

Fig. 2.6 System up and down times contributing to its availability.

Fig. 2.6 depicts the variations of the state of an example repairable two-state system with
time. Until time t1, the example system in Fig. 2.6 is continuously available and thus has
an interval availability of 1. After the first failure at time t1, availability drops below 1
and continues to decrease until the completion of the first repair at time t'1. The repair
time is thus t'1 – t1. The second failure occurs at time t2, yielding a time between failures
of t2 – t1. Over a period of time, the expected value of t'i – ti and ti+1 – ti are known as
mean time to repair (MTTR) and mean time between failures (MTBF), respectively.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 59

In the special case of zr(t) = , i.e. a constant repair rate, the steady-state availability for a
system with a constant failure rate  is:


A = (2.3.Av2)
+

We will formally derive the preceding equation in Chapter 4 as a simple example of


state-space modeling. For now, we present an intuitive justification by noting that, with
exponential failure and repair time distributions, we have MTTF = 1/ and MTTR = 1/,
leading to

1/ MTTF MTTF


A = = = (2.3.Av3)
1/ + 1/ MTTF + MTTR MTBF

where MTBF = MTTF + MTTR = 1/ + 1/ is the mean time between failures.

Pointwise availability a(t) and interval availability A(t) are related as follows:

t
A(t) = (1/t) 0
a(x) dx (2.3.Av4)

Both pointwise and interval availability are relatively difficult to derive in general.
However, for most practical purposes, the steady-state availability A can be used in lieu
of pointwise availability. This is because if a(t) can be assumed to be a constant, then it
must be equal to A(t) by the preceding equation. Interval availability being a constant in
turn implies A(t) = A. As an example, if a system is available 99% of the time in the long
run, then we can assume that at any given time instant, it will be available with
probability 0.99.

A high-availability computer system must be robust and resilient. A standard dictionary


defines “robustness” as strength, vigor, roughness, health, the opposite of delicate, sickly,
and “resilience” as an ability to recover from or adjust easily to misfortune or change, a
synonym for elasticity. Anderson [Ande85] defines a resilient computing system as one
that is “capable of providing dependable service to its users over a wide range of
potentially adverse circumstances” and notes that trustworthiness and robustness are the

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 60

key attributes of such a system. He adds that a robust computer system “retains its ability
to deliver service in conditions which are beyond its normal domain of operation,
whether due to harsh treatment, or unreasonable service requests, or misoperation, or the
impact of faults, or lack of maintenance, etc.”

Example 2.3: Availability formula Consider exponential and repair laws, with failure rate  and
repair rate . In the time interval [0, t], we can expect t failures which take t/ time units to
repair, on the average. Thus, for large t, the system will be under repair for t/ time units out of t
time units, yielding the availability 1 – /. Is there anything wrong with this argument, given that
the availability was previously derived to be A = 1 – /( + )?

Solution: The number of expected failures is actually slightly less than t, because the system is
operational only in a fraction A of time t, where A is the availability. Correcting the argument, we
note that At failures are expected over the available time At, yielding an expected repair time of
At/ time units. Thus, the availability A satisfies the equation A = 1 – A/ , where the last term
is the fraction of the time t that is spent on repair. This yields A = 1/(1 + /) = /( + ).

Continual increase in system complexities and the difficulties in testing for initial system
verification and subsequent preventive and diagnostic maintenance have led to concern
for testability. Since any preventive or diagnostic maintenance procedure is based on
testing, maintainability and testability are closely related. Testability is often quantified
by the complementary notions of controllability and observability. In the context of
digital circuits, controllability is an indicator of the ease with which the various internal
points of the circuit can be driven to desired states by supplying values at the primary
inputs. Similarly, observability indicates the ease with which internal logic values can be
observed by monitoring the primary outputs of the circuit [Gold79].

In practice, computer systems have more than two states. There may be multiple
operational states where the system exhibits different computational capabilities.
Availability analysis for gracefully degrading systems will be discussed along with
performability in Section 2.4. It is noteworthy that performability and similar terms, some
of which were listed in the preceding paragraph, are sometimes collectively referred to as
the “-ilities”. Several other informally defined “-ilities” can be found in the literature
(survivability, reconfigurability, diagnosability, and so on), although none has found
widespread acceptance.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 61

2.4 Performability and MCBF

Widespread use of multiprocessors and gracefully degrading systems, that did not obey
the all-or-none mode of operation implicit in conventional reliability and availability
models, caused some difficulties in dependability evaluation. Consequently,
performability was suggested as a relevant measure. Again the desirability of a simple
numeric measure led to the suggestion of mean computation before failure (MCBF),
although the use of this measure did not become as widespread as the MTTF and MTBF
of the earlier era. These concepts will be discussed in the remainder of this section.

The performability of a gracefully degrading system at time t depends on the set of


resources available, the computational capability provided by these resources, and the
“worth” associated with each capability level. As such, complete discussion of
performability is beyond the scope of this section. Rather we choose to illustrate the tools
and techniques by means of a simple example.

Consider a dual-processor computer system with two performance levels; both processors
working (worth = 2) and only one processor working (worth = 1), ignoring all other
resources. If the processors fail one at a time, and are repaired one at a time, then the
system’s state diagram is as shown in Fig. 2.7. In Chapter 4, we will show how the
steady-state probabilities for the system being in each of its states can be determined. If
these probabilities are pUp2, pUp1, and 1 – pUp2 – pUp1, then, the performability of the
system is:

P = 2pUp2 + pUp1

As a numerical example, pUp2 = 0.92 and pUp1 = 0.06 lead to P = 1.90. In other words,
the performance level of the system is equivalent to 1.9 processors on the average, with
the ideal performability being 2.

When processors fail and are repaired independently, and if Processor i has a steady-state
availability Ai, then performability of the system above becomes:

P = A1 + A2

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 62

More generally, i.e., when the resources are not identical, each availability Ai of a
resource must be multiplied by the worth of that resource. Note that the independent
repair assumption implies that maintenance resources are not limited. If there is only one
repair person, say, then this assumption may not be valid.

A fail-soft, or gracefully degrading, system may be said to be “available” when its


performance is at or above a minimum threshold. Thus, one can readily derive the
availability figure of merit based on performability calculations. Assuming that the
system of Fig. 2.7 is available in states Up2 and Up1, its availability with our preceding
assumptions will be A = pUp2 + pUp1 = 0.92 + 0.06 = 0.98.

Repair Partial repair


Start Up 1 Down
Up 2
State
Partial failure Failure

Fig. 2.7 Three-state repairable system with different performance


parameters in its two nonfailed states.

Example 2.4: Performability improvement factor With a low rate of failure, performability
will be close to its ideal value. For this reason, a performability improvement factor, PIF, can be
defined in a manner similar to RIF, given in equation 2.2.RIF. For the two-processor system
analyzed in the preceding paragraphs, determine the PIF relative to a fail-hard system that would
go down when either processor fails.

Solution: The performability of the fail-hard system is readily seen to be 2pUp2 = 2  0.92 = 1.84.
Given the ideal performability of 2, the performabilities of our fail-hard and fail-soft systems in
relative terms are 1.84/2 = 0.92 and 1.90/2 = 0.95, respectively. Thus:
PIFfail-soft/fail-hard = (1 – 0.92) / (1 – 0.95) = 1.6
Note that like RIF, PIF is useful for comparing different designs. The result 1.6 does not have any
physical significance. The performability ratio in this example is 0.95 / 0.92 = 1.033, which is a
true indicator of the increase in expected performance.

Figure 2.8 depicts the variations of the state of an example three-state repairable system
with time. Up to time t1, the example system in Fig. 2.8 is continuously available with
maximal performance. After the partial failure at time t1, performance drops.
Subsequently, at time t2, the system fails completely. After partial repair is completed at

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 63

time t'2, system performance goes up and is eventually restored to its full level at time t'1.
The full repair time is, therefore, t'1 – t1, with partial repair taking t'2 – t2 time.

Partial
Failure
Up

Total Partial
Failure Repair
Partially Up

Down
t1 t2 t'2 t'1 t 3 t'3 t
0
Time

Fig. 2.8 System up, partially up, and down times contributing to its
performability.

The shaded area in Fig. 2.8 represents the computational power that is available prior to
the first total failure. If this power is utilized in its entirety, then the shaded area
represents the amount of computation that is performed before total system failure. The
expected value of this parameter is known as mean computation before failure (MCBF).
Since no computation is performed in the totally failed state, MCBF can be viewed as
representing mean computation between failures. Thus, MCBF is to performability as
MTBF is to reliability.

Note that performability generalizes both reliability and performance. For a system that
never fails, performability is the same as performance. For a system that has a single
performance level (all or none), performability is synonymous with reliability; i.e.,
performance level is 100% iff the system has not failed.

One approach to computing MCBF is to use:

MCBF = Performability  MTTF (2.4.MCBF)

If availalability and MTTR are known, MTTF can be found from equation 2.3.Av3.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 64

2.5 Integrity and Safety

The measures discussed thus far deal with the operation and performance of the computer
system (and frequently only with the hardware) rather than with the integrity and success
of the computations performed. Neither availability nor performability distinguishes
between a system that experiences 30 two-minute outages per week and one that fails
once per week but takes an hour to repair. Increasing dependence on transaction
processing systems and safety-critical applications of computers has led to new concerns
such as integrity, safety, security, and privacy. The first two of these concerns, and the
corresponding dependability measures, are treated in this section. Security and privacy
will be discussed in Section 2.6.

The two attributes of integrity and safety are similar; integrity is inward-looking and
relates to the capacity of a system to protect its computational resources and data under
adverse circumstances, while safety is outward-looking and pertains to consequences of
incorrect actions for the system environment and users. One can examine system integrity
by assigning the potential causes and consequences of failures to a number or a
continuum of “severity” classes. If computational resources and data are not corrupted
due to low-severity causes, then the system fares well on integrity. If the failure of a
system seldom has severe external consequences, then the system is high on safety.

Integrity is a qualitative system attribute, although certain aspects of it can be quantified.


System integrity can be ensured via rapid fault/error detection, frequent data back-ups,
mechanisms that can isolate malfunctioning subsystems, and restoration via hot-swapped
modules. A high-integrity system continues to provide reasonable service, while also
protecting data files and other system resources, in the face of undesirable events
originating from inside or outside the system.

Safety, on the other hand, is almost always quantified. Leveson [Leve86] defines safety
as “the probability that conditions [leading] to mishaps (hazards) do not occur, whether
or not the intended function is performed”. Central to the quantification of safety is the
notion of risk. A standard dictionary defines risk as “the possibility of loss or injury”.
Reliability engineers use probability instead of possibility. The expected loss or risk
associated with a particular failure is a function of both its severity and its probability.
More precisely:

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 65

risk = frequency  magnitude (2.5.Risk1)


[consequence / unit time] [events / unit time] [consequence / event]

An alternate form of the risk equation is:

risk = probability  severity (2.5.Risk2)

The “magnitude” parameter in Eqn. (2.5.Risk1) or “severity” in Eqn. (2.5.Risk2) is


measured in some agreed-upon unit. In many cases, this is done by associating a dollar
cost with the occurrence of each undesirable event.

For example, the approximate individual risk (early fatality probability per year)
associated with motor vehicle accidents is 3 104 which is about 10 times the risk of
drowning, 100 times the risk of railway accidents, and 1000 times the risk of being killed
by a hurricane ([Henl81], p. 11). Individual risks below 106 per year are generally
deemed acceptable. Computer scientists and engineers have so far only scratched the
surface of safety evaluation techniques and much more work in this area can be expected
in the coming years.

To put our discussion of safety on the same footing as those of reliability, availability,
and performability, we envisage the three-state system model of Fig. 2.9. Certain adverse
condition may cause the system to fail in an unsafe manner; the probability of these
events must be minimized. Safe failures, on the other hand, are acceptable. The provision
of a separate “Safe Down” state, rather than merging it with the “Up” state, is useful in
that the two states may be given different weights in deriving numerical figures of merit
of various kinds. Furthermore, if we add transitions corresponding to repair from each
“Down” state to the “Up” state, we can quantify not only the risk of unsafe operation but
also the chances that the backup (manual) system may be stretched beyond its capacity
owing to overly long repair time.

Consider for example the more elaborate state model depicted in Fig. 2.10. Here, we
model the fact that a safe failure condition may turn into an unsafe one if the situation is
not handled properly and the possibility that the system can recover from a safe failure
through repair.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 66

Start Failure Safe Unsafe


Up
State Down Down

Failure

Fig. 2.9 Three-state nonrepairable system with safe and unsafe


failed states.

Repair
Start Safe Mishandling Unsafe
Up
State Down Down
Failure
Failure

Fig. 2.10 Three-state repairable system with safe and unsafe failed
states.

In both Figs. 2.9 and 2.10, one may use multiple unsafe failed states, one for each level of
severity, say. In this way, the probabilities of ending up in the various unsafe states can
be used for risk assessment using Eqn. (2.5.Risk2). State-space modeling techniques are
discussed in Chapter 4.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 67

2.6 Privacy and Security

In many application contexts, the bulk of impairments to dependability are human-


related. Such factors include operator errors (e.g., due to carelessness or improper
response to safety warnings) and malicious attacks (hackers, viruses, and the like).
Desirable system attributes pertaining to the impairments just mentioned include privacy
and security. Privacy is compromised, for example, when confidential or personal data
are disclosed to unauthorized parties, either due to inadvertent error or as a consequence
of malicious manipulation. Security is breached, for example, when account information
for a bank customer is incorrectly modified owing to inadvertent error (such as modifying
the account balance in an ATM cash withdrawal attempt, without dispensing the cash) or
malicious action.

Despite several decades of research on privacy and security in computing systems, these
two aspects have resisted quantitative assessment. In theory, security can be quantified in
the same manner as safety, that is, by considering frequency or probability of a security
breach as one factor, and magnitude or severity of the event as another. However,
quantifying both factors is substantially more difficult in the case of security, compared
with safety. One aspect of the difficulty pertains to the fact that security breaches are
often not accidental, so they are ill-suited to a probabilistic treatment.

We end this section by noting that system security is orthogonal to both reliability and
safety. A system that automatically locks up when a security breach is suspected may be
deemed highly secure but it may not be very reliable or safe in the traditional sense.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 68

Problems

2.1 Mean time to failure


A particular computer model has a constant failure rate . What percentage reduction in  would lead to
MTTF improvement by 25%? By 100%?

2.2 Calculating event probabilities


Let, in a class of n students, A be the event that no student is born in December and B be the event that at
least two students have identical birthdays (only the day and month, not the year).
a. Which event, A or B, is more likely in a class with n = 10? Fully justify your answer.
b. For what value of n are the two events A and B roughly equiprobable?
c. Let an event be likely if its probability exceeds 0.9. For what values of n is event B likely?

2.3 The birthday paradox


Let in a class of n students, p be the probability that at least two students have identical birthdays (only the
day and month, not the year). What is the smallest value of n for which p > 1/2? This is known as the
birthday paradox, because the answer is much smaller than what one might think.

2.4 The Monty Hall problem


This problem is named after one-time host of the TV game show “Let’s Make a Deal.” Imagine that you
are a contestant on this game show and there is a prize behind one of three doors with equal probabilities.
You pick door A. The host opens door B to reveal that there is no prize behind it. He then gives you a
chance to switch to door C, that is, to get the potential prize behind door C instead of door A. Is it better to
make the switch or to stick to your original choice? Hint: A possible argument is that the prize is equally
likely to be behind door A or door C, and this situation does not change by the opening of door B. A second
argument is that when you chose door A, you picked the prize with probability 1/3 and missed it with
probability 2/3. So, after opening door B, the probability that the prize is behind door C increases to 2/3.

2.5 Computing expected values


We roll five standard dice, with faces marked 1-6. With each roll, we record a number (not necessarily an
integer) as follows. If a number appears more than any other (2, 3, 4, or 5 times), we record that number. If
two numbers appear twice each, we record their average. If all five numbers are different, we record the
median of the five values.
a. Argue informally that the expected value of the number recorded is 3.5.
b. Present a formal proof of the result of part a.

2.6 Interval availability


Plot the interval availability A(t) as a function of t, 00:00 ≤ t ≤ 24:00, for a two-state repairable system that
fails at time 3:20, goes back into operation at time 3:35, fails again at time 18:30, becomes operational
again a time 19:10, and remains operational until t = 24:00. Explain the shape of the curve obtained.

2.7 Mission time for a given reliability


Find the mission-time function T(r) = R–1(r) for as many of the distributions shown in Table 2.2 as
possible.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 69

2.8 Order statistics


We roll n standard dice, with faces marked 1-6, and record the largest number L and the smallest number S
that appear. If we repeat this experiment many times, L and S can be viewed as random integer variables
ranging in [1, 6]. Let pk[j] = prob{k dice are rolled, and L = j} and qk[j] = prob{k dice are rolled, and L < j}.
Clearly, qk[1] = 0 and 1j6 pk[j] = 1, for all k.
a. Prove that pn[j] = 6–n + (j – 1) pn–1[j] + (1/6) qn–1[j].
b. Starting from p1[j] = 1/6 and q1[j] = (j – 1)/6, tabulate the values of pn[j] and qn[j] for n  6, using
the equality of part a.
c. Calculate the expected value E[L] for n  6 and plot it as a function of n.
d. Prove that E[L] + E[S] = 7, regardless of n.

2.9 Availability of a two-state system


A system never fails but is shut down for 30 minutes of preventive maintenance after every two hours of
operation. The first two hours of operation begin at time 00:00.
a. Plot the availability of this system in the interval [0, t], for 00:00 ≤ t ≤ 24:00, as a function of t.
b. Derive an expression of the interval availability A(t) as a function of t. Hint: The expression will
involve “floor” or “ceiling” operations.
c. Show that the availability A(t) of this system tends to 0.8 for large t.

2.10 System with two operational states


In the three-state system of Fig. 2.7, add a transition from the Up2 state to the Failed state. Supply an
appropriate label for the new transition and explain why it makes sense to consider this transition. Does a
reverse transition from the Failed state to the Up2 state also make sense?

2.11 Properties of expected value


Let x and y be random variables and let a and b denote constants. Prove the following results:
a. If f(x) is symmetrical about a, that is, f(a + x) = f(a – x), then E[x] = a
b. E[x + y] = E[x] + E[y]
c. E[a x + b] = a E[x] + b

2.12 Privacy and security


One of the most significant break-ins for a cellular telephone system was uncovered in Greece during early
2005. Read the article [Prev07] and use online sources describing the same incident to answer the following
questions in a single typed page (single-spacing is okay) as completely as possible.
a. Write an abstract (200 words or less) that describes the indicent being reported.
b. The article’s main theme are privacy and security. However, system dependability features
(reliability, serviceability, maintainability) are also involved. Describe in 200 words or less the
interplay between dependability features and the focal points of privacy and security.
c. Speculate on whether similar incidents could have happened in more technologically advanced
countries within the same time frame (2004-05).

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 70

2.13 Interval failure probability


a. What is the probability that a system with reliability function R(t) fails in the time interval [a, b]?
b. Consider the special case of a constant hazard rate , corresponding to an exponential reliability
formula, and explain why the derived interval failure probability from part a is not (b – a).
c. Consider a time c in the interval [a, b]. What is the conditional probability that the system of part a
failed in the interval [a, c], given the knowledge that it definitely failed in the interval [a, b]?
d. Discuss the special case of part c for a constant hazard rate .

2.14 Secure data storage


A real number (secret) must be shared among three people so that no one of them knows the number but
any two can cooperate to discover what it is. Consider the following secret-sharing scheme. Each person i
is given a pair of real numbers ai and bi, so that aix + bi defines a line in the x-y plane. The three lines
intersect at a point whose x coordinate is the secret number [Blak79].
a. What is the informational redundancy in this scheme?
b. Does any of the three persons have partial information about the secret?
c. Will the scheme work just as well if the secret number is an integer?
d. Discuss the advantages and disadvantages of the following scheme for sharing two secret
numbers. The scheme works as above, except that the y coordinate of the intersection point
corresponds to the second secret number.

2.15 Probability concepts


During World War II, a hypothetical city laid out as a 10-by-10 grid of equal-size blocks was hit by 200
randomly dropped bombs. Thus, the probability of any particular bomb hitting a specific city block was
1/100 and each block was hit by an average of two bombs.
a. Find the probability of a particular city block not being hit by bombs at all.
b. Derive the expected number of city blocks hit by two or more bombs.

2.16 Probability concepts


A particular data file can be stored in any one of 10 locations, numbered 1 through 10. The probability of
the data file being in location i in inversely proportional to i. In other words, the probability of the data file
being in location 5 is twice that of it being in location 10. Assuming that the inspection of each location
takes 1 time unit, what is the expected length of time needed to retrieve the data file? Does the expected
value depend on the order in which we inspect the 10 locations?

2.17 Mean time to failure


A particular computer model has the failure rate .
a. If a large number m of identical machines of this type run continuously for a period of time equal
to their MTTF, how many are expected to be still working at the end? Hint: The answer isn’t m/2.
b. By what time should we expect 90% of the machines to have failed?

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 71

2.18 Probability concepts


Which one of the two patterns below is more random? Explain your answer.

2.19 Probability concepts


a. Which is more probable at your home or office: a power failure or an Internet outage? Which is
likely to last longer?
b. Which surgeon would you prefer for an operation that you must undergo: Surgeon A, who has
performed some 500 operations of the same type, with 5 of his patients perishing during or
immediately after surgery, or Surgeon B, who has a perfect record in 25 operations?
c. Which do you think is more likely: the event that every student in a class of 10 was born in the
first half of the year or the event that at least two students were born on the same day of the year?

2.20 Mean time to failure


Using integration by parts, show that the definition of MTTF in eqn. (2.2.MTTF) and the second integral in
that same equation are equivalent.

2.21 Coin-flipping puzzles


a. You are joining Coin Flippers of America and your dues will be decided by flipping a coin, until a
5-toss pattern of your choosing appears. For example, if you choose HHHHH and it takes 36 flips
before the pattern appears, your annual dues will be $36. Should you choose HHHHH, HTHTH,
or HHHTT? Does it even matter?
b. After joining Coin Flippers of America, you enter a tournament and face the first opponent. Each
of you picks a different head-tail sequence of length 5, and a coin is flipped repeatedly. The player
whose sequence appears first is the winner. What sequence would you choose if you were to go
first?

2.22 Overbooking by airlines


JetBlack Airlines has determined that on average 5% of those making flight reservations do not show up.
The company has thus decided to sell 78 tickets on each 75-passenger flight.
a. What is the probability that every passenger showing up will get a seat on such a flight?
b. How can the airline increase the probability of being able to accommodate all passengers to at
least 90%?

2.23 Comparing event probabilities


A city has two hospitals with maternity wards: a large one, where an average of 64 babies are born per day,
and a small one, with an average of 8 daily births. On average, half of the new arrivals in each hospital are
boys and half are girls. One day, however, one of the hospitals had three times as many boys born as girls.
In which hospital is this more likely to have occurred?

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 72

2.24 The birthday paradox extended


The birthday paradox (see Problem 2.3) tells us that with a fairly small number of randomly distributed
birthdays, it is more likely than not to have 2 on the same day. With 75 birthdays, the odds of having 2 on
the same day becomes 99.9%; that is, almost certain. These results are counter-intuitive, hence the
designation “paradox.” Consider the corresponding result for having 3 birthdays on the same day. In other
words, what is the smallest number of randomly distributed birthdays that would make it more likely than
not to have 3 birthdays on the same day?

2.25 Amdahl’s reliability law


Express the main idea presented in [Parh15] in 200 or fewer words; that is, write an abstract for the paper.

2.26 Reliability inversion


The actual reliability of a highly-reliable system is unknowable, so designers try to obtain lower bounds on
system reliability as part of the design evaluation process in order to assess whether the system is good
enough for a particular application. If we have two systems with actual reliabilities R1 and R2 and reliability
lower bounds r1 and r2, with r1 < r2, we cannot deduce that R1 < R2. The condition r1 < r2 and R1 > R2 is
known as reliability inversion [Parh20]. Why is this a problem and what can we do about it?

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 73

References and Further Readings


[Ande85] Anderson, T. A. (ed.), Resilient Computing Systems, Collins, London, 1985. Also: Wiley, New
York, 1986.
[Andr02] Andrews, J. D. and T. R. Moss, Reliability and Risk Assessment, American Society of
Mechanical Engineering, 2nd ed., 2002.
[Bent99] Bentley, J. P., Reliability and Quality Engineering, Addison-Wesley, 2nd ed., 1999.
[Blak79] Blakley, G., “Safeguarding Cryptographic Keys,” Proc. AFIPS Nat’l Computer Conf., 1979,
pp. 313-317.
[Cart64] Carter, W. C., H. C. Montgomery, R. J. Preiss, and H. J. Reinheimer, “Design of Serviceability
Features for the IBM System/360,” IBM J. Research and Development, Vol. 8, No. 2, pp.
115-126, April 1964.
[USDT13] US Department of Transportation, “Treatment of the Value of Preventing Fatalities and Injuries
in Preparing Economic Analysis,” Revised Guidance, 2013. On-line document:
https://www.transportation.gov/sites/dot.dev/files/docs/VSL%20Guidance_2013.pdf
[Gold79] Goldstein, L. H., “Controllability/Observability Analysis of Digital Circuits”, IEEE Trans.
Circuits and Systems, Vol. 26, No. 9, pp. 685-693, September 1979.
[Henl81] Henley, E. J. and H. Kumamoto, Reliability Engineering and Risk Assessment, Prentice-Hall,
Englewood Cliffs, NJ, 1981.
[Leve86] Leveson, N. G., “Software Safety: Why, What, and How?” Computing Surveys, Vol. 18, No.
2, pp. 125-163, June 1986.
[Levi15] Levitin, G., L. Xing, B. W. Johnson, and Y. Dai, “Mission Reliability, Cost and Time for Cold
Standby Computing Systems with Periodic Backup,” IEEE Trans. Computers, Vol. 64, No. 4,
pp. 1043-1057, April 2015.
[Lewi87] Lewis, E. E., Introduction to Reliability Engineering, Wiley, New York, 1987.
[Nguy16] Nguyen, T. A., D. S. Kim, and J. S. Park, “Availability Modeling and Analysis of a Data
Center for Disaster Tolerance,” Future Generation Computer Systems, Vol. 56, pp. 27-50,
March 2016.
[Papo90] Papoulis, A., Probability & Statistics, Prentice Hall, 1990.
[Parh15] Parhami, B., “Amdahl's Reliability Law: A Simple Quantification of the Weakest-Link
Phenomenon,” IEEE Computer, Vol. 48, No. 7, pp. 55-58, July 2015.
[Parh20] Parhami, B., “Reliability Inversion: A Cautionary Tale,” IEEE Computer, Vol. 53, No. 6, pp.
28-33, June 2020.
[Prev07] Prevelakis, V. and D. Spinellis, “The Athens Affair: How Some Extremely Smart Hackers
Pulled off the Most Audacious Cell-Network Break-in Ever,” IEEE Spectrum, Vol. 44, No. 7,
pp. 26-33, July 2007.
[Ross72] Ross, S. M., Introduction to Probability Models, Academic Press, New York, 1972.
[Sahn96] Sahner, R. A., K. S. Trivedi, and A. Puliafito, Performance and Reliability Analysis of
Computer Systems: An Example-Based Approach Using the SHARPE Software Package,
Kluwer, 1996.
[Shoo02] Shooman, M. L., Reliability of Computer Systems and Networks, Wiley, New York, 2002.
[Siew92] Siewiorek, D. P. and R. S. Swarz, Reliable Computer Systems: Design and Evaluation, Digital
Press, 2nd ed., 1992.
[Tobi86] Tobias, P. A. and D. C. Trindade, Applied Reliability, Van Nostrand Reinhold, New York,
1986.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 74

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 75

3 Combinational Modeling
“Torture numbers, and they’ll confess to anything.”
Gregg Easterbrook

“Doubt is not a pleasant condition, but certainty is absurd.”


Voltaire

“No papers for this session will be published. The purpose of


this is to permit the speakers to be very candid regarding the
various computer disasters which they are describing.”
From “abstract” of the session on “Anatomies of
Computer Disasters” in Proc. First Int’l Conf.
Computing in Civil Engineering, 1981

Topics in This Chapter


3.1. Modeling by Case Analysis
3.2. Series and Parallel Systems
3.3. Classes of k-out-of-n Systems
3.4. Reliability Block Diagrams
3.5. Reliability Graphs
3.6. The Fault-Tree Method

Combinational, or stateless, models allow various system-level dependability


parameters to be calculated from the relevant parameters of the component parts
or subsystems that comprise the system. In this chapter, we introduce a number of
basic tools or building blocks for dependability evaluation: case analysis, series
systems, parallel systems, and k-out-of-n systems. We then apply and extend these
tools to the task of dependability analysis by means of three widely applicable
graphical representations for system dependability modeling. We start with the
simple and intuitive reliability block diagrams and end with the widely used fault
trees, covering the lesser-known reliability graphs in between.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 76

3.1 Modeling by Case Analysis

Given a set of components, subsystems, or other parts that comprise a system, one can
determine the probability of the system being operational by enumerating all possible
combinations of good and bad parts that result (or do not result) in system failure. The
overall probability of these subcases is the system unreliability (reliability). This method
works well when the number of parts is fairly small and their interactions and mutual
influences are well understood.

Example 3.1: Reliability modeling of a multiprocessor A multiprocessor system consists of


two processors, a common bus, and four memory modules. Given reliabilities for each of the three
component types, what is the system reliability, assuming that at least one processor and one
memory module, connected by the common bus, are needed for system operation.

Solution: The common bus is a critical system part. The system fails if the bus, both processors,
or all four memory modules malfunction. This is the same as saying that the system functions
properly if the bus, one of the two processors, and one of the four memory modules work. This has
the probability R = rb[1 – (1 – rp)2][1 – (1 – rm)4].

Example 3.1 was simple enough to allow us to write the pertinent reliability equation
directly. We now illustrate how case analysis can be used to reduce the complexity of
reliability evaluation using the same simple example.

First consider two cases: the bus system works (probability rb) or it does not work
(probability 1 – rb). We begin constructing a tree, as in Fig. 3.1, where the root labeled
“No information” has two children, labeled “Bus okay” and “Bus bad.” If we are
interested in enumerating the operational system configurations, we can ignore the
subtree corresponding to “Bus bad.” We next focus on the processors and form three
subtrees for the “Bus okay” branch, labeling them “Both processors okay” (probability
rp2), “One processor okay” (probability 2rp(1 – rp)), and “Both processors bad”
(probability (1 – rp)2). We can merge the first two of these branches and assigning the
resulting “At least one processor okay” branch the probability 2rp – rp2, because the two
are identical with respect to the proper functioning of the system. Continuing in this
manner, we arrive at all possible leaf nodes associated with working configurations.
Adding the probabilities of these leaf nodes yields the overall system reliability. We can
stop expanding each branch as soon as the reliability equation for the corresponding state
can be written directly.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 77

No information
rb 1 – rb

Bus okay Bus bad

1 – (1 – rp)2 (1 – rp)2

At least one processor okay Both processors bad

1 – (1 – rm)4 (1 – rm)4

At least one All four


memory module okay memory modules bad

Fig. 3.1 Example of reliability evaluation by case analysis.

We further illustrate the method of case analysis with two additional examples.

Example 3.2: Data availability modeling with home and mirror sites Use case analysis to
derive the availability of data for the system in Example 1.1.

Solution: The required case-analysis tree is depicted in Fig. 3.2, leading to the availability
equation A = aSaL + (1 – aSaL)aSaL = 2aSaL – (aSaL)2.

Example 3.3: Data availability modeling with triplication Use case analysis to derive the
availability of data for the system in Example 1.2.

Solution: The required case-analysis tree is depicted in Fig. 3.3, leading to the availability
equation A = aSaL + (1 – aSaL)aSaL + (1 – aSaL)2aSaL = 3aSaL – 3(aSaL)2 + (aSaL)3.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 78

aSaL (1 – aS)aL
1 – aL

aSaL aSaL

(a) System configuration (b) Case analysis tree

Fig. 3.2 Case analysis used to derive the data availability equation
for Example 1.1 (home and mirror sites).

a Sa L (1 – aS)aL
1 – aL

aSaL
1 – aL (1 – aS)aL

aSaL aSaL

(a) System configuration (b) Case analysis tree

Fig. 3.3 Case analysis used to derive the data availability equation
for Example 1.2 (home site and two backups).

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 79

3.2 Series and Parallel Systems

A series system of n components is one in which the proper operation of each of the n
components is required for the system to perform its function. Such a system is
represented as in Fig. 3.4a, with each rectangular block denoting one of the subsystems.

(a) Block diagram (b) Example with valves

Fig. 3.4 Series system block diagram and example with valves that
are prone to stuck-on-shut failures.

Given the reliability Ri for the ith component of a series system, the overall reliability of
the system, assuming independence of failures in the subsystems, is given by:

R = ∏ 𝑅 (3.2.ser1)

For example, if we place four valves in tandem on a segment of a pipe connected to a


reservoir (Fig. 3.4b), with the component valves being prone to stuck-on-shut failures, we
have a series system. Note that the term “series” in series system does not imply that the
subsystems are physically connected in series in the mechanical or electrical sense. If our
valves were prone so stuck-on-open failures only, then a four-unit series system would
actually consist of the valves being connected in parallel in the mechanical sense. In a
four-way system of parallel valves, the stuck-on-open failure of any one of the valves
will cause a stuck-on-open failure at the system level, thus with parallel valves that only
fail in the stuck-on-open mode, we have a series system in the reliability theoretic sense.

If the ith component in a series system has a constant hazard rate i, thus having
exponential reliability, the overall system will have exponential reliability with the hazard
rate i. This is a direct consequence of equation (3.2.ser). With repairable components,
having hazard rate i and repair rate i, the availability of a series system of n
components is related to the individual module availabilities Ai = i / (i + i) by:

A = ∏ 𝐴 (3.2.ser2)

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 80

Equation (3.2.ser2) is valid when component failures/repairs are independent and we


have a separate repairperson or facility for each unit; in other words, concurrent failure of
multiple units does not slow down the repair of any one unit.

A parallel system of n components is one in which the proper operation of a single one of
the n components is sufficient for the system to perform its function. Such a system is
represented as in Fig. 3.5a, with each rectangular block denoting one of the subsystems.

(a) Block diagram (b) Example with valves

Fig. 3.5 Parallel system block diagram and example with valves that
are prone to stuck-on-shut failures.

Given the reliability Ri for the ith component of a parallel system, the overall reliability
of the system, assuming independence of failures in the subsystems, is given by:

R =1–∏ (1 − 𝑅 ) (3.2.par1)

For example, if placing a valve on each of four branches of a pipe (Fig. 3.5b), with the
component valves being prone to stuck-on-shut failures, yields a parallel system; we can
still control access to the reservoir even if three of the valves fail in the stuck-on-shut
mode. Again, the term “parallel” in parallel system does not imply that the subsystems
are physically connected in parallel in the mechanical or electrical sense.

If the components in a parallel system are repairable, with the ith component having a
hazard rate i and repair rate i, the availability of a parallel system of n components is
related to the individual module availabilities Ai = i / (i + i) by:

A =1– ∏ (1 − 𝐴 ) (3.2.par2)

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 81

Equation (3.2.par2) is valid when component failures/repairs are independent and we


have a separate repairperson or facility for each unit; in other words, concurrent failure of
multiple units does not slow down the repair of any one unit.

Reliability and availability equations for series and parallel systems are quite simple. This
does not mean, however, that proper application of these equations does not require
careful thinking. The following example illustrates that care must be exercised in
performing even simple dependability analyses.

Example 3.4: A two-way parallel system In a passenger plane, the failure rate of the cabin
–5
pressurizing system is 10 /hr and the failure rate of the oxygen-mask deployment system is also
10–5/hr. What is the probability of loss of life due to both systems failing during a 10-hour flight?

Possible solution 1: Given the assumption of failure independence, both systems fail together at a
rate of 10–10/hr. Thus, fatality probability for a 10-hour flight is 10–10  10 = 10–9. Fatality odds of
1 in a billion or less are generally deemed acceptable in safety-critical systems.

Possible solution 2: The probability of the cabin-pressurizing system failing during a 10-hour
flight is 10–4. The probability of the oxygen-mask system failing during the flight is also 10–4.
Given the assumption of independence, the probability of both systems failing together during the
flight is 10–8. This latter probability is higher than acceptable norms for safety-critical systems.

Analysis: So, which of the two solutions is correct? Neither one. Here’s why. When we multiply
the two per-hour failure rates and then take the flight duration into account, we are assuming that
only the failure of the two systems within the same hour is catastrophic. This produces the
optimistic reliability estimate 1 – 10–9. When we multiply the two flight-long failure rates, we are
assuming that the failures of both systems would be catastrophic, no matter when each occurs
during the flight. This produces the pessimistic reliability estimate 1 – 10 –8. The reader should be
able to supply examples of when the two systems fail at different times during a flight, without
leading to a catastrophe.

The simple reliability equation (3.2.par1) for a parallel system is based on the assumption
that all n subsystems contribute to the proper system functioning at the same time, and
each is capable of performing the entire job, so that the failure of up to n – 1 of the n
subsystems will be noncritical. This happens, for example, if we have n concurrently
operating ventilation systems in a lab, each with its own power supply, in order to ensure
the proper removal of hazardous fumes. If the capacity of one of the subsystems is
inadequate and we need at least two of them to perform the job, we no longer have a
parallel system, but a 2-out-of-n system (see Section 3.3). Similarly, if only one of the

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 82

subsystems is active at any given time, with the others activated in turn upon detection of
a failure, then equation (3.2.par1) is valid only if failure detection is perfect and
instantaneous, and the activation of spares is always successful.

The simplest way to account for imperfect failure detection and activation of spares in a
parallel system is via the use of a coverage parameter c, with c < 1. The coverage
parameter is defined as the probability that the switchover from an operating module to a
spare module goes without a hitch. Thus, in a two-unit parallel system in which the
primary module has reliability r1 and the spare has reliability r2, the system reliability is:

R = r1 + (1 – r1)cr2 (3.2.cov1)

Equation (3.2.cov1) essentially tells us that the two-way parallel system with imperfect
coverage will work if unit 1 works, or if unit 1 fails, but the switchover is successful and
unit 2 works. With modules having identical reliability r, equation (3.2.cov1) becomes:

( )
R = r[1 + c(1 – r)] = r (3.2.cov2)
( )

The rightmost expression in equation (3.2.cov2) allows us to generalize the reliability


equation to the case of an n-way parallel system with imperfect coverage c:

( )
R=r (3.2.cov3)
( )

Deriving equation (3.2.cov3) is left as an exercise. The crucial impact of coverage on


system reliability is evident from Fig. 3.6, assuming a module reliability of r = 0.95.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 83

Fig. 3.6 Adding spares to a parallel system is unhelpful in the


absence of good coverage c. Module reliability is r = 0.95.

Note that, in practice, the coverage factor is not a constant, but deteriorates with more
spares. In this case the depiction of the effect of coverage in Fig. 3.6 may be viewed as
optimistic. So, adding a large number of spares is not only unhelpful (as suggested by the
saturation effect in Fig. 3.6), but it may actually be detrimental to system reliability.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 84

3.3 Classes of k-out-of-n Systems

In a k-out-of-n system, there are n modules, the proper functioning of any k of which is
sufficient for system operation. Note that both series (n-out-of-n) and parallel (1-out-of-n)
systems are special cases of this more general class. For example, if you have one spare
tire in the trunk of your car, then, ignoring the possible difference between a spare tire
and a regular tire, your tire system is a 4-out-of-5 system. You can continue driving your
car as long as at most one tire malfunctions, assuming successful switchover from a
regular tire to the spare tire. If you carry two spare tires in your trunk, then your tire
system may be described as a 4-out-of-6 system.

One of the most commonly used systems of this type is a 2-out-of-3 system, depicted in
Fig. 3.7. This redundancy scheme is known as triple modular redundancy (TMR) and
relies on a decision circuit, or voter, to deduce the correct output based on the outputs it
receives from three concurrently operating modules.

Fig. 3.7 Triple modular redundancy with voting.

Assuming a perfect (always working) voting unit, the reliability of a TMR system with
module reliabilities r1, r2, and r3 is:

R = r1r2r3 + r1r2(1 – r3) + r2r3(1 – r1) + r3r1(1 – r2) (3.2.TMR1)

In the special case of identical modules of reliability r, equation (3.2.TMR1) becomes R


= 3r2 – 2r3. Accounting for an imperfect voting unit with reliability rv, a TMR system
with identical modules has reliability:

R = rv(3r2 – 2r3) (3.2.TMR2)

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 85

Assuming exponential reliability r = e–t for each of the three modules and taking the
voter to be perfectly reliable, the MTTF parameter of a TMR system can be obtained
based on eqns. (2.2.MTTF) and (3.2.TMR2) with rv = 1:

 
MTTFTMR = = ∫ 𝑅(𝑡)𝑑𝑡 = ∫ [3𝑒 −2𝑒 ]𝑑𝑡 = 5/(6) (3.2.MTTF)

Note that even though the reliability of a TMR system is greater than that of a single
module, its MTTF deteriorates from 1/ to 5/(6).

The reliability equation for a k-out-of-n system with an imperfect voting unit and
identical modules is:

R = rv ∑ 𝑟 (1 − 𝑟) (3.2.kofn)

In the special case of odd n with k = (n + 1)/2, the k-out-of-n scheme uses majority voting
and is sometimes referred to as n-modular redundancy (NMR). It is readily seen from
equations (3.2.TMR2) and (3.2.kofn) that TMR and NMR methods lead to significant
reliability improvement only if the voting unit is much more reliable than the modules
performing the system functions.

A key element in the application of k-out-of-n redundancy schemes, and their special
cases of majority voting, is the design of appropriate “voting” circuits. Considerations in
the design of voting circuits are discussed in Chapter 12.

Replicating the voters and performing the entire computation in three separate and
independent channels is one way of removing the voting circuits from the critical system
core. Figure 3.8 shows how voter triplication in a TMR system will allow voter failures
as well as module failures to be tolerated. As the oval dashed boxes indicate, the voter
reliability can be lumped with module reliability, instead of it appearing separately, as in
equations (3.2.TMR2) and (3.2.kofn).

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 86

Fig. 3.8 TMR system with nonreplicated and replicated voting units.

Note that in writing the reliability equation (3.2.TMR1) for a TMR system, we have
pessimistically assumed that any two module failures will render the system useless. This
is not always the case. For example, if the modules produce single-bit outputs, then when
the output of one module is stuck-on-1, while a second module’s output is stuck-on-0, the
system can still produce the correct output, despite the occurrence of double module
failures. Such compensating failures, as well as situations where problems are detected
because the multiple modules produce distinct erroneous results, leading to a lack of
majority agreement, are discussed in Chapter 12.

Two variants of k-out-of-n systems also merit discussion, although they are not in
common use for modeling computer systems. We first note that the type of k-out-of-n
system we have covered thus far can be called k-out-of-n:G system, with the added
qualifier “G” designating that the system is “good” when at least k of its n modules are
good. We may also encounter k-out-of-n:F systems, in which the failure of any k or more
subsystems is tantamount to system failure. Clearly, a k-out-of-n:F system is identical to
an (n – k + 1)-out-of-n:G system. So, the new notation is unnecessary for the type of
systems we have been discussing.

A consecutive k-out-of-n:G system is one in which the n modules are linearly ordered,
say, by indexing them from 1 to n, with the failure of any k consecutive modules causing
system failure. So, for example, such a system may not be able to function with exactly k
working modules, unless these k modules happen to be consecutive.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 87

Example 3.5: Consecutive 3-out-of-5:G system Three values can be transmitted over three of
five buses, using shift switches at the interface, as depicted in Fig. 3.9. Shift switches are
controlled using a common set of control signals that puts all of them in the upward, straight, or
downward connection state. Such a reconfiguration scheme is easier and less costly to implement
than arbitrary (crossbar) connectivity and is thus quite popular.

Solution: The reliability of this system is different from an ordinary 3-out-of-5 system, because,
for example, the outage of the middle bus line is not tolerated, even if it is the only one. Let each
bus line have reliability r and assume that the switches are perfect. The system works when all 5
bus lines are okay, or any 4 are okay, except if the middle bus line is the bad one (4 cases in all),
or if the set of 3 bus lines {1, 2, 3}, {2, 3, 4}, or {3, 4, 5} are good, with the remaining 2 being
bad. Adding the three terms corresponding to the cases above, we get the system reliability
equation: R = r5 + 4r4(1 – r) + 3r3(1 – r)2 = 3r3 – 2r4.

Common switch
setting controls
Switch setting
control signals

(a) Shift switch (b) Bus with consecutive-3-out-of-5 redundancy

Fig. 3.9 A consecutive 3-out-of-5:G system.

The second variant is the class of consecutive k-out-of-n:F systems. Here, any k or more
consecutive failed modules render the system nonfunctional. So, such a system may
tolerate more than k module failures, provided the failed modules are not consecutive.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 88

Example 3.6: Consecutive 2-out-of-n:F system Consider a system of n street lights, where the
lights provide a minimum level of illumination deemed adequate for safety, unless two or more
consecutive lights are out. What is the reliability of this consecutive 2-out-of-n:F system?

Solution: The reliability of this system is different from an ordinary (n – 1)-out-of-n system,
because, for example, the safety criterion is met even if every other light is out. Let each street
light have reliability r. Let f(n) be the reliability for a consecutive 2-out-of-n:F system. Then, we
can write f(n) = r f(n – 1) + r(r – 1) f(n – 2), with f(1) = 1 and f(2) = 2r – r2. The two terms in the
equation for f(n) correspond to the two possible cases for the first street light. If that light is
working, then the system will be okay if the remaining n – 1 lights do not suffer 2 consecutive
outages. Otherwise, if the first light is out, then the second light must be working, and the
remaining n – 2 lights should not have 2 consecutive outages. The recurrence and its associated
initial conditions allow us to compute f(n) for any value of n, either numerically for a given value
of r or symbolically for arbitrary r. For example, we find f(5) = r2 + 3r3 – 4r4 + r5.

In some consecutive k-out-of-n:G or k-out-of-n:F systems, the module indexing is


considered circular, rather than linear. In this case, modules n and 1 are viewed as being
consecutive, so additional modes for the system being operational (k-out-of-n:G) or
failing (k-out-of-n:F) become possible.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 89

3.4 Reliability Block Diagrams

A reliability block diagram (RBD) is simply a combination of modules connected in


series or parallel forms. The example RBD in Fig. 3.11 may represent a multiprocessor,
with A, F, and G being critical resources (such as bus, shared memory, and power
supply), and B-D and C-E being two separate processors with their associated memory
modules. Or the same diagram may represent an office with three critical employees and
four clerks that can pair up in a particular way to perform the tasks required of them, with
one pair (B-D or C-E) being adequate for the expected functions. The latter example is
quite useful for making the point that an RBD does not represent electrical or mechanical
linking of modules, but rather their interactions in terms of system reliability.

B D
A F G
C E

Fig. 3.11 Example reliability block diagram.

A reliability block diagram is best understood in terms of its success paths. A success
path is simply a path through the modules, that leads from one side of the diagram to the
other. In the case of Fig. 3.11, the success paths are A-B-D-F-G and A-C-E-F-G. By
definition, the system modeled by a reliability block diagram is functional if all the
modules on at least one success path are functional.

The reliability equation corresponding to an RBD can be easily derived by applying the
series and parallel reliability equations (3.2.ser1) and (3.2.par1). In the case of the RBD
in Fig. 3.11, using rX to denote the reliability of module X, we have:

R = rA [1 – (1 – rB rD)(1 – rC rE)] rF rG (3.4.RBD1)

If all modules in Fig. 3.11 have the same reliability r, equation (3.4.RBD1) reduces to R
= r5(2 – r2). Note that because 2 – r2 > 1, the system modeled is more reliable than a
series system with five identical modules, as one would expect.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 90

Example 3.7: Parallel-series and series-parallel systems Denoting the reliability of module j
in Fig. 3.12 as rj:
a. Derive the reliability equation for the parallel-series system of Fig. 3.12a.
b. Derive the reliability equation for the series-parallel system of Fig. 3.12b.
c. Compare the reliability expressions derived in parts a and b and discuss.

Solution: For parts a and b, we use equations (3.2.ser1) and (3.2.par1) in turn.
a. Ra = 1 – (1 – r1 r2)(1 – r3 r4)
b. Rb = [1 – (1 – r1)(1 – r3)] [1 – (1 – r2)(1 – r4)]
c. After some simple algebraic manipulation, the difference of the reliabilities for parts a and b
is found to be Rb – Ra = r1r4(1 – r2)(1 – r3) + r2r3(1 – r1)(1 – r4). Because the difference is
always positive, the series-parallel configuration of Fig. 3.12b always offers better reliability
compared with the parallel-series arrangement of Fig. 3.12a. We should have been able to
predict this advantage, which is precisely due to Fig. 3.12b surviving when modules 1 and 4
are operational, while modules 2 and 3 have failed, or vice versa.

(a) Parallel-series (b) Series-parallel

Fig. 3.12 Parallel-series and series-parallel example reliability block


diagrams.

The basic series-parallel RBDs discussed thus far can be extended in several different
ways. One way of extending RBDs is to allow k-out-of-n structures in addition to n-out-
of-n (series) and 1-out-of-n (parallel) constructs. Such a structure is drawn as a set of
parallel blocks with a suitable notation indicating that k out of the n blocks must be
functional. This can take the form of an annotation next to the blocks, or a voter-like
connector on the right-hand side on the parallel group into which the label “k of n” or “k /
n” is inscribed. The use of such k-out-of-n structures does not complicate the derivation
of the reliability equation: we simply use equation (3.3.kofn) in this case, in lieu of
equation (3.2.par1).

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 91

B E

A C F 2/3 H

D G

Fig. 3.13 Example of extended RBDs, with k-out-of-n structures.

A second way to extend RBDs is to allow connectivity patterns that are more general
than series-parallel. For example, the bridge pattern of Fig. 3.14 would constitute such an
extended RBD. In this example, one may view module 5 is being capable of replacing
modules 2 and 3 when the latter interact with modules 1 and 4 (but not when module 3
should cooperate with module 6).

1 2 3 4

Fig. 3.14 Example of an extended RBD, allowing more general


connectivity patterns than series-parallel.

A third way to extend RBDs is to allow repeated blocks [Misr70]. Figure 3.15 depicts
two ways of representing a 2-out-of-3 structure, using parallel-series and series-parallel
connection of blocks A, B, and C, with repetition.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 92

A B
A B C

B C

B C A
C A

(a) Parallel-series (b) Series-parallel

Fig. 3.15 Two ways of representing a 2-out-of-3 structure by means of


repeated modules.

When RBDs are not of the simple series/parallel variety or when they have repeated
elements, special methods are required for their analysis. The following two examples
demonstrate the method.

Example 3.8: Non-series/parallel RBDs Consider the extended RBD in Fig. 3.14 and denote
the reliability of module i by ri. Derive an expression for the overall system reliability.

Solution: The system functions properly if a string of healthy modules connect one side of the
diagram to the other. Because module 2 is the unit whose role deviates from series or parallel
connection, we will perform a case analysis by assuming that it works (replacing it with a line
connecting modules 1 and 3) or does not work (disconnecting modules 1 and 3). We thus get the
system reliability equation R = r2R2good + (1 – r2)R2bad, where R2good and R2bad are conditional
reliabilities for the two cases just mentioned. Our solution is complete upon noting that R2good =
r4[1 – (1 – r1)(1 – r6)][1 – (1 – r3)(1 – r5)] and R2bad = r4[1 – (1 – r1r5)(1 – r3r6)].

Example 3.9: Extended RBDs with repeated elements Consider an extended RBD that is 3-
way parallel, with each of the parallel branches being a series connection, as follows: (a) 1-5-4,
(b) 1-2-3-4, and (c) 6-3-4. Boxes with the same number denote a common module. So, for
example, the two occurrences of 1 in the diagram represent a common module 1. This RBD may
be viewed as equivalent to that in Fig.3.14, in that it has the same success paths. So the analysis of
this example is another way of solving Example 3.8. Derive a reliability expression for this RBD.

Solution: The inequality R  1 – i(1 – Rith success path) provides an upper bound on system
reliability. The reason that the expression on the right-hand side represents an upper bound rather
than an exact value is that is takes multiple occurrences of the same module as having independent
failures. In the case of our example, we get R  1 – (1 – r1r5r4)(1 – r1r2r3r4)(1 – r6r3r4). It turns out
that if we multiply out the parenthesized terms on right-hand side of the foregoing inequality, but
do this by ignoring the higher powers of each reliability term, an exact reliability formula results.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 93

For our example, the process just outlined yields R = r3r4r6 + r1r2r3r4 + r1r2r3r4r6 + r1r4r5 –
r1r3r4r5r6 – r1r2r3r4r5 – r1r2r3r4r5r6.

Thus far, we have taken the modules in an RBD to be independent of each other. More
sophisticated models take the possibility of common-cause failures into account or allow
the failure of some modules to affect the proper functioning of others, perhaps after a
randomly variable amount of time [Levi13].

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 94

3.5 Reliability Graphs

A reliability graph (RG) is a schematic representation of system components, their


interactions, and their roles in proper system operation in a manner that is more general
than RBDs. An RG is an acyclic directed graph, with edges corresponding to system
components. There are unique source and sink nodes, typically drawn on the left side and
right side of the diagram, respectively, with a directed path from source to sink defining a
success path. As the names imply, a source node has no incoming edges, while a sink
node has no outgoing edges. Some edges are labeled “” and correspond to hypothetical
modules that are infinitely reliable.

Figure 3.16 depicts an example reliability graph having success paths A-E-H-L-N, B-D-
G-M, and C-F-H-K-M, among others. A reliability graph can be analyzed by converting
it to a number of series/parallel structures through case analysis.

Fig. 3.16 Example of a reliability graph.

Reliability graphs are more powerful than simple series/parallel RBDs in terms of
representational power, but they are less powerful than the most general form of fault
trees, to be described next.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 95

3.6 The Fault-Tree Method

Fault tree is a tool for top-down reliability analysis. To construct a fault tree, we start at
the top with an undesirable event called a “top event” and determine all the possible ways
in which the top event can occur. The fault-tree method allows us to determine, in a
systematic fashion, how the top event can be caused by individual or combined lower-
level undesirable events. Figure 3.17 contains an informal description for the building
process as well aa some of the pertiment symbols used.

Fig. 3.17 Constructing a fault tree and some of the pertinent symbols.

For example, if the top event is being late for work, its immediate causes might be clock
radio not turning on, family emergency, or the bus not running on time. Going one level
down, the clock radio might fail to turn on due to the coincidence of a power outage and
its battery being dead.

Once a fault tree has been built, it can be analyzed in at least two ways: using the cut-set
method and via conversion to a reliability block diagram.

A cut set is any set of initiators so that the failure of all of them induces the top event. A
minimal cut set is a cut set for which no subset is also a cut set. As an example, for the
fault tree of Fig. 3.18, the minimal cut sets are {a, b}, {a, d}, and {b, c}. The equivalent
RBD for a given fault tree is one that has the same minimal cut sets. An example is
depicted in Fig. 3.18. Note that the equivalent RBD mat not be unique.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 96

(a) Fault tree (b) RBD

Fig. 3.18 An example fault tree and its RBD equivalent.

Other than allowing us to probabilistically analyze a fault tree, cut sets also help in
common-cause failure assessment and exposition of system vulnerability: a small cut ses
indicates high vulnerability. The notion of path set is the dual of cut set. A path set is any
set of initiators so that if all of them are fault-free, the top event is inhibited. One can
drive the path sets for a fault tree by exchanging AND and OR gates and then obtaining
the cut sets for the transformed tree.

Example 3.10: Fault trees and RBDs Consider a system exhibiting the minimal cut set {a, b},
{a, c}, {a, d}, {c, d, e, f}.
a. Construct a fault tree for the system.
b. Derive an equivalent RBD.
c. What is the path set for this example?

Solution:
a. A fault tree for the system is depicted in Fig. 3.19a.
b. One possible RBD is depicted in Fig. 3.19 b.
c. Exchanging AND and OR gates in Fig. 3.19a, we find the path set of the original diagram as
the cut set of the transformed diagram, thus obtaining: {a, c}, {a, d}, {a, e}, {a, f}, {b, c, d}.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 97

a b c d d
c d e f

a e

b c d f

(a) Fault tree (b) RBD

Fig. 3.19 Fault tree and RBD associated with Example 3.10.

In conclusion, we note that the combinational models introduced in this chapter constitute
a hierarchy in terms of their representational power [Malh94]. At the top of this
hierarchy, we have FTs with repeated elements. Reliability graphs are next, with
somewhat less representational power. Finally, at the bottom of the hierarchy we have
RBDs and ordinary FTs (no repeated elements), with these latter two models being
equivalent in the sense of having identical representational power.

Fig. 3.20 The hierarchy of combinational reliability models.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 98

Problems

3.1 Series and parallel resistors


a. Consider n resistors connected in series, with the dominant failure mode being an open-circuit
between the resistor’s two terminals (the resistance becoming infinite). Describe this set of
resistors as a series or parallel system.
b. Repeat part a, this time assuming only short-circuit failures (the resistance becoming zero).
c. Repeat part a for the parallel connection of n resistors susceptible to open circuit failures.
d. Repeat part c, this time assuming only short-circuit failures (the resistance becoming zero).

3.2 Series and parallel diodes


Repeat problem 3.1, replacing every occurrence of the word “resistor” with the word “diode.” Note that an
ideal, properly functioning diode has infinite resistance in the backward direction and zero resistance in the
forward direction.

3.3 Series-parallel system reliability


Consider the following two series-parallel systems composed of six modules each. Assume identical
modules of reliability r.

System A
System B

a. Write the reliability equations for the two systems and determine the conditions under which
system A is more reliable than system B.
b. Repeat part a in the more general case when there are m pairs, rather than 3 pairs, of modules
arranged as series connection of m parallel pairs or parallel connection of two m-module series
chains.
c. Generalize the conclusions of part b to a case where instead of parallel pairs, we have k-wide
parallel connections. In other words, assume that there are km modules in all, with the parallel
parts being k-wide and the series parts being of length m.

3.4 Modeling of k-out-of-n systems


Figure 3.15 depicts two ways of modeling a 2-out-of-3 system as an RBD with repeated elements.
a. Is it possible to use an RBD with fewer than six blocks to achieve the same effect? Note that we
are excluding the RBD variant in Fig. 3.13, which would need only three blocks.
b. Present at least three RBDs that model a 3-out-of-5 system.
c. What is the minimum number of blocks in an RBD that models a k-out-of-n system?

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 99

3.5 Combinational model types


A multiprocessor system is composed of an interconnection network N and two processors P1 and P2, each
with its own memory (M1 and M2) and two disk drives (D1 and E1 for P1, D2 and E2 for P2). A processor can
function as long as the processor itself, its memory unit, and one of its two disk drives are operational.
Model this system by means of the following and obtain the corresponding reliability expressions.
a. Reliability block diagram
b. Reliability graph
c. Fault tree

3.6 Consecutive k-out-of-n systems


Prove that if we switch from linear indexing to circular indexing, reliability improves for consecutive k-out-
of-n:G systems and deteriorates for consecutive k-out-of-n:F systems.

3.7 Circular consecutive k-out-of-n systems


a. Reformulate the redundant bus arrangement of Example 3.5 so that it corresponds to circular
consecutive 3-out-of-5:G system and derive the resulting reliability.
b. Reformulate the street-lights application of Example 3.6 so that it corresponds to circular
consecutive 2-out-of-n:F system and calculate the resulting reliability.
c. Propose an example of a consecutive k-out-of-n:F system that occurs in computer or computer-
based systems.

3.8 RBD for a multiprocessor system


At the beginning of Section 3.4, Fig. 3.11 was described as possibly modeling a multiprocessor system.
Can you characterize the RBD of Fig. 3.13 in the same manner? Discuss.

3.9 Modeling of a redundant resistor


Five identical resistors, each of resistance r, are connected into a bridge (Fig. 3.12b, with a fifth resistor
included in the middle connection) and together act as one resistor. Each resistor fails to open with
probability p and to short with probability q. Any end-to-end resistance value between r/2 and 2r is deemed
acceptable. Derive an expression for the reliability of the overall component.

3.10 Reliability of k-out-of-n systems


A system consists of 3 processors and 8 memory modules with failure rates of 1 and 2 per 1000 hours,
respectively. All other parts of the system are perfectly reliable.
a. What is the probability that no subsystem fails in a one-month period?
b. Assuming no repair, what is the probability that in a one-month period at least 2 processors and 4
memory modules remain available (i.e., a minimal system survives)?
c. If you had enough money to add one processor or one memory module for part b, which one
would you choose and why?

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 100

3.11 Modeling of coverage in parallel systems


a. Model a 2-way parallel system with modules M1 and M2 having reliabilities r1 and r2 and
imperfect coverage c by placing a special module representing the imperfect coverage on the
parallel path for M2. Then, derive the reliability equation as a function of r1, r2, and c.
b. Use induction on n to prove equation (3.2.cov3) for the reliability of an n-way parallel system with
imperfect coverage.
c. How would equation (3.2.cov3) change if the coverage factor were different after each failure (c1,
c2, … , cn–1) and module reliabilities were also different (r1, r2, … , rn)?

3.12 Modeling of a two-phase mission


A phased mission is one which requires the availability of different resources in each of several phases of
operation. Consider for example a two-phase mission for a computer system with three resources
(subsystems) A, B, and C. During phase 1, which lasts T1 hours, only subsystem A needs to be operational.
During phase 2, of duration T2, proper functioning of subsystem B plus one of the other two subsystems
would suffice. Such a mission is deemed a success if both phases are completed with the required resources
being operational. Assuming exponential reliabilities, with constant failure rates A, B, and C for the three
resources (regardless of their being in operation or idle), write down the reliability equation for the two-
phase mission just defined.

3.13 Modeling of a phased mission


Consider a system composed of n resources, numbered 1 to n, having reliabilities R1(t), R2(t), . . . , Rn(t),
and a -phase mission in which the set Sj of resources needed in phase j is a subset of Sj–1 for 2  j  .
a. Write down an expression for the reliability of the system if the completion of all  phases is
required for mission success.
b. Present the special case of your expression assuming exponential reliability formulas.

3.14 Series and parallel systems


Consider four nested rooms, the innermost of which contains a safe. To access the safe, one should go
through four locked doors, one per room. Now consider a second security arrangement, with a single large
room having four separate doors, any one of which can be used for entry to access the safe.
a. Discuss the suitability of each arrangement, and derive its reliability, when the door locks can only
fail by becoming stuck on shut (can’t be unlocked).
b. Repeat part a for locks that fail by becoming stuck on open (can't be locked).
c. Assuming that we are limited to the use of four doors, what alternative security arrangement
would you suggest if both types of lock failures mentioned in parts a and b are possible?
d. Derive the reliability of the arrangement you suggested in part c. State your assumptions and show
all intermediate steps.

3.15 Combinational modeling


In a circuit, two diodes in series are used in lieu of a single diode between points A and B. Each diode has
two failure modes: it fails in the open-circuit mode with probability p and in the closed-circuit (shorted)
mode with probability q. The reliability of each diode is thus 1 – p – q.

A B

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 101

a. Write down the reliability equation of the series connection, and show that it is preferable to a
nonredundant diode iff open-circuit failures are less likely than closed-circuit failures (p < q).
b. Repeat part a for two diodes in parallel replacing a single diode, and derive the corresponding
condition.
c. Show that it is possible to arrange 4 diodes so that they act as a 3-out-of-4 system between A and
B, given the open-circuit and short-circuit failure modes.

3.16 Assessing reliability


a. Which is more reliable: Plane X or Plane Y that carries four times as many passengers as Plane X
and is twice as likely to crash?
b. Which is more reliable: a 4-wheel vehicle with 1 spare tire or an 18-wheeler with 2 spare tires?

3.17 Reliability wall


Consider a parallel computer with p processors running tasks of which a fraction f cannot be parallelized
and the remaining fraction 1 – f is perfectly parallizable on all p processors. Amdah’s constant-task speedup
formula, s = p/[1 + f(p – 1)], tells us that while computation speedup increases with p, it can never exceed
the upper limit 1/f as p approaches infinity. On the other hand, Gustafson’s constant-running-time scaled
speedup s = f + p(1 – f) continues to grow indefinitely for any f < 1 as p approaches infinity. With an
increase in the number p of processors in exascale computing, reliability enhancement methods must be
employed, which imply time and cost overheads. Checkpointing, for example, must be done more
frequently as p increases, implying superlinear overhead in terms of work. If the reliability overhead is
superlinear in p, then a reliability wall [Yang12] may inhibit further performance increases via the
introduction of additional processors. Show that a reliability wall exists under the Amdahl interpretation of
speedup, but not under Gustafson’s.

3.18 Series-parallel systems with coincident failures


Consider a highly simplified scheme for modeling coincident failures in series/parallel systems. Systems of
interest are composed of modules with identical failure probability p and coincident failure probability c for
any two modules, with c > p2. The probability of coincident failures in 3 or more modules is negligible.
a. Write down the reliability equation of the series connection of n such modules.
b. Write down the reliability equation of the parallel connection of n such modules.
c. Discuss how to apply the model defined in this problem to RBDs.

3.19 Short- and open-circuit tolerance for resistors


A resistor can fail by becoming open (infinite resistance) or by short-circuiting (zero resistance). To build a
general model that accommodates both extremes as special cases, we assume that one of the resistors in the
following networks, each intended to serve as a robust resistor built from 4 identical resistors, can assume
the resistance r + , with –r <  < .
a. Derive a formula for the equivalent resistance of the network on the left as a function of r and .
b. Repeat part a for the network on the right.
c. How would you go about choosing one of these redundant networks for a particular application?

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 102

3.20 Series-parallel systems


A pumping station has 3 pumps. Pump 1 must be working at all times, while only one of Pump 2 or Pump 3
is needed to be operational. The reliabilities of the three pumps for one month of operation are R1 = 0.999,
R2 = 0.988, and R3 = 0.975.
a. What is the failure probability for this system of pumps?
b. Assuming that the exponential reliability law is applicable, find the failure rates for the three
pumps and an equivalent overall failure rate for the system of pumps.
c. Discuss the contribution of each pump to the equivalent overall failure rate of part b.
d. Consider improving one of the reliabilities by a small increment . If all three options cost the
same, which pump would you choose to improve and why?

3.21 k-out-of-n vs. k-out-of-k reliability


A 34-out-of-36 system (e.g., a reconfigurable 6-by-6 processor array that can tolerate up to 2 failed
elements) is more reliable than a 34-out-of-34 (series) system. Derive the extent of reliability improvement
by writing the reliability of a 34-out-of-36 system, with each component having reliability r, as r34(1 + ),
where  is a function of r.

3.22 The Dr. Sanjay Gupta problem


The puzzle known as “The Monty Hall Problem,” bearing the name of an old-time game-show host, deals
with counter-intuitive probabilistic notions (see Problem 2.4). For the title of this problem dealing with a
probabilistic model of testing for diseases, I have borrowed the name of a CNN medical commentator. A
newly developed diagnostic test for a particular disease gives the result “positive” with 99% probability if
you have the disease and with 2% probability (false positive) if you don’t. Assume that 1% of the residents
of a city have that particular disease. A randomly chosen person from the city is administered the test, with
the result being “positive.” What is the probability that the person has the disease?

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 103

References and Further Readings


[Bent99] Bentley, J. P., Reliability and Quality Engineering, Addison-Wesley, 2nd ed., 1999.
[Buza70] Buzacott, J. A., “Network Approaches to Finding the Reliability of Repairable Systems,” IEEE
Trans. Computers, Vol. 19, No. 4, pp. 140-146, November 1970.
[Duga99] Dugan, J. B., “Fault Tree Analysis of Computer-Based Systems,” tutorial presentation,
accessed on July 22, 2012. http://www.fault-tree.net/papers/dugan-comp-sys-fta-tutor.pdf
[Kuo03] Kuo, W. and M. J. Zuo, Optimal Reliability Modeling: Principles and Applications, Wiley,
2003.
[Levi13] Levitin, G., L. Xing, H. Ben-Haim, and Y. Dai, “Reliability of Series-Parallel Systems with
Random Failure Propagation Time,” IEEE Trans. Reliability, Vol. 62, No. 3, pp. 637-647,
September 2013.
[Malh94] Malhotra, M. and K. S. Trivedi, “Power-Hierarchy of Dependability Model Types,” IEEE
Trans. Reliability, Vol. 43, No. 3, pp. 493-502, September 1994.
[Misr08] Misra, K. B. (ed.), Handbook of Performability Engineering, Springer, 2008. {UCSB library
has the e-book}
“Fault Tree Analysis,” by Liudong Xing and Suprasad V. Amari, Chap. 38, pp. 595-620.
[NRC81] US Nuclear Regulatory Commission, Fault-tree Handbook, 1981, accessed on September 25,
2012. http://www.nrc.gov/reading-rm/doc-collections/nuregs/staff/sr0492/sr0492.pdf
[Reib91] Reibman, A. L. and M. Veeraraghavan, “Reliability Modeling: An Overview for System
Designers,” IEEE Computer, Vol. 24, No. 4, pp. 49-57, April 1991.
[Sahn96] Sahner, R. A., K. S. Trivedi, and A. Puliafito, Performance and Reliability Analysis of
Computer Systems: An Example-Based Approach Using the SHARPE Software Package,
Kluwer, 1996.
[Weib13] Weibul.com, “Fault Tree Analysis: An Overview of Basic Concepts,” accessed on September
25, 2013. http://www.weibull.com/basics/fault-tree/index.htm
[Wind13] Windchill, “Fault-Tree Reliability Analysis Program,” accessed on September 25, 2013.
http://www.ptc.com/product/relex/fault-tree
[Xing08] Xing, L. and S. V. Amari, “Fault Tree Analysis,” Chap. 38 in Handbook of Performability
Engineering, ed. by K. B. Misra, Springer, 2008, pp. 595-620.
[Yang12] Yang, X., Z. Wang, J. Xue, and Y. Zhou, “The Reliability Wall for Exascale Supercomputing,”
IEEE Trans. Computers, Vol. 61, No. 6, pp. 767-779, June 2012.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 104

4 State-Space Modeling
“All models are wrong, some are useful.”
G. E. P. Box

“When one admits that nothing is certain one must, I think, also
admit that some things are much more nearly certain than
others.”
Bertrand Russell

Topics in This Chapter


4.1. Markov Chains and Models
4.2. Modeling Nonrepairable systems
4.3. Modeling Repairable Systems
4.4. Modeling Fail-Soft Systems
4.5. Solving Markov Models
4.6. Dependability Modeling in Practice

State-space models are suitable for modeling of systems that can be in multiple
states from the viewpoint of functionality and can move from state to state as a
result of deviations (e.g., malfunctions) and remedies (e.g., repairs). Even though
many classes of state-space models have been introduced and are in use, our focus
will be on Markov models that have been found most useful in practice and
possess sufficient flexibility and power to faithfully represent nearly all system
categories of practical interest. In this chapter, we introduce Markov chains as
models of probabilistically evolving systems and explore their use in evaluating
the dependability of computer systems, particularly those with repairable parts.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 105

4.1 Markov Chains and Models

A discrete-time Markov chain can be viewed as a probabilistic sequential machine.


Probability values are assigned to transitions, in such a way that the sum of probabilities
associated with transitions out of any given state is 1. For example, Fig. 4.1 represents the
Markov diagram for a four-state system, with the initial state 0 and transition
probabilities as shown on the various edges. Note that the sum of probabilities associated
with transitions out of various states does not equal 1, as required. For example, the two
transitions out of state 0 have total probability of 0.7. This is because self-loops, or
transitions from one state into the same state, are often omitted to reduce clutter. The
presence of such self-loops is always implied in our Markov diagrams.

Fig. 4.1 State diagram of an example four-state Markov chain.

A Markov chain can be represented by a transition matrix M, where the element mi,j in
row i and column j denotes the probability associated with the transition from state i to
state j When the Markov chain has no transition between two states, the corresponding
entry in M is 0. The transition matrix M0 for the Markov chain of Fig. 4.1 is:

0.3 0.4 0.3 0.0


𝑀 = 0.5 0.4 0.0 0.1 (4.1.exMm0)
0.0 0.2 0.7 0.1
0.4 0.0 0.3 0.3

A matrix such as the one in eqn. (4.1.exMm0), in which the sum of entries in each row
add up to 1, is referred to as a Markov matrix. A second example of a Markov matrix is
provided by eqn. (4.1.exMm1).

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 106

0.5 0.2 0.1 0.2


𝑀 = 0.1 0.4 0.4 0.1 (4.1.exMm1)
0.3 0.0 0.2 0.5
0.2 0.6 0.0 0.2

At any given time, our knowledge about the state of an n-state Markov chain can be
represented by an n-vector, with its ith element representing the probability of being in
state i. For example, the state vector (1, 0, 0, 0) means that the system depicted in Fig. 4.1
is known to be in state 0, (1/2, 1/2, 0, 0) means that it is equally likely to be in state 0 or
state 1, and (1/4, 1/4, 1/4, 1/4) denotes complete uncertainty about the state of the system.
Clearly, the elements of a state vector must add up to 1.

( ) ( ) ( ) ( )
Starting from state 𝑠 ( ) = (𝑠 ,𝑠 ,𝑠 ,𝑠 ) at time 0, one can compute the state vector
at time step 1 via multiplying s by the transition matrix M, that is, 𝑠 ( ) = 𝑠 ( ) M. More
generally, the state vector of the system after k time steps is given by:

𝑠( )
= 𝑠 ( ) Mk (4.1.svec)

For example, if the system in Fig. 4.1 is initially in state 0 or 1 with equal probabilities,
its state vector after one and two time steps will be:

0.3 0.4 0.3 0.0


𝑠 ( ) = (0.5, 0.5, 0, 0) 0.5 0.4 0.0 0.1 = (0.40, 0.40, 0.15, 0.05)
0.0 0.2 0.7 0.1
0.4 0.0 0.3 0.3

0.3 0.4 0.3 0.0


𝑠 ( ) = (0.40, 0.40, 0.15, 0.05) 0.5 0.4 0.0 0.1
0.0 0.2 0.7 0.1
0.4 0.0 0.3 0.3
= (0.340, 0.365, 0.225, 0.070)

A discrete-time Markov chain can be viewed as a sequential machine with no input, also
known as an autonomous sequential machine. In such a machine, transitions are triggered
by the clock signal, with no other external influence. In conventional digital circuits,
clock-driven counters are examples of autonomous sequential machines that move
through a sequence of states on their own. If we have two or more possible input values,
then a general stochastic sequential machine results. Such a machine will have a separate

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 107

transition matrix for each input value and its state vector after k time steps will depend on
the k-symbol input sequence that it receives.

Example 4.1: Stochastic sequential machines Consider a 4-state, 2-input stochastic sequential
machine with the state transition probabilities provided by the matrix M0 of eqn. (4.1.exMm0) for
input value 0 and by the matrix M1 of eqn. (4.1.exMm1) for input value 1. Assuming that the
machine begins in state 0, what will be its state vector after receiving the input sequence 0100?

Solution: The initial state vector (1, 0, 0, 0) must be multiplied by M0M1M0M0 to find the final
state vector. This process yields the state vector (0.2620, 0.2844, 0.3586, 0.0950).

Continuous-time Markov chains are more useful for modeling computer system
performance and dependability. In such chains, transitions are labeled with rates, rather
than probabilities. For example, a transition rate of 10–6 per hour means over a very short
time period of dt hours, the transition will occur with probability 10–6dt. We often use
Greek letters to denote arbitrary transition rates, with  being a common choice for
failure rate and  for repair or service rate.

Markov chains are used widely for modeling systems of many different kinds. For
example, the process of computer programming may be modeled by considering various
states corresponding to problem formulation, program design, testing, debugging, and
running [Funk07]. State transitions are labeled with estimated probabilities of going from
one state (say, testing) to other states (say, debugging or running). Pattern recognition
problems may be tackled by the so-called “hidden Markov models” [Ghah01]. As a final
example, a person’s movement between home, office, and various other places of interest
can be studied via a Markov model that associates probabilities with going from one of
the places to each other place [Ashb03]. Many more applications exist [Bolc06].

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 108

Example 4.2: Markov model for a fail-soft multiprocessor Consider a parallel machine with 3
processors, each of which has a failure rate  and a repair rate . When 2 or 3 processors fail, only
one can be repaired at a time. The machine can operate with any number of processors, but its
performance will be lower with only 1 or 2 processors working, thus providing fail-soft operation.
Construct a suitable Markov model for this parallel machine.

Solution: Each of the 3 processors can be in the up (1) or down (0) state, leading to 8 possible
states in all. For example the state 111 corresponds to all processors being operational and 110
corresponds to the third one being down. Figure 4.2a depicts the resulting 8-state Markov model
and the associated state transitions. We know that the 3 processors are identical with regard to
their failure and repair charactristics. If they are also the same in other respects, there is no need to
distinguish the 3 states in which a single processor is down or the ones having 2 bad processors.
Merging these states, we get the simplified Markov model of Fig. 4.2b. The new failure rates are
obtained by summing the values on the merged transitions, but the repair rate does not change. In
the event that a single processor cannot provide sufficient computational power for the tasks of
interest, we might further merge states 0 and 1 in Fig. 4.2b, given that both of them imply the
status of the system being “down.” Thus, whether it is appropriate to simplify a Markov model by
merging some states depends on the model’s semantics, rather then its appearance.

(a) Initial Markov model

(b) Simplified Markov model

Fig. 4.2 Markov models for a 3-processor fail-soft system. In part (a),
all solid arrows should be labeled  and all dashed arrows .

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 109

4.2 Modeling Nonrepairable Systems

In modeling nonrepairable systems, the reliability equation and parameters such as MTTF
are of interest. Such systems eventually fail, so our objective is often to determine the
system lifetime and devise method for increasing it.

Example 4.3: Two-state nonrepairable system A nonrepairable system has the failure rate .
Use a 2-state Markov model to find the reliability of this system as a function of time.

Solution: The requisite Markov model is depicted in Fig. 4.3a. Reliability can be viewed as the
probability of the system being in state 1. To find the reliability as a function of time, we note that
p1 changes at the rate – (failure rate being  means that the probability of failure over a short
time interval dt is dt). Thus, we can set up the differential equation p1 = –p1, which has the
solution p1 = e–t, given the initial condition p1(0) = 1. Figure 4.3b shows a plot of the reliability
R(t) = p1(t) as a function of time. The probability of the system being in state 0 can be found from
the identity p0 + p1 = 1 to be p0 = 1 – e–t.

(a) System states and Markov model (b) System reliability over time

Fig. 4.3 Markov model and the reliability curve for a 2-state
nonrepairable system.

Example 4.4: n-module parallel system Consider n lightbulbs lit at the same time, each failing
independently at the rate . Construct a Markov chain to model this system of n parallel lightbulbs
without replacement and use the model to find the expected time until all n lightbulbs die.

Solution: The requisite Markov model has n states, labeled from n (start state, when all lightbulbs
are good) down to 0 (failure state) is depicted in Fig. 4.4 (ignore the dashed box, with is for
Example 4.5). The expected time to go from state k to state k – 1 is 1/(k). Thus, the system’s
expected lifetime, or the time to go from state n to state 0, is (1/)[1/n + 1/(n – 1) + … + 1/2 + 1].
We see that due to the use of parallel lightbulbs, the lifetime 1/ of a single lightbulb has been
extended by a factor equal to the sum of the harmonic series, which is O(log n) for large n.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 110

Fig. 4.4 Markov model for an n-module parallel or k-out-of-n system,


with the failure state being state 0 for the former and all the
states within the dashed box in the latter.

Example 4.5: k-out-of-n nonrepairable systems Construct and solve a Markov model for a
nonrepairable k-out-of-n system.

Solution: The Markov model with n + 1 states, in which state i represents i of the n modules being
operational, can be simplified by merging the last k states into a single “failure” or “down” state.
The following balance equations can be written for the n – k + 1 good states:
pn = –npn
pn–1 = npn – (n – 1)pn–1, and more generally, for j  k,
pj = (j + 1)pj+1 – jpj
From the system of n + k + 1 differential equations above, we find pn = e–nt, given the initial
condition pn(0) = 1, and more generally, for j  k,
pj = (e–t)j (1 – e–t) n–j
The problability pF for the failure state can be obtained from pn + pn–1 + … + pk + pF = 1.

Note that to solve the system of differential equations of Example 4.5, we did not need to
resort to the more general method involving LaPlace transform (see Section 4.6), because
the first equation was solvable directly and each additional equation introduced only one
new dependent variable.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 111

4.3 Modeling Repairable Systems

In a repairable system, the effect of failures can be undone by repair actions. One way to
model variations in repair time is to associate a repair rate with each repair transition, as
depicted in Fig. 4.5a. A repair rate  means that the repair time has the exponential
distribution, with the probability of repair action taking more time then t being e–t. The
effectiveness and timeliness of repair actions can be assessed by evaluating the system’s
steady-state availability, or the fraction of time the system is expected to be in the “Up”
state. The following example shows have the availability can be derived for the simplest
possible repairable system.

Example 4.6: Two-state repairable system Consider a repairable system with failure rate 
and repair rate . Derive a formula for the steady-state system availability.

Solution: The requisite Markov model is depicted in Fig. 4.5a. Availability can be viewed as the
probability of the system being in state 1. To find the steady-state availability, we set up the
balance equation –p1 + p0 = 0, which indicates that, over the long run, a transition out of state 1
has the same probability as a transition into it. The balance equation above, along with p0 + p1 = 1,
allow us to obtain p1 = /( + ) and p0 = /( + ). We will see later that availability of this
system as a function of time is A(t) = /( + ) + /( + )e–(+)t, which is consistent with the
steady-state availability A = /( + ) just derived.

(a) System states and Markov model (b) System availability over time

Fig.4.5 Markov model and the availability curve for a 2-state


repairable system.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 112

As discussed in Section 2.5, we may want to distinguish different failure states to allow a
more refined safety analysis. Another reason for contemplating multiple failure states is
to model different rates of failures as well as different rates of repair in a repairable
system. Some failures may be repairable, while others aren’t, and classes of failures may
differ in the difficulty and latency of repair actions. The following example illustrates the
Markov modeling process for one class of such systems.

Example 4.7: Repairable system with two failure states Consider a repairable system with
failure rate  = 1 + 2, divided into two parts 0 for failures of type 0 and 1 for failures of type 1.
Assuming the common repair rate  for both failure types, derive a formula for the steady-state
system availability and the probabilities in being in the two failure states.

Solution: The requisite Markov model is depicted in Fig. 4.6a. To obtain steady-state probabilities
of the system being in each of its 3 states, we write the following two balance equations and use
them in conjuction with p0 + p1 + p2 = 1 to find p0 = 0/( + ), p1 = 1/( + ), and p2 = /( + ).
–p2 + p1 + p0 = 0
–p1 + 1p2 = 0
Figure 4.6b shows typical variations in state probabilities over time (with the derivation to be
discussed later) and their convergence to steady-state values just derived. If penalties or costs cj
are associated with being in the various failed states, analyses of the kind presented in this
example allow the computation of the total system risk as failure states cjpj.

(a) System states and Markov model (b) State probabilities over time

Fig.4.6 Markov model and state probabilities for a repairable


system with two failure states.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 113

4.4 Modeling Fail-Soft Systems

As discussed in Section 2.4, we may want to have multiple operational states, associated
with different levels of system capability, to allow for performability analysis. As in the
case of multiple failure states, discussed in Section 4.3, different operational states may
differ in their failure and repair rates, allowing for more accurate reliability and
availability analyses. Such states may also have different rewards or benefits bj associated
with them so that a weighted total benefit operational states bjpj can be computed. The
following examples illustrate the Markov modeling process for such systems.

Example 4.8: Repairable system with two operational states Consider a repairable system
with operational states 1 and 2 and failure state 0, as depicted in Fig. 4.7a with its associated
failure and repair rates. Derive the probabilities of being in the various states and use them to
compute system availability and system performability, the latter under the assumption of the
performance or benefit b2 associated with state 2 is twice that of b1 = 1 for state 1.

Solution: The requisite Markov model is depicted in Fig. 4.7a. To obtain steady-state probabilities
for the system states, we write the following two balance equations and use them in conjuction
with p0 + p1 + p2 = 1 to find p0 = 12/(12), p1 = 2/2, and p2 = , where  = 1/[1 + 2/2 +
12/(12)].
–2p2 + 2p1 = 0
 1p 1 –  1p 0 = 0
Figure 4.7b shows typical variations in state probabilities over time and their convergence to
steady-state values just derived. System availability is simply A = p1 + p2 = (1 + 2/2). Assuming
a performance level of 1 unit in state 1 and 2 units in state 2, system performability is P = p1 + 2p2
= (2 + 2/2). If the performance level of 2 in state 2 results from 2 processors running in parallel,
it might be reasonable to assume 2 = 21 = 2 and 2 = 1 =  (single repair facility). Then,
assuming  = /, we get A = ( + 2) / (2 + 2 + 2) and P = 2( + 1) / (2 + 2 + 2).

(a) System states and Markov model (b) State probabilities over time

Fig.4.7 Markov model and state probabilities for a repairable


system with two operational states.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 114

Example 4.9: Fail-soft systems with imperfect coverage We saw in Section 3.2 that adding
parallel redundancy without ensuring accurate and timely malfunction detection and switchover
may not be helpful. We also introduced a coverage parameter c and used it to derive the reliability
of an n-way parallel system under imperfect coverage in eqn. (3.2.cov3). Analyze a repairable fail-
soft system composed of 2 processors, so that upon a first processor failure, successful switching
to a single-processor configuration occurs with probability c.

Solution: The requisite Markov model is depicted in Fig. 4.8. Note that the transition labeled 2 in
an ordinary fail-soft system has been replaced with two transitions: one labeled 2c into state 1,
corresponding to successful reconfiguration, and another labled 2(1 – c) into state 0, representing
catastrophic failure upon the first module outage. To obtain steady-state probabilities for the
system states, we write the following two balance equations and use them in conjuction with p0 +
p1 + p2 = 1 to find p0, p1, and p2.
–2p2 + 2p1 = 0
2(1 – c)p2 + 1p1 – 1p0 = 0
Even though we have set up the model in full generality in terms of failure and repair rates, we
solve it only for 2 = 21 = 2 and 2 = 1 = . Defining  = / and  = 1/[2 + (4 – 2c) + 2],
we get p0 = 2[(1 – c) + 1], p1 = 2, and p2 = 2. Note that just as reconfiguration can be
unsuccessful, so too repair might be imperfect or the system may be unable to reuse a properly
repaired module. Thus, the use of coverge is appropriate for repair transitions as well.

Fig.4.8 Markov model for a fail-soft system with imperfect coverage.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 115

4.5 Solving Markov Models

In the preceding sections, we introduced the method of transition balancing for finding
the steady-state probabilities associated with the states of a Markov model. This method
is quite simple and is readily automated throught the use of reliability modeling tools or
general solvers for systems of linear equation. Occasionally, however, we would like to
derive the transient solutions of a Markov chain to study the short-term behavior of a
system or to gain insight into how quickly the system reaches its steady state.

To find the transient solutions to a Markov model, we first set up a system of first-order
differential equations describing the time-variant balance (rather than the steady-state
balance) at each state. We then apply the Laplace transform to the system of differential
equations, solve the resulting system of algebraic equations, and find the final solutions
via the application of inverse LaPlace transform. The LaPlace transform converts a time-
domain function f(t) to its transform-domain counterpart F(s) by:

𝐹(𝑠) = 𝐿𝑎𝑝𝑙𝑎𝑐𝑒[𝑓(𝑡)] = ∫ 𝑒 𝑓(𝑡)𝑑𝑡 (4.5.LPT)

By convention, an uppercase letter is used to denote the Laplace transform of a function


named with the corresponding lowercase letter. Table 4.1 lists what we need to know
about the Laplace transforms of simple functions to solve problems encountered in
analyzing dependable systems. More background on Laplace (also written as “LaPlace”)
transform, and a more extensive table of transforms, can be found in [Wiki15].

Table 4.1 Laplace transforms for some simple functions.

Time-domain function Transform-domain function


k; constant k/s
–at
e 1/(s + a)
t / k!; k  0
k 1 / sk+1
tke–at/k! 1/(s + a)k+1
k.h(t); constant k k.H(s)
h(k.t); constant k > 0 H(s/k) / k
t.h(t) –dH(s) / ds
g(t) + h(t) G(s) + H(s)
h(t); derivative of h(t) s.H(s) – h(0); h(0) is the initial value of h

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 116

One more technique that we need is that of partial fraction expansion. Here is a brief
review, which is adequate for most cases. More details and a large set of examples can be
found in [Wiki15a].

Given a fraction N(s)/D(s), where N(s) and D(s) are the numerator and denominator
polynomials in s, with N(s) being of a lower degree than D(s), it can be expanded as a
sum of fractions of the form ai /(s – ri), assuming that the ri values are the roots of the
equation D(s) = 0. We assume that the roots are simple. Repeated roots require a
modification in this process that we will not discuss here (see Problem 4.21). This partial
fraction expansion allows us to convert an arbitrary fraction to a sum of fractions of the
forms that appear in the right-hand column of Table 4.1 and thus be able to apply the
inverse LaPlace transform to them.

( ) ( )
= = + + ...+ (4.5.pfe1)
( ) ∏ ( )

By converting the sum of the fractions on the right-hand side of equation (4.5.pfe1) to a
single fraction having the product of the denominators as its denominator and equating
the coefficients of the various powers of s in the numerators of both sides, we readily
derive the constants ai as:

( ) ( )
ai = (4.5.pfe2)
( )

The subscript s = ri is equation (4.5.pfe2) means that the bracketed expression is to be


evaluated at s = ri to yield the value of the constant ai.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 117

Example 4.10: Two-state repairable systems Find transient state probabilities for the 2-state
repairable system of Fig. 4.5a and show that the results are consistent with the transient
availability curve depicted in Fig. 4.5b.

Solution: We begin by setting up the balance differential equations for the two states.
p1(t) = –p1(t) + p0(t)
p0(t) = –p0(t) + p1(t)
Using Laplace transform, we convert the equations above into algebraic equations, noting the
initial conditions p1(0) = 1 and p0(0) = 0.
sP1(s) – p1(0) = –P1(s) + P0(s)
sP0(s) – p0(0) = –P0(s) + P1(s)
Noting the initial conditions p1(0) = 1 and p0(0) = 0, we find the solutions:
P1(s) = (s + ) / [s2 + ( + )s]
P0(s) =  / [s2 + ( + )s]
To apply the inverse Lapalce transform to P1(s) and P0(s), we need to convert the right-hand sides
of the two equations above into function forms whose inverse Laplace transform are known to us
from Table 4.1. This is done via partial-fraction expansion discussed just before this example.
P0(s) =  / [s2 + ( + )s] = a1/s + a2/(s +  + )
The denominators s and s +  +  on the right are the factors of the original denominator on the
left, while a1 and a2 are constants to be determined by insuring that the two sides are equal for all
values of s. From this process, we find a1 = /( + ) and a2 = –/( + ), allowing us to write
down the final result.
p0(t) = InverseLaplace[a1/s + a2/(s +  + )] = /( + ) – /( + )e–( + )t
The inverse Laplace transform is similarly applied to P1(s), yielding:
p1(t) = /( + ) + /( + )e–( + )t
These results are consistent with p1(t) shown in Fig. 4.5b decreasing from 1 at t = 0 to /( + ) in
steady state (t = ).

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 118

Example 4.11: Triplicated system with repair The lifetime of a triplicated system with voting
(Fig. 3.7) can be extended by allowing repair or replacement to take place upon the first module
malfunction. In this way, it is possible for the system to return to full functionality, with 3 working
units, before a second module malfunction renders it inoperable. Only if the second malfunction
occurs before the completion of repair or replacement will the system experience failure. Analyze
the MTTF of such a TMR system with repair.

Solution: The Markov model for this system is depicted in Fig. 4.9. Steady-state probabilities for
the system states can be obtained from the following two balance equations, used in conjuction
with p0 + p1 + pF = 1. Unfortunately, this steady-state analysis is unhelpful, because it leads to p0 =
p1 = 0 and pF = 1, which isn’t surprising (why?).
–3p0 + p1 = 0
3p0 – (2+p1 = 0
In general, with a failure state that has no outgoing transitions, a so-called absorbing state, we get
pF = 1 in steady state.
Time-variant state probabilities can be obtained from the following differential equations, with the
initial conditions p0(0) = 1 and p1(0) = 0.
p0 = –3p0 + p1
p1 = 3p0 – (2+p1
Using Laplace transform, we can solve the equations above to obtain R(t) = p0(t) + p1(t). The
result, with the notational conventions  = / and  = (2 + 10 + 1)1/2 becomes:
R(t) = [( +  + 5)e–(–+5)t/2 – ( –  + 5)e–(++5)t/2]/(2)
Integrating R(t), per eqn. (2.2.MTTF), we find, after simplification:
MTTF = [1 + (– 1)/6]/ = (1 + /5)[5/(6)]
This result suggests that the provision of repair extends the MTTF of a TMR system by a factor of
1 + /5 over that of an nonrepairable TMR system, which in turn has a smaller MTTF (by a factor
of 5/6) compared with a nonredundant module. For example, with  = 10–6/hr and  = 0.1/hr, we
have MTTFModule = 1M hr, MTTFTMR = 0.833M hr, and MTTFTMR with repair = 16,668M hr.

Time

0 1 F
2nd failure
1st failure

0 1

(a) System states and Markov model (b) Reliabilities over time

Fig.4.8 Markov model and graphic depiction of reliability and MTTF


improvements for a repairable TMR system.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 119

4.6 Dependability Modeling in Practice

A particularly useful Markov model is the so-called birth-and-death process. This model
is used in queuing-theory analysis, where the customer’s arrival rate and providers’
service rate determine the queue size and waiting time. Referring to the infinite birth-and-
death process depicted in Fi.g 4.10, transition from state j to state j + 1 is an arrival or
birth. Transition from state j to state j – 1 is a departure or death. Closed-form solutions
for state probabilities are difficult to obtain in general, but steady-state probabilities are
easily obtained:

pj = p001 … j–1 / (12 … j) (4.6.BnD)

TMR
with
repair

Fig. 4.10 Markov model for a birth-and-death process.

In the finite version of Fig. 4.10, the last state is labeled n and we have an (n + 1)-state
birth-and-death process. The following two examples deal with special cases of the
process shown in Fig. 4.10

Example 4.12: A simple queuing system for bank customers Customers enter a bank at the
rate of  and they are serviced by a single teller at the rate of . Even if  < , the length of the
waiting line can grow indefinitely due to random variations. Use a birth-and-death process to
compute the probability of the queue length becoming n for different values of n and determine the
average queue length.

Solution: Using eqn. (4.6.BnD), with all the i set to  and all the i set to , and taking  = /,
we find pj = p0/j. From j0 pj = 1, we obtain p0 = 1 – 1/ and pj = (1 – 1/)/j. The average queue
length is j0 jpj = (1 – 1/)j0 j/j = (1 – 1/)/(–12 = 1/(–1) = /( – ). Note that for  = 1,
the queue length becomes unbounded, and for a service rate  that is slightly greater than but very
close to the arrival rate , the queue can become quite long.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 120

Example 4.13: Gracefully degrading system with n identical modules The behavior of a
gracefully degrading system with n identical modules can be modeled by an (n + 1)-state birth-
and-death process, where state j represents the unavailability of j modules (state 0, with all the
modules being functional, is the initial state). Find the probability of the system being in state 0,
assuming up to s modules can be repaired at once (the so-called M/M/s queue, where M stands for
Markov process for failure and repair and s is the number of service stations or providers).

Solution: Figure 4.11 depicts the markov model for the system under consideration. The repair
rates used depend on the value of s. For s = 1, all repair transitions are labeled . For general s,
repair transitions are labeled , 2, … , s, from the left side of the diagram, with the remaining
transitions, if any, all labeled s, the maximum repair rate with s service providers. Applying
balance equations and defining  = /, we can find the steady-state probabilities of being in the
various states as:
pj = (n – j + 1)pj–1/(j), for 1  j  s
pj = (n – j + 1)pj–1/(s), for s + 1  j  n
The equations above yield each state probability in terms of p0:
𝑛
pj = 𝑗 – jp0, for 1  j  s
𝑛
pj = 𝑗 – j[j!/(s!sj–s)]p0, for s + 1  j  n
Using p0 + p1 + … + pn = 1, we can compute p0 to complete the derivation:
𝑛 𝑛
p0 = 1/[∑ 𝑗
𝜌 +∑ 𝑗
𝜌 𝑗!/(𝑠! 𝑠 )]
The state probabilities just derived can be used to compute system availability (summing over
states in which the system is deemed available) or performability (weighted sum, where the weight
for each state is a measure of system performance at that state).

Fig. 4.11 Markov model for an n-module gracefully degrading system.


Repair transition rates may vary, depending on the number
of servers or repair stations.

Dependability modeling is an iterative process, as depicted in Fig. 4.12. Even if the


modeling approach is chosen correctly at the outset, fine-tuning of assumptions and
parameters may be required in multiple iterations, until satisfactory and trustworthy
results have been obtained. In carrying out the modeling process, various software aids
can be useful. A wide variety of such aids have been developed over the years and many
of them are available free of charge to the general public or for academic research.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 121

Fig. 4.12 The dependability modeling process.

Examples of software aids for dependability modeling include those offered by PTC
Windchill (formerly Relex) [PTCW20] and ReliaSoft [Reli20], companies specializing in
reliability engineering, University of Virginia’s Galileo [UVA20], and Iowa State
University’s HIMAP [IASU20]. There are also more limited tools associated with Matlab
and a number of Matlab-based systems. A 2004 study enumerates a set of possibly useful
tools [Bhad04]. A Google search for “reliability analysis tools” will reveal a host of other
products and guides.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 122

Problems

4.1 Four-processor fail-soft system


A multiprocessor system with 4 nodes has a per-processor malfunction rate of 1 per 500 hours and a repair
rate of 1 per 10 hours. All other system components are assumed to be perfectly reliable for this problem.
Each processor can perform 1 unit of computation per hour. The system is deemed available as long as at
least one processor is functioning. Derive the system’s availability, performability, and MCBF.

4.2 Automobile with a spare tire


Consider an automobile with four regular tires and one spare tire. The failure rate of a regular tire is . A
spare fails at the rate 1 <  when it is in the trunk and at the rate 2 >  when it is used. When the spare is
in use, the replaced failed tire is repaired at the rate . Assume no repair for a failed spare in the trunk,
because this event is usually not detected. The system fails when any two tires are unusable.
a. Construct a state-space model for this system and derive the associated reliability equation.
b. What do we need to add to the model in order to allow the derivation of steady-state availability?
c. Relate this example to a computer-based system that you describe.

4.3 Two-processor fail-soft system


The following Markov model is known to correspond to a two-processor fail-soft system.

1 10 2

1
00 1 11
2

2 1
01

a. Explain the assumptions under which the model was developed. In particular, pay attention to the
lack of full symmetry in the state diagram.
b. Solve the simplified model to derive the steady-state probabilities associated with the four states.

4.4 Delayed failure detection


Consider a state-space model for a system that can be in one of three states: G (good), U (bad, failure
undetected), F (bad, failure detected). Assume failure rate of , repair rate of , and failure detection “rate,”
modeling the latency of failure detection, of .
a. Calculate the steady-state availability of this system.
b. Discuss the implications of delayed failure detection by comparing your result with that of a two-
state system that has immediate failure detection.

4.5 Modeling a championship series


Two sports teams play a 7-game championship series, with the series ending as soon as one team has won 4
games. The state of the championship series at any time can be represented by the pair (w1, w2), where wi is
the number of games won by team i. Assume that team 1wins each game with probability p, independent of
all previous results. Tie scores are not allowed.
a. Model this championship series as a Markov process.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 123

b. Use the model of part a to find the probability Ci that team i (i = 1, 2) wins the championship.

4.6 Modeling a small tournament


Four sports teams play in a single-elimination tournament. In the first round, teams 1 and 4 play each other
in one game and teams 2 and 3 in another. Winners of these two games then play for the championship,
while the losers play for the third place. The probability that team i wins over team j in a game is pij,
independent of all previous results. Tie scores are not allowed.
a. Model this tournament as a Markov process.
b. Use the model of part a to find the probability Cij that team i finishes in jth place (1  i, j  4).

4.7 A parallel system of lightbulbs with replacement


Consider n lightbulbs, all lit at the same time, that fail independently at the rate . In Example 4.4, we
analyzed this system under the assumption of no replacement. Assume that a building custodian replaces
the failed lightbulbs at a fixed rate . Detection of a dead lightbulb is immediate.
a. Construct a Markov chain to model this system of n parallel lightbulbs with replacement and use
the model to find the expected time until all n lightbulbs die.
b. Find numerical values for the expected lifetimes without and with replacement, assuming a failure
rate of 0.001/hr and a replacement rate of 10/hr.

4.8 Weather in the Land of Oz


The Land of Oz is said to have a peculiar weather pattern. It never has two nice days in a row. The day
following a nice day is equally likely to be rainy or snowy. When it snows or rains, there is an even chance
that next day’s weather is the same. If there is a change from snow or rain, only half the time the change
involves having a nice day.
a. Represent the weather in the Land of Oz as a 3-state Markov chain.
b. Show that if the transition matrix M for the Markov model of part a is raised to the nth power, the
(i, j) entry of the resulting matrix M n represents the probability of having type-j weather n days
after having type-i weather.
c. Show that in the long run, weather in the Land of Oz will be totally independent of today’s
weather.

4.9 Moving balls between urns


We have two urns, together holding 4 balls. At each step, one of the 4 balls is chosen at random and moved
from its current place to the other urn.
a. Taking the number of balls in urn 1 as the system state, represent this process as a Markov chain.
b. Assuming that we begin with no ball in urn 1, plot the variation of the expected number of balls in
urn 1 as a function of steps.
c. Repeat part b, this time assuming that urn 1 has 2 balls at the outset.

4.10 The stepping-stone model


The stepping-stone model is used in the study of genetics. We have a square array of side n, with each of
the n2 cells having one of k different colors. Each cell has 8 nearest neighbors, even the cells on the edge of
the array, which are assumed to be neighbors with cells at the other end of the row or column. In each step,
a cell is chosen at random and is given the same color as a randomly chosen nearest neighbor.
a. How many states are there in the Markov model of this process?

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 124

b. Is it feasible to solve the Markov model for k = 2 colors in an array of side length n = 10?
c. Write a program to experiment with and observe state changes in the example of part b.
d. Prove that eventually, all cells will assume the same color.
e. Prove that the probability that one of two initial colors prevails is equal to the proportion of cells
that are of that color at the start.

4.11 Making bail


A man in jail has $300 but need $800 to get out on bail. A prison guard agrees to make a series of bets with
him. If the prisoner bets $x, he wins $x with probability 0.4 and loses $x with probability 0.6.
a. Find the probability that the prisoner accumulates $800 before losing all of his money, assuming
he bets $100 each time?
b. Repeat part a, this time assuming that the prisoner bets as much as possible, but not more than the
amount he needs to reach $800.
c. Which strategy, the one in part a or that of part b, gives the prisoner a better chance of making
bail?

4.12 Absorbing Markov chains


An absorbing state in a Markov chain is one that has no outgoing transition. In an absorbing Markov chain,
some absorbing state can be reached from any transient state. The transition matrix M of an absorbing
Markov chain with r absorbing and n – r transient states has the canonical form
𝑄 𝑅
M=
0 𝐼
where Q is a square matrix of size n – r and Ir is the identity matrix of size r.
a. Provide an interpretation for the value of the (i, j) entry of the matrix Qk.
b. Prove the identity N = ∑ 𝑄 = (𝐼 − 𝑄) .
c. Show how the fundamental matrix N of part b can be used to find the expected number of steps
before absorption, when starting from an arbitrary transient state.
d. Model the process of flipping a fair coin until the sequence HTH is observed as an absorbing
Markov chain and find the expected number of flips before the desired pattern occurs.

4.13 The game of tennis


The game of tennis may enter a tie state called “Deuce,” which requires one player to make two straight
points in order to win. If player A makes a point, the game goes from “Deuce” to the “Advantage A” state,
from which it either goes to the “A wins” state or back to the “Deuce” state, depending on who makes the
next point. The “Advantage B” and “B wins” states are defined analogously. Suppose that player A has a
probability p of winning any given point, regardless of the previous history of the game.
a. Set up a Markov chain to model the game of tennis, beginning from the “Deuce” state.
b. Find the probabilities of reaching the two absorbing states “A wins” and “B wins.”
c. Starting at “Deuce,” what is the expected number of points played before the game ends?

4.14 Stochastic sequential machines


Consider the stochastic sequential machine defined in Example 4.1.
a. Derive the machine’s final state vector, given an initial state vector and a very long sequence of 0s
as input.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 125

b. Repeat part a for a very long sequence of 1s as input.


c. Repeat part a for a very long input sequence beginning with 0 and containing alternate 0s and 1s.

4.15 Modeling a 3-processor fail-soft system


Solve the simplified Markov model of Fig. 4.2b, as defined in Example 4.2, to find the steady-state
probabilities for the 4 states, system availability (which you define), and system performability.

4.16 Quintuplicated system with repair


Perform an analysis similar to that of Example 4.8 for a system with 5-way voting on the outputs of
identical modules, in which a disagreeing module and one of the good modules are removed from service
and the system switches to TMR operation while the bad module is being repaired. If before the repair of
the first failed module has been completed, one of the three operating modules fails, the good module that
was taken out of service is switched back in and the system continues operating in TMR mode. Assume
perfect failure detection and switching.

4.17 Fail-soft systems with imperfect coverage


We solved the Markov model of Example 4.9 in the special case of 2 = 21 and 2 = 1. Derive a solution
for the general case, without the latter assumptions.

4.18 Fail-soft system with imperfect coverage


As suggested at the end of Example , the notion of coverage can be applied to repair transitions as well, so
as to model imperfect repair or the inability of the system to reincorporate a repaired module into its
operation. Show how this might be done and solve the resulting Markov model.

4.19 Birth-and-death processes


The M/M/1 queue is a special case of the M/M/s queue discussed in Example 4.13. The M/M/1/b queue
assumes a limited buffer size b, so that arrivals or births are not accepted when the buffer is full. Solve the
Markov model corresponding to this birth-and-death process with limited buffer size.

4.20 Little’s theorem or law


Consider a bank or similar system in which customer arrivals and departures are modeled by a birth-and-
death process with arrival rate  and service rate . The average time W a customer spends in the system is
the sum of the average time spent in the queue and the average service time 1/. Prove Little’s theorem or
law that states L = W, where L is the averge queue length.

4.21 Partial fraction expansion


To expand a fraction N(s)/(s – r)k, which has the root r repeated k times, we write N(s)/(s – r)k = a1/(s – r) +
a2/(s – r)2 + … + ak/(s – r)k, where the aj are constants to be determined. Use this method to find the partial
fraction expansion of (s + 3)/[s(s + 2)2(s + 5)].

4.22 Convergence of Markov chains


In Problem 4.8, weather in the Land of Oz was modeled by a 3  3 Markov matrix and you were asked to
show that, in the long run, weather is totally independent of today’s weather. Another way of saying this in
the general case of an n  n Markov matrix is that for k large enough, Mk converges to a matrix with
identical rows (x0 x1 . . . xn–2 1–x0–x1– … –xn–2).

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 126

a. Does such a convergence occur for any Markov matrix M? If yes, explain your reasoning in full; if
not, provide a counterexample.
b. Assuming that convergence does occur, outline an efficient procedure for finding the fixed long-
term distribution (x0 x1 . . . xn–2 1–x0–x1– … –xn–2) from M, without having to compute many
powers of M to empirically verify convergence.

4.23 Modeling of system behavior


An e-commerce Web site has the following simplified discrete-time 4-state model for customer behavior.
Any customer starts in state 0. After one unit of time, the customer goes to the browse state 1. Over the
next time unit, the customer goes from the browse state to the purchase state 2 with probability 0.1 and to
the exit state 3 with probability 0.2. From the purchase state, the customer goes back to the browse state
with probability 0.2 and to the exit state with probability 0.7. There are no transitions out of the exit state.
a. Draw a diagram for the state-space model and write down the corresponding Markov matrix M.
b. Derive the probability that a user in start state 0 at time step t will be in the exit state 3 at time step
t + 4.
c. Derive the probability that any given customer makes a purchase before exiting.
d. Compute M2, M3, and M4 and discuss their meanings. What is the limit of Mk as k tends to infinity?
e. Derive the probability that a customer makes more than one purchase (visits state 2 at lease twice).

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-09-29 127

References and Further Readings


[Ashb03] Ashbrook, D. and T. Starner, “Using GPS to Learn Significant Locations and Predict
Movements Across Multiple Users,” Personal and Ubiquitous Computing, Vol. 7, No. 5, pp.
275-286, 2003.
[Behr00] Behrends, E., Introduction to Markov Chains, with Special Emphasis on Rapid Mixing,
Vieweg, 2000.
[Bhad04] Bhaduri, D., “Tools and Techniques for Evaluating Reliability Trade-offs for Nano-
Architectures,” MS thesis, Virginia Tech, 2004. [On-line version accessed on January 9, 2015.]
[Bolc06] Bolch, G., S. Greiner, H. de Meer, and K. S. Trivedi, Quueing Networks and Markov Chains:
Modeling and Performance Evaluation with Computer Science Applications, Wiley, 2006.
[Buza70] Buzacott, J. A., “Markov Approach to Finding the Failrue Times of Repairable Systems,” IEEE
Trans. Reliability, Vol. 19, No. 4, pp. 128-134, November 1970.
[Funk07] Funk, A., J. R. Gilbert, D. Mizell, and V. Shah, “Modeling Programmer Workflows with
Timed Markov Models,” http://gauss.cs.ucsb.edu/publication/ctwatch-tmm.pdf
[Ghah01] Ghahramani, Z., “An Introduction to Hidden Markov Models and Bayesian Networks,” Int’l J.
Pattern Recognition and Artificial Intelligence, Vol. 15, No. 1, pp. 9-42, 2001.
[Grin06] Grinstead, C. M. and J. L. Snell, Introduction to Probability, on-line version of Chapter 11,
entitled “Markov Chains,” pp. 405-470, accessed on January 9, 2015.
http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/Chapter11.pdf
[IASU15] Iowa State Univeristy, “HIMAP Reliability Analysis Software,” accessed on January 9, 2015.
http://ecpe.ece.iastate.edu/dcnl/Tools/tools_HIMAP.htm
[Kuo03] Kuo, W. and M. J. Zuo, Optimal Reliability Modeling: Principles and Applications, Wiley,
2003.
[Malh94] Malhotra, M. and K. S. Trivedi, “Power-Hierarchy of Dependability Model Types,” IEEE
Trans. Reliability, Vol. 43, No. 3, pp. 493-502, September 1994.
[PTCW20] PTC Windchill (formerly Relex), https://www.ptc.com/en/products/windchill
[Puki98] Pukite, J. and P. Pukite, Modeling for Reliability Analysis: Markov Modeling for Reliability,
Maintainability, Safety, and Supportability Analyses of Complex Computer Systems, IEEE
Press, 1998.
[Reib91] Reibman, A. L. and M. Veeraraghavan, “Reliability Modeling: An Overview for System
Designers,” IEEE Computer, Vol. 24, No. 4, pp. 49-57, April 1991.
[Reli20] ReliaSoft, “Reliability Analysis and Management Software,” Company Web site:
https://www.reliasoft.com/products
[Sahn96] Sahner, R. A., K. S. Trivedi, and A. Puliafito, Performance and Reliability Analysis of
Computer Systems: An Example-Based Approach Using the SHARPE Software Package,
Kluwer, 1996.
[Shoo02] Shooman, M. L., Reliability of Computer Systems and Networks, Wiley, 2002.
[UVA20] University of Virginia, “Galileo Reliability Analysis Software,”
http://www.cs.virginia.edu/~ftree/
[Wiki20] Wikipedia, “Laplace Transform,” http://en.wikipedia.org/wiki/Laplace_transform
[Wiki20a] Wikipedia, “Partial Fraction Decomposition,” http://en.wikipedia.org/wiki/Partial_fraction

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 128

II Defects: Physical Imperfections


Ideal

Defective
“[An engineer] recognizes, unfortunate though they may be, that
defects are unplanned experiments that can teach one how to
Faulty make the next design better.”
Erroneous Henry Petroski, To Engineer Is Human, 1985

Malfunctioning
“The search for perfection begins with detecting imperfection.”
Degraded
Anonymous
Failed

Chapters in This Part


5. Defect Avoidance
6. Defect Circumvention
7. Shielding and Hardening
8. Yield Enhancement

Any manufactured component, particularly if mass produced, is bound to have


killer and/or latent defects. A killer defect renders the part completely useless,
whereas latent defects potentially disrupt its operation or performance at a later
stage. Understanding the nature, and learning how to manage the consequences,
of such defects is the starting point of our journey through the six undesirable
states in the multilevel model of dependable computing. Advances in integrated-
circuit and other technologies used to build computers and computer-based
systems have unfortunately not helped in this arena: if anything, the exponentially
increasing complexities and much greater densities have made defects more
prevalent and harder to detect. In this part, after discussing general concepts
relating to defect avoidance and circumvention, we study a class of avoidance
methods based on shielding and hardening of components and then consider how
the harsh impact of defects on integrated circuit yield can be mitigated through
redundancy and switch-based reconfiguration methods.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 129

5 Defect Avoidance
“Better a diamond with a flaw than a pebble without one.”
Chinese proverb

“The omission of the silicon that had been put in nickel [core of
the cathode] to make processing easier . . . raised the effective
life of vacuum tubes from 500 hours to 500,000 hours. The
marginal checking gave another factor of ten on that.”
Jay W. Forrester, reflecting on the Whirlwind I
computer, 1983

Topics in This Chapter


5.1. Types and Causes of Defects
5.2. Yield and Its Associated Costs
5.3. Defect Modeling
5.4. The Bathtub Curve
5.5. Burn-in and Stress Testing
5.6. Active Defect Prevention

Complete defect avoidance, if possible, would be the preferred choice for dealing
with dependability concerns at the device level. Unfortunately, however, defect-
free devices and components may be very expensive (due to stringent
manufacturing and/or careful screening requirements) and perhaps even
impossible to build. Thus, we do what we can, within technical and budgetary
constraints, to reduce the occurrence of defects. We then handle what remains
through accurate modeling and appropriate circumvention techniques. In this
chapter, we deal with the understanding, detecting, and modeling of defects,
leaving the discussion of defect circumvention techniques to Chapter 6.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 130

5.1 Types and Causes of Defects

Defects can be viewed as imperfections or weaknesses that may lurk around without
causing any harm, but that can potentially give rise to faults and other undesirable
behavior. Any manufactured part is prone to defects, particularly if it is mass produced.
In this section, we review the types of defects that one finds in integrated circuits and in
certain mass storage devices as prominent examples of what can go wrong during
manufacturing and how the presence of defects can be detected.

Defects in integrated circuits

Modern integrated circuits are produced via a sequence of complex processing steps
involving the deposition of material and structures in layers, beginning with a substrate.
As the structures shrink in size, things can and do go wrong. Small impurities in the
material involved, tiny particles or air bubbles, or even the natural variations associated
with automatic production can lead to problems. Figure 5.1 shows two examples of
defects in modern ICs that affect the circuit elements deposited on a surface and vertical
interconnections between different layers. Figure 5.2a stresses the fact that ideal designs
often gain nonuniformity through the mass production process, making circuit parameters
and other aspects of the system different from one point on the chip to another. The same
absolute difference in physical dimensions and shapes becomes much more serious in
relative terms as the technology is scaled down. Figure 5.2b shows the temperature
distribution across a chip. Because speed and other circuit parameters are affected by
temperature, the effect of temperature variations is similar to those of nonuniformity
resulting from the finite precision of the manufacturing process.

(a) Particle embedded between layers (b) Resistive open due to unfilled via

Fig. 5.1 Typical defects in high-density integrated circuits.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 131

(a) Lithography variations (b) Thermal map of a chip

Fig. 5.2 Process and run-time variations can lead to subtle defects,
and associated performance problems, arising from
changes in resistance, capacitance, and other parameters.

Detection of integrated-circuit defects is a challenging proposition. Some obvious defects


may be detectable through visual inspection. Most defects, however, require much more
elaborate methods. Inspection via high-resolution imaging systems and testing of circuit
parameters (as opposed to its functionality, which may be unaffected by the presence of
defects) are some possibilities. One approach of the latter variety for CMOS circuits is
IDDQ testing, where the method’s name arises from the measurement of the supply current
IDD in the quiescent state. When a defect-free CMOS digital circuit is in quiescent state,
static current between the power supply and ground should correspond to a small amount
of leakage. Many common manufacturing defects lead to a significant rise in this static
current, thus facilitating their detection. Experience has shown that IDDQ testing reveals
not only manufacturing defects but also certain logic-level faults that are otherwise
difficult to detect by tests based on the stuck-at fault model (see Section 9.1).

Because defects may not be directly noticeable, one approach to their detection is to
intentionally push the system from defective to faulty or erroneous state in the multilevel
model of Fig. 1.6, so as to make the system state more observable. Such burn-in or stress
tests will be discussed in Section 5.5.

Besides on-chip defects discussed thus far, defects can occur in elements found at higher
levels of the digital system packaging hierarchy, including in connectors, printed-circuit
boards, cabling, and enclosures. However, these types of defects are deemed less serious,
because they lend themselves to easier detection though visual inspection and assembly-
or system-level testing.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 132

Defects in disk storage devices

The currently dominant technologies for mass storage devices consist of writing data on
smooth surfaces, using magnetic or optical recording schemes. Like integrated circuits,
these recording schemes have also experienced exponential density improvements,
recording more and more bits per unit area. When the rectangular area devoted to the
storage of a single bit has sides that are measured in nanometers (see Fig. 5.3a), slight
impurities, surface variations, dust particles, and minute scratches can potentially wipe
out thousands of bits worth of stored data. It would be utterly impractical to discard every
manufactured magnetic or optical disk if it contained the slightest surface defect.

To see the magnitude of the problem resulting from the high recording density on a hard
magnetic disk, for example, consider the comparative dimensions depicted in Fig. 5.3b.
With such minute dimensions, a small scratch on the disk surface can affect multiple
tracks and thousands of bits. To make matters worse, the read/write head must be placed
very close to the disk surface to allow accurate reading and writing of data with the
densities shown in Fig. 5.3a. The read/write head actually flies on a cushion on air, nearly
touching the surface of the disk. Slight variations in the surface or the presence of an
external particles can cause a head crash, leading to substantial damage to the disk
surface and the stored data. This is why modern disk drives are built with tightly sealed
enclosures.

(a) Bits on the disk surface (b) Head separation from the surface

Fig. 5.3 The high density and small head separation (less than 1 m)
in magnetic recording storage technology.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 133

Surface defects, and their impact on the stored data, are not unique to magnetic mass
storage devices. Similar considerations apply to other storage media, such as CDs and
DVDs.

Challenges from disk defects are similar to those faced by IC designers and
manufacturers: namely, the detection of these defects and appropriate schemes to
circumvent them. Data on disk memory is often encoded using CRC or a similarly strong
error-correcting code. When a sector exhibits repeated violations of the code, it may be
remapped to a different physical disk location and its original location flagged as
unusable. Computer operating systems routinely monitor disk operation, using externally
observable characteristics and certain sensor-provided information to avoid serious disk
failures and the ensuing data loss. Here is a partial list of monitored parameters in modern
disk drives: head flying height (a downward trend often portends a head crash); number
of remapped sectors; frequency of error correction via the built-in code. Additionally, the
following are signs of mechanical or electrical problems that may lead to future data loss:
changes in spin-up time, rising temperatures in the unit, reduction in data throughput.

Modern disk memories typically have strong protection built in against defect-caused
data corruption. As shown in Fig. 5.3disk, the protection may span multiple levels, from
individual sectors (red protective coding), blocks of sectors (blue), and blocks of blocks
of sectors (green).

Fig. 5.3disk Multiple levels of protective error-coding in a Hitachi disk.


Black areas represent raw (non-redundant) data sectors.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 134

5.2 Yield and Its Associated Costs

The multistep chemical and physical processes that lead from a silicon crystal ingot to a
finished IC chip is depicted in Fig. 5.4. Defects on the sliced wafer lead to a certain
number of defective dies after the wafer has been patterned (converted into a collection of
dies, via a number of processing steps). In the example of Fig. 5.4, 11 of the 120 dies are
shown as defective. So, assuming that no other defects arise in the processes of dicing,
testing, and mounting the dies, a yield of 109/120  91% is achieved.

Example 5.1: Financial gain from yield improvement Consider a company that manufactures
5000 wafers per week. A wafer holds 200 dies, with each good die generating a revenue of $15.
Estimate the annual revenue gain from each percentage point in improved yield.

Solution: Each percentage point improvement in yield results in 2 additional good dies per wafer,
corresponding to a revenue gain of $30. So, the expected annual revenue gain for a 1% yield
improvement is $30 gain/wafer  5000 wafers/week  52 weeks/year = $7.8M.

Experimental evidence suggests that the die yield is related to defect density (number of
defects per unit area) and die area, as shown in equation (5.2.yield). The parameter a is a
technology-dependent constant which is in the range 3-4 for modern CMOS processes.

×
Die yield = = [1 + ] (5.2.yield)

Blank wafer
30-60 cm Patterned wafer
with defects
Processing:
Silicon Slicer x x 20-30 steps
crystal x x x
15-30 x x
ingot cm x x
x x

(100s of simple or scores


0.2 cm of complex processors)

Microchip
Good
Die or other part Part
Die die
Dicer tester Mounting tester
Usable
part
to ship
~1 cm ~1 cm

Fig. 5.4 The manufacturing process for an IC part.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 135

120 dies, 109 good 26 dies, 15 good

Fig. 5.5 Why larger dies lead to a dramatic reduction in yield.

It is this nonlinear relationship that causes die cost to grow superlinearly with the die size
(chip complexity). Note that assuming a fixed cost for a wafer and a good yield at the
wafer level (i.e., only a small number of wafers have to be discarded), the cost of a die is
directly proportional to its area and inversely proportional to the die yield.

Die cost =  (5.2.cost)

Because, according to equation (5.2.yield), die yield is a decreasing function of the die
area, the cost of a die will grow superlinearly with its area. This effect is evident from
Fig. 5.5, where the same defect pattern has rendered 11 of the dies useless, leading to a
much smaller yield in the case of the larger dies of Fig. 5.5b than the smaller dies in Fig.
5.5a. In the extreme of using an entire wafer to implement a single integrated circuit, that
is, having one die per wafer, yield becomes a very serious problem. This is why many of
the defect circumvention methods, discussed in Chapter 6, were first suggested in
connection with wafer-scale integration.

Example 5.2: Effects of dies size on yield and cost Assume that the dies in Fig. 5.5 are 1  1
and 2  2 cm in size and ignore the defect pattern shown. Assuming a defect density of 0.8/cm2,
2

how much more expensive will the 2  2 die be than the 1  1 die?

Solution: Let the wafer yield be w. From the die yield formula, we obtain a yield of 0.492w and
0.113w for the 1  1 and 2  2 dies, respectively, assuming a = 3. Plugging these values into the
formula for die cost, we find that the 2  2 die costs (120/26)  (0.492/0.113) = 20.1 times as
much as the 1  1 die; this represents a factor of 120/26 = 4.62 greater cost attributable to the
smaller number of dies on a wafer and a factor of 0.492/0.113 = 4.35 due to the effect of yield.
With a = 4, the ratio assumes the somewhat larger value (120/26)  (0.482/0.095) = 23.4.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 136

The aforementioned effect of die size on yield is widely known and duly emphasized in
VLSI design courses. Another cost factor associated with yield, however, is often
ignored: low yield leads to much higher testing cost, if an overall part quality is to be
achieved. This is illustrated in the following example.

Example 5.3: Effects of yield on testing and part reliability Assuming a part yield of 50%,
discuss how achieving an acceptable defective part rate of 100 defects per million (DPM) affects
the part cost. Include all factors contributing to cost.

Solution: Consider manufacturing 2M parts of which 1M are expected to be defective, given the
50% yield. To achieve the goal of 100 DPM in parts shipped, we must catch 999,900 of the 1M
defective parts. Any testing process is imperfect in that the test will miss some of the defects
(imperfect test coverage) and will also generate a number of false positives. Thus, we require a test
coverage of 99.99%. Going from a coverage of 99.9% (a fraction 10 –3 or 0.1% of the defects
missed) to 99.99% (10–4 or 0.01% missed), for example, entails a significant investment in test
development and application times. False positives do not constitute a major cost in this particular
context, because discarding another 1-2% of the parts due to false positives in testing does not
change the scale of the financial loss.

To make the discussion in the solution to Example 5.3 more quantitative, we need to
model the testing cost as a function of test coverage (Fig. 5.6). This modeling cannot be
done in general, as testing cost depends on the tested circuit’s functionality and
implementation technology. There is a significant body of research, however, to assist us
with this task in specific cases of interest [Agra01].
Testing cost

–5 –4 –3 –2 –1
10 10 10 10 10
Missed defect fraction

Fig. 5.6 Testing cost rises sharply with a reduction in the desired
fraction of missed defects.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 137

5.3 Defect Modeling

Defects are of two main types. Global or gross-area defects result from scratches (e.g.,
from wafer mishandling), mask misalignment, or over/under-etching. Such defects can be
eliminated or minimized by appropriate provisions and process adjustments. Local or
spot defects result from process imperfections (e.g., extra or missing material), process
variations, or effects of airborne particles. Even though spot defects are harder to deal
with, not all such defects lead to structural or parametric damage. The actual damage
suffered depends on the location and extent of the defect relative to feature size.

Two examples of defect modeling are depicted in Fig. 5.7. Excess material deposited on
the die surface can cause physically proximate conductors to become connected. If we
model extra-material defects as circles, then the lightly shaded rectangular regions in Fig.
5.7a indicate possible locations for the center of the defect circle of a certain size that
would lead to improper connectivity. Pinhole defects result from tiny areas where
material may be missing (due to burst bubbles, for example). This may cause problems
because missing dielectric material between two vertically adjacent conductors may lead
to their becoming connected. Critical regions for pinhole defects, shown as small lightly
shaded squares in Fig. 5.7b, correspond to overlapping conductors that are separated by a
thin dielectric layer.

Under such assumptions, the modeling process consists of determining the likelihood of
having defects that fall in the corresponding critical regions, based on some knowledge
about defect kind and size distributions.

Critical
areas

(a) Excess-material critical areas (b) Pinhole critical areas

Fig. 5.7 Excess-material defects (modeled as circular areas) and


pinhole defects.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 138

0.3

Defects / cm2
0.2

0.1

0.0
0 100 200 300 400
Defect diameter (nm)

Fig. 5.8 A sample defect size distribution for an overall defect rate of
0.3/cm2.

Here is an empirical defect-size distribution model. Defects typically range from a


minimum size xmin to a maximum size xmax, with defects outside the range so rare as to
have a negligible effect. The defect density f(x) as a function of defect diameter x follows
a power law:

f(x) = kx–p for xmin < x < xmax; 0 otherwise (5.defect1)

The exponent parameter p is typically in the range [2.0, 3.5] and k is a normalizing
constant. Figure 5.8 depicts a sample defect size distribution, assuming an overall defect
density of 0.3/cm2.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 139

5.4 The Bathtub Curve

Most components or systems do not have a constant failure rate over their lifetimes. A
simplified model that accounts for variations in the failure rate over time, known as the
bathtub curve, is based on the hypothesis that three factors contribute to failure. The first
factor, infant mortality, is due to imperfections in system design and construction that
may lead to early failures. Taking the analogy of a new car, it happens that mere factory
inspections and testing are inadequate for removing all defects, leading to quite a few
defective or low-quality cars (the so-called “lemons”) to be marketed and sold. If the
particular car you buy survives this early phase, then it will likely function without much
trouble for many years.

The second factor, random failures, can arise at any time during a component’s or
system’s life due to environmental influences, normal stresses, or load conditions. The
constant failure rate  is often used to model this phase of useful life.

The third factor is the wearing out of devices or circuits, leading to higher likelihood of
failures as a component or system ages. As depicted in Fig. 5.btc, the wearout effect is
more pronounced for mechanical devices than for electronics. In fact, many computer and
communication equipment become obsolete and discarded so quickly that wearout isn’t a
significant concern. On the other hand, fatigue or wearout is a major concern for aircraft
parts, including those forming the fuselage, and is dealt with by preventive maintenance
and periodic replacement. Interestingly, aging or deterioration is not limited to hardware
but has also been observed in software, owing to the accumulation of state information
(from setting of user preferences, updates, and extensions).

Fig. 5.btc The bathtub curve, showing the three phases of a


component’s life.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 140

Using colorful expressions such as “the bathtub curve does not hold water” [Wong88],
reliability researchers have been pointing out the weaknesses of the bathtub curve model
for quite a long time. Our discussion of the bathtub curve is motivated by the fact that it
provides a useful pedagogical tool for drawing attention to infant mortality (and hence
the importance of rigorous post-manufacturing tests and burn-in tests) and wearout (often
avoided by preventive maintenance and early retirement of devices or systems that are
prone to deterioration with age). It also tells us why the constant failure rate assumption
might be appropriate during most systems’ post-burn-in, pre-wearout, useful lives.

Referring to Fig. 5.burn1, we note the effect of infant mortality on the reliability function,
driving home the point that unless we deal with the infant mortality problem, achieving
high reliability would be impossible.

100 Infant mortality

90
Reliability

80

70 No significant wear-out

60

50
0 20 40 60 80 100
Time in years

Fig. 5.burn1 Survival probability of electronic components.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 141

5.5 Burn-in and Stress Testing

In order to expose existing and latent defects that lead to infant mortality, one needs to
test a component or system extensively under various environmental and usage scenarios.
An alternative to such extended testing, which may take an unreasonable amount of time,
is to expose the product to abnormally harsh conditions in an effort to accelerate the
exposure of defects. The name “burn-in” comes from the fact that for electronic circuits,
testing them under high temperatures (literally in ovens) is commonly used, given that
intense heat can accelerate the failure processes. In the extended sense, “burn-in” refers
to any harsher-than-normal treatment, including using greater loads, higher clock
frequencies, excessive shock and vibration, and so on.

The ovens used for high-temperature burn-in testing of electronic devices and systems are
quite elaborate and expensive, as they require fine controls to avoid damaging sensitive
parts in the circuits under test.

As depicted in Fig. 5.burn2, components that survive burn-in testing, will be left with
very few residual defects that could potentially lead to early failures.

100

Components
burned-in for 3 years
99
Reliability

98

97 Normal components
with no burn-in

96

95
0 2 4 6 8 10
Time in years

Fig. 5.burn2 Survival probability of electronic components.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 142

5.6 Active Defect Prevention

Besides initial or manufacturing imperfections, wear and tear during the course of a
device’s lifetime can lead to the emergence of defects. A harsh operating environment or
excessive load may speed up the development of defects. Such conditions can sometimes
be counteracted by operational measures such as temperature control, load redistribution,
or clock scaling. Radiation-induced defects can be minimized by proper shielding or
hardening (see Chapter 7) and those resulting from mishandling, shock, or vibration can
be mitigated by encasing, padding, or mechanical insulation.

One of the most commonly used strategies for active detect prevention is periodic or
preventive maintenance. Preventive maintenance forestalls latent defects from developing
into full-blown defects that produce faults, errors, and so on. To grasp the role of
preventive maintenance for computer parts, consider that passenger aircraft parts are
routinely replaced according to a fixed maintenance schedule so as to avoid fatigue-
induced failures. So, an aircraft engine may be replaced at the end of its nominal service
period, even though it exhibits no signs of impending failure. Referring to the bathtub
curve of Fig. 5.btc, this is akin to resetting the clock and avoiding the wearout phase of
the curve for the replaced part. For this strategy to be effective, however, we must also
make sure to avoid the infant mortality phase of the new engine by subjecting it to
rigorous burn-in and stress testing.

Given that preventive maintenance has as associated cost in terms of personnel and lost
system functionality, many studies have been performed to optimize the maintenance
schedule under various cost models and system characteristics, including whether the
preventive maintenance is perfect (rendering the system “like new”) or imperfect (e.g.,
reducing the effective age of the system that dictates its hazard rate, but not fully resetting
it to zero [Bart06]). Often, the resulting models for maintenance optimization are too
complex for analytical solution, necessitating the use of numerical solutions.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 143

Problems

5.1 Defects and yield


Every three years, from 1980 to 1992, DRAM chips increased in capacity by a factor of 4 and in die area by
a factor of about 1.5 (beginning with 64 Kb and 0.15 cm2 in 1980), while yields remained virtually constant
at around 45%. Assuming that these trends have continued up to the present day, what can you say about
trends in defect density and memory cost? State all your assumptions.

5.2 Defect modeling


In Section 5.3, it was noted that the density function for the defect size (diameter) x is usually taken to be
f(x) = kx–p, for xmin < x < xmax, where k, p, xmin, and xmax are constants.
a. Derive the value of the constant k in terms of the other three constants and the overall defect
density d.
b. Estimate the numerical value of k for the sample defect distribution shown in Fig. 5.8.

5.3 Yield variation with die size


Figure 5.5 and Example 5.2 show the effect of increasing the die size from 1  1 cm2 to 2  2 cm2.
a. With the same assumptions as in Example 5.2, calculate the yield and relative die cost for 3  3
square dies.
b. Repeat part a for 2  4 rectangular dies.

5.4 Effects of yield on die cost


A wafer containing 100 copies of a complex processor die costs $900 to manufacture. The area occupied by
each processor is 2 cm2 and the defect density is 2/cm2. What is the manufacturing cost per die?

5.5 Number of dies on a wafer


Consider a circular wafer of diameter d. The number of square dies of side u on the wafer is upper-bounded
by d2/(4u2). The actual number will be smaller because there are incomplete dies at the edge.
a. Argue that d2/(4u2) – d/(1.414u) is a fairly accurate estimate for the number of dies.
b. Apply the formula of part a to the wafers shown in Figure 3.9 to obtain an estimate for the number
of dies and determine the error in each case. The dies are 1  1 and 2  2 and d = 14.
c. Suggest and justify a formula that would work for nonsquare u  v dies (e.g., 1  2 cm2).
d. Is it possible to make a general statement about whether square dies or rectangular dies of the
same area waste less space on a wafer?

5.6 Yield modeling


Parts a-c of this problem provide die sizes and associated yields for 3 different DRAM chips over time.
Estimate the corresponding defect density in each case.
a. 1 Gb DRAM chip: TBD
b. 4 Gb DRAM chip: TBD
c. 16 Gb DRAM chip: TBD
d. Derive a formula for the required defect density if the yield for an x Gb DRAM were to be 90%
and provide numerical values for the examples in parts a-c.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 144

5.7 An alternative yield model


Consider another proposed yield model, in which die area is measured in cm2 and defect density is given
per cm2: Die yield = e– Defect density  Die area
a. Provide an analysis that shows when this alternate model provides results that are nearly the same
as those from equation (5.2.yield).
b. Under what conditions would the two models provide significantly different results?
c. Which yield model do you think is more realistic?

5.8 Defect modeling


Intro
a. x
b. x
c. x

5.9 The bathtub curve


Which failure distribution can be used to model all three parts of a bathtub curve? Explain.

5.10 The bathtub curve


The bathtub curve can be viewed as the summation of three curves: a declining curve that represents
failures resulting from problems that exist in a product or system at time zero, a horizontal line
corresponding to random failures due to loading and operational stress, and a rising curve corresponding to
failures caused by wear and tear.
a. Research this idea for a particular system or class of systems.
b. Model the three parts using available or estimated parameters.
c. Verify that the sum of the three parts does yield a bathtub curve.

5.11 The roller-coaster curve


The roller-coaster curve was mentioned in Section 5.4 as a proposed substitute for the bathtub curve.
a. Present diagrams depicting the typical shapes of the roller-coaster curve.
b. Explain failure mechanisms and other reasons for advocating the roller-coaster curve over the
bathtub curve.
c. In terms of probability distributions, how can the roller-coaster curve be modeled?
d. How is the modeling of the roller-coaster curve different from that of the bathtub curve?
e. Try to discover if there are any arguments against using the roller-coaster curve.

5.12 Burn-in testing


Intro
a. x
b. x
c. x
d. x

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 145

5.13 Burn-in testing


Intro
a. x
b. x
c. x

5.14 Preventive maintenance


Intro
a. x
b. x
c. x
d. x

5.15 Preventive maintenance


Intro
a. x
b. x
c. x
d. x

5.16 Cost and benefits of yield improvement


In example 5.1, we analyzed the financial gain from yield improvement, pretending that improving the
yield costs nothing. In reality, improving the yield does have a cost that rises sharply with the target yield,
exhibiting a trend similar to Fig. 5.6 (view the horizontal axis as representing the fraction of bad parts, so
that 10–1 corresponds to a yield of 90%).
a. Discuss the reasons for the much sharper rise in the cost of yield improvement when yield is
already quite high.
b. Try to find the parameters of a cost model for yield improvement and reconsider Example 5.1 in
light of this cost.

5.17 Probabilistic design


PCMOS (probabilistically correct CMOS logic) adders have been designed that produce an incorrect value
0.25% of the time but consume 3.5 times less power. Further relaxing the probability of correctness to 0.92
leads to a factor-of-15 reduction in power [Anth13]. Additionally, noting that not all bits are created equal
allows us to focus on the more-significant bits at the expense of the less-significant bits, which then
become less reliable. See also [Pale09].
a. What is the reason for reduced energy consumption with probabilistic design?
b. Is the method applicable to both fixed-point and floating-point computations?
c. How does this method differ from lower-precision or adjustable-precision computation?
d. Name a few applications that would benefit from probabilistic design.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 146

5.18 Defect classes


A consumer organization estimates that 29% of new cars delivered to dealers have a cosmetic defect, such
as a scratch or a dent, while 7% have functional defects related to a part or subsystem that does not work
properly. That same consumer organization estimates that 2% of all cars have defects of both types.
a. What is the probability that a new car has some kind of defect?
b. What is the probability that a new car has a cosmetic defect but no functional defect?
c. If your new car has a dent, what is the probability that it also has a functional defect?
d. Do you think that cosmetic and functional defects are independent of each other? Explain.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 147

References and Further Readings


[Agra01] Agrawal, V. D., S. C. Seth, and P. Agrawal, “Fault Coverage Requirements in Production
Testing of LSI Circuits,” IEEE J. Solid-State Circuits, Vol. 17, No. 1, pp. 57-61, May 2001.
[Anth13] Anthes, G., “Inexact Design—Beyond Fault-Tolerance,” Communications of the ACM, Vol.
56, No. 4, pp. 18-20, April 2013.
[Bart06] Bartholomew-Biggs, M., B. Christianson, and M. Zuo, “Optimizing Preventive Maintenance
Models,” Computational Optimization and Applications, Vol. 35, No. 2, pp. 261-279, 2006.
[Cici95] Ciciani, B., Manufacturing Yield Evaluation of VLSI/WSI Systems, IEEE Computer Society
Press, 1995.
[Eler07] Elerath, J., “Hard Disk Drives: The Good, the Bad and the Ugly,” ACM Queue, Vol. 5, No. 6,
pp. 28-37, September-October 2007.
[Ghos10] Ghosh, S. and K. Roy, “Parameter Variation Tolerance and Error Resiliency: New Design
Paradigm for the Nanoscale Era,” Proc. IEEE, Vol. 98, No. 10, pp. 1718-1751, 2010.
[Hawk94] Hawkins, C. F., J. M. Soden, A. W. Righter, and F. J. Ferguson, “Defect Classes—An Overdue
Paradigm for CMOS IC Testing,” Proc. IEEE Int’l Test Conference, 1994, pp. 413-425.
[Kore07] Koren, I., and C. M. Krishna, Fault-Tolerant Systems, Morgan Kaufmann, 2007.
[Liew00] Liew, T., S. W. Wua, S. K. Chowa, and C. T. Lim, “Surface and Subsurface Damages and
Magnetic Recording Pattern Degradation Induced by Indentation and Scratching,” Tribology
International, Vol. 33, No. 9, pp. 611-621, September 2000.
[Sode95] Soden, J. M. and C. F. Hawkins, “IDDQ Testing and Defect Classes—A Tutorial,” Proc. Custom
Integrated Circuits Conf., May 1995, pp. 633-642.
[Khar96] Khare, J. B. and W. Maly, From Contamination to Defects, Faults and Yield Loss: Simulation
and Applications, Kluwer, 1996.
[Klut03] Klutke, G.-A., P. C. Kiessler, and M. A. Wortman, “A Critical Look at the Bathtub Curve,”
IEEE Trans. Reliability, Vol. 52, No. 1, pp. 125-129, March 2003.
[Pale09] Palerm, K., et al., “Sustaining Moore’s Law in Embedded Computing through Probabilistic and
Approximate Design,” Proc. Int'l Conf. Compilers, Architectures, and Synthesis for Embedded
Systems, October 2009.
[Sham09] Sham, K.-J., “Crosstalk Mitigation Techniques in High-Speed Serial Links,” M.S. Thesis,
Univ. of Minnesota, April 2009.
[Wang10] Wang, L.-T., C. E. Stroud, and N. A. Touba. System-on-Chip Test Architectures: Nanometer
Design for Testability, Morgan Kaufmann, 2010. Section 8.3, “Manufacturing Defects, Process
Variation, and Reliability.”
[Wong88] Wong, K. L., “The Bathtub Curve Does Not Hold Water Any More,” Quality and Reliability
Engineering Int’l, Vol. 4, No. 3, pp. 279-282, July/September 1988.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 148

6 Defect Circumvention
“We grow tired of everything but turning others into ridicule, and
congratulating ourselves on their defects.”
William Hazlitt

“We have learned to live in a world of mistakes and defective


products as if they were necessary to life. It is time to adopt a
new philosophy in America.”
Norman Cousins

“The flaw which is hidden is deemed greater than it is.”


Marcus Aurelius

Topics in This Chapter


6.1. Detection of Defects
6.2. Redundancy and Reconfiguration
6.3. Defective Memory Arrays
6.4. Defects in Logic and FPGAs
6.5. Defective 1D and 2D Arrays
6.6. Other Circumvention Methods

As the densities of integrated circuits and memory devices rose through decades
of exponential growth, defect circumvention methods took on increasingly
important roles. Today, it is nearly impossible to build a defect-free silicon wafer
or a perfectly uniform magnetic-disk platter. So methods for detecting defective
areas, and avoiding them via initial configuration or subsequent reconfiguration,
are in widespread use. Defect circumvention methods covered in this chapter have
a great deal in common with switching and reconfiguration schemes employed at
the level of modules or nodes in parallel and distributed systems. Our focus here
is on methods that are particularly suited to fine-grain circumvention.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 149

6.1 Detection of Defects

Defects are detected as a result of post-manufacturing inspections and testing, as well as


during normal system operation.

When a wafer emerges from the manufacturing process, visual inspections are performed
to identify obvious defects. During this phase, the inspector (human or machine) focuses
on the more problematic areas, such as the edges of a wafer.

Defect avoidance and circumvention methods are complementary. Avoidance schemes


include defect awareness in design, particularly in the floor planning and routing phases,
extensive quality control during the manufacturing process, and comprehensive
screening, including burn-in and stress tests. Defect circumvention methods fall under the
two strategies of defect removal and defect masking. To remove defects, we must first
identify them and then use built-in resources to bypass or disable the defective parts. This
approach is very similar to dynamic hardware redundancy at the module or system level.
Masking of defects requires static redundancy on the die or wafer. In this scheme,
defective parts continue to operate, but their effect is voided or muted by other healthy
parts that operate redundantly. Several examples of defect removal and masking
techniques will be discussed in the remaining sections of this chapter.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 150

6.2 Redundancy and Reconfiguration

Providing redundant components or cells, plus a capability to avoid or route around bad
elements is one way of avoiding defects. This approach is best-suited to systems that
have a regular or repetitive structure on the die. Examples include memories, FPGAs,
multicore chips, and chip-multiprocessors. Irregular or random logic implies greater
redundancy arising from replication, with the interconnect problem exacerbated by the
need for the replicated structures not to be too close to each other (to minimize common-
cause defects).

A good example of the redundancy and reconfiguration approach to defect circumvention


is the method used to avoid bad sectors on a disk memory. Bad sectors are identified by
error detection during read operations. Post-manufacturing tests typically detect a number
of bad sectors that are included in the so-called P-list (permanent or primary defect table).
Such initially damaged sectors do not form part of disk system’s storage capacity and
have no impact of its performance, given that performance data and guarantees already
include the effect of such sectors. As the disk is used, other defective sectors emerge,
whose addresses are included in the so-called G-list (growth or post-use defect table) by
the disk controller. Upon a disk write operation, such a bad sector is replaced with a spare
sector, with all subsequent accesses to it automatically redirected to the new location.
Because of this redirection, the presence of bad sectors in the G-list slows down access to
the data and affects the overall performance. Once the disk runs out of spare sectors, its
defect circumvention capacity has been exhausted and the entire disk must be replaced.

Example 6.1: Disk sector remapping Assume that bad disk sectors are detected with 100%
probability and that there is a hard limit on the number of remapped sectors due to performance
concerns. Suggest a reliability model for the disk.

Solution: To be provided.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 151

6.3 Defective Memory Arrays

Early semiconductor memories were less reliable than their immediate predecessors
(magnetic core memories). Thus, methods of dealing with defective bit cells in such
memories were developed early on. One class of methods involving error-
detecting/correcting codes will be discussed in Chapters 13 and 14. Here, we focus on
defect circumvention methods that allow us to bypass defective memory cells, assuming
that their presence is detected via appropriate tests or via concurrent error detection.

A commonly used scheme is to provide the memory array (as a whole or in subarrays of
smaller size) with spare rows and columns of cells. In the example of Fig. 6.masrc, the
memory array is shown to consist of two subarrays, each with its dedicated spare rows
and columns. When a bad memory cell is detected, the entire row or column containing it
is replaced with one of the spares. The choice of using a spare row or a spare column is
arbitrary when there is an isolated bad cell, whereas in the case of multiple cell defects in
the same row/column, one approach can be more efficient than the other. Switches at the
periphery of the array or subarray allow external connections to be routed to the spare
row/column in lieu of the one containing the bad memory cell(s). There are also defects
in wiring and other row/column access mechanisms that may disable an entire row or
column, in which case the choice of replacement is obvious.

Let us focus on an array or subarray with m data rows and s spare rows. Assuming perfect
switching and reconfiguration, the redundant arrangement can be modeled as an m-out-
of-(m + s) system. The modeling becomes somewhat more complex when we have both
spare rows and columns, but the relevant models are still combinational in nature.

Fig. 6.masrc Memory array with spare rows and columns.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 152

Given a particular pattern of memory cell defects, finding the optimal reconfiguration is
nontrivial. We will discuss the pertinent methods in connection with yield enhancement
for semiconductor memory chips in Chapter 8.

Example 6.2: Reliability modeling for redundant memory arrays Statement to be provided..

Solution: To be provided.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 153

6.4 Defects in Logic and FPGAs

Moore and Shannon, in their pioneering work on the reliability of relay circuits [Moor56]
showed how one can build arbitrarily reliable circuits from unreliable, or in their word,
“crummy,” relays. Consider relays that are prone to short-circuiting when they are
supposed to be open. Let the probability of such an improper short-circuiting event be p.
Then, the relay circuit of Fig. 6.M&S will experience a similarly defective behavior (i.e.,
short-circuiting) with probability

h(p) = 4p2 – 4p3 + p4 (6.4.relay)

It is readily verified that h(p) < p, provided that p < 0.382. In other words, as long as
each relay isn’t totally unreliable (a relay with p  1/3 is crummy indeed), some
improvement in behavior is achieved via the bridge circuit of Fig. 6.M&S with four-fold
redundancy. Recursive application of this scheme will lead to arbitrarily reliable relay
circuit having the reliability function h(h(h( … h(p)))).

The Moore-Shannon method just discussed is an example of defect circumvention via


masking. The defective relays remain part of the switching circuit but their effects are
counteracted by healthy relays.

(a) Redundancy scheme (b) Reliability function

Fig. 6. M&S Building reliable switching circuits from crummy relays.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 154

Fig. 6. FPGA1 Bypassing of defective elements in an FPGA can be done


using the same methods that allow us to avoid already used
or unavailable elements.

(a) Logic blocks and interconnects (b) Possible switch-box details

Fig. 6.FPGA2 Routing resources in an FPGA.

Fig. 6. CMP Defective processors or memory modules can be disabled


or bypassed in a multicore chip or chip-multiprocessor.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 155

FPGA and FPGA-like devices are particularly suitable for defect circumvention methods
via removal (bypassing). As shown in simplified form in Fig. 6.FPGA1, an FPGA
consists of an array of configurable logic blocks (CLBs) that have programmable
interconnects among themselves and with special I/O blocks at the chip boundaries. The
programmable interconnects, or routing resources, can take on different forms in FPGAs,
with an example depicted in Fig. 6.FPGA2. Defect circumvention in such devices is quite
natural because it relies on the same mechanisms that are used for layout constraints (e.g.,
use only blocks in the upper-left quadrant) or for blocks and routing resources that are no
longer available due to prior assignment.

FPGAs are examples of circuits that are composed of multiple identical parts that are
interchangeable. Similar methods are applicable to multicore chips and chip-
multiprocessors. In the latter systems, processors and memory modules may be the units
that are bypassed or replaced. However, defects may also impact the interconnection
network connecting the processors with each other, or linking processors with memory
modules. Such networks constitute the main defect circumvention challenge in this case.
We will discuss the switching and reconfiguration aspects of such systems when we get
to the malfunction level in our multilevel model.

Example 6.3: Defect modeling for FPGAs Statement to be provided..

Solution: To be provided.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 156

6.5 Defective 1D and 2D Arrays

Arrays can be built from identical nodes, of which several can be placed on a single chip.
If such nodes are independent of each other and have separate I/O connections, then it
would be an easy matter to avoid the use of an defective nodes. For example, to build a
massively parallel processor out of 64-processor chips, we might place 72 processors on
each chip to allow for up to 8 defective processors. We often prefer, however, to
interconnects the nodes on the chip for higher-bandwidth communication, both on-chip
and off-chip. As shown in Fig. 6.defarray-a, use of on-chip connections can lead to
shorter and more efficient links, while also allowing more pins for each off-chip channel.

It is also possible to embed reconfiguration switches between nodes on a chip so as to


allow dynamic bypassing of bad nodes. A simple example of such switches, a 2  2
switch having the ‘bent’ and ‘crossed’ states, is depicted in Fig. 6.defarray-b. As shown
in Fig. 6.defarray-c, defects that make one or more nodes unusable can be circumvented
by the proper setting of reconfiguration switches so as to form complete rows and
columns. This method of salvaging of a smaller working array from a larger initial array
is useful for both VLSI yield enhancement and run-time reconfiguration upon the
detection of malfunctioning nodes. The proposed methods differ in the types and
placement of switches (e.g., 4-port, single- or double-track), types and placement of spare
nodes, algorithms for deriving working configurations, ways of affecting reconfiguration,
and methods of assessing resilience.

(a) Building blocks (b) 2-state switches (c) 2D array with defects

Fig. 6.defarray Possible building blocks for arrays, 2-state reconfiguration


switches with ‘bent’ and ‘crossed’ states, and reconfigured
rows and columns in a defective 2D array.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 157

In the following, we assume 4-port, 2-state switches depicted in Fig. 6.defarray-b. For
example, a 1D array can be constructed from such switches and a set of functional and
spare nodes, as shown in Fig. 6.array1D-a. Alternatively, we can embed mux-switches in
each of the blocks so as to select one of two inputs (from the block immediately to the
left or from the block two positions to the left) and ignoring the other input, based on
diagnostic information. Such embedded switches remove single points of failures that are
associated with the nonredundant switches of Fig. 6.array1D-a and also simplify the
reliability modeling process.

The same reconfiguration schemes used for 1D arrays can be applied to 2D mesh arrays,
as depicted in Fig. 6.arr2D1, with the switches allowing a node to be avoided by moving
to a different row or column.

(a) External 2-state switches

(b) Embedded mux-switches

Fig. 6.array1D Reconfigurable arrays with a track of external 2-state


switches and with embedded switching.

(a) External 2-state switches (b) Embedded mux-switches

Fig. 6.arr2D1 Two types of reconfiguration switching for 2D arrays.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 158

Assuming that we also have the capability to bypass nodes within their own rows and
columns (e.g., via a separate switching scheme not shown in Fig. 6.arr2D1), we can
salvage a smaller working array from one with spare rows and/or columns, as depicted in
Fig. 6.arr2D2-a. The heavy arrows in Fig. 6.arr2D2-b denote how rows and columns have
shifted downward or rightward to avoid the bad nodes. We will discuss both the
reconfiguration capacity and the reliability modeling of such schemes in Section 8.5 in
connection with yield enhancement methods.

(a) Array with unusable/unused nodes (b) Compensation paths

Fig. 6.arr2D2 Salvaging a 5  5 working array from a 6  6 one with a


spare row/column, and compensation paths associated with
the 7 defective nodes shown.

Example 6.4: Reliability modeling for processor arrays To be provided based on [Parh19].

Solution: To be provided.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 159

6.6 Other Circumvention Methods

The notion of “crummy” components, that occupied Moore and Shannon because of
unreliable electromechanical relays, is once again front and center as we enter the age of
nonoelectronics. The sheer density of nanoelectronic circuits makes precise
manufacturing almost impossible and the effects of even minor process variations quite
serious. It is, therefore, necessary to incorporate defect circumvention methods into the
design process and the structure of such circuits.

For example, hybrid-technology FPGAs, with CMOS logic elements and very compact
but unreliable crossbar nanoswitches, need defect circumvention schemes [Robi07] to be
deemed practical. Such hybrid schemes, as depicted in Fig. 6.nano, are expected to
produce an increase of 8-fold or more in density, while providing reliable operation via
defect circumvention. As another example, the use of memory architectures with block-
level redundancy has been proposed for hybrid semiconductor/nanodevice
implementation [Stru05]. The scheme uses error-correcting codes for defect tolerance, as
opposed to using them to overcome damage from operational or “soft” errors. A possible
structure is depicted in Fig. 6.memory.

(a) The hybrid technology (b) (c)

Fig. 6.nano Nanoelectronics with “crummy” components, used in


conjunction with CMOS logic, to provide significant
improvement in circuit density [Robi07].

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 160

Fig. 6.memory Memory with block-level redundancy for efficient hybrid


semiconductor/nanodevice implementation [Stru05].

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 161

Problems

6.1 Circuit defect circumvention by masking


Consider that in Fig. 6.M&S, each of the relays is replaced by a resistor of resistance R. The four resistors
will then act like a single resistor of equivalent resistance R. Under what conditions would you say the
redundant resistor is tolerant of one of the four resistors developing a defect that makes it open
(disconnecting its two ends) or causes it to short-circuit?

6.2 Crummy relays


Consider the scheme for building highly reliable relay circuits, as discussed at the beginning of Section 6.4.
a. Fully justify equation (6.4.relay).
b. Determine the number of levels that the method must be applied recursively if we want to get from
a relay reliability of 1 – p = 0.8 to a switching circuit reliability of 0.9999.
c. Given relays of reliability 1 – p = 0.9, how many relays do we need in a circuit to achieve a
reliability goal of 1 – 10–9?
d. Develop an approximate formula for the number of recursive levels required if our relays have the
reliability parameter 1 – p  1 –  (where  is quite small) and we have a reliability goal r > 1 – p.
e. How good is the approximation of part d when applied to the examples of parts c and b? Discuss.

6.3 Crummy relays


a. Analyze the switching circuit of Fig. 6.M&S-a, after removing the middle vertical connection
between the upper and lower relays.
b. Is the resulting circuit better or worse than the one in Fig. 6.M&S-a? Discuss.

6.4 Title
Intro
a. x
b. x
c. x
d. x

6.5 Title
Intro
a. x
b. x
c. x
d. x

6.6 Title
Intro
a. x
b. x

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 162

6.7 Reconfigurable 2D arrays


The following diagram represents a reconfigurable 2D array with embedded three-terminal switches, so
that each switch (shown as a small box) has two connections to neighboring cells and one connection to a
row/column bus that links all horizontally or vertically aligned switches together. Propose a suitable design
for the switches so that reconfiguration around defective cells becomes possible in a manner similar to the
scheme discussed in Section 6.5. Justify your design choices and show the switch settings for an example
2D array with defects that you specify.

Row bus

Column
bus

6.8 Reconfigurable 2D arrays


We need 8  8 meshes of processors for a particular application. We want the manufactured arrays to be
able to circumvent any pattern of 5 defective processors. What would you suggest in terms of providing
redundant rows and columns for the array? Please provide complete reasoning for your proposal.

6.9 Reliability inversion


Read the paper [Parh19], where the notion of reliability inversion is defined and an example of where it
might occur is provided. Define and analyze a second example that demonstrates reliability inversion.

6.10 Reliability modeling of reconfigurable linear arrays


Using the modeling approach of [Parh19], set up appropriate reliability models for the processor arrays
depicted in Figs. 6.array1Da and 6.array1Db and compare the two schemes with reasonable assumptions
about the model parameters.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 163

References and Further Readings


[Barl65] Barlow, R. E., F. Proschan, and L. C. Hunter, Mathematical Theory of Reliability, 1965, pp.
199-204. (Republished by SIAM, 1996.)
[Breu04] Breuer, M. A., S. K. Gupta, and T. M. Mak, “Defect and Error Tolerance in the Presence of
Massive Numbers of Defects,” IEEE Design & Test of Computers, Vol. 21, No. 3, pp. 216-227,
May-June 2004.
[Durb04] Durbeck, L. J. K. and N. J. Macias, “Obtaining Quadrillion-Transistor Logic Systems Despite
Imperfect Manufacture, Hardware Failure, and Incomplete System Specification,” Chapter 4 in
Nano, Quantum and Molecular Computing: Implications to High Level Design and Validation,
S. K. Shukla and R. I. Bahar (eds.), Kluwer, 2004, pp. 109-132.
[Grah04] Graham, P. and M. Kokhale, “Nanocomputing in the Presence of Defects and Faults: A
Survey,” Chapter 2 in Nano, Quantum and Molecular Computing: Implications to High Level
Design and Validation, S. K. Shukla and R. I. Bahar (eds.), Kluwer, 2004, pp. 39-72.
[Koch59] Kochen, M., “Extension of Moore-Shannon Model for Relay Circuits,” IBM J., April 1959, pp.
169-186.
[Kore98] Koren, I. and Z. Koren, “Defect-Tolerant VLSI Circuits: Techniques and Yield Analysis,”
Proc. IEEE, Vol. 86, No. 9, pp. 1817-1836, September 1998.
[Kore07] Koren, I., and C. M. Krishna, Fault-Tolerant Systems, Morgan Kaufmann, 2007.
[Mish04] Mishra, M. and S. C. Goldstein, “Defect Tolerance at the End of the Roadmap,” Chapter 3 in
Nano, Quantum and Molecular Computing: Implications to High Level Design and Validation,
S. K. Shukla and R. I. Bahar (eds.), Kluwer, 2004, pp. 73-108.
[Moor56] Moore, E. F. and C. E. Shannon, “Reliable Circuits Using Less Reliable Relays” (Parts I & II),
J. Franklin Institute, Vol. 262, No. 3, pp. 191-208, September 1956, and No. 4, pp. 281-297,
October 1956.
[Parh19] Parhami, B., “Reliability Inversion: A Cautionary Tale,” submitted for publication.
[Robi07] Robinett, W., G. S. Snider, P. J. Kuekes, and R. S. Williams, “Computing with a Trillion
Crummy Components,” Communications of the ACM, Vo. 50, No. 9, pp. 35-39, September
2007.
[Stru05] Strukov, D. B. and K. K. Likharev, “Prospects for Terabit-Scale Nanoelectronic Memories,”
Nanotechnology, Vol. 16, pp. 137-148, January 2005.
[Tref15] Trefzer, M. A., J. A. Walker, S. J. Bale, and A. M. Tyrrell, “Fighting Stochastic Variability in
D-Type Flip-Flop with Transistor-Level Reconfiguration,” IET Computers & Digital
Techniques, Vol. 9, No. 4, pp. 190-196, July 2015.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 164

7 Shielding and Hardening


“Should you shield the canyons from the windstorms you would
never see the true beauty of their carvings.”
Elisabeth Kubler-Ross

“Go on. Nothing that you can say can distress me now. I am
hardened.”
E. M. Forester, “The Machine Stops”, in The
Eternal Moment (Collection of Short Stories),
Harcourt Brace, 1928

Topics in This Chapter


7.1. Interference and Cross-Talk
7.2. Shielding via Enclosures
7.3. The Radiation Problem
7.4. Radiation Hardening
7.5. Vibrations, Shocks, and Spills
7.6. Current Practice and Trends

Shielding is the act of isolating a part or subsystem from the external world, or
from other parts or subsystems, with the goal of preventing defects that are caused
or aggravated by external influences. This approach, which has been used for
decades to protect systems that operate in harsh environments, is now necessary
for run-of-the-mill digital systems, given the continually rising operating
frequencies and susceptibility of nanoscale components to electromagnetic
interference and particle impacts. As effective as shielding can be, it is often not
enough. Hardening is the complementary technique of increasing the resilience of
components with regard to the undesirable effects named above.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 165

7.1 Interference and Cross-Talk

Electromagnetic or radio-frequency interference (EMI, RFI) is a disturbance that affects


an electrical circuit owing to either electromagnetic conduction or electromagnetic
radiation emitted from an external source. The disturbance may interrupt, obstruct, or
otherwise degrade or limit the effective performance of the circuit. Interference can occur
through the air or via shared power supply. Crosstalk (XT) refers to any phenomenon by
which a signal transmitted on one circuit or channel of a transmission system creates an
undesired effect in another circuit or channel. Crosstalk is usually caused by undesired
capacitive, inductive, or conductive coupling from one circuit, part of a circuit, or
channel, to another.

Shrinking feature sizes have made on-chip crosstalk a major problem. Increased clock
frequency is also an important contributing factor. At very high frequencies, the small,
distributed capacitance that exists between mutually insulated circuit nodes may lead to
an effective short to the ground, weakening the signals and affecting their ability to
perform the intended functions. Referring to Fig. 7.1a, the interwire capacitance CI can
easily exceed the load plus parasitic capacitance CL for long buses, affecting power
dissipation, speed, and signal integrity.

(a) Interwire capacitance (b) On-chip twisted pair

Fig. 7.1 Source of crosstalk problems, and a mitigation method.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 166

7.2 Shielding via Enclosures

Materials and techniques exist for shielding hardware from a variety of external
influences such as static electricity, electromagnetic interference, or radiation. Many
advanced shielding methods have been developed for use with spacecraft computers that
may be subjected to extreme temperatures and other harsh environmental conditions.
Noteworthy among adverse conditions affecting electronic systems in space is
bombardment by high energy atomic particles.

As VLSI circuit features shrink, the radiation problem, formerly problematic only during
space missions, affects the proper operation of electronics even on earth. We will discuss
methods for dealing with the radiation problem in Sections 7.3 and 7.4.

Fig. 7.encl Shielding via specialized enclosures or packaging.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 167

7.3 The Radiation Problem

Computing and other electronic equipment can be affected by radiation of two kinds:
electromagnetic and particle.

Effects of electromagnetic radiation can be easily eradicated. Ultraviolet (UV) radiation


is nonpenetrating and thus easily stopped. Both X-ray and gamma radiations can be
absorbed by atoms with heavy nuclei, such as lead. Other defensive measures include the
use of thick layer of suitably reinforced concrete, as used, for example, in building
nuclear reactors.

Particle radiation comes in a variety of forms. Alpha particles (helium nuclei) are the
least penetrating so that even paper stops them. Beta particles (electrons) are somewhat
more penetrating, requiring the use of aluminium sheets. Neutron radiation is more
diffictult to deal with, requiring rather bulky shielding. Finally cosmic radiation comes
into play for space electronics. Besides primary radiation of the kinds just cited,
secondary radiation, arising from the interaction of primary radiation and material used
for shielding, is also of some interest.

As integrated circuits shrink in size, the damage done by high-energy particles, such as
protons or heavy ions, can be significant. Radiation ionizes the oxide, creating electrons
and holes; the electrons then flow out, creating a positive charge which leads to current
leak across the channel. It also decreases the threshold voltage, which affects timing and
other operational parameters. It has been estimated that a one-way mission to Mars
exposes the electronics to about 1000 kilorad of radiation in total, which is near the limit
of what is now tolerable by advanced space electronics.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 168

(a) Heavy ion radiation (b) Proton radiation

Fig. 7.rad Effects of heavy-ion and proton radiations on electronics.


[Source: http://parts.jpl.nasa.gov/docs/Radcrs_Final.pdf]

The most common negative impacts of radiation, and the associated terminology, are as
follows:

Single-event upset (SEU): A single ion changing the state of a memory or register bit;
multiple bits being affected is possible, but rare.

Single-event latchup (SEL) or snapback: A heavy ion or a high-energy particle


shorting the power source to substrate (high currents may result).

Single-event transient (SET): The discharge of collected charge from an ionization


event creating a spurious signal.

Single-event induced burnout (SEB): A drain-source voltage exceeding the breakdown


threshold of the parasitic structures.

Single-event gate rupture (SEGR): A heavy ion hitting the gate region, combined with
applied high voltage, as in EEPROMs, creates breakdown.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 169

7.4 Radiation Hardening

Radiation hardening is accomplished by a variety of methods, applied from the device


and circuit levels all the way to system level. At the device and component level, four
approaches are noteworthy. First, instead of common, and fairly inexpensive,
semiconductor substrate, an insulating or wide-band substrate may be used. Second,
sensitive parts may be replaced with more rugged, functionally equivalent components.
For example SRAMs may be used instead of DRAMs, a strategy that is quite effective
but implies a nontrivial added cost. Third, the chip or package containing the circuit may
be shielded through the use of more resilient material in the chip’s composition. Fourth,
the packaging ma be made radioactive-resistant, an approach that is not as effective
against proton radiation as other kinds of radiation, but the ability of the packaging to
slow down the particles, if not completely stop them, may be valuable when used in
conjuction with other methods.

At the fault level, circuit duplication/triplication, along with comparison/voting, can be


used to guard against radiation-caused deviations. One level further up, we can use error
codes to detect or correct any incorrect value produced. Finally, at the system and
application levels, a number of strategies, including on-line or periodic testing, liveness
checks, and frequent system resets can help guard against radiation-caused problems.

(a) Packaging example (b) Packaging ffectiveness

Fig. 7.pack Packaging and its effectiveness against radiation.


[Source: http://parts.jpl.nasa.gov/docs/Radcrs_Final.pdf]

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 170

One important point to keep in mind about enclosures used to mitigate radiation effects is
that proper care must be taken about the choice of material. Because of the possibility of
secondary radiation (radiation of a different kind produced as a result of the primary
radiation interacting with the packaging material), improper packaging may actually do
more harm than good in protecting against radiation effects.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 171

7.5 Vibrations, Shocks, and Spills

Besides radiation, a variety of other envioronmental conditions can affect the proper
functioning of computer equipment. Vibrations, shocks, and spills constitute some of the
major categories of such conditions.

Vibration can be a problem when a computer system is installed in a car, truck, train,
boat, airplane, or space capsule (basically, anything that moves or spins). Certain factory
or process-control installations are also prone to excessive vibrations that may cause
loose connections to become undone or various mechanical parts to break down from
stress. Systems can be ruggedized to tolerate vibration by initial stress testing (screening
out products that are prone to fail when exposed to vibration) and use of special casing
that absorbs or neutralizes the unwanted movements.

Shock is experienced from rough handling of devices or exposure to impact, as in an


accidental drop or car crash. Various levels of protection against shock can be provided,
ranging from special “skins” added to ordinary devices (as in phone or tablet cases/shells)
to total product redesign from specs. Many modern portable devices have built-in sensors
that detect the acceleration resulting from an impact or drop and initiate protective
actions, such as saving the state or securing the disk.

(a) Ruggedized phone (b) Ruggedized laptop (c) Ruggedized disk drive

Fig. 7.rugged Ruggedized phone (Casio G-Shock), laptop (Panasonic


Toughbook), and external disk drive (LaCie/Hitachi).

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 172

Protection against spills, and waterproofing in general, is technically quite simple, but of
course adds to the product cost. Watches and cameras have been marketed in waterproof
versions for many decades. The same methods can be applied to smartphones, laptops,
and other electronic devices. As mechanical moving parts, bottons, and levers disappear
from our devices, the task of waterproofing becomes simpler.

Laptop computers that have been partially ruggedized against shock and spills and are
aimed for use by children have been in existence for several years now. Other
environmental conditions against which protection may be sought include electrical noise
(needed for use in some industrial environments; see Section 7.1), radiation (see Section
7.3), and heat.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 173

7.6 Current Practice and Trends

This section has not yet been written. The following paragraphs contain some of the
points to be made.

Abstract of [Nemo97]: Single-event upset (SEU) tolerance for commercial 1Mbit


SRAMs, 4Mbit SRAMs, 16Mbit DRAMs and 64Mbit DRAMs was evaluated by
irradiation tests using high-energy heavy ions with an LET range between 4.0 and 60.6
MeV/(mg/cm2). The threshold LET and the saturated cross-section were determined for
each device from the LET dependence of the SEU cross-section. We show these test
results and describe the SEU tolerance of highly integrated memory devices in
connection with their structures and fabrication processes. The SEU rates in actual space
were also calculated for these devices.

Abstract of [Karn04]: Radiation-induced single event upsets (SEUs) pose a major


challenge for the design of memories and logic circuits in high-performance
microprocessors in technologies beyond 90nm. Historically, we have considered power-
performance-area trade offs. There is a need to include the soft error rate (SER) as
another design parameter. In this paper, we present radiation particle interactions with
silicon, charge collection effects, soft errors, and their effect on VLSI circuits. We also
discuss the impact of SEUs on system reliability. We describe an accelerated
measurement of SERs using a high-intensity neutron beam, the characterization of SERs
in sequential logic cells, and technology scaling trends. Finally, some directions for
future research are given.

Abstract of [Worm05]: Systems-on-Chip (SoC) design involves several challenges,


stemming from the extreme miniaturization of the physical features and from the large
number of devices and wires on a chip. Since most SoCs are used within embedded
systems, specific concerns are increasingly related to correct, reliable, and robust
operation. We believe that in the future most SoCs will be assembled by using large-scale
macro-cells and interconnected by means of on-chip networks. We examine some
physical properties of on-chip interconnect busses, with the goal of achieving fast,
reliable, and low-energy communication. These objectives are reached by dynamically
scaling down the voltage swing, while ensuring data integrity-in spite of the decreased
signal to noise ratio-by means of encoding and retransmission schemes. In particular, we
describe a closed-loop voltage swing controller that samples the error retransmission rate

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 174

to determine the operational voltage swing. We present a control policy which achieves
our goals with minimal complexity; such simplicity is demonstrated by implementing the
policy in a synthesizable controller. Such a controller is an embodiment of a self-
calibrating circuit that compensates for significant manufacturing parameter deviations
and environmental variations. Experimental results show that energy savings amount up
to 42%, while at the same time meeting performance requirements.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 175

Problems

7.1 Designs with improved noise immunity


As devices and interconnects are scaled down, integrated-circuits become more vulnerable to noise. Many
techniques have been proposed for reducing the vulnerability (increasing the immunity) to noise in such
circuits. Study the twin-transistor method to improve noise tolerance [Bala01] and write a 2-page report
about the essence of the method and the domain of its applicability.

7.2 Effects of radiation on logic circuits


Read the paper [Poli11] and answer the following qustions.
a. How does radiation strike affect the output of a NAND gate with both inputs being 1?
b. Discuss the effect of radiation strike on the operation of a bit-serial adder.
c. Repeat part b for a ripple-carry adder.
d. Which of the adders of parts b and c is more likely to produce an erroneous sum due to radiation?

7.3 Selective hardening


Read the paper [Poli11a] and summarize its key ideas in one typset page. Single-space the text and include
only one figure from the paper that you deem most important to conveying its main message.

7.4 Wave attacks


In addition to protecting computer systems against radiation and other electromagnetic waves caused by
specific environments and random phenomena in them, we should also be concerned with malicious attacks
that take advantage of system sensitivities to such external interference to force crashes or to compromise
security and data privacy. Study the latter problem and prepare a single-spaced 2-page report on the range
of threats and possible remedies.

7.4 Rugged laptops for space applications


Many technological advances have their roots in the safety and ruggedness requirements of space flight.
The GRiD (Graphical Retrieval Information Display) Compass, the first laptop in orbit, had a 21.6-cm
bright plasma display and was used by NASA on Space Shuttle missions through the early 1990s. The
rugged 4.5-kg laptop, costing $8150 at the time, reportedly survived the 1986 Space Shuttle Challenger
crash. Use Internet sources to discover the various design methods used to build the GRiD Compass and
present your findings in a 2-page report.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 176

References and Further Readings


[Bala01] Balamurugan, G. and N. R. Shanbhag, “The Twin-Transistor Noise-Tolerant Dynamic Circuit
Technique,” IEEE J. Solid-State Circuits, Vol. 36, No. 2, pp. 273-280, February 2001.
[Carm99] Carmichael, C., E. Fuller, P. Blain, and M. Caffrey, “SEU Mitigation Techniques for Virtex
FPGAs in Space Applications,” Proc. MAPLD Int’l Conf., 1999.
[Clar16] Clark, L. T., D. W. Patterson, C. Ramamurthy, and K. E. Holbert, “An Embedded
Microprocessor Radiation Hardened by Microarchitecture and Circuits,” IEEE Trans.
Computers, Vol. 65, No. 2, pp. 382-395, February 2016
[Duan09] Duan, C., V. Cordero, and S. P. Khatri, “Efficient On-Chip Crosstalk Avoidance CODEC
Design,” IEEE Trans. VLSI Systems, Vol. 17, No. 4, pp. 551-560, April 2009.
[Edmo00] Edmonds, L. D., C. E. Barnes, and L. Z. Scheick, “An Introduction to Space Radiation Effects
on Microelectronics,” JPL Publication 00-06, 83 pp., May 2000. [Available on-line at:
https://snebulos.mit.edu/projects/reference/NASA-Generic/JPL-00-06.pdf]
[Gatt16] Gatti, U., C. Calligaro, E. Pikhay, and Y. Roizin, “Radiation-Hardening Methodologies for
Flash ADC,” Analog Integrated Circuits and Signal Processing, Vol. 87, No. 2, pp. 141-154,
May 2016.
[Geet09] Geetha, S., K. K. Satheesh Kumar, C. R. K. Rao, M. Vijayan, and D. C. Trivedi, “EMI
Shielding: Methods and Materials—A Review,” J. Applied Polymer Science, Vol. 112, No. 4,
pp. 2073-2086, 2009.
[John98] Johnston, A. H., “Radiation Effects in Advanced Microelectronics Technologies,” IEEE Trans.
Nuclear Science, Vol. 45, No. 3, pp. 1339-1354, June 1998.
[JPL] Jet Propulsion Lab., NASA, “Space Radiation Effects on Microelectronics,” Short course
slides, undated, available online at: http://parts.jpl.nasa.gov/docs/Radcrs_Final.pdf
[Karn04] Karnik, T., “Characterization of Soft Errors Caused by Single Event Upsets in CMOS
Processes,” IEEE Trans. Dependable and Secure Computing, Vol. 1, No. 2, pp. 128-143,
April-June 2004.
[Kern88] Kerns, S. E., et al., “The Design of Radiation-Hardened ICs for Space: A Compendium of
Approaches,” Proc. IEEE, Vol. 76, No. 11, pp. 1470-1509, November 1988.
[Khat01] Khatri, S. P., R. K. Brayton, and A. L. Sangiovanni-Vincentelli, Cross-Talk Noise Immune
VLSI Design Using Regular Layout Fabrics, Kluwer, 2001.
[Ma89] Ma, T. P. and P. V. Dressendorfer, Ionizing Radiation Effects in MOS Devices and Circuits,
Wiley, 1989.
[Nemo97] Nemoto, N., et al., “Evaluation of Single-Event Upset Tolerance on Recent Commercial
Memory ICs” Proc. 3rd ESA Electronic Components Conf., April 1997.
[Poli11] Polian, I., J. P. Hayes, S. M. Reddy, and B. Becker, “Modeling and Mitigating Transient Errors
in Logic Circuits,” IEEE Trans. Dependable and Secure Computing, Vol. 8, No. 4, pp. 537-
547, July/August 2011.
[Poli11a] Polian, I. and J. P. Hayes, “Selective Hardening: Toward Cost-Effective Error Tolerance,”
IEEE Design & Test of Computers, Vol. 28, No. 3, pp. 54-63, May-June 2011.
[Worm05] Worm, F., P. Ienne, P. Thiran, and G. De Micheli, “A Robust Self-Calibrating Transmission
Scheme for on-Chip Networks,” IEEE Trans. VLSI Systems, Vol. 13, No. 1, pp. 126-139,
January 2005.
[Yu09] Yu, H., L. He, and M.-C. F. Chang, “Robust On-Chip Signaling by Staggered and Twisted
Bundle,” IEEE Design & Test of Computers, Vol. 26, No. 5, pp. 92-104, September/October
2009.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 177

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 178

8 Yield Enhancement
“As the soft yield of water cleaves obstinate stone, so to yield
with life solves the insolvable: To yield, I have learned, is to
come back again.”
Lao-Tzu

“. . . never give in except to convictions of honour and good


sense. Never yield to force; never yield to the apparently
overwhelming might of the enemy.”
Winston Churchill

Topics in This Chapter


8.1. Yield Models
8.2. Redundancy for Yield Enhancement
8.3. Floor-Planning and Routing
8.4. Improving Memory Yield
8.5. Regular Processor Arrays
8.6. Impact of Process Variations

In Section 5.2, we introduced the notion of yield and explained why a small
deterioration in defect density is amplified in the way it affects the final product
cost. Despite significant strides in improving the design and manufacturing
processes for integrated circuits, yield has presented a greater challenge with each
generation of denser and more complex devices. Due to the significant impact of
yield on cost, modern production technologies for electronic devices incorporate
provisions for detecting and circumventing defects of various kinds to reduce the
need for discarding slightly defective parts. In this chapter we review defect
detection and circumvention methods that are particularly suitable for the goal of
yield enhancement in digital circuits and storage products.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 179

8.1 Yield Models

Yield models are combinatorial in nature and range from primitive to highly
sophisticated. Let us begin with a very simple example.

Example 8.1: Modeling of yield Consider a square chip area of side 1 cm filled with parallel,
equally spaced nodes with width and separation of 1 m. Assume there are an average of 10
random defects per cm2. Defects are of the extra-material kind, with 80% being small defects of
diameter 0.5 m and 20% being larger defects of diameter 1.5 m. What is the expected yield of
this simple chip?

Solution: The expected number of defects on the chip is 10 (8 small, 2 large). Small defects
cannot lead to shorts, so we can ignore them. A large defect leads to a short if its center is within a
0.5-m band halfway between two neighboring nodes. So we need to compute the probability of at
least 1 large defect appearing within a critical area of 0.25 cm2, given an average of 2 such defects
in 1 cm2. Because each of the 2 defects falls in the critical area with probability 1/4, the probability
of having at least 1 large defect in that area is 1 – (3/4)2 = 7/16, giving a yield of 9/16  56%.

Most yield models in practical use are based on defect distributions that provide
information about the frequencies and sizes of defects of various kinds. They then take
the exact circuit layout or some rough model of it into account in deriving critical areas
for each defect type/size. The ratio of the critical area to the total area is a measure of the
sensitivity of the particular layout to the corresponding defect type/size.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 180

8.2 Redundancy for Yield Enhancement

Consider a device consisting of n identical cells. A simple-minded strategy for yield


improvement is to provide s spare cells so that we still have at least n good cells in the
presence of up to s defective ones. Such an approach appears to lend itself to modeling as
an n-out-of-(n + s) system. However, such a model would be appropriate only if any
defective cell is replaceable with any one of the spares. Placement of the cells on the chip
and connectivity among them may make such an arbitrary replacement impossible. For
example, replacement may have to be done in blocks (such as rows or columns), instead
of single cells. Such restrictions would lead to a significant reduction in the resulting
yield compared with what the n-out-of-(n + s) model predicts.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 181

8.3 Floor-Planning and Routing

In automated tools for electronic circuit design, floorplanning and routing stages affect
the resulting yield. Thus, VLSI layout must be done with defect patterns and their
impacts in mind. Designers can mitigate the effect of extra- and missing-material defects
by adjusting the rules for floorplanning and routing. For example, wider wires are less
sensitive to missing-material defects and narrower wires are less likely to be shorted to
others by extra material, given the same center-to-center separation (Fig. 7.matdef). The
examples above indicate that designers face a complex array of optimizations and trade-
offs, as they must strike a balance with regard to sensitivity to different defect types.

Different chip layout/routing designs differ in their sensitivity to various defect classes.
Because of defect clustering, one good idea is to place blocks with similar sensitivities to
defects apart from each other.

Killer Latent
Killer
defect Extra material defect

Latent
defect Missing material Killer
defect

(a) Missing material (b) Defects with wider wires (c) Defects with narrower wires

Fig. 7.matdef A real missing-material defect and simplified modeling of


both extra- and missing-material defects as circles.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 182

One approach to modeling the impact of defects on yield is to derive critical areas in the
layout where the presence of a defect of a given size would disable the circuit. The gray
regions in Fig. 7.critarea represents the results of such an effort for small extra-material
defects of a specific diameter and larger defect of the size shown. The small defect is
seen to be noncritical in most areas, causing shorts between wires/vias only in the small,
fairly narrow gray regions shown. So, there is a relatively low probability that the small
defects would lead to an unusable chip. The larger defect, on the other hand, can lead to
shorts when centered in a significant portion of the circuit segment shown, making it a
killer defect with high probability.

The fraction of the chip area that is critical with respect to various defect sizes, combined
with information on the distribution of such defects (Fig. 5.8), allows us to compute the
overall probability that the chip will be rendered unusable by extra-material or missing-
material defects. If changes in the layout cannot improve an unreasonably low yield, then
redundancy techniques, discussed elsewhere in this chapter, might be called for.

Small defect Large defect

Fig. 7.critarea Critical areas for different defect sizes.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 183

8.4 Improving Memory Yield

Systems-on-chips (SOCs) found in many modern electronic products consist mostly of


memory, so improving memory yield is important to the overall yield.

The most common redundancy scheme applied to semiconductor memories is the


provision of spare rows and/or columns. The memory array in Fig. 8.memsrc-a has been
provided with 2 spare rows and 2 spare columns. In general, the number of spares need
not be the same in the two dimensions and the spares need not be global in the sense of
spanning all rows/columns. For example, the 6-row memory array of Fig. 8.memsrc-a
may be divided into two 3-row banks, with each bank having its own pair of spare
columns of length 3. Similarly, the spare rows may be segmented when the banks have
fewer columns than the entire memory array.

Replacement of defective rows/columns with spares reqires the incorporation of


switching mechanisms around the memory array so as to route accesses to the defective
entities to an assigned spare. [Elaborate on switching schemes.]

Given a particular pattern of defective memory cells (bits), such as the dark cells in Fig.
memsrc-a, we would like to know if the available spare resources are adequate for
circumventing the defects. In other words, we want to find out an assignment of spare
rows and columns to the defective cells so that all defects are covered, if it exists, or to
conclude that the circumvention capacity of the system has been exceeded. For the
example set of 7 defects in Fig. memsrc-a, the assignment can be readily found by
inspection: use the spare columns to cover the defects in columns 2 and 4 (numbering
from 0) and the spare rows to circumvent the defects in rows 3 and 5. We may also be
interested to find the optimal assignment, that is, one that used a minimal number of
spare resources, in case more defects must be circumvented in future.

The assignment problem discussed in the preceding paragraph is NP-complete in general.


We can conclude quite easily that any pattern of n defects can be circumvented by using
n spare rows and columns in any combination (r spare rows and n – r spare columns, for
any r). This lower bound is also an upper bound in the worst case when no two defects
are in the same row or column. In most cases, however, we can do better, circumventing
significantly more than r + c defects, given r spare rows and c spare columns (7 defects,
with 2 spare rows and 2 spare columns in the example above).

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 184

(a) Memory array, with spare rows/columns (b) A representation of the defect pattern

Fig. 8.memsrc Memory array with spare rows and column, and the bipartite
graph representing the defect pattern shown.

The spare row/column assignment problem can be converted to a graph problem as


follows. Construct a bipartite graph, with vertices on the two sides corresponding to rows
(R0-Rr–1) and columns (C0-Cc–1) in the memory array and edges connecting pairs of
vertices Ri and Cj if there is a defect in row i, column j. The assignment problem is then
selecting r vertices among the Ri and c vertices among the Cj such that every edge is
incident to a selected node.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 185

8.5 Regular Processor Arrays

Processor arrays of 1 or 2 dimensions can be embedded with switches in order to


circumvent a defective processor. In a 1D scheme, exemplified by Fig. 6.array1D, any
number of defective processors can be circumvented by appropriate setting of the switch
states. Assuming the switches and defect detection to be perfect, the resulting system can
be modeled as an n-out-of-(n + s) system. Defective switches can be circumvented by
either embedding them in processor cells (Fig. 6.array1D-b) or by using a redundant
switching network in which defective switches can be bypassed.

The latter approach for bypassing a single defective switch is depicted in Fig. 8.red1Dsw,
where we see that even though the switching cell 3 is inoperative, communication
between neighbors among the remaining processors via the switching network is not
interrupted. However, the unusability of the switching cell 3 also makes processor 2
inaccessible. The use of distributed switching, as shown in Fig. 6.array1D-b, obviates the
need for considering redundancy schemes for the switches and for more complicated
models with separate provisions for switching reliability.

Whereas the 1D arrays discussed above have no limit on the array size and the number of
spares, the 2D array reconfiguration schemes of Fig. 6.arr2D1 are more constrained,
owing to the more rigid connectivity requirements between processor nodes. Referring to
Fig. 6.arr2D2-b, we note that a particular pattern of defects can be circumvented if
straight, nonintersecting, nonoverlapping paths (the solid arrows) can be drawn from the
spare row or column elements to each defective element. The 7 “compensation paths,”
shown as heavy arrows in Fig. 6.arr2D2-b, do not intersect or overlap, indicating that the
7 defective processors can be circumvented, as demonstrated in the same figure.

Fig. 8.red1Dsw Reconfigurable 1D array with redundant switching.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 186

A natural question at this point is the maximum circumvention capacity of the


reconfigurable 2D array in the worst case. It is easy to show that with 1 spare row and 1
spare column, no more than 3 defective processors can be circumvented in the worst case.
Four defective processors, if they appear in a 2  2 block, will defeat the scheme, because
one of the processors will be “behind” others as far as the spare nodes are concerned. In
fact, you should be able to come up with a 3-processor defect pattern that is also
noncircumventable.

The discussion above is based on the assumption of 2-way shift-switching at the edges of
the array, so that a row/column is either connected straight through or with a one-position
shift downward/rightward (Fig. 8.shift-a). It is also possible to use 3-way shift-switching
at the edges (Fig. 8.shift-b), which would allow the spare row/column to be viewed as
being on either side of the array, depending on the defect pattern. This added flexibility
would improve the worst-case noncircumventable defect pattern to 4 processors, thus
improving yield.

An m  m array must then be modeled as an (m2 – 2)-out-of-m system with regard to


reliability or yield, assuming 2-way shift-switching at the boundaries and as an (m2 – 3)-
out-of-m system with 3-way shift-switching. These reliability models are simple but
highly pessimistic. Precise modeling is more complicated, requiring that we enumerate
all the possible defect patterns that can/cannot be circumvented and compute a
propbability for each case.

Example 6.x: Reliability modeling for procressor arrays To be provided based on [Parh19].

Solution: To be provided.

We can go beyond the 3-defect limit for reconfigurable 2D arrays by providing spare
rows and columns on all array boundaries, that is, spare rows at the top and bottom and
spare columns on either side. Figuring out the worst-case defect pattern in this case is left
as an exercise.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 187

(a) 2-way shift-switching (b) 3-way shift switching

Fig. 6.shift Shift-switching at the edges of a reconfigurable 2D array.

As in most engineering problems, the optimal solution method for a particular application
and defect model may be a mix of the various method available to us. Intutively, we can
think of the effectiveness as the hybrid approach as being due to each method having
some weaknesses that are covered by the other(s).

A good example of a hybrid approach is provided by the problem of memory yield


enhancement. IBM’s 16 Mb memory chip [ref cite] used 16 spare rows and 24 spare
columns in each of its 4 quadrants, along with a single-error-correcting code that attached
9 check bits to each 128-bit data word (or 137 data word? Check the source).
Furthermore, bits assigned to the same word were separated by 8 bit positions, making it
less likely for a single defect to affect more than 1 bit in the same word. As shown in Fig.
6.rcs-ecc, the combination of row/column sparing and single-error-correcting code leads
to a significant improvement in yield for memory devices.

The effectiveness of the hybrid approach just discussed can be explained thus. Row and
column spares are very effective for large numbers of defects when they cluster along
rows and columns. Use of error-correcting codes is capable of overcoming widespread
random defects when each word does not have too many defective cells. The latter
weakness is covered by row spares that allow us to circumvent the entire word (and
several other words in the same row).

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 188

Fig. 6.rcs-ecc Memory yield as a function of the average number of


defective cells with sparing and error-correcting code.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 189

8.6 Impact of Process Variations

As devices and interconnects are scaled down, integrated-circuits become more error-
prone and vulnerable to both external influences and internal interference. One important
reason for such errors and vulnerabilities is manufacturing process variations [Ghos10].
Process imperfections lead to transistors, wires, and other circuit elements to have
imperfect shapes, something that can be considered mild defects. When circuit elements
are relatively large, small imperfections do not cause serious variations in electrical
properties, such as resistance or capacitance. However, for a tiny element, a small
irregularity in shape can translate to relatively large variations in electrical parameters, as
well as large variations between supposedly-identical elements.

The same mechanisms that make process variations more serious in modern VLSI than in
previous generations of circtuis also may lead to massive numbers of defects or new
kinds of defects that have not been observed before [Breu04], [Siri04].

Example 8.y: Effect of process variations on wire resistance Consider a wire of width 100 nm
on an IC chip. Process variations may cause the width of the wire along up to half of its length to
become either as small as 50 nm or as large as 150 nm. Quantify the change in the wire’s
resistance in the worst case.

Solution: Assuming no variation in the thickness (depth) of the wire, the resistance is inversely
proportional to wire width, doubling when the width is halved. In the worst case, half of the wire
will have its original resistance R/2 and the other half will have resistance ranging from R to R/3.
Thus, the total resistance of the wire will range from R/2 + R = 3R/2 to R/2 + R/3 = 5R/6, placing
the variations relative to the original resistance R in the interval [–17%, +50%], or a factor of 1.8
separating the maximum and minimum resistance values.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 190

Problems

8.1 Memory array with spare rows and columns


Consider an m  m memory array with r spare rows and c spare columns, where m >> max(r, c).
a. What is the smallest number of memory-cell defects that would need all of the redundant
resources to overcome, that is, it cannot be circumvented if we reduce r or c by 1?
b. What is the largest number of memory cell defects that can be circumvented with the given spare
rows and columns?

8.2 Memory array with spare rows and columns


We learned in Section 8.4 that the spare row/column assignment problem for memory arrays with arbitrary
numbers of spare rows and columns is NP-complete.
a. The NP-completeness result applies to the general problem, not to any specific instance. Show that
when there is only one spare row and one spare column, the problem can be solved efficiently.
b. Present the result of part a in the form of a spare row/column assignment algorithm and supply an
argument for its correctness.
c. Derive the running time of the algorithm of part b as a function of the side length n of a square
memory array.

8.3 Memory array with spare rows and columns


Consider a memory array with spare rows and/or columns for yield enhancement. Which of the following
statements, if any, is correct?
a. Using a spare rows and b spare columns is preferable to using a + b spare rows or columns.
b. Using 2c spare rows and 2c spare columns for a 2n  2n memory array is preferable to dividing
the memory array into four quadrants and providing each of the quadrants with c spare rows and c
spare columns.

8.4 Yield with defect circumvention


The manufacturing process for an integrated circuit die produces an average of 2.5 defects, with defect
distribution p(k), probability of having k defects, given in the following table. Approximately 40% of the
defects are killer defects.
k 0 1 2 3 4 5 6
p(k) 0.11 0.18 0.25 0.20 0.13 0.08 0.05
a. What is the expected yield of this process?
b. We provide reconfiguration mechanisms on the die to allow the circumvention of up to two
defects. Assuming that the reconfiguration logic is always free from defects and that it requires
negligible area, what is the new expected yield?

8.5 Reconfigurable 2D processor arrays


We concluded that a 2D processor array with 1 spare row and 1 spare column can circumvent up to 2
defective processors in the worst case.
a. What is the guaranteed number of circumventable defects in a 2D processor array with spare
rows/columns on all four sides?

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 191

b. Is it advantageous to provide more than one spare row or column on each side of the array?
c. Would increasing the defect tolerance capability change if both spare rows and spare columns are
on one side of the array?

8.6 Reconfigurable 2D processor arrays


Intro

a.
b.

8.7 Critical areas for small and large defects


Consider Fig. 7.critarea, in which examples of small and large defects and their corresponding critical areas
in connection with a specific circuit layout are shown. How would the criticl areas change under the
following defect sizes? Only approximate answers are needed.
a. Small defects halve in diameter.
b. Small defects double in diameter.
c. Large defects halve in diameter.
d. Large defects double in diameter.

8.8 Reliability modeling of reconfigurable linear arrays


Using the modeling approach of [Parh19], set up appropriate reliability models for the processor arrays
depicted in Figs. 6.array1Da and 8.red1Dsw and compare the two schemes with reasonable assumptions
about the model parameters.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-11 192

References and Further Readings


[Breu04] Breuer, M. A., S. K. Gupta, and T. M. Mak, “Defect and Error Tolerance in the Presence of
Massive Numbers of Defects,” IEEE Design & Test of Computers, Vol. 21, No. 3, pp. 216-227,
May-June 2004. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1302088&isnumber=28929
[Chia07] Chiang, C. C. and J. Kawa, Design for Manufacturability and Yield for Nano-Scale CMOS,
Springer, 2007.
[Cici95] Ciciani, B., Manufacturing Yield Evaluation of VLSI/WSI Systems, IEEE Computer Society
Press, 1995.
[Ghai09] Ghaida, R. S., K. Doniger, and P. Zarkesh-Ha, “Random Yield Prediction Based on a
Stochastic Layout Sensitivity Model,” IEEE Trans. Semiconductor Manufacturing, Vol. 22,
No. 3, pp. 329-337, August 2009.
[Ghos10] Ghosh, S. and K. Roy, “Parameter Variation Tolerance and Error Resiliency: New Design
Paradigm for the Nanoscale Era,” Proc. IEEE, Vol. 98, No. 10, pp. 1718-1751, 2010.
[Huan03] Huang, C.-T., C.-F. Wu, J.-F. Li, and C.-W. Wu, “Built-in Redundancy Analysis for Memory
Yield Improvement,” IEEE Trans. Reliability, Vol. 52, No. 4, pp. 386-399, December 2003.
[Kore98] Koren, I. and Z. Koren, “Defect-Tolerant VLSI Circuits: Techniques and Yield Analysis,”
Proc. IEEE, Vol. 86, No. 9, pp. 1817-1836, September 1998.
[Kore07] Koren, I., and C. M. Krishna, Fault-Tolerant Systems, Morgan Kaufmann, 2007 (Chapter 8, pp.
249-283).
[Lu06] Lu, S. K., Y.-C. Tsai, C.-H. Hsu, A. Pao, K. Chiu, and E. Chen, “Efficient Built-in Redundancy
Analysis for Embedded Memories with 2-D Redundancy,” IEEE Trans. VLSI Systems, Vol. 14,
No. 1, pp. 34-42, January 2006.
[Parh19] Parhami, B., “Reliability Inversion: A Cautionary Tale,” IEEE Computer, Vol. 53, No. 6, pp.
28-33, June 2020.
[Parh20] Parhami, B., “Reliability and Modelability Advantages of Distributed Switching for
Reconfigurable 2D Processor Arrays,” Proc. 11th Annual IEEE Information Technology,
Electronics and Mobile Communication Conf., November 2020, to appear.
[Royc90] Roychowdhury, V. P., J. Bruck, and T. Kailath, “Efficient Algorithms for Reconfiguration in
VLSI/WSI Arrays,” IEEE Trans. Computers, Vol. 39, No. 4, pp. 480-489, April 1990.
[Sega00] Segal, J., L. Milor, and Y.-K. Peng, “Reducing Baseline Defect Density Through Modeling
Random Defect-Limited Yield,” IEEE Micro, January 2000. [Available on-line at:
http://micromagazine.fabtech.org/archive/00/01/segal.html]
[Siri04] Sirisantana, N., B. C. Paul, and K. Roy, “Enhancing Yield at the End of the Technology
Roadmap,” IEEE Design & Test of Computers, Vol. 21, No. 6, pp. 563-571, November-
December 2004.
[Stru05] Strukov, D. B. and K. K. Likharev, “Prospects for Terabit-Scale Nanoelectronic Memories,”
Nanotechnology, Vol. 16, No. 1, pp. 137-148, January 2005.
[Zhan95] Zhang, J. C. and M. A. Styblinski, Yield and Variability Optimization of Integrated Circuits,
Kluwer, 1995.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


1
2
[email protected]
http://www.ece.ucsb.edu/~parhami

This is a draft of the forthcoming book


Dependable Computing: A Multilevel Approach,
by Behrooz Parhami, publisher TBD
ISBN TBD; Call number TBD

All rights reserved for the author. No part of this book may be reproduced,
stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical,
photocopying, microfilming, recording, or otherwise, without written permission. Contact the author at:
ECE Dept., Univ. of California, Santa Barbara, CA 93106-9560, USA ([email protected])

Dedication
To my academic mentors of long ago:

Professor Robert Allen Short (1927-2003),


who taught me digital systems theory
and encouraged me to publish my first research paper on
stochastic automata and reliable sequential machines,

and

Professor Algirdas Antanas Avižienis (1932- )


who provided me with a comprehensive overview
of the dependable computing discipline
and oversaw my maturation as a researcher.

About the Cover


The cover design shown is a placeholder. It will be replaced by the actual cover image
once the design becomes available. The two elements in this image convey the ideas that
computer system dependability is a weakest-link phenomenon and that modularization &
redundancy can be employed, in a manner not unlike the practice in structural
engineering, to prevent failures or to limit their impact.
Last modified: 2020-10-23 12

Structure at a Glance
The multilevel model on the right of the following table is shown to emphasize its
influence on the structure of this book; the model is explained in Chapter 1 (Section 1.4).

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 193

III Faults: Logical Deviations


Ideal

“To find fault is easy; to do better may be difficult.”


Defective

Plutarch
Faulty

Erroneous “A fault that humbles a man is of greater value than a virtue that
puffs him up.”
Malfunctioning

Anonymous
Degraded

Chapters in This Part


Failed

9. Fault Testing
10. Fault Masking
11. Design for Testability
12. Replication and Voting

Faults, defined as circuit-level deviations from a system’s specified behavior, can


arise in two ways: from defective devices, when a segment of the circuit that
contains them is exercised, or directly from incorrect logic-signal values due to
external influences, intracircuit interactions, or design/implementation flaws. In
this part, we present key concepts pertaining to the complementary methods of
fault testing (detecting faults by forcing them to produce observable output errors)
and fault masking (using redundancy to ensure that faults do not produce errors).
Both fault testing and fault masking are facilitated by an abstract formulation of
the causes and manifestations of faults, known as a fault model. Prompted by the
complexities encountered in designing, validating, and applying fault detection
tests, we examine a number of strategies to improve the testability of such
circuits. We conclude this part by a study of a class of logic-level redundancy
schemes based on circuit replication and voting, as a simple and widely used
example of fault masking techniques.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 194

9 Fault Testing
“As long as there are tests, there will be prayer in schools.”
Anonymous

“To test, or not to test; that is the question: Whether ‘tis nobler for
the tester’s soul to suffer the barbs and snickers of outraged
designers, or to take arms against a sea of failures, and by
testing, end them? To try: to test; to test . . .”
B. Beizer, Software Testing Techniques

Topics in This Chapter


9.1. Overview and Fault Models
9.2. Path Sensitization and D-Algorithm
9.3. Boolean Difference Methods
9.4. The Complexity of Fault Testing
9.5. Testing of Units with Memory
9.6. Off-Line vs. Concurrent Testing

Fault detection by means of testing is used for the validation of engineering


prototypes, screening of manufactured devices, and corrective or preventive
maintenance in operational systems. The fault testing effort is a combination of
test generation, test validation, and test application. In this chapter, after
explaining the fundamental notions of test generation for combinational digital
circuits, we show that test generation is inherently difficult from a computational
standpoint. We then demonstrate how testing becomes even more complicated
when the circuit-under-test contains memory. Finally, we discuss how built-in
testing and self-test methods partially alleviate the aforementioned difficulties.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 195

9.1 Overview and Fault Models

Fault testing is performed in three contexts. Engineering tests aim to ascertain whether a
new system (a prototype, say) is correctly designed. Manufacturing tests are performed to
establish the correctness of an implementation. Maintenance tests, performed in the field,
check for correct operation, either because some problem was encountered (corrective
maintenance) or else in anticipation of potential problems (preventive maintenance). As
shown in Fig. 9.1, fault testing entails the three steps of test generation, test validation,
and test application. We present a brief overview of these three steps in the rest of this
section, later on focusing exclusively on test generation.

Test generation entails the selection of input test patterns whose application to the circuit
would expose the fault set of interest via observing the circuit’s outputs. The set of test
patterns may be preset, meaning that the entire set is applied before reaching a decision,
or adaptive, where the selection of a test to apply depends on previous test outcomes,
with test application stopping as soon as enough information is available.

FAULT TESTING
FAULT TESTING
(Engineering,
(Engineering, Manufacturing,
Manufacturing, Maint Maintenance)enance)

TEST GENERATION TEST VALIDATION TEST APPLICATION


(Preset/Adaptive)

FUNCTIONAL STRUCTURAL THEORETICAL EXPERIMENTAL EXTERNALLY INTERNALLY


(Exhaustive/ (Analytic/ CONTROLLED CONTROLLED
Heuristic) Heuristic)

FAULT FAULT DIAG- ALGO- SIMULA- FAULT MANUAL AUTO- TEST CONCUR-
MODEL COVER- NOSIS RITHM TION INJEC- MATIC MODE RENT
switch- AGE EXTENT D-algo- software TION (ATE) (BIST) on-line
or gate- none rithm, (parallel, testing
level (check- boolean deductive, (self-
(single/ out, go/ differ- concur- off-line testing checked
multiple no-go) ence, rent) or design)
stuck-at, to full etc. hardware
bridging, resolu- (simulation
etc.) tion engine)

Fig. 9.1 A taxonomy of fault testing.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 196

If we view the circuit or system under test as a black box, perhaps because we know
nothing about its internal design, we opt for functional testing. For example, functional
testing of an adder entails the application of various integers as inputs and checking that
the correct sum is produced in each case. A functional test is exhaustive if all possible
combinations of inputs are applied. Exhaustive testing is practical only for circuits with a
relatively small number of inputs (4-bit or 8-bit adder, but not 32-bit adder). Random
testing entails the selection of a random sample of input test patterns, which of course
provides no guarantee of catching all faults. In heuristic functional test generation, we
pick the tests to correspond to typical inputs as well as what we believe to be problematic
or “corner” cases. In the case of an adder, our selections may include both positive and
negative integers, small and large numbers, values that lead to overflow, inputs that
generate sums of extreme values, and, perhaps, inputs that generate carry chains of
varying lengths.

In structural testing, knowledge of the circuit’s or system’s internal composition is used


to make the testing more efficient; that is, selecting inputs that are guaranteed to detect
any fault from a fault-set of interest (analytic) or do so with high probability (heuristic).
Generating structural tests requires that we select a fault model (specifying the kinds of
faults that are expected and their effects on circuit behavior; more on this in Section 9.2),
assess fault coverage, contemplate the diagnosis extent (whether we would like to
pinpoint the location of a detected fault or to simply make a “go/no-go” decision), and
devise or select an algorithm for the task.

Once a set of tests has been generated, we need test validation to ensure that the chosen
tests will accomplish our goal of complete fault coverage or high-probability detection.
Some approaches to test validation are theoretical, meaning that they can provide
guaranteed results within the limitations of the chosen fault model. Experimental test
validation, which may provide partial assurance, is done via simulation (modeling the
circuit, with and without faults) or fault injection (purposely introducing faults of various
kinds in the circuit or system to see if they are detected).

Following validation, we enter the domain of test application, where we expose the
circuit or system under test to the chosen inputs and observe its outputs. When test
application is externally controlled, the circuit or system is placed in test mode and its
behavior is observed for an extended period of time. This kind of manual or automatic
test application is known as off-line testing. Increasingly, for modern digital systems, test

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 197

application is internally controlled via built-in provisions for testing and testability
enhancement. Internally controlled test application can be off-line, meaning that a special
test mode is entered during which the circuit or system ceases normal operation, or on-
line, where testing and normal operation are concurrent.

A circuit or system whose testing requires relatively low effort in test generation and
application scores high on testability. We will see in Chapter 11 that testability can be
quantified, but, for now, we are using testability in its qualitative sense. The set-up for
testing is depicted in Fig. 9.2. Testability requires that each point within the circuit under
test be controllable (through paths leading from primary inputs to that point) and
observable (by propagating the state of the desired point to one or more primary outputs).
Redundancy in circuits often curtails controllability and observability, thus having a
negative impact on testability.

Referring to Fig. 9.2, test patterns may be randomly generated, come from a preset list, or
be selected according to previous test outcomes. Test results emerging at the circuit’s
outputs may be used in raw form (implying high data volumes) or compressed into a
signature before the comparison. Reference value can come from a “gold” or trusted
version of the circuit that runs concurrently with it or from a precomputed table of
expected outcomes. Finally, test application may be off-line or on-line (concurrent).

Complete, high-coverage testing is critical, because any delay in fault detection has
important financial consequences. The rule of thumb that you lose a few dollars when
you catch a component fault, tens of dollars if it goes to the circuit-board level, hundreds
of dollars if the faulty component makes it to the system level, and thousands of dollard if
in-field corrective action is required, is important to remember.

Fig. 9.2. Test set-up, composed of test-pattern source, circuit under


test, and comparator. The controllability and observability of
a point within the circuit are highlighted by dashed lines.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 198

Except perhaps in the case of exhaustive functional testing, test coverage may be well
below 100%, with reasons including the large input space, model inaccuracies, and
impossibility of dealing with all combinations of the modeled faults. So, testing, which
may be quite effective in the early stages of system design, when there may be many
residual bugs and faults, tends to be less convincing when bugs/faults are rare.
Paraphrasing Edsger W. Dijkstra, who made an oft-quoted statement in connection with
program bugs, we can say that testing can be used to show the presence of faults, but
never to show their absence. Also relevant is this observation by Steve C. McConnell:
“Trying to improve software quality by increasing the amount of testing is like trying to
lose weight by weighing yourself more often.”

We conclude this section by presenting an overview of fault models at different levels of


abstraction. Fault model, a catalog of the types of deviations in logic values that one
expects in the circuit under test, can be specified at various levels: transistor, gate,
function, system. Gate-level fault models will be the focus of our discussion, so let us
briefly review the other three levels before proceeding. Transistor-level faults, which are
caused by defects, shorts/opens, electromigration, transients, and the like, may lead to
high current, incorrect output, or intermediate voltage, among others. They can be
modeled as stuck-on/off, bridging, delay, coupling, and crosstalk faults. Transistor-level
fault models quickly become intractable because of the large model space. Function-level
faults are selected in an ad hoc manner based on the defined function of a block (decoder,
ALU, memory). We will discuss system-level faults (malfunctions, in our terminology) in
Part V of this book.

At the gate or logic level, the most widely considered faults are the so-called “line stuck”
faults, where a circuit line/node assumes a constant logic value, independent of the
circuit’s inputs. We will focus on these stuck-at-0 (s-a-0) and stuck-at-1 (s-a-1) faults in
the rest of this chapter. For example, Fig. 9.3a shows the upper input of the rightmost
AND gate suffering from an s-a-0 fault. Line bridging faults result from unintended
connection between two lines/nodes, often leading to wired OR/AND of the respective
signals (Fig. 9.3a). Line open faults (bottom line in Fig. 9.3a) can sometimes be modeled
as an s-a-0 or s-a-1 fault. Delay faults (excessively long delays for some signals) are less
tractable than the previous fault types. Coupling and crosstalk faults are other examples
of fault types included in some models.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 199

(a) Logic circuit and fault examples (b) Testing for a particular s-a-0 fault

Fig. 9.3 Gate- or logic-level faults and testing for them.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 200

9.2 Path Sensitization and D-Algorithm

The main ideas behind test design are controlling the faulty point from primary inputs
and propagating its behavior to some primary output. Considering the s-a-0 fault shown
in Fig. 9.3b, a test must consist of input values that force that particular line to 1 and then
propagate the resulting value (1 if the line is healthy, 0 if it is s-a-0) to one of the two
primary outputs. The process of propagating the 1/0 value to a primary output is known
as path sensitization. In the example of Fig. 9.3b, the path from the s-a-0 line to output K
is sensitized by choosing the lower AND gate input to be 1 (a 0 input inhibits
propagation, because it makes the AND gate output 0, regardless of the value on the
upper input) and the lower OR gate input to be 0.

At this point, we have the following requirements to be satisfied through a suitable


assignment of values to the primary inputs A, B, and C: the XOR gate output should be 1,
the output of the lower AND gate should be 0, and C should be 1. It is easy to see that
two test patterns satisfy these requirements: (A, B, C) = (0 1 1) or (1 0 1).

The path sensitization method just discussed is formalized in the D-algorithm [Roth66],
which is based on the D-calculus. A 1/0 on the logic circuit diagram (Fig. 9.3b) is
represented as D and 0/1 is represented as 𝐷. Then, D and 𝐷 values are propagated to
outputs via forward tracing (path sensitization) and towards the inputs via backward
tracing, eventually producing the required tests. In applying the D-algorithm, circuit lines
must be considered and labeled separately, as depicted in Fig. 9.DAlg-a. This is required
because in some cases, electrically connected lines (such as M, N, and P in Fig. 9.DAlg-
a) may not be affected in the same way by a fault on one of them.

(a) Circuit with all lines labeled (b) Reconvergent fan-out example

Fig. 9.DAlg Example circuit for, and a problem with, the D algorithm.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 201

The D-algorithm encounters difficulties with reconvergent fan-outs, but an extension of


the algorithm, PODEM (path-oriented decision making) [Goel81] fixes the shortcomings.
The reconvergent fan-out problem is illustrated in Fig. 9.DAlg-b. The letter D on line s
indicates the intention to test it for s-a-0. Path sensitization dictates that we should force
the other input of each of the two AND gates to 1, leading to D and 𝐷 on the OR gate
inputs and 1 on the z output. Thus the s-a-0 fault on line s cannot be detected. PODEM
gets around this problem by setting the y input to 0 instead of 1, causing the output z to
become D.

The worst-case time complexity of the D-algorithm is exponential in the circuit size,
given that it must consider all path combinations. The presence of XOR gates in the
circuit cuases the behavior to approach the worst case. However, the average case is
much better and tends to be quadratic in the circuit size. PODEM also has exponential
time complexity, but in the number of circuit inputs, not its size.

Once the set of possible tests for each fault of interest has been obtained, the rest of the
test generation process is as follows. We construct a table whose rows represent test
patterns (circuit input values) and whose columns correspond to faults. An “x” is placed
at a row-column intersection if the test associated with that row detects the specific fault
associated with the column; a hyphen is placed otherwise. A partial table of this kind is
shown in Table 9.1. Our task is completed upon choosing a minimal set of rows that
cover all faults. This covering problem can be solved quite efficiently in a manner similar
to choosing prime implicants for a minimal sum-of-products representation of a logic
function. For example, in the case of Table 9.1, if only the 4 faults shown are of interest,
then we have two minimal test sets: {(0, 0, 1), (0, 1, 1)} and {(0, 0, 1), (1, 0, 1)}.

Table 9.1 The covering problem in test generation.

P Q __
A B C s-a-0 s-a-1 s-a-0 s-a-1
0 0 0 - - - x
0 0 1 - x - x
0 1 1 x - x -
1 0 1 x - x -

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 202

It is easy to see that any test that detects P s-a-0 in Fig. 9.DAlg-a also detects L s-a-0 and
Q s-a-0. Such faults are said to be equivalent. In a similar manner, the faults Q s-a-1, R s-
a-1, and K s-a-1 are equivalent. Identifying equivalent faults before test generation leads
to savings in time and effort, because only one fault from each equivalence class needs to
be considered.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 203

9.3 Boolean Difference Methods

Consider a Boolean function of several variables, such the K as a function A, B, and C in


Fig. 9.DAlg-a.

K = f(A, B, C) = AB  BC  CA (9.3.Cout1)

The Boolean difference of K with respect to input variable B is defined as:

dK/dB = f(A, 0, C)  f(A, 1, C) = CA  (A  C) = A  C (9.3.BD1)

Intuitively, dK/dB = 1 (satisfied when A  C) tells us that the value of K changes when B
changes from 0 to 1 or that K is sensitive to a change in the value of B. Conversely,
dK/dB = 0 (satisfied for A = C) indicates that the value of K is insensitive to a change in
the value of B.

Consider in Fig. 9.DAlg.a the line P being s-a-0. A stuck line behaves as an independent
variable rather than as a dependent one. So, considering P as an independent variable, the
Boolean equation for the output Q can be obtained as:

K = PC  AB (9.3.Cout2)

The Boolean difference of K with respect to P is:

dK/dP = AB  (C  AB) = C(𝐴𝐵 ) (9.3.BD2)

Tests that detect P s-a-0 are solutions to the equation P dK/dP = 1.

P dK/dP = (A  B)C(𝐴𝐵 ) = 1  C = 1, A  B (9.3.Psa0t)

Similarly, tests that detect P s-a-1 are solutions to the equation 𝑃 dK/dP = 1.

𝑃 dK/dP = (𝐴  𝐵)C(𝐴𝐵 ) = 1  C = 1, A = B = 0 (9.3.Psa1t)

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 204

9.4 The Complexity of Fault Testing

From our discussion of the Boolean difference method in Section 9.3, and in particular
from equations 9.3.Psa0t and 9.3.Psa1t, we see that the problem of generating tests for a
particular fault of interest can be converted to solving an instance of the satisfiability
(SAT) problem. The SAT problem is defined, in its decision form, as answering the
question: Is a particular Boolean expression satisfiable, that is, can it assume the value 1
for some assignment of values to its variables? According to a well-known theorem of
complexity theory [Cook71], SAT is NP-complete. That is, no efficient, subexponential-
time algorithm is known that would solve an arbitrary instance of SAT. In fact, even
highly restricted versions of SAT remain NP-complete.

Conversion of the fault detection problem to SAT leads to the suspicion that perhaps fault
detection is also NP-complete. To show this, we must prove that an arbitrary instance of
SAT or some other NP-complete problem can be easily transformed to a fault-detection
problem. Such a proof was first constructed by Ibarra and Sahni [Ibar75] and
subsequently simplified by Fujiwara [Fuji82]. The demonstration that fault detection is an
NP-complete problem makes it unlikely that the task can be performed efficiently by
means of a general algorithm any time in the near future. We are thus motivated to seek
solutions to the problem in special cases and to devise heuristic algorithms that produce
acceptable, near-complete solutions for circuit classes of practical interest.

In the rest of this section, we provide a proof that fault detection is an NP-complete
problem. In fact, we will prove that a highly restricted form of the problem, that is,
finding tests for stuck-at fault in certain 3-level circuits composed entirely of AND and
OR gates (with all primary inputs being uncomplemented) is NP complete, by
transforming a known NP-complete problem to it. Circuits composed entirely of AND
and OR gates and lacking any complemented inputs are known as monotone circuits, so
named because if you increase the value of some of the inputs from 0 to 1 the ouput
either does not change or it changes from 0 to 1 (it can never go from 1 to 0). Thus, our
proof will show that fault detection in certain restricted classes of monotone logic circuits
is NP-complete, making the general problem NP-complete as well.

We will use 3SAT as a problem whose NP-completeness is well-established [Cook71] as


our starting point. In 3SAT, the logic expression to be satisfied in the product of clauses,

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 205

each of which is the logical OR of at most 3 true or complemented variables. For


example, an instance of 3SAT might involve the Boolean expression:

E = (x1x4x5) (x2x3x7) (x2x3x6) (x1x3x6) (9.4.3SAT)

From the NP-completeness of 3SAT, we show that the lesser-known clause-monotone


SAT (CM-SAT) is also NP-complete. The CM-SAT is as follows. Given n Boolean
variables, we want to determine whether a Boolean expression of the following form is
satisfiable, that is, wheter it can assume the value 1 for some combination of input values.

E = (ai  bi  ci  … ) (9.4.CM1)

Each of the ANDed (multiplied) clauses on the right-hand side of equation 9.4.CM1 is
the logical OR of n or fewer terms, all being either true variables or complemented
variables, but not both in the same clause. As an example, an instance of CM-SAT
expression with 7 variables might be the following, which contains two clauses with only
complemented variables and two clauses with only uncomplemented variables.

E = (x1x4x5x6x7) (x2x3x7) (x2x3x4x6) (x1x3x6) (9.4.CM2)

It is easy to convert an arbitrary instance of 3SAT to an instance of CM-SAT: all we need


is to observe that if a clause in 3SAT has both complemented and uncomplemented
variables, we can replace it with the product of two clauses to obtain a CM-SAT problem
that is satisfiable iff the original 3SAT problem instance is satisfiable. For example,
consider the clause involving xk and two uncomplemented variables xi and xj. Then:

(xixjxk) is replaced with (xixjvk) (vkxk) (9.4.CM3)

The variable vk is a new variable; thus, the CM-SAT problem obtained may have up to
twice as many variables as the original problem. Similarly, when a clause contains a
single uncomplemented variable and two complemented ones:

(xixjxk) is replaced with (xixjvk) (vkxk) (9.4.CM4)

It is easy to see that the original 3SAT problem is satisfiable iff the derived CM-SAT
problem is satisfiable.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 206

Corresponding to a CM-SAT instance such as the one in equation 9.4.CM2, we construct


a 3-level AND-OR-AND logic circuit as follows.

At the first level, we place a number of AND gates, one for each clause with
complemented inputs. Each AND receives the inputs which appear in the clause, but in
true form, and produces the complement of the clause as its output. For example, the
AND gate associated with the clause x1  x4  x5  x6  x7 in equation 9.4.SAT2 will
have the output x1x4x5x6x7 = (x1  x4  x5  x6  x7). At the second level, there are OR
gates, one receiving the outputs of the first-level AND gates as its inputs and the others
forming the remaining clauses with only uncomplemented variables. Finally, the outputs
of all the OR gates in level 2 are fed to a single AND gate in level 3. As an example, the
3-level circuit constructed as above for the CM-SAT instance of equation 9.4.SAT2 is
depicted in Fig. 9.SAT.

In the circuit of Fig. 9.SAT, an s-a-1 fault on line y1 is detectable iff there exists an input
pattern that sets all outputs of the level-1 AND gates to 0 and all outputs of the level-2
OR gates, other than the top one, to 1. It is easy to see that the test will then satisfy the
instance of the CM-SAT problem from which the circuit was derived. Given that the
conversion time from CM-SAT to fault detection is a polynomial in the number n of
variables, the fault detection problem must be NP-complete. We have thus proved:

Theorem 9.fdnpc: The problem of detecting stuck-at faults in three-level monotone


AND-OR-AND circuits is NP-complete.

x1x2x3x4x5x6x7 Level 1 Level 2 Level 3


y1
y3
z
y2
y4

y5

Fig. 9.SAT Example circuit formed by converting the CM-SAT instance


of equation 9.4.SAT2 to a fault detection problem.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 207

9.5 Testing of Units with Memory

Even leaving the high complexity of test generation aside, it is still the case that
exponentially many (up to 2n) test patterns may be required for an n-input combinational
circuit. This may lead to a significant amount of time spent in applying and analyzing
tests. The presence of memory in the circuit expands the number of required test cases,
given that the circuit behavior is influenced by its state. To test a sequential machine, we
may need to apply different input sequences for each possible initial state. This double-
exponential complexity may render testing with 100% coverge impractical.

Memory devices are special sequential circuits for which a wide variety of testing
strategies have been devised. A simple-minded approach would be to write the all-0s and
all-1s bit-patterns into every memory word and read out to verify proper storage and
retrieval. This seems to ensure that every memory cell is capable of storing and correctly
reading out the bit values 0 and 1. The problems with this approach include the fact that it
does not test the memory access/decoding mechanism (there is no way to know that the
intended word was written into and retrieved) and does not allow for pattern-sensitive
faults, where cell operation is affected by the values stored in nearby cells. Furthermore,
modern high-density memories experience dynamic faults that are exposed only for
specific access sequences.

The optimal way of testing memory units changes with each technology shift and
evolutionary step in their development. Given the trends in increasing size and
sensitivity, both arising from higher densities, built-in self-test appears to be the only
viable approach in the long run. A particular challenge in applying built-in self-test
methods is that any such method consumes some memory bandwidth, thus requiring
some sacrifice in performance. Memory testing continues to be an active research area.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 208

9.6 Off-Line vs. Concurrent Testing

This section not yet written.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 209

Problems

9.1 Testing of binary adders


A half-adder (HA) consists of an XOR gate and an AND gate producing the sum and carry-out bits,
respectively, when adding two input bits x and y. A full-adder (FA), with an additional carry-in input, can
be built from two HAs and an OR gate.
a. Discuss functional and structural testing of a single-bit FA built as above.
b. Repeat part a for a 4-bit ripple-carry adder built of four FAs.
c. How would you enhance the testability of the 4-bit ripple-carry adder if you were allowed to use
only one extra pin?

9.2 Path sensitization and D-algorithm


Consider circuits A, B, and C. Answer the following questions using path sensitization and D-algorithm.

x1 x1 x1
x2 x2 x2
x3 x1 x2
x4 x2 x3
Circuit A Circuit B Circuit C

a. Identify s-a-0 and s-a-1 tests for the input x3 in circuit A.


b. Identify s-a-0 and s-a-1 tests for the input x2 in circuit B.
c. Identify s-a-0 and s-a-1 tests for the input x2 in circuit C.

9.3 Boolean difference techniques


Redo Problem 9.2 using Boolean difference techniques.

9.4 Testing of XOR circuits


Show that a tree of 2-input XOR gates, implementing the logic function x1  x2  . . .  xn, can be tested
for all single s-a-0 and s-a-1 faults with only three input test patterns, regardless of the number n of inputs.
Hint: A single test detects all single s-a-1 faults.

9.5 Testing for stuck-at faults


Consider the following 8-input, 8-output circuit.
a. Derive a set of tests for detecting all stuck-at faults.
b. Describe the function performed by the circuit and relate your response to the tests in part a.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 210

9.6 Testing with false positives


For digital circuits, tests may be incomplete, in the sense of missing some of the possible faults, but we
usually don’t have to worry about a test incorrectly identifying a healthy circuit as faulty, an outcome
known as “false positive.” In other areas, such as tests for diseases, false positives are common and must be
taken into account in assessing the effectiveness of testing. Consider a diagnostic test for a particular
disease that gives the result “positive” with 99% probability if the person tested has the disease and with
2% probability (false positive) if not. Assume that 1% of the residents of a city have that particular disease.
A randomly chosen person from the city is administered the test.
a. What is the probability that the test result is positive?
b. If the test result is positive, what is the probability that the person tested has the disease?
c. If the test result is negative, what is the probability that the person tested has the disease?

9.x Title
Intro
a. xxx
b. xxx
c. xxx
d. xxx

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 211

References and Further Readings


[Alar06] Al-ars, Z., S. Hamdioui, and A. J. Goor, “Space of DRAM Fault Models and Corresponding
Testing,” Proc. Design Automation and Test in Europe, 2006, pp. 1252-1257.
[Cook71] Cook, S., “The Complexity of Theorem Proving Procedures,” Proc. 3rd Annual Symp. Theory
of Computing, 1971, pp. 151-158,
[Fuji82] Fujiwara, H. and S. Toida, “The Complexity of Fault Detection Problems for Combinational
Logic Circuits,” IEEE Trans. Computers, Vol. 31, pp. 555-559, 1982.
[Fuji85] Fujiwara, H., Logic Testing and Design for Testability, MIT Press, 1985.
[Goel81] Goel, P., “An Implicit Enumeration Algorithm to Generate Tests for Combinational Logic
Circuits,” IEEE Trans. Computers, Vol. 30, No. 3, pp. 215-222, March 1981.
[Hamd03] Hamdioui, S. and G. Gaydadjiev, “Future Challenges in Memory Testing,” Proc. PRORISC,
2003, pp. 78-83.
[Ibar75] Ibarra, O. H. and S. K. Sahni, “Polynomially Complete Fault Detection Problems,” IEEE
Trans. Computers, Vol. 24, No. 3, pp. 242-249, March 1975.
[Jha03] Jha, N. and S. Gupta, Testing of Digital Circuits, Cambridge Univ. Press, 2003.
[Krst98] Krstic, A. and K.-T. Cheng, Delay Fault Testing for VLSI Circuits, Kluwer, 1998.
[Lala97] Lala, P. K., Digital Circuit Testing and Testability, Academic Press, 1997.
[Roth66] Roth, J. P., “Diagnosis of Automata Failures: A Calculus and a Method,” IBM J. Research and
Development, Vol. 10, No. 4, pp. 278-291, July 1966.
[Wang06] Wang, L.-T., C.-W. Wu, and X. Wen (eds.), VLSI Test Principles and Architectures: Design
for Testability, Elsevier, 2006.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 212

10 Fault Masking
“Don’t find fault with what you don’t understand.”
French proverb

“Life is [like a piano]. The discord is there, and the harmony is


there. Study to play it correctly, and it will give forth the beauty;
play it falsely, and it will give forth the ugliness. Life is not at
fault.”
Anonymous

Topics in This Chapter


10.1. Fault Avoidance vs. Masking
10.2. Interwoven Redundant Logic
10.3. Static Redundancy with Replication
10.4. Dynamic and Hybrid Redundancy
10.5. Time Redundancy
10.6. Variations and Complications

Fault masking can also be called fault tolerance, but we avoid using the latter term
because of its past association with the entire field of dependable computing.
There are two ways to mask faults. One way is to build upon the inherent
redundancy in logic circuits. This is akin to the redundancy observed in natural
language: you may be able to understand a sentence in this book even after
removing every vowel, or covering the lower half of every letter. Another way is
by using replicated circuits whose outputs feed a fusion or combining circuit,
often (inappropriately) called a voter. This approach leads to static, dynamic, and
hybrid redundancy methods. Masking of transient faults is also discussed.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 213

10.1 Fault Avoidance vs. Masking

Referring to Fig. 10.1, we note that there is a pleasant symmetry between avoidance and
masking strategies for dealing with faults (as explained in the introductory paragraph of
this chapter, we prefer to use fault masking in lieu of fault tolerance). Certain faults may
be unavoidable, while others may turn out to be unmaskable. Thus, practical strategies for
dealing with faults are often hybrid in that they encompass some avoidance and some
masking techniques.

In the 9-leaf tree of Fig. 10.1, this chapter’s focus will be on the rightmost two leaves
labeled “restored” and “unaffected.” Masking through restoration requires that a fault be
exposed, promptly detected by a monitor, and fully circumvented via reconfiguration.
This approach requires the application of dynamic redundancy, where redundant features
are built into the system but are not brought into service until they are needed to replace
faulty elements. Masking through concealment is based on static redundancy methods,
whereby redundant elements are fully integrated into the circuit in a way that they cover
for imperfections in other elements.

Fig. 10.1 Fault avoidance versus masking: a symmetrical view.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 214

10.2 Interwoven Redundant Logic

Interwoven redundant logic is based on the notions of critical and subcritical faults.
Referring to Fig. 10.quad-a, we see that with the given input values, if line a is s-a-0, the
circuit output will not change. We thus consider a s-a-0 a subcritical fault. The same
holds true for h s-a-0, c s-a-1, and d s-a-1. On the other hand, line b s-a-1 will change the
circuit output from 0 to 1, making it a critical fault for the input pattern shown. Because
not all faults are of the stuck-at type, henceforth we characterize each fault as a 0  1 flip
or a 1  0 flip.

We see from the preceding discussion that even a nonredundant logic circuit is capable of
masking some faults. This is both good and bad. It is good in the sense that not every
fault will affect the correct functioning of the circuit. It is bad in the sense of impacting
the timeliness of fault detection. Generally speaking, an AND gate with many inputs is
more sensitive to 1  0 flips, because they have the potential of causing a 1  0 flip at
the gate’s output. A multi-input OR gate, on the other hand, is more sensitive to 0  1
flips. We are thus motivated to seek methods involving alternating use of AND and OR
gates for turning this inherent masking capability of logic circuits into a general scheme
that also masks critical faults.

(a) Critical/subcritical faults

(b) Original logic circuit (c) Part of the quadded version

Fig. 10.quad The notion of critical/subcritical faults and connectivity


pattern in interwoven 4-way-redundant logic.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 215

Consider, for example, the logic circuit of Fig. 10.quad-b with potentially critical 1  0
flips on either input of the shaded AND gate at the top left. Suppose we quadruplicate
this AND gate and the OR gate which it feeds, as well as all inputs and other signals, as
depicted in Fig. 10.quad-c. The four copies of any signal x are named x1, x2, x3, and x4,
with the value of x taken to be the 3-out-of-4 majority value among the four signal
copies. The connectivity pattern of inputs and other replicated signals is such that any
critical flip at the AND layer turns into a subcritical flip at the following OR layer. For
example, the critical 1  0 flip for a1 causes the subcritical 1  0 flip at the top input e1
to the OR gate it feeds at the next circuit layer.

One can show that to mask h critical faults with this alternating, interwoven AND-OR
arrangement, the number of gates must be multiplied by (h + 1)2 and the number of inputs
for each gate must increase by the factor h + 1. For h = 1, this interwoven redundant logic
scheme is known as quadded logic. Note that the alternating AND and OR layers can be
replaced by uniform NAND layers in the usual way.

A similar masking effect is observed in the crummy-relays scheme of Moore and


Shannon [Moor56]. We consider relays of two kinds (Fig. 10.relay-a): a make contact is
normally open and closes when energized, while a break contact is normally closed and
opens when energized. Given two make contacts, their series connection computes the
AND function, as shown in Fig. 10.relay-b. We define the two probabilities a and c for
each relay as follows:

a prob [contact made | relay is energized]


1–a prob [contact open | relay is energized]
c prob [contact made | relay is not energized]
1–c prob [contact open | relay is not energized]

For make contacts, we have a > c, while for break contacts a < c holds. No matter how
crummy the relays, that is, how close the values of a and c to each other, one can
interconnect many of them in a redundant structure, using series and parallel elements, to
achieve an arbitrarily high reliability.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 216

Example 10.M&S: Reliable circuits from crummy relays Consider the parallel-series quad of
make relays depicted in Fig. 10.relay-c. Derive the reliability paramters for the quad and
determine under what conditions it behaves better than a single make relay.

Solution: The a parameter of the quad circuit can be derived to be aquad = prob [connection made |
relays energized] = 2a2 – a4. Thus, aquad > a if a > 0.62. Similarly, the quad’s c parameter can be
shown to be cquad = prob [connection made | relays not energized] = 2c2 – c4. It is readily seen that
cquad < c for all values of c. Thus, unless the a parameter has an unreasonably low value, the quad
circuit offers better reliability than a single make relay.

(a) Relay contact types (b) AND circuit (c) Parallel-series quad

Fig. 10.relay Relays and their use in building an AND circuit and a more
reliable make circuit.

Interwoven logic has been proposed as a way of overcoming highly unreliable


nanoelectronic circuits and their attendant massive numbers of defects that are inevitable
in the mass production of extremely small components. Figure 10.nano-a depicts a half-
adder circuit, with AND-OR and NAND-only implementations. The interwoven logic
version of the half-adder is depicted in Fig. 10.nano-b, where each of the three 2-input
NAND gates is replaced by four 4-input NAN gates and additional NAND gates act as
inverters for the signal bundles representing the a and b inputs.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 217

(a) Nonredundant half-adders (b) Interwoven logic version

Fig. 10.nano Interwoven logic for nanoelectronics [Han05].

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 218

10.3 Static Redundancy with Replication

A conceptually simple static redundancy scheme is triple-modular redundancy (TMR),


which as shown in Fig. 10.tmr-a consists of triplicating the computational module and
voting on the results they produce. The voter masks any fault that is limited to only one
of the modules. Two faults can potentially lead to an incorrect output from the voter, so
denoting the module reliability by Rm, a worst-case reliability analysis leads to:

RTMR = 3Rm2 – 2Rm3 = Rm[1 + (1 – Rm)(2Rm – 1)] (10.3.RTMR)

The variation of RTMR as a function of Rm is shown in Fig. 10.tmr-b. The second


expression for RTMR in equation (10.3.RTMR) suggests that for the TMR scheme to lead
to reliability improvement over a nonredundant module, that is, for RTMR > Rm, we must
have (1 – Rm)(2Rm – 1) > 0 or Rm > 1/2. The reliability improvement factor achieved by
the TMR scheme is:

RIFTMR/Simplex = = (10.3.RIF)
( )

Assuming Rm = e–t, the reliability curves for a simplex module and the TMR system are
shown as a function of t in Fig. 10.TMR-c. We know that the MTTF for the simplex
system is 1/. The MTTF for the TMR system can be shown to be:

MTTFTMR = (5/6) (10.3.MTTF)

The shorter MTTF of a TMR system is due to the fact that RTMR falls below Rm and stays
there for t > ln 2, thus leading to a smaller area under the reliability curve.

(a) Simple TMR scheme (b) Reliability vs. Rm (c) Reliability over time

Fig. 10.tmr Relays and their use in building an AND circuit and a more
reliable make circuit.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 219

Q
C

(a) (b)

Fig. 10.seuFF Two types of triplicated flip-flops designed to withstand


single event upsets (SEUs).

One recent application of TMR is in the design of flip-flops that are designed to
withstand single event upsets (SEUs). As shown in Fig. 10.seuFF, triplication of the FF
itself, or both the FF and its correction circuitry, results in single faults to be tolerated.

Generalizing triple-modular redundancy to N units, yields N-modular reudnancy (NMR).


The parameter N is often taken to be an odd number, allowing the use of an (N + 1)/2-
out-of-N voter in 5MR, 7MR, and so on. One must note, however that voter complexity
rises rapidly with increasing N. It is also possible to use an even value for N. For
example, we may use 3-out-of-4 voting in 4MR and also design the voter such that it
detects the presence of double faults, leading to greater reliability and safety compared
with TMR.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 220

10.4 Dynamic and Hybrid Redundancy

Dynamic redundancy, as the name implies, requires some sort of action to bypass a faulty
element, once a fault has been detected. The simplest form of dynamic redundancy
entails a fault detector tacked on to an operating unit and a mechanism that allows
switching to an alternate (standby, spare) unit (Fig. 10.dynr-a). In contrast, static or
masking redundancy (Fig. 10.tmr-a) simply hides the effects of a fault and continues
operation uninterrupted, provided that the scheme’s masking capacity is not exceeded.

The fault detector in Fig. 10.dynr-a may be of various kinds: a code checker, a watchdog
timer, or a comparator, in the event that the operational unit is itself duplicated. The latter
scheme is sometimes referred to as “pair and spare.”
[Elaborate further on fault detection.]

The standby or spare unit may be “cold,” meaning that it is powered off until needed.
When the spare is to be switched in, it must be powered on, initialized, and set to an
appropriate state to be able to continue the work of the operational unit from where it was
interrupted (preferable) or from a restart position. Alternatively, we may use a “hot”
spare which runs in parallel with the operational unit and is thus in sync with it. A hot
spare can be switched in with minimal delay, thus helping to improve availability. We
may also opt for the intermediate case of a “warm” spare that is powered on but not
completely in sync with the operational unit.

(a) Detect and replace (b) Mask, diagnose, and switch

Fig. 10.dynr Relays and their use in building an AND circuit and a more
reliable make circuit.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 221

We can combine the advantages of static (uninterrupted operation) with those of dynamic
redundancy (lower hardware and energy costs) into a hybrid redundancy scheme, as
depicted in Fig. 10-dynr-b. Initually, the reconfiguration switch S is set so that the
outputs of units 1, 2, and 3 are selected and sent to the voting circuit. When a fault occurs
in one of the 3 operational units, the voting circuit masks the fault, but the voter output
allows the switch S to determine which of the operational units disagreed with the final
outcome. The disagreeing unit is then replaced with the spare, allowing the hybrid-
redundant system to continue its operation and to tolerate a second fault later on.

It is possible to igmore the first few instance of disagreement from one unit in the
expectation that they were due to transient faults. In this scheme, the switch maintains a
disagreement tally for each operational unit and replaces it only if the tally exceeds a
preset threshold. Another optimization is to switch to duplex or simplex operation when
the supply of spares has been exhausted and one of the operational units experiences a
fault. Continuing with duplex operation provides greater safety, whereas switching to
simplex mode extends the lifetime of the system.

Figure 10.sw depicts high-level designs of switches for standby and hybrid redundancy.
[Elaborate on switch design, including the self-purging variant and the associated
threshold voting scheme.]

Spares Spares

(a) Standby (b) Hybrid

Fig. 10.sw Switches for standby and hybrid redundancy.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 222

We now summarize our discussion of static versus dynamic (and hybrid) redundancy.
Static redundancy provides immediate masking and thus uninterrupted operation. It is
also high on safety. Its disadvantages include power and area penalties and the fact that
the voting circuit is critical to correct operation of the system. Dynamic redundancy
consumes less power (especially with cold standbys) and provides longer life by simply
adding more spares. Tolerance is not immediate, causing availability and safety concerns.
Also, the assumption of longer life with more spares is critically dependant on the
coverage factor. In the absence of near-perfect coverage, the addition of more spares may
not help or even be detrimental to system reliability. Hybrid redundancy have some of the
advantages of both schemes, as well as some of their disadvantages. The switch-voter
part of a hybrid-redundant system is both complex and critical.

We will see in Chapter 15 that, via a scheme known as self-checking design, fault
detection coverage of standby redundancy can be improved, making the technique more
attractive from a practical standpoint.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 223

10.5 Time Redundancy

Instead of replicating a circuit or module, one may reuse the same unit multiple times to
derive several results. This strategy, sometimes referred to as retry, is particularly
effective for dealing with transient faults. Clearly, if the unit has a permanent fault, then
the same erroneous result will be obtained during each retry.

One way around the problem presented by permanent faults is to change things around
during recomputations so as to make it less likely for the same fault to lead to identical
errors, thus reducing the likelihood of the fault going undetected. For example, if an
adder is being used to compute the sum a + b, we may switch the operands and compute
b + a during the second run, or we may complement the inputs and output, thus aiming to
obtain the same result via the computation –(–a – b). Similarly, in multiplying a by the
even number 2b, we may try computing (2a)  b the second time. For arbitrary integer
oprands a and b, we can find the product (2a)  b/2, while initializing the cumulative
partial product to a if b0 = 1 and to 0 otherwise. Yet another alternative, assuming that a
and b are not very large, is to compute (2a)  (2b) and divide the result by 4 through
right-shifting by 2 bits.

The final example in the preceding paragraph is an instance of the method of


recomputing with shifted operatnds.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 224

10.6 Variations and Complications

Both NMR and hybrid hardware redundancy have been used in practice. The Japanese
Shinkansen “Bullet” train employed a triple-duplex control system, implying 6-fold
redundancy. [Elaborate on the redundancy scheme.] Before they were permanently
retired in 2012, NASA’s Space Shuttles used computers with 5-way redundancy in
hardware and 2-way reudnancy in software. The 5 hardware units originally consisted of
3 concurrently operating units and 2 spare units. Later, the configuration was changed to
4 + 1. Two independently developed software systems were used to protect against
software design faults.

One consequence of using static redundancy is reduced testability. The very act of
masking faults makes it impossible to detect them via the circuit’s inputs and outputs.
Consider, for example, the quadded logic scheme of Section 10.2. This scheme is
designed to mask a single stuck-at fault. So what happenes if the redundant circuit
already contains a fault at the time of its manufacture? The fault cannot be detected by
simply testing the redundant circuit, and if a second fault develops during use (which is
really the first fault from the user’s viewpoint), it may not be masked. Similar difficulties
arise with regard to replication-based redundant systems. Thus, incorporating testability
features, of the types discussed in Chapter 11, is even more significant for systems
employing masking redundancy.

As an example, consider the TMR system of Fig. 10.tmrt-a. If any of the three units
becomes faulty after system assembly but prior to its use, then the first fault occurring
during the system’s operation may lead to an incorrect output. This is because the
presence of the first fault in not detectable using only the circuit’s primary inputs and
outputs. This problem can be fixed by using the arrangement depicted in Fig. 10.tmrt-b.
A multiplexer is used to select one of four values as the output from the redundant circuit:
option 0 selects the value produced by the voting unit, while options 1-3 select the output
supplied by one of the units 1-3. The only detrimental effect of this arrangement is the
added multiplexer delay during normal operation of the redundant circuit. This is more
than offset by the facility for testing each of the replicas initially as well as periodically
during system operation.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 225

These lines are


controllable but
not observable

(a) Testability problem (b) Testability improvement

Fig. 10.tmrt TMR configuration with provisions for testing.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 226

Problems

10.1 Title
Intro
a. xxx
b. xxx
c. xxx
d. xxx

10.2 MTTF of a TMR system


Prove equation (10.3.MTTF), showing that the MTTF of a TMR system with perfect voting is 5/6 of the
MTTF of a simplex module.

10.3 Comparing fault-masking schemes


We are considering two possible implementations of a 4-way redundant system. The redundancy is 300%
in both cases, so ignoring the complexity of comparison, voting, and switching mechanisms, cost is not a
factor in our decision. Option 1 is the use of pair-and-spare scheme with comparison used for fault
detection and the active pair replaced with the spare pair upon a detected fault. Option 2 is triplication with
voting and one spare. For both cases, repair is performed to bring the faulty unit or pair back into service.
In answering the following questions, no calculation is necessary. However, answers based on quantitative
evidence will receive extra credit.
a. Which scheme is preferable with respect to reliability in a safety-critical setting?
b. Which scheme is preferable with respect to system availability?

10.4 Title
Intro
a. xxx
b. xxx
c. xxx
d. xxx

10.5 Modeling for fault masking


Consider the following fault-masking redundancy scheme with voting. There are five modules that feed a
3-out-ot-5 voting unit. When a disagreement is detected among the modules, the disagreeing module is
purged (switched out), along with one of the good ones, and the voting unit is reconfigured to act as a 2-
out-of-3 unit. The next detected disagreement causes the disagreeing module and one of the good ones to
be purged, leaving a simplex system. As usual, we ignore the possibility of simultaneous multiple faults.
a. Construct and solve a state-space reliability model for this system, assuming perfect coverage,
switching, and voting. State all your assumptions clearly.
b. Outline a method for improving the reliability of this hardware redundancy scheme that does not
involve adding extra modules (only the switching and voting parts can change).

10.6 Comparison-based sorting networks

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 227

A comparator is a 2-input, 2-ouput logic circuit that receives the integers a and b as inputs and delivers c =
min(a, b) and d = max(a, b) as outputs. Thus, a comparator either sends the two inputs straight through or
exchanges them to achieve correctly ordered outputs.
a. Show the design of a 4-input sorting network using 5 comparators.
b. Assume that comparators can get “stuck on straight through” or “stuck on exchange,” thus not
providing the correct ordering in some cases. How can the network of part a be tested for such
stuck-type faults efficiently, assuming at most a single faulty comparator?
c. Is it possible to build a 4-input sorting network that masks the fault of any single comparator?
How, or why not?

10.7 Quadded logic with NAND gates


Consider the half-adder circuit shown below.

a. Convert the circuit so that only NAND gates are used. Feel free to change the outputs to 𝑠̅ and 𝑐̅ if
it makes your job easier.
b. Draw the quadded version of the circuit in part a, using only NAND gates.
c. Is it always possible to convert a quadded AND-OR circuit into a quadded NAND circuit by
replacing each AND and OR gate with a NAND gate?

10.8 Voting in NMR systems


a. Does 2-out-of-4 voting make sense in a 4MR system? Discuss.
b. Generalize your answer and discussion for part a to K-out-of N voting in an NMR system, where
K isn’t (N + 1)/2. Consider both the cases of K  N/2 and K > N/2 + 1.

10.9 Design of voters for NMR systems


a. Is it possible to design a 3-out-of-5 voter, using only 2-out-of-3 voters and no other component?
Fully justify your answer.
b. Generalize your answer to part a for the case of designing an m-out-of-n voter from 2-out-of-3
voters.
c. Design an efficient 3-out-of-5 voter using 2-out-of-3 voters and inverters, if helpful.
d. Repeat part c for a 5-out-of-9 voter.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 228

References and Further Readings


[Han05] Han, J., J. Gao, P. Jonker, Q. Yan, and J. A. B. Fortes, “Toward Hardware-Redundant Fault-
Tolerant Logic for Nanoelectronics,” IEEE Design & Test of Computers, Vol. 22, No. 4, pp.
328-339, July-August 2005.
[Han15] Han, J., E. Leung, L. Liu, and F. Lombardi, “A Fault-Tolerant Technique Using Quadded
Logic and Quadded Transistors,” IEEE Trans. VLSI Systems, Vol. 23, No. 8, pp. 1562-1566,
August 2015.
[Hsu95] Hsu, Y.-M., E. E. Swartzlander Jr., and V. Piuri, “Recomputing by Operand Exchanging: A
Time Redundancy Approach for Fault-Tolerant Neural Networks,” Proc. IEEE Int’l Conf.
Application-Specific Array Processors, 1995, pp. 54-
[Jens63] Jensen, P. A., “Quadded NOR Logic,” IEEE Trans. Reliability, Vol. 12, pp. 22-31, September
1963.
[Lyon62] Lyons, R. E. and W. Vanderkulk, “The Use of Triple Modular Redundancy to Improve
Computer Reliability,” IBM J. Research and Development, Vol. 6, No. 2, pp. 200-209, April
1962.
[Mukh15] Mukherjee, A. and A. S. Dhar, “Fault Tolerant Architecture Design Using Quad-Gate-
Transistor Redundancy,” IET Circuits, Devices & Systems, Vol. 9, No. 3, pp. 152-160, May
2015.
[Pier65] Pierce, W. H., Failure-Tolerant Computer Design, Academic Press, 1965.
[Tryo62] Tryon, J. G., “Quadded Logic,” in Redundancy Techniques for Computing Systems, R. H.
Wilcox and W.C. Mann (eds.), Spartan, 1962, pp. 205-228.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 229

11 Design for Testability


“The real fault is to have faults and not to amend them.”
Confucius

“When something goes wrong with a computer program or an


engineering structure, the scrutiny under which the ill-fated
object comes often uncovers a host of other innocuous bugs and
faults that might have gone forever unnoticed had the accident
not happened.”
Henry Petroski, To Engineer Is Human, 1985

Topics in This Chapter


11.1. The Importance of Testablity
11.2. Testability Modeling
11.3. Testpoint Insertion
11.4. Sequential Scan Techniques
11.5. Boundary Scan Design
11.6. Built-in Self-Test

Highly complex digital systems cannot be tested exhaustively to validate their


functionalities. Adequate structural testing is also often impractical, given the
limited number of primary inputs and outputs that are externally controllable and
observable. It has thus become necessary to incorporate testability features in
modern digital circuits so as to simplify the testing process and to reduce its cost.
Collectively, these features provide better access to the normally hidden internal
nodes, thus making it possible to control and observe them without disassembling
the circuit under test. Recently, boundary scan design has become the de facto
standard for making an assembly of communicating modules testable.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 230

11.1 The Importance of Testability

Whereas a simple circuit with small set of inputs and outputs can be tested with a
reasonable amount of time and effort, a complex unit, such as a microprocessor, would be
impossible to test solely based on its input/output behavior. Hence, nearly all modern
digital circuits are designed and built with special provisions for improved testability.

Because the cost of a fault is greatly reduced when it is caught early, as discussed in
Section 9.1, testability features appear at all levels of the digital system packaging
hierarchy, from the component or chip level, through board and subassembly levels, all
the way to the system level. Our discussions in this chapter pertain mostly to circuit- and
board-level techniques, but similar considerations apply elsewhere. In particular,
diagnosability features form important considerations at the malfunction level (see
Chapter 17).

In order for a unit to be easily testable, we must be able to control and observe its internal
points. Thus, controllability and observability are the cornerstones of testability, as we
will see in detail in Section 11.2.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 231

11.2 Testability Modeling

To allow detection of a fault on line Li of a logic circuit (see Fig. 11.1), we must be able
both to control that line from the circuit’s primary inputs and to observe its behavior via
the primary outputs. Thus, good testability requires good controllability and good
obsevability. It is thus natural to define the testability T(Pi) of a line Li as the product of
its controllability C(Li) and its observability O(Li). We will discuss suitable quantification
of controllability and observability shortly. However, assuming that we have already
determined C and O values in the interval [0, 1] for each line Li, 1  i  n, within the
circuit, the overall testability of the circuit can be taken to be the average of the
testabilities for all points. Thus:

Tcircuit = ∑ 𝐶(𝐿 ) × 𝑂(𝐿 ) (11.2.test)

Note that testability metric defined by equation (11.2.test) is an empirical measure that
does not have a physical significance. In other words, if the testability of a circuit turns
out to be 0.23, it does not tell us much about how difficult it would be to test the circuit.
It is only useful for comparing different circuits, different designs for the same circuit, or
different points within a circuit with regard to ease of testing. So, a design with a
testability of 0.23 may be preferred to one that has a testability of 0.16, say.

To determine the line controllabilities within a given circuit, we begin from primary
inputs, each of which is assigned the perfect controllability of 1, and proceed toward the
outputs. The controllability of the single output line of a k-input gate is the product of the
gate’s controllability transfer factor (CTF) and the average of its input controllabilities:

Path to
control P

Path to
observe P

Fig. 11.1 Good testability with respect to an internal circuit fault on


line P requires good controllability over P from the inputs
and good observability of its behavior from the outputs.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 232

(a) Gate output controllability (b) Controllability after fan-out

Fig. 11.2 Quantifying the controllability of a circuit line.

Coutput = 𝐶𝑇𝐹 × ∑ 𝐶 /𝑘 (11.2.contr)

The CTF of a gate depends on how easy it is to set its output to 0 or 1 at will. Taking a 3-
input AND gate of Fig. 11.2a as an example, 7 of its 8 input patterns set the output to 0
and only one pattern sets the output to 1. The relatively large difference between N(0) = 7
and N(1) = 1 indicates poor controllability at the gate’s output. In general, we use
equation 11.2.CTF to derive a gate’s CTF:

( ) ( )
CTF = 1 – (11.2.CTF)
( ) ( )

When an equal number of input patterns set the gate’s output to 0 or 1, we have N(0) =
N(1), leading to the perfect CTF of 1. An XOR gate has the N(0) = N(1) property and is
thus a desirable circuit component in terms of testability. In the case of a line that fans out
into f lines, as in Fig. 11.2-b, the controllability of each of the output lines is given by:

Cf-way fan-out = Cinput / (1 + log2 f) (11.2.fo-c)

A line of very low controllability constitutes a good place for the insertion of a testpoint.

(a) Gate input observability (b) Observability before fan-out

Fig. 11.3 Quantifying the observability of a circuit line.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 233

Example 11.contr: Quantifying controllability Derive the controllabilities of lines P and K in


the logic circuit of Fig. 11.1.

Solution: The XOR gate has a CTF of 1, making the controllability of line M, and thus line P,
equal to 1. Two-input AND and OR gates have a CTF of 1/2. Thus both Q and R have a
controllability of 1/2, giving K a controllability of 1/4.

To determine the line observabilities within a given circuit, we begin from primary
outputs, each of which is assigned the perfect observability of 1, and proceed toward the
inputs. The observability of each input line of a k-input gate is the product of the gate’s
observability transfer factor (OTF) and its output observability:

Oinput i = OTF  Ooutput (11.2.obser)

The OTF of a gate depends on how easy it is to sensitize a path from a gate input to the
output. Taking a 3-input AND gate of Fig. 11.3a as an example and considering a fault on
one of the inputs, only N(sp) = 1 of the 4 patterns on the other two input sensitizes a path
to the output, whereas N(ip) = 3 patterns inhibit the propagation. The relatively small
number of sensitizing options indicates poor observability of the gate’s inputs. In general,
we use equation 11.2.OTF to derive a gate’s OTF:

( )
OTF = ( )
(11.2.OTF)
( )

When there are no inhibiting input patterns, we have N(ip) = 0, leading to the perfect
OTF of 1. An XOR gate has the N(ip) = 0 property and is thus a desirable circuit
component in terms of observability. In the case of a line that fans out into f lines, as in
Fig. 11.3b, the observability of each of the input lines is given by:

Of-way fan-out = 1 – ∏ (1 − 𝑂 ) (11.2.fo-o)

A line of very low observability constitutes a good place for the insertion of a testpoint.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 234

Example 11.obser: Quantifying observability Derive the observabilities of lines B and P in the
logic circuit of Fig. 11.1.

Solution: Two-input AND and OR gates have an OTF of 1/2, leading to an observability of 1/2
for Q and 1/4 for P, tracing back from the primary output K. Line B has an observability of 1,
given the path consisting of two XORs (OTF = 1) from B to the primary output S.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 235

11.3 Testpoint Insertion

If during testability analysis one or more line in the circuit are shown to have low
testabilities, we might want to make those lines externally controllable and observable via
testpoint placement. Figure 11.tps shows how the testability of a system composed of two
cascaded modules can be improved by inserting degating logic at their interface. In this
way, each module can be tested separately and then in tandem to ensure both proper
module operation and correct interfacing.

Example 11.test: Placement of a testpoint Derive the testabilities of all lines in Fig. 10.quad-a
and from them, deduce the location of a single testpoint that would help most with testability.

Solution: To be provided.

If m testpoints are to be placed, it may not be a good idea to pick as their locations the
lines with the m lowest testability values. This is because locating a testpoint on the
lowest-testability line will in general affect the testabilities of many other lines. A good
strategy would be to recalculate the testabilities of all circuit lines after the optimal
placement of a single testpoint and considering it as a primary input/output.

Fig. 11.tps Testpoints improve controllability and observability.

(a) Partitioned design (b) Normal mode (c) Test mode for A

Fig. 11.partn Design partitioning to improve testability.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 236

11.4 Sequential Scan Techniques

Sequential circuits are more difficult to test than combinational circuits, because their
behavior is state-dependent. We need exponentially many tests to test the sequential
circuit’s behavior for each initial condition. One way to reduce the number of test
patterns needed is to test the flip-flops’ proper operation and the correctness of the
combinational part separately. Inputs to the combinational part are the circuit’s primary
inputs and those coming from the FFs (Fig. 11.scand-a). The ability to load arbitrary bit
patterns into the circuit’s FFs will allow us to apply desired test patterns to the
combinational logic and then observe its response by looking at the primaty outputs and
the new contents of the FFs.

Figure 11.scand-b shows one way of accomplishing this aim. All the FFs in the circuit are
strung into a long chain, with serial input and serial output. This chain is known as a scan
path. There is multiplexer before each FF in the scan path that allows the FF to receive its
input from the scan path (test mode) or from the regular source within the circuit (normal
operation). During testing, we alternate between test mode and normal mode in many
phases. In test mode, we shift a desired pattern into the FFs, while shifting out the stored
pattern placed there by the previous phase of testing.

(a) Generic sequential circuit (b) Flip-flops chained in a scan path

Fig. 11.scand Scan design for improving the testability of sequential


circuits.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 237

Boundary scan design is a methodology used to facilitate board-level testing of digital


systems. For each component to be placed on the board, the input and output lines pass
through specially designed scan cells (the small square boxes in Fig. 11.scanp-a) that
operate in one of two modes. During normal operation, the scan cells are transparent and
simply pass the signals from their PI inputs to PO outputs. Thus, during normal
operation, the scan cells are invisible and have no effect on the operation of the circuit,
except for introducing a slight additional delay resulting from their relaying the signals.

When we want to test the board, . . .

(a) Scan path for a circuit (b) Cascading of scan paths

Fig. 11.scanp Scan path at the circuit and system levels.

Fig. 11.scanc Basic boundary scan cell.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 238

11.5 Built-in Self-Test

Given the difficulty of testing, built-in self-test (BIST) capability is incorporated into
many complex systems. BIST requires the incorporation of test-pattern generation and
pass/fail decision into the same package as the circuit under test. As in ordinary testing,
the test patterns may be generated randomly at the time of application or may be
precomputed and stored in memory. Among the most common methods of on-the-fly test
generation is the use of linear feedback shift registers (LFSRs)

Fig. 11.bist Ordinary testing versus built-in self-test.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 239

11.6 Testing of Analog and Hybrid Circuits

Discuss based on [Liu12].

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 240

Problems

11.1 Testability modeling


Quantify the testability of each line of a single-bit full-adder circuit built of two half-adders and an OR
gate. Recall that a half-adder consists of an XOR gate and an AND gate producing the sum and carry-out
bits, respectively, when adding two input bits x and y. Based on the results obtained, where would you
place a single testpoint?

11.2 Testability modeling


a. Suppose we replace a 4-input AND gate (inputs a, b, c, d; output z) with three 2-input AND gates,
connected so as to produce the same output z = abcd. The 3-gate circuit has internal lines x and y,
besides the primary inputs and output. Show how this replacement changes the testabilities of the
primary input and output lines.
b. Discuss the reasonableness of the results in part a.
c. Repeat parts a and b with OR gates.
d. Discuss what happens in the case of XOR gates.

11.3 Testability under reconvergent fan-out


a. Consider the circuit of Fig. 9.DAlg-b that contains a reconvergent fan-out. Show that the
testability of line s is the same when computed from each of the two paths to the output.
b. Provide an example of a circuit with a reconvergent fan-out where different testability values are
obtained for the fan-out point depending on the path chosen.
c. In the situation of part b, how are testability values to be assigned?

11.4 Controllability and Observability transfer factors


a. Derive the CTF of 4-bit less-than comparator (two unsigned 4-bit binary inputs, 1 output line).
b. Repeat part a for OTF.
c. Repeat part a for a 4-bit less-than comparator for 2’s-complement integers.
d. Repeat part c for OTF.

11.5 Controllability and Observability transfer factors


Discuss a method for obtaining the CTF and OTF parameters of a k-input, l-output component. Apply your
method to a 4-bit unsigned binary comparator that provides less-than, greater-than, and equality indications
(3 output lines).

11.6 Controllability and Observability transfer factors


a. Derive the CTF parameter for a 4-input, 1-output logic block that implements the logic function z
= x 0x 2  x 1x 3.
b. Repeat part a for OTF.
c. How would the CTF of part a change if we use the block as a multiplexer, by connecting its x3
input to 𝑥 ?
d. Repeat part c for OTF.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 241

References and Further Readings


[Agra93] Agrawal, V. D., C. R. Kime, and K. K. Saluja, “A Tutorial on Built-in Self-Test, Part 1:
Principles” (Part 2: Applications), IEEE Design and Test of Computers, Vol. 10, No. 1 (2), pp.
73-82 (69-77), January (April) 1993.
[Benn84] Bennetts, R. G., Design of Testable Logic Circuits, Addison-Wesley, 1984.
[Fuji85] Fujiwara, H., Logic Testing and Design for Testability, MIT Press, 1985.
[Lala85] Lala, P. K., Fault Tolerant and Fault Testable Hardware Design, Prentice-Hall, 1985.
[Lala97] Lala, P. K., Digital Circuit Testing and Testability, Academic Press, 1997.
[Liu12] Liu, R. W., Testing and Diagnosis of Analog Circuits and Systems, Springer, 2012.
[Step76] Stephenson, J. E. and J. Grason, “A Testability Measure for Register Transfer Level Digital
Circuits,” Proc. Int’l Symp. Fault-Tolerant Computing, 1976, pp. 101-107.
[Wang06] Wang, L.-T., C.-W. Wu, and X. Wen (eds.), VLSI Test Principles and Architectures: Design
for Testability, Elsevier, 2006.
[Wang10] Wang, L.-T., C. E. Stroud, and N. A. Touba. System-on-Chip Test Architectures: Nanometer
Design for Testability, Morgan Kaufmann, 2010.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 242

12 Replication and Voting


“Honest disagreement is often a good sign of progress.”
Mahatma Gandhi

“Half of the American people never read a newspaper. Half


never voted for President. One hopes it is the same half.”
Gore Vidal

Topics in This Chapter


12.1. Hardware Redundancy Overview
12.2. Replication in Space
12.3. Replication in Time
12.4. Mixed Space/Time Replication
12.5. Switching and Voting Units
12.6. Variations and Design Issues

As noted in Chapter 10, a widely used fault masking technique is based on


replicated circuits or modules whose outputs feed a fusion or combining circuit,
known as vote taker. In this chapter, we augment the abstract discussions of
Chapter 10 with practical considerations in designing such redundant circuits.
Among implementation questions discussed here are how to synchronize the
multiple circuits so that data fusion occurs correctly, how to reduce the volume of
data to be compared or fused, how to maximize reliability for a given level of
replication, and how to ensure that the switching and fusion elements do not
become weak links in the system,

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 243

12.1 Hardware Redundancy Overview

Typical hardware units consist of a data path, where computations and other data
manipulations take place, and a control unit that is in charge of scheduling and
sequencing the data path operations (Fig. 12.dpcg). A small amount of glue logic binds
the main two parts together, allowing various optimizations as well as customization of
certain generic capabilities for applications of interest. Redundancy methods for dealing
with faults in data path, control unit, and glue logic are quite different. Generally
speaking, many more options are available for protecting data paths through redundancy,
as opposed to control circuitry (far fewer options) and the glue logic (extremely limited).

Options for protecting the data path span a wide range:

 Replication in space, including duplication with comparison, triplication with


voting (TMR), pair-and-spare, NMR, and hybrid schemes (rather costly)
 Replication in time, including recomputation with comparison, recomputation
with voting, alternating logic, recomputation after shifting, recomputation after
swapping, and replication of operand segments (rather slow)
 Mixed space-time replication based on a combination of the methods listed under
the preceding two bullet points
 Monitoring via mechanisms, such as watchdog timers and activity monitors, that
are economical but have imperfect coverage
 Low-redundancy coding-based schemes, including parity prediction, residue
checking, and self-checking design

Fig. 12.dpcg Hardware unit with data path, control unit, and glue logic.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 244

Options for the control unit may involve coding of control signals, control-flow
watchdogs, and self-checking design. Protection methods for the glue logic are limited to
simple replication and self-checking design.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 245

12.2 Replication in Space

The schemes depicted in Fig. 12.space have already been discussed in connection with
fault masking. In this section, we take a more detailed look at some of them, with the goal
of understanding the trade-offs involved in their use and the extent of protection they
offer in data path operations.

Let us begin by a more detailed examination of the TMR scheme of Fig. 12.space-c.
Previously, we viewed TMR as a 2-out-of-3 system and derived its reliability with an
imperfect voting circuit in equation 10.3.RTMR. In the following example, we consider
the effects of an imperfect voter on system reliability.

Example 12.TMR1: Modeling of TMR with imperfect voting circuit Consider a TMR
system with identical module reliabilities Rm and voter reliability Rv. Under what conditions will
the TMR system be more reliable than a simplex module?

Solution: The reliability of the TMR system can be written as R = Rv(3Rm2 – 2Rm3), For R > Rm,
we must have Rv > 1/(3Rm – 2Rm2). On the other hand, for a given Rv, the system will be more
reliable than a module if (3 – 9 − 8/𝑅 )/4 < Rm < (3 + 9 − 8/𝑅 )/4 (see Fig. 12.TMR) For
example, if Rv = 0.95, reliability improvement requires that 0.56 < Rm < 0.94. In practice, Rv is
very close to 1, that is, Rv = 1 – . We then have 1/Rv  1 +  and (1 – 8)0.5  1 – 4. Thus, the
condition for reliability improvement becomes 0.5 +  < Rm < 1 – . Because modules tend to be
much more reliable than voters, improved reliability is virtually guaranteed.

Next, we try to model the effect of compensating errors in TMR.

Example 12.TMR3: TMR with compensating errors Not all double-module faults lead to an
erroneous output in TMR. Consider a TMR system in which each of the 3 modules sends a single
bit to the voting circuit. Let the module reliability be Rm = 1 – p0 – p1, where p0 and p1 are
probabilities of 0-fault and 1-fault, respectively. Derive the system reliability R.

Solution: The system operates correctly if it has no more than one fault or if there are two faults
with compensating errors. Thus, R = (3Rm2 – 2Rm3) + 6p0p1Rm, with the last term being the
contribution of compendating errors to system reliability. Take for example the numerical values
Rm = 0.998 and p0 = p1 = 0.001. These values yield R = 0.999 990 = 0.999 984 + 000 006, where
0.000 006 is the improvement to the ordinary TMR reliability 0.999 984 due to the modeling of
compensating errors. We can derive the respective reliability improvement factors thus:
RIFTMR/Simplex = 0.002 / 0.000 016 = 125 and RIFCompens/TMR = 0.000 016 / 0.000 010 = 1.6.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 246

(a) Duplicate and compare (b) Triplicate and vote

(a) Pair-and-spare (b) NMR/hybrid

Fig. 12.space Several replication methods in space.

Fig. 12.TMR Analysis of TMR with imperfect voting unit.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 247

12.3 Replication in Time

Section to be based on the following slides:

(a) Using 2 adders and a multiplier (b) Using an adder and a multiplier

Fig. 12.time Duplication in time, with scheduling considerations.

Fig. 12.time2 Triplication in time, with scheduling considerations

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 248

12.4 Mixed Space/Time Replication

Section to be written.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 249

12.5 Switching and Voting Units

Consider a TMR system with a 3-way voting circuit. Assuming that we do not worry
about how the system will behave in case of two faulty modules, a simple voter can be
designed from a word comparator and a 2-to-1 multiplexer, as depicted in Fig. 12.vote-a.
If x1 and x2 disagree, then x3 is chosen as the voting result; otherwise x1 = x2 is passed on
to the output. This comparison-based voting scheme can be readily extended to a larger
number of computation channels, but the number of comparators required rises sharply
with the number n of channels.

In applications where single-bit results (decisions) are to be voted on, the voting circuit is
referred to as a bit-voter. We can synthesize bit-voters from logic gates (Fig. 12.3of5),
but the circuit complexity quickly explodes with increasing n. It is thus imperative to find
systematic design schemes based on higher-level components that keep the design
complexity in check.

(a) Comparison-based voting circuit (b) Basics and notation for bit-voting

Fig. 12.vote Designs for simple voting circuits.

Fig. 12.3of5 Design of a 3-out-of-5 bit-voter from 2-input gates.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 250

(a) Using 2-to-1 muxes (b) Using a single 8-to-1 mux

Fig. 12.muxv Designs for a 3-out-of-5 bitit-voter based on multiplexers.

One possibility is the use of multiplexers, as depicted in Fig. 12.muxv. For example, in
Fig. 12.muxv-b, the inputs are partitioned into the subsets {x1, x2, x3} and {x4, x5}. If
inputs in the first set are identical, then the majority output is already known. If two
inputs in the first set are 1s, then the voting result is 1 iff at least one member of the
second set is 1. Finally, if one of the 3 inputs in the first set is 1, then x4 = x5 = 1 is
required for producing a 1 at the output.

Other design strategies for synthezising bit-voters include the use of arithmetic circuits
(add the bits and compare the sum to a threshold) and the use of selection networks (the
majority of bit values is also their median). Synthesis of bit-voters based on these
strategies has been performed and the results compared (Fig.12.cmplx). It is readily seen
that multiplexer-based designs have the edge for small values of n, but for larger values
of n, designs based on selection networks tend to win.

Word-voting circuits cannot be designed based on bit-voting on the various bit positions.
To see this, note that word inputs 00, 10, and 11 have majority results of 1 and 0 in their
two positions, but the result of bit-voting, that is, 10, is not a majority value. The situation
can even be worse. With word inputs 000, 101, and 110, the bitwise majority voting
result is 100, which isn’t even equal to one of the inputs.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 251

Fig. 12.cmplx Comlexity of bit-voting circuits synthesized based on


various strategies.

Recently, a recursive method for the construction of voting networks has been proposed
that offers regularlity, scalability and power-efficiency benefits over previous design
methods [Parh21], [Parh21a]. The essense of the method is illustrated in Fig. 12.recTCN,
which shows an at-least-l-out-of-n threshold circuit built from a multiplexer (mux), an at-
least-l-out-of-(n–1) threshold circuit, and an at-least-(l–1)-out-of-(n–1) threshold circuit.
Figure 12.5of9 illustrates the unrolling of the recursive construction method in the
specific case of a 5-out-of-9 majority voter.

x1
. n–1 l/(n – 1)
. TCN 0
.
xn–1 l/n
MUX
n–1 (l – 1)/(n – 1)
TCN 1

xn

Fig. 12.recTCN Recursive method for building an at-least-l-out-of-n


threshold counting network from a multiplexer and two
smaller threshold counting networks. Unrolling the
recurrence leads to a multiplxer-based design.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 252

Fig. 12.5of9 Recursive contruction of a 5-out-of-9 majority voter. Circles


represent multiplexers controlled by the inputs x4, x5, x6, x7,
x8, and x9, going in columns from left to right. The 3-input
gates on the left edge, including the majority gate in the
middle-left of the diagram, receive the inputs x1, x2, and x3.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 253

12.6 Variations and Design Issues

One variation on the theme of replication and voting is that of self-purging redundancy.
Instead of having n operational units and s spares, all n + s units contribute to the
computation by sending their results to a threshold voting circuit (Fig. 12.purge). When a
module disagrees with the voting outcome, it takes itself out of the computatioon by
resetting a flip-flop whose output enables the module output to go to the voting circuit.
The threshold may be fixed or it may vary as units are taken out of service.

An interesting design strategy is based on alternating logic. The basic strategy is depicted
in Fig. 12-alt. Alternating logic takes advantage of the fact that the same fault is unlikely
to affect a signal and its complement in the same way. This property is routinely used in
data transmission over buses by sending a data packet, then sending the bitwise
complement of the data, and comparing the two versions at the destination, allowing the
detection of bus lines s-a-0, s-a-1, and many transient faults. Let the dual of a Boolean
function f(x1, x2, … , xn) be defined as another function fd(x1, x2, … , xn) such that:

fd(x1, x2, … , xn) = f (x1, x2, … , xn) (12.6.dual)

The dual of a Boolean function (logic circuit) can be obtained by exchanging AND and
OR operators (gates). For example, the dual of f(a, b, c) = ab  c is fd(a, b, c) = (a  b)c.
Using the dual of a function instead of its identical duplicate provides greater protection
against common-cause and certain unidirectional multiple faults. If a function is self-
dual, a property that is heldby many commonly used circuits such as binary adders (both
unsigned anc complement), a form of time redundancy can be applied by using the same
circuit to compute the function and its dual.

Fig. 12.purge Self-purging redundancy with a threshold voting circuit.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 254

Fig. 12.alt Duplication with alternating logic.

Recomputing with transformed operands (Fig. 12.xform) constitutes another class of


strategies for fault detection and masking. The main idea is that the redundant
computation is performed on transformed operands in an effort to make it unlikely that
similar faults would lead to identical errors in the two results. When f is binary addition,
we can use shifts for encoding and decoding, given that the sum of shifted operands is a
shifted version of the original sum. For operations that are symmetric, recomputation
after swapping the two operands will lead to any existing fault to be exercised differently,
and thus to produce a different error, which makes it detectable.

Fig. 12.xform Computing with transformed operands.

(a) Two-way segmented adder (b) A 4-way segmented adder

Fig. 12.segm Two designs for time-redundant, segmented addition.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 255

Problems

12.1 Voting units for TMR and 5MR


a. Show at least one implementation (using logic gates and/or standard building-block combinational
circuits, such as comparators, muxes, and the like) for 2-out-of-3 word-voting. The three inputs
and the voting unit output are k-bit words. The voting unit need not detect the lack of majority, but
must produce the majority value, if one exists, at its output.
b. Can your design be extended to a 3-out-of-5 voter?

12.2 How not to do TMR


The International Space Station (ISS) experienced a computer-related crisis in June 2007. According to
NASA documents, “On 13 June, a complete shutdown of secondary power to all [three] central computer
and terminal computer channels occurred, resulting in the loss of capability to control ISS Russian segment
systems.” Study this ISS incident using Internet sources and discuss in one typed page the nature of the
crisis, its underlying causes, how the problems were dealt with, and what we can learn from this experience
in terms of how to design dependable systems with diverse subsystems and suppliers.

12.3 Design of switch-voting units


For the redundancy scheme discussed in Problem 10.5, design the required 5-input, 1-output switch-voting
unit, with 5 internal FFs indicating which units are contributing to the formation of the output. Assume that
the five modules produce 1-bit results. Feel free to use muxes, comparators, and other standard circuits in
your design (the design need not be at the gate level).

12.4 TMR with imperfect voting unit


Consider a TMR system with identical logic modules of reliability R and a voting unit of reliability r.
a. Derive the range of R values for which TMR offers improved reliability over a simple module,
given the use of a voting unit with reliability r.
b. Plot the upper and lower bounds of part a as functions of r and discuss.

12.5 TMR vs. standby sparing


Which is more reliable over a given mission time T: A 2-out-of-3 system with voter reliability 0.9 over time
T or a 2-way parallel system (one working module and one spare) with coverage factor 0.8?
a. If module reliability over time T is 0.9.
b. If module reliability over time T is 0.7.

12.6 Recursively-built voters


Use the recursive design method illustrated in Figs. 12.recTCN and 12.5of9 to build voters and of the
following kinds.
a. 4-out-of-7 majority voter
b. 5-out-of-7 supermajority voter
c. 6-out-of-7 near-unanimity voter
d. 6-out-of-9 supermajority voter
e. 8-out-of-9 near-unanimity voter

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 256

12.6 Recursively-built threshold circuits


By offering appropriate circuit designs for specific examples, show that the recursive design method
illustrated in Figs. 12.recTCN and 12.5of9 can also be applied to the design of various other types of
threshold circuits.
a. At-least-3-out-of-7 threshold circuit
b. Less-than-3-out-of-7 inverse-threshold circuit
c. No-more-than-3-out-of-7 inverse-threshold circuit
d. Exactly-4-out-of-7 weight-checking circuit
e. At-least-3-and-no-more-than-4-out-of-7 between-limits threshold circuit

12.x Title
Intro
a. xx
b. xx
c. xx

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-10-23 257

References and Further Readings


[Lyon62] Lyons, R. E. and W. Vanderkulk, “The Use of Triple Modular Redundancy to Improve
Computer Reliability,” IBM J. Research and Development, Vol. 6, No. 2, pp. 200-209, April
1962.
[Parh91] Parhami, B., “Voting Networks,” IEEE Trans. Reliability, Vol. 40, No. 3, pp. 380-394, August
1991.
[Parh91a] Parhami, B., “Design of m-out-of-n Bit-Voters,” Proc. 25th Asilomar Conf. Signals, Systems,
and Computers, 1991, pp. 1260-1264.
[Parh96] Parhami, B., “A Taxonomy of Voting Schemes for Dependable Multi-Channel Computations,”
Reliability Engineering and System Safety, Vol. 52, No. 2, pp. 139-151, May 1996.
[Parh21] Parhami, B., S. Bakhtavari Mamaghani, and G. Jaberipur, “Recursive Construction of Counting
Networks,” Submitted for publication.
[Parh21a] Parhami, B., S. Bakhtavari Mamaghani, and G. Jaberipur, “Recursive Construction of Large
Voters,” In preparation.
[Town03] Townsend, W. J., J. A. Abraham, and E. E. Swartzlander Jr., “Quadruple Time Redundancy
Adders [Error Correcting Addder],” Proc. 18th IEEE Int’l Symp. Defect and Fault Tolerance in
VLSI Systems, pp. 250-2562003.
[Yama05] Yamasaki, H. and T. Shibata, “A High-Speed Median Filter VLSI Using Floating-Gate-
CMOS-Based Low-Power Majority Voting Circuits,” Proc. 31st European Solid-State Circuits
Conf., 2005, pp. 125-128.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


1
2
[email protected]
http://www.ece.ucsb.edu/~parhami

This is a draft of the forthcoming book


Dependable Computing: A Multilevel Approach,
by Behrooz Parhami, publisher TBD
ISBN TBD; Call number TBD

All rights reserved for the author. No part of this book may be reproduced,
stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical,
photocopying, microfilming, recording, or otherwise, without written permission. Contact the author at:
ECE Dept., Univ. of California, Santa Barbara, CA 93106-9560, USA ([email protected])

Dedication
To my academic mentors of long ago:

Professor Robert Allen Short (1927-2003),


who taught me digital systems theory
and encouraged me to publish my first research paper on
stochastic automata and reliable sequential machines,

and

Professor Algirdas Antanas Avižienis (1932- )


who provided me with a comprehensive overview
of the dependable computing discipline
and oversaw my maturation as a researcher.

About the Cover


The cover design shown is a placeholder. It will be replaced by the actual cover image
once the design becomes available. The two elements in this image convey the ideas that
computer system dependability is a weakest-link phenomenon and that modularization &
redundancy can be employed, in a manner not unlike the practice in structural
engineering, to prevent failures or to limit their impact.
Last modified: 2020-11-01 12

Structure at a Glance
The multilevel model on the right of the following table is shown to emphasize its
influence on the structure of this book; the model is explained in Chapter 1 (Section 1.4).

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 258

IV Errors: Informational Distorations


Ideal

Defective
“To err is human––to blame it on someone else is even more
human.”
Faulty
Jacob's law
Erroneous

“Let me put it this way, Mr. Amor. The 9000 series is the most
Malfunctioning
reliable computer ever made. No 9000 computer has ever made
Degraded a mistake or distorted information. We are all, by any practical
definition of the words, foolproof and incapable of error.”
Failed

HAL, an on-board “superintelligent” computer in


the movie “2001: A Space Odyssey”

Chapters in This Part


13. Error Detection
14. Error Correction
15. Self-Checking Modules
16. Redundant Disk Arrays

An error is any difference in a system’s state (contents of its memory elements)


compared with a reference state, as defined by its specification. Errors are either
built into a system by improper initialization or develop due to fault-induced
deviations. Error models characeterize the information-level consequences of
logic faults. Countermeasures against errors are available in the form of error-
detecting and error-correcting codes, which are the information-level counterparts
of fault testing and fault masking at the logic level. Redundant information
representation to detect and/or correct errors has a long history in the realms of
symbolic codes and electronic communication. We present a necessarily brief
overview of coding theory, adding to the classical methods a number of options
that are particularly suitable for computation, as opposed to communication.
Applications of coding techniques to the design of self-checking modules and
redundant arrays of independent disks conclude this part of the book.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 259

13 Error Detection
”Thinking you know when in fact you don’t is a fatal mistake, to
which we are all prone.”
Bertrand Russell

“In any collection of data, the figure most obviously correct,


beyond all need of checking, is the mistake.”
Finagle’s Third Law

“An expert is a person who has made all the mistakes that can
be made in a very narrow field.”
Niels Bohr

Topics in This Chapter


13.1. Basics of Error Detection
13.2. Checksum Codes
13.3. Weight-Based and Berger Codes
13.4. Cyclic Codes
13.5. Arithmetic Error-Detecting Codes
13.6. Other Error-Detecting Codes

One way of dealing with errors is to ensure their prompt detection, so that they
can be counteracted by appropriate recovery actions. Another approach, discussed
in Chapter 10, is automatic error correction. Error detection requires some form of
result checking that might be done through time and/or informational redundancy.
Examples of time redundancy include retry, variant computation, and output
verification. The simplest form of informational redundancy is replication, where
each piece of information is duplicated, triplicated, and so on. This would imply a
redundancy of at least 100%. Our focus in this chapter is on lower-redundancy,
and thus more efficient, error-detecting codes.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 260

13.1 Basics of Error Detection

A method for detecting transmission/computation errors, used along with subsequent


retransmission/recomputation, can ensure dependable communication/computation. If the
error detection scheme is completely fool-proof, then no error will ever go undetected
and we have at least a fail-safe mode of operation. Retransmissions and recomputations
are most effective in dealing with transient causes that are likely to disappear over time,
thus making one of the subsequent attempts successful. In practice, no error detection
scheme is completely fool-proof and we have to use engineering judgment as to the
extent of the scheme’s effectiveness vis-à-vis its cost, covering for any weaknesses via
the application of complementary methods.

Error detection schemes have been used since ancient times. Jewish scribes are said to
have devised methods, such as fitting a certain exact number of words on each line/page
or comparing a sample of a new copy with the original [Wikipedia], to prevent errors
during manual copying of text, thousands of years ago. When an error was discovered,
the entire page, or in cases of multiple errors, the entire manuscript, would be destroyed,
an action equivalent to the modern retransmission or recomputation. Discovery of the
Dead Sea Scrolls, dating from about 150 BCE, confirmed the effectiveness of these
quality control measures.

The most effective modern error detection schemes are based on redundancy. In the most
common set-up, k-bit information words are encoded as n-bit codewords, n > k. Changing
some of the bits in a valid codeword produces an invalid codeword, thus leading to
detection. The ratio r/k, where r = n – k is the number of redundant bits, indicates the
extent of redundancy or the coding cost. Hardware complexity of error detection is
another measure of cost. Time complexity of the error detection algorithm is a third
measure of cost, which we often ignore, because the process can almost always be
overlapped with useful communication/computation.

Figure 13.1a depicts the data space of size 2k, the code space of size 2n, and the set of 2n –
2k invalid codewords, which, when encountered, indicate the presence of errors.
Conceptually, the simplest redundant represention for error detection consists of
replication of data. For example, triplicating a bit b, to get the corresponding codeword
bbb allows us to detect errors in up to 2 of the 3 bits. For example, the valid codeword
000 can change to 010 due to a single-bit error or to 110 as a result of 2 bit-flips, both of

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 261

which are invalid codewords allowing detection of the errors. We will see in Chapter 14
that the same triplication scheme allows the correction of single-bit errors in lieu of
detection of up to 2 bit-errors.

A possible practical scheme for using duplication is depicted in Fig. 13.2a. The desired
computation y = f(x) is performed twice, preceded by encoding of the input x
(duplication) and succeeded by comparison to detect any disagreement between the two
results. Any error in one copy of x, even if accompanied by mishaps in the associated
computation copy, is detectable with this scheme. A variant, shown in Fig. 13.2b, is
based on computing 𝑦 in the second channel, with different outputs being a sign of
correctness. One advantage of the latter scheme, that is, endocing x as x𝑥̅ , over straight
duplication, or xx, is that it can detect any unidirectional error (0s changing to 1s, or 1s to
0s, but not both at the same time), even if the errors span both copies.

(a) Data and code spaces (b) Triplication at the bit level

Fig. 13.1 Data and code spaces in general (sizes 2k and 2n) and for
bit-level triplication (sizes 2 and 8).

(a) Straight duplication (b) Inverted duplication

Fig. 13.2 Straight and inverted duplication are examples of high-


redundancy encodings.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 262

Example 13.parity: Odd/even parity checking One of the simplest and oldest methods of
protecting data against accidental corruption is the use of a single parity bit for k bits of data (r =
1, n = k + 1). Show that the provision of an even or odd parity bit, that is, an extra bit that makes
the parity of the (k + 1)-bit codeword even or odd, will detect any single bit-flip or correct an
erasure error. Also, describe the encoding, decoding, and error-checking processes.

Solution: Any single bit-flip will change the parity from even to odd or vice versa, thus being
detectable. An erased bit value can be reconstructed by noting which of the two possibilities
would make the parity correct. During encoding, an even parity bit can be derived by XORing all
data bits together. An odd parity bit is simply the complement of the corresponding even parity
bit. No decoding is needed, as the code is separable. Error checking consists of recomputing the
parity bits and comparing it against the existing one.

Representing a bit b by the bit-pair (b, 𝑏 ) is known as two-rail encoding. A two-rail-


encoded signal can be inverted by exchanging its two bits, turning (b, 𝑏 ) into (𝑏 , b).
Taking the possibility of error into account, a two-rail signal is written as (t, c), where
under error-free operation we have t = b (the true part) and c = 𝑏 (the complement part).
Logical inversion will then convert (t, c) into (c, t). Logical operators, other than
inversion, can also be defined for two-rail-encoded signals:

NOT: (𝑡, 𝑐) = (c, t) (13.1.2rail)


AND: (t1, c1) (t2, c2) = (t1t2, c1  c2)
OR: (t1, c1)  (t2, c2) = (t1  t2, c1c2)
XOR: (t1, c1)  (t2, c2) = (t1c2  t2c1, t1t2  c1c2)

The operators just defined propagate any input errors to their outputs, thus facilitating
error detection. For example, it is readily verified that (0, 1)  (1, 1) = (1, 1) and (0, 1) 
(0, 0) = (0, 0).

A particularly useful notion for the design and assessment of error codes is that of
Hamming distance, defined as the number of positions in which two bit-vectors differ.
The Hamming distance of a code is the minimum distance between its codewords. For
example, it is easy to see that a 5-bit code in which all codewords have weight 2 (the 2-
out-of-5 code) has Hamming distance 2. This code has 10 codewords and is thus suitable
for representing the decimal digits 0-9, among other uses.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 263

To detect d errors, a code must have a minimum distance of d + 1. Correction of c errors


requires a code distance of at least 2c + 1. For correcting c errros as well as detecting d
errors (d > c), a minimum Hamming distance of c + d + 1 is needed. Thus, a single-error-
correcting/double-error-detecting (SEC/DED) code requires a minimum distance of 4.

We next review the types of errors and various ways of modeling them. Error models
capture the relationships between errors and their causes, including circuit faults. Errors
are classified as single or multiple (according to the number of bits affected), inversion or
erasure (flipped or unrecognizable/missing bits), random or correlated, and symmetric or
asymmetrice (regarding the likelihood of 01 or 10 inversions). Note that Nonbinary
codes have substitution rather than inversion errors. For certain applications, we also
need to deal with transposition errors (exchange of adjacent symbols). Also note that
errors are permanent by nature; in our terminology, we have transient faults, but no such
thing as transient errors.

Error codes, first used for and developed in connection with communication on noisy
channels, were later applied to protecting stored data against corruption. In computing
applications, where data is manipulated in addition to being transmitted and stored, a
commonly applied strategy is to use coded information during transmission and storage
and to strip/reinstate the redundancy via decoding/encoding before/after data
manipulations. Fig. 13.coding depicts this process. While any error-detecting/correcting
code can be used for protection against transmission and storage errors, most such codes
are not closed under arithmetic/logic operations. Arithmetic error codes, to be discussed
in Section 13.5, provide protection for data manipulation circuits as well as transmission
and storage systems.

Protected by encoding

I E S D O
n n S t S e u
p c e o e c t
u o n r n o p
t d d e d d u
e e t

Unprotected

Manipulate

Fig. 13.3 Application of coding to error control.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 264

We end this section with a brief review of criteria used to judge the effectiveness of error-
detecting as well as error-correcting codes. These include redundancy (r redundant bits
used for k information bits, for an overhead of r/k), encoding circuit/time complexity,
decoding circuit/time complexity (nonexistent for separable codes), error detection
capability (single, double, b-bit burst, byte, unidirectional), and possible closure under
operations of interest.

Note that error-detecting and error-correcting codes used for dependability improvement
are quite different from codes used for privacy and security. In the latter case, a simple
decoding process would be detrimental to the purpose for which the code is used.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 265

13.2 Checksum Codes

Checksum codes constitute one of the most widely used classes of error-detecting codes.
In such codes, one or more check digits or symbols, computed by some kind of
summation, are attached to data digits or symbols.

Example 13.UPC: Universal product code, UPC-A In UPC-A, an 11-digit decimal product
number is augmented with a decimal check digit, which is considered the 12th digit and is
computed as follows. The odd-indexed digits (numbering begins with 1 at the left) are added up
and the sum is multiplied by 3. Then, the sum of the even-indexed digits is added to the previous
result, with the new result subtracted from the next higher multiple of 10, to obtain a check digit in
[0-9]. For instance, given the product number 03600029145, its check digit is computed thus. We
first find the weighted sum 3(0 + 6 + 0 + 2 + 1 + 5) + 3 + 0 + 0 + 9 + 4 = 58. Subtracting the latter
value from 60, we obtain the check digit 2 and the codeword 036000291452. Describe the error-
detection algorithm for UPC-A code. Then show that all single-digit substitution errors and nearly
all transposition errors (switching of two adjacent digits) are detectable.

Solution:
To detect errors, we recompute the check digit per the process outlined in the problem statement
and compare the result against the listed check digit. Any single-digit substitution error will add to
the weighted sum a positive or negative error magnitude that is one of the values in [1-9] or in
{12, 15, 18, 21, 24, 27}. None of the listed values is a multiple of 10, so the error is detectable. A
transposition error will add or subtract an error mangnitude that is the difference between 3a + b
and 3b + a, that is, 2(a – b). As long as a – b  5, the error magnitude will not be divisible by 10
and the error will be detectable. The undetectable exceptions are thus adjacent transposed digits
that differ by 5 (i.e., 5 and 0; 6 and 1; etc.).

Generalizing from Example 13.UPC, checksum codes are characterized as follows. Given
the data vector x1, x2, … , xk, we attach a check digit xk+1 to the right end of the vector
so as to satisfy the check equation

∑ 𝑤 𝑥 = 0 mod A (13.2.chksm)

where the wj are weights associated with the different digit positions and A is a check
constant. With this terminology, the UPC-A code of example 13.UPC has the weight
vector 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1 and the check constant A = 10. Such a checksum
code will detect all errors that add to the weighted sum an error magnitude that is not a
multiple of A. In some variants of checksum codes, the binary representations of the
vector elements are XORed together rather than added (after multiplication by their

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 266

corresponding weights). Because the XOR operation is simpler and faster, such XOR
checkcum codes have a speed and cost advantage, but they are not as strong in terms of
error detection capabilities.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 267

13.3 Weight-Based and Berger Codes

The 2-out-of-5 or 5-bit weight-2 code, alluded to in Section 13.1, is an instance of


constant-weight codes. A constant-weight n-bit code consists of codewords having the
weight w and can detect any unidirectional error, regardless of the number of bit-flips.

Example 13.cwc: Capabilities of constant-weight codes How many random or bidirectional


errors can be detected by a w-out-of-n, or n-bit weight-w, code? How many errors can be
corrected? How should we choose the code weight w so as to maximize the number of codewords?

Solution: Any codeword of a constant-weight code can be converted to another codeword by


flipping a 0 bit and a 1 bit. Thus, code distance is 2, regardless of the values of the parameters w
and n. With a minimum distance of 2, the code can detect a single random error and has no error
correction capability. The number of codewords is maximized if we choose w = n/2 for even n or
one of the values (n – 1)/2 or (n + 1)/2 for odd n.

Another weight-based code is the separable Berger code, whose codewords are formed as
follows. We count the number of 0s in a k-bit data word and attach to the data word the
representation of the count as a log2(k + 1)-bit binary number. Hence, the codewords
are of length n = k + r = k + log2(k + 1). Using a vertical bar to separate the data part
and check part of a Berger code, here are some examples of codewords with k = 6:
000000|110; 000011|100; 101000|100; 111110|001.

A Berger code can detect all unidirectional errors, regardless of the number of bits
affected. This is because 01 flips can only decrease the count of 0s in the data part, but
they either leave the check part unchanged or increase it. As for random errors, only
single errors are guaranteed to be detectable (why?). The following example introduces
an alternate form of Berger code.

Example 13.Berger: Alternate form of Berger code Instead of attaching the count of 0s as a
binary number, we may attach the 1’s-complement (bitwise complement) of the count of 1s. What
are the error-detection capabilities of the new variant?

Solution: The number of 1s increases by 01flips, thus increasing the count of 1s and decreasing
its 1’s-complement. A similar opposing direction of change applies to 10 flips. Thus, all
unidirectional errors remain detectable.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 268

13.4 Cyclic Codes

A cyclic code is any code in which a cyclic shift of a codeword always results in another
codeword. A cyclic code can be characterized by its generating polynomial G(x) of
degree r, with all of the coefficients being 0s and 1s. Here is an example generator
polynomial of degree r = 3:

G(x) = 1 + x + x3 (13.4.GenP)

Multiplying a polynomial D(x) that represents a data word by the generator polynomial
G(x) produces the codeword polynomial V(x). For example, given the 7-bit data word
1101001, associated with the polynomial 1 + x + x3 + x6, the corresponding codeword is
obtained via a polynomical multiplication in which coefficients are added modulo 2:

V(x) = D(x)  G(x) = (1 + x + x3 + x6)( 1 + x + x3) (13.4.CodeW)


2 7 9
=1+x +x +x

The result in equation 13.4.CodeW corresponds to the 10-bit codeword 1010000101.


Because of the use of polynomial multiplication to form codewords, a given word is a
valid codeword iff the corresponding polynomial is divisible by G(x).

The polynomial G(x) of a cyclic code is not arbitrary, but must be a factor of 1 + xn. For
example, given that

1 + x15 = (1 + x)(1 + x + x2)(1 + x + x4)(1 + x3 + x4)(1 + x + x2 + x3 + x4) (13.4.poly)

several potential choices are available for the generator polynomial of a 16-bit cyclic
code. Each of the factors on the right-hand side of equation 13.4.poly, or the product of
any subset of the 5 factors, can be used as a generator polynomial. The resulting 16-bit
cyclic codes will be different with respect to their error detection capabilities and
encoding/decoding schemes.

An n-bit cyclic code with k bits’ worth of data and r bits of redundancy (generator
polynomial of degree r = n – k) can detect all burst errors of width less than r. This is
because the burst error polynomial xi E(x), where E(x) is of degree less than r, cannot be
divisible by the degree-r generator polynomial G(x).

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 269

What makes cyclic codes particularly desirable is that they require fairly simple hardware
for encoding and decoding. The linear shift register depicted in Fig. 13.cced-a receives
the data word D(x) bit-serially and produces the code vector V(x) bit-serially, beginning
with the constant term (the coefficient of x0). The equally simple cyclic-code decoding
hardware in Fig. 13.cc3e-b is readily understood if we note that B(x) = (x + x3)D(x) and
D(x) = V(x) + B(x) = (1 + x + x3)D(x) + (x + x3)D(x).

(a) Encoding: Forming the polynomial product D(x)  G(x)

(b) Decoding: Forming the polynomial quotient V(x) / G(x)

Fig. 13.cced Encoding and decoding hardware for a cyclic code.

Cyclic codes, as defined so far, are not separable, thus potentially slowing down the
delivery of data due to the decoding latency. Here is one way to construct separable
cyclic codes that leads to the cyclic redundancy check (CRC) codes in common use for
disk memories. Given the degree-(k – 1) polynomial D(x) associated with the k-bit data
word and the degree-r generator polynomial G(x), the degree-(n – 1) polynomial
corresponding to the n-bit encoded word is:

V(x) = [xrD(x) mod G(x)] + xrD(x) (13.crc)

It is easy to see that V(x) computed from equation 13.crc is divisible by G(x). Because the
remainder polynomial in the square brackets is at most of degree r – 1, the data part D(x)
remains separate from the check component.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 270

Example 13.CRC1: Separable cyclic code Consider a CRC code with 4 data bits and the
generator polynomial G(x) = 1 + x + x3. Form the CRC codeword associated with the data word
1001 and check your work by verifying the divisibility of the resulting polynomial V(x) by G(x).

Solution: Dividing x3D(x) = x3 + x6 by G(x), we get the remainder x + x2. The code-vector
polynomial is V(x) = [x3D(x) mod G(x)] + x3D(x) = x + x2 + x3 + x6, corresponding to the codeword
0111001. Rewriting V(x) as x(1 + x + x3) + x3(1 + x + x3) confirms that it is divisible by G(x).

Example 13.CRC2: Even parity as CRC code Show that the use of a single even parity bit is a
special case of CRC and derive its generator polynomial G(x).

Solution: Let us take an example data word 1011, with D(x) = 1 + x2 + x3 and its even-parity
coded version 11011 with V(x) = 1 + x + x3 + x4 (note that the parity bit precedes the data). Given
that the remainder D(x) mod G(x) is a single bit, the generator polynomial, if it exists, must be
G(x) = 1 + x. We can easily verify that (1 + x2 + x3) mod (1 + x) = 1. For a general proof, we note
that xi mod (1 + x) = 1 for all values of i. Therefore, the number of 1s in the data word, which is
the same as the number of terms in the polynomial D(x), determines the number of 1s that must be
added (modulo 2) to form the check bit. The latter process coincides with the definition of an even
parity-check bit.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 271

13.5 Arithmetic Error-Detecting Codes

Arithmetic error-detecting codes came about because of the unsuitability of conventional


codes in dealing with arithmetic errors. There are two aspects to the problem. First,
ordinary coded numbers are not closed under arithmetic operations, leading to the need
for removing the redundancy via decoding, performing arithmetic, and finally encoding
the result. Meanwhile, data remains unprotected between the decoding and encoding
steps (Fig. 13.3). Second, logic fault in arithmetic circuits can lead to errors that are more
extensive than single-bit, double-bit, burst, and other errors targeted by conventional
codes. Let us consider an example of the latter problem.

Example 13.arith1: Errors caused by single faults in arithmetic circuits Show that a single
logic fault in an unsigned binary adder can potentially flip an arbitrary number of bits in the sum.

Solution: Consider the addition of two 16-bit unsigned binary numbers 0010 0111 0010 0001 and
0101 1000 1101 0011, whose correct sum is 0111 1111 1111 0100. Indexing the bit positions from
the right end beginning with 0, we note that during this particular addition, the carry signal going
from position 3 to position 4 is 0. Now suppose that the output of the carry circuit in that position
is s-a-1. The erroneous carry of 1 will change the output to 1000 0000 0000 0100, flipping 12 bits
in the process and changing the numerical value of the ouput by 16.

Characterization of arithmetic errors is better done via the value added to or subtracted
from a correct result, rather than by the number of bits affected. When the amount added
or subtracted is a power of 2, as in Example 13.arith1, we say that we have an arithmetic
error of weight 1, or a single arithmetic error. When the amount added or subtracted can
be expressed as the sum or difference of two different powers of 2, we say we have an
arithmetic eror of weight 2, or double arithmetic error.

Example 13.arith2: Arithmetic weight of error Consider the correct sum of two unsigned
binary numbers to be 0111 1111 1111 0100.
a. Characterize the arithmetic errors that transform the sum to 1000 0000 0000 0100.
b. Repeat part a for the transformation to 0110 0000 0000 0100.

Solution:
a. The error is +16 = +24, thus it is characterized as a positive weight-1 or single error.
b. The error is –32 752 = –215 + 24, thus it is characterized as a negative weight-2 or double error.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 272

Arithmetic error-detecting codes are characterized by the arithmetic weights of detectable


errors. Furthermore, they allow arithmetic operations to be performed directly on coded
operands, either by the usual arithmetic algorithms or by specially modified versions that
are not much more complex. We will discuss two classes of such codes in the remainder
of this section: product codes and residue codes.

Product or AN codes form a class of arithmetic error codes in which a number N is


encoded by the number AN, where A is the code or check constant, chosen according to
the discussion that follows. Detecting all weight-1 arithmetic errors requires only that A
be odd. Particularly suitable odd values for A include numbers of the form 2a – 1, because
such low-cost check constants make the encoding and decoding processes simple.
Arithmetic errors of weight 2 or more may go undetected. For example, the error
magnitude 32 736 = 215 – 25 would be undetectable with A = 3, 11, or 31.

For AN codes, encoding consists of multiplication by A, an operation that becomes shift-


subtract for A = 2a – 1. Error detection consists of verifying divisibility by A. Decoding
consists of division by A. both of the latter operations can be performed quite efficiently
a-bit-at-a-time when A = 2a – 1. This is because given (2a – 1)x, we can find x from:

x = 2ax – (2a – 1)x (13.5.div)

Even though both x and 2ax are unknown, we do know that 2ax ends with a 0s. Thus,
equation 13.5.div allows us to compute the rightmost a bits of x, which become the next a
bits of 2ax. This a-bit-at-a-time process continues until all bits of x have been derived.

Example 13.arith3: Decoding the 15x code What 16-bit unsigned binary number is represented
by 0111 0111 1111 0100 1100 in the low-cost product code 15x?

Solution: We know that 16x is of the form ●●●● ●●●● ●●●● ●●●● 0000. We use equation
13.5.div to find the rightmost 4 bits of x as 0000 – 1100 = 0100, remembering the borrow-out in
position 3 for the next step. We now know that 16x is of the form ●●●● ●●●● ●●●● 0100 0000.
In the next step, we find 0100 – 0100 (–1) = (–1) 1111, where a parenthesized –1 indicates
borrow-in/out. We now know 16x to be of the form ●●●● ●●●● 1111 0100 0000. The next 4
digits of x are found thus: 1111 – 1111 (–1) = (–1) 1111. We now know 16x to be of the form
●●●● 1111 1111 0100 0000. The final 4 digits of x are found thus: 1111 – 0111 (–1) = 0111.
Putting all the pieces together, we have the answer x = 0111 1111 1111 0100.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 273

The arithmetic operations of addition and subtraction on product-coded operands are


straightforward, because they are done exactly as ordinary addition and subtraction.

Ax  Ay = A(x  y) (13.5.addsub)

Multiplication requires a final correction via division by A, given that Ax  Ay = A2xy.


Even though it is possible to perform the division Ax / Ay via premultiplying Ax by A,
performing the ordinary division A2x / Ay, and doing a final correction to ensure that the
remainder is of correct sign and within the acceptable range, the added complexity may
be unacceptable in many applications. Calculating the square-root of Ax via applying a
standard square-rooting algorithm to A2x encounters problems similar to division.

Note that product codes are nonseparable, because data and redundant check information
are intermixed. We next consider a class of separable arithmetic error-detecting codes.

A residue error-correcting code represents an integer N by the pair (N, C = |N|A), where
|N|A is the residue of N modulo the chosen check modulus A. Because the data part N and
the check part C are separate, decoding is trivial. To encode a number N, we must
compute |N|A and attach it to N. This is quite easy for a low-cost modulus A = 2a – 1:
simply divide the number N into a-bit chunks, beginning at its right end, and add the
chunks together modulo A. Modulo-(2a – 1) addition is just like ordinary a-bit addition,
with the adder’s carry-out line connected to its carry-in line, a configuration known as
end-around carry.

Example 13.mod15: Computing low-cost residues Compute the mod-15 residue.of the 16-bit
unsigned binary number 0101 1101 1010 1110.

Solution: The mod-15 residue of x is obtained thus: 0101 + 1101 + 1010 + 1110 mod 15.
Beginning at the right end, the first addition produces an outgoing carry, leading to the mod-15
sum 1110 + 1010 – 10000 + 1 = 1001. Next, we get 1001 + 1101 – 10000 + 1 = 0111. Finally, in
the last step, we get: 0111 + 0101 = 1100. Note that the end-around carry is 1 in all three steps.

For an arbitrary modulus A, we can use table-lookup to compute |N|A. Suppose that the
number N can be divided into m chunks of size b bits. Then, we need a table of size 2bm,
having 2b entries per chunk. For each chunk, we consult the corresponding part of the

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 274

table, which stores the mod-A residues all possible b-bit chunks in that position, adding
all the entries thus read out, modulo A.

An inverse residue code uses the check part D = A – C instead of C = |N|A. This change is
motivated by the fact that certain unidirectional errors that are prevalent in VLSI circuits
tend to change the values of N and C in the same direction, raising the possibility that the
error will go undetected. For example if the least-significant bits of N and C both change
from 0 to 1, the value of both N and C will be increased by 1, potentially causing a match
between the residue of the erroneous N value and the erroneous C value. Inverse residue
encoding eliminates this possibility, given that unidirectional errors will affect the N and
D parts in opposite directions.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 275

13.6 Other Error-Detecting Codes

The theory of error-detecting codes is quite extensive. One can devise virtually an infinite
number of different error-detecting codes and entire textbooks have been devoted to the
study of such codes. Our study in the previous sections of this chapter was necessarily
limited to codes that have been found most useful in the design of dependable computer
systems. This section is devoted to the introduction and very brief discussion of a number
of other error-detecting codes, in an attempt to fill in the gaps.

Erasure errors lead to some symbols to become unreadable, effectively reducing the
length of a codeword from n to m, when there are n – m erasures. The code ensures that
the original k data symbols are recoverable from the m available symbols. When m = k,
the erasure code is optimal, that is, any k bits can be used to reconstruct the n-bit
codeword and thus the k-bit data word. Near-optimal erasure codes require (1 + )k
symbols to recover the original data, where  > 0. Examples of near-optimal erasure
codes include Tornado codes and low-density parity-check (LDPC) codes.

Given that 8-bit bytes are important units of data representation, storage, and
transmission in modern digital systems, codes that use bytes as their symbols are quite
useful for computing applications. As an example, a single-byte-error-correcting, double-
byte-error-detecting code [Kane82] may be contemplated.

Most codes are designed to deal with random errors, with particular distributions across a
codeword (say, uniform distribution). In certain situations, such as when bits are
transmitted serially or when a surface scratch on a disk affects a small disk area, the
possibility that multiple adjacent bits are adversely affected by an undesirable event
exists. Such errors, referred to “burst errors,” are characterized by their extent or length.
For example, a single-bit-error-correcting, 6-bit-burst-error-detecting code might be of
interest in such contexts. Such a code would correct a single random error, while
providing safety against a modest-length burst error.

In this chapter, we have seen error-detecting codes applied at the bit-string or word level.
It is also possible to apply coding at higher levels of abstraction. Error-detecting codes
that are applicable at the data structure level (robust data structures) or at the application
level (algorithm-based error tolerance) will be discussed in Chapter 20.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 276

Problems

13.1 Checksum codes


a. Describe the checksum scheme used in the (9 + 1)-digit international serial book number (ISBN)
code and state its error detecting capabilities.
b. Repeat part a for the (12 + 1)-digit ISBN code that replaced the older code in 2005.
c. Repeat part a for the (8 + 1)-digit bank identification or routing number code that appears to the
left of account number on bank checks.

13.2 Detecting substitution and transposition errors


Consider the design of an n-symbol q-ary code that detects single substitution or transposition errors.
a. Show that for q = 2, the code can have 2n/3 codewords.
b. Show that for q  3, the code can have qn–1 codewords.
c. Prove that the results of parts a and b are optimal [Abde98].

13.3 Unidirectional error-detecting codes


An (n, t) Borden code is an n-bit code composed of n/2-out-of-n codewords and all codewords of the form
m-out-of-n, where m = n/2 mod (t + 1). Show that the (n, t) Borden code is an optimal t-unidirectional
error-detecting code [Pies96].

13.4 Binary cyclic codes


Prove or disprove the following for a binary cyclic code.
a. If x + 1 is a factor of G(x), the code cannot have codewords on odd weight.
b. If V(x) is the polynomial associated with some codeword, then so is xiV(x) mod (1 + xn) for any i.
c. If the generator polynomial is divisible by x + 1, then the all-1s vector is a codeword [Pete72].

13.5 Error detection in UPC-A


a. Explain why all one-digit errors are caught by UPC-A coding scheme based on mod-10 checksum
on 11 data digits and 1 check digit, using the weight vector 3 1 3 1 3 1 3 1 3 1 3 1.
b. Explain why all transposition errors (adjacent digits switching positions) are not caught.

13.6 Error detection for performance enhancement


Read the article [Blaa09] and discuss, in one typewritten page, how error detection methods can be put to
other uses, such as speed enhancement and energy economy.

13.7 Parity generator or check circuit


Present the design of a logic circuit that accepts 8 data bits, plus 1 odd or even parity bit, and either checks
the parity or generates a parity bit.

13.8 The 10-digit ISBN code

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 277

In the old, 10-digit ISBN code, the 9-digit book identifier x1x2x3x4x5x6x7x8x9 was augmented with a 10th
check digit x10, derived as (11 – W) mod 11, where W is the modulo-11 weighted sum 1i9 (11 – i)xi.
Because the value of x10 is in the range [0, 10], the check digit is written as X when the residue is 10.
a. Provide an algorithm to check the validity of the 10-digit ISBN code x1x2x3x4x5x6x7x8x9x10.
b. Show that the code detects all single-digit substitution errors.
c. Show that the code detects all single transposition errors.
d. Since a purely numerical code would be more convenient, it is appealing to replace the digit value
X, when it occurs, with 0. How does this change affect the code’s capability in detecting
substitution errors?
e. Repeat part d for transposition errors.

13.9 Binary cyclic codes


A check polynomial H(x) is a polynomial such that V(x)H(x) = 0 mod (xn + 1) for any data polynomial V(x).
a. Prove that each cyclic code has a check polynomial and show how to derive it.
b. Sketch the design of a hardware unit that can check the validity of a data word.

13.10 Shortening a cyclic code


Starting with a cyclic code of redundancy r, consider shortening the code in its first or last s bits by
choosing all codewords that contain 0s in those positions and then deleting the positions.
a. Show that the resulting code is not a cyclic code.
b. Show that the resulting code still detects all bursts of length r, provided they don’t wrap around.
c. Provide an example of an undetectable burst error of length r that wraps around.

13.11 An augmented Berger code


In a code with a 63-bit data part, we attach both the number of 0s and the number of 1s in the data as check
bits (two 6-bit checks), getting a 75-bit code. Assess this new code with respect to redundancy, ease of
encoding and decoding, and error detection/correction capability, comparing it in particular to Berger code.

13.12 A modified Berger code


In a Berger code, when the number of data bits is k = 2a, we need a check part of a + 1 bits to represent
counts of 0s in the range [0, 2a], leading to a code length of n = k + a + 1. Discuss the pros and cons of the
following modified Berger code and deduce whether the modification is worthwhile. Counts of 0s in the
range [0, 2a – 1] are represented as usual as a-bit binary numbers. The count 2a is represented as 0, that is,
the counts are considered to be modulo-2a.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 278

References and Further Readings


[Abde98] Abdel-Ghaffar, K. A. S., “Detecting Substitutions and Transpositions of Characters,”
Computer J., Vol. 41, No. 4, pp. 238-242, 1998.
[Aviz71] Avizienis, A., “Arithmetic Error Codes: Cost and Effectiveness Studies for Application in
Digital System Design,” IEEE Trans. Computers, pp. 1322-1331.
[Aviz73] Avizienis, A., “Arithmetic Algorithms for Error-Coded Operands,” IEEE Trans. Computers,
Vol. 22, No. 6, pp. 567-572, June 1973.
[Berg61] Berger, J. M., “A Note on Error Detection Codes for Asymmetric Channels,” Information and
Control, Vol. 4, pp. 68-73, March 1961.
[Blaa09] Blaauw, D. and S. Das, “CPU, Heal Thyself,” IEEE Spectrum, Vol. 46, No. 8, pp. 41-43 and
52-56, August 2009.
[Chen06] Cheng, C. and K. K. Parhi, “High-Speed Parallel CRC Implementation Based on Unfolding,
Pipelining, and Retiming,” IEEE Trans. Circuits and Systems II, Vol. 53, No. 10, pp. 1017-
1021, October 2006.
[Das09] Das, S., et al., “RazorII: In Situ Error Detection and Correction for PVT and SER Tolerance,”
IEEE J. Solid-State Circuits, Vol. 44, No. 1, pp. 32-48, January 2009.
[Garn66] Garner, H. L., “Error Codes for Arithmetic Operations,” IEEE Trans. Electronic Computers,
[Grib16] Gribaudo, M., M. Iacono, and D. Manini, “Improving Reliability and Performances in Large
Scale Distributed Applications with Erasure Codes and Replication,” Future Generation
Computer Systems, Vol. 56, pp. 773-782, March 2016.
[Kane82] Kaneda, S. and E. Fujiwara, “Single Byte Error Correcting Double Byte Error Detecting Codes
for Memory Systems,” IEEE Trans. Computers, Vol. 31, No. 7, pp. 596-602, July 1982..
[Knut86] Knuth, D. E., “Efficient Balanced Codes,” IEEE Trans. Information Theory, Vol. 32, No. 1, pp.
51-53, January 1986.
[Parh78] Parhami, B. and A. Avizienis, “Detection of Storage Errors in Mass Memories Using
Arithmetic Error Codes,” IEEE Trans. Computers, Vol. 27, No. 4, pp. 302-308, April 1978.
[Pete59] Peterson, W. W. and M. O. Rabin, “On Codes for Checking Logical Operations,” IBM J.,
[Pete72] Peterson, W. W. and E. J. Weldon Jr., Error-Correcting Codes, MIT Press, 2nd ed., 1972.
[Pies96] Piestrak, S. J., “Design of Self-Testing Checkers for Borden Codes,” IEEE Trans. Computers,
pp. 461-469, April 1996.
[Raab16] Raab, P., S. Kramer, and J. Mottok, “Reliability of Data Processing and Fault Compensation in
Unreliable Arithmetic Processors,” Microprocessors and Microsystems, Vol. 40, pp. 102-112,
February 2016.
[Rao74] Rao, T. R. N., Error Codes for Arithmetic Processors, Academic Press, 1974.
[Rao89] Rao, T. R. N. and E. Fujiwara, Error-Control Coding for Computer Systems, Prentice Hall,
1989.
[Wake75] Wakerly, J. F., “Detection of Unidirectional Multiple Errors Using Low-Cost Arithmetic Error
Codes,” IEEE Trans. Computers
[Wake78] Wakerly, J. F., Error Detecting Codes, Self-Checking Circuits and Applications, North-
Holland, 1978.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 279

14 Error Correction
“Error is to truth as sleep is to waking. As though refreshed, one
returns from erring to the path of truth.”
Johann Wolfgang von Goethe, Wisdom and
Experience

“Don’t argue for other people’s weaknesses. Don’t argue for your
own. When you make a mistake, admit it, correct it, and learn
from it / immediately.”
Stephen R. Covey

Topics in This Chapter


14.1. Basics of Error Correction
14.2. Hamming Codes
14.3. Linear Codes
14.4. Reed-Solomon and BCH Codes
14.5. Arithmetic Error-Correcting Codes
14.6. Other Error-Correcting Codes

Automatic error correction may be used to prevent distorted subsystem states


from adversely affecting the rest of the system, in much the same way that fault
masking is used to hinder the fault-to-error transition. Avoiding data corruption,
and its service- or result-level consequences, requires prompt error correction;
otherwise, errors might lead to malfunctions, thus moving the computation one
step closer to failure. To devise cost-effective error correction schemes, we need a
good understanding of the types of errors that might be encountered and the cost
and correction capabilities of various informational redundancy methods; hence,
our discussion of error-correcting codes in this chapter.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 280

14.1 Basics of Error Correction

Instead of detecting errors and performing some sort of recovery action such as
retransmission or recomputation, one may aim for providing sufficient redundancy in the
code so as to correct the most common errors quickly. In contrast to the backward
recovery methods associated with error detection followed by additional actions, error-
correcting codes are said to allow forward recovery. In practice, we may use an error
correction scheme for simple, common errors in conjunction with error detection for rarer
or more extensive error patterns. A Hamming single-error-correcting/double-error-
detecting (SEC/DED) code provides a good example.

Error-correcting codes were also developed for communication over noisy channels and
were later adopted for use in computer systems. Notationally, we proceed as in the case
of error-detecting codes, discussed in Chapter 13. In other words, we assume that k-bit
information words are encoded as n-bit codewords, n > k. Changing some of the bits in a
valid codeword produces an invalid codeword, thus leading to detection, and with
appropriate provisions, to correction. The ratio r/k, where r = n – k is the number of
redundant bits, indicates the extent of redundancy or the coding cost. Hardware
complexity of error correction is another measure of cost. Time complexity of the error
correction algorithm is a third measure of cost, which we often ignore, because we
expect correction events to be very rare.

Figure 14.1a depicts the data space of size 2k, the code space of size 2n, and the set of 2n –
2k invalid codewords, which, when encountered, indicate the presence of errors.
Conceptually, the simplest redundant represention for error correction consists of
replication of data. For example, triplicating a bit b to get the corresponding codeword
bbb allows us to correct an error in 1 of the 3 bits. Now, if the valid codeword 000
changes to 010 due to a single-bit error, we can correct the error, given that the erroneous
value is closer to 000 than to 111. We saw in Chapter 13 that the same triplication
scheme can be used for the detection of single-bit and double-bit errors in lieu of
correction of a single-bit error. Of course, triplication does not have to be applied at the
bit leval. A data bit-vector x of length k can be triplicated to become the 3k-bit codeword
xxx. Referring to Fig. 14.1b, we note that if the voter is triplicated to produce three copies
of the result y, the modified circuit would supply the coded yyy version of y, which can
then be used as input to other circuits.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 281

(a) Data and code spaces (b) Triplication as coding

Fig. 14.1 Data and code spaces for error-correction coding in general
(sizes 2k and 2n) and for triplication (sizes 2k and 23k).

The high-redundancy triplication code corresponding to the voting scheme of Fig. 14.1b
is conceptually quite simple. However, we would like to have lower-redundancy codes
that offer similar protection against potential errors. To correct a single-bit error in an n-
bit (non)codeword with r = n – k bits of redundancy, we must have 2r > k + r, which
dictates slightly more than log2 k bits of redunadancy. One of our challenges in this
chapter is to determine whether we can approach this lower bound and come up with
highly efficient single-error-correcting codes and, if not, whether we can get close to the
bound. More generally, the challenge of designing efficient error-correcting codes with
different capabilities is what will drive us in the rest of this chapter. Let us begin with an
example that achieves single-error correction and double-error detection with a
redundancy of r = 2 𝑘 + 1 bits.

Example 14.1: Two-dimensional error-correcting code A conceptually simple error-


correcting code is as follows. Let the width k of our data word be a perfect square and arrange the
bits of a given data word in a 2D square matrix (in row-major order, say). Now attach a parity bit
to each row, yielding a new bit-matrix with √𝑘 rows and √𝑘 + 1 columns. Then attach a parity bit
to each column of the latter matrix, ending up with a (√𝑘 + 1)  (√𝑘 + 1) bit-matrix. Show that
this coding scheme is capable of correcting all single-bit errors and detecting all double-bit errors.
Provide an example of a triple-bit error that goes undetected.

Solution: A single bit-flip will cause the parity check to be violated for exactly one row and one
column, thus pinpointing the location of the erroneous bit. A double-bit error is detectable because
it will lead to the violation of parity checks for 2 rows (when the errors are in the same column),
2 columns, or 2 rows and 2 columns. A pattern of 3 errors may be such that there are 2 errors in
row i and 2 errors in column j (one of them shared), leading to no parity check violation.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 282

The criteria used to judge error-correcting codes are quite similar to those used for error-
detecting codes. These include redundancy (r redundant bits used for k information bits,
for an overhead of r/k), encoding circuit/time complexity, decoding circuit/time
complexity (nonexistent for separable codes), error correction capability (single, double,
b-bit burst, byte, unidirectional), and possible closure under operations of interest.
Greater correction capability generally entails more redundancy. To correct c errors, a
minimum code distance of 2c + 1 is necessary. Codes may also have a combination of
correction and detection capabilities. To correct c errors and additionally detect d errors,
where d > c, a miminum code distance of c + d + 1 is needed. For example, a SEC/DED
code cannot have a distance of less than 4.

The notion of an adequate Hamming distance in a code allowing error correction is


illustrated in Fig. 14.dist, though the reader must bear in mind that the 2D representation
of codewords and their distances is somewhat inaccurate. Let each circle in Fig. 14.dist
represent a word, with the red circles labeled c1, c2, and c3 representing codewords and all
others corresponding to noncodewords. The Hamming distance between words is
represented by the minimal number or horizontal, vertical, or diagonal steps on the grid
needed to move from one word to the other. It is easy to see that the particular code in
Fig. 14.dist has minimum distance 3, given that we can get from c3 to c1 or c2 in 3 steps
(1 vertical move and 2 diagonal moves).

c1 c2

e1 e3 e2

c3

Fig. 14.dist Illustration of how a code with Hamming distance of 3 might


allow the correction of single-bit errors.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 283

For each of the three codewords, distance-1 and distance-2 words from it are highlighted
by drawing a dashed box through them. For each codeword, there are 8 distance-1 words
and 16 distance-2 words. We note that distance-1 words, colored yellow, are distinct for
each of the three codewords. Thus, if a single-bit error transforms c1 to e1, say, we know
that the correct word is c1, because e1 is closer to c1 than to the other two codewords.
Certain distance-2 noncodewords, such as e2, are similarly closest to a particular valid
codeword, which may allow their correction. However, as seen from the example of the
noncodeword e3, which is at distance 2 from all three valid codewords, some double-bit
errors may not be correctable in a distance-3 code.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 284

14.2 Hamming Codes

Hamming codes take their name from Richard W. Hamming, an American scientist who
is rightly recognized as their inventor. However, while Hamming’s publication of his idea
was delayed by Bell Laboratory’s legal department as part of their patenting strategy,
Claude E. Shannon independently discovered and published the code in his seminal 1949
book, The Mathematical Theory of Communication [Nahi13].

We begin our discussion of Hamming codes with the simplest possible example: a (7, 4)
single-error-correcting (SEC) code, with n = 7 total bits, k = 4 data bits, and r = 3
redundant parity-check bits. As depicted in Fig. 14.H74-a, each parity bit is associated
with 3 data bits and its value is chosen to make the group of 4 bits have even parity.
Thus, the data word 1001 is encoded as the codeword 1001|101. The evenness of parity
for pi’s group is checked by computing the syndrome si and verifying that it is 0. When
all three syndromes are 0s, the word is error-free and no correction is needed. When the
syndrome vector s2s1s0 contains one or more 1s, the combination of values point to a
unique bit that is in error and must be flipped to correct the 7-bit word. Figure 14.H74-b
shows the correspondence between the syndrome vector and the bit that is in error.

Encoding and decoding circuitry for the Hamming (7. 4) SEC code of Fig. 14.H74 are
shown in Fig. 14. Hed.

The redundancy of 3 bits for 4 data bits (3/7  43%) in the (7, 4) Hamming code of Fig.
14.H74 is unimpressive, but the situation gets better with more data bits. We can
construct (15, 11), (31, 26), (63, 57), (127, 120), (255, 247), (511, 502), and (1023, 1013)
SEC Hamming codes, with the last one on the list having a redundancy of only 10/1023,
which is less than 1%. The general pattern is having 2 r – 1 total bits with r check bits. In
this way, the r syndrome bits can assume 2r possible values, one of which corresponds to
the no-error case and the remaining 2r – 1 are in one-to-one correspondence with the
code’s 2r – 1 bits. It is easy to see that the redundancy is in general r/(2r – 1), which
approaches 0 for very wide codes.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 285

(a) Parity bits and syndromes (b) Error correction table

Fig. 14.H74 A Hamming [7, 4] SEC code, with n = 7 total bits, k = 4 data
bits, and r = 3 redundant parity-check bits.

Fig. 14.Hed Encoder and decoder (composed of syndrome generator


and error corrector) circuits for the Hamming (7, 4) SEC
code of Fig. 14.H4.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 286

Fig. 14.pcm Matrix formulation of Hamming SEC code.

We next discuss the structure of the parity check matrix H for an (n, k) Hamming code.
As seen in Fig. 14.pcm, the columns of H hold all the possible bit patterns of length n – k,
except the all-0s pattern. Hence, n = 2n–k – 1 is satisfied for any Hamming SEC code. The
last n – k columns of the parity check matrix form the identity matrix, given that (by
definition) each parity bit is included in only one parity set. The error syndrome s of
length r is derived by multiplying the r  n parity check matrix H by the n-bit received
word, where matrix-vector multiplication is done by using the AND operation instead of
multiplication and the XOR operation instead of addition.

By rearranging the columns of H so that they appear in ascending order of the binary
numbers they represent (and, of course, making the corresponding change in the
codeword), we can has the syndrome vector correspond to the column number directly
(Fig. 14.rpcm-a), allowing us to use the simple error corrector depicted in Fig. 14.rpcm-b.

(a) Rearranged parity check matrix (b) Error correction

Fig. 14.rpcm Matrix formulation of Hamming SEC code.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 287

(a) Rearranged parity check matrix (b) Error correction

Fig. 14.gpcm Matrix formulation of Hamming SEC code.

The parity check matrix and the error corrector for a general Hamming code are depicted
in Fig. 14.gpcm.

Associated with each Hamming code is an n  d generator matrix G such that the product
of G by the d-element data vector is the n-element code vector. For example:

1 0 0 0 𝑑
⎡0 ⎡ ⎤
1 0 0⎤ 𝑑 ⎢
𝑑

⎢0 0 1 0⎥
⎢ ⎥ 𝑑 ⎢𝑑 ⎥
⎢0 0 0 1⎥ × = ⎢𝑑 ⎥ (14.2.gen)
𝑑
⎢1 1 1 0⎥ ⎢𝑝 ⎥
⎢1 0 1 1⎥ 𝑑
⎢𝑝 ⎥
⎣0 1 1 1⎦ ⎣𝑝 ⎦

Recall that matrix-vector multiplication is done with AND/XOR instead of multiply/add.

To convert a Hamming SEC code into a SEC-DED code, we add a row of all 1s and a
column corresponding to the extra parity bit pr to the parity check matrix, as shown in
Fig. 14.secded-a. [Elaborate further on the check matrix, the effects of a double-bit error,
and the error corrector in Fig. 14.secded-b.]

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 288

(a) Augmented parity check matrix (b) Error correction

Fig. 14.secded Hamming SEC-DED code.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 289

14.3 Linear Codes

Hamming codes are examples of linear codes, but linear codes can be defined in other
ways too. A code is linear iff given any two codewords u and v, the bit-vector w = u  v
is also a codeword. Throughout the discussion that follows, data and code vectors are
considered to be column vectors. A linear code can be characterized by its generator
matrix or parity check matrix. The n  k (n rows, k columns) generator matrix G, when
multiplied by a k-vector d, representing the data, produces the n-bit coded version u of
the data

u=Gd (14.3.enc)

where matrix-vector multiplication for this encoding process is performed by using


modulo-2 addition (XOR) in lieu of standard addition. Checking of codewords is
performed by multiplying the (n – k)  n parity check matrix H by the potentially
erroneous n-bit codeword v

s=Hv (14.3.chk)

with an all-0s syndrome vector s indicating that v is a correct codeword.

Example: A Hamming code as a linear code

Example: A non-Hamming linear code

All linear codes have the following two properties:


1. The all-0s vector is a codeword.
2. The code’s distance equals the minimum Hamming weight of its codewords.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 290

14.4 Reed-Solomon and BCH Codes

In this section, we introduce two widely used classes of codes that allow flexible design
of a variety of codes with desired error correction capabilities and simple decoding, with
one class being a subclass of the other. Alexis Hocquenghem in 1959 [Hocq59], and
later, apparently independently, Raj Bose and D. K. Ray-Chaudhuri in 1960 [Bose60],
invented a class of cyclic error-correcting codes that are named BCH codes in their
honor. Irving S. Reed and Gustave Solomon [Reed60] are credited with developing a
special class of BCH codes that has come to bear their names.

We begin by discussing Reed-Solomon (RS) error-correcting codes. RS codes are non-


binary, in the sense of having a multivalued symbol set from which data and check
symbols are drawn. A popular instance of the latter code, which has been used in CD
players, digital audiotapes, and digital television is the RS(255, 223) code: it has 8-bit
symbols, 223 bytes of data, and 32 check bytes, for an informational redundancy of
32/255 = 12.5% (the redundancy rises to 32/223 = 14.3% if judged relative to data bytes,
rather than the total number of bytes in the code). The aforementioned code can correct
up to 16 symbol errors. A symbol error is defined as any change in the symbol value, be
it flipping of only one bit or changing of any number of bits in the symbol. When
symbols are 8 bits wide, symbol error is also referred to as a “byte error.”

In general, a Reed-Solomom code requires 2t check sysmbols, each s bits wide, if it is to


correct up to t symbol errors. To correct t errors, which implies a disntance-(2t + 1)-code,
the number k of data symbols must satisfy:

k  2s – 1 – 2t (14.4.RS)

Inequality 14.4.RS suggests that the symbol bit-width s grows with the data length k, and
this may be viewed as a disadvantage of RS codes. One important property of RS codes
is that they guarantee optimal mimimum code distance, given the code parameters.

Theorem 14.RSdist (minimum distance of RS codes): An (n, k) Reed-Solomon code has


the optimal minimum distance n – k + 1.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 291

In what follows, we focus on the properties of t-error-correcting RS codes with n = 2s – 1


code symbols, n – r = 2s – 1 – 2t data symbols, and r = 2t check symbols. Such a code is
characterized by its degree-2t generator polynomial:

g(x) = ∏ (𝑥 + α ) = x2t + g2t–1x2t–1 + … + g1x + g0 (14.4.RSgen)

In equation 14.4.RSgen, b is usually 0 or 1 and  is a primitive element of GF(2s). Recall


that a primitive element is one that generates all nonzero elements of the field by its
powers. For example, three different representations of the of elements in GF(2 3) are
shown in Table 14.GF8.. As usual, the product of the generator polynomial and the data
polynomial yields the code polynomial, and thus the corresponding codeword. All
polynomial coefficients are s-bit numbers and all arithmetic is performed modulo 2 s.

Example 14.RS The Reed-Solomon code RS(7, 3) Consider an RS(7, 3) code, with 3-bit
symbols, defined by the generator polynomial of the form g(x) = (1 + x)( + x)(2 + x)(3 + x) =
6 + 5x + 5x2 + 2x3 + x4, where  satisfies 1 +  + 3 = 0. This code can correct any double-
symbol error. What types of errors are correctable by this code at the bit level.

Solution: Given that each of the 8 symbols can be mapped to 3 bits, the RS(7, 3) code defined
here is a (21, 9) code, if we count bit positions, and has a code distance of 5. This means that any
random double-bit error will be correctable. Additionally, because the RS code can correct any
error that is confined to no more than 2 symbols, any burst error of length 4 is also correctable.
Note that a burst error of length 5 can potentially span 3 adjacent 3-bit symbols.

Table 14.GF8 Three different representations of elements in GF(23). The


polynomials are modulo 3 +  + 1.

–––––––––––––––––––––––––––––––
Power Polynomial Vector
–––––––––––––––––––––––––––––––
-- 0 000
1 1 001
  010
2 2 100
3 +1 011
4 2 +  110
 5
 ++1
2
111
6 2 + 1 101
–––––––––––––––––––––––––––––––

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 292

Reed-Solomon codes can also be introduced via a matrix formulation in lieu of


polynomials. [Elaborate]

BCH codes Have the advantage of being binary codes, thus avoiding the additional
burden of converting nonbinary symbols into binary via encoding. A BCH(n, k) code can
be characterized by its n  (n – k) parity check matrix P which allows the computation of
the error syndrome via the vector-matrix multiplication W  P, where W is the received
word. For example, Fig. 14.BCH shows the how the syndrome for BCH(15, 7) code is
computed. This example code shown is also characterized by it generator polynomial:

g(x) = 1 + x4 + x6 + x7 + x8 (14.4.BCH)

Practical applications of BCH codes include the use of BCH(511, 493) as a double-error-
correcting code in a video coding standard for videophones and the use of BCH(40, 32)
as SEC/DED code in ATM communication.

Fig. 14.BCH The parity check matrix and syndrome generation via
vector-matrix multiplication for BCH(15, 7) code.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 293

14.5 Arithmetic Error-Correcting Codes

Consider the (7, 15) biresidue code, which uses mod-7 and mod-15 residues of a data
words as its check parts. The data word can be up to 12 bits wide and the attached check
parts are 3 and 4 bits, respectively, for a total code width of 19 bits. Figure 14.bires
shows the syndromes generated when data is corrupted by a weight-1 arithmetic error,
corresponding to the addition or subtraction of a power of 2 to/from the data. Because all
the syndromes in this table up to the error magnitude 211 are distinct, such weight-1
arithmetic errors are correctable by this code.

In general, a biresidue code with relativel prime low-cost check moduli A = 2a – 1 and B
= 2b – 1 supports ab bits of data for weight-1 error correction. The representational
redundancy of the code is:

(a + b)/ab = 1/a + 1/b (14.5.redun)

Thus, by increasing the values of a and b, we can reduce the amount of redundancy.
Figure 14.brH compares such biresidue codes with the Hamming SEC code in terms of
code and data widths. Figure 14.brarith shows a general scheme for performing
arithmetic operations and checking of results with biresidue-coded operands. The scheme
is very similar to that of residue codes for addition, subtraction, and multiplication,
except that two residue checks are required. Division and square-rooting remain difficult.

Fig. 14.bires Syndromes for single arithmetic errors, having magnitudes


that are powers of 2, in the (7, 15) biresidue code.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 294

Fig. 14.brH A comparison of biresidue arithmetic codes with Hamming


SEC codes in terms of code and data widths.

Fig. 14.brarith Arithmetic operations and checking of results for biresidue-


coded operands.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 295

14.6 Other Error-Correcting Codes

As in the case of error-detecting codes, we have only scratched the surface of the vast
theory of error-correcting codes in the preceding sections. So as to present a more
complete picture of the field, which is reflected in many excellent textbooks some of
which are cited in the references section, we touch upon a number of other error-
correcting codes in this section. The codes include:

Reed-Muller codes: RM codes have a recursive contruction, with smaller codes used to
build larger ones.

Turbo codes: Turbo codes are highly efficient separable codes with iterative (soft)
decoding. A data word is augmented with two check words, one obtained directly from
an encoder and a second one formed based on an interleaved version of the data. The two
encoders for Turbo codes are generally identical. Soft decoding means that each of two
decoders provides an assessment of the probability of a bit being 1. The two decoders
then exchange information and refine their estimates iteratively. Turbo codes are
extensively used in cell phones and other communication applications.

Low-density parity-check codes: In LPDC codes, each parity check is defined on a


small set of bits, so both encoding and error checking are very fast; error correction is
more difficult and entails an iterative process.

Information dispersal: In this scheme, data is encoded in n pieces, such that any k of the
pieces would be adequate for reconstruction. Such codes are useful for protecting privacy
as well as data integrity.

In this chapter, we have seen error-correcting codes applied at the bit-string or word
level. It is also possible to apply coding at higher levels of abstraction. Error-correcting
codes that are applicable at the data structure level (robust data structures) or at the
application level (algorithm-based error tolerance) will be discussed in Chapter 20.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 296

Problems

14.1 A Hamming SEC code


Consider the (15, 11) Hamming SEC code, with 11 data bits d0-d10 and 4 parity bits p0-p3.
a. Write down the parity check matrix H for this code.
b. Rearrange the columns of H so that the syndrome directly identifies the location of an erroneous
bit.
c. Find the generator matrix of the resulting Hamming SEC code.
d. Use the generator matrix of part c to encode a particular data word that you choose, introduce a
single-bit error in the resulting codeword, and verify that the syndrome generated by the parity
check matrix correctly identifies the location of the error.

14.2 Hamming SEC/DED codes


For data words of widths 4 through 64, in increments of 4 bits, specify the total number of bits required by
a Hamming SEC/DED code.

14.3 Redundancy for error correction


Show that it is impossible to have a double error-correcting code with 16 information bits and 8 check bits
(i.e., k = 16, r = 8, n = k + r = 24). Then, derive a lower bound on the number of check bits for double error
correction with k information bits.

14.4 Error detection vs. error correction


Figures 13.1a and 14.1a appear identical at first sight. See if you can identify a key difference and its
significance in the two figures.

14.5 2D error-correcting code


Show that in the code defined in Example 14.1, the resulting (√𝑘 + 1)  (√𝑘 + 1) matrix will be the same
whether we attach the row parity bits first, followed by column parities, or vice versa.

14.6 Cyclic Hamming codes


Some cyclic codes are Hamming codes.
a. Show that there are 30 distinct binary (7, 4) Hamming codes.
b. Show that the (7, 4) cyclic code with G(x) = 1 + x2 + x3 is one of the Hamming codes in part a.
c. Show that the (7, 4) cyclic code with G(x) = 1 + x + x3 is one of the Hamming codes in part a.
d. Show that the codes in parts b and c are the only cyclic codes among the 30 codes of part a.

14.7 Product of codes


There are various ways in which two codes can be combined (“multiplied”) to form a new code. One way
is a generalized version of the 2D parity code of Example 14.1. The k = k1k2 data bits are arranged in
column-major order into a k1  k2 matrix, which is augmented along the columns by r1 check bits, so that
each column is a codeword of a separable (n1, k1) code of distance d1, and along the columns by r2 check
bits of a separable (n2, k2) code of distance d2. The result is an (n1n2, k1k2) code.
a. Characterize the resulting code if each of the row and column codes is a Hamming (7, 4) code.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 297

b. Show that the burst-error-correcting capability of the code of part b is greater than its random-
error-correcting capability.
c. What is the distance of the product of two codes of distances d1 and d2? Hint: Assume that each of
the two codes contains the all-0s codeword and a codeword of weight equal to the code distance.
d. What can we say about the burst-error-correction capability of the product of two codes in
general?

14.8 Redundancy bounds for binary error-correcting codes


𝑛
Let 𝑉(𝑛, 𝑚) = ∑ 𝑖
and consider an n-bit c-error-correcting code with code rate R = 1 – r/n (i.e., with r
redundant bits).
a. Prove the Hamming lower bound for redundancy: r  log2 V(n, c)
b. Prove the Gilbert-Varshamov upper bound for redundancy: r  log2 V(n, 2c)
c. Plot the two bounds for n up to 1000 and discuss.

14.9 Reliable digital filters


Study the paper [Gao15] and prepare a 1-page summary (single-spaced, with at most one figure)
highlighting how error-correcting codes are used to ensure reliable computation in the particular application
domain discussed.

14.10 Two-dimensional error checking


A class grade list has m variously weighted columns (homework assignments, projects, exams, and the like)
and n rows. For each column, the mean, median, minimum, and maximum of the n student grades are
calculated and listed at the bottom. For each row, the weighted sum or composite grade is calculated and
entered on the right. Devise a scheme, similar to a 2D code, for checking the grade calculations for possible
errors and discuss the error detection or correction capabilities of your scheme.

14.x Title
Intro
a. xxx
b. xxx

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 298

References and Further Readings


[AlBa93] Al-Bassam, S. and B. Bose, “Design of Efficient Error-Correcting Balanced Codes,” IEEE
Trans. Computers, pp. 1261-1266.
[Araz87] Arazi, B., A Commonsense Approach to the Theory of Error-Correcting Codes, MIT Press,
1987.
[Bose60] Bose, R. C. and D. K. Ray-Chaudhhuri, “On a Class of Error Correcting Binary Group Codes,”
Information and Control, Vol. 3, pp. 68-79, 1960.
[Bruc92] Bruck, J. and M. Blaum, “New Techniques for Constructing EC/AUED Codes,” IEEE Trans.
Computers, pp. 1318-1324, October 1992.
[Gao15] Gao, Z., P. Reviriego, W. Pan, Z. Xu, M. Zhao, J. Wang, and J. A. Maestro, “Fault Tolerant
Parallel Filters Based on Error Correction Codes,” IEEE Trans. VLSI Systems, Vol. 23, No. 2,
pp. 384-387, February 2015.
[Garr04] Garrett, P., The Mathematics of Coding Theory, Prentice Hall, 2004, p. 283.
[Guru09] Guruswami, V. and A. Rudra, “Error Correction up to the Information-Theoretic Limit,”
Communications of the ACM, Vol. 52, No. 3, pp. 87-95, March 2009.
[Hank00] Hankerson, R. et al., Coding Theory and Cryptography: The Essentials, Marcel Dekker, 2000,
p. 120.
[Hocq59] Hocquenghem, A., “Codes Correcteurs d’Erreurs,” Chiffres, Vol. 2, pp. 147-156, September
1959.
[Kund90] Kundu, S. and S. M. Reddy, “On Symmetric Error Correcting and All Unidirectional Error
Detecting Codes,” IEEE Trans. Computers, pp. 752-761, June 1990.
[Lin88] Lin, D. J. and B. Bose, “Theory and Design of t-Error Correcting and d(d>t)-Unidirectional
Error Detecting (t-EC d-UED) Codes,” IEEE Trans. Computers, pp. 433-439.
[More06] Morelos-Zaragoza, R. H., The Art of Error Correcting Coding, Wiley, 2006.
[Nahi13] Nahin, P. J., The Logician and the Engineer: How George Boole and Claude Shannon Created
the Information Age, Princeton, 2013.
[Pete72] Peterson, W. W. and E. J. Weldon Jr., Error-Correcting Codes, MIT Press, 2nd ed., 1972.
[Plan97] Plank, J. S., “A tutorial on Reed–Solomon coding for fault‐tolerance in RAID‐like systems,”
Software: Practice and Experience, Vol. 27, No. 9, 1997, pp. 995-1012.
[Rao89] Rao, T. R. N. and E. Fujiwara, Error-Control Coding for Computer Systems, Prentice Hall,
1989
[Reed60] Reed, I. and G. Solomon, “Polynomial Codes over Certain Finite Fields,” SIAM J. Applied
Mathematics, Vol. 8, pp. 300-304, 1960.
[Skla04] Sklar, B. and F. J. Harris, “The ABCs of Linear Block Codes,” IEEE Signal Processing, Vol.
21, No. 4, pp. 14-35, July 2004.
[Tay16] Tay, T. F. and C.-H. Chang, “A Non-Iterative Multiple Residue Digit Error Detection and
Correction Algorithm in RRNS,” IEEE Trans. Computers, Vol. 65, No. 2, pp. 396-, February
2016.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 299

15 Self-Checking Modules
“I have not failed. I’ve just found 10,000 ways that won’t work.”
Thomas Edison

“But the Committee of the Mending Apparatus now came


forward, and allayed the panic with well-chosen words. It
confessed that the Mending Apparatus was itself in need of
repair.”
E.M. Forester, The Machine Stops

Topics in This Chapter


15.1. Checking of Function Units
15.2. Error Signals and Their Combining
15.3. Totally Self-Checking Design
15.4. Self-Checking Checkers
15.5. Practical Self-Checking Design
15.6. Self-Checking State Machines

We observed in Chapter 9 that testing of digital circuits is quite difficult, given


that a circuit’s internal points may be inadequately controllable or observable.
This difficulty motivated us to consider design method for testable logic circuits
in Chapter 11. In self-checking design, which is the topic of this chapter, we go
one step further: we ensure that any fault from a predefined class of faults is either
masked, in the sense of not affecting the correct output of the circuit, or detected,
because it produces an invalid output. Distinguishing valid and invalid outputs is
made possible by the use of error-detecting codes.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 300

15.1 Checking of Function Units

It is possible to check the operation of a function unit without the high circuit/area and
power overheads of replication, which entails a redundancy of at least 100%. A number
of methods are available to us for this purpose, which we shall discuss in this chapter.
Among these methods, those based on coding of the function unit’s input and output are
the most rigorous and readily verifiable.

Consider the input and output data spaces in Fig. 15.1a. The encoded input space is
divided into code space and error space; ditto for the encoded output space. The function
f to be realized maps points from the input code space into the output code space (the
solid arrow in Fig. 15.1a). This represents the normal functioning of the self-checking
circuit, during which the validity of the output is verified by the self-checking code
checker in Fig. 15.1b. When a particular fault  occurs in the function unit, the desiger
should ensure that either an error-space output is producted by the circuit or else the fault
is masked by producing the correct output (the dashed arrows in Fig. 15.1a). Thus, the
challenge of self-checking circuit design is to come of for strategies to ensure that any
fault from a designated fault-class of interest is either detected by the output it produces
or produces the correct output. This chapter is devoted to a review of such design
strategies and ways of assessing their effectiveness.

(a) Mappings for self-checking design (b) Self-checking function unit

Fig. 15.1 Data and code spaces in general (sizes 2k and 2n) and for
bit-level triplication (sizes 2 and 8).

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 301

Figure 15.1b contains two blocks, each of which can be faulty or fault-free. Thus, we
need to consider four cases in deriving a suitable design process.

 Both the function unit and the checker are okay: This is the expected or normal
mode of operation during which correct results are obtained and certified.

 Only the function unit is okay: False alarm may be raised by the faulty checker,
but this situation is safe.

 Only the checker is okay: We have either no output error or a detectable error.

 Neither the function unit nor the checker is okay: The faulty checker may miss
detecting a fault-induced error at the function unit output. This problem leads us
to the design of checkers with at least two output lines; a single check signal, if
stuck-at-okay, will go undetected, thus raising the possibility that a double fault of
this kind is eventually encountered. We say that undetected faults increase the
probability of fault accumulation.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 302

15.2 Error Signals and Their Combining

This section to be based on the following slides:

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 303

15.3 Totally Self-Checking Design

This section to be based on the following slides:

Self-monitoring circuits

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 304

15.4 Self-Checking Checkers

This section to be based on the following slides:

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 305

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 306

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 307

15.5 Self-Checking State Machines

This section to be based on the following slides:

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 308

15.6 Practical Self-Checking Design

Design based on parity codes


Design with residue encoding
FPGA-based design
General synthesis rules
Partially self-checking design

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 309

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 310

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 311

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 312

Problems

15.1 Self-checking checker for 2-out-of-4 code


Show that a self-checking checker for 2-out-of-4 code can be built from three 2-input OR gates and three 2-
input AND gates.

15.2 Self-checking m-out-of-n code checkers


Consider the following strategy for designing a self-checking checker for an m-out-of-n code. Begin by
dividing the n inputs into two roughly equal subsets S = {s1, s2, . . . , sk} and T = {t1, t2, . . . , tn–k}. Then,
select the nonnegative integers A and B such that A + B = 2q – 1 – m for an arbitrary integer q. Next, design
two circuits to compute U = A + ∑ 𝑠 and V = B + ∑ 𝑡.
a. Complete the description of the design. Hint: What is U + V for codewords and noncodewords?
b. Show that the resulting design is indeed self-checking.
c. Design a self-checking checker for 3-out-of-6 code based on this method.

15.3 Dependability and other system attributes


Read the article [Blaa09] and discuss, in one typewritten page, how dependable computing methods can be
put to other uses, such as speed enhancement and energy economy.

15.4 Self-checking Berger-code checkers


a. Design a totally self-checking checker for a Berger code with 31 data bits and 5 check bits.
b. Describe how your design in the answer to part a will change when the Berger code’s check part
holds 31 – count(1s), instead of the commonly used count(0s).
c. Extend your answer to part b to the general case of k data bits and r = log2(k + 1) check bits.

15.5 Totally-self-checking decoder


Suppose that the correct functioning of a 2-to-4 decoder is to be monitored. Design a totally-self-checking
checker to be used at the output of the decoder. Show that your design is indeed totally self-checking.

15.6 Totally-self-checking checker


Consider a code with multiple parity check bits applied to different subsets of the data bits. Hamming SEC
and SEC/DED codes are examples of this class of parity-based codes. Devise a strategy for designing a
totally self-checking checker for such a code. Fully justify your answer.

15.7 Coded computing


Read the article [Dutt20] and answer the following questions, using no more than a couple of sentences for
each answer.
a. What is coded computing?
b. How do the codes used in the article differ from discrete codes of Chapters 13-14 in our textbook?
c. What are the “seven dwarfs”?
d. Which four of the seven dwarfs do the authors tackle?
e. Why do you think the other three dwarfs were not considered?

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 313

15.8 Use of parity-preserving and parity-inverting codes


Intro

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 314

References and Further Readings


[Akba14] Akbar, M. A. and J.-A. Lee, “Comments on ‘Self-Checking Carry-Select Adder Design Based
on Two-Rail Encoding’,” IEEE Trans. Circuits and Systems I, Vol. 61, No. 7, pp. 2212-2214,
July 2014.
[Ande73] Anderson, D. A. and G. Metze, “Design of Totally Self-Checking Check Circuits for m-out-of-
n Codes,” IEEE Trans. Computers
[Ashj77] Ashjaee, M. J. and S. M. Reddy, “On Totally Self-Checking Checkers for Separable Codes,”
IEEE Trans. Computers, pp. 737-744, August 1977.
[Chan99] Chang, W.-F. and C.-W. Wu, “Low-Cost Modular Totally Self-Checking Checker Design for
m-out-of-n Codes,” IEEE Trans. Computers, Vol. 48, No. 8, pp. 815-826, August 1999.
[Chua78] Chuang, H. and S. Das, “Design of Fail-Safe Sequential Machines Using Separable
Codes,” IEEE Trans. Computers, Vol. 27, No. 3, pp. 249-251, March 1978.
[Dutt20] Dutta, S., H. Jeong, Y. Yang, V. Cadambe, T. M. Low, and P. Grover, “Addressing
Unreliability in Emerging Devices and Non-von Neumann Architectures Using Coded
Computing,” Proceedings of the IEEE, Vol. 108, No. 8, pp. 1219-1234, August 2020.
[Gait08] Details to be supplied
[Lala01] Lala, P. K., Self-checking and Fault-Tolerant Digital Design, Morgan Kaufmann, 2001.
[Lala03] Lala, P. K. and A. L. Burress, “Self-Checking Logic Design for FPGA Implementation,” IEEE
Trans. Instrumentation and Measurement, Vol. 52, No. 5, pp. 1391-1398, October 2003.
[Parh76] Parhami, B., “Design of Self-Monitoring Logic Circuits for Applications in Fault-Tolerant
Digital Systems,” Proc. Int’l Symp. Circuits and Systems, 1976.
[Wake78] Wakerly, J. F., Error Detecting Codes, Self-Checking Circuits and Applications, North-
Holland, 1978.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 315

16 Redundant Disk Arrays


“Success is the ability to go from one failure to another with no
loss of enthusiasm.”
Winston Churchill

“If two men on the same job agree all the time, then one is
useless. If they disagree all the time, then both are useless.”
Darryl F. Zanuck

Topics in This Chapter


16.1. Disk Drives and Disk Arrays
16.2. Disk Mirroring and Striping
16.3. Data Encoding Schemes
16.4. RAID and Its Levels
16.5. Disk Array Performance
16.6. Disk Array Reliability Modeling

Large storage capacities are required in many modern computer applications.


Early in the history of high-performance computing, special disk drives were
developed to cater to the capacity and reliability needs of such applications.
Beginning in the late 1980s, however, it was realized that economy of scale
favored the use of low-cost disk drives, produced in high volumes for the personal
computer market. Even though each such disk unit is fairly reliable, when arrays
of hundreds or perhaps thousands of them are used to offer multiterabyte capacity,
potential data loss due to disk failure becomes a serious concern. In this chapter,
we study methods for building reliable disk arrays.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 316

16.1 Disk Drives and Disk Arrays

Magnetic disk drives form the main vehicles for supplying stable archival storage in
many applications. Since the inception of hard disk drive in 1956, the recording density
of these storage devices has grown exponentially, much like the exponential growth in
the density of integrated circuits. Currently, tens of gigabytes of information can be
stored on each cm2 of disk surface, making it feasible to provide terabyte-class disk
drives for personal use and petabyte-class archival storage for enterprise systems.
Amdahl’s rules of thumb for system balance dictate that for each GIPS of performance
one needs 1 GB of main memory and 100 GB of disk storage. Thus the trend toward ever
greater performance brings with it the need for ever larger secondary storage.

Figure 16.1 shows a typical disk memory configuration and the terminology associated
with its design and use. There are 1-12 platters mounted on a spindle that rotates at
speeds of 3600 to well over 10,000 revolutions per minute. Data is recorded on both
surfaces of each platter along circular tracks. Each track is divided into several sectors,
with a sector being the unit of data transfer into and out of the disk. The recording density
is a function of the track density (tracks per centimeter or inch) and the linear bit density
along the track (bits per centimeter or inch). In the year 2010, the areal recording density
of inexpensive commercial disks was in the vicinity of 100 Gb/cm 2. Early computer disks
had diameters of up to 50 cm, but modern disks are seldom outside the range of 1.0-3.5
inches (2.5-9.0 cm) in diameter.

The recording area on each surface does not extend all the way to the center of the
platter because the very short tracks near the middle cannot be efficiently utilized. Even
so, the inner tracks are a factor of 2 or more shorter than the outer tracks. Having the
same number of sectors on all tracks would limit the track (and, hence, disk capacity) by
what it is possible to record on the short inner tracks. For this reason, modern disks put
more sectors on the outer tracks. Bits recorded in each sector include a sector number at
the beginning, followed by a gap to allow the sector number to be processed and noted by
the read/write head logic, the sector data, and error detection/correction information.
There is also a gap between adjacent sectors. It is because of these gaps, the sector
number, and error-coding overhead, plus spare tracks that are often used to allow for
“repairing” bad tracks discovered at the time of manufacturing testing and in the course
of disk memory operation (see Section 6.2), that a disk’s formatted capacity is much
lower than its raw capacity based on data recording density.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 317

Sector Read/write head

Actuator

Recording area

Track c – 1
Track 2
Track 1
Track 0

Arm

Direction of Platter
rotation
Spindle

Figure 16.1 Disk memory elements and key terms.

An actuator can move the arms holding the read/write heads, of which we have as many
as there are recording surfaces, to align them with a desired cylinder consisting of tracks
with same diameter on different recording surfaces. Reading of very closely spaced data
on the disk necessitates that the head travel very close to the disk surface (within a
fraction of a micrometer). The heads are prevented from crashing onto the surface by a
thin cushion of air. Note that even the tiniest dust particle is so large in comparison with
the head separation from the surface that it will cause the head to crash onto the surface.
Such head crashes damage the mechanical parts and destroy a great deal of data on the
disk. To prevent these highly undesirable events, hard disks are typically sealed in
airtight packaging.

Disk performance is related to access latency and data transfer rate. Access latency is the
sum of cylinder seek time (or simply seek time) and rotational latency, the time needed
for the sector of interest to arrive under the read/write head. Thus:

Disk access latency = Seek time + Rotational latency (16.1.access)

A third term, the data transfer time, is often negligible compared with the other two and
can be ignored.

Seek time depends on how far the head has to travel from its current position to the target
cylinder. Because this involves a mechanical motion, consisting of an acceleration phase,
a uniform motion, and a deceleration or braking phase, one can model the seek time for
moving by c cylinders as follows, where , , and  are constants:

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 318

Seek time =  + (c – 1) +  c  1 (16.1.seek)

The linear term (c – 1), corresponding to the uniform motion phase, is a rather recent
addition to the seek-time equation; older disks simply did not have enough tracks, and/or
a high enough acceleration, for uniform motion to kick in.

Rotational latency is a function of where the desired sector is located on the track. In the
best case, the head is aligned with the track just as the desired sector is arriving. In the
worst case, the head just misses the sector and must wait for nearly one full revolution.
So, on average, the rotational latency is equal to the time for half a revolution:

30 30 000
Average rotational latency = s = ms (16.1.avgrl)
rpm rpm

Hence, for a rotation speed of 10 000 rpm, the average rotational latency is 3 ms and its
range is 0-6 ms.

The data transfer rate is related to the rotational speed of the disk and the linear bit
density along the track. For example, suppose that a track holds roughly 2 Mb of data and
the disk rotates at 10,000 rpm. Then, every minute, 2  1010 bits pass under the head.
Because bits are read on the fly as they pass under the head, this translates to an average
data transfer rate of about 333 Mb/s = 42 MB/s. The overhead induced by gaps, sector
numbers, and CRC encoding causes the peak transfer rate to be somewhat higher than the
average transfer rate thus computed.

Figure 16.2 The three components of access time for a disk.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 319

While hard disk drives continue to grow in capacity, there are applications for which no
single disk can satisfy the storage needs. There are other applications that need high data
rates, in excess of what can be provided by a single disk, so as to keep a large number of
computational units usefully busy. In such cases, arrays of disks, sometimes numbering
in hundreds or thousands, are used.

Modern disk drives are highly reliable, boasting MTTFs that are measured in decades.
With hundreds or thousands of drives in the same disk-array system, however, a few
failures per year or even per month are to be expected. Thus, to avoid data loss which is
critically important for systems dealing with large data bases, reliability improvement
measures are required.

The intersection of the two requirements just discussed (improved capacity and
throughput on one side and higher reliability on the other), brought about the idea of
using a redundant array of independent disks (RAID) in lieu of a single disk unit. Another
motivating factor for RAID in the mid 1980s was that it no longer made economic sense
to design and manufacture special high-capacity, high-reliability disk units for mainframe
and supercomputer applications, given that the same capacity and throughput could be
provided by combining low-cost disks, mass-produced for the personal computer market.
In fact, the letter “I” in RAID originally stood for “Inexpensive.”

The steep downward trend in the per-gigabyte price of disk memories, and thus of RAID
systems, has made such storage systems affordable even for personal applications, such
as home storage servers.

Much of our discussion in this chapter is in terms of magnetic hard-disk drives, the
common components in RAID systems. Even though SSD storage has different failure
mechanisms (they contain no moving parts to fail, but suffer from higher error rates and
limited erase counts), applicable high-level concepts are pretty much the same. Please
refer to [Jere11] for issues involved in designing SSD RAIDS. A good overview of SSD
RAID challenges and of products available on the market is provided in [CW13].

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 320

16.2 Disk Mirroring and Striping

The two ideas of disk mirroring and striping are foundational in the composition of RAID
systems, so we discuss them here in detail before reviewing the structure of modern
redundant disk arrays.

Mirroring refers to duplicating each data file on a separate disk, so that it remains
available in the event of the original disk failing. The original file copy and the mirror
copy are updated together and are thus fully synchronized. Read operations, however,
take place on the original copy. So, even though a mirrored disk system provides no
improvement in access speed performance or data transfer bandwidth, it offers high
reliability due to acting as a two-way parallel system. Only if both disks containing the
original file and its mirror copy fail will we have data loss. The drawback of 100%
redundancy in storage space is what motivated the development of subsequent RAID
schemes based on various forms of low-redundancy data encoding. Mirroring is
sometimes referred to as Raid level 1 (RAID1, for short), because it was the first form of
redundancy applied to disk arrays.

In disk striping, we divide our data into smaller pieces, perhaps all the way down to the
byte or bit level, and store the pieces on different disks, so that all those pieces can be
read out or written into concurrently. This increases the disk system’s read and write
bandwidth for large files, but has the drawback that all the said disks must be functional
for us to be able to recover or manipulate the file. The disk system in effect behaves as a
series system and will thus have a lower reliability than a single disk. Striping is
sometimes referred to as RAID level 0 (RAID0, for short), even though no data
redundancy is involved in it.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 321

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 322

16.3 Data Encoding Schemes

The key idea for making disk arrays reliable is to spread each data block across multiple
disks in encoded form, so that the failure of one (or a small number) of the disk drives
does not lead to data loss. Many different encoding schemes have been tried or proposed
for this purpose. One feature that makes the encoding simpler compared with arbitrary
error-correcting codes is the fact that standard disk drives already come with strong error-
detecting and error-correcting codes built in. It is extremely unlikely, though not totally
impossible, for a data error on a specific disk drive to go undetected. This possibility is
characterized by the disk’s bit error rate (BER), which for modern disk drives is on the
order of 10–15. This vanishingly small probability of reading corrupt data off a disk
without detecting the error can be ignored for most practical applications, leading to the
assumption that disk data errors can be detected with perfect accuracy.

So, for all practical purposes, disk errors at the level of data access units (say sectors) are
of the erasure kind, rendering codes capable of dealing with erasure errors adequate for
encoding of data blocks across multiple disks. The simplest such code is duplication.
Note that duplication is inadequate for error correction with arbitrary errors. In the latter
case, when the two data copies disagree, there is no way of finding out which copy is in
error. However, if one of the two copies is accidentally lost or erased, then the other copy
supplies the correct data. Simple parity or checksum schemes can also be used for
erasure-error correction. Again, with arbitrary (inversion) errors, a parity bit or checksum
can only detect errors. The same parity bit or checksum, however, can be used to
reconstruct a lost or erased bit/byte/word/block.

Fig. 16.parity Parity-coded bits and blocks of data across multiple disks.
There are 4 data bits/bytes/words/blocks, followed by a
parity bit/byte/word/block.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 323

Whether parity encoding is done with bits or blocks of data, the concepts are the same, so
we proceed with the more practical block-level parity encoding. The parity block P for
the four blocks A, B, C, and D of data depicted in Fig. 16.parity is obtained as:

P=ABCD (16.3.prty)

Then, if one of the blocks, say B, is lost or erased, it can be rebuilt thus:

B=ACDP (16.3.rbld)

Note that if the disk drive holding the block B becomes inaccessible, one needs to read
the entire contents of the other four disks in order to reconstruct all the lost blocks. This is
a time-consuming process, raising the possibility that a second disk fails before the
reconstruction is complete. This is why it may be desirable to include a second coding
scheme to allow the possibility of two disk drive failures. For example, we may use a
second coding scheme in which the check block Q is derived from the data blocks A, B,
C, and D in a way that is different from P:

Q = g(A, B, C, D) (16.3.chk2)

Then, the data will continue to be protected even during the rebuilding process after the
first disk failure.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 324

16.4 RAID and Its Levels

We have already alluded to RAID0 (provides high performance, but no error tolerance),
RAID1 (provides high data integrity, but is an overkill in terms of redundancy), and
RAID10 (more accurately, RAID1/0) as precursors of the more elaborate RAID2-RAID6
schemes to be discussed in this section.

Fig. 16.levels Schematic representation of key RAID levels and their


associated redundancy schemes.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 325

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 326

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 327

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 328

16.5 Disk Array Performance

This section to be based on the following slides:

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 329

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 330

16.6 Disk Array Reliability Modeling

This section to be based on the following slides and notes.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 331

RAID Reliability calculations

[The following two links no longer work. I am looking for equivalent replacements]

http://storageadvisors.adaptec.com/2005/11/01/raid-reliability-calculations/

http://storageadvisors.adaptec.com/2005/11/01/raid-reliability-calculations/

From the website pointed to by the latter link, I find the following for Seagate disks

MTTF = 1.0-1.5 M hr

Bit error rate = 1 in 10^14 to 1 in 10^15

Using the data above, the poster finds mean time to data loss as follows:

SAS RAID5 = 53 000 yr

SATA RAID6 = 6.7 M yr

Fig. 16.prdcts Typical RAID products.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 332

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 333

Problems

16.1 System with two disk drives


Discuss the pros and cons of using two identical disk drives as independent storage units, in a RAID0
configuration, or in a RAID1 configuration.

16.2 Hybrid RAID architectures


Besides RAID 10, discussed in Section 16.4, there are other hybrid RAID architectures. Perform an Internet
search to identify two such hybrid architectures, describe each one briefly, and compare the two with
respect to cost and reliability.

16.3 Disk array reliability evaluation


A computer’s input/output system is composed of the following components, with the given MTTF values:
(1) Four disk drives, at 0.5M hr each; (2) Four IDE cables, at 2M hr each; (3) One IDE controller, at 5M hr;
(4) A power supply, at 0.1M hr. Assume exponential distribution of component lifetimes and similar disk
controller complexities for all systems.
a. Derive the MTTF and mean time to data loss (MTDL) for this I/O system.
b. Expanding storage needs forces the computer system manager to double the number of disk drives
and to use RAID0 for organizing them. What are the new system’s MTTF and MTDL?
c. Reliability concerns motivates the computer system manager to convert to a 6-disk RAID3
system, with one parity disk and a hot spare disk. Reconstructing data on the spare disk takes 2 hr.
Calculate the new system’s MTTF and MTDL.
d. Some time later, the computer system manager decides to switch to a RAID5 configuration,
without adding any storage space. Does this action improve or degrade the MTDL?

16.4 Disk failure rate modeling


The figure below plots the expected remaining time to the next failure in a cluster node as a function of
time elapsed since the last failure, with both times given in minutes [PDSI09].

a. Assuming that the situation for disk memories is similar, why does this data refute the assumption
that disk lifetimes are exponentially distributed?
b. Which distribution is called for in lieu of exponential distribution?

16.5 Disk array reliability


Suppose each disk in a disk array with d data disks and r parity (redundant) disks fails independently with
probability p over a 1-year period. Assume that the control system never fails and ignore the recovery time.
a. Consider a disk array with d = 2 and r = 1. What is the expected number of disk failures in a year?
b. What is the reliability of the disk array of part a over a 1-year period?
c. Consider a disk array with d = 3 and r = 2. What is the expected number of disk failures in a year?

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 334

d. What is the reliability of the disk array of part c over a 1-year period?
e. For p = 0.1, which is more reliable: The disk array of part a or that of part c?
f. For p = 0.6, which is more reliable: The disk array of part a or that of part c?

16.6 RAID 5 with different parity group sizes


We have 11 identical disks and want to build a RAID system with an effective capacity equivalent to 8
disks. Discuss the pros and cons of the following schemes (both using the RAID 5 architecture) in terms of
data loss probability and ease/latency of data reconstruction.
a. Nine disks are used as a single parity group and 2 disks are designated as spares.
b. Ten disks are arranged into two separate parity groups of 5 disks each and 1 disk is designated as
spare.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-01 335

References and Further Readings


[Chen94] Chen, P. M., E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson, “RAID: High-
Performance Reliable Secondary Storage,” ACM Computing Surveys, Vol. 26, No. 2, pp. 145-
185, June 1994.
[Corn12] Cornwell, M., “Anatomy of a Solid-State Drive,” Communications of the ACM, Vol. 55, No.
12, pp. 59-63, December 2012.
[CW13] Computer Weekly, “SSD RAID Essentials; What You Need to Know about Flash and RAID.”
http://www.computerweekly.com/feature/SSD-RAID-essentials-What-you-need-to-know-
about-flash-and-RAID
[Eler07] Elerath, J., “Hard Disk Drives: The Good, the Bad and the Ugly,” ACM Queue, Vol. 5, No. 6,
pp. 28-37, September-October 2007.
[Feng05] Feng, G.-L., R. H. Deng, F. Bao, and J.-C. Shen, “New Efficient MDS Array Codes for
RAID—Part I: Reed-Solomon-Like Codes for Tolerating Three Disk Failures,” IEEE Trans.
Computers, Vol. 54, No. 9, pp. 1071-1080, September 2005.
[Feng05a] Feng, G.-L., R. H. Deng, F. Bao, and J.-C. Shen, “New Efficient MDS Array Codes for
RAID—Part II: Rabin-Like Codes for Tolerating Multiple (4) Disk Failures,” IEEE Trans.
Computers, Vol. 54, No. 12, pp. 1473-1483, December 2005.
[Gray00] Gray, J. and P. Shenoy, “Rules of Thumb in Data Engineering,” Microsoft Research MS-TR-
99-100, revised ed., March 2000.
[Grib16] Gribaudo, M., M. Iacono, and D. Manini, “Improving Reliability and Performances in Large
Scale Distributed Applications with Erasure Codes and Replication,” Future Generation
Computer Systems, Vol. 56, pp. 773-782, March 2016.
[Jere11] Jeremic, N., G. Muhl, A. Busse, and J. Richling, “The Pitfalls of Deploying Solid-State Drive
RAIDs,” Proc. 4th Int’l Conf. Systems and Storage, Article No. 14, 2011.
[Patt88] Patterson, D. A., G. Gibson, and R. H. Katz, “A Case for Redundant Arrays of Inexpensive
Discs,”
[PDSI09] Petascale Data Storage Institute, “Analyzing Failure Data,” project Web site, accessed on
November 8, 2012: http://www.pdl.cmu.edu/PDSI/FailureData/index.html
[Plan97] Plank, J. S., “A tutorial on Reed–Solomon coding for fault‐tolerance in RAID‐like systems,”
Software: Practice and Experience, Vol. 27, No. 9, 1997, pp. 995-1012.
[Schr07] Schroeder, B. and G. A. Gibson, “Understanding Disk Failure Rates: What Does an MTTF of
1,000,000 Hours Mean to You?” ACM Trans. Storage, Vol. 3, No. 3, Article 8, 31 pp., October
2007.
[Thom09] Thomasian, A. and M. Blaum, “Higher Reliability Redundant Disk Arrays: Organization,
Operation, and Coding,” ACM Trans. Storage, Vol. 5, No. 3, pp. 7-1 to 7-59, November 2009.
[WWW1] News and reviews of storage-related products, http://www.storagereview.com/guide/index.html
(This link is obsolete; it will be replaced by a different resource in future).
[WWW2] Battle Against Any RAID-F (Free, Four, Five), http://www.baarf.com (accessed on November
8, 2012).

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 336

V Malfunctions: Architectural Anomalies


Ideal

“The Internet treats censorship as a malfunction and routes


Defective
around it.”
Faulty
John Perry Barlo
Erroneous

“. . . the technical demands of modern warfare are so complex


Malfunctioning
[that] a considerable percentage of our material is bound to
Degraded malfunction even before it is deployed against a foe.”

Failed
Ernest K. Gann, The Black Watch

Chapters in This Part


17. Malfunction Diagnosis
18. Malfunction Tolerance
19. Standby Redundancy
20. Robust Parallel Processing

A system moves from erroneous to malfunctioning state when an error affects the
functional behavior of some constituent subsystem. Design or implementation
flaws can lead to malfunctions directly, even in the absence of errors. At the
architectural level, malfunctions have manifestations similar to those of faults
occurring at the logic level. One difference is that instead of pass/not-pass testing
to detect the presence of malfunctions, we tend to use diagnostic testing to detect
and locate the offending modules. We then concern ourselves with methods to
tolerate such malfunctions via redundancy and reconfiguration. Because modules
interacting at the architectural level tend to be rather complex, standby or
dynamic redundancy is often preferred to massive or static redundancy. Making
either type of redundancy work, however, is nontrivial in view of difficulties in
synchronization and maintenance of data integrity during and after switchovers.
We conclude this part by a discussion of malfunction tolerance by means of
robustness features in parallel processing systems.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 337

17 Malfunction Diagnosis
“If information systems fail or seriously malfunction, societal
activities lose support, and this may sometimes result in
uncontrollable chaos in society as a whole.”
H. Inose and J. R. Pierce, Information
Technology and Civilization

“In diagnosis think of the easy first.”


Martin H. Fischer

Topics in This Chapter


17.1. Self-Diagnosis in Subsystems
17.2. Malfunction Diagnosis Models
17.3. One-Step Diagnosability
17.4. Sequential Diagnosability
17.5. Diagnostic Accuracy and Resolution
17.6. Other Topics in Diagnosis

A malfunctioning subsystem must be identified and isolated quickly and


effectively, in order to channel its impact toward a service-level degradation (soft
failure) rather than a result-level breach (hard failure). Malfunction diagnosis,
which is sometimes referred to as “system-level fault diagnosis” in the literature,
encompasses a spectrum of techniques, from self-assessment, through hierarchical
or stepwise testing, to cooperative diagnosis with centralized or distributed
control. Unlike fault testing at the logic level, malfunction diagnosis entails the
determination of not just the occurrence of a malfunction but also the identity or
location of the offending subsystem.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 338

17.1 Self-Diagnosis in Subsystems

Modern computer systems have processing capabilities that are distributed among
multiple modules, even when there is only one “processor” in the system. Examples of
units with processing capabilities include graphics cards, network interfaces, input/output
channels, and device controllers. Furthermore, multiple CPUs are being employed to
provide the required computational power in a wide spectrum of systems, given the
marked slowdown in clock frequency improvements and the greater energy efficiency of
slower processors. Thus, it makes sense to try to use these capabilities in performing
cross-diagnostic checks among such modules.

Self-diagnostic checks are quite common. When you turn on your desktop or laptop
computer, a diagnostic check is run to verify the correct functioning of major subsystems,
including the CPU, memory, disk drive, and various interfaces. The check is not
exhaustive but is intended to catch most common problems. In the context of dependable
computing, we need a bit more coverage than such quick sanity checks.

In our discussion of fault testing in Chapter 9, we assumed that a special tester unit
applied test patterns to the circuit under test and used the circuit’s outputs to render a
judgment about it health. In the case of intermodule diagnostic testing, this approach may
prove impractical, given that the complexity of the modules involved would generate an
extremely heavy volume of data being passed between them. One way around this
difficulty is for the tester to initiate a self-diagnostic process in another module to
determine whether it is working properly. This self-diagnostic should not have a yes/no
answer, because such a binary outcome would increase the chances of a malfunctioning
module generating a “yes” answer.

Here is a workable strategy. The initiator supplies the module under test with a “seed
value” to be used in the diagnostic process. The self-diagnosing unit uses the supplied
seed as an argument of an extensive computation that exercises nearly all of its
components, including memory resources, ending up with an “answer” that is a known
function of the seed value. It is this answer that is returned to the initiator as the
diagnostic outcome. Given this outcome, it is then easy for the test initiator to compare
the returned value with the expected result and to deduce whether the reporting unit is
healthy. With a 64-bit diagnostic outcome, say, it is less likely for a malfunctioning unit
to accidentally produce the correct result.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 339

Note that when the health of a unit is suspect, there is no reason to trust its ability to
execute the very instructions that constitute the self-diagnostic routine. For this reason,
we often use a layered approach to self-diagnosis. At the beginning of the process, a
small core of the module is tested. This core may be in charge of executing some very
simple instructions and may have very limited memory and other resources. Once it has
been established that the core can be trusted with regard to its health, the circle of trust is
gradually extended to other parts of the module, in each phase using trusted parts whose
health has been previously established to test new parts.

Fig. 17.1 Layered approach to self-diagnosis begins by verifying the


health of a small core and then gradually expands the circle
of trust to subunits A, B, and C.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 340

17.2 Malfunction Diagnosis Models

The PMC system-level diagnosis model used in this book is due to Preparata, Metze, and
Chien [Prep67]. The system under diagnosis consists of a set of modules for which we
have defined a testing graph (Fig. 17.2a). A directed edge from M i to Mj in the testing
graph indicates that module Mi can test module Mj. Note that the testing graph may or
may not correspond to the actual physical connectivity among the modules. For example,
the four modules of Fig. 17.2a may be interconnected by a bus, thus making it possible
for each of them to test any other module. In this case, the testing graph is a proper
subgraph of the complete graph characterizing the physical connectivity among the
modules. Among reasons for selecting a subset of the available physical links to form a
testing graph is the desire to reduce the communications overhead and to limit the module
workloads that result from administering self-diagnostic tests and interpreting the results.

The diagnosis verdicts Dij  {0, 1}, with 0 meaning “pass” and 1 representing “fail,”
form an n  n Boolean diagnosis matrix D, which is usually quite sparse. In particular,
the diagonal entries of D are never used (unit i does not judge itself). We assume that a
good unit always renders a correct judgement about other units, that is, tests have perfect
coverage and no false alarms, but that a verdict rendered by a malfunctioning unit is
arbitrary and cannot be trusted. Note that the PMC model which we use for malfunction
diagnosis is often referred to as a model for “system-level fault diagnosis.” Following our
terminology, a system-level fault is referred to as a malfunction.

Mi Mj Mi Mj

(a) Testing graph (b) Pass outcome of a test (c) Fail outcome of a test

Fig. 17.2 System-level testing graph and the two possible test
outcomes when module i tests module j.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 341

Example 17.1: Interpreting test outcomes Consider the following diagnosis matrix D for the
4-module system of Fig. 17.2a, in which dashes denote lack of testing. If we know that no more
than one module can be malfunctioning, interpret the test outcomes in D.
− 0 − −
0 0
D= − −
1 − − −
1 − 0 −

Solution: Module 0 is judged by both M2 and M3 to be malfunctioning (D2,0 = D3,0 = 1). If M0


were good, then both M2 and M3 would have to be malfunctioning, which, by assumption, can’t be
the case. On the other hand, M0 malfunctioning and the other 3 modules being good are consistent
with all the test results in D. Thus, with the assumption of no more than one malfunctioning
module, M0 must be the culprit.

The PMC model is but one of the models available for malfunction diagnosis, but it is
widely used and highly suitable for the points we want to make regarding key notions of
system-level diagnosis. Other malfunction diagnosis models include the comparison
model [Maen81], in which an observer assigns tasks to processors and draws conclusions
regarding their health via comparing the results they return.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 342

17.3 One-Step Diagnosability

A system characterized by a testing graph is said to be one-step diagnosable with respect


to a predefined set of malfunction patterns if any pattern of malfunctions from the given
set is correctly diagnosable from the resulting diagnosis matrix in every case. The
qualifier “one-step” implies that a single round of testing, albeit potentially involving
many tests, suffices for reaching a correct diagnosis decision. In particular, a system is
one-step t-diagnosable if correct diagnosis is possible for any set of t malfunctioning
modules. We often delete the default qualifier “one-step,” and call such a system t-
diagnosable. With this terminology, the system in Fig. 17.2a is (one-step) 1-diagnosable.

To establish 1-diagnosability, we consider all possible single malfunctions and verify that
the resulting syndromes (sets of diagnosis outcomes) are distinct from each other and
from the no-malfunction case, regardless of how the malfunctioning unit behaves in its
assessments of other modules.

Example 17.synd: One-step 1-diagnosibility Find the syndromes for the system of Fig. 17.2a
and show that it is one-step 1-diagnosable but not 2-diagnosable.

Solution: Syndromes for the system of Fig. 17.2a are listed in Fig. 17.synd-a, with the results
confirming the system’s 1-diagnosability, given that the rows corresponding to different single
malfunctions differ with each other and with the malfunction-free case in at lease one position. On
the other hand, we see that the system of Fig. 17.2a is not 2-diagnosable, because the syndrome for
the double malfunction {M0, M1} may be indistinguishable from the single malfunction {M0} in
some cases. Given the restriction to at most one malfunctioning unit, the syndrome dictionary in
Fig. 17.synd-b allows us to translate the observed 6-bit syndrome {D01, D12, D13, D20, D30, D32} to
the identity of the malfunctioning unit, if any.

(a) Syndromes for single and a few double malfunctions (b) Syndrome dictionary

Fig. 17.synd Diagnosis syndromes and syndrome dictionary for single


malfunctions.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 343

General results can be obtained for 1-step t-diagnosability that can be applied in some
cases in lieu of an exhaustive analysis of the kind done in Example 17.synd.

Theorem 17.td-nec (Necessary conditions for 1-step t-diagnosibility): An n-unit system


is 1-step t-diagnosable only if: (1) n  2t + 1, that is, one-step diagnosability requires the
bad modules to be in the minority; (2) Each module is tested by at least t other modules.

The second necessary condition in Theorem 17.td-nec becomes a sufficient condition if


we add the restriction that no two modules test each other. Intuitively, the absence of
mutual testing among modules facilitates the deduction process because two
malfunctioning units can potentially vindicate each other, making an error in judgment
during the decision process more likely.

Theorem 17.td-suff (A sufficient condition for 1-step t-diagnosibility): An n-unit system


in which no two units test one another is 1-step t-diagnosable iff each unit is tested by at
least t other units.

Based on Theorem 17.td-nec, the 4-module system of Fig. 17.2a can never become 2-
diagnosable, regardless of how many links we add to the testing graph. By Theorem
17.td-suff, the same system is 1-diagnosable.

The diagnosability problem has a lot in common with a collection of popular puzzles
about liars and truth-tellers. Consider the following setting. You visit an island whose
inhabitants are from two tribes: member of one tribe (“liars”) consistently lie; members
of the other tribe (“truth-tellers”) always tell the truth. Members of the two tribes are
indistinguishable to us, but they can recognize eath other. The puzzles then ask us various
questions about how to deduce the truth about various situations from the unreliable
responses we receive. In fact, malfunction diagnosis corresponds to an extended, more
challenging, version of these puzzles in which a third tribe (“randoms”) is introduced.
Healthy modules correspond to “truth-tellers,” because they correctly diagnose other
modules. Malfunctioning modules correspond to “randoms,” because their judgments are
unrelated to the truth. Interestingly, liars aren’t as hard to deal with as randoms, because
their consistency provides us with more information.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 344

In dealing with 1-step diagnosability in a collection of interconnected modules, we are


faced with two kinds of problems: analysis and synthesis. The analysis problem is itself
of two kinds.

Problem 17.a1 (the extent of 1-step t-diagnosability): Given a directed graph defining
the test links, find the largest value of t for which the system is 1-step t-diagnosable.

The foregoing problem is easy if no two units test one another and fairly difficult if
mutual testing is allowed. There exists a vast amount of published research dealing with
Problem 17.a1.

Problem 17.a2 (1-step malfunction diagnosis): Given a directed graph defining the test
links and a set of test outcomes, identify all malfunctioning untis, assuming there are no
more than t such units.

Problem 17.a2, which arises when we want to repair or reconfigure a system using test
outcomes, is solved via table lookup or analytical methods.

The synthesis problem associated with 1-step diagnosability is as follows.

Problem 17.s (connection assignment for 1-step t-diagnosability): Specify the test links
that would make an n-unit system 1-step t-diagnosable, using as few test links as
possible.

As an example, a degree-t directed chordal ring, in which nodes are numbered 0 to n – 1


and node i tests the t nodes i + 1, i + 2, … , i + t (all experssions being mod n) has the
required property.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 345

A straightforward process for solving Problem 17.a2 is given in Algorithm 17.diag1,


which uses exhaustive search with backtracking.

Algorithm 17.diag1 An O(n3)-step diagnosis algorithm


Input: The testing graph and a diagnosis matrix
Output: Every unit labeled G (good) or B (bad)
while some unit remains unlabeled do
choose an unlabeled unit and label it G or B
use labeled units to label as many other units as possible
if the new label leads to a contradiction
then backtrack
endif
endwhile

A somewhat more efficient O(n2.5)-step diagnosis algorithm is as follows. From the


original testing graph, derive an L-graph that has the same set of nodes and a link from
node i to node j iff node i can be assumed to be malfunctioning when node j is known to
be good. The unique minimal vertex cover of the L-graph, that is a subset of its nodes
that touches at least one of the two endpoints of each edge, corresponds to the set of t or
fewer malfunctioning units.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 346

17.4 Sequential Diagnosability

In Example 17.synd, we established that the system of Fig. 17.2a is not 2-diagnosable,
because the syndromes for the double malfunction {M 0, M1} is potentially indistinct from
that of the single malfunction {M0}. We may note that a common syndrome for the two
malfunction patterns just listed does provide some useful diagnostic information: that M 0
is definitely bad. So, we can potentially use this information to replace or repair M 0
before further testing to identify other malfunctioning modules. This observation leads us
to the notion of sequential diagnosability, which means that the test syndrome points
unambiguously to at least one malfunctioning unit. Assuming that we began with k
malfunctions, replacing or repairing one bad unit leaves us with no more than k – 1
malfunctions, thus reducing the diagnosis problem to a simpler one. Iterating in this
manner, allows us to identify all k malfunctioning units in k or fewer rounds.

Example 17.seqd1: Sequential-diagnosibility Show that the system of Fig. 17.2a isn’t
sequentially 2-diagnosable.

Solution: The desired result is readily established by noting that the syndromes for the
malfunction sets {M0, M2} and {M1, M3}, that is, (x 1 0 x 1 1) and (1 x x 0 x x), where x denotes
0/1 and test results are listed in the order shown in Fig. 17.synd-a, are potentially indistinct.

In fact, the result of Example 17.seqd could have been established based on the following
general theorem.

Theorem 17.seqd (A necessary condition for sequential diagnosibility): An n-unit


system is sequentially t-diagnosable only if n  2t + 1, that is, sequential diagnosability is
possible only if the bad modules are in the minority.

Example 17.seqd2: Sequential-diagnosibility of a unidirectional ring Show that a system


whose testing graph is a unidirectional or directed ring is sequentially 2-diagnosable, but not one-
step 2-diagnosable or sequentially 3-diagnosable.

Solution: As an example, consider the 5-node directed ring of Fig. 17.dring-a. Possible test
outcomes are for single and some double malfunctions are depicted in Fig. 17.dring-b. We note
that even though some of the syndromes shown overlap, they all point to M0 being malfunctiong.
So, under sequential diagnosabiloity, there is no ambiguity and we can replace M0 before
proceeding. The set of all syndromes that point to M0 being malfunctioning is shown in Fig.
17.dring-c.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 347

It is possible to prove that an n-node directed ring is sequentially t-diagnosable for any t
satisfying (t2 – 1)/4 + t + 2  n (see Problem 17.5), a result from which two parts of
Example 17.seqd2 follow.

4 1

3 2

(a) Directed ring (b) Diagnosis syndromes (c) Syndrome dictionary

Fig. 17.dring Diagnosis syndromes and syndrome dictionary for


sequential diagnosability of a 5-node directed ring.

Analysis and synthesis problems for sequential diagnosability parallel those of 1-step
diagnosability, and have likewise been extensively studied.

Problem 17.a1-s (the extent of sequential diagnosability): Given a directed graph


defining the test links, find the largest value of t for which the system is sequentially t-
diagnosable.

Problem 17.a2-s (sequential malfunction diagnosis): Given a directed graph defining the
test links and a set of test outcomes, identify at least one malfunctioning unti (preferably
more), assuming there are no more than t such units.

Problem 17.s-s (connection assignment for sequential diagnosability): Specify the test
links that would make an n-unit system sequentially t-diagnosable, using as few test links
as possible.

An n-node ring, with n >= 2t + 1, with added test links from 2t – 2 other nodes to node 0
(besides node n – 1, which already tests it) has the required property.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 348

17.5 Diagnostic Accuracy and Resolution

So far in our discussions of diagnosability, we have demanded full accuracy in the sense
of requiring that all malfunctioning modules be identified (1-step diagnosability) or that
some malfunctiong modules, and only malfunctioning modules, be deduced (sequential
diagnosability) from the test outcomes. By relaxing these requirements, we may be able
to successfully diagnose systems that would be undiagnosable with the former, more
strict definitions.

An n-unit system is 1-step t/s-diagnosable if a set of no more than t malfunctioning units


can always be identified to within a set of s units, where s  t. By allowing s – t healthy
units to be potentially included among those flagged for repair and replacement,
diagnosis may become simpler or possible in some cases. Note that our original notion of
1-step t-diagnosability does not correspond to the widely studied special case of 1-step
t/t-diagnosability in this new notation. The reason is that in t/t-diagnosability, it is
admissible to identify a set of t modules for replacement when there are in fact only t – 1
malfunctioning units.

Given the values of t and s, the problem of deciding whether a system is t/s-diagnosable
is co-NP-complete. However, there exist efficient, polynomial-time, algorithms to find
the largest integer t such that the system is t/t- or t/(t + 1)-diagnosable.

A similar relaxation in the diagnosis accuracy is possible for sequential diagnosis. An n-


unit system is sequentially t/r-diagnosable if from a set of up to t malfunctioning units, r
can be identified in one step, where r < t and the identified set contains at least one
malfunctioning unit.

Finally, we can integrate the notion of safety, where some malfunctions are promptly
detected but not necessarily diagnosed, into our definitions. Safe diagnosability implies
that up to t  malfunctions are correctly diagnosed and up to u are detected, where u > t .
This kind of diagnosability/testability, which is reminiscent of combination error-
correcting/detecting codes, ensures that there is no danger of incorrect diagnosis for a
larger number of malfunctions (up to u), thus increasing system safety.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 349

17.6 Other Topics in Diagnosis

Diagnosability results have been published for a great variety of regular interconnection
networks, such as the three topologies shown in Fig. 17.topo. Topologies like these have
been used in the design of many general-purpose parallel computers and special-purpose
architectures for high-performance computing. The specific examples shown in Fig.
17.topo are composed of degree-4 nodes and thus can be 1-step 4-diagnosable at best.
Proving that they indeed possess this level of diagnosability or deriving their level of
sequential diagnosability are active research areas.

What comes after malfunction diagnosis?

(a) 2D torus (b) 4D hypercube (c) Chordal ring

Fig. 11.topo Some interconnection networks for parallel processing.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 350

Problems

17.1 Diagnosability in a directed ring


Prove directly (i.e., by forming malfunction syndromes and comparing them with each other, rather than by
using general theorems about diagnosability) that an n-node directed ring network is 1-step 1-diagnosable
but not 1-step 2-dignosable. Hint: Take advantage of symmetry to reduce the amount of work.

17.2 Malfunction diagnosis in a bidirectional ring


a. Show that a bidirectional ring with five or more nodes is one-step 2-diagnosable.
b. Prove that the result of part a is the strongest possible. In other words, a bidirectional ring cannot
be 3-diagnosable, and a ring with 3 or 4 nodes is not 2-diagnosable.

17.3 Diagnosability in a 2D torus


Consider 1-step t-diagnosability of a collection of subsystems interconnected as an m  m 2D torus network
in which node (i, j) in row i, column j, is connected to the four neighboring nodes (i  1, j) and (i, j  1),
where all arithmetic is modulo m and m > 4. Links are bidirectional and allow testing in either direction.
a. By defining a suitable testing graph, show that 2D torus is at least 1-step 2-diagnosable (i.e., t  2).
b. Can you prove a stronger diagnosability result for the 2D torus? [Arak00]

17.4 Emergency notification system


An emergency notification system is built to detect any one of k possible hazardous conditions. It utilizes r
alarm units with sensors, where each alarm can detect a different subset of the k conditions.
a. Formulate this problem as a malfunction diagnosis problem and its associated graph
representation.
b. Assuming that alarms do not fail, under what conditions do the alarms collectively pinpoint which
hazardous conditions exist? Explain.
c. Now consider the possibility of failures in the alarms. Discuss the requirements for correct
identification of the existing hazardous condition(s).

17.5 Sequential diagnosability of directed rings


Prove that an n-node directed ring is sequentially t-diagnosable for any t satisfying (t2 – 1)/4 + t + 2  n.

17.6 Diagnosability of swapped/OTIS networks


A swapped (aka OTIS) network based on an n-node basis network or graph G is built as follows [Parh05].
The network Sw(G) has n2 nodes belonging to n different copies of G. Each copy of G is conneted
internally as in the original basis network G. Additionally, node i in copy j of G is connected to node j in
copy i of G. The latter links are known as intercluster links, whereas the links of G are intracluster links.
Thus, node degree of Sw(G) is one more than that of G. Study the diagnosability of swapped netwotks

17.7 Diagnosability of biswapped networks


A biswapped network based on an n-node basis network or graph G is built as follows [Xiao10]. The
network Bsw(G) has 2n2 nodes belonging to 2n different copies of G, half of the copies appearing in part 0
and half in part 1. Each copy of G is conneted internally as in the original basis network G. Additionally,
node i in copy j of G in part k is connected to node j in copy i of G in part 1 – k. The latter links are known

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 351

as intercluster links, whereas the links of G are intracluster links. Thus, node degree of Bsw(G) is one more
than that of G. The intercluster links define a bipartite subgraph; hence, the name “biswapped.” Study the
diagnosability of biswapped netwotks

17.8 One-step 1-diagnosability


In example 17.synd, we established that the system depicted by the testing graph of Fig. 17.2a is 1-step 1-
diagnosable.
a. Remove a minimum number of test links from the graph so that the system it represents is no
longer 1-step 1-diagnosable.
b. What is the maximum number of links that can be removed from the testing graph so that the
systeme it represents remains 1-step 1-diagnosable? Fully justify your answer.

17.9 One-step 1-diagnosability


Why does the necessary and sufficient condition “each unit being tested by at least t other units” along with
“no two units testing each other” guarantee the satisfaction of condition 1 in Theorem 17.td-nec?

17.10 Sequential diagnosability of directed rings


Prove that the sufficient condition (t2 – 1)/4 + t + 2  n for an n-node directed ring being sequentially t-
diagnosable implies n  2t + 1 but that the necessary condition n  2t + 1 is not sufficient for an n-node
directed ring to be sequentially t-diagnosable.

17.11 A necessary and sufficient condition for t-diagnosability


Prove that a graph G is t-diagnosable iff it contains a star of order t at each node v (consisting of v
connected to nodes x1, x2, … , xt), with each node xi connected to a distinct node yi. [Hsu07]

17.12 A sufficient condition for t/t-diagnosability


Prove that a k-connected regular graph is t/t-diagnosable for t = 2k – 2 – g, where g is the maximum
number of neighbors shared by adjacent nodes. [Hao16]

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 352

References and Further Readings


[Arak00] Araki, T. and Y. Shibata, “Diagnosability of Networks Represented by the Cartesian Product,”
IEICE Trans. Fundamentals, Vol. E83-A, No. 3, pp. 465-470, March 2000.
[Butl08] Butler, R. W., “A Primer on Architectural Level Fault Tolerance,” NASA Technical
Memorandum TM-2008-215108, 48 pp., February 2008.
[Chan10] Chang, G.-Y., “(t, k)-Diagnosability for Regular Networks,” IEEE Trans. Computers, Vol. 59,
No. 9, pp. 1153-1157, September 2010.
[Gu18] Gu, M.-M., R.-X. Hao, and J.-B. Lu, “The Pessimistic Diagnosability of Data Center
Networks,” Information Processing Letters, Vol. 134, pp. 52-56, June 2018.
[Haki74] Hakimi, S. L. and A. T. Amin, “Characterization of Connection Assignment of Diagnosable
Systems,” IEEE Trans. Computers, Vol. 23, pp. 86-88, 1974.
[Hao16] Hao, R.-X., M.-M. Gu, and Y.-Q. Feng, “The Pessimistic Diagnosabilities of Some General
Regular Graphs,” Theoretical Computer Science, Vol. 609, Pt. 2, pp. 513-420, January 2016.
[Hsu07] Hsu, G. H. and J. J. M. Tan, “A Local Diagnosability Measure for Multiprocessor Systems,”
IEEE Trans. Parallel and Distributed Systems, Vol. 18, No. 5, pp. 598-607, 2007.
[Karu79] Karunanithi, S. and A. D. Friedman, “Analysis of Digital Systems Using a New Measure of
System Diagnosis,” IEEE Trans. Computers, Vol. 28, No. 2, pp. 121-123, February 1979.
[Maen81] Maeng, J. and M. Malek, “A Comparison Connection Assignment for Self-Diagnosis of
Multiprocessor Systems,” Proc. 11th Int’l Symp. Fault-Tolerant Computing, 1981, pp. 173-
175.
[Parh05] Parhami, B., “Swapped Interconnection Networks: Topological, Performance, and Robustness
Attributes,” J. Parallel and Distributed Computing, Vol. 65, No. 11, pp. 1443-1452, November
2005.
[Parh16] Parhami, B., N. Wu, and S. Tao, “Taxonomy and Overview of Distributed Malfunction
Diagnosis in Networks of Intelligent Nodes,” J. Computer Science and Engineering, Vol. 13,
No. 2, pp. 23-31, 2016.
[Prep67] Preparata, F. P., G. Metze, and R. T. Chien, “On the Connection Assignment Problem of
Diagnosable Systems,” IEEE Trans. Electronic Computers, Vol. 16, pp. 848-854, 1967.
[Seng92] Sengupta, A. and A. Dahbura, “On Self-Diagnosable Multiprocessor Systems: Diagnosis by the
Comparison Approach,” IEEE Trans. Computers, Vol. 41, No. 11, pp. 1386-1396, 1992.
[Soma87] Somani, A. K., V. K. Agarwal, and D. Avis, “A Generalized Theory for System Level
Diagnosis,” IEEE Trans. Computers, Vol. 36, pp. 538-546, 1987.
[Sull88] Sullivan, G., “An O(t3 + |E|) Fault Identification Algorithm for Diagnosable Systems,” IEEE
Trans. Computers, Vol. 37, pp. 388-397, 1988.
[Xiao10] Xiao, W. J., B. Parhami, W. D. Chen, M. X. He, and W. H. Wei “Fully Symmetric Swapped
Networks Based on Bipartite Cluster Connectivity,” Information Processing Letters, Vol. 110,
No. 6, pp. 211-215, 15 February 2010.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 353

18 Malfunction Tolerance
“If you improve or tinker with something long enough, eventually
it will break or malfunction.”
Arthur Bloch

“The test of courage comes when we are in the minority. The test
of tolerance comes when we are in the majority.”
Ralph W. Sockman

“I have seen gross intolerance shown in support of tolerance.”


Samuel Taylor Coleridge

Topics in This Chapter


18.1. System-Level Reconfiguration
18.2. Isolating a Malfunctioning Element
18.3. Data and State Recovery
18.4. Regular Arrays of Modules
18.5. Low-Redundancy Sparing
18.6. Malfunction-Tolerant Scheduling

Once a malfunctioning subsystem has been identified, the system must be


reconfigured to work without it or with a replacement unit (spare). In either case,
any informational resource residing in the malfunctioning unit must be
reconstructed or worked around. Details of the reconfiguration and recovery
strategies are system- and application-dependent. Thus, our focus in this chapter
is on general mechanisms that facilitate the implementation of such strategies.
Some of the indispensible ideas include module isolation techniques,
reconfiguration with spares, tradeoffs in switching complexity and overhead, and
task scheduling strategies that account for malfunctions.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 354

18.1 System-Level Reconfiguration

A system consists of modular resources (processors, memory banks, disk storage,


interfact units, and the like) plus interconnects. Redundant resources can mitigate the
effect of module malfunctions. Thus, a key challenge in reconfiguration is dealing with
interconnects. Throughout our discussion in this chapter, we will assume that module and
interconnect malfunctions are promptly diagnosed via self-checking, external monitoring,
or concurrently executed system-level tests. We can model system resources, including
both modules and interconnects, by means of directed or undirected graphs. To be able to
overcome the effect of link malfunctions, it is necessary to have multiple paths from each
potential source to every possible destination.

For example, Fig. 18.1a shows a 16-processor parallel computer whose nodes are
interconnected via a 2D torus topology. It is readily seen that each source node is
connected to each possible destination node via 4 parallel node-and-edge-disjoint paths.
Because of this property, any set of 3 malfunctioning resources (nodes and/or links) can
be tolerated without cutting off intermodule communication. In graph-throretic terms, we
say that the system in Fig. 18.1a is 4-connected. High connectivity is a desirable attribute
for malfunction tolerance.

(a) Parallel paths in a 2D torus (b) Parallel paths in a bus-based system

Fig. 18.1 Architectural redundancy in a torus-connected parallel


computer and in a bus-based system.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 355

As a second example, consider the 3-bus system of Fig. 18.1b, where each I/O port from
a 2-port module is connectable to one of two buses. The dashed boxes in the middle can
be viewed as programmable 2  2 switches that allow each module to communicate with
any other module via two different paths. Thus, besides tolerating malfunctioning
modules, we can route around a single malfunctioning bus as well.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 356

18.2 Isolating a Malfunctioning Element

Reconfiguration techniques allow us to route around malfunctiong modules or


interconnects but do not necessarily ensure that the circumvented elements do not
interfere with the proper functioning of the healthy ones. A prime example occurs in bus-
based systems. When we decide not to use a malfunctioning module in a bus-based
system, the bad module may still place nonsensical data on the bus, thus preventing the
good modules from properly communicating on the bus. A possible solution is depicted
in Fig. 18.isol-a. Each module can place data on the bus only if it has permission from
two other modules. Ignoring the problems associated with assigning and managing such
permissions, which are admittedly nontrivial, we see that it would take at least 3
malfunctioning modules in order to cause interference on the bus.

We can abstract the following general strategy from the scheme of isolating a module
from a bus, as discussed in the preceding paragraph. When a critical shared resource is to
be accessed by a module, some form of external authorization is needed to ensure that a
run-away module does not cause the entire system to crash. The situation can be likened
to transactions at a bank. A customer can perform simple transactions at an ATM. If the
customer wants to withdraw a larger sum of money than the ATM’s transaction limit,
assistance from a teller must be sought (a form of external authorization). For very large
transactions, even the teller does not have the required authority and the branch manager
gets involved (the second external authorization).

Permission 1

From module To bus

Permission 2

(a) Isolating a module from a bus (b) Reading from and writing to a bus

Fig. 18.isol Methods of isolating malfunctioning modules to ensure


noninterference in the proper functioning of other moduels.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 357

An alternative to the double-permission scheme of Fig. 18.isol-a, with its elaborate


assignment and collaborative management requirements, is the use of a single connection
flip-flop, as depicted in Fig. 18.isol-b. To isolate a module from the bus, one resets its
connection flip-flop, which then disables reading from and writing to the bus. The
multiple-bus scheme of Fig. 18.1b can be realized by using this circuitry for each of the
heavy dots connecting a module I/O port to a bus. In this case, any problems in the
connection flip-flop or associated circuitry can be treated as a bus malfunction for both
modeling and circumvention purposes.

Isolation of modules is somewhat simpler in systems with point-to-point communication.


Referring to Fig. 18.1a, as long as each healthy module is aware of the identity of its
malfunctioning neighbors, it can simply ignore all communications from those modules.

Malfunction tolerance would be much easier if modules would simply stop functioning,
rather than engage in arbitrary behavior. Unpredicatable or arbitrary behviour on the part
of a malfunctioning element, sometimes referred to as a “Byzantine malfunction,” is
notoriously difficult to handle. One source of the difficulty is that the module’s arbitrary
behavior may make it seem different to multiple external observers, some judging it to be
healthy and others detecting the malfunction.

Methods are available to ensure that a malfunctioning module stops in an inert state,
where it can’t confuse the system’s healthy modules. Here is one way to accomplish this
goal. Suppose modules run (approximately) synchronized clocks and have access to
reliable stable storage, where critical data can be stored. A k-malfunction-stop module
can be implemented from k + 1 identical units of this kind, operating in parallel. The key
element for this realization is an s-process that decides when the redundant module has
stopped and sets a “stop” flag in stable storage to “true.”

[More details to be supplied.]

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 358

Malfunction-stop modules

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 359

18.3 Data and State Recovery

Logs are essential tools for system recovery via undo and redo operations. The undo and
redo operations are quite similar to their namesakes in word-processing and other
common applications. When a detected malfunction makes it impossible or inadvisable to
continue processing as usual, the partial effects of incomplete transactions must be
undone to maintain consistency in stable storage. Similarly, the partially complete
transaction must be redone when circumstance allow it.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 360

18.4 Regular Arrays of Modules

Regular arrays of modules are used extensively in certain applications that need high-
throughput processing of massive amounts of data. Examples include communication
routing, scientific modeling, visual rendering, and certain kinds of simulation. Note that
regularity in our discussions here refers to the interconnection pattern, not physical layout
(the latter may be the case for on-chip systems). The focus of our discussion will be on
2D arrays, although some of the techniques can be extended to higher dimensions in a
straightforward manner.

Row/column bypassing is a widely used method for reconfiguring 2D arrays. Let us first
focus on row bypassing for modules that communicate in one direction: from top to
bottom. As seen in Fig. 18.pass-a, placing a multiplexer at the input to each module
allows us to bypass the previous row, taking the input from the row immediately above it.
Applying the same scheme to the other 3 inputs of a 4-port module leads to the building
block of Fig. 18.pass-b, which allows both row and column bypassing.

(a) Row bypassing in downward direction (b) Module with row/column bypass

Fig. 18.pass Reconfiguration via row/column bypassing in 2D arrays.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 361

Choosing rows and/or columns to bypass

Switch modules in FPGAs

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 362

An array reconfiguration scheme

One-track and two-track switching schemes

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 363

18.5 Low-Redundancy Sparing

Row/column bypassing is attractive in view of its simple reconfiguration and decision


logic. However, it is rather wasteful when an entire row or column of modules is
discarded to circumvent a single malfunctioning unit. It is thus natural to ask whether we
can come up with lower-redundancy schemes that make more efficient use of the spare
resources. As shown in Fig. 18.spare-a, it might be possible to place shared spares in
certain strategic locations, where then can readily replace a malfunctioning module from
an assigned set. To achieve this goal, the spare modules usually need greater connectivity
that the primary modules, given the flexibility requirements for different replacement
patterns. For example, if each spare module in Fig. 18.spare-a is to be able to replace any
of the four primary modules within its cluster while maintaining the same communication
structure, it will need at least 8 ports.

For 2D meshes without wraparound links (i.e., not torus networks), an ingenious
reconfiguration scheme allows the use of a single spare module for replacing any
malfunctioning module located in any row/column. The scheme, depicted in Fig.
18.spare-b, takes advantage of unused ports at the edges of a mesh to provide additional
links that are not used under normal conditions. When a module malfunctions, the
remaining healthy modules are renumbered (assigned new row/column numbers) and
their ports relabeled to form a new working mesh. In the example of Fig. 18.spare-b, once
the malfunctioning module 5 has been isolated, the system is reconfigured as shown, so
that, for example, the new row 0 will consist of modules 6, 7, 8, and 9, which are
renumbered 0, 1, 2, and 3.

(a) Shared spare in a cluster (b) Single-spare scheme for a 2D array

Fig. 18.spare Low-redundancy sparing in 2D arrays of modules.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 364

18.6 Malfunction-Tolerant Scheduling

One of the considerations in computer systems composed of multiple computational


resources is the assignment of various tasks or subtasks to the available modules. This is
known as task scheduling. Schedules can be static (determined at the outsets, and adhered
to through the course of the computation) or dynamic. In the latter case, changes due to
resource fluctuations or imbalance in load that may result from inaccuracies in task
running-time predictions are possible. Clearly, malfunction tolerance requires some
degree of dynamism in task scheduling to allow changes when modules are removed
from service due to malfunctions.

Task scheduling problems are hard even when resource requirements and availability are
both fixed and known a priori. These problems become significantly more difficult when
resource requirements fluctuate and/or resource availability changes dynamically due to
modules malfunctioning and are returned to service after repair.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 365

Problems

18.1 xxx
Intro
a. xxx
b. xxx
c. xxx

18.2 xxx
Intro
a. xxx
b. xxx
c. xxx
d. xxx

18.3 Placement of spare modules


Consider a linear array of n modules similar to that in Fig. 6.array1D. When an active module is replaced
with a spare, the contents of some of the modules must be shifted so as to maintain the correct module
ordering in the array. It makes sense to try to place the spare modules in locations along the array that
would minimize the amount of data shifting required [Parh77].
a. Show that a single spare module is best placed at the middle of the array when n is odd and at
either of the two central locations when n is even. State all your assumptions clearly.
b. How would the answer to part a change is the modules are arranged in a ring rather than a linear
array?
c. Show that multiple spare modules should be spaced equally along the array.
d. Discuss whether using the same criteria presented in this problem, the placement of spare rows
and columns within 2D arrays can be optimized.

18.4 Shared spares in a regular array


Consider a 2D array of modules with side lengths m = 2a, with a shared spare provided for every 4
modules, as depicted in Fig. 18-spare-a. Assume that the switching mechanisms are perfectly reliable and
that all modules, including spares, have the same reliability r(t).
a. Construct a cominational reliability model for the system.
b. We noted that for the 4  4 array, each spare module needs 8 ports. What is the required number p
of ports for a spare module when a > 2?
c. When a spare replaces an ordinary module, one of its p ports should be assigned for
communication in each of the east, west, north, and south directions. How many possible
neighbors does the replaced module have in each direction?
d. Design the required switching mechanism to allow the port assignment alluded to in part c.
e. Can the sparing scheme of this problem be extended to the 2D torus interconnection pattern?

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 366

References and Further Readings


[Bruc93] Bruck, J., R. Cypher, and C.-T. Ho, “Fault-Tolerant Meshes and Hypercubes with Minimal
Numbers of Spares,” IEEE Trans. Computers, Vol. 42, No. 9, pp. 1089-1104, September 1993.
[Butl08] Butler, R. W., “A Primer on Architectural Level Fault Tolerance,” NASA Technical
Memorandum TM-2008-215108, 48 pp., February 2008.
[Cast15] Castro-Leon, M., H. Meyer, D. Rexachs, and E. Luque, “Fault Tolerance at System Level
Based on RADIC Architecture,” J. Parallel and Distributed Computing, Vol. 86, pp. 98-111,
December 2015.
[Jian15] Jiang, G, J. Wu, Y. Ha, Y. Wang, and J. Sun, “Reconfiguring Three-Dimensional Processor
Arrays for Fault-Tolerance: Hardness and Heuristic Algorithms,” IEEE Trans. Computers, Vol.
64, No. 10, pp. 2926-2939, October 2015.
[Parh77] Parhami, B., “Optimal Placement of Spare Modules in a Cascaded Chain,” IEEE Trans.
Reliability, Vol. 26, No. 4, pp. 280-282, October 1977.
[Parh20] Parhami, B., “Reliability and Modelability Advantages of Distributed Switching for
Reconfigurable 2D Processor Arrays,” Proc. 11th Annual IEEE Information Technology,
Electronics and Mobile Communication Conf., November 2020, to appear.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 367

19 Standby Redundancy
“A long life may not be good enough, but a good life is long
enough.”
Anonymous

“The major difference between a thing that might go wrong and a


thing that cannot possibly go wrong is that when a thing that
cannot possibly go wrong goes wrong, it usually turns out to be
impossible to get at or repair.”
Douglas Adams

Topics in This Chapter


19.1. Malfunction Detection
19.2. Cold and Hot Spare Units
19.3. Conditioning of Spares
19.4. Switching over to Spares
19.5. Self-Repairing Systems
19.6. Modeling of Self-Repair

In a system with standby or dynamic redundancy, also known as a sparing system,


redundant modules appear on the periphery of the operational or active modules.
Once an active module has been determined to be malfunctioning, it is removed
from service and a spare module is switched in to take its place. Standby/dynamic
redundancy is more efficient than masking/static redundancy in terms of the
extended system lifetime that it offers and in energy consumption. However, it
leads to the added challenge of ensuring timely malfunction detection and reliable
switchover to a spare. Without satisfactory solutions to these problems, a standby
system would not offer high reliability, availability, or safety.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 368

19.1 Malfunction Detection

No amount of spare resources is useful if the malfunctioning of the active module is not
promptly detected. Detection options include:

Coding of control signals

Monitoring via watchdog timers

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 369

Activity monitor

Control-flow watchdog

Control-flow checking is done through the extraction of a precise control-flow graph


(CFG) describing how instructions are chained together. Building a precise CFG is a
difficult task with an unrestricted instruction-set architecture. Indirect jumps make the
task even harder, so forbidding such jumps has been advocated for providing integrity-
friendly semantics [Gonz19].

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 370

19.2 Cold and Hot Spare Units

Spare modules in standby redundancy are of two main kinds. A spare that is inactive,
perhaps even powered down to conserve energy and to reduce its exposure to wear and
tear and thus failure, is known as a cold spare. Conversely, an active spare that is fully
ready to take over the function of a malfunctioning active module at any time is known as
a hot spare. In a manner similar to the use of the term “firmware” as something that
bridges the gap between hardware and software, we designate a spare that falls between
the two extremes above, perhaps being powered up but not quite up to date with respect
to the state of the active module, as a warm spare.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 371

19.3 Conditioning of Spares

Conditioning refers to preparing a spare module to take the place of an active module.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 372

19.4 Switching over to Spares

Switching mechanism for standby sparing have a lot in common with those used for
defect circumvention, particularly when spares are shared among multiple units.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 373

19.5 Self-Repairing Systems

Self-repair is the ability of a system to go from one working configuration to another


(after a detected malfunction), without human intervention. Such a self-repair capability
is one of the key features of autonomic systems, which are designed to manage
themselves and hide the ever-increasing complexities in system components and
interfaces from end users, allowing the users to focus on running their applications
instead of worrying about managing the system. In the latter context, self-repair isn’t just
a mechanism for improving system reliability, but also a tool for reducing system
maintenance, operational, and support costs.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 374

19.6 Modeling of Self-Repair

Both combinational and state-space models to be discussed in this section.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 375

Problems

19.1 Impact of coverage on reliability


We have a nonredundant subsystem with a malfunction rate of 20 per 1M hours. Compute and compare the
reliabilities of the following standby sparing arrangements for operating times of 1000, 10 000, and 100
000 hours. Assume exponential reliability.
a. One spare module, with a coverage factor of 0.999
b. Two spare modules, with a coverage factor of 0.99
c. Three spare modules, with a coverage factor of 0.95
d. Four spare modules, with a coverage factor of 0.9

19.2 Optimal number of spares for a given coverage


a. If a standby sparing scheme with module reliability 0.99 has a coverage factor of c = 0.95, what is
the optimal number s of spare modules?
b. Repeat part a, this time assuming that the coverage factor decreases by 0.01 with each added spare
module, that is c = 0.96 – 0.01s, where s is the number of spare modules.

19.3 Optimal number of spare modules


A detailed study of a standby sparing system, with one active module and s spare modules, has determined
that s = 2 is the most cost-effective choice for the number of spare modules, with s = 3 being a close second
choice. Deduce the shape of the reliability curve as a function of the number s of spares and discuss the
contribution of the coverage factor to these conclusions.

19.x Title
Intro
a. xxx
b. xxx
c. xxx
d. xxx

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 376

References and Further Readings


[Arno73] Arnold, T. F., “The Concept of Coverage and Its Effect on the Reliability Model of a
Repairable System,” IEEE Trans. Computers, Vol. 22, No. 3, pp. 251-254, March 1973.
[Borg75] Borgerson, B. R. and R. F. Freitas, “A Reliability Model for Gracefully Degrading Standby-
Sparing Systems,” IEEE Trans. Computers, Vol. 24, pp. 517-525, 1975.
[Bour69] Bouricius, W. G., W. C. Carter, and P. R. Schneider, “Reliability Modeling Techniques for
Self-Repairing Computer Systems,” Proc. 12th ACM National Conf., 1969, pp. 295-305.
[Bour71] Bouricius, W. G., W.C. Carter, D. C. Jessep, P. R. Schneider, and A. B. Wadia, “Reliability
Modeling for Fault-Tolerant Computers,” IEEE Trans. Computers, Vol. 20, No. 11, pp. 1306-
1311, November 1971.
[Bued09] Buede, D. M., The Engineering Design of Systems: Models and Methods, Wiley, 2nd ed., 2009.
[Gonz19] Gonzalvez, A. and R. Lashermes, “A Case Against Indirect Jumps for Secure Programs,” Proc.
9th Workshop Software Security, Protection, and Reverse Engineering, San Juan, United
States, December 2019, pp. 1-10.
[Shoo02] Shooman, M. L., Reliability of Computer Systems and Networks, Wiley, 2002.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 377

20 Robust Parallel Processing


“I know you and Frank were planning to disconnect me, and I'm
afraid that’s something I cannot allow to happen.”
HAL, the on-board computer in 2001: A Space
Odyssey

“In a FORTRAN program controlling the United States’ first


mission to Venus, a programmer coded a DO loop as DO 3 I =
1.3 ... coding a period instead of a comma. However, the
compiler treated this as an acceptable assignment statement
[DO3I = 1.3, leading to] the failure of the mission.”
G.J. Meyers, Software Reliability: Principles and
Practices

Topics in This Chapter


20.1. A Graph-Theoretic Framework
20.2. Connectivity and Parallel Paths
20.3. Dilated Internode Distances
20.4. Malfunction-Tolerant Routing
20.5. Embeddings and Emulations
20.6. Robust Multistage Networks

Parallel processors offer built-in redundancy in computation and communication


resources. These resources are not separated into active and spare. Rather, all
system resources are usable at all times, so that when there is no malfunction,
additional resources contribute to system performance. As nodes and links of a
parallel processor malfunction, they are removed from service, making the system
smaller and more limited in its capabilities, but perhaps still functioning and
capable of executing its critical tasks. As in standby sparing, rapid and complete
malfunction detection is a key challenge here. However, the process of switching
in of spares is replaced by a computation remapping process.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 378

20.1 A Graph-Theoretic Framework

Parallel processors are divided into two classes of global-memory and distributed-
memory systems. In global-memory multiprocessors, a number of processing nodes are
connected to a large collection of memory modules, or banks, via a processor-to-memory
interconnection network, often implemented as a multistage structure of switching
elements. Instead of linking processors to memory banks, such multistage networks can
also be used to interconnect processing nodes to each other. In the latter processor-to-
processor interconnection usage, such networks are also called indirect networks, because
the connections among processors are established indirectly through switches, rather than
directly via links that connect the processors’ communication ports.

We will deal with multistage (indirect) networks in Section 20.6. The rest of this chapter
is devoted to problems associated with direct interprocessor communication networks
exemplified by the 64-node (6-dimensional) hypercube, depicted in Fig. 20.nets-a. A
wide variety of different interconnection networks have been proposed over the years, so
much so that the multitude of options available is often referred to as “the sea of
interconnection networks’ (Fig. 20.nets-b). The proposed networks differ in topological,
performance, robustness, and realizability attributes.

(a) Hypercube (6D, 64-node) (b) Sea of direct interconnection networks

Fig. 20.nets Examples of direct interconnection networks.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 379

We often assume that a parallel system is built from homogeneous processing nodes,
although interconnected heterogeneous nodes are sometimes considered. The internode
communication architecture is characterized by the type of routing scheme (packet
switching vs. wormhole or cut-through) and the routing protocols supported (e.g.,
whether nodes have buffer storage for messages that are passing through). Such details
don’t matter at the level of graph represemtation, which models only connectivity issues.

In robust parallel processing, we don’t make a distinction between ordinary resources and
spare resources. All resources are pooled and what would have been spare modules,
communication links, and the like are made available to boost performance in the absence
of malfunctions. The nominally extra processing and communication resources allow us
to overcome the effects of malfunctioning processors and transmission paths by simply
switching to alternates.

Attributes of interconnection networks

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 380

20.2 Connectivity and Parallel Paths

Two key notions in allowing the tolerance of malfunctioning nodes and links are those of
connectivity and parallel paths. Node-disjoint paths, which are useful for malfunction
tolerance, can also be used for improved performance via parallel transmission of pieces
of long messages.

Discuss notions related to network reliability, such as two-terminal reliability.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 381

20.3 Dilated Internode Distances

Internode distances vary as a result of malfunctions. For example, if one link becomes
unavailable in a 2D mesh, the formerly distance-1 pair of nodes that it connected turn into
distance-3 nodes. Because a network of connectivity  can become partitioned as a result
of  or more malfunctions, it is common to analyze the behavior of direct networks in the
presence of worst-case patterns of  – 1 malfunctions. For example, one may ask how the
diameter of a network, or its average internode distance, is affected in the presence of
such worst-case patterns of malfunctioning units.

Malfunction diameter of the hypercube

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 382

One of the most challenging open problem of graph theory is synthesizing graphs that
have small diameters, while maintaining a desirably small node degree. More
specifically, given nodes of a given maximum degree, we seek to synthesize the largest
possible graphs with bounded diameter or, given a desired size, we wish to minimize the
resulting diameter [Chun87]. Of particular interest, in the context of dependable
computing, are graphs with small diameters that remain small after deleting a few nodes
or edges.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 383

20.4 Malfunction-Tolerant Routing

Routing messages in a network of nodes containing malfunctions may be based on two


strategies. In one strategy, the set of malfunctioning resources are known globally and
thus, every node sending a message can precompute a viable path for the message to take.
Specification of the chosen path may be attached to the message, which will then find its
way through the network with no need for additional computation along the way. The
distributed version of this approach allows each node to compute the best outgoing link
for a message to take toward its destination, with computing the rest of the path delegated
to intermediate nodes. One advantage of the distributed version is that it allows changes
in the network to occur dynamically as messages are in transit.

For very large, or for loosely-connected networks, it is more realistic to assume that each
node knows only about malfunctioning resources in its immediate neighborhood. Then,
path calculation must occur in a distributed manner. Such distributed routing decisions
may lead to:

Suboptimal paths: Messages may not travel via the shortest available paths
Deadlocks: Messages may interfere with, or circularly wait for, each other
Livelocks: Messages may wander around, never reaching their destinations

Adaptive routing in a hypercube

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 384

Adaptive routing in a mesh network

Routing with nonconvex malfunction regions

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 385

20.5 Embeddings and Emulations

Embedding is a mapping of one network (the guest) onto another network (the host).
Emulation allows one network to behave as another. The two notions are related, in the
sense that a good embedding can be used to devise an efficient emulation. Both notions
are useful for malfunction tolerance.

Dilation: Longest path onto which an edge is mapped (routing-distance slowdown)


Congestion: Maximum number of edges mapped onto one edge (contention slowdown)
Load factor: Maximum number of nodes mapped onto one node (processing slowdown)

Example: Mesh/torus embedding in a hypercube

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 386

20.6 Robust Multistage Networks

Insert a general discussion of interconnection redundancy here. [Parh79]

Multistage networks use switched to interconnect nodes instead of providing direct links
between them.

Just as was the case for direct networks, the design space for indirect networks is quite
vast, leading to the term “sea of indirect interconnection networks.” The proposed
networks differe in topological, performance, robustness, and realizability attributes.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 387

The capability to bypass malfunctioning switches is needed for robustness in multistage


interconnection networks. This feature becomes even more critical in architectures such
as the butterfly network that have a single routing path between a source node on one side
and a destination node on the other side.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 388

Problems

20.1 Extra-stage butterfly network


The malfunction-tolerant extra-stage butterfly network, discussed in Section 20.6, essentially provides
connectivity between 16 inputs and 16 outputs. Can a 16-input butterfly network provide the same
function? How or why not?

20.2 Malfunction-tolerant routing


A p-node square torus has at most two malfunctioning processors known only to their neighbors. The
sender of a message does not necessarily know of the existence or location of the malfunctioning
processor(s), but it does know that no more than two processors are malfunctioning.
a. What is the worst-case diameter of the incomplete torus?
b. What is the worst-case bisection width?
c. Outline the design of a malfunction-tolerant packet routing algorithm that operates in a distributed
fashion (uses only local routing decisions).
d. How many extra steps (hops) does your routing algorithm require compared with a shortest path,
in the worst case?

20.3 Malfunction-tolerant routing


A node in a hypercube network with some malfunctioning nodes is said to be k-capable if every healthy
node at distance k from it is reachable via a shortest path [Chiu97].
a. Show that in a q-cube, q-bit capability vectors of all nodes can be computed recursively through a
simple algorithm.
b. Devise a malfunction-tolerant routing algorithm for the q-cube whereby each node makes its
routing decisions solely on the basis of its own and its neighbors’ capability vectors.

20.4 Routing via alternate paths


Consider a two-dimensional 4  4 torus network in which nodes are always functional, but each link fails
with probability p, independently of others. What is the probability of not being able to send a message
from a source node to a desired destination node in the worst case? Assume that p is very small, so that the
probability of j links all being functional is 1 – jp. Hint: Use the full node- and edge-symmetry of the torus
network to reduce the number of cases that must be considered.

20.5 Two-terminal reliability


For a network with n nodes numbered 0 to n – 1, its 2-terminal reliability for nodes i and j is the probability
that a healthy routing path exists between nodes i and j. The network’s 2-terminal reliability is the
minimum of all 2-terminal reliabilities for every i and j. Find the 2-terminal reliability of an n-node
undirected ring, assuming tha nodes do not fail and each link has reliability r.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-12 389

References and Further Readings


[Berm16] Bermudez Garzon, D. F., C. G. Requena, M. Engracia Gomez, P. Lopez, and J. Duato, “A
Family of Fault-Tolerant Efficient Indirect Topologies,” IEEE Trans. Parallel and Distributed
Systems, Vol. 27, No. 4, pp. 927-940, April 2016.
[Chen01] Chen, C.-L. and G.-M. Chiu, “A Fault-Tolerant Routing Scheme for Meshes with Nonconvex
Faults,” IEEE Trans. Parallel and Distributed Systems, Vol. 12, No. 5, pp. 467-475, May 2001.
[Chen09] W. Chen, W. J. Xiao, and B. Parhami, “Swapped (OTIS) Networks Built of Connected Basis
Networks are Maximally Fault Tolerant,” IEEE Trans. Parallel and Distributed Systems, Vol.
20, No. 3, pp. 361-366, March 2009.
[Chiu97] Chiu, G.-M. and K.-S. Chen, “Use of Routing Capability for Fault-Tolerant Routing in
Hypercube Multicomputers,” IEEE Trans. Computers, Vol. 46, No. 8, pp. 953-958, August
1997.
[Chun87] Chung, F. R. K., “Diameters of Graphs: Old Problems and New Results,” Congressus
Numerantium, Vol. 60, 1987, pp. 295-317.
[Fubi14] Fu, B., Y. Han, H. Li, and X. Li, “ZoneDefense: A Fault-Tolerant Routing for 2-D Meshes
Without Virtual Channels,” IEEE Trans. VLSI Systems, Vol. 22, No. 1, pp. 113-126, January
2014.
[Gu18] Gu, M. M. and R.-X. Hao, “Reliability Analysis of Cayley Graphs Generated by
Transpositions,” Discrete Applied Mathematics, Vol. 244, pp. 94-102, July 2018.
[Kane97] Kanellakis, P. C. and A. A. Shvartsman, Fault-Tolerant Parallel Computation, Kluwer, 1997.
[Kris87] Krishnamoorthy, M.S. and B. Krishnamurthy, “Fault Diameter of Interconnection Networks,”
Computers & Mathematics with Applications, Vol. 13, Nos. 5/6, pp. 577-582, 1987.
[Ledu16] Leduc-Primeau, Francois, Vincent Gripon, Michael G. Rabbat, and Warren J. Gross, “Fault-
Tolerant Associative Memories Based on c-Partite Graphs,” IEEE Trans. Signal Processing,
Vol. 64, No. 4, pp. 829-841, February 2016.
[Parh79] Parhami, B., “Interconnection Redundancy for Reliability Enhancement in Fault-Tolerant
Digital Systems,” Digital Processes, Vol. 5, Nos. 3-4, pp. 199-211, 1979.
[Parh99] Parhami, B., Introduction to Parallel Processing: Algorithms and Architectures, Plenum, 1999.
[Wu14] Wu, J., T. Srikanthan, G. Jiang, and K. Wang, “Constructing Sub-Arrays with Short
Interconnects from Degradable VLSI Arrays,” IEEE Trans. Parallel and Distributed Systems,
Vol. 25, No. 4, pp. 929-938, April 2014.
[Xu01] Xu, J., Topological Structure and Analysis of Interconnection Networks, Kluwer, 2001.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


1
2
[email protected]
http://www.ece.ucsb.edu/~parhami

This is a draft of the forthcoming book


Dependable Computing: A Multilevel Approach,
by Behrooz Parhami, publisher TBD
ISBN TBD; Call number TBD

All rights reserved for the author. No part of this book may be reproduced,
stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical,
photocopying, microfilming, recording, or otherwise, without written permission. Contact the author at:
ECE Dept., Univ. of California, Santa Barbara, CA 93106-9560, USA ([email protected])

Dedication
To my academic mentors of long ago:

Professor Robert Allen Short (1927-2003),


who taught me digital systems theory
and encouraged me to publish my first research paper on
stochastic automata and reliable sequential machines,

and

Professor Algirdas Antanas Avižienis (1932- )


who provided me with a comprehensive overview
of the dependable computing discipline
and oversaw my maturation as a researcher.

About the Cover


The cover design shown is a placeholder. It will be replaced by the actual cover image
once the design becomes available. The two elements in this image convey the ideas that
computer system dependability is a weakest-link phenomenon and that modularization &
redundancy can be employed, in a manner not unlike the practice in structural
engineering, to prevent failures or to limit their impact.
Last modified: 2020-11-19 12

Structure at a Glance
The multilevel model on the right of the following table is shown to emphasize its
influence on the structure of this book; the model is explained in Chapter 1 (Section 1.4).

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 390

VI Degradations: Behavioral Lapses


Ideal

Defective
“Junk is the ultimate merchandise. The junk merchant does not
sell his product to the consumer, he sells the consumer to the
Faulty product. He does not improve and simplify his merchandise, he
degrades and simplifies the client.”
Erroneous

William S. Burroughs
Malfunctioning

Degraded “I will permit no man to narrow and degrade my soul by making


me hate him.”
Failed

Booker T. Washington

Chapters in This Part


21. Degradation Allowance
22. Degradation Management
23. Resilient Algorithms
24. Software Redundancy

We have defined a dependable computer system as one that produces trustworthy


and timely results. Neither trustworthiness nor timeliness, however, is a binary
(all or none) attribute: results may be incomplete or inaccurate, rather than
missing or completely wrong, and they may be tardy enough to cause some
inconvenience, without being totally useless or obsolete. Thus, various levels of
inaccuracy, incompleteness, and tardiness may be distinguished and those that fall
within certain margins might be viewed as degradations rather than failures. The
first challenge in designing gracefully degrading systems is in mechanisms that
allow degradations to occur without violating performance or safety requirements.
The next challenge is to manage module switch-outs and switch-ins, as
malfunctions are diagnosed and as the affected modules return to service
following repair or recovery. We conclude this part by considering two specific
examples of degradation allowance via resilient algorithms and degradation
management by means of software redundancy.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 391

21 Degradation Allowance
“A hurtful act is the transference to others of the degradation
which we bear in ourselves.”
Simone Weil

“My voice had a long, nonstop career. It deserves to be put to


bed with quiet and dignity, not yanked out every once in a while
to see if it can still do what it used to do. It can’t.”
Beverly Sills

Topics in This Chapter


21.1. Graceful Degradation
21.2. Diagnosis, Isolation, and Repair
21.3. Stable Storage
21.4. Process and Data Recovery
21.5. Checkpointing and Rollback
21.6. Optimal Checkpoint Insertion

The quotation “eighty percent of success is showing up,” from humorist Woody
Allen, can be rephrased for fail-soft systems as “eighty percent of not failing is
degradation allowance.” This is because malfunctions do not automatically lead to
degradation: they may engender an abrupt transition to failure. In other words,
providing mechanisms to allow operation in degraded mode is the primary
challenge in implementing fail-soft computer systems. For a malfunction to be
noncatastrophic, its identification must be quick and the module’s internal state
and associated data must be fully recoverable. Stable storage, checkpointing , and
rollback are some of the techniques at our disposal for this purpose.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 392

21.1 Graceful Degradation

A dependable computer system produces trustworthy and timely results. In reality,


neither trustworthiness nor timeliness is a binary, all-or-none, attribute. For example,
results may be incomplete or inaccurate, rather than totally missing or completely wrong,
and they may be tardy enough to cause some inconvenience, without being totally useless
or obsolete. Thus, various levels of inaccuracy, incompleteness, and tardiness can be
distinguished and those that fall below particular thresholds might be viewed as
degradations rather than failures. A system that is capable of operating in such
intermediate states between fully operational and totally failed is characterized as
gracefully degradable, gracefully degrading, or fail-soft. The noun form referring to the
pertinent system attribute is graceful degradation.

Degradations occur in many different ways. A byte-sliced arithmetic/logic unit might lose
precision if a malfunctioning slice is bypassed through reconfiguration (inaccuracy). A
dual-processor real-time system with one malfunctioning unit might choose to ignore less
critical computations in order to keep up with the demands of time-critical events
(incompleteness). A malfunctioning disk drive within a transaction processing system can
effectively slow down the system’s response (tardiness). These are all instances of
degraded performance. In this broader sense, performance is quite difficult to define, but
Meyer [Meye80] does an admirable job:

“Evaluations of computer performance and computer reliability are each concerned,


in part, with the important question of computer system ‘effectiveness’. [Therefore,]
performance evaluations (of the fault-free system) will generally not suffice since
structural changes, due to faults, may be the cause of degraded performance. By the
same token, traditional views of reliability (probability of success, mean time to
failure, etc.) no longer suffice since ‘success’ can take on various meanings and, in
particular, it need not be identified with ‘absence of system failure’.”

Remember that in the quoted text above, faults/failures correspond to malfunctions in our
terminology. It was the concerns cited above that led to the definition of performability
(see Section 2.4) as a composite measure that encompasses performance and reliability
and that constitutes a proper generalization of both notions.

Graceful degradation isn’t a foregone conclusion when a system has resource redundancy
in the form of multiple processors, extra memory banks, parallel interconnecting buses,
and the like. Rather, active provisions are required to ensure that degradation, rather than

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 393

total interruption, of service will result upon module malfunctions. The title “Degradation
Allowance” for this chapter is intended to drive home the point that degradation must be
explicitly provided for in the design of a system.

Example 21.1: Degradation allowance is not automatic Describe a system that has more
resources of a particular kind than absolutely needed but that cannot gracefully degrade when even
one of those resources become unavailable.

Solution: Most automobiles have 4 wheels. In theory, a vehicle can operate with 3 wheels; in fact,
a variety of 3-wheeled autos exist. However, an ordinary 4-wheeled vehicle cannot operate if one
of the wheels becomes unavailable, because the design of 3-wheeled vehicles is quite different
from 4-wheeled ones.

Among the prerequisites for graceful degradation are quick diagnosis of isolated
malfunctions, effective removal and quarantine of malfunctioning elements, on-line
repair (preferably via hot-pluggable modules), and avoidance of catastrophic
malfunctions. The issues surrounding degradation management, that is, adaptation to
resource loss via task prioritization and load redistribution, monitoring of system
operation in degraded mode, returning the system to intact or less degraded state at the
earliest opportunity, and resuming normal operation when possible, will be discussed in
Chapter 22.

On-line and off-line repairs, and their impacts on system operation and performance are
depicted in Fig. 21-fsoft. On-line repair is accomplished via the removal/replacement of
affected modules in a way that does not disrupt the operation of the remaining system
parts. Off-line repair involves shutting down the entire system while affected modules are
removed and their replacements are plugged in. Note that with on-line repair, it may be
possible to avoid system shut-down altogether, thus improving both availability and
performability of the system.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 394

Fig. 21.fsoft A fail-soft system with possible on-line repair.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 395

21.2 Diagnosis, Isolation, and Repair

The first step in allowing degradations is to correctly diagnose a malfunction. Removing


a malfunctioning unit is done by updating the system resource tables within the operating
system and, perhaps, via physical isolation (see Fig. 18.isol, for example) to ensure that
the rest of the system is not affected by improper or random behavior on the part of the
logically removed unit.

Next, a working configuration must be created that includes only properly functioning
units. Such a working configuration would exclude processors, channels, controllers, and
I/O elements (such as sensors) that have been identified as malfunctioning. Other
examples of resources that might be excluded are bad tracks on disk, garbled files, and
noisy communication channels. Additionally, virtual address remapping can be used to
remove parts of memory from use. In the case of a malfunctioning cache memory, one
might bypass the cache altogether or use a more restricted mapping scheme that exposes
only the properly functioning part.

The final considerations before resuming disrupted processes include the recovery of
state information from removed units, if possible, initializing any new resource that has
been brought on-line, and reactivating processes via rollback or restart.

When, at some future time, the removed units are to be returned to service (say, after
completion of repair or upon verification that the malfunction resulted from transient
rather than permanent conditions), the steps outlined above may have to be repeated.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 396

21.3 Stable Storage

A storage device or system is stable if it can never lose its contents under normal
operating conditions. This kind of permanence is lacking in certain storage devices, such
as register files and SRAM/DRAM chips, unless the system is provided with battery
backup for a time duration long enough to save the contents of volatile memories, such as
a disk cache, on a more permanent medium. Until recently, use of disk memory was the
main way of realizing stable storage in computing devices, but now there are other
options such as flash memory and magnetoresistive RAM. Combined stability and
reliability can be provided for data via RAID-like methods.

Malfunction tolerance would become much easier if affected moduled simply stopped
functioning, rather than engage in arbitrary behavior that may be disruptive to the rest of
the system. Unpredicatable or Byzantine malfunctions are notoriously difficult to handle.
Thus, we are motivated to seek methods of designing modules that behave in a
malfunction-stop manner.

Given access to a reliable stable storage, along with its controlling s-process and
(approximately) synchronized clocks, we can implement a k-malfunction-stop module
from k + 1 units that can perform the same function. These units do not have to be
identical. Here is how the s-process decides whether the high-level modules has stopped:

Algorithm 21.mstop Behavior of s-process for a k-malfunction-stop module


Input: Bag R of received requests with appropriate timestamps
Output: possible setting of the variable stop to TRUE
if (|R| = k + 1)  ( stop)  (all requests are identical and from different sources)
then if the request is a write
then perform the write operation in stable storage
else if the request is a read, send the value to all processors
endif
else set the variable stop in stable storage to TRUE
endif

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 397

21.4 Process and Data Recovery

The simplest recovery scheme is restart, in which case the partially completed actions
during the unsuccessful execution of a transaction must be undone. One way to achieve
this end is by using logs, which form the main subject of this secion. Another way is
through the use of a method known as shadow paging. Note, however, that recovery with
restart may be impossible in systems that operate in real time, performing actions that
cannot be undone. Examples abound in the process control domain and in space
applications. Such actions must be compensated for as part of the degradation
management strategy.

The use of recovery logs has been studied primarily in connection with database
management systems. Logs contain sufficient information to allow the restoration of our
system to a recent consistent state. They maintain information about the changes made to
the data by various transactions. A previously backed up copy of the data is typically
restored, followed by reapplying the operations of committed transactions, up to the time
of failure, found in the recovery log.

A common protocol for recovery logs is write-ahead logging (WAL). Log entries can be
of two kinds, undo-type entries and redo-type entries, with some logs containing both
kinds of entries. Undo-type log entries hold old data values, so that the values can be
restored if needed. Redo-type entries hold new data values to be used in case of operation
retry. The main idea of write-ahead logging is that no changes should be made before the
necessary log entries are created and saved. In this way, we are guaranteed to have the
proper record for all changes should recovery be required. [More elaboration needed.]

Logs are sequential append-only files. The relative inefficiency of a sequential structure
isn’t a major concern, given that logs are rarely used to effect recovery.

An efficient scheme for using recovery logs is via deferred updates. In this method, any
changes to the data are not written directly to stable storage. Rather, the data affected by
updates is cached in main memory; only after all changes associated with a transaction
have been applied will the data be written to stable storage (preceded, of course, by
saving the requisite log entries). In this way, access to the typically slow stable storage is
minimized and performance is improved.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 398

An example of deferred updates is shadow paging. In order to avoide in-place updates,


which may create inconsistencies, any page to be modified is copied into a shadow page,
which is then freely updated, given that there are no external references to it [Hitz95].
Once the page becomes ready for assuming durable status, all pages that refer to the
original are updated to point to the new page. The idea is similar to the method used in
old batch-processing systems in which two copies of all daily updates were maintained
on separate disks, with one disk kept as back-up and the other one used as the starting
point for next day’s operation.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 399

21.5 Checkpointing and Rollback

Long-running computations, whose execution times are comparable to or exceed the


hardware’s MTTF are not likely to complete before hardware crash necessitates a restart.
This situation was a routine occurrence in early digital computers whose MTTF was
measured in hours, leading to many attempts at program execution before a successful
run to termination. Thus, programmers of early computers devised method for recording
intermediate results and state of a computation so that after recovery from a hardware
failure, they did not have to restart the computation from the very beginning. This
technique came to be known as checkpointing. Modern digital systems have a much
longer MTTF but they also execute more complex programs, some of which may run for
days or even weeks. So, checkpointing is still a useful technique.

Example 21.chkpt1: Effect of checkpointing on task completion probability Suppose a


computation’s running time T is twice the MTTF of the machine used to execute it. Ignoring the
checkpointing time overhead, compare the probability of completing the computation in 2T time:
a. Assuming no checkpointing.
b. Assuming checkpointing at regular intervals of T/2

Solution: We assume an exponential reliability formula R = e–t, with MTTF = 1/.


a. The system reliability over the time t = T = 2/ is e–t = e–2 = 0.135 335, which is the probability
that the task completes in time T. The no-checkpointing case can be modeled as a 2-state discrete
Markov chain, with time step T and states S and C, corresponding to computation start and
completion. There is a single transition from S to C, with an associated probability 0.135 335,
leading to the transition matrix having rows (0.864 665 0.135 335) and (0 1). Beginning in state
(1 0), the state after two units T of time will be (0.747 646 0.252 354), with 0.252 354, or about
25%, being the completion probability after 2T time.
b. Under checkpointing at regular intervals T/2, the Markov chain will also have an intermediate
state H, where the task is half-completed, with transition probability from S to H and from H to C
being e–1 = 0.367 879. The unit time in this case is T/2 = 1/. Beginning with state (1 0 0), the
system will go through states (0.632 121 0.367 879 0), (0.399 577 0.465 088 0.135 335),
and (0.252 581 0.440 988 0.306 431) at times T/2, T, and 3T/2, before ending up in state
(0.159 662 0.371 677 0.468 661) at time 2T, with 0.468 661, or about 47%, being the
completion probability after 2T time.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 400

Checkpoints are placed at convenient locations along the course of a computation, not
necessarily at equal intervals, but we often assume equal checkpointing intervals for the
sake of analytical tractability. Checkpointing entails some overhead consisting of the
program instructions needed to save the state and partial results and those needed to
recover from failure by reloading a previously computation state.

We see from Example 21.chkpt1 that not using checkpoints may lead to a small
probability of task completion within a specified time period. On the other hand, using
too many checkpoints may be counterproductive, given the associated overhead. Thus,
there may be an optimal configuration that leads to the best expected completion time.
We will discuss optimal checkpointing in Section 21.6.

Once a system malfunction disrupts an in-progress computation, the comutation must be


rolled back to its latest checkpoint. Thus, checkpointing and rollback go hand in hand in
recovering from system malfunctions. Process rollback or restart creates no problem for
tasks that perform I/O only at their start and termination. Referring to Fig. 21.chkpt1,
recovery from the detected malfunction, which affects processes 2 and 6, is readily
accomplished by rolling back process 2 to its checkpoint 2 and restarting process 6.
Interacting processes require more care during rollback, as we will see shortly.

Fig. 21.chkpt1 Checkpointing for multiple independent, noncommunicating


processes.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 401

Example 21.chkpt2: Data checkpointing Consider data objects stored on a primary site and k
backup sites. With appropriate design, such a scheme will be k-malfunction-tolerant. Each data
access request is sent to the primary site, where a read request is honored immediately. A write
request triggers a chain of events consisting of the primary site sending update transactions to the
backup sites and acknowledging the original write request only after acknowledgements have been
received from all backup sites. Argue that increasing the number k of backup sites beyond a
certain point may be counterproductive in that it may lead to lower data availability.

Solution: When the primary site is up and running, data is available. Data becomes unavailable in
three system state: (1) Recovery state, in which the primary site has malfunctioned and the system
is in the process of “electing” a new primary site from among the backup sites. (2) Checkpoint
state, in which the primary site is performing data backup. (3) Idle state, in which all sites, primary
and backup, are unavailable. As the number of backups increases, the system will spend more time
in states 1 and 2 and less time in state 3, so there may be an optimal choice for the number k of
backup sites. Analysis by Huang and Jalote [Huan89], with a number of specific assumptions,
indicates that data availability goes up from the nonredundant figure of 0.922 to 0.987 for k = 1,
0.996 for k = 2, 0.997 for k = 4, beyond which there is no improvement.

The checkpointing scheme depicted in Fig. chkpt1 is synchronous in the sense of all
running processes doing their checkpointing at the same time, perhaps dictated by a
central system authority. In large, or loosely coupled systems, it is more likely for
processes to schedule their checkpoints independently based on their own state of
computataion and when checkpointing is most convenient. These asynchronous
checkpoints, depicted in Fig. 21.chkpt2, do not create any difficulty if the processes are
independent and non-interacting. Upon a detected malfunction, all affected processes are
notified, with each process independently rolling back to its latest checkpoint.

Fig. 21.chkpt2 Asynchronous checkpointing for multiple independent


communicating processes.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 402

If the independent processes interact, however, as shown by the dashed arrows


representing message exchanges in Fig. 21.chkpt2, complications might arise. For
example, if a malfunction is detected at the instant shown, Process 2 is rolled back to its
latest checkpoint 2.1, with no other action necessary, given that the process has had no
interaction with other processes since that checkpoint. On the other hand, rolling back
Process 5 to its latest checkpoint 5.2 creates the problem that the process will be missing
events 5.2 and 5.3, corresponding to messages arriving from Processes 1 and 3,
respectively, upon its repeated execution. This is because Processes 1 and 3 have
progressed in their execution (they were not affected by the malfunction) and will thus
not resend those two messages.

One way of dealing with such dependencies is to also roll back certain interacting
processes when a given process is rolled back. There is a possibility of a chain reaction
that could lead to all processes having to restart from their very beginning. In general, we
need to identify a recovery line, or a consistent set of checkpoint, whose selection would
lead to correct re-execution of all processes. This is a nontrivial problem. An alternative
approach is to create stable logs of all interprocess communications, so that a process can
consult them upon re-execution.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 403

21.6 Optimal Checkpoint Insertion

There is a clear tradeoff in checkpoint insertion. Too few checkpoints lead to long
rollback and waste of computational resources in the event of a malfunction. Too many
checkpoints lead to excessive time overhead. These two opposing trends are depicted in
Fig. 21.optcp-a. As in many other engineering problems, there is often a happy medium
that can be found analytically or experimentally.

Example 21.optcp1: Optimal number of checkpoints Consider a computation of running


time T divided into q segments, so that there are q – 1 checkpoints. Let  denote the malfunction
rate and Tcp be the time needed to capture a checkpoint snapshot. Determine an optimal value for q
that minimizes the total running time.

Solution: The computation can be viewed as having q + 1 states corresponding to the fraction i/q
of it completed, for i = 0 to q. From each state to the next one in sequence, the transition
probability over the time step T/q is 1 – T/q, as depicted in the discrete-time Markov chain of Fig.
21.optcp-b. By using the latter linear approximation, we have implicitly assumed that T/q << 1/.
We can easily derive Ttotal = T/(1 – T/q) + (q – 1)Tcp = T + T2/(q – T) + (q – 1)Tcp.
Differentiating Ttotal with respect to q and equating with 0 yields qopt = T( + /𝑇 ). For
example, if we have T = 200 hr,  = 0.01 / hr, and Tcp = 1/8 hr, we get q opt
= 59 and Ttotal  211 hr.

(a) Opposing trends (b) Determination of optimal checkpoint frequency

Fig. 21.optcp Tradeoffs in checkpoint insertion.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 404

Example 21.optcp2: Optimal checkpointing in transaction processing Thus far, we have


incorporated rollback time into the checkpointing overhead and assumed the composite overhead
to be a constant. In some application contexts, such as transaction processing, it is possible that the
rollback time increases (say, linearly) with the time interval over which the computation is to be
rolled back. Representing the checkpointing period by Pcp and the rollback overhead by a + bx,
where x (0 < x < Pcp) is the malfunction time within a checkpointing period and b is a relatively
small constant of proportionality that accounts for the time needed for certain actions such as
updates, find the optimal value of Pcp.

Solution: The expected rollback time due to a malfunction in the time interval [0, Pcp] is found by
integrating (a + bx)dx over [0, Pcp], yielding Trb = Pcp(a + bPcp/2). We can choose Pcp to
minimize the relative checkpointing overhead O = (Tcp + Trb)/Pcp = Tcp/Pcp + (a + bPcp/2) by
equating dO/dPcp with 0. The result is Pcpopt = 2𝑇 /(𝑏). Let us assume, for example, that Tcp =
16 min and  = 0.0005/min (corresponding to an MTTF of 33.3 hr). Then, the optimal
checkpointing period is Pcpopt = 800 min = 13.3 hr. If by using a faster memory for saving the
checkpoint snapshots we can reduce Tcp to 1 min (a factor of 16 reduction), the optimal
checkpointing period goes down by a factor of 4 to become Pcpopt = 200 min = 3.3 hr.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 405

Problems

21.1 Checkpointing for long computations


In the optimal checkpointing example for long computations in Section 21.6, q = 59 (insertion of 58
checkpoints) was determined to be optimal. Because each checkpoint entails 1/8 hr of overhead, this
accounts for a tad over 7 hr of the running time extension from 200 hr to 211 hr. What do you think is the
source of the additional 4 hr in estimated additional running time?

21.2 Optimal checkpointing


We discussed optimal checkpointing under the assumption that time overhead per checkpoint is a constant.
Suppose that checkpointing overhead is a linear function of checkpointing period, that is, the longer the
time interval between checkpoints, the more information there is to store and the longer the time overhead
for each checkpoint. Present an analysis of optimal checkpointing in this case, stating all your assumptions.

21.3 Effect of checkpointing on task completion probability


Continue Example 21.chkpt1 with checkpointing at regular intervals of T/3 and T/4. Discuss.

21.4 Discrete optimal checkpointing


In discussing optimal checkpointing, we assumed that we can insert a checkpoint at any desired point along
the course of the computation with the same overhead Tcp. In reality, a computation may have a finite set of
feasible times t1 < t2 < . . . < tm where checkpoints can realistically be placed and they have corresponding
checkpointing overheads T1, T2, … , Tm, with each Ti > 0 being a known constant. Outline a procedure for
finding an optimal subset of checkpoints from among the m choices. Is it possible for the optimal number
of checkpoints to be 0? What about m checkpoints being optimal?

21.x Title
Intro
a. xxx
b. xxx
c. xxx
d. xxx

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 406

References and Further Readings


[Hitz95] Hitz, D., J. Lau, and M. Malcolm, “File System Design for an NFS File Server Appliance,”
Network Appliance Corp. Technical Report 3002, Rev. C, March 1995.
https://atg.netapp.com/wp-content/uploads/2000/01/file-system-design.pdf
[Huan89] Huang, Y. and P. Jalote, “Analytic Models for the Primary Site Approach for Fault Tolerance,”
Acta Informatica, Vol. 26, pp. 543-557, 1989.
[Jalo94] Jalote, P., Fault Tolerance in Distributed Systems, Prentice Hall, 1994.
[Kris93] Krishna, C. M. and A. D. Singh, “Reliability of Checkpointed Real-Time Systems Using Time
Redundancy,” IEEE Trans. Reliability, Vol. 42, No. 3, pp. 427-435, September 1993.
[Prad94] Pradhan, D. K. and N. H. Vaidya, “Roll-Forward Checkpointing Scheme: A Novel Fault-
Tolerant Architecture,” IEEE Trans. Computers, Vol. 43, No. 10, pp. 1163-1174, October
1994.
[Schl83] Schlichting, R. D. and F. B. Schneider, “Fail Stop Processors: An Approach to Designing
Fault-Tolerant Computing Systems,” ACM Trans. Computer Systems, Vol. 1, No. 3, pp. 222-
238, August 1983.
[Sour19] Souravlas, S. and A. Sifaleras, “Trends in Data Replication Strategies: A Survey,” Int’l J.
Parallel, Emergent and Distributed Systems, Vol. 34, No. 2, pp. 222-239, 2019.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 407

22 Degradation Management
“Most of us don't think, we just occasionally rearrange our
prejudices.”
Frank Knox

“The communications links were constantly tested by means of


sending filler messages. At the time of the false alerts, these filler
messages had the same form as attack messages, but with a
zero filled in for the number of missiles detected. The system did
not use any of the standard error correction or detection
schemes for these messages. When the chip failed, the system
started filling in the ‘missiles detected’ field with random digits.”
A. Borning, Computer System Reliability and
Nuclear War

Topics in This Chapter


22.1. Data Distribution Methods
22.2. Multiphase Commit Protocols
22.3. Dependable Communication
22.4. Dependable Collaboration
22.5. Remapping and Load Balancing
22.6. Modeling of Degradable Systems

Assuming that malfunctioning units are correctly identified, offending subsystems


are isolated, reconfiguration is appropriately performed, and recovery processes
are successfully executed, several other steps are still necessary for the system to
be able to function in a degraded mode. Tasks must be prioritized and those that
cannot be executed with the limited resources disabled or removed. Similarly,
adaptation in the opposite direction is required when previously malfunctioning
resources are returned to service following checkout or repair. Thus, degradation
management aims to ensure the smooth functioning of the system under resource
fluctuations (losses and reactivations).

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 408

22.1 Data Distribution Methods

Reliable data storage requires that the availability and integrity of data not be dependent
on the health of any one site. To ensure this property, data may be replicated at different
sites, or it may be dispersed so that losing pieces of the data does not preclude its accurate
reconstruction.

As discussed earlier, data replication can place a large burden on the system, perhaps
even leading in the extreme to the nullification of its advantages. The need for updating
all replicas before proceeding with further actions is one such burden. One way around
this problem is the establishment of read and write quorums. Consider the example in
Fig. 22.integ-a, where the 9 data replicas are logically viewed as forming a 2D array. If a
read operation is defined as accessing the 3 replicas in any one column (the read quorum)
and selecting the replica with the latest time-stamp, then the system can safely proceed
after updating any 3 replicas in one row (the write quorum). Because the read and write
quorums intersect in all cases, there is never a danger of using stale data and the system is
never bogged down if one or two replicas are out of date or unavailable.

(a) Replication with quorums (b) Data dispersion example

Fig. 22.integ Ensuring data integrity and availability via replication and
dispersion.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 409

A similar result can be achieved via the data dispersion scheme of Fig. 22.integ-b, where
a piece of data is divided into 6 pieces, which are then encoded in such a way that any
two of the encoded pieces suffice for reconstructing the original data. Such an encoding
requires 3-fold redundancy and is thus comparable to 3-way replication in terms of
storage overhead. Now, if we define read and write quorums to comprise any 4 of the 6
encoded pieces, gaining access to any 4 pieces would be sufficient for reconstructing the
data, because the 4 pieces are bound to have at least 2 pieces that are up to date (have the
latest time stamp). This scheme too eases the burden of updating the data copies by not
requiring that every single piece be kept up to date at all times.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 410

22.2 Multiphase Commit Protocols

Consider the following puzzle known as “the two generals problem.” The setting is as
follows. Two generals lead the two divisions of an army camped on the mountains on
each side of an enemy-occupied valley. The two army divisions can communicate only
via messengers. Messengers, who are loyal and highly reliable, may need an arbitrary
amount of time to cross the valley and in fact may never arrive due to being captured by
the enemy forces.

We need a scheme for the two generals G1 and G2 to agree on a common attack time,
given that attack by only one division would be disastrous. Here is a possible scenario.
General G1 decides on time T and sends a messenger to inform G2. Because G1 will not
attack unless he is certain that G2 has received the message about his proposed attack
time, G2 sends an acknowledgment to G1. Now, G2 will have to make sure that G1 has
received his acknowledgment, because he knows that G1 will not attack without it. So, G1
must acknowledge G2’s acknowledgment. This can go on forever, without either general
ever being sure that the other general will attack at time T.

The situation above is akin to what happens at a bank’s computer system when you try to
make cash withdrawal from an ATM. The ATM should not dispense cash unless it is
certain that the bank’s central computer checks the account balance and adjusts it after
the withdrawal. On the other hand, you will not like it is your account balance is reduced
without any cash being dispensed. So, the two sides, the ATM and the bank’s database,
must act in concert, either both completing the transaction or both abandoning it. Thus,
the withdrawal transaction, or electronic funds transfer between two accounts, must be an
atomic, all-or-none action.

A key challenge is maintaining the atomicity of such actions in the occurrence of


malfunctions in various system components. In centralized systems, atomicity can be
ensured via locking mechanisms. Each operation is performed in three phases:

Acquire (read or write) lock for a desired data object and operation
Perform operation while holding the lock
Release lock

One must take care to avoid deadlocks arising from circular waiting for locks.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 411

Fig. 22.2pcp Coordinator and participant states in the two-phase commit


protocol.

An alternative to the use of locks is devising a protocol that requires cross-checking


before commiting to changes arising from a transaction. The simplest such protocol is
known as “the two-phase commit protocol.” The protocol is executed between a
coordinator and a number of participants, which have the states depicted in Fig. 22.2pcp.
[Details to be supplied.]

To avoid participants being stranded in the “wait” state (e.g., when the coordinator
malfunctions), a time-out scheme may be implemented.

To deal with the shortcomings of the two-phase commit, a three-phase commit protocol
may be devised. As shown in Fig. 22.3pcp, an extra “prepare” state is inserted between
the “wait” and “commit” states of two-phase commit. This protocol is safe from
blocking, given the absence of a local state that is adjacent to both a “commit” state and
an “abort” state.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 412

Fig. 22.3pcp Coordinator and participant states in the two-phase commit


protocol.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 413

22.3 Dependable Communication

Point-to-point messages can be protected against communication errors due to


malfunctioning links or intermediate nodes through encoding, requiring receipt
acknowledgments, and implementation of a time-out mechanism.

It is sometimes required that a message be reliably broadcast or multicast to a set of


nodes, so that it is guaranteed to be received by all the intended nodes. One way of
accomplishing reliable broadcast is to send the message along the branches of a broadcast
tree, with possible repetition. In this scheme, duplicate messages can be recognized from
their sequence numbers.

In order to cut down on the amount of communication during reliable broadcasting,


acknowledgment messages may be piggybacked on subsequent broadcast messages.
Suppose node P broadcasts a message m1. Upon receiving m1, node Q may tack on an
acknowledgment for m1 to its own broadcast message m2. If a third node R did not
receive m1, it will find out about it from Q’s acknowledgment and will take steps to
acquire the missed message, perhaps by asking P for a retransmission.

Atomic broadcasting entails not only reliable message delivery but also requires that
multiple broadcasts be received in the same order by all nodes. If we use the scheme
outlined in the preceding paragraph, in-order delivery of messages will not be guaranteed,
so atomic broadcasting is much more complicated.

Another form of reliable broadcasting is causal broadcast, which requires that if m2 is


sent after m1, any message triggered by m2 must not cause actions before those of m1 have
been completed.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 414

22.4 Dependable Collaboration

Many distributed systems, built from COTS nodes (processors plus memory) and
standard interconnects, contain sufficient resource redundancy to allow the
implementation of software-based malfunction tolerance schemes. Interconnect
malfunctions are dealt with by synthesizing reliable primitives for point-to-point and
collective communication (broadcast, multicast, and so on), as discussed in Section 22.3.
Node malfunctions are modeled differently, as illustrated in Fig. 22.malfns, with possible
models reanging from benign crash malfunctions, that are fairly easy to deal with, to the
arbitrary or Byzantine malfunctions, that require greater care in protocol development
and much higher redundancy.

A potentially helpful resource in managing a group of cooperating nodes, that are subject
to malfunctions, is a reliable group membership service. The group’s membership may
expand and contract owing to changing processing requirements or because of
malfunctions and repairs. A reliable group membership service maintains up-to-date
status information and thus supports a reliable multicast, via which a message sent by one
group member is guaranteed to be received by all other members.

Fig. 22.malfns Node malfunctions range from benign to Byzantine.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 415

Another recent development in the theory of distributed systems is the notion of


malfunction detector, a distributed oracle tasked with monitoring system resources for
signs of malfunctions. As part of its operation, a malfunction detector creates and
maintains a list of suspected processes characterized by two properties: completeness
(having all malfunctioning processes in the list) and accuracy (having no healthy
processes). Using specialized malfunction detectors decouples the effort to detect
malfunctions from that of the actual computation, leading to greater modularity. It also
improves portability, because the same application can be used on a different platform is
suitable malfunction detectors are available for it.

A perfect malfunction detector, having strong completeness and strong accuracy is the
minimum required for interactive consistency. Strong completeness, along with eventual
weak consistency are the minimum requirements for consensus [Rayn05]. [Elaboration to
be added.]

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 416

22.5 Remapping and Load Balancing

When the configuration of a system changes due to the detection and removal of
malfunctioning units, division of labor in ongoing computations must be reconsidered.
This can be done via a remapping scheme that determines the new division of
responsibilities among participating modules, or via load balancing (basic computational
assignments do not change, but the loads of the removed modules are distributed among
other modules). Load balancing is also used not just to accommodate lost/recovered
resources due to malfunctions and repairs, but also to optimize system performance in the
face of changing computational requirements.

Even in the absence of a detected malfunction and the attendant system reconfiguration,
remapping of a computation to have its various pieces executed by different modules may
be useful for exposing hidden malfunctions. This is because the effects of a
malfunctioning module will likely be different on diverse computations, making it highly
unlikely to get the same final results for the original and remapped computation.

Let us consider a remapping example for a computation that runs on a 5-cell linear array.
By adding a redundant 6th cell at the end of the array (Fig. 22.remap), we can arrange for
the computation to be performed in two different ways: one starting in cell 1 and another
starting in cell 2 [Kwai97]. Each cell j + 1 can be instructed to compare the result of step j
in the computation that it received from the cell to its left to the result of step j that it
obtains within the second computation. A natural extension of this scheme is to provide 2
extra cells in the array and to perform three versions of the computation, with cell j + 2
voting on the three results obtained by cell j in the first computation, cell j + 1 in the
second version, and cell j itself in the third version.

Fig. 22.remap Recomputation with shift in space.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 417

22.6 Modeling of Degradable Systems

A gracefully degradable system typically has one ideal or intact state, multiple degraded
states, and at least one failure state, as depicted in Fig. 22.degsys. To ensure that the
system degrades rather than fail or come to a halt, we have to reduce the probability of
malfunctions going undetected, increase the accuracy of malfunction diagnosis, make
repair rates much greater than malfunction rates (typically, but keeping hot-pluggable
spares), and provide sufficient safety factor in computational capacity.

In practice, besides the indirect paths to failure corresponding to resource exhaustion, as


depicted in Fig. 22.gdsys, there may be direct or semidirect paths that lead to failure in a
shorter amount of time (Fig. 22.paths). These faster paths to failure arise from imperfect
coverage in malfunction detection or in the attendant reconfiguration to tolerate a
detected malfunction. While the provision of additional spare capacity lengthens the
indirect paths, it does nothing to avoid the direct paths; on the contrary, it may increase
the probability of taking a direct path, given the growth in the complexity of system-level
reconfiguration mechanisms. For a given coverage factor, addition of resources beyone a
certain point would not be cost-effective with regard to the resulting reliability gain. This
is quite similar to the effect we saw previously in standby sparing.

Fig. 22.gdsys State-space model for a gracefully degradable system, with


multiple degradation levels and multiple failure states.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 418

Fig. 22.paths Direct, semidirect, and indirect paths to failure in a


degradable system.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 419

Problems

22.1 Title
Intro
a. xxx
b. xxx
c. xxx
d. xxx

22.2 Soft failures in data centers


Read the paper [Sank14] and prepare a one-page, single-spaced report covering the nature of the failures
considered, their frequency (e.g., single event or recurrent), their effects on down time, and possible
countermeasures. Begin your report with a one-sentence summary of key findings.

22.x Title
Intro
a. xxx
b. xxx
c. xxx
d. xxx

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 420

References and Further Readings


[Jalo94] Jalote, P., Fault Tolerance in Distributed Systems, Prentice Hall, 1994.
[Kwai97] Kwai, D.-M. and B. Parhami, “An On-Line Fault Diagnosis Scheme for Linear Processor
Arrays,” Microprocessors and Microsystems, Vol. 20, No. 7, pp. 423-428, March 1997.
[Rabi89] Rabin, M., “Efficient Dispersal of Information for Security, Load Balancing, and Fault
Tolerance, J. ACM, Vol. 36, No. 2, pp. 335-348, April 1989.
[Rayn05] Raynal, M., “A Short Introduction to Failure Detectors for Asynchronous Distributed
Systems,” ACM SIGACT News, Vol. 36, No. 1, pp. 53-70, March 2005.
[Sank14] Sankar, S. and S. Gurumurthi, “Soft Failures in Large Data Centers,” IEEE Computer
Architecture Letters, Vol. 13, No. 2, pp. 105-108, July-December 2014.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 421

23 Resilient Algorithms
“Perfection is achieved, not when there is nothing more to add,
but when there is nothing left to take away.”
Antoine de Saint Exupery

“The catastrophic nature of some program failures, in which the


program collapses suddenly and utterly, has its analogue in the
metallurgical phenomenon of brittle fracture, in which the crack
propagates at nearly the speed of sound.”
P.W. Abrahams, The Role of Failure in Software
Design

Topics in This Chapter


20.1. COTS-Based Paradigms
20.2. Robust Data Structures
20.3. Data Diversity and Fusion
20.4. Self-Checking Algorithms
20.5. Self-Adapting Algorithms
20.6. Other Algorithmic Methods

One approach degradation allowance is through the design of resilient algorithms


that are by design insensitive to resource malfunctions. Resilient algorithms have
built-in redundancy in their computations and data structures, so that when a
limited number of malfunctions are experienced, the resulting errors are detected
or even corrected. This approach is very attractive because it may allow the use of
commercial off-the-shelf system components instead of specially designed, and
rather expensive, hardware. Beginning from primitive, or building-block,
algorithms for communication and other forms of collaboration, a multilevel
algorithmic structure is built that can function correctly under adverse conditions.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 422

23.1 COTS-Based Paradigms

Many of the hardware and software redundancy methods assume that we are building the
entire system (or a significant part of it) from scratch. Many users of highly reliable
systems, however, do not have the capability to develop such systems and thus have one
of two options.

The first option is to buy dependable systems from vendors that specialize in such
systems and associated support services. Here is a partial list of companies which have
offered, or are now offering, fault-tolerant systems and related services:

ARM: Fault-tolerant ARM (launched in late 2006), automotive applications


Nth Generation Computing: High-availability and enterprise storage systems
Resilience Corp.: Emphasis on data security
Stratus Technologies: “The Availability Company”
Sun Microsystems: Fault-tolerant SPARC (ft-SPARC™)
Tandem Computers: An early leader, part of HP/Compaq since 1997

An alternative is to build upon commercial off-the-shelf (COTS) components and


systems some protective layers that ensure dependable operation. A number of algorithm
and data-structure design methods that are resilient to imperfect hardware are available.

An early experiment with the latter approach was performed in the 1970s, when Stanford
University built one of two “concept systems” for fly-by-wire aircraft control, using
mostly COTS components. The resulting multiprocessor, named SIFT (software-
implemented fault tolerance) was meant to introduce a fault tolerance scheme that
contrasted with the hardware-intensive approach of MIT’s FTMP (fault-tolerant
multiprocessor). The Stanford and MIT design teams strived to achieve a system failure
rate goal of 10–9 per hour over a 10-hour flight, which is typical of avionics safety
requirements. Some fundamental results on, and methods for, clock synchronization
emerged from the SIFT project. To prevent errors from propagating in SIFT, processors
obtained multiple copies of data from different memories over different buses (local
voting).

The COTS approach to fault tolerance has some inherent limitations. Some modern
microprocessors have dependability features built in: they may use parity and other codes

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 423

in memory, TLB, and microcode store; they may take advantage of retry features at
various levels, from bus transmissions to full instructions; they may have provide
machine check facilities and registers to hold the check results. According to Avizienis,
however, these features are often not documented enough to allow users to build on them,
the protection provided is nonsystematic and uneven, recovery options may be limited to
shutdown and restart, description of error handling is scattered among a lot of other
detail, and there is no top-down view of the features and their interrelationships [Aviz97].

Manufacturers can incorporate both more advanced and new features, and at times have
experimented with a number of mechanisms, but until recently, the low volume of the
application base hindered commercial viability.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 424

23.2 Robust Data Structures

Stored and transmitted data can be protected against unwanted changes through encoding,
but coding does not protect the structure of the data. Consider, for example, an ordered
list of numbers. Individual numbers can be protected by encoding and the set of values
can be protected by a checksum; the ordering of data, however, remains unprotected with
either scheme. Some protection against an inadvertent change in ordering can be
provided by a weighted checksum. Another idea is to provide within the array a field that
records the difference between each element and the one that follows it. A natural
question at this point is whether we can devise general schemes for protecting data
structures of common interest.

Let us first consider linked lists (Fig. 23.rlist). [Details to be supplied.]

Robust data structures provide fairly good protection with little design effort or run-time
overhead

Audits can be performed during idle time


Reuse possibility makes the method even more effective

Robustness features to protect the structure can be combined with coding methods (such
as checksums) to protect the content

Fig. 23.rlist Robustness of linked lists.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 425

Other robust data structures of interest include trees, FIFOs, stacks or LIFOs, heaps, and
queues. In general, a linked data structure is 2-detectable and 1-correctable iff the link
network is 2-connected.

Binary trees have weak connectivity and are thus quite vulnerable to corrupted or missing
links. One way to strengthen the connectivity of trees is to add parent links and/or threads
(links that connect a node to higher-level nodes). An example of a thread link is shown in
Fig. 23.rtree. Threads can be added with little overhead by taking advantage of unused
leaf links (one bit in every node can be used to identify leaves, thus freeing their link
fields for other uses).

Adding redundancy to data structures has three types of cost:

 Storage requirements for the additional information


 Slightly more difficult updating procedures
 Time overhead for periodic checking of structural integrity

Fig. 23.rtree Improving the robustness of a binary tree.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 426

23.3 Data Diversity and Fusion

Alternate formulations of the same information (input re-expression) is known as data


diversity. For example, a rectangle can be specified by its two sides x and y, by the length
z of its diameters and the angle  between them, or by the radii r and R of its inscribed
and circumscribed circles. As shown in Fig. 23.ddiv, diverse representations lead to
diverse calculations, thus reducing the chance of encountering the same errors during
multiple computations.

The inverse of input re-expression is output fusion. When information is provided by


diverse sources, perhaps with different resolutions or formats, the task of reconciling the
differences in order to derive a more reliable assessment of the prevailing conditions is
quite nontrivial. [Elaborate.]

Fig. 23.ddiv Diverse representations and associated area calculations


for a rectangle.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 427

23.4 Self-Checking Algorithms

It is sometimes possible to design algorithms and associated data structures to that the
computation becomes resilient to both representational and computational errors. A prime
example is provided by a method known as algorithm-based malfunction tolerance,
which is more widely known in the literarure by the acronym ABFT (algorithm-based
fault tolerance).

Consider the 3  3 matrix M shown in Fig. 23.abmt1. Adding modulo-8 row checksums
and column checksums results in the row-checksum matrix Mr and column checksum
matrix Mc, respectively. Including both sets of checksums, with the lower-right matrix
element set to the checksum of the row checksums or of the column checksums (it can be
shown that the result is the same either way) lead to the full checksum matrix Mf, a
representation of M that allows the correction of any single error in the matrix elements
and detection of up to 3 errors; some patterns of 4 errors, such as in the 4 elements
enclosed in the dashed box in Fig. 23.abmt1, may go undetected.

In addition to correction or detection of representational errors, as outlined in the


preceding paragraph, the matrix representation depicted in Fig. 23.abmt1 allows matrix
multiplication to be performed on encoded matrices, thus helping with the detection of
errors resulting from incorrect arithmetic operations (Fig. abmt2).

Fig. 23.abmt A 3  3 matrix and its modulo-8 row checksum, column


checksum, and full checksum matrices.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 428

Fig. 23.abmt2 Multiplication of matrices with row and column checksums.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 429

23.5 Self-Adapting Algorithms

This section to be written.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 430

23.6 Other Algorithmic Methods

This section to be written.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 431

Problems

23.1 Algorithm-based malfunction tolerance


Consider extending the algorithm-based malfunction tolerance scheme for matrix computations by adding a
second check row and a second check column. The added row and column for a matrix M will contain
modulo-m residues of the column and row checksums for the full-checksum matrix Mf, where m is
relatively prime with respect to the original modulus m used in forming Mf.
a. Independent of whether the scheme is practical for matrix computations, derive its error detection
and correction capabilities.
b. Discuss whether the matrix multiplication algorithm can be made to work for the new encoding.

23.2 Time redundancy at the application level


We are given a probabilistic algorithm for solving a problem that does not lend itself to deterministic
solution. The given algorithm has been tested on a large number of problem instances and is known to
produce a correct solution in 82% of the cases. The algorithm makes random choices during its execution,
so different runs of the algorithm can be considered statistically independent as far as correctness of the
result is concerned. Discuss whether and how the algorithm can be used to produce a solution that is correct
at the 99.99% confidence level.

23.3 Quantifying the reliability of programs


Carbin et al. [Carb16] have suggested that reliability of programs under soft errors in the underlying
hardware can be quantified by estimating the probability of correctness for variable values and verifying
that they exceed predefined thresholds. Discuss the method proposed in the paper with regard to the
following.
a. How realistic it is to determine the desired correctness probabilities or lower bounds for them.
b. The fraction of all program failures that can be attributed to soft errors of the kinds considered.
c. Options for corrective actions should the estimated correctness probabilities be unacceptable.
d. Whether the methods proposed might be adaptable to cover other causes of program failures.

23.x Title
Problem intro
a. xxx
b. xxx
c. xxx
d. xxx

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 432

References and Further Readings


[Amma88] Ammann, P.E. and J.C. Knight, “Data Diversity: An Approach to Software Fault Tolerance,”
IEEE Trans. Computers, Vol. 37, No. 4, pp. 418-425, April 1988.
[Aviz97] Avizienis, A., “Toward Systematic Design of Fault-Tolerant Systems,” IEEE Computer, Vol.
30, No. 4, pp. 51-58, April 1997.
[Carb16] Carbin, M., S. Misailovic, and M. C. Rinard, “Verifying Quantitative Reliability for Programs
that Execute on Unreliable Hardware,” Communications of the ACM, Vol. 59, No. 8, pp. 83-91,
August 2016.
[Fino09] Finocchi, I., F. Grandoni, and G. F. Italiano, “Optimal Resilient Sorting and Searching in the
Presence of Memory Faults,” Theoretical Computer Science, Vol. 410, No. 44, pp. 4457-4470,
October 2009.
[Golo06] Goloubeva, O., M. Rebaudengo, M. S. Reorda, and M. Violante, Software-Implemented
Hardware Fault Tolerance, Springer, 2006.
[Huan84] Huang, K. H. and J. A. Abraham, “Algorithm-Based Fault Tolerance for Matrix Operations,”
IEEE Trans. Computers, Vol. 33, No. 6, pp. 518-528, June 1984.
[John89] Johnson, B. W., Design and Analysis of Fault-Tolerant Digital Systems, Addison-Wesley,
1989.
[Kant90] Kant, K. and A. Ravichandran, “Synthesizing Robust Data Structures—An Introduction,” IEEE
Trans. Computers, Vol. 39, No. 2, pp. 161-173, February 1990.
[Siew92] Siewiorek, D. P., and R. S. Swarz, Reliable Computer Systems: Design and Evaluation, Digital
Press, 2nd ed., 1992. Also: A. K. Peters, 1998.
[Tayl80] Taylor, D. J., J. P. Black, and D. E. Morgan, “Redundancy in Data Structures: Improving
Software Fault Tolerance,” IEEE Trans. Software Engineering, Vol. 6, No. 6, pp. 585-594,
November 1980.
[Vija97] Vijay, M. and R. Mittal, “Algorithm-Based Fault Tolerance: A Review,” Microprocessors and
Microsystems, Vol. 21, pp. 151-161, 1997.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 433

24 Software Redundancy
“Those parts of the system that you can hit with a hammer (not
advised) are called hardware; those program instructions that
you can only curse at are called software.”
Anonymous

“. . . even perfect program verification can only establish that a


program meets its specification. The hardest part of the software
task is arriving at a complete and consistent specification, and
much of the essence of building a program is in fact the
debugging of the specification.”
Fredrick P. Brooks, Jr., Essence and Accidents
of Software Engineering

Topics in This Chapter


24.1. Software Dependability
24.2. Software Malfunction Models
24.3. Software Verification and Validation
24.4. N-Version Programming
24.5. The Recovery-Block Method
24.6. Hybrid Software Redundancy

Software and hardware malfunctions are on the surface quite different. It is


sometimes argued that software does not age, that programs are not subject to
external interference or transient faults, and that all software-related problems are
due to design flaws. Thus, software replication does not help, the argument
continues, because all copies of the software will have the same flaws. If you tend
to agree with the arguments above, you will be quite surprised to learn in this
chapter about the use of more or less similar techniques for dealing with both
hardware and software malfunctions. In fact combining both classes of methods is
our best bet in building ultradependable systems.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 434

24.1 Software Dependability

Imagine the following product disclaimers:

For a steam iron: There is no guarantee, explicit or implied, that this device will remove
wrinkles from clothing or that it will not lead to the user’s electrocution. The manufacturer is
not liable for any bodily harm or property damage resulting from the operation of this device.

For an electric toaster: The name “toaster” for this product is just a symbolic identifier.
There is no guarantee, explicit or implied, that the device will prepare toast. Bread slices
inserted in the product may be burnt from time to time, triggering smoke detectors or causing
fires. By opening the package, the user acknowledges that s/he is willing to assume sole
responsibility for any damages resulting from the product’s operation.

You may hesitate before buying such a steam iron or toaster, yet this is how we purchase
commodity software. Software producers and marketers, far from postulating dependable
operation, do not even promise correct functional behavior! The situation is only slightly
better for custom software, produced to exacting functional and reliability specifications.

Software unreliability is caused predominantly by design slips, not by operational


deviations. Latent design slips, which form the main mechanisms for software
malfunctions, are becoming common in hardware as well, given the phenomenal levels of
complexity in modern systems.

The curse of complexity is best illustrated through an example. The 7-Eleven


convenience store chain reportedly spent some $9M to make its point-of-sale software
Y2K-compliant for its 5200 stores, shortly before the year-2000 problem (caused by the
use of 2 digits for the year field in some databases, leading to the problem of years 1900
and 2000 becoming indistinguishable) was to hit the world’s computerized information
systems. The modified software was subjected to 10,000 tests, all of which were
successful. The company’s management and information system professionals were
relieved, as the system worked flawlessly throughout the year 2000. On January 1, 2001,
however, the system began rejecting credit cards, because it somehow “believed” the year
to be 1901. The problem was identified and corrected within a day, but it left the lasting
message that removing one bug can sometimes just transform or relocate the problem,
rather than fix it.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 435

Fig. 24.sdlc Phases in software development life cycle where flaws


might creep in.

To see where flaws might be introduced in software, it is instructive to examine the


various phases in the software development life cycle (Fig. 24.sdlc). The specifications
stage and later are only relevant if commodity software cannot satisfy the requirements

Beginning with unit test (see Fig. 24.sdlc), major structural and logical problems
remaining in a piece of software are removed early on. What remains after extensive
verification and validation is a collection of tiny flaws which surface under rare
conditions or particular combinations of circumstances, thus giving software failure a
statistical nature. Software usually contains one or more flaws per thousand lines of code,
with < 1 flaw considered good (linux has been estimated to have 0.1). If there are f flaws
in a software component, the hazard rate, that is, rate of failure occurrence per hour, is kf,
with k being the constant of proportionality which is determined experimentally (e.g., k =
0.0001). Software reliability is then modeled by:

R(t) = e–kft (24.1.swrel)

According to this model, the only way to improve software reliability is to reduce the
number of residual flaws through more rigorous verification and/or testing.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 436

Flaw

Not expected
to occur

Fig. 24.swflaw Residual software flaws within the input space.

Given extensive testing, the residual flaws in software are by nature difficult to detect.
They manifest themselves primarily for unusual combinations of inputs and program
states (the so-called “corner cases”), schematically represented in Fig. 24.swflaw. Light
shading is used to denote the parts of input/state space for which the software is free from
flaws. Unshaded regions represent input/state combinations that are known to be
impossible, thus making them irrelevant to proper functioning of the software. Dark
shading is used for trouble spots, which have been missed during testing. Occasionally,
during the use of a released piece of software, a user’s operating conditions will hit one
of these trouble spots, thus exposing the associated flaw. Once a flaw has been exposed,
it is dealt with through a new release of the software (particularly if the flaw is deemed
important in the sense of its potentials for affecting many other users) or through a
software patch.

For a while, there was some resistance to the idea of treating software malfunctions in a
probabilistic fashion, much like what we do with hardware-caused malfunctions. The
argument went that software flaws, which are due to design errors, either exist or do not
exist and that they do not emerge probabilistically. However, as discussed in the
preceding paragraph, software flaws are often exposed by rare combinations of inputs
and internal states, it does make sense to assume that there is a certain probability
distribution, derivable from input distributions, for a software malfunction to occur.

The idea of using software redundancy to counteract uncertainties in design quality and
residual bugs is a natural one. However, it is not clear what form the redundancy should
take: it is certainly not helpful to replicate the same piece of software, with identical
internal flaws/bugs. We will tackle this topic in the last three sections of this chapter.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 437

24.2 Software Malfunction Models

A software flaw or bug can lead to operational error for certain combinations of inputs
and system states, causing a software-induced failure. Informally, the term “software
failure” is used to denote any software-related dependability problem. Flaw removal can
be modeled in various ways, two of which are depicted in Fig. 24.flawr. When removing
existing flaws does not introduce any new flaws, we have the optimistic model of Fig.
24.flawr-a. Flaw removal is quick in early stages, but as more flaws are removed, it
becomes more difficult to pinpoint the remaining ones, leading to a reduction in flaw
removal rate. The more realistic model of Fig. 24.flawr-b assumes that the number of
new flaws introduced is proportional to the removal rate. The following example is based
on the simpler model of Fig. 24.flawr-a.

Example 24.flaw: Modeling the software flaw removal process Assume that no new flaws are
introduced as we remove existing flaws in a piece of software estimated to have F0 = 132 flaws
initially and that the flaw removal rate linearly decreases with time. Model the reliability of this
software as a function of time.

Solution: Let F be the number of residual flaws and  be the testing time in months. From the
problem statement, we can write dF()/d = –(a – b), leading to F() = F0 – a(1 – b/(2a)). The
hazard function is then z() = kF(), where k is a constant of proportionality, and R(t) = e–kF()t.
Taking k to be 0.000132, we find R(t) = exp(–0.000132(130 – 30(1 – /16))t). If testing is done
for  = 8 months, the reliability equation becomes e–0.00132t, which corresponds to an MTTF of 758
hours. Note that no testing would have resulted in an MTTF of 58 hours and that testing for 2, 4,
and 6 months would have led to MTTFs of 98, 189, and 433 hours, respectively.

(a) Removing flaws, without (b) New flaws introduced are


generating new ones proportional to removal rate

Fig. 24.flawr Modeling flaw removal from software.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 438

Linearly decreasing flaw removal rate isn’t the only option in modeling. Constant flaw
removal rate has also been considered, but it does not lead to a very realistic model.
Exponentially decreasing flaw removal rate is more realistic than linearly decreasing,
since flaw removal rate never really becomes 0. Model constants can be estimated via:

 Using a handbook: public ones, or compiled from in-house data


 Matching moments (mean, 2nd moment, . . .) to flaw removal data
 Least-squares estimation, particularly with multiple data sets
 Maximum-likelihood estimation (a statistical method)

In addition to the exponential reliability model based on estimates of the remaining


number of flaws in a piece of software, leading to a constant hazard rate or a hazard rate
function for a specific amount of testing, a phenomenon similar to wearout in the case of
hardware has been observed for software. Of course, software does not wear out or age in
the same sense as hardware. Yet we do observe some deterioration in the performance of
a piece of software that has been running for a long time. This wearout phenomenon
along with the large number of flaws before testing make the defect-related bathtub curve
of Fig. 5.btc also applicable to software.

The primary reasons for software aging include accumulation of junk in the state part of
the system (which is reversible via restoration) and long-term cumulative effects of
updates via patching and the like. As the software’s structure deviates from its original
clean form, unexpected failures begin to occur. Eventually software becomes so mangled
that it must be discarded and redeveloped from scratch.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 439

24.3 Software Verification and Validation

Basic concepts and terms in software verification and validation

Formal proofs for software verification

Software flaw tolerance

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 440

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 441

24.4 N-Version Programming

Introduction to N-version programming and justifications for it.

Some objections to N-version programming, and responses to them.

Reliability modeling for N-version programs.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 442

Some applications of N-version programming.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 443

24.5 The Recovery Block Method

The recovery block method may be viewed as the software counterpart to standby sparing
for hardware. Suppose we can verify the results obtained by a software module by
subjecting them to an acceptance test. For now, let us assume that the acceptance test is
perfect in the sense of not missing any erroneous results and not producing any false
positive. Implications of imperfect acceptance tests will be discussed later. With these
assumptions, one can organize a number of different software modules all performing the
same computation in the form of a recovery block.

Recovery block: (24.5.rb)


ensure acceptance test ; e.g., sorted list
by primary module ; e.g., quicksort
else by first alternate ; e.g., bubblesort
.
.
.
else by last alternate ; e.g., insertion sort
else fail

The program structure 24.5.rb encapsulates a primary software module, which is


executed to completion and its results subjected to the acceptance test. Passing of the
acceptance test, which occurs in a vast majority of cases, terminates the recovery block.
Failing the test triggers the execution of the first alternate module, with the process
repeated as above: for each alternate, passing of the acceptance test terminates the block
and failing it triggers the execution of the next alternate. A failure event is indicated once
all alternates have been tried without success.

The comments next to the pseudocode lines in the program structure 24.5.rb provide an
example in which the task to be performed is that of sorting a list. The primary module
uses the quicksort algorithm, which has very good average-case running time but is rather
complicated in terms of programming, and thus prone to residual software bugs. The first
alternate uses the bubblesort algorithm, which is not as fast, but much easier to write and
debug. The longer running time of bubblesort may not be problematic, given that we
expect the alternates to be executed rarely. As we go down the list of the alternates, even
simpler but perhaps lower-performing algorithms may be utilized. In this way, diversity
can be provided among the alternates can be provided, while also reducing the
development cost relative to N-version programming. Design diversity makes it more

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 444

likely for one of the alternate modules to succeed when the primary module fails to
produce an acceptable result.

The acceptance test for our sorting example can take the form of a linear scan of the
output list to verify that its elements are in nondescending or nonascending order,
depending on the direction of sorting. Such an acceptance test will detect an improperly
sorted list, but may not catch the problem when the output list does not consist of exactly
the same set of values as the input list. We can of course make the acceptance test as
comprehensive as desired, but a price is paid in both software development effort and
running time, given that the acceptance test is on the critical path.

In general, the acceptance test can range from a simple reasonableness check to a
sophisticated and thorough validation effort. Note that performing the computation a
second time and comparing the two sets of results can be viewed as a form of acceptance
testing, in which the acceptance test module is of the same order of complexity as the
main computational module. Computations that have simple inverses lend themselves to
efficient acceptance testing. For example, the results of root finding for a polynomial can
be readily verified by polynomial evaluation.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 445

24.6 Hybrid Software Redundancy

The various software redundancy methods, including N-version programming and


recovery blocks can be unified in a way that allows the discovery of novel combinations
of replication and acceptance testing and to facilitate the comparison of existing methods
and new proposed methods. We begin by representing two hybrid redundancy schemes in
the form of block diagrams composed of software modules and acceptance tests.

We now present the elements of a general notation that facilitates the study and synthesis
of other software redundancy schemes.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 446

Problems

24.1 The year-2038 design flaw


Study the so-called “year 2038” (abbreviated Y2038 or Y2K38) problem and write a two-page report about
it. In your report, discuss implications of the Y2038 problem to computer system reliability as well as
similarities and differences with the infamous Y2K problem.

24.2 Multiversion programming


The t/(n – 1) version programming [Xu97] combines the ideas of N-version programming, discussed in this
chapter, and the malfunction diagnosis techniques of Chapter 17. Write a two-page report describing the
method and its advantages relative to other software redundancy schemes.

24.3 Software aging and rejuvenation


A concept that has emerged in recent years to counteract the effects of software aging is that of software
rejuvenation. Research the notions of software aging and software rejuvenation and present your findings in
a two-page report (typed, single-space). In your report, provide precise definitions for the notions
introduced, and paint an accurate picture of the application domains and impact of each method discussed.
A good starting point for finding relevant references is [Silv09].

24.4 Acceptance testing [Kore02]


The correct output, z, of some program has as its probability density function the truncated exponential
function given below, where L is a known positive constant: f(z) = if 0  z  L then e–z/(1 – e–L) else 0.
On any particular input, the program fails with probability q, in which case it produces an arbitrary value
with uniform distribution in [0, L]. The penalty of producing an incorrect value is E, while that of
producing no value at all is S, where E and S are known constants. An acceptance test is to be set up in the
form of a range check which rejects any output that does not fall in [0, R]. Find the optimal value of R for
which the expected total penalty is minimized.

24.5 N-channel computation in software and hardware


Discuss the similarities and differences between N-version programming and the NMR (replication and
voting) method as used for hardware. Include in your discussion both implementation aspects and
reliability modeling considerations.

24.6 Software debugging [Kore02]


Let the probability of uncovering a bug after applying t seconds of testing to a software module, given that
it has at least one bug, be 1 – e–t. You believe at the outset, perhaps based on past experience with similar
software, that the probability of having at least one bug in your software is q = 1 – p. Assume that after t
seconds of testing, you fail to find a bug.
a. Prove: prob{the software is bug-free | t seconds of testing revealed no bugs} = 1/[1 + (q/p)e–t]
b. Assuming q = 0.9, plot the variation of the confidence factor of part a for  = 0.001, 0.01, and 0.1,
as t varies between 0 and 10 000.
c. Assuming = 0.01, plot the variation of the confidence factor of part a for q = 0.9, 0.7, and 0.5, as
t varies between 0 and 10 000.
d. Discuss the results of parts b and c and draw appropriate conclusions from them.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 447

24.7 Program correctness proof


Consider the program fragment: s = 0; k = 0; while k  n do s = s + 2k; k = k + 1; endwhile
a. Prove that the program fragment computes f(n) = n(n + 1) when n > 0 is an integer.
b. What does the fragment compute when n is a positive real number?
c. What will the fragment compute if we reorder the two statements inside the while loop?

24.8 Voting for 3-version software


The 3 versions of a program produce the following sets of output values at consecutive voting points, with
the voter output also shown. Determine the voting algorithm used in decision schemes a and b. Discuss and
fully justify your answers.
Data set Values produced Decision a Decision b
1 48.3 48.2 48.1 48.2 48.2
2 48.2 48.7 48.0 48.2 48.3
3 48.1 49.4 47.7 48.1 48.4
4 48.3 51.3 47.9 48.3 48.1
5 48.0 52.6 48.2 48.2 48.1
6 48.3 53.7 48.1 48.3 48.2
7 48.1 54.5 47.9 48.1 48.0

24.9 Software redundancy in a Mars mission


In a one-page, single-spaced report, describe the role of software redundancy in how NASA’s Curiosity
Rover reached Mars and functioned as intended by its designers [Holz14].

24.10 The leap-second problem


Read the article [Sava15] about the impact of leap seconds on the reliable operation of computer systems.
In one single-space typed page, describe the problem, along with possible solutions or workarounds.

24.11 Verification of program reliability


Carbin, Misailovic, and Rinard [Carb16] propose an interesting method for quantifying program reliability.
Study the cited paper and outline in one typewritten page the essense of their method and its practical
implications. Pay special attention to the malfunction model assumed.

24.12 Dependability considerations for controllers


Read the paper [Alka19] and address the following questions in a one-page report.
a. What is different about controllers compared with other components, such as memory modules,
CPUs, GPUs, and the like?
b. What are the two most important techniques described in the paper for improving controller
reliability?
c. What are the redundancy and other cost factors associated with the methods of part b?
d. Why do the authors focus on FPGA-based implementation of controllers?

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-19 448

References and Further Readings


[Alka19] Alkady, G. I., R. M. Daoud, H. H. Amer, M. Y. ElSalamouny, and I. Adly, “Failures in Fault-
Tolerant FPGA-Based Controllers—A Case Study,” Microprocessors and Microsystems, Vol.
64, pp. 178-184, 2019.
[Arms10] Armstrong, J., “Erlang,” Communications of the ACM, Vol. 53, No. 9, pp. 68-75, September
2010. {Erlang is an open-source language that allows for easy programming of multicore
systems and fault-tolerant distributed applications.}
[Bica97] Bicarregui, J., J. Dick, B. Matthews, and E. Woods, “Making the Most of Formal Specification
through Animation, Testing, and Proof,” Science of Computer Programming, Vol. 29, Nos. 1-
2, pp. 53-78, July 1997.
[Carb16] Carbin, M., S. Misailovic, and M. C. Rinard, “Verifying Quantitative Reliability for Programs
that Execute on Unreliable Hardware,” Communications of the ACM, Vol. 59, No. 8, pp. 83-91,
August 2016.
[Hlav01] Hlavaty, T., L. Preucil, and P. Stepan, “Case Study: Formal Specification and Verification of
Railway Interlocking System,” Proc. 27th Euromicro Conf., pp. 258-263, 2001.
[Hlav04] Hlavacek, I., J. Chleboun, and I. Babuska, Uncertain Input Data Problems and the Worst
Scenario Method, Elsevier, 2004.
[Holz14] Holzmann, G. J., “Mars Code,” Commuications of the ACM, Vol. 57, No. 2, pp, 64-73,
February 2014.
[Kirb99] Kirby, J., M. Archer, and C. Heitmeyer, “Applying Formal Methods to an Information Security
Device: An Experience Report,” Proc. 4th IEEE Int’l Symp. High-Assurance Systems
Engineering, 1999.
[Kore02] Koren, I. and C. M. Krishna, Fault-Tolerant Systems, Morgan Kaufmann, 2007.
[Lyu05] Lyu, M. R. (ed.), Software Fault Tolerance, Wiley, 2005. Available online at:
http://www.cse.cuhk.edu.hk/~lyu/book/sft/index.html
[NASA00] US National Aeronautics and Space Administration, “Software Fault Tolerance: A Tutorial,”
NASA report, 2000.
[Parh01] Parhami, B., “An Approach to Component-Based Synthesis of Fault-Tolerant Software,”
Informatica, Vol. 25, pp. 533-543, November 2001.
[Requ00] Requet, A. and G. Bossu, “Embedding Formally Proved Code in a Smart Card: Converting B
to C,” Proc. Int’l Conf. Formal Engineering Methods, pp. 15-22, 2000.
[Roth89] Rothermel, K. and C. Mohan, “ARIES/NT: A Recovery Method Based on Write-Ahead
Logging for Nested Transactions,” IBM Thomas J. Watson Research Division, 1989.
[Sala14] Salako, K. and L. Strigini, “When Does ‘Diversity’ in Development Reduce Common
Failures? Insights from Probabilistic Modeling,” IEEE Trans. Dependable and Secure Systems,
Vol. 11, No. 2, pp. 193-206, March/April 2014.
[Sava15] Savage, N., “Split Second,” Communications of the ACM, Vol. 58, No. 9, pp. 12-14,
September 2015.
[Silv09] Silva, L. M., J. Alonso, and J. Torres, “Using Virtualization to Improve Software
Rejuvenation,” IEEE Trans. Computers, Vol. 58, No. 11, November 2009, pp. 1525-1538.
[Xu97] Xu, J. and B. Randell, “Software Fault Tolerance: t/(n – 1)-Variant Programming,” IEEE
Trans. Reliability, Vol. 46, No. 1, pp. 60-68, March 1997.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 449

VII Failures: Computational Breaches


Ideal

Defective “Engineers have learned so well from failures that a major failure
today is big news. We no longer expect bridges to collapse or
Faulty
buildings to fall in or spacecraft to explode. Such disasters are
Erroneous
perceived as anomalous, . . . We grieve the lost lives, we search
among the designers for the guilty. Yet these disasters serve the
Malfunctioning same function as the failures of an earlier era. Failures remain
the engineer's best teacher, his best laboratory.”
Degraded

J. Schallan, reviewing ‘To Engineer is Human’


Failed

“A man may fall many times, but he won’t be a failure until he


says that someone pushed him.”
Elmer G. Letterman

Chapters in This Part


25. Failure Confinement
26. Failure Recovery
27. Agreement and Adjudication
28. Fail-Safe System Design

A failure occurs when a system’s degradation allowance/management capacity


has been exceeded in space (due to catastrophic malfunctions) or in time (due to
resource exhaustion after a sequence of malfunctions). This is essentially the end
of the road and what we have been trying to avoid all along. There is a silver
lining, however, in that the larger system, of which the computer is a part, might
survive such a failure if adequate provisions are made. Failure confinement
techniques, to prevent the spread of damage to data and other system resources,
and failure recovery strategies, to return the system to normal operation swiftly,
are key elements of the required provisions. We conclude this part of the book by
discussing the use of agreement and adjudication schemes to protect data integrity
and the design of fail-safe systems to prevent catastrophes.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 450

25 Failure Confinement
“I always turn to the sports page first, which records people’s
accomplishments. The front page has nothing but man’s
failures.”
Earl Warren

“Programming today is a race between software engineers


striving to build bigger and better idiot-proof programs, and the
Universe trying to produce bigger and better idiots. So far, the
Universe is winning.”
Rich Cook

Topics in This Chapter


25.1. From Failure to Disaster
25.2. Failure Awareness
25.3. Failure and Risk Assessment
25.4. Limiting the Damage
25.5. Failure Avoidance Strategies
25.6. Ethical Considerations

A common goal of the methods discussed in the preceding chapters is to prevent


or postpone computer system failures. In practice, what is actually accomplished
is a sharp reduction in the probability, rather than full eradication, of failures. It is
thus important to be prepared for failures and to have a good understanding of
their underlying causes, consequences, damage confinement and assessment
methods, and possible remedial actions. Like other engineering professionals,
computer engineers have ethical obligations in reporting problems and following
proper design practices to avoid failures. Thus, we conclude the chapter with a
review of ethics guidelines for engineers.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 451

25.1 From Failure to Disaster

Computers are components in larger economic, technical, or societal systems. Viewed in


this way, a computer failure need not result in insurmountable threats or losses. In many
cases, prompt failure detection and activation of back-up systems can avert potential
disasters. This strategy is in widespread use for safety-critical systems: jetliners have
manual controls and human override options; spacecraft are capable of being controlled
from the ground as back-up for their on-board navigation systems; nuclear reactor control
systems can be manually bypassed, when needed. Such manual replacement and bypass
provisions constitute a buffer between the failed state in our multilevel model and
potential disaster (Fig. 25.1).

The provision of manual back-up and bypass capability is a good idea, even for systems
that are not safety-critical. On Friday, November 30, 1996 (Thanksgiving weekend), the
US railroad company Amtrak lost ticketing capability due to a communication system
disruption. Unfortunately for the company, station personnel had no up-to-date fare
information as back-up, and were thus unable to issue tickets manually. This lack of
foresight led to major coustomer inconvenience and loss of revenue for Amtrak. One
must note, however, that in certain cases, such as e-commerce Web sites, manual back-up
systems may be impractical.

Fig. 25.1 Computer failure may not lead to system failure or disaster.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 452

25.2 Failure Awareness

The first step in proper handling of computer system failures is being aware of their
inevitability and, preferably, likelihood. Poring over failure data that are available from
experience reports and repositories is helpful to both computer designers and users.
System designers can get a sense of where the dependability efforts should be directed
and how to avoid common mistakes, while users can become empowered to face failure
events and to put in place appropriate recovery plans.

Unfortunately, much of the available failure statistics are incomplete and, at times,
misleading. Collecting experimental failure data isn’t easy. Widespread experiments are
impossible for one-of-a-kind or limited-issue systems and performing them under
reasonably uniform conditions is quite a challenge for mass-produced systsme. There is
also the embarrassment factor: system operators may be reluctant to report failures that
put their technical and administrative skills in doubt, and vendors, especially those who
boast about the reliability of their systems, may be financially motivated to hide or
obscure failure events. Once a failure event is logged, assigning a cause to it is nontrivial.
These caveats notwithstanding, whatever data that is available should be used rather than
ignored, perhaps countering any potential bias by drawing information from multiple
independent sources.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 453

Software failure data is available from the following two sources, among others. The
Promise Software Engineering Repository [PSER12] contains a collection of publicly
available datasets and tools to help researchers who build predictive software models and
the software engineering community at large. The failure data at the Software Forensics
Center [SFC12] is the largest of its kind in the world and includes specific details about
hundreds of projects, with links to thousands of cases.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 454

Let us consider the application of failure data to assessing and validating reliability
models through an example.

Example 25.valid: Validating reliability models Consider the reliability state model shown in
Fig. 25.disk for mirrored disk pairs, where state i corresponds to i disks being healthy.
a. Solve the model and derive the disk pair’s MTTF, given a disk MTTF of 50 000 hr per
manufacturer’s claim and an estimated MTTR of 5 hr.
b. In 48 000 years of observation (2 years  6000 systems  4 disk pairs / system), 35 double disk
failures were logged. Do the observation results confirm the model of part a? Discuss.

Solution: From the MTTF and MTTR values, we find  = 2  10–5/hr and  = 0.2/hr.
a. The model of Fig. 25.disk is readily solved to provide an effective disk pair failure rate of about
22/ or an approximate disk pair MTTF of /(22) = 15,811 yr.
b. The observation data suggests a disk pair MTTF of 48 000/35  1371 yr. The observed MTTF is
more than 11 times worse than the modeled value. The discrepancy may be attributed to one or
more of the following factors: exaggerated disk MTTF claim on the part of the manufacturer,
underestimation of repair time, imperfect coverage in recovering from the failure of a single disk
in the pair. The latter factor can be accounted for by including a transition from state 2 to state 0,
with an appropriate rate, in Fig. 25.disk.

Fig. 25.disk State-space reliability model for a mirrored disk pair.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 455

25.3 Failure and Risk Assessment

Given its importance in our discussion here, let’s reproduce equation (2.5.Risk1) and its
alternate form, equation (2.5.Risk2), here, giving them new numbers for ready reference
in this chapter.

risk = frequency  magnitude (25.3.Risk1)


[consequence / unit time] [events / unit time] [consequence / event]

risk = probability  severity (25.3.Risk2)

Like system reliability, the probability or frequency of failures is unknowable. As we


derive lower bounds on reliability using pessimistic models, we also obtain upper bounds
on failure probability. This is the best we can do, given that overestimating reliability
(underestimating the probability of failure) is quite dangerous. Clearly, it is to our
advantage to ensure that our reliability lower bound is as close as possible to the actual,
unknowable quantity. The more accurate our frequency/probability term in the equations
above, the closer our assessed risk will be to the true risk involved.

The magnitude/severity term in the preceding equations is estimated via economic


analysis. Before discussing the methods used, we should note that consequences depend
on how promptly and thouroughly we handle the failures. Systems must be designed such
that failure events are communicated to operators/users via clean, unambiguous
messages. Listing the options and the urgency of various actions is a good idea. Two-way
communication, that is, adding user feedback to the process, is helpful.

Now, consider the following thought experiment: an attempt to establish how much your
life is worth to you.

You have a 1/10 000 chance of dying today. If it were possible to buy out
the risk, how much would you be willing to pay? Assume that you are not
limited by current assets, that is, you can use future earnings too.

An answer of $1000 (risk) combined with the frequency 10–4 leads to a magnitude of
$10M, which is the implicit worth you assign to your life. Assigning monetary values to
lives, repugnant as it seems, happens all the time in risk assessment. The higher salaries

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 456

demanded by workers in certain dangerous lines of work and our willingness to pay the
cost of smoke detectors are examples of trade-offs involving the exchange of money for
endangering or protecting human lives.

In an eye-opening book [Tale07], author Nassim Nicholas Taleb discusses rare events
and how humans are ill-equipped for judging and comparing their probabilities. In a
subsequent book [Tale12], that also has important implications for the design of resilient
systems, Taleb discusses systems that not only survive disorder and harsh conditions, but
thrive and improve in such environment.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 457

25.4 Limiting the Damage

This section to be written based on the following slides.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 458

25.5 Failure Avoidance Strategies

There are age-old design principles that engieers should be aware of in order to produce
reliable systems. These principles apply to any design, but they are particularly important
in developing highly complex hardware and software systems.

– Limit novelty [stick with proven methods and technologies]


– Adopt sweeping simplifications
– Get something simple working soon
– Iteratively add capability
– Give incentives for reporting errors
– Descope [reduce goals/specs] early
– Give control to (and keep it in) a small design team

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 459

25.6 Ethical Considerations

Many system failures would not occur if engineers were aware of their ethical
responsibilities toward the customers and the society at large. Modern engineering
curricula include some formal training in how to deal with ethical quandries. This
training is often provided via a required or recommended course that reviews general
principles of ethics, discusses them in the context of engineering practice, and provide a
number of case studies in which students consider the impact of various career and design
decisions. In the author’s opinion, just as dependability should not be taught in a course
separate from those that deal with engineering design, so too discussion of ethics must be
integrated in all courses within an engineering curriculm.

All professional engineering societies have codes of ethics that outline the principles of
ethical behavior in the respective professions. For example, the IEEE Code of Ethics
[IEEE19] compels us engineers to follow rules of ethics in general (be fair, reject bribery,
avoid conflicts of interest) and in technical activities:

accept responsibility in making decisions consistent with the safety,


health, and welfare of the public, and to disclose promptly factors that
might endanger the public or the environment;

maintain and improve our technical competence and to undertake


technological tasks for others only if qualified by training or experience,
or after full disclosure of pertinent limitations;

seek, accept, and offer honest criticism of technical work, to acknowledge


and correct errors, and to credit properly the contributions of others;

IEEE has a separate “Code of Conduct” [IEEE14] that spells out in greater details
guidelines for responsible practice of engineering.

The Accociation for Computing Machinery is similarly explicit in its ethical


recommendations to its members [ACM18]:

minimize malfunctions by following generally accepted standards for


system design and testing;

give comprehensive and thorough evaluations of computer systems and


their impacts, including analysis of possible risks;

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 460

In its comprehensive code of engineering ethics, the National Society of Professional


Engineers provides rules of practice (when to accept an assignment, whistle-blowing) and
professional obligations (acknowledging errors, being open to suggestions) based on the
following six fundamental canons [NSPE18]. According to NSPE, professional
engineers:

1. Hold paramount the safety, health, and welfare of the public


2. Perform services only in areas of their competence
3. Issue public statements only in an objective and truthful manner
4. Act for each employer or client as faithful agents or trustees
5. Avoid deceptive acts
6. Conduct themselves honorably, responsibly, ethically, and lawfully

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 461

Problems

25.1 MTTF for disk pairs


The solution to Example 25.valid ends with the suggestion that the model can be better matched to reality
by including a transition from state 2 to state 0, with an appropriate rate that reflects coverage. Derive the
latter rate to bring to the MTTF close to the observed value of 1371 yr.

25.2 Trustworthy artificial intelligence


As AI systems get more powerful and less transparent due to ever-increasing complexity, much discussion
is going on about whether or not we can put our trust into machine-learning and other AI systems. Using
[EurC19] and [Flor19] as your sources, write a 2-page report that addresses both of the following aspects of
the problem.
a. Ethical concerns in delegating critical tasks to autonomous systems.
b. The extent to which we can trust AI with critical decision-making.

25.x Title
Intro
a. xxx
b. xxx
c. xxx
d. xxx

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 462

References and Further Readings


[ACM18] Association for Computing Machinery, “ACM Code of Ethics and Professional Conduct,” on-
line document, accessed on November 10, 2019. https://www.acm.org/code-of-ethics
[EurC19] European Commision, “Ethics Guideline for Trustworthy AI,” High-Level Expert Group on
Artificial Intelligence, April 8, 2019. https://ec.europa.eu/futurium/en/ai-alliance-consultation
[Flor19] Floridi, L. and J. Cowls, “A Unified Framework of Five Principles for AI in Society,” Harvard
Data Science Review, June 14, 2019. https://hdsr.mitpress.mit.edu/pub/l0jsh9d1
[IEEE14] Institute of Electrical and Electronics Engineers, “IEEE Code of Conduct,” June 2014, on-line
document, accessed on November 10, 2019. https://www.ieee.org/content/dam/ieee-
org/ieee/web/org/about/ieee_code_of_conduct.pdf
[IEEE19] Institute of Electrical and Electronics Engineers, “IEEE Code of Ethics,” on-line document,
accessed on November 10, 2019. https://www.ieee.org/about/corporate/governance/p7-8.html
[LANL12] Los Alamos National Laboratory,
[NSPE18] National Society of Professional Engineers, “Code of Ethics for Engineers,” 2018, on-line
document, accessed on November 10, 2019.
http://www.nspe.org/Ethics/CodeofEthics/index.html
[PSER12] Promise Software Engineering Repository,
[Roch12] Rochester University, “Memory Hardware Error Research Project,”
[Schr07] Schroeder, B. and G. A. Gibson, “Understanding Disk Failure Rates: What Does an MTTF of
1,000,000 Hours Mean to You?” ACM Trans. Storage, Vol. 3, No. 3, Article 8, 31 pp., October
2007.
[Schr07a] Schroeder, B. and G. A. Gibson, “Understanding Failures in Petascale Computers,” J. Physics:
Conference Series, Vol. 78, No. 1, Article 012022, 2007.
[Schr09] Schroeder, B. and G. A. Gibson, “A Large-Scale Study of Failures in High-Performance
Computing Systems,” IEEE Trans. Dependable and Secure Systems, to appear (available
online from the IEEE TDSS Web site).
[SFC12] Software Forensics Center,
[Siew92] Siewiorek, D. P. and R. S. Swarz, Reliable Computer Systems: Design and Evaluation, Digital
Press, 2nd ed., 1992. Also: A. K. Peters, 1998.
[Tale07] Taleb, N. N., Black Swan: The Impact of the Highly Improbable, Random House, 2007.
[Tale12] Taleb, N. N., Antifragile: Things that Gain from Disorder, Random House, 2012.
[Usen12] Usenix, “Computer Failure Data Repository,”
[Xin04] Xin, Q., E. L. Miller, and T. J. E. Schwarz, “Evaluation of Distributed Recovery in Large-Scale
Storage Systems,” Proc. 13th IEEE Int’l Symp. High Performance Distributed Computing, pp.
172-181, 2004.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 463

26 Failure Recovery
“The first rule of holes: when you’re in one, stop digging.”
Molly Ivins

“. . . failures appear to be inevitable in the wake of prolonged


success, which encourages lower margins of safety. Failures in
turn lead to greater safety margins and, hence, new periods of
success. To understand what engineering is and what engineers
do is to understand how failures can happen and how they can
contribute more than successes to advance technology.”
Henry Petroski, To Engineer is Human—The
Role of Failure in Successful Design

Topics in This Chapter


26.1. Planning for Recovery
26.2. Types of Recovery
26.3. Interfaces and the Human Link
26.4. Backup Systems and Processes
26.5. Blame Assessment and Liability
26.6. Learning from Failures

Planning to deal with computer system failures has a great deal in common with
preparations undertaken in anticipation of natural disasters. Since computers are
often components in larger control, corporate, or societal systems, interaction with
the users and environment must also be factored in. Recovery from computer
failures is made possible by systems and procedures for backing up data and
programs and for alternate facilities to run applications in case of complete system
shut-down due to a catastrophic failure, a natural disaster, or a malicious attack.
Once a failure has occurred, investigations may be conducted to establish the
cause, assign responsibility, and catalog the event for educational purposes and as
part of failure logs to help future designers.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 464

26.1 Planning for Recovery

Just as an organization might hold fire drills to familiarize its personnel with the
procedures to be followed in the event of a real fire, so too it must plan for dealing with,
and recovering from, computer system failures. Whether an anticipated failure has mild
consequences or leads to a disaster, the corresponding recovery procedures must be
properly documented and be part of the personnel training programs.

Recovery from a failure can be expressed in the same manner as the recovery block
scheme in the program structure 24.5.rb, with the manual or emergency procedure being
considered the last alternate and human judgment forming part of the acceptance test.
When the failure is judged to be a result of transient environmental conditions, the same
alternate may be executed multiple times, before moving on to the next one, including the
final initiation of manual recovery.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 465

26.2 Types of Recovery

Many terms have been used to describe the process of recovery from computer system
failures. First, systems that are capable of working with diminished resources are referred
to as fail-slow or fail-soft. These terms imply that the system is resilient and won’t be
brought down when some resources become unavailable. At the opposite extreme, we
have fail-fast and fail-hard systems that are purposely designed to fail quickly and in a
physically obvious manner, following the philosophy that subtle failures that may go
unnoticed may be more dangerous than overt ones, such as total system shut-down or
crash. Extended failure detection latency is undersirable not only owing to potentially
errant system behavior, but also because the occurrence of subsequent unrelated failures
may overwhelm the system’s defenses. Along the same lines, a fail-stop system comes to
a complete stop rather than behave erratically upon failure. The latter category of systems
may be viewed as a special case of a fail-safe systems, where the notion of safe behavior
generalizes that of halting all actions.

Alongside the terms reviewed in the preceding paragraph, we use fail-over to indicate a
situation when failure of a unit (such as Web server) is overcome by another unit taking
over its workload. This is easily done when the failed unit does not carry much state. If a
video is being streamed by a server that fails, the server taking over needs to know only
the file name, the receipient’s address, and the current point of streaming within the
video. Fail-over software is available for Web servers as part of firewalls for most
popular operating systems. The term fail-back is used to refer to the failed system
returning to service, either as the primary unit or as a warm standby. Finally, the term
fail-forward, derived from the notion of forward error correction (the ability of error-
correcting codes to let the computation go forward after the occurrence of an error, as
opposted to error-detecting schemes leading to backward error correction via rollback to
a previous checkpoint) is sometimes, though rarely, used.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 466

26.3 Interfaces and the Human Link

Key elements in ensuring proper human reaction to failure events are the believability
and helpfulness of failure warnings.

Human factors in automated systems

A system’s human interface not only affects its usability but also contributes heavily to
its reliability, availability, and safety. A properly designed and tuned interface can help
prevent many human errors that form one of the main sources of system unreliability and
risk. The popular saying “The interface is the system” is in fact quite true. Here is a
common way of evaluating a user interface. A group of 3-5 usability experts and/or
nonexperts judges the interface based on a set of specific criteria. Here are some criteria,
which would be used to judge most interfaces [Gree09a]:

 Simplicity: The interface is clean and easy to use.


 Designed with errors in mind: The interface assumes that the user will make
errors. Errors are avoidable (via the requirement for confirmation of critical
actions) and easy to reverse.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 467

 Visibility of system state: The user knows about what is happening inside the
computer from looking at the interface.
 Speaks the user’s language: Uses concepts that are familiar to users. If there are
different user classes (say, novices and experts in the field), the interface is
understandable to all.
 Minimizes human memory load: Human memory is fallible and people are likely
to make errors if they must remember information. Where possible, critical
information appears on the screen. Recognition and selection from a list are easier
than memory recall.
 Provides feedback to the user: When the user acts, the interface confirms that
something happened. The feedback may range from a simple beep to indicate that
a button press was recorded to a detailed message that describes the consequences
of the action.
 Provides good error messages: When errors occur, the user is given helpful
information about the problem. Poor error messages can be disastrous.
 Consistency: Similar actions produce similar results. Visually similar objects
(colors, shapes) are related in an important way. Conversely, objects that are
fundamentally different have distinct visual appearance.

When an interface is implicated in an accident, the most common problems are


inconsistency, hiding of the system state, and failure to design for error.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 468

26.4 Backup Systems and Processes

Backing up of data is a routine practice for corporate and enterprise systems. The data
banks held by a company, be they data related to designs, inventory, personnel, suppliers,
or customer base, is quite valuable to the company’s business and must be rigorously
protected against loss. Unfortunately, many personal computer users do not take data
backup seriously and end up losing valuable data as a result. Here are some simple steps
that nonexpert computer users can take to protect their data files:

Use removable storage: external hard drive, flash drive, CD/DVD


E-mail file copy to yourself (not usable for very large files)
Run a backup utility periodically (full or incremental backup)
Subscribe to an on-line backup service (plain or encrypted)
Do not overwrite updated files, but create new versions of them

[Elaborate on the importance of on-site and off-site backups.]

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 469

26.5 Blame Assessment and Liability

In cases where the cause of a failure isn’t immediately evident, or when multiple parties
involved do not agree on the cause, computer forensics may come into play. Computer
forensics is a relatively new speciality of increasingly utility in:

Scanning computers belonging to defendants or litigants in legal cases


Gathering evidence against those suspected of wrongdoing
Learning about a system to debug, optimize, or reverse-engineer it
Assessing compromised computer systems to determine intrusion method
Analyzing failed computer systems to determine cause of failure
Recovering data after computer failures

There are several journals on computer forensics, digital investigations, and e-discovery.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 470

26.6 Learning from Failures

As we discussed in Section 1.4 in connection with our multilevel model of dependable


computing, a system may fail when it degrades beyond its degradation management
capacity (downward transition) or it may begin its life in the failed state (sideway
transition). A system of the latter kind never gets off the ground and its design may be
scrapped before it becomes operational. The following are a few examples of such failed
systems, along with the cost of the project to build them:

Automated reservations, ticketing, flight scheduling, fuel delivery, kitchens, and general
administration, United Airlines + Univac,
started 1966, target 1968, scrapped 1970, $50M

Hotel reservations linked with airline and car rental, Hilton + Marriott + Budget +
American Airlines, started 1988, scrapped 1992, $125M

IBM workplace OS for PPC (Mach 3.0 + binary compatibility with AIX + DOS, Mac
OS, OS/400 + new clock mgmt + new RPC + new I/O + new CPU), started 1991,
scrapped 1996, $2B

US FAA’s Advanced Automation System (to replace a 1972 system), started 1982,
scrapped 1994, $6B

London Ambulance Dispatch Service, started 1991, scrapped 1992,


20 lives lost in 2 days, $2.5M

Many more such failures exist, as exemplified by the following partial list:

• Portland, Oregon, Water Bureau, $30M, 2002


• Washington D.C., Payroll system, $34M, 2002
• Southwick air traffic control system $1.6B, 2002
• Sobey’s grocery inventory, Canada, $50M, 2002
• King County financial mgmt system, $38M, 2000
• Australian submarine control system, 100M, 1999
• California lottery system, $52M
• Hamburg police computer system, 70M, 1998

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 471

• Kuala Lumpur total airport management system, $200M, 1998


• UK Dept. of Employment tracking, $72M, 1994
• Bank of America Masternet accounting system, $83M, 1988
• FBI virtual case, 2004
• FBI Sentinel case management software, 2006

Learning from failures is one of the tenets of engineering practice. As elegantly pointed
out by Henry Petroski [Petr06]:

“When a complex system succeeds, that success masks its proximity to


failure. . . . Thus, the failure of the Titanic contributed much more to the
design of safe ocean liners than would have her success. That is the
paradox of engineering and design.”

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 472

Problems

26.1 Resilience engineering


According to a published discussion [Kris12], companies with critical computing infrastructures have
started to adopt resilience engineering, a practice that has been common in other high-risk industries such
as aviation and health care. In a maximum of one typed page, describe the key notions of resilience
engineering (~1/2 page), Amazon’s adopted strategy (~1/4 page), and Google’s approach to it (~1/4 page).

26.x Title
Intro
a. xxx
b. xxx
c. xxx
d. xxx

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 473

References and Further Readings


[Bain87] Bainbridge, L. “Ironies of Automation: Increasing Levels of Automation Can Increase, Rather
than Decrease, the Problems of Supporting the Human Operator,” in New Technology and
Human Error, J. Rasmussen, K. Duncan, and J. Leplat (eds.), pp. 276–283, Wiley, 1987.
[Dekk06] Dekker, S., The Field Guide to Understanding Human Error, Ashgate, 2006.
[Gree09] Greengard, S., “Making Automation Work,” Communications of the ACM, Vol. , No. 12, pp. ,
December 2009.
[Gree09a] Green, M., “Error and Injury in Computers & Medical Devices,” Web page accessed on
November 29, 2012. http://www.visualexpert.com/Resources/compneg.html
[Kris12] Krishnan, K., “Practice: Weathering the Unexpected,” Communications of the ACM, Vol. 55,
No. 11, pp. 48-52, November 2012.
[Petr06] Petroski, H., Success Through Failure: The Paradox of Design, Princeton Univ. Press, 2006, p.
95.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 474

27 Agreement and Adjudication


“There is nothing more likely to start disagreement among
people or countries than an agreement.”
E. B. White

“We judge ourselves by what we feel capable of doing, while


others judge us by what we have already done.”
Henry Wadsworth Longfellow

Topics in This Chapter


27.1. Voting and Data Fusion
27.2. Weighted Voting
27.3. Voting with Agreement Sets
27.4. Variations in Voting
27.5. Distributed Agreement
27.6. Byzantine Resiliency

In Chapter 12, we studied simple voting schemes and their associated hardware
implementations. In this chapter, we introduce a variety of more flexible, and thus
computationally more complex, voting schemes that are often implemented in
software or a combination of hardware and software. Voting schemes constitute
particular instances of a process known as data fusion, where suspect or
incomplete data from various sources are used to derive more accurate or
trustworthy values. One complication with voting or data fusion in a distributed
environment is the possibility of communication errors and Byzantine (absolute
worst-case) failures. Discussions of these notions conclude this chapter.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 475

27.1 Voting and Data Fusion

This section to be written based on the following slides.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 476

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 477

27.2 Weighted Voting

From our discussion of hardware voting in Section xxxx, we are already familiar with the
notion of majority voting. A majority fuser sets the output y to be the value provided by a
majority of the inputs xi, if such a majority exists (Fig. 27.fuser-a). We also know that
majority fusers can be built from comparators and multiplexers.

Weighted fusion, which covers majority fusion as a special case, can be defined as
follows. Given n input data objects x1, x2, . . . , xn and associated nonnegative real weights
v1, v2, ... , vn, with vi = V, compute output y and its weight w such that y is “supported
by” a set of input objects with weights totaling w, where w satisfies a condition associated
with a chosen subscheme. Here are some subschemes that can be used in connection with
the general arrangement of Fig. 27-fuser-b:

Unanimity w=V
Majority w > V/2
Supermajority w  2V/3
Byzantine w > 2V/3
Plurality (w for y)  (w for any z  y)
Threshold w > some preset lower bound

Plurality fusion (in its special nonweighted case depicted in Fig. 27.fuser-c) selects the
value that appears on the largest number of inputs and presents it at output. With the
input values {1, 3, 2, 3, 4}, the output will be 3. It is interesting to note that with
approximate values, selection of the plurality results may be nontrivial. If the inputs are
{1.00, 3.00, 0.99, 3:00, 1.01}, one can legitimately argue that 1.00 constitutes the proper
plurality result. We will discuss approximate voting in more detail later. For now, we
note that median fusion would have produced the output 1.00 in the latter example.

(a) Nonweighted majority (b) Generalized weighted (c) Nonweighted plurality

Fig. 27.fuser Three kinds of data fusers derived from voting schemes.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 478

Implementing weighted plurality voting units [Parh91].

Details of a sorting-based plurality voting unit [Parh91].

Threshold voting and its generalizations

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 479

Usefulness of weighted threshold voting

Implementing weighted threshold voting units

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 480

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 481

27.3 Voting with Agreement Sets

An agreement set for an n-input voting scheme is a subset of the n inputs so that if all
inputs in the subset are in agreement, then the output of the voting scheme is based on
that particular subset. A voting scheme can be fully characterized by its agreement sets.
For example, simple 2-out-of-3 majority voting has the agreement sets {x1, x2}, {x2, x3},
and {x3, x1}. Clearly, the agreement sets cannot be arbitrary if the voting outcome is to be
well-defined. For example, in a 4-input voting scheme, the agreement sets cannot include
{x1, x2} and {x3, x4}. It is readily proven that for a collection of agreement sets to make
sense, no two sets should have an empty intersection.

Implementing agreement-set voting units

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 482

27.4 Variations in Voting

This section to be written based on the following slides.

Generalized median voting

The impossibility of perfect voting

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 483

Approximate voting

Approval voting

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 484

Interval voting

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 485

27.5 Distributed Agreement

This section to be written based on the following slides.

Byzantine failures in distributed voting

The interactive consistency (IC) algorithm

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 486

Building upon consensus protocols

Correctness and performance of ICA

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 487

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 488

27.6 Byzantine Resiliency

The Byzantine generals problem

Byzantine generals with reliable messengers

Bounds for Bzantine resiliency:


To tolerate f Byzantine failures:
We need 3f + 1 or more FCRs (fault containment regions)
FCRs must be interconnected via at least 2f + 1 disjoint paths
Inputs must be exchanged in at least f + 1 rounds
Corollary 1: Simple 3-way majority voting is not Byzantine resilient

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 489

Corollary 2: Because we need 2f + 1 good nodes out of a total of


3f + 1 nodes, a fraction (2f + 1)/(3f + 1) = 2/3 + 1/(9f + 3) of the nodes
must be healthy
This is greater than a supermajority (2/3) requirement

With some support from the system, in the form of certain kind of centralized or
distributed reliable and tamperproof service, the number of replicas needed for Byzantine
resilience can be reduced from 3f + 1 to 2f + 1, which is the minimum possible. Several
proposals of this kind have been made since 2004, with the latest [Vero13] possessing
performance and efficiency advantages over the earlier ones.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 490

Problems

27.1 Title
Intro
a. xxx
b. xxx
c. xxx
d. xxx

27.2 Title
Intro
a. xxx
b. xxx
c. xxx
d. xxx

27.3 Weighted threshold voting


Consider a 4-input weighted threshold voting unit with input weights v + 3, v + 2, v + 1, v, and a threshold
of 2v + 4, where v is an integer.
a. Show, with a precise and convincing proof, that the given voting unit is equivalent to one with
input weights 3, 2, 2, 1 and threshold 5.
b. What are the practical implications of the equivalence of part a?
c. If we change the threshold to 2v + 3, what would be a simplified set of weights and threshold?
d. Show a possible circuit realization for the voting unit in part c, using an ordinary nonweighted
voting unit.

27.4 Nonuniform voting schemes


The United Nations Security Council consists of five permanent members and 10 nonpermanent members
that serve two-year terms. For a resolution to be approved by the Council, all five permanent members and
at least four nonpermanent members must agree. Can this decision scheme be formulated as weighted
voting? How, or why not?

27.5 Weighted threshold voting


Find a simpler set of weights, and the associated threshold value, that would produce results identical to
each of the following weighted threshold voting schemes, or show that no simplification is possible.
a. Weights: 3, 3, 2, 2, 2. Threshold: 8 (Answer: 2, 2, 1, 1, 1; 5)
b. Weights: 5, 4, 3, 2, 1. Threshold: 9 (Solution: Calling the simplified weights a, b, c, d, and e,
respectively, the minimal sets are {a, b}, {a, c, d}, {a, c, e}, {b, c, d}. No simplification is
possible, as no two votes can be equal to each other.}
c. Weights: k + 1, k + 1, k + 1, k + 1, k + 1, 1, 1, 1, . . . , 1 (k 1s in all). Threshold: 5k + 9.

27.6 Generalized and weighted voting

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 491

A generalized voting scheme can be specified by listing its agreement sets. For example, simple 2-out-of-3
majority voting with inputs A, B, and C has the agreement sets {A, B}, {B, C}, {C, A}. Show that each of
the agreement sets below corresponds to a weighted threshold voting scheme and present a simple
hardware voting unit implementation for each case.
a. {A, B}, {A, C}, {A, D}, {B, C, D}
b. {A, B}, {A, C, D}, {B, C, D}
c. {A, B, C}, {A, C, D}, {B, C, D}
d. {A, B}, {A, C, D}, {B, D}

27.7 Generalized and weighted voting


Is there any generalized voting scheme on 4 inputs that cannot be realized as a weighted threshold voting
scheme? Fully justify your answer.

27.8 Assent in social choice


Problem to be designed based on [Bald13].

27.9 Computation of majority


Problem to be designed based on [DeMa12].

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 492

References and Further Readings


[Bald13] Baldiga, K. A. and J. R. Green, “Assent-Maximizing Social Choice,” Social Choice and
Welfare, Vol. 40, No. 2, pp. 439-460, February 2013.
[Bend15] Bendahmane, A., M. Essaaidi, A. El Moussaoui, and A. Younes, “The Effectiveness of
Reputation-Based Voting for Collusion Tolerance in Large-Scale Grids,” IEEE Trans.
Dependable and Secure Computing, Vol. 12, No. 6, pp. 665-674, November-December 2015.
[Cons13] Consortium for Mathematics and Its Applications, For All Practical Purposes: Mathematical
Literacy in Today’s World, W. H. Freeman, 9th ed., 2013, 912 pp. [Part III, Voting and Social
Choice: 9. Social Choice: The Impossible Dream; 10. The Manipulability of Voting Systems;
11. Weighted Voting Systems; 12. Electing the President.]
[Corr05] Correia, M., N. F. Neves, L. C. Lung, and P. Verissimo, “Low Complexity Byzantine-Resilient
Consensus,” Distributed Computing, Vol. 17, No. 3, pp. 237-249, March 2005.
[DeMa12] De Marco, G., E. Kranakis, and G. Wiener, “Computing Majority with Triple Queries,”
Theoretical Computer Science, Vol. 461, pp. 17-26, November 2012.
[Deno19] Denoeux, T., “Decision-Making with Belief Functions: A Review,” Int’l J. Approximate
Reasoning, Vol. 109, pp. 87-110, June 2019.
[Frie07] Friedman, R., A. Mostefaoui, S. Rajsbaum, and M. Raynal, “Asynchronous Agreement and Its
Relation with Error-Correcting Codes,” IEEE Trans. Computers, Vol. 56, No. 7, pp. 865-875,
July 2007.
[Kilg06] Kilgour, D. M., S. J. Brams, and R. Sanver, “How to Elect a Representative Committee Using
Approval Balloting,” in [Sime06], pp. 83-95.
[Laha15] Lahat, D., T. Adali, and C. Jutten, “Multimodal Data Fusion: An Overview of Methods,
Challenges, and Prospects,” Proc. IEEE, Vol. 103, No. 9, pp. 1449-1477, September 2015.
[Levi13] Levitin, Gregory and Kjell Hausken, “Defending Threshold Voting Systems with Identical
Voting Units,” IEEE Trans. Reliability, Vol 62, No. 2, pp. 466-477, June 2013.
[Mesk06] Meskanen, T. and H. Nurmi, “Distance from Consensus: A Theme and Variations,” in
[Sime06], pp. 117-132.
[Nage06] Nagel, J. H., “A Strategic Problem in Approval Voting,” in [Sime06], pp. 133-150.
[Parh91] Parhami, B., “Voting Networks,” IEEE Trans. Reliability, Vol. 40, No. 3, pp. 380-394, August
1991.
[Saar06] Saari, D. G., “Hidden Mathematical Structures of Voting,” in [Sime06], pp. 221-234.
[Sime06] Simeone, B. and F. Pukelsheim (eds.), Mathematics and Democracy: Recent Advances in
Voting Systems and Collective Choice, Springer, 2006.
[Vemp13] Vempaty, A., L. Tong, and P. Varshney, “Distributed Inference with Byzantine Data: State-of-
the-Art Review on Data Falsification Attacks,” IEEE Signal Processing, Vol. 30, No. 5, pp. 65-
75, September 2013.
[Vero13] Veronese, G. S., M. Correia, A. N. Bessani, L. C. Lung, and P/ Verissimo, “Efficient Byzantine
Fault-Tolerance,” IEEE Trans. Computers, Vol. 62, No. 1, pp. 16-30, January 2013.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 493

28 Fail-Safe System Design


“A common mistake that people make when trying to design
something completely foolproof is to underestimate the ingenuity
of complete fools.”
Douglas Adams

“They that can give up essential liberty to obtain a little


temporary safety deserve neither liberty nor safety.”
Benjamin Franklin

Topics in This Chapter


28.1. Fail-Safe System Concepts
28.2. Principles of Safety Engineering
28.3. Fail-Safe Specifications
28.4. Fail-Safe Combinational Logic
28.5. Fail-Safe State Machines
28.6. System- and User-Level Safety

We would prefer our computers and computer-based systems to be immune to


failures. However, given that complete freedom from failures is impossible to
ascertain in all but the simplest systems, the next best thing is to follow a fail-safe
design paradigm: to ensure that failures do not create safety hazards for humans
and do not lead to the loss of valuable resources, including stored data. The long
tradition of the field of safety engineering, as applied to older engineering
disciplines, is a useful guide for our discussions. Following this guide, we study
some methods relating to fail-safe specifications, design of fail-safe logic circuits,
and safety concerns at the system and user levels.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 494

28.1 Fail-Safe System Concepts

This section to be written based on the following slides.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 495

28.2 Principles of Safety Engineering

Safety-critical systems exists outside the domain of computing. Thus, over many years of
experience with such systems, a number of principles for desining safe systems have been
identified. These principles include [Gold87]:

1. Use barriers and interlocks to constrain access to critical system resources or states
2. Perform critical actions incrementally, rather than in a single step
3. Dynamically modify system goals to avoid or mitigate damages
4. Manage the resources needed to deal with a safety crisis, so that enough will be
available in an emergency
5. Exercise all critical functions and safety features regularly to assess and maintain their
viability
6. Design the operator interface to provide the information and power needed to deal with
exceptions
7. Defend the system against malicious attacks

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 496

28.3 Fail-Safe Specifications

This section to be written based on the following slides.

Example: Amusement park train

Example: Traffic light controller


Today’s traffic lights are computer-controlled, with a microprocessor embedded in a
control box near the intersection running the requisite software. There are typically many
inputs, including data from vehicle sensors embedded in the road, input from pedestrian
crossing buttons, and information about traffic conditions received from a remote source
or collected from cameras installed at the intersection. The light fixtures themselves may
contain conventional green, yellow/amber, and red lights, as well as special turn lights.
Finally, there may be various pedestrian signals.

Let us ignore all this complexity and focus on the simplest possible intersection with two
sets of 3 lights (green, yellow/amber, red) for the two intersecting roads. The simplest
algorithm for controlling the lights would cycle through various states, allowing cars to
move on one street for a fixed amount of time and then changing the lights to let cars on
the other road to proceed for a while. The state of the intersection at any given time can
be represented as XY, where each of the letters X and Y can assume a color code from
{G, Y, R}. With regard to safety, the state RR is very safe, but, of course, undesirable
from the traffic movement point of view, and the state GG is highly dangerous and

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 497

should never occur. One can say that with regard to safety, the states can be ordered in
the following way, from the safest to the most dangerous: RR, {RY, YR}, {RG, GR},
YY, {GY, YG}, GG. So, the objective in the design should be to assure that the state of
the traffic lights is one of the first five just listed, avowing any of the final four states.

The cycling should be


RG  RY  RR  GR  YR  RR  RG
The RR states, maintained for a few seconds, are inserted to provide a safe buffer
betweeh cars moving on the two streets.

Now, if the lights are controlled by six separate signals that turn them on and off, there is
a chance that through some signal being stuck or incorrectly computed, one of the
prohibited states is entered.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 498

28.4 Fail-Safe Combinational Logic

This section to be written based on the following slides.

Example: Fail-safe 2-out-of-4 code checker

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 499

28.5 Fail-Safe State Machines

This section to be written based on the following slides.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 500

28.6 System- and User-Level Safety

Any fail-safe system will have some false alarms: situations in which the system enters
its safe mode of operation, or stops functioning altogether, in the absence of any
imminent danger. Conversely, monitoring components in a fail-safe system may miss
hazardous conditions, thus allowing the system to remain operational in an unsafe mode.
False alarms lead to inconvenience, lower productivity, and perhaps financial losses.
However, these undesirable consequences are preferable to potentially huge losses that
might result from unsafe operation. This tradeoff between the frequency of false alarms
and the likelihood of missed warnings is a fundamental one in any system that relies on
tests with binary outcomes for diagnostics.

A binary test is characterized by its sensitivity and specificity. Sensitivity is the fraction of
positive cases (e.g., people having a particular disease) that the test diagnoses correctly.
Specificity is the fraction of negative cases (e.g., healthy people) that are correctly
identified. Figure 28.test contains a graphical depiction of the notions of test specificity
and sensitivity, as well as the trade-offs between them. We see that Test 1 in Fig. 28.xa is
high on specificity, because it does not lead to very many false positives, whereas Test 2
in Fig. 28.xb is highly sensitive in the sense of correctly diagnosing a great majority of
the positive cases. In general, highly sensitive tests tend to be less specific, and vice
versa. For safety-critical systems, we want to err on the side of too much sensitivity.

A dead-man’s switch, sometimes called a kill switch, is used in a number of safety-


critical systems that should not be operable without an operator’s presence. It usually,
takes the form of a handle or pedal that the operator must touch or press continuously,
and was first used in trains to stop them in the event of operator incapacitation. Early
trains serving urban areas carried two operators, so that the second one could take over if
the first operator was incapacitated for any reason, much like modern passenger aircraft
using a pilot and a co-pilot. The invention of dead man’s switch came about as a cost-
saving measure.

Within a self-monitoring safety-critical system, a dead man’s switch can take the form of
a monitoring unit that continuously runs tests to verify the safety of the current
configuration and operational status. If some predetermined time interval passes without
the monitor asserting itself or confirming proper status, the system will automatically
enter its safe operating mode.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 501

(a) Specific test (b) Sensitive test

Fig. 28.test The notions of specificity and sensitivity in binary tests.

Interlocks, watchdog units, and other preventive and monitoring mechanisms are
commonplace in safety-critical systems.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 502

Problems

28.1 Exception-free arithmetic


One of the sources of difficulty in designing dependable systems is dealing with arithmetic exceptions,
such as overflow, underflow, and various disallowed or illegal operations. Read the paper [Haye09] and
answer the following questions.
a. What is exception-free arithmetic and how can it be implemented?
b. What is the performance overhead, if any, of exception-free arithmetic?
c. How does exception-freedom contribute to dependability?
d. Does the method proposed in the paper eliminate all exceptions or only some of them?

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 503

References and Further Readings


[Chua78] Chuang, H. and S. Das, “Design of Fail-Safe Sequential Machines Using Separable
Codes,” IEEE Trans. Computers, Vol. 27, No. 3, pp. 249-251, March 1978.
[Haye09] Hayes, B., “The Higher Arithmetic: How to Count to a Zillion without Falling off the End of
the Number Line,” American Scientist, Vol. 97, No. 5, pp. 364-368, September-October 2009.
[Nico98] Nicolaidis, M., “Fail-Safe Interfaces for VLSI: Theoretical Foundations and Implementation,”
IEEE Trans. Computers, Vol. 47, No. 1, pp. 62-77, January 1998.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 504

A Past, Present, and Future


“You can’t change the past, but you can ruin the present by
worrying about the future.”
Anonymous

“The distinction between the past, present and future is only a


stubbornly persistent illusion.”
Albert Einstein

Topics in This Appendix


A.1. Historical Perspective
A.2. Long-Life Systems
A.3. Safety-Critical Systems
A.4. High-Availability Systems
A.5. Commercial and Personal Systems
A.6. Trends, Outlook, and Resources

In this appendix, we trace the history of dependable computing, from the earliest
digital computers to modern machines used in a variety of application domains,
from space exploration and real-time process control to banking and e-commerce.
We explore a few turning points along this impressive historical path, including
emergence of long-life systems for environments that make repair impossible,
development of safety-centered computer systems, meeting the stringent demands
of applications requiring high availability, and interaction of dependability with
security concerns. After discussing how advanced ideas and methods developed
in connection with system classes named above find their way into run-of-the-mill
commercial and personal systems, we conclude with a discussion of current
trends, future outlook, and resources for further study of dependable computing.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 505

A.1 Historical Perspective

Even though fault-tolerant computing as a discipline had its origins in the late 1960s, key
activities in the field began some two decades earlier in Prague, where computer scientist
Antonin Svoboda (1907-1980) built the SAPO computer. The pioneering design of SAPO
employed triplication and voting to overcome the effects of poor component quality.
Svoboda’s efforts were little known in the West and thus did not exert much influence on
the field as we know it. JPL’s STAR (self-testing-and-repairing) computer, on the other
hand, was highly influential, both owing to its design innovations and because the project
leader, Professor Algirdas Avizienis, was one of the early leaders of the field in the US.
The STAR computer project was motivated by NASA’s plans for a Solar System
exploration mission taking 10 years. Known as “The Grand Tour,” the mission was later
scrapped in its original form, but a number of its pieces, including the highly fault-
tolerant computer, formed the basis of later space endeavors.

With this introduction, we embark on reviewing the development of dependable and


fault-tolerant computing by discussing each of the seven decades 1950-2010 very briefly
in the rest of this section.

(a) SAPO; Prague, 1951 ` (b) STAR; JPL, 1971

Fig. A.early Two early fault-tolerant computers.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 506

Dependable computing in the 1950s

Dependable computing in the 1960s

Dependable computing in the 1970s

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 507

Dependable computing in the 1980s

Dependable computing in the 1990s

Dependable computing in the 2000s

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 508

In the 2010s decade, we have continued to see the development and further maturation of
the field. New challenges to be faced in the rest of this decade include applying and
extending available techniques to new computing paradigms and environments, auch as
cloud computing and its attendant mobile platforms (smartphones and compact tablets).
Among problems that need extensive study and innovative solutions is greater integration
of reliability and security concerns. At the end of this decade, the field of dependable
computing, and its flagship conference, will be preparing to celebrate their 50th
anniversary (DSN-50 will be held in 2020): a milestone that should be cause for a
detailed retrospective assessment of, and prospective planning for, the methods and
traditions of the field, as they have been practiced and as they might apply to new and
emerging technologies. By then, some of the pioneers of the field will no longer be with
us, but there are ample new researchers to carry the torch forward.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 509

A.2 Long-Life Systems

Interest in long-life computer systems began with the desire to launch spacecraft on
multiyear missions, where there is no possibility of repair. Even for manned space
missions, the limited repair capacity isn’t enough to offset the effects of even greater
reliability requirements. Today, we are still interested in long-life system for space travel,
but the need for such systems has expanded owing to many remotely located or hard-to-
access systems for intelligence gathering and environmental monitoring, to name only
two application domains.

Systems of historical and current interest that fall into the long-life category include
NASA’s OAO, the Galileo spacecraft, JPL STAR, the International Space Station,
communication satellites, and remote sensor networks.

Description of the JPL STAR computer

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 510

A.3 Safety-Critical Systems

Safety-critical computer systems were first employed in flight control, nuclear reactor
safety, and factory automation. Today, the scope of safety-critical systems has broadened
substantially, given the expansion of numerous applications requiring computerized
control: high-speed transportation, health monitoring, surgical robots.

Systems of historical and current interest that fall into the safety-critical category include
Carnegie Mellon University’s C.vmp, Stanford University’s SIFT, MIT’s FTMP, the
industrial control computers of August Systems, high-speed train controls, and
automotive computers

Avionics fly-by-wire systems

Automotive drive-by-wire systems

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 511

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 512

A.4 High-Availability Systems

High-availability computer systems were first developed by telephone companies for


their electronic switching systems. Even short interruptions in telephone service are both
embarrassing to communication companies and sources of significant revenue loss.
Today, the availability of communication services is even more important, not only for
telephony but also for a host of applications such as on-line banking, e-commerce, social
networking, and other systems that can ill-afford even very short down times.

Systems of historical and current interest that fall into the high-availability category
include AT&T ESS 1-5 (telephone switching, 1965-1982), Tandem’s various computers
(NonStop I-II, . . . , Cyclone, CLX800, 1976-1991), Stratus FT200-XA2000 (1981-1990),
banking systems, Internet portals (Google, Yahoo), and e-commerce (Amazon, eBay).
.
The following description of Tandem NonStop Cyclone is copied from [Parh99].

The first Tandem NonStop, a MIMD distributed-memory bus-based multiprocessor, was


announced in 1976 for database and transaction processing applications requiring high
reliability and data integrity. Since then, several versions have appeared. The Tandem
NonStop Cyclone, described in this section [Prad96], was first introduced in 1989. A
main objective of Cyclone’s design was to prevent any single hardware or software
malfunction from disabling the system. This objective was achieved by hardware and
informational redundancy as well as procedural safeguards, backup processes,
consistency checks, and recovery schemes.

A fully configured cyclone system consisted of 16 processors that were organized into
sections of 4 processors. Processors in each section were interconnected by a pair of 20
MB/s buses (Dynabus) and each could support two I/O subsystems capable of burst
transfer rates of 5 MB/s (Fig. A.Tand-a). An I/O subsystem consisted of two I/O
channels, each supporting up to 32 I/O controllers. Multiple independent paths were
provided to each I/O device via redundant I/O subsystems, channels, and controllers. Up
to 4 sections could be linked via unidirectional fiber optics Dynabus + that allowed
multiple sections to be nonadjacent within a room or even housed in separate rooms (Fig.
A.Tand-b). By Isolating Dynabus+ from Dynabus, full-bandwidth communications could
occur independently in each 4-processor section. Other features of the NonStop Cyclone
are briefly reviewed below.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 513

(a) One section of NonStop Cyclone system (b) Dynabus + connecting 4 sections

Fig. A.Tand Tandem NonStop Cyclone system of the late 1980s.

Processors: Cyclone’s 32-bit processors had advanced superscalar CISC designs. They
used dual 8-stage pipelines, an instruction pairing technique for instruction-level parallel
processing, sophisticated branch predication algorithms for minimizing pipeline bubbles,
and separate 64 KB instruction and data caches. Up to 128 MB of main memory could be
provided for each cyclone processor. The main memory was protected against errors
through the application of a SEC/DED code. Data transfers between memory and I/O
channels were performed via DMA and thus did not interfere with continued instruction
processing.

System performance: Performance estimates published in 1990 indicated that, after


accounting for cache misses and other overhead, the custom-designed Cyclone processor
could execute each instruction in an average of 1.5-2 clock cycles. Thus, with a clock
period of 10 ns, the peak performance of a fully configured NonStop Cyclone system was
about 1000 MIPS. Since each of the two I/O subsystems connected to a processor could
transmit data at a burst rate of 5 MB/s, a peak aggregate I/O bandwidth of 160 MB/s was
available.

Hardware reliability: Use of multiple processors, buses, power supplies, I/O paths, and
mirrored disks were among the methods used to ensure continuous operation despite
hardware malfunctions. A fail-fast strategy was employed to reduce the possibility of
error propagation and data contamination. Packaging and cooling technologies had also
been selected to minimize the probability of failure and to allow components, such as

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 514

circuit boards, fans, and power supplies, to be hot-pluggable without a need to interrupt
system operation. When a malfunctioning processor was detected via built-in hardware
self-checking logic, its load was transparently distributed to other processors by the
operating system.

Software reliability: The GUARDIAN 90 operating system was a key to Cyclone’s high
performance and reliability. Every second, each processor was required to send an “I’m
alive” message to every other processor over all buses. Every 2 seconds, each processor
checked to see if it had received a message from every other processor. If a message had
not been received from a particular processor, it was assumed to be malfunctioning.
Other software mechanisms for malfunction detection included data consistency checks
and kernel-level assertions. Malfunctions in buses, I/O paths, and memory were tolerated
by avoiding the malfunctioning unit or path. Processor malfunctions led to deactivation
of the processor. For critical applications, GUARDIAN 90 maintained duplicate backup
processes on disjoint processors. To reduce overhead, the backup process was normally
inactive but was kept consistent with the primary process via periodic checkpointing
messages. Upon malfunction detection, the backup process was started from the last
checkpoint, perhaps using mirror copies of the data.

Related systems: In addition to NonStop Cyclone, and its subsequent RISC-based


Himalaya servers, Tandem offered the Unix-based Integrity S2 uniprocessor system
which tolerated malfunctions via triplication and voting. It used R4000 RISC processors
and offered both application and hardware-level compatibility with other Unix-based
systems. Commercial reliable multiprocessors were also offered by Stratus (XA/R Series
300, using a pair-and-spare hardware redundancy scheme) and Sequoia (Series 400, using
self-checking modules with duplication). Both of the latter systems were tightly coupled
shared-memory multiprocessors.

Example: Google data center architecture

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 515

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 516

A.5 Commercial and Personal Systems

Due to highly unreliable components, early computers had extensive provisions for fault
masking and error detection/correction. Today’s components are ultrareliable, but there
are so many of them in a typical system that faults/errors/malfunctions are inevitable. It is
also the case that computers are being used not only by those with hardware/software
training but predominantly by nonexperts, or experts in other fields, who would be
immensely inconvenienced by service interruptions and erroneous results.

Systems of historical and current interest that fall into the commercial/personal category
include SAPO, IBM System 360, and IBM Power 6 [Reic08].

Description of the IBM Power 6 microprocessor

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 517

A.6 Trends, Outlook, and Resources

In this section, we review some of the active research areas in dependable computing and
provide a forecast of where the field is headed in the next few decades. Technology
forecasting is, of course, a perilous task and many forecasters look quite silly when their
predications are examined decades hence. Examples of off-the-mark forecasts include
Thomas J. Watson’s “I think there is a world market for maybe five computers,” and Ken
Olson’s “There is no reason anyone would want a computer in their home.” Despite these
and other spectacular failures in forecasting, and cautionary anonymous bits of advice
such as “There are two kinds of forecasts: lucky and wrong,” I am going to heed the more
optimistic musing of Henri Poincare, who said “It is far better to foresee even without
certainty than not to foresee at all.”

Dependable computer systems and design methods continue to evolve. Over the years,
emphasis in the field has shifted from building limited-edition systems with custom
components to using commercial off-the-shelf (COTS) components to the extent
possible. This strategy implies incorporating dependability features through layers of
software and services that run on otherwise untrustworthy computing elements.
Designers of COTS components in turn provide features that enable and facilitate such a
layered approach. This trend, combined with changes in technology and scale of systems,
creates a number of challenges which will occupy computer system designers for decades
to come.

Challenge 1: The curse of shrinking electronic devices (nanotechnology)


Challenge 2: Reliability in cloud computing (opportunities and problems)
Challenge 3: Smarter and lower-power redundancy schemes (brain-inspired)
Challenge 4: Ensuring data longevity over centuries, even millennia
Challenge 5: Dependability verification through reasoning about uncertainty
Challenge 6: Counteracting the combined effects of faults and intrusions
Challenge 7: Reliability as a roadblock for exascale systems (reliability wall)

Nanotechnology brings about smaller and more energy-efficient devices, but it also
creates new challenges. Smaller devices are prone to unreliable operation, present
complex modeling problems (due to the need for taking parameter variations into
account), and exacerbate the already difficult testing problems.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 518

Parameter variations [Ghos10] occur from one wafer to another, between dies on the
same wafer, and within each die. Reasons for parameter variations include vertical
nonuniformities in the layer thicknesses and horizontal nonuniformities in line and
spacing widths.

Cloud computing has been presented as a solution to all corporate and personal
computing problems, but we must be aware of its complicated reliability and availability
problems [Baue12]. Whereas it is true that multiplicity and diversity of resources can lead
to higher reliability by avoiding single points of failure, this benefit does not come about
automatically. Accidental and deliberate outages are real possibilities and identifying the
weakest link in this regard is no easy task. Assignment of blame in the event of failures
and rigorous risk assessment for both e-commerce and safety-critical systems are among
other difficulties.

The human brain is often viewed as an epitome of design elegance and efficiency.
Though quite energy-efficient (the brain uses around 20 W of power, whereas performing
a minute fraction of the brain’s function for a short period of time via simulation requires
supercomputers that consume megawatts of popwer), its design is anything but elegant.
The brain has grown in layers over millions of years of evolutionary time. Newer
capabilities are housed in the neocortex, developed fairly recently, and the older reptilian
brain parts are still there (Fig. A.brain-a). As a result, there is functional redundancy,
meaning that the same function (such as vision) may be performed in multiple regions.
Futhermore, the use of a fairly small number of building blocks make it easy for one part
of the brain to cover for other parts when they are damaged. The death or disconnection
of a small number of neurons is often inconsequential and those suffering substantial
brain injuries can and do recover to full brian function. The brain uses electrochemical
signaling that is extremely slow compared with electronic communication in modern
digital computers. Yet, the overall computational power of the brain is as yet unmatched
by even the most powerful supercomputers. Finally, memory and computational
functionalities are intermixed in the brain: there aren’t a separate memory units and
computational elements.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 519

(a) Main parts of the human brain (b) Communication between neurons

Fig. A.brain The human brain’s redundancy and uniformity of parts


make it more reliable.

As for data longevity, typical media used for mass data storage have lifespans of 3-20
years. Data can be lost to both media decay and format obsolescence. [Elaborate]

Data preservation is particularly daunting when documents and records are produced and
consumed in digital form, as such data files have no nondigital back-up.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 520

The field of dependable computing must deal with uncertainties of various kinds.
Therefore, research in this field has entailed methods for representing and reasoning
about uncertainties. Uncertainties can exist in both data (missing, imprecise, or estimated
data) and in rules for processing data (e.g., empirical rules). Approches used to date fall
under four categories:

Probabilistic analysis (Bayesian logic)


Certainty/Confidence factors
Evidence/Belief theory (Dempster-Shafer theory)
Continuous truth values (Fuzzy logic)

The interaction between failures and intrusions (security breaches) has become quite
important in recent year. Increasingly, hackers on computer systems take advantage of
extended system vulnerabilities during failure episodes to compromise such systems.
[Elaborate]

As top-of-the-line supercomputers use more and more processing nodes (currently in the
hundreds of thousands, soon to be in the millions), system MTBF is shortened and the
reliability overhead increases. Checkpointing, for example, must be done more
frequently, which can lead to superlinear overhead of such methods in terms of the
number p of processors. Typically, computational speed-up increases with p (albeit
sometimes facing a flattening due to Amdahl’s law). The existence of reliability overhead
may mean that the speed-up can actually decline beyond a certain number of processors.

[See Problem 3.17 in Chapter 3 on modeling]


Amdahl’s speedup formula p/[1 + f(p – 1)]; cannot exceed 1/f, no matter how large f;
tends to a constant as p approaches infinity
Gustafson’s constant-running-time scaled speedup f + p(1 – f); nearly p for small f; tends
to infinity as p approaches infinity
If the reliability overhead is superlinear in p, then a reliability wall exists
Theorem: Reliability wall exists iff as p approaches infinity the speedup remains finite

Other topics of interest:


Linear threshold gates [Ayme12]
Verifying computations without reexecuting them [Walf15]

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 521

The field of dependable computing has undergone many changes since the appearance of
the first computer incorporating error detection and correction [Svob79]. The emergence
of new technologies and the unwavering quest for higher performance, usability, and
reliability are bound to create new challenges in the coming years. These will include
completely new challenges, as well as novel or transformed versions of the ones
discussed in the preceding paragraphs. Researchers in dependable computing, who helped
make digital computers into indispensable tools in the six-plus decades since the
introduction of fault tolerance notions, will thus have a significant role to play in making
them even more reliable and ubiquitous as we proceed through the second decade of the
Twenty-First Century.
Fault tolerance strategies to date have by and large relied on perfect hardware and/or
perfect detection of failures. Thus, we either get the correct results (nearly all of the time)
or hit an exceptional failure condition, which we detect and recover from. With modern
computing technology, where extreme complexity and manufacturing variability make
failures the norm, rather than the exception, we should design computers more like
biological systems in which failures (and noise) are routinely expected.

in Valencia, Spain

Fig. A.tmln Dependable computing through the decades. [Incomplete]

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 522

We summarize our discussions of the history, current status, and future of dependable
computing in the timeline depicted in Fig. A.tmln. As for resources that would allow the
reader to gain additional insights in dependable computing, we have already listed some
general references at the end of the Preface and topic-specific references at the end of
each chapter (and this appendix). Other resources, which are nowadays quite extensive,
thanks to electronic information dissemination, can be found through Internet search
engines. For example, a search for “fault-tolerant computer” on Google yields some 1.5
million items, not counting additional hits for “dependable computing,” “computer
system reliability,” and other related terms. The author maintains a list of Web resources
for dependable computing on his companion website for this book: it can be reached via
the author’s faculty Web site at University of California, Santa Barbara (UCSB).

Resources for dependable computing

In this book, we have built a framework and skimmed over a number of applicable
techniques at various levels of the system hierarchy, from devices to applications. This
framework can serve as your guide, allowing you to integrate new knowledge you gain in
the field and to see how it fits in the dependable computing universe.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 523

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 524

Problems

A.1 Safety-critical systems


It was reported widely that on November 19, 2009, a computer failure at the US Federal Aviation
Administration (FAA) caused massive flight cancellations and delays. The failure made some flight data
(such as flight numbers, destinations, and altitudes) unavailable, leading to manual data entry and forcing
flight controllers to space the aircraft further apart for safety reasons. Initially, very few technical details
were available about the incident, with news reports blaming it on the failure of a single circuit board in
Salt Lake City, Utah. However, more details have emerged since then. Prepare a 2-page report about this
incident, citing the main technical reasons for the disruption and the role played by dependability
enhancement features, or lack thereof.

A.2 Data privacy and government access privileges


There have been suggestions in recent years to make encrypted private communications accessible to
certain government entities. The article [Neum15] likens such privacy circumvention methods to placing
house keys under doormats, with the implication that backdoor access channels can be abused. Prepare a
half-page, single-spaced abstract for the article in which you cite a couple of key reasons for and against
such mandated access.

A.3 Title
Intro
a. xxx
b. xxx
c. xxx
d. xxx

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 525

References and Further Readings


[Aviz71] Avizienis, A., G. C. Gilley, F. P. Mathur, D. A. Rennels, J. A. Rohr, and D. K. Rubin, “The
STAR (Self-Testing-And-Repairing) Computer: An Investigation of the Theory and Practice of
Fault-Tolerant Computer Design,” IEEE Trans. Computers, Vol. 20, No. 11, pp. 1312-1321,
November 1971.
[Aviz87] Avizienis, A., H. Kopetz, and J.-C. Laprie (eds.), The Evolution of Fault-Tolerant Computing
(Dependable Computing and Fault-Tolerant Systems, Vol. 1), Springer, 1987.
[Ayme12] Aymerich, N. and A. Rubio, “Fault-Tolerant Nanoscale Architecture Based on Linear
Threshold Gates,” Microprocessors and Microsystems, Vol. 36, No. 5, pp. 420-426, July 2012.
[Barr03] Barroso, L. A., J. Dean, abd U. Holzle, “Web Search for a Planet: The Google Cluster
Architecture,” IEEE Micro, Vol. 23, No. 2, pp. 22-28, March-April 2003.
[Bart07] Bartley, G. F., “Boeing B-777: Fly-By-Wire Flight Controls,” in Digital Avionics Handbook,
ed. by C. R. Spitzer, Vol. 1, pp. 23-1 to 23-14, CRC Press, 2006. Electronic version of the
handbook available at: http://www.engnetbase.com/books/5549/8438fm.pdf
[Baue12] Bauer, E. and R. Adams, Reliability and Availability of Cloud Computing, IEEE Press, 2012.
[Cart64] Carter, W. C., H. C. Montgomery, R. J. Preiss, and H. J. Reinheimer, “Design of Serviceability
Features for the IBM System/360,” IBM J. Research and Development, Vol. 8, No. 2, pp. 115-
126, April 1964.
[Ghos10] Ghosh, S. and K. Roy, “Parameter Variation Tolerance and Error Resiliency: New Design
Paradigm for the Nanoscale era,” Proceedings of the IEEE, Vol. 98, No. 10, pp. 1718-1751.
October 2010.
[Han05] Han, J., E. Taylor, J. Gao, and J. Fortes, “Reliability Modeling of Nanoelectronic Cirsuits,”
Proc. 5th IEEE Conf. Nanotechnology, Vol. 1, pp. 104-107, July 2005.
[Hitt07] Hitt, E. F. and D. Mulcare, “Fault-Tolerant Avionics,” in Digital Avionics Handbook, ed. by C.
R. Spitzer, Vol. 2, pp. 8-1 to 8-24, CRC Press, 2007. Electronic version of the handbook
available at: http://www.engnetbase.com/books/6327/8441_fm.pdf
[Lati18] Latif, S. S., “Fault Tolerance in Reversible Logic,” MS thesis, U. Lethbridge, 2018.
[Lin14] Lin, C.-C., A. Chakrabarti, and N. K. Jha, “FTQLS: Fault-Tolerant Quantum Logic Synthesis,”
IEEE Trans. VLSI Systems, Vol. 22, No. 6, pp. 1350-1363, June 2014.
[Mack07] Mack, M. J., W. M. Sauer, S. B. Swaney, and B. G. Mealey, “IBM Power6 Reliability,” IBM J.
Research and Development, Vol. 51, No. 6, pp. 763-, 2007.
[Msad15] Msadek, N., R. Kiefhaber, and T. Ungerer, “A Trustworthy, Fault-Tolerant and Scalable Self-
Configuration Algorithm for Organic Computing Systems,” J. System Architecture, Vol. 61,
No. 10, pp. 511-519, November 2015.
[Neum15] Neumann, P. G. et al., “Inside Risks: Keys under Doormats,” Communications of the ACM,
Vol. 58, No. 10, pp. 24-26, October 2015.
[Parh99] Parhami, B., Introduction to Parallel Processing: Algorithms and Architectures, Plenum, 1999.
[Prad96] Pradhan, D. K., “Case Studies in Fault-Tolerant Multiprocessor and Distributed Systems,”
Chapter 4 in Fault-Tolerant Computer System Design, Prentice–Hall, 1996, pp. 236–281.
[Reic08] Reick, K., P. N. Sanda, S. Swaney, J. W. Kellington, M. Mack, M. Floyd, and D. Henderson,
“Fault-Tolerant Design of the IBM Power6 Microprocessor,” IEEE Micro, Vol. 28, No. 2, pp.
30-38, March-April 2008.
[Renn84] Rennels, D. A., “Fault-Tolerant Computing—Concepts and Examples,” IEEE Trans.
Computers, Vol. 33, No. 12, pp. 1116-1129, December 1984.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)


Last modified: 2020-11-25 526

[Seid09] Seidel, F., “X-by-Wire,” accessed online on December 2, 2009, at: http://osg.informatik.tu-
chemnitz.de/lehre/old/ws0809/sem/online/x-by-wire.pdf
[Siew95] Siewiorek, D. (ed.), Fault-Tolerant Computing Highlights from 25 Years, Special Volume of
the 25th Int’l Symp. Fault-Tolerant Computing, 1995.
[Sing18] Singh, G., B. Raj, and R. K. Sarin, “Fault-Tolerant Design and Analysis of QCA-Based
Circuits,” IET Circuits, Devices & Systems, Vol. 12, No. 5, pp. 638-644, 2018.
[Svob79] Svoboda. A., “Oral History Interview OH35 by R. Mapstone,” 15 November 1979, Charles
Babbage Institute, University of Minnesota, Minneapolis. Transcripts available at:
http://www.cbi.umn.edu/oh/pdf.phtml?id=263
[Trav04] Traverse, P., I. Lacaze, and J. Souyris, “Airbus Fly-by-Wire: A Total Approach to
Dependability,” in Building the Information Society, ed. by P. Jacquart, pp. 191-212, Springer,
2004.
[Vinh16] Vinh, P. C. and E. Vassev, “Nature-Inspired Computation and Communication: A Formal
Approach,” Future Generation Computer Systems, Vol. 56, pp. 121-123, March 2016.
[Walf15] Walfish, M. and A. J. Blumberg, “Verifying Computations without Reexecuting Them,”
Communications of the ACM, Vol. 58, No. 2, pp. 74-84, February 2015.
[Yang12] Yang, X., Z. Wang, J. Xue, and Y. Zhou, “The Reliability Wall for Exascale Supercomputing,”
IEEE Trans. Computers, Vol. 61, No. 6, pp. 767-779, June 2012.

Dependable Computing: A Multilevel Approach (B. Parhami, UCSB)

You might also like