0% found this document useful (0 votes)
97 views83 pages

2021 - Caps3-6 - Software Architecture in Practice - Bass, Clements, Kazman

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views83 pages

2021 - Caps3-6 - Software Architecture in Practice - Bass, Clements, Kazman

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Part II: Quality Attributes

3
Understanding Quality Attributes
Quality is never an accident; it is always the result of high intention,
sincere effort, intelligent direction and skillful execution.
—William A. Foster

Many factors determine the qualities that must be provided for in a


system’s architecture. These qualities go beyond functionality, which is
the basic statement of the system’s capabilities, services, and behavior.
Although functionality and other qualities are closely related, as you will
see, functionality often takes the front seat in the development scheme.
This preference is shortsighted, however. Systems are frequently
redesigned not because they are functionally deficient—the replacements
are often functionally identical—but because they are difficult to
maintain, port, or scale; or they are too slow; or they have been
compromised by hackers. In Chapter 2, we said that architecture was the
first place in software creation in which the achievement of quality
requirements could be addressed. It is the mapping of a system’s
functionality onto software structures that determines the architecture’s
support for qualities. In Chapters 4– 14, we discuss how various qualities
are supported by architectural design decisions. In Chapter 20, we show
how to integrate all of your drivers, including quality attribute decisions,
into a coherent design.
We have been using the term “quality attribute” loosely, but now it is
time to define it more carefully. A quality attribute (QA) is a measurable
or testable property of a system that is used to indicate how well the
system satisfies the needs of its stakeholders beyond the basic function
of the system. You can think of a quality attribute as measuring the
“utility” of a product along some dimension of interest to a stakeholder.
In this chapter our focus is on understanding the following:

How to express the qualities we want our architecture to exhibit


How to achieve the qualities through architectural means
How to determine the design decisions we might make with respect
to the qualities

This chapter provides the context for the discussions of individual


quality attributes in Chapters 4– 14.

3.1 Functionality
Functionality is the ability of the system to do the work for which it was
intended. Of all of the requirements, functionality has the strangest
relationship to architecture.
First of all, functionality does not determine architecture. That is,
given a set of required functionality, there is no end to the architectures
you could create to satisfy that functionality. At the very least, you could
divide up the functionality in any number of ways and assign the sub-
pieces to different architectural elements.
In fact, if functionality were the only thing that mattered, you
wouldn’t have to divide the system into pieces at all: A single monolithic
blob with no internal structure would do just fine. Instead, we design our
systems as structured sets of cooperating architectural elements—
modules, layers, classes, services, databases, apps, threads, peers, tiers,
and on and on—to make them understandable and to support a variety of
other purposes. Those “other purposes” are the other quality attributes
that we’ll examine in the remaining sections of this chapter, and in the
subsequent quality attribute chapters in Part II.
Although functionality is independent of any particular structure, it is
achieved by assigning responsibilities to architectural elements. This
process results in one of the most basic architectural structures—module
decomposition.
Although responsibilities can be allocated arbitrarily to any module,
software architecture constrains this allocation when other quality
attributes are important. For example, systems are frequently (or perhaps
always) divided so that several people can cooperatively build them. The
architect’s interest in functionality is how it interacts with and constrains
other qualities.
Functional Req uirements
After more than 3 0 years of writing about and discussing the
distinction between functional requirements and quality
requirements, the definition of functional requirements still eludes
me. Quality attribute requirements are well defined: Performance
has to do with the system’s timing behavior, modifiability has to do
with the system’s ability to support changes in its behavior or other
qualities after initial deployment, availability has to do with the
system’s ability to survive failures, and so forth.
Function, however, is a much more slippery concept. An
international standard (ISO 25010) defines functional suitability as
“the capability of the software product to provide functions which
meet stated and implied needs when the software is used under
specified conditions.” That is, functionality is the ability to provide
functions. One interpretation of this definition is that functionality
describes what the system does and quality describes how well the
system does its function. That is, qualities are attributes of the
system and function is the purpose of the system.
This distinction breaks down, however, when you consider the
nature of some of the ”function.” If the function of the software is
to control engine behavior, how can the function be correctly
implemented without considering timing behavior? Is the ability to
control access by requiring a user name/password combination not
a function, even though it is not the purpose of any system?
I much prefer using the word “responsibility” to describe
computations that a system must perform. Questions such as “What
are the timing constraints on that set of responsibilities?”, “What
modifications are anticipated with respect to that set of
responsibilities?”, and “What class of users is allowed to execute
that set of responsibilities?” make sense and are actionable.
The achievement of qualities induces responsibility; think of the
user name/password example just mentioned. Further, one can
identify responsibilities as being associated with a particular set of
requirements.
So does this mean that the term “functional requirement”
shouldn’t be used? People have an understanding of the term, but
when precision is desired, we should talk about sets of specific
responsibilities instead.
Paul Clements has long ranted against the careless use of the
term “nonfunctional,” and now it’s my turn to rant against the
careless use of the term “functional”—which is probably equally
ineffectually.
—LB

3.2 Quality Attribute Considerations


Just as a system’s functions do not stand on their own without due
consideration of quality attributes, neither do quality attributes stand on
their own; they pertain to the functions of the system. If a functional
requirement is “When the user presses the green button, the Options
dialog appears,” a performance QA annotation might describe how
quickly the dialog will appear; an availability QA annotation might
describe how often this function is allowed to fail, and how quickly it
will be repaired; a usability QA annotation might describe how easy it is
to learn this function.
Quality attributes as a distinct topic have been studied by the software
community at least since the 1970s. A variety of taxonomies and
definitions have been published (we discuss some of these in Chapter
14), many of which have their own research and practitioner
communities. However, there are three problems with most discussions
of system quality attributes:
1. The definitions provided for an attribute are not testable. It is
meaningless to say that a system will be “modifiable.” Every
system will be modifiable with respect to one set of changes and
not modifiable with respect to another. The other quality attributes
are similar in this regard: A system may be robust with respect to
some faults and brittle with respect to others, and so forth.
2. Discussion often focuses on which quality a particular issue
belongs to. Is a denial-of-service attack on a system an aspect of
availability, an aspect of performance, an aspect of security, or an
aspect of usability? All four attribute communities would claim
“ownership” of the denial-of-service attack. All are, to some
extent, correct. But this debate over categorization doesn’t help us,
as architects, understand and create architectural solutions to
actually manage the attributes of concern.
3. Each attribute community has developed its own vocabulary. The
performance community has “events” arriving at a system, the
security community has “attacks” arriving at a system, the
availability community has “faults” arriving, and the usability
community has “user input.” All of these may actually refer to the
same occurrence, but they are described using different terms.
A solution to the first two problems (untestable definitions and
overlapping issues) is to use quality attribute scenarios as a means of
characterizing quality attributes (see Section 3 .3 ). A solution to the third
problem is to illustrate the concepts that are fundamental to that attribute
community in a common form, which we do in Chapters 4– 14.
We will focus on two categories of quality attributes. The first
category includes those attributes that describe some property of the
system at runtime, such as availability, performance, or usability. The
second category includes those that describe some property of the
development of the system, such as modifiability, testability, or
deployability.
Quality attributes can never be achieved in isolation. The achievement
of any one will have an effect—sometimes positive and sometimes
negative—on the achievement of others. For example, almost every
quality attribute negatively affects performance. Take portability: The
main technique for achieving portable software is to isolate system
dependencies, which introduces overhead into the system’s execution,
typically as process or procedure boundaries, which then hurts
performance. Determining a design that may satisfy quality attribute
requirements is partially a matter of making the appropriate tradeoffs;
we discuss design in Chapter 21.
In the next three sections, we focus on how quality attributes can be
specified, what architectural decisions will enable the achievement of
particular quality attributes, and what questions about quality attributes
will enable the architect to make the correct design decisions.

3.3 Specifying Quality Attribute Req uirements:


Quality Attribute Scenarios
We use a common form to specify all QA requirements as scenarios.
This addresses the vocabulary problems we identified previously. The
common form is testable and unambiguous; it is not sensitive to whims
of categorization. Thus it provides regularity in how we treat all quality
attributes.
Quality attribute scenarios have six parts:

Stimulus. We use the term “stimulus” to describe an event arriving at


the system or the project. The stimulus can be an event to the
performance community, a user operation to the usability
community, or an attack to the security community, and so forth. We
use the same term to describe a motivating action for developmental
qualities. Thus a stimulus for modifiability is a request for a
modification; a stimulus for testability is the completion of a unit of
development.
Stimulus source. A stimulus must have a source—it must come from
somewhere. Some entity (a human, a computer system, or any other
actor) must have generated the stimulus. The source of the stimulus
may affect how it is treated by the system. A request from a trusted
user will not undergo the same scrutiny as a request by an untrusted
user.
Response. The response is the activity that occurs as the result of the
arrival of the stimulus. The response is something the architect
undertakes to satisfy. It consists of the responsibilities that the
system (for runtime qualities) or the developers (for development-
time qualities) should perform in response to the stimulus. For
example, in a performance scenario, an event arrives (the stimulus)
and the system should process that event and generate a response. In
a modifiability scenario, a request for a modification arrives (the
stimulus) and the developers should implement the modification—
without side effects—and then test and deploy the modification.
Response measure. When the response occurs, it should be
measurable in some fashion so that the scenario can be tested—that
is, so that we can determine if the architect achieved it. For
performance, this could be a measure of latency or throughput; for
modifiability, it could be the labor or wall clock time required to
make, test, and deploy the modification.
These four characteristics of a scenario are the heart of our quality
attribute specifications. But two more characteristics are important, yet
often overlooked: environment and artifact.

Environment. The environment is the set of circumstances in which


the scenario takes place. Often this refers to a runtime state: The
system may be in an overload condition or in normal operation, or
some other relevant state. For many systems, “normal” operation can
refer to one of a number of modes. For these kinds of systems, the
environment should specify in which mode the system is executing.
But the environment can also refer to states in which the system is
not running at all: when it is in development, or testing, or refreshing
its data, or recharging its battery between runs. The environment sets
the context for the rest of the scenario. For example, a request for a
modification that arrives after the code has been frozen for a release
may be treated differently than one that arrives before the freeze.
The fifth successive failure of a component may be treated
differently than the first failure of that component.
Artifact. The stimulus arrives at some target. This is often captured
as just the system or project itself, but it’s helpful to be more precise
if possible. The artifact may be a collection of systems, the whole
system, or one or more pieces of the system. A failure or a change
request may affect just a small portion of the system. A failure in a
data store may be treated differently than a failure in the metadata
store. Modifications to the user interface may have faster response
times than modifications to the middleware.

To summarize, we capture quality attribute requirements as six-part


scenarios. While it is common to omit one or more of these six parts,
particularly in the early stages of thinking about quality attributes,
knowing that all of the parts are there forces the architect to consider
whether each part is relevant.
We have created a general scenario for each of the quality attributes
presented in Chapters 4– 13 to facilitate brainstorming and elicitation of
concrete scenarios. We distinguish general quality attribute scenarios—
general scenarios—which are system independent and can pertain to any
system, from concrete quality attribute scenarios—concrete scenarios—
which are specific to the particular system under consideration.
To translate these generic attribute characterizations into requirements
for a particular system, the general scenarios need to be made system
specific. But, as we have found, it is much easier for a stakeholder to
tailor a general scenario into one that fits their system than it is for them
to generate a scenario from thin air.
Figure 3 .1 shows the parts of a quality attribute scenario just
discussed. Figure 3 .2 shows an example of a general scenario, in this
instance for availability.

Figure 3.1 The parts of a quality attribute scenario

Figure 3.2 A general scenario for availability

Not My Problem
Some time ago I was doing an architecture analysis on a complex
system created by and for Lawrence Livermore National
Laboratory. If you visit this organization’s website (llnl.gov) and try
to figure out what Livermore Labs does, you will see the word
“security” mentioned over and over. The lab focuses on nuclear
security, international and domestic security, and environmental and
energy security. Serious stuff . . .
K eeping this emphasis in mind, I asked my clients to describe
the quality attributes of concern for the system that I was
analyzing. I’m sure you can imagine my surprise when security
wasn’t mentioned once! The system stakeholders mentioned
performance, modifiability, evolvability, interoperability,
configurability, and portability, and one or two more, but the word
“security” never passed their lips.
Being a good analyst, I questioned this seemingly shocking and
obvious omission. Their answer was simple and, in retrospect,
straightforward: “We don’t care about it. Our systems are not
connected to any external network, and we have barbed-wire
fences and guards with machine guns.”
Of course, someone at Livermore Labs was very interested in
security. But not the software architects. The lesson here is that the
software architect may not bear the responsibility for every QA
requirement.
—RK

3.4 Achieving Quality Attributes through


Architectural Patterns and Tactics
We now turn to the techniques an architect can use to achieve the
required quality attributes: architectural patterns and tactics.
A tactic is a design decision that influences the achievement of a
quality attribute response—it directly affects the system’s response to
some stimulus. Tactics may impart portability to one design, high
performance to another, and integrability to a third.
An architectural pattern describes a particular recurring design
problem that arises in specific design contexts and presents a well-
proven architectural solution for the problem. The solution is specified
by describing the roles of its constituent elements, their responsibilities
and relationships, and the ways in which they collaborate. Like the
choice of tactics, the choice of an architectural pattern has a profound
effect on quality attributes—usually more than one.
Patterns typically comprise multiple design decisions and, in fact,
often comprise multiple quality attribute tactics. We say that patterns
often bundle tactics and, consequently, frequently make tradeoffs among
quality attributes.
We will look at example relationships between tactics and patterns in
each of our quality attribute– specific chapters. Chapter 14 explains how
a set of tactics for any quality attribute can be constructed; those tactics
are, in fact, the steps we used to produce the sets found in this book.
While we discuss patterns and tactics as though they were
foundational design decisions, the reality is that architectures often
emerge and evolve as a result of many small decisions and business
forces. For example, a system that was once tolerably modifiable may
deteriorate over time, through the actions of developers adding features
and fixing bugs. Similarly, a system’s performance, availability, security,
and any other quality may (and typically does) deteriorate over time,
again through the well-intentioned actions of programmers who are
focused on their immediate tasks and not on preserving architectural
integrity.
This “death by a thousand cuts” is common on software projects.
Developers may make suboptimal decisions due to a lack of
understanding of the structures of the system, schedule pressures, or
perhaps a lack of clarity in the architecture from the start. This kind of
deterioration is a form of technical debt known as architecture debt. We
discuss architecture debt in Chapter 23 . To reverse this debt, we typically
refactor.
Refactoring may be done for many reasons. For example, you might
refactor a system to improve its security, placing different modules into
different subsystems based on their security properties. Or you might
refactor a system to improve its performance, removing bottlenecks and
rewriting slow portions of the code. Or you might refactor to improve
the system’s modifiability. For example, when two modules are affected
by the same kinds of changes over and over because they are (at least
partial) duplicates of each other, the common functionality could be
factored out into its own module, thereby improving cohesion and
reducing the number of places that need to be changed when the next
(similar) change request arrives.
Code refactoring is a mainstay practice of agile development projects,
as a cleanup step to make sure that teams have not produced duplicative
or overly complex code. However, the concept applies to architectural
elements as well.
Successfully achieving quality attributes often involves process-
related decisions, in addition to architecture-related decisions. For
example, a great security architecture is worthless if your employees are
susceptible to phishing attacks or do not choose strong passwords. We
are not dealing with the process aspects in this book, but be aware that
they are important.

3.5 Designing with Tactics


A system design consists of a collection of decisions. Some of these
decisions help control the quality attribute responses; others ensure
achievement of system functionality. We depict this relationship in
Figure 3 .3 . Tactics, like patterns, are design techniques that architects
have been using for years. In this book, we isolate, catalog, and describe
them. We are not inventing tactics here, but rather just capturing what
good architects do in practice.

Figure 3.3 Tactics are intended to control responses to stimuli.

Why do we focus on tactics? There are three reasons:


1. Patterns are foundational for many architectures, but sometimes
there may be no pattern that solves your problem completely. For
example, you might need the high-availability high-security broker
pattern, not the textbook broker pattern. Architects frequently need
to modify and adapt patterns to their particular context, and tactics
provide a systematic means for augmenting an existing pattern to
fill the gaps.
2. If no pattern exists to realize the architect’s design goal, tactics
allow the architect to construct a design fragment from “first
principles.” Tactics give the architect insight into the properties of
the resulting design fragment.
3. Tactics provide a way of making design and analysis more
systematic within some limitations. We’ll explore this idea in the
next section.
Like any design concept, the tactics that we present here can and
should be refined as they are applied to design a system. Consider
performance: Schedule resources is a common performance tactic. But
this tactic needs to be refined into a specific scheduling strategy, such as
shortest-job-first, round-robin, and so forth, for specific purposes. Use
an intermediary is a modifiability tactic. But there are multiple types of
intermediaries (layers, brokers, proxies, and tiers, to name just a few),
which are realized in different ways. Thus a designer will employ
refinements to make each tactic concrete.
In addition, the application of a tactic depends on the context. Again,
consider performance: Manage sampling rate is relevant in some real-
time systems but not in all real-time systems, and certainly not in
database systems or stock-trading systems where losing a single event is
highly problematic.
Note that there are some “super-tactics”—tactics that are so
fundamental and so pervasive that they deserve special mention. For
example, the modifiability tactics of encapsulation, restricting
dependencies, using an intermediary, and abstracting common services
are found in the realization of almost every pattern ever! But other
tactics, such as the scheduling tactic from performance, also appear in
many places. For example, a load balancer is an intermediary that does
scheduling. We see monitoring appearing in many quality attributes: We
monitor aspects of a system to achieve energy efficiency, performance,
availability, and safety. Thus we should not expect a tactic to live in only
one place, for just a single quality attribute. Tactics are design primitives
and, as such, are found over and over in different aspects of design. This
is actually an argument for why tactics are so powerful and deserving of
our attention—and yours. Get to know them; they’ll be your friends.

3.6 Analyz ing Quality Attribute Design Decisions:


Tactics- Based Questionnaires
In this section, we introduce a tool the analyst can use to understand
potential quality attribute behavior at various stages through the
architecture’s design: tactics-based questionnaires.
Analyzing how well quality attributes have been achieved is a critical
part of the task of designing an architecture. And (no surprise) you
shouldn’t wait until your design is complete before you begin to do it.
Opportunities for quality attribute analysis crop up at many different
points in the software development life cycle, even very early ones.
At any point, the analyst (who might be the architect) needs to
respond appropriately to whatever artifacts have been made available for
analysis. The accuracy of the analysis and expected degree of confidence
in the analysis results will vary according to the maturity of the available
artifacts. But no matter the state of the design, we have found tactics-
based questionnaires to be helpful in gaining insights into the
architecture’s ability (or likely ability, as it is refined) to provide the
needed quality attributes.
In Chapters 4– 13 , we include a tactics-based questionnaire for each
quality attribute covered in the chapters. For each question in the
questionnaire, the analyst records the following information:

Whether each tactic is supported by the system’s architecture.


Whether there are any obvious risks in the use (or nonuse) of this
tactic. If the tactic has been used, record how it is realized in the
system, or how it is intended to be realized (e.g., via custom code,
generic frameworks, or externally produced components).
The specific design decisions made to realize the tactic and where in
the code base the implementation (realization) may be found. This is
useful for auditing and architecture reconstruction purposes.
Any rationale or assumptions made in the realization of this tactic.

To use these questionnaires, simply follow these four steps:


1. For each tactics question, fill the “Supported” column with “Y” if
the tactic is supported in the architecture and with “N” otherwise.
2. If the answer in the “Supported” column is “Y,” then in the
“Design Decisions and Location” column describe the specific
design decisions made to support the tactic and enumerate where
these decisions are, or will be, manifested (located) in the
architecture. For example, indicate which code modules,
frameworks, or packages implement this tactic.
3. In the “Risk” column indicate the risk of implementing the tactic
using a (H = High, M = Medium, L = Low) scale.
4. In the “Rationale” column, describe the rationale for the design
decisions made (including a decision to not use this tactic). Briefly
explain the implications of this decision. For example, explain the
rationale and implications of the decision in terms of the effort on
cost, schedule, evolution, and so forth.
While this questionnaire-based approach might sound simplistic, it
can actually be very powerful and insightful. Addressing the set of
questions forces the architect to take a step back and consider the bigger
picture. This process can also be quite efficient: A typical questionnaire
for a single quality attribute takes between 3 0 and 90 minutes to
complete.

3.7 Summary
Functional requirements are satisfied by including an appropriate set of
responsibilities within the design. Quality attribute requirements are
satisfied by the structures and behaviors of the architecture.
One challenge in architectural design is that these requirements are
often captured poorly, if at all. To capture and express a quality attribute
requirement, we recommend the use of a quality attribute scenario. Each
scenario consists of six parts:
1. Source of stimulus
2. Stimulus
3. Environment
4. Artifact
5. Response
6. Response measure
An architectural tactic is a design decision that affects a quality
attribute response. The focus of a tactic is on a single quality attribute
response. An architectural pattern describes a particular recurring design
problem that arises in specific design contexts and presents a well-
proven architectural solution for the problem. Architectural patterns can
be seen as “bundles” of tactics.
An analyst can understand the decisions made in an architecture
through the use of a tactics-based checklist. This lightweight architecture
analysis technique can provide insights into the strengths and
weaknesses of the architecture in a very short amount of time.

3.8 For Further Reading


Some extended case studies showing how tactics and patterns are used in
design can be found in [Cervantes 16].
A substantial catalog of architectural patterns can be found in the five-
volume set Pattern-Oriented Software Architecture, by Frank
Buschmann et al.
Arguments showing that many different architectures can provide the
same functionality—that is, that architecture and functionality are
largely orthogonal—can be found in [Shaw 95].

3.9 Discussion Questions


1. What is the relationship between a use case and a quality attribute
scenario? If you wanted to add quality attribute information to a use
case, how would you do it?
2. Do you suppose that the set of tactics for a quality attribute is finite
or infinite? Why?
3. Enumerate the set of responsibilities that an automatic teller
machine should support and propose a design to accommodate that
set of responsibilities. Justify your proposal.
4. Choose an architecture that you are familiar with (or choose the
ATM architecture you defined in question 3 ) and walk through the
performance tactics questionnaire (found in Chapter 9). What
insight did these questions provide into the design decisions made
(or not made)?
4
Availability
Technology does not always rhyme
with perfection and reliability.
Far from it in reality!
—Jean-Michel Jarre

Availability refers to a property of software—namely, that it is there and


ready to carry out its task when you need it to be. This is a broad
perspective and encompasses what is normally called reliability
(although it may encompass additional considerations such as downtime
due to periodic maintenance). Availability builds on the concept of
reliability by adding the notion of recovery—that is, when the system
breaks, it repairs itself. Repair may be accomplished by various means,
as we’ll see in this chapter.
Availability also encompasses the ability of a system to mask or repair
faults such that they do not become failures, thereby ensuring that the
cumulative service outage period does not exceed a required value over
a specified time interval. This definition subsumes concepts of
reliability, robustness, and any other quality attribute that involves a
concept of unacceptable failure.
A failure is the deviation of the system from its specification, where
that deviation is externally visible. Determining that a failure has
occurred requires some external observer in the environment.
A failure’s cause is called a fault. A fault can be either internal or
external to the system under consideration. Intermediate states between
the occurrence of a fault and the occurrence of a failure are called errors.
Faults can be prevented, tolerated, removed, or forecast. Through these
actions, a system becomes “resilient” to faults. Among the areas with
which we are concerned are how system faults are detected, how
frequently system faults may occur, what happens when a fault occurs,
how long a system is allowed to be out of operation, when faults or
failures may occur safely, how faults or failures can be prevented, and
what kinds of notifications are required when a failure occurs.
Availability is closely related to, but clearly distinct from, security. A
denial-of-service attack is explicitly designed to make a system fail—
that is, to make it unavailable. Availability is also closely related to
performance, since it may be difficult to tell when a system has failed
and when it is simply being egregiously slow to respond. Finally,
availability is closely allied with safety, which is concerned with keeping
the system from entering a hazardous state and recovering or limiting the
damage when it does.
One of the most demanding tasks in building a high-availability fault-
tolerant system is to understand the nature of the failures that can arise
during operation. Once those are understood, mitigation strategies can be
designed into the system.
Since a system failure is observable by users, the time to repair is the
time until the failure is no longer observable. This may be an
imperceptible delay in a user’s response time or it may be the time it
takes someone to fly to a remote location in the Andes to repair a piece
of mining machinery (as was recounted to us by a person responsible for
repairing the software in a mining machine engine). The notion of
“observability” is critical here: If a failure could have been observed,
then it is a failure, whether or not it was actually observed.
In addition, we are often concerned with the level of capability that
remains when a failure has occurred—a degraded operating mode.
Distinguishing between faults and failures allows us to discuss repair
strategies. If code containing a fault is executed but the system is able to
recover from the fault without any observable deviation from the
otherwise specified behavior, we say that no failure has occurred.
The availability of a system can be measured as the probability that it
will provide the specified services within the required bounds over a
specified time interval. A well-known expression is used to derive
steady-state availability (which came from the world of hardware):
MTBF /(MTBF + MTTR)
where MTBF refers to the mean time between failures and MTTR refers
to the mean time to repair. In the software world, this formula should be
interpreted to mean that when thinking about availability, you should
think about what will make your system fail, how likely it is that such an
event will occur, and how much time will be required to repair it.
From this formula, it is possible to calculate probabilities and make
claims like “the system exhibits 99.999 percent availability” or “there is
a 0.001 percent probability that the system will not be operational when
needed.” Scheduled downtimes (when the system is intentionally taken
out of service) should not be considered when calculating availability,
since the system is deemed “not needed” then; of course, this is
dependent on the specific requirements for the system, which are often
encoded in a service level agreement (SLA). This may lead to seemingly
odd situations where the system is down and users are waiting for it, but
the downtime is scheduled and so is not counted against any availability
requirements.
Detected faults can be categorized prior to being reported and
repaired. This categorization is commonly based on the fault’s severity
(critical, major, or minor) and service impact (service-affecting or non-
service-affecting). It provides the system operator with a timely and
accurate system status and allows for an appropriate repair strategy to be
employed. The repair strategy may be automated or may require manual
intervention.
As just mentioned, the availability expected of a system or service is
frequently expressed as an SLA. The SLA specifies the availability level
that is guaranteed and, usually, the penalties that the provider will suffer
if the SLA is violated. For example, Amazon provides the following
SLA for its EC2 cloud service:
AWS will use commercially reasonable efforts to make the Included
Services each available for each AWS region with a Monthly Uptime
Percentage of at least 99.99% , in each case during any monthly billing
cycle (the “Service Commitment”). In the event any of the Included
Services do not meet the Service Commitment, you will be eligible to
receive a Service Credit as described below.
Table 4.1 provides examples of system availability requirements and
associated threshold values for acceptable system downtime, measured
over observation periods of 90 days and one year. The term high
availability typically refers to designs targeting availability of 99.999
percent (“5 nines”) or greater. As mentioned earlier, only unscheduled
outages contribute to system downtime.
Table 4.1 System Availability Requirements
Availability Downtime/ 90 Days Downtime/ Y ear
Availability Downtime/ 90 Days Downtime/ Y ear
99.0% 21 hr, 3 6 min 3 days, 15.6 hr
99.9% 2 hr, 10 min 8 hr, 0 min, 46 sec
99.99% 12 min, 58 sec 52 min, 3 4 sec
99.999% 1 min, 18 sec 5 min, 15 sec
99.9999% 8 sec 3 2 sec

4.1 Availability General Scenario


We can now describe the individual portions of an availability general
scenario as summarized in Table 4.2.
Table 4.2 Availability General Scenario
Po Description Possible Values
rti
on
of
Sc
en
ari
o
So This specifies where the fault comes from. Internal/external: people,
urc hardware, software,
e physical infrastructure,
physical environment
Sti The stimulus to an availability scenario is Fault: omission, crash,
mu a fault. incorrect timing,
lus incorrect response
Ar This specifies which portions of the Processors,
tif system are responsible for and affected by communication channels,
act the fault. storage, processes,
affected artifacts in the
system’s environment
Po Description Possible Values
rti
on
of
Sc
en
ari
o
En We may be interested in not only how a Normal operation,
vir system behaves in its “normal” startup, shutdown, repair
on environment, but also how it behaves in mode, degraded
me situations such as when it is already operation, overloaded
nt recovering from a fault. operation
Re The most commonly desired response is Prevent the fault from
sp to prevent the fault from becoming a becoming a failure
on failure, but other responses may also be
se important, such as notifying people or Detect the fault:
logging the fault for later analysis. This
section specifies the desired system
response.
Re We may focus on a number of measures
Log the fault
sp of availability, depending on the criticality
on of the service being provided.
se Notify the
me appropriate entities
as (people or systems)
ure

Recover from the


fault

Disable the source of


events causing the
fault

Be temporarily
unavailable while a
Po Description Possible Values
rti
on
of
Sc
en
ari
o
repair is being
effected

Fix or mask the


fault/failure or
contain the damage
it causes

Operate in a
degraded mode
while a repair is
being effected

Time or time interval


when the system
must be available

Availability
percentage (e.g.,
99.999 percent)

Time to detect the


fault

Time to repair the


fault
Po Description Possible Values
rti
on
of
Sc
en
ari
o

Time or time interval


in which system can
be in degraded mode

Proportion (e.g., 99
percent) or rate (e.g.,
up to 100 per
second) of a certain
class of faults that
the system prevents,
or handles without
failing

An example concrete availability scenario derived from the general


scenario in Table 4.2 is shown in Figure 4.1. The scenario is this: A
server in a server farm fails during normal operation, and the system
informs the operator and continues to operate with no downtime.
Figure 4.1 Sample concrete availability scenario

4.2 Tactics for Availability


A failure occurs when the system no longer delivers a service that is
consistent with its specification and this failure is observable by the
system’s actors. A fault (or combination of faults) has the potential to
cause a failure. Availability tactics, in turn, are designed to enable a
system to prevent or endure system faults so that a service being
delivered by the system remains compliant with its specification. The
tactics we discuss in this section will keep faults from becoming failures
or at least bound the effects of the fault and make repair possible, as
illustrated in Figure 4.2.

Figure 4.2 Goal of availability tactics


Availability tactics have one of three purposes: fault detection, fault
recovery, or fault prevention. The tactics for availability are shown in
Figure 4.3 . These tactics will often be provided by a software
infrastructure, such as a middleware package, so your job as an architect
may be choosing and assessing (rather than implementing) the right
availability tactics and the right combination of tactics.

Figure 4.3 Availability tactics

Detect Faults
Before any system can take action regarding a fault, the presence of the
fault must be detected or anticipated. Tactics in this category include:
Monitor. This component is used to monitor the state of health of
various other parts of the system: processors, processes, I/O,
memory, and so forth. A system monitor can detect failure or
congestion in the network or other shared resources, such as from a
denial-of-service attack. It orchestrates software using other tactics
in this category to detect malfunctioning components. For example,
the system monitor can initiate self-tests, or be the component that
detects faulty timestamps or missed heartbeats.1
1. When the detection mechanism is implemented using a counter
or timer that is periodically reset, this specialization of the
system monitor is referred to as a watchdog. During nominal
operation, the process being monitored will periodically reset the
watchdog counter/timer as part of its signal that it’s working
correctly; this is sometimes referred to as “petting the watchdog.”
Ping/echo. In this tactic, an asynchronous request/response message
pair is exchanged between nodes; it is used to determine reachability
and the round-trip delay through the associated network path. In
addition, the echo indicates that the pinged component is alive. The
ping is often sent by a system monitor. Ping/echo requires a time
threshold to be set; this threshold tells the pinging component how
long to wait for the echo before considering the pinged component
to have failed (“timed out”). Standard implementations of ping/echo
are available for nodes interconnected via Internet Protocol (IP).
Heartbeat. This fault detection mechanism employs a periodic
message exchange between a system monitor and a process being
monitored. A special case of heartbeat is when the process being
monitored periodically resets the watchdog timer in its monitor to
prevent it from expiring and thus signaling a fault. For systems
where scalability is a concern, transport and processing overhead can
be reduced by piggybacking heartbeat messages onto other control
messages being exchanged. The difference between heartbeat and
ping/echo lies in who holds the responsibility for initiating the health
check—the monitor or the component itself.
Timestamp. This tactic is used to detect incorrect sequences of
events, primarily in distributed message-passing systems. A
timestamp of an event can be established by assigning the state of a
local clock to the event immediately after the event occurs. Sequence
numbers can also be used for this purpose, since timestamps in a
distributed system may be inconsistent across different processors.
See Chapter 17 for a fuller discussion of the topic of time in a
distributed system.
Condition monitoring. This tactic involves checking conditions in a
process or device, or validating assumptions made during the design.
By monitoring conditions, this tactic prevents a system from
producing faulty behavior. The computation of checksums is a
common example of this tactic. However, the monitor must itself be
simple (and, ideally, provably correct) to ensure that it does not
introduce new software errors.
Sanity checking. This tactic checks the validity or reasonableness of
specific operations or outputs of a component. It is typically based
on a knowledge of the internal design, the state of the system, or the
nature of the information under scrutiny. It is most often employed at
interfaces, to examine a specific information flow.
Voting. V oting involves comparing computational results from
multiple sources that should be producing the same results and, if
they are not, deciding which results to use. This tactic depends
critically on the voting logic, which is usually realized as a simple,
rigorously reviewed, and tested singleton so that the probability of
error is low. V oting also depends critically on having multiple
sources to evaluate. Typical schemes include the following:
Replication is the simplest form of voting; here, the components
are exact clones of each other. Having multiple copies of
identical components can be effective in protecting against
random failures of hardware but cannot protect against design or
implementation errors, in hardware or software, since there is no
form of diversity embedded in this tactic.
Functional redundancy, in contrast, is intended to address the
issue of common-mode failures (where replicas exhibit the same
fault at the same time because they share the same
implementation) in hardware or software components, by
implementing design diversity. This tactic attempts to deal with
the systematic nature of design faults by adding diversity to
redundancy. The outputs of functionally redundant components
should be the same given the same input. The functional
redundancy tactic is still vulnerable to specification errors—and,
of course, functional replicas will be more expensive to develop
and verify.
Analytic redundancy permits not only diversity among
components’ private sides, but also diversity among the
components’ inputs and outputs. This tactic is intended to
tolerate specification errors by using separate requirement
specifications. In embedded systems, analytic redundancy helps
when some input sources are likely to be unavailable at times.
For example, avionics programs have multiple ways to compute
aircraft altitude, such as using barometric pressure, with the
radar altimeter, and geometrically using the straight-line distance
and look-down angle of a point ahead on the ground. The voter
mechanism used with analytic redundancy needs to be more
sophisticated than just letting majority rule or computing a
simple average. It may have to understand which sensors are
currently reliable (or not), and it may be asked to produce a
higher-fidelity value than any individual component can, by
blending and smoothing individual values over time.
Exception detection. This tactic focuses on the detection of a system
condition that alters the normal flow of execution. It can be further
refined as follows:
System exceptions will vary according to the processor hardware
architecture employed. They include faults such as divide by
zero, bus and address faults, illegal program instructions, and so
forth.
The parameter fence tactic incorporates a known data pattern
(such as 0xDEADBEEF) placed immediately after any variable-
length parameters of an object. This allows for runtime detection
of overwriting the memory allocated for the object’s variable-
length parameters.
Parameter typing employs a base class that defines functions
that add, find, and iterate over type-length-value (TLV )
formatted message parameters. Derived classes use the base
class functions to provide functions to build and parse messages.
Use of parameter typing ensures that the sender and the receiver
of messages agree on the type of the content, and detects cases
where they don’t.
Timeout is a tactic that raises an exception when a component
detects that it or another component has failed to meet its timing
constraints. For example, a component awaiting a response from
another component can raise an exception if the wait time
exceeds a certain value.
Self-test. Components (or, more likely, whole subsystems) can run
procedures to test themselves for correct operation. Self-test
procedures can be initiated by the component itself or invoked from
time to time by a system monitor. These may involve employing
some of the techniques found in condition monitoring, such as
checksums.

Recover from Faults


Recover from faults tactics are refined into preparation and repair tactics
and reintroduction tactics. The latter are concerned with reintroducing a
failed (but rehabilitated) component back into normal operation.
Preparation and repair tactics are based on a variety of combinations
of retrying a computation or introducing redundancy:

Redundant spare. This tactic refers to a configuration in which one


or more duplicate components can step in and take over the work if
the primary component fails. This tactic is at the heart of the hot
spare, warm spare, and cold spare patterns, which differ primarily in
how up-to-date the backup component is at the time of its takeover.
Rollback. A rollback permits the system to revert to a previous
known good state (referred to as the “rollback line”)—rolling back
time—upon the detection of a failure. Once the good state is
reached, then execution can continue. This tactic is often combined
with the transactions tactic and the redundant spare tactic so that
after a rollback has occurred, a standby version of the failed
component is promoted to active status. Rollback depends on a copy
of a previous good state (a checkpoint) being available to the
components that are rolling back. Checkpoints can be stored in a
fixed location and updated at regular intervals, or at convenient or
significant times in the processing, such as at the completion of a
complex operation.
Exception handling. Once an exception has been detected, the
system will handle it in some fashion. The easiest thing it can do is
simply to crash—but, of course, that’s a terrible idea from the point
of availability, usability, testability, and plain good sense. There are
much more productive possibilities. The mechanism employed for
exception handling depends largely on the programming
environment employed, ranging from simple function return codes
(error codes) to the use of exception classes that contain information
helpful in fault correlation, such as the name of the exception, the
origin of the exception, and the cause of the exception Software can
then use this information to mask or repair the fault.
Software upgrade. The goal of this tactic is to achieve in-service
upgrades to executable code images in a non-service-affecting
manner. Strategies include the following:
Function patch. This kind of patch, which is used in procedural
programming, employs an incremental linker/loader to store an
updated software function into a pre-allocated segment of target
memory. The new version of the software function will employ
the entry and exit points of the deprecated function.
Class patch. This kind of upgrade is applicable for targets
executing object-oriented code, where the class definitions
include a backdoor mechanism that enables the runtime addition
of member data and functions.
Hitless in-service software upgrade (ISSU). This leverages the
redundant spare tactic to achieve non-service-affecting upgrades
to software and associated schema.
In practice, the function patch and class patch are used to deliver bug
fixes, while the hitless ISSU is used to deliver new features and
capabilities.
Retry. The retry tactic assumes that the fault that caused a failure is
transient, and that retrying the operation may lead to success. It is
used in networks and in server farms where failures are expected and
common. A limit should be placed on the number of retries that are
attempted before a permanent failure is declared.
Ignore faulty behavior. This tactic calls for ignoring messages sent
from a particular source when we determine that those messages are
spurious. For example, we would like to ignore the messages
emanating from the live failure of a sensor.
Graceful degradation. This tactic maintains the most critical system
functions in the presence of component failures, while dropping less
critical functions. This is done in circumstances where individual
component failures gracefully reduce system functionality, rather
than causing a complete system failure.
Reconfiguration. Reconfiguration attempts to recover from failures
by reassigning responsibilities to the (potentially restricted)
resources or components left functioning, while maintaining as
much functionality as possible.

Reintroduction occurs when a failed component is reintroduced after it


has been repaired. Reintroduction tactics include the following:

Shadow. This tactic refers to operating a previously failed or in-


service upgraded component in a “shadow mode” for a predefined
duration of time prior to reverting the component back to an active
role. During this duration, its behavior can be monitored for
correctness and it can repopulate its state incrementally.
State resynchronization. This reintroduction tactic is a partner to the
redundant spare tactic. When used with active redundancy—a
version of the redundant spare tactic—the state resynchronization
occurs organically, since the active and standby components each
receive and process identical inputs in parallel. In practice, the states
of the active and standby components are periodically compared to
ensure synchronization. This comparison may be based on a cyclic
redundancy check calculation (checksum) or, for systems providing
safety-critical services, a message digest calculation (a one-way hash
function). When used alongside the passive redundancy version of
the redundant spare tactic, state resynchronization is based solely on
periodic state information transmitted from the active component(s)
to the standby component(s), typically via checkpointing.
Escalating restart. This reintroduction tactic allows the system to
recover from faults by varying the granularity of the component(s)
restarted and minimizing the level of service affectation. For
example, consider a system that supports four levels of restart,
numbered 0– 3 . The lowest level of restart (Level 0) has the least
impact on services and employs passive redundancy (warm spare),
where all child threads of the faulty component are killed and
recreated. In this way, only data associated with the child threads is
freed and reinitialized. The next level of restart (Level 1) frees and
reinitializes all unprotected memory; protected memory is
untouched. The next level of restart (Level 2) frees and reinitializes
all memory, both protected and unprotected, forcing all applications
to reload and reinitialize. The final level of restart (Level 3 ) involves
completely reloading and reinitializing the executable image and
associated data segments. Support for the escalating restart tactic is
particularly useful for the concept of graceful degradation, where a
system is able to degrade the services it provides while maintaining
support for mission-critical or safety-critical applications.
Nonstop forwarding. This concept originated in router design, and
assumes that functionality is split into two parts: the supervisory or
control plane (which manages connectivity and routing information)
and the data plane (which does the actual work of routing packets
from sender to receiver). If a router experiences the failure of an
active supervisor, it can continue forwarding packets along known
routes—with neighboring routers—while the routing protocol
information is recovered and validated. When the control plane is
restarted, it implements a “graceful restart,” incrementally rebuilding
its routing protocol database even as the data plane continues to
operate.

Prevent Faults
Instead of detecting faults and then trying to recover from them, what if
your system could prevent them from occurring in the first place?
Although it might sound as if some measure of clairvoyance would be
required, it turns out that in many cases it is possible to do just that.2
2. These tactics deal with runtime means to prevent faults from
occurring. Of course, an excellent way to prevent faults—at least in
the system you’re building, if not in systems that your system must
interact with—is to produce high-quality code. This can be done by
means of code inspections, pair programming, solid requirements
reviews, and a host of other good engineering practices.

Removal from service. This tactic refers to temporarily placing a


system component in an out-of-service state for the purpose of
mitigating potential system failures. For example, a component of a
system might be taken out of service and reset to scrub latent faults
(such as memory leaks, fragmentation, or soft errors in an
unprotected cache) before the accumulation of faults reaches the
service-affecting level, resulting in system failure. Other terms for
this tactic are software rejuvenation and therapeutic reboot. If you
reboot your computer every night, you are practicing removal from
service.
Transactions. Systems targeting high-availability services leverage
transactional semantics to ensure that asynchronous messages
exchanged between distributed components are atomic, consistent,
isolated, and durable—properties collectively referred to as the
“ACID properties.” The most common realization of the transactions
tactic is the “two-phase commit” (2PC) protocol. This tactic prevents
race conditions caused by two processes attempting to update the
same data item at the same time.
Predictive model. A predictive model, when combined with a
monitor, is employed to monitor the state of health of a system
process to ensure that the system is operating within its nominal
operating parameters, and to take corrective action when the system
nears a critical threshold. The operational performance metrics
monitored are used to predict the onset of faults; examples include
the session establishment rate (in an HTTP server), threshold
crossing (monitoring high and low watermarks for some constrained,
shared resource), statistics on the process state (e.g., in-service, out-
of-service, under maintenance, idle), and message queue length
statistics.
Exception prevention. This tactic refers to techniques employed for
the purpose of preventing system exceptions from occurring. The
use of exception classes, which allows a system to transparently
recover from system exceptions, was discussed earlier. Other
examples of exception prevention include error-correcting code
(used in telecommunications), abstract data types such as smart
pointers, and the use of wrappers to prevent faults such as dangling
pointers or semaphore access violations. Smart pointers prevent
exceptions by doing bounds checking on pointers, and by ensuring
that resources are automatically de-allocated when no data refers to
them, thereby avoiding resource leaks.
Increase competence set. A program’s competence set is the set of
states in which it is “competent” to operate. For example, the state
when the denominator is zero is outside the competence set of most
divide programs. When a component raises an exception, it is
signaling that it has discovered itself to be outside its competence
set; in essence, it doesn’t know what to do and is throwing in the
towel. Increasing a component’s competence set means designing it
to handle more cases—faults—as part of its normal operation. For
example, a component that assumes it has access to a shared
resource might throw an exception if it discovers that access is
blocked. Another component might simply wait for access or return
immediately with an indication that it will complete its operation on
its own the next time it does have access. In this example, the second
component has a larger competence set than the first.

4.3 Tactics- Based Questionnaire for Availability


Based on the tactics described in Section 4.2, we can create a set of
availability tactics– inspired questions, as presented in Table 4.3 . To gain
an overview of the architectural choices made to support availability, the
analyst asks each question and records the answers in the table. The
answers to these questions can then be made the focus of further
activities: investigation of documentation, analysis of code or other
artifacts, reverse engineering of code, and so forth.
Table 4.3 Tactics-Based Questionnaire for Availability
Tactics Tactics Question Su RDesig Rati
Group pp i n onale
or s Decis and
t? kions Assu
( Y and mpti
/ N L ocat ons
) ion
Detect Does the system use ping/ echo to detect
Faults failure of a component or connection, or
network congestion?
Tactics Tactics Question Su RDesig Rati
Group pp i n onale
or s Decis and
t? kions Assu
( Y and mpti
/ N L ocat ons
) ion
Does the system use a component to
monitor the state of health of other parts of
the system? A system monitor can detect
failure or congestion in the network or other
shared resources, such as from a denial-of-
service attack.
Does the system use a heartbeat—a periodic
message exchange between a system monitor
and a process—to detect failure of a
component or connection, or network
congestion?
Does the system use a timestamp to detect
incorrect sequences of events in distributed
systems?
Does the system use voting to check that
replicated components are producing the
same results?

The replicated components may be identical


replicas, functionally redundant, or
analytically redundant.
Does the system use ex ception detection to
detect a system condition that alters the
normal flow of execution (e.g., system
exception, parameter fence, parameter
typing, timeout)?
Can the system do a self- test to test itself for
correct operation?
Tactics Tactics Question Su RDesig Rati
Group pp i n onale
or s Decis and
t? kions Assu
( Y and mpti
/ N L ocat ons
) ion
Recover Does the system employ redundant spares?
from
Faults Is a component’s role as active versus spare
(Prepara fixed, or does it change in the presence of a
tion and fault? What is the switchover mechanism?
Repair) What is the trigger for a switchover? How
long does it take for a spare to assume its
duties?
Does the system employ ex ception handling
to deal with faults?

Typically the handling involves either


reporting, correcting, or masking the fault.
Does the system employ rollback, so that it
can revert to a previously saved good state
(the “rollback line”) in the event of a fault?
Can the system perform in-service software
upgrades to executable code images in a
non-service-affecting manner?
Does the system systematically retry in
cases where the component or connection
failure may be transient?
Can the system simply ignore faulty
behavior (e.g., ignore messages when it is
determined that those messages are
spurious)?
Tactics Tactics Question Su RDesig Rati
Group pp i n onale
or s Decis and
t? kions Assu
( Y and mpti
/ N L ocat ons
) ion
Does the system have a policy of degradation
when resources are compromised,
maintaining the most critical system
functions in the presence of component
failures, and dropping less critical functions?
Does the system have consistent policies and
mechanisms for reconfiguration after
failures, reassigning responsibilities to the
resources left functioning, while maintaining
as much functionality as possible?
Recover Can the system operate a previously failed or
from in-service upgraded component in a
Faults “shadow mode” for a predefined time prior
(Reintro to reverting the component back to an active
duction) role?
If the system uses active or passive
redundancy, does it also employ state
resynchroniz ation to send state information
from active components to standby
components?
Does the system employ escalating restart
to recover from faults by varying the
granularity of the component(s) restarted and
minimizing the level of service affected?
Can message processing and routing portions
of the system employ nonstop forwarding,
where functionality is split into supervisory
and data planes?
Tactics Tactics Question Su RDesig Rati
Group pp i n onale
or s Decis and
t? kions Assu
( Y and mpti
/ N L ocat ons
) ion
Prevent Can the system remove components from
Faults service, temporarily placing a system
component in an out-of-service state for the
purpose of preempting potential system
failures?
Does the system employ transactions—
bundling state updates so that asynchronous
messages exchanged between distributed
components are atomic, consistent, isolated,
and durable?
Does the system use a predictive model to
monitor the state of health of a component to
ensure that the system is operating within
nominal parameters?

When conditions are detected that are


predictive of likely future faults, the model
initiates corrective action.

4.4 Patterns for Availability


This section presents a few of the most important architectural patterns
for availability.
The first three patterns are all centered on the redundant spare tactic,
and will be described as a group. They differ primarily in the degree to
which the backup components’ state matches that of the active
component. (A special case occurs when the components are stateless, in
which case the first two patterns become identical.)

Active redundancy (hot spare). For stateful components, this refers


to a configuration in which all of the nodes (active or redundant
spare) in a protection group3 receive and process identical inputs in
parallel, allowing the redundant spare(s) to maintain a synchronous
state with the active node(s). Because the redundant spare possesses
an identical state to the active processor, it can take over from a
failed component in a matter of milliseconds. The simple case of one
active node and one redundant spare node is commonly referred to
as one-plus-one redundancy. Active redundancy can also be used for
facilities protection, where active and standby network links are used
to ensure highly available network connectivity.
3 . A protection group is a group of processing nodes in which one
or more nodes are “active,” with the remaining nodes serving as
redundant spares.
Passive redundancy (warm spare). For stateful components, this
refers to a configuration in which only the active members of the
protection group process input traffic. One of their duties is to
provide the redundant spare(s) with periodic state updates. Because
the state maintained by the redundant spares is only loosely coupled
with that of the active node(s) in the protection group (with the
looseness of the coupling being a function of the period of the state
updates), the redundant nodes are referred to as warm spares.
Passive redundancy provides a solution that achieves a balance
between the more highly available but more compute-intensive (and
expensive) active redundancy pattern and the less available but
significantly less complex cold spare pattern (which is also
significantly cheaper).
Spare (cold spare). Cold sparing refers to a configuration in which
redundant spares remain out of service until a failover occurs, at
which point a power-on-reset4 procedure is initiated on the
redundant spare prior to its being placed in service. Due to its poor
recovery performance, and hence its high mean time to repair, this
pattern is poorly suited to systems having high-availability
requirements.
4. A power-on-reset ensures that a device starts operating in a
known state.
Benefits:
The benefit of a redundant spare is a system that continues to
function correctly after only a brief delay in the presence of a
failure. The alternative is a system that stops functioning
correctly, or stops functioning altogether, until the failed
component is repaired. This repair could take hours or days.
Tradeoffs:
The tradeoff with any of these patterns is the additional cost and
complexity incurred in providing a spare.
The tradeoff among the three alternatives is the time to recover
from a failure versus the runtime cost incurred to keep a spare
up-to-date. A hot spare carries the highest cost but leads to the
fastest recovery time, for example.
Other patterns for availability include the following.

Triple modular redundancy (TMR). This widely used


implementation of the voting tactic employs three components that
do the same thing. Each component receives identical inputs and
forwards its output to the voting logic, which detects any
inconsistency among the three output states. Faced with an
inconsistency, the voter reports a fault. It must also decide which
output to use, and different instantiations of this pattern use different
decision rules. Typical choices are letting the majority rule or
choosing some computed average of the disparate outputs.
Of course, other versions of this pattern that employ 5 or 19 or 53
redundant components are also possible. However, in most cases, 3
components are sufficient to ensure a reliable result.
Benefits:
TMR is simple to understand and to implement. It is blissfully
independent of what might be causing disparate results, and is
only concerned about making a reasonable choice so that the
system can continue to function.
Tradeoffs:
There is a tradeoff between increasing the level of replication,
which raises the cost, and the resulting availability. In systems
employing TMR, the statistical likelihood of two or more
components failing is vanishingly small, and three components
represents a sweet spot between availability and cost.
Circuit breaker. A commonly used availability tactic is retry. In the
event of a timeout or fault when invoking a service, the invoker
simply tries again—and again, and again. A circuit breaker keeps the
invoker from trying countless times, waiting for a response that
never comes. In this way, it breaks the endless retry cycle when it
deems that the system is dealing with a fault. That’s the signal for
the system to begin handling the fault. Until the circuit break is
“reset,” subsequent invocations will return immediately without
passing along the service request.
Benefits:
This pattern can remove from individual components the policy
about how many retries to allow before declaring a failure.
At worst, endless fruitless retries would make the invoking
component as useless as the invoked component that has failed.
This problem is especially acute in distributed systems, where
you could have many callers calling an unresponsive
component and effectively going out of service themselves,
causing the failure to cascade across the whole system. The
circuit breaker, in conjunction with software that listens to it
and begins recovery procedures, prevents that problem.
Tradeoffs:
Care must be taken in choosing timeout (or retry) values. If the
timeout is too long, then unnecessary latency is added. But if
the timeout is too short, then the circuit breaker will be tripping
when it does not need to—a kind of “false positive”—which
can lower the availability and performance of these services.
Other availability patterns that are commonly used include the
following:
Process pairs. This pattern employs checkpointing and rollback. In
case of failure, the backup has been checkpointing and (if necessary)
rolling back to a safe state, so is ready to take over when a failure
occurs.
Forward error recovery. This pattern provides a way to get out of an
undesirable state by moving forward to a desirable state. This often
relies upon built-in error-correction capabilities, such as data
redundancy, so that errors may be corrected without the need to fall
back to a previous state or to retry. Forward error recovery finds a
safe, possibly degraded state from which the operation can move
forward.
4.5 For Further Reading
Patterns for availability:

You can read about patterns for fault tolerance in [Hanmer 13 ].

General tactics for availability:

A more detailed discussion of some of the availability tactics in this


chapter is given in [Scott 09]. This is the source of much of the
material in this chapter.
The Internet Engineering Task Force has promulgated a number of
standards supporting availability tactics. These standards include
Non-Stop Forwarding [IETF 2004], Ping/Echo (ICMP [IETF 1981]
or ICMPv6 [RFC 2006b] Echo Request/Response), and MPLS (LSP
Ping) networks [IETF 2006a].

Tactics for availability— fault detection:

Triple modular redundancy (TMR) was developed in the early 1960s


by Lyons [Lyons 62].
The fault detection in the voting tactic is based on the fundamental
contributions to automata theory by V on Neumann, who
demonstrated how systems having a prescribed reliability could be
built from unreliable components [V on Neumann 56].

Tactics for availability— fault recovery:

Standards-based realizations of active redundancy exist for


protecting network links (i.e., facilities) at both the physical layer of
the seven-layer OSI (Open Systems Interconnection) model
[Bellcore 98, 99; Telcordia 00] and the network/link layer [IETF
2005].
Some examples of how a system can degrade through use
(degradation) are given in [Nygard 18].
Mountains of papers have been written about parameter typing, but
[Utas 05] writes about it in the context of availability (as opposed to
bug prevention, its usual context). [Utas 05] has also written about
escalating restart.
Hardware engineers often use preparation and repair tactics.
Examples include error detection and correction (EDAC) coding,
forward error correction (FEC), and temporal redundancy. EDAC
coding is typically used to protect control memory structures in
high-availability distributed real-time embedded systems [Hamming
80]. Conversely, FEC coding is typically employed to recover from
physical layer errors occurring in external network links [Morelos-
Zaragoza 06]. Temporal redundancy involves sampling spatially
redundant clock or data lines at time intervals that exceed the pulse
width of any transient pulse to be tolerated, and then voting out any
defects detected [Mavis 02].

Tactics for availability— fault prevention:

Parnas and Madey have written about increasing an element’s


competence set [Parnas 95].
The ACID properties, important in the transactions tactic, were
introduced by Gray in the 1970s and discussed in depth in [Gray 93 ].

Disaster recovery:

A disaster is an event such as an earthquake, flood, or hurricane that


destroys an entire data center. The U.S. National Institute of
Standards and Technology (NIST) identifies eight different types of
plans that should be considered in the event of a disaster, See
Section 2.2 of NIST Special Publication 800-3 4, Contingency
Planning Guide for Federal Information Systems,
https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication80
0-3 4r1.pdf.

4.6 Discussion Questions


1. Write a set of concrete scenarios for availability using each of the
possible responses in the general scenario.
2. Write a concrete availability scenario for the software for a
(hypothetical) driverless car.
3. Write a concrete availability scenario for a program like Microsoft
Word.
4. Redundancy is a key strategy for achieving high availability. Look
at the patterns and tactics presented in this chapter and decide how
many of them exploit some form of redundancy and how many do
not.
5. How does availability trade off against modifiability and
deployability? How would you make a change to a system that is
required to have 24/7 availability (i.e., no scheduled or unscheduled
down time, ever)?
6. Consider the fault detection tactics (ping/echo, heartbeat, system
monitor, voting, and exception detection). What are the
performance implications of using these tactics?
7. Which tactics are used by a load balancer (see Chapter 17) when it
detects a failure of an instance?
8. Look up recovery point objective (RPO) and recovery time
objective (RTO), and explain how these can be used to set a
checkpoint interval when using the rollback tactic.
5
Deployability
From the day we arrive on the planet And blinking, step into the sun
There’s more to be seen than can ever be seen More to do than can ever be
done
—The Lion King

There comes a day when software, like the rest of us, must leave home
and venture out into the world and experience real life. Unlike the rest of
us, software typically makes the trip many times, as changes and updates
are made. This chapter is about making that transition as orderly and as
effective and—most of all—as rapid as possible. That is the realm of
continuous deployment, which is most enabled by the quality attribute of
deployability.
Why has deployability come to take a front-row seat in the world of
quality attributes?
In the “bad old days,” releases were infrequent—large numbers of
changes were bundled into releases and scheduled. A release would
contain new features and bug fixes. One release per month, per quarter,
or even per year was common. Competitive pressures in many domains
—with the charge being led by e-commerce—resulted in a need for
much shorter release cycles. In these contexts, releases can occur at any
time—possibly hundreds of releases per day—and each can be instigated
by a different team within an organization. Being able to release
frequently means that bug fixes in particular do not have to wait until the
next scheduled release, but rather can be made and released as soon as a
bug is discovered and fixed. It also means that new features do not need
to be bundled into a release, but can be put into production at any time.
This is not desirable, or even possible, in all domains. If your software
exists in a complex ecosystem with many dependencies, it may not be
possible to release just one part of it without coordinating that release
with the other parts. In addition, many embedded systems, systems in
hard-to-access locations, and systems that are not networked would be
poor candidates for a continuous deployment mindset.
This chapter focuses on the large and growing numbers of systems for
which just-in-time feature releases are a significant competitive
advantage, and just-in-time bug fixes are essential to safety or security or
continuous operation. Often these systems are microservice and cloud-
based, although the techniques here are not limited to those technologies.

5.1 Continuous Deployment


Deployment is a process that starts with coding and ends with real users
interacting with the system in a production environment. If this process
is fully automated—that is, if there is no human intervention—then it is
called continuous deployment. If the process is automated up to the point
of placing (portions of) the system into production and human
intervention is required (perhaps due to regulations or policies) for this
final step, the process is called continuous delivery.
To speed up releases, we need to introduce the concept of a
deployment pipeline: the sequence of tools and activities that begin when
you check your code into a version control system and end when your
application has been deployed for users to send it requests. In between
those points, a series of tools integrate and automatically test the newly
committed code, test the integrated code for functionality, and test the
application for concerns such as performance under load, security, and
license compliance.
Each stage in the deployment pipeline takes place in an environment
established to support isolation of the stage and perform the actions
appropriate to that stage. The major environments are as follows:

Code is developed in a development environment for a single module


where it is subject to standalone unit tests. Once it passes the tests,
and after appropriate review, the code is committed to a version
control system that triggers the build activities in the integration
environment.
An integration environment builds an executable version of your
service. A continuous integration server compiles1 your new or
changed code, along with the latest compatible versions of code for
other portions of your service and constructs an executable image for
your service.2 Tests in the integration environment include the unit
tests from the various modules (now run against the built system), as
well as integration tests designed specifically for the whole system.
When the various tests are passed, the built service is promoted to
the staging environment.
1. If you are developing software using an interpreted language
such as Python or JavaScript, there is no compilation step.
2. In this chapter, we use the term “service” to denote any
independently deployable unit.
A staging environment tests for various qualities of the total system.
These include performance testing, security testing, license
conformance checks, and possibly user testing. For embedded
systems, this is where simulators of the physical environment
(feeding synthetic inputs to the system) are brought to bear. An
application that passes all staging environment tests—which may
include field testing—is deployed to the production environment,
using either a blue/green model or a rolling upgrade (see Section
5.6). In some cases, partial deployments are used for quality control
or to test the market response to a proposed change or offering.
Once in the production environment, the service is monitored closely
until all parties have some level of confidence in its quality. At that
point, it is considered a normal part of the system and receives the
same amount of attention as the other parts of the system.

You perform a different set of tests in each environment, expanding


the testing scope from unit testing of a single module in the development
environment, to functional testing of all the components that make up
your service in the integration environment, and ending with broad
quality testing in the staging environment and usage monitoring in the
production environment.
But not everything always goes according to plan. If you find
problems after the software is in its production environment, it is often
necessary to roll back to a previous version while the defect is being
addressed.
Architectural choices affect deployability. For example, by employing
the microservice architecture pattern (see Section 5.6), each team
responsible for a microservice can make its own technology choices; this
removes incompatibility problems that would previously have been
discovered at integration time (e.g., incompatible choices of which
version of a library to use). Since microservices are independent
services, such choices do not cause problems.
Similarly, a continuous deployment mindset forces you to think about
the testing infrastructure earlier in the development process. This is
necessary because designing for continuous deployment requires
continuous automated testing. In addition, the need to be able to roll
back or disable features leads to architectural decisions about
mechanisms such as feature toggles and backward compatibility of
interfaces. These decisions are best taken early on.

The Effect of Virtualiz ation on the Different


Environments
Before the widespread use of virtualization technology, the
environments that we describe here were physical facilities. In most
organizations, the development, integration, and staging
environments comprised hardware and software procured and
operated by different groups. The development environment might
consist of a few desktop computers that the development team
repurposed as servers. The integration environment was operated
by the test or quality-assurance team, and might consist of some
racks, populated with previous-generation equipment from the data
center. The staging environment was operated by the operations
team and might have hardware similar to that used in production.
A lot of time was spent trying to figure out why a test that
passed in one environment failed in another environment. One
benefit of environments that employ virtualization is the ability to
have environment parity, where environments may differ in scale
but not in type of hardware or fundamental structure. A variety of
provisioning tools support environment parity by allowing every
team to easily build a common environment and by ensuring that
this common environment mimics the production environment as
closely as possible.

Three important ways to measure the quality of the pipeline are as


follows:
Cycle time is the pace of progress through the pipeline. Many
organizations will deploy to production several or even hundreds of
times a day. Such rapid deployment is not possible if human
intervention is required. It is also not possible if one team must
coordinate with other teams before placing its service in production.
Later in this chapter, we will see architectural techniques that allow
teams to perform continuous deployment without consulting other
teams.
Traceability is the ability to recover all of the artifacts that led to an
element having a problem. That includes all the code and
dependencies that are included in that element. It also includes the
test cases that were run on that element and the tools that were used
to produce the element. Errors in tools used in the deployment
pipeline can cause problems in production. Typically, traceability
information is kept in an artifact database. This database will
contain code version numbers, version numbers of elements the
system depends on (such as libraries), test version numbers, and tool
version numbers.
Repeatability is getting the same result when you perform the same
action with the same artifacts. This is not as easy as it sounds. For
example, suppose your build process fetches the latest version of a
library. The next time you execute the build process, a new version
of the library may have been released. As another example, suppose
one test modifies some values in the database. If the original values
are not restored, subsequent tests may not produce the same results.

DevOps
DevOps—a portmanteau of “development” and “operations”—is a
concept closely associated with continuous deployment. It is a
movement (much like the Agile movement), a description of a set
of practices and tools (again, much like the Agile movement), and a
marketing formula touted by vendors selling those tools. The goal
of DevOps is to shorten time to market (or time to release). The
goal is to dramatically shorten the time between a developer
making a change to an existing system—implementing a feature or
fixing a bug—and the system reaching the hands of end users, as
compared with traditional software development practices.
A formal definition of DevOps captures both the frequency of
releases and the ability to perform bug fixes on demand:
DevOps is a set of practices intended to reduce the time between
committing a change to a system and the change being placed
into normal production, while ensuring high quality. [Bass 15]
Implementing DevOps is a process improvement effort. DevOps
encompasses not only the cultural and organizational elements of
any process improvement effort, but also a strong reliance on tools
and architectural design. All environments are different, of course,
but the tools and automation we describe are found in the typical
tool chains built to support DevOps.
The continuous deployment strategy we describe here is the
conceptual heart of DevOps. Automated testing is, in turn, a
critically important ingredient of continuous deployment, and the
tooling for that often represents the highest technological hurdle for
DevOps. Some forms of DevOps include logging and post-
deployment monitoring of those logs, for automatic detection of
errors back at the “home office,” or even monitoring to understand
the user experience. This, of course, requires a “phone home” or
log delivery capability in the system, which may or may not be
possible or allowable in some systems.
DevSecOps is a flavor of DevOps that incorporates approaches
for security (for the infrastructure and for the applications it
produces) into the entire process. DevSecOps is increasingly
popular in aerospace and defense applications, but is also valid in
any application area where DevOps is useful and a security breach
would be particularly costly. Many IT applications fall in this
category.

5.2 Deployability
Deployability refers to a property of software indicating that it may be
deployed—that is, allocated to an environment for execution—within a
predictable and acceptable amount of time and effort. Moreover, if the
new deployment is not meeting its specifications, it may be rolled back,
again within a predictable and acceptable amount of time and effort. As
the world moves increasingly toward virtualization and cloud
infrastructures, and as the scale of deployed software-intensive systems
inevitably increases, it is one of the architect’s responsibilities to ensure
that deployment is done in an efficient and predictable way, minimizing
overall system risk.3
3 . The quality attribute of testability (see Chapter 12) certainly plays a
critical role in continuous deployment, and the architect can provide
critical support for continuous deployment by ensuring that the
system is testable, in all the ways just mentioned. However, our
concern here is the quality attribute directly related to continuous
deployment over and above testability: deployability.
To achieve these goals, an architect needs to consider how an
executable is updated on a host platform, and how it is subsequently
invoked, measured, monitored, and controlled. Mobile systems in
particular present a challenge for deployability in terms of how they are
updated because of concerns about bandwidth. Some of the issues
involved in deploying software are as follows:

How does it arrive at its host (i.e., push, where updates deployed are
unbidden, or pull, where users or administrators must explicitly
request updates)?
How is it integrated into an existing system? Can this be done while
the existing system is executing?
What is the medium, such as DV D, USB drive, or Internet delivery?
What is the packaging (e.g., executable, app, plug-in)?
What is the resulting integration into an existing system?
What is the efficiency of executing the process?
What is the controllability of the process?

With all of these concerns, the architect must be able to assess the
associated risks. Architects are primarily concerned with the degree to
which the architecture supports deployments that are:

Granular. Deployments can be of the whole system or of elements


within a system. If the architecture provides options for finer
granularity of deployment, then certain risks can be reduced.
Controllable. The architecture should provide the capability to
deploy at varying levels of granularity, monitor the operation of the
deployed units, and roll back unsuccessful deployments.
Efficient. The architecture should support rapid deployment (and, if
needed, rollback) with a reasonable level of effort.

These characteristics will be reflected in the response measures of the


general scenario for deployability.

5.3 Deployability General Scenario


Table 5.1 enumerates the elements of the general scenario that
characterize deployability.
Table 5.1 General Scenario for Deployability
Po Description Possible Values
rti
on
of
Sc
en
ari
o
So The trigger for the End user, developer, system administrator,
urc deployment operations personnel, component marketplace,
e product owner.
Sti What causes the A new element is available to be deployed. This
mu trigger is typically a request to replace a software
lus element with a new version (e.g., fix a defect,
apply a security patch, upgrade to the latest
release of a component or framework, upgrade to
the latest version of an internally produced
element).

New element is approved for incorporation.

An existing element/set of elements needs to be


rolled back.
Po Description Possible Values
rti
on
of
Sc
en
ari
o
Ar What is to be Specific components or modules, the system’s
tif changed platform, its user interface, its environment, or
act another system with which it interoperates. Thus
s the artifact might be a single software element,
multiple software elements, or the entire system.
En Staging, Full deployment.
vir production (or a
on specific subset of Subset deployment to a specified portion of
me either) users, V Ms, containers, servers, platforms.
nt
Re What should Incorporate the new components.
sp happen
on Deploy the new components.
se
Monitor the new components.

Roll back a previous deployment.


Re A measure of Cost in terms of:
sp cost, time, or
on process
se effectiveness for a
me deployment, or
Number, size, and complexity of affected
as for a series of
artifacts
ure deployments over
time
Average/worst-case effort

Elapsed clock or calendar time


Po Description Possible Values
rti
on
of
Sc
en
ari
o

Money (direct outlay or opportunity cost)

New defects introduced

Extent to which this deployment/rollback affects


other functions or quality attributes.

Number of failed deployments.

Repeatability of the process.

Traceability of the process.

Cycle time of the process.

Figure 5.1 illustrates a concrete deployability scenario: “A new


release of an authentication/authorization service (which our product
uses) is made available in the component marketplace and the product
owner decides to incorporate this version into the release. The new
service is tested and deployed to the production environment within 40
hours of elapsed time and no more than 120 person-hours of effort. The
deployment introduces no defects and no SLA is violated.”
Figure 5.1 Sample concrete deployability scenario

5.4 Tactics for Deployability


A deployment is catalyzed by the release of a new software or hardware
element. The deployment is successful if these new elements are
deployed within acceptable time, cost, and quality constraints. We
illustrate this relationship—and hence the goal of deployability tactics—
in Figure 5.2.

Figure 5.2 Goal of deployability tactics

The tactics for deployability are shown in Figure 5.3 . In many cases,
these tactics will be provided, at least in part, by a CI/CD (continuous
integration/continuous deployment) infrastructure that you buy rather
than build. In such a case, your job as an architect is often one of
choosing and assessing (rather than implementing) the right
deployability tactics and the right combination of tactics.

Figure 5.3 Deployability tactics

Next, we describe these six deployability tactics in more detail. The


first category of deployability tactics focuses on strategies for managing
the deployment pipeline, and the second category deals with managing
the system as it is being deployed and once it has been deployed.

Manage Deployment Pipeline


Scale rollouts. Rather than deploying to the entire user base, scaled
rollouts deploy a new version of a service gradually, to controlled
subsets of the user population, often with no explicit notification to
those users. (The remainder of the user base continues to use the
previous version of the service.) By gradually releasing, the effects
of new deployments can be monitored and measured and, if
necessary, rolled back. This tactic minimizes the potential negative
impact of deploying a flawed service. It requires an architectural
mechanism (not part of the service being deployed) to route a
request from a user to either the new or old service, depending on
that user’s identity.
Roll back. If it is discovered that a deployment has defects or does
not meet user expectations, then it can be “rolled back” to its prior
state. Since deployments may involve multiple coordinated updates
of multiple services and their data, the rollback mechanism must be
able to keep track of all of these, or must be able to reverse the
consequences of any update made by a deployment, ideally in a fully
automated fashion.
Script deployment commands. Deployments are often complex and
require many steps to be carried out and orchestrated precisely. For
this reason, deployment is often scripted. These deployment scripts
should be treated like code—documented, reviewed, tested, and
version controlled. A scripting engine executes the deployment
script automatically, saving time and minimizing opportunities for
human error.

Manage Deployed System


Manage service interactions. This tactic accommodates
simultaneous deployment and execution of multiple versions of
system services. Multiple requests from a client could be directed to
either version in any sequence. Having multiple versions of the same
service in operation, however, may introduce version
incompatibilities. In such cases, the interactions between services
need to be mediated so that version incompatibilities are proactively
avoided. This tactic is a resource management strategy, obviating the
need to completely replicate the resources so as to separately deploy
the old and new versions.
Package dependencies. This tactic packages an element together
with its dependencies so that they get deployed together and so that
the versions of the dependencies are consistent as the element moves
from development into production. The dependencies may include
libraries, OS versions, and utility containers (e.g., sidecar, service
mesh), which we will discuss in Chapter 9. Three means of
packaging dependencies are using containers, pods, or virtual
machines; these are discussed in more detail in Chapter 16.
Feature toggle. Even when your code is fully tested, you might
encounter issues after deploying new features. For that reason, it is
convenient to be able to integrate a “kill switch” (or feature toggle)
for new features. The kill switch automatically disables a feature in
your system at runtime, without forcing you to initiate a new
deployment. This provides the ability to control deployed features
without the cost and risk of actually redeploying services.
5.5 Tactics- Based Questionnaire for Deployability
Based on the tactics described in Section 5.4, we can create a set of
deployability tactics– inspired questions, as presented in Table 5.2. To
gain an overview of the architectural choices made to support
deployability, the analyst asks each question and records the answers in
the table. The answers to these questions can then be made the focus of
subsequent activities: investigation of documentation, analysis of code or
other artifacts, reverse engineering of code, and so forth.
Table 5.2 Tactics-Based Questionnaire for Deployability
Tactic Tactics Question Sup RDesign Ratio
s por i Decisi nale
Grou ted s ons and
ps ? kand Assu
( Y / L ocati mptio
N) on ns
Mana Do you scale rollouts, rolling out new
ge releases gradually (in contrast to releasing
deplo in an all-or-nothing fashion)?
yment Are you able to automatically roll back
pipeli deployed services if you determine that
ne they are not operating in a satisfactory
fashion?
Do you script deployment commands to
automatically execute complex sequences
of deployment instructions?
Mana Do you manage service interactions so
ge that multiple versions of services can be
deplo safely deployed simultaneously?
yed Do you package dependencies so that
syste services are deployed along with all of the
m libraries, OS versions, and utility
containers that they depend on?
Tactic Tactics Question Sup RDesign Ratio
s por i Decisi nale
Grou ted s ons and
ps ? kand Assu
( Y / L ocati mptio
N) on ns
Do you employ feature toggles to
automatically disable a newly released
feature (rather than rolling back the newly
deployed service) if the feature is
determined to be problematic?

5.6 Patterns for Deployability


Patterns for deployability can be organized into two categories. The first
category contains patterns for structuring services to be deployed. The
second category contains patterns for how to deploy services, which can
be parsed into two broad subcategories: all-or-nothing or partial
deployment. The two main categories for deployability are not
completely independent of each other, because certain deployment
patterns depend on certain structural properties of the services.

Patterns for Structuring Services

Microserv ice Architecture


The microservice architecture pattern structures the system as a
collection of independently deployable services that communicate only
via messages through service interfaces. There is no other form of
interprocess communication allowed: no direct linking, no direct reads of
another team’s data store, no shared-memory model, no back-doors
whatsoever. Services are usually stateless, and (because they are
developed by a single relatively small team4) are relatively small—hence
the term microservice. Service dependencies are acyclic. An integral part
of this pattern is a discovery service so that messages can be
appropriately routed.
4. At Amazon, service teams are constrained in size by the “two pizza
rule”: The team must be no larger than can be fed by two pizzas.
Benefits:

Time to market is reduced. Since each service is small and


independently deployable, a modification to a service can be
deployed without coordinating with teams that own other services.
Thus, once a team completes its work on a new version of a service
and that version has been tested, it can be deployed immediately.
Each team can make its own technology choices for its service, as
long as the technology choices support message passing. No
coordination is needed with respect to library versions or
programming languages. This reduces errors due to incompatibilities
that arise during integration—and which are a major source of
integration errors.
Services are more easily scaled than coarser-grained applications.
Since each service is independent, dynamically adding instances of
the service is straightforward. In this way, the supply of services can
be more easily matched to the demand.

Tradeoffs:

Overhead is increased, compared to in-memory communication,


because all communication among services occurs via messages
across a network. This can be mitigated somewhat by using the
service mesh pattern (see Chapter 9), which constrains the
deployment of some services to the same host to reduce network
traffic. Furthermore, because of the dynamic nature of microservice
deployments, discovery services are heavily used, adding to the
overhead. Ultimately, those discovery services may become a
performance bottleneck.
Microservices are less suitable for complex transactions because of
the difficulty of synchronizing activities across distributed systems.
The freedom for every team to choose its own technology comes at a
cost—the organization must maintain those technologies and the
required experience base.
Intellectual control of the total system may be difficult because of
the large number of microservices. This introduces a requirement for
catalogs and databases of interfaces to assist in maintaining
intellectual control. In addition, the process of properly combining
services to achieve a desired outcome may be complex and subtle.
Designing the services to have appropriate responsibilities and an
appropriate level of granularity is a formidable design task.
To achieve the ability to deploy versions independently, the
architecture of the services must be designed to allow for that
deployment strategy. Using the manage service interactions tactic
described in Section 5.4 can help achieve this goal.

Organizations that have heavily employed the microservice


architecture pattern include Google, Netflix, PayPal, Twitter, Facebook,
and Amazon. Many other organizations have adopted the microservice
architecture pattern as well; books and conferences exist that focus on
how an organization can adopt the microservice architecture pattern for
its own needs.

Patterns for Complete Replacement of Services


Suppose there are N instances of Service A and you wish to replace them
with N instances of a new version of Service A, leaving no instances of
the original version. You wish to do this with no reduction in quality of
service to the clients of the service, so there must always be N instances
of the service running.
Two different patterns for the complete replacement strategy are
possible, both of which are realizations of the scale rollouts tactic. We’ll
cover them both together:
1. Blue/green. In a blue/green deployment, N new instances of the
service would be created and each populated with new Service A
(let’s call these the green instances). After the N instances of new
Service A are installed, the DNS server or discovery service would
be changed to point to the new version of Service A. Once it is
determined that the new instances are working satisfactorily, then
and only then are the N instances of the original Service A
removed. Before this cutoff point, if a problem is found in the new
version, it is a simple matter of switching back to the original (the
blue services) with little or no interruption.
2. Rolling upgrade. A rolling upgrade replaces the instances of
Service A with instances of the new version of Service A one at a
time. (In practice, you can replace more than one instance at a
time, but only a small fraction are replaced in any single step.) The
steps of the rolling upgrade are as follows:
a. Allocate resources for a new instance of Service A (e.g., a
virtual machine).
b. Install and register the new version of Service A.
c. Begin to direct requests to the new version of Service A.
d. Choose an instance of the old Service A, allow it to
complete any active processing, and then destroy that
instance.
e. Repeat the preceding steps until all instances of the old
version have been replaced.
Figure 5.4 shows a rolling upgrade process as implemented by
Netflix’s Asgard tool on Amazon’s EC2 cloud platform.
Figure 5.4 A flowchart of the rolling upgrade pattern as
implemented by Netflix’s Asgard tool

Benefits:

The benefit of these patterns is the ability to completely replace


deployed versions of services without having to take the system out
of service, thus increasing the system’s availability.

Tradeoffs:

The peak resource utilization for a blue/green approach is 2N


instances, whereas the peak utilization for a rolling upgrade is N + 1
instances. In either case, resources to host these instances must be
procured. Before the widespread adoption of cloud computing,
procurement meant purchase: An organization had to purchase
physical computers to perform the upgrade. Most of the time there
was no upgrade in progress, so these additional computers largely
sat idle. This made the financial tradeoff clear, and rolling upgrade
was the standard approach. Now that computing resources can be
rented on an as-needed basis, rather than purchased, the financial
tradeoff is less compelling but still present.
Suppose you detect an error in the new Service A when you deploy
it. Despite all the testing you did in the development, integration,
and staging environments, when your service is deployed to
production, there may still be latent errors. If you are using
blue/green deployment, by the time you discover an error in the
new Service A, all of the original instances may have been deleted
and rolling back to the old version could take considerable time. In
contrast, a rolling upgrade may allow you to discover an error in the
new version of the service while instances of the old version are
still available.
From a client’s perspective, if you are using the blue/green
deployment model, then at any point in time either the new version
or the old version is active, but not both. If you are using the rolling
upgrade pattern, both versions are simultaneously active. This
introduces the possibility of two types of problems: temporal
inconsistency and interface mismatch.
Temporal inconsistency. In a sequence of requests by Client C
to Service A, some may be served by the old version of the
service and some may be served by the new version. If the
versions behave differently, this may cause Client C to
produce erroneous, or at least inconsistent, results. (This can
be prevented by using the manage service interactions tactic.)
Interface mismatch. If the interface to the new version of
Service A is different from the interface to the old version of
Service A, then invocations by clients of Service A that have
not been updated to reflect the new interface will produce
unpredictable results. This can be prevented by extending the
interface but not modifying the existing interface, and using
the mediator pattern (see Chapter 7) to translate from the
extended interface to an internal interface that produces correct
behavior. See Chapter 15 for a fuller discussion.

Patterns for Partial Replacement of Services


Sometimes changing all instances of a service is undesirable. Partial-
deployment patterns aim at providing multiple versions of a service
simultaneously for different user groups; they are used for purposes such
as quality control (canary testing) and marketing tests (A/B testing).

C anary Testing
Before rolling out a new release, it is prudent to test it in the production
environment, but with a limited set of users. Canary testing is the
continuous deployment analog of beta testing.5 Canary testing designates
a small set of users who will test the new release. Sometimes, these
testers are so-called power users or preview-stream users from outside
your organization who are more likely to exercise code paths and edge
cases that typical users may use less frequently. Users may or may not
know that they are being used as guinea pigs—er, that is, canaries.
Another approach is to use testers from within the organization that is
developing the software. For example, Google employees almost never
use the release that external users would be using, but instead act as
testers for upcoming releases. When the focus of the testing is on
determining how well new features are accepted, a variant of canary
testing called dark launch is used.
5. Canary testing is named after the 19th-century practice of bringing
canaries into coal mines. Coal mining releases gases that are
explosive and poisonous. Because canaries are more sensitive to
these gases than humans, coal miners brought canaries into the mines
and watched them for signs of reaction to the gases. The canaries
acted as early warning devices for the miners, indicating an unsafe
environment.
In both cases, the users are designated as canaries and routed to the
appropriate version of a service through DNS settings or through
discovery-service configuration. After testing is complete, users are all
directed to either the new version or the old version, and instances of the
deprecated version are destroyed. Rolling upgrade or blue/green
deployment could be used to deploy the new version.
Benefits:

Canary testing allows real users to “bang on” the software in ways
that simulated testing cannot. This allows the organization deploying
the service to collect “in use” data and perform controlled
experiments with relatively low risk.
Canary testing incurs minimal additional development costs, because
the system being tested is on a path to production anyway.
Canary testing minimizes the number of users who may be exposed
to a serious defect in the new system.

Tradeoffs:

Canary testing requires additional up-front planning and resources,


and a strategy for evaluating the results of the tests needs to be
formulated.
If canary testing is aimed at power users, those users have to be
identified and the new version routed to them.

A/ B Testing
A/B testing is used by marketers to perform an experiment with real
users to determine which of several alternatives yields the best business
results. A small but meaningful number of users receive a different
treatment from the remainder of the users. The difference can be minor,
such as a change to the font size or form layout, or it can be more
significant. For example, HomeAway (now V rbo) has used A/B testing
to vary the format, content, and look-and-feel of its worldwide websites,
tracking which editions produced the most rentals. The “winner” would
be kept, the “loser” discarded, and another contender designed and
deployed. Another example is a bank offering different promotions to
open new accounts. An oft-repeated story is that Google tested 41
different shades of blue to decide which shade to use to report search
results.
As in canary testing, DNS servers and discovery-service
configurations are set to send client requests to different versions. In A/B
testing, the different versions are monitored to see which one provides
the best response from a business perspective.

Benefits:

A/B testing allows marketing and product development teams to run


experiments on, and collect data from, real users.
A/B testing can allow for targeting of users based on an arbitrary set
of characteristics.

Tradeoffs:

A/B testing requires the implementation of alternatives, one of


which will be discarded.
Different classes of users, and their characteristics, need to be
identified up front.

5.7 For Further Reading


Much of the material in this chapter is adapted from Deployment and
Operations for Software Engineers by Len Bass and John K lein [Bass
19] and from [K azman 20b].
A general discussion of deployability and architecture in the context
of DevOps can be found in [Bass 15].
The tactics for deployability owe much to the work of Martin Fowler
and his colleagues, which can be found in [Fowler 10], [Lewis 14], and
[Sato 14].
Deployment pipelines are described in much more detail in [Humble
10]
Microservices and the process of migrating to microservices are
described in [Newman 15].

5.8 Discussion Questions


1. Write a set of concrete scenarios for deployability using each of the
possible responses in the general scenario.
2. Write a concrete deployability scenario for the software for a car
(such as a Tesla).
3. Write a concrete deployability scenario for a smartphone app. Now
write one for the server-side infrastructure that communicates with
this app.
4. If you needed to display the results of a search operation, would you
perform A/B testing or simply use the color that Google has
chosen? Why?
5. Referring to the structures described in Chapter 1, which structures
would be involved in implementing the package dependencies
tactic? Would you use the uses structure? Why or why not? Are
there other structures you would need to consider?
6. Referring to the structures described in Chapter 1, which structures
would be involved in implementing the manage service interactions
tactic? Would you use the uses structure? Why or why not? Are
there other structures you would need to consider?
7. Under what circumstances would you prefer to roll forward to a
new version of service, rather than to roll back to a prior version?
When is roll forward a poor choice?
6
Energy Efficiency
Energy is a bit like money: If you have a positive balance, you can
distribute it in various ways, but according to the classical laws that were
believed at the beginning of the century, you weren’t allowed to be
overdrawn.
—Stephen Hawking

Energy used by computers used to be free and unlimited—or at least


that’s how we behaved. Architects rarely gave much consideration to the
energy consumption of software in the past. But those days are now
gone. With the dominance of mobile devices as the primary form of
computing for most people, with the increasing adoption of the Internet
of Things (IoT) in industry and government, and with the ubiquity of
cloud services as the backbone of our computing infrastructure, energy
has become an issue that architects can no longer ignore. Power is no
longer “free” and unlimited. The energy efficiency of mobile devices
affects us all. Likewise, cloud providers are increasingly concerned with
the energy efficiency of their server farms. In 2016, it was reported that
data centers globally accounted for more energy consumption (by 40
percent) than the entire United K ingdom—about 3 percent of all energy
consumed worldwide. More recent estimates put that share up as high as
10 percent. The energy costs associated with running and, more
importantly, cooling large data centers have led people to calculate the
cost of putting whole data centers in space, where cooling is free and the
sun provides unlimited power. At today’s launch prices, the economics
are actually beginning to look favorable. Notably, server farms located
underwater and in arctic climates are already a reality.
At both the low end and the high end, energy consumption of
computational devices has become an issue that we should consider.
This means that we, as architects, now need to add energy efficiency to
the long list of competing qualities that we consider when designing a
system. And, as with every other quality attribute, there are nontrivial
tradeoffs to consider: energy usage versus performance or availability or
modifiability or time to market. Thus considering energy efficiency as a
first-class quality attribute is important for the following reasons:
1. An architectural approach is necessary to gain control over any
important system quality attribute, and energy efficiency is no
different. If system-wide techniques for monitoring and managing
energy are lacking, then developers are left to invent them on their
own. This will, in the best case, result in an ad hoc approach to
energy efficiency that produces a system that is hard to maintain,
measure, and evolve. In the worst case, it will yield an approach
that simply does not predictably achieve the desired energy
efficiency goals.
2. Most architects and developers are unaware of energy efficiency as
a quality attribute of concern, and hence do not know how to go
about engineering and coding for it. More fundamentally, they lack
an understanding of energy efficiency requirements—how to
gather them and analyze them for completeness. Energy efficiency
is not taught, or typically even mentioned, as a programmer’s
concern in today’s educational curricula. In consequence, students
may graduate with degrees in engineering or computer science
without ever having been exposed to these issues.
3. Most architects and developers lack suitable design concepts—
models, patterns, tactics, and so forth—for designing for energy
efficiency, as well as managing and monitoring it at runtime. But
since energy efficiency is a relatively recent concern for the
software engineering community, these design concepts are still in
their infancy and no catalog yet exists.
Cloud platforms typically do not have to be concerned with running
out of energy (except in disaster scenarios), whereas this is a daily
concern for users of mobile devices and some IoT devices. In cloud
environments, scaling up and scaling down are core competencies, so
decisions must be made on a regular basis about optimal resource
allocation. With IoT devices, their size, form factors, and heat output all
constrain their design space—there is no room for bulky batteries. In
addition, the sheer number of IoT devices projected to be deployed in
the next decade makes their energy usage a concern.
In all of these contexts, energy efficiency must be balanced with
performance and availability, requiring engineers to consciously reason
about such tradeoffs. In the cloud context, greater allocation of resources
—more servers, more storage, and so on—creates improved
performance capabilities as well as improved robustness against failures
of individual devices, but at the cost of energy and capital outlays. In the
mobile and IoT contexts, greater allocation of resources is typically not
an option (although shifting the computational burden from a mobile
device to a cloud back-end is possible), so the tradeoffs tend to center on
energy efficiency versus performance and usability. Finally, in all
contexts, there are tradeoffs between energy efficiency, on the one hand,
and buildability and modifiability, on the other hand.

6.1 Energy Efficiency General Scenario


From these considerations, we can now determine the various portions of
the energy efficiency general scenario, as presented in Table 6.1.
Table 6.1 Energy Efficiency General Scenario
Porti Description Possible Values
on of
Scen
ario
Sour This specifies who or what End user, manager, system
ce requests or initiates a request to administrator, automated
conserve or manage energy. agent
Stim A request to conserve energy. Total usage, maximum
ulus instantaneous usage, average
usage, etc.
Artif This specifies what is to be Specific devices, servers,
acts managed. V Ms, clusters, etc.
Envir Energy is typically managed at Runtime, connected, battery-
onme runtime, but many interesting powered, low-battery mode,
nt special cases exist, based on power-conservation mode
system characteristics.
Porti Description Possible Values
on of
Scen
ario
Resp What actions the system takes to One or more of the following:
onse conserve or manage energy usage.

Disable services

Deallocate runtime
services

Change allocation of
services to servers

Run services at a lower


consumption mode

Allocate/deallocate
servers

Change levels of service

Change scheduling
Porti Description Possible Values
on of
Scen
ario
Resp The measures revolve around the Energy managed or saved in
onse amount of energy saved or terms of:
meas consumed and the effects on other
ure functions or quality attributes.

Maximum/average
kilowatt load on the
system

Average/total amount of
energy saved

Total kilowatt hours used

Time period during


which the system must
stay powered on

. . . while still maintaining a


required level of functionality
and acceptable levels of other
quality attributes

Figure 6.1 illustrates a concrete energy efficiency scenario: A


manager wants to save energy at runtime by deallocating unused
resources at non-peak periods. The system deallocates resources while
maintaining worst-case latency of 2 seconds on database queries, saving
on average 50 percent of the total energy required.
Figure 6.1 Sample energy efficiency scenario

6.2 Tactics for Energy Efficiency


An energy efficiency scenario is catalyzed by the desire to conserve or
manage energy while still providing the required (albeit not necessarily
full) functionality. This scenario is successful if the energy responses are
achieved within acceptable time, cost, and quality constraints. We
illustrate this simple relationship—and hence the goal of energy
efficiency tactics—in Figure 6.2.

Figure 6.2 Goal of energy efficiency tactics

Energy efficiency is, at its heart, about effectively utilizing resources.


We group the tactics into three broad categories: resource monitoring,
resource allocation, and resource adaptation (Figure 6.3 ). By “resource,”
we mean a computational device that consumes energy while providing
its functionality. This is analogous to the definition of a hardware
resource in Chapter 9, which includes CPUs, data stores, network
communications, and memory.

Figure 6.3 Energy efficiency tactics

Monitor Resources
You can’t manage what you can’t measure, and so we begin with
resource monitoring. The tactics for resource monitoring are metering,
static classification, and dynamic classification.

Metering. The metering tactic involves collecting data about the


energy consumption of computational resources via a sensor
infrastructure, in near real time. At the coarsest level, the energy
consumption of an entire data center can be measured from its power
meter. Individual servers or hard drives can be measured using
external tools such as amp meters or watt-hour meters, or using
built-in tools such as those provided with metered rack PDUs (power
distribution units), ASICs (application-specific integrated circuits),
and so forth. In battery-operated systems, the energy remaining in a
battery can be determined through a battery management system,
which is a component of modern batteries.
Static classification. Sometimes real-time data collection is
infeasible. For example, if an organization is using an off-premises
cloud, it might not have direct access to real-time energy data. Static
classification allows us to estimate energy consumption by
cataloging the computing resources used and their known energy
characteristics—the amount of energy used by a memory device per
fetch, for example. These characteristics are available as
benchmarks, or from manufacturers’ specifications.
Dynamic classification. In cases where a static model of a
computational resource is inadequate, a dynamic model might be
required. Unlike static models, dynamic models estimate energy
consumption based on knowledge of transient conditions such as
workload. The model could be a simple table lookup, a regression
model based on data collected during prior executions, or a
simulation.

Allocate Resources
Resource allocation means assigning resources to do work in a way that
is mindful of energy consumption. The tactics for resource allocation are
to reduce usage, discovery, and scheduling.

Reduce usage. Usage can be reduced at the device level by device-


specific activities such as reducing the refresh rate of a display or
darkening the background. Removing or deactivating resources
when demands no longer require them is another method for
decreasing energy consumption. This may involve spinning down
hard drives, turning off CPUs or servers, running CPUs at a slower
clock rate, or shutting down current to blocks of the processor that
are not in use. It might also take the form of moving V Ms onto the
minimum number of physical servers (consolidation), combined
with shutting down idle computational resources. In mobile
applications, energy savings may be realized by sending part of the
computation to the cloud, assuming that the energy consumption of
communication is lower than the energy consumption of
computation.
Discovery. As we will see in Chapter 7, a discovery service matches
service requests (from clients) with service providers, supporting the
identification and remote invocation of those services. Traditionally
discovery services have made these matches based on a description
of the service request (typically an API). In the context of energy
efficiency, this request could be annotated with energy information,
allowing the requestor to choose a service provider (resource) based
on its (possibly dynamic) energy characteristics. For the cloud, this
energy information can be stored in a “green service directory”
populated by information from metering, static classification, or
dynamic classification (the resource monitoring tactics). For a
smartphone, the information could be obtained from an app store.
Currently such information is ad hoc at best, and typically
nonexistent in service APIs.
Schedule resources. Scheduling is the allocation of tasks to
computational resources. As we will see in Chapter 9, the schedule
resources tactic can increase performance. In the energy context, it
can be used to effectively manage energy usage, given task
constraints and respecting task priorities. Scheduling can be based
on data collected using one or more resource monitoring tactics.
Using an energy discovery service in a cloud context, or a controller
in a multi-core context, a computational task can dynamically switch
among computational resources, such as service providers, selecting
the ones that offer better energy efficiency or lower energy costs. For
example, one provider may be more lightly loaded than another,
allowing it to adapt its energy usage, perhaps using some of the
tactics described earlier, and consume less energy, on average, per
unit of work.

Reduce Resource Demand


This category of tactics is detailed in Chapter 9. Tactics in this category
—manage event arrival, limit event response, prioritize events (perhaps
letting low-priority events go unserviced), reduce computational
overhead, bound execution times, and increase resource usage efficiency
—all directly increase energy efficiency by doing less work. This is a
complementary tactic to reduce usage, in that the reduce usage tactic
assumes that the demand stays the same, whereas the reduce resource
demand tactics are a means of explicitly managing (and reducing) the
demand.
6.3 Tactics- Based Questionnaire for Energy
Efficiency
As described in Chapter 3 , this tactics-based questionnaire is intended to
very quickly understand the degree to which an architecture employs
specific tactics to manage energy efficiency.
Based on the tactics described in Section 6.2, we can create a set of
tactics-inspired questions, as presented in Table 6.2. To gain an overview
of the architectural choices made to support energy efficiency, the
analyst asks each question and records the answers in the table. The
answers to these questions can then be made the focus of further
activities: investigation of documentation, analysis of code or other
artifacts, reverse engineering of code, and so forth.
Table 6.2 Tactics-Based Questionnaire for Energy Efficiency
Ta Tactics Question S RDes Rat
cti u i ign ion
cs p s Dec ale
Gr p kisio an
ou or ns d
p te and Ass
d L oc um
? atio pti
( n ons
Y
/
N
)
Re Does your system meter the use of energy?
sou
rce That is, does the system collect data about the actual
Mo energy consumption of computational devices via a
nit sensor infrastructure, in near real time?
ori
ng Does the system statically classify devices and
computational resources? That is, does the system have
reference values to estimate the energy consumption of
a device or resource (in cases where real-time metering
is infeasible or too computationally expensive)?
Re Does the system dynamically classify devices and
sou computational resources? In cases where static
rce classification is not accurate due to varying load or
Mo environmental conditions, does the system use
nit dynamic models, based on prior data collected, to
ori estimate the varying energy consumption of a device or
ng resource at runtime?
Re Does the system reduce usage to scale down resource
sou usage? That is, can the system deactivate resources
rce when demands no longer require them, in an effort to
All save energy? This may involve spinning down hard
oca drives, darkening displays, turning off CPUs or
tio servers, running CPUs at a slower clock rate, or
n shutting down memory blocks of the processor that are
not being used.
Does the system schedule resources to more
effectively utilize energy, given task constraints and
respecting task priorities, by switching computational
resources, such as service providers, to the ones that
offer better energy efficiency or lower energy costs? Is
scheduling based on data collected (using one or more
resource monitoring tactics) about the state of the
system?
Does the system make use of a discovery service to
match service requests to service providers? In the
context of energy efficiency, a service request could be
annotated with energy requirement information,
allowing the requestor to choose a service provider
based on its (possibly dynamic) energy characteristics.
Re Do you consistently attempt to reduce resource
du demand? Here, you may insert the questions in this
ce category from the Tactics-Based Questionnaire for
Re Performance from Chapter 9.
sou
rce
De
ma
nd
6.4 Patterns
Some examples of patterns used for energy efficiency include sensor
fusion, kill abnormal tasks, and power monitor.

Sensor Fusion
Mobile apps and IoT systems often collect data from their environment
using multiple sensors. In this pattern, data from low-power sensors can
be used to infer whether data needs to be collected from higher-power
sensors. A common example in the mobile phone context is using
accelerometer data to assess if the user has moved and, if so, to update
the GPS location. This pattern assumes that accessing the low-power
sensor is much cheaper, in terms of energy consumption, than accessing
the higher-power sensor.

Benefits:

The obvious benefit of this pattern is the ability to minimize the


usage of more energy-intensive devices in an intelligent way rather
than, for example, just reducing the frequency of consulting the
more energy-intensive sensor.

Tradeoffs:

Consulting and comparing multiple sensors adds up-front


complexity.
The higher-energy-consuming sensor will provide higher-quality
data, albeit at the cost of increased power consumption. And it will
provide this data more quickly, since using the more energy-
intensive sensor alone takes less time than first consulting a
secondary sensor.
In cases where the inference frequently results in accessing the
higher-power sensor, this pattern could result in overall higher
energy usage.

Kill Abnormal Tasks


Mobile systems, because they are often executing apps of unknown
provenance, may end up unknowingly running some exceptionally
power-hungry apps. This pattern provides a way to monitor the energy
usage of such apps and to interrupt or kill energy-greedy operations. For
example, if an app is issuing an audible alert and vibrating the phone and
the user is not responding to these alerts, then after a predetermined
timeout period the task is killed.

Benefits:

This pattern provides a “fail-safe” option for managing the energy


consumption of apps with unknown energy properties.

Tradeoffs:

Any monitoring process adds a small amount of overhead to system


operations, which may affect performance and, to a small extent,
energy usage.
The usability of this pattern needs to be considered. K illing energy-
hungry tasks may be counter to the user’s intention.

Power Monitor
The power monitor pattern monitors and manages system devices,
minimizing the time during which they are active. This pattern attempts
to automatically disable devices and interfaces that are not being actively
used by the application. It has long been used within integrated circuits,
where blocks of the circuit are shut down when they are not being used,
in an effort to save energy.

Benefits:

This pattern can allow for intelligent savings of power at little to no


impact to the end user, assuming that the devices being shut down
are truly not needed.

Tradeoffs:

Once a device has been switched off, switching it on adds some


latency before it can respond, as compared with keeping it
continually running. And, in some cases, the startup may be more
energy expensive than a certain period of steady-state operation.
The power monitor needs to have knowledge of each device and its
energy consumption characteristics, which adds up-front
complexity to the system design.

6.5 For Further Reading


The first published set of energy tactics appeared in [Procaccianti 14].
These were, in part, the inspiration for the tactics presented here. The
2014 paper subsequently inspired [Paradis 21]. Many of the tactics
presented in this chapter owe a debt to these two papers.
For a good general introduction to energy usage in software
development—and what developers do not know—you should read
[Pang 16].
Several research papers have investigated the consequences of design
choices on energy consumption, such as [K azman 18] and [Chowdhury
19].
A general discussion of the importance of creating “energy-aware”
software can be found in [Fonseca 19].
Energy patterns for mobile devices have been catalogued by [Cruz 19]
and [Schaarschmidt 20].

6.6 Discussion Questions


1. Write a set of concrete scenarios for energy efficiency using each of
the possible responses in the general scenario.
2. Create a concrete energy efficiency scenario for a smartphone app
(for example, a health monitoring app).
3. Create a concrete energy efficiency scenario for a cluster of data
servers in a data center. What are the important distinctions between
this scenario and the one you created for question 2?
4. Enumerate the energy efficiency techniques that are currently
employed by your laptop or smartphone.
5. What are the energy tradeoffs in your smartphone between using
Wi-Fi and the cellular network?
6. Calculate the amount of greenhouse gases in the form of carbon
dioxide that you, over an average lifetime, will exhale into the
atmosphere. How many Google searches does this equate to?
7. Suppose Google reduced its energy usage per search by 1 percent.
How much energy would that save per year?
8. How much energy did you use to answer question 7?

You might also like