Software Architecture in Practice
Software Architecture in Practice
Len Bass
Paul Clements
Rick Kazman
Preface
Acknowledgments
PART I INTRODUCTION
CHAPTER 4 Availability
4.1 Availability General Scenario
4.2 Tactics for Availability
4.3 Tactics-Based Questionnaire for Availability
4.4 Patterns for Availability
4.5 For Further Reading
4.6 Discussion Questions
CHAPTER 5 Deployability
5.1 Continuous Deployment
5.2 Deployability
5.3 Deployability General Scenario
5.4 Tactics for Deployability
5.5 Tactics-Based Questionnaire for Deployability
5.6 Patterns for Deployability
5.7 For Further Reading
5.8 Discussion Questions
CHAPTER 7 Integrability
7.1 Evaluating the Integrability of an Architecture
7.2 General Scenario for Integrability
7.3 Integrability Tactics
7.4 Tactics-Based Questionnaire for Integrability
7.5 Patterns
7.6 For Further Reading
7.7 Discussion Questions
CHAPTER 8 Modifiability
8.1 Modifiability General Scenario
8.2 Tactics for Modifiability
8.3 Tactics-Based Questionnaire for Modifiability
8.4 Patterns
8.5 For Further Reading
8.6 Discussion Questions
CHAPTER 9 Performance
9.1 Performance General Scenario
9.2 Tactics for Performance
9.3 Tactics-Based Questionnaire for Performance
9.4 Patterns for Performance
9.5 For Further Reading
9.6 Discussion Questions
CHAPTER 10 Safety
10.1 Safety General Scenario
10.2 Tactics for Safety
10.3 Tactics-Based Questionnaire for Safety
10.4 Patterns for Safety
10.5 For Further Reading
10.6 Discussion Questions
CHAPTER 11 Security
11.1 Security General Scenario
11.2 Tactics for Security
11.3 Tactics-Based Questionnaire for Security
11.4 Patterns for Security
11.5 For Further Reading
11.6 Discussion Questions
CHAPTER 12 Testability
12.1 Testability General Scenario
12.2 Tactics for Testability
12.3 Tactics-Based Questionnaire for Testability
12.4 Patterns for Testability
12.5 For Further Reading
12.6 Discussion Questions
CHAPTER 13 Usability
13.1 Usability General Scenario
13.2 Tactics for Usability
13.3 Tactics-Based Questionnaire for Usability
13.4 Patterns for Usability
13.5 For Further Reading
13.6 Discussion Questions
CHAPTER 16 Virtualization
16.1 Shared Resources
16.2 Virtual Machines
16.3 VM Images
16.4 Containers
16.5 Containers and VMs
16.6 Container Portability
16.7 Pods
16.8 Serverless Architecture
16.9 Summary
16.10 For Further Reading
16.11 Discussion Questions
PART VI CONCLUSIONS
References
About the Authors
Index
Preface
Writing (on our part) and reading (on your part) a book about software
architecture, which distills the experience of many people, presupposes that
1. having a reasonable software architecture is important to the
successful development of a software system and
2. there is a sufficient body of knowledge about software architecture to
fill up a book.
There was a time when both of these assumptions needed justification.
Early editions of this book tried to convince readers that both of these
assumptions are true and, once you were convinced, supply you with basic
knowledge so that you could apply the practice of architecture yourself.
Today, there seems to be little controversy about either aim, and so this
book is more about the supplying than the convincing.
The basic principle of software architecture is every software system is
constructed to satisfy an organization’s business goals, and that the
architecture of a system is a bridge between those (often abstract) business
goals and the final (concrete) resulting system. While the path from abstract
goals to concrete systems can be complex, the good news is that software
architectures can be designed, analyzed, and documented using known
techniques that will support the achievement of these business goals. The
complexity can be tamed, made tractable.
These, then, are the topics for this book: the design, analysis, and
documentation of architectures. We will also examine the influences,
principally in the form of business goals that lead to quality attribute
requirements, that inform these activities.
In this chapter, we will focus on architecture strictly from a software
engineering point of view. That is, we will explore the value that a software
architecture brings to a development project. Later chapters will take
business and organizational perspectives.
Architecture Is an Abstraction
Since architecture consists of structures, and structures consist of elements1
and relations, it follows that an architecture comprises software elements
and how those elements relate to each other. This means that architecture
specifically and intentionally omits certain information about elements that
is not useful for reasoning about the system. Thus an architecture is
foremost an abstraction of a system that selects certain details and
suppresses others. In all modern systems, elements interact with each other
by means of interfaces that partition details about an element into public
and private parts. Architecture is concerned with the public side of this
division; private details of elements—details having to do solely with
internal implementation—are not architectural. This abstraction is essential
to taming the complexity of an architecture: We simply cannot, and do not
want to, deal with all of the complexity all of the time. We want—and need
—the understanding of a system’s architecture to be many orders of
magnitude easier than understanding every detail about that system. You
can’t keep every detail of a system of even modest size in your head; the
point of architecture is to make it so you don’t have to.
1. In this book, we use the term “element” when we mean either a module
or a component, and don’t want to distinguish between the two.
System Architecture
A system’s architecture is a representation of a system in which there
is a mapping of functionality onto hardware and software components,
a mapping of the software architecture onto the hardware architecture,
and a concern for the human interaction with these components. That
is, system architecture is concerned with the totality of hardware,
software, and humans.
A system architecture will influence, for example, the functionality
that is assigned to different processors and the types of networks that
connect those processors. The software architecture will determine
how this functionality is structured and how the software programs
residing on the various processors interact.
A description of the software architecture, as it is mapped to
hardware and networking components, allows reasoning about
qualities such as performance and reliability. A description of the
system architecture will allow reasoning about additional qualities
such as power consumption, weight, and physical dimensions.
When designing a particular system, there is frequently negotiation
between the system architect and the software architect over the
distribution of functionality and, consequently, the constraints placed
on the software architecture.
Enterprise Architecture
Enterprise architecture is a description of the structure and behavior of
an organization’s processes, information flow, personnel, and
organizational subunits. An enterprise architecture need not include
computerized information systems—clearly, organizations had
architectures that fit the preceding definition prior to the advent of
computers—but these days enterprise architectures for all but the
smallest businesses are unthinkable without information system
support. Thus a modern enterprise architecture is concerned with how
software systems support the enterprise’s business processes and
goals. Typically included in this set of concerns is a process for
deciding which systems with which functionality the enterprise should
support.
An enterprise architecture will specify, for example, the data model
that various systems use to interact. It will also specify rules for how
the enterprise’s systems interact with external systems.
Software is only one concern of enterprise architecture. How the
software is used by humans to perform business processes and the
standards that determine the computational environment are two other
common concerns addressed by enterprise architecture.
Sometimes the software infrastructure that supports communication
among systems and with the external world is considered a portion of
the enterprise architecture; at other times, this infrastructure is
considered one of the systems within an enterprise. (In either case, the
architecture of that infrastructure is a software architecture!) These
two views will result in different management structures and spheres
of influence for the individuals concerned with the infrastructure.
Are These Disciplines in Scope for This Book? Yes! (Well, No.)
The system and the enterprise provide environments for, and
constraints on, the software architecture. The software architecture
must live within the system and the enterprise, and increasingly is the
focus for achieving the organization’s business goals. Enterprise and
system architectures share a great deal with software architectures. All
can be designed, evaluated, and documented; all answer to
requirements; all are intended to satisfy stakeholders; all consist of
structures, which in turn consist of elements and relationships; all have
a repertoire of patterns at their respective architects’ disposal; and the
list goes on. So to the extent that these architectures share
commonalities with software architecture, they are in the scope of this
book. But like all technical disciplines, each has its own specialized
vocabulary and techniques, and we won’t cover those. Copious other
sources exist that do.
Decomposition structure. The units are modules that are related to each
other by the “is-a-submodule-of” relation, showing how modules are
decomposed into smaller modules recursively until the modules are
small enough to be easily understood. Modules in this structure
represent a common starting point for design, as the architect
enumerates what the units of software will have to do and assigns each
item to a module for subsequent (more detailed) design and eventual
implementation. Modules often have products (such as interface
specifications, code, and test plans) associated with them. The
decomposition structure determines, to a large degree, the system’s
modifiability. That is, do changes fall within the purview of a few
(preferably small) modules? This structure is often used as the basis for
the development project’s organization, including the structure of the
documentation, and the project’s integration and test plans. Figure 1.5
shows an example of a decomposition structure.
Figure 1.5 A decomposition structure
Layer structure. The modules in this structure are called layers. A layer
is an abstract “virtual machine” that provides a cohesive set of services
through a managed interface. Layers are allowed to use other layers in a
managed fashion; in strictly layered systems, a layer is only allowed to
use a single other layer. This structure imbues a system with portability
—that is, the ability to change the underlying virtual machine. Figure
1.7 shows a layer structure of the UNIX System V operating system.
Data model. The data model describes the static information structure
in terms of data entities and their relationships. For example, in a
banking system, entities will typically include Account, Customer, and
Loan. Account has several attributes, such as account number, type
(savings or checking), status, and current balance. A relationship may
dictate that one customer can have one or more accounts, and one
account is associated with one or more customers. Figure 1.9 shows an
example of a data model.
Service structure. The units here are services that interoperate through a
service coordination mechanism, such as messages. The service
structure is an important structure to help engineer a system composed
of components that may have been developed independently of each
other.
Concurrency structure. This C&C structure allows the architect to
determine opportunities for parallelism and the locations where
resource contention may occur. The units are components, and the
connectors are their communication mechanisms. The components are
arranged into “logical threads.” A logical thread is a sequence of
computations that could be allocated to a separate physical thread later
in the design process. The concurrency structure is used early in the
design process to identify and manage issues associated with concurrent
execution.
Fewer Is Better
Not all systems warrant consideration of many architectural structures. The
larger the system, the more dramatic the difference between these structures
tends to be; but for small systems, we can often get by with fewer
structures. For example, instead of working with each of several C&C
structures, usually a single one will do. If there is only one process, then the
process structure collapses to a single node and need not be explicitly
represented in the design. If no distribution will occur (that is, if the system
is implemented on a single processor), then the deployment structure is
trivial and need not be considered further. In general, you should design and
document a structure only if doing so brings a positive return on the
investment, usually in terms of decreased development or maintenance
costs.
Architectural Patterns
In some cases, architectural elements are composed in ways that solve
particular problems. These compositions have been found to be useful over
time and over many different domains, so they have been documented and
disseminated. These compositions of architectural elements, which provide
packaged strategies for solving some of the problems facing a system, are
called patterns. Architectural patterns are discussed in detail in Part II of
this book.
1.4 Summary
The software architecture of a system is the set of structures needed to
reason about the system. These structures comprise software elements,
relations among them, and properties of both.
There are three categories of structures:
Module structures show the system as a set of code or data units that
have to be constructed or procured.
Component-and-connector structures show the system as a set of
elements that have runtime behavior (components) and interactions
(connectors).
Allocation structures show how elements from module and C&C
structures relate to nonsoftware structures (such as CPUs, file systems,
networks, and development teams).
The strategies for these and other quality attributes are supremely
architectural. But an architecture alone cannot guarantee the functionality
or quality required of a system. Poor downstream design or implementation
decisions can always undermine an adequate architectural design. As we
like to say (mostly in jest): What the architecture giveth, the implementation
may taketh away. Decisions at all stages of the life cycle—from
architectural design to coding and implementation and testing—affect
system quality. Therefore, quality is not completely a function of an
architectural design. But that’s where it starts.
the user is concerned that the system is fast, reliable, and available
when needed;
the customer (who pays for the system) is concerned that the
architecture can be implemented on schedule and according to budget;
the manager is worried that (in addition to cost and schedule concerns)
the architecture will allow teams to work largely independently,
interacting in disciplined and controlled ways; and
the architect is worried about strategies to achieve all of those goals.
Enhanced reuse
More regular and simpler designs that are more easily understood and
communicated, and bring more reliably predictable outcomes
Easier analysis with greater confidence
Shorter selection time
Greater interoperability
Unprecedented designs are risky. Proven designs are, well, proven. This
is not to say that software design can never be innovative or offer new and
exciting solutions. It can. But these solutions should not be invented for the
sake of novelty; rather, they should be sought when existing solutions are
insufficient to solve the problem at hand.
Properties of software follow from the choice of architectural tactics or
patterns. Tactics and patterns that are more desirable for a particular
problem should improve the resulting design solution, perhaps by making it
easier to arbitrate conflicting design constraints, by increasing insights into
poorly understood design contexts, and by helping surface inconsistencies
in requirements. We will discuss architectural tactics and patterns in Part II.
2.14 Summary
Software architecture is important for a wide variety of technical and
nontechnical reasons. Our List of Thirteen includes the following benefits:
1. An architecture will inhibit or enable a system’s driving quality
attributes.
2. The decisions made in an architecture allow you to reason about and
manage change as the system evolves.
3. The analysis of an architecture enables early prediction of a system’s
qualities.
4. A documented architecture enhances communication among
stakeholders.
5. The architecture is a carrier of the earliest, and hence most-
fundamental, hardest-to-change design decisions.
6. An architecture defines a set of constraints on subsequent
implementation.
7. The architecture dictates the structure of an organization, or vice
versa.
8. An architecture can provide the basis for incremental development.
9. An architecture is the key artifact that allows the architect and the
project manager to reason about cost and schedule.
10. An architecture can be created as a transferable, reusable model that
forms the heart of a product line.
11. Architecture-based development focuses attention on the assembly of
components, rather than simply on their creation.
12. By restricting design alternatives, architecture productively channels
the creativity of developers, reducing design and system complexity.
13. An architecture can be the foundation for training of a new team
member.
Many factors determine the qualities that must be provided for in a system’s
architecture. These qualities go beyond functionality, which is the basic
statement of the system’s capabilities, services, and behavior. Although
functionality and other qualities are closely related, as you will see,
functionality often takes the front seat in the development scheme. This
preference is shortsighted, however. Systems are frequently redesigned not
because they are functionally deficient—the replacements are often
functionally identical—but because they are difficult to maintain, port, or
scale; or they are too slow; or they have been compromised by hackers. In
Chapter 2, we said that architecture was the first place in software creation
in which the achievement of quality requirements could be addressed. It is
the mapping of a system’s functionality onto software structures that
determines the architecture’s support for qualities. In Chapters 4–14, we
discuss how various qualities are supported by architectural design
decisions. In Chapter 20, we show how to integrate all of your drivers,
including quality attribute decisions, into a coherent design.
We have been using the term “quality attribute” loosely, but now it is
time to define it more carefully. A quality attribute (QA) is a measurable or
testable property of a system that is used to indicate how well the system
satisfies the needs of its stakeholders beyond the basic function of the
system. You can think of a quality attribute as measuring the “utility” of a
product along some dimension of interest to a stakeholder.
In this chapter our focus is on understanding the following:
3.1 Functionality
Functionality is the ability of the system to do the work for which it was
intended. Of all of the requirements, functionality has the strangest
relationship to architecture.
First of all, functionality does not determine architecture. That is, given a
set of required functionality, there is no end to the architectures you could
create to satisfy that functionality. At the very least, you could divide up the
functionality in any number of ways and assign the sub-pieces to different
architectural elements.
In fact, if functionality were the only thing that mattered, you wouldn’t
have to divide the system into pieces at all: A single monolithic blob with
no internal structure would do just fine. Instead, we design our systems as
structured sets of cooperating architectural elements—modules, layers,
classes, services, databases, apps, threads, peers, tiers, and on and on—to
make them understandable and to support a variety of other purposes.
Those “other purposes” are the other quality attributes that we’ll examine in
the remaining sections of this chapter, and in the subsequent quality
attribute chapters in Part II.
Although functionality is independent of any particular structure, it is
achieved by assigning responsibilities to architectural elements. This
process results in one of the most basic architectural structures—module
decomposition.
Although responsibilities can be allocated arbitrarily to any module,
software architecture constrains this allocation when other quality attributes
are important. For example, systems are frequently (or perhaps always)
divided so that several people can cooperatively build them. The architect’s
interest in functionality is how it interacts with and constrains other
qualities.
Functional Requirements
After more than 30 years of writing about and discussing the
distinction between functional requirements and quality requirements,
the definition of functional requirements still eludes me. Quality
attribute requirements are well defined: Performance has to do with
the system’s timing behavior, modifiability has to do with the system’s
ability to support changes in its behavior or other qualities after initial
deployment, availability has to do with the system’s ability to survive
failures, and so forth.
Function, however, is a much more slippery concept. An
international standard (ISO 25010) defines functional suitability as
“the capability of the software product to provide functions which
meet stated and implied needs when the software is used under
specified conditions.” That is, functionality is the ability to provide
functions. One interpretation of this definition is that functionality
describes what the system does and quality describes how well the
system does its function. That is, qualities are attributes of the system
and function is the purpose of the system.
This distinction breaks down, however, when you consider the
nature of some of the ”function.” If the function of the software is to
control engine behavior, how can the function be correctly
implemented without considering timing behavior? Is the ability to
control access by requiring a user name/password combination not a
function, even though it is not the purpose of any system?
I much prefer using the word “responsibility” to describe
computations that a system must perform. Questions such as “What
are the timing constraints on that set of responsibilities?”, “What
modifications are anticipated with respect to that set of
responsibilities?”, and “What class of users is allowed to execute that
set of responsibilities?” make sense and are actionable.
The achievement of qualities induces responsibility; think of the
user name/password example just mentioned. Further, one can
identify responsibilities as being associated with a particular set of
requirements.
So does this mean that the term “functional requirement” shouldn’t
be used? People have an understanding of the term, but when
precision is desired, we should talk about sets of specific
responsibilities instead.
Paul Clements has long ranted against the careless use of the term
“nonfunctional,” and now it’s my turn to rant against the careless use
of the term “functional”—which is probably equally ineffectually.
—LB
Not My Problem
Some time ago I was doing an architecture analysis on a complex
system created by and for Lawrence Livermore National Laboratory. If
you visit this organization’s website (llnl.gov) and try to figure out
what Livermore Labs does, you will see the word “security”
mentioned over and over. The lab focuses on nuclear security,
international and domestic security, and environmental and energy
security. Serious stuff . . .
Keeping this emphasis in mind, I asked my clients to describe the
quality attributes of concern for the system that I was analyzing. I’m
sure you can imagine my surprise when security wasn’t mentioned
once! The system stakeholders mentioned performance, modifiability,
evolvability, interoperability, configurability, and portability, and one
or two more, but the word “security” never passed their lips.
Being a good analyst, I questioned this seemingly shocking and
obvious omission. Their answer was simple and, in retrospect,
straightforward: “We don’t care about it. Our systems are not
connected to any external network, and we have barbed-wire fences
and guards with machine guns.”
Of course, someone at Livermore Labs was very interested in
security. But not the software architects. The lesson here is that the
software architect may not bear the responsibility for every QA
requirement.
—RK
3.7 Summary
Functional requirements are satisfied by including an appropriate set of
responsibilities within the design. Quality attribute requirements are
satisfied by the structures and behaviors of the architecture.
One challenge in architectural design is that these requirements are often
captured poorly, if at all. To capture and express a quality attribute
requirement, we recommend the use of a quality attribute scenario. Each
scenario consists of six parts:
1. Source of stimulus
2. Stimulus
3. Environment
4. Artifact
5. Response
6. Response measure
An architectural tactic is a design decision that affects a quality attribute
response. The focus of a tactic is on a single quality attribute response. An
architectural pattern describes a particular recurring design problem that
arises in specific design contexts and presents a well-proven architectural
solution for the problem. Architectural patterns can be seen as “bundles” of
tactics.
An analyst can understand the decisions made in an architecture through
the use of a tactics-based checklist. This lightweight architecture analysis
technique can provide insights into the strengths and weaknesses of the
architecture in a very short amount of time.
Be temporarily
unavailable while a
repair is being
effected
Operate in a degraded
mode while a repair is
being effected
Availability
percentage (e.g.,
Po Description Possible Values
rti
on
of
Sc
en
ari
o
99.999 percent)
Proportion (e.g., 99
percent) or rate (e.g.,
up to 100 per second)
of a certain class of
faults that the system
prevents, or handles
without failing
Detect Faults
Before any system can take action regarding a fault, the presence of the
fault must be detected or anticipated. Tactics in this category include:
Prevent Faults
Instead of detecting faults and then trying to recover from them, what if
your system could prevent them from occurring in the first place? Although
it might sound as if some measure of clairvoyance would be required, it
turns out that in many cases it is possible to do just that.2
2. These tactics deal with runtime means to prevent faults from occurring.
Of course, an excellent way to prevent faults—at least in the system
you’re building, if not in systems that your system must interact with—
is to produce high-quality code. This can be done by means of code
inspections, pair programming, solid requirements reviews, and a host
of other good engineering practices.
Removal from service. This tactic refers to temporarily placing a system
component in an out-of-service state for the purpose of mitigating
potential system failures. For example, a component of a system might
be taken out of service and reset to scrub latent faults (such as memory
leaks, fragmentation, or soft errors in an unprotected cache) before the
accumulation of faults reaches the service-affecting level, resulting in
system failure. Other terms for this tactic are software rejuvenation and
therapeutic reboot. If you reboot your computer every night, you are
practicing removal from service.
Transactions. Systems targeting high-availability services leverage
transactional semantics to ensure that asynchronous messages
exchanged between distributed components are atomic, consistent,
isolated, and durable—properties collectively referred to as the “ACID
properties.” The most common realization of the transactions tactic is
the “two-phase commit” (2PC) protocol. This tactic prevents race
conditions caused by two processes attempting to update the same data
item at the same time.
Predictive model. A predictive model, when combined with a monitor,
is employed to monitor the state of health of a system process to ensure
that the system is operating within its nominal operating parameters,
and to take corrective action when the system nears a critical threshold.
The operational performance metrics monitored are used to predict the
onset of faults; examples include the session establishment rate (in an
HTTP server), threshold crossing (monitoring high and low watermarks
for some constrained, shared resource), statistics on the process state
(e.g., in-service, out-of-service, under maintenance, idle), and message
queue length statistics.
Exception prevention. This tactic refers to techniques employed for the
purpose of preventing system exceptions from occurring. The use of
exception classes, which allows a system to transparently recover from
system exceptions, was discussed earlier. Other examples of exception
prevention include error-correcting code (used in telecommunications),
abstract data types such as smart pointers, and the use of wrappers to
prevent faults such as dangling pointers or semaphore access violations.
Smart pointers prevent exceptions by doing bounds checking on
pointers, and by ensuring that resources are automatically de-allocated
when no data refers to them, thereby avoiding resource leaks.
Increase competence set. A program’s competence set is the set of
states in which it is “competent” to operate. For example, the state when
the denominator is zero is outside the competence set of most divide
programs. When a component raises an exception, it is signaling that it
has discovered itself to be outside its competence set; in essence, it
doesn’t know what to do and is throwing in the towel. Increasing a
component’s competence set means designing it to handle more cases—
faults—as part of its normal operation. For example, a component that
assumes it has access to a shared resource might throw an exception if it
discovers that access is blocked. Another component might simply wait
for access or return immediately with an indication that it will complete
its operation on its own the next time it does have access. In this
example, the second component has a larger competence set than the
first.
You can read about patterns for fault tolerance in [Hanmer 13].
Disaster recovery:
There comes a day when software, like the rest of us, must leave home and
venture out into the world and experience real life. Unlike the rest of us,
software typically makes the trip many times, as changes and updates are
made. This chapter is about making that transition as orderly and as
effective and—most of all—as rapid as possible. That is the realm of
continuous deployment, which is most enabled by the quality attribute of
deployability.
Why has deployability come to take a front-row seat in the world of
quality attributes?
In the “bad old days,” releases were infrequent—large numbers of
changes were bundled into releases and scheduled. A release would contain
new features and bug fixes. One release per month, per quarter, or even per
year was common. Competitive pressures in many domains—with the
charge being led by e-commerce—resulted in a need for much shorter
release cycles. In these contexts, releases can occur at any time—possibly
hundreds of releases per day—and each can be instigated by a different
team within an organization. Being able to release frequently means that
bug fixes in particular do not have to wait until the next scheduled release,
but rather can be made and released as soon as a bug is discovered and
fixed. It also means that new features do not need to be bundled into a
release, but can be put into production at any time.
This is not desirable, or even possible, in all domains. If your software
exists in a complex ecosystem with many dependencies, it may not be
possible to release just one part of it without coordinating that release with
the other parts. In addition, many embedded systems, systems in hard-to-
access locations, and systems that are not networked would be poor
candidates for a continuous deployment mindset.
This chapter focuses on the large and growing numbers of systems for
which just-in-time feature releases are a significant competitive advantage,
and just-in-time bug fixes are essential to safety or security or continuous
operation. Often these systems are microservice and cloud-based, although
the techniques here are not limited to those technologies.
DevOps
DevOps—a portmanteau of “development” and “operations”—is a
concept closely associated with continuous deployment. It is a
movement (much like the Agile movement), a description of a set of
practices and tools (again, much like the Agile movement), and a
marketing formula touted by vendors selling those tools. The goal of
DevOps is to shorten time to market (or time to release). The goal is to
dramatically shorten the time between a developer making a change to
an existing system—implementing a feature or fixing a bug—and the
system reaching the hands of end users, as compared with traditional
software development practices.
A formal definition of DevOps captures both the frequency of
releases and the ability to perform bug fixes on demand:
DevOps is a set of practices intended to reduce the time between
committing a change to a system and the change being placed into
normal production, while ensuring high quality. [Bass 15]
Implementing DevOps is a process improvement effort. DevOps
encompasses not only the cultural and organizational elements of any
process improvement effort, but also a strong reliance on tools and
architectural design. All environments are different, of course, but the
tools and automation we describe are found in the typical tool chains
built to support DevOps.
The continuous deployment strategy we describe here is the
conceptual heart of DevOps. Automated testing is, in turn, a critically
important ingredient of continuous deployment, and the tooling for
that often represents the highest technological hurdle for DevOps.
Some forms of DevOps include logging and post-deployment
monitoring of those logs, for automatic detection of errors back at the
“home office,” or even monitoring to understand the user experience.
This, of course, requires a “phone home” or log delivery capability in
the system, which may or may not be possible or allowable in some
systems.
DevSecOps is a flavor of DevOps that incorporates approaches for
security (for the infrastructure and for the applications it produces)
into the entire process. DevSecOps is increasingly popular in
aerospace and defense applications, but is also valid in any
application area where DevOps is useful and a security breach would
be particularly costly. Many IT applications fall in this category.
5.2 Deployability
Deployability refers to a property of software indicating that it may be
deployed—that is, allocated to an environment for execution—within a
predictable and acceptable amount of time and effort. Moreover, if the new
deployment is not meeting its specifications, it may be rolled back, again
within a predictable and acceptable amount of time and effort. As the world
moves increasingly toward virtualization and cloud infrastructures, and as
the scale of deployed software-intensive systems inevitably increases, it is
one of the architect’s responsibilities to ensure that deployment is done in
an efficient and predictable way, minimizing overall system risk.3
3. The quality attribute of testability (see Chapter 12) certainly plays a
critical role in continuous deployment, and the architect can provide
critical support for continuous deployment by ensuring that the system
is testable, in all the ways just mentioned. However, our concern here is
the quality attribute directly related to continuous deployment over and
above testability: deployability.
To achieve these goals, an architect needs to consider how an executable
is updated on a host platform, and how it is subsequently invoked,
measured, monitored, and controlled. Mobile systems in particular present
a challenge for deployability in terms of how they are updated because of
concerns about bandwidth. Some of the issues involved in deploying
software are as follows:
How does it arrive at its host (i.e., push, where updates deployed are
unbidden, or pull, where users or administrators must explicitly request
updates)?
How is it integrated into an existing system? Can this be done while the
existing system is executing?
What is the medium, such as DVD, USB drive, or Internet delivery?
What is the packaging (e.g., executable, app, plug-in)?
What is the resulting integration into an existing system?
What is the efficiency of executing the process?
What is the controllability of the process?
With all of these concerns, the architect must be able to assess the
associated risks. Architects are primarily concerned with the degree to
which the architecture supports deployments that are:
Average/worst-case effort
Next, we describe these six deployability tactics in more detail. The first
category of deployability tactics focuses on strategies for managing the
deployment pipeline, and the second category deals with managing the
system as it is being deployed and once it has been deployed.
Microservice Architecture
The microservice architecture pattern structures the system as a collection
of independently deployable services that communicate only via messages
through service interfaces. There is no other form of interprocess
communication allowed: no direct linking, no direct reads of another team’s
data store, no shared-memory model, no back-doors whatsoever. Services
are usually stateless, and (because they are developed by a single relatively
small team4) are relatively small—hence the term microservice. Service
dependencies are acyclic. An integral part of this pattern is a discovery
service so that messages can be appropriately routed.
4. At Amazon, service teams are constrained in size by the “two pizza
rule”: The team must be no larger than can be fed by two pizzas.
Benefits:
Tradeoffs:
Benefits:
The benefit of these patterns is the ability to completely replace
deployed versions of services without having to take the system out of
service, thus increasing the system’s availability.
Tradeoffs:
Canary Testing
Before rolling out a new release, it is prudent to test it in the production
environment, but with a limited set of users. Canary testing is the
continuous deployment analog of beta testing.5 Canary testing designates a
small set of users who will test the new release. Sometimes, these testers are
so-called power users or preview-stream users from outside your
organization who are more likely to exercise code paths and edge cases that
typical users may use less frequently. Users may or may not know that they
are being used as guinea pigs—er, that is, canaries. Another approach is to
use testers from within the organization that is developing the software. For
example, Google employees almost never use the release that external users
would be using, but instead act as testers for upcoming releases. When the
focus of the testing is on determining how well new features are accepted, a
variant of canary testing called dark launch is used.
5. Canary testing is named after the 19th-century practice of bringing
canaries into coal mines. Coal mining releases gases that are explosive
and poisonous. Because canaries are more sensitive to these gases than
humans, coal miners brought canaries into the mines and watched them
for signs of reaction to the gases. The canaries acted as early warning
devices for the miners, indicating an unsafe environment.
In both cases, the users are designated as canaries and routed to the
appropriate version of a service through DNS settings or through
discovery-service configuration. After testing is complete, users are all
directed to either the new version or the old version, and instances of the
deprecated version are destroyed. Rolling upgrade or blue/green
deployment could be used to deploy the new version.
Benefits:
Canary testing allows real users to “bang on” the software in ways that
simulated testing cannot. This allows the organization deploying the
service to collect “in use” data and perform controlled experiments with
relatively low risk.
Canary testing incurs minimal additional development costs, because
the system being tested is on a path to production anyway.
Canary testing minimizes the number of users who may be exposed to a
serious defect in the new system.
Tradeoffs:
A/B Testing
A/B testing is used by marketers to perform an experiment with real users
to determine which of several alternatives yields the best business results. A
small but meaningful number of users receive a different treatment from the
remainder of the users. The difference can be minor, such as a change to the
font size or form layout, or it can be more significant. For example,
HomeAway (now Vrbo) has used A/B testing to vary the format, content,
and look-and-feel of its worldwide websites, tracking which editions
produced the most rentals. The “winner” would be kept, the “loser”
discarded, and another contender designed and deployed. Another example
is a bank offering different promotions to open new accounts. An oft-
repeated story is that Google tested 41 different shades of blue to decide
which shade to use to report search results.
As in canary testing, DNS servers and discovery-service configurations
are set to send client requests to different versions. In A/B testing, the
different versions are monitored to see which one provides the best
response from a business perspective.
Benefits:
Tradeoffs:
Disable services
Deallocate runtime
services
Change allocation of
services to servers
Allocate/deallocate servers
Change scheduling
Porti Description Possible Values
on of
Scen
ario
Resp The measures revolve around the Energy managed or saved in
onse amount of energy saved or terms of:
meas consumed and the effects on other
ure functions or quality attributes.
Maximum/average
kilowatt load on the
system
Average/total amount of
energy saved
Monitor Resources
You can’t manage what you can’t measure, and so we begin with resource
monitoring. The tactics for resource monitoring are metering, static
classification, and dynamic classification.
Metering. The metering tactic involves collecting data about the energy
consumption of computational resources via a sensor infrastructure, in
near real time. At the coarsest level, the energy consumption of an
entire data center can be measured from its power meter. Individual
servers or hard drives can be measured using external tools such as amp
meters or watt-hour meters, or using built-in tools such as those
provided with metered rack PDUs (power distribution units), ASICs
(application-specific integrated circuits), and so forth. In battery-
operated systems, the energy remaining in a battery can be determined
through a battery management system, which is a component of modern
batteries.
Static classification. Sometimes real-time data collection is infeasible.
For example, if an organization is using an off-premises cloud, it might
not have direct access to real-time energy data. Static classification
allows us to estimate energy consumption by cataloging the computing
resources used and their known energy characteristics—the amount of
energy used by a memory device per fetch, for example. These
characteristics are available as benchmarks, or from manufacturers’
specifications.
Dynamic classification. In cases where a static model of a
computational resource is inadequate, a dynamic model might be
required. Unlike static models, dynamic models estimate energy
consumption based on knowledge of transient conditions such as
workload. The model could be a simple table lookup, a regression
model based on data collected during prior executions, or a simulation.
Allocate Resources
Resource allocation means assigning resources to do work in a way that is
mindful of energy consumption. The tactics for resource allocation are to
reduce usage, discovery, and scheduling.
6.4 Patterns
Some examples of patterns used for energy efficiency include sensor fusion,
kill abnormal tasks, and power monitor.
Sensor Fusion
Mobile apps and IoT systems often collect data from their environment
using multiple sensors. In this pattern, data from low-power sensors can be
used to infer whether data needs to be collected from higher-power sensors.
A common example in the mobile phone context is using accelerometer
data to assess if the user has moved and, if so, to update the GPS location.
This pattern assumes that accessing the low-power sensor is much cheaper,
in terms of energy consumption, than accessing the higher-power sensor.
Benefits:
The obvious benefit of this pattern is the ability to minimize the usage
of more energy-intensive devices in an intelligent way rather than, for
example, just reducing the frequency of consulting the more energy-
intensive sensor.
Tradeoffs:
Benefits:
Tradeoffs:
Power Monitor
The power monitor pattern monitors and manages system devices,
minimizing the time during which they are active. This pattern attempts to
automatically disable devices and interfaces that are not being actively used
by the application. It has long been used within integrated circuits, where
blocks of the circuit are shut down when they are not being used, in an
effort to save energy.
Benefits:
Tradeoffs:
Mission/system stakeholder
Component marketplace
Component vendor
Porti Description Possible Values
on of
Scena
rio
Stimu What is the stimulus? One of the following:
lus That is, what kind of
integration is being
described?
Add new component
Entire system
Component metadata
Component configuration
Porti Description Possible Values
on of
Scena
rio
Envir What state is the system One of the following:
onme in when the stimulus
nt occurs?
Development
Integration
Deployment
Runtime
Porti Description Possible Values
on of
Scena
rio
Respo How will an One or more of the following:
nse “integrable” system
respond to the stimulus?
Effort
Money
Calendar time
Limit Dependencies
Encapsulate
Encapsulation is the foundation upon which all other integrability tactics
are built. It is therefore seldom seen on its own, but its use is implicit in the
other tactics described here.
Encapsulation introduces an explicit interface to an element and ensures
that all access to the element passes through this interface. Dependencies on
the element internals are eliminated, because all dependencies must flow
through the interface. Encapsulation reduces the probability that a change
to one element will propagate to other elements, by reducing either the
number of dependencies or their distances. These strengths are, however,
reduced because the interface limits the ways in which external
responsibilities can interact with the element (perhaps through a wrapper).
In consequence, the external responsibilities can only directly interact with
the element through the exposed interface (indirect interactions, such as
dependence on quality of service, will likely remain unchanged).
Encapsulation may also hide interfaces that are not relevant for a
particular integration task. An example is a library used by a service that
can be completely hidden from all consumers and changed without these
changes propagating to the consumers.
Encapsulation, then, can reduce the number of dependencies as well as
the syntactic, data, and behavior semantic distances between C and S.
Use an Intermediary
Intermediaries are used for breaking dependencies between a set of
components Ci or between Ci and the system S. Intermediaries can be used
to resolve different types of dependencies. For example, intermediaries such
as a publish–subscribe bus, shared data repository, or dynamic service
discovery all reduce dependencies between data producers and consumers
by removing any need for either to know the identity of the other party.
Other intermediaries, such as data transformers and protocol translators,
resolve forms of syntactic and data semantic distance.
Determining the specific benefits of a particular intermediary requires
knowledge of what the intermediary actually does. An analyst needs to
determine whether the intermediary reduces the number of dependencies
between a component and the system and which dimensions of distance, if
any, it addresses.
Intermediaries are often introduced during integration to resolve specific
dependencies, but they can also be included in an architecture to promote
integrability with respect to anticipated scenarios. Including a
communication intermediary such as a publish–subscribe bus in an
architecture, and then restricting communication paths to and from sensors
to this bus, is an example of using an intermediary with the goal of
promoting integrability of sensors.
Restrict Communication Paths
This tactic restricts the set of elements with which a given element can
communicate. In practice, this tactic is implemented by restricting a
element’s visibility (when developers cannot see an interface, they cannot
employ it) and by authorization (i.e., restricting access to only authorized
elements). The restrict communication paths tactic is seen in service-
oriented architectures (SOAs), in which point-to-point requests are
discouraged in favor of forcing all requests to go through an enterprise
service bus so that routing and preprocessing can be done consistently.
Adhere to Standards
Standardization in system implementations is a primary enabler of
integrability and interoperability, across both platforms and vendors.
Standards vary considerably in terms of the scope of what they prescribe.
Some focus on defining syntax and data semantics. Others include richer
descriptions, such as those describing protocols that include behavioral and
temporal semantics.
Standards similarly vary in their scope of applicability or adoption. For
example, standards published by widely recognized standards-setting
organizations such as the Institute of Electrical and Electronics Engineers
(IEEE), the International Organization for Standardization (ISO), and the
Object Management Group (OMG) are more likely to be broadly adopted.
Conventions that are local to an organization, particularly if well
documented and enforced, can provide similar benefits as “local standards,”
though with less expectation of benefits when integrating components from
outside the local standard’s sphere of adoption.
Adopting a standard can be an effective integrability tactic, although its
effectiveness is limited to benefits based on the dimensions of difference
addressed in the standard and how likely it is that future component
suppliers will conform to the standard. Restricting communication with a
system S to require use of the standard often reduces the number of
potential dependencies. Depending on what is defined in a standard, it may
also address syntactic, data semantic, behavioral semantic, and temporal
dimensions of distance.
Abstract Common Services
Where two elements provide services that are similar but not quite the
same, it may be useful to hide both specific elements behind a common
abstraction for a more general service. This abstraction might be realized as
a common interface implemented by both, or it might involve an
intermediary that translates requests for the abstract service to more specific
requests for the elements hidden behind the abstraction. The resulting
encapsulation hides the details of the elements from other components in
the system. In terms of integrability, this means that future components can
be integrated with a single abstraction rather than separately integrated with
each of the specific elements.
When the abstract common services tactic is combined with an
intermediary (such as a wrapper or adapter), it can also normalize syntactic
and semantic variations among the specific elements. For example, we see
this when systems use many sensors of the same type from different
manufacturers, each with its own device drivers, accuracy, or timing
properties, but the architecture provides a common interface to them. As
another example, your browser may accommodate various kinds of ad-
blocking plug-ins, yet because of the plug-in interface the browser itself can
remain blissfully unaware of your choice.
Abstracting common services allows for consistency when handling
common infrastructure concerns (e.g., translations, security mechanisms,
and logging). When these features change, or when new versions of the
components implementing these features change, the changes can be made
in a smaller number of places. An abstract service is often paired with an
intermediary that may perform processing to hide syntactic and data
semantic differences among specific elements.
Adapt
Discover
A discovery service is a catalog of relevant addresses, which comes in
handy whenever there is a need to translate from one form of address to
another, whenever the target address may have been dynamically bound, or
when there are multiple targets. It is the mechanism by which applications
and services locate each other. A discovery service may be used to
enumerate variants of particular elements that are used in different products.
Entries in a discovery service are there because they were registered.
This registration can happen statically, or it can happen dynamically when a
service is instantiated. Entries in the discovery service should be de-
registered when they are no longer relevant. Again, this can be done
statically, such as with a DNS server, or dynamically. Dynamic de-
registration can be handled by the discovery service itself performing health
checks on its entries, or it can be carried out by an external piece of
software that knows when a particular entry in the catalog is no longer
relevant.
A discovery service may include entries that are themselves discovery
services. Likewise, entries in a discovery service may have additional
attributes, which a query may reference. For example, a weather discovery
service may have an attribute of “cost of forecast”; you can then ask a
weather discovery service for a service that provides free forecasts.
The discover tactic works by reducing the dependencies between
cooperating services, which should be written without knowledge of each
other. This enables flexibility in the binding between services, as well as
when that binding occurs.
Tailor Interface
Tailoring an interface is a tactic that adds capabilities to, or hides
capabilities in, an existing interface without changing the API or
implementation. Capabilities such as translation, buffering, and data
smoothing can be added to an interface without changing it. An example of
removing capabilities is hiding particular functions or parameters from
untrusted users. A common dynamic application of this tactic is intercepting
filters that add functionality such as data validation to help prevent SQL
injections or other attacks, or to translate between data formats. Another
example is using techniques from aspect-oriented programming that weave
in preprocessing and postprocessing functionality at compile time.
The tailor interface tactic allows functionality that is needed by many
services to be added or hidden based on context and managed
independently. It also enables services with syntactic differences to
interoperate without modification to either service.
This tactic is typically applied during integration; however, designing an
architecture so that it facilitates interface tailoring can support integrability.
Interface tailoring is commonly used to resolve syntactic and data semantic
distance during integration. It can also be applied to resolve some forms of
behavioral semantic distance, though it can be more complex to do (e.g.,
maintaining a complex state to accommodate protocol differences) and is
perhaps more accurately categorized as introducing an intermediary.
Configure Behavior
The tactic of configuring behavior is used by software components that are
implemented to be configurable in prescribed ways that allow them to more
easily interact with a range of components. The behavior of a component
can be configured during the build phase (recompile with a different flag),
during system initialization (read a configuration file or fetch data from a
database), or during runtime (specify a protocol version as part of your
requests). A simple example is configuring a component to support
different versions of a standard on its interfaces. Ensuring that multiple
options are available increases the chances that the assumptions of S and a
future C will match.
Building configurable behavior into portions of S is an integrability tactic
that allows S to support a wider range of potential Cs. This tactic can
potentially address syntactic, data semantic, behavioral semantic, and
temporal dimensions of distance.
Coordinate
Orchestrate
Orchestrate is a tactic that uses a control mechanism to coordinate and
manage the invocation of particular services so that they can remain
unaware of each other.
Orchestration helps with the integration of a set of loosely coupled
reusable services to create a system that meets a new need. Integration costs
are reduced when orchestration is included in an architecture in a way that
supports the services that are likely to be integrated in the future. This tactic
allows future integration activities to focus on integration with the
orchestration mechanism instead of point-to-point integration with multiple
components.
Workflow engines commonly make use of the orchestrate tactic. A
workflow is a set of organized activities that order and coordinate software
components to complete a business process. It may consist of other
workflows, each of which may itself consist of aggregated services. The
workflow model encourages reuse and agility, leading to more flexible
business processes. Business processes can be managed under a philosophy
of business process management (BPM) that views processes as a set of
competitive assets to be managed. Complex orchestration can be specified
in a language such as BPEL (Business Process Execution Language).
Orchestration works by reducing the number of dependencies between a
system S and new components {Ci}, and eliminating altogether the explicit
dependencies among the components {Ci}, by centralizing those
dependencies at the orchestration mechanism. It may also reduce syntactic
and data semantic distance if the orchestration mechanism is used in
conjunction with tactics such as adherence to standards.
Manage Resources
A resource manager is a specific form of intermediary that governs access
to computing resources; it is similar to the restrict communication paths
tactic. With this tactic, software components are not allowed to directly
access some computing resources (e.g., threads or blocks of memory), but
instead request those resources from a resource manager. Resource
managers are typically responsible for allocating resource access across
multiple components in a way that preserves some invariants (e.g., avoiding
resource exhaustion or concurrent use), enforces some fair access policy, or
both. Examples of resource managers include operating systems,
transaction mechanisms in databases, use of thread pools in enterprise
systems, and use of the ARINC 653 standard for space and time partitioning
in safety-critical systems.
The manage resource tactic works by reducing the resource distance
between a system S and a component C, by clearly exposing the resource
requirements and managing their common use.
7.4 Tactics-Based Questionnaire for Integrability
Based on the tactics described in Section 7.3, we can create a set of
integrability tactics–inspired questions, as presented in Table 7.2. To gain
an overview of the architectural choices made to support integrability, the
analyst asks each question and records the answers in the table. The
answers to these questions can then be made the focus of further activities:
investigation of documentation, analysis of code or other artifacts, reverse
engineering of code, and so forth.
Table 7.2 Tactics-Based Questionnaire for Integrability
Tac Tactics Question Sup RDesign Ratio
tics por i Decisi nale
Gro ted s ons and
up ? kand Assu
(Y/ Locati mptio
N) on ns
Lim Does the system encapsulate functionality of
it each element by introducing explicit interfaces
Dep and requiring that all access to the elements
end passes through these interfaces?
enci Does the system broadly use intermediaries for
es breaking dependencies between components—
for example, removing a data producer’s
knowledge of its consumers?
Does the system abstract common services,
providing a general, abstract interface for similar
services?
Does the system provide a means to restrict
communication paths between components?
Does the system adhere to standards in terms
of how components interact and share
information with each other?
Tac Tactics Question Sup RDesign Ratio
tics por i Decisi nale
Gro ted s ons and
up ? kand Assu
(Y/ Locati mptio
N) on ns
Ada Does the system provide the ability to statically
pt (i.e., at compile time) tailor interfaces—that is,
the ability to add or hide capabilities of a
component’s interface without changing its API
or implementation?
Does the system provide a discovery service,
cataloguing and disseminating information about
services?
Does the system provide a means to configure
the behavior of components at build,
initialization, or runtime?
Coo Does the system include an orchestration
rdin mechanism that coordinates and manages the
ate invocation of components so they can remain
unaware of each other?
Does the system provide a resource manager
that governs access to computing resources?
7.5 Patterns
The first three patterns are all centered on the tailor interface tactic, and are
described here as a group:
Benefits:
All three patterns allow access to an element without forcing a change
to the element or its interface.
Tradeoffs:
Benefits:
Tradeoffs:
Dynamic Discovery
Dynamic discovery applies the discovery tactic to enable the discovery of
service providers at runtime. Consequently, a runtime binding can occur
between a service consumer and a concrete service.
Use of a dynamic discovery capability sets the expectation that the
system will clearly advertise both the services available for integration with
future components and the minimal information that will be available for
each service. The specific information available will vary, but typically
comprises data that can be mechanically searched during discovery and
runtime integration (e.g., identifying a specific version of an interface
standard by string match).
Benefits:
Tradeoffs:
Change happens.
Study after study shows that most of the cost of the typical software
system occurs after it has been initially released. If change is the only
constant in the universe, then software change is not only constant but
ubiquitous. Changes happen to add new features, to alter or even retire old
ones. Changes happen to fix defects, tighten security, or improve
performance. Changes happen to enhance the user’s experience. Changes
happen to embrace new technology, new platforms, new protocols, new
standards. Changes happen to make systems work together, even if they
were never designed to do so.
Modifiability is about change, and our interest in it is to lower the cost
and risk of making changes. To plan for modifiability, an architect has to
consider four questions:
What can change? A change can occur to any aspect of a system: the
functions that the system computes, the platform (the hardware,
operating system, middleware), the environment in which the system
operates (the systems with which it must interoperate, the protocols it
uses to communicate with the rest of the world), the qualities the system
exhibits (its performance, its reliability, and even its future
modifications), and its capacity (number of users supported, number of
simultaneous operations).
What is the likelihood of the change? One cannot plan a system for all
potential changes—the system would never be done or if it was done it
would be far too expensive and would likely suffer quality attribute
problems in other dimensions. Although anything might change, the
architect has to make the tough decisions about which changes are
likely, and hence which changes will be supported and which will not.
When is the change made and who makes it? Most commonly in the
past, a change was made to source code. That is, a developer had to
make the change, which was tested and then deployed in a new release.
Now, however, the question of when a change is made is intertwined
with the question of who makes it. An end user changing the screen
saver is clearly making a change to one aspect of the system. Equally
clear, it is not in the same category as changing the system so that it
uses a different database management system. Changes can be made to
the implementation (by modifying the source code), during compilation
(using compile-time switches), during the build (by choice of libraries),
during configuration setup (by a range of techniques, including
parameter setting), or during execution (by parameter settings, plug-ins,
allocation to hardware, and so forth). A change can also be made by a
developer, an end user, or a system administrator. Systems that learn
and adapt supply a whole different answer to the question of when a
change is made and “who” makes it—it is the system itself that is the
agent for change.
What is the cost of the change? Making a system more modifiable
involves two types of costs:
The cost of introducing the mechanism(s) to make the system more
modifiable
The cost of making the modification using the mechanism(s)
For example, the simplest mechanism for making a change is to wait for
a change request to come in, then change the source code to accommodate
the request. In such a case, the cost of introducing the mechanism is zero
(since there is no special mechanism); the cost of exercising it is the cost of
changing the source code and revalidating the system.
Toward the other end of the spectrum is an application generator, such as
a user interface builder. The builder takes as input a description of the
designed UI produced through direct manipulation techniques and which
may then produce source code. The cost of introducing the mechanism is
the cost of acquiring the UI builder, which may be substantial. The cost of
using the mechanism is the cost of producing the input to feed the builder
(this cost can be either substantial or negligible), the cost of running the
builder (close to zero), and finally the cost of whatever testing is performed
on the result (usually much less than for hand-coding).
Still further along the spectrum are software systems that discover their
environments, learn, and modify themselves to accommodate any changes.
For those systems, the cost of making the modification is zero, but that
ability was purchased along with implementing and testing the learning
mechanisms, which may have been quite costly.
For N similar modifications, a simplified justification for a change
mechanism is that
N * Cost of making change without the mechanism ≤
Cost of creating the mechanism + (N * cost of making the change using the
mechanism)
Here, N is the anticipated number of modifications that will use the
modifiability mechanism—but it is also a prediction. If fewer changes than
expected come in, then an expensive modification mechanism may not be
warranted. In addition, the cost of creating the modifiability mechanism
could be applied elsewhere (opportunity cost)—in adding new functionality,
in improving the performance, or even in non-software investments such as
hiring or training. Also, the equation does not take time into account. It
might be cheaper in the long run to build a sophisticated change-handling
mechanism, but you might not be able to wait for its completion. However,
if your code is modified frequently, not introducing some architectural
mechanism and simply piling change on top of change typically leads to
substantial technical debt. We address the topic of architectural debt in
Chapter 23.
Change is so prevalent in the life of software systems that special names
have been given to specific flavors of modifiability. Some of the common
ones are highlighted here:
Scalability is about accommodating more of something. In terms of
performance, scalability means adding more resources. Two kinds of
performance scalability are horizontal scalability and vertical
scalability. Horizontal scalability (scaling out) refers to adding more
resources to logical units, such as adding another server to a cluster of
servers. Vertical scalability (scaling-up) refers to adding more resources
to a physical unit, such as adding more memory to a single computer.
The problem that arises with either type of scaling is how to effectively
utilize the additional resources. Being effective means that the
additional resources result in a measurable improvement of some
system quality, did not require undue effort to add, and did not unduly
disrupt operations. In cloud-based environments, horizontal scalability
is called elasticity. Elasticity is a property that enables a customer to
add or remove virtual machines from the resource pool (see Chapter 17
for further discussion of such environments).
Variability refers to the ability of a system and its supporting artifacts,
such as code, requirements, test plans, and documentation, to support
the production of a set of variants that differ from each other in a
preplanned fashion. Variability is an especially important quality
attribute in a product line, which is a family of systems that are similar
but vary in features and functions. If the engineering assets associated
with these systems can be shared among members of the family, then
the overall cost of the product line plummets. This is achieved by
introducing mechanisms that allow the artifacts to be selected and/or
adapt to usages in the different product contexts that are within the
product line’s scope. The goal of variability in a software product line is
to make it easy to build and maintain products in that family over a
period of time.
Portability refers to the ease with which software that was built to run
on one platform can be changed to run on a different platform.
Portability is achieved by minimizing platform dependencies in the
software, isolating dependencies to well-identified locations, and
writing the software to run on a “virtual machine” (for example, a Java
Virtual Machine) that encapsulates all the platform dependencies.
Scenarios describing portability deal with moving software to a new
platform by expending no more than a certain level of effort or by
counting the number of places in the software that would have to
change. Architectural approaches to dealing with portability are
intertwined with those for deployability, a topic addressed in Chapter 5.
Location independence refers to the case where two pieces of
distributed software interact and the location of one or both of the
pieces is not known prior to runtime. Alternatively, the location of these
pieces may change during runtime. In distributed systems, services are
often deployed to arbitrary locations, and clients of those services must
discover their location dynamically. In addition, services in a distributed
system must often make their location discoverable once they have been
deployed to a location. Designing the system for location independence
means that the location will be easy to modify with minimal impact on
the rest of the system.
Test modification
Deploy modification
Self-modify
Effort
Elapsed time
Po Description Possible Values
rti
on
of
Sc
en
ari
o
Increase Cohesion
Several tactics involve redistributing responsibilities among modules. This
step is taken to reduce the likelihood that a single change will affect
multiple modules.
Reduce Coupling
We now turn to tactics that reduce the coupling between modules. These
tactics overlap with the integrability tactics described in Chapter 7, because
reducing dependencies among independent components (for integrability) is
similar to reducing coupling among modules (for modifiability).
Defer Binding
Because the work of people is almost always more expensive error-prone
than the work of computers, letting computers handle a change as much as
possible will almost always reduce the cost of making that change. If we
design artifacts with built-in flexibility, then exercising that flexibility is
usually cheaper than hand-coding a specific change.
Parameters are perhaps the best-known mechanism for introducing
flexibility, and their use is reminiscent of the abstract common services
tactic. A parameterized function f(a, b) is more general than the similar
function f(a) that assumes b = 0. When we bind the value of some
parameters at a different phase in the life cycle than the one in which we
defined the parameters, we are deferring binding.
In general, the later in the life cycle we can bind values, the better.
However, putting the mechanisms in place to facilitate that late binding
tends to be more expensive—a well-known tradeoff. And so the equation
given earlier in the chapter comes into play. We want to bind as late as
possible, as long as the mechanism that allows it is cost-effective.
The following tactics can be used to bind values at compile time or build
time:
Configuration-time binding
Resource files
8.4 Patterns
Patterns for modifiability divide the system into modules in such a way that
the modules can be developed and evolved separately with little interaction
among them, thereby supporting portability, modifiability, and reuse. There
are probably more patterns designed to support modifiability than for any
other quality attribute. We present a few that are among the most commonly
used here.
Client-Server Pattern
The client-server pattern consists of a server providing services
simultaneously to multiple distributed clients. The most common example
is a web server providing information to multiple simultaneous users of a
website.
The interactions between a server and its clients follow this sequence:
Discovery:
Communication is initiated by a client, which uses a discovery
service to determine the location of the server.
The server responds to the client using an agreed-upon protocol.
Interaction:
The client sends requests to the server.
The server processes the requests and responds.
The server may have multiple instances if the number of clients grows
beyond the capacity of a single instance.
If the server is stateless with respect to the clients, each request from a
client is treated independently.
If the server maintains state with respect to the clients, then:
Each request must identify the client in some fashion.
The client should send an “end of session” message so that the
server can remove resources associated with that particular client.
The server may time out if the client has not sent a request in a
specified time so that resources associated with the client can be
removed.
Benefits:
Tradeoffs:
Benefits:
Tradeoffs:
Layers Pattern
The layers pattern divides the system in such a way that the modules can be
developed and evolved separately with little interaction among the parts,
which supports portability, modifiability, and reuse. To achieve this
separation of concerns, the layers pattern divides the software into units
called layers. Each layer is a grouping of modules that offers a cohesive set
of services. The allowed-to-use relationship among the layers is subject to a
key constraint: The relations must be unidirectional.
Layers completely partition a set of software, and each partition is
exposed through a public interface. The layers are created to interact
according to a strict ordering relation. If (A, B) is in this relation, we say
that the software assigned to layer A is allowed to use any of the public
facilities provided by layer B. (In a vertically arranged representation of
layers, which is almost ubiquitous, A will be drawn higher than B.) In some
cases, modules in one layer are required to directly use modules in a
nonadjacent lower layer, although normally only next-lower-layer uses are
allowed. This case of software in a higher layer using modules in a
nonadjacent lower layer is called layer bridging. Upward usages are not
allowed in this pattern.
Benefits:
Tradeoffs:
If the layering is not designed correctly, it may actually get in the way,
by not providing the lower-level abstractions that programmers at the
higher levels need.
Layering often adds a performance penalty to a system. If a call is
made from a function in the top-most layer, it may have to traverse
many lower layers before being executed by the hardware.
If many instances of layer bridging occur, the system may not meet its
portability and modifiability goals, which strict layering helps to
achieve.
Publish-Subscribe Pattern
Publish-subscribe is an architectural pattern in which components
communicate primarily through asynchronous messages, sometimes
referred to as “events” or “topics.” The publishers have no knowledge of
the subscribers, and subscribers are only aware of message types. Systems
using the publish-subscribe pattern rely on implicit invocation; that is, the
component publishing a message does not directly invoke any other
component. Components publish messages on one or more events or topics,
and other components register an interest in the publication. At runtime,
when a message is published, the publish–subscribe (or event) bus notifies
all of the elements that registered an interest in the event or topic. In this
way, the message publication causes an implicit invocation of (methods in)
other components. The result is loose coupling between the publishers and
the subscribers.
The publish-subscribe pattern has three types of elements:
Concurrency
Concurrency is one of the more important concepts that an architect
must understand and one of the least-taught topics in computer science
courses. Concurrency refers to operations occurring in parallel. For
example, suppose there is a thread that executes the statements
x = 1;
x++;
and another thread that executes the same statements. What is the
value of x after both threads have executed those statements? It could
be either 2 or 3. I leave it to you to figure out how the value 3 could
occur—or should I say I interleave it to you?
Concurrency occurs anytime your system creates a new thread,
because threads, by definition, are independent sequences of control.
Multitasking on your system is supported by independent threads.
Multiple users are simultaneously supported on your system through
the use of threads. Concurrency also occurs anytime your system is
executing on more than one processor, whether those processors are
packaged separately or as multi-core processors. In addition, you must
consider concurrency when you use parallel algorithms, parallelizing
infrastructures such as map-reduce, or NoSQL databases, or when you
use one of a variety of concurrent scheduling algorithms. In other
words, concurrency is a tool available to you in many ways.
Concurrency, when you have multiple CPUs or wait states that can
exploit it, is a good thing. Allowing operations to occur in parallel
improves performance, because delays introduced in one thread allow
the processor to progress on another thread. But because of the
interleaving phenomenon just described (referred to as a race
condition), concurrency must also be carefully managed.
As our example shows, race conditions can occur when two threads
of control are present and there is shared state. The management of
concurrency frequently comes down to managing how state is shared.
One technique for preventing race conditions is to use locks to
enforce sequential access to state. Another technique is to partition the
state based on the thread executing a portion of code. That is, if we
have two instances of x, x is not shared by the two threads and no race
condition will occur.
Race conditions are among the hardest types of bugs to discover;
the occurrence of the bug is sporadic and depends on (possibly
minute) differences in timing. I once had a race condition in an
operating system that I could not track down. I put a test in the code
so that the next time the race condition occurred, a debugging process
was triggered. It took more than a year for the bug to recur so that the
cause could be determined.
Do not let the difficulties associated with concurrency dissuade you
from utilizing this very important technique. Just use it with the
knowledge that you must carefully identify critical sections in your
code and ensure (or take actions to ensure) that race conditions will
not occur in those sections.
—LB
User request
Request from
external
system
Data arriving
from a sensor
or other
system
Internal:
One
component
may make a
request of
another
component.
A timer may
generate a
notification.
Sti The stimulus is the arrival of an event. The event can Arrival of a
m be a request for service or a notification of some periodic,
ul state of either the system under consideration or an sporadic, or
us external system. stochastic event:
A periodic
event arrives
at a
predictable
interval.
A stochastic
event arrives
according to
some
probability
distribution.
A sporadic
event arrives
according to
a pattern that
is neither
periodic nor
stochastic.
Ar The artifact stimulated may be the whole system or
tif just a portion of the system. For example, a power-
Whole
act on event may stimulate the whole system. A user
system
request may arrive at (stimulate) the user interface.
Component
within the
system
En The state of the system or component when the Runtime. The
vir stimulus arrives. Unusual modes—error mode, system or
on overloaded mode—will affect the response. For component can
m example, three unsuccessful login attempts are be operating in:
en allowed before a device is locked out.
t
Normal mode
Emergency
mode
Error
correction
mode
Peak load
Overload
mode
Degraded
operation
mode
Some other
defined mode
of the system
Re
sp The system will process the stimulus. Processing the
on stimulus will take time. This time may be required System
se for computation, or it may be required because returns a
processing is blocked by contention for shared response
resources. Requests can fail to be satisfied because
the system is overloaded or because of a failure
somewhere in the processing chain. System
returns an
error
System
generates no
response
System
ignores the
request if
overloaded
System
changes the
mode or level
of service
System
services a
higher-
priority event
System
consumes
resources
Re Timing measures can include latency or throughput.
sp Systems with timing deadlines can also measure
on jitter of response and ability to meet the deadlines.
se Measuring how many of the requests go unsatisfied The
m is also a type of measure, as is how much of a (maximum,
ea computing resource (e.g., a CPU, memory, thread minimum,
su pool, buffer) is utilized. mean,
re median) time
the response
takes
(latency)
The number
or percentage
of satisfied
requests over
some time
interval
(throughput)
or set of
events
received
The number
or percentage
of requests
that go
unsatisfied
The variation
in response
time (jitter)
Usage level
of a
computing
resource
At any instant during the period after an event arrives but before the
system’s response to it is complete, either the system is working to respond
to that event or the processing is blocked for some reason. This leads to the
two basic contributors to the response time and resource usage: processing
time (when the system is working to respond and actively consuming
resources) and blocked time (when the system is unable to respond).
Whatever the cause, you must identify places in the architecture where
resource limitations might cause a significant contribution to overall
latency.
With this background, we turn to our tactic categories. We can either
reduce demand for resources (control resource demand) or make the
resources we have available handle the demand more effectively (manage
resources).
Manage Resources
Even if the demand for resources is not controllable, the management of
these resources can be. Sometimes one resource can be traded for another.
For example, intermediate data may be kept in a cache or it may be
regenerated depending on which resources are more critical: time, space, or
network bandwidth. Here are some resource management tactics:
Scheduling Policies
A scheduling policy conceptually has two parts: a priority assignment
and dispatching. All scheduling policies assign priorities. In some
cases, the assignment is as simple as first-in/first-out (or FIFO). In
other cases, it can be tied to the deadline of the request or its semantic
importance. Competing criteria for scheduling include optimal
resource usage, request importance, minimizing the number of
resources used, minimizing latency, maximizing throughput,
preventing starvation to ensure fairness, and so forth. You need to be
aware of these possibly conflicting criteria and the effect that the
chosen scheduling policy has on the system’s ability to meet them.
A high-priority event stream can be dispatched—assigned to a
resource—only if that resource is available. Sometimes this depends
on preempting the current user of the resource. Possible preemption
options are as follows: can occur anytime, can occur only at specific
preemption points, or executing processes cannot be preempted. Some
common scheduling policies are these:
Benefits:
Tradeoffs:
Load Balancer
A load balancer is a kind of intermediary that handles messages originating
from some set of clients and determines which instance of a service should
respond to those messages. The key to this pattern is that the load balancer
serves as a single point of contact for incoming messages—for example, a
single IP address—but it then farms out requests to a pool of providers
(servers or services) that can respond to the request. In this way, the load
can be balanced across the pool of providers. The load balancer implements
some form of the schedule resources tactic. The scheduling algorithm may
be very simple, such as round-robin, or it may take into account the load on
each provider, or the number of requests awaiting service at each provider.
Benefits:
Tradeoffs:
Throttling
The throttling pattern is a packaging of the manage work requests tactic. It
is used to limit access to some important resource or service. In this pattern,
there is typically an intermediary—a throttler—that monitors (requests to)
the service and determines whether an incoming request can be serviced.
Benefits:
Tradeoffs:
Map-Reduce
The map-reduce pattern efficiently performs a distributed and parallel sort
of a large data set and provides a simple means for the programmer to
specify the analysis to be done. Unlike our other patterns for performance,
which are independent of any application, the map-reduce pattern is
specifically designed to bring high performance to a specific kind of
recurring problem: sort and analyze a large data set. This problem is
experienced by any organization dealing with massive data—think Google,
Facebook, Yahoo, and Netflix—and all of these organizations do in fact use
map-reduce.
The map-reduce pattern has three parts:
Benefits:
Tradeoffs:
If you do not have large data sets, the overhead incurred by the map-
reduce pattern is not justified.
If you cannot divide your data set into similarly sized subsets, the
advantages of parallelism are lost.
Operations that require multiple reduces are complex to orchestrate.
Safety is also concerned with detecting and recovering from these unsafe
states to prevent or at least minimize resulting harm.
Any portion of the system can lead to an unsafe state: The software, the
hardware portions, or the environment can behave in an unanticipated,
unsafe fashion. Once an unsafe state is detected, the potential system
responses are similar to those enumerated for availability (in Chapter 4).
The unsafe state should be recognized and the system should be made
through
Sensor
Software
component
Communica
tion
channel
Device
(such as a
clock)
A function
is never
performed.
A specific
instance of a
commission:
A function
is
performed
incorrectly.
A device
produces a
spurious
event.
Po Description Possible Values
rti
on
of
Sc
en
ari
o
A device
produces
incorrect
data.
A specific
instance of
incorrect data:
A sensor
reports
incorrect
data.
A software
component
produces
incorrect
results.
Po Description Possible Values
rti
on
of
Sc
en
ari
o
A timing
failure:
Data arrives
too late or
too early.
A generated
event
occurs too
late or too
early or at
the wrong
rate.
Events
occur in the
wrong
order.
Po Description Possible Values
rti
on
of
Sc
en
ari
o
En System operating mode
vir
on Normal
me operation
nt
Degraded
operation
Manual
operation
Recovery
mode
Avoid the
unsafe state
Recover
Continue in
degraded or
safe mode
Shut down
Switch to
manual
operation
Switch to a
backup
system
Notify
appropriate
entities
Po Description Possible Values
rti
on
of
Sc
en
ari
o
(people or
systems)
Log the
unsafe state
(and the
response to
it)
Amount or
percentages
of unsafe
states from
Po Description Possible Values
rti
on
of
Sc
en
ari
o
which the
system can
(automatica
lly) recover
Change in
risk
exposure:
size(loss) *
prob(loss)
Percentage
of time the
system can
recover
Amount of
time the
system is in
a degraded
or safe
mode
Amount or
percentage
Po Description Possible Values
rti
on
of
Sc
en
ari
o
of time the
system is
shut down
Elapsed
time to
enter and
recover
(from
manual
operation,
from a safe
or degraded
mode)
Substitution
This tactic employs protection mechanisms—often hardware-based—for
potentially dangerous software design features. For example, hardware
protection devices such as watchdogs, monitors, and interlocks can be used
in lieu of software versions. Software versions of these mechanisms can be
starved of resources, whereas a separate hardware device provides and
controls its own resources. Substitution is typically beneficial only when the
function being replaced is relatively simple.
Predictive Model
The predictive model tactic, as introduced in Chapter 4, predicts the state of
health of system processes, resources, or other properties (based on
monitoring the state), not only to ensure that the system is operating within
its nominal operating parameters but also to provide early warning of a
potential problem. For example, some automotive cruise control systems
calculate the closing rate between the vehicle and an obstacle (or another
vehicle) ahead and warn the driver before the distance and time become too
small to avoid a collision. A predictive model is typically combined with
condition monitoring, which we discuss later.
Timeout
The timeout tactic is used to determine whether the operation of a
component is meeting its timing constraints. This might be realized in the
form of an exception being raised, to indicate the failure of a component if
its timing constraints are not met. Thus this tactic can detect late timing and
omission failures. Timeout is a particularly common tactic in real-time or
embedded systems and distributed systems. It is related to the availability
tactics of system monitor, heartbeat, and ping-echo.
Timestamp
As described in Chapter 4, the timestamp tactic is used to detect incorrect
sequences of events, primarily in distributed message-passing systems. A
timestamp of an event can be established by assigning the state of a local
clock to the event immediately after the event occurs. Sequence numbers
can also be used for this purpose, since timestamps in a distributed system
may be inconsistent across different processors.
Condition Monitoring
This tactic involves checking conditions in a process or device, or
validating assumptions made during the design, perhaps by using assertions.
Condition monitoring identifies system states that may lead to hazardous
behavior. However, the monitor should be simple (and, ideally, provable) to
ensure that it does not introduce new software errors or contribute
significantly to overall workload. Condition monitoring provides the input
to a predictive model and to sanity checking.
Sanity Checking
The sanity checking tactic checks the validity or reasonableness of specific
operation results, or inputs or outputs of a component. This tactic is
typically based on a knowledge of the internal design, the state of the
system, or the nature of the information under scrutiny. It is most often
employed at interfaces, to examine a specific information flow.
Comparison
The comparison tactic allows the system to detect unsafe states by
comparing the outputs produced by a number of synchronized or replicated
elements. Thus the comparison tactic works together with a redundancy
tactic, typically the active redundancy tactic presented in the discussion of
availability. When the number of replicants is three or greater, the
comparison tactic can not only detect an unsafe state but also indicate
which component has led to it. Comparison is related to the voting tactic
used in availability. However, a comparison may not always lead to a vote;
another option is to simply shut down if outputs differ.
Containment
Containment tactics seek to limit the harm associated with an unsafe state
that has been entered. This category includes three subcategories:
redundancy, limit consequences, and barrier.
Redundancy
The redundancy tactics, at first glance, appear to be similar to the various
sparing/redundancy tactics presented in the discussion of availability.
Clearly, these tactics overlap, but since the goals of safety and availability
are different, the use of backup components differs. In the realm of safety,
redundancy enables the system to continue operation in the case where a
total shutdown or further degradation would be undesirable.
Replication is the simplest redundancy tactic, as it just involves having
clones of a component. Having multiple copies of identical components can
be effective in protecting against random failures of hardware, but it cannot
protect against design or implementation errors in hardware or software
since there is no form of diversity embedded in this tactic.
Functional redundancy, by contrast, is intended to address the issue of
common-mode failures (where replicas exhibit the same fault at the same
time because they share the same implementation) in hardware or software
components, by implementing design diversity. This tactic attempts to deal
with the systematic nature of design faults by adding diversity to
redundancy. The outputs of functionally redundant components should be
the same given the same input. The functional redundancy tactic is still
vulnerable to specification errors, however, and of course, functional
replicas will be more expensive to develop and verify.
Finally, the analytic redundancy tactic permits not only diversity of
components, but also a higher-level diversity that is visible at the input and
output level. As a consequence, it can tolerate specification errors by using
separate requirement specifications. Analytic redundancy often involves
partitioning the system into high assurance and high performance (low
assurance) portions. The high assurance portion is designed to be simple
and reliable, whereas the high performance portion is typically designed to
be more complex and more accurate, but less stable: It changes more
rapidly, and may not be as reliable as the high assurance portion. (Hence,
here we do not mean high performance in the sense of latency or
throughput; rather, this portion “performs” its task better than the high
assurance portion.)
Limit Consequences
The second subcategory of containment tactics is called limit consequences.
These tactics are all intended to limit the bad effects that may result from
the system entering an unsafe state.
The abort tactic is conceptually the simplest. If an operation is
determined to be unsafe, it is aborted before it can cause damage. This
technique is widely employed to ensure that systems fail safely.
The degradation tactic maintains the most critical system functions in the
presence of component failures, dropping or replacing functionality in a
controlled way. This approach allows individual component failures to
gracefully reduce system functionality in a planned, deliberate, and safe
way, rather than causing a complete system failure. For example, a car
navigation system may continue to operate using a (less accurate) dead
reckoning algorithm in a long tunnel where it has lost its GPS satellite
signal.
The masking tactic masks a fault by comparing the results of several
redundant components and employing a voting procedure in case one or
more of the components differ. For this tactic to work as intended, the voter
must be simple and highly reliable.
Barrier
The barrier tactics contain problems by keeping them from propagating.
The firewall tactic is a specific realization of the limit access tactic,
which is described in Chapter 11. A firewall limits access to specified
resources, typically processors, memory, and network connections.
The interlock tactic protects against failures arising from incorrect
sequencing of events. Realizations of this tactic provide elaborate
protection schemes by controlling all access to protected components,
including controlling the correct sequencing of events affecting those
components.
Recovery
The final category of safety tactics is recovery, which acts to place the
system in a safe state. It encompasses three tactics: rollback, repair state,
and reconfiguration.
The rollback tactic permits the system to revert to a saved copy of a
previous known good state—the rollback line—upon the detection of a
failure. This tactic is often combined with checkpointing and transactions,
to ensure that the rollback is complete and consistent. Once the good state
is reached, then execution can continue, potentially employing other tactics
such as retry or degradation to ensure that the failure does not reoccur.
The repair state tactic repairs an erroneous state—effectively increasing
the set of states that a component can handle competently (i.e., without
failure)—and then continues execution. For example, a vehicle’s lane keep
assist feature will monitor whether a driver is staying within their lane and
actively return the vehicle to a position between the lines—a safe state—if
it drifts out. This tactic is inappropriate as a means of recovery from
unanticipated faults.
Reconfiguration attempts to recover from component failures by
remapping the logical architecture onto the (potentially limited) resources
left functioning. Ideally, this remapping allows full functionality to be
maintained. When this is not possible, the system may be able to maintain
partial functionality in combination with the degradation tactic.
Benefits:
The cost of certifying the system is reduced because you need to
certify only a (usually small) portion of the total system.
Cost and safety benefits accrue because the effort focuses on just
those portions of the system that are germane to safety.
Tradeoffs:
The work involved in performing the separation can be expensive,
such as installing two different networks in a system to partition
safety-critical and non-safety-critical messages. However, this
approach limits the risk and consequences of bugs in the non-
safety-critical portion from affecting the safety-critical portion.
Separating the system and convincing the certification agency that
the separation was performed correctly and that there are no
influences from the non-safety-critical portion on the safety-
critical portion is difficult, but is far easier than the alternative:
having the agency certify everything to the same rigid level.
Privacy
An issue closely related to security is the quality of privacy. Privacy
concerns have become more important in recent years and are
enshrined into law in the European Union through the General Data
Protection Regulation (GDPR). Other jurisdictions have adopted
similar regulations.
Achieving privacy is about limiting access to information, which in
turn is about which information should be access-limited and to whom
access should be allowed. The general term for information that
should be kept private is personally identifiable information (PII). The
National Institute of Standards and Technology (NIST) defines PII as
“any information about an individual maintained by an agency,
including (1) any information that can be used to distinguish or trace
an individual’s identity, such as name, social security number, date
and place of birth, mother’s maiden name, or biometric records; and
(2) any other information that is linked or linkable to an individual,
such as medical, educational, financial, and employment
information.”
The question of who is permitted access to such data is more
complicated. Users are routinely asked to review and agree to privacy
agreements initiated by organizations. These privacy agreements
detail who, outside of the collecting organization, is entitled to see
PII. The collecting organization itself should have policies that govern
who within that organization can have access to such data. Consider,
for example, a tester for a software system. To perform tests, realistic
data should be used. Does that data include PII? Generally, PII is
obscured for testing purposes.
Frequently the architect, perhaps acting for the project manager, is
asked to verify that PII is hidden from members of the development
team who do not need to have access to PII.
which is:
Inside the
organization
Por Description Possible Values
tio
n
of
Sce
nar
io
Outside the
organization
Previously
identified
Unknown
Por Description Possible Values
tio
n
of
Sce
nar
io
Sti The stimulus is an attack. An unauthorized
mul attempt to:
us
Display data
Capture data
Change or
delete data
Access system
services
Change the
system’s
behavior
Reduce
availability
Por Description Possible Values
tio
n
of
Sce
nar
io
Art What is the target of the attack?
ifac
t System
services
Data within
the system
A component
or resources
of the system
Data produced
or consumed
by the system
En What is the state of the system when the attack The system is:
vir occurs?
on
me
nt
Online or
offline
Por Description Possible Values
tio
n
of
Sce
nar
io
Connected to
or
disconnected
from a
network
Behind a
firewall or
open to a
network
Fully
operational
Partially
operational
Not
operational
Data or
services are
protected
from
unauthorized
access
Data or
services are
not being
manipulated
without
authorization
Parties to a
transaction are
identified with
assurance
The parties to
the transaction
cannot
repudiate their
involvements
Por Description Possible Values
tio
n
of
Sce
nar
io
The data,
resources, and
system
services will
be available
for legitimate
use
Recording
access or
modification
Recording
attempts to
access data,
resources, or
services
Por Description Possible Values
tio
n
of
Sce
nar
io
Notifying
appropriate
entities
(people or
systems)
when an
apparent
attack is
occurring
Accuracy of
attack
detection
How much
time passed
Por Description Possible Values
tio
n
of
Sce
nar
io
before an
attack was
detected
How many
attacks were
resisted
How long it
takes to
recover from a
successful
attack
How much
data is
vulnerable to
a particular
attack
Figure 11.1 shows a sample concrete scenario derived from the general
scenario: A disgruntled employee at a remote location attempts to
improperly modify the pay rate table during normal operations. The
unauthorized access is detected, the system maintains an audit trail, and
the correct data is restored within one day.
Figure 11.1 Sample scenario for security
Detect Attacks
The detect attacks category consists of four tactics: detect intrusion, detect
service denial, verify message integrity, and detect message delay.
Resist Attacks
There are a number of well-known means of resisting an attack:
React to Attacks
Several tactics are intended to respond to a potential attack.
Intercepting Validator
This pattern inserts a software element—a wrapper—between the source
and the destination of messages. This approach assumes greater importance
when the source of the messages is outside the system. The most common
responsibility of this pattern is to implement the verify message integrity
tactic, but it can also incorporate tactics such as detect intrusion and detect
service denial (by comparing messages to known intrusion patterns), or
detect message delivery anomalies.
Benefits:
Depending on the specific validator that you create and deploy, this
pattern can cover most of the waterfront of the “detect attack” category
of tactics, all in one package.
Tradeoffs:
These systems can encompass most of the “detect attacks” and “react
to attacks” tactics.
Tradeoffs:
The patterns of activity that an IPS looks for change and evolve over
time, so the patterns database must be constantly updated.
Systems employing an IPS incur a performance cost.
IPSs are available as commercial off-the-shelf components, which
makes them unnecessary to develop but perhaps not entirely suited to a
specific application.
Unit testers
Integration testers
System testers
Acceptance testers
End users
Validate system
functions
Validate qualities
Discover emerging
threats to quality
The completion of a
coding increment
such as a class,
layer, or service
The completed
integration of a
subsystem
Porti Description Possible Values
on of
Scena
rio
The complete
implementation of
the whole system
The deployment of
the system into a
production
environment
A testing schedule
Porti Description Possible Values
on of
Scena
rio
Artifa The artifact is the portion of the system The portion being
cts being tested and any required test tested:
infrastructure.
A unit of code
(corresponding to a
module in the
architecture)
Components
Services
Subsystems
The test
infrastructure
Porti Description Possible Values
on of
Scena
rio
Respo The system and its test infrastructure can One or more of the
nse be controlled to perform the desired tests, following:
and the results from the test can be
observed.
Effort to achieve a
given percentage of
state space coverage
Porti Description Possible Values
on of
Scena
rio
Probability of a fault
being revealed by
the next test
Time to perform
tests
Effort to detect
faults
Length of time to
prepare test
infrastructure
Effort required to
bring the system
into a specific state
Reduction in risk
exposure: size(loss)
× probability(loss)
All of these tactics add some capability or abstraction to the software that
(were we not interested in testing) otherwise would not be there. They can
be seen as augmenting bare-bones, get-the-job-done software with more
elaborate software that has some special capabilities designed to enhance
the efficiency and effectiveness of testing.
In addition to the testability tactics, a number of techniques are available
for replacing one component with a different version of itself that facilitates
testing:
Limit Complexity
Complex software is much harder to test. Its operating state space is large,
and (all else being equal) it is more difficult to re-create an exact state in a
large state space than to do so in a small state space. Because testing is not
just about making the software fail, but also about finding the fault that
caused the failure so that it can be removed, we are often concerned with
making behavior repeatable. This category includes two tactics:
When an interface creates the service and injects it into the client, a client
is written with no knowledge of a concrete implementation. In other words,
all of the implementation specifics are injected, typically at runtime.
Benefits:
Tradeoffs:
Strategy Pattern
In the strategy pattern, a class’s behavior can be changed at runtime. This
pattern is often employed when multiple algorithms can be employed to
perform a given task, and the specific algorithm to be used can be chosen
dynamically. The class simply contains an abstract method for the desired
functionality, with the concrete version of this method being selected based
on contextual factors. This pattern is often used to replace non-test versions
of some functionality with test versions that provide additional outputs,
additional internal sanity checking, and so forth.
Benefits:
Tradeoffs:
The strategy pattern, like all design patterns, adds a small amount of
up-front complexity. If the class is simple or if there are few runtime
choices, this added complexity is likely wasted.
For small classes, the strategy pattern can make code slightly less
readable. However, as complexity grows, breaking up the class in this
way can enhance readability.
Benefits:
This pattern, like the strategy pattern, makes classes simpler, by not
placing all of the pre- and post-processing logic in the class.
Using an intercepting filter can be a strong motivator for reuse and can
dramatically reduce the size of the code base.
Tradeoffs:
If a large amount of data is being passed to the service, this pattern can
be highly inefficient and can add a nontrivial amount of latency, as
each filter makes a complete pass over the entire input.
Enviro When does the The user actions with which usability is
nment stimulus reach concerned always occur at runtime or at system
the system? configuration time.
Portio Description Possible Values
n of
Scena
rio
Artifac What portion of Common examples include:
ts the system is
being
stimulated?
A GUI
A command-line interface
A voice interface
A touch screen
Number of errors
Learning time
User satisfaction
Cancel. When the user issues a cancel command, the system must be
listening for it (thus there is the responsibility to have a constant listener
that is not blocked by the actions of whatever is being canceled); the
activity being canceled must be terminated; any resources being used by
the canceled activity must be freed; and components that are
collaborating with the canceled activity must be informed so that they
can also take appropriate action.
Undo. To support the ability to undo, the system must maintain a
sufficient amount of information about system state so that an earlier
state may be restored, at the user’s request. Such a record may take the
form of state “snapshots”—for example, checkpoints—or a set of
reversible operations. Not all operations can be easily reversed. For
example, changing all occurrences of the letter “a” to the letter “b” in a
document cannot be reversed by changing all instances of “b” to “a”,
because some of those instances of “b” may have existed prior to the
original change. In such a case, the system must maintain a more
elaborate record of the change. Of course, some operations cannot be
undone at all: You can’t unship a package or unfire a missile, for
example.
Undo comes in flavors. Some systems allow a single undo (where
invoking undo again reverts you to the state in which you commanded
the first undo, essentially undoing the undo). In other systems,
commanding multiple undo operations steps you back through many
previous states, either up to some limit or all the way back to the time
when the application was last opened.
Pause/resume. When a user has initiated a long-running operation—say,
downloading a large file or a set of files from a server—it is often
useful to provide the ability to pause and resume the operation. Pausing
a long-running operation may be done to temporarily free resources so
that they may be reallocated to other tasks.
Aggregate. When a user is performing repetitive operations, or
operations that affect a large number of objects in the same way, it is
useful to provide the ability to aggregate the lower-level objects into a
single group, so that the operation may be applied to the group, thus
freeing the user from the drudgery, and potential for mistakes, of doing
the same operation repeatedly. An example is aggregating all of the
objects in a slide and changing the text to 14-point font.
Model-View-Controller
MVC is likely the most widely known pattern for usability. It comes in
many variants, such as MVP (model-view-presenter), MVVM (model-
view-view-model), MVA (model-view-adapter), and so forth. Essentially all
of these patterns are focused on separating the model—the underlying
“business” logic of the system—from its realization in one or more UI
views. In the original MVC model, the model would send updates to a view,
which a user would see and interact with. User interactions—key presses,
button clicks, mouse motions, and so forth—are transmitted to the
controller, which interprets them as operations on the model and then sends
those operations to the model, which changes its state in response. The
reverse path was also a portion of the original MVC pattern. That is, the
model might be changed and the controller would send updates to the view.
The sending of updates depends on whether the MVC is in one process
or is distributed across processes (and potentially across the network). If the
MVC is in one process, then the updates are sent using the observer pattern
(discussed in the next subsection). If the MVC is distributed across
processes, then the publish-subscribe pattern is often used to send updates
(see Chapter 8).
Benefits:
Tradeoffs:
Observer
The observer pattern is a way to link some functionality with one or more
views. This pattern has a subject—the entity being observed—and one or
more observers of that subject. Observers need to register themselves with
the subject; then, when the state of the subject changes, the observers are
notified. This pattern is often used to implement MVC (and its variants)—
for example, as a way to notify the various views of changes to the model.
Benefits:
Tradeoffs:
The observer pattern is overkill if multiple views of the subject are not
required.
The observer pattern requires that all observers register and de-register
with the subject. If observers neglect to de-register, then their memory
is never freed, which effectively results in a memory leak. In addition,
this can negatively affect performance, since obsolete observers will
continue to be invoked.
Observers may need to do considerable work to determine if and how
to reflect a state update, and this work may be repeated for each
observer. For example, suppose the subject is changing its state at a
fine granularity, such as a temperature sensor that reports 1/100th
degree fluctuations, but the view updates changes only in full degrees.
In such cases where there is an “impedance mismatch,” substantial
processing resources may be wasted.
Memento
The memento pattern is a common way to implement the undo tactic. This
pattern features three major components: the originator, the caretaker, and
the memento. The originator is processing some stream of events that
change its state (originating from user interaction). The caretaker is sending
events to the originator that cause it to change its state. When the caretaker
is about to change the state of the originator, it can request a memento—a
snapshot of the existing state—and can use this artifact to restore that
existing state if needed, by simply passing the memento back to the
originator. In this way, the caretaker knows nothing about how state is
managed; the memento is simply an abstraction that the caretaker employs.
Benefits:
The obvious benefit of this pattern is that you delegate the complicated
process of implementing undo, and figuring out what state to preserve,
to the class that is actually creating and managing that state. In
consequence, the originator’s abstraction is preserved and the rest of
the system does not need to know the details.
Tradeoffs:
Depending on the nature of the state being preserved, the memento can
consume arbitrarily large amounts of memory, which can affect
performance. In a very large document, try cutting and pasting many
large sections, and then undoing all of that. This is likely to result in
your text processor noticeably slowing down.
In some programming languages, it is difficult to enforce the memento
as an opaque abstraction.
Chapters 4–13 each dealt with a particular quality attribute (QA) that is
important to software systems. Each of those chapters discussed how its
particular QA is defined, gave a general scenario for that QA, and showed
how to write specific scenarios to express precise shades of meaning
concerning that QA. In addition, each provided a collection of techniques to
achieve that QA in an architecture. In short, each chapter presented a kind
of portfolio for specifying and designing to achieve a particular QA.
However, as you can no doubt infer, those ten chapters only begin to
scratch the surface of the various QAs that you might need in a software
system you’re working on.
This chapter will show how to build the same kind of specification and
design approach for a QA not covered in our “A list.”
Development Distributability
Development distributability is the quality of designing the software to
support distributed software development. Like modifiability, this quality is
measured in terms of the activities of a development project. Many systems
these days are developed using globally distributed teams. One problem that
must be overcome when adopting this approach is coordinating the teams’
activities. The system should be designed so that coordination among teams
is minimized—that is, the major subsystems should exhibit low coupling.
This minimal coordination needs to be achieved both for the code and for
the data model. Teams working on modules that communicate with each
other may need to negotiate the interfaces of those modules. When a
module is used by many other modules, each developed by a different team,
communication and negotiation become more complex and burdensome.
Thus the architectural structure and the social (and business) structure of the
project need to be reasonably aligned. Similar considerations apply for the
data model. Scenarios for development distributability will deal with the
compatibility of the communication structures and data model of the system
being developed and the coordination mechanisms utilized by the
organizations doing the development.
ISO 25010 lists the following QAs that deal with product quality:
Functional suitability. Degree to which a product or system provides
functions that meet the stated and implied needs when used under the
specified conditions.
Performance efficiency. Performance relative to the amount of
resources used under the stated conditions.
Compatibility. Degree to which a product, system, or component can
exchange information with other products, systems, or components,
and/or perform its required functions, while sharing the same hardware
or software environment.
Usability. Degree to which a product or system can be used by specified
users to achieve specified goals with effectiveness, efficiency, and
satisfaction in a specified context of use.
Reliability. Degree to which a system, product, or component performs
the specified functions under the specified conditions for a specified
period of time.
Security. Degree to which a product or system protects information and
data so that persons or other products or systems have the degree of
data access appropriate to their types and levels of authorization.
Maintainability. Degree of effectiveness and efficiency with which a
product or system can be modified by the intended maintainers.
Portability. Degree of effectiveness and efficiency with which a system,
product, or component can be transferred from one hardware, software,
or other operational or usage environment to another.
Arrival rate
Queuing discipline
Scheduling algorithm
Service time
Topology
Network bandwidth
Routing algorithm
These are the only parameters that can affect latency within this model. This
is what gives the model its power. Furthermore, each of these parameters
can be affected by various architectural decisions. This is what makes the
model useful for an architect. For example, the routing algorithm can be
fixed or it could be a load-balancing algorithm. A scheduling algorithm
must be chosen. The topology can be affected by dynamically adding or
removing new servers. And so forth.
If you are creating your own model, your set of scenarios will inform
your investigation. Its parameters can be derived from the stimuli (and its
sources), the responses (and their measures), the artifacts (and their
properties), and the environment (and its characteristics).
Multiple Interfaces
It is possible to split a single interface into multiple interfaces. Each of
these has a related logical purpose, and serves a different class of actors.
Multiple interfaces provide a kind of separation of concerns. A specific
class of actor might require only a subset of the functionality available; this
functionality can be provided by one of the interfaces. Conversely, the
provider of an element may want to grant actors different access rights,
such as read or write, or to implement a security policy. Multiple interfaces
support different levels of access. For example, an element might expose its
functionality through its main interface and give access to debugging or
performance monitoring data or administrative functions via separate
interfaces. There may be public read-only interfaces for anonymous actors
and private interfaces that allow authenticated and authorized actors to
modify the state of an element.
Resources
Resources have syntax and semantics:
Interface Evolution
All software evolves, including interfaces. Software that is encapsulated by
an interface is free to evolve without impact to the elements that use this
interface as long as the interface itself does not change. An interface,
however, is a contract between an element and its actors. Just as a legal
contract can be changed only within certain constraints, software interfaces
should be changed with care. Three techniques can be used to change an
interface: deprecation, versioning, and extension.
Interface Scope
The scope of an interface defines the collection of resources directly
available to the actors. You, as an interface designer, might want to reveal
all resources; alternatively, you might wish to constrain the access to certain
resources or to certain actors. For example, you might want to constrain
access for reasons of security, performance management, and extensibility.
A common pattern for constraining and mediating access to resources of
an element or a group of elements is to establish a gateway element. A
gateway—often called a message gateway—translates actor requests into
requests to the target element’s (or elements’) resources, and so becomes an
actor for the target element or elements. Figure 15.2 provides an example of
a gateway. Gateways are useful for the following reasons:
Interaction Styles
Interfaces are meant to be connected together so that different elements can
communicate (transfer data) and coordinate (transfer control). There are
many ways for such interactions to take place, depending on the mix
between communication and coordination, and on whether the elements
will be co-located or remotely deployed. For example:
Many different interaction styles exist, but we will focus on two of the
most widely used: RPC and REST.
Although not the only protocol that can be used with REST, HTTP is the
most common choice. HTTP, which has been standardized by the World
Wide Web Consortium (W3C), has the basic form of <command><URI>.
Other parameters can be included, but the heart of the protocol is the
command and the URI. Table 15.1 lists the five most important commands
in HTTP and describes their relationship to the traditional CRUD (create,
read, update, delete) database operations.
Table 15.1 Most Important Commands in HTTP and Their Relationship to
CRUD Database Operations
HTTP Command CRUD Operation Equivalent
post create
get read
put update/replace
patch update/modify
HTTP Command CRUD Operation Equivalent
delete delete
Protocol Buffers
The Protocol Buffer technology originated at Google and was used
internally for several years before being released as open source in 2008.
Like JSON, Protocol Buffers use data types that are close to programming-
language data types, making serialization and deserialization efficient. As
with XML, Protocol Buffer messages have a schema that defines a valid
structure, and that schema can specify both required and optional elements
and nested elements. However, unlike both XML and JSON, Protocol
Buffers are a binary format, so they are extremely compact and use memory
and network bandwidth resources quite efficiently. In this respect, Protocol
Buffers harken back to a much earlier binary representation called Abstract
Syntax Notation One (ASN.1), which originated in the early 1980s when
network bandwidth was a precious resource and no bit could be wasted.
The Protocol Buffers open source project provides code generators to
allow easy use of Protocol Buffers with many programming languages. You
specify your message schema in a proto file, which is then compiled by a
language-specific protocol buffer compiler. The procedures generated by
the compilers will be used by an actor to serialize and by an element to
deserialize the data.
As when using XML and JSON, the interacting elements may be written
in different languages. Each element then uses the Protocol Buffer compiler
specific to its language. Although Protocol Buffers can be used for any
data-structuring purpose, they are mostly employed as part of the gRPC
protocol.
Protocol Buffers are specified using an interface description language.
Since they are compiled by language-specific compilers, the specification is
necessary to ensure correct behavior of the interface. It also acts as
documentation for the interfaces. Placing the interface specification in a
database allows for searching it to see how values propagate through the
various elements.
Error Handling
When designing an interface, architects naturally concentrate on how it is
supposed to be used in the nominal case, when everything works according
to plan. The real world, of course, is far from the nominal case, and a well-
designed system must know how to take appropriate action in the face of
undesired circumstances. What happens when an operation is called with
invalid parameters? What happens when a resource requires more memory
than is available? What happens when a call to an operation never returns,
because it has failed? What happens when the interface is supposed to
trigger a notification event based on the value of a sensor, but the sensor
isn’t responding or is responding with gibberish?
Actors need to know whether the element is working correctly, whether
their interaction is successful and whether an error has occurred. Strategies
to do so include the following:
Indicating the source of the error helps the system choose the appropriate
correction and recovery strategy. Temporary errors with idempotent
operations can be dealt with by waiting and retrying. Errors due to invalid
input require fixing the bad requests and resending them. Missing
dependencies should be reinstalled before reattempting to use the interface.
Implementation bugs should be fixed by adding the usage failure scenario
as an additional test case to avoid regressions.
15.4 Summary
Architectural elements have interfaces, which are boundaries over which
elements interact with each other. Interface design is an architectural duty,
because compatible interfaces allow architectures with many elements to do
something productive and useful together. A primary use of an interface is
to encapsulate an element’s implementation, so that this implementation
may change without affecting other elements.
Elements may have multiple interfaces, providing different types of
access and privileges to different classes of actors. Interfaces state which
resources the element provides to its actors as well as what the element
needs from its environment to function correctly. Like architectures
themselves, interfaces should be as simple as possible, but no simpler.
Interfaces have operations, events, and properties; these are the parts of
an interface that the architect can design. To do so, the architect must
decide the element’s
Interface scope
Interaction style
Representation, structure, and semantics of the exchanged data
Error handling
A hypervisor requires that its guest VMs use the same instruction set as
the underlying physical CPU—the hypervisor does not translate or simulate
instruction execution. For example, if you have a VM for a mobile or
embedded device that uses an ARM processor, you cannot run that virtual
machine on a hypervisor that uses an x86 processor. Another technology,
related to hypervisors, supports cross-processor execution; it is called an
emulator. An emulator reads the binary code for the target or guest
processor and simulates the execution of guest instructions on the host
processor. The emulator often also simulates guest I/O hardware devices.
For example, the open source QEMU emulator1 can emulate a full PC
system, including BIOS, x86 processor and memory, sound card, graphics
card, and even a floppy disk drive.
1. qemu.org
Hosted/Type 2 hypervisors and emulators allow a user to interact with
the applications running inside the VM through the host machine’s on-
screen display, keyboard, and mouse/touchpad. Developers working on
desktop applications or working on specialized devices, such as mobile
platforms or devices for the Internet of Things, may use a hosted/Type 2
hypervisor and/or an emulator as part of their build/test/integrate toolchain.
A hypervisor performs two main functions: (1) It manages the code
running in each VM, and (2) it manages the VMs themselves. To elaborate:
1. Code that communicates outside the VM by accessing a virtualized
disk or network interface is intercepted by the hypervisor and
executed by the hypervisor on behalf of the VM. This allows the
hypervisor to tag these external requests so that the response to these
requests can be routed to the correct VM.
The response to an external request to an I/O device or the network is
an asynchronous interrupt. This interrupt is initially handled by the
hypervisor. Since multiple VMs are operating on a single physical
host machine and each VM may have I/O requests outstanding, the
hypervisor must have a method for forwarding the interrupt to the
correct VM. This is the purpose of the tagging mentioned earlier.
2. VMs must be managed. For example, they must be created and
destroyed, among other things. Managing VMs is a function of the
hypervisor. The hypervisor does not decide on its own to create or
destroy a VM, but rather acts on instructions from a user or, more
frequently, from a cloud infrastructure (you’ll read more about this in
Chapter 17). The process of creating a VM involves loading a VM
image (discussed in the next section).
In addition to creating and destroying VMs, the hypervisor monitors
them. Health checks and resource usage are part of the monitoring.
The hypervisor is also located inside the defensive security perimeter
of the VMs, as a defense against attacks.
Finally, the hypervisor is responsible for ensuring that a VM does not
exceed its resource utilization limits. Each VM has limits on CPU
utilization, memory, and disk and network I/O bandwidth. Before
starting a VM, the hypervisor first ensures that sufficient physical
resources are available to satisfy that VM’s needs, and then the
hypervisor enforces those limits while the VM is running.
A VM is booted just as a bare-metal physical machine is booted. When
the machine begins executing, it automatically reads a special program
called the boot loader from disk storage, either internal to the computer or
connected through a network. The boot loader reads the operating system
code from disk into memory, and then transfers execution to the operating
system. In the case of a physical computer, the connection to the disk drive
is made during the power-up process. In the case of the VM, the connection
to the disk drive is established by the hypervisor when it starts the VM. The
“VM Images” section discusses this process in more detail.
From the perspective of the operating system and software services
inside a VM, it appears as if the software is executing inside of a bare-metal
physical machine. The VM provides a CPU, memory, I/O devices, and a
network connection.
Given the many concerns that it must address, the hypervisor is a
complicated piece of software. One concern with VMs is the overhead
introduced by the sharing and isolation needed for virtualization. That is,
how much slower does a service run on a virtual machine, compared to
running directly in a bare-metal physical machine? The answer to this
question is complicated: It depends on the characteristics of the service and
on the virtualization technology used. For example, services that perform
more disk and network I/O incur more overhead than services that do not
share these host resources. Virtualization technology is improving all the
time, but overheads of approximately 10% have been reported by Microsoft
on its Hyper-V hypervisor.2
2. https://docs.microsoft.com/en-us/biztalk/technical-guides/system-
resource-costs-on-hyper-v
There are two major implications of VMs for an architect:
1. Performance. Virtualization incurs a performance cost. While Type 1
hypervisors carry only a modest performance penalty, Type 2
hypervisors may impose a significantly larger overhead.
2. Separation of concerns. Virtualization allows an architect to treat
runtime resources as commodities, deferring provisioning and
deployment decisions to another person or organization.
16.3 VM Images
We call the contents of the disk storage that we boot a VM from a VM
image. This image contains the bits that represent the instructions and data
that make up the software that we will run (i.e., the operating system and
services). The bits are organized into files and directories according to the
file system used by your operating system. The image also contains the boot
load program, stored in its predetermined location.
There are three approaches you can follow to create a new VM image:
1. You can find a machine that is already running the software you want
and make a snapshot copy of the bits in that machine’s memory.
2. You can start from an existing image and add additional software.
3. You can create an image from scratch. Here, you start by obtaining
installation media for your chosen operating system. You boot your
new machine from the install media, and it formats the machine’s disk
drive, copies the operating system onto the drive, and adds the boot
loader in the predetermined location.
For the first two approaches, repositories of machine images (usually
containing open-source software) are available that provide a variety of
minimal images with just OS kernels, other images that include complete
applications, and everything in between. These efficient starting points can
support you in quickly trying out a new package or program.
However, some issues may arise when you are pulling down and running
an image that you (or your organization) did not create:
These images are very large, so transferring them over a network can be
very slow.
An image is bundled with all of its dependencies.
You can build a VM image on your development computer and then
deploy it to the cloud.
You may wish to add your own services to the VM.
While you could easily install services when creating an image, this
would lead to a unique image for every version of every service. Aside
from the storage cost, this proliferation of images becomes difficult to keep
track of and manage. Thus it is customary to create images that contain
only the operating system and other essential programs, and then add
services to these images after the VM is booted, in a process called
configuration.
16.4 Containers
VMs solve the problem of sharing resources and maintaining isolation.
However, VM images can be large, and transferring VM images around the
network is time-consuming. Suppose you have an 8 GB(yte) VM image.
You wish to move this from one location on the network to another. In
theory, on a 1 Gb(it) per second network, this will take 64 seconds.
However, in practice a 1 Gbps network operates at around 35% efficiency.
Thus transferring an 8 GB VM image will take more than 3 minutes in the
real world. Although you can adopt some techniques to reduce this transfer
time, the result will still be a duration measured in minutes. After the image
is transferred, the VM must boot the operating system and start your
services, which takes still more time.
Containers are a mechanism to maintain most of the advantages of
virtualization while reducing the image transfer time and startup time. Like
VMs and VM images, containers are packaged into executable container
images for transfer. (However, this terminology is not always followed in
practice.)
Reexamining Figure 16.1, we see that a VM executes on virtualized
hardware under the control of the hypervisor. In Figure 16.3, we see several
containers operating under the control of a container runtime engine, which
in turn is running on top of a fixed operating system. The container runtime
engine acts as a virtualized operating system. Just as all VMs on a physical
host share the same underlying physical hardware, all containers within a
host share the same operating system kernel through the runtime engine
(and through the operating system, they share the same underlying physical
hardware). The operating system can be loaded either onto a bare-metal
physical machine or a virtual machine.
Figure 16.3 Containers on top of a container runtime engine on top of
an operating system on top of a hypervisor (or bare metal)
16.7 Pods
Kubernetes is open source orchestration software for deploying, managing,
and scaling containers. It has one more element in its hierarchy: Pods. A
Pod is a group of related containers. In Kubernetes, nodes (hardware or
VMs) contain Pods, and Pods contain containers, as shown in Figure 16.4.
The containers in a Pod share an IP address and port space to receive
requests from other services. They can communicate with each other using
interprocess communication (IPC) mechanisms such as semaphores or
shared memory, and they can share ephemeral storage volumes that exist for
the lifetime of the Pod. They have the same lifetime—the containers in
Pods are allocated and deallocated together. For example, service meshes,
discussed in Chapter 9, are often packaged as a Pod.
16.9 Summary
Virtualization has been a boon for software and system architects, as it
provides efficient, cost-effective allocation platforms for networked
(typically web-based) services. Hardware virtualization allows for the
creation of several virtual machines that share the same physical machine. It
does this while enforcing isolation of the CPU, memory, disk storage, and
network. Consequently, the resources of the physical machine can be shared
among several VMs, while the number of physical machines that an
organization must purchase or rent is minimized.
A VM image is the set of bits that are loaded into a VM to enable its
execution. VM images can be created by various techniques for
provisioning, including using operating system functions or loading a pre-
created image.
Containers are a packaging mechanism that virtualizes the operating
system. A container can be moved from one environment to another if a
compatible container runtime engine is available. The interface to container
runtime engines has been standardized.
Placing several containers into a Pod means that they are all allocated
together and any communication between the containers can be done
quickly.
Serverless architecture allows for containers to be rapidly instantiated
and moves the responsibility for allocation and deallocation to the cloud
provider infrastructure.
Suppose you wish to have a VM allocated for you in the cloud. You send
a request to the management gateway asking for a new VM instance. This
request has many parameters, but three essential parameters are the cloud
region where the new instance will run, the instance type (e.g., CPU and
memory size), and the ID of a VM image. The management gateway is
responsible for tens of thousands of physical computers, and each physical
computer has a hypervisor that manages the VMs on it. So, the
management gateway will identify a hypervisor that can manage an
additional VM of the type you have selected by asking, Is there enough
unallocated CPU and memory capacity available on that physical machine
to meet your needs? If so, it will ask that hypervisor to create an additional
VM; the hypervisor will perform this task and return the new VM’s IP
address to the management gateway. The management gateway then sends
that IP address to you. The cloud provider ensures that enough physical
hardware resources are available in its data centers so that your request will
never fail due to insufficient resources.
The management gateway returns not only the IP address for the newly
allocated VM, but also a hostname. The hostname returned after allocating
a VM reflects the fact that the IP address has been added to the cloud
Domain Name System (DNS). Any VM image can be used to create the
new VM instance; that is, the VM image may comprise a simple service or
be just one step in the deployment process to create a complex system.
The management gateway performs other functions in addition to
allocating new VMs. It supports collecting billing information about the
VM, and it provides the capability to monitor and destroy the VM.
The management gateway is accessed through messages over the Internet
to its API. These messages can come from another service, such as a
deployment service, or they can be generated from a command-line
program on your computer (allowing you to script operations). The
management gateway can also be accessed through a web-based application
operated by the cloud service provider, although this kind of interactive
interface is not efficient for more than the most trivial operations.
Timeouts
Recall from Chapter 4 that timeout is a tactic for availability. In a
distributed system, timeouts are used to detect failure. There are several
consequences of using timeouts:
Hedged requests. Make more requests than are needed and then cancel
the requests (or ignore responses) after sufficient responses have been
received. For example, suppose 10 instances of a microservice (see
Chapter 5) are to be launched. Issue 11 requests and after 10 have
completed, terminate the request that has not responded yet.
Alternative requests. A variant of the hedged request technique is called
alternative request. In the just-described scenario, issue 10 requests.
When 8 requests have completed, issue 2 more, and when a total of 10
responses have been received, cancel the 2 requests that are still
remaining.
Autoscaling VMs
Returning to Figure 17.4, suppose that the two clients generate more
requests than can be handled by the two service instances shown.
Autoscaling creates a third instance, based on the same virtual machine
image that was used for the first two instances. The new instance is
registered with the load balancer so that subsequent requests are distributed
among three instances rather than two. Figure 17.5 shows a new
component, the autoscaler, that monitors and autoscales the utilization of
the server instances. Once the autoscaler creates a new service instance, it
notifies the load balancer of the new IP address so that the load balancer
can distribute requests to the new instance, in addition to the requests it
distributes to the other instances.
Figure 17.5 An autoscaler monitoring the utilization
Because the clients do not know how many instances exist or which
instance is serving their requests, autoscaling activities are invisible to
service clients. Furthermore, if the client request rate decreases, an instance
can be removed from the load balancer pool, halted, and deallocated, again
without the client’s knowledge.
As an architect of a cloud-based service, you can set up a collection of
rules for the autoscaler that govern its behavior. The configuration
information you provide to the autoscaler includes the following items:
Autoscaling Containers
Because containers are executing on runtime engines that are hosted on
VMs, scaling containers involves two different types of decisions. When
scaling VMs, an autoscaler decides that additional VMs are required, and
then allocates a new VM and loads it with the appropriate software. Scaling
containers means making a two-level decision. First, decide that an
additional container (or Pod) is required for the current workload. Second,
decide whether the new container (or Pod) can be allocated on an existing
runtime engine instance or whether a new instance must be allocated. If a
new instance must be allocated, you need to check whether a VM with
sufficient capacity is available or if an additional VM needs to be allocated.
The software that controls the scaling of containers is independent of the
software that controls the scaling of VMs. This allows the scaling of
containers to be portable across different cloud providers. It is possible that
the evolution of containers will integrate the two types of scaling. In such a
case, you should be aware that you may be creating a dependency between
your software and the cloud provider that could be difficult to break.
17.4 Summary
The cloud is composed of distributed data centers, with each data center
containing tens of thousands of computers. It is managed through a
management gateway that is accessible over the Internet and is responsible
for allocating, deallocating, and monitoring VMs, as well as measuring
resource usage and computing billing.
Because of the large number of computers in a data center, failure of a
computer in such a center happens quite frequently. You, as an architect of
a service, should assume that at some point, the VMs on which your service
is executing will fail. You should also assume that your requests for other
services will exhibit a long tail distribution, such that as many as 5 percent
of your requests will take 5 to 10 times longer than the average request.
Thus you must be concerned about the availability of your service.
Because single instances of your service may not be able to satisfy all
requests in a timely manner, you may decide to run multiple VMs or
containers containing instances of your service. These multiple instances sit
behind a load balancer. The load balancer receives requests from clients and
distributes the requests to the various instances.
The existence of multiple instances of your service and multiple clients
has a significant impact on how you handle state. Different decisions on
where to keep the state will lead to different results. The most common
practice is to keep services stateless, because stateless services allow for
easier recovery from failure and easier addition of new instances. Small
amounts of data can be shared among service instances by using a
distributed coordination service. Distributed coordination services are
complicated to implement, but several proven open source implementations
are available for your use.
The cloud infrastructure can automatically scale your service by creating
new instances when demand grows and removing instances when demand
shrinks. You specify the behavior of the autoscaler through a set of rules
giving the conditions for the creation or deletion of instances.
The telephone will be used to inform people that a telegram has been
sent.
—Alexander Graham Bell
So, what did Alexander Graham Bell know, anyway? Mobile systems,
including and especially phones, are ubiquitous in our world today. Besides
phones, they include trains, planes, and automobiles; they include ships and
satellites, entertainment and personal computing devices, and robotic
systems (autonomous or not); they include essentially any system or device
that has no permanent connection to a continuous abundant power source.
A mobile system has the ability to be in movement while continuing to
deliver some or all of its functionality. This makes dealing with some of its
characteristics a different matter from dealing with fixed systems. In this
chapter we focus on five of those characteristics:
1. Energy. Mobile systems have limited sources of power and must be
concerned with using power efficiently.
2. Network connectivity. Mobile systems tend to deliver much of their
functionality by exchanging information with other devices while
they are in motion. They must therefore connect to those devices, but
their mobility makes these connections tricky.
3. Sensors and actuators. Mobile systems tend to gain more information
from sensors than fixed systems do, and they often use actuators to
interact with their environment.
4. Resources. Mobile systems tend to be more resource-constrained than
fixed systems. For one thing, they are often quite small, such that
physical packaging becomes a limiting factor. For another, their
mobility often makes weight a factor. Mobile devices that must be
small and lightweight have limits on the resources they can provide.
5. Life cycle. Testing mobile systems differs from the testing of other
systems. Deploying new versions also introduces some special issues.
When designing a system for a mobile platform, you must deal with a
large number of domain-specific requirements. Self-driving automobiles
and autonomous drones must be safe; smartphones must provide an open
platform for a variety of vastly different applications; entertainment
systems must work with a wide range of content formats and service
providers. In this chapter, we’ll focus on the characteristics shared by many
(if not all) mobile systems that an architect must consider when designing a
system.
18.1 Energy
In this section, we focus on the architectural concerns most relevant to
managing the energy of mobile systems. For many mobile devices, their
source of energy is a battery with a very finite capacity for delivering that
energy. Other mobile devices, such as cars and planes, run on the power
produced by generators, which in turn may be powered by engines that run
on fuel—again, a finite resource.
Within all of these categories, the technologies and the standards are
evolving rapidly.
Reading raw data. The lowest level of the stack is a software driver to
read the raw data. The driver reads the sensor either directly or, in the
case where the sensor is a portion of a sensor hub, through the hub. The
driver gets a reading from the sensor periodically. The period frequency
is a parameter that will influence both the processor load from reading
and processing the sensor and the accuracy of the created
representation.
Smoothing data. Raw data usually has a great deal of noise or variation.
Voltage variations, dirt or grime on a sensor, and a myriad of other
causes can make two successive readings of a sensor differ. Smoothing
is a process that uses a series of measurements over time to produce an
estimate that tends to be more accurate than single readings. Calculating
a moving average and using a Kalman filter are two of the many
techniques for smoothing data.
Converting data. Sensors can report data in many formats—from
voltage readings in millivolts to altitude above sea level in feet to
temperature in degrees Celsius. It is possible, however, that two
different sensors measuring the same phenomenon might report their
data in different formats. The converter is responsible for converting
readings from whatever form is reported by the sensor into a common
form meaningful to the application. As you might imagine, this function
may need to deal with a wide variety of sensors.
Sensor fusion. Sensor fusion combines data from multiple sensors to
build a more accurate or more complete or more dependable
representation of the environment than would be possible from any
individual sensor. For example, how does an automobile recognize
pedestrians in its path or likely to be in its path by the time it gets there,
day or night, in all kinds of weather? No single sensor can accomplish
this feat. Instead, the automobile must intelligently combine inputs from
sensors such as thermal imagers, radar, lidar, and cameras.
18.4 Resources
In this section, we discuss computing resources from the perspective of
their physical characteristics. For example, in devices where energy comes
from batteries, we need to be concerned with battery volume, weight, and
thermal properties. The same holds true for resources such as networks,
processors, and sensors.
The tradeoff in the choice of resources is between the contribution of the
particular resource under consideration and its volume, weight, and cost.
Cost is always a factor. Costs include both the manufacturing costs and
nonrecurring engineering costs. Many mobile systems are manufactured by
the millions and are highly price-sensitive. Thus a small difference in the
price of a processor multiplied by the millions of copies of the system in
which that processor is embedded can make a significant difference to the
profitability of the organization producing the system. Volume discounts
and reuse of hardware across different products are techniques that device
vendors use to reduce costs.
Volume, weight, and cost are constraints given both by the marketing
department of an organization and by the physical considerations of its use.
The marketing department is concerned with customers’ reactions. The
physical considerations for the device’s use depend on both human and
usage factors. Smartphone displays must be large enough for a human to
read; automobiles are constrained by weight limits on roads; trains are
constrained by track width; and so forth.
Other constraints on mobile system resources (and therefore on software
architects) reflect the following factors:
Hardware First
For many mobile systems, the hardware is chosen before the software is
designed. Consequently, the software architecture must live with the
constraints imposed by the chosen hardware.
The main stakeholders in early hardware choices are management, sales,
and regulators. Their concerns typically focus on ways to reduce risks
rather than ways to promote quality attributes. The best approach for a
software architect is to actively drive these early discussions, emphasizing
the tradeoffs involved, instead of passively awaiting their outcomes.
Testing
Mobile devices present some unique considerations for testing:
Deploying Updates
In a mobile device, updates to the system either fix issues, provide new
functionality, or install features that are unfinished but perhaps were
partially installed at the time of an earlier release. Such an update may
target the software, the data, or (less often) the hardware. Modern cars, for
example, require software updates, which are fetched over networks or
downloaded via USB interfaces. Beyond providing for the capability of
updates during operation, the following specific issues relate to deploying
updates:
Logging
Logs are critical when investigating and resolving incidents that have
occurred or may occur. In mobile systems, the logs should be offloaded to a
location where they are accessible regardless of the accessibility of the
mobile system itself. This is useful not only for incident handling, but also
for performing various types of analyses on the usage of the system. Many
software applications do something similar when they encounter a problem
and ask for permission to send the details to the vendor. For mobile
systems, this logging capability is particularly important, and they may very
well not ask permission to obtain the data.
18.6 Summary
Mobile systems span a broad range of forms and applications, from
smartphones and tablets to vehicles such as automobiles and aircraft. We
have categorized the differences between mobile systems and fixed systems
as being based on five characteristics: energy, connectivity, sensors,
resources, and life cycle.
The energy in many mobile systems comes from batteries. Batteries are
monitored to determine both the remaining time on the battery and the
usage of individual applications. Energy usage can be controlled by
throttling individual applications. Applications should be constructed to
survive power failures and restart seamlessly when power is restored.
Connectivity means connecting to other systems and the Internet through
wireless means. Wireless communication can be via short-distance
protocols such as Bluetooth, medium-range protocols such as Wi-Fi
protocols, and long-distance cellular protocols. Communication should be
seamless when moving from one protocol class to another, and
considerations such as bandwidth and cost help the architect decide which
protocols to support.
Mobile systems utilize a variety of sensors. Sensors provide readings of
the external environment, which the architect then uses to develop a
representation within the system of the external environment. Sensor
readings are processed by a sensor stack specific to each operating system;
these stacks will deliver readings meaningful to the representation. It may
take multiple sensors to develop a meaningful representation, with the
readings from these sensors then being fused (integrated). Sensors may also
become degraded over time, so multiple sensors may be needed to get an
accurate representation of the phenomenon being measured.
Resources have physical characteristics such as size and weight, have
processing capabilities, and carry a cost. The design choices involve
tradeoffs among these factors. Critical functions may require more
powerful and reliable resources. Some functions may be shared between the
mobile system and the cloud, and some functions may be shut down in
certain modes to free up resources for other functions.
Life-cycle issues include choice of hardware, testing, deploying updates,
and logging. Testing of the user interface may be more complicated with
mobile systems than with fixed systems. Likewise, deployment is more
complicated because of bandwidth, safety considerations, and other issues.
Once you have a utility tree filled out, you can use it to make important
checks. For instance:
19.6 Summary
Architectures are driven by architecturally significant requirements. An
ASR must have:
Why do we explicitly capture the design purpose? You need to make sure
that you are clear about your goals for a round. In an incremental design
context comprising multiple rounds, the purpose for a design round may be,
for example, to produce a design for early estimation, to refine an existing
design to build a new increment of the system, or to design and generate a
prototype to mitigate certain technical risks. In addition, you need to know
the existing architecture’s design, if this is not greenfield development.
At this point, the primary functionality—typically captured as a set of
use cases or user stories—and QA scenarios should have been prioritized,
ideally by your most important project stakeholders. (You can employ
several different techniques to elicit and prioritize them, as discussed in
Chapter 19). You, the architect, must now “own” these. For example, you
need to check whether any important stakeholders were overlooked in the
original requirements elicitation process, and whether any business
conditions have changed since the prioritization was performed. These
inputs really do “drive” design, so getting them right and getting their
priority right are crucial. We cannot stress this point strongly enough.
Software architecture design, like most activities in software engineering, is
a “garbage-in-garbage-out” process. The results of ADD cannot be good if
the inputs are poorly formed.
The drivers become part of an architectural design backlog that you
should use to perform the different design iterations. When you have made
design decisions that account for all of the items in the backlog, you’ve
completed this round. (We discuss the idea of a backlog in more depth in
Section 20.8.)
Steps 2–7 make up the activities for each design iteration carried out
within this design round.
Iterate If Necessary
You should perform additional iterations and repeat steps 2–7 for every
driver that was considered. More often than not, however, this kind of
repetition will not be possible because of time or resource constraints that
force you to stop the design activities and move on to implementation.
What are the criteria for evaluating if more design iterations are
necessary? Let risk be your guide. You should at least have addressed the
drivers with the highest priority. Ideally, you should have certainty that
critical drivers are satisfied or, at least, that the design is “good enough” to
satisfy them.
Creation of Prototypes
In case the previously mentioned analysis techniques do not guide you to
make an appropriate selection of design concepts, you may need to create
prototypes and collect measurements from them. Creating early
“throwaway” prototypes is a useful technique to help in the selection of
externally developed components. This type of prototype is usually created
without consideration for maintainability, reuse, or allowance for achieving
other important goals. Such a prototype should not be used as a basis for
further development.
Although the creation of prototypes can be costly, certain scenarios
strongly motivate them. When thinking about whether you should create a
prototype, ask these questions:
If most of your answers to these questions are “yes,” then you should
strongly consider the creation of a throwaway prototype.
When you instantiate a design concept, you may actually affect more
than one structure. For example, in a particular iteration, you might
instantiate the passive redundancy (warm spare) pattern, introduced in
Chapter 4. This will result in both a C&C structure and an allocation
structure. As part of applying this pattern, you will need to choose the
number of spares, the degree to which the state of the spares is kept
consistent with that of the active node, a mechanism for managing and
transferring state, and a mechanism for detecting the failure of a node.
These decisions are responsibilities that must live somewhere in the
elements of a module structure.
Instantiating Elements
Here’s how instantiation might look for each of the design concept
categories:
Defining Interfaces
Interfaces establish a contractual specification that allows elements to
collaborate and exchange information. They may be either external or
internal.
External interfaces are interfaces of other systems with which your
system must interact. These may form constraints for your system, since
you usually cannot influence their specification. As we noted earlier,
establishing a system context at the beginning of the design process is
useful to identify external interfaces. Since external entities and the system
under development interact via interfaces, there should be at least one
external interface per external system (as shown in Figure 20.2).
Internal interfaces are interfaces between the elements that result from
the instantiation of design concepts. To identify the relationships and the
interface details, you need to understand how the elements interact with
each other to support use cases or QA scenarios. As we said in Chapter 15
in our discussion of software Interfaces, “interacts” means anything one
element does that can impact the processing of another element. A
particularly common type of interaction is the runtime exchange of
information.
Behavioral representations such as UML sequence diagrams, statecharts,
and activity diagrams (see Chapter 22) allow you to model the information
that is exchanged between elements during execution. This type of analysis
is also useful to identify relationships between elements: If two elements
need to exchange information directly or otherwise depend on each other,
then a relationship between these elements exists. Any information that is
exchanged becomes part of the specification of the interface.
The identification of interfaces is usually not performed equally across
all design iterations. When you are starting the design of a greenfield
system, for example, your first iterations will produce only abstract
elements such as layers; these elements will then be refined in later
iterations. The interfaces of abstract elements such as layers are typically
underspecified. For example, in an early iteration you might simply specify
that the UI tier sends “commands” to the business logic tier, and the
business logic tier sends “results” back. As the design process proceeds,
and particularly when you create structures to address specific use cases
and QA scenarios, you will need to refine the interfaces of the elements that
participate in these interactions.
In some special cases, identifying the appropriate interfaces may be
greatly simplified. For example, if you choose a complete technology stack
or a set of components that have been designed to interoperate, then the
interfaces will already be defined by those technologies. In such a case, the
specification of interfaces is a relatively trivial task, as the chosen
technologies have “baked in” many interface assumptions and decisions.
Finally, be aware that not all of the internal interfaces need to be
identified in any given ADD iteration. Some may be delegated to later
design activities.
Also, you may add more items to the backlog as decisions are made. As
a case in point, if you choose a reference architecture, you will probably
need to add specific concerns, or QA scenarios derived from them, to the
architectural design backlog. For example, if we choose a web application
reference architecture and discover that it does not provide session
management, then that becomes a concern that needs to be added to the
backlog.
20.7 Summary
Design is hard. Methods are needed to make it more tractable (and
repeatable). In this chapter, we discussed the attribute-driven design (ADD)
method in detail; it allows an architecture to be designed in a systematic and
cost-effective way.
We also discussed several important aspects that need to be considered in
the steps of the design process. These aspects include the identification and
selection of design concepts, their use in producing structures, the
definition of interfaces, the production of preliminary documentation, and
ways to track design progress.
The importance of the decision. The more important the decision, the
more care should be taken in making it and making sure it’s right.
The number of potential alternatives. The more alternatives, the more
time could be spent in evaluating them.
Good enough as opposed to perfect. Many times, two possible
alternatives do not differ dramatically in their consequences. In such a
case, it is more important to make a choice and move on with the design
process than it is to be absolutely certain that the best choice is being
made.
Evaluation by Outsiders
Outside evaluators can cast a more objective eye on an architecture.
“Outside” is relative; this may mean outside the development project,
outside the business unit where the project resides but within the same
company, or outside the company altogether. To the degree that evaluators
are “outside,” they are less likely to be afraid to bring up sensitive
problems, or problems that aren’t apparent because of organizational culture
or because “we’ve always done it that way.”
Often, outsiders are chosen to participate in the evaluation because they
possess specialized knowledge or experience, such as knowledge about a
quality attribute that’s important to the system being examined, skill with a
particular technology being employed, or long experience in successfully
evaluating architectures.
Also, whether justified or not, managers tend to be more inclined to
listen to problems uncovered by an outside team hired at considerable cost
than by team members within the organization. This can be understandably
frustrating to project staff who may have been complaining about the same
problems, to no avail, for months.
In principle, an outside team may evaluate a completed architecture, an
incomplete architecture, or a portion of an architecture. In practice, because
engaging them is complicated and often expensive, they tend to be used to
evaluate complete architectures.
Project decision makers. These people are empowered to speak for the
development project or have the authority to mandate changes to it.
They usually include the project manager and, if an identifiable
customer is footing the bill for the development, a representative of that
customer may be present as well. The architect is always included—a
cardinal rule of architecture evaluation is that the architect must
willingly participate.
Architecture stakeholders. Stakeholders have a vested interest in the
architecture performing as advertised. They are the people whose ability
to do their job hinges on the architecture promoting modifiability,
security, high reliability, or the like. Stakeholders include developers,
testers, integrators, maintainers, performance engineers, users, and
builders of systems interacting with the one under consideration. Their
job during an evaluation is to articulate the specific quality attribute
goals that the architecture should meet for the system to be considered a
success. A rule of thumb—and that is all it is—is that you should
expect to enlist 10 to 25 stakeholders for the evaluation of a large
enterprise-critical architecture. Unlike the evaluation team and the
project decision makers, stakeholders do not participate in the entire
exercise.
Table 21.2 shows the four phases of the ATAM, who participates in each
phase, and the typical cumulative time spent on the activity—possibly in
several segments.
Table 21.2 ATAM Phases and Their Characteristics
P Activity Participants Typical Cumulative Time
ha
se
0 Partnership Evaluation team leadership Proceeds informally as
and and key project decision required, perhaps over a few
preparation makers weeks
1 Evaluation Evaluation team and project 1–2 days
decision makers
2 Evaluation Evaluation team, project 2 days
(continued) decision makers, and
stakeholders
3 Follow-up Evaluation team and 1 week
evaluation client
The frequency of heartbeats affects the time in which the system can
detect a failed component. Some assignments will result in
unacceptable values of this response; these are risks.
The frequency of heartbeats determines the time for detection of a fault.
Higher frequency leads to improved availability but also consumes
more processing time and communication bandwidth (potentially
leading to reduced performance). This is a tradeoff.
These issues, in turn, may catalyze a deeper analysis, depending on how
the architect responds. For example, if the architect cannot characterize the
number of clients and cannot say how load balancing will be achieved by
allocating processes to hardware, there is little point in proceeding to any
performance analysis. If such questions can be answered, the evaluation
team can perform at least a rudimentary, or back-of-the-envelope, analysis
to determine if these architectural decisions are problematic vis-à-vis the
quality attribute requirements they are meant to address.
The analysis during step 6 is not meant to be comprehensive. The key is
to elicit sufficient architectural information to establish some link between
the architectural decisions that have been made and the quality attribute
requirements that need to be satisfied.
Figure 21.1 shows a template for capturing the analysis of an
architectural approach for a scenario. As shown in the figure, based on the
results of this step, the evaluation team can identify and record a set of risks
and non-risks, sensitivity points, and tradeoffs.
Figure 21.1 Example of architecture approach analysis (adapted from
[Clements 01b])
At the end of step 6, the evaluation team should have a clear picture of
the most important aspects of the entire architecture, the rationale for key
design decisions, and a list of risks, non-risks, sensitivity points, and
tradeoff points.
At this point, phase 1 is concluded.
Tactics-Based Questionnaires
Another (even lighter) lightweight evaluation method that we
discussed in Chapter 3 is the tactics-based questionnaire. A tactics-
based questionnaire focuses on a single quality attribute at a time. It
can be used by the architect to aid in reflection and introspection, or it
can be used to structure a question-and-answer session between an
evaluator (or evaluation team) and an architect (or group of designers).
This kind of session is typically short—around one hour per quality
attribute—but can reveal a great deal about the design decisions taken,
and those not taken, in pursuit of control of a quality attribute and the
risks that are often buried within those decisions. We have provided
quality attribute–specific questionnaires in Chapters 4–13 to help
guide you in this process.
A tactics-based analysis can lead to surprising results in a very
short time. For example, once I was analyzing a system that managed
healthcare data. We had agreed to analyze the quality attribute of
security. During the session, I dutifully walked through the security
tactics–based questionnaire, asking each question in turn (as you may
recall, in these questionnaires each tactic is transformed into a
question). For example, I asked, “Does the system support the
detection of intrusions?”, “Does the system support the verification of
message integrity?”, and so forth. When I got to the question “Does
the system support data encryption?”, the architect paused and smiled.
Then he (sheepishly) admitted that the system had a requirement that
no data could be passed over a network “in the clear”—that is,
without encryption. So they XOR’ed all data before sending it over
the network.
This is a great example of the kind of risk that a tactics-based
questionnaire can uncover, very quickly and inexpensively. Yes, they
had met the requirement in a strict sense—they were not sending any
data in the clear. But the encryption algorithm that they chose could
be cracked by a high school student with modest abilities!
—RK
21.7 Summary
If a system is important enough for you to explicitly design its architecture,
then that architecture should be evaluated.
The number of evaluations and the extent of each evaluation may vary
from project to project. A designer should perform an evaluation during the
process of making an important decision.
The ATAM is a comprehensive method for evaluating software
architectures. It works by having project decision makers and stakeholders
articulate a precise list of quality attribute requirements (in the form of
scenarios) and by illuminating the architectural decisions relevant to
analyzing each high-priority scenario. The decisions can then be understood
in terms of risks or non-risks to find any trouble spots in the architecture.
Lightweight evaluations can be performed regularly as part of a project’s
internal peer review activities. Lightweight Architecture Evaluation, based
on the ATAM, provides an inexpensive, low-ceremony architecture
evaluation that can be carried out in less than a day.
22.2 Notations
Notations for documenting views differ considerably in their degree of
formality. Roughly speaking, there are three main categories of notation:
22.3 Views
Perhaps the most important concept associated with software architecture
documentation is that of the view. A software architecture is a complex
entity that cannot be described in a simple one-dimensional fashion. A view
is a representation of a set of system elements and relations among them—
not all system elements, but those of a particular type. For example, a
layered view of a system would show elements of type “layer”; that is, it
would show the system’s decomposition into layers, along with the relations
among those layers. A pure layered view would not, however, show the
system’s services, or clients and servers, or data model, or any other type of
element.
Thus views let us divide the multidimensional entity that is a software
architecture into a number of (we hope) interesting and manageable
representations of the system. The concept of views leads to a basic
principle of architecture documentation:
Documenting an architecture is a matter of documenting the relevant
views and then adding documentation that applies to more than one view.
What are the relevant views? This depends entirely on your goals. As we
saw previously, architecture documentation can serve many purposes: a
mission statement for implementers, a basis for analysis, the specification
for automatic code generation, the starting point for system understanding
and reverse engineering, or the blueprint for project estimation and
planning.
Different views also expose different quality attributes to different
degrees. In turn, the quality attributes that are of most concern to you and
the other stakeholders in the system’s development will affect which views
you choose to document. For instance, a module view will let you reason
about your system’s maintainability, a deployment view will let you reason
about your system’s performance and reliability, and so forth.
Because different views support different goals and uses, we do not
advocate using any particular view or collection of views. The views you
should document depend on the uses you expect to make of the
documentation. Different views will highlight different system elements
and relations. How many different views to represent is the result of a
cost/benefit decision. Each view has a cost and a benefit, and you should
ensure that the expected benefits of creating and maintaining a particular
view outweigh its costs.
The choice of views is driven by the need to document a particular
pattern in your design. Some patterns are composed of modules, others
consist of components and connectors, and still others have deployment
considerations. Module views, component-and-connector (C&C) views,
and allocation views are the appropriate mechanism for representing these
considerations, respectively. These categories of views correspond, of
course, to the three categories of architectural structures described in
Chapter 1. (Recall from Chapter 1 that a structure is a collection of
elements, relations, and properties, whereas a view is a representation of
one or more architectural structures.)
In this section, we explore these three categories of structure-based views
and then introduce a new category: quality views.
Module Views
A module is an implementation unit that provides a coherent set of
responsibilities. A module might take the form of a class, a collection of
classes, a layer, an aspect, or any decomposition of the implementation unit.
Example module views are decomposition, uses, and layers. Every module
view has a collection of properties assigned to it. These properties express
important information associated with each module and the relationships
among the modules, as well as constraints on the module. Example
properties include responsibilities, visibility information (what other
modules can use it), and revision history. The relations that modules have to
one another include is-part-of, depends-on, and is-a.
The way in which a system’s software is decomposed into manageable
units remains one of the important forms of system structure. At a
minimum, it determines how a system’s source code is decomposed into
units, what kinds of assumptions each unit can make about services
provided by other units, and how those units are aggregated into larger
ensembles. It also includes shared data structures that impact, and are
impacted by, multiple units. Module structures often determine how
changes to one part of a system might affect other parts and hence the
ability of a system to support modifiability, portability, and reuse.
The documentation of any software architecture is unlikely to be
complete without at least one module view. Table 22.1 summarizes the
characteristics of module views.
Table 22.1 Summary of Module Views
Ele Modules, which are implementation units of software that provide a
ment coherent set of responsibilities
s
Rela
tions
Is-part-of, which defines a part/whole relationship between the
submodule (the part) and the aggregate module (the whole)
Name. A module’s name is, of course, the primary means to refer to it.
A module’s name often suggests something about its role in the system.
In addition, a module’s name may reflect its position in a
decomposition hierarchy; the name A.B.C, for example, refers to a
module C that is a submodule of a module B, which is itself a
submodule of A.
Responsibilities. The responsibility property for a module is a way to
identify its role in the overall system and establishes an identity for it
beyond the name. Whereas a module’s name may suggest its role, a
statement of responsibility establishes that role with much more
certainty. Responsibilities should be described in sufficient detail to
make clear to the reader what each module does. A module’s
responsibilities are often captured by tracing to a project’s requirements
specification, if there is one.
Implementation information. Modules are units of implementation. It is
therefore useful to record information related to their implementation
from the point of view of managing their development and building the
system that contains them. This might include:
Mapping to source code units. This identifies the files that
constitute the implementation of a module. For example, a module
Account, if implemented in Java, might have several files that
constitute its implementation: IAccount.java (an interface),
AccountImpl.java (implementation of Account functionality), and
perhaps even a unit test AccountTest.java.
Test information. The module’s test plan, test cases, test harness,
and test data are important to document. This information may
simply be a pointer to the location of these artifacts.
Management information. A manager may need information about
the module’s predicted schedule and budget. This information may
simply be a pointer to the location of these artifacts.
Implementation constraints. In many cases, the architect will have
an implementation strategy in mind for a module or may know of
constraints that the implementation must follow.
Revision history. Knowing the history of a module, including its
authors and particular changes, may help you when you’re
performing maintenance activities.
A module view can be used to explain the system’s functionality to
someone not familiar with it. The various levels of granularity of the
module decomposition provide a top-down presentation of the system’s
responsibilities and, therefore, can guide the learning process. For a system
whose implementation is already in place, module views, if kept up-to-date,
are helpful because they explain the structure of the code base to a new
developer on the team.
Conversely, it is difficult to use the module views to make inferences
about runtime behavior, because these views are just a static partition of the
functions of the software. Thus a module view is not typically used for
analysis of performance, reliability, and many other runtime qualities. For
those purposes, we rely on component-and-connector and allocation views.
Component-and-Connector Views
C&C views show elements that have some runtime presence, such as
processes, services, objects, clients, servers, and data stores. These elements
are termed components. Additionally, C&C views include as elements the
pathways of interaction, such as communication links and protocols,
information flows, and access to shared storage. Such interactions are
represented as connectors in C&C views. Example C&C views include
client-server, microservice, and communicating processes.
A component in a C&C view may represent a complex subsystem, which
itself can be described as a C&C subarchitecture. A component’s
subarchitecture may employ a different pattern than the one in which the
component appears.
Simple examples of connectors include service invocation, asynchronous
message queues, event multicast supporting publish-subscribe interactions,
and pipes that represent asynchronous, order-preserving data streams.
Connectors often represent much more complex forms of interaction, such
as a transaction-oriented communication channel between a database server
and a client, or an enterprise service bus that mediates interactions between
collections of service users and providers.
Connectors need not be binary; that is, they need not have exactly two
components with which they interact. For example, a publish-subscribe
connector might have an arbitrary number of publishers and subscribers.
Even if the connector is ultimately implemented using binary connectors,
such as a procedure call, it can be useful to adopt n-ary connector
representations in a C&C view. Connectors embody a protocol of
interaction. When two or more components interact, they must obey
conventions about order of interactions, locus of control, and handling of
error conditions and timeouts. The protocol of interaction should be
documented.
The primary relation within a C&C view is attachment. Attachments
indicate which connectors are attached to which components, thereby
defining a system as a graph of components and connectors. Compatibility
often is defined in terms of information type and protocol. For example, if a
web server expects encrypted communication via HTTPS, then the client
must perform the encryption.
An element (component or connector) of a C&C view will have various
properties associated with it. Specifically, every element should have a
name and type, with its additional properties depending on the type of
component or connector. As an architect, you should define values for the
properties that support the intended analyses for the particular C&C view.
The following are examples of some typical properties and their uses:
Relati
ons
Attachments: Components are associated with connectors to
yield a graph.
Constr Components can only be attached to connectors, and
aints connectors can only be attached to components.
Allocation Views
Allocation views describe the mapping of software units to elements of an
environment in which the software is developed or in which it executes.
The environment in such a view varies; it might be the hardware, the
operating environment in which the software is executed, the file systems
supporting development or deployment, or the development organization(s).
Table 22.3 summarizes the characteristics of allocation views. These
views consist of software elements and environmental elements. Examples
of environmental elements are a processor, a disk farm, a file or folder, or a
group of developers. The software elements come from a module or C&C
view.
Table 22.3 Summary of Allocation Views
El Software element and environmental element. A software element has
e properties that are required of the environment. An environmental
m element has properties that are provided to the software.
en
ts
Re Allocated-to: A software element is mapped (allocated to) an
lat environmental element.
io
ns
C Varies by view.
on
str
ai
nt
s
Us For reasoning about performance, availability, security, and safety. For
ag reasoning about distributed development and allocation of work to
e teams. For reasoning about concurrent access to software versions. For
reasoning about the form and mechanisms of system installation.
Quality Views
Module, C&C, and allocation views are all structural views: They primarily
show the structures that the architect has designed into the architecture to
satisfy functional and quality attribute requirements.
These views are excellent choices for guiding and constraining
downstream developers, whose primary job is to implement those
structures. However, in systems in which certain quality attributes (or, for
that matter, any stakeholder concerns) are particularly important and
pervasive, structural views may not be the best way to present the
architectural solution to those needs. The reason is that the solution may be
spread across multiple structures that are cumbersome to combine (e.g.,
because the element types shown in each structure are different).
Another kind of view, which we call a quality view, can be tailored for
specific stakeholders or to address specific concerns. Quality views are
formed by extracting the relevant pieces of structural views and packaging
them together. Here are five examples:
C&C views with each other. Because all C&C views show runtime
relations among components and connectors of various types, they tend
to combine well. Different (separate) C&C views tend to show different
parts of the system, or tend to show decomposition refinements of
components in other views. The result is often a set of views that can be
combined easily.
Deployment view with any C&C view that shows processes. Processes
are the components that are deployed onto processors, virtual machines,
or containers. Thus there is a strong association between the elements in
these views.
Decomposition view and any work assignment, implementation, uses, or
layered views. The decomposed modules form the units of work,
development, and uses. In addition, these modules populate layers.
Use cases describe how actors can use a system to accomplish their
goals; they are frequently used to capture the functional requirements
for a system. UML provides a graphical notation for use case diagrams
but does not specify how the text of a use case should be written. The
UML use case diagram is a good way to provide an overview of the
actors and the behavior of a system. Its description, which is textual,
should include the following items: the use case name and a brief
description, the actor or actors who initiate the use case (primary
actors), other actors who participate in the use case (secondary actors),
the flow of events, alternative flows, and non-success cases.
A UML sequence diagram shows a sequence of interactions among
instances of elements pulled from the structural documentation. It is
useful, when designing a system, for identifying where interfaces need
to be defined. The sequence diagram shows only the instances
participating in the scenario being documented. It has two dimensions:
vertical, representing time, and horizontal, representing the various
instances. The interactions are arranged in time sequence from top to
bottom. Figure 22.2 is an example of a sequence diagram that illustrates
the basic UML notation. Sequence diagrams are not explicit about
showing concurrency. If that is your goal, use activity diagrams instead.
Figure 22.2 A simple example of a UML sequence diagram
When you study a diagram that represents an architecture, you see the
end product of a thought process but can’t always easily understand the
decisions that were made to achieve this result. Recording design decisions
beyond the representation of the chosen elements, relationships, and
properties is fundamental to help in understanding how you arrived at the
result; in other words, it lays out the design rationale.
When your iteration goal involves satisfying an important quality
attribute scenario, some of the decisions that you make will play a
significant role in achieving the scenario response measure. Consequently,
you should take the greatest care in recording these decisions: They are
essential to facilitate analysis of the design you created, to facilitate
implementation, and, still later, to aid in understanding the architecture
(e.g., during maintenance). Given that most design decisions are “good
enough,” and seldom optimal, you also need to justify the decisions made,
and to record the risks associated with your decisions so that they may be
reviewed and possibly revisited.
You may perceive recording design decisions as a tedious task. However,
depending on the criticality of the system being developed, you can adjust
the amount of information that is recorded. For example, to record a
minimum of information, you can use a simple table such as Table 22.4. If
you decide to record more than this minimum, the following information
might prove useful:
What evidence was produced to justify decisions?
Who did what?
Why were shortcuts taken?
Why were tradeoffs made?
What assumptions did you make?
In the same way that we suggest that you record responsibilities as you
identify elements, you should record the design decisions as you make
them. If you leave it until later, you will not remember why you did things.
Modeling Tools
Many commercially available modeling tools are available that support the
specification of architectural constructs in a defined notation; SysML is a
widely used choice. Many of these tools offer features aimed at practical
large-scale use in industrial settings: interfaces that support multiple users,
version control, syntactic and semantic consistency checking of the models,
support for trace links between models and requirements or models and
tests, and, in some cases, automatic generation of executable source code
that implements the models. In many projects, these are must-have
capabilities, so the purchase price of the tool—which is not insignificant in
some cases—should be evaluated against what it would cost the project to
achieve these capabilities on its own.
Document what is true about all versions of your system. Your web
browser doesn’t go out and grab just any piece of software when it
needs a new plug-in; a plug-in must have specific properties and a
specific interface. And that new piece of software doesn’t just plug in
anywhere, but rather in a predetermined location in the architecture.
Record those invariants. This process may make your documented
architecture more a description of constraints or guidelines that any
compliant version of the system must follow. That’s fine.
Document the ways the architecture is allowed to change. In the
examples mentioned earlier, this will usually mean adding new
components and replacing components with new implementations. The
place to do this is the variability guide discussed in Section 22.6
Generate interface documentation automatically. If you use explicit
interface mechanisms such as protocol buffers (described in Chapter
15), then there are always up-to-date definitions of component
interfaces; otherwise, the system would not work. Incorporate those
interface definitions into a database so that revision histories are
available and the interfaces can be searched to determine what
information is used in which components.
Traceability
Architecture, of course, does not live in a bubble, but in a milieu of
information about the system under development that includes
requirements, code, tests, budgets and schedules, and more. The purveyors
of each of these areas must ask themselves, “Is my part right? How do I
know?” This question takes on different specific forms in different areas;
for example, the tester asks, “Am I testing the right things?” As we saw in
Chapter 19, architecture is a response to requirements and business goals,
and its version of the “Is my part right?” question is to ensure that those
have been satisfied. Traceability means linking specific design decisions to
the specific requirements or business goals that led to them, and those links
should be captured in the documentation. If, at the end of the day, all ASRs
are accounted for (“covered”) in the architecture’s trace links, then we have
assurance that the architecture part is right. Trace links may be represented
informally—a table, for instance—or may be supported technologically in
the project’s tool environment. In either case, trace links should be part of
the architecture documentation.
22.10 Summary
Writing architectural documentation is much like other types of writing.
The golden rule is: Know your reader. You must understand the uses to
which the writing will be put and the audience for the writing. Architectural
documentation serves as a means for communication among various
stakeholders: up the management chain, down into the developers, and
across to peers.
An architecture is a complicated artifact, best expressed by focusing on
particular perspectives, called views, which depend on the message to be
communicated. You must choose the views to document and choose the
notation to document these views. This may involve combining various
views that have a large overlap. You must not only document the structure
of the architecture but also the behavior.
In addition, you should document the relations among the views in your
documentation, the patterns you use, the system’s context, any variability
mechanisms built into the architecture, and the rationale for your major
design decisions.
There are other practical considerations for creating, maintaining, and
distributing the documentation, such as choosing a release strategy,
choosing a dissemination tool such as a wiki, and creating documentation
for architectures that change dynamically.
Some debts are fun when you are acquiring them, but none are fun when
you set about retiring them.
—Ogden Nash
Without careful attention and the input of effort, designs become harder to
maintain and evolve over time. We call this form of entropy “architecture
debt,” and it is an important and highly costly form of technical debt. The
broad field of technical debt has been intensively studied for more than a
decade—primarily focusing on code debt. Architecture debt is typically
more difficult to detect and more difficult to eradicate than code debt
because it involves nonlocal concerns. The tools and methods that work
well for discovering code debt—code inspections, code quality checkers,
and so forth—typically do not work well for detecting architecture debt.
Of course, not all debt is burdensome and not all debt is bad debt.
Sometimes a principle is violated when there is a worthy tradeoff—for
example, sacrificing low coupling or high cohesion to improve runtime
performance or time to market.
This chapter introduces a process to analyze existing systems for
architecture debt. This process gives the architect both the knowledge and
the tools to identify and manage such debt. It works by identifying
architecturally connected elements—with problematic design relations—
and analyzing a model of their maintenance costs. If that model indicates
the existence of a problem, typically signaled by an unusually high amount
of changes and bugs, this signifies an area of architecture debt.
Once architecture debt has been identified, if it is bad enough, it should
be removed through refactoring. Without quantitative evidence of payoff,
typically it is difficult to get project stakeholders to agree to this step. The
business case (without architecture debt analysis) goes like this: “I will take
three months to refactor this system and give you no new functionality.”
What manager would agree to that? However, armed with the kinds of
analyses we present here, you can make a very different pitch to your
manager, one couched in terms of ROI and increased productivity that pays
the refactoring effort back, and more, in a short time.
The process that we advocate requires three types of information:
The model for analyzing debt identifies areas of the architecture that are
experiencing unusually high rates of bugs and churn (committed lines of
code) and attempts to associate these symptoms with design flaws.
The matrix shown in Figure 23.1 is quite sparse. It means that these files
are not heavily structurally coupled to each other and, as a consequence,
you might expect that it would be relatively easy to change these files
independently. In other words, this system seems to have relatively little
architecture debt.
Now consider Figure 23.2, which overlays historical co-change
information on Figure 23.1. Historical co-change information is extracted
from the version control system. This indicates how often two files change
together in commits.
Not every file in a hotspot will be tightly coupled to every other file.
Instead, a collection of files may be tightly coupled to each other and
decoupled from other files. Each such collection is a potential hotspot and
is a potential candidate for debt removal, through refactoring.
Figure 23.3 is a DSM based on files in Apache Cassandra—a widely
used NoSQL database. It shows an example of a clique (a cycle of
dependencies). In this DSM, you can see that the file on row 8
(locator.AbstractReplicationStrategy) depends on file 4
(service.WriteResponseHandler) and aggregates file 5
(locator.TokenMetadata). Files 4 and 5, in turn, depend on file 8, thus
forming a clique.
23.3 Example
We illustrate this process with a case study, which we call SS1, done with
SoftServe, a multinational software outsourcing company. At the time of
the analysis, the SS1 system contained 797 source files, and we captured its
revision history and issues over a two-year period. SS1 was maintained by
six full-time developers and many more occasional contributors.
Identifying Hotspots
During the period that we studied SS1, 2,756 issues were recorded in its Jira
issue-tracker (1,079 of which were bugs) and 3,262 commits were recorded
in the Git version control repository.
We identified hotspots using the process just described. In the end, three
clusters of architecturally related files were identified as containing the
most harmful anti-patterns and hence the most debt in the project. The debt
from these three clusters represented a total of 291 files, out of 797 files in
the entire project, or a bit more than one-third of the project’s files. The
number of defects associated with these three clusters covered 89 percent of
the project’s total defects (265).
The chief architect of the project agreed that these clusters were
problematic but had difficulty explaining why. When presented with this
analysis, he acknowledged that these were true design problems, violating
multiple design rules. The architect then crafted a number of refactorings,
focusing on remedying the flawed relations among the files identified in the
hotspots. These refactorings were based on removing the anti-patterns in
the hotspots, so the architect had a great deal of guidance in how to do this.
But does it pay to do these kinds of refactorings? After all, not all debts
are worth paying off. This is the topic of the next section.
23.4 Automation
This form of architectural analysis can be fully automated. Each of the anti-
patterns introduced in Section 23.2 can be identified in an automated
fashion and the tooling can be built into a continuous integration tool suite
so that architecture debt is continuously monitored. This analysis process
requires the following tools:
The only specialized tools needed for this process are the ones to build
the DSM and analyze the DSM. Projects likely already have issue tracking
systems and revision histories, and plenty of reverse-engineering tools are
available, including open source options.
23.5 Summary
This chapter has presented a process for identifying and quantifying
architecture debt in a project. Architecture debt is an important and highly
costly form of technical debt. Compared to code-based technical debt,
architecture debt is often harder to identify because its root causes are
distributed among several files and their interrelationships.
The process outlined in this chapter involves gathering information from
the project’s issue tracker, its revision control system, and the source code
itself. Using this information, architecture anti-patterns can be identified
and grouped into hotspots, and the impact of these hotspots can be
quantified.
This architecture debt monitoring process can be automated and built
into a system’s continuous integration tool suite. Once architecture debt has
been identified, if it is bad enough, it should be removed through
refactoring. The output of this process provides the quantitative data
necessary to make the business case for refactoring to project management.
After that, use the needs of the architecture’s stakeholders as a guide when
crafting the contents of subsequent releases.
Work with the project’s stakeholders to determine the release tempo and
the contents of each project increment.
Your first architectural increment should include module decomposition
and uses views, as well as a preliminary C&C view.
Use your influence to ensure that early releases deal with the system’s
most challenging quality attribute requirements, thereby ensuring that
no unpleasant architectural surprises appear late in the development
cycle.
Stage your architecture releases to support those project increments and
to support the needs of the development stakeholders as they work on
each increment.
24.5 Summary
Software architects do their work in the context of a development project of
some sort. As such, they need to understand their role and responsibilities
from that perspective.
The project manager and the software architect may be seen as
occupying complementary roles: The manager runs the project from an
administrative perspective, and the architect runs the project from a
technical solution perspective. These two roles intersect in various ways,
and the architect can support the manager to enhance the project’s chance
of success.
In a project, architectures do not spring fully formed from Zeus’s
forehead, but rather are released in increments that are useful to
stakeholders. Thus the architect needs to have a good understanding of the
architecture’s stakeholders and their information needs.
Agile methodologies focus on incremental development. Over time,
architecture and Agile (although they got off to a rough start together) have
become indispensable partners.
Global development creates a need for an explicit coordination strategy
that is based on more formal strategies than are needed for co-located
development.
If software architecture is worth doing, then surely it’s worth doing well.
Most of the literature about architecture concentrates on the technical
aspects. This is not surprising; it is a deeply technical discipline. But
architectures are created by architects working in organizations that are full
of actual human beings. Dealing with these humans is a decidedly
nontechnical undertaking. What can be done to help architects, especially
architects-in-training, be better at this important dimension of their job?
And what can be done to help organizations do a better job of encouraging
their architects to produce their best work?
This chapter is about the competence of individual architects and the
organizations that wish to produce high-quality architectures.
Since the architecture competence of an organization depends, in part, on
the competence of architects, we begin by asking what it is that architects
are expected to do, know, and be skilled at. Then we’ll look at what
organizations can and should do to help their architects produce better
architectures. Individual and organizational competencies are intertwined.
Understanding only one or the other won’t do.
These examples purposely illustrate that skills and knowledge are important
(only) for supporting the ability to carry out duties effectively. As another
example, “documenting the architecture” is a duty, “ability to write clearly”
is a skill, and “ISO Standard 42010” is part of the related body of
knowledge. Of course, a skill or knowledge area can support more than one
duty.
Knowing the duties, skills, and knowledge of architects (or, more
precisely, the duties, skills, and knowledge that are needed of architects in a
particular organizational setting) can help establish measurement and
improvement strategies for individual architects. If you want to improve
your individual architectural competence, you should take the following
steps:
1. Gain experience carrying out the duties. Apprenticeship is a
productive path to achieving experience. Education alone is not
enough, because education without on-the-job application merely
enhances knowledge.
2. Improve your nontechnical skills. This dimension of improvement
involves taking professional development courses, for example, in
leadership or time management. Some people will never become truly
great leaders or communicators, but we can all improve on these
skills.
3. Master the body of knowledge. One of the most important things a
competent architect must do is master the body of knowledge and
remain up-to-date on it. To emphasize the importance of keeping
current with the field, consider the advances in knowledge required
for architects that have emerged in just the last few years. For
example, architectures to support computing in the cloud (Chapter 17)
were not important several years ago. Taking courses, becoming
certified, reading books and journals, visiting websites, reading blogs,
attending architecture-oriented conferences, joining professional
societies, and meeting with other architects are all useful ways to
improve knowledge.
Duties
This section summarizes a wide variety of architects’ duties. Not every
architect in every organization will perform every one of these duties on
every project. However, competent architects should not be surprised to
find themselves engaged in any of the activities listed here. We divide these
duties into technical duties (Table 25.1) and nontechnical duties (Table
25.2). One immediate observation you should make is the large number of
many nontechnical duties. An obvious implication, for those of you who
wish to be architects, is that you must pay adequate attention to the
nontechnical aspects of your education and your professional activities.
Table 25.1 Technical Duties of a Software Architect
General Specif Example Duties
Duty ic
Area Duty
Area
Architec Creati Design or select an architecture. Create a software
ting ng an architecture design plan. Build a product line or product
archit architecture. Make design decisions. Expand details and
ecture refine the design to converge on a final design. Identify
the patterns and tactics, and articulate the principles and
key mechanisms of the architecture. Partition the system.
Define how the components fit together and interact.
Create prototypes.
Evalu Evaluate an architecture (for your current system or for
ating other systems) to determine the satisfaction of use cases
and and quality attribute scenarios. Create prototypes.
analyz Participate in design reviews. Review the designs of the
ing an components designed by junior engineers. Review
archit designs for compliance with the architecture. Compare
ecture software architecture evaluation techniques. Model
alternatives. Perform tradeoff analysis.
Docu Prepare architectural documents and presentations useful
menti to stakeholders. Document or automate the
ng an documentation of software interfaces. Produce
archit documentation standards or guidelines. Document
ecture variability and dynamic behavior.
General Specif Example Duties
Duty ic
Area Duty
Area
Worki Maintain and evolve an existing system and its
ng architecture. Measure architecture debt. Migrate existing
with system to new technology and platforms. Refactor
and existing architectures to mitigate risks. Examine bugs,
transf incident reports, and other issues to determine revisions to
ormin existing architecture.
g
existin
g
syste
m(s)
Perfor Sell the vision. Keep the vision alive. Participate in
ming product design meetings. Give technical advice on
other architecture, design, and development. Provide
archit architectural guidelines for software design activities.
ecting Lead architecture improvement activities. Participate in
duties software process definition and improvement. Provide
architecture oversight of software development activities.
Duties Mana Analyze functional and quality attribute software
concern ging requirements. Understand business, organizational, and
ed with the customer needs, and ensure that the requirements meet
life- requir these needs. Listen to and understand the scope of the
cycle ement project. Understand the client’s key design needs and
activitie s expectations. Advise on the tradeoffs between software
s other design choices and requirements choices.
than Evalu Analyze the current IT environment and recommend
architec ating solutions for deficiencies. Work with vendors to represent
ting future the organization’s requirements and influence future
techno products. Develop and present technical white papers.
logies
General Specif Example Duties
Duty ic
Area Duty
Area
Select Manage the introduction of new software solutions.
ing Perform technical feasibility studies of new technologies
tools and architectures. Evaluate commercial tools and
and software components from an architectural perspective.
techno Develop internal technical standards and contribute to the
logy development of external technical standards.
Skills
Given the wide range of duties enumerated in the previous section, which
skills does an architect need to possess? Much has been written about the
architect’s special role of leadership in a project; the ideal architect is an
effective communicator, manager, team builder, visionary, and mentor.
Some certificate or certification programs emphasize nontechnical skills.
Common to these certification programs are assessment areas of leadership,
organization dynamics, and communication.
Table 25.3 enumerates the set of skills most useful to an architect.
Table 25.3 Skills of a Software Architect
Ge Specif Example Skills
ne ic
ral Skill
Ski Area
ll
Ar
ea
Co Outwa Ability to make oral and written communications and
m rd presentations. Ability to present and explain technical
mu comm information to diverse audiences. Ability to transfer
nic unicati knowledge. Ability to persuade. Ability to see from and sell to
ati on multiple viewpoints.
on (beyo
ski nd the
lls team)
Inwar Ability to listen, interview, consult, and negotiate. Ability to
d understand and express complex topics.
comm
unicati
on
(withi
n the
team)
Ge Specif Example Skills
ne ic
ral Skill
Ski Area
ll
Ar
ea
Int Team Ability to be a team player. Ability to work effectively with
er relatio superiors, subordinates, colleagues, and customers. Ability to
pe nships maintain constructive working relationships. Ability to work in
rso a diverse team environment. Ability to inspire creative
nal collaboration. Ability to build consensus. Ability to be
ski diplomatic and respect others. Ability to mentor others. Ability
lls to handle and resolve conflict.
W Leade Ability to make decisions. Ability to take initiative and be
or rship innovative. Ability to demonstrate independent judgment, be
k influential, and command respect.
ski Workl Ability to work well under pressure, plan, manage time, and
lls oad estimate. Ability to support a wide range of issues and work on
manag multiple complex tasks concurrently. Ability to effectively
ement prioritize and execute tasks in a high-pressure environment.
Skills Ability to think strategically. Ability to work under general
to supervision and under constraints. Ability to organize
excel workflow. Ability to detect where the power is and how it flows
in the in an organization. Ability to do what it takes to get the job
corpor done. Ability to be entrepreneurial, to be assertive without
ate being aggressive, and to receive constructive criticism.
enviro
nment
Skills Ability to be detail-oriented while maintaining overall vision
for and focus. Ability to see the big picture.
handli
ng
inform
ation
Ge Specif Example Skills
ne ic
ral Skill
Ski Area
ll
Ar
ea
Skills Ability to tolerate ambiguity. Ability to take and manage risks.
for Ability to solve problems. Ability to be adaptable, flexible,
handli open-minded, and resilient.
ng the
unexp
ected
Abilit Ability to look at different things and find a way to see how
y to they are, in fact, just different instances of the same thing. This
think may be one of the most important skills for an architect to have.
abstra
ctly
Knowledge
A competent architect has an intimate familiarity with an architectural body
of knowledge. Table 25.4 gives a set of knowledge areas for an architect.
Table 25.4 Knowledge Areas of a Software Architect
General Specific Specific Knowledge Examples
Knowledge Knowle
Area dge
Area
Computer Knowled Knowledge of architecture frameworks,
science ge of architectural patterns, tactics, structures and views,
knowledge architect reference architectures, relationships to system and
ure enterprise architecture, emerging technologies,
concepts architecture evaluation models and methods, and
quality attributes.
General Specific Specific Knowledge Examples
Knowledge Knowle
Area dge
Area
Knowled Knowledge of software development knowledge
ge of areas, including requirements, design, construction,
software maintenance, configuration management,
engineeri engineering management, and software engineering
ng process. Knowledge of systems engineering.
Computer Design Knowledge of tools and design and analysis
science knowled techniques. Know-ledge of how to design complex
knowledge ge multi-product systems. Knowledge of object-
oriented analysis and design, and UML and SysML
diagrams.
Program Knowledge of programming languages and
ming programming language models. Knowledge of
knowled specialized programming techniques for security,
ge real time, safety, etc.
Knowledge Specific Knowledge of hardware/software interfaces, web-
of technolo based applications, and Internet technologies.
technologie gies and Knowledge of specific software/operating systems.
s and platform
platforms s
General Knowledge of the IT industry’s future directions
knowled and the ways in which infrastructure impacts an
ge of application.
technolo
gies and
platform
s
Knowledge Domain Knowledge of the most relevant domains and
about the knowled domain-specific technologies.
organizatio ge
n’s context
and
manageme
General Specific Specific Knowledge Examples
nt
Knowledge Knowle
Area dge
Area
Industry Knowledge of the industry’s best practices and
knowled Industry standards. Knowledge of how to work in
ge onshore/offshore team environments.
Business Knowledge of the company’s business practices,
knowled and its competition’s products, strategies, and
ge processes. Knowledge of business and technical
strategy, and business reengineering principles and
processes. Knowledge of strategic planning,
financial models, and budgeting.
Leadersh Knowledge of how to coach, mentor, and train
ip and software team members. Knowledge of project
manage management. Knowledge of project engineering.
ment
techniqu
es
Personnel-related:
Hire talented architects.
Establish a career track for architects.
Make the position of architect highly regarded through visibility,
rewards, and prestige.
Have architects join professional organizations.
Establish an architect certification program.
Establish a mentoring program for architects.
Establish an architecture training and education program.
Measure architects’ performance.
Have architects receive external architect certifications.
Reward or penalize architects based on project success or failure.
Process-related:
Establish organization-wide architecture practices.
Establish a clear statement of responsibilities and authority for
architects.
Establish a forum for architects to communicate and share
information and experience.
Establish an architecture review board.
Include architecture milestones in project plans.
Have architects provide input into product definition.
Hold an organization-wide architecture conference.
Measure and track the quality of architectures produced.
Bring in outside expert consultants on architecture.
Have architects advise on the development team structure.
Give architects influence throughout the entire project life cycle.
Technology-related:
Establish and maintain a repository of reusable architectures and
architecture-based artifacts.
Create and maintain a repository of design concepts.
Provide a centralized resource to analyze and help with architecture
tools.
Be Mentored
While experience may be the best teacher, most of us will not have the
luxury, in a single lifetime, to gain firsthand all the experience needed to
make us great architects. But we can gain experience secondhand. Find a
skilled architect whom you respect, and attach yourself to that person. Find
out if your organization has a mentoring program that you can join. Or
establish an informal mentoring relationship—find excuses to interact, ask
questions, or offer to help (for instance, offer to be a reviewer).
Your mentor doesn’t have to be a colleague. You can also join
professional societies where you can establish mentor relationships with
other members. There are meetups. There are professional social networks.
Don’t limit yourself to just your organization.
Mentor Others
You should also be willing to mentor others as a way of giving back or
paying forward the kindnesses that have enriched your career. But there is a
selfish reason to mentor as well: We find that teaching a concept is the
litmus test of whether we deeply understand that concept. If we can’t teach
it, it’s likely we don’t really understand it—so that can be part of your goal
in teaching and mentoring others in the profession. Good teachers almost
always report their delight in how much they learn from their students, and
how much their students’ probing questions and surprising insights add to
the teachers’ deeper understanding of the subject.
25.4 Summary
When we think of software architects, we usually first think of the technical
work that they produce. But, in the same way that an architecture is much
more than a technical “blueprint” for a system, an architect is much more
than a designer of an architecture. This has led us to try to understand, in a
more holistic way, what an architect and an architecture-centric
organization must do to succeed. An architect must carry out the duties,
hone the skills, and continuously acquire the knowledge necessary to be
successful.
The key to becoming a good and then a better architect is continuous
learning, mentoring, and being mentored.
What will the future bring in terms of developments that affect the practice
of software architecture? Humans are notoriously bad at predicting the
long-term future, but we keep trying because, well, it’s fun. To close our
book, we have chosen to focus on one particular aspect that is firmly rooted
in the future but seems tantalizingly close to reality: quantum computing.
Quantum computers will likely become practical over the next five to ten
years. Consider that the system you are currently working on may have a
lifetime on the order of tens—plural—of years. Code written in the 1960s
and 1970s is still being used today on a daily basis. If the systems you are
working on have lifetimes on that order, you may need to convert them to
take advantage of quantum computer capabilities when quantum computers
become practical.
Quantum computers are generating high interest because of their
potential to perform calculations at speeds that far outpace the most capable
and powerful of their classical counterparts. In 2019, Google announced
that its quantum computer completed a complex computation in 200
seconds. That same calculation, claimed Google, would take even the most
powerful supercomputers approximately 10,000 years to finish. It isn’t that
quantum computers do what classical computers do, only extraordinarily
faster; rather, they do what classical computers can’t do using the
otherworldly properties of quantum physics.
Quantum computers won’t be better than classical computers at solving
every problem. For example, for many of the most common transaction-
oriented data-processing tasks, they are likely irrelevant. They will be good
at problems that involve combinatorics and are computationally difficult for
classic computers. However, it is unlikely that a quantum computer will
ever power your phone or watch or sit on your office desk.
Understanding the theoretical basis of a quantum computer involves deep
understanding of physics, including quantum physics, and that is far outside
our scope. For context, the same was also true of classical computers when
they were invented in the 1940s. Over time, the requirement for
understanding how CPUs and memory work has disappeared due to the
introduction of useful abstractions, such as high-level programming
languages. The same thing will happen in quantum computers. In this
chapter, we introduce the essential concepts of quantum computing without
reference to the underlying physics (which has been known to make heads
actually explode).
Operations on Qubits
Some single qubit operations are analogs of classical bit operations,
whereas others are specific to qubits. One characteristic of most quantum
operations is that they are invertible; that is, given the result of an operation,
it is possible to recover the input into that operation. Invertibility is another
distinction between classical bit operations and qubit operations. The one
exception to invertibility is the READ operation: Since measurement is
destructive, the result of a READ operation does not allow the recovery of
the original qubit. Examples of qubit operations include the following:
1. A READ operation takes as input a single qubit and produces as
output either a 0 or a 1 with probabilities determined by the
amplitudes of the input qubit. The value of the input qubit collapses to
either a 0 or a 1.
2. A NOT operation takes a qubit in superposition and flips the
amplitudes. That is, the probability of the resulting qubit being 0 is
the original probability of it being 1, and vice versa.
3. A Z operation adds Π to the phase of the qubit (modulo 2Π).
4. A HAD (short for Hadamard) operation creates an equal
superposition, which means the amplitudes of qubits with value 0 and
1, respectively, are equal. A 0 input value generates a phase of 0
radians, and a 1 input value generates a phase of Π radians.
It is possible to chain multiple operations together to produce more
sophisticated units of functionality.
Some operators work on more than one qubit. The primary two-qubit
operator is CNOT—a controlled not. The first qubit is the control bit. If it is
1, then the operation performs a NOT on the second qubit. If the first qubit
is 0, then the second qubit remains unchanged.
Entanglement
Entanglement is one of the key elements of quantum computing. It has no
analog in classical computing, and gives quantum computing some of its
very strange and wondrous properties, allowing it to do what classical
computers cannot.
Two qubits are said to be “entangled” if, when measured, the second
qubit measurement matches the measurement of the first. Entanglement can
occur no matter the amount of time between the two measurements, or the
physical distance between the qubits. This leads us to what is called
quantum teleportation. Buckle up.
26.2 Quantum Teleportation
Recall that it is not possible to copy one qubit to another directly. Thus, if
we want to copy one qubit to another, we must use indirect means.
Furthermore, we must accept the destruction of the state of the original
qubit. The recipient qubit will have the same state as the original, destroyed
qubit. Quantum teleportation is the name given to this copying of the state.
There is no requirement that the original qubit and the recipient qubit have
any physical relationship, nor are there constraints on the distance that
separates them. In consequence, it is possible to transfer information over
great distances, even hundreds or thousands of kilometers, between qubits
that have been physically implemented.
The teleportation of the state of a qubit depends on entanglement. Recall
that entanglement means that a measurement of one entangled qubit will
guarantee that a measurement of the second qubit will have the same value.
Teleportation utilizes three qubits. Qubit A and B are entangled, and then
qubit ψ is entangled with qubit A. Qubit ψ is teleported to the location of
qubit B, and its state becomes the state of qubit B. Roughly speaking,
teleportation proceeds through these four steps:
1. Entangle qubits A and B. We discussed what this means in the prior
section. The locations of A and B can be physically separate.
2. Prepare the “payload.” The payload qubit will have the state to be
teleported. The payload, which is the qubit ψ, is prepared at the
location of A.
3. Propagate the payload. The propagation involves two classical bits
that are transferred to the location of B. The propagation also involves
measuring A and ψ, which destroys the state of both of these qubits.
4. Re-create the state of ψ in B.
We have omitted many key details, but the point is this: Quantum
teleportation is an essential ingredient of quantum communication. It relies
on transmitting two bits over conventional communication channels. It is
inherently secure, since all that an eavesdropper can determine are the two
bits sent over conventional channels. Because A and B communicate
through entanglement, they are not physically sent over a communication
line. The U.S. National Institute of Science and Technology (NIST) is
considering a variety of different quantum-based communication protocols
to be the basis of a transport protocol called HTTPQ, which is intended to
be a replacement for HTTPS. Given that it takes decades to replace one
communication protocol with another, the goal is for HTTPQ to be adopted
prior to the availability of quantum computers that can break HTTPS.
QRAM
Quantum random access memory (QRAM) is a critical element for
implementing and applying many quantum algorithms. QRAM, or
something similar, will be necessary to provide efficient access to large
amounts of data such as that used in machine learning applications.
Currently, no implementation of QRAM exists, but several research groups
are exploring how such an implementation could work.
Conventional RAM comprises a hardware device that takes as input a
memory location and returns as output the contents of that memory
location. QRAM is conceptually similar: It takes as input a memory
location (likely a superposition of memory locations) and returns as output
the superpositioned contents of those memory locations. The memory
locations whose contents are returned were written conventionally—that is,
each bit has one value. The values are returned in superposition, and the
amplitudes are determined by the specification of the memory locations to
be returned. Because the original values were conventionally written, they
can be copied in a nondestructive fashion.
A problem with the proposed QRAM is that the number of physical
resources required scales linearly with the number of bits retrieved. Thus it
may not be practical to construct QRAM for very large retrievals. As with
much of the discussion of quantum computers, QRAM is in the theoretical
discussion stage rather than the engineering phase. Stay tuned.
The remaining algorithms we discuss assume the existence of a
mechanism for efficiently accessing the data manipulated by an algorithm,
such as with QRAM.
Matrix Inversion
Matrix inversion underlies many problems in science. Machine learning, for
example, requires the ability to invert large matrices. Quantum computers
hold promise to speed up matrix inversion in this context. The HHL
algorithm by Harrow, Hassidim, and Lloyd will invert a linear matrix,
subject to some constraints. The general problem is to solve the equation Ax
= b, where A is an N × N matrix, x is a set of N unknowns, and b is a set of
N known values. You learned about the simplest case (N = 2) in elementary
algebra. As N grows, however, matrix inversion becomes the standard
technique to solve the set of equations.
The following constraints apply when solving this problem with quantum
computers:
1. The b’s must be quickly accessible. This is the problem that QRAM is
supposed to solve.
2. The matrix A must satisfy certain conditions. If it is a sparse matrix,
then it likely can be processed efficiently on a quantum computer.
The matrix must also be well conditioned; that is, the determinant of
the matrix must be non-zero or close to zero. A small determinant
causes issues when inverting a matrix on a classical computer, so this
is not a quantum unique problem.
3. The result of applying the HHL algorithm is that the x values appear
in superposition. Thus a mechanism is needed for efficiently isolating
the actual values from the superposition.
The actual algorithm is too complicated for us to present here. One
noteworthy element, however, is that it relies on an amplitude
magnification technique based on using phases.
Len Bass is an award-winning author who has lectured widely around the
world. His books on software architecture are standards. In addition to his
books on software architecture, Len has also written books on User
Interface Software and DevOps. Len has over 50 years’ experience in
software development, 25 of those at the Software Engineering Institute of
Carnegie Mellon. He also worked for three years at NICTA in Australia and
is currently an adjunct faculty member at Carnegie Mellon University,
where he teaches DevOps.
Dr. Paul Clements is the Vice President of Customer Success at BigLever
Software, Inc., where he works to spread the adoption of systems and
software product line engineering. Prior to this, he was a senior member of
the technical staff at Carnegie Mellon University’s Software Engineering
Institute, where for 17 years he worked leading or co-leading projects in
software product line engineering and software architecture design,
documentation, and analysis. Prior to the SEI, he was a computer scientist
with the U.S. Naval Research Laboratory in Washington, DC, where his
work involved applying advanced software engineering principles to real-
time embedded systems.
In addition to this book, Clements is the co-author of two other
practitioner-oriented books about software architecture: Documenting
Software Architectures: Views and Beyond and Evaluating Software
Architectures: Methods and Case Studies. He also co-wrote Software
Product Lines: Practices and Patterns and was co-author and editor of
Constructing Superior Software. In addition, Clements has authored about a
hundred papers in software engineering, reflecting his long-standing
interest in the design and specification of challenging software systems.
Rick Kazman is a Professor at the University of Hawaii and a Visiting
Researcher at the Software Engineering Institute of Carnegie Mellon
University. His primary research interests are software architecture, design
and analysis tools, software visualization, and software engineering
economics. Kazman has been involved in the creation of several highly
influential methods and tools for architecture analysis, including the ATAM
(Architecture Tradeoff Analysis Method), the CBAM (Cost-Benefit
Analysis Method), and the Dali and Titan tools. In addition to this book, he
is the author of over 200 publications and is co-author of three patents and
eight books, including Technical Debt: How to Find It and Fix It, Designing
Software Architectures: A Practical Approach, Evaluating Software
Architectures: Methods and Case Studies, and Ultra-Large-Scale Systems:
The Software Challenge of the Future. His research has been cited over
25,000 times, according to Google Scholar. He is currently the chair of the
IEEE TAC (Technical Activities Committee), Associate Editor for IEEE
Transactions on Software Engineering, and a member of the ICSE Steering
Committee.
Index
A/B testing, 86
Abort tactic, 159
Abstract common services, 108
Abstract data sources for testability, 189
Abstraction, architecture as, 3
ACID (atomic, consistent, isolated, and durable) properties, 61
Acronym lists in documentation, 346
Active redundancy, 66
Activity diagrams for traces, 342–343
Actors
attack, 174
elements, 217
Actuators
mobile systems, 263, 267–268
safety concerns, 151–152
Adapt tactic for integrability, 108–109, 111
ADD method. See Attribute-Driven Design (ADD) method
ADLs (architecture description languages), 331
Aggregation for usability, 201
Agile development, 370–373
Agile Manifesto, 371–372
Air France flight 447, 152
Allocated-to relation
allocation views, 337
deployment structure, 15
Allocation structures, 10, 15–16
Allocation views
documentation, 348–350
overview, 337–338
Allowed-to-use relationship, 128–129
Alternative requests in long tail latency, 252
Amazon service-level agreements, 53
Analysis
ADD method, 295, 304–305
ATAM, 318–319, 321
automated, 363–364
Analysts
documentation, 350
software interface documentation for, 229
Analytic redundancy tactic
availability, 58
safety, 159
Apache Camel project, 356–359
Apache Cassandra database, 360–361
Applications for quantum computing, 396–397
Approaches
ATAM, 317–319, 321
CIA, 169
Lightweight Architecture Evaluation, 325
Architects
communication with, 29
competence, 379–385
duties, 379–383
evaluation by, 311
knowledge, 384–385
mentoring, 387–388
mobile system concerns, 264–273
role. See Role of architects
skills, 383–384
Architectural debt
automation, 363–364
determining, 356–358
example, 362–363
hotspots, 358–362
introduction, 355–356
quantifying, 363
summary, 364
Architectural structures, 7–10
allocation, 15–16
C&C, 14–16
limiting, 18
module, 10–14
relating to each other, 15–18
selecting, 18
table of, 17
views, 5–6
Architecturally significant requirements (ASRs)
ADD method, 289–290
from business goals, 282–284
change, 286
introduction, 277–278
from requirements documents, 278–279
stakeholder interviews, 279–282
summary, 286–287
utility trees for, 284–286
Architecture
changes, 27
cloud. See Cloud and distributed computing
competence. See Competence
debt. See Architectural debt
design. See Design and design strategy
documentation. See Documentation
evaluating. See Evaluating architecture
integrability, 102–103
modifiability. See Modifiability
patterns. See Patterns
performance. See Performance
QAW drivers, 281
QAW plan presentation, 280
quality attributes. See Quality attributes
requirements. See Architecturally significant requirements (ASRs);
Requirements
security. See Security
structures. See Architectural structures
tactics. See Tactics
testability. See Testability
usability. See Usability
Architecture description languages (ADLs), 331
Architecture Tradeoff Analysis Method (ATAM), 313
approaches, 317–319, 321
example exercise, 321–324
outputs, 314–315
participants, 313–314
phases, 315–316
presentation, 316–317
results, 321
scenarios, 318
steps, 316–321
Ariane 5 explosion, 151
Artifacts
ADD method, 291
availability, 53
continuous deployment, 74
deployability, 76
energy efficiency, 91
in evaluation, 312
integrability, 104
modifiability, 120–121
performance, 136
quality attributes expressions, 43–44
safety, 154
security, 171
testability, 186
usability, 198
Aspects for testability, 190
ASRs. See Architecturally significant requirements (ASRs)
Assertions for system state, 190
Assurance levels in design, 164
Asynchronous electronic communication, 375
ATAM. See Architecture Tradeoff Analysis Method (ATAM)
Atomic, consistent, isolated, and durable (ACID) properties, 61
Attachment relation for C&C structures, 14–16
Attachments in C&C views, 335
Attribute-Driven Design (ADD) method
analysis, 295, 304–305
design concepts, 295–298
design decisions, 294
documentation, 301–303
drivers, 292–294
element choice, 293–294
element instantiation, 299–300
inputs, 292
overview, 289–291
prototypes, 297–298
responsibilities, 299–300
steps, 292–295
structures, 298–301
summary, 306
views, 294, 301–302
Attributes. See Quality attributes
Audiences for documentation, 330–331
Audits, 176
Authenticate actors tactic, 174
Authorize actors tactic, 174
Automation, 363–364
Autoscaling in distributed computing, 258–261
Availability
CIA approach, 169
cloud, 253–261
detect faults tactic, 56–59
general scenario, 53–55
introduction, 51–52
patterns, 66–69
prevent faults tactic, 61–62
questionnaires, 62–65
recover from faults tactics, 59–61
tactics overview, 55–56
Availability of resources tactic, 139
Availability quality attribute, 285
Availability zones, 248
E-scribes, 314
Earliest-deadline-first scheduling strategy, 143
Early design decisions, 31
EC2 cloud service, 53, 184
ECUs (electronic control units) in mobile systems, 269–270
Edge cases in mobile systems, 271
Education, documentation as, 330
Efficiency, energy. See Energy efficiency
Efficient deployments, 76
Einstein, Albert, 385
Electric power for cloud centers, 248
Electronic control units (ECUs) in mobile systems, 269–270
Elements
ADD method, 293–294, 299–300
allocation views, 337
C&C views, 336
defined, 4
modular views, 333
software interfaces, 217–218
Emergent approach, 370–371
Emulators for virtual machines, 236
Enabling quality attributes, 26
Encapsulation in integrability, 106
Encrypt data tactic, 175
Encryption in quantum computing, 394–395
End users, documentation for, 349–350
Energy efficiency, 89–90
general scenario, 90–91
patterns, 97–98
questionnaire, 95–97
tactics, 92–95
Energy for mobile systems, 263–265
Entanglement in quantum computing, 393–394
Enterprise architecture vs. system architecture, 4–5
Environment
allocation views, 337–338
availability, 54
continuous deployment, 72
deployability, 76
energy efficiency, 91
integrability, 104
modifiability, 120–121
performance, 136
quality attributes expressions, 43–44
safety, 154
security, 171
software interfaces, 217
testability, 186
usability, 198
virtualization effects, 73
Environmental concerns with mobile systems, 269
Errors
description, 51
error-handling views, 339
software interface handling of, 227–228
in usability, 197
Escalating restart tactic, 60–61
Estimates, cost and schedule, 33–34
Evaluating architecture
architect duties, 311, 381
ATAM. See Architecture Tradeoff Analysis Method (ATAM)
contextual factors, 312–313
key activities, 310–311
Lightweight Architecture Evaluation, 324–325
outsider analysis, 312
peer review, 311–312
questionnaires, 326
risk reduction, 309–310
summary, 326–327
Events
performance, 133
software interfaces, 219–220
Evolution of software interfaces, 220–221
Evolutionary dependencies in architectural debt, 356
Exception detection tactic, 58–59
Exception handling tactic, 59
Exception prevention tactic, 62
Exception views, 339
Exchanged data in software interfaces, 225–227
Executable assertions for system state, 190
Experience in design, 296
Expressiveness concern for exchanged data representation, 225
Extendability in mobile systems, 273
EXtensible Markup Language (XML), 226
Extensions for software interfaces, 220
External interfaces, 300–301
Externalizing change, 125
Failures
availability. See Availability
cloud, 251–253
description, 51
Fault tree analysis (FTA), 153
Faults
description, 51–52
detection, 55
prevention, 61–62
recovery from, 59–61
Feature toggle in deployment, 80
FIFO (first-in/first-out) queues, 143
Firewall tactic, 159
First-in/first-out (FIFO) queues, 143
First principles from tactics, 47
Fixed-priority scheduling, 143
Flexibility
defer binding tactic, 124
independently developed elements for, 35
Follow-up phase in ATAM, 316
Forensics, documentation for, 330
Formal documentation notations, 331
Forward error recovery pattern, 68
Foster, William A., 39
FTA (fault tree analysis), 153
Fuller, R. Buckminster, 1
Function patches, 59
Function testing in mobile systems, 272
Functional redundancy
availability, 58
containment, 159
Functional requirements, 40–41
Functional suitability of quality attributes, 211
Functionality
C&C views, 336
description, 40
Fusion of mobile system sensors, 268
Future computing. See Quantum computing
Tactics
ADD method, 299–300
architecture evaluation, 326
availability, 55–65
deployability, 78–81
energy efficiency, 92–97
integrability, 105–112
modifiability, 121–125
performance, 137–146
quality attributes, 45–46, 48–49
safety, 156–162
security, 172–178
testability, 187–192
usability, 200–203
Tailor interface tactic, 109
Team building skills, 383
Teams in ATAM, 313–314
Technical debt. See Architecture debt
Technology knowledge of architects, 385
Technology-related competence, 387
Teleportation in quantum computing, 394
Temporal distance in architecture integrability, 103
Temporal inconsistency in deployability, 85
10-18 Monkey, 185
Test harnesses, 184
Testability
general scenario, 186–187
introduction, 183–185
patterns, 192–194
questionnaires, 192
tactics, 187–191
Testable requirements, 278
Testers, documentation for, 349
Tests and testing
continuous deployment, 72–73
mobile systems, 271–272
modules, 334
Therac 25 radiation overdose, 151
Therapeutic reboot tactic, 61
Thermal limits in mobile systems, 269
Threads
concurrency, 135
virtualization, 234
Throttling mobile system power, 265
Throttling pattern for performance, 148
Throughput of systems, 137
Tiered system architectures in REST, 225
Time and time management
architect role, 368
performance, 133
Time coordination in distributed computing, 257
Time to market, independently developed elements for, 35
Timeout tactic
availability, 58–59
safety, 157–158
Timeouts in cloud, 251–252
Timestamp tactic
availability, 57
safety, 158
Timing as safety factor, 153
TMR (triple modular redundancy), 67
Traceability
continuous deployment, 74
documentation, 352–353
Traces for behavior documentation, 341–342
Tradeoffs in ATAM, 315
Traffic systems, 144
Training, architecture for, 36
Transactions in availability, 61
Transducers in mobile systems, 267
Transferable models, 34
Transforming existing systems, 381
Transparency in exchanged data representation, 226
Triple modular redundancy (TMR), 67
Two-phase commits, 61
Type 1 hypervisors, 235
Type 2 hypervisors, 235
Felix Bachmann, Len Bass, Paul Clements, David Garlan, James Ivers,
Reed Little, Robert Nord, and Judith A. Stafford. “Software Architecture
Documentation in Practice: Documenting Architectural Layers,” CMU/SEI-
2000-SR-004, March 2000.
Felix Bachmann, Len Bass, Paul Clements, David Garlan, James Ivers,
Reed Little, Robert Nord, and Judith A. Stafford. “Documenting Software
Architectures: Organization of Documentation Package,” CMU/SEI-2001-
TN-010, August 2001.
Felix Bachmann, Len Bass, Paul Clements, David Garlan, James Ivers,
Reed Little, Robert Nord, and Judith A. Stafford. “Documenting Software
Architecture: Documenting Behavior,” CMU/SEI-2002-TN-001, January
2002.
Felix Bachmann, Len Bass, Paul Clements, David Garlan, James Ivers,
Reed Little, Robert Nord, and Judith A. Stafford. “Documenting Software
Architecture: Documenting Interfaces,” CMU/SEI-2002-TN-015, June
2002.
Len Bass, Paul Clements, Rick Kazman, and Mark Klein. “Models for
Evaluating and Improving Architecture Competence,” CMU/SEI-2008-TR-
006, March 2008.
Len Bass, Paul Clements, Rick Kazman, John Klein, Mark Klein, and
Jeannine Siviy. “A Workshop on Architecture Competence,” CMU/SEI-
2009-TN-005, April 2009.
Rick Kazman, Mark Klein, and Paul Clements. “ATAM: Method for
Architecture Evaluation,” CMU/SEI-2000-TR-004, August 2000.
Rick Kazman, Jai Asundi, and Mark Klein, “Making Architecture Design
Decisions, An Economic Approach,” CMU/SEI-2002-TR-035, September
2002.
Much of the material for Chapter 7 was inspired by and drawn from R.
Kazman, P. Bianco, J. Ivers, J. Klein, “Integrability”, CMU/SEI-2020-TR-
001, 2020.
Contents
Cover Page
About This eBook
Halftitle Page
Title Page
Copyright Page
Contents
Preface
Acknowledgments
Part I: Introduction
1. What Is Software Architecture?
1.1 What Software Architecture Is and What It Isn’t
1.2 Architectural Structures and Views
1.3 What Makes a “Good” Architecture?
1.4 Summary
1.5 For Further Reading
1.6 Discussion Questions
2. Why Is Software Architecture Important?
2.1 Inhibiting or Enabling a System’s Quality
Attributes
2.2 Reasoning about and Managing Change
2.3 Predicting System Qualities
2.4 Communication among Stakeholders
2.5 Early Design Decisions
2.6 Constraints on Implementation
2.7 Influences on Organizational Structure
2.8 Enabling Incremental Development
2.9 Cost and Schedule Estimates
2.10 Transferable, Reusable Model
2.11 Architecture Allows Incorporation of
Independently Developed Elements
2.12 Restricting the Vocabulary of Design
Alternatives
2.13 A Basis for Training
2.14 Summary
2.15 For Further Reading
2.16 Discussion Questions
Part II: Quality Attributes
3. Understanding Quality Attributes
3.1 Functionality
3.2 Quality Attribute Considerations
3.3 Specifying Quality Attribute Requirements:
Quality Attribute Scenarios
3.4 Achieving Quality Attributes through
Architectural Patterns and Tactics
3.5 Designing with Tactics
3.6 Analyzing Quality Attribute Design Decisions:
Tactics-Based Questionnaires
3.7 Summary
3.8 For Further Reading
3.9 Discussion Questions
4. Availability
4.1 Availability General Scenario
4.2 Tactics for Availability
4.3 Tactics-Based Questionnaire for Availability
4.4 Patterns for Availability
4.5 For Further Reading
4.6 Discussion Questions
5. Deployability
5.1 Continuous Deployment
5.2 Deployability
5.3 Deployability General Scenario
5.4 Tactics for Deployability
5.5 Tactics-Based Questionnaire for Deployability
5.6 Patterns for Deployability
5.7 For Further Reading
5.8 Discussion Questions
6. Energy Efficiency
6.1 Energy Efficiency General Scenario
6.2 Tactics for Energy Efficiency
6.3 Tactics-Based Questionnaire for Energy
Efficiency
6.4 Patterns
6.5 For Further Reading
6.6 Discussion Questions
7. Integrability
7.1 Evaluating the Integrability of an Architecture
7.2 General Scenario for Integrability
7.3 Integrability Tactics
7.4 Tactics-Based Questionnaire for Integrability
7.5 Patterns
7.6 For Further Reading
7.7 Discussion Questions
8. Modifiability
8.1 Modifiability General Scenario
8.2 Tactics for Modifiability
8.3 Tactics-Based Questionnaire for Modifiability
8.4 Patterns
8.5 For Further Reading
8.6 Discussion Questions
9. Performance
9.1 Performance General Scenario
9.2 Tactics for Performance
9.3 Tactics-Based Questionnaire for Performance
9.4 Patterns for Performance
9.5 For Further Reading
9.6 Discussion Questions
10. Safety
10.1 Safety General Scenario
10.2 Tactics for Safety
10.3 Tactics-Based Questionnaire for Safety
10.4 Patterns for Safety
10.5 For Further Reading
10.6 Discussion Questions
11. Security
11.1 Security General Scenario
11.2 Tactics for Security
11.3 Tactics-Based Questionnaire for Security
11.4 Patterns for Security
11.5 For Further Reading
11.6 Discussion Questions
12. Testability
12.1 Testability General Scenario
12.2 Tactics for Testability
12.3 Tactics-Based Questionnaire for Testability
12.4 Patterns for Testability
12.5 For Further Reading
12.6 Discussion Questions
13. Usability
13.1 Usability General Scenario
13.2 Tactics for Usability
13.3 Tactics-Based Questionnaire for Usability
13.4 Patterns for Usability
13.5 For Further Reading
13.6 Discussion Questions
14. Working with Other Quality Attributes
14.1 Other Kinds of Quality Attributes
14.2 Using Standard Lists of Quality Attributes—Or
Not
14.3 Dealing with “X-Ability”: Bringing a New QA
into the Fold
14.4 For Further Reading
14.5 Discussion Questions
Part III: Architectural Solutions
15. Software Interfaces
15.1 Interface Concepts
15.2 Designing an Interface
15.3 Documenting the Interface
15.4 Summary
15.5 For Further Reading
15.6 Discussion Questions
16. Virtualization
16.1 Shared Resources
16.2 Virtual Machines
16.3 VM Images
16.4 Containers
16.5 Containers and VMs
16.6 Container Portability
16.7 Pods
16.8 Serverless Architecture
16.9 Summary
16.10 For Further Reading
16.11 Discussion Questions
17. The Cloud and Distributed Computing
17.1 Cloud Basics
17.2 Failure in the Cloud
17.3 Using Multiple Instances to Improve
Performance and Availability
17.4 Summary
17.5 For Further Reading
17.6 Discussion Questions
18. Mobile Systems
18.1 Energy
18.2 Network Connectivity
18.3 Sensors and Actuators
18.4 Resources
18.5 Life Cycle
18.6 Summary
18.7 For Further Reading
18.8 Discussion Questions
Part IV: Scalable Architecture Practices
19. Architecturally Significant Requirements
19.1 Gathering ASRs from Requirements
Documents
19.2 Gathering ASRs by Interviewing Stakeholders
19.3 Gathering ASRs by Understanding the Business
Goals
19.4 Capturing ASRs in a Utility Tree
19.5 Change Happens
19.6 Summary
19.7 For Further Reading
19.8 Discussion Questions
20. Designing an Architecture
20.1 Attribute-Driven Design
20.2 The Steps of ADD
20.3 More on ADD Step 4: Choose One or More
Design Concepts
20.4 More on ADD Step 5: Producing Structures
20.5 More on ADD Step 6: Creating Preliminary
Documentation during the Design
20.6 More on ADD Step 7: Perform Analysis of the
Current Design and Review the Iteration Goal and
Achievement of the Design Purpose
20.7 Summary
20.8 For Further Reading
20.9 Discussion Questions
21. Evaluating an Architecture
21.1 Evaluation as a Risk Reduction Activity
21.2 What Are the Key Evaluation Activities?
21.3 Who Can Perform the Evaluation?
21.4 Contextual Factors
21.5 The Architecture Tradeoff Analysis Method
21.6 Lightweight Architecture Evaluation
21.7 Summary
21.8 For Further Reading
21.9 Discussion Questions
22. Documenting an Architecture
22.1 Uses and Audiences for Architecture
Documentation
22.2 Notations
22.3 Views
22.4 Combining Views
22.5 Documenting Behavior
22.6 Beyond Views
22.7 Documenting the Rationale
22.8 Architecture Stakeholders
22.9 Practical Considerations
22.10 Summary
22.11 For Further Reading
22.12 Discussion Questions
23. Managing Architecture Debt
23.1 Determining Whether You Have an
Architecture Debt Problem
23.2 Discovering Hotspots
23.3 Example
23.4 Automation
23.5 Summary
23.6 For Further Reading
23.7 Discussion Questions
Part V: Architecture and the Organization
24. The Role of Architects in Projects
24.1 The Architect and the Project Manager
24.2 Incremental Architecture and Stakeholders
24.3 Architecture and Agile Development
24.4 Architecture and Distributed Development
24.5 Summary
24.6 For Further Reading
24.7 Discussion Questions
25. Architecture Competence
25.1 Competence of Individuals: Duties, Skills, and
Knowledge of Architects
25.2 Competence of a Software Architecture
Organization
25.3 Become a Better Architect
25.4 Summary
25.5 For Further Reading
25.6 Discussion Questions
Part VI: Conclusions
26. A Glimpse of the Future: Quantum Computing
26.1 Single Qubit
26.2 Quantum Teleportation
26.3 Quantum Computing and Encryption
26.4 Other Algorithms
26.5 Potential Applications
26.6 Final Thoughts
26.7 For Further Reading
References
About the Authors
Index
i
ii
iii
iv
v
vi
vii
viii
ix
x
xi
xii
xiii
xiv
xv
xvi
xvii
xviii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446