0% found this document useful (0 votes)
45 views

Design Patterns For High Availability

The document discusses design patterns for achieving high availability systems with 99.999% uptime. It describes using redundant commercial hardware and software to detect and correct faults without human intervention. Redundant hardware includes replicating modules and using a voter. Dynamic redundancy activates a passive replica when the active one fails. The key is avoiding single points of failure through redundancy.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Design Patterns For High Availability

The document discusses design patterns for achieving high availability systems with 99.999% uptime. It describes using redundant commercial hardware and software to detect and correct faults without human intervention. Redundant hardware includes replicating modules and using a voter. Dynamic redundancy activates a passive replica when the active one fails. The key is avoiding single points of failure through redundancy.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Design Patterns for High Availability

embedded.com/design-patterns-for-high-availability

Embedded Staff July 31, 2002

It is possible to achieve five-nines reliability with everyday commercial-


quality hardware and software. The key is the way in which these
components are combined.

Our society has come to expect uninterrupted service from many of the systems that it
employs, such as phones, automated teller machines, and credit card verification
networks. Many of these are implemented as embedded systems.

Embedded designers are being asked more often to create systems that run reliably to the
degree that they're in service 99.999% of the time (termed “five-nines” availability), which
is equivalent to less than one second of downtime per day. These are called high
availability systems.

The design of high availability systems is based on a combination of redundant hardware


components and software to manage fault detection and correction without human
intervention. In this article, we'll quickly review some definitions tied to high availability
and fault management, and then go on to discuss some hardware and software design
patterns for fault tolerant systems.

Fault vs. failure

When we design a high availability system, we need to focus a major proportion of our
design effort on failures and faults. For want of a better definition, we can define a failure
as a situation in which the service provided by a system does not meet its specification. I'd
prefer a definition that compares the service provided by a system with our expectations,
rather than a spec. But often our expectations are subjective and not well documented. So
we'll stick with a comparison of a system's services to its specification, even though that
leaves us open to problems coming from specification errors or omissions.

A fault, on the other hand, is a failure of an interacting system. It's what we think of as the
possible cause of something undesirable happening. So a fault might be a failure of a
subsystem of our system, a component failure, a failure of an external system, or a
programming mistake. Faults can trigger more faults of various kinds. Or they can trigger
failures. Or they can happen without triggering any failure at all.

For example, a fault might be the failure of a trucker to respect the load limit of the
Golden Gate Bridge. If the truck exceeds the load limit by 20% or 30%, we don't expect
the bridge to fail. But another fault might be that same truck crashing into the median
divider of the bridge roadway. This wouldn't cause the bridge to physically fail either, but
it might cause the bridge to fail in its specified purpose of bringing people and vehicles
across the Golden Gate waterway for a number of hours.

1/10
Faults may be transient, permanent, or intermittent. When they are activated, they may
cause errors in the state of a system or subsystem. These errors may trigger failures in
your system. There are four main approaches to dealing with faults:

Fault forecasting
Fault avoidance
Fault removal
Fault tolerance

Fault forecasting uses mathematical models as well as practical experiments to provide


estimates about the presence of faults and their consequences. For example, one practical
fault forecasting technique is to inject faults into a system and study any resulting failures.

Fault avoidance and removal are implemented by using rigorous system, hardware, and
software development processes, perhaps including formal specification and verification
techniques.

Fault tolerance is implemented by using redundant, perhaps diverse, implementations of


a system to avoid the effects of faults. One way to implement fault tolerance is called
“graceful degradation” (or “fail soft”)-providing sensible partial functionality if full system
performance cannot be provided. Another way is called “fail safe” (or “fail stop”)-stopping
the system in a safe state rather than continuing operation when a fault appears.

The main concept for fault tolerance is redundancy. It's based on the idea (or the hope)
that multiple independent faults will not strike your system together. A fault tolerant
system should be designed to avoid single points of failure. In other words, if a part of the
system can suffer a fault, there should be a redundant part of the system that can cover for
the fault and thus avoid a failure.

Redundancy comes in a variety of forms:

Hardware redundancy (low-level, high-level, or both)


Software redundancy
Time redundancy
Information redundancy

Examples of hardware redundancy include self-checking logic circuits, and multiple flight
computers in a single airplane. Software redundancy might use two completely different
algorithms to calculate the same result. Time redundancy may be done by communication
re-transmissions. And information redundancy can be done using backups, checksums,
and error correction codes.

Redundancy may be dynamic or static. Both use replicated system elements. In static
redundancy, all replicas are active at the same time. If one replica “throws a fault,” the
other replicas can be used immediately to allow the system to continue correct operation.
In dynamic redundancy, one replica is active and others are passive. If the active replica
throws a fault, a previously passive replica is activated and takes over critical operations.

2/10
So what does all of this have to do with achieving high availability? First, a definition.
High availability is the ability of a system to tolerate faults and continue to provide service
according to its spec. Systems can use all of the concepts and approaches described here
to achieve high availability.

Availability is often measured in units of “availability percentage” or “down time per


year.” Typical fault tolerant systems may be expected to be available 99.99% of the time,
or down less than an hour per year (10 seconds on a typical day). But high availability
systems are expected to be available 99.999% of the time, or down less than five minutes
per year (roughly, one second per day). This generally means that when a fault appears, it
must be handled automatically-humans are too slow to remove or mask any problem
within the time required.

Hardware redundancy

Rather than building hardware as individual super-reliable modules constructed of super-


reliable components, it is usually more cost-effective to use redundant replicated modules
of everyday commercial quality hardware built with everyday commercial quality
components.

Each of the replicas is often designed to exhibit “fail fast” or “fail stop” behavior. This
vastly simplifies fault management decision making: Every failure will make the hardware
stop dead in its tracks; rather than attempting to limp along and challenge a fault
manager to figure out which outputs of the module are now faulty and which are still
good.

For fault tolerance using static redundancy, each replica module can be of everyday
commercial reliability. Using two replicas is called pairing or duplexing. If N replicas are
used, this is called N-plexing.

Figure 1: Triple redundant hardware

Figure 1 shows a 3-plexed or triple redundant hardware design. The three replicas are
shown near the bottom of the diagram. They provide their outputs to a “voter,” which
determines the actual final output of the subsystem. When N >= 3 in an N-plexed design,

3/10
a “voter” will usually use majority decision making. However, this needs to be a majority
of the non-failed replicas, not just a simple majority of the total number of replicas (failed
and non-failed).

But isn't a voter just a piece of hardware and software that could fail like any of the other
modules in our system? In fact, it could; and it would bring disaster to our system if it did.
But voters are normally very simple units that can be designed and tested to ensure their
robustness. Alternatively, you can create designs involving replicated voters and second-
level voters of voters, etc. But we won't discuss that here.

For fault tolerance using dynamic redundancy, the replicated modules can still be of
everyday commercial reliability. One way to do this uses a redundant pair consisting of
one active module and one standby module. Another way is to use a cluster of modules.
The modules do not have to be precise duplicates of one another, and can have different
characteristics, interfaces, and capacities. A cluster requires a failover strategy to decide
how to manage multiple modules when the primary module throws a fault. Here are some
choices:

Standby backup. While the primary module runs in the system, a backup module is
waiting in “standby,” watching the primary module for faults and ready to fire up
and take over. For example, high-availability web servers can be designed using this
approach.
Rotating standby. While the primary module runs in the system, there may be a
number of backup modules. One backup will take over running the system in case of
a fault in the primary. The flight computers on the space shuttle have been designed
with this philosophy: the primary module consists of a pair of computers that must
always agree with one another. The first backup module is a similar pair. But the
second backup module on the space shuttle is a single computer that can take over
only via human command.
Failover to non-critical module. The primary module runs the critical resources of
the system. A backup module can run other non-critical things, but it can take over
the most critical services of the primary in case of fault. This is what we do, as
humans, when a PC's high-speed Internet connection fails as we're trying to send an
urgent e-mail, and we quickly switch over to that old modem we never thought we'd
need again.
Mutual takeover. Each module runs its own critical resources, but can take over the
critical resources of another module in case of fault. For example, in a cardiac
intensive care ward, there should be a heart monitoring computer for every eight
patients. But each one can handle an additional eight patients (perhaps with some
graceful degradation) if a neighboring heart monitoring computer goes down.

The correct implementation of failover is crucial. It would be a disaster if a primary


module that was throwing faults were to continue running, while at the same time
another module would try to take over its services. Their services could collide in
unexpected ways. And it might be equally disastrous if a primary module were stopped

4/10
after throwing a fault, and no other module came up to take over its services. So the
validation and testing of failover is critical, although it's not something that many of us
enjoy doing.

Software redundancy

Most hardware faults are random and result from physical defects, either sustained
during manufacturing or developed over time as components wear out or suffer shocks
from the surrounding physical world. Software faults, on the other hand, are not physical;
software does not wear out. Instead, software faults result from the invocation of software
paths that contain defects in the software design or implementation that were always
there. Since software is typically more complex than hardware, it can be expected to have
many more built-in defects, resulting in more software faults than hardware faults.
Software fault tolerance is also more expensive to design than hardware fault tolerance.

N-version programming is a veteran design pattern for software fault tolerance. When I
first ran across it in the '70s, it was called dissimilar software. It's the software equivalent
of hardware N-plexing (see Figure 1). But it's not as simple as the replication of hardware
N-plexing, where the N copies of the same software would contain the same faults and
produce the same errors N times. In N-version programming, if N units of some software
functionality need to run in parallel, they need to be N disparate implementations of that
functionality, independently implemented by N separate development teams. That's N-
version programming.

In June 1996, the first flight of the Ariane-5 satellite-launcher rocket exploded in a ball of
fire as it reached an altitude of 4,000 meters. The failure was caused by faults in the
rocket's inertial reference system (a part of its digital flight controls) despite hardware
redundancy, because software redundancy had not been handled properly. There were
two inertial reference computers on the Ariane-5, one active and the other a “hot”
standby. Both ran in parallel, with identical hardware and software. The software was
nearly the same as that on the Ariane-4, an older and successful launch vehicle. But some
of the flight parameter values on the Ariane-5 were larger than on the Ariane-4, so data
values overflowed. The error was handled by shutting down the computer. Since the
redundant computer was running the same software, it was hit by data overflow as well,
and it was quickly shut down too, and the whole inertial reference system went dead. As a
result, the nozzles of the engines swiveled to extreme positions, causing the rocket to turn
abruptly and break apart before self-destructing.[1] The way the data overflow errors were
handled would have been appropriate for randomly occurring hardware errors, but it was
not appropriate for similar software errors occurring on two computers. And similar
software errors on the two computers could have been avoided with N-version
programming.

Back in the '70s we thought of N-version programming as the state-of-the-art in software


fault tolerance. Since then, a number of problems with this design pattern have emerged:
software development costs skyrocket when you use it, since you need to pay the N
independent groups to implement the N independent software designs. But if you try to

5/10
skimp on some of the costs, you'll run into what's called the “Average IQ” problem: less
expensive development teams have less-qualified software engineers who will produce
lower-quality code. So you may end up with N diverse programs that are all riddled with
faults, created in N different ways.

Another downfall of N-version programming is the issue of what to provide as input to the
N independent development teams. In general, a single specification is photocopied and
provided to all N development teams. But if this specification is flawed, you'll get N
independently developed versions of similarly flawed software that'll all do the wrong
thing. And if specification or usage errors are identified after the system is released, each
new error must be fixed N times-once in each of the N diverse implementations, thus
making maintenance costs exorbitant. Nowadays, the pricetag of N-version programming
is usually thought to be better spent by asking one top-notch software development team
to develop one high-quality software version using the best available infrastructure,
software development tools, techniques, and testing.

Checkpointing

In contrast to the static redundancy of N-version programming, a number of software


fault tolerance design patterns are based on dynamic redundancy. All of them take a four-
stage approach:

1. Error detection
2. Damage assessment and confinement (sometimes called “firewalling”)
3. Error recovery
4. Fault treatment and continued service

In the second of these stages, a fail-stop approach is often taken when software errors are
detected. But software is highly complex, so it is often unclear how to undo the effects of
faulty software behavior surrounding an error. One helpful tool in this regard is the
concept of a transaction. A transaction is a collection of operations on the state of an
application, such that the beginning and end of a transaction are points at which the
application is in a consistent state. For example, every city's town hall has a file system
containing information about all the residents of the town. When two people get married,
their names and date of marriage are recorded in a file called “Married Couples.” In
addition, the status of the groom gets changed from single to married in the file called
“Male Residents.” And the status of the bride gets changed from single to married in the
file called “Female Residents.” If one of the three file updates fails or if the software
crashes along the way, we need to go back to the beginning of the marriage “transaction.”
Otherwise we might end up with an inconsistent state like the groom being listed as
married while his wife is still listed as single. Things are only consistent at the beginning
of the marriage transaction, and at the successful end of the marriage transaction.

If we want to use the concept of transactions for fault tolerance, our system has to be able
to save its state at the beginning of a transaction. This is called checkpointing. It involves
taking a “snapshot” of the software's state just as it is about to begin the first step of the

6/10
new transaction. The snapshot is only taken if the previous transaction was completed in
an error-free state. The basic recovery strategy here is re-execution: when an error is
detected during a transaction, the transaction is fail-stopped and the system is reloaded
back to the last saved checkpoint. Then service is continued from this checkpoint,
allowing new transactions to build on its consistent state. However, the transaction that
was fail-stopped is lost. This kind of error recovery is referred to as backward error
recovery, since the software state is rolled back to a past error-free point.

Figure 2: Checkpoint-rollback scenario

Simple checkpointing has its own dangerous single point of failure. There might be a fault
during the taking of the snapshot of the checkpoint itself. But there's a solution to this,
sometimes called checkpoint-rollback. It's illustrated in Figure 2. In this figure, the
ellipses represent a software client and a software server that communicate with one
another by sending messages through queues. A single transaction may include many
Request messages going from client to server, and many Response messages going from
server to client. During a transaction, data are modified within the server. At the end of a
transaction, a consistent set of data should be recorded on each of the two persistent
mass-storage devices shown on the right. Together with the data, a transaction sequence
number should be recorded. If an error is detected sometime later and the server is
stopped, this server may be restarted (or a replica server may be started). As part of the
startup recovery procedure, the transaction sequence numbers on the two mass-storage
devices are checked. The server data will be recovered from the device containing the
highest sequence number. (The other mass-storage device may contain a lower sequence
number because the fault occurred during the checkpointing to that device.)

Process pairs

A limitation of this checkpointing design pattern is that recovery after a fault may take a
long time. Starting or restarting a server might require a lot of processing, as might
recovering the data back to the checkpoint. A “hot backup” server working directly with
persistent mass storage device(s) of its own could speed up recovery. This design pattern
is called process pairs; it is illustrated in Figure 3.

7/10

Figure 3: Process pairs

Here we see a primary server at the center of the diagram working much as it does in the
previous checkpointing scenario. Clients work directly with this primary server. Whenever
the primary server successfully completes an entire transaction, it passes information
about its new consistent state to a backup server (on the right). Both the primary and
backup servers record the data in their persistent mass-storage devices. In this way, the
backup server is kept current about completed transactions. While the primary server is
up and available to clients, it sends regular heartbeat messages to the backup server.
These heartbeat messages may be combined with some of the checkpointing messages. If
the backup server detects that the stream of heartbeat messages has stopped, it
understands that the primary server is dead or unavailable, and it will very quickly take
over as a new primary server.

This design pattern can work when client, primary server, and backup server are all on
different processors or modules, as well as when they're all on a single processor.

With all its sophistication, the process pairs design pattern still leaves some issues open.
For example, checkpointing messages may get lost or arrive at the backup server out of
order. Or there may be problems if a primary server is the controller of a physical device
and it fails in mid-operation. The backup server may not know how to complete the
operation when it takes over, since it does not know how far the primary got before
failing. (For example: the primary server has moved the graphite rods of a nuclear reactor
10cm of the required 30cm when it fails.)

Once again, in the process pairs design pattern the transaction that was underway when
the primary server failed may well be lost or dropped in mid-execution. The backup server
will take over in the state of the last previously completed transaction reported to it by the
primary server before it failed.

Recovery blocks

Another design pattern for dynamic software redundancy is called recovery blocks. It's
based on checkpointing, but it adds an acceptance test for the results of software
processing. You need to prepare a number of alternative ways of processing your data-as
in N-version programming-but you don't run all of your processing alternatives for each
transaction. Instead, you select one way of processing your data as your primary
alternative, and first try to do the transaction using the primary alternative's processing
only. At the end of this primary processing, you run the acceptance test. If the acceptance
test is passed, you're done. If the acceptance test is failed, you go on to try the next
alternative processing, as shown in Figure 4.

8/10

Figure 4: Recovery blocks

As seen in this diagram, checkpointing must be done before first entering the recovery
block. And the software must roll back to the checkpoint state after every failed
acceptance test. The acceptance test is performed after every processing alternative that is
tried. In this way, processing alternatives are run until a processing alternative succeeds
in delivering results that pass the acceptance test. These “good” outputs comprise the final
results of the recovery block.

Forward error recovery

Checkpoint-rollback, process pairs, and recovery blocks are all ways of doing backward
error recovery. Most dynamic software fault tolerance is done using backward error
recovery, but forward error recovery is another option. Its main idea is to continue from
an erroneous state and make corrections that will clean up the damaged state. Forward
error recovery depends on accurate damage assessment and is often system-specific. An
example of forward error recovery is exception handling, which passes control to
specialized exception handling software when a problem is detected, rather than rolling
back and continuing from some previously checkpointed state.

One design pattern for forward error recovery is called alternative processing. This
approach can be used when there are two (or more) processing alternatives for a
transaction, with one alternative normally being very precise but computationally
complex, and the other alternative being more simplistic and of higher performance. The
second alternative can be used as both a test and a secondary results provider, if the
primary processing is faulty. For example, a square-root algorithm can be the primary
processing and a table-lookup-interpolation can be its alternative. Once the algorithm has
run, table-lookup-interpolation can be used both to test the quality of the square-root
algorithm results and to quickly fix some errors in those results.

9/10

Figure 5: Alternative processing for aircraft flight control

In Figure 5, we see two digital flight control systems being run at the same time to control
a single airplane (Boeing 777).[2] The decision logic on the right side of the diagram uses
the outputs of the simpler flight control system as a yardstick against which to test the
outputs of the complex primary flight control system. If the acceptance test fails, then the
outputs of the simpler flight control system could also be used as an alternative to the
failed primary outputs. (The arrows going to the left show that the results of the
alternative processing may also be used to provide feedback to the primary processing.)

Design patterns such as these allow everyday commercial-quality hardware and software
to be used as building blocks for true high-availability systems, systems that can, without
human intervention, achieve “five-nines” or greater availability.

David Kalinsky is director of customer education at OSE Systems. He is a lecturer and


seminar leader on technologies for embedded software. In recent years, David has built
high-tech training programs on aspects of software engineering for the development of
real-time and embedded systems. Before that, he was involved in the design of many
embedded medical and aerospace systems. David holds a PhD in nuclear physics from
Yale. He can be reached at .

Endnotes

1. SIAM (Society for Industrial and Applied Mathematics) News, vol. 29, #8, October
1996.
Back

2. Chandra, Ramesh and Lui Sha. “On Scheduling Tasks in Reliable Real-Time Control
Systems,” Proceedings of the 20th IEEE Real-Time Systems Symposium, 1998.
Back

Return to the August 2002 Table of Contents

10/10

You might also like