Breaking the Availability Barrier Ii: Achieving Century Uptimes with Active/Active Systems
()
About this ebook
This book is Volume 2 of a three-part series on active/active systems. It describes techniques that can be used today for extending system failure times from years to centuries, often at little or no additional cost.
As our daily lives and corporate well-being become more dependent upon computers, system reliability grows increasingly important. No longer are frequent system outages acceptable. In many cases, failure intervals must now be measured in centuries.
Starting with a summary of Volume 1, techniques for achieving extraordinary availabilities are reviewed. These techniques use active/active architectures, in which multiple independent nodes using a common distributed database are cooperating in a common application. Should a node fail, all that is required is to switch the users on that node to a surviving node.
Equally important to the achievement of high availability is the ability to upgrade the system hardware and software without denying service to the users. The procedures to do this within an active/active system are described.
The secret to high availability is to let it fail, but fix it fast. This volume explores the server, database, and network redundancy techniques that allow fast-fix to happen. The cost considerations involved in such redundant architectures are also explored.
Dr. Bruce Holenstein
Dr. Bill Highleyman, Paul J. Holenstein, and Dr. Bruce Holenstein have a combined experience of over 90 years in the implementation of fault-tolerant, highly available computing systems. This experience ranges from the early days of custom redundant systems to today’s fault-tolerant offerings from HP (NonStop) and Stratus. Dr. Bill Highleyman has done extensive work on the effect of failure mode reduction on system availability. He has built fault-tolerant systems for train control, racetrack wagering, securities trading, message communication, and other applications. He is the Managing Editor of the Availability Digest (availabilitydigest.com). Paul J. Holenstein and Dr. Bruce Holenstein have architected and implemented the various data replication techniques required for the availability enhancements described in this book. Their company, Gravic, provides the Shadowbase line of data replication products to the fault-tolerant community.
Related to Breaking the Availability Barrier Ii
Related ebooks
How To Do Virtualization: Your Step-By-Step Guide To Virtualization Rating: 0 out of 5 stars0 ratingsService Availability: Principles and Practice Rating: 0 out of 5 stars0 ratingsWindows Azure Hybrid Cloud Rating: 0 out of 5 stars0 ratingsVMware Horizon 6 Desktop Virtualization Solutions Rating: 0 out of 5 stars0 ratingsWiFi, WiMAX, and LTE Multi-hop Mesh Networks: Basic Communication Protocols and Application Areas Rating: 0 out of 5 stars0 ratingsCloud Computing and Virtualization Rating: 0 out of 5 stars0 ratingsVMware View Security Essentials Rating: 0 out of 5 stars0 ratingsVMware Horizon View 6 Desktop Virtualization Cookbook Rating: 0 out of 5 stars0 ratingsMicrosoft Exchange Server 2013 - Sizing, Designing and Configuration: A Practical Look Rating: 0 out of 5 stars0 ratingsMicrosoft Exchange Server 2013 High Availability Rating: 0 out of 5 stars0 ratingsvSphere High Performance Cookbook - Second Edition Rating: 0 out of 5 stars0 ratingsTomcat 6 Developer's Guide Rating: 4 out of 5 stars4/5VMware vRealize Orchestrator Cookbook Rating: 0 out of 5 stars0 ratingsDiscovering Requirements: How to Specify Products and Services Rating: 4 out of 5 stars4/5Zero Trust Security: Building Cyber Resilience & Robust Security Postures Rating: 0 out of 5 stars0 ratingsOnline Identity A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsGetting Started with Windows VDI Rating: 0 out of 5 stars0 ratingsMalware Analysis A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsEnterprise Information Security Architecture A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratings“Careers in Information Technology: Artificial Intelligence (AI) Robotics Engineer”: GoodMan, #1 Rating: 0 out of 5 stars0 ratingsSecurity controls Complete Self-Assessment Guide Rating: 0 out of 5 stars0 ratingsRed Hat Ansible A Complete Guide - 2021 Edition Rating: 0 out of 5 stars0 ratingsVMware Horizon View High Availability Rating: 0 out of 5 stars0 ratingsCitrix XenApp Performance Essentials Rating: 0 out of 5 stars0 ratingsComputer Networking Bootcamp: Routing, Switching And Troubleshooting Rating: 0 out of 5 stars0 ratingsCYBER SECURITY HANDBOOK Part-1: Hacking the Hackers: Unraveling the World of Cybersecurity Rating: 0 out of 5 stars0 ratingsIaaS Mastery: Infrastructure As A Service: Your All-In-One Guide To AWS, GCE, Microsoft Azure, And IBM Cloud Rating: 0 out of 5 stars0 ratingsLearning VMware App Volumes Rating: 0 out of 5 stars0 ratingsNetwork Architecture A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratings
Computers For You
Elon Musk Rating: 4 out of 5 stars4/5The Invisible Rainbow: A History of Electricity and Life Rating: 5 out of 5 stars5/5Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5An Ultimate Guide to Kali Linux for Beginners Rating: 3 out of 5 stars3/5CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratingsCompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5The Hacker Crackdown: Law and Disorder on the Electronic Frontier Rating: 4 out of 5 stars4/5People Skills for Analytical Thinkers Rating: 5 out of 5 stars5/5Discord For Dummies Rating: 0 out of 5 stars0 ratingsProcreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 5 out of 5 stars5/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsTor and the Dark Art of Anonymity Rating: 5 out of 5 stars5/5Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5I Forced a Bot to Write This Book: A.I. Meets B.S. Rating: 4 out of 5 stars4/5The Best Hacking Tricks for Beginners Rating: 4 out of 5 stars4/5Uncanny Valley: A Memoir Rating: 4 out of 5 stars4/5
Reviews for Breaking the Availability Barrier Ii
0 ratings0 reviews
Book preview
Breaking the Availability Barrier Ii - Dr. Bruce Holenstein
CONTENTS
Dedication
Forward
What is This Book
?
Achieving Extreme Availabilities
A Roadmap Through This Book
Authors’ Notes
Acknowledgements
About the Authors
Part 1-Survivable Systems for Enterprise Computing
Chapter 1-Achieving Century Uptimes
What is Reliability?
System Availability
The 9s Measure of Availability
The Price of Reliability
The Why of Century Uptimes
The How of Century Uptimes
The Acceptance of Active/Active Technology
What’s Next
Chapter 2-Reliability of Distributed Computing Systems
Active/Active Systems Reviewed
The Availability Relationship
The Importance of Repair Time
The Importance of Recovery Time and the 4 Rs
System Splitting
Failover Time
Failover Faults
Environmental Faults
What’s Next?
Chapter 3-An Active/Active Primer
A General Solution
Database Locality
Database Synchronization
Synchronous Replication
Failure Mechanisms
Controlling Database Costs
The Availability/Performance/Cost Compromise
What’s Next?
Part 2-Building and Managing Active/Active Systems
Chapter 4-Active/Active Topologies
Architectural Topologies
Network Topologies
What’s Next
Chapter 5-Redundant Reliable Networks
The Need for Network Redundancy
Reliability Is More Than Just Redundancy
The Great Protocol Wars of the Twentieth Century
Redundancy Configurations
Backup Networks
Reconfigurable Networks
The Internet
Fault Detection
Fault Recovery
Fault Repair
Transaction, Session, and Connection Loss
Cost
A Case Study
What’s Next
Chapter 6-Distributed Databases
The Need for Distributed Databases
Database Synchronization
Issues with Distributing a Database
Issues with Remote Access
Failover
Database Recovery
What’s Next
Chapter 7-Node Failures
Causes of Node Failures
Detecting Failures
Failover
Switching Users
Node Recovery
Other Issues
What’s Next
Chapter 8-Eliminating Planned Outages with Zero Downtime Migrations
Introduction
Yes! Application Availability Counts
The ZDM Solution
Uses for ZDM
The Online Copy Facility
Zero Downtime Migrations with Low Risk
Planned Outages Eliminated
What’s Next?
Chapter 9-Total Cost of Ownership (TCO)
Choosing the Solution
Return on Investment (ROI)
Total Cost of Ownership (TCO)
Initial System Cost
Recurring Costs
Putting It All Together
What’s Next
Appendices
Appendix 1-Rules of Availability
Volume 1 Rules
Volume 2 Rules
Volume 3 Rules
References and Suggested Reading
About the Authors
Endnotes
Dedication
This book is dedicated to our spouses,
Denise, Janice, and Karen,
for their enduring patience and support.
We also dedicate this book to Jim Gray
for his fundamental contributions to transaction
processing technology on which this book is based.
Jim, an avid sailor, has been missing at sea
since January 28, 2007.
Forward
Given today’s technology, [six 9s] is unachievable for all practical purposes, and an unrealistic goal.
-Evan Marcus and Hal Stern, 20001¹
My, how things change in just a few years! Not only are we going to talk about achieving systems with six 9s availability but also with eight 9s availability and beyond. Furthermore, we are not talking just about system availability. We are talking about application service availability. After all, following a failure of some sort, if the users of an application are being serviced in an unacceptable manner (such as experiencing excessively long response times), then the application is essentially not available.
If you could configure your current system to:
• provide extreme availability-MTBFs measured in centuries,
• affect only a subset of users upon a failure,
• recover from any failure in subseconds to seconds,
• lose little if any data as the result of a failure,
• eliminate planned downtime,
• achieve disaster tolerance,
• use all available capacity,
• load balance at will,
• be easily expandable,
• require no change to existing applications,
• all at little or no additional cost,
wouldn’t you be interested? We think so, and that is what this book is all about. Active/active systems can and do provide these benefits today.
Abe Lincoln said that it is better to remain silent and be thought a fool than to speak out and remove all doubt.
At the risk of sounding foolish to some, we recognize that there are naysayers who will argue that extreme availabilities cannot be achieved. In this book we are speaking out, confident that the many examples of successful installations of active/active systems will prove us not to be fools, notwithstanding Abe.
What is This Book
?
We referred to this book
in the previous section. Actually, when we started to write this book,
we intended it to be the second in a series on active/active systems. However, when we finished it, it became apparent that it was much too long to be a comfortable single book to read. Therefore, we decided to break it up into two volumes.
We will refer to the (now) three volumes as Volumes 1, 2, and 3. This book
comprises Volumes 2 and 3. The titles of the active/active trilogy are:
Volume 1: Breaking the Availability Barrier: Survivable Systems for Enterprise Computing, published by AuthorHouse in 2004,
Volume 2: Breaking the Availability Barrier II: Achieving Century Uptimes with Active/Active Systems, published by AuthorHouse in 2007 along with Volume 3.
Volume 3: Breaking the Availability Barrier III: Active/Active Systems in Practice, published by AuthorHouse in 2007 along with Volume 2.
In keeping with Volumes 2 and 3 being essentially one book, this Forward is the same in each volume. However, the content of each volume is markedly different.
Let us now return to the introduction of active/active systems.
Achieving Extreme Availabilities
The secret to the achievement of extreme availabilities is in the configuration. By configuring (or re-configuring) your monolithic system as an active/active architecture, the benefits described in our introduction can all be achieved.
What is an active/active system? We define it as a network of independent processing nodes, each having access to a common replicated database. All nodes can cooperate in a common application, and users can be serviced by multiple nodes.
Image34274.JPGAn Active/Active System
Note an important implication of this definition. Active/active architectures are not just about protecting against hardware failures. In most cases, any event that will bring down a monolithic system will only bring down one node in an active/active system. Such failure events include not only hardware faults, but also software faults, operator errors, environmental failures (air conditioning, power, etc.), and manmade or natural disasters. Active/active architectures protect users against all of these faults, allowing service to be continued by simply switching users from a failed node to one or more surviving nodes.
Another implication is what active/active is not. Active/active is not a technology; it is a business solution. Active/active is not about distributed database synchronization; it is about achieving century uptimes. More specifically,
• Active/active systems are not co-located clusters. A basic tenet of active/active systems is that they protect against area-wide problems. If the nodes cannot be geographically separated, then they are not part of an active/active system.
• Active/active systems are not independent nodes using a common database. In such an architecture, the database cannot be geographically distributed and represents a single point of failure.
• Active/active systems are not those that use hardware replication for database synchronization. Hardware replication cannot guarantee referential integrity.² As a consequence, applications at synchronized sites cannot use the database copies.
• By the same token, active/active systems are not those that use software replication engines that do not guarantee referential integrity.
• Active/active systems are not clusters. Users on an active/active system can be put back into service in seconds by
Forward
switching them to another operating node. Clusters require that another node be brought online, a process that typically takes minutes. This time delay precludes century uptimes.
• Active/active systems are not lock-stepped or voting systems because such designs require each node to process the same requests, thus precluding scalability.
• Active/active systems are not limited to enterprise applications. There are active/active distributed database systems on the market that are loosely coupled and synchronized by replication.
• Active/active systems do not require distributed disk-resident databases. Many active/active systems maintain their databases in memory.
Of course, in some cases, there may be no need for a database in an application (for example, a cluster of Web servers). In such systems, there is no context saved between operations. Implementing clusters of systems such as these is not a difficult task as it is only necessary to route any transaction to any surviving server. However, if an active database is involved such that context is retained from transaction to transaction, then providing a redundant synchronized database is necessary. This brings with it a myriad of issues. These volumes concentrate on applications which depend upon an integrated and updatable distributed database.
In many cases, the nodes in the application network are completely symmetric. Any transaction can be routed to any node, which can read or update any set of data items in the database. Should a node fail, users at the other nodes are unaffected. Furthermore, the users at the failed node can be switched quickly to surviving nodes, with their services restored in seconds or less.
In seconds is the secret. Common today is the use of cluster technology to provide high availability. Should a node in the cluster fail, users are switched to a backup node. However, the applications on that node must be brought up and database tables and files opened before application services can be offered to the users. This process typically takes several minutes or more. In active/active configurations, all applications are already up and running on each node and are actively processing transactions. All that must be done is to switch over affected users to surviving nodes.
Let us say that an active/active system can recover services in three seconds and that the equivalent cluster can recover in five minutes (300 seconds). The cluster will be down one-hundred times longer than will the active/active system. This lops off two nines from the cluster’s availability relative to the equivalent active/active system. A six 9s active/active system would be reduced to an availability of four 9s if it were in a cluster configuration. No wonder in 2007 many pundits still state that six 9s is not possible. But it is, as we will show in these volumes.
This leads to one of our availability rules:
Rule 36: To achieve extreme reliabilities, let it fail; but fix it fast.³
Are extreme availabilities important to you? Are the four 9s available with HP NonStop servers or with PC or Unix clusters acceptable? As we will discuss later, surveys have shown that the costs of downtime can range from USD $100,000 to several million dollars an hour, depending upon the application. Perhaps even worse, downtime can lead to the dreaded CNN Moment
and massive losses in stock value (see Chapter 9, Total Cost of Ownership (TCO), in Volume 2 for what happened to AOL in 1996 and eBay in 1999). At the extreme, downtime can lead to significant property loss or even loss of life.
Only you can make this judgment. If extreme availabilities are important to your enterprise, this book
is for you.
A Roadmap Through This Book
As we explained earlier, this book
is in fact Volumes 2 and 3 of our trilogy describing how to achieve extreme availabilities with active/active systems. The first volume in this series, published in 2004 by AuthorHouse (www.authorhouse.com) and entitled Breaking the Availability Barrier: Survivable Systems for Enterprise Computing, referred to herein as Volume 1, lays the groundwork and the theory supporting the concepts of active/active systems. These two current volumes focus more on the practical aspects of implementing these systems.
They are broken into four parts, Parts 1 and 2 being in Volume 2 and Parts 3 and 4 being in Volume 3:
• Part 1, Survivable Systems for Enterprise Computing, summarizes and expands on Volume 1 and provides the background for the further topics discussed in these Volumes 2 and 3. Volume 1 is not needed to understand the content or the conclusions of Volumes 2 and 3.
• Part 2, Building and Managing Active/Active Systems, demonstrates how to build the redundancy required by active/active systems and how to control their cost and performance.
• Part 3, Infrastructure Case Study, describes an example of commercially available infrastructure products known to the authors to be suitable for production active/active systems. It also provides a valuable performance analysis tool for these products.
• Part 4, Active/Active Systems at Work, summarizes many of the beneficial uses of active/active systems, provides several case studies of active/active systems in use today, and describes various related technologies and issues.
The authors’ intended audience for these Volumes 2 and 3 and their predecessor Volume 1 includes IT executives who feel that they must reduce the downtime of their systems, system architects and senior developers who must build these systems or modify existing systems to achieve the required availability, and operations staff who must run these systems and recover from system faults.
Part 1-Survivable Systems for Enterprise Computing
As the French biologist Louis Pasteur said, Chance favors the prepared mind.
To prepare ourselves to understand active/active systems, Volume 1 of this series laid the groundwork for active/active systems and supported the concepts with mathematical analyses. As said earlier, Part 1 of this Volume 2 summarizes and expands upon the contents of Volume 1.
In Chapter 1, Achieving Century Uptimes, we talk about what is reliability and how to quantify it. We then extend these concepts to extremely reliable system configurations called active/active systems.
Chapter 2, Reliability of Distributed Computing Systems, summarizes the mathematical foundations for active/active systems. For the reader who is mathematically adverse, you will be pleased to know that the rest of this book uses minimal mathematics (except for the data replication engine performance model, which is relegated to Appendix 2). In fact, Chapter 2 can be skipped without missing the main points of the material in the following chapters.
An overview of active/active systems is discussed in Chapter 3, An Active/Active Primer. Here we discuss in some detail the structure and characteristics of the all-important data replication engine. We also look briefly at the various failure modes and how to recover from them as well as how to control costs of active/active architectures. These later subjects are analyzed in much greater detail in Part 2 of this volume.
Part 2-Building and Managing Active/Active Systems
The whole rationale behind active/active systems is active redundancy, which masks failures by recovering from them so rapidly that no one notices. A similar but localized philosophy is used in HP’s NonStop servers, in which critical software processes are supported by backup processes in other processors resident in the same node and ready to take over in subsecond time. Also, all databases are redundant so that disk faults are masked.
There are a variety of application network topologies that have the characteristics of active/active systems. In Chapter 4, Active/Active Topologies, examples of many of these configurations are described.
In active/active systems, the inherent redundancy includes networks, databases, and processing nodes. Chapter 5, Redundant Reliable Networks, discusses ways in which to build the reliable networks needed for data replication to provide database synchronization between distributed database copies, for heartbeats to monitor the health of the processing nodes, and for users to be switched between nodes.
Chapter 6, Distributed Databases, describes how data replication engines can be used to keep in synchronism the multiple copies of a database in the application network. It discusses issues with replication such as data collisions and loss of data following a failure. Recovery from a failed database copy and access to a viable database copy following a node or network failure are explored.
The monitoring of a processing node’s health is discussed in Chapter 7, Node Failures. A node can be considered to have failed if the processing system comprising that node has failed, if its database has failed, or if it has lost connectivity to the rest of the application network due to network faults. Techniques for recovering from a node failure are discussed, including issues such as tug-of-wars and operating in split-brain mode.
A highly beneficial use of controlled failures is shown in Chapter 8, Eliminating Planned Outages with Zero Downtime Migration (ZDM). Planned downtime is one of the major causes of reduced application availability. In many installations, the planned downtime required to upgrade a system or to execute other maintenance functions far exceeds unplanned downtime due to faults. In active/active systems, a node can be taken out of service purposefully with little or no impact on the users. This capability can be used to advantage to upgrade hardware, operating system software, application software, database structures, and so on. This technique also allows the capacity of the application network to be easily expanded by adding new nodes online.
Controlling the cost of an active/active system is as important as it is with any other system. However, active/active systems present an additional level of complexity. There are many ways to configure an active/active system to manage the appropriate compromise between cost, availability, and performance. As we look at different potential configurations, how do we know which contenders are the least costly? What are the factors that enter into the total cost of ownership equation? These topics are discussed in some detail in Chapter 9, Total Cost of Ownership (TCO).
Part 3-Infrastructure Case Study
In the first two parts of this book
, we describe why active/active systems can provide such high availability and how to build these systems. A set of tools is described that form a basis for the implementation of active/active systems. In Part 3, we look at a set of commercially available tools that fill the needs of active/active systems, and a performance model that can be used to gauge the effectiveness of such tools. The set of tools which are described are necessarily tools with which the authors are quite familiar but are otherwise reflective of several such tools in the marketplace.⁴
The above chapters have covered two of the three legs of the active/active triangle-availability and cost. The third leg is performance. At the heart of most active/active systems is the data replication engine, and the performance of an active/active system is directly related to this engine. In Chapter 10, Performance of Active/Active Systems, we create a performance model for a generic data replication engine and show how its various performance measures are affected by a variety of replication engine architectures. The mathematics behind the performance model are left for Appendix 2, Replication Engine Performance Model, in Volume 3.
The primary facility that is required is an appropriate data replication engine. Chapter 11, Shadowbase, describes the Shadowbase data replication engine that has been used in many such implementations. Shadowbase is an example of a data replication engine with a very low replication latency (the time it takes for a change that is made to a source database to be propagated to the target database). Low replication latency is important to minimize data collisions and also to minimize data loss following a failure.
In order to take a node out of service and later return it to service, it is important to have a database copy facility that can copy the contents of an active database to a node about to be put into service (or even after it has been placed into service) while the source database is being updated. Chapter 12, SOLV, describes such a utility. Working with Shadowbase, SOLV can efficiently make a copy of an active database even while that database is being updated. In addition, future versions of SOLV will verify that two online databases are in synchronism and will resynchronize two active databases by repairing rows with differing content.⁵
In Chapter 13, ZDM with Shadowbase, we discuss the use of Shadowbase and SOLV to upgrade nodes in an active/active system without taking down the applications. With Zero Downtime Migrations, planned downtime can be completely eliminated since nodes in an application network can be upgraded without denying service to any user. Upgrades can include the hardware, operating system, applications, database, and networks, among others. In addition, ZDM can be used to add nodes dynamically into an application network to expand its capacity.
Part 4-Active/Active Systems at Work
After learning how to build an active/active system and having seen an example of a tool set needed to do this, Part 4 looks at some actual uses of this technology in place today. It also describes some related technologies and issues.
We start in Chapter 14, Benefits of Multiple Nodes in Practice, by summarizing the various active/active system benefits that we have discussed in the book. These benefits include achieving extreme availability and very fast response time in the face of unplanned outages and even disasters, the elimination of scheduled downtime, the efficient use of all available processing capacity, the simplification of recovery testing, and application capacity expansion, both symmetric and asymmetric.
In Chapter 15, Case Studies, we look at a variety of actual uses of active/active technology. Our examples come from a wide variety of industries, including financial institutions, telecommunications, travel, web services, brokerages, plant management, and even casinos.
Finally, in Chapter 16, Related Technologies and Drivers, we explore some technologies that are related to availability. They include Grid Computing, the NonStop Server Advanced Architecture, Split Mirrors, the Real-Time Enterprise, Bulletproof Storage, and Virtual Tape. We also discuss the large number of regulatory requirements that may affect your availability decisions.
Appendices
Throughout all three volumes of this trilogy, a variety of rules applicable to highly available systems have been stated. These rules are summarized in Appendix 1, Rules of Availability. These are annotated with volume and chapter so that their context can easily be found and studied.
Appendix 1 is contained in both Volumes 2 and 3. The remaining appendices will be found in Volume 3.
Appendix 2, Replication Engine Performance Model, sets forth the detailed mathematics behind the data replication engine performance model summarized in Chapter 10, Performance of Active/Active Systems. It also structures the resulting model into a set of tables suitable for creating an Excel spreadsheet for convenient performance calculations.
Appendix 3, Regulatory Requirements, summarizes the various regulatory issues that may have a bearing on the availability and operations of processing systems. These regulations are referenced in Chapter 16, Related Topics and Drivers.
Additionally, we asked a noted consultant in the field of highly available systems, Dr. Werner Alexi, President of CS Software, Concepts, and Solutions, GmbH, to provide his comments and critique on active/active systems. His views are presented in Appendix 4, A Consultant’s Critique.
Authors’ Notes
You may have noted that this is a long book when both volumes are considered. As Winston Churchill said, the length of this document defends it well against the risk of its being read.
To mitigate this, we would like to point out that most detail is summarized in snippets that can easily be scanned, often as rules. For instance, you might want to just hunt for the rules and read the supporting text. This will give you a good feeling for where we are trying to take you.
In many places throughout this book, reference is made to HP NonStop systems. NonStop systems were originally developed by Tandem Computers to provide very high availability. Tandem Computers was subsequently acquired by Compaq Computers, and Compaq was then acquired by HP. HP has changed the name of the Tandem systems to HP NonStop servers. The authors have considerable experience with these systems. However, concepts and recommendations presented in this book are extendable to all types of commodity systems to make them redundant, including HP Superdome, Windows Server clusters, Unix clusters, Linux servers, and IBM Parallel Sysplex systems.
Each of the chapters in this book has been written to be self-standing at the risk of some repetition. Therefore, the reader is encouraged to pick and choose the topics of interest and to read only those chapters that apply. Adequate reference is made to other chapters to suggest further reading.
Acknowledgements
All three volumes of Breaking The Availability Barrier have benefited from reviews by many people. We gratefully acknowledge the contributions to this volume by Mary Heck for her contributions to Appendix 3 and by Dr. Werner Alexi for his critique, published in Appendix 4. We also thank Burt Liebowitz and John Carson, whose book Multiple Processing Systems for Real-Time Applications provided background for this work, and Jim Gray, whose many writings fueled the fire. They and others who have influenced this volume include:
Werner Alexi, CS Software
Wendy Bartlett, HP
Victor Berutti, Gravic
Richard Buckle, Insession
Robert Cline, SunGard Securities Processing
Dan Coughlin, First Data Corp.
Michael Crispyn, Fifth Third Bank
Terry Cumaranatunge, Motorola
Dick Davis, Gravic
Giampaolo Gandini, Telecom Italia Mobile Jeff Glatstein, SunGard Securities Processing Jim Gray, Microsoft
Jon Healy, SunGard Securities Processing
Mary Heck, Gravic
Tom Hoffmann, Motorola
Bill Holenstein, Gravic
Denise Holenstein, Gravic
Dan Hoppmann, A. G. Edwards
ITUG Connection staff
Clark Jablon, Akin Gump
Gene Jarema, Gravic
Jim Johnson, The Standish Group
Tim Keefauver, HP
Rob Klotz, First Data Corp.
Bill Knapp, Gravic
Bob Kossler, HP
Burt Liebowitz, Consultant
Bob Loftis, HP
Mike Nemerowski, SunGard Securities Processing Carl Niehaus, HP
Kate Noer, SunGard Securities Processing Gianfranco Pompado, Telecom Italia Mobile
Tullio Privitera, Telecom Italia Mobile
Janice Reeder, The Sombers Group
Steve Saltwick, HP
Harry Scott, Carr Scott Software
Scott Sitler, HP
Gary Strickler, Gravic
Bart van Leeuwen, Rabobank
Joanne Welk, Motorola
About the Authors
Paul J. Holenstein is Executive Vice President of Gravic, Inc., the maker of the Shadowbase line of data replication products. Shadowbase is a low latency, high-performance, real-time data replication engine that provides business continuity as well as heterogeneous data integration and synchronization.