Calculating Total System Availability: Hoda Rohani, Azad Kamali Roosta
Calculating Total System Availability: Hoda Rohani, Azad Kamali Roosta
Calculating Total System Availability: Hoda Rohani, Azad Kamali Roosta
Abstract— In a mission critical application, “Availability” is the very first requirement to consider. Thus
understanding what it is, what would affect it, and how to calculate it is vital. Although many methods have
been proposed to calculate the availability of a device and/or a simple system, calculating the availability
of a Business Application within a very complex organization is not still easily achievable. In this project,
we would be proposing a method to enable the IT management team of KLM, to predict their Business
Application availability based on the configuration and the components used in their infrastructure.
It is obvious that the more details we include in the model, the more precise result we will be having.
But this might end up trading the calculation time and the model complexity for some precision that we
don’t really need. So we have to select only those components we want to consider their effect carefully.
One way of determining which to select can be to rely on experienced expert’s opinion. Another good
practice can be going through the historical data of incident management system. If we’re dealing with
a system that has been in place for a rather long time and the incident records are accessible for it, the
latter might be more useful. On the other hand, if we have a new system with most of its behaviors are
yet to be known, the former is a better choice.
1: =
2: =
+
1
Although environmental availability will affect the whole system in general, because there might be different environment
for different components (i.e. having more than a datacenter, which is the case for KLM), we have to consider this parameter
in component level. If this was not the case, we could have simply considered this parameter at the final level at once.
component is called the “End User” and it depends on all major non-database applications. End User’s
availability is considered 1.0 (always available).
- Arbitrary Failure
Beside the different sources, a failure may be of different types, one being Arbitrary Failures. These
types of failure have undetermined result and a component/system facing such failures will be showing
unpredictable behavior. For instance, a simple calculator in presence of such failures, might give you
different results for an identical equation and parameters over the time. Such behavior, in general, will
make it hard to find if a component has failed. This specially becomes more important when there are
other replicas of a component.
As an example, assume that a user is using either of two instances of a redundant applications through a
dispatcher. The dispatcher’s role is to determine which of the instances are currently available to serve.
Normally, this job is easily done via a simple tests, hence making the dispatcher simple and relatively
reliable. But in presence of arbitrary failures in the applications, the dispatcher may mistakenly route the
user to the wrong instance. In this case, the dispatcher has failed because of a failure in the application.
Such dependent availability calculations are not permitted in our model.
In order to avoid such situation, it is important to calculate those application’s availability more
accurately. Also the dispatcher should become wiser so that it can cover such failures. Finally, the
availability of the dispatcher should be calculated independently of the applications.
In general, it is best to eliminate the sources of such failures as much as possible. In software, this type
of failures are mostly because of a fault in design and/or coding phases, and can be reduced by in-depth
reviewing and testing of software artifacts.
- Peaceful Degradation
Clusters of components might be made as a way of masking their failures, but it also might be a way
of increasing their processing power. While the former case is being referred to as High Availability, the
latter is called Load Balancing.
It is obvious that as soon as a node in a load balanced cluster fails, the availability of that cluster would
become dependent on the load on the remaining nodes. If the node becomes higher than what they’re
capable of deliver, there would be interruptions on the service (i.e. users may experience a slower
response). This situation is referred to as Peaceful Degradation. In such situation, the service might still
be considered available, but with a limited functionality.
An example of such situation is when a 4 engine aircraft flying at 30,000 feet, loses 3 of its engine. The
plane is still able to fly, but cannot keep its altitude at 30,000.
Although one may consider such degraded state as still available, but because subsequent failures of
nodes in the cluster may result to failure of the cluster (although there might still be some available nodes
in it) we will not mark the degraded state as available and will mark it as failed.
This being said, throughout this project, when we’re talking about clusters, it means that they are meant
solely for high availability and not load balancing.
- Failure rate
Failure Rate, is the frequency by which the system fails.
For the components without moving parts, assuming a constant failure rate is not far from reality and
since –except for the Hard Disk Drives- all of components engaged in our model are solely electrical, we
don’t have any moving part. Disk drives however, are devices without repair. So any kind of failure
would lead to the failed disk being replaced and there would be no consecutive failures. Hence for these
devices also the constant failure rate holds (they only fail once during their operational life) [4].
- Bathtub
Hardware failures are usually described by “bath tub curve”. The first period is called infant mortality.
During this period, the hardware failure is high. The next period is called the normal life. Failures usually
occur randomly during this time but the point is that the rate of failures is predictable and constant and
is almost low. The cause of failures may include undetectable defects, design flaws, higher random stress
than expected, human factors, and environmental failures. The last period called wear out period, is when
the units are old and begin to fail at a high rate due to degradation of component characteristics [4, 5, 8].
Availability 9s Downtime
90% One 36.5 days/year
99% Two 3.65 days/year
99.9% Three 8.76 hours/year
99.99% Four 52 minutes/year
99.999% Five 5 minutes/year
99.9999% Six 31 seconds/year
A typical example of this configuration is a “RDBMS server” and its “storage”. It is obvious that to be
able to serve a database, both parts are meant to function properly at an instance of time. Another example
are the HDDs in a RAID0 (striping) configuration in which a failure of either of the disks will result in
losing all the data on the array.
In this configuration, as the system will be considered working as far as all the components are working,
the total availability would be the multiplication of each component’s “independent” availability [14].
As availability of each component is a number lower than 1, the total system availability would
become lower than any of its components. For example, the total availability of the system in Fig. 2
would be:
This actually makes sense as it is said that “A chain is only as strong as its weakest link”.
B. Parallel Configuration
Sometimes, system designers would put identical (or even similar) components together, in a way
that as far as one of the components are available, the system can survive. Component in this
configuration are said to be made redundant (Fig 4.).
Figure 4- A system with two parallel components
Unlike the Serial configuration, components in a parallel configuration are either identical components,
or two (or more) components with the same function. This main idea here is that the components failures
are not arbitrary and components are fail-safe. This means that the component is either operating
correctly, or it stops from working. If this is not the case, there should be a 3rd component as the output
evaluator which considers which of the components placed in parallel should be used at a time (if this is
not the case, then the “voting” mechanism is used which requires at least 3 component in parallel).
The system is this configuration fails, if all of its components fail [11, 14].
Since the unavailability of this system is smaller than the unavailability of each of its components, the
total availability would be even higher than the most available component.
The key for these calculations are that each component’s availability should be calculated independently.
On the other hand, the failure sources that are unique among components of a parallel configuration
should be excluded from components availability and calculated separately.
Examples of this configuration are an aircraft with 4 engines, which can also fly by 1 engine operating,
or disk mirroring which is considered in RAID 1 configuration.
It is also worth mentioning that we will not consider peaceful degradation as complete parallel
configuration.
C. Hybrid Configuration
Hybrid Configuration happens when the system consists of multiple components, from those some
are serial and the others are parallel. In order to calculate the availability of such a system, one may
calculate any consecutive serial/parallel components and replace them with blocks with new availability
in order to be able to complete the calculation.
This method is known as Reliability Block Diagram [15] modeling and works best as far as we can
see each components either as serial or parallel (and not both) in a system. Fig. 5 is an example of a
system that deriving RBD for it is not so easy.
$ C ) = # D 7C %& = 1 − E 7 =>?@A
MN
Finally, we introduced a criticality function in order to evaluate the role of each component in the total
(un)availability of the system and find the component with the most influence on such (un)availability.
The criticality function of a component is defined as the total number of that component’s appearance
as a faulty component multiplied by component’s unavailability.
(% O #O& = #1 − #O&& × E 1
MN
Eventually, the more the criticality of a component is, the more effect it has on the system’s
unavailability.
“appT” and “appEDB” are two database services supporting these applications. Table 3 represents these
relations (dependencies).
apps.csv Host Name, Application Name, Clone Counts hst01,appCSA,1 Contains Host’s, Application
and clone counts
netnods.csv Network Node, Direct Neighbor1, Direct Switch_1,Switch_3 list of network devices and
Neighbor2, … ,Direct Neighbor n their direct neighbors
hostnicsw.csv Host Name, Ethernet Card, Switch Name hst08,eth2,Switch_1 Network connectivity of
components
*) The availability file contains component’s availability parameters. It is assumed that each single
component availability is calculated before and presented in this file. However, as many of the
component we’re dealing with, have the MTBF and MTTR in hand, if the availability is not provided
directly for a component, the program will calculate it on the fly, based on the known = T U
formula. If these parameters are not presented either, the availability of the component is assumed 1.0.
Fig. 8 shows component availability flowchart.
Start: makeit.py
For each
component
Yes
No
No
A = 1.0
Availability is entered as a number between 0 and 1 and each of these parameters can be left empty
(except for the component name). Note that if the application is being served by a 3rd party whom we
have a contract (and probably SLA) with (like Amadeus), the availability mentioned in the contract can
be a good choice as that application’s availability [17].
It is also worth mentioning that as these availability values are “Independent” of each other, availability
of application replicas would be the same.
For this experiment, we generated 54 random numbers between (0.9999 and 1.0) as availability
parameters, with 0.9999 being the smallest, 0.999997 being the largest number and an average of
0.99995.
C. Processing Files
The program was organized into three different executable files, among those one calculates
component’s availability (acalculator.py), one prepares the input files (makeit.py) and eventually the
other (run.py) runs the main program over the data structures. There are some other auxiliary files that
facilitate the execution. We can see all these files and their relation in Fig. 9.
Total Availability Calculation Process
Phase
Component
Input Dependency Host – NIC – App – Replica - App – Host
Code Template Network Nodes Redundancy List Availability
List Switch Relation Host Relation Relation
(template.py) (netnods.csv) (clusters.csv) Parameters
(dep.csv) (hostnicsw.csv) (hostapp.csv) (apps.csv)
(availability.csv)
Intermediate
Application
Process
Runs...
Calculation
Component
Main Code Availability
(exe.py) Parameters
(acalculator.py)
Output
D. Assumptions
Apart from the general assumptions for our model, we have made the following assumptions
regarding this experiment:
- Each single node’s independent Availability is either pre-calculated, or its MTBF and MTTR
parameters are present. If none were present, a random number between 0.9999 and 0.999997
were assigned as the availability.
- Whenever there is a physical network path between two network nodes, it illustrates a
network connection between them. In other words, no network segmentation exists in upper
layers.
- Physical connectors (like cables) are considered as always available.
- Network devices are seen as a single component even if they are modular.
- There is no virtualization involved.
- There is only one web server on each OS.
- Hosts include: Web Server, Operating System and Host hardware (except for the NIC).
- All network cards of a server are able to take-over other cards.
- In the network layer, Redundancy is made by using separate paths. There is no Stacked
Switch.
- Environmental and Human Related Factors are rolled out for simplicity
E. Execution
We ran the program 6 times, starting from a maximum of 1 failure at a time, up to a maximum of 6
simultaneous failures. The result was as follow:
1) Maximum of 1 simultaneous failure
In this case, 5 different failure scenarios occurred, which were caused by the failure of:
- hst11
- appEDB
- Switch_3
- Switch_1
- End User
Note that, as “End User’s” availability is 1.0, it will not affect the total amount of availability, but as it
is considered a component, it is shown in the result (This is true for all components with availability
of 1.0).
And total availability of “End-User” were calculated to be: 99.9781477%
2) Maximum of 2 simultaneous failures
When considering up to more 2 concurrent failures, it is obvious that the result contains single
failures, as well as all component failures that include one of the single failures components.
There are 55 components: 5 for each single failure alone, 50 for each couple containing a single failure
and VW/X for combinations of those 5, which will make: 5+5*50+10=265
Hence, we expect this case to be at least 265. The final amount is: 280 failure scenarios. Those 15
scenarios are caused by failure of the components shown in Table 5:
Table 5- Failure Scenarios in presence of maximum 2 Failure
2 'hst09' 'hst10'
4 'appT.REP1' 'hst10'
5 'hst02' 'hst01->eth2'
7 'hst02' 'hst01'
9 'hst01->eth2' 'hst02->eth2'
10 'appCSA.REP2' 'appCSA.REP1'
11 'appCSA.REP2' 'hst01'
12 'appCSA.REP1' 'hst02->eth2'
13 'hst02->eth2' 'hst01'
14 'Switch_2' hst11->eth1
15 'hst11->eth1' 'hst11->eth2'
Note that “A.REPx” shows the xth replica of the application “A”.
3) Maximum simultaneous failures > 2
We continued the simulations up to maximum number of 7 which ended up quite similar to the
previous results.
Table 6-Test Case Result
It is worth mentioning that although the concurrent failures came into account, the more precise
availability would be calculated, but there is point where the precision we gain is far from what we really
need. On the other hand (considering the high availability function of single components used in
enterprise networks) the probability of multiple failure happening together is negligible.
So we can say that the current test case has got an availability of “99.97809%” with even maximum
concurrent failure equal to sum of components.
According to our result, the most critical component were turned out to be “Switch_1” switch.
X. FUTURE WORKS
The method we proposed here can be improved in (specially) two aspects.
A. Optimizing the algorithm
The program were written as a proof of concept, with around 600 line programming in Python.
Neither the choice of the programming tool, nor the method of programing is prepared is so efficient.
On the other hand, it is best if the algorithm could become smarter in detection of failure scenarios in
case of examining the redundancies and also the merge of some components and summarizing their
availability.
Passing the data via text files are not a really good idea in production. It would be good if the program
could read its data directly from AITIH database.
B. More criticality options
The criticality function defined here is not the only way we can find the weakest link of our chain.
One can define some other functions (even more accurate ones) to highlight the effect of some special
components on the whole system.
There can also functions being defined to find any over qualified components (if any), those that though
having a high availability, are not affecting the whole system as expected. This function can be used in
decision making process to save some costs.
ACKNOWLEDGMENT
We would like to take the chance to thank all KLM AITIH department and especially Betty and Leon
Gommans, Peter Huisman, Maarten Hogendoorn and David van Leerdam, who helped a lot during
different parts of this project, either by their supportive behavior or cooperative actions.
REFERENCES
1. ISO/IEC, Information technology - Security techniques - Management of information and communications technology
security, in Part 1: Concepts and models for information and communications technology security management.
2004.
2. Wikipedia. Reliability engineering. [cited 2014 January]; Available from:
http://en.wikipedia.org/wiki/Reliability_engineering.
3. Fault-tolerant computer system design. ed. K.P. Dhiraj. 1996, Prentice-Hall, Inc. 550.
4. Vargas, E. and S. BluePrints, High availability fundamentals. Sun Blueprints series, 2000.
5. Torell, W. and V. Avelar, Mean time between failure: Explanation and standards. White Paper, 2004. 78.
6. Cisco. [cited 2014 January]; DESIGNING AND MANAGING HIGH AVAILABILITY IP NETWORKS]. Available
from:
https://www.cisco.com/en/US/prod/collateral/iosswrel/ps6537/ps6550/prod_presentation0900aecd8031069b.pdf.
7. Wikipedia. Availability. [cited 2014 January]; Available from: http://en.wikipedia.org/wiki/Availability.
8. eventhelix. [cited 2014 January]; Available from:
http://www.eventhelix.com/realtimemantra/faulthandling/reliability_availability_basics.htm#.UvfKjJA1iM9.
9. Weygant, P.S., Clusters for High Availability: A Primer of HP Solutions. 2001: Prentice Hall Professional.
10 Colville, R.J. and G. Spafford, Configuration Management for Virtual and Cloud Infrastructures. Gartner, http://www.
rbiassets. com/getfile. ashx/42112626510, 2010.
11. Oggerino, C., High Availability Network Fundamentals: A Practical Guide to Predicting Network Availability. 2001:
Cisco Press. 256.
12. Vallath, M., Oracle real application clusters. 2004: Access Online via Elsevier.
13. Weygant, P.S., Primer on Clusters for High Availability. Technical Paper at Hewlett-Packard Labs, CA, 2000.
14. Xin, J., et al., Network Service Reliability Analysis Model. CHEMICAL ENGINEERING, 2013. 33.
15. Shooman, M.L., Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design. 2003: Wiley.
16. Vesely, W.E., U.S.N.R.C.D.o. Systems, and R. Research, Fault tree handbook. 1981: Systems and Reliability Research,
Office of Nuclear Regulatory Research, U.S. Nuclear Regulatory Commission.
17. Fishman, D.M., Application Availability: An Approach to Measurement. Sun Microsystems. Recuperado el, 2000. 8.
XI. APPENDIX 1.
The diagram in this attachment is showing all components in our proof of concept experiment, and
their relation.
There are four different categories of nodes, including: Applications, Host Hardware, Network
Interface Cards and Networking Devices (Switches). The lines between nodes represent a “relation”
which is interpreted based on type of components in the relation, as shown in Table 3.