0% found this document useful (0 votes)
70 views

Modul 4. Availability Concepts

This document discusses availability concepts and how to calculate and measure availability of systems. It defines availability as a percentage of uptime over a given period, with 99.9% being common for systems. Downtime is also defined in terms of hours per year. Availability is calculated using Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR). While 100% uptime is impossible, high availability can be achieved through redundancy, failover, and reducing sources of unavailability like human errors, software bugs, and complexity.

Uploaded by

kelvin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views

Modul 4. Availability Concepts

This document discusses availability concepts and how to calculate and measure availability of systems. It defines availability as a percentage of uptime over a given period, with 99.9% being common for systems. Downtime is also defined in terms of hours per year. Availability is calculated using Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR). While 100% uptime is impossible, high availability can be achieved through redundancy, failover, and reducing sources of unavailability like human errors, software bugs, and complexity.

Uploaded by

kelvin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Availability Concepts

Introduction
• Everyone expects their infrastructure to be available all the time.
In this age of global, always-on, always connected system,
disturbances in availability are noticed immediately. A lot 100%
guaranteed availability of an infrastructure, however, is
impossible. No matter how much effort is spent on creating high
available infrastructures, there is always a chance of downtime.
It’s just a fact of life.
• According to a survey from the 2014 Uptime Symposium, 46% of
companies using their own datacenter had at least one “business-
impacting” datacenter outage over 12 months.
• This chapter discusses the concepts and technologies used to
create high available systems. It include calculating availability,
managing human factors, the reliability of infrastructure
components, how to design for resilience, and – if everything else
fails – business continuity management and disaster recovery.
Introduction
Calculating availability
• In general, availability can neither be calculated, nor guaranteed
upfront. It can only be reported on afterwards, when a system has
run for some years. This make designing for high availability a
complicated task. Fortunately, over the years, much knowledge
and experience is gained on how to design high availability
systems, using design pattern like failover, redundancy, structured
programming, avoiding Single Points of Failures (SPOFs), and
implementing sound systems management. But first, let’s discuss
how availability is expressed in numbers.
Availability Percentages and
Intervals
• The availability of a system is usually expressed as a percentage of
uptime in a given time period (usually one year or one month).
The following table shows the maximum downtime for a
particular percentage of availability
Availability % Downtime per Downtime per Downtime per
year month week
99.8% 17.5 hours 86.2 minutes 20.2 minutes

99.9% (“three nines”) 8.8 hours 43.2 minutes 10.1 minutes

99.99% (“four nines”) 52.6 minutes 4.3 minutes 1.0 minutes

99.999% (“five nines”) 5.3 minutes 25.9 seconds 6.1 seconds


Availability levels
• Typical requirements used in service level agreements today are
99.8% or 99.9% availability per month for a full IT system. To
meet this requirement, the availability of the underlying
infrastructure must be much higher, typically in the range of
99.99% or higher.
• 99.999% uptime is also known as carrier grade availability, this
level of availability originates from telecommunication system
components (not full system!) that need an extremely high
availability. Higher availability levels for a complete system are
very uncommon, as they are almost impossible to reach.
Availability levels
• As a comparison: the electricity supply in Netherlands, is very
reliable. Over the last year, the average downtime per household
was 23 minutes per year. This is equivalent to an availability of
99.9956% Some other European countries:
• Germany: 21 minutes = 99.9960%
• United Kingdom: 75 minutes = 99.9857%
• France: 71 minutes = 99.9865%
• Poland: 260 minutes = 99.9506%
• The average downtime in the USA is 127 minutes, leading to an
availability of 99.9759%
• While 99.9% uptime means 525 minutes of downtime per year, this
downtime should not occur in one event, nor should one-minute downtimes
occur 525 times a year. It is therefore good practice to agree on maximum
frequency of unavailable.

Unavailability (minutes) Number of events (per year)


0–5 <= 35
5 – 10 <= 10
10 – 20 <= 5
20 – 30 <=2
> 30 <= 1

• In this example, it means that the system can be unavailable for 25 minutes
no more than twice a year. It is also allowed, however, to be unavailable for 3
minutes three times each month. For each availability requirement, a
frequency table should be provided, in addition to each given availability
percentage.
MTBF and MTTR
• The factor involved in calculating availability are Mean Time
Between Failures (MTBF), which is the average time that passes
between failures, and Mean Time To Repair (MTTR), which is the
time it takes to recover from a failure.
• The term “mean” means that the number expressed by MTBF and
MTTR are statistically calculated values.
Mean Time Between Failures
(MTBF)
• The MTBF is expressed in hours (how many hours will the
component or service work with failure).

Component MTBF (hours)


Hard disk 750,000
Power Supply 100,000
Fan 100,000
Ethernet Network Switch 350,000
RAM 1,000,000
Mean Time Between Failures
(MTBF)
• It is important to understand how these number are calculated. No
manufacturer can test if a hard disk will continue to word without failing for
750,000 hours (=85 years). Instead, manufacturers run tests in large batches
of components. In case of for instance hard disk, 1000 disk could have been
tested for 3 months. If in that period of time five disks fail, the MTBF is
calculated as follows:
• The test time is 3 months. One year has four of those periods. So if the test would
have lasted one year, 4 * 5 = 20 disks would have failed.
• In one year, the disks would have run:
• 1000 disks * 365 * 24 = 8,760,000 running hours.
• This means that the MTBF = 8,760,000 hours/20 failed drives = 438,000
hours/failure
• So, actually MTBF only says something about the chance of failure in the first
months of use. It is extrapolated value for the probable downtime of a disk.
• It would be better to specify the annual failure rate instead (in example, 2%
of all disk will fail in the first year), but that is not very good advertising.
Mean Time To Repair (MTTR)
• When a component breaks, it needs to be repaired. Usually the repair time
(expressed as a Mean Time To Repair – MTTR) is kept low by having a service
contract with the supplier of the component. Sometimes spare parts are kept
on-site to lower the MTTR (making MTTR more like Mean Tim To Replace).
• Typically, a faulty component is not repaired immediately
• Some example of what might be needed for to complete repairs are:
• Notification of the fault (time before seeing an alarm message)
• Processing the alarm
• Finding the root cause of the error
• Looking up repair information
• Getting spare components from storage
• Having technician come to datacenter with the spare component
• Physically repairing the fault
• Restarting and testing the component
• Instead of these manual actions, the best way to keep the MTTR low is to
introduce automated redundancy and failover.
Source of unavailability
• Human Errors
• Software Bugs
• Planned Maintenance
• Physical Defects
• Environmental issues
• Complexity of the Infrastructure
Human Errors
• Usually only 20% of the failures leading to unavailability are technology
failures. Through 2015, 80% of outages impacting mission-critical service
will be caused by people and process issues and more than 50% of those
outages will be caused by change/configuration/release integration and
hand-off issues.
• Of course, it helps to have highly qualified and trained staff, with a healthy
sense of responsibility.
• Errors are human, however, and there is no cure for it. End users can
introduce downtime by misuse of the system.
• When a user for instance starts the generation of ten very large reports at the same
time, the performance of the system could suffer in such a degree that the system
become unavailable to other users. Also, when a user forgets a password (and
maybe tries an incorrect password for more than five times) he or she is locked out
and the system is unavailable for that user.
• If that user has a very responsible job, like approving some steps in a business
process, being locked out could mean that a business process is unavailable to
other users as well
Human Errors
• Most unavailability issues, however, are the result of actions fro systems
managers. Some typical actions (or the lack thereof) are:
• Performing a test in the production environment (not recommended at all)
• Switching off the wrong component (not the defective server that needs repair, but
the one still operating)
• Swapping a good working disk in a RAID set instead of the defective one
• Restoring the wrong backup tape to production.
• Accidentally removing files (mail folders, configuration files) or database entries
• Making incorrect changes to configurations (for instance, the routing table of a
network router, or a change in the Windows registry).
• Incorrect labeling of cables, later leading to errors when change are made to the
cabling.
• Performing maintenance on an incorrect virtual machine (the one in production
instead of the one in the test environment)
• Making a typo in a system command environment
• Insufficient testing, for instance, the fallback procedure to move operations from
the primary datacenter to the secondary was never tested, and failed when it was
really needed
Human Errors
• Man of these mistakes can be avoided by using proper systems management
procedures, like have having a standard template for creating new servers,
using formal deployment strategies with the appropriate tools, using
administrative accounts only when absolutely needed, etc.
• As an example, when in some UNIX environments users log in with an
administrative account (root), they automatically get the following message:
We assume you have received the usual lecture from the local
systems manager. It usually boils down to these three things:
#1) Respect the privacy of others.
#2) Think before you type
#3) With great power comes great responsibility
• Simple measures like this make people aware of the impact of their actions,
leading to fewer mistakes.
• Of course, there are also bad people out there. Hacker can create downtime
by for instance executing a Denial of Service attack.
Software Bugs
• After human errors, software bug are the number two reason for
unavailability. Because of the complexity of most software it is
nearly impossible (and very close) to create bug-free software.
• Software bugs in application or system driver can stop an entire
system (like the infamous Blue Screen of Death on Windows
systems), or create downtime in other ways.
• Since operating systems are software too, operating systems
contain bugs that can lead to corrupted file systems, network
failures, or other sources of unavailability.
Planned Maintenance
• Planned maintenance is sometimes needed to perform systems
management tasks like upgrading hardware or software,
implementing software changes, migrating data, or the creation of
backups.
• Since most system today must be available 24/7, planned
maintenance should only be performed on parts of the
infrastructure while other parts keep serving clients.
• When the infrastructure has no single point of failure (SPOF),
downtime of a single component does not lead to downtime of the
entire system.
• This way it is possible to, for instance, upgrade an operating
system to the latest software, while the infrastructure as a whole
remains available.
Planned Maintenance
• During planned maintenance, however, the system is more
vulnerable to downtime than under normal circumstances.
• When the systems manager makes a mistake during planned
maintenance, the risk downtime is higher than normal.
• When planned maintenance is performed on a component, a SPOF
could be introduced, being the component not under
maintenance.
• When that component breaks during the planned maintenance,
the whole system can become unavailable.
• Another example is the upgrade of system in a higher available
cluster. When one component is upgraded and other is not
upgraded yet, it could be that the high available cluster is not
working as such. In the period of time the system is vulnerable to
downtime.
Physical Defects
• Of course, everything breaks down eventually, but mechanical
parts are most likely to break first. Some examples of mechanical
parts are:
• Fans for cooling the equipment. Fans usually have a limited lifespan.
They usually break because of dust in the bearings, causing the motor to
work harder until it breaks.
• Disk drive. Disk drives contains two main moving parts: the motor
spinning the platters and the linear motor that moves the read/write
heads.
• Tapes and tape drives. Tapes are very vulnerable to defects as the tape is
spun on and off the reels all the time. Tape drives and especially tape
robots contain very sensitive pieces of mechanics that can break easily.
• Apart from mechanical failures because of normal usage, parts
also break because of external factors like ambient temperature,
moist, vibrations, and aging.
Physical Defects
• Most parts favor stable temperatures. When the temperature
fluctuates, parts expand and shrink, leading to for instance contact
problems in connectors or solder joints. This effect also occurs
when parts are exposed to vibrations and when parts are switches
on and off frequently.
• Some parts also age over time. Not only mechanical parts wear
out, but also some electronic parts like large capacitors, that
contain fuilds, and transformers, that vibrate due to the AC
current creating fluctuating magnetic fields. Solder joints also age
over time, just like on/off switches that are used frequently.
• When used heavily or over an extended period of time, a power
supply will wear out; it slowly loses some of its initial power
capacity. It is recommended to calculate a surplus of at least 25%
of continuously available power for 24/7 running equipment
Physical Defects
• Network cables, especially when they are moved around much,
tend to fail over time. Another type of cable that is highly sensitive
to mechanical stress is fiber optics cable.
• Some components, like PC system boards and external disk
caches, are equipped with batteries. Batteries, including
rechargeable batteries, are known to fail often. Other components
to fail are oscillators used on system boards. These oscillators are
also in effect mechanical parts and prone to failure.
• In most case the availability of a component follows a so-called
bathtub curve.
Physical Defects
Physical Defects
• A components failure is most likely when the component is new.
In the first month of use the chance of a components failure is
relatively high. Sometimes a component doesn’t even work at all
when unpacked for the first time. This is called a DOA component
– Dead On Arrival.
• When a component still works after the first month, it is likely
that it will continue working without failure until the end of its
technical life cycle. This is the other end of the bathtub – the
chance of failure rises suddenly at the end of the life cycle of a
component.
Environmental Issues
• Environmental issues can cause downtime as well. Issues with
power and cooling, and external factor like fire, earthquakes and
flooding can cause entire datacenters to fail.
• Power can fail for a short or long period of time, and can have
voltage drops or spikes. Power outages can cause downtime, and
power spikes can cause power supplies to fail. The effect of these
power issues can be eliminated by using an Uninterruptable
Power Supply (UPS).
• Failure of the air conditioning system can lead to high
temperatures in the datacenter. When the temperature rises too
much, systems must (or will automatically) be shut down to avoid
damage.
Complexity of the infrastructure
• Adding more components to an overall system design can undermine high
availability, even if the extra components are implemented to achieve high
availability.
• Complex systems inherently have more potential points of failure and are
more difficult to implement correctly. Also a complex system is harder to
manage, more knowledge is needed to maintain the system and errors are
made more easily.
• Sometimes it is better to just have an extra spare system in the closet than to
use complex redundant systems. When a workstation fails, most people can
work on another machine, and the defective machine can be swapped in 15
minutes. This is probably a better choice than implementing high availability
measures in the workstation, like dual network cards, dual connections to
dual network switches that can be failover, failover drivers for the network
card in the workstation, dual power supplies in the workstation fed via two
separate cables and power outlets on two fuse boxes, etc.
• The same goes for high availability measures on other levels.
Complexity of the infrastructure
I once had a very instable set of redundant ATM (Asynchronous
Transfer Mode) network switches in the core of a network. I could not
get the system to failover reliably, leading to multiple instance of
downtime of a few minutes each.

When I removed the redudancy in the network, the network never


failed again for at least a year. The leftover switches were loaded with
a working configuration and put in the closet.

If the core switch would fail, we could swap it in 10 minutes (which,


given that this would not happen more than once a year – probably
even less, led to an availability of at least 99.995%)
Availability Patterns
• A single point of failure (SPOF) is a component in the
infrastructure that, if it fails, causes downtime to the entire
system. SPOFs should be avoided in IT infrastructures are they
pose a large risk to the availability of a system.
• For example, in most storage systems, the failure of one disk does
not effect the availability of the storage system. Technologies like
RAID (Redundant Arrays of Independent Disks) can be used to
handle the failure of a single disk, eliminating disks as a SPOF.
Server cluster, double network connections, and dual datacenters
– they all are meant to eliminate SPOFs. The trick is to find SPOFs
that are not that obvious.
Availability Patterns
• While is sounds easy to eliminate single point of failure, in
practice it is not always feasible or cost effective. Take for instance
the internet connection your organization uses to send e-mail.
• Do you have multiple internet connection from your e-mail server?
• Are these connections running over seperate cables in the building?
• What about outside of the building?
• Do you use multiple internet providers?
• Do you share their backbones?
While users should not notice a failure, the systems managers should! I have seen in practice
that a failing disk was not stopping the system, because of RAID technology, but the system
managers – lacking proper monitoring tools – did not notice it.

But when the second disk failed, both the users and the system managers noticed the
downtime!
Availability Patterns
• While eliminating SPOFs is very important, it is good to realize
that there is always something shared in an infrastructure (like
the building, the electricity provider, the metropolitan area, or the
country). We just need to know what is shared and if the risk of
sharing is acceptable.
• To eliminating SPOFs, a combination of redudancy, failover, and
fallback can be used.
Redudancy
• Redudancy is the duplication of critical component in a single
system, to avoid a SPOF.
• In IT infrastructure components, redudancy is usually
implemented in power supplies (a single component two poer
supplies; if one fails, the other takes over), network interfaces, and
SAN HBAs (Host Bus Adapters) for connecting storage.
Failover
• Failover is the (semi) automatic switch-over to a standby system
(component), either in the same or in another datacenter, upon
the failure or abnormal termination of the previously active
system (component).
• Examples are Windows Server failover clustering, VMware High
Availability and (on the database level) Oracle Real Application
Cluster (RAC).
• Failover is discussed in the chapters on the corresponding building
blocks.
Fallback
• Fallback is the manual switchover to an identical stanby computer
system in a different location, typically used for disaster recovery.
• There sre three basic forms of fallback solutions:
• Hot site
• Warm site
• Cold site
Fallback – Hot Site
• A hot site is a fully configured fallback datacenter, fully equipped
with power and cooling.
• The applications are installed on the servers, and data is kept up-to-date
to fully mirror the production system.
• Staff and operator should be able to walk in and begin full
operation in a very short time (typical one or two hours)
• This type of site requires constant maintenance of the hardware,
software, data, and applications to be sure the site accurately
mirrors the state of the production site at all times.
Fallback – Warm Site
• A warm site could best be described as a mix between a hot site
and cold site.
• Like a hot site, the warm site is a computer facility readily
available with power, cooling, and computers, but the applications
may not be installed or configured.
• But external communication links and other data elements, that
commonly take a long time to order and install, will be present.
• To start working in a warm site, application and all their data will
need to be restored from backup media and tested.
• This typically takes a day.
• The benefit of warm site compared to a hot site is that it needs
less attention when not on use and is much cheaper.
Fallback – Cold Site
• A cold site differs from the other two in that it is ready for
equipment to be brought in during an emergency, but no
computer hardwareis available at the site.
• The cold site is a room with power anf cooling facilities, but
computers must be brought on-site if needed, and communication
links may not be ready.
• Application will need to be installed and current data fully
restored from backups.
• Although a cold site provides minimal fallback protection, if an
organization has very little for a fallback site, a cold site may be
better than nothing.
Business Continuity
• Although many measure can be taken to provide high availability,
the availability of the IT infrastructure can never be guaranteed in
all situation.
• In case of a disaster, the infrastructure could become unavailable,
in some cases for a longer period of time.
• Business continuity is about identifying threats an organization
faces and providing and effective response.
• Business Continuity Management (BCM) and Disaster Recovery
Planning (DRP) are processes to handle the effect or disaster.
Business Continuity Management
• BCM is not about IT alone. It includes managing business
processes, and the availability of people and work places in
disaster situations.
• It includes disaster recovery, business recovery, crisis
management, incident management, emergency management,
product recall, and contingency planning.
• A Business Continuity Plan (BCP) describes the measures to be
taken when a critical incident occurs in order to continue running
critical operations, and to halt non-critical processes.
• The BS:25999 norm describes guidlines on how to implement
BCM.
Disaster Recovery Planning
• Disaster recovery planning (DRP) contains a set of measures to
take in case of a disaster, when (parts of) the IT infrastructure
must be accommodated in an alternative location.
• An IT disaster defined as an irreparable problem in datacenter,
making the datacenter unusable. In general, disaster can be
classified into two broad categories.
• The first is natural disaster such as floods, hurricanes, tornadoes or
earthquakes.
• The second category is manmade disaster, including hazardous material
spills, infrastructure failure, or bio-terrorism.
• In a survey performed under eighteen very experienced IT
proffesionals, that a disaster as defined above is very unlikely in
western Europe.
Estimated occurrance of disaster
• In figure below, The estimated occurance of disasters is shown.
• Based on this figure, disaster in western Europe are expected to
happen no more than once every 30 years.
Estimated occurrance of disaster
• The IT disaster recovery standard BS:25777 can be to implement
DRP.
• DRP assesses the risk of failing IT systems and provides solution.
• A typical DRP solution is the use of fallback facilities and having a
Computer Emergency Response Team (CERT) in place.
• A CERT is usually a team of systems managers and senior
management that decides how to handel a certain crisis once it
becomes reality.
• The steps that need to be taken to resolve a disaster highly
depend on the type of disaster.
• It could be the organization’s building is damaged or destroyed
(for instance in case of a fire), maybe even people got hurt or died.
• One of the first worries is of course to save people.
• But after that, procedures must be followes to restore IT operations as
soon as possible.
Estimated occurrance of disaster
• A new (temporary) building might be needed, temporary staff
might be needed, and new equipment must be installed or hired.
• After that, steps must be taken to get the systems up and running
again and to have the data restored.
• Connections to the outside world must be established (not only to
the internet, but also to business patrners) and business
processes must be initiated again.
RTO and RPO
• Two important objectives of disaster recovery planning are the
Recovery Time Objective (RTO) and the Recovery Point Objective
(RPO).
RTO and RPO
• The RTO is the maximum duration of time within which a
business process must be restored after a disaster, in order to
avoid unacceptable consequences (like bankruptcy).
• RTO is only valid in case of a disaster and not the acceptable
downtime under normal circumtances.
• Measures like failover and fallback must be taken in order to fulfill
the RTO requirements.
RTO and RPO
• The RPO is the point in time to wich data must be recovered
considering some “acceptable loss” in a disaster situation.
• It describes the amount of data loss a business is willing to accept
in case of a disaster, measured in time.
• For instance, when each day a backup is made of all data, and
disaster destroys all data, the maximum RPO is 24 hours – the
maximum amount of data lost between the last backup and the
occurance of the disaster.
• To lower the RPO, a different back-up regime could be
implemented.
Question and Answer

You might also like