Modul 4. Availability Concepts
Modul 4. Availability Concepts
Introduction
• Everyone expects their infrastructure to be available all the time.
In this age of global, always-on, always connected system,
disturbances in availability are noticed immediately. A lot 100%
guaranteed availability of an infrastructure, however, is
impossible. No matter how much effort is spent on creating high
available infrastructures, there is always a chance of downtime.
It’s just a fact of life.
• According to a survey from the 2014 Uptime Symposium, 46% of
companies using their own datacenter had at least one “business-
impacting” datacenter outage over 12 months.
• This chapter discusses the concepts and technologies used to
create high available systems. It include calculating availability,
managing human factors, the reliability of infrastructure
components, how to design for resilience, and – if everything else
fails – business continuity management and disaster recovery.
Introduction
Calculating availability
• In general, availability can neither be calculated, nor guaranteed
upfront. It can only be reported on afterwards, when a system has
run for some years. This make designing for high availability a
complicated task. Fortunately, over the years, much knowledge
and experience is gained on how to design high availability
systems, using design pattern like failover, redundancy, structured
programming, avoiding Single Points of Failures (SPOFs), and
implementing sound systems management. But first, let’s discuss
how availability is expressed in numbers.
Availability Percentages and
Intervals
• The availability of a system is usually expressed as a percentage of
uptime in a given time period (usually one year or one month).
The following table shows the maximum downtime for a
particular percentage of availability
Availability % Downtime per Downtime per Downtime per
year month week
99.8% 17.5 hours 86.2 minutes 20.2 minutes
• In this example, it means that the system can be unavailable for 25 minutes
no more than twice a year. It is also allowed, however, to be unavailable for 3
minutes three times each month. For each availability requirement, a
frequency table should be provided, in addition to each given availability
percentage.
MTBF and MTTR
• The factor involved in calculating availability are Mean Time
Between Failures (MTBF), which is the average time that passes
between failures, and Mean Time To Repair (MTTR), which is the
time it takes to recover from a failure.
• The term “mean” means that the number expressed by MTBF and
MTTR are statistically calculated values.
Mean Time Between Failures
(MTBF)
• The MTBF is expressed in hours (how many hours will the
component or service work with failure).
But when the second disk failed, both the users and the system managers noticed the
downtime!
Availability Patterns
• While eliminating SPOFs is very important, it is good to realize
that there is always something shared in an infrastructure (like
the building, the electricity provider, the metropolitan area, or the
country). We just need to know what is shared and if the risk of
sharing is acceptable.
• To eliminating SPOFs, a combination of redudancy, failover, and
fallback can be used.
Redudancy
• Redudancy is the duplication of critical component in a single
system, to avoid a SPOF.
• In IT infrastructure components, redudancy is usually
implemented in power supplies (a single component two poer
supplies; if one fails, the other takes over), network interfaces, and
SAN HBAs (Host Bus Adapters) for connecting storage.
Failover
• Failover is the (semi) automatic switch-over to a standby system
(component), either in the same or in another datacenter, upon
the failure or abnormal termination of the previously active
system (component).
• Examples are Windows Server failover clustering, VMware High
Availability and (on the database level) Oracle Real Application
Cluster (RAC).
• Failover is discussed in the chapters on the corresponding building
blocks.
Fallback
• Fallback is the manual switchover to an identical stanby computer
system in a different location, typically used for disaster recovery.
• There sre three basic forms of fallback solutions:
• Hot site
• Warm site
• Cold site
Fallback – Hot Site
• A hot site is a fully configured fallback datacenter, fully equipped
with power and cooling.
• The applications are installed on the servers, and data is kept up-to-date
to fully mirror the production system.
• Staff and operator should be able to walk in and begin full
operation in a very short time (typical one or two hours)
• This type of site requires constant maintenance of the hardware,
software, data, and applications to be sure the site accurately
mirrors the state of the production site at all times.
Fallback – Warm Site
• A warm site could best be described as a mix between a hot site
and cold site.
• Like a hot site, the warm site is a computer facility readily
available with power, cooling, and computers, but the applications
may not be installed or configured.
• But external communication links and other data elements, that
commonly take a long time to order and install, will be present.
• To start working in a warm site, application and all their data will
need to be restored from backup media and tested.
• This typically takes a day.
• The benefit of warm site compared to a hot site is that it needs
less attention when not on use and is much cheaper.
Fallback – Cold Site
• A cold site differs from the other two in that it is ready for
equipment to be brought in during an emergency, but no
computer hardwareis available at the site.
• The cold site is a room with power anf cooling facilities, but
computers must be brought on-site if needed, and communication
links may not be ready.
• Application will need to be installed and current data fully
restored from backups.
• Although a cold site provides minimal fallback protection, if an
organization has very little for a fallback site, a cold site may be
better than nothing.
Business Continuity
• Although many measure can be taken to provide high availability,
the availability of the IT infrastructure can never be guaranteed in
all situation.
• In case of a disaster, the infrastructure could become unavailable,
in some cases for a longer period of time.
• Business continuity is about identifying threats an organization
faces and providing and effective response.
• Business Continuity Management (BCM) and Disaster Recovery
Planning (DRP) are processes to handle the effect or disaster.
Business Continuity Management
• BCM is not about IT alone. It includes managing business
processes, and the availability of people and work places in
disaster situations.
• It includes disaster recovery, business recovery, crisis
management, incident management, emergency management,
product recall, and contingency planning.
• A Business Continuity Plan (BCP) describes the measures to be
taken when a critical incident occurs in order to continue running
critical operations, and to halt non-critical processes.
• The BS:25999 norm describes guidlines on how to implement
BCM.
Disaster Recovery Planning
• Disaster recovery planning (DRP) contains a set of measures to
take in case of a disaster, when (parts of) the IT infrastructure
must be accommodated in an alternative location.
• An IT disaster defined as an irreparable problem in datacenter,
making the datacenter unusable. In general, disaster can be
classified into two broad categories.
• The first is natural disaster such as floods, hurricanes, tornadoes or
earthquakes.
• The second category is manmade disaster, including hazardous material
spills, infrastructure failure, or bio-terrorism.
• In a survey performed under eighteen very experienced IT
proffesionals, that a disaster as defined above is very unlikely in
western Europe.
Estimated occurrance of disaster
• In figure below, The estimated occurance of disasters is shown.
• Based on this figure, disaster in western Europe are expected to
happen no more than once every 30 years.
Estimated occurrance of disaster
• The IT disaster recovery standard BS:25777 can be to implement
DRP.
• DRP assesses the risk of failing IT systems and provides solution.
• A typical DRP solution is the use of fallback facilities and having a
Computer Emergency Response Team (CERT) in place.
• A CERT is usually a team of systems managers and senior
management that decides how to handel a certain crisis once it
becomes reality.
• The steps that need to be taken to resolve a disaster highly
depend on the type of disaster.
• It could be the organization’s building is damaged or destroyed
(for instance in case of a fire), maybe even people got hurt or died.
• One of the first worries is of course to save people.
• But after that, procedures must be followes to restore IT operations as
soon as possible.
Estimated occurrance of disaster
• A new (temporary) building might be needed, temporary staff
might be needed, and new equipment must be installed or hired.
• After that, steps must be taken to get the systems up and running
again and to have the data restored.
• Connections to the outside world must be established (not only to
the internet, but also to business patrners) and business
processes must be initiated again.
RTO and RPO
• Two important objectives of disaster recovery planning are the
Recovery Time Objective (RTO) and the Recovery Point Objective
(RPO).
RTO and RPO
• The RTO is the maximum duration of time within which a
business process must be restored after a disaster, in order to
avoid unacceptable consequences (like bankruptcy).
• RTO is only valid in case of a disaster and not the acceptable
downtime under normal circumtances.
• Measures like failover and fallback must be taken in order to fulfill
the RTO requirements.
RTO and RPO
• The RPO is the point in time to wich data must be recovered
considering some “acceptable loss” in a disaster situation.
• It describes the amount of data loss a business is willing to accept
in case of a disaster, measured in time.
• For instance, when each day a backup is made of all data, and
disaster destroys all data, the maximum RPO is 24 hours – the
maximum amount of data lost between the last backup and the
occurance of the disaster.
• To lower the RPO, a different back-up regime could be
implemented.
Question and Answer