Best Practices For Continuous Application Availability
Best Practices For Continuous Application Availability
68 J une 2005
Marriott Wardman Park Hotel
Washington, District of Columbia
Best Practices for Continuous Application
Availability
These materials can be reproduced only with Gartner's written approval. Such approvals must be requested via e-
mail [email protected].
Best Practices for Continuous Application Availability
Page 1
Donna Scott
C2, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
High-Profile Downtime Is Down
High-profile downtime incidents in 2004 were much less frequent than in
1999-2001. Because they are less frequent, they become part of doing
business. That does not mean there is no cost of downtime it just
means that people/customers are more accommodating when they are
rarely vs. frequently impacted.
Date Event Cause
10/8/04 4-day slow-down/outage of code changes
Paypal;
9/14/04 5-hour FAA radio outage Lack of
disrupted air travel; some maintenance
planes flew dangerously close
8/1/04 3-hour plane grounding Unintentional user
American & U.S. Airways error
Since 2001, a greater number of organizations started to systematically reduce risks in their IT environments, and
therefore improve end-to-end availability. This included an emphasis on designing for availability and managing for
availability through improving problem and change management processes. As their availability levels rose, there
have been fewer high-profile downtime instances in the news. You certainly can still find them as shown in the
graphic, but they are less frequent. Because they are less frequent, they become part of "doing business," and have
less overall impact, than when they were occurring frequently. That does not mean that there is not a cost of
downtime it just means that people/customers are more accommodating when they are rarely impacted (vs.
frequently impacted).
While all enterprises still do have downtime, the levels of uptime have risen. Still, there is increased desire to
continue to improve to operate critical IT services and applications 24 hours per day, seven days per week
(24x7), because of IT-business process interdependencies . The complexity of todays IT infrastructures and
applications makes managing these systems to high levels of availability difficult. Achieving continuous availability
requires a multipronged strategy that addresses and mitigates risks of failures and planned maintenance/upgrade.
Continuous availability must be designed in and requires substantial levels of cross-organizational people/process
discipline and control. This presentation focuses on strategies to achieve continuous end-to-end IT
service/application availability.
Best Practices for Continuous Application Availability
Page 2
Donna Scott
C2, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Less Downtime Supported by Data Center
Conference Polling Results
0
10
20
30
40
50
60%
Average Very Good Outstanding Best in
Class
100%
Availability
Dec. 2003 Data
Center Conf. Poll
Results (n=151)
Dec. 2000 Data
Center Conf. Poll
Results (n=N/A)
2
8
25
34
42
23
38
20
9
Key
Average
Very Good
Outstanding
Best-in-Class
100% Availability
(<=98%; >=175 downtime hours per year)
(99%; <=87 downtime hours per year
(99.5%; <=43 downtime hours per year)
(99.9%; <=9 downtime hours per year)
(zero unplanned downtime)
Dec. 2004 Data
Center Conf. Poll
Results (n=165)
5
24
41
29
0
1
The results from a poll conducted at Gartner's Data Center Conference in December 2004 as well as in
December 2003 indicate that many enterprises have made outstanding progress in improving application
availability since 2000. Business processes' increasing reliance on IT and the pervasive ideal of the real-time
enterprise will continue to drive this trend toward around-the-clock availability. To achieve this, enterprises
must have a multipronged strategy that addresses application architecture, technology infrastructure and IT
process maturity.
These conclusions are based on the findings of a poll conducted among CIOs, heads of IT operations and data
center managers attending Gartner's Data Center Conference in December 2004 as well as in December 2003
(which attracted more than 1,400 attendees, who were given electronic polling devices with selected questions
inserted in each conference presentation). Respondents answered two questions that focused on end-to-end
availability levels for their enterprises' most-critical applications. Responses to questions on unplanned
downtime were compared to results from a similar poll in 2000. Although Gartner recognizes that respondents
don't necessarily represent a statistically significant distribution, we believe the results of these polls are of
interest to users.
Client Issues
1. How will enterprises define and measure continuous availability, and how much does it
cost?
2. How should IT services be architected for continuous availability?
3. What IT process best practices and strategies will enterprises adopt to achieve continuous
IT service availability?
Best Practices for Continuous Application Availability
Page 3
Donna Scott
C2, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Client Issue: How will enterprises define and measure continuous availability, and how much
does it cost?
Tactical Guideline: Most enterprises that are seeking to improve availability initially focus on
reducing unplanned downtime.
A highly available IT service provides user access to applications and data for a minimum of 99 percent of
scheduled time, despite unscheduled incidents. It typically implies the ability to eliminate/avoid (via error
detection, circumvention, correction and recovery) or minimize (via rapid restart) unscheduled outages. Implied
in high availability is application and data integrity, as well as acceptable (as defined by the user) application
performance. Ultimately, high availability must be measured from a users perspective. If a user cant access an
IT service (set of applications and their underlying infrastructure) during scheduled hours, the application is
considered unavailable. Most business-critical applications, including business-to-business (B2B), business-to-
consumer (B2C) and enterprise applications, have some planned or scheduled downtime to perform maintenance.
Scheduled downtime should be negotiated with users to avoid times of peak or seasonal business demand.
Further, scheduled downtime should clearly be communicated to users to set expectations and avoid the
dissatisfaction that occurs when trying to access a site or services that are unavailable. A continuously operable
site enables access during expanded hours, often near 24x7 or the full 24x7. Continuous availability is the
combination of high availability and continuous operations, and it enables expanded hours of user access (near
24x7 or 24x7) a high percentage (99 percent or more) of the time.
Conclusion: 24x7 availability is designed in, not bought; is expensive; and requires a strategy and plan.
Fault
Avoidance
Rapid
Recovery
Integrity
Application
Performance
Management
Continuous
Availability
Continuous
Operations
Availability
Reporting/
Metrics
Continuous availability provides expanded user access to
application services (24x7 or near 24x7) while also providing
access to a high percentage (equal to or more than 99 percent)
of scheduled time, despite unscheduled incidents.
Minimizing
unplanned
downtime
Minimizing
planned
downtime
High
Availability
Availability Defined in User,
Not Component Terms
Best Practices for Continuous Application Availability
Page 4
Donna Scott
C2, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Strategic Planning Assumption: Large enterprises and outsourcers that measure end-to-end
service availability will rise from 25 percent today to more than 50 percent by 2007 and 75
percent by 2009 (0.8 probability).
Key Issue: How will enterprises define and measure continuous availability, and how much does
it cost? As IT organizations move toward IT service management providing various levels of service to
customers at various levels of cost they must also regularly measure and report their service performance. A
critical aspect of service performance is end-to-end application service availability. The end-to-end service to be
measured is typically defined jointly by the business and IT organization as part of the service-level management
process. It includes a set of applications and underlying infrastructure that are critical to a business process.
Examples may include all applications and infrastructure associated with e-commerce, call center or enterprise
resource planning. Fulfilling the end-to-end service requirements may be done in-house or outsourced, or a
combination of both. Although most IT organizations and outsourcers measure the availability of IT components,
most do not yet measure end-to-end service availability. However, the trend for measurement is consistent with the
IT service management trend and, as a result, large enterprises and outsourcers that measure end-to-end service
availability will rise from 25 percent today to more than 50 percent by 2007 and 75 percent by 2009 (0.8
probability). Most enterprises dont measure end-to-end service availability today because its difficult to do,
crosses many business, IT, outsourcer, organizational and process boundaries, and requires significant manual effort
to determine outage business impact and cost. Furthermore, many enterprises are just starting to organize IT for
service management. Action Item: You cant improve what you dont measure. Develop and implement a method for
measuring end-to-end availability from the users perspective the set of application functions critical to their
business process.
IT Services/Application Availability Reporting Best Practices
Measure on IT services/products, not individual components
General formula =1- (total downtime minutes/total available minutes)
Weigh downtime by pain index typically number of users affected, weighted for
the severity of the impact
Predetermine the conditions that constitute downtime. If an e-commerce site has
1 percent of functions down, does this count?
Map downtime conditions to severity/priority levels in the service desk and event
management systems.
Business Rules
Application Code
Desktop
Servers
Internet Network
Middleware and
Production Objects
Database
Transactions
Storage
Many enterprises find it
difficult to measure end-to-
end IT service availability,
and use response time
measures as a proxy
for availability.
Best Practice #1: Measure to the Users
View
Best Practices for Continuous Application Availability
Page 5
Donna Scott
C2, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Key Issue: How will enterprises define and measure continuous availability, and how much does
it cost? Business requirements for application service availability and disaster recovery should be defined during
the business requirements phase. Ignoring requirements early often results in a solution that does not meet
requirements and ultimately requires significant re-architecture to improve service. We recommend a
classification scheme of supported service levels and associated costs. These drive tasks and spending in
development/application architecture, system architecture and operations. Business managers then develop a
business case for a particular classification of service. When moving toward IT service management (ITSM),
enterprises define IT services that are meaningful to their customers, and define formal service-level agreements
which define service quality commitments. Enterprises not yet ready for ITSM, but wanting to improve
availability levels, should follow the same process and set informal SLAs from which to monitor their progress
toward goals. Whether formal or not, these measures must trickle down into IT and supplier performance goals
that correlate with the end-to-end SLA. It is vital for IT organizations to identify realistic targets internally and
summarize them to ensure that the BU/IT organization service-level target is realistic. The target numbers are not
guessed at, but are based on actual experience. Action Item: When moving to IT service management, IT
organizations should ideally measure their performance for six to 12 months prior to agreeing to formal BU/IT
organization SLAs.
SLA Chain
Database
Systems Software
Servers
Application
Facilities/Environment
Outsourced
Wide-Area Networks
Outsourced
Business Unit IT organization Internal and External
Service Providers
Best Practice #2: Determine SLAs Early in Life
Cycle; Classify and Consider SLA Chain
Class SLA Typical IT Services
1-RTE 24x7, 99.9 Customer/Partner Facing
RTO=2 hrs; RPO=0 Significant Revenue and/or Service Impact
2-Critical 24x6-3/4, 99.5 Supply Chain
RTO<8 hrs; RPO<4 hrs Medium Impact on Customer Service
3-Important 18x7, 99.2 Back Office Applications
RTO=72 hrs; RPO=24 hrs
Strategic Planning Assumption: By 2007, more than 50 percent of enterprises will classify IT
service availability and disaster recovery requirements during the early phases of the project
life cycle an increase from 20 percent today (0.7 probability).
Tactical Guideline: Breaking the service level down into manageable components is the best
way to gain confidence that service levels are realistic and achievable.
Best Practices for Continuous Application Availability
Page 6
Donna Scott
C2, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Strategic Planning Assumption: Through 2007, no more than 10 percent of critical application
services will achieve 99.9 percent (best-in-class) availability (0.8 probability).
Key Issue: What IT process best practices and strategies will enterprises adopt to achieve
continuous IT service availability?
Business executives want the IT organization to deliver services like a utility turn it on and it is always there;
turn it up or down as needed. IT services, however, are not standardized or commoditized; rather, service levels
reflect the interworking of components, services and management domains. The end result reflects the
infrastructures ability to deliver service, and also depends on the design of the application and the other
services being delivered by shared resources. Wishful thinking by those delivering or consuming the service
will not guarantee availability levels. Here we provide an end-to-end IT service availability ranking based on
annual service downtime, categorized independently for planned vs. unplanned downtime. It shows average,
very good, outstanding and best-in-class levels of availability. Most enterprises would not bother measuring
availability for services requiring average availability levels, but they would reserve their efforts for those
mission-critical services requiring higher levels of availability.
Action Item: Benchmarking can be a useful exercise, but ultimately, you should measure against your business
requirements, and not against some other enterprises requirements or achievements.
Average
Very Good
Outstanding
Hours Down/Year/IT Service
Unplanned
More than 175 hours
(Less than 98%)
Between 70 and 87
hours (99% - 99.2%)
Less than 43 hours
(99.5%)
Less than 9 hours
(99.9%)
Planned
More than 250 hours
Less than 200 hours
Less than 50 hours
Best Practice #3: Know How Your IT
Services Availability Metrics Stack Up
Best in Class Less than 12 hours
Best Practices for Continuous Application Availability
Page 7
Donna Scott
C2, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Key Issue: How will enterprises define and measure continuous availability, and how much
does it cost?
Building highly available applications is expensive, costing about 2.5 times that of a standard or not highly
available application. The costs are not just capital costs, such as in redundancy, but also come in the form of
greater diligence in IT processes (such as in performance monitoring). Continuous availability is even more
expensive at least three-and-a-half times the cost of a standard application. Most enterprises define
satisfactory levels of availability for mission-critical, externally accessed Web applications at 99.5 percent
(about 43 hours per year of unplanned downtime), and two to eight hours per month of planned downtime. A
key differentiator for those enterprises justifying high availability and near-continuous operations is an
investment in architectures and standards for application design/development/testing and operations. Once
development standards are implemented, it costs little more to design and develop a highly available
application than a standard application. Disregarding availability requirements during design frequently causes
costly re-architecting at a later date to meet growing availability requirements.
Action Item: Considering availability requirements during design will save money throughout the life of the IT
service, by avoiding the costs of retrofitting and re-architecting at a later date.
Tactical Planning Guideline: A continuously available IT service will cost at least 3.5 times a
standard, nonhighly available service.
From a design/development perspective, it costs less to do it right the first
time than to retrofit it later. Standards are necessary to make the process
repeatable.
CA Continuous availability
HA High availability
Retrofit/redesign Retrofit/redesign
Plus One-Time Project
Costs For:
Service and operations
architecture and standards,
including change
management
and impact analysis
Design, testing and
development architecture
and standards
CA design, testing,
scheduling, operations
Additional technology
costs for CA
Additional technology
costs for CA
Relative
Cost
2X
3X
4X
Cost of standard
application
Cost of standard
application
HA Design/
Development/
Testing
HA Support
Operations/mgmt. Operations/mgmt.
Redundancy Redundancy
Best Practice #4: Know Your Costs of Delivering
Availability; Use in Project J ustification
Best Practices for Continuous Application Availability
Page 8
Donna Scott
C2, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Client Issue: How should IT services be architected for continuous availability?
Strategic Imperative: Achieving high levels of application service availability requires infusing
it into corporate culture. A critical success factor is defining repeatable processes through the
creation of infrastructure, software and operational architectures.
To achieve high levels of availability requires enterprises to understand the impact of architecture on
availability. We recommend availability SLAs be tied to architectural standards for infrastructure, software
and operations. These policies make sure that a repeatable process is used to ensure the level of availability
specified. It builds the SLA planning and execution into the infrastructure design, application design and
operational design. In the slide example, three levels of SLAs are offered for acceptable levels of unplanned
downtime: standard (98 percent to 99 percent availability); silver (99 percent to 99.5 percent) and gold (99.5
percent to 99.9 percent). By specifying architectural requirements, it provides justification for the increase in
cost for higher levels of service. It also provides a basis for benchmarking costs of service with external service
providers. Furthermore, based on the architectural requirements for achieving higher quality of service, an IT
organization can provide standard multipliers required to achieve the higher levels of service. For example, if
the standard SLA is X, the silver level may be 2*X, while the gold level may be 3*X. This will help the
business process/application owner and relationship manager during the negotiation process (where sometimes
requirements change based on available budget).
Class 1 Gold
99.5%99.9%
Eight to 43 hours
Price: 3*X
Infra-
structure
Reqs.
Stand-alone servers
Auto-restart
Tested backup
No SPOF user NW
Application start/stop
Tested recovery plan
User account testing
Change management
Change management
Event monitoring
Backup practices
Well-trained staff
Tested recovery plan
Parallel cluster
Hot plug hardware
Use of GA products
Spare parts on-site
Redundant arch.
Auto-failover
Consistent config.
Vendor MTTR SLA
Auto-recovery
No transaction re-entry
Replicated database
Test env. =prod. env.
Real-time alarming
with business impact
Capacity planning
Outage analysis,
prevention
App. design failover
Auto-diagnostics
Scalable
Secure
Proactive tuning
Proactive availability
& performance mgmt.
Proactive problem
mgmt./root-cause
analysis
Class 3 Standard
98%99%
87 to 175 hours
Price: X
Class 2 Silver
99.0%99.5%
43 to 87 hours
Price: 2*X
Software
Reqs.
Oper-
ations
Reqs.
Downtime
per year
Best Practice #5: Invest in Service-Level
Management and Architecture Standards
Best Practices for Continuous Application Availability
Page 9
Donna Scott
C2, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Design for
Automated, Rolling
Updates, Change,
Scale
Design for
Automated, Rolling
Updates, Change,
Scale
Embed
Management
Instrumentation
Embed
Management
Instrumentation
Best Practice #6: Invest in Holistic, Resilient
Application and Infrastructure Architectures
IT Service
Design
Principals
for CA
Use
Asynchronous
Application
Integration/
Messaging
Use
Asynchronous
Application
Integration/
Messaging
Strategies to Reduce
Application Downtime
Strategies
to Reduce
Application
Downtime Impact
Design for
Functional
Degradation
Design for
Functional
Degradation
Partition
Databases
Partition
Databases
Architect for
Redundancy,
Failover and
Horizontal Scaling
Architect for
Redundancy,
Failover and
Horizontal Scaling
Architect for
Active/Active; User
Transparency through
Failures
Architect for
Active/Active; User
Transparency through
Failures
Use Stateless
but Persistent
Architectures
Use Stateless
but Persistent
Architectures
Design Apps
to Communicate
with Users
Design Apps
to Communicate
with Users
Client Issue: How should IT services be architected for continuous availability? Designing for
continuous availability reduces downtime or minimizes user impact for any outage that occurs. Although
unexpected component outages cannot be completely eliminated, masking outages from users creates the
perception of uninterrupted availability. This approach is the underpinning of any availability strategy.
Although enterprises would like to rely on the underlying technology infrastructure exclusively for the quality
of service of their applications, application availability also depends on application design. For example,
component architectures make it easier to upgrade an application in flight. Stateless application components
allow for a more flexible, powerful use of application server load balancing, providing for greater levels of
availability as well as user access persistency, despite component outages. Moreover, de-coupled connections
between components and applications are more tolerant of failures than synchronous connections. Effective
applications management starts at development time with instrumentation. Clearly identified "interfaces"
between components or services, transactions and business processes are primary candidates for this activity.
Action Item: Base infrastructure resiliency on levels of application resiliency; high levels of application
resiliency will modify requirements for infrastructure resiliency. Consideration of application availability must
be present from the start of the project and not delayed until application deployment because architectural
changes cannot easily be reversed.
Strategic Planning Assumptions: Through 2008, less than 20 percent of large enterprises
will systematically design application architectures to achieve near-continuous availability
(0.8 probability). Through 2008, 80 percent of large enterprises will continue to rely on the
infrastructure to achieve near-continuous availability, rather than systematically build it into
the application architecture (0.8 probability).
Best Practices for Continuous Application Availability
Page 10
Donna Scott
C2, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Geographic
Load
Balancer
Geographic
Load Balancer
Site Load
Balancer
Site
Load
Balancer
Web
Server
Clusters
Application
Server
Clusters
Database
Server
Clusters
Disk
PIT Image,
Tape B/U
Web
Server
Clusters
Application
Server
Clusters
Database
Server
Clusters
Application
Replication
DB/Host Replication
g IBM
g Microsoft
g NSI
g Oracle
g Quest
g Veritas
Remote Copy
g EMC
g Hitachi
g HP
g IBM
Secondary
Site
Best Practice #7: Invest in Multisite Architectures
for Built-In Disaster Recovery
Strategic Imperative: For short (measured in seconds) and transparent recovery, the
application architecture must be designed, rather than adding features in infrastructure as an
afterthought.
Client Issue: How should IT services be architected for continuous availability?
For application services with continuous availability requirements, including short recovery time objective
(RTO) and recovery point objective (RPO), multisite architectures are used. Often, a new real-time enterprise
(RTE) application service starts with a single-site architecture and migrates to multiple sites as its risks grow.
Multiple sites complicate applications architecture design (for example, load balancing, database partitioning,
database replication and site synchronization must be designed into the architecture). For nontransaction
processing applications, multiple sites run concurrently, connecting users to the closest or least-used site. To
reduce complexity, most transaction processing (TP) applications replicate databases (or disks) to an alternate
site, but the alternate databases are idle unless a disaster occurs. Then, a switch to the alternate site can be
accomplished in typically 15 to 30 minutes. Some enterprises prefer to partition databases and split the TP load
between sites, and then consolidate data later for decision support and reporting. This reduces the impact of a
site outage, affecting only a portion of the user base. A small number of organizations prefer more-complex
architectures with bidirectional replication between sites to maintain a single database image and the highest
levels of availability.
Action Item: The shorter the requirement for failover, the closer the replication must be to the application level
in the architecture.
Best Practices for Continuous Application Availability
Page 11
Donna Scott
C2, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Best Practice Case Study: UPS Architects for
100 Percent Availability
DIAD
Customers
UPS
Operating
Centers
New J ersey Data Center
Georgia Data Center
Data Replication
Critical
Information
Important
Information
Client Issue: How should IT services be architected for continuous availability?
UPS architected its package delivery status data so that customers could access it over the Internet 24x7. An
architectural overview follows and refers to the graphic.
At the time of package delivery, UPS drivers record package data electronically into a PDA-like device called a
DIAD. After each package delivery, the driver transmits the data over a cellular network, in real time, to a
mainframe in New J ersey (NJ ). If NJ is inaccessible, the DIAD transmits to the Georgia (GA) center.
As soon as the data is received by a front-end collection application in the NJ or GA data centers, it's written to an
IMS database and to WebSphere MQ; the latter replicates the transaction to the other data center.
Both locations run application tasks to take the IMS data and apply the updates/inserts to the IBM DB2 database.
This process enables up-to-date package data status from the database.
At the end of each day, UPS drivers physically go to one of the 1,800 regional package processing centers, and all
data collected in their DIADs is transferred to distributed Intel/Windows systems.
All data is transferred daily from the regional facilities via FTP to the NJ data center and processed into the IBM
DB2 databases. For added protection, all data is stored in the regional centers for up to three days.
Action Item: Companies that want uninterrupted, 24x7 access to business applications and data must consider
certain requirements in the early phases of IT enterprise architecture design.
Best Practices for Continuous Application Availability
Page 12
Donna Scott
C2, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
IT organizations evolve over time toward IT management process maturity. In the more-traditional mainframe
data centers, it took 10 to 20 years to achieve Level 3 service management maturity. By comparison, distributed
heterogeneous computing is relatively new, and few IT organizations have reached Level 3 maturity.
Each maturity level provides the foundation for the higher levels. People, processes and tools must be in place
at one level before the enterprise will be able to proceed to the next level. Even as IT organizations move up,
there will always be additional work, continuous engineering and improvements going on at each level.
Business processes that are heavily dependent on IT require a minimum of Level 3 IT management process
maturity to have the necessary rigor and predictability in the delivery of IT services. Therefore, setting
business/IT SLAs requires that all objects within the IT service (for example, systems, storage, networks,
applications, database, OS software, facilities and environmental factors) also be managed at Level 3 maturity.
Otherwise, the weak link (the objects managed at lower levels) will cause the entire service to fail to meet the
business SLAs. Action Item: Understand where you are today in the IT management process maturity model,
set a goal of what level you need to reach to best support the business, then use the model to help guide
investments in people, processes and technology to achieve higher levels of maturity.
Client Issue: What IT process best practices and strategies will enterprises adopt to achieve
continuous IT service availability?
Strategic Planning Assumption: Through 2005, fewer than 25 percent, and through 2007, fewer
than 35 percent, of large enterprises will achieve IT service management maturity (0.8
probability).
4 Value IT/Business Metric Linkage
3 Service Capacity Planning, Service-Level Management
2 Proactive Performance, Change, Problem
Configuration, Availability Mgmt.
Automation, J ob Scheduling
1 Reactive Event Up/Down, Console, Trouble
Ticket, Backup, Topology, Inventory
0 Chaotic Multiple Help Desks, Nonexistent
IT Operations, User Call Notification
Value IT/Business Metric Linkage
Service Capacity Planning, Service-Level Management
Proactive Performance, Change, Problem
Configuration, Availability Management
Automation, J ob Scheduling
Reactive Event Up/Down, Console, Trouble
Ticket, Backup, Topology, Inventory
Chaotic Multiple Help Desks, Nonexistent
IT Operations, User Call Notification
Level Maturity Processes
Service quality
Customer satisfaction
Understanding of costs; benchmarking
Labor costs
Risk
Service Management Benefits
Best Practice #8: Know Your IT
Management Process Maturity Level
Best Practices for Continuous Application Availability
Page 13
Donna Scott
C2, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Strategic Imperative: Through 2007, excessive downtime will cause most IT organizations to
increase their attention on IT process re-engineering.
Client Issue: What IT process best practices and strategies will enterprises adopt to achieve
continuous IT service availability? Based on extensive feedback from clients, we estimate that, on
average, 40 percent of unplanned mission-critical application downtime is caused by application failures
(including bugs, performance issues or changes to applications that cause problems); 40 percent by operator
errors (including incorrectly or not performing an operations task); and about 20 percent by hardware (for
example, server and network), operating systems, environmental factors (for example, heating, cooling and
power failures), and natural or manmade disasters. To address the 80 percent of unplanned downtime caused by
people failures (vs. technology failures or disasters), enterprises should invest in improving their change and
problem management processes (to reduce the downtime caused by application failures); automation tools, such
as job scheduling and event management (to reduce the downtime caused by operator errors); and improving
availability through application architecture. The balance should be addressed by eliminating single points of
failure through redundancy or reducing time-to-repair through technology support/maintenance agreements. To
reduce planned downtime, investing in change management and application/DBMS architecture/design/
development processes will have the greatest effect and the highest return on investment. Action Item:
Enterprises will reach an availability wall over which they cant climb, unless they invest in re-engineering
IT processes. These processes include availability, change, configuration, problem and performance
management, as well as application architecture and capacity planning.
40%
Operations
Errors
40%
Application
Failure
20%
Environmental
Factors,
HW, OS, Power,
Disasters
Unplanned Downtime Planned Downtime
65%
Application and
Database
2%
Physical Plant/
Environmentals
10%
Backup and
Recovery
10%
Batch
Application
Processing
13%
Hardware,
Systems
Software
Best Practice #9: Know Why Your IT
Service Is Down
Best Practices for Continuous Application Availability
Page 14
Donna Scott
C2, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Best Practice #10: Collect the Right Metrics
for Downtime Analysis and Future Prevention
IT services metrics/trending:
Frequency of unplanned outages
Mean-time-to-resolution/repair
Service downtime and impact
Response time
Root-cause analysis/postmortem
Root Cause Coding Multilevel
Hardware
Application
Software
System
Software
Facilities/
Environment
Service
Provider
People
(Human Error)
Process
External
Events
Preventable Unpreventable
Capacity Technology
Drive the right behavior
with employee
and department
performance metrics
Client Issue: What IT process best practices and strategies will enterprises adopt to achieve
continuous IT service availability? Measuring and trending end-to-end availability will help the IT
organization understand how it is doing relative to defined service-level goals. However, improving availability
requires a more-granular understanding of availability and the root cause of outages, so that similar outages
may be prevented. Root cause consists of postmortem outage analysis by a cross-functional IT team, to identify
the reason the outage occurred. This may be, for example, due to a component failure, application failure,
changes that resulted in unanticipated problems or configuration inconsistency. Further classification to
determine whether the outage was preventable under existing processes will aid in correcting human error and
process failures. However, some outages may be preventable only with additional investment, which the
enterprise and IT organization must justify, or it must accept the outage risk. IT and business performance
metrics can be counter to the goals of availability. For example, if developers are assessed on code timeliness
and not quality, then this will lead to buggy code, causing downtime. Few enterprises define availability as a
goal across business and IT. However, those that have done so find significant benefits in availability, planning
and less fire-fighting.
Action Item: Consider which metrics drive the desired behavior to achieve high levels of availability.
Strategic Planning Assumption: Through 2007, fewer than 10 percent of large enterprises will
assess business and IT performance on availability metrics (0.8 probability).
Best Practices for Continuous Application Availability
Page 15
Donna Scott
C2, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Strategic Planning Assumption: Through 2008, fewer than 5 percent of successful Internet
attacks will exploit a " day-zero" vulnerability (0.8 probability).
Vulnerabilities Exploited
Old Patch
Recent Patch
New Vulnerability
Misconfiguration
The percentage of vulnerabilities that are attacked
within one month of the patch release will double
from 15 percent in 2003 to 30 percent by 2006
(0.7 probability).
Best Practice #11: Invest in Security; We
Have Met the Enemy, and They Are Us
Client Issue: What IT process best practices and strategies will enterprises adopt to achieve
continuous IT service availability? Most vulnerabilities are known about long before an attack happens.
However, it is tremendously expensive and complex for companies to patch software on all desktops and servers.
Most patches often require patches themselves, and many patches (at least 30 percent) break at least one
corporate application. To many IT operation organizations, patching is seen as more risky than is getting
attacked.
The security industry likes to hype the possibility of day-zero attacks that is, attacks that exploit
vulnerabilities that no one knew about prior to the attack. However, day-zero attacks will represent less than 5
percent of successful attacks through 2008, as attackers continue to focus on reverse engineering patches to
develop exploit code.
Software developers should focus on reducing configuration errors making default out of the box
configurations more secure than on reducing attack surfaces external interfaces that provide traction for
attacks.
Action Item: The most-important step to increase Internet security is to stop using software that has frequent
security vulnerabilities. Barring that, greatly increase the resources that you apply to security to assure that all
software is safely configured and patched.
Best Practices for Continuous Application Availability
Page 16
Donna Scott
C2, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Availability
IT/business alignment
Customer satisfaction
Project Change Management
IT Change Management Key Components
Operational Change Management
Change
Implementation
Change
Monitoring
IT Operational
Change
Management
Request
IT Change
Management Benefits:
Best Practice #12: Invest in IT Change
Management
Strategic Planning Assumption: Through 2008, investments in change management processes
will have the highest impact on IT service levels (0.8 probability).
Client Issue: What IT process best practices and strategies will enterprises adopt to achieve
continuous IT service availability? IT change management is a process that enables an enterprise to
modify any part of its IT and communications environment, and supports the acceptance, approval and
implementation of the modifications. The goal is to enable controlled changes while preserving the integrity
and service quality of the production environment. Business processes rely on IT and expect it to be available
and provide high service quality. Enterprises cant achieve this quality without effective IT change management
processes. This includes project change management managing the design, development, testing and
implementation of change and operational change management managing the approval, scheduling and
coordination of change. Improving change management processes is one of the best investments enterprises can
make as availability can increase by 25 percent to 35 percent. Change management is difficult because it
requires changes in human behavior. No longer can changes be made by an individual in isolation; they become
public for the betterment of the enterprise as a whole to reduce business and technical risk. Reshaping users
behavior requires a significant amount of education to raise awareness. Support by senior management also is
needed to reinforce the consequences of breaching the process. Action Item: Clients must document and
instrument their change management process across development, operations and the lines of business.
Best Practices for Continuous Application Availability
Page 17
Donna Scott
C2, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Best Practice #13: Invest in Testing
Application Development/Quality Assurance
Infrastructure Changes
Security Patches
Operations/Production Control
Failover/Fail-back
Business Continuity and Disaster Recovery
Configuration Audits
Client Issue: What IT process best practices and strategies will enterprises adopt to achieve
continuous IT service availability?
Availability should be designed-in, but testing should provide additional confidence. Testing should be done for
all changes in the environment prior to deployment including application, application integration, infrastructure,
software, security patches and operations/production control changes. In addition, all changes should have a
roll-back plan, should unforeseen problems occur as a result of the change. Moreover, it is critical to test for
automated failover architectures, to ensure they will work when needed. Failover clustering frequently fails due
to out-of-sync conditions, and testing helps ensure that the configurations are consistent and provides
confidence that the architecture will work as planned. Configuration audits (using software that compares a
gold configuration to actual) can help with this process and enable more consistent configurations, which will
improve quality of service. Finally, testing of business continuity plans are vital to ensuring recovery in the
event of a disaster scenario, including testing of technology and people processes (for example, crisis
management, public relations and interface with the press, damage assessment, invocation of plans, execution
of plan and more). Action Item: Invest in comprehensive testing of all changes to the production environment.
Audit configurations for compliance and enforcement.
Strategic Imperative: We estimate that unplanned downtime can be reduced by 20 percent to
40 percent through testing.
Best Practices for Continuous Application Availability
Page 18
Donna Scott
C2, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Best Practice #14 Case Study: Invest in Reduced
MTTR through Effective 24x7 L1 Support
EDS Case Study: Benefits: Improved Incident Response Time and Productivity
L1 OC alert to incident in 7.9 minutes (vs. 14.7 minutes)
Reduced alert volume by 72 percent and number of tickets by 25 percent
Enterprise Clients supported by OC FTE rose from 1.2 to 4.2
Improved operator job satisfaction
Monitoring rules
Outsourcing Client Pre-
Assessment and Investment
Pre-established SLAs and
inheritance dependency tree
Integrated control center
i.e., CA, InfoVista,
Opsware SMARTS;
documented procedures
enable OC to focus on
fast service restoration
Integrated, automated
ticketing and escalation
(with info attached)
Restoration
Procedures
SLA rules dynamically determine
severity and process rules for L2/L3
Operations Center (L1)
SLA
Client Issue: What IT process best practices and strategies will enterprises adopt to achieve
continuous IT service availability? To provide more value to its outsourcing clients at a lower cost, EDS has
embarked on a broad strategy for RTI, which it calls the Agile Enterprise. As a foundation, EDS has implemented
the Agile Control Center which focuses on incident/problem prevention and fast restoration. While most companies
have weak support processes where operators work in a reactive manner with too many alerts and inadequate
procedures and tools to restore service EDS implements smarter and more-integrated processes, giving its
operators the tools and procedures to do their jobs, thus increasing L1 incident resolution and shortening overall
time-to-repair. To achieve these benefits, EDS invests skilled resources in creating monitors, alert
correlation/suppression rules, restoration procedures, and SLAs associated with IT service topology/resource
dependencies. L1 operators view all alerts from a single pane of glass and have specific procedures to follow when
alerts occur. Drill-down tools are integrated with the alerts (with a "right click"), so, for example, the operator can
determine whether changes were made to a server that is experiencing slow response time. All actions taken are
automatically attached to the incident ticket. Further, ticket severity level is dynamically assigned based on SLAs,
thus ensuring L2/L3 are working on the right priorities. Action Item: Invest in building monitors, correlation and
restoration procedures to enable L1 support achieve lower mean-time-to-repair for IT service outages.
Strategic Imperative: Reduce mean-time-to-repair by providing the operations command
center with meaningful tools, skills and procedures.
Best Practices for Continuous Application Availability
Page 19
Donna Scott
C2, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Recommendations
Develop an availability strategy, plan and architecture (reducing planned and unplanned
downtime) that crosses business units, customer relationship management, applications,
architecture, IT infrastructure and operations.
Design application architectures for continuous availability, reducing complexity where
possible.
Invest in maturing IT management processes toward service-level management. IT
processes cut through the individual department silos. Begin setting targets and
measuring service levels, even if informally.
Dont be complacent; Actively test, test, testall changes, integration of changes, switch-
over processes and business continuity plans.
To quickly drive availability improvements, set corporate goals and performance metrics for
business units and IT.
This is the end of this presentation. Click any
where to continue.
These materials can be reproduced only with Gartners written approval. Such approvals must be requested via
e-mail [email protected].