2024 Resiliency Survey OutagesReport
2024 Resiliency Survey OutagesReport
Annual outage
analysis 2024
The causes and impacts of IT and data center outages
Avoiding digital infrastructure failures remains paramount for data center owners
and operators. This report analyzes recent Uptime Institute data on IT and data center
outage trends: their causes, costs and consequences.
Authors
Douglas Donnellan, Research Analyst
Andy Lawrence, Executive Director of Research
Key findings
• Data relating to outages should be treated • Power issues are consistently the most
skeptically. All methodologies used to common cause of serious and severe data
track the frequency, severity and costs of center outages. However, network-related
outages are subject to uncertainty, partly issues are the largest single cause of IT
because of a lack of transparency and service outages.
reliable reporting mechanisms.
• Four in five respondents to the 2023
• Uptime research suggests that overall Uptime Institute data center survey say
outage frequency and severity continue to that their most recent serious outage
decline. However, cyber-related incidents could have been prevented with better
— which are responsible for many of the management, processes and configuration.
most severe outages — are on the rise, This suggests that, as in previous years,
causing extensive and serious disruption. there is an opportunity to reduce outages
through training and process review.
• Outages are costly. More than half (54%)
of the respondents to the 2023 Uptime • Uptime data suggests that each year there
Institute data center survey say their are, on average, 10 to 20 high-profile IT
most recent significant, serious or severe outages or data center events globally
outage cost more than $100,000, with 16% that cause serious or severe financial
saying that their most recent outage cost loss, business and customer disruption,
more than $1 million. reputational loss and, in extreme cases,
loss of life.
Annual outage
analysis 2024
The causes and impacts of IT and data center outages
Contents
Introduction ———————————————————————————————————————————— 5
Avoiding digital infrastructure failures remains paramount for data center owners
and operators. This report analyzes recent Uptime Institute data on IT and data center
outage trends: their causes, costs and consequences.
Authors
Douglas Donnellan, Research Analyst
Andy Lawrence, Executive Director of Research
Power outages————————————————————————————————————————— 17
Networking outages——————————————————————————————————————— 18
System and software outages————————————————————————————————— 20
The human factor———————————————————————————————————————— 21
Cost of outages————————————————————————————————————————— 25
Summary————————————————————————————————————————————— 26
Appendix: Sources and methodology——————————————————————————— 27
Contents (continued)
Figures and tables
Table 1 ——————————————————— 5 Figure 9 ———————————————————16
How Uptime Institute tracks outages Most common causes of major
third-party outages
Table 2 ——————————————————— 6
Outage severity rating Figure 10 —————————————————— 17
Most common causes of major
Figure 1 ——————————————————— 7 power-related outages
While majority of operators experienced
an outage, most had negligible impact Figure 11 ——————————————————19
Most common causes of major
Figure 2 ——————————————————— 9 network-related outages
Physical site redundancy still climbing
Figure 12 ——————————————————20
Table 3 ———————————————————10 Most common causes of major IT
Publicly reported outages tracked by system- / software-related outages
Uptime, 2016 to 2023
Figure 13 ——————————————————21
Figure 3 ———————————————————10 Most common causes of major
Proportion of publicly reported outages human error-related outages
that were serious or severe, 2016 to 2023
Figure 14 ——————————————————22
Figure 4 ——————————————————— 11 Most say outages preventable
Power remains the number one root with better management
cause of outages
Figure 15 ——————————————————23
Figure 5 ——————————————————— 12 Durations of publicly reported
Most common causes of IT outages, 2017 to 2023
service-related outages
Table 4 ———————————————————24
Figure 6 ———————————————————13 Ten major outages in 2023 and 2024
Causes of publicly reported outages 2023
Figure 16 ——————————————————25
Figure 7 ———————————————————14 Half of impactful outages cost
Publicly reported outages by sector, more than $100,000
2016 to 2023
Figure 8 ———————————————————15
Most say cloud only resilient enough
for some workloads
Introduction
This report on digital infrastructure outage trends is the sixth edition in an ongoing
series from Uptime Institute Intelligence that analyzes IT service resiliency. The analysis
is based on data from a variety of sources, including publicly available reports (e.g.,
information reported in news and social media), Uptime surveys (e.g., the Uptime Institute
Global Survey of IT and Data Center Managers 2023 and the Uptime Institute Data Center
Resiliency Survey 2024) and other data aggregated and anonymized from Uptime
members and partners. Each of these data sources has limitations (see Table 1).
Tracking outages is neither simple nor consistent. Some outages are visible and well
publicized, others remain confidential. Some managers, staff and customers may be aware
of outages, while others in different roles may not. In addition, some major slowdowns or
disruptions may not be classified as outages. Uptime Institute uses multiple means to track
the overall trends and incidents, but none provide a clear picture on their own.
This table shows the methods used by Uptime (see the Appendix for further information).
Uptime Abnormal Good / very Detailed, accurate site / Information primarily facility / site-based
Incident Report good facility-level data
(AIRs) database shared under a non-
disclosure agreement All data anonymous
Throughout this report we discuss and categorize outages according to their obvious
or perceived severity, using terms such as “serious” or “severe”. The way we categorize
outages according to severity is shown in Table 2.
3 Significant Customer / user service disruptions, mostly of limited scope, duration or effect. Minimal
or no financial effect. Some reputational or compliance impact(s).
4 Serious Disruption of service and/or operation. Ramifications include some financial losses,
compliance breaches, reputational damage and possibly safety concerns. Customer
losses possible.
5 Severe Major and damaging disruption of services and / or operations with ramifications
including large financial losses and possibly safety issues, compliance breaches,
customer losses and reputational damage.
Figure 1 While majority of operators experienced an outage, most had negligible impact
On a scale of 1 (negligible) to 5 (severe) how would you classify your data center’s most
impactful outage in the past three years, either in your own facility or because of a third-party
service provider? (n=781)
Severe 4%
Serious 6%
Did you
experience an
45% outage in the 55% Significant 17%
NO past three YES
years?
Minimal 32%
Negligible 41%
What do these outage rates tell us? The answer is complicated because the technology is
never static and external factors (such as weather or utility reliability) can play a significant
role. Uptime Intelligence can, however, make the following observations:
• Overall decrease. The data collected over several years suggests that outages are
decreasing relative to the overall rise in IT. This is likely due to many different reasons
(see below).
• No complacency. Although the outage frequency has decreased, there is no room
for (or any sign of) complacency. Instead, there is strong consensus across the
sector, including among regulators, that outage rates are still concerning. The high
financial and/or reputational costs, which can result from a data center outage, mean
that resiliency consistently registers as one of the top concerns among industry
stakeholders and is a strong driver of investment.
• Public cloud. The move to the public cloud does not necessarily mean that there will
be fewer outages. However, it may mean that, for example, “third-party supplier” is
registered as the cause of more IT service disruptions while fewer on-premises data
center outages are recorded.
• Long COVID. The COVID-19 pandemic had a significant impact on the data center
industry, particularly in terms of decreasing and then increasing demand, straining
supply chains and distorting outage rates. These aftershocks are still being felt in
2024, even if indirectly, while their longer-term impact remains unclear. For example,
supply chain disruptions continue to stall capital projects, which has led many
organizations to delay maintenance and infrastructure upgrades. It is possible that
these factors have temporarily reduced the rate of incidents, which can sometimes
cause an outage, and that a rebound effect in the near- to medium-term will be seen.
• Grid instability. There is evidence that the global shift toward more transactive, dynamic
and renewable power grids is reducing, or will reduce, grid reliability. If this is the case,
data centers may experience an increase in outages. Many outages occur when an
uninterruptible power supply (UPS) or generator fails to respond to a grid disruption.
• Climate change. Extreme weather events — such as high and low temperatures,
high winds and floods, and forest fires — exacerbated by climate change have been
associated with data center outages over the past few years. This trend is likely to
intensify and will increase the outage risks until pre-emptive action is taken.
• Adoption of new technology. To comply with anticipated and recently passed
regulations around the reporting and improvement of resiliency and energy
performance, operators may adopt technologies and practices that require careful
management. These may even add new risks, for example:
• The use of distributed, software-based resiliency (i.e., moving traffic and workloads
dynamically) can reduce outage risks and their associated impact over time, but
during an introductory period, these may increase.
• The use of liquid cooling may reduce some thermal risks, but the impact of
component failure may reduce thermal ride-through times, which, in some cases,
increases risk.
Despite the increase in risk factors, Uptime’s annual survey data up to 2023 suggests that
the rate of outages per facility is falling. What could be driving this trend? One factor stands
out: Uptime research finds that, year-on-year, most organizations are investing more in
physical infrastructure redundancy (see Figure 2).
This trend contradicts expectations that multisite approaches will undermine expensive,
physical site redundancy strategies. While the industry may indeed move further toward
distributed and software-based resiliency models, maintaining and increasing site-level
redundancy remains a high priority for most operators.
The proportion of publicly reported outages that were either serious or severe continued
to increase in 2023. Figure 3 shows that category 4 and 5 outages combined increased
by 12 percentage points compared with 2022. This may reflect the growing dependence
of other industries on IT — when outages happen, the consequences are often
widespread and affect millions of users; however, it may also mean that the media have
become less interested in reporting minor outages. Nevertheless, it is likely that major
incidents do cause greater financial and reputational damage than in previous years and,
as such, are more likely to be reported on by public media outlets.
Figure 3 Proportion of publicly reported outages that were serious or severe, 2016 to 2023
11% Severe 5%
(category 5)
29%
22% Serious
(category 4)
67%
Minimal / significant 65%
(categories 2 and 3)
This data suggests that each year there will probably be 10 to 20 high-profile IT outages
or data center events across the world that cause serious or severe financial loss,
business and customer disruption, reputational loss and, in extreme cases, loss of life.
Outage causes
Establishing the root cause of a data center outage is imperative for preventing repeat
instances of disruption and for identifying areas that require greater investment to mitigate
the risks. However, assessing outage data poses challenges due to the multifaceted nature
of most incidents, which often stem from a combination of factors.
Uptime’s annual surveys consistently show that disruptions to on-site power distribution
are the most common cause behind impactful outages (see Figure 4). This is unsurprising
given the intolerance of IT hardware to any significant power disturbances, such as voltage
fluctuations or complete loss of power, that last more than fractions of a second.
Conversely, failures or underperformance of cooling equipment are generally tolerated for
longer durations, often measured in minutes, due to thermal ride-through mechanisms
or network traffic redirection capabilities. While IT-originating failures may occur more
frequently, they often have isolated, minor effects that go unrecorded and primarily impact
specific applications or datasets.
Third-party provider issues have seen a marginal but consistent uptick since 2020, rising
by five percentage points to account for nearly one in 10 outages in 2023. This steady
increase reflects the growing reliance on cloud / hosting, software as a service (SaaS)
and colocation providers.
Power 52%
Cooling
Third-party provider
For a more general view of outage causes that extend beyond (but include) the data center,
Uptime’s annual resiliency survey also asks about the most common causes of any end-
to-end IT service outages, regardless of whether they were the most recent or the most
impactful. The responses have consistently shown that network-related outages are more
common, and are considerably ahead of power-related outages (see Figure 5).
IT system /
18% software 22%
7% Cooling 7%
Third-party
10% IT service 8%
No IT service
9% outages 10%
2% Other 3%
IT (software / configuration)
Fiber 6%
Fire
Cooling 9%
22%
Network (cabling)
7%
Telecoms
21%
30%
Transportation
7% (Only top 6 outage categories are shown.)
(Category 1, negligible, outages and outages in which the root
8% cause was not determined or disclosed are omitted for all years.)
Figure 8 Most say cloud only resilient enough for some workloads
Regarding public cloud services, do you think public cloud is resilient enough to run all of your
organization’s mission-critical IT workloads, run most of them, run only some of them, or is public
cloud not resilient enough to run any of your organization’s mission-critical IT workloads? (n=441)
In the past three years, many organizations have backed away from a “cloud-first”
strategy and are taking a more cautious, selective approach. This has shown up in Uptime
survey data, which, to some extent, has predicted the slowdown of large-scale corporate
migrations to the cloud.
The main reasons preventing organizations from further adopting cloud services
for critical applications have shifted over the years. Rather than a lack of clarity and
transparency regarding providers’ operational resiliency (as it has been in previous
years), most operators (64%, n=240) in 2024 cite data security concerns as the main
barrier to increasing the adoption of public cloud services. Notably, only one in five (20%)
respondents cite resiliency concerns as their organization’s main deterrent.
It is likely that these data security concerns follow widely publicized cyberattacks on some
cloud providers, in which services were taken offline and confidential information was
compromised. Indeed, one in five (20%) operators that experienced an outage due to a
third-party IT provider cite a malicious cyberattack as the root cause — an uptick of seven
percentage points from 2022. However, software or configuration errors from third-party
IT service providers are more than three times (62%) as likely to cause an outage as
cybersecurity issues (see Figure 9).
Software or
configuration error 62%
Networking /
connectivity issues 35%
Power outages
Power-related disruptions often lead to the most severe downtime incidents. While diagnosis
and restoration of power are often quick, restarting the IT equipment and synchronizing
the databases can take several hours — assuming these systems were not damaged during
the disruption.
Uptime’s annual survey data consistently shows that, while power-related issues are the
most common cause of impactful outages for data centers (see Outage causes), they also
represent a growing share of overall outages year on year.
Challenges with electrical grids may exacerbate this trend in the years ahead. Grid reliability is
under threat due to a combination of factors, including aging infrastructure and transmission
systems, escalating demand, the decommissioning of older power generation plants, severe
weather events and an increasing reliance on intermittent renewable energy sources.
In terms of power-related outages at their site, 30% of operators who responded to the
2024 Uptime resiliency survey suffered an impactful outage caused by a problem with
power systems in the past three years (see Figure 10). UPS issues are cited as the most
common underlying cause of these outages — as they have been in every year since we
have conducted this survey.
There are several reasons for this. Uptime engineers working across many data centers
report the following as the most common problems with static UPS systems:
• Fans fail frequently because they are usually inexpensive and constantly in operation.
A single fan failure does not take down a unit, but the failure of multiple fans may.
• Snubber capacitors can fail from wear and tear. Regular preventative maintenance
will reduce the number of failures.
• Batteries fail because of age. They require good management, close monitoring
and adherence to replacement schedules. Many batteries fail because they are not
monitored closely enough by experienced technicians.
• Inverter stack failures are least common. These are more likely to occur when the unit
is overloaded, although wear and tear can also cause failures.
The frequency of UPS problems is more likely to increase with age, so supply chain /
replacement problems may lead to more failures. Operators of data centers with no
trusted concurrent maintainability designs (the ability to bypass any item of equipment
for maintenance without interrupting overall service) can be more likely to postpone
maintenance or replacement.
Generators are reliable, but require regular scheduled maintenance, fuel checks and testing.
Automatic transfer switch (ATS) units are generally robust, but failures may occur with active
controls or with a loss of direct current (DC) power to those controls. Other less common
failures are caused by mechanical issues, such as worn-out bearings or a jammed switch.
Networking outages
Networking issues have caused an increasing portion of IT service outages in recent years.
The 2024 Uptime resiliency survey finds that the two most common causes of networking-
or connectivity-related outages are configuration / change management failure and third-
party network provider failures — similar numbers to previous years (see Figure 11).
As demand patterns for applications evolve, data center networks also undergo
corresponding changes. The growing use of virtualization to accommodate this demand
heightens the reliance on software components, such as management, monitoring and
automation systems.
These tools can help prevent network-related incidents that are caused by human error.
However, they require script modifications when network changes occur, which can lead to
errors during reconfiguration. When organizations utilize multiple hardware vendors, this
becomes more challenging because it requires further maintenance and the adaptation of
multiple scripts with each introduced change.
Configuration / change
management failure 46%
Third-party network
provider failure 42%
Hardware failure 33%
Firmware / software error 20%
46% 46%
NO YES Line breakages 19%
Network / congestion failure 15%
8% Corrupted firewall /
13%
Don’t routing tables issues
know
Cyberattack 12%
Weather-related incident 7%
Configuration errors, firmware errors and corrupted routing tables all play a significant role in
network-related failures, while the more traditional worries of weather and cable breaks take
on a much smaller role overall. Congestion and capacity issues can also cause failures, but
these are often the result of programming or configuration issues.
In a complex and high-throughput environment, small errors can propagate across networks,
resulting in cascading failures that can be difficult to stop, diagnose and fix. The high number
of network or software failures has clearly contributed to the rise in telecommunications
failures in publicly reported outages.
Configuration / change
management issue 60%
Capacity /
12% congestion issue 21%
Don’t know
Cyberattack /
security issue 16%
Software problems primarily arise from configuration and change management issues,
patches, upgrades and other changes, which can lead to instability and unanticipated
errors. These errors become more difficult to contain once they have propagated across
networks. Hardware and software faults are less likely than configuration and change
management issues to cause an outage — but taken together, they still contribute to a
significant number of outages.
Compared with 2022, cyberattacks, including ransomware and distributed denial of
service (DDoS) attacks, increased by six percentage points. When such incidents occur, the
consequences can be severe, leading to data loss, financial loss and reputational damage.
While even the most robust methods of training and effective processes for staff cannot
prevent all possible failures, nearly four in five (78%) operators report that better
management and processes would have prevented their organization’s most recent
downtime incident (see Figure 14). This proportion has been highly consistent. Every year
since 2020, at least 75% of operators have noted the preventability of their most recent
outage; this suggests that there is a major opportunity to make significant reductions
in downtime.
78% 22%
YES NO
4%
4% 18%
7%
9%
23% 8%
27%
63%
37%
0-4 hours 4-12 hours 12-24 hours 24-48 hours >48 hours
Cost of outages
Even though operators appear to have reduced the likelihood of the most serious
and severe outages, those that do happen tend to be expensive. More than half of the
respondents (54%) to Uptime’s 2023 annual survey say their most recent significant,
serious or severe outage cost more than $100,000, with 16% saying that their most recent
material outage cost more than $1 million (see Figure 16).
These cost figures are slightly lower than in previous years. However, as noted in
the Outages frequency and severity section of this report, changes to the survey’s
methodology may have affected year-on-year comparisons.
The high cost of outages stems from multiple factors, including inflation, penalties for
breaching service level agreements (SLAs), labor costs, callouts and the expense of
replacing parts. However, the primary driver is the increasing reliance of corporate
economic operations on digital services and data centers. The failure of critical IT services
frequently results in immediate business disruption and revenue loss.
Uptime does not calculate an average cost of outages because the insights gained are
rarely useful — businesses and outage impacts vary widely. Each year, a few large outlier
outages are so costly that they can distort the overall picture. Some result in compensation,
fines and lost business, with costs adding up to millions or even tens of millions of dollars.
The high costs resulting from outages will likely increase over time as the dependency on
digital services also rises. Stronger SLAs, expected by some businesses because of this
growing reliance, could make outages even more costly, as will more and higher regulatory
fines and compensation for customers who experience a service disruption. This, in turn,
justifies better analysis of the causes and costs of outages and the continued or increased
investment in resiliency.
46% $ $ $ $ $ $ $ $ $ $
Under $100,000
$ $ $ $ $ $ $ $ $ $
$ $ $ $ $ $ $ $ $ $ 38%
$ $ $ $ $ $ $ $ $ $ $100,000 to
$1 million
$ $ $ $ $ $ $ $ $ $
$ $ $ $ $ $ $ $ $ $
$ $ $ $ $ $ $ $ $ $
$ $ $ $ $ $ $ $ $ $
16%
$ $ $ $ $ $ $ $ $ $ Over $1 million
$ $ $ $ $ $ $ $ $ $
UPTIME INSTITUTE GLOBAL DATA CENTER SURVEY 2023
Summary
High availability and resiliency (which means outage prevention and effective recovery) are
priorities for all involved in the digital infrastructure supply chain. It is sometimes assumed
that progress in this area is as reliable as Moore’s law has been in the past three decades.
This is not the case: Uptime’s data shows that progress is gradual, hard-won and — when
failures occur — expensive.
This year’s data shows that progress is being made in improving the outage rate relative to
overall IT capacity. This can be attributed to a range of measures: greater investment, the
combined effects of software-based resiliency and on-site physical redundancy, improved
training, the outsourcing and greater professionalism of some third-party operators, and
overall continuing vigilance.
But Uptime’s research also identifies several trends in 2024 that could undermine
progress in outage rates and overall resiliency.
First, the widespread adoption of distributed architectures — where IT workloads are
spread across multiple sites — aims to mitigate localized failures. However, Uptime data
suggests that this shift may play a role in the increase in network, software or system-
related incidents. This will likely be a transitional effect.
Second is the increase in system and network complexity, which includes the use of more
automated and remote management at a facility level, supported by software-based
optimization strategies. The greater use of both on-site and distributed systems enables
data center operators to improve maintenance, energy management, capacity planning
and incident monitoring — but it also requires greater integration with IT and operational
technology (OT) systems, a heightened awareness of security risks and rigorous testing.
The greater use of these systems can create additional access pathways and widen the
attack surface for malicious actors, which may be a factor in the notable rise in outages
caused by cybersecurity incidents.
Third is the ongoing challenge of recruiting and training staff as well as the establishment
of proven management processes. To make significant progress in reducing downtime,
owners and operators will have to allocate increased efforts and resources to these areas.
Finally, there are two areas of increasing risk that are external to the data center itself:
grid stability, which may deteriorate due to high demand and the transition to intermittent,
renewable energy; and the effects of climate change (notably, extreme weather). While
operators can do little about these directly, there are many steps that they can take to
reduce their exposure.
In summary, outage prevention requires ongoing vigilance and investment — and currently,
the digital infrastructure industry is on an improving trajectory. Robust data center design,
detailed attention to IT architectures and topology, physical infrastructure redundancy,
testing, improved training and continuous review will continue to be necessary if this is
to be maintained.
Appendix
Uptime compiles a database of public outages, which is used for some of the findings in
this report. However, it is not a comprehensive list of all the outages that have occurred.
There are some limitations that should be considered:
• If a failure is not reported or picked up by the media or Uptime, it will not be recorded.
This immediately means there is a bias toward coverage of large, public-facing IT
services, particularly in geographies with well-developed and open media.
• Uptime limits failures to those that had a noticeable impact on end users — a major
fire during data center commissioning, for example, may never be registered. All
category 1 outages have been eliminated, i.e., small, short failures where the business
or reputational impact is negligible.
• The amount of information available varies widely from outage to outage. In some
of the analyses, it has been necessary to include outages for which the cause is “not
known” — which means information was never disclosed.
• Although IT system failures are included, cybersecurity breaches are generally not
included, apart from those that lead to complete service interruptions.
Douglas Donnellan
Douglas Donnellan is a Research Analyst at Uptime Intelligence covering sustainability in
data centers. His background includes environmental research and communications, with
a strong focus on education.
[email protected]