Incident Metrics in Sre
Incident Metrics in Sre
Incident Metrics in Sre
Critically Evaluating
MTTR and Friends
Štěpán Davidovič
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Incident Metrics
in SRE, the cover image, and related trade dress are trademarks of O’Reilly Media,
Inc.
The views expressed in this work are those of the author, and do not represent the
publisher’s views. While the publisher and the author have used good faith efforts to
ensure that the information and instructions contained in this work are accurate, the
publisher and the author disclaim all responsibility for errors or omissions, includ‐
ing without limitation responsibility for damages resulting from the use of or reli‐
ance on this work. Use of the information and instructions contained in this work is
at your own risk. If any code samples or other technology this work contains or
describes is subject to open source licenses or the intellectual property rights of oth‐
ers, it is your responsibility to ensure that your use thereof complies with such licen‐
ses and/or rights.
This work is part of a collaboration between O’Reilly and Google. See our statement
of editorial independence.
978-1-098-10313-2
[LSI]
Table of Contents
iii
Incident Metrics in SRE
Abstract
Measuring improvements as a result of a process change, product
purchase, or technological change is commonplace. In reliability
engineering, statistics such as mean time to recovery (MTTR) or
mean time to mitigation (MTTM) are often measured. These statis‐
tics are sometimes used to evaluate improvements or track trends.
In this report, I use a simple Monte Carlo simulation process (which
can be applied in many other situations), as well as statistical analy‐
sis, to demonstrate that these statistics are poorly suited for decision
making or trend analysis in the context of production incidents. To
replace these, I propose better ways to achieve the same measure‐
ments for some contexts.
Introduction
One of the key responsibilities of a site reliability engineer (SRE) is
to manage incidents of the production system(s) they are responsi‐
ble for. Within an incident, SREs contribute to debugging the sys‐
tem, choosing the right immediate mitigation, and organizing the
incident response if it requires broader coordination.
But the responsibility of an SRE is not limited just to managing inci‐
dents. Some of the work involves prevention, such as devising
robust strategies for performing changes in production or automati‐
cally responding to problems and reverting the system to a known-
safe state. The work also includes mitigation, such as better
processes for communication, improvements in monitoring, or
development of tooling that provides assistance during debugging of
1
the incident. As a matter of fact, there are products dedicated to
improving the process of incident response.
You want your incidents (if you must have any at all!) to have as lit‐
tle impact as possible. That often means short incident durations,
which I’ll focus on here. Understanding how a process change or a
product purchase shortens the durations of incidents is important,
especially if there are real costs associated with the incidents. How‐
ever, we can’t jump to conclusions from a single incident; an analysis
of a whole body of incidents is required.
A quick search with your favorite search engine might reveal many
articles that state that MTTx metrics (including mean time to recov‐
ery and mean time to mitigation) should be considered the key per‐
formance indicators of your service’s reliability. These articles are
sometimes authored by high-profile companies with a track record
of delivering their services reliably or providing reliability-related
tooling. But are these metrics good indicators of reliability? In fact,
are they indicators that can even be used at all? How can you tell?
When applying MTTx metrics, the goal is to understand the evolu‐
tion of the reliability of your systems. But the reality is that applying
these metrics is trickier than it seems, and these popular metrics are
dangerously misleading in most practical scenarios.
This report will show that MTTx is not useful in most typical SRE
settings, for reasons that apply to many summary statistics and do
not depend on company size or strictness of enforcement of produc‐
tion practices. Whatever metric you choose to use, it is important to
test that it can give you robust insights regardless of the shape of the
incident duration distribution. There may not be a “silver bullet”
metric that could serve as a general-purpose replacement where
MTTx is currently considered, but you may have more success in
measurement by tailoring the metric to the question at hand. I’ll end
this report by exploring some alternative methods for achieving
these measurements.
2 John Allspaw, “Moving Past Shallow Incident Data”, Adaptive Capacity Labs, March 23,
2018.
3 “Mean time to recovery”, Wikipedia.
I also collected incident data from Google (Figure 3), and Google’s
data set—in my analysis—represents a very large company focused
on internet services. The Google data set was collected over a one-
year period—shorter than any of the data sets shown in Figure 2—
but it also contains internal incidents (e.g., those impacting only
developer productivity). I cannot share the numbers, but Google’s
incident data set is several times larger than any of the three public
data sets, as expected given the company size.
4 Laura Nolan, “What Breaks Our Systems: A Taxonomy of Black Swans” (video), SRE‐
con19 Americas, March 25, 2019.
5 “Normal probability plot”, Wikipedia.
6 See “A List of Post-mortems!” and “Postmortem Index”.
Analyzing Improvements
All right, you’ve got a clear picture of what your incident durations
look like. Now it’s time to make your incidents shorter!
Imagine you are offered a reliability-enhancing product that helps
you shorten the mitigation and resolution time of incidents by 10%.
For example, a daylong incident shrinks to a little over 21 and a half
hours. You are offered a trial to evaluate the product. How can you
tell that the product delivers on its promises? This report explores
the use of MTTR and similar metrics, so that’s the metric I’ll use.
I chose this artificial scenario intentionally because it applies to
many real-world scenarios. Whether you are changing a policy,
developing software, or introducing a new incident-management
process, the objective is often to shorten your incidents and try to
evaluate the success of the change.
7 Notice, for example, that Company C has incident durations often aligned with whole
hours, and this manifests as spikes on the graph.
Analyzing Improvements | 7
To gauge whether a product delivers on its promise of shortening
the incident duration by 10%, you could set a threshold of a 10%
decrease in MTTR compared to before you began using the product.
A looser test is to require any improvement at all. You would decide
that the product is successful if you see any shortening of incidents
at all, regardless of magnitude.
You want to have a crisp understanding of how you expect the met‐
ric to behave and be confident that the chosen metric (such as
MTTR) faithfully measures what you want it to measure. There
would be real and severe risks and costs if you were to rely on a poor
metric. These can be direct, such as purchasing a product for the
wrong reasons, but they can also be very subtle. For example, your
employees’ morale may suffer upon realizing that their incident-
management efforts are evaluated using unproven or suspect
metrics.
8 A simulation done by repeated sampling to model a behavior—in this case, the behav‐
ior of incident resolution times.
You are doing two samples, with size N1 and N2 where N1 = N2. The
50/50 split gives the strongest analysis; I will briefly touch on why in
“Analytical Approach” on page 18.
Simply put, you visit thousands of parallel universes where you sim‐
ulate that the product delivers on its promises and compare the
resulting MTTR against the incidents that weren’t treated. Mechani‐
cally, this can be done using tools such as a Python script and a CSV
file with the data or a sufficiently capable SQL engine, and does not
require any specialized tooling or additional knowledge.
You are now operating on probabilities, so you need to add one
more condition to your test: some tolerance of random flukes. Let’s
say that you’re tolerating up to 10% of these parallel universes to
mislead you. More formally, you might recognize this as requiring
statistical significance α = .10. This is arguably a generous value.
Table 1. Incident count, mean, and variance across the three data sets.
Company A Company B Company C
Incidents (all) 779 348 2157
Incidents (2019) 173 103 609
Mean TTR 2h 26m 2h 31m 4h 31m
Standard deviation 5h 16m 5h 1m 6h 53m
9 As of late summer 2020, I felt that just using the last 12 months could lead to an
unusual data set, swayed by world events.
Analyzing Improvements | 9
Having performed the simulation, I plotted it out to see what hap‐
pens (Figure 5).
We’ve learned that even without any intentional change to the inci‐
dent durations, many simulated universes would make you believe
that the MTTR got much shorter—or much longer—without any
structural change. If you can’t tell when things aren’t changing,
you’ll have a hard time telling when they do.
Analyzing Improvements | 11
durations keep coming from the same distribution (not changed by
any incident-handling improvement), and you evaluate the typical
change in the statistics.
From here on, I will simplify the discussion and focus only on the
scenario of showing what the change in MTTR can be if nothing
changed the incidents, foregoing the analysis of improvements.
Consequently, what’s most interesting is the shape of the resulting
distribution: put bluntly, we want to know how flat it gets.
σ 2incidents
σ 2sample mean = N
in the limit. In line with your intuition, this indicates that the var‐
iance seen in the observed MTTR value decreases as the sample size
(i.e., the incident count) increases. That can easily be demonstrated.
Table 2 shows 90% confidence intervals for MTTR for several inci‐
dent counts.
Recall that you are drawing two samples from the incident-duration
distribution. So if you are trying to find out how good an analysis
you can make with N incidents total, you draw two samples with
size N1 and N2, where N1 = N2.
11 See the chapters “Sampling Distribution of the Mean,” “Sampling Distribution of Dif‐
ference Between Means,” “Testing of Means,” and others in Online Statistics Education,
project leader David M. Lane, Rice University.
Analyzing Improvements | 13
Figure 7. Decrease of width of 90% confidence interval as sample size
increases.
While the results for Company A and Company B were fairly similar
in MTTR for these percentile measures, you can see the impact of
the differences between their incident durations.
Analyzing Improvements | 15
Geometric mean
Another aggregate statistic you might be interested in is the geomet‐
ric mean, which is calculated as n x1 · x2 · x3 . . . xn. This is especially
appealing given the fact that the incident duration distribution isn’t
too far off from lognormal distribution, and so the geometric mean
is to lognormal distribution what arithmetic mean is to normal dis‐
tribution. As before, this can be simulated quickly (Table 5).
12 Depending on your business, this reasoning might be flawed. Consider that having a
single one-hour-long incident a month might impact your users (and your business)
very differently than 60 one-minute-long incidents. This same concern also applies to
the commonly used service level objective language.
Counting incidents
This report discusses whether you can detect improvement in han‐
dling of incidents, focusing on analyzing how an incident is
resolved. Going from having an incident to not having an incident
at all is outside the scope of this paper.
However, since I’ve gathered all this data, at the very least I can take
a brief look at the data sets to understand the behavior of the inci‐
dent counts over time. I will not attempt a deeper analysis here.
The incident count is just as erratic as incident durations. Even
aggregated to whole years, as shown in Figure 8, the values jump
around wildly. At the resolution of months or quarters, it is even
worse. At best, some egregious trends can perhaps be gleaned from
this graph: Company C has seen a steep increase in its incident
count in 2019 (the trend continues into 2020, not shown in graph)
compared to years prior. This trend is only apparent at a multiyear
time scale, which is especially visible when compared to the erratic
trends of Company A and Company B.
Analyzing Improvements | 17
Figure 8. Incidents per year, as proportion of total incident count, per
company. Incomplete years (year 2020 when the data was collected
and the first year of each data set) were excluded.
Analytical Approach
So far, I’ve been using Monte Carlo simulations. However, you can
also take an analytical approach. Could you rely on the central limit
theorem and calculate the confidence intervals rather than simulate
them? Well, sometimes.
The central limit theorem says that the distribution of the sample
mean will tend toward normal distribution in the limit. However,
with incidents being an infrequent occurrence, there might be so
few of them that the central limit theorem does not even apply yet.
13 Rick Branson, “Stop Counting Production Incidents”, Medium, January 31, 2020.
Analytical Approach | 19
The variance of the sample mean converges to:16
σ 2incidents
σ 2sample mean = N
σ 2A − B = σ 2A + σ 2B
For this case, where variance and sample size are the same for both
sample mean normal distributions, this gives:
2 2
σ 2A − B = σ
N incidents
This also explains the previous assertion that a 50/50 split is the best
choice, since a different ratio of sample sizes would lead to a greater
variance and therefore worse results.
You can then apply a two-tailed z-test. You can expand the z-test
formula; knowing that the distribution mean is 0, you are looking
for a given change in MTTR and also expanding it with the variance
calculation:
ΔMTTR
z=
2 2
σ
N
You can also turn it around: you can look up the corresponding z-
score (the z-score for a two-tailed test at our α = .10 is ~1.644) and
find the confidence interval of the MTTR change:
2 2
±ΔMTTR = ± z N
σ
16 See the chapters in Online Statistics Education, as well as “Distribution of the sample
mean”, Wikipedia.
17 Eric W. Weisstein, “Normal Difference Distribution”, from MathWorld—A Wolfram
Web Resource, updated March 5, 2021.
2 2
±ΔMTTR = ± 1 . 644 50
5h16m = ± 1h44m
Conclusion
I have demonstrated that even in a favorable analysis setup, MTTx
cannot be used for many practical purposes where it has been adver‐
tised to be useful, such as evaluating reliability trends, evaluating
results of policies or products, or understanding the overall system
reliability. The operators of systems, DevOps or SREs, should move
away from defaulting to the assumption that MTTx can be useful. Its
application should be treated with skepticism, unless its applicability
has been shown in a particular situation.
Acknowledgments
The author would like to thank Kathy Meier-Hellstern for her
review, advice, and suggestions; Ben Appleton for his review of this
work as well as of some initial work leading up to this text; Michael
Brundage for further review and for inspiring additional analysis;
Scott Williams for further review; and Cassie Kozyrkov for her work
to make statistical thinking an increasingly accessible topic.
Acknowledgments | 29
About the Author
Štěpán Davidovič is a site reliability engineer at Google. He cur‐
rently works on internal infrastructure for automatic monitoring. In
previous Google SRE roles, he developed Canary Analysis Service
and has worked on both a wide range of shared infrastructure
projects and AdSense reliability. He obtained his bachelor’s degree
from Czech Technical University, Prague, in 2010.