Debugging Distributed Systems
Debugging Distributed Systems
debugging ONLY
Debugging
Incidents
in Google’s
Distributed
Systems
How experts debug
production issues
in complex
distributed systems CHARISMA CHAN AND BETH COOPER
G
oogle has published two books about SRE
(Site Reliability Engineering) principles, best
practices, and practical applications.1,2 In
the heat of the moment when handling a
production incident, however, a team’s actual
response and debugging approaches often differ from
ideal best practices.
This article covers the outcomes of research
performed in 2019 on how engineers at Google debug
production issues, including the types of tools, high-level
strategies, and low-level tasks that engineers use in
varying combinations to debug effectively. It examines
the research approach used to capture data, summarizing
the common engineering journeys for production
investigations and sharing examples of how experts
debug complex distributed systems. Finally, the article
extends the Google specifics of this research to provide
RESEARCH APPROACH
As this study began, its focus was to develop an empirical
understanding of the debugging process, with the
overarching goal of creating optimal product solutions
that meet the needs of Google engineers. We wanted to
capture the data that engineers need when debugging,
when they need it, the communication process among
the teams involved, and the types of mitigations that are
successful. The hypothesis was that commonalities exist
across the types of questions that engineers are trying to
answer while debugging production incidents, as well as
the mitigation strategies they apply.
To this end, we analyzed postmortem results over the
last year and extracted time to mitigation, root causes,
and correlated mitigations for each. We then selected 20
recent incidents for qualitative user studies. This approach
allowed us to understand and evaluate the processes
and practices of engineers in a real-world setting and to
deep-dive into user behavior and patterns that couldn’t be
extracted by analyzing trends in postmortem documents.
The first step was trying to understand user behavior:
At the highest level, what did the end-to-end debugging
experience look like at Google? The study was broken
down into the following phases (which are unpacked in the
sections that follow):
3 Phase 0 – Define a way to segment the incident
responder and incident type populations.
3 Phase 1 – Audit the postmortem documentation from
Incident Responders
First, the incident responder (or on-callers) were segmented
into two distinct groups: SWEs (software engineers), who
typically work with a product team, and SREs (Site Reliability
Engineers), who are often responsible for the reliability of
many products. These two groups were further segmented
according to tenure at Google. We found the following
behaviors across the different user cohorts:
Incident Types
We also examined incidents across the following
dimensions, and found some common patterns for each:
3 Scale and complexity. The larger the blast radius (i.e.,
its location(s), the affected systems, the importance of
the user journey affected, etc.) of the problem, the more
complex the issue.
3 Size of the responding team. As more people are
involved in an investigation, communication channels
among teams grow, and tighter collaboration and handoffs
between teams become even more critical.
3 Underlying cause. On-callers are likely to respond
to symptoms that map to six common underlying issues:
capacity problems; code changes; configuration changes;
dependency issues (a system/service my system/service
depends on is broken); underlying infrastructure issues
(network or servers are down); and external traffic issues.
Our investigation intentionally did not look at security or
B
uilding
blocks
Phase 3: Mapping the responders’ journeys
are often This study allowed us to generate snapshots of what
repeated an actual incident investigation lifecycle looks like at
as the user Google. By mapping out each responder’s journey and then
investigates
the issue, and aggregating those views, we extracted common patterns,
each block can tools, and questions asked around debugging that apply
happen in a to virtually every type of incident. Figure 1 is a sample
nonsequential
of the visual mapping of the steps taken by each of the
and, sometimes,
cyclical order. responders interviewed.
Detect
The on-caller discovers the issue via an alert, a customer
escalation, or a proactive investigation by an engineer
on the team. A common question would be: What is the
severity of this issue?
acmqueue | march-april 2020 7
debugging 8 of 20
Triage loop
The on-caller’s goal is to assess the situation quickly by
examining the situation’s blast radius (the severity and
impact of the issue) and determining whether there is a
need to escalate (pull in other teams, inform internal and
external stakeholders). This stage can happen multiple
times in a single incident as more information comes in.
Common questions include: Should I escalate? Do I need
to address this issue immediately, or can this wait? Is this
outage local, regional, or global? If the outage is local or
regional, could it become global (for example, a rollout
contained by a canary analysis tool likely won’t trigger a
global outage, whereas a query of death triggered by a
rollout that is now spreading across your systems might)?
Investigate loop
The on-caller forms hypotheses about potential issues and
gathers data using a variety of monitoring tools to validate
or disprove theories. The on-caller then attempts to
mitigate or fix the underlying problem. This stage typically
happens multiple times in a single incident as the on-caller
collects data to validate or disprove any hypotheses about
what caused the issue.
Common questions include: Was there a spike in
errors and latency? Was there a change in demand?
How unhealthy is this service? (Is this a false alarm, or
are customers still experiencing issues?) What are the
problematic dependencies? Were there production
changes in services or dependencies?
Mitigate loop
The on-caller’s goal is to determine what mitigation action
could fix the issue. Sometimes a mitigation attempt can
make the issue worse or cause an adverse ripple effect
on one of its dependent services. Remediation (or full
resolution of the issue) usually takes the longest of all the
debugging steps. This step can, and often does, happen
multiple times in a single incident.
S
Common questions include: What mitigation should be
ometimes a
mitigation
taken? How confident are you that this is the appropriate
attempt can mitigation? Did this mitigation fix the issue?
make the
issue worse Resolve/root-cause loop
or cause an
adverse ripple The on-caller’s goal is to figure out the underlying issue
effect on one of in order to prevent the problem from occurring again.
its dependent This step typically occurs after the issue is mitigated and
services.
is no longer time sensitive, and it may involve significant
code changes. Responders write the postmortem
documentation during this stage.
Common questions include: What went wrong? What’s
the root cause of the problem? How can you make your
processes and systems more resilient?
Communication
Throughout the entire process, incident responders
document their findings, work with teammates on
debugging, and communicate outside of their team as
needed.
OBSERVABILITY DATA
In every single interview, on-callers reported that they
S
experienced engineers, great technology, and powerful
ome
responders
tooling, things can—and do—go wrong in unexpected ways.
wound up
uninten- An exemplary debugging journey
tionally The following is an example of a successful debugging
applying bad
changes to session, where the SRE follows best practices and
production. mitigates a service-critical issue in less than 20 minutes.
While sitting in a meeting, the SRE on-caller receives
a page informing her that the front-end server is seeing
a 500 server error. While she’s initially looking at service
health dashboards, a pager-storm starts, and she sees
many more alerts firing and errors surfacing. She responds
quickly and immediately identifies that her service isn’t
healthy.
She then determines the severity of the issue, first
asking herself how many users are impacted. After
looking at a few error rate charts, she confirms that a
few locations have been hit with this outage, and she
suspects that it will significantly worsen if not immediately
addressed. This line of questioning is referred to as the
triage loop, similar to triage processes used in health care
(for example, emergency rooms that sort patients by
urgency or type of service). The SRE needs to determine
Triage effectively
Once you have the prerequisites of SLOs and accurate
monitoring in place, you need to be able to quickly
determine both the severity of user pain and the total blast
radius. You should also know how to set up the proper
communication channels based on the severity of the issue.
Mitigate early
Documenting a set of mitigation strategies that are safe
for your service can help on-callers temporarily fix user-
facing issues and buy your team critical time to identify the
root cause. For more information on implementing generic
mitigations, see “Reducing the Impact of Service Outages
with Generic Mitigations with Jennifer Mace.”5 The ability
to easily identify what changed in your service— either
in its critical dependencies or in your user traffic—is also
helpful in determining what mitigation attempt to move
forward with. As mentioned in the exemplary debugging
case, asking a series of common questions and having
metrics, logs, and traces can help speed up the process of
validating your theories about what went wrong.
References
1. B eyer, B., Jones, C., Petoff, J., Murphy, N. R., eds. 2016.
Building Secure and Reliable Systems. O’Reilly Media;
https://landing.google.com/sre/books/.
2. Beyer, B., Murphy, N. R., Rensin, D. K., Kawahara, K.,
Thorne, S., eds. 2018. The Site Reliability Workbook.
O’Reilly Media; https://landing.google.com/sre/books/.