0% found this document useful (0 votes)

71 views20 pages

Debugging Distributed Systems

Debugging incidents in Google's distributed systems

Uploaded by

Vadim Kadnikov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views20 pages

Debugging Distributed Systems

Debugging incidents in Google's distributed systems

Uploaded by

Vadim Kadnikov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

1 of 20 TEXT

debugging ONLY

Debugging
Incidents
in Google’s
Distributed
Systems
How experts debug
production issues
in complex
distributed systems CHARISMA CHAN AND BETH COOPER

G
oogle has published two books about SRE
(Site Reliability Engineering) principles, best
practices, and practical applications.1,2 In
the heat of the moment when handling a
production incident, however, a team’s actual
response and debugging approaches often differ from
ideal best practices.
This article covers the outcomes of research
performed in 2019 on how engineers at Google debug
production issues, including the types of tools, high-level
strategies, and low-level tasks that engineers use in
varying combinations to debug effectively. It examines
the research approach used to capture data, summarizing
the common engineering journeys for production
investigations and sharing examples of how experts
debug complex distributed systems. Finally, the article
extends the Google specifics of this research to provide

acmqueue | march-april 2020 1

debugging 2 of 20

some practical strategies that you can apply in your

organization.

RESEARCH APPROACH
As this study began, its focus was to develop an empirical
understanding of the debugging process, with the
overarching goal of creating optimal product solutions
that meet the needs of Google engineers. We wanted to
capture the data that engineers need when debugging,
when they need it, the communication process among
the teams involved, and the types of mitigations that are
successful. The hypothesis was that commonalities exist
across the types of questions that engineers are trying to
answer while debugging production incidents, as well as
the mitigation strategies they apply.
To this end, we analyzed postmortem results over the
last year and extracted time to mitigation, root causes,
and correlated mitigations for each. We then selected 20
recent incidents for qualitative user studies. This approach
allowed us to understand and evaluate the processes
and practices of engineers in a real-world setting and to
deep-dive into user behavior and patterns that couldn’t be
extracted by analyzing trends in postmortem documents.
The first step was trying to understand user behavior:
At the highest level, what did the end-to-end debugging
experience look like at Google? The study was broken
down into the following phases (which are unpacked in the
sections that follow):
3 Phase 0 – Define a way to segment the incident
responder and incident type populations.
3 Phase 1 – Audit the postmortem documentation from

acmqueue | march-april 2020 2

debugging 3 of 20

a spread of actual Google incidents.

3 Phase 2 – Conduct in-depth user interviews with first
responders who worked on those incidents.
3 Phase 3 – Map the responders’ journeys across those
incidents, detailing common patterns, questions, and steps
taken.

Phase 0: Segment incident responder and

incident type populations
The preliminary approach to segment the population under
study was designed to ensure that a sufficiently broad set
of incidents and interviewees was included, from which we
could capture a comprehensive set of data.

Incident Responders
First, the incident responder (or on-callers) were segmented
into two distinct groups: SWEs (software engineers), who
typically work with a product team, and SREs (Site Reliability
Engineers), who are often responsible for the reliability of
many products. These two groups were further segmented
according to tenure at Google. We found the following
behaviors across the different user cohorts:

SWE vs. SRE mental models and tools

SWEs are more likely to consult logs earlier in their
debugging workflow, where they look for errors that could
indicate where a failure occurred.
SREs rely on a more generic approach to debugging:
Because SREs are often on call for multiple services, they
apply a general approach to debugging based on known
characteristics of their system(s). They look for common

acmqueue | march-april 2020 3

debugging 4 of 20

failure patterns across service health metrics (e.g., errors

and latency for requests) to isolate where the issue is
happening, and often dig into logs only if they’re still
uncertain about the best mitigation strategy.

Experience level of the incident responder

Newer engineers are more likely to use recently developed
tools, while engineers with extensive experience (10-plus
years running complex, distributed systems at Google)
tend to use more legacy tools. Intuitively, this finding
makes sense—people tend to use the tools they are most
comfortable with, particularly in emergency situations.

Incident Types
We also examined incidents across the following
dimensions, and found some common patterns for each:
3 Scale and complexity. The larger the blast radius (i.e.,
its location(s), the affected systems, the importance of
the user journey affected, etc.) of the problem, the more
complex the issue.
3 Size of the responding team. As more people are
involved in an investigation, communication channels
among teams grow, and tighter collaboration and handoffs
between teams become even more critical.
3 Underlying cause. On-callers are likely to respond
to symptoms that map to six common underlying issues:
capacity problems; code changes; configuration changes;
dependency issues (a system/service my system/service
depends on is broken); underlying infrastructure issues
(network or servers are down); and external traffic issues.
Our investigation intentionally did not look at security or

acmqueue | march-april 2020 4

debugging 5 of 20

data-correctness issues outside the scope of the tools

focused on in this work.
3 Detection. On-callers learn about issues through
human or machine detection that is based on availability
or performance problems. Some common mechanisms
include alerts on the following: white-box metrics;
synthetic traffic; SLO (service-level objective) violations;
and user-detected issues.

Phase 1: Postmortem documentation analysis

Once the different categories of incidents were determined,
we read the postmortems for the 20 incidents identified for
qualitative studies, mapping the steps responders took for
each case. This approach allowed us to validate the common
factors that affect how responders handled these incidents
and the challenges they faced. We could also ensure
that the incidents selected for deep-dive analysis were
distributed across the dimensions, as just described.
Google has a strong culture of blameless
postmortems.4 It is common for teams to look at the
history of their failures to ensure that their services are
continuing to run reliably. Because of this, postmortem
documents are readily available internally and were an
invaluable resource for analyzing debugging behavior.
Detailed chat transcripts linked to these postmortems
helped form a base understanding of what happened, when
it happened, and what went wrong. We could then start
mapping a prototype of the debugging journey. Future
research could extend this work by applying natural-
language processing to further validate response patterns
in the incident response chats.

acmqueue | march-april 2020 5

debugging 6 of 20

Phase 2: In-depth interviews

To round out this study, in-depth interviews were
conducted with the first responders identified in these 20
postmortems so any gaps in the postmortem document
could be filled in. These data sources added significant
color to the debugging journey we were mapping, and
surfaced a core set of building blocks that make up the
overall debugging process.

B
uilding
blocks
Phase 3: Mapping the responders’ journeys
are often This study allowed us to generate snapshots of what
repeated an actual incident investigation lifecycle looks like at
as the user Google. By mapping out each responder’s journey and then
investigates
the issue, and aggregating those views, we extracted common patterns,
each block can tools, and questions asked around debugging that apply
happen in a to virtually every type of incident. Figure 1 is a sample
nonsequential
of the visual mapping of the steps taken by each of the
and, sometimes,
cyclical order. responders interviewed.

COMMON PATTERNS AROUND DEBUGGING

A typical canonical debugging journey consists of the
stages and sub-journeys shown in figure 2 and described
in this section. These building blocks are often repeated as
the user investigates the issue, and each block can happen
in a nonsequential and, sometimes, cyclical order.
During the detection to mitigation stages, investigations
are typically time sensitive—especially when the issue
affects the end-user experience. An on-caller will always
try to mitigate the issue or “stop the bleeding” before
uncovering the root cause. After mitigation, on-callers and

acmqueue | march-april 2020 6

debugging 7 of 20

FIGURE 1: Building Blocks

developers often perform a deeper analysis of the code and

apply measures to prevent a similar situation from recurring.

Detect
The on-caller discovers the issue via an alert, a customer
escalation, or a proactive investigation by an engineer
on the team. A common question would be: What is the
severity of this issue?
acmqueue | march-april 2020 7
debugging 8 of 20

FIGURE 2: User Journey

acmqueue | march-april 2020 8

debugging 9 of 20

Triage loop
The on-caller’s goal is to assess the situation quickly by
examining the situation’s blast radius (the severity and
impact of the issue) and determining whether there is a
need to escalate (pull in other teams, inform internal and
external stakeholders). This stage can happen multiple
times in a single incident as more information comes in.
Common questions include: Should I escalate? Do I need
to address this issue immediately, or can this wait? Is this
outage local, regional, or global? If the outage is local or
regional, could it become global (for example, a rollout
contained by a canary analysis tool likely won’t trigger a
global outage, whereas a query of death triggered by a
rollout that is now spreading across your systems might)?

Investigate loop
The on-caller forms hypotheses about potential issues and
gathers data using a variety of monitoring tools to validate
or disprove theories. The on-caller then attempts to
mitigate or fix the underlying problem. This stage typically
happens multiple times in a single incident as the on-caller
collects data to validate or disprove any hypotheses about
what caused the issue.
Common questions include: Was there a spike in
errors and latency? Was there a change in demand?
How unhealthy is this service? (Is this a false alarm, or
are customers still experiencing issues?) What are the
problematic dependencies? Were there production
changes in services or dependencies?

acmqueue | march-april 2020 9

debugging 10 of 20

Mitigate loop
The on-caller’s goal is to determine what mitigation action
could fix the issue. Sometimes a mitigation attempt can
make the issue worse or cause an adverse ripple effect
on one of its dependent services. Remediation (or full
resolution of the issue) usually takes the longest of all the
debugging steps. This step can, and often does, happen
multiple times in a single incident.

S
Common questions include: What mitigation should be
ometimes a
mitigation
taken? How confident are you that this is the appropriate
attempt can mitigation? Did this mitigation fix the issue?
make the
issue worse Resolve/root-cause loop
or cause an
adverse ripple The on-caller’s goal is to figure out the underlying issue
effect on one of in order to prevent the problem from occurring again.
its dependent This step typically occurs after the issue is mitigated and
services.
is no longer time sensitive, and it may involve significant
code changes. Responders write the postmortem
documentation during this stage.
Common questions include: What went wrong? What’s
the root cause of the problem? How can you make your
processes and systems more resilient?

Communication
Throughout the entire process, incident responders
document their findings, work with teammates on
debugging, and communicate outside of their team as
needed.

OBSERVABILITY DATA
In every single interview, on-callers reported that they

acmqueue | march-april 2020 10

debugging 11 of 20

started working with time-series metrics that indicate the

health of a given service, performing a breadth-first search
to identify which components of the system were broken.
The majority of the teams that were interviewed evaluated
the following items:
3 RPC (remote procedure call) latency and error
metrics (similar to the metrics derived from the open-
source gRPC libraries).
3 Change in external traffic, including QPS (queries per
second).
3 Change in production such as rollouts, configuration
pushes, and experiments.
3 Underlying job metrics such as memory and CPU
consumption.
Both alerts and realtime dashboards use these metrics.
On-callers typically used logs and traces only after they
identified a component as broken, and they then needed to
drill down to the specific issue.

ANECDOTES FROM THE FRONT LINE

Some of the interviewees applied SRE best practices
to debug complex distributed systems, methodically
eliminating their theories on what could go wrong, applying
temporary mitigations to prevent user pain, and, finally,
successfully resolving and root-causing the problem that
set off the outage in the first place.
Many other responders hit unexpected roadblocks.
Some responders were impacted by a complex set
of changes throughout the stack that occurred
simultaneously. Therefore, it was extremely challenging
to isolate the actual issue and figure out how to resolve

acmqueue | march-april 2020 11

debugging 12 of 20

it. Other responders cited process and awareness issues:

Some did not fully understand how their production
tooling worked, or the appropriate standard course of
action to take. Some responders wound up unintentionally
applying bad changes to production.
Following are some (anonymous) stories to illustrate
successful and problematic debugging sessions. These
anecdotes are intended to show that even with the most

S
experienced engineers, great technology, and powerful
ome
responders
tooling, things can—and do—go wrong in unexpected ways.
wound up
uninten- An exemplary debugging journey
tionally The following is an example of a successful debugging
applying bad
changes to session, where the SRE follows best practices and
production. mitigates a service-critical issue in less than 20 minutes.
While sitting in a meeting, the SRE on-caller receives
a page informing her that the front-end server is seeing
a 500 server error. While she’s initially looking at service
health dashboards, a pager-storm starts, and she sees
many more alerts firing and errors surfacing. She responds
quickly and immediately identifies that her service isn’t
healthy.
She then determines the severity of the issue, first
asking herself how many users are impacted. After
looking at a few error rate charts, she confirms that a
few locations have been hit with this outage, and she
suspects that it will significantly worsen if not immediately
addressed. This line of questioning is referred to as the
triage loop, similar to triage processes used in health care
(for example, emergency rooms that sort patients by
urgency or type of service). The SRE needs to determine

acmqueue | march-april 2020 12

debugging 13 of 20

if the alert is noise, if she needs to handle it now, and

whether to escalate the issue to other teams and
stakeholders.
Now that she knows this is a real and relatively severe
issue, the SRE starts pulling in other people from her
team to help with the investigation. She also sets up
communication channels to inform other teams that may
be affected, and to let them know her team is addressing
the outage.
She then focuses on temporarily mitigating the issue
for end users. She tasks a teammate with ensuring that
traffic isn’t routed to any of the unhealthy locations and
configuring load balancers to avoid sending traffic to
impacted locations. For the moment, this action stops the
issue from propagating, which leaves her free to conduct a
deeper investigation using monitoring data.
Next, she asks a series of questions that help her
narrow down the potential cause and figure out how best
to mitigate the issue permanently. She largely uses time-
series metrics (e.g., Cloud Monitoring metrics3) to help
answer these questions quickly:
3 To narrow down the breadth of the investigation:
Which specific parts of the service are unhealthy? Are the
errors coming from the front end or back end? Are there
“slices” of data that are problematic? Are there outliers in
the data?
3 To identify the severity of the issue and rule out
causes: Is the shape of the graph a step (something changed
suddenly and remained unchanged), a spike (something
changed, then stopped), or a slope (a gradual rollout is
happening)? How quickly did the error rate ramp up?

acmqueue | march-april 2020 13

debugging 14 of 20

3 To identify the severity: What is the blast radius? (If

errors occur globally, this indicates a severe issue that will
most likely have end-user impact.)
3 To rule out underlying causes: When did the problem
start? What production events in the service or in its
dependencies correlate with this issue?
Once the issue is mitigated, the SRE drills into logs and
traces, confirming that a new line of code was crashing
the jobs in the regions with issues. She decides to roll back
to the last stable version of the service, and validates
that the issue is resolved when the affected locations are
brought back online.

Debugging journey where the tooling failed

to support the on-caller
The following is an example of a journey where Google
on-callers hit unexpected hurdles as they debugged, and
where applying best practices could have reduced the time
to mitigation.
The on-caller receives a page that informs him that the
service’s overall server-side availability SLO (service-level
objective) was down from 99.9 percent to 91 percent, and
that specific user actions failed. He begins his investigation
by looking at graphs of metrics that confirm (1) when the
error rate started to increase; (2) errors were mostly caused
by timeouts; and (3) request durations were about equal to
the duration of the timeout. He then slices the metrics to the
failing user actions identified before, checks the associated
server errors and queries-per-second metrics, and digs into
server logs to find specific errors. Up to this point, he has
followed common practices for debugging.

acmqueue | march-april 2020 14

debugging 15 of 20

At the same time, another on-caller for a back-end

service dependency notices that the service is nearing its
quota limitations and suspects that this situation might
have an impact on the investigation. This on-caller tries to
allocate some quota through a configuration change, hoping
to alleviate the problem. Because of a misunderstanding
in the configuration push tooling, however, this change
accidentally removes a back-end server in one location
instead of adding quota, which increases the error rates in
the other locations. Additionally, since he considered this
change to be safe, the on-caller didn’t monitor the rollout
of the updated configuration as closely as best practices
recommend, and initially missed indicators that overall
capacity was actually reduced because of the removed
location. At this point, the on-caller breaks from best
practices by performing a global push of a nonvalidated
configuration that includes a completely unrelated change—
the action of dropping a back end should be separate from
adding capacity.
While this is happening, the first on-caller goes deep in
the logs and finds “permission-denied” errors increased
at the time the back-end server was removed. He does
this through a breadth-first search of a number of the
supporting back ends and an analysis of their aggregated
logs. Here, he notices that when one server was removed,
more requests were funneling to the servers that were
experiencing issues. Only after digging into logs and
opening a number of tools is he able to connect the errors
to the configuration change in the dependency.
Better tooling could have prevented the user from
performing an unanticipated change. Tooling could also

acmqueue | march-april 2020 15

debugging 16 of 20

have helped validate what the change would actually do.

Additionally, better tooling to support monitoring the
effects of the changes to the system could have helped the
on-callers draw these conclusions earlier.
The on-callers then connect to share their findings.
Once connected, the first on-caller rolls back the
configuration push that reduced capacity, identifies the
back-end dependency that changed the permission errors,
and works with the back-end team to get bad changes
rolled back.

TRANSLATING INSIGHTS INTO CONCRETE ACTION

If you are responsible for running a distributed service, you
might find yourself dealing with scenarios similar to what
the teams we interviewed experienced. Our study revealed
that teams that apply the following principles are typically
able to mitigate service problems faster.

Establish SLOs and accurate monitoring

You need to have SLOs and/or metrics that you can alert
and optionally report on. These should accurately reflect
user pain and allow for slicing by failure domains. These
should also be associated with alerts that have clear next
steps and links to the most important information.

Triage effectively
Once you have the prerequisites of SLOs and accurate
monitoring in place, you need to be able to quickly
determine both the severity of user pain and the total blast
radius. You should also know how to set up the proper
communication channels based on the severity of the issue.

acmqueue | march-april 2020 16

debugging 17 of 20

Mitigate early
Documenting a set of mitigation strategies that are safe
for your service can help on-callers temporarily fix user-
facing issues and buy your team critical time to identify the
root cause. For more information on implementing generic
mitigations, see “Reducing the Impact of Service Outages
with Generic Mitigations with Jennifer Mace.”5 The ability
to easily identify what changed in your service— either
in its critical dependencies or in your user traffic—is also
helpful in determining what mitigation attempt to move
forward with. As mentioned in the exemplary debugging
case, asking a series of common questions and having
metrics, logs, and traces can help speed up the process of
validating your theories about what went wrong.

Apply established mitigation strategies for common issues

Although every service is different, the following patterns
emerged in the underlying issues we examined and the
mitigations associated with them. When you’re dealing
with a problem that you’ve never seen before, it can be
helpful to think about what type of issue your service is
facing, the questions you should ask, and the associated
mitigations based on the answers.
3 Service errors. This was the most common cause
for an alert firing in our study. As such, it also had the
largest variety of mitigations. Some factors to consider
in determining mitigation strategies include: (1) Are the
errors occurring globally? Check for correlated rollouts,
configuration/data changes, and experiments. (2) Are the
incoming QPS spiking? Add capacity and/or start load
shedding to drop traffic that your service can’t handle. (3)

acmqueue | march-april 2020 17

debugging 18 of 20

Is a bad actor causing a change in QPS? If so, block the user.

3 Performance. Latency can make for a bad user
experience and degrade into errors over time. These issues
can be difficult to debug if there is no obvious correlated
capacity or production change. Typically, responders look
through traces to identify which components in the stack
are affected and try to determine a solution from there.
3 Capacity. Capacity issues are some of the easiest to
spot, especially if you have capacity-specific alerts. Like
errors and performance issues, these can manifest as
both fast and slow burns. If a service is going to run out
of capacity immediately, teams typically ask for more
capacity in an “emergency loan” to scale up their service
(or they may attempt to scale out). For a slow burn,
responders perform additional analyses and planning
to determine if there are other underlying issues. These
types of alerts surface only when automated capacity
systems hit their authorized maximum, and acquiring more
resources requires human intervention.
3 Dependency issues. A critical dependency—even if it’s
deep within the service stack—can contribute to the failure
of the entire service. Knowing your hard dependencies
(those in the critical path of your code) and being able to
view the health of these dependencies can be helpful in
ruling out whether the problem actually lies with another
service.
3 Debugging microservices. Most of the teams
we interviewed have a microservice architecture.
Frequently, the error may be deeper in the stack than
where it manifested to the on-caller. Similar to debugging
dependencies, it’s helpful to be able to traverse the stack

acmqueue | march-april 2020 18

debugging 19 of 20

quickly, associate production

Related articles changes, and understand service

3 The Calculus of Service Availability

You’re only as available as the sum
architecture.

of your dependencies. CONCLUSIONS

Ben Treynor, Mike Dahlin, Vivek Rau, SREs continuously strive to
Betsy Beyer improve systems and expose
https://queue.acm.org/detail.cfm?id=3096459 vulnerabilities in order to limit

3 Why SRE Documents Matter

How documentation enables SRE teams
the probability of failures, near
misses, and inefficiencies in
to manage new and existing services production. Even under the
Shylaja Nukala and Vivek Rau most ideal conditions, things
https://queue.acm.org/detail.cfm?id=3283589 inevitably go wrong. By surfacing,

3 Weathering the Unexpected

Failures happen, and resilience drills
preserving, and disseminating
the commonalities—both positive
help organizations prepare for them. and negative—in the debugging
Kripa Krishnan workflow, the aim is to prevent
https://queue.acm.org/detail.cfm?id=2371516 the same class of problem from
recurring, or, when prevention
isn’t possible, to minimize the
duration or impact of unavoidable outages. Hopefully,
other organizations can apply these findings in practice
too.

References
1. B eyer, B., Jones, C., Petoff, J., Murphy, N. R., eds. 2016.
Building Secure and Reliable Systems. O’Reilly Media;
https://landing.google.com/sre/books/.
2. Beyer, B., Murphy, N. R., Rensin, D. K., Kawahara, K.,
Thorne, S., eds. 2018. The Site Reliability Workbook.
O’Reilly Media; https://landing.google.com/sre/books/.

acmqueue | march-april 2020 19

debugging 20 of 20

3. Google Cloud. 2020. Metric list; https://cloud.google.

com/monitoring/api/metrics.
4. Lunney, J., Lueder, S. 2017. Postmortem culture: learning
from failure. O’Reilly Media; https://landing.google.com/
sre/sre-book/chapters/postmortem-culture/.
5. M
ace, J. 2019. Spotlight on Cloud: Reducing the Impact of
Service Outages with Generic Mitigations with Jennifer
Mace. O’Reilly Media; https://www.oreilly.com/library/
view/spotlight-on-cloud/0636920347927/.

Beth Cooper is a Product Manager at Google NYC. She

focuses on building Google scale monitoring for both site
reliability and software engineers. Prior to Google, she
worked on Microsoft Azure building products for cloud and
datacenter automation.

Charisma Chan is a user experience design researcher at

Google UK in London. Prior to joining Google, she led design
research and strategy for consumer and enterprise products
in the financial services and media sectors and she holds a
bachelor’s degree from Cornell University in Ithaca, New York.
Copyright © 2020 held by owner/author. Publication rights licensed to ACM.

acmqueue | march-april 2020 20

Safety Valve Api 576 Edition 4
100% (1)
Safety Valve Api 576 Edition 4
117 pages
Quick Lunch Case
No ratings yet
Quick Lunch Case
4 pages
My Philosophy On Alerting
No ratings yet
My Philosophy On Alerting
8 pages
Elements of Real Analysis (M.a. Al-Gwaiz - S.A. Elsanousi)
No ratings yet
Elements of Real Analysis (M.a. Al-Gwaiz - S.A. Elsanousi)
448 pages
A Few Billion Lines of Code Later - Using Static Analysis To Find Bugs in The Real World - ACM - 2010 (BLOC-coverity)
No ratings yet
A Few Billion Lines of Code Later - Using Static Analysis To Find Bugs in The Real World - ACM - 2010 (BLOC-coverity)
10 pages
Troubleshooting and Debugging Techniques
No ratings yet
Troubleshooting and Debugging Techniques
15 pages
Fuzzing The Past, The Present and The Future
No ratings yet
Fuzzing The Past, The Present and The Future
11 pages
Aranda 2009 Ics e
No ratings yet
Aranda 2009 Ics e
11 pages
NTW216ComputerReliabilityFinalReportSpring2016 Docx-1
No ratings yet
NTW216ComputerReliabilityFinalReportSpring2016 Docx-1
16 pages
p62 Distefano
No ratings yet
p62 Distefano
9 pages
ACCU Overload152
No ratings yet
ACCU Overload152
24 pages
UQP Solved Module 3
No ratings yet
UQP Solved Module 3
19 pages
Working Session Information Retrieval Ba
No ratings yet
Working Session Information Retrieval Ba
3 pages
How Many Eyeballs Tame Complexity
No ratings yet
How Many Eyeballs Tame Complexity
2 pages
2021-02-09 - WTF Is SRE
No ratings yet
2021-02-09 - WTF Is SRE
54 pages
2013 March Deep Dive
No ratings yet
2013 March Deep Dive
24 pages
Effective Problem Solving
No ratings yet
Effective Problem Solving
11 pages
Fuzzing For Software Security Testing and Quality Assurance
No ratings yet
Fuzzing For Software Security Testing and Quality Assurance
5 pages
All Testings MAP
No ratings yet
All Testings MAP
5 pages
Practice Qns GCP DevOps Set4
No ratings yet
Practice Qns GCP DevOps Set4
12 pages
Unit Iv
No ratings yet
Unit Iv
34 pages
Debugging - Software Testing
No ratings yet
Debugging - Software Testing
16 pages
Professional Cloud Devops Engineer - 4
No ratings yet
Professional Cloud Devops Engineer - 4
8 pages
RFC3227 - Guidelines For Evidence Collection and Archiving
No ratings yet
RFC3227 - Guidelines For Evidence Collection and Archiving
10 pages
Lecture 5
No ratings yet
Lecture 5
22 pages
Complete Guide To Incident Management Part 1 22EB09
No ratings yet
Complete Guide To Incident Management Part 1 22EB09
13 pages
Aloka 06-09
No ratings yet
Aloka 06-09
2 pages
Hafta 8
No ratings yet
Hafta 8
13 pages
Lessons Learned From Two Decades
No ratings yet
Lessons Learned From Two Decades
8 pages
Cheatsheet SAST
No ratings yet
Cheatsheet SAST
2 pages
Unit Iv
No ratings yet
Unit Iv
41 pages
MSS Module 3 Important Topics
No ratings yet
MSS Module 3 Important Topics
9 pages
Data Bug On Open Source For Predictive Articulates
No ratings yet
Data Bug On Open Source For Predictive Articulates
16 pages
Class 3 - STLC and Its Process
No ratings yet
Class 3 - STLC and Its Process
25 pages
(2017) Process Mining Collective Behavior
No ratings yet
(2017) Process Mining Collective Behavior
10 pages
Ammett Williams Google Cloud DevOps
No ratings yet
Ammett Williams Google Cloud DevOps
10 pages
SRE Best Practices
No ratings yet
SRE Best Practices
11 pages
SRE - Assign-4 (C, 152, Syed Ahmed Raza)
No ratings yet
SRE - Assign-4 (C, 152, Syed Ahmed Raza)
7 pages
Bug Report
No ratings yet
Bug Report
12 pages
Unit 1
No ratings yet
Unit 1
79 pages
Predicting Fault Incidence Using Software Change History PDF
No ratings yet
Predicting Fault Incidence Using Software Change History PDF
9 pages
Google Cloud DevOps Engineer Exam Prep Sheet
No ratings yet
Google Cloud DevOps Engineer Exam Prep Sheet
16 pages
Defect Tracking System: Sujata Solanke and Prof. Prakash N. Kalavadekar
No ratings yet
Defect Tracking System: Sujata Solanke and Prof. Prakash N. Kalavadekar
6 pages
Week No. 6
No ratings yet
Week No. 6
24 pages
Seminar Final
No ratings yet
Seminar Final
19 pages
Developer Journey Idea To Production
No ratings yet
Developer Journey Idea To Production
33 pages
Rat - A
No ratings yet
Rat - A
5 pages
Engineering Reliability Into Web Sites: Google SRE
No ratings yet
Engineering Reliability Into Web Sites: Google SRE
21 pages
Milestone - 3-4 - Template - Ananda Aditya Surya
No ratings yet
Milestone - 3-4 - Template - Ananda Aditya Surya
8 pages
Chap I
No ratings yet
Chap I
35 pages
TMPA WhaleShark
No ratings yet
TMPA WhaleShark
14 pages
The 5 Software Engineering Practices and System Implementation
No ratings yet
The 5 Software Engineering Practices and System Implementation
28 pages
Coverity White Paper-SAT-Next Generation Static Analysis 0
No ratings yet
Coverity White Paper-SAT-Next Generation Static Analysis 0
21 pages
Open TIDE
No ratings yet
Open TIDE
32 pages
2012 July Deep Dive
No ratings yet
2012 July Deep Dive
34 pages
Software Mining Repository Notes Unit 1
No ratings yet
Software Mining Repository Notes Unit 1
10 pages
Rekap Pasca UTS
No ratings yet
Rekap Pasca UTS
9 pages
Mar 02 Archeology
No ratings yet
Mar 02 Archeology
3 pages
Static Code Analysis To Detect Software Security V
No ratings yet
Static Code Analysis To Detect Software Security V
8 pages
Static Source Code Security Analyzers Comparison
No ratings yet
Static Source Code Security Analyzers Comparison
6 pages
The Court of Souls Volume 01
No ratings yet
The Court of Souls Volume 01
486 pages
Planet of Lust
No ratings yet
Planet of Lust
3 pages
Types of Industries: We Will Follow This Pattern: - Characteristics - Location - Types
No ratings yet
Types of Industries: We Will Follow This Pattern: - Characteristics - Location - Types
16 pages
Advanced Surveying Practices - CE - 6th Sem
No ratings yet
Advanced Surveying Practices - CE - 6th Sem
7 pages
5th Professions
No ratings yet
5th Professions
7 pages
Poltical Science Worksheet-Globalization: Section-B
No ratings yet
Poltical Science Worksheet-Globalization: Section-B
2 pages
JHFJF Oijekjf
No ratings yet
JHFJF Oijekjf
14 pages
Ladder of Inference
No ratings yet
Ladder of Inference
3 pages
NCP - Diabetes Mellitus
No ratings yet
NCP - Diabetes Mellitus
5 pages
Internship at Troikaa Pharmaceuticals
No ratings yet
Internship at Troikaa Pharmaceuticals
7 pages
Net Shit
No ratings yet
Net Shit
3 pages
Air Gauging Baiscs
No ratings yet
Air Gauging Baiscs
36 pages
ER-420 Datasheet Corrosion TX
No ratings yet
ER-420 Datasheet Corrosion TX
3 pages
Highway Engineering B 1
No ratings yet
Highway Engineering B 1
8 pages
Stat219 Notes
No ratings yet
Stat219 Notes
132 pages
Ict Lamp Based Tle6ie-0a-1
100% (1)
Ict Lamp Based Tle6ie-0a-1
2 pages
ISOFIX - Brochure
No ratings yet
ISOFIX - Brochure
4 pages
MIDTERM EXAMINATION OBLICON Answer Key
No ratings yet
MIDTERM EXAMINATION OBLICON Answer Key
5 pages
Llaves de Vehiculos y Memorias
100% (1)
Llaves de Vehiculos y Memorias
28 pages
SG Form CCPL Income Update Form NC
No ratings yet
SG Form CCPL Income Update Form NC
2 pages
SR A38x Clearfog Base Rev1.1 MTBF
No ratings yet
SR A38x Clearfog Base Rev1.1 MTBF
18 pages
UV Hacking 1
No ratings yet
UV Hacking 1
6 pages
Investors Attitude Toward Mutual Funds
No ratings yet
Investors Attitude Toward Mutual Funds
10 pages
Your Guide To Seeing Orbs: Orbs Are The Consciousnesses of Our Angels, Guides, Passed Over Loved Ones and Other Spirits
No ratings yet
Your Guide To Seeing Orbs: Orbs Are The Consciousnesses of Our Angels, Guides, Passed Over Loved Ones and Other Spirits
3 pages
SAMBULAWANDD
No ratings yet
SAMBULAWANDD
2 pages
2 - GoogleSheets-Course-Notes
No ratings yet
2 - GoogleSheets-Course-Notes
33 pages
Winter Project Guideline
No ratings yet
Winter Project Guideline
12 pages

Debugging Distributed Systems

Uploaded by

Debugging Distributed Systems

Uploaded by

1 of 20 TEXT

acmqueue | march-april 2020 1

some practical strategies that you can apply in your

acmqueue | march-april 2020 2

a spread of actual Google incidents.

Phase 0: Segment incident responder and

SWE vs. SRE mental models and tools

acmqueue | march-april 2020 3

failure patterns across service health metrics (e.g., errors

Experience level of the incident responder

acmqueue | march-april 2020 4

data-correctness issues outside the scope of the tools

Phase 1: Postmortem documentation analysis

acmqueue | march-april 2020 5

Phase 2: In-depth interviews

COMMON PATTERNS AROUND DEBUGGING

acmqueue | march-april 2020 6

FIGURE 1: Building Blocks

developers often perform a deeper analysis of the code and

FIGURE 2: User Journey

acmqueue | march-april 2020 8

acmqueue | march-april 2020 9

acmqueue | march-april 2020 10

started working with time-series metrics that indicate the

ANECDOTES FROM THE FRONT LINE

acmqueue | march-april 2020 11

it. Other responders cited process and awareness issues:

acmqueue | march-april 2020 12

if the alert is noise, if she needs to handle it now, and

acmqueue | march-april 2020 13

3 To identify the severity: What is the blast radius? (If

Debugging journey where the tooling failed

acmqueue | march-april 2020 14

At the same time, another on-caller for a back-end

acmqueue | march-april 2020 15

have helped validate what the change would actually do.

TRANSLATING INSIGHTS INTO CONCRETE ACTION

Establish SLOs and accurate monitoring

acmqueue | march-april 2020 16

Apply established mitigation strategies for common issues

acmqueue | march-april 2020 17

Is a bad actor causing a change in QPS? If so, block the user.

acmqueue | march-april 2020 18

quickly, associate production

3 The Calculus of Service Availability

of your dependencies. CONCLUSIONS

3 Why SRE Documents Matter

3 Weathering the Unexpected

acmqueue | march-april 2020 19

3. Google Cloud. 2020. Metric list; https://cloud.google.

Beth Cooper is a Product Manager at Google NYC. She

Charisma Chan is a user experience design researcher at

acmqueue | march-april 2020 20

You might also like

3. Google Cloud. 2020. Metric list; https://cloud.google.