SLO Guide & FAQ
SLO Guide & FAQ
T
and is now deprecated.
SLO Guide
RI: Zac Brown , Jeremy Edwards , Siddarth Chandrasekaran , Preston O'Neal
D
Last updated : Apr 13, 2022
Status: published (living document)
Table of Contents
Table of Contents
Context
SLO’s - A Brief Recap
FAQ
General Questions
What’s a Service Level Objective (SLO)?
What’s wrong with the way we’ve been doing SLO’s (aka SLC’s)?
Why are we doing SLO’s?
Why do SLO’s need to have real-time alerting?
If a team comes to me and says they need a stronger SLO, can I push back?
How should managers and leaders think about their SLO’s?
Why do we use a fixed 30 day window for measuring SLO’s?
Does every service have to have the same SLO? Does my SLO have to be as high as my upstream
callers?
What resources are available for me to go deeper on SLO’s?
Stripe SLO Monitoring Questions
What tools are available to Stripes for monitoring SLO’s?
Do I have to use Stripe’s SLO Monitoring?
Can I create custom SLI’s to set SLO’s on with Stripe’s SLO Monitoring?
I’m a Foundation team providing common infrastructure to Stripe. Can I use Stripe’s SLO Monitoring?
Why are we focused on StripeNext Horizon services first for SLO Monitoring?
What happens if my service runs out of Error Budget?
I keep hearing that we want to achieve 99.999% availability SLO Stripe-wide. Does my service have to
have a 99.999% SLO?
Can I use SLO Monitoring as the only alerts for my service’s health?
Can I tune my error budget burn rate alerts for SLO’s?
Why are you using Burn Rate alerts instead of Threshold-based alerts?
Resources
Context
istorically, Stripe has set broad single target goals for reliability. As we’ve grown, it has become apparent that
H
a one-size-fits-all approach does not work. Different teams have different needs and goals when making the
tradeoff between reliability and engineering velocity. This document provides an overview of the patterns and
practices needed to build a service designed for different levels of reliability.
dditionally, we want to clarify and standardize how concepts in our current infrastructure map to the changes
A
being introduced in the StripeNext work. This will help better align reporting scenarios for leadership and raise
the floor for how we monitor our next generation of services.
There’s a lot of confusion around SLO’s. Common questions that the Reliability2 organization gets are:
- What time window should we measure SLO’s on?
- My service falls under the PaymentIntents API which has a target SLO of 99.999%, but we rely on a
downstream system that only reports an SLO 99.9%. How can I make this reliable?
- What metrics should I use to measure my SLO’s?
- Should those metrics be client side? Server side? Both?
- What should my API/service/system Availability SLO be?
nfortunately there’s no one-size-fits-most/all answer to these questions. The details depend on the service or
U
system being monitored and how it impacts customers when it’s in a degraded state. On top of that, any
blanket guidance that the Reliability organization could provide would do a disservice to the service owner.
SLO’s are a tool for balancing reliability and engineering velocity. They provide a signal to Engineering Teams
and their respective parent organizations on how a given service is doing against the broader priorities of the
organization.
utting that in context, if the Reliability organization were to set SLO targets for other teams, we would be
P
trying to push priorities onto teams that may have other higher priority work. This won’t work and it won’t scale.
In brief, the responsibilities for SLO’s are laid out as follows:
- RBCLEAP owns and produces standardized dashboarding, alerting, and reporting on SLO’s across
Stripe. This tooling will be a standardized minimum bar (min-bar) that is flexible enough in the future to
account for scenario-specific SLO’s.
- RBCLEAP owns setting guidance, education, and hands-on assistance for teams that have
Reliability-related issues.
- Teams outside of RBCLEAP own setting SLO’s for themselves and reporting on them within their
Operating Group. This ownership includes both the decision making process for raising or lowering the
SLO based on the priorities of that team, their OG’s agreement, and the maturity of the product/service.
s we started the planning for StripeNext, we recognized we needed to set clear expectations with our partner
A
teams as well as the leadership team. We will use SLO’s to drive that point, setting targets that strike the right
balance between reliability (e.g. availability, latency) and engineering velocity.
dditionally, SLO’s provide a tool for communicating the availability, latency, throughput, and so on for partner
A
teams. Teams can then choose to implement additional strategies for improving the reliability of their own
service (e.g. retries).
1
rror Budget is defined as (1 - SLO). For example, an SLO of 99.99% is an Error Budget of 0.01%.
E
2
RBCLEAP
Making Sense of the 9’s
he following table shows the error budget for the year and for a 30 day period based on the number of 9’s
T
specified. Notably, it takes less than half a minute (26 seconds) to exhaust the error budget for a service
targeting 99.999% availability.
he boxes where the time durations are decorated in red are those which have a Mean Time to Recovery
T
(MTTR) that is unlikely for a human to respond quickly enough to meet.
o put this in perspective,void-originatewas estimatedto have taken our trailing 30 day availability from
T
99.9997% to 99.947%.
our Error Budget is a tool for managing risk. If you have a particularly risky change you want to flight, your
Y
Error Budget gives you breathing room to test that change in production. Similarly, if your product is early in its
lifecycle and you want to iterate quickly, you can set your SLO at a lower target (e.g. 99.95%). This sets the
right expectations within Stripe for how you’re prioritizing reliability versus engineering velocity.
For Infrastructure, this is a more nuanced conversation, but the following table shows examples.
3
When we’ve discussed SLO’s with teams in the past, a common question is “How do Availability Tiers map to SLO’s? Do
I need to set both?” The short answer is - you need to set both today.
he long answer is that Availability Tiers need to be abstracted away into the infrastructure platform. They are an
T
implementation detail that is better represented by teams setting an SLO target. As we continue investing in the platform,
this is one of several areas we’d like to abstract away.
Plan for Resiliency
ervices should strive, where possible, to avoid being stateful. This means only storing data in external
S
sources like a Data Layer (e.g. MySQL, Mongo) and, optionally, a Cache Layer (e.g. Redis). When service
instances maintain state, they often require coordination when an instance is replaced.
imilarly, a leaderless service is preferable to leader-elected services. Leader election is a specific type of
S
stateful service where one instance acts as the coordinator for other instances. If the leader fails, a new leader
must be elected. This failover to a new leader is costly in terms of Mean Time to Recovery. Leader Election is
O(minutes) to elect a new leader which means it would be impossible to meet a 5 9’s target.
o meet both our efficiency and reliability goals, services need to be intelligently auto-scaled. When our
T
services auto-scale, they also need to include headroom to absorb unanticipated traffic.
Idempotency
PI’s should be implemented to support idempotency wherever possible. Most of our API surface already
A
supports the notion of idempotent API calls. However, it’s important to plan for this from the start as it can
influence design decisions further down the road.
he design principle is related to designing Stateless Services. If your API is idempotent, then the failure of an
T
arbitrary instance of your service should not negatively impact the customer. We can either retry the API
automatically for them or they can retry it if we’ve hit our own timeout.
imilarly, how we modify our systems should be idempotent. Most modern system tooling designs with this
S
principle, but it is still important to take this into consideration when looking at introducing new tooling or
practices.
- ompute - where the actual process runs for the service (e.g. EC2 instance, Kubernetes)
c
- network - underlying networking infrastructure that connects compute to the broader Internet
- data - the database (e.g. MySQL) or message bus (e.g. Kafka) layers that allow applications to store
state and communicate that state to other services
eyond that, a Service-Oriented Architecture usually means that the service handling your application is not
B
the same service doing user authentication or hosting your dashboards. That means there’s both infrastructure
service dependencies as well as functional dependencies. Understanding what your service is depending on is
crucial to the reliability profile of your service. The SLO’s your dependent services provide will inform how you
think about your failure modes for handling outages in those dependencies.
hese critical dependencies are where your primary focus for having redundancy should be. This usually
T
means deploying your service to multiple regions with automated regional failover, autoscaled deployments,
and a sharded redundant data layer.
y keeping your critical dependencies at a minimum, you can focus on how to gracefully degrade when one of
B
your less critical dependencies becomes unavailable.
If any of those scenarios fail, the product’s core functionality is still preserved. You can still enter a search term
and get back a page of results.
here possible, we need to consider the same types of graceful degradation in our systems. This isn’t always
W
possible, but it’s important to consider because it is one of several tools available to preserve Stripe’s core
product scenarios like the Charge Path.
- If your service is experiencing high latency for a long enough period of time, it will manifest as an
availability issue. The excess waiting for calls to complete will starve your service of resources and
cause you to be unavailable.
- If you max out your allocated resources and saturate the service, your availability will drop and latency
will increase due to an inability to service the throughput.
- If you exceed your known throughput limits, you’ll see availability dip and latency increase as it takes
more effort to complete work. Similarly, this can lead to saturation.
s Stripe matures, we’ll need to expand how we think about what it means to be Available. Today we think
A
about a couple of scenarios:
- did the RPC succeed?
- did the charge succeed when it should have (mostly Payments)?
elatedly, POST API’s and GET API’s often do not have the same performance profile. Maybe POST requests
R
have a p50 latency of 50ms while GET API’s have a p50 latency of 250ms. This might seem like a small delta,
but at high RPS it can make a difference. If you host both the POST and GET API’s off the same service
instances, it can obscure subtle shifts in the overall performance monitoring of those services. More
importantly, if you have a massive influx of GET API requests, it could starve out CPU resources available to
service the POST requests. Every GET request would take 5 times as long as every POST request.
imilarly, from a monitoring point of view, it is easier to monitor two separate services for their respective
S
SLO’s. You can more easily reason about them within the larger system, and it’s clear from the delineation
what each service’s reliability, performance, and efficiency profiles are.
ut another way, it might seem like less operational overhead to serve a single deployment of your service with
P
multiple sets of functionality. However, at a large enough scale, it can exacerbate problems and make system
decomposition and monitoring more difficult in the long term.
“You inherit the reliability of your most used dependencies.”
Services often have many external dependencies. These might include:
- Database (MySQL)
- Message Bus (Kafka)
- 3rd party services not part of Stripe’s business (e.g. Visa)
- etcetera
If any of these dependencies makes up a plurality of the calls your own service makes, then your service will
inherit that dependency’s reliability. For example, if the 3rd party service offers an SLA of 99.95% and your
service’s job is to process payments through that service (e.g. it’s Visa), then if you implement no additional
reliability strategies, you can expect that 0.05% of all requests to fail.
So what to do? There are a number of strategies available to us that we’ll cover in the next section.
Retries
his is a common pattern across Stripe for handling issues with reliability in downstream dependencies (both
T
internal and external). Generally, this is a very effective strategy but it comes with one potentially high cost:
latency.
ystems which expect to retry need to make calls that have timeouts. Crucially, callers need to make sure
S
they’re only retrying on errors which indicate that the call might actually succeed in the future. Depending on
the system, the timeout duration will need to be massaged - some systems need longer timeouts and others
need shorter ones. Without timeouts or if the timeouts are too long, tail latency4 will grow, further impactingthe
callers. Additionally, retries need to be intelligent, using jitter and exponential backoff to ensure a thundering
herd is not unleashed on the dependency.
If possible, building systems that can detect the health of downstream dependencies will provide the most
robust protection against issues like thundering herds and retry amplification. While Stripe does not have a
comprehensive approach to this today, it’s a goal we’ll work toward in the future.
Recommendation(s):
- Retries are a good tool but make sure to include jitter and exponential backoff in the retry strategy and
make appropriate choices around timeouts5.
- To mitigate retry amplification, if a retry fails, rewrite retriable error codes into a non-retriable error
code.6
4
“Tail latency, also known as high-percentile latency, refers to high latencies that clients see fairly infrequently.” (source:
https://brooker.co.za/blog/2021/04/19/latency.html)
5
gRPC Networking Configuration
6
No standardized set of error codes for this exists today.
If possible, the error text should also provide the user some notion of next steps. Whether it’s an indicator that
their inputs were invalid or a retry time horizon, providing these to callers will help them adapt their services to
better integrate with yours.
Recommendation(s):
- Make it clear whether an error is a client issue or server issue.
- Make it clear whether an error is retryable.
- Make it clear, in the error context, how a user can improve their usage of the service.
Managing Timeouts
<TBD>
In many cases, we already do this today at Stripe. For example, Shared-MSP services run multiple copies of
the same container for a given service. This also improves resiliency in case an instance of the dependency
fails. While the request might fail, the entire service won’t be down.
Sharding
harding is related to running multiple copies of a dependency, however it tends to be focused on how data is
S
stored. When we talk about the “availability” of a data tier dependency, there are actual two dimensions we
have to consider:
- Is data for any customer available?
- Is data for every customer available?
It is possible, with sharding, to build systems that can ensure one or both of these guarantees though the cost
can be significant.
oday, Stripe uses sharding for distributing customers across Mongo and MySQL. However, we don’t duplicate
T
data, so a single shard being down impacts a subset of customers. This means we’re solving the “Is data for
any customer available?” question.
ther mechanisms of sharding might mimic Amazon’s Cell Architecture7. This pattern separates multiple
O
instances of services across regions and availability zones to distribute failures across boundaries that are less
likely to fail.
Caching
sing caching can improve reliability in a similar way to utilizing retries. By caching and setting reasonable
U
timeouts, you reduce the need to query downstream dependencies like your data tier or dependency service.
owever, depending on your approach to caching, you may be trading one system’s reliability for another.
H
Caches that need updating often are typically external to the service itself - for example a Redis cluster. This is
another service that can potentially go down, but your service should gracefully fallback to its data tier.
7
AWS re:Invent 2018: How AWS Minimizes the Blast Radius of Failures (ARC338)
If the cached data is relatively static (e.g. top 100 customers by NPV8) , then you can store these instatic files
that are loaded once in memory for your service. This improves latency on lookups for something like
translating OrgId to AccountId because the data for your heaviest users is already in memory.
e discussed Google.com’s core scenario is search inSystems Degrade Gracefullyabove. The auto-complete
W
service or site info modal service can adhere to lower SLO’s without notably impacting the primary
expectations of Google’s users. Similarly, we should aim to build services that keep as few dependencies as
possible and handle loss of the dependency gracefully. Doing this will mitigate the risk of the less reliable
services failing.
nowing what is most critical in a given service’s functions helps to ensure we can still serve traffic in a
K
degraded state. While this isn’t ideal, it’s better than being hard down.
eplacing a dependency is a high cost solution but sometimes the only solution. As always, this is about
R
business prioritization and balancing cost in development versus onboarding to a new 3rd party.
If you capture data across internal and external dependencies, you can identify where these weaknesses are
and have data-driven discussions with partners about how to improve the situation.
ecommendation: This path is a “path of last resort.” Poorly implemented, this looks like teams bullying each
R
other about their SLO’s to get unblocked. Teams should absolutely surface reliability concerns with partners,
but they must also recognize that priorities may vary across the organization.
Multi-Region (Future)
TBD
8
Net Payment Volume
FAQ
General Questions
What’s wrong with the way we’ve been doing SLO’s (aka SLC’s)?
he limitation to our historic strategy with SLO’s has been that they don’t have real-time monitoring. Without
T
real-time monitoring, there’s no opportunity to preserve the SLO and avoid a miss. Additionally, we have not
applied any standardization to how we’ve measured SLO’s in the past. For example, there’s no standardized
time window, definition of availability, definition of latency and more.
If a team comes to me and says they need a stronger SLO, can I push back?
es! A common misunderstanding of using SLO’s is that all of your dependencies must have an SLO greater
Y
than or equal to your own. This is inaccurate and would be untenable - you’d basically end up in a standoff on
who’s going to implement reliability improvements first.
here are strategies that upstream callers can implement like sharding and retrying of downstream
T
dependencies to improve their own reliability. Upstream callers can also carefully design their service to ensure
they minimize the number of critical dependencies they have.
owever, expect that teams will ask this question. It’s a reasonable one and should be treated as a signal that
H
your service may not be meeting your customers’ needs.
9
https://sre.google/sre-book/monitoring-distributed-systems/#xref_monitoring_golden-signals
LO’s are business metrics that help with planning and prioritization. If your services aren’t performing well
S
against their SLO’s, that’s an indicator that you need to prioritize reliability improvements. Similarly, if your
services are performing exceptionally well against their SLO’s, you can plan more risky changes.
dditionally, a fixed 30 day window provides sufficient time that a team could reasonably respond and mitigate
A
a degradation in their service. If you use a time window of 6 hours, that leaves very little time to respond to a
degradation and preserve your SLO.
oes every service have to have the same SLO? Does my SLO have to be as high as
D
my upstream callers?
No! There are two reasons for this:
1. Services should be designed to reduce the number of critical dependencies they have. By reducing the
number of critical dependencies a service has, you improve the resiliency of your service which allows
you to provide a higher SLO.
2. Upstream callers can implement strategies to mitigate lower SLO’s in downstream dependencies. See
this sectionfor more information.
or v1 services using the pay-server stack, theServiceCommitmentsdashboard is still the primary way to
F
monitor SLO’s. Please note that it does not provide real-time alerting.
However, teams will have the ability to configure alerting for their service’s SLO’s.
Can I create custom SLI’s to set SLO’s on with Stripe’s SLO Monitoring?
Not yet, but we’re hoping to add this in 2022Q4 or 2023Q1.
I’m a Foundation team providing common infrastructure to Stripe. Can I use Stripe’s
SLO Monitoring?
es… if it’s running as a Horizon service. We recognize that not all teams will be building Horizon services,
Y
however to ensure we complete at least a subset of the end-to-end scenarios, the workstream is focused on
Horizon services.
Why are we focused on StripeNext Horizon services first for SLO Monitoring?
or better or worse, we believe it’s more important to complete one end-to-end scenario and expand outward
F
based on what we learn. Given the amount of work going into rewriting or building net new services in Horizon,
we believe it would give the greatest bang for buck.
I keep hearing that we want to achieve 99.999% availability SLO Stripe-wide. Does my
service have to have a 99.999% SLO?
o! Please seeDoes every service have to have thesame SLO? Does my SLO have to be as high as my
N
upstream callers?
Can I use SLO Monitoring as the only alerts for my service’s health?
ou could, but we wouldn’t recommend it. The SLO Monitoring alerts are a “floor raiser”, meaning they provide
Y
a minimum bar for alerting for services. We expect that as teams gain a better understanding of how their
service behaves in production, they’ll create custom alerts and dashboards that give them detailed insight into
their service’s health.
he SLO Monitoring workstream is meant to provide a coherent high level picture of the health of Stripe’s
T
services across engineering organizations.
Can I tune my error budget burn rate alerts for SLO’s?
es! In 2022 Q2, we’ll be adding the ability to tune different aspects of the burn rate alerts. You’ll also be able
Y
to disable alerts but we highly discourage that :). While we’ll allow disabling of alerts, that will be reported as
part ofRPDhealth scoring.
Why are you using Burn Rate alerts instead of Threshold-based alerts?
In general we can create alerts with better precision and recall by alerting on error budget burn rate over
multiple time windows. More information on this is available in SLO Alerting .
Resources
- Google SRE book(s)
- https://sre.google/books/
- Microsoft’s Azure Architecture Reliability docs:
https://docs.microsoft.com/en-us/azure/architecture/reliability/architect
- Google’s Cloud Architecture docs:https://cloud.google.com/architecture/framework/reliability
- Command Query Responsibility Separation (CQRS) design pattern
- https://martinfowler.com/bliki/CQRS.html
- https://docs.microsoft.com/en-us/azure/architecture/patterns/cqrs
- Setting SLO’s for services with dependencies
- https://cloud.google.com/blog/products/devops-sre/defining-slos-for-services-with-dependencies
-cre-life-lessons
- On retries, timeouts, and jitter
- https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/