0% found this document useful (0 votes)
34 views15 pages

SLO Guide & FAQ

The document outlines the migration of the SLO Guide to Trailhead and its deprecation, emphasizing the importance of Service Level Objectives (SLOs) in balancing reliability and engineering velocity. It discusses various reliability patterns, error budgets, and provides guidance on setting appropriate SLOs tailored to different services within Stripe. Additionally, it highlights the responsibilities of teams in managing their own SLOs while providing resources for further understanding.

Uploaded by

naturebuddy2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views15 pages

SLO Guide & FAQ

The document outlines the migration of the SLO Guide to Trailhead and its deprecation, emphasizing the importance of Service Level Objectives (SLOs) in balancing reliability and engineering velocity. It discusses various reliability patterns, error budgets, and provides guidance on setting appropriate SLOs tailored to different services within Stripe. Additionally, it highlights the responsibilities of teams in managing their own SLOs while providing resources for further understanding.

Uploaded by

naturebuddy2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

‭ his document has been migrated to Trailhead (‬‭go/slo/docs‬‭)‬

T
‭and is now deprecated.‬

‭SLO Guide‬
‭ RI:‬ Zac Brown ‭,‬ Jeremy Edwards ‭,‬ Siddarth Chandrasekaran ‭,‬ Preston O'Neal
D
‭Last updated :‬ Apr 13, 2022
‭Status: published (living document)‬

‭Table of Contents‬
‭Table of Contents‬

‭Context‬
‭SLO’s - A Brief Recap‬

‭How many 9’s do you need?‬


‭Making Sense of the 9’s‬
‭Reliability versus Engineering Velocity‬
‭Error Budgets == Flexible Spending Account‬
‭So how many 9’s do I need?‬

‭Reliability Patterns and Resources‬


‭Plan for Resiliency‬
‭Stateless & Leaderless‬
‭Autoscaled & Overprovisioned (within reason)‬
‭Idempotency‬
‭Resilient System Degradation & Dependencies‬
‭Dependencies are Known‬
‭Critical Dependencies are Minimized‬
‭Systems Degrade Gracefully‬
‭High Latency == Availability Drop == Throughput Limit Exceeded == Fully Saturated‬
‭Workloads Segregated by Performance/Priority Profile‬
‭“You inherit the reliability of your most used dependencies.”‬
‭Retries‬
‭Error Codes and User Actionability‬
‭Managing Timeouts‬
‭Running Multiple Copies of the Dependency‬
‭Sharding‬
‭Caching‬
‭Graceful Degradation & Critical Path Pruning‬
‭Replacing Dependencies with a More Reliable Alternative‬
‭ sking for a Stronger SLO‬
A
‭Multi-Region (Future)‬

‭FAQ‬
‭General Questions‬
‭What’s a Service Level Objective (SLO)?‬
‭What’s wrong with the way we’ve been doing SLO’s (aka SLC’s)?‬
‭Why are we doing SLO’s?‬
‭Why do SLO’s need to have real-time alerting?‬
‭If a team comes to me and says they need a stronger SLO, can I push back?‬
‭How should managers and leaders think about their SLO’s?‬
‭Why do we use a fixed 30 day window for measuring SLO’s?‬
‭Does every service have to have the same SLO? Does my SLO have to be as high as my upstream‬
‭callers?‬
‭What resources are available for me to go deeper on SLO’s?‬
‭Stripe SLO Monitoring Questions‬
‭What tools are available to Stripes for monitoring SLO’s?‬
‭Do I have to use Stripe’s SLO Monitoring?‬
‭Can I create custom SLI’s to set SLO’s on with Stripe’s SLO Monitoring?‬
‭I’m a Foundation team providing common infrastructure to Stripe. Can I use Stripe’s SLO Monitoring?‬
‭Why are we focused on StripeNext Horizon services first for SLO Monitoring?‬
‭What happens if my service runs out of Error Budget?‬
‭I keep hearing that we want to achieve 99.999% availability SLO Stripe-wide. Does my service have to‬
‭have a 99.999% SLO?‬
‭Can I use SLO Monitoring as the only alerts for my service’s health?‬
‭Can I tune my error budget burn rate alerts for SLO’s?‬
‭Why are you using Burn Rate alerts instead of Threshold-based alerts?‬

‭Resources‬

‭Context‬
‭ istorically, Stripe has set broad single target goals for reliability. As we’ve grown, it has become apparent that‬
H
‭a one-size-fits-all approach does not work. Different teams have different needs and goals when making the‬
‭tradeoff between reliability and engineering velocity. This document provides an overview of the patterns and‬
‭practices needed to build a service designed for different levels of reliability.‬

‭ dditionally, we want to clarify and standardize how concepts in our current infrastructure map to the changes‬
A
‭being introduced in the StripeNext work. This will help better align reporting scenarios for leadership and raise‬
‭the floor for how we monitor our next generation of services.‬

‭SLO’s - A Brief Recap‬


‭ ervice Level Objectives (SLOs) are internal objectives for a team around the level of availability and‬
S
‭performance for their service. They enable teams to codify the balance between reliability and engineering‬
‭velocity, allowing teams to make the right prioritization decisions based on how much remaining error budget‬
t‭hey have. The Error Budget‬‭1‬ ‭represents the amount of risk a team can take on while still meeting their quality‬
‭bar.‬

‭There’s a lot of confusion around SLO’s. Common questions that the Reliability‬‭2‬ ‭organization gets are:‬
‭-‬ ‭What time window should we measure SLO’s on?‬
‭-‬ ‭My service falls under the PaymentIntents API which has a target SLO of 99.999%, but we rely on a‬
‭downstream system that only reports an SLO 99.9%. How can I make this reliable?‬
‭-‬ ‭What metrics should I use to measure my SLO’s?‬
‭-‬ ‭Should those metrics be client side? Server side? Both?‬
‭-‬ ‭What should my API/service/system Availability SLO be?‬

‭ nfortunately there’s no one-size-fits-most/all answer to these questions. The details depend on the service or‬
U
‭system being monitored and how it impacts customers when it’s in a degraded state. On top of that, any‬
‭blanket guidance that the Reliability organization could provide would do a disservice to the service owner.‬
‭SLO’s are a tool for balancing reliability and engineering velocity. They provide a signal to Engineering Teams‬
‭and their respective parent organizations on how a given service is doing against the broader priorities of the‬
‭organization.‬

‭ utting that in context, if the Reliability organization were to set SLO targets for other teams, we would be‬
P
‭trying to push priorities onto teams that may have other higher priority work. This won’t work and it won’t scale.‬

‭In brief, the responsibilities for SLO’s are laid out as follows:‬
‭-‬ ‭RBCLEAP owns and produces standardized dashboarding, alerting, and reporting on SLO’s across‬
‭Stripe. This tooling will be a standardized minimum bar (min-bar) that is flexible enough in the future to‬
‭account for scenario-specific SLO’s.‬
‭-‬ ‭RBCLEAP owns setting guidance, education, and hands-on assistance for teams that have‬
‭Reliability-related issues.‬
‭-‬ ‭Teams outside of RBCLEAP own setting SLO’s for themselves and reporting on them within their‬
‭Operating Group. This ownership includes both the decision making process for raising or lowering the‬
‭SLO based on the priorities of that team, their OG’s agreement, and the maturity of the product/service.‬

‭How many 9’s do you need?‬


I‭n 2021, we set an overall availability target of 99.999% or “5 9’s of reliability”. Stripes take great pride in‬
‭setting the bar high and delivering a great customer experience. However, this target lacks both nuance and‬
‭realistic expectations.‬

‭ s we started the planning for StripeNext, we recognized we needed to set clear expectations with our partner‬
A
‭teams as well as the leadership team. We will use SLO’s to drive that point, setting targets that strike the right‬
‭balance between reliability (e.g. availability, latency) and engineering velocity.‬

‭ dditionally, SLO’s provide a tool for communicating the availability, latency, throughput, and so on for partner‬
A
‭teams. Teams can then choose to implement additional strategies for improving the reliability of their own‬
‭service (e.g. retries).‬

‭1‬
‭ rror Budget is defined as (1 - SLO). For example, an SLO of 99.99% is an Error Budget of 0.01%.‬
E
‭2‬
‭RBCLEAP‬
‭Making Sense of the 9’s‬
‭ he following table shows the error budget for the year and for a 30 day period based on the number of 9’s‬
T
‭specified. Notably, it takes less than half a minute (26 seconds) to exhaust the error budget for a service‬
‭targeting 99.999% availability.‬

‭Availability‬ ‭ rror Budget‬


E ‭ rror Budget‬
E
‭(Year)‬ ‭(Last 30d)‬

‭90% (1 nine)‬ ‭36.53 days‬ ‭3 days‬

‭99% (2 nines)‬ ‭3.65 days‬ ‭7.3 hours‬

‭99.9% (3 nines)‬ ‭8.77 hours‬ ‭43.65 minutes‬

‭99.99% (4 nines)‬ ‭52.60 minutes‬ ‭4.36 minutes‬

‭99.995% (4.5 nines)‬ ‭26.30 minutes‬ ‭2.18 minutes‬

‭99.999% (5 nines)‬ ‭5.26 minutes‬ ‭26 seconds‬

‭ he boxes where the time durations are decorated in red are those which have a Mean Time to Recovery‬
T
‭(MTTR) that is unlikely for a human to respond quickly enough to meet.‬

‭ o put this in perspective,‬‭void-originate‬‭was estimated‬‭to have taken our trailing 30 day availability from‬
T
‭99.9997% to 99.947%.‬

‭Reliability versus Engineering Velocity‬


‭ LO’s are a tool for reasoning about a service’s reliability and engineering velocity. When you set your‬
S
‭Availability SLO to 99.99%, you are also setting your error budget. Error budget is defined as: 1 - SLO. In our‬
‭example, our Error Budget is 0.01% and that means that 0.01% of all requests to our service can fail.‬

‭Error Budgets == Flexible Spending Account‬


I‭ find it easiest to reason about Error Budgets like a Flexible Spending Account (FSA). In many healthcare‬
‭plans, you can set aside a predetermined amount per year into an FSA. If you don’t use that money at the end‬
‭of the year, the money is forfeited. Similarly, if you don’t use your Error Budget, then you don’t get it back.‬

‭ our Error Budget is a tool for managing risk. If you have a particularly risky change you want to flight, your‬
Y
‭Error Budget gives you breathing room to test that change in production. Similarly, if your product is early in its‬
‭lifecycle and you want to iterate quickly, you can set your SLO at a lower target (e.g. 99.95%). This sets the‬
‭right expectations within Stripe for how you’re prioritizing reliability versus engineering velocity.‬

‭So how many 9’s do I need?‬


‭ here is no one size fits all recommendation out of a couple of specific instances (e.g. Charge Path). Each‬
T
‭service owner needs to set the right expectations within their organization as to how reliable they will be. While‬
‭the Reliability team could set a blanket SLO for all services, this would undermine the value of SLO’s as a‬
‭prioritization tool. The reliability of a product is one of several dimensions that feed into a team’s prioritization.‬
‭ hat being said, the following tables provides a rough mapping of API or System Priority to SLO to Availability‬
T
‭Tier‬‭3‭.‬ Please see footnote one for more depth on Availability‬‭Tiers’ relevance.‬

‭There are a few caveats to these tables:‬


‭-‬ ‭These tables assume a single region, the current layout of Stripe today.‬
‭-‬ ‭The mappings are not necessarily representative of reality today.‬
‭-‬ ‭Astute readers might note that the author has been talking about “the SLO of services” the entire time‬
‭but these tables don’t mention service.‬

‭API Priority‬ ‭Min. SLO (Availability)‬ ‭Availability Tier‬

‭P0/P1 (Charge Path POST)‬ ‭99.995%‬ ‭A100-A110‬

‭P2 (Default POST/DELETE)‬ ‭99.99%‬ ‭A120‬

‭P3 (Default GET Retrieve, Dashboards)‬ ‭99.95%‬ ‭A130‬

‭P4 (Default GET List)‬ ‭99.9%‬ ‭A140‬

‭P5 (Testmode APIs)‬ ‭99.5%‬ ‭A150+‬

‭For Infrastructure, this is a more nuanced conversation, but the following table shows examples.‬

‭System Priority‬ ‭Min. SLO (Availability)‬ ‭Availability Tier‬

‭MSP worker nodes, Observability Systems‬ ‭99.995%‬ ‭A100-A110‬

‭Database Systems - any merchant‬ ‭99.995%‬ ‭A120‬

‭Database Systems - specific merchant‬ ‭99.99%‬ ‭A120‬

‭Control Planes‬ ‭99.99%‬ ‭A130‬

‭CI/CD Pipeline‬ ‭99.95%‬ ‭A140‬

‭Reliability Patterns and Resources‬


‭ his section includes a non-exhaustive list of patterns that can help improve the reliability of a service or‬
T
‭system. The content is meant to provide a primer on the topic but every service and system is different. The‬
‭patterns used in one service may not apply to another.‬

‭3‬
‭When we’ve discussed SLO’s with teams in the past, a common question is “How do Availability Tiers map to SLO’s? Do‬
‭I need to set both?” The short answer is - you need to set both today.‬

‭ he long answer is that Availability Tiers need to be abstracted away into the infrastructure platform. They are an‬
T
‭implementation detail that is better represented by teams setting an SLO target. As we continue investing in the platform,‬
‭this is one of several areas we’d like to abstract away.‬
‭Plan for Resiliency‬

‭Stateless & Leaderless‬


‭Where possible, treat your service instances like cattle, not pets.‬

‭ ervices should strive, where possible, to avoid being stateful. This means only storing data in external‬
S
‭sources like a Data Layer (e.g. MySQL, Mongo) and, optionally, a Cache Layer (e.g. Redis). When service‬
‭instances maintain state, they often require coordination when an instance is replaced.‬

‭ imilarly, a leaderless service is preferable to leader-elected services. Leader election is a specific type of‬
S
‭stateful service where one instance acts as the coordinator for other instances. If the leader fails, a new leader‬
‭must be elected. This failover to a new leader is costly in terms of Mean Time to Recovery. Leader Election is‬
‭O(minutes) to elect a new leader which means it would be impossible to meet a 5 9’s target.‬

‭Autoscaled & Overprovisioned (within reason)‬


‭ ost services do not experience consistent traffic and this is true for Stripe as well. As traffic fluctuates, we‬
M
‭experience an increase or decrease in overall resource utilization of our systems. On top of that, we may have‬
‭unexpected increases in load due to circumstances beyond our control like a merchant shifting more traffic to‬
‭us, a DDoS attack, or a flash sale from one of our merchants. However, it’s not efficient for us to stay statically‬
‭overprovisioned for our peak anticipated load.‬

‭ o meet both our efficiency and reliability goals, services need to be intelligently auto-scaled. When our‬
T
‭services auto-scale, they also need to include headroom to absorb unanticipated traffic.‬

‭Designing services to be autoscaled is nuanced by there are a few key properties:‬


‭-‬ ‭stateless - As mentioned above, stateless services are easy to reconstitute because there’s no state to‬
‭share. The less state there is to maintain, the faster the instance comes up and is ready to serve traffic.‬
‭Similarly, the less state an instance tracks, the lighter weight it is and the more we can run per‬
‭underlying compute host.‬
‭-‬ ‭clear autoscaling signals - For some services, this can be as simple as queue depth (i.e. how many‬
‭messages are we behind on processing). In other cases, we’ll need to build ML models that predict‬
‭traffic based on past performance.‬
‭-‬ ‭scalable and redundant data layer - While we strive to make our service instances stateless, we still‬
‭need to persist data. The data layer choices we make influence how well a service can auto-scale.‬
‭-‬ ‭NOTE TO SELF: This is largely a problem Foundation needs to solve, but we should empower‬
‭Product teams to provide feedback on what capabilities they need and identify how we can‬
‭meet those needs.‬

‭Idempotency‬
‭ PI’s should be implemented to support idempotency wherever possible. Most of our API surface already‬
A
‭supports the notion of idempotent API calls. However, it’s important to plan for this from the start as it can‬
‭influence design decisions further down the road.‬

‭ he design principle is related to designing Stateless Services. If your API is idempotent, then the failure of an‬
T
‭arbitrary instance of your service should not negatively impact the customer. We can either retry the API‬
‭automatically for them or they can retry it if we’ve hit our own timeout.‬
‭ imilarly, how we modify our systems should be idempotent. Most modern system tooling designs with this‬
S
‭principle, but it is still important to take this into consideration when looking at introducing new tooling or‬
‭practices.‬

‭Resilient System Degradation & Dependencies‬

‭Dependencies are Known‬


‭ art of building reliable and resilient systems is understanding the dependency graph of a given service. In a‬
P
‭Service Oriented Architecture, no service exists without some set of dependencies - at a minimum, services‬
‭will depend on:‬

-‭ ‬ ‭ ompute - where the actual process runs for the service (e.g. EC2 instance, Kubernetes)‬
c
‭-‬ ‭network - underlying networking infrastructure that connects compute to the broader Internet‬
‭-‬ ‭data - the database (e.g. MySQL) or message bus (e.g. Kafka) layers that allow applications to store‬
‭state and communicate that state to other services‬

‭ eyond that, a Service-Oriented Architecture usually means that the service handling your application is not‬
B
‭the same service doing user authentication or hosting your dashboards. That means there’s both infrastructure‬
‭service dependencies as well as functional dependencies. Understanding what your service is depending on is‬
‭crucial to the reliability profile of your service. The SLO’s your dependent services provide will inform how you‬
‭think about your failure modes for handling outages in those dependencies.‬

‭Critical Dependencies are Minimized‬


‭ special case of dependencies for services are critical dependencies. These are dependencies that, when‬
A
‭down, fully block your service from working at all. For example, if your service exists in a single AWS region‬
‭and network egress is hard down, there is no way for your service to handle inbound requests.‬

‭ hese critical dependencies are where your primary focus for having redundancy should be. This usually‬
T
‭means deploying your service to multiple regions with automated regional failover, autoscaled deployments,‬
‭and a sharded redundant data layer.‬

‭ y keeping your critical dependencies at a minimum, you can focus on how to gracefully degrade when one of‬
B
‭your less critical dependencies becomes unavailable.‬

‭Systems Degrade Gracefully‬


‭ esilient services degrade gracefully. For example, take Google Search - it degrades gracefully but you may‬
R
‭not realize it. When you load google.com, the core scenario is entering a search string and getting a page‬
‭rendered back with results. However, there is a list of functionality that is non-critical that allows Google Search‬
‭to degrade gracefully:‬

-‭ ‬ ‭ uto-complete for the search textbox‬


a
‭-‬ ‭ads in the search results‬
‭-‬ ‭infinite scrolling‬
‭-‬ ‭“About this result” modal dialog‬

I‭f any of those scenarios fail, the product’s core functionality is still preserved. You can still enter a search term‬
‭and get back a page of results.‬
‭ here possible, we need to consider the same types of graceful degradation in our systems. This isn’t always‬
W
‭possible, but it’s important to consider because it is one of several tools available to preserve Stripe’s core‬
‭product scenarios like the Charge Path.‬

‭High Latency == Availability Drop == Throughput Limit Exceeded == Fully Saturated‬


‭ here is a relationship that may not be obvious between the Four Golden Signals of Availability, Latency,‬
T
‭Throughput, and Saturation. If one degrades, the others may follow.‬

‭-‬ I‭f your service is experiencing high latency for a long enough period of time, it will manifest as an‬
‭availability issue. The excess waiting for calls to complete will starve your service of resources and‬
‭cause you to be unavailable.‬
‭-‬ ‭If you max out your allocated resources and saturate the service, your availability will drop and latency‬
‭will increase due to an inability to service the throughput.‬
‭-‬ ‭If you exceed your known throughput limits, you’ll see availability dip and latency increase as it takes‬
‭more effort to complete work. Similarly, this can lead to saturation.‬

‭ s Stripe matures, we’ll need to expand how we think about what it means to be Available. Today we think‬
A
‭about a couple of scenarios:‬
‭-‬ ‭did the RPC succeed?‬
‭-‬ ‭did the charge succeed when it should have (mostly Payments)?‬

‭In the future, we’ll want to add a third of:‬


‭-‬ ‭did the RPC respond fast enough to matter?‬

‭Workloads Segregated by Performance/Priority Profile‬


‭ any teams own a service that has multiple scenarios associated with it. In this context, the service is a set of‬
M
‭related functionality that has both POST and GET API’s associated with it. For example, the Charge Path has‬
‭both API’s for creating charges as well as API’s for getting those transactions. In the Charge Path, the POST‬
‭API’s for creating charges are considered to be the highest priority while the GET API’s have a lower priority.‬

‭ elatedly, POST API’s and GET API’s often do not have the same performance profile. Maybe POST requests‬
R
‭have a p50 latency of 50ms while GET API’s have a p50 latency of 250ms. This might seem like a small delta,‬
‭but at high RPS it can make a difference. If you host both the POST and GET API’s off the same service‬
‭instances, it can obscure subtle shifts in the overall performance monitoring of those services. More‬
‭importantly, if you have a massive influx of GET API requests, it could starve out CPU resources available to‬
‭service the POST requests. Every GET request would take 5 times as long as every POST request.‬

‭ imilarly, from a monitoring point of view, it is easier to monitor two separate services for their respective‬
S
‭SLO’s. You can more easily reason about them within the larger system, and it’s clear from the delineation‬
‭what each service’s reliability, performance, and efficiency profiles are.‬

‭ ut another way, it might seem like less operational overhead to serve a single deployment of your service with‬
P
‭multiple sets of functionality. However, at a large enough scale, it can exacerbate problems and make system‬
‭decomposition and monitoring more difficult in the long term.‬
‭“You inherit the reliability of your most used dependencies.”‬
‭Services often have many external dependencies. These might include:‬
‭-‬ ‭Database (MySQL)‬
‭-‬ ‭Message Bus (Kafka)‬
‭-‬ ‭3rd party services not part of Stripe’s business (e.g. Visa)‬
‭-‬ ‭etcetera‬

I‭f any of these dependencies makes up a plurality of the calls your own service makes, then your service will‬
‭inherit that dependency’s reliability. For example, if the 3rd party service offers an SLA of 99.95% and your‬
‭service’s job is to process payments through that service (e.g. it’s Visa), then if you implement no additional‬
‭reliability strategies, you can expect that 0.05% of all requests to fail.‬

‭So what to do? There are a number of strategies available to us that we’ll cover in the next section.‬

‭Retries‬
‭ his is a common pattern across Stripe for handling issues with reliability in downstream dependencies (both‬
T
‭internal and external). Generally, this is a very effective strategy but it comes with one potentially high cost:‬
‭latency.‬

‭ ystems which expect to retry need to make calls that have timeouts. Crucially, callers need to make sure‬
S
‭they’re only retrying on errors which indicate that the call might actually succeed in the future. Depending on‬
‭the system, the timeout duration will need to be massaged - some systems need longer timeouts and others‬
‭need shorter ones. Without timeouts or if the timeouts are too long, tail latency‬‭4‬ ‭will grow, further impacting‬‭the‬
‭callers. Additionally, retries need to be intelligent, using jitter and exponential backoff to ensure a thundering‬
‭herd is not unleashed on the dependency.‬

I‭f possible, building systems that can detect the health of downstream dependencies will provide the most‬
‭robust protection against issues like thundering herds and retry amplification. While Stripe does not have a‬
‭comprehensive approach to this today, it’s a goal we’ll work toward in the future.‬

‭Recommendation(s)‬‭:‬
‭-‬ ‭Retries are a good tool but make sure to include jitter and exponential backoff in the retry strategy and‬
‭make appropriate choices around timeouts‬‭5‬‭.‬
‭-‬ ‭To mitigate retry amplification, if a retry fails, rewrite retriable error codes into a non-retriable error‬
‭code.‬‭6‬

‭Error Codes and User Actionability‬


‭ hether your users are internal or external, they need error codes that make it obvious what their next step is.‬
W
‭For example, it should be clear to your callers whether or not an error [is retryable. They should make it clear‬
‭whether an error was the result of the user (e.g. HTTP 4xx status code) or the result of some failure internal to‬
‭the service (e.g. HTTP 5xx status code).‬

‭4‬
‭“Tail latency, also known as high-percentile latency, refers to high latencies that clients see fairly infrequently.” (source:‬
‭https://brooker.co.za/blog/2021/04/19/latency.html‬‭)‬
‭5‬
gRPC Networking Configuration
‭6‬
‭No standardized set of error codes for this exists today.‬
I‭f possible, the error text should also provide the user some notion of next steps. Whether it’s an indicator that‬
‭their inputs were invalid or a retry time horizon, providing these to callers will help them adapt their services to‬
‭better integrate with yours.‬

‭Recommendation(s):‬
‭-‬ ‭Make it clear whether an error is a client issue or server issue.‬
‭-‬ ‭Make it clear whether an error is retryable.‬
‭-‬ ‭Make it clear, in the error context, how a user can improve their usage of the service.‬

‭Managing Timeouts‬
‭<TBD>‬

‭Running Multiple Copies of the Dependency‬


‭ ome dependencies may have a scaling limit on their throughput. For whatever reason, there’s no viable way‬
S
‭to raise the throughput ceiling without also impacting the availability or latency of that service. A common‬
‭solution is to run multiple copies of that dependency.‬

I‭n many cases, we already do this today at Stripe. For example, Shared-MSP services run multiple copies of‬
‭the same container for a given service. This also improves resiliency in case an instance of the dependency‬
‭fails. While the request might fail, the entire service won’t be down.‬

‭Sharding‬
‭ harding is related to running multiple copies of a dependency, however it tends to be focused on how data is‬
S
‭stored. When we talk about the “availability” of a data tier dependency, there are actual two dimensions we‬
‭have to consider:‬
‭-‬ ‭Is data for any customer available?‬
‭-‬ ‭Is data for every customer available?‬
‭It is possible, with sharding, to build systems that can ensure one or both of these guarantees though the cost‬
‭can be significant.‬

‭ oday, Stripe uses sharding for distributing customers across Mongo and MySQL. However, we don’t duplicate‬
T
‭data, so a single shard being down impacts a subset of customers. This means we’re solving the “Is data for‬
‭any customer available?” question.‬

‭ ther mechanisms of sharding might mimic Amazon’s Cell Architecture‬‭7‭.‬ This pattern separates multiple‬
O
‭instances of services across regions and availability zones to distribute failures across boundaries that are less‬
‭likely to fail.‬

‭Caching‬
‭ sing caching can improve reliability in a similar way to utilizing retries. By caching and setting reasonable‬
U
‭timeouts, you reduce the need to query downstream dependencies like your data tier or dependency service.‬

‭ owever, depending on your approach to caching, you may be trading one system’s reliability for another.‬
H
‭Caches that need updating often are typically external to the service itself - for example a Redis cluster. This is‬
‭another service that can potentially go down, but your service should gracefully fallback to its data tier.‬

‭7‬
AWS re:Invent 2018: How AWS Minimizes the Blast Radius of Failures (ARC338)
I‭f the cached data is relatively static (e.g. top 100 customers by NPV‬‭8‭)‬ , then you can store these in‬‭static files‬
‭that are loaded once in memory for your service. This improves latency on lookups for something like‬
‭translating OrgId to AccountId because the data for your heaviest users is already in memory.‬

‭Graceful Degradation & Critical Path Pruning‬


‭ arlier, we noted that it’s almost impossible to build a service that has no dependencies. You can however‬
E
‭aggressively reduce the number of dependencies that are critical to your service’s operation.‬

‭ e discussed Google.com’s core scenario is search in‬‭Systems Degrade Gracefully‬‭above. The auto-complete‬
W
‭service or site info modal service can adhere to lower SLO’s without notably impacting the primary‬
‭expectations of Google’s users. Similarly, we should aim to build services that keep as few dependencies as‬
‭possible and handle loss of the dependency gracefully. Doing this will mitigate the risk of the less reliable‬
‭services failing.‬

‭ nowing what is most critical in a given service’s functions helps to ensure we can still serve traffic in a‬
K
‭degraded state. While this isn’t ideal, it’s better than being hard down.‬

‭Replacing Dependencies with a More Reliable Alternative‬


‭ his is one of the most aggressive responses to handling a dependency with an SLO too low for your service’s‬
T
‭needs. Most often, this will come up in discussion about a 3rd party dependency that isn’t reliable enough (e.g.‬
‭SignalFX).‬

‭ eplacing a dependency is a high cost solution but sometimes the only solution. As always, this is about‬
R
‭business prioritization and balancing cost in development versus onboarding to a new 3rd party.‬

‭Asking for a Stronger SLO‬


‭ ith sufficient data, we can drive discussions with dependencies to drive them toward improving their SLO.‬
W
‭This might happen both internally and externally to Stripe. For example, a payment provider with an SLA of‬
‭99.9% introduces a lot of overhead for Stripe to be more resilient.‬

I‭f you capture data across internal and external dependencies, you can identify where these weaknesses are‬
‭and have data-driven discussions with partners about how to improve the situation.‬

‭ ecommendation: This path is a “path of last resort.” Poorly implemented, this looks like teams bullying each‬
R
‭other about their SLO’s to get unblocked. Teams should absolutely surface reliability concerns with partners,‬
‭but they must also recognize that priorities may vary across the organization.‬

‭Multi-Region (Future)‬
‭TBD‬

‭8‬
‭Net Payment Volume‬
‭FAQ‬

‭General Questions‬

‭What’s a Service Level Objective (SLO)?‬


‭ n SLO (Service Level Objective) is a target level of reliability for a service. Typically, all services will have‬
A
‭SLO’s around the Four Golden Signals‬‭9‬ ‭of Availability,‬‭Latency, Throughput, and Saturation.‬

‭What’s wrong with the way we’ve been doing SLO’s (aka SLC’s)?‬
‭ he limitation to our historic strategy with SLO’s has been that they don’t have real-time monitoring. Without‬
T
‭real-time monitoring, there’s no opportunity to preserve the SLO and avoid a miss. Additionally, we have not‬
‭applied any standardization to how we’ve measured SLO’s in the past. For example, there’s no standardized‬
‭time window, definition of availability, definition of latency and more.‬

‭Why are we doing SLO’s?‬


‭ losely monitoring our SLO’s ensures we’re meeting the level of service our customers expect from Stripe.‬
C
‭Having this data also surfaces important data about the health of the business to leaders which helps prioritize‬
‭the right investments. For example, if a team is consistently meeting 99.9999% availability and their SLO is‬
‭only 99.995%, they can take on additional risk and ship new features for customers at a faster rate.‬

‭Why do SLO’s need to have real-time alerting?‬


‭ eal-time alerting for SLO’s gives teams an opportunity to preserve the quality of service customers expect of‬
R
‭Stripe. Additionally, this is another layer of monitoring that is standard across Stripe’s services and provides a‬
‭backstop for ensuring we at least monitor the Four Golden Signals.‬

‭If a team comes to me and says they need a stronger SLO, can I push back?‬
‭ es! A common misunderstanding of using SLO’s is that all of your dependencies must have an SLO greater‬
Y
‭than or equal to your own. This is inaccurate and would be untenable - you’d basically end up in a standoff on‬
‭who’s going to implement reliability improvements first.‬

‭ here are strategies that upstream callers can implement like sharding and retrying of downstream‬
T
‭dependencies to improve their own reliability. Upstream callers can also carefully design their service to ensure‬
‭they minimize the number of critical dependencies they have.‬

‭ owever, expect that teams will ask this question. It’s a reasonable one and should be treated as a signal that‬
H
‭your service may not be meeting your customers’ needs.‬

‭How should managers and leaders think about their SLO’s?‬


‭ LO’s provide a high level measure of the health of your team or organization’s services. Teams should‬
S
‭monitor the health of the Four Golden Signals of Availability, Latency, Throughput, and Saturation as well as‬
‭custom metrics that are unique to their service.‬

‭9‬
‭https://sre.google/sre-book/monitoring-distributed-systems/#xref_monitoring_golden-signals‬
‭ LO’s are business metrics that help with planning and prioritization. If your services aren’t performing well‬
S
‭against their SLO’s, that’s an indicator that you need to prioritize reliability improvements. Similarly, if your‬
‭services are performing exceptionally well against their SLO’s, you can plan more risky changes.‬

‭Why do we use a fixed 28 day window for measuring SLO’s?‬


‭ art of what makes SLO’s useful is it provides a common language to discuss reliability. One dimension of‬
P
‭measuring reliability is the timeframe you measure over. By standardizing on 30 days for monitoring SLO’s, we‬
‭simplify discussions around comparing relative reliability of services.‬

‭ dditionally, a fixed 30 day window provides sufficient time that a team could reasonably respond and mitigate‬
A
‭a degradation in their service. If you use a time window of 6 hours, that leaves very little time to respond to a‬
‭degradation and preserve your SLO.‬

‭ oes every service have to have the same SLO? Does my SLO have to be as high as‬
D
‭my upstream callers?‬
‭No! There are two reasons for this:‬
‭1.‬ ‭Services should be designed to reduce the number of critical dependencies they have. By reducing the‬
‭number of critical dependencies a service has, you improve the resiliency of your service which allows‬
‭you to provide a higher SLO.‬
‭2.‬ ‭Upstream callers can implement strategies to mitigate lower SLO’s in downstream dependencies. See‬
‭this section‬‭for more information.‬

‭What resources are available for me to go deeper on SLO’s?‬


-‭ ‬ ‭ ee‬‭Resources‬‭section for a reading list.‬
S
‭-‬ ‭The‬‭Reliability Architecture‬‭team is a resource for‬‭any questions you have.‬

‭Stripe SLO Monitoring Questions‬

‭What tools are available to Stripes for monitoring SLO’s?‬


‭ he StripeNext SLO Workstream is building standardized tooling, infrastructure, dashboarding, and alerting for‬
T
‭all existing and future Horizon services. For more information, have a look at‬ SLO Monitoring Overview ‭for‬
‭more information.‬

‭ or v1 services using the pay-server stack, the‬‭Service‬‭Commitments‬‭dashboard is still the primary way to‬
F
‭monitor SLO’s. Please note that it does not provide real-time alerting.‬

‭Do I have to use Stripe’s SLO Monitoring?‬


I‭n short, yes. As part of StripeNext, all Horizon services will have a mandatory field requiring service owners to‬
‭define SLO’s for Availability and Latency. In the future, we’ll also add Throughput, Saturation, and the option to‬
‭include custom SLO’s based on service-specific SLI’s.‬

‭However, teams will have the ability to configure alerting for their service’s SLO’s.‬
‭Can I create custom SLI’s to set SLO’s on with Stripe’s SLO Monitoring?‬
‭Not yet, but we’re hoping to add this in 2022Q4 or 2023Q1.‬

I‭’m a Foundation team providing common infrastructure to Stripe. Can I use Stripe’s‬
‭SLO Monitoring?‬
‭ es… if it’s running as a Horizon service. We recognize that not all teams will be building Horizon services,‬
Y
‭however to ensure we complete at least a subset of the end-to-end scenarios, the workstream is focused on‬
‭Horizon services.‬

‭In the future, we’ll look to expand support for:‬


‭-‬ ‭infrastructure - e.g. database systems, networking infrastructure, etc‬
‭-‬ ‭control planes - e.g. those written in Golang‬
‭-‬ ‭proxies - e.g. mproxy, MySQL Inspector, kproxy‬
‭-‬ ‭Golang-based services used by Foundation and Security‬

‭Again, if you build a Horizon service, that will be fully supported.‬

‭Why are we focused on StripeNext Horizon services first for SLO Monitoring?‬
‭ or better or worse, we believe it’s more important to complete one end-to-end scenario and expand outward‬
F
‭based on what we learn. Given the amount of work going into rewriting or building net new services in Horizon,‬
‭we believe it would give the greatest bang for buck.‬

‭What happens if my service runs out of Error Budget?‬


‭Today - you’ll receive real-time notifications of Error Budget exhaustion.‬

‭In the future, we’ll:‬


‭-‬ ‭surface Error Budget issues in‬‭Reliability Posture‬‭Dashboard‬‭(RPD). The reporting in RPD will be‬
‭surfaced in Ops Reviews and brought for discussion if the issue is severe enough.‬
‭-‬ ‭integrate Error Budget exhaustion with the‬‭http://go/production-gates‬‭program‬

I‭ keep hearing that we want to achieve 99.999% availability SLO Stripe-wide. Does my‬
‭service have to have a 99.999% SLO?‬
‭ o! Please see‬‭Does every service have to have the‬‭same SLO? Does my SLO have to be as high as my‬
N
‭upstream callers?‬

‭Can I use SLO Monitoring as the only alerts for my service’s health?‬
‭ ou could, but we wouldn’t recommend it. The SLO Monitoring alerts are a “floor raiser”, meaning they provide‬
Y
‭a minimum bar for alerting for services. We expect that as teams gain a better understanding of how their‬
‭service behaves in production, they’ll create custom alerts and dashboards that give them detailed insight into‬
‭their service’s health.‬

‭ he SLO Monitoring workstream is meant to provide a coherent high level picture of the health of Stripe’s‬
T
‭services across engineering organizations.‬
‭Can I tune my error budget burn rate alerts for SLO’s?‬
‭ es! In 2022 Q2, we’ll be adding the ability to tune different aspects of the burn rate alerts. You’ll also be able‬
Y
‭to disable alerts but we highly discourage that :). While we’ll allow disabling of alerts, that will be reported as‬
‭part of‬‭RPD‬‭health scoring.‬

‭Why are you using Burn Rate alerts instead of Threshold-based alerts?‬
I‭n general we can create alerts with better precision and recall by alerting on error budget burn rate over‬
‭multiple time windows. More information on this is available in‬ SLO Alerting ‭.‬

‭Resources‬
‭-‬ ‭Google SRE book(s)‬
‭-‬ ‭https://sre.google/books/‬
‭-‬ ‭Microsoft’s Azure Architecture Reliability docs:‬
‭https://docs.microsoft.com/en-us/azure/architecture/reliability/architect‬
-‭ ‬ ‭Google’s Cloud Architecture docs:‬‭https://cloud.google.com/architecture/framework/reliability‬
‭-‬ ‭Command Query Responsibility Separation (CQRS) design pattern‬
‭-‬ ‭https://martinfowler.com/bliki/CQRS.html‬
‭-‬ ‭https://docs.microsoft.com/en-us/azure/architecture/patterns/cqrs‬
‭-‬ ‭Setting SLO’s for services with dependencies‬
‭-‬ ‭https://cloud.google.com/blog/products/devops-sre/defining-slos-for-services-with-dependencies‬
‭-cre-life-lessons‬
‭-‬ ‭On retries, timeouts, and jitter‬
‭-‬ ‭https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/‬

You might also like