0% found this document useful (0 votes)

34 views15 pages

SLO Guide & FAQ

The document outlines the migration of the SLO Guide to Trailhead and its deprecation, emphasizing the importance of Service Level Objectives (SLOs) in balancing reliability and engineering velocity. It discusses various reliability patterns, error budgets, and provides guidance on setting appropriate SLOs tailored to different services within Stripe. Additionally, it highlights the responsibilities of teams in managing their own SLOs while providing resources for further understanding.

Uploaded by

naturebuddy2022

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views15 pages

SLO Guide & FAQ

Uploaded by

naturebuddy2022

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

‭ his document has been migrated to Trailhead (‬‭go/slo/docs‬‭)‬

T
‭and is now deprecated.‬

‭SLO Guide‬
‭ RI:‬ Zac Brown ‭,‬ Jeremy Edwards ‭,‬ Siddarth Chandrasekaran ‭,‬ Preston O'Neal
D
‭Last updated :‬ Apr 13, 2022
‭Status: published (living document)‬

‭Table of Contents‬
‭Table of Contents‬

‭Context‬
‭SLO’s - A Brief Recap‬

‭How many 9’s do you need?‬

‭Making Sense of the 9’s‬
‭Reliability versus Engineering Velocity‬
‭Error Budgets == Flexible Spending Account‬
‭So how many 9’s do I need?‬

‭Reliability Patterns and Resources‬

‭Plan for Resiliency‬
‭Stateless & Leaderless‬
‭Autoscaled & Overprovisioned (within reason)‬
‭Idempotency‬
‭Resilient System Degradation & Dependencies‬
‭Dependencies are Known‬
‭Critical Dependencies are Minimized‬
‭Systems Degrade Gracefully‬
‭High Latency == Availability Drop == Throughput Limit Exceeded == Fully Saturated‬
‭Workloads Segregated by Performance/Priority Profile‬
‭“You inherit the reliability of your most used dependencies.”‬
‭Retries‬
‭Error Codes and User Actionability‬
‭Managing Timeouts‬
‭Running Multiple Copies of the Dependency‬
‭Sharding‬
‭Caching‬
‭Graceful Degradation & Critical Path Pruning‬
‭Replacing Dependencies with a More Reliable Alternative‬
‭ sking for a Stronger SLO‬
A
‭Multi-Region (Future)‬

‭FAQ‬
‭General Questions‬
‭What’s a Service Level Objective (SLO)?‬
‭What’s wrong with the way we’ve been doing SLO’s (aka SLC’s)?‬
‭Why are we doing SLO’s?‬
‭Why do SLO’s need to have real-time alerting?‬
‭If a team comes to me and says they need a stronger SLO, can I push back?‬
‭How should managers and leaders think about their SLO’s?‬
‭Why do we use a fixed 30 day window for measuring SLO’s?‬
‭Does every service have to have the same SLO? Does my SLO have to be as high as my upstream‬
‭callers?‬
‭What resources are available for me to go deeper on SLO’s?‬
‭Stripe SLO Monitoring Questions‬
‭What tools are available to Stripes for monitoring SLO’s?‬
‭Do I have to use Stripe’s SLO Monitoring?‬
‭Can I create custom SLI’s to set SLO’s on with Stripe’s SLO Monitoring?‬
‭I’m a Foundation team providing common infrastructure to Stripe. Can I use Stripe’s SLO Monitoring?‬
‭Why are we focused on StripeNext Horizon services first for SLO Monitoring?‬
‭What happens if my service runs out of Error Budget?‬
‭I keep hearing that we want to achieve 99.999% availability SLO Stripe-wide. Does my service have to‬
‭have a 99.999% SLO?‬
‭Can I use SLO Monitoring as the only alerts for my service’s health?‬
‭Can I tune my error budget burn rate alerts for SLO’s?‬
‭Why are you using Burn Rate alerts instead of Threshold-based alerts?‬

‭Resources‬

‭Context‬
‭ istorically, Stripe has set broad single target goals for reliability. As we’ve grown, it has become apparent that‬
H
‭a one-size-fits-all approach does not work. Different teams have different needs and goals when making the‬
‭tradeoff between reliability and engineering velocity. This document provides an overview of the patterns and‬
‭practices needed to build a service designed for different levels of reliability.‬

‭ dditionally, we want to clarify and standardize how concepts in our current infrastructure map to the changes‬
A
‭being introduced in the StripeNext work. This will help better align reporting scenarios for leadership and raise‬
‭the floor for how we monitor our next generation of services.‬

‭SLO’s - A Brief Recap‬

‭ ervice Level Objectives (SLOs) are internal objectives for a team around the level of availability and‬
S
‭performance for their service. They enable teams to codify the balance between reliability and engineering‬
‭velocity, allowing teams to make the right prioritization decisions based on how much remaining error budget‬
t‭hey have. The Error Budget‬‭1‬ ‭represents the amount of risk a team can take on while still meeting their quality‬
‭bar.‬

‭There’s a lot of confusion around SLO’s. Common questions that the Reliability‬‭2‬ ‭organization gets are:‬
‭-‬ ‭What time window should we measure SLO’s on?‬
‭-‬ ‭My service falls under the PaymentIntents API which has a target SLO of 99.999%, but we rely on a‬
‭downstream system that only reports an SLO 99.9%. How can I make this reliable?‬
‭-‬ ‭What metrics should I use to measure my SLO’s?‬
‭-‬ ‭Should those metrics be client side? Server side? Both?‬
‭-‬ ‭What should my API/service/system Availability SLO be?‬

‭ nfortunately there’s no one-size-fits-most/all answer to these questions. The details depend on the service or‬
U
‭system being monitored and how it impacts customers when it’s in a degraded state. On top of that, any‬
‭blanket guidance that the Reliability organization could provide would do a disservice to the service owner.‬
‭SLO’s are a tool for balancing reliability and engineering velocity. They provide a signal to Engineering Teams‬
‭and their respective parent organizations on how a given service is doing against the broader priorities of the‬
‭organization.‬

‭ utting that in context, if the Reliability organization were to set SLO targets for other teams, we would be‬
P
‭trying to push priorities onto teams that may have other higher priority work. This won’t work and it won’t scale.‬

‭In brief, the responsibilities for SLO’s are laid out as follows:‬
‭-‬ ‭RBCLEAP owns and produces standardized dashboarding, alerting, and reporting on SLO’s across‬
‭Stripe. This tooling will be a standardized minimum bar (min-bar) that is flexible enough in the future to‬
‭account for scenario-specific SLO’s.‬
‭-‬ ‭RBCLEAP owns setting guidance, education, and hands-on assistance for teams that have‬
‭Reliability-related issues.‬
‭-‬ ‭Teams outside of RBCLEAP own setting SLO’s for themselves and reporting on them within their‬
‭Operating Group. This ownership includes both the decision making process for raising or lowering the‬
‭SLO based on the priorities of that team, their OG’s agreement, and the maturity of the product/service.‬

‭How many 9’s do you need?‬

I‭n 2021, we set an overall availability target of 99.999% or “5 9’s of reliability”. Stripes take great pride in‬
‭setting the bar high and delivering a great customer experience. However, this target lacks both nuance and‬
‭realistic expectations.‬

‭ s we started the planning for StripeNext, we recognized we needed to set clear expectations with our partner‬
A
‭teams as well as the leadership team. We will use SLO’s to drive that point, setting targets that strike the right‬
‭balance between reliability (e.g. availability, latency) and engineering velocity.‬

‭ dditionally, SLO’s provide a tool for communicating the availability, latency, throughput, and so on for partner‬
A
‭teams. Teams can then choose to implement additional strategies for improving the reliability of their own‬
‭service (e.g. retries).‬

‭1‬
‭ rror Budget is defined as (1 - SLO). For example, an SLO of 99.99% is an Error Budget of 0.01%.‬
E
‭2‬
‭RBCLEAP‬
‭Making Sense of the 9’s‬
‭ he following table shows the error budget for the year and for a 30 day period based on the number of 9’s‬
T
‭specified. Notably, it takes less than half a minute (26 seconds) to exhaust the error budget for a service‬
‭targeting 99.999% availability.‬

‭Availability‬ ‭ rror Budget‬

E ‭ rror Budget‬
E
‭(Year)‬ ‭(Last 30d)‬

‭90% (1 nine)‬ ‭36.53 days‬ ‭3 days‬

‭99% (2 nines)‬ ‭3.65 days‬ ‭7.3 hours‬

‭99.9% (3 nines)‬ ‭8.77 hours‬ ‭43.65 minutes‬

‭99.99% (4 nines)‬ ‭52.60 minutes‬ ‭4.36 minutes‬

‭99.995% (4.5 nines)‬ ‭26.30 minutes‬ ‭2.18 minutes‬

‭99.999% (5 nines)‬ ‭5.26 minutes‬ ‭26 seconds‬

‭ he boxes where the time durations are decorated in red are those which have a Mean Time to Recovery‬
T
‭(MTTR) that is unlikely for a human to respond quickly enough to meet.‬

‭ o put this in perspective,‬‭void-originate‬‭was estimated‬‭to have taken our trailing 30 day availability from‬
T
‭99.9997% to 99.947%.‬

‭Reliability versus Engineering Velocity‬

‭ LO’s are a tool for reasoning about a service’s reliability and engineering velocity. When you set your‬
S
‭Availability SLO to 99.99%, you are also setting your error budget. Error budget is defined as: 1 - SLO. In our‬
‭example, our Error Budget is 0.01% and that means that 0.01% of all requests to our service can fail.‬

‭Error Budgets == Flexible Spending Account‬

I‭ find it easiest to reason about Error Budgets like a Flexible Spending Account (FSA). In many healthcare‬
‭plans, you can set aside a predetermined amount per year into an FSA. If you don’t use that money at the end‬
‭of the year, the money is forfeited. Similarly, if you don’t use your Error Budget, then you don’t get it back.‬

‭ our Error Budget is a tool for managing risk. If you have a particularly risky change you want to flight, your‬
Y
‭Error Budget gives you breathing room to test that change in production. Similarly, if your product is early in its‬
‭lifecycle and you want to iterate quickly, you can set your SLO at a lower target (e.g. 99.95%). This sets the‬
‭right expectations within Stripe for how you’re prioritizing reliability versus engineering velocity.‬

‭So how many 9’s do I need?‬

‭ here is no one size fits all recommendation out of a couple of specific instances (e.g. Charge Path). Each‬
T
‭service owner needs to set the right expectations within their organization as to how reliable they will be. While‬
‭the Reliability team could set a blanket SLO for all services, this would undermine the value of SLO’s as a‬
‭prioritization tool. The reliability of a product is one of several dimensions that feed into a team’s prioritization.‬
‭ hat being said, the following tables provides a rough mapping of API or System Priority to SLO to Availability‬
T
‭Tier‬‭3‭.‬ Please see footnote one for more depth on Availability‬‭Tiers’ relevance.‬

‭There are a few caveats to these tables:‬

‭-‬ ‭These tables assume a single region, the current layout of Stripe today.‬
‭-‬ ‭The mappings are not necessarily representative of reality today.‬
‭-‬ ‭Astute readers might note that the author has been talking about “the SLO of services” the entire time‬
‭but these tables don’t mention service.‬

‭API Priority‬ ‭Min. SLO (Availability)‬ ‭Availability Tier‬

‭P0/P1 (Charge Path POST)‬ ‭99.995%‬ ‭A100-A110‬

‭P2 (Default POST/DELETE)‬ ‭99.99%‬ ‭A120‬

‭P3 (Default GET Retrieve, Dashboards)‬ ‭99.95%‬ ‭A130‬

‭P4 (Default GET List)‬ ‭99.9%‬ ‭A140‬

‭P5 (Testmode APIs)‬ ‭99.5%‬ ‭A150+‬

‭For Infrastructure, this is a more nuanced conversation, but the following table shows examples.‬

‭System Priority‬ ‭Min. SLO (Availability)‬ ‭Availability Tier‬

‭MSP worker nodes, Observability Systems‬ ‭99.995%‬ ‭A100-A110‬

‭Database Systems - any merchant‬ ‭99.995%‬ ‭A120‬

‭Database Systems - specific merchant‬ ‭99.99%‬ ‭A120‬

‭Control Planes‬ ‭99.99%‬ ‭A130‬

‭CI/CD Pipeline‬ ‭99.95%‬ ‭A140‬

‭Reliability Patterns and Resources‬

‭ his section includes a non-exhaustive list of patterns that can help improve the reliability of a service or‬
T
‭system. The content is meant to provide a primer on the topic but every service and system is different. The‬
‭patterns used in one service may not apply to another.‬

‭3‬
‭When we’ve discussed SLO’s with teams in the past, a common question is “How do Availability Tiers map to SLO’s? Do‬
‭I need to set both?” The short answer is - you need to set both today.‬

‭ he long answer is that Availability Tiers need to be abstracted away into the infrastructure platform. They are an‬
T
‭implementation detail that is better represented by teams setting an SLO target. As we continue investing in the platform,‬
‭this is one of several areas we’d like to abstract away.‬
‭Plan for Resiliency‬

‭Stateless & Leaderless‬

‭Where possible, treat your service instances like cattle, not pets.‬

‭ ervices should strive, where possible, to avoid being stateful. This means only storing data in external‬
S
‭sources like a Data Layer (e.g. MySQL, Mongo) and, optionally, a Cache Layer (e.g. Redis). When service‬
‭instances maintain state, they often require coordination when an instance is replaced.‬

‭ imilarly, a leaderless service is preferable to leader-elected services. Leader election is a specific type of‬
S
‭stateful service where one instance acts as the coordinator for other instances. If the leader fails, a new leader‬
‭must be elected. This failover to a new leader is costly in terms of Mean Time to Recovery. Leader Election is‬
‭O(minutes) to elect a new leader which means it would be impossible to meet a 5 9’s target.‬

‭Autoscaled & Overprovisioned (within reason)‬

‭ ost services do not experience consistent traffic and this is true for Stripe as well. As traffic fluctuates, we‬
M
‭experience an increase or decrease in overall resource utilization of our systems. On top of that, we may have‬
‭unexpected increases in load due to circumstances beyond our control like a merchant shifting more traffic to‬
‭us, a DDoS attack, or a flash sale from one of our merchants. However, it’s not efficient for us to stay statically‬
‭overprovisioned for our peak anticipated load.‬

‭ o meet both our efficiency and reliability goals, services need to be intelligently auto-scaled. When our‬
T
‭services auto-scale, they also need to include headroom to absorb unanticipated traffic.‬

‭Designing services to be autoscaled is nuanced by there are a few key properties:‬

‭-‬ ‭stateless - As mentioned above, stateless services are easy to reconstitute because there’s no state to‬
‭share. The less state there is to maintain, the faster the instance comes up and is ready to serve traffic.‬
‭Similarly, the less state an instance tracks, the lighter weight it is and the more we can run per‬
‭underlying compute host.‬
‭-‬ ‭clear autoscaling signals - For some services, this can be as simple as queue depth (i.e. how many‬
‭messages are we behind on processing). In other cases, we’ll need to build ML models that predict‬
‭traffic based on past performance.‬
‭-‬ ‭scalable and redundant data layer - While we strive to make our service instances stateless, we still‬
‭need to persist data. The data layer choices we make influence how well a service can auto-scale.‬
‭-‬ ‭NOTE TO SELF: This is largely a problem Foundation needs to solve, but we should empower‬
‭Product teams to provide feedback on what capabilities they need and identify how we can‬
‭meet those needs.‬

‭Idempotency‬
‭ PI’s should be implemented to support idempotency wherever possible. Most of our API surface already‬
A
‭supports the notion of idempotent API calls. However, it’s important to plan for this from the start as it can‬
‭influence design decisions further down the road.‬

‭ he design principle is related to designing Stateless Services. If your API is idempotent, then the failure of an‬
T
‭arbitrary instance of your service should not negatively impact the customer. We can either retry the API‬
‭automatically for them or they can retry it if we’ve hit our own timeout.‬
‭ imilarly, how we modify our systems should be idempotent. Most modern system tooling designs with this‬
S
‭principle, but it is still important to take this into consideration when looking at introducing new tooling or‬
‭practices.‬

‭Resilient System Degradation & Dependencies‬

‭Dependencies are Known‬

‭ art of building reliable and resilient systems is understanding the dependency graph of a given service. In a‬
P
‭Service Oriented Architecture, no service exists without some set of dependencies - at a minimum, services‬
‭will depend on:‬

-‭ ‬ ‭ ompute - where the actual process runs for the service (e.g. EC2 instance, Kubernetes)‬
c
‭-‬ ‭network - underlying networking infrastructure that connects compute to the broader Internet‬
‭-‬ ‭data - the database (e.g. MySQL) or message bus (e.g. Kafka) layers that allow applications to store‬
‭state and communicate that state to other services‬

‭ eyond that, a Service-Oriented Architecture usually means that the service handling your application is not‬
B
‭the same service doing user authentication or hosting your dashboards. That means there’s both infrastructure‬
‭service dependencies as well as functional dependencies. Understanding what your service is depending on is‬
‭crucial to the reliability profile of your service. The SLO’s your dependent services provide will inform how you‬
‭think about your failure modes for handling outages in those dependencies.‬

‭Critical Dependencies are Minimized‬

‭ special case of dependencies for services are critical dependencies. These are dependencies that, when‬
A
‭down, fully block your service from working at all. For example, if your service exists in a single AWS region‬
‭and network egress is hard down, there is no way for your service to handle inbound requests.‬

‭ hese critical dependencies are where your primary focus for having redundancy should be. This usually‬
T
‭means deploying your service to multiple regions with automated regional failover, autoscaled deployments,‬
‭and a sharded redundant data layer.‬

‭ y keeping your critical dependencies at a minimum, you can focus on how to gracefully degrade when one of‬
B
‭your less critical dependencies becomes unavailable.‬

‭Systems Degrade Gracefully‬

‭ esilient services degrade gracefully. For example, take Google Search - it degrades gracefully but you may‬
R
‭not realize it. When you load google.com, the core scenario is entering a search string and getting a page‬
‭rendered back with results. However, there is a list of functionality that is non-critical that allows Google Search‬
‭to degrade gracefully:‬

-‭ ‬ ‭ uto-complete for the search textbox‬

a
‭-‬ ‭ads in the search results‬
‭-‬ ‭infinite scrolling‬
‭-‬ ‭“About this result” modal dialog‬

I‭f any of those scenarios fail, the product’s core functionality is still preserved. You can still enter a search term‬
‭and get back a page of results.‬
‭ here possible, we need to consider the same types of graceful degradation in our systems. This isn’t always‬
W
‭possible, but it’s important to consider because it is one of several tools available to preserve Stripe’s core‬
‭product scenarios like the Charge Path.‬

‭High Latency == Availability Drop == Throughput Limit Exceeded == Fully Saturated‬

‭ here is a relationship that may not be obvious between the Four Golden Signals of Availability, Latency,‬
T
‭Throughput, and Saturation. If one degrades, the others may follow.‬

‭-‬ I‭f your service is experiencing high latency for a long enough period of time, it will manifest as an‬
‭availability issue. The excess waiting for calls to complete will starve your service of resources and‬
‭cause you to be unavailable.‬
‭-‬ ‭If you max out your allocated resources and saturate the service, your availability will drop and latency‬
‭will increase due to an inability to service the throughput.‬
‭-‬ ‭If you exceed your known throughput limits, you’ll see availability dip and latency increase as it takes‬
‭more effort to complete work. Similarly, this can lead to saturation.‬

‭ s Stripe matures, we’ll need to expand how we think about what it means to be Available. Today we think‬
A
‭about a couple of scenarios:‬
‭-‬ ‭did the RPC succeed?‬
‭-‬ ‭did the charge succeed when it should have (mostly Payments)?‬

‭In the future, we’ll want to add a third of:‬

‭-‬ ‭did the RPC respond fast enough to matter?‬

‭Workloads Segregated by Performance/Priority Profile‬

‭ any teams own a service that has multiple scenarios associated with it. In this context, the service is a set of‬
M
‭related functionality that has both POST and GET API’s associated with it. For example, the Charge Path has‬
‭both API’s for creating charges as well as API’s for getting those transactions. In the Charge Path, the POST‬
‭API’s for creating charges are considered to be the highest priority while the GET API’s have a lower priority.‬

‭ elatedly, POST API’s and GET API’s often do not have the same performance profile. Maybe POST requests‬
R
‭have a p50 latency of 50ms while GET API’s have a p50 latency of 250ms. This might seem like a small delta,‬
‭but at high RPS it can make a difference. If you host both the POST and GET API’s off the same service‬
‭instances, it can obscure subtle shifts in the overall performance monitoring of those services. More‬
‭importantly, if you have a massive influx of GET API requests, it could starve out CPU resources available to‬
‭service the POST requests. Every GET request would take 5 times as long as every POST request.‬

‭ imilarly, from a monitoring point of view, it is easier to monitor two separate services for their respective‬
S
‭SLO’s. You can more easily reason about them within the larger system, and it’s clear from the delineation‬
‭what each service’s reliability, performance, and efficiency profiles are.‬

‭ ut another way, it might seem like less operational overhead to serve a single deployment of your service with‬
P
‭multiple sets of functionality. However, at a large enough scale, it can exacerbate problems and make system‬
‭decomposition and monitoring more difficult in the long term.‬
‭“You inherit the reliability of your most used dependencies.”‬
‭Services often have many external dependencies. These might include:‬
‭-‬ ‭Database (MySQL)‬
‭-‬ ‭Message Bus (Kafka)‬
‭-‬ ‭3rd party services not part of Stripe’s business (e.g. Visa)‬
‭-‬ ‭etcetera‬

I‭f any of these dependencies makes up a plurality of the calls your own service makes, then your service will‬
‭inherit that dependency’s reliability. For example, if the 3rd party service offers an SLA of 99.95% and your‬
‭service’s job is to process payments through that service (e.g. it’s Visa), then if you implement no additional‬
‭reliability strategies, you can expect that 0.05% of all requests to fail.‬

‭So what to do? There are a number of strategies available to us that we’ll cover in the next section.‬

‭Retries‬
‭ his is a common pattern across Stripe for handling issues with reliability in downstream dependencies (both‬
T
‭internal and external). Generally, this is a very effective strategy but it comes with one potentially high cost:‬
‭latency.‬

‭ ystems which expect to retry need to make calls that have timeouts. Crucially, callers need to make sure‬
S
‭they’re only retrying on errors which indicate that the call might actually succeed in the future. Depending on‬
‭the system, the timeout duration will need to be massaged - some systems need longer timeouts and others‬
‭need shorter ones. Without timeouts or if the timeouts are too long, tail latency‬‭4‬ ‭will grow, further impacting‬‭the‬
‭callers. Additionally, retries need to be intelligent, using jitter and exponential backoff to ensure a thundering‬
‭herd is not unleashed on the dependency.‬

I‭f possible, building systems that can detect the health of downstream dependencies will provide the most‬
‭robust protection against issues like thundering herds and retry amplification. While Stripe does not have a‬
‭comprehensive approach to this today, it’s a goal we’ll work toward in the future.‬

‭Recommendation(s)‬‭:‬
‭-‬ ‭Retries are a good tool but make sure to include jitter and exponential backoff in the retry strategy and‬
‭make appropriate choices around timeouts‬‭5‬‭.‬
‭-‬ ‭To mitigate retry amplification, if a retry fails, rewrite retriable error codes into a non-retriable error‬
‭code.‬‭6‬

‭Error Codes and User Actionability‬

‭ hether your users are internal or external, they need error codes that make it obvious what their next step is.‬
W
‭For example, it should be clear to your callers whether or not an error [is retryable. They should make it clear‬
‭whether an error was the result of the user (e.g. HTTP 4xx status code) or the result of some failure internal to‬
‭the service (e.g. HTTP 5xx status code).‬

‭4‬
‭“Tail latency, also known as high-percentile latency, refers to high latencies that clients see fairly infrequently.” (source:‬
‭https://brooker.co.za/blog/2021/04/19/latency.html‬‭)‬
‭5‬
gRPC Networking Configuration
‭6‬
‭No standardized set of error codes for this exists today.‬
I‭f possible, the error text should also provide the user some notion of next steps. Whether it’s an indicator that‬
‭their inputs were invalid or a retry time horizon, providing these to callers will help them adapt their services to‬
‭better integrate with yours.‬

‭Recommendation(s):‬
‭-‬ ‭Make it clear whether an error is a client issue or server issue.‬
‭-‬ ‭Make it clear whether an error is retryable.‬
‭-‬ ‭Make it clear, in the error context, how a user can improve their usage of the service.‬

‭Managing Timeouts‬
‭<TBD>‬

‭Running Multiple Copies of the Dependency‬

‭ ome dependencies may have a scaling limit on their throughput. For whatever reason, there’s no viable way‬
S
‭to raise the throughput ceiling without also impacting the availability or latency of that service. A common‬
‭solution is to run multiple copies of that dependency.‬

I‭n many cases, we already do this today at Stripe. For example, Shared-MSP services run multiple copies of‬
‭the same container for a given service. This also improves resiliency in case an instance of the dependency‬
‭fails. While the request might fail, the entire service won’t be down.‬

‭Sharding‬
‭ harding is related to running multiple copies of a dependency, however it tends to be focused on how data is‬
S
‭stored. When we talk about the “availability” of a data tier dependency, there are actual two dimensions we‬
‭have to consider:‬
‭-‬ ‭Is data for any customer available?‬
‭-‬ ‭Is data for every customer available?‬
‭It is possible, with sharding, to build systems that can ensure one or both of these guarantees though the cost‬
‭can be significant.‬

‭ oday, Stripe uses sharding for distributing customers across Mongo and MySQL. However, we don’t duplicate‬
T
‭data, so a single shard being down impacts a subset of customers. This means we’re solving the “Is data for‬
‭any customer available?” question.‬

‭ ther mechanisms of sharding might mimic Amazon’s Cell Architecture‬‭7‭.‬ This pattern separates multiple‬
O
‭instances of services across regions and availability zones to distribute failures across boundaries that are less‬
‭likely to fail.‬

‭Caching‬
‭ sing caching can improve reliability in a similar way to utilizing retries. By caching and setting reasonable‬
U
‭timeouts, you reduce the need to query downstream dependencies like your data tier or dependency service.‬

‭ owever, depending on your approach to caching, you may be trading one system’s reliability for another.‬
H
‭Caches that need updating often are typically external to the service itself - for example a Redis cluster. This is‬
‭another service that can potentially go down, but your service should gracefully fallback to its data tier.‬

‭7‬
AWS re:Invent 2018: How AWS Minimizes the Blast Radius of Failures (ARC338)
I‭f the cached data is relatively static (e.g. top 100 customers by NPV‬‭8‭)‬ , then you can store these in‬‭static files‬
‭that are loaded once in memory for your service. This improves latency on lookups for something like‬
‭translating OrgId to AccountId because the data for your heaviest users is already in memory.‬

‭Graceful Degradation & Critical Path Pruning‬

‭ arlier, we noted that it’s almost impossible to build a service that has no dependencies. You can however‬
E
‭aggressively reduce the number of dependencies that are critical to your service’s operation.‬

‭ e discussed Google.com’s core scenario is search in‬‭Systems Degrade Gracefully‬‭above. The auto-complete‬
W
‭service or site info modal service can adhere to lower SLO’s without notably impacting the primary‬
‭expectations of Google’s users. Similarly, we should aim to build services that keep as few dependencies as‬
‭possible and handle loss of the dependency gracefully. Doing this will mitigate the risk of the less reliable‬
‭services failing.‬

‭ nowing what is most critical in a given service’s functions helps to ensure we can still serve traffic in a‬
K
‭degraded state. While this isn’t ideal, it’s better than being hard down.‬

‭Replacing Dependencies with a More Reliable Alternative‬

‭ his is one of the most aggressive responses to handling a dependency with an SLO too low for your service’s‬
T
‭needs. Most often, this will come up in discussion about a 3rd party dependency that isn’t reliable enough (e.g.‬
‭SignalFX).‬

‭ eplacing a dependency is a high cost solution but sometimes the only solution. As always, this is about‬
R
‭business prioritization and balancing cost in development versus onboarding to a new 3rd party.‬

‭Asking for a Stronger SLO‬

‭ ith sufficient data, we can drive discussions with dependencies to drive them toward improving their SLO.‬
W
‭This might happen both internally and externally to Stripe. For example, a payment provider with an SLA of‬
‭99.9% introduces a lot of overhead for Stripe to be more resilient.‬

I‭f you capture data across internal and external dependencies, you can identify where these weaknesses are‬
‭and have data-driven discussions with partners about how to improve the situation.‬

‭ ecommendation: This path is a “path of last resort.” Poorly implemented, this looks like teams bullying each‬
R
‭other about their SLO’s to get unblocked. Teams should absolutely surface reliability concerns with partners,‬
‭but they must also recognize that priorities may vary across the organization.‬

‭Multi-Region (Future)‬
‭TBD‬

‭8‬
‭Net Payment Volume‬
‭FAQ‬

‭General Questions‬

‭What’s a Service Level Objective (SLO)?‬

‭ n SLO (Service Level Objective) is a target level of reliability for a service. Typically, all services will have‬
A
‭SLO’s around the Four Golden Signals‬‭9‬ ‭of Availability,‬‭Latency, Throughput, and Saturation.‬

‭What’s wrong with the way we’ve been doing SLO’s (aka SLC’s)?‬
‭ he limitation to our historic strategy with SLO’s has been that they don’t have real-time monitoring. Without‬
T
‭real-time monitoring, there’s no opportunity to preserve the SLO and avoid a miss. Additionally, we have not‬
‭applied any standardization to how we’ve measured SLO’s in the past. For example, there’s no standardized‬
‭time window, definition of availability, definition of latency and more.‬

‭Why are we doing SLO’s?‬

‭ losely monitoring our SLO’s ensures we’re meeting the level of service our customers expect from Stripe.‬
C
‭Having this data also surfaces important data about the health of the business to leaders which helps prioritize‬
‭the right investments. For example, if a team is consistently meeting 99.9999% availability and their SLO is‬
‭only 99.995%, they can take on additional risk and ship new features for customers at a faster rate.‬

‭Why do SLO’s need to have real-time alerting?‬

‭ eal-time alerting for SLO’s gives teams an opportunity to preserve the quality of service customers expect of‬
R
‭Stripe. Additionally, this is another layer of monitoring that is standard across Stripe’s services and provides a‬
‭backstop for ensuring we at least monitor the Four Golden Signals.‬

‭If a team comes to me and says they need a stronger SLO, can I push back?‬
‭ es! A common misunderstanding of using SLO’s is that all of your dependencies must have an SLO greater‬
Y
‭than or equal to your own. This is inaccurate and would be untenable - you’d basically end up in a standoff on‬
‭who’s going to implement reliability improvements first.‬

‭ here are strategies that upstream callers can implement like sharding and retrying of downstream‬
T
‭dependencies to improve their own reliability. Upstream callers can also carefully design their service to ensure‬
‭they minimize the number of critical dependencies they have.‬

‭ owever, expect that teams will ask this question. It’s a reasonable one and should be treated as a signal that‬
H
‭your service may not be meeting your customers’ needs.‬

‭How should managers and leaders think about their SLO’s?‬

‭ LO’s provide a high level measure of the health of your team or organization’s services. Teams should‬
S
‭monitor the health of the Four Golden Signals of Availability, Latency, Throughput, and Saturation as well as‬
‭custom metrics that are unique to their service.‬

‭9‬
‭https://sre.google/sre-book/monitoring-distributed-systems/#xref_monitoring_golden-signals‬
‭ LO’s are business metrics that help with planning and prioritization. If your services aren’t performing well‬
S
‭against their SLO’s, that’s an indicator that you need to prioritize reliability improvements. Similarly, if your‬
‭services are performing exceptionally well against their SLO’s, you can plan more risky changes.‬

‭Why do we use a fixed 28 day window for measuring SLO’s?‬

‭ art of what makes SLO’s useful is it provides a common language to discuss reliability. One dimension of‬
P
‭measuring reliability is the timeframe you measure over. By standardizing on 30 days for monitoring SLO’s, we‬
‭simplify discussions around comparing relative reliability of services.‬

‭ dditionally, a fixed 30 day window provides sufficient time that a team could reasonably respond and mitigate‬
A
‭a degradation in their service. If you use a time window of 6 hours, that leaves very little time to respond to a‬
‭degradation and preserve your SLO.‬

‭ oes every service have to have the same SLO? Does my SLO have to be as high as‬
D
‭my upstream callers?‬
‭No! There are two reasons for this:‬
‭1.‬ ‭Services should be designed to reduce the number of critical dependencies they have. By reducing the‬
‭number of critical dependencies a service has, you improve the resiliency of your service which allows‬
‭you to provide a higher SLO.‬
‭2.‬ ‭Upstream callers can implement strategies to mitigate lower SLO’s in downstream dependencies. See‬
‭this section‬‭for more information.‬

‭What resources are available for me to go deeper on SLO’s?‬

-‭ ‬ ‭ ee‬‭Resources‬‭section for a reading list.‬
S
‭-‬ ‭The‬‭Reliability Architecture‬‭team is a resource for‬‭any questions you have.‬

‭Stripe SLO Monitoring Questions‬

‭What tools are available to Stripes for monitoring SLO’s?‬

‭ he StripeNext SLO Workstream is building standardized tooling, infrastructure, dashboarding, and alerting for‬
T
‭all existing and future Horizon services. For more information, have a look at‬ SLO Monitoring Overview ‭for‬
‭more information.‬

‭ or v1 services using the pay-server stack, the‬‭Service‬‭Commitments‬‭dashboard is still the primary way to‬
F
‭monitor SLO’s. Please note that it does not provide real-time alerting.‬

‭Do I have to use Stripe’s SLO Monitoring?‬

I‭n short, yes. As part of StripeNext, all Horizon services will have a mandatory field requiring service owners to‬
‭define SLO’s for Availability and Latency. In the future, we’ll also add Throughput, Saturation, and the option to‬
‭include custom SLO’s based on service-specific SLI’s.‬

‭However, teams will have the ability to configure alerting for their service’s SLO’s.‬
‭Can I create custom SLI’s to set SLO’s on with Stripe’s SLO Monitoring?‬
‭Not yet, but we’re hoping to add this in 2022Q4 or 2023Q1.‬

I‭’m a Foundation team providing common infrastructure to Stripe. Can I use Stripe’s‬
‭SLO Monitoring?‬
‭ es… if it’s running as a Horizon service. We recognize that not all teams will be building Horizon services,‬
Y
‭however to ensure we complete at least a subset of the end-to-end scenarios, the workstream is focused on‬
‭Horizon services.‬

‭In the future, we’ll look to expand support for:‬

‭-‬ ‭infrastructure - e.g. database systems, networking infrastructure, etc‬
‭-‬ ‭control planes - e.g. those written in Golang‬
‭-‬ ‭proxies - e.g. mproxy, MySQL Inspector, kproxy‬
‭-‬ ‭Golang-based services used by Foundation and Security‬

‭Again, if you build a Horizon service, that will be fully supported.‬

‭Why are we focused on StripeNext Horizon services first for SLO Monitoring?‬
‭ or better or worse, we believe it’s more important to complete one end-to-end scenario and expand outward‬
F
‭based on what we learn. Given the amount of work going into rewriting or building net new services in Horizon,‬
‭we believe it would give the greatest bang for buck.‬

‭What happens if my service runs out of Error Budget?‬

‭Today - you’ll receive real-time notifications of Error Budget exhaustion.‬

‭In the future, we’ll:‬

‭-‬ ‭surface Error Budget issues in‬‭Reliability Posture‬‭Dashboard‬‭(RPD). The reporting in RPD will be‬
‭surfaced in Ops Reviews and brought for discussion if the issue is severe enough.‬
‭-‬ ‭integrate Error Budget exhaustion with the‬‭http://go/production-gates‬‭program‬

I‭ keep hearing that we want to achieve 99.999% availability SLO Stripe-wide. Does my‬
‭service have to have a 99.999% SLO?‬
‭ o! Please see‬‭Does every service have to have the‬‭same SLO? Does my SLO have to be as high as my‬
N
‭upstream callers?‬

‭Can I use SLO Monitoring as the only alerts for my service’s health?‬
‭ ou could, but we wouldn’t recommend it. The SLO Monitoring alerts are a “floor raiser”, meaning they provide‬
Y
‭a minimum bar for alerting for services. We expect that as teams gain a better understanding of how their‬
‭service behaves in production, they’ll create custom alerts and dashboards that give them detailed insight into‬
‭their service’s health.‬

‭ he SLO Monitoring workstream is meant to provide a coherent high level picture of the health of Stripe’s‬
T
‭services across engineering organizations.‬
‭Can I tune my error budget burn rate alerts for SLO’s?‬
‭ es! In 2022 Q2, we’ll be adding the ability to tune different aspects of the burn rate alerts. You’ll also be able‬
Y
‭to disable alerts but we highly discourage that :). While we’ll allow disabling of alerts, that will be reported as‬
‭part of‬‭RPD‬‭health scoring.‬

‭Why are you using Burn Rate alerts instead of Threshold-based alerts?‬
I‭n general we can create alerts with better precision and recall by alerting on error budget burn rate over‬
‭multiple time windows. More information on this is available in‬ SLO Alerting ‭.‬

‭Resources‬
‭-‬ ‭Google SRE book(s)‬
‭-‬ ‭https://sre.google/books/‬
‭-‬ ‭Microsoft’s Azure Architecture Reliability docs:‬
‭https://docs.microsoft.com/en-us/azure/architecture/reliability/architect‬
-‭ ‬ ‭Google’s Cloud Architecture docs:‬‭https://cloud.google.com/architecture/framework/reliability‬
‭-‬ ‭Command Query Responsibility Separation (CQRS) design pattern‬
‭-‬ ‭https://martinfowler.com/bliki/CQRS.html‬
‭-‬ ‭https://docs.microsoft.com/en-us/azure/architecture/patterns/cqrs‬
‭-‬ ‭Setting SLO’s for services with dependencies‬
‭-‬ ‭https://cloud.google.com/blog/products/devops-sre/defining-slos-for-services-with-dependencies‬
‭-cre-life-lessons‬
‭-‬ ‭On retries, timeouts, and jitter‬
‭-‬ ‭https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/‬

10th Practical File 2024-25
76% (21)
10th Practical File 2024-25
21 pages
How To Draw Horses - in Simple Steps - Dutton, Eva - 2009 - Tunbridge Wells, Kent - Search Press - 9781844483723 - Anna's Archive
No ratings yet
How To Draw Horses - in Simple Steps - Dutton, Eva - 2009 - Tunbridge Wells, Kent - Search Press - 9781844483723 - Anna's Archive
36 pages
Learn To Paint in Acrylics With 50 More Small Paintings - Mark Daniel Nelson - 2020 - Quarry Books - 9781631598524 - Anna's Archive
100% (1)
Learn To Paint in Acrylics With 50 More Small Paintings - Mark Daniel Nelson - 2020 - Quarry Books - 9781631598524 - Anna's Archive
147 pages
The Cloud Adoption Playbook: Proven Strategies for Transforming Your Organization with the Cloud
From Everand
The Cloud Adoption Playbook: Proven Strategies for Transforming Your Organization with the Cloud
Moe Abdula
No ratings yet
The Data-Confident Internal Auditor: A Practical, Step-by-Step Guide
From Everand
The Data-Confident Internal Auditor: A Practical, Step-by-Step Guide
Yusuf Moolla
No ratings yet
Application Observability with Elastic: Real-time metrics, logs, errors, traces, root cause analysis, and anomaly detection
From Everand
Application Observability with Elastic: Real-time metrics, logs, errors, traces, root cause analysis, and anomaly detection
Navin Sabharwal
No ratings yet
Process Capability: Six Sigma Thinking, #4
From Everand
Process Capability: Six Sigma Thinking, #4
Sumeet Savant
5/5 (2)
NetOps 2.0 Transformation: The DIRE Methodology
From Everand
NetOps 2.0 Transformation: The DIRE Methodology
Ray Belleville
5/5 (1)
CMMI High Maturity Hand Book
From Everand
CMMI High Maturity Hand Book
Vishnuvarthanan Moorthy
No ratings yet
Introduction to Disciplined Agile Delivery - Second Edition
From Everand
Introduction to Disciplined Agile Delivery - Second Edition
Scott Ambler
5/5 (2)
Supply Chain and Operations Insights
From Everand
Supply Chain and Operations Insights
Sachin Nambeesan
No ratings yet
Node js+Monitoring,+Alerting+and+Reliability+101+by+RisingStack+-+2nd+Edition
No ratings yet
Node js+Monitoring,+Alerting+and+Reliability+101+by+RisingStack+-+2nd+Edition
35 pages
PMI-ACP Exam Companion : Q & A with Explanations
From Everand
PMI-ACP Exam Companion : Q & A with Explanations
SUJAN
No ratings yet
Six Sigma Principles with Practice
From Everand
Six Sigma Principles with Practice
John Fraser
3.5/5 (3)
Mandatory Benefit Secrets: How to Turn Statutes Into Assets
From Everand
Mandatory Benefit Secrets: How to Turn Statutes Into Assets
Heather Buford
No ratings yet
Stability and Capability Analysis: Statistics for Lean Six Sigma Simplified with GEN AI, #6
From Everand
Stability and Capability Analysis: Statistics for Lean Six Sigma Simplified with GEN AI, #6
Sumeet Savant
No ratings yet
High Velocity Itsm: Agile It Service Management for Rapid Change in a World of Devops, Lean It and Cloud Computing
From Everand
High Velocity Itsm: Agile It Service Management for Rapid Change in a World of Devops, Lean It and Cloud Computing
Randy A. Steinberg
No ratings yet
Aerospace MRO Leader's Guide to Achieving Breakthrough Performance, Within 7 Days!
From Everand
Aerospace MRO Leader's Guide to Achieving Breakthrough Performance, Within 7 Days!
Steven Foust
No ratings yet
Carrier cloud A Complete Guide
From Everand
Carrier cloud A Complete Guide
Gerardus Blokdyk
No ratings yet
Elite Scrum
From Everand
Elite Scrum
B Reed
No ratings yet
MCS-034: Software Engineering
From Everand
MCS-034: Software Engineering
Dr. DK Sukhani
No ratings yet
Agile Leadership in the Scrum context (Updated for Scrum Guide V. 2020): Servant Leadership for Agile Leaders and those who want to become one.
From Everand
Agile Leadership in the Scrum context (Updated for Scrum Guide V. 2020): Servant Leadership for Agile Leaders and those who want to become one.
Paul C. Müller
No ratings yet
Ultimate AWS Certified Solutions Architect Associate Exam Guide
From Everand
Ultimate AWS Certified Solutions Architect Associate Exam Guide
Venkata Sasi Kanumuri
No ratings yet
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
From Everand
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
Venkata Sasi Kanumuri
No ratings yet
IT Interview Guide for Freshers: Crack your IT interview with confidence
From Everand
IT Interview Guide for Freshers: Crack your IT interview with confidence
Sameer S Paradkar
No ratings yet
Agile Testing: An Overview
From Everand
Agile Testing: An Overview
Florian Heuer
4/5 (10)
Lead Dev Talk (Fork) PDF
No ratings yet
Lead Dev Talk (Fork) PDF
45 pages
From PMO to VMO: Managing for Value Delivery
From Everand
From PMO to VMO: Managing for Value Delivery
Sanjiv Augustine
No ratings yet
Lead While Serving: An Integrated Approach to Managing Your Stakeholders and Customers
From Everand
Lead While Serving: An Integrated Approach to Managing Your Stakeholders and Customers
Stuart Soon Wah Lim
No ratings yet
Practical Work 2 - Designing SLOs and SLIs
No ratings yet
Practical Work 2 - Designing SLOs and SLIs
4 pages
A Practical Guide to Implement Oracle E-Business Suite
From Everand
A Practical Guide to Implement Oracle E-Business Suite
Anant Porwal
No ratings yet
Cracking Microservices Interview: Learn Advance Concepts, Patterns, Best Practices, NFRs, Frameworks, Tools and DevOps
From Everand
Cracking Microservices Interview: Learn Advance Concepts, Patterns, Best Practices, NFRs, Frameworks, Tools and DevOps
Sameer S Paradkar
3/5 (1)
Agile and Quality by Design
From Everand
Agile and Quality by Design
Ronald N. Goulden, MBA, PMP
No ratings yet
Cloud Governance
From Everand
Cloud Governance
Ernie Zibert
3.5/5 (2)
Scrum Master Fundamentals - Foundations: Scrum Master Fundamentals, #1
From Everand
Scrum Master Fundamentals - Foundations: Scrum Master Fundamentals, #1
Selwyn Classen
No ratings yet
The Lean-Agile Way: Unleash business results in the digital era with value stream management
From Everand
The Lean-Agile Way: Unleash business results in the digital era with value stream management
Cecil 'Gary' Rupp
No ratings yet
SVA: The next generation SLA
From Everand
SVA: The next generation SLA
Ernie Zibert
4/5 (1)
Implementing Computer Systems for Small & Medium Businesses
From Everand
Implementing Computer Systems for Small & Medium Businesses
Randy Rolleman
No ratings yet
Site Reliability Engineering Foundations: Definitive Reference for Developers and Engineers
From Everand
Site Reliability Engineering Foundations: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
AZ-900 Azure Fundamentals Practice Paper 4: AZ-900 Azure Fundamentals, #4
From Everand
AZ-900 Azure Fundamentals Practice Paper 4: AZ-900 Azure Fundamentals, #4
Tech Interviews
No ratings yet
An Agile Playbook for Technical Communicators: A Guide for Technical Communicators Working with Agile Teams
From Everand
An Agile Playbook for Technical Communicators: A Guide for Technical Communicators Working with Agile Teams
Luke Pivac
No ratings yet
Cloud load balancing Standard Requirements
From Everand
Cloud load balancing Standard Requirements
Gerardus Blokdyk
No ratings yet
The Executive's Handbook to Digital Transformation
From Everand
The Executive's Handbook to Digital Transformation
Austin Ayres
No ratings yet
Project Management Waterfall-Agile-It-Data Science: Great for Pmp and Pmi-Acp Exams Preparation
From Everand
Project Management Waterfall-Agile-It-Data Science: Great for Pmp and Pmi-Acp Exams Preparation
Dr. Festus Elleh PhD PMP PMI-ACP
No ratings yet
Benefits Realisation Management: The Benefit Manager's Desktop Step-by-Step Guide
From Everand
Benefits Realisation Management: The Benefit Manager's Desktop Step-by-Step Guide
Stuart Wilde
No ratings yet
Cloud Infrastructure Management Interface The Ultimate Step-By-Step Guide
From Everand
Cloud Infrastructure Management Interface The Ultimate Step-By-Step Guide
Gerardus Blokdyk
No ratings yet
Software Testing Interview Questions You'll Most Likely Be Asked
From Everand
Software Testing Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Cloud API A Clear and Concise Reference
From Everand
Cloud API A Clear and Concise Reference
Gerardus Blokdyk
No ratings yet
Agile Methodology
From Everand
Agile Methodology
IntroBooks Team
No ratings yet
Professional Scrum Master II Practice Questions and Exam Tests PSM II Exam Guidebook And Updated Questions
From Everand
Professional Scrum Master II Practice Questions and Exam Tests PSM II Exam Guidebook And Updated Questions
Idea Link
No ratings yet
DevOps Basics, Principles, and More
From Everand
DevOps Basics, Principles, and More
Tom Henricksen
No ratings yet
EC|Council Disaster Recovery Professional Exam Practice Questions and Dumps Exam Guidebook and Updated Questions for DRP
From Everand
EC|Council Disaster Recovery Professional Exam Practice Questions and Dumps Exam Guidebook and Updated Questions for DRP
Byte Books
No ratings yet
PMI-ACP Exam Insights: Q&A with Explanations
From Everand
PMI-ACP Exam Insights: Q&A with Explanations
SUJAN
No ratings yet
AgileQuest: Unlocking PMI-ACP Success
From Everand
AgileQuest: Unlocking PMI-ACP Success
SUJAN
No ratings yet
Cloud File Sharing Standard Requirements
From Everand
Cloud File Sharing Standard Requirements
Gerardus Blokdyk
No ratings yet
Valuing Agile:: The Financial Management of Agile Projects
From Everand
Valuing Agile:: The Financial Management of Agile Projects
Alan Moran
No ratings yet
Scrum or Not to Scrum?: The Answer
From Everand
Scrum or Not to Scrum?: The Answer
Syed Ulghani
No ratings yet
DevOps and Microservices: Non-Programmer's Guide to DevOps and Microservices
From Everand
DevOps and Microservices: Non-Programmer's Guide to DevOps and Microservices
Stephen Fleming
4/5 (2)
OSS Masterclasses - Insights from Thought-Leaders in Operational Support Systems (OSS)
From Everand
OSS Masterclasses - Insights from Thought-Leaders in Operational Support Systems (OSS)
Ryan Jeffery
No ratings yet
Agile Bubble: A Study Aid for the PMP & CAPM Exam
From Everand
Agile Bubble: A Study Aid for the PMP & CAPM Exam
Phill Akinwale
No ratings yet
Day 1
No ratings yet
Day 1
5 pages
Alerting Policies
No ratings yet
Alerting Policies
56 pages
2 - SRE Pracrical Work
No ratings yet
2 - SRE Pracrical Work
11 pages
GRADE 7 Physics Worksheet - Force, Pressure and Density - SE2
100% (1)
GRADE 7 Physics Worksheet - Force, Pressure and Density - SE2
3 pages
GRADE 7 Physics Worksheet - SE2
No ratings yet
GRADE 7 Physics Worksheet - SE2
4 pages
8.1 - 8.2 - Weather and Climate
No ratings yet
8.1 - 8.2 - Weather and Climate
8 pages
7.1 - Ecosystems of The Earth
No ratings yet
7.1 - Ecosystems of The Earth
45 pages
6.9 - 6.11 - Skeletal System
No ratings yet
6.9 - 6.11 - Skeletal System
62 pages
GRADE 7 Chemistry Worksheet - SE2
No ratings yet
GRADE 7 Chemistry Worksheet - SE2
3 pages
Replica Reduction Resiliency - Preventing Spurious Impaired Workflows
No ratings yet
Replica Reduction Resiliency - Preventing Spurious Impaired Workflows
19 pages
Hover Car Racer
No ratings yet
Hover Car Racer
4 pages
The Reign of Terror Day 4
No ratings yet
The Reign of Terror Day 4
4 pages
Pin Page 2
No ratings yet
Pin Page 2
1 page
Eurasian Eagle Owl - Wildlife Reference Photos For Artists
No ratings yet
Eurasian Eagle Owl - Wildlife Reference Photos For Artists
1 page
Birds (Please Select Sub-Category) - Wildlife Reference Photos For Artists
No ratings yet
Birds (Please Select Sub-Category) - Wildlife Reference Photos For Artists
1 page
Learn To Paint Flowers in Watercolour (Collins Learn To - Unknown - 1991-01-01 - Unknown - 9780007666652 - Anna's Archive
100% (2)
Learn To Paint Flowers in Watercolour (Collins Learn To - Unknown - 1991-01-01 - Unknown - 9780007666652 - Anna's Archive
68 pages
Untitled Document
No ratings yet
Untitled Document
1 page
The Brain
No ratings yet
The Brain
1 page
Louise de Masi On X A New Mini Tutorial On Youtube Where I Paint This Little Wren in Watercolor. HTTPST - coqiok0AEQgd #Arto
No ratings yet
Louise de Masi On X A New Mini Tutorial On Youtube Where I Paint This Little Wren in Watercolor. HTTPST - coqiok0AEQgd #Arto
1 page
Atomic Structure 1
No ratings yet
Atomic Structure 1
100 pages
50 Small Paintings: Learn To Paint in Acrylics With
100% (4)
50 Small Paintings: Learn To Paint in Acrylics With
148 pages
Tarragon Prodigy Pet - Google Search
No ratings yet
Tarragon Prodigy Pet - Google Search
1 page
Dhyan Singh Chand Hockey S Magician
No ratings yet
Dhyan Singh Chand Hockey S Magician
23 pages
Origami Bogota 2015 PDF
100% (2)
Origami Bogota 2015 PDF
232 pages
Learn To Paint Pastels - Blockley, John - 1990 - London - Collins - 9780004121154 - Anna's Archive
100% (1)
Learn To Paint Pastels - Blockley, John - 1990 - London - Collins - 9780004121154 - Anna's Archive
68 pages
Flowers, Fruit & Vegetables - Simple Approaches To Drawing - Civardi, Giovanni - 2011 - Tunbridge Wells - Search - 9781844486823 - Anna's Archive
100% (3)
Flowers, Fruit & Vegetables - Simple Approaches To Drawing - Civardi, Giovanni - 2011 - Tunbridge Wells - Search - 9781844486823 - Anna's Archive
68 pages
3D City Model As A First Step Towards Digital Twin
No ratings yet
3D City Model As A First Step Towards Digital Twin
8 pages
SQL Exercise & Worksheet: Connecting You To The Next Level in Life
No ratings yet
SQL Exercise & Worksheet: Connecting You To The Next Level in Life
6 pages
Workshop06 PRJ321 Tran PDF
No ratings yet
Workshop06 PRJ321 Tran PDF
27 pages
OpenManage Enterprise - Deep Dive - Power Manager and Power Policies
No ratings yet
OpenManage Enterprise - Deep Dive - Power Manager and Power Policies
7 pages
ML Checklist PDF
No ratings yet
ML Checklist PDF
4 pages
Project Report
No ratings yet
Project Report
84 pages
Big Data Tutorial
No ratings yet
Big Data Tutorial
2 pages
Oracle DBA Resume With 4+exp
50% (2)
Oracle DBA Resume With 4+exp
5 pages
React Express Sequelize
No ratings yet
React Express Sequelize
3 pages
Foundation Admin 2023
No ratings yet
Foundation Admin 2023
152 pages
Unit 1 Data Science Notes
No ratings yet
Unit 1 Data Science Notes
33 pages
Cs pb1 Ms
No ratings yet
Cs pb1 Ms
16 pages
1 Introduction To Database Management System Concepts
No ratings yet
1 Introduction To Database Management System Concepts
23 pages
Srikant - Informatica AXON EDC MDM Consultant - GoAhead
No ratings yet
Srikant - Informatica AXON EDC MDM Consultant - GoAhead
4 pages
Solved Cs Practical List For Class Xii 2024 - 25
No ratings yet
Solved Cs Practical List For Class Xii 2024 - 25
48 pages
Lab1 3-Instalacion A2billing
No ratings yet
Lab1 3-Instalacion A2billing
14 pages
Techlog 2020-2 Synchronization Tool Deployment Guide
No ratings yet
Techlog 2020-2 Synchronization Tool Deployment Guide
17 pages
Postgres Enterprise Manager: Release 7.15
No ratings yet
Postgres Enterprise Manager: Release 7.15
32 pages
Petrol Bunk Management System
No ratings yet
Petrol Bunk Management System
3 pages
LIS ICT 106 - Course Syllabus - Revised
No ratings yet
LIS ICT 106 - Course Syllabus - Revised
8 pages
DLP Policy
No ratings yet
DLP Policy
9 pages
Architecture Patterns For Building Generative AI Applications
No ratings yet
Architecture Patterns For Building Generative AI Applications
29 pages
DataPump PDF
No ratings yet
DataPump PDF
15 pages
Cluster Goa
No ratings yet
Cluster Goa
84 pages
12 Normalization
No ratings yet
12 Normalization
41 pages
MB 300
No ratings yet
MB 300
12 pages
Spark Job Dataproc
No ratings yet
Spark Job Dataproc
4 pages
Microchip PIC24F Peripheral Library
No ratings yet
Microchip PIC24F Peripheral Library
2 pages
Kolej Yayasan Pelajaran Johor Lab Skill 3: SQL SEMESTER 2 SESI 2020/2021
No ratings yet
Kolej Yayasan Pelajaran Johor Lab Skill 3: SQL SEMESTER 2 SESI 2020/2021
3 pages

SLO Guide & FAQ

Uploaded by

SLO Guide & FAQ

Uploaded by

‭ his document has been migrated to Trailhead (‬‭go/slo/docs‬‭)‬

‭How many 9’s do you need?‬

‭Reliability Patterns and Resources‬

‭SLO’s - A Brief Recap‬

‭How many 9’s do you need?‬

‭Availability‬ ‭ rror Budget‬

‭90% (1 nine)‬ ‭36.53 days‬ ‭3 days‬

‭99% (2 nines)‬ ‭3.65 days‬ ‭7.3 hours‬

‭99.9% (3 nines)‬ ‭8.77 hours‬ ‭43.65 minutes‬

‭99.99% (4 nines)‬ ‭52.60 minutes‬ ‭4.36 minutes‬

‭99.995% (4.5 nines)‬ ‭26.30 minutes‬ ‭2.18 minutes‬

‭99.999% (5 nines)‬ ‭5.26 minutes‬ ‭26 seconds‬

‭Reliability versus Engineering Velocity‬

‭Error Budgets == Flexible Spending Account‬

‭So how many 9’s do I need?‬

‭There are a few caveats to these tables:‬

‭API Priority‬ ‭Min. SLO (Availability)‬ ‭Availability Tier‬

‭P0/P1 (Charge Path POST)‬ ‭99.995%‬ ‭A100-A110‬

‭P2 (Default POST/DELETE)‬ ‭99.99%‬ ‭A120‬

‭P3 (Default GET Retrieve, Dashboards)‬ ‭99.95%‬ ‭A130‬

‭P4 (Default GET List)‬ ‭99.9%‬ ‭A140‬

‭P5 (Testmode APIs)‬ ‭99.5%‬ ‭A150+‬

‭System Priority‬ ‭Min. SLO (Availability)‬ ‭Availability Tier‬

‭MSP worker nodes, Observability Systems‬ ‭99.995%‬ ‭A100-A110‬

‭Database Systems - any merchant‬ ‭99.995%‬ ‭A120‬

‭Database Systems - specific merchant‬ ‭99.99%‬ ‭A120‬

‭Control Planes‬ ‭99.99%‬ ‭A130‬

‭CI/CD Pipeline‬ ‭99.95%‬ ‭A140‬

‭Reliability Patterns and Resources‬

‭Stateless & Leaderless‬

‭Autoscaled & Overprovisioned (within reason)‬

‭Designing services to be autoscaled is nuanced by there are a few key properties:‬

‭Resilient System Degradation & Dependencies‬

‭Dependencies are Known‬

‭Critical Dependencies are Minimized‬

‭Systems Degrade Gracefully‬

-‭ ‬ ‭ uto-complete for the search textbox‬

‭High Latency == Availability Drop == Throughput Limit Exceeded == Fully Saturated‬

‭In the future, we’ll want to add a third of:‬

‭Workloads Segregated by Performance/Priority Profile‬

‭Error Codes and User Actionability‬

‭Running Multiple Copies of the Dependency‬

‭Graceful Degradation & Critical Path Pruning‬

‭Replacing Dependencies with a More Reliable Alternative‬

‭Asking for a Stronger SLO‬

‭What’s a Service Level Objective (SLO)?‬

‭Why are we doing SLO’s?‬

‭Why do SLO’s need to have real-time alerting?‬

‭How should managers and leaders think about their SLO’s?‬

‭Why do we use a fixed 28 day window for measuring SLO’s?‬

‭What resources are available for me to go deeper on SLO’s?‬

‭Stripe SLO Monitoring Questions‬

‭What tools are available to Stripes for monitoring SLO’s?‬

‭Do I have to use Stripe’s SLO Monitoring?‬

‭In the future, we’ll look to expand support for:‬

‭Again, if you build a Horizon service, that will be fully supported.‬

‭What happens if my service runs out of Error Budget?‬

‭In the future, we’ll:‬

You might also like