Kubernetes Reliability at Scale
Kubernetes Reliability at Scale
Reliability
at Scale
How to Improve Uptime with Resiliency Management
K u b e r n e t e s R e l i a b i l i t y at S c a l e 1
Introduction: The
Kubernetes Reliability Gap
There’s more pressure than ever to deliver high-availability Kubernetes systems.
Consumers expect applications to be available at all times and have zero patience for
outages or downtime. At the same time, businesses have created intricate webs of APIs
and dependencies that rely on your applications.
Unfortunately, building reliable systems is easier said than done. Every system has
potential points of failure that lead to outages—known as reliability risks. And when
you’re dealing with the complex, ephemeral nature of Kubernetes, there’s an even higher
possibility that risks will go undetected until they cause incidents.
The traditional approach to reliability starts with using observability to instrument your
systems. Any issue or non-optimum spike in your metrics creates an alert, which is then
resolved using your incident response runbook.
This reactive approach can only surface reliability risks after the failure has occurred. This
creates a gap between where you think the reliability of your system is and where you find
out it actually is when there’s an outage.
In order to meet the availability demands of your users, you need to fill in that gap with a
standards-based approach to your system’s resiliency.
By focusing on the resiliency of your Kubernetes redeployment, you can surface reliability
risks proactively and address them before they cause outages and downtime.
Kubernetes is deployed in a series of distinct layers that are key to its resiliency and
adaptability. The layers of pods, nodes, and clusters provide distinct separation that’s
essential for being able to scale up or down and restart as necessary while maintaining
Infrastructure Risks
Power outages
Hardware failures
pod risks Cloud provider outages
Application crashes
Unhandled exceptions
p o d d e p loy m e n t r i s k s
Misconfigured Deployments
Node risks
cluster risks
Kernel panics
Autoscaling problems
K u b e r n e t e s R e l i ab i l i t y a t S c a l e TABLE OF CO N TE N TS 5
Unfortunately, these same layers also increase the potential points of failure. Their
interconnected nature can take a small error, such as a container consuming more
CPU or memory than expected, then compound it across other nodes and cause a
wide-scale outage.
The first step in Kubernetes resiliency management is to look at the potential reliability
risks inherent in every Kubernetes deployment. As you build resiliency standards, you’ll
Every software application is dependent on the infrastructure below it for stability, and
Power outage
Hardware failure
Architecting to minimize these risks includes multiple Availability Zone redundancy, multi-
region deployments, and other best practices. You’ll need to be able to test whether these
Cluster risks
Clusters lay out the core policies and configurations for all of the nodes, pods, and
containers deployed within them. When reliability risks occur at the cluster level, these
Autoscaling problem
Kubernetes Reliability at S c a l e TA B L E O F C O N T E N T S 6
Detecting and testing for the risks will include looking at your cluster configuration, how it
responds to increases in resource demands, and its response to changing network activity.
Node risks
Any reliability risks in the nodes will immediately inhibit the ability to run pods. While
cloud-hosted Kubernetes providers can automatically restart problem nodes, if there’s
core issues with the control plane, the problem will just be replicated again. These issues
could include
Node-based reliability risks are usually focused around the ability of the node to
communicate with the rest of the cluster, get the resources it needs, and correctly
manage pods.
Misconfigured deployment
Too few (or too many) replica
Missing or failing container image
Deployment failures due to limited cluster capacity
within them, and are often tied to limited resources or finding /communicating with
container images.
Pod risks
Even if everything goes well with pod deployment, there can still be issues that affect the
Application crashe
Unhandled exception
Many of these “last-mile” risks can be uncovered by monitoring and testing for specific
Unfortunately, there’s never enough time or money to fix every single reliability risk. In the
well-known balance between expense, quality, and speed, the demands of business make it
impossible. Instead, you need to find that balance where you’re addressing the critical
reliability risks that could have the greatest impact and deprioritizing the risks that would
have a minor impact. When you consider the number of moving pieces and potential
reliability risks present in Kubernetes, this kind of prioritization and identification becomes
even
more important.
The exact list will change from organization to organization and service to service, but you
Consistency/Integrity - Are pods using the same image and running smoothly?
Any issue that interferes with these capabilities poses a critical risk to your Kubernetes
a standardized approach
in hand. At the same time, the interconnected nature of Kubernetes means that building
reliability requires clear standards and governance to ensure uniform resiliency across all
Traditionally, these two practices have been at odds. Governance and heavy testing gates
tend to slow down deployments, while a high-speed DevOps approach stresses a fast rate
If you’re going to bring the two together, you need a different approach, one based on
standards, but with automation and technological capabilities that allow every team to
The approach needs to be able to surface known risks from all layers of a Kubernetes
this practice needs to include standardized metrics and processes that can be used across
the organization so all Kubernetes deployments are held to the same reliability standards.
K u b e r n e t e s R e l i a b i l i t y at S c a l e TA B L E O F C O N T E N T S 9
Framework for
Kubernetes resiliency
Any approach for Kubernetes resiliency would have to combine the known possible
reliability risks above with the criteria for critical risk identification and organization-
wide governance.
When these are paired with the technology of Fault Injection testing, it creates a resiliency
exploratory testing, and continuous risk monitoring with the reporting and processes
K u b e r n e t e s R e l i a b i l i t y at S c a l e TA B L E O F C O N T E N T S 10
Resiliency standards
Some reliability risks are common to all Kubernetes deployments. For example, every
Kubernetes deployment should be tested against how it’s going to respond during a surge
in demand for resources, a drop in network communications, or a loss of connection
to dependencies.
These are recorded under Organizational Standards, which inform the standard set of
reliability risks that every team should test against. While you start with common reliability
risks, this list should expand to include risks unique to your company that are common
across your organization. For example, if every service connects to a specific database,
then you should standardize around testing what happens if there’s latency in
that connection.
Reliability is often measured by either the binary “Currently up/currently down” status or
the backwards-facing “uptime vs. downtime” metric. But neither of these measurements
will help you see the posture of potential reliability risks before they become outages—and
whether you’ve addressed them or gained more risks over time.
This is why it’s essential to have metrics, reporting, and dashboards that show the results
of your resiliency tests and risk monitoring. These dashboards give the various teams
K u b e r n e t e s R e l i a b i l i t y at S c a l e TA B L E O F C O N T E N T S 11
core data to align around and be accountable for results. By showing how each service
performs on tests built against the defined resiliency standards, you get an accurate view
of your reliability posture that can inform important prioritization conversations.
Some Kubernetes risks, such as missing memory limits, can be quick and easy to fix, but
can also cause massive outages if unaddressed. The complexity of Kubernetes can make it
easy to miss these issues, along with other known reliability risks common across all
Kubernetes deployments, which means you can operationalize their detections.
Many of these critical risks can be located by scanning configuration files and container
statuses. These scans should run continuously on Kubernetes deployments so these risks
can be surfaced and addressed quickly.
K u b e r n e t e s R e l i a b i l i t y at S c a l e TA B L E O F C O N T E N T S 12
Validation testing using
Validation testing vs.
standardized test suites exploratory testing
to, such as a spike in CPU demand or
testing against to validate that your
system meets the specific requirements.
a drop in network connectivity to
For example, if a pod reaches its CPU
key dependencies.
K u b e r n e t e s R e l i a b i l i t y at S c a l e TA B L E O F C O N T E N T S 13
Metrics and reporting
Most organizations lack a consistent, agreed-upon method for identifying reliability risks
that can be shared and understood across their teams. It’s not that the information isn’t
out there—almost every engineer knows the common ways their service will fail—it’s
that there’s no centralized way for all reliability risks and potential failure points to be
cataloged, tested for, and compared between services.
Tracking resilience tests gives you that central alignment. When you track the results over
time, individual teams can show exactly what risks are and aren’t present in their services,
taking that knowledge out of the engineer’s heads and putting it into a place where the
entire organization can benefit from it.
By tracking reliability risks over time, engineers and operators can show the effectiveness
of their efforts by pointing to the test that previously failed but now passes, proving that
the risk is no longer present.
Finally, tracking reliability risks creates a metric that can be used to track reliability over
time across the organization. This is where standards and testing come together to
produce actionable organizational alignment.
Over time, this creates a metric where the entire team can align around common
reliability metrics and get an accurate picture of the reliability posture of their entire
Kubernetes system.
The current status of every resiliency test falls into one of three results:
1 Passed
The deployment performed as expected and no reliability risk exists.
2 Failed
The deployment did not perform as expected, and a reliability risk is known to exist.
3 Not run
The test hasn’t been run recently enough to be certain of the result. A known
reliability risk may or may not exist—which is, in itself, a reliability risk.
By looking at it this way, test results can be pooled into a binary state where a point is
scored for any passed tests (no reliability risk present) and a zero is scored for a failed or
not-run test (known or possible reliability risk present).
K u b e r n e t e s R e l i a b i l i t y at S c a l e TA B L E O F C O N T E N T S 17
How regular testing creates a metric of scores
When you run a series of tests to build a reliability score, this creates a numeric data point
that shows the reliability posture of your Kubernetes deployment at a specific time.
By regularly running resiliency test suites, you create a metric of your reliability posture
over time. Like any metric, this can be plotted to show trends, then each data point can be
Every organization will have different requirements, and your standards owner should
set your specific testing cadence, but a good goal is to work towards weekly testing of
production systems. A weekly cadence gives you an accurate view that will always be
K u b e r n e t e s R e l i a b i l i t y at S c a l e TA B L E O F C O N T E N T S 18
Create dashboards for alignment and reporting
By combining reliability scores with regular testing, you create reliability metrics. So the
next step is to create a system for reporting those metrics with dashboards.
The goal with these dashboards isn’t to assign blame or point out failures. Instead, they
should be used to plan engineering work and applaud successes. For example, if a team
shipped a new feature and their reliability score decreased, this might be expected with
the large amount of new code added to the system. The decrease in score then shows the
team that time should be spent ensuring reliability of the new feature before moving onto
the next one. At the same time, if they come back two weeks later and the score has
increased, then they should be celebrated for how much they improved the new
feature’s reliability.
Ku b e r n e t e s R e l i a b i l i t y at S c a l e TA B L E O F CO N T E N T S 20
As with reliability scores, these detected
risks can be broken down into a binary Further reading:
metric: either the risk is present or it isn’t. By tracking reliability metrics, you enable
And just like with reliability scores, tracking your organization to operate on a whole
the detection of these risks over time new level of reliability, one where they have
creates a reliability metric. Like with the processes and tooling in place to find
reliability risks, prioritize which risks need
reliability metrics gained from testing,
the most attention, and then report back
these scanned reliability metrics should be
the results to the greater business.
Ku b e r n e t e s R e l i a b i l i t y at S c a l e TA B L E O F CO N T E N T S 21
Risk monitoring
and mitigation
While resiliency testing is necessary for uncovering some reliability risks, the nature of
Kubernetes makes it possible to scan for key misconfigurations, bad default values, or anti-
patterns that create reliability risks within the cluster. You can deploy a tool across your
cluster to detect Kubernetes resources and analyze configurations across the deployment.
This makes it possible to automatically detect key reliability risks and surface them before
they start causing behavior that could lead to an outage.
Kubernetes risk monitoring uses the cluster, node, and pod data to uncover critical
reliability risks automatically without testing or waiting for an observability alert. Many of
these are caused by configuration issues or require small changes to images that can be
relatively quick to address. By setting up a system to monitor for these key risks, you can
proactively surface them without the delay of other methods.
The nature of Kubernetes and the complexity of deployments has the potential to create
a large number of risks, but there’s a core group of ten that should be included in any risk
monitoring practice. These are the most common critical risks that could cause major
outages if left unaddressed. When building out your Kubernetes reliability tooling and
standards, start by making sure these ten reliability risks are being detected and covered.
From there, you can add other reliability risks to your monitoring list.
Resource risks
Running out of resources directly impacts system stability. If your nodes don’t have
enough CPU or RAM available, they may start slowing down, locking up, or terminating
resource-intensive pods to make room.
Setting requests is the first step towards preventing this, because they specify the
minimum resources needed to run a pod. Limits are somewhat the opposite and set an
upper cap on how much RAM a pod can use, preventing a memory leak from taking all of
a node’s resources.
Ku b e r n e t e s R e l i a b i l i t y at S c a l e TA B L E O F CO N T E N T S 23
Missing CPU requests
Further reading
A common risk is deploying pods without
Find out how to detect missing CPU
setting a CPU request. While it may seem
requests and how to resolve the reliability
like a low-impact, low-severity issue, not risk: How to ensure your Kubernetes Pods
using CPU requests can have a big impact, have enough CPU
including preventing your pod
from running.
1 They tell Kubernetes the minimum amount of the resource to allocate to a pod.
This helps Kubernetes determine which node to schedule the pod on and how to
schedule it relative to other pods.
Without this, Kubernetes might schedule a pod onto a node that doesn't have enough
capacity for it. Even if the pod uses a small amount of CPU at first, that amount could
increase over time, leading to CPU exhaustion.
A memory request specifies how much RAM should be reserved for a pod's container.
When you deploy a pod that needs a minimum amount of memory, such as 512 MB or
1 GB, you can define that in your pod's manifest. Kubernetes then uses that information
to determine where to deploy the pod so it has at least the amount of memory requested.
When deploying a pod without a memory request, Kubernetes has to make a best-guess
decision about where to deploy the pod.
At that point, a kernel process called the OOMKiller jumps in and terminates the process
before the entire system becomes unstable.
Setting a limit and a request creates a range of memory that the pod could consume,
making it easier for both you and Kubernetes to determine how much memory the pod
will use on deployment.
Unfortunately, containers often crash, terminate, or restart with little warning. Even
before that point, they can have less visible problems like memory leaks, network latency,
and disconnections. Liveness probes allow you to detect these problems, then terminate
and restart the pod.
On the node level, you should set up Kubernetes in multiple availability zones (AZs) for
high availability. When these risks are remediated, your system will be able to detect pod
failures and failover nodes if there’s an AZ failure.
These two reliability risks directly affect your deployment’s ability to have the redundancy
necessary to be resilient to pod, node, cluster, or AZ failure.
The power of liveness probes is in their ability to detect container failures and
automatically restart failed containers. This recovery mechanism is built into Kubernetes
itself without the need for a third-party tool. Service owners can define liveness probes as
part of their deployment manifests, and their containers will always be deployed with
liveness probes.
Ku b e r n e t e s R e l i a b il it y at S c a l e TA B L E O F CO N T E N T S 26
In theory, the only time a service owner should have to manually check their containers
is if the liveness probe fails to restart a container (like the dreaded CrashLoopBackOff
state). But in order to restart the container, a liveness probe has to be defined in the
container’s manifest.
By default, many Kubernetes cloud providers provision new clusters within a single
Availability Zone (AZ). Because these AZs are isolated, one AZ can experience an incident
or outage without affecting other AZs, creating redundancy—but only if your application is
set up in multiple AZs.
If a cluster is set up in a single AZ and that AZ fails, the entire cluster will also fail along
with any applications and services running on it. This is why the AWS Well-Architected
Framework recommends having at least two redundant AZs for High Availability.
Kubernetes natively supports deploying across multiple AZs, both in its control plane (the
systems responsible for running the cluster) and worker nodes (the systems responsible
for running your application pods).
There are also times when a pod simply can’t be scheduled to run. Commonly, this
happens because the cluster doesn’t have the resources, or your pod requires a persistent
volume that isn’t available.
Containers in these states should be able to be restarted when a failure occurs, but are
unable to, creating a risk to the resiliency of your deployment.
Pods in CrashLoopBackOff
Further reading
CrashLoopBackOff is the state that a pod
Get tips for CrashLoopBackOff
enters after repeatedly terminating due to
troubleshooting, detecting it, and verifying
an error. Normally, if a container crashes, your fixes: How to fix and prevent
Kubernetes waits for a short delay and CrashLoopBackOff events in Kubernetes
restarts the pod.
The time between when a pod crashes and when it restarts is called the delay. On each
restart, Kubernetes exponentially increases the length of the delay, starting at 10 seconds,
then 20 seconds, then 40 seconds, continuing in that pattern up to 5 minutes. If
Kubernetes reaches the max delay time of 5 minutes and the pod still fails to run,
Kubernetes will stop trying to deploy the pod and gives it the status CrashLoopBackOff.
Ku b e r n e t e s R e l i a b il it y at S c a l e TA B L E O F CO N T E N T S 28
Application errors that cause the process to crash
Problems connecting to third-party services or dependencies
Trying to allocate unavailable resources to the container, like ports that are already in
use or more memory than what's available
A failed liveness probe.
There are many more reasons why a CrashLoopBackOff can happen, and this is why it's
one of the most common issues that even experienced Kubernetes developers run into.
Images in ImagePullBackOff
Before Kubernetes can create a container, it first needs an image to use as the basis for the
container. An image is a static, compressed folder containing all of the files and executable
code needed to run the software embedded within the image.
If Kubernetes can't pull the image for any reason (such as an invalid image name, poor
network connection, or trying to download from a private repository), it will retry after a
set amount of time. Like a CrashLoopBackOff, it will exponentially increase the amount of
time it waits before retrying, up to a maximum of 5 minutes. If it still can't pull the image
after 5 minutes, it will stop trying and set the container's status to ImagePullBackOff.
A pod is unschedulable when it's been put into Kubernetes' scheduling queue, but can't be
deployed to a node. This can be for a number of reasons, including
The cluster not having enough CPU or RAM available to meet the pod's requirements
Pod affinity or anti-affinity rules preventing it from being deployed to available nodes
Nodes being cordoned due to updates or restarts
The pod requires a persistent volume that's unavailable, or bound to an
unavailable node.
Although the reasons vary, an unschedulable pod is almost always a symptom of a larger
problem. The pod itself may be fine, but the cluster isn't operating the way it should,
which makes resolving the issue even more critical.
Another common application risk is introduced by using init containers. These are handy
for preparing an environment for the main container, but introduce a potential point of
failure where the init container can’t run and causes the main container to fail.
Both of these risks occur at the application level, which means infrastructure or cluster-
level detection could miss them.
Tags, which are created by the image's creator to identify a single version of a
container. Multiple container versions can have the same tag, meaning a single tag
could refer to multiple different container versions over time
Digests, which are the result of running the image through a hashing function
(usually SHA256). Each digest identifies one single version of a container.
Changing the container in any way also changes the digest.
Tags are easier to read than digests, but they come with a catch: a single tag could refer to
multiple image versions. The most infamous example is latest, which always points to
Ku b e r n e t e s R e l i a b il it y at S c a l e TA B L E O F CO N T E N T S 31
the most recently released version of a container image. If you deploy a pod using the
latest tag today, then deploy another pod tomorrow, you could end up with two
completely different versions of the same pod running side-by-side.
An init container is a container that runs before the main container in a pod. They're often
used to prepare the environment so the main container has everything it needs to run.
For example, imagine you want to deploy a large language model (LLM) in a pod. LLMs
require datasets that can be several GB. You can create an init container that downloads
these datasets to the node so that when the LLM container starts, it immediately has
access to the data it needs.
Init containers are incredibly useful for setting up a pod before handing it off to the main
container, but they introduce an additional point of failure.
Init containers run during the pod's initialization process and must finish running before
the main container starts. To add to this, if you have multiple init containers defined,
they'll all run sequentially until they've either completed successfully or failed.
When to test in your SDLC and which exact tests to run will vary depending on your
individual organization’s standards and the maturity of your resilience practice. But there
is a core set of resiliency tests that should be run for every Kubernetes deployment, as
well as best practices to help determine when in your SDLC your teams should run
resiliency tests.
Exploratory testing
Exploratory testing is used to better understand your systems and suss out the unknowns
in how it responds to external pressures. Many of the experiments performed under the
practice of Chaos Engineering make use of exploratory tests to find unknown points of
potential failure.
To minimize the impact on your systems, exploratory tests should always be done in a
controlled manner. While a trustworthy Fault Injection tool will contain safeguards like
automatic rollback in case of problems, the injection of faults can potentially cause
disruption when doing exploratory tests. Be sure to follow Chaos Engineering best
practices like limiting the blast radius and carefully defining the boundaries of the
experiment. Ideally, these tests should start with individual services, then expand broader
into the organization as you become more confident in the results and impact of the test.
For example, a common type of exploratory test is making sure your Kubernetes
deployments scale properly in response to high demand. You can set up a Horizontal Pod
Autoscaling (HPA) rule on your deployment to increase the number of pods when CPU
Validation testing
Once you have a standardized set of known failures and reliability risks, you can test
your resilience to them with validation testing. Using Fault Injection, validation tests
inject specific failure conditions into your systems to verify resilience to failures. Unlike
exploratory testing, which is done manually, validation testing works best when it can be
automated on a schedule. Ideally, they should be tested weekly, but many organizations
will start with monthly testing, then gradually increase the frequency as they become
more comfortable with the testing process.
Resource tests
Any Kubernetes deployment needs to be resilient to sudden spikes in traffic, demand,
or resource needs. These two tests will verify that your services are resilient to sudden
resource spikes. Depending on your architecture, you may also want to add a Disk I/O
scalability test to this mix.
CPU Scalability: Test that your service scales as expected when CPU capacity is
limited. This should be done in three stages of 50%, 75%, and 90% CPU consumption.
Estimated test length: 15 minutes
Memory Scalability: Test that your service scales as expected when memory is limited.
Memory consumption should be done in three stages: 50%, 75%, and 90% capacity.
Estimated test length: 15 minutes
Make sure that your deployments are resilient to infrastructure failures. These tests
shut down a host or access to an availability zone to verify that your deployment has the
redundancy in place to stay up when a host or zone goes down. If your standards call for
multi-region redundancy, then you should add tests that make regions unavailable.
Zone Redundancy: Test your service's availability when a randomly selected zone is
unreachable from the other zones.
When customizing suites, you should do it based off data from sources like
Incidents - When there’s an outage, it’s a good practice to set up tests to detect
and prevent the same incident from happening in the future. For example, if you
experienced a DNS-based outage, then you may want to set up weekly tests to make
sure you can failover to a fallback DNS service
Observability alerts - There’s plenty of application behavior that doesn’t directly
create an outage, but is still a definite warning sign. Perhaps a service owner has
noticed that compute resource spikes that take up 85% of compute capacity don’t
take the system down, but still create a situation where a spike in traffic would cause
an outage. In this case, you’d want to add tests that simulate compute resource usage
at 85% capacity to ensure resilience to this potential failure
Exploratory testing - As covered above, it’s important to work directly with service
operators to adjust testing parameters and fit the needs of specific services. Using
exploratory testing, operators can determine exactly what the failures are so you can
design tests against them. Critical services, for example, should have higher resilience
standards than internal services, and the test suite should be customized to fit
these standards.
Kubernetes Reliability at Scale TABLE OF CONTENTS 38
Industry models - There are many architecture models, such as the AWS Well-
Architected Framework, that have specific best practices to improve reliability. If
you’re using these architecture standards, then you can adjust your testing suites to
verify compliance with those standards
Industry compliance requirements - Highly regulated industries, such as finance, can
often have resilience and reliability standards unique to their industry. Often these can
be much more strict than common best practices, and you should adjust test suites
accordingly to fit these compliance requirements.
Catches some resilience risks before Expensive and difficult to Can lead to false
Gating Release
production run production-like test confidence without also
Candidates on
Fits into existing QA/performance
environments testing in production
running tests testing cycles Can miss infrastructure Slows down QA process
and network-level risks
Testing in staging
Testing in a staging environment prevents any potential downtime caused by testing from
impacting customers. However, perfectly duplicating a staging environment with the same
workloads, resources, and traffic as production environments is cost-prohibitive and
time intensive. Additionally, there are changes outside your control, such as network
topography, that can’t be accounted for in staging environments.
Ultimately, while testing in staging can catch key reliability risks, it can’t give you an
accurate view of the reliability of your system in production.
Due to the nature of Fault Injection testing, a full battery of tests could take several hours.
If you’re releasing on a weekly or monthly schedule, holding up a deployment to run these
tests could be worth it for the reliability risks you uncover. However, if you’re set up for
multiple releases a day, then the time spent on the tests prevents them from being used as
a gating mechanism. In this case, you should consider testing in production, either post-
deployment or on a regular schedule.
Remember, the goal of resiliency testing is to uncover reliability risks in production. While
some of these can be uncovered before deployment, you should fit testing into your SDLC
where it makes the most sense and can be the most effective at uncovering reliability risks
in production before they impact customers.
Kubernetes systems are constantly changing with new deployments, resource changes,
network topography shifts, and more. A service that had very few reliability risks two
weeks ago could suddenly have a much more vulnerable reliability posture due to new
Further reading
through regular, automated validation
aim to have weekly scheduled tests in should be asking and the trade-offs you
from Gremlin
release automation
your prioritization and resourcing
K u b e r n e t e s R e l i a b i l i t y at S c a l e TA B L E O F C O N T E N T S 42
Roles and responsibilities
Any Kubernetes resilience effort requires contributions from three key roles if it’s going to
be successful. Everyone working on resiliency falls into one of these three roles, which are
resiliency management.
2 Standards roles set standards, manage tooling, and oversee the execution of
the framework.
Resiliency Roles
s ta n d a r d s Shared reliability
O p e r at i o n s
mandate
Define common resilience patterns
Perform regular resilience tests
to test broadly Aligned goals
Remediate reliability risks
Manage tooling for testing & reporting & metrics
Leadership
Prioritize reliability
Dedicate resources
Drive accountability
K u b e r n e t e s R e l i a b i l i t y at S c a l e TA B L E O F C O N T E N T S 43
The roles aren’t tied to specific titles, and it’s common for one person or team to take on
two of the roles: for example, performance engineering teams or centralized SRE teams
often take on both setting the standards and performing tests and mitigations—at least
initially. But without someone stepping in to take on the requirements of each role, teams
often struggle to make progress.
Leadership role
The leadership role is the one responsible for setting the priorities of engineering teams
and allocating resources. In some companies this is held by someone in the C-suite, while
in others it’s held by Vice Presidents or Directors. The defining factor is that anyone in this
role has the authority to make organization-wide priorities and direct resources
towards them.
Core Responsibilities:
Core Responsibilities:
Ku b e r n e t e s R e l i a b i l i t y at S c a l e TA B L E O F CO N T E N T S 45
Operator role
This role could have a wide variety of titles, but the defining characteristics are that they’re
responsible for the resiliency of specific services. They make sure the tests are run, report
the results, and make sure any prioritized risks are addressed.
Core Responsibilities:
innovations.
their systems.
Program checklist.
K u b e r n e t e s R e l i a b i l i t y at S c a l e TA B L E O F C O N T E N T S 47
Next steps:
Your first 30 days of resiliency
Your Kubernetes resiliency management practice will mature, grow, and change over
time, but it doesn’t have to take months to start creating results. In fact, you can start
uncovering reliability risks and having a demonstrable impact on your Kubernetes
reliability with this roadmap for your first 30 days.
You can speed the process along by getting alignment on a specific group of services being
used for the pilot. These are your early-adoptor services, and the more you have everyone
involved on board, the better results you’re going to get.
In fact, having a small group of teams who are invested and focused can often be more
effective than trying with a wider, more hesitant group right out of the gate. Once you
start proving results with early adopters, then you’ll get less resistance as you roll the
program out more broadly.
When choosing these services, you should start with ones that are important to your
business to provide the greatest immediate value. Dependencies are a common source of
reliability risks, so a good choice is to start with central services that have fully-connected
dependencies. You could also select services that are fully loaded with production data and
Ku b e r n e t e s R e l i a b i l i t y at S c a l e TA B L E O F C O N T E N T S 48
dependencies, but aren’t launched yet, such as services during a migration or about to be
launched. The last common choice is services that are already having reliability issues, thus
allowing you to prove your effectiveness and address an area of concern at the same time.
If you’re building your own tool, this can be pretty complicated, but a reliability platform
like Gremlin streamlines the process of installing agents and setting up permissions.
Once the agent is set up, you’ll want to define your risk monitoring parameters and core
validation test suites. A good place to start is with the critical risks from Chapter 4 and
the test suites from Chapter 5. (Gremlin has these set up as default test suites for
any service.)
Generally, these core risks and test suites are a good place to start, then you can adjust
them as you become more comfortable with testing. But you can also alter these test
suites to fit unique standards for your organization or to include a test that validates
Almost every Kubernetes system has at least one of the critical risks above. Since risk
monitoring uses continuous detection and scanning rather than active Fault Injection
testing, it’s a faster, easier way to uncover active critical reliability risks.
Follow these steps to quickly find risks, fix them, and prove the results:
2 The scan will return a list of reliability risks, along with a mitigation status.
K u b e r n e t e s R e l i a b i l i t y at S c a l e TA B L E O F C O N T E N T S 49
3 Work with the team behind the service to address unmitigated risks. Most of these,
such as missing memory limits, can be a relatively light lift to fix.
4 Deploy the fix and go back to the monitoring dashboard. Any risk you addressed should
be shown as mitigated.
Since risk monitoring is automated and non-invasive, this is also an easier way to spur
adoption of resiliency management with other teams. Show those teams the results you
were able to create, then help them to set up their own risk monitoring.
These first results will usually return a lot of existing reliability risks, which can be a good
thing. It means your resiliency testing is effectively uncovering reliability risks before they
cause outages.
Now that you’ve addressed some of the more pressing reliability risks, it’s time to start
running Fault Injection tests. Run the validation test suites you set up to get a baseline
report for the reliability posture of your early-adopter services.
These first results will usually return a lot of existing reliability risks, which can be a good
thing. It means your resiliency testing is effectively uncovering reliability risks before they
cause outages.
You should have a list of critical Kubernetes risks that you’ve addressed, and by looking
at the before and after results from risk monitoring and validation tests, you’ll be able
to show the effectiveness of your resiliency efforts—and show exactly how you’ve
improved the reliability of your Kubernetes deployment.
Gremlin offers a free trial that includes all of the capabilities you need to take the actions above.
Over the course of four weeks, you’ll be able to stretch your resiliency wings, prove the
effectiveness of your efforts, and have a lasting impact on the reliability of your Kubernetes
deployment.
Ku b e r n e t e s R e l i a b i l i t y at S c a l e TA B L E O F CO N T E N T S 51
Gremlin is the Enterprise Reliability Platform that helps teams
proactively test their systems, build and enforce reliability and
resiliency standards, and automate their reliability practices
organization-wide.