Workloads - Kubernetes
Workloads - Kubernetes
Workloads
Understand Pods, the smallest deployable compute object in Kubernetes, and
the higher-level abstractions that help you to run them.
1: Pods
1.1: Pod Lifecycle
1.2: Init Containers
1.3: Disruptions
1.4: Ephemeral Containers
1.5: Pod Quality of Service Classes
1.6: User Namespaces
1.7: Downward API
2: Workload Resources
2.1: Deployments
2.2: ReplicaSet
2.3: StatefulSets
2.4: DaemonSet
2.5: Jobs
2.6: Automatic Cleanup for Finished Jobs
2.7: CronJob
2.8: ReplicationController
Kubernetes pods have a defined lifecycle. For example, once a pod is running in your cluster
then a critical fault on the node where that pod is running means that all the pods on that
node fail. Kubernetes treats that level of failure as final: you would need to create a new Pod
to recover, even if the node later becomes healthy.
However, to make life considerably easier, you don't need to manage each Pod directly.
Instead, you can use workload resources that manage a set of pods on your behalf. These
resources configure controllers that make sure the right number of the right kind of pod are
running, to match the state you specified.
https://kubernetes.io/docs/concepts/workloads/_print/ 1/112
6/6/23, 3:49 PM Workloads | Kubernetes
In the wider Kubernetes ecosystem, you can find third-party workload resources that provide
additional behaviors. Using a custom resource definition, you can add in a third-party
workload resource if you want a specific behavior that's not part of Kubernetes' core. For
example, if you wanted to run a group of Pods for your application but stop work unless all
the Pods are available (perhaps for some high-throughput distributed task), then you can
implement or install an extension that does provide that feature.
What's next
As well as reading about each resource, you can learn about specific tasks that relate to them:
To learn about Kubernetes' mechanisms for separating code from configuration, visit
Configuration.
There are two supporting concepts that provide backgrounds about how Kubernetes
manages pods for applications:
Garbage collection tidies up objects from your cluster after their owning resource has
been removed.
The time-to-live after finished controller removes Jobs once a defined time has passed
since they completed.
Once your application is running, you might want to make it available on the internet as a
Service or, for web application only, using an Ingress .
https://kubernetes.io/docs/concepts/workloads/_print/ 2/112
6/6/23, 3:49 PM Workloads | Kubernetes
1 - Pods
Pods are the smallest deployable units of computing that you can create and manage in
Kubernetes.
A Pod (as in a pod of whales or pea pod) is a group of one or more containers, with shared
storage and network resources, and a specification for how to run the containers. A Pod's
contents are always co-located and co-scheduled, and run in a shared context. A Pod models
an application-specific "logical host": it contains one or more application containers which are
relatively tightly coupled. In non-cloud contexts, applications executed on the same physical
or virtual machine are analogous to cloud applications executed on the same logical host.
As well as application containers, a Pod can contain init containers that run during Pod
startup. You can also inject ephemeral containers for debugging if your cluster offers this.
What is a Pod?
Note: While Kubernetes supports more container runtimes than just Docker, Docker is
the most commonly known runtime, and it helps to describe Pods using some
terminology from Docker.
The shared context of a Pod is a set of Linux namespaces, cgroups, and potentially other
facets of isolation - the same things that isolate a container. Within a Pod's context, the
individual applications may have further sub-isolations applied.
A Pod is similar to a set of containers with shared namespaces and shared filesystem
volumes.
Using Pods
The following is an example of a Pod which consists of a container running the image
nginx:1.14.2 .
pods/simple-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
Pods are generally not created directly and are created using workload resources. See
Working with Pods for more information on how Pods are used with workload resources.
https://kubernetes.io/docs/concepts/workloads/_print/ 3/112
6/6/23, 3:49 PM Workloads | Kubernetes
Pods that run a single container. The "one-container-per-Pod" model is the most
common Kubernetes use case; in this case, you can think of a Pod as a wrapper around
a single container; Kubernetes manages Pods rather than managing the containers
directly.
Pods that run multiple containers that need to work together. A Pod can
encapsulate an application composed of multiple co-located containers that are tightly
coupled and need to share resources. These co-located containers form a single
cohesive unit of service—for example, one container serving data stored in a shared
volume to the public, while a separate sidecar container refreshes or updates those files.
The Pod wraps these containers, storage resources, and an ephemeral network identity
together as a single unit.
Each Pod is meant to run a single instance of a given application. If you want to scale your
application horizontally (to provide more overall resources by running more instances), you
should use multiple Pods, one for each instance. In Kubernetes, this is typically referred to as
replication. Replicated Pods are usually created and managed as a group by a workload
resource and its controller.
See Pods and controllers for more information on how Kubernetes uses workload resources,
and their controllers, to implement application scaling and auto-healing.
For example, you might have a container that acts as a web server for files in a shared
volume, and a separate "sidecar" container that updates those files from a remote source, as
in the following diagram:
Some Pods have init containers as well as app containers. Init containers run and complete
before the app containers are started.
https://kubernetes.io/docs/concepts/workloads/_print/ 4/112
6/6/23, 3:49 PM Workloads | Kubernetes
Pods natively provide two kinds of shared resources for their constituent containers:
networking and storage.
Note: Restarting a container in a Pod should not be confused with restarting a Pod. A Pod
is not a process, but an environment for running container(s). A Pod persists until it is
deleted.
The name of a Pod must be a valid DNS subdomain value, but this can produce unexpected
results for the Pod hostname. For best compatibility, the name should follow the more
restrictive rules for a DNS label.
Pod OS
FEATURE STATE: Kubernetes v1.25 [stable]
You should set the .spec.os.name field to either windows or linux to indicate the OS on
which you want the pod to run. These two are the only operating systems supported for now
by Kubernetes. In future, this list may be expanded.
In Kubernetes v1.27, the value you set for this field has no effect on scheduling of the pods.
Setting the .spec.os.name helps to identify the pod OS authoratitively and is used for
validation. The kubelet refuses to run a Pod where you have specified a Pod OS, if this isn't
the same as the operating system for the node where that kubelet is running. The Pod
security standards also use this field to avoid enforcing policies that aren't relevant to that
operating system.
Here are some examples of workload resources that manage one or more Pods:
Deployment
StatefulSet
DaemonSet
Pod templates
Controllers for workload resources create Pods from a pod template and manage those Pods
on your behalf.
PodTemplates are specifications for creating Pods, and are included in workload resources
such as Deployments, Jobs, and DaemonSets.
Each controller for a workload resource uses the PodTemplate inside the workload object to
make actual Pods. The PodTemplate is part of the desired state of whatever workload
resource you used to run your app.
https://kubernetes.io/docs/concepts/workloads/_print/ 5/112
6/6/23, 3:49 PM Workloads | Kubernetes
The sample below is a manifest for a simple Job with a template that starts one container.
The container in that Pod prints a message then pauses.
apiVersion: batch/v1
kind: Job
metadata:
name: hello
spec:
template:
# This is the pod template
spec:
containers:
- name: hello
image: busybox:1.28
command: ['sh', '-c', 'echo "Hello, Kubernetes!" && sleep 3600']
restartPolicy: OnFailure
# The pod template ends here
Modifying the pod template or switching to a new pod template has no direct effect on the
Pods that already exist. If you change the pod template for a workload resource, that resource
needs to create replacement Pods that use the updated template.
For example, the StatefulSet controller ensures that the running Pods match the current pod
template for each StatefulSet object. If you edit the StatefulSet to change its pod template, the
StatefulSet starts to create new Pods based on the updated template. Eventually, all of the old
Pods are replaced with new Pods, and the update is complete.
Each workload resource implements its own rules for handling changes to the Pod template.
If you want to read more about StatefulSet specifically, read Update strategy in the StatefulSet
Basics tutorial.
On Nodes, the kubelet does not directly observe or manage any of the details around pod
templates and updates; those details are abstracted away. That abstraction and separation of
concerns simplifies system semantics, and makes it feasible to extend the cluster's behavior
without changing existing code.
Kubernetes doesn't prevent you from managing Pods directly. It is possible to update some
fields of a running Pod, in place. However, Pod update operations like patch , and replace
have some limitations:
Most of the metadata about a Pod is immutable. For example, you cannot change the
namespace , name , uid , or creationTimestamp fields; the generation field is unique. It
only accepts updates that increment the field's current value.
https://kubernetes.io/docs/concepts/workloads/_print/ 6/112
6/6/23, 3:49 PM Workloads | Kubernetes
Storage in Pods
A Pod can specify a set of shared storage volumes. All containers in the Pod can access the
shared volumes, allowing those containers to share data. Volumes also allow persistent data
in a Pod to survive in case one of the containers within needs to be restarted. See Storage for
more information on how Kubernetes implements shared storage and makes it available to
Pods.
Pod networking
Each Pod is assigned a unique IP address for each address family. Every container in a Pod
shares the network namespace, including the IP address and network ports. Inside a Pod (and
only then), the containers that belong to the Pod can communicate with one another using
localhost . When containers in a Pod communicate with entities outside the Pod, they must
coordinate how they use the shared network resources (such as ports). Within a Pod,
containers share an IP address and port space, and can find each other via localhost . The
containers in a Pod can also communicate with each other using standard inter-process
communications like SystemV semaphores or POSIX shared memory. Containers in different
Pods have distinct IP addresses and can not communicate by OS-level IPC without special
configuration. Containers that want to interact with a container running in a different Pod can
use IP networking to communicate.
Containers within the Pod see the system hostname as being the same as the configured
name for the Pod. There's more about this in the networking section.
Any container in a pod can run in privileged mode to use operating system administrative
capabilities that would otherwise be inaccessible. This is available for both Windows and
Linux.
https://kubernetes.io/docs/concepts/workloads/_print/ 7/112
6/6/23, 3:49 PM Workloads | Kubernetes
Static Pods
Static Pods are managed directly by the kubelet daemon on a specific node, without the
API server observing them. Whereas most Pods are managed by the control plane (for
example, a Deployment), for static Pods, the kubelet directly supervises each static Pod (and
restarts it if it fails).
Static Pods are always bound to one Kubelet on a specific node. The main use for static Pods
is to run a self-hosted control plane: in other words, using the kubelet to supervise the
individual control plane components.
The kubelet automatically tries to create a mirror Pod on the Kubernetes API server for each
static Pod. This means that the Pods running on a node are visible on the API server, but
cannot be controlled from there.
Note: The spec of a static Pod cannot refer to other API objects (e.g., ServiceAccount,
ConfigMap, Secret, etc).
Container probes
A probe is a diagnostic performed periodically by the kubelet on a container. To perform a
diagnostic, the kubelet can invoke different actions:
You can read more about probes in the Pod Lifecycle documentation.
What's next
Learn about the lifecycle of a Pod.
Learn about RuntimeClass and how you can use it to configure different Pods with
different container runtime configurations.
Read about PodDisruptionBudget and how you can use it to manage application
availability during disruptions.
Pod is a top-level resource in the Kubernetes REST API. The Pod object definition
describes the object in detail.
The Distributed System Toolkit: Patterns for Composite Containers explains common
layouts for Pods with more than one container.
Read about Pod topology spread constraints
To understand the context for why Kubernetes wraps a common Pod API in other resources
(such as StatefulSets or Deployments), you can read about the prior art, including:
Aurora
Borg
Marathon
Omega
Tupperware.
https://kubernetes.io/docs/concepts/workloads/_print/ 8/112
6/6/23, 3:49 PM Workloads | Kubernetes
Whilst a Pod is running, the kubelet is able to restart containers to handle some kind of faults.
Within a Pod, Kubernetes tracks different container states and determines what action to take
to make the Pod healthy again.
In the Kubernetes API, Pods have both a specification and an actual status. The status for a
Pod object consists of a set of Pod conditions. You can also inject custom readiness
information into the condition data for a Pod, if that is useful to your application.
Pods are only scheduled once in their lifetime. Once a Pod is scheduled (assigned) to a Node,
the Pod runs on that Node until it stops or is terminated.
Pod lifetime
Like individual application containers, Pods are considered to be relatively ephemeral (rather
than durable) entities. Pods are created, assigned a unique ID (UID), and scheduled to nodes
where they remain until termination (according to restart policy) or deletion. If a Node dies,
the Pods scheduled to that node are scheduled for deletion after a timeout period.
Pods do not, by themselves, self-heal. If a Pod is scheduled to a node that then fails, the Pod is
deleted; likewise, a Pod won't survive an eviction due to a lack of resources or Node
maintenance. Kubernetes uses a higher-level abstraction, called a controller, that handles the
work of managing the relatively disposable Pod instances.
A given Pod (as defined by a UID) is never "rescheduled" to a different node; instead, that Pod
can be replaced by a new, near-identical Pod, with even the same name if desired, but with a
different UID.
When something is said to have the same lifetime as a Pod, such as a volume, that means that
the thing exists as long as that specific Pod (with that exact UID) exists. If that Pod is deleted
for any reason, and even if an identical replacement is created, the related thing (a volume, in
this example) is also destroyed and created anew.
Pod diagram
A multi-container Pod that contains a file puller and a web server that uses a persistent volume for
shared storage between the containers.
Pod phase
A Pod's status field is a PodStatus object, which has a phase field.
https://kubernetes.io/docs/concepts/workloads/_print/ 9/112
6/6/23, 3:49 PM Workloads | Kubernetes
The phase of a Pod is a simple, high-level summary of where the Pod is in its lifecycle. The
phase is not intended to be a comprehensive rollup of observations of container or Pod state,
nor is it intended to be a comprehensive state machine.
The number and meanings of Pod phase values are tightly guarded. Other than what is
documented here, nothing should be assumed about Pods that have a given phase value.
Value Description
Pendi The Pod has been accepted by the Kubernetes cluster, but one or more of the
ng containers has not been set up and made ready to run. This includes time a
Pod spends waiting to be scheduled as well as the time spent downloading
container images over the network.
Runni The Pod has been bound to a node, and all of the containers have been
ng created. At least one container is still running, or is in the process of starting or
restarting.
Succe All containers in the Pod have terminated in success, and will not be restarted.
eded
Faile All containers in the Pod have terminated, and at least one container has
d terminated in failure. That is, the container either exited with non-zero status
or was terminated by the system.
Unkno For some reason the state of the Pod could not be obtained. This phase
wn typically occurs due to an error in communicating with the node where the Pod
should be running.
Since Kubernetes 1.27, the kubelet transitions deleted pods, except for static pods and force-
deleted pods without a finalizer, to a terminal phase ( Failed or Succeeded depending on
the exit statuses of the pod containers) before their deletion from the API server.
If a node dies or is disconnected from the rest of the cluster, Kubernetes applies a policy for
setting the phase of all Pods on the lost node to Failed.
Container states
As well as the phase of the Pod overall, Kubernetes tracks the state of each container inside a
Pod. You can use container lifecycle hooks to trigger events to run at certain points in a
container's lifecycle.
Once the scheduler assigns a Pod to a Node, the kubelet starts creating containers for that
Pod using a container runtime. There are three possible container states: Waiting , Running ,
and Terminated .
To check the state of a Pod's containers, you can use kubectl describe pod <name-of-pod> .
The output shows the state for each container within that Pod.
https://kubernetes.io/docs/concepts/workloads/_print/ 10/112
6/6/23, 3:49 PM Workloads | Kubernetes
Waiting
If a container is not in either the Running or Terminated state, it is Waiting . A container in
the Waiting state is still running the operations it requires in order to complete start up: for
example, pulling the container image from a container image registry, or applying Secret data.
When you use kubectl to query a Pod with a container that is Waiting , you also see a
Reason field to summarize why the container is in that state.
Running
The status indicates that a container is executing without issues. If there was a
Running
postStart hook configured, it has already executed and finished. When you use kubectl to
query a Pod with a container that is Running , you also see information about when the
container entered the Running state.
Terminated
A container in the Terminated state began execution and then either ran to completion or
failed for some reason. When you use kubectl to query a Pod with a container that is
Terminated , you see a reason, an exit code, and the start and finish time for that container's
period of execution.
If a container has a preStop hook configured, this hook runs before the container enters the
Terminated state.
The restartPolicy applies to all containers in the Pod. restartPolicy only refers to
restarts of the containers by the kubelet on the same node. After containers in a Pod exit, the
kubelet restarts them with an exponential back-off delay (10s, 20s, 40s, …), that is capped at
five minutes. Once a container has executed for 10 minutes without any problems, the
kubelet resets the restart backoff timer for that container.
Pod conditions
A Pod has a PodStatus, which has an array of PodConditions through which the Pod has or
has not passed. Kubelet manages the following PodConditions:
Ready : the Pod is able to serve requests and should be added to the load balancing
pools of all matching Services.
https://kubernetes.io/docs/concepts/workloads/_print/ 11/112
6/6/23, 3:49 PM Workloads | Kubernetes
lastTransitio Timestamp for when the Pod last transitioned from one status to
nTime another.
Pod readiness
FEATURE STATE: Kubernetes v1.14 [stable]
Your application can inject extra feedback or signals into PodStatus: Pod readiness. To use this,
set readinessGates in the Pod's spec to specify a list of additional conditions that the
kubelet evaluates for Pod readiness.
Readiness gates are determined by the current state of status.condition fields for the Pod.
If Kubernetes cannot find such a condition in the status.conditions field of a Pod, the status
of the condition is defaulted to " False ".
Here is an example:
kind: Pod
...
spec:
readinessGates:
- conditionType: "www.example.com/feature-1"
status:
conditions:
- type: Ready # a built in PodCondition
status: "False"
lastProbeTime: null
lastTransitionTime: 2018-01-01T00:00:00Z
- type: "www.example.com/feature-1" # an extra PodCondition
status: "False"
lastProbeTime: null
lastTransitionTime: 2018-01-01T00:00:00Z
containerStatuses:
- containerID: docker://abcd...
ready: true
...
The Pod conditions you add must have names that meet the Kubernetes label key format.
For a Pod that uses custom conditions, that Pod is evaluated to be ready only when both the
following statements apply:
When a Pod's containers are Ready but at least one custom condition is missing or False ,
the kubelet sets the Pod's condition to ContainersReady .
https://kubernetes.io/docs/concepts/workloads/_print/ 12/112
6/6/23, 3:49 PM Workloads | Kubernetes
After a Pod gets scheduled on a node, it needs to be admitted by the Kubelet and have any
volumes mounted. Once these phases are complete, the Kubelet works with a container
runtime (using Container runtime interface (CRI)) to set up a runtime sandbox and configure
networking for the Pod. If the PodHasNetworkCondition feature gate is enabled, Kubelet
reports whether a pod has reached this initialization milestone through the PodHasNetwork
condition in the status.conditions field of a Pod.
The PodHasNetwork condition is set to False by the Kubelet when it detects a Pod does not
have a runtime sandbox with networking configured. This occurs in the following scenarios:
Early in the lifecycle of the Pod, when the kubelet has not yet begun to set up a sandbox
for the Pod using the container runtime.
Later in the lifecycle of the Pod, when the Pod sandbox has been destroyed due to
either:
the node rebooting, without the Pod getting evicted
for container runtimes that use virtual machines for isolation, the Pod sandbox
virtual machine rebooting, which then requires creating a new sandbox and fresh
container network configuration.
The PodHasNetwork condition is set to True by the kubelet after the successful completion of
sandbox creation and network configuration for the Pod by the runtime plugin. The kubelet
can start pulling container images and create containers after PodHasNetwork condition has
been set to True .
For a Pod with init containers, the kubelet sets the Initialized condition to True after the
init containers have successfully completed (which happens after successful sandbox creation
and network configuration by the runtime plugin). For a Pod without init containers, the
kubelet sets the Initialized condition to True before sandbox creation and network
configuration starts.
Container probes
A probe is a diagnostic performed periodically by the kubelet on a container. To perform a
diagnostic, the kubelet either executes code within the container, or makes a network
request.
Check mechanisms
There are four different ways to check a container using a probe. Each probe must define
exactly one of these four mechanisms:
exec
grpc
Performs a remote procedure call using gRPC. The target should implement gRPC health
checks. The diagnostic is considered successful if the status of the response is SERVING.
httpGet
https://kubernetes.io/docs/concepts/workloads/_print/ 13/112
6/6/23, 3:49 PM Workloads | Kubernetes
Performs an HTTP GET request against the Pod's IP address on a specified port and path.
The diagnostic is considered successful if the response has a status code greater than or
equal to 200 and less than 400.
tcpSocket
Performs a TCP check against the Pod's IP address on a specified port. The diagnostic is
considered successful if the port is open. If the remote system (the container) closes the
connection immediately after it opens, this counts as healthy.
Caution: Unlike the other mechanisms, exec probe's implementation involves the
creation/forking of multiple processes each time when executed. As a result, in case of
the clusters having higher pod densities, lower intervals of initialDelaySeconds,
periodSeconds, configuring any probe with exec mechanism might introduce an overhead
on the cpu usage of the node. In such scenarios, consider using the alternative probe
mechanisms to avoid the overhead.
Probe outcome
Each probe has one of three results:
Success
Failure
Unknown
The diagnostic failed (no action should be taken, and the kubelet will make further checks).
Types of probe
The kubelet can optionally perform and react to three kinds of probes on running containers:
livenessProbe
Indicates whether the container is running. If the liveness probe fails, the kubelet kills the
container, and the container is subjected to its restart policy. If a container does not
provide a liveness probe, the default state is Success.
readinessProbe
Indicates whether the container is ready to respond to requests. If the readiness probe
fails, the endpoints controller removes the Pod's IP address from the endpoints of all
Services that match the Pod. The default state of readiness before the initial delay is
Failure. If a container does not provide a readiness probe, the default state is Success.
startupProbe
Indicates whether the application within the container is started. All other probes are
disabled if a startup probe is provided, until it succeeds. If the startup probe fails, the
kubelet kills the container, and the container is subjected to its restart policy. If a container
does not provide a startup probe, the default state is Success.
For more information about how to set up a liveness, readiness, or startup probe, see
Configure Liveness, Readiness and Startup Probes.
If the process in your container is able to crash on its own whenever it encounters an issue or
becomes unhealthy, you do not necessarily need a liveness probe; the kubelet will
automatically perform the correct action in accordance with the Pod's restartPolicy .
https://kubernetes.io/docs/concepts/workloads/_print/ 14/112
6/6/23, 3:49 PM Workloads | Kubernetes
If you'd like your container to be killed and restarted if a probe fails, then specify a liveness
probe, and specify a restartPolicy of Always or OnFailure.
If you'd like to start sending traffic to a Pod only when a probe succeeds, specify a readiness
probe. In this case, the readiness probe might be the same as the liveness probe, but the
existence of the readiness probe in the spec means that the Pod will start without receiving
any traffic and only start receiving traffic after the probe starts succeeding.
If you want your container to be able to take itself down for maintenance, you can specify a
readiness probe that checks an endpoint specific to readiness that is different from the
liveness probe.
If your app has a strict dependency on back-end services, you can implement both a liveness
and a readiness probe. The liveness probe passes when the app itself is healthy, but the
readiness probe additionally checks that each required back-end service is available. This
helps you avoid directing traffic to Pods that can only respond with error messages.
If your container needs to work on loading large data, configuration files, or migrations during
startup, you can use a startup probe. However, if you want to detect the difference between
an app that has failed and an app that is still processing its startup data, you might prefer a
readiness probe.
Note: If you want to be able to drain requests when the Pod is deleted, you do not
necessarily need a readiness probe; on deletion, the Pod automatically puts itself into an
unready state regardless of whether the readiness probe exists. The Pod remains in the
unready state while it waits for the containers in the Pod to stop.
Startup probes are useful for Pods that have containers that take a long time to come into
service. Rather than set a long liveness interval, you can configure a separate configuration
for probing the container as it starts up, allowing a time longer than the liveness interval
would allow.
Termination of Pods
Because Pods represent processes running on nodes in the cluster, it is important to allow
those processes to gracefully terminate when they are no longer needed (rather than being
abruptly stopped with a KILL signal and having no chance to clean up).
The design aim is for you to be able to request deletion and know when processes terminate,
but also be able to ensure that deletes eventually complete. When you request deletion of a
Pod, the cluster records and tracks the intended grace period before the Pod is allowed to be
forcefully killed. With that forceful shutdown tracking in place, the kubelet attempts graceful
shutdown.
Typically, the container runtime sends a TERM signal to the main process in each container.
Many container runtimes respect the STOPSIGNAL value defined in the container image and
send this instead of TERM. Once the grace period has expired, the KILL signal is sent to any
https://kubernetes.io/docs/concepts/workloads/_print/ 15/112
6/6/23, 3:49 PM Workloads | Kubernetes
remaining processes, and the Pod is then deleted from the API Server. If the kubelet or the
container runtime's management service is restarted while waiting for processes to
terminate, the cluster retries from the start including the full original grace period.
An example flow:
1. You use the kubectl tool to manually delete a specific Pod, with the default grace
period (30 seconds).
2. The Pod in the API server is updated with the time beyond which the Pod is considered
"dead" along with the grace period. If you use kubectl describe to check on the Pod
you're deleting, that Pod shows up as "Terminating". On the node where the Pod is
running: as soon as the kubelet sees that a Pod has been marked as terminating (a
graceful shutdown duration has been set), the kubelet begins the local Pod shutdown
process.
1. If one of the Pod's containers has defined a preStop hook, the kubelet runs that
hook inside of the container. If the preStop hook is still running after the grace
period expires, the kubelet requests a small, one-off grace period extension of 2
seconds.
Note: If the preStop hook needs longer to complete than the default grace
period allows, you must modify terminationGracePeriodSeconds to suit this.
2. The kubelet triggers the container runtime to send a TERM signal to process 1
inside each container.
Note: The containers in the Pod receive the TERM signal at different times and
in an arbitrary order. If the order of shutdowns matters, consider using a
preStop hook to synchronize.
3. At the same time as the kubelet is starting graceful shutdown of the Pod, the control
plane evaluates whether to remove that shutting-down Pod from EndpointSlice (and
Endpoints) objects, where those objects represent a Service with a configured selector.
ReplicaSets and other workload resources no longer treat the shutting-down Pod as a
valid, in-service replica. Pods that shut down slowly should not continue to serve regular
traffic and should start terminating and finish processing open connections. Some
applications need to go beyond finishing open connections and need more graceful
termination - for example: session draining and completion. Any endpoints that
represent the terminating pods are not immediately removed from EndpointSlices, and
a status indicating terminating state is exposed from the EndpointSlice API (and the
legacy Endpoints API). Terminating endpoints always have their ready status as false
(for backward compatibility with versions before 1.26), so load balancers will not use it
for regular traffic. If traffic draining on terminating pod is needed, the actual readiness
can be checked as a condition serving . You can find more details on how to implement
connections draining in the tutorial Pods And Endpoints Termination Flow
1. When the grace period expires, the kubelet triggers forcible shutdown. The container
runtime sends SIGKILL to any processes still running in any container in the Pod. The
kubelet also cleans up a hidden pause container if that container runtime uses one.
2. The kubelet transitions the pod into a terminal phase ( Failed or Succeeded depending
on the end state of its containers). This step is guaranteed since version 1.27.
3. The kubelet triggers forcible removal of Pod object from the API server, by setting grace
period to 0 (immediate deletion).
4. The API server deletes the Pod's API object, which is then no longer visible from any
client.
https://kubernetes.io/docs/concepts/workloads/_print/ 16/112
6/6/23, 3:49 PM Workloads | Kubernetes
Caution: Forced deletions can be potentially disruptive for some workloads and their
Pods.
By default, all deletes are graceful within 30 seconds. The kubectl delete command
supports the --grace-period=<seconds> option which allows you to override the default and
specify your own value.
Setting the grace period to 0 forcibly and immediately deletes the Pod from the API server. If
the pod was still running on a node, that forcible deletion triggers the kubelet to begin
immediate cleanup.
Note: You must specify an additional flag --force along with --grace-period=0 in order
to perform force deletions.
When a force deletion is performed, the API server does not wait for confirmation from the
kubelet that the Pod has been terminated on the node it was running on. It removes the Pod
in the API immediately so a new Pod can be created with the same name. On the node, Pods
that are set to terminate immediately will still be given a small grace period before being force
killed.
Caution: Immediate deletion does not wait for confirmation that the running resource
has been terminated. The resource may continue to run on the cluster indefinitely.
If you need to force-delete Pods that are part of a StatefulSet, refer to the task documentation
for deleting Pods from a StatefulSet.
The Pod garbage collector (PodGC), which is a controller in the control plane, cleans up
terminated Pods (with a phase of Succeeded or Failed ), when the number of Pods exceeds
the configured threshold (determined by terminated-pod-gc-threshold in the kube-
controller-manager). This avoids a resource leak as Pods are created and terminated over
time.
Additionally, PodGC cleans up any Pods which satisfy any of the following conditions:
When the PodDisruptionConditions feature gate is enabled, along with cleaning up the pods,
PodGC will also mark them as failed if they are in a non-terminal phase. Also, PodGC adds a
pod disruption condition when cleaning up an orphan pod (see also: Pod disruption
conditions).
What's next
Get hands-on experience attaching handlers to container lifecycle events.
For detailed information about Pod and container status in the API, see the API
reference documentation covering .status for Pod.
https://kubernetes.io/docs/concepts/workloads/_print/ 18/112
6/6/23, 3:49 PM Workloads | Kubernetes
You can specify init containers in the Pod specification alongside the containers array (which
describes app containers).
If a Pod's init container fails, the kubelet repeatedly restarts that init container until it
succeeds. However, if the Pod has a restartPolicy of Never, and an init container fails
during startup of that Pod, Kubernetes treats the overall Pod as failed.
To specify an init container for a Pod, add the initContainers field into the Pod specification,
as an array of container items (similar to the app containers field and its contents). See
Container in the API reference for more details.
If you specify multiple init containers for a Pod, kubelet runs each init container sequentially.
Each init container must succeed before the next can run. When all of the init containers have
run to completion, kubelet initializes the application containers for the Pod and runs them as
usual.
Init containers can contain utilities or custom code for setup that are not present in an
app image. For example, there is no need to make an image FROM another image just to
use a tool like sed , awk , python , or dig during setup.
The application image builder and deployer roles can work independently without the
need to jointly build a single app image.
Init containers can run with a different view of the filesystem than app containers in the
same Pod. Consequently, they can be given access to Secrets that app containers cannot
access.
Because init containers run to completion before any app containers start, init
containers offer a mechanism to block or delay app container startup until a set of
preconditions are met. Once preconditions are met, all of the app containers in a Pod
can start in parallel.
https://kubernetes.io/docs/concepts/workloads/_print/ 19/112
6/6/23, 3:49 PM Workloads | Kubernetes
Init containers can securely run utilities or custom code that would otherwise make an
app container image less secure. By keeping unnecessary tools separate you can limit
the attack surface of your app container image.
Examples
Here are some ideas for how to use init containers:
for i in {1..100}; do sleep 1; if dig myservice; then exit 0; fi; done; exit
Register this Pod with a remote server from the downward API with a command like:
Wait for some time before starting the app container with a command like
sleep 60
Place values into a configuration file and run a template tool to dynamically generate a
configuration file for the main app container. For example, place the POD_IP value in a
configuration and generate the main app configuration file using Jinja.
apiVersion: v1
kind: Pod
metadata:
name: myapp-pod
labels:
app.kubernetes.io/name: MyApp
spec:
containers:
- name: myapp-container
image: busybox:1.28
command: ['sh', '-c', 'echo The app is running! && sleep 3600']
initContainers:
- name: init-myservice
image: busybox:1.28
command: ['sh', '-c', "until nslookup myservice.$(cat /var/run/secrets/kubern
- name: init-mydb
image: busybox:1.28
command: ['sh', '-c', "until nslookup mydb.$(cat /var/run/secrets/kubernetes.
https://kubernetes.io/docs/concepts/workloads/_print/ 20/112
6/6/23, 3:49 PM Workloads | Kubernetes
pod/myapp-pod created
Name: myapp-pod
Namespace: default
[...]
Labels: app.kubernetes.io/name=MyApp
Status: Pending
[...]
Init Containers:
init-myservice:
[...]
State: Running
[...]
init-mydb:
[...]
State: Waiting
Reason: PodInitializing
Ready: False
[...]
Containers:
myapp-container:
[...]
State: Waiting
Reason: PodInitializing
Ready: False
[...]
Events:
FirstSeen LastSeen Count From SubObjectPath
--------- -------- ----- ---- -------------
16s 16s 1 {default-scheduler }
16s 16s 1 {kubelet 172.17.4.201} spec.initContainers
13s 13s 1 {kubelet 172.17.4.201} spec.initContainers
13s 13s 1 {kubelet 172.17.4.201} spec.initContainers
13s 13s 1 {kubelet 172.17.4.201} spec.initContainers
At this point, those init containers will be waiting to discover Services named mydb and
myservice .
https://kubernetes.io/docs/concepts/workloads/_print/ 21/112
6/6/23, 3:49 PM Workloads | Kubernetes
---
apiVersion: v1
kind: Service
metadata:
name: myservice
spec:
ports:
- protocol: TCP
port: 80
targetPort: 9376
---
apiVersion: v1
kind: Service
metadata:
name: mydb
spec:
ports:
- protocol: TCP
port: 80
targetPort: 9377
service/myservice created
service/mydb created
You'll then see that those init containers complete, and that the myapp-pod Pod moves into
the Running state:
This simple example should provide some inspiration for you to create your own init
containers. What's next contains a link to a more detailed example.
Detailed behavior
During Pod startup, the kubelet delays running init containers until the networking and
storage are ready. Then the kubelet runs the Pod's init containers in the order they appear in
the Pod's spec.
Each init container must exit successfully before the next container starts. If a container fails
to start due to the runtime or exits with failure, it is retried according to the Pod
restartPolicy . However, if the Pod restartPolicy is set to Always, the init containers use
restartPolicy OnFailure.
https://kubernetes.io/docs/concepts/workloads/_print/ 22/112
6/6/23, 3:49 PM Workloads | Kubernetes
A Pod cannot be Ready until all init containers have succeeded. The ports on an init container
are not aggregated under a Service. A Pod that is initializing is in the Pending state but
should have a condition Initialized set to false.
If the Pod restarts, or is restarted, all init containers must execute again.
Changes to the init container spec are limited to the container image field. Altering an init
container image field is equivalent to restarting the Pod.
Because init containers can be restarted, retried, or re-executed, init container code should
be idempotent. In particular, code that writes to files on EmptyDirs should be prepared for
the possibility that an output file already exists.
Init containers have all of the fields of an app container. However, Kubernetes prohibits
readinessProbe from being used because init containers cannot define readiness distinct
from completion. This is enforced during validation.
Use activeDeadlineSeconds on the Pod to prevent init containers from failing forever. The
active deadline includes init containers. However it is recommended to use
activeDeadlineSeconds only if teams deploy their application as a Job, because
activeDeadlineSeconds has an effect even after initContainer finished. The Pod which is
already running correctly would be killed by activeDeadlineSeconds if you set.
The name of each app and init container in a Pod must be unique; a validation error is thrown
for any container sharing a name with another.
Resources
Given the ordering and execution for init containers, the following rules for resource usage
apply:
The highest of any particular resource request or limit defined on all init containers is
the effective init request/limit. If any resource has no resource limit specified this is
considered as the highest limit.
The Pod's effective request/limit for a resource is the higher of:
the sum of all app containers request/limit for a resource
the effective init request/limit for a resource
Scheduling is done based on effective requests/limits, which means init containers can
reserve resources for initialization that are not used during the life of the Pod.
The QoS (quality of service) tier of the Pod's effective QoS tier is the QoS tier for init
containers and app containers alike.
Quota and limits are applied based on the effective Pod request and limit.
Pod level control groups (cgroups) are based on the effective Pod request and limit, the same
as the scheduler.
The Pod infrastructure container is restarted. This is uncommon and would have to be
done by someone with root access to nodes.
All containers in a Pod are terminated while restartPolicy is set to Always, forcing a
restart, and the init container completion record has been lost due to garbage collection.
The Pod will not be restarted when the init container image is changed, or the init container
completion record has been lost due to garbage collection. This applies for Kubernetes v1.20
and later. If you are using an earlier version of Kubernetes, consult the documentation for the
version you are using.
What's next
Read about creating a Pod that has an init container
https://kubernetes.io/docs/concepts/workloads/_print/ 23/112
6/6/23, 3:49 PM Workloads | Kubernetes
https://kubernetes.io/docs/concepts/workloads/_print/ 24/112
6/6/23, 3:49 PM Workloads | Kubernetes
1.3 - Disruptions
This guide is for application owners who want to build highly available applications, and thus
need to understand what types of disruptions can happen to Pods.
It is also for cluster administrators who want to perform automated cluster actions, like
upgrading and autoscaling clusters.
Except for the out-of-resources condition, all these conditions should be familiar to most
users; they are not specific to Kubernetes.
We call other cases voluntary disruptions. These include both actions initiated by the
application owner and those initiated by a Cluster Administrator. Typical application owner
actions include:
These actions might be taken directly by the cluster administrator, or by automation run by
the cluster administrator, or by your cluster hosting provider.
Ask your cluster administrator or consult your cloud provider or distribution documentation
to determine if any sources of voluntary disruptions are enabled for your cluster. If none are
enabled, you can skip creating Pod Disruption Budgets.
Caution: Not all voluntary disruptions are constrained by Pod Disruption Budgets. For
example, deleting deployments or pods bypasses Pod Disruption Budgets.
https://kubernetes.io/docs/concepts/workloads/_print/ 25/112
6/6/23, 3:49 PM Workloads | Kubernetes
The frequency of voluntary disruptions varies. On a basic Kubernetes cluster, there are no
automated voluntary disruptions (only user-triggered ones). However, your cluster
administrator or hosting provider may run some additional services which cause voluntary
disruptions. For example, rolling out node software updates can cause voluntary disruptions.
Also, some implementations of cluster (node) autoscaling may cause voluntary disruptions to
defragment and compact nodes. Your cluster administrator or hosting provider should have
documented what level of voluntary disruptions, if any, to expect. Certain configuration
options, such as using PriorityClasses in your pod spec can also cause voluntary (and
involuntary) disruptions.
Kubernetes offers features to help you run highly available applications even when you
introduce frequent voluntary disruptions.
As an application owner, you can create a PodDisruptionBudget (PDB) for each application. A
PDB limits the number of Pods of a replicated application that are down simultaneously from
voluntary disruptions. For example, a quorum-based application would like to ensure that the
number of replicas running is never brought below the number needed for a quorum. A web
front end might want to ensure that the number of replicas serving load never falls below a
certain percentage of the total.
Cluster managers and hosting providers should use tools which respect
PodDisruptionBudgets by calling the Eviction API instead of directly deleting pods or
deployments.
For example, the kubectl drain subcommand lets you mark a node as going out of service.
When you run kubectl drain , the tool tries to evict all of the Pods on the Node you're taking
out of service. The eviction request that kubectl submits on your behalf may be temporarily
rejected, so the tool periodically retries all failed requests until all Pods on the target node are
terminated, or until a configurable timeout is reached.
A PDB specifies the number of replicas that an application can tolerate having, relative to how
many it is intended to have. For example, a Deployment which has a .spec.replicas: 5 is
supposed to have 5 pods at any given time. If its PDB allows for there to be 4 at a time, then
the Eviction API will allow voluntary disruption of one (but not two) pods at a time.
The group of pods that comprise the application is specified using a label selector, the same
as the one used by the application's controller (deployment, stateful-set, etc).
The "intended" number of pods is computed from the .spec.replicas of the workload
resource that is managing those pods. The control plane discovers the owning workload
resource by examining the .metadata.ownerReferences of the Pod.
Involuntary disruptions cannot be prevented by PDBs; however they do count against the
budget.
Pods which are deleted or unavailable due to a rolling upgrade to an application do count
against the disruption budget, but workload resources (such as Deployment and StatefulSet)
are not limited by PDBs when doing rolling upgrades. Instead, the handling of failures during
application updates is configured in the spec for the specific workload resource.
When a pod is evicted using the eviction API, it is gracefully terminated, honoring the
terminationGracePeriodSeconds setting in its PodSpec.
https://kubernetes.io/docs/concepts/workloads/_print/ 26/112
6/6/23, 3:49 PM Workloads | Kubernetes
PodDisruptionBudget example
Consider a cluster with 3 nodes, node-1 through node-3 . The cluster is running several
applications. One of them has 3 replicas initially called pod-a , pod-b , and pod-c . Another,
unrelated pod without a PDB, called pod-x , is also shown. Initially, the pods are laid out as
follows:
pod-x available
All 3 pods are part of a deployment, and they collectively have a PDB which requires there be
at least 2 of the 3 pods to be available at all times.
For example, assume the cluster administrator wants to reboot into a new kernel version to
fix a bug in the kernel. The cluster administrator first tries to drain node-1 using the kubectl
drain command. That tool tries to evict pod-a and pod-x . This succeeds immediately. Both
pods go into the terminating state at the same time. This puts the cluster in this state:
pod-x terminating
The deployment notices that one of the pods is terminating, so it creates a replacement called
pod-d . Since node-1 is cordoned, it lands on another node. Something has also created
pod-y as a replacement for pod-x .
(Note: for a StatefulSet, pod-a , which would be called something like pod-0 , would need to
terminate completely before its replacement, which is also called pod-0 but has a different
UID, could be created. Otherwise, the example applies to a StatefulSet as well.)
At some point, the pods terminate, and the cluster looks like this:
At this point, if an impatient cluster administrator tries to drain node-2 or node-3 , the drain
command will block, because there are only 2 available pods for the deployment, and its PDB
requires at least 2. After some time passes, pod-d becomes available.
https://kubernetes.io/docs/concepts/workloads/_print/ 27/112
6/6/23, 3:49 PM Workloads | Kubernetes
Now, the cluster administrator tries to drain node-2 . The drain command will try to evict the
two pods in some order, say pod-b first and then pod-d . It will succeed at evicting pod-b .
But, when it tries to evict pod-d , it will be refused because that would leave only one pod
available for the deployment.
The deployment creates a replacement for pod-b called pod-e . Because there are not
enough resources in the cluster to schedule pod-e the drain will again block. The cluster may
end up in this state:
At this point, the cluster administrator needs to add a node back to the cluster to proceed
with the upgrade.
You can see how Kubernetes varies the rate at which disruptions can happen, according to:
Note: In order to use this behavior, you must have the PodDisruptionConditions feature
gate enabled in your cluster.
When enabled, a dedicated Pod DisruptionTarget condition is added to indicate that the
Pod is about to be deleted due to a disruption. The reason field of the condition additionally
indicates one of the following reasons for the Pod termination:
PreemptionByScheduler
DeletionByTaintManager
Pod is due to be deleted by Taint Manager (which is part of the node lifecycle controller
within kube-controller-manager) due to a NoExecute taint that the Pod does not tolerate;
see taint-based evictions.
EvictionByEvictionAPI
Pod has been marked for eviction using the Kubernetes API .
DeletionByPodGC
Pod, that is bound to a no longer existing Node, is due to be deleted by Pod garbage
collection.
TerminationByKubelet
https://kubernetes.io/docs/concepts/workloads/_print/ 28/112
6/6/23, 3:49 PM Workloads | Kubernetes
Pod has been terminated by the kubelet, because of either node pressure eviction or the
graceful node shutdown.
Note: A Pod disruption might be interrupted. The control plane might re-attempt to
continue the disruption of the same Pod, but it is not guaranteed. As a result, the
DisruptionTarget condition might be added to a Pod, but that Pod might then not
actually be deleted. In such a situation, after some time, the Pod disruption condition will
be cleared.
When the PodDisruptionConditions feature gate is enabled, along with cleaning up the pods,
the Pod garbage collector (PodGC) will also mark them as failed if they are in a non-terminal
phase (see also Pod garbage collection).
When using a Job (or CronJob), you may want to use these Pod disruption conditions as part
of your Job's Pod failure policy.
when there are many application teams sharing a Kubernetes cluster, and there is
natural specialization of roles
when third-party tools or services are used to automate cluster management
Pod Disruption Budgets support this separation of roles by providing an interface between
the roles.
If you do not have such a separation of responsibilities in your organization, you may not
need to use Pod Disruption Budgets.
What's next
Follow steps to protect your application by configuring a Pod Disruption Budget.
https://kubernetes.io/docs/concepts/workloads/_print/ 29/112
6/6/23, 3:49 PM Workloads | Kubernetes
Learn about updating a deployment including steps to maintain its availability during the
rollout.
https://kubernetes.io/docs/concepts/workloads/_print/ 30/112
6/6/23, 3:49 PM Workloads | Kubernetes
This page provides an overview of ephemeral containers: a special type of container that runs
temporarily in an existing Pod to accomplish user-initiated actions such as troubleshooting.
You use ephemeral containers to inspect services rather than to build applications.
Sometimes it's necessary to inspect the state of an existing Pod, however, for example to
troubleshoot a hard-to-reproduce bug. In these cases you can run an ephemeral container in
an existing Pod to inspect its state and run arbitrary commands.
Ephemeral containers may not have ports, so fields such as ports , livenessProbe ,
readinessProbe are disallowed.
Ephemeral containers are created using a special ephemeralcontainers handler in the API
rather than by adding them directly to pod.spec , so it's not possible to add an ephemeral
container using kubectl edit .
Like regular containers, you may not change or remove an ephemeral container after you
have added it to a Pod.
In particular, distroless images enable you to deploy minimal container images that reduce
attack surface and exposure to bugs and vulnerabilities. Since distroless images do not
include a shell or any debugging utilities, it's difficult to troubleshoot distroless images using
kubectl exec alone.
When using ephemeral containers, it's helpful to enable process namespace sharing so you
can view processes in other containers.
What's next
Learn how to debug pods using ephemeral containers.
https://kubernetes.io/docs/concepts/workloads/_print/ 31/112
6/6/23, 3:49 PM Workloads | Kubernetes
Guaranteed
Pods that are Guaranteed have the strictest resource limits and are least likely to face
eviction. They are guaranteed not to be killed until they exceed their limits or there are no
lower-priority Pods that can be preempted from the Node. They may not acquire resources
beyond their specified limits. These Pods can also make use of exclusive CPUs using the
static CPU management policy.
Criteria
For a Pod to be given a QoS class of Guaranteed :
Every Container in the Pod must have a memory limit and a memory request.
For every Container in the Pod, the memory limit must equal the memory request.
Every Container in the Pod must have a CPU limit and a CPU request.
For every Container in the Pod, the CPU limit must equal the CPU request.
Burstable
Pods that are Burstable have some lower-bound resource guarantees based on the request,
but do not require a specific limit. If a limit is not specified, it defaults to a limit equivalent to
the capacity of the Node, which allows the Pods to flexibly increase their resources if
resources are available. In the event of Pod eviction due to Node resource pressure, these
Pods are evicted only after all BestEffort Pods are evicted. Because a Burstable Pod can
include a Container that has no resource limits or requests, a Pod that is Burstable can try
to use any amount of node resources.
Criteria
A Pod is given a QoS class of Burstable if:
The Pod does not meet the criteria for QoS class Guaranteed .
At least one Container in the Pod has a memory or CPU request or limit.
https://kubernetes.io/docs/concepts/workloads/_print/ 32/112
6/6/23, 3:49 PM Workloads | Kubernetes
BestEffort
Pods in the BestEffort QoS class can use node resources that aren't specifically assigned to
Pods in other QoS classes. For example, if you have a node with 16 CPU cores available to the
kubelet, and you assign 4 CPU cores to a Guaranteed Pod, then a Pod in the BestEffort QoS
class can try to use any amount of the remaining 12 CPU cores.
The kubelet prefers to evict BestEffort Pods if the node comes under resource pressure.
Criteria
A Pod has a QoS class of BestEffort if it doesn't meet the criteria for either Guaranteed or
Burstable . In other words, a Pod is BestEffort only if none of the Containers in the Pod
have a memory limit or a memory request, and none of the Containers in the Pod have a CPU
limit or a CPU request. Containers in a Pod can request other resources (not CPU or memory)
and still be classified as BestEffort .
Memory QoS uses the memory controller of cgroup v2 to guarantee memory resources in
Kubernetes. Memory requests and limits of containers in pod are used to set specific
interfaces memory.min and memory.high provided by the memory controller. When
memory.min is set to memory requests, memory resources are reserved and never reclaimed
by the kernel; this is how Memory QoS ensures memory availability for Kubernetes pods. And
if memory limits are set in the container, this means that the system needs to limit container
memory usage; Memory QoS uses memory.high to throttle workload approaching its memory
limit, ensuring that the system is not overwhelmed by instantaneous memory allocation.
Memory QoS relies on QoS class to determine which settings to apply; however, these are
different mechanisms that both provide controls over quality of service.
Any Container exceeding a resource limit will be killed and restarted by the kubelet
without affecting other Containers in that Pod.
If a Container exceeds its resource request and the node it runs on faces resource
pressure, the Pod it is in becomes a candidate for eviction. If this occurs, all Containers
in the Pod will be terminated. Kubernetes may create a replacement Pod, usually on a
different node.
The resource request of a Pod is equal to the sum of the resource requests of its
component Containers, and the resource limit of a Pod is equal to the sum of the
resource limits of its component Containers.
The kube-scheduler does not consider QoS class when selecting which Pods to preempt.
Preemption can occur when a cluster does not have enough resources to run all the
Pods you defined.
What's next
Learn about resource management for Pods and Containers.
Learn about Node-pressure eviction.
Learn about Pod priority and preemption.
Learn about Pod disruptions.
Learn how to assign memory resources to containers and pods.
https://kubernetes.io/docs/concepts/workloads/_print/ 33/112
6/6/23, 3:49 PM Workloads | Kubernetes
https://kubernetes.io/docs/concepts/workloads/_print/ 34/112
6/6/23, 3:49 PM Workloads | Kubernetes
This page explains how user namespaces are used in Kubernetes pods. A user namespace
isolates the user running inside the container from the one in the host.
A process running as root in a container can run as a different (non-root) user in the host; in
other words, the process has full privileges for operations inside the user namespace, but is
unprivileged for operations outside the namespace.
You can use this feature to reduce the damage a compromised container can do to the host
or other pods in the same node. There are several security vulnerabilities rated either HIGH
or CRITICAL that were not exploitable when user namespaces is active. It is expected user
namespace will mitigate some future vulnerabilities too.
This is a Linux-only feature and support is needed in Linux for idmap mounts on the
filesystems used. This means:
On the node, the filesystem you use for /var/lib/kubelet/pods/ , or the custom
directory you configure for this, needs idmap mount support.
All the filesystems used in the pod's volumes must support idmap mounts.
In practice this means you need at least Linux 6.3, as tmpfs started supporting idmap mounts
in that version. This is usually needed as several Kubernetes features use tmpfs (the service
account token that is mounted by default uses a tmpfs, Secrets use a tmpfs, etc.)
Some popular filesystems that support idmap mounts in Linux 6.3 are: btrfs, ext4, xfs, fat,
tmpfs, overlayfs.
In addition, support is needed in the container runtime to use this feature with Kubernetes
stateless pods:
CRI-O: version 1.25 (and later) supports user namespaces for containers.
Please note that containerd v1.7 supports user namespaces for containers, compatible with
Kubernetes 1.27.2. It should not be used with Kubernetes 1.27 (and later).
Introduction
User namespaces is a Linux feature that allows to map users in the container to different
users in the host. Furthermore, the capabilities granted to a pod in a user namespace are
valid only in the namespace and void outside of it.
A pod can opt-in to use user namespaces by setting the pod.spec.hostUsers field to false .
The kubelet will pick host UIDs/GIDs a pod is mapped to, and will do so in a way to guarantee
that no two stateless pods on the same node use the same mapping.
The runAsUser , runAsGroup , fsGroup , etc. fields in the pod.spec always refer to the user
inside the container.
The valid UIDs/GIDs when this feature is enabled is the range 0-65535. This applies to files
and processes ( runAsUser , runAsGroup , etc.).
https://kubernetes.io/docs/concepts/workloads/_print/ 35/112
6/6/23, 3:49 PM Workloads | Kubernetes
Files using a UID/GID outside this range will be seen as belonging to the overflow ID, usually
65534 (configured in /proc/sys/kernel/overflowuid and /proc/sys/kernel/overflowgid ).
However, it is not possible to modify those files, even by running as the 65534 user/group.
Most applications that need to run as root but don't access other host namespaces or
resources, should continue to run fine without any changes needed if user namespaces is
activated.
When creating a pod, by default, several new namespaces are used for isolation: a network
namespace to isolate the network of the container, a PID namespace to isolate the view of
processes, etc. If a user namespace is used, this will isolate the users in the container from
the users in the node.
This means containers can run as root and be mapped to a non-root user on the host. Inside
the container the process will think it is running as root (and therefore tools like apt , yum ,
etc. work fine), while in reality the process doesn't have privileges on the host. You can verify
this, for example, if you check which user the container process is running by executing ps
aux from the host. The user ps shows is not the same as the user you see if you execute
inside the container the command id .
This abstraction limits what can happen, for example, if the container manages to escape to
the host. Given that the container is running as a non-privileged user on the host, it is limited
what it can do to the host.
Furthermore, as users on each pod will be mapped to different non-overlapping users in the
host, it is limited what they can do to other pods too.
Capabilities granted to a pod are also limited to the pod user namespace and mostly invalid
out of it, some are even completely void. Here are two examples:
CAP_SYS_MODULE does not have any effect if granted to a pod using user namespaces,
the pod isn't able to load kernel modules.
CAP_SYS_ADMIN is limited to the pod's user namespace and invalid outside of it.
Without using a user namespace a container running as root, in the case of a container
breakout, has root privileges on the node. And if some capability were granted to the
container, the capabilities are valid on the host too. None of this is true when we use user
namespaces.
If you want to know more details about what changes when user namespaces are in use, see
man 7 user_namespaces .
The kubelet will assign UIDs/GIDs higher than that to pods. Therefore, to guarantee as much
isolation as possible, the UIDs/GIDs used by the host's files and host's processes should be in
the range 0-65535.
Note that this recommendation is important to mitigate the impact of CVEs like CVE-2021-
25741, where a pod can potentially read arbitrary files in the hosts. If the UIDs/GIDs of the
pod and the host don't overlap, it is limited what a pod would be able to do: the pod UID/GID
won't match the host's file owner/group.
https://kubernetes.io/docs/concepts/workloads/_print/ 36/112
6/6/23, 3:49 PM Workloads | Kubernetes
Limitations
When using a user namespace for the pod, it is disallowed to use other host namespaces. In
particular, if you set hostUsers: false then you are not allowed to set any of:
hostNetwork: true
hostIPC: true
hostPID: true
The pod is allowed to use no volumes at all or, if using volumes, only these volume types are
allowed:
configmap
secret
projected
downwardAPI
emptyDir
What's next
Take a look at Use a User Namespace With a Pod
https://kubernetes.io/docs/concepts/workloads/_print/ 37/112
6/6/23, 3:49 PM Workloads | Kubernetes
It is sometimes useful for a container to have information about itself, without being overly
coupled to Kubernetes. The downward API allows containers to consume information about
themselves or the cluster without using the Kubernetes client or API server.
In Kubernetes, there are two ways to expose Pod and container fields to a running container:
as environment variables
as files in a downwardAPI volume
Together, these two ways of exposing Pod and container fields are called the downward API.
Available fields
Only some Kubernetes API fields are available through the downward API. This section lists
which fields you can make available.
You can pass information from available Pod-level fields using fieldRef . At the API level, the
spec for a Pod always defines at least one Container. You can pass information from
available Container-level fields using resourceFieldRef .
metadata.name
metadata.namespace
metadata.uid
metadata.annotations['<KEY>']
metadata.labels['<KEY>']
the text value of the pod's label named <KEY> (for example, metadata.labels['mylabel'])
spec.serviceAccountName
https://kubernetes.io/docs/concepts/workloads/_print/ 38/112
6/6/23, 3:49 PM Workloads | Kubernetes
spec.nodeName
status.hostIP
status.podIP
The following information is available through a downwardAPI volume fieldRef , but not as
environment variables:
metadata.labels
all of the pod's labels, formatted as label-key="escaped-label-value" with one label per
line
metadata.annotations
resource: limits.cpu
resource: requests.cpu
resource: limits.memory
resource: requests.memory
resource: limits.hugepages-*
resource: requests.hugepages-*
resource: limits.ephemeral-storage
resource: requests.ephemeral-storage
https://kubernetes.io/docs/concepts/workloads/_print/ 39/112
6/6/23, 3:49 PM Workloads | Kubernetes
What's next
You can read about downwardAPI volumes.
You can try using the downward API to expose container- or Pod-level information:
as environment variables
as files in downwardAPI volume
https://kubernetes.io/docs/concepts/workloads/_print/ 40/112
6/6/23, 3:49 PM Workloads | Kubernetes
2 - Workload Resources
2.1 - Deployments
A Deployment provides declarative updates for Pods and ReplicaSets.
You describe a desired state in a Deployment, and the Deployment Controller changes the
actual state to the desired state at a controlled rate. You can define Deployments to create
new ReplicaSets, or to remove existing Deployments and adopt all their resources with new
Deployments.
Use Case
The following are typical use cases for Deployments:
Creating a Deployment
Before creating a Deployment define an environment variable for a container.
controllers/nginx-deployment.yaml
https://kubernetes.io/docs/concepts/workloads/_print/ 41/112
6/6/23, 3:49 PM Workloads | Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
In this example:
The Deployment creates a ReplicaSet that creates three replicated Pods, indicated by the
.spec.replicas field.
The .spec.selector field defines how the created ReplicaSet finds which Pods to
manage. In this case, you select a label that is defined in the Pod template ( app: nginx ).
However, more sophisticated selection rules are possible, as long as the Pod template
itself satisfies the rule.
The Pods are labeled app: nginx using the .metadata.labels field.
The Pod template's specification, or .template.spec field, indicates that the Pods
run one container, nginx , which runs the nginx Docker Hub image at version
1.14.2.
Create one container and name it nginx using the
.spec.template.spec.containers[0].name field.
Before you begin, make sure your Kubernetes cluster is up and running. Follow the steps
given below to create the above Deployment:
If the Deployment is still being created, the output is similar to the following:
When you inspect the Deployments in your cluster, the following fields are displayed:
READY displays how many replicas of the application are available to your users. It
follows the pattern ready/desired.
UP-TO-DATE displays the number of replicas that have been updated to achieve
the desired state.
AVAILABLE displays how many replicas of the application are available to your
users.
AGE displays the amount of time that the application has been running.
Waiting for rollout to finish: 2 out of 3 new replicas have been updated...
deployment "nginx-deployment" successfully rolled out
4. Run the kubectl get deployments again a few seconds later. The output is similar to
this:
Notice that the Deployment has created all three replicas, and all replicas are up-to-date
(they contain the latest Pod template) and available.
5. To see the ReplicaSet ( rs ) created by the Deployment, run kubectl get rs . The output
is similar to this:
DESIRED displays the desired number of replicas of the application, which you
define when you create the Deployment. This is the desired state.
CURRENT displays how many replicas are currently running.
READY displays how many replicas of the application are available to your users.
AGE displays the amount of time that the application has been running.
The HASH string is the same as the pod-template-hash label on the ReplicaSet.
6. To see the labels automatically generated for each Pod, run kubectl get pods --show-
labels . The output is similar to:
https://kubernetes.io/docs/concepts/workloads/_print/ 43/112
6/6/23, 3:49 PM Workloads | Kubernetes
The created ReplicaSet ensures that there are three nginx Pods.
Note:
You must specify an appropriate selector and Pod template labels in a Deployment (in
this case, app: nginx ).
Do not overlap labels or selectors with other controllers (including other Deployments
and StatefulSets). Kubernetes doesn't stop you from overlapping, and if multiple
controllers have overlapping selectors those controllers might conflict and behave
unexpectedly.
Pod-template-hash label
The pod-template-hash label is added by the Deployment controller to every ReplicaSet that
a Deployment creates or adopts.
This label ensures that child ReplicaSets of a Deployment do not overlap. It is generated by
hashing the PodTemplate of the ReplicaSet and using the resulting hash as the label value
that is added to the ReplicaSet selector, Pod template labels, and in any existing Pods that the
ReplicaSet might have.
Updating a Deployment
Note: A Deployment's rollout is triggered if and only if the Deployment's Pod template
(that is, .spec.template) is changed, for example if the labels or container images of the
template are updated. Other updates, such as scaling the Deployment, do not trigger a
rollout.
1. Let's update the nginx Pods to use the nginx:1.16.1 image instead of the
nginx:1.14.2 image.
https://kubernetes.io/docs/concepts/workloads/_print/ 44/112
6/6/23, 3:49 PM Workloads | Kubernetes
deployment.apps/nginx-deployment edited
Waiting for rollout to finish: 2 out of 3 new replicas have been updated...
or
After the rollout succeeds, you can view the Deployment by running kubectl get
deployments . The output is similar to this:
Run kubectl get rs to see that the Deployment updated the Pods by creating a new
ReplicaSet and scaling it up to 3 replicas, as well as scaling down the old ReplicaSet to 0
replicas.
kubectl get rs
Running get pods should now show only the new Pods:
Next time you want to update these Pods, you only need to update the Deployment's
Pod template again.
https://kubernetes.io/docs/concepts/workloads/_print/ 45/112
6/6/23, 3:49 PM Workloads | Kubernetes
Deployment ensures that only a certain number of Pods are down while they are being
updated. By default, it ensures that at least 75% of the desired number of Pods are up
(25% max unavailable).
Deployment also ensures that only a certain number of Pods are created above the
desired number of Pods. By default, it ensures that at most 125% of the desired number
of Pods are up (25% max surge).
For example, if you look at the above Deployment closely, you will see that it first creates
a new Pod, then deletes an old Pod, and creates another new one. It does not kill old
Pods until a sufficient number of new Pods have come up, and does not create new
Pods until a sufficient number of old Pods have been killed. It makes sure that at least 3
Pods are available and that at max 4 Pods in total are available. In case of a Deployment
with 4 replicas, the number of Pods would be between 3 and 5.
Get details of your Deployment:
Name: nginx-deployment
Namespace: default
CreationTimestamp: Thu, 30 Nov 2017 10:56:25 +0000
Labels: app=nginx
Annotations: deployment.kubernetes.io/revision=2
Selector: app=nginx
Replicas: 3 desired | 3 updated | 3 total | 3 available | 0 una
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=nginx
Containers:
nginx:
Image: nginx:1.16.1
Port: 80/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: nginx-deployment-1564180365 (3/3 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 2m deployment-controller Scaled up replica
Normal ScalingReplicaSet 24s deployment-controller Scaled up replica
Normal ScalingReplicaSet 22s deployment-controller Scaled down repli
Normal ScalingReplicaSet 22s deployment-controller Scaled up replica
Normal ScalingReplicaSet 19s deployment-controller Scaled down repli
Normal ScalingReplicaSet 19s deployment-controller Scaled up replica
Normal ScalingReplicaSet 14s deployment-controller Scaled down repli
Here you see that when you first created the Deployment, it created a ReplicaSet (nginx-
deployment-2035384211) and scaled it up to 3 replicas directly. When you updated the
Deployment, it created a new ReplicaSet (nginx-deployment-1564180365) and scaled it
up to 1 and waited for it to come up. Then it scaled down the old ReplicaSet to 2 and
scaled up the new ReplicaSet to 2 so that at least 3 Pods were available and at most 4
Pods were created at all times. It then continued scaling up and down the new and the
old ReplicaSet, with the same rolling update strategy. Finally, you'll have 3 available
replicas in the new ReplicaSet, and the old ReplicaSet is scaled down to 0.
https://kubernetes.io/docs/concepts/workloads/_print/ 46/112
6/6/23, 3:49 PM Workloads | Kubernetes
Note: Kubernetes doesn't count terminating Pods when calculating the number of
availableReplicas, which must be between replicas - maxUnavailable and replicas +
maxSurge. As a result, you might notice that there are more Pods than expected during a
rollout, and that the total resources consumed by the Deployment is more than replicas
+ maxSurge until the terminationGracePeriodSeconds of the terminating Pods expires.
If you update a Deployment while an existing rollout is in progress, the Deployment creates a
new ReplicaSet as per the update and start scaling that up, and rolls over the ReplicaSet that it
was scaling up previously -- it will add it to its list of old ReplicaSets and start scaling it down.
For example, suppose you create a Deployment to create 5 replicas of nginx:1.14.2 , but
then update the Deployment to create 5 replicas of nginx:1.16.1 , when only 3 replicas of
nginx:1.14.2 had been created. In that case, the Deployment immediately starts killing the 3
nginx:1.14.2 Pods that it had created, and starts creating nginx:1.16.1 Pods. It does not
wait for the 5 replicas of nginx:1.14.2 to be created before changing course.
Note: In API version apps/v1, a Deployment's label selector is immutable after it gets
created.
Selector additions require the Pod template labels in the Deployment spec to be
updated with the new label too, otherwise a validation error is returned. This change is a
non-overlapping one, meaning that the new selector does not select ReplicaSets and
Pods created with the old selector, resulting in orphaning all old ReplicaSets and creating
a new ReplicaSet.
Selector updates changes the existing value in a selector key -- result in the same
behavior as additions.
Selector removals removes an existing key from the Deployment selector -- do not
require any changes in the Pod template labels. Existing ReplicaSets are not orphaned,
and a new ReplicaSet is not created, but note that the removed label still exists in any
existing Pods and ReplicaSets.
https://kubernetes.io/docs/concepts/workloads/_print/ 47/112
6/6/23, 3:49 PM Workloads | Kubernetes
means that when you roll back to an earlier revision, only the Deployment's Pod template
part is rolled back.
Suppose that you made a typo while updating the Deployment, by putting the image
name as nginx:1.161 instead of nginx:1.16.1 :
The rollout gets stuck. You can verify it by checking the rollout status:
Waiting for rollout to finish: 1 out of 3 new replicas have been updated...
Press Ctrl-C to stop the above rollout status watch. For more information on stuck
rollouts, read more here.
You see that the number of old replicas ( nginx-deployment-1564180365 and nginx-
deployment-2035384211 ) is 2, and new replicas (nginx-deployment-3066724191) is 1.
kubectl get rs
Looking at the Pods created, you see that 1 Pod created by new ReplicaSet is stuck in an
image pull loop.
Note: The Deployment controller stops the bad rollout automatically, and stops
scaling up the new ReplicaSet. This depends on the rollingUpdate parameters
https://kubernetes.io/docs/concepts/workloads/_print/ 48/112
6/6/23, 3:49 PM Workloads | Kubernetes
(maxUnavailable specifically) that you have specified. Kubernetes by default sets the
value to 25%.
Get the description of the Deployment:
Name: nginx-deployment
Namespace: default
CreationTimestamp: Tue, 15 Mar 2016 14:48:04 -0700
Labels: app=nginx
Selector: app=nginx
Replicas: 3 desired | 1 updated | 4 total | 3 available | 1 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=nginx
Containers:
nginx:
Image: nginx:1.161
Port: 80/TCP
Host Port: 0/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True ReplicaSetUpdated
OldReplicaSets: nginx-deployment-1564180365 (3/3 replicas created)
NewReplicaSet: nginx-deployment-3066724191 (1/1 replicas created)
Events:
FirstSeen LastSeen Count From SubObjectPath Type
--------- -------- ----- ---- ------------- -----
1m 1m 1 {deployment-controller } Norma
22s 22s 1 {deployment-controller } Norma
22s 22s 1 {deployment-controller } Norma
22s 22s 1 {deployment-controller } Norma
21s 21s 1 {deployment-controller } Norma
21s 21s 1 {deployment-controller } Norma
13s 13s 1 {deployment-controller } Norma
13s 13s 1 {deployment-controller } Norma
To fix this, you need to rollback to a previous revision of Deployment that is stable.
https://kubernetes.io/docs/concepts/workloads/_print/ 49/112
6/6/23, 3:49 PM Workloads | Kubernetes
deployments "nginx-deployment"
REVISION CHANGE-CAUSE
1 kubectl apply --filename=https://k8s.io/examples/controllers/ngin
2 kubectl set image deployment/nginx-deployment nginx=nginx:1.16.1
3 kubectl set image deployment/nginx-deployment nginx=nginx:1.161
1. Now you've decided to undo the current rollout and rollback to the previous revision:
For more details about rollout related commands, read kubectl rollout .
https://kubernetes.io/docs/concepts/workloads/_print/ 50/112
6/6/23, 3:49 PM Workloads | Kubernetes
The Deployment is now rolled back to a previous stable revision. As you can see, a
DeploymentRollback event for rolling back to revision 2 is generated from Deployment
controller.
2. Check if the rollback was successful and the Deployment is running as expected, run:
Name: nginx-deployment
Namespace: default
CreationTimestamp: Sun, 02 Sep 2018 18:17:55 -0500
Labels: app=nginx
Annotations: deployment.kubernetes.io/revision=4
kubernetes.io/change-cause=kubectl set image deployme
Selector: app=nginx
Replicas: 3 desired | 3 updated | 3 total | 3 available | 0 una
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=nginx
Containers:
nginx:
Image: nginx:1.16.1
Port: 80/TCP
Host Port: 0/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: nginx-deployment-c4747d96c (3/3 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 12m deployment-controller Scaled up replica
Normal ScalingReplicaSet 11m deployment-controller Scaled up replica
Normal ScalingReplicaSet 11m deployment-controller Scaled down replic
Normal ScalingReplicaSet 11m deployment-controller Scaled up replica
Normal ScalingReplicaSet 11m deployment-controller Scaled down replic
Normal ScalingReplicaSet 11m deployment-controller Scaled up replica
Normal ScalingReplicaSet 11m deployment-controller Scaled down replic
Normal ScalingReplicaSet 11m deployment-controller Scaled up replica
Normal DeploymentRollback 15s deployment-controller Rolled back deploy
Normal ScalingReplicaSet 15s deployment-controller Scaled down replic
https://kubernetes.io/docs/concepts/workloads/_print/ 51/112
6/6/23, 3:49 PM Workloads | Kubernetes
Scaling a Deployment
You can scale a Deployment by using the following command:
deployment.apps/nginx-deployment scaled
Assuming horizontal Pod autoscaling is enabled in your cluster, you can set up an autoscaler
for your Deployment and choose the minimum and maximum number of Pods you want to
run based on the CPU utilization of your existing Pods.
deployment.apps/nginx-deployment scaled
Proportional scaling
RollingUpdate Deployments support running multiple versions of an application at the same
time. When you or an autoscaler scales a RollingUpdate Deployment that is in the middle of a
rollout (either in progress or paused), the Deployment controller balances the additional
replicas in the existing active ReplicaSets (ReplicaSets with Pods) in order to mitigate risk. This
is called proportional scaling.
For example, you are running a Deployment with 10 replicas, maxSurge=3, and
maxUnavailable=2.
You update to a new image which happens to be unresolvable from inside the cluster.
https://kubernetes.io/docs/concepts/workloads/_print/ 52/112
6/6/23, 3:49 PM Workloads | Kubernetes
kubectl get rs
Then a new scaling request for the Deployment comes along. The autoscaler increments
the Deployment replicas to 15. The Deployment controller needs to decide where to add
these new 5 replicas. If you weren't using proportional scaling, all 5 of them would be
added in the new ReplicaSet. With proportional scaling, you spread the additional
replicas across all ReplicaSets. Bigger proportions go to the ReplicaSets with the most
replicas and lower proportions go to ReplicaSets with less replicas. Any leftovers are
added to the ReplicaSet with the most replicas. ReplicaSets with zero replicas are not
scaled up.
In our example above, 3 replicas are added to the old ReplicaSet and 2 replicas are added to
the new ReplicaSet. The rollout process should eventually move all replicas to the new
ReplicaSet, assuming the new replicas become healthy. To confirm this, run:
The rollout status confirms how the replicas were added to each ReplicaSet.
kubectl get rs
kubectl get rs
deployment.apps/nginx-deployment paused
deployments "nginx"
REVISION CHANGE-CAUSE
1 <none>
Get the rollout status to verify that the existing ReplicaSet has not changed:
kubectl get rs
You can make as many updates as you wish, for example, update the resources that will
be used:
https://kubernetes.io/docs/concepts/workloads/_print/ 54/112
6/6/23, 3:49 PM Workloads | Kubernetes
The initial state of the Deployment prior to pausing its rollout will continue its function,
but new updates to the Deployment will not have any effect as long as the Deployment
rollout is paused.
Eventually, resume the Deployment rollout and observe a new ReplicaSet coming up
with all the new updates:
deployment.apps/nginx-deployment resumed
kubectl get rs -w
kubectl get rs
Note: You cannot rollback a paused Deployment until you resume it.
https://kubernetes.io/docs/concepts/workloads/_print/ 55/112
6/6/23, 3:49 PM Workloads | Kubernetes
Deployment status
A Deployment enters various states during its lifecycle. It can be progressing while rolling out
a new ReplicaSet, it can be complete, or it can fail to progress.
Progressing Deployment
Kubernetes marks a Deployment as progressing when one of the following tasks is performed:
When the rollout becomes “progressing”, the Deployment controller adds a condition with the
following attributes to the Deployment's .status.conditions :
type: Progressing
status: "True"
You can monitor the progress for a Deployment by using kubectl rollout status .
Complete Deployment
Kubernetes marks a Deployment as complete when it has the following characteristics:
All of the replicas associated with the Deployment have been updated to the latest
version you've specified, meaning any updates you've requested have been completed.
All of the replicas associated with the Deployment are available.
No old replicas for the Deployment are running.
When the rollout becomes “complete”, the Deployment controller sets a condition with the
following attributes to the Deployment's .status.conditions :
type: Progressing
status: "True"
reason: NewReplicaSetAvailable
This Progressing condition will retain a status value of "True" until a new rollout is
initiated. The condition holds even when availability of replicas changes (which does instead
affect the Available condition).
You can check if a Deployment has completed by using kubectl rollout status . If the
rollout completed successfully, kubectl rollout status returns a zero exit code.
echo $?
https://kubernetes.io/docs/concepts/workloads/_print/ 56/112
6/6/23, 3:49 PM Workloads | Kubernetes
Failed Deployment
Your Deployment may get stuck trying to deploy its newest ReplicaSet without ever
completing. This can occur due to some of the following factors:
Insufficient quota
Readiness probe failures
Image pull errors
Insufficient permissions
Limit ranges
Application runtime misconfiguration
One way you can detect this condition is to specify a deadline parameter in your Deployment
spec: ( .spec.progressDeadlineSeconds ). .spec.progressDeadlineSeconds denotes the
number of seconds the Deployment controller waits before indicating (in the Deployment
status) that the Deployment progress has stalled.
The following kubectl command sets the spec with progressDeadlineSeconds to make the
controller report lack of progress of a rollout for a Deployment after 10 minutes:
deployment.apps/nginx-deployment patched
Once the deadline has been exceeded, the Deployment controller adds a
DeploymentCondition with the following attributes to the Deployment's .status.conditions :
type: Progressing
status: "False"
reason: ProgressDeadlineExceeded
This condition can also fail early and is then set to status value of "False" due to reasons as
ReplicaSetCreateError . Also, the deadline is not taken into account anymore once the
Deployment rollout completes.
See the Kubernetes API conventions for more information on status conditions.
Note: Kubernetes takes no action on a stalled Deployment other than to report a status
condition with reason: ProgressDeadlineExceeded. Higher level orchestrators can take
advantage of it and act accordingly, for example, rollback the Deployment to its previous
version.
Note: If you pause a Deployment rollout, Kubernetes does not check progress against
your specified deadline. You can safely pause a Deployment rollout in the middle of a
rollout and resume without triggering the condition for exceeding the deadline.
You may experience transient errors with your Deployments, either due to a low timeout that
you have set or due to any other kind of error that can be treated as transient. For example,
let's suppose you have insufficient quota. If you describe the Deployment you will notice the
following section:
https://kubernetes.io/docs/concepts/workloads/_print/ 57/112
6/6/23, 3:49 PM Workloads | Kubernetes
<...>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True ReplicaSetUpdated
ReplicaFailure True FailedCreate
<...>
If you run kubectl get deployment nginx-deployment -o yaml , the Deployment status is
similar to this:
status:
availableReplicas: 2
conditions:
- lastTransitionTime: 2016-10-04T12:25:39Z
lastUpdateTime: 2016-10-04T12:25:39Z
message: Replica set "nginx-deployment-4262182780" is progressing.
reason: ReplicaSetUpdated
status: "True"
type: Progressing
- lastTransitionTime: 2016-10-04T12:25:42Z
lastUpdateTime: 2016-10-04T12:25:42Z
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: 2016-10-04T12:25:39Z
lastUpdateTime: 2016-10-04T12:25:39Z
message: 'Error creating: pods "nginx-deployment-4262182780-" is forbidden: e
object-counts, requested: pods=1, used: pods=3, limited: pods=2'
reason: FailedCreate
status: "True"
type: ReplicaFailure
observedGeneration: 3
replicas: 2
unavailableReplicas: 2
Eventually, once the Deployment progress deadline is exceeded, Kubernetes updates the
status and the reason for the Progressing condition:
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing False ProgressDeadlineExceeded
ReplicaFailure True FailedCreate
You can address an issue of insufficient quota by scaling down your Deployment, by scaling
down other controllers you may be running, or by increasing quota in your namespace. If you
satisfy the quota conditions and the Deployment controller then completes the Deployment
rollout, you'll see the Deployment's status update with a successful condition ( status:
"True" and reason: NewReplicaSetAvailable ).
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
https://kubernetes.io/docs/concepts/workloads/_print/ 58/112
6/6/23, 3:49 PM Workloads | Kubernetes
type: Available with status: "True" means that your Deployment has minimum
availability. Minimum availability is dictated by the parameters specified in the deployment
strategy. type: Progressing with status: "True" means that your Deployment is either in
the middle of a rollout and it is progressing or that it has successfully completed its progress
and the minimum required new replicas are available (see the Reason of the condition for the
particulars - in our case reason: NewReplicaSetAvailable means that the Deployment is
complete).
You can check if a Deployment has failed to progress by using kubectl rollout status .
kubectl rollout status returns a non-zero exit code if the Deployment has exceeded the
progression deadline.
Waiting for rollout to finish: 2 out of 3 new replicas have been updated...
error: deployment "nginx" exceeded its progress deadline
echo $?
Clean up Policy
You can set .spec.revisionHistoryLimit field in a Deployment to specify how many old
ReplicaSets for this Deployment you want to retain. The rest will be garbage-collected in the
background. By default, it is 10.
Note: Explicitly setting this field to 0, will result in cleaning up all the history of your
Deployment thus that Deployment will not be able to roll back.
Canary Deployment
If you want to roll out releases to a subset of users or servers using the Deployment, you can
create multiple Deployments, one for each release, following the canary pattern described in
managing resources.
https://kubernetes.io/docs/concepts/workloads/_print/ 59/112
6/6/23, 3:49 PM Workloads | Kubernetes
When the control plane creates new Pods for a Deployment, the .metadata.name of the
Deployment is part of the basis for naming those Pods. The name of a Deployment must be a
valid DNS subdomain value, but this can produce unexpected results for the Pod hostnames.
For best compatibility, the name should follow the more restrictive rules for a DNS label.
Pod Template
The .spec.template and .spec.selector are the only required fields of the .spec .
The .spec.template is a Pod template. It has exactly the same schema as a Pod, except it is
nested and does not have an apiVersion or kind .
In addition to required fields for a Pod, a Pod template in a Deployment must specify
appropriate labels and an appropriate restart policy. For labels, make sure not to overlap with
other controllers. See selector.
Replicas
.spec.replicas is an optional field that specifies the number of desired Pods. It defaults to
1.
Should you manually scale a Deployment, example via kubectl scale deployment deployment
--replicas=X , and then you update that Deployment based on a manifest (for example: by
running kubectl apply -f deployment.yaml ), then applying that manifest overwrites the
manual scaling that you previously did.
If a HorizontalPodAutoscaler (or any similar API for horizontal scaling) is managing scaling for
a Deployment, don't set .spec.replicas .
Instead, allow the Kubernetes control plane to manage the .spec.replicas field
automatically.
Selector
.spec.selector is a required field that specifies a label selector for the Pods targeted by this
Deployment.
A Deployment may terminate Pods whose labels match the selector if their template is
different from .spec.template or if the total number of such Pods exceeds .spec.replicas .
It brings up new Pods with .spec.template if the number of Pods is less than the desired
number.
Note: You should not create other Pods whose labels match this selector, either directly,
by creating another Deployment, or by creating another controller such as a ReplicaSet or
a ReplicationController. If you do so, the first Deployment thinks that it created these
other Pods. Kubernetes does not stop you from doing this.
If you have multiple controllers that have overlapping selectors, the controllers will fight with
each other and won't behave correctly.
https://kubernetes.io/docs/concepts/workloads/_print/ 60/112
6/6/23, 3:49 PM Workloads | Kubernetes
Strategy
specifies the strategy used to replace old Pods by new ones.
.spec.strategy
.spec.strategy.type can be "Recreate" or "RollingUpdate". "RollingUpdate" is the default
value.
Recreate Deployment
All existing Pods are killed before new ones are created when
.spec.strategy.type==Recreate .
Note: This will only guarantee Pod termination previous to creation for upgrades. If you
upgrade a Deployment, all Pods of the old revision will be terminated immediately.
Successful removal is awaited before any Pod of the new revision is created. If you
manually delete a Pod, the lifecycle is controlled by the ReplicaSet and the replacement
will be created immediately (even if the old Pod is still in a Terminating state). If you need
an "at most" guarantee for your Pods, you should consider using a StatefulSet.
Max Unavailable
.spec.strategy.rollingUpdate.maxUnavailable is an optional field that specifies the
maximum number of Pods that can be unavailable during the update process. The value can
be an absolute number (for example, 5) or a percentage of desired Pods (for example, 10%).
The absolute number is calculated from percentage by rounding down. The value cannot be 0
if .spec.strategy.rollingUpdate.maxSurge is 0. The default value is 25%.
For example, when this value is set to 30%, the old ReplicaSet can be scaled down to 70% of
desired Pods immediately when the rolling update starts. Once new Pods are ready, old
ReplicaSet can be scaled down further, followed by scaling up the new ReplicaSet, ensuring
that the total number of Pods available at all times during the update is at least 70% of the
desired Pods.
Max Surge
.spec.strategy.rollingUpdate.maxSurge is an optional field that specifies the maximum
number of Pods that can be created over the desired number of Pods. The value can be an
absolute number (for example, 5) or a percentage of desired Pods (for example, 10%). The
value cannot be 0 if MaxUnavailable is 0. The absolute number is calculated from the
percentage by rounding up. The default value is 25%.
For example, when this value is set to 30%, the new ReplicaSet can be scaled up immediately
when the rolling update starts, such that the total number of old and new Pods does not
exceed 130% of desired Pods. Once old Pods have been killed, the new ReplicaSet can be
scaled up further, ensuring that the total number of Pods running at any time during the
update is at most 130% of desired Pods.
https://kubernetes.io/docs/concepts/workloads/_print/ 61/112
6/6/23, 3:49 PM Workloads | Kubernetes
More specifically, setting this field to zero means that all old ReplicaSets with 0 replicas will be
cleaned up. In this case, a new Deployment rollout cannot be undone, since its revision
history is cleaned up.
Paused
.spec.pausedis an optional boolean field for pausing and resuming a Deployment. The only
difference between a paused Deployment and one that is not paused, is that any changes into
the PodTemplateSpec of the paused Deployment will not trigger new rollouts as long as it is
paused. A Deployment is not paused by default when it is created.
What's next
Learn more about Pods.
Run a stateless application using a Deployment.
Read the Deployment to understand the Deployment API.
Read about PodDisruptionBudget and how you can use it to manage application
availability during disruptions.
Use kubectl to create a Deployment.
https://kubernetes.io/docs/concepts/workloads/_print/ 62/112
6/6/23, 3:49 PM Workloads | Kubernetes
2.2 - ReplicaSet
A ReplicaSet's purpose is to maintain a stable set of replica Pods running at any given time. As
such, it is often used to guarantee the availability of a specified number of identical Pods.
A ReplicaSet is linked to its Pods via the Pods' metadata.ownerReferences field, which
specifies what resource the current object is owned by. All Pods acquired by a ReplicaSet have
their owning ReplicaSet's identifying information within their ownerReferences field. It's
through this link that the ReplicaSet knows of the state of the Pods it is maintaining and plans
accordingly.
A ReplicaSet identifies new Pods to acquire by using its selector. If there is a Pod that has no
OwnerReference or the OwnerReference is not a Controller and it matches a ReplicaSet's
selector, it will be immediately acquired by said ReplicaSet.
This actually means that you may never need to manipulate ReplicaSet objects: use a
Deployment instead, and define your application in the spec section.
Example
controllers/frontend.yaml
https://kubernetes.io/docs/concepts/workloads/_print/ 63/112
6/6/23, 3:49 PM Workloads | Kubernetes
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: frontend
labels:
app: guestbook
tier: frontend
spec:
# modify replicas according to your case
replicas: 3
selector:
matchLabels:
tier: frontend
template:
metadata:
labels:
tier: frontend
spec:
containers:
- name: php-redis
image: gcr.io/google_samples/gb-frontend:v3
Saving this manifest into frontend.yaml and submitting it to a Kubernetes cluster will create
the defined ReplicaSet and the Pods that it manages.
kubectl get rs
https://kubernetes.io/docs/concepts/workloads/_print/ 64/112
6/6/23, 3:49 PM Workloads | Kubernetes
Name: frontend
Namespace: default
Selector: tier=frontend
Labels: app=guestbook
tier=frontend
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"apps/v1","kind":"ReplicaSet","metadata":{"annotati
Replicas: 3 current / 3 desired
Pods Status: 3 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: tier=frontend
Containers:
php-redis:
Image: gcr.io/google_samples/gb-frontend:v3
Port: <none>
Host Port: <none>
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 117s replicaset-controller Created pod: frontend-wt
Normal SuccessfulCreate 116s replicaset-controller Created pod: frontend-b2
Normal SuccessfulCreate 116s replicaset-controller Created pod: frontend-vc
And lastly you can check for the Pods brought up:
You can also verify that the owner reference of these pods is set to the frontend ReplicaSet.
To do this, get the yaml of one of the Pods running:
The output will look similar to this, with the frontend ReplicaSet's info set in the metadata's
ownerReferences field:
https://kubernetes.io/docs/concepts/workloads/_print/ 65/112
6/6/23, 3:49 PM Workloads | Kubernetes
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2020-02-12T07:06:16Z"
generateName: frontend-
labels:
tier: frontend
name: frontend-b2zdv
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: frontend
uid: f391f6db-bb9b-4c09-ae74-6a1f77f3d5cf
...
Take the previous frontend ReplicaSet example, and the Pods specified in the following
manifest:
pods/pod-rs.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod1
labels:
tier: frontend
spec:
containers:
- name: hello1
image: gcr.io/google-samples/hello-app:2.0
---
apiVersion: v1
kind: Pod
metadata:
name: pod2
labels:
tier: frontend
spec:
containers:
- name: hello2
image: gcr.io/google-samples/hello-app:1.0
As those Pods do not have a Controller (or any object) as their owner reference and match the
selector of the frontend ReplicaSet, they will immediately be acquired by it.
Suppose you create the Pods after the frontend ReplicaSet has been deployed and has set up
its initial Pod replicas to fulfill its replica count requirement:
https://kubernetes.io/docs/concepts/workloads/_print/ 66/112
6/6/23, 3:49 PM Workloads | Kubernetes
The new Pods will be acquired by the ReplicaSet, and then immediately terminated as the
ReplicaSet would be over its desired count.
The output shows that the new Pods are either already terminated, or in the process of being
terminated:
You shall see that the ReplicaSet has acquired the Pods and has only created new ones
according to its spec until the number of its new Pods and the original matches its desired
count. As fetching the Pods:
When the control plane creates new Pods for a ReplicaSet, the .metadata.name of the
ReplicaSet is part of the basis for naming those Pods. The name of a ReplicaSet must be a
valid DNS subdomain value, but this can produce unexpected results for the Pod hostnames.
For best compatibility, the name should follow the more restrictive rules for a DNS label.
https://kubernetes.io/docs/concepts/workloads/_print/ 67/112
6/6/23, 3:49 PM Workloads | Kubernetes
Pod Template
The .spec.templateis a pod template which is also required to have labels in place. In our
frontend.yaml example we had one label: tier: frontend . Be careful not to overlap with
the selectors of other controllers, lest they try to adopt this Pod.
For the template's restart policy field, .spec.template.spec.restartPolicy , the only allowed
value is Always , which is the default.
Pod Selector
The .spec.selector field is a label selector. As discussed earlier these are the labels used to
identify potential Pods to acquire. In our frontend.yaml example, the selector was:
matchLabels:
tier: frontend
Replicas
You can specify how many Pods should run concurrently by setting .spec.replicas . The
ReplicaSet will create/delete its Pods to match this number.
When using the REST API or the client-go library, you must set propagationPolicy to
Background or Foreground in the -d option. For example:
https://kubernetes.io/docs/concepts/workloads/_print/ 68/112
6/6/23, 3:49 PM Workloads | Kubernetes
Once the original is deleted, you can create a new ReplicaSet to replace it. As long as the old
and new .spec.selector are the same, then the new one will adopt the old Pods. However,
it will not make any effort to make existing Pods match a new, different pod template. To
update Pods to a new spec in a controlled way, use a Deployment, as ReplicaSets do not
support a rolling update directly.
Scaling a ReplicaSet
A ReplicaSet can be easily scaled up or down by simply updating the .spec.replicas field.
The ReplicaSet controller ensures that a desired number of Pods with a matching label
selector are available and operational.
When scaling down, the ReplicaSet controller chooses which pods to delete by sorting the
available pods to prioritize scaling down pods based on the following general algorithm:
The annotation should be set on the pod, the range is [-2147483647, 2147483647]. It
represents the cost of deleting a pod compared to other pods belonging to the same
ReplicaSet. Pods with lower deletion cost are preferred to be deleted before pods with higher
deletion cost.
The implicit value for this annotation for pods that don't set it is 0; negative values are
permitted. Invalid values will be rejected by the API server.
This feature is beta and enabled by default. You can disable it using the feature gate
PodDeletionCost in both kube-apiserver and kube-controller-manager.
Note:
This is honored on a best-effort basis, so it does not offer any guarantees on pod
deletion order.
Users should avoid updating the annotation frequently, such as updating it based
on a metric value, because doing so will generate a significant number of pod
updates on the apiserver.
https://kubernetes.io/docs/concepts/workloads/_print/ 69/112
6/6/23, 3:49 PM Workloads | Kubernetes
controllers/hpa-rs.yaml
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: frontend-scaler
spec:
scaleTargetRef:
kind: ReplicaSet
name: frontend
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 50
Saving this manifest into hpa-rs.yaml and submitting it to a Kubernetes cluster should
create the defined HPA that autoscales the target ReplicaSet depending on the CPU usage of
the replicated Pods.
Alternatively, you can use the kubectl autoscale command to accomplish the same (and it's
easier!)
Alternatives to ReplicaSet
Deployment (recommended)
Deployment is an object which can own ReplicaSets and update them and their Pods via
declarative, server-side rolling updates. While ReplicaSets can be used independently, today
they're mainly used by Deployments as a mechanism to orchestrate Pod creation, deletion
and updates. When you use Deployments you don't have to worry about managing the
ReplicaSets that they create. Deployments own and manage their ReplicaSets. As such, it is
recommended to use Deployments when you want ReplicaSets.
https://kubernetes.io/docs/concepts/workloads/_print/ 70/112
6/6/23, 3:49 PM Workloads | Kubernetes
Bare Pods
Unlike the case where a user directly created Pods, a ReplicaSet replaces Pods that are
deleted or terminated for any reason, such as in the case of node failure or disruptive node
maintenance, such as a kernel upgrade. For this reason, we recommend that you use a
ReplicaSet even if your application requires only a single Pod. Think of it similarly to a process
supervisor, only it supervises multiple Pods across multiple nodes instead of individual
processes on a single node. A ReplicaSet delegates local container restarts to some agent on
the node such as Kubelet.
Job
Use a Job instead of a ReplicaSet for Pods that are expected to terminate on their own (that
is, batch jobs).
DaemonSet
Use a DaemonSet instead of a ReplicaSet for Pods that provide a machine-level function, such
as machine monitoring or machine logging. These Pods have a lifetime that is tied to a
machine lifetime: the Pod needs to be running on the machine before other Pods start, and
are safe to terminate when the machine is otherwise ready to be rebooted/shutdown.
ReplicationController
ReplicaSets are the successors to ReplicationControllers. The two serve the same purpose,
and behave similarly, except that a ReplicationController does not support set-based selector
requirements as described in the labels user guide. As such, ReplicaSets are preferred over
ReplicationControllers
What's next
Learn about Pods.
Learn about Deployments.
Run a Stateless Application Using a Deployment, which relies on ReplicaSets to work.
ReplicaSet is a top-level resource in the Kubernetes REST API. Read the ReplicaSet
object definition to understand the API for replica sets.
Read about PodDisruptionBudget and how you can use it to manage application
availability during disruptions.
https://kubernetes.io/docs/concepts/workloads/_print/ 71/112
6/6/23, 3:49 PM Workloads | Kubernetes
2.3 - StatefulSets
StatefulSet is the workload API object used to manage stateful applications.
Manages the deployment and scaling of a set of Pods, and provides guarantees about the
ordering and uniqueness of these Pods.
Like a Deployment, a StatefulSet manages Pods that are based on an identical container spec.
Unlike a Deployment, a StatefulSet maintains a sticky identity for each of its Pods. These pods
are created from the same spec, but are not interchangeable: each has a persistent identifier
that it maintains across any rescheduling.
If you want to use storage volumes to provide persistence for your workload, you can use a
StatefulSet as part of the solution. Although individual Pods in a StatefulSet are susceptible to
failure, the persistent Pod identifiers make it easier to match existing volumes to the new
Pods that replace any that have failed.
Using StatefulSets
StatefulSets are valuable for applications that require one or more of the following.
Limitations
The storage for a given Pod must either be provisioned by a PersistentVolume
Provisioner based on the requested storage class , or pre-provisioned by an admin.
Deleting and/or scaling a StatefulSet down will not delete the volumes associated with
the StatefulSet. This is done to ensure data safety, which is generally more valuable than
an automatic purge of all related StatefulSet resources.
StatefulSets currently require a Headless Service to be responsible for the network
identity of the Pods. You are responsible for creating this Service.
StatefulSets do not provide any guarantees on the termination of pods when a
StatefulSet is deleted. To achieve ordered and graceful termination of the pods in the
StatefulSet, it is possible to scale the StatefulSet down to 0 prior to deletion.
When using Rolling Updates with the default Pod Management Policy ( OrderedReady ),
it's possible to get into a broken state that requires manual intervention to repair.
Components
The example below demonstrates the components of a StatefulSet.
https://kubernetes.io/docs/concepts/workloads/_print/ 72/112
6/6/23, 3:49 PM Workloads | Kubernetes
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
selector:
matchLabels:
app: nginx # has to match .spec.template.metadata.labels
serviceName: "nginx"
replicas: 3 # by default is 1
minReadySeconds: 10 # by default is 0
template:
metadata:
labels:
app: nginx # has to match .spec.selector.matchLabels
spec:
terminationGracePeriodSeconds: 10
containers:
- name: nginx
image: registry.k8s.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "my-storage-class"
resources:
requests:
storage: 1Gi
https://kubernetes.io/docs/concepts/workloads/_print/ 73/112
6/6/23, 3:49 PM Workloads | Kubernetes
Pod Selector
You must set the field of a StatefulSet to match the labels of its
.spec.selector
.spec.template.metadata.labels . Failing to specify a matching Pod Selector will result in a
validation error during StatefulSet creation.
Pod Identity
StatefulSet Pods have a unique identity that consists of an ordinal, a stable network identity,
and stable storage. The identity sticks to the Pod, regardless of which node it's (re)scheduled
on.
Ordinal Index
For a StatefulSet with N replicas, each Pod in the StatefulSet will be assigned an integer
ordinal, that is unique over the Set. By default, pods will be assigned ordinals from 0 up
through N-1.
Start ordinal
FEATURE STATE: Kubernetes v1.27 [beta]
.spec.ordinals is an optional field that allows you to configure the integer ordinals assigned
to each Pod. It defaults to nil. You must enable the StatefulSetStartOrdinal feature gate to
use this field. Once enabled, you can configure the following options:
Stable Network ID
Each Pod in a StatefulSet derives its hostname from the name of the StatefulSet and the
ordinal of the Pod. The pattern for the constructed hostname is $(statefulset
name)-$(ordinal) . The example above will create three Pods named web-0,web-1,web-2 . A
StatefulSet can use a Headless Service to control the domain of its Pods. The domain
managed by this Service takes the form: $(service name).$(namespace).svc.cluster.local ,
where "cluster.local" is the cluster domain. As each Pod is created, it gets a matching DNS
subdomain, taking the form: $(podname).$(governing service domain) , where the governing
service is defined by the serviceName field on the StatefulSet.
Depending on how DNS is configured in your cluster, you may not be able to look up the DNS
name for a newly-run Pod immediately. This behavior can occur when other clients in the
cluster have already sent queries for the hostname of the Pod before it was created. Negative
caching (normal in DNS) means that the results of previous failed lookups are remembered
and reused, even after the Pod is running, for at least a few seconds.
https://kubernetes.io/docs/concepts/workloads/_print/ 74/112
6/6/23, 3:49 PM Workloads | Kubernetes
If you need to discover Pods promptly after they are created, you have a few options:
Query the Kubernetes API directly (for example, using a watch) rather than relying on
DNS lookups.
Decrease the time of caching in your Kubernetes DNS provider (typically this means
editing the config map for CoreDNS, which currently caches for 30 seconds).
As mentioned in the limitations section, you are responsible for creating the Headless Service
responsible for the network identity of the pods.
Here are some examples of choices for Cluster Domain, Service name, StatefulSet name, and
how that affects the DNS names for the StatefulSet's Pods.
Stable Storage
For each VolumeClaimTemplate entry defined in a StatefulSet, each Pod receives one
PersistentVolumeClaim. In the nginx example above, each Pod receives a single
PersistentVolume with a StorageClass of my-storage-class and 1 Gib of provisioned storage.
If no StorageClass is specified, then the default StorageClass will be used. When a Pod is
(re)scheduled onto a node, its volumeMounts mount the PersistentVolumes associated with
its PersistentVolume Claims. Note that, the PersistentVolumes associated with the Pods'
PersistentVolume Claims are not deleted when the Pods, or StatefulSet are deleted. This must
be done manually.
https://kubernetes.io/docs/concepts/workloads/_print/ 75/112
6/6/23, 3:49 PM Workloads | Kubernetes
When the nginx example above is created, three Pods will be deployed in the order web-0,
web-1, web-2. web-1 will not be deployed before web-0 is Running and Ready, and web-2 will
not be deployed until web-1 is Running and Ready. If web-0 should fail, after web-1 is Running
and Ready, but before web-2 is launched, web-2 will not be launched until web-0 is
successfully relaunched and becomes Running and Ready.
If a user were to scale the deployed example by patching the StatefulSet such that
replicas=1 , web-2 would be terminated first. web-1 would not be terminated until web-2 is
fully shutdown and deleted. If web-0 were to fail after web-2 has been terminated and is
completely shutdown, but prior to web-1's termination, web-1 would not be terminated until
web-0 is Running and Ready.
Update strategies
A StatefulSet's .spec.updateStrategy field allows you to configure and disable automated
rolling updates for containers, labels, resource request/limits, and annotations for the Pods in
a StatefulSet. There are two possible values:
OnDelete
RollingUpdate
The RollingUpdate update strategy implements automated, rolling updates for the Pods in
a StatefulSet. This is the default update strategy.
Rolling Updates
When a StatefulSet's .spec.updateStrategy.type is set to RollingUpdate , the StatefulSet
controller will delete and recreate each Pod in the StatefulSet. It will proceed in the same
order as Pod termination (from the largest ordinal to the smallest), updating each Pod one at
a time.
The Kubernetes control plane waits until an updated Pod is Running and Ready prior to
updating its predecessor. If you have set .spec.minReadySeconds (see Minimum Ready
Seconds), the control plane additionally waits that amount of time after the Pod turns ready,
before moving on.
https://kubernetes.io/docs/concepts/workloads/_print/ 76/112
6/6/23, 3:49 PM Workloads | Kubernetes
You can control the maximum number of Pods that can be unavailable during an update by
specifying the .spec.updateStrategy.rollingUpdate.maxUnavailable field. The value can be
an absolute number (for example, 5 ) or a percentage of desired Pods (for example, 10% ).
Absolute number is calculated from the percentage value by rounding it up. This field cannot
be 0. The default setting is 1.
This field applies to all Pods in the range 0 to replicas - 1 . If there is any unavailable Pod
in the range 0 to replicas - 1 , it will be counted towards maxUnavailable .
Note: The maxUnavailable field is in Alpha stage and it is honored only by API servers that
are running with the MaxUnavailableStatefulSet feature gate enabled.
Forced rollback
When using Rolling Updates with the default Pod Management Policy ( OrderedReady ), it's
possible to get into a broken state that requires manual intervention to repair.
If you update the Pod template to a configuration that never becomes Running and Ready (for
example, due to a bad binary or application-level configuration error), StatefulSet will stop the
rollout and wait.
In this state, it's not enough to revert the Pod template to a good configuration. Due to a
known issue, StatefulSet will continue to wait for the broken Pod to become Ready (which
never happens) before it will attempt to revert it back to the working configuration.
After reverting the template, you must also delete any Pods that StatefulSet had already
attempted to run with the bad configuration. StatefulSet will then begin to recreate the Pods
using the reverted template.
PersistentVolumeClaim retention
FEATURE STATE: Kubernetes v1.27 [beta]
whenDeleted
configures the volume retention behavior that applies when the StatefulSet is deleted
whenScaled
configures the volume retention behavior that applies when the replica count of the
StatefulSet is reduced; for example, when scaling down the set.
https://kubernetes.io/docs/concepts/workloads/_print/ 77/112
6/6/23, 3:49 PM Workloads | Kubernetes
For each policy that you can configure, you can set the value to either Delete or Retain .
Delete
The PVCs created from the StatefulSet volumeClaimTemplate are deleted for each Pod
affected by the policy. With the whenDeleted policy all PVCs from the volumeClaimTemplate
are deleted after their Pods have been deleted. With the whenScaled policy, only PVCs
corresponding to Pod replicas being scaled down are deleted, after their Pods have been
deleted.
Retain (default)
PVCs from the volumeClaimTemplate are not affected when their Pod is deleted. This is the
behavior before this new feature.
Bear in mind that these policies only apply when Pods are being removed due to the
StatefulSet being deleted or scaled down. For example, if a Pod associated with a StatefulSet
fails due to node failure, and the control plane creates a replacement Pod, the StatefulSet
retains the existing PVC. The existing volume is unaffected, and the cluster will attach it to the
node where the new Pod is about to launch.
The default for policies is Retain , matching the StatefulSet behavior before this new feature.
apiVersion: apps/v1
kind: StatefulSet
...
spec:
persistentVolumeClaimRetentionPolicy:
whenDeleted: Retain
whenScaled: Delete
...
The StatefulSet controller adds owner references to its PVCs, which are then deleted by the
garbage collector after the Pod is terminated. This enables the Pod to cleanly unmount all
volumes before the PVCs are deleted (and before the backing PV and volume are deleted,
depending on the retain policy). When you set the whenDeleted policy to Delete , an owner
reference to the StatefulSet instance is placed on all PVCs associated with that StatefulSet.
The whenScaled policy must delete PVCs only when a Pod is scaled down, and not when a
Pod is deleted for another reason. When reconciling, the StatefulSet controller compares its
desired replica count to the actual Pods present on the cluster. Any StatefulSet Pod whose id
greater than the replica count is condemned and marked for deletion. If the whenScaled
policy is Delete , the condemned Pods are first set as owners to the associated StatefulSet
template PVCs, before the Pod is deleted. This causes the PVCs to be garbage collected after
only the condemned Pods have terminated.
This means that if the controller crashes and restarts, no Pod will be deleted before its owner
reference has been updated appropriate to the policy. If a condemned Pod is force-deleted
while the controller is down, the owner reference may or may not have been set up,
depending on when the controller crashed. It may take several reconcile loops to update the
owner references, so some condemned Pods may have set up owner references and others
may not. For this reason we recommend waiting for the controller to come back up, which will
verify owner references before terminating Pods. If that is not possible, the operator should
verify the owner references on PVCs to ensure the expected objects are deleted when Pods
are force-deleted.
Replicas
.spec.replicas is an optional field that specifies the number of desired Pods. It defaults to
1.
https://kubernetes.io/docs/concepts/workloads/_print/ 78/112
6/6/23, 3:49 PM Workloads | Kubernetes
Should you manually scale a deployment, example via kubectl scale statefulset
statefulset --replicas=X , and then you update that StatefulSet based on a manifest (for
example: by running kubectl apply -f statefulset.yaml ), then applying that manifest
overwrites the manual scaling that you previously did.
If a HorizontalPodAutoscaler (or any similar API for horizontal scaling) is managing scaling for
a Statefulset, don't set .spec.replicas . Instead, allow the Kubernetes control plane to
manage the .spec.replicas field automatically.
What's next
Learn about Pods.
Find out how to use StatefulSets
Follow an example of deploying a stateful application.
Follow an example of deploying Cassandra with Stateful Sets.
Follow an example of running a replicated stateful application.
Learn how to scale a StatefulSet.
Learn what's involved when you delete a StatefulSet.
Learn how to configure a Pod to use a volume for storage.
Learn how to configure a Pod to use a PersistentVolume for storage.
StatefulSet is a top-level resource in the Kubernetes REST API. Read the StatefulSet
object definition to understand the API for stateful sets.
Read about PodDisruptionBudget and how you can use it to manage application
availability during disruptions.
https://kubernetes.io/docs/concepts/workloads/_print/ 79/112
6/6/23, 3:49 PM Workloads | Kubernetes
2.4 - DaemonSet
A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the
cluster, Pods are added to them. As nodes are removed from the cluster, those Pods are
garbage collected. Deleting a DaemonSet will clean up the Pods it created.
In a simple case, one DaemonSet, covering all nodes, would be used for each type of daemon.
A more complex setup might use multiple DaemonSets for a single type of daemon, but with
different flags and/or different memory and cpu requests for different hardware types.
controllers/daemonset.yaml
https://kubernetes.io/docs/concepts/workloads/_print/ 80/112
6/6/23, 3:49 PM Workloads | Kubernetes
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd-elasticsearch
namespace: kube-system
labels:
k8s-app: fluentd-logging
spec:
selector:
matchLabels:
name: fluentd-elasticsearch
template:
metadata:
labels:
name: fluentd-elasticsearch
spec:
tolerations:
# these tolerations are to have the daemonset runnable on control plane nod
# remove them if your control plane nodes should not run pods
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
containers:
- name: fluentd-elasticsearch
image: quay.io/fluentd_elasticsearch/fluentd:v2.5.2
resources:
limits:
memory: 200Mi
requests:
cpu: 100m
memory: 200Mi
volumeMounts:
- name: varlog
mountPath: /var/log
terminationGracePeriodSeconds: 30
volumes:
- name: varlog
hostPath:
path: /var/log
Required Fields
As with all other Kubernetes config, a DaemonSet needs apiVersion , kind , and metadata
fields. For general information about working with config files, see running stateless
applications and object management using kubectl.
Pod Template
The .spec.template is one of the required fields in .spec .
https://kubernetes.io/docs/concepts/workloads/_print/ 81/112
6/6/23, 3:49 PM Workloads | Kubernetes
The .spec.template is a pod template. It has exactly the same schema as a Pod, except it is
nested and does not have an apiVersion or kind .
In addition to required fields for a Pod, a Pod template in a DaemonSet has to specify
appropriate labels (see pod selector).
Pod Selector
The .spec.selector field is a pod selector. It works the same as the .spec.selector of a
Job.
You must specify a pod selector that matches the labels of the .spec.template . Also, once a
DaemonSet is created, its .spec.selector can not be mutated. Mutating the pod selector
can lead to the unintentional orphaning of Pods, and it was found to be confusing to users.
The user can specify a different scheduler for the Pods of the DaemonSet, by setting the
.spec.template.spec.schedulerName field of the DaemonSet.
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- target-host-name
https://kubernetes.io/docs/concepts/workloads/_print/ 82/112
6/6/23, 3:49 PM Workloads | Kubernetes
node.kubernetes NoSch Only added for DaemonSet Pods that request host
.io/network- edule networking, i.e., Pods having spec.hostNetwork:
unavailable true . Such DaemonSet Pods can be scheduled onto
nodes with unavailable network.
You can add your own tolerations to the Pods of a DaemonSet as well, by defining these in the
Pod template of the DaemonSet.
Push: Pods in the DaemonSet are configured to send updates to another service, such
as a stats database. They do not have clients.
NodeIP and Known Port: Pods in the DaemonSet can use a hostPort , so that the pods
are reachable via the node IPs. Clients know the list of node IPs somehow, and know the
port by convention.
https://kubernetes.io/docs/concepts/workloads/_print/ 83/112
6/6/23, 3:49 PM Workloads | Kubernetes
DNS: Create a headless service with the same pod selector, and then discover
DaemonSets using the endpoints resource or retrieve multiple A records from DNS.
Service: Create a service with the same Pod selector, and use the service to reach a
daemon on a random node. (No way to reach specific node.)
Updating a DaemonSet
If node labels are changed, the DaemonSet will promptly add Pods to newly matching nodes
and delete Pods from newly not-matching nodes.
You can modify the Pods that a DaemonSet creates. However, Pods do not allow all fields to
be updated. Also, the DaemonSet controller will use the original template the next time a
node (even with the same name) is created.
You can delete a DaemonSet. If you specify --cascade=orphan with kubectl , then the Pods
will be left on the nodes. If you subsequently create a new DaemonSet with the same selector,
the new DaemonSet adopts the existing Pods. If any Pods need replacing the DaemonSet
replaces them according to its updateStrategy .
Alternatives to DaemonSet
Init scripts
It is certainly possible to run daemon processes by directly starting them on a node (e.g. using
init , upstartd , or systemd ). This is perfectly fine. However, there are several advantages
to running such processes via a DaemonSet:
Ability to monitor and manage logs for daemons in the same way as applications.
Same config language and tools (e.g. Pod templates, kubectl ) for daemons and
applications.
Running daemons in containers with resource limits increases isolation between
daemons from app containers. However, this can also be accomplished by running the
daemons in a container but not in a Pod.
Bare Pods
It is possible to create Pods directly which specify a particular node to run on. However, a
DaemonSet replaces Pods that are deleted or terminated for any reason, such as in the case
of node failure or disruptive node maintenance, such as a kernel upgrade. For this reason,
you should use a DaemonSet rather than creating individual Pods.
Static Pods
It is possible to create Pods by writing a file to a certain directory watched by Kubelet. These
are called static pods. Unlike DaemonSet, static Pods cannot be managed with kubectl or
other Kubernetes API clients. Static Pods do not depend on the apiserver, making them useful
in cluster bootstrapping cases. Also, static Pods may be deprecated in the future.
Deployments
DaemonSets are similar to Deployments in that they both create Pods, and those Pods have
processes which are not expected to terminate (e.g. web servers, storage servers).
Use a Deployment for stateless services, like frontends, where scaling up and down the
number of replicas and rolling out updates are more important than controlling exactly which
host the Pod runs on. Use a DaemonSet when it is important that a copy of a Pod always run
on all or certain hosts, if the DaemonSet provides node-level functionality that allows other
Pods to run correctly on that particular node.
https://kubernetes.io/docs/concepts/workloads/_print/ 84/112
6/6/23, 3:49 PM Workloads | Kubernetes
For example, network plugins often include a component that runs as a DaemonSet. The
DaemonSet component makes sure that the node where it's running has working cluster
networking.
What's next
Learn about Pods.
Learn about static Pods, which are useful for running Kubernetes control plane
components.
Find out how to use DaemonSets
Perform a rolling update on a DaemonSet
Perform a rollback on a DaemonSet (for example, if a roll out didn't work how you
expected).
Understand how Kubernetes assigns Pods to Nodes.
Learn about device plugins and add ons, which often run as DaemonSets.
DaemonSet is a top-level resource in the Kubernetes REST API. Read the DaemonSet
object definition to understand the API for daemon sets.
https://kubernetes.io/docs/concepts/workloads/_print/ 85/112
6/6/23, 3:49 PM Workloads | Kubernetes
2.5 - Jobs
A Job creates one or more Pods and will continue to retry execution of the Pods until a
specified number of them successfully terminate. As pods successfully complete, the Job
tracks the successful completions. When a specified number of successful completions is
reached, the task (ie, Job) is complete. Deleting a Job will clean up the Pods it created.
Suspending a Job will delete its active Pods until the Job is resumed again.
A simple case is to create one Job object in order to reliably run one Pod to completion. The
Job object will start a new Pod if the first Pod fails or is deleted (for example due to a node
hardware failure or a node reboot).
If you want to run a Job (either a single task, or several in parallel) on a schedule, see CronJob.
controllers/job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: pi
spec:
template:
spec:
containers:
- name: pi
image: perl:5.34.0
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
backoffLimit: 4
job.batch/pi created
https://kubernetes.io/docs/concepts/workloads/_print/ 86/112
6/6/23, 3:49 PM Workloads | Kubernetes
Name: pi
Namespace: default
Selector: batch.kubernetes.io/controller-uid=c9948307-e56d-4b5d-8302-ae
Labels: batch.kubernetes.io/controller-uid=c9948307-e56d-4b5d-8302-ae
batch.kubernetes.io/job-name=pi
...
Annotations: batch.kubernetes.io/job-tracking: ""
Parallelism: 1
Completions: 1
Start Time: Mon, 02 Dec 2019 15:20:11 +0200
Completed At: Mon, 02 Dec 2019 15:21:16 +0200
Duration: 65s
Pods Statuses: 0 Running / 1 Succeeded / 0 Failed
Pod Template:
Labels: batch.kubernetes.io/controller-uid=c9948307-e56d-4b5d-8302-ae2d7b7
batch.kubernetes.io/job-name=pi
Containers:
pi:
Image: perl:5.34.0
Port: <none>
Host Port: <none>
Command:
perl
-Mbignum=bpi
-wle
print bpi(2000)
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 21s job-controller Created pod: pi-xf9p4
Normal Completed 18s job-controller Job completed
To list all the Pods that belong to a Job in a machine readable form, you can use a command
like this:
pi-5rwd7
Here, the selector is the same as the selector for the Job. The --output=jsonpath option
specifies an expression with the name from each Pod in the returned list.
3.1415926535897932384626433832795028841971693993751058209749445923078164062862089
When the control plane creates new Pods for a Job, the .metadata.name of the Job is part of
the basis for naming those Pods. The name of a Job must be a valid DNS subdomain value,
but this can produce unexpected results for the Pod hostnames. For best compatibility, the
name should follow the more restrictive rules for a DNS label. Even when the name is a DNS
subdomain, the name must be no longer than 63 characters.
Job Labels
Job labels will have batch.kubernetes.io/ prefix for job-name and controller-uid .
Pod Template
The .spec.template is the only required field of the .spec .
The .spec.template is a pod template. It has exactly the same schema as a Pod, except it is
nested and does not have an apiVersion or kind .
In addition to required fields for a Pod, a pod template in a Job must specify appropriate
labels (see pod selector) and an appropriate restart policy.
Pod selector
The .spec.selector field is optional. In almost all cases you should not specify it. See section
specifying your own pod selector.
1. Non-parallel Jobs
normally, only one Pod is started, unless the Pod fails.
the Job is complete as soon as its Pod terminates successfully.
2. Parallel Jobs with a fixed completion count:
specify a non-zero positive value for .spec.completions .
the Job represents the overall task, and is complete when there are
.spec.completions successful Pods.
https://kubernetes.io/docs/concepts/workloads/_print/ 88/112
6/6/23, 3:49 PM Workloads | Kubernetes
once any Pod has exited with success, no other Pod should still be doing any work
for this task or writing any output. They should all be in the process of exiting.
For a non-parallel Job, you can leave both .spec.completions and .spec.parallelism unset.
When both are unset, both are defaulted to 1.
For a fixed completion count Job, you should set .spec.completions to the number of
completions needed. You can set .spec.parallelism , or leave it unset and it will default to 1.
For a work queue Job, you must leave .spec.completions unset, and set .spec.parallelism
to a non-negative integer.
For more information about how to make use of the different types of job, see the job
patterns section.
Controlling parallelism
The requested parallelism ( .spec.parallelism ) can be set to any non-negative value. If it is
unspecified, it defaults to 1. If it is specified as 0, then the Job is effectively paused until it is
increased.
Actual parallelism (number of pods running at any instant) may be more or less than
requested parallelism, for a variety of reasons:
For fixed completion count Jobs, the actual number of pods running in parallel will not
exceed the number of remaining completions. Higher values of .spec.parallelism are
effectively ignored.
For work queue Jobs, no new Pods are started after any Pod has succeeded -- remaining
Pods are allowed to complete, however.
If the Job Controller has not had time to react.
If the Job controller failed to create Pods for any reason (lack of ResourceQuota , lack of
permission, etc.), then there may be fewer pods than requested.
The Job controller may throttle new Pod creation due to excessive previous pod failures
in the same Job.
When a Pod is gracefully shut down, it takes time to stop.
Completion mode
FEATURE STATE: Kubernetes v1.24 [stable]
Jobs with fixed completion count - that is, jobs that have non null .spec.completions - can
have a completion mode that is specified in .spec.completionMode :
NonIndexed (default): the Job is considered complete when there have been
.spec.completions successfully completed Pods. In other words, each Pod completion
is homologous to each other. Note that Jobs that have null .spec.completions are
implicitly NonIndexed .
Note: Although rare, more than one Pod could be started for the same index (due to
various reasons such as node failures, kubelet restarts, or Pod evictions). In this case, only
the first Pod that completes successfully will count towards the completion count and
https://kubernetes.io/docs/concepts/workloads/_print/ 89/112
6/6/23, 3:49 PM Workloads | Kubernetes
update the status of the Job. The other Pods that are running or completed for the same
index will be deleted by the Job controller once they are detected.
An entire Pod can also fail, for a number of reasons, such as when the pod is kicked off the
node (node is upgraded, rebooted, deleted, etc.), or if a container of the Pod fails and the
.spec.template.spec.restartPolicy = "Never" . When a Pod fails, then the Job controller
starts a new Pod. This means that your application needs to handle the case when it is
restarted in a new pod. In particular, it needs to handle temporary files, locks, incomplete
output and the like caused by previous runs.
By default, each pod failure is counted towards the .spec.backoffLimit limit, see pod
backoff failure policy. However, you can customize handling of pod failures by setting the
Job's pod failure policy.
If you do specify .spec.parallelism and .spec.completions both greater than 1, then there
may be multiple pods running at once. Therefore, your pods must also be tolerant of
concurrency.
If either of these requirements is not satisfied, the Job controller counts a terminating Pod as
an immediate failure, even if that Pod later terminates with phase: "Succeeded" .
If either of the calculations reaches the .spec.backoffLimit , the Job is considered failed.
Note: If your job has restartPolicy = "OnFailure", keep in mind that your Pod running
the Job will be terminated once the job backoff limit has been reached. This can make
debugging the Job's executable more difficult. We suggest setting restartPolicy =
"Never" when debugging the Job or using a logging system to ensure output from failed
Jobs is not lost inadvertently.
https://kubernetes.io/docs/concepts/workloads/_print/ 90/112
6/6/23, 3:49 PM Workloads | Kubernetes
Note: You can only configure a Pod failure policy for a Job if you have the
JobPodFailurePolicy feature gate enabled in your cluster. Additionally, it is
recommended to enable the PodDisruptionConditions feature gate in order to be able to
detect and handle Pod disruption conditions in the Pod failure policy (see also: Pod
disruption conditions). Both feature gates are available in Kubernetes 1.27.
A Pod failure policy, defined with the .spec.podFailurePolicy field, enables your cluster to
handle Pod failures based on the container exit codes and the Pod conditions.
In some situations, you may want to have a better control when handling Pod failures than
the control provided by the Pod backoff failure policy, which is based on the Job's
.spec.backoffLimit . These are some examples of use cases:
To optimize costs of running workloads by avoiding unnecessary Pod restarts, you can
terminate a Job as soon as one of its Pods fails with an exit code indicating a software
bug.
To guarantee that your Job finishes even if there are disruptions, you can ignore Pod
failures caused by disruptions (such preemption, API-initiated eviction or taint-based
eviction) so that they don't count towards the .spec.backoffLimit limit of retries.
You can configure a Pod failure policy, in the .spec.podFailurePolicy field, to meet the
above use cases. This policy can handle Pod failures based on the container exit codes and
the Pod conditions.
/controllers/job-pod-failure-policy-example.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: job-pod-failure-policy-example
spec:
completions: 12
parallelism: 3
template:
spec:
restartPolicy: Never
containers:
- name: main
image: docker.io/library/bash:5
command: ["bash"] # example command simulating a bug which trigger
args:
- -c
- echo "Hello world!" && sleep 5 && exit 42
backoffLimit: 6
podFailurePolicy:
rules:
- action: FailJob
onExitCodes:
containerName: main # optional
operator: In # one of: In, NotIn
values: [42]
- action: Ignore # one of: Ignore, FailJob, Count
onPodConditions:
- type: DisruptionTarget # indicates Pod disruption
https://kubernetes.io/docs/concepts/workloads/_print/ 91/112
6/6/23, 3:49 PM Workloads | Kubernetes
In the example above, the first rule of the Pod failure policy specifies that the Job should be
marked failed if the main container fails with the 42 exit code. The following are the rules for
the main container specifically:
Note: Because the Pod template specifies a restartPolicy: Never, the kubelet does not
restart the main container in that particular Pod.
The second rule of the Pod failure policy, specifying the Ignore action for failed Pods with
condition DisruptionTarget excludes Pod disruptions from being counted towards the
.spec.backoffLimit limit of retries.
Note: If the Job failed, either by the Pod failure policy or Pod backoff failure policy, and
the Job is running multiple Pods, Kubernetes terminates all the Pods in that Job that are
still Pending or Running.
if you want to use a .spec.podFailurePolicy field for a Job, you must also define that
Job's pod template with .spec.restartPolicy set to Never .
the Pod failure policy rules you specify under spec.podFailurePolicy.rules are
evaluated in order. Once a rule matches a Pod failure, the remaining rules are ignored.
When no rule matches the Pod failure, the default handling applies.
you may want to restrict a rule to a specific container by specifying its name
in spec.podFailurePolicy.rules[*].containerName . When not specified the rule applies
to all containers. When specified, it should match one the container or initContainer
names in the Pod template.
you may specify the action taken when a Pod failure policy is matched by
spec.podFailurePolicy.rules[*].action . Possible values are:
FailJob : use to indicate that the Pod's job should be marked as Failed and all
running Pods should be terminated.
Ignore : use to indicate that the counter towards the .spec.backoffLimit should
not be incremented and a replacement Pod should be created.
Count : use to indicate that the Pod should be handled in the default way. The
counter towards the .spec.backoffLimit should be incremented.
Note: When you use a podFailurePolicy, the job controller only matches Pods in the
Failed phase. Pods with a deletion timestamp that are not in a terminal phase (Failed or
Succeeded) are considered still terminating. This implies that terminating pods retain a
tracking finalizer until they reach a terminal phase. Since Kubernetes 1.27, Kubelet
transitions deleted pods to a terminal phase (see: Pod Phase). This ensures that deleted
pods have their finalizers removed by the Job controller.
https://kubernetes.io/docs/concepts/workloads/_print/ 92/112
6/6/23, 3:49 PM Workloads | Kubernetes
Another way to terminate a Job is by setting an active deadline. Do this by setting the
.spec.activeDeadlineSeconds field of the Job to a number of seconds. The
activeDeadlineSeconds applies to the duration of the job, no matter how many Pods are
created. Once a Job reaches activeDeadlineSeconds , all of its running Pods are terminated
and the Job status will become type: Failed with reason: DeadlineExceeded .
Example:
apiVersion: batch/v1
kind: Job
metadata:
name: pi-with-timeout
spec:
backoffLimit: 5
activeDeadlineSeconds: 100
template:
spec:
containers:
- name: pi
image: perl:5.34.0
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
Note that both the Job spec and the Pod template spec within the Job have an
activeDeadlineSeconds field. Ensure that you set this field at the proper level.
Keep in mind that the restartPolicy applies to the Pod, and not to the Job itself: there is no
automatic Job restart once the Job status is type: Failed . That is, the Job termination
mechanisms activated with .spec.activeDeadlineSeconds and .spec.backoffLimit result in
a permanent Job failure that requires manual intervention to resolve.
Another way to clean up finished Jobs (either Complete or Failed ) automatically is to use a
TTL mechanism provided by a TTL controller for finished resources, by specifying the
.spec.ttlSecondsAfterFinished field of the Job.
When the TTL controller cleans up the Job, it will delete the Job cascadingly, i.e. delete its
dependent objects, such as Pods, together with the Job. Note that when the Job is deleted, its
lifecycle guarantees, such as finalizers, will be honored.
For example:
https://kubernetes.io/docs/concepts/workloads/_print/ 93/112
6/6/23, 3:49 PM Workloads | Kubernetes
apiVersion: batch/v1
kind: Job
metadata:
name: pi-with-ttl
spec:
ttlSecondsAfterFinished: 100
template:
spec:
containers:
- name: pi
image: perl:5.34.0
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
The Job pi-with-ttl will be eligible to be automatically deleted, 100 seconds after it
finishes.
If the field is set to 0 , the Job will be eligible to be automatically deleted immediately after it
finishes. If the field is unset, this Job won't be cleaned up by the TTL controller after it finishes.
Note:
It is recommended to set ttlSecondsAfterFinished field because unmanaged jobs (Jobs
that you created directly, and not indirectly through other workload APIs such as CronJob)
have a default deletion policy of orphanDependents causing Pods created by an
unmanaged Job to be left around after that Job is fully deleted. Even though the
control plane eventually garbage collects the Pods from a deleted Job after they either fail
or complete, sometimes those lingering pods may cause cluster performance
degradation or in worst case cause the cluster to go offline due to this degradation.
You can use LimitRanges and ResourceQuotas to place a cap on the amount of resources
that a particular namespace can consume.
Job patterns
The Job object can be used to support reliable parallel execution of Pods. The Job object is not
designed to support closely-communicating parallel processes, as commonly found in
scientific computing. It does support parallel processing of a set of independent but related
work items. These might be emails to be sent, frames to be rendered, files to be transcoded,
ranges of keys in a NoSQL database to scan, and so on.
In a complex system, there may be multiple different sets of work items. Here we are just
considering one set of work items that the user wants to manage together — a batch job.
There are several different patterns for parallel computation, each with strengths and
weaknesses. The tradeoffs are:
One Job object for each work item, vs. a single Job object for all work items. The latter is
better for large numbers of work items. The former creates some overhead for the user
and for the system to manage large numbers of Job objects.
Number of pods created equals number of work items, vs. each Pod can process
multiple work items. The former typically requires less modification to existing code and
containers. The latter is better for large numbers of work items, for similar reasons to
the previous bullet.
Several approaches use a work queue. This requires running a queue service, and
modifications to the existing program or container to make it use the work queue. Other
approaches are easier to adapt to an existing containerised application.
The tradeoffs are summarized here, with columns 2 to 4 corresponding to the above
tradeoffs. The pattern names are also links to examples and more detailed description.
https://kubernetes.io/docs/concepts/workloads/_print/ 94/112
6/6/23, 3:49 PM Workloads | Kubernetes
When you specify completions with .spec.completions , each Pod created by the Job
controller has an identical spec . This means that all pods for a task will have the same
command line and the same image, the same volumes, and (almost) the same environment
variables. These patterns are different ways to arrange for pods to work on different things.
This table shows the required settings for .spec.parallelism and .spec.completions for
each of the patterns. Here, W is the number of work items.
.spec.completion .spec.parallelis
Pattern s m
Advanced usage
Suspending a Job
FEATURE STATE: Kubernetes v1.24 [stable]
When a Job is created, the Job controller will immediately begin creating Pods to satisfy the
Job's requirements and will continue to do so until the Job is complete. However, you may
want to temporarily suspend a Job's execution and resume it later, or start Jobs in suspended
state and have a custom controller decide later when to start them.
To suspend a Job, you can update the .spec.suspend field of the Job to true; later, when you
want to resume it again, update it to false. Creating a Job with .spec.suspend set to true will
create it in the suspended state.
When a Job is resumed from suspension, its .status.startTime field will be reset to the
current time. This means that the .spec.activeDeadlineSeconds timer will be stopped and
reset when a Job is suspended and resumed.
When you suspend a Job, any running Pods that don't have a status of Completed will be
terminated. with a SIGTERM signal. The Pod's graceful termination period will be honored and
your Pod must handle this signal in this period. This may involve saving progress for later or
https://kubernetes.io/docs/concepts/workloads/_print/ 95/112
6/6/23, 3:49 PM Workloads | Kubernetes
undoing changes. Pods terminated this way will not count towards the Job's completions
count.
apiVersion: batch/v1
kind: Job
metadata:
name: myjob
spec:
suspend: true
parallelism: 1
completions: 5
template:
spec:
...
You can also toggle Job suspension by patching the Job using the command line.
The Job's status can be used to determine if a Job is suspended or has been suspended in the
past:
apiVersion: batch/v1
kind: Job
# .metadata and .spec omitted
status:
conditions:
- lastProbeTime: "2021-02-05T13:14:33Z"
lastTransitionTime: "2021-02-05T13:14:33Z"
status: "True"
type: Suspended
startTime: "2021-02-05T13:13:48Z"
The Job condition of type "Suspended" with status "True" means the Job is suspended; the
lastTransitionTime field can be used to determine how long the Job has been suspended
for. If the status of that condition is "False", then the Job was previously suspended and is now
running. If such a condition does not exist in the Job's status, the Job has never been stopped.
Events are also created when the Job is suspended and resumed:
https://kubernetes.io/docs/concepts/workloads/_print/ 96/112
6/6/23, 3:49 PM Workloads | Kubernetes
Name: myjob
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 12m job-controller Created pod: myjob-hlrpl
Normal SuccessfulDelete 11m job-controller Deleted pod: myjob-hlrpl
Normal Suspended 11m job-controller Job suspended
Normal SuccessfulCreate 3s job-controller Created pod: myjob-jvb44
Normal Resumed 3s job-controller Job resumed
The last four events, particularly the "Suspended" and "Resumed" events, are directly a result
of toggling the .spec.suspend field. In the time between these two events, we see that no
Pods were created, but Pod creation restarted as soon as the Job was resumed.
In most cases a parallel job will want the pods to run with constraints, like all in the same
zone, or all either on GPU model x or y but not a mix of both.
The suspend field is the first step towards achieving those semantics. Suspend allows a
custom queue controller to decide when a job should start; However, once a job is
unsuspended, a custom queue controller has no influence on where the pods of a job will
actually land.
This feature allows updating a Job's scheduling directives before it starts, which gives custom
queue controllers the ability to influence pod placement while at the same time offloading
actual pod-to-node assignment to kube-scheduler. This is allowed only for suspended Jobs
that have never been unsuspended before.
The fields in a Job's pod template that can be updated are node affinity, node selector,
tolerations, labels, annotations and scheduling gates.
However, in some cases, you might need to override this automatically set selector. To do
this, you can specify the .spec.selector of the Job.
Be very careful when doing this. If you specify a label selector which is not unique to the pods
of that Job, and which matches unrelated Pods, then pods of the unrelated job may be
deleted, or this Job may count other Pods as completing it, or one or both Jobs may refuse to
create Pods or run to completion. If a non-unique selector is chosen, then other controllers
(e.g. ReplicationController) and their Pods may behave in unpredictable ways too. Kubernetes
will not stop you from making a mistake when specifying .spec.selector .
Here is an example of a case when you might want to use this feature.
Say Job old is already running. You want existing Pods to keep running, but you want the
rest of the Pods it creates to use a different pod template and for the Job to have a new name.
You cannot update the Job because these fields are not updatable. Therefore, you delete Job
old but leave its pods running, using kubectl delete jobs/old --cascade=orphan . Before
deleting it, you make a note of what selector it uses:
https://kubernetes.io/docs/concepts/workloads/_print/ 97/112
6/6/23, 3:49 PM Workloads | Kubernetes
kind: Job
metadata:
name: old
...
spec:
selector:
matchLabels:
batch.kubernetes.io/controller-uid: a8f3d00d-c6d2-11e5-9f87-42010af00002
...
Then you create a new Job with name new and you explicitly specify the same selector. Since
the existing Pods have label batch.kubernetes.io/controller-uid=a8f3d00d-c6d2-11e5-9f87-
42010af00002 , they are controlled by Job new as well.
You need to specify manualSelector: true in the new Job since you are not using the
selector that the system normally generates for you automatically.
kind: Job
metadata:
name: new
...
spec:
manualSelector: true
selector:
matchLabels:
batch.kubernetes.io/controller-uid: a8f3d00d-c6d2-11e5-9f87-42010af00002
...
The new Job itself will have a different uid from a8f3d00d-c6d2-11e5-9f87-42010af00002 .
Setting manualSelector: true tells the system that you know what you are doing and to
allow this mismatch.
Note: The control plane doesn't track Jobs using finalizers, if the Jobs were created when
the feature gate JobTrackingWithFinalizers was disabled, even after you upgrade the
control plane to 1.26.
The control plane keeps track of the Pods that belong to any Job and notices if any such Pod is
removed from the API server. To do that, the Job controller creates Pods with the finalizer
batch.kubernetes.io/job-tracking . The controller removes the finalizer only after the Pod
has been accounted for in the Job status, allowing the Pod to be removed by other controllers
or users.
Jobs created before upgrading to Kubernetes 1.26 or before the feature gate
JobTrackingWithFinalizers is enabled are tracked without the use of Pod finalizers. The Job
controller updates the status counters for succeeded and failed Pods based only on the
Pods that exist in the cluster. The contol plane can lose track of the progress of the Job if Pods
are deleted from the cluster.
You can determine if the control plane is tracking a Job using Pod finalizers by checking if the
Job has the annotation batch.kubernetes.io/job-tracking . You should not manually add or
remove this annotation from Jobs. Instead, you can recreate the Jobs to ensure they are
tracked using Pod finalizers.
https://kubernetes.io/docs/concepts/workloads/_print/ 98/112
6/6/23, 3:49 PM Workloads | Kubernetes
You can scale Indexed Jobs up or down by mutating both .spec.parallelism and
.spec.completions together such that .spec.parallelism == .spec.completions . When the
ElasticIndexedJob feature gate on the API server is disabled, .spec.completions is
immutable.
Use cases for elastic Indexed Jobs include batch workloads which require scaling an indexed
Job, such as MPI, Horovord, Ray, and PyTorch training jobs.
Alternatives
Bare Pods
When the node that a Pod is running on reboots or fails, the pod is terminated and will not be
restarted. However, a Job will create new Pods to replace terminated ones. For this reason, we
recommend that you use a Job rather than a bare Pod, even if your application requires only a
single Pod.
Replication Controller
Jobs are complementary to Replication Controllers. A Replication Controller manages Pods
which are not expected to terminate (e.g. web servers), and a Job manages Pods that are
expected to terminate (e.g. batch tasks).
As discussed in Pod Lifecycle, Job is only appropriate for pods with RestartPolicy equal to
OnFailure or Never . (Note: If RestartPolicy is not set, the default value is Always .)
One example of this pattern would be a Job which starts a Pod which runs a script that in turn
starts a Spark master controller (see spark example), runs a spark driver, and then cleans up.
An advantage of this approach is that the overall process gets the completion guarantee of a
Job object, but maintains complete control over what Pods are created and how work is
assigned to them.
What's next
Learn about Pods.
Read about different ways of running Jobs:
Coarse Parallel Processing Using a Work Queue
Fine Parallel Processing Using a Work Queue
Use an indexed Job for parallel processing with static work assignment
Create multiple Jobs based on a template: Parallel Processing using Expansions
Follow the links within Clean up finished jobs automatically to learn more about how
your cluster can clean up completed and / or failed tasks.
Job is part of the Kubernetes REST API. Read the Job object definition to understand the
API for jobs.
Read about CronJob, which you can use to define a series of Jobs that will run based on a
schedule, similar to the UNIX tool cron .
Practice how to configure handling of retriable and non-retriable pod failures using
podFailurePolicy , based on the step-by-step examples.
https://kubernetes.io/docs/concepts/workloads/_print/ 99/112
6/6/23, 3:49 PM Workloads | Kubernetes
When your Job has finished, it's useful to keep that Job in the API (and not immediately delete
the Job) so that you can tell whether the Job succeeded or failed.
Kubernetes' TTL-after-finished controller provides a TTL (time to live) mechanism to limit the
lifetime of Job objects that have finished execution.
The TTL-after-finished controller assumes that a Job is eligible to be cleaned up TTL seconds
after the Job has finished. The timer starts once the status condition of the Job changes to
show that the Job is either Complete or Failed ; once the TTL has expired, that Job becomes
eligible for cascading removal. When the TTL-after-finished controller cleans up a job, it will
delete it cascadingly, that is to say it will delete its dependent objects together with it.
Kubernetes honors object lifecycle guarantees on the Job, such as waiting for finalizers.
You can set the TTL seconds at any time. Here are some examples for setting the
.spec.ttlSecondsAfterFinished field of a Job:
Specify this field in the Job manifest, so that a Job can be cleaned up automatically some
time after it finishes.
Manually set this field of existing, already finished Jobs, so that they become eligible for
cleanup.
Use a mutating admission webhook to set this field dynamically at Job creation time.
Cluster administrators can use this to enforce a TTL policy for finished jobs.
Use a mutating admission webhook to set this field dynamically after the Job has
finished, and choose different TTL values based on job status, labels. For this case, the
webhook needs to detect changes to the .status of the Job and only set a TTL when
the Job is being marked as completed.
Write your own controller to manage the cleanup TTL for Jobs that match a particular
selector-selector.
Caveats
Updating TTL for finished Jobs
You can modify the TTL period, e.g. .spec.ttlSecondsAfterFinished field of Jobs, after the
job is created or has finished. If you extend the TTL period after the existing
ttlSecondsAfterFinished period has expired, Kubernetes doesn't guarantee to retain that
Job, even if an update to extend the TTL returns a successful API response.
Time skew
Because the TTL-after-finished controller uses timestamps stored in the Kubernetes jobs to
determine whether the TTL has expired or not, this feature is sensitive to time skew in your
cluster, which may cause the control plane to clean up Job objects at the wrong time.
Clocks aren't always correct, but the difference should be very small. Please be aware of this
risk when setting a non-zero TTL.
https://kubernetes.io/docs/concepts/workloads/_print/ 100/112
6/6/23, 3:49 PM Workloads | Kubernetes
What's next
Read Clean up Jobs automatically
Refer to the Kubernetes Enhancement Proposal (KEP) for adding this mechanism.
https://kubernetes.io/docs/concepts/workloads/_print/ 101/112
6/6/23, 3:49 PM Workloads | Kubernetes
2.7 - CronJob
FEATURE STATE: Kubernetes v1.21 [stable]
CronJob is meant for performing regular scheduled actions such as backups, report
generation, and so on. One CronJob object is like one line of a crontab (cron table) file on a
Unix system. It runs a job periodically on a given schedule, written in Cron format.
CronJobs have limitations and idiosyncrasies. For example, in certain circumstances, a single
CronJob can create multiple concurrent Jobs. See the limitations below.
When the control plane creates new Jobs and (indirectly) Pods for a CronJob, the
.metadata.name of the CronJob is part of the basis for naming those Pods. The name of a
CronJob must be a valid DNS subdomain value, but this can produce unexpected results for
the Pod hostnames. For best compatibility, the name should follow the more restrictive rules
for a DNS label. Even when the name is a DNS subdomain, the name must be no longer than
52 characters. This is because the CronJob controller will automatically append 11 characters
to the name you provide and there is a constraint that the length of a Job name is no more
than 63 characters.
Example
This example CronJob manifest prints the current time and a hello message every minute:
application/job/cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: hello
spec:
schedule: "* * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: hello
image: busybox:1.28
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster
restartPolicy: OnFailure
(Running Automated Tasks with a CronJob takes you through this example in more detail).
https://kubernetes.io/docs/concepts/workloads/_print/ 102/112
6/6/23, 3:49 PM Workloads | Kubernetes
For example, 0 0 13 * 5 states that the task must be started every Friday at midnight, as
well as on the 13th of each month at midnight.
The format also includes extended "Vixie cron" step values. As explained in the FreeBSD
manual:
Step values can be used in conjunction with ranges. Following a range with /<number>
specifies skips of the number's value through the range. For example, 0-23/2 can be used
in the hours field to specify command execution every other hour (the alternative in the V7
standard is 0,2,4,6,8,10,12,14,16,18,20,22 ). Steps are also permitted after an asterisk,
so if you want to say "every two hours", just use */2 .
Note: A question mark (?) in the schedule has the same meaning as an asterisk *, that is,
it stands for any of available value for a given field.
Other than the standard syntax, some macros like @monthly can also be used:
Equivalent
Entry Description to
To generate CronJob schedule expressions, you can also use web tools like crontab.guru.
Job template
The .spec.jobTemplate defines a template for the Jobs that the CronJob creates, and it is
required. It has exactly the same schema as a Job, except that it is nested and does not have
an apiVersion or kind . You can specify common metadata for the templated Jobs, such as
labels or annotations. For information about writing a Job .spec , see Writing a Job Spec.
After missing the deadline, the CronJob skips that instance of the Job (future occurrences are
still scheduled). For example, if you have a backup job that runs twice a day, you might allow it
to start up to 8 hours late, but no later, because a backup taken any later wouldn't be useful:
https://kubernetes.io/docs/concepts/workloads/_print/ 103/112
6/6/23, 3:49 PM Workloads | Kubernetes
you would instead prefer to wait for the next scheduled run.
For Jobs that miss their configured deadline, Kubernetes treats them as failed Jobs. If you
don't specify startingDeadlineSeconds for a CronJob, the Job occurrences have no deadline.
If the .spec.startingDeadlineSeconds field is set (not null), the CronJob controller measures
the time between when a job is expected to be created and now. If the difference is higher
than that limit, it will skip this execution.
For example, if it is set to 200 , it allows a job to be created for up to 200 seconds after the
actual schedule.
Concurrency policy
The .spec.concurrencyPolicy field is also optional. It specifies how to treat concurrent
executions of a job that is created by this CronJob. The spec may specify only one of the
following concurrency policies:
Note that concurrency policy only applies to the jobs created by the same cron job. If there
are multiple CronJobs, their respective jobs are always allowed to run concurrently.
Schedule suspension
You can suspend execution of Jobs for a CronJob, by setting the optional .spec.suspend field
to true. The field defaults to false.
This setting does not affect Jobs that the CronJob has already started.
If you do set that field to true, all subsequent executions are suspended (they remain
scheduled, but the CronJob controller does not start the Jobs to run the tasks) until you
unsuspend the CronJob.
Caution: Executions that are suspended during their scheduled time count as missed
jobs. When .spec.suspend changes from true to false on an existing CronJob without a
starting deadline, the missed jobs are scheduled immediately.
For another way to clean up jobs automatically, see Clean up finished jobs automatically.
Time zones
FEATURE STATE: Kubernetes v1.27 [stable]
For CronJobs with no time zone specified, the kube-controller-manager interprets schedules
relative to its local time zone.
You can specify a time zone for a CronJob by setting .spec.timeZone to the name of a valid
time zone. For example, setting .spec.timeZone: "Etc/UTC" instructs Kubernetes to
interpret the schedule relative to Coordinated Universal Time.
A time zone database from the Go standard library is included in the binaries and used as a
fallback in case an external database is not available on the system.
https://kubernetes.io/docs/concepts/workloads/_print/ 104/112
6/6/23, 3:49 PM Workloads | Kubernetes
CronJob limitations
Unsupported TimeZone specification
The implementation of the CronJob API in Kubernetes 1.27 lets you set the .spec.schedule
field to include a timezone; for example: CRON_TZ=UTC * * * * * or TZ=UTC * * * * * .
Specifying a timezone that way is not officially supported (and never has been).
If you try to set a schedule that includes TZ or CRON_TZ timezone specification, Kubernetes
reports a warning to the client. Future versions of Kubernetes will prevent setting the
unofficial timezone mechanism entirely.
Modifying a CronJob
By design, a CronJob contains a template for new Jobs. If you modify an existing CronJob, the
changes you make will apply to new Jobs that start to run after your modification is complete.
Jobs (and their Pods) that have already started continue to run without changes. That is, the
CronJob does not update existing Jobs, even if those remain running.
Job creation
A CronJob creates a Job object approximately once per execution time of its schedule. The
scheduling is approximate because there are certain circumstances where two Jobs might be
created, or no Job might be created. Kubernetes tries to avoid those situations, but does not
completely prevent them. Therefore, the Jobs that you define should be idempotent.
For every CronJob, the CronJob Controller checks how many schedules it missed in the
duration from its last scheduled time until now. If there are more than 100 missed schedules,
then it does not start the job and logs the error.
Cannot determine if job needs to be started. Too many missed start time (> 100).
It is important to note that if the startingDeadlineSeconds field is set (not nil ), the
controller counts how many missed jobs occurred from the value of
startingDeadlineSeconds until now rather than from the last scheduled time until now. For
example, if startingDeadlineSeconds is 200 , the controller counts how many missed jobs
occurred in the last 200 seconds.
A CronJob is counted as missed if it has failed to be created at its scheduled time. For
example, if concurrencyPolicy is set to Forbid and a CronJob was attempted to be
scheduled when there was a previous schedule still running, then it would count as missed.
For example, suppose a CronJob is set to schedule a new Job every one minute beginning at
08:30:00 , and its startingDeadlineSeconds field is not set. If the CronJob controller
happens to be down from 08:29:00 to 10:21:00 , the job will not start as the number of
missed jobs which missed their schedule is greater than 100.
To illustrate this concept further, suppose a CronJob is set to schedule a new Job every one
minute beginning at 08:30:00 , and its startingDeadlineSeconds is set to 200 seconds. If the
CronJob controller happens to be down for the same period as the previous example
( 08:29:00 to 10:21:00 ,) the Job will still start at 10:22:00. This happens as the controller now
checks how many missed schedules happened in the last 200 seconds (i.e., 3 missed
schedules), rather than from the last scheduled time until now.
https://kubernetes.io/docs/concepts/workloads/_print/ 105/112
6/6/23, 3:49 PM Workloads | Kubernetes
The CronJob is only responsible for creating Jobs that match its schedule, and the Job in turn
is responsible for the management of the Pods it represents.
What's next
Learn about Pods and Jobs, two concepts that CronJobs rely upon.
Read about the detailed format of CronJob .spec.schedule fields.
For instructions on creating and working with CronJobs, and for an example of a CronJob
manifest, see Running automated tasks with CronJobs.
CronJob is part of the Kubernetes REST API. Read the CronJob API reference for more
details.
https://kubernetes.io/docs/concepts/workloads/_print/ 106/112
6/6/23, 3:49 PM Workloads | Kubernetes
2.8 - ReplicationController
Note: A Deployment that configures a ReplicaSet is now the recommended way to set up
replication.
A ReplicationController ensures that a specified number of pod replicas are running at any one
time. In other words, a ReplicationController makes sure that a pod or a homogeneous set of
pods is always up and available.
A simple case is to create one ReplicationController object to reliably run one instance of a
Pod indefinitely. A more complex use case is to run several identical replicas of a replicated
service, such as web servers.
controllers/replication.yaml
apiVersion: v1
kind: ReplicationController
metadata:
name: nginx
spec:
replicas: 3
selector:
app: nginx
template:
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
Run the example job by downloading the example file and then running this command:
https://kubernetes.io/docs/concepts/workloads/_print/ 107/112
6/6/23, 3:49 PM Workloads | Kubernetes
replicationcontroller/nginx created
Name: nginx
Namespace: default
Selector: app=nginx
Labels: app=nginx
Annotations: <none>
Replicas: 3 current / 3 desired
Pods Status: 0 Running / 3 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=nginx
Containers:
nginx:
Image: nginx
Port: 80/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
FirstSeen LastSeen Count From SubobjectPath
--------- -------- ----- ---- -------------
20s 20s 1 {replication-controller }
20s 20s 1 {replication-controller }
20s 20s 1 {replication-controller }
Here, three pods are created, but none is running yet, perhaps because the image is being
pulled. A little later, the same command may show:
To list all the pods that belong to the ReplicationController in a machine readable form, you
can use a command like this:
Here, the selector is the same as the selector for the ReplicationController (seen in the
kubectl describe output), and in a different form in replication.yaml . The --
output=jsonpath option specifies an expression with the name from each pod in the returned
list.
https://kubernetes.io/docs/concepts/workloads/_print/ 108/112
6/6/23, 3:49 PM Workloads | Kubernetes
When the control plane creates new Pods for a ReplicationController, the .metadata.name of
the ReplicationController is part of the basis for naming those Pods. The name of a
ReplicationController must be a valid DNS subdomain value, but this can produce unexpected
results for the Pod hostnames. For best compatibility, the name should follow the more
restrictive rules for a DNS label.
For general information about working with configuration files, see object management.
Pod Template
The .spec.template is the only required field of the .spec .
The .spec.template is a pod template. It has exactly the same schema as a Pod, except it is
nested and does not have an apiVersion or kind .
In addition to required fields for a Pod, a pod template in a ReplicationController must specify
appropriate labels and an appropriate restart policy. For labels, make sure not to overlap with
other controllers. See pod selector.
For local container restarts, ReplicationControllers delegate to an agent on the node, for
example the Kubelet.
Pod Selector
The .spec.selector field is a label selector. A ReplicationController manages all the pods
with labels that match the selector. It does not distinguish between pods that it created or
deleted and pods that another person or process created or deleted. This allows the
ReplicationController to be replaced without affecting the running pods.
Also you should not normally create any pods whose labels match this selector, either
directly, with another ReplicationController, or with another controller such as Job. If you do
so, the ReplicationController thinks that it created the other pods. Kubernetes does not stop
you from doing this.
If you do end up with multiple controllers that have overlapping selectors, you will have to
manage the deletion yourself (see below).
Multiple Replicas
You can specify how many pods should run concurrently by setting .spec.replicas to the
number of pods you would like to have running concurrently. The number running at any
time may be higher or lower, such as if the replicas were just increased or decreased, or if a
pod is gracefully shutdown, and a replacement starts early.
https://kubernetes.io/docs/concepts/workloads/_print/ 109/112
6/6/23, 3:49 PM Workloads | Kubernetes
When using the REST API or client library, you need to do the steps explicitly (scale replicas to
0, wait for pod deletions, then delete the ReplicationController).
When using the REST API or client library, you can delete the ReplicationController object.
Once the original is deleted, you can create a new ReplicationController to replace it. As long
as the old and new .spec.selector are the same, then the new one will adopt the old pods.
However, it will not make any effort to make existing pods match a new, different pod
template. To update pods to a new spec in a controlled way, use a rolling update.
Scaling
The ReplicationController enables scaling the number of replicas up or down, either manually
or by an auto-scaling control agent, by updating the replicas field.
Rolling updates
The ReplicationController is designed to facilitate rolling updates to a service by replacing
pods one-by-one.
Ideally, the rolling update controller would take application readiness into account, and would
ensure that a sufficient number of pods were productively serving at any given time.
The two ReplicationControllers would need to create pods with at least one differentiating
label, such as the image tag of the primary container of the pod, since it is typically image
updates that motivate rolling updates.
https://kubernetes.io/docs/concepts/workloads/_print/ 110/112
6/6/23, 3:49 PM Workloads | Kubernetes
For instance, a service might target all pods with tier in (frontend), environment in
(prod) . Now say you have 10 replicated pods that make up this tier. But you want to be able
to 'canary' a new version of this component. You could set up a ReplicationController with
replicas set to 9 for the bulk of the replicas, with labels tier=frontend, environment=prod,
track=stable , and another ReplicationController with replicas set to 1 for the canary, with
labels tier=frontend, environment=prod, track=canary . Now the service is covering both
the canary and non-canary pods. But you can mess with the ReplicationControllers separately
to test things out, monitor the results, etc.
A ReplicationController will never terminate on its own, but it isn't expected to be as long-lived
as services. Services may be composed of pods controlled by multiple ReplicationControllers,
and it is expected that many ReplicationControllers may be created and destroyed over the
lifetime of a service (for instance, to perform an update of pods that run the service). Both
services themselves and their clients should remain oblivious to the ReplicationControllers
that maintain the pods of the services.
The ReplicationController is forever constrained to this narrow responsibility. It itself will not
perform readiness nor liveness probes. Rather than performing auto-scaling, it is intended to
be controlled by an external auto-scaler (as discussed in #492), which would change its
replicas field. We will not add scheduling policies (for example, spreading) to the
ReplicationController. Nor should it verify that the pods controlled match the currently
specified template, as that would obstruct auto-sizing and other automated processes.
Similarly, completion deadlines, ordering dependencies, configuration expansion, and other
features belong elsewhere. We even plan to factor out the mechanism for bulk pod creation
(#170).
https://kubernetes.io/docs/concepts/workloads/_print/ 111/112
6/6/23, 3:49 PM Workloads | Kubernetes
API Object
Replication controller is a top-level resource in the Kubernetes REST API. More details about
the API object can be found at: ReplicationController API object.
Alternatives to ReplicationController
ReplicaSet
ReplicaSet is the next-generation ReplicationController that supports the new set-based
label selector. It's mainly used by Deployment as a mechanism to orchestrate pod creation,
deletion and updates. Note that we recommend using Deployments instead of directly using
Replica Sets, unless you require custom update orchestration or don't require updates at all.
Deployment (Recommended)
Deploymentis a higher-level API object that updates its underlying Replica Sets and their
Pods. Deployments are recommended if you want the rolling update functionality, because
they are declarative, server-side, and have additional features.
Bare Pods
Unlike in the case where a user directly created pods, a ReplicationController replaces pods
that are deleted or terminated for any reason, such as in the case of node failure or disruptive
node maintenance, such as a kernel upgrade. For this reason, we recommend that you use a
ReplicationController even if your application requires only a single pod. Think of it similarly to
a process supervisor, only it supervises multiple pods across multiple nodes instead of
individual processes on a single node. A ReplicationController delegates local container
restarts to some agent on the node, such as the kubelet.
Job
Use a Job instead of a ReplicationController for pods that are expected to terminate on their
own (that is, batch jobs).
DaemonSet
Use a DaemonSet instead of a ReplicationController for pods that provide a machine-level
function, such as machine monitoring or machine logging. These pods have a lifetime that is
tied to a machine lifetime: the pod needs to be running on the machine before other pods
start, and are safe to terminate when the machine is otherwise ready to be
rebooted/shutdown.
What's next
Learn about Pods.
Learn about Deployment, the replacement for ReplicationController.
ReplicationController is part of the Kubernetes REST API. Read the
ReplicationController object definition to understand the API for replication controllers.
https://kubernetes.io/docs/concepts/workloads/_print/ 112/112