k8s Primer
k8s Primer
Master
Master is the controlling element of the cluster. Some people call it the “Brain" of the cluster. It is the
only endpoint that is open to the users of the cluster. For the purpose of fault-tolerance, one cluster may
have multiple masters.
Master has 4 parts:
1. API server: This is the front end that communicates with the user. It is a REST-based API that is
designed to consume JSON inputs. As a default, it runs in port 443.
2. Scheduler: Scheduler watches API server for new Pod requests. It communicates with Nodes to
create new pods and to assign work to nodes while allocating resources or imposing constraints.
3. Cluster store: Cluster store is a persistent storage holding cluster states and configuration
details. It uses ETCD (open-source distributed key-value store) to store this data.
4. Controller: Includes Node controller, Endpoint Controller, Namespace Controller, etc.
Nodes (Slaves/Minions)
Nodes are the workers. They are the ones that do all the “Work” assigned to the cluster. Inside a Node,
there are 3main components, apart from the “Pods” (I will talk about Pods later on). Those 3 parts are;
1. Kubelet Kublets do a lot of work inside a Node. They register the nodes with the cluster, watch for
work assignments from the scheduler, instantiate new Pods, report back to the master, etc.
2. Container Engine Container Engine is the responsible person for managing containers. It does all the
image pulling, container stopping, starting, etc. Most widely used container engine is Docker. However,
you can also use Rocket for this.
3. Kube Proxy Kube Proxy is responsible for assigning IP addresses per pod. Each time a pod is created,
a new IP address will be allocated for that pod. Kube Proxy also does the Loadbalancing work.
Apart from those mentioned components, Nodes have their own default pods like logging, health
checking, DNS, etc. Each node exposes 3 read-only endpoints through (usually) localhost:10255. Those
endpoints are,
● /specs
● /healthz
● /pods
Essential Components of Kubernetes
There are few main components of a Kubernetes Cluster architecture that anyone should know before
going into working with Kubernetes. First one is a Pod:
Pods
The Pod is a Ring-faced environment with its own Network stack and Kernal namespaces. It has
containers inside. No pod can exist without a container. But there can be single-container pods or
multi-container pods depending on the application we deploy.
For example, if you have a tightly coupled application with an API and a log, you can use one container
for API and another for the log. But you can deploy both of them in the same Pod. However, industry
best practice is to go with single-container architecture.
Another small thing to note about Pod is that they are “Mortal”. Confused? Let me explain. A pod’s
life-cycle has 3 stages:
This is similar to Born → Living → Dead. There will be no Resurrection; no re-birth. If a Pod died without
completing his task, a new Pod will be created to replace the dead Pod. The most important thing is, this
new pod’s IP and all other factors will be different from the dead pod.
Deployment Controller
To manage the Pods, there are numerous controllers presented in Kubernetes. Such controller used for
the purpose of deployment and declarative updates is known as Deployment Controller.
In the Deployment object (mostly used format is a YAML file. But in this tutorial, I use command line) we
can describe our “Desired state” like what is the image needed to be deployed, what are the ports to
expose, how many replicas to have, what are the labels needed to be added, etc. What Deployment
Controller does is to check this desired state periodically and make changes in the cluster to make sure
the desired state is achieved.
Service
Another one component I am going to use in this tutorial is “Service”. Before telling what is a Service, I
will describe why we need a service.
As I mentioned earlier, Pods are mortal. When a pod dies, a new one is born to take its place. It doesn’t
have the same IP address as the dead one.
So think of a scenario where we have a system with both front end service and backend service. From
the front end to call the backend, we need an IP or URL. Let's assume we used the pod IP of the backend
service inside the frontend code. We face three issues:
1. We need to first deploy our backend and take its IP. Then we need to include it in the front
end code before making the docker image. This order must be followed.
2. What if we want to scale our backend? We need to update the frontend again with the
new pod IPs.
3. If the backend pod dies, a new pod will be created. Then we need to change the front end
code with the new pod IP and make the docker image again. We also have to swap the
image in the frontend. This will become even more problematic if backend has several
pods.
Too much work and complicated work. This is why we need a “Service”.
How Kubernetes Service works
Service has its own IP address and DNS which are stable. So the frontend is successfully decoupled
from the backend services. Therefore, a Service is a High-level stable abstract point for multiple pods.
For the discovery of Pods, a service uses something called “labels”. Pods belong to a Service via labels.
In the service initializing stage, we describe what labels the service should look for via “selector” flag. If
the Service found a Pod with all the labels mentioned in the selector section, the Service will append its
endpoint list and add the pod to the list. (Having extra labels than the mentioned, is acceptable. But
should not miss any label mentioned.)
When a request comes to the Service, it uses a method like Round-Robbin, Random, etc. to select the
request forwarding pod.
Use of Service object facilitates us with many advantages, like request forwarding to only healthy pods,
load balancing, roll-back of versions, etc. But the most important advantage of a Service is successful
decoupling of System components.
There are 5 types of Services available in Kuberntes which we can choose according to our purpose:
(Source: Kubernetes.io, 2019)
1. ClusterIP: Exposes the service on a cluster-internal IP. Choosing this value makes the service
only reachable from within the cluster. This is the default ServiceType.
2. NodePort: Exposes the service on each Node’s IP at a static port (the NodePort). A ClusterIP
service, to which the NodePort service will route, is automatically created. You’ll be able to
contact the NodePort service, from outside the cluster, by requesting <NodeIP>:<NodePort>.
3. LoadBalancer: Exposes the service externally using a cloud provider’s load balancer. NodePort
and ClusterIP services, to which the external load balancer will route, are automatically created.
4. ExternalName: Maps the service to the contents of the externalName field (e.g.
foo.bar.example.com), by returning a CNAME record with its value. No proxying of any kind is set
up. This requires version 1.7 or higher of kube-dns
Minikube
Setup
Common
OPTIONAL
So in particular case docker was using the groupfs which i changed to systemd
Create the file as
[root@k8smaster ~]# vim /etc/docker/daemon.json
{ "exec-opts": ["native.cgroupdriver=systemd"] }
[root@k8smaster ~]# systemctl restart docker
[root@k8smaster ~]# systemctl status docker
Master Node
Dashboard
kubectl apply -f
https://raw.githubusercontent.com/kubernetes/dashboard/v2.5.0/aio/deploy/recommended.yaml
kubectl get services --all-namespaces
kubectl -n kubernetes-dashboard edit service kubernetes-dashboard
kubectl proxy --address='0.0.0.0' --disable-filter=true &
kubectl get services --all-namespaces
http://192.168.1.244:8001/api/v1/namespaces/kubernetes-dashboard/services/http:kubernetes-dashboa
rd:/proxy/#/workloads?namespace=default
The easiest way to access the Kubernetes API with when running minikube is to use
kubectl proxy --port=8080
Kubectl
Port Forwarding
Kubectl port-forward allows you to access and interact with internal Kubernetes cluster processes from
your localhost. You can use this method to investigate issues and adjust your services locally without the
need to expose them beforehand.
Even though Kubernetes is a highly automated orchestration system, the port forwarding process
requires direct and recurrent user input. A connection terminates once the pod instance fails, and it’s
necessary to establish a new forwarding by entering the same command manually.
1. The port-forward command specifies the cluster resource name and defines the port number to
port-forward to.
2. As a result, the Kubernetes API server establishes a single HTTP connection between your
localhost and the resource running on your cluster.
3. The user is now able to engage that specific pod directly, either to diagnose an issue or debug if
necessary.
Port forwarding is a work-intensive method. However, in some cases, it is the only way to access internal
cluster resources.
K8S API
In Kubernetes, an indexer is a component that provides efficient indexing and querying capabilities for
resources in the Kubernetes API server. It is responsible for maintaining an index of the desired objects
based on specified fields.
The indexer is used by various Kubernetes controllers, clients, and other components to quickly retrieve
and filter objects based on specific criteria. It allows for efficient searching and retrieval of resources
without the need to iterate through all objects.
When an object is created or updated in the Kubernetes API server, the indexer updates its index
accordingly. This allows for fast lookup and retrieval of objects based on various fields, such as labels,
annotations, or custom-defined fields.
The indexer plays a crucial role in enabling efficient and performant operations within the Kubernetes
ecosystem, especially when dealing with large-scale deployments and managing numerous resources. It
enhances the overall responsiveness and scalability of the Kubernetes API server by providing optimized
access to resources based on different search criteria.
Cache
In Kubernetes, the cache object is a client-side cache mechanism that is utilized by various components
to store and retrieve information from the Kubernetes API server. It acts as an in-memory cache of the
Kubernetes API resources, allowing for faster access and reducing the need for repeated API calls.
The cache object is typically used by controllers, clients, and other Kubernetes components that require
frequent access to resource information. It helps improve the performance and efficiency of operations
by reducing the network latency and API server load.
When using the cache object, the client fetches the desired resources from the API server and stores
them in the cache. Subsequent requests for the same resources can then be served directly from the
cache, eliminating the need for additional API calls. The cache object automatically manages the
synchronization and refreshing of the stored resource information to ensure its accuracy and consistency.
Additionally, the cache object provides functionalities such as indexing and event handling. It can index
resources based on specific fields, allowing for efficient lookup and filtering operations. It also receives
and processes events from the API server, keeping the cached resources up to date with any changes
happening in the cluster.
By utilizing the cache object, Kubernetes components can optimize their interactions with the API server,
improve performance, and reduce the overall load on the cluster.
Informers
In Kubernetes, the Informer object is a client-side caching and event handling mechanism provided by
the client-go library. It enables efficient tracking and retrieval of resource changes from the Kubernetes
API server.
The Informer acts as a controller that watches and synchronizes a specific set of resources with the API
server. It maintains a local cache of the watched resources and keeps it up to date by handling incoming
events. These events can include creations, updates, deletions, or other changes to the watched
resources.
By using the Informer, client applications can avoid making frequent direct API calls to the server and
instead rely on the cached data. This improves performance and reduces unnecessary network traffic
and API server load.
In addition to caching, the Informer provides a convenient way to handle events related to the watched
resources. It allows developers to define event handlers that get executed when specific types of events
occur. This enables applications to react to changes in real-time and take appropriate actions, such as
updating local state, triggering additional processes, or sending notifications.
The Informer can be configured with various options, such as the resource type, namespace, and update
frequency, to tailor its behavior according to the application's needs.
Overall, the Informer object simplifies resource synchronization and event handling in Kubernetes client
applications, improving efficiency and responsiveness while reducing the reliance on direct API calls.
ClientSet
In Kubernetes, the clientset is a client library provided by client-go, which is the official Go client for
interacting with the Kubernetes API server. The clientset serves as a high-level interface that simplifies
the process of interacting with various Kubernetes resources and performing operations on them.
The clientset is generated from the Kubernetes API specification and provides a set of typed client
objects for each resource type in the API. These client objects offer methods and functions that abstract
away the complexities of directly interacting with the API server, making it easier for developers to
perform CRUD (Create, Read, Update, Delete) operations on Kubernetes resources.
With the clientset, developers can easily create, retrieve, update, and delete resources such as pods,
services, deployments, namespaces, and more. It handles the low-level details of authenticating with
the API server, constructing API requests, and processing responses.
The clientset also supports various optional configuration parameters that allow customization, such as
specifying the API server URL, authentication credentials, timeouts, and transport settings.
By utilizing the clientset, developers can write Kubernetes applications in Go and interact with the
Kubernetes API server in a more intuitive and efficient manner. It provides a convenient and idiomatic
way to work with Kubernetes resources, reducing the amount of boilerplate code needed and improving
productivity.
Controllers
Kubernetes controllers are components that track at least one Kubernetes resource type. Each resource
object has a spec field that represents the desired state. The controller(s) for that resource are
responsible for ensuring that the current state is as close as possible to the desired state. There are
many different types of controllers in Kubernetes, each with its own specific purpose.
What is the Controller Manager?
The Controller Manager is a key component of the Kubernetes control plane. It is responsible for running
various controllers that watch the state of Kubernetes resources and reconcile the actual state with the
desired state.
The Controller Manager runs as a set of processes on the Kubernetes master node. Examples of
controllers that ship with Kubernetes today are the replication controller, endpoints controller,
namespace controller, and serviceaccounts controller.Each of these controllers has a specific set of
responsibilities, such as managing the number of replicas for a given deployment or ensuring that the
endpoints for a service are up-to-date.
● Controller: Tracks at least one Kubernetes resource type and is responsible for making the
current state come closer to the desired state.
● Controller Manager: A Kubernetes control plane component that runs multiple controllers
and ensures that they are functioning correctly.
In simpler terms, a controller is responsible for managing a specific resource’s desired state, while the
controller manager manages multiple controllers and ensures they are working as intended.
As a beginner, it can be overwhelming to learn about all the controllers available in Kubernetes. But not
all controllers are created equal. In this post, we will identify the top 10 Kubernetes controllers that will
provide a focused learning plan to master them.
1. ReplicaSet Controller
The ReplicaSet controller is responsible for ensuring that the specified number of replicas of a pod is
running at all times. It is a successor to the deprecated Replication Controller and is widely used in
Kubernetes deployments.
2. Deployment Controller
The Deployment controller is used to manage the deployment of pods in a declarative way. It can scale
up or down the number of replicas, do a rolling update, and revert to an earlier version if necessary.
3. StatefulSet Controller
The StatefulSet controller is used to manage stateful applications that require unique network identifiers
and stable storage. It ensures that pods are deployed in a predictable order and that each pod has a
unique hostname.
4. DaemonSet Controller
The DaemonSet controller ensures that a copy of a pod runs on each node in the cluster. It is commonly
used for tasks such as logging and monitoring.
5. Job Controller
The Job controller manages batch tasks in the Kubernetes cluster. It ensures that the job is completed
successfully and terminates when the specified number of completions is reached.
6. CronJob Controller
The CronJob controller is used to create jobs that run on a schedule. It is commonly used for tasks such
as backups and cleanup.
7. Namespace Controller
The Namespace controller is used to create and manage namespaces in the Kubernetes cluster.
Namespaces are a way to divide cluster resources between multiple users.
8. ServiceAccount Controller
The ServiceAccount controller is used to manage service accounts in the Kubernetes cluster. Service
accounts are used to provide an identity to pods and control access to resources.
9. Service Controller
The Service controller is used to manage Kubernetes services. It ensures that requests are routed to the
appropriate pods based on labels and selectors.
The Ingress controller is used to manage the ingress resources in the Kubernetes cluster. It allows
external traffic to access the services in the cluster.
1. Reflector: A reflector watches the Kubernetes API for the specified resource type (kind). This could be
an in-built resource or it could be a custom resource. When it receives notification about the existence of
a new resource instance through the watch API, it gets the newly created object using the corresponding
listing API. It then puts the object in a Delta Fifo queue.
2. Informer: An informer pops objects from the Delta Fifo queue. Its job is to save objects for later
retrieval, and invoke the controller code passing it the object.
3. Indexer: An indexer provides indexing functionality over objects. A typical indexing use-case is to
create an index based on object labels. Indexers can maintain indexes based on several indexing
functions. Indexer uses a thread-safe data store to store objects and their keys. There is a default
function that generates an object’s key as<namespace>/<name> combination for that object.
You can try out our Postgres custom resource to see how these components fit together in real code.
This custom resource has been developed following the sample-controller available in Kubernetes.
Ref
1.: Generating ClientSet/Informers/Lister and CRD for Custom Resources | Writing K8S Operator - …
2. Writing a Kubernetes custom controller (ekspose) from scratch to expose your deployment | Par…
Operators
asd
Pods
Pods are the smallest deployable units of computing that you can create and manage in Kubernetes.
A Pod (as in a pod of whales or pea pod) is a group of one or more containers, with shared storage and
network resources, and a specification for how to run the containers. A Pod's contents are always
co-located and co-scheduled, and run in a shared context. A Pod models an application-specific "logical
host": it contains one or more application containers which are relatively tightly coupled. In non-cloud
contexts, applications executed on the same physical or virtual machine are analogous to cloud
applications executed on the same logical host.
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
You'll rarely create individual Pods directly in Kubernetes—even singleton Pods. This is because Pods
are designed as relatively ephemeral, disposable entities. When a Pod gets created (directly by you, or
indirectly by a controller), the new Pod is scheduled to run on a Node in your cluster. The Pod remains
on that node until the Pod finishes execution, the Pod object is deleted, the Pod is evicted for lack of
resources, or the node fails.
You can use workload resources to create and manage multiple Pods for you. A controller for the
resource handles replication and rollout and automatic healing in case of Pod failure. For example, if a
Node fails, a controller notices that Pods on that Node have stopped working and creates a replacement
Pod. The scheduler places the replacement Pod onto a healthy Node.
Here are some examples of workload resources that manage one or more Pods:
● Deployment
● StatefulSet
● DaemonSet
Pod Lifecycle
This page describes the lifecycle of a Pod. Pods follow a defined lifecycle, starting in the Pending phase,
moving through Running if at least one of its primary containers starts OK, and then through either the
Succeeded or Failed phases depending on whether any container in the Pod terminated in failure.
Pods are only scheduled once in their lifetime. Once a Pod is scheduled (assigned) to a Node, the Pod
runs on that Node until it stops or is terminated.
Like individual application containers, Pods are considered to be relatively ephemeral (rather than
durable) entities. Pods are created, assigned a unique ID (UID), and scheduled to nodes where they
remain until termination (according to restart policy) or deletion. If a Node dies, the Pods scheduled to
that node are scheduled for deletion after a timeout period.
Pods do not, by themselves, self-heal. If a Pod is scheduled to a node that then fails, the Pod is deleted;
likewise, a Pod won't survive an eviction due to a lack of resources or Node maintenance. Kubernetes
uses a higher-level abstraction, called a controller, that handles the work of managing the relatively
disposable Pod instances.
A given Pod (as defined by a UID) is never "rescheduled" to a different node; instead, that Pod can be
replaced by a new, near-identical Pod, with even the same name if desired, but with a different UID.
When something is said to have the same lifetime as a Pod, such as a volume, that means that the thing
exists as long as that specific Pod (with that exact UID) exists. If that Pod is deleted for any reason, and
even if an identical replacement is created, the related thing (a volume, in this example) is also destroyed
and created anew.
Value Description
Pending The Pod has been accepted by the Kubernetes cluster, but one or more of the containers
has not been set up and made ready to run. This includes time a Pod spends waiting to
be scheduled as well as the time spent downloading container images over the network.
Running The Pod has been bound to a node, and all of the containers have been created. At least
one container is still running, or is in the process of starting or restarting.
Succeeded All containers in the Pod have terminated in success, and will not be restarted.
Failed All containers in the Pod have terminated, and at least one container has terminated in
failure. That is, the container either exited with non-zero status or was terminated by the
system.
Unknown For some reason the state of the Pod could not be obtained. This phase typically occurs
due to an error in communicating with the node where the Pod should be running.
Pod Conditions
A Pod has a PodStatus, which has an array of PodConditions through which the Pod has or has not
passed:
status Indicates whether that condition is applicable, with possible values "True",
"False", or "Unknown".
lastTransitionTime Timestamp for when the Pod last transitioned from one status to another.
message Human-readable message indicating details about the last status transition.
Pod Readiness
Readiness gates are determined by the current state of status.condition fields for the Pod. If Kubernetes
cannot find such a condition in the status.conditions field of a Pod, the status of the condition is
defaulted to "False".
Here is an example:
kind: Pod
...
spec:
readinessGates:
- conditionType: "www.example.com/feature-1"
status:
conditions:
- type: Ready # a built in PodCondition
status: "False"
lastProbeTime: null
lastTransitionTime: 2018-01-01T00:00:00Z
- type: "www.example.com/feature-1" # an extra PodCondition
status: "False"
lastProbeTime: null
lastTransitionTime: 2018-01-01T00:00:00Z
containerStatuses:
- containerID: docker://abcd...
ready: true
...
For a Pod that uses custom conditions, that Pod is evaluated to be ready only when both the following
statements apply:
Container Probes
● ExecAction: Executes a specified command inside the container. The diagnostic is considered
successful if the command exits with a status code of 0.
● TCPSocketAction: Performs a TCP check against the Pod's IP address on a specified port. The
diagnostic is considered successful if the port is open.
● HTTPGetAction: Performs an HTTP GET request against the Pod's IP address on a specified port
and path. The diagnostic is considered successful if the response has a status code greater than
or equal to 200 and less than 400.
The kubelet can optionally perform and react to three kinds of probes on running containers:
● livenessProbe: Indicates whether the container is running. If the liveness probe fails, the kubelet
kills the container, and the container is subjected to its restart policy. If a Container does not
provide a liveness probe, the default state is Success.
● readinessProbe: Indicates whether the container is ready to respond to requests. If the readiness
probe fails, the endpoints controller removes the Pod's IP address from the endpoints of all
Services that match the Pod. The default state of readiness before the initial delay is Failure. If a
Container does not provide a readiness probe, the default state is Success.
● startupProbe: Indicates whether the application within the container is started. All other probes
are disabled if a startup probe is provided, until it succeeds. If the startup probe fails, the kubelet
kills the container, and the container is subjected to its restart policy. If a Container does not
provide a startup probe, the default state is Success.
For more information about how to set up a liveness, readiness, or startup probe, see Configure
Liveness, Readiness and Startup Probes.
If you'd like your container to be killed and restarted if a probe fails, then specify a liveness probe, and
specify a restartPolicy of Always or OnFailure.
If you'd like to start sending traffic to a Pod only when a probe succeeds, specify a readiness probe. In
this case, the readiness probe might be the same as the liveness probe, but the existence of the
readiness probe in the spec means that the Pod will start without receiving any traffic and only start
receiving traffic after the probe starts succeeding.
If you want your container to be able to take itself down for maintenance, you can specify a readiness
probe that checks an endpoint specific to readiness that is different from the liveness probe.
If your app has a strict dependency on back-end services, you can implement both a liveness and a
readiness probe. The liveness probe passes when the app itself is healthy, but the readiness probe
additionally checks that each required back-end service is available. This helps you avoid directing traffic
to Pods that can only respond with error messages.
If your container needs to work on loading large data, configuration files, or migrations during startup,
you can use a startup probe. However, if you want to detect the difference between an app that has
failed and an app that is still processing its startup data, you might prefer a readiness probe.
Note: If you want to be able to drain requests when the Pod is deleted, you do not necessarily need a
readiness probe; on deletion, the Pod automatically puts itself into an unready state regardless of
whether the readiness probe exists. The Pod remains in the unready state while it waits for the
containers in the Pod to stop.
Startup probes are useful for Pods that have containers that take a long time to come into service.
Rather than set a long liveness interval, you can configure a separate configuration for probing the
container as it starts up, allowing a time longer than the liveness interval would allow.
Termination of Pods
Because Pods represent processes running on nodes in the cluster, it is important to allow those
processes to gracefully terminate when they are no longer needed (rather than being abruptly stopped
with a KILL signal and having no chance to clean up).
Typically, the container runtime sends a TERM signal to the main process in each container. Many
container runtimes respect the STOPSIGNAL value defined in the container image and send this instead
of TERM. Once the grace period has expired, the KILL signal is sent to any remaining processes, and the
Pod is then deleted from the API Server.
An example flow:
1. You use the kubectl tool to manually delete a specific Pod, with the default grace period (30
seconds).
2. The Pod in the API server is updated with the time beyond which the Pod is considered "dead"
along with the grace period. If you use “kubectl describe” to check on the Pod you're deleting,
that Pod shows up as "Terminating". On the node where the Pod is running: as soon as the
kubelet sees that a Pod has been marked as terminating (a graceful shutdown duration has been
set), the kubelet begins the local Pod shutdown process.
1. If one of the Pod's containers has defined a preStop hook, the kubelet runs that hook
inside of the container. If the preStop hook is still running after the grace period expires,
the kubelet requests a small, one-off grace period extension of 2 seconds.
Note: If the preStop hook needs longer to complete than the default grace period
allows, you must modify terminationGracePeriodSeconds to suit this.
2. The kubelet triggers the container runtime to send a TERM signal to process 1 inside
each container.
Note: The containers in the Pod receive the TERM signal at different times and in an
arbitrary order. If the order of shutdowns matters, consider using a preStop hook to
synchronize.
3. At the same time as the kubelet is starting graceful shutdown, the control plane removes that
shutting-down Pod from Endpoints (and, if enabled, EndpointSlice) objects where these
represent a Service with a configured selector. ReplicaSets and other workload resources no
longer treat the shutting-down Pod as a valid, in-service replica. Pods that shut down slowly
cannot continue to serve traffic as load balancers (like the service proxy) remove the Pod from
the list of endpoints as soon as the termination grace period begins.
4. When the grace period expires, the kubelet triggers forcible shutdown. The container runtime
sends SIGKILL to any processes still running in any container in the Pod. The kubelet also cleans
up a hidden pause container if that container runtime uses one.
5. The kubelet triggers forcible removal of Pod objects from the API server, by setting grace period
to 0 (immediate deletion).
6. The API server deletes the Pod's API object, which is then no longer visible from any client.
By default, all deletes are graceful within 30 seconds. The kubectl delete command supports the
--grace-period=<seconds> option which allows you to override the default and specify your own value.
Setting the grace period to 0 forcibly and immediately deletes the Pod from the API server. If the pod
was still running on a node, that forcible deletion triggers the kubelet to begin immediate cleanup.
Note: You must specify an additional flag --force along with --grace-period=0 in order to perform force
deletions.
When a force deletion is performed, the API server does not wait for confirmation from the kubelet that
the Pod has been terminated on the node it was running on. It removes the Pod in the API immediately
so a new Pod can be created with the same name. On the node, Pods that are set to terminate
immediately will still be given a small grace period before being force killed.
If you need to force-delete Pods that are part of a StatefulSet, refer to the task documentation for
deleting Pods from a StatefulSet.
The control plane cleans up terminated Pods (with a phase of Succeeded or Failed), when the number of
Pods exceeds the configured threshold (determined by terminated-pod-gc-threshold in the
kube-controller-manager). This avoids a resource leak as Pods are created and terminated over time.
Labels are key/value pairs that are attached to objects, such as pods. Labels are intended to be used to
specify identifying attributes of objects that are meaningful and relevant to users, but do not directly
imply semantics to the core system. Labels can be used to organize and to select subsets of objects.
Labels can be attached to objects at creation time and subsequently added and modified at any time.
Each object can have a set of key/value labels defined. Each Key must be unique for a given object.
"metadata": {
"labels": {
"key1" : "value1",
"key2" : "value2"
}
}
Labels allow for efficient queries and watches and are ideal for use in UIs and CLIs. Non-identifying
information should be recorded using annotations.
Labels are key/value pairs. Valid label keys have two segments: an optional prefix and name, separated
by a slash (/). The name segment is required and must be 63 characters or less, beginning and ending
with an alphanumeric character ([a-z0-9A-Z]) with dashes (-), underscores (_), dots (.), and
alphanumerics between. The prefix is optional. If specified, the prefix must be a DNS subdomain: a
series of DNS labels separated by dots (.), not longer than 253 characters in total, followed by a slash (/).
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app.kubernetes.io/name: mysql
app.kubernetes.io/instance: mysql-abcxzy
app.kubernetes.io/version: "5.7.21"
app.kubernetes.io/component: database
app.kubernetes.io/part-of: wordpress
app.kubernetes.io/managed-by: helm
app.kubernetes.io/created-by: controller-manager
The kubernetes.io/ and k8s.io/ prefixes are reserved for Kubernetes core components.
Valid label value:
Label selectors
Unlike names and UIDs, labels do not provide uniqueness. In general, we expect many objects to carry
the same label(s).
Via a label selector, the client/user can identify a set of objects. The label selector is the core grouping
primitive in Kubernetes.
The API currently supports two types of selectors: equality-based and set-based. A label selector can
be made of multiple requirements which are comma-separated. In the case of multiple requirements, all
must be satisfied so the comma separator acts as a logical AND (&&) operator.
The semantics of empty or non-specified selectors are dependent on the context, and API types that use
selectors should document the validity and meaning of them.
Note: For some API types, such as ReplicaSets, the label selectors of two instances must not overlap
within a namespace, or the controller can see that as conflicting instructions and fail to determine how
many replicas should be present.
Caution: For both equality-based and set-based conditions there is no logical OR (||) operator. Ensure
your filter statements are structured accordingly.
Equality-based requirement
Equality- or inequality-based requirements allow filtering by label keys and values. Matching objects
must satisfy all of the specified label constraints, though they may have additional labels as well. Three
kinds of operators are admitted =,==,!=. The first two represent equality (and are synonyms), while the
latter represents inequality. For example:
environment = production
tier != frontend
One usage scenario for equality-based label requirement is for Pods to specify node selection criteria.
For example, the sample Pod below selects nodes with the label "accelerator=nvidia-tesla-p100".
apiVersion: v1
kind: Pod
metadata:
name: cuda-test
spec:
containers:
- name: cuda-test
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
accelerator: nvidia-tesla-p100
Set-based requirement
Set-based label requirements allow filtering keys according to a set of values. Three kinds of operators
are supported: in,notin and exists (only the key identifier).
For example:
environment in (production, qa)
tier notin (frontend, backend)
partition
!partition
● The first example selects all resources with key equal to environment and value equal to
production or qa.
● The second example selects all resources with key equal to tier and values other than frontend
and backend, and all resources with no labels with the tier key.
● The third example selects all resources including a label with key partition; no values are
checked.
● The fourth example selects all resources without a label with a key partition; no values are
checked.
Similarly the comma separator acts as an AND operator. So filtering resources with a partition key (no
matter the value) and with an environment different than qa can be achieved using
partition,environment notin (qa). The set-based label selector is a general form of equality since
environment=production is equivalent to environment in (production); similarly for != and notin.
Set-based requirements can be mixed with equality-based requirements. For example: partition in
(customerA, customerB),environment!=qa.
LIST and WATCH operations may specify label selectors to filter the sets of objects returned
using a query parameter. Both requirements are permitted (presented here as they would
appear in a URL query string):
● Equality-based requirements:
?labelSelector=environment%3Dproduction,tier%3Dfrontend
● Set-based requirements:
?labelSelector=environment+in+%28production%2Cqa%29%2Ctier+in+%28frontend
%29
Both label selector styles can be used to list or watch resources via a REST client. For example,
targeting apiserver with kubectl and using equality-based one may write:
"selector": {
"component" : "redis",
or
selector:
component: redis
selector:
matchLabels:
component: redis
matchExpressions:
- {key: tier, operator: In, values: [cache]}
- {key: environment, operator: NotIn, values: [dev]}
matchLabels is a map of {key,value} pairs. A single {key,value} in the matchLabels map is equivalent to
an element of matchExpressions, whose key field is "key", the operator is "In", and the values array
contains only "value". matchExpressions is a list of pod selector requirements. Valid operators include In,
NotIn, Exists, and DoesNotExist. The values set must be non-empty in the case of In and NotIn. All of
the requirements, from both matchLabels and matchExpressions are ANDed together -- they must all be
satisfied in order to match.
Replica Controller
Note: A Deployment that configures a ReplicaSet is now the recommended way to set up replication.
A ReplicationController ensures that a specified number of pod replicas are running at any one time.
In other words, a ReplicationController makes sure that a pod or a homogeneous set of pods is
always up and available.
If there are too many pods, the ReplicationController terminates the extra pods. If there are too few, the
ReplicationController starts more pods. Unlike manually created pods, the pods maintained by a
ReplicationController are automatically replaced if they fail, are deleted, or are terminated. For example,
your pods are re-created on a node after disruptive maintenance such as a kernel upgrade. For this
reason, you should use a ReplicationController even if your application requires only a single pod.
This example ReplicationController config runs three copies of the nginx web server.
apiVersion: v1
kind: ReplicationController
metadata:
name: nginx
spec:
replicas: 3
selector:
app: nginx
template:
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
To delete a ReplicationController and all its pods, use kubectl delete. Kubectl will scale the
ReplicationController to zero and wait for it to delete each pod before deleting the ReplicationController
itself.
You can delete a ReplicationController without affecting any of its pods.
Using kubectl, specify the --cascade=orphan option to kubectl delete.
When using the REST API or Go client library, you can delete the ReplicationController object. Once the
original is deleted, you can create a new ReplicationController to replace it. As long as the old and new
.spec.selector are the same, then the new one will adopt the old pods.
Pods may be removed from a ReplicationController's target set by changing their labels. This technique
may be used to remove pods from service for debugging and data recovery. Pods that are removed in
this way will be replaced automatically (assuming that the number of replicas is not also changed).
Rescheduling
As mentioned above, whether you have 1 pod you want to keep running, or 1000, a
ReplicationController will ensure that the specified number of pods exists, even in the event of node
failure or pod termination (for example, due to an action by another control agent).
Scaling
The ReplicationController enables scaling the number of replicas up or down, either manually or by an
auto-scaling control agent, by updating the replicas field.
Rolling updates
The ReplicationController is designed to facilitate rolling updates to a service by replacing pods
one-by-one.
The recommended approach is to create a new ReplicationController with 1 replica, scale the new (+1)
and old (-1) controllers one by one, and then delete the old controller after it reaches 0 replicas. This
predictably updates the set of pods regardless of unexpected failures.
Ideally, the rolling update controller would take application readiness into account, and would ensure
that a sufficient number of pods were productively serving at any given time.
The two ReplicationControllers would need to create pods with at least one differentiating label, such as
the image tag of the primary container of the pod, since it is typically image updates that motivate rolling
updates.
For instance, a service might target all pods with tier in (frontend), environment in (prod). Now say you
have 10 replicated pods that make up this tier. But you want to be able to 'canary' a new version of this
component. You could set up a ReplicationController with replicas set to 9 for the bulk of the replicas,
with labels tier=frontend, environment=prod, track=stable, and another ReplicationController with
replicas set to 1 for the canary, with labels tier=frontend, environment=prod, track=canary. Now the
service is covering both the canary and non-canary pods. But you can mess with the
ReplicationControllers separately to test things out, monitor the results, etc.
A ReplicationController will never terminate on its own, but it isn't expected to be as long-lived as
services. Services may be composed of pods controlled by multiple ReplicationControllers, and it is
expected that many ReplicationControllers may be created and destroyed over the lifetime of a service
(for instance, to perform an update of pods that run the service). Both services themselves and their
clients should remain oblivious to the ReplicationControllers that maintain the pods of the services.
Replica Set
A ReplicaSet's purpose is to maintain a stable set of replica Pods running at any given time. As such,
it is often used to guarantee the availability of a specified number of identical Pods.
A ReplicaSet is defined with fields, including a selector that specifies how to identify Pods it can acquire,
a number of replicas indicating how many Pods it should be maintaining, and a pod template specifying
the data of new Pods it should create to meet the number of replicas criteria. A ReplicaSet then fulfills
its purpose by creating and deleting Pods as needed to reach the desired number. When a ReplicaSet
needs to create new Pods, it uses its Pod template. A ReplicaSet is linked to its Pods via the Pods'
metadata.ownerReferences field.
A ReplicaSet ensures that a specified number of pod replicas are running at any given time. However, a
Deployment is a higher-level concept that manages ReplicaSets and provides declarative updates to
Pods along with a lot of other useful features. Therefore, we recommend using Deployments instead of
directly using ReplicaSets, unless you require custom update orchestration or don't require updates at
all.
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: frontend
labels:
app: guestbook
tier: frontend
spec:
# modify replicas according to your case
replicas: 3
selector:
matchLabels:
tier: frontend
template:
metadata:
labels:
tier: frontend
spec:
containers:
- name: php-redis
image: gcr.io/google_samples/gb-frontend:v3
kubectl get rs
NAME DESIRED CURRENT READY AGE
frontend 3 3 3 6s
{"apiVersion":"apps/v1","kind":"ReplicaSet","metadata":{"annotations":{},"labels":{"app":"guestbook","ti
er":"frontend"},"name":"frontend",...
Replicas: 3 current / 3 desired
Pods Status: 3 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: tier=frontend
Containers:
php-redis:
Image: gcr.io/google_samples/gb-frontend:v3
Port: <none>
Host Port: <none>
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 117s replicaset-controller Created pod: frontend-wtsmm
Normal SuccessfulCreate 116s replicaset-controller Created pod: frontend-b2zdv
Normal SuccessfulCreate 116s replicaset-controller Created pod: frontend-vcmts
You can also verify that the owner reference of these pods is set to the frontend ReplicaSet. To do this,
get the yaml of one of the Pods running:
You can remove Pods from a ReplicaSet by changing their labels. This technique may be used to remove
Pods from service for debugging, data recovery, etc. Pods that are removed in this way will be replaced
automatically ( assuming that the number of replicas is not also changed).
Scaling a ReplicaSet
A ReplicaSet can be easily scaled up or down by simply updating the .spec.replicas field. The ReplicaSet
controller ensures that a desired number of Pods with a matching label selector are available and
operational.
When scaling down, the ReplicaSet controller chooses which pods to delete by sorting the available
pods to prioritize scaling down pods based on the following general algorithm:
Note:
● This is honored on a best-effort basis, so it does not offer any guarantees on pod deletion order.
● Users should avoid updating the annotation frequently, such as updating it based on a metric
value, because doing so will generate a significant number of pod updates on the apiserver.
A ReplicaSet can also be a target for Horizontal Pod Autoscalers (HPA). That is, a ReplicaSet can be
auto-scaled by an HPA. Here is an example HPA targeting the ReplicaSet we created in the previous
example.
controllers/hpa-rs.yaml
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: frontend-scaler
spec:
scaleTargetRef:
kind: ReplicaSet
name: frontend
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 50
Saving this manifest into hpa-rs.yaml and submitting it to a Kubernetes cluster should create the defined
HPA that autoscales the target ReplicaSet depending on the CPU usage of the replicated Pods.
Alternatively, you can use the kubectl autoscale command to accomplish the same (and it's easier!)
kubectl autoscale rs frontend --max=10 --min=3 --cpu-percent=50
Replica Set is the next generation of Replication Controller. Replication controller is kinda imperative, but
replica sets try to be as declarative as possible.
1. Replica Set supports the new set-based 1. Replication Controller only supports
selector. This gives more flexibility. equality-based selector.
for eg: environment in (production, qa) This for eg: environment = production This selects all
selects all resources with key equal to resources with key equal to environment and
environment and value equal to production value equal to production or qa
2. Rollout command is used for updating the 2. Rolling-update command is used for updating
replica set. Even though replica set can be the replication controller. This replaces the
used independently, it is best used along with specified replication controller with a new
deployments which makes them declarative. replication controller by updating one pod at
a time to use the new PodTemplate.
Deployments
You describe a desired state in a Deployment, and the Deployment Controller changes the actual
state to the desired state at a controlled rate. You can define Deployments to create new
ReplicaSets, or to remove existing Deployments and adopt all their resources with new
Deployments.
The following is an example of a Deployment. It creates a ReplicaSet to bring up three nginx Pods:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
In this example:
Config Map
A ConfigMap is an API object used to store non-confidential data in key-value pairs. Pods can consume
ConfigMaps as environment variables, command-line arguments, or as configuration files in a volume.
A ConfigMap allows you to decouple environment-specific configuration from your container images, so
that your applications are easily portable.
Examples
MongoDB
Standalone Instance
To do so:
Create a StorageClass
StorageClass helps pods provision persistent volume claims on the node. To create a StorageClass:
1. Use a text editor to create a YAML file to store the storage class configuration.
vim StorageClass.yaml
2. Specify your storage class configuration in the file. The example below defines the
mongodb-storageclass:
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: mongodb-storageclass
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
vim PersistentVolume.yaml
2. In the file, allocate storage that belongs to the storage class defined in the previous step. Specify the
node that will be used in pod deployment in the nodeAffinity section. The node is identified using the
label created in Step 1.
apiVersion: v1
kind: PersistentVolume
metadata:
name: mongodb-pv
spec:
capacity:
storage: 2Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: mongodb-storageclass
local:
path: /mnt/data
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: size
operator: In
values:
- large
3. Create another YAML for the configuration of the persistent volume claim:
vim PersistentVolumeClaim.yaml
4. Define the claim named mongodb-pvc and instruct Kubernetes to claim volumes belonging to
mongodb-storageclass.
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: mongodb-pvc
spec:
storageClassName: mongodb-storageclass
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
resources:
requests:
storage: 1Gi
vim ConfigMap.yaml
2. Use the file to store information about system paths, users, and roles. The following is an example of
a ConfigMap file:
apiVersion: v1
kind: ConfigMap
metadata:
name: mongodb-configmap
data:
mongo.conf: |
storage:
dbPath: /data/db
ensure-users.js: |
const targetDbStr = 'test';
const rootUser = cat('/etc/k8-test/admin/MONGO_ROOT_USERNAME');
const rootPass = cat('/etc/k8-test/admin/MONGO_ROOT_PASSWORD');
const usersStr = cat('/etc/k8-test/MONGO_USERS_LIST');
usersStr
.trim()
.split(';')
.map(s => s.split(':'))
.forEach(user => {
const username = user[0];
const rolesStr = user[1];
const password = user[2];
if (!rolesStr || !password) {
return;
}
try {
targetDb.createUser(userDoc);
} catch (err) {
if (!~err.message.toLowerCase().indexOf('duplicate')) {
throw err;
}
}
});
Create a StatefulSet