A Beginner-Friendly Introduction To Kubernetes - by David Chong - Towards Data Science
A Beginner-Friendly Introduction To Kubernetes - by David Chong - Towards Data Science
You have 1 free member-only story left this month. Sign up for Medium and get an extra one
Save
A Beginner-Friendly Introduction to
Kubernetes
With a hands-on MLFlow deployment example
Introduction
In a nutshell, K8s is simply a container orchestration framework. What this
essentially means is that K8s is a system designed to automate the lifecycle of
containerized applications — from predictability, scalability to availability.
If you’re using Kubernetes to set up your data science infrastructure, do check out
Saturn Cloud, a scalable, flexible data science platform which offers compute including
GPUs.
Complex applications that have many components are often made up of hundreds
or even thousands of microservices. Scaling these microservices up while ensuring
availability is an extremely painful process if we were to manage all these different
components using custom-written programs or scripts, resulting in the demand for
a proper way of managing these components.
Cue Kubernetes.
Benefits of Kubernetes
Open in app Get started
Kubernetes promises to solve the above problem using these following features:
1. High Availability — this simply means that your application will always be up
and running, whether you have a new update to roll-out or have some
unexpected pods crashing.
3. Disaster Recovery — this ensures that your application will always have the
latest data and states of your application if something unfortunate happens to
your physical or cloud-based infrastructure.
Example K8s setup with a single master and two slave nodes (Illustrated by Author)
Master Node(s)
As its name suggests, the Master node is the boss of the cluster, deciding the cluster
state and what each worker node does. In order to setup a Master node, 4 processes
are required to run on it:
1. API Server
Main entrypoint for users to interact with the cluster (i.e., cluster gateway); it is
where requests are sent when we use kubectl
2. Scheduler
Decide which node the next pod will be spun up on but does NOT spin up the
pod itself (kubelet does this)
3. Controller Manager
233 2
Detects cluster state changes (e.g., pods dying) and tries to restore the cluster
back to its original state
For example, if a pod unexpectedly dies, the Controller Manager makes a request
Open in app Get started
to the Scheduler to decide which node to spin up the new pod to replace the dead
pod. Kubelet then spins up the new pod.
4. etcd
Cluster BRAIN!
Application data is NOT stored here, only cluster state data. Remember, the
master node does not do the work, it is the brain of the cluster. Specifically, etcd
stores the cluster state information in order for other processes above to know
information about the cluster
Slave/Worker Node(s)
Each worker node has to be installed with 3 node processes in order to allow
Kubernetes to interact with it and to independently spin up pods within each node.
The 3 processes required are:
In charge of taking configuration files and spinning up the pod using the
container runtime (see below!) installed on the node
2. Container Runtime
A network proxy that implements part of the Kubernetes Service concept (details
below)
Sits between nodes and forwards the requests intelligently (either intra-node or
inter-node forwarding)
Open in app Get started
Components of Kubernetes
Now that we know K8s work, let’s look at some of the most common components of
Kubernetes that we will use to deploy our applications.
1. Pod
Smallest unit of K8s and usually houses an instance of your application
2. Service
Because pods are meant to be ephemeral, Service provides a way to “give” pods
a permanent IP address
With Service, if the pod dies, its IP address will not change upon re-creation
Acts almost as a load balancer that routes traffic to pods while maintaining a
static IP
Like load balancers, the Service can also be internal or external, where external
Service is public facing (public IP) and internal Service which is meant for
internal applications (private IP)
3. Ingress
With Services, we may now have a web application exposed on a certain port,
say 8080 on an IP address, say 10.104.35. In practice, it is impractical to access a
public-facing application on http://10.104.35:8080 .
In essence, Ingress exposes HTTP and HTTPs routes from outside the cluster to
services within the cluster [1].
SSL termination (a.k.a. SSL offloading) — i.e., traffic to Service and its Pods is in
plaintext
That being said, creating an Ingress resource alone has no effect. An Ingress-
Open in app Get started
controller is also required to satisfy an Ingress.
4. Ingress Controller
Load balances incoming traffic to services in the cluster
Also manages egress traffic for services that require communication with
external services
Ingress contains the rules for routing traffic, deciding which Service the incoming
request should route to within the cluster.
5. ConfigMap
As its name suggests, it is essentially a configuration file that you want exposed
Open in app Get started
for users to modify
6. Secret
Also a configuration file, but for sensitive information like passwords
Base64-encoded
7. Volumes
Used for persistent data storage
Volumes can be stored locally on the same node running your pods or remotely
(e.g., cloud storage, NFS)
8. Deployment
Used to define blueprint for pods
Deployments usually have replicas such that when any component of the
application dies, there is always a backup
Let’s Practice!
Because this article focuses on understanding the components of K8s themselves
rather than how to setup a K8s cluster, we will simply use minikube to setup our own
local cluster. After which, we will deploy a simple but realistic application — a
MLFlow server.
If you want to follow along with the source code, I have included them in a GitHub
repo here.
MLflow with remote Tracking Server, backend and artifact stores (Image credits: MLFlow documentation)
To those who are unaware, MLFlow is mainly an experiment tracking tool that
allows Data Scientists to track their data science experiments by logging data and
model artifacts, with the option of deploying their models using a standardized
package defined by MLFlow. For the purposes of this article, we will deploy the
Open in app Get started
MLFlow tracking web server with a PostgreSQL backend (hosted on Cloud SQL) and
blob store (on Google Cloud Storage).
Before that, we’ll have to install a few things (skip ahead if you already have these
installed).
Installation
1. Docker
2. K8s command line tool, kubectl . Our best friend — we use this to interact with
our K8s cluster, be it minikube, cloud or a hybrid cluster
5. [Optional] Power tools for kubectl , kubens and kubectx . Follow this to install.
You can verify that the various components listed above are created with minikube
status . If you have several K8s cluster context, make sure you switch to minikube.
# Check context
kubectx
# If not on minikube, switch context
kubectx minikube
With our local cluster setup, let’s start by setting up external components and then
move on to deploying Kubernetes objects.
We first need a Docker image of the MLFlow web server that we will be deploying.
Unfortunately, MLFlow does not have an official image that we can use on
DockerHub, so I’ve created one here for everyone to use. Let’s pull the image I’ve
Open in app Get started
created from DockerHub.
This will be used to store metadata for the runs logged onto MLFlow tracking server.
As mentioned earlier, it is easier to create stateful applications outside of your
Kubernetes cluster.
First of all, create an account and project on GCP if you don’t already have one
--assign-ip \
--authorized-networks=<your_ip_address>/32 \
--database-version=POSTGRES_14 \
--region=<your_region> \
--cpu=2 \
--memory=3840MiB \
--root-password=<your_password>
To find <your_ip_address> , simple Google “what is my ip”. For <region> , you can
specify a region that is close to you. For me, I’ve specified asia-southeast1 .
NOTE! These configs are intended for this example deployment and not
suitable for production environments. For production environments,
you would want to have minimally multi-zonal availability connected
over a Private IP.
3. Create a Google Cloud Storage Bucket
Open in app Get started
This will be used to store data and model artefacts logged by the user. Create a
bucket on GCP and take note of the URI for later. For myself, I’ve created one at
gs://example-mlflow-artefacts using the following command:
Now, the exciting part — deploying onto our Kubernetes clusters the various
components that are needed. Before that, it’s absolutely essential to know a few
things about K8s objects.
Kubernetes resources are created using .yaml files with specific formats (refer to the
Kubernetes documentation [2] for any resource type you’re creating). They are used to
define what containerized applications are running on which port and more
importantly, the policies around how those applications behave.
kind : defines the component type (e.g. Secret, ConfigMap, Pod, etc)
metadata : data that uniquely identifies an object, including name , UID and
namespace (more about this in the future!)
4a. Let’s start with the ConfigMap as these configurations will be needed when we
deploy our MLFlow application using Deployment (NOTE: Order of resource creation
matters, especially when there is configurations or secrets attached to deployments).
# configmap.yaml
Open in app Get started
apiVersion: v1
kind: ConfigMap
metadata:
name: mlflow-configmap
data:
DEFAULT_ARTIFACT_ROOT: <your_gs_uri>
DB_NAME: postgres
DB_USERNAME: postgres
DB_HOST: <your_cloud_sql_public_ip>
💡 Pro Tip! Always have a tab of the official K8s documentation open so you can
reference the example .yaml file they have for each K8s component.
4b. Next, let’s create one for Secrets. Note that secrets have to be base64-encoded. It
can simply be done using:
The only thing that we have to encode is the password for our PostgreSQL instance
defined above earlier when we created it on Cloud SQL. Let’s base64-encode that
and copy the stdout into the .yaml file below.
# secrets.yaml
apiVersion: v1
kind: Secret
metadata:
name: mlflow-postgresql-credentials
type: Opaque
data:
postgresql-password: <your_base64_encoded_password>
5a. Let’s start with Deployment. To understand deployments, let’s take a step back
and recall that the main difference between Deployment and Pod is that the former
helps to create replicas of the pod that will be deployed. As such, the yaml file for
Deployment consists of the configurations for the Pod, as well as the number of
replicas we want to create.
If we take a look at the yaml file below, we notice metadata and spec appearing
twice in the configuration, the first time at the top of the config file and the second
time below the “template” key. This is because everything defined BELOW the
“template” key is used for the Pod configuration.
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mlflow-tracking-server
labels:
app: mlflow-tracking-server
spec:
replicas: 1
selector:
matchLabels:
app: mlflow-tracking-server-pods
template:
metadata:
labels:
app: mlflow-tracking-server-pods
spec:
containers:
- name: mlflow-tracking-server-pod
image: davidcjw/example-mlflow:1.0
Open in app Get started
ports:
- containerPorts: 5000
resources:
limits:
memory: 1Gi
cpu: "2"
requests:
memory: 1Gi
cpu: "1"
imagePullPolicy: Always
env:
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: mlflow-postgresql-credentials
key: postgresql-password
- name: DB_USERNAME
valueFrom:
configMapKeyRef:
name: mlflow-configmap
key: DB_USERNAME
- name: DB_HOST
valueFrom:
configMapKeyRef:
name: mlflow-configmap
key: DB_HOST
- name: DB_NAME
valueFrom:
configMapKeyRef:
name: mlflow-configmap
key: DB_NAME
- name: DEFAULT_ARTIFACT_ROOT
valueFrom:
configMapKeyRef:
name: mlflow-configmap
key: DEFAULT_ARTIFACT_ROOT
Two important questions to answer: 1) How do the pod replicas group together to be
identified as one by the Deployment? 2) How does the Deployment know which group
of pod replicas belong to it?
1. template > metadata > labels : Unlike other components like ConfigMap and
Secret, this metadata key labels is mandatory because each pod replica created
under this deployment will have a unique ID (e.g., mlflow-tracking-xyz, mlflow-
tracking-abc). To be able to collectively identify them as a group, labels are used
so that each of these pod replicas will receive these same set of labels.
2. selector > matchLabels : Used to determine which group of pods are under this
Open in app Get started
deployment. Note that the labels here have to exactly match the labels in (1).
Image by Author
containers > image : the image that will be used by each pod
containers > env : here is where we specify the environment variables that will
be initialized in each pod, referenced from the ConfigMap and Secret we have
created earlier.
5b. Service — As mentioned above, Service is used almost like a load balancer to
distribute traffic to each of the pod replicas. As such, here are some important
things to note about Service.
selector : This key-value pair should match the template > metadata > labels
specified earlier in Deployment, so that Service knows which set of pods to route
the request to.
Open in app Get started
type : This defaults to ClusterIP , which is the internal IP address of the cluster
(a list of other other service types can be found here). For our use case, we will
use NodePort to expose our web application on a port of our node’s IP address.
Do note that the values for NodePort can only be between 30000–32767.
targetPort : This refers to the port that your pod is exposing the application on,
which is specified in Deployment.
apiVersion: v1
kind: Service
metadata:
labels:
app: mlflow-tracking-server
name: mlflow-tracking-server
spec:
type: NodePort
selector:
app: mlflow-tracking-server-pods
ports:
- port: 5000
protocol: TCP
targetPort: 5000
nodePort: 30001
You can in fact put several .yaml configurations in one file — specifically the
Deployment and Service configurations, since we will be applying those changes
together. To do so, simply use a --- to demarcate these two configs in one file:
# deployment.yaml
apiVersion: v1
kind: Deployment
...
---
apiVersion: v1
kind: Service
...
Finally, we apply these changes using kubectl apply -f k8s/deployment.yaml .
Open in app Get started
Congrats! You can now access your MLFlow server at <node_IP>:<nodePort> . Here’s
how to find out what your node_IP is:
# or equivalently:
minikube ip
If you’re like me using the Docker driver on Darwin (or Windows, WSL), the Node IP
will not be directly reachable using the above method. Complete steps 4 and 5 listed
in this link to access your application.
Cleaning Up
Finally, we’re done with our test application and cleaning up is as simple as minikube
delete --all .
If you’re using Kubernetes to set up your data science infrastructure, do check out
Saturn Cloud, a scalable, flexible data science platform which offers compute including
GPUs.
Final Words
Thanks for reading and hope this helps you in your understanding of Kubernetes.
Please let me know if you spot any mistakes or if you would like to know more in
another article!
Support me! — If you like my content and are not subscribed to Medium, do consider
supporting me and subscribing via my referral link here (NOTE: a portion of your
membership fees will be apportioned to me as referral fees).
References
[1] What is Ingress?
Give a tip
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-
edge research to original features you don't want to miss. Take a look.
By signing up, you will create a Medium account if you don’t already have one. Review
our Privacy Policy for more information about our privacy practices.