Infrastructure Design For Student Collaboration Projects Using Kubernetes
Infrastructure Design For Student Collaboration Projects Using Kubernetes
DIPLOMA THESIS
Supervisor
Lect. Dr. Pop Emilia
Author
Sidorac Constantin-Radu
2024
ABSTRACT
One of the banes of a software developer’s existence is getting their code to prop-
erly work on a different machine. This is more so the case for students, who don’t
have the necessary experience to efficiently troubleshoot deployment issues, or to
properly document the process of running their application.
Students need to collaborate on projects as part of their education, but working
in a team can become a pain point for them, often for the reason stated above. Even
outside of the school context, students with different knowledge sets and prefer-
ences may come together to build something.
The following pages lay out the choices made when designing the infrastructure
for a complex collaborative project between students.
At first, the core tools will be introduced. Kubernetes will be used as the base of
the infrastructure, with applications running in Docker containers.
Following, both cost and performance optimizations will be tackled. To be able
to quantify the possible improvements, multiple results will be presented.
The design of the infrastructure will be explored in-depth, tackling topics such
as observability, security, and stability. Multiple tools or methods will be explored
in this part.
A chapter will deal with the processes and tools involved in developing an ap-
plication, and how these are used in the presented project.
Finally, there will be some case studies of applications developed by students
using this infrastructure: a volunteer management solution, and some websites.
Unless otherwise noted, this paper only contains my own work based on my
own research. I have neither given nor received unauthorized assistance on this
thesis.
Sidorac Constantin-Radu
i
Contents
1 Introduction 1
ii
CONTENTS
4 Infrastructure design 22
4.1 Kubernetes Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.1 VM configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.2 Network configuration . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.3 Software Configuration . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Managing the Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.1 ArgoCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.2 Repository file structure & namespaces . . . . . . . . . . . . . 26
4.3 Observing the Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.1 Metrics & Alerts . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.2 Logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3.3 Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Securing the Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.1 Encrypted traffic . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.2 Secret management . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.3 Network policies . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Repairing the Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5.1 Backups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5.2 Restoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.5.3 Dynamic DNS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.6 Downtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.7 Centralized services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.7.1 Authentication service - OAuth2 proxy . . . . . . . . . . . . . 33
4.7.2 Central database - PostgreSQL . . . . . . . . . . . . . . . . . . 34
5 Application development 35
5.1 Source Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.1.1 Git & GitHub . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.1.2 Contributing strategy . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Password Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2.1 Need . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3 Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
iii
CONTENTS
6 Case studies 39
6.1 Volunteer Management . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.1.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2 Project Websites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Bibliography 43
iv
Chapter 1
Introduction
The main goal of this paper is establishing a set of tools and practices as the
foundation on which multiple collaborative projects between students will be build.
Most of the decisions presented throughout the thesis are made with a concrete
purpose in mind, but alternatives and their use cases will always be presented as
well.
The real-life use case that this thesis is tackling boils down to the building of the
infrastructure that is going to be used by a number of Computer Science students.
These students are part of a student’s association, which means that their work will,
for all intents and purposes, be regarded as volunteering.
The background of these students varies, they can be first years or pursuing
a Master’s degree. They may already work in the field, or have no professional
experience. The languages, frameworks, and interests of any student can also vary
widly. All this variation will be felt to a higher degree as time passes and new
volunteers try to deploy changes to the applications created by the older ones.
To make all of this variation manageable we’ll be using a microservice-based
architecture, enforcing Docker as the main tool of choice. Kubernetes is used to
orchestrate the containers, and Helm is used to bundle up the applications that will
be deployed on Kubernetes. These concepts will be tackled in Chapter 2 - Cloud
Computing and Kubernetes.
An important part of any project is budgeting, and the budget can be rather
sparse in the case of nonprofit organizations (like a student’s association is). For
this reason we want to maximize what we can get out of the available resources
by entailing multiple optimization strategies. Chapter 3 - Cost and performance
optimizations will present some of these strategies.
Afterwards, in Chapter 4 - Infrastructure design, the actual design of the Kuber-
netes cluster will be tackled. A big part of the chapter will consist of the presentation
of different tools and the purposes for which they’re used. By the end of this chapter,
a production-ready cluster will be in place.
1
CHAPTER 1. INTRODUCTION
The cluster needs to be used and applications need to be developed, tested, and
deployed on it for any of the prior work to have any use. Chapter 5 - Application
development delves into this topics such as Source Control, documentation, and
CI/CD.
Finally, 2 projects will be studied in Chapter 6 - Case studies. These were devel-
oped by student volunteers, each of them having a different amount of experience.
2
Chapter 2
3
CHAPTER 2. CLOUD COMPUTING AND KUBERNETES
For the purposes of this paper, Docker is the main building block, and it lends
itself well to student projects and the problems that may arise in them:
• “It worked on my machine” - even veterans run into this issue, and it’s the
norm for students, but with the help of containerization the problem becomes
trivial.
• Automation - building and deploying are done as easily as running one com-
mand. Switching to a new version is nothing more than using a different im-
age.
2.1.3 Microservices
Microservices can be thought of as a set of small services that are developed and
deployed independently, but work together, communicating through APIs,
4
CHAPTER 2. CLOUD COMPUTING AND KUBERNETES
5
CHAPTER 2. CLOUD COMPUTING AND KUBERNETES
trol, providing a single source of truth that is used to automatically apply changes to
systems. Things are built the same way every time and the usual development prac-
tices can be applied (modularization, versioning). Looking at the code also provides
a real overview of the infrastructure that’s in place.
Some of the tools presented in the following chapters are based on this concept,
mainly Kubernetes (which uses yaml files called ’manifests’ for its resource defini-
tions), and Helm (which templates and bundles up Kubernetes manifests).
Terraform is another IaC tool used to configure resources from different cloud
providers.
All the infrastructure presented in this paper will be IaC, including every single
VM and all of the preparation that is needed for them. All of the cloud-provider
resources are deployed using Terraform, and everything Kubernetes (besides some
initial steps) is deployed using ArgoCD. Besides the advantages presented above,
IaC is used to future-proof the infrastructure: if everything was lost, or we had to
switch cloud providers, there would be an extremely fast and simple way to get
everything up and running again.
2.2 Kubernetes
This is the main tool that’s going to be used, and everything else is going to be
build on top of it. All of the applications, tools, and services are going to be running
in Kubernetes, as well as all of the configuration and routing.
2.2.1 Overview
Kubernetes[7] is an open source platform used to manage containerized work-
loads. It can be considered a container orchestrator that facilitates automation and
declarative configuration. Kubernetes is highly customizable, and focuses on offer-
ing flexibility to its users by providing building blocks which can then be imple-
mented, extended, and configured in multiple ways.
Kubernetes runs on a cluster, which is a set of machines called ‘nodes’. These
nodes host the basic deployment unit of Kubernetes, the Pod, which contains the
containerized applications.
The state of the cluster is dictated by Kubernetes Resources, which are objects
expressed in the ‘yaml’ format. These objects have unique names per resource, but
can be isolated further through the use of namespaces, which provide a scope for
names and enable grouping of resources.
6
CHAPTER 2. CLOUD COMPUTING AND KUBERNETES
2.2.3 Workloads
A pod contains one or more containers, all of which share the same context,
as well as storage and network resources. Pods are meant to run one instance of
an application, meaning that you can easily horizontally scale your application by
adding more pods.
Pods are scheduled on nodes mostly based on the resources they request, but
there are other mechanisms that ensure certain pods run only on certain nodes.
Most of the time pods are managed by other workload resources, which are
shortly presented below:
Deployments are generally used for stateless applications that need a certain num-
ber of pods to be running. They handle pod failures, and do rolling updates in
case something about the pod definition changes.
Jobs are used for one-time batch jobs. There is also a CronJob resource that creates
Jobs based on a predefined schedule.
StatefulSets are used for stateful applications, like databases, which store their
state using persistent storage.
DaemonSets ensure that all nodes have an instance of the pod running on them.
They’re used, for example, for getting node metrics.
2.2.4 Networking
All pods receive their own unique IP address per cluster, meaning that all pods
are normally able to communicate with each other. There are a few resources used
7
CHAPTER 2. CLOUD COMPUTING AND KUBERNETES
The Service resource is used to expose an application that’s running as one or more
pods. Because pods can be killed and created at any point, the backend is
dynamic, and the service takes care of knowing which pods are part of the
backend at any point.
The Ingress resource is used to route HTTP requests to services based on rules. For
the ingress to work, an ingress controller needs to be running. There are many
ingress controllers, some of which are native to a certain cloud provider, and
some which are vendor-independent.
Moreover, there are DNS records made for every Service and Pod, enabling the
use of DNS names instead of IPs.
2.2.5 Configuration
Kubernetes has a few resources that are used to configure applications: Con-
figMaps and Secrets. ConfigMaps are used for storing non-confidential application
configuration in a place separate from the application code. Secrets are used to store
confidential data, like passwords.
2.2.6 Example
This subsection presents an example of an application running in Kubernetes.
8
CHAPTER 2. CLOUD COMPUTING AND KUBERNETES
The cluster contains 4 nodes, each having 2CPU, 4GB of memory, and 30GB of
storage. The same mock web application described in 2.1.3 is deployed, but an
observability tool is also added for collecting node metrics. Figure 2.2 shows how
pods could be distributed across the nodes based on the requests they set.
9
CHAPTER 2. CLOUD COMPUTING AND KUBERNETES
Figure 2.5: Services and the backend Pods they point to.
A Service resource is created for each microservice. The grouping of pods can be
seen in Figure 2.5.
Figure 2.6: Routing of public Services through Ingresses. Non-public services can
communicate with each other using the Cluster’s DNS names.
10
CHAPTER 2. CLOUD COMPUTING AND KUBERNETES
Two ingresses are set up, one for the application itself, and one for Prometheus
(which will be described in more detail in section 4.3). All client requests are routed
based on the rules set on the ingress. This means that requests made to the arti-
cle.example.com domain are routed to one of three services, depending on the most
specific prefix they match against, and all of the requests that were made to the
prometheus.example.com domain go directly to the Prometheus service.
It is not desirable to have all services exposed to the internet. As such, in those
cases communication between services should be done internally to the cluster.
Pods can be referenced by IP, by their DNS name, or by the DNS name of a Ser-
vice that selects them: to make a request to the PostgreSQL database you could use
‘10.0.2.4’, ‘postgresql-pod-1.apps’ or ‘PostgreSQL.apps‘. In practice, the DNS name
of a service is used for internal communication.
Figure 2.6 shows the communication paths between services, including public-
facing ones.
2.3.1 Helm
Kubernetes uses declarative configuration, which means that the desired state
of the cluster is held in manifest files. Manifest files are yaml files in which the
specification of a Kubernetes resource is declared.
A Kubernetes application usually consists of multiple manifests, and the sheer
amount of files can become unwieldy if the application has any complexity. Helm[8]
is a tool built to manage these manifests, fulfilling the role of a package manager.
Helm packages are called ‘Charts’ and
an example of the file structure can be seen
in Figure 2.7, where the same web applica-
tion presented in the previous sections was
used.
The ‘templates’ folder contains yaml
files templated using the GO template lan-
guage, which should be rendered to valid
manifests after being processed by the
templating engine using the values found
in the ‘values.yaml‘ file. The resulting
manifests are then applied to the clus-
ter.
Figure 2.7: Example of the elements
of a helmchart
11
CHAPTER 2. CLOUD COMPUTING AND KUBERNETES
This paper makes heavy use of helm, packaging the entire state of the cluster in
charts. Helm is also used for automation, being used internally by ArgoCD (which
will be tackled in a future section).
2.3.2 Kustomize
Kustomize[9] is a tool used to apply modifications to manifest files without mod-
ifying the original files. It works by patching over the original resources, making it
a great tool for when you want slightly different configuration based on the envi-
ronment.
There are cases where Kustomize is the better tool for the job than Helm, and
that’s the case for the infrastructure presented in this paper as well (ArgoCD is de-
ployed using Kustomize).
The infrastructure described in this paper uses operators made by the Kuber-
netes community to easily deploy and configure some tools. These tools will be
covered in chapter 4.
12
Chapter 3
Figure 3.1: Opportunities for free resources from different Cloud providers
13
CHAPTER 3. COST AND PERFORMANCE OPTIMIZATIONS
Microsoft - offers licenses for Office365, Windows, Visual Studio, and other Mi-
crosoft products.
14
CHAPTER 3. COST AND PERFORMANCE OPTIMIZATIONS
GitHub - offers the GitHub Student Developer Pack, which contains benefits pro-
vided directly by GitHub (like GitHub Copilot and GitHub Pro), as well as
benefits provided by partners.
The infrastructure developed in this paper is open source, as are all of the projects
developed using it. Thanks to that, GitHub Actions as well as GitHub Packages are
used as part of the deployment pipeline.
The infrastructure presented in this paper makes heavy use of the tools found in
the Cloud Native Landscape. Helm, kube-router, k3s, Redis, PostgreSQL, ArgoCD,
Nginx, flannel, cert-manager, CoreDNS, Oauth2-proxy, Prometheus and Grafana are
15
CHAPTER 3. COST AND PERFORMANCE OPTIMIZATIONS
just some of the tools used, with many more potential tools added as the needs of
the infrastructure increase.
3.2.1 VM Configuration
Most cloud providers offer a high number of VM types. The total number of
VM types comes from a combination of parameters: operating system, number of
virtual CPUs, amount of RAM, processor type, temporary storage, hosting region,
and disks. For the rest of the section Azure VM pricing[30] will be used.
16
CHAPTER 3. COST AND PERFORMANCE OPTIMIZATIONS
cheapest processor, and the one that’s the most lightweight, and Debian was chosen
from the list provided by Azure, as it supports the ARM processor.
Disks can get a bit complicated because of the options available and the limita-
tions set on them. Research is needed to be able to come to the most cost-efficient
result. For example, in Azure a Premium SSD v2 has way better IOPS and it’s way
cheaper than a normal Premium SSD, but you are unable to use a Premium SSD v2
as a boot disk, meaning that adding a second disk to the VM would be better over-
all. At the same time, at lower amounts of storage, a Standard SSD may be more
performant, as well a cheaper, than a Premium SSD.
Spot VMs[32] are VMs which make use of the extra capacity the Cloud provider
has on hand. They have a significant cost reduction( 75%), but no Service-Level
Agreement exists for them, meaning that they can be taken down at any time with
just 30 seconds of notice. They are also not guaranteed, because the cloud provider
may not have that many spare resources.
Spot VMs are great for workloads that can handle interruptions, so they would
be a great addition to a Kubernetes cluster. A possible strategy would be scheduling
pods on Spot VMs and falling back on normal VMs when the extra capacity is gone.
17
CHAPTER 3. COST AND PERFORMANCE OPTIMIZATIONS
Because of a quota limit, only 1 Spot VM can be used by this infrastructure, but the
terraform scripts provided can be configured to use more.
18
CHAPTER 3. COST AND PERFORMANCE OPTIMIZATIONS
K3s was chosen for this paper, as it had the smallest binary, it was easy to setup
quickly, and had just enough features for the desired use-cases. Those features could
also be turned off and replaced with the desired ones.
debian:12.5 117MB
debian:12.5-slim 74.8MB
ubuntu:24.04 76.2MB
distroless:base-debian12 20.7MB
distroless:base-nossl-debian12 14.8MB
distroless:static-debian12 1.99MB
busybox:3.16.1 4.26MB
Ideally, you would like to have the smallest image possible, but sometimes that
comes at the cost of time. Getting your application to work is harder, or downright
19
CHAPTER 3. COST AND PERFORMANCE OPTIMIZATIONS
python:3.11 1.01 GB
python:3.11-slim 131 MB
python:3.11-alpine 52.7 MB
python3 distroless 52.8 MB
impossible, when it depends on a lot of things you’d usually find in a normal linux
distribution (some images don’t even have a shell available).
For example, for alpine-based images, you would usually add dependencies us-
ing apk (alpine’s packet manager). For distroless images, you would need to com-
pile your application to a binary in a different build step, and copy that binary, as
well as any required libraries, on top of the distroless base.
Figure 3.4 shows how big of a size decrease can be achieved by using the right
base image for a python-based application. In most use-cases, you’d probably use
the python:3.11-slim base image, but that still provides a 87% decrease in image size
compared to the default image.
20
CHAPTER 3. COST AND PERFORMANCE OPTIMIZATIONS
21
Chapter 4
Infrastructure design
This chapter will be presenting the actual infrastructure that was created using
some of the techniques, practices and tools presented in this paper. The context
and reasoning for each choice will be presented, along with possible alternatives.
Because of the sheer breadth of tools and information available, some choices were
based on popularity and good reputation.
Once again, this infrastructure is made to be used by a group of Computer Sci-
ence students that are part of a nonprofit organization. Fast, small, cheap solutions
are most of the time preferred to resilient ones.
4.1.1 VM configuration
All of the infrastructure is based on having a couple of VMs organized in a cluster
structure and running Kubernetes on them.
The Azure Grant for Nonprofits provides $2000 worth of credits to spend in a
year. Going above that amount would entail paying it out of pocket, and consider-
ing how expensive the cloud can get, and how little money a student’s association
would have, minimizing costs is a must.
For this reason, the cheapest VMs were searched for. It turns out that on Azure,
a B2pls VM hosted in Central India with 2vCPU and 4GB of memory is as cheap
as they come. Memory is always a concern, which is why the smaller VM variants
weren’t considered. As mentioned, a high number of smaller VMs provides a higher
total amount of resources per dollar. This VM has an ARM processor, and Debian
installed on it.
Disk-wise, the default for the OS was used, which is a 30GB Standard SSD, but
most VMs would also have a secondary disk, a 32GB Premium v2 SSD. The OS disk
would only be used as ephemereal storage by Kubernetes, as well as to cache docker
22
CHAPTER 4. INFRASTRUCTURE DESIGN
images, regarding its performance as not that important. The secondary disk would
be used to store persistent data, especially for databases, meaning that the best was
desired(and, funnily enough, this disk type is also the cheapest for what it provides).
In total, 7 VMs were provisioned and were paid for upfront for a 41% cost re-
duction that came from reserving the instances for 1 year. Because the amount of
money the organization has is fixed per year, it makes sense to get as much comput-
ing power as possible, even though not all of it would be consistently used. All of
those 7 VMs have their mandatory boot disk, as well as a secondary disk. The cost
for all of this, in total for a year, is $1300, leaving $700 for traffic costs, as well as
unexpected costs.
Using Spot VMs was desired and planned for, but unfortunately Azure has a
quota for them, only allowing the use of one B2pls v2 on the subscription that was
used. In any case, the terraform module supports Spot VMs, and their configuration
would have been just slightly different: because of their ephemereal nature, nothing
would be persisted on them, meaning they wouldn’t have that secondary disk. Extra
measures would be necessary to use Spot VMs, such as running a scheduled job that
regularly turns the VM back on if it was deallocated.
If more capacity is required, Azure support will probably be contacted to in-
23
CHAPTER 4. INFRASTRUCTURE DESIGN
24
CHAPTER 4. INFRASTRUCTURE DESIGN
Everything passes through the master VM, as all nodes are connected directly to
it, and all traffic goes to it first. This makes the master VM the single point of failure
of this infrastructure. For all intents and purposes, this infrastructure isn’t crucial,
and the possiblity of downtime is worth the cost optimizations.
For DNS, Clouflare is used, without proxying. CNAME records are used for
everything besides the root domain, because all subdomains would point to the
same public IP.
• the master node, which is where the k3s server resides and which holds the
public IP
• normal nodes, which have the same VM configuration as the master nodes,
but the k3s agent is installed on it
• spot nodes, which are just like normal nodes, but they don’t have a secondary
disk and don’t allow persistent workloads on them
The local-path provisioner that comes with k3s is used for storing persistent data.
It makes use of the existing storage of the nodes, and stores data in a folder on them.
It’s configured to use only the second disk for persistent storage, which is mounted
on /data/ssd.
4.2.1 ArgoCD
ArgoCD is, as its name implies, a Kubernetes tool for continuous delivery. It uses
Git repositories as sources of truth, watches them, and modifies the cluster in such
a way that the cluster state matches what is declared in a respository.
25
CHAPTER 4. INFRASTRUCTURE DESIGN
After the initial startup script that installs ArgoCD, it is used to deploy every-
thing else in the cluster, including itself. Application resources are used to define
what and where is deployed, and both helm and kustomize are supported for pack-
aging Kubernetes applications.
ArgoCD makes deploying a new version of an application as simple as changing
a version number in a file. At least every 3 minutes ArgoCD checks the repository,
and then makes the necessary changes.
A nice dashboard is provided, which also allows the users to manage resources.
ArgoCD is configured to only allow users of the organization to be able to access
the UI, and users can access certain parts of the UI based on the Google Group
they’re part of (for example, developers only get access to the production-apps and
development-apps applications).
26
CHAPTER 4. INFRASTRUCTURE DESIGN
important feature that is used here are helm subcharts: you can define a dependency
to an external chart, and have that deployed. Many charts are already made by other
people, and this provides a cleaner way to use them.
The hermes resources are split into 2: production, and development. Production
resources are the ones that contain the state that can be seen by the outside world,
like the main website, while development resources can only be accessed by devel-
opers. The helmcharts are templated using the values.yaml file for the development
environment and the values.prd.yaml file for the production environment.
The repository could be differently organized, and may be modified with time.
27
CHAPTER 4. INFRASTRUCTURE DESIGN
and traces.
In Kubernetes, there are a few popular tools used for collecting and visualizing
metrics, as well as sending alerts. These are Prometheus, Grafana, and Alertman-
ager.
Prometheus is complex tool: it scrapes metrics from endpoints, stores them in a
time-series database, and allows querying of said metrics with the possibility of also
creating alerts.
Grafana is tool used for visualization. It can be used to visualize all the observ-
ability components: logs, metrics, and traces alike. Grafana supports multiple data
28
CHAPTER 4. INFRASTRUCTURE DESIGN
sources, meaning that you may only need one visualization tool for the entire ob-
servability stack.
Alertmanager is used to send alerts to different receivers. It receives alerts from
Prometheus, and then forwards them. For this paper, Alertmanager was configured
to send alerts to a Discord channel, as that is where the project communication is
done.
4.3.2 Logs
Logs are part of any application, and they’re crucial in figuring out what went
wrong where.
Clusters containing many applications have a massive amount of logs. Thank-
fully, this isn’t the case just yet, so for now the ArgoCD UI is enough to be able to
debug applications.
Later, when logs would start to matter more a solution consisting of fluent-bit
for picking up all the logs from the cluster, Loki for storing them, and Grafana for
visualization would be used.
4.3.3 Traces
Traces are records that capture the flow and timing of requests as they move
through the system. They are important when it comes to understand the behaviour
of microservices.
Ideally, a Kubernetes cluster would have a tracing solution. As things stand, this
wasn’t a priority, but might become one when more and more services are added.
29
CHAPTER 4. INFRASTRUCTURE DESIGN
For traffic to and from the cluster to be considered secure, certificates need to be
generated. Doing it manually is a time-consuming process, but may be desired for
some production workloads where paid certificates are used because it’s important
to verify business legitimacy (for shops for example).
For the purposes of this infrastructure, automation and cost-efficiency are a pri-
ority, and free certificates encrypt traffic in the same way as paid certificates.
30
CHAPTER 4. INFRASTRUCTURE DESIGN
ment overhead that may result in undesired behaviours: creating these secrets also
requires some form of access to the cluster, so people might start creating normal
secrets and not have them in source control anymore.
4.5.1 Backups
Thanks to Google Workspaces for Nonprofits, the organization has 100TB of free
Google Drive storage. Other forms of cloud storage would cost money, and this is an
insane amount of free storage that’s just sitting there, waiting to be taken advantage
of.
As most backup solutions focus on using disks or bucket stores, some custom
backup solutions were created with the help of the rclone tool.
For the postgresql database, a CronJob runs daily, dumps the database to a file,
archives and compresses it, and then uses rclone to copy that file to Google Drive.
31
CHAPTER 4. INFRASTRUCTURE DESIGN
A similar approach was taken for the secret manager: rclone saves the necessary
files daily to Google Drive.
There are currently no backup solutions in place for persistent volumes, but a
similar approach would probably be taken.
4.5.2 Restoring
In case of an incident in which everything becomes lost, there comes the need to
restore everything. This process should be as simple and fast as possible, to prevent
insane amounts of downtime, and to ensure that actual restoration is possible.
There are currently 4 things which would need to be restored: the secret man-
ager, the secret manager database, the postgresql database, and the Kubernetes clus-
ter.
The secret manager is deployed using terraform, and only requires setting in a
few variables that are needed for proper functioning. The backups only need to be
copied from the bucket, and dropped on the VM.
The postgresql database has a restoration job which takes the specified backup
from the drive, decompresses it, and restores the database.
The Kubernetes cluster would be restored in the same way the secret manager
would be restored, but may take longer and it should be observed just in case some-
thing goes wrong. As of now, besides the database, persistent data wouldn’t be able
to be restored.
4.6 Downtime
Availability in general is going to suffer in favor of more computing power and
lower costs in the current implementation of the infrastructure.
32
CHAPTER 4. INFRASTRUCTURE DESIGN
The reason for why the trade-off is worth it is that traffic is generally low, and
downtime has close to 0 impact, meaning that, as things currently stand, highly
available solutions are not a priority.
At some point, as the apps become more and more used, availability will become
important, and the focus will be shifted to accommodate that as well.
Currently, the Oauth2-proxy is used for all the infrastructure dashboards (Ar-
goCD, Grafana, Alertmanager, Prometheus), as well as the student management
application.
33
CHAPTER 4. INFRASTRUCTURE DESIGN
Figure 4.6: Basic flow of accessing apps protected by the Oauth2 proxy
34
Chapter 5
Application development
35
CHAPTER 5. APPLICATION DEVELOPMENT
When one wants to add new functionality of an application, they should first
create a branch that’s based on the main branch. This kind of branch is called a
‘feature branch’. When development is done and the feature is considered complete,
the feature branch should be merged into main branch. This process is done through
something called a ‘Pull Request’, in which a user requests that their feature branch
is merged.
In GitHub, pull requests have a lot of options available, but the most impor-
tant one is approvals. Another person (or more, depending on the strategy that’s in
place) needs to look at the code that’s about to be merged and approve the changes.
If something is wrong or needs to be modified, the reviewer can leave some com-
ments, after which the initial developer resolves them. This process continues until
approval is given, and the feature branch is merged through a squash commit (this
means that all of the commits of the feature branch become just one commit on the
main branch).
This branching strategy is uncomplicated, and allows the main repository to only
contain commits that add entire features.
5.2.1 Need
A software solution can depend on sensitive information that comes in the form
of passwords, tokens, keys, and generally secrets. Teams should be able to share
this information in a safe manner, which is why a password manager should be
employed.
Passwords managers aren’t only useful for software development though. The
activity of volunteers includes quite a few of these secrets that need to be shared:
Instagram or Facebook credentials, shared email accounts, and so on.
In the past, all of these credentials would be shared through channels like pri-
vate messaging, some text files, and even public Google Docs (which were only safe
thanks to obfuscation).
5.2.2 Implementation
Most password managers that allow access to groups of people need to be paid.
There are self-hosted solutions, but they should be hosted somewhere separate from
the Kubernetes cluster. As mentioned in subsection 3.1.3, Google has some free tier
products, and one of those is a month worth computing power of a small VM, along
with a public IP, which is enough for hosting a password manager.
36
CHAPTER 5. APPLICATION DEVELOPMENT
5.3 Communication
Developer communication is important, to help one another, discuss implemen-
tation details, and make decisions. As such, the communication channels that are
used become important.
Discord was chosen for communication. It has built-in share screening capabili-
ties, voice and video calls, as well as text channels. Another very important feature
is the possiblity of extending Discord functionality through the use of bots and web-
hooks (which is how Kubernetes alerts are received in a discord channel). The actual
organization of the server is not definitive, with changes being made as the needs of
the projects are more and more understood.
Entire communities exist on Discord, with complex servers managing thousands
of users and a plethora of topics, so it should lend itself quite well to handling a few
students working on computer science projects.
37
CHAPTER 5. APPLICATION DEVELOPMENT
5.5 CI/CD
CI/CD[43] is a set of practices and tools designed to improve the software de-
velopment lifecycle.
Continuous integration means that code changes are merged into the main branch
multiple times per day. The integration of the code can have a lot of automation
behind it: validating, building and testing, and only then allowing the merge to
happen.
Continuous deployment is concerned with the automation of the deployment
process.
5.6 Documentation
Documentation written using Markdown, with at least a README.md file con-
taining basic information about what the application does, how to run it, and how
it can be configured. Developers may fail to write proper documentation even in
professional environments, but, considering the ephemeral nature of computer sci-
ence students, it is crucial to have adequate documentation for all of the developed
software solutions. As applications grow, API documentation will also be required.
38
Chapter 6
Case studies
This chapter will tackle 2 ‘projects’, their design, and how the infrastructure pre-
sented in this paper was used to make them come to life.
6.1.1 Design
The volunteer management application is made up of 2 distinct projects: the
volunteer-api[44], and the volunteer-management-frontend[45]. The idea was to
decouple the management itself from the interface, to allow for alternative imple-
mentations in the future.
The API has a few endpoints which allow CRUD operations on the students, as
well recruitment campaigns and candidates. The frontend consumes this API and
provides the actual management operations.
A volunteer has built both of these components, using GO for the API, and React
+ Typescript for the frontend, with minimal adjustments made by me to allow them
to properly run in Kubernetes.
39
CHAPTER 6. CASE STUDIES
40
CHAPTER 6. CASE STUDIES
6.1.2 Results
As can be seen in figure Figure 6.2, the application makes use of the oauth2 proxy
to only allow a certain group of people to manage volunteers. The ingress is config-
ured in such a way that all requests made to endpoints that start with ‘/api/v1’ go
to the API service, while all other requests go directly to the frontend. Cert-manager
is used to generate a certificate for the desired subdomain. The application connects
to the centralised PostgreSQL database using secrets deployed from a private repos-
itory with the use of ArgoCD. To reach the database pod, the cluster DNS was used.
The Docker image of the application is built every time a new change is pushed
to the main branch thanks to GitHub actions, and the resulting image is stored in
GitHub Packages.
Grafana was used to check some metrics of the application, specifically how
much CPU and memory was used when the application is idle. During the initial
deployment phases, alerts from alertmanager showed that the application wasn’t
deployed properly, and the ArgoCD UI was used to check the logs of the containers.
6.2.1 Design
The old Laravel applications were containerized and updated to a version that
supported ARM processors. Minimal adjustments were made to the websites to be
up to date, but they are subpar by all standards, and they need to be migrated to
some more lightweight and simpler technologies.
6.2.2 Results
For each of these websites, the design was simple. There would be 1 pod running
the containerized application, with an Ingress pointing to it, and a certificate created
for encryption.
Grafana and ArgoCD were used in the process of making the websites work.
GitHub actions and GitHub packages were made use of here as well.
41
Chapter 7
This paper has shown how Kubernetes could be used to host computer science
projects developed by students. A focus was put on minimizing costs, with quite
a few strategies being presented. The usage of community-created tools was a big
part of the infrastructure, providing efficient and cost-effective solutions to many
problems.
Future work would expand more on the processes involved during the devel-
opment of applications, particularly on the improvement of the CI pipeline. Focus
would be put on the actual management of the projects, and how best to handle
ephemeral contributors, taking inspiration from the open source movement.
The tool belt offered by the cluster could also by improved, mainly by adding a
logging solution to the observability stack. Some first steps that could be taken to
make the cluster more secure would be implementing a VPN solution and having
concrete NetworkPolicies.
42
Bibliography
[2] Sachchidanand Singh and Nirmala Singh. Containers docker: Emerging roles
future of cloud technology. In 2016 2nd International Conference on Applied and
Theoretical Computing and Communication Technology (iCATccT), pages 804–807,
2016.
[3] Tamanna Siddiqui, Shadab Alam Siddiqui, and Najeeb Ahmad Khan. Compre-
hensive analysis of container technology. In 2019 4th International Conference on
Information Systems and Computer Networks (ISCON), pages 218–223, 2019.
[6] Kief Morris. Infrastructure as Code: Dynamic Systems for the Cloud Age. O’Reilly
Media, Inc., 2nd edition, 2020.
43
BIBLIOGRAPHY
[11] Google. Google Free Trial and Free Tier Services. https://cloud.google.
com/free/docs/free-cloud-features#free-tier. Online; accessed
May 2024.
[19] Amazon Web Services. AWS Programs for Research and Education. https:
//aws.amazon.com/grants/. Online; accessed May 2024.
44
BIBLIOGRAPHY
[26] Synopsys. 2024 open source security and risk analysis report.
https://www.synopsys.com/software-integrity/resources/
analyst-reports/open-source-security-risk-analysis.html.
Online; accessed May 2024.
[27] Manuel Hoffmann, Frank Nagle, and Zhou Yanuo. The value of
open source software. Harvard Business School Strategy Unit Working
Paper No. 24-038, Available at SSRN: https://ssrn.com/abstract=4693148 or
http://dx.doi.org/10.2139/ssrn.4693148, (January 1, 2024).
[31] Khushi Gupta and Tushar Sharma. Changing trends in computer architecture
: A comprehensive analysis of arm and x86 processors. International Journal
of Scientific Research in Computer Science, Engineering and Information Technology,
pages 619–631, 06 2021.
[33] Georgios Koukis, Sotiris Skaperas, Ioanna Angeliki Kapetanidou, Lefteris Ma-
matas, and Vassilis Tsaoussidis. Performance evaluation of kubernetes net-
working approaches across constraint edge environments, 2024.
[35] Ameer Khan. Brief overview of cache memory. Technical report, 04 2020.
[36] Sarina Sulaiman, Siti Mariyam Shamsuddin, Ajith Abraham, and Shahida Su-
laiman. Web caching and prefetching: What, why, and how? In 2008 Interna-
tional Symposium on Information Technology, volume 4, pages 1–8, 2008.
45
BIBLIOGRAPHY
[38] Shahab Bakhtiyari. Performance evaluation of the apache traffic server and
varnish reverse proxies. 2012.
[40] John D. Blischak, Emily R. Davenport, and Greg Wilson. A quick introduction
to version control with git and github. PLOS Computational Biology, 12(1):1–18,
01 2016.
[43] Mojtaba Shahin, Muhammad Ali Babar, and Liming Zhu. Continuous inte-
gration, delivery and deployment: A systematic review on approaches, tools,
challenges and practices. IEEE Access, PP, 03 2017.
46