Cloud Native Networking Deep Dive River Publishers Rapids Series in Communications and Networking
Cloud Native Networking Deep Dive River Publishers Rapids Series in Communications and Networking
Chander Govindarajan
IBM Research, India
Priyanka Naik
IBM Research, India
River Publishers
Published 2023 by River Publishers
River Publishers
Alsbjergvej 10, 9260 Gistrup, Denmark
www.riverpublishers.com
© 2023 River Publishers. All rights reserved. No part of this publication may be
reproduced, stored in a retrieval systems, or transmitted in any form or by any
means, mechanical, photocopying, recording or otherwise, without prior written
permission of the publishers.
1 Introduction 1
v
Contents
4 Container-container Networking 21
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Container-container Metworking in Docker . . . . . . . . . . . . . . . . . . . . . 21
4.3 Multi-host Container–Container Networking Techniques . . . . . . . . . . . . . . . 23
4.4 CNI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 k8s-netsim: CNI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.6 Technology: cnitool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.7 k8s-netsim: Flannel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.8 k8s-netsim: Container Networking . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.9 k8s-netsim: Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Services 37
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Technology: Kube-proxy and Kube-dns . . . . . . . . . . . . . . . . . . . . . . . 37
5.3 k8s-netsim: Achieving Services . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4 Introduction to Nftables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.5 Hands-on: Nftables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.6 k8s-netsim: Implementing Kube-proxy . . . . . . . . . . . . . . . . . . . . . . . 45
6 Exposing Services 49
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2 Ingress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.3 k8s-simulator: Exposing Services . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.4 k8s-simulator: Hands-on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7 Mutli-cluster Networking 55
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.2 Multi-cluster Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.3 Skupper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
vi
Contents
8 Retrospective 65
Index 67
vii
About the Authors
ix
CHAPTER
Introduction
Richard P. Feynman
Hello. If you are interested in a deep dive into the networking fundamentals
that power Kubernetes, you have come to the right place.
But a system providing all of these great features comes at a cost: a lot of
complexity. Unlike traditional server deployments, where the system is quite
intuitive, Kubernetes is quite opaque – there are a number of layers working
together to achieve all this functionality. Many of these layers have evolved from
various requirements and attempts to solve problems and do not really flow
together intuitively.
What this book is about is getting a deeper understanding into how network
ing works in Kubernetes using an hands-on approach to build a simplified net
working stack that mimics the real thing – while using as many real components
as possible.
1
Introduction
Before we go deeper into this, let us be clear what this book is not about:
• Not about how to use Kubernetes as a developer: how to deploy and manage app lifecycle.
• Not about how to use manage Kubernetes as an operator or sysadmin.
• Not a comparison of various technology choices to use with Kubernetes.
The second approach allows you to examine and make changes to the sim
ulator if desired, although you can use the provided image directly to follow
along for the rest of this book.
2
Introduction
Figure 1.1: A glimpse of the network topology we will simulate in this book.
3
CHAPTER
In this chapter, we take a brief look at some Kubernetes concepts. The reader
may ignore this chapter completely if they are aware of and have used Kuber
netes before.
The concepts of cloud and cloud computing have evolved over the past
several decades. It was realized in the 1950s that there is a need to have sharing
of compute, memory, disk, and resources to be more cost efficient, which led
to the introduction of mainframe computers. Further, the need for time sharing
across geographies was handled by the introduction of ARPANET, the ability
to communicate systems at different locations, i.e. the birth of Internet. The
resource sharing demand increased and so in 1972 IBM introduced Virtual
Machine OS. The virtual operating system ran over the existing operating system
but was dedicated to a particular user. This enabled sharing of resources on a
machine in an isolated environment which was used on IBM mainframes. In the
1990s, along with isolation, system resources like compute, disk, memory, and
isolation in a network was provided with help of “virtual” private networks as
a rentable service.
The era of what we know today as cloud computing began in the 2000s
with Amazon’s AWS and Google’s doc application hosted on their infrastructure,
shared/editable across users in any geography in real-time. Companies like
IBM and Google collaborated with universities to build large server farms or
data centers to build the infrastructure required to run these cloud applica
tions. They had orchestration platforms like Openstack1 and F52 to run virtual
1
https://www.openstack.org/
2
https://www.f5.com/
5
Introduction to Kubernetes Concepts
machines for various users and their applications. A virtual machine based
cloud infrastructure provided isolation but had performance overheads due
to an additional layer of a guest operating system. Containers helped fill the
gap of providing isolation in a shared infrastructure without the additional OS
overhead. Since 2013, Docker containers have become the de facto to deploy
applications on such containers on a single machine. In 2017, to enable con
tainer deployment on a large cloud infrastructure Kubernetes was launched as
a cloud native (container) orchestration platform. This chapter will cover the
various concepts in Kubernetes that enabled this wide acceptance in the cloud
community.
3
https://kubernetes.io/
6
Introduction to Kubernetes Concepts
running as a pod. Init containers are part of the app pod and the application
container starts only after init containers finish execution. This helps enable
a dependency chain across micro-service chains where there are some config
dependencies between the micro-services and the order of them is critical.
Sidecars are another example of containers running along with the application
container to handle tasks such as monitoring, logging, and update config on
the fly.
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx: 1 .2 3 .4
ports:
- containerPort: 80
Listing 2.1: Example pod YAML
7
Introduction to Kubernetes Concepts
apiVersion: apps / v1
kind: Deployment
metadata:
name: nginx−deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx: 1 .2 3 .4
4
https://kubebyexample.com/concept/deployments
8
Introduction to Kubernetes Concepts
ports:
- containerPort: 80
Listing 2.2: Example deployment YAML
Since deployment helps manage the application pod’s lifecycle, scaling the pod’s
cluster IP changes on restarts. However, a client needs a static IP and endpoint
to access the microservice. Service is used to expose the the application (see
Listing 3.3). There are various service types:
• Cluster IP: Used for accessing an application internally in a cluster using a cluster IP as a
service IP.
• NodePort: Expose service using the node IP and a static port on the node.
• Load balancer: Expose the service using the cloud based load balancer. This helps for
application with multiple replicas.
• E ternal name: Add a name to the service (eg. xyz.com).
apiVersion: v1
kind: Service
metadata:
name: serviceA
spec:
selector:
app . kubernetes . i o /name: App1
ports:
- protocol: TCP
port: 80
targetPort: 9080
Listing 2.3: Example service YAML
These services now form a higher layer abstraction that can be used safely
without worrying about where or how many pods are being run. Later in this
book, we shall see how this is achieved.
Finally, now that we have Services running in a cluster that can talk to
services, we need an additional layer of control in exposing these services
outside the cluster. This is needed not just from a security point of view, but
also from a standard application management viewpoint.
There are a lot more pieces in Kuberenetes that we have not touched upon
here. The introduction here is very light on purpose and we invite readers who
don’t know about Kubernetes to take some time now to read more into it. The
training available on the official website5 may be a good place to start.
5
https://kubernetes.io/training/
9
CHAPTER
In this chapter, we shall start with the very basics of running a container cluster
orchestrator, which is running the worker/host nodes and containers on them.
In this chapter, our containers will not be able to talk to each other yet; this is
something we will add in subsequent chapters.
Worker nodes are connected to each other over switches and routers – the
exact networking between them depends on the placement of the workers in
the data center and their types. If they are themselves virtual, then there are at
1
https://fedoraproject.org/coreos/
2
https://www.flatcar.org/
11
Workers and Containers
least two layers of networking already (apart from the new layers that we shall
examine deeply in this book).
In our simulation, we do not care about the exact configuration of the workers
or the networking between them – these details are not relevant when trying to
understand container networking.
Mininet3 is a well established tool for creating virtual networks used for
research development of new algorithms such as for switching and routing
and used in academia for teaching networking. Using simple commands, users
can launch virtual networks running real kernel and switch codes managed by
real SDN controllers like Open vSwitch. Mininet has an easy to use CLI which
enables the creation of nodes and links, and supports network libraries for ping,
tcpdump, etc. Also, for any component created in the Mininet network you can
run your custom code for it. Thus, Mininet is beneficial for working on new
protocols and network topology, and is a great tool for a learning networking.
3
http://mininet.org/
12
Workers and Containers
you will see a number of printed statements that results in a prompt that
looks like:
mininet>
13
Workers and Containers
... (truncated)
Exercise: verify that the underlay is working and that you can ping one
worker from another. Specifically, check the ip of C0w1 and run the ping
command as “ping <ip>” from the second worker.
14
Workers and Containers
When you are done, pressing Ctrl-D or “exit” in the prompt will shut down
the simulator with output like the following:
***Stopping 0 controllers
In this book, we are only interested in the networking aspects of a container. So,
we will simulate containers using only “network namespaces”.
Our “container” will thus share process id, file system, and all other
resources with the underlying host. There is no isolation guarantees in our setup
(apart from networking, that is) by design.
Namespaces are one of the core technologies underlying containers. They allow
an isolated, fresh view of a single sub-system comprising the resources but
completely independent of the host. The various types of namespaces provided
by the Linux kernel are:
• User namespace: process can have root privilege within its user namespace.
• Process ID (PID) namespace: Have PIDs in namespace that are independent of other names-
paces.
• Network namespace: have an independent network stack (with routing rules, IP address) as
shown in Figure 3.1.
15
Workers and Containers
16
Workers and Containers
This runs our docker image as a container without running the simula
tor. This loads an environment that is already setup with the needed tools.
Alternatively, you can run commands in this section on any Linux host.
$ ifconfig
$ ip netns list
Run the list command above to verify that the new namepace has been
created.
We can now run arbitrary commands inside our new namespace using the
following syntax:
Run “fconfig" inside this network namespace and verify that it is different
from the root namespace.
In fact, you will notice that the new net namespace is empty of interfaces.
This is the expected behavior. Let us bring up the lo interface inside this
namespace:
Now, ping localhost from the root namespace and examine the counter
reported by ifconfig. Do the same inside the network namespace and check that
the two counters are completely independent.
17
Workers and Containers
Hopefully, the previous section has given you an idea of network namespaces.
In our simulator, each container on a worker is a new network namespace
created on the worker host. Before a deep dive into how the connectivity in the
topology exists, let us discuss the various components in the topology, as shown
in Figure 3.2. We emulate two clusters, cluster 1 and cluster 2, each having a
underlay switch (s0), and assume some connection between them, say via switch
t0. Each cluster has three workers, w1, w2, w3, and each has a container/pod (c1,
c3, c3 respectively) deployed on it.
Startup the simulator and run commands in the container using the follow
ing syntax:
18
Workers and Containers
You will get an ouput as shown. We will see in the next chapter how IPs are
assigned to the container, but for the moment check that the IP of the container
is different from the worker.
mininet> py C0w1.create_container("c10")
mininet> py C0w1.delete_container("c10")
While following through the rest of the book, you may want to run multiple
shells simultaneously onto various workers and containers. The “Mininet” CLI
provided by the simulator only allows a single command at a time which can be
limiting.
To open different shells concurrently, you can use the following commands.
19
Workers and Containers
$ ./utils/exec 0 w1
0:w1>
$ ./utils/exec 0 w1 c1
0:w1:c1>
You may find this approach much easier to play with the simulator. In the
rest of this book, we use this and recommend you do the same. We also use
the prompt convention to represent where to run commands. If you see a code
block that starts with a prompt like “:w1> ", it means you need to exec into w1
of cluster 0. Similarly for “:w1:c1> " means that you should exec into c1 on w1
of cluster 0 using the shown command.
As you may have seen in the introduction chapter and elsewhere, Kubernetes
deals with pods as the smallest unit of abstraction. A pod is composed of
multiple containers running together.
For simplicity, in this book, we consider pods and containers as one and
the same. How is this acceptable? This is because all containers within a pod
share the same network namespace. Since our containers are nothing more than
network namespaces, this works out very well for us.
3.11 Summary
So far, our containers are running, but they cannot really talk to each other.
In the next chapter, we introduce container networking – or how pods are
assigned IPs and how Kubernetes allows containers to talk to each other.
20
CHAPTER
Container-container Networking
4.1 Introduction
In the previous chapter, we saw how when new network namespaces are created,
they are completely empty of any interfaces. This is true for real containers
running on Kubernetes too. Then, how do we make containers talk to each other?
Specifically, in this chapter, we will try to enable the red lines that we see
in Figure 4.1.
We can think of three things that need to be done after a container comes
up:
• An interface has to be assigned to the network namespace, like “eth0”.
• This interface has to be assigned an IP address.
• Packets coming from a pod has to be routed to other pods.
If we take a step back, we realize that the same problem holds for Docker
containers. How is this different to Kubernetes?
21
Container-container Networking
Figure 4.1: In the last chapter, we got the squares and circles connected with the black lines.
Red lines show connectivity between containers in a single cluster. We achieve this using the
Flannel CNI plugin in this chapter.
22
Container-container Networking
3. Packets entering one of the veth pair exits the other. 4. The bridge device
acts as a simple L2 bridge and switched packets from all attached interfaces to
all other attached interfaces.
This simple approach works for local containers but, as you can see, this
cannot be extended to multiple workers.
If you abstract away the problem, there are many techniques to achieve multi
host networking.
Another set of techniques, instead, bring these new endpoints as first class
members of the underlay network. BGP is an example of this approach, where
23
Container-container Networking
the new containers can be connected to the existing BGP routers by advertising
routes. This approach, however, requires programming of the L3 devices that
connect the workers.
Since, there are many different ways to approach this, the Kubernetes com
munity has a standardized an approach known as “container network interface”
or CNI1 for short. In the next section, we shall look at the CNI.
4.4 CNI
Any container running over any run time like Kubernetes, CRI-O, or Mesos has
certain requirements in terms of how the networking is managed for it. The
requirement can vary across containers and be similar across runtimes. CNI
container network interface assists this by providing a specification and support
for plugins to configure the container network, as shown in Figure 4.3. Some
plugins can, for example, (i) be a bridge to help creation of an interface for
a container and support intra-host communication, (ii) create an overlay using
vlan, (iii) assign IP addresses: static, DHCP based, host-local. The benefit of
having these plugins is one can chain these together based on the networking
requirements. For example, a bridge plugin for interface, vxlan for overlay and
a bandwidth plugin to add limits to the bandwidth utilization. CNI comprises of
a specification , a json file to specify the network requirements, and plugins to
handle them. Some meta plugins widely used today are:
• Flannel: This is the default CNI for Kubernetes. Flannel2 provides layer 3 network between
containers across multiple nodes. That is, it handles the inter node connectivity and supports
additional plugin for inter node connectivity. We will build our own flannel based plugin as part
of this chapter.
1
https://www.cni.dev/
2
https://github.com/flannel-io/cni-plugin
24
Container-container Networking
• Cilium: Cilium3 is an eBPF based solution that along with networking supports observability
and security. One can add L3-L7 network policies using Cilium. Since, it’s built on eBPF it also
supports adding dynamic code at various hook points in the kernel.
• Calico: Calico4 is also an eBPF based network and policy enforcement CNI that helps choose
the network dataplane between vanilla Linux, and eBPF based, and DPDK based VPP. Calico is
popular for its performance and scalibility.
Now that we have seen how Kubernetes organizes its container networking, let
us look at how we achieve it in our simulator. We try to run our setup as close as
possible to reality, so we also run a CNI plugin, specifically Flannel to configure
networking for our containers.
The real Kubernetes setup has mechanisms to call the CNI plugins when
it creates and deletes containers. However, we don’t have a real scheduler
or orchestrators – just function calls that creates containers (that is, network
namespaces).
3
https://cilium.io/
4
https://www.tigera.io/tigera-products/calico/
25
Container-container Networking
“cnitool” is a development aid that allows you to run CNI plugins for a specific
network namespace. You can check the documentation5 for more details.
CNI_PATH=/opt/cni/bin
NETCONFPATH=/tmp/knetsim/<name>
cnitool add|del <name> /var/run/netns/<nsname>
We shall use a rather standard and famous CNI plugin called Flannel in our
setup. Before we get into how the networking itself works, let us look at the
logistics of how flannel is deployed. Normally, this is a task done on cluster
setup by Kubernetes administrators, but here we shall manually run the pieces
in our simulator. This process will make it much clearer on what is configured
when.
1. A “FlannelD” binary runs as a DaemonSet (meaning one on each worker node) to manage per
worker configuration. We will run the process directly on our Mininet host.
2. “FlannelD” daemons connect to the cluster “etcd” setup to share and synchronize configuration.
Etcd is a distributed key-value store used to manage a lot of the configuration across masters
and workers in a Kubernetes cluster. Since we don’t have the default out-of-the box etcd cluster,
we run our own two node cluster – running as two new hosts in the Mininet topology.
3. “Flannel” CNI plugin binary installed and available on each worker. Now, whenever containers
come up or go down, this plugin is called. We make this plugin available on all of our worker
hosts.
Now, that we have seen the pieces, what is the exact startup sequence in our
simulator?
5
https://www.cni.dev/docs/cnitool/
26
Container-container Networking
1. Cluster wide configuration is loaded into etcd first. This includes global configuration such as
IP ranges, what inter-worker connectivity option to use, etc.
2. “FlannelD” daemons are started on each worker. These daemons read global configuration from
etcd and generate per worker configuration.
3. Setup CNI configuration to be used for containers on all workers. This is the file to be used by
cnitool to know which plugins to run.
4. Containers can now be created and deleted.
{
"Network": "11.0.0.0/8",
"SubnetLen": 20,
"SubnetMin": "11.10.0.0",
"SubnetMax": "11.99.0.0",
"Backend": {
"Type": "vxlan",
"VNI": 100,
"Port": 8472
}
}
This configuration file specifies a few global options for this cluster:
1. We specify a container network of “11.0.0.0/8” for all containers managed by Flannel. This
means all containers will be assigned IPs of the form “11.xxx.xxx.xxx” in this cluster.
2. The next few options talk about subnets. Each worker is assigned a different subnet to use for
containers running on it. This allows a clean partition of IPs across workers, so there is no
assignment conflict and it becomes easy to route packets between workers.
3. The next few lines specify the backend to use. This is the technology used to connect containers
across workers. In this example, we use VXLAN tunnels between workers. The configuration
specifies the virtual network identifier (VNI) and port to use to setup the tunnels.
Notice that we use the same configuration in our simulator for all clusters.
Each cluster thus will have containers assigned to the same IP ranges. This is
generally acceptable, since containers form a virtual network and solutions like
Flannel are only concerned with connecting containers within a cluster.
Once this configuration is loaded into the cluster etcd and all FlannelDd
daemons are started up, each daemon generates a local configuration. This
can be found in the path "/tmp/knetsim/<worker>/flannel-subnet.env" in the
simulator environment where worker names are of the form “C0w1”, “C0w2”,
etc.
27
Container-container Networking
FLANNEL_NETWORK=11.0.0.0/8
FLANNEL_SUBNET=11.10.224.1/20
FLANNEL_MTU=1450
FLANNEL_IPMASQ=false
(the exact contents of this file will vary between workers and runs).
The second line indicates that this FlannelD has carved out the subnet
“11.10.224.1/20” for use for containers on this worker.
Exercise: Look at the running containers in the worker and confirm that
the IP falls within the subnet. Create a few new containers and see that this
constraint is always respected. Check the subnets of all workers and see that
they are non-overlapping.
Finally, let us look at the CNI configuration being used on one of the
workers:
{
"name": "C0w1",
"type": "flannel",
"subnetFile": "/tmp/knetsim/C0w1/flannel-subnet.env",
"dataDir": "/tmp/knetsim/C0w1/flannel",
"delegate": {"isDefaultGateway": true}
}
The “type” field is used to convey to the user (“cnitool” in this case) to
search for a binary named “Flannel” to use as the CNI plugin. The “subnetFile”
is the same file we saw above. The “dataDir” is used to store any other generated
files etc.
The “delegate” field is used to pass params to other CNI plugins being
called. This is important, since Flannel automatically creates configuration to
call the “bridge” and “host-local” plugins to manage single-host networking.
These are CNI plugins that come out of the box and operate much like their
Docker counterparts. Here, we set “isDefaultGateway” true to enable routing
of all traffic from the container through the Flannel provided bridge network.
This is needed since (as we saw earlier), our network namespaces are otherwise
empty of interfaces.
28
Container-container Networking
Now, that we have understood how Flannel is configured and how it is running,
let us examine the actual networking that it enables, as described in Figure 4.4.
With the running environment, open new shells to exec into containers, so
that we can ping them from each other.
Now, you can ping this container (running on worker 2) from worker 1:
29
Container-container Networking
You can follow the progress of the packets (while the ping above is running)
by examining the interfaces.
You should notice similar output on the other end, i.e. the “cni0” interface
of “C0w2”. This “cni0” interface is the default bridge created by Flannel.
Now, examine the interfaces “flannel.100” and “eth0” (note: the eth0 inter
face is usually named “<worker>-eth0” by Mininet) on both workers for icmp
packets. What do you observe?
We see that the ICMP packets are visible on the “flannel.100” interfaces
but nothing is seen on the “eth0” interfaces. But, the packets are going from
worker 1 to worker 2 (and back) and these workers are only connected via their
“eth0” interfaces. So, how is this happening?
30
Container-container Networking
You can see traffic relevant to the tunnel, using a command as follows:
Now, if you run the ping, you will see some traffic here. But, how do we
confirm that this is indeed the same ping packets being encapsulated. We can
do that with the “tshark” command6 .
In this, we:
1. “-V”: run it in verbose mode
2. “-d udp.port==8472,vxlan”: define that udp traffic on port 8462 be parsed as vxlan. (This is
needed since the default port for vxlan is something else).
3. “port 8472”: filter traffic only on this port. (We do this just to reduce the noise in the capture.)
Frame 1: 148 bytes on wire (1184 bits), 148 bytes captured (1184 bits)
on interface
C0w1-eth0, id 0 [447/834]
Interface id: 0 (C0w1-eth0)
Interface name: C0w1-eth0
Encapsulation type: Ethernet (1)
Arrival Time: Mar 9, 2023 09:27:07.240587208 UTC
[Time shift for this packet: 0.000000000 seconds]
Epoch Time: 1678354027.240587208 seconds
[Time delta from previous captured frame: 0.000000000 seconds]
[Time delta from previous displayed frame: 0.000000000 seconds]
6
https://tshark.dev/
31
Container-container Networking
32
Container-container Networking
33
Container-container Networking
0000 89 aa 03 00 00 00 00 00 10 11 12 13 14 15 16 17 ................
0010 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27 ........ !"#$%&’
0020 28 29 2a 2b 2c 2d 2e 2f 30 31 32 33 34 35 36 37 ()*+,-./01234567
Data: 89aa030000000000101112131415161718191a1b1c1d1e1f202
122232425262728292a2b?
[Length: 48]
This shows one frame, which is a UDP packet on port 8472. Below this is the
packet decoded, which is seen as a VXLAN packet. Note that the VNI is 100,
which is what we setup in the Flannel config.
Inside the VXLAN packet is the actual payload, which contains an Ethernet
frame with an IPv4 packet. This is how tunneling works – normal packets are
encapsulated as payload into other packets on the underlay with the VXLAN
header. Tshark has decoded this payload packet as well and shows that it is
actually an ICMP packet!
34
Container-container Networking
See the source and destination IPs of the inner packet – these are the source
and destination containers. Note the src and dest IPs of the outer packet – these
are the worker IPs!
Exercise: Run this capture yourself and confirm that this is the same ICMP
packet that you observe in the other higher level interfaces. Examine it at the
source and destination IPs in the inner packet in the capture.
Examine the Flannel logs in the "/tmp/knetsim" folder in the simulator envi
ronment.
4.10 Summary
In this chapter, we looked at the CNI, the reason why it is needed, what the spec
provides and how to use it. We then looked at one CNI plugin, Flannel, and saw
how to run it in our simulator to achieve container–container networking.
35
CHAPTER
Services
5.1 Introduction
In the last chapter we saw how individual containers can talk to each other. How
ever, while it is needed, this is not enough to implement the service abstraction
– to allow containers to talks to services.
1
https://kubernetes.io/docs/concepts/overview/components/#kube-proxy
37
Services
2
https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/
38
Services
There are other models, such as NodePort which may involved no Virtual IP for
services.)
For the purpose of the simulator we shall only focus on the “kube-proxy”
component and we shall not implement the “kube-dns” component. The chief
reason for this is that DNS is a well known approach and the application of it
in kubernetes is nothing out of the ordinary – unlike “kube-proxy” which is a
more custom component.
This interface does the following: declares containers c2 and c3 to form the
backing pods of a service to be reachable at the virtual IP 100.64.10.1.
How does the simulator implement this interface? While kube-proxy uses
IPTables, we internally use “nftables” to achieve the same. In the next dew
sections we shall go into details of nftables.
39
Services
Even though it has features that look similar to iptables in terms of rules,
there are benefits of nftables over iptables:
• Unification: With iptables you had different tables to add rules for ipv4 and ipv6. With nftables
you can have one solution managing ipv4, ipv6, arptables, etc.
• Flexibility: Similar to iptables, nftables supports chaining of rules. However, it does not start
with a base chain and hence provides flexibility.
• Performance: With iptables, even if the rules do not match a flow, it needs to be verified for each
rule which adds unnecessary overhead. Nftables uses maps and concatenation to structure the
ruleset.
Figure 5.2 shows a high level view of how nftables compares to IPTables
if you are familiar with the older tool. The kernel modules implementing the
3
https://www.nftables.org
40
Services
actual processing of packets are completely different, as is the API layer. How-
ever, to help users move their older scripts, a compatibility tool iptables-nft has
been provided which can run older IPTable format rules with the newer nftables
plumbing.
In this section, we shall play with “nftables” to demonstrate its use. As before,
start up a fresh instance of the simulator container:
The environment already comes with the “nftables” installed for use.
As expected, it will be empty. (Note that the simulator is not running in this
setup.)
For a full list of nft commands, refer to the official reference page4 .
Rules in “nft” are organized in chains which are placed inside tables. So, we
start with creating tables first.
This creates a new table “table1” meant to handle ip family type. The
allowed family types are: “arp”, “bridge”, “ip”, “ip6”, “inet”, “netdev”.
You can check the new (empty) table exists using the command as before:
4
https://wiki.nftables.org/wiki-nftables/index.php/Main_Page
41
Services
nft add chain ip table1 chain1 { type filter hook output priority
0 \; policy accept \; }
Now “list ruleset” as before and check that the newly created chain has been
loaded into the table.
Now, we have a table and a chain, but nothing useful can happen without a
rule. So, let us create a new rule:
You have seen the main components of a rule: a match condition and the
statement.
5
https://www.quad9.net/
42
Services
$ ping -c 1 9.9.9.9
To delete rules, we need to use handles. First, list rules with handles:
6
https://wiki.nftables.org/wiki-nftables/index.php/Quick_reference
nftables_in_10_minutes#Matches
43
Services
List the ruleset and confirm that this rule has been dropped.
Now that you know how to create and delete nft elements, play around with
the nft command.
As an exercise, add a rule to “drop” packets to this IP and then try pinging.
It should no longer work.
Nftables has a lot of other features that we do not cover here, features that
make it much more powerful than IPTables. These include:
• Intervals to support ranges of IPs like “192.168.0.1–192.168.0.250” or ports "1–1024".
• Concatenation syntax to work on pairs of structures like (IP . port).
• Math operations like hashing, number series.
• Sets and maps: data structures to help decouple rules from the data it operates on.
• Quotas: rules that match only until a number of bytes have passed.
• Flowtables: network stack bypass, to move packets to userspace programs for out-of-kernel
custom processing.
Earlier in this section, we talked about hook points. Now is a good time to
revist these. Figure 5.3 shows the Netfilter hook points in the Linux kernel with
a focus only on the IP layer. One way to understand hook points is that they are
places in the kernel’s networking stack where you can stop, inspect and act on
packets, such as modify packets or control their flow. We have seen use of nft in
this chapter to manipulate packets using these hook points, but there are other
mechanism which rely on these hook points, for e.g., eBPF programs can be run
at these hook points.
44
Services
Figure 5.3: Netfilter hook points in the kernel – focus only on the IP layer.
There are similar hook points in the ARP layer, not shown. As we mentioned
earlier, not all operation types make sense for all hook points. For a full version
of this diagram, refer to the official documentation on hook points7 .
On startup, we create a table and chain set to the hook point prerouting –
for packets originating from within pods. Similarly, we create a table and chain
7
https://wiki.nftables.org/wiki-nftables/index.php/Netfilter_hooks
45
Services
attached to the hook point output – for packets originating on the hosts. We also
add an IP route using the following command on all workers:
to route packets destined to our VIP range into the local flannel interface.
Then, for each new service added to our cluster with the command:
we create the following rule in the two chains that we created above:
So, all in all, we construct a new rule for every service that randomly chooses
one of the specified backends to send packets to.
You may notice that multiple packets of the same flow should be sent to the
same destination, otherwise the basic structure of the connection does not hold.
This is already happening – the nat hook point only applies to the first packet
of a new flow – subsequent packets automatically follow the processing of the
first one. So, we do not have to do anything further to ensure correct working.
46
Services
47
CHAPTER
Exposing Services
6.1 Introduction
In the last chapter, we got the service abstraction up and running for pods in the
cluster. Alhough a lot of inter-service communication happens, the micro-service
graph is first activiated, or the whole process really starts, from an external
request.
In this chapter, we will see how services can be exposed out of the cluster.
6.2 Ingress
1
https://kubernetes.io/docs/concepts/services-networking/ingress/
49
Exposing Services
This is where Ingress comes in. You may need to specify your Ingress
resource something like in Listing 7.1.
50
Exposing Services
port:
number: 8000
Listing 6.1: Example Ingress YAML
How do these Ingress controllers work? They typically wrap an existing third
party technology like HAProxy, Nginx, Traefik, etc and program the underlying
tool using the given configuration. Typically, they translate the provided Ingress
YAML into a format suitable for the underlying tool.
This is because these tools – many predating Kubernetes – are all well estab
lished projects that focus on the key use case of acting as a service frontend,
doing things like TLS termination, load balancing, ingress security controls,
authentication, etc. Thus, they are well suited to play the role of the Ingress
implementor.
As you may have guessed by now, we statically configure one of these tools to
implement Ingress like functionality in our simulator. The tool we have chosen
is nginx3 .
Nginx calls itself a HTTP and reverse proxy server with a number of features
needed for HTTP serving and tools to programmatically process requests. It is
a very well established tool and serves a large part of the internet.
2
https://kubernetes.io/docs/concepts/services-networking/ingress-controllers/
3
https://nginx.org/en/
51
Exposing Services
Figure 6.1: Exposing two services under a single endpoint using Ingress.
C0.get("w1").run_ingress("grp1", 8001,
[{"path": "/svc1", "endpoint": "100.64.11.1:8000"},
{"path": "/svc2", "endpoint": "100.64.11.2:8000"}])
where we:
• Run an ingress on worker w1 of C0 at port 8001.
• Exposes two services under the two paths.
• The underlying service is referred to using the service VIPs we used in the last chapter. Note
that if we had a DNS set up, we could use simple names and ports.
Figure 6.1 shows the Ingress setup we will achieve with the code above.
Assuming this works for the moment, you may have a question of how do we
test this? So far, our services have been simple empty containers and we only
ever pinged them. But, now we need to setup some sort of HTTP service to verify
that expose works.
To this end, we have included a simple HTTP server that serves a hardcoded
string in the simulator (in the path “utils/ss.py”. So, to enable the two service
above, we use code like the following:
52
Exposing Services
Internally, our simulator runs nginx with a generated config that looks like
the following:
events{}
http {
server {
listen 8001;
location /svc1 { proxy_pass http://100.64.11.1:8000; }
location /svc2 { proxy_pass http://100.64.11.2:8000; }
}
}
53
Exposing Services
If you run this a few times, you should see one of the two responses “S1
W2” and “S1-W3”. This is exactly as we saw in the last chapter – basic service
implementation.
Now, let us access the same service via the Ingress, from the same worker
node:
Run it a few times and verify that the behavior is ok. Now, also try reaching
the “/svc2” path and check if it works as expected.
You can try now reaching this IP from outside, say the e1 node (which is not
a Kubernetes worker at all):
This is the Ingress working correctly as expected. Just to summarize the flow,
now:
• Client on e1 connects to the nginx ingress exposed at 8001.
• Nginx on worker 1 rerouted the request to the Service VIP 100.64.11.1 port 8000.
• Our kube-proxy equivalent (based on nftables) translates the VIP to one of the real backing pod
IPs.
• Flannel routes the request to the correct location based on the pod ip.
You can see how the concepts we have introduced in each chapter have built
on top of the previous one.
54
CHAPTER
Mutli-cluster Networking
7.1 Introduction
In the previous section we saw how services can be exposed outside of a cluster.
While that is useful for interactions between end users and services running
on a cluster, it does not scale to the most generic use cases of multi-cluster
applications.
What are these multi-cluster or multi-cloud applications? Why are they the
future?
While a single cluster of machines scales and scales well – it has certain
limitations. All worker nodes are within a single zone in a datacenter.
Having multiple clusters is going to be the norm rather than the exception
in the coming future.
• An application will run across multiple clouds (from the same or different providers) for the
purpose of geo-distribution. Note that this is different from simple sharding, where each data
center caters to one section of users. This involves app micro services in multiple clouds talking
to each other to synchronize state etc.
• A lot of deployments are hybrid-cloud: some portion of the app runs on public cloud providers
and some on internal enterprise data centers. These portions run on the internal sites are typically
the crown jewels – due to security, proprietery hardware or other requirements. In these cases,
there would be a certain amount of communication between these two clouds in the normal
workflows.
55
Mutli-cluster Networking
Figure 7.1: Example of a multi-cloud scenario spanning multiple public clouds, private clouds
and edge clouds.
• Edge/Telco clouds are an emerging field – where Kubernetes clusters are run close to the end
user. Some amount of latency sensitive processing is best offloaded to such clouds; however
these clouds will have limited resources, such as computing and storage, and thus cannot run
entire cloud apps. This is yet another scenario where enabling cross-cloud networking is a must.
Clearly, all of these are problems faced within a single cluster too, but, as
we saw, there are several layers for in-cluster networking and several tools for
each layer that solve these problems.
In this chapter, we want to achieve the blue line shown in Figure 7.2. That is, we
have multiple clusters connected with some sort of underlay that connects them;
56
Mutli-cluster Networking
Figure 7.2: Extending our simulator to allow pods to talk between clusters. We want to achieve
the blue line in this chapter.
thus workers can talk between clusters if needed, although they will typically
be behind a firewall and not allowed to freely communicate. Each cluster is an
independent solution – none of the lower layers that we have discussed up until
now should be changed.
57
Mutli-cluster Networking
7.3 Skupper
1
https://docs.cilium.io/en/stable/network/clustermesh/
2
https://istio.io/latest/docs/ops/configuration/traffic
management/multicluster/
3
https://www.consul.io/docs/k8s/installation/multi-cluster
4
https://github.com/submariner-io/submariner
5
https://skupper.io/index.html
58
Mutli-cluster Networking
without any low-level changes to the networking setup. All inter cluster com
munication is secured using mutual TLS and Skupper does not need any admin
privileges on the clusters it works on.
"router", {
"id": "c0",
"mode": "interior",
"helloMaxAgeSeconds": "3",
"metadata": "{\"id\":\"c0\",\"version\":\"1.0.2\"}"
}
where we identify the name of the Skupper router. We have one router per
cluster (which we run on workern1). The mode is identified to be “interior”
or “edge” – the only difference being that edge routers terminate the routing
graph don’t connect outward to any other routers. In our usecase, we mark every
cluster as an “interior” router.
59
Mutli-cluster Networking
"listener",
{
"name": "interior-listener",
"role": "inter-router",
"port": 55671,
"maxFrameSize": 16384,
"maxSessionFrames": 640
}
"connector",
{
"name": "link1",
"role": "inter-router",
"host": "<ip of remote>",
"port": "55671",
"cost": 1,
"maxFrameSize": 16384,
"maxSessionFrames": 640
}
Note how the port set in the connector is the port used as the listener in the
remote. This section in the config tell the Skupper router to form a link with a
remote Skupper router with some config parameters. The “cost” parameter can
be used to setup weights for the links, for example.
Now that we have seen the config elements relevant to connecting clus
ters, how are services exposed and consumed? Much like the “listener” and
“connector” sections we saw just now, Skupper also has “tcpConnector” and
“tcpListener” which look like the following.
"tcpConnector",
{
"name": "backend",
60
Mutli-cluster Networking
"host": "localhost",
"port": "8090",
"address": "backend:8080",
"siteId": "c1"
}
Where we indicate to the Skupper (on site c1) that it should create a new
service with the address of “backend:8080” (a string identifier) and connect to
“host” and “port” for the actual service. The “name” is only used to differenti
ate between tcpConnectors on the router. Note how, we have set the address to
refer to a different port (purely for illustration purpose. Instead of this, we can
use normal strings as well).
"tcpListener",
{
"name": "backend:8080",
"port": "1028",
"address": "backend:8080",
"siteId": "c0"
}
which on the Skupper router for cluster c0 creates a new local port 1028 to reach
the service tagged with the address “backend:8080”. What this does, is setup
the underlying Skupper network to enable local services to reach the <local
skupper ip>:1028 and get routed to the remote service on cluster c1.
We set up the Skupper portions in our simulator using code that looks like
the following:
61
Mutli-cluster Networking
Understand the steps above and run your own wget checks to verify that the
connectivity is working. Now, there are a number of exercises you can do here:
• Look at the generated Skupper conf files in the “/tmp/knetsim/skupper” directory in the simulator
environment.
• We exposed a pod IP instead of a service VIP for simplicity, but, based on your understanding of
the previous chapters, it should be easy to see how multi-cluster connectivity would work with
either service VIPs or Ingress. Check your understanding by modifying the code above to work
in one of those two modes.
62
Mutli-cluster Networking
• Follow the flow of the packets, using tcpdump and tshark as shown above. Note: this can be
challenging with the way Skupper works in encapsulating packets. Look into the AMQP network
protocol for documentation on how the links between clusters are managed.
• Extend the setup to have more clusters. Try out different topologies, or update the code to change
link weights and follow the packet flow.
7.5 Summary
63
CHAPTER
Retrospective
Albert Einstein
We are at the end of the deep-dive. Hopefully you learnt a thing or two!
Hopefully, this hands-on approach has given you a peek under the hood of
the complex machine that is Kubernetes. Where to from here?
65
Retrospective
• Alhough we have nudged the reader to look into the full code of the simulator, this may be a
point to reinforce the suggestion. You understand most of it already.
• For simplicity, we omitted several variants of the technologies shown. For example, we did not
explore how a DNS is used in Kubernetes. Similarly, we only looked at the IPTables variant of
kube-proxy (which is the most common) and not at the others such as IPVS based services.
We only explored the VXLAN tunneling approach to cross worker connectivity, though a lot of
other options are available. We hope that this introduction will allow you to confidently approach
these variants.
• We didn’t go into the more advanced functionalities possible in each layer. Now may be a time
to explore features such as connection security (using encryption), policies, observability, etc.
• In this book, we didn’t explore the space of service meshes – mainly in the interests of simplicity.
We invite the reader to explore the space and understand how it fits with the layers shown. The
enterprising reader is invited to think about how we can extend the simulator we have built
to demonstrate service meshes in a simple manner, and contribute back to our open-source
simulator.
• Now that you have a solid grasp of the fundamentals that underpin them, take a second look at
the Kubernetes and CNCF ecosystem. We suggest you approach the projects layer by layer as
we have shown here: starting from container engines to CNI plugins to Ingress Controllers and
finally to multi-cluster solutions. There are many alternatives that work at each layer offering
various features, though the basics remain the same.
Happy journey!
66
Index
C M
cloud native networking, 6 multi cloud networking, ix
containers, 6, 7, 15, 21, 22, 24, 37, 52, 65 mininet 12, 13, 14, 54
K
kubernetes, 6, 9, 37, 49, 50
67