0% found this document useful (0 votes)
123 views59 pages

Observing Enterprise Kubernetes Clusters at Scale

This document discusses Giant Swarm's journey in monitoring Kubernetes clusters at scale. It began by using a single Prometheus server on the control plane to monitor all tenant clusters, but this led to scalability issues as the number of clusters grew. The document outlines plans to improve scalability by using Prometheus horizontal scaling - running a separate Prometheus instance for each tenant cluster managed by a new operator. This will allow metrics to be sharded across Prometheus servers for improved reliability and scalability as the number of monitored clusters increases.

Uploaded by

Joe Salisbury
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views59 pages

Observing Enterprise Kubernetes Clusters at Scale

This document discusses Giant Swarm's journey in monitoring Kubernetes clusters at scale. It began by using a single Prometheus server on the control plane to monitor all tenant clusters, but this led to scalability issues as the number of clusters grew. The document outlines plans to improve scalability by using Prometheus horizontal scaling - running a separate Prometheus instance for each tenant cluster managed by a new operator. This will allow metrics to be sharded across Prometheus servers for improved reliability and scalability as the number of monitored clusters increases.

Uploaded by

Joe Salisbury
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Observing Enterprise

Kubernetes Clusters
At Scale
Joe Salisbury
@salisbury_joe
Product Owner - Internal
Platform Team

How do we empower Product teams?

2
Giant Swarm manages Kubernetes
clusters for enterprises

3
Control plane for managing
Kubernetes clusters

All Kubernetes clusters


completely managed

4
- ~35 people
- 100s of Clusters
Scale - 1000s of Nodes
- EU, USA, China

5
- AWS
Providers -
-
Azure
On-Prem

6
Giant Swarm takes care of your
infrastructure

You focus on your business value

7
Fully managed
==
Responsible for everything

8
- Managed Apps
What is Everything? - Kubernetes
- Actual Infrastructure

9
Responsible for everything
==
Monitoring for everything

10
Observing Kubernetes

11
Monitoring -
-
Metrics
Logging
Domains - Tracing

12
Logging
- EFK stack
- Mainly used for deep debugging after the fact
- Looking at Loki for the future
- Lighter, Prometheus / Grafana integration

13
Tracing
- Looking at Jaeger
- Helpful for our API services (request-response)
- Tip of the iceberg
- Most likely will kill these in future
- Still researching tracing for operators
- Async background processing
- Lots of small traces

14
Metrics -> Prometheus

15
Our Prometheus -
-
Present
Pains
Journey - Plans

16
Monitoring is an
Present evolutionary process

17
Control Plane Tenant Clusters

API API Server, Kubelets, etc.

Operators API Server, Kubelets, etc.

Monitoring API Server, Kubelets, etc.

18
‘We need to monitor clusters’
- We have a Prometheus server running on the
control plane - we can use it to monitor all the
tenant clusters!

- This was maybe a good idea at the time

19
- Dependencies

- Tenant clusters routable from the control plane


- Peering / IPAM

20
control plane: 10.0.0.0/16 (10.0.0.0 -> 10.0.255.255)

tenant clusters: 10.1.0.0/16 (10.1.0.0 -> 10.1.255.255)

/24 mask

Control Plane VPC

10.0.0.0/16

Tenant Cluster VPC Tenant Cluster VPC Tenant Cluster VPC

10.1.0.0/24 10.1.1.0/24 10.1.2.0/24

21
- Configuration
- Automatically adding tenant clusters to
Prometheus

22
prometheus-config-controller
- Sidecar for Prometheus

- Watches for Kubernetes Custom Resources


- Updates Prometheus ConfigMap
- Fetches certificates, shares via emptyDir
- Reloads Prometheus on changes

23
reloads

watches syncs reads


Chartconfig Prometheus
Clusters
Chartconfig prometheus-config-controller prometheus
ConfigMap
CR
CR

Chartconfig Certificate
Certificates
Chartconfig Volume
CR
CR reads

24
Control Plane Tenant Clusters

Prometheus API Server, Kubelets, etc.

API Server, Kubelets, etc.

API Server, Kubelets, etc.

25
also add node-exporter, ingress-controllers, coredns, custom exporters, all the control plane services, the kitchen sink...

26
- AlertManager & OpsGenie

- Heartbeats for each installation


- Always firing alert in Prometheus
- Special routing to OpsGenie in AlertManager
- Heartbeat support in OpsGenie (page if no
ping)

27
Installation 1

prometheus alertmanager

Installation 2

alertmanager
Installation 2 is
down, ding ding
Installation 3
ding
prometheus alertmanager

28
And it works!
- In production for most of 2018, and a fair chunk
of 2019 now
- Added more targets, some improvements, but no
major architectural changes

29
Pains Roll for Initiative

30
Prometheus Memory Usage
- Number of clusters correlates (ish) with number
of series
- Number of series correlates with memory usage

31
- Currently forced to scale vertically
- Fine for now, but not where we want to be in
the future
- We want to enable developers to add tons of
metrics
- Trend will only continue

32
Prometheus v2.9.1 (from v2.6.0)

- Go 1.12!
33
- Outgrown / outgrowing our initial assumption that
customers would run a handful of small tenant
clusters
- We can drop metrics we don’t need (e.g: cadvisor
for customer workloads) as needed

- But, not a long term solution

34
Reliability
- If the Prometheus server goes down, we lose
monitoring for all tenant clusters
- We can have a better failure mode
- e.g: lose monitoring for some percentage of
tenant clusters

35
Querying
- Having separate installations is great most of
the time
- Pain in the ass for querying
- Digging into a global view
- Have to look at multiple Grafanas
- Percentage of data we see will decrease over time
(human patience is a constant)

36
A collection of ideas for
Plans the future

37
Goal for 2019 is to improve the scalability
of our metrics infrastructure

38
Addressing Prometheus Scaling
- If we can’t scale vertically, let’s scale
horizontally!
- One Prometheus per tenant cluster (at least)

39
- prometheus-operator
- Use building blocks!
- Build a new operator that watches our Cluster
CRs, ensures CRs for prometheus-operator

40
watches ensures watches
Chartconfig
Cluster CR
Chartconfig prometheus-config-operator
Chartconfig
Prometheus CR
Chartconfig prometheus-operator
CR
CR CR
CR
ensures

Prometheus Prometheus Prometheus

41
Control Plane Tenant Clusters

Prometheus API Server, Kubelets, etc.

Prometheus Prometheus API Server, Kubelets, etc.

Prometheus API Server, Kubelets, etc.

42
Codify our Prometheus topology in one service

43
- Provide one feature with one service
- Provide / use building blocks / abstraction layers
- Codify business logic in one operator

44
- We may need to support multiple Prometheus
servers per Kubernetes cluster (for gargantuan
clusters)
- We can transition into it
- e.g: prometheus-config-operator can create
multiple Prometheus CRs for one tenant
cluster
- Benefit of having topology codified in one
operator

45
- Sharding Prometheus allows us to scale
horizontally
- Increases scalability and reliability
- Can scale control plane horizontally
- Failure modes are better

46
Global Observability
- Still early days
- Let’s try Cortex!
- All Prometheus servers use remote write to write
to a Cortex backend
- Use Cortex for global querying (one Grafana to
rule them all)

- Keep alerting at installation level

47
Empowerment

48
What does this help us do in
the future?

49
Giant Swarm builds and operates
one product

No custom infrastructure

50
Feedback loop

- Monitoring to detect
- Postmortems to fix
- Pipeline to deploy

Detect, Fix, Deploy

51
Learnings from one installation
rolled out to all customers

52
- Monitoring enables this feedback loop
- Improving monitoring improves this feedback loop

- Kind of the point of an internal platform team :D

53
Good observability is not just
reactive

Aim to work proactively

54
Bam! What questions do you have?

Tobias is doing a workshop


tomorrow!

55
Thank you!

Joe Salisbury
@salisbury_joe
- e.g: Adidas reports issue with 95th percentile
DNS latency
- Add alerting for high 95th percentile DNS
latency
- Improve DNS dashboard to better show
distribution
- Update default CoreDNS configuration for
mitigate (autopath)
- Fix lib-musl issue (don’t use the library)

57
58
59

You might also like