0% found this document useful (0 votes)

123 views59 pages

Observing Enterprise Kubernetes Clusters at Scale

This document discusses Giant Swarm's journey in monitoring Kubernetes clusters at scale. It began by using a single Prometheus server on the control plane to monitor all tenant clusters, but this led to scalability issues as the number of clusters grew. The document outlines plans to improve scalability by using Prometheus horizontal scaling - running a separate Prometheus instance for each tenant cluster managed by a new operator. This will allow metrics to be sharded across Prometheus servers for improved reliability and scalability as the number of monitored clusters increases.

Uploaded by

Joe Salisbury

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

123 views59 pages

Observing Enterprise Kubernetes Clusters at Scale

Uploaded by

Joe Salisbury

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

Observing Enterprise

Kubernetes Clusters
At Scale
Joe Salisbury
@salisbury_joe
Product Owner - Internal
Platform Team

How do we empower Product teams?

2
Giant Swarm manages Kubernetes
clusters for enterprises

3
Control plane for managing
Kubernetes clusters

All Kubernetes clusters

completely managed

4
- ~35 people
- 100s of Clusters
Scale - 1000s of Nodes
- EU, USA, China

5
- AWS
Providers -
-
Azure
On-Prem

6
Giant Swarm takes care of your
infrastructure

You focus on your business value

7
Fully managed
==
Responsible for everything

8
- Managed Apps
What is Everything? - Kubernetes
- Actual Infrastructure

9
Responsible for everything
==
Monitoring for everything

10
Observing Kubernetes

11
Monitoring -
-
Metrics
Logging
Domains - Tracing

12
Logging
- EFK stack
- Mainly used for deep debugging after the fact
- Looking at Loki for the future
- Lighter, Prometheus / Grafana integration

13
Tracing
- Looking at Jaeger
- Helpful for our API services (request-response)
- Tip of the iceberg
- Most likely will kill these in future
- Still researching tracing for operators
- Async background processing
- Lots of small traces

14
Metrics -> Prometheus

15
Our Prometheus -
-
Present
Pains
Journey - Plans

16
Monitoring is an
Present evolutionary process

17
Control Plane Tenant Clusters

API API Server, Kubelets, etc.

Operators API Server, Kubelets, etc.

Monitoring API Server, Kubelets, etc.

18
‘We need to monitor clusters’
- We have a Prometheus server running on the
control plane - we can use it to monitor all the
tenant clusters!

- This was maybe a good idea at the time

19
- Dependencies

- Tenant clusters routable from the control plane

- Peering / IPAM

20
control plane: 10.0.0.0/16 (10.0.0.0 -> 10.0.255.255)

tenant clusters: 10.1.0.0/16 (10.1.0.0 -> 10.1.255.255)

/24 mask

Control Plane VPC

10.0.0.0/16

Tenant Cluster VPC Tenant Cluster VPC Tenant Cluster VPC

10.1.0.0/24 10.1.1.0/24 10.1.2.0/24

21
- Configuration
- Automatically adding tenant clusters to
Prometheus

22
prometheus-config-controller
- Sidecar for Prometheus

- Watches for Kubernetes Custom Resources

- Updates Prometheus ConfigMap
- Fetches certificates, shares via emptyDir
- Reloads Prometheus on changes

23
reloads

watches syncs reads

Chartconfig Prometheus
Clusters
Chartconfig prometheus-config-controller prometheus
ConfigMap
CR
CR

Chartconfig Certificate
Certificates
Chartconfig Volume
CR
CR reads

24
Control Plane Tenant Clusters

Prometheus API Server, Kubelets, etc.

API Server, Kubelets, etc.

25
also add node-exporter, ingress-controllers, coredns, custom exporters, all the control plane services, the kitchen sink...

26
- AlertManager & OpsGenie

- Heartbeats for each installation

- Always firing alert in Prometheus
- Special routing to OpsGenie in AlertManager
- Heartbeat support in OpsGenie (page if no
ping)

27
Installation 1

prometheus alertmanager

Installation 2

alertmanager
Installation 2 is
down, ding ding
Installation 3
ding
prometheus alertmanager

28
And it works!
- In production for most of 2018, and a fair chunk
of 2019 now
- Added more targets, some improvements, but no
major architectural changes

29
Pains Roll for Initiative

30
Prometheus Memory Usage
- Number of clusters correlates (ish) with number
of series
- Number of series correlates with memory usage

31
- Currently forced to scale vertically
- Fine for now, but not where we want to be in
the future
- We want to enable developers to add tons of
metrics
- Trend will only continue

32
Prometheus v2.9.1 (from v2.6.0)

- Go 1.12!
33
- Outgrown / outgrowing our initial assumption that
customers would run a handful of small tenant
clusters
- We can drop metrics we don’t need (e.g: cadvisor
for customer workloads) as needed

- But, not a long term solution

34
Reliability
- If the Prometheus server goes down, we lose
monitoring for all tenant clusters
- We can have a better failure mode
- e.g: lose monitoring for some percentage of
tenant clusters

35
Querying
- Having separate installations is great most of
the time
- Pain in the ass for querying
- Digging into a global view
- Have to look at multiple Grafanas
- Percentage of data we see will decrease over time
(human patience is a constant)

36
A collection of ideas for
Plans the future

37
Goal for 2019 is to improve the scalability
of our metrics infrastructure

38
Addressing Prometheus Scaling
- If we can’t scale vertically, let’s scale
horizontally!
- One Prometheus per tenant cluster (at least)

39
- prometheus-operator
- Use building blocks!
- Build a new operator that watches our Cluster
CRs, ensures CRs for prometheus-operator

40
watches ensures watches
Chartconfig
Cluster CR
Chartconfig prometheus-config-operator
Chartconfig
Prometheus CR
Chartconfig prometheus-operator
CR
CR CR
CR
ensures

Prometheus Prometheus Prometheus

41
Control Plane Tenant Clusters

Prometheus API Server, Kubelets, etc.

Prometheus Prometheus API Server, Kubelets, etc.

Prometheus API Server, Kubelets, etc.

42
Codify our Prometheus topology in one service

43
- Provide one feature with one service
- Provide / use building blocks / abstraction layers
- Codify business logic in one operator

44
- We may need to support multiple Prometheus
servers per Kubernetes cluster (for gargantuan
clusters)
- We can transition into it
- e.g: prometheus-config-operator can create
multiple Prometheus CRs for one tenant
cluster
- Benefit of having topology codified in one
operator

45
- Sharding Prometheus allows us to scale
horizontally
- Increases scalability and reliability
- Can scale control plane horizontally
- Failure modes are better

46
Global Observability
- Still early days
- Let’s try Cortex!
- All Prometheus servers use remote write to write
to a Cortex backend
- Use Cortex for global querying (one Grafana to
rule them all)

- Keep alerting at installation level

47
Empowerment

48
What does this help us do in
the future?

49
Giant Swarm builds and operates
one product

No custom infrastructure

50
Feedback loop

- Monitoring to detect
- Postmortems to fix
- Pipeline to deploy

Detect, Fix, Deploy

51
Learnings from one installation
rolled out to all customers

52
- Monitoring enables this feedback loop
- Improving monitoring improves this feedback loop

- Kind of the point of an internal platform team :D

53
Good observability is not just
reactive

Aim to work proactively

54
Bam! What questions do you have?

Tobias is doing a workshop

tomorrow!

55
Thank you!

Joe Salisbury
@salisbury_joe
- e.g: Adidas reports issue with 95th percentile
DNS latency
- Add alerting for high 95th percentile DNS
latency
- Improve DNS dashboard to better show
distribution
- Update default CoreDNS configuration for
mitigate (autopath)
- Fix lib-musl issue (don’t use the library)

57
58
59

OpenShift Container Platform 4.18 Logging en US
No ratings yet
OpenShift Container Platform 4.18 Logging en US
163 pages
Kubernetes Vs Istio Traffic Routing
No ratings yet
Kubernetes Vs Istio Traffic Routing
8 pages
Openshift Container Platform-3.11-Day Two Operations Guide
No ratings yet
Openshift Container Platform-3.11-Day Two Operations Guide
110 pages
Rally
100% (1)
Rally
305 pages
Kubernetes Networking Intro & Deep Dive
No ratings yet
Kubernetes Networking Intro & Deep Dive
44 pages
Lab - 2 Deploying The Ceph Cluster Using Cephadm
100% (1)
Lab - 2 Deploying The Ceph Cluster Using Cephadm
9 pages
Red Hat Enterprise Linux-9-Building Running and Managing Containers-En-Us
100% (1)
Red Hat Enterprise Linux-9-Building Running and Managing Containers-En-Us
163 pages
Linux Training: Master RHCSA On RHEL-9 With Comprehensive Certification
No ratings yet
Linux Training: Master RHCSA On RHEL-9 With Comprehensive Certification
11 pages
SRECon EMEA 2017 - Monitoring Cloudflare's Planet-Scale Edge Network With Prometheus
No ratings yet
SRECon EMEA 2017 - Monitoring Cloudflare's Planet-Scale Edge Network With Prometheus
76 pages
Getting Started With: Vikram
100% (2)
Getting Started With: Vikram
44 pages
DevOps Shack - Kubernetes Projects With Implementation
No ratings yet
DevOps Shack - Kubernetes Projects With Implementation
40 pages
OpenStack Administration With Ansible - Sample Chapter
100% (1)
OpenStack Administration With Ansible - Sample Chapter
21 pages
Doctor-Patient-Webapp - Devops Professional-Capstone Proposal Document
100% (1)
Doctor-Patient-Webapp - Devops Professional-Capstone Proposal Document
9 pages
LVM2 - Data Recovery: Milan Brož
100% (1)
LVM2 - Data Recovery: Milan Brož
25 pages
Docker PPT 2
100% (1)
Docker PPT 2
48 pages
RHEL 8.3 - Deploying Red Hat Enterprise Linux 8 On Public Cloud Platforms
No ratings yet
RHEL 8.3 - Deploying Red Hat Enterprise Linux 8 On Public Cloud Platforms
102 pages
The Docker Handbook: by Anand Nevase
No ratings yet
The Docker Handbook: by Anand Nevase
57 pages
Docker and Kubernetes Offline Installation in RHEL7
No ratings yet
Docker and Kubernetes Offline Installation in RHEL7
8 pages
CKAD Cheat Sheet
No ratings yet
CKAD Cheat Sheet
1 page
Lab 3.4 PDF
No ratings yet
Lab 3.4 PDF
5 pages
Valaxy - DevOps Practitioner Training
100% (1)
Valaxy - DevOps Practitioner Training
7 pages
Docker Fundamentals Jumpstart
No ratings yet
Docker Fundamentals Jumpstart
34 pages
Prisma Cloud Complete Guide Kubernetes
No ratings yet
Prisma Cloud Complete Guide Kubernetes
14 pages
WSUS Reading Material With All Steps
No ratings yet
WSUS Reading Material With All Steps
1,347 pages
16 - Prometheus Checklist
No ratings yet
16 - Prometheus Checklist
9 pages
Big Mumbai
No ratings yet
Big Mumbai
10 pages
Sukam Online Ups
100% (1)
Sukam Online Ups
12 pages
Lab27 - Kubernetes Pod Security Context
No ratings yet
Lab27 - Kubernetes Pod Security Context
10 pages
Lab 9
No ratings yet
Lab 9
2 pages
Intermediate Docker and Kubernetes
No ratings yet
Intermediate Docker and Kubernetes
6 pages
Exercise 3.2: Configure A Local Docker Repo
No ratings yet
Exercise 3.2: Configure A Local Docker Repo
9 pages
Kubernetes Autoscaling Guide
No ratings yet
Kubernetes Autoscaling Guide
7 pages
Extending SaltStack - Sample Chapter
No ratings yet
Extending SaltStack - Sample Chapter
15 pages
06 - Monitoring-Logging
No ratings yet
06 - Monitoring-Logging
29 pages
KPLABS Course - CKA D1 Core Concepts
No ratings yet
KPLABS Course - CKA D1 Core Concepts
22 pages
Lab: Kubernetes Metrics Server
No ratings yet
Lab: Kubernetes Metrics Server
6 pages
Deploying and Scaling Kubernetes With Rancher - 2nd Ed PDF
0% (1)
Deploying and Scaling Kubernetes With Rancher - 2nd Ed PDF
66 pages
Sanghamitra User ID Password (Class 5)
No ratings yet
Sanghamitra User ID Password (Class 5)
9 pages
Docker Study Guide
No ratings yet
Docker Study Guide
12 pages
Introductiontoshellscripting 140114112036 Phpapp02
No ratings yet
Introductiontoshellscripting 140114112036 Phpapp02
61 pages
Certified Kubernetes Administrator (CKA) Exam Curriculum: A Cloud Native Computing Foundation (CNCF) Publication CNCF - Io
No ratings yet
Certified Kubernetes Administrator (CKA) Exam Curriculum: A Cloud Native Computing Foundation (CNCF) Publication CNCF - Io
4 pages
Ingress Nginx k8s
No ratings yet
Ingress Nginx k8s
17 pages
Jenkins: Continuous Integration - Written in Java
No ratings yet
Jenkins: Continuous Integration - Written in Java
54 pages
Drupal and Container Orchestration - Using Kubernetes To Manage All The Things
No ratings yet
Drupal and Container Orchestration - Using Kubernetes To Manage All The Things
21 pages
Deploying Openstack Lab On GCP-v3
No ratings yet
Deploying Openstack Lab On GCP-v3
10 pages
Podman Part2
No ratings yet
Podman Part2
5 pages
Devops&Cloud CV
No ratings yet
Devops&Cloud CV
7 pages
Application Deployment in Nud
No ratings yet
Application Deployment in Nud
31 pages
Troubleshooting in DevOps
No ratings yet
Troubleshooting in DevOps
5 pages
How To Build A Blog With The Ghost API and Next - Js
No ratings yet
How To Build A Blog With The Ghost API and Next - Js
68 pages
Linux - Foundation.actualtests - Cks.vce.2023 Jan 06.by - Lance.24q.vce
No ratings yet
Linux - Foundation.actualtests - Cks.vce.2023 Jan 06.by - Lance.24q.vce
4 pages
Kubectl Commands Cheat Sheet
No ratings yet
Kubectl Commands Cheat Sheet
1 page
Kubernetes Fundamental: Phuletv - Devops Lead
No ratings yet
Kubernetes Fundamental: Phuletv - Devops Lead
23 pages
Ce6306 Strength of Materials Ii/Iii Mechanical Engineering
No ratings yet
Ce6306 Strength of Materials Ii/Iii Mechanical Engineering
29 pages
Freeipa30 SSSD SUDO Integration
No ratings yet
Freeipa30 SSSD SUDO Integration
19 pages
2019 Aug TechTIPS-JUSTIFIED
No ratings yet
2019 Aug TechTIPS-JUSTIFIED
9 pages
How To Install and Use Docker-Compose To Deploy Containers in CentOS 7 PDF
No ratings yet
How To Install and Use Docker-Compose To Deploy Containers in CentOS 7 PDF
7 pages
C Programming Sollution
100% (1)
C Programming Sollution
43 pages
Docker Swarm
No ratings yet
Docker Swarm
5 pages
EKS Overview
No ratings yet
EKS Overview
14 pages
Prometheus Up and Running Infrastructure
No ratings yet
Prometheus Up and Running Infrastructure
6 pages
Powershell Commands PDF
No ratings yet
Powershell Commands PDF
3 pages
Computer SSC CGL 2022 Tier II Paper I - RBE - Compressed
No ratings yet
Computer SSC CGL 2022 Tier II Paper I - RBE - Compressed
17 pages
Deep Learning For Middle School Students
No ratings yet
Deep Learning For Middle School Students
34 pages
Centralizing Kubernetes and Container Operations
No ratings yet
Centralizing Kubernetes and Container Operations
24 pages
Nginx Secure
No ratings yet
Nginx Secure
13 pages
Alarm IPCAM IOS EYE4 APP User Manual
No ratings yet
Alarm IPCAM IOS EYE4 APP User Manual
11 pages
Longest Palindromic Subsequence
No ratings yet
Longest Palindromic Subsequence
26 pages
Co PDF
No ratings yet
Co PDF
123 pages
Docker Cheatsheet Sematext PDF
No ratings yet
Docker Cheatsheet Sematext PDF
1 page
Kubectl Kubectx Kubetail Helm: Web Web Web Web Web Web 8080 8080 Web Web Web Web Dev Dev
No ratings yet
Kubectl Kubectx Kubetail Helm: Web Web Web Web Web Web 8080 8080 Web Web Web Web Dev Dev
1 page
Bai Giang - Le Thi Thuy
No ratings yet
Bai Giang - Le Thi Thuy
56 pages
AccurioPress C2070 C2070P C2060 Catalog en PDF
No ratings yet
AccurioPress C2070 C2070P C2060 Catalog en PDF
16 pages
Prometheus Concepts
No ratings yet
Prometheus Concepts
4 pages
Deep Alignment Network: A Convolutional Neural Network For Robust Face Alignment
No ratings yet
Deep Alignment Network: A Convolutional Neural Network For Robust Face Alignment
10 pages
AWS EKS CI - CD With AWS CodeCommit + AWS CodeBuild + AWS CodePipeline - Final
No ratings yet
AWS EKS CI - CD With AWS CodeCommit + AWS CodeBuild + AWS CodePipeline - Final
21 pages
2 - Sheet Metal Tray
No ratings yet
2 - Sheet Metal Tray
18 pages
GC 2024 06 30
No ratings yet
GC 2024 06 30
8 pages
Catalogo Mosfets Rohm
No ratings yet
Catalogo Mosfets Rohm
20 pages
Module 4 Learning Plan 1
No ratings yet
Module 4 Learning Plan 1
11 pages
Unit 5
No ratings yet
Unit 5
7 pages
TextAds LandingPage L1 v5 07292022-1
No ratings yet
TextAds LandingPage L1 v5 07292022-1
4 pages
3 - Identifying Information Sources
No ratings yet
3 - Identifying Information Sources
7 pages
Draft National Telecom Policy 2011
No ratings yet
Draft National Telecom Policy 2011
27 pages
Xiq Whitepaper vr2
No ratings yet
Xiq Whitepaper vr2
9 pages
Literature Review
0% (1)
Literature Review
4 pages
Linux - Part Ii: Asad 1/1/2012
No ratings yet
Linux - Part Ii: Asad 1/1/2012
9 pages
Census and Statistics Department Hong Kong Special Administrative Region
No ratings yet
Census and Statistics Department Hong Kong Special Administrative Region
1 page
Modelling Tomography Using Reflexw
No ratings yet
Modelling Tomography Using Reflexw
16 pages
August, 2009 Issue
No ratings yet
August, 2009 Issue
8 pages

Observing Enterprise Kubernetes Clusters at Scale

Uploaded by

Observing Enterprise Kubernetes Clusters at Scale

Uploaded by

Observing Enterprise

How do we empower Product teams?

All Kubernetes clusters

You focus on your business value

API API Server, Kubelets, etc.

Operators API Server, Kubelets, etc.

Monitoring API Server, Kubelets, etc.

- This was maybe a good idea at the time

- Tenant clusters routable from the control plane

tenant clusters: 10.1.0.0/16 (10.1.0.0 -> 10.1.255.255)

Control Plane VPC

Tenant Cluster VPC Tenant Cluster VPC Tenant Cluster VPC

10.1.0.0/24 10.1.1.0/24 10.1.2.0/24

- Watches for Kubernetes Custom Resources

watches syncs reads

Prometheus API Server, Kubelets, etc.

API Server, Kubelets, etc.

API Server, Kubelets, etc.

- Heartbeats for each installation

- But, not a long term solution

Prometheus Prometheus Prometheus

Prometheus API Server, Kubelets, etc.

Prometheus Prometheus API Server, Kubelets, etc.

Prometheus API Server, Kubelets, etc.

- Keep alerting at installation level

Detect, Fix, Deploy

- Kind of the point of an internal platform team :D

Aim to work proactively

Tobias is doing a workshop

You might also like