Observing Enterprise Kubernetes Clusters at Scale
Observing Enterprise Kubernetes Clusters at Scale
Kubernetes Clusters
At Scale
Joe Salisbury
@salisbury_joe
Product Owner - Internal
Platform Team
2
Giant Swarm manages Kubernetes
clusters for enterprises
3
Control plane for managing
Kubernetes clusters
4
- ~35 people
- 100s of Clusters
Scale - 1000s of Nodes
- EU, USA, China
5
- AWS
Providers -
-
Azure
On-Prem
6
Giant Swarm takes care of your
infrastructure
7
Fully managed
==
Responsible for everything
8
- Managed Apps
What is Everything? - Kubernetes
- Actual Infrastructure
9
Responsible for everything
==
Monitoring for everything
10
Observing Kubernetes
11
Monitoring -
-
Metrics
Logging
Domains - Tracing
12
Logging
- EFK stack
- Mainly used for deep debugging after the fact
- Looking at Loki for the future
- Lighter, Prometheus / Grafana integration
13
Tracing
- Looking at Jaeger
- Helpful for our API services (request-response)
- Tip of the iceberg
- Most likely will kill these in future
- Still researching tracing for operators
- Async background processing
- Lots of small traces
14
Metrics -> Prometheus
15
Our Prometheus -
-
Present
Pains
Journey - Plans
16
Monitoring is an
Present evolutionary process
17
Control Plane Tenant Clusters
18
‘We need to monitor clusters’
- We have a Prometheus server running on the
control plane - we can use it to monitor all the
tenant clusters!
19
- Dependencies
20
control plane: 10.0.0.0/16 (10.0.0.0 -> 10.0.255.255)
/24 mask
10.0.0.0/16
21
- Configuration
- Automatically adding tenant clusters to
Prometheus
22
prometheus-config-controller
- Sidecar for Prometheus
23
reloads
Chartconfig Certificate
Certificates
Chartconfig Volume
CR
CR reads
24
Control Plane Tenant Clusters
25
also add node-exporter, ingress-controllers, coredns, custom exporters, all the control plane services, the kitchen sink...
26
- AlertManager & OpsGenie
27
Installation 1
prometheus alertmanager
Installation 2
alertmanager
Installation 2 is
down, ding ding
Installation 3
ding
prometheus alertmanager
28
And it works!
- In production for most of 2018, and a fair chunk
of 2019 now
- Added more targets, some improvements, but no
major architectural changes
29
Pains Roll for Initiative
30
Prometheus Memory Usage
- Number of clusters correlates (ish) with number
of series
- Number of series correlates with memory usage
31
- Currently forced to scale vertically
- Fine for now, but not where we want to be in
the future
- We want to enable developers to add tons of
metrics
- Trend will only continue
32
Prometheus v2.9.1 (from v2.6.0)
- Go 1.12!
33
- Outgrown / outgrowing our initial assumption that
customers would run a handful of small tenant
clusters
- We can drop metrics we don’t need (e.g: cadvisor
for customer workloads) as needed
34
Reliability
- If the Prometheus server goes down, we lose
monitoring for all tenant clusters
- We can have a better failure mode
- e.g: lose monitoring for some percentage of
tenant clusters
35
Querying
- Having separate installations is great most of
the time
- Pain in the ass for querying
- Digging into a global view
- Have to look at multiple Grafanas
- Percentage of data we see will decrease over time
(human patience is a constant)
36
A collection of ideas for
Plans the future
37
Goal for 2019 is to improve the scalability
of our metrics infrastructure
38
Addressing Prometheus Scaling
- If we can’t scale vertically, let’s scale
horizontally!
- One Prometheus per tenant cluster (at least)
39
- prometheus-operator
- Use building blocks!
- Build a new operator that watches our Cluster
CRs, ensures CRs for prometheus-operator
40
watches ensures watches
Chartconfig
Cluster CR
Chartconfig prometheus-config-operator
Chartconfig
Prometheus CR
Chartconfig prometheus-operator
CR
CR CR
CR
ensures
41
Control Plane Tenant Clusters
42
Codify our Prometheus topology in one service
43
- Provide one feature with one service
- Provide / use building blocks / abstraction layers
- Codify business logic in one operator
44
- We may need to support multiple Prometheus
servers per Kubernetes cluster (for gargantuan
clusters)
- We can transition into it
- e.g: prometheus-config-operator can create
multiple Prometheus CRs for one tenant
cluster
- Benefit of having topology codified in one
operator
45
- Sharding Prometheus allows us to scale
horizontally
- Increases scalability and reliability
- Can scale control plane horizontally
- Failure modes are better
46
Global Observability
- Still early days
- Let’s try Cortex!
- All Prometheus servers use remote write to write
to a Cortex backend
- Use Cortex for global querying (one Grafana to
rule them all)
47
Empowerment
48
What does this help us do in
the future?
49
Giant Swarm builds and operates
one product
No custom infrastructure
50
Feedback loop
- Monitoring to detect
- Postmortems to fix
- Pipeline to deploy
51
Learnings from one installation
rolled out to all customers
52
- Monitoring enables this feedback loop
- Improving monitoring improves this feedback loop
53
Good observability is not just
reactive
54
Bam! What questions do you have?
55
Thank you!
Joe Salisbury
@salisbury_joe
- e.g: Adidas reports issue with 95th percentile
DNS latency
- Add alerting for high 95th percentile DNS
latency
- Improve DNS dashboard to better show
distribution
- Update default CoreDNS configuration for
mitigate (autopath)
- Fix lib-musl issue (don’t use the library)
57
58
59