Troubleshooting and Workaround in Kubernetes
Troubleshooting and Workaround in Kubernetes
And
Workaround
In Kubernetes
瑞嘉軟體 Jack Kuo
1
Speaker
Jack Kuo
目前專注領域
○ Agile
○ DevOps
○ Cloud Native
Agenda
● About Us
● Case 1: Kubernetes Pod Restart
○ CrashLoopBackOff
○ ThreadPool
○ Warmup Setting
● Case 2: Client Get 504 Timeout
○ Network Troubleshooting
○ Prometheus Metrics
● Case 3: P99 Latency Is High
○ Resource Issue
○ Inconsistent Performance
● Takeaways
4
About Us
SRE Team 防守範圍
Infra Team 防守範圍
5
Not Just Only Troubleshooting
6
But Also Workaround
7
8
https://static.learnk8s.io/cbe079ed7f6e7764445aa746b2c9f295.png 9
Case 1
Kubernetes Pod Restart
10
Kubernetes Pod Restart - CrashLoopBackOff
11
Kubernetes Pod Restart - CrashLoopBackOff
12
Kubernetes Pod Restart - CrashLoopBackOff
● Application Issue
○ Code Error
○ Environment Config Error
○ Dependent Applications Not Ready
● Resource Issue
○ CPU
○ Memory
● Health Check
○ Startup Probe
○ Liveness Probe
○ Readiness Probe
13
所以我說...那個...未知狀態呢?
14
Kubernetes Pod Restart - ThreadPool
15
Kubernetes Pod Restart - ThreadPool
.Net ThreadPool
● ThreadPool.SetMaxThreads
○ Bigger is not necessarily better
○ Context Switching is expensive
● ThreadPool.SetMinThreads
○ The number of Threads is not always above MinThreads
○ MinThreads means how many threads will be generated without delay
16
Kubernetes Pod Restart - Warmup Setting
Latency, High CPU Consumption And HPA Scaling 的惡性循環
17
Kubernetes Pod Restart - Warmup Setting
minReadySeconds ≠ min Ready Seconds
18
Please Tell Me When You Are Ready
19
Kubernetes Pod Restart - Warmup Setting
20
Case 2
Client Get 504 Timeout
21
Client Get 504 Timeout
● Application Issue
○ Application is too busy to handle requests from Client
○ There is a problem with Dependent Applications
● Network Issue
○ CDN
○ Load Balancer
○ Kubernetes Ingress
○ Kubernetes Service
○ Kubernetes Application Pod
Bottom-Up
22
A Chain Is Only As Strong As Its Weakeast Link
23
Client Get 504 Timeout - Network Troubleshooting
Pod to Pod
24
Client Get 504 Timeout - Network Troubleshooting
Service to Pod
25
Client Get 504 Timeout - Network Troubleshooting
Ingress to Service
Node 1 Node 2 Node 3
26
Client Get 504 Timeout - Network Troubleshooting
HA Proxy to Ingress-Nginx-Controller
Node 1 Node 2 Node 3
27
好像有點冗長且麻煩
28
Client Get 504 Timeout - Prometheus Metrics
Ingress Request Volume By Status
29
Client Get 504 Timeout - Prometheus Metrics
Ingress Percentile Response Time
30
Client Get 504 Timeout - Prometheus Metrics
Endpoint RPS
31
Client Get 504 Timeout - Prometheus Metrics
Requests Currently In Progress By Endpint
32
Workaround
Separate Kubernetes Deployment By Ingress Host And Path
33
Workaround
Separate Kubernetes Deployment By Ingress Host And Path
34
Case 3
P99 Latency Is High
35
P99 Latency Is A Leading Indicator Of Problems
36
P99 Latency Is High
● Application Issue
● Resource Issue
○ Resource Competition
○ Memory Leak
● Performance Inconsistency
○ Sticky Session Setting
○ VM Host Issue
37
P99 Latency Is High - Resource Issue
Resource Competition worker node resource
Limits Requests
39
P99 Latency Is High - Resource Issue
Memory Leak
40
Workaround
A Cronjob To Detect Memory Leak
*/2 * * * *
41
Workaround
A Cronjob To Detect Memory Leak
42
P99 Latency Is High - Inconsistent Performance
Pod RPS
43
P99 Latency Is High - Inconsistent Performance
Sticky Session Setting
nginx.ingress.kubernetes.io/upstream-hash-by:
"$http_x_actual_ip"
44
P99 Latency Is High - Inconsistent Performance
Inconsistent Pod Memory Resource Usage
45
P99 Latency Is High - Inconsistent Performance
.Net Runtime Bug In AMD Machine
46
P99 Latency Is High - Inconsistent Performance
Inconsistent Pod CPU Resource Usage
47
P99 Latency Is High - Inconsistent Performance
VM Host Issue
48
CPU Exceeds Tipping Point, Performance Reduction
49
Workaround
Dummy Pod
50
Workaround
Dummy Pod worker node resource
Dummy Pod
51
Workaround Doesn’t Mean That The Problem Is Solved
52
Takeaways
● Warmup Setting With Kubernetes
Health Check
53