0% found this document useful (0 votes)
22 views53 pages

Troubleshooting and Workaround in Kubernetes

Uploaded by

scridb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views53 pages

Troubleshooting and Workaround in Kubernetes

Uploaded by

scridb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Troubleshooting

And
Workaround
In Kubernetes
瑞嘉軟體 Jack Kuo

1
Speaker

Jack Kuo

瑞嘉軟體 CTO Department SRE Team


○ Senior SRE
○ Speaker

目前專注領域
○ Agile
○ DevOps
○ Cloud Native
Agenda
● About Us
● Case 1: Kubernetes Pod Restart
○ CrashLoopBackOff
○ ThreadPool
○ Warmup Setting
● Case 2: Client Get 504 Timeout
○ Network Troubleshooting
○ Prometheus Metrics
● Case 3: P99 Latency Is High
○ Resource Issue
○ Inconsistent Performance
● Takeaways

4
About Us
SRE Team 防守範圍
Infra Team 防守範圍

SRE Team 防守範圍

5
Not Just Only Troubleshooting

6
But Also Workaround

7
8
https://static.learnk8s.io/cbe079ed7f6e7764445aa746b2c9f295.png 9
Case 1
Kubernetes Pod Restart

10
Kubernetes Pod Restart - CrashLoopBackOff

11
Kubernetes Pod Restart - CrashLoopBackOff

12
Kubernetes Pod Restart - CrashLoopBackOff
● Application Issue
○ Code Error
○ Environment Config Error
○ Dependent Applications Not Ready
● Resource Issue
○ CPU
○ Memory
● Health Check
○ Startup Probe
○ Liveness Probe
○ Readiness Probe

13
所以我說...那個...未知狀態呢?

14
Kubernetes Pod Restart - ThreadPool

15
Kubernetes Pod Restart - ThreadPool
.Net ThreadPool

● ThreadPool.SetMaxThreads
○ Bigger is not necessarily better
○ Context Switching is expensive
● ThreadPool.SetMinThreads
○ The number of Threads is not always above MinThreads
○ MinThreads means how many threads will be generated without delay

16
Kubernetes Pod Restart - Warmup Setting
Latency, High CPU Consumption And HPA Scaling 的惡性循環

17
Kubernetes Pod Restart - Warmup Setting
minReadySeconds ≠ min Ready Seconds

18
Please Tell Me When You Are Ready

19
Kubernetes Pod Restart - Warmup Setting

return non-200 to Kubernetes return 200 to kubernetes

20
Case 2
Client Get 504 Timeout

21
Client Get 504 Timeout
● Application Issue
○ Application is too busy to handle requests from Client
○ There is a problem with Dependent Applications
● Network Issue
○ CDN
○ Load Balancer
○ Kubernetes Ingress
○ Kubernetes Service
○ Kubernetes Application Pod

Bottom-Up
22
A Chain Is Only As Strong As Its Weakeast Link

23
Client Get 504 Timeout - Network Troubleshooting
Pod to Pod

Node 1 Node 2 Node 3

24
Client Get 504 Timeout - Network Troubleshooting
Service to Pod

Node 1 Node 2 Node 3

25
Client Get 504 Timeout - Network Troubleshooting
Ingress to Service
Node 1 Node 2 Node 3

curl -v -H “Host: <Host>” http://<Ingress-Nginx-Controller IP>:<Ingress-Nginx-Controller Port>/<Path>

26
Client Get 504 Timeout - Network Troubleshooting
HA Proxy to Ingress-Nginx-Controller
Node 1 Node 2 Node 3

curl -v -H “Host: <Host>” http://<HAProxy IP>:<HAProxy Port>/<Path>

27
好像有點冗長且麻煩

28
Client Get 504 Timeout - Prometheus Metrics
Ingress Request Volume By Status

29
Client Get 504 Timeout - Prometheus Metrics
Ingress Percentile Response Time

30
Client Get 504 Timeout - Prometheus Metrics
Endpoint RPS

31
Client Get 504 Timeout - Prometheus Metrics
Requests Currently In Progress By Endpint

32
Workaround
Separate Kubernetes Deployment By Ingress Host And Path

33
Workaround
Separate Kubernetes Deployment By Ingress Host And Path

34
Case 3
P99 Latency Is High

35
P99 Latency Is A Leading Indicator Of Problems

36
P99 Latency Is High
● Application Issue
● Resource Issue
○ Resource Competition
○ Memory Leak
● Performance Inconsistency
○ Sticky Session Setting
○ VM Host Issue

37
P99 Latency Is High - Resource Issue
Resource Competition worker node resource

guaranteed resource for container available resource for container

Requests Limits Resource Competition !!!

max resource container can use


guaranteed resource for container available resource for container

Limits Requests

max resource container can use


38
P99 Latency Is High - Resource Issue
Pod Anti-Affinity

39
P99 Latency Is High - Resource Issue
Memory Leak

40
Workaround
A Cronjob To Detect Memory Leak

Memory Usage < Memory Target

Calculate Memory Usage Prometheus API If

*/2 * * * *

Memory Usage > Memory Target

Notify To Slack Restart Deployment

41
Workaround
A Cronjob To Detect Memory Leak

42
P99 Latency Is High - Inconsistent Performance
Pod RPS

43
P99 Latency Is High - Inconsistent Performance
Sticky Session Setting

User-Agent X-Forwarded-For X-Actual-IP

nginx.ingress.kubernetes.io/upstream-hash-by:
"$http_x_actual_ip"

44
P99 Latency Is High - Inconsistent Performance
Inconsistent Pod Memory Resource Usage

45
P99 Latency Is High - Inconsistent Performance
.Net Runtime Bug In AMD Machine

46
P99 Latency Is High - Inconsistent Performance
Inconsistent Pod CPU Resource Usage

47
P99 Latency Is High - Inconsistent Performance
VM Host Issue

48
CPU Exceeds Tipping Point, Performance Reduction

49
Workaround
Dummy Pod

50
Workaround
Dummy Pod worker node resource

Dummy Pod

51
Workaround Doesn’t Mean That The Problem Is Solved

52
Takeaways
● Warmup Setting With Kubernetes
Health Check

● Prometheus Metrics Is Helpful


For Network Troubleshooting

● Same Pod But Inconsistent


Performance In Kubernetes

● Workaround Doesn’t Mean That


The Problem Is Solved

53

You might also like