DevOps Shack - 100 Common Kubernetes Errors and Solutions
DevOps Shack - 100 Common Kubernetes Errors and Solutions
2
19. Pod Status Unknown
20. DaemonSet Not Deploying Pods on All Nodes
21-30
21. The Connection to the Server Was Refused
22. Pods Stuck in ContainerCreating State
23. RBAC Permission Denied Error
24. Pod IP Not Reachable
25. Kubelet Service Not Running
26. Deployment Not Updating
27. FailedAttachVolume
28. Namespace Deletion Stuck
29. Pods Stuck in Init State
30. Ingress Shows 404
31-40
31. Invalid Memory/CPU Requests
32. Service NodePort Not Accessible
33. Pods Not Being Scheduled
34. Pod Cannot Connect to External Services
35. Deployment Not Scaling Properly
36. Cluster Autoscaler Not Adding Nodes
37. API Server Timeout
38. Pod Cannot Access ConfigMap
39. DaemonSet Pods Not Deploying
40. Pods Cannot Pull Secrets
41-50
41. Service External-IP Pending
3
42. Pod Terminated With Exit Code 137
43. Failed to Start Pod Sandbox
44. Helm Release Stuck in PENDING_INSTALL
45. Ingress Returning 502 Bad Gateway
46. Error: context deadline exceeded
47. Volume Mount Permissions Denied
48. Pods Exceeding Resource Quota
49. Cluster Certificate Expired
50. HPA Not Responding to Metrics
51-60
51. Pod IP Conflict
52. Service Account Not Found
53. Pod Fails Liveness Probe
54. Cannot Delete Namespace
55. Cannot Access API from External Client
56. Ingress Controller Pods CrashLooping
57. Kubelet Certificate Rotation Failing
58. Metrics Server Shows No Data
59. DNS Resolution Intermittent
60. Pod Cannot Access PersistentVolume
61-70
61. Error: Forbidden when Running kubectl Commands
62. Node Cannot Join Cluster
63. API Server High Latency
64. DaemonSet Pods Not Running on Specific Node
65. Service Not Forwarding Traffic
4
66. Error: Unauthorized when Accessing API Server
67. Pod Logs Not Available
68. Cannot Delete Stuck Pod
69. Pods Restarting Frequently
70. Cluster Nodes Unreachable
71-80
71. Pods Stuck in ImagePullBackOff
72. Node Disk Pressure Causes Pod Evictions
73. Pod Cannot Access Secret
74. PVC Pending Due to StorageClass Issues
75. Cannot Scale Deployment Beyond Node Capacity
76. CoreDNS Pods CrashLooping
77. HPA Not Responding to Custom Metrics
78. Pod Scheduling Ignored Node Selector
79. Ingress SSL/TLS Configuration Fails
80. Kube-proxy Failing
81-90
81. Cluster Autoscaler Scaling Too Slowly
82. Pod Logs Truncated
83. Pods Stuck in Evicted State
84. Pod Cannot Access Cluster Internal DNS
85. PersistentVolume Stuck in Released State
86. Job Failing to Complete
87. ConfigMap Too Large
88. Services Intermittently Unreachable
89. Error: Connection Refused When Accessing Service
5
90. Cluster High CPU Usage
91-100
91. Pod Stuck in Pending Due to Node Affinity
92. Control Plane Components Not Starting
93. Pods Stuck in Terminating State
94. Cluster Upgrade Fails
95. Network Policy Blocking Traffic
96. Pods Stuck in Unknown State
97. PersistentVolume Not Resizing
98. Ingress Redirect Loop
99. Pod Security Context Misconfiguration
100. Pods Overloaded Due to Missing HPA
6
Introduction
Before diving into specific Kubernetes errors, start by checking these essential
aspects to quickly identify potential root causes:
7
10 Most Important Things to Check for Diagnosing Kubernetes
Errors
Before diving into specific Kubernetes errors, start by checking these essential
aspects to quickly identify potential root causes:
8
Category What to Check Command/Action
These are the first key areas to investigate when diagnosing Kubernetes errors
to quickly identify and resolve issues.
9
Error 1: CrashLoopBackOff Error
Cause:
The container is unable to start successfully and keeps restarting.
Solution:
1. Check the logs of the failing pod to identify the issue.
Example:
kubectl logs <pod-name>
2. Review the logs to understand the root cause of the issue. Common
reasons include application misconfiguration, missing environment
variables, or incorrect startup commands.
3. Verify that the container image is built and pushed correctly to the
registry.
Example:
docker build -t <image-name> .
docker push <image-name>
4. If environment variables are missing, update the deployment
configuration.
Example:
Edit the deployment using:
kubectl edit deployment <deployment-name>
Add the missing environment variables under the env section.
Error 2: ImagePullBackOff
Cause:
Kubernetes is unable to pull the container image from the registry.
Solution:
1. Check the pod description to identify the error details.
Example:
kubectl describe pod <pod-name>
2. Verify the image name and tag in your deployment configuration. Ensure
the image exists in the registry.
10
3. Authenticate with the container registry if required. For private
registries, create a secret and link it to your deployment:
Example:
Create a secret:
kubectl create secret docker-registry <secret-name> --docker-
server=<registry-server> --docker-username=<username> --docker-
password=<password>
Update the deployment to use the secret:
kubectl edit deployment <deployment-name>
Add the secret under the imagePullSecrets section.
11
Cause:
A node is in a NotReady state due to issues like disk pressure, memory
pressure, or network problems.
Solution:
1. Check the status of the nodes.
Example:
kubectl get nodes
2. Describe the problematic node to find the cause.
Example:
kubectl describe node <node-name>
3. Address the specific issue:
o If disk pressure is mentioned, free up disk space.
o If memory pressure is mentioned, reduce resource consumption
or increase node capacity.
4. Restart the kubelet service on the affected node if necessary.
12
kubectl edit deployment <deployment-name>
Modify the resources section under containers.
13
3. Ensure the target port matches the container's exposed port.
4. If using a NodePort or LoadBalancer service, ensure that the firewall
allows traffic on the specified port.
5. Test the service using a temporary pod.
Example:
kubectl run test-pod --image=busybox --rm -it -- /bin/sh
Use curl to test the service from inside the cluster.
14
3. Create the missing ConfigMap.
Example:
kubectl create configmap <configmap-name> --from-
literal=<key>=<value>
4. Restart the affected deployment to apply the changes.
Example:
kubectl rollout restart deployment <deployment-name>
15
2. If the PVC is in a Pending state, describe it to see the reason.
Example:
kubectl describe pvc <pvc-name>
3. Ensure the storage class and volume configuration match the PVC
request.
Example:
Update the storage class or create a matching PV using:
kubectl apply -f <pv-definition.yaml>
4. Check the pod's volume configuration and ensure the PVC is referenced
correctly.
16
1. Check the node status.
Example:
kubectl get nodes
kubectl describe node <node-name>
2. Identify and clean up unused Docker images and containers on the node.
Example:
docker system prune -f
3. Increase disk space or attach additional storage to the node.
4. If using a cloud provider, scale the cluster to add more nodes.
17
1. Check the HPA details.
Example:
kubectl describe hpa <hpa-name>
2. Verify the CPU or memory metrics are available.
Example:
kubectl top pod
3. Ensure that resource requests and limits are set in the deployment.
Example:
Edit the deployment:
kubectl edit deployment <deployment-name>
Add resource requests and limits under resources.
4. If metrics are missing, verify that the metrics server is running.
Example:
kubectl get pods -n kube-system | grep metrics-server
18
Error 17: PVC in Lost State
Cause:
The PersistentVolumeClaim (PVC) is in a Lost state because the underlying
storage is unavailable.
Solution:
1. Check the PV and PVC status.
Example:
kubectl get pv
kubectl get pvc
2. Describe the PV to find the reason for the Lost state.
Example:
kubectl describe pv <pv-name>
3. Verify that the storage backend (e.g., NFS, EBS, etc.) is accessible and
functioning.
4. If the storage backend is no longer available, recreate the PV and PVC
with a new backend.
19
3. Test the permissions using:
Example:
kubectl auth can-i <action> <resource>
20
4. Restart the DaemonSet to reapply its configuration.
Example:
kubectl rollout restart daemonset <daemonset-name>
21
3. If a volume mount is causing the problem, ensure that the referenced
PersistentVolume is bound correctly.
4. If resource constraints are an issue, free up resources or scale the cluster.
22
4. Ensure that the nodes can communicate with each other on the required
ports.
23
4. Monitor the rollout progress.
Example:
kubectl rollout status deployment <deployment-name>
24
3. If the namespace still doesn't delete, edit the namespace and remove
the finalizers.
Example:
kubectl edit namespace <namespace-name>
25
Error 31: Invalid Memory/CPU Requests
Cause:
The resource requests or limits specified in the deployment are invalid or
exceed node capacity.
Solution:
1. Check the pod description to find the invalid resource specification.
Example:
kubectl describe pod <pod-name>
2. Verify the current node capacity.
Example:
kubectl describe node <node-name>
3. Update the deployment to set valid resource requests and limits.
Example:
Edit the deployment:
kubectl edit deployment <deployment-name>
Ensure resources.requests and resources.limits are set appropriately.
26
Error 33: Pods Not Being Scheduled
Cause:
The scheduler cannot find a suitable node for the pod due to resource
constraints or taints.
Solution:
1. Describe the pod to see why it is not being scheduled.
Example:
kubectl describe pod <pod-name>
2. Check for node taints that might prevent scheduling.
Example:
kubectl describe node <node-name>
3. Update the pod's tolerations or affinity rules if necessary.
4. Ensure sufficient resources are available on the nodes.
27
Error 35: Deployment Not Scaling Properly
Cause:
The deployment is not scaling the number of replicas as expected.
Solution:
1. Verify the current number of replicas.
Example:
kubectl get deployment <deployment-name>
2. Check for resource constraints on the nodes.
3. Scale the deployment manually to test scaling functionality.
Example:
kubectl scale deployment <deployment-name> --replicas=<number>
4. If using HPA, ensure the metrics server is functioning correctly.
29
1. Describe the DaemonSet to identify the issue.
Example:
kubectl describe daemonset <daemonset-name>
2. Check the node selector or tolerations in the DaemonSet configuration.
3. Ensure that the nodes have sufficient resources to schedule the pods.
4. Restart the DaemonSet.
Example:
kubectl rollout restart daemonset <daemonset-name>
30
1. Verify the cloud provider integration with the cluster.
Example:
kubectl get nodes -o wide (Check if the nodes have the correct cloud
provider labels)
2. Check the service description for details.
Example:
kubectl describe svc <service-name>
3. Ensure that the cloud provider account has sufficient permissions to
create load balancers.
4. If using a local cluster, use a NodePort service instead of a LoadBalancer.
31
1. Check the kubelet logs for sandbox errors.
Example:
journalctl -u kubelet
2. Restart the container runtime on the node.
Example:
systemctl restart docker (or containerd depending on the runtime)
3. If the issue persists, check network configurations like CNI plugins.
4. Remove any stale sandboxes.
Example:
docker ps -a | grep <sandbox-id> and remove it.
32
1. Check the ingress logs for errors.
Example:
kubectl logs <ingress-controller-pod> -n kube-system
2. Verify the backend service and pod are running and accessible.
Example:
kubectl get svc and kubectl get pods
3. Ensure the backend service's target port matches the pod's exposed
port.
4. Test the service manually to ensure it responds.
33
kubectl edit deployment <deployment-name>
Add:
securityContext:
runAsUser: <uid>
fsGroup: <gid>
3. Ensure the volume has the correct ownership and permissions.
35
4. If necessary, reconfigure the CNI plugin with a new CIDR range.
36
Cause:
The namespace is stuck due to finalizers on resources.
Solution:
1. Describe the namespace to identify the finalizers.
Example:
kubectl describe namespace <namespace-name>
2. Remove the finalizers manually.
Example:
kubectl edit namespace <namespace-name>
3. Delete the namespace again.
Example:
kubectl delete namespace <namespace-name>
37
1. Check the ingress controller pod logs.
Example:
kubectl logs <ingress-pod-name>
2. Verify the ingress controller configuration.
3. Restart the ingress controller pods.
Example:
kubectl rollout restart deployment <ingress-controller-name>
4. Test ingress functionality with a basic configuration.
38
2. Verify that the metrics server is deployed with valid configurations.
3. Restart the metrics server deployment.
Example:
kubectl rollout restart deployment metrics-server -n kube-system
4. Test metrics collection using:
Example:
kubectl top pod
39
3. If the storage backend is unavailable, resolve the issue or recreate the
volume.
40
1. Check API server logs for high latency events.
2. Increase the resources allocated to the API server.
3. Reduce the number of requests to the API server by optimizing scripts or
automation.
41
1. Verify the credentials in the kubeconfig file.
2. Generate a new token if needed and update the configuration.
42
Cause:
Nodes cannot communicate due to network issues.
Solution:
1. Check node status.
Example:
kubectl get nodes
2. Verify the network configuration and resolve connectivity issues.
43
2. Free up disk space by removing unused images and containers.
Example:
docker system prune -f
3. Add additional storage to the node if necessary.
44
Error 75: Cannot Scale Deployment Beyond Node Capacity
Cause:
The nodes lack sufficient resources to schedule additional pods.
Solution:
1. Check the deployment's resource requests and limits.
2. Verify node capacity.
Example:
kubectl describe nodes
3. Scale the cluster or add new nodes.
45
3. Update the HPA to reference the correct custom metrics.
46
1. Check the kube-proxy pod logs.
Example:
kubectl logs <kube-proxy-pod-name> -n kube-system
2. Restart the kube-proxy daemonset.
Example:
kubectl rollout restart daemonset kube-proxy -n kube-system
47
1. Delete the evicted pods.
Example:
kubectl delete pod <pod-name>
2. Address the resource issue causing the evictions.
48
Error 87: ConfigMap Too Large
Cause:
ConfigMaps have a size limit of 1MB.
Solution:
1. Split the ConfigMap into smaller ConfigMaps.
2. Use external storage solutions for larger configurations.
49
1. Check resource usage.
Example:
kubectl top nodes
2. Scale up the cluster or optimize workloads.
50
Cause:
Pods are stuck because Kubernetes cannot clean up resources or communicate
with the node.
Solution:
1. Force delete the pods.
Example:
kubectl delete pod <pod-name> --grace-period=0 --force
2. Check the resources (e.g., volumes, secrets) that the pod was using.
51
2. Describe the policies to check the rules.
Example:
kubectl describe networkpolicy <policy-name> -n <namespace>
3. Update the policy to allow the required traffic.
53
2. Ensure the metrics server is running and resource requests/limits are
defined for the pods.
Conclusion
Kubernetes is a powerful but intricate system, and its flexibility comes with its
own set of challenges. This guide has aimed to equip you with the knowledge
and tools necessary to address 100 of the most common Kubernetes errors.
Each error has been dissected to uncover its root cause and followed by
practical, step-by-step solutions to resolve the problem efficiently.
By understanding these common pitfalls and learning how to fix them, you can
significantly enhance your troubleshooting skills and confidence in managing
Kubernetes environments. Beyond resolving immediate issues, the insights
provided in this guide encourage the adoption of best practices, such as
monitoring resource usage, automating repetitive tasks, and implementing
proper security measures, which are crucial for long-term success in
Kubernetes operations.
As Kubernetes continues to evolve, so too will the challenges and opportunities
it presents. Staying informed, leveraging community support, and continuously
updating your knowledge base are key to staying ahead in this dynamic
ecosystem. This guide is a stepping stone, but it’s your curiosity,
experimentation, and willingness to learn that will make you a Kubernetes
expert.
Thank you for taking this journey into Kubernetes problem-solving. May this
guide empower you to build, manage, and scale your applications with greater
confidence and reliability. Remember, every challenge is an opportunity to
grow, and every error resolved brings you one step closer to mastering
Kubernetes.
54