You’ve finally taken the leap and deployed your workloads on Kubernetes, what could possibly go wrong? As you no doubt have experienced, A LOT, A WHOLE LOT! Don’t let it deter you though, everything is prone to failures and misconfigurations. We are hoping this guide will help alleviate some of those pains or at least take away a little of the mystery of what is going on.
First and foremost, you should familiarize yourself with this fantastic visualization from the good folks over at Learnk8s.io.
Throughout this article we will reference sample deployments & services from our github page.
Pending
Pods stay in a pending state usually due to one of the following reaons:
- Resource issues resulting in the pod not being scheduled on a node.
- Image pull failure, such is the case when the image name, tag or docker secret is incorrect. In case of pull issues, these are typically easy to spot as the status will eventually move from Pending to ContainerCreating and eventually ErrImagePull.
You should check the resource limits on the pod along with any resource quotes defined. To give an idea of how to troubleshoot this we will use some sample manifests to create a pod in a pending state.
After creating a new pod/deployment you notice it is stuck in a pending state. A very useful start is the describe command
$ kubectl describe pod <pod-name>
A lot of information can be obtained from the describe output. For troubleshooting, the events section can provide some key information to determining the issue. Let’s go ahead and deploy a pod to get an idea of what we are looking at.
$ kubectl apply -f https://raw.githubusercontent.com/gleamingthekube/troubleshooting-kubernetes/main/pending-pod.yaml
View the pods to check the state
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
busybox-65cf7d7587-vrcv8 0/1 Pending 0 49s
As you can see it is sitting in a pending state. So, why not just use the logs command? The logs command retrieves the application logs. Since our pod has not yet been scheduled anywhere the application is not running and no logs are available. Take a look at the describe output, note the amount of detail returned, you will see details about the container(s) in the pod, environment variables, volumes and mounts, tolerations, etc. At the end you will see a list of events.
As the name suggests, this is a list of events that have occurred. As you can see below the pod failed for two reasons, first neither of our nodes (CP or worker) has sufficient memory to run the pod. This makes perfect sense since the test pod requests 500Gi of RAM. Wildly excessive but simply used to illustrate a point. Second, one node has a taint (CP) that prevents pods from being scheduled on it.
If you don’t see the message about the taint, you are likely running in a standalone cluster (i.e, minikube) or you have removed the taint from the CP node. By default, Kubernetes prevents you from scheduling pods on the Control Plane.
$ kubectl describe pod busybox-65cf7d7587-vrcv8
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 59s default-scheduler 0/2 nodes are available: 1 Insufficient memory, 1 node(s) had taint {node-role.kubernetes.io/master:NoSchedule: }, that the pod didn't tolerate.
For the example, the simple solution is to either increase the amount of RAM on one or more of your nodes or simple decrease the requested resources in the pod spec.
This is great if the pod is stuck in a pending state, but what about when a pod doesn’t even show your output?
Non-Existent Pods
You’ve created a deployment and when checking the pod status it shows no new pods created. To replicate this we will run through the below. First, create a new namespace for us to test in to avoid messing with any other workloads.
$ kubectl create ns quota
Apply these two files from our git page
$ kubectl apply -f https://raw.githubusercontent.com/gleamingthekube/troubleshooting-kubernetes/main/resource-quota.yaml -n quota
$ kubectl apply -f https://raw.githubusercontent.com/gleamingthekube/troubleshooting-kubernetes/main/pending-pod.yaml -n quota
Take a look at the pods in the quota namespace and notice nothing was created. This makes it difficult to follow the advice above with no pod to describe.
$ kubectl get pod -n quota
No resources found in quota namespace.
In this case, we can look to the events of the entire namespace. Notice we are hitting a resource quota problem. When you have a lot of pods in the namespace this can be tedious to look through.
$ kubectl get events -n quota
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreate 17s replicaset-controller Error creating: pods "stress-ng-99f96b676-87qcx" is forbidden: exceeded quota: pods-low, requested: requests.cpu=500m,requests.memory=500Gi, used: requests.cpu=0,requests.memory=0, limited: requests.cpu=1m,requests.memory=1m
We could take an alternate approach to our troubleshooting though. We don’t see a pod running, but remember, we created a deployment, so our next step can go to the topmost level and make sure there were no problems with the deployment.
$ kubectl describe deploy stress-ng
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 3m38s deployment-controller Scaled up replica set stress-ng-99f96b676 to 1
Nothing suspicious here, the deployment did its job and created the replicaset and requested it scale up to 1. Let’s move a head and take a look at the replicaset.
$ kubectl describe rs stress-ng-99f96b676 -n quota
Here we can again see the events that caused our failure.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreate 6m58s replicaset-controller Error creating: pods "stress-ng-99f96b676-87qcx" is forbidden: failed quota: pods-low: must specify cpu,memory
You can though go ahead and further explore the resource quota on the namespace. You will see this quota is set extremely low. You can adjust or remove this to get the pod to enter a running state.
$ kubectl describe quota pods-low -n quota
Name: pods-low
Namespace: quota
Resource Used Hard
-------- ---- ----
pods 0 1
requests.cpu 0 1m
requests.memory 0 1m
That wraps up this article, we will tackle application and network troubleshooting in some other articles.