A Pod Restarts. So, What’s Going on?
In Kubernetes world, pods are considered to be relatively ephemeral (rather than durable) entities. Means, we cannot expect a pod to be a long running resource. There are various reasons for termination, restart, re-initialization of pods when any change is introduced and the changes can come from multiple dimensions.
A software system can only be perfectly stable if it exists in a vacuum. If we stop changing the codebase, we stop introducing bugs. If the underlying hardware or libraries never change, neither of these components will introduce bugs. If we freeze the current user base, we’ll never have to scale the system.
Ref: https://landing.google.com/sre/sre-book/chapters/simplicity/
A pod can have one or multiple containers one being application container and other could be init container which terminates after it does specific task or application container is ready to do its job, sidecar container which lies attached with main application container.
Let’s dig first how we can see the pods and how we can see restarts & health of pod. How can we know how many containers are there in pod? Simply describing the pod will give details: kubectl describe pod [pod-name]
.
Also, a detailed view of pods running in a cluster in particular namespace can be seen with with kubectl get pods
:
In above scenario of monitoring namespace, we can see the first two and fourth pod has READY value of 2/2 and rest are 1/1. This means two out of two containers are healthy and ready to serve in first case. And for rest, there are pods with single container and they are healthy too.
The 4th column shows the count of restart. The fifth pods has RESTARTS value of 2 means the pod was restarted twice in last 6 days and 13 hours since its creation. Rest of the pods have not been restarted. What not to be confused is, the restart doesn’t means re-creation of pod. Restart and re-creation or re-initialization are different things. We will also discuss about this below.
Coming back to point of why a pod restarts. I am combining the cases of re-initialization of pods also in the points. The difference is restart keeps the pod name same if used deployment but re-initialize creates a new pod with new name on its suffix values:
1. New deployment
When a new version of container is to be deployed, it re-initialize the pod.
$ kubectl create deploy nginx --image nginx:1.17.0-alpine -n devops
deployment.apps/nginx created$ kubectl get po -n devops
NAME READY STATUS RESTARTS AGE
nginx-5759f56c55-cjv57 1/1 Running 0 7s
Now, I need to upgrade the nginx version to 1.18.0-alpine.
While the new version of pod is being deployed(which took around 10s), it had STATUS of ContainerCreating and after its ready, the old pod got killed.
2. Change in environment variable pod
We can define environment variable for single or multiple pods. Listing the defined environment vars:
$ kubectl set env pods --all --list -n devops
# Pod nginx-5777594854-8pnxg, container nginx
We add new variable and check the pods:
$ kubectl set env deployment/nginx DATE=$(date '+%d/%m/%Y-%H:%M:%S') -n devops
deployment.extensions/nginx env updated$ kubectl get po -n devops
NAME READY STATUS RESTARTS AGE
nginx-7849b54d8d-tzwjx 1/1 Running 0 14s
nginx-85c988d647-4s7sr 0/1 ContainerCreating 0 2s
Yes, it re-initiated a new pod and after its ready, the older one gets terminated. When there are many pods running, the get gradually updated but not at once when any environment variable is added or updated.
3. HealthCheck failure
There are three probes for health check of a pod: liveness, readiness and startup probes.
Readiness probe is for indication that the container is ready to serve traffic. Means, a load balancer will not send traffic to container unless its Readiness probe succeeds.
Liveness probe recovers a pod when there is any deadlock and stuck being useless. But this probe is mainly responsible for restarting a container. This is where we focus while debugging a restarting container. A simple definition:
apiVersion: v1
kind: Pod
metadata:
labels:
test: liveness
name: liveness-http
spec:
containers:
- name: liveness
image: k8s.gcr.io/liveness
args:
- /server
livenessProbe:
httpGet:
path: /healthz
port: 8080
httpHeaders:
- name: Custom-Header
value: Awesome
initialDelaySeconds: 3
periodSeconds: 3
Here, the liveness check starts after delay of 3 seconds and tries /healthz
path with httpGet requests on port 8080. If the check fails, the container is killed by kubelet and keeps on restarting unless the probe succeeds.
The continuous restart of pod changes the STATUS to CrashLoopBackOff.
Let’s do a test. I changed the initialDelaySeconds: 1
and periodSeconds: 1
and applied the manifest. Here is the result:
$ kubectl get po -n devops
NAME READY STATUS RESTARTS AGE
liveness-http 1/1 Running 1 21s$ kubectl get po -n devops
NAME READY STATUS RESTARTS AGE
liveness-http 1/1 Running 2 32s$ kubectl get po -n devops
NAME READY STATUS RESTARTS AGE
liveness-http 1/1 Running 4 92s$ kubectl get po -n devops
NAME READY STATUS RESTARTS AGE
liveness-http 0/1 CrashLoopBackOff 5 2m27s
The restart count gradually increased and resulted on CrashLoopBackOff ultimately. But how can we debug this? Describe the pod and see the events the end:
kubectl describe pod liveness-http
.....
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m30s default-scheduler Successfully assigned devops/liveness-http to ip-10-0-1-117.us-west-2.compute.internal
Normal Pulled 4m3s (x3 over 4m29s) kubelet, ip-10-0-1-117.us-west-2.compute.internal Successfully pulled image "k8s.gcr.io/liveness"
Normal Created 4m3s (x3 over 4m29s) kubelet, ip-10-0-1-117.us-west-2.compute.internal Created container liveness
Normal Started 4m3s (x3 over 4m29s) kubelet, ip-10-0-1-117.us-west-2.compute.internal Started container liveness
Normal Pulling 3m50s (x4 over 4m30s) kubelet, ip-10-0-1-117.us-west-2.compute.internal Pulling image "k8s.gcr.io/liveness"
Warning Unhealthy 3m50s (x9 over 4m18s) kubelet, ip-10-0-1-117.us-west-2.compute.internal Liveness probe failed: HTTP probe failed with statuscode: 500
Normal Killing 3m50s (x3 over 4m16s) kubelet, ip-10-0-1-117.us-west-2.compute.internal Container liveness failed liveness probe, will be restarted
It clearly shows the Liveness probe failed with httpStatus code of 500 which resulted in multiple restart.
For debugging, we can increase the Liveness check initialization time or remove the check for some time and see what the problem is by going through the pod logs.
4. Draining Node
In the course of maintenance of nodes which could be for updating spec, upgrading version, fixing problems, the pods scheduled on the node(s) have to be drained means the pods needs to be initialized on healthy node.
$ kubectl drain node1
node/node1 cordoned
evicting pod "liveness-http"
pod/liveness-http evicted
node/node1 evicted$ kubectl get no
NAME STATUS ROLES AGE VERSION
node1 Ready,SchedulingDisabled <none> 28m v1.14.8
When we drain node1
, it evicted the liveness-http
pod. If the compute resource is available on other nodes, then it will be scheduled there. Otherwise, it will remain in pending state. If any loadbalancer was sending traffic, that would return error as none of the pods with the label is in healthy state. This lead downtime! One way for minimizing the downtime is PodDisruptionBudget. I have written a long post on implementing the budget on following blog:
The draining of node could be both graceful and forceful. If we are using spot instance in AWS or Preemptible instance in GCP for saving cost, it gives few minutes of termination notice followed by cordoning the node(making it unschedulable for pods).
In the short duration, the pods scheduled on the to-be-deleted node has to be re-scheduled. There is helm chart for Spot Termination Notice Handler which schedules the pods but when there is single pod running for a label and there are many pods running in node, it might also lead a downtime for short period.
5. OOM(Out of Memory) Kill
This is one of the common reason of restarting container which happens the resource usage is not configured or application itself behaves unpredictable.
If we have allocated 600Mi of memory for a container and it tries to allocate more than this limit, the pod will be killed with OOM. The requests
value on the other hand is the pre-allocation for the container.
spec:
containers:
- name: app
image: nginx
resources:
limits:
memory: "600Mi"
requests:
memory: "100Mi"
To get idea of the behavior of container in terms of memory /cpu usage/limit, this solely depends on the application type, load its handling, heap memory it uses etc. After observations on the fluctuation by load testing and performance analysis, the limits & requests has to be set.
6. High Node Pressure
Resource sharing is both challenge and feature of in any distributed system, Kubernetes of course. Based on the pressure on the compute node, pods could be rescheduled to nodes with low pressure. kubelet uses CFS(Completely Fair Scheduler) quota to enforce pod CPU limits. When any node runs many CPU-bound pods, the workload can move to different CPU cores depending on whether the pod is throttled and which CPU cores are available at scheduling time.
Even if the node memory is full when all containers are under limits, it could trigger OOM resulting in rescheduling. If the node disk is full and have to free space, pods in there might be evicted.