PodDisruptionBudget — A Key for Zero Downtime

In finance, things go good when budget is planned well. Even in extreme scenario or disaster, one can sustain if there is a plan. Just like that.

In Kubernetes world, the budget is for pods. We cannot predict everything to be good all the time. Changes happen that might for pod or node itself, both update and upgrade or even disaster. Here, planning means we don’t let everything to go down but set a scenario where on one way neither our service burn out nor we allocate extra resources left unused.

Coming to the point. Let’s consider a scenario, we need to upgrade version of node or update the spec often. Cluster downscaling is also a normal condition. In these cases, the pods running on the to-be-deleted nodes needs to be drained. I have three nodes:

And many pods are running in these nodes:

We need to remove from the pool which we cannot do it by detaching instantly as that will lead to termination of all the pods running in there which can get services down.

First step before detaching node is to make the node unscheduled.

Now, if I run new pods, none of them will be scheduled on but the pods prior to that are running there as it is. We need a way to drain them.

If you have pods with local data, additional argument is required. The command first cordon the nodes by itself if not run earlier.

If you quickly check the pods with , it will terminate all the running pods instantly which were scheduled on . This could lead a downtime! If you are running few number of pods and all of them are scheduled on same node, it will take some time for the pods to be scheduled on other node.

In real scenario, its not possible to do this for each node but that has to be done for lots of nodes by passing label. This impacts the application performance if not down because we loose big number of running containers in our Kubernetes cluster.

To prevent this type of cases, we set a budget for pods called as PodDisruptionBudget(PDB).

PDB configures the number of concurrent disruptions that application pod experiences when node is to be managed. Deployment, ReplicationController, ReplicaSet, StatefulSet can be bind by PodDisruptionBudget with label selector. The budget is specified by using either of two values:

minAvailable: This is the minimum number of pods that should be running for the label. For example, if we have 20 pods running and minAvailable is set to 10. If the node is to be drained for some reason or pods are to evicted, only 10 will start terminating and will gradually drain rest. But at least 10 of the pods will be ready state so that the application can serve request. The number should be decided based on the traffic or workload the pods should handle.

maxUnavailable: The number of pods that could terminate in case node has to be drained.

In both cases, we can specify both absolute number as well as percentage. Like, if we have 20 pods running and maxUnavailable is set to 50%, then 10 pods can be unavailable.

For the PodDisruptionBudget to work, there must be at least 2 pods running for a label selector otherwise, the node cannot be drained gracefully and it will be evicted forcefully when grace time ends.

The disruption budget can be checked with

This allows for higher availability while permitting the cluster administrator to manage the clusters nodes.

Say Hi to me on Twitter and Linkedin where I keep on sharing interesting updates.

DevOps | SRE | #GDE