After doing a base Kubernetes cluster setup, the question arises, what if something is going wrong or what if some incident happens? Running a containerized application in the orchestrated environment is step one but monitoring the application and cluster is another important part. Even if we are using a managed Kubernetes cluster like GKE where we don’t need to worry about master nodes, autoscaling, auto-healing etc. But what if things go south? In this post, I will be sharing some of the base level monitoring we need to set up assuming we are using prometheus for the metrics.
For a quick monitoring setup, we can use kube-prometheus-stack charts from prometheus community which is a swiss-knife for starter setup. It combines Kubernetes manifests, Grafana dashboards, and Prometheus rules.
Let’s dive into key metrics and prometheus query. We can simply create a Grafana dashboard and integrate alerting from there which makes it much easier.
- Container Restart and CrashLoopBackOff: There could be multiple reasons for a container restart which we can dig later through logs but the monitoring makes us aware at least there is an issue somewhere when containers are restarting very often within a duration. The reason could be health check failure, Out of Memory(OOM), application error handling, node issue etc. We can set the restart count interval and severity based on the type of application.
- Pod in Pending state: To confirm a container in running state, it needs the mentioned(requests in K8s) resource which could be CPU, Memory. If the resource is unavailable in the nodes, then the pod will stay in Pending state waiting for the compute resource. Initially, when we are not sure of the resource being used by application, its pattern as well as hitting the auto-scaling limit of nodes, pods could stay in pending state.
- Pod termination reason: If everything is going as expected, pods will go off with the normal flow. But sometimes, a pod could get killed for an unexpected reason which could be error other than 0 from application, OOM etc. Alerting with these metrics helps in debugging the reason for termination. We can go through the kubernetes events or log later.
- Ingress monitoring: Monitoring of the Kubernetes inside resources might not be sufficient all the time. The end result of running our application in the orchestrated environment is ultimately to provide a good user experience even in adverse case. By monitoring the ingress traffic, status code, data transfer rate, we can be aware whether the end users are experiencing issues or not. Let’s say if any user is waiting for 1 minute to get api response, we can’t get it from system metrics but it can be seen on load balancer or ingress side. If distributed tracing is set up or custom application metrics is in place, that could also help.
- CPU/Memory/Network usage of pod: Over and under utilization is not a good practice in terms of end user experience, cloud bills and application stability. Proper provisioning of resources based on the usage history of the specific set of application can give us the best. For this, monitoring of base metrics like, CPU, memory, network, disk i/o are very much helpful. We can also set auto-scaling based on the results.
- Node resource usage: When a node is under high pressure, this could cause abnormal behaviour of the containers running in the system. So, monitoring the disk, memory, cpu, and bandwidth utilization of the node helps us apply auto-scaling rules as well as choosing the right type of node.