Setting up a infrastructure and getting application services running is one story. Their reliability and uptime are another critical considerations from both technical and business point of view. Any unexpected points could lie on multiple ends of the setup and identifying the weakness, improve resilience and enhance overall system reliability in proactive manner is Chaos Engineering. The practice is basically a series of simulation of different failure scenarios to check how the application or infrastructure reacts in case of real fault.
In this blog post, we will discuss on paving a path for Chaos Testing in a Kubernetes based architecture using LitmusChaos — Open Source Chaos Engineering Platform.
My first motivation to initiate the Chaos Testing was to confirm if the monitoring and alerts are in place or not. We often set tools for monitoring & alerting cloud infrastructure, containers, application processes in case of any unexpected issue and then assume they work. But in later run, we might figure out that the assumption was not correct. With Chaos Engineering tools, we get to create mock of actual scenario. Some of those which I use in basic setup are:
- Pod termination/restart
- Stress on pod/container(cpu, memory, IO)
- Stress on node(cpu, memory, IO, restart, drain, disk fill)
- Pod network latency and DNS error
You can see already available 50+ experiments in Litmus hub and can also add/update your own experiment in chaos-charts repo. I found some incompatibility issue in the experiments with newer version(3.0) of Litmus.
Apart from Litmus, there are other alternatives for Chaos Testing tools under Cloud Native Landscape, like, chaos-mesh, ChaosToolkit, ChaosBlade, Gremlin(extensive but comes with a price). After trying out few of the open source options, I found Litmus to be covering use cases in multiple infrastructure levels and more community support.
The installation of Litmus was pretty simple as per installation doc using helm3 and I exposed the frontend through nginx ingress and the UI comes with inbuilt username/password authentication.
After choosing up environment(Production/non-Production) and enabling the scope of Chaos infrastructure, we get a file with CRD(Custom Resource Definition) of Chaos and ArgoCD to apply through
kubectl in the cluster. By the end of the setup, we should be able to see few of the components running in litmus namespace: chaos-exporter, chaos-operator, event-tracker, litmus-auth-server, litmus-frontend, litmus-server, subscriber and workflow-controller.
ChaosHubs is reference to the hub registry from where experiment definition can be pulled which could be both public or private git repository authenticated with ssh or access token. I used my forked repo of chaos-charts repo and adjusted my experiments fixing some version incompatibility also as well as tuning some experiment definition.
These probes could be executed at start, end or during the test and gives us idea of actual resiliency of our system. There are different types of probes we can use: http, cmd, k8s, prometheus.
We can apply multiple probes under different conditions for each experiment too.
For example, we could expect certain URL to be responding with
200 response code before and after the test is done so that there is no impact of the test afterwards.
This is the major space where we define actual test against our environment. The frontend dashboard allows us to create/modify experiment using workflow for which we don’t need to play with yaml file, rather it generates the file at end which can be downloaded and saved or applied manually from command line in future.
The experiment can be triggered instantly or scheduled. All of these experiment step could have their own resilience probe and separate authentication measures.
There are already some of the providers target experiments but we can customize or add as per our need.
It helped in multiple ways for my project by using these experiments in real environments. Mocking the resource hog and confirming the behavior of application as well as auto-scaling triggers and alerts in case of service disruption — a whole lot of flows gave us confidence of our monitoring system. Network disruption and delay, node disk fill-up and node outage were rare issues but it was hard to replicate and ignored can now be tested. Lots of use cases just by going through existing experiments made us feel our system resiliency.