Building Resilience With Chaos Engineering and Litmus

The scalability, agility, and continuous delivery offered by microservices architecture make it a popular option for businesses today. Nevertheless, microservices architectures are not invulnerable to disruptions. Various factors, such as network communication, inter-service dependencies, external dependencies, and scalability issues, can contribute to outages.

Prominent companies like Slack, Twitter, Robinhood Trading, Amazon, Microsoft, Google, and others have recently encountered outages resulting in significant downtime expenses. These incidents underscore the wide-ranging causes of outages in microservices architectures, encompassing configuration errors, database issues, infrastructure scaling failures, and code problems.

To mitigate the impact of outages and enhance system availability, it is essential for businesses to prioritize resiliency principles during the design, development, and operation of microservices architectures. This article will explore how chaos engineering can assist in improving system resiliency and minimizing outages.

I recently spoke at Chaos Carnival on the same topic. You can also watch my talk here.

What Is Chaos Engineering?

Chaos engineering is a technique used to assess the resilience and dependability of intricate systems by deliberately introducing controlled failures. Its objective is to proactively identify and draw attention to system flaws before they result in real-world issues such as outages, data loss, or security breaches.

This is achieved by simulating diverse failure scenarios, including network disruptions, server malfunctions, or unexpected surges in traffic, and observing the system's response. By intentionally inducing failures in a controlled environment, chaos engineering allows teams to gain deeper insights into the limitations and failure domains of their systems, enabling them to develop strategies for mitigating or preventing such failures in the future.

Prominent companies like Netflix, Amazon, Google, and Microsoft have recognized the significance of chaos engineering in ensuring site reliability. Netflix, for example, has introduced tools like Chaos Monkey, Chaos Kong, and ChAP, which target different levels of infrastructure to uphold their service level agreements. Amazon has incorporated the concept of Gamedays into their AWS Well-Architected Framework, wherein various teams collaborate to simulate chaos in their environment, fostering knowledge and reinforcing system reliability as a whole.

What Is Resiliency Testing?

The primary focus of resiliency testing revolves around assessing a system's capacity to bounce back from disruptions or failures and maintain its intended functionality. The objective of resiliency testing is to enhance the overall reliability and availability of a system while mitigating the impact of potential disruptions or failures. Through the identification and resolution of potential vulnerabilities or weaknesses in system design or implementation, resiliency testing ensures the system's continuous operation in the presence of unforeseen events or circumstances.

Why Should I Test Resiliency?

Resiliency testing is essential for a number of reasons. Here are a few examples:

Thus, conducting resilience testing holds significant importance in guaranteeing the reliability, availability, and swift recovery of your system from failures or outages. Through the identification and resolution of potential points of failure, you can establish a more sturdy and resilient system that not only enhances the user experience but also adheres to regulatory requirements.

Why Should I Test Resiliency in Kubernetes?

Resiliency testing in Kubernetes is crucial due to its complex and distributed nature, catering to large-scale, mission-critical applications. While Kubernetes offers features like automatic scaling, self-healing, and rolling updates for resiliency, glitches or failures can still occur in a Kubernetes cluster.

Here are the top reasons why we should test resiliency in Kubernetes:

Considering all of this, testing resiliency in Kubernetes is important to ensure that your application can handle interruptions and continue to function as intended.

Chaos vs Resiliency vs Reliability

Chaos, resiliency, and reliability are interconnected concepts, but they should not be used interchangeably. Below is a brief overview of each concept:

In essence, chaos engineering involves intentionally introducing failures into a system to test its resilience, which refers to its ability to bounce back from such failures. On the other hand, reliability measures the consistent and predictable performance of a system over an extended period. All three concepts chaos engineering, resilience, and reliability are crucial for establishing and sustaining robust and dependable systems. Each concept contributes distinctively to ensuring the overall quality and resilience of a system.

Building Resilience With Chaos Engineering and Litmus

What Are Available Tools To Test System Resiliency?

Litmus, Gremlin, Chaos Mesh, and Chaos Monkey are all popular open-source tools used for chaos engineering. As we will be using AWS cloud infrastructure, we will also explore AWS Fault Injection Simulator (FIS). While they share the same goals of testing and improving the resilience of a system, there are some differences between them. Here are some comparisons:

Scope Chaos Mesh Chaos Monkey Litmus Gremlin AWS FIS
Kubernetes-native Yes Yes Yes Yes No
Cloud-native No No Yes Yes Yes (AWS)
Baremetal No No No Yes No
Built-in Library Basic Basic Extensive Extensive Basic
Customization Using YAML Using YAML Using Operator Using DSL Using SSM docs
Dashboard No No  Yes Yes No
OSS Yes Yes Yes Yes No

The bottom line is that while all four tools share similar features, we choose Litmus as it provides flexibility to leverage AWS SSM documents to execute chaos in our AWS infrastructure. Now let’s see how we can use Litmus to execute chaos like terminating pods and EC2 instances in Kubernetes and AWS environments respectively.

Installing Litmus in Kubernetes

At first, we will see how to install Litmus in Kubernetes to execute chaos in an environment.

Here are the basic installation steps for LitmusChaos:

  1. Set up a K8s cluster: We need a running Kubernetes cluster. For this article, we will use k3d.

Shell
 
$ k3d cluster create


Shell
 
$ kubectl cluster-info

Kubernetes control plane is running at https://0.0.0.0:38537

CoreDNS is running at https://0.0.0.0:38537/api/vl/namespaces/kube-system/services/kube-dns:dns/proxy

Metrics-server is running at https://0.0.0.0:38537/api/vl/namespaces/kube-system/services/https:metrics-server:https/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster—info dump’.


  1. Install Helm: You can install Helm by following the instructions on the Helm website. Add the LitmusChaos chart repository: Run the following command to add the LitmusChaos chart repository:

Shell
 
$ helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/


  1. Install LitmusChaos: Run the following command to install LitmusChaos.

Shell
 
$ helm install litmuschaos litmuschaos/litmus --namespace=litmus


This will install the LitmusChaos control plane in the litmus namespace. You can change the namespace to your liking.

  1. Verify the installation: Run the following command to verify that LitmusChaos is running:

Shell
 
$ kubectl get pods -n litmus
NAME                              READY  STATUS   RESTARTS  AGE

chaos-litmus-frontend-6££c95c884-x2452  1/1    Running     0       6m22s

chaos-litmus-auth-server-b8dcdf66b-v8hf9  1/1  Running     0       6m22s

chaos-litmus-server-585786dd9c-16x37  1/1      Running     0       6m22s


This should show the LitmusChaos control plane pods running.

  1. Login into the Litmus portal using port-forwarding.

Shell
 
$ kubectl port-forward svc/chaos-litmus-frontend-service -nlitmus 9091:9091


Building Resilience With Chaos Engineering and Litmus

Once you log in, a webhook will install litmus-agent (called self-agent) components in the cluster. Verify it.

Shell
 
$  kubectl get pods -n litmus

NAME                                      STATUS    RESTARTS   AGE   READY

chaos-litmus-frontend-6£fc95c884-x245z    Running   0          9m6s  1/1

chaos-litmus-auth-server-b8dcdf66b-v8he9  Running   0          9m6s  1/1

chaos-litmus-server-585786dd9c-16x37      Running   0          9m6s  1/1

subscriber-686d9b8dd9-bjgih               Running   0          9m6s  1/1

chaos-operator-ce-84bc885775-kzwzk        Running   0          92s   1/1

chaos-exporter-6c9b5988cd-1wmpm           Running   0          94s   1/1

event-tracker-744b6fd8cf-rhrfc            Running   0          94s   1/1

workflow-controller-768b7d94dc-xr6vy      Running   0          92s   1/1 


With these steps, you should have LitmusChaos installed and ready to use on your Kubernetes cluster.

Experimenting With Chaos

Experimenting with chaos within a cloud-native environment typically involves using a chaos engineering tool to simulate various failure scenarios and test the resilience of the system. Most of the cloud-native application infrastructure consists of Kubernetes and corresponding Cloud components. For this article, we will see chaos in Kubernetes and in the cloud environment i.e. AWS.

Chaos in K8s

For evaluating the resilience of a Kubernetes cluster we can test the following failure scenarios:

Through the execution of these failure scenarios, you can pinpoint potential vulnerabilities within the cluster's resilience. This enables you to enhance the system, ensuring it achieves high availability and reliability levels. By testing these failure scenarios, you can identify potential vulnerabilities in the cluster's resilience and improve the system to ensure high availability and reliability.

Scenario: Killing a Pod

In this experiment, we will kill a pod using Litmus. We will use an Nginx deployment for a sample application under test (AUT).

Shell
 
$ kubectl create deploy nginx --image=nginx -nlitmus
Shell
 
$ kubectl get deploy -nlitmus | grep nginx

NAME   READY   UP-TO-DATE   AVAILABLE   AGE
nginx  1/1     1            1           109m


Building Resilience With Chaos Engineering and Litmus

Building Resilience With Chaos Engineering and Litmus

Building Resilience With Chaos Engineering and Litmus

Building Resilience With Chaos Engineering and Litmus

Building Resilience With Chaos Engineering and Litmus

Shell
 
$ kubectl get pods -nlitmus

NAME                                          READY   STATUS    RESTARTS   AGE

chaos-litmus-frontend-6ffc95c884-x245z        1/1     Running   0          32m

chaos-mongodb-68f8b9444c-w2kkm                1/1     Running   0          32m

chaos-litmus-auth-server-b8dcdf66b-v8hf9      1/1     Running   0          32m

chaos-litmus-server-585786dd9c-16xj7          1/1     Running   0          32m

subscriber-686d9b8dd9-bjgjh                   1/1     Running   0          24m

chaos-operator-ce-84bc885775-kzwzk            1/1     Running   0          24m

chaos-exporter-6c9b5988c4-1wmpm               1/1     Running   0          24m

event-tracker-744b6fd8cf-rhrfc                1/1     Running   0          24m

workflow-controller-768f7d94dc-xr6vv          1/1     Running   0          24m

kill-pod-test-1683898747-869605847            0/2     Completed 0          9m36s

kill-pod-test-1683898747-2510109278           2/2     Running   0          5m49s

Pod-delete-tdoklgkv-runner                    1/1     Running   0          4m29s

Pod-delete-swkok2-pj48x                       1/1     Running   0          3m37s

nginx-76d6c9b8c-mnk8f                         1/1     Running   0          4m29s


You can verify the series of events to understand the entire process. Some of the main events are shown below stating experiment pod was created, the nginx pod (AUT) getting deleted, the nginx pod getting created again, and the experiment was successfully executed.

Shell
 
$ kubectl get events -nlitmus                                              

66s   Normal    Started            pod/pod-delete-swkok2-pj48x                  Started container pod-delete-swkok2                                                 

62s   Normal    Awaited            chaosresult/pod-delete-tdoklgkv-pod-delete   experiment: pod-delete, Result: Awaited

58s   Normal    PreChaosCheck      chaosengine/pod-delete-tdok1gkv              AUT: Running                                                                        

58s   Normal    Killing            pod/nginx-76d6c9b8c-c8vv7                    Stopping container nginx                                                            

58s   Normal    Successfulcreate   replicaset/nginx-76d6c9b8c                   Created pod: nginx-76d6c9b8c-mnk8f                                                                                                             

44s   Normal    Killing            pod/nginx-76d6c9b8c-mnk8f                    Stopping container nginx                                                            

44s   Normal    Successfulcreate   replicaset/nginx-76d6c9b8c                   Created pod: nginx-76d6c9b8c-kqtgq                                                  

43s   Normal    Scheduled          pod/nginx-76d6c9b8c-kqtgq                    Successfully assigned litmus/nginx-76d6c9b8c-kqtgq to k3d-k3s-default-server-0         

128   Normal    PostChaosCheck     chaosengine/pod-delete-tdok1gkv              AUT: Running                                                                        

8s    Normal    Pass               chaosresult/pod-delete-tdoklgkv-pod-delete   experiment: pod-delete, Result: Pass

8s    Normal    Summary            chaosengine/pod-delete-tdok1gkv              pod-delete experiment has been Passed                                               

3s    Normal    Completed          job/pod-delete-swkok2                        Job completed  


Chaos in AWS

Chaos can be introduced in any AWS environment with the following failures:

Scenario: Terminate EC2 Instance

In this scenario, we will include one chaos experiment of terminating an EC2 instance. Litmus leverages AWS SSM documents for executing experiments in AWS. For this scenario, we will require two manifest files; one for configMap consisting of the script for the SSM document and the other consisting of a complete workflow of the scenario. Both these manifest files can be found here.

Shell
 
$ kubectl apply -f https://raw.githubusercontent.com/rutu-k/litmus-ssm-docs/main/terminate-instance-cm.yaml


Building Resilience With Chaos Engineering and Litmus

Shell
 
$ kubectl logs aws-ssm-chaos-by-id-vSoazu-w6tmj -n litmus -f

time="2023-05-11T13:05:10Z" level=info msg="Experiment Name: aws-ssm-chaos-by-id"

time="2023-05-11T13:05:14Z" level=info msg="The instance information is as follows" Chaos Namespace=litmus Instance ID=i-0da74bcaa6357ad60 Sequence=parallel Total Chaos Duration=960

time="2023-05-11T13:05:14Z" level=info msg="[Info]: The instances under chaos(IUC) are: [i-0da74bcaa6357ad60]"

time="2023-05-11T13:07:252" level=info msg="[Status]: Checking SSM command status"

time="2023-05-11T13:07:26Z" level=info msg="The ssm command status is Success"

time="2023-05-11T13:07:28Z" level=info msg="[Wait]: Waiting for chaos interval of 120s"

time="2023-05-11T13:09:28Z" level=info msg="[Info]: Target instanceID list, [i-0da74bcaa6357ad60]"

time="2023-05-11T13:09:28Z" level=info msg="[Chaos]: Starting the ssm command"

time="2023-05-11T13:09:28Z" level=info msg="[Wait]: Waiting for the ssm command to get in InProgress state”

time="2023-05-11T13:09:28Z" level=info msg="[Status]: Checking SSM command status”

time="2023-05-11T13:09:30Z" level=info msg="The ssm command status is InProgress”

time="2023-05-11T13:09:32Z" level=info msg="[Wait]: waiting for the ssm command to get completed”

time="2023-05-11T13:09:32Z" level=info msg="[Status]: Checking SSM command status"

time="2023-05-11T13:09:32Z" level=info msg="The ssm command status is Success"


Building Resilience With Chaos Engineering and Litmus

Building Resilience With Chaos Engineering and Litmus

What To Do Next?

Design a Resiliency Framework

Building Resilience With Chaos Engineering and Litmus

A resiliency framework encompasses a structured approach or a collection of principles and strategies that utilize chaos engineering to establish resilience and guarantee overall reliability. The following provides a comprehensive description of the usual steps or lifecycle incorporated within a resiliency framework:

Assign a Resiliency Score

The resiliency score can be considered as a metric that quantifies and assesses the level of resilience and robustness in a system. It takes into account several factors, such as the system's architecture, mean time to recover (MTTR), mean time between failures (MTBF), redundancy measures, availability, scalability, fault tolerance, monitoring capabilities, and recovery strategies. The calculation of the resiliency score varies between systems and organizations, as it depends on their specific priorities and requirements.

The resiliency score assists organizations in assessing the resiliency level of their systems and pinpointing areas that require enhancement. A higher resiliency score signifies a more robust system that can effectively handle failures while minimizing disruptions to functionality and user experience. By consistently measuring and monitoring the resiliency score, organizations can track their progress in enhancing system resiliency and ensure ongoing improvement.

Gamedays

Gamedays are structured and scheduled events in which organizations simulate real-world failure scenarios and assess the resiliency of their systems within a secure and controlled setting. In a Gameday, teams intentionally induce failures or introduce chaos into the system to carefully observe and analyze its behavior and response.

It is essential for organizations to engage in Gamedays as they provide valuable opportunities for teams to enhance their incident response and troubleshooting capabilities. By participating in Gamedays, teams can improve their collaboration, communication, and coordination skills in high-pressure situations, which are crucial when dealing with real-world failures or incidents. Gamedays demonstrate an organization's proactive commitment to ensuring that its system can withstand unforeseen events and maintain operations without significant disruptions.

In summary, Gamedays play a crucial role in enhancing system resiliency, validating recovery mechanisms, and fostering a culture of preparedness and continuous improvement within organizations.

Incorporate Resiliency Checks in CI/CD Pipelines

Incorporating resiliency checks into CI/CD pipelines brings numerous benefits, contributing to the overall strength and dependability of software systems. Here are some notable advantages of integrating resiliency checks into these pipelines:

Improve Observability Posture

It is presumed that the system is actively monitored, and relevant metrics, logs, traces, and other events are captured and analyzed before inducing chaos in the system. Make sure your observability tools and processes provide visibility into the system's health, performance, and potential issues, triggering alerts or notifications when anomalies or deviations from the steady state are detected. 

If there is a lack of visibility into any issues that arise during the chaos, it is important to incorporate and enhance observability measures accordingly. With several different chaos experiments, you can analyze missing observability data and add it to your system accordingly.

Conclusion

In this article, we learned what chaos engineering, resiliency, and reliability are, and how all three are related to each other. We saw what are available tools for executing chaos and why we chose Litmus for our use case.

Further, we explored what types of chaos experiments we can execute in Kubernetes and AWS environments. We also saw a demo of chaos experiments executed in both environments. In addition to this, we have learned how we can design a resiliency framework and incorporate resiliency scoring, gamedays, and resiliency checks to improve the overall observability of a platform.

Thanks for reading! Hope you found this blog post helpful. If you have any questions or suggestions, please do reach out to me on LinkedIn.

 

 

 

 

Top