Building Resilience With Chaos Engineering and Litmus

2025-02-10

The scalability, agility, and continuous delivery offered by microservices architecture make it a popular option for businesses today. Nevertheless, microservices architectures are not invulnerable to disruptions. Various factors, such as network communication, inter-service dependencies, external dependencies, and scalability issues, can contribute to outages.

Prominent companies like Slack, Twitter, Robinhood Trading, Amazon, Microsoft, Google, and others have recently encountered outages resulting in significant downtime expenses. These incidents underscore the wide-ranging causes of outages in microservices architectures, encompassing configuration errors, database issues, infrastructure scaling failures, and code problems.

To mitigate the impact of outages and enhance system availability, it is essential for businesses to prioritize resiliency principles during the design, development, and operation of microservices architectures. This article will explore how chaos engineering can assist in improving system resiliency and minimizing outages.

I recently spoke at Chaos Carnival on the same topic. You can also watch my talk here.

What Is Chaos Engineering?

Chaos engineering is a technique used to assess the resilience and dependability of intricate systems by deliberately introducing controlled failures. Its objective is to proactively identify and draw attention to system flaws before they result in real-world issues such as outages, data loss, or security breaches.

This is achieved by simulating diverse failure scenarios, including network disruptions, server malfunctions, or unexpected surges in traffic, and observing the system's response. By intentionally inducing failures in a controlled environment, chaos engineering allows teams to gain deeper insights into the limitations and failure domains of their systems, enabling them to develop strategies for mitigating or preventing such failures in the future.

Prominent companies like Netflix, Amazon, Google, and Microsoft have recognized the significance of chaos engineering in ensuring site reliability. Netflix, for example, has introduced tools like Chaos Monkey, Chaos Kong, and ChAP, which target different levels of infrastructure to uphold their service level agreements. Amazon has incorporated the concept of Gamedays into their AWS Well-Architected Framework, wherein various teams collaborate to simulate chaos in their environment, fostering knowledge and reinforcing system reliability as a whole.

What Is Resiliency Testing?

The primary focus of resiliency testing revolves around assessing a system's capacity to bounce back from disruptions or failures and maintain its intended functionality. The objective of resiliency testing is to enhance the overall reliability and availability of a system while mitigating the impact of potential disruptions or failures. Through the identification and resolution of potential vulnerabilities or weaknesses in system design or implementation, resiliency testing ensures the system's continuous operation in the presence of unforeseen events or circumstances.

Why Should I Test Resiliency?

Resiliency testing is essential for a number of reasons. Here are a few examples:

Avoiding costly downtime: The purpose of resiliency testing is to uncover potential weak points in a system that could result in expensive periods of inactivity if left unaddressed. By assessing a system's capacity to bounce back from disruptions or failures, one can guarantee its uninterrupted functionality, even in the face of unforeseen circumstances.
Increase reliability: Engaging in such practices will enhance the system's overall reliability. By identifying and resolving potential vulnerabilities, one can construct a stronger and more resilient system that is less prone to failure or disruption.
Improve the user experience: A resilient system capable of swift recovery from disruptions or outages is expected to deliver an enhanced user experience. Such a system is less prone to encountering outages or data loss, resulting in increased user satisfaction with its performance.
Compliance requirements: Numerous industries and regulatory frameworks necessitate a specified level of system resilience and uptime. Employing this form of testing enables you to ascertain that your system complies with these requirements and mitigates the risk of potential legal or regulatory complications.

Thus, conducting resilience testing holds significant importance in guaranteeing the reliability, availability, and swift recovery of your system from failures or outages. Through the identification and resolution of potential points of failure, you can establish a more sturdy and resilient system that not only enhances the user experience but also adheres to regulatory requirements.

Why Should I Test Resiliency in Kubernetes?

Resiliency testing in Kubernetes is crucial due to its complex and distributed nature, catering to large-scale, mission-critical applications. While Kubernetes offers features like automatic scaling, self-healing, and rolling updates for resiliency, glitches or failures can still occur in a Kubernetes cluster.

Here are the top reasons why we should test resiliency in Kubernetes:

Criticality of underlying infrastructure: The underlying infrastructure plays a critical role in the functioning of your application. When your application relies on Kubernetes to manage and coordinate its components, any disruption in the Kubernetes cluster can result in downtime or data loss. By utilizing resiliency testing, you can verify that your Kubernetes cluster can recover from disruptions and maintain its intended functionality.
Distributed system: Kubernetes is a distributed system comprising various elements such as nodes, controllers, and APIs that collaborate to form a unified platform for deploying and managing applications. Conducting resilience testing for Kubernetes allows you to identify potential failure points within this intricate system and ensure its ability to bounce back from disruptions.
Constant evolution: Given the constant evolution of Kubernetes, with regular releases of new features and updates, it is vital to use resiliency testing to ascertain that your Kubernetes cluster can seamlessly adapt to these changes without experiencing any downtime or disruptions.

Considering all of this, testing resiliency in Kubernetes is important to ensure that your application can handle interruptions and continue to function as intended.

Chaos vs Resiliency vs Reliability

Chaos, resiliency, and reliability are interconnected concepts, but they should not be used interchangeably. Below is a brief overview of each concept:

Chaos: Chaos involves intentionally introducing controlled failures or disruptions into a system to assess its resilience and uncover potential vulnerabilities. Chaos engineering is a methodology employed to simulate these failures and assess the system's response.
Resiliency: Resiliency denotes the capacity of a system to rebound from disruptions or failures and sustain its intended functionality. Resiliency testing focuses on evaluating the system's recovery capability and identifying potential points of failure.
Reliability: Reliability pertains to the consistency and predictability of a system's performance over time. A reliable system can be trusted to operate as intended, without unexpected failures or interruptions. Reliability is often measured by factors such as uptime, availability, and mean time between failures (MTBF).

In essence, chaos engineering involves intentionally introducing failures into a system to test its resilience, which refers to its ability to bounce back from such failures. On the other hand, reliability measures the consistent and predictable performance of a system over an extended period. All three concepts chaos engineering, resilience, and reliability are crucial for establishing and sustaining robust and dependable systems. Each concept contributes distinctively to ensuring the overall quality and resilience of a system.

Building Resilience With Chaos Engineering and Litmus

What Are Available Tools To Test System Resiliency?

Litmus, Gremlin, Chaos Mesh, and Chaos Monkey are all popular open-source tools used for chaos engineering. As we will be using AWS cloud infrastructure, we will also explore AWS Fault Injection Simulator (FIS). While they share the same goals of testing and improving the resilience of a system, there are some differences between them. Here are some comparisons:

Scope	Chaos Mesh	Chaos Monkey	Litmus	Gremlin	AWS FIS
Kubernetes-native	Yes	Yes	Yes	Yes	No
Cloud-native	No	No	Yes	Yes	Yes (AWS)
Baremetal	No	No	No	Yes	No
Built-in Library	Basic	Basic	Extensive	Extensive	Basic
Customization	Using YAML	Using YAML	Using Operator	Using DSL	Using SSM docs
Dashboard	No	No	Yes	Yes	No
OSS	Yes	Yes	Yes	Yes	No

The bottom line is that while all four tools share similar features, we choose Litmus as it provides flexibility to leverage AWS SSM documents to execute chaos in our AWS infrastructure. Now let’s see how we can use Litmus to execute chaos like terminating pods and EC2 instances in Kubernetes and AWS environments respectively.

Installing Litmus in Kubernetes

At first, we will see how to install Litmus in Kubernetes to execute chaos in an environment.

Here are the basic installation steps for LitmusChaos:

Set up a K8s cluster: We need a running Kubernetes cluster. For this article, we will use k3d.

     Shell 
   
   $ k3d cluster create

     Shell 
   
   $ kubectl cluster-info

Kubernetes control plane is running at https://0.0.0.0:38537

CoreDNS is running at https://0.0.0.0:38537/api/vl/namespaces/kube-system/services/kube-dns:dns/proxy

Metrics-server is running at https://0.0.0.0:38537/api/vl/namespaces/kube-system/services/https:metrics-server:https/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster—info dump’.

Install Helm: You can install Helm by following the instructions on the Helm website. Add the LitmusChaos chart repository: Run the following command to add the LitmusChaos chart repository:

     Shell 
   
   $ helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/

Install LitmusChaos: Run the following command to install LitmusChaos.

     Shell 
   
   $ helm install litmuschaos litmuschaos/litmus --namespace=litmus

This will install the LitmusChaos control plane in the litmus namespace. You can change the namespace to your liking.

Verify the installation: Run the following command to verify that LitmusChaos is running:

     Shell 
   
   $ kubectl get pods -n litmus
NAME                              READY  STATUS   RESTARTS  AGE

chaos-litmus-frontend-6££c95c884-x2452  1/1    Running     0       6m22s

chaos-litmus-auth-server-b8dcdf66b-v8hf9  1/1  Running     0       6m22s

chaos-litmus-server-585786dd9c-16x37  1/1      Running     0       6m22s

This should show the LitmusChaos control plane pods running.

     Shell 
   
   $ kubectl port-forward svc/chaos-litmus-frontend-service -nlitmus 9091:9091

Building Resilience With Chaos Engineering and Litmus

Once you log in, a webhook will install litmus-agent (called self-agent) components in the cluster. Verify it.

     Shell 
   
   $  kubectl get pods -n litmus

NAME                                      STATUS    RESTARTS   AGE   READY

chaos-litmus-frontend-6£fc95c884-x245z    Running   0          9m6s  1/1

chaos-litmus-auth-server-b8dcdf66b-v8he9  Running   0          9m6s  1/1

chaos-litmus-server-585786dd9c-16x37      Running   0          9m6s  1/1

subscriber-686d9b8dd9-bjgih               Running   0          9m6s  1/1

chaos-operator-ce-84bc885775-kzwzk        Running   0          92s   1/1

chaos-exporter-6c9b5988cd-1wmpm           Running   0          94s   1/1

event-tracker-744b6fd8cf-rhrfc            Running   0          94s   1/1

workflow-controller-768b7d94dc-xr6vy      Running   0          92s   1/1

With these steps, you should have LitmusChaos installed and ready to use on your Kubernetes cluster.

Experimenting With Chaos

Experimenting with chaos within a cloud-native environment typically involves using a chaos engineering tool to simulate various failure scenarios and test the resilience of the system. Most of the cloud-native application infrastructure consists of Kubernetes and corresponding Cloud components. For this article, we will see chaos in Kubernetes and in the cloud environment i.e. AWS.

Chaos in K8s

For evaluating the resilience of a Kubernetes cluster we can test the following failure scenarios:

Node failure: You can simulate node failure by either shutting it down or disconnecting it from the network. This test aims to determine if the cluster can effectively handle node failures and if the impacted pods can be seamlessly transferred to alternative nodes. It is important to consider the potential for cascading failures that may arise due to any delays in migrating the pods from one node to another.
Pod failure: By deliberately terminating a pod or introducing a fault into its container, we can simulate a pod failure. This test serves to evaluate the cluster's capacity to detect and successfully recover from such pod failures.
Network failure: In this case, the simulation involves network partitioning or network congestion to assess the cluster's capability to manage communication failures between nodes or pods. To accomplish this, you can utilize the Linux Traffic Control tool to manipulate the flow of traffic within your system.
Resource saturation: To gauge the cluster's capacity to handle resource contention and prioritize critical workloads, it is possible to simulate resource saturation, such as CPU or memory exhaustion. This can be achieved by utilizing the stress-ng tool to intensively utilize memory or CPU resources.
DNS failure: To assess the cluster's capability to resolve DNS names and manage service lookup failures, you can deliberately introduce DNS failures. This test is conducted to evaluate how effectively the cluster handles such scenarios.
Cluster upgrades: To evaluate the cluster's ability to execute rolling upgrades and maintain availability during the upgrade process, you can simulate upgrades within the Kubernetes cluster. This includes upgrading the control plane and worker nodes. The purpose of this test is to assess the cluster's capacity to seamlessly perform upgrades while ensuring continuous availability.

Through the execution of these failure scenarios, you can pinpoint potential vulnerabilities within the cluster's resilience. This enables you to enhance the system, ensuring it achieves high availability and reliability levels. By testing these failure scenarios, you can identify potential vulnerabilities in the cluster's resilience and improve the system to ensure high availability and reliability.

Scenario: Killing a Pod

In this experiment, we will kill a pod using Litmus. We will use an Nginx deployment for a sample application under test (AUT).

     Shell 
   
   $ kubectl create deploy nginx --image=nginx -nlitmus

     Shell 
   
   $ kubectl get deploy -nlitmus | grep nginx

NAME   READY   UP-TO-DATE   AVAILABLE   AGE
nginx  1/1     1            1           109m

Go to the Litmus portal, and click on Home.
Click on Schedule a Chaos Scenario and select Self Agent.

Building Resilience With Chaos Engineering and Litmus

Next, select chaos experiment from ChaosHubs.
Next, name the scenario as kill-pod-test.
Next, click on Add a new chaos Experiment.
Choose generic/pod-delete experiment.

Building Resilience With Chaos Engineering and Litmus

Tune the experiment parameters to select the correct deployment labels and namespace.

Building Resilience With Chaos Engineering and Litmus

Enable Revert Schedule and click Next.
Assign the required weight for the experiment. For now, we will keep 10 points.
Click Schedule Now and then Finish. The execution of the Chaos Scenario will start.
To view the Chaos Scenario, click on Show the Chaos Scenario.

Building Resilience With Chaos Engineering and Litmus

You will see the Chaos Scenario and experiment CRDs getting deployed and the corresponding pods getting created.
Once the Chaos Scenario is completed, you will see that the existing Nginx pod is deleted and a new pod is up and running.

Building Resilience With Chaos Engineering and Litmus

     Shell 
   
   $ kubectl get pods -nlitmus

NAME                                          READY   STATUS    RESTARTS   AGE

chaos-litmus-frontend-6ffc95c884-x245z        1/1     Running   0          32m

chaos-mongodb-68f8b9444c-w2kkm                1/1     Running   0          32m

chaos-litmus-auth-server-b8dcdf66b-v8hf9      1/1     Running   0          32m

chaos-litmus-server-585786dd9c-16xj7          1/1     Running   0          32m

subscriber-686d9b8dd9-bjgjh                   1/1     Running   0          24m

chaos-operator-ce-84bc885775-kzwzk            1/1     Running   0          24m

chaos-exporter-6c9b5988c4-1wmpm               1/1     Running   0          24m

event-tracker-744b6fd8cf-rhrfc                1/1     Running   0          24m

workflow-controller-768f7d94dc-xr6vv          1/1     Running   0          24m

kill-pod-test-1683898747-869605847            0/2     Completed 0          9m36s

kill-pod-test-1683898747-2510109278           2/2     Running   0          5m49s

Pod-delete-tdoklgkv-runner                    1/1     Running   0          4m29s

Pod-delete-swkok2-pj48x                       1/1     Running   0          3m37s

nginx-76d6c9b8c-mnk8f                         1/1     Running   0          4m29s

You can verify the series of events to understand the entire process. Some of the main events are shown below stating experiment pod was created, the nginx pod (AUT) getting deleted, the nginx pod getting created again, and the experiment was successfully executed.

     Shell 
   
   $ kubectl get events -nlitmus                                              

66s   Normal    Started            pod/pod-delete-swkok2-pj48x                  Started container pod-delete-swkok2                                                 

62s   Normal    Awaited            chaosresult/pod-delete-tdoklgkv-pod-delete   experiment: pod-delete, Result: Awaited

58s   Normal    PreChaosCheck      chaosengine/pod-delete-tdok1gkv              AUT: Running                                                                        

58s   Normal    Killing            pod/nginx-76d6c9b8c-c8vv7                    Stopping container nginx                                                            

58s   Normal    Successfulcreate   replicaset/nginx-76d6c9b8c                   Created pod: nginx-76d6c9b8c-mnk8f                                                                                                             

44s   Normal    Killing            pod/nginx-76d6c9b8c-mnk8f                    Stopping container nginx                                                            

44s   Normal    Successfulcreate   replicaset/nginx-76d6c9b8c                   Created pod: nginx-76d6c9b8c-kqtgq                                                  

43s   Normal    Scheduled          pod/nginx-76d6c9b8c-kqtgq                    Successfully assigned litmus/nginx-76d6c9b8c-kqtgq to k3d-k3s-default-server-0         

128   Normal    PostChaosCheck     chaosengine/pod-delete-tdok1gkv              AUT: Running                                                                        

8s    Normal    Pass               chaosresult/pod-delete-tdoklgkv-pod-delete   experiment: pod-delete, Result: Pass

8s    Normal    Summary            chaosengine/pod-delete-tdok1gkv              pod-delete experiment has been Passed                                               

3s    Normal    Completed          job/pod-delete-swkok2                        Job completed

Chaos in AWS

Chaos can be introduced in any AWS environment with the following failures:

Availability zone failure: To assess the application's capability to withstand data center failures, we can simulate the failure of an availability zone within an AWS region. This can be accomplished by modifying the NACL (Network Access Control List) or Route table rules to induce the failure and subsequently restoring them. This test evaluates the application's resilience in handling such failures effectively. Note: You need to be extremely cautious while doing such activities.
Instance failure: In order to evaluate the application's capacity to handle node failures and sustain availability, we can simulate the failure of an EC2 instance by terminating it. This test aims to assess the application's ability to effectively manage such node failures and ensure continuous availability.
Auto-Scaling group failure: To examine the application's capability to handle scaling events and maintain availability, we can simulate the failure of an auto-scaling group by either suspending or terminating all instances within the group. This test is conducted to assess how well the application manages such scaling events and ensures consistent availability.
Network failure: To evaluate the application's capacity to handle communication failures between nodes, network failures such as network congestion or network partitioning can be simulated. Additionally, the effect of Network Time Protocol (NTP) asynchronization can be tested by intentionally introducing discrepancies in time synchronization between nodes or a specific set of nodes. These simulations enable the assessment of the application's resilience in managing communication failures and time synchronization issues.
Database failure: To assess the application's capability to handle database failures and ensure data consistency, one can simulate a database failure by either shutting down the database nodes or introducing a fault into the database system. This test allows the evaluation of the application's resilience in managing such database failures and maintaining the consistency of data. You can check whether your backup and restore mechanisms are working fine or not. You can verify whether the secondary node is being promoted to the primary in case of failures and how much time it takes for the same.
Security breach: Handling security incidents and preserving data confidentiality, one can simulate security breaches, including unauthorized access or data breaches. This test is conducted to assess the application's resilience in managing security incidents and safeguarding the confidentiality of data.

Scenario: Terminate EC2 Instance

In this scenario, we will include one chaos experiment of terminating an EC2 instance. Litmus leverages AWS SSM documents for executing experiments in AWS. For this scenario, we will require two manifest files; one for configMap consisting of the script for the SSM document and the other consisting of a complete workflow of the scenario. Both these manifest files can be found here.

Apply the configMap first in the litmus namespace.

     Shell 
   
   $ kubectl apply -f https://raw.githubusercontent.com/rutu-k/litmus-ssm-docs/main/terminate-instance-cm.yaml

Then, go to the Litmus portal, and click on Home.
Click on Schedule a Chaos Scenario and select Self Agent.
Now, instead of selecting a chaos experiment from ChaosHubs, we will select Import a Chaos Scenario using YAML and upload our workflow manifest.
Click Next and Finish.
To view the Chaos Scenario, click on Show the Chaos Scenario.

Building Resilience With Chaos Engineering and Litmus

You will see the Chaos Scenario and experiment CRDs getting deployed and the corresponding pods getting created.
Verify the logs of the experiment pods. It will show the overall process and status of each step.

     Shell 
   
   $ kubectl logs aws-ssm-chaos-by-id-vSoazu-w6tmj -n litmus -f

time="2023-05-11T13:05:10Z" level=info msg="Experiment Name: aws-ssm-chaos-by-id"

time="2023-05-11T13:05:14Z" level=info msg="The instance information is as follows" Chaos Namespace=litmus Instance ID=i-0da74bcaa6357ad60 Sequence=parallel Total Chaos Duration=960

time="2023-05-11T13:05:14Z" level=info msg="[Info]: The instances under chaos(IUC) are: [i-0da74bcaa6357ad60]"

time="2023-05-11T13:07:252" level=info msg="[Status]: Checking SSM command status"

time="2023-05-11T13:07:26Z" level=info msg="The ssm command status is Success"

time="2023-05-11T13:07:28Z" level=info msg="[Wait]: Waiting for chaos interval of 120s"

time="2023-05-11T13:09:28Z" level=info msg="[Info]: Target instanceID list, [i-0da74bcaa6357ad60]"

time="2023-05-11T13:09:28Z" level=info msg="[Chaos]: Starting the ssm command"

time="2023-05-11T13:09:28Z" level=info msg="[Wait]: Waiting for the ssm command to get in InProgress state”

time="2023-05-11T13:09:28Z" level=info msg="[Status]: Checking SSM command status”

time="2023-05-11T13:09:30Z" level=info msg="The ssm command status is InProgress”

time="2023-05-11T13:09:32Z" level=info msg="[Wait]: waiting for the ssm command to get completed”

time="2023-05-11T13:09:32Z" level=info msg="[Status]: Checking SSM command status"

time="2023-05-11T13:09:32Z" level=info msg="The ssm command status is Success"

Once the Chaos Scenario is completed, you will see that the SSM document is executed.

Building Resilience With Chaos Engineering and Litmus

You can verify that the EC2 instance is being terminated.

Building Resilience With Chaos Engineering and Litmus

What To Do Next?

Design a Resiliency Framework

Building Resilience With Chaos Engineering and Litmus

A resiliency framework encompasses a structured approach or a collection of principles and strategies that utilize chaos engineering to establish resilience and guarantee overall reliability. The following provides a comprehensive description of the usual steps or lifecycle incorporated within a resiliency framework:

Define steady state: The term "steady state" in relation to a system denotes a condition in which the system is in a state of equilibrium, operating smoothly under normal circumstances. It represents a stable and desirable outcome wherein all components, services, and dependencies of the system are functioning correctly and fulfilling their intended roles.
Define the hypothesis: In this step, you have to hypothesize or predict the behavior of your system when subjected to specific chaos like high load, failure of a specific component or network disruptions, and many more. Suppose we have strategically distributed our workloads across four distinct availability zones (AZs) to ensure robust availability. Now, imagine a scenario where we deliberately introduce chaos into the system, causing one of the AZs to fail. In such a situation, can the system effectively handle and adapt to this unexpected event while maintaining its overall functionality?
Formulate and execute the experiment: To initiate the experiment, it is essential to establish the scope and parameters. This entails identifying the particular type of chaos to be introduced, such as network latency, resource exhaustion, or random termination of pods/instances. Additionally, determining the experiment's duration and implementing any necessary constraints or safety measures are crucial steps. Once these parameters are defined, the chaos experiment can be executed by deliberately introducing controlled disruptions or failures into the target system. It is important to gradually introduce the chaos and closely monitor its impact to ensure it remains within acceptable boundaries.
Revert chaos: Revert the chaos induced in your system and bring the system back to a steady state.
Verify steady state: Analyze the data collected during the chaos experiment to determine the system's resilience and identify any weaknesses or vulnerabilities. Compare the observed behavior with expected outcomes (hypothesis) and evaluate the system's ability to recover and maintain its desired level of performance and functionality.
Report: It is crucial to document the experiment details, findings, and recommendations for future reference. Sharing the results with the wider team or organization encourages a culture of learning and continuous improvement. This documentation serves as a valuable resource for future chaos engineering experiments, enhances your understanding of the system, and contributes to building institutional knowledge.
Periodic resiliency checks: Rather than being a singular occurrence, this is a continuous process. It is important to regularly repeat the aforementioned steps to validate the resilience of the system, particularly after implementing changes or updates. As confidence in the system's resilience increases, gradually intensify and expand the complexity of the experiments. Utilizing the insights gained from these experiments, make essential adjustments and enhancements to the system's architecture, design, or operational procedures. This may involve resolving identified issues, optimizing resource allocation, strengthening fault tolerance mechanisms, or refining automation and monitoring capabilities.

Assign a Resiliency Score

The resiliency score can be considered as a metric that quantifies and assesses the level of resilience and robustness in a system. It takes into account several factors, such as the system's architecture, mean time to recover (MTTR), mean time between failures (MTBF), redundancy measures, availability, scalability, fault tolerance, monitoring capabilities, and recovery strategies. The calculation of the resiliency score varies between systems and organizations, as it depends on their specific priorities and requirements.

The resiliency score assists organizations in assessing the resiliency level of their systems and pinpointing areas that require enhancement. A higher resiliency score signifies a more robust system that can effectively handle failures while minimizing disruptions to functionality and user experience. By consistently measuring and monitoring the resiliency score, organizations can track their progress in enhancing system resiliency and ensure ongoing improvement.

Gamedays

Gamedays are structured and scheduled events in which organizations simulate real-world failure scenarios and assess the resiliency of their systems within a secure and controlled setting. In a Gameday, teams intentionally induce failures or introduce chaos into the system to carefully observe and analyze its behavior and response.

It is essential for organizations to engage in Gamedays as they provide valuable opportunities for teams to enhance their incident response and troubleshooting capabilities. By participating in Gamedays, teams can improve their collaboration, communication, and coordination skills in high-pressure situations, which are crucial when dealing with real-world failures or incidents. Gamedays demonstrate an organization's proactive commitment to ensuring that its system can withstand unforeseen events and maintain operations without significant disruptions.

In summary, Gamedays play a crucial role in enhancing system resiliency, validating recovery mechanisms, and fostering a culture of preparedness and continuous improvement within organizations.

Incorporate Resiliency Checks in CI/CD Pipelines

Incorporating resiliency checks into CI/CD pipelines brings numerous benefits, contributing to the overall strength and dependability of software systems. Here are some notable advantages of integrating resiliency checks into these pipelines:

Early detection of resiliency issues: By integrating resiliency checks into the CI/CD pipeline, organizations can detect potential resiliency issues at an early stage of the software development lifecycle. This proactive approach enables teams to address these issues in a timely manner, preventing them from escalating into critical failures in production environments.
Enhanced user experience: By incorporating resiliency checks, organizations can identify and address potential issues that might impact user interactions, ensuring a seamless and uninterrupted user experience. This proactive approach allows for the detection and mitigation of problems, enabling software systems to handle failures effectively without compromising the end-user experience.
Increased system stability: Incorporating resiliency checks into CI/CD pipelines allows for the validation of system stability. This proactive measure helps prevent cascading failures and ensures that the system remains stable and performs optimally, even when faced with challenging conditions. By conducting these checks, organizations can identify and address potential issues early on, promoting the overall reliability and robustness of the system.
Better preparedness for production environments: By integrating resiliency checks into CD pipelines, teams can create an environment that simulates real-world scenarios, including failures and disruptions. This enables them to effectively prepare their software for production environments by identifying and addressing any resiliency gaps before deployment. Incorporating resiliency checks in CD pipelines enhances the overall readiness of the software, ensuring it can handle various challenges and maintain optimal performance in the live environment.
Cost savings: By checking resiliency issues at an early stage in the CD pipeline, organizations can minimize the financial losses that may arise from system failures occurring in production.
Compliance requirements: By incorporating resiliency checks into the CI/CD pipeline, organizations can ensure that their software systems comply with various regulatory and industry standards such as SoC/SoX requirements. This integration allows organizations to demonstrate their commitment to maintaining a high level of security, integrity, and compliance in their software development processes.

Improve Observability Posture

It is presumed that the system is actively monitored, and relevant metrics, logs, traces, and other events are captured and analyzed before inducing chaos in the system. Make sure your observability tools and processes provide visibility into the system's health, performance, and potential issues, triggering alerts or notifications when anomalies or deviations from the steady state are detected.

If there is a lack of visibility into any issues that arise during the chaos, it is important to incorporate and enhance observability measures accordingly. With several different chaos experiments, you can analyze missing observability data and add it to your system accordingly.

Conclusion

In this article, we learned what chaos engineering, resiliency, and reliability are, and how all three are related to each other. We saw what are available tools for executing chaos and why we chose Litmus for our use case.

Further, we explored what types of chaos experiments we can execute in Kubernetes and AWS environments. We also saw a demo of chaos experiments executed in both environments. In addition to this, we have learned how we can design a resiliency framework and incorporate resiliency scoring, gamedays, and resiliency checks to improve the overall observability of a platform.

Thanks for reading! Hope you found this blog post helpful. If you have any questions or suggestions, please do reach out to me on LinkedIn.