Chaos Testing Your Microservices With Istio
While architecting distributed cloud applications, you should assume that failures will happen and design your applications for resiliency. A microservice ecosystem is going to fail at some point or the other and hence you need to learn about embracing failures. In short, design your microservices with failure in mind.
Chaos Testing is a practice to intentionally introduce failures into your system to test the resiliency and recovery of your microservices architecture. The Mean Time to Recovery (MTTR) needs to be minimized in modern day architectures. Hence, it is beneficial to validate different failure scenarios ahead of time and to take the necessary steps to stabilize the system and make it more resilient.
Chaos Monkey is a popular resiliency tool created by Netflix that can help applications to handle random instance failures. Chaos Monkey randomly terminates virtual machine instances and containers that run inside of your production environment to raise errors and exception scenarios. Exposing the development team to failures more frequently assists them to build resilient services.
Fault Injection With Istio
With Istio, failures can be injected at the application layer like HTTP Errors or Delays to test the resiliency of the application. You can configure faults to be injected into requests that match specific conditions. You can inject either delays or faults into the requests. This will mimic service failures and latency between service calls.
Injecting planned errors and delays into your Production system will determine how resilient your microservice ecosystem is. It's a good way to identify if there are cascading errors if notifications are triggered to development teams when there is an outage. This happens when there is proper observability available to identify the root cause of the outage and, most importantly, recover from the failure.
Istio enables you to inject two types of faults: HTTP Error Codes and Time Delays.
Injecting HTTP Errors
The below VirtualService manifest introduces the fault injection rule to send 503 errors for 50% of the ServiceB v2 traffic:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: serviceB
spec:
hosts:
- serviceB
http:
- fault:
abort:
httpStatus: 503
percent: 50
route:
- destination:
host: serviceB
subset: v2
Injecting Time Delays
The below VirtualService manifest introduces an HTTP delay of 10 sec for 50% of the incoming traffic to ServiceB v1 -
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: serviceB
spec:
hosts:
- serviceB
http:
- fault:
delay:
fixedDelay: 10s
percent: 50
route:
- destination:
host: serviceB
subset: v1
Conclusion
Istio provides an easy way to test the resiliency of your services, The injection of errors and delays are transparent to the application and does not require any code level changes. Since Envoy intercepts all the incoming and outgoing network traffic, it handles the fault injection at the network layer itself.
Check the previous articles related to Istio Service Mesh Resiliency features:
Istio Circuit Breaker With Outlier Detection
Resilient Microservices With Istio Circuit Breaker
Handling Service Timeouts Using Istio
Retry Design Pattern With Istio
Additional Resources:
Istio Fault Injection