Redis Reconnection Resiliency

Background

It's a world of microservices. Such applications or microservices are required to store data temporarily with frequent and super quick access to avoid disk IO operations using Redis-like in-memory databases. These applications have multiple in-memory database clusters to handle huge amounts of traffic and to avoid request failures. To access this data quickly, applications are required to have the preconfigured, established pooled connections ready for service from the applications.Image title

Problem Statement

Applications built for resiliency have backup options in case of application or infrastructure failures. In-memory database clusters that exist in different data centers on different servers allow for backup connectivity in case of data center or server issues.

A multi-region resilient application utilizing an in-memory database should be able to reconnect the cluster in case of a disconnect and connect to a backup cluster if the primary cluster is unavailable to avoid request failures.

A true region-agnostic solution will dynamically select the in-memory database cluster for persisting or retrieving data, making the transaction seamless to the client even in the case of errors when the primary cluster connections are getting rehydrated. This dynamic rehydration is a problem while relying on the modern automated dependency injection and auto-wiring object manipulation technologies.

Once an incoming request has reached the application, the request should be handled correctly even if the underlying infrastructure in that region experiences failures.

Servicing an incoming request needs to be persisted or fetched dynamically from another backup in-memory database cluster when the primary cluster has issues. At the same time, for subsequent requests, connections with the failed cluster will get rehydrated from its connection pool and ready to serve without having a dependency on the backup clusters.

Solution Summary

A simple solution to the above problem is recreating all the connections that were already pooled against the in-memory database that has connectivity issues.

Use the newly established connection pool and make sure that for the rest of the transactions, new connections will be used from the newly cached pool set. Key technical aspects were to manage these pools dynamically during runtime.

Singleton

A service that had a custom extended framework to instantiate objects to only one instance using the singleton design pattern. For in-memory databases, the framework allows the application to instantiate one template (like Spring RestTemplate) to handle all database transactions. The underlying template, which implements the Connection Factory and Pool Configuration establishes a connection to execute the transaction. When this template is rendered unusable for establishing connections, it will self-heal upon rehydrating the pool of connections from its factory. The framework will discard the older template and create a new one. This new single template will be used to serve all incoming requests.

Active and Backup Connections

A service that creates/uses multiple in-memory database templates: one to connect to the primary cluster, and another to connect to a backup cluster. The primary cluster is determined by the “region” of the server the application is running on. A region can represent a geographical location and/or data center. The same application running in multiple regions will connect to the corresponding database using the template configuration mechanism.

Cross Region Resiliency

Every service that is “cross-region” resilient means that the application is deployed on multiple servers in multiple regions and the underlying in-memory database cluster infrastructure is similarly deployed on different servers in different regions.

If an entire “region” experiences issues, an incoming client request will be handled by the application running against the auto-replicated in another region. However, once a request has reached the application, if the underlying database cluster experiences issues, the service dynamically forwards the operations to the backup cluster and should successfully complete the request. At the same time, templates will start the process of discarding all the pre-created connections and will start recreating the new set of connections configured per the pool configurations supplied and will establish a new template for subsequent operations to be successful for the failed region until it is successful. BV

Connection Rehydration

A broken connection between the in-memory database cluster must be removed, as it is an unrecoverable error if no action is taken. The framework, when implemented as a singleton connection, holds the broken connections and does not easily support removing the connections. An incoming request will fail if the connection provided by the template has been broken previously. Services must support runtime connection re-hydrations by removing all the connections from the pool and recreating the new. When a connection is broken, the application recreates the templates and connections to provide a clean reconnection to the in-memory database cluster for the next transaction.

Runtime and Performance

Every service should determine the primary and backup region on startup to amortize the dynamic decisioning on which database cluster to perform the default operations on. When an operation fails to perform on an active chosen cluster, the service should recreate the underlying templates and connections and forward the operations to the backup cluster. The implementation allows for quick, dynamic reconnection multi-region resiliency of database clusters. Since the newly created templates are singleton and fetched during runtime, the performance impact is avoided except for the one that failed at first.

Detail Description

Overview

Connecting to in-memory databases using existing technologies is pretty simple. The problems that we generally miss in 90 percent of applications are when the databases are rehydrated or when a network connection is being lost temporarily.

The application will have a pooled set of pre-configured connections that are cached and produced from a factory upon having an executable eviction algorithm at configured intervals to verify the connection validity. The below diagram explains multiple templates that connect to multiple region-based in-memory databases. Each template is configured with a factory of connections that were created and assigned to a pool to hold. The factory will then retrieve a connection from the pool on a needed basis when a transaction is requested to execute.

Lost Connections

The below diagram represents when a connection is being lost because of DB rehydration, a DB restart, a DB Network connection issue, a firewall issue, or because of any other reason that we can imagine. The entire pool of connections has now become invalid as the connections have lost the socket connection, resulting in connection-refused error messages upon requested transactions.

Invalid Connections Pool

Even after the database comes back up, the entire pool of connections is still invalid because they will start showing broken pipe messages to re-establish a connection upon losing a track of being what happened to the database in the meantime.

Rehydrated Connections

All the connections inside a pool require rehydration to execute a successful transaction. This requires the factory to create new connections to hold via a pool for the corresponding templates, which also need to be recreated. When a failed connection is detected, the system is now intelligently built to recreate all the templates connected to corresponding region-based databases.

Connections will be held in a thread context and will not get closed or return to the pool for quick next-transaction executions. These are called cached connections from the pool to a thread context. When a random connection fails to execute a transaction, it’s difficult to find which connection had been used to run a transaction and how many of such broken connections exist inside a pool of connections. Rather, it takes less time to re-create the pool of connections than to go through each one of the connections to close. Also, closing these connections eventually leads to a runtime pool exhausted status or to no connections available for upcoming transactions to execute.

After restarting the in-memory database server, the operation on hold with the connection in a thread will start throwing broken pipe errors. To avoid this, rehydration of connections is the best solution, as represented below.

Flowchart That Can Explain the End-to-End Flow

 

 

 

 

Top