How to Simulate a DockerHub Outage to Test Image Caching

First off, happy new year! 

After facing a few patches of public registry downtime at the end of the year (both Docker Hub and ECR), one of our first goals of 2022 was to implement pull-through caching of public images. This would allow Shipyard users to continue building, testing, and reviewing their applications, even when outages happen.

The implementation was straightforward (s/o to Shipyard engineer extraordinaire Rogério Shieh), but when it came time to verify, we were faced with the Fun Engineering Problem™️ of simulating a service outage.

Quick Context: Pull-Through Caching Of Public Images

In general, it’s rare for applications to add a whole new service (e.g. postgres, redis, mysql). Most iteration is on existing services. That means that Shipyard is usually pulling the same images from Docker Hub over and over.

So if we maintained a cache of public images, Shipyard could keep building environments even when the rare Docker Hub outage happens.

Thankfully, the standard image registry server has this functionality built-in. We’ll spare the implementation details for another blog post, but in short, to enable pull-through caching, you need to:

But once implemented, we’re faced with a tricky question: how do we simulate a Docker Hub outage?

Step 1: Simulating Outage For Docker Builds

In order for a Docker client to pull an image, it has to pull metadata from registry-1.docker.io. So if you add the following line to /etc/hosts, you won’t be able to pull Docker images:

127.0.0.1 registry-1.docker.io

If you try to docker pull after adding that line, you’ll get the following error response:

Untitled

Easy peasy.

Step 2: Simulating Outage For Pull-Through Cache Registry

Whenever you pull Docker images, the images are being pulled from https://docker.io. Messing with the pull-through registry’s DNS via /etc/hosts didn’t seem to be breaking anything, so we had to go one level deeper.

First, we check the IPs that docker.io points to with:

 dig docker.io

Untitled

Then, since we’re on AWS, we updated the cluster’s VPC’s routing table to point those IPs to basically anything else (we pointed them to a random EC2 instance):

Untitled

Next, let’s verify that we don’t have the ability to pull brand new images that haven’t been cached.

If you try pulling a never-used image from Docker Hub from the Docker client, you’ll get the following error response:

Untitled

But now, if you pull an image that is cached (even though neither the client nor the pull-through can contact Docker Hub), we get:

Untitled

Voilà. Pull-through image caching at work.

Final Thoughts

It’s always fun to get to problems that need creative solutions. Like any engineering problem, there are definitely other ways to achieve this simulation (e.g. service mesh, iptables), but that’s precisely where the art of engineering lies: choose what works for you.

 

 

 

 

Top