Fairwinds | Blog

Eliminate The Registry As A Single Point of Kubernetes Failure Using Saffire

Written by Andy Suderman | Mar 10, 2021 11:00:00 AM

Has this ever happened to you? You're running a kops cluster with flannel as the CNI, and Quay experiences a day-long outage. Suddenly, the well-oiled machine that is your autoscaling Kubernetes cluster grinds to a halt. New nodes never join the cluster, and other random services can't start.

You've fallen victim to one of the few remaining single points of failure in your Kubernetes cluster: the container registry. This may seem like a strange thing to write an entire tool to solve, but at Fairwinds this problem causes us massive headaches and lost time across the dozens of clusters that we run for our clients. If it isn't Quay and flannel today, it will be another cluster-critical service tomorrow.

Introducing Saffire

Saffire is a controller that runs in your cluster and watches for pods that are experiencing issues pulling their underlying images. When it finds these pods, it will attempt to patch the object that owns that pod and replace the image spec with one that you have specified as a replacement. Thus, in mere seconds after the image fails to pull, your workload is now happily running on an image pulled from a different registry.

How Does It Work?

Saffire creates a Custom Resource Definition that you can deploy which specifies the list of registries that are considered equivalent for the images in that namespace. This, coupled with a package that a friend and co-worker of mine wrote called controller-utils, allows Saffire to lookup the top-level controller of the pods that are failing and patch them with a new image source.

Doesn't This Require Some Planning On My End?

Yep! You're going to have to push all your images to two repositories at the same time. You may be wondering why Fairwinds chose this approach instead of something like a pull-through cache, or an in-cluster registry that mirrors all of our images. There's a couple reasons for this:

1. We wanted to avoid running our own infrastructure to solve this issue.

We were looking for a Kubernetes-native approach that didn't require us to run long-lived, high-maintenance infrastructure.


2. We wanted to utilize existing services as much as possible.

Google Container Registry (GCR), Amazon Elastic Container Registry (ECR), Quay, and Docker Hub are all great services that experience outages relatively infrequently. With Saffire, you can still use these services, you're just not dependent on a single one.


What's With The Name?

At Fairwinds, we've taken a liking to space-themed names. When we originally named Polaris, we were stuck in the nautical theme that constantly plagues the Kubernetes community. But Polaris allowed us to bridge the gap between sea and space. All the tools we created since are space-themed.

Saffire is a reference to a series of missions that NASA conducted to test how fires behaved on spacecraft and space stations. It seemed a fitting name for a tool that mitigates fires in our Kubernetes environments.


Hope You Enjoy It

Head on over to the Github repository for Saffire and let us know there if you have any questions or issues!