<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=521127644762074&amp;ev=PageView&amp;noscript=1">

Tell us more


Cluster Health Checkup: Reliability (Part Two)


As I mentioned in part one, a Site Reliability Engineer cares a lot about reliability in a Kubernetes environment. In most cases, reliability is synonymous with stability and provides a better user experience. Reliability becomes much easier to obtain with the right configurations. That's where Fairwinds Insights comes in.

Fairwinds Insights is an Open Source as a Service platform, integrating best-of-breed Kubernetes auditing tools that improve cluster security, workload reliability, and engineering productivity. In part one, I covered how Fairwinds Insights helps with setting resource requests and limits. Here, I'll cover autoscaling and liveness and readiness probes.


Autoscaling will increase cluster reliability by allowing it to respond to changes in load without things “falling over” or breaking. There are two types of autoscaling that work together to provide a stable cluster.

Horizontal Pod Autoscaling (HPA)

This type of scaling allows your deployments to increase the number of replicas that they have based on a metric. CPU and memory are easy metrics to use, since they are available in some cloud providers automatically, but other metrics providers can be used as well. HPA is easy to configure by creating an object in your cluster; you simply specify a target for the metric, and the HPA will adjust the number of desired pods to try and keep the average of the metric at the target.

If configured correctly, HPA can allow your workloads to handle large swings in load or traffic, keeping your application and your cluster stable.

Cluster Autoscaling

You might be wondering what happens when the number of pods exceeds the capacity of the nodes in the cluster. This is the obvious next problem to tackle after configuring HPA correctly. In order to solve this, we deploy the  cluster autoscaler. This controller watches to see if pods are unable to be scheduled, and then increases the number of nodes to make room for that pod.

Note that the cluster autoscaler will have a hard time doing its job if your resource requests are not set correctly. This is because the cluster autoscaler relies on the scheduler to know that a pod won’t fit in the current set of nodes, and it also relies on the resource request to determine whether adding a new node will make room for the pod. So as always, reliability requires good resource requests and limits first.

Liveness and Readiness Probes

Another important facet of cluster reliability is one that we often refer to as “self healing.” The idea here is to automatically detect when there are issues in the cluster and automatically fix those issues. This concept is built into Kubernetes in the form of liveness and readiness probes.

These probes are checks that the Kubernetes cluster will perform on your containers at a set interval. They have two states, pass and fail. There is a threshold for how many times the probe has to fail or succeed before the state is changed. Usually these are configured in the form of HTTP calls against your service, but there are other types such as TCP and exec.

Liveness Probes

Let’s first talk about the most important of the two probe types, liveness. This probe indicates whether or not the container is running, or alive. If this probe is moved into a failing state, then Kubernetes will automatically send a signal to kill the pod that this container belongs to.

If each container in the pod does not have a liveness probe, then a faulty or non-functioning pod will continue to run indefinitely, using up valuable resources and most likely causing errors in your application. This is why the liveness probe is fundamental to the proper functioning of a Kubernetes cluster.

Readiness Probes

This probe is used to indicate when a container is ready to serve traffic. If the pod is behind a Kubernetes service, the pod will not be added to the list of available endpoints in that service until all of the containers in that pod are marked as ready. This allows you to keep pods that are not healthy from serving any traffic or accepting any requests, preventing your application from exposing errors.

Self Healing Pods

These two probe types, when configured correctly on all of your containers, provide the cluster with the ability to “self heal”. Problems that arise in containers will be automatically detected and pods will be killed or taken out of service automatically. This is why we strongly recommend that all containers have both probes. To this end, Fairwinds Insights will detect any deployments that do not have both a liveness and readiness probe configured for each container in the deployment. This is factored into your reliability score in Insights.

Need help with building reliable clusters?

Try Fairwinds Insights

In Conclusion

There are many factors that need to be considered when building a stable and reliable Kubernetes cluster. I hope that this article has given you the first steps that you will need to take in that direction, and I also hope that Fairwinds Insights can help you continually improve upon these things in your clusters.