Cluster Health Checkup: Reliability

As I mentioned in part one, site reliability engineers care a great deal about reliability in Kubernetes environments. In most cases, reliability is synonymous with stability and provides a better user experience. It's easier to ensure reliability with the right configurations. That's where Fairwinds Insights comes in.

Fairwinds Insights integrates best-of-breed Kubernetes auditing tools that improve cluster security, workload reliability, and engineering productivity at scale. In part one, I covered how Fairwinds Insights helps with setting resource requests and limits. Now let's cover autoscaling and liveness and readiness probes.

Fairwinds Insights is available to use for free. You can sign up here.

Autoscaling

Autoscaling increases cluster reliability by allowing it to respond to changes in load without things “falling over” or breaking. There are two types of autoscaling that work together to provide stability for your clusters.

Horizontal Pod Autoscaling (HPA)

This type of scaling allows your deployments to increase the number of replicas that they have based on a metric, such as CPU and memory. These are available in some cloud providers automatically, but you can use other metrics as well. It's easy to configure HPA by creating an object in your cluster; you specify a target for the metric and the HPA adjusts the number of desired pods as a method of keeping the average of the metric at the target.

If configured correctly, HPA allows your workloads to handle large swings in load or traffic, keeping your application and cluster stable.

Cluster Autoscaling

You may wonder what happens when the number of pods exceeds the capacity of the nodes in the cluster. After configuring HPA correctly, deploy the cluster autoscaler. This controller scans to detect whether pods are unable to be scheduled, and then increases the number of nodes to make room for that pod.

The cluster autoscaler will have difficulty with this if your resource requests are not set correctly. This is because the cluster autoscaler relies on the scheduler to detect that a pod won’t fit in the current set of nodes, relying on the resource request to determine whether adding a new node will make room for the pod. Reliability always requires good resource requests and limits first.

Liveness & Readiness Probes

Another important aspect of cluster reliability is often referred to as “self healing.” The idea is to automatically detect when there are issues in the cluster and fix those issues automatically. Kubernetes built this concept in — in the form of liveness probes and readiness probes.

These probes are checks that the Kubernetes cluster can perform on your containers at a set interval. They have two states, pass and fail. There's a threshold for how many times the probe has to fail or succeed before the state changes, usually configured in the form of HTTP calls against your service, but there are other types too, such as TCP and exec.

Liveness Probes

Let’s explore the most important probe type, liveness probes. These probes indicates whether or not the container is running, or alive. If this probe moves into a failing state, Kubernetes automatically sends a signal to kill the pod that this container belongs to.

If each container in the pod does not have a liveness probe, then a faulty or non-functioning pod continues to run indefinitely, using valuable resources and probably causing errors in your application. This is why liveness probes are fundamental to the proper function of Kubernetes clusters.

Readiness Probes

This probe indicates whether a container is ready to serve traffic. If the pod is behind a Kubernetes service, the pod won't be added to the list of available endpoints in that service until all the containers in that pod are marked as ready. This enables you to keep pods that aren't healthy from serving traffic or accepting requests, preventing your application from exposing errors.

Self Healing Pods

These two probe types, when configured correctly on all of your containers, provide the cluster with the ability to “self heal.” Problems that arise in containers are automatically detected and pods are killed or taken out of service automatically. This is why we strongly recommend that all containers have both probes. To that end, Fairwinds Insights detects any deployments that don't have both a liveness and readiness probe configured for each container in the deployment, which is factored into the reliability score in Insights.

Building a Stable, Reliable Kubernetes Cluster

There are a lot of factors to consider as you seek to build a stable and reliable Kubernetes cluster. This article provides the first steps to take in that direction. Our Managed Kubernetes is a people-led service that can help you by architecting, building, and managing Kubernetes, and Fairwinds Insights allows you to manage that in house, so you can continually improve reliability, cost efficiency, and security in your clusters.

Cluster Health Checkup: Reliability - Part Two