Cluster Health Checkup: Reliability

A Site Reliability Engineer cares a lot about reliability in a Kubernetes environment, as you might have guessed from the job title. Reliability is important to a software organization because it provides stability and generally a better user experience. In addition, it makes the jobs of the operators and developers that work in the organization easier.

In a Kubernetes environment, reliability can be easy to obtain, but if things are configured incorrectly, it becomes a far-reaching goal. At Fairwinds, we do a lot of things to ensure that our Kubernetes clusters are reliable and stable for our customers. This can involve recommendations for application changes as well as changes to cluster configuration. In these next few articles, I will outline a few of these things, and how Fairwinds Insights might be able to help.

Resource Requests and Limits

Resource requests and limits for CPU and memory are the heart of what allows the Kubernetes scheduler to do its job well.

When the scheduler is deciding on which node a pod should be scheduled, it looks at the resource requests of the pod, then places that pod on a node that will provide the resources that it has asked for. This ensures that the pod will be able to run whatever workload it needs to run. If the resource request is not set, or it is set incorrectly, this can result in a pod being placed on a node that is out of resources, and the workload will not run as efficiently as possible (or it may not run at all).

Limits are used to keep pods from interfering with other pods when they consume all of the available resources on a node. This is referred to as the “noisy neighbor problem.” If single pod is allowed to consume all of the node CPU and memory, then other pods will have no resources available. If you set limits on what a pod can consume, then you will prevent this from happening.

Need help with resource requests and limits?

Fairwinds Insights can help you with these settings using an open source tool called Goldilocks. Goldilocks uses the recommendations provided by the Vertical Pod Autoscaler to create action items in Insights that suggest new resource requests and limits. It does this by creating a VPA object which in turn looks at the historical memory and CPU usage of your workloads to determine a recommendation.

Interested in using Fairwinds Insights? It’s available for free! Learn more here.

Once you have good resource requests and limits set, you can move on to enabling autoscaling for your application and your cluster. Next up, we'll talk about how Fairwinds Insights can help with autoscaling.