Building a Strong Reliability Foundation in Kubernetes: From Crawl to Run

In the world of modern application development, Kubernetes is the de facto container orchestration platform. It helps platform and development teams manage applications and services in distributed environments reliably and at scale. However, to make the most of Kubernetes capabilities, you must understand how to ensure that it will be stable and highly available. So, how can you build a strong reliability foundation in Kubernetes, starting from the basics and moving to more advanced strategies?

Foundation — Basic Configuration (Crawl)

Let's start with the basics of building a strong foundation. These are the first things that you should configure to increase the reliability of your workloads in Kubernetes.

Liveness Probes

Liveness probes are Kubernetes' way of determining whether an application inside a pod is running. When a liveness probe fails, Kubernetes restarts the failing container to ensure the reliability of your app.

Readiness Probes

While liveness probes determine whether the application is running, readiness probes ensure that a container is ready to serve traffic. If an application fails the readiness probe, Kubernetes stops sending it traffic until it passes the probe again. Using readiness probes where appropriate can help you ensure traffic is only directed to healthy, responsive pods.

Resource Requests/Limits

By setting resource requests and limits (correctly), you provide the Kubernetes scheduler with the minimum and maximum resources a pod needs. This helps Kubernetes determine where to place pods and prevents resource starvation or overconsumption that might impact overall application reliability.

Multiple Replicas of Each Workload

Kubernetes provides the ability to run multiple replicas of each workload, enhancing both availability and reliability. In case one of the instances fails, traffic can be routed to the remaining instances.

Next Steps (Walk)

Once you've set the foundation in the previous steps, you can start implementing some more advanced strategies.

Horizontal Pod Autoscaling (HPA)

HPA adjusts the number of pod replicas based on the observed CPU/Memory utilization or on custom metrics that you have identified. You can easily configure HPA by creating an object in your cluster; simply specify a target for the metric and the HPA adjusts the number of desired pods to work to keep the average of the metric at the target. This allows your application to handle spikes in load or traffic more efficiently, making it more reliable. While CPU and Memory may not be the best metrics to scale on, they do provide a simple and quick option to autoscale your workloads.

Basic Cluster Autoscaling

Similar to how HPA can scale the number of pods, cluster autoscaling scales the number of nodes in the cluster. Deploy the cluster autoscaler, which is a controller that watches to see if pods are unable to be scheduled and increases the number of nodes to make it possible to schedule that pod. When the number of pods decreases, it can also remove nodes. This results in a cluster that dynamically changes size to accommodate your pods. Note that this works best when used alongside HPA and also requires that you set your resource requests and limits.

Monitoring

Monitoring pod metrics, such as CPU throttling, memory usage, and network IO, help you to identify issues before they impact the reliability of your applications. Prometheus is a popular open-source monitoring solution that is purpose-built for Kubernetes. Additionally, there are many commercial products that provide all the monitoring and metrics you might need.

Best Things (Run)

Finally, it's time to make sure everything is running as smoothly as possible by applying these strategies for maximizing reliability in Kubernetes.

Horizontal Pod Autoscaling (HPA) on "Golden Signals"

The four Golden signals are latency, traffic, errors, and saturation. By scaling based on one or more of these metrics, your applications are better able to handle variations in load, which can significantly increase overall reliability. Using a metric such as latency, traffic or another user-facing metric for your HPAs allows you to more accurately scale the number of pods needed to serve traffic.

Intelligent Autoscaling (such as Karpenter)

Karpenter is an open-source cluster autoscaler built by Amazon Web Services (AWS) to be a more flexible, responsive autoscaler for Kubernetes. It evaluates the specific needs of each pod, including resources and scheduling constraints, which makes it more effective at maintaining reliability than the cluster-autoscaler from Kubernetes that most organizations have been using. Karpenter can also use multiple instance types more efficiently than cluster-autoscaler, providing better bin-packing (assuming your resource requests are set correctly).

Application Performance Monitoring

Application performance monitoring, or APM, moves beyond basic metrics to provide more in-depth insight into how your applications are performing. By monitoring everything — from request latency to database query performance — it is possible to proactively identify and address issues, which helps you to improve both application performance and reliability.

Build Your Kubernetes Reliability Foundation

Kubernetes itself and the broader open source community offer many ways to enhance the reliability of your applications. By starting with a strong foundation and building up to more advanced strategies, you can ensure that your applications are as reliable as possible in Kubernetes environments.

The journey to creating a strong reliability foundation in Kubernetes is one that involves continuous learning and improvement as you learn more about your own environment and as Kubernetes itself matures. As you move through these stages, you'll find that Kubernetes provides an incredibly flexible and powerful platform for running your applications reliably, even though it comes with challenges.

The Fairwinds Insights free tier provides Kubernetes guardrails to help you take control over your environment and make it more reliable, secure, and cost efficient.