Kubernetes Best Practices for Reliability

Many adopters of Kubernetes attempt to capitalize on 100% of the existing infrastructure automation technology that predates cloud native applications and Kubernetes, such as Puppet, Chef, Ansible, Packer, and Terraform. Using these tools in a non-cloud native manner does not yield the best result. For example, using a configuration management tool to build container images adds unnecessary time and complexity to your application deployments, which can cost you agility and ability to recover.

Kubernetes reliability becomes much easier to achieve with the right configurations. And the degree and manner you use pre-Kubernetes tools will likely change as you adopt containers, Kubernetes, and cloud native architecture.

Kubernetes Best Practices for Reliability

Reliability in a Kubernetes environment is synonymous with stability, streamlined development and operations, and a better user experience.

In Kubernetes, it’s easy to configure things incorrectly. Deploy the right configuration to ensure stability, streamlined development and operations and a better user experience.

[Click to tweet]

In a Kubernetes environment, reliability becomes much easier to achieve with the right configuration. Here I suggest four Kubernetes best practice tips for increased reliability. You can read my complete recommendations here.

Tip 1: Embrace Ephemeral

Use cloud native architecture to help embrace the ephemeral nature of containers and Kubernetes pods (a running instance of your application container). Two examples include:

Use service discovery to help users and other applications reach your application. The number of Kubernetes pods (running containers) change as your application scales to meet demand, and service discovery provides a means of accessing those pods without needing to know their location.
Instead of attempting to modify an existing container, abstract your application configuration from its container image, and build and deploy a new container image through your CI pipeline. Containers are ephemeral, and running configuration management software inside of application containers adds unnecessary complexity and overhead. For additional information about separating application configuration from container images, see our How to Kube video.

Tip 2: Avoid Single Points of Failure

Kubernetes helps improve reliability by providing redundant components, and making it possible to schedule application containers across multiple nodes and multiple availability zones (AZs) in the cloud. Use anti-affinity or node selection to help spread your applications across the Kubernetes cluster for high availability.

Node selection allows you to define which nodes in your cluster are eligible to run your application based on labels. The labels typically represent node characteristics like bandwidth or special resources like GPUs.

Anti-affinity allows you to further constrain nodes where your application should not be allowed to run, based on the presence of labels. This keeps your application containers from running on the same node, or from running on the same node with other components of the same application. Read some more fault tolerance advice here.

Tip 3: Set Resource Requests and Limits

Resource requests and limits for CPU and memory are at the heart of what allows the Kubernetes scheduler to do its job. If a single pod is allowed to consume all of the node CPU and memory, then resources will be starved from other pods and potentially Kubernetes components. Setting limits on a pods consumption will increase reliability by keeping pods from consuming all of the available resources on a node (this is referred to as the “noisy neighbor problem”).

Tip 4: Use Liveness and Readiness Probes

By default Kubernetes will begin sending traffic to application containers immediately. Increase the robustness of your application by setting health checks that tell Kubernetes when your application pods are ready to receive traffic or if they have become unresponsive. See my advice for setting these probes.

If you want more in-depth analysis on building reliable clusters, check out our Kubernetes Best Practices. You can also read more about best practices for security and efficiency.