2023 Benchmark Kubernetes Report: The State of Kubernetes Workload Reliability

Organizations continue to move more applications and services to the cloud. Despite economic uncertainty, most organizations anticipate that their expected cloud usage and spend will be the same as or higher than planned in the year ahead. According to Flexera’s 2023 State of the Cloud report, only 10% of respondents expect it to be somewhat lower or significantly lower than planned. As they work to control high cloud costs, organizations must also balance the need to ensure Kubernetes workload reliability. After all, keeping your costs low is just as important as keeping your users happy. As organizations move ever more production workloads to Kubernetes, it is important to understand both how to secure all aspects of Kubernetes and track and monitor workload security over time.

Using data from over 150,000 workloads and hundreds of organizations, Fairwinds created the 2023 Kubernetes Benchmark Report to analyze trends in 2022 and compare the data to the previous year’s benchmark. A CNCF report indicated recently that 96% of respondents were using or evaluating Kubernetes, but aligning to Kubernetes best practices can still be a challenging task for organizations large and small. That lack of alignment can result in some real consequences: greater security risks, uncontrolled cloud costs, and decreased reliability of cloud apps and services. Let’s walk through the six areas in the benchmark related to reliability.

Missing Memory Limits & Memory Requests

Kubernetes best practices state that resource limits and requests should always be set on your workloads, but it can be difficult to know what values to use for each application. This results in two outcomes that can be problematic: some teams never set requests or limits at all while others set them too high during initial testing and then never return to make appropriate adjustments. The 2021 benchmark report showed that 41% of organizations had set memory requests and limits for over 90% of their workloads. In 2022, that number is down to 17% — less than half of the previous year. Overall, more workloads are impacted by missing memory limits compared to the previous year. This may be caused by a few different issues: developers and DevOps teams don’t know what limits to set, Kubernetes consumption is growing, but visibility into configurations isn’t keeping pace, or both. To ensure that scaling actions work properly in your Kubernetes cluster is to dial in your memory limits and requests on each pod, so workloads run efficiently. When you set your memory limits and requests appropriately, you will have your applications on Kubernetes clusters running as efficiently and reliably as possible.

Missing Liveness and Readiness Probes

At the most basic level, a liveness probe is a probe you can use to determine whether or not a container is running. Probes periodically check to monitor the health of an application. If a liveness probe moves into a failing state, Kubernetes sends a signal to restart the container automatically. Restarting the container can restore your service to an operational state. You need to have a liveness probe in each container in the pod, otherwise a faulty or non-functioning pod will run indefinitely. This uses up valuable resources and can even cause application errors. In the latest benchmark report, 83% of organizations were not setting liveness or readiness probes for more than 10% of workloads. In comparison, the previous benchmark showed that 65% were not setting liveness or readiness probes for more than 10% of workloads. This indicates that the issue is not improving.

Pull Policy Not Always

If you rely on cached versions of a Docker container image it can become a reliability issue. An image will always be pulled if it isn’t already cached on the node attempting to run it. The problem is that this can cause variations in images that are running per node. It could even provide a way to gain access to an image without having direct access to the ImagePullSecret. This year there was an increase in all workloads impacted. Twenty-five percent of organizations relied on cached images for nearly all their workloads, an increase from the 15% in the previous benchmark. This negatively impacts the reliability of applications.

Deployment Missing Replicas

This year’s benchmark data included a new check for deployments that only have a single replica, which can also negatively impact reliability. Frequently, workloads were not configured to use multiple replicas. According to the benchmark, 25% of organizations are running over half their workloads without replicas. Deployment of multiple replicas can help organizations protect the stability and high availability of containers. If a node crashes, a Deployment will still replace pods if the replica count is 1, however, during that time there will be 0 replicas. This may cause the application to be unavailable.

CPU Limits Missing

In the benchmark report based on data from 2021, an impressive 36% of organizations were missing CPU limits on fewer than 10% of their workloads. Unfortunately, the number of impacted workloads increased across the board for workloads in 2022. The percentage of organizations that had more than 10% of workloads impacted rose from 64% to 86%. It’s important to specify CPU limits, otherwise the container will not have any upper bound on how much CPU it can consume. A CPU-intensive container can slow down and exhaust all CPU available on the node, negatively impacting reliability.

CPU Requests Missing

In the new benchmark report, there are more organizations with missing CPU requests. In the previous year, only 50% of organizations were missing requests on at least 10% of their workloads. The latest data shows that 78% of organizations have greater than 10% of workloads impacted. There was also a significant increase, from 0% up to 17%, of organizations with 71-80% of workloads missing CPU requests. This can be problematic if a single pod is allowed to consume all of the node CPU and memory. Another pod may become starved for resources. When you set your resource requests appropriately, it increases the reliability of your apps and services because it guarantees that the pod will have access to those resources. It will also prevent the other pods from consuming all of the available resources on a node.

Kubernetes Adoption Is Growing, but Reliability Configurations Remain Challenging

Kubernetes brings exceptional value to organizations today. Rapid adoption of Kubernetes and increased deployment to production environments means that it is critical to understand the many configurations available in Kubernetes and how to adjust them appropriately for your environment and business requirements. Use the Kubernetes Benchmark report to understand where other organizations are missing the mark and make changes so that your organization’s deployment is as secure, reliable, and cost-efficient as possible.

Read the Kubernetes Benchmark Report today.