5 Ways You Can Diagnose & Prevent OOMKilled Errors in Kubernetes

Learn some of the steps you can take to diagnose an OOMKilled (Out of Memory) error in a Linux-based system. Out of memory errors in Kubernetes usually occur when a container or pod requests more memory than is available on the node or when the container or pod uses more memory than expected. Container engines also use exit codes when a container terminates; you can review those codes to learn why a container was terminated. Exit code 137 is another way to indicate an OOMKilled error.

Exit code 137 indicating an OOMKilled error

The following four factors commonly contribute to OOM errors in Kubernetes:

Resource requests and limits are not properly defined, resulting in a mismatch between the resources requested and the available resources.
Memory leaks in the application code or the container itself could cause memory usage to grow over time.
Increased demand for resources due to spikes in traffic or an increase in resource-intensive workloads.
Heavy resource usage by other containers or pods running on the same node.

The role of kubelet in OOMKilled errors

The kubelet is the primary node agent that runs on each worker node in a Kubernetes cluster. Among other things, the kubelet manages the state of each pod and container running on the node, including reporting resource usage to the apiserver and enforcing resource limits.

When a container running on a node exceeds its memory limit, the kubelet sends a notification to the Kubernetes control plane to indicate that the container has been terminated due to an OOMKilled error. The control plane then restarts the container or reschedules it on a different node.

To prevent OOMKilled errors in Kubernetes, it's important toconfigure memory limits and requests for containers properly, as well as monitor memory usage and performance metrics. Prometheus is an open source systems monitoring and alerting toolkit thatgathers workload metrics from a Prometheus installation to provide resource usage data. Some teams also use Grafana, another open source project, as a backend to collect, query, and present the data.

Metrics that help you understand OOMKilled errors in Kubernetes

There are key metrics that you need to be aware of regarding OOMKilled errors in Kubernetes. A few of the critical metrics you need to monitor include CPU and memory usage, network traffic, pod status, and node resources. In addition custom metrics can also be enabled on the application end to measure the errors and unexpected behaviors. These metrics can help you with debugging and troubleshooting OOMKilled errors, which helps you improve the stability and reliability of your applications. The key metrics to monitor include:

Memory usage: This metric measures the amount of memory used by containers or pods. When you monitor memory usage, it helps you identify when a container or pod is close to exceeding its memory limit, which can help prevent OOMKilled errors.
Memory limits: This metric specifies the maximum amount of memory a container or pod is allowed to consume. When you set appropriate memory limits for pod memory, it helps you prevent containers or pods from consuming too much memory and triggering OOMKilled errors.
CPU usage: This metric measures the amount of CPU resources used by containers or pods. When you monitor CPU usage, it can help you identify particularly high CPU usage which can be a factor in causing OOMKilled errors.
Restart count: This metric measures the number of times a container or pod has been restarted. When you monitor restart count, it helps you identify whether a container or pod is experiencing OOMKilled errors and determine whether the issue is persistent or intermittent.
Node resource utilization: As mentioned above, heavy usage from neighboring pods on a node can cause resource exhaustion on the node, which can be a factor in causing OOMKills

The Linux kernel provides a feature called OOM Killer that Kubernetes uses to manage container lifecycles. This mechanism monitors node memory for processes consuming too much memory, identifying which should be killed. The Linux kernel also provides an oom_score for every process that is running on the host. As this score increases, the chances of the process being killed also increases. There is another value, named oom_score_adj, which enables users to adjust the OOM process to define when to terminate processes.

The oom_score_adj value is what Kubernetes uses to define a Quality of Service (QoS) class for a pod. These are the three QoS classes:

Guaranteed
Burstable
BestEffort

Open source projects, such as Goldilocks, can help you adjust Kubernetes resource requests. By monitoring these metrics and setting limits, you can identify and address memory-related issues in your Kubernetes cluster and prevent OOMKilled errors from impacting the quality of service.

Symptoms of an OOMKilled error in Kubernetes

In Kubernetes, a container or pod may be restarted for a number of reasons, including to recover from runtime failures, to update the application or configuration, or due to resource constraints. If a container or pod experiences an OOMKilled error, it may be restarted automatically by Kubernetes, depending on the configuration of your cluster. Kubernetes provides multiple options for you to control how often and under what conditions a container or pod should be restarted.

Restarting the container or pod can provide a temporary solution to the problem by freeing up memory and CPU resources. However, if you do not address the root cause of the OOMKilled error, the container or pod may continue to experience the same error and be restarted repeatedly.

If you see repeated restarts, it may indicate a persistent issue with the container or pod, such as a memory leak or inadequate resource allocation. In these cases, you need to diagnose the root cause of the error and address it to prevent further OOMKilled errors and keep your Kubernetes cluster running smoothly.

Diagnose & troubleshoot OOMKilled errors in Kubernetes logs

The command-line tool that you use to interact with Kubernetes clusters is kubectl. It provides a wide range of commands for managing and troubleshooting Kubernetes resources, including pods, containers, services, and nodes. You can use the following commands for this purpose:

kubectl get pods: retrieves a list of all pods in the current namespace. The output includes information about the pod name, status, and restart count. By inspecting the restart count for a specific pod, you can determine whether it has been restarted multiple times due to OOMKilled errors.
kubectl describe pod: retrieves detailed information about a specific pod, including its current state, events, and container status. By inspecting the container status for a specific pod, you can determine whether a container has been terminated due to OOMKilled errors.
kubectl logs: retrieves the logs for a specific container within a pod. By inspecting the logs for a container that has been terminated due to OOMKilled errors, you may be able to identify the root cause of the error, such as a memory leak or excessive resource usage.
kubectl top: shows a snapshot of resource capacity and utilization. It can run against nodes and pods. For nodes it will show the capacity of the node and the percentage of memory and cpu currently in use on the node. For pods it will show the current memory and cpu utilization.

Using kubectl with other diagnostic tools, such as monitoring systems or profiling tools, helps you diagnose and troubleshoot OOMKilled errors in your Kubernetes cluster as well as set parameters appropriately. You can also look at the YAML to see a breakdown of allocatable resources.

The role of the Kubernetes API

The API server is a key component of the Kubernetes control plane and serves as the central point of communication for all other components in the cluster. When a pod or container experiences an OOMKilled error, it sends an event to the Kubernetes API server to indicate that the container has been terminated due to out of memory errors. Other Kubernetes components, such as the scheduler or replica controller, can use this event information to determine how to handle the terminated container or pod.

The Kubernetes API server also exposes a range of APIs that can be used to manage and monitor Kubernetes resources, including pods, containers, services, and nodes. Using these APIs in conjunction with other diagnostic tools, such as kubectl, can help you diagnose and troubleshoot OOMKilled errors.

Identifying which container triggered the error

When a container in a pod is terminated due to an OOMKilled error, the pod's status is updated to reflect this, setting the container's status within the pod to Terminated and the reason for termination to OOMKilled. The pod's overall status then depends on the number of containers still running in that pod.

If there are other containers still running within the pod, the pod's status is Running. If all the containers within the pod have been terminated, the pod's status is updated to Failed. This pod status change can trigger Kubernetes to restart or reschedule the pod on a different node.

Steps to Prevent OOMKilled Errors in Kubernetes

Adjust resource limits and allocations for the affected container

While memory limits play a crucial role in preventing OOMKilled errors in Kubernetes, setting appropriate CPU limits can also help to reduce the risk of OOMKilled errors.

CPU limits are used to control the amount of CPU resources a container or pod can consume. When a container or pod exceeds its CPU limit, Kubernetes may throttle the container or pod, which can cause it to slow or become unresponsive. If the container or pod continues to consume excessive CPU resources, it may eventually lead to OOMKilled errors.

By setting appropriate CPU limits, you can ensure that containers or pods do not consume more CPU resources than they require. Setting CPU limits can also help you ensure that other containers or pods on the same node have access to sufficient CPU resources, which can prevent resource starvation in your Kubernetes cluster.

Setting CPU limits is not going to prevent all OOMKilled errors, of course. Other factors, such as memory leaks or inefficient resource allocation, can still lead to OOMKilled errors. When you set appropriate CPU limits and memory limits, it can help to reduce the risk of OOMKilled errors.

Optimize the container to reduce memory usage

Memory limits play a crucial role in preventing OOMKilled errors in Kubernetes. When a container or pod requests memory resources, Kubernetes sets a limit on the amount of memory that can be used by that container or pod.

If a container or pod exceeds its memory limit, Kubernetes may terminate the container or pod and generate an OOMKilled error because the system is unable to allocate more memory to the container or pod.

Setting appropriate memory limits can prevent OOMKilled errors by ensuring that containers or pods do not consume more memory than they require. If you find that a container or pod needs more memory than its current limit, you can increase the limit to accommodate the additional memory requirements.

Diagnosing & Preventing OOMKilled Errors in K8s

Platform engineers work closely with development teams to prevent and address OOMKilled errors in Kubernetes. Platform engineering and DevOps teams manage the deployment, operation, and maintenance of applications in production, and they work hard to ensure that those applications and services are reliable, stable, and scalable.

When it comes to preventing OOMKilled errors, here are several steps you can take:

Properly configure memory limits and memory requests for containers and pods to ensure that they have enough resources to run without running out of available memory.
Monitor memory usage and performance metrics to identify potential issues before they lead to OOMKilled errors.
Ensure that Kubernetes clusters are properly provisioned with sufficient resources to support the applications running on them.
Implement tools to detect and respond to OOMKilled errors in real-time.
Analyze OOMKilled errors to identify root causes and develop preventative measures.

By adopting best practices and leveraging tools and technologies that support containerization and orchestration, you can help ensure the stability and reliability of your organization’s applications and services in production. Need help? Fairwinds' Managed Kubernetes has the expertise you need to focus on your differentiators, instead of your infrastructure.

Originally published April 5, 2023, updated April 5, 2024.