What Not to Monitor and What Not to Alert On in Kubernetes

Kubernetes, the de-facto container orchestrator, offers considerable flexibility and power. But monitoring every single thing in Kubernetes can become overwhelming quickly! What should you focus on, and what can you leave alone? In an earlier post, I addressed what you do need to monitor in Kubernetes and why, with a focus on what to do when you’re building a Kubernetes platform.

Here are a few things that you shouldn’t monitor and alert on in Kubernetes in general:

Metrics that are always changing: Some metrics change constantly, such as CPU usage and memory usage. While it’s important to set resource requests and limits, the metrics alone aren’t very useful for alerting, because they don’t necessarily indicate a problem.
Metrics that aren’t relevant to your applications: If you’re not using a particular feature of Kubernetes, don’t monitor the metrics for that feature. For example, if for some reason you are not using horizontal pod autoscaling, don’t monitor the horizontal pod autoscaler metrics.
Metrics that are too noisy: Some metrics are very noisy, and they generate a lot of alerts that aren’t actionable. Monitoring and alerting on them will generate a lot of data and a lot of alerts, but not a lot of specific things you can do to fix an issue.

Now let’s take a look at some of the things you can safely skip monitoring and alerting on in Kubernetes.

Container Restarts

Container restarts are a symptom of another issue that's happening, so just looking at the container restarts isn’t going to tell you a whole lot. The container may have restarted due to an out of memory error (OOMkill). Or it could be an application error. These issues just happen, and it’s not necessary to alert on it every time a container restarts.

Hopefully, your application has proper retry mechanisms and fault tolerance so it's not a big deal when a container restarts. It should just close the connection cleanly. The container restart itself is a symptom of an underlying problem. What is more relevant for you to alert on in this situation are the potential underlying causes. For example, you should know when OOMkills are happening so that you can investigate those potential issues during the work day. A container restart is not something you need to get an alert on to wake you up in the middle of the night unless it’s something that negatively affects the end user at that time.

CPU and Memory Resources

In general, this is a holdover from the time when you had one thing running on a node and there were no self-healing capabilities. In the Kubernetes world, CPU and memory load are not something to wake people up for. Spikes are normal. Instead, focus on the key performance indicators that directly affect system performance or user experience.

If you have a situation when a high CPU or memory load lasts for a long duration, that's something that you'll want to follow up on to see what's happening. It makes sense to have the monitoring in place so you can see what’s happening and identify and resolve potential problems, but in themselves, spikes and drops in CPU and memory load are not worth waking folks up for.

CPU Throttling

CPU throttling can be a sign of a problem with your Kubernetes cluster, such as a container that is using too much CPU or a node that is overloaded. But it can be noisy -— even containers that are not causing problems can have CPU throttling issues and can be difficult to troubleshoot.

However, CPU throttling can be a valuable metric to look at. It can help you identify performance problems and prevent resource exhaustion. If you are reevaluating your resource requests and limits to get more performance out of your container, that metric can help you see whether your performance is where you want it or that you want to make it better. In the middle of the night, though, getting an alert on CPU throttling out of context is not helpful.

Logs

I like to say that logs can be fragile and limited. Logs are long. They’re verbose and they include things like stack traces and a lot of other details, depending on how you have configured your logs. They lack context when you look at them. If you're alerting on one log line or even the appearance of one log line multiple times, it doesn’t give you information about what is around it. If you alert on a log line, you’ll get woken up with very limited information to review and no actionable information. It's much more useful to get an alert from one of the four Golden Signals than it is to get an alert on a log.

If you've implemented logging in your application, in the system itself, and in the Kubernetes cluster, there are a ton of logs, and event logging in Kubernetes is extremely noisy. If you’re logging too much, your monitoring system can fall behind in processing all those logs, especially if an event happens that causes your service or system to create more logs because there's a problem. That could result in an overwhelmed logging platform and then a lag in your alerts. Logs are an output and a debug tool, not a source of alerting.

Focus on User Experience

Monitoring the metrics that are most important to how your end users experience your applications and services. Using the right metrics will help you to identify problems early on, and to take corrective action before they negatively impact your users.

A few more tips about what not to monitor and alert on in Kubernetes:

Understand your applications and their dependencies to identify the metrics that are important to monitor.
Use a monitoring tool with built-in defaults. This provides a good starting point that you can adjust to meet your unique needs.
Experiment and optimize your monitoring configuration. Some metrics may not be as important as you thought or you may need to add new metrics to your monitoring configuration.

Make sure that you are monitoring and alerting on metrics that are useful and actionable, so when you do get woken up in the middle of the night, it’s only on problems that need human intervention. If you set up the right guardrails and governance, you can maximize Kubernetes’ self-healing nature and minimize those middle-of-the-night alerts.

Learn about What to Monitor and Why instead in this on-demand webinar.