Kubernetes Problem Solving

No doubt when running Kubernetes you will spend a good amount of time problem solving. Our team does a lot of problem solving across many different Kubernetes deployments. Here are a few common issues and how we solve them.

1. OOMKilled pod

In a Kubernetes cluster, the underlying resources for memory are limited, and containers can be terminated by the underlying operating system when memory runs low. Processes are ranked for termination by whether or not their usage of the starved memory resources exceed requests and by priority.

To avoid problems, you’ll need to set memory requests and limits that accommodate your application’s usage over its lifetime. If your memory limits are too low, you’ll get familiar with OOMKilled as a status showing that your Pod was terminated for using too much memory! But if you set your limits too high, you're inherently wasting money by overallocating.

Ideally you want to set memory requests and limits in that sweet spot of “just enough”. You will need to go through all Pods to check for memory requests and limits. Manually doing this can take a lot of time and is prone to human error. We solve this problem by using Goldilocks, an open source tool that helps teams allocate resources to their Kubernetes deployments and get those resource calibrations just right.

2. Higher than expected costs

Cost reduction is one of the many benefits of Kubernetes, and you should have some idea of expected costs. If at the end of the month your bill shows skyrocketing costs, then you are likely to have a problem.

To solve this problem you must compare recommendations to actual costs. Many organizations set their CPU and memory requests and limits too high, but don’t know where to make changes. We use Goldilocks, as mentioned above, to review requests and limits. For cost analysis, though, we use Fairwinds Insights, which ingests Goldilocks data and helps us prioritize which workloads need tuning of requests and limits. We also use Cluster Autoscaler to ensure any extra nodes are removed when they are unused, which saves time and money.

3. Knowing when to update Helm

Patching a Kubernetes add-on isn’t typically that hard. On the other hand, keeping track of when to update can be hard. While Kubernetes updates come on a regular quarterly schedule, Helm chart updates are incredibly hard to monitor and predict. Our Fairwinds team uses Nova, an open source, command-line interface for cross-checking the Helm charts running in your cluster with the latest versions available. Nova will let you know if you’re running a chart that’s out-of-date or deprecated, so you can make sure you’re always aware of updates. If using Fairwinds Insights, the tool can ingest Nova data and provide alerts when new updates are available, rather than you having to run Nova periodically.

4. Handling an unexpected burst in traffic

Whether you have suddenly gone viral or have a DoS attack, Kubernetes offers autoscaling so your app won’t fail. While that is good if you’ve gone viral, it can be really bad if there is a DoS attack occurring.

Set up rate-limiting in your application, as well as in your service mesh or ingress controllers. This will prevent any single machine from using up too much bandwidth. For example, nginx-ingress can limit the requests per second or per minute, the payload size, and the number of concurrent connections from a single IP address.
Run load testing to understand how well your application scales. You’ll want to set up your application on a staging cluster, and hit it with traffic. It’s a bit harder to test against distributed DoS attacks, where traffic could be coming from many different IPs at once, but there are a few services out there to help with this.
Enlist a third-party service like Cloudflare to do DoS/DDoS protection.

Referring back to point one, you’ll need to ensure you have the right limit requests to ensure that your regular traffic can reach your services.

5. Missing tags

Container registries generally allow you to overwrite a tag when you push a container. A common example of this is the latest tag which may even automatically be set to match the last tag pushed. Using this tag in production is a dangerous practice because it presents a possible scenario where you don’t know what code is running in your container and potentially end up with multiple versions of code running in production unintentionally. To help keep track of problems like these, we use Polaris, another open source tool that supports a number of checks related to the image specified by pods. When used, if a tag is missing, Polaris will show either images.tagNotSpecified or images.pullPolicyNotAlways.

For those of you that don’t have the resources to work with all of these open source projects, we’ve combined them all into Fairwinds Insights, a configuration validation tool that will scan your clusters and check for configurations that may be costing you money, leaving you vulnerable or causing downtime.

Interested in using Fairwinds Insights? It’s available for free! Learn more here.

You can check it out and try it by spinning up a cluster on GKE, AKS or EKS to see how it can help with Kubernetes problem solving.