Running Kubernetes in production means you spend a lot of time troubleshooting, regardless of whether you are running web apps, data pipelines, or AI/ML and GenAI workloads. The good news is that most problems look similar across clusters, and a small set of questions can uncover the real issues quickly. These are seven questions we see again and again across Kubernetes environments, and they’re usually where the biggest wins come from.
This post walks through common Kubernetes problems, the questions to ask, and practical ways to solve them.
1. Why Are My Pods Getting OOMKilled?
In Kubernetes, OOMKills happen either because a container exceeds its own cgroup memory limit or because the node itself runs out of memory and the kernel OOM killer terminates processes based on their QoS class and usage. In practice, that means a Pod can look perfectly healthy during development, then suddenly get killed under production load when it bursts above its memory settings.
Questions to ask:
- Are memory requests and limits set for this workload?
- Do those settings reflect real usage over time, especially for spiky workloads like AI inference or batch jobs?
- Is the node itself under memory pressure, causing the kernel to evict lower‑priority processes?
How to approach the solution:
- Start by inspecting historical memory usage for the Pod and container (for example, via Prometheus or your cloud monitoring tools). Aim for limits that comfortably exceed typical peaks; don’t just aim for just averages.
- If limits are too low, Kubernetes will kill the container as soon as it crosses that boundary. If they are too high everywhere, you’ll burn money by over‑provisioning nodes.
- For large or variable workloads (like model servers and data transforms), combine right‑sized requests/limits with horizontal or vertical autoscaling so the platform can adapt to changing load.
2. Why Are My Kubernetes and Cloud Bills So High?
Kubernetes is often adopted to improve efficiency, but many organizations see costs rise unexpectedly once they scale.
Questions to ask:
- Which namespaces, teams, or applications are driving the most CPU, memory, and GPU consumption?
- Are requests and limits consistently higher than actual usage across workloads?
- Are there idle nodes (especially GPU nodes), abandoned workloads, or over‑aggressive autoscaling policies?
How to approach the solution:
- First, break costs down by tenant (teams, services, or projects) using labels and annotations. This can help deconstruct a single large bill into a map of who is using what.
- Compare real usage against resource requests. A common anti-pattern is setting requests to safe but very high values, which reduces the chance of OOMKill events, but forces larger or more nodes than necessary.
- Tune autoscaling policies so that they scale up fast enough to handle real traffic patterns, but also scale down promptly when load drops; this is especially important for AI/ML and GenAI workloads with bursty traffic and expensive GPU instances.
- Clean up unused resources regularly (old namespaces, CronJobs, stale persistent volumes) and ensure cluster autoscalers are allowed to remove idle nodes.
Use open source tooling:
- Goldilocks analyzes workload usage and recommends just enough CPU and memory requests/limits for deployments and stateful sets. Instead of manually tuning every YAML, teams run Goldilocks to generate recommendations and then apply those changes via GitOps or their preferred workflow. Goldilocks can even adjust requests and limits in real time by leveraging the Vertical Pod Autoscaler when running in Recommendation Mode.
- OpenCost is an open source project that gives you detailed, real‑time cost allocation for Kubernetes clusters across clouds, broken down by cluster, namespace, workload, and more.
- kube‑green can automatically scale down or sleep non‑critical workloads during quiet periods, reducing both node count and allocated CPU/memory.
3. How Do I Know When to Update Helm Charts and Add‑ons?
Patching a Kubernetes add-on or Helm release is straightforward. Knowing when you should upgrade (especially across many clusters) is harder.
Questions to ask:
- Which Helm charts and versions are running today across all clusters?
- Which of these are outdated, deprecated, or have known security issues?
- Do we have a safe process to test and roll out updates without breaking production?
How to approach the solution:
- Maintain an inventory of all charts, their versions, and which clusters/namespaces use them, commonly via helm list automation or inventory tools.
- Regularly check those versions against the latest releases from upstream chart repositories, paying special attention to charts that bundle operators, ingress controllers, service meshes, and storage drivers.
- Adopt a repeatable process: Compare differences with tools like helm-diff. Try upgrades in a non‑production environment, run smoke and regression tests, then roll out with canaries or progressive delivery in production.
Using Fairwinds open source:
- Nova is a CLI that checks the Helm charts running in your cluster against available versions and flags outdated or deprecated charts. Teams use it in CI pipelines or periodic checks to answer: “Which charts are out of date or deprecated?” so that upgrades become proactive instead of reactive.
4. Can My Cluster Handle Sudden Traffic Spikes or Attacks?
Unexpected bursts in traffic can come from good events (launch day, going viral) or bad ones (DoS attempts, misconfigured clients). Kubernetes provides powerful scaling primitives, but they must be configured correctly.
Questions to ask:
- What metrics drive scaling decisions: CPU, memory, requests per second, queue length, or latency?
- Are there rate limits and quotas in place at ingress or API gateway levels?
- What happens when traffic exceeds the capacity of downstream dependencies like databases, vector stores, or external APIs?
How to approach the solution:
- Configure Horizontal Pod Autoscalers (or other autoscaling mechanisms) using metrics that correlate to user experience, such as latency or concurrency. This is critical for GenAI endpoints where CPU may be low but latency spikes as models get overloaded.
- Implement rate limiting at the edge (ingress controllers, API gateways, or service mesh) so a single client or group of clients cannot monopolize capacity; for example, nginx.ingress.kubernetes.io/limit-rps and limit-rpm annotations are commonly used with ingress‑nginx. Many teams are also moving to Gateway API implementations for more flexible traffic policies, but the core ideas (rate limiting, timeouts, retries) remain the same.
- Perform load testing in a staging or pre‑production environment that mirrors production as closely as possible, covering both positive bursts and negative scenarios like partial outages of dependencies.
- Consider external services for DDoS protection and global traffic management when exposing public APIs at scale (for example, Cloudflare and cloud‑provider edge services).
5. Do I Really Know Which Images and Tags Are Running?
Container registries make it easy to overwrite tags, and many workflows default to tags like latest. In production, this can create dangerous ambiguity and complicate incident response.
Questions to ask:
- Are production workloads pinned to specific, immutable tags or digests, or do they use floating tags like latest?
- Can the team answer, with confidence, “what code and base image is running in this Pod right now”?
- Is there any policy preventing untagged or unscanned images from being deployed?
How to approach the solution:
- Use explicit, versioned tags (often linked to build IDs or Git SHAs) or image digests for production workloads, and avoid mutable tags for anything critical.
- Align CI/CD so that image build, scan, and deploy are connected: new images are scanned, tagged immutably, and then referenced by manifests.
- Enforce admission controls that reject Pods with risky image configurations, such as missing tags, untrusted registries, or unsigned images.
6. Are My Kubernetes Configurations Following Best Practices?
As clusters, teams, and workloads multiply, configuration drift and small misconfigurations become a major source of outages, security findings, and performance problems.
Questions to ask:
- Are there consistent policies across clusters for security (running as non‑root), reliability (liveness probes), and cost (resource requests and limits)?
- How do we detect when a new deployment violates those policies?
- Is configuration review happening manually in code review only, or also via automated checks?
How to approach the solution:
- Codify expectations as policies: for example:
- all Pods must set requests/limits
- no privileged containers
- no hostPath mounts in certain namespaces
- Integrate configuration scanning into CI and into the clusters themselves via admission controllers so misconfigurations are caught both before merge and at deploy time.
- Standardize golden templates for common workload types (web services, batch jobs, data jobs, and AI/ML workloads) so that teams start from a safe baseline instead of reinventing specs each time.
Using open source tooling:
- Polaris can be run against manifests or live clusters to highlight misconfigurations across categories like reliability, security, and efficiency. Teams often integrate Polaris into CI using its CLI or GitHub Action to prevent new regressions and use it periodically to assess existing workloads.
- Kyverno is a Kubernetes‑native policy engine that lets you define policies as Kubernetes resources in YAML; it can validate, mutate, or generate configurations and enforce them at admission time or via background scans.
- OPA Gatekeeper builds on Open Policy Agent to provide a constraint‑based model for admission control, with reusable policy libraries and cluster‑wide auditing. Many organizations use Kyverno or Gatekeeper alongside tools like Polaris for a layered policy‑as‑code approach.
7. Where Should Teams Start with Kubernetes Problem Solving?
Kubernetes can feel overwhelming, especially when you are supporting many types of workloads, whether they are traditional apps, data processing, or AI/ML or GenAI services. A structured set of questions can guide where to start.
Useful starting questions:
- Are we regularly seeing Pod failures (CrashLoopBackOff, OOMKilled, Evicted), and do we understand why?
- Do we know which workloads and teams are responsible for most of our resource consumption and cost?
- Are our add‑ons and Helm charts kept reasonably up to date?
- Do we have clear policies for images, security, and configuration hygiene?
- Are observability and load testing built into our routine, or are they just used during incidents?
From there, you can:
- Use open source tools (like Goldilocks and OpenCost) to right‑size resources and connect usage to cost.
- Use policy engines (Polaris, Kyverno, Gatekeeper) to enforce configuration best practices consistently across clusters and pipelines.
- Use tools like Nova and your Git/GitOps process to maintain visibility into Helm chart freshness and plan safe upgrades as part of regular maintenance windows.
By routinely asking these questions and backing the answers with open source tools and automation, platform and SRE teams can make Kubernetes a reliable foundation for everything from simple web applications to complex data and GenAI platforms, without turning every incident into a one‑off firefight.
If you’re asking these questions today but don’t have the people or time to stay ahead of them, Fairwinds can run Kubernetes for you. With Managed Kubernetes‑as‑a‑Service, our SREs design, harden, and operate your clusters across EKS, AKS, and GKE, so your teams can focus on features, AI/ML, and GenAI instead of staying on top of upgrades, add‑ons, and OOMKills.
Want to explore whether Managed KaaS is a fit for your team? Start a conversation with Fairwinds and we’ll map these questions to your current Kubernetes environment and roadmap, and outline what a managed model could look like for you.
This post was originally published on July 31, 2020 and has been significantly updated and restructured to answer common questions today.