How Do I Run AI Workloads on Kubernetes Without Wasting GPUs?

Written by Stevie Caldwell | May 20, 2026 6:31:05 PM

Many teams want to run AI and ML workloads on Kubernetes but are worried about wasting GPUs, overcomplicating the platform, or breaking reliability for the rest of their services. The good news is that Kubernetes can work very well for model training, batch jobs, and real time inference, including LLM APIs and vector search services, as long as you plan for GPU scheduling, right size your workloads, and put guardrails in place so expensive nodes and jobs don’t run unchecked.

Most GPU waste isn’t because Kubernetes is the wrong place for AI. It comes from treating GPUs as static, all or nothing resources in environments where demand is spiky, multi tenant, and uneven. GPU costs compound fast when nodes run idle, jobs are overprovisioned, or sharing isn't configured. Instead of standing up a separate AI platform that’s hard to run, you can treat AI workloads as first class citizens on the Kubernetes infrastructure you already have, with a few extra patterns for GPUs. The patterns here apply whether you're running LLM inference, model training pipelines, batch scoring jobs, or embedding services.

This article stays focused on infrastructure and platform patterns for running AI workloads efficiently on Kubernetes. It doesn't cover model architecture choices, training algorithms, or low level CUDA/kernel tuning.

Should I Even Run AI Workloads on Kubernetes?

Before you invest in GPU node pools and scheduling strategies, it helps to be clear about when Kubernetes is the right foundation for AI and when something simpler will work. The answer depends on how many teams are involved, how often you ship models, and how much standardization you want across your stack.

When Kubernetes Makes Sense for AI and ML

Kubernetes is a strong fit for AI and ML when you need to run many services or jobs, share GPUs across teams, and manage everything with the same control plane and tooling as your other applications. That pattern shows up in organizations where AI is maturing from early experiments into production services.

You gain the most value from Kubernetes when:

Multiple teams or products rely on models and AI powered features.
You deploy and update inference services frequently.
Security, isolation, and compliance rules already live in Kubernetes and need to apply to AI as well.

Kubernetes works especially well for production inference, batch jobs, and pipelines that need autoscaling, isolation, and repeatable deployments. The same kinds of platform patterns apply whether you're thinking about cloud native infrastructure for AI workloads broadly or running AI specifically on GKE, EKS, or AKS.

When a Simpler Platform Might Be Better

If you're only training a few models occasionally, a managed notebook or single VM setup can be cheaper and simpler to operate. In that case, managed notebook or training services usually provide enough orchestration without a full Kubernetes footprint.

A team that runs one experiment every few months doesn't benefit much from pod scheduling, autoscaling, or multi tenant quotas. If your current situation is one person running an experiment on a single GPU, a notebook platform or a hosted training service is often the right starting point. As usage grows and models move into production, the case for Kubernetes becomes stronger, especially when you want the same platform to handle both AI and non AI workloads.

How Do I Schedule AI Workloads and GPUs on Kubernetes?

Once you decide Kubernetes is the right home for AI, the next challenge is getting scheduling right. Making GPUs explicit in your cluster starts with how you define node pools, resource requests, and workload boundaries.

Use Node Pools and Labels for GPU Workloads

The first step is to make GPU capacity explicit in your cluster. Create dedicated node pools with GPUs and use labels, taints, and tolerations so only AI workloads land there.

A practical pattern might look like this:

Create one or more GPU node groups and label them with a key such as node-type=gpu.
Add taints to GPU nodes so general purpose workloads don't schedule there by default.
Configure AI workloads with matching tolerations and node selectors that target the GPU pools.

This keeps non AI workloads away from expensive GPU nodes and makes GPU consumption easier to track. It also lets you apply different autoscaling and maintenance policies to GPU pools without impacting the rest of the cluster.

Here is a minimal example of a deployment that targets GPU nodes (this assumes you're using the standard NVIDIA device plugin; resource names may vary for other hardware providers):


apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ai-inference
  template:
    metadata:
      labels:
        app: ai-inference
    spec:
      # Ensures the pod only lands on nodes labeled for GPUs
      nodeSelector:
        node-type: gpu
      # Allows the pod to schedule on tainted GPU nodes
      tolerations:
        - key: "gpu"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
      containers:
        - name: model-server
          image: your-registry/model-server:latest
          resources:
            requests:
              cpu: "500m"
              memory: "2Gi"
              nvidia.com/gpu: 1
            limits:
              cpu: "1"
              memory: "4Gi"
              nvidia.com/gpu: 1
          ports:
            - containerPort: 8080

Note on Fractional Requests: In the example above, nvidia.com/gpu: 1 is an integer. By default, Kubernetes treats a GPU as an indivisible unit. You cannot request 0.5 of a GPU here; if you try, the pod will remain unschedulable. This all-or-nothing scheduling is precisely why many GPUs sit at low utilization; you’re forced to reserve an entire expensive chip for a service that might only need a fraction of its power. To fix this, you’ll need to implement the sharing strategies (MIG or Time Slicing) discussed below.

Request GPU Resources Explicitly

Kubernetes won't infer which pods need accelerators. You must declare those requirements. Use device plugins and resource requests so the scheduler understands which pods need GPUs and how many.

Key steps include:

Installing the GPU device plugin that matches your hardware so GPUs appear as allocatable resources.
Requesting GPUs explicitly in pod specifications, for example by setting nvidia.com/gpu in the resources section.
Keeping CPU and memory requests realistic so the scheduler has a complete view of each pod.

When GPU needs are explicit, the scheduler can pack workloads more efficiently and avoid bin packing problems that leave capacity stranded. Clear requests are also a foundation for policy, reporting, and quota controls around GPU use. The way you set resource requests and limits for any other service applies directly here, even though that guidance isn’t AI specific.

Separate Training, Batch, and Real Time Inference

Training jobs, offline batch inference, and low latency online inference behave very differently. They compete for GPUs in different ways and tolerate different levels of delay and disruption.

To improve your results:

Use distinct namespaces or labels for training, batch, and real time inference.
Map those categories to different node pools, priorities, and disruption budgets.
Apply separate autoscaling rules and quotas based on the reliability and cost profile of each workload type.

Structuring those environments well on GKE follows the same logic, with a few cloud specific considerations around node pools and workload isolation.

How Do I Know If We’re Wasting GPUs on Kubernetes?

Before you change anything, you need to confirm whether GPU waste is actually a problem and where it's coming from. Kubernetes GPU utilization metrics are the starting point; a few simple checks will tell you quickly whether there is obvious waste to go after.

Common signs of GPU waste include:

Low Aggregate Utilization: GPU metrics stay consistently low, even during peak demand.
Stranded Resources: Pods request a full GPU but only use a small fraction of its available compute or memory.
Idle Infrastructure: GPU nodes stay online overnight or between batch windows with no active workloads.
Lack of Visibility: No clear view of GPU usage and cost at the workload, namespace, or team level.

When checking these metrics, distinguish between Volatile GPU Utilization (the math being done) and VRAM Usage (the memory footprint). In AI workloads, VRAM saturation is often the real bottleneck; a model might occupy 100% of the memory while the actual compute utilization stays below 10%. Identifying this mismatch is the first step toward efficient sharing.

These are rule-of-thumb indicators rather than hard thresholds. For example, a workload with very strict latency SLOs might justify lower average utilization, but it's still worth understanding the tradeoff.

How Do I Avoid Wasting GPU Capacity on Kubernetes?

Getting scheduling right helps, but it doesn't guarantee high utilization. To avoid waste, you need data on how workloads actually use GPUs and mechanisms to turn that data into scaling and budgeting decisions.

Right Size GPU Requests and Job Duration

In most clusters, the biggest source of GPU waste is over provisioning that nobody revisits once the service is online. Start by measuring how much GPU memory and compute your jobs actually use over time.

At minimum, do the following:

Export GPU utilization at the workload level so idle capacity is visible by service and team.
Inspect memory and compute usage during realistic traffic patterns.
Identify services and jobs that consistently use a small fraction of their requested GPU capacity.

If a workload always requests a full GPU but only uses a small portion of it, a large share of your accelerator budget is idle capacity. Even before you adopt GPU sharing, adjusting CPU, memory, and replica counts based on data can reduce the number of GPU nodes required for the same level of service.

Across production Kubernetes clusters, fleet average GPU utilization is often well below 30 percent, with recent data putting it as low as 5 percent. If your workloads are in that range, there's usually room to reclaim capacity with better sizing, batching, or scheduling.

Rightsizing based on actual usage data is more accurate than setting requests by instinct or copying defaults from another service.

Job duration matters as well. Training runs that fully occupy GPUs for long periods tend to justify their cost more easily than many short jobs that each spin up large nodes, perform little work, and tear down.

Kubernetes GPU Sharing: MIG, Time Slicing, and When to Use Each

One of the clearest ways to cut GPU waste is to stop treating every workload as if it needs an entire physical device. For many inference services, dev environments, and smaller models, whole GPU allocation is simply too blunt.

There are three main GPU allocation patterns worth considering:

GPU allocation approach	Best fit	Benefits	Tradeoffs
Whole GPU per pod	Large training jobs, high throughput inference on big models	Simple to set up and reason about, predictable performance	Often wastes capacity for smaller workloads, especially at low to medium traffic levels
MIG (Multi Instance GPU)	Production inference and multi tenant clusters on supported NVIDIA hardware	Strong isolation between slices, better packing of medium sized workloads, more predictable QoS	Requires specific hardware (NVIDIA Ampere architecture or newer, e.g., A100, H100) and more operational work to manage profiles.
Time slicing	Dev and test environments, lighter or bursty workloads that can tolerate some variability	Easy to increase utilization, no hardware partitioning, multiple pods share a device over time	Weaker isolation, less predictable tail latency, careful tuning needed for production use

You don't need to adopt these patterns everywhere or all at once. A practical approach is to start by identifying services that routinely use a fraction of a GPU, then test MIG or time slicing on that subset before redesigning the entire cluster. For any sharing approach, test under realistic load, watch tail latency and error rates, and roll it out gradually before you point critical production traffic at it.

Use Autoscaling and Queueing for Bursty Workloads

Batch and training workloads often arrive in bursts. If you size your cluster for the largest burst and leave it there, average GPU utilization will be low and costs will be high.

A more efficient pattern is:

Put batch and training work behind queues so total demand is visible.
Use cluster autoscaling, or a tool such as Karpenter, to add GPU nodes when queues grow.
Allow GPU pools to scale down once queues drain, so you stop paying for idle accelerators.

For real time inference, configure autoscaling to respond to request latency, queue depth, and GPU utilization directly. Tune autoscaling to maximize GPU utilization while protecting your latency and availability targets. When you tune scale down behavior, account for model load times and cold start impact. You don’t want to trade cost savings for latency spikes your SLOs can’t absorb.

Sedai has a useful writeup on GPU resource management strategies that goes deeper on the queue based and autoscaling side of this.

Turn Idle GPU Time into a Visible Cost

Engineering teams respond to data that connects technical choices to money. Turning idle GPU consumption into visible, shared metrics makes optimization part of normal operations instead of a finance-driven surprise. The standard stack for this is NVIDIA's DCGM exporter feeding into Prometheus and Grafana, which gives you per-workload GPU utilization alongside CPU and memory. If you're using Fairwinds Insights, GPU metrics are also available alongside CPU, memory, and cost data in the same view.

Useful views include:

Dashboards that show GPU utilization by namespace, team, and application.
Simple measures, such as idle GPU hours per week and estimated cost for that idle time.
Reports that highlight workloads with low GPU utilization and high spend across a period.

When platform, data, and finance stakeholders all see the same picture, it becomes much easier to justify changes to resource settings, scheduling policies, and templates.

How Do I Keep AI Workloads Reliable on Kubernetes?

AI workloads don't only stress GPUs. They also introduce new failure modes and dependencies. Reliability comes from treating models as services, shipping them safely, and accounting for the data and storage constraints that come with them.

Treat Models as Services with Clear SLOs

In production, a model is another service that happens to do inference. It should have clear service level objectives for response time and availability, exactly as you would define for any other API.

That means:

Documenting p95 or p99 latency targets and an availability goal for each model.
Using error budgets to decide when to prioritize reliability work over new features.
Building capacity plans that connect resource settings back to those objectives.

When reliability expectations are explicit, you can reason about how much headroom you need on GPUs and how aggressive autoscaling can be. You can also decide when it's acceptable to run closer to saturation in order to save cost.

Use Health Checks and Gradual Rollouts for Models

Models can fail in subtle ways that don't show up as container crashes. You want Kubernetes to recognize unhealthy model servers quickly and to roll out new versions carefully.

Key practices include:

Wrapping model servers in containers that provide lightweight health endpoints.
Configuring readiness and liveness probes so Kubernetes only routes traffic to healthy pods and can recycle unhealthy ones.
Using rolling updates or canary strategies for new model versions, combined with automated rollback based on metrics such as error rate or latency.

With these pieces in place, you can ship new models frequently without taking on more risk than you intend to. Reliability for AI services looks very similar to reliability for any other service running in the cluster. Many of the same hard earned Kubernetes lessons still apply.

Plan for Data, Storage, and Networking Constraints

GPU tuning alone doesn't keep AI workloads healthy. Large models and datasets place real pressure on storage and network paths, especially during rollouts and cold starts.

Plan for:

Fast Model Loading: Use local or node-attached storage for frequently used models to reduce I/O wait times.
Optimized Image Pulls: Use OCI Image Streaming or tools like Spegel to mitigate cold starts caused by 10GB+ container images.
Efficient Caching: Implement caching layers or artifact repositories that handle large binaries without bottlenecking.
Controlled Rollouts: Use staggered deployment patterns so you don't flood the network or storage systems every time a new model ships.

Because AI containers bundled with CUDA drivers and PyTorch are massive, your service may sit in ContainerCreating for minutes while the image pulls. Even with a fast autoscaler, these networking bottlenecks can break your recovery time objectives (RTO) if you don't use streaming or aggressive caching. If you account for these constraints along with CPU and GPU, you reduce the risk of intermittent slowdowns and timeouts that only appear at scale.

How Do We Operate AI on Kubernetes Without Burning Out the Platform Team?

From a platform perspective, the challenge is less about a single cluster and more about repeatability across teams and workloads. Standardization and guardrails protect both the cluster and the people who run it.

Standardize Templates for AI Workloads

Platform teams feel the most pain when every AI workload shows up as a one off configuration. A small number of well designed templates can eliminate a lot of that overhead.

Create deployment templates or Helm charts specifically for:

Training jobs and other long running batch processes.
Scheduled batch inference jobs.
Real time inference services.

Each template should include defaults for GPU requests and node selectors, baseline CPU and memory requests, health checks, logging, and metric export. Getting those defaults right is largely the same problem as rightsizing any Kubernetes workload, and the same Kubernetes best practices apply.

Build in Guardrails for Cost and Reliability

Guardrails let platform teams support growth without reviewing every YAML file line by line themselves. Policies around GPU usage and node types prevent runaway costs and unintentional impact on critical workloads.

Useful guardrails include:

Requiring resource requests and limits on AI workloads before they can be deployed.
Enforcing maximum GPU counts or specific node types per namespace or team.
Blocking deployments that skip approved templates or omit basic metadata, such as ownership and cost allocation labels.

You can enforce these policies with policy engines such as Kyverno and with the admission control layer in Kubernetes. Over time, these controls keep your clusters stable and your GPU usage aligned with budget and reliability goals.

Turn AI on Kubernetes into a Repeatable Practice

Start by defining GPU node pools, standard templates for AI jobs and services, and basic cost and reliability checks for those workloads. Then iterate. Tighten right sizing with real usage data, improve GPU observability with better metrics and dashboards, adjust autoscaling thresholds, and refine policies as usage grows.

If you're unsure where to start, these three steps in the next sprint can shift your trajectory:

Put all GPU nodes into labeled and tainted node pools.
Add explicit GPU requests and realistic CPU and memory requests to the top three GPU consuming workloads.
Create a simple dashboard that shows per workload GPU utilization and idle GPU hours by namespace.

With those in place, every other optimization in this article becomes easier to prioritize and measure.

As these practices solidify, AI on Kubernetes feels like just another part of your platform. At that point, it often makes sense to offload some of the operational burden to a partner that specializes in AI ready Kubernetes clusters, while your teams stay focused on models and product features. That’s where AI ready infrastructure and managed EKS services help.

AI Workloads and Kubernetes FAQ

Can Kubernetes manage GPU resources for AI workloads?

Yes. With GPU aware node pools, device plugins, and explicit GPU requests, Kubernetes can schedule and isolate GPU workloads effectively. Autoscaling and queue based patterns help keep those GPUs utilized rather than idle.

How do I measure GPU utilization on Kubernetes?

Install GPU metrics exporters, feed the data into your monitoring stack, and build views that show utilization by workload, namespace, and team. Make GPU usage visible alongside CPU, memory, and cost so teams can see where GPUs sit idle and where spend is concentrated.

What is GPU scheduling in Kubernetes?

GPU scheduling means configuring node pools, labels, taints, tolerations, and resource requests so the scheduler places GPU workloads on the right nodes, honors isolation rules, and packs workloads without stranding expensive capacity.

When should I use MIG or time slicing?

Use MIG when you have supported NVIDIA hardware and need stronger isolation and more predictable performance for shared GPUs, especially in multi-tenant or production inference environments. Use time slicing when you need a simpler way to share GPUs across lighter, less sensitive workloads such as development, testing, or bursty internal services, and test its impact on latency and reliability before you roll it out more broadly.

View full post