Teams run their own LLM inference for a range of reasons: data privacy, cost control at volume, latency, the ability to fine-tune, or just the need to operate independently of third-party rate limits and pricing changes. When they say they want to run inference on Kubernetes, they usually mean they want to host a model themselves rather than routing prompts to an API like OpenAI or Anthropic.
That means running a large file of trained weights, typically pulled from Hugging Face, stored in Amazon S3 buckets, or in Amazon Elastic File System (EFS), on your own infrastructure. The Qwen 3 family, for example, ranges from 7 billion to 235 billion parameters. A 27 billion parameter model can require 24 to 32 GB of GPU memory just to load.
Getting that infrastructure right is harder than it looks, and it tends to generate the same specific questions. We recently ran a live, hands-on session on EKS and answered questions from attendees. This post covers the key takeaways and those questions directly.
Kubernetes practitioners who haven't run inference before often assume it works like a typical web app: containerize the model server, write a Deployment, configure an HPA, and you’re done. The core assumptions behind that workflow don't apply here.
An inference pipeline runs through three distinct stages:
That shifting resource profile means an inference pod doesn't behave like a standard web app pod. A web app pod is interchangeable: if one is unhealthy, another can handle the traffic. Inference pods carry large amounts of state, take 15 to 30 minutes to come up after loading model weights, and have codependencies in disaggregated deployments that standard Kubernetes scheduling doesn't account for.
If you try to scale an inference workload like a web app, you'll get bad results.
There's also a cost dimension that doesn't exist for most workloads. G5, G6, and P5 instances with multiple GPUs run into tens of dollars per hour. G6.12xlarge instances, which provide four L4 GPUs, are a common choice for mid-size models. P5 instances with higher-end accelerators are significantly more expensive. GPU memory is often the binding constraint that determines which instance types are viable.
Four areas of the cluster need to be addressed before deploying any inference tooling.
Kubernetes doesn't recognize GPUs without the NVIDIA device plugin, deployed as a DaemonSet. On Bottlerocket AMIs, this is baked in. On EKS, Karpenter handles node provisioning: configure node pools with the instance types that carry the GPU accelerators you need, add taints so general workloads don't land on GPU nodes, and it will spin up the right hardware when a workload requests it. Getting familiar with the instance type, GPU type, GPU count, and GPU memory matrix for your target accelerators is necessary groundwork before anything else.
In a disaggregated inference setup, pre-fill and decode pods are distinct components that depend on each other. If pre-fill pods come up but decode pods don't, none of them are doing useful work. Standard Kubernetes will schedule partial pod sets without complaint. A gang scheduler understands those codependencies and handles placement accordingly. Even in simpler, all-in-one setups, pod topology constraints and affinity rules require careful attention.
Tensor parallelism distributes model weights across multiple GPUs, which means the pods need to be scheduled close to each other to keep cross-GPU communication latency low. Inference responses stream rather than return all at once, which means proxy timeouts and WebSocket support become configuration concerns at the ingress layer.
An HPA configured for CPU and memory is the wrong tool for inference pods. The meaningful signals are GPU utilization, queue depth, and time to first token. KEDA supports custom scaling metrics and can drive autoscaling decisions based on those values. This requires that the inference tooling you choose exports those metrics, which not all options do.
Three common options cover a spectrum from quick-to-run to production-ready. Knowing where each one breaks down matters as much as knowing what it does well.
vLLM is a high-performance inference engine that supports multi-GPU tensor parallelism, extensive configuration, and Prometheus metrics. Deploying it directly from the official Helm chart gives you a single pod running all stages of the inference pipeline on one node.
It's the right starting point for proving a model runs in your environment. Load the model weights from S3 into a persistent volume to speed up restarts, then verify the endpoint returns valid responses. That's the extent of what this approach gives you. It provides no autoscaling, no built-in observability, and no model lifecycle management. Treat it as a proof of concept and don't build production infrastructure on top of it.
Ollama is designed for simplicity. It's easy to deploy, supports multiple models simultaneously, and exposes a clean API. It's the standard choice for local development. Running multiple models side by side on G6 instances works well.
The production limitations for Ollama are real. Ollama doesn't expose Prometheus metrics in a standard format and it has no built-in autoscaling. It's a reasonable choice for a small model on a single GPU, a development environment, or an internal use case where operational visibility isn't a priority. For anything requiring elastic scaling, scale-to-zero, or metrics-driven alerting, it's the wrong tool.
KubeAI is built specifically for production inference on Kubernetes and addresses the gaps in both of the preceding options directly.
The core abstraction is a Model custom resource. You declare what model you want, which engine to use (vLLM for GPU workloads, Ollama for CPU), the accelerator profile, GPU parallelism, and replica bounds. KubeAI handles the rest. It spins up a preload job that caches model weights into EFS, uses that cache on subsequent pod starts to reduce startup time, and cleans up the cache automatically when a model is deleted so EFS doesn't accumulate stale data.
All inference requests route through KubeAI's own service before reaching the model pod. That routing layer is what enables autoscaling and scale-to-zero: KubeAI has visibility into queue depth and can bring GPU nodes up when demand appears and shut them down when it doesn't, without any change to the endpoint the caller sees.
Prometheus metrics are available out of the box: token throughput, time to first token, request duration broken out by prompt and generation length, context window utilization. That's the observability foundation required to run inference workloads in production.
The performance difference from multi-GPU parallelism is concrete. Running the same model with 4-GPU tensor parallelism under KubeAI on a G6.12xlarge produces significantly faster token output than the same model on a single GPU under Ollama. Same model, same cluster, different infrastructure underneath it.
The tradeoff: KubeAI has more configuration surface area than Ollama and more operational complexity than a standalone vLLM chart. It's not the right tool for a quick proof of concept. It’s the right tool when you're building something that needs to stay running.
Use the vLLM Helm chart when you want to verify a model runs in your cluster before committing to more complex infrastructure. Use vLLM through KubeAI when you're building something that needs autoscaling, observability, model caching, and lifecycle management.
KubeAI uses vLLM as its GPU inference engine, so you're not choosing between them so much as choosing whether to use vLLM with or without an orchestration layer around it.
Start with time to first token, GPU utilization, and queue depth.
GPU memory is frequently the bottleneck: if a model is consuming all available GPU memory on a node, performance degrades before the pod shows any obvious problems. Also check whether you're hitting context window limits, which can cause generation to stop or degrade. CPU and memory still matter for the non-GPU parts of the pipeline, but they're rarely the primary constraint on nodes dedicated to inference.
Decide on your deployment architecture first: all-in-one pods running all pipeline stages per GPU, or a disaggregated setup with separate pre-fill and decode pods.
That decision drives your networking requirements, gang scheduling needs, and pod placement strategy. Then choose your inference tooling and make sure it exports the metrics you need. Next, configure your node pools, Karpenter profiles, taints, and RBAC. Trying to retrofit those decisions after the fact is painful.
Karpenter's on-demand node provisioning prevents GPU nodes from running when nothing needs them. KubeAI's scale-to-zero support takes it further, shutting down GPU instances when no models are actively serving requests.
Both mechanisms help, but they don't replace governance. Giving engineers unconstrained access to GPU node pools will produce expensive instances sitting at low utilization. RBAC, node pool restrictions, and cost observability are required layers.
The split between platform and application concerns doesn't disappear with inference workloads, it gets more pronounced.
The platform team owns the Kubernetes fundamentals: autoscaling, node management, RBAC, workload isolation, GitOps, policy guardrails, the GPU node pools, and the inference orchestration layer. Application teams own the model selection, inference configuration, and the workloads that consume the inference endpoints. Getting those boundaries defined early prevents a lot of ambiguity about who owns what when something breaks.
Kubernetes is still the foundation. Inference is a new workload type with specific infrastructure requirements, but it runs in Kubernetes.
Every investment in cluster fundamentals, node management, policy, and platform tooling transfers directly. If the cluster is in solid shape, adding GPU node pools and inference tooling is an extension of existing work. If the fundamentals are shaky, that will surface before you get anywhere near a working inference endpoint.
Check out the full walkthrough, including live demos on EKS with vLLM, Ollama, and KubeAI. Questions are welcome in the Fairwinds community Slack.
If you're working through this for your own infrastructure, talk to the Fairwinds team.