How We Migrated a Production Cluster from Ingress NGINX to Gateway API

Back in January, we published a post about the retirement of ingress-nginx and laid out what that means for teams still running it. We also gave a CNCF CloudNative Live talk a couple of years ago that walked through the case for Gateway API.

This post is what came after: we actually did it with a customer. One migration, on real infrastructure, under real traffic. Here's the plan Stevie Caldwell designed and what I learned executing it.

Why Gateway API

Gateway API is a Kubernetes-native specification for routing traffic, not an API gateway product, and not a drop-in replacement for any specific ingress controller. Our January post covers the full decision framework. The short version: ingress-nginx has served teams well for years, but its configuration model has real limits. As our CTO Andy Suderman put it in the CNCF talk, when you get into complex routing or rate limiting, configuration gets fragmented across annotations and config maps in ways that are difficult to audit and even harder to hand off to service owners.

Gateway API splits responsibility across objects. The cluster admin owns the Gateway. Application teams own the HTTPRoutes. The resources are typed and validated at the API level rather than parsed from string annotations at runtime. When Stevie demoed an annotation typo in the talk (an integer where a string was expected), it failed silently. Gateway API would have rejected that at apply time.

That separation also opens up better traffic control. Stevie demoed weighted routing between two backends using only HTTPRoute fields, no service mesh required. That capability became directly useful in this migration, as you'll see in the cutover section.

Why kgateway

Choosing Gateway API still means choosing an implementation. We evaluated several, including Envoy Gateway, Cilium Gateway, Traefik, and nginx Gateway Fabric. We landed on kgateway. Here’s why:

Maturity

kgateway started as Gloo by Solo.io in 2018 and was accepted into the CNCF Sandbox in March 2025. The production history matters: it's not a new project.

Feature Coverage

For this migration, our client needed Cross-Origin Resource Sharing (CORS) support, a Web Application Firewall (WAF) with customizable rulesets, and a replacement for ExternalName services. kgateway covered all three. Check the kgateway feature docs to confirm coverage for your specific requirements before committing to an implementation.

Architecture

When evaluating implementations, check whether the project supports multiple Gateways per cluster. At the time of our evaluation, some implementations only supported one due to control and data plane coupling. That's a real constraint if you need separate internal and external gateways.

Translating Ingress to Gateway API

Before you touch any tooling, sort out how many Gateways you need, whether they're external or internal, and whether you're using wildcard certs or per-host TLS. Those decisions affect your Gateway and listener structure and are harder to change mid-migration.

The kgateway ingress2gateway tool handles a lot of the mechanical translation. It's a fork of the upstream tool with expanded ingress-nginx annotation support and generates kgateway-specific resources like TrafficPolicy and BackendConfigPolicy directly. Run it with --providers=ingress-nginx --emitter=kgateway against your cluster or against files. You'll also need a GatewayClass in your cluster before any Gateway resources will work. Think of it the same way as IngressClass: it's what binds the Gateway to the kgateway controller.

One thing to know before you run it: it doesn't handle pathType: ImplementationSpecific. If any of your Ingress objects use that path type, the tool errors out for those objects and produces nothing. Most instances can be updated to pathType: Prefix without behavior changes, but review each one before making that decision.

Beyond ingress2gateway, the broader translation work involves remapping nginx annotations to kgateway policy resources. Annotations that controlled proxy behavior, timeouts, buffer sizes, and redirects now live in ListenerPolicy, TrafficPolicy, or BackendConfigPolicy objects, each targeting a specific Gateway, HTTPRoute, or Service. A timeout that was a single annotation becomes a separate YAML object you apply independently. Running ingress2gateway gives you a good starting point. It also prints warnings for anything it couldn't convert, making it easy to identify what requires manual translation. You can then look up the appropriate mapping in the kgateway documentation and create the corresponding policy resources. The kgateway documentation covers the full mapping. Plan time for this inventory work: the more annotations your Ingress objects carry, the longer it takes.

Moving Traffic Safely

NLB and ALB setups diverge here, so know which you're on before you start.

NLB

I used weighted DNS via external-dns to cut over without downtime. Add weighted routing annotations to the existing Ingress, deploy the HTTPRoute with its Route 53 DNS record at a weight of zero, verify both paths independently, then gradually shift traffic by adjusting the Route 53 weighted DNS records managed by external-dns. Once the HTTPRoute's DNS record reaches a weight of 100, retire the Ingress.

For example:


# On the existing Ingress
annotations:
  external-dns.alpha.kubernetes.io/aws-weight: "100"
  external-dns.alpha.kubernetes.io/set-identifier: ingress-<appname>

# On the HTTPRoute (start at zero)
annotations:
  external-dns.alpha.kubernetes.io/aws-weight: "0"
  external-dns.alpha.kubernetes.io/set-identifier: httproute-<appname>

Source: external-dns weighted routing documentation

One thing I caught during cutover: spec.hostnames on the HTTPRoute has to match the production ingress hostname exactly when you're ready to switch over. If you tested with a wildcard or a different subdomain level, you may also need to update spec.listeners.hostnames on the Gateway itself.

ALB

The weighted cutover doesn't apply for ALB. ALBs use a different provisioning path through kgateway: the Envoy service runs as ClusterIP, kgateway creates an Ingress object for it, and that Ingress carries the ALB annotations. When external-dns sees that Ingress, it will try to create Route 53 records pointing at the ClusterIP, which isn't externally routable. Add external-dns.alpha.kubernetes.io/exclude: "true" to any HTTPRoute that shouldn't get its own DNS record managed by external-dns.

For example:


apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: example-httproute
  annotations:
    external-dns.alpha.kubernetes.io/exclude: "true"

Source: external-dns annotation documentation

If you forget this on a production hostname, it can result in a traffic interruption.

kgateway Behaviors That Differ From nginx

Several kgateway behaviors differ from nginx in ways that only surface under real traffic. Here's what to expect.

WebSockets

nginx allows upgrades by default. kgateway doesn't. Without enabling WebSocket upgrades on the Gateway listener via HTTPListenerPolicy and upgradeConfig, clients got 403 errors even when the backend and HTTPRoute were correctly configured. WebSocket apps also need their own HTTPRoute rather than inheriting the same policies as regular HTTP traffic.

Intermittent 503s After Deploys

Envoy can reuse pooled upstream connections that outlive pod termination, resulting in requests being sent to backends that are no longer available. These are gateway-to-upstream connection failures, not application errors. The fix is to align connection timeouts on both sides: set idleTimeout on BackendConfigPolicy (the default is 1 hour) and/or reduce keepalive TTL in the app so the gateway closes idle connections before reusing them.

Rate limiting Isn't Per-Client IP by Default

The local TrafficPolicy token bucket is shared per gateway pod. Per-IP protection requires global rate limiting with RemoteAddress, Redis, and a rate limit service. If you're migrating nginx rate limiting and expecting the same behavior, you won't get it without additional configuration.

Before chasing any of these as misconfigurations, check the kgateway GitHub issue tracker. Nginx parity gaps, edge-case failures, and doc corrections show up there first, often with a fix already on the roadmap. Some of the issues early migrations hit may already be resolved by the time you're reading this.

Lessons from the Migration

Install kgateway and update supporting tools in separate PRs. Bundling kgateway, cert-manager, and external-dns Gateway API support into one pull request can cause issues. In an ArgoCD environment with Argo Vault Plugin, sync order isn't guaranteed and AVP environments are prone to crashlooping when dependencies land out of order. Two PRs eliminates that.

Run the kgateway controller at two replicas from the start. During a node rotation event, the controller (deployed as a single replica) can block the rolling process.

Sort out cert-manager namespace topology before you start. cert-manager creates certificate secrets in the same namespace as the Gateway. Cross-namespace certificate references aren't supported yet. If your current setup has certs and Ingress objects in different namespaces, that affects where the Gateway has to live. Also note that cert-manager's Gateway API support requires enabling a feature gate in your cert-manager deployment before it will provision certificates for Gateway resources. Account for this at design time.

Keeping Kubernetes Current

There's a lot of work involved in keeping a Kubernetes cluster current. This migration is one concrete example of what that looks like: both the planning and the things you only find out after traffic hits. If you're still on ingress-nginx, it's time to figure out where you're moving.

Questions about how this applies to your cluster? Talk to the Fairwinds team.