This is the transcript of a talk presented at KubeCon Japan titled "Add-Ons Need Love Too: Increasing Cluster Security Through Add-On Maintenance."
Stevie Caldwell: My name is Stevie Caldwell. I’m a Tech Lead for the SRE team at Fairwinds.
If you’re not familiar with Fairwinds, we’re a company that builds and maintains Kubernetes clusters and the underlying infrastructure on which they run.
Andy Suderman: I'm Andy Suderman, the CTO at Fairwinds. I'm a CNCF Ambassador and the author and maintainer of several open source projects you may have heard of, Goldilocks and Pluto.
Stevie is also the author and maintainer of another project we’ll talk about later called GoNoGo.
We’re both longtime Kubernetes users and always excited to talk about anything open source or CNCF-related.
Stevie: Our agenda for today is to first discuss why add-ons matter. We always talk about Kubernetes as a whole, and we often talk about add-ons individually in sessions like this—but we don’t generally address add-ons as an ecosystem. So we’re going to start by talking about what add-ons are, their purpose, and why they’re important. Then we’ll cover vulnerability trends, which are one of the main reasons we care about add-on security and upgrades. Next, we’ll discuss the operational struggle that prevents teams from upgrading add-ons on a reliable cadence. And finally, we’ll talk about best practices. Hopefully, you’ll leave here with some tools and ideas for managing add-ons in your clusters more reliably and diligently.
Stevie: So, why do add-ons matter? What are they? Why do we have them?
Let’s start with the story of building a cluster. Imagine you have an application you're really excited about, and you want to put it into a Kubernetes cluster so the world can experience it. You’ve got advice from trusted peers who say, “Yes, run this in Kubernetes”—so that’s what you’re going to do.
You spin up a Kubernetes cluster, starting with a baseline setup. With a baseline cluster, what you get is essentially your control plane. That includes all the components that manage the cluster: they schedule your workloads, reconcile the actual state of the cluster with your desired state, and store that state.
Then you have your data plane. That’s where your worker nodes live—the nodes where your actual workloads will run. The data plane includes components that manage communication between the data and control planes, and others that handle the underlying container runtime on the nodes.
You’ll also have components running in the cluster that enable pod-to-pod traffic across nodes and support service discovery.
Now, as a side note, some of those components I just mentioned are technically referred to as "add-ons." But here’s the thing: they’re not really optional. If you want your cluster to function—if you want your application to actually run—you need those particular add-ons. So it's a bit strange that they’re called add-ons, since they’re essential.
At this point, you’ve got your baseline cluster running, your application is deployed, it’s happy—and you're feeling good.
Now you want people to access it. You want to expose the application to users outside of your cluster. This is the first point—getting traffic into your cluster—where you really start to consider add-ons.
Technically, you don’t need an add-on to do this. You could use a Kubernetes primitive, like a Service of type LoadBalancer. But—let’s be honest—it’s just not done that way in production.
If you're running a real workload, you're going to deploy an ingress controller or some kind of gateway. And that’s your first true add-on.
Stevie: Now you’ve got traffic coming into your cluster—but you want to secure that traffic. You want to encrypt it. Maybe you also want to encrypt east-west traffic—traffic between your services inside the cluster. That’s another place where you might start considering add-ons.
So, you choose an add-on to handle TLS. Now your incoming traffic is encrypted.
Next, you probably want to know what’s happening inside your cluster. How is your application performing? How is the cluster performing? If something starts going wrong, you’ll want metrics to help troubleshoot. That’s where observability add-ons come in.
So now, in this “baby” Kubernetes cluster—in its very infancy—you’ve already installed three or four add-ons just to get the basics working. And that’s just the beginning.
For a production-grade Kubernetes cluster, you’re likely going to add many, many more add-ons to support a range of capabilities.
Add-ons, it turns out, are really important for realizing Kubernetes’ full potential. Everything we talk about—scalability, reliability, efficiency, security—often hinges on the add-ons you install.
They also make life easier for engineers managing clusters. Add-ons help automate tasks like DNS updates, IP address changes, upgrades, and more—so we’re not doing those manually.
Here’s just a sampling of add-ons available in the ecosystem. And for each of them, there are multiple options. It’s not just the NGINX Ingress Controller—you also have Traefik, and others. There’s Kyverno, KEDA, the Horizontal Pod Autoscaler—just to name a few.
These add-ons are expected to be managed in lockstep with your Kubernetes version. That means: when you upgrade Kubernetes, you should also upgrade your add-ons. They’re meant to be treated as part of a unified whole.
Why is that important? Because when you have 10, 15, or more add-ons in your cluster, and any one of them gets exposed, it can do serious damage.
Let’s illustrate that with a couple of examples.
Here’s an RBAC role for an add-on. This particular add-on has access to all the secrets in your cluster. It can create, update, and access secrets across all namespaces. And we all know what secrets often contain—database credentials, certificates, sensitive config values. If this add-on is ever compromised, you’re going to have a very bad day.
Here’s another example—this time a security context from a different add-on. In this context, we see that it adds capabilities like SYS_PTRACE and SYS_ADMIN.
These capabilities mean the container can:
On top of that, privileged: true is set—which is essentially like running as root. The container gets all Linux capabilities, including host networking and namespace access, and can bypass security tools like AppArmor or SELinux.
Again—not something you want exposed.
So what does this mean for us as engineers?
You’ve got multiple add-ons running in your cluster that potentially have dangerous levels of access. If a vulnerability exists and it’s exploitable, you could be in real trouble.
There are two main ways to mitigate this risk:
Know what permissions your add-ons have. If there’s an opportunity to reduce permissions—like adjusting RBAC settings—do it. That said, some add-ons need these elevated permissions. For example, if an add-on uses eBPF under the hood, it will require kernel access and host networking.
If vulnerabilities are found and fixes are released, you need to be ready to upgrade quickly. Keeping your add-ons updated means you’re protected against known vulnerabilities.
Because the vulnerabilities aren’t going away. The best defense is proactive maintenance.
Andy: These add-ons exist in all of our critical paths—north-south, east-west traffic, secrets management—they have access to nearly all of our cluster functionality.
The reality is, vulnerabilities in open source are only getting worse as our adoption of open source continues to grow.
According to the Black Duck Open Source Security and Risk Analysis Report, 86% of the software they analyzed contained vulnerable open source components. Now, this data is focused on the software supply chain—it’s not specifically about Kubernetes add-ons—but I believe the two go hand in hand. Much of what we see in the broader software supply chain also impacts our add-on supply chain.
We’re also seeing a sharp increase in CVEs reported in the Kubernetes ecosystem. In the 10 years Kubernetes has existed, the number of CVEs has grown dramatically—up 463%.
If you've been monitoring your CVE reports or have a vulnerability scanner running, you've likely seen this trend. This stat comes from a Sonatype report released last year that looks back over the past decade.
What’s even more concerning is that, despite the growing number of vulnerabilities, we’re getting slower at remediating them.
Ten years ago, it typically took 30 to 50 days to remediate a vulnerability. Fast forward to 2024—when this report came out—and we’re now looking at average remediation times of over a year for many CVEs.
As cluster operators, we’re under a lot of pressure from this constant influx of vulnerabilities across all the software we rely on.
Stevie: We’re getting notifications for CVE remediation in 2024—this week, in fact, several were released. And it’s like: “Okay, now we’re fixing these”—but they’re already pretty old. And it’s still happening.
Andy: On the flip side, while we have increasing pressure from the supply side of vulnerabilities, we’re also facing increasing pressure from the regulatory side to fix them.
There are now multiple frameworks and executive orders that all point toward the need to spend more time addressing and patching vulnerabilities.
For example:
So, across the board, we’re seeing increased regulatory pressure—no matter which standard or governing body is relevant to the software we’re building.
Meanwhile, the number of vulnerabilities continues to rise.
As cluster operators, we’re stuck in a difficult position—caught between the growing supply of vulnerabilities and the increasing demand to fix them quickly and thoroughly.
Stevie: So why is it so hard for us to maintain these add-ons? Why is it so difficult to upgrade them on a reasonable cadence?
It’s because the struggle is real.
One of the best ways to understand this is by looking at where we’ve come with Kubernetes upgrades—because we used to face very similar challenges there.
In the early days, it wasn’t unusual for an engineer to deploy a Kubernetes cluster and just leave it at that version. If everything was running fine and there weren’t any must-have features needed by the team, you'd let that cluster ride for as long as possible.
Why? Because upgrading was scary—and Kubernetes is complex.
Remember the Kubernetes 1.5 migration when you had to upgrade etcd? Back then, we didn’t have tools like Velero to easily handle cluster snapshots and backups. If you were backing up your cluster at all—and let’s be honest, most of us weren’t—you were probably running a script that used etcdctl commands.
And even if you were backing it up, you likely weren’t testing those backups or doing restore drills. So you definitely didn’t feel confident saying, “If something goes wrong during the upgrade, we’ll just restore.” That level of confidence didn’t exist.
So what did people do instead? They stayed on older Kubernetes versions as long as they could to avoid the risk.
At our company at the time, we—and many others—just spun up brand new clusters and migrated workloads over. That felt like the safer and easier option.
Andy: We took the more fun route of just taking down the whole control plane all at once so that we could upgrade etcd and let the workloads keep running.
Stevie: Oh, you were living the risky life—how exciting.
Yes, upgrading is scary. A small mistake can absolutely destroy your cluster and make things very bad for you.
On top of that, Kubernetes is complex. There are a lot of moving parts, and sometimes the complexity isn’t even about Kubernetes itself—it’s about the things that use Kubernetes.
When you wanted to upgrade to a new version, you had to look through your codebase, find any deprecated API versions, and update them.
And let’s be real—many of you probably weren’t doing that work yourselves. SREs and platform engineers typically aren’t the ones managing application deployments. That meant chasing down developers and asking them to update their manifests and redeploy—just so you could do a cluster upgrade.
Developers often didn’t want to deal with it—because they didn’t understand why it mattered. They didn’t feel the operational urgency behind the request. That lack of alignment made Kubernetes upgrades hard.
But things have started to improve.
According to the 2023 Datadog report, we’re gaining velocity in keeping Kubernetes versions up to date. The report showed that about 50% of users were running clusters that were 16 months old or newer.
So, people are upgrading clusters more frequently now. Why?
One big factor is the rise of cloud service providers managing Kubernetes.
According to the 2024 CNCF survey, 46% of users are running their clusters on providers like EKS, GKE, and AKS.
Why does that matter? Because those providers want you to upgrade—and often, they won’t give you a choice.
So given all this progress—how we’ve moved from risky, painful Kubernetes upgrades to a more stable and predictable cadence—why are we still struggling with add-ons?
What’s stopping us from reaching that same kind of velocity when it comes to maintaining and upgrading the add-ons in our clusters?
"Set it and forget it" syndrome is very real.
I’d wager that everyone in this room has, at some point, deployed an application or add-on—and then just moved on. As long as it wasn’t broken, wasn’t causing headaches, and was doing what it was supposed to do, we left it alone.
Much like those old Kubernetes clusters, we just didn’t touch it.
So why does that happen? Well, first of all—we have a lot on our plates. If it ain’t broke, don’t fix it, right?
Second, these add-ons often come with complex compatibility requirements. Upgrading them isn’t always straightforward.
Some add-ons require specific versions of Kubernetes, so you may go to update one and realize—oh no, it also requires a cluster upgrade you’re not ready for. Now you’ve got a bigger problem.
And sometimes, the dependencies aren’t even directly related to the add-on you’re upgrading.
Take cert-manager, for example. Not to pick on cert-manager, but a couple of years ago, they changed their annotations—from cert-manager.io to certmanager.io or something similar.
That change required every Ingress object in your cluster to be updated with the new annotations. If you didn’t, and you were using something like the nginx Ingress Controller with the cert-manager shim, your certificate renewals would fail.
Any time you deployed a new service expecting a cert to be issued, it would just break.
So, upgrading cert-manager suddenly turned into a major effort—way beyond a single Helm upgrade.
Then there’s the upgrade process itself. It’s manual. Even if you’re using GitOps tools like Argo CD or Flux, you still have to:
Sometimes you’re digging through GitHub changelogs or PRs, other times you get lucky and there’s a clean, bulleted list on a website. But there’s no standardization.
And remember—this isn’t just for one add-on. If you’ve got five add-ons in your cluster, that’s five different places you have to check for upgrade steps and risks.
It’s homework. And it’s not easy.
Then there’s testing. Of course, we all deploy to lower environments before pushing to production. None of us are deploying straight to prod… right?
But that means more steps—testing the upgrade, validating it works, and then figuring out how to roll it back if it doesn’t. And rollback isn’t always simple, especially when you’re dealing with Custom Resource Definitions (CRDs).
All of this takes time—and time is something we rarely have. Most of us don’t have extra cycles to dedicate to this kind of work.
So, we end up in a situation where even critical add-ons go unpatched—not because we don’t care, but because we’re overwhelmed.
So, what’s the solution? How do we address this challenge? How do we move forward? How do we get our add-ons to the same place we’ve gotten Kubernetes version upgrades—where upgrades are routine, expected, and manageable?
Andy: Those are great questions. (The easy answer is: hire Fairwinds for add-on management—but don’t worry, this isn’t a pitch.)
I want to share a quote I love, one that goes back to the early days of DevOps. It’s from the Continuous Delivery book, and a fellow engineer said this to me years ago. It’s been a mantra of mine ever since:
“If it hurts, do it more often, and bring the pain forward.”
— Jez Humble, Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation
The idea is simple: Break things into smaller chunks. Do them sooner. Do them more often. Eventually, they stop hurting so much.
And I’d extend that thinking to say:
“Just as we release our software continuously, we should patch our infrastructure continuously.”
— Andy Suderman (me)
That’s the Holy Grail. We’ve been talking about it for a long time: not just patching apps, but patching our infrastructure—all the underlying components, all the add-ons—continuously, alongside everything else we do.
Now, as Stevie mentioned earlier, time is a major barrier. But there is tooling that can help reduce the burden.
At Fairwinds, we’ve built and maintained several open source projects to support this kind of continuous infrastructure maintenance:
This is the kind of automation and tooling that can move us toward safer, faster, more consistent add-on upgrades—and help bring that pain forward so it hurts a lot less over time.
If you want to fully automate the process—let things update on their own and then deal with any breakage afterward—you can use a tool like Renovate.
There are strategies to make that less painful and avoid taking down your production environment. One common pattern is to automatically update your non-production environments, observe what breaks, fix any issues, and then roll out the updates to production.
You can also use tools like Dependabot, which is similar to Renovate. It helps you see what updates need to happen and can automatically generate pull requests against your infrastructure-as-code repos.
If you don’t have strict configuration requirements—or simply don’t want to manage certain things—some cloud-managed add-ons are available. Services like GKE Autopilot or EKS Auto Mode can handle updates for you.
They’re limited in scope and configurability, so it depends on your needs, but they can remove some of the burden of keeping add-ons current.
In summary:
Stevie: The last thing you want is for a severe vulnerability to be disclosed that requires a patch to a specific version—and you’re many, many versions behind. That situation is far more painful and more likely to introduce breaking changes into your environment.
So yes, increased frequency is a big one.
Andy: For the foreseeable future—unless you’re paying Chainguard a lot of money—your security updates and feature updates will remain tightly coupled with the open source tools you rely on. So we have to stay on top of them. We need to treat our add-ons the same way we’ve learned to treat our Kubernetes clusters—keep them updated and keep moving forward.
And of course—just to say it again—use a staged approach. Please don’t test in production.
Learn more about Fairwinds Add-on Management.