Top 5 Tips for Better Kubernetes Self Service

When you've got a medium- to large-sized company, you need a platform to help your application teams ship code into production in a standardized way. This ensures that your applications are easier to maintain, scalable, secure, and cost-efficient. A platform can provide a standardized development environment, automated deployment and scaling, centralized monitoring and logging, and cost optimization. In other words, a platform can help you to improve the quality and efficiency of your application development and deployment process.

Kubernetes creates a common language for dealing with cloud infrastructure. It provides a single interface for provisioning machines to run your workloads, for provisioning disk, for getting ingress set up, for provisioning certificates for your applications, for scaling your apps and services, and many other things. That means that Kubernetes can really be the backbone of your infrastructure. But Kubernetes is kind of like AWS; you wouldn't want to just give your developers keys to the Kubernetes cluster and say, “do whatever you want.” You really need to layer some more abstraction on top of K8s to make sure that your dev teams are doing things in a secure, efficient, reliable way.

TL;DR

To really run a platform, you need Kubernetes itself as well as cluster add-ons to provision SSL certificates, get traffic to your cluster, put policies in place, and monitor resource usage; a way to deploy and run apps in your K8s cluster; a consistent and integrated way of delivering feedback to dev teams; and a method of governance for Kubernetes to align to best practices and minimize problems
Kubernetes offers a lot of opportunities, but you also need to onboard developers and make it easy for them to deploy and run apps and services while following best practices — in other words, improve Kubernetes self service
1. Your first step to improving K8s self service is understanding the lay of the land — ask a lot of questions to help you understand what’s working and what’s not
2. Once you start getting those answers, you’ll almost certainly discover a lot of issues that you’ll want to address — so scan your environments and shift your findings left in the SLDC to get information back to your devs in the tools they already use like Slack and Jira, then decide what issues you want to block or warn on
3. Set up guardrails that enable you to enforce Kubernetes best practices automatically using an admission controller, which will help your devs become more confident with ship code quickly because they aren’t worried about breaking something
4. While guardrails are a great way to help keep things in line and your devs working on delivering code, you still need to to monitor your environment to evaluate areas for improvement, particularly related to security, availability, and cost-efficiency — monitoring and getting feedback to developers is another great way to improve self service
5. You also need to solicit feedback from your dev teams so you understand where their challenges are and resolve them — ask questions and listen to the answers to figure out how to make the process better for everyone

Cluster Add-ons

When you spin up a new cluster in your cloud environment (e.g. an EKS cluster on AWS, or a GKE cluster on GCP), you just have a vanilla Kubernetes cluster with nothing installed in it. It has a lot of functionality, but it doesn't have everything you need for a production application. You'll need some solutions for:

Deploying your application
Provisioning SSL certificates
Getting traffic into that cluster
Putting policies in place
Monitoring how much resources people are using

To do so, you need to install add-ons to do some of the peripheral tasks that core Kubernetes doesn't take care of.

Deployment

You also need a way for your development teams to get their applications into the cluster and running. In the early days, you might have had your engineers run kubectl to push some resources into the cluster, but what you really want is a CI/CD process that kicks off a whole deployment process every time your developers push code or tag a particular commit. You may be using Helm under the hood to get those resources into the cluster; it's up to your platform team to figure out what's going to work for your organization.

Feedback

Once you've got cluster add-ons and deployment then you've got something running… but eventually your developers are going to hit an issue where things aren't deploying, apps are not scaling well, or the application crashed. You're going to need to provide routes for them to:

Figure out how to troubleshoot what's going on
Figure out how to add an environment variable to their application
Raise issues when they have questions

Whether that's devs opening up a pull request, servicing an issue in Slack , or opening up a Jira ticket , figuring out how you're going to handle those interactions between your platform team and your development teams is super important.

Governance

What's the difference between governance and feedback? They may sound the same, but there is a difference. Governance is all about making sure that your development teams don't do anything they're not supposed to do and how to prevent problems from recurring. Feedback is all about how your development teams surface problems with your platform team.

The best way to handle governance is by putting policy in place for the development teams, such as:

Don't allow any workloads to run as root
Don't allow workloads to request more than five CPUs per pod

Every time you see your development teams create a problem in production, it can be helpful to put some policy in place , either in a warning mode or an enforcement mode to make sure that those same problems don't come up again in the future.

Better Self-service in Kubernetes

1. Assess the situation

If you're trying to move towards a self-service model, the first thing to understand is what's going on now. Here are a few important questions you should get the answers to:

How are developers deploying now?
How messy are the things in production?
How consistent are they?

The answers to these questions will help you figure out what sorts of policies that you might want to put in place. For example, everybody needs to have auto-scaling set up — how many people have auto-scaling set up today and how many will be affected if you put a policy in place enforcing that? To figure this out, deploy auditing tools to understand what versions you are running of everything, how many applications there are, how many people are deploying via CI/CD versus manually. Talk to your development teams, do some automated auditing of your environments, check out your logs. This will help you get a feeling for how things are working today and what's not working.

2. Shifting findings left

When you assess your situation, you're going to find issues like people are using way more resources or asking for way more resources they need. They're not setting up auto-scaling. They've got security vulnerabilities in their Docker containers , they are missing best practices, they’re not setting health probes , and so on. You could go into Slack and message every developer to tell them not to do things this way. But that approach is neither helpful nor scalable. The best thing to do is shift the findings left (think of the developers as being on the left side of the deployment process and your SREs all the way on the right).

The best way to do this is to start integrating auditing tools into your CI/CD pipelines. Start scanning for any issues with best practices or any policy violations. You should be scanning:

Helm charts
Docker images
Terraform

At the very beginning, start by warning developers of the issues. For example, adding comments on every GitHub PR that say: “You added a new workload here, and it doesn't have any liveness probes set up.” Or a message saying, “You created a new Docker image here; it has 10 vulnerabilities in it and two of them are critical.” That helps the developers understand what they are doing wrong. Eventually, you'll want to block issues, too. If somebody opens a pull request that's creating new issues or adding new vulnerabilities, they should not be allowed to merge that unless they get approval from somebody or they fix the issue.

It can be a journey figuring out how much you want to scan, how much feedback you want to surface, and what you want to block on. Use the findings from that first step of auditing your environment to decide what you are going to scan first in CI/CD, what you are going to block first, and what is the most important thing to start enforcing as early in the pipeline as possible.

3. Set up guardrails

It's easy to circumvent CI/CD. Development teams can force merge PRs or use kubectl and Helm to edit directly in the cluster without going through the CI/CD process. That’s why it's important to start enforcing policies at a lower level to ensure your development teams aren't circumventing those best practices at the CI/CD layer. A Kubernetes admission controller is a pretty hard wall blocking people from putting things into your cluster that are going to cause problems.

You can set up an admission controller in your Kubernetes cluster that says: if you see a resource that doesn't have memory requests or CPU requests, reject it, don't allow that into the cluster. Then you have a pretty strong guarantee that you're not going to see that problem in production ever, even if a developer forces a merge or the CI/CD doesn't catch it. Guardrails are a great mechanism to make sure you have certain guarantees about what does and does not get into your cluster.

Sometimes platform engineering teams are worried that guardrails will slow down the development teams by putting roadblocks in their way. These guardrails actually help devs move faster because they know they're not going to catastrophically break anything unintentionally. It makes devs more confident to ship quickly and improves self-service.

3. Deliver feedback

Once you have guardrails in place, you have a good baseline of an environment. But you still need to have some kind of auditing in place for your environment so your SREs can look at it and determine how healthy different applications are. Evaluate how well they are scaling and whether any are redlining in terms of memory, CPU, or ability to scale horizontally.

You need to be able to surface feedback that shows whether you are wasting $900 a month on an application that really should only be spending $100 a month. You need to do some monitoring and send feedback to the developers. It can also be helpful to allow teams to compare the security, reliability, and cost-efficiency of their workloads so they know where to focus the efforts of your platform team.

5. Solicit feedback

You also need to solicit feedback. Ask your devs what's working and what's not. Talk to them so you understand what problems they are running into regularly. One of the biggest metrics around the success of a platform is how fast you developers can go. A big piece of making sure that devs can self-service is asking the developers how equipped they feel to do their job. Here are a few more questions you should be asking:

How often are you getting stuck?
How much time does it take you to trigger a new deployment?
How much time does it take you to make slight changes, such as changing your auto-scaling settings or adding a new environment variable?
How well do you understand how your application's performing in terms of security, in terms of its efficiency, and so on?

It’s incredibly important, both formally and informally, to have those lines of communication open with your development teams. Make sure they have a place to surface problems when they do happen, that they're able to do their job, and that they're not spending 50% of their time wrestling with infrastructure. Soliciting feedback is the only way to get that information.

Enable Better Kubernetes Self Service

These five tips will help you improve your dev teams’ ability to self-service in Kubernetes environments, and so much of it is about enabling communication. Whether you’re assessing the situation, shifting findings earlier in the SLDC, setting up guardrails, or delivering or soliciting feedback, it’s all about getting the right information to the right people at the right time .

And remember, this is an ongoing process. You can’t set up a self-service platform in Q4 and be set forever. You need to iterate constantly because your needs are changing constantly. Teams are changing as you hire and lose people. You also have new business requirements, such as new compliance standards, to adhere to. It is an ongoing process, one filled with constant improvement and iteration. Having open lines of communication across your teams will set you up for success with a self service approach.