Never Should You Ever In Kubernetes Part 3: 6 K8s Reliability Mistakes

As we outlined in our first post in this series, there are some things that you should simply never, ever do in Kubernetes. Corey Quinn, Chief Cloud Economist at The Duckbill Group, had a lively conversation with Kendall Miller, President of Fairwinds, and Stevie Caldwell, Senior Site Reliability Engineer at Fairwinds discussing things that development and operations teams should never ever do in Kubernetes if they want to get the most out of the leading container orchestrator. If you’re looking to maximize Kubernetes reliability, here’s what you need to avoid...

(Remember, this is never should you ever, so the headlines might seem a bit odd, a bit obvious, or even surprising!)

1. Configure all the things via a GUI

The problem with configuring everything via (graphical user interface (GUI) is that you end up with one person who maybe remembers all the buttons they pushed to set everything up. And then if they don’t remember the process, or if that person leaves, it’s really hard to replicate your environment with clicks. You can certainly configure via a GUI, but if you choose to do that, you have to write everything down in a workbook, describing how to go through the steps and what to click (and where), and then cross your fingers and hope you know where to find the workbook when you really need it (maybe just don’t do this).

Let's say everything falls over and you need to go in and recreate your clusters. Pointing and clicking to do so is the slowest way to recreate your clusters. That said, there is an evolution in how you set things up from least advanced to most advanced, so you might start off using the console and point and click. Then you graduate to something like Terraform or CloudFormation (both infrastructure as code tools). Beyond that, you get into more dynamic solutions like the Cloud Development Kit (CDK), which lets you define Kubernetes and infrastructure components using familiar languages. The final step is using the console. The console has a great UI, it’s worth using, but there has to be something that captures the change that gets made and codifies that. Some providers do that, so if you change anything, you can click a button and it will spit out the code for all of the settings.

If your provider doesn’t do that (AWS, for example) Console Recorder is an extension that Ian Mckay made and maintains. It sits there in Chrome or Firefox and it watches everything you're doing in the console and then generates everything you did in whatever language you want. It also spits out a scope IAM policy for what you just did. That’s a really useful extension for you to use.

2. Let your SREs do RDD

Should you never ever let your site reliability engineers do resume driven development (RDD)? If your operations team says they want to use XYZ because it's the hottest new thing because they want to be able to put it on their resumes, that's probably not the best way to decide how to build your entire infrastructure. That’s not to say that there isn’t a lot of interesting tech out there, and your team should have the opportunity to explore it, however, don't do it for the sole purpose of getting your SREs their next job. Exploring new technology is important, but don’t sacrifice the reliability of your infrastructure in the pursuit of filling in a resume (there are always non-prod environments to play in!).

3. Run Docker Swarm on top of Nomad on top of Kubernetes

The heading for this one alone makes you think, wow, maybe I shouldn’t do that? And you’re right - don’t run Docker Swarm on top of Nomad on top of Kubernetes. It sounds ridiculous, but we’ve seen companies attempt to do things just as complicated; they use one tool to do one piece of the orchestration, another to do the scheduling, and another to do something else. For some reason, adding layers of complexity is attractive to some people. There are a lot of great solutions you can explore in the Cloud Native Computing Foundation ecosystem, but the goal is to reduce complexity and increase reliability, not the opposite.

4. Easter egg single points of failure

When we say Easter egg single points of failure, we mean know and hide really interesting single points of failure. You should never, ever hide those. The most common example is the undocumented thing that “everyone” knows—until you look around and discover that all of those people who knew have taken jobs elsewhere.

There are always single points of failure, it’s just a question of where in the stack it’s going to be. Even if you think you've got rid of all of them, there are still some there. The trick is to be realistic about what it takes to take down your site and to know what your points of failure are rather than hiding them.

5. Weekend at Bernie’s your liveness probes

If you haven't seen Weekend at Bernie's, it’s a movie in which people are pretending that something that is dead is still alive. And that's something you don't want to do with your liveness probes. Don’t ignore the fact that things have died, the goal is to make sure that you do have working liveness probes so that you can respond accordingly.

For a lot of shops, the monitoring system they pay attention to is Twitter or the customer support desk. That’s the only way they know that things are down. Don't do that.

6. Skip monitoring “for now”

There are a lot of people who are just trying to get to production and decide that they’ll figure out how to monitor later. The reason you shouldn’t do this is that sooner you start your monitoring, the sooner you can start getting baselines for how your application should be behaving. This data will inform a lot of decisions you make down the line about how you scale, where your traffic spikes are, things like that.

Monitoring helps you get to efficiency. If you’ve been monitoring, your baselines will help you decide how to configure things, how to right size things, and how to actually scale up or scale down as appropriate.

Watch the entirely entertaining webinar on demand to learn what else you should never, ever do in Kubernetes.