Fairwinds Polaris 1.0 - Best Practices for Kubernetes Workloads

I’m excited to announce that we’ve released Fairwinds Polaris 1.0!

We launched Polaris almost a year ago to help Kubernetes users avoid common mistakes when configuring their workloads. Over the course of managing hundreds of clusters for dozens of organizations, the Fairwinds SRE team kept seeing the same mistakes over and over: resource requests and limits going unset, liveness and readiness probes being ignored, and containers requesting completely unnecessary security permissions. These are sure-fire ways to create headaches down the line – from outages, to cost overruns, to security breaches. We saw Polaris as a way to encode all our battle scars into a single configuration validator that could benefit the entire Kubernetes community.

Polaris began as a simple dashboard, showing all the workloads in your cluster that fall short of best practices. But we soon added a Validating Webhook for the truly paranoid - this will reject kubectl apply for any workload that fails one of the danger-level checks. But our users still weren’t satisfied - they wanted to run Polaris in CI/CD, so they could catch errors before they were merged into master. Soon Polaris could also run on YAML files and Helm charts.

Now, after over 1200 stars on GitHub and lots of great feedback from the community, we’ve learned a ton, and we’re excited to announce a few amazing new features in 1.0:

Custom checks using JSON Schema
Support for all controller types, including OpenShift and custom resources
Check exemptions, for workloads that really do need special permissions
Simplified configuration and output formats

Run Polaris in multiple clusters, track results over time and integrate with Slack, Datadog, and Jira with Fairwinds Insights, software to standardize and enforce development best practices. It's free to use! Compare Polaris and Insights.

Custom Checks with JSON Schema

This is by far the biggest and most existing change to Polaris: you can now define your own custom checks using JSON Schema.

Originally, we had hard-coded our checks in Golang. We would manually check things like:

if cv.Container.LivenessProbe == nil {
    cv.addFailure(
        messages.LivenessProbeFailure,
        conf.HealthChecks.LivenessProbeMissing,
        category,
        id)
} else {
    cv.addSuccess(messages.LivenessProbeSuccess, category, id)
}

This proved to be a bit cumbersome and bug-prone in the long-run. For example, we eventually realized that Jobs and CronJobs should probably be exempt from liveness probe checks, which involved wrapping the above statement in some extra conditionals. As we discovered more exceptions like this, our Go code became rather unwieldy.

What’s more, there was no easy way for our users to add their own checks - they’d need to add them into our codebase, and not every check is appropriate for every organization.

So we decided to move towards using a configuration language for checks. After heavily investigating OPA (more on that below), we decided to go with JSON Schema. Now, the check above, plus all of its exceptions and configuration, looks something like this:

successMessage: Liveness probe is configured
failureMessage: Liveness probe should be configured
category: Health Checks
controllers:
  exclude:
  - Job
  - CronJob
containers:
  exclude:
  - initContainer
target: Container
schema:
  '$schema': http://json-schema.org/draft-07/schema
  type: object
  required:
  - livenessProbe
  properties:
    livenessProbe:
      type: object
      not:
        const: null

Now all the configuration related to a particular check lives in the same place. We can adjust the controllers it applies to, the messages it generates, and even the logic of the check itself, just by editing some YAML.

JSON Schema is incredibly powerful - you can force numbers to fall between maxima and minima, force strings and object keys to match particular patterns, set bounds on array lengths, and even combine schemas using allOf, oneOf, or anyOf. We’ve even extended JSON Schema to let you set minima and maxima for CPU and Memory using human readable strings, like 1GB. To help you get started, we’ve provided some sample schemas for things like restricting memory usage and disallowing particular docker registries.

But wait, what about OPA?

Kubernetes veterans will probably be wondering why we didn’t go with OPA, a heavyweight validation configuration mechanism that has been developed by the community. We looked at the project carefully, but decided ultimately that it was much more complex and powerful than we (or our users) needed. It takes some time and effort to get used to Rego, the DSL that drives OPA policy. JSON Schema is also already an integral part of Kubernetes - it’s used by CRDs and core resources for validation, and is a part of Open API, which drives the Kubernetes API.

New Controller Types

When we originally built Polaris, it only checked Deployments, arguably the most common controller type in Kubernetes. But soon after launching we had requests for more controller types, like StatefulSets, CronJobs, and DaemonSets. After adding support for half a dozen different controllers (as well as writing a large amount of boilerplate code, thanks to Go’s rigid type system), we realized we’d never be able to keep up with all the controllers out there - in addition to the core Kubernetes controllers, there are non-standard controllers like OpenShift’s DeploymentConfig.

So we decided to try a different tack - instead of fetching controllers, we’d fetch the underlying pods that the controllers create. We could then use Owner References to walk up the hierarchy until we hit something without an owner - i.e. the parent controller.

(As an interesting aside, we learned that some workloads are owned not by a controller, but by the node itself! In the 1.0 dashboard, you’ll notice some almost-duplicate entries in kube-system due to this fact.)

With this change in place, we’re able to support any controllers out there in the wild, whether or not we’d even heard of them before. So go ahead and build your own controllers! We’ll still bug you about your liveness probes.

Note, however, that we can’t yet catch these new controller types in the Validating Webhook - the webhook still watches for a fixed set of controller types. We may build support for this down the line, but need to be careful that we still allow workloads to scale up properly.

Check Exemptions

Some workloads really do need to do things like run as root, or have access to the host network. This is true for a lot of things that live in kube-system, as well as some utilities like cert-manager and nginx-ingress.

To help cut down on the noise generated by these workloads, we’ve created a library of exemptions, which will allow particular workloads to ignore particular checks. You can add your own workloads to this configuration, or you can add annotations like polaris.fairwinds.com/exempt=true and polaris.fairwinds.com/cpuRequestsMissing-exempt=true.

We’d like to continue building this library, so if you’re running common utilities like Istio or Linkerd, let us know what else we should add!

New Configuration and Output Formats

Our configuration syntax initially looked something like this:


networking:
  hostNetworkSet: warning
  hostPortSet: warning
security:
  runAsPrivileged: error
  capabilities:
    error:
      ifAnyAdded:
        - SYS_ADMIN
        - NET_ADMIN
        - ALL
    warning:
      ifAnyAddedBeyond:
        - CHOWN
        - DAC_OVERRIDE
        - FSETID

The struggle was that some checks were just booleans, like hostNetworkSet,while others, like capabilities,required some extra configuration. So after a lot of internal discussion, we settled on a somewhat inconsistent but minimal syntax.

In 1.0, we’ve done away with all the extra configuration in favor of JSON Schema (see above). So now the syntax is a lot simpler:

checks:
  # networking
  hostNetworkSet: warning
  hostPortSet: warning
  # security
  dangerousCapabilities: danger
  insecureCapabilities: warning

Note that we also changed error to danger to differentiate between unexpected errors and failing checks. While output format has also been simplified a bunch, it’s not entirely user-facing, so we won’t go into it here. But if you’re interested in seeing it, here’s an example.

Onward to 2.0

So what’s next for Polaris?

First, we’d like to add a whole lot more checks. Now that we’ve got a scalable way of building them, and an easy way for the community to contribute new ones, we’d like to build up a large library of checks that users can turn on and off. And we’d like to start checking more than just workloads - things like Ingresses, Services, and RBAC need validation too.

We may also add support for OPA down the line. While JSON Schema will remain the format of choice in Polaris, OPA would appeal to organizations that have already invested time in creating Rego policies, or who want to write complicated checks. Adding OPA support would make Polaris a highly flexible validation framework.

Finally, we will look to build a deeper integration with Fairwinds Insights, our commercial configuration validation platform. Fairwinds Insights allows you to aggregate Polaris results across multiple clusters, track the lifecycle of findings over time, and push the data out to third-parties like Slack and Datadog. Fairwinds Insights also has plugins for other open-source auditing tools, like Trivy, kube-bench, and Goldilocks, so you can be confident you’re covered all the bases from security to efficiency to reliability.

If you have some time to try out Polaris, or if you’re already using it, we’d love to hear from you! Reach out on GitHub and Slack, or send an email to opensource@fairwinds.com.

See how Fairwinds Insights reduces your Kubernetes risk!