I’m excited to announce the next major release of Fairwinds Insights!
Fairwinds Insights is a platform for policy enforcement and auditing in Kubernetes. It comes with a dozen integrations with open source tools like Polaris, Trivy, and OPA, and can check your Infrastructure-as-Code in CI, protect your cluster via Admission Control, or run as an in-cluster Agent to provide you with persistent monitoring for security, efficiency, and reliability issues.
Insights already throws off a lot of critical data, so with this release we focused on ways to help you visualize, navigate, and triage that data. Below are a few of the big features that landed in 3.0.
When Insights was first conceived, I was the only person working on it. I quickly threw together a UI with a horizontal navigation, which was perfectly sufficient for the two or three features that were available.
Two years later, the feature set has grown to the point that we were running out of space! Thankfully we now have a wonderful UI/UX designer working with us, who was able to reimagine our application’s information hierarchy from the ground up:
The top section allows you to switch contexts, moving between different Kubernetes clusters and infrastructure-as-code repositories
The middle section allows you to navigate within a given context
The bottom section is where you can log out, view a different organization, or find helpful links to documentation
The new navigation is much closer to other cloud-native SaaS products like Datadog or Google Cloud Console, so it should feel more familiar to users. If those products are any indicator, the new vertical navigation will allow us to grow the application for years to come.
Fairwinds Insights has had functionality related to cost and resource usage for a while now, but the data was pretty coarse: we utilized Goldilocks to maintain a running average, minimum, and maximum of CPU and memory usage for each workload. For stable workloads, this allowed us to provide reasonable recommendations for where limits and requests should be set; but for workloads with spiky usage, we’d need a bit more information.
To solve this problem, we introduced the Prometheus Collector report, which captures information on CPU and memory usage every 30 seconds. We maintain two weeks worth of data, so you can see how your usage changes over the course of a day or a week.
By showing this information alongside your current requests and limits, you can easily see how well they’re set. For example, in the above chart, because the lines are mostly within the shaded blue area, we can conclude that we’ve picked good values for memory requests and limits; the workload uses a maximum of about 75% of the memory limit, and on average utilizes at least as much memory as we’ve requested.
By contrast, in the below graph, the lines are well below the blue box, which tells us that this workload has been over-provisioned, and is probably costing us more than it needs to:
Armed with all this extra data, plus some clever visualizations, our users are better able to right-size their workloads, which allows them to both save money and maintain a more stable environment.
If you’d like to try it out, log into your Insights account and add the Prometheus Collector report to your cluster via the Report Hub.
One of the most common pieces of feedback we’ve gotten about Insights is that the sheer amount of data can be overwhelming. Furthermore, many of the Action Items produced by Fairwinds Insights are not immediately actionable - they’re controlled by a third-party or the core Kubernetes infrastructure.
To help remediate this problem, we introduced Automation Rules, which allow you to automatically triage Action Items based on certain rules. For instance, you might want to ignore any issues that pop up in the kube-system namespace, since that’s controlled by Kubernetes itself. To do that, you can write a bit of JavaScript:
if (ActionItem.ResourceNamespace === 'kube-system') {
ActionItem.Resolution = WILL_NOT_FIX_RESOLUTION;
}
This is a powerful way to start to cut down on the amount of data being generated, so that you can focus on the issues that matter most to your organization.
You can also use Automation Rules to send notifications to Slack, if (for example) a critical vulnerability appears in your production cluster. We plan to continue expanding this functionality to include creating GitHub and Jira tickets, or sending arbitrary HTTP requests.
To get started with Automation Rules, check out the documentation.
While we added support for OPA policies in 2.0, policy management was controlled entirely via a CLI. While we still think this should be the primary flow (as it allows you to store your policies in an Infrastructure-as-Code repository), we’ve built a user interface to help you view and manage policies as well.
This should help users realize a quicker time-to-value with OPA policies, as well as help them view, edit, and enable/disable their policies on the fly.
We also provide a growing library of policies for you to clone and modify:
I’m very excited about the progress we’ve made on Fairwinds Insights. After creating a strong MVP that was able to surface security, efficiency, and reliability issues with anyone’s Kubernetes cluster, we’ve put a lot of work into making those findings more actionable and understandable.
Interested in using Fairwinds Insights? It’s available for free! Learn more here.
In the coming quarter, we’ll continue to refine our user interface with more visualizations and ways to customize the results, and will continue to watch for new and interesting open source tools that we can connect to the Report Hub. Stay tuned!