Kubernetes Clinic Spotlight on Liz Fong Jones: Observing Observability

If you are a developer or engineer interested in observability and containerization, you probably know the name Liz Fong Jones. She’s spent her career focused on solving challenging and important technical questions, and has impacted the developer community – from both technical and ethics standpoints – so much that when she left Google for her current role at Honeycomb, news outlets covered the move.

Liz is a regular on the conference circuit, bringing her big ideas and commitment to developer advocacy front and center. When events were put on hold in March 2020, Liz searched for a solution to continue hearing about developer concerns and successes. Her answer democratized events and welcomed people directly to her doorstep. Her Virtual Office Hours have reached the far corners of the globe and have ensured that developers that want to share their ideas and hear her thoughts are able to, regardless of budget- or travel-restrictions.

I was excited to get to do a Zoom call with Liz and ask her questions about her role as a developer advocate and the importance of cloud services -- and watch as she dug underneath some of my questions to answer at a deeper level. It’s impossible to walk away from a conversation with Liz not thinking a little bigger, a little more urgently and become a lot more aware of how we respect one another. To thank Liz for sharing her thoughts with this community, Fairwinds has made a donation to Trans Lifeline in her name.

Thank you, Liz, for taking time to chat with us today. We have to ask first - how are you doing these days?

Those weekends that are mercifully free of big news are helpful.

Cloud services have become a lifeline for all of us – workers, families, students and movements. We saw you participate in a panel discussion recently about being a Site Reliability Engineer (SRE) at a time when the world is counting on them. What advice do you have for SREs today and in the future given the crucial importance of cloud services?

We always talk about this idea of revisiting your service level objectives (SLOs) – if your service is depended on more, you should probably increase your SLOs. If it’s suddenly less important because other things are more important, then you should relax your SLOs. That’s the main lesson. Your priorities might change, so don’t try to keep up a system that’s not worth maintaining. If your service is now super critical, you need to advocate for more funding, more resources, etc. so that you can keep the world moving.

You spent much of your career in a variety of SRE roles at Google. What did you take away from your experiences there that allows you to advocate for SREs at your current company, Honeycomb?

I worked in a variety of both back-end and front-end teams at Google. My bio says from Google Cloud Load Balancer to Google Flights – those are completely different services. One of them is a millions-of-queries per second service that routes every single request to Google.com. The other is a relatively low traffic service that is still like, “hey, how much does this flight cost and what are the routing options?” We had to think about how to put logic around it.

That experience taught me that it’s not just back-end systems that matter. Every system has a role to play, which applies to empathy for SREs at their organizations, especially smaller organizations. When I joined Google Flights it was a product that was started because Google acquired a company called ITA Software that had hundreds of employees. As a result, it was a team that was operationally overloaded, but they also didn’t appreciate people coming to them and saying “Do it the Google way.” I learned the hard way you can’t go to a team and tell them how to do it a different way.

One of the things that caught our attention a few years ago were your thoughts on managing up and sidewise as an SRE. Is this still a skill that you believe SREs need to master? Has it changed over the past 3 – 4 years?

It’s still super critical for all kinds of engineers, including SREs, to master because you have to communicate about your impact and persuade other people about your technical ideas. Particularly for marginalized people who are having to document what they are doing and prove that they are doing great things because their manager isn’t paying attention to them.

Still focusing on SRE’s, you gave a talk at Monitorama PDX 2019 called “Tradeoffs on the Road to Observability” that talked about efficiency and sustainability for SRE’s. What is your biggest take away from that talk?

The biggest take away from that talk is don’t reinvent the wheel – don’t build it yourself if there is another alternative. I wanted to persuade people to really ask the question “what is my organization rewarding?” Is it rewarding building for the sake of building or are they rewarding impact? A lot of companies decide to spin up a Kubernetes team or an observability team and build something in house. But they would be better off going with a managed service provider. You have to ask if it is truly business differentiating for you or not.

Honeycomb just issued research about observability. Tell us about that research. Were you surprised by the findings?

This was actually a project that we did in 2019 where we were attempting to build a maturity model around observability. We hired an independent research firm to go in and research the observability community to find out what were people practicing and what’s next? Is there a correlation between business success and observability?

It’s interesting to see where the market is now – what percentage of people are currently trying to find the path versus what set are already on board. There are a lot of people that want to get on the observability train but don’t know how. We are gearing up to do the report again in 2021 with the same type of questions so that we can see what was baseline and where the needle has moved in the past 18 months.

The thing that was a little surprising about this research is that I'd had selection bias in the conversations I'd had. I talked to people who are very interested in progressing along observability. When you look at where the market is as a whole you realize there’s a set of people you haven’t reached yet -- that was what was most surprising to me.

Aside from very high numbers of adoption, what else do you see on the horizon for Kubernetes observability that we may want to prepare for?

This is the thing that really conflicts me because a lot of companies are promoting Kubernetes agents. The challenge is that when you instrument your services at either the raw service log layer or the system metrics layer or the service mesh layer, you aren’t capturing the details of the applications, which is where the rich detail is. You have to combine these approaches. I think that’s something that people need to understand: dropping an agent in isn’t going to magically give you 100% observability.

As containerization and Kubernetes, in particular, continue to grow, how do you think that companies can continue to focus on and maximize the quality of service for clients?
I’m not sure the goal is to maximize the quality of service. The goal is getting good enough quality of service.

I should be clear that Honeycomb doesn’t use Kubernetes – we don’t use it in production. It’s overkill for our use case. The reason we see people adopting Kubernetes is not even from a reliability point of view, but from a cost point of view. They want to have workloads together. Many of these workloads are submachine. Our workloads fill whole machines and then some. And we have this machinery to automatically replace nodes. If one dies, we auto spin up a fresh one.

When it comes to Kubernetes, people initially adopt it for cost reasons or for container management because they want to spin up fresh copies of things. If you are going to do that, a lot of the reliability benefit comes from it being able to spin up a new container. Unfortunately, people are doing a lot of persistent workloads with Kubernetes which is fraught with peril. I understand there are ways to do it, but you have to test and make sure your workloads will survive having the pods restarted. People aren’t used to that. I say we don’t use Kubernetes, but we use Kubernetes concepts.

What effect do you think that dichotomy has on observability – for someone who really values observability, does it make sense to steer clear of Kubernetes or container-based orchestration?

This is part of the reason why observability matters –if you have many more services because you no longer have monoliths running on single machines that means your logs don’t persist; they are being constantly scrubbed away and restarted and you have many more services to trace execution flow through. Yes, it’s more complex and more difficult when it’s not on a single machine, but that’s why you should adopt observability. I’d go so far as to say it’s more valuable in those situations.

You’ve been very open and welcoming in hosting private office hours online for folks to chat about Observability & SRE. In a world where we’re now connecting with each other remotely more and more, the idea of virtual office hours is particularly intriguing to me. How has that experience been for you and how has it helped you connect with folks in the larger community?

I started doing Virtual Office Hours in March when it became clear that we aren’t having conferences for at least 18 months. That was a moment of realization that I, as a developer advocate, needed to continue doing community outreach but differently. I need to understand what developers are thinking about, talking about what problems they are experiencing. In the past I could go to a conference and talk to people about this and find out how their services were doing and what challenges they had. I realized that I couldn’t do that unless I started inviting people to come to my doorstep.

It turns out that the Virtual Office Hours were really neat – people from almost every continent have reached out to me. Even in a world where we do resume conferences, I’ll keep doing them. It’s so democratic – anyone can do it instead of just the people that can afford to fly to a place and stay in a hotel to see me.

Your Twitter is a combination of innovation, advocacy and inclusion. These are important topics in business, too. Do you have any feedback for companies about how diversity, inclusion, and innovation improve how they service their customers?

It’s kind of interesting that there’s the justice argument for diversity and inclusion and then there’s the economic argument. It can be really tricky to try to figure out how to talk about those things and I’m not sure I’ve mastered the recipe.

As for why I’ve chosen my Twitter to be how it is - it’s because I’m an entire person. If you don’t fundamentally respect me as a person, I don’t want to be giving you business advice. How I bring that lens into how I work is that if your service is working fine for 99% of people but it’s hard for 1% of people, that’s maybe something you should do something about. You should maybe figure out who those 1% of people are and why they are impacted poorly. Do they share common traits? And what can I do to rectify the problem? That’s an analogy for how the rest of the world is going. There are groups of people for whom the quality of service – as it were – is less good and we should try to fix that.