Platform Engineering Trends in Cloud-Native: Q&A With Tom Wilkie
The rise of Kubernetes, cloud-native, and microservices spawned major changes in architectures and abstractions that developers use to create modern applications. In this multi-part series, I talk with some of the leading experts across various layers of the stack — from networking infrastructure to application infrastructure and middleware to telemetry data and modern observability concerns — to understand emergent platform engineering patterns that are affecting developer workflow around cloud-native. The next participant in our series is Tom Wilkie, CTO at Grafana Labs, where he leads engineering for Grafana Cloud.
Q: We are nearly a decade into containers and Kubernetes (Kubernetes was first released in Sept 2014). How would you characterize how Kubernetes has influenced modern thinking around distributed systems?
A: One dramatic way is that the marginal cost of “one more service” is effectively zero. Once you’ve paid the price of going to Kubernetes, adding another service is effectively free. Whereas if you think back a decade to 15 years ago, trying to deploy another service meant spinning up some more VMs, writing some more Ansible, building a new binary, and packaging it as an RPM. The marginal cost of adding a new service a decade ago was high. Now it’s just run of the mill.
This has had a huge impact on how you think about your architecture. When building monoliths, you had to worry about the isolation of the different services within the monolith. When you’re building microservices, you don’t have to — or at least, not nearly as much. The new patterns in microservices stem from this reduction in the marginal cost of running extra services and enable you to defer the thought of how to create isolation between different functions within the same software. It allows you to take more risks as well. If I write a new service and it ends up crashing all the time, it ideally doesn’t impact the rest of my application. Whereas in the monolithic world, it could take out my entire application.
On a different topic, there’s a long, storied history of people pontificating about software composability. I think Kubernetes has achieved a degree of composability that did not exist before. I was not a fan of CRDs and operators in Kubernetes — I thought this was massively over-complicated. However, CRDs and operators have made it possible to compose different technologies more easily than before, and this is making it possible to deliver solutions such as traffic management via service meshes. Exciting stuff.
Finally, I think the Kubernetes API machinery is going to transcend cluster scheduling. New software is being architected around Kubernetes APIs because it is the best way to get a large group of people working together asynchronously and collaborating without massive bandwidth. What’s impressed me about Kubernetes is its community and its ability to bring together people from different companies and competitors and have them collaborate. The standardization that Kubernetes has brought around patterns for working with APIs and distributed workforces really enabled the dream of composability.
Q: How has this affected how we think about the dividing line between operators, developers, and new patterns around platform engineering?
A: Ten years ago, DevOps wasn’t really a thing. I was first introduced to it at Google, where I was an SRE manager. In a lot of ways, SRE is just the re-branding of Ops.
But it’s not the same as Ops. Ops was often a very unscalable human process. SRE embodied the philosophy of “let’s treat Ops as an engineering problem.” Fundamentally, the problem is still Ops. The problem is still running software that other people have written, where you’re just going to try to engineer your way out of the parts that don’t scale well. I know it’s not a very popular opinion, but I still consider SRE to be Ops 2.0 or Ops++, which is not to say that it’s a bad thing. But it is fundamentally different from DevOps and fundamentally different from platform engineering.
Platform engineering is an extension of any other type of product engineering--the key difference is that your customers are your engineers. Internally, a platform team, very specifically, is not responsible for operating other people’s software. Today, these platform teams are typically consuming the Kubernetes platform from one of the cloud service providers and then enriching it to make their developers’ lives easier. They act as the team that operates, builds, deploys, and standardizes CI/CD systems, observability tools, config management, etc. — and take this collection of all of your organization’s technology, then layers on a set of opinionations that make it a more coherent, consistent platform for developers.
As you get to bigger teams, the amount of “stuff” that individuals are responsible for on-call can increase to the point where it’s not possible to hold it all in your head. So, there is a world where specialists are required, and dedicated SREs need to be introduced. But for most enterprises below the scale of Google or Netflix, SRE, I think, was a bit oversold, and the reality is that most developers should be able to operate their own software, whether served by platform teams or a combination of DevOps processes.
Q: The popularity of the major cloud service platforms and all of the thrust behind the last ten years of SaaS applications and cloud-native created a ton of new abstractions for the level at which developers are able to interact with underlying cloud and network infrastructure. How has this trend of raising the abstraction for interacting with infrastructure affected developers? And how has it changed the dynamics of human reasoning with systems and the role of observability?
A: Cloud-native architectures have led to an explosion in telemetry data. There’s a saying that these microservices architectures “wear their complexity on their sleeve.” They don’t pretend they’re being simple or try to hide what’s going on inside the architecture from their operators – they expose it all.
In some ways, that makes them easy to operate. If you know the right questions, you can go in and ask them what’s going on: why is it behaving like this? Unfortunately, at first glance, it makes these systems appear very complicated.
It’s this idea of intrinsic complexity vs. extrinsic complexity. A lot of these distributed systems are inherently complicated. And would you rather have a distributed system that hid that complexity from you, which is going to make it hard to figure out what’s gone wrong? Or would you like a system that is inherently complicated and shows you that inherent complexity?
When I say these systems wear their complexity on their sleeve — they expose a lot of telemetry. You look at a lot of these distributed systems, and they are natively instrumented in great detail. They make it very easy to understand what’s going on. The thrust of cloud-native technologies like eBPF and OpenTelemetry is making it even easier to gather new types of profiling information to automatically instrument compiled applications with distributed traces. It’s all part of the same story of basically an explosion in telemetry, which means you need tooling that allows you to answer these questions.
Q: What are the areas where it makes sense for developers to have to really think about underlying systems?
A: There is a term called “mechanical sympathy,” — which basically expresses the value of understanding how a system works. You have to be able to phrase the questions you ask in a way that takes advantage of the underlying machine.
Often, that is more important than knowing what the latest newfangled abstractions are. I really think the cloud-native world is divided into two camps. There are people building software to solve these really wide APIs that — let’s face it — don’t have to handle that much throughput and load. And then there’s the people having to build databases and really narrow APIs, but huge amounts of complexity and huge amounts of scale. I’d argue that generally, service meshes, serverless, no code — those kinds of things don’t even figure into these kinds of systems; you still need to know the fundamentals of how software works. You still need to be able to use a profiler. You still need to be able to do basic optimizations, understand how computers work, and build your software with that in mind.
Inherently, complexity is not bad. There are complex problems. Making 1,000 computers appear as one consistent API is complicated. Anyone that tells you otherwise is trying to trick you. I would rather work with systems that allow me to see what’s going on inside them than pretend like nothing is wrong when everything is broken. And I think there’s evidence that developers today prefer technologies that are easier to observe.
Q: Despite the obvious immense popularity of cloud-native, much of the world's infrastructure (especially in highly regulated industries) is still running in on-prem datacenters. What does the future hold for all this legacy infrastructure and millions of servers humming along in data centers? What are the implications for managing this mixed cloud-native infrastructure together with legacy data centers over time? And what are the implications for observability across these mixed environments?
A: We’re all familiar with the cliché that every company is a tech company, whether you know it or not. And as a tech company, your ability to compete is based on your ability to innovate. If half my team is focused on keeping lights on and running a server farm, and doing a load of “undifferentiated heavy lifting,” replacing disks that have failed, racking and stacking more servers — I’m going to be half as competitive as a company that doesn’t have that problem, and who can just be elastic and spin up on demand.
So, while macroeconomics might lead you to think there’d be a reduction in cloud spend and more investment in capital and running their own servers, I’ve seen the opposite. Macroeconomics has meant smaller teams are being tasked to do more. They are paying for SaaS and managed services because they realize that trying to run these services more profitably than the providers is not their core business. They are putting more pressure on the SaaS providers to create new features that allow them to see where they may be wasting money or otherwise opportunities to reduce spending.
For the enterprises that are still managing on-prem deployments, I see them looking for ways to make those environments look as similar to the new cloud-native world as possible. They don’t want the cognitive load of developers thinking about two different environments. They want these cloud-native tools in the data center. The general trend is toward managing your on-prem the same way you manage your cloud. I think that is where this is all headed.
Q: What do you think are some of the modern checklist items that developers care most about in terms of their workflow and how platform engineering makes their lives more productive? And where does observability fit in?
A: I think for platform engineering to be done right, there has to be a great config management story. Most companies have a terrible config management story. Or worse, they have multiple different config management systems, and they’re application and team-specific, and they’re proprietary.
Config management is a superpower. If you can get it right, if you can have a really consistent and high quality and loads of automation around your config management, this is how you can scale.
The platform engineering team’s job is to force multiply. Their job is to build a CD system and then have every team in the company use the same continuous deployment. If you don’t have a consistent config management system, how can you achieve a consistent continuous deployment system? You end up in a world where every team does its own config management, every team does its own CI, every team does its own CD, every team does its own vulnerability management, every team does its own SLO management, and so on. And it makes it very hard to build a scalable kind of force-multiplying platform team.
Enterprises that want to attract top talent today also have to understand that software engineers are very nomadic. They stay on jobs for a few years, and they want to develop transferable skills. They want to work on Kubernetes. They want to work with Prometheus and Grafana and other emergent standards in the space. Google’s release of Kubernetes addressed this challenge. Previously, they would hire a bunch of engineers, and it would take them many months to learn the internal tech stack. One of their solutions was to externalize a lot of their technology choices and train a new generation of engineers who were familiar with their tech.
Now, most modern companies have a similar tech stack. They have some Kubernetes clusters, they have some basic open source observability tooling, they use an open source CI/CD, and so on. You can move to a different company and be relatively productive relatively quickly because you’re acquiring transferable skills. It also helps as an employer to attract people to be able to say work for us. We use all the standard cloud-native tooling. You’re not going to have to re-learn everything and spend the next six months getting used to the proprietary choices we made.