Why teams fail with Kubernetes—and what to do about it

Kubernetes offers a powerful operating model for running cloud-native systems, but adopting it is anything but straightforward.

Yes, Kubernetes helps reduce the operational complexity of microservices, and it provides useful abstractions for deploying and running containers. But moving to Kubernetes is akin to adopting an elephant as a pet.

There are major implications to how teams must interact when you're using Kubernetes—especially as you scale. Fail to address those issues, and you'll put your entire endeavor at risk. Here's what you need to keep in mind.

It's all about team interactions

Kubernetes adoption is not just about the operations/infrastructure team migrating the infrastructure setup to Kubernetes clusters while product teams deploy and run services in Kubernetes pods. Those are the core inputs to the engine, but you'll face many other tasks and responsibilities when running Kubernetes—even if you're using a managed service.

Fail to address the questions "Who is responsible for x?" and "Who is affected by y?" and you'll put all your efforts at risk. For example, replace "x" above with "deciding on namespaces versus clusters for service and environment isolation" or "upgrading all clusters to a new Kubernetes version," and you start to see why you need to clarify the boundaries of responsibility and their impacts.

The way teams interact, and the behaviors promoted by your culture, are more accurate predictors of a successful Kubernetes adoption than are technical expertise and infrastructure costs and metrics— that is, if you measure success as enabling faster and sustainable delivery of customer-focused value (via features, better user experience, more resilience).

Having clarity of purpose, and understanding the responsibilities and behaviors around the teams operating Kubernetes (operations/infrastructure/platform) as well as the teams using Kubernetes (product/feature/stream) are all key to success.

Abstractions, cognitive load, and DevEx

Using Kubernetes might be a sound decision from an engineering standpoint, but the developer experience (DevEx) is often subpar, and the abstractions are at a lower level than any individual developer would need because Kubernetes was designed as a generic platform to meet every possible use case.

Extraneous cognitive load is the amount of human working memory used to understand and perform a task that is not directly related to the business outcome you're trying to achieve.

Poor DevEx and complicated abstractions and interfaces mean that the cognitive load for the average developer who lacks deep Kubernetes expertise will increase steeply when you adopt Kubernetes. That is, unless you explicitly consider and manage that potential overload.

You need a digital platform on top of Kubernetes

Kelsey Hightower, staff developer advocate for the Google Cloud Platform, said Kubernetes should be an implementation detail of an organization's change management system.

In other words, you need to focus on clarifying the interfaces and enhancing the usage experience of the internal services that your product teams rely upon to quickly and safely build, deploy, and run the services they are responsible for. These systems can range from CI/CD pipelines to monitoring and metrics collection.

You need to abstract away the details that are extraneous to your organization's build and run processes. You need to increase the reliability, predictability, and security of that small set of critical internal services, and provide adequate support (including on-call support) and communication channels for fast feedback.

All of this is engraved in Evan Bottcher's simple definition of a digital platform: "A digital platform is a foundation of self-service APIs, tools, services, knowledge and support, which are arranged as a compelling internal product."

Kubernetes is not a digital platform, although it might well be the foundation for one (regardless if under a managed service like Amazon Elastic Kubernetes Service or not). Failing to understand and address this difference is the prime reason for poor adoption in many organizations.

Not defining this internal platform leads to inconsistencies in the use of external services. It also leads to unreasonable demands on product teams that are already being pulled in many directions while, ironically, being pressured to deliver more features faster, since they now have Kubernetes.

But Kubernetes is no silver bullet. Its complexity presents a steep learning curve for newcomers. If your engineers are being asked to rely on Kubernetes documentation to learn to solve their problems, no matter how good that documentation is, you do not have a digital platform.

You likely have a gap in operational capabilities and a maturity issue that needs to be addressed before you can reap the force-multiplier benefits that Kubernetes can bring about.

The size of a digital platform varies with mileage and scale. For a startup, a simple wiki page specifying which cloud services to use with some sensible defaults, tricks, and caveats might be enough. You might rely on your more experienced engineers to provide documentation and support on an as-needed basis. In our book, Team Topologies: Organizing Business and Technology Teams for Fast Flow, Matthew Skelton and I call this a "thinnest viable platform."

As your startup grows, so will your platform, as the product teams begin to need more internal services. Eventually, a platform group might include multiple platform teams, each aligned to a small set of platform services. These teams need strong product management to create a compelling internal product that makes life easier for the other engineering teams (the platform clients).

How Airbnb enabled 1,000+ engineers with Kubernetes

Airbnb is a good example of a digital platform on top of Kubernetes that evolved based on the needs of its engineering teams. Melanie Cebula, infrastructure engineer at Airbnb, spoke at QCon London about the way her team wraps Kubernetes into easy-to-consume internal services for its development teams.

As she explained, instead of creating a set of dreaded YAML files (deployment, ConfigMap, service) per environment (dev, canary, production), development teams need only provide their project-specific, service-focused inputs and then run the internal service kube-gen (alias k gen).

This simple command takes care of generating all the required YAML files, ensuring their correctness (not just syntax-wise but also semantically in terms of expected values), and finally applying them in the corresponding Kubernetes cluster(s).

Figure 1. The kube-gen wrapper generates the needed configuration files per environment at Airbnb. Source: Melanie Cebula, Airbnb.

The infrastructure team at Airbnb is saving hundreds, if not thousands, of hours for 1,000+ engineers who can now use a much simpler abstraction that has been adapted to their needs, with a user experience that's familiar to them.

Other internal services provided by the infrastructure team include k deploy, to create new namespaces; k diagnose, to collect information from multiple sources on malfunctioning pods and services; and templates for new services and deployment pipelines.

Effectively, they are providing a digital platform for their engineers that embeds their evolving understanding of what engineering teams need to perform better, as well as good practices and tooling around security, logging, debugging, and so on. Crucially, they are doing this without asking for more of the engineering teams' cognitive load. Instead, engineers are free to focus on business outcomes with clearly defined, simple service boundaries.

Figure 2. The infrastructure platform at Airbnb establishes clear boundaries and reduces the cognitive load on development teams. Source: Team Topologies: Organizing Business and Technology Teams for Fast Flow.

Clear team interactions are key to sustained success

The success of an internal platform is influenced by the behaviors and interaction modes of the responsible teams to a much larger extent than by its technical achievements. If the platform team does not see its mission as to reduce the extraneous cognitive load of engineering teams by means of a compelling internal product, then it might dwell in the technical complexity of a service and forget to check if it serves the needs of the team that requested it.

If the platform team does not collaborate closely with the product teams during initial stages of a new service or evolution to have fast feedback, then the developer experience will suffer, and usage will drop because the platform will stop being a compelling product.

If the platform team does not provide timely (on-call and office hours) support for its internal services with clear response times, service status pages, and communication channels, then the platform will not be seen as reliable and engineering teams might resort to other options.

On the other hand, product teams need to carefully reconsider whether they really need to go off the "paved road" provided by the platform for any specific service or tooling requirements. If they go off on their own without talking to the platform team and without a clear use case for adopting some new technology, then they will break the trust boundaries with the platform team and end up having too much unnecessary cognitive load.

Product teams need to be open and frank about their needs while understanding whatever limitations the platform teams might be working under. Blameless interactions are key.

A general pattern of interaction between product and platform team is to have close collaboration during the initial discovery stages for a new platform service (or evolution) required by a product team. Over a period of time, this intentional collaboration effort will diminish as the needs, boundaries, and interfaces for this service becomes clearer, until eventually it can be consumed by all product teams as a service.

Figure 3. The evolution pattern of team interactions for a new platform service (or evolution), from initial discovery with high collaboration to "X as a service" with no need to collaborate any more. Source: Team Topologies: Organizing Business and Technology Teams for Fast Flow.

In the end, it's all about teams having a clear purpose, responsibilities and ways of interacting in order to set the right expectations and behaviors.

How to get started

Take these three simple steps to nudge your organization's Kubernetes adoption with a human- and team-centric approach.

Assess cognitive load. Ask your teams if they truly understand how to build, deploy, and run the applications they are responsible for in Kubernetes.
Visualize the platform. Kubernetes is not your internal platform. Document how your organization is currently using it, along with your recommended practices, sensible defaults, and other useful information in a wiki page. Then start adding the missing pieces for a true digital platform.
Clarify team interactions. Set the right expectations between teams in terms of who is responsible for what, who is affected, and what types of behaviors to adopt in which circumstances.

Follow the initial steps above and you'll start to understand the gap between your current Kubernetes implementation and having an internal digital platform (and teams) that accelerates software delivery through reduced cognitive load, a first-class developer experience, and a compelling platform that is resilient and fit for purpose.

You'll also gain insights into how your teams interact today, and the anti-patterns and misaligned expectations that are creating friction between teams and individuals. You'll be moving toward a healthier, more organic work environment that acknowledges the complex socio-technical nature of software systems today.

Want to know more about digital platforms and reducing the cognitive load of Kubernetes? Attend my talk, "The Elephant in the Kubernetes Room: Team Interactions at Scale," at KubeCon + CloudNativeCon North America in San Diego, California, which runs November 18-21, 2019. I'll be speaking on Thursday.

Keep learning

Choose the right ESM tool for your needs. Get up to speed with the our Buyer's Guide to Enterprise Service Management Tools
What will the next generation of enterprise service management tools look like? TechBeacon's Guide to Optimizing Enterprise Service Management offers the insights.
Discover more about IT Operations Monitoring with TechBeacon's Guide.
What's the best way to get your robotic process automation project off the ground? Find out how to choose the right tools—and the right project.
Ready to advance up the IT career ladder? TechBeacon's Careers Topic Center provides expert advice you need to prepare for your next move.

Read more articles about: Enterprise IT, IT Ops

You are here