6 best practices for highly available Kubernetes clusters

public://pictures/kjelland2.jpg
Meaghan Kjelland, Software Engineer, Google

Karan Goel, a software engineer at Google, cowrote this story.

Everyone running a Kubernetes cluster in production wants it to be reliable. Many administrators implement a multi-master setup, but often this isn't enough to consider a cluster highly available.

A highly available microservice requires that the system gracefully handle the failures of its components. If one part of the system fails in any way, the system can recover without significant downtime.

So how exactly can you achieve a highly available, highly reliable, and multi-master Kubernetes cluster? One way is to run a regional Kubernetes cluster on the Google Kubernetes Engine, a managed version of Kubernetes hosted on the Google Cloud Platform.

To help you achieve highly available Kubernetes clusters, here are some best practices based on what we learned while operating Google Kubernetes Engine at scale at Google.

The State of Analytics in IT Operations

High availability for the Kubernetes control plane

The Kubernetes control plane consists of the controller manager, scheduler, and API server. When running a highly available Kubernetes cluster, the first thing to focus on is running multiple replicas of these control plane components.

1. Figure out the types of failures you need to protect your cluster from

Some common failure domains that people consider are tied to their networks, disk/data, machines, power sources, cooling systems, etc.

Kubernetes is designed to handle brief control plane failures. Workloads will continue to run and be accessible on the worker nodes. However, if worker nodes fail, the control plane is not available to reschedule the work or to reconfigure routing within the cluster. This can cause workloads and services to be inaccessible.

Google Kubernetes Engine's regional clusters run the machines that make up the clusters across Google Compute Engine zones. This allows the control plane to continue to run while one zone is experiencing a failure.

On Google Compute Engine, regions are independent geographic areas, or a campus of data centers that consists of zones. Zones are deployment areas for cloud platform resources within a region. From an availability standpoint, zones can be considered a single point of failure.

Currently on Google Compute Engine, you have a choice of 18 regions, 55 zones, and over 100 points of presence across 35 countries.

2. Run multiple replicas of the control plane components across failure domains

Once you have chosen the failure domains that you care about, you can run multiple replicas of the control plane across those domains. The idea behind replication is to run multiple copies of a process so that if one fails, another one can pick up the job.

If you're running multiple copies of the same workload, you need to define where those are running. For us that means different Google Compute Engine zones that are physically located in different places.

For someone else that could mean simply running on different racks within a data center. For still others that separation could mean running on a different continent. That’s something you need to decide based on your business and technical needs.

On Google Kubernetes Engine, if all the replicas are running within the same zone and you suffer a zonal outage, your whole cluster will be temporarily unavailable. Its regional clusters reduce the risk of this because the replicas are run across Google Compute Engine zones. In fact, Google Kubernetes Engine itself is a global service.

Etcd provides high availability for the data plane 

Kubernetes uses etcd as a data store for all cluster-related data. That means the information about the pods you're running, the nodes you have in your clusters, and the secrets—all of that is stored in etcd. In a reliable system, etcd needs to be able to handle failures without losing data.

3. Run etcd as a multi-node cluster in production

It's critical to keep etcd running because it functions as the brain of the entire cluster. Like the control plane components above, etcd can run with multiple replicas.

However, etcd is a bit more complicated because it is stateful, so the data stored in one replica needs to match all the others. Because the data is shared across etcd replicas, leader election is more difficult, so the components must coordinate with one another.

A common pattern for running in a distributed system is active-passive replication. A single instance is elected as the leader, while other instances wait for the leader to go down to take over. This is a simple form of leader election.

This pattern works well for stateless components, such as the controller-manager, but it has reliability implications for stateful components, such as etcd. Etcd uses a quorum-based, leader election algorithm that requires a strict majority of replicas to elect a leader. Cluster members elect a single leader, and all other members become followers.

4. Back up etcd

You can never prevent or predict all the failures that could happen in a system. To prevent a failure from taking out the entire etcd cluster, back up the data in etcd periodically so you can recover.

Etcd provides a simple snapshot command through its command-line interface, etcdctl, that you can be use to capture the current state of the data in your cluster. Ideally, backups run regularly and are stored somewhere that is isolated from where the cluster is running. This ensures that any failure that causes the etcd cluster to fail will not also cause a loss of the backup.

It's critical to test the freshness of your periodic backups in addition to regularly testing the restore action from your backups.

[ Webinar: What’s New in Network Operations Management (Dec. 11) ]

How to run your own workloads in a high-availability way on Kubernetes

Kubernetes makes running your own applications with high availability easier, but it is not automatic. The same principles that we use to run the control plane will mirror the way users run their workloads; they just need to know more information to configure it themselves.

5. Choose a leader election algorithm

The first question to ask when configuring your application is whether you need to use a leader election in your application. There is no one right answer, since different leader election algorithms might make sense for different applications. The leader election algorithm you use depends on what kind of workload you're running, and what your requirements are.

For example, etcd uses a quorum-based leader election algorithm, while the Kubernetes scheduler uses an active-passive leader election.

6. Run your application across failure domains

Similar to what you do with the control plane components, you want to run multiple replicas of your own workload. You can easily do this using a deployment or a stateful set on Kubernetes.

You can also use pod anti-affinity rules to balance your multiple replicas across failure domains. This will reduce the risk that hardware failures will cause outages in your applications.

There's more to learn

As you can see, running Kubernetes can be very difficult. The learning curve is steep, and there are many nuances to running production workloads, but Kubernetes also provides great flexibility to meet your unique needs.

Some additional considerations, aside from high availability, are scaling the cluster itself, upgrades to the machines on which you're running Kubernetes, and upgrades to the version of Kubernetes you're using.

There are many systems, including Google Kubernetes Engine, that will do these things for you. And because you don't have to know all the intricate details of how the system runs, you can focus on building your own workloads.

For example, the goal with Google Kubernetes Engine regional clusters is to ensure that your control plane runs in difficult physical locations. But you don’t have to do anything; the built-in automation takes care of it all. 

Next steps

During our talk at KubeCon, we'll share a variety of resources, including white papers and documentation for Google Kubernetes Engine's regional cluster, which became generally available in June. In addition, we'll discuss more about how we think about failure domains.

Kubernetes is constantly evolving to better meet the needs of the community. The best way to stay in the know is to be involved by watching conference talks, reading current articles, and even participating in special-interest group meetings.

Then, after you look at these resources, try playing around with Kubernetes yourself.

To learn more about best practices for highly available Kubernetes clusters, come to our session at KubeCon + CloudNativeCon, December 10-13, in Seattle, Washington. We'll be speaking Thursday at 1: 45 pm.