Zombie board game

With immutable infrastructure, your systems can rise from the dead

Would you like to command your own personal army of zombies? With immutable infrastructure and Kubernetes, you can.

Immutable infrastructure is built out of service components that are similar to code objects—service components intended to perform a limited function. As with a code object, when you change an infrastructure object, it gets replaced with a new version. Your infrastructure components act like your own personal army of zombies that rise from the dead when summoned!

This methodology can be shocking to the unbeliever, as it requires possibly scrapping ideas and processes set forward by tools such as Puppet, Chef, Salt, etc. in favor of declarative user data, or images. Why? More on that a bit later.

The Essential Guide to Serverless Technologies and Architectures

Immutable infrastructure, fresh out of the oven

I first came across the idea of immutable infrastructure several years ago at WWDC while hanging out with DevOps engineers from a large European media company. The concept was not new—Amazon had been steering AWS users in this direction for several years. But I had questions about how the DevOps engineers had been able to set up such a dynamic infrastructure to handle the Olympics. That's when they explained the concepts of immutable infrastructure and described a process they had created, which they called “The Bakery.”

The Bakery is a tool that takes updates from the engineers' continuous integration process and rolls them into Amazon Machine Images (AMI). These images are then slowly updated throughout the infrastructure, in a rolling-instance rebuild.

Because there is no need for live updates, and no need for Puppet, the approach greatly reduces the number of potential points of failure. The engineers explained how secure this process is for them, because they only expose the exact ports their application uses and even disable SSH.

Support, too, is greatly improved. Logs from the servers are offloaded for the engineers to analyze and to inform future versions of their AMI. Critical failures can be solved by rolling back to a previous AMI version, and minor failures result in the simple action of deleting and replacing the instance.

Fast forward to the age of containers

If you’re container-savvy, you’re probably thinking to yourself, “Hey, I already do this with containers!”

If so, good for you, as this is the correct way to use containers in most cases. However, many people stick to their old ways when thinking about Kubernetes clusters. I’m not saying that there is never a place for server management systems, but in my 16 years of experience with 4 companies running large multimillion/billion-dollar infrastructures, the cost of such systems piles up. Kubernetes already has the functionality to operate independently of these systems, and adding more can just make an already complicated system that much more unwieldy and costly.

How to apply user data to server configurations

User data is my preferred way to apply a declarative configuration to my servers, because it is generally supported on most cloud and on-premises virtual machine systems.

Different operating system types support different types of user data, so Google around to see which setup is right for you. I prefer CoreOS with Cloud-Config. But the immutable infrastructure concept is not tied to a specific technology. Using a system similar to what my media company friends use may be more up your alley.

So how do I apply this to Kubernetes? First, consider the Kubernetes worker nodes. For readers who are longtime Kubernetes users, minions, these are the easiest to… immutablize? Here's a punch list of the things you want to configure:

  • The kubelet (configured to connect to a master.)

  • A kube proxy (configured to connect to a master.)

  • Docker and Flannel (if not already configured by default)

That’s all you need. I am sure some readers might want a few other goodies, but there is no need for things such as ZooKeeper, Puppet, etc.

At this point, you might be thinking, "What if one of those services fail?" The answer is to delete and rebuild the instance. I typically have a process, such as an AWS Elastic Load Balancing or similar, that watches the health of the service ports. If they misbehave, I delete or replace the instance.

The beauty of Kubernetes is that when that happens, the pods simply move to another healthy instance. The resources that would otherwise be reserved for management get dedicated to my workload, and I maintain stability across my application. Upgrades to these components are applied by updating your user data or image and slowing rolling the new configuration across your infrastructure.

Master servers can be more difficult. There are a few additional processes to consider, and you must ensure that you have a plan for your Kubernetes ETCD database app. Most large organizations separate their ETCD cluster from their master nodes.

If you do this, configuration of master nodes becomes almost as simple as the worker nodes. It is possible, however, to have ETCD reside on the master nodes in an immutable setup. Here is an example of a master config where each Kubernetes master node is also an ETCD member.

If you are running a solo master, services such as Puppet and ZooKeeper may be a good fit. If you run a multi-master setup (which I recommend for production), you can treat your master servers in a similar way as you do the workers.

Make it all work in production

So how well does this setup actually perform in production? This configuration may not be ideal for every organization, so be sure to do your homework. That said, we have been running this exact setup in production for well over a year now, with 3,000 to 4,000 pods, and the difference between this setup and our previous, Chef-based setup is like night and day.

Our support calls have been greatly reduced. Our support staff is able to quickly resolve the few issues we do have by deleting the offending instance. And the engineering staff has been able to quickly take that failure feedback to root cause and update the user data configuration to resolve.

This style of infrastructure thinking flies in the face of everything an old-school DevOps engineer holds dear. I can tell you from personal experience, however, that it is a relief to no longer fear the dreaded 3 am PagerDuty call.  

Want to learn more? Drop in on my Lightning Talk, Zombie Kubernetes—Making nodes rise from the dead, at CloudNativeCon + KubeCon Europe 2017 in Berlin. 

The Essential Guide to Serverless Technologies and Architectures
Topics: IT Ops