5 best practices for container orchestration in IT production

If your enterprise IT operations organization has moved to container technology such as Docker, you’re likely dealing with container orchestration systems within IT production. These systems include Apache Mesos, Google Kubernetes, Docker Swarm, and a few smaller players.

If you haven’t been paying attention to container orchestration technology, you should. It's just as important as the containers themselves. The products allow you to schedule containers to start and stop, as well as scale container usage through managed container clusters.

Here's what's important about container orchestration engines: Without them, containers would be a really nice, distributed, and portable architecture, but they wouldn't be able to scale to enterprise needs. Container orchestration engines solve the scaling problem, or at least part of the problem.

At issue is IT production, which is tasked with actually making this stuff work, and making it work well. At least four nines (i.e., 99.99%) uptime is expected these days, and considering that most of this technology is new in the market, that’s asking a lot from those charged with container production. 

So what best practices should IT operations managers and staff charged with IT production be considering as they move container-based applications into production? While there are existing operations patterns around virtualization, those in IT operations find out quickly that containers are not virtual machines. Indeed, there are not many existing IT operations analogs to consider.

Considering this void, the time is right to define core best practices for container orchestration for IT production. Here are five of the most important steps. 

Go to TB LearnHow to manage the container-based environment

1.  Set up demarcation lines for moving into production

While this is a common traditional practice, those who deal with containers often don’t understand the path from development to production. When dealing with container orchestration, there needs to be a staging platform, which is typically at the end of a DevOps process and tool chain. Those containers need to be tested, integrated, validated, and made ready for staging.

When in staging, they should be running with or within an orchestration system such as Kubernetes, which is an exact copy of the production configuration. Once proven to be stable, the containers can then be promoted from staging to production. Finally, they need to be capable of rollback at any time if issues occur with the new deployment. In many cases, rollback is an automatic process. 

2.  Automate reporting of issues found in container orchestration production

Things can go wrong, especially when you consider how containers operate within orchestration systems.

Given that production and development are now linked (via DevOps practices), it’s important that there be automatic reporting of issues found within containers that move into production.

Developers need to understand what’s going wrong using continuous reporting of issues and need to react to issues with fixes that are continuously tested, integrated, and deployed so that the issues can be resolved in a short amount of time.

3.  Monitor, monitor, monitor

The nice thing about running container orchestration systems, whether in the cloud or on premises, is the number of monitoring and management tools that are available to watch over the containers. These monitoring systems have several core capabilities and advantages including:

  • The ability to gather detailed data over time and use that data to spot trends that could indicate you’re moving toward a failure. These tools pull data from the container orchestration systems, such as use of memory, processor, network, I/O, etc., and they determine relationships that indicate system health, including aspects of the system that may need attention.
  • The ability for the monitoring system to take automatic action based on its findings. For instance, if a network error is beginning to show up on console, then shutting down the hub that seems to be originating those errors could avoid a total outage. Policies are set up within the monitoring software that allow you to do this via established rules.

4. Back up data automatically, including disaster recovery and business continuity

There are those who manage container orchestration production without a good understanding of where the data is or how it needs to be backed up, preserved, and available for restoration. These are requirements that must be dealt with, whether you’re on the public cloud or not.

Containers, including containers that work within orchestration systems, store data either within the container where the application is running, or, more likely, via an external database that may be container-based but typically is not. No matter where the data exists, it must be replicated to secondary and independent storage systems and protected in some way.

While many believe that public clouds have disaster recovery already built in, in most cases, you’re going to be recovering data that’s been accidentally removed or corrupted. While public cloud does have some failover capabilities, you need to ensure that these more fine-grained data recovery operations are defined and workable. They are not automatic; you need to set them up and test them well.

In addition, many of these backup and recovery mechanisms should be user-driven processes available to a range of users. If you limit control to only a few operations managers, you will soon find that developers and other end users will need to recover data more times than it is actually available to them. Security and governance controls allows these non-ops staffers to recover what they need to recover, in line with enterprise policies and laws. 

5. Plan for production capacity

Most important of all of the best practices listed here is capacity planning for production. Again, both on-premises and public cloud–based systems need this consideration.

The idea is simple in theory but difficult to carry out. You need to understand the current capacity requirements, in terms of infrastructure needed by the container orchestration systems. This includes servers, storage, network, databases, etc. Moreover, you need to predict what will be needed in the near future, mid-range future, and longer term.

The trick is to understand the interrelationship between the containers, container orchestration, and any supporting systems (e.g., databases) and their impact upon capacity. For example, say you have five instances of container orchestration systems which include 2 staging and 3 production systems and which require 20 servers configured in specific ways.

These servers can be configured virtually within a public cloud provider, or physically using traditional methods. Of course, these servers have needs as well, including storage, networking, security, monitoring, power, etc. You need to model that capacity as well.

The point is to understand current containers in production, as well as what the growth will look like over the next five years. Using the forecast growth of the containers in production, you should be able to figure out the impact to other infrastructure and understand those capacity issues. That needs to be modeled so there are no surprises around growth.

Public cloud users are happy to know that they can provision capacity as it’s needed, on demand. However, this does not solve all their problems, in terms of budgeting and understanding which cloud servers will be needed. What’s more, you need to consider that container orchestration systems are themselves delivered as cloud services, so their management may be less in your hands than in the hands of the cloud provider.

Time for trial-and-error

The success of IT production in the age of containers is based upon the ability to keep an open mind and experiment with new processes and technology. While a trial-and-error approach may scare many folks in IT production, the reality is that you have little choice.

Of course, this should not be too tall an order for IT production, which typically changes its processes and tools every five to ten years. Change in the world of IT is a constant. The rise of containers and container orchestration requires that you change again. 

Those charged with production of container orchestration need to understand that they are breaking new ground, but they can rely on older processes to provide a good starting point for how to operate these systems. Given the rise of DevOps and the logical coupling of development and operations, this is a good time to renew your processes and set your culture in the right direction.

Go to TB LearnHow to manage the container-based environment
Topics: IT Ops