How to move a large monolithic app to continuous delivery with zero downtime

Pierre Vincent Head of Site Reliability Engineering, Glofox

The first line of code in my company's internal communications platform was compiled more than 12 years ago. I remember it well, because it was my first job as a junior developer. And back then, our team of five engineers released changes to customers only two or three times per year. We were blissfully unaware of how release management was about to change in the era of continuous delivery.

The releases were major events, and it wasn't unusual for the platform to be offline for four hours or more during the upgrade—which is why they usually happened late on Sunday nights.

A decade later, product development at Poppulo has grown to more than 70 engineers. We've adopted continuous delivery practices, so our deployment frequency has picked up. But until recently, deployments still meant scheduling downtime for maintenance, which limited us to going no faster than four weeks. And, as is true for many companies, there is no longer such a thing as "off-hours" for our worldwide customer base.

The downtime requirement for upgrades was so ingrained in our culture that we came to view it as an inevitable thing that only a full rewrite would solve. The reality turned out to be much simpler, though, and today we have transitioned from monthly deployments with a few hours of downtime to zero-downtime deployments every two weeks—and we no longer have to work Sunday nights.

Here's how we got there, and how your team can too.

Mapping how deployments impact users

The architecture of the core Poppulo platform is monolithic. The application is built from a large monorepo, yielding artifacts for a few dozen Java processes. These processes are deployed together and are tightly coupled through a MySQL database and JMS queues.

Despite our preconceived ideas that zero downtime was something that might never happen for us, we set out to look at our deployment process pragmatically, with a goal to reduce the deployment impact as much as possible. Our first step was to map out the reality of our deployment process, step by step. It looked like something like this:

Figure 1. By mapping our main deployment steps, alongside the associated user impact, we were able to prioritize work to progressively reduce downtime.

[ Also see: The 10 commandments of continuous delivery ]

Online database schema migration

What became apparent from our mapping session was that our database migration phase was not only the No. 1 contributor to the effective downtime, but that it was very hard to predict its duration from one release to the next.

To take this component out of the equation, we set out to decouple applying database migrations from code deployment. Separating these two steps required the schema changes to be backward-compatible, since the old code will keep running with the new schema. We achieved this by following the "expand/contract" pattern, explained at Martin Fowler's site.

Applying schema changes on a database with active users also required us to be careful about the impact on the running application. The most common issue we faced was certain commands that locked tables, which can degrade performance or even lead to outages.

To make it easy for developers writing migrations, we included automated checks in the build pipelines to flag migrations susceptible to locked tables; these would need approval before going forward.

If you are facing a similar challenges, read the chapter "Loosening the Application/Database Coupling" from the excellent e-book DevOps for the Database, by Baron Schwartz, which goes into much more detail about zero-downtime database operations.

Ensuring constant uptime with rolling upgrades

With database migrations out of the picture, the next issue we questioned was whether we really needed to take the entire application down for upgrades.

For horizontal scaling purposes, we are running multiple instances of the different processes that make up the core Poppulo application. These instances share work by consuming tasks from JMS queues.

Before an upgrade, all work in the queue had to be completed, and no new work could be accepted until the upgrade was done. During this queue-draining phase—which we called "maintenance mode"—users could still access the application, but with limited functionality (e.g., able to edit drafts, but not to publish them).

This "drain all, then upgrade all" approach was rooted in the tight coupling of our processes through Java-serialized JMS messages. Since Java serialization of messages didn't guarantee backward compatibility, we moved to JSON messages and, here as well, followed the expand-contract pattern to ensure that new code would work with old messages (and vice versa).

Finally, we moved the draining logic to work on a per-instance basis, instead of application-wide. This enabled us to gradually drain and upgrade instances, keeping all functionality available for end users throughout the upgrade.

Figure 2. Instead of upgrading all instances at once, a rolling upgrade ensures that at least one instance is up at any given time, guaranteeing that the feature remains available.
 

Reducing the mental overload of operating deployments

With most of the code-related blockers addressed, we started looking at the actual task of deploying an upgrade. Large, monolithic deployments can be stressful, due to the number of moving parts and to bad memories of near-misses or deployments that escalated into production incidents. And adding active users to the mix without the cushion of a maintenance window undeniably raises the stakes.

We didn't want deployments to be stressful. We needed them to be consistent, repeatable, and observable for us to have the confidence to run them with live traffic, during peak hours, and, ultimately, more frequently.

We started by fully automating the upgrade process with GitLab Pipelines, specifying in code the entire rolling upgrade procedure. This immediately benefited the person in charge of the deployments by removing the responsibility of following manual steps and the risks of making mistakes under pressure.

A procedure written as code committed in source control also meant we were able to review and track changes. Finally, the GitLab Pipeline became the only place to control deployments with adequate auditing, access control, and execution logs.

The next thing we needed if we were to be confident about deploying with live traffic was visibility during and after the upgrade. For this, we leveraged Prometheus and Grafana to build a single dashboard compiling the key signals we needed during deployments. These included health checks of the different services, status of synthetic monitoring of core user journeys, error rates, latency, and queues saturation.

By also highlighting the rolling upgrade progress, this dashboard became our central place to monitor deployments. No more flicking between terminals, tailing logs, or refreshing pages.

It’s easier than you think

While zero-downtime deployments were always an aspiration of ours, we postponed tackling the problem for a long time because it felt overwhelming to challenge the way we had deployed software for more than a decade. Once we started looking at it for what it really was, however, it became a much simpler problem.

Zero-downtime deployments don't mean everything always stays up, or that everything is immediately running the latest version; they simply mean users don't notice a thing while all this is happening.

Beyond solving the immediate problem of negatively affecting customers during regular maintenance, the several small changes we made transformed the way we work. Deployments during working hours massively improved work-life balance for our operations team, which used to do this at 8 PM on Sundays. Similarly, automation and visibility greatly reduced stress.

From a product development perspective, getting to no downtime for deployments opens up the road for deploying on demand, which in turn leads to working in smaller batches, reducing risk, and getting faster feedback from your customers.

We have already increased our deployment frequency from four weeks to two, and we have our eye set on daily deployments. Going faster has already surfaced the next big obstacle: reducing the time between code commit and releasable changes, which we hope to achieve with better continuous integration and the introduction of trunk-based development.

As Dave Farley and Jez Humble perfectly put it in Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation: "If it hurts, do it more frequently, and bring the pain forward."

For more practical information on how to do zero-downtime deploys of monolithic applications, see my session at DevOpsDays Portugal on June 3-4.

Read more articles about: App Dev & TestingDevOps

More from DevOps