The dos and don'ts of large-scale deployments

I recently had the privilege of being on a panel on large-scale deployments, part of Electric Cloud's Continuous Discussions (#c9d9) series, hosted by Electric Cloud's Sam Fell. Joining me on the panel were Andrew Siemer, chief architect at Clear Measure; Seb Rose, independent consultant and coach, and lead author of The Cucumber for Java Book; and Simon McCartney, developer and system engineer at HP Cloud. We got together on a Google Hangout to discuss the differences between deploying software to 10 or 20 servers, which can be pretty challenging in itself, and deploying to thousands of servers around the world.

I'm going to summarize the main points that were made during the discussion and add some more that I didn't get the chance to talk about during the session.

The team

It all starts with having the right team in place in a DevOps environment. To enable the team to minimize risk, they should have self-contained development environments that replicate production environments and include automated regression tests to ensure repeatability and remove the risk. Practice, as we know, makes perfect, and it's essential in large-scale deployments. Seb mentioned Amazon's "two-pizza" teams that have full responsibility over small areas of functionality—from provisioning the environments, to getting the continuous deployment pipeline in place, to monitoring and handling service calls. No one is forced to practice, but because the team has overall responsibility, they naturally make sure their deployments work well even in early stages of the pipeline. Consider introducing microservices into your architecture to reduce dependencies between teams. And try to have a team member on board who has been involved with large-scale deployments before so you can leverage their experience. The team's culture must eschew "hacks," such as manually copying files or editing configuration files and require that only automated deployment scripts be used.

Fidelity of environments

It's always important to remove variations between environments, but it's essential at a large scale. If you make a code change, that should be the only variable, whether you're testing it in your development environment, in QA, in your integration environment, or in production. One way to achieve this is through virtualization of hardware and services and by using container technology such as Docker. But in practice, many teams hit a snag when they need to integrate with legacy systems, because the company isn't willing to invest in making the legacy environments identical or in automating the deployment of these environments. We also see "shadow IT," where teams provision their own hardware and software configurations during development instead of using the company's official configuration. This can result in downtime in production because of mismatches between the configurations. This leads to the dreaded "well, it works on my machine" scenario. If you can quantify the cost of downtime, which can be significant, you may be able to convince management to invest in aligning the environments.

Fidelity of process

As a developer, you can simply put a file into a directory to make something work. Don't do that. The deployment process must be fully transparent, from development through to production, and this is achieved through automation, with your configuration files under source control. This approach is called infrastructure as code, where you treat your infrastructure configuration exactly as you treat your source code files. If you need to make a change to the configuration, you edit the configuration file and check that in to your source control system. The change will then automatically and consistently flow out through the different environments into production. Note that you can still introduce manual approval steps if you really have to by pausing the automation and having it continue after approval is granted.

You need to make sure that your process also includes rollback capabilities, which must also be tested. The design and testing of the rollback procedure must ensure that your users don't lose any data. As you test, make sure that you're not only using actual production data but also production quantities of data, because that's typically much greater than the minimal data sets used in the development and testing environments. But don't do all your testing on production data sets: most unit tests should use small data sets to be effective and finish quickly.

Another challenge in larger organizations, particularly those with legacy systems, is that different teams working on the same product might have different procedures for deployment. As you scale up, you'll have to grasp that nettle and figure out how to align the procedures, because the cost of not doing so will be higher.

Monitoring

You need to keep track of your deployment's health, and you do this through monitoring. You can monitor your hardware, operating systems, middleware, and application servers, but you must also monitor your end-user experience, perhaps by using synthetic transactions. This is how you'll know what your users are really seeing and doing, and it lets you track business value. Keep an eye on security considerations, such as unauthorized access to your systems. Log files are key to analyzing production problems, so make sure you evaluate your log data during development and ensure that it contains clear information that can be used in the event of a problem. Don't just monitor your production systems, though—you should be monitoring all your environments in your pipeline.

Arrange to have a prominently displayed central dashboard with good visualization, showing overall system health and sub-system metrics. This will let you see very quickly when something goes wrong, and where.

Deployment strategies

We discussed a number of different strategies that can be adopted when deploying software, depending on your situation:

Feature toggles let you release partially written code. By wrapping the code with a toggle, you can ensure that the toggle is off until the feature is complete. This is used to develop larger features while still releasing frequently to production. If you use feature toggles, consider them part of your technical debt and remove them when the feature is done. Otherwise, your code will become cluttered with redundant toggles.
Blue-green deployment involves maintaining two separate but identical deployment environments. Only one of them—say, the blue one—is live in production at any given time. As you deploy a change to the system, you update the blue one and test the green one. Once the green one is working well, you take the blue system offline and make the green one live. If you discover problems in the green one, you can quickly take it offline and put the blue one back online until you solve the problem. If you do this, you'll need to make sure that you don't lose any data that may have been added to the green system before it was disabled. Test this capability before the green system goes live for the first time.
Canary releases (or incremental rollouts) are similar to blue-green releases, the difference being that you deploy your change to the green system, but rather than switch the blue system off, you route some of the users to the green system while most of the users stay on the blue one. As you gain confidence in the green system, you gradually wean more of your users off the blue one until they're all using the green system. As with blue-green deployments, you must make sure that you can move users back to the blue system if there are problems without losing any data.
A/B testing shouldn't be confused with canary releases. A/B testing is a strategy for testing a hypothesis using two different implementations, rather than for deploying a change. Your A/B testing could be a long-running exercise, whereas canary releases are over in minutes or hours. Some companies, such as Wix, use A/B testing for every new feature and can have hundreds to thousands of experiments running at any given moment, all providing valuable input to the team. It's hard to keep track of all the information, so they developed a visual dashboard called Petri, which they donated to open source, so you can try it out yourself.

Meet the challenge

Large-scale deployment is challenging, but at the same time, it's doable and rewarding. As we discussed on the panel, get the right team in place using DevOps practices, and encourage a culture of shared responsibility. Let the teams develop without risk by introducing and maintaining automated testing throughout the pipeline, and guarantee consistent deployment by treating the infrastructure as code and deploying nothing manually. Employ system-wide monitoring with a simple dashboard, so everyone knows that a problem occurred as soon as it happened, and make sure you have a proven strategy in place to roll your deployments back without data loss in the event of having to take part of the system offline. Understanding these ideas should give you the confidence to meet the challenge of large-scale deployment.

Keep learning

Take a deep dive into the state of quality with TechBeacon's Guide. Plus: Download the free World Quality Report 2022-23.
Put performance engineering into practice with these top 10 performance engineering techniques that work.
Find to tools you need with TechBeacon's Buyer's Guide for Selecting Software Test Automation Tools.
Discover best practices for reducing software defects with TechBeacon's Guide.
Take your testing career to the next level. TechBeacon's Careers Topic Center provides expert advice to prepare you for your next move.

Read more articles about: App Dev & Testing, DevOps

You are here