Test in production? Yes you can (and you should)

"I don't always test," muses the Most Interesting Man in the World, on one of the most evergreen tech meme posters of all time, "but when I do, I test in production." I have been laughing at that meme ever since someone posted it on the wall at work way back in ... 2013? It's hilarious!

Since then, "test in prod" has become shorthand for all of the irresponsible ways we interact with production services, or cut corners in our rush to ship. And for this I blame the Most Interesting Man and his irresistible meme.

Because, frankly, we all test in production almost all the time. It's not a bad thing or a sign of negligence. What's killing us is our inability to own up to it and to correctly identify and invest in the tooling we need for testing in production.

Here's how to get over it and get on with it. But before I get to that, let's back up a second and define some terms.

Defining the problem

Testing means establishing the quality, performance, or reliability of something, especially before it is taken into widespread use. Production is the place (infrastructure and code) where your users are.

Testing is about reducing uncertainty. If I run a piece of deterministic code in a particular environment, I expect the result to succeed or fail in a repeatable way. That gives me confidence in that code, for that environment.

Modern systems are built out of those building blocks, plus:

Many concurrent connections
A specific network stack with specific tunables, firmware, network interface card
Iffy or nonexistent ability to serialize within a connection
Race conditions
Services loosely coupled over networks
Network flakiness
Ephemeral runtimes
Specific CPUs and their bugs, multiprocessors
Specific hardware RAM and memory bugs
Specific operating system distro, kernel, and OS version
Specific library versions for all dependencies
Build environment
Deployment code and process
Runtime restarts
Cache hits or misses
Specific containers or VMs and their bugs
Specific schedulers and their quirks
Clients with their own specific backoffs and retries and timeouts
The Internet at large
Noisy neighbors
Thundering herds
Queues
Human operators and debuggers
Environment settings
Deaths, trials, and other real-world events

When we testers say "production," we mean the constellation of all of these things and more. Despite our best efforts to abstract away such pesky concepts as "what firmware version is your eth0 card?" I am here to tell you that, once in a blue moon, you will still have to care about those things.

[ Special Coverage: O’Reilly Velocity Conference ]

Why everyone tests in prod

You aren't testing just code anymore. You are testing complex systems made up of users, code, environment, infrastructure, and a point in time. These systems have unpredictable interactions, a lack of predictable ordering, and emergent properties that defy your ability to deterministically test.

If testing is about uncertainty, you "test" any time you deploy to production because every deploy is a unique, never-to-be-replicated combination of artifact, environment, infrastructure, and time of day.

As anyone who has ever created a typo such as "producktion" can attest, some amount of uncertainty is unavoidable. There's an irreducible amount of uncertainty to every code deploy. It just can't be eliminated.

So if for that reason alone, we all test in prod. But it's more than that. The phrase, "I don't always test, but when I do, I test in production," insinuates that you can only do one or the other: test before prod, or in production. But actually, all responsible teams perform both kinds of tests. But we confess only to the first type—the responsible type.

Nobody admits to testing in prod, or talks about how we could do it better and more safely. And nobody invests in their "test in prod" tooling. That's a damned shame.

Engineering cycles are the scarcest resource in the world for most of us. Any time we choose to do something with our time, we implicitly choose not to do hundreds of other things. Choosing what to spend our precious time on is one of the most difficult things any team can do. It can literally make or break your company.

And we have systematically under-invested in tooling for production systems. The way we talk about testing and the way we actually work with software have exclusively centered on preventing problems from ever reaching production. Admitting that some bugs will make it to prod, no matter what we do, has been an unspeakable reality.

And because of this, we find ourselves starved of ways to understand, observe, or rigorously test our code in its most important phase of development. We find ourselves literally flying blind.

Case in point: The Ubuntu situation

For example, a few weeks ago we decided to upgrade Ubuntu across our fleet. The Ubuntu 16.04 AMI was about to age out of support, and it hadn't been systematically rolled since I first set up our infrastructure in 2016. We did all the responsible things: We tested it, we wrote a script, and we rolled it out to staging and to the cluster servers we use to monitor our own production.

Then we decided to roll it out to production.

Things did not go entirely as planned.

We make extensive use of autoscaling groups, and our data storage nodes bootstrap one from the other. There was an issue with cron jobs running on the hour while the bootstrap was still running. Turns out, we had tested bootstrapping during only 50 out of every 60 minutes of the hour.

The problems we saw were associated with our storage nodes. We had issues with data expiring while rsyncing over, and our staff was panicking when they did not see metadata for segment files, or vice versa. We also experienced minor issues around instrumentation, graceful restarts, and namespacing.

This is a great example of responsibly testing in prod. We did the appropriate amount of testing in a faux environment. We did as much as we could in non-prod environments first. We built in safeguards, we practiced observability-driven development, and we added instrumentation so we could watch progress and spot failures. And we rolled it out while watching closely for any unfamiliar behavior or scary problems.

Could we have ironed out all of the bugs before running it in prod? You can never, ever guarantee that you have ironed out everything. We certainly could have spent an infinite amount of time trying to increase our confidence that we had ironed out all possible bugs, but you quickly reach a point of fast-diminishing returns.

We're a startup, and startups don't tend to fail because they moved too fast. They tend to fail because they obsess over trivialities that don't provide business value. It was important that we reach a reasonable level of confidence, and handle errors, and have multiple levels of fail-safes (e.g., backups).

Balance the risk equation

Risk management is one of the things that separates our senior engineers from the juniors. We conduct experiments in risk management every day, often unconsciously. Every time you decide to merge to master or deploy to prod, you're taking a risk. and every time you decide not to merge or deploy, you're also taking a risk. If you think too hard about all the risks you are taking, it can be paralyzing.

It might feel like it's less risky to not deploy than to deploy, but this is false; it's simply a different kind of risk. When you decide not to deploy, you risk not shipping things your users need or want, you risk having a sluggish deployment culture, and you risk losing out to your competition. It is better to practice risky things often and in small chunks with a limited blast radius, rather than to avoid risky things.

Organizations differ in their appetite for risk. Even within an organization, there is a wide range of risk tolerances. Tolerance tends to be lowest and paranoia highest the closer you get to laying bits down on disk, especially when it comes to user data or billing data. Tolerance tends to be higher on the developer tools side or with offline or stateless services, where mistakes are less user-visible or permanent.

Many engineers, if you ask them, will declare their absolute rejection of all risk. They passionately believe that any error is one error too many. Yet these engineers somehow manage to leave the house each morning, and sometimes even drive cars (the horror!). Risk pervades everything that we do.

The risks of not acting are less visible, but no less deadly. They're simply harder to internalize when they are amortized over longer periods of time, or felt by different teams. Good engineering discipline consists of forcing oneself to take little risks every day and to keep in good practice.

This is an industry that's largely in denial about failure, and the denial is only just beginning to lift.

The fact is, distributed systems exist in a continual state of partial degradation. Failure is the only constant. Failure is happening on your systems right now, in a hundred ways you aren't aware of and may never learn about. So obsessing over individual errors will drive you straight to the madhouse.

Get over your fear

The cure for this insanity is to embrace error budgets via service-level objectives and service-level indicators, thinking critically about how much failure users can tolerate, and hooking up feedback loops to empower software engineers to own their systems from end to end.

This means we need to help engineers get over their fear and paranoia around production systems. You should be up to your elbows in prod every single day. Prod is where your users live. Prod is where users interact with your code on your infrastructure.

And we as an industry have systematically under-invested in prod-related tooling. We have chosen to bar people from prod rather than building guard rails or building tools to help them do the right thing by default and make it hard to do the wrong thing. We have assigned tooling deployment to interns, not to our most senior engineers. We have built a glass castle, where we ought to have a playground.

It should be muscle memory for every engineer who is shipping code to look at that code as it runs in prod. No pull request should be accepted unless you can answer the question, "How will I know if this breaks?" While you're deploying your code, you should go look at your instrumentation to see if it is doing what you expected it to, and if anything else looks weird.

There are many categories of uncertainty that can only ever be truly tested in prod, such as behavioral testing, A/B testing, realistic load testing, and so on. And there are many tools and techniques for testing more safely in prod that are beginning to gain wider usage, such as feature flags, observability, progressive deployments, etc.

Every moment you spend in a non-prod environment is a moment when you are not learning about the real world—what commands to run, what is safe or dangerous, how it feels. You're also learning the wrong things about the world. You are informing your intuitive corpus with bad data.

There's a lot of daylight between throwing your code over the wall and waiting to get paged, or shipping with alert eyes on it as it goes out, watching your instrumentation, and actively flexing the new code. The job of modern software engineers is not done until they have watched users use their code in production.

There's a real place for testing before prod. But as Caitie McCaffrey, principal software engineering lead at Azure Sphere, says, you can catch 80% of the bugs with 20% of the effort.

Bottom line: Test before prod—and after

Yes, you should test before prod and test in prod. But frankly, if I had to choose—and thankfully I do not—I would choose the ability to watch my code in prod over all the pre-prod hardening in the world.

Why? Because only one represents reality. And as our systems get ever more complex and exhibit ever more emergent behaviors, pre-prod testing will only continue to lose ground and relevance in the larger scheme of things.

Test in prod. It's the only way to be sure of anything.

Want to know more? During my Velocity conference session, I'll offer more tips about testing in production. The conference runs June 10-13 in San Jose, California.

Keep learning

Take a deep dive into the state of quality with TechBeacon's Guide. Plus: Download the free World Quality Report 2022-23.
Put performance engineering into practice with these top 10 performance engineering techniques that work.
Find to tools you need with TechBeacon's Buyer's Guide for Selecting Software Test Automation Tools.
Discover best practices for reducing software defects with TechBeacon's Guide.
Take your testing career to the next level. TechBeacon's Careers Topic Center provides expert advice to prepare you for your next move.

Read more articles about: App Dev & Testing, Testing

You are here