Micro Focus is now part of OpenText. Learn more >

You are here

You are here

Don't do performance testing in production environments only

public://pictures/Shane-Evans .png
Shane Evans Senior Product Manager, HP LoadRunner, HP
 

"Performance testing is dead!" "Performance belongs to developers!" "Nobody needs to do performance testing in a lab anymore!"

These are just some of the claims I've heard about performance testing. Only production environments matter, the thinking goes. Test environments are a waste of time. I get the logic behind these statements, but they're too simplistic. Testing in production is a good idea, but if this is your only methodology, you're setting yourself up for disaster.

How to do performance testing wrong

For example, on Black Friday 2014, a well known online retailer suffered an outage in production, losing potentially millions in revenue, and share prices took a minor dip. Industry pundits had a field day, saying the retailer should have tested differently or used tool X over tool Y.

In fact, the customer had tested exactly how they were instructed to by their "leading cloud testing provider." The retailer had been following the provider's best practices of testing in production prior to the big day. What went wrong?

From what I can piece together, the internal testing team, responsible for all enterprise application testing, had been unsuccessful in convincing the online business to use the same tools they'd been using for years. This team had had a lot of successes, but as is often the case, the online team bought into the hype of testing only in production, using a tool that was designed to be used only in the cloud.

It is sometimes necessary to test on a global scale, and against the nearest environment to production itself, but if that were the best and only way to test, shouldn't everyone be doing it all the time? Absolutely not.

Why not test in production?

There are scenarios where testing in production is the right approach, but it's not the right strategy for every situation. This type of large scale, multi-geography test takes massive coordination of efforts and planning across multiple teams.

You then also have to ease your real customers onto a pre-production environment that has been tested to work, and you must do so at a time when usage is low. This means that these tests need to succeed the first time, and they need to provide the information needed to fix any issues before the big day, or all of that time and energy was for nothing.

To get the most out of a "big bang" test, there should be several testing stages prior to that. These smaller, more focused scenarios should be run in lab environments, with all the external dependencies either replicated or virtualized in such a way that you can control whether the dependent services, network conditions, and responses are either positive or negative.

The purpose here is to drive out any individual component bottlenecks well before putting it all together. That way, when it comes to the big bang test, you can be sure that no one component will fail.

You also need to really understand your production usage. Which customers, from where, doing what are some of the key variables to understand here.

Why performance tests fail

I don

Going back to our major online retailer: it turned out that the customer flows that were tested didn't truly represent what users were doing that Black Friday when the system went down.

Instead of the standard checkout process you would expect from users trying to get that big-ticket item, users had already added the items to their carts days in advance, presumably multiple times. When the clock struck midnight and the items went on sale, a lot of customers simply jumped to the cart to check out.

Now, that's not the typical scenario. The only way to have known it was coming would have been to analyze the previous year's usage and project for growth.

It also would have made sense to keep an eye on production leading up to the event, which I can only assume wasn't done here, but I'm not sure would have made any difference if it was new behavior. Because the big bang test was most likely run weeks before Black Friday, there wouldn't have been enough time to catch what was happening in production and coordinate another test.

Use continuous testing to ramp up to big bang testing

So how could this retailer have done things differently? Hindsight is 20/20: I would've recommended a series of tests, starting with internal systems and using usage parameters pulled from the previous year's event.

This would need to start several weeks prior to a big bang test against the production environment. Starting these tests early would allow us to pinpoint any bottlenecks in individual components, such as web application servers and services, and provide a baseline to compare to later.

These component tests can be run automatically from the continuous integration system after each and every build to detect any anomalies caused by changes in the code for each component. We can also virtualize dependencies such as other services and networks to provide consistent conditions, both positive and negative, for every run.

Then, we run the big bang test a week or two before our big event using the same assets we used in the lab—meaning the same scripts, to the extent that the settings can be the same. We schedule two tests instead of one, the first to compare against our baseline tests in the lab and the second to compare against what we see in production.

This is assuming the two aren't exactly the same. Remember, our testing in the lab should be based on the previous year's production data. The second test should be against the latest data available, which should be the production data leading up to the event. This will give us a comparison between what we saw last year and what has changed since.

If we've done our job before testing in production, this should just be a validation of our assumptions. If all our hopes are riding on this test, we're already risking too much. Continuous testing starts during development, and shouldn't be left to the end.

Performance engineering through application monitoring

This is only one example of how performance testing can be shifted earlier in the lifecycle and continued throughout. It doesn't end at release.

In fact, this is where performance testing results are validated by production monitoring systems, both synthetic and real. The data captured by application performance monitoring (APM) systems is then fed back into our development cycle for the next release. This is essential to improving performance, rather than just testing for it.

But that's another post entirely.

Keep learning

Read more articles about: App Dev & TestingTesting