No fear: How to test software in production

public://pictures/amberrace.jpeg
Amber Race, Senior Software Development Engineer in Test, Big Fish Games

When I started as a software tester, testing in production was what happened when teams didn't take QA seriously. But your application is being tested in production every single day by the people who use it. You just need to find a way to use all the data users are already generating.

The infinite variety of devices, operating systems, browsers, and general user behavior in production is an invaluable source of information, not only for debugging, but also for finding issues that are difficult to spot in less realistic testing environments.

Here's how to put all that information to good use.

World Quality Report 2018-19: The State of QA and Testing

Choose the right tools

To get the most out of your production data, you need a means to gather the information and display it in ways that can expose potential problems. There are many tools available for collecting and visualizing data, both commercial and open source. Here are some of the more popular open-source tools:

  • StatsD: A lightweight daemon that collects and aggregates statistics before sending them to a back-end aggregator such as Graphite
  • Graphite: A service for collecting, querying, and graphing statistical data
  • Grafana: A browser application for creating graphs and dashboards
  • ELK Stack: A combination of Elastic Search, Log Stash, and Kibana that allows for analysis and visualization of log files

If your ops team has dashboards for watching server health, then it is very likely that it is already using one or more of these tools.

Getting started

My team owns a service that is used by multiple games played by over 700,000 people every day all over the world. It handles thousands of requests per second, 24 hours a day, 7 days a week.

Yet when we first released the service, we really had no idea which parts of the API were being used, or to what extent. Sometimes the service slowed to a crawl and we had no real idea why it happened.

Once during a game release we had a database meltdown and it took hours to diagnose that a debug flag had been left on in the client, so we were writing far more data than we could handle.

In the beginning

When we started out, we had the following information about the production environment:

  • A simple dashboard showed system info for each host box, including CPU, memory, load average, and ports available.
  • A separate dashboard existed for our MySQL databases; it was powerful, but difficult to use.
  • Apache service logs showed each request, the response code, and any exception stack traces. Because of the nature of our service, there wasn't any information on specific APIs called or request data sent unless it was part of a stack trace.

Improving the logs

The first thing we tried was adding more information to the log files, such as the API name and user ID. This helped a little bit, but with multiple logs spread across several log servers, gathering relevant data was still a painful process.

We added more information on exceptions and put the information in a JSON string so it could potentially be added to the company's ELK stack. This helped some, but querying the logs has a significant learning curve, and it was a challenge to figure out how to extract the information we needed.

Instrumenting the code

What worked best for our team was adding code to the service itself, to send statistics directly to a Graphite server. For each call, we would collect the following information:

  • ID of the game making the call
  • Host name of the specific server box
  • Name of the target API
  • How long the request took to complete
  • Whether the request succeeded or failed
  • Exception type, if applicable

We also started tracking some things not specific to an individual request:

  • Memcache activity for each API
  • Number of active threads on each box
  • Response time for a critical third-party service

[ Webinar: Agile Portfolio Management: Three best practices ]

Exposing the bugs

With this information, we could finally see exactly which APIs were being used by each game and how often they were being called. It is one thing to use an application and track what calls are made during that session, but it is quite another to see the activity of all your customers in real time.

Things that can slide by during individual sessions become very obvious when presented in bulk. Here are some of the bugs our team discovered from analyzing production data:

  • One game was saving player data after every move instead of at the end of a level; the save API comprised nearly 90% of all the calls it made.
  • The mysterious service slowdowns were being caused by network issues, with a third-party API clogging up our service. With this information, we were able to take steps to isolate the third-party call so it wouldn't affect the rest of the service activity.
  • One API was attempting to save null values to memcache, causing unnecessary load on the memcache servers.
  • A certain API was failing every time it was called by a certain game. It turned out the dev team had meant to remove the call from the client but had forgotten.
  • One of the most heavily used APIs was making a database call that returned empty 95% of the time. We were able to rework the API to reduce the load on our database by about 7%.

Better performance testing

One of the biggest benefits of having more production information was that it became much easier to design load tests that accurately reflected actual server load. Instead of playing a few games and those sessions to guess at the average pattern of API calls, I can just look at our graphs and see what percentage of requests are going to each API.

More importantly, increased confidence in our production monitoring makes it possible to run load tests in production. So instead of running a test on a stage server and trying to extrapolate from that how our production servers will handle an equivalent load, I can just run directly in our actual production environment. This means I can be much more confident about how new APIs will affect the existing production load.

Continuous improvement

This just scratches the surface of what is possible with testing in production. In the future I hope to extract even more detailed information about each request to expose different behaviors among various devices, player states, and other details that I wouldn't necessarily be able to come up with on my own. I hope you will be inspired to start exploring your production data and see what it can show you about your services and applications.

For more about testing in production, attend Amber's presentation at TestBash San Francisco on November 9.