When I started as a software tester, testing in production was what happened when teams didn't take QA seriously. But your application is being tested in production every single day by the people who use it. You just need to find a way to use all the data users are already generating.
The infinite variety of devices, operating systems, browsers, and general user behavior in production is an invaluable source of information, not only for debugging, but also for finding issues that are difficult to spot in less realistic testing environments.
Here's how to put all that information to good use.
Choose the right tools
To get the most out of your production data, you need a means to gather the information and display it in ways that can expose potential problems. There are many tools available for collecting and visualizing data, both commercial and open source. Here are some of the more popular open-source tools:
- StatsD: A lightweight daemon that collects and aggregates statistics before sending them to a back-end aggregator such as Graphite
- Graphite: A service for collecting, querying, and graphing statistical data
- Grafana: A browser application for creating graphs and dashboards
- ELK Stack: A combination of Elastic Search, Log Stash, and Kibana that allows for analysis and visualization of log files
If your ops team has dashboards for watching server health, then it is very likely that it is already using one or more of these tools.
Getting started
My team owns a service that is used by multiple games played by over 700,000 people every day all over the world. It handles thousands of requests per second, 24 hours a day, 7 days a week.
Yet when we first released the service, we really had no idea which parts of the API were being used, or to what extent. Sometimes the service slowed to a crawl and we had no real idea why it happened.
Once during a game release we had a database meltdown and it took hours to diagnose that a debug flag had been left on in the client, so we were writing far more data than we could handle.
In the beginning
When we started out, we had the following information about the production environment:
- A simple dashboard showed system info for each host box, including CPU, memory, load average, and ports available.
- A separate dashboard existed for our MySQL databases; it was powerful, but difficult to use.
- Apache service logs showed each request, the response code, and any exception stack traces. Because of the nature of our service, there wasn't any information on specific APIs called or request data sent unless it was part of a stack trace.
Improving the logs
The first thing we tried was adding more information to the log files, such as the API name and user ID. This helped a little bit, but with multiple logs spread across several log servers, gathering relevant data was still a painful process.
We added more information on exceptions and put the information in a JSON string so it could potentially be added to the company's ELK stack. This helped some, but querying the logs has a significant learning curve, and it was a challenge to figure out how to extract the information we needed.
Instrumenting the code
What worked best for our team was adding code to the service itself, to send statistics directly to a Graphite server. For each call, we would collect the following information:
- ID of the game making the call
- Host name of the specific server box
- Name of the target API
- How long the request took to complete
- Whether the request succeeded or failed
- Exception type, if applicable
We also started tracking some things not specific to an individual request:
- Memcache activity for each API
- Number of active threads on each box
- Response time for a critical third-party service
Exposing the bugs
With this information, we could finally see exactly which APIs were being used by each game and how often they were being called. It is one thing to use an application and track what calls are made during that session, but it is quite another to see the activity of all your customers in real time.
Things that can slide by during individual sessions become very obvious when presented in bulk. Here are some of the bugs our team discovered from analyzing production data:
- One game was saving player data after every move instead of at the end of a level; the save API comprised nearly 90% of all the calls it made.
- The mysterious service slowdowns were being caused by network issues, with a third-party API clogging up our service. With this information, we were able to take steps to isolate the third-party call so it wouldn't affect the rest of the service activity.
- One API was attempting to save null values to memcache, causing unnecessary load on the memcache servers.
- A certain API was failing every time it was called by a certain game. It turned out the dev team had meant to remove the call from the client but had forgotten.
- One of the most heavily used APIs was making a database call that returned empty 95% of the time. We were able to rework the API to reduce the load on our database by about 7%.
Better performance testing
One of the biggest benefits of having more production information was that it became much easier to design load tests that accurately reflected actual server load. Instead of playing a few games and those sessions to guess at the average pattern of API calls, I can just look at our graphs and see what percentage of requests are going to each API.
More importantly, increased confidence in our production monitoring makes it possible to run load tests in production. So instead of running a test on a stage server and trying to extrapolate from that how our production servers will handle an equivalent load, I can just run directly in our actual production environment. This means I can be much more confident about how new APIs will affect the existing production load.
Continuous improvement
This just scratches the surface of what is possible with testing in production. In the future I hope to extract even more detailed information about each request to expose different behaviors among various devices, player states, and other details that I wouldn't necessarily be able to come up with on my own. I hope you will be inspired to start exploring your production data and see what it can show you about your services and applications.
For more about testing in production, attend Amber's presentation at TestBash San Francisco on November 9.
Keep learning
Take a deep dive into the state of quality with TechBeacon's Guide. Plus: Download the free World Quality Report 2022-23.
Put performance engineering into practice with these top 10 performance engineering techniques that work.
Find to tools you need with TechBeacon's Buyer's Guide for Selecting Software Test Automation Tools.
Discover best practices for reducing software defects with TechBeacon's Guide.
- Take your testing career to the next level. TechBeacon's Careers Topic Center provides expert advice to prepare you for your next move.