How to create test databases that reflect real life

The first thing people see when they walk into my office is a big sign that says, "It's not done until QA says it's done." As a QA team lead, it's my responsibility to ensure the quality of the product on which my team is working.

Developers share that responsibility as well. If they pass code to us that blows up the moment we run it, they'll be working on bug fixes late into the night. At the end of the day, however, it's my team's responsibility to give the code the final stamp of approval so it can ship to customers. That means we need to create tests that effectively simulate how the software will be used in the real world.

The last thing we want is for defects to be discovered in production by customers. If that happens, it's because we missed something, and the software engineers and QA staff will be spending the night in red-alert mode fixing production issues. No one wants that.

The challenge is that testing that simulates real-world conditions is difficult. The diversity of interactions between users and live production systems is immense, and there's a wide range of test cases that QA needs to cover. These include differences in:

End cases
Data combinations
Errors in data
Methods of entering data
Data boundaries (for example, what happens if you try to push 3,000 characters into a 30 character field?)
Corrupt data scenarios (data packets do sometimes get lost or corrupted in transition, and computers can crash suddenly)

We use the three methods below to build a test database that faithfully represents production systems. These enable our team to cover as many real-life test cases as possible.

Synthesize data

We create data artificially, but it's not enough to just pour data into the test database. The complex relationships between different tables in relational databases or objects in object databases are difficult to reproduce. If you don't create the data through normal application flows, you can end up with inconsistent data. The right way to do it is to write programs that simulate user behavior, similar to automation testing. This generates scalable databases with consistent and coherent data (the application should prevent inconsistent data from reaching the database), allowing you to let your databases grow to any extent.

However, automatic programs only cover about 60 percent of real-life use cases, so synthesizing data artificially isn't in itself sufficient to create a representative test database.

Use a customer's production database

There's nothing better than using a real customer's systems for testing. One challenge here is that you'll need permission from your customer, which isn't always easy to obtain. Enterprises tend to be cautious with the data they own. Nevertheless, as a software vendor, you may be able to get permission to view your customer's database in order to analyze and uncover production issues. In one case, my team cooperated with a customer to sort out performance issues with their installation. Our analysis of the database showed it to be severely overpopulated due to an unusual configuration that we never would have anticipated. By engaging with this customer, we uncovered an unexpected use case that we could address.

However, observation and analysis of the customer database isn't enough. You need to have that data in a test database where you can add, modify, and delete data in order to ascertain whether the application is working correctly. That means getting the customer's permission to make a copy of the production database to run on your own system.

Making a copy of a customer database isn't trivial, especially when the database schema changes—fields, tables, and objects may be added or removed or their names could be changed. To make a copy of your customer's data, go through an orderly migration process, using automation tools to ensure that the data remains intact and coherent.

We go through this process whenever we update our database schemas. We migrate data from our own test accounts and then apply a battery of tests to ensure the latest build works correctly with the new database. Only then do we migrate production databases.

Simulate erratic behavior to generate unexpected end cases

The volume of usage in production systems can generate use cases that your team could never anticipate. Objects fall on keyboards, causing unexpected input, and packets may be corrupted or lost for any number of reasons while traveling between a user's computer and a database that may be hosted thousands of miles away. While there will always be that one use case for which we didn't test, we try to cover as many scenarios as we can by using automation tests that simulate erratic user behavior and unusual flows. In this way, if we manage to crash the program or the database, we have a way of analyzing it—and preventing the same thing from happening in our own production systems.

Use performance engineering to simulate heavy loads

Load testing is just one aspect of performance engineering, but you can use it to simulate two scenarios. In the first scenario, performance engineers run tests that can last for hours or even days, generating huge volumes of data. This is a great opportunity to see how your database behaves as it scales up to gargantuan proportions. In one case, we ran across a database that became overloaded with millions of lines of log data that was never used for anything. This led us to change our implementation. We decided to aggregate log data, rather than store it all in raw form.

In the second scenario, performance engineers simulate numerous concurrent users. This is a great opportunity to examine your security mechanisms and ensure that user data gets pulled from or stored in the right databases and that permissions won't be breached.

Performance testing addresses not only how your database behaves with large quantities but also how it behaves with many customers and their users as well.

Go live

A test database representative of live systems is critical if you want to thoroughly test new developments. While synthesized data can cover some scenarios, it can't match the diversity of live systems. If you want to cover as many test cases as possible, you need to use live data from production systems.

Getting permission from your customer to use their data is win-win. Your customer gets a much higher assurance that they won't run into unforeseen issues in the next release, and both your developers and your QA team will have the best possible infrastructure for testing new developments going forward.

Keep learning

Take a deep dive into the state of quality with TechBeacon's Guide. Plus: Download the free World Quality Report 2022-23.
Put performance engineering into practice with these top 10 performance engineering techniques that work.
Find to tools you need with TechBeacon's Buyer's Guide for Selecting Software Test Automation Tools.
Discover best practices for reducing software defects with TechBeacon's Guide.
Take your testing career to the next level. TechBeacon's Careers Topic Center provides expert advice to prepare you for your next move.

Read more articles about: App Dev & Testing, Testing

You are here