3 highly effective strategies for managing test data
Think back to the first automated test you wrote. If your like most testing professionals, you probably used an existing user and password and then wrote verification points using data already in the system. Then you ran the test. If it passed, it was because the data in the system was the same as it was when you wrote the test. And if it didn’t pass, it was probably because the data changed.
Most new automated testers experience this. But they quickly learn that they can’t rely on specific data residing in the system when the test script executes. Test data must be set up in the system so that tests run credibly, and with accurate reporting.
Over the last year, I’ve researched, written, and spoken coast-to-coast on strategies for managing test data, and the common patterns you can use to resolve these issues. The set of solutions surrounding test data are what I call "data strategies for testing." Here are three patterns for managing your own test data more effectively. If after reading you want to dig in more deeply, drop in on my presentations on these patterns during my upcoming presentation at the upcoming Automation Guild conference.
Three strategies for data testing
Each data strategy has two components: a "creational strategy" and a "cleanup strategy." The creational strategy creates data test needs. The cleanup strategy cleans it up.
1. The elementary approach
I call the approach I described at the beginning of this article “the elementary approach” because it has no creational strategy. The test automation code does nothing to create the data that the tests use. Likewise, the approach does nothing to clean up data after each test case runs.
While this approach doesn’t work in most environments, nor for with most applications under test, it does serve as a foundation for other patterns. The elementary approach can work in some cases, but those are few and far between. For most of us, we realize very quickly that we must manage the data in the system in order to get the results we want.
For instance, if the data in the system changes because another user (or test case) changes it, then our test fails. If we want our test case to change data in the system and verify that it changed, re-running the test will fail. The same is true if we wanted to run the same test case in parallel—we’d experience a race condition. Test executions compete to be the first to access and change data. One would fail, one would pass. So if the organization values consistent test results, the elementary approach won’t work.
2. Refresh your data source
A common solution to this problem is to reset the data source that the application is using prior to test execution. I call this "the refresh data source approach.”
In between test executions, test automation will reset the data source. That solves the problem of making sure you have the same data in the system each time tests run, provided you refresh the data source with a snapshot containing the data you want.
But this can be costly. In some systems, refreshing a data source can take hours, or even days. It may also be costly in terms of labor. After all, how many testers know how to reset an Oracle database to a previous state? The technical skills needed to implement this approach may be high.
As with the elementary approach, refresh data source works with some test suites, applications, and environments. The key to implementing it is understanding the team’s constraints and aligning them with goals for the tests. For instance, in the case of a shared system under test (SUT), how will refreshing the data source affect testers on your team? Management may not agree with having 10 testers resting idle for a couple hours a day because of a refresh strategy on a shared system. This doesn’t sound like something that will aid in today’s continuous delivery initiatives.
3. The selfish data generation approach
So the next thought for many is “what if we didn’t refresh the database often, and instead create unique data for each execution of a test case?” I call this “selfish data generation”.
Whereas the refresh data strategy has a cleanup but no creation strategy, this approach has a creation but no cleanup strategy. Consider a test case that creates the data it needs to verify functionality, and where this data is unique. The problem of encountering a race condition on data goes away in this situation because each test has its own unique data to modify and verify functionality. Additionally, the problem of long-running times for refresh code is gone, and your testers don't become idle while those long refresh processes run.
A new problem created by this approach is that data builds up in the system quickly. How big a problem could this be, right? I hear developers say again and again that tests “will never create enough data that it will matter.” And every time I end up at the table with them, in just a matter of weeks, discussing the large amount of data that has built up in the system.
In healthy automated testing environments automated tests run a lot. They run many times while they are developed. When tied into continuous integration systems and run with every commit, the problem is amplified. When every small test case is creating data in the system, the size of the data source explodes.
Selfish data generation so named because the strategy only cares about the concerns of the tests, and nothing else. It doesn’t consider other interests, or needs. It doesn’t consider what may happen when, over the course of a couple months, it has created 500 million users. And it doesn’t consider what data growth does to query times across the application.
What is good about the selfish data generation approach is that it gets all of your tests to run without having race conditions causing false positives in test reports. It is also very good at finding issues within the SUT that arise from varying the data used for inputs.
These three strategies are the most basic patterns I’ve discovered. They should pique your interest, serve as a basis for developing a better understanding of test data management, and help you think through what you do with your own test environments. Mix these, and match them. Explore alternatives, such as refreshing specific data and generating other data. Explore whether mocking data sources can accelerate testing efforts.
As systems become more intertwined, you'll need more solutions to push ahead with testing and test automation. But today you can make a commitment to actively managing test data so that your testing can be accurate, viable, and repeatable. You can find more information in my webinars, specifically the one entitled “Patterns in Managing Your Test Data.”
In January, I’ll be speaking at the Automation Guild in depth about these patterns, and demonstrating test automation code that implements them. I’ll make simple reference implementations of these solutions available to attendees. I hope you see you then. In the mean time, if you have questions please post them below.
Image credit: Flickr