Micro Focus is now part of OpenText. Learn more >

You are here

You are here

Privacy and test data management: Why discovery is now a requirement

Eric Popiel Cybersecurity Evangelist, CyberRes

Data privacy regulations, including the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), have refocused attention on test data management (TDM). These laws have put pressure on organizations to ensure that all personal and sensitive content has been identified and removed from test data. 

Traditionally, TDM was about creating a relationally intact, reliable subset of production data, or data very similar to it, that application and systems developers could use for different test use cases. The primary reason people would do this was to reduce the size of the production dataset in order to save on storage and improve performance.

Content in test data that was considered personal and sensitive would typically be masked or removed without affecting properties that would degrade the quality of the tests. However, because there was no real oversight over this process, it was not unusual to have multiple copies of production data pervade the test process. My team came across a client that had 22 copies of production data for just one application. 

Historically, these projects were very tactical and of low importance on the IT project list—and typically were run entirely within IT between database administrators and the testing/quality assurance functions without any oversight from regulatory bodies or auditors. 

The new privacy laws demand a modern approach to TDM. Here's why—and how to get started shifting your efforts.

Challenges with creating the right dataset

In addition to a perceived low value around investing in TDM, there were several other challenges. This first: the need to create a realistic dataset representing actual production data and exhibiting the same level of data integrity. The last thing you want your development and testing communities debugging is good code that fails due to poor data quality.

Also, under the previous constraint, how do you create a realistic “right-size” dataset? Further, since these are test databases, what about the creation and refresh rates required as the speed of the development and release cycles continues to dramatically increase year over year as more companies adopt agile DevOps processes? Given the low value of investment in these “non-strategic” projects through the years, many processes are very manual, are difficult to repeat, and lack automation.

Therefore, it is also not unusual to hear of production database copies, perhaps containing personal and sensitive data, being cloned and used as a testing database. Even in the case where the “critical” production systems had some protection, many of the systems, marketing for example, were largely ignored.

Why data discovery is key

Modern test data management is about discovering and protecting personal and sensitive content, so that you are compliant with regulatory mandates. It's about making sure you have analyzed every database and every dataset for personal and sensitive content.

The only way to do that is through data discovery. Once you have done that discovery, you need an automated process for identifying and protecting personal and sensitive content using a predefined set of rules that would act on the content and protect it based on what kind of classification it falls under.  

At the end of the day, the goal is to have a fully protected dataset that does not contain any personal or sensitive content. With TDM these days, the storage perspective is not as important as it once was. Discovery and protection of personal and sensitive data is the biggest piece.

Old challenges join the new

To take on these new challenges, TDM must address the old challenges as well. Automation is key in TDM strategies because release cycles have kept increasing. A survey of over 600 software development professionals that OverOps conducted last year found that 59% of organizations release new code anywhere from twice a week to multiple times a day. So speed, automation, and realistic, right-size test datasets are a must.

Importantly, the changing regulatory landscape means TDM is no longer just an IT-only function. Data privacy mandates have raised the stakes considerably for organizations that fail to apply due diligence standards to protecting personal and sensitive data.

So chief compliance officers, legal counsel, chief data officers, and other senior executives have all begun paying closer attention to data risks, in both production and non-production environments. To demonstrate compliance with organizational standards for data protection, TDM projects and processes cannot exist in their current form. They must evolve.

Integrating the identification and remediation of personal and personal and sensitive data in test data subsets with automated data-centric protection is now a requirement for effective TDM. This ensures that data privacy compliance is built-in.

Why changing your TDM approach is now a requirement

Just a few years ago, there weren't any government regulations around data breaches and data privacy, and you didn't have this “assume breach” mentality. Today, there's greater awareness and concern over data loss in test and development environments.  

As far back as 2018, for example, the FTC warned about the need for organizations to secure their non-production data following multiple breaches at Uber's cloud storage infrastructure that resulted in the loss of personal and sensitive data belonging to Uber drivers and customers. The problem had to do with developers and testers at Uber connecting to the company's production data in the cloud, using weak access controls.

The practice allowed attackers to exploit the company's software development environment and gain access to personal and sensitive data stored in the cloud. "This case demonstrates the importance of securing all software environments, not just production environments," the FTC warned at the time.

The use of format-preserving encryption, anonymization, masking, and other techniques to prevent personal and sensitive content from getting exposed in the development and testing stages is now more important than ever.

Compliance and legal officers are concerned about the process and want to be assured that it is followed in a proper manner. If you think it is in your best interest to encrypt personal and sensitive content in your production database, you could extend that argument to say you need to protect test data as well. You might have one petabyte of production data, but five times as much data in your test environment when you add it all up.

The current regulatory environment is not the only reason why you need to make test data a part of your overall privacy processes. The other aspect is that the scope of data that is considered personal and sensitive has also increased through the years. Knowing where that content is can be a huge challenge.

Most people, for instance, know that a customer database might contain Social Security numbers, credit card numbers, and other personal and sensitive content. However, as you look at upstream and downstream applications and databases and date lakes, it becomes a lot more convoluted knowing what personal and sensitive content you might have in each of those environments. Without going field by field through each database, how do you even know what personal and sensitive content you might have in your test data?

Keep learning

Read more articles about: SecurityData Security