Format-preserving hashing: A better way to anonymize your sensitive data

public://pictures/karen_0.jpeg
Karen Martin, Freelance writer, Independent

Epidemiologists analyze healthcare data to learn how to detect, treat, and prevent disease. Insurance providers analyze the same data to detect insurance fraud. And financial institutions analyze transactions to detect financial fraud.

Generally, you must anonymize sensitive data before giving it to researchers, to protect data subjects' privacy. Most organizations do this by binning, encrypting, tokenizing, or hashing sensitive fields.

There's always a trade-off between the accuracy of the data given to researchers and the anonymity of the data subjects. If the dates on which patients were diagnosed with measles were binned by year, for example, with every date from January 1, 1970, through December 31, 1970, shortened to just 1970, it would be harder to associate a specific patient with a measles diagnosis.

This provides a level of anonymity but destroys some of the fine structure of the data. A researcher would know how many measles cases were diagnosed that year, but would not know if the cases were spread randomly through the year or clustered around certain dates.

Format-preserving hashing (FPH) is a new approach to anonymizing sensitive data. It provides a flexible trade-off between protecting the anonymity of data subjects and preserving the value of the data for secondary uses.

Here's why FPH may be the better option than format-preserving encryption (FPE) in the healthcare industry—and in many others.

How to Achieve Consistent Data Security Across Hybrid IT

Saving the structure

You can preserve more structure by encrypting the dates in records. A researcher cano see if there are large variations from day to day, but won't know the actual date of a specific diagnosis. Encrypting, tokenizing, and hashing all protect anonymity and preserve data structure, but hashing has the added advantage of potentially reducing your security compliance costs.

Some laws and regulations treat encrypted values the same as unencrypted values. If you want to get PCI DSS certification of your systems, for example, any systems that process encrypted values need to be evaluated for compliance by a qualified security assessor (QSA), while systems that handle only tokenized values are somewhat easier to certify.

The general philosophy seems to be that the harder it is to recover an original value from an anonymized value, the easier it is to perform a security audit.

Encryption and tokenization are both reversible. If hackers can get the key that was used in the encryption, or the credentials used to authenticate to the detokenization system, they can recover the original data. That is why it is vital to protect encryption keys or the authentication credentials used by the tokenization system, and why the security controls protecting that information may be subject to security audits.

With FPH, on the other hand, an attacker would have to reverse a hash function in order to recover an original value. And by their very design, cryptographic hash functions are essentially impossible to reverse. There is simply no secret information, either in the form of a cryptographic key or authentication credentials, that will let an attacker reverse FPH.

So, from a certain point of view, the security properties of FPH are actually better than alternatives such as encryption or tokenization.

Why preserve formats?

FPH also preserves the format of the data, making it easier to work with. Data format changes often have unexpected consequences, particularly in complex environments with one or more legacy components. If a given column in a database contains 16-digit credit card numbers, for example, anonymization techniques that result in a different 16-digit number are easier to use than ones that result in a value longer or shorter than 16 digits.

A great story about unexpected problems at a large bank illustrates the point. When the bank upgraded one of its applications—let's call it the "ABC" application—several of it's systems mysteriously stopped working. Bank personnel eventually discovered that the problem was due to a tiny format change in the application's login message.

Before the upgrade, when a user or application successfully logged into the application, they were greeted with a "Welcome to Big Bank's ABC application" message. After the upgrade, the login message changed to "Welcome to Big Bank's ABC v2 application." 

Scripts logging into the ABC system did not recognize the new message and decided the login had failed. This trivial change—the addition of "v2" to the login message—locked many applications out of the critical ABC application, dramatically reducing the availability of many of the bank's systems. Millions of dollars were lost.

Most encryption or hashing solutions change the format of the data, making them difficult to use in complex environments that cannot handle even trivial format changes. FPE, however, forces the format of an encrypted value to match that of the original value.

If you apply FPE to a 16-digit value, you get another 16-digit value, and legacy systems handle the encrypted values just as easily as they handle the unencrypted values. FPE is a well-established technique in use for over a decade.

FPH uses a similar approach to FPE, but implements irreversible hashing in a way that preserves the format of the original data. 

[ Webinar: Get Started with Seamless App Sec in a Single Day (Jan. 23) ]

Do try this at home

FPH is an interesting new approach to anonymizing sensitive data. With the additional element of security provided by hashing, FPH should be able to anonymize data in a way that is acceptable to both researchers and regulators, and easier to implement in the complex legacy environments that are all too common these days.

Have you considered FPH? What's holding you back? Share your experience with FPH in the comments below.