Is your mobile app design putting your data at risk?

Every database administrator worries about leaks, but those supporting mobile apps have to worry just a bit more. Why? Because mobile clients are so thin these days that most programmers store all data in a central database—and that repository becomes a juicy one-stop-shopping target for attackers. Encryption is the primary tool for data protection, but you should also consider user experience when choosing an encryption strategy. The cost of not having a strategy can be enormous.

This nightmare scenario has become a big issue. When executives at medical insurance provider Anthem woke up one morning in February 2015, for example, they found that someone had slipped into the company's database and made off with the personal information of 78 million Anthem customers. The data had been stored in a database that was open to anyone who could get through the outer layers of the insurer's state-of-the-art perimeter security. Unfortunately, as Anthem found out, perimeter security isn't enough. The attacker found a way through it, and all those customer names, birthdays, and Social Security numbers were just sitting there for the taking.

Download 93-Page ReportHPE Cyber Risk Report 2016

Lawrence Lessig, a Harvard University professor of law who studies privacy, says that the matter is more than just hurt feelings and an inchoate sense of violation. “There are real costs to failed privacy,” he explains, and cleanup is expensive. One estimate puts the cost at $240 per customer.

Anthem isn't alone. Hundreds of other companies have discovered similar breaches, and an untold number still don’t know that attackers have broken into their systems. Could you be one of them?

Many companies are paralyzed by fear and don’t know any way to stop the hemorrhaging. Every day, the list of holes in operating systems and firewalls grows longer, and only the willfully naive can possibly believe that the industry is close to patching all of them. You need to take action. Here's what you can do.

Scramble data without inconveniencing users

If you can’t build a strong enough outer perimeter to guard your information, you should store the information in a way that’s not useful to anyone who breaks in. Many programmers assume that in order to do useful computation with the data, it must be stored in a format that anyone can easily read. But you can use clever mathematics to block even the most clever thief while still allowing company employees to serve customers.

Finding a solution to this data security issue is even more important today because mobile devices routinely store more of employees' data in the cloud. In order to feel more secure about mobile devices and people storing information remotely, you need to trust that the cloud can protect your data once it leaves the devices.

The first step is to recognize that it’s not essential to have all of your data available in order for users to make decisions based on that data. Nonetheless, many developers are pack rats, storing every bit of information in databases and logs, just in case it’s useful later. But keeping this kind of data around just creates a tempting target for criminals.

The trick is to scramble or encrypt the data, and only store the scrambled version, not the original. This makes the data unreadable but still useful to those who know how to query it in the right way.

You can choose from several techniques that will let your database do useful work without having any useful data inside of it. I covered some of these in Translucent Databases, a book I wrote back in 2009, and it still applies today. Since then, the field has only grown, as more researchers have contributed solutions related to using database information without relying on access to the real underlying data.

Consider the SHA-3 option

The simplest example may be looking for matches in a database. Many programmers know the technique of using a cryptographic hash function to scramble a password because the technique was used in Unix back in the 1970s. Approaches have grown more sophisticated and elaborate since then. Today, many experts use the latest SHA-3 standard, but the idea remains the same. The hash function serves as a mathematical blender that takes a string and turns it into a scrambled number in a way that, practically speaking, should be impossible to reverse. (I say “should be impossible” because no one has yet described how to do that publicly.)

Although the data is scrambled, it isn’t unusable. The hash function may block you from reading the data by reversing the hash function, but you can search for matches because if x=y, then SHA(x)=SHA(y). So, instead of storing the user’s name, password, Social Security number, or other sensitive value, the database can store the hashed version: SHA(name), SHA(password), etc. When it’s time to look for a particular user, instead of writing a query for the name, you can use SHA(name) in the query.

This approach prevents many attacks that begin with the attacker gaining root-level access to the database because the most secret core of the database doesn’t contain the values in the clear: The data is completely scrambled. A column with Social Security numbers, for instance, can’t reveal anyone’s Social Security number if it’s hashed. When someone types in a password, it is hashed first and then compared against the table of hashed passwords. If it matches, the user typed in the right password. But if the table is revealed, the security of the hash function prevents anyone from working backward to figure out the original password. The same technique works for Social Security numbers, names, or other sensitive data.

Don't scramble everything

You don't need to use this technique to hide everything. One common reason for storing large amounts of personal data is to help the marketing department provide customized services for the user. Sales transactions can also help the business reorder and plan for the future. This approach still works if you scramble only the sensitive columns and leave the others in the clear.

Imagine a table of product orders that includes fields for name, address, ZIP code, color, shape, and size. If the name and address data are scrambled, the database can’t help an attacker who grabs a copy of the database figure out who ordered what. But if the columns with the ZIP code, color, shape, and size are left unscrambled, the marketing department can still analyze the data by using complex algorithms to figure out which products and which colors sold well in which ZIP codes.

 

SHA(name)ZIP codeSKUColorSize
7a9c8d9d938a...1211050001322BlueXL
7a9c8d9d938a...1211050001222BlueXL
411ac788d90d...1211050001222BlueL
009129310231...2121041123123GreenXL
ace434123123...2121077734242GreenS
11231341245a...2121050001222GreenS

This database table hides customer names by hashing them but leaves plenty of information for data analysis. It's easy to see that blue is a popular color in ZIP code 12110. But, while no one can invert SHA(name) to discover who ordered blue clothes, an internal user's application can compute SHA(name) and present it to look up past orders. 

To scramble or not? Understanding the tradeoffs

There are several tradeoffs every developer should keep in mind when implementing a system like this. In general, more sophisticated scrambling algorithms can add more security at the cost of making the database less forgiving.

In the example of the table full of orders, scrambling the name by itself prevents someone from scanning the column and reading the names, but it doesn’t prevent an attacker from looking up someone they know. If you’re looking for your neighbor Bob, for instance, you can compute SHA(“Bob”) and find that row.

One solution is to add a password to the mix by storing the hashed version of the name, followed by a password—SHA(name+password). This can’t be guessed by just knowing the name, blocking attackers from looking for the records of a particular person. But the system is now a bit more brittle because the records will be lost if anyone forgets a password. However, this approach may be acceptable for low-value data that's ephemeral, such as for blog posts and comments. But for more serious domains, like medical records, you can use more elaborate mechanisms that will let you reset or recover the password.

As the Internet takes over more responsibilities and offers more lightweight services, you'll find more opportunities to use a basic architecture like this to protect personal information. It wouldn’t be appropriate to scramble the names and personal data of, say, the income tax database at the IRS, or your bank account, but there are plenty of websites that can vanish without too much trouble. If some social media or chat sites were lost because of encryption, there wouldn’t be many repercussions. Indeed, sites like Snapchat promote the fact that their data is ephemeral.

More advanced options offer flexibility

Hashing the names or identity numbers of users is just the beginning. You can use several other techniques to encrypt information in different ways, offering protections against different types of attacks. Some of these techniques blur the information just enough by adding a bit of random data to the location, making the rough position available but not the exact position.

Others reveal one subset of the data to one group and another with more complex encryption to others. One approach requires two or three people to agree before unlocking the data. That process prevents any single party or group from fishing through the data, but it doesn’t stop the entire group. You can find many case study examples in my book.

Perimeter breaches are inevitable. Be prepared.

Many systems engineers assume that their data must be stored in a clearly readable form for easy access, and so they concentrate on building an elaborate perimeter that only lets in the right people. That's a false assumption, and the perimeter defense doesn't always work. Firewalls and supposedly secure versions of operating systems have failed time and again to block attackers. Your best solution is to use encryption from the beginning, before the data leaves the mobile device or the desktop. The central computer will never have access to the underlying information if you scramble it before it ever leaves the smartphone.

“We’ve been trapped by an idiot tradeoff between privacy and security,” laments Lessig. “We’re now beginning to see the mathematical machines that will give us privacy with security. That will give people what they want, which is usually a good business model for commerce.”

Download 93-Page ReportHPE Cyber Risk Report 2016
Topics: MobileSecurity