How to secure data lakes: What you don't know can hurt you

An enterprise data lake is a great option for warehousing data from different sources for analytics or other purposes but securing data lakes can be a big challenge.

Unlike purpose-built data stores and database management systems, in a data lake you dump data in its original format, often on the premise that you'll eventually use it somehow. The data typically is unmanaged and available to anyone across the enterprise.

Some organizations are using data lakes as an archive for their data, some as a landing zone for data from different sources, and some as a sandbox for data scientists to play in, said Gartner analyst Merv Adrian. A lot of these lakes are moving to the cloud because of the cheap storage available from vendors such as Amazon, Microsoft, Google, and Alibaba, he said.

You put raw data into a lake and then later do some indexing, aggregation, and analysis to see what value you can extract from it.

"The working idea is you start trying out ideas and building models with the data. If something works, you might put it into a DBMS or relational database management system. You don't leave it in the lake."
—Merv Adrian

You cannot afford to overlook the security issues that arise when warehousing all this data, especially from a regulatory compliance standpoint. Experts dive into the best approaches for securing your data lake.

Have a comprehensive data lake management strategy

To secure a data lake, you need to have a holistic understanding of the data usage, planned applications, governance requirements across those applications, and specific levels of security and access control stemming from those requirements, said Doug Henschen, principal analyst at Constellation Research.

For starters, your data lake might not necessarily even be a single Hadoop or Spark cluster anymore. Companies are maintaining and in some cases adding separate data lakes for different purposes, he said. Some, for instance, have separate lakes for ongoing production workloads and others for data science.

The manner in which organizations are deploying data lakes is also changing. Low-cost object storage options such as Amazon S3 and Microsoft's Azure Object Store are pushing many organizations to deploy their data lakes in the cloud. For enterprises, these services are an attractive, low-cost option for archival data they might want to store as part of a data lake.

"With this growing array of options, the challenge in data lake management is ensuring not just comprehensive management across all these stores, but also authorized access and data governance," Henschen said.

Organizations need to keep in mind that the comprehensive management and security required is often beyond the capabilities of a single platform.

"The security and governance capabilities of individual software distributions don’t always meet all the requirements for granular access control and emerging governance requirements."
—Doug Henschen

Data lineage is one example. When considering data lake protections, it is important to know where data originated, how it was altered or enriched, and who touched it throughout the lifecycle of its use in decision making.

Understand the data lake pipeline

When you throw data into a data lake, realize that you don't have many of the protections available with an enterprise database or relational database management system, Gartner's Adrian said.

"The key is the new stuff doesn't have the benefits that we expected from the old stuff."
—Merv Adrian

With traditional database management systems, the information security team might handle all the network security and access control protections but do little with the data once it enters the database management system. Data lake structures do not come with all of the governance capabilities and policies associated with a traditional database management system, from basic referential integrity to role-based access and separation of duties, Adrian said.

One way to approach data lake security is to think of it more as a sort of a pipeline with upstream, midstream, and downstream components, said Adrian. The threat vectors associated with each stage are somewhat different and need to be addressed differently.

Get up to speed on the upstream component

The upstream component is the point where you are ingesting the data going into the data lake. Go as far upstream as you possibly can and consider what threat remediation measures you can apply at those points. Consider basic things such as data classification, who owns the data, and who cares about it. Does it come with any compliance obligations and reporting requirements? Think GDPR (the EU's General Data Protection Regulation), for instance, he said. Consider what policies—such as encryption or role-based data masking—you may need to apply to the data to protect it at the ingestion point, he said.

Make the midstream matter

The midstream component of the pipeline is where the actual processing takes place, Adrian said. One of the concerns here is an attack on the processing infrastructure. SQL injection attacks, for instance, are one threat. Remediation includes permissions management and workflow management to limit the ability of people to see, access, and interact with the data.

Consider the data output as well. When you are combining and manipulating data from multiple independent sources, the output can sometimes be unexpected. A person with the clearance to access one class of data might not be authorized to view the output when that data is combined with a brand-new data source. Having an understanding of what is coming out when you process the data is important, he said. Using measures such as format-preserving data encryption can be a big help when you do things such as data aggregation and analysis.

Understand the journey on the downstream

The downstream component, where consumption of output occurs, is where measures for monitoring and remediation are important. Use controls such as user entity behavior analytics to make sure that people are using the data in a secure and appropriate manner, Adrian said.

John Felahi, vice president of product at Podium Data, said that understanding the data's full journey is key.

"Big data moving through a modern enterprise travels a long, complex journey from the moment it’s produced or acquired, through innumerable interim preparation or storage stops, to its final point for consumption by business users or analysts. Everywhere along this path security, governance, and enterprise-grade management practices are essential."
—John Felahi

Get on top of security with stream-fed data lakes

The type of protections you need depends on whether your date lake is stream-fed or spring-fed, said Tim Negris, senior vice president of product management and strategy at predictive analytics platform vendor Rulex Analytics.

Stream-fed data lakes are filled from the top with business reference data for customers and suppliers, as well as data for purchases, sales, and other transactions flowing from business-facing applications. Data in these lakes can include credit card numbers, personally identifiable information, and business activity data that can be of high interest to thieves because of its inherent value. Often, when such data is thrown into a data lake, it is far more vulnerable to security threats than when it was in the business data systems from which it was copied.

Regulations such as GDPR also heavily affect this kind of data lake by restricting the retention time and speculative use of the data. Organizations will not be allowed to keep filling the lake with more and more consumer data just for the purpose of searching for actionable patterns in the data.

"If a data lake violating this restriction were hacked and the owner prosecuted because of poor security, the additional violation in terms of use would add to the fines considerably."
—Tim Negris

Security for stream-fed data lakes need to be handled the same way you would handle security for enterprise database systems, Negris said. That means implementing controls such as data encryption, user authentication, and role-based access control and security. Plenty of tools are readily available for doing this for data lakes built on data management solutions from the major vendors, such as IBM and Oracle, he added.

It is a little harder to implement the same protections for data lakes built from open-source components. Many of the tools are standalone products that need to be integrated and synchronized for full-stack security, Negris said. "To the casual denizens of the data lake, this may seem like authoritarian overkill, but is the only way to harden and protect the data."

Spring-fed data lakes are filled from the bottom with event and signal data from operational applications such as ERP, IoT, and SCADA systems. The data itself is not the concern because it has little value in and of itself and often doesn't have any meaning outside of a particular context, Negris said. The security threat is more about people intercepting the data flowing into the lake, or eavesdropping on it to know what the data is, where it might be coming from, and any relationships that might exist within it. Threats of concern include terrorism—taking down a power gird—and industrial espionage, such as stealing process trade secrets and even insider sabotage.

With spring-fed lakes, in addition to protecting the data itself, consider techniques from the world of physical asset security, such as an alarm when an edge device IP address changes or when a new compute device or data transfer process is added to the network, in order to trap analytical eavesdropping, or data diversion.

"This will go a long way to preventing the kinds of threats and risks common to data lakes associated with operational applications and IoT systems."
—Negris

Share your team's best practices for securing data lakes in the comments section below.

Keep learning

Learn from your SecOps peers with TechBeacon's State of SecOps 2021 Guide. Plus: Download the CyberRes 2021 State of Security Operations.
Get a handle on SecOps tooling with TechBeacon's Guide, which includes the GigaOm Radar for SIEM.
The future is security as code. Find out how DevSecOps gets you there with TechBeacon's Guide. Plus: See the SANS DevSecOps survey report for key insights for practitioners.
Get up to speed on cyber resilience with TechBeacon's Guide. Plus: Take the Cyber Resilience Assessment.
Put it all into action with TechBeacon's Guide to a Modern Security Operations Center.

Read more articles about: Security, Information Security

You are here