How to use containers to take on hybrid-cloud data

For a multitude of reasons, some companies won't, or can't, move their applications and data wholesale to the cloud. For them, a hybrid cloud approach is the only way to go. Containers can help these companies manage data in a hybrid cloud setup in ways that other approaches cannot.

Containers have increasingly become part of the hybrid toolkit because they yield many of the same benefits as cloud, while providing deployment flexibility. If you use container-based data lakes as part of your hybrid mix, deployment can be much easier. 

Still, containers can be a challenge, for big data in particular. Here are some pointers about the data-lake approach, and benefits and downsides associated with it.

Hybrid Cloud Management: Forrester Overview of 41 Providers

Why hybrid?

Moving workloads from traditional on-premises systems to the cloud is a big decision for most enterprises. Concerns center around control over systems, their costs, availability, end-to-end performance, and overall security. Even after these are addressed, it's not practical to move every application to the cloud overnight.

Therefore, many companies adopt a hybrid approach, with some apps and data in the public cloud, and others remaining on-premises. And even if your hybrid cloud is meant to be a temporary situation, there is one good reason to stick with it for the long haul: data.

Data in the hybrid cloud

Moving data to the public cloud can be extremely risky, and expensive. The first reason is data gravity, a tendency to leave data where it currently resides. This is due to costs—you need to provision double the capacity while you move it—speed of transfer, replication while transferring, or concerns about data loss. This is especially true of big data, where there are huge amounts of data at stake.

But perhaps the biggest risk is that there's so much value to data—to enterprises that collect and store it, and to their employees and customers—that many organizations just aren't willing to push it all to the cloud. The risks of data theft and loss are simply too great. 

Additionally, for many industries, regulatory requirements and local laws impose restrictions and requirements, and may even prohibit outright the migration of data to the cloud. These include Health Information Privacy and Accountability Act (HIPAA) compliance rules for healthcare firms, PCI DSS rules for financial services firms, FISMA restrictions, SOX compliance for business data in the cloud, and so on.

Consider also that laws vary by country. In many cases, while this doesn't restrict use of cloud services, laws may require that certain types of data (i.e., human resources data) are stored on servers that reside within the country of origin. Pushing that data to the cloud, where it may be stored, replicated, or even backed up to cloud servers in a different country, may leave you in violation.

Data in on-premises lakes

The issues discussed above may lead many to create an on-premises data lake. This is different from just database storage, a data warehouse, or a single repository. A data lake includes structured and unstructured data derived from applications, devices, social media, user feedback, system reports, users, and so on.

Keeping a data lake on-premises lends itself readily to real-time analytics, data mining, and application integration, all of which can lead to improved operational efficiencies. But there are reasons to move some or all of your data to the cloud, if you can.

Cloudbursting

As great as on-premises data is, some businesses may choose a hybrid approach for their data storage. One reason is to implement a cloudbursting solution, where sudden surges in data collection results in data heading up to the cloud when on-premises capacity becomes constrained. Whether this is a temporary or permanent storage solution is a business decision.

The Internet of Things is an example of applications that could benefit: Varying amounts of data come from growing numbers of sensors or mobile devices, which can quickly consume on-premises storage.

Hybrid cloud data storage may also be driven by a cloud-based backup strategy for on-premises data, a disaster recovery plan, or cost-savings initiatives.

How containers can help

As laws and regulations change, management philosophies shift, and technology continues its advance, how do you hedge today's on-premises data storage? In most cases, the answer is a form of container-based architecture. Whether you use Docker, Mesos, Kubernetes, or even a virtualization-based approach, the benefits apply.

Containers often equate to agility, but they also increase portability. By building out services and data stores within containers, you can more easily move them all—or some of them at a time as part of your migration strategy—to the public cloud.

Containers also provide flexibility in terms of maintaining a similar architecture across all your on-premises and cloud applications, but with the ability to customize rollouts in geographical regions.

So, for instance, you can embed an SQL-based relational database within a Docker container, using Kubernetes or some other management feature to associate storage in a decoupled manner. The container can be easily deployed to production almost instantaneously, bringing with it the proven database schema and tested code.

This is where you have some choices to make in terms of architecture. You can:

  • Deploy both the container and the physical storage backing it on-premises.
  • Deploy the container on-premises, associated with cloud-based storage, or vice versa.
  • Deploy the container to the cloud via one of the container cloud service providers.

In this way, containers offer you the ultimate in data portability.

Another plus is how containers enable agility for future innovation for existing applications. The healthcare industry, for one, is looking to containers to speed up innovation, agility, DevOps implementation, and cloud-like approaches to application development. There are other container design patterns that can help your IT strategy overall, including for applications and data.

Container challenges

No solution is perfect, and there are challenges to a big data cloud-container approach. These include:

  • Data volumes that outgrow a container's resource quota
  • The impact to a container-management system that attempts to spin up new instances of a container with large volumes of data, and the costs of large amounts of physical storage required for each instance that's started
  • The effect containers can have on database replication and clustering implementations, such as the additional compute and communication overhead required per instance
  • Impacts on persistent storage versus in-memory storage and data caching, potentially consuming large amounts of physical memory within the container's host system
  • The effects of container-driven OS dependencies on storage choices; this can limit the host deployment options for your containers
  • A negative performance impact due to a higher dependency on network communications across distributed container deployments 
  • The possibility of exposing critical data to poor container security choices and implementations
  • The difficulty and time required to orchestrate the deployment of container storage dependencies across applications

Fortunately, there are general architectural solutions to these issues, and cloud vendors' strategies and products that use containers and big data. These include federated solutions, hybrid data lakes, and other hybrid cloud and on-premises data solutions.

Use hybrid data for the right reasons

Regardless of industry, hybrid cloud architecture is here to stay. Regulated industries are especially driven to hybrid, and many are leveraging it to gain competitive advantage through innovation, digital transformation, and modernization.

The keys are properly managing your hybrid cloud to ensure consistency through a container-based approach, security and governance, well-defined service-level agreements, and the use of DevOps.