You are here

You are here

Why you should share BI resources with your data science teams

public://pictures/Mike-Perrow-Chief-Editor-TechBeacon.png
Mike Perrow Technology Evangelist, Vertica
 

Business intelligence (BI) has evolved from the time the term was coined in 1865 to today's complex of data sources, databases, and reporting technologies that use visualization.

But the essential idea behind BI hasn't changed much in the ensuing 150 years or so: Learn what the data is telling you about your business activity, and either put course correction in place or pump more resources into what's working well, before the competition learns what you're doing.

This is what has come to be known as "data-driven," a term business folks like to use because it says to their customers, "We are fact-based"—i.e., we look at the truth, and help you make sound decisions.

The recent rise of big data technology—along with public clouds that make massive data storage more affordable than ever—has pushed data-driven techniques into the limelight of business strategy. The focus on data is evident practically everywhere you look, from business school curricula to the recommendations from the most frequently consulted business analysts.

But big data, or more specifically "data science," is not the same thing as BI. Running reports and using visualization to see trends in last week's transaction records is the province of BI. You're looking to see if your plans are on track or not, and you know what you’re looking for.

Data science, on the other hand, is about exploring much larger (usually) datasets to discover patterns in data that you hadn't previously anticipated.

All too frequently, businesses that have been doing some form of BI for 20 or more years don't have ample resources to support big data teams that want to use the same servers and cloud instances to run their projects. At least that's what BI teams often claim. But there's a new way around this.

The common profile of business analytics

BI teams are composed of data specialists, often an extension of the core IT organization. Like the rest of IT, they are charged with running a critical part of business operations. They rely on compute and data storage resources that they manage with an eye on 1) holding down bottom-line expenditure and 2) meeting service-level agreements (SLAs) that they've forged with various business units across the company.

In fact, it's not uncommon for BI teams to face a loss of compensation if they fail to meet SLAs. That's one reason BI teams take their work very seriously.

BI usually relies heavily on a data warehouse, which is a single location for regular uploads of operational data for reporting. These data stores can become massive, so when the public cloud became popular, many BI teams took advantage of what initially appeared to be more affordable storage.

They could store more data, use cold storage and hot storage capabilities priced differently depending on frequency of access, and continue running their regular BI analytics and meeting their SLAs. And while the cloud soon proved to be just about as costly as maintaining on-premises resources—cloud vendors tend to charge considerably more for compute compared to storage—the cloud does allow fast provisioning of resources on demand.

Today, a majority of businesses that use the cloud buy services from multiple vendors, and many BI teams use an evolved mix of cloud and on-premises sources for their data warehouse needs.

Data scientists have a slightly different agenda

As BI teams were busy securing their clouds and hybrid resources to support their regular operations, data scientists got to work on corralling massive datasets, something they had been dreaming about for more than a decade. 

Data scientists had been working out complex algorithms that could map useful trends, which could lead a business to cool new opportunities. All they needed were the vast quantities of data that would make machine learning worth the effort. The cloud and Apache Hadoop storage put that within reach.

Fast forward to the immediate present. No longer simply data repositories, many of today's data warehouses can support sophisticated data science projects. And businesses seem eager to do more with data; many are hiring data scientists to usher in a new era of big data analytics.

But now the BI teams are the gatekeepers to the storage and compute resources that might enable the new big data agenda. It's not that the BI teams want to keep data science from doing its job; rather, they can’t compromise the agreement it's made with the business via those closely managed SLAs. They need to keep those routine reports flowing, daily, and they can't compromise the resources needed for data science to proceed at a useful pace.

Isolating the workloads

Isolating the workloads between BI teams and big data teams may not exactly be the Holy Grail of a company's data platform, but it has been an elusive goal for all but the wealthiest businesses. Of course, if you can afford to give any team all the resources it needs—whether BI, big data, or anything in between—you don't have a problem.

For the rest of the world, the cost of additional on-premises storage and CPU power, or the cost of additional cloud resources, has been a barrier to a fuller embrace of big data potential.

Some organizations have attempted to isolate database workloads by using one of two approaches, each with distinct problems.

Manage multiple workloads in a single database

One approach is to build a big database and use schemas, tables, and resource pools for different projects. "Resource pools" work like lines of demarcation across your data resources. If you have 10 servers in a cluster, a resource pool allows you to allocate x amount of CPU and memory to one group, and allocate y amount to another group, etc.

This means you could run a single database, and try to manage your multiple workloads from there. But in practice, the dividing lines—defined by shares, limits, etc.—were not always clear between one workload and another. One workload is usually dominant, which frequently causes the performance of the others to be compromised.

Create multiple databases

Another method is to create multiple databases. But given that different workloads usually have to work with at least some of the same data, you end up with multiple copies of the data (replicas). Which not only means your storage bill rises, but keeping those replicas in sync can also be a huge headache.

Yes, you get perfect workload isolation, so the resources being consumed by one group will never affect those being consumed by another. But you don't want different teams reporting against two different versions of the truth. Plus, there's the additional overhead of managing all those data replicas after projects are complete.

Use subclusters to isolate workloads

A handful of big data vendors are offering methods to allow BI and data science to coexist without those problems. The concept is "subclustering," or "virtual data warehouses." It involves virtualizing the data that needs to be crunched, such that live, up-to-date information is accessible to any team that needs it for whatever purpose.

The advantage to teams working with the data is clear: Isolated groups of compute nodes can be attached to the same dataset. You don't have to continually duplicate the data to create isolated workloads (which leads to data quickly becoming out of sync). Plus, your BI team need not worry about a data science team potentially crashing one or more servers vital to daily BI operations.

With subclusters, every group gets its own compute capacity, and each group is walled off from the next, with the data it needs cached in virtual compute nodes. This means everyone has access to the same data, so there is no need for data replication, and thus no data sync issues.

For example, if you have one subcluster that's ingesting new data, with other subclusters ready to query the data within as separate workloads, there's no waiting. Separate teams can begin their operations when the new data is loaded.

Subclusters' advantages abound

Isolation of database workloads allows organizations to do more with their data, to try things they've never attempted before, without worrying that they'll potentially crash the production database being used by BI teams.

It allows BI to continue its focus on decision making based on a current business model, evaluating what the data is telling them in known categories. And it allows data science to discover the unknowns, to finally operationalize machine learning at a scale that puts the promise of big data within reach many more organizations than ever before.

Keep learning