Machine learning and data warehousing: What it is, why it matters

One of the many technologies included under the umbrella of artificial intelligence, machine learning is defined by Wikipedia as "a field of computer science that gives computers the ability to learn without being explicitly programmed."

The technology, which is a core part of the data analytics technologies that power the modern data warehouse, features algorithms that can make predictions on their own about data and its insights without being hampered by strict guidelines and instructions. When used successfully, machine learning can help with infrastructure scalability, cost savings, and agility.

For its part, artificial intelligence (AI) is the ability of machines to think like humans. It stems from the idea that "given enough data and compute power, machines will be able to think and learn using mathematical simulation of the human brain," said John Santaferraro, research director at Enterprise Management Associates (EMA).

"[Thinking like humans] includes concepts like self-learning, reasoning, deciding, correcting, communicating, and—most importantly—increasing in overall intelligence," Santaferraro said.

Most of the technology labeled as AI is actually machine learning, or the data-driven use of advanced algorithms to simulate small parts of human thinking and decision-making processes.

"Most of the smart technology in the market today learns based on the input of more, diverse data and the fine-tuning of advanced mathematical algorithms."
—John Santaferraro

Here's what your team needs to know about machine learning and why it matters for modern data warehousing.

Where machine learning fits

Machine learning is becoming popular in the modern data warehouse, which captures large amounts of data from multiple sources and devices and stores it on a single platform for easy retrieval and analysis. The reason for data warehouses is simple: Machine learning works best the more data you throw at a problem.

Ideally, machine-learning and traditional data warehousing teams can, work off the same organizational datasets, but they organize data a bit differently in order to glean insights from the data. Traditional data warehousing professionals typically work with more heavily modeled data, while machine-learning pros prefer less formal regulation.

It is that flexibility than allows machine-learning programs to make predictions and recommend actions based on them, without the need for human intervention.

How the algorithms work

In essence, machine-learning tools look for uncommon patterns in data—exceptions that might flag a concern, or common occurrences that suggest a desired result. By spotting these common examples or uncommon exceptions, it can learn from the data it encounters and confirm the assumptions the tool makes as it encounters more data. It "learns" what to look for.

Machine-learning algorithms use math to determine when something appears right or doesn’t appear right, depending on how a query is written. The algorithm is “trained” in what to look for, based on historical patterns. It also learns what the data element might look like when it is wrong. An algorithm trained in this fashion is called a model.

Why ML is foundational to working with big data

Several benefits are attributed to the machine-learning model, including its ability to scale up and handle vast amounts of data, including many different data types. That is fortunate, since there is an obvious up-front cost to adopting machine learning. As with any investment, however, the lifetime cost of adopting machine learning is significantly lower if you have an effective algorithm.

Machine-learning tools can also quickly adapt to new trends. An organization simply needs to build data about the new trend into the algorithm, and the machine-learning environment should make any necessary adjustments that are required automatically.

"Since the early days of data warehousing, the most common use cases have consistently been customer analytics. Under the heading of customer experience, most machine-learning algorithms are being used for the improvement and automation of sales, marketing, and customer service business processes."
—John Santaferraro

Supervised versus unsupervised training of datasets

For machine learning to make intelligent predictions or recommendations, it must analyze large sets of data. Data that's been labeled is called supervised data. For example, if a machine-learning program is asked to distinguish between in-state residents and out-of-state residents to determine benefits eligibility, the program might analyze the label "New York" versus "not New York."

On the flip side, if the data is not labeled, it is known as unsupervised data. Algorithms attempting to analyze the data don't benefit from the labeling process and are typically less effective in producing results.

Clustering is one example of unsupervised training. This process is especially useful in retail, if an organization has multiple products for sale that are similar but with a unique feature or color, for example.

The machine-learning algorithm could be asked to search for all customers in a cluster—for example, all customers who purchased the product in blue could be targeted in an online campaign with products with similar features.

Where ML technology is going

In recent years, the capability to handle vast datasets has begun to allow data scientists and analysts to deploy machine learning on a massive scale in data warehouses, whether the datasets are structured or semi-structured. This helps data pros perform predictive analytics and forecasting without having to move data out of the data warehouse to develop and train machine-learning models.

New tools, such as Google's BigQuery ML service, Teradata, and Vertica, now can help data analysts work with machine learning. That ability was previously limited to data scientists or those trained to work in programming languages such as R and Python.

As machine learning becomes more mainstream, data will become more democratized. Teams with limited data management technical experience will be able to run these services without having to fret over infrastructure limitations or the original source of the data.

The impact of recent advances with machine learning is that data professionals and business users will be able to better share data in real time. Organizations will be able to make better business decisions by knowing how to ask better questions of their datasets to begin with, and get results that help further business goals.

Read more articles about: Enterprise IT, Data Management

You are here