Lessons from machine learning's front lines: It's all about the data

Luther Martin, Distinguished Technologist, Micro Focus

Machine learning is fundamentally about trying to find useful patterns in data. Most discussions about ML have focused largely on the algorithms used to do this, or how to interpret the output of those algorithms.

Those are both interesting topics, but what you should focus on instead is the data itself—in particular, the data you use use to train ML algorithms.

Here are key lessons I learned through several ML projects that included finding patterns in scientific data, digital images, human speech, data breaches, and security incident data.

The State of Analytics in IT Operations

Data is messy

Real-world data is often incomplete, with lots of errors. Because of this, it can be useful to draw a parallel between secure coding practices and machine learning.

Most software security vulnerabilities are caused by things such as buffer overflows, SQL injection, cross-site scripting, and similar phenomena. The common element in all of these vulnerabilities is that they exist because data wasn't properly validated before it was used. Most secure programming practices can be summarized as: "Data exists only to break your code; program accordingly." 

Likewise, when you're developing ML algorithms, a similar defensive mentality is useful because data is frequently contaminated by a wide range of unexpected phenomena. Data exists only to thwart the development of your algorithms; act accordingly.

An example of this comes from someone I met at a conference a few years ago. This person had worked on the Human Genome Project (HGP), which determined the sequence of roughly 3 billion base pairs that comprise human DNA. Once the human genome was sequenced, trying to figure out exactly what's in it and what it means turned out to be a daunting problem in data analysis, and one perfectly suited for solving with ML.

This researcher told me about how a team of data scientists had almost found an interesting pattern in the human genome; what they expected differed from the HGP data in a single position. Puzzled, they asked the HGP team to take a closer look at the suspicious data. They found that the data was indeed incorrect, and the ML algorithms had predicted the error.

Data can be corrupted

Are there other errors in the human genome? Absolutely. The goal of the HGP was to have an error rate of no more than 1 in 10,000, or 0.01%, a margin that should give rise to many thousands of errors in the HGP data, and these may not be identified until careful analysis of the data suggests that they're wrong.

As Ronald Reagan was fond of saying back in the Cold War, "Trust, but verify." Interesting data patterns may be caused by interesting phenomena, but they can also be caused by bad data.

So be ready to spend way too much time understanding data problems that aren't caused by the interesting effects. If you're the one developing ML algorithms, it's useful to understand that your data will probably always be corrupted in some way.

If you are new to analyzing data, there's a class offered through online platform Coursera, "Getting and Cleaning Data," that is a good introduction to the topic. When I worked on analyzing data breaches, I spent much more time cleaning up data to make it suitable for analysis than I spent on the actual analysis. Don't be surprised if you find yourself in a similar situation.

[ Webinar: What’s New in Network Operations Management (Dec. 11) ]

Human error compounds the problem

To train it to recognize pictures of cats, a system needs pictures that are cats and pictures that aren't cats, and the training uses both of these cases to refine how well a picture is judged to be a cat or something else. As another example, to recognize a particular vowel sound, you need samples of sounds that both are and are not that particular sound.

Labeling is the process of deciding exactly which of the data is right and which is wrong; is a particular picture a cat or not? This can be hard and expensive because it can require lots of human expertise to make the labeling correct. That's hard to do well.

People make mistakes. They actually make lots of mistakes, and this makes labeling data much harder than it needs to be.

Different types of errors

In Appendix 6, “Human error rates,” in his book Reliability, Maintainability and Risk, David Smith divides tasks into four general types: simplest possible tasks, routine simple tasks, routine tasks with care needed, and complicated, non-routine tasks.

Noticing the presence of a busy intersection while driving is an example of a simplest possible task; turning off a light when you leave a room is an example of a routine simple task; making the correct selection from a vending machine is an example of a routine task with care needed. Yet more complicated tasks fall into the category of complicated, non-routine tasks.

According to Smith, a good rule of thumb is that people fail simplest possible tasks about 1 time in 100,000 attempts, giving a failure probability of about 0.001%. They fail routine simple tasks about 3 times in 1,000 attempts, giving a failure probability of about 0.3%. They fail routine tasks with care needed about 1 time in 100 attempts, giving a failure probability of about 1%. And they fail more complicated tasks at least 1 time in 10 attempts, giving a failure probability of at least 10%.

When good labels go wrong

Labeling data to make it useful to ML algorithms is probably a routine task with care needed. That means that it's probably done incorrectly about 1% of the time—maybe even more.

So even if there were no problems at all in collecting the data (which, of course, never happens), when the data is labeled, it is probably labeled incorrectly a significant fraction of time. This means that your ML algorithms are probably being trained on data that's not as accurate as it could be.

This suggests that it's critical to make your labeling process as simple as possible. You definitely don’t want to make labeling data so complicated that it falls into the category of "more complicated tasks." At that point interesting patterns in your data may become impossible to find because of how much they are obfuscated by human error.

Know what you're dealing with

Machine learning can provide lots of ways to find and use the many patterns that exist in data, but this can turn out to be harder than you might expect because of the nature of data.

All real-world data is inaccurate and incomplete, and when you label this data for use in machine learning it gets even less accurate. Expect to spend time working around these issues when you apply ML to solving your problems.