Robot on developer's desk

4 things IT Ops needs to know about machine learning

You have a system that collects logs from your infrastructure, web servers, and applications. That’s 1.4TB of logs every day—more than 500TB over the course of a year. While you’re able to create some graphs that show how your systems are performing, identifying unusual behaviors and anomalies across all of your data in real time is nearly impossible. And because your business is global, this is a 24/7 responsibility for your IT team, which somehow needs visibility into these operations at all times.

As datasets collected by organizations increase in size and complexity, it becomes impractical to manually spot infrastructure problems, intruders, or business issues. There are simply not enough human resources available to watch for every potentially interesting metric around the clock.

That’s why many IT Ops teams have deployed machine-learning tools to help identify anomalies and outliers in live data streams. With machine learning, they figure, they have checked the box and now have the ability to monitor and resolve operational issues, improve their cybersecurity efforts, detect fraud, and more.

Look deeper, though, and chances are you’ll find that they are frantically trying to keep a lid on issues. It’s incredibly challenging to shift from passive to active management, and it’s a daily struggle to overcome what I see as the most critical challenges in IT Ops.

Here are four key pain points to be aware of when looking to get the most out of machine learning.

Full coverageTopic center: Machine learning

Drowning in false positives

To detect threats, outages, and other anomalies, many organizations handcraft rules, or they rely on people staring at dashboards. But these options are expensive, have low accuracy, require a lot of staff, and can't scale to cover the many thousands of metrics that a typical organization already collects. Most importantly, false positives from rules generate so many noisy alerts that security analysts simply ignore them. How many alerts? I’ve talked with plenty of companies that estimate tens of thousands on a daily basis.

To solve this issue, organizations often hire data scientists, who extract data to third-party tools and write their own statistical models. But anyone who has tried these approaches will tell you that it is extremely difficult to make real-time workstreams operate within existing use cases such as logging and security. It’s also challenging in the face of such robust data; a statistical model may work well for one dataset but not the other.

You need to move from a state of constantly chasing down “boy who cried wolf” false positives to one where you’re managing and solving real problems that demand immediate attention.

Analytic “black box” paradox

Stating that an IT system is behaving anomalously is not particularly useful unless there is an indication as to how and why. Complex predictors (such as deep neural networks) are often referred to as “black boxes,” since they tend to reveal little about their inner workings and, worse, can be very difficult to interpret.

This isn’t a question of trusting or doubting the recommendations made by black-box analytics. To do your job effectively, you need a sense of context. In order to be alerted to an anomaly, it’s important to be clear about what is, in fact, an anomaly. What’s the normal behavior of that system? Where has it deviated? What features significantly influenced the anomaly? To obtain this level of detail may require placing some constraints on your analytics solution; the more complex the analytics, the harder it is to demonstrate the machine-learning process and the logic in the decision making. It’s give-and-take to be sure, so you need to consider the value of interpretability as a key feature of an effective solution.

Data scientist dependence

Many organizations have turned to data science teams—an expensive, scarce resource—to address all of these pains. While data science teams may alleviate some of the heavy load, they typically manually analyze historical data in an offline batch analysis mode—and with custom solutions that are applicable only to the limited data they are looking at. Or they may use machine-learning tools that require significant expertise to define and regularly tune modeling parameters. What you need instead is an online, real-time approach to analysis that is robust to a large variety of data characteristics that may not be known a priori. 

Focus on your needs, not the hype

Driven by futuristic applications for machine data—autonomous vehicles, smart grids—machine learning has become one of the top buzzwords in the C-suite. But there’s a big gap between those concepts and the reality of what’s possible in the short term. Avoid falling into the trap of thinking too big and creating too high a bar to entry for yourself. Forget the hype and keep it real.

Doing machine learning right requires it to be at the heart of your IT Ops processes. If you have an existing log analytics system, it should be a part of the system and not some bolt-on tool with algorithms. If you’re evaluating new systems to handle your data, you should examine how the machine learning works: Is it unsupervised machine learning? Does it come with an intuitive interface? Does it require data to be pulled out of one system to another? Can it spot anomalies in real time at the rate that your data is being ingested? Can it alert your key IT Ops personnel, and if so, will they be able to react in the manner that your business requires?

Companies are ingesting massive amounts of live data streams, and that on its own can feel like a major accomplishment. But data volume isn’t the endgame. The real question is: What now?

Make it a priority to gain visibility into those data streams; otherwise none of that data-ingesting work matters. Ask yourself what the use case is. The initial use case should not be something massive such as cybersecurity. Instead, to home in on the appropriate use case, ask yourself: “What’s the value of all that data to the business? What could we learn from it?”

Take Windows event logs, for example. A viable use case goes beyond simply answering, "What's going on with these machines?” The real value comes with the ability to detect anomalous user behavior and to identify what’s happening when and where. Only with this deeper insight can you build—step by step—the foundation to manage operations better, identify cyberthreats, and reduce fraud. And over time, you will put yourself in the position to pursue bigger and bigger aspirations.

Share your team's experiences with machine learning in the comments below. What have you learned along the way?

Full coverageTopic center: Machine learning
Topics: IT Ops