Essential guide to AIOps: Top tools and implementation tips

public://pictures/Christopher-Null-CEO-Null-Media.png
Christopher Null, Freelance writer, Null Media

AIOps platforms may be the hottest ticket in enterprise IT right now, with vendors offering all manner of AI-driven tools that arrive with a heady promise of transforming IT operations. You may be tempted to adopt AIOps, but also may be overwhelmed by the number of the choices and by the risk of wasting time and money.

To get a clearer picture, here's a look at today's available AIOps tools and their capabilities, how you should approach choosing the right one for your business, and what hurdles you may have to overcome for a successful implementation.

The State of Analytics in IT Operations

What's inside AIOps?

Though a wide range of AI capabilities—including everything from predictive analytics to natural-language processing—can be applied to IT operations, Cloud Academy trainer Guy Hummel said they're all generally leveraged to help IT workers in one of two ways. These tools are built either to automate mundane tasks so that IT operators can focus more on strategic work that adds business value, or to perform tasks that are beyond a human's capabilities.

Hummel used the example of migrating a server to a cloud environment, where you have to choose the right type of virtual machine for provisioning. AI can handle this sort of routine task quite easily, he said, eliminating that particular to-do from IT tick lists. Another common use case is to implement an AI-driven chatbot to handle routine requests and questions from users, which allows help-desk staff to focus on more complex requests.

In contrast, some tasks are difficult, if not impossible, for humans to handle effectively on their own. "With the rising number of cyberattacks on every organization, IT administrators don't have enough time to sift through and investigate all of the security alerts coming in to determine which ones are serious and which ones can be ignored," Hummel explained. "AI, on the other hand, can use past history, multiple data points, and malware alerts from external sources to detect malicious actions."

AIOps can also help get ahead of and resolve problems before they happen, said Andi Mann, chief technology advocate at Splunk. "AIOps systems use longitudinal operational data to effectively 'learn' good states versus bad states and detect early warning signals that humans miss," he said. "Even weak signals, hidden in a vast amount of obfuscating data, can be strong predictors of failure to AIOps systems."

Michael Procopio, Product Marketing Manager at Micro Focus, said that metric correlation helps find anomalies before they have serious impact. 

"AIOps is really good at sifting through lots of data to find where to focus. Millions of log records and metrics can be processed to find the few that matter, something that humans just can handle."
Michael Procopio

Figure 1: This analytics dashboard shows both anomaly detection in the upper right and log analytics lower right where 2.9M log messages were processed to find 20 significant ones.

Mann added that AIOps lends analytical predictability to many other areas of IT.

"Poor predictability of capacity is a very common cause of outages and slowdowns. With predictive analytics, IT can ensure the right configuration of physical and virtual resources to meet predicted load cycles." 
Andi Mann

Similarly, Mann said, AIOps systems can predict load patterns and even correlate with business metrics to schedule routine maintenance such as patching, upgrades, new releases, and backups during low-impact time windows.

Today's systems

While few all-in-one AIOps solutions exist today, many vendors offer AIOps platforms that incorporate an array of capabilities that include data ingestion and management, automated pattern discovery and prediction, and root-cause failure analysis.

"There are many open-source machine-learning tools, such as TensorFlow and MXNet, but most of them are general-purpose frameworks that require a great deal of expertise to use," Cloud Academy's Hummel said. "IT organizations are usually better off implementing a system that's designed specifically for IT operations and that hides the complexity of the underlying AI." 

Hummel added that most of these are commercial products, although some of them, such as Elastic Stack, also have open-source variants.

But before you start evaluating tools and vendors, you need to clearly identify specific, key problems that your company is looking to solve, said Phil Tee, CEO of Moogsoft, which recently published its 2018 AIOps Buyers Guide.

"It's an unfortunate truth that many vendors throw words around like 'machine learning' and 'AI' in marketing materials in order to appeal to a growing audience, but these brands might not yet have the capabilities to do such."
Phil Tee

[ Webinar: What’s New in Network Operations Management (Dec. 11) ]

AIOps first steps

Getting started with AIOps can be daunting, so it's best to take a bite-by-bite approach. The first step is identifying and understanding your IT operations data, Splunk's Mann said. At its core, AIOps is data-driven, so it requires access to all relevant operations data, including unstructured machine data such as logs, metrics, streaming data, API outputs, and device data.

In many use cases, structured business data is also required—namely databases, social sentiment, and other relational data. "The more relevant data AIOps systems have, the more accurate they will be," he explained.

Next, strive to understand how this data can help solve your biggest problems. "Review past failures and identify what data would show the root, or at least proximate, causes of your highest-priority problems," Mann said. "Using data analytics to filter through the vast noise of data, see how you can correlate outages or slowdowns with notable events in your environment to discover causes of this problem. Then start to train your AIOps systems on known problems and resolutions."

Finally, use this insight to prepare machine learning and AI for real-time monitoring and automated response. "Once AIOps systems are primed with key indicators of known good/known bad states, you'll be able to monitor and alert on issues in real time, ignoring false positives," Mann said. "Over time—and perhaps with supervision—you'll be able to take automated actions to remediate or even prevent both known and unknown problems."

Mann noted that this creates a virtuous circle that can drive even more AI maturity, which can proactively fix errors, prevent downtime, and optimize efficiency.

"Take it step-by-step, starting in one area of your business, one system or application, one team or organization. Then repeat across other businesses, applications, and teams."
—Andi Mann

How to avoid potential AIOps pitfalls

A technology that can replace so much human labor is bound to foster some outsized assumptions. Management may believe it is acquiring a set-it-and-forget-it solution, while IT pros may fear that AI will take their jobs. The truth is AIOps is still in its infancy, so all involved should set their expectations accordingly, Cloud Academy's Hummel said. 

"AI is not a replacement for IT professionals. It’s a way to augment their abilities." 
Guy Hummel

IT staff need to remain involved to ensure that the recommendations and actions of their AI systems are correct, he said. "This is especially important in areas like threat detection. It's also important to feed sufficient data into these systems. If an AI system doesn't have enough data for learning, it will likely make poor predictions."

This last point is likely the biggest hurdle to a successful AIOps implementation. Too many IT leaders provide AIOps systems with a minimal data set and then either assume absolute accuracy in the system's findings or ignore them altogether, Splunk's Mann said. This leads to a variety of dangers, from disregarding real problems only because there is no data about them, to overcorrecting on false positives because of a lack of explanatory data.

Silos of data introduce additional issues, he added. "When multiple teams and AIOps systems work with their own datasets, it multiplies the lost opportunity," he said. For example, without combining operational and security data, triage teams can't correlate IT operations slowdowns and outages with security penetration vectors to identify that a production problem has its root cause in a security breach.

The key to avoiding these issues, Mann said, is to collect as much relevant data as possible, share it among multiple teams, and store it over time to build valid training datasets. Then use that data with AIOps systems in both supervised and unsupervised modes to gain a complete picture of your environment and help to alleviate the systemic bias that happens with limited access or limited data.

"While AIOps can be transformational to your business, it likely isn't a cure for every single issue affecting your IT or technical operations. So take the time to understand the implications of the technology on your business and have candid conversations around what needs to be fixed immediately."
—Andi Mann