Micro Focus is now part of OpenText. Learn more >

You are here

You are here

AIOps is the oxygen for your data: 4 steps to get started

Josh Atwell Senior Technology Advocate, Splunk

Your application portfolio is probably distributed between the platforms you own or rent and subscription services. The requirements of managing the health and availability of those services are complex, and may seem overwhelming. But emerging AIOps capabilities that intelligently augment and support your technology teams are changing that.

AIOps systems use artificial intelligence technologies, such as machine learning, to help monitor and manage IT systems and applications. There is a wide array of focus areas for these systems, but luckily one constant exists to help you successfully apply next-generation technologies to manage your application portfolio: data. 

Relying on your systems to deliver better analysis and decision-support insights in turn allows your teams to focus their energy more prescriptively and accurately. That requires integrating a lot of different types of data, from many different sources, into a learning platform that delivers analysis and insights.

The volume and velocity of data in most environments can be overwhelming, and the majority of IT Ops tools were not built to meet ever-increasing data demands. Here are the steps you can take right away to start realizing the value from AIOps.

1. Identify and consolidate your data sources

Specific error log types and latency metrics are common initial data sources, because they are already understood by systems operators as indicators of system health. However, restricting systems to a single data type limits your insights, preventing a clear, holistic picture of system activity and overall service health.

Mature monitoring systems already have familiarity with these common data sources, but they often run into challenges monitoring the scale and complexity required today. In this evolving landscape, IT operators require insight into massive datasets over extended periods of time without loss of data fidelity or responsiveness.

As such, it's critical that AIOps systems manage real-time requirements as well as historical analysis, which means they need to be purpose-built.

Beyond the timeline, the system also requires access to all types of data, including unstructured machine data, structured metrics, and relational data. This combination provides the opportunity for tremendous insight into today's conditions, as well as predictions of future state.

Your first step should be to focus on established and well-understood data sources. It is important to identify data sources at each layer of the service, beginning with infrastructure, whether it is cloud or traditional, and moving up to application performance.

Then correlate those sources in the system based on their relationship, such as network and storage latencies, to overall transaction latency, which in turn correlates to overall transaction volume. Successfully identifying and correlating the right data will set you up for the next important step.

2. Define your baseline

Collecting and visualizing data is just the beginning for AIOps platforms. Once you are satisfied that the system is collecting from the right data sources, it's critical to define service baselines based on both target performance and current conditions.

Initial work happens in accessing and analyzing raw historical machine and metric data, such as the number of failed transactions or queries, to provide a base understanding. The AIOps system can then apply machine-learning algorithms against the available data to make visible unidentified trends and patterns, providing previously unavailable insights.

For example, response latency in your requests to your inventory service is a critical function, but there are many sources of that latency, such as bandwidth, disk utilization, CPU utilization, and improperly formatted queries. Modern AIOps systems can simultaneously and quickly analyze all of these variables, pinpoint the primary source of request latency, and notify the appropriate responder to address the issue before it affects your customers.

Quality data from more sources, which were often previously siloed, will yield better results, since there is more information for the algorithms to work with. Because results will likely deliver correlations too difficult for humans to isolate, it's important to be prescriptive about the data and invest time in reviewing the system recommendations.

This effort will not only provide valuable insights into system behavior, but will also allow the resulting automation to be more prescriptive and accurate.

3. Compare patterns in real time

So far, you have identified and collected a diverse set of correlated data sources and given the system an opportunity to define a baseline. Now you must analyze system behavior in real time to see how ongoing data ingestion and analysis compares with your defined baselines. 

Instead of simply identifying seasonal patterns, it may take several weeks or months before truly anomalous behavior is regularly identified and reported. It is important to take this time to re-evaluate your data sources and look for additional sources that can further improve your system visibility.

If you have not already included business data sources, this is a great time to review key performance indicators (KPIs) and map correlations between system performance and business performance. What are the sales implications when latency increases? At what point do you anticipate customers abandoning their transactions or filing help tickets due to latency? How much is that scenario going to affect the business?

Answering these questions and monitoring these things in real time is only possible through automated analysis of your data. And gaining these valuable insights for both current and future business service performance is where your investment in an AIOps solution really starts to pay off.

For example, a global manufacturing leader was able to reduce unplanned downtime by 64%, saving it over $300,000 an hour in costly outages by leveraging the predictive analytics and auto-remediation capabilities available in its AIOps platform.

4. Watch the system identify new patterns

Improving situational awareness in your environment is just the beginning. AIOps platforms can predict future issues by implementing automated and intelligent analysis to examine the intersections between seemingly disparate streams of data. 

Once you have established real-time monitoring of your key systems over a period of time, you should begin receiving insights from the system of potential outages in time to prevent them from happening. AIOps systems can deliver such insights today if fed a diverse set of loosely coupled data sources. The AI system identifies new patterns when it recognizes previously unknown intersections between the data streams. 

These new patterns result in new sets of reporting, alerting, and automated responses to remediate predicted issues with enough time to prevent any customer impact. This greatly reduces the amount of time your operations and development teams spend responding to incidents, and it is all a result of better analysis of the data you already have in your environment.

Data is the lifeblood of AIOps 

Simply monitoring a small subset of data sources is no longer sufficient. Environment complexity has far exceeded human capacity to ingest, analyze, and respond to the data that describes the systems supporting your customers.

Success is now measured in how effectively your organization can centralize its data and gain insights from it through automated analysis to identify previously unknown patterns. While it's important to pick quality data, it's equally important to bring in all types of data to provide better insights.

That process is iterative and today's modern tool sets can help you every step of your journey. The best analysis is only possible with all of the best data. That data is already available to you, but you need AIOps capabilities to use it effectively.

Keep learning

Read more articles about: Enterprise ITIT Ops