You are here

You are here

Get started with AIOps: A real-world example

public://pictures/torrey_jones_0.jpg
Torrey Jones Principal Consultant, Greenlight Group
 

Effective AI-boosted IT operations (AIOps) requires four kinds of analytics. Two are fundamental, while the other two support AIOps more directly.

The fundamentals include descriptive and diagnostic analytics, which in turn support prescriptive and predictive analytics. The first two require essential technology, including reports, dashboards, and other tools that enable the analytics shown on the upper tier. Prescriptive and predictive analytics rely more on machine learning and are more directly tied into AIOps.

Here's what you need to know.

The four types of IT operations analytics

There are four key types of analytic modes. Each has its own purpose within IT operations, and each requires a different level of team maturity. But what does that mean in terms of resolving day-to-day issues? 

While some people believe that there is a clear progress in sophistication from "What happened?" (descriptive analytics) to "What actions will we take based on what we know?" (prescriptive analytics), these modes of analytics all work together at a practical level.

Figure 1: The four modes of IT operations analytics, and the kinds of activities that are possible when using commercially available tools. Source: Torrey Jones

To illustrate how this all works, here's a true-to-life scenario—anyone who has worked for any length of time in IT operations is likely to be familiar with these elements.

Nightmare scenario: At a critical moment, a major outage occurs 

You work for an online retailer of fidget spinners that is about to launch its biggest sale ever, in preparation for FidgetCon, when the website experiences a major outage. It occurs not during business hours, but at 4 p.m. on a Saturday. This is a Severity 1, Priority 1, critical incident, which causes the company to bleed money at a rate well above your pay grade.

The website is up, but it errors out trying to process payments for the newest, fanciest fidget spinner.

The virtual war room assembles

Most of the IT Ops team is off for the weekend, and many of them will need to be called, texted, and invited to a virtual war room to work on the problem. The major-incident manager establishes a conference bridge and starts bringing people into the virtual meeting.

  • The first call is to the web infrastructure team members, who say the problem is not with them, because the website is up and functional.
  • The next call goes to the application team members, from whom you learn that the payment processing system is erroring out when processing a payment.
  • So you call the payment processing team, which says that the connection to the third-party service it uses for fraud detection is timing out.
  • The payment processing team then calls the service, which says its systems are up and running.
  • So it's on to the network team. The general perception of the non-networking people in IT is that the network is always the problem. But in this case the network is fine; all network devices in the route to the third-party provider are up, and there are no alarms on those devices.

So... now what?

Next step: Diagnostic analysis to identify traffic failures 

It's time to bring analytics to bear on the problem. You already know what happened, so your effort focuses on diagnosing the problem. Once you find it, you can work on a fix.

The vast amount of data that needs to be examined in a case such as this is like a huge haystack, and you're looking for a needle. But what if you could collect all of the logs from all of the components, aggregate them, and put that information at the fingertips of your network operations center (NOC)? 

After reviewing the aggregated metrics from your application performance monitoring (APM) tools as well as application logs—which show horrendous response times for transactions to the third-party payment system—your NOC team determines that connectivity to the third-party provider is the problem. But the connection isn't down completely; some transactions are still going through.

The NOC can see metrics and logs from the payment processing system, but the logging has always been extremely verbose because of SOX and PCI compliance needs. Luckily, today's diagnostic analysis tools have built-in machine-learning algorithms that can reduce tens of thousands of log lines to fewer than 100 unique entries.

Next the NOC explores the connectivity to the third-party provider to locate the problem. A trace-route shows that the traffic fails between your edge router and the third-party provider. A search of the network device logs—which have been centrally aggregated to a common place—shows that the IPSec tunnel to the third-party vendor is trying to establish itself. It's successful 10% of the time but tends to gets dropped shortly after successful IPsec tunnel negotiations.

One call to the third-party provider later, it turns out that it had a weather-related outage with adverse effects that "shouldn't have been technically possible." Lesson learned: Diagnostic analytics can help reduce the complexities encountered in a war room and cut your mean-time-to resolution metrics.

Use descriptive analysis to collect and preprocess data to detect abnormalities

Imagine the power of a system that enables some preprocessing of all of this data—the metrics, the logs, and potentially other information—as independent data sources and as an aggregate across all data sources.

The data constantly changes, it's uncontrollable, and it may cause adverse effects on day-to-day IT operations. In this scenario, the third-party provider operates in a different region of the country. Had you been collecting and processing up-to-the-minute weather statistics from the geographic location of its data center, you would have known that there were tornado warnings and thunderstorms. That's useful information, and that's basic descriptive analytics.

The ability to collect this sort of data and preprocess it with machine learning allows you to detect abnormalities, which in turn allows you to raise internal awareness of a potentially critical situation with the payment processing system. Again, this could have been used by the NOC to start its diagnostic research in the area of the payment processing system (skipping the APM and application log analysis).

Prescriptive analysis: Time to get proactive, with precautions in place

Now it's 12 hours later, and you've apologized for missing your daughter's third birthday party. The critical incident is over, you're back online, and the team understands what went wrong. Monday rolls around, and it's time to do the post-mortem. 

The business analysts get involved, and you learn that the 12-hour outage cost $1 million in lost sales of the Fidget Spinner Elite—a FidgetCon exclusive. Yes, that hurts. If only someone on the business or marketing team had told IT that the company was having its biggest sales event ever, with a FidgetCon-exclusive product, so much pain could have been avoided. 

If the IT team had known that, your team could have scaled up the infrastructure, ensured that disaster recovery procedures were refreshed and reviewed with the critical teams, and put special action plans in place, just in case a perfect storm occurred. 

Had IT Ops known ahead of time, it could have been proactively prescriptive. In this case, that would mean engaging with the business team and understanding the business impact of any sort of outage during the biggest sales event the company ever had.

You could have looked at expected order estimates and estimated website traffic. You could have examined historical data for the production environment and prescribed a more robust web/application/payment/network environment, including at least one backup third-party provider in a different region of the country from your primary provider.

There are tools that can help with the specific steps your organization needs to take prescriptively to prevent or reduce problem occurrence in the future. But what's important to understand is the type of behavior that predictive analysis recommends.

Prescriptive and predictive analysis: A stress test for the next big event

Predictive analytics involves taking historical data and leveraging it to look into the future.

To achieve this, the connections between the business team and the IT team are critical; organizations overlook those connections at their peril. Ask Netflix how its business performs if AWS fails. Ask any of the world's largest airlines what happens when large-scale IT issues ground planes for hours or days. A perfect storm can happen anytime, but the damage can be minimized by leveraging today's technologies.

Fast-forward our example scenario by six months. The critical failure you experienced is still lodged in the back of your mind. You've worked with the business team to improve visibility into business-related actions that IT Ops needs to be ready to support. You've put in place a process and a timeline to allow you to be proactively prescriptive.

But what about that new system that just came online, the one that (you're told) isn't a big deal, at least not yet?

You're using the same third-party payment-processing vendor, and you've established a backup vendor if your primary vendor fails. This system supports your newest B2B distribution line of business, the first customer is going live soon, the production environment is built, and it's purring like the cat in the video you watched last week.

Here's the countdown to your next customer go-live event in T-30 days.

T-30 days 

Time to stress-test. Okay, the system handled the stress test with absolute perfection; even the auto-scale-up feature of the compute resources occurred flawlessly. Nice job, team! You had the forethought to ensure that the production-grade monitoring was put in place before the stress test, and you were able to feed all of those metrics, logs, etc., into the analytic engine. 

T-7 days

You conduct an additional stress test. Great performance. The system is operating as designed. You've now had two successful stress tests that put the load on the environment at 10 times what it should see from the first B2B customers using the platform.

T-1 day

This is when predictive analytics start to come into play. 

Analytics detects an anomaly that, when it last occurred, caused imminent failure. But it is not an anomaly from IT infrastructure. This time, the anomaly is from weather data—not again. Is this another perfect storm? 

This time it's different, because a) you have historical data that you can use to predict outcomes based on real events, and b) you proactively prescribed the proper fix. In a controlled fashion, you were able to shift the processing load from the primary payment-processing vendor over to your backup vendor—for both the B2C-facing website and the new B2B platform.

You thought you felt good at T-7 days? Imagine how good you’ll feel seeing your daughter all smiles as she takes the bow off of that new bicycle you bought for her fourth birthday. No guarantees, but it's a lot more likely you won't miss another one because of work.

Getting started

Most organizations have they data to do all of this at their disposal, which means that predictive analytics won't be as difficult or as complex as some people make it out to be. It's just a matter of finding the right data and  feeding it into an AI-based system that can crank on it with machine learning algorithms.

There are many AI-based systems, particularly in the realm of making sense of your IT-related data and making reasonable predictions about when to update systems, when peak crunch times will occur, and the like.

The most important thing, however, is to just get started. 

Keep learning