You are here

You are here

How IT Ops analytics can speed up troubleshooting

Robert L. Scheier Principal, Bob Scheier Associates
Open sign

When thousands of users at a major information technology company lost network access, its IT operations staff spent days sifting through millions of rows of network performance data in search of the cause.

But then, to speed the troubleshooting, they fed that data to an operational analytics tool, which applied machine-learning algorithms to logs from the network devices that were generating incidents. Within 20 minutes, it had alerted the team to a misconfigured router that was the key to ending the outage.

This is the new world of IT troubleshooting, where smart machines and smart people work together to more quickly and efficiently solve IT operations management problems that slow performance and drive up costs. Having IT operations analytics in place—and using it effectively—is critical if your IT Ops team is to keep up with the pace of DevOps. Fortunately, it's not hard to get started.

In IT operations analytics, smart machines do the heavy analytics on gigabytes of operational and fault data, sifting the most valuable clues from irrelevant or misleading alerts to reveal the most significant information in millions of lines of log data. The smart people then focus on the most promising leads and eliminate the underlying problems.

IT Ops analytics speeds the resolution of common outages and identifies the causes of sudden spikes in the cost of IT infrastructure by using the four modes of operational analytics: descriptive, diagnostic, predictive, and prescriptive.

By using this fast problem-resolution technique, your IT operations staff can keep pace with the rest of the DevOps team by delivering and assuring the performance of applications quickly and flexibly to reduce costs and improve customer satisfaction.

Descriptive analytics

As the name implies, descriptive analytics makes sense of what has happened, or is happening, in your IT infrastructure. It presents data in formats such as dashboards, quick reports, data consolidation, search/data mining, scorecards, and key performance indicators so you can more precisely understand the nature and extent of operational problems and how efficiently your infrastructure is being utilized.

For example, a system administrator might need to identify how many resources an application uses in an average month as a guide for capacity planning. Descriptive analytics might provide the answers in the form of monthly usage and activity trends, highest and lowest resource utilization, the amount of storage each virtual machine consumed, and the amount and type of cloud resources the application used.

Diagnostic analytics

Diagnostic analytics goes the next step by helping to pinpoint the cause of an existing problem. It includes automated root-cause analysis, intelligent notification, collaborative investigation, and real-time analytics. Diagnostic analytics involves identifying unknown trends, automating the collection and configuration of information, and reducing the mean time to repair (MTTR).

Using operational analytics, one organization expected to reduce MTTR for a slowdown in a critical customer-facing online application. Previously, application, database, Unix, network, and storage experts took as long as 36 hours to identify the cause of a slowdown, then needed another two weeks to clear the backlog of orders not processed during the slowdown.

Based on a simulation of the same slowdown, the company found that using operational analytics to automatically analyze data from multiple sources and look for underlying patterns could identify the root cause in less than 30 minutes.

Predictive analytics

The more information you have about what causes problems, the better equipped you are to prevent them. This is where predictive analytics comes into play, using machine learning to identify IT operations data associated with the problem in the past and leveraging those insights to suggest when another failure is likely.

Predictive analytics can be delivered in the form of a digital “smoke detector” that provides an early warning for problems, accurately predicts failures or system overloads, and identifies which combinations of hardware and software or sequences of events are most likely to cause specific types of problems. The most effective predictive analytics takes into account real-world context, often provided by human experts, such as the effect of outside events such as weather on system outages.

Another multinational data storage company used operational analytics to improve uptime for a complex “phone home” system that alerted it to problems with a customer’s storage hardware. The complexity of the system and the amount of data flowing through it had made it difficult to find and fix problems before they caused a system failure.

The company now uses machine learning to create dynamic baselines of all application metrics and log data and so learn what “normal” system behavior is over time, and to identify problematic trends such as rising file system utilization in a key server that could signal a coming failure. In this way, the company reduced staff costs of troubleshooting by more than 60% while increasing uptime and customer satisfaction.

Prescriptive analytics

Once analytics has described your problem, identified the root cause, and predicted when it will recur, the next obvious question is: How do you prevent it from happening in the first place? Predictive analytics suggests specific steps that are likely to prevent or mitigate specific outages or cost spikes.

One example is self-learning algorithms that suggest the optimal configuration of virtual machines based on factors such as workload, performance, location, and power consumption. The same types of algorithms can also learn from past events to suggest best practices in remediating outages or security events.

One large managed service provider uses analytics to predict usage and failure patterns and take steps to avoid future problems with the hardware used by its customers. By identifying issues with, say, a particular disc drive vendor, drive models, or drives of a certain age, the provider can proactively replace hardware or sell services such as refreshing aging or problem hardware.

Up and running with IT Ops analytics: A four-step plan

Implementing operational analytics can seem daunting, but it doesn’t have to be. Four effective initial steps are to do the following:

  • Arm yourself with a rough return-on-investment calculation, and then go to business leaders to get buy-in. Be sure to account for what could be significant investments in storage hardware, big-data databases, and/or file systems, data collection technologies, and analytics tools. Size your analytics environment appropriately for growth, and determine how much historical data to preserve.
  • Factor in how you’ll staff for the capabilities you’ll need. If you want a general-purpose analytics platform that can be used for business as well as IT analytics, you’ll need to hire data scientists to create specific algorithms. If you just need analytics for typical IT challenges, however, you may be able to rely on out-of-the-box tools. In either case, you’ll need people on staff who are trained on how to interpret the data.
  • Manage your own expectations, and those of management, and build support with early wins. Some common first uses might include capturing historical data-flow behavior of netflow traffic to detect possible security breaches, proactively finding application performance anomalies that could signal problems, and eliminating monitoring “noise” to focus on the most significant data in root-cause analysis.
  • Measure and document the cost and time of problem-solving before you deploy IT Ops analytics. This makes it easier to prove the value of your initial analytics investment, and to build momentum for future progress.

Big data isn’t just for fine-tuning the operation of industrial equipment, performing genetic sequencing, or analyzing sales trends. Used right, it can help to ensure that your IT infrastructure is running properly and that when slowdowns or outages occur, you can fix them quickly.

Keep learning

Read more articles about: Enterprise ITIT Ops