What IT Ops needs to know about anomaly detection: Better security and ops

It operations analytics teams are constantly bombarded by threats, as well as by simple inefficiencies in day-to-day operations. Anomaly-detection tools can help mitigate both. Unfortunately, many IT operations organizations don't fully understand the benefits of IT operations analytics technology.

Here’s what IT operations management teams need to know.

IT operations analytics: A real-world demonstration of root cause analysis [video]

Anomaly detection in a nutshell

Eric Ogren, senior security analyst at 451 Research, describes anomaly detection as “security analytics.” It’s actually about more than just anomalies, because anomalies are not necessarily security events, he explains.  “It’s more of, ‘How do you determine if an outlier activity is a security issue?’” 

Two years from now, analytics will drive most organizations’ security strategies as operations teams use insights gleaned from analytics to apply preventive measures. “It will be analytics first, and then more pinpoint, siloed-type approaches based on what the analytics tell you,” Ogren says.

In the context of IT operations, anomalies are important because they are a leading indicator that something’s not normal. “An anomaly is something that happens that’s different than what you’ve seen before,” says Gary Brandt, technology evangelist at Hewlett Packard Enterprise. “That doesn’t necessarily mean that it’s a bad thing or a good thing—it’s just different. It’s abnormal. It’s an anomaly.”

Anomaly detection is the process of finding patterns in data that don’t conform to a model of normal behavior. Unless you’re a data scientist or practitioner familiar with tools that offer algorithms for pattern recognition, the principles behind anomaly detection may seem obscure and unapproachable. But the benefits are clear.

8 key benefits of anomaly detection

The benefits of anomaly detection include the ability to:

  • Monitor any data source, including user logs, devices, networks, and servers.
  • Rapidly identify zero-day attacks as well as unknown security threats.
  • Find unusual behaviors across data sources that are not identified when using traditional security methods.
  • Disallow use alerts automatically to identify key outliers. 
  • Discover anomalies in event streams, such as web traffic, using historical data.
  • Analyze various data features, including information from users, hosts, and agents, as well as response times.
  • Identify rogue users, i.e., internal employees granted privileges that they shouldn’t have, by comparing their behaviors to a baseline of normal behaviors.

Anomaly detection lets you identify when a metric is behaving differently, taking into account such things as seasonal day-of-week and time-of-day patterns and trends.

For example, anomaly detection can help you detect unusually high CPU utilization levels at any moment in time.

“In a normal running state, an application or a piece of software has things that it’s doing,” says Torrey Jones, principal consultant at Greenlight Group. “Users may be interacting with it, it may be doing some back-end processing—it’s doing stuff.”

That normal processing over time creates seasonal patterns. So Monday at 8 a.m., when everybody gets to the office and logs into the system, there’s a bit of an uptick in CPU utilization, which dies off to a steady state as the day goes on, he says.

Then Tuesday morning comes along, people sign back in and look to see what they have to do for the day. CPU usage goes up, then settles at some normal state. But what if that company is in retail and is having a big sale?

“CPU utilization is going to spike through the roof because of the additional load being placed on the application,” Jones says. “In predictable situations such as on Black Friday, you can assume ahead of time that that’s going to occur."

You can determine what your peak utilization rate is going to be, and what your steady normal is going to be, Jones said. "And you know what normal is because you collect these metrics over time, so you have three, six months’, or a year’s worth of data.”

That’s where analytics comes in, Brandt says: to help you learn about the behaviors of the application or system or database or a combination of those things from the historical data. Understanding the behaviors enables you to proactively find trends and see things that are going on before they become problems.

Anomaly detection methods

Machine learning can be used to learn the characteristics of a system from observed data, helping to enhance the speed of detection. Machine-learning algorithms not only learn from the data, but they’re also able to make predictions based on that data. Machine learning for anomaly detection includes techniques that enable you to effectively detect and classify anomalies in large and complex big data sets.

Other anomaly-detection methods include sequential hypothesis tests, such as cumulative sum charts and sequential probability ratio tests, for detecting changes in the distributions of real-time data and setting alert parameters.

“So what we’ve done with anomaly detection, what a lot of vendors have done, is we baseline,” Brandt says. “We have a lot of data coming in, especially like a time series or a stream of data, and we dynamically baseline things. Then by applying different algorithms like seasonality algorithms on top of that, we can determine what is normal behavior with some kind of a range or a speed.”

If something falls outside the baseline, it’s an anomaly. While one anomaly may be acceptable and two anomalies might be questionable, three, four, five, or more anomalies are likely indications that something is going wrong, says Jones. “So what do we do with this information? That’s where IT operations teams come into play."

You can detect these things, and notify somebody who looks at it to determine whether its proactively going out and looking for abnormalities or the system is identifying these abnormalities and alerting somebody to an abnormality. "So something gets noticed and a human gets engaged,” Jones said.

Tools for anomaly detection

IT operations managers researching anomaly-detection tools will find a wide range of products.

Specialized tools

On one side are platforms whose set of capabilities is made available to staff members who are trained in building very sophisticated types of analytics, like “a data scientist or a statistician or somebody who understands mathematically the algorithms and what they’re trying to do uses that tool or that platform to build something very specific for a particular need. It's kind of like giving them a box of Legos,” Brandt says.

“It’s kind of like giving them a box of Legos.”
—Gary Brandt

Turnkey products

On the other end of the spectrum are more turnkey products that are purpose-built for different use cases in specific industries. The users are operators or individuals who aren’t responsible for designing algorithms.

“They are the beneficiaries of the information that the algorithms provide that helps them with their core tasks or duties,” Brandt says. “It’s more like giving them a Lego building rather than a box of Legos, with some levels of configuration. Then there are tools that offer kind of a mix.”

There are also a few different classifications of breach-detection technologies, says Josh Zelonis, senior analyst at Forrester Research.

Technologies to be deployed on endpoints include any type of anti-malware or antivirus technology that also incorporates a segment of endpoint protection platforms, as well as endpoint detection and response technology, he says. “With each of those, you’re going to be getting actual technologies that are built into these tools that are really just information-gathering tools from the endpoint.”

Based on that, you can perform user behavior analysis or use machine learning to pull out anomalies. A network stack can provide this type of information as well, he adds. “This can be anything from monitoring DNS queries on your network to getting a lot deeper in the stack and unpacking network traffic to perform analysis on where machines are communicating. That is, What is this machine usually talking to? Why is this different? And that’s really the anomaly,” Zelonis says.

Deception technology

The third class of tools, deception technology, has been making a comeback. “The key to the deception technology, or honeypot, is that you’re putting fake assets into the infrastructure, which is not only fake network devices, but you can even have deception in files and have fake files on the network,” Zelonis says. “There are backup vendors that will deploy fake Excel spreadsheets, and if one of those files changes, they assume it’s ransomware and start rolling the machine back.”

Too many cooks?

Before you implement any anomaly-detection tool, be sure it has the ability to collect and aggregate all structured and unstructured data points, Jones says. “There may be multiple pieces of data that, when combined, create the abnormality."

Today, many IT organizations have siloed point products that independently collect different types of data.

“You may have one solution that’s collecting server monitoring information, i.e., CPU, memory, disk utilization, for instance,” Jones says. “You may also have a separate, completely unrelated system collecting weather data, or maybe you’re just going to source it from the National Oceanic and Atmospheric Administration. And maybe you have a third system monitoring the performance from the user’s perspective, i.e., how long it takes for a specific application to display and operate for the user.”

That means you have three pillars, or silos, of data that are not connected. Therefore, at the analytics layer, you have to collect the data and have an analytics engine analyze the data. You might do that using a manual process with a statistical mechanism such as the R programming language, or by using a commercial product, Jones says. A lot of times that analysis step also involves data aggregation, meaning you take data from all three of these sources and put them together.

“We correlate the time for that data, meaning that we want to make sure that if we’re looking at server metric information at 9 a.m. Monday morning, we’re also looking at weather information Monday morning at 9 a.m.,” Jones says. “You have to correlate the data together by way of time and time series. So you slice each data point into time chunks, effectively correlating it to other data points collected in that same time period.”

Then you need a visualization layer to visualize the analysis. That might consist of a raw table of dates and times and values, or it could be a chart or line graph. “Collect the data, analyze the data, and visualize it. The analysis is where the algorithm comes in to determine what is normal and what is abnormal," says Jones.

Tailor anomaly detection to your needs

Anomaly detection lets you detect patterns in data that are outside what’s normal for your systems in order to detect fraud or network intrusion and take appropriate action. You use anomaly-detection tools to monitor any data source and rapidly identify unusual behaviors that you otherwise could not identify using traditional security techniques.

So now you’re ready. But before you deploy anomaly detection, research the methods to determine which is best for your organization. Don’t rely on potentially outdated marketing materials and product briefs, analysts say. 

Rather, ask vendors how their tools work to identify anomalies and how that fits with your company’s needs. Then ask for a one-on-one product demonstration. In this way, you can determine firsthand whether any given product will contribute to the success of your organization.

IT operations analytics: A real-world demonstration of root cause analysis [video]
Topics: IT OpsSecurity