You are here

APM is becoming a big data challenge: 4 ways to deal with it

Jason Bloomberg, President, Intellyx

Remember Parkinson’s Law? It states that the amount of work you have will expand to fill the available time. In this article, we’ll discuss the big data corollary to this law: The amount of data you collect will expand to consume your ability to store and process it. In other words, if it’s possible to collect more data, somebody will. That’s just what is happening in the application performance management (APM) space, or broadly speaking, IT operations management (ITOM) overall.

From log files to API-based telemetry, the quantity of data that such tools must now process is exploding. You might ask if such quantities of data qualify as big data. Just how big is big data anyway?

There is no official definition of big data, but a common one is data sets that are too large for traditional tools to store, process, or analyze. Big data, therefore, is a moving target. As tools mature, the threshold for big data continues to grow, continually pushing the boundaries of our ability to deal with increasingly large data sets.

World Quality Report 2018: The State of QA and Testing

Application performance management: Pushing into big data territory

APM has become a big data challenge, in part because in today’s digital world, applications never stand alone. Application performance depends on the performance of software infrastructure, the virtualization layer, the underlying hardware, and even the network itself. All of these supporting players generate increasingly massive quantities of data for the end application. Existing, first-generation APM tools simply aren’t up to this big data challenge. Managing today’s environments has become a big data problem, so it requires big data analytics techniques.

Unlike older tools, modern APM tools are always pushing the big data threshold. Generating several terabytes of ops data a day is increasingly common. Simply storing such quantities of data becomes an expensive challenge, and storage is only one piece of the big data puzzle.

Storing data for analysis falls short when the source data is derived from real-time streams. As a result, applications that operate in real time compound the challenge. APM must deal with both real-time behavior and increasingly large and complex archives of event data, which are big data challenges in and of themselves.

Case in point: A large global credit card transaction processor had an earlier-generation APM tool in place that monitored up to 2 billion events per day—far more capacity than this company required at the time. Today, however, it has to process over 30 billion events per day, and the number keeps growing.

The structure of the data also impacts the ability to analyze it. As with so many modern big data problems, such data comes in different types with different levels of structure. Some of the raw data exists as unstructured log files, while in other cases, the information the APM tool requires is the result of structured queries or various types of API calls.

[ Webinar: World Quality Report 2019: Focus on the Financial Services Sector ]

Looking for the problem

Regardless of the quantities of information APM tools must now process, at their core they must answer two fundamental questions: Is there a problem? And if so, How do we fix it?

What are we dealing with?

The first question centers on the challenge of anomaly detection. To identify a problem, the APM tool must uncover anomalies in the behavior of applications and their underlying infrastructure. Traditionally, anomaly detection tools monitored infrastructure-centric data sources (log files, CPU and memory metrics, etc.), looking for spikes that might indicate a problem. When such a spike occurred, the tool would send an alert, typically to a hapless admin who had to decide the appropriate course of action.

Over time, the sheer quantity of such alerts made discerning the important information from the noise virtually impossible. This problem of alert storms was already a difficult challenge a decade ago; today, the problem is many times worse. Next-generation, big data-centric APM tooling is absolutely essential for dealing with it.

Anomalies, however, can be more difficult to separate from the noise than expected, as one large retailer found out the hard way. Its APM tool was able to recognize normal daily and weekly traffic pattern variations with no issues. However, as Black Friday approached, traffic patterns soon departed from what the tool thought was normal behavior. Given the particular importance of uptime during Black Friday for this retailer, differentiating expected traffic spikes from anomalies that indicate a real problem is especially critical for its business.

How do we fix the problem?

The second question remains: How do APM tools address the problem? Simply identifying anomalies is only a step in the right direction. Anomalies are no more than symptoms of an underlying issue. We must also fix the root cause, which means we need to know what it is. In other words, we must perform a root-cause analysis.

Ops teams have struggled with root-cause analyses for years, but today, analyses that involve many terabytes of streaming data present unfamiliar challenges to the IT Ops organization, especially in real-time environments. Simply analyzing stored information is not sufficient, and only analyzing real-time streams doesn’t identify root causes, either. Today, it’s important to perform real-time analyses combined with analyses of historic data, even seconds after the fact, to effectively uncover root causes.

Furthermore, most big data analysis seeks to identify correlations within the data. To determine causes of anomalies, however, data correlations aren’t good enough. We must separate causes from symptoms over time in order to say when systems are behaving differently and why.

As a result, APM tools must recognize time series data, which correlates events with time stamps. Only by tracing sequences of anomaly events back in time is it possible to find the root cause.

APM in modern enterprise environments

Today’s enterprise IT environments are heterogeneous and hybrid, with diverse elements on premises as well as in various cloud environments. As a result, IT Ops must correlate data across silos and various infrastructure components, which is an entirely impossible task for traditional enterprise tooling.

Traditional tools simply aren’t able to correlate data across different monitoring silos, different data structures, or different time periods, and with the dozens of monitoring tools in a typical enterprise. Simply integrating this mishmash of tooling will never address such challenges.

Furthermore, anomaly detection does not consist entirely of identifying inadvertent problems. Hackers cause anomalies as well. Not only is it critical to catch attacks immediately, but any IT Ops monitoring tool must be able to uncover the activities of hackers who are trying to hide their tracks. Applying patches to vulnerabilities is necessary but not sufficient, because there is always the possibility a patch is ineffective. Don’t assume you’ve patched every vulnerability. Ops teams must continue to verify that previously resolved threat vectors remain resolved on an ongoing basis. And that is yet another big data challenge.

4 suggestions for APM and big data management

Modern digital businesses are both software-driven and customer-focused. For that reason, any digital organization must connect the dots between the customer experience and the performance of the underlying technology. Efficient, real-time anomaly detection and root-cause analysis are both important enablers of this connection.

To be successful with these modern APM strategies, consider the following suggestions:

Expect the quantity of monitored information to continue to grow

  • Don’t invest in any tool that doesn’t have a modern scalable architecture.
  • Put in place a plan to retire existing tools that aren’t up to the challenge.

Avoid "perimeter complacency"

  • Remember that problems could be anywhere: on premises, in the cloud, or with a third-party plug-in or API.
  • You can’t simply monitor your own systems and applications and call it a day.

Use machine learning that builds and continually improves its own models 

Anomaly detection depends upon understanding what normal application behavior is, but in today’s dynamic, digital environments, "normal" can be difficult to pin down.

  • You need to rely on machine learning that builds and continually improves its own models of normal behavior.
  • Be sure to select tools with cutting-edge machine learning capabilities.

Keep scalability in mind

Remember that the big data challenge for APM extends well beyond chasing down problems for just today’s (as in "this week's") modern digital applications. As the Internet of Things (IoT) grows, real-time anomaly detection at scale will become increasingly important.

  • After all, IoT data feeds originate at vast numbers of sensors and controls, which increases the noise level within the data while their quantities explode.
  • Even today’s next-generation, big data-centric APM tools will need to continue to evolve to keep up.

Regardless of whether your APM challenges are IoT-related or focused on today’s digital applications, big data analytics are increasingly essential tools in your toolbox.