How metrics-based monitoring can help prevent your next outage

Dave Cliffe Product Manager, PagerDuty

Every business monitors operational metrics to ensure its infrastructure runs smoothly. Unexpected changes and anomalies in these metrics trigger an investigation and (sometimes) an urgent response—or at least, they should. For example, Amazon sounds the alarm when there is a discernible drop in orders per second. Netflix monitors video stream starts per second and alerts the appropriate team if something seems amiss.

There is no one-size-fits-all strategy to determine what kind of outage or metrics change deserves an urgent response: The service you provide to your customers likely has very different key performance indicators than orders or streams. But every operationally mature organization must implement alerts driven by business metrics because they are what matter most.

Hitting the big red button

Operational metrics are great for identifying precursors to potential business-impacting outages. Unfortunately, it's often difficult to understand what those metrics tell you. For instance, a CPU pegged at 100 percent might mean an overworked server (bad) or optimal resource usage (good). Many seasoned network operations center (NOC) personnel operate with a business metrics sixth sense—they know which combination of operational metrics is an indicator of something bad happening—but that's a difficult skill to pass along to the rest of the on-call team.

This sixth sense is a reminder that humans are often responsible for making outage-related decisions. Whether in an NOC or on a distributed on-call team, humans decide when to investigate and triage. They decide when to launch an urgent, coordinated response. These decisions should be data driven, but using operational metrics without business metrics will either result in hitting the big red button too often or not soon enough. Understanding that plummeting business metrics is an indicator of real customer impact will greatly improve the team's ability to evaluate the urgency of response and find the signal in the noise.

Get ahead by monitoring metrics in real time

How do you incorporate business metrics into your triage decisions? First, you monitor business metrics in real time. CFOs, business analysts, and product managers already look at this data on a regular basis, maybe even daily. The key is to implement a system that operationalizes that data. An e-commerce company, for example, relies on shopping cart metrics. On a typical day, customers fill their carts with thousands of items. What happens if all customer shopping carts suddenly show up as empty? Here's a hint: Something is wrong and requires immediate attention.

There's a logical question that follows: Who is responsible for fixing the problem? There may be a number of teams that contribute to satisfying a particular business goal, meaning any one of them may be the appropriate responder. Defining an automated spray-and-pray response for severe cases can be incredibly valuable, cutting mean time to resolution significantly. But beware that it comes with a price. It is important to remember that business metrics monitoring must be rooted in reliability. Specifically, triggering this type of large-scale immediate response for every data inconsistency can result in a slippery slope toward alert fatigue, not just individually, but at an organizational level as well.

Going customer-first

Reducing downtime requires shifting perspective to a business metrics-minded, customer-first approach to monitoring. Operationally mature companies must define metrics that reflect business priorities, monitor those metrics in real time, detect anomalies, and trigger the appropriate response.

Again, depending on the situation, 100 percent CPU capacity could be a terrible thing (a precursor to an outage) or a great thing (maximum resource usage). Organizations won't know unless they adopt a system driven by business metrics. Only when organizations understand and monitor their most important priorities can they reliably service their customers.

