Building an Alerting System That Really Works

Building a high quality alerting system often feels like a dark art. Often it is hard to set the proper thresholds and it is even harder to define when an alert should be triggered or not. This results in alerts being raised too early or too late and your colleagues losing faith in the system. Once you use a structured approach to build an alerting system you will find it much easier and the alerts more predictable and precise.

Measure Selection

First you have to select proper measures to alert on. This selection is key as all other steps depend on using meaningful measures. While there seems to be an infinite number of different measures, you can categorize them into three main categories:

Saturation measures indicate how much of a resource is used. Examples are CPU usage or resource pool consumption.
Occurrence measures indicate whether a condition was met or not. A good example is errors. These measures are often presented as a rate like failed transactions per seconds.
Continuous measures do not have a single value at any given point in time, but instead a large number of different values. A typical example is response times. Irrespective of how small you make the sample, you will always have a large amount of values and never just one single representative value.

These different categories are not equally suited for triggering alerts. Saturation measures are often bad candidates for triggering alerts, mostly because using all your resources is not a bad thing per se. It only becomes a problem when you want to use more than you actually have. So instead of using the saturation measure directly, you are better off monitoring the effect of a resource being saturated.

In a real world example, this means that looking at connection acquisition time is a better indicator of a problem than connection pool usage. Another even more extreme example is CPU consumption. Instead of measuring CPU consumption, it makes more sense to measure whether all processes get as much CPU as they need (load average). In the worst case your CPU usage will even be below the threshold but your load factor will indicate a severe problem.

Defining Normal Behavior

Once the proper measures are selected you have to choose the values that define what normal behavior is.

For rate measures this is pretty simple. You take the rate values over a defined time range and calculate an aggregate over the measurements. In this simplest case this is the average.

Defining normal behavior for continuous measures is much harder because there is not one normal. Instead, there are a lot of different values, which as a whole are considered normal. Instead, you want to define a reference value or even better, a reference range, something like “my response times are between 500 and 600 milliseconds.”

The most common approach is using the mean value as the reference value and n times the standard deviation defining the range. Let’s assume your response times follow a standard distribution with a mean of 500 ms and a standard deviation of 50 ms. This means that—as shown below—66 percent of all requests are expected to be between 450 and 550 ms. There is still a 34 percent likeliness for a value to be outside of this range.

This model works fine as long as the data really follows a normal distribution. However, if 60 percent of all response times range between 500 and 650 ms, your definition of normal is simply wrong. This implies that also all your conclusions about abnormal behavior will be wrong as well. This can have fatal effects like alerts triggered way too late in high load scenarios when things go really wrong, or you wake somebody up in the middle of the night when everything is fine.

Response times in many cases are not normally distributed. You can easily find this inconsistency by comparing the definition of normal against real world data. Simply chart the actual value distribution or look for obvious flaws like response times, which are expected to be negative. In this case you will have to define the range of values manually without relying on a predefined probability model.

Expectation Testing: Deciding When Something Is Not Normal

So once you have defined what is normal you need to find out when we see abnormal behavior. We defined normal behavior as the value range we expect based on reference data. In other words you have to find out whether the actually measured values are what you expect them to be.

For the sake of simplicity let us look we look at an error rate example. Response times conceptually work the same but are mathematically more complex.

Let’s assume you have a normal error rate of 3 percent. This is based on monitoring data with about 10,000 transactions per second. Now, in the middle of night, we have 5 errors within 100 transactions. Should you alert or not? You have a 5 percent error rate, which is higher than expected. On the other hand, your system normally has 300 transactions failing and now it is only 5. This does not look like a straightforward decision.

If you, however, apply some statistical knowledge it is not that hard. What we are dealing with in the case of error rates is called a binomial distribution. Luckily the likeliness of any number of errors in any number of requests is very simple to calculate as shown below.

In this case, the probability to see 5 or more errors in a sample of 100 calls is about 18 percent. You would, however, only trigger an alert if the probability was below 5 percent. This is similar to triggering an alert of a value exceeding the mean plus two times the standard deviation.

Let’s recap the steps above. First you need to pick the values you want to alert on. The ideal candidates are rate and occurrence measures. Next you define the expected range of these values. Finally you test whether a certain measurement was within the expected range or not. In the end, good alerting is less a dark art than applying some solid mathematics to a problem. I’ll be covering this in much greater detail in my talk at Velocity New York in October.

This is one of a series of posts related to the upcoming Velocity conference in New York City (Oct 14-16). We hope to see you there.

Building an Alerting System That Really Works

Velocity 2013 Speaker Series

Measure Selection

Defining Normal Behavior

Expectation Testing: Deciding When Something Is Not Normal

Get the O’Reilly Systems Engineering and Operations Newsletter