Lately I’ve taken to exploring some of the aggregated event statistics that we have on file at Flight Data Services — and as a side project, I was using my knowledge of statistics to develop a fairly straightforward anomaly detection algorithm. My first approach was to compute the daily average of a particular key point value (in this case,
Acceleration Normal at Touchdown), and then compute the mean and an arbitrary confidence interval (say, 99.5%). Data that fell outside of this interval would then be marked as an anomaly and would warrant further investigation. This is a pretty rudimentary statistical approach to anomaly detection, but one that’s commonly applied in this context — the only problem is that it’s wrong.
What does a confidence interval actually measure?
In the case, I’m going to be talking about the Gaussian distribution, but this applies to other distributions too. In basic inferential statistics, we try to use data generated from a sample of a population to make conclusions about the population as a whole. That is, we are implicitly making the assumption that the population’s data is generated due to some underlying probability distribution. The problem is, the parameters of this distribution (i.e. in the Gaussian case, the mean and standard deviation) are unknown — we can only estimate those parameters by calculating statistics from samples of the population.
These parameters (i.e. mean and std. dev) are easy to calculate from sample data, but the samples are not necessarily representative of the underlying population — so these estimates have an inherent amount of uncertainty in them. One other thing that’s useful to note is that because of a neat statistical property known as the law of large numbers, we can be more confident that a large sample better represents the underlying population than a small one. So bigger is better — at least in statistics.
A confidence interval, then, is a mathematical concept that expresses the uncertainty in a particular estimate of the population’s distribution. Phrased another way, the upper 95% confidence bound is the point at which 97.5% of the “weight” of the probability density function lies below it (and of course, the opposite is true for the lower bound). If you haven’t seen this before, you might be thinking “why 97.5%”? Well that’s because a confidence interval is two-sided — so
100%−95%=5%, and 5%2=2.5%.
Of course, then
What was I doing wrong?
In short, I was applying confidence intervals incorrectly because a confidence interval expresses the range of expected values of a sample mean. It’s a subtle detail (and one that is often missed, myself included!), but an important one. A 95% confidence interval suggests that if you sampled a population 100 times, then the sample mean would lie within the confidence interval 95 out of those 100 times — and, from a Bayesian perspective, you can be 95% sure that the population mean lies within the 95% confidence interval. Because a confidence interval is about the expected range of values of an average, basing an anomaly detection algorithm on this kind of interval will lead to lots of false positives. If you think of each new flight as a new sample of one (as opposed to an average of several), you should hopefully be able to see why a confidence interval is too small — because the variance of a sample containing only a single data point will be very high!
– The definition of the univariate Gaussian distribution.
When to use it?
You should use a confidence interval if you’re attempting to express your confidence in a possible range for the expected values of a particular random variable. For example, if I wanted to estimate the average speed of cars driving past the office during the home-time rush every day (hint: it’s not fast), then I would record a large number of measurements and take the average — let’s say, 10 kilometers per hour. A confidence interval (in this case, let’s say a 95% confidence interval of ±2km/h) is simply a measure of my expectation of the range of values for the average speed — that is, 95% of the time, I would expect the average speed of a sample of similar cars to be between 8km/h and 12km/h.
What about the tolerance interval?
After a little bit of reading up on other anomaly detection techniques, I stumbled upon a couple of online resources that discussed some lesser-known intervals that have a slightly stronger statistical justification for building anomaly detection thresholds — and so produce much more sensible results. In this case I’m going to be talking about the tolerance interval — an interval that works as a sort of “set-and-forget” quantity that is estimated once and then can be re-used for future observations (this differs from something like the prediction interval which should technically be re-computed each time).
Instead of estimating the range of possible values that contains the mean of a population, tolerance intervals attempt to estimate the range of values that the entire population resides in. This is a conceptual departure from the usual confidence interval, because one must also provide a tolerance threshold as well as a confidence level. That is, a tolerance interval answers the question “what range will contain 99% of the population, with a 95% confidence?”. That essentially means that, 95% of the time, 99% of data points will lie within the tolerance interval.
How is it defined?
This is where things start to get fun. As mentioned above, we need to provide two parameters to calculate the tolerance interval — and that’s because we now need to consider two probability distributions; the Gaussian (as before), and the Chi-squared distribution as well. In short, the Chi-squared distribution (with n degrees of freedom, where n is the number of data points in a sample) shows the probability distribution of the sum of squares of individual data points. The tolerance interval ti can be approximated as follows, where n is the degrees of freedom, z1−p2 is the critical value of the Gaussian distribution, taken at the value of the proportion of the data within the interval ti, and γ is the confidence level;
In Python, an approximation to the tolerance interval can be computed using just a few lines of code;
And of course, applying this tolerance interval to 1,200 randomly-generated numbers yields the following bounds (note how few outliers there are!).
This figure was generated quite easily using the following Python code;
Tolerance intervals are a very powerful statistical tool for creating “set-and-forget” thresholds for performing rudimentary anomaly detection on certain types of numerical data. There are some great resources on these statistical concepts (and I’ve only scraped the surface), so if my explanation falls a little bit short, I’d check those out too.