Anomaly Detection

December 07, 2017

Anomaly detection refers to the technique of identifying unusual patterns and finding outliers in a set of observations. Outliers are data points that differ considerably from the remainders of the dataset. Usually, extreme values that diverge from the normal or expected behavior.

Historically statistics was applied to find and remove outliers, for example from the tails of a Gaussian distribution. The idea was that outliers which result from errors (noise, human, etc) may arise in misleading interpretations. In addition, by filtering them out, modern algorithms in supervised learning can gain in accuracy. On the other hand, the anomalies are nowadays also the object of interest, as it is the case of “rare events” in physics, medicine, business or cybersecurity.

Datasets vary in their nature, but the most typical ones are time series and spatial data. There are three main types of outliers

Points are single occurrences anomalous with respect to the complete dataset. For example, the stock values marked as red stars in the plot of next section.
Contextual outliers are anomalous for specific circumstances.
Finally, collective outliers are a collection of occurrences which are not anomalous by themselves, but as a subgroup or subsequence of the entire dataset. Next figure exemplifies it in red, for a plotted electrocardiographic signal.

Example of collective outlier. The red portion of the signal is an outlier because it was at that value for a significantly longer duration than normal.

Methodology

Anomalies can be studied in a binary classification scenario where training data points are labeled as “normal” or “anomalous”, in a supervised learning framework where labels are of any kind and the model predictions are compared to the actual classes to check for agreement.

Anomalies can also be studied in an unsupervised learning setting, where a score function is applied over the whole dataset. Outliers are determined by setting a threshold to the scores or by taking the n samples with the largest values. All frameworks can be approached by statistical or machine learning methods, as listed below.

1: Statistical Techniques

Extreme value analysis (EVA) such as the z-score: studies the extreme deviations from the data mean or the tails of the underlying distribution. Data points beyond a threshold, which is set in terms of the standard deviation (e.g. 2-3), are flagged as outliers. Good for its simplicity, but can only be used if a distribution is assumed. Bad results for high dimensions and static problem over time series.
Moving average (MA): EVA version for time series. The mean is computed across the data points with the aid of a rolling window. It works by smoothing short-term fluctuations while highlighting long-term ones. It can be extended by assigning weights to the data points. For example, in exponential smoothing weights decrease exponentially over time. EVA and MA are compared in figures below, where the latter outcomes a better anomaly selection. However, the disadvantage is that MA ignores seasonal patterns.
STL decomposition: splits an original signal into three components: seasonal, trend and residue. The latter, which encodes the irregularities, is used to find the outliers. Simple and good for seasonal time series. Drawback is that signals with dramatic changes are not well analyzed: anomalies should be treated separately, before and after the changes.

2: Machine Learning Techniques

Proximity and density-based approaches: work under the hypothesis that normal points lie in dense neighborhoods, whereas outliers occur far away. Proximity methods are based on the K-Nearest Neighbor (KNN) algorithm. In unsupervised learning, a score function is computed in terms of distance metrics such as the Euclidean or the Manhattan one. In supervised learning, KNN can be used to classify already labeled data, and flagged data points whose predictions differ from the actual class. On the other side, density methods such as the Local Outlier Factor (LOF), which make use of the reachability distance, detect points with a lower density than the neighbors.
Clustering-based approaches: are completely unsupervised, and assume that similar data points belong to same clusters. K-Means Clustering is one of the best known techniques. All data points are assigned to different clusters, and those at a distance from their corresponding centroids beyond a threshold are flagged as outliers. Bad results are found for datasets with too many anomalies. An extra assumption is therefore that the amount of normal points has to exceed considerably that of outliers.
Projection methods: project the dataset to a lower dimensional space, where anomalies can be detected by hand or by applying one of the previously mentioned techniques. An example will be to use Principal Component Analysis to work in a two-dimensional space.
For datasets with samples labeled as “normal” or “anomalous”, any supervised learning algorithm (Support Vector Machine, Neural Networks, Decision Trees, etc) can be applied for a binary classification of unknown data points. However, the fact that classes tend to be unbalanced (more normal than anomalous samples) can create a disadvantage. In addition, future anomalies might be different from current ones and hence difficult to predict.
Isolation forests: are a version of random forest that computes a score for each data point in terms of its position in the trees. It can be easily tuned and is independent of the data structure. But has more computational costs than simpler methods.
Long Short Term Memory (LSTM): are a type of recurrent neural networks that take into account temporal sequences as a whole.

Applications

Fraud detection: credit card, insurance
Intrusion detection, cybersecurity: hacks in network traffic
Industrial damage: malfunctioning such as in engine combustion of aircrafts
Manufacturing: aircraft engines, etc
Image processing, video surveillance
Medical diagnosis: monitoring health issues such as a malignant tumor in an MR scan
Natural sciences: discovering strange and unusual activities in astronomy, physics, etc.

Examples

Moving average for sunspots in natural science: implementation in Python can be found here.
LSTM for intrusion detection in computer network systems.
Identifying pickpocket suspects. The detection system is built in two steps: unsupervised detection and supervised classification.
Twitter developed an open source package in R for detecting anomalies in seasonal time series by decomposing the signal and extracting the trend.