Anomaly detection refers to the technique of identifying unusual patterns and finding outliers in a set of observations. Outliers are data points that differ considerably from the remainders of the dataset. Usually, extreme values that diverge from the normal or expected behavior.
Historically statistics was applied to find and remove outliers, for example from the tails of a Gaussian distribution. The idea was that outliers which result from errors (noise, human, etc) may arise in misleading interpretations. In addition, by filtering them out, modern algorithms in supervised learning can gain in accuracy. On the other hand, the anomalies are nowadays also the object of interest, as it is the case of “rare events” in physics, medicine, business or cybersecurity.
Datasets vary in their nature, but the most typical ones are time series and spatial data. There are three main types of outliers
- Points are single occurrences anomalous with respect to the complete dataset. For example, the stock values marked as red stars in the plot of next section.
- Contextual outliers are anomalous for specific circumstances.
- Finally, collective outliers are a collection of occurrences which are not anomalous by themselves, but as a subgroup or subsequence of the entire dataset. Next figure exemplifies it in red, for a plotted electrocardiographic signal.
Example of collective outlier. The red portion of the signal is an outlier because it was at that value for a significantly longer duration than normal.