Predicting the improbable, Part 3: Anomaly detection

Dario Martinez

2018-02-07 10:33:46
Reading Time: 4 minutes

In the other part of this series, we presented and described state of art of the algorithms used to balance datasets. The imbalanced data paradigm is well-researched and several go-to solutions in Python are available for consumption.

A shift in perspective

But what if, instead of trying to balance the dataset, the imbalance problem is tackled from a different perspective? Since the usual problem with imbalanced datasets is that there is very low occurrence in some classes, one solution is to present the detection of rare events. Due to its nature of malfunctioning, these events are usually seen as outliers from the distribution, and well-known algorithms can detect them:

  • Elliptic envelope methodology: A common approach is to assume that the data come from known distributions.  We generally try to define the “shape” of the data, and can define outlying observations as observations which stand far enough from the fit shape.
  • Isolation forest: A random forest is one go-to algorithm for fitting data in high dimensional datasets. We can suggest the use of decision trees to anomaly detection because they are information theoretic models and outliers increase the minimum code length to describe a data set. Following this idea, it is possible to ‘isolate’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. When a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies. Take a look at Scikit-Learn implementation.
  • Local outlier factor (LOF): LOF is a metric that reflects the degree of abnormality of the observations and is used to define proximity-based models. It measures the local density deviation of a given data point with respect to its neighbours. The idea is to detect the samples that have a substantially lower density than their neighbours. This density is calculated by ratio involving the k-nearest neighbours algorithm. There is also an implementation in Sciki-Learn.
  • High dimensional outlier detection: Due to the expected sparsity of the datasets and the well-known curse of dimensionality, the presented outlier detection algorithms might not be enough. In these kinds of scenarios, the distance to the nearest neighbour approaches the distance to the farthest neighbour, and contrast in distances to different data points becomes nonexistent. This basically means that using methods such as LOF, which are based on nearest neighbourhood, for high dimensional data sets will lead to outlier scores which are close to each other.
  • High Contrast Subspaces for Density-Based Outlier Ranking (HiCS): The HiCS method basically uses a 3-step methodology to deal with curse of dimensionality in the outlier detection problem. First, it finds high contrast subspaces using the comparison of the marginal and conditional probability density functions. Then, it calculates an outlier score for each point based on each of high contrast subspaces. Finally, it calculates the average of scores generated from the previous step.

Anomalies thought time: change-point detection

A huge amount of real-world data is based on a time series. In this context, change-point detection methods are crucial. They aim to identify variations on the probability distribution of a time series. These methodologies also concern anomaly detection in time series as a particular case of a extreme change-point detection. We will usually handle time series problems with “offline” algorithms:

  • Change detectors based on z-scores: Z-Score measures the distance from the population mean in units of the standard error. It can be used to identify when a sample is deviating from the expected statistic of the population. One common approach is to maintain two sets of statistics (e.g. mean and std) that describe the signal and compare them using z-score.
  • Cumulative sum control chart (CUSUM): It iteratively calculates the weighted cumulative sum. When this sum exceeds a certain threshold value, a change value has been found. It is a valuable “online” methodology with plenty of use in streaming change detection. There are plenty of available implementations of this algorithm.
  • Bayesian detection: This method is based on the prior train of thought on how probable is it to have two successive change-points with certain distance. Then, it models the likelihood of the data in the sequence given that there is no change-point in the sequence. This methodology works well but can be slow depending on the length of the time series. There are plenty of implementations in both Python and R.

Conclusion

We make a 180º change of perspective here and try to tackle the imbalance classes prediction as a outlier detection problem. From isolating the rarities using decision forests to dealing with the curse of dimensionality, we have covered the most used algorithms for outliers detection. Finally, we have presented some standard methodologies for dealing with anomalies in time series.

Although this is well-researched topic, by no means is it closed. Cutting-edge performance algorithms are popping out every year. Stay tuned, as without grasping the future of this topic, it is entirely possible that in the coming years there won’t be such thing as an “imbalanced data problem”.

Predicting a highly probable event with sufficient amount of data is not a challenge for a sufficiently skilled Data Scientist. However, the reverse of this is challenging, especially concerned with algorithm performance…

A common approach is to sample the datasets to modify data distributions to create a “new” balanced dataset. Data re-sampling is commonly used in data science to validate machine learning (ML) models…

Author: Dario Martinez

© datascience.aero