Predicting the Improbable, Part 2: Handling imbalanced data

Dario Martinez

2018-01-24 15:52:41
Reading Time: 4 minutes

For Part 1 of this series we presented the imbalanced data challenge and the serious repercussions of not addressing it. Hopefully, the existence of imbalanced datasets is well known and documented and there are algorithms and implementations to address it. For the second part of this series, we recommend a few direct approaches using previously-validated implementations.

Sampling methods

A common approach is to sample the datasets to modify data distributions to create a “new” balanced dataset. Data re-sampling is commonly used in data science to validate machine learning (ML) models. If you have ever performed cross-validation when building a model, then you have performed data re-sampling (although people rarely refer to it as the formal data re-sampling method). In order to do it, there are two particular methodologies: oversampling and undersampling. The first method aims to expand the minority class and can cause overfitting due to creating multiple similar instances. The second aims to reduce the majority class, which in turn might result in loss of relevant information.

There are multiple approaches and implementations of sampling methods. Imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-class imbalance. It is compatible with (and based on) scikit-learn and is part of scikit-learn-contrib projects. Within the proposed algorithms in the package, some methods could be particularly useful for dealing with the expected imbalance in the user-case datasets:

  • Informed undersampling: Algorithms like NearMiss-(1 & 2 & 3) perform undersampling by using a k-nearest neighbour classifier. On the other hand, one-sided selection (OSS) performs an heuristic that previously cleans the dataset by using Tomek links to remove noisy samples. Then a 1-nearest neighbour classifier is applied to all samples.
  • Synthetic Sampling with Data Generation (artificial oversampling): This technique aims to create new artificial samples of the minority class to balance the dataset. The commonly used algorithm is synthetic minority oversampling technique (SMOTE) which uses an unsupervised algorithm that creates new samples based on the spatial distance between samples. A more sophisticated approach is the Adaptive Synthetic Sampling (ADASYN). This algorithm also generates new samples by interpolation, however it focuses on generating samples next to the original samples which are wrongly classified using a k-Nearest Neighbours classifier.
  • Combination of over- and undersampling: A well-rounded approach is to perform oversampling to generate synthetic samples of the minor class and then perform undersampling to clean the resulted space from noisy samples. The oversampling step is done using SMOTE and the cleaning can be done using either Tomek’s link (SMOTETomek) or edited nearest-neighbours (SMOTEENN).
  • Ensemble methods: EasyEnsemble is a unsupervised methodology that use random subsets of the majority class. It randomly undersamples the original set to create an ensemble dataset. Another well rounded approach is BalanceCascade, a supervised algorithm that iteratively creates balance and pull out redundant samples in majority class to form a final classifier. This differs from the previous method by using a classifier to ensure that the misclassified samples can be selected again for the next subset.

Cost-Sensitive methods

Instead of modifying the dataset, you can consider the cost of misclassification to minimize the overall cost. Using this strategy you can train certain ML algorithms as cost-sensitive classifiers. One common approach are cost-sensitive decision trees. They are classic decision trees with thresholds adjusted to yield the most dominant point of the ROC curve. Some solid Python implementations of this algorithm can be found as part of the COSTCLA project of the University of Luxembourg.

Classic ML methodologies are not the only approach to adapt cost sensitive methods.  We can also apply this strategy to deep learning using cost-sensitive ANNs. One can follow certain steps to achieve this, such as: modify the probability estimate of the outputs in the testing stage, bias the ANNs during training to focus on the expensive class, rise the learning rate for costly classes, and lower the low-cost samples or replace the error function for an expected cost minimization function.

Kernel and Active Learning methods

There are other ML-based approaches. For instance, kernel-based SVMs are a classical ML learning algorithm that can be very useful in addressing certain data (check scikit-learn implementation). More sophisticated approaches propose the integration of kernel-based methods to sampling methods, e.g. SMOTE with different costs (SDC), ensembles of under-/over-sampled SVM, etc. There is no straightforward implementation of these methodologies and most of the work is theoretical with very low validation. In recent years, Active Learning has gained popularity, as it is a semi-supervised ML methodology in which the algorithm queries the user (or an information source) to obtain new data points. Although, this is an interesting strategy it has only been applied for undersampling and oversampling datasets related to dense word disambiguation (WSD).


There are multiple algorithms with various approaches for addressing imbalanced datasets; from adjusting your dataset using sampling, to using specific ML learning techniques. The most effective algorithm will vary depending on the characteristics of the imbalanced data set.

But what if we change the problem statement? Instead of trying to balance the dataset, we predict the low occurrence classes as anomalies. Find out more in the third and last part of this blog series.

Predicting a highly probable event with sufficient amount of data is not a challenge for a sufficiently skilled Data Scientist. However, the reverse of this is challenging, especially concerned with algorithm performance…

But what if, instead of trying to balance the dataset, the imbalance problem is tackled from a different perspective? Since the usual problem with imbalanced datasets is that there is very low occurrence in some classes…

Author: Dario Martinez