“How Do You Handle Imbalanced Data?” A Question That Deserves a Better Answer

Jan 04, 2026

Recently, I came across a question asking how to handle the imbalanced data problem.

Imbalance Is Not the Real Problem

Problem? First of all, imbalanced data is not the problem. In most real-world cases, especially fraud/AML, imbalance is simply the natural tendency of data distribution. You can’t expect fraud to be 20 or 30% of the data. Naturally, that is not possible.

Once we accept imbalance as a reality, the next question becomes how to deal with it during model training.

Common Approaches

The most commonly discussed options in the literature are under-sampling and over-sampling.

Over-sampling from fraud cases involves generating additional samples of fraudulent data. While this can give good performance in the training phase, these generated samples may not align with the natural tendency of fraud.

As a result, performance often deteriorates when the model is applied to out-of-time validation data.

On the other hand, undersampling addresses the imbalance by reducing non-fraud cases. However, this comes at the cost of removing good negative samples from the training phase, which can weaken the model’s understanding of normal behaviour.

A More Stable Alternative

Given the limitations of both sampling approaches, we need alternatives that respect the original data distribution.

One such option is class weightage during model training. By assigning more weightage to the class of interest, fraud, the model is penalised more when it fails to learn fraud patterns.

Logistic regression and tree-based models provide direct options to set class weights, making this approach both simple and effective.

Why Evaluation Metrics Matter Even More

Handling imbalance does not end at training; evaluation plays an equally important role. Since the data is imbalanced, the popular metric ROC AUC does not give the full picture.

ROC AUC plots the true positive rate against the false positive rate, where FPR = FP / (FP + TN). In imbalanced data, true negatives dominate the denominator, which means false positives can appear less significant than they actually are.

Because of this limitation, it is better to rely on the precision–recall curve. Precision tells us, out of predicted frauds, how many are actual frauds, while recall tells us, out of all frauds, how many were captured by the model.

Precision–Recall Curve: A Better Lens for Fraud

The precision–recall curve becomes important because it focuses only on the fraud class and ignores true negatives, which are anyway dominant in imbalanced data.

Unlike ROC AUC, the precision–recall curve clearly shows the trade-off between precision and recall across different thresholds. This helps in understanding how model behaviour changes when we move from capturing more fraud to reducing false alerts.

Risk Appetite and Investigation Cost

In fraud and AML (anti-money laundering) use cases, this directly supports threshold selection based on investigation capacity and the bank’s risk appetite. Low precision directly translates into higher investigation cost, which is why choosing the right evaluation metric is as important as choosing the right modelling approach in fraud and AML use cases.

Decoding Statistics

Discussion about this post

Ready for more?