Sampling Techniques

SMOTE

SMOTE stands for Synthetic Minority Over-sampling Technique.

You must apply SMOTE after the train-test split, and only to the training data, to prevent data leakage and ensure a realistic model evaluation, as applying it before introduces synthetic data into the test set, invalidating performance metrics. The goal is to balance the training set to help the model learn minority patterns, while keeping the test set pristine to measure true generalization to unseen, imbalanced data, say Kaggle users and Stack Overflow users

Why apply SMOTE after the split (to training data only)?

  • Prevents Data Leakage: Applying SMOTE before splitting means synthetic samples from the training set can end up in the test set, giving the model an unfair advantage and inflating performance scores.
  • Realistic Evaluation: The test set should mirror real-world, imbalanced data to accurately assess how your model will perform in production.
  • Accurate Generalization: By training on balanced data but testing on imbalanced data, you get a true measure of the model’s ability to generalize and perform on unseen examples, not just on “cheated” synthetic data. 

Correct Workflow

  1. Split Data: Divide your entire dataset into training and testing sets first (e.g., 80/20 split).
  2. Apply SMOTE: Use SMOTE exclusively on the training set to generate synthetic minority samples.
  3. Train Model: Train your machine learning model using the now-balanced training data.
  4. Evaluate: Test the model on the original, untouched test set.

For more details, refer

  • Official API documentation1

RandomUnderSampler

TBA

SMOTETomek

TBA

What's the benefit of cleaning up the majoirty samples that are nearest neighbours to minority samples? Instead, those are hard examples with high likelihood for False Positives which the model should ideally learn right?

The argument FOR removing Tomek links:

  • These ambiguous boundary samples create “noise” that confuses the classifier
  • Removing them creates a cleaner decision boundary
  • Theory: cleaner separation → better generalization

Your counterargument (which is valid):

  • Those are exactly the hard cases the model needs to learn
  • Removing them artificially simplifies the problem
  • In production, you’ll encounter these boundary cases — the model should handle them
  • For AML specifically: a transaction that “looks like fraud but isn’t” is valuable training signal

The reality:

ScenarioTomek removal helpsTomek removal hurts
Noisy labels (mislabeled data)✓ Cleans errors
Clean labels, hard examples✓ Loses information
Simple decision boundary
Complex/overlapping classes✓ Oversimplifies

For AML detection: Your intuition is correct — AML has genuinely overlapping classes (some fraud looks legitimate by design). Removing Tomek links may:

  1. Remove the most informative examples
  2. Make the model overconfident on easy cases
  3. Increase false negatives in production

Better alternatives for hard examples:

  • Adjust class weights or threshold instead
  • Use focal loss (down-weights easy examples)
  • Keep the hard examples, let the model learn the uncertainty

Sampling Techniques Comparison

TechniqueWhat it doesResult
SMOTECreates synthetic minority samples by interpolating between existing minority pointsIncreases minority class size to match majority
RandomUnderSamplerRandomly removes majority class samplesDecreases majority class size
SMOTETomekSMOTE + removes “Tomek links” (majority samples that are nearest neighbors to minority samples)Cleans decision boundary after oversampling

Visual example (starting with 1000 legit, 10 fraud):

SMOTE:           1000 legit, 1000 fraud (synthetic)
Undersampling:   20 legit,   10 fraud   (with 0.5 ratio)
SMOTETomek:      ~950 legit, 1000 fraud (SMOTE + boundary cleanup)

Loss-based Techniques

Class weights vs. Focal Loss

What’s the difference between class weights vs. focal loss? Both address class imbalance, but they work differently:

AspectClass WeightsFocal Loss
What it doesMultiplies loss by fixed weight per classDown-weights easy examples dynamically
FocusClass membershipPrediction confidence
Static vs DynamicStatic (same weight for all minority samples)Dynamic (adapts per sample based on model’s confidence)

Standard cross-entropy loss (for positive class):

loss = -log(p)

When p = 0.95 (confident correct): -log(0.95) = 0.05  (low loss)
When p = 0.10 (confident wrong):   -log(0.10) = 2.30  (high loss)

Focal loss:

loss = -(1-p)^γ * log(p)

Class Weights:

# Every fraud sample gets same 100x weight, regardless of difficulty
loss = -100 * y * log(p)  # fraud
loss = -1 * (1-y) * log(1-p)  # legit

Focal Loss:

# Hard examples (low confidence) get higher weight
# Easy examples (high confidence) get down-weighted
loss = -α * (1-p)^γ * y * log(p)
 
# γ (gamma) controls focus on hard examples
# γ=0: same as regular cross-entropy
# γ=2: common default, strongly down-weights easy cases

Example with γ=2:

Model confidence(1-p)^γ weight-log(p)Focal lossEffect
p=0.95 (easy)0.00250.050.000125Almost ignored
p=0.50 (uncertain)0.250.690.173Moderate
p=0.10 (hard)0.812.301.86High focus

When to use which:

ScenarioBetter choice
Simple imbalance, similar difficultyClass weights
Many easy negatives drowning signalFocal loss
Object detection (RetinaNet origin)Focal loss
Fraud detection with hard casesFocal loss (learns boundary better)

For AML: Focal loss could be better because:

  1. Most legit transactions are “obviously legit” → down-weight them
  2. Hard boundary cases get more attention
  3. Complements your earlier point about keeping hard examples

XGBoost doesn’t have built-in focal loss, but LightGBM does via custom objective, or you can use it with neural networks.

Footnote

Footnotes

  1. https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html