The F1-score is the harmonic mean of precision and recall, and gives a more balanced picture:
Where:
- Precision = TP / (TP + FP)
- Recall = TP / (TP + FN)
Example: https://www.kaggle.com/code/mayuringle8890/fraud-detection-notebook/
π In fraud detection:
- We want high recall β catch as many frauds as possible (reduce false negatives)
- We also want high precision β avoid too many false alarms (reduce false positives)
-
A high F1-score means your model is doing well on both fronts.
-
You may even consider using FΞ²-score to prioritize either recall or precision depending on your business need:
- F2-score if catching more fraud is more important than false positives.
- F0.5-score if false positives are costlier than missing some fraud.
If 99.9% of transactions are legitimate and the model always predicts "Not Fraud", then:
- Accuracy = 99.9%,
- True Positives (fraud detected) = 0
- False Negatives (fraud missed) = All actual frauds
π So the model looks good on paper (99.9% accurate) but is completely useless in practice β it catches no fraud.
Metric | Good for imbalanced data? | What it tells you |
---|---|---|
Accuracy | β No | Can be misleading if classes are imbalanced |
Precision | β Yes | How many predicted frauds were actually frauds |
Recall | β Yes | How many actual frauds you successfully detected |
F1-score | β β Best choice | Balances precision and recall |
For fraud detection, which is a highly imbalanced binary classification problem, you need a model that:
- Handles class imbalance well.
- Can capture complex patterns.
- Can be tuned for precision-recall trade-offs.
Here are some recommended models, categorized by complexity:
Model | Notes |
---|---|
Logistic Regression | Simple, interpretable, good baseline. Add class weights or use SMOTE. |
Decision Tree | Captures non-linear patterns, but can overfit. |
Use with:
- class_weight='balanced'
- feature scaling (for LR)
Model | Why it's good for fraud detection |
---|---|
Random Forest | Robust, handles imbalance with class weights. |
XGBoost | Handles imbalance via scale_pos_weight , high performance. |
LightGBM | Fast, efficient, supports is_unbalance=True flag. |
CatBoost | Works well with categorical features and imbalance. |
β These are often top performers in Kaggle competitions and real-world systems.
If you have very few fraud samples, try:
Model | Notes |
---|---|
Isolation Forest | Unsupervised, good for detecting rare patterns |
One-Class SVM | Works when you only have "normal" data to learn from |
Autoencoders (Deep Learning) | Learn normal patterns, flag large reconstruction errors as frauds |
π Use these when you donβt have labels for frauds or they are very sparse.
Model | Notes |
---|---|
Graph Neural Networks | If fraud involves networks (users, devices, accounts) |
Hybrid Models (Ensemble + Deep Learning) | Combine decision trees and autoencoders |
- Resampling: SMOTE, ADASYN, or undersample the majority class.
- Evaluation: Use F1-score, Precision-Recall AUC, not accuracy.
- Threshold tuning: You can tune the classification threshold to optimize F1 or minimize business cost.
- Explainability: Use SHAP/LIME for model interpretability, especially important in finance.
# Try this pipeline:
- Preprocessing: Scale features + encode categoricals
- Use: LightGBM or XGBoost
- Set: scale_pos_weight = (legit / fraud) ratio
- Evaluate: precision, recall, F1, PR-AUC
Example: https://www.kaggle.com/code/mayuringle8890/fraud-detection-notebook/
In a real fraud detection system, you might also:
- Use Precision-Recall curves
- Optimize based on business cost (e.g., cost of a false positive vs false negative)
- Use confusion matrix to interpret model performance