XGBoost is a powerful technique for addressing highly imbalanced datasets, often observed in real-world problems such as fraud detection and medical diagnosis. It offers several tools and settings optimized for imbalanced data that contribute to consistent and reliable model performance.
Key Features for Imbalanced Datasets
Weighted Loss Function
XGBoost allows for the assignment of different weights to positive and negative examples within its loss function, enabling you to counteract class imbalance.
# Example of using a weighted loss function
xgb_model = XGBClassifier(scale_pos_weight=3)
Stratified Sampling
XGBoost extends stratified sampling to each boosting round. This approach ensures that during training, a class-balanced subset of data is used to grow each new tree, thereby mitigating the issue of skewed distributions.
param['tree_method'] = 'hist' # For improved speed
param['max_bin'] = 256 # This is the default, but it's good to be explicit
param['scale_pos_weight'] = balance_ratio
Focused Evaluation Metrics
While accuracy might not be the most reliable performance metric for imbalanced datasets, XGBoost offers a range of more nuanced measures such as F1 score, precision, and recall. Each of these can be tuned to place higher importance on either the positive or negative class.
For metric evaluation during training, set:
# Scoring metric emphasizing positive class (default is "roc_auc")
xgb_model = XGBClassifier(eval_metric='aucpr')
Hyperparameters Optimization
For customized tuning, you can refine the behavior of XGBoost on imbalanced datasets through particular hyperparameters:
max_delta_step: Used for batch-style predictions, it leverages thresholding to address imbalances.
gamma (minimum loss reduction required to make a further partition): Tuning gamma can provide greater sensitivity to positive classes and improve recall.
subsample (sample rate for each boosting round): A lower subsample rate reduces the influence of the majority class.
Tailored Features in XGBoost 1.3
The latest version of XGBoost introduced a dedicated set of functionalities to augment the handling of imbalanced data. You can reap the benefits of these updates with:
tree_method='gpu_hist' and scale_pos_weight: Ideal for training on large GPUs
enable_experimental_json: true for flexible interaction with Bolt inspired JSON interface.