Data Science

🚀 Powerful Handling Imbalanced Data With Reweighting: That Will Unlock Expert!

Hey there! Ready to dive into Handling Imbalanced Data With Reweighting? This friendly guide will walk you through everything step-by-step with easy-to-follow examples. Perfect for beginners and pros alike!

SuperML Team
Share this article

Share:

🚀

💡 Pro tip: This is one of those techniques that will make you look like a data science wizard! Understanding Data Imbalance - Made Simple!

Data imbalance occurs when class distributions are significantly skewed, commonly seen in fraud detection, medical diagnosis, and recommendation systems. Initial assessment involves calculating class ratios and visualizing distributions to determine appropriate handling strategies.

Let’s make this super clear! Here’s how we can tackle this:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification

# Generate imbalanced dataset
X, y = make_classification(
    n_samples=10000, n_features=2,
    n_redundant=0, n_clusters_per_class=1,
    weights=[0.95, 0.05], random_state=42
)

# Calculate imbalance ratio
unique, counts = np.unique(y, return_counts=True)
ratio = dict(zip(unique, counts))
imbalance_ratio = min(counts) / max(counts)

print(f"Class distribution: {ratio}")
print(f"Imbalance ratio: {imbalance_ratio:.3f}")

# Visualize class distribution
plt.figure(figsize=(8, 6))
plt.scatter(X[y==0][:, 0], X[y==0][:, 1], label='Majority Class')
plt.scatter(X[y==1][:, 0], X[y==1][:, 1], label='Minority Class')
plt.legend()
plt.title('Imbalanced Dataset Visualization')
plt.show()

🚀

🎉 You’re doing great! This concept might seem tricky at first, but you’ve got this! Basic Reweighting Strategy - Made Simple!

Reweighting involves assigning different weights to classes during model training to balance their influence. This way is effective when imbalance isn’t extreme and computational resources are limited.

Let’s make this super clear! Here’s how we can tackle this:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Calculate class weights
n_samples = len(y)
n_classes = len(unique)
class_weights = {
    0: (n_samples / (n_classes * counts[0])),
    1: (n_samples / (n_classes * counts[1]))
}

# Train weighted model
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LogisticRegression(class_weight=class_weights)
model.fit(X_train, y_train)

print("Class weights:", class_weights)
print("Model accuracy:", model.score(X_test, y_test))

🚀

Cool fact: Many professional data scientists use this exact approach in their daily work! Implementing Random Undersampling - Made Simple!

Random undersampling reduces majority class samples to match minority class size. This cool method improves training efficiency but risks losing important information from discarded samples.

Don’t worry, this is easier than it looks! Here’s how we can tackle this:

from imblearn.under_sampling import RandomUnderSampler

# Apply random undersampling
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X, y)

# Verify new class distribution
new_unique, new_counts = np.unique(y_resampled, return_counts=True)
print("Original distribution:", dict(zip(unique, counts)))
print("Resampled distribution:", dict(zip(new_unique, new_counts)))

# Train model with balanced data
model_balanced = LogisticRegression()
model_balanced.fit(X_resampled, y_resampled)

🚀

🔥 Level up: Once you master this, you’ll be solving problems like a pro! TikTok-Style Negative Sampling - Made Simple!

The TikTok approach prioritizes negative samples that are more likely to be confused with positive samples, improving model discrimination ability at decision boundaries.

Here’s a handy trick you’ll love! Here’s how we can tackle this:

def tiktok_negative_sampling(X, y, pilot_model, sampling_rate=0.5):
    # Get predictions from pilot model
    probs = pilot_model.predict_proba(X)[:, 1]
    
    # Calculate average probability
    avg_prob = np.mean(probs[y == 0])
    
    # Calculate sampling probabilities
    sampling_probs = sampling_rate * probs / avg_prob
    
    # Sample negative instances
    neg_mask = y == 0
    random_nums = np.random.random(sum(neg_mask))
    keep_mask = random_nums < sampling_probs[neg_mask]
    
    # Combine with positive instances
    final_mask = ~neg_mask  # Keep all positive samples
    final_mask[y == 0] = keep_mask
    
    return X[final_mask], y[final_mask], sampling_probs[final_mask]

🚀 Implementing Pilot Model Training - Made Simple!

The pilot model serves as an initial classifier trained on balanced data to estimate sample importance. Its predictions guide the intelligent selection of negative samples for final model training.

This next part is really neat! Here’s how we can tackle this:

def train_pilot_model(X, y):
    # Step 1: Create balanced dataset for pilot model
    rus = RandomUnderSampler(random_state=42)
    X_pilot, y_pilot = rus.fit_resample(X, y)
    
    # Step 2: Train pilot model
    pilot_model = LogisticRegression(max_iter=1000)
    pilot_model.fit(X_pilot, y_pilot)
    
    # Step 3: Evaluate pilot model
    pilot_score = pilot_model.score(X, y)
    print(f"Pilot model accuracy: {pilot_score:.3f}")
    
    return pilot_model

🚀 Probability Correction Implementation - Made Simple!

This slide shows you how to implement probability correction to account for sampling bias, ensuring unbiased probability estimates despite non-uniform sampling.

Let’s make this super clear! Here’s how we can tackle this:

def correct_probabilities(raw_probs, sampling_probs):
    """
    Correct model probabilities using sampling probabilities
    Based on eq 5 from TikTok's paper
    """
    # Convert probabilities to log odds
    log_odds = np.log(raw_probs / (1 - raw_probs))
    
    # Apply correction
    corrected_log_odds = log_odds - np.log(sampling_probs)
    
    # Convert back to probabilities
    corrected_probs = 1 / (1 + np.exp(-corrected_log_odds))
    
    return corrected_probs

🚀 End-to-End Implementation - Made Simple!

A complete implementation combining pilot model training, intelligent negative sampling, and probability correction for real-world recommendation systems.

Let me walk you through this step by step! Here’s how we can tackle this:

class IntelligentNegativeSampler:
    def __init__(self, sampling_rate=0.5):
        self.sampling_rate = sampling_rate
        self.pilot_model = None
        self.final_model = None
        
    def fit(self, X, y):
        # Train pilot model
        self.pilot_model = train_pilot_model(X, y)
        
        # Perform intelligent negative sampling
        X_sampled, y_sampled, self.sampling_probs = tiktok_negative_sampling(
            X, y, self.pilot_model, self.sampling_rate
        )
        
        # Train final model
        self.final_model = LogisticRegression(max_iter=1000)
        self.final_model.fit(X_sampled, y_sampled)
        
    def predict_proba(self, X):
        raw_probs = self.final_model.predict_proba(X)[:, 1]
        return correct_probabilities(raw_probs, self.sampling_probs)

🚀 Real-World Example - E-commerce Recommendations - Made Simple!

Implementation of intelligent negative sampling for an e-commerce recommendation system, demonstrating preprocessing, training, and evaluation.

Don’t worry, this is easier than it looks! Here’s how we can tackle this:

def ecommerce_recommendation_example():
    # Generate synthetic e-commerce data
    n_samples = 100000
    n_features = 10
    X, y = make_classification(
        n_samples=n_samples,
        n_features=n_features,
        weights=[0.98, 0.02],  # Typical click-through rate
        random_state=42
    )
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    # Train model with intelligent sampling
    sampler = IntelligentNegativeSampler(sampling_rate=0.3)
    sampler.fit(X_train, y_train)
    
    # Evaluate
    y_pred_proba = sampler.predict_proba(X_test)
    return y_test, y_pred_proba

🚀 Results Evaluation and Metrics - Made Simple!

complete evaluation metrics for imbalanced classification problems, including precision-recall curves, ROC curves, and custom metrics specifically designed for recommendation systems.

Let me walk you through this step by step! Here’s how we can tackle this:

from sklearn.metrics import precision_recall_curve, roc_curve, auc
import seaborn as sns

def evaluate_model_performance(y_true, y_pred_proba):
    # Calculate precision-recall curve
    precision, recall, pr_thresholds = precision_recall_curve(y_true, y_pred_proba)
    pr_auc = auc(recall, precision)
    
    # Calculate ROC curve
    fpr, tpr, roc_thresholds = roc_curve(y_true, y_pred_proba)
    roc_auc = auc(fpr, tpr)
    
    # Plot curves
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
    
    # Precision-Recall curve
    ax1.plot(recall, precision, label=f'PR-AUC: {pr_auc:.3f}')
    ax1.set_title('Precision-Recall Curve')
    ax1.set_xlabel('Recall')
    ax1.set_ylabel('Precision')
    ax1.legend()
    
    # ROC curve
    ax2.plot(fpr, tpr, label=f'ROC-AUC: {roc_auc:.3f}')
    ax2.plot([0, 1], [0, 1], 'k--')
    ax2.set_title('ROC Curve')
    ax2.set_xlabel('False Positive Rate')
    ax2.set_ylabel('True Positive Rate')
    ax2.legend()
    
    return pr_auc, roc_auc

🚀 Online Learning Implementation - Made Simple!

Implementing online learning with intelligent negative sampling for continuous model updates in production recommender systems.

Don’t worry, this is easier than it looks! Here’s how we can tackle this:

class OnlineIntelligentSampler:
    def __init__(self, window_size=10000):
        self.window_size = window_size
        self.X_buffer = []
        self.y_buffer = []
        self.model = None
        
    def update(self, X_new, y_new):
        # Add new samples to buffer
        self.X_buffer.extend(X_new)
        self.y_buffer.extend(y_new)
        
        # Maintain fixed window size
        if len(self.X_buffer) > self.window_size:
            self.X_buffer = self.X_buffer[-self.window_size:]
            self.y_buffer = self.y_buffer[-self.window_size:]
        
        # Retrain model if enough data
        if len(self.X_buffer) >= 1000:
            X = np.array(self.X_buffer)
            y = np.array(self.y_buffer)
            
            # Apply intelligent sampling
            sampler = IntelligentNegativeSampler()
            sampler.fit(X, y)
            self.model = sampler
            
    def predict_proba(self, X):
        if self.model is None:
            raise ValueError("Model not trained yet")
        return self.model.predict_proba(X)

🚀 Handling Extreme Imbalance - Made Simple!

cool techniques for handling extreme imbalance ratios (>1:1000) using combination of intelligent sampling and ensemble methods.

Let’s make this super clear! Here’s how we can tackle this:

from sklearn.ensemble import RandomForestClassifier
from imblearn.ensemble import BalancedRandomForestClassifier

class ExtremeImbalanceHandler:
    def __init__(self, imbalance_threshold=0.001):
        self.imbalance_threshold = imbalance_threshold
        
    def fit(self, X, y):
        # Calculate imbalance ratio
        unique, counts = np.unique(y, return_counts=True)
        imbalance_ratio = min(counts) / max(counts)
        
        if imbalance_ratio < self.imbalance_threshold:
            # Use ensemble of sampling methods
            self.model = BalancedRandomForestClassifier(
                n_estimators=100,
                sampling_strategy='auto',
                replacement=True
            )
        else:
            # Use standard intelligent sampling
            self.model = IntelligentNegativeSampler()
            
        self.model.fit(X, y)
        return self

🚀 Performance Monitoring System - Made Simple!

Implementation of a monitoring system to track model performance and sampling efficiency in production environments, detecting concept drift and sampling bias issues.

Let me walk you through this step by step! Here’s how we can tackle this:

class PerformanceMonitor:
    def __init__(self, window_size=1000):
        self.metrics_history = {
            'auc_scores': [],
            'sampling_efficiency': [],
            'class_distribution': []
        }
        self.window_size = window_size
        
    def update_metrics(self, y_true, y_pred, sampling_probs):
        from sklearn.metrics import roc_auc_score
        
        # Calculate AUC score
        auc = roc_auc_score(y_true, y_pred)
        self.metrics_history['auc_scores'].append(auc)
        
        # Calculate sampling efficiency
        efficiency = np.mean(sampling_probs)
        self.metrics_history['sampling_efficiency'].append(efficiency)
        
        # Track class distribution
        class_dist = np.mean(y_true)
        self.metrics_history['class_distribution'].append(class_dist)
        
        # Detect anomalies
        if len(self.metrics_history['auc_scores']) >= self.window_size:
            self._check_for_anomalies()
            
    def _check_for_anomalies(self):
        recent_auc = np.mean(self.metrics_history['auc_scores'][-10:])
        overall_auc = np.mean(self.metrics_history['auc_scores'])
        
        if recent_auc < overall_auc * 0.9:  # 10% degradation
            print("Warning: Performance degradation detected")

🚀 Real-world Example - Social Media Engagement - Made Simple!

Implementation of intelligent negative sampling for social media engagement prediction, including feature engineering and temporal aspects.

Don’t worry, this is easier than it looks! Here’s how we can tackle this:

class SocialMediaEngagementPredictor:
    def __init__(self):
        self.sampler = IntelligentNegativeSampler()
        self.monitor = PerformanceMonitor()
        
    def preprocess_features(self, data):
        features = {
            'user_activity': data['user_posts'] / np.mean(data['user_posts']),
            'content_length': np.log1p(data['content_length']),
            'time_of_day': np.cos(2 * np.pi * data['hour'] / 24),
            'day_of_week': np.sin(2 * np.pi * data['day'] / 7),
            'topic_relevance': data['topic_score']
        }
        return pd.DataFrame(features)
    
    def fit(self, X_raw, y):
        # Preprocess features
        X = self.preprocess_features(X_raw)
        
        # Train model with intelligent sampling
        self.sampler.fit(X.values, y)
        
        # Initial monitoring
        y_pred = self.sampler.predict_proba(X.values)
        self.monitor.update_metrics(y, y_pred, self.sampler.sampling_probs)
        
    def predict_proba(self, X_raw):
        X = self.preprocess_features(X_raw)
        return self.sampler.predict_proba(X.values)

🚀 Additional Resources - Made Simple!

  • “Efficient Extreme Multi-label Classification using Negative Sampling” https://arxiv.org/abs/2006.11015
  • “Self-paced Negative Sampling for Training Deep Neural Networks” https://arxiv.org/abs/1911.12137
  • “Learning from Imbalanced Data: Open Challenges and Future Directions” https://arxiv.org/abs/1901.04533
  • “Adaptive Sampling Strategies for Recommendation Systems” Search on Google Scholar with keywords: adaptive sampling recommendation systems
  • “Deep Learning with Imbalanced Data: Strategies and Applications” Search on Google Scholar with keywords: deep learning imbalanced data strategies

🎊 Awesome Work!

You’ve just learned some really powerful techniques! Don’t worry if everything doesn’t click immediately - that’s totally normal. The best way to master these concepts is to practice with your own data.

What’s next? Try implementing these examples with your own datasets. Start small, experiment, and most importantly, have fun with it! Remember, every data science expert started exactly where you are right now.

Keep coding, keep learning, and keep being awesome! 🚀

Back to Blog

Related Posts

View All Posts »