Data Science

🚀 Comprehensive Oversampling Techniques For Imbalanced Learning: That Changed Everything Expert!

Hey there! Ready to dive into Oversampling Techniques For Imbalanced Learning? This friendly guide will walk you through everything step-by-step with easy-to-follow examples. Perfect for beginners and pros alike!

SuperML Team
Share this article

Share:

🚀

💡 Pro tip: This is one of those techniques that will make you look like a data science wizard! Understanding Data Imbalance - Made Simple!

Data imbalance occurs when class distributions in a dataset are significantly skewed. This fundamental challenge in machine learning can severely impact model performance, as algorithms tend to be biased towards the majority class, leading to poor predictive accuracy for minority classes.

Ready for some cool stuff? Here’s how we can tackle this:

# Example of imbalanced dataset creation and visualization
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification

# Generate imbalanced dataset
X, y = make_classification(
    n_samples=1000,
    n_classes=2,
    weights=[0.9, 0.1],  # 90% majority, 10% minority
    random_state=42
)

# Display class distribution
unique, counts = np.unique(y, return_counts=True)
plt.bar(['Majority', 'Minority'], counts)
plt.title('Class Distribution')
plt.ylabel('Number of Samples')
plt.show()

print(f"Class distribution:\nMajority: {counts[0]}\nMinority: {counts[1]}")

🚀

🎉 You’re doing great! This concept might seem tricky at first, but you’ve got this! Random Oversampling Implementation - Made Simple!

Random oversampling duplicates minority class instances randomly until reaching balanced distribution. While simple, This way carries the risk of overfitting as it creates exact copies of existing samples without introducing new information.

Here’s where it gets exciting! Here’s how we can tackle this:

import numpy as np
from sklearn.model_selection import train_test_split

def random_oversample(X, y):
    # Get indices of each class
    majority_indices = np.where(y == 0)[0]
    minority_indices = np.where(y == 1)[0]
    
    # Calculate number of samples to generate
    n_samples = len(majority_indices) - len(minority_indices)
    
    # Randomly sample with replacement from minority class
    minority_indices_resampled = np.random.choice(
        minority_indices,
        size=n_samples,
        replace=True
    )
    
    # Combine indices and sort them
    all_indices = np.concatenate([
        majority_indices,
        minority_indices,
        minority_indices_resampled
    ])
    
    return X[all_indices], y[all_indices]

# Example usage
X_resampled, y_resampled = random_oversample(X, y)
print(f"Original distribution: {np.bincount(y)}")
print(f"Resampled distribution: {np.bincount(y_resampled)}")

🚀

Cool fact: Many professional data scientists use this exact approach in their daily work! SMOTE Algorithm Core Concepts - Made Simple!

Synthetic Minority Over-sampling Technique (SMOTE) generates synthetic samples by interpolating between minority class instances. It uses k-nearest neighbors to create new samples along the feature space lines connecting minority samples.

Let me walk you through this step by step! Here’s how we can tackle this:

def calculate_smote_formula():
    formula = """
    # SMOTE synthetic sample generation formula
    $$x_{new} = x_i + \lambda \cdot (x_{knn} - x_i)$$
    
    Where:
    $$x_i$$ is the selected minority instance
    $$x_{knn}$$ is one of its k-nearest neighbors
    $$\lambda$$ is a random number between 0 and 1
    """
    return formula

print(calculate_smote_formula())

🚀

🔥 Level up: Once you master this, you’ll be solving problems like a pro! SMOTE Implementation from Scratch - Made Simple!

Here’s where it gets exciting! Here’s how we can tackle this:

import numpy as np
from sklearn.neighbors import NearestNeighbors

def smote_oversample(X_minority, n_samples, k_neighbors=5):
    # Initialize nearest neighbors
    neigh = NearestNeighbors(n_neighbors=k_neighbors + 1)
    neigh.fit(X_minority)
    
    # Find k-nearest neighbors for all minority samples
    distances, indices = neigh.kneighbors(X_minority)
    
    # Generate synthetic samples
    synthetic_samples = []
    for i in range(len(X_minority)):
        # Get k-nearest neighbors (excluding the sample itself)
        nn_indices = indices[i, 1:]
        
        # Generate n_samples/len(X_minority) samples for each minority instance
        n_synthetic = int(n_samples / len(X_minority))
        
        for _ in range(n_synthetic):
            # Select random neighbor
            nn_idx = np.random.choice(nn_indices)
            
            # Calculate synthetic sample
            diff = X_minority[nn_idx] - X_minority[i]
            synthetic = X_minority[i] + np.random.random() * diff
            synthetic_samples.append(synthetic)
    
    return np.vstack([X_minority, np.array(synthetic_samples)])

# Example usage
X_minority = X[y == 1]
n_samples_needed = sum(y == 0) - sum(y == 1)
X_synthetic = smote_oversample(X_minority, n_samples_needed)
print(f"Original minority samples: {len(X_minority)}")
print(f"Synthetic samples generated: {len(X_synthetic) - len(X_minority)}")

🚀 ADASYN Algorithm Principles - Made Simple!

ADASYN (Adaptive Synthetic) improves upon SMOTE by focusing on harder-to-learn examples. It uses a density distribution to determine the number of synthetic samples needed for each minority instance, giving more weight to instances that are harder to learn.

This next part is really neat! Here’s how we can tackle this:

def calculate_adasyn_formula():
    formula = """
    # ADASYN density distribution formula
    $$r_i = \frac{\Delta_i}{Z}$$
    
    Where:
    $$\Delta_i$$ is the number of majority samples in k-neighbors
    $$Z$$ is a normalization constant
    
    # Number of synthetic samples formula
    $$g_i = r_i \cdot G$$
    
    Where:
    $$G$$ is the total number of synthetic samples needed
    """
    return formula

print(calculate_adasyn_formula())

🚀 ADASYN Implementation from Scratch - Made Simple!

ADASYN generates synthetic samples by focusing on minority instances that are harder to learn. This example includes the density distribution calculation and adaptive synthetic sample generation based on the difficulty level of each minority instance.

Let’s break this down together! Here’s how we can tackle this:

def adasyn_oversample(X, y, k_neighbors=5, beta=1.0):
    # Identify minority and majority classes
    minority_class = 1
    X_minority = X[y == minority_class]
    
    # Calculate number of samples to generate
    minority_count = sum(y == minority_class)
    majority_count = sum(y == 1 - minority_class)
    G = (majority_count - minority_count) * beta
    
    # Initialize nearest neighbors
    neigh = NearestNeighbors(n_neighbors=k_neighbors + 1)
    neigh.fit(X)
    
    # Calculate r_i (density ratio) for each minority instance
    r_i = []
    for x_i in X_minority:
        indices = neigh.kneighbors([x_i], return_distance=False)[0][1:]
        delta_i = sum(y[indices] != minority_class) / k_neighbors
        r_i.append(delta_i)
    
    # Normalize r_i
    if sum(r_i) == 0:
        r_i = np.ones(len(r_i)) / len(r_i)
    else:
        r_i = r_i / sum(r_i)
    
    # Calculate g_i (number of synthetic samples for each minority instance)
    g_i = np.round(r_i * G).astype(int)
    
    # Generate synthetic samples
    synthetic_samples = []
    for i, x_i in enumerate(X_minority):
        if g_i[i] == 0:
            continue
            
        # Find k-nearest minority neighbors
        minority_indices = np.where(y == minority_class)[0]
        neigh.fit(X[minority_indices])
        neighbors = neigh.kneighbors([x_i], n_neighbors=k_neighbors+1, return_distance=False)[0][1:]
        
        # Generate g_i synthetic samples
        for _ in range(g_i[i]):
            nn_idx = np.random.choice(neighbors)
            x_nn = X[minority_indices[nn_idx]]
            
            # Generate synthetic sample
            lambda_val = np.random.random()
            synthetic = x_i + lambda_val * (x_nn - x_i)
            synthetic_samples.append(synthetic)
    
    return np.vstack([X_minority, np.array(synthetic_samples)])

# Example usage
X_resampled = adasyn_oversample(X, y)
print(f"Original minority samples: {sum(y == 1)}")
print(f"Synthetic samples generated: {len(X_resampled) - sum(y == 1)}")

🚀 Real-world Application - Credit Card Fraud Detection - Made Simple!

Credit card fraud detection presents a classic imbalanced learning problem where fraudulent transactions represent a tiny fraction of all transactions. This example shows you the application of oversampling techniques in a real-world scenario.

Ready for some cool stuff? Here’s how we can tackle this:

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import pandas as pd

# Load and preprocess credit card data
def prepare_credit_card_data():
    # Synthetic credit card dataset for demonstration
    np.random.seed(42)
    n_samples = 10000
    
    # Generate legitimate transactions (99.7%)
    X_legitimate = np.random.normal(loc=0, scale=1, size=(int(n_samples * 0.997), 2))
    
    # Generate fraudulent transactions (0.3%)
    X_fraudulent = np.random.normal(loc=2, scale=1, size=(int(n_samples * 0.003), 2))
    
    X = np.vstack([X_legitimate, X_fraudulent])
    y = np.hstack([np.zeros(len(X_legitimate)), np.ones(len(X_fraudulent))])
    
    return X, y

# Prepare data
X, y = prepare_credit_card_data()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train models with different sampling techniques
def evaluate_model(X_train_resampled, y_train_resampled, X_test, y_test, method_name):
    clf = RandomForestClassifier(random_state=42)
    clf.fit(X_train_resampled, y_train_resampled)
    y_pred = clf.predict(X_test)
    
    print(f"\nResults for {method_name}:")
    print(classification_report(y_test, y_pred))

# Original (imbalanced) data
evaluate_model(X_train, y_train, X_test, y_test, "Original Data")

# SMOTE
X_train_smote = smote_oversample(X_train[y_train == 1], sum(y_train == 0) - sum(y_train == 1))
y_train_smote = np.hstack([np.ones(len(X_train_smote))])
evaluate_model(X_train_smote, y_train_smote, X_test, y_test, "SMOTE")

# ADASYN
X_train_adasyn = adasyn_oversample(X_train, y_train)
y_train_adasyn = np.hstack([np.ones(len(X_train_adasyn))])
evaluate_model(X_train_adasyn, y_train_adasyn, X_test, y_test, "ADASYN")

🚀 Performance Visualization and Analysis - Made Simple!

This slide focuses on visualizing and comparing the performance metrics of different oversampling techniques using complete evaluation metrics and intuitive visualizations.

Don’t worry, this is easier than it looks! Here’s how we can tackle this:

import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import precision_recall_curve

def plot_performance_comparison(models_dict, X_test, y_test):
    plt.figure(figsize=(15, 5))
    
    # ROC Curve
    plt.subplot(1, 2, 1)
    for name, model in models_dict.items():
        y_pred_proba = model.predict_proba(X_test)[:, 1]
        fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
        roc_auc = auc(fpr, tpr)
        
        plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.2f})')
    
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curves')
    plt.legend()
    
    # Precision-Recall Curve
    plt.subplot(1, 2, 2)
    for name, model in models_dict.items():
        y_pred_proba = model.predict_proba(X_test)[:, 1]
        precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
        pr_auc = auc(recall, precision)
        
        plt.plot(recall, precision, label=f'{name} (AUC = {pr_auc:.2f})')
    
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title('Precision-Recall Curves')
    plt.legend()
    
    plt.tight_layout()
    plt.show()

# Create models dictionary
models_dict = {
    'Original': clf_original,
    'SMOTE': clf_smote,
    'ADASYN': clf_adasyn
}

plot_performance_comparison(models_dict, X_test, y_test)

🚀 Borderline-SMOTE Enhancement - Made Simple!

Borderline-SMOTE enhances traditional SMOTE by focusing on minority instances near the decision boundary. This way identifies borderline samples using the ratio of majority neighbors to total neighbors, creating synthetic samples only for these critical instances.

This next part is really neat! Here’s how we can tackle this:

def borderline_smote(X, y, k_neighbors=5, m_neighbors=10):
    minority_class = 1
    X_minority = X[y == minority_class]
    X_majority = X[y != minority_class]
    
    # Initialize nearest neighbors for both minority and majority
    neigh_m = NearestNeighbors(n_neighbors=m_neighbors)
    neigh_m.fit(X_majority)
    
    # Find borderline samples
    borderline_samples = []
    for x_i in X_minority:
        # Find m nearest majority neighbors
        distances, _ = neigh_m.kneighbors([x_i])
        majority_ratio = len([d for d in distances[0] if d <= np.mean(distances)]) / m_neighbors
        
        # Classify as DANGER if ratio is between 0.3 and 0.7
        if 0.3 <= majority_ratio <= 0.7:
            borderline_samples.append(x_i)
    
    borderline_samples = np.array(borderline_samples)
    
    # Apply SMOTE only to borderline samples
    if len(borderline_samples) > 0:
        synthetic_samples = smote_oversample(
            borderline_samples, 
            n_samples=len(X_majority) - len(X_minority),
            k_neighbors=min(k_neighbors, len(borderline_samples)-1)
        )
        return synthetic_samples
    
    return X_minority

# Example usage
X_borderline = borderline_smote(X, y)
print(f"Original minority samples: {sum(y == 1)}")
print(f"Borderline-SMOTE samples: {len(X_borderline)}")

🚀 Ensemble Oversampling Strategy - Made Simple!

Combining multiple oversampling techniques can provide more reliable synthetic samples. This example creates an ensemble approach that uses the strengths of different oversampling methods while mitigating their individual weaknesses.

Let’s break this down together! Here’s how we can tackle this:

def ensemble_oversample(X, y, methods=['random', 'smote', 'adasyn'], weights=[0.3, 0.4, 0.3]):
    assert len(methods) == len(weights) and sum(weights) == 1
    
    synthetic_samples = []
    minority_class = 1
    n_samples_needed = sum(y == 0) - sum(y == 1)
    
    for method, weight in zip(methods, weights):
        n_method_samples = int(n_samples_needed * weight)
        
        if method == 'random':
            X_synthetic = random_oversample(X[y == minority_class], 
                                         n_samples=n_method_samples)
        elif method == 'smote':
            X_synthetic = smote_oversample(X[y == minority_class], 
                                         n_samples=n_method_samples)
        elif method == 'adasyn':
            X_synthetic = adasyn_oversample(X, y, 
                                          beta=n_method_samples/sum(y == minority_class))
            
        synthetic_samples.append(X_synthetic)
    
    # Combine all synthetic samples
    X_combined = np.vstack(synthetic_samples)
    
    return X_combined

# Example usage with evaluation
X_ensemble = ensemble_oversample(X, y)
y_ensemble = np.ones(len(X_ensemble))

# Train and evaluate
clf_ensemble = RandomForestClassifier(random_state=42)
clf_ensemble.fit(np.vstack([X[y == 0], X_ensemble]), 
                np.hstack([np.zeros(sum(y == 0)), y_ensemble]))

print("Ensemble Oversampling Results:")
y_pred = clf_ensemble.predict(X_test)
print(classification_report(y_test, y_pred))

🚀 Oversampling Validation Strategy - Made Simple!

Proper validation is crucial when working with oversampled data to prevent data leakage. This example shows you the correct way to validate models trained on oversampled data using stratified cross-validation.

Here’s a handy trick you’ll love! Here’s how we can tackle this:

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score

def validate_oversampling(X, y, oversample_func, n_splits=5):
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    scores = []
    
    for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
        # Split data
        X_train_fold = X[train_idx]
        y_train_fold = y[train_idx]
        X_val_fold = X[val_idx]
        y_val_fold = y[val_idx]
        
        # Apply oversampling only to training data
        X_train_resampled = oversample_func(X_train_fold, y_train_fold)
        y_train_resampled = np.ones(len(X_train_resampled))
        
        # Train model
        clf = RandomForestClassifier(random_state=42)
        clf.fit(np.vstack([X_train_fold[y_train_fold == 0], X_train_resampled]),
               np.hstack([np.zeros(sum(y_train_fold == 0)), y_train_resampled]))
        
        # Evaluate
        y_pred = clf.predict(X_val_fold)
        score = f1_score(y_val_fold, y_pred)
        scores.append(score)
        
        print(f"Fold {fold+1} F1-Score: {score:.3f}")
    
    print(f"\nMean F1-Score: {np.mean(scores):.3f} (+/- {np.std(scores):.3f})")
    return scores

# Example usage for different methods
methods = {
    'SMOTE': lambda X, y: smote_oversample(X[y == 1], sum(y == 0) - sum(y == 1)),
    'ADASYN': lambda X, y: adasyn_oversample(X, y),
    'Borderline-SMOTE': lambda X, y: borderline_smote(X, y),
    'Ensemble': lambda X, y: ensemble_oversample(X, y)
}

for name, method in methods.items():
    print(f"\nValidating {name}:")
    validate_oversampling(X, y, method)

🚀 Handling Mixed Data Types in Oversampling - Made Simple!

Oversampling becomes more complex when dealing with datasets containing both numerical and categorical features. This example introduces a smart approach to handle mixed data types during synthetic sample generation.

Here’s a handy trick you’ll love! Here’s how we can tackle this:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

def mixed_data_oversample(X, y, categorical_features):
    # Separate numerical and categorical features
    X_num = X[:, ~np.isin(np.arange(X.shape[1]), categorical_features)]
    X_cat = X[:, categorical_features]
    
    # Initialize label encoders for categorical features
    label_encoders = [LabelEncoder() for _ in range(len(categorical_features))]
    X_cat_encoded = np.zeros_like(X_cat)
    
    # Encode categorical features
    for i, le in enumerate(label_encoders):
        X_cat_encoded[:, i] = le.fit_transform(X_cat[:, i])
    
    def generate_synthetic_sample(x1, x2, lambda_val):
        # Generate numerical features
        num_features = x1[~np.isin(np.arange(len(x1)), categorical_features)]
        num_neighbor = x2[~np.isin(np.arange(len(x2)), categorical_features)]
        synthetic_num = num_features + lambda_val * (num_neighbor - num_features)
        
        # Generate categorical features
        synthetic_cat = []
        for i in range(len(categorical_features)):
            # Randomly select category from either parent
            if np.random.random() > 0.5:
                synthetic_cat.append(x1[categorical_features[i]])
            else:
                synthetic_cat.append(x2[categorical_features[i]])
        
        # Combine features
        synthetic = np.zeros(len(x1))
        synthetic[~np.isin(np.arange(len(x1)), categorical_features)] = synthetic_num
        synthetic[categorical_features] = synthetic_cat
        
        return synthetic
    
    def mixed_smote(X_minority, n_samples, k_neighbors=5):
        synthetic_samples = []
        neigh = NearestNeighbors(n_neighbors=k_neighbors + 1)
        neigh.fit(X_cat_encoded[y == 1])
        
        for i in range(len(X_minority)):
            nn_indices = neigh.kneighbors([X_cat_encoded[y == 1][i]], 
                                        return_distance=False)[0][1:]
            
            for _ in range(int(n_samples / len(X_minority))):
                nn_idx = np.random.choice(nn_indices)
                lambda_val = np.random.random()
                
                synthetic = generate_synthetic_sample(
                    X_minority[i],
                    X_minority[nn_idx],
                    lambda_val
                )
                synthetic_samples.append(synthetic)
        
        return np.array(synthetic_samples)
    
    # Generate synthetic samples
    X_minority = X[y == 1]
    n_samples = sum(y == 0) - sum(y == 1)
    synthetic_samples = mixed_smote(X_minority, n_samples)
    
    return synthetic_samples

# Example usage with mixed data
def create_mixed_dataset():
    # Create synthetic dataset with mixed features
    np.random.seed(42)
    n_samples = 1000
    
    # Numerical features
    X_num = np.random.normal(size=(n_samples, 2))
    
    # Categorical features
    categories = ['A', 'B', 'C']
    X_cat = np.random.choice(categories, size=(n_samples, 2))
    
    # Combine features
    X = np.hstack([X_num, X_cat])
    
    # Create imbalanced labels
    y = np.zeros(n_samples)
    y[:100] = 1  # 10% minority class
    
    return X, y, [2, 3]  # indices of categorical features

# Test the implementation
X_mixed, y_mixed, categorical_features = create_mixed_dataset()
X_synthetic = mixed_data_oversample(X_mixed, y_mixed, categorical_features)
print(f"Original minority samples: {sum(y_mixed == 1)}")
print(f"Synthetic samples generated: {len(X_synthetic)}")

🚀 Cost-Sensitive Learning Integration - Made Simple!

This cool implementation combines oversampling techniques with cost-sensitive learning to create a more reliable approach to imbalanced learning. The method adjusts both sample weights and synthetic sample generation based on misclassification costs.

Ready for some cool stuff? Here’s how we can tackle this:

class CostSensitiveOversampler:
    def __init__(self, cost_matrix=None, oversampling_method='smote'):
        self.cost_matrix = cost_matrix if cost_matrix is not None else {
            (0, 1): 1.0,  # False Negative cost
            (1, 0): 5.0   # False Positive cost
        }
        self.oversampling_method = oversampling_method
    
    def compute_sample_weights(self, y):
        weights = np.ones(len(y))
        for i, yi in enumerate(y):
            weights[i] = sum(self.cost_matrix.get((yi, j), 0) 
                           for j in set(y) if j != yi)
        return weights
    
    def fit_resample(self, X, y):
        # Calculate initial sample weights
        sample_weights = self.compute_sample_weights(y)
        
        # Determine number of synthetic samples based on costs
        minority_cost = self.cost_matrix.get((1, 0), 1.0)
        majority_cost = self.cost_matrix.get((0, 1), 1.0)
        cost_ratio = minority_cost / majority_cost
        
        n_synthetic = int((sum(y == 0) * cost_ratio) - sum(y == 1))
        
        # Generate synthetic samples
        if self.oversampling_method == 'smote':
            X_synthetic = smote_oversample(X[y == 1], n_synthetic)
        elif self.oversampling_method == 'adasyn':
            X_synthetic = adasyn_oversample(X, y, beta=cost_ratio)
        else:
            raise ValueError("Unsupported oversampling method")
        
        # Combine original and synthetic samples
        X_resampled = np.vstack([X, X_synthetic])
        y_resampled = np.hstack([y, np.ones(len(X_synthetic))])
        
        # Update sample weights for combined dataset
        weights_resampled = np.hstack([
            sample_weights,
            np.ones(len(X_synthetic)) * self.cost_matrix.get((1, 0), 1.0)
        ])
        
        return X_resampled, y_resampled, weights_resampled

# Example usage
cost_matrix = {
    (0, 1): 1.0,   # Cost of misclassifying negative as positive
    (1, 0): 10.0   # Cost of misclassifying positive as negative
}

oversample = CostSensitiveOversampler(cost_matrix=cost_matrix)
X_resampled, y_resampled, sample_weights = oversample.fit_resample(X, y)

# Train cost-sensitive model
clf = RandomForestClassifier(random_state=42)
clf.fit(X_resampled, y_resampled, sample_weight=sample_weights)

# Evaluate
y_pred = clf.predict(X_test)
print("Cost-Sensitive Results:")
print(classification_report(y_test, y_pred))

🚀 Dynamic Oversampling Rate Adjustment - Made Simple!

This example introduces an adaptive approach that dynamically adjusts oversampling rates based on model performance feedback during training. The method monitors validation metrics to optimize the synthetic sample generation process.

Ready for some cool stuff? Here’s how we can tackle this:

class DynamicOversamplingAdjuster:
    def __init__(self, base_rate=1.0, adjustment_factor=0.1, 
                 min_rate=0.5, max_rate=2.0):
        self.base_rate = base_rate
        self.adjustment_factor = adjustment_factor
        self.min_rate = min_rate
        self.max_rate = max_rate
        self.performance_history = []
        
    def adjust_rate(self, current_performance):
        if len(self.performance_history) > 0:
            prev_performance = self.performance_history[-1]
            
            # Adjust rate based on performance trend
            if current_performance > prev_performance:
                self.base_rate *= (1 + self.adjustment_factor)
            else:
                self.base_rate *= (1 - self.adjustment_factor)
                
            # Ensure rate stays within bounds
            self.base_rate = np.clip(self.base_rate, 
                                   self.min_rate, 
                                   self.max_rate)
        
        self.performance_history.append(current_performance)
        return self.base_rate

    def generate_samples(self, X, y, method='smote'):
        minority_count = sum(y == 1)
        majority_count = sum(y == 0)
        
        # Calculate dynamic number of samples
        n_samples = int((majority_count - minority_count) * self.base_rate)
        
        if method == 'smote':
            return smote_oversample(X[y == 1], n_samples)
        elif method == 'adasyn':
            return adasyn_oversample(X, y, beta=self.base_rate)
        else:
            raise ValueError("Unsupported method")

def train_with_dynamic_oversampling(X, y, n_epochs=5):
    # Split data
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.2, stratify=y, random_state=42
    )
    
    # Initialize adjuster and model
    adjuster = DynamicOversamplingAdjuster()
    clf = RandomForestClassifier(random_state=42)
    
    # Training loop
    for epoch in range(n_epochs):
        # Generate samples with current rate
        X_synthetic = adjuster.generate_samples(X_train, y_train)
        
        # Combine data
        X_combined = np.vstack([X_train, X_synthetic])
        y_combined = np.hstack([y_train, np.ones(len(X_synthetic))])
        
        # Train model
        clf.fit(X_combined, y_combined)
        
        # Evaluate performance
        y_val_pred = clf.predict(X_val)
        current_f1 = f1_score(y_val, y_val_pred)
        
        # Adjust sampling rate
        new_rate = adjuster.adjust_rate(current_f1)
        
        print(f"Epoch {epoch+1}:")
        print(f"F1-Score: {current_f1:.3f}")
        print(f"Sampling Rate: {new_rate:.2f}")
        print(f"Synthetic Samples: {len(X_synthetic)}\n")
    
    return clf, adjuster.performance_history

# Example usage
clf_dynamic, history = train_with_dynamic_oversampling(X, y)

🚀 Additional Resources - Made Simple!

Here are relevant papers from ArXiv that provide deeper insights into oversampling techniques and imbalanced learning:

🎊 Awesome Work!

You’ve just learned some really powerful techniques! Don’t worry if everything doesn’t click immediately - that’s totally normal. The best way to master these concepts is to practice with your own data.

What’s next? Try implementing these examples with your own datasets. Start small, experiment, and most importantly, have fun with it! Remember, every data science expert started exactly where you are right now.

Keep coding, keep learning, and keep being awesome! 🚀

Back to Blog

Related Posts

View All Posts »