Data Science

🚀 Ultimate Eda Cheatcodes Secrets That Will Revolutionize Your!

Hey there! Ready to dive into Ultimate Eda Cheatcodes? This friendly guide will walk you through everything step-by-step with easy-to-follow examples. Perfect for beginners and pros alike!

SuperML Team
Share this article

Share:

🚀

💡 Pro tip: This is one of those techniques that will make you look like a data science wizard! Data Loading and Initial Exploration - Made Simple!

The foundation of any EDA begins with smartly loading data and performing initial checks. This involves reading various file formats, checking basic statistics, and understanding the dataset’s structure using pandas and numpy libraries.

Let’s break this down together! Here’s how we can tackle this:

import pandas as pd
import numpy as np
from datetime import datetime

# Load dataset with automatic date parsing
df = pd.read_csv('sales_data.csv', parse_dates=['transaction_date'])

# Display complete dataset information
print("Dataset Info:")
print(df.info())

# Generate descriptive statistics
print("\nDescriptive Statistics:")
print(df.describe(include='all'))

# Check missing values
missing_values = df.isnull().sum()
print("\nMissing Values:")
print(missing_values[missing_values > 0])

# Display first few rows with custom formatting
pd.set_option('display.max_columns', None)
print("\nSample Data:")
print(df.head())

🚀

🎉 You’re doing great! This concept might seem tricky at first, but you’ve got this! cool Data Type Analysis - Made Simple!

Understanding data types and their distributions is super important for proper preprocessing. This cool method identifies potential data quality issues and helps determine appropriate transformation strategies for different variable types.

Let’s break this down together! Here’s how we can tackle this:

def analyze_data_types(df):
    # Dictionary to store analysis results
    analysis = {
        'numerical_cols': [],
        'categorical_cols': [],
        'datetime_cols': [],
        'unique_counts': {},
        'memory_usage': {}
    }
    
    for column in df.columns:
        # Determine data type category
        if pd.api.types.is_numeric_dtype(df[column]):
            analysis['numerical_cols'].append(column)
        elif pd.api.types.is_datetime64_any_dtype(df[column]):
            analysis['datetime_cols'].append(column)
        else:
            analysis['categorical_cols'].append(column)
        
        # Count unique values
        analysis['unique_counts'][column] = df[column].nunique()
        
        # Calculate memory usage
        analysis['memory_usage'][column] = df[column].memory_usage(deep=True) / 1024**2  # MB
    
    return analysis

# Example usage
data_analysis = analyze_data_types(df)
for category, columns in data_analysis.items():
    if isinstance(columns, list):
        print(f"\n{category.upper()}:")
        for col in columns:
            print(f"{col}: {data_analysis['unique_counts'][col]} unique values, "
                  f"{data_analysis['memory_usage'][col]:.2f} MB")

🚀

Cool fact: Many professional data scientists use this exact approach in their daily work! Statistical Distribution Analysis - Made Simple!

A complete analysis of variable distributions helps identify patterns, skewness, and potential outliers. This example combines visual and statistical methods to provide insights into data characteristics.

Ready for some cool stuff? Here’s how we can tackle this:

import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns

def analyze_distributions(df, numerical_cols):
    plt.style.use('seaborn')
    
    for col in numerical_cols:
        # Calculate statistical measures
        skewness = stats.skew(df[col].dropna())
        kurtosis = stats.kurtosis(df[col].dropna())
        
        # Create distribution plot
        plt.figure(figsize=(12, 6))
        
        # Histogram with KDE
        sns.histplot(data=df, x=col, kde=True)
        
        # Add Q-Q plot as subplot
        stats.probplot(df[col].dropna(), dist="norm", plot=plt)
        
        plt.title(f'Distribution Analysis: {col}\n'
                 f'Skewness: {skewness:.2f}, Kurtosis: {kurtosis:.2f}')
        plt.tight_layout()
        plt.show()
        
        # Print statistical tests
        normality_test = stats.normaltest(df[col].dropna())
        print(f"\nNormality Test for {col}:")
        print(f"p-value: {normality_test.pvalue:.4f}")

# Example usage
numerical_columns = df.select_dtypes(include=[np.number]).columns
analyze_distributions(df, numerical_columns)

🚀

🔥 Level up: Once you master this, you’ll be solving problems like a pro! cool Outlier Detection - Made Simple!

This example combines multiple outlier detection methods including statistical, distance-based, and machine learning approaches to provide a reliable identification of anomalous data points.

Let’s make this super clear! Here’s how we can tackle this:

from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler

def detect_outliers(df, columns, methods=['zscore', 'iqr', 'isolation_forest']):
    outliers_dict = {}
    
    for col in columns:
        outliers_dict[col] = {}
        data = df[col].dropna()
        
        if 'zscore' in methods:
            # Z-score method
            z_scores = np.abs(stats.zscore(data))
            outliers_dict[col]['zscore'] = data[z_scores > 3]
        
        if 'iqr' in methods:
            # IQR method
            Q1 = data.quantile(0.25)
            Q3 = data.quantile(0.75)
            IQR = Q3 - Q1
            outliers_dict[col]['iqr'] = data[(data < (Q1 - 1.5 * IQR)) | 
                                           (data > (Q3 + 1.5 * IQR))]
        
        if 'isolation_forest' in methods:
            # Isolation Forest method
            iso_forest = IsolationForest(contamination=0.1, random_state=42)
            yhat = iso_forest.fit_predict(data.values.reshape(-1, 1))
            outliers_dict[col]['isolation_forest'] = data[yhat == -1]
    
    return outliers_dict

# Example usage
numerical_cols = df.select_dtypes(include=[np.number]).columns
outliers = detect_outliers(df, numerical_cols)

# Print summary
for col in outliers:
    print(f"\nOutliers in {col}:")
    for method, values in outliers[col].items():
        print(f"{method}: {len(values)} outliers detected")

🚀 Correlation Analysis and Feature Relationships - Made Simple!

cool correlation analysis extends beyond simple Pearson correlations to include non-linear relationships and mutual information metrics, providing deeper insights into feature dependencies and potential predictive power.

Let me walk you through this step by step! Here’s how we can tackle this:

import scipy.stats as stats
from sklearn.metrics import mutual_info_score

def advanced_correlation_analysis(df, numerical_cols):
    # Initialize correlation matrices
    n = len(numerical_cols)
    pearson_corr = np.zeros((n, n))
    spearman_corr = np.zeros((n, n))
    mutual_info = np.zeros((n, n))
    
    for i, col1 in enumerate(numerical_cols):
        for j, col2 in enumerate(numerical_cols):
            # Calculate different correlation metrics
            pearson_corr[i,j] = stats.pearsonr(df[col1], df[col2])[0]
            spearman_corr[i,j] = stats.spearmanr(df[col1], df[col2])[0]
            
            # Normalize and bin data for mutual information
            x_norm = pd.qcut(df[col1], q=10, labels=False, duplicates='drop')
            y_norm = pd.qcut(df[col2], q=10, labels=False, duplicates='drop')
            mutual_info[i,j] = mutual_info_score(x_norm, y_norm)
    
    # Create heatmap visualizations
    fig, axes = plt.subplots(1, 3, figsize=(20, 6))
    
    sns.heatmap(pearson_corr, annot=True, cmap='RdBu', center=0, 
                xticklabels=numerical_cols, yticklabels=numerical_cols, ax=axes[0])
    axes[0].set_title('Pearson Correlation')
    
    sns.heatmap(spearman_corr, annot=True, cmap='RdBu', center=0,
                xticklabels=numerical_cols, yticklabels=numerical_cols, ax=axes[1])
    axes[1].set_title('Spearman Correlation')
    
    sns.heatmap(mutual_info, annot=True, cmap='viridis',
                xticklabels=numerical_cols, yticklabels=numerical_cols, ax=axes[2])
    axes[2].set_title('Mutual Information')
    
    plt.tight_layout()
    plt.show()

# Example usage
numerical_cols = df.select_dtypes(include=[np.number]).columns
advanced_correlation_analysis(df, numerical_cols)

🚀 Time Series Decomposition and Analysis - Made Simple!

Time series decomposition is super important for understanding temporal patterns in data. This example breaks down time series into trend, seasonal, and residual components while providing statistical measures of each component.

Here’s a handy trick you’ll love! Here’s how we can tackle this:

from statsmodels.tsa.seasonal import seasonal_decompose
import statsmodels.api as sm

def analyze_time_series(df, date_column, value_column, freq='D'):
    # Set datetime index
    df_ts = df.set_index(date_column)
    
    # Perform decomposition
    decomposition = seasonal_decompose(df_ts[value_column], 
                                     period=30 if freq=='D' else 12,
                                     extrapolate_trend='freq')
    
    # Plot components
    fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(15, 12))
    
    decomposition.observed.plot(ax=ax1)
    ax1.set_title('Original Time Series')
    
    decomposition.trend.plot(ax=ax2)
    ax2.set_title('Trend')
    
    decomposition.seasonal.plot(ax=ax3)
    ax3.set_title('Seasonal')
    
    decomposition.resid.plot(ax=ax4)
    ax4.set_title('Residual')
    
    plt.tight_layout()
    
    # Perform additional statistical tests
    # Augmented Dickey-Fuller test for stationarity
    adf_test = sm.tsa.stattools.adfuller(df_ts[value_column].dropna())
    
    # Calculate autocorrelation
    acf = sm.tsa.stattools.acf(df_ts[value_column].dropna(), nlags=40)
    
    print(f"ADF Test Statistic: {adf_test[0]}")
    print(f"p-value: {adf_test[1]}")
    print("\nAutocorrelation (first 5 lags):")
    print(acf[:5])

# Example usage
analyze_time_series(df, 'date_column', 'value_column', freq='D')

🚀 Dimensionality Reduction and Feature Importance - Made Simple!

This cool implementation combines multiple dimensionality reduction techniques with feature importance analysis to provide complete insights into data structure and variable relationships.

Let me walk you through this step by step! Here’s how we can tackle this:

from sklearn.decomposition import PCA, FastICA
from sklearn.manifold import TSNE
from sklearn.ensemble import RandomForestRegressor
from umap import UMAP

def analyze_feature_importance(df, target_col=None):
    # Prepare data
    X = df.select_dtypes(include=[np.number]).drop(columns=[target_col] if target_col else [])
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # PCA
    pca = PCA()
    pca_result = pca.fit_transform(X_scaled)
    
    # Calculate explained variance
    explained_variance = np.cumsum(pca.explained_variance_ratio_)
    
    # Plot explained variance
    plt.figure(figsize=(10, 6))
    plt.plot(range(1, len(explained_variance) + 1), explained_variance, 'bo-')
    plt.axhline(y=0.95, color='r', linestyle='--')
    plt.xlabel('Number of Components')
    plt.ylabel('Cumulative Explained Variance Ratio')
    plt.title('PCA Analysis')
    plt.show()
    
    # Feature importance using Random Forest
    if target_col is not None:
        rf = RandomForestRegressor(n_estimators=100, random_state=42)
        rf.fit(X, df[target_col])
        
        # Plot feature importance
        importances = pd.Series(rf.feature_importances_, index=X.columns)
        importances.sort_values(ascending=True).plot(kind='barh')
        plt.title('Feature Importance')
        plt.show()
    
    return {
        'pca_components': pca.components_,
        'explained_variance': explained_variance,
        'feature_importance': rf.feature_importances_ if target_col else None
    }

# Example usage
results = analyze_feature_importance(df, target_col='target_variable')

🚀 Pattern Mining and Association Rules - Made Simple!

cool pattern mining techniques help discover hidden relationships and frequent patterns in categorical data. This example includes support for both categorical and discretized numerical variables.

Let’s break this down together! Here’s how we can tackle this:

from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

def mine_patterns(df, categorical_cols, min_support=0.01, min_confidence=0.5):
    # Prepare transactions
    transactions = df[categorical_cols].values.tolist()
    
    # Transform data
    te = TransactionEncoder()
    te_ary = te.fit(transactions).transform(transactions)
    df_encoded = pd.DataFrame(te_ary, columns=te.columns_)
    
    # Find frequent itemsets
    frequent_itemsets = apriori(df_encoded, 
                               min_support=min_support, 
                               use_colnames=True)
    
    # Generate rules
    rules = association_rules(frequent_itemsets, 
                            metric="confidence", 
                            min_threshold=min_confidence)
    
    # Calculate additional metrics
    rules['lift'] = rules['lift'].round(4)
    rules['conviction'] = np.where(rules['confidence'] == 1, np.inf,
                                 (1 - rules['consequent support']) / 
                                 (1 - rules['confidence']))
    
    # Sort rules by lift ratio
    rules = rules.sort_values('lift', ascending=False)
    
    print("Top 10 Association Rules:")
    print(rules.head(10)[['antecedents', 'consequents', 
                         'support', 'confidence', 'lift']])
    
    return rules, frequent_itemsets

# Example usage
categorical_columns = ['category1', 'category2', 'category3']
rules, itemsets = mine_patterns(df, categorical_columns)

🚀 Multivariate Anomaly Detection - Made Simple!

This example combines multiple statistical and machine learning approaches to detect multivariate anomalies in high-dimensional datasets using reliable estimation techniques.

Let me walk you through this step by step! Here’s how we can tackle this:

from sklearn.covariance import EllipticEnvelope
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor

def detect_multivariate_anomalies(df, numerical_cols):
    # Prepare data
    X = df[numerical_cols]
    X_scaled = StandardScaler().fit_transform(X)
    
    # Initialize detectors
    detectors = {
        'reliable Covariance': EllipticEnvelope(contamination=0.1, 
                                             random_state=42),
        'One-Class SVM': OneClassSVM(kernel='rbf', nu=0.1),
        'Local Outlier Factor': LocalOutlierFactor(n_neighbors=20, 
                                                  contamination=0.1)
    }
    
    results = {}
    for name, detector in detectors.items():
        # Fit and predict
        if name == 'Local Outlier Factor':
            y_pred = detector.fit_predict(X_scaled)
        else:
            y_pred = detector.fit(X_scaled).predict(X_scaled)
        
        # Store results
        results[name] = y_pred == -1  # True for outliers
        
    # Compare results
    agreement_matrix = np.zeros((len(detectors), len(detectors)))
    for i, (name1, res1) in enumerate(results.items()):
        for j, (name2, res2) in enumerate(results.items()):
            agreement = np.mean(res1 == res2)
            agreement_matrix[i, j] = agreement
    
    # Visualize agreement
    plt.figure(figsize=(10, 8))
    sns.heatmap(agreement_matrix, 
                annot=True, 
                xticklabels=detectors.keys(),
                yticklabels=detectors.keys())
    plt.title('Detector Agreement Matrix')
    plt.show()
    
    return results

# Example usage
numerical_cols = df.select_dtypes(include=[np.number]).columns
anomaly_results = detect_multivariate_anomalies(df, numerical_cols)

🚀 Dynamic Feature Engineering - Made Simple!

cool feature engineering pipeline that automatically generates and evaluates new features based on domain knowledge and statistical properties of the data.

Let me walk you through this step by step! Here’s how we can tackle this:

from sklearn.feature_selection import mutual_info_regression
from sklearn.preprocessing import PolynomialFeatures

class DynamicFeatureEngineer:
    def __init__(self, df, target_col=None):
        self.df = df
        self.target_col = target_col
        self.numerical_cols = df.select_dtypes(include=[np.number]).columns
        self.categorical_cols = df.select_dtypes(include=['object']).columns
        
    def generate_statistical_features(self):
        stats_features = pd.DataFrame()
        
        # Rolling statistics
        for col in self.numerical_cols:
            if col != self.target_col:
                stats_features[f'{col}_rolling_mean'] = self.df[col].rolling(window=3).mean()
                stats_features[f'{col}_rolling_std'] = self.df[col].rolling(window=3).std()
                
        return stats_features
    
    def generate_polynomial_features(self, degree=2):
        poly = PolynomialFeatures(degree=degree, include_bias=False)
        poly_features = poly.fit_transform(self.df[self.numerical_cols])
        
        feature_names = poly.get_feature_names_out(self.numerical_cols)
        return pd.DataFrame(poly_features, columns=feature_names)
    
    def evaluate_features(self, features_df):
        if self.target_col is None:
            return None
            
        mi_scores = mutual_info_regression(features_df, self.df[self.target_col])
        feature_importance = pd.DataFrame({
            'feature': features_df.columns,
            'importance': mi_scores
        }).sort_values('importance', ascending=False)
        
        return feature_importance
    
    def fit_transform(self):
        # Generate features
        statistical_features = self.generate_statistical_features()
        polynomial_features = self.generate_polynomial_features()
        
        # Combine features
        all_features = pd.concat([
            statistical_features,
            polynomial_features
        ], axis=1)
        
        # Evaluate features
        importance = self.evaluate_features(all_features)
        
        if importance is not None:
            print("Top 10 Generated Features:")
            print(importance.head(10))
        
        return all_features

# Example usage
engineer = DynamicFeatureEngineer(df, target_col='target')
new_features = engineer.fit_transform()

🚀 Temporal Pattern Analysis - Made Simple!

cool time-based pattern detection that identifies cyclical patterns, seasonality effects, and temporal anomalies using spectral analysis and wavelet transformations.

This next part is really neat! Here’s how we can tackle this:

import pywt
from scipy import signal
from datetime import timedelta

def analyze_temporal_patterns(df, date_column, value_column):
    # Prepare time series data
    ts = df.set_index(date_column)[value_column]
    
    # Spectral Analysis
    frequencies, power_spectrum = signal.welch(ts.dropna(), 
                                             fs=1.0, 
                                             window='hanning',
                                             nperseg=len(ts)//10)
    
    # Wavelet Analysis
    wavelet = 'morl'
    scales = np.arange(1, 128)
    coefficients, frequencies = pywt.cwt(ts.dropna(), scales, wavelet)
    
    # Plot results
    fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(15, 12))
    
    # Original Time Series
    ax1.plot(ts.index, ts.values)
    ax1.set_title('Original Time Series')
    
    # Power Spectrum
    ax2.plot(frequencies, power_spectrum)
    ax2.set_title('Power Spectrum')
    ax2.set_xlabel('Frequency')
    ax2.set_ylabel('Power')
    
    # Wavelet Transform
    im = ax3.imshow(abs(coefficients), aspect='auto', cmap='viridis')
    ax3.set_title('Wavelet Transform')
    ax3.set_ylabel('Scale')
    plt.colorbar(im, ax=ax3)
    
    plt.tight_layout()
    
    # Find dominant periods
    peak_frequencies = frequencies[signal.find_peaks(power_spectrum)[0]]
    dominant_periods = 1/peak_frequencies
    
    print("\nDominant Periods Detected:")
    for period in dominant_periods:
        print(f"Period: {timedelta(days=period)}")
    
    return {
        'power_spectrum': (frequencies, power_spectrum),
        'wavelet_coefficients': coefficients,
        'dominant_periods': dominant_periods
    }

# Example usage
temporal_analysis = analyze_temporal_patterns(df, 'date_column', 'value_column')

🚀 Multi-dimensional Clustering Analysis - Made Simple!

Implementation of cool clustering techniques that automatically determine best cluster numbers and handle high-dimensional data with smart validation metrics.

Let’s make this super clear! Here’s how we can tackle this:

from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score, calinski_harabasz_score
from scipy.cluster.hierarchy import dendrogram, linkage

class AdvancedClustering:
    def __init__(self, data):
        self.data = StandardScaler().fit_transform(data)
        self.n_samples = data.shape[0]
        
    def determine_optimal_clusters(self, max_clusters=10):
        scores = {
            'silhouette': [],
            'calinski': []
        }
        
        for n_clusters in range(2, max_clusters + 1):
            kmeans = KMeans(n_clusters=n_clusters, random_state=42)
            labels = kmeans.fit_predict(self.data)
            
            scores['silhouette'].append(silhouette_score(self.data, labels))
            scores['calinski'].append(calinski_harabasz_score(self.data, labels))
        
        return scores
    
    def perform_hierarchical_clustering(self):
        # Generate linkage matrix
        linkage_matrix = linkage(self.data, method='ward')
        
        # Plot dendrogram
        plt.figure(figsize=(12, 8))
        dendrogram(linkage_matrix)
        plt.title('Hierarchical Clustering Dendrogram')
        plt.xlabel('Sample Index')
        plt.ylabel('Distance')
        plt.show()
        
        return linkage_matrix
    
    def perform_density_clustering(self):
        # Estimate epsilon using nearest neighbors
        from sklearn.neighbors import NearestNeighbors
        
        nbrs = NearestNeighbors(n_neighbors=2).fit(self.data)
        distances, indices = nbrs.kneighbors(self.data)
        distances = np.sort(distances, axis=0)
        distances = distances[:,1]
        
        # Plot k-distance graph
        plt.figure(figsize=(10, 6))
        plt.plot(distances)
        plt.title('K-distance Graph')
        plt.xlabel('Points')
        plt.ylabel('Distance')
        plt.show()
        
        # Perform DBSCAN with estimated parameters
        epsilon = np.percentile(distances, 90)
        min_samples = int(np.log(self.n_samples))
        
        dbscan = DBSCAN(eps=epsilon, min_samples=min_samples)
        labels = dbscan.fit_predict(self.data)
        
        return labels, epsilon, min_samples

# Example usage
clustering = AdvancedClustering(df[numerical_cols])
optimal_scores = clustering.determine_optimal_clusters()
linkage_matrix = clustering.perform_hierarchical_clustering()
dbscan_labels, eps, min_samples = clustering.perform_density_clustering()

🚀 Interactive Data Profiling - Made Simple!

This example provides a complete data profiling system that automatically generates statistical summaries, identifies potential data quality issues, and does distribution analysis for all variable types.

This next part is really neat! Here’s how we can tackle this:

from scipy import stats
import warnings
warnings.filterwarnings('ignore')

class DataProfiler:
    def __init__(self, df):
        self.df = df
        self.profile = {}
        
    def analyze_column(self, column):
        data = self.df[column]
        column_type = data.dtype
        
        analysis = {
            'type': str(column_type),
            'missing': data.isnull().sum(),
            'missing_pct': (data.isnull().sum() / len(data)) * 100,
            'unique_values': data.nunique(),
            'memory_usage': data.memory_usage(deep=True) / 1024**2  # MB
        }
        
        if np.issubdtype(column_type, np.number):
            analysis.update({
                'mean': data.mean(),
                'median': data.median(),
                'std': data.std(),
                'skew': stats.skew(data.dropna()),
                'kurtosis': stats.kurtosis(data.dropna()),
                'outliers_zscore': len(data[np.abs(stats.zscore(data.dropna())) > 3])
            })
            
        elif column_type == 'object' or column_type == 'category':
            value_counts = data.value_counts()
            analysis.update({
                'mode': data.mode()[0],
                'entropy': stats.entropy(value_counts.values),
                'top_values': value_counts.head().to_dict()
            })
            
        elif np.issubdtype(column_type, np.datetime64):
            analysis.update({
                'min_date': data.min(),
                'max_date': data.max(),
                'date_range': (data.max() - data.min()).days
            })
            
        return analysis
    
    def generate_profile(self):
        for column in self.df.columns:
            self.profile[column] = self.analyze_column(column)
            
        # Generate correlation matrix for numerical columns
        numerical_cols = self.df.select_dtypes(include=[np.number]).columns
        if len(numerical_cols) > 1:
            self.profile['correlations'] = self.df[numerical_cols].corr()
            
        return self.profile
    
    def print_summary(self):
        print("\nData Profile Summary:")
        print(f"Total Rows: {len(self.df)}")
        print(f"Total Columns: {len(self.df.columns)}")
        print("\nColumn Details:")
        
        for column, analysis in self.profile.items():
            if column != 'correlations':
                print(f"\n{column}:")
                for metric, value in analysis.items():
                    print(f"  {metric}: {value}")

# Example usage
profiler = DataProfiler(df)
profile = profiler.generate_profile()
profiler.print_summary()

🚀 Automated Insight Generation - Made Simple!

This smart implementation automatically discovers and ranks interesting patterns, relationships, and anomalies in the dataset using statistical tests and machine learning techniques.

Let me walk you through this step by step! Here’s how we can tackle this:

class InsightGenerator:
    def __init__(self, df):
        self.df = df
        self.insights = []
        
    def test_correlation_significance(self, col1, col2):
        corr, p_value = stats.pearsonr(self.df[col1], self.df[col2])
        return corr, p_value
    
    def find_distribution_changes(self, column, window_size=30):
        rolling_mean = self.df[column].rolling(window=window_size).mean()
        rolling_std = self.df[column].rolling(window=window_size).std()
        
        # Detect significant changes using Kolmogorov-Smirnov test
        changes = []
        for i in range(window_size, len(self.df), window_size):
            window1 = self.df[column].iloc[i-window_size:i]
            window2 = self.df[column].iloc[i:i+window_size]
            if len(window2) >= window_size:
                ks_stat, p_value = stats.ks_2samp(window1, window2)
                if p_value < 0.05:
                    changes.append(i)
        
        return changes
    
    def discover_insights(self):
        numerical_cols = self.df.select_dtypes(include=[np.number]).columns
        
        # Correlation insights
        for i, col1 in enumerate(numerical_cols):
            for col2 in numerical_cols[i+1:]:
                corr, p_value = self.test_correlation_significance(col1, col2)
                if abs(corr) > 0.7 and p_value < 0.05:
                    self.insights.append({
                        'type': 'correlation',
                        'description': f'Strong correlation ({corr:.2f}) between {col1} and {col2}',
                        'strength': abs(corr)
                    })
        
        # Distribution changes
        for col in numerical_cols:
            changes = self.find_distribution_changes(col)
            if changes:
                self.insights.append({
                    'type': 'distribution_change',
                    'description': f'Distribution changes detected in {col} at positions: {changes}',
                    'strength': len(changes)
                })
        
        # Sort insights by strength
        self.insights.sort(key=lambda x: x['strength'], reverse=True)
        return self.insights
    
    def print_insights(self, top_n=10):
        print(f"\nTop {top_n} Insights Discovered:")
        for i, insight in enumerate(self.insights[:top_n], 1):
            print(f"\n{i}. {insight['description']}")
            print(f"   Type: {insight['type']}")
            print(f"   Strength: {insight['strength']:.2f}")

# Example usage
insight_generator = InsightGenerator(df)
insights = insight_generator.discover_insights()
insight_generator.print_insights()

🚀 Additional Resources - Made Simple!

🎊 Awesome Work!

You’ve just learned some really powerful techniques! Don’t worry if everything doesn’t click immediately - that’s totally normal. The best way to master these concepts is to practice with your own data.

What’s next? Try implementing these examples with your own datasets. Start small, experiment, and most importantly, have fun with it! Remember, every data science expert started exactly where you are right now.

Keep coding, keep learning, and keep being awesome! 🚀

Back to Blog

Related Posts

View All Posts »