🚀 Mastering Feature Discretization For Non Linear Modeling Secrets That Will Transform Your!
Hey there! Ready to dive into Mastering Feature Discretization For Non Linear Modeling? This friendly guide will walk you through everything step-by-step with easy-to-follow examples. Perfect for beginners and pros alike!
🚀
💡 Pro tip: This is one of those techniques that will make you look like a data science wizard! Understanding Feature Discretization - Made Simple!
Feature discretization transforms continuous variables into categorical ones through binning. This process lets you linear models to capture non-linear patterns by creating discrete boundaries that approximate complex relationships. The transformation typically uses techniques like equal-width or equal-frequency binning.
Let’s make this super clear! Here’s how we can tackle this:
import numpy as np
import pandas as pd
# Sample continuous data
ages = np.random.normal(40, 15, 1000)
df = pd.DataFrame({'age': ages})
# Equal-width binning
df['age_binned'] = pd.cut(df['age'],
bins=[0, 25, 50, 75, 100],
labels=['Youth', 'Adult', 'Middle', 'Senior'])
# One-hot encoding
age_dummies = pd.get_dummies(df['age_binned'], prefix='age')
print(df.head())
print("\nOne-hot encoded features:\n", age_dummies.head())
🚀
🎉 You’re doing great! This concept might seem tricky at first, but you’ve got this! Implementing Equal-Width Binning from Scratch - Made Simple!
Equal-width binning divides the range of values into k intervals of equal size. This example shows you how to create bins manually without using pandas, providing more control over the discretization process and better understanding of the underlying mechanics.
Here’s a handy trick you’ll love! Here’s how we can tackle this:
def equal_width_binning(data, num_bins):
min_val, max_val = min(data), max(data)
bin_width = (max_val - min_val) / num_bins
bins = []
for i in range(num_bins):
lower = min_val + i * bin_width
upper = min_val + (i + 1) * bin_width
bins.append((lower, upper))
# Assign data points to bins
binned_data = []
for value in data:
for idx, (lower, upper) in enumerate(bins):
if lower <= value <= upper:
binned_data.append(idx)
break
return np.array(binned_data), bins
# Example usage
data = np.random.normal(50, 15, 1000)
binned_values, bin_edges = equal_width_binning(data, 5)
print(f"Original value: {data[0]:.2f}")
print(f"Binned value: {binned_values[0]}")
🚀
✨ Cool fact: Many professional data scientists use this exact approach in their daily work! Equal-Frequency Binning Implementation - Made Simple!
Equal-frequency binning ensures that each bin contains approximately the same number of samples. This way is particularly useful when dealing with skewed distributions where equal-width binning might create bins with very different populations.
Let’s make this super clear! Here’s how we can tackle this:
def equal_frequency_binning(data, num_bins):
n = len(data)
samples_per_bin = n // num_bins
sorted_data = np.sort(data)
bins = []
binned_data = np.zeros(n)
for i in range(num_bins):
start_idx = i * samples_per_bin
end_idx = (i + 1) * samples_per_bin if i < num_bins - 1 else n
bin_values = sorted_data[start_idx:end_idx]
bins.append((min(bin_values), max(bin_values)))
# Assign original data to bins
for i, value in enumerate(data):
for bin_idx, (lower, upper) in enumerate(bins):
if lower <= value <= upper:
binned_data[i] = bin_idx
break
return binned_data, bins
# Example usage
data = np.random.exponential(scale=2.0, size=1000)
binned_values, bin_edges = equal_frequency_binning(data, 5)
print(f"Bin edges: {bin_edges}")
print(f"Sample distribution: {np.bincount(binned_values.astype(int))}")
🚀
🔥 Level up: Once you master this, you’ll be solving problems like a pro! Adaptive Binning Strategy - Made Simple!
Adaptive binning adjusts bin widths based on data distribution, using statistical measures like variance or entropy. This way optimizes information retention while reducing dimensionality, particularly effective for features with non-uniform distributions.
Let’s make this super clear! Here’s how we can tackle this:
import numpy as np
from scipy import stats
def adaptive_binning(data, min_bins=3, max_bins=10, threshold=0.05):
best_bins = min_bins
max_variance_reduction = 0
for n_bins in range(min_bins, max_bins + 1):
# Calculate initial variance
total_variance = np.var(data)
# Try binning with n_bins
hist, bin_edges = np.histogram(data, bins=n_bins)
bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2
# Assign each point to nearest bin center
digitized = np.digitize(data, bin_edges)
binned_data = bin_centers[np.clip(digitized - 1, 0, len(bin_centers) - 1)]
# Calculate variance after binning
binned_variance = np.var(binned_data)
variance_reduction = (total_variance - binned_variance) / total_variance
if variance_reduction > max_variance_reduction:
max_variance_reduction = variance_reduction
best_bins = n_bins
if variance_reduction > (1 - threshold):
break
return best_bins
# Example usage
np.random.seed(42)
data = np.concatenate([
np.random.normal(0, 1, 300),
np.random.normal(4, 0.5, 200)
])
optimal_bins = adaptive_binning(data)
print(f"best number of bins: {optimal_bins}")
# Visualize results
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 5))
plt.hist(data, bins=optimal_bins, density=True, alpha=0.7)
plt.title(f"Adaptive Binning Result (n_bins={optimal_bins})")
plt.show()
🚀 Feature Discretization for Linear Regression - Made Simple!
This example shows you how feature discretization can improve linear regression performance on non-linear data by creating piece-wise linear approximations of complex relationships.
Let’s break this down together! Here’s how we can tackle this:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import numpy as np
# Generate non-linear data
X = np.linspace(0, 10, 1000).reshape(-1, 1)
y = np.sin(X) + 0.1 * np.random.randn(1000)
def create_discretized_features(X, num_bins):
bins = np.linspace(X.min(), X.max(), num_bins + 1)
X_binned = np.digitize(X, bins)
X_encoded = np.zeros((len(X), num_bins))
for i in range(len(X)):
bin_idx = X_binned[i] - 1
if bin_idx < num_bins:
X_encoded[i, bin_idx] = 1
return X_encoded
# Compare regular vs discretized linear regression
X_disc = create_discretized_features(X, num_bins=20)
# Fit models
reg_model = LinearRegression().fit(X, y)
disc_model = LinearRegression().fit(X_disc, y)
# Predictions
y_pred_reg = reg_model.predict(X)
y_pred_disc = disc_model.predict(X_disc)
# Calculate R2 scores
r2_regular = r2_score(y, y_pred_reg)
r2_discretized = r2_score(y, y_pred_disc)
print(f"R2 Score (Regular): {r2_regular:.4f}")
print(f"R2 Score (Discretized): {r2_discretized:.4f}")
🚀 Entropy-Based Discretization - Made Simple!
Entropy-based discretization uses information gain to determine best bin boundaries. This method is particularly effective for classification tasks as it creates bins that maximize class separation.
This next part is really neat! Here’s how we can tackle this:
import numpy as np
from scipy.stats import entropy
def calculate_entropy(y):
_, counts = np.unique(y, return_counts=True)
probabilities = counts / len(y)
return entropy(probabilities, base=2)
def entropy_based_split(X, y, threshold):
initial_entropy = calculate_entropy(y)
best_gain = 0
best_threshold = None
left_mask = X <= threshold
right_mask = ~left_mask
if len(y[left_mask]) > 0 and len(y[right_mask]) > 0:
left_entropy = calculate_entropy(y[left_mask])
right_entropy = calculate_entropy(y[right_mask])
# Calculate weighted average entropy
weighted_entropy = (
(len(y[left_mask]) * left_entropy +
len(y[right_mask]) * right_entropy) / len(y)
)
information_gain = initial_entropy - weighted_entropy
return information_gain
return 0
def entropy_based_discretization(X, y, min_samples=50):
boundaries = []
def recursive_split(X, y, start, end):
if end - start < min_samples:
return
possible_thresholds = np.percentile(
X[start:end],
q=range(10, 100, 10)
)
best_gain = 0
best_threshold = None
for threshold in possible_thresholds:
gain = entropy_based_split(X[start:end], y[start:end], threshold)
if gain > best_gain:
best_gain = gain
best_threshold = threshold
if best_threshold is not None:
boundaries.append(best_threshold)
split_point = start + np.sum(X[start:end] <= best_threshold)
recursive_split(X, y, start, split_point)
recursive_split(X, y, split_point, end)
recursive_split(X, y, 0, len(X))
return np.sort(boundaries)
# Example usage
np.random.seed(42)
X = np.concatenate([
np.random.normal(0, 1, 300),
np.random.normal(4, 0.5, 200)
])
y = (X > 2).astype(int)
boundaries = entropy_based_discretization(X, y)
print(f"best boundaries: {boundaries}")
🚀 Feature Discretization for Time Series Data - Made Simple!
Time series data often benefits from discretization when dealing with cyclical patterns or seasonal effects. This example shows how to discretize temporal features while preserving their sequential nature and periodic patterns.
This next part is really neat! Here’s how we can tackle this:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
class TimeSeriesDiscretizer:
def __init__(self, granularity='hour'):
self.granularity = granularity
self.mappings = {
'hour': (24, lambda x: x.hour),
'dayofweek': (7, lambda x: x.dayofweek),
'month': (12, lambda x: x.month - 1)
}
def fit_transform(self, timestamps):
n_bins, extractor = self.mappings[self.granularity]
# Convert timestamps to datetime if needed
if isinstance(timestamps[0], str):
timestamps = pd.to_datetime(timestamps)
# Extract temporal feature and create cyclic encoding
feature_values = extractor(timestamps)
transformed = self._create_cyclic_features(feature_values, n_bins)
return transformed
def _create_cyclic_features(self, values, n_bins):
# Create sine and cosine transformations
sin_values = np.sin(2 * np.pi * values / n_bins)
cos_values = np.cos(2 * np.pi * values / n_bins)
return np.column_stack([sin_values, cos_values])
# Example usage
np.random.seed(42)
dates = pd.date_range(start='2024-01-01', periods=1000, freq='H')
values = np.random.normal(0, 1, 1000)
# Add synthetic pattern
values += np.sin(2 * np.pi * pd.to_datetime(dates).hour / 24) * 2
# Create discretizer
discretizer = TimeSeriesDiscretizer(granularity='hour')
transformed = discretizer.fit_transform(dates)
# Demonstrate effectiveness with simple linear regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# Compare regular hours vs transformed features
X_regular = pd.to_datetime(dates).hour.values.reshape(-1, 1)
X_transformed = transformed
model_regular = LinearRegression().fit(X_regular, values)
model_transformed = LinearRegression().fit(X_transformed, values)
print(f"R2 Score (Regular): {r2_score(values, model_regular.predict(X_regular)):.4f}")
print(f"R2 Score (Transformed): {r2_score(values, model_transformed.predict(X_transformed)):.4f}")
🚀 MDL-Based Discretization - Made Simple!
Minimum Description Length (MDL) principle provides a theoretically sound approach to discretization by finding the best trade-off between model complexity and data compression efficiency.
Ready for some cool stuff? Here’s how we can tackle this:
import numpy as np
from math import log2
from scipy.stats import entropy
class MDLDiscretizer:
def __init__(self, min_bins=2, max_bins=20):
self.min_bins = min_bins
self.max_bins = max_bins
def _calculate_mdl_cost(self, data, boundaries):
n = len(data)
k = len(boundaries) + 1 # number of bins
# Model complexity cost
model_cost = k * log2(n)
# Data encoding cost
bins = np.digitize(data, boundaries)
bin_counts = np.bincount(bins)
data_cost = 0
for count in bin_counts:
if count > 0:
p = count / n
data_cost -= count * log2(p)
return model_cost + data_cost
def fit_transform(self, data):
data = np.asarray(data)
best_boundaries = None
min_cost = np.inf
for n_bins in range(self.min_bins, self.max_bins + 1):
# Generate candidate boundaries
percentiles = np.linspace(0, 100, n_bins + 1)[1:-1]
boundaries = np.percentile(data, percentiles)
# Calculate MDL cost
cost = self._calculate_mdl_cost(data, boundaries)
if cost < min_cost:
min_cost = cost
best_boundaries = boundaries
# Transform data using best boundaries
return np.digitize(data, best_boundaries)
# Example usage
np.random.seed(42)
# Generate mixture of gaussians
data = np.concatenate([
np.random.normal(0, 1, 500),
np.random.normal(4, 0.5, 300),
np.random.normal(8, 0.8, 200)
])
discretizer = MDLDiscretizer()
discretized = discretizer.fit_transform(data)
# Calculate information content
original_entropy = entropy(np.histogram(data, bins='auto')[0])
discretized_entropy = entropy(np.bincount(discretized))
print(f"Original entropy: {original_entropy:.4f}")
print(f"Discretized entropy: {discretized_entropy:.4f}")
print(f"Number of unique bins: {len(np.unique(discretized))}")
🚀 Multivariate Feature Discretization - Made Simple!
Multivariate discretization considers relationships between features during the binning process. This way is particularly useful when features have strong correlations or interact in complex ways to influence the target variable.
Let me walk you through this step by step! Here’s how we can tackle this:
import numpy as np
from sklearn.mixture import GaussianMixture
class MultivariateDiscretizer:
def __init__(self, n_components=3, random_state=42):
self.n_components = n_components
self.gmm = GaussianMixture(
n_components=n_components,
random_state=random_state
)
def fit_transform(self, X):
# Fit GMM to identify natural clusters in multivariate space
self.gmm.fit(X)
# Get cluster assignments
clusters = self.gmm.predict(X)
# Calculate cluster probabilities for each point
probs = self.gmm.predict_proba(X)
# Create discretized features based on cluster memberships
discretized = np.zeros((X.shape[0], self.n_components))
discretized[np.arange(len(clusters)), clusters] = 1
return discretized, probs
def transform_single_feature(self, X, feature_idx):
# Project points onto specific feature dimension
means = self.gmm.means_[:, feature_idx]
sorted_indices = np.argsort(means)
# Assign points to nearest component
clusters = self.gmm.predict(X)
# Remap clusters based on feature ordering
mapping = {old: new for new, old in enumerate(sorted_indices)}
remapped = np.array([mapping[c] for c in clusters])
return remapped
# Example usage with correlated features
np.random.seed(42)
n_samples = 1000
# Generate correlated features
X1 = np.random.normal(0, 1, n_samples)
X2 = X1 * 0.7 + np.random.normal(0, 0.5, n_samples)
X = np.column_stack([X1, X2])
# Apply multivariate discretization
discretizer = MultivariateDiscretizer(n_components=4)
X_disc, probs = discretizer.fit_transform(X)
# Get feature-specific discretization
X1_disc = discretizer.transform_single_feature(X, 0)
X2_disc = discretizer.transform_single_feature(X, 1)
print("Original shape:", X.shape)
print("Discretized shape:", X_disc.shape)
print("\nFeature 1 unique values:", np.unique(X1_disc))
print("Feature 2 unique values:", np.unique(X2_disc))
🚀 ChiMerge Discretization Algorithm - Made Simple!
ChiMerge is a bottom-up discretization method that uses chi-square statistics to determine when to merge adjacent intervals. It’s particularly effective for classification tasks as it maintains class discrimination.
Don’t worry, this is easier than it looks! Here’s how we can tackle this:
import numpy as np
from scipy.stats import chi2_contingency
class ChiMergeDiscretizer:
def __init__(self, max_intervals=6, significance_level=0.05):
self.max_intervals = max_intervals
self.significance_level = significance_level
self.boundaries = None
def _calculate_chi_square(self, interval1, interval2, y1, y2):
# Create contingency table
classes = np.unique(np.concatenate([y1, y2]))
cont_table = np.zeros((2, len(classes)))
for i, yi in enumerate([y1, y2]):
for j, c in enumerate(classes):
cont_table[i, j] = np.sum(yi == c)
# Calculate chi-square statistic
if cont_table.sum() == 0:
return 0
chi2, _, _, _ = chi2_contingency(cont_table)
return chi2
def fit_transform(self, X, y):
sorted_indices = np.argsort(X)
X = X[sorted_indices]
y = y[sorted_indices]
# Initialize boundaries at unique values
unique_vals = np.unique(X)
boundaries = unique_vals[:-1] + np.diff(unique_vals)/2
while len(boundaries) > self.max_intervals - 1:
chi_squares = []
# Calculate chi-square for adjacent intervals
for i in range(len(boundaries) - 1):
mask1 = (X >= boundaries[i]) & (X < boundaries[i+1])
mask2 = (X >= boundaries[i+1]) & (X < boundaries[i+2]
if i+2 < len(boundaries) else np.inf)
chi2 = self._calculate_chi_square(
X[mask1], X[mask2],
y[mask1], y[mask2]
)
chi_squares.append(chi2)
# Merge intervals with lowest chi-square
min_chi2_idx = np.argmin(chi_squares)
boundaries = np.delete(boundaries, min_chi2_idx + 1)
self.boundaries = boundaries
return np.digitize(X, self.boundaries)
# Example usage
np.random.seed(42)
# Generate synthetic classification data
X = np.concatenate([
np.random.normal(0, 1, 300),
np.random.normal(3, 1, 300),
np.random.normal(6, 1, 400)
])
y = np.concatenate([
np.zeros(300),
np.ones(300),
2 * np.ones(400)
])
discretizer = ChiMergeDiscretizer(max_intervals=5)
X_disc = discretizer.fit_transform(X, y)
print(f"Number of boundaries: {len(discretizer.boundaries)}")
print(f"Boundaries: {discretizer.boundaries}")
print(f"Unique discretized values: {np.unique(X_disc)}")
🚀 Real-World Application - Customer Segmentation - Made Simple!
This example shows you feature discretization in a customer segmentation scenario, where continuous features like age, income, and purchase frequency are transformed to create meaningful customer segments.
Let’s break this down together! Here’s how we can tackle this:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
class CustomerSegmentationDiscretizer:
def __init__(self):
self.age_bins = [0, 25, 35, 50, 65, 100]
self.income_bins = [0, 30000, 60000, 100000, 150000, np.inf]
self.frequency_bins = [0, 2, 5, 10, 20, np.inf]
def transform(self, df):
df_transformed = pd.DataFrame()
# Age discretization
df_transformed['age_segment'] = pd.cut(
df['age'],
bins=self.age_bins,
labels=['Gen-Z', 'Young Adult', 'Adult', 'Middle-Age', 'Senior']
)
# Income discretization
df_transformed['income_segment'] = pd.cut(
df['annual_income'],
bins=self.income_bins,
labels=['Low', 'Lower-Mid', 'Middle', 'Upper-Mid', 'High']
)
# Purchase frequency discretization
df_transformed['frequency_segment'] = pd.cut(
df['purchase_frequency'],
bins=self.frequency_bins,
labels=['Rare', 'Low', 'Medium', 'High', 'VIP']
)
# One-hot encode all segments
return pd.get_dummies(df_transformed, prefix_sep='_')
# Generate synthetic customer data
np.random.seed(42)
n_customers = 1000
customer_data = pd.DataFrame({
'age': np.random.normal(40, 15, n_customers),
'annual_income': np.random.lognormal(10.5, 0.5, n_customers),
'purchase_frequency': np.random.gamma(2, 3, n_customers),
'customer_value': np.zeros(n_customers) # Target variable
})
# Create target variable based on complex rules
customer_data['customer_value'] = (
(customer_data['age'] > 35) &
(customer_data['annual_income'] > 60000) &
(customer_data['purchase_frequency'] > 5)
).astype(int)
# Apply discretization
discretizer = CustomerSegmentationDiscretizer()
X_transformed = discretizer.transform(customer_data)
# Train model
model = LogisticRegression(random_state=42)
model.fit(X_transformed, customer_data['customer_value'])
# Analyze feature importance
feature_importance = pd.DataFrame({
'feature': X_transformed.columns,
'importance': np.abs(model.coef_[0])
})
print("Top 5 most important segments:")
print(feature_importance.sort_values('importance', ascending=False).head())
🚀 Real-World Application - Geospatial Analysis - Made Simple!
This example shows how to discretize geographical coordinates into meaningful regions while preserving spatial relationships, particularly useful for location-based analysis and modeling.
Let me walk you through this step by step! Here’s how we can tackle this:
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from scipy.spatial import ConvexHull
class GeospatialDiscretizer:
def __init__(self, eps=0.1, min_samples=5):
self.eps = eps
self.min_samples = min_samples
self.cluster_centers_ = None
self.cluster_boundaries_ = None
def fit_transform(self, coordinates):
# Apply DBSCAN clustering
clustering = DBSCAN(
eps=self.eps,
min_samples=self.min_samples,
metric='haversine'
).fit(np.radians(coordinates))
labels = clustering.labels_
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
# Calculate cluster centers and boundaries
self.cluster_centers_ = []
self.cluster_boundaries_ = []
for i in range(n_clusters):
mask = labels == i
cluster_points = coordinates[mask]
# Calculate centroid
center = np.mean(cluster_points, axis=0)
self.cluster_centers_.append(center)
# Calculate convex hull for boundary
if len(cluster_points) >= 3:
hull = ConvexHull(cluster_points)
boundary = cluster_points[hull.vertices]
else:
boundary = cluster_points
self.cluster_boundaries_.append(boundary)
return labels
# Generate synthetic location data
np.random.seed(42)
n_points = 1000
# Create multiple clusters of points
centers = [
(40.7128, -74.0060), # New York
(34.0522, -118.2437), # Los Angeles
(41.8781, -87.6298) # Chicago
]
locations = []
for center in centers:
cluster = np.random.normal(
loc=center,
scale=[0.1, 0.1],
size=(n_points // len(centers), 2)
)
locations.append(cluster)
locations = np.vstack(locations)
# Create DataFrame with location data
location_data = pd.DataFrame(
locations,
columns=['latitude', 'longitude']
)
# Apply spatial discretization
discretizer = GeospatialDiscretizer(eps=0.1, min_samples=5)
location_data['region'] = discretizer.fit_transform(
location_data[['latitude', 'longitude']].values
)
# Print statistics
print("Number of regions:", len(np.unique(location_data['region'])))
print("\nPoints per region:")
print(location_data['region'].value_counts())
🚀 cool Feature Discretization with Dynamic Programming - Made Simple!
This example uses dynamic programming to find the best binning strategy that minimizes information loss while maintaining interpretability. The algorithm considers both the global and local structure of the data.
This next part is really neat! Here’s how we can tackle this:
import numpy as np
from scipy.stats import entropy
import pandas as pd
class DPDiscretizer:
def __init__(self, max_bins=10, min_samples=30):
self.max_bins = max_bins
self.min_samples = min_samples
self.boundaries = None
def _calculate_bin_cost(self, values, y_values):
if len(values) < self.min_samples:
return np.inf
# Calculate class distribution in bin
class_counts = np.bincount(y_values)
probabilities = class_counts / len(y_values)
# Calculate entropy of the bin
bin_entropy = entropy(probabilities)
# Add penalty for bin size
size_penalty = -np.log(len(values) / self.min_samples)
return bin_entropy + size_penalty
def _find_optimal_bins(self, x, y):
n = len(x)
# Initialize dynamic programming tables
dp = np.full((n + 1, self.max_bins + 1), np.inf)
split_points = np.zeros((n + 1, self.max_bins + 1), dtype=int)
# Base case: zero cost for empty sequence
dp[0, 0] = 0
# Fill dynamic programming table
for i in range(1, n + 1):
for k in range(1, min(i + 1, self.max_bins + 1)):
for j in range(i):
cost = (dp[j, k-1] +
self._calculate_bin_cost(x[j:i], y[j:i]))
if cost < dp[i, k]:
dp[i, k] = cost
split_points[i, k] = j
# Reconstruct best boundaries
boundaries = []
pos = n
k = self.max_bins
while k > 0 and pos > 0:
boundaries.append(x[split_points[pos, k]])
pos = split_points[pos, k]
k -= 1
return sorted(boundaries)
def fit_transform(self, X, y):
# Sort data by feature value
sort_idx = np.argsort(X)
X_sorted = X[sort_idx]
y_sorted = y[sort_idx]
# Find best bin boundaries
self.boundaries = self._find_optimal_bins(X_sorted, y_sorted)
# Transform data using boundaries
return np.digitize(X, self.boundaries)
# Example usage with complex non-linear pattern
np.random.seed(42)
# Generate synthetic data with multiple class regions
def generate_complex_data(n_samples=1000):
X = np.random.uniform(0, 10, n_samples)
y = np.zeros(n_samples)
# Create complex class boundaries
y[(X > 2) & (X < 4)] = 1
y[(X > 6) & (X < 7)] = 2
y[X > 8] = 3
# Add noise
noise_idx = np.random.choice(
n_samples,
size=int(0.1 * n_samples),
replace=False
)
y[noise_idx] = np.random.randint(0, 4, len(noise_idx))
return X, y
X, y = generate_complex_data()
# Apply discretization
discretizer = DPDiscretizer(max_bins=8)
X_disc = discretizer.fit_transform(X, y)
# Analyze results
results = pd.DataFrame({
'original_value': X,
'discretized_value': X_disc,
'class': y
})
print("Discovered boundaries:", discretizer.boundaries)
print("\nClass distribution in each bin:")
for bin_idx in range(len(discretizer.boundaries) + 1):
mask = X_disc == bin_idx
class_dist = np.bincount(y[mask].astype(int))
print(f"\nBin {bin_idx}:")
print(f"Samples: {sum(mask)}")
print(f"Class distribution: {class_dist}")
🚀 Additional Resources - Made Simple!
- “best Discretization Using Information Theory” - https://arxiv.org/abs/1401.1914
- “MDL-Based Discretization Methods for Classification” - https://arxiv.org/abs/1509.00922
- “A Survey of Discretization Techniques: Taxonomy and Empirical Analysis” - https://arxiv.org/abs/1405.4534
- “Feature Engineering and Selection: A Practical Approach” - Search on Google for “Max Kuhn Feature Engineering Book”
- “Statistical and Machine Learning Methods for Discretization” - Search for “Dougherty et al. Discretization Survey”
- “Dynamic Programming Algorithms for Feature Discretization” - Search for “Fayyad & Irani MDL Discretization”
Note: Some recommended search terms are provided since direct URLs may not be available for all resources.
🎊 Awesome Work!
You’ve just learned some really powerful techniques! Don’t worry if everything doesn’t click immediately - that’s totally normal. The best way to master these concepts is to practice with your own data.
What’s next? Try implementing these examples with your own datasets. Start small, experiment, and most importantly, have fun with it! Remember, every data science expert started exactly where you are right now.
Keep coding, keep learning, and keep being awesome! 🚀