Machine Learning Classification Metrics with Python
A practical guide to evaluating classification models in Python using confusion matrices, accuracy, precision, recall, F1, ROC-AUC, PR curves, calibration, cross-validation, and cost-sensitive metrics.
Table of Contents
Understanding Classification Metrics Fundamentals
Classification metrics define how a model performs against the business objective and risk profile of the use case. The confusion matrix is the starting point because it separates correct predictions from false positives and false negatives.
In production ML systems, metric selection should reflect the cost of different errors. A fraud model, medical classifier, marketing propensity model, and content moderation model should not be judged by the same metric alone.
The following example creates a confusion matrix and extracts the basic classification outcomes.
import numpy as np
from sklearn.metrics import confusion_matrix
def create_confusion_matrix(y_true, y_pred):
"""
Creates and returns a confusion matrix with basic metrics
Parameters:
y_true: array-like of shape (n_samples,) Ground truth labels
y_pred: array-like of shape (n_samples,) Predicted labels
"""
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
# Basic metrics formula in comments
# Accuracy = (TP + TN) / (TP + TN + FP + FN)
# Precision = TP / (TP + FP)
# Recall = TP / (TP + FN)
print(f"Confusion Matrix:\n{cm}")
return cm
# Example usage
y_true = [0, 1, 0, 1, 1, 0, 1, 0]
y_pred = [0, 1, 0, 0, 1, 0, 1, 1]
result = create_confusion_matrix(y_true, y_pred)
Accuracy and Precision Metrics
Accuracy measures overall correctness, while precision measures how reliable positive predictions are. Accuracy can be useful on balanced datasets, but it can be misleading when classes are imbalanced or when false positives and false negatives have very different business costs.
Precision is especially important when acting on a positive prediction is expensive, risky, or customer-facing.
The following example calculates accuracy and precision from the confusion matrix.
def calculate_basic_metrics(y_true, y_pred):
"""
Calculate accuracy and precision metrics
Mathematical formulas:
$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$$
$$Precision = \frac{TP}{TP + FP}$$
"""
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
accuracy = (tp + tn) / (tp + tn + fp + fn)
precision = tp / (tp + fp)
return {
'accuracy': accuracy,
'precision': precision
}
# Example usage
y_true = [0, 1, 0, 1, 1, 0, 1, 0]
y_pred = [0, 1, 0, 0, 1, 0, 1, 1]
metrics = calculate_basic_metrics(y_true, y_pred)
print(f"Accuracy: {metrics['accuracy']:.3f}")
print(f"Precision: {metrics['precision']:.3f}")
Recall and F1-Score Implementation
Recall measures how many actual positive cases the model captures. F1-score combines precision and recall into a single harmonic-mean metric. These metrics are important when missing a positive case is costly, such as fraud detection, medical screening, safety alerts, or compliance monitoring.
The following example calculates recall and F1-score.
def calculate_advanced_metrics(y_true, y_pred):
"""
Calculate recall and F1-score
Mathematical formulas:
$$Recall = \frac{TP}{TP + FN}$$
$$F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}$$
"""
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
recall = tp / (tp + fn)
precision = tp / (tp + fp)
f1 = 2 * (precision * recall) / (precision + recall)
return {
'recall': recall,
'f1_score': f1
}
# Example usage
metrics = calculate_advanced_metrics(y_true, y_pred)
print(f"Recall: {metrics['recall']:.3f}")
print(f"F1-Score: {metrics['f1_score']:.3f}")
ROC Curve Implementation
The Receiver Operating Characteristic curve shows the relationship between true positive rate and false positive rate across decision thresholds. ROC-AUC is useful for comparing ranking quality, but it should be interpreted carefully on highly imbalanced datasets.
The following example plots an ROC curve from predicted probabilities.
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
def plot_roc_curve(y_true, y_prob):
"""
Plot ROC curve from probability predictions
Mathematical formula:
$$TPR = \frac{TP}{TP + FN}$$
$$FPR = \frac{FP}{FP + TN}$$
"""
fpr, tpr, thresholds = roc_curve(y_true, y_prob)
roc_auc = auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2,
label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()
# Example usage
y_true = [0, 1, 0, 1, 1, 0, 1, 0]
y_prob = [0.1, 0.9, 0.2, 0.7, 0.8, 0.1, 0.9, 0.3]
plot_roc_curve(y_true, y_prob)
Precision-Recall Curve
The Precision-Recall curve is often more informative than ROC-AUC when the positive class is rare. It shows the trade-off between precision and recall across thresholds and helps teams choose operating points for imbalanced classification problems.
The following example plots a precision-recall curve and calculates average precision.
from sklearn.metrics import precision_recall_curve, average_precision_score
def plot_precision_recall_curve(y_true, y_prob):
"""
Plot Precision-Recall curve
Mathematical formula:
$$AP = \sum_n (R_n - R_{n-1}) P_n$$
Where AP is Average Precision, R is Recall, P is Precision
"""
precision, recall, _ = precision_recall_curve(y_true, y_prob)
avg_precision = average_precision_score(y_true, y_prob)
plt.figure()
plt.plot(recall, precision, color='blue', lw=2,
label=f'Precision-Recall curve (AP = {avg_precision:.2f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc="lower left")
plt.show()
# Example usage
y_true = [0, 1, 0, 1, 1, 0, 1, 0]
y_prob = [0.1, 0.9, 0.2, 0.7, 0.8, 0.1, 0.9, 0.3]
plot_precision_recall_curve(y_true, y_prob)
Multi-Class Classification Metrics
Multi-class classification requires metrics that summarize performance across multiple classes. Macro averaging treats each class equally, while micro averaging aggregates outcomes globally. The right choice depends on whether minority classes should carry equal importance.
The following example calculates per-class precision and recall with macro averages.
def calculate_multiclass_metrics(y_true, y_pred, num_classes):
"""
Calculate metrics for multi-class classification
Mathematical formulas:
$$Macro-Precision = \frac{1}{n}\sum_{i=1}^{n} Precision_i$$
$$Micro-Precision = \frac{TP_{total}}{TP_{total} + FP_{total}}$$
"""
# Initialize arrays for per-class metrics
precisions = np.zeros(num_classes)
recalls = np.zeros(num_classes)
# Calculate per-class metrics
for class_idx in range(num_classes):
true_class = (y_true == class_idx)
pred_class = (y_pred == class_idx)
tp = np.sum(true_class & pred_class)
fp = np.sum(~true_class & pred_class)
fn = np.sum(true_class & ~pred_class)
precisions[class_idx] = tp / (tp + fp) if (tp + fp) > 0 else 0
recalls[class_idx] = tp / (tp + fn) if (tp + fn) > 0 else 0
macro_precision = np.mean(precisions)
macro_recall = np.mean(recalls)
return {
'macro_precision': macro_precision,
'macro_recall': macro_recall,
'per_class_precision': precisions,
'per_class_recall': recalls
}
# Example usage
y_true = [0, 1, 2, 0, 1, 2, 0, 1, 2]
y_pred = [0, 1, 1, 0, 1, 2, 2, 1, 2]
metrics = calculate_multiclass_metrics(y_true, y_pred, num_classes=3)
print(f"Macro Precision: {metrics['macro_precision']:.3f}")
Cohen’s Kappa Score Implementation
Cohen’s Kappa measures agreement while accounting for agreement expected by chance. It can be useful when simple accuracy overstates model quality, especially in classification tasks with skewed class distributions.
The following example calculates Cohen’s Kappa from a confusion matrix.
def cohen_kappa_score(y_true, y_pred):
"""
Calculate Cohen's Kappa Score
Mathematical formula:
$$\kappa = \frac{p_o - p_e}{1 - p_e}$$
Where p_o is observed agreement and p_e is expected agreement
"""
cm = confusion_matrix(y_true, y_pred)
n_classes = cm.shape[0]
sum_0 = cm.sum(axis=0)
sum_1 = cm.sum(axis=1)
expected = np.outer(sum_0, sum_1) / np.sum(sum_0)
w_mat = np.ones([n_classes, n_classes], dtype=np.int)
w_mat.flat[::n_classes + 1] = 0
k = np.sum(cm * w_mat)
e = np.sum(expected * w_mat)
kappa = 1 - k / e if e != 0 else 1
return kappa
# Example usage
y_true = [0, 1, 2, 0, 1, 2, 0, 1, 2]
y_pred = [0, 1, 1, 0, 1, 2, 2, 1, 2]
kappa = cohen_kappa_score(y_true, y_pred)
print(f"Cohen's Kappa Score: {kappa:.3f}")
Balanced Accuracy and Matthews Correlation Coefficient
Balanced accuracy and Matthews Correlation Coefficient are useful for imbalanced datasets. Balanced accuracy averages sensitivity and specificity, while MCC uses all four confusion matrix outcomes and is often a stronger single-number summary for binary classification.
The following example calculates balanced accuracy and MCC.
def advanced_imbalanced_metrics(y_true, y_pred):
"""
Calculate balanced accuracy and Matthews Correlation Coefficient
Mathematical formulas:
$$Balanced Accuracy = \frac{1}{2}(\frac{TP}{TP + FN} + \frac{TN}{TN + FP})$$
$$MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}$$
"""
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
# Balanced Accuracy
sensitivity = tp / (tp + fn) if (tp + fn) > 0 else 0
specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
balanced_acc = (sensitivity + specificity) / 2
# Matthews Correlation Coefficient
numerator = tp * tn - fp * fn
denominator = np.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))
mcc = numerator / denominator if denominator != 0 else 0
return {
'balanced_accuracy': balanced_acc,
'mcc': mcc
}
# Example usage
y_true = [0, 0, 0, 0, 1, 1]
y_pred = [0, 0, 0, 1, 1, 0]
metrics = advanced_imbalanced_metrics(y_true, y_pred)
print(f"Balanced Accuracy: {metrics['balanced_accuracy']:.3f}")
print(f"Matthews Correlation Coefficient: {metrics['mcc']:.3f}")
Cross-Validation for Classification Metrics
Cross-validation estimates model performance across multiple train-validation splits. Stratified k-fold cross-validation is preferred for classification because it preserves class distribution across folds.
The following example evaluates a classifier across stratified folds with multiple metrics.
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone
import numpy as np
def cross_validate_classifier(model, X, y, n_splits=5):
"""
Perform stratified k-fold cross-validation with multiple metrics
Mathematical formula for std error:
$$SE = \sqrt{\frac{\sum(x - \bar{x})^2}{n-1}}$$
"""
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
metrics = {
'accuracy': [],
'precision': [],
'recall': [],
'f1': []
}
for train_idx, val_idx in skf.split(X, y):
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
# Clone model for fresh instance
model_clone = clone(model)
model_clone.fit(X_train, y_train)
y_pred = model_clone.predict(X_val)
# Calculate metrics for this fold
fold_metrics = calculate_basic_metrics(y_val, y_pred)
for metric in metrics:
metrics[metric].append(fold_metrics[metric])
# Calculate mean and std for each metric
results = {}
for metric in metrics:
results[f'{metric}_mean'] = np.mean(metrics[metric])
results[f'{metric}_std'] = np.std(metrics[metric])
return results
# Example usage
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
# Generate sample dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
model = DecisionTreeClassifier(random_state=42)
results = cross_validate_classifier(model, X, y)
for metric, value in results.items():
print(f"{metric}: {value:.3f}")
Calibration Metrics and Reliability Diagram
Calibration measures whether predicted probabilities match observed outcome frequencies. This matters when probabilities drive business decisions, thresholds, pricing, triage, or risk scoring.
The following example plots a reliability diagram and calculates the Brier score.
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt
def analyze_calibration(y_true, y_prob, n_bins=10):
"""
Analyze classifier calibration and plot reliability diagram
Mathematical formula for Brier Score:
$$BS = \frac{1}{N}\sum_{i=1}^{N}(f_i - o_i)^2$$
Where f_i are forecasted probabilities and o_i are actual outcomes
"""
# Calculate calibration curve
prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=n_bins)
# Calculate Brier score
brier_score = np.mean((y_prob - y_true) ** 2)
# Plot reliability diagram
plt.figure(figsize=(8, 8))
plt.plot([0, 1], [0, 1], 'k:', label='Perfectly calibrated')
plt.plot(prob_pred, prob_true, 's-', label='Model')
plt.xlabel('Mean predicted probability')
plt.ylabel('True probability')
plt.title(f'Reliability Diagram (Brier Score: {brier_score:.3f})')
plt.legend()
plt.grid(True)
plt.show()
return {
'brier_score': brier_score,
'calibration_curve': {
'prob_true': prob_true,
'prob_pred': prob_pred
}
}
# Example usage
np.random.seed(42)
# Generate sample predictions
y_true = np.random.binomial(1, 0.3, 1000)
y_prob = np.clip(np.random.normal(y_true, 0.2), 0, 1)
results = analyze_calibration(y_true, y_prob)
print(f"Brier Score: {results['brier_score']:.3f}")
Custom Scoring Function Implementation
Custom scoring functions are useful when generic metrics do not reflect business cost or operational risk. A custom metric can weight false positives and false negatives differently based on the use case.
The following example creates a weighted custom scorer for scikit-learn cross-validation.
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score
def custom_metric(y_true, y_pred, weight_fp=2.0, weight_fn=1.0):
"""
Create custom weighted metric for domain-specific needs
Mathematical formula:
$$Score = \frac{TP}{TP + weight_{fp}FP + weight_{fn}FN}$$
"""
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
denominator = tp + (weight_fp * fp) + (weight_fn * fn)
score = tp / denominator if denominator > 0 else 0
return score
# Create scorer object
custom_scorer = make_scorer(custom_metric,
weight_fp=2.0,
weight_fn=1.0,
greater_is_better=True)
# Example usage with cross-validation
def evaluate_with_custom_metric(X, y, model, cv=5):
scores = cross_val_score(model,
X,
y,
cv=cv,
scoring=custom_scorer)
return {
'mean_score': scores.mean(),
'std_score': scores.std(),
'all_scores': scores
}
# Example usage
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
model = DecisionTreeClassifier(random_state=42)
results = evaluate_with_custom_metric(X, y, model)
print(f"Custom Metric - Mean: {results['mean_score']:.3f} ± {results['std_score']:.3f}")
Use Case: Credit Card Fraud Detection
Fraud detection is a high-value classification use case where class imbalance and cost-sensitive errors dominate metric selection. Missing fraud is usually much more expensive than a false alarm, but excessive false positives can still damage customer experience and operations.
The following example evaluates fraud detection with precision, recall, cost, and confusion matrix outputs.
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
def evaluate_fraud_detection(X, y):
"""
complete evaluation for fraud detection
Cost matrix formula:
$$Cost = FN \times cost_{fn} + FP \times cost_{fp}$$
where cost_fn = 100 (missed fraud)
and cost_fp = 10 (false alarm)
"""
# Prepare data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split with stratification due to imbalance
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2, stratify=y, random_state=42
)
# Train model
model = RandomForestClassifier(class_weight='balanced', random_state=42)
model.fit(X_train, y_train)
# Get predictions and probabilities
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
# Calculate complete metrics
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()
# Cost-sensitive evaluation
cost_fn = 100 # Cost of missing fraud
cost_fp = 10 # Cost of false alarm
total_cost = (fn * cost_fn) + (fp * cost_fp)
metrics = {
'precision': tp / (tp + fp) if (tp + fp) > 0 else 0,
'recall': tp / (tp + fn) if (tp + fn) > 0 else 0,
'cost_savings': 1 - (total_cost / (len(y_test) * cost_fn)),
'confusion_matrix': cm,
'total_cost': total_cost
}
return metrics, y_prob, y_test
# Example usage
# Generate imbalanced dataset
np.random.seed(42)
n_samples = 10000
fraud_ratio = 0.02
X = np.random.randn(n_samples, 10)
y = np.random.choice([0, 1], size=n_samples, p=[1-fraud_ratio, fraud_ratio])
metrics, y_prob, y_test = evaluate_fraud_detection(X, y)
for key, value in metrics.items():
if isinstance(value, (int, float)):
print(f"{key}: {value:.3f}")
elif isinstance(value, np.ndarray):
print(f"{key}:\n{value}")
Use Case: Medical Diagnosis Classification
Medical diagnosis classification is a high-risk evaluation scenario where false negatives can have serious consequences. Sensitivity, specificity, negative predictive value, positive predictive value, and likelihood ratios should be reviewed together rather than relying on accuracy alone.
The following example evaluates a multi-label medical classifier across multiple disease categories.
def evaluate_medical_classifier(X, y, disease_names):
"""
complete evaluation for medical diagnosis
Mathematical formulas:
$$NPV = \frac{TN}{TN + FN}$$
$$LR+ = \frac{TPR}{FPR}$$
"""
# Prepare stratified cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
metrics_per_disease = {disease: {
'sensitivity': [],
'specificity': [],
'npv': [], # Negative Predictive Value
'ppv': [], # Positive Predictive Value
'likelihood_ratio_positive': []
} for disease in disease_names}
for train_idx, test_idx in cv.split(X, y):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# Train multi-label classifier
model = OneVsRestClassifier(RandomForestClassifier(random_state=42))
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Calculate metrics for each disease
for i, disease in enumerate(disease_names):
tn, fp, fn, tp = confusion_matrix(y_test[:, i], y_pred[:, i]).ravel()
sensitivity = tp / (tp + fn) if (tp + fn) > 0 else 0
specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
npv = tn / (tn + fn) if (tn + fn) > 0 else 0
ppv = tp / (tp + fp) if (tp + fp) > 0 else 0
lr_positive = sensitivity / (1 - specificity) if (1 - specificity) > 0 else float('inf')
metrics_per_disease[disease]['sensitivity'].append(sensitivity)
metrics_per_disease[disease]['specificity'].append(specificity)
metrics_per_disease[disease]['npv'].append(npv)
metrics_per_disease[disease]['ppv'].append(ppv)
metrics_per_disease[disease]['likelihood_ratio_positive'].append(lr_positive)
# Calculate mean metrics
final_metrics = {disease: {
metric: np.mean(values) for metric, values in disease_metrics.items()
} for disease, disease_metrics in metrics_per_disease.items()}
return final_metrics
# Example usage
n_samples = 1000
n_diseases = 3
disease_names = [f'Disease_{i}' for i in range(n_diseases)]
# Generate multi-label dataset
X = np.random.randn(n_samples, 10)
y = np.random.randint(2, size=(n_samples, n_diseases))
results = evaluate_medical_classifier(X, y, disease_names)
for disease, metrics in results.items():
print(f"\n{disease}:")
for metric, value in metrics.items():
print(f"{metric}: {value:.3f}")
Additional Resources
- “A Survey on Deep Learning for Named Entity Recognition” - https://arxiv.org/abs/1812.09449
- “Deep Neural Networks for Learning Graph Representations” - https://arxiv.org/abs/1704.06483
- “Calibration in Modern Neural Networks” - https://arxiv.org/abs/2106.07998
- “Why Should I Trust You?: Explaining the Predictions of Any Classifier” - https://arxiv.org/abs/1602.04938
- “Learning Deep Features for One-Class Classification” - https://arxiv.org/abs/1801.05365
For high-visibility content, verify that every resource, link, and implementation detail is current before publishing.
Closing Thoughts
Classification metrics are not interchangeable. The right metric depends on class imbalance, prediction threshold, business cost, operational workflow, and risk tolerance. Accuracy may be acceptable for balanced, low-risk problems, but precision, recall, F1, PR-AUC, calibration, MCC, and cost-sensitive metrics are often more useful in production systems.
The practical rule is simple: choose metrics based on the decision the model supports. If a false negative is expensive, optimize for recall and sensitivity. If a false positive is expensive, optimize for precision and specificity. If probabilities drive downstream decisions, evaluate calibration. A reliable classification system aligns metrics with business risk before deployment.
Related Reading
Enterprise AI Architecture
Want more enterprise AI architecture breakdowns?
Subscribe to SuperML.