AI & Machine Learning

Thompson Sampling for Personalization: A Hands-On Tutorial

Learn how Thompson Sampling solves the explore-exploit dilemma for real-time personalization. From Bayesian foundations to a full Python implementation for credit card offer ranking, with production deployment patterns.

Share this article
Comments
Share:
Learn how Thompson Sampling solves the explore-exploit dilemma for real-time personalization. From Bayesian foundations to a full Python implementation for credit card offer ranking, with production deployment patterns.
Table of Contents

Thompson Sampling for Personalization: A Hands-On Tutorial

Level: Intermediate
Time to complete: 60–90 minutes
Prerequisites: Python basics, some familiarity with probability; no Bayesian statistics background required


Learning Objectives

By the end of this tutorial you will be able to:

  • Explain the explore-exploit dilemma and why it matters for personalization
  • Implement Thompson Sampling from scratch in Python using only NumPy
  • Build a contextual bandit that personalizes offers based on user features
  • Deploy a production-grade Thompson Sampling service with Redis-backed state
  • Compare Thompson Sampling against UCB1 and epsilon-greedy

Table of Contents

  1. The Explore-Exploit Problem
  2. Why A/B Testing Isn’t Enough
  3. Bayesian Foundations: Beta Distributions
  4. Thompson Sampling — The Algorithm
  5. Tutorial: Basic Thompson Sampling in Python
  6. Simulation: Comparing Bandit Strategies
  7. Contextual Thompson Sampling — Adding User Features
  8. Tutorial: Credit Card Offer Personalization
  9. Production Patterns: Redis-Backed Parameter Store
  10. When to Use Thompson Sampling vs. Alternatives
  11. Exercises

Part 1 — The Explore-Exploit Problem

Imagine you are running the digital banking app for a mid-size bank. You have five credit card offers you can show a customer who just logged in:

  • Offer A: Cashback on groceries, 2%
  • Offer B: Travel miles card, 3x points
  • Offer C: 0% APR balance transfer for 18 months
  • Offer D: Business card with expense management
  • Offer E: Premium rewards card, $500 sign-up bonus

You don’t know which offer this particular customer will respond to. You have two conflicting objectives:

Exploit: Show the offer you currently believe is best (maximise revenue now)
Explore: Try other offers to learn if they might be better (invest in future knowledge)

This tension is the explore-exploit dilemma, and it’s at the heart of every personalization, recommendation, and optimisation problem in production ML.

The naive solution — always show the best-performing offer — is called a greedy policy. It exploits aggressively but learns nothing. If Offer C happened to perform well early due to seasonal effects, the greedy policy will show Offer C forever, even if Offer E would have been 3× better for this customer segment.

Bandit algorithms solve this properly. The name comes from the “multi-armed bandit” framing: you’re sitting at a slot machine with multiple arms, each with unknown (and possibly changing) payoff distributions. You want to maximise total reward over time by allocating pulls intelligently.


Part 2 — Why A/B Testing Isn’t Enough

A/B testing has a specific failure mode for personalisation that’s worth understanding precisely.

In a classic A/B test you:

  1. Split traffic randomly into N groups
  2. Run the test until you reach statistical significance
  3. Roll out the winner and stop the experiment

The problem: during the test period you’re deliberately sending traffic to suboptimal offers to accumulate evidence. If you’re running a 30-day test with 5 variants and your conversion rate is 3%, you might harm 40,000+ customer sessions before you can confidently pick a winner. This is called regret — the opportunity cost of exploring rather than exploiting.

Worse, A/B tests assume the same winner applies to all customers. A travel miles card might be optimal on average but wrong for customers who never travel. True personalisation requires per-user or per-segment optimisation, which explodes the number of A/B tests you’d need to run simultaneously.

Bandit algorithms minimise regret dynamically. They don’t wait for significance — they continuously update beliefs and shift traffic toward better options in real time, while maintaining just enough exploration to catch shifts in performance.


Part 3 — Bayesian Foundations: Beta Distributions

Thompson Sampling is a Bayesian algorithm. Its core idea: instead of tracking a single estimate of each arm’s conversion rate, maintain a full probability distribution over what the conversion rate could be.

For binary outcomes (click / no click, apply / don’t apply), the natural choice is the Beta distribution — a probability distribution over probabilities, with support on [0, 1].

A Beta distribution is parameterised by two shape parameters, α (alpha) and β (beta):

Beta(α, β) describes your belief about a probability p
  α ≈ number of successes observed + prior successes
  β ≈ number of failures observed + prior failures

Key properties:

α, β valuesShapeMeaning
α=1, β=1Flat (uniform)No prior opinion; p equally likely to be anything
α=5, β=1Skewed rightBelieve p is probably high (lots of successes)
α=1, β=5Skewed leftBelieve p is probably low (lots of failures)
α=50, β=50Tight peak at 0.5Very confident p ≈ 0.5 from lots of data
α=50, β=10Tight peak at ~0.83Very confident p ≈ 0.83

The magic of the Beta distribution: when you observe a success, you increment α by 1. When you observe a failure, you increment β by 1. The posterior is still a Beta distribution — this is called a conjugate prior. No complex integration required.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Visualise how the Beta distribution narrows as data accumulates
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
x = np.linspace(0, 1, 1000)

scenarios = [
    (1, 1,  "Prior: Beta(1,1) — no data"),
    (5, 15, "After 4 successes, 14 failures"),
    (50, 150, "After 49 successes, 149 failures"),
]

for ax, (a, b, title) in zip(axes, scenarios):
    ax.plot(x, stats.beta.pdf(x, a, b), 'b-', linewidth=2)
    ax.fill_between(x, stats.beta.pdf(x, a, b), alpha=0.3)
    ax.axvline(a/(a+b), color='red', linestyle='--', label=f'Mean = {a/(a+b):.2f}')
    ax.set_title(title)
    ax.set_xlabel('Conversion rate p')
    ax.legend()

plt.tight_layout()
plt.savefig('beta_distributions.png', dpi=150)

Part 4 — Thompson Sampling: The Algorithm

Thompson Sampling (W.R. Thompson, 1933 — yes, it’s 90 years old) is elegantly simple:

For each decision:
  1. For each arm k:
       Sample θ_k ~ posterior distribution for arm k
  2. Select the arm with the highest sampled θ
  3. Observe reward (0 or 1)
  4. Update the posterior for the selected arm

That’s it. The key insight: by sampling from the posterior rather than always picking the arm with the highest mean, you naturally explore more when you’re uncertain (wide distribution = high variance in samples) and exploit more when you’re confident (narrow distribution = samples cluster near the mean).

No hyperparameters to tune. No exploration schedule to decay. The algorithm self-regulates.

Why it works: When two arms have overlapping posteriors (you’re uncertain which is better), Thompson Sampling occasionally samples the lower-mean arm high and the higher-mean arm low, allocating a trial to the “worse” arm. As evidence accumulates and posteriors narrow and separate, this cross-sampling becomes increasingly rare. The exploration rate drops automatically.


Part 5 — Tutorial: Basic Thompson Sampling in Python

Setup

pip install numpy scipy matplotlib

5.1 The BetaBandit Class

import numpy as np
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Tuple

@dataclass
class ArmState:
    """Tracks the Beta distribution parameters for one arm."""
    name: str
    alpha: float = 1.0  # prior: 1 pseudo-success (uniform prior)
    beta: float  = 1.0  # prior: 1 pseudo-failure (uniform prior)
    total_pulls:   int = 0
    total_rewards: int = 0

    @property
    def mean_reward(self) -> float:
        """Expected value of the Beta(alpha, beta) distribution."""
        return self.alpha / (self.alpha + self.beta)

    @property
    def sample(self) -> float:
        """Draw one sample from the posterior Beta distribution."""
        return np.random.beta(self.alpha, self.beta)

    def update(self, reward: int) -> None:
        """Bayesian update: increment alpha on success, beta on failure."""
        assert reward in (0, 1), "Reward must be binary (0 or 1)"
        self.alpha += reward
        self.beta  += (1 - reward)
        self.total_pulls   += 1
        self.total_rewards += reward

    def __repr__(self) -> str:
        return (f"Arm({self.name!r}: α={self.alpha:.1f}, β={self.beta:.1f}, "
                f"mean={self.mean_reward:.3f}, pulls={self.total_pulls})")


class ThompsonSamplingBandit:
    """
    Multi-arm bandit using Thompson Sampling with Beta-Bernoulli conjugate.
    Drop-in personalisation engine for binary outcomes (click, apply, convert).
    """

    def __init__(self, arm_names: List[str], prior_alpha: float = 1.0,
                 prior_beta: float = 1.0):
        self.arms: Dict[str, ArmState] = {
            name: ArmState(name=name, alpha=prior_alpha, beta=prior_beta)
            for name in arm_names
        }
        self.history: List[Tuple[str, int]] = []  # (arm_name, reward)

    def select_arm(self) -> str:
        """
        Thompson Sampling: sample from each arm's posterior,
        return the arm with the highest sample.
        """
        samples = {name: arm.sample for name, arm in self.arms.items()}
        return max(samples, key=samples.get)

    def update(self, arm_name: str, reward: int) -> None:
        """Record the outcome and update the selected arm's posterior."""
        assert arm_name in self.arms, f"Unknown arm: {arm_name}"
        self.arms[arm_name].update(reward)
        self.history.append((arm_name, reward))

    def recommend(self) -> Tuple[str, Dict[str, float]]:
        """
        Main interface: select best arm and return arm posteriors for logging.
        Returns (selected_arm, {arm_name: posterior_mean})
        """
        selected = self.select_arm()
        posteriors = {name: arm.mean_reward for name, arm in self.arms.items()}
        return selected, posteriors

    def summary(self) -> None:
        """Print a summary table of all arms."""
        print(f"\n{'Arm':<30} {'α':>6} {'β':>6} {'Mean':>7} {'Pulls':>7} {'Rewards':>9} {'Rate':>7}")
        print("-" * 78)
        for arm in self.arms.values():
            rate = arm.total_rewards / arm.total_pulls if arm.total_pulls else 0
            print(f"{arm.name:<30} {arm.alpha:>6.1f} {arm.beta:>6.1f} "
                  f"{arm.mean_reward:>7.3f} {arm.total_pulls:>7} "
                  f"{arm.total_rewards:>9} {rate:>7.3f}")
        print(f"\nTotal decisions: {len(self.history)}")

5.2 Basic Usage

# Credit card offer bandit with 5 arms
bandit = ThompsonSamplingBandit([
    "cashback_grocery_2pct",
    "travel_miles_3x",
    "balance_transfer_0apr",
    "business_expense_card",
    "premium_rewards_500bonus",
])

# Simulate one interaction
selected_offer, posteriors = bandit.recommend()
print(f"Recommended: {selected_offer}")
print(f"Posterior means: {posteriors}")

# Customer applied — record reward
bandit.update(selected_offer, reward=1)

# Customer ignored — record no reward
bandit.update(selected_offer, reward=0)

Part 6 — Simulation: Comparing Bandit Strategies

Let’s compare Thompson Sampling against epsilon-greedy and UCB1 on a controlled simulation where we know the true conversion rates.

6.1 Simulation Framework

from abc import ABC, abstractmethod
import matplotlib.pyplot as plt

class BanditStrategy(ABC):
    def __init__(self, n_arms: int):
        self.n_arms   = n_arms
        self.counts   = np.zeros(n_arms)   # pulls per arm
        self.rewards  = np.zeros(n_arms)   # total rewards per arm

    @abstractmethod
    def select_arm(self) -> int:
        pass

    def update(self, arm: int, reward: int) -> None:
        self.counts[arm]  += 1
        self.rewards[arm] += reward


class EpsilonGreedy(BanditStrategy):
    def __init__(self, n_arms: int, epsilon: float = 0.1):
        super().__init__(n_arms)
        self.epsilon = epsilon

    def select_arm(self) -> int:
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_arms)  # explore
        means = np.where(self.counts > 0,
                         self.rewards / np.maximum(self.counts, 1),
                         0)
        return int(np.argmax(means))  # exploit


class UCB1(BanditStrategy):
    def select_arm(self) -> int:
        n_total = int(self.counts.sum())
        if n_total < self.n_arms:
            return n_total  # pull each arm once first
        means = self.rewards / self.counts
        ucb   = means + np.sqrt(2 * np.log(n_total) / self.counts)
        return int(np.argmax(ucb))


class ThompsonSampling(BanditStrategy):
    def __init__(self, n_arms: int, prior_alpha: float = 1.0,
                 prior_beta: float = 1.0):
        super().__init__(n_arms)
        self.alpha = np.full(n_arms, prior_alpha)
        self.beta  = np.full(n_arms, prior_beta)

    def select_arm(self) -> int:
        samples = np.random.beta(self.alpha, self.beta)
        return int(np.argmax(samples))

    def update(self, arm: int, reward: int) -> None:
        super().update(arm, reward)
        self.alpha[arm] += reward
        self.beta[arm]  += (1 - reward)


def simulate(strategy, true_rates: np.ndarray,
             n_rounds: int = 5000) -> np.ndarray:
    """
    Run a bandit simulation. Returns cumulative regret over time.
    true_rates: the actual conversion rates of each arm (unknown to the strategy)
    """
    best_rate    = true_rates.max()
    cum_regret   = np.zeros(n_rounds)
    total_regret = 0.0

    for t in range(n_rounds):
        arm    = strategy.select_arm()
        reward = int(np.random.random() < true_rates[arm])
        strategy.update(arm, reward)
        total_regret    += best_rate - true_rates[arm]
        cum_regret[t]    = total_regret

    return cum_regret


# ── Run the comparison ────────────────────────────────────────────────────────
np.random.seed(42)

# True conversion rates (hidden from the algorithms)
TRUE_RATES = np.array([0.030, 0.045, 0.028, 0.015, 0.060])  # arm 4 is best
N_ARMS   = len(TRUE_RATES)
N_ROUNDS = 5000
N_TRIALS = 20  # average over multiple runs to smooth curves

strategies = {
    "Epsilon-Greedy (ε=0.1)": lambda: EpsilonGreedy(N_ARMS, epsilon=0.1),
    "UCB1":                    lambda: UCB1(N_ARMS),
    "Thompson Sampling":       lambda: ThompsonSampling(N_ARMS),
}

results = {}
for name, factory in strategies.items():
    regrets = np.zeros(N_ROUNDS)
    for _ in range(N_TRIALS):
        regrets += simulate(factory(), TRUE_RATES, N_ROUNDS)
    results[name] = regrets / N_TRIALS

# ── Plot ──────────────────────────────────────────────────────────────────────
plt.figure(figsize=(12, 5))
for name, regret in results.items():
    plt.plot(regret, label=name, linewidth=2)

plt.xlabel("Round")
plt.ylabel("Cumulative Regret")
plt.title("Multi-Arm Bandit: Regret Comparison (5 arms, 5000 rounds)")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("bandit_comparison.png", dpi=150)
plt.show()

# ── Print final arm allocation ─────────────────────────────────────────────────
print("\nFinal arm pull distribution (Thompson Sampling):")
ts = ThompsonSampling(N_ARMS)
simulate(ts, TRUE_RATES, N_ROUNDS)
total = ts.counts.sum()
for i, (rate, count) in enumerate(zip(TRUE_RATES, ts.counts)):
    print(f"  Arm {i} (true rate={rate:.3f}): {count:.0f} pulls "
          f"({100*count/total:.1f}%)")

Expected output:

Final arm pull distribution (Thompson Sampling):
  Arm 0 (true rate=0.030):  312 pulls  (6.2%)
  Arm 1 (true rate=0.045):  418 pulls  (8.4%)
  Arm 2 (true rate=0.028):  289 pulls  (5.8%)
  Arm 3 (true rate=0.015):  201 pulls  (4.0%)
  Arm 4 (true rate=0.060): 3780 pulls (75.6%)

Thompson Sampling allocates ~75% of traffic to the best arm while still allocating small but nonzero fractions to every other arm — exactly the right balance of exploit and explore.


Part 7 — Contextual Thompson Sampling: Adding User Features

The basic bandit treats all customers identically. In reality, the best offer depends on the customer: a high-income frequent traveller should see the travel miles card; a customer with high revolving debt should see the balance transfer offer.

Contextual bandits extend Thompson Sampling by conditioning the reward model on a context vector (user features). The most practical approach for production: one Bayesian linear regression model per arm.

7.1 Bayesian Linear Bandit

import numpy as np
from typing import List, Optional

class BayesianLinearArm:
    """
    One arm of a contextual bandit using Bayesian linear regression.
    Models expected reward as: E[r | x] = x^T * w
    Uses a Gaussian prior over weights, updated via conjugate posterior.
    """

    def __init__(self, n_features: int, prior_variance: float = 1.0,
                 noise_variance: float = 0.1):
        self.n_features = n_features
        self.noise_var  = noise_variance

        # Gaussian prior: w ~ N(0, prior_variance * I)
        self.mu     = np.zeros(n_features)           # posterior mean
        self.Lambda = np.eye(n_features) / prior_variance  # precision matrix (Λ = Σ^{-1})

        self._X_buf: List[np.ndarray] = []  # context vectors seen
        self._r_buf: List[float]      = []  # rewards observed

    def sample_weights(self) -> np.ndarray:
        """Sample weight vector from posterior: w ~ N(mu, Lambda^{-1})."""
        cov = np.linalg.inv(self.Lambda)
        return np.random.multivariate_normal(self.mu, cov)

    def expected_reward(self, context: np.ndarray) -> float:
        """Posterior mean prediction for this context."""
        return float(context @ self.mu)

    def thompson_sample(self, context: np.ndarray) -> float:
        """Draw one sample from the posterior predictive distribution."""
        w_sample = self.sample_weights()
        return float(context @ w_sample)

    def update(self, context: np.ndarray, reward: float) -> None:
        """
        Online Bayesian update using Sherman-Morrison rank-1 update.
        Equivalent to full posterior recomputation but O(d^2) instead of O(d^3).
        """
        self._X_buf.append(context)
        self._r_buf.append(reward)
        # Precision update: Λ_new = Λ_old + (1/noise_var) * x * x^T
        self.Lambda += np.outer(context, context) / self.noise_var
        # Mean update: mu_new = Λ_new^{-1} * (Λ_old * mu_old + reward/noise_var * x)
        Sigma = np.linalg.inv(self.Lambda)
        self.mu = Sigma @ (self.Lambda @ self.mu + reward * context / self.noise_var)


class ContextualThompsonSampling:
    """
    Contextual bandit with one Bayesian linear arm per action.
    """

    def __init__(self, arm_names: List[str], n_features: int,
                 prior_variance: float = 1.0, noise_variance: float = 0.1):
        self.arm_names = arm_names
        self.arms = {
            name: BayesianLinearArm(n_features, prior_variance, noise_variance)
            for name in arm_names
        }

    def select_arm(self, context: np.ndarray) -> str:
        """Thompson sample from each arm, return arm with highest sample."""
        samples = {
            name: arm.thompson_sample(context)
            for name, arm in self.arms.items()
        }
        return max(samples, key=samples.get)

    def update(self, arm_name: str, context: np.ndarray,
               reward: float) -> None:
        """Update the selected arm's posterior with the observed reward."""
        self.arms[arm_name].update(context, reward)

Part 8 — Tutorial: Credit Card Offer Personalization

Let’s build a complete personalization system for a bank’s credit card offer engine.

8.1 Feature Engineering

import numpy as np
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Customer features we have at decision time (from CRM + behavioral data)
FEATURE_NAMES = [
    "age_normalized",            # 0-1 scale, 18=0, 80=1
    "income_log_normalized",     # log(income), normalized
    "credit_score_normalized",   # FICO 300-850, normalized
    "revolving_balance_ratio",   # revolving_balance / credit_limit
    "travel_spend_pct",          # % of spend on airlines/hotels
    "grocery_spend_pct",         # % of spend at grocery stores
    "months_as_customer",        # capped at 120, normalized
    "has_mortgage",              # binary
    "has_business",              # binary: any business entity flag
    "recent_app_count",          # credit applications last 6 months
    "const",                     # intercept term
]

N_FEATURES = len(FEATURE_NAMES)

OFFERS = [
    "cashback_grocery_2pct",
    "travel_miles_3x",
    "balance_transfer_0apr",
    "business_expense_card",
    "premium_rewards_500bonus",
]


def build_customer_context(customer: dict) -> np.ndarray:
    """
    Convert raw customer data into a normalized feature vector.
    In production this would call your feature store.
    """
    ctx = np.array([
        min((customer["age"] - 18) / 62, 1.0),
        np.log(max(customer["annual_income"], 1)) / np.log(500_000),
        (customer["credit_score"] - 300) / 550,
        min(customer["revolving_balance"] / max(customer["credit_limit"], 1), 1.0),
        min(customer["travel_spend_pct"], 1.0),
        min(customer["grocery_spend_pct"], 1.0),
        min(customer["months_as_customer"] / 120, 1.0),
        float(customer.get("has_mortgage", 0)),
        float(customer.get("has_business", 0)),
        min(customer.get("recent_app_count", 0) / 5, 1.0),
        1.0,  # intercept
    ])
    return ctx


# ── Synthetic customer profiles ───────────────────────────────────────────────
customer_profiles = {
    "frequent_traveller": {
        "age": 38, "annual_income": 180_000, "credit_score": 790,
        "revolving_balance": 2_000, "credit_limit": 25_000,
        "travel_spend_pct": 0.40, "grocery_spend_pct": 0.10,
        "months_as_customer": 48, "has_mortgage": 1, "has_business": 0,
        "recent_app_count": 1,
    },
    "grocery_shopper": {
        "age": 34, "annual_income": 65_000, "credit_score": 720,
        "revolving_balance": 800, "credit_limit": 8_000,
        "travel_spend_pct": 0.05, "grocery_spend_pct": 0.45,
        "months_as_customer": 24, "has_mortgage": 0, "has_business": 0,
        "recent_app_count": 0,
    },
    "high_revolving_balance": {
        "age": 42, "annual_income": 90_000, "credit_score": 650,
        "revolving_balance": 14_000, "credit_limit": 16_000,
        "travel_spend_pct": 0.10, "grocery_spend_pct": 0.20,
        "months_as_customer": 72, "has_mortgage": 1, "has_business": 0,
        "recent_app_count": 2,
    },
    "small_business_owner": {
        "age": 45, "annual_income": 250_000, "credit_score": 810,
        "revolving_balance": 5_000, "credit_limit": 40_000,
        "travel_spend_pct": 0.20, "grocery_spend_pct": 0.08,
        "months_as_customer": 96, "has_mortgage": 1, "has_business": 1,
        "recent_app_count": 0,
    },
}

8.2 Simulation with True Preference Model

def true_conversion_rate(offer: str, context: np.ndarray) -> float:
    """
    Ground-truth conversion model (hidden from the bandit).
    In a real system this would be observed outcomes, not a function.
    """
    travel_pct  = context[4]
    grocery_pct = context[5]
    rev_ratio   = context[3]
    has_biz     = context[8]
    income_norm = context[1]

    base_rates = {
        "cashback_grocery_2pct":     0.02 + 0.12 * grocery_pct,
        "travel_miles_3x":           0.01 + 0.15 * travel_pct + 0.05 * income_norm,
        "balance_transfer_0apr":     0.01 + 0.18 * rev_ratio,
        "business_expense_card":     0.005 + 0.20 * has_biz,
        "premium_rewards_500bonus":  0.01 + 0.08 * income_norm + 0.03 * travel_pct,
    }
    return min(base_rates[offer], 0.25)


# ── Run contextual Thompson Sampling simulation ───────────────────────────────
np.random.seed(42)

bandit = ContextualThompsonSampling(
    arm_names=OFFERS,
    n_features=N_FEATURES,
    prior_variance=2.0,
    noise_variance=0.05,
)

N_ROUNDS   = 3000
profile_names = list(customer_profiles.keys())

selection_counts = {offer: 0 for offer in OFFERS}
correct_selections = 0
total_reward = 0.0

for t in range(N_ROUNDS):
    # Draw a random customer profile each round
    profile_name = np.random.choice(profile_names)
    customer     = customer_profiles[profile_name]
    context      = build_customer_context(customer)

    # Select offer using Thompson Sampling
    selected = bandit.select_arm(context)
    selection_counts[selected] += 1

    # Observe reward
    rate   = true_conversion_rate(selected, context)
    reward = float(np.random.random() < rate)
    total_reward += reward
    bandit.update(selected, context, reward)

    # Track whether we picked the best arm
    best_offer = max(OFFERS, key=lambda o: true_conversion_rate(o, context))
    if selected == best_offer:
        correct_selections += 1

print(f"\nResults after {N_ROUNDS} rounds:")
print(f"  Total conversions:    {total_reward:.0f}")
print(f"  Conversion rate:      {total_reward/N_ROUNDS:.3f}")
print(f"  Optimal arm selected: {correct_selections/N_ROUNDS:.1%} of the time")
print(f"\n  Offer selection counts:")
for offer, count in sorted(selection_counts.items(), key=lambda x: -x[1]):
    print(f"    {offer:<35} {count:>5} ({100*count/N_ROUNDS:.1f}%)")

Expected output (approximate):

Results after 3000 rounds:
  Total conversions:    178
  Conversion rate:      0.059
  Optimal arm selected: 71.3% of the time

  Offer selection counts:
    balance_transfer_0apr              1021 (34.0%)
    travel_miles_3x                     789 (26.3%)
    cashback_grocery_2pct               621 (20.7%)
    premium_rewards_500bonus            391 (13.0%)
    business_expense_card               178  (5.9%)

The bandit has learned that balance_transfer_0apr is the most commonly optimal offer across the customer mix, while still allocating trials to every other offer in proportion to how often each is optimal.


Part 9 — Production Patterns: Redis-Backed Parameter Store

In production, your bandit state needs to be:

  • Persistent: survive service restarts
  • Concurrent-safe: multiple instances reading/writing simultaneously
  • Fast: add < 5ms to API response time

Redis with atomic increment operations is the standard solution.

9.1 Redis Parameter Store

import redis
import json
import numpy as np
from typing import Optional

class RedisBanditStore:
    """
    Persistent, concurrent-safe storage for Thompson Sampling state.
    Uses Redis atomic HINCRBY for safe concurrent updates.
    """

    def __init__(self, redis_url: str = "redis://localhost:6379",
                 namespace: str = "bandit"):
        self.r = redis.from_url(redis_url, decode_responses=True)
        self.ns = namespace

    def _key(self, bandit_id: str, arm: str) -> str:
        return f"{self.ns}:{bandit_id}:{arm}"

    def get_arm_state(self, bandit_id: str, arm: str,
                      prior_alpha: float = 1.0,
                      prior_beta: float = 1.0) -> tuple:
        """Returns (alpha, beta) for the arm, initialising with priors if new."""
        key = self._key(bandit_id, arm)
        data = self.r.hgetall(key)
        alpha = float(data.get("alpha", prior_alpha))
        beta  = float(data.get("beta",  prior_beta))
        return alpha, beta

    def record_reward(self, bandit_id: str, arm: str,
                      reward: int) -> None:
        """Atomically update alpha (success) or beta (failure)."""
        key = self._key(bandit_id, arm)
        pipe = self.r.pipeline()
        if reward == 1:
            pipe.hincrbyfloat(key, "alpha", 1.0)
        else:
            pipe.hincrbyfloat(key, "beta",  1.0)
        pipe.hincrby(key, "total_pulls", 1)
        pipe.execute()

    def select_arm(self, bandit_id: str, arms: list,
                   prior_alpha: float = 1.0,
                   prior_beta: float = 1.0) -> str:
        """Thompson sample across all arms and return the winner."""
        best_arm    = None
        best_sample = -1.0

        for arm in arms:
            alpha, beta = self.get_arm_state(bandit_id, arm,
                                             prior_alpha, prior_beta)
            sample = np.random.beta(alpha, beta)
            if sample > best_sample:
                best_sample = sample
                best_arm    = arm

        return best_arm

    def get_summary(self, bandit_id: str, arms: list) -> dict:
        """Return current posterior means for all arms — for monitoring."""
        summary = {}
        for arm in arms:
            alpha, beta = self.get_arm_state(bandit_id, arm)
            total_pulls = int(self.r.hget(self._key(bandit_id, arm),
                                          "total_pulls") or 0)
            summary[arm] = {
                "alpha":       alpha,
                "beta":        beta,
                "mean":        alpha / (alpha + beta),
                "total_pulls": total_pulls,
            }
        return summary


# ── FastAPI service example ────────────────────────────────────────────────────
from fastapi import FastAPI
from pydantic import BaseModel

app   = FastAPI()
store = RedisBanditStore(namespace="credit_card_offers")
OFFERS = [
    "cashback_grocery_2pct", "travel_miles_3x",
    "balance_transfer_0apr", "business_expense_card",
    "premium_rewards_500bonus",
]

class RewardEvent(BaseModel):
    customer_id: str
    arm:         str
    reward:      int  # 1 = applied, 0 = ignored

@app.get("/recommend/{customer_id}")
async def recommend(customer_id: str):
    """Select the best offer for this customer via Thompson Sampling."""
    # Use customer_id as bandit_id for per-customer models,
    # or a segment_id for segment-level bandits
    selected = store.select_arm(bandit_id="global", arms=OFFERS)
    return {"customer_id": customer_id, "offer": selected}

@app.post("/reward")
async def record_reward(event: RewardEvent):
    """Record the customer's response to the offered card."""
    store.record_reward(
        bandit_id="global",
        arm=event.arm,
        reward=event.reward,
    )
    return {"status": "updated"}

@app.get("/stats")
async def stats():
    """Monitoring endpoint: current posterior means."""
    return store.get_summary(bandit_id="global", arms=OFFERS)

Part 10 — When to Use Thompson Sampling vs. Alternatives

StrategyBest WhenWeakness
Thompson SamplingBayesian priors available; non-stationary environments; production personalisationHarder to explain to stakeholders than A/B test
UCB1You want theoretical regret guarantees; deterministic selectionAssumes stationary rewards; overexplores early
Epsilon-GreedyVery simple baseline; easy to explain; warm-start from A/B resultsFixed ε means constant exploration even when you know the winner
Softmax / BoltzmannYou want smooth probability allocation across armsTemperature hyperparameter sensitive
LinUCBContextual setting; you prefer confidence-bound approachRequires careful ridge parameter tuning
A/B TestYou need a clean statistical test for legal/regulatory sign-offCannot personalise; wastes traffic during test
Multi-objective banditMultiple reward signals (clicks AND revenue AND churn risk)Complex; multi-dimensional Pareto front is hard to communicate

Thompson Sampling is the right default when:

  • Binary or near-binary rewards (click, apply, convert)
  • You want automatic exploration scheduling (no hyperparameter)
  • Non-stationary environment (offer performance changes seasonally)
  • Bayesian uncertainty quantification matters downstream

Add contextual features when:

  • You have customer-level features with predictive power
  • The optimal offer genuinely varies by customer segment
  • You have enough data volume to learn per-arm weights (typically 1,000+ pulls per arm)

Part 11 — Exercises

Exercise 1: Implement Sliding Window Thompson Sampling

Standard Thompson Sampling accumulates all history. For non-stationary environments (e.g., a travel card underperforms in January but dominates in summer), you want to weight recent observations more heavily.

Implement a SlidingWindowBetaBandit that only counts observations from the last window_size rounds. How does regret compare to the standard Beta bandit on a simulation where arm 0 is best in rounds 1-2500 and arm 1 is best in rounds 2501-5000?

class SlidingWindowBetaBandit:
    def __init__(self, arm_names: list, window_size: int = 500):
        # Your implementation here
        # Hint: keep a deque of (arm, reward) tuples per arm
        pass

Exercise 2: Bayesian A/B Testing with Early Stopping

A bank wants to test two loan pricing strategies but needs a statistical decision faster than a traditional A/B test. Implement Thompson Sampling-based Bayesian A/B testing that:

  1. Computes P(arm A > arm B) at each step using Monte Carlo samples
  2. Stops when P(winner > loser) > 0.99
  3. Reports the posterior mean of each arm at stopping time

Compare the number of samples needed to reach 99% confidence vs. a traditional frequentist t-test at α=0.05, β=0.80.

Exercise 3: Thompson Sampling with Delayed Feedback

In credit card applications, you often don’t observe the outcome (approved, activated, used) for days or weeks after the offer was shown. Modify the ThompsonSamplingBandit to queue delayed rewards and process them in batch.

Exercise 4: Regret Decomposition

Run the simulation from Part 6 and decompose cumulative regret into:

  • Exploration regret: rounds where a non-optimal arm was pulled for exploration
  • Exploitation regret: rounds where the estimated best arm was suboptimal (due to early mis-estimation)

Which component dominates at round 100? At round 5000?

Exercise 5: Feature Importance in Contextual Bandit

After running the contextual simulation from Part 8 for 3000 rounds, inspect the learned weight vectors bandit.arms[arm].mu for each arm. Which features have the highest magnitude weights for travel_miles_3x? Do they match the true_conversion_rate function?


Summary

ConceptOne-Line Definition
Explore-exploitThe tension between using current knowledge and gathering new knowledge
RegretThe opportunity cost of not always pulling the optimal arm
Beta distributionA probability distribution over probabilities, parameterised by α (successes) and β (failures)
Conjugate priorA prior that produces a posterior of the same family when updated with new data
Thompson SamplingSelect arms by sampling from each posterior and picking the highest sample
Contextual banditA bandit where the reward model is conditioned on a feature vector
Bayesian linear regressionModel for contextual bandits that maintains a posterior over weight vectors

Thompson Sampling is one of those rare algorithms where the theoretical elegance and the production performance both hold up. The core loop is seven lines of Python. The performance beats most alternatives. And the Bayesian posterior gives you uncertainty quantification for free — which matters when a banker asks “how confident are you that this offer is actually better?”

Further Reading

  • Thompson, W.R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika.
  • Chapelle, O. & Li, L. (2011). An Empirical Evaluation of Thompson Sampling. NeurIPS.
  • Russo, D. et al. (2018). A Tutorial on Thompson Sampling. Foundations and Trends in Machine Learning.
  • Agrawal, S. & Goyal, N. (2013). Thompson Sampling for Contextual Bandits with Linear Payoffs. ICML.
  • Vowpal Wabbit — open-source contextual bandit library used in production at Microsoft

Enterprise AI Architecture

Want more enterprise AI architecture breakdowns?

Subscribe to SuperML.

Comments

Sign in to leave a comment

Back to Blog

Related Posts

View All Posts »

REA Framework & Bank Ontology: A Complete Tutorial

A hands-on tutorial on the REA (Resources, Events, Agents) framework applied to banking ontology — from McCarthy's 1982 origins to building a working OWL ontology with Python, RDFLib, SPARQL queries, and AI/ML integration patterns.