Thompson Sampling for Personalization: A Hands-On Tutorial
Learn how Thompson Sampling solves the explore-exploit dilemma for real-time personalization. From Bayesian foundations to a full Python implementation for credit card offer ranking, with production deployment patterns.
Table of Contents
Thompson Sampling for Personalization: A Hands-On Tutorial
Level: Intermediate
Time to complete: 60–90 minutes
Prerequisites: Python basics, some familiarity with probability; no Bayesian statistics background required
Learning Objectives
By the end of this tutorial you will be able to:
- Explain the explore-exploit dilemma and why it matters for personalization
- Implement Thompson Sampling from scratch in Python using only NumPy
- Build a contextual bandit that personalizes offers based on user features
- Deploy a production-grade Thompson Sampling service with Redis-backed state
- Compare Thompson Sampling against UCB1 and epsilon-greedy
Table of Contents
- The Explore-Exploit Problem
- Why A/B Testing Isn’t Enough
- Bayesian Foundations: Beta Distributions
- Thompson Sampling — The Algorithm
- Tutorial: Basic Thompson Sampling in Python
- Simulation: Comparing Bandit Strategies
- Contextual Thompson Sampling — Adding User Features
- Tutorial: Credit Card Offer Personalization
- Production Patterns: Redis-Backed Parameter Store
- When to Use Thompson Sampling vs. Alternatives
- Exercises
Part 1 — The Explore-Exploit Problem
Imagine you are running the digital banking app for a mid-size bank. You have five credit card offers you can show a customer who just logged in:
- Offer A: Cashback on groceries, 2%
- Offer B: Travel miles card, 3x points
- Offer C: 0% APR balance transfer for 18 months
- Offer D: Business card with expense management
- Offer E: Premium rewards card, $500 sign-up bonus
You don’t know which offer this particular customer will respond to. You have two conflicting objectives:
Exploit: Show the offer you currently believe is best (maximise revenue now)
Explore: Try other offers to learn if they might be better (invest in future knowledge)
This tension is the explore-exploit dilemma, and it’s at the heart of every personalization, recommendation, and optimisation problem in production ML.
The naive solution — always show the best-performing offer — is called a greedy policy. It exploits aggressively but learns nothing. If Offer C happened to perform well early due to seasonal effects, the greedy policy will show Offer C forever, even if Offer E would have been 3× better for this customer segment.
Bandit algorithms solve this properly. The name comes from the “multi-armed bandit” framing: you’re sitting at a slot machine with multiple arms, each with unknown (and possibly changing) payoff distributions. You want to maximise total reward over time by allocating pulls intelligently.
Part 2 — Why A/B Testing Isn’t Enough
A/B testing has a specific failure mode for personalisation that’s worth understanding precisely.
In a classic A/B test you:
- Split traffic randomly into N groups
- Run the test until you reach statistical significance
- Roll out the winner and stop the experiment
The problem: during the test period you’re deliberately sending traffic to suboptimal offers to accumulate evidence. If you’re running a 30-day test with 5 variants and your conversion rate is 3%, you might harm 40,000+ customer sessions before you can confidently pick a winner. This is called regret — the opportunity cost of exploring rather than exploiting.
Worse, A/B tests assume the same winner applies to all customers. A travel miles card might be optimal on average but wrong for customers who never travel. True personalisation requires per-user or per-segment optimisation, which explodes the number of A/B tests you’d need to run simultaneously.
Bandit algorithms minimise regret dynamically. They don’t wait for significance — they continuously update beliefs and shift traffic toward better options in real time, while maintaining just enough exploration to catch shifts in performance.
Part 3 — Bayesian Foundations: Beta Distributions
Thompson Sampling is a Bayesian algorithm. Its core idea: instead of tracking a single estimate of each arm’s conversion rate, maintain a full probability distribution over what the conversion rate could be.
For binary outcomes (click / no click, apply / don’t apply), the natural choice is the Beta distribution — a probability distribution over probabilities, with support on [0, 1].
A Beta distribution is parameterised by two shape parameters, α (alpha) and β (beta):
Beta(α, β) describes your belief about a probability p
α ≈ number of successes observed + prior successes
β ≈ number of failures observed + prior failures
Key properties:
| α, β values | Shape | Meaning |
|---|---|---|
| α=1, β=1 | Flat (uniform) | No prior opinion; p equally likely to be anything |
| α=5, β=1 | Skewed right | Believe p is probably high (lots of successes) |
| α=1, β=5 | Skewed left | Believe p is probably low (lots of failures) |
| α=50, β=50 | Tight peak at 0.5 | Very confident p ≈ 0.5 from lots of data |
| α=50, β=10 | Tight peak at ~0.83 | Very confident p ≈ 0.83 |
The magic of the Beta distribution: when you observe a success, you increment α by 1. When you observe a failure, you increment β by 1. The posterior is still a Beta distribution — this is called a conjugate prior. No complex integration required.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# Visualise how the Beta distribution narrows as data accumulates
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
x = np.linspace(0, 1, 1000)
scenarios = [
(1, 1, "Prior: Beta(1,1) — no data"),
(5, 15, "After 4 successes, 14 failures"),
(50, 150, "After 49 successes, 149 failures"),
]
for ax, (a, b, title) in zip(axes, scenarios):
ax.plot(x, stats.beta.pdf(x, a, b), 'b-', linewidth=2)
ax.fill_between(x, stats.beta.pdf(x, a, b), alpha=0.3)
ax.axvline(a/(a+b), color='red', linestyle='--', label=f'Mean = {a/(a+b):.2f}')
ax.set_title(title)
ax.set_xlabel('Conversion rate p')
ax.legend()
plt.tight_layout()
plt.savefig('beta_distributions.png', dpi=150)
Part 4 — Thompson Sampling: The Algorithm
Thompson Sampling (W.R. Thompson, 1933 — yes, it’s 90 years old) is elegantly simple:
For each decision:
1. For each arm k:
Sample θ_k ~ posterior distribution for arm k
2. Select the arm with the highest sampled θ
3. Observe reward (0 or 1)
4. Update the posterior for the selected arm
That’s it. The key insight: by sampling from the posterior rather than always picking the arm with the highest mean, you naturally explore more when you’re uncertain (wide distribution = high variance in samples) and exploit more when you’re confident (narrow distribution = samples cluster near the mean).
No hyperparameters to tune. No exploration schedule to decay. The algorithm self-regulates.
Why it works: When two arms have overlapping posteriors (you’re uncertain which is better), Thompson Sampling occasionally samples the lower-mean arm high and the higher-mean arm low, allocating a trial to the “worse” arm. As evidence accumulates and posteriors narrow and separate, this cross-sampling becomes increasingly rare. The exploration rate drops automatically.
Part 5 — Tutorial: Basic Thompson Sampling in Python
Setup
pip install numpy scipy matplotlib
5.1 The BetaBandit Class
import numpy as np
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Tuple
@dataclass
class ArmState:
"""Tracks the Beta distribution parameters for one arm."""
name: str
alpha: float = 1.0 # prior: 1 pseudo-success (uniform prior)
beta: float = 1.0 # prior: 1 pseudo-failure (uniform prior)
total_pulls: int = 0
total_rewards: int = 0
@property
def mean_reward(self) -> float:
"""Expected value of the Beta(alpha, beta) distribution."""
return self.alpha / (self.alpha + self.beta)
@property
def sample(self) -> float:
"""Draw one sample from the posterior Beta distribution."""
return np.random.beta(self.alpha, self.beta)
def update(self, reward: int) -> None:
"""Bayesian update: increment alpha on success, beta on failure."""
assert reward in (0, 1), "Reward must be binary (0 or 1)"
self.alpha += reward
self.beta += (1 - reward)
self.total_pulls += 1
self.total_rewards += reward
def __repr__(self) -> str:
return (f"Arm({self.name!r}: α={self.alpha:.1f}, β={self.beta:.1f}, "
f"mean={self.mean_reward:.3f}, pulls={self.total_pulls})")
class ThompsonSamplingBandit:
"""
Multi-arm bandit using Thompson Sampling with Beta-Bernoulli conjugate.
Drop-in personalisation engine for binary outcomes (click, apply, convert).
"""
def __init__(self, arm_names: List[str], prior_alpha: float = 1.0,
prior_beta: float = 1.0):
self.arms: Dict[str, ArmState] = {
name: ArmState(name=name, alpha=prior_alpha, beta=prior_beta)
for name in arm_names
}
self.history: List[Tuple[str, int]] = [] # (arm_name, reward)
def select_arm(self) -> str:
"""
Thompson Sampling: sample from each arm's posterior,
return the arm with the highest sample.
"""
samples = {name: arm.sample for name, arm in self.arms.items()}
return max(samples, key=samples.get)
def update(self, arm_name: str, reward: int) -> None:
"""Record the outcome and update the selected arm's posterior."""
assert arm_name in self.arms, f"Unknown arm: {arm_name}"
self.arms[arm_name].update(reward)
self.history.append((arm_name, reward))
def recommend(self) -> Tuple[str, Dict[str, float]]:
"""
Main interface: select best arm and return arm posteriors for logging.
Returns (selected_arm, {arm_name: posterior_mean})
"""
selected = self.select_arm()
posteriors = {name: arm.mean_reward for name, arm in self.arms.items()}
return selected, posteriors
def summary(self) -> None:
"""Print a summary table of all arms."""
print(f"\n{'Arm':<30} {'α':>6} {'β':>6} {'Mean':>7} {'Pulls':>7} {'Rewards':>9} {'Rate':>7}")
print("-" * 78)
for arm in self.arms.values():
rate = arm.total_rewards / arm.total_pulls if arm.total_pulls else 0
print(f"{arm.name:<30} {arm.alpha:>6.1f} {arm.beta:>6.1f} "
f"{arm.mean_reward:>7.3f} {arm.total_pulls:>7} "
f"{arm.total_rewards:>9} {rate:>7.3f}")
print(f"\nTotal decisions: {len(self.history)}")
5.2 Basic Usage
# Credit card offer bandit with 5 arms
bandit = ThompsonSamplingBandit([
"cashback_grocery_2pct",
"travel_miles_3x",
"balance_transfer_0apr",
"business_expense_card",
"premium_rewards_500bonus",
])
# Simulate one interaction
selected_offer, posteriors = bandit.recommend()
print(f"Recommended: {selected_offer}")
print(f"Posterior means: {posteriors}")
# Customer applied — record reward
bandit.update(selected_offer, reward=1)
# Customer ignored — record no reward
bandit.update(selected_offer, reward=0)
Part 6 — Simulation: Comparing Bandit Strategies
Let’s compare Thompson Sampling against epsilon-greedy and UCB1 on a controlled simulation where we know the true conversion rates.
6.1 Simulation Framework
from abc import ABC, abstractmethod
import matplotlib.pyplot as plt
class BanditStrategy(ABC):
def __init__(self, n_arms: int):
self.n_arms = n_arms
self.counts = np.zeros(n_arms) # pulls per arm
self.rewards = np.zeros(n_arms) # total rewards per arm
@abstractmethod
def select_arm(self) -> int:
pass
def update(self, arm: int, reward: int) -> None:
self.counts[arm] += 1
self.rewards[arm] += reward
class EpsilonGreedy(BanditStrategy):
def __init__(self, n_arms: int, epsilon: float = 0.1):
super().__init__(n_arms)
self.epsilon = epsilon
def select_arm(self) -> int:
if np.random.random() < self.epsilon:
return np.random.randint(self.n_arms) # explore
means = np.where(self.counts > 0,
self.rewards / np.maximum(self.counts, 1),
0)
return int(np.argmax(means)) # exploit
class UCB1(BanditStrategy):
def select_arm(self) -> int:
n_total = int(self.counts.sum())
if n_total < self.n_arms:
return n_total # pull each arm once first
means = self.rewards / self.counts
ucb = means + np.sqrt(2 * np.log(n_total) / self.counts)
return int(np.argmax(ucb))
class ThompsonSampling(BanditStrategy):
def __init__(self, n_arms: int, prior_alpha: float = 1.0,
prior_beta: float = 1.0):
super().__init__(n_arms)
self.alpha = np.full(n_arms, prior_alpha)
self.beta = np.full(n_arms, prior_beta)
def select_arm(self) -> int:
samples = np.random.beta(self.alpha, self.beta)
return int(np.argmax(samples))
def update(self, arm: int, reward: int) -> None:
super().update(arm, reward)
self.alpha[arm] += reward
self.beta[arm] += (1 - reward)
def simulate(strategy, true_rates: np.ndarray,
n_rounds: int = 5000) -> np.ndarray:
"""
Run a bandit simulation. Returns cumulative regret over time.
true_rates: the actual conversion rates of each arm (unknown to the strategy)
"""
best_rate = true_rates.max()
cum_regret = np.zeros(n_rounds)
total_regret = 0.0
for t in range(n_rounds):
arm = strategy.select_arm()
reward = int(np.random.random() < true_rates[arm])
strategy.update(arm, reward)
total_regret += best_rate - true_rates[arm]
cum_regret[t] = total_regret
return cum_regret
# ── Run the comparison ────────────────────────────────────────────────────────
np.random.seed(42)
# True conversion rates (hidden from the algorithms)
TRUE_RATES = np.array([0.030, 0.045, 0.028, 0.015, 0.060]) # arm 4 is best
N_ARMS = len(TRUE_RATES)
N_ROUNDS = 5000
N_TRIALS = 20 # average over multiple runs to smooth curves
strategies = {
"Epsilon-Greedy (ε=0.1)": lambda: EpsilonGreedy(N_ARMS, epsilon=0.1),
"UCB1": lambda: UCB1(N_ARMS),
"Thompson Sampling": lambda: ThompsonSampling(N_ARMS),
}
results = {}
for name, factory in strategies.items():
regrets = np.zeros(N_ROUNDS)
for _ in range(N_TRIALS):
regrets += simulate(factory(), TRUE_RATES, N_ROUNDS)
results[name] = regrets / N_TRIALS
# ── Plot ──────────────────────────────────────────────────────────────────────
plt.figure(figsize=(12, 5))
for name, regret in results.items():
plt.plot(regret, label=name, linewidth=2)
plt.xlabel("Round")
plt.ylabel("Cumulative Regret")
plt.title("Multi-Arm Bandit: Regret Comparison (5 arms, 5000 rounds)")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("bandit_comparison.png", dpi=150)
plt.show()
# ── Print final arm allocation ─────────────────────────────────────────────────
print("\nFinal arm pull distribution (Thompson Sampling):")
ts = ThompsonSampling(N_ARMS)
simulate(ts, TRUE_RATES, N_ROUNDS)
total = ts.counts.sum()
for i, (rate, count) in enumerate(zip(TRUE_RATES, ts.counts)):
print(f" Arm {i} (true rate={rate:.3f}): {count:.0f} pulls "
f"({100*count/total:.1f}%)")
Expected output:
Final arm pull distribution (Thompson Sampling):
Arm 0 (true rate=0.030): 312 pulls (6.2%)
Arm 1 (true rate=0.045): 418 pulls (8.4%)
Arm 2 (true rate=0.028): 289 pulls (5.8%)
Arm 3 (true rate=0.015): 201 pulls (4.0%)
Arm 4 (true rate=0.060): 3780 pulls (75.6%)
Thompson Sampling allocates ~75% of traffic to the best arm while still allocating small but nonzero fractions to every other arm — exactly the right balance of exploit and explore.
Part 7 — Contextual Thompson Sampling: Adding User Features
The basic bandit treats all customers identically. In reality, the best offer depends on the customer: a high-income frequent traveller should see the travel miles card; a customer with high revolving debt should see the balance transfer offer.
Contextual bandits extend Thompson Sampling by conditioning the reward model on a context vector (user features). The most practical approach for production: one Bayesian linear regression model per arm.
7.1 Bayesian Linear Bandit
import numpy as np
from typing import List, Optional
class BayesianLinearArm:
"""
One arm of a contextual bandit using Bayesian linear regression.
Models expected reward as: E[r | x] = x^T * w
Uses a Gaussian prior over weights, updated via conjugate posterior.
"""
def __init__(self, n_features: int, prior_variance: float = 1.0,
noise_variance: float = 0.1):
self.n_features = n_features
self.noise_var = noise_variance
# Gaussian prior: w ~ N(0, prior_variance * I)
self.mu = np.zeros(n_features) # posterior mean
self.Lambda = np.eye(n_features) / prior_variance # precision matrix (Λ = Σ^{-1})
self._X_buf: List[np.ndarray] = [] # context vectors seen
self._r_buf: List[float] = [] # rewards observed
def sample_weights(self) -> np.ndarray:
"""Sample weight vector from posterior: w ~ N(mu, Lambda^{-1})."""
cov = np.linalg.inv(self.Lambda)
return np.random.multivariate_normal(self.mu, cov)
def expected_reward(self, context: np.ndarray) -> float:
"""Posterior mean prediction for this context."""
return float(context @ self.mu)
def thompson_sample(self, context: np.ndarray) -> float:
"""Draw one sample from the posterior predictive distribution."""
w_sample = self.sample_weights()
return float(context @ w_sample)
def update(self, context: np.ndarray, reward: float) -> None:
"""
Online Bayesian update using Sherman-Morrison rank-1 update.
Equivalent to full posterior recomputation but O(d^2) instead of O(d^3).
"""
self._X_buf.append(context)
self._r_buf.append(reward)
# Precision update: Λ_new = Λ_old + (1/noise_var) * x * x^T
self.Lambda += np.outer(context, context) / self.noise_var
# Mean update: mu_new = Λ_new^{-1} * (Λ_old * mu_old + reward/noise_var * x)
Sigma = np.linalg.inv(self.Lambda)
self.mu = Sigma @ (self.Lambda @ self.mu + reward * context / self.noise_var)
class ContextualThompsonSampling:
"""
Contextual bandit with one Bayesian linear arm per action.
"""
def __init__(self, arm_names: List[str], n_features: int,
prior_variance: float = 1.0, noise_variance: float = 0.1):
self.arm_names = arm_names
self.arms = {
name: BayesianLinearArm(n_features, prior_variance, noise_variance)
for name in arm_names
}
def select_arm(self, context: np.ndarray) -> str:
"""Thompson sample from each arm, return arm with highest sample."""
samples = {
name: arm.thompson_sample(context)
for name, arm in self.arms.items()
}
return max(samples, key=samples.get)
def update(self, arm_name: str, context: np.ndarray,
reward: float) -> None:
"""Update the selected arm's posterior with the observed reward."""
self.arms[arm_name].update(context, reward)
Part 8 — Tutorial: Credit Card Offer Personalization
Let’s build a complete personalization system for a bank’s credit card offer engine.
8.1 Feature Engineering
import numpy as np
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Customer features we have at decision time (from CRM + behavioral data)
FEATURE_NAMES = [
"age_normalized", # 0-1 scale, 18=0, 80=1
"income_log_normalized", # log(income), normalized
"credit_score_normalized", # FICO 300-850, normalized
"revolving_balance_ratio", # revolving_balance / credit_limit
"travel_spend_pct", # % of spend on airlines/hotels
"grocery_spend_pct", # % of spend at grocery stores
"months_as_customer", # capped at 120, normalized
"has_mortgage", # binary
"has_business", # binary: any business entity flag
"recent_app_count", # credit applications last 6 months
"const", # intercept term
]
N_FEATURES = len(FEATURE_NAMES)
OFFERS = [
"cashback_grocery_2pct",
"travel_miles_3x",
"balance_transfer_0apr",
"business_expense_card",
"premium_rewards_500bonus",
]
def build_customer_context(customer: dict) -> np.ndarray:
"""
Convert raw customer data into a normalized feature vector.
In production this would call your feature store.
"""
ctx = np.array([
min((customer["age"] - 18) / 62, 1.0),
np.log(max(customer["annual_income"], 1)) / np.log(500_000),
(customer["credit_score"] - 300) / 550,
min(customer["revolving_balance"] / max(customer["credit_limit"], 1), 1.0),
min(customer["travel_spend_pct"], 1.0),
min(customer["grocery_spend_pct"], 1.0),
min(customer["months_as_customer"] / 120, 1.0),
float(customer.get("has_mortgage", 0)),
float(customer.get("has_business", 0)),
min(customer.get("recent_app_count", 0) / 5, 1.0),
1.0, # intercept
])
return ctx
# ── Synthetic customer profiles ───────────────────────────────────────────────
customer_profiles = {
"frequent_traveller": {
"age": 38, "annual_income": 180_000, "credit_score": 790,
"revolving_balance": 2_000, "credit_limit": 25_000,
"travel_spend_pct": 0.40, "grocery_spend_pct": 0.10,
"months_as_customer": 48, "has_mortgage": 1, "has_business": 0,
"recent_app_count": 1,
},
"grocery_shopper": {
"age": 34, "annual_income": 65_000, "credit_score": 720,
"revolving_balance": 800, "credit_limit": 8_000,
"travel_spend_pct": 0.05, "grocery_spend_pct": 0.45,
"months_as_customer": 24, "has_mortgage": 0, "has_business": 0,
"recent_app_count": 0,
},
"high_revolving_balance": {
"age": 42, "annual_income": 90_000, "credit_score": 650,
"revolving_balance": 14_000, "credit_limit": 16_000,
"travel_spend_pct": 0.10, "grocery_spend_pct": 0.20,
"months_as_customer": 72, "has_mortgage": 1, "has_business": 0,
"recent_app_count": 2,
},
"small_business_owner": {
"age": 45, "annual_income": 250_000, "credit_score": 810,
"revolving_balance": 5_000, "credit_limit": 40_000,
"travel_spend_pct": 0.20, "grocery_spend_pct": 0.08,
"months_as_customer": 96, "has_mortgage": 1, "has_business": 1,
"recent_app_count": 0,
},
}
8.2 Simulation with True Preference Model
def true_conversion_rate(offer: str, context: np.ndarray) -> float:
"""
Ground-truth conversion model (hidden from the bandit).
In a real system this would be observed outcomes, not a function.
"""
travel_pct = context[4]
grocery_pct = context[5]
rev_ratio = context[3]
has_biz = context[8]
income_norm = context[1]
base_rates = {
"cashback_grocery_2pct": 0.02 + 0.12 * grocery_pct,
"travel_miles_3x": 0.01 + 0.15 * travel_pct + 0.05 * income_norm,
"balance_transfer_0apr": 0.01 + 0.18 * rev_ratio,
"business_expense_card": 0.005 + 0.20 * has_biz,
"premium_rewards_500bonus": 0.01 + 0.08 * income_norm + 0.03 * travel_pct,
}
return min(base_rates[offer], 0.25)
# ── Run contextual Thompson Sampling simulation ───────────────────────────────
np.random.seed(42)
bandit = ContextualThompsonSampling(
arm_names=OFFERS,
n_features=N_FEATURES,
prior_variance=2.0,
noise_variance=0.05,
)
N_ROUNDS = 3000
profile_names = list(customer_profiles.keys())
selection_counts = {offer: 0 for offer in OFFERS}
correct_selections = 0
total_reward = 0.0
for t in range(N_ROUNDS):
# Draw a random customer profile each round
profile_name = np.random.choice(profile_names)
customer = customer_profiles[profile_name]
context = build_customer_context(customer)
# Select offer using Thompson Sampling
selected = bandit.select_arm(context)
selection_counts[selected] += 1
# Observe reward
rate = true_conversion_rate(selected, context)
reward = float(np.random.random() < rate)
total_reward += reward
bandit.update(selected, context, reward)
# Track whether we picked the best arm
best_offer = max(OFFERS, key=lambda o: true_conversion_rate(o, context))
if selected == best_offer:
correct_selections += 1
print(f"\nResults after {N_ROUNDS} rounds:")
print(f" Total conversions: {total_reward:.0f}")
print(f" Conversion rate: {total_reward/N_ROUNDS:.3f}")
print(f" Optimal arm selected: {correct_selections/N_ROUNDS:.1%} of the time")
print(f"\n Offer selection counts:")
for offer, count in sorted(selection_counts.items(), key=lambda x: -x[1]):
print(f" {offer:<35} {count:>5} ({100*count/N_ROUNDS:.1f}%)")
Expected output (approximate):
Results after 3000 rounds:
Total conversions: 178
Conversion rate: 0.059
Optimal arm selected: 71.3% of the time
Offer selection counts:
balance_transfer_0apr 1021 (34.0%)
travel_miles_3x 789 (26.3%)
cashback_grocery_2pct 621 (20.7%)
premium_rewards_500bonus 391 (13.0%)
business_expense_card 178 (5.9%)
The bandit has learned that balance_transfer_0apr is the most commonly optimal offer across the customer mix, while still allocating trials to every other offer in proportion to how often each is optimal.
Part 9 — Production Patterns: Redis-Backed Parameter Store
In production, your bandit state needs to be:
- Persistent: survive service restarts
- Concurrent-safe: multiple instances reading/writing simultaneously
- Fast: add < 5ms to API response time
Redis with atomic increment operations is the standard solution.
9.1 Redis Parameter Store
import redis
import json
import numpy as np
from typing import Optional
class RedisBanditStore:
"""
Persistent, concurrent-safe storage for Thompson Sampling state.
Uses Redis atomic HINCRBY for safe concurrent updates.
"""
def __init__(self, redis_url: str = "redis://localhost:6379",
namespace: str = "bandit"):
self.r = redis.from_url(redis_url, decode_responses=True)
self.ns = namespace
def _key(self, bandit_id: str, arm: str) -> str:
return f"{self.ns}:{bandit_id}:{arm}"
def get_arm_state(self, bandit_id: str, arm: str,
prior_alpha: float = 1.0,
prior_beta: float = 1.0) -> tuple:
"""Returns (alpha, beta) for the arm, initialising with priors if new."""
key = self._key(bandit_id, arm)
data = self.r.hgetall(key)
alpha = float(data.get("alpha", prior_alpha))
beta = float(data.get("beta", prior_beta))
return alpha, beta
def record_reward(self, bandit_id: str, arm: str,
reward: int) -> None:
"""Atomically update alpha (success) or beta (failure)."""
key = self._key(bandit_id, arm)
pipe = self.r.pipeline()
if reward == 1:
pipe.hincrbyfloat(key, "alpha", 1.0)
else:
pipe.hincrbyfloat(key, "beta", 1.0)
pipe.hincrby(key, "total_pulls", 1)
pipe.execute()
def select_arm(self, bandit_id: str, arms: list,
prior_alpha: float = 1.0,
prior_beta: float = 1.0) -> str:
"""Thompson sample across all arms and return the winner."""
best_arm = None
best_sample = -1.0
for arm in arms:
alpha, beta = self.get_arm_state(bandit_id, arm,
prior_alpha, prior_beta)
sample = np.random.beta(alpha, beta)
if sample > best_sample:
best_sample = sample
best_arm = arm
return best_arm
def get_summary(self, bandit_id: str, arms: list) -> dict:
"""Return current posterior means for all arms — for monitoring."""
summary = {}
for arm in arms:
alpha, beta = self.get_arm_state(bandit_id, arm)
total_pulls = int(self.r.hget(self._key(bandit_id, arm),
"total_pulls") or 0)
summary[arm] = {
"alpha": alpha,
"beta": beta,
"mean": alpha / (alpha + beta),
"total_pulls": total_pulls,
}
return summary
# ── FastAPI service example ────────────────────────────────────────────────────
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
store = RedisBanditStore(namespace="credit_card_offers")
OFFERS = [
"cashback_grocery_2pct", "travel_miles_3x",
"balance_transfer_0apr", "business_expense_card",
"premium_rewards_500bonus",
]
class RewardEvent(BaseModel):
customer_id: str
arm: str
reward: int # 1 = applied, 0 = ignored
@app.get("/recommend/{customer_id}")
async def recommend(customer_id: str):
"""Select the best offer for this customer via Thompson Sampling."""
# Use customer_id as bandit_id for per-customer models,
# or a segment_id for segment-level bandits
selected = store.select_arm(bandit_id="global", arms=OFFERS)
return {"customer_id": customer_id, "offer": selected}
@app.post("/reward")
async def record_reward(event: RewardEvent):
"""Record the customer's response to the offered card."""
store.record_reward(
bandit_id="global",
arm=event.arm,
reward=event.reward,
)
return {"status": "updated"}
@app.get("/stats")
async def stats():
"""Monitoring endpoint: current posterior means."""
return store.get_summary(bandit_id="global", arms=OFFERS)
Part 10 — When to Use Thompson Sampling vs. Alternatives
| Strategy | Best When | Weakness |
|---|---|---|
| Thompson Sampling | Bayesian priors available; non-stationary environments; production personalisation | Harder to explain to stakeholders than A/B test |
| UCB1 | You want theoretical regret guarantees; deterministic selection | Assumes stationary rewards; overexplores early |
| Epsilon-Greedy | Very simple baseline; easy to explain; warm-start from A/B results | Fixed ε means constant exploration even when you know the winner |
| Softmax / Boltzmann | You want smooth probability allocation across arms | Temperature hyperparameter sensitive |
| LinUCB | Contextual setting; you prefer confidence-bound approach | Requires careful ridge parameter tuning |
| A/B Test | You need a clean statistical test for legal/regulatory sign-off | Cannot personalise; wastes traffic during test |
| Multi-objective bandit | Multiple reward signals (clicks AND revenue AND churn risk) | Complex; multi-dimensional Pareto front is hard to communicate |
Thompson Sampling is the right default when:
- Binary or near-binary rewards (click, apply, convert)
- You want automatic exploration scheduling (no hyperparameter)
- Non-stationary environment (offer performance changes seasonally)
- Bayesian uncertainty quantification matters downstream
Add contextual features when:
- You have customer-level features with predictive power
- The optimal offer genuinely varies by customer segment
- You have enough data volume to learn per-arm weights (typically 1,000+ pulls per arm)
Part 11 — Exercises
Exercise 1: Implement Sliding Window Thompson Sampling
Standard Thompson Sampling accumulates all history. For non-stationary environments (e.g., a travel card underperforms in January but dominates in summer), you want to weight recent observations more heavily.
Implement a SlidingWindowBetaBandit that only counts observations from the last window_size rounds. How does regret compare to the standard Beta bandit on a simulation where arm 0 is best in rounds 1-2500 and arm 1 is best in rounds 2501-5000?
class SlidingWindowBetaBandit:
def __init__(self, arm_names: list, window_size: int = 500):
# Your implementation here
# Hint: keep a deque of (arm, reward) tuples per arm
pass
Exercise 2: Bayesian A/B Testing with Early Stopping
A bank wants to test two loan pricing strategies but needs a statistical decision faster than a traditional A/B test. Implement Thompson Sampling-based Bayesian A/B testing that:
- Computes P(arm A > arm B) at each step using Monte Carlo samples
- Stops when P(winner > loser) > 0.99
- Reports the posterior mean of each arm at stopping time
Compare the number of samples needed to reach 99% confidence vs. a traditional frequentist t-test at α=0.05, β=0.80.
Exercise 3: Thompson Sampling with Delayed Feedback
In credit card applications, you often don’t observe the outcome (approved, activated, used) for days or weeks after the offer was shown. Modify the ThompsonSamplingBandit to queue delayed rewards and process them in batch.
Exercise 4: Regret Decomposition
Run the simulation from Part 6 and decompose cumulative regret into:
- Exploration regret: rounds where a non-optimal arm was pulled for exploration
- Exploitation regret: rounds where the estimated best arm was suboptimal (due to early mis-estimation)
Which component dominates at round 100? At round 5000?
Exercise 5: Feature Importance in Contextual Bandit
After running the contextual simulation from Part 8 for 3000 rounds, inspect the learned weight vectors bandit.arms[arm].mu for each arm. Which features have the highest magnitude weights for travel_miles_3x? Do they match the true_conversion_rate function?
Summary
| Concept | One-Line Definition |
|---|---|
| Explore-exploit | The tension between using current knowledge and gathering new knowledge |
| Regret | The opportunity cost of not always pulling the optimal arm |
| Beta distribution | A probability distribution over probabilities, parameterised by α (successes) and β (failures) |
| Conjugate prior | A prior that produces a posterior of the same family when updated with new data |
| Thompson Sampling | Select arms by sampling from each posterior and picking the highest sample |
| Contextual bandit | A bandit where the reward model is conditioned on a feature vector |
| Bayesian linear regression | Model for contextual bandits that maintains a posterior over weight vectors |
Thompson Sampling is one of those rare algorithms where the theoretical elegance and the production performance both hold up. The core loop is seven lines of Python. The performance beats most alternatives. And the Bayesian posterior gives you uncertainty quantification for free — which matters when a banker asks “how confident are you that this offer is actually better?”
Further Reading
- Thompson, W.R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika.
- Chapelle, O. & Li, L. (2011). An Empirical Evaluation of Thompson Sampling. NeurIPS.
- Russo, D. et al. (2018). A Tutorial on Thompson Sampling. Foundations and Trends in Machine Learning.
- Agrawal, S. & Goyal, N. (2013). Thompson Sampling for Contextual Bandits with Linear Payoffs. ICML.
- Vowpal Wabbit — open-source contextual bandit library used in production at Microsoft
Enterprise AI Architecture
Want more enterprise AI architecture breakdowns?
Subscribe to SuperML.