Winning with Multi-Armed Bandits: Smarter Experimentation in Databricks
Posted on August 18th, 2025
Running experiments often feels like gambling. Should you put more volume behind variant A, or give variant B another chance? Traditional A/B testing splits traffic and waits - but what if you could continuously adapt, maximizing gains as you learn? Enter Multi-Armed Bandits: an elegant blend of probability, statistics, and decision-making that turns experiments into dynamic optimization engines.
Just like choosing the right slot machine at a casino, Multi-Armed Bandits help you decide which option deserves your next coin - except here, the coin is traffic, impressions, or user attention. Let’s explore how they work, why they beat static testing, and how we’ve applied them in Databricks.
The Problem with Static A/B Testing
Traditional A/B tests split traffic between variants and wait until results reach statistical significance. The split doesn’t need to be 50/50 - you can use 90/10 or other ratios - but the smaller the allocation to a variant, the longer it takes to gather enough evidence.
At our company, we rely heavily on A/B tests for big, structural changes in user experience. Think of a new onboarding flow, a redesigned payment page, or a navigation change in the app. These are decisions that, once deployed, become part of the permanent experience. In those cases, we want the rigor and certainty of a traditional A/B test.
But when it comes to fast-moving experimentation - like testing email subject lines, CRM templates, or new value propositions - relying only on static A/B testing would slow us down dramatically. That’s because traditional A/B has some built-in drawbacks:
- Lost opportunity: weaker variants keep receiving traffic until the test ends.
- Slow learning: reaching significance can take days or weeks, especially with skewed splits.
- No adaptivity: once the split is set, the system doesn’t adjust based on performance.
This is where Multi-Armed Bandits shine: they continuously balance exploration (testing uncertain variants) with exploitation (backing strong performers), adapting allocations in real time as results come in.
The Multi-Armed Bandit Intuition
Picture yourself walking into a casino with rows of slot machines. Each machine (or "arm") has an unknown payout probability - some generous, some stingy. You have a limited number of coins in your pocket, and every coin you spend is a chance to learn which machine is the most profitable.
Here’s the dilemma: If you only play one machine from the start, you might miss out on a better one just a few seats away. If you spread your coins evenly across all machines, you’ll waste a lot of plays on the poor performers.
The optimal strategy lies somewhere in between: try enough machines to explore the landscape, but also double down and exploit the ones that look promising.
This is the essence of the Multi-Armed Bandit problem. Each "trial" you run - whether it’s pulling a slot lever in the casino or sending a subject line in an email campaign - gives you more information about the true payout. Over time, the bandit strategy adjusts, directing more and more of your limited coins toward the machines that deliver the best returns.
In marketing or product terms, each "arm" is a variant - an email subject line, a landing page design, a recommendation strategy. Each "coin" is a unit of user attention - a send, an impression, or a visit. The framework continuously reallocates traffic, rewarding strong performers without completely ignoring uncertain ones, so you discover winners faster while minimizing regret.
A Bayesian Approach with Beta Distributions
When experimenting, each user interaction is basically a trial with two outcomes: success or failure (clicked or not, converted or not). This naturally fits the binomial distribution, which describes the probability of observing a certain number of successes across a fixed number of trials.
For example:
- If we send 1,000 emails and get 30 clicks, we can think of this as 1,000 Bernoulli trials (each user either clicks = 1 or doesn’t = 0).
- The binomial distribution then tells us the probability of seeing exactly 30 successes (or 31, or 29) given some underlying "true" click-through rate (CTR).
The challenge: we don’t know the true CTR in advance. This is where Bayesian statistics come in.
Enter the Beta Distribution
The Beta distribution is a natural partner to the binomial because it serves as its conjugate prior. In plain English: it lets us update our beliefs smoothly as new data comes in.
For each variant i, we model the CTR as:
$$CTR_i \sim \text{Beta}(\alpha, \beta),$$
where $\alpha$ represents the number of observed successes (clicks), $\beta$ represents the number of observed failures (non-clicks), and together, they define a probability distribution over what the true CTR could be.
Intuitively:
- A variant with little data has a wide, uncertain curve - maybe it’s great, maybe it’s terrible.
- A variant with lots of data has a narrow, sharp curve - we’re much more confident about its true CTR.
Example
Let’s compare three variants, all with a ~3% CTR but different sample sizes:
Sends | Clicks | CTR | 95% Confidence Interval |
---|---|---|---|
1,000 | 30 | 2.955% | [2.11%, 4.25%] |
10,000 | 300 | 2.995% | [2.68%, 3.35%] |
100,000 | 3,000 | 2.999% | [2.90%, 3.11%] |
Notice how the mean CTR stays the same, but the confidence interval shrinks as data accumulates. This is the key idea: not all 3% CTRs are equal - the amount of evidence behind them changes how much we should trust them.
We can visualize this by plotting the Beta distributions for each case. With fewer samples, the curve is wide and uncertain. With more samples, it becomes sharp and concentrated around 3%.
Thompson Sampling in Action
We now have, for each variant, a posterior distribution over its true CTR. Think of each Beta curve as our updated belief after observing successes and failures, i.e., using $\alpha = \alpha + \text{clicks}$ and $\beta = \beta + (\text{sends} − \text{clicks})$ to update the prior.
Thompson Sampling converts that uncertainty into action:
- For each variant, draw one random sample from its posterior (its Beta curve).
- Select the variant with the highest sampled value for that "trial".
- Repeat this many times; the share of "wins" per variant estimates the probability it is the best.
Those win probabilities are natural traffic weights for your next batch. Variants with wider (more uncertain) posteriors still win some simulated worlds (exploration), while consistently strong variants win often and get more traffic (exploitation).
Example allocation from simulated win probabilities
Scenario: three variants at ~3% CTR but with different sample sizes
(1,000 sends / 30 clicks), (10,000 sends / 300 clicks), (100,000 sends / 3,000 clicks).
- Posterior parameters (assuming a Beta(1,1) prior): α = [31, 301, 3001], β = [971, 9701, 97001].
- Estimated probability each variant is best (via Thompson simulation): [0.491, 0.282, 0.228].
- Suggested allocation for the next batch: [49.1%, 28.2%, 22.8%].
Note that even with similar mean CTRs, the smallest sample has a wider posterior and sometimes draws high, so it can win many simulated worlds. That is exploration. As evidence grows, posteriors tighten and the consistently good variant starts dominating the allocation. That is exploitation.
In production, you can add guardrails like minimum allocation floors and max ramp rates while preserving the adaptive behavior.
Behind the scenes, the draws come from the updated Beta posteriors (random variates from Beta), and win counts across simulations approximate each variant’s "best arm" probability.
Architecture: Multi-Armed Bandits in Databricks + Engage (AWS)
Our CRM tool used to send messages to users is called Engage and we've built it internally. It mainly uses AWS SES to send emails to customers, and the data is tracked in real time by using Kinesis with Databricks, piped into Delta tables. Users can configure AB Tests on the tool with N variants and publish them - the percentage of volume allocated to each variant will be updated using the framework above.
At a high level, we separate two loops: a real-time data loop that keeps CRM metrics fresh (≤ ~1 minute), and a decision loop that periodically updates variant weights using Bandits. The decision loop reads aggregated metrics from Delta (Gold), computes new weights via Thompson Sampling, and publishes them back to Engage, where users are deterministically assigned to variants.
Frame A - Real-time CRM Data Flow (≤ ~1 min latency)
- Engage triggers sends with experiment_id and variant_id.
- Emails go out via Amazon SES; tracking events (send, delivery, bounce, open, click) arrive via SNS, which are piped to Amazon Kinesis and consumed by a Databricks Streaming job.
- The streaming job parses and enriches events (user_id, experiment_id, variant_id, timestamps) and writes to tables in the Medallion Architecture.
- We maintain a Delta Gold table at the variant level: sends, opens, clicks, unsubscribes, and other metrics by experiment_id × variant_id (minute-level freshness).
Frame B - Bandit Allocation Loop (every X minutes)
- Scheduler invokes the Bandit job for each active experiment.
- The job reads the Gold table and updates posteriors (alpha, beta) per variant, using the configured conversion (click, open, or in-site conversion).
- The job runs Thompson Sampling to estimate the probability of being best for each variant and derives new weights.
- The job writes weights and metadata to a Delta allocations table and publishes the current weights to Engage’s AB Test Config (DynamoDB).
Trade-offs, Guardrails, and When to Use What
Bandits aren’t magic; they’re a policy choice. They excel when speed of learning and continuous adaptation matter - subject lines, templates, fast-iterating offers. They’re less compelling for high-stakes, slow-moving decisions (e.g., a checkout redesign), where a classical A/B provides the defensibility you want.
Two practical notes keep bandits honest. First, priors: if you know a channel’s historical CTR, use a data-driven prior to avoid over-exploring on cold starts. Second, guardrails: enforce a small minimum allocation per arm, cap ramp rates between cycles, and define stop rules (e.g., auto-graduate a winner after a sustained high probability of being the best). If the scheduler is late, fall back to last known weights or a conservative control-heavy split. These basics prevent the “unlucky early draw” problem and keep production stable.
What We Learned (and What’s Next)
After adopting this loop, two things became obvious. First, time-to-winner dropped - not because we got lucky, but because the policy stopped wasting volume on poor performers. Second, confidence became visible: plotting posteriors and intervals makes it clear when a “3% CTR” is backed by 1k versus 100k sends, so product and CRM can make informed decisions.
A natural next iteration is a composite reward, not a single metric. CRM needs to balance reputation with providers, on-site conversion, and disengagement. A scalar reward like
$$ \text{reward} = w_\text{open}\cdot \text{open} + w_\text{click}\cdot \text{click} + w_\text{conv}\cdot \text{conversion} \; - \; w_\text{unsub}\cdot \text{unsubscribe} $$
lets you capture those trade-offs explicitly. Optimizing only CTR can surface variants that also spike unsubscribes - we’ve seen that pattern, and we don’t want to promote it. In practice, the weights should be normalized and carefully managed. We’ll dive into composite rewards and constrained bandits in a future post.
The next frontier is Contextual Bandits - letting features (channel, device, segment, time of day) influence the policy so the “best arm” depends on context. It’s the same Bayesian backbone with a smarter routing layer. Start simple, instrument well, and graduate when your data shows that the extra complexity pays for itself.
Final Thoughts
If anomaly detection is about spotting trouble fast, Multi-Armed Bandits are about finding upside faster. They turn experimentation from a fixed split into a living system that learns where your attention pays off. In practice, the math is modest, the operational footprint is small, and the payoff - fewer wasted sends, quicker convergence, clearer confidence - compounds over time.
Simple tools, thoughtfully combined, still win.