What’s the better method for A/B testing: Bayesian or Frequentist?
Some swear by the Frequentist approach, others say Bayesian is best, and still others say that it’s a waste of time to think too much about this at all. In order to help you make your own assessment, we’ll go over everything you need to know about these two techniques below.
In case you’re unfamiliar with the term, an A/B test is a user experience optimization technique that involves publishing two versions of something on your store (this could be a page, page template, or even an entire theme) to see which one performs best.
These experiments take the guesswork out of ecommerce, allowing you to objectively determine what it takes to reduce your bounce rate, increase your conversion rate, make more sales, and reach whatever other goals that you may have for your store.
Now, on to the big Bayesian vs. Frequentist A/B testing debate.
Let’s start with the Frequentist method. The key to this approach is that it only uses the data collected in your current experiment to make conclusions. From start to finish, conducting a Frequentist A/B test would look something like this:
Bayesian tests follow a similar process — except, instead of starting from scratch, you incorporate past knowledge into your experiment. These previous insights are known in statistics as priors. The point of running a Bayesian test is to update your priors with new information, which may end up reinforcing your priors or shifting them one way or the other.
The advantage of Bayesian testing is that it’s more flexible, allowing you to account for context outside of your experiment. But this flexibility can also be a curse. Priors may be biased, based on hunches or gut feelings rather than hard facts (this is almost necessarily the case if something is being tested for the first time).
For example, in an A/B test, a Bayesian might start with the prior belief that a 10% improvement in conversion rate would be reasonable while a 100% increase would be extremely unlikely — but is that really the case? Maybe the changes you’re making really could double performance, even if they seem small on the surface. If your priors are inaccurate in a Bayesian test, it can throw off the whole experiment.
Here is an overview of the key concepts and differences between the two methods:
Feature / Concept | Frequentist | Bayesian |
---|---|---|
Definition of Probability | Long-run frequency of events in repeated trials | Degree of belief (subjective probability) |
Use of Prior Knowledge | ❌ Not incorporated | ✅ Explicitly incorporated via priors |
Output Type | Point estimate (e.g., “B is 3% better than A”) + p-value | Probability distribution (e.g., “There’s an 85% chance B > A”) |
Interpretation of Results | Binary decision (significant or not) | Probability of different outcomes (e.g., improvement > X%) |
Main Metric | p-value (usually tested at p < 0.05) | Posterior probability, credible intervals |
Confidence vs. Credibility | Confidence interval (CI): range likely to contain true value | Credible interval: range where true value lies with X% certainty |
Sample Size Requirements | Larger samples typically required to achieve statistical power | Often handles small samples better (if good priors exist) |
Stopping Rules | Fixed sample size required before evaluating results | Flexible stopping — can check results as data accumulates |
False Positive Control | Controlled via alpha (e.g., 5% significance threshold) | Controlled via prior and model design |
Assumption Robustness | Assumes random sampling, independence, often normality | Can incorporate uncertainty, robustness depends on model/prior |
Computation | Simpler math; easy closed-form solutions | May require MCMC, simulation, or numerical integration (complex) |
Common Tools/Examples | T-tests, Z-tests, Chi-squared, ANOVA | Bayesian inference engines, e.g., PyMC, Stan, or adaptive testing tools |
Tool Examples | Adobe Target, Optimizely (default), Google Optimize (legacy) | VWO (Bayesian), Google Experiments (Bayesian Engine in GA4), Convertize |
Best For | ✅ Simple, repeatable, large-sample experiments | ✅ Data-scarce environments, early testing, decision making under uncertainty |
Worst For | ❌ Small datasets, early testing | ❌ Cases where prior is unknowable or misleading |
Another key difference is how the very concept of probability is defined by each of these methods.
For Frequentists, probability is the expected frequency of an event (hence the name). Most Frequentist tests ultimately produce what is known as a point estimate — Frequentists think there is an objectively “true” number for whatever they’re testing for in an experiment, and the point estimate is a single, fixed value that serves as the best guess of what this number is given the data in the sample.
Bayesians believe probability is a measure of strength of belief. In other words, it’s somewhat subjective. The results of a Bayesian test are typically expressed as a distribution of possible outcomes (like in a bell curve) rather than a fixed number. As for the name of this type of experiment, it’s derived from Thomas Bayes, a mathematician and theologian who came up with the theorem behind this idea back in the 18th century.
In addition to A/B testing, both Bayesian and Frequentist experiments have many other real-world uses. Indeed, these concepts have helped discover scientific breakthroughs, build new products, and much more.
For example, Bayesian methods have been used to develop spam filters since Microsoft started doing so in the 1990s.
The priors in this scenario are the words and phrases that the email provider believes may indicate that a given message is spam — stuff like “save”, “sale”, “act now”, etc. These priors will be updated as the user starts to receive mail and the provider is able to analyze what kinds of words are used in their legitimate messages. This often makes the filter less sensitive, resulting in fewer false positives (messages that end up in the spam folder but are actually legitimate) and more accurate spam filtering overall.
Also, there’s an excellent New York Times piece on the subject of Bayesian statistics, “The Odds, Continually Updated”, that recounts a rather heroic example of the Bayesian method in action.
In July 2014, the fisherman John Aldridge fell out his boat and was lost at sea, only able to stay afloat by using his boots as improvised pontoons. The Coast Guard deployed Bayesian methods in their search and rescue operation, first using their priors of where survivors might be found in the area and then updating these priors with new information as it came in, such as how strong the currents were and where rescue helicopters had already checked that day. Within 12 hours, they found Aldridge — sunburnt, hypothemic, but alive.
Not every problem lends itself well to the Bayesian approach. Sometimes, coming up with priors is basically just guesswork. In that same New York Times piece, the physicist Kyle Cranmer described this process of trying to assign numbers to subjective beliefs as “like fingernails on a chalkboard”. Cranmer helped develop the Frequentist technique that was used to discover the Higgs boson particle at the Large Hadron Collider, perhaps the most significant scientific advancement of the last 50 years.
Many researchers in science and medicine prefer the Frequentist method, which is known to work particularly well with large datasets and repeatable experiments.
It’s worth noting that the Bayesian method, due to its use of priors, tends to be better at dealing with the volatility of small sample sizes.
To demonstrate, let’s imagine that you want to test for the probability of a coin landing on heads.
A Bayesian approach would start with the prior that the odds of this event are known to be 50/50. The coin would have to be weighted or otherwise tampered with to throw this assumption off. If there is no such tampering, then the Bayesian test results would be accurate right from the jump, as depicted in this helpful chart from LYST’s engineering blog:
But a Frequentist experiment would need to figure out the probability of a coin landing on heads all on its own. A coin landing on heads four out of the first five times it’s flipped wouldn’t be too crazy — at that point in a Frequentist test, though, it would appear as if a coin is much more likely to land on heads in general. You may need up to a hundred or so flips for the true nature of reality to come into focus:
As you can see in these charts, this difference between the two approaches all but disappears over the duration of the experiments. It’s really only a concern when you’re dealing with very small sample sizes.
Each A/B testing tool addresses the Bayesian vs. Frequentist question in its own way.
For example, VWO takes a Bayesian approach, while Adobe Target uses the Frequentist method. And then there are tools, such as Convertize, that use principles from both the Bayesian and Frequentist schools of thought in their statistical engines.
As for Shogun, we use a Chi-Squared algorithm for our A/B tests, which involves creating a table of observed and expected frequencies and then comparing these two rows of data to see whether the null hypothesis should be accepted or rejected (all of this happens on the back end, of course — you won’t need to deal with any of the math yourself).
The Chi-Squared method is especially well-suited for tests that involve categorical outcomes (data that can be sorted into distinct groups), such as the conversions or sales that tend to be the primary metric in ecommerce experiments. Another advantage to this approach is that it doesn’t require you to make any assumptions about the underlying distribution of data, allowing you to avoid any issues of bias.
While the statistics behind Shogun A/B Testing may be complicated, actually using this tool is remarkably easy. After you launch an experiment, Shogun will tell you exactly when you’ve collected enough data to be at least 95% confident in the results, so there’s no ambiguity as to when you should keep waiting to collect more data as well as when you can go ahead and declare a winner.
If you’d like to learn more about this subject, check out our guide to the most important A/B testing best practices.