July 29, 2025

The Big Differences Between Bayesian vs. Frequentist A/B Testing

Arrow pointing left
back to blog

about

Learn the difference between Frequentist and Bayesian A/B testing and how it relates to the testing tools on the market.

the author

Adam Ritchie
Ecommerce Contributor

share this post

 What’s the better method for A/B testing: Bayesian or Frequentist?

Some swear by the Frequentist approach, others say Bayesian is best, and still others say that it’s a waste of time to think too much about this at all. In order to help you make your own assessment, we’ll go over everything you need to know about these two techniques below.

A Powerful A/B Testing App for ShopifyShogun A/B Testing offers all the features you need to conduct your own experiments and gain ground-breaking insights about your storefront.Get started now

Bayesian vs. Frequentist A/B Testing

In case you’re unfamiliar with the term, an A/B test is a user experience optimization technique that involves publishing two versions of something on your store (this could be a page, page template, or even an entire theme) to see which one performs best. 

These experiments take the guesswork out of ecommerce, allowing you to objectively determine what it takes to reduce your bounce rate, increase your conversion rate, make more sales, and reach whatever other goals that you may have for your store.

Now, on to the big Bayesian vs. Frequentist A/B testing debate. 

Let’s start with the Frequentist method. The key to this approach is that it only uses the data collected in your current experiment to make conclusions. From start to finish, conducting a Frequentist A/B test would look something like this:

  1. Define a null hypothesis and an alternative hypothesis: The null hypothesis is that there is no difference in the targeted event rate between Version A and Version B in your A/B test. In the alternative hypothesis, there is some difference. For example, let’s say you think changing the featured image on a product page may increase its conversion rate. The null hypothesis would be that by the end of your test both the original version of the page and the new variant will still have about the same conversion rate, while the alternative hypothesis is that there will be some improvement in the version with the updated image. 
  2. Publish your test: Typically, Version A of an A/B test will be the original version of the page/template/theme that you’re testing — this is the control in the experiment. Version B is the test variant with the changes you want to try out. Once you’re done designing the test variant, the next step is to publish it and then split up the traffic that would ordinarily just go to the original version, randomly assigning some visitors to Version A and others to Version B.
  3. Collect data: Before concluding your experiment, you should wait until each variant in your test has received enough visitors to produce statistically significant results. Once you’ve reached this point, it will be highly unlikely that any difference in performance between Version A and Version B is due to random chance. 
  4. Determine if the results are significant: Specifically, most Frequentist A/B tests will determine that the results are significant when the odds that they are just an outlier are less than 5% (in statistics, this figure is also known as the p-value). That means you can be at least 95% confident that the behavior displayed in your sample (i.e., the visitors who happened to participate in your A/B test) accurately represents the entire population that you’re trying to test for (i.e., the thousands or millions of people who visit your store). 
  5. Act on your results: If you have statistically significant results showing that Version B did indeed produce an improvement in whatever metric you’re measuring, then you can conclude that your null hypothesis is false and confidently implement the changes you were testing. Conversely, if your statistically significant results show that Version B produced no change or a negative effect, then you should fully revert back to the original. 

Bayesian tests follow a similar process — except, instead of starting from scratch, you incorporate past knowledge into your experiment. These previous insights are known in statistics as priors. The point of running a Bayesian test is to update your priors with new information, which may end up reinforcing your priors or shifting them one way or the other.

The advantage of Bayesian testing is that it’s more flexible, allowing you to account for context outside of your experiment. But this flexibility can also be a curse. Priors may be biased, based on hunches or gut feelings rather than hard facts (this is almost necessarily the case if something is being tested for the first time). 

For example, in an A/B test, a Bayesian might start with the prior belief that a 10% improvement in conversion rate would be reasonable while a 100% increase would be extremely unlikely — but is that really the case? Maybe the changes you’re making really could double performance, even if they seem small on the surface. If your priors are inaccurate in a Bayesian test, it can throw off the whole experiment. 

Technical Comparison Table Overview

Here is an overview of the key concepts and differences between the two methods:

Feature / ConceptFrequentistBayesian
Definition of ProbabilityLong-run frequency of events in repeated trialsDegree of belief (subjective probability)
Use of Prior Knowledge❌ Not incorporated✅ Explicitly incorporated via priors
Output TypePoint estimate (e.g., “B is 3% better than A”) + p-valueProbability distribution (e.g., “There’s an 85% chance B > A”)
Interpretation of ResultsBinary decision (significant or not)Probability of different outcomes (e.g., improvement > X%)
Main Metricp-value (usually tested at p < 0.05)Posterior probability, credible intervals
Confidence vs. CredibilityConfidence interval (CI): range likely to contain true valueCredible interval: range where true value lies with X% certainty
Sample Size RequirementsLarger samples typically required to achieve statistical powerOften handles small samples better (if good priors exist)
Stopping RulesFixed sample size required before evaluating resultsFlexible stopping — can check results as data accumulates
False Positive ControlControlled via alpha (e.g., 5% significance threshold)Controlled via prior and model design
Assumption RobustnessAssumes random sampling, independence, often normalityCan incorporate uncertainty, robustness depends on model/prior
ComputationSimpler math; easy closed-form solutionsMay require MCMC, simulation, or numerical integration (complex)
Common Tools/ExamplesT-tests, Z-tests, Chi-squared, ANOVABayesian inference engines, e.g., PyMC, Stan, or adaptive testing tools
Tool ExamplesAdobe Target, Optimizely (default), Google Optimize (legacy)VWO (Bayesian), Google Experiments (Bayesian Engine in GA4), Convertize
Best For✅ Simple, repeatable, large-sample experiments✅ Data-scarce environments, early testing, decision making under uncertainty
Worst For❌ Small datasets, early testing❌ Cases where prior is unknowable or misleading

How Probably is Defined By Each Method

Another key difference is how the very concept of probability is defined by each of these methods. 

For Frequentists, probability is the expected frequency of an event (hence the name). Most Frequentist tests ultimately produce what is known as a point estimate — Frequentists think there is an objectively “true” number for whatever they’re testing for in an experiment, and the point estimate is a single, fixed value that serves as the best guess of what this number is given the data in the sample. 

Bayesians believe probability is a measure of strength of belief. In other words, it’s somewhat subjective. The results of a Bayesian test are typically expressed as a distribution of possible outcomes (like in a bell curve) rather than a fixed number. As for the name of this type of experiment, it’s derived from Thomas Bayes, a mathematician and theologian who came up with the theorem behind this idea back in the 18th century.  

A Powerful A/B Testing App for ShopifyShogun A/B Testing offers all the features you need to conduct your own experiments and gain ground-breaking insights about your storefront.Get started now

Real World Examples of Each Application

In addition to A/B testing, both Bayesian and Frequentist experiments have many other real-world uses. Indeed, these concepts have helped discover scientific breakthroughs, build new products, and much more.

For example, Bayesian methods have been used to develop spam filters since Microsoft started doing so in the 1990s. 

The priors in this scenario are the words and phrases that the email provider believes may indicate that a given message is spam — stuff like “save”, “sale”, “act now”, etc. These priors will be updated as the user starts to receive mail and the provider is able to analyze what kinds of words are used in their legitimate messages. This often makes the filter less sensitive, resulting in fewer false positives (messages that end up in the spam folder but are actually legitimate) and more accurate spam filtering overall. 

Also, there’s an excellent New York Times piece on the subject of Bayesian statistics, “The Odds, Continually Updated”, that recounts a rather heroic example of the Bayesian method in action. 

In July 2014, the fisherman John Aldridge fell out his boat and was lost at sea, only able to stay afloat by using his boots as improvised pontoons. The Coast Guard deployed Bayesian methods in their search and rescue operation, first using their priors of where survivors might be found in the area and then updating these priors with new information as it came in, such as how strong the currents were and where rescue helicopters had already checked that day. Within 12 hours, they found Aldridge — sunburnt, hypothemic, but alive. 

Not every problem lends itself well to the Bayesian approach. Sometimes, coming up with priors is basically just guesswork. In that same New York Times piece, the physicist Kyle Cranmer described this process of trying to assign numbers to subjective beliefs as “like fingernails on a chalkboard”. Cranmer helped develop the Frequentist technique that was used to discover the Higgs boson particle at the Large Hadron Collider, perhaps the most significant scientific advancement of the last 50 years. 

Many researchers in science and medicine prefer the Frequentist method, which is known to work particularly well with large datasets and repeatable experiments.  

Impact of Sample Size

It’s worth noting that the Bayesian method, due to its use of priors, tends to be better at dealing with the volatility of small sample sizes. 

To demonstrate, let’s imagine that you want to test for the probability of a coin landing on heads.

A Bayesian approach would start with the prior that the odds of this event are known to be 50/50. The coin would have to be weighted or otherwise tampered with to throw this assumption off. If there is no such tampering, then the Bayesian test results would be accurate right from the jump, as depicted in this helpful chart from LYST’s engineering blog:

Even with a small sample size, there tends to be little volatility in Bayesian test results.

But a Frequentist experiment would need to figure out the probability of a coin landing on heads all on its own. A coin landing on heads four out of the first five times it’s flipped wouldn’t be too crazy — at that point in a Frequentist test, though, it would appear as if a coin is much more likely to land on heads in general. You may need up to a hundred or so flips for the true nature of reality to come into focus:

If there’s only a small sample size, Frequentist test results can be very volatile.

As you can see in these charts, this difference between the two approaches all but disappears over the duration of the experiments. It’s really only a concern when you’re dealing with very small sample sizes.

How Do Most A/B Testing Tools Handle This?

Each A/B testing tool addresses the Bayesian vs. Frequentist question in its own way.

For example, VWO takes a Bayesian approach, while Adobe Target uses the Frequentist method. And then there are tools, such as Convertize, that use principles from both the Bayesian and Frequentist schools of thought in their statistical engines. 

As for Shogun, we use a Chi-Squared algorithm for our A/B tests, which involves creating a table of observed and expected frequencies and then comparing these two rows of data to see whether the null hypothesis should be accepted or rejected (all of this happens on the back end, of course — you won’t need to deal with any of the math yourself).

The Chi-Squared method is especially well-suited for tests that involve categorical outcomes (data that can be sorted into distinct groups), such as the conversions or sales that tend to be the primary metric in ecommerce experiments. Another advantage to this approach is that it doesn’t require you to make any assumptions about the underlying distribution of data, allowing you to avoid any issues of bias.

While the statistics behind Shogun A/B Testing may be complicated, actually using this tool is remarkably easy. After you launch an experiment, Shogun will tell you exactly when you’ve collected enough data to be at least 95% confident in the results, so there’s no ambiguity as to when you should keep waiting to collect more data as well as when you can go ahead and declare a winner. 

Shogun makes it easy to interpret your test results.

If you’d like to learn more about this subject, check out our guide to the most important A/B testing best practices.

A Powerful A/B Testing App for ShopifyShogun A/B Testing offers all the features you need to conduct your own experiments and gain ground-breaking insights about your storefront.Get started now

You might also enjoy

Get started for free

Get hands on with everything you need to grow your business with a comprehensive suite of tools.
Start now
Arrow pointing up and to the right Arrow pointing up and to the right