A/B Tests & Statistical Validity
Updated: Aug 23, 2021
Running statistically valid A/B Tests is a particularly tricky and intensive aspect of Experimentation. If you have done some light reading or watched some YouTube videos on this topic, you might be led to believe that you need only make use of an A/B Testing software to split your website/app traffic between the control and the variant, wait a couple days or weeks, get results from your tests and declare a winner.
One would think that after declaring a winner from the test you ran, you will start to observe an incredible lift in your conversion rate, but alas! nothing changes or worse, there is a considerable drop in your revenue and you start to lose money.
If you are a business leader making decisions from A/B tests, especially if you are not running those tests, you NEED to understand what that data is telling you.
An A/B test cannot tell you categorically what to do — it is simply a statement about the probability of one thing happening. If you don’t understand this and how to use it for decisions you could be making a lot of mistakes.
In this article, I would be going over the common terms and concepts you need to understand when you want to run a statistically valid A/B test.
What is my risk appetite ?
Statistical significance is often misunderstood and misused in organizations today because more and more companies are relying on data to make critical business decisions.
With the data you collect from the activity of users of your website, you can compare the efficacy of the two designs A and B. When a finding is significant, it simply means you can feel confident that it’s real, not that you just got lucky (or unlucky) in choosing the sample. It is fundamental to determine how likely it is that the observed discrepancy between the two samples originates from chance. You can learn more Statistical Significance here.
When you run an experiment you want to know if your results are “significant.” In the real world, the effects on business is not always the same thing as confidence that a result isn’t due purely to chance (i.e. statistical significance - risk appetite).
The key take away from this is, as a business, the statistical significance of the results of any test you run is an indicator of how much risk you are willing to accept if you push the winner to be implemented. You need to understand that not all experiments have significant results. Not having significant results does NOT mean the test ‘failed’. There is no failure in experimentation, unless you didn’t learn anything.
Quite simply, validity threats are the factors that threaten the validity of your A/B test results.
There are two types of errors that occur in statistics, Type I and Type II. Type I errors occur when you find a difference or correlation when one doesn’t exist (false positive). Type II errors occur when you find no difference or correlation when one does exist (false negative).
Here are 3 common validity threats you can face in A/B testing:
Flicker Effect: This happens when the users see the original page before the variation for a tiny fraction of a second. This can be due to test implementation issues or technical issues like your overall website load speed being slow or your testing tool is loaded via Google Tag Manager instead of directly on the page and you don’t control the load order.
Selection Bias: This is a very common problem and it is caused by wrongly assuming a portion of the traffic represents a totality of the traffic.
Example: Your store brand has sub domains for the primary regions it operates in, you go ahead to run a test in one region and your results are statistically significant. Going ahead to push the fix for all subdomains would be a bad move, seeing as the results of the test can only be true for the specific region you ran the test in.
Instrumentation Effect: This is another common problem and it is caused by wrong code implementation. Make sure to test your campaigns before they go live by looking at post-click landing pages and ads on different browsers and devices.
There is always risk in A/B testing, so before you test, go through these steps: 1. With the help of your entire team, inventory all of the threats. 2. Make your entire team aware of the test so that they do not create additional threats. 3. If the list of threats is too long, postpone the test.
The simple answer is that you can’t completely eliminate validity threats. They do and will always exist. It’s not about eliminating validity threats completely, it’s about managing and minimizing them.
The Minimum Detectable Effect (MDE)
In frequentist A/B testing, knowing the minimum detectable effect is a highly inconvenient and yet necessary fact of life. A properly measured A/B test only makes sense as a test for a specific uplift or more, the Minimum Detectable Effect is a measure of the minimum uplift in conversion that would be meaningful to you as a business.
For example, a test will cost $4,000 to develop and push to production, so you would want to know what uplift will make you at least $30,000? Your MDE will be the desired conversion rate that will give you at least that amount.
You go ahead to run the test to get an MDE of 7% and the observed uplift after the test is 3.5% Is that uplift real? It might be, but your stats simply can’t tell you that and it equally might not be.
The MDE could be 7% and the observed lift 3.5% and the result could be inconclusive all at the same time.
Don’t forget to watch out for the validity threats above, and let everyone on your team (and your client’s team) know that you’re testing. The more of your organization you inform, the less likely it is somebody alters an aspect of the test and renders it statistically invalid.