Consequences of Ending Your Test Too Soon
In this post we will talk about the consequences of ending your A/B test too soon. The idea came to me when I saw some A/B and MTV tests case studies, which after running only a few days were declared as winning by the testing tools. Then I got ‘lucky’ and came across this very good example, as one of our tests we were running turned out to be a winning one on the following day. Mind you, this was a site with a high volume of traffic.
Let’s have a look at this actual example below. The following day after we launched the test, our testing tool declared a winner. According to our testing tool, we improved our conversion by a respectable 87.25% with 100% confidence level. Great! Well, not really. What’s the issue?
Technically, if you input the data (conversions & visits) into any statistical tool, it would show that this test was statistically valid. So seems like no issue here. However the issue is that the test didn’t run for long enough.
If we stopped the test then and pat each other on the shoulder about how great we were, then we would probably make a very big mistake. The reason for that is simple: we didn’t test our variation on Friday or Monday traffic, or on weekend traffic. But, because we didn’t stop the test (because we knew it was too early), our actual result looked very different.
The actual test result after 4 weeks of running was 10.49% improvement with 99% confidence level. The actual results differ from the initial ‘winning’ result by -731.74%. How is this possible? The reason is, every day you receive different traffic to your website and each day’s traffic behaves differently too.
Now, back to the consequences if we stopped this test then. Let’s say you were running this test in checkout, and on the following day you say to your boss something like “hey boss, we just increased our site revenue by 87.25%”. If I was your boss, you would make me extremely happy and probably would increase your salary too. So we start celebrating, but at the end of the month, instead of having 87% more money in our bank account, we see the same money we had last month.
To avoid this type of blunder, always be patient and run your tests for a minimum of 2 weeks with recommended maximum of 6 weeks and confidence level no less than 95%. Also, once your testing tool declares a wining variation, don’t stop your test immediately. Run it for another week to see if the result is solid. A solid winning variation should, during this ‘control’ week, hold its winning status. If it doesn’t, then you haven’t found your winning version.
If you test like this, you will keep bringing sustainable, solid improvements to the site and results you can rely on.
Questions or comments?
For queries regarding conversion optimization of your site, or for more information on this article please contact Jan Petrovic, founder of proimpact7.com and master certified in conversion optimization and web analytics.
jan@proimpact7.com




or 



Great case, Jan.
I think the same should be said for tests that are performing poorly. Last year one of my clients was testing a new call to action and were watching the stats on the test like an eagle. Basically in the first day or two we saw a ~75% decrease in leads for a treatment. Everyone wanted to stop the treatment ASAP but I managed to convince them to keep it up for a little longer. Sure we may have lost a few leads, but at least we’re absolutely sure that particular call to action doesn’t work in that situation.
If we had stopped it prematurely we would never have been sure how that call to action impacted the site.
A few other rules of thumb I like to decide on tests are: avoid generalizing test results from unusual periods like Christmas, get at least 100 conversions per variation, use a consistent data source for tests (i.e. rather than GWO or VWO, report on all tests from GA data – VWO gives me weird data)
Hi Robert, absolutely. I recall seeing a third party test like that too, when the winning variation was losing from the very beginning. So as you said, stopping anything prematurely could lead to wrong conclusions.
I just wanted to share my view on it. I think Robert is right, cancelling on a test prematurely would contaminate the entire end data. The point of A/B testing is to have an equal amount of data on both versions to sum up at the end to see a difference. The amount of data volume could of course vary but the more the better I’d say. I rarely run anything below 300 visitors as human behaviour factors in. But Robert, could you follow up on VWO giving you wierd data? I’m an avid user of it myself and I haven’t noticed anything wierd about it. Let me know if you have the time!