I agree with other critiques here that this A/B testing calculator does little to add to the conversation, and someone who uses it would be mislead in how to interpret results.<p>The procedure my team uses is:<p>1) hypothesize an expected conversion rate and whether to use a one-sided test (if we're testing that the state of the world remains unchanged) or a two-sided test (if we're testing that that the state of the world has changed as a result of the variation)<p>2) run those numbers through this power/sample size calculator to determine the number of visitors we need before we can analyze the experiment: <a href="http://www.stat.ubc.ca/~rollin/stats/ssize/b2.html" rel="nofollow">http://www.stat.ubc.ca/~rollin/stats/ssize/b2.html</a><p>3) wait for traffic<p>4) after enough visitors have come to the funnel, pass the resulting conversion numbers through ABBA <a href="http://www.thumbtack.com/labs/abba/" rel="nofollow">http://www.thumbtack.com/labs/abba/</a> [1] to see confidence intervals on our results<p>For further reading, I highly recommend:<p><a href="http://visualwebsiteoptimizer.com/split-testing-blog/how-to-calculate-ab-test-sample-size/" rel="nofollow">http://visualwebsiteoptimizer.com/split-testing-blog/how-to-...</a><p>[1] disclaimer: my colleague wrote ABBA
The confidence region appears symmetric around the mean. This indicates they are using the normal approximation.<p>Exact confidence regions can be found, if you do it using Binomial instead.<p>The question to be asked, if m out of n is the frequency, is for which values of p do you have P(m>=M) = 0.975 and for which value of p do you have P(m<=M) = 0.975, where M ~ Binimial(n,p).<p>It can be solved easily too.
Pretty animation, but I have some reservations about the maths:<p>- Normal approximation, as already noticed, ain't no good. Use the Wilson score instead.<p>- No power calculation? Type II errors are far more important IMHO in typical web applications because switching costs are small.<p>- non-overlapping 95% confidence interval do not imply p < 0.05. It's actually much lower than that. 83% CI is more like p of 0.05. (Errors add in quadrature.)<p>There is a tension between making something simple for the lay person and providing knobs for the expert to twiddle. I can see the case for removing the knobs but the choices should at least be documented.<p>[It's late here so this post is a bit slim on details. If you're interested sign up to <a href="http://bandits.mynaweb.com/" rel="nofollow">http://bandits.mynaweb.com/</a> as the next section covers confidence intervals.]
Try out <a href="http://www.evanmiller.org/ab-testing/chi-squared.html" rel="nofollow">http://www.evanmiller.org/ab-testing/chi-squared.html</a>. That's what we use at SimplePrints all the time. Great library of functions.
This methodology is flawed, because in practice the conversion rate changes over time. There are different effects that cause this temporal dependence not including the natural effect on the product of certain techniques (for example testing a very loud and painful to escape from upsell will cause some people to agree into the upsell and never see it again, and others to feel pissed off). Other causes of temporal dependence are different mix of traffic in terms of geography and demographics at different times of day and week.<p>Even using proper Wilson confidence intervals with good methodology with tens of millions of impressions per group, we would see day to day variations in rate outside confidence intervals of the previous day, way more frequently than one would expect (one would expect a 95% confidence interval to be exceeded once every few weeks instead of every couple of days).<p>The proper methodology is to estimate by bootstrapping on a good selection of dangerous variables, including time.
Sorry to be another typical HN nitpicker, but one problem I see with many of these approaches is that the significance level (95% in this case) is picked out of thin air. The reality is that even with a single data point, you have information. The information may not be reliable, but it's information nonetheless. The only reason that people don't redesign based on unreliable information is that redesigns have costs: costs for the developer and costs for the users. Given that different sites have different cost functions, they should also have different significance thresholds.<p>One size does not fit all.
It's great!<p>Although most real-life cases that I'm familiar with also include an average basket, even repeated purchases. Explaining that the anecdotal big purchase on version A is actually anecdotic, and that you need to consider the distribution, escalation slope, rhythm of purchases… all that is very difficult, especially with simple tools around like this one that make it sound like such tests are actually simple and can work for any level of audience. Having a decent separation of expected LTV on a four-pronged A/B/C/D test, especially when your conversion rate is around 1% and your re-purchase well under 20%… that’s a challenge that requires millions of users for months.
This is a better tool with more information provided (in an excel spreadsheet)<p><a href="http://visualwebsiteoptimizer.com/split-testing-blog/ab-testing-significance-calculator-spreadsheet-in-excel/" rel="nofollow">http://visualwebsiteoptimizer.com/split-testing-blog/ab-test...</a>
Ok I give up even pretending to understand statistics anymore<p>I'm going to pick up a neglected "Think Stats" from OReilly and would appreciate anyone's feedback on Stats Moocs on coursera or similar<p>(I'm finding long division difficult these days)
I wish these things would let you change the confidence interval. 95% is really only used in an academic context. If you think about decisions that you make in a typical business context, you are not using more than 80% or so