A/B Testing Significance Calculator

62 pointsby bvanvugtabout 11 years ago

12 comments

martianabout 11 years ago

I agree with other critiques here that this A/B testing calculator does little to add to the conversation, and someone who uses it would be mislead in how to interpret results.The procedure my team uses is:1) hypothesize an expected conversion rate and whether to use a one-sided test (if we're testing that the state of the world remains unchanged) or a two-sided test (if we're testing that that the state of the world has changed as a result of the variation)2) run those numbers through this power/sample size calculator to determine the number of visitors we need before we can analyze the experiment: <a href="http://www.stat.ubc.ca/~rollin/stats/ssize/b2.html" rel="nofollow">http://www.stat.ubc.ca/~rollin/stats/ssize/b2.html</a>3) wait for traffic4) after enough visitors have come to the funnel, pass the resulting conversion numbers through ABBA <a href="http://www.thumbtack.com/labs/abba/" rel="nofollow">http://www.thumbtack.com/labs/abba/</a> [1] to see confidence intervals on our resultsFor further reading, I highly recommend:<a href="http://visualwebsiteoptimizer.com/split-testing-blog/how-to-calculate-ab-test-sample-size/" rel="nofollow">http://visualwebsiteoptimizer.com/split-testing-blog/how-to-...</a>[1] disclaimer: my colleague wrote ABBA

评论 #7535070 未加载

评论 #7535124 未加载

altrego99about 11 years ago

The confidence region appears symmetric around the mean. This indicates they are using the normal approximation.Exact confidence regions can be found, if you do it using Binomial instead.The question to be asked, if m out of n is the frequency, is for which values of p do you have P(m>=M) = 0.975 and for which value of p do you have P(m<=M) = 0.975, where M ~ Binimial(n,p).It can be solved easily too.

评论 #7533905 未加载

评论 #7533788 未加载

noelwelshabout 11 years ago

Pretty animation, but I have some reservations about the maths:- Normal approximation, as already noticed, ain't no good. Use the Wilson score instead.- No power calculation? Type II errors are far more important IMHO in typical web applications because switching costs are small.- non-overlapping 95% confidence interval do not imply p < 0.05. It's actually much lower than that. 83% CI is more like p of 0.05. (Errors add in quadrature.)There is a tension between making something simple for the lay person and providing knobs for the expert to twiddle. I can see the case for removing the knobs but the choices should at least be documented.[It's late here so this post is a bit slim on details. If you're interested sign up to <a href="http://bandits.mynaweb.com/" rel="nofollow">http://bandits.mynaweb.com/</a> as the next section covers confidence intervals.]

评论 #7534459 未加载

评论 #7534398 未加载

alexgoliveabout 11 years ago

Try out <a href="http://www.evanmiller.org/ab-testing/chi-squared.html" rel="nofollow">http://www.evanmiller.org/ab-testing/chi-squared.html</a>. That's what we use at SimplePrints all the time. Great library of functions.

viiabout 11 years ago

This methodology is flawed, because in practice the conversion rate changes over time. There are different effects that cause this temporal dependence not including the natural effect on the product of certain techniques (for example testing a very loud and painful to escape from upsell will cause some people to agree into the upsell and never see it again, and others to feel pissed off). Other causes of temporal dependence are different mix of traffic in terms of geography and demographics at different times of day and week.Even using proper Wilson confidence intervals with good methodology with tens of millions of impressions per group, we would see day to day variations in rate outside confidence intervals of the previous day, way more frequently than one would expect (one would expect a 95% confidence interval to be exceeded once every few weeks instead of every couple of days).The proper methodology is to estimate by bootstrapping on a good selection of dangerous variables, including time.

tedsandersabout 11 years ago

Sorry to be another typical HN nitpicker, but one problem I see with many of these approaches is that the significance level (95% in this case) is picked out of thin air. The reality is that even with a single data point, you have information. The information may not be reliable, but it's information nonetheless. The only reason that people don't redesign based on unreliable information is that redesigns have costs: costs for the developer and costs for the users. Given that different sites have different cost functions, they should also have different significance thresholds.One size does not fit all.

评论 #7534621 未加载

bertilabout 11 years ago

It's great!Although most real-life cases that I'm familiar with also include an average basket, even repeated purchases. Explaining that the anecdotal big purchase on version A is actually anecdotic, and that you need to consider the distribution, escalation slope, rhythm of purchases… all that is very difficult, especially with simple tools around like this one that make it sound like such tests are actually simple and can work for any level of audience. Having a decent separation of expected LTV on a four-pronged A/B/C/D test, especially when your conversion rate is around 1% and your re-purchase well under 20%… that’s a challenge that requires millions of users for months.

pkeaneabout 11 years ago

<a href="http://www.experimentcalculator.com/" rel="nofollow">http://www.experimentcalculator.com/</a>

timedoctorabout 11 years ago

This is a better tool with more information provided (in an excel spreadsheet)<a href="http://visualwebsiteoptimizer.com/split-testing-blog/ab-testing-significance-calculator-spreadsheet-in-excel/" rel="nofollow">http://visualwebsiteoptimizer.com/split-testing-blog/ab-test...</a>

lifeisstillgoodabout 11 years ago

Ok I give up even pretending to understand statistics anymoreI'm going to pick up a neglected "Think Stats" from OReilly and would appreciate anyone's feedback on Stats Moocs on coursera or similar(I'm finding long division difficult these days)

评论 #7535079 未加载

hammockabout 11 years ago

I wish these things would let you change the confidence interval. 95% is really only used in an academic context. If you think about decisions that you make in a typical business context, you are not using more than 80% or so

评论 #7534293 未加载

dalek2point3about 11 years ago

really? now we have to have an entire website dedicated to the t-test? surely this is something you could learn to do in a spreadsheet? no?