tl;dr: Don't bother with confidence intervals. Use a G-test instead.<p>Calculate it here: <a href="http://elem.com/~btilly/effective-ab-testing/g-test-calculator.html" rel="nofollow">http://elem.com/~btilly/effective-ab-testing/g-test-calculat...</a><p>Read more here: <a href="http://en.wikipedia.org/wiki/G-test" rel="nofollow">http://en.wikipedia.org/wiki/G-test</a><p>And plain English here: <a href="http://en.wikipedia.org/wiki/Likelihood_ratio_test" rel="nofollow">http://en.wikipedia.org/wiki/Likelihood_ratio_test</a>
<p><pre><code> When A/B testing, you need to always remember three things:
The smaller your change is, the more data you need to be sure
that the conclusion you have reached is statistically significant.
</code></pre>
Is that a mathematically provable result? It seems hard to conceptualize what a 'small' or 'big' change is. I would have expected another argument along the lines of "If you make more than one change at a time, you are not going to be able to know which one of your changes caused the result".
I think the big issues people see in A/B testing is because of a fairly tricky reason: the underlying distribution of the data. The usual ways of estimating how big your sample size are have one huge giraffe of a problem hiding in them: they assume the underlying distribution is normal.<p>The correct way to estimate your sample size is to use the cumulative distribution function of your underlying distribution. See a brief explanation from Wikipedia here: <a href="http://en.wikipedia.org/wiki/Sample_size_determination#By_cumulative_distribution_function" rel="nofollow">http://en.wikipedia.org/wiki/Sample_size_determination#By_cu...</a><p>Now what's the problem with A/B testing? Most of the stuff we test A/B for is incredibly non-normal. Often 99% of visits do not convert. We're looking at extremely skewed data here. Generally the more skewed the distribution, the more samples we need.<p>For a very basic understanding of why: consider a very simple distribution with 99.99% of the time you get $0 and 0.01% of the time you get $29 - fairly similar to what we A/B test. Do you think a sample of 1000 or 10000 is going to be anywhere near enough here? Of course not.