At my first webdev internship, my only job was to report to the "Head of Analytics" (a young liberal arts guy). All I did all day was make the tweaks he told me to do. It was stuff like "make this button red, green or blue", or "try these three different phrasings".<p>We got no more than 100 hits a day, with no more than 2-3 conversions a day, and he would run these tests for, like, 2 days.<p>I hated it, and the website looked horrible because everything was competing with each other and just used whatever random color won.
I love the concept of A/A testing here, illustrating that you get apparent results even when you compare something to itself.<p>I can't imagine how A/B tests are a productive use of time for any site with less than a million users.<p>There are so many more useful things you could be doing to create value. If you're running a startup you should rather have some confidence in your own decisions.
I do this professionally as my sole job. This is one of the very few papers I've read that seem completely legit to me. I especially love their point on necessary sample sizes to get to a 90% power.
This article's title echoes a paper which continues to influence the medical research and bioinformatics community, "Why Most Published Research Findings Are False" by JPA Ioannidis.<p><a href="http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.0020124" rel="nofollow">http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fj...</a><p>While the OP's article targets some low-hanging fruit, like halting criteria, multiple hypotheses, etc. which should be familiar to anyone serious about bioinformatics and statistics, Ioannidis takes these things a little farther and comes up with a number of corollaries that apply equally well to A/B testing.<p>After all, the randomized controlled trials that the FDA uses to approve new drugs are essentially identical to what would be called an A/B test on Hacker News.
I strongly recommend using Evan Miller's free A/B testing tools to avoid those issues!<p>Use them to really know if conversion rate is significantly different, whether the mean value of two groups is significantly different and how to calculate sample size:<p><a href="http://www.evanmiller.org/ab-testing/" rel="nofollow">http://www.evanmiller.org/ab-testing/</a>
Putting aside bandits and all that, it seems like the first step should be to set up a hierarchical prior which performs shrinkage. Multiple comparisons and stopping issues are largely due to using frequentist tests rather than a simple probabilistic model and inference that conditions and the observed data.<p>Gelman et al, "Why we (usually) don't have to worry about multiple comparisons" <a href="http://arxiv.org/abs/0907.2478" rel="nofollow">http://arxiv.org/abs/0907.2478</a>
<p><pre><code> > We know that that, occasionally, a test will generate a
> false positive due to random chance - we can’t avoid that.
> By convention we normally fix this probability at 5%. You
> might have heard this called the significance probability
> or p-value.
> If we use a p-value cutoff of 5% we also expect to see 5
> false positives.
</code></pre>
Am I reading this incorrectly, or is the author describing p-values incorrectly?<p>A p-value is the chance a result at least as strong as the observed result would occur if the null hypothesis is true. You can't "fix" this probability at 5%. You can say "results with a p-value below 5% are good candidates for further testing". The fact that p-values of 0.05 and below are often considered significant in academia tells you nothing about the probability of a false positive occurring in an arbitrary test.
The article is spot on. We at <a href="http://visualwebsiteoptimizer.com/" rel="nofollow">http://visualwebsiteoptimizer.com/</a> know that there are some biases (particularly related to 'Multiple comparisions' and 'Multiple seeing of data') that lead of results that seem better than they actually are. Though the current results are not wrong. They are directionally correct, and with most A/B tests even if 95% confidence is really a true confidence of 90% or less, the business will still do better implementing the variation (v/s not doing anything).<p>Of course, these are very important issues for A/B testing vendors like us to understand and fix, since users mostly rely on our calculations to base their decisions. You will see us working towards taking care of such issues.
Good article in general, I have a small question:<p>"Let’s imagine we perform 100 tests on a website and, by running each test for 2 months, we have a large enough sample to achieve 80% power. 10 out of our 100 variants will be truly effective and we expect to detect 80%, or 8, of these true effects.
If we use a p-value cutoff of 5% we also expect to see 5 false positives. So, on average, we will see 8+5 = 13 winning results from 100 A/B tests."<p>If we expect 10 truly effective tests and 5 false positives, we'd have 15 tests that rejected the null hypothesis of h_0=h_test. Taking power into account, shouldn't we see 15*0.8, 12 winning results from the results? I.e. wouldn't one of the false positives also have not-enough-power?
The "regression to the mean" and "novelty" effect is getting at two different things (both true, both important).<p>1. Underpowered tests are likely to exaggerate differences, since E(abs(truth - result)) increases as the sample size shrinks.<p>2. The <i>much bigger problem</i> I've seen a lot: when users see a new layout they aren't accustomed to they often respond better, but when they get used to it, they can begin responding worse than with the old design. Two ways to deal with this are long term testing (let people get used to it) and testing on new users. Or, embrace the novelty effect and just keep changing shit up to keep users guessing - this seems to be FB's solution.
Great read.<p>What bothers me about A/B tests is when people say, eg."there was a 7% improvement" without telling us the sample size, or error margin. I'd rather hear: On a sample size of 1,000 unique visits, the improvement rate was 7% +/- 4%
I really liked this; it's condescending, but in a good natured sort of way. It's as if the author was trying to explain really basic statistics to a marketer, then realized that the marketer had NO idea what he was talking about.<p>So you get statements like "This is a well-known phenomenon, called ‘regression to the mean’ by statisticians. Again, this is common knowledge among statisticians but does not seem to be more widely known."<p>I thought that was hilarious.
Martin gave this paper as a talk at our PyData London conference this weekend (thanks Martin!), videos will be linked once we have them. He shares hard-won lessons and good advice. Here's my write-up: <a href="http://ianozsvald.com/2014/02/24/pydatalondon-2014/" rel="nofollow">http://ianozsvald.com/2014/02/24/pydatalondon-2014/</a>
Related... someone should write a good article about estimating customer acquisition costs (CAC, or ROI if you prefer) based on conversion rates of ads.<p>It drives me batty when people tell me their "average" conversion rate is 1% after running a $25 ad campaign with so few clicks. It seems like too many folks are just oblivious to sample size, confidence interval, and power calculations -- something that could be solved with a quick Wikipedia search [1].<p>[1] <a href="https://en.wikipedia.org/wiki/Sample_size_determination" rel="nofollow">https://en.wikipedia.org/wiki/Sample_size_determination</a>
Regarding the final bullet point of doing a second validation, the sample size should be bigger right? Because of the tendency for winners to coincide with +ve random effects, you will choose a larger experiment size and expect to see a lesser result.
Visibility on this is set to "Private" is is really supposed to be linked publically on HN? I was about to Tweet a link to it and then I felt dirty, like maybe the author wanted to send the link to just a select group.
Coming from a poker background, where sample size trumps everything, I've LOL'ed at every person that has ever whipped out an A/B test on me.
compare and contrast this whitepaper with arguably one of the most common optimization apps out there:<p><a href="https://help.optimizely.com/hc/en-us/articles/200133789-How-long-to-run-a-test" rel="nofollow">https://help.optimizely.com/hc/en-us/articles/200133789-How-...</a>
In my experience it can't be overstated how important it is to wait until you have a large sample size to decide whether a variation is the winner. Nearly all of the A/B tests I run start out looking like a variation is the clear, landslide winner (sometimes showing 100%+ improvement over the original) only to eventually end up regressing toward the mean. I can't get a clear idea of the winner of a test until I've shown the variation(s) to 10s of thousands of visitors and received a few thousand conversions. I've also learned that it's important to only perform tests on new visitors when possible. That means tests need to run longer to get the appropriate sample size. If you're testing over a few hundred conversions and performing tests on new and returning visitors then you're probably getting skewed results. Again, that's just in my experience so far. YMMV. One thing to consider with a test is that the variations may be too subtle to have a significant, positive impact on conversion.