Most Winning A/B Test Results are Illusory [pdf]

158 pointsby ernoppabout 11 years ago

20 comments

gkobergerabout 11 years ago

At my first webdev internship, my only job was to report to the "Head of Analytics" (a young liberal arts guy). All I did all day was make the tweaks he told me to do. It was stuff like "make this button red, green or blue", or "try these three different phrasings".We got no more than 100 hits a day, with no more than 2-3 conversions a day, and he would run these tests for, like, 2 days.I hated it, and the website looked horrible because everything was competing with each other and just used whatever random color won.

评论 #7288718 未加载

评论 #7289003 未加载

评论 #7289980 未加载

评论 #7289094 未加载

评论 #7288226 未加载

ronaldxabout 11 years ago

I love the concept of A/A testing here, illustrating that you get apparent results even when you compare something to itself.I can't imagine how A/B tests are a productive use of time for any site with less than a million users.There are so many more useful things you could be doing to create value. If you're running a startup you should rather have some confidence in your own decisions.

评论 #7288312 未加载

评论 #7288676 未加载

评论 #7288879 未加载

评论 #7289997 未加载

darkxanthosabout 11 years ago

I do this professionally as my sole job. This is one of the very few papers I've read that seem completely legit to me. I especially love their point on necessary sample sizes to get to a 90% power.

评论 #7287871 未加载

评论 #7287899 未加载

pakabout 11 years ago

This article's title echoes a paper which continues to influence the medical research and bioinformatics community, "Why Most Published Research Findings Are False" by JPA Ioannidis.<a href="http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.0020124" rel="nofollow">http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fj...</a>While the OP's article targets some low-hanging fruit, like halting criteria, multiple hypotheses, etc. which should be familiar to anyone serious about bioinformatics and statistics, Ioannidis takes these things a little farther and comes up with a number of corollaries that apply equally well to A/B testing.After all, the randomized controlled trials that the FDA uses to approve new drugs are essentially identical to what would be called an A/B test on Hacker News.

hvassabout 11 years ago

I strongly recommend using Evan Miller's free A/B testing tools to avoid those issues!Use them to really know if conversion rate is significantly different, whether the mean value of two groups is significantly different and how to calculate sample size:<a href="http://www.evanmiller.org/ab-testing/" rel="nofollow">http://www.evanmiller.org/ab-testing/</a>

评论 #7288664 未加载

评论 #7291034 未加载

tristanzabout 11 years ago

Putting aside bandits and all that, it seems like the first step should be to set up a hierarchical prior which performs shrinkage. Multiple comparisons and stopping issues are largely due to using frequentist tests rather than a simple probabilistic model and inference that conditions and the observed data.Gelman et al, "Why we (usually) don't have to worry about multiple comparisons" <a href="http://arxiv.org/abs/0907.2478" rel="nofollow">http://arxiv.org/abs/0907.2478</a>

gabemartabout 11 years ago

<pre><code> > We know that that, occasionally, a test will generate a > false positive due to random chance - we can’t avoid that. > By convention we normally fix this probability at 5%. You > might have heard this called the significance probability > or p-value. > If we use a p-value cutoff of 5% we also expect to see 5 > false positives. </code></pre> Am I reading this incorrectly, or is the author describing p-values incorrectly?A p-value is the chance a result at least as strong as the observed result would occur if the null hypothesis is true. You can't "fix" this probability at 5%. You can say "results with a p-value below 5% are good candidates for further testing". The fact that p-values of 0.05 and below are often considered significant in academia tells you nothing about the probability of a false positive occurring in an arbitrary test.

评论 #7289734 未加载

评论 #7288014 未加载

paraschopraabout 11 years ago

The article is spot on. We at <a href="http://visualwebsiteoptimizer.com/" rel="nofollow">http://visualwebsiteoptimizer.com/</a> know that there are some biases (particularly related to 'Multiple comparisions' and 'Multiple seeing of data') that lead of results that seem better than they actually are. Though the current results are not wrong. They are directionally correct, and with most A/B tests even if 95% confidence is really a true confidence of 90% or less, the business will still do better implementing the variation (v/s not doing anything).Of course, these are very important issues for A/B testing vendors like us to understand and fix, since users mostly rely on our calculations to base their decisions. You will see us working towards taking care of such issues.

评论 #7289996 未加载

评论 #7289893 未加载

moapiabout 11 years ago

Good article in general, I have a small question:"Let’s imagine we perform 100 tests on a website and, by running each test for 2 months, we have a large enough sample to achieve 80% power. 10 out of our 100 variants will be truly effective and we expect to detect 80%, or 8, of these true effects. If we use a p-value cutoff of 5% we also expect to see 5 false positives. So, on average, we will see 8+5 = 13 winning results from 100 A/B tests."If we expect 10 truly effective tests and 5 false positives, we'd have 15 tests that rejected the null hypothesis of h_0=h_test. Taking power into account, shouldn't we see 15*0.8, 12 winning results from the results? I.e. wouldn't one of the false positives also have not-enough-power?

评论 #7290963 未加载

评论 #7288968 未加载

dbroockmanabout 11 years ago

The "regression to the mean" and "novelty" effect is getting at two different things (both true, both important).1. Underpowered tests are likely to exaggerate differences, since E(abs(truth - result)) increases as the sample size shrinks.2. The much bigger problem I've seen a lot: when users see a new layout they aren't accustomed to they often respond better, but when they get used to it, they can begin responding worse than with the old design. Two ways to deal with this are long term testing (let people get used to it) and testing on new users. Or, embrace the novelty effect and just keep changing shit up to keep users guessing - this seems to be FB's solution.

stevoskiabout 11 years ago

Great read.What bothers me about A/B tests is when people say, eg."there was a 7% improvement" without telling us the sample size, or error margin. I'd rather hear: On a sample size of 1,000 unique visits, the improvement rate was 7% +/- 4%

ameister14about 11 years ago

I really liked this; it's condescending, but in a good natured sort of way. It's as if the author was trying to explain really basic statistics to a marketer, then realized that the marketer had NO idea what he was talking about.So you get statements like "This is a well-known phenomenon, called ‘regression to the mean’ by statisticians. Again, this is common knowledge among statisticians but does not seem to be more widely known."I thought that was hilarious.

IanOzsvaldabout 11 years ago

Martin gave this paper as a talk at our PyData London conference this weekend (thanks Martin!), videos will be linked once we have them. He shares hard-won lessons and good advice. Here's my write-up: <a href="http://ianozsvald.com/2014/02/24/pydatalondon-2014/" rel="nofollow">http://ianozsvald.com/2014/02/24/pydatalondon-2014/</a>

mildtrepidationabout 11 years ago

Would be interested to see patio11's feedback on this one.

评论 #7289902 未加载

beambotabout 11 years ago

Related... someone should write a good article about estimating customer acquisition costs (CAC, or ROI if you prefer) based on conversion rates of ads.It drives me batty when people tell me their "average" conversion rate is 1% after running a $25 ad campaign with so few clicks. It seems like too many folks are just oblivious to sample size, confidence interval, and power calculations -- something that could be solved with a quick Wikipedia search [1].[1] <a href="https://en.wikipedia.org/wiki/Sample_size_determination" rel="nofollow">https://en.wikipedia.org/wiki/Sample_size_determination</a>

gatehouseabout 11 years ago

Regarding the final bullet point of doing a second validation, the sample size should be bigger right? Because of the tendency for winners to coincide with +ve random effects, you will choose a larger experiment size and expect to see a lesser result.

27182818284about 11 years ago

Visibility on this is set to "Private" is is really supposed to be linked publically on HN? I was about to Tweet a link to it and then I felt dirty, like maybe the author wanted to send the link to just a select group.

rubiquityabout 11 years ago

Coming from a poker background, where sample size trumps everything, I've LOL'ed at every person that has ever whipped out an A/B test on me.

评论 #7288539 未加载

lingbenabout 11 years ago

compare and contrast this whitepaper with arguably one of the most common optimization apps out there:<a href="https://help.optimizely.com/hc/en-us/articles/200133789-How-long-to-run-a-test" rel="nofollow">https://help.optimizely.com/hc/en-us/articles/200133789-How-...</a>

coderdudeabout 11 years ago

In my experience it can't be overstated how important it is to wait until you have a large sample size to decide whether a variation is the winner. Nearly all of the A/B tests I run start out looking like a variation is the clear, landslide winner (sometimes showing 100%+ improvement over the original) only to eventually end up regressing toward the mean. I can't get a clear idea of the winner of a test until I've shown the variation(s) to 10s of thousands of visitors and received a few thousand conversions. I've also learned that it's important to only perform tests on new visitors when possible. That means tests need to run longer to get the appropriate sample size. If you're testing over a few hundred conversions and performing tests on new and returning visitors then you're probably getting skewed results. Again, that's just in my experience so far. YMMV. One thing to consider with a test is that the variations may be too subtle to have a significant, positive impact on conversion.