How Not To Run An A/B Test

124 点作者 TimothyFitz大约 15 年前

11 条评论

btilly大约 15 年前

This is an important thing to be aware of, but I wouldn't take the numbers at strictly face value. Repeated peeks at the same running experiment are not independent of each other. Furthermore once the underlying difference between A and B starts asserting itself statistically, it doesn't stop. And finally a chance fluctuation in the opposite direction from an underlying difference has to be much larger to get to statistical significance than one in the same direction. These are massive complications that make the statistics very hard to calculate.I addressed this in my 2008 tutorial on A/B testing at OSCON. What I did is ran Monte Carlo simulations of running an A/B test while continuously following the results with various sets of parameters, and running the test to different confidence levels. In that model I peeked at every single data point. You can find the results starting at <a href="http://elem.com/~btilly/effective-ab-testing/#slide59" rel="nofollow">http://elem.com/~btilly/effective-ab-testing/#slide59</a>. (See <a href="http://meyerweb.com/eric/tools/s5/features.html#controlchart" rel="nofollow">http://meyerweb.com/eric/tools/s5/features.html#controlchart</a> for the keyboard shortcuts to navigate the slides.)My advice? Wait until you have at least a certain minimum sample size to decide. Only decide there with high certainty. And then the longer the experiment runs, the lower the confidence you should be willing to accept. This procedure will let you stop most tests relatively fast, but still avoids making significant mistakes.

评论 #1277454 未加载

评论 #1277258 未加载

nkurz大约 15 年前

It's a good article, and a good intro to the pitfalls of statistical interpretation, but I think it reaches the wrong conclusion. Yes, when one has a very limited data set and needs to draw a conclusion in a hurry, and one has full confidence that there are no confounding variables in one's experiment, then paying very close attention to small differences in p-values can make sense. But how often is this the case when testing a new logo or signup page?I'm less mathematically sophisticated than the author, and would choose a simpler approach: ignore weak results. If one determines that there is a 95% chance that 51% of people prefer Logo A, either stick with what what you have, go with the one you like, or keep searching for a better logo. If you can't see the effect in the raw data without rigorous mathematical analysis, it's probably not a change worth spending much time on.Instead of adjusting your significance test for each 'peek', simply ignore anything less than 99.9% 'significant'. And while you are at it, ignore anything that's less than a 10% improvement, on the assumption that structural errors in your testing are likely to overwhelm any effects smaller than this. Drug trials and the front page of Google aside, if the effect is so small that it flips into and out of 'significance' each time you peek, it's probably not the answer you want.

patio11大约 15 年前

This is important enough of a usage note that I'm going to probably mention it in my software's documentation. I personally largely ignore this issue and thing I'm probably safe doing so with my usual testing workflow, but it is an easy thing to burn yourself on if you sit and watch your dashboard all day.

评论 #1277175 未加载

jacquesm大约 15 年前

When I do stuff like this I purposefully ignore the results for the time I've set for the experiment. It's very easy to fall prey to thinking you have a result that will not change in the longer term. Things like daily and weekly cycles for instance can really throw off your analysis.The only danger is having a 'hidden variable' influence your results and averaging over the longer term masks that influence. For example, if you are not geo-targeting your content, you could conclude after a long run of testing that a certain page performs better than another, only to throw away the averaged out effect of having the different pages up on different times of the day, one of them performing significantly better for one audience and vv.So you should keep all your data in order to figure out if such masking is happening and giving you results that are good but that could be even better.

paraschopra大约 15 年前

This is an interesting issue and I have seen users of my app (Visual Website Optimizer) complaining that their results were statistically significant a day before but now they aren't. Justifiably, they expect significance to freeze in time once it has been achieved. However, as you say significance is also a random function and not necessarily monotonically increasing or decreasing.The constraint here is not the math or technology rather it is users' needs. They want data, reporting and significance calculation to be done in real time. And even though we have a test duration calculator, I haven't seen any user actually making use of it. Plus many users will not even wait for statistical significance to be achieved.Though, in VWO, we will love to wait calculating significance until end of experiment. I'm sure the users won't like it at all.

ryanjmo大约 15 年前

While I understand the merit of what this article is saying I really want to caution always requiring a strict high confidence level when making a decision in start-ups. Requiring a strict confidence level does make sense for a company like Zynga who has nearly a limitless supply of users to run tests on, but for a start-up the value of being able to make a decision quickly often outweighs the value of being '95% confident'. Let's not forget all this wasted time worrying about the details of all of this math.In my opinion, peak early and often and when your gut tells you something is true, it probablly is.It is actually a mathematical fact if at any point in your A/B tests A is bigger than B, based on that data there is at least a 50% prob that asymptotically A is bigger than B.

评论 #1278482 未加载

carbocation大约 15 年前

The first calculation that the author is setting up is a power calculation, which is a strong start. Based on your expectations about the effect size of the treatment (in this case, the difference between A and B), and your desired probability of correctly identifying a difference, you can figure out how large of a sample size you need to see an effect. (This is called Beta.)If you're going to take several peeks as you run your trial and you want to be particularly rigorous, consider alpha spending functions. In medicine, alpha-spending functions are often used to take early looks trial results. 'Alpha' is what you use to determine which P-values you will consider significant. To oversimplify a bit, early peeks (before you've got your full sample size) have very extreme alphas. If your trial ultimately uses an alpha of 0.05, a prespecified early look may use an alpha of 0.001. (There are ways of calculating a meaningful alpha values; these are just examples drawn from a hat.)By setting useful alphas and betas, you can benefit from true, potent treatment effects (if present) earlier than you might otherwise, without too much risk of identifying spurious associations.

评论 #1277234 未加载

shalmanese大约 15 年前

Why are we still using p<0.05 for web A/B testing? p<0.05 made sense when each individual data point cost real money to generate: grad students interviewing participants or geologists making individual measurements. p < 0.05 was a good tradeoff between certainty and cost.Now, in the world of the web where measurement has an upfront cost but 0 incremental cost, why not move to p < 0.001 or p < 0.0001? Sure, you need to increase the magnitude of data you're gathering by 2 or 3 but that's so much easier than delving into the epistemological complexities of p < 0.05

harisenbon大约 15 年前

While interesting, I think that this is more of a mathematical proof for something that people doing any sort of testing should remember:Don't stop before the test is complete, just because you've gotten an answer.I generally leave my A/B tests up well after I've gotten a significance report, mostly because I'm lazy but also because I know that given enough time and enough entries, the significance reports can change.Especially in the multi-variate tests that Evan wrote about, just because you get one result as significant doesn't preclude other possibilities from also being significant.

marciovm123大约 15 年前

I had a statistics course at MIT where the professor would bring up one of the solutions to this problem at least once a week for the entire semester:<a href="http://en.wikipedia.org/wiki/Bonferroni_correction" rel="nofollow">http://en.wikipedia.org/wiki/Bonferroni_correction</a>

评论 #1277360 未加载

khafra大约 15 年前

Interesting that frequentist A/B software packages let you essentially break the test without telling you. Are there bayesian A/B testers that give you a likelihood ratio instead?