There's a related notion of an "adaptive" statistical design, where the allocation of users to each test group varies based on prior performance of the group. For example, if after the first 100 users you notice that group A seems to be doing slightly better than group B, you will favor it by allocation more users to that group. You can compute this allocation in such a way as to maximize the number of successes. In particular it will converge to always picking the better approach eventually, assuming there is a real difference. This also means you don't really need to "stop" the experiment to make everyone use the better version (although you may want to for other reasons).<p>Here is one paper: <a href="http://web.eecs.umich.edu/~qstout/pap/SciProg00.pdf" rel="nofollow">http://web.eecs.umich.edu/~qstout/pap/SciProg00.pdf</a>
<i>If you aim to make inferences about which ideas work best, you should pick a sample size prior to the experiment and run the experiment until the sample size is reached.</i><p>That's not a very Bayesian thing to say. It doesn't matter what sample size you decided to pick at the beginning. A Bayesian method should yield reasonable results at every step of the experiment, and allows you to keep on testing until you feel comfortable with the posterior probability distributions.<p>If 10 customers have converted so far, and 30 haven't, then you would expect the conversion rate to be somewhere between 10% and 40%, as evidenced by this graph of the Beta distribution(10,30):<p><a href="http://www.wolframalpha.com/input/?i=plot+BetaDistribution+10+30" rel="nofollow">http://www.wolframalpha.com/input/?i=plot+BetaDistribution+1...</a><p>You then do the same with method B, and stop testing once the overlap between the two probability distributions looks small enough.<p>Anscombe's rule is interesting, but it seems rather critically dependent on the number of future customers, which is hard to estimate. The advantage of the visual approach outlined above is that it's more intuitive, and people can use their best judgment to decide whether to keep on testing or not.<p><i>Disclaimer</i>: I am not an A/B tester.
This is a powerful approach when you can quantify your regret. For many startups, however, it's important to understand the tradeoffs involved in moving one metric upward or downward. To take Zynga as an example, they care about virality at least as much as engagement (or perhaps moreso). Adding or removing a friendspam dialog is likely to trade some virality for user experience. What percentages make or break the decision? Sometimes this is a qualitative call.<p>In environments where you need to look at the impact of your experiments across multiple variables, and make a subjective call about the tradeoffs, it's really important to have statistical confidence in the movement of each variable you're evaluating. This is a key strength of the traditional A/B testing approach.
May I ask how this is Bayesian in anyway? I understand that using the term Bayesian is good for directing clicks to a site, but this seems like good old fashion frequentist math. None of the hallmarks of a Bayesian approach to the problem are here: having a distribution over hypothesis, having an explicit prior, computing the posterior of the distributions.<p>I have some experience with the medical trial literature and specifically bandit algorithms and using cumulative regret verses other statistical measures like PAC frameworks. And regret is most certainly not a Bayesian idea. Instead you are explicitly modeling the cost of each action (providing an A or B test to a user) instead of assuming all costs are equal.<p>Yes, this is a better approach because it explicitly models the costs associated with the exploration/exploitation dilemma. But, it is not Bayesian.
This gets into the nitty gritty of running trials (A/B, split testing). If things like this get baked into libraries it has the chance of pushing the state of the art forwards.<p>Very worthy of an HN post.<p>EDIT: Actually, check out their entire blog. It's worth your time.
This description of content optimization using bandit algorithms sounds like an even better approach: <a href="http://untyped.com/untyping/2011/02/11/stop-ab-testing-and-make-out-like-a-bandit/" rel="nofollow">http://untyped.com/untyping/2011/02/11/stop-ab-testing-and-m...</a><p>That company has already made a web app and service to optimize content using that approach, Myna, at <a href="http://www.mynaweb.com/" rel="nofollow">http://www.mynaweb.com/</a>. A simulated experiment showed their approach to be better than A/B testing: <a href="http://www.mynaweb.com/blog/2011/09/13/myna-vs-ab.html" rel="nofollow">http://www.mynaweb.com/blog/2011/09/13/myna-vs-ab.html</a>. Though Myna's website doesn't say whether it is currently free or not, or what its pricing will be when it goes out of beta.
It's worth noting that considering the zoomed in graph (the 4th image), while it shows correctly that it could cause problems if you use significance as a stopping rule, also clearly shows that the classical test is far more powerful for n < 2000, i.e. it states a result is significant with more sensitivity.<p>So while Anscombe's rule looks good for massive amounts of users, smaller tests with predefined stopping rules can be more useful if you only have a few thousand observations.
"k is the expected number of future users who will be exposed to a result"<p>Does this mean that this approach does not make much sense if your estimate of k is totally wrong?<p>How do you estimate k?
There is nothing wrong with studying this approach and trying it out to see whether the interpretations are more helpful. However, insufficient data and insufficient technique are common when studying extremely complex systems; this Bayesian approach makes assumptions that may not be correct.