TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Stop A/B Testing and Make Out like a Bandit

131 pointsby fezzlalmost 14 years ago

19 comments

paraschopraalmost 14 years ago
This is interesting (and I'm yet to study the algorithm in detail) but I guess you are comparing two dissimilar problems. One is hypothesis testing and other is continuous optimization.<p>The reason A/B testing has many parameters is because at the end of data collection it hopes to employ a hypothesis test to determine whether variation was better performing as compared to control (the interpretation of "better" is done according to parameters defined). The confidence level and other parameters allow you to do proper risk-reward analysis and see if you want to go ahead with variation. Moreover, minimal sample size (visitors/conversions) ensures that local (initial) fluctuations do not unnecessarily bias the test results. In fact, in A/B test first 100 visitors are virtually identical to last 100 visitors with regards to impact on final results.<p>However, I am guessing in bandit algorithms (since their primary usage is continuous optimization and not hypothesis testing) local fluctuations can have an adverse impact on ultimate results which may be okay for continuous optimization problems but not for projects where you need to determine with a defined level of confidence whether one version works better than the other.<p>Different needs, different algorithms.
评论 #2832125 未加载
ojillesalmost 14 years ago
The article doesn't do a very good job of explaining "Bandit algorithms". The closest s/he comes is this, but that really doesn't enlighten me:<p><i>So, what is the bandit problem? You have a set of choices you can make. On the web these could be different images to display, or different wordings for a button, and so on. Each time you make a choice you get a reward. For example, you might get a reward of 1 if a button is clicked, and reward of 0 otherwise. Your goal is to maximise your total reward over time. This clearly fits the content optimisation problem.</i><p>Edit: Anyone have better pointers? (Other than the UU article referenced in the post)
评论 #2831617 未加载
评论 #2831628 未加载
评论 #2832309 未加载
评论 #2832346 未加载
评论 #2832253 未加载
评论 #2831608 未加载
coffeemugalmost 14 years ago
I think the big issue here is that the bandit problem assumes independence between machines, while this is almost certainly an incorrect assumption to make when analyzing user behavior. For example, I might be able to increase convergence by changing the button text to "buy now" <i>and</i> changing the background color to black, but not each one independently. Conversely, changing the button text to "buy now" might <i>hurt</i> convergence if the background color is black, but improve convergence if the background color is white.<p>Essentially that means that if I make N changes at the same time and convergence changes, it's not possible to tell which combination of the changes affected convergence, and how (at least not without a new series of hypothesis tests to establish controls). Perhaps only one change made the difference and the rest were irrelevant, perhaps multiple changes made the difference, etc.<p>The bandit problem is much simpler because it guarantees that none of the variables depend on each other. If there is no such guarantee, we're effectively stuck with searching through an exponentially large space of possibilities, and using NHST to tell the difference.
评论 #2832798 未加载
评论 #2834158 未加载
评论 #2832237 未加载
StavrosKalmost 14 years ago
This is actually revolutionary, if it works properly. I had never considered the problem as an exploration vs exploitation issue, which it clearly is.<p>Imagine throwing a few alternatives at the problem, in any way you like, and having an algorithm select the best ones automatically and optimally, without you needing to hand-tune anything.<p>You wouldn't even need to select the best-performing variation, as the algorithm would converge to it in the end. You could also throw in new ones at any time to have tested, or have new ones produced automatically, e.g. in the context of new items in e-stores (we have these new featured items, select the best one to display on the front page).<p>I'm sure there's a catch, but I don't remember the algorithm very well (and, from what I remember, it looks like there isn't), and I don't have time to read the paper thoroughly now.
评论 #2831888 未加载
评论 #2831838 未加载
snippyhollowalmost 14 years ago
Relevant: <a href="http://explo.cs.ucl.ac.uk/" rel="nofollow">http://explo.cs.ucl.ac.uk/</a> (International Conference of Machine Learning 2011, Workshop on "exactly that".)
评论 #2831649 未加载
d2almost 14 years ago
From the Wikipedia entry:<p>"Originally [the bandit problem was] considered by Allied scientists in World War II, it proved so intractable that it was proposed the problem be dropped over Germany so that German scientists could also waste their time on it."
noelwelshalmost 14 years ago
Author here. We have a beta implementation of the idea available. Drop me a line (noel at untyped dot com) if you're interested in trying it out.
评论 #2831716 未加载
asharpalmost 14 years ago
Just as a general note re repeated sig testing errors: Wouldn't it be possible to run a standard A/B testing run over a small but not insignificant number of iterations, and then iterate over that a number of times?<p>You could then use bayes to find a final estimate of, say, H1. As each high level iteration is fairly small, feedback can be provided to the user, although it couldn't be acted on.<p>Speaking of which, if we have an expected number of false positives for any given number of test scores, couldn't you take the average number of positives generated as an rv and then try and determine if it is different to the expected number of false positives?<p>It seems as though this type of error relies on the fact that a single false positive stops the testing rather then continuing on and allowing regression to the mean. By stopping this, it should then stop, or at least reduce this type of error.
StavrosKalmost 14 years ago
historious cache, because it's intermittently dropping for me:<p><a href="http://cache.historious.net/cached/1369875/" rel="nofollow">http://cache.historious.net/cached/1369875/</a>
asharpalmost 14 years ago
Interesting.<p>Some of the claims made seem to be strange. Adding in additional choices is fine, dealing with multiple choices is fine, modifying each page as you give it to the user is fine, you're just adding in additional assumptions, which, when wrong would completely ruin your test. Similarly these results would completely ruin a bandit algorithm, because it relies on a much larger set of assumptions then a standard a/b test.<p>One quick example: You lose temporal independence and you are testing X and Y. For the first 15 rounds, X's metric is 100 and Y's is -100. after that, they are reversed. With an epsilon first gambit algorithm with N=15, the algorithm will simply choose X forever.<p>That said, they are an very interesting set of algorithms and it'd be interesting to see how brittle they are in practice.
wccrawfordalmost 14 years ago
It sounds exactly like A/B testing, but using a specific algorithm to determine the winner.<p>It talks about comparing the current situation to the best situation... But in most A/B testing, A would be the current 'best' and B would be the challenger. Same thing.<p>It also talks about a reward for certain button-presses, but that isn't actually what you want to optimize. You want to optimize revenue. So it's possible this could send you down the wrong path.<p>And if it's saying you should pit the current site against the best the site has done historically, that's ridiculous. You couldn't possibly put controls on all of the factors involved. That's why A/B testing is special: All the other factors are guaranteed to be as identical as possible.
评论 #2831618 未加载
wglbalmost 14 years ago
So we have Patrick's described experiences about how A/B testing produces measurable results for him with only a tiny bit of theory.<p>Now we have the bandit theory ready for market, saying that A/B is not optimal.<p>I know you are thinking i am about to say "premature optimization". But instead I'll just ask what results to the group at untyped have to show versus Patrick's freely-available simple-to-implement proven results?
评论 #2837072 未加载
jamescoopsalmost 14 years ago
Is this the same as multi-variate testing?
tintinalmost 14 years ago
Maybe a little oftopic, but are there places where you can find tips about best-practice content? To be clear: I once read that a button labeled "read more" is not very good because people don't like to read. But when you name it something like "more about this subject" people get greedy.
michaelfairleyalmost 14 years ago
The multi-armed bandit solution has seen most real world use in medical trials. Randomly assigning patients to treatments when you have more information available is highly unethical, as it can literally result in extra loss of life.
jasonkolbalmost 14 years ago
There is a standard way to do this called Taguchi Testing that has been around in the manufacturing world for years. I have a Java API that does it that I've been thinking about open sourcing, ping me if you have any interest in it.
d2almost 14 years ago
<a href="http://www.math.ucla.edu/~tom/Stopping/Contents.html" rel="nofollow">http://www.math.ucla.edu/~tom/Stopping/Contents.html</a><p>See chapter 7.
feddalmost 14 years ago
it says "Error establishing a database connection" when i click the link. it's true and even a bit on topic ;)<p>edit: hallelujah, the db is up now, after 30 minutes. can somebody upvote me back now? ;)
hammerbrostimealmost 14 years ago
Anything simple out there to make it available to the masses?
评论 #2831653 未加载