How Optimizely Almost Got Me Fired (2014)

163 pointsby yanowitzover 9 years ago

24 comments

iamleppertover 9 years ago

A/B testing is high on hype and promise but low on actual results if you follow it though to actual metrics. I've done various forms of A/B throughout most of my career and found them to be cinsistent with the OP's results.A much better approach is to install significant instrumentation and actually talk to users about what's wrong with your sign up form.That, or actually build a product that users want instead of chasing after pointless metrics. I mean, really, you think changing the color of text or a call-out is going to make up for huge deficiencies in your product or make people buy it? The entire premise seems illogical and just doesn't work. The only time I've seen a/b tests truly help was when it accidentally fixed some cross browser issue or moved a button within reach of a user.Most of the A/B website optimization industry is an elaborate scam, put off on people who don't know any better and are looking for a magic bullet.

评论 #10874269 未加载

评论 #10873538 未加载

评论 #10874754 未加载

评论 #10873410 未加载

评论 #10873330 未加载

评论 #10873174 未加载

评论 #10873775 未加载

评论 #10873316 未加载

评论 #10873329 未加载

评论 #10873948 未加载

评论 #10874691 未加载

评论 #10873164 未加载

评论 #10873462 未加载

feralover 9 years ago

[I'm a PM @ Optimizely]We were asked about this article before, on our community forums, and one of our statisticians, David, wrote a detailed reply to this article's concerns about one- vs two-tailed testing, which might be of interest [3rd from the top]:<a href="https://community.optimizely.com/t5/Strategy-Culture/Let-s-talk-about-Single-Tailed-vs-Double-Tailed/m-p/4220" rel="nofollow">https://community.optimizely.com/t5/Strategy-Culture/Let-s-t...</a>Additionally, since then, as other commenters have mentioned, we've completely overhauled how we do our A/B-testing calculations, which, theoretically and empirically, now have an accurate false-positive rate even when monitored continuously. Details:<a href="https://blog.optimizely.com/2015/01/20/statistics-for-the-internet-age-the-story-behind-optimizelys-new-stats-engine/" rel="nofollow">https://blog.optimizely.com/2015/01/20/statistics-for-the-in...</a>

yummyfajitasover 9 years ago

Disclaimer: I'm the director of data science at VWO, an Optimizely competitor.In my view, the issue is not one-tail vs two-tail tests, or sequential vs one-look tests at all. The issue is a failure to quantify uncertainty.Optimizely (last time I looked), our old reports, and most other tools, all give you improvement as a single number. Unfortunately that's BS. It's simply a lie to say "Variation is 18% better than Control" unless you had facebook levels oftraffic. An honest statement will quantify the uncertainty: "Variation is between -4.5% and +36.4% better than Control".When phrased this way, it's hardly surprising that deploying this variation failed to achieve an 18% lift - 18% is just one possible value in a wide range of possible values.The big problem with this is that customers (particularly agencies who are selling A/B test results to clients) hate it. If we were VC funded, we might even have someone pushing us to tell customers the lie they want rather than the truth they need.Note that to provide uncertainty bounds like this, one needs to use a Bayesian method (only us, AB Tasty and Qubit do this, unless I forgot about someone).(Frequentist methods can provide confidence intervals, but these are NOT the same thing. Unfortunately p-values and confidence intervals are completely unsuitable for reporting to non-statisticians; they are completely misinterpreted by almost 100% of laypeople. <a href="http://myweb.brooklyn.liu.edu/cortiz/PDF%20Files/Misinterpretations%20of%20Significance.pdf" rel="nofollow">http://myweb.brooklyn.liu.edu/cortiz/PDF%20Files/Misinterpre...</a> <a href="http://www.ejwagenmakers.com/inpress/HoekstraEtAlPBR.pdf" rel="nofollow">http://www.ejwagenmakers.com/inpress/HoekstraEtAlPBR.pdf</a> )

评论 #10874001 未加载

评论 #10875020 未加载

scuba_man_spiffover 9 years ago

One thing I noticed that I haven't seen commented on yet:The solution mentioned of running a two tailed test would not have solved the problem of a false result the author demonstrated through conducting an A/A test.According to the image in the article: <a href="http://blog.sumall.com/wp-content/uploads/2014/06/optimizely-test.png" rel="nofollow">http://blog.sumall.com/wp-content/uploads/2014/06/optimizely...</a>The A/A test had: A1: Population: 3920 Conversion: 721 A2: Population: 3999 Conversion: 623<pre><code> Z-Score: 3.3 2-tailed test signifiance: 99.92% </code></pre> Looks like the one-tail vs. two tail test doesn't make huge difference in this case.So, maybe a larger sample size would have seen a reversion to the mean, but given the size and high significance that would be unlikely (interesting exercise to try different assumptions to calculate how unlikely, with the most overly generous obviously just being the stated significance).Yes, the test was only conducted over one day, but if it was the exact same thing being served for both, that shouldn't matter.If there was a reversion to the mean due to an early spike, we would expect to see the % difference between the two cells narrow as the test kept running. You can see in the chart that the % difference (relative gap between the lines) stays about the same after 8pm on the 9th.So if it's not the one-tailed test at fault, and it's not the short duration of the test at fault, what is?Don't know.I have seen in the past that setup problems are incredibly easy to make w/ a/b testing tools when implementing the tool on your site. I've seen in other tools things like automated traffic from Akamai only going to the default control, or subsets of traffic such as returning visitors excluded from some cells but not others.Based on those results, I'd be suspicious of something in the tool setup being amiss.

closedover 9 years ago

> This usually happens because someone runs a one-tailed test that ends up being overpowered.It always pains me a little when people doing research describe statistical power as a type of curse. Overpowered? Should we reduce it? The risk isn't having too much power, the risk is that someone will incorrectly interpret their Null Hypothesis Significance Test (NHST). They need to shift their focus to measuring something (and quantifying the uncertainty of their measurements), rather than think of "how likely was this result given a null hypothesis", whether that hypothesis is..something is not greater than 0 (one-tail), orsomething is not 0 (two-tail).> You’ll often see statistical power conveyed as P90 or 90%. In other words, if there’s a 90% chance A is better than B, there’s a 10% chance B is better than A and you’ll actually get worse results.This isn't necessarily true. A could be the same as B. Also, these tests are being done from the frequentist perspective, so saying "there's X chance B is better than A" is inappropriate, unless you're talking about the conclusions of your significance test (e.g. 90% chance you correctly detect a difference between them--a difference you assume is fixed to some true underlying value). Overall, being aware that a one-tail test is taking the position that nothing can happen in the other direction is useful, but a good next step is understanding what NHST can and cannot say.This even a frequentist vs bayesian problem, since you could create situations where a person felt a study was overpowered in either framework.

jacalataover 9 years ago

Don't just run tests longer - run tests for a pre-defined amount of time instead of "until you see a result you like".

评论 #10874015 未加载

评论 #10873415 未加载

评论 #10874886 未加载

ellipticover 9 years ago

I agree with the main point of the article, but I'm somewhat disturbed by the statistical errors and misconceptions.>Few websites actually get enough traffic for their >audiences to even out into a nice pretty bell curve. If >you get less than a million visitors a month your >audience won’t be identically distributed and, even >then, it can be unlikely.What is the author trying to say here? Has he thought hard about what it means for "an audience" to be identically distributed?>Likewise, the things that matter to you on your website, >like order values, are not normally distributed Why do they need to be?>Statistical power is simply the likelihood that the >difference you’ve detected during your experiment >actually reflects a difference in the real world. Simply googling the term would reveal this is incorrect.

hb42over 9 years ago

> In most organizations, if someone wants to make a change to the website, they’ll want data to support that change.So true and sad. In all the so called data-driven groups I have worked for, the tyranny of data makes metrics and numbers the justification for or counter to anything, however they have been put together.> The sad truth is that most people aren’t being rigorous about their A/B testing and, in fact, one could argue that they’re not A/B testing at all, they’re just confirming their own hypotheses.The sad truth is that most people aren’t being rigorous about anything.

RA_Fisherover 9 years ago

Here's a great article about how Optimizely gets it wrong: <a href="http://dataorigami.net/blogs/napkin-folding/17543303-the-binary-problem-and-the-continuous-problem-in-a-b-testing" rel="nofollow">http://dataorigami.net/blogs/napkin-folding/17543303-the-bin...</a>There are _many_ offenders. I've yet to see a commercial tool that gets it right.Tragically, the revamp by Optimizely neglects the straightforward Bayesian solution and uses a more fragile and complex sequential technique.

评论 #10873136 未加载

kylerushover 9 years ago

Optimizely rolled out a huge update to the way it handles statistics called Stats Engine last year. That update resolves the issues discussed in this article. You can read more about Stats Engine here: <a href="https://www.optimizely.com/statistics/" rel="nofollow">https://www.optimizely.com/statistics/</a>

评论 #10873400 未加载

cwyersover 9 years ago

> some testing vendors use one-tailed tests (Visual Website Optimizer, Optimizely)> Most A/B testing tools recommend terminating tests as soon as they show significance, even though that significance may very well be due to short-term bias. A little green indicator will pop up, as it does in Optimizely, and the marketer will turn the test off.People pay brisk money for this?

stdbrouwover 9 years ago

After reading the blog post and reading through the comments, it looks like people are drawing the wrong conclusion from this. The problem is not that AB-testing is overrated, doesn't work, is bullshit etc. but that Optimizely used to do it wrong.

评论 #10873809 未加载

评论 #10874678 未加载

erikbernover 9 years ago

> Statistical power is simply the likelihood that the difference you’ve detected during your experiment actually reflects a difference in the real world.This seems incorrect to me. Isn't statistical power the likelihood that the null hypothesis would generate an outcome at least as extreme as what you observed?I'm guessing the issue has a lot more to do with peeking at the outcome and not correcting for it (and similarly running many tests)<a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf" rel="nofollow">http://www.stat.columbia.edu/~gelman/research/unpublished/p_...</a>

评论 #10873170 未加载

评论 #10874566 未加载

nartzover 9 years ago

In my experience, if you are seeing huge effects like 60% difference in conversion, etc, you probably did something wrong (i.e. too small sample, didn't wait long enough, etc) - I've never seen something this large by simply moving things around, changing colors, changing messages, etc.Also, in general, the more drastic the changes are, the more of an effect you could have (up to some percentage). I.e. a small change would be changing a message or color, dont expect conversion to change by much. A large change would be changing from a flash site to an html one with a full redesign that loads twice as fast...

评论 #10875162 未加载

forrestthewoodsover 9 years ago

I'm increasingly convinced all statistical analysis performed by non-PhDs is no better than a coin flip. Maybe even worse.My favorite example is still the quite popular Page Weight Matter posts. I wonder how close they were to abandoning a 90% reduction in size. I wonder how many improvements the world at large has thrown away due to faulty analysis.<a href="http://blog.chriszacharias.com/page-weight-matters" rel="nofollow">http://blog.chriszacharias.com/page-weight-matters</a>

jamiequintover 9 years ago

This is a huge problem with paid marketing as well. Many folks will look at a conversion rate completely ignorant of sample size and allocate thousands of dollars in budget to something which they have no idea performs better or worse.The real problem (as you allude to in the article) is that the demand for accurate tools is not really there. Vendors don't build in accurate stats because only a tiny portion of their client base understand/demands them.

anarchitectover 9 years ago

There is much more to running experiments properly than it seems. While I'm not an expert on the statistics side, there are a few things I've learned over the years which come to mind...1) Run the experiment in whole business cycles (for us, 1 week = one cycle), based on a sample size you've calculated upfront (I use <a href="http://www.evanmiller.org/ab-testing/sample-size.html" rel="nofollow">http://www.evanmiller.org/ab-testing/sample-size.html</a>). Accept that some changes are just not testable in any sensible amount of time (I wonder what the effect of changing a font will have on e-commerce conversion rate).2) Use more than one set of metrics for analysis to discover unexpected effects. We use the Optimizely results screen for general steer, but do final analysis in either Google Analytics or our own databases. Sometimes tests can positively affect the primary metric but negatively affect another.3) Get qualitative feedback either before or during the test. We use a combination of user testing (remote or moderated) and session recording (we use Hotjar, and send tags so we can view sessions in that experiment).

filleokusover 9 years ago

Interesting read, might be worth adding (2014) to the title though.

评论 #10873119 未加载

gnicholasover 9 years ago

Almost didn't click on this because the title seems (and is) clickbait-y, but it was actually a very useful read for me.As a founder, I'm constantly hearing about A/B testing and how great these tools are. I'm not enough of a statistician to know whether everything in this article is true/valid (and would welcome a rebuttal), but the part about regression to the mean really hits home. Encouraging users to cut off testing too early means that you make them feel good ("Look, we had this huge difference!"), when in reality the difference is smaller/negligible.I'll still do some A/B testing, but given our engineering/time constraints—and my inability to accurately vet the claims/conclusions of the testing software—I won't spend too much time on this.

评论 #10876386 未加载

IndianAstronautover 9 years ago

AB testing is too simplistic. Even on my marketing team we have designed more complex metrics to look at a factors impact on multiple outcomes. The testing is still a straight forward chi square, but with a bit more depth.

TeMPOraLover 9 years ago

Wow, did not see that coming. This article actually confirms the cynical hypothesis I entertain - that most of the "data-driven" marketing and analytics is basically marketers bullshitting each other, their bosses, their customers and themselves, because nobody knows much statistics and everyone wants to believe that if they're spending money and doing something, it must be bringing results.Some quotes from the article supporting the cynical worldview:--"Most A/B testing tools recommend terminating tests as soon as they show significance, even though that significance may very well be due to short-term bias. A little green indicator will pop up, as it does in Optimizely, and the marketer will turn the test off. But most tests should run longer and in many cases it’s likely that the results would be less impressive if they did. Again, this is a great example of the default settings in these platforms being used to increase excitement and keep the users coming back for more."This basically stops short of implying that Optimizely is doing this totally on purpose.--"In most organizations, if someone wants to make a change to the website, they’ll want data to support that change. Instead of going into their experiments being open to the unexpected, open to being wrong, open to being surprised, they’re actively rooting for one of the variations. Illusory results don’t matter as long as they have fodder for the next meeting with their boss. And since most organizations aren’t tracking the results of their winning A/B tests against the bottom line, no one notices."In other words, everybody is bullshitting everybody, but it doesn't matter as long as everyone plays along and money keeps flowing.--"Over the years, I’ve spoken to a lot of marketers about A/B testing and conversion optimization, and, if one thing has become clear, it’s how unconcerned with statistics most marketers are. Remarkably few marketers understand statistics, sample size, or what it takes to run a valid A/B test.""Companies that provide conversion testing know this. Many of those vendors are more than happy to provide an interface with a simple mechanic that tells the user if a test has been won or lost, and some numeric value indicating by how much. These aren’t unbiased experiments; they’re a way of providing a fast report with great looking results that are ideal for a PowerPoint presentation. Most conversion testing is a marketing toy, essentially." (emphasis mine)Thank you for admitting it publicly.--Like whales, whose cancers grow so big that the tumors catch their own cancers and die[0], it seems that marketing industry, a well known paragon of honesty and teacher of truth, is actually being held down by its own utility makers applying their honourable strategies within their own industry.I know it's not a very appropriate thing to do, but I really want to laugh out loud at this. Karma is a bitch. :).[0] - <a href="http://www.nature.com/news/2007/070730/full/news070730-3.html" rel="nofollow">http://www.nature.com/news/2007/070730/full/news070730-3.htm...</a>

评论 #10874707 未加载

jmountover 9 years ago

(as others have mentioned) Optimizely's newer engine uses ideas like Wald's sequential analysis. Here is my article on the topic: <a href="http://www.win-vector.com/blog/2015/12/walds-sequential-analysis-technique/" rel="nofollow">http://www.win-vector.com/blog/2015/12/walds-sequential-anal...</a> .

hyperpalliumover 9 years ago

\tangent When I first heard of A/B testing, I thought of combining it with genetic algorithms to evolve the entire site. Just run it til the money rolls in.Unfortunately, if it did work, it would probably be through something misleading or scammy. Therefore, you need some kind of automatic legality checking... which would be hard.

jbpetersenover 9 years ago

Is anybody out there taking an approach of gradually driving more traffic to whichever option is winning out and never running 100% with anything specific?