The red flag here for me was that Optimizely encourages you to stop the test as soon as it "reaches significance." You shouldn't do that. What you should do is precalculate a sample size based on the statistical power you need, which involves determining your tolerance for the probability of making an error and on the minimum effect size you need to detect. Then, you run the test to completion and crunch the numbers afterward. This helps prevent the scenario where your page tests 18% better than itself by minimizing probability that your "results" are just a consequence of a streak of positive results in one branch of the test.<p>I was also disturbed that the effect size was taken into account in the sample size selection. You need to know this before you do any type of statistical test. Otherwise, you are likely to get "positive" results that just don't mean anything.<p>OTOH, I wasn't too concerned that the test was a one-tailed test. Honestly, in a website A/B test, all I really am concerned about is whether my new page is better than the old page. A one-tailed test tells you that. It might be interesting to run two-tailed tests just so you can get an idea what not to do, but for this use I think a one-tailed test is fine. It's not like you're testing drugs, where finding any effect, either positive or negative, can be valuable.<p>I should also note that I only really know enough about statistics to not shoot myself in the foot in a big, obvious way. You should get a real stats person to work on this stuff if your livelihood depends on it.
Note on SumAll<p>All users who use SumAll should be wary of their service. We tried them out and we then found out that they used our social media accounts to spam our followers and users with their advertising. We contacted them asking for answers and we never heard from them. Our suggestion: Avoid SumAll.
This article comes off as a bit boastful and somewhat of an advertisement for the company...<p>"What threw a wrench into the works was that SumAll isn’t your typical company. We’re a group of incredibly technical people, with many data analysts and statisticians on staff. We have to be, as our company specializes in aggregating and analyzing business data. Flashy, impressive numbers aren’t enough to convince us that the lifts we were seeing were real unless we examined them under the cold, hard light of our key business metrics."<p>I was expecting some admission of how their business is actually different/unusual, not just "incredibly technical". Secondly, I was expecting to hear that these "technical" people monkeyed with the A/B testing (or simply over-thought it) which got them in to trouble .. but no, just a statement about how "flashy" numbers don't appeal to them.<p>I think the article would be much better without some of that background.
>We decided to test two identical versions of our homepage against each other... we saw that the new variation, which was identical to the first, saw an 18.1% improvement. Even more troubling was that there was a “100%” probability of this result being accurate.<p>Wow. Cool explanation of one-tailed, two tailed tests. Somehow I have never run across that. Here's a link with more detail (I think it's the one intended in the article, but a different one was used): <a href="http://www.ats.ucla.edu/stat/mult_pkg/faq/general/tail_tests.htm" rel="nofollow">http://www.ats.ucla.edu/stat/mult_pkg/faq/general/tail_tests...</a>
Oh great, another misuse of A/B testing<p>Here's the thing, stop A/Bing every little thing (and/or "just because") and you'll get more significant results.<p>Do you think the true success of something is due to A/B testing? A/B testing is optimizing, not archtecting.
It seems like I see these articles pop up on a regular basis over at Inbound or GrowthHackers.<p>I think the problem is two-sided: one on the part of the tester and one on the part of the tools. The tools "statistically significant" winners MUST be taken with a grain of salt.<p>On the user side, you simply cannot trust the tools. To avoid these pitfalls, I'd recommend a few key things. One, know your conversion rates. If you're new to a site and don't know patterns, run A/A tests, run small A/B tests, dig into your analytics. Before you run a serious A/B test, you'd better know historical conversion rates and recent conversion rates. If you know your variances, it's even better, but you could probably heuristically understand your rate fluctuations just by looking at analytics and doing A/A test. Two, run your tests for long after you get a "winning" result. Three, have the traffic. If you don't have enough traffic, your ability to run A/B tests is greatly reduced and you become more prone to making mistakes because you're probably an ambitious person and want to keep making improvements! The nice thing here is that if you don't have enough traffic to run tests, you're probably better off doing other stuff anyway.<p>On the tools side (and I speak from using VWO, not Optimizely, so things could be different), but VWO tags are on all my pages. VWO knows what my goals are. Even if I'm not running active tests on pages, why can't they collect data anyway and get a better idea of what my typical conversion rates are? That way, that data can be included and considered before they tell me I have a "winner". Maybe this is nitpicky, but I keep seeing people who are actively involved in A/B testing write articles like this, and I have to think the tools could do a better job in not steering intermediate-level users down the wrong path, let alone novice users.
What he did in that article is more commonly known as an "A/A test"<p>Optimizely actually has a decent article on it: <a href="https://help.optimizely.com/hc/en-us/articles/200040355-Run-and-interpret-an-A-A-test" rel="nofollow">https://help.optimizely.com/hc/en-us/articles/200040355-Run-...</a>
I just checked in one possible R calculation of two-sided significance under a binomial model under the simple null hypothesis A and B have the same common rate (and that that rate is exactly what was observed, a simplifying assumption) here <a href="http://winvector.github.io/rateTest/rateTestExample.html" rel="nofollow">http://winvector.github.io/rateTest/rateTestExample.html</a> . The long and short is you get slightly different significances under what model you assume, but in all cases you should consider it easy to calculate an exact significance subject to your assumptions. In this case it says differences this large would only be seen in about 1.8% to 2% of the time (a two-sided test). So the result isn't that likely under the null-hypothesis (and then you make a leap of faith that maybe the rates are different). I've written a lot of these topics at the Win-Vector blog <a href="http://www.win-vector.com/blog/2014/05/a-clear-picture-of-power-and-significance-in-ab-tests/" rel="nofollow">http://www.win-vector.com/blog/2014/05/a-clear-picture-of-po...</a> .<p>They said they ran an A/A test (a very good idea), but the numbers seem slightly implausible under the two tests are identical assumption (which again, doesn't immediately imply the two tests are in fact different).<p>The important thing to remember is your exact significances/probabilities are a function of the unknown true rates, your data, and your modeling assumptions. The usual advice is to control the undesirable dependence on modeling assumptions by using only "brand name tests." I actually prefer using ad-hoc tests, but discussion what is assumed in them (one-sided/two-sided, pooled data for null, and so on). You definitely can't assume away a thumb on the scale.<p>Also this calculation is not compensating for any multiple trial or early stopping effect. It (rightly or wrongly) assumes this is the only experiment run and it was stopped without looking at the rates.<p>This may look like a lot of code, but the code doesn't change over different data.
I would be curious to know what percentage of teams with statisticians / data people actually use tools like Optimizely? A lot of people seem to be building their own frameworks that use a lot of different algorithms (two-armed bandits, etc.). From my understanding, Optimizely is really aimed at marketers without much statistical knowledge.<p>Of course, if you're a startup, building an A/B testing tool is your last priority, so you would use an existing solution.<p>Are there much more advanced 'out-of-the-box' tools for testing out there besides the usual suspects, i.e. Optimizely, Monetate, VWO, etc.?
This title used to read "How Optimizely (Almost) Got Me Fired", which is the actual title of the article.<p>It seems a mod (?) changed it to "Winning A/B results were not translating into improved user acquisition".<p>I've seen a descriptive title left by the submitter change back to the less descriptive original by a mod. But I'm curious why a mod would editorialize certain titles and change them away from their original, but undo the editorializing of others and change them to the less descriptive originals.
> The kicker with one-tailed tests is that they only measure – to continue with the example above – whether the new drug is better than the old one. They don’t measure whether the new drug is the same as the old drug, or if the old drug is actually better than the new one. <i>They only look for indications that the new drug is better...</i><p>I don't understand this paragraph. They only look for indications that the drug is better... than what?
Do any of these tools show you a distribution of variable your trying to optimize? I am just thinking that some product features might be polarizing but if you measure, the mean it might give you different results than expected. I am thinking that's where the two-tailed comes in.
Perhaps the most troubling element is that optimizely seems comfortable claiming 100% certainty in anything. That requires (in Bayesian terminology) infinite evidence, or equivalently (in frequentist terminology) if they have finite data, an infinite gap between mean performances.
this is all fine and good, but if you're goal is to see what works best between X new versions of a page and you are rigorous in creating variants, Optimizely is a great tool for figuring out the best converting variant.
In my experience Optimizely does everything they can to mislead their users into overestimating their gains.<p>Optimizely is best suited at creating exciting graphs and numbers that will impress the management, which I guess is a more lucrative business than providing real insight.
The headline isn't really what this article is about, particularly the disparaging of Optimizely. Might I suggest "The dangers of naive A/B testing" or "Buyer beware -- A/B methodologies dissected" or "Don't Blindly Trust A/B Test Results".