> Stop saying: “We’ve reached 95% statistical significance.”<p>> And start saying: “There’s a 5% chance that these results are total bullshit.”<p>Argh, no, no, no and no!<p>95% significance is NOT 95% probability! When you select a confidence level of a 95%, the probability that your results are nonsense is ZERO or ONE. There is no probability statement associated to it. Just because something is unknown does not mean that you can make a probability statement about it, and the mathematics around statistical testing all depend on the assumption that the parameter being tested is not random, merely unknown...<p>Rather, 95% statistical significance means, we got this number from a procedure that 95% of the time produces the right thing, but we have no idea whether this particular number we got is correct or not.<p>UNLESS!<p>Unless you're doing Bayesian stats. But in that case your procedure looks completely different and produces very different probability intervals instead of confidence intervals, and you don't talk about statistical significance at all, but about raw probabilities.
No.<p>In Frequentist thinking; p=0.05 means that if there was in reality no difference in your A and B and you repeated the experiment many times, 5% of the observed differences would be equal to or greater than the difference you just measured.<p>No probabilistic statement about the results being correct or incorrect can be made from a Null-Hypothesis significance test.
I've long argued that the biggest problem with orthodox NHST for A/B testing is that you actually don't care about 'significance of effect' as much as you do 'magnitude of effect'. Furthermore, p-values tell you nothing about the range of possible improvements (or lack thereof) you're facing. Maybe you are willing to risk potential losses for potentially huge gains, or maybe you can't afford to lose a single customer and would rather exchange time for certainty.<p>My favored approach I've outlined here[0]. Where the problem is basically considered one of Bayesian parameter estimation. Benefits include:<p>1. Output is a range of possible improvements so you can reason about risk/reward for calling a test early.<p>2. Allows the use of prior information to prevent very early stopping, and provide better estimates early on.<p>3. Every piece of the testing setup is, imho, easy to understand (ignore this benefit if you can comfortably derive Student's T-distribution from first principles)<p>[0] <a href="https://www.countbayesie.com/blog/2015/4/25/bayesian-ab-testing" rel="nofollow">https://www.countbayesie.com/blog/2015/4/25/bayesian-ab-test...</a>
Lots of knit picking here. In plain English, confidence intervals are about your results being bogus. You flipped 100 coins, all of them came up heads, you conclude 100% of coin tosses come up heads. By chance, you got a very unlikely sample that differed substantially from the population. You could also conclude your A/B test is a success, when it was just randomly atypical.
You have to wonder: what else from their junior year in college did mr. Avshalomov get completely wrong?<p>How many of the recent YC graduates fail at basic numeracy? Does node.js mean you don't have to understand data structures and algorithms to successfully "preneur" too?<p>I mean, in finance this doesn't do. Or in consulting. So there's adverse selection to worry too.
I'm not a statistician, but lately I've been wondering:<p>When we're A/B testing code, the code is already written. If there's a 5%, or even 15% chance of it being bullshit, who cares? The effort is usually exactly the same if I switch or not.<p>It's my understanding that 95%, 99%, etc, were established for things that require extra change. We don't want to spend extra time developing and marketing a new drug if it isn't effective. We don't want to tell people to do A instead of B if we aren't sure A is really better than B.<p>But in software I've already spent all the time I need to to implement the variation on the feature. So given that, why do I need 95%?<p>I would appreciate if someone with more knowledge can answer this question.<p>Edit to add: I see a lot of answers about the cost to keep the code around. What about A/B tests that don't require extra code, just different code? Most of our A/B tests fall into this category.
Is it just me or this sentence makes no mathematical sense at all?<p>"If you’re running squeaky clean A/B tests at 95% statistical significance and you run 20 tests this year, odds are one of the results you report (and act on) is going to be straight up wrong."
"We’re taking techniques that were designed for static sample sizes and applying them to continuous datasets" - Wait, seriously? Do A/B testers not use the <i>very</i> well developed techniques that exist for time-series data?