It is great to understand this stuff. But people don't. And I am personally of the belief that p-values are popular exactly because they are so easy to misunderstand as the answer to the question that we want to ask (what is the probability that we are right).<p>That said, if you need to use p-values in A/B testing, you might want to read <a href="http://elem.com/~btilly/ab-testing-multiple-looks/part1-rigorous.html" rel="nofollow">http://elem.com/~btilly/ab-testing-multiple-looks/part1-rigo...</a> to get a procedure that gives always valid p-values, and then <a href="http://elem.com/~btilly/ab-testing-multiple-looks/part2-limited-data.html" rel="nofollow">http://elem.com/~btilly/ab-testing-multiple-looks/part2-limi...</a> for a more practical alternative and the caveats. (I still intend to return to the series, but not for a bit more.)
My biggest pet peeve is the interpretation of p-values absent the context of effect sizes. If you have a huge sample size you're quite frequently going to find statistically significant differences between groups, but often those differences aren't that meaningful.<p>If you assessed differences in average human height between two cities based on a million datapoints, you're pretty darn likely to find a difference that's statistically significant but not really important or meaningful (like a .1mm difference in average height).<p>The above is why the American Psychological Association says "reporting and interpreting effect sizes in the context of [p-values] is essential to good research".[1] It's also why Statwing always report effect sizes along with p-values for hypothesis tests.[2]<p>(For clarity: effect size is best presented in readily interpretable, concrete terms like height or whatever unit you're using. If that's not possible because you're comparing ratings on a 1 to 7 scale or something, or if you want to compare effect sizes across different types of analyses, there are specific metrics of effect size).<p>[1] <a href="http://people.cehd.tamu.edu/~bthompson/apaeffec.htm" rel="nofollow">http://people.cehd.tamu.edu/~bthompson/apaeffec.htm</a><p>[2] <a href="https://www.statwing.com/demo" rel="nofollow">https://www.statwing.com/demo</a>
Great topic. P-values are so easy to misinterpret, in fact, that I think the article makes the very error that it warns against:<p>>"If it’s under 5%, p < 0.05, we can be reasonably certain that our result probably implies a stacked coin."<p>By itself, a p-value is NOT enough to imply that the null hypothesis is false. In fact, if I flipped a regularly looking coin and saw 7 heads, I'd still be very confident that the coin is fair, because weighted coins are so rare. Later, the article correctly warns:<p>>P-value misconception #5: "1 − (p-value) is not the probability of the alternative hypothesis being true (see (1))."<p>P.S. I think the weasel words in the first quoted sentence, "reasonably certain" and "probably implies," show that the author is at least subconsciously aware of this logical error. :)
A really good new site about "p-hacking" and how to detect it<p><a href="http://www.p-curve.com/" rel="nofollow">http://www.p-curve.com/</a><p>is by Uri Simonsohn, a professor of psychology with a better than average understanding of statistics, and colleagues who are concerned about making scientific papers more reliable. You can use the p-curve software on that site for your own investigations into p values found in published research.<p>Many of the issues brought up by the blog post kindly submitted here and the comments that were submitted here before this comment become much more clear after reading Simonsohn's various articles<p><a href="http://opim.wharton.upenn.edu/~uws/" rel="nofollow">http://opim.wharton.upenn.edu/~uws/</a><p>about p values and what they mean, and other aspects of interpreting published scientific research.
The common misconceptions of p-values make them appear more relevant than they really are.<p>I'm skeptical they would be used nearly as much if they were properly understood.
This line is silly:<p>> <i>you can just pretend you went into your experiment with different halting conditions and, voila!, your results become significant.</i><p>You can misrepresent your results regardless of the underlying statistic. But it's no easier to lie about p-values than to lie about any other statistical procedure.<p>Anyway, the post seems to be more about hypothesis testing than p-values per se.