Here's a more simple thought experiment that gets across the point of why p(null | significant effect) /= p(significant effect | null), and why p-values are flawed as stated in the post.<p>Imagine a society where scientists are really, really bad at hypothesis generation. In fact, they're so bad that they only test null hypothesis that are true. So in this hypothetical society, the null hypothesis in any scientific experiment ever done is true. But statistically using a p value of 0.05, we'll still reject the null in 5% of experiments. And those experiments will then end up being published in scientific literature. But then this society's scientific literature now only contains false results - literally all published scientific results are false.<p>Of course, in real life, we hope that our scientists have better intuition for what is in fact true - that is, we hope that the "prior" probability in Bayes' theorem, p(null), is not 1.
One of the best articles covering this issues is Meehl[1][2]. You can find discussion in various places like Gelman[3] and Reinhart[4].<p>[1] Meehl, Paul E (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66(1), 195–244.<p>[2] <a href="http://meehl.umn.edu/files/144whysummariespdf" rel="nofollow">http://meehl.umn.edu/files/144whysummariespdf</a><p>[3] <a href="http://andrewgelman.com/2015/03/23/paul-meehl-continues-boss/" rel="nofollow">http://andrewgelman.com/2015/03/23/paul-meehl-continues-boss...</a><p>[4] <a href="https://www.refsmmat.com/notebooks/meehl.html" rel="nofollow">https://www.refsmmat.com/notebooks/meehl.html</a>
'The fundamental problem is that p values don't mean what we "need" them to mean, that is p(null | significant effect).'<p>From Bayes' theorem, this more useful probability is given by p * x, where x = p(null) / p(significant effect). Maybe we could just lower the accepted threshold for statistical significance by several orders of magnitude so that, for statistically significant p, p * x is still small even for careful (i.e. big) estimates of x (e.g. maybe a Fermi approximation of the total number of experiments ever performed in the field in question). This doesn't necessarily imply impractically big sample sizes, although obviously this depends on the specifics (I believe the p value for a given value of the t-statistic decays exponentially with sample size).
Here is a better way to think about this.<p>The proper role of data is to update our existing beliefs about the world. It is not to specify what our beliefs should be.<p>The question that we really want to answer is, "What is the probability that X is true?" What p-values do is replace that with the seemingly similar but very different, "What is the probability that I'd have the evidence I have against X by chance alone were X true?" Bayesian factors try to capture the idea of how much belief should shift.<p>The conclusion at the end is that replication is better than either approach. I agree. We know that there are a lot of ways to hack p-values. Bayesian factors haven't caught on because they don't match how people want to think. However if we keep consistent research standards, and replicate routinely, the replication rate gives us a sense of how much confidence we should have in a new result that we hear about.<p>(Spoiler. A lot less confidence than most breathless science reporting would have you believe.)
My favorite probability theory problem is related to this article.<p>You have a test for a disease that is 99% accurate. This means that 99% of the time the test gives a correct result. You test positive for the disease and it is known that 1% of the population has the disease. What is the probability you have the disease?<p>The answer is not at all the one most people think at first when given this problem. This problem is why getting two tests is always a good thing to do when testing positive for a disease.<p>EDIT: I updated the statement of the problem to be one that can be answered!
The core issue is that p-values are cheaper to get than replicating the study, but replicating the study is the only reliable way to see if it's true or not. Sometimes the expensive/time-consuming way, is the only good way.
I'm not trying to be facetious, but isn't this something you learn in junior-level stats? I had this drilled in in both undergrad math courses and grad machine learning courses; I'm confused to see it warrant an article.
Andrew Gelman's blog provides regular insightful commentary on this issue, I highly recommend it:<p><a href="http://andrewgelman.com/" rel="nofollow">http://andrewgelman.com/</a><p>The post that turned me on to all of this is at:<p><a href="http://andrewgelman.com/2016/09/21/what-has-happened-down-here-is-the-winds-have-changed/" rel="nofollow">http://andrewgelman.com/2016/09/21/what-has-happened-down-he...</a>
The article says:<p>> Note: this has nothing to do with p-hacking (which is a huge but separate issue).<p>I disagree. p-hacking is when one experimenter checks many statistical tests to find one that is significant. The effect the author is discussing is that many experimenters do many experiments and the significant ones get published. One is more unethical (or maybe just incompetent) than the other, but they’re essentially the same phenomenon.
I'm honestly more tired of essays about p-values than p-values.<p>It's true that like all metrics if it becomes a target then it maybe abused (Goodharts Law).<p>However if you abolished p-values people would start hacking or misunderstanding priors or confidence limits or OR.<p>It's an easy dumb stat that most anyone can do in excel and most everyone recognises. The emphasis should be that it remains a quick shorthand for casual use but that more complex studies have more sophisticated models and probabilistic reasoning.<p>But the emphasis on the p-values is bizarre. As best illustrated by JT Leek the pipeline of data research has multiple points of failure that may lead to false findings or irreproducible research. But we talk very little about them whilst essays about p-values come out every week...<p><a href="https://www.nature.com/news/statistics-p-values-are-just-the-tip-of-the-iceberg-1.17412#/pipe" rel="nofollow">https://www.nature.com/news/statistics-p-values-are-just-the...</a>
This was a really interesting article. I've worked with researchers who try to defend a small but statistically significant finding that just doesn't seem likely to be real, and this provides a statistical explanation for my skepticism. The p-value mentality is deeply entrained in a lot of researchers, though<p>The challenge for journal editors seems very real. There's another group that deals with this challenge of interpreting the validity of significant findings for a living, though: biotech VCs. A lot of times trying to reproduce the work is their best way of addressing this, and often the first work done by startups is to try to replicate the academic work. For some other heuristics VCs use to assess "reproducibility risk", see here;<p><a href="https://lifescivc.com/2012/09/scientific-reproducibility-begleys-six-rules/" rel="nofollow">https://lifescivc.com/2012/09/scientific-reproducibility-beg...</a>
2 solutions:
a) stop doing experiments that just look for correlation without any attempt to get at mechanism. Of course sometimes you can't avoid this and then
2) use lower p values. Don't waste thousands or millions (more) dollars following up 5% results.
> Many researchers are now arguing that we should, more generally, move away from using statistics to make all-or-none decisions and instead use them for "estimation". In other words, instead of asking whether an effect is null or not, we should ask how big the effect is likely to be given the data.<p>I couldn’t agree more with this statement, and even moreso in a business setting than in research. It’s just so easy to get caught up in statistical significance and lose perspective on practical significance. I’ve found confidence intervals most informative and easy to understand.
When I was first taught statistics, I was told that the researcher had to justify a plausible hypothesis first - and then do a hypothesis test/p-value to prove their theory.<p>If this combination of the scientist's intuitive understanding and the p-value test result align, then this is a credible result.<p>On the other hand, the trend now is to conduct every possible test whether or not there is any justification for doing so (corrected for multiple testing, no p-hacking, yes, sure)<p>For example, in tech, we might test every shade of blue. Some of those blues are gonna come up as p-value hits - but since we had no good reason to do this test, this was probably just random noise.<p>Similarly, in genetics, we're gonna test every single gene against everything - just to see what happens (yes, yes, do a Bonferroni correction on each set of tests). Hmm, recent results in genetics don't seem to be very robust or repeatable, for some reason.<p>The likelihood of a truthful link in these tests is incredibly low. When have no particular reason to believe there is a truthful link, and are just blind testing, the false positive rate is very high (as described in the article), and probably even higher than the article speculates with - almost all hits are gonna be false positives.<p>Maybe p-values just don't work well with modern day data. Or, maybe, Big Data just doesn't contain information about mysterious, unexplored, and innovative correlations that we hope it does.
> instead of asking whether an effect is null or not, we should ask how big the effect is likely to be given the data. However, at the end of the day, editors need to make an all-or-none decision about whether to publish a paper<p>Yet another way in which the traditional publishing structure actively harms science.
Here's a follow-up to the original blog post: <a href="https://lucklab.ucdavis.edu/blog/2018/4/28/why-ive-lost-faith-in-p-values-part-2" rel="nofollow">https://lucklab.ucdavis.edu/blog/2018/4/28/why-ive-lost-fait...</a>
I remember reading <a href="http://andrewgelman.com/2016/11/13/more-on-my-paper-with-john-carlin-on-type-m-and-type-s-errors/" rel="nofollow">http://andrewgelman.com/2016/11/13/more-on-my-paper-with-joh...</a><p>with its graph "This is what power = 0.06 looks like". So I got the point that you have to have sufficient statistical power. A useful rule of thumb is that you need a power of at least 0.8. You need to have some idea how big the effect is likely to. Perhaps from previous exploratory research, from claims of other researchers, from reasoning "well, if this is happening the way we think it is, there has to be an effect greater than x waiting to be discovered.". Then you work out how big a sample size you need to use. Then you roll up your sleeves and get down to work.<p>But the reason for using p values rather than Bayesian inference is that it gets you out of the tricky problem of coming up with a prior. You only need to think about the null hypothesis and ask yourself whether the probability of the data, given the null hypothesis, is less than 0.05.<p>So there is a bit of contradiction. p values don't really work unless you ensure that you have sufficient power. To do that you need a plausible effect size, to feed into your power calculation. And that is implicitly an rough approximate prior, 50:50 either null or that effect. You could just do a Bayesian update, stating how much you shifted from 50:50.<p>Basically, if you don't already know enough to have an arguable prior to get a Bayesian approach started, you don't know enough to do a power calculation, so you shouldn't be using p-values either.<p>I went looking on andrewgelman.com for a reference for wanting power = 0.8 and found a more recent post<p><a href="http://andrewgelman.com/2017/12/04/80-power-lie/" rel="nofollow">http://andrewgelman.com/2017/12/04/80-power-lie/</a><p>Oh shit! The situation is much worse than I realised :-(
James Abdey wrote his Ph.D. on this subject several year ago and proposed an alternative method for making decisions based on statistical evidence: <a href="http://etheses.lse.ac.uk/31/" rel="nofollow">http://etheses.lse.ac.uk/31/</a>
This is an old thread already and I don't know if I'm getting my voice heard. But at any rate: hypothesis testing (slightly different philosophically from p-values, but anyway) is bogus because conjectures-and-refutations falsificationism is bogus. That's not how good science has ever happened, only how bogus research programs have dressed themselves in science.<p>The core of science is "the unity of science". Signal-to-noise measurements tell you very little outside a general coherentist/holistic verificationist framework.
This is especially troubling when combined with confirmation bias. The whole point of data is that it anchors us to reality. Data should be the check that prevents us from believing something simply because we want it to be true. But if we only test theories we already suspect are true, we are already biasing the kinds of false positives we will get.
pvals are a lot like the weather - everyone complains about it, nobody does anything about it. Specifically, what tends to be missing from these conversations is a good alternative - the author seems to be asking for false discovery rates/q values. Or maybe effect sizes? The reality is one size doesn't fit all, and the most useful statistic depends on the context. Oh, and the target: good luck submitting your work to a biological journal without pvals. I'm sure the editor will briefly marvel at your courage in taking a stand as she rejects without review.<p>While we're on the subject: there's a tendency to appeal to larger sample sizes, as the author also mentions. Worth remembering that for some of us data isn't a thing you download from the interweb, it's something you generate - and it costs money and time to do so. (And for human subjects research, the stakes are even higher...)
I don't think there's any cause to abandon p-values and NHST if you're running experiments with high power and intelligent, deliberate priors.<p>With power = 0.8 and p(h1) = 0.6, p(h0 | p < 0.05) = 0.04. Even if power = 0.8 and p(h1) = 0.2 then p(h0 | p < 0.05) = 0.2.
Does anyone have that article about how a pro/anti parapsychology both designed a study, analyzed the data, and got conflicting results?
(There was a joke about how it was the only paper published that explained a discrepancy by saying the other side cheated)
Is there a book (written in plain language) that goes into the history of academic journals and the details the current state of the "replication crisis", "data dredging", etc?
Despite using statistics daily, I still feel utterly uncomfortable about its philosophical grounding. Are there any resources HN can suggest to soothe the heart of a sceptic?
p-values that are not in the physics ranges are ridiculous.<p>It's a shame everyone started copying physics but decided for higher acceptance/rejection values.<p>I was a little bit disappointed when I realized that a bunch of valid modern science is just proper experiment design and number crunching. If it's not physics, there's no models of why things work, there's just a p-value on the correlation or some other comparison function.<p>Medicine has turned into a field where you can't know a thing.<p><a href="http://www.cochrane.org/CD005427/BACK_combined-chiropractic-interventions-for-low-back-pain" rel="nofollow">http://www.cochrane.org/CD005427/BACK_combined-chiropractic-...</a><p>I love reading reports like the above:<p>> There is currently no evidence that supports or refutes that these interventions(chiropractic intervention) provide a clinically meaningful difference for pain or disability in people with [lower back pain] when compared to other interventions.<p>p-values really do not help that much.
Sometimes dumbing down a concept can totally screw up a person's learning curve. During early days a lot of Java tutorials (in Indian engineering books) mentioned that the reason Java has Interfaces is because otherwise it is not possible to inherit from multiple classes. While it is true that you can implement multiple interfaces, the whole point of interface is to define "interface" without forcing an implementation. It has nothing to do with the "limitation" of single inheritance.<p>Coming back to p values. A simple google search will find you many articles that say<p>> A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis.<p>The whole idea of p-values is to warn a scientist that they should look for statistical significance. The behaviour of hypothesis over infinite trials is what that matters and hence more data => better reliance. But "more", "better" etc. are subjective ideas and in many cases when everything else considered normal <0.05 might be good but not always. There are far too many factors such as wrong sampling method, things you can not measure vs things you can measure etc. that affect this number truly.<p>I think author nails it when he writes "Replication is the best statistic."<p>Always think of these tests in biological evolution perspective. Do you think this hypothetis would survive the test of time where it has to frequently face the real world ?
I never understood why people took p-values seriously. They never seemed to mean anything of use.<p>Whenever I brought it up around other academics, no one seemed to want to comment on it. Maybe they were afraid to admit they didn't understand a topic that's apparently important to publishing? Anyone can follow the formula to make a p-value, but there's no requirement to understand its meaning.<p>I'd love to find their use, but I still haven't found it.
tldr: The author got a PhD in 1993[1] and is just now figuring out that p-values are not false positive rates<p>[1]<a href="http://mindbrain.ucdavis.edu/people/sjluck" rel="nofollow">http://mindbrain.ucdavis.edu/people/sjluck</a><p>I was lucky and figured it out before getting a degree. Its got to be hard for people in this position to look back on their previous work where the most fundamental aspect of interpreting the results was incorrect.<p>He gets it right that statistics are good for estimation, but there is a part two. You need to come up with a theory that makes a prediction to compare to these estimates, and then test <i>that</i>. Ie, your prediction about the distribution of the results is the "null hypothesis". I think p-values are probably ok for that.
Sounds like someone who never understood statistics, still doesn't, and doesn't want to.<p>A particularly glaring issue is this offhand comment:<p>> this is a statement about what happens when the null hypothesis is actually true. In real research, we don't know whether the null hypothesis is actually true. If we knew that, we wouldn't need any statistics! In real research, we have a p value, and we want to know whether we should accept or reject the null hypothesis.<p>That isn't a question that <i>any</i> statistical approach will help you with. There's a reason we talk in terms of "rejecting" or "failing to reject" a hypothesis. We don't do statistical tests to accept hypotheses, only to reject them.<p>The concept of accepting one hypothesis based on a comparison between it and one other hypothesis is ludicrous on its face, suffering exactly the problems associated with Pascal's wager.