Recently, a group of distinguished psychologists posted a preprint on PsyArxiv arguing for a re-definition of the traditional signifance threshold, lowering it from P < 0.05 to P < 0.005. They are concerned about reproducibility, and they argue that “changing the P value threshold is simple, aligns with the training undertaken by many researchers, and might quickly achieve broad acceptance.” That’s all true, but I’m afraid it won’t necessarily “improve the reproducibility of scientific research in many fields.”
Why? Let’s take a little trip down memory lane.
Almost a year ago I pointed out that we need to “Be wary of results from studies with small sample sizes, even if the effects are statistically significant.” I illustrated why with the following figure produced using R code available at Github: https://github.com/kholsinger/noisy-data
What that figure shows is the distribution of P-values that pass the P < 0.05 significance threshold when the true difference between two populations is mu standard deviations (with the same standard deviation in both populations) and with equal sample sizes of n. The results are from 1000 random replications. As you can see when the sample size is small, there’s a good chance that a significant result will have the wrong sign, i.e., the observed difference will be negative rather than positive, even if the between-population diffference is 0.2 standard deviations. When the between-population difference is 0.05, you’re almost as likely to say the difference is in the wrong direction as to get it right.
Does changing the threshold to P < 0.005 help. Well, I changed the number of replications to 10,000, reduced the threshold to P < 0.005, and here’s what I got.
Do you see a difference? I don’t. I haven’t run a formal statistical test to compare the distributions, but I’m pretty sure they’d be indistinguishable.
In short, reducing the significance threshold to P < 0.005 will result in fewer investigators reporting statistically significant results. But unless the studies they do also have reasonably large sample sizes relative to the expected magnitude of any effect and the amount of variability within classes, they won’t be any more likely to know the direction of the effect than with the traditional threshold of P < 0.05.
The solution to the reproducibility crisis is better data, not better statistics.