Last week I pointed out a new paper by Denes Szucs and John Ioannidis, *When null hypothesis significance testing is unsuitable for research: a reassessment*.^{1} I mentioned that P-values from small, noisy studies are likely to be misleading. Last April, Raghu Parthasarathy at The Eighteenth Elephant had a long post on a more fundamental problem with P-values: they encourage binary thinking. Why is this a problem?

- “Binary statements can’t be sensibly combined” when measurements have noise.
- “It is almost never necessary to combine boolean statements.”
- “Everything always has an effect.”

Those brief statements probably won’t make any sense,^{2} so head over to The Eighteenth Elephant to get the full explanation. The post is a bit long, but it’s easy to read, and well worth your time.

Andrew Gelman recently linked to Parthasarathy’s post and adds one more observation on how P-values are problematic: they are “interpretable only under the null hypothesis, yet the usual purpose of the p-value in practice is to reject the null.” In other words, P-values are derived assuming the null hypothesis is true. They tell us what the chances of getting the data we got are if the null hypothesis were true. Since we typically don’t believe the null hypothesis is true, the P-value doesn’t correspond to anything meaningful.

To take Gelman’s example, suppose we had an experiment with a control, treatment A, and treatment B. Our data suggest that treatment A is not different from control (P=0.13) but that treatment B is different from the control (P=0.003). That’s pretty clear evidence that treatment A and treatment B are different, right? Wrong.

P=0.13 corresponds to a treatment-control difference of 1.5 standard deviations; P=0.003, to a treatment-control difference of 3.0 standard deviations, a difference of 1.5 standard deviations, which corresponds to a P-value of 0.13. Why the apparent contradiction? Because if we want to say that treatment A and treatment B, we need to compare them directly to each other. When we do so, we realize that we don’t have any evidence that the treatments are different from one another.

As Parthasarthy points out in a similar example, a better interpretation is that we have evidence for the ordering (control < treatment A < treatment B). Null hypothesis significance testing could easily mislead us into thinking that what we have instead is (control = treatment A < treatment B). The problem arises, at least in part, because no matter how often we remind ourselves that it’s wrong to do so, we act as if a failure to reject the null hypothesis is evidence for the null hypothesis. Parthasarthy describes nicely how we should be approaching these problems:

It’s absurd to think that anything exists in isolation, or that any treatment really has “zero” effect, certainly not in the messy world of living things. Our task, always, is to quantify the size of an effect, or the value of a parameter, whether this is the resistivity of a metal or the toxicity of a drug.

We should be focusing on estimating the magnitude of effects and the uncertainty associated with those estimates, not testing null hypotheses.

(more…)