Uncommon Ground

Statistics

Against null hypothesis testing – the elephants and Andrew Gelman edition

Last week I pointed out a new paper by Denes Szucs and John Ioannidis, When null hypothesis significance testing is unsuitable for research: a reassessment.1 I mentioned that P-values from small, noisy studies are likely to be misleading. Last April, Raghu Parthasarathy at The Eighteenth Elephant had a long post on a more fundamental problem with P-values: they encourage binary thinking. Why is this a problem?

  1. “Binary statements can’t be sensibly combined” when measurements have noise.
  2. “It is almost never necessary to combine boolean statements.”
  3. “Everything always has an effect.”

Those brief statements probably won’t make any sense,2 so head over to The Eighteenth Elephant to get the full explanation. The post is a bit long, but it’s easy to read, and well worth your time.

Andrew Gelman recently linked to Parthasarathy’s post and adds one more observation on how P-values are problematic: they are “interpretable only under the null hypothesis, yet the usual purpose of the p-value in practice is to reject the null.” In other words, P-values are derived assuming the null hypothesis is true. They tell us what the chances of getting the data we got are if the null hypothesis were true. Since we typically don’t believe the null hypothesis is true, the P-value doesn’t correspond to anything meaningful.

To take Gelman’s example, suppose we had an experiment with a control, treatment A, and treatment B. Our data suggest that treatment A is not different from control (P=0.13) but that treatment B is different from the control (P=0.003). That’s pretty clear evidence that treatment A and treatment B are different, right? Wrong.

P=0.13 corresponds to a treatment-control difference of 1.5 standard deviations; P=0.003, to a treatment-control difference of 3.0 standard deviations, a difference of 1.5 standard deviations, which corresponds to a P-value of 0.13. Why the apparent contradiction? Because if we want to say that treatment A and treatment B, we need to compare them directly to each other. When we do so, we realize that we don’t have any evidence that the treatments are different from one another.

As Parthasarthy points out in a similar example, a better interpretation is that we have evidence for the ordering (control < treatment A < treatment B). Null hypothesis significance testing could easily mislead us into thinking that what we have instead is (control = treatment A < treatment B). The problem arises, at least in part, because no matter how often we remind ourselves that it’s wrong to do so, we act as if a failure to reject the null hypothesis is evidence for the null hypothesis. Parthasarthy describes nicely how we should be approaching these problems:

It’s absurd to think that anything exists in isolation, or that any treatment really has “zero” effect, certainly not in the messy world of living things. Our task, always, is to quantify the size of an effect, or the value of a parameter, whether this is the resistivity of a metal or the toxicity of a drug.

We should be focusing on estimating the magnitude of effects and the uncertainty associated with those estimates, not testing null hypotheses.

(more…)

Against null hypothesis significance testing

Several months ago I pointed out that P-values from small, noisy experiments are likely to be misleading. Given our training, we think that if a result is significant with a small sample, it must be a really big effect. But unless we have good reason to believe that there is very little noise in the results (a reason other than the small amount of variation observed in our sample), we could easily be misled. Not only will we overestimate how big the effect is, but we are almost as likely to say that the effect is positive when it’s really negative as we are to get the sign right. Look back at this post from August to see for yourself (and download the R code if you want to explore further). As Gelman and Carlin point out,

There is a common misconception that if you happen to obtain statistical significance with low power, then you have achieved a particularly impressive feat, obtaining scientific success under difficult conditions.

I bring this all up again because I recently learned of a new paper by Denes Szucs and John Ioannidis, When null hypothesis significance testing is unsuitable for research: a reassessment. They summarize their advice on null hypothesis significance testing (NHST) in the abstract:

Whenever researchers use NHST they should justify its use, and publish pre-study power calculations and effect sizes, including negative findings. Studies should optimally be pre-registered and raw data published.

They go on to point out that part of the problem is the way that scientists are trained:

[M]ost scientists…are still near exclusively educated in NHST, they tend to misunderstand and abuse NHST and the method is near fully dominant in scientific papers.

The whole paper is worth reading, and reading carefully. If you use statistics in your research, please read it and remember its lessons the next time you’re analyzing your data.

 

(more…)

Designing exploratory studies: measurement (part 2)

Remember Gelman’s preliminary principles for designing exploratory studies:

  1. Validity and reliability of measurements.
  2. Measuring lots of different things.
  3. Connections between quantitative and qualitative data.
  4. Collect or construct continuous measurements where possible.

I already wrote about validity and reliability. I admitted to not knowing enough yet to provide advice on assessing the reliability of measurements ecologists and evolutionists make (except in the very limited sense of whether or not repeated measurements of the same characteristic give similar results). For the time being that means I’ll focus on

  • Remembering that I’m measuring an indicator of something else that is the thing that really matters, not the thing that really matters itself.
  • Being as sure as I can that what I’m measuring is a valid and reliable indicator of that thing, even though the best I can do with that right now is a sort of heuristic connection between a vague notion of what I really think matters, underlying theory, and expectations derived from earlier work.

It’s that second part where “measuring lots of different things” comes in. Let’s go back to LMA and MAP. I’m interested in LMA because it’s an important component of the leaf economics spectrum. There are reasons to expect that tough leaves (those in which LMA is high) will not only be more resistant to herbivory from generalist herbivores, but that they will have lower rates of photosynthesis. Plants are investing more in those leaves. So high LMA is, in some vague sense, an indicator of the extent to which resource conservation is more important to plants than rapid acquisition of resources. So in designing an exploratory study, I should think about other traits plants have that could be indicators of resource conservation vs. rapid resource acquisition and measure as many of them as I can. A few that occur to me are leaf area, photosynthetic rate, leaf nitrogen content, leaf C/N ratio, tissue density, leaf longevity, and leaf thickness.

If I measure all of these (or at least several of them) and think of them as indicators of variation on the underlying “thing I really care about”, I can then imagine treating that underlying “thing I really care about” as a latent variable. One way, but almost certainly not the only way, I could assess the relationship between that latent variable and MAP would be to perform a factor analysis on the trait dataset, identify a single latent factor, and use that factor as the dependent variable whose variation I study in relation to MAP. Of course, MAP is only one way in which we might assess water availability in the environment. Others that might be especially relevant for perennials with long-lived leaves (like Protea) in the Cape Floristic Region rainfall seasonality, maximum number of days between days with “significant” rainfall in the summer, total summer rainfall, estimated potential evapotranspiration for the year, and estimated PET for the summer. A standard way to relate the “resource conservation” factor to the “water availability” factor would be a canonical correspondence analysis.

I am not advocating that we all start doing canonical correspondence analyses as our method of choice in designing exploratory studies, this way of thinking about exploratory studies does help me clarify (a bit) what it is that I’m really looking for. I still have work to do on getting it right, but it feels as if I’m heading towards something analogous to exploratory factor analysis (to identify factors that are valid, in the sense that they are interpretable and related in a meaningful way to existing theoretical constructs and understanding) and confirmatory factor analysis (to confirm that the exploration has revealed factors that can be reliably measured).

Stay tuned. It is likely to be a while before I have more thoughts to share, but as they develop, they’ll appear here, and if you follow along, you’ll be the first to hear about them.

Designing exploratory studies: measurement

On Wednesday I argued that we need carefully done exploratory studies to discover phenomena as much as we need carefully done experimental studies to test explanations for the phenomena that have been discovered.1 Andrew Gelman suggests four preliminary principles:

  1. Validity and reliability of measurements.
  2. Measuring lots of different things.
  3. Connections between quantitative and qualitative data.2
  4. Collect or construct continuous measurements where possible.

Today I’m going to focus on #1, the validity and reliability of measurements.

If there happen to be any social scientists reading this, it’s likely to come as a shock to you to learn that most ecologists and evolutionary biologists haven’t thought too carefully about the problem of measurement, or at least that’s been my experience. My ecologist and evolutionary biologist friends are probably scratching their heads. “What the heck does Holsinger mean by ‘the problem of measurement.'” I’m sure I’m going to butcher this, because what little I think I know I picked up informally second hand, but here’s how I understand it.

(more…)

Designing exploratory studies: preliminary thoughts

Andrew Gelman has a long, interesting, and important post about designing exploratory studies. It was inspired by the following comment from Ed Hagen following a blog post about a paper in Psychological Science.

Exploratory studies need to become a “thing”. Right now, they play almost no formal role in social science, yet they are essential to good social science. That means we need to put as much effort in developing standards, procedures, and techniques for exploratory studies as we have for confirmatory studies. And we need academic norms that reward good exploratory studies so there is less incentive to disguise them as confirmatory.

I think Ed’s suggestion is too narrow. Exploratory studies are essential to good science, not just good social science. We often (or at least I often) have only a vague idea about how features I’m interested in relate to one another. Take leaf mass per area (LMA)

We often (or at least I often) have only a vague idea about how features I’m interested in relate to one another. Take leaf mass per area (LMA)1 and mean annual temperature or mean annual precipitation, for example. In a worldwide dataset compiled by IJ Wright and colleagues2, tougher leaves (higher values of LMA) are associated with warmer temperatures and less rainfall.

LMA vs. mean annual temperature and log(mean annual rainfall)

LMA, mean annual temperature, and log(mean annual rainfall) in 2370 species at 163 sites (from Wright et al. 2004)

We expected similar relationships in our analysis of Protea and Pelargonium,3 but we weren’t trying to test those expectations. We were trying to determine what those relationships were. We were, in other words, exploring our data, and guess what we found. Tougher leaves are associated with less rainfall in both general and with warmer temperatures in Protea. They were, however, associated with cooler temperatures in Pelargonium, exactly the opposite of what we expected. One reason for the difference might be that Pelargonium leaves are drought deciduous, so they avoid the summer drought characteristic of the regions from which our samples were collected. That is, of course, a post hoc explanation and has to be interpreted cautiously as a hypothesis to be tested, not as an established causal explanation. But that is precisely the point. We needed to investigate the phenomena to identify a pattern. Only then could we generate a hypothesis worth testing.

I find that I am usually more interested in discovering what the phenomena are than in tying down the mechanistic explanations for them. The problem, as Ed Hagen suggests, is that studies that explicitly label themselves as exploratory play little role in science. They tend to be seen as “fishing expeditions,” not serious science. The key, as Hagen suggests, is that to be useful, exploratory studies have to be done as carefully as explicit, hypothesis-testing confirmatory studies. A corollary he didn’t mention is that science will be well-served if we observe the distinction between exploratory and confirmatory studies.4

(more…)

How to think about replication

Caroline Tucker reviewed a paper by Nathan Lemoine and colleagues in late September and reminded us that inferring anything from small, noisy samples is problematic.1 Ramin Skibba now describes a good example of the problems that can arise.

In 1988, Fritz Strack and his colleagues reported that people found cartoons funnier if they held a pen in their teeth than if they held it between their lips. Why? Because holding a pen between your teeth causes you to smile, while holding one between your lips causes you to pout. This report spawned a series of studies on the “facial feedback hypothesis”, the hypothesis that facial expressions influence our emotional states. It seems plausible enough, and I know that I’ve read advice along this line in various places even though I’d never heard of the “facial feedback hypothesis” until I read Skibba’s article.

Unfortunately, the hypothesis wasn’t supported in what sounds like a pretty definitive study: 17 experiments and 1900 experimental subjects. Sixteen of the studies had large enough samples to be confident that the failure to detect an effect wasn’t a result of small sample size. Strack disagrees. He argues that (a) using a video camera to record participants may have made them self-conscious and suppressed their responses and (b) the cartoons were too old or too unfamiliar to participants to evoke an appropriate response.

Let’s take Strack at his word. Let’s assume he’s right on both counts. How important do you think the facial feedback to emotions is if being recorded by a video camera or being shown the wrong cartoons causes it to disappear (or at least to be undetectable)? I don’t doubt that Strack detected the effect in his sample, but the attempt to replicate his results suggest that the effect is either very sensitive to context or very weak.

I haven’t gone back to Strack’s paper to check on the original sample size, but the problem here is precisely what you’d expect to encounter if the original conclusions were based on a study in which the sample size was small relative to the signal-to-noise ratio. To reliably detect a small effect that varies across contexts requires either (a) very large sample size (if you want your conclusions to apply to the entire population) or (b) very careful specification of the precise subpopulation to which your conclusions apply (and extreme caution in attempting to generalize beyond it).

(more…)

Noise miners

I’ve pointed out the problems with small, noisy samples using simulations (here, here, here, and here). But I’ve also learned that stories are far more persuasive than facts, and I’ve learned that I’m not good at telling stories. Fortunately, there are some people who tell stories very well, and John Schmidt is one of them. Here’s how his recent story, Noise Miners, starts.

What most people don’t understand about noise is how hard it is to find the good stuff.

You can get noise anywhere; most noise is just sitting on the ground, waiting for you to pick it up. Coincidences — “coinkidinks”, as collectors sometimes call them — can be had by the dozen just outside your front door. As I arrived in this small university town, home to one of the largest noise mines in the country, I planned to see how the high-quality noise was dug, and to learn about the often-forgotten people who dig it for us.

Follow the link and read the whole thing if that piques your interest.

Noisy data and small samples are a bad combination

I first mentioned the problems associated with small samples and noisy data in late August. That post demonstrated that you’d get the sign wrong almost half of the time with a small sample, even though a t-test would tell you that the result is statistically significant. The next two posts on the topic (September 9th and 19th) pointed out that being Bayesian won’t save you, even if you use fairly informative priors.

It turns out that I’m not alone in pointing out these problems. Caroline Tucker discusses a new paper in Ecology by Nathan Lemoine and colleagues that points out the same difficulties. She sums the problem up nicely.

It’s a catch-22 for small effect sizes: if your result is correct, it very well may not be significant; if you have a significant result, you may be overestimating the effect size.

There is no easy solution. Lemoine and his colleagues focus on errors of magnitude, where I’ve been focusing on errors in sign, but the bottom line is the same:

Be wary of results from studies with small sample sizes, even if the effects are statistically significant.

Lemoine, N.P., A. Hoffman, A.J. Felton, L. Baur, F. Chaves, J. Gray, Q. Yu, and M.D. Smith. 2016. Underappreciated problems of low replication in ecological field studies. Ecology doi: 10.1002/ecy.1506

Even an informative prior doesn’t help much

Two weeks ago I pointed out that you should

Be wary of results from studies with small sample sizes, even if the effects are statistically significant.

Last week I pointed out that being Bayesian won’t save you. If you were paying close attention, you may have thought to yourself

Holsinger’s characterization of Bayesian inference isn’t completely fair. The mean effect sizes he simulated were only 0.05, 0.10, and 0.20, but he used a prior with a standard deviation of 1.0 in his analyses. Any Bayesian in her right mind wouldn’t use a prior that broad, because she’d have a clue going into the experiment that the effect size was relatively small. She’d pick a prior that more accurately reflects prior knowledge of the likely results.

It’s a fair criticism, so to see how much difference more informative priors make, I re-did the simulations with a Gaussian prior on each mean with a prior mean of 0.0 (as before) and a standard deviation of 2 times the effect size used in the simulation. Here are the results:

Mean Sample size Power Wrong sign
0.05 10 0/1000 na
50 2/1000 0/2
100 7/1000 2/7
0.10 10 1/1000 0/1
50 0/1000 na
100 0/1000 na
0.20 10 22/1000 2/22
50 128/1000 0/158
100 265/1000 0/292

With a more informative prior, you’re not likely to say that an effect is positive when it’s actually negative. There are, however, a couple of things worth noticing when you compare this table to the last one.

  1. The more informative prior doesn’t help much, if at all, with a sample size of 10. The N(0,1) prior got the sign wrong in 7 out of 62 cases where the 95% credible interval on the posterior mean difference did not include 0. The N(0,0.4) prior made the same mistake in 2 out of 22 cases. So it didn’t make as many mistakes as the less informative prior, but it made almost the same proportion. In other words, you’d be almost as likely to make a sign error with the more informative prior as you are with the less informative prior.
  2. Even with a sample size of 100, you wouldn’t be “confident” that there is a difference very often (only 7 times out of 1000) when the “true” difference is small, 0.05, but you’d make a sign error nearly a third of the time (2 out of 7 cases) .

So what does all of this mean? When designing and interpreting an experiment you need to have some idea of how big the between-group differences you might reasonably expect to see are relative to the within-group variation. If the between-group differences are “small”, you’re going to need a “large” sample size to be confident about your inferences. If you haven’t collected your data yet, the message is to plan for “large” samples within each group. If you have collected your data and your sample size is small, be very careful about interpreting the sign of any observed differences – even if they are “statistically significant.”

What’s a “small” difference, and what’s a “large” sample? You can play with the R/Stan code in Github to explore the effects: https://github.com/kholsinger/noisy-data. You can also read Gelman and Carlin (Perspectives on Psychological Science 9:641; 2014 http://dx.doi.org/10.1177/1745691614551642) for more rigorous advice.

Being Bayesian won’t save you

Last week I pointed out that you should

Be wary of results from studies with small sample sizes, even if the effects are statistically significant.

Now you may be thinking to yourself: “I’m a Bayesian, and I use somewhat informative priors. This doesn’t apply to me.” Well, I’m afraid you’re wrong. Here are results from analysis of data simulated according to the same conditions I used last week in exploring P-values. The prior on each mean is N(0, 1), and the prior on each standard deviation is half-N(0, 1).

Mean Sample size Power Wrong sign
0.05 10 39/1000 18/39
50 59/1000 12/59
100 47/1000 5/47
0.10 10 34/1000 8/34
50 81/1000 10/81
100 115/1000 6/115
0.20 10 62/1000 7/62
50 158/1000 2/158
100 292/1000 0/292

Here “Power” refers to the number of times (out of 1000 replicates) the symmetric 95% credible intervals do not overlap 0, which is when we’d normally conclude we have evidence that the means of the two populations are different. Notice that when the effect and sample size are small (0.05 and 10, respectively), we would infer the wrong sign for the difference almost half of the time (18/39). We’re less likely to make a sign error when the effect is larger (7/62 for an effect of 0.20) or when the sample size is large (5/47 for a sample size of 100). But the bottom line remains the same:

Be wary of results from studies with small sample sizes, even if the effects are statistically significant.

This figure summarizes results from the simulation, and you’ll find the code in the same Github repository as the P-value code I mentioned last week: https://github.com/kholsinger/noisy-data. Remember that Gelman and Carlin (Perspectives on Psychological Science 9:641; 2014 http://dx.doi.org/10.1177/1745691614551642)  also have advice on how to tell whether you’re data are too noisy for your sample to give confidence in your inferences.

bayesian