Randomized controlled experiments are generally regarded as the gold standard for identifying a causal factor.1 Let’s describe a really simple one first. Then we’ll explore why they’re regarded as the gold standard.
Picking up with the example I used last time, let’s suppose we’re trying to test the hypothesis that applying nitrogen fertilizer increases the yield of corn.2 As I pointed out, in setting up our experiment, we’d seek to control for every variable that could influence corn yield so that we can isolate the effect of nitrogen. In the simplest possible case, we’d have two adjacent plots in a field that have been plowed and tilled thoroughly so that the soil in the two plots is completely mixed and indistinguishable in every way – same content of nitrogen, phosphorous, other macronutrients, other micronutrients; same soil texture; same percent of (the same kind of) soil organic matter; same composition of clay, silt, and sand; everything.3 We’d also have plants that were genetically uniform (or as genetically uniform as we can make them), either highly inbred lines or an F1 cross produced between two highly inbred lines. We’d make sure the field was level, maybe using high-tech laser leveling devices, and we’d make sure that every plant in the entire field received the same amount of water. Since we know that the microclimate at the perimeter of the field is different from in the middle of the field, we’d make the field big enough that we could focus our measurements on a part of the field isolated from these edge effects. Then we’d randomly choose one side of the field to be the “low N” treatment and the other to be the “high N” treatment.4 After allowing the plants to grow for an appropriate amount of time, we’d harvest them, dry them, and weigh them.
Our hypothesis has the form
If N is applied to a corn field, then the yield will be greater than if it had not been applied.
Notice that we can’t both apply N and not apply N to the same set of plants. We have to compare what happens when we apply N to one set of plants and don’t apply it to another. If we find that the “high N” plants have a greater yield than the “low N” plants, we infer that the “low N” plants would also have had a greater yield if we had applied N to them (which we didn’t). Why is that justified? Because everything about the two treatments is identical, by design, except for the amount of N applied. If there’s a difference in yield, it can only be attributed to something that differs between the treatments, and the only thing that differs is the amount of N applied.
I can hear you thinking, “Couldn’t the difference just be due to chance?” Well, yes it could. If we do a statistical test and demonstrate that the yields are statistically distinguishable, that increases our confidence that the difference in yield is real, but nothing can ever make the conclusion logically certain in the way we can be logically certain that 2+2=4.5 To my mind there are two things that make us accept the outcome of this experiment as evidence that applying N increases corn yield:
- It’s not just this experiment. If the same experiment is repeated in different places with different soil types, different corn genotypes, and different weather patterns, we get the same result. We can never be certain, but the consistency of that result increases our confidence that the association isn’t just a fluke.
- What we understand about plant growth and physiology leads us to expect that providing nitrogen in fertilizer should enhance plant growth. In other words, this particular hypothesis is part of a larger theoretical framework about plant physiology and development. That framework provides a coherent and repeatable set of predictions across a wide empirical domain.
Put those two together, and we have good reason for thinking that the observed association between N fertilizer and corn yield is actually a causal association.
In experiments where we can’t completely control all relevant variables except the one that we’re interested in, we rely on randomization. Suppose, for example, we couldn’t produce genetically uniform corn. Then we’d randomize the assignment of individuals to the “high” and “low” treatments. The results aren’t quite as solid as if we’d had complete uniformity. It’s always possible that by some statistical fluke a factor we aren’t measuring ends up overrepresented in one treatment and underrepresented in the other, but if we’ve randomized well and we have a reasonably large sample, the chances are small. So our inference isn’t quite as firm, but it’s still pretty goo.
We’ll explore the “reasonably large sample question” in the next installment.
- See, for example, Rubin (Annals of Applied Statistics 2:808-840; 2008. https://projecteuclid.org/euclid.aoas/1223908042) ↩
- If you know me or my work, you know that I’m not at all crazy about the null hypothesis testing approach to investigating ecology. We’ll get to that later, but let’s start with a simple case. Even those of us who don’t like null hypothesis testing as a general approach recognize that it has value. We’ll focus on one way in which it has value here. ↩
- If we were really fastidious we might even set up the experiment in a large growth chamber in which we mixed the soil together and distributed it evenly ourselves. ↩
- If we were really paranoid about controlling for all possible factors, we’d even randomly assign a nitrogen fertilizer level (high or low) to every different plant in the field, and we’d probably do the whole experiment in a very large growth chamber where we could mix the soil ourselves and ensure that light, humidity, and temperature were as uniform as possible across all individuals in the experiment. ↩
- If you don’t see why, Google “problem of induction” and you’ll get some idea. If that doesn’t satisfy you, ask, and I’ll see what I can do to provide an explanation. ↩