Uncommon Ground

Statistics

Causal inference in ecology – no post this week

Last week was finals week at UConn, which means Commencement weekend began on Saturday. I represented the Provost’s Office at the PharmD ceremony at 9:00am Saturday morning, and our Master’s Commencement Ceremony was held at 1:30pm. I had yesterday off, but because of all the things that accumulated last week, I didn’t have time to write the next installment of this series. It will return next Monday, barring unforseen complications.

In the meantime, this page contains links to the posts that have appeared so far. It also contains links to some related posts that you might find interesting. You may also have noticed the “Causal inference in ecology” label at the top of this page. That’s a link to the same page of posts in case you want to find it again.

Causal inference in ecology – Randomization and sample size

Causal inference in ecology – links to the series

Last week I explored the logic behind controlled experiments and why they are typically regarded as the gold standard for identifying and measuring causal effects.1 Let me tie that post and the preceding one on counterfactuals together before we proceed with the next idea. To make things as concrete as possible, let’s return to our hypothetical example of determining whether applying nitrogen fertilizer increases the yield of corn. We do so by

  • Randomly assigning individual corn plants to different plots within a field.
  • Applying nitrogen fertilizer to some plots, the treatment plots, and not to others, the control plots.
  • Determining whether the yield in treatment plots exceeds that in the control plots.

Where do counterfactuals come in? If the yield of treatment plots exceeds that of control plots aren’t we done? Well, not quite. You see the plants that were in the treatment plots are different individuals from those that are in the control plots. To infer that nitrogen fertilizer increases yield, we have to extrapolate the results from the treatment plots to the control plots. We have to be willing to conclude that the yield in the control plots would have been greater if we had applied nitrogen fertilizer there. That’s the counterfactual. We are asserting what would have happened if the facts had been different. In practice, we don’t usually worry about this step in the logic, because we presume that our random assignment of corn plants to different plots means that the plants in the two plots are essentially equivalent. As I pointed out last time, that inference depends on having done the randomization well and having a reasonably large sample.

Let’s assume that we’ve done the randomization well, say by using a pseudorandom number generator in our computer to assign individual plants to the different plots. But let’s also assume that there is genetic variation among our corn plants that influences yield. To make things really simple, let’s assume that there’s a single locus with two alleles associated with yield differences, that high yield is dominant to low yield, and that the two alleles are in equal frequency, so that 75% of the individuals are high yield and 25% are low yield. To make things really simple let’s further assume that all of the high yield plants produce 1kg of corn (sd=0.1kg) and that all of the low yield plants produce exactly 0.5kg of corn (sd=0.1kg).2 Let’s further assume that applying nitrogen fertilizer has absolutely no effect on yield.Then a simple simulation in R produces the following results:3

Sample size:  5 
          lo:  133 
          hi:  140 
     neither:  9727 
Sample size:  10 
          lo:  201 
          hi:  175 
     neither:  9624 
Sample size:  20 
          lo:  255 
          hi:  217 
     neither:  9528 

What you can see from these results is that I was only half right. You need to do the randomization well,4 but your sample size doesn’t need to be all that big to ensure that you get reasonable results. Keep in mind that “reasonable results” here means that (a) you reject the null hypothesis of no difference in yield about 5% of the time and (b) you reject it either way at about the same frequency.5 There are, however, other reasons that you want to have reasonable sample sizes. Refer to the posts linked to on the Causal inference in ecology page for more information about that.

With counterfactuals, controlled experiments, and randomization out of the way, our next stop will be the challenge of falsification.I didn’t discuss the “and measuring” part last week, only the “identifying” part. We’ll return to measuring causal effects later in this series after we’ve explored issues associated with identifying causal effects (or exhausted ourselves trying).

  1. That corresponds to an effect size of 0.2 standard deviations.
  2. Click through to the next page to see the R code.
  3. OK, you can’t see that you need to do the randomization well, but I did it well and it worked, so why not do it well and be safe?
  4. Since I used a two-sided t-test with a 5% significance threshold, this is just what you should expect.

(more…)

Causal inference in ecology – Controlled experiments

Causal inference in ecology – links to the series

Randomized controlled experiments are generally regarded as the gold standard for identifying a causal factor.1 Let’s describe a really simple one first. Then we’ll explore why they’re regarded as the gold standard.

Picking up with the example I used last time, let’s suppose we’re trying to test the hypothesis that applying nitrogen fertilizer increases the yield of corn.2 As I pointed out, in setting up our experiment, we’d seek to control for every variable that could influence corn yield so that we can isolate the effect of nitrogen. In the simplest possible case, we’d have two adjacent plots in a field that have been plowed and tilled thoroughly so that the soil in the two plots is completely mixed and indistinguishable in every way – same content of nitrogen, phosphorous, other macronutrients, other micronutrients; same soil texture; same percent of (the same kind of) soil organic matter; same composition of clay, silt, and sand; everything.3 We’d also have plants that were genetically uniform (or as genetically uniform as we can make them), either highly inbred lines or an F1 cross produced between two highly inbred lines. We’d make sure the field was level, maybe using high-tech laser leveling devices, and we’d make sure that every plant in the entire field received the same amount of water. Since we know that the microclimate at the perimeter of the field is different from in the middle of the field, we’d make the field big enough that we could focus our measurements on a part of the field isolated from these edge effects. Then we’d randomly choose one side of the field to be the “low N” treatment and the other to be the “high N” treatment.4 After allowing the plants to grow for an appropriate amount of time, we’d harvest them, dry them, and weigh them.

Our hypothesis has the form

If N is applied to a corn field, then the yield will be greater than if it had not been applied.

Notice that we can’t both apply N and not apply N to the same set of plants. We have to compare what happens when we apply N to one set of plants and don’t apply it to another. If we find that the “high N” plants have a greater yield than the “low N” plants, we infer that the “low N” plants would also have had a greater yield if we had applied N to them (which we didn’t). Why is that justified? Because everything about the two treatments is identical, by design, except for the amount of N applied. If there’s a difference in yield, it can only be attributed to something that differs between the treatments, and the only thing that differs is the amount of N applied.

I can hear you thinking, “Couldn’t the difference just be due to chance?” Well, yes it could. If we do a statistical test and demonstrate that the yields are statistically distinguishable, that increases our confidence that the difference in yield is real, but nothing can ever make the conclusion logically certain in the way we can be logically certain that 2+2=4.5 To my mind there are two things that make us accept the outcome of this experiment as evidence that applying N increases corn yield:

  1. It’s not just this experiment. If the same experiment is repeated in different places with different soil types, different corn genotypes, and different weather patterns, we get the same result. We can never be certain, but the consistency of that result increases our confidence that the association isn’t just a fluke.
  2. What we understand about plant growth and physiology leads us to expect that providing nitrogen in fertilizer should enhance plant growth. In other words, this particular hypothesis is part of a larger theoretical framework about plant physiology and development. That framework provides a coherent and repeatable set of predictions across a wide empirical domain.

Put those two together, and we have good reason for thinking that the observed association between N fertilizer and corn yield is actually a causal association.

In experiments where we can’t completely control all relevant variables except the one that we’re interested in, we rely on randomization. Suppose, for example, we couldn’t produce genetically uniform corn. Then we’d randomize the assignment of individuals to the “high” and “low” treatments. The results aren’t quite as solid as if we’d had complete uniformity. It’s always possible that by some statistical fluke a factor we aren’t measuring ends up overrepresented in one treatment and underrepresented in the other, but if we’ve randomized well and we have a reasonably large sample, the chances are small. So our inference isn’t quite as firm, but it’s still pretty goo.

We’ll explore the “reasonably large sample question” in the next installment.

  1. See, for example, Rubin (Annals of Applied Statistics 2:808-840; 2008. https://projecteuclid.org/euclid.aoas/1223908042)
  2. If you know me or my work, you know that I’m not at all crazy about the null hypothesis testing approach to investigating ecology. We’ll get to that later, but let’s start with a simple case. Even those of us who don’t like null hypothesis testing as a general approach recognize that it has value. We’ll focus on one way in which it has value here.
  3. If we were really fastidious we might even set up the experiment in a large growth chamber in which we mixed the soil together and distributed it evenly ourselves.
  4. If we were really paranoid about controlling for all possible factors, we’d even randomly assign a nitrogen fertilizer level (high or low) to every different plant in the field, and we’d probably do the whole experiment in a very large growth chamber where we could mix the soil ourselves and ensure that light, humidity, and temperature were as uniform as possible across all individuals in the experiment.
  5. If you don’t see why, Google “problem of induction” and you’ll get some idea. If that doesn’t satisfy you, ask, and I’ll see what I can do to provide an explanation.

Causal inference in ecology – Counterfactuals

Causal inference in ecology – links to the series

Let’s start with a few preliminaries.1

  • A causal factor (“cause” for short) is something that is predictably related to a particular outcome. For example, fertilizing crops generally increases their yield, so fertilizer is a causal factor related to yield. The way I think about it, a causal factor need not always lead to the outcome. It’s enough if it merely increases the probability of the outcome. For example, smoking doesn’t always lead to lung cancer among those who smoke, but it does increase the probability that you will suffer from lung cancer if you smoke.
  • Causes precede effects.2 That’s one reason why teleology is problematic. A teleological explanation explains the current state of things as a result of, i.e., as caused by, something in the future, namely a purpose.3
  • Effects may have multiple causes. The world, or at least the world of biology, is a complicated place. Regardless of what phenomenon you’re studying, there are likely to be several (or many) causal factors that influence.

The last point is one of the most important ones for purposes of this series. When we are investigating a phenomenon,4 we’re trying to discern which of several plausible causal factors plays a role and, possibly, the relative “importance” of those causal factors.5

To make this concrete, let’s suppose that we’re trying to determine whether application of nitrogen fertilizer increases the yield of corn. That means we have to determine whether adding nitrogen and adding nitrogen alone increases corn yield. Why the emphasis on “adding nitrogen alone”? Suppose that we added nitrogen to a corn field by adding manure. Then increases in the amount of applied nitrogen are associated with increases in the amount of a host of other substances. If yields increased, we’d know that adding manure increases yield, but not whether it’s because of the nitrogen in manure or something else. Why does this matter?

From very early on in our education we’re taught that “correlation is not the same as causation.” We want to distinguish cases where A causes B from cases where A is merely correlated with B. Yet, as David Hume pointed out long ago, experience6 alone can only show us that A and B actually occur together, not that they must occur together (link). One way of distinguishing cause from correlation is that causes support counterfactual statements. They provide us with a reason to believe statements like “If we had applied nitrogen to the field, the corn yield would have increased” even if we never applied nitrogen to the field at all. The only reason I can see that we could believe such a statement is if we had already determined that adding nitrogen and adding nitrogen alone increases corn yield.7

How do we determine that? Randomized controlled experiments are the most widely known approach, and they are typically regarded as the gold standard against which all other means of inference are compared. That’s where we’ll pick up in the next installment.

  1. As I warned in the introduction to the series, I am not an expert in causal inference. The terminology I use is likely both to be imprecise and to be somewhat different from the terminology experts use.
  2. Philosophers have argued about whether backward causation is possible, but I’m going to ignore that possibility.
  3. Biologists sometimes use teleological language to explain adaptation, e.g., land animals evolved legs to provide mobility. It is, however, relatively easy (if a bit long-winded) to eliminate the teleological language, because natural selection shows how adaptations arise from differential reproduction and survival (link).
  4. Or at least this is how it is when I’m investigating a phenomenon.
  5. I’ll come back to the idea of identifying the relative importance of causal factors in a future post.
  6. Or experiment.
  7. If there are any philosophers reading this, you’ll recognize that this account is horribly sketchy and amounts to little more than proof by vigorous assertion. If you’re so inclined, I invite you to flesh out more complete explanations for readers who are interested.

Causal inference in ecology – Introduction to the series

If you’ve been following posts here since the first of the year, you know that I’ve been writing about how I keep myself organized. Today I’m starting a completely different series in which I begin to collect my thoughts on how we can make judgments about the cause (or causes) of ecological phenomena1 and the circumstances under which judgments are possible. Before I start, I need to offer a few disclaimers.

  • Any evolutionary biologist or ecologist who knows me and my work knows that it’s not uncommon for my ideas to represent a minority opinion. (Think pollen discounting for those of you who know my work on the evolution of plant mating systems.) I make no claim that anything I write here is broadly representative of what my fellow evolutionary biologists and ecologists think, only that it’s what I think. Please challenge me on anything you think I’ve got wrong, because I’m sure there will be things I get wrong, and the easiest way for me to discover those errors is for someone else to point them out.
  • I had a minor in Philosophy as an undergraduate and there is an enormous literature on causality in the philosophy of science. I’ll be using a very crude understanding of “cause.” I don’t think it is wildly misleading, but I’m certain it wouldn’t stand up to serious scrutiny.2
  • I’ll be thinking about causal inference in the specific context of trying to infer causes from observational data using statistics rather than from inferring causes controlled experiments.3 I’ll be using an approach developed in the 1970s by Donald Rubin, the Rubin Causal Model.4
  • There is a very large literature on causal inference in the social sciences. I’ll be drawing heavily on Imbens and Rubin, Causal Inference for Statistics, Social and Biomedical Sciences: An Introduction,5 but there’s an enormous amount of material there that I won’t attempt to cover. I am also pretty new to the concepts associated with the Rubin causal model, so it’s entirely possible that I’ll misrepresent or misinterpret a point that the real experts got right. In other words, if something I say doesn’t make any sense, it’s more likely I got it wrong than that Imbens and Rubin got it wrong.

Although I will be thinking about causal inference in the context of observational data and statistics, I don’t plan to write much (if at all) about the problems with P-values, Bayes factors, credible/confidence intervals overlapping 0 (or not), and the like. If you’d like to know the concerns I have about them, here are links to old posts on those issues.

  1. I’m calling the post “Causal inference in ecology” only because “Causal inference in ecology, evolutionary biology, and population genetics” would be too long.
  2. There’s a good chance that a moderately competent undergraduate Philosophy major would find it woefully inadequate.
  3. To be more precise, we don’t infer causes from controlled experiments. Rather, we have pre-existing hypotheses about possible causes, and we use controlled experiments to test those hypotheses.
  4. In my relatively limited reading on the subject, I’ve most often seen it referred to as the Rubin causal model, but it is sometimes referred to as the Neyman causal model.
  5. Reminder: If you click on that link, it will take you to Amazon.com. I use that link simply because it’s convenient. You can buy the book, if you’re so inclined, from many other outlets. I am not an Amazon affiliate, and I will not receive any compensation if you decide to buy the book regardless of whether you buy it at Amazon or elsewhere. By the way, Chapter 23 in Gelman and Hill’s book, Data Analysis Using Regression and Multilevel/Hierarchical Models has an excellent overview of the Rubin causal model.

Exploring mixed models in Stan

I am about to begin developing a moderately complex mixed model in Stan to analyze realtinoships among anatomical/morphological traits (e.g., leaf thickness, LMA, wood density), physiological performance (e.g., Amax, stem hydraulic conductance), and indices of fitness (e.g., height, growth rate, number of seedheads). One complication is that the observations are from several different species of Protea at several different sites.1 We’re going to treat sites as nested within species.

Before I start building the whole model, I wanted to make sure that I can do a simple mixed linear regression with a random site effect nested within a random species effect. In stan_lmer() notation that becomes:

stan_lmer(Amax ~ LMA + (1|Species/Site))

I ran a version of my code with several covariates in addition to LMA using hand-coded stan and compared the results to those from stan_lmer(). Estimates for the overall intercept and the regression coefficients associated with each covariate were very similar. The estimates of both standard deviations and individual random effects at the species and site within species level were rather different – especially at the species level. This was troubling, so I set up a simple simulation to see if I could figure out what was going on. The R code, Stan code, and simulation results are available in Github: https://kholsinger.github.io/mixed-models/.

The model used for simulation is very simple:

\(
\begin{equation}
y_k \sim \mbox{N}(\mu_k, \sigma) \\
\mu_k = \beta_0(species|site) + \beta_1x \\
\beta_0(species|site) \sim \mbox{N}(\beta_0(species), \sigma_{species|site}) \\
\beta_0(species) \sim \mbox{N}(\beta_0, \sigma_{species})
\end{equation}
\)

Happily, the Stan code I wrote does well in recovering the simulation parameters.2 Surprisingly, it does better on recovering the random effect parameters than stan_lmer(). I haven’t completely sorted things out yet, but the difference is likely to be a result of different prior specifications for the random effects. My simulation code3 uses independent Cauchy(0,5) priors for the standard deviation of all variance parameters. stan_lmer() uses a covariance structure for all parameters that vary by group.4 If the difference in prior specifications is really responsible, it means that the differences between my approach and the approach used in stan_lmer() will vanish as the number of groups grows.

Since we’re only interested in the analog of \(\beta_1\) for the analyses we’ll be doing, the difference in random effect estimates doesn’t bother me, especially since my approach seems to recover them better given the random effect structure we’re working with. This is, however, a good reminder that if you’re working with mixed models and you’re interested in the group-level parameters, you’re going to need a large number of groups, not just a large number of individuals, to get reliable estimates of the group parameters.

(more…)

Keep your data tidy

If you’ve spent any time using R, you probably know the name Hadley Wickham. He’s chief scientist at RStudio, the author of 4 books on R, and the author of several indispensable R packages, including ggplot2, dplyr, and devtools. I was reminded recently that several years ago, he wrote a very useful paper for the Journal of Statistical Software, “Tidy data” (August 2014, Volume 59, Issue 10, https://www.jstatsoft.org/article/view/v059i10).

If you are familiar with Hadley’s contributions to R, you won’t be surprised that tidy data has a simple, clean – tidy – set of requirements:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

That sounds simple, but it requires that many of us rethink the way we structure our data, no more column headers as values, no more storing of multiple variables in one column, no more storing some variables in rows and others in columns. Fortunately, Hadley is also the author of tidyr. I haven’t used it yet, but given how bad I am at starting with tidy data, I suspect I’ll be using it a lot in the future.

I’m sorry. P < 0.005 won't save you.

Recently, a group of distinguished psychologists posted a preprint on PsyArxiv arguing for a re-definition of the traditional signifance threshold, lowering it from P < 0.05 to P < 0.005. They are concerned about reproducibility, and they argue that “changing the P value threshold is simple, aligns with the training undertaken by many researchers, and might quickly achieve broad acceptance.” That’s all true, but I’m afraid it won’t necessarily “improve the reproducibility of scientific research in many fields.”

Why? Let’s take a little trip down memory lane.

Almost a year ago I pointed out that we need to “Be wary of results from studies with small sample sizes, even if the effects are statistically significant.” I illustrated why with the following figure produced using R code available at Github: https://github.com/kholsinger/noisy-data

What that figure shows is the distribution of P-values that pass the P < 0.05 significance threshold when the true difference between two populations is mu standard deviations (with the same standard deviation in both populations) and with equal sample sizes of n. The results are from 1000 random replications. As you can see when the sample size is small, there’s a good chance that a significant result will have the wrong sign, i.e., the observed difference will be negative rather than positive, even if the between-population diffference is 0.2 standard deviations. When the between-population difference is 0.05, you’re almost as likely to say the difference is in the wrong direction as to get it right.

Does changing the threshold to P < 0.005 help. Well, I changed the number of replications to 10,000, reduced the threshold to P < 0.005, and here’s what I got.

Do you see a difference? I don’t. I haven’t run a formal statistical test to compare the distributions, but I’m pretty sure they’d be indistinguishable.

In short, reducing the significance threshold to P < 0.005 will result in fewer investigators reporting statistically significant results. But unless the studies they do also have reasonably large sample sizes relative to the expected magnitude of any effect and the amount of variability within classes, they won’t be any more likely to know the direction of the effect than with the traditional threshold of P < 0.05.

The solution to the reproducibility crisis is better data, not better statistics.

 

Not every credible interval is credible

Lauren Kennedy and co-authors (citation below) worry about the effect of “contamination” on estimates of credible intervals.1 The effect arises because we often assume that values are drawn from a normal distribution, even though there are “outliers” in the data, i.e., observations drawn from a different distribution that “contaminate” our observations. Not surprisingly, they find that a model including contamination does a “better job” of estimating the mean and credible intervals than one that assumes a simple normal distribution.2

They consider the following data as an example:
-2, -1, 0, 1, 2, 15
They used the following model for the data (writing in JAGS notation):

x[i] ~ dnorm(mu, tau)
tau ~ dgamma(0.0001, 0.0001)
mu ~ dnorm(0, 100)

That prior on tau should be a red flag. Gelman (citation below) pointed out a long time ago that such a prior is a long way from being vague or non-informative. It puts a tremendous amount of weight on very small values of tau, meaning a very high weight on large values of the variance. Similarly, the N(0, 100); prior on mu; may seem like a “vague” choice, but it puts more than 80% of the prior probability on outcomes with x < -20 or x > 20, substantially more extreme than any that were observed.

Before we begin an analysis we typically have some idea what “reasonable” values are for the variable we’re measuring. For example, if we are measuring the height of adult men, we would be very surprised to find anyone in our sample with a height greater than 3m or less than 0.5m. It wouldn’t make sense to use a prior for the mean that put appreciable probability on outcomes more extreme.

In this case the data are made up, so there isn’t any prior knowledge to work from. but the authors say that “[i]t is immediately obvious that the sixth data point is an outlier” (emphasis in the original). Let’s take them at their word. A reasonable choice of prior might then be N(0,1), since all of the values (except for the “outlier”) lie within two standard deviations of the mean.3 Similarly, a reasonable choice for the prior on sigma (sqrt(1/tau)) might be a half-normal with mean 0 and standard deviation 2, which will allow for standard deviations both smaller and larger than observed in the data.

I put that all together in a little R/Stan program (test.R, test.stan). When I run it, these are the results I get:

         mean se_mean    sd    2.5%     25%     50%     75%   97.5% n_eff  Rhat
mu      0.555   0.016 0.899  -1.250  -0.037   0.558   1.156   2.297  3281 0.999
sigma   4.775   0.014 0.841   3.410   4.156   4.715   5.279   6.618  3466 1.000
lp__  -16.609   0.021 0.970 -19.229 -17.013 -16.314 -15.903 -15.663  2086 1.001

Let’s compare those results to what Kennedy and colleagues report:

AnalysisPosterior mean95% credible interval
Stan + "reasonable priors"0.56(-1.25, 2.30)
Kennedy et al. - Normal2.49(-4.25, 9.08)
Kennedy et al. - Contaminated normal0.47(-2.49, 4.88)

So if you use “reasonable” priors, you get a posterior mean from a model without contamination that isn’t very different from what you get from the more complicated contaminated normal model, and the credible intervals are actually narrower. If you really think a priori that 15 is an unreasonable observation, which estimate (point estimate and credible interval) would you prefer? I’d go for the model assuming a normal distribution with reasonable priors.

It all comes down to this. Your choice of priors matters. There is no such thing as an uninformative prior. If you think you are playing it safe by using very vague or flat priors, think carefully about what you’re doing. There’s a good chance that you’re actually putting a lot of prior weight on values that are unreasonable.4 You will almost always have some idea about what observations are reasonable or possible. Use that information to set weakly informative priors. See the discussion at https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations for more detailed advice.

(more…)

Against null hypothesis testing – the elephants and Andrew Gelman edition

Last week I pointed out a new paper by Denes Szucs and John Ioannidis, When null hypothesis significance testing is unsuitable for research: a reassessment.1 I mentioned that P-values from small, noisy studies are likely to be misleading. Last April, Raghu Parthasarathy at The Eighteenth Elephant had a long post on a more fundamental problem with P-values: they encourage binary thinking. Why is this a problem?

  1. “Binary statements can’t be sensibly combined” when measurements have noise.
  2. “It is almost never necessary to combine boolean statements.”
  3. “Everything always has an effect.”

Those brief statements probably won’t make any sense,2 so head over to The Eighteenth Elephant to get the full explanation. The post is a bit long, but it’s easy to read, and well worth your time.

Andrew Gelman recently linked to Parthasarathy’s post and adds one more observation on how P-values are problematic: they are “interpretable only under the null hypothesis, yet the usual purpose of the p-value in practice is to reject the null.” In other words, P-values are derived assuming the null hypothesis is true. They tell us what the chances of getting the data we got are if the null hypothesis were true. Since we typically don’t believe the null hypothesis is true, the P-value doesn’t correspond to anything meaningful.

To take Gelman’s example, suppose we had an experiment with a control, treatment A, and treatment B. Our data suggest that treatment A is not different from control (P=0.13) but that treatment B is different from the control (P=0.003). That’s pretty clear evidence that treatment A and treatment B are different, right? Wrong.

P=0.13 corresponds to a treatment-control difference of 1.5 standard deviations; P=0.003, to a treatment-control difference of 3.0 standard deviations, a difference of 1.5 standard deviations, which corresponds to a P-value of 0.13. Why the apparent contradiction? Because if we want to say that treatment A and treatment B, we need to compare them directly to each other. When we do so, we realize that we don’t have any evidence that the treatments are different from one another.

As Parthasarthy points out in a similar example, a better interpretation is that we have evidence for the ordering (control < treatment A < treatment B). Null hypothesis significance testing could easily mislead us into thinking that what we have instead is (control = treatment A < treatment B). The problem arises, at least in part, because no matter how often we remind ourselves that it’s wrong to do so, we act as if a failure to reject the null hypothesis is evidence for the null hypothesis. Parthasarthy describes nicely how we should be approaching these problems:

It’s absurd to think that anything exists in isolation, or that any treatment really has “zero” effect, certainly not in the messy world of living things. Our task, always, is to quantify the size of an effect, or the value of a parameter, whether this is the resistivity of a metal or the toxicity of a drug.

We should be focusing on estimating the magnitude of effects and the uncertainty associated with those estimates, not testing null hypotheses.

(more…)