Not every credible interval is credible

Lauren Kennedy and co-authors (citation below) worry about the effect of “contamination” on estimates of credible intervals.¹ The effect arises because we often assume that values are drawn from a normal distribution, even though there are “outliers” in the data, i.e., observations drawn from a different distribution that “contaminate” our observations. Not surprisingly, they find that a model including contamination does a “better job” of estimating the mean and credible intervals than one that assumes a simple normal distribution.²

They consider the following data as an example:
-2, -1, 0, 1, 2, 15
They used the following model for the data (writing in JAGS notation):

x[i] ~ dnorm(mu, tau)
tau ~ dgamma(0.0001, 0.0001)
mu ~ dnorm(0, 100)

That prior on tau should be a red flag. Gelman (citation below) pointed out a long time ago that such a prior is a long way from being vague or non-informative. It puts a tremendous amount of weight on very small values of tau, meaning a very high weight on large values of the variance. Similarly, the N(0, 100); prior on mu; may seem like a “vague” choice, but it puts more than 80% of the prior probability on outcomes with x < -20 or x > 20, substantially more extreme than any that were observed.

Before we begin an analysis we typically have some idea what “reasonable” values are for the variable we’re measuring. For example, if we are measuring the height of adult men, we would be very surprised to find anyone in our sample with a height greater than 3m or less than 0.5m. It wouldn’t make sense to use a prior for the mean that put appreciable probability on outcomes more extreme.

In this case the data are made up, so there isn’t any prior knowledge to work from. but the authors say that “[i]t is immediately obvious that the sixth data point is an outlier” (emphasis in the original). Let’s take them at their word. A reasonable choice of prior might then be N(0,1), since all of the values (except for the “outlier”) lie within two standard deviations of the mean.³ Similarly, a reasonable choice for the prior on sigma (sqrt(1/tau)) might be a half-normal with mean 0 and standard deviation 2, which will allow for standard deviations both smaller and larger than observed in the data.

I put that all together in a little R/Stan program (test.R, test.stan). When I run it, these are the results I get:

         mean se_mean    sd    2.5%     25%     50%     75%   97.5% n_eff  Rhat
mu      0.555   0.016 0.899  -1.250  -0.037   0.558   1.156   2.297  3281 0.999
sigma   4.775   0.014 0.841   3.410   4.156   4.715   5.279   6.618  3466 1.000
lp__  -16.609   0.021 0.970 -19.229 -17.013 -16.314 -15.903 -15.663  2086 1.001

Let’s compare those results to what Kennedy and colleagues report:

Analysis	Posterior mean	95% credible interval
Stan + "reasonable priors"	0.56	(-1.25, 2.30)
Kennedy et al. - Normal	2.49	(-4.25, 9.08)
Kennedy et al. - Contaminated normal	0.47	(-2.49, 4.88)

So if you use “reasonable” priors, you get a posterior mean from a model without contamination that isn’t very different from what you get from the more complicated contaminated normal model, and the credible intervals are actually narrower. If you really think a priori that 15 is an unreasonable observation, which estimate (point estimate and credible interval) would you prefer? I’d go for the model assuming a normal distribution with reasonable priors.

It all comes down to this. Your choice of priors matters. There is no such thing as an uninformative prior. If you think you are playing it safe by using very vague or flat priors, think carefully about what you’re doing. There’s a good chance that you’re actually putting a lot of prior weight on values that are unreasonable.⁴ You will almost always have some idea about what observations are reasonable or possible. Use that information to set weakly informative priors. See the discussion at https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations for more detailed advice.

Gelman, A. 2006. Prior distributions for variance parameters in hierarchical models (Comment on article by Browne and Draper). Bayesian Analysis 1:515-534 https://projecteuclid.org/euclid.ba/1340371048

Kennedy, L.A., D.J. Navarro, A. Perfors, and N. Briggs. 2017. Not every credible interval is credible. Behavioral Research doi: 10.3758/s13428-017-0854-1

¹They note that the problem isn’t unique to Bayesian credible intervals. The same problems apply to classical confidence intervals.

²If you want to know what the authors mean by “better”, read the paper. That’s not the focus of this post.

³If you’re following closely, you’re likely to be bothered that I’m using the data to set the prior. You’re right to be bothered, because you should use prior knowledge, not the data to set your prior. In this case I have no choice, since there isn’t any prior knowledge to draw on.

⁴There are cases where a flat prior is completely reasonable. For example, if you’re a population geneticist (like me) and you’re estimating allele frequencies in populations, it’s completely reasonable to presume that any value between 0 and 1 is reasonable. Using a flat uniform (or flat Dirichlet) prior is reasonable. Of course, we might be better off with a Beta(2,2) prior (for one allele frequency) or a Dirichlet with all parameters equal to 2 because the fact that we’re estimating allele frequencies at all from a finite sample means that it’s unlikely the allele frequency is very close to either 0 or 1.

Not every credible interval is credible

Leave a Comment Cancel reply