Two weeks ago I pointed out that you should

Be wary of results from studies with small sample sizes, even if the effects are statistically significant.

Last week I pointed out that being Bayesian won’t save you. If you were paying close attention, you may have thought to yourself

Holsinger’s characterization of Bayesian inference isn’t completely fair. The mean effect sizes he simulated were only 0.05, 0.10, and 0.20, but he used a prior with a standard deviation of 1.0 in his analyses. Any Bayesian in her right mind wouldn’t use a prior that broad, because she’d have a clue going into the experiment that the effect size was relatively small. She’d pick a prior that more accurately reflects prior knowledge of the likely results.

It’s a fair criticism, so to see how much difference more informative priors make, I re-did the simulations with a Gaussian prior on each mean with a prior mean of 0.0 (as before) and a standard deviation of 2 times the effect size used in the simulation. Here are the results:

Mean | Sample size | Power | Wrong sign |
---|---|---|---|

0.05 | 10 | 0/1000 | na |

50 | 2/1000 | 0/2 | |

100 | 7/1000 | 2/7 | |

0.10 | 10 | 1/1000 | 0/1 |

50 | 0/1000 | na | |

100 | 0/1000 | na | |

0.20 | 10 | 22/1000 | 2/22 |

50 | 128/1000 | 0/158 | |

100 | 265/1000 | 0/292 |

With a more informative prior, you’re not likely to say that an effect is positive when it’s actually negative. There are, however, a couple of things worth noticing when you compare this table to the last one.

- The more informative prior doesn’t help much, if at all, with a sample size of 10. The N(0,1) prior got the sign wrong in 7 out of 62 cases where the 95% credible interval on the posterior mean difference did not include 0. The N(0,0.4) prior made the same mistake in 2 out of 22 cases. So it didn’t make as
mistakes as the less informative prior, but it made almost the same*many*. In other words, you’d be almost as likely to make a sign error with the more informative prior as you are with the less informative prior.*proportion* - Even with a sample size of 100, you wouldn’t be “confident” that there is a difference very often (only 7 times out of 1000) when the “true” difference is small, 0.05, but you’d make a sign error nearly a third of the time (2 out of 7 cases) .

So what does all of this mean? When designing and interpreting an experiment you need to have some idea of how big the between-group differences you might reasonably expect to see are relative to the within-group variation. If the between-group differences are “small”, you’re going to need a “large” sample size to be confident about your inferences. If you haven’t collected your data yet, the message is to plan for “large” samples within each group. If you have collected your data and your sample size is small, be very careful about interpreting the sign of any observed differences – even if they are “statistically significant.”

What’s a “small” difference, and what’s a “large” sample? You can play with the R/Stan code in Github to explore the effects: https://github.com/kholsinger/noisy-data. You can also read Gelman and Carlin (Perspectives on Psychological Science 9:641; 2014 http://dx.doi.org/10.1177/1745691614551642) for more rigorous advice.