Reproducibility is hard
Last year, the Open Science Collaboration published a very important article: Estimating the reproducibility of psychological science. Here’s a key part of the abstract:
We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. There is no single standard for evaluating replication success. Here, we evaluated reproducibility using significance and P values, effect sizes, subjective assessments of replication teams, and meta-analysis of effect sizes. The mean effect size (r) of the replication effects (Mr = 0.197, SD = 0.257) was half the magnitude of the mean effect size of the original effects (Mr = 0.403, SD = 0.188), representing a substantial decline. Ninety-seven percent of original studies had significant results (P < .05). Thirty-six percent of replications had significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.
Since then, reproducibility has gained even more attention than it had before. My students and I have been taking baby steps towards good practice – using Github to share code and data (and versions), using scripts (mostly in R) to manipulate and transform data, and making the code and data freely available as early in the writing process as we can. But there are some important things we don’t do as well as we could – I’ve never tried using Docker to ensure that all versions of the software we use for analysis in a paper are preserved, I’m as bad at writing documentation for what I’m doing as I ever was (but I try to write my code as clearly as possible, so it’s not too hard to figure out what I was doing.
I need to do better, but Lorena Barba (@LorenaABarba) had a article in the “Working Life” section of Science that made me feel a bit better about how far I have to go. Three years ago she posted a manifesto on reproducibility. In her Science piece, she describes how hard it’s been to live up to that pledge. But she concludes with some words to live by:
About 150 years ago, Louis Pasteur demonstrated how experiments can be conducted reproducibly—and the value of doing so. His research had many skeptics at first, but they were persuaded by his claims after they reproduced his results, using the methods he had recorded in keen detail. In computational science, we are still learning to be in his league. My students and I continuously discuss and perfect our standards, and we share our reproducibility practices with our community in the hopes that others will adopt similar ideals. Yes, conducting our research to these standards takes time and effort—and maybe our papers are slower to be published. But they’re less likely to be wrong.
Barba, L.A. 2016. The hard road to reproducibility. Science 354:142 doi: 10.1126/science.354.6308.142
Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science 349:aac4716 doi: 10.1126/science.aac4716