Uncommon Ground

Open science

Reproducibility is hard

Last year, the Open Science Collaboration published a very important article: Estimating the reproducibility of psychological science. Here’s a key part of the abstract:

We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. There is no single standard for evaluating replication success. Here, we evaluated reproducibility using significance and P values, effect sizes, subjective assessments of replication teams, and meta-analysis of effect sizes. The mean effect size (r) of the replication effects (Mr = 0.197, SD = 0.257) was half the magnitude of the mean effect size of the original effects (Mr = 0.403, SD = 0.188), representing a substantial decline. Ninety-seven percent of original studies had significant results (P < .05). Thirty-six percent of replications had significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.

Since then, reproducibility has gained even more attention than it had before. My students and I have been taking baby steps towards good practice – using Github to share code and data (and versions), using scripts (mostly in R) to manipulate and transform data, and making the code and data freely available as early in the writing process as we can. But there are some important things we don’t do as well as we could – I’ve never tried using Docker to ensure that all versions of the software we use for analysis in a paper are preserved, I’m as bad at writing documentation for what I’m doing as I ever was (but I try to write my code as clearly as possible, so it’s not too hard to figure out what I was doing.

I need to do better, but Lorena Barba (@LorenaABarba) had a article in the “Working Life” section of Science that made me feel a bit better about how far I have to go. Three years ago she posted a manifesto on reproducibility. In her Science piece, she describes how hard it’s been to live up to that pledge. But she concludes with some words to live by:

About 150 years ago, Louis Pasteur demonstrated how experiments can be conducted reproducibly—and the value of doing so. His research had many skeptics at first, but they were persuaded by his claims after they reproduced his results, using the methods he had recorded in keen detail. In computational science, we are still learning to be in his league. My students and I continuously discuss and perfect our standards, and we share our reproducibility practices with our community in the hopes that others will adopt similar ideals. Yes, conducting our research to these standards takes time and effort—and maybe our papers are slower to be published. But they’re less likely to be wrong.


Barba, L.A. 2016. The hard road to reproducibility. Science 354:142 doi: 10.1126/science.354.6308.142
Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science 349:aac4716 doi: 10.1126/science.aac4716

Twitter, blogs, and scientific critiques

Susan Fiske is a very well known and very well respected social psychologist. This is the opening paragraph of her Wikipedia biography:

Susan Tufts Fiske (born August 19, 1952) is Eugene Higgins Professor of Psychology and Public Affairs at the Princeton University Department of Psychology.[ She is a social psychologist known for her work on social cognition, stereotypes, and prejudice. Fiske leads the Intergroup Relations, Social Cognition, and Social Neuroscience Lab at Princeton University. A recent quantitative analysis identifies her as the 22nd most eminent researcher in the modern era of psychology (12th among living researchers, 2nd among women). Her notable theoretical contributions include the development of the stereotype content model, ambivalent sexism theory, power as control theory, and the continuum model of impression formation.

She was elected to the National Academy of Sciences in 2013, and she is a past President of the Association for Psychological Science. You may have heard that the current APS President, Susan Goldin-Meadow, invited Fiske to share her thoughts on “the impact that the new media are having…on our science [and] on our scientists.” The draft column provoked heated responses from, among others, Andrew Gelman, Sam Schwarzkopf, and Neuroskeptic. Fiske favors judging throuh

monitored channels, most often in private with a chance to improve (peer review), or at least in moderated exchanges (curated comments and rebuttals).

Gelman, Schwarzkop, and Neuroskeptic prefer open forums. As Gelman puts it,

We learn from our mistakes, but only if we recognize that they are mistakes. Debugging is a collaborative process. If you approve some code and I find a bug in it, I’m not an adversary, I’m a collaborator. If you try to paint me as an “adversary” in order to avoid having to correct the bug, that’s your problem.

There’s a response to the responses on the APS site. It reads, in part,

APS encourages its members to air differing viewpoints on issues of importance to the field of psychological science, and the Observer provides a forum for those deliberations, Goldin-Meadow notes.

“Susan Fiske is a distinguished leader in the field and I invited her to share her opinion for an upcoming edition of the magazine,” she says. “It’s unfortunate that many on social media view her remarks as an attack on open science, when her goal is simply to remind us that scientists sometimes use social media in destructive ways. APS fully expects and welcomes discussion around the issues she raises.”

Of course scientists sometimes use social media in destructive ways. We’re human after all, and we sometimes make mistakes. But we also sometimes – I would argue more often – use social media in constructive ways. It was a blog post by Rosie Redfield that started unraveling the fantasy of arsenic life (in which NASA-sponsored scientists claimed that arsenic could substitute for phosphorous in the DNA or an unusual bacterium). Arguably we wouldn’t be talking about the replication in science at all, or at least we wouldn’t be talking about it nearly as much, if it weren’t for blogs that published some vigorous critiques of widely reported scientific results that turned out to be much more weakly supported than it initially appeared.

Put me in the Gelman, Schwarzkopf, Neuroskeptic camp. It behooves us to behave respectably if we use social media to critique a study. All of us are human. All of us make mistakes. Making a mistake isn’t something to be ashamed of. It’s to be expected if you’re pushing forward at the edges of knowledge. As Schwarzkopf put it:

I can’t speak for others, but if someone applied for a job with me and openly discussed the fact that a result of theirs failed to replicate and/or that they had to revise their theories, this would work strongly in their favor compared to the candidate with overbrimming confidence who only published Impact Factor > 30 papers, none of which have been challenged.

P.S. I notice that Goldin-Meadow’s column in the September issue of APS Observer is titled “Why preregistration makes me nervous.”