How to organize data in spreadsheets
I recently discovered an article by Karl Broman and Kara Woo in The American Statistician entitled “Data organization in spreadsheets” (https://doi.org/10.1080/00031305.2017.1375989). It is the first article in the April 2018 special issue on data science. Why, you might ask, would a journal published by the American Statistical Association devote the first paper in a special issue on data science to spreadsheets instead of something more statistical. Well, among other things it turns out that the risks of using spreadsheets poorly are so great that there’s a European Spreadsheet Risks Interest Group that keeps track of “horror stories” (http://www.eusprig.org/horror-stories.htm). For example, Wisconsin initially estimated that the cost of a recount in the 2016 Presidential election would be $3.5M. After correcting a spreadsheet error, the cost climbed to $3.9M (https://www.wrn.com/2016/11/wisconsin-presidential-recount-will-cost-3-5-million/).
My favorite example, though, dates from 2013. Thomas Herndon, then a third-year doctoral student at UMass Amherst showed that a spreadsheet error in a very influential paper published by two eminent economists, Carmen Reinhart and Kenneth Rogoff, magnified the apparent effect of debt on economic growth (https://www.chronicle.com/article/UMass-Graduate-Student-Talks/138763). That paper was widely cited by economists arguing against economic stimulus in response to the financial crisis of 2008-2009.
That being said, Broman and Woo correctly point out that
Amid this debate, spreadsheets have continued to play a significant role in researchers’ workflows, and it is clear that they are a valuable tool that researchers are unlikely to abandon completely.
So since you’re not going to stop using spreadsheets (and I won’t either), you should at least use them well. If you don’t have time to read the whole article, here are twelve points you should remember:
- Be consistent – “Whatever you do, do it consistently.”
- Choose good names for things – “It is important to pick good names for things. This can be hard, and so it is worth putting some time and thought into it.”
- Write dates as YYYY-MM-DD. https://imgs.xkcd.com/comics/iso_8601.png
- No empty cells – Fill in all cells. Use some common code for missing data.1
- Put just one thing in a cell – “The cells in your spreadsheet should each contain one piece of data. Do not put more than one thing in a cell.”
- Make it a rectangle – “The best layout for your data within a spreadsheet is as a single big rectangle with rows corresponding to subjects and columns corresponding to variables.”2
- Create a data dictionary – “It is helpful to have a separate file that explains what all of the variables are.”
- No calculations in raw data files – “Your primary data file should contain just the data and nothing else: no calculations, no graphs.”
- Do not use font color or highlighting as data – “Analysis programs can much more readily handle data that are stored in a column than data encoded in cell highlighting, font, etc. (and in fact this markup will be lost completely in many programs).”
- Make backups – “Make regular backups of your data. In multiple locations. And consider using a formal version control system, like git, though it is not ideal for data files. If you want to get a bit fancy, maybe look at dat (https://datproject.org/).”
- Use data validation to avoid errors
- Save the data in plain text files
- R likes “NA”, but it’s easy to use “.” or something else. Just use “na.strings” when you use read.csv or “na” when you use readcsv. ↩
- If you’re a ggplot user you’ll recognize that this is wide format, while ggplot typically needs long format data. I suggest storing your data in wide format and using ddply() to reformat for plotting. ↩