If you’ve spent any time using R, you probably know the name Hadley Wickham. He’s chief scientist at RStudio, the author of 4 books on R, and the author of several indispensable R packages, including ggplot2, dplyr, and devtools. I was reminded recently that several years ago, he wrote a very useful paper for the Journal of Statistical Software, “Tidy data” (August 2014, Volume 59, Issue 10, https://www.jstatsoft.org/article/view/v059i10).
If you are familiar with Hadley’s contributions to R, you won’t be surprised that tidy data has a simple, clean – tidy – set of requirements:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
That sounds simple, but it requires that many of us rethink the way we structure our data, no more column headers as values, no more storing of multiple variables in one column, no more storing some variables in rows and others in columns. Fortunately, Hadley is also the author of tidyr. I haven’t used it yet, but given how bad I am at starting with tidy data, I suspect I’ll be using it a lot in the future.