{"id":751,"date":"2019-05-22T08:00:00","date_gmt":"2019-05-22T12:00:00","guid":{"rendered":"http:\/\/darwin.eeb.uconn.edu\/uncommon-ground\/?p=751"},"modified":"2019-05-19T16:00:40","modified_gmt":"2019-05-19T20:00:40","slug":"how-to-organize-data-in-spreadsheets","status":"publish","type":"post","link":"https:\/\/darwin.eeb.uconn.edu\/uncommon-ground\/blog\/2019\/05\/22\/how-to-organize-data-in-spreadsheets\/","title":{"rendered":"How to organize data in spreadsheets"},"content":{"rendered":"\r\n<p class=\"wp-block-paragraph\">I recently discovered an article by Karl Broman and Kara Woo in <em>The American Statistician<\/em> entitled \u201cData organization in spreadsheets\u201d (<a href=\"https:\/\/doi.org\/10.1080\/00031305.2017.1375989\">https:\/\/doi.org\/10.1080\/00031305.2017.1375989<\/a>). It is the first article in the April 2018 special issue on data science. Why, you might ask, would a journal published by the American Statistical Association devote the first paper in a special issue on data science to spreadsheets instead of something more statistical. Well, among other things it turns out that the risks of using spreadsheets poorly are so great that there\u2019s a European Spreadsheet Risks Interest Group that keeps track of \u201chorror stories\u201d (<a href=\"http:\/\/www.eusprig.org\/horror-stories.htm\">http:\/\/www.eusprig.org\/horror-stories.htm<\/a>). For example, Wisconsin initially estimated that the cost of a recount in the 2016 Presidential election would be $3.5M. After correcting a spreadsheet error, the cost climbed to $3.9M (<a href=\"https:\/\/www.wrn.com\/2016\/11\/wisconsin-presidential-recount-will-cost-3-5-million\/\">https:\/\/www.wrn.com\/2016\/11\/wisconsin-presidential-recount-will-cost-3-5-million\/<\/a>).<\/p>\r\n\r\n\r\n\r\n<p class=\"wp-block-paragraph\">My favorite example, though, dates from 2013. Thomas Herndon, then a third-year doctoral student at UMass Amherst showed that a spreadsheet error in a very influential paper published by two eminent economists, Carmen Reinhart and Kenneth Rogoff, magnified the apparent effect of debt on economic growth (<a href=\"https:\/\/www.chronicle.com\/article\/UMass-Graduate-Student-Talks\/138763\">https:\/\/www.chronicle.com\/article\/UMass-Graduate-Student-Talks\/138763<\/a>). That paper was widely cited by economists arguing against economic stimulus in response to the financial crisis of 2008-2009.<\/p>\r\n\r\n\r\n\r\n<p class=\"wp-block-paragraph\">That being said, Broman and Woo correctly point out that<\/p>\r\n\r\n\r\n\r\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\r\n<p>Amid this debate, spreadsheets have continued to play a significant role in researchers\u2019 workflows, and it is clear that they are a valuable tool that researchers are unlikely to abandon completely.<\/p>\r\n<\/blockquote>\r\n\r\n\r\n\r\n<p class=\"wp-block-paragraph\">So since you\u2019re not going to stop using spreadsheets (and I won\u2019t either), you should at least use them well. If you don\u2019t have time to read the whole article, here are twelve points you should remember:<\/p>\r\n\r\n\r\n\r\n<ol class=\"wp-block-list\">\r\n<li>Be consistent &#8211; \u201cWhatever you do, do it consistently.\u201d<\/li>\r\n<li>Choose good names for things &#8211; \u201cIt is important to pick good names for things. This can be hard, and so it is worth putting some time and thought into it.\u201d<\/li>\r\n<li>Write dates as YYYY-MM-DD. <a href=\"https:\/\/imgs.xkcd.com\/comics\/iso_8601.png\">https:\/\/imgs.xkcd.com\/comics\/iso_8601.png<\/a><\/li>\r\n<li>No empty cells &#8211; Fill in all cells. Use some common code for missing data.<sup><a id=\"ffn1\" class=\"footnote\" href=\"#fn1\">1<\/a><\/sup><\/li>\r\n<li>Put just one thing in a cell &#8211; \u201cThe cells in your spreadsheet should each contain one piece of data. Do not put more than one thing in a cell.\u201d<\/li>\r\n<li>Make it a rectangle &#8211; \u201cThe best layout for your data within a spreadsheet is as a single big rectangle with rows corresponding to subjects and columns corresponding to variables.\u201d<sup><a id=\"ffn2\" class=\"footnote\" href=\"#fn2\">2<\/a><\/sup><\/li>\r\n<li>Create a data dictionary &#8211; \u201cIt is helpful to have a separate file that explains what all of the variables are.\u201d<\/li>\r\n<li>No calculations in raw data files &#8211; \u201cYour primary data file should contain just the data and nothing else: no calculations, no graphs.\u201d<\/li>\r\n<li>Do not use font color or highlighting as data &#8211; \u201cAnalysis programs can much more readily handle data that are stored in a column than data encoded in cell highlighting, font, etc. (and in fact this markup will be lost completely in many programs).\u201d<\/li>\r\n<li>Make backups &#8211; \u201cMake regular backups of your data. In multiple locations. And consider using a formal version control system, like git, though it is not ideal for data files. If you want to get a bit fancy, maybe look at dat (<a href=\"https:\/\/datproject.org\">https:\/\/datproject.org\/<\/a>).\u201d<\/li>\r\n<li>Use data validation to avoid errors<\/li>\r\n<li>Save the data in plain text files<\/li>\r\n<\/ol>\r\n\r\n\r\n\r\n<ol class=\"wp-block-list\">\r\n<li id=\"fn1\">R likes \u201cNA\u201d, but it\u2019s easy to use \u201c.\u201d or something else. Just use \u201cna.strings\u201d when you use read.csv or \u201cna\u201d when you use read<em>csv.<\/em> <a href=\"#ffn1\">&#x21a9;<\/a><\/li>\r\n<li id=\"fn2\">If you\u2019re a ggplot user you\u2019ll recognize that this is wide format, while ggplot typically needs long format data. I suggest storing your data in wide format and using ddply() to reformat for plotting. <a href=\"#ffn2\">&#x21a9;<\/a><\/li>\r\n<\/ol>\r\n","protected":false},"excerpt":{"rendered":"<p>I recently discovered an article by Karl Broman and Kara Woo in The American Statistician entitled \u201cData organization in spreadsheets\u201d (https:\/\/doi.org\/10.1080\/00031305.2017.1375989). It is the first article in the April 2018&#8230; <a class=\"read-more-button\" href=\"https:\/\/darwin.eeb.uconn.edu\/uncommon-ground\/blog\/2019\/05\/22\/how-to-organize-data-in-spreadsheets\/\">Read more &gt;<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[10],"tags":[],"class_list":["post-751","post","type-post","status-publish","format-standard","hentry","category-statistics"],"_links":{"self":[{"href":"https:\/\/darwin.eeb.uconn.edu\/uncommon-ground\/wp-json\/wp\/v2\/posts\/751","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/darwin.eeb.uconn.edu\/uncommon-ground\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/darwin.eeb.uconn.edu\/uncommon-ground\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/darwin.eeb.uconn.edu\/uncommon-ground\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/darwin.eeb.uconn.edu\/uncommon-ground\/wp-json\/wp\/v2\/comments?post=751"}],"version-history":[{"count":3,"href":"https:\/\/darwin.eeb.uconn.edu\/uncommon-ground\/wp-json\/wp\/v2\/posts\/751\/revisions"}],"predecessor-version":[{"id":754,"href":"https:\/\/darwin.eeb.uconn.edu\/uncommon-ground\/wp-json\/wp\/v2\/posts\/751\/revisions\/754"}],"wp:attachment":[{"href":"https:\/\/darwin.eeb.uconn.edu\/uncommon-ground\/wp-json\/wp\/v2\/media?parent=751"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/darwin.eeb.uconn.edu\/uncommon-ground\/wp-json\/wp\/v2\/categories?post=751"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/darwin.eeb.uconn.edu\/uncommon-ground\/wp-json\/wp\/v2\/tags?post=751"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}