I referred in passing to rstanarm and Bayesian linear regression in the R notebook on reducing the number of covariates. We’ll encounter Bayesian approaches again soon, and I just happened to run across a nice, simple introduction to Bayesian linear regression. It uses Python, with which I am only glancingly familiar, but you don’t need to run Python to read the discussion and understand what’s going on. If you’re unfamiliar with Bayesian inference or if you’d just like to check your understanding, take a look at this Introduction to Bayesian Linear Regression.
In the R notebook on principal component regression I noted that interpreting principal components can be a challenge. When I wrote that, I hadn’t seen a nice paper by Chong et al.1 The method they describe is presented specifically in the context of interpreting selection gradients after a principal components regression, but the idea is general. Once you’ve done the regression on principal components, transform the regression coefficients back to the original scale. Doing this does require, however, fitting all of the principal components, not just the first few.
Here’s the citation and abstract:
Chong, V. K., H. F. Fung, and J. R. Stinchcombe. 2018. A note on measuring natural selection on principal component scores. Evolution Letters. 2-4: 272–280 dos: 10.1002/evl3.63
Measuring natural selection through the use of multiple regression has transformed our understanding of selection, although the methods used remain sensitive to the effects of multicollinearity due to highly correlated traits. While measuring selection on principal component (PC) scores is an apparent solution to this challenge, this approach has been heavily criticized due to difficulties in interpretation and relating PC axes back to the original traits. We describe and illustrate how to transform selection gradients for PC scores back into selection gradients for the original traits, addressing issues of multicollinearity and biological interpretation. In addition to reducing multicollinearity, we suggest that this method may have promise for measuring selection on high‐dimensional data such as volatiles or gene expression traits. We demonstrate this approach with empirical data and examples from the literature, highlighting how selection estimates for PC scores can be interpreted while reducing the consequences of multicollinearity.
- John Stinchcombe pointed me to the paper. Thanks John. ↩
In the last installment of this series we explored a couple of simple strategies to reduce the number of covariates in a multiple regression.1, namely retaining only covariates that have a “real” relationship with the response variable2 and selecting one covariate from each cluster of (relatively) uncorrelated covariates.3 Unfortunately, we found that neither approach worked very well in our toy example.4.
One of the reasons that the second approach (picking “weakly” correlated covariates) may not have worked very well is that in our toy example we know that both
x3 contribute positively to
y, but our analysis included only
x1. Another approach that is sometimes used when there’s a lot of association among covariates is to first perform a principal components analysis and then to regress the response variable on the scores from the first few principal components. The newest R notebook in this series explores principal component regression.
Spoiler alert: It doesn’t help the point estimates much either, but the uncertainty around those point estimates is so large that we can’t legitimately say they’re different from one another.
- If you’ve forgotten why we might want to reduce the number of covariates, look back at this post. ↩
- The paradox lurking here is that if we knew which covariates these were, we probably wouldn’t have measured the others (or at least we wouldn’t have included them in the regression analysis). ↩
- There isn’t a good criterion to determine how weak the correlation needs to be to regard clusters as “relatively” uncorrelated. ↩
- If you’re reading footnotes, you’ll realize that the situation isn’t quite as dire as it appears from looking only at point estimates. Using
rstanarm()for a Bayesian analysis shows that the credible intervals are very broad and overlapping. We don’t have good evidence that the point estimates are different from one another. ↩
If you’ve been following this series, you now know that multiple regression can be very useful but that its usefulness depends on overcoming several challenges. One of those challenges is that if we use all of the covariates available to us and some of them are highly correlated with one another, our assessment of which covariates have an association with the response variable may be misleading and any prediction we make about new observations may be very unreliable. That leads us to the problem of variable selection. Rather than using all of the covariates we have available, maybe we’d be better off if we used only a few.
In this R notebook, I explore a couple of approaches to variable selection:
- Restricting the covariates to those we know have an association with the response variable.1
- Identifying clusters of covariates that are highly associated with one another, (relatively) unassociated with those in other clusters, and picking one covariate from each cluster for the analysis.2
As you’ll see for the sample data set we’ve been exploring in which there are two clusters of covariates having strong associations within clusters and weak to non-existent associations between clusters, neither of these approaches serves us particularly well. The next installment will explore another commonly used approach – principal components regression.
- There’s at least one obvious problem with this approach that I don’t discuss in the notebook. In the work I’ve been involved with, we rarely know ahead of time which covariates, if any, have “real” relationships with the response variable. Most often we’ve measured covariates because we anticipate that they have some relationship to what we’re interested in and we’re trying to figure out which one(s) are most important. ↩
- This approach has some practical problems that I don’t discuss in the notebook. How strong do associations have to be to be “highly associated”? How weak do they have to be to be “(relatively) unassociated”? What do we do if there isn’t a clear cutoff between “highly associated” and “(relatively) associated”? ↩
Yes, you read that right. It’s the Marist Mindset List, not the Beloit Mindset List. It’s the same Mindset list as before, but it now has a new home. If you’ve never heard of the Mindset List before, here’s the full press release. The short version is
The Marist Mindset List is created by Ron Nief, Director Emeritus of Public Affairs at Beloit College, along with educators McBride and Westerberg, Shaffer, and Zurhellen. Additional items on the list, as well as commentaries and guides, can be found at www.marist.edu/mindset-list and www.themindsetlist.com.
As always, I enjoy looking over the list, even though it makes me feel really old. Here are a few of the items I found particularly striking this year.
- Like Pearl Harbor for their grandparents, and the Kennedy assassination for their parents, 9/11 is an historical event.
- The primary use of a phone has always been to take pictures.
- The nation’s mantra has always been: “If you see something, say something.”
- They are as non-judgmental about sexual orientation as their parents were about smoking pot.
- Apple iPods have always been nostalgic.
You can find the full list at www.marist.edu/mindset-list. Enjoy!
We saw in the first installment in this series that multiple regression may allow us to distinguish “real” from “spurious” associations among variables. Since it worked so effectively in the example we studied, you might wonder why you would ever want to reduce the number of covariates in a multiple regression.
Why not simply throw in everything you’ve measured and let the multiple regression sort things out for you? There are at least a couple of reasons:
- When you have covariates that are highly correlated, the associations that are strongly supported may not be the ones that are “real”. In other words, if you’re using multiple regression in an attempt to identify the “important” covariates, you may identify the wrong ones.
- When you have covariates that are highly correlated, any attempt to extrapolate predictions beyond the range of covariates that you’ve measured may be misleading. This is especially true if you fit a linear regression and the true relationship is curvilinear.1
- The R notebook linked here doesn’t explore the problem of extrapolation when the true relationship is curvilinear, but if you’ve been following along and you have a reasonable amount of facility with R, you shouldn’t find it hard to explore that on your own. ↩
- The R-squared in our initial example was greater than 0.99. That’s why multiple regression worked so well. The example you’ll see here has an R-squared of “only” 0.42 (adjusted 0.36). The “only” is in quotes because in many analyses in ecology an evolution, an R-squared that large would seem pretty good. ↩
Not long after making my initial post in this series on variable selection in multiple regression, I received the following question on Twitter:
The short answer is that
lm() isn’t doing anything special with the covariates. It’s simply minimizing the squared deviation between predictions and observations. The longer version is that it’s able to “recognize” the “real” relationships in the example because it’s doing something analogous to a controlled experiment. It is (statistically) holding other covariates constant and asking what the effect of varying just one of them is. The trick is that it’s doing this for all of the covariates simultaneously.
I illustrate this in a new R notebook by imagining a regression analysis in which we look for an association between, say,
x9 and the residuals left after regressing
I was talking with one of my graduate students a few days ago about variable selection in multiple regression. She was looking for a published “cheat sheet.” I told her I didn’t know of any. “Why don’t you write one?” “The world’s too complicated for that. There will always be judgment involved. There will never be a simple recipe to follow.” That was the end of it, for then.
From the title you can tell that I decided I needed to get my own thoughts in order about variable selection. If you know me, you also know that I find one of the best ways to get my thoughts straight is to write them down. So that’s what I’m starting now.
Expect to see a new entry every week or so. I’ll be posting the details in R notebooks so that you can download the code, run it yourself, and play around with it if you’re so inclined.1 As I develop notebooks, I’ll develop a static page with links to them. Unlike the page on causal inference in ecology, which links to blog posts, these will link directly to HTML versions of R notebooks that will show discuss the aspect of the issue I’m working through that week along with the R code that facilitated my thinking. All of the source code will be available in a Github repository, but you’ll also be able to download the .Rmd file when you have the HTML version open simply by clicking on the “Code” button at the top right of the page and selecting “Download Rmd” from the dropdown.
If you’re still interested after all of that. Here’s a link to the first installment:
Last Friday I attended a very interesting symposium entitled Presenting science to the public in a post-truth era and jointly sponsored by the Science of Learning & Art of Communication1 and the University of Connecticut Humanities Institute, more specifically its project on Humility & Conviction in Public Life.2 The speakers – Åsa Wikforss (Stockholm University), Tali Sharot (University College London), and Michael Lynch (UConn) – argued that the primary function3 of posts on social media is to express emotion, not to impart information, that not only are we more likely to accept new evidence that confirms what we already believe than new evidence that contradicts it, and that knowledge resistance often arises because we resist the consequences that would follow from believing the evidence presented to us.
I can’t claim expertise in the factors influencing whether people accept or reject the evidence for climate change, but Merchants of Doubt makes a compelling case that the resistance among some prominent doubters arises because they believe that accepting the evidence that climate change is happening and the humans are primarily responsible will require massive changes in our economic system and, quite possibly, severe limits on individual liberty. In other words, the case Oreskes and Conway make in Merchants of Doubt is consistent with a form of knowledge resistance in which the evidence for human-caused climate change is resisted because of the consequences accepting that evidence would have. It also illustrates a point I do my best to drive home when I teach my course in conservation biology.
As scientists, we discover empirical facts about the world, e.g., CO2 emissions have increased the atmospheric concentration of CO2 far above pre-industrial levels and much of the associated increase in global average temperature is a result of those emissions. Too often, though, we proceed immediately from discovering those empirical facts to concluding that particular policy choices are necessary. We think, for example, that because CO2 emissions are causing changes in global climate we must therefore reduce or eliminate CO2 emissions. There is, however, a step in the logic that’s missing.
To conclude that we must reduce or eliminate CO2 emissions we must first decide that the climate changes associated with increasing CO2 emissions are bad things that we should avoid. It may seem obvious that they are. After all, how could flooding of major metropolitan areas and the elimination of low-lying Pacific Island nations be a good thing? They aren’t. But avoiding them isn’t free. It involves choices. We can spend some amount of money now to avoid those consequences, we can spend money later when the threats are more imminent, or we can let the people who live in those place move out of the way when the time comes. I’m sure you can think of some other choices, too. Even if those three are the only choices, the empirical data alone don’t tell us which one to pick. The choice depends on what kind of world we want to live in. It is a choice based on moral or ethical values. The empirical evidence must inform our choice among the alternatives, but it isn’t sufficient to determinethe choice.
Perhaps the biggest challenge we face in developing a response to climate change is that emotions are so deeply engaged on both sides of the debate that we cannot agree on the empirical facts. A debate that should be played out in the realm of “What kind of world do we want to live in? What values are most important?” Is instead played out in the realm of tribal loyalty.
The limits to knowledge Wikforss, Sharot, and Lynch identified represent real, important barriers to progress. But overcoming knowledge resistance, in particular, seems more likely if we remember that translating knowledge to action requires applying our values. When we are communicating science that means either stopping at the point where empirical evidence ends and application of values begins or making it clear that science ends with the empirical evidence and that our recommendation for action derives from our values.4
- A training grant funded through the National Science Foundation Research Traineeship (NRT) Program ↩
- Funded by the John Templeton Foundation (story in UConn Today). ↩
- Note: Lynch used the phrase “primary function” in a technical, philosophical sense inspired by Ruth Milliken’s idea of a “proper function,” but the plain sense of the phrase conveys its basic meaning. ↩
- In the real world it may sometimes, perhaps even often, be difficult to make a clean distinction between the realm of empirical research and the realm of ethical values. Distinguishing between them to the extent possible is still valuable, and it is even more valuable to be honest about the ways in which your personal values influence any actions you recommend. ↩
I recently discovered an article by Karl Broman and Kara Woo in The American Statistician entitled “Data organization in spreadsheets” (https://doi.org/10.1080/00031305.2017.1375989). It is the first article in the April 2018 special issue on data science. Why, you might ask, would a journal published by the American Statistical Association devote the first paper in a special issue on data science to spreadsheets instead of something more statistical. Well, among other things it turns out that the risks of using spreadsheets poorly are so great that there’s a European Spreadsheet Risks Interest Group that keeps track of “horror stories” (http://www.eusprig.org/horror-stories.htm). For example, Wisconsin initially estimated that the cost of a recount in the 2016 Presidential election would be $3.5M. After correcting a spreadsheet error, the cost climbed to $3.9M (https://www.wrn.com/2016/11/wisconsin-presidential-recount-will-cost-3-5-million/).
My favorite example, though, dates from 2013. Thomas Herndon, then a third-year doctoral student at UMass Amherst showed that a spreadsheet error in a very influential paper published by two eminent economists, Carmen Reinhart and Kenneth Rogoff, magnified the apparent effect of debt on economic growth (https://www.chronicle.com/article/UMass-Graduate-Student-Talks/138763). That paper was widely cited by economists arguing against economic stimulus in response to the financial crisis of 2008-2009.
That being said, Broman and Woo correctly point out that
Amid this debate, spreadsheets have continued to play a significant role in researchers’ workflows, and it is clear that they are a valuable tool that researchers are unlikely to abandon completely.
So since you’re not going to stop using spreadsheets (and I won’t either), you should at least use them well. If you don’t have time to read the whole article, here are twelve points you should remember:
- Be consistent – “Whatever you do, do it consistently.”
- Choose good names for things – “It is important to pick good names for things. This can be hard, and so it is worth putting some time and thought into it.”
- Write dates as YYYY-MM-DD. https://imgs.xkcd.com/comics/iso_8601.png
- No empty cells – Fill in all cells. Use some common code for missing data.1
- Put just one thing in a cell – “The cells in your spreadsheet should each contain one piece of data. Do not put more than one thing in a cell.”
- Make it a rectangle – “The best layout for your data within a spreadsheet is as a single big rectangle with rows corresponding to subjects and columns corresponding to variables.”2
- Create a data dictionary – “It is helpful to have a separate file that explains what all of the variables are.”
- No calculations in raw data files – “Your primary data file should contain just the data and nothing else: no calculations, no graphs.”
- Do not use font color or highlighting as data – “Analysis programs can much more readily handle data that are stored in a column than data encoded in cell highlighting, font, etc. (and in fact this markup will be lost completely in many programs).”
- Make backups – “Make regular backups of your data. In multiple locations. And consider using a formal version control system, like git, though it is not ideal for data files. If you want to get a bit fancy, maybe look at dat (https://datproject.org/).”
- Use data validation to avoid errors
- Save the data in plain text files
- R likes “NA”, but it’s easy to use “.” or something else. Just use “na.strings” when you use read.csv or “na” when you use readcsv. ↩
- If you’re a ggplot user you’ll recognize that this is wide format, while ggplot typically needs long format data. I suggest storing your data in wide format and using ddply() to reformat for plotting. ↩