Uncommon Ground

Monthly Archive: August 2019

Trying out a couple of simple strategies for reducing the number of covariates

Variable selection in multiple regression

If you’ve been following this series, you now know that multiple regression can be very useful but that its usefulness depends on overcoming several challenges. One of those challenges is that if we use all of the covariates available to us and some of them are highly correlated with one another, our assessment of which covariates have an association with the response variable may be misleading and any prediction we make about new observations may be very unreliable. That leads us to the problem of variable selection. Rather than using all of the covariates we have available, maybe we’d be better off if we used only a few.

In this R notebook, I explore a couple of approaches to variable selection:

  1. Restricting the covariates to those we know have an association with the response variable.1
  2. Identifying clusters of covariates that are highly associated with one another, (relatively) unassociated with those in other clusters, and picking one covariate from each cluster for the analysis.2

As you’ll see for the sample data set we’ve been exploring in which there are two clusters of covariates having strong associations within clusters and weak to non-existent associations between clusters, neither of these approaches serves us particularly well. The next installment will explore another commonly used approach – principal components regression.

  1. There’s at least one obvious problem with this approach that I don’t discuss in the notebook. In the work I’ve been involved with, we rarely know ahead of time which covariates, if any, have “real” relationships with the response variable. Most often we’ve measured covariates because we anticipate that they have some relationship to what we’re interested in and we’re trying to figure out which one(s) are most important.
  2. This approach has some practical problems that I don’t discuss in the notebook. How strong do associations have to be to be “highly associated”? How weak do they have to be to be “(relatively) unassociated”? What do we do if there isn’t a clear cutoff between “highly associated” and “(relatively) associated”?

The Marist Mindset List for the Class of 2023

Yes, you read that right. It’s the Marist Mindset List, not the Beloit Mindset List. It’s the same Mindset list as before, but it now has a new home. If you’ve never heard of the Mindset List before, here’s the full press release. The short version is

The Marist Mindset List is created by Ron Nief, Director Emeritus of Public Affairs at Beloit College, along with educators McBride and Westerberg, Shaffer, and Zurhellen. Additional items on the list, as well as commentaries and guides, can be found at www.marist.edu/mindset-list and www.themindsetlist.com.

As always, I enjoy looking over the list, even though it makes me feel really old. Here are a few of the items I found particularly striking this year.

  • Like Pearl Harbor for their grandparents, and the Kennedy assassination for their parents, 9/11 is an historical event.
  • The primary use of a phone has always been to take pictures.
  • The nation’s mantra has always been: “If you see something, say something.”
  • They are as non-judgmental about sexual orientation as their parents were about smoking pot.
  • Apple iPods have always been nostalgic.

You can find the full list at www.marist.edu/mindset-list. Enjoy!

Challenges of multiple regression (or why we might want to select variables)

Variable selection in multiple regression

We saw in the first installment in this series that multiple regression may allow us to distinguish “real” from “spurious” associations among variables. Since it worked so effectively in the example we studied, you might wonder why you would ever want to reduce the number of covariates in a multiple regression.

Why not simply throw in everything you’ve measured and let the multiple regression sort things out for you? There are at least a couple of reasons:

  1. When you have covariates that are highly correlated, the associations that are strongly supported may not be the ones that are “real”. In other words, if you’re using multiple regression in an attempt to identify the “important” covariates, you may identify the wrong ones.
  2. When you have covariates that are highly correlated, any attempt to extrapolate predictions beyond the range of covariates that you’ve measured may be misleading. This is especially true if you fit a linear regression and the true relationship is curvilinear.1

This R notebook explores both of these points using the same set of deterministic relationships we’ve used before to generate the data, but increasing the residual variance.2

  1. The R notebook linked here doesn’t explore the problem of extrapolation when the true relationship is curvilinear, but if you’ve been following along and you have a reasonable amount of facility with R, you shouldn’t find it hard to explore that on your own.
  2. The R-squared in our initial example was greater than 0.99. That’s why multiple regression worked so well. The example you’ll see here has an R-squared of “only” 0.42 (adjusted 0.36). The “only” is in quotes because in many analyses in ecology an evolution, an R-squared that large would seem pretty good.

What is multiple regression doing?

Not long after making my initial post in this series on variable selection in multiple regression, I received the following question on Twitter:

The short answer is that lm() isn’t doing anything special with the covariates. It’s simply minimizing the squared deviation between predictions and observations. The longer version is that it’s able to “recognize” the “real” relationships in the example because it’s doing something analogous to a controlled experiment. It is (statistically) holding other covariates constant and asking what the effect of varying just one of them is. The trick is that it’s doing this for all of the covariates simultaneously.

I illustrate this in a new R notebook by imagining a regression analysis in which we look for an association between, say, x9 and the residuals left after regressing y on x1.

What is multiple regression doing?

Collecting my thoughts about variable selection in multiple regression

I was talking with one of my graduate students a few days ago about variable selection in multiple regression. She was looking for a published “cheat sheet.” I told her I didn’t know of any. “Why don’t you write one?” “The world’s too complicated for that. There will always be judgment involved. There will never be a simple recipe to follow.” That was the end of it, for then.

From the title you can tell that I decided I needed to get my own thoughts in order about variable selection. If you know me, you also know that I find one of the best ways to get my thoughts straight is to write them down. So that’s what I’m starting now.

Expect to see a new entry every week or so. I’ll be posting the details in R notebooks so that you can download the code, run it yourself, and play around with it if you’re so inclined.1 As I develop notebooks, I’ll develop a static page with links to them. Unlike the page on causal inference in ecology, which links to blog posts, these will link directly to HTML versions of R notebooks that will show discuss the aspect of the issue I’m working through that week along with the R code that facilitated my thinking. All of the source code will be available in a Github repository, but you’ll also be able to download the .Rmd file when you have the HTML version open simply by clicking on the “Code” button at the top right of the page and selecting “Download Rmd” from the dropdown.

If you’re still interested after all of that. Here’s a link to the first installment:

Why multiple regression is needed

  1. You’ll get the most out of R notebooks if you work with them through RStudio. Fortunately, the open source version is likely to serve your needs, so all it will cost you is a little bit of disk space.