Some parting thoughts on variable selection in multiple regression

Variable selection in multiple regression

As I said a month and a half ago, this series started because

I was talking with one of my graduate students a few days ago about variable selection in multiple regression. She was looking for a published “cheat sheet.” I told her I didn’t know of any. “Why don’t you write one?” “The world’s too complicated for that. There will always be judgment involved. There will never be a simple recipe to follow.”

If you’ve been following along, it won’t surprise you to learn that I’m not going to conclude with a simple recipe.

“The world’s too complicated for that. There will always be judgment involved. There will never be a simple recipe to follow.”

Although there will never be a simple recipe, I can tell you what I’m going to do. You’ll want to look at a new R notebook that explores what happens when associations among covariates aren’t as strong as those we’ve been assuming so far.

For any analysis where I can use stan_glm() or stan_glmer() I’ll use horseshoe priors to “shrink” the regression coefficients for unimportant covariates towards zero. For any analysis where I can’t use stan_glm() or stan_glmer(), I’ll probably be using Stan directly and I’ll hardcode the horseshoe priors myself.
If I feel the need to use a relatively objective method to identify some subset of covariates that are “important”,¹ I’ll use projection predictive variable selection as implemented in projpred to identify the most important covariates.²
For reasons outlined in the Conclusions section of the R notebook I mentioned above, I will be very cautious about interpreting associations between covariates and response variables as anything other than a statistical association. Only if an association I find has been found repeatedly in other data sets and also has a good “first principles” explanation will I begin to interpret it as a causal association. Otherwise, I’ll interpret it as an intriguing pattern worthy of further study and exploration. If you want more details on how hard it is to infer causal relationships from these kinds of analyses, look at my blog series on causal inference in ecology.

I can think of a couple of reasons that I might want to select a subset of covariates. (1) I might not have a lot of data to fit to my model. Because of the priors, I’ll be able to fit it without the model blowing up, but the parameter estimates are likely to be very poorly defined. Reducing the number of parameters may help me isolate an “interesting” relationship. So long as I remember that all I may uncover is a statistical association, that pattern might still be worth investigating. (2) I might want a relatively objective way to simplify a complicated model so that it’s easy to understand, and there may not be an obvious break graphically or numerically between those that are important and those that aren’t. Regardless of whether it’s for reason #1 or reason #2, I will be extremely cautious about interpreting any associations identified through projection predictive variable selection as “real” and the ones not identified as “spurious.” In fact, I probably won’t do it at all, and I’ll probably present results from analysis of the full model in addition to the reduced model, even if the full model results only go in online supplemental material. ↩
Since I’m new to using projpred, I don’t know whether I’ll be able to use it with my own Stan code. If not, it’s yet another reason for me to learn brms, which can handle a bunch of models that rstanarm can’t – possibly many of them that I’d be hardcoding in Stan otherwise. ↩

Some parting thoughts on variable selection in multiple regression

Leave a Comment Cancel reply