Variable selection in multiple regression

We saw in the first installment in this series that multiple regression may allow us to distinguish “real” from “spurious” associations among variables. Since it worked so effectively in the example we studied, you might wonder why you would ever want to reduce the number of covariates in a multiple regression.

Why not simply throw in everything you’ve measured and let the multiple regression sort things out for you? There are at least a couple of reasons:

- When you have covariates that are highly correlated, the associations that are strongly supported may not be the ones that are “real”. In other words, if you’re using multiple regression in an attempt to identify the “important” covariates, you may identify the wrong ones.
- When you have covariates that are highly correlated, any attempt to extrapolate predictions beyond the range of covariates that you’ve measured may be misleading. This is especially true if you fit a linear regression and the true relationship is curvilinear.
^{1}

This R notebook explores both of these points using the same set of deterministic relationships we’ve used before to generate the data, but increasing the residual variance.^{2}

- The R notebook linked here doesn’t explore the problem of extrapolation when the true relationship is curvilinear, but if you’ve been following along and you have a reasonable amount of facility with R, you shouldn’t find it hard to explore that on your own. ↩
- The R-squared in our initial example was greater than 0.99. That’s why multiple regression worked so well. The example you’ll see here has an R-squared of “only” 0.42 (adjusted 0.36). The “only” is in quotes because in many analyses in ecology an evolution, an R-squared that large would seem pretty good. ↩