Variable selection in multiple regression

If you’ve been following this series, you now know that multiple regression can be very useful but that its usefulness depends on overcoming several challenges. One of those challenges is that if we use all of the covariates available to us and some of them are highly correlated with one another, our assessment of which covariates have an association with the response variable may be misleading and any prediction we make about new observations may be very unreliable. That leads us to the problem of variable selection. Rather than using all of the covariates we have available, maybe we’d be better off if we used only a few.

In this R notebook, I explore a couple of approaches to variable selection:

- Restricting the covariates to those we know have an association with the response variable.
^{1} - Identifying clusters of covariates that are highly associated with one another, (relatively) unassociated with those in other clusters, and picking one covariate from each cluster for the analysis.
^{2}

As you’ll see for the sample data set we’ve been exploring in which there are two clusters of covariates having strong associations within clusters and weak to non-existent associations between clusters, neither of these approaches serves us particularly well. The next installment will explore another commonly used approach – principal components regression.

- There’s at least one obvious problem with this approach that I don’t discuss in the notebook. In the work I’ve been involved with, we rarely know ahead of time which covariates, if any, have “real” relationships with the response variable. Most often we’ve measured covariates because we anticipate that they have some relationship to what we’re interested in and we’re trying to figure out which one(s) are most important. ↩
- This approach has some practical problems that I don’t discuss in the notebook. How strong do associations have to be to be “highly associated”? How weak do they have to be to be “(relatively) unassociated”? What do we do if there isn’t a clear cutoff between “highly associated” and “(relatively) associated”? ↩