Variable selection in multiple regression

In the last installment of this series we explored a couple of simple strategies to reduce the number of covariates in a multiple regression.^{1}, namely retaining only covariates that have a “real” relationship with the response variable^{2} and selecting one covariate from each cluster of (relatively) uncorrelated covariates.^{3} Unfortunately, we found that neither approach worked very well in our toy example.^{4}.

One of the reasons that the second approach (picking “weakly” correlated covariates) may not have worked very well is that in our toy example we know that both `x1`

and `x3`

contribute positively to `y`

, but our analysis included only `x1`

. Another approach that is sometimes used when there’s a lot of association among covariates is to first perform a principal components analysis and then to regress the response variable on the scores from the first few principal components. The newest R notebook in this series explores principal component regression.

Spoiler alert: It doesn’t help the point estimates much either, but the uncertainty around those point estimates is so large that we can’t legitimately say they’re different from one another.

- If you’ve forgotten why we might want to reduce the number of covariates, look back at this post. ↩
- The paradox lurking here is that if we knew which covariates these were, we probably wouldn’t have measured the others (or at least we wouldn’t have included them in the regression analysis). ↩
- There isn’t a good criterion to determine how weak the correlation needs to be to regard clusters as “relatively” uncorrelated. ↩
- If you’re reading footnotes, you’ll realize that the situation isn’t quite as dire as it appears from looking only at point estimates. Using
`rstanarm()`

for a Bayesian analysis shows that the credible intervals are very broad and overlapping. We don’t have good evidence that the point estimates are different from one another. ↩