Not long after making my initial post in this series on variable selection in multiple regression, I received the following question on Twitter:

The short answer is that lm()
isn’t doing anything special with the covariates. It’s simply minimizing the squared deviation between predictions and observations. The longer version is that it’s able to “recognize” the “real” relationships in the example because it’s doing something analogous to a controlled experiment. It is (statistically) holding other covariates constant and asking what the effect of varying just one of them is. The trick is that it’s doing this for all of the covariates simultaneously.
I illustrate this in a new R notebook by imagining a regression analysis in which we look for an association between, say, x9
and the residuals left after regressing y
on x1
.