If you’ve read the two previous installments in this series, you may wonder why I didn’t stop with the last one. What else is there to worry about? We saw that multiple regression may allow us to identify “real” associations and distinguish them from “spurious” ones. What more could we want? Well, to answer this question. Let’s start with another experiment.
First, I’m going to modify the code I ran before to introduce a little more random error. Specifically, we have \(R^2 > 0.99\). I’ll increase resid
from 0.2 to 2.0 so that the resulting \(R^2\) is only about 0.4.
library(tidyverse)
library(reshape2)
library(ggplot2)
library(cowplot)
library(mvtnorm)
## intetcept
##
beta0 <- 1.0
## regression coefficients
##
beta <- c(1.0, -1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
## pattern of correlation matrix, all non-zero entries are set to saem
## correlation, covariance matrix caldulated from individual variances and a
## single association parameter governing the non-zero correlation coefficients
##
## Note: Not just any pattern will work here. The correlation matrix and
## covariance matrix generated from this pattern must be positive definite.
## If you change this pattern, you may get an error when you try to generate
## data with a non-zero association parameter.
##
Rho <- matrix(nrow = 9, ncol = , byrow = TRUE,
data = c(1,0,1,0,1,0,1,0,1,
0,1,0,1,0,1,0,1,0,
1,0,1,0,1,0,1,0,1,
0,1,0,1,0,1,0,1,0,
1,0,1,0,1,0,1,0,1,
0,1,0,1,0,1,0,1,0,
1,0,1,0,1,0,1,0,1,
0,1,0,1,0,1,0,1,0,
1,0,1,0,1,0,1,0,1
))
## vector of standard deviations for covariates
##
sigma <- rep(1, 9)
## construct a covariance matrix from the pattern, standard deviations, and
## one parameter in [-1,1] that governs the magnitude of non-zero correlation
## coefficients
##
## Rho - the pattern of associations
## sigma - the vector of standard deviations
## rho - the association parameter
##
construct_Sigma <- function(Rho, sigma, rho) {
## get the correlation matris
##
Rho <- Rho*rho
for (i in 1:ncol(Rho)) {
Rho[i,i] <- 1.0
}
## notice the use of matrix multiplication
##
Sigma <- diag(sigma) %*% Rho %*% diag(sigma)
return(Sigma)
}
## set the random number seed manually so that every run of the code will
## produce the same numbers
##
set.seed(1234)
n_samp <- 100
cov_str <- rmvnorm(n_samp,
mean = rep(0, nrow(Rho)),
sigma = construct_Sigma(Rho, sigma, 0.8))
resid <- rep(2.0, n_samp)
y_str <- rnorm(nrow(cov_str), mean = beta0 + cov_str %*% beta, sd = resid)
dat <- data.frame(y_str, cov_str, rep("Strong", length(y_str)))
column_names <- c("y", paste("x", seq(1, length(beta)), sep = ""), "Scenario")
colnames(dat) <- column_names
covariates <- paste("x", seq(1, length(beta)), sep = "")
lm_for_pred_1 <- lm(as.formula(paste("y ~ ", paste(covariates, collapse = " + "))), data = dat)
summary(lm_for_pred_1)
Call:
lm(formula = as.formula(paste("y ~ ", paste(covariates, collapse = " + "))),
data = dat)
Residuals:
Min 1Q Median 3Q Max
-5.5591 -1.2791 0.0769 1.0579 4.7125
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.89283 0.21580 4.137 7.89e-05 ***
x1 0.93538 0.45457 2.058 0.04251 *
x2 -0.37673 0.45067 -0.836 0.40541
x3 1.06996 0.39591 2.702 0.00823 **
x4 -0.09839 0.44561 -0.221 0.82575
x5 -0.14849 0.39651 -0.375 0.70891
x6 -0.21750 0.43934 -0.495 0.62177
x7 -0.44814 0.45517 -0.985 0.32748
x8 0.04633 0.49939 0.093 0.92629
x9 0.49035 0.46447 1.056 0.29392
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.112 on 90 degrees of freedom
Multiple R-squared: 0.4152, Adjusted R-squared: 0.3567
F-statistic: 7.099 on 9 and 90 DF, p-value: 1.051e-07
Notice that now x2
appears not to be associated with y
and that the evidence for an association between x1
and y
is much weaker than it was before. If we restrict the model to the covariates we know have an association (because we know the model that generated the data), here’s what we get:
lm_data_set_1 <- lm(y ~ x1 + x2 + x3, data = dat)
summary(lm_data_set_1)
Call:
lm(formula = y ~ x1 + x2 + x3, data = dat)
Residuals:
Min 1Q Median 3Q Max
-5.5057 -1.3596 0.0372 1.0451 4.9972
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8951 0.2076 4.311 3.93e-05 ***
x1 0.9918 0.3274 3.029 0.003150 **
x2 -0.5504 0.2232 -2.466 0.015419 *
x3 1.0247 0.2942 3.483 0.000748 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.07 on 96 degrees of freedom
Multiple R-squared: 0.4006, Adjusted R-squared: 0.3819
F-statistic: 21.39 on 3 and 96 DF, p-value: 1.079e-10
So, if we restrict the covariates to the ones we know are “right” we now have pretty good evidence that all three are associated, although the magnitude of the association between x2
and y
is stiller smaller than in “ought” to be. That’s one reason why we might want to restrict the number of covariates. If we include “too many” covariates, the extra uncertainty associated with each individual association may cause us to miss “real” associations that are there.
Here’s another one: Let’s generate a new data set using the same parameters. Since the errors are random, this data set will differ from the one we had before. We’ll run both linear regressions and compare the results.
cov_str <- rmvnorm(n_samp,
mean = rep(0, nrow(Rho)),
sigma = construct_Sigma(Rho, sigma, 0.8))
resid <- rep(2.0, n_samp)
y_str <- rnorm(nrow(cov_str), mean = beta0 + cov_str %*% beta, sd = resid)
dat <- data.frame(y_str, cov_str, rep("Strong", length(y_str)))
column_names <- c("y", paste("x", seq(1, length(beta)), sep = ""), "Scenario")
colnames(dat) <- column_names
covariates <- paste("x", seq(1, length(beta)), sep = "")
lm_for_pred_2 <- lm(as.formula(paste("y ~ ", paste(covariates, collapse = " + "))), data = dat)
summary(lm_for_pred_2)
Call:
lm(formula = as.formula(paste("y ~ ", paste(covariates, collapse = " + "))),
data = dat)
Residuals:
Min 1Q Median 3Q Max
-4.4981 -1.3523 0.0183 1.1068 4.0849
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.01074 0.20353 4.966 3.23e-06 ***
x1 0.67594 0.39921 1.693 0.093872 .
x2 -0.76051 0.43765 -1.738 0.085684 .
x3 1.46075 0.38765 3.768 0.000293 ***
x4 0.24703 0.38449 0.642 0.522183
x5 0.03430 0.39888 0.086 0.931674
x6 -0.85573 0.40490 -2.113 0.037330 *
x7 -0.02932 0.39308 -0.075 0.940701
x8 0.23705 0.40573 0.584 0.560505
x9 -0.31367 0.40115 -0.782 0.436305
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.937 on 90 degrees of freedom
Multiple R-squared: 0.5136, Adjusted R-squared: 0.465
F-statistic: 10.56 on 9 and 90 DF, p-value: 5.33e-11
lm_data_set_2 <- lm(y ~ x1 + x2 + x3, data = dat)
summary(lm_data_set_2)
Call:
lm(formula = y ~ x1 + x2 + x3, data = dat)
Residuals:
Min 1Q Median 3Q Max
-4.5624 -1.3991 -0.0468 1.3177 4.1804
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.0206 0.2001 5.099 1.71e-06 ***
x1 0.6424 0.3398 1.891 0.061704 .
x2 -1.0993 0.1900 -5.787 8.99e-08 ***
x3 1.1957 0.3199 3.737 0.000316 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.93 on 96 degrees of freedom
Multiple R-squared: 0.4848, Adjusted R-squared: 0.4687
F-statistic: 30.11 on 3 and 96 DF, p-value: 8.283e-14
Now we have a situation where in the full analysis there appears to be an association between x6
and y
and where in the reduced analysis the evidence for an association between s1
and y
is really marginal. Furthermore, evidence for the associations between x1
, x2
, and y
are marginal.1 At least when we restrict our analysis to the case where we’re examining only “real” associations, we’re not too wrong.
In addition to analysis of the second data set getting things “wrong”, remember that the only difference between the first data set and the second is that the random error is different in the two cases. Neither the identity nor the magnitude of the “real” relationships has changed. That’s a problem not only if you hope to interpret observed associations as causal relationships,2 but also if you want to extrapolate relationships beyond the domain of your regression.
Let’s consider one new data point, and compare the predictions of these two models with the true answer:
new_data <- data.frame(x1 = 4.0, x2 = 4.0, x3 = 4.0,
x4 = 4.0, x5 = 4.0, x6 = 4.0,
x7 = 4.0, x8 = 4.0, x9 = 4.0)
pred_from_1 <- predict.lm(lm_for_pred_1, new_data)
pred_from_2 <- predict.lm(lm_for_pred_2, new_data)
cat("Prediction from data set 1: ", pred_from_1, "\n",
"Prediction from data set 2: ", pred_from_2, "\n",
"True answer: ", beta0 + as.matrix(new_data) %*% beta, "\n", sep = "")
Prediction from data set 1: 5.903918
Prediction from data set 2: 3.794066
True answer: 5
Remember that the residual standard deviation we specified was 2.0. That means that the two predictions are two standard deviations apart from one another, and each is about half a standard deviation from the true value. Given that we’re extrapolating well beyond the observed range of any of the covariates, x2
has the largest value (2.9) among any of them, that doesn’t seem awful, but with a larger number of covariates, more of which had real associations, the predictions could be substantially more unstable than these.
In any case, I think these simple exampls show that there are some good reasons that we may want to reduce the number of covariates in a regression. The question is how, and for the answer to that, I’m afraid you’ll have to come back another time.
Lest you think that I cooked this example somehow. I guarantee you that I didn’t know it was going to work out this way until I ran the code. The results of the second simulation made my point even better than I imagined it would.↩
See my other blog series on (causal inference in ecology)[http://darwin.eeb.uconn.edu/uncommon-ground/causal-inference-in-ecology/] for some of the challenges facing causal inference in an observational context in ecology.↩