In the R notebook illustrating challenges of multiple regression I illustrated why we might want to reduce the number of covariates in our regression. In this notebook I’m going to illustrate a couple of different approaches:

- Retaining only the covariates that have a “real” relationship with the response variable.
- Selecting covariates from (relatively) uncorrelated clusters of covariates

In the example we’ve been playing with so far, we know that only `x1`

, `x2`

, and `x3`

have a “real” relationship with `y`

, because the process we used to generate the data has 0s for the coefficient relating `x4`

-`x9`

to `y`

. Let’s regenerate the data from our last example, with \(R^2 \approx 0.4\), restrict our analysis to only `x1`

, `x2`

, and `x3`

and see if we encounter the instability in parameter estimates and predictions we saw last time.

```
library(tidyverse)
library(reshape2)
library(ggplot2)
library(cowplot)
library(mvtnorm)
library(corrplot)
```

```
## intetcept
##
beta0 <- 1.0
## regression coefficients
##
beta <- c(1.0, -1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
## pattern of correlation matrix, all non-zero entries are set to saem
## correlation, covariance matrix caldulated from individual variances and a
## single association parameter governing the non-zero correlation coefficients
##
## Note: Not just any pattern will work here. The correlation matrix and
## covariance matrix generated from this pattern must be positive definite.
## If you change this pattern, you may get an error when you try to generate
## data with a non-zero association parameter.
##
Rho <- matrix(nrow = 9, ncol = , byrow = TRUE,
data = c(1,0,1,0,1,0,1,0,1,
0,1,0,1,0,1,0,1,0,
1,0,1,0,1,0,1,0,1,
0,1,0,1,0,1,0,1,0,
1,0,1,0,1,0,1,0,1,
0,1,0,1,0,1,0,1,0,
1,0,1,0,1,0,1,0,1,
0,1,0,1,0,1,0,1,0,
1,0,1,0,1,0,1,0,1
))
## vector of standard deviations for covariates
##
sigma <- rep(1, 9)
## construct a covariance matrix from the pattern, standard deviations, and
## one parameter in [-1,1] that governs the magnitude of non-zero correlation
## coefficients
##
## Rho - the pattern of associations
## sigma - the vector of standard deviations
## rho - the association parameter
##
construct_Sigma <- function(Rho, sigma, rho) {
## get the correlation matris
##
Rho <- Rho*rho
for (i in 1:ncol(Rho)) {
Rho[i,i] <- 1.0
}
## notice the use of matrix multiplication
##
Sigma <- diag(sigma) %*% Rho %*% diag(sigma)
return(Sigma)
}
## set the random number seed manually so that every run of the code will
## produce the same numbers
##
set.seed(1234)
n_samp <- 100
cov_str <- rmvnorm(n_samp,
mean = rep(0, nrow(Rho)),
sigma = construct_Sigma(Rho, sigma, 0.8))
resid <- rep(2.0, n_samp)
y_str <- rnorm(nrow(cov_str), mean = beta0 + cov_str %*% beta, sd = resid)
dat_1 <- data.frame(y_str, cov_str, rep("Strong", length(y_str)))
cov_str <- rmvnorm(n_samp,
mean = rep(0, nrow(Rho)),
sigma = construct_Sigma(Rho, sigma, 0.8))
y_str <- rnorm(nrow(cov_str), mean = beta0 + cov_str %*% beta, sd = resid)
dat_2 <- data.frame(y_str, cov_str, rep("Strong", length(y_str)))
column_names <- c("y", paste("x", seq(1, length(beta)), sep = ""), "Scenario")
colnames(dat_1) <- column_names
colnames(dat_2) <- column_names
```

Now that the data are generated, let’s run the analyses

```
lm_for_pred_1 <- lm(y ~ x1 + x2 + x3, data = dat_1)
lm_for_pred_2 <- lm(y ~ x1 + x2 + x3, data = dat_2)
summary(lm_for_pred_1)
```

```
Call:
lm(formula = y ~ x1 + x2 + x3, data = dat_1)
Residuals:
Min 1Q Median 3Q Max
-5.5057 -1.3596 0.0372 1.0451 4.9972
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8951 0.2076 4.311 3.93e-05 ***
x1 0.9918 0.3274 3.029 0.003150 **
x2 -0.5504 0.2232 -2.466 0.015419 *
x3 1.0247 0.2942 3.483 0.000748 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.07 on 96 degrees of freedom
Multiple R-squared: 0.4006, Adjusted R-squared: 0.3819
F-statistic: 21.39 on 3 and 96 DF, p-value: 1.079e-10
```

`summary(lm_for_pred_2)`

```
Call:
lm(formula = y ~ x1 + x2 + x3, data = dat_2)
Residuals:
Min 1Q Median 3Q Max
-4.5624 -1.3991 -0.0468 1.3177 4.1804
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.0206 0.2001 5.099 1.71e-06 ***
x1 0.6424 0.3398 1.891 0.061704 .
x2 -1.0993 0.1900 -5.787 8.99e-08 ***
x3 1.1957 0.3199 3.737 0.000316 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.93 on 96 degrees of freedom
Multiple R-squared: 0.4848, Adjusted R-squared: 0.4687
F-statistic: 30.11 on 3 and 96 DF, p-value: 8.283e-14
```

Well, the first thing we discover is that the parameter estimates are not as stable as we might hope. They’re all in the right direction - `x1`

and `x3`

have positive coefficients, `x2`

has a negative coefficient -, but the magnitude of `x2`

is a long way off in the first data set and the magnitude of `x1`

is a long way off in the second data set. If we compare predictions, here’s what we get:

```
new_data <- data.frame(x1 = 4.0, x2 = 4.0, x3 = 4.0,
x4 = 4.0, x5 = 4.0, x6 = 4.0,
x7 = 4.0, x8 = 4.0, x9 = 4.0)
pred_from_1 <- predict.lm(lm_for_pred_1, new_data)
pred_from_2 <- predict.lm(lm_for_pred_2, new_data)
cat("Prediction from data set 1: ", pred_from_1, "\n",
"Prediction from data set 2: ", pred_from_2, "\n",
"True answer: ", beta0 + as.matrix(new_data) %*% beta, "\n", sep = "")
```

```
Prediction from data set 1: 6.7596
Prediction from data set 2: 3.975488
True answer: 5
```

That doesn’t look very good either. This isn’t good news, because we know that `x1`

, `x2`

, and `x3`

have an association with `y`

and even though we excluded irrelevant covariates our parameter estimates aren’t very stable and our predictions can be pretty variable. Let’s try another approach.