If you came to this page directly, please take a moment to look at this blog post for the comment. I’ll wait until you’re back.

OK. Now you’ve read the blogpost, and you know why we’re here. I’m going to illustrate how it is that multiple regression separates the effects that are “really there” from those that are only there because of statistical associations. We’ll use exactly the same data aw we did last time, but we’ll only analyse the strong association scenario.{^1] Everything in this block of R code is just setting up the data set.

```
library(tidyverse)
library(reshape2)
library(corrplot)
library(ggplot2)
library(cowplot)
library(WVPlots)
library(mvtnorm)
```

```
## I always like to clean out my workspace before running a script to make sure
## that I'm starting R in the same state. This helps to ensure that I can
## reproduce my results later
rm(list = ls())
## intetcept
##
beta0 <- 1.0
## regression coefficients
##
beta <- c(1.0, -1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
## pattern of correlation matrix, all non-zero entries are set to saem
## correlation, covariance matrix caldulated from individual variances and a
## single association parameter governing the non-zero correlation coefficients
##
## Note: Not just any pattern will work here. The correlation matrix and
## covariance matrix generated from this pattern must be positive definite.
## If you change this pattern, you may get an error when you try to generate
## data with a non-zero association parameter.
##
Rho <- matrix(nrow = 9, ncol = , byrow = TRUE,
data = c(1,0,1,0,1,0,1,0,1,
0,1,0,1,0,1,0,1,0,
1,0,1,0,1,0,1,0,1,
0,1,0,1,0,1,0,1,0,
1,0,1,0,1,0,1,0,1,
0,1,0,1,0,1,0,1,0,
1,0,1,0,1,0,1,0,1,
0,1,0,1,0,1,0,1,0,
1,0,1,0,1,0,1,0,1
))
## vector of standard deviations for covariates
##
sigma <- rep(1, 9)
## construct a covariance matrix from the pattern, standard deviations, and
## one parameter in [-1,1] that governs the magnitude of non-zero correlation
## coefficients
##
## Rho - the pattern of associations
## sigma - the vector of standard deviations
## rho - the association parameter
##
construct_Sigma <- function(Rho, sigma, rho) {
## get the correlation matris
##
Rho <- Rho*rho
for (i in 1:ncol(Rho)) {
Rho[i,i] <- 1.0
}
## notice the use of matrix multiplication
##
Sigma <- diag(sigma) %*% Rho %*% diag(sigma)
return(Sigma)
}
## set the random number seed manually so that every run of the code will
## produce the same numbers
##
set.seed(1234)
n_samp <- 100
cov_ind <- rmvnorm(n_samp,
mean = rep(0, nrow(Rho)),
sigma = construct_Sigma(Rho, sigma, 0.0))
cov_wk <- rmvnorm(n_samp,
mean = rep(0, nrow(Rho)),
sigma = construct_Sigma(Rho, sigma, 0.2))
cov_str <- rmvnorm(n_samp,
mean = rep(0, nrow(Rho)),
sigma = construct_Sigma(Rho, sigma, 0.8))
resid <- rep(0.2, n_samp)
y_ind <- rnorm(nrow(cov_ind), mean = beta0 + cov_ind %*% beta, sd = resid)
y_wk <- rnorm(nrow(cov_wk), mean = beta0 + cov_wk %*% beta, sd = resid)
y_str <- rnorm(nrow(cov_str), mean = beta0 + cov_str %*% beta, sd = resid)
dat_str <- data.frame(y_str, cov_str, rep("Strong", length(y_str)))
column_names <- c("y", paste("x", seq(1, length(beta)), sep = ""), "Scenario")
colnames(dat_str) <- column_names
```

First, let’s double check the multiple regression analysis to make sure that the data we’re using really are the same.

```
covariates <- paste("x", seq(1, length(beta)), sep = "")
summary(lm(as.formula(paste("y ~ ", paste(covariates, collapse = " + "))), data = dat_str))
```

```
Call:
lm(formula = as.formula(paste("y ~ ", paste(covariates, collapse = " + "))),
data = dat_str)
Residuals:
Min 1Q Median 3Q Max
-0.47277 -0.13864 0.01532 0.13314 0.50245
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.009106 0.021042 47.957 <2e-16 ***
x1 0.981123 0.044866 21.868 <2e-16 ***
x2 -1.000192 0.037447 -26.710 <2e-16 ***
x3 0.975234 0.039012 24.998 <2e-16 ***
x4 0.022840 0.045962 0.497 0.620
x5 0.036022 0.043129 0.835 0.406
x6 -0.012031 0.041184 -0.292 0.771
x7 0.002427 0.040711 0.060 0.953
x8 0.002452 0.042054 0.058 0.954
x9 -0.006829 0.037163 -0.184 0.855
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2003 on 90 degrees of freedom
Multiple R-squared: 0.9913, Adjusted R-squared: 0.9905
F-statistic: 1144 on 9 and 90 DF, p-value: < 2.2e-16
```

If you compare those results with the ones we obtained before,^{1} you’ll see that they’re identical to 6 decimal places. That makes it pretty likely that they’re the same data sets.

We’ll focus this example on `x1`

, `x3`

, and `x9`

, where we know that the “real” associations are between `x1`

, `x3`

, and `y`

and that the “apparent” association between `x9`

and `y`

arises only because `x9`

is strongly associated with `x1`

and `x3`

. Let’s start by looking at the bivariate association between `x1`

and `y`

, including a plot of the residuals.

```
lm_str <- lm(y ~ x1, data = dat_str)
residual <- residuals(lm_str)
for.plot <- data.frame(x = dat_str$x1, y = residual)
p <- ggplot(for.plot, aes(x = x, y = y)) +
geom_point(fill = "tomato", color = "tomato") +
geom_hline(yintercept = 0.0) +
xlab("Observed (x1)") +
ylab("Residual")
print(p)
```

As you can see, the residuals seem pretty randomly distributed, which is what we hope to see when we do a residual plot. But let’s try this. Let’s try regressing the residuals on `x3`

. Think of this as looking at the influence of `x3`

on `y`

once we’ve removed the influence of `x1`

.

```
res_x3 <- lm(residual ~ x3, data = dat_str)
print(summary(res_x3))
```

```
Call:
lm(formula = residual ~ x3, data = dat_str)
Residuals:
Min 1Q Median 3Q Max
-2.8942 -0.8188 0.1138 0.8285 2.3859
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.349e-05 1.127e-01 -0.001 0.9995
x3 3.419e-01 1.182e-01 2.893 0.0047 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.127 on 98 degrees of freedom
Multiple R-squared: 0.0787, Adjusted R-squared: 0.0693
F-statistic: 8.371 on 1 and 98 DF, p-value: 0.004697
```

Look at that. We have good evidence that the residuals from regressing `y`

on `x1`

show a relationship with `x3`

- the larger `x3`

, the larger the residual. We can see this both by coloring the residuals above and by plotting the residuals against `x3`

.

```
for.plot <- data.frame(x = dat_str$x1, y = residual, x3 = dat_str$x3)
p <- ggplot(for.plot, aes(x = x, y = y, fill = x3, color = x3)) +
geom_point() +
scale_fill_gradient2() +
scale_color_gradient2() +
geom_hline(yintercept = 0.0) +
xlab("Observed (x1)") +
ylab("Residual")
print(p)
```

```
p <- ggplot(for.plot, aes(x = x3, y = residual)) +
geom_point(color = "tomato", fill = "tomato") +
geom_smooth(method = "lm") +
xlab("Observed (x3)") +
ylab("Residual (from x1)")
print(p)
```

Now let’s see what happens if we do the same thing with `x9`

.

```
res_x9 <- lm(residual ~ x9, data = dat_str)
print(summary(res_x9))
```

```
Call:
lm(formula = residual ~ x9, data = dat_str)
Residuals:
Min 1Q Median 3Q Max
-3.5085 -0.8446 0.0678 0.7904 2.7547
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.01417 0.11669 -0.121 0.904
x9 0.15777 0.11294 1.397 0.166
Residual standard error: 1.163 on 98 degrees of freedom
Multiple R-squared: 0.01952, Adjusted R-squared: 0.00952
F-statistic: 1.952 on 1 and 98 DF, p-value: 0.1656
```

```
for.plot <- data.frame(x = dat_str$x1, y = residual, x9 = dat_str$x9)
p <- ggplot(for.plot, aes(x = x, y = y, fill = x9, color = x9)) +
geom_point() +
scale_fill_gradient2() +
scale_color_gradient2() +
geom_hline(yintercept = 0.0) +
xlab("Observed (x1)") +
ylab("Residual")
print(p)
```