If you came to this page directly, please take a moment to look at this blog post for the comment. I’ll wait until you’re back.

OK. Now you’ve read the blogpost, and you know why we’re here. I’m going to illustrate how it is that multiple regression separates the effects that are “really there” from those that are only there because of statistical associations. We’ll use exactly the same data aw we did last time, but we’ll only analyse the strong association scenario.{^1] Everything in this block of R code is just setting up the data set.

library(tidyverse)
library(reshape2)
library(corrplot)
library(ggplot2)
library(cowplot)
library(WVPlots)
library(mvtnorm)
## I always like to clean out my workspace before running a script to make sure
## that I'm starting R in the same state. This helps to ensure that I can 
## reproduce my results later
rm(list = ls())

## intetcept
##
beta0 <- 1.0
## regression coefficients
##
beta <- c(1.0, -1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
## pattern of correlation matrix, all non-zero entries are set to saem
## correlation, covariance matrix caldulated from individual variances and a 
## single association parameter governing the non-zero correlation coefficients
##
## Note: Not just any pattern will work here. The correlation matrix and
## covariance matrix generated from this pattern must be positive definite.
## If you change this pattern, you may get an error when you try to generate
## data with a non-zero association parameter.
##
Rho <- matrix(nrow = 9, ncol = , byrow = TRUE, 
              data = c(1,0,1,0,1,0,1,0,1,
                       0,1,0,1,0,1,0,1,0,
                       1,0,1,0,1,0,1,0,1,
                       0,1,0,1,0,1,0,1,0,
                       1,0,1,0,1,0,1,0,1,
                       0,1,0,1,0,1,0,1,0,
                       1,0,1,0,1,0,1,0,1,
                       0,1,0,1,0,1,0,1,0,
                       1,0,1,0,1,0,1,0,1
                       ))
## vector of standard deviations for covariates
##
sigma <- rep(1, 9)

## construct a covariance matrix from the pattern, standard deviations, and
## one parameter in [-1,1] that governs the magnitude of non-zero correlation
## coefficients
##
## Rho - the pattern of associations
## sigma - the vector of standard deviations
## rho - the association parameter
##
construct_Sigma <- function(Rho, sigma, rho) {
  ## get the correlation matris
  ##
  Rho <- Rho*rho
  for (i in 1:ncol(Rho)) {
    Rho[i,i] <- 1.0
  }
  ## notice the use of matrix multiplication
  ##
  Sigma <- diag(sigma) %*% Rho %*% diag(sigma)
  return(Sigma)
}


## set the random number seed manually so that every run of the code will 
## produce the same numbers
##
set.seed(1234)

n_samp <- 100
cov_ind <- rmvnorm(n_samp,
                   mean = rep(0, nrow(Rho)),
                   sigma = construct_Sigma(Rho, sigma, 0.0))
cov_wk <- rmvnorm(n_samp,
                  mean = rep(0, nrow(Rho)),
                  sigma = construct_Sigma(Rho, sigma, 0.2))
cov_str <- rmvnorm(n_samp,
                   mean = rep(0, nrow(Rho)),
                   sigma = construct_Sigma(Rho, sigma, 0.8))

resid <- rep(0.2, n_samp)
y_ind <- rnorm(nrow(cov_ind), mean = beta0 + cov_ind %*% beta, sd = resid)
y_wk <- rnorm(nrow(cov_wk), mean = beta0 + cov_wk %*% beta, sd = resid)
y_str <- rnorm(nrow(cov_str), mean = beta0 + cov_str %*% beta, sd = resid)

dat_str <- data.frame(y_str, cov_str, rep("Strong", length(y_str)))

column_names <- c("y", paste("x", seq(1, length(beta)), sep = ""), "Scenario")
colnames(dat_str) <- column_names

First, let’s double check the multiple regression analysis to make sure that the data we’re using really are the same.

covariates <- paste("x", seq(1, length(beta)), sep = "")
summary(lm(as.formula(paste("y ~ ", paste(covariates, collapse = " + "))), data = dat_str))

Call:
lm(formula = as.formula(paste("y ~ ", paste(covariates, collapse = " + "))), 
    data = dat_str)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.47277 -0.13864  0.01532  0.13314  0.50245 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.009106   0.021042  47.957   <2e-16 ***
x1           0.981123   0.044866  21.868   <2e-16 ***
x2          -1.000192   0.037447 -26.710   <2e-16 ***
x3           0.975234   0.039012  24.998   <2e-16 ***
x4           0.022840   0.045962   0.497    0.620    
x5           0.036022   0.043129   0.835    0.406    
x6          -0.012031   0.041184  -0.292    0.771    
x7           0.002427   0.040711   0.060    0.953    
x8           0.002452   0.042054   0.058    0.954    
x9          -0.006829   0.037163  -0.184    0.855    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2003 on 90 degrees of freedom
Multiple R-squared:  0.9913,    Adjusted R-squared:  0.9905 
F-statistic:  1144 on 9 and 90 DF,  p-value: < 2.2e-16

If you compare those results with the ones we obtained before,1 you’ll see that they’re identical to 6 decimal places. That makes it pretty likely that they’re the same data sets.

We’ll focus this example on x1, x3, and x9, where we know that the “real” associations are between x1, x3, and y and that the “apparent” association between x9 and y arises only because x9 is strongly associated with x1 and x3. Let’s start by looking at the bivariate association between x1 and y, including a plot of the residuals.

lm_str <- lm(y ~ x1, data = dat_str)
residual <- residuals(lm_str)
for.plot <- data.frame(x = dat_str$x1, y = residual)
p <- ggplot(for.plot, aes(x = x, y = y)) +
  geom_point(fill = "tomato", color = "tomato") +
  geom_hline(yintercept = 0.0) +
  xlab("Observed (x1)") +
  ylab("Residual")
print(p)

As you can see, the residuals seem pretty randomly distributed, which is what we hope to see when we do a residual plot. But let’s try this. Let’s try regressing the residuals on x3. Think of this as looking at the influence of x3 on y once we’ve removed the influence of x1.

res_x3 <- lm(residual ~ x3, data = dat_str)
print(summary(res_x3))

Call:
lm(formula = residual ~ x3, data = dat_str)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8942 -0.8188  0.1138  0.8285  2.3859 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept) -7.349e-05  1.127e-01  -0.001   0.9995   
x3           3.419e-01  1.182e-01   2.893   0.0047 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.127 on 98 degrees of freedom
Multiple R-squared:  0.0787,    Adjusted R-squared:  0.0693 
F-statistic: 8.371 on 1 and 98 DF,  p-value: 0.004697

Look at that. We have good evidence that the residuals from regressing y on x1 show a relationship with x3 - the larger x3, the larger the residual. We can see this both by coloring the residuals above and by plotting the residuals against x3.

for.plot <- data.frame(x = dat_str$x1, y = residual, x3 = dat_str$x3)
p <- ggplot(for.plot, aes(x = x, y = y, fill = x3, color = x3)) +
  geom_point() +
  scale_fill_gradient2() +
  scale_color_gradient2() +
  geom_hline(yintercept = 0.0) +
  xlab("Observed (x1)") +
  ylab("Residual")
print(p)

p <- ggplot(for.plot, aes(x = x3, y = residual)) +
  geom_point(color = "tomato", fill = "tomato") +
  geom_smooth(method = "lm") +
  xlab("Observed (x3)") +
  ylab("Residual (from x1)")
print(p)

Now let’s see what happens if we do the same thing with x9.

res_x9 <- lm(residual ~ x9, data = dat_str)
print(summary(res_x9))

Call:
lm(formula = residual ~ x9, data = dat_str)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5085 -0.8446  0.0678  0.7904  2.7547 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.01417    0.11669  -0.121    0.904
x9           0.15777    0.11294   1.397    0.166

Residual standard error: 1.163 on 98 degrees of freedom
Multiple R-squared:  0.01952,   Adjusted R-squared:  0.00952 
F-statistic: 1.952 on 1 and 98 DF,  p-value: 0.1656
for.plot <- data.frame(x = dat_str$x1, y = residual, x9 = dat_str$x9)
p <- ggplot(for.plot, aes(x = x, y = y, fill = x9, color = x9)) +
  geom_point() +
  scale_fill_gradient2() +
  scale_color_gradient2() +
  geom_hline(yintercept = 0.0) +
  xlab("Observed (x1)") +
  ylab("Residual")
print(p)