Introduction

Before we begin exploring some ideas about variable selection in multiple regression, we need to review

It will take us a while to get through the first of these points, so that’s all we’ll cover in this notebook. The next notebook will take up some of the problems that arise when covariates aren’t independent of one another.

To do that we’ll create three toy data sets in which there are 9 covariates, only 3 of which are associated with the response variable. The three data sets will have the same regression coefficients, i.e., the same intercept, the same non-zero value for the 3 covariates that have an association, and 0 for the remaining 6 covariates. They’ll differ in the degree of collinearity among the covariates, ranging from complete independence of all covariates to relativelly modest collinearity to relatively strong collinearity.

To get started, we’ll load all of the libraries we’re going to use in this session and set up sample data sets. Note: I’m going to be generating the covariate data from a multivariate normal, but that’s only for convenience. Linear regression has some important assumptions. (Multi)normality of covariates is at the bottom of the list.

library(tidyverse)
library(reshape2)
library(corrplot)
library(ggplot2)
library(cowplot)
library(WVPlots)
library(mvtnorm)
## I always like to clean out my workspace before running a script to make sure
## that I'm starting R in the same state. This helps to ensure that I can 
## reproduce my results later
rm(list = ls())

## intetcept
##
beta0 <- 1.0
## regression coefficients
##
beta <- c(1.0, -1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
## pattern of correlation matrix, all non-zero entries are set to saem
## correlation, covariance matrix caldulated from individual variances and a 
## single association parameter governing the non-zero correlation coefficients
##
## Note: Not just any pattern will work here. The correlation matrix and
## covariance matrix generated from this pattern must be positive definite.
## If you change this pattern, you may get an error when you try to generate
## data with a non-zero association parameter.
##
Rho <- matrix(nrow = 9, ncol = , byrow = TRUE, 
              data = c(1,0,1,0,1,0,1,0,1,
                       0,1,0,1,0,1,0,1,0,
                       1,0,1,0,1,0,1,0,1,
                       0,1,0,1,0,1,0,1,0,
                       1,0,1,0,1,0,1,0,1,
                       0,1,0,1,0,1,0,1,0,
                       1,0,1,0,1,0,1,0,1,
                       0,1,0,1,0,1,0,1,0,
                       1,0,1,0,1,0,1,0,1
                       ))
## vector of standard deviations for covariates
##
sigma <- rep(1, 9)

## construct a covariance matrix from the pattern, standard deviations, and
## one parameter in [-1,1] that governs the magnitude of non-zero correlation
## coefficients
##
## Rho - the pattern of associations
## sigma - the vector of standard deviations
## rho - the association parameter
##
construct_Sigma <- function(Rho, sigma, rho) {
  ## get the correlation matris
  ##
  Rho <- Rho*rho
  for (i in 1:ncol(Rho)) {
    Rho[i,i] <- 1.0
  }
  ## notice the use of matrix multiplication
  ##
  Sigma <- diag(sigma) %*% Rho %*% diag(sigma)
  return(Sigma)
}

Generating the data

OK, with those preliminaries out of the way. Let’s generate data under our three scenarios:

We’ll use the same residual standard deviation, resid = 0.2, for all of them. cov_ refers to the covariates. y_ is the corresponding response

## set the random number seed manually so that every run of the code will 
## produce the same numbers
##
set.seed(1234)

n_samp <- 100
cov_ind <- rmvnorm(n_samp,
                   mean = rep(0, nrow(Rho)),
                   sigma = construct_Sigma(Rho, sigma, 0.0))
cov_wk <- rmvnorm(n_samp,
                  mean = rep(0, nrow(Rho)),
                  sigma = construct_Sigma(Rho, sigma, 0.2))
cov_str <- rmvnorm(n_samp,
                   mean = rep(0, nrow(Rho)),
                   sigma = construct_Sigma(Rho, sigma, 0.8))

resid <- rep(0.2, n_samp)
y_ind <- rnorm(nrow(cov_ind), mean = beta0 + cov_ind %*% beta, sd = resid)
y_wk <- rnorm(nrow(cov_wk), mean = beta0 + cov_wk %*% beta, sd = resid)
y_str <- rnorm(nrow(cov_str), mean = beta0 + cov_str %*% beta, sd = resid)

Now I’ll collect the data into a data frame so that we can look at the covariates and explore how they relate to the response variable.

dat_ind <- data.frame(y_ind, cov_ind, rep("Independent", length(y_ind)))
dat_wk <- data.frame(y_wk, cov_wk, rep("Weak", length(y_wk)))
dat_str <- data.frame(y_str, cov_str, rep("Strong", length(y_str)))

column_names <- c("y", paste("x", seq(1, length(beta)), sep = ""), "Scenario")
colnames(dat_ind) <- column_names
colnames(dat_wk) <- column_names
colnames(dat_str) <- column_names

dat <- rbind(dat_ind, dat_wk, dat_str)

Covariate associations

First, let’s just look at the association of the covariates under the three scenarios.

p <- PairPlot(dat_ind,
              meas_vars = column_names[grep("x", column_names)],
              title = "Independent")
print(p)

p <- PairPlot(dat_wk,
              meas_vars = column_names[grep("x", column_names)],
              title = "Weak association")
print(p)