## Introduction

Before we begin exploring some ideas about variable selection in multiple regression, we need to review

• Why multiple regression may be needed and
• Some of the problems that arise when covariates arenâ€™t independent of one another, i.e., when they are collinear.

It will take us a while to get through the first of these points, so thatâ€™s all weâ€™ll cover in this notebook. The next notebook will take up some of the problems that arise when covariates arenâ€™t independent of one another.

To do that weâ€™ll create three toy data sets in which there are 9 covariates, only 3 of which are associated with the response variable. The three data sets will have the same regression coefficients, i.e., the same intercept, the same non-zero value for the 3 covariates that have an association, and 0 for the remaining 6 covariates. Theyâ€™ll differ in the degree of collinearity among the covariates, ranging from complete independence of all covariates to relativelly modest collinearity to relatively strong collinearity.

To get started, weâ€™ll load all of the libraries weâ€™re going to use in this session and set up sample data sets. Note: Iâ€™m going to be generating the covariate data from a multivariate normal, but thatâ€™s only for convenience. Linear regression has some important assumptions. (Multi)normality of covariates is at the bottom of the list.

``````library(tidyverse)
library(reshape2)
library(corrplot)
library(ggplot2)
library(cowplot)
library(WVPlots)
library(mvtnorm)``````
``````## I always like to clean out my workspace before running a script to make sure
## that I'm starting R in the same state. This helps to ensure that I can
## reproduce my results later
rm(list = ls())

## intetcept
##
beta0 <- 1.0
## regression coefficients
##
beta <- c(1.0, -1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
## pattern of correlation matrix, all non-zero entries are set to saem
## correlation, covariance matrix caldulated from individual variances and a
## single association parameter governing the non-zero correlation coefficients
##
## Note: Not just any pattern will work here. The correlation matrix and
## covariance matrix generated from this pattern must be positive definite.
## If you change this pattern, you may get an error when you try to generate
## data with a non-zero association parameter.
##
Rho <- matrix(nrow = 9, ncol = , byrow = TRUE,
data = c(1,0,1,0,1,0,1,0,1,
0,1,0,1,0,1,0,1,0,
1,0,1,0,1,0,1,0,1,
0,1,0,1,0,1,0,1,0,
1,0,1,0,1,0,1,0,1,
0,1,0,1,0,1,0,1,0,
1,0,1,0,1,0,1,0,1,
0,1,0,1,0,1,0,1,0,
1,0,1,0,1,0,1,0,1
))
## vector of standard deviations for covariates
##
sigma <- rep(1, 9)

## construct a covariance matrix from the pattern, standard deviations, and
## one parameter in [-1,1] that governs the magnitude of non-zero correlation
## coefficients
##
## Rho - the pattern of associations
## sigma - the vector of standard deviations
## rho - the association parameter
##
construct_Sigma <- function(Rho, sigma, rho) {
## get the correlation matris
##
Rho <- Rho*rho
for (i in 1:ncol(Rho)) {
Rho[i,i] <- 1.0
}
## notice the use of matrix multiplication
##
Sigma <- diag(sigma) %*% Rho %*% diag(sigma)
return(Sigma)
}``````

## Generating the data

OK, with those preliminaries out of the way. Letâ€™s generate data under our three scenarios:

• no association: `ind` (`rho = 0`)
• weak association: `wk` (`rho = 0.2`)
• strong association: `str` (`rho = 0.8`)

Weâ€™ll use the same residual standard deviation, `resid = 0.2`, for all of them. `cov_` refers to the covariates. `y_` is the corresponding response

``````## set the random number seed manually so that every run of the code will
## produce the same numbers
##
set.seed(1234)

n_samp <- 100
cov_ind <- rmvnorm(n_samp,
mean = rep(0, nrow(Rho)),
sigma = construct_Sigma(Rho, sigma, 0.0))
cov_wk <- rmvnorm(n_samp,
mean = rep(0, nrow(Rho)),
sigma = construct_Sigma(Rho, sigma, 0.2))
cov_str <- rmvnorm(n_samp,
mean = rep(0, nrow(Rho)),
sigma = construct_Sigma(Rho, sigma, 0.8))

resid <- rep(0.2, n_samp)
y_ind <- rnorm(nrow(cov_ind), mean = beta0 + cov_ind %*% beta, sd = resid)
y_wk <- rnorm(nrow(cov_wk), mean = beta0 + cov_wk %*% beta, sd = resid)
y_str <- rnorm(nrow(cov_str), mean = beta0 + cov_str %*% beta, sd = resid)``````

Now Iâ€™ll collect the data into a `data frame` so that we can look at the covariates and explore how they relate to the response variable.

``````dat_ind <- data.frame(y_ind, cov_ind, rep("Independent", length(y_ind)))
dat_wk <- data.frame(y_wk, cov_wk, rep("Weak", length(y_wk)))
dat_str <- data.frame(y_str, cov_str, rep("Strong", length(y_str)))

column_names <- c("y", paste("x", seq(1, length(beta)), sep = ""), "Scenario")
colnames(dat_ind) <- column_names
colnames(dat_wk) <- column_names
colnames(dat_str) <- column_names

dat <- rbind(dat_ind, dat_wk, dat_str)``````

## Covariate associations

First, letâ€™s just look at the association of the covariates under the three scenarios.

``````p <- PairPlot(dat_ind,
meas_vars = column_names[grep("x", column_names)],
title = "Independent")
print(p)``````

``````p <- PairPlot(dat_wk,
meas_vars = column_names[grep("x", column_names)],
title = "Weak association")
print(p)``````