If you’ve studied multiple regression before, you’ve probably heard the term “collinearity”. Collinearity means that two or more covariates are providing essentially the same information about the response variable. When covariates are highly collinear, “regression estimates are unstable and have high standard errors.”1 The Variance Inflation Factor (VIF) is commonly used to assess how much of a problem we have with collinearity. A VIF greater than 4 means that we should look carefully at our covariates (without telling what we should do once we’ve looked at them). A VIF greater than 10 means we have serious problems that we should fix (again without giving us any advice about how to fix them). Let’s take a look at the VIF estimates for the sample data we’ve been using and see what we find.

## Regenerating the data

First we have to regenerate the data.

library(tidyverse)
library(reshape2)
library(ggplot2)
library(cowplot)
library(mvtnorm)
library(corrplot)

rm(list = ls())
## intetcept
##
beta0 <- 1.0
## regression coefficients
##
beta <- c(1.0, -1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
## pattern of correlation matrix, all non-zero entries are set to saem
## correlation, covariance matrix caldulated from individual variances and a
## single association parameter governing the non-zero correlation coefficients
##
## Note: Not just any pattern will work here. The correlation matrix and
## covariance matrix generated from this pattern must be positive definite.
## If you change this pattern, you may get an error when you try to generate
## data with a non-zero association parameter.
##
Rho <- matrix(nrow = 9, ncol = , byrow = TRUE,
data = c(1,0,1,0,1,0,1,0,1,
0,1,0,1,0,1,0,1,0,
1,0,1,0,1,0,1,0,1,
0,1,0,1,0,1,0,1,0,
1,0,1,0,1,0,1,0,1,
0,1,0,1,0,1,0,1,0,
1,0,1,0,1,0,1,0,1,
0,1,0,1,0,1,0,1,0,
1,0,1,0,1,0,1,0,1
))
## vector of standard deviations for covariates
##
sigma <- rep(1, 9)

## construct a covariance matrix from the pattern, standard deviations, and
## one parameter in [-1,1] that governs the magnitude of non-zero correlation
## coefficients
##
## Rho - the pattern of associations
## sigma - the vector of standard deviations
## rho - the association parameter
##
construct_Sigma <- function(Rho, sigma, rho) {
## get the correlation matris
##
Rho <- Rho*rho
for (i in 1:ncol(Rho)) {
Rho[i,i] <- 1.0
}
## notice the use of matrix multiplication
##
Sigma <- diag(sigma) %*% Rho %*% diag(sigma)
return(Sigma)
}

## set the random number seed manually so that every run of the code will
## produce the same numbers
##
set.seed(1234)

n_samp <- 100
cov_str <- rmvnorm(n_samp,
mean = rep(0, nrow(Rho)),
sigma = construct_Sigma(Rho, sigma, 0.8))

resid <- rep(2.0, n_samp)

y_str <- rnorm(nrow(cov_str), mean = beta0 + cov_str %*% beta, sd = resid)
dat_1 <- data.frame(y_str, cov_str, rep("Strong", length(y_str)))

cov_str <- rmvnorm(n_samp,
mean = rep(0, nrow(Rho)),
sigma = construct_Sigma(Rho, sigma, 0.8))
y_str <- rnorm(nrow(cov_str), mean = beta0 + cov_str %*% beta, sd = resid)
dat_2 <- data.frame(y_str, cov_str, rep("Strong", length(y_str)))

column_names <- c("y", paste("x", seq(1, length(beta)), sep = ""), "Scenario")
colnames(dat_1) <- column_names
colnames(dat_2) <- column_names

## Examining VIF estimates

Now that we have the data, we’ll use vif() from the car package to look at the VIFs from analysis of each of the two data sets.

library(car)

lm_for_pred_1 <- lm(y ~ x1 + x2 + x3, data = dat_1)
lm_for_pred_2 <- lm(y ~ x1 + x2 + x3, data = dat_2)
cat("From data set 1\n")
From data set 1
vif(lm_for_pred_1)
      x1       x2       x3
1.831894 1.030354 1.818362 
cat("\nFrom data set 2\n")

From data set 2
vif(lm_for_pred_2)
      x1       x2       x3
2.279091 1.007431 2.287378 

The VIF coefficients reported in each case don’t look too bad. The worst of them, x3, is less than 2.3, so we wouldn’t normally think that collinearity is a problem. If we do the same analysis when we include only x1 and x2 in our analysis there’s even less indication of a problem, as you can see.

lm_for_pred_1 <- lm(y ~ x1 + x2, data = dat_1)
lm_for_pred_2 <- lm(y ~ x1 + x2, data = dat_2)
cat("From data set 1\n")
From data set 1
vif(lm_for_pred_1)
      x1       x2
1.028449 1.028449 
cat("\nFrom data set 2\n")

From data set 2
vif(lm_for_pred_2)
      x1       x2
1.003759 1.003759 

In one sense, I should probably stop here. If you were paying attention to footnotes in the last notebook, you will have noticed that footnote 4 said this:

To be fair, part of the problem here comes from looking only at the point estimates. Look at the results from rstanarm() again. You’ll see that the credible intervals for x1 and x2 as estimated from data set 1 overlap broadly with those estimated from data set 2. The estimates aren’t as different from one another as they initially appear. Neither are the posterior predictions. (Verifying that is left as an exercise for those who are interested.)

The problem might be that we - I - have been paying too much attention to point estimates of regreesion coefficients and to point predictions from fitted regressions rather than paying attention to uncertainty (as any proper Bayesian should). But I promised you an exploration of principal components regression in the title of this notebook, so let’s try it with these data sets and see what happens.

## Principal components regression

As the name suggests, the first step in a principal components regression is a principal components analysis, specifically a principal components analysis of the covariates.2

pca_1 <- princomp(dat_1[, 2:10], cor = TRUE)
pca_2 <- princomp(dat_2[, 2:10], cor = TRUE)

plot(pca_1)

plot(pca_2)

In both cases the screeplot suggests that the first two principal components account for essentially all of the variance. That shouldn’t be too surprising, since the correlation matrix is so highly structured. Alternate rows are (approximately) equal to one another in the correlation matrix used to generate the data. It also shouldn’t be too surprising that the estimated loadings on the first two principals component are fairly similar in the two data sets.

cat("Data set 1\n")
Data set 1
round(pca_1$loadings[, 1:2], 3)  Comp.1 Comp.2 x1 0.376 0.223 x2 0.256 -0.431 x3 0.374 0.227 x4 0.249 -0.434 x5 0.368 0.266 x6 0.252 -0.430 x7 0.377 0.236 x8 0.291 -0.407 x9 0.406 0.220 cat("\nData set 2\n")  Data set 2 round(pca_2$loadings[, 1:2], 3)
   Comp.1 Comp.2
x1  0.404  0.179
x2  0.205 -0.465
x3  0.411  0.174
x4  0.224 -0.446
x5  0.391  0.216
x6  0.237 -0.438
x7  0.384  0.237
x8  0.243 -0.429
x9  0.400  0.209

### Regression on the principal components

Since the first two principal components accounts for so much of the variance, let’s regress y on the score associated with the first principal component.


dat_1_new <- data.frame(y = dat_1$y, x1 = pca_1$scores[, 1],
x2 = pca_1$scores[, 2]) dat_2_new <- data.frame(y = dat_2$y,
x1 = pca_2$scores[, 1], x2 = pca_2$scores[, 2])

cat("Data set 1\n")
Data set 1
lm_for_pred_1 <- lm(y ~ x1 + x2, data = dat_1_new)
summary(lm_for_pred_1)

Call:
lm(formula = y ~ x1 + x2, data = dat_1_new)

Residuals:
Min     1Q Median     3Q    Max
-5.697 -1.300 -0.061  1.200  5.026

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.8984     0.2113   4.251 4.89e-05 ***
x1            0.5329     0.1004   5.306 7.08e-07 ***
x2            0.6533     0.1223   5.342 6.07e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.113 on 97 degrees of freedom
Multiple R-squared:  0.3689,    Adjusted R-squared:  0.3559
F-statistic: 28.35 on 2 and 97 DF,  p-value: 2.022e-10
cat("\nData set 2\n")

Data set 2
lm_for_pred_2 <- lm(y ~ x1 + x2, data = dat_2_new)
summary(lm_for_pred_2)

Call:
lm(formula = y ~ x1 + x2, data = dat_2_new)

Residuals:
Min      1Q  Median      3Q     Max
-4.5704 -1.4218 -0.1672  1.3223  4.1186

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   1.3290     0.2037   6.524 3.11e-09 ***
x1            0.4170     0.1003   4.156 6.98e-05 ***
x2            0.8141     0.1118   7.283 8.68e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.037 on 97 degrees of freedom
Multiple R-squared:  0.4202,    Adjusted R-squared:  0.4083
F-statistic: 35.16 on 2 and 97 DF,  p-value: 3.292e-12

Although the point estimates aren’t exactly equal (you shouldn’t expect them to be), they’re reasonably close. Moreover, if we run a Bayesian version of the analysis, the credible intervals overlap very broady.

library(rstanarm)

options(mc.cores = parallel::detectCores())

stan_lm_for_pred_1 <- stan_glm(y ~ x1 + x2, data = dat_1_new, family = gaussian(), refresh = 0)
stan_lm_for_pred_2 <- stan_glm(y ~ x1 + x2, data = dat_2_new, family = gaussian(), refresh = 0)
summary(stan_lm_for_pred_1, digits = 3)

Model Info:

function:     stan_glm
family:       gaussian [identity]
formula:      y ~ x1 + x2
algorithm:    sampling
priors:       see help('prior_summary')
sample:       4000 (posterior sample size)
observations: 100
predictors:   3

Estimates:
mean     sd       2.5%     25%      50%      75%      97.5%
(Intercept)      0.899    0.218    0.485    0.753    0.897    1.041    1.324
x1               0.532    0.101    0.333    0.463    0.532    0.603    0.728
x2               0.651    0.120    0.414    0.572    0.653    0.730    0.890
sigma            2.131    0.150    1.855    2.029    2.121    2.224    2.448
mean_PPD         0.895    0.305    0.308    0.689    0.900    1.098    1.506
log-posterior -224.273    1.405 -227.823 -224.942 -223.947 -223.241 -222.484

Diagnostics:
mcse  Rhat  n_eff
(Intercept)   0.003 1.000 4381
x1            0.001 1.000 4955
x2            0.002 1.000 4826
sigma         0.002 1.000 4563
mean_PPD      0.005 1.000 4205
log-posterior 0.032 1.001 1872

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
summary(stan_lm_for_pred_2, digits = 3)

Model Info:

function:     stan_glm
family:       gaussian [identity]
formula:      y ~ x1 + x2
algorithm:    sampling
priors:       see help('prior_summary')
sample:       4000 (posterior sample size)
observations: 100
predictors:   3

Estimates:
mean     sd       2.5%     25%      50%      75%      97.5%
(Intercept)      1.326    0.204    0.925    1.196    1.326    1.456    1.731
x1               0.414    0.100    0.216    0.349    0.414    0.479    0.612
x2               0.812    0.114    0.590    0.736    0.811    0.887    1.039
sigma            2.058    0.148    1.792    1.954    2.050    2.150    2.372
mean_PPD         1.332    0.291    0.764    1.131    1.329    1.528    1.914
log-posterior -220.636    1.437 -224.331 -221.367 -220.317 -219.550 -218.869

Diagnostics:
mcse  Rhat  n_eff
(Intercept)   0.003 1.000 4720
x1            0.001 1.000 4435
x2            0.002 0.999 4683
sigma         0.002 1.000 4576
mean_PPD      0.005 1.000 4192
log-posterior 0.033 1.002 1925

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

### Prediction from principal components

Predicting y from new data is a bit more complicated than in the past. We have to calculate scores on the first two principal components from the new data and feed those scores to predict().

new_data <- data.frame(x1 = 4.0, x2 = 4.0, x3 = 4.0,
x4 = 4.0, x5 = 4.0, x6 = 4.0,
x7 = 4.0, x8 = 4.0, x9 = 4.0)

pred_1 <- as.matrix(new_data) %*% pca_1$loadings[, 1:2] pred_2 <- as.matrix(new_data) %*% pca_2$loadings[, 1:2]

dat_pred_1 <- data.frame(x1 = pred_1[1],
x2 = pred_1[2])
dat_pred_2 <- data.frame(x1 = pred_2[1],
x2 = pred_2[2])

pred_from_1 <- predict.lm(lm_for_pred_1, dat_pred_1)
pred_from_2 <- predict.lm(lm_for_pred_2, dat_pred_2)

cat("Prediction from data set 1: ", pred_from_1, "\n",
"Prediction from data set 2: ", pred_from_2, "\n",
"True answer:                ", beta0 + as.matrix(new_data) %*% beta, "\n", sep = "")
Prediction from data set 1: 5.803723
Prediction from data set 2: 3.68063
True answer:                5

Unfortunately, our predictions are about the same as they were before. Given that the estimated regression coefficients aren’t too different from one another, this suggests that the challenges of prediction may have more to do with extrapolating beyond the bounds of the observed data than with uncertainty about regression coefficients. And remember, the underlying process generating these data is linear. Imagine how bad our extrapolations could be if the underlying process were non-linear, was well approximated by a linear regression in the observed range of the data, and our extrapolation goes beyond the observed data.

## Conclusions

Principal components regression does seem to stabilize estimates of regression coefficients a bit, particularly when you realize the amount of uncertainty associated with those estimates. When you take uncertainty into account, you really don’t have any evidence that the estimates you’re getting are different. That’s good news.

The bad news is that our predictions don’t seem to be any more stable.To be fair, I haven’t tried to assess uncertainty in the predictions. There’s a good chance that if I did, the predictions wouldn’t look all that different. If you want to check out my intuition, you should be able to use posterior_predict() on the objects from the stan_lm() analyses with the same data frames used here to get posterior prediction intervals.

Oh, what the heck I’m curious to see what happens when we run posterior_predict() on the stan_glm() results, so here goes:

predict_1 <- posterior_predict(stan_lm_for_pred_1, dat_pred_1, type = "response")
predict_2 <- posterior_predict(stan_lm_for_pred_2, dat_pred_2, type = "response")

summarize_posterior <- function(x, credible = 0.95, digits = 3) {
lo_p <- (1.0 - credible)/2.0
hi_p <- credible + lo_p
ci <- quantile(x, c(lo_p, hi_p))
cat(round(mean(x), 3), " (", round(ci[1], 3), ",", round(ci[2], 3), ")\n", sep = "")
}

cat("Data set 1\n")
Data set 1
summarize_posterior(predict_1)
5.811 (1.013,10.653)
cat("\nData set 2\n")

Data set 2
summarize_posterior(predict_2)
3.677 (-1.051,8.327)

Wow! Those are pretty broad intervals, and you can see that they overlap broadly. So I was right. Although the point estimates look different from one another, there clearly isn’t a difference between the posterior predictions that you could defend as meaningful.

Actually, that’s not all of the bad news. The other bad news is that interpreting a principal components regression is less straightforward than interpreting a regression directly on the covariates. Granted, the principal components are (by definition) statistically unrelated to one another, but we’re now faced with interpreting what they mean. In this case we can say that the first principal component reflects (roughly) the sum of all of the covariates and that the second principal component reflects (roughly) the difference between the sum of odd and even covariates. In nearly every principal component analysis of biological data I’ve done, deciding how to provide a verbal interpretation of the principal components has been a challenge, not to mention that a screeplot rarely shows such a clean break between components that matter and those that don’t.

So are there other alternatives for reducing the number of covariates? Would I be asking this question if there weren’t. We’ll take a look at the LASSO in the next notebook, and in the one following that I’ll force you to think Bayesian and explore hierarchical shrinkage priors, specifically the regularize horseshoe prior that’s available in rstanarm(). That will probably be the end of the series, except that by popular request3 I may try to formulate a checklist that summarizes what we’ve learned in a concluding post.

1. I’m drawing heavily on Collinearity Diagnostics, Model Fit & Variable Contribution for this discussion.

2. If you’re not familiar with principal components analysis, Wikipedia has a nice overview: https://en.wikipedia.org/wiki/Principal_component_analysis.

3. Well, I’ve only had one request, but it’s probably worth formulating a checklist anyway, if I can come up with one that I don’t feel embarassed about.