The problem of identifying covariates that should be included in a regression has been studied for a long time. More than two decades ago, Rob Tibshirani introduced the Lasso to let the data tell us which covariates to include. If you want the details, including how the Lasso has been generalized, refer to the Wikipedia page. For our purposes, it’s enough for you to know that for linear regression fit by least squares, the Lasso simply includes a constraint on the regression coefficients.

What is the Lasso?

To make that constraint clear, let’s write out explicitly what the least squares criterion is:

\[ \mbox{min}_{\beta}\left\{\frac{1}{N}\sum_i\left(y_i - (\beta_0 + \sum_j\beta_jx_{ij})\right)^2\right\} \]

In words this simply means that we find the set of \(\beta\)s that minimize the difference between our predictions and the observed data. The Lasso does that too, but it adds a constraint on the set of \(\beta\)s, namely

\[ \sum_j |\beta_j| < \lambda \quad , \]

where \(\lambda\) is a parameter that determines the maximum influence of all covariates on the response variable when all of the individual influences are combined. To incorporate the constraint into the algorithm we use this criterion:

\[ \mbox{min}_{\beta}\left\{\frac{1}{N}\sum_i\left(y_i - (\beta_0 + \sum_j\beta_jx_{ij})\right)^2 + \lambda\sum_j|\beta_j|\right\} \]

In a simple least squares regression we don’t care how big any of the \(\beta\)s are individually and we don’t care about how big their overall magnitude is. In using the Lasso we are making the assumption that not only are we unwilling to accept the idea that any individual \(\beta\) is large, but we are also unwilling for any combination of them to be large either. The effect of this, as we’ll see is not only to limit the magnitude of individual regression coefficients, but also to keep some of them close to zero.

Trying out the Lasso

As usual, our first step is to regenerate the data we’ve been playing with.


rm(list = ls())
## intetcept
beta0 <- 1.0
## regression coefficients
beta <- c(1.0, -1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
## pattern of correlation matrix, all non-zero entries are set to saem
## correlation, covariance matrix caldulated from individual variances and a 
## single association parameter governing the non-zero correlation coefficients
## Note: Not just any pattern will work here. The correlation matrix and
## covariance matrix generated from this pattern must be positive definite.
## If you change this pattern, you may get an error when you try to generate
## data with a non-zero association parameter.
Rho <- matrix(nrow = 9, ncol = , byrow = TRUE, 
              data = c(1,0,1,0,1,0,1,0,1,
## vector of standard deviations for covariates
sigma <- rep(1, 9)

## construct a covariance matrix from the pattern, standard deviations, and
## one parameter in [-1,1] that governs the magnitude of non-zero correlation
## coefficients
## Rho - the pattern of associations
## sigma - the vector of standard deviations
## rho - the association parameter
construct_Sigma <- function(Rho, sigma, rho) {
  ## get the correlation matris
  Rho <- Rho*rho
  for (i in 1:ncol(Rho)) {
    Rho[i,i] <- 1.0
  ## notice the use of matrix multiplication
  Sigma <- diag(sigma) %*% Rho %*% diag(sigma)

## set the random number seed manually so that every run of the code will 
## produce the same numbers

n_samp <- 100
cov_str <- rmvnorm(n_samp,
                   mean = rep(0, nrow(Rho)),
                   sigma = construct_Sigma(Rho, sigma, 0.8))

resid <- rep(2.0, n_samp)

y_str <- rnorm(nrow(cov_str), mean = beta0 + cov_str %*% beta, sd = resid)
dat_1 <- data.frame(y_str, cov_str, rep("Strong", length(y_str)))

cov_str <- rmvnorm(n_samp,
                   mean = rep(0, nrow(Rho)),
                   sigma = construct_Sigma(Rho, sigma, 0.8))
y_str <- rnorm(nrow(cov_str), mean = beta0 + cov_str %*% beta, sd = resid)
dat_2 <- data.frame(y_str, cov_str, rep("Strong", length(y_str)))

column_names <- c("y", paste("x", seq(1, length(beta)), sep = ""), "Scenario")
colnames(dat_1) <- column_names
colnames(dat_2) <- column_names

## saving results in scale allows me to use them later for prediction with
## new data
scale_1 <- lapply(dat_1[, 1:10], scale)
scale_2 <- lapply(dat_2[, 1:10], scale)

## when assigning the same scaling to a data frame, the scaling attributes
## are lost
dat_1[, 1:10] <- lapply(dat_1[, 1:10], scale)
dat_2[, 1:10] <- lapply(dat_2[, 1:10], scale)

OK. Now that we have the data, let’s try the lasso and see what we get.1

Fitting the lasso

The first thing we have to do is to find a value for \(\lambda\). A good approach is to cv.glmnet(), which will split the data set into test and training data set and use cross-validation to identify the best value.2 Since we’re going to do this for both data sets, we start by writing a function that will do it all for us. In this function, I holdout some of the data, by default 20 percent, to illustrate how well the model performs on within sample predictions.


lasso <- function(dat, title, holdout = 0.2) {
  x_vars <- model.matrix(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9, 
                         data = dat)[, -1]
  y_var <- dat$y
  ## select the training set (a random sample from 50% of the rows)
  train <- sample(1:nrow(x_vars), nrow(x_vars)*(1.0 - holdout))
  ## construct the text set
  test <- setdiff(1:nrow(x_vars), train)
  ## use a series of different lambda's that cv.glmnet() picks to identify
  ## the best one
  cv_output <- cv.glmnet(x_vars[train, ], y_var[train])
  ## now predict values for the test set and the training set so that we can
  ## compare observations with predictions for both of them
  predict_train <- predict(cv_output, x_vars[train, ], s = "lambda.min")
  predict_train_mat <- cbind(y_var[train], predict_train, "Training")
  if (holdout > 0) {
    predict_test <- predict(cv_output, x_vars[test, ], s = "lambda.min")
    predict_test_mat <- cbind(y_var[test], predict_test, "Test")
    predict <- rbind(predict_train_mat, predict_test_mat)
  } else {
    predict <- predict_train_mat
  colnames(predict) <- c("Observed", "Predicted", "Data set")
  predict <-
  predict$Observed <- as.numeric(as.character(predict$Observed))
  predict$Predicted <- as.numeric(as.character(predict$Predicted))
  if (holdout > 0) {
    p <- ggplot(predict, aes(x = Observed, y = Predicted, 
                             color = `Data set`, fill = `Data set`))
  } else {
    p <- ggplot(predict, aes(x = Observed, y = Predicted))
  p <- p + geom_point() +
    geom_abline(slope = 1.0, intercept = 0.0) + 
    geom_smooth(aes(group = 1), method = "lm")


## set seed to allow results to be reproduced

lasso_1 <- lasso(dat_1, "Data set 1")

lasso_2 <- lasso(dat_2, "Data set 2")

You should notice a couple of things so far:

  1. The training set does a good job of predicting the test data.3
  2. The predicted values are, on average, less than the observed values.4

Comparing coefficient estimates

Since we’ve fit a linear regression as part of fitting the lasso, we can also get a report on the regression coefficients identified as part of the regression, but first, let’s refit the lasso to the entire data set.

lasso_1 <- lasso(dat_1, "Data set 1", holdout = 0.0)