# Statistics

## Causal inference in ecology – An update

Causal inference in ecology – links to the series

A couple of years ago I wrote a series of posts on causal inference in ecology. In it I explored the Rubin causal model and concluded that

the Rubin causal model isn’t likely to help me make causal inferences with the kinds of observational data I collect.

I haven’t changed my mind about that, but I do have an update.

I’ve been reading Regression and other stories, by Andrew Gelman, Jennifer Hill, and Aki Vehtari, which I highly recommend reading if you use regression for any purpose in your research. I just finished Chapter 21, “Additional topics in causal inference”, and the last section, 21.5 “Causes of effects and effects of causes”, is particularly relevant to my earlier conclusion. Not surprisingly, Gelman, Hill, and Vehtari (GHV) have a better way of explaining the role that regression can play in generating hypotheses than I did. You’ll need to read the chapters on causal inference (or be familiar with the Rubin causal model) to fully appreciate their insight, but here it is in a nutshell.

We can make inferences about the effect of a cause when we (a) identify an intervention (a cause) that may have an effect and (b) randomize the intervention across experimental units (or do something that mimics random assignment by balancing on potential pre-observation confounders or by using an instrumental variable, a regression discontinuity, or difference-in-differences approach). Thought about in this way, the purpose of statistical analysis is to estimate the magnitude of an effect.

The regression analyses I typically do can be cast as an attempt to make inferences about the cause of an effect.1 Here’s where GHV have a better way of thinking about it than I did. Let’s suppose that I’m interested in environmental features that influence stomatal density, the example I discussed on 11 June 2018. I illustrate there that three principal components describing aspects of the environment show strong associations with stomatal density. GHV remind us that some other variable (or set of variables) could cause the observed differences in stomatal density and that once we’ve taken that variable into account, none of the PCs would show an association with stomatal density.2 More importantly, they point out that the association suggests causal hypotheses that could account for the association. To the extent that its important to us to dissect those causes, we can then do new experiments or make new observations (using Rubin’s causal model as a framework if we’re going to make causal inferences from an observational study) structured to estimate the effects those hypotheses suggest.

1. I wrote “can be cast as an attempt”, because I do my damndest to make it clear that I’m only asserting that certain variables have stronger associations with the outcome I’m studying than others, not that those variables cause the outcome.
2. Fortunately (for me), that’s consistent with what I wrote two years ago.

## An R-Stan implementation of Bayesian inference for Wright’s F-statistics

Some of you know that many years ago Paul Lewis, Dipak Dey, and I wrote a paper describing a Bayesian approach to inferring population structure from dominant markers.1 You may also know that Paul Lewis and I wrote a Windoze program in C++, Hickory, that implemented the approach. We later extended Hickory for analysis of co-dominant markers. Later still, Feng Guo, Dipak, and I wrote another paper describing a Bayesian approach to (a) estimating population- and locus-specific effects on FST and (b) identifying loci where the posterior distribution of FST is markedly different from the overall estimate.2 If you know that (and maybe even if you don’t know all of that), you also know that Paul and I stopped maintaining Hickory a number of years ago. I moved from Windoze to Mac, and the library we were using to support the graphical user interface became too complicated for me to keep up with.

I’ve had a few requests from people who were interested in using Hickory, but I just haven’t had the time to find a way to support them – until now.

Over the past several years, I’ve been using Stan for many different statistical analyses. When I received another request for Hickory a couple of weeks ago, I realized that I could pretty easily develop a new version of Hickory in R/Stan. This approach has several advantages over the standalone C++ code in the original Hickory.

1. I don’t have to worry about writing the MCMC sampler myself. I use the very sophisticated Hamiltonian Monte Carlo in Stan. I not only avoid me the bother of writing my own sampler, I have much greater confidence that the sampler is performing correctly. It’s written and maintained by experts, and the convergence diagnostics are far more sophisticated than for Metropolis-Hastings.
2. It should be readily portable to any platform on which R is supported. The only requirement, for now, is that you have a C++ compiler installed. If you’re running a Mac, you may need to download Xcode. If you’re running Linux, you should be all set. If your running Windows, you can download Rtools from CRAN. I intend to submit the R package I’ve written to CRAN once I’ve tested it more thoroughly and provided some extensions to the crude functionality currently available. Once it’s on CRAN, you won’t even need a C++ compiler.
3. I can develop an interface to adegenet and other R packages used for analysis and manipulation of genetic data so that Hickory can use data in many different formats supported by other packages.

A very early release of Hickory is available at GitHub. You should find all of the information you need to install and use it there. Let me know if you run into problems. I’ll do my best to walk you through them (and probably correct some errors I’ve made or at least improve the meager documentation in the process).

1. Holsinger, K. E., Lewis, P. O., and Dey, D. K. 2002. A Bayesian approach to inferring population structure from dominant markers. Molecular Ecology 11:1157–1164.
2. Guo, F., Dey, D. K., and Holsinger, K. E. 2009. A Bayesian hierarchical model for analysis of SNP diversity in multilocus, multipopulation samples. Journal of the American Statistical Association 104:142–154. http://doi.org/10.1198/jasa.2009.0010

## Some parting thoughts on variable selection in multiple regression

Variable selection in multiple regression

As I said a month and a half ago, this series started because

I was talking with one of my graduate students a few days ago about variable selection in multiple regression. She was looking for a published “cheat sheet.” I told her I didn’t know of any. “Why don’t you write one?” “The world’s too complicated for that. There will always be judgment involved. There will never be a simple recipe to follow.”

If you’ve been following along, it won’t surprise you to learn that I’m not going to conclude with a simple recipe.

“The world’s too complicated for that. There will always be judgment involved. There will never be a simple recipe to follow.”

Although there will never be a simple recipe, I can tell you what I’m going to do. You’ll want to look at a new R notebook that explores what happens when associations among covariates aren’t as strong as those we’ve been assuming so far.

• For any analysis where I can use stan_glm() or stan_glmer() I’ll use horseshoe priors to “shrink” the regression coefficients for unimportant covariates towards zero. For any analysis where I can’t use stan_glm() or stan_glmer(), I’ll probably be using Stan directly and I’ll hardcode the horseshoe priors myself.
• If I feel the need to use a relatively objective method to identify some subset of covariates that are “important”,1 I’ll use projection predictive variable selection as implemented in projpred to identify the most important covariates.2
• For reasons outlined in the Conclusions section of the R notebook I mentioned above, I will be very cautious about interpreting associations between covariates and response variables as anything other than a statistical association. Only if an association I find has been found repeatedly in other data sets and also has a good “first principles” explanation will I begin to interpret it as a causal association. Otherwise, I’ll interpret it as an intriguing pattern worthy of further study and exploration. If you want more details on how hard it is to infer causal relationships from these kinds of analyses, look at my blog series on causal inference in ecology.
1. I can think of a couple of reasons that I might want to select a subset of covariates. (1) I might not have a lot of data to fit to my model. Because of the priors, I’ll be able to fit it without the model blowing up, but the parameter estimates are likely to be very poorly defined. Reducing the number of parameters may help me isolate an “interesting” relationship. So long as I remember that all I may uncover is a statistical association, that pattern might still be worth investigating. (2) I might want a relatively objective way to simplify a complicated model so that it’s easy to understand, and there may not be an obvious break graphically or numerically between those that are important and those that aren’t. Regardless of whether it’s for reason #1 or reason #2, I will be extremely cautious about interpreting any associations identified through projection predictive variable selection as “real” and the ones not identified as “spurious.” In fact, I probably won’t do it at all, and I’ll probably present results from analysis of the full model in addition to the reduced model, even if the full model results only go in online supplemental material.
2. Since I’m new to using projpred, I don’t know whether I’ll be able to use it with my own Stan code. If not, it’s yet another reason for me to learn brms, which can handle a bunch of models that rstanarm can’t – possibly many of them that I’d be hardcoding in Stan otherwise.

## Using projection predictiion for variable selection in a Bayesian regression

Variable selection in multiple regression

Horseshoe priors are very easy to use if you’re using rstanarm. You should consider using them in any analysis where you use stan_glm() or stan_glmer(). If you were paying attention, though, I did a bit (OK, more than a bit) of handwaving in deciding which covariates were “important”. In this example, it was pretty easy, because there were some covariates with posterior distributions well away from zero and others with posterior distributions (close to) centered on zero and the difference between the two sets of coefficients was easy to see. That won’t always be the case. In fact, it probably won’t usually be the case. So we’d like to have some way of more “objectively” identifying which covariates are important and which aren’t.

Thats where projection predictive variable selection comes in. It’s an approach that uses a statistically meaningful criterion to guide your choice of variables that are “important” in the sense that including those variables (and only those variables) is sufficient to give predictions roughly equivalent to including all of them. Again, if you’re using rstanarm, it’s very easy to take advantage of the approach thanks to the projpred package available on CRAN.

In case you missed the link to projection predictive variable selection, here it is again: http://darwin.eeb.uconn.edu/pages/variable-selection/projection-predictive-variable-selection.nb.html.

## A Bayesian approach to variable selection using horseshoe priors

Variable selection in multiple regression

The Lasso has been very widely used, particularly in high-dimensional problems where the number of observations is less than the number of covariates.1 In fact, when I checked Google Scholar on Saturday, it had been cited nearly 30,000 times.2 Bayesians didn’t want to be left out, so Trevor Park and George Casella developed the Bayesian Lasso.3 The Bayesian Lasso overcomes what to my mind is one of the great disadvantages of the original Lasso, the difficulty of providing an assessment of how reliable the regression coefficients are. Like any other Bayesian method that uses MCMC methods, it’s just as easy to get credible intervals on parameters as it is to get posterior means. The Bayesian Lasso also estimates $$\lambda$$ as part of the procedure rather than relying on cross-validation. The R package monovm provides an implementation of the Bayesian Lasso in addition to other shrinkage regression methods.

I haven’t explored monovm, but if you’re interested in the Bayesian Lasso, you might want to check it out. Instead of exploring the Bayesian Lasso, the R notebook I’ve put together here explores the use of “horseshoe priors” in rstanarm. The basic idea is the same. We’d like to “shrink” some parameter estimates towards zero, and we’d like to have the data tell us which estimates to shrink. The nice thing about “horseshoe priors” in rstanarm is that if you know how to set up a regression in stan_glm() or stan_glmer() you can use a horseshoe prior very easily in your analysis simply by changing the prior parameter in your call to one of those functions.

1. This is often referred to as an $$n \ll p$$ problem. I’m not going to address that problem here, but if you deal with genomic data, you’ll want to familiarize yourself with the problem and the approaches typically used for addressing it.
2. 29,558 times to be exact.
3. Park, T. and G. Casella. 2008. The Bayesian Lasso. Jornal of the American Statistical Association. 103:681-686. doi: 10.1198/016214508000000337

## Using the Lasso for variable selection

Variable selection in multiple regression

If you’ve been following along, you’ve now seen some fairly simple approaches for reducing the number of covariates in a linear regression. It shouldn’t come as a shock that statisticians have been worried about the problem for a long time or that they’ve come up with some pretty sophisticated approaches to the problem.1 The first one we’ll explore is the Lasso (least asolute shrinkage and selection operator), which Rob Tibshirani introduced the Lasso to statistics and machine learning more than 20 years ago.2 You’ll find more details in the R notebook illustrating using the Lasso to select covariates, but here are the basic ideas.

The “shrinkage” part of the name refers to the idea that we don’t expect all of the covariates we’re including in the model to be important. And if a covariate isn’t important, we want the magnitude of the regression coefficient associated with that component to be zero (or nearly zero). In other words, we want the estimate to be “shrunk” towards zero rather than taking the value it would if we included it in the full multiple regression.

The “selection” part of the name refers to the idea that we don’t know ahead of time which of the covariates are important (and shouldn’t be shrunk towards 0) and which are important (and should be shrunk towards 0). We want the data to tell us which covariates are important and which aren’t, i.e., we want the data to “select” important covariates.

The Lasso accomplishes this by adding a penalty to the typical least squares estimates. Instead of simply minimizing the sum of squared deviations from the regression line, we do so subject to a constraint that the total magnitude of all regression coefficients is less than some value. We’ll use glmnet() to fit the Lasso. If you explore the accompanying documentation, you’ll see that the Lasso is just one method along a continuum of constrained optimization approaches. I’ll let you explore those on your own if you’re interested.

1. I’m not going to discuss forward, backward, or all subsets approaches to selecting variables. They don’t seem to be used much anymore (for good reason). If you’re interested in them, take a look at the Wikipedia page on stepwise regression.
2. Wikipedia points out that it was originally introduced 10 years earlier in geophysics, but Tibshirani discovered it independently, and it was his discovery that led to its wide use in statistics and machine learning.

## A simple introduction to Bayesian linear regression

I referred in passing to rstanarm and Bayesian linear regression in the R notebook on reducing the number of covariates. We’ll encounter Bayesian approaches again soon, and I just happened to run across a nice, simple introduction to Bayesian linear regression. It uses Python, with which I am only glancingly familiar, but you don’t need to run Python to read the discussion and understand what’s going on. If you’re unfamiliar with Bayesian inference or if you’d just like to check your understanding, take a look at this Introduction to Bayesian Linear Regression.

## An update on principal components regression

Variable selection in multiple regression

In the R notebook on principal component regression I noted that interpreting principal components can be a challenge. When I wrote that, I hadn’t seen a nice paper by Chong et al.1 The method they describe is presented specifically in the context of interpreting selection gradients after a principal components regression, but the idea is general. Once you’ve done the regression on principal components, transform the regression coefficients back to the original scale. Doing this does require, however, fitting all of the principal components, not just the first few.

Here’s the citation and abstract:

Chong, V. K., H. F. Fung, and J. R. Stinchcombe. 2018. A note on measuring natural selection on principal component scores. Evolution Letters. 2-4: 272–280 dos: 10.1002/evl3.63

Measuring natural selection through the use of multiple regression has transformed our understanding of selection, although the methods used remain sensitive to the effects of multicollinearity due to highly correlated traits. While measuring selection on principal component (PC) scores is an apparent solution to this challenge, this approach has been heavily criticized due to difficulties in interpretation and relating PC axes back to the original traits. We describe and illustrate how to transform selection gradients for PC scores back into selection gradients for the original traits, addressing issues of multicollinearity and biological interpretation. In addition to reducing multicollinearity, we suggest that this method may have promise for measuring selection on high‐dimensional data such as volatiles or gene expression traits. We demonstrate this approach with empirical data and examples from the literature, highlighting how selection estimates for PC scores can be interpreted while reducing the consequences of multicollinearity.

1. John Stinchcombe pointed me to the paper. Thanks John.

## Principal components regression

Variable selection in multiple regression

In the last installment of this series we explored a couple of simple strategies to reduce the number of covariates in a multiple regression.1, namely retaining only covariates that have a “real” relationship with the response variable2 and selecting one covariate from each cluster of (relatively) uncorrelated covariates.3 Unfortunately, we found that neither approach worked very well in our toy example.4.

One of the reasons that the second approach (picking “weakly” correlated covariates) may not have worked very well is that in our toy example we know that both x1 and x3 contribute positively to y, but our analysis included only x1. Another approach that is sometimes used when there’s a lot of association among covariates is to first perform a principal components analysis and then to regress the response variable on the scores from the first few principal components. The newest R notebook in this series explores principal component regression.

Spoiler alert: It doesn’t help the point estimates much either, but the uncertainty around those point estimates is so large that we can’t legitimately say they’re different from one another.

1. If you’ve forgotten why we might want to reduce the number of covariates, look back at this post.
2. The paradox lurking here is that if we knew which covariates these were, we probably wouldn’t have measured the others (or at least we wouldn’t have included them in the regression analysis).
3. There isn’t a good criterion to determine how weak the correlation needs to be to regard clusters as “relatively” uncorrelated.
4. If you’re reading footnotes, you’ll realize that the situation isn’t quite as dire as it appears from looking only at point estimates. Using rstanarm() for a Bayesian analysis shows that the credible intervals are very broad and overlapping. We don’t have good evidence that the point estimates are different from one another.

## Trying out a couple of simple strategies for reducing the number of covariates

Variable selection in multiple regression

If you’ve been following this series, you now know that multiple regression can be very useful but that its usefulness depends on overcoming several challenges. One of those challenges is that if we use all of the covariates available to us and some of them are highly correlated with one another, our assessment of which covariates have an association with the response variable may be misleading and any prediction we make about new observations may be very unreliable. That leads us to the problem of variable selection. Rather than using all of the covariates we have available, maybe we’d be better off if we used only a few.

In this R notebook, I explore a couple of approaches to variable selection:

1. Restricting the covariates to those we know have an association with the response variable.1
2. Identifying clusters of covariates that are highly associated with one another, (relatively) unassociated with those in other clusters, and picking one covariate from each cluster for the analysis.2

As you’ll see for the sample data set we’ve been exploring in which there are two clusters of covariates having strong associations within clusters and weak to non-existent associations between clusters, neither of these approaches serves us particularly well. The next installment will explore another commonly used approach – principal components regression.

1. There’s at least one obvious problem with this approach that I don’t discuss in the notebook. In the work I’ve been involved with, we rarely know ahead of time which covariates, if any, have “real” relationships with the response variable. Most often we’ve measured covariates because we anticipate that they have some relationship to what we’re interested in and we’re trying to figure out which one(s) are most important.
2. This approach has some practical problems that I don’t discuss in the notebook. How strong do associations have to be to be “highly associated”? How weak do they have to be to be “(relatively) unassociated”? What do we do if there isn’t a clear cutoff between “highly associated” and “(relatively) associated”?