## Exploring mixed models in Stan

I am about to begin developing a moderately complex mixed model in Stan to analyze realtinoships among anatomical/morphological traits (e.g., leaf thickness, LMA, wood density), physiological performance (e.g., *A*_{max}, stem hydraulic conductance), and indices of fitness (e.g., height, growth rate, number of seedheads). One complication is that the observations are from several different species of *Protea* at several different sites.^{1} We’re going to treat sites as nested within species.

Before I start building the whole model, I wanted to make sure that I can do a simple mixed linear regression with a random site effect nested within a random species effect. In `stan_lmer()` notation that becomes:

stan_lmer(Amax ~ LMA + (1|Species/Site))

I ran a version of my code with several covariates in addition to LMA using hand-coded stan and compared the results to those from `stan_lmer()`. Estimates for the overall intercept and the regression coefficients associated with each covariate were very similar. The estimates of both standard deviations and individual random effects at the species and site within species level were rather different – especially at the species level. This was troubling, so I set up a simple simulation to see if I could figure out what was going on. The R code, Stan code, and simulation results are available in Github: https://kholsinger.github.io/mixed-models/.

The model used for simulation is very simple:

\(

\begin{equation}

y_k \sim \mbox{N}(\mu_k, \sigma) \\

\mu_k = \beta_0(species|site) + \beta_1x \\

\beta_0(species|site) \sim \mbox{N}(\beta_0(species), \sigma_{species|site}) \\

\beta_0(species) \sim \mbox{N}(\beta_0, \sigma_{species})

\end{equation}

\)

Happily, the Stan code I wrote does well in recovering the simulation parameters.^{2} Surprisingly, it does better on recovering the random effect parameters than `stan_lmer()`. I haven’t completely sorted things out yet, but the difference is likely to be a result of different prior specifications for the random effects. My simulation code^{3} uses independent Cauchy(0,5) priors for the standard deviation of all variance parameters. `stan_lmer()` uses a covariance structure for all parameters that vary by group.^{4} If the difference in prior specifications is really responsible, it means that the differences between my approach and the approach used in `stan_lmer()` will vanish as the number of groups grows.

Since we’re only interested in the analog of \(\beta_1\) for the analyses we’ll be doing, the difference in random effect estimates doesn’t bother me, especially since my approach seems to recover them better given the random effect structure we’re working with. This is, however, a good reminder that if you’re working with mixed models and you’re interested in the group-level parameters, you’re going to need a large number of *groups*, not just a large number of individuals, to get reliable estimates of the group parameters.