Introduction

We will simulate some data to illustrate the relationship between heritability estimates derived from a regression of offspring phenotype on mid-parent value and those derived from analysis of half-sibs. To do the simulation, we’ll assume that there are a large number of loci that affect the trait and that each locus has a small effect. With this assumption, we can approximate the genotype of each individual as a random variable drawn from a normal distribution. The R code below implements the folloowing simple algorithm:

  1. Select a genotype for the sire at random from a normal distribution with mean \(mu\) and variance \(\sigma^2_g\). Call that genotype \(x_p\).

  2. Select a genotype for the dam at random from a normal distribution with the same mean and variance. Call that genotype \(x_m\).

  3. Calculate the mid-parent genotypic value as \(x_{mp} = \frac{x_m + x_p}{2}\).

  4. Set the offspring genotype to \(x_{mp}\).

  5. Repeat steps #2 and #3 n_dams times.

  6. Repeat steps #1-#5 n_sires times.

  7. Set phenotypes for the maternal parent, the paternal parent, and the offspring by drawing them from normal distributions with the appropriate means and varience \(\sigma^2_e\)

  8. Record the paternal ID, the genotype of each parent, the mid-parent genotypic value, the phenotype of each parent, the mid-parent phenotypic value, and the offspring phenotype.

NOTE: I am using set.seed(1234) to ensure that if you run this code on your own, you’ll get the same results as I do. If you erase or comment out that line and rund the code, your results will differ slightly from mine.

library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
── Attaching packages ───────────────────────────── tidyverse 1.3.1 ──
✓ ggplot2 3.3.5     ✓ purrr   0.3.4
✓ tibble  3.1.4     ✓ dplyr   1.0.7
✓ tidyr   1.1.3     ✓ stringr 1.4.0
✓ readr   2.0.1     ✓ forcats 0.5.1
── Conflicts ──────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
rm(list = ls())

set.seed(1234)

n_sires <- 100
n_dams <- 10
mu <- 50
sigma_2_g <- 25
sigma_2_e <- 9

dat <- tibble(pat_ID = NA,
              ind_ID = NA,
              mat_g = NA,
              pat_g = NA,
              mid_g = NA,
              mat_p = NA,
              pat_p = NA,
              mid_p = NA,
              off_p = NA)
ct <- 0
for (i in 1:n_sires) {
  patg <- rnorm(1, mu, sqrt(sigma_2_g))
  patp <- rnorm(1, patg, sqrt(sigma_2_e))
  for (j in 1:n_dams) {
    ct <- ct + 1
    matg <- rnorm(1, mu, sqrt(sigma_2_g))
    matp <- rnorm(1, matg, sqrt(sigma_2_e))
    midg <- (matg + patg)/2
    offp <- rnorm(1, midg, sqrt(sigma_2_e))
    dat <- add_row(dat,
                   pat_ID = i,
                   ind_ID = ct,
                   mat_g = matg,
                   pat_g = patg,
                   mid_g = midg,
                   mat_p = matp,
                   pat_p = patp,
                   mid_p = (matp + patp)/2,
                   off_p = offp)
  }
}
dat <- dat %>%
  filter(!is.na(pat_ID)) %>%
  mutate(pat_ID = factor(pat_ID),
         ind_ID = factor(ind_ID))
pat_g_var <- var(unique(dat[, c("pat_ID", "pat_g")])$pat_g)
pat_p_var <- var(unique(dat[, c("pat_ID", "pat_p")])$pat_p)
variances <- tibble(Parent = c("Sire", "Dam"),
                    Genotypic = round(c(pat_g_var, var(dat$mat_g)), 3),
                    Environmental = round(c(pat_p_var - pat_g_var,
                                            var(dat$mat_p) -
                                              var(dat$mat_g)), 3),
                    Phenotypic = round(c(pat_p_var, var(dat$mat_p)),
                                      3))
knitr::kable(variances)
Parent Genotypic Environmental Phenotypic
Sire 26.680 5.934 32.614
Dam 23.884 9.868 33.752

As you can see the simulated genotypic and environmental variances are pretty close to what we specified, i.e., \(\sigma^2_g = 25\) and \(\sigma^2_e = 9\).1 You can also see that the phenotypic variance is pretty close to \(\sigma^2_g + \sigma^2_e\), as we expect.

Offspring-midparent regression

If you’re familiar with regression in R, you know about lm() and glm(). I’m going to use a similar function from rstanarm, namely stan_glm() to run a Bayesian linear regression. In addition to providing an estimate of the slope of the regression of offspring on mid-parent value, it provides a sense of how precise that estimate is in the form of 90 percent credible intervals.2

library(rstanarm)
Loading required package: Rcpp
Registered S3 method overwritten by 'htmlwidgets':
  method           from         
  print.htmlwidget tools:rstudio
This is rstanarm version 2.21.1
- See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!
- Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.
- For execution on a local, multicore CPU with excess RAM we recommend calling
  options(mc.cores = parallel::detectCores())
## This line allows rstanarm to run several chains at the same time
## instead of running them sequentially
options(mc.cores = parallel::detectCores())

off_r <- stan_glm(off_p ~ mid_p,
                  data = dat,
                  refresh = 0)
summary(off_r, digits = 3)

Model Info:
 function:     stan_glm
 family:       gaussian [identity]
 formula:      off_p ~ mid_p
 algorithm:    sampling
 sample:       4000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 1000
 predictors:   2

Estimates:
              mean   sd     10%    50%    90% 
(Intercept) 12.970  1.323 11.273 12.986 14.608
mid_p        0.740  0.027  0.708  0.740  0.775
sigma        3.425  0.076  3.328  3.425  3.524

Fit Diagnostics:
           mean   sd     10%    50%    90% 
mean_PPD 49.663  0.149 49.474 49.662 49.853

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
              mcse  Rhat  n_eff
(Intercept)   0.020 1.000 4331 
mid_p         0.000 1.000 4335 
sigma         0.001 1.000 3749 
mean_PPD      0.002 1.000 3919 
log-posterior 0.028 1.002 1827 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

We see that the slope of the regression, the mid_p line, is 0.740 (0.708, 0.775), while we expect to see \(\frac{\sigma^2_g}{\sigma^2_g + \sigma^2_p} = \frac{25}{34} = 0.735\). That’s not bad at all. Now let’s look at the half sib analysis.

Half-sib analysis

Besides the fact that stan_glm() provides both point estimates and estimates of uncertainty it also sets us up nicely for using stan_glmer(), which sets us up for doing the same thing with variance components. That’s what we’ll be doing here. We are fitting a model in which each offspring’s phenotype is drawn from a normal distribution with a mean that depends on pat_ID.3

off_h <- stan_glmer(off_p ~ (1|pat_ID),
                    data = dat,
                    refresh = 0)
summary(off_h, digits = 3, 
        pars = c("sigma", "Sigma[pat_ID:(Intercept),(Intercept)]"))

Model Info:
 function:     stan_glmer
 family:       gaussian [identity]
 formula:      off_p ~ (1 | pat_ID)
 algorithm:    sampling
 sample:       4000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 1000
 groups:       pat_ID (100)

Estimates:
                                        mean   sd    10%   50%   90%
sigma                                 3.891  0.090 3.775 3.891 4.006
Sigma[pat_ID:(Intercept),(Intercept)] 5.531  1.027 4.310 5.429 6.907

MCMC diagnostics
                                      mcse  Rhat  n_eff
sigma                                 0.001 1.000 4085 
Sigma[pat_ID:(Intercept),(Intercept)] 0.030 1.001 1142 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

sigma is the standard deviation within a sibship, \(\sigma_w\), and Sigma[pat_ID:(Intercept),(Intercept)] is the variance among half-sib families, \(\sigma^2_{hs}\). The variance among half-sib families is the variance among mothers, which is the genetic variance. There are, however, 10 offspring in each half-sib family. Thus, we can estimate the broad-sense heritability with this little function.4

heritability <- function(sigma_w, sigma_hs, n_off) {
  h_2 = sigma_hs*n_off/(sigma_hs*n_off + sigma_w^2)
  return(h_2)
}

off_h_mat <- as.data.frame(off_h)
h_2 <- heritability(off_h_mat$sigma, 
                    off_h_mat$`Sigma[pat_ID:(Intercept),(Intercept)]`,
                    n_dams)
off_h_mat$h_2 <- h_2

library(ggplot2)
p <- ggplot(off_h_mat, aes(x = h_2)) +
  geom_histogram(bins = 100, alpha = 0.4) +
  geom_vline(xintercept = mean(off_h_mat$h_2), 
             color = "red",
             linetype = "dashed") +
  xlab("Heritability") +
  theme_bw()
p

As you can see, in this simulation the half-sib estimate is pretty close to what we expect to see, 0.781.5

Lab 12

As part of her dissertation, Nora Mitchell measured several traits in two closely related species of Protea

Protea punctata is a large, upright shrub.