# Introduction

We will simulate some data to illustrate the relationship between heritability estimates derived from a regression of offspring phenotype on mid-parent value and those derived from analysis of half-sibs. To do the simulation, weâ€™ll assume that there are a large number of loci that affect the trait and that each locus has a small effect. With this assumption, we can approximate the genotype of each individual as a random variable drawn from a normal distribution. The R code below implements the folloowing simple algorithm:

1. Select a genotype for the sire at random from a normal distribution with mean $$mu$$ and variance $$\sigma^2_g$$. Call that genotype $$x_p$$.

2. Select a genotype for the dam at random from a normal distribution with the same mean and variance. Call that genotype $$x_m$$.

3. Calculate the mid-parent genotypic value as $$x_{mp} = \frac{x_m + x_p}{2}$$.

4. Set the offspring genotype to $$x_{mp}$$.

5. Repeat steps #2 and #3 n_dams times.

6. Repeat steps #1-#5 n_sires times.

7. Set phenotypes for the maternal parent, the paternal parent, and the offspring by drawing them from normal distributions with the appropriate means and varience $$\sigma^2_e$$

8. Record the paternal ID, the genotype of each parent, the mid-parent genotypic value, the phenotype of each parent, the mid-parent phenotypic value, and the offspring phenotype.

NOTE: I am using set.seed(1234) to ensure that if you run this code on your own, youâ€™ll get the same results as I do. If you erase or comment out that line and rund the code, your results will differ slightly from mine.

library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method         from
print.tbl_lazy
print.tbl_sql
â”€â”€ Attaching packages â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ tidyverse 1.3.1 â”€â”€
âœ“ ggplot2 3.3.5     âœ“ purrr   0.3.4
âœ“ tibble  3.1.4     âœ“ dplyr   1.0.7
âœ“ tidyr   1.1.3     âœ“ stringr 1.4.0
âœ“ readr   2.0.1     âœ“ forcats 0.5.1
â”€â”€ Conflicts â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ tidyverse_conflicts() â”€â”€
x dplyr::lag()    masks stats::lag()
rm(list = ls())

set.seed(1234)

n_sires <- 100
n_dams <- 10
mu <- 50
sigma_2_g <- 25
sigma_2_e <- 9

dat <- tibble(pat_ID = NA,
ind_ID = NA,
mat_g = NA,
pat_g = NA,
mid_g = NA,
mat_p = NA,
pat_p = NA,
mid_p = NA,
off_p = NA)
ct <- 0
for (i in 1:n_sires) {
patg <- rnorm(1, mu, sqrt(sigma_2_g))
patp <- rnorm(1, patg, sqrt(sigma_2_e))
for (j in 1:n_dams) {
ct <- ct + 1
matg <- rnorm(1, mu, sqrt(sigma_2_g))
matp <- rnorm(1, matg, sqrt(sigma_2_e))
midg <- (matg + patg)/2
offp <- rnorm(1, midg, sqrt(sigma_2_e))
pat_ID = i,
ind_ID = ct,
mat_g = matg,
pat_g = patg,
mid_g = midg,
mat_p = matp,
pat_p = patp,
mid_p = (matp + patp)/2,
off_p = offp)
}
}
dat <- dat %>%
filter(!is.na(pat_ID)) %>%
mutate(pat_ID = factor(pat_ID),
ind_ID = factor(ind_ID))
pat_g_var <- var(unique(dat[, c("pat_ID", "pat_g")])$pat_g) pat_p_var <- var(unique(dat[, c("pat_ID", "pat_p")])$pat_p)
variances <- tibble(Parent = c("Sire", "Dam"),
Genotypic = round(c(pat_g_var, var(dat$mat_g)), 3), Environmental = round(c(pat_p_var - pat_g_var, var(dat$mat_p) -
var(dat$mat_g)), 3), Phenotypic = round(c(pat_p_var, var(dat$mat_p)),
3))
knitr::kable(variances)
Parent Genotypic Environmental Phenotypic
Sire 26.680 5.934 32.614
Dam 23.884 9.868 33.752

As you can see the simulated genotypic and environmental variances are pretty close to what we specified, i.e., $$\sigma^2_g = 25$$ and $$\sigma^2_e = 9$$.1 You can also see that the phenotypic variance is pretty close to $$\sigma^2_g + \sigma^2_e$$, as we expect.

## Offspring-midparent regression

If youâ€™re familiar with regression in R, you know about lm() and glm(). Iâ€™m going to use a similar function from rstanarm, namely stan_glm() to run a Bayesian linear regression. In addition to providing an estimate of the slope of the regression of offspring on mid-parent value, it provides a sense of how precise that estimate is in the form of 90 percent credible intervals.2

library(rstanarm)
Loading required package: Rcpp
Registered S3 method overwritten by 'htmlwidgets':
method           from
print.htmlwidget tools:rstudio
This is rstanarm version 2.21.1
- See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!
- Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.
- For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores())
## This line allows rstanarm to run several chains at the same time
## instead of running them sequentially
options(mc.cores = parallel::detectCores())

off_r <- stan_glm(off_p ~ mid_p,
data = dat,
refresh = 0)
summary(off_r, digits = 3)

Model Info:
function:     stan_glm
family:       gaussian [identity]
formula:      off_p ~ mid_p
algorithm:    sampling
sample:       4000 (posterior sample size)
priors:       see help('prior_summary')
observations: 1000
predictors:   2

Estimates:
mean   sd     10%    50%    90%
(Intercept) 12.970  1.323 11.273 12.986 14.608
mid_p        0.740  0.027  0.708  0.740  0.775
sigma        3.425  0.076  3.328  3.425  3.524

Fit Diagnostics:
mean   sd     10%    50%    90%
mean_PPD 49.663  0.149 49.474 49.662 49.853

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
mcse  Rhat  n_eff
(Intercept)   0.020 1.000 4331
mid_p         0.000 1.000 4335
sigma         0.001 1.000 3749
mean_PPD      0.002 1.000 3919
log-posterior 0.028 1.002 1827

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

We see that the slope of the regression, the mid_p line, is 0.740 (0.708, 0.775), while we expect to see $$\frac{\sigma^2_g}{\sigma^2_g + \sigma^2_p} = \frac{25}{34} = 0.735$$. Thatâ€™s not bad at all. Now letâ€™s look at the half sib analysis.

## Half-sib analysis

Besides the fact that stan_glm() provides both point estimates and estimates of uncertainty it also sets us up nicely for using stan_glmer(), which sets us up for doing the same thing with variance components. Thatâ€™s what weâ€™ll be doing here. We are fitting a model in which each offspringâ€™s phenotype is drawn from a normal distribution with a mean that depends on pat_ID.3

off_h <- stan_glmer(off_p ~ (1|pat_ID),
data = dat,
refresh = 0)
summary(off_h, digits = 3,
pars = c("sigma", "Sigma[pat_ID:(Intercept),(Intercept)]"))

Model Info:
function:     stan_glmer
family:       gaussian [identity]
formula:      off_p ~ (1 | pat_ID)
algorithm:    sampling
sample:       4000 (posterior sample size)
priors:       see help('prior_summary')
observations: 1000
groups:       pat_ID (100)

Estimates:
mean   sd    10%   50%   90%
sigma                                 3.891  0.090 3.775 3.891 4.006
Sigma[pat_ID:(Intercept),(Intercept)] 5.531  1.027 4.310 5.429 6.907

MCMC diagnostics
mcse  Rhat  n_eff
sigma                                 0.001 1.000 4085
Sigma[pat_ID:(Intercept),(Intercept)] 0.030 1.001 1142

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

sigma is the standard deviation within a sibship, $$\sigma_w$$, and Sigma[pat_ID:(Intercept),(Intercept)] is the variance among half-sib families, $$\sigma^2_{hs}$$. The variance among half-sib families is the variance among mothers, which is the genetic variance. There are, however, 10 offspring in each half-sib family. Thus, we can estimate the broad-sense heritability with this little function.4

heritability <- function(sigma_w, sigma_hs, n_off) {
h_2 = sigma_hs*n_off/(sigma_hs*n_off + sigma_w^2)
return(h_2)
}

off_h_mat <- as.data.frame(off_h)
h_2 <- heritability(off_h_mat$sigma, off_h_mat$Sigma[pat_ID:(Intercept),(Intercept)],
n_dams)
off_h_mat$h_2 <- h_2 library(ggplot2) p <- ggplot(off_h_mat, aes(x = h_2)) + geom_histogram(bins = 100, alpha = 0.4) + geom_vline(xintercept = mean(off_h_mat$h_2),
color = "red",
linetype = "dashed") +
xlab("Heritability") +
theme_bw()
p

As you can see, in this simulation the half-sib estimate is pretty close to what we expect to see, 0.781.5

# Lab 12

As part of her dissertation, Nora Mitchell measured several traits in two closely related species of Protea

Protea punctata is a large, upright shrub.