We will simulate some data to illustrate the relationship between
heritability estimates derived from a regression of offspring phenotype
on mid-parent value and those derived from analysis of half-sibs. To do
the simulation, we’ll assume that there are a large number of loci that
affect the trait and that each locus has a small effect. With this
assumption, we can approximate the genotype of each individual as a
random variable drawn from a normal distribution. The R
code below implements the folloowing algorithm:
Select a genotype for the sire at random from a normal distribution with mean \(mu\) and variance \(\sigma^2_g\). Call that genotype \(x_p\).
Select a genotype for the dam at random from a normal distribution with the same mean and variance. Call that genotype \(x_m\).
Calculate the mid-parent genotypic value as \(x_{mp} = \frac{x_m + x_p}{2}\).
Set the offspring genotype to \(x_{mp}\).
Repeat steps #2 and #3 n_dams
times.
Repeat steps #1-#5 n_sires
times.
Set phenotypes for the maternal parent, the paternal parent, and the offspring by drawing them from normal distributions with the appropriate means and varience \(\sigma^2_e\)
Record the paternal ID, the genotype of each parent, the mid-parent genotypic value, the phenotype of each parent, the mid-parent phenotypic value, and the offspring phenotype.
NOTE: I am using set.seed(1234)
to ensure that if you
run this code on your own, you’ll get the same results as I do. If you
erase or comment out that line and rund the code, your results will
differ slightly from mine.
options(tidyverse.quiet = TRUE)
library(tidyverse)
rm(list = ls())
set.seed(1234)
n_sires <- 100
n_dams <- 10
mu <- 50
sigma_2_g <- 25
sigma_2_e <- 9
dat <- tibble(pat_ID = NA,
ind_ID = NA,
mat_g = NA,
pat_g = NA,
mid_g = NA,
mat_p = NA,
pat_p = NA,
mid_p = NA,
off_p = NA)
ct <- 0
for (i in 1:n_sires) {
patg <- rnorm(1, mu, sqrt(sigma_2_g))
patp <- rnorm(1, patg, sqrt(sigma_2_e))
for (j in 1:n_dams) {
ct <- ct + 1
matg <- rnorm(1, mu, sqrt(sigma_2_g))
matp <- rnorm(1, matg, sqrt(sigma_2_e))
midg <- (matg + patg)/2
offp <- rnorm(1, midg, sqrt(sigma_2_e))
dat <- add_row(dat,
pat_ID = i,
ind_ID = ct,
mat_g = matg,
pat_g = patg,
mid_g = midg,
mat_p = matp,
pat_p = patp,
mid_p = (matp + patp)/2,
off_p = offp)
}
}
dat <- dat %>%
filter(!is.na(pat_ID)) %>%
mutate(pat_ID = factor(pat_ID),
ind_ID = factor(ind_ID))
pat_g_var <- var(unique(dat[, c("pat_ID", "pat_g")])$pat_g)
pat_p_var <- var(unique(dat[, c("pat_ID", "pat_p")])$pat_p)
variances <- tibble(Parent = c("Sire", "Dam"),
Genotypic = round(c(pat_g_var, var(dat$mat_g)), 3),
Environmental = round(c(pat_p_var - pat_g_var,
var(dat$mat_p) -
var(dat$mat_g)), 3),
Phenotypic = round(c(pat_p_var, var(dat$mat_p)),
3))
knitr::kable(variances)
Parent | Genotypic | Environmental | Phenotypic |
---|---|---|---|
Sire | 26.680 | 5.934 | 32.614 |
Dam | 23.884 | 9.868 | 33.752 |
As you can see the simulated genotypic and environmental variances are reasonably close to what we specified, i.e., \(\sigma^2_g = 25\) and \(\sigma^2_e = 9\).1 You can also see that the phenotypic variance is pretty close to \(\sigma^2_g + \sigma^2_e\), as we expect.
If you’re familiar with regression in R
, you know about
lm()
and glm()
. I’m going to use a similar
function from rstanarm
, namely stan_glm()
to
run a Bayesian linear regression. In addition to providing an estimate
of the slope of the regression of offspring on mid-parent value, it
provides a sense of how precise that estimate is in the form of 90
percent credible intervals.2
library(rstanarm)
## This line allows rstanarm to run several chains at the same time
## instead of running them sequentially
options(mc.cores = parallel::detectCores())
off_r <- stan_glm(off_p ~ mid_p,
data = dat,
refresh = 0)
summary(off_r, digits = 3)
Model Info:
function: stan_glm
family: gaussian [identity]
formula: off_p ~ mid_p
algorithm: sampling
sample: 4000 (posterior sample size)
priors: see help('prior_summary')
observations: 1000
predictors: 2
Estimates:
mean sd 10% 50% 90%
(Intercept) 12.970 1.323 11.273 12.986 14.608
mid_p 0.740 0.027 0.708 0.740 0.775
sigma 3.425 0.076 3.328 3.425 3.524
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 49.663 0.149 49.474 49.662 49.853
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.020 1.000 4331
mid_p 0.000 1.000 4335
sigma 0.001 1.000 3749
mean_PPD 0.002 1.000 3919
log-posterior 0.028 1.002 1827
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
We see that the slope of the regression, the mid_p
line,
is 0.740 (0.708, 0.775), while we expect to see \(\frac{\sigma^2_g}{\sigma^2_g + \sigma^2_p} =
\frac{25}{34} = 0.735\). That’s not bad at all. Now let’s look at
the half sib analysis.
Besides the fact that stan_glm()
provides both point
estimates and estimates of uncertainty it also sets us up nicely for
using stan_glmer()
, which sets us up for doing the same
thing with variance components. That’s what we’ll be doing here. We are
fitting a model in which each offspring’s phenotype is drawn from a
normal distribution with a mean that depends on pat_ID
.3
off_h <- stan_glmer(off_p ~ (1|pat_ID),
data = dat,
refresh = 0)
summary(off_h, digits = 3,
pars = c("sigma", "Sigma[pat_ID:(Intercept),(Intercept)]"))
Model Info:
function: stan_glmer
family: gaussian [identity]
formula: off_p ~ (1 | pat_ID)
algorithm: sampling
sample: 4000 (posterior sample size)
priors: see help('prior_summary')
observations: 1000
groups: pat_ID (100)
Estimates:
mean sd 10% 50% 90%
sigma 3.891 0.090 3.775 3.891 4.006
Sigma[pat_ID:(Intercept),(Intercept)] 5.531 1.027 4.310 5.429 6.907
MCMC diagnostics
mcse Rhat n_eff
sigma 0.001 1.000 4085
Sigma[pat_ID:(Intercept),(Intercept)] 0.030 1.001 1142
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
sigma
is the standard deviation within a sibship, \(\sigma_w\), and
Sigma[pat_ID:(Intercept),(Intercept)]
is the variance among
half-sib families, \(\sigma^2_{hs}\).
The variance among half-sib families is the variance among mothers,
which is the genetic variance. There are, however, 10 offspring in each
half-sib family. Thus, we can estimate the broad-sense heritability with
this little function.4
heritability <- function(sigma_w, sigma_hs, n_off) {
h_2 = sigma_hs*n_off/(sigma_hs*n_off + sigma_w^2)
return(h_2)
}
off_h_mat <- as.data.frame(off_h)
h_2 <- heritability(off_h_mat$sigma,
off_h_mat$`Sigma[pat_ID:(Intercept),(Intercept)]`,
n_dams)
off_h_mat$h_2 <- h_2
library(ggplot2)
p <- ggplot(off_h_mat, aes(x = h_2)) +
geom_histogram(bins = 100, alpha = 0.4) +
geom_vline(xintercept = mean(off_h_mat$h_2),
color = "red",
linetype = "dashed") +
xlab("Heritability") +
theme_bw()
p
As you can see, in this simulation the half-sib estimate is pretty close to what we expect to see, 0.781.5
As part of her dissertation, Nora Mitchell measured several traits in two closely related species of Protea
Protea punctata is a large, upright shrub.