We will simulate some data to illustrate the relationship between heritability estimates derived from a regression of offspring phenotype on mid-parent value and those derived from analysis of half-sibs. To do the simulation, we’ll assume that there are a large number of loci that affect the trait and that each locus has a small effect. With this assumption, we can approximate the genotype of each individual as a random variable drawn from a normal distribution. The R
code below implements the folloowing simple algorithm:
Select a genotype for the sire at random from a normal distribution with mean \(mu\) and variance \(\sigma^2_g\). Call that genotype \(x_p\).
Select a genotype for the dam at random from a normal distribution with the same mean and variance. Call that genotype \(x_m\).
Calculate the mid-parent genotypic value as \(x_{mp} = \frac{x_m + x_p}{2}\).
Set the offspring genotype to \(x_{mp}\).
Repeat steps #2 and #3 n_dams
times.
Repeat steps #1-#5 n_sires
times.
Set phenotypes for the maternal parent, the paternal parent, and the offspring by drawing them from normal distributions with the appropriate means and varience \(\sigma^2_e\)
Record the paternal ID, the genotype of each parent, the mid-parent genotypic value, the phenotype of each parent, the mid-parent phenotypic value, and the offspring phenotype.
NOTE: I am using set.seed(1234)
to ensure that if you run this code on your own, you’ll get the same results as I do. If you erase or comment out that line and rund the code, your results will differ slightly from mine.
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ───────────────────────────── tidyverse 1.3.1 ──
✓ ggplot2 3.3.5 ✓ purrr 0.3.4
✓ tibble 3.1.4 ✓ dplyr 1.0.7
✓ tidyr 1.1.3 ✓ stringr 1.4.0
✓ readr 2.0.1 ✓ forcats 0.5.1
── Conflicts ──────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
rm(list = ls())
set.seed(1234)
n_sires <- 100
n_dams <- 10
mu <- 50
sigma_2_g <- 25
sigma_2_e <- 9
dat <- tibble(pat_ID = NA,
ind_ID = NA,
mat_g = NA,
pat_g = NA,
mid_g = NA,
mat_p = NA,
pat_p = NA,
mid_p = NA,
off_p = NA)
ct <- 0
for (i in 1:n_sires) {
patg <- rnorm(1, mu, sqrt(sigma_2_g))
patp <- rnorm(1, patg, sqrt(sigma_2_e))
for (j in 1:n_dams) {
ct <- ct + 1
matg <- rnorm(1, mu, sqrt(sigma_2_g))
matp <- rnorm(1, matg, sqrt(sigma_2_e))
midg <- (matg + patg)/2
offp <- rnorm(1, midg, sqrt(sigma_2_e))
dat <- add_row(dat,
pat_ID = i,
ind_ID = ct,
mat_g = matg,
pat_g = patg,
mid_g = midg,
mat_p = matp,
pat_p = patp,
mid_p = (matp + patp)/2,
off_p = offp)
}
}
dat <- dat %>%
filter(!is.na(pat_ID)) %>%
mutate(pat_ID = factor(pat_ID),
ind_ID = factor(ind_ID))
pat_g_var <- var(unique(dat[, c("pat_ID", "pat_g")])$pat_g)
pat_p_var <- var(unique(dat[, c("pat_ID", "pat_p")])$pat_p)
variances <- tibble(Parent = c("Sire", "Dam"),
Genotypic = round(c(pat_g_var, var(dat$mat_g)), 3),
Environmental = round(c(pat_p_var - pat_g_var,
var(dat$mat_p) -
var(dat$mat_g)), 3),
Phenotypic = round(c(pat_p_var, var(dat$mat_p)),
3))
knitr::kable(variances)
Parent | Genotypic | Environmental | Phenotypic |
---|---|---|---|
Sire | 26.680 | 5.934 | 32.614 |
Dam | 23.884 | 9.868 | 33.752 |
As you can see the simulated genotypic and environmental variances are pretty close to what we specified, i.e., \(\sigma^2_g = 25\) and \(\sigma^2_e = 9\).1 You can also see that the phenotypic variance is pretty close to \(\sigma^2_g + \sigma^2_e\), as we expect.
If you’re familiar with regression in R
, you know about lm()
and glm()
. I’m going to use a similar function from rstanarm
, namely stan_glm()
to run a Bayesian linear regression. In addition to providing an estimate of the slope of the regression of offspring on mid-parent value, it provides a sense of how precise that estimate is in the form of 90 percent credible intervals.2
library(rstanarm)
Loading required package: Rcpp
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
This is rstanarm version 2.21.1
- See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!
- Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.
- For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores())
## This line allows rstanarm to run several chains at the same time
## instead of running them sequentially
options(mc.cores = parallel::detectCores())
off_r <- stan_glm(off_p ~ mid_p,
data = dat,
refresh = 0)
summary(off_r, digits = 3)
Model Info:
function: stan_glm
family: gaussian [identity]
formula: off_p ~ mid_p
algorithm: sampling
sample: 4000 (posterior sample size)
priors: see help('prior_summary')
observations: 1000
predictors: 2
Estimates:
mean sd 10% 50% 90%
(Intercept) 12.970 1.323 11.273 12.986 14.608
mid_p 0.740 0.027 0.708 0.740 0.775
sigma 3.425 0.076 3.328 3.425 3.524
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 49.663 0.149 49.474 49.662 49.853
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.020 1.000 4331
mid_p 0.000 1.000 4335
sigma 0.001 1.000 3749
mean_PPD 0.002 1.000 3919
log-posterior 0.028 1.002 1827
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
We see that the slope of the regression, the mid_p
line, is 0.740 (0.708, 0.775), while we expect to see \(\frac{\sigma^2_g}{\sigma^2_g + \sigma^2_p} = \frac{25}{34} = 0.735\). That’s not bad at all. Now let’s look at the half sib analysis.
Besides the fact that stan_glm()
provides both point estimates and estimates of uncertainty it also sets us up nicely for using stan_glmer()
, which sets us up for doing the same thing with variance components. That’s what we’ll be doing here. We are fitting a model in which each offspring’s phenotype is drawn from a normal distribution with a mean that depends on pat_ID
.3
off_h <- stan_glmer(off_p ~ (1|pat_ID),
data = dat,
refresh = 0)
summary(off_h, digits = 3,
pars = c("sigma", "Sigma[pat_ID:(Intercept),(Intercept)]"))
Model Info:
function: stan_glmer
family: gaussian [identity]
formula: off_p ~ (1 | pat_ID)
algorithm: sampling
sample: 4000 (posterior sample size)
priors: see help('prior_summary')
observations: 1000
groups: pat_ID (100)
Estimates:
mean sd 10% 50% 90%
sigma 3.891 0.090 3.775 3.891 4.006
Sigma[pat_ID:(Intercept),(Intercept)] 5.531 1.027 4.310 5.429 6.907
MCMC diagnostics
mcse Rhat n_eff
sigma 0.001 1.000 4085
Sigma[pat_ID:(Intercept),(Intercept)] 0.030 1.001 1142
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
sigma
is the standard deviation within a sibship, \(\sigma_w\), and Sigma[pat_ID:(Intercept),(Intercept)]
is the variance among half-sib families, \(\sigma^2_{hs}\). The variance among half-sib families is the variance among mothers, which is the genetic variance. There are, however, 10 offspring in each half-sib family. Thus, we can estimate the broad-sense heritability with this little function.4
heritability <- function(sigma_w, sigma_hs, n_off) {
h_2 = sigma_hs*n_off/(sigma_hs*n_off + sigma_w^2)
return(h_2)
}
off_h_mat <- as.data.frame(off_h)
h_2 <- heritability(off_h_mat$sigma,
off_h_mat$`Sigma[pat_ID:(Intercept),(Intercept)]`,
n_dams)
off_h_mat$h_2 <- h_2
library(ggplot2)
p <- ggplot(off_h_mat, aes(x = h_2)) +
geom_histogram(bins = 100, alpha = 0.4) +
geom_vline(xintercept = mean(off_h_mat$h_2),
color = "red",
linetype = "dashed") +
xlab("Heritability") +
theme_bw()
p
As you can see, in this simulation the half-sib estimate is pretty close to what we expect to see, 0.781.5
As part of her dissertation, Nora Mitchell measured several traits in two closely related species of Protea
Protea punctata is a large, upright shrub.