This lab exercise will be different from others we’ve had in the course in several ways:
GWAS-results.csv
file that is produced, to Nick by the end
of the day on Friday.You’ll be using a new R
package for this lab exercise.
Installing it is slightly more complicated than usual, because you first
have to install a C/C++ compiler and a Fortran compiler in a place where
R
can find it. Fortunately there’s a really good set of
instructions at https://learnb4ss.github.io/learnB4SS/articles/install-brms.html.
If you follow those instructions, you should be all set.1 We recommend that you
install brms
before lab on
Tuesday so that you can try the example analysis together in lab and
talk about how to interpret the results.
Note: We’ve assigned two people to loci 141:176 for all of the traits. There shouldn’t be noticeable differences in the results, but this gives us a way of checking a few of the results to make sure that they are consistent.
Loci | Mass | PD | TDT |
---|---|---|---|
1:35 | Isabelle | Rowan | Akriti |
36:71 | Emma | Agustin | Matthew |
71:106 | Nora | Kaitlyn | Gaurav |
107:142 | Henry | Cynthia | Nick |
143:178 | Sophia | Fahad | Claire |
Alyssa | Maya | Kyle | |
179:218 | Michelle | Michelle | Sanjay |
McDonald | Neitzey | ||
brms
Friedline et al.2 investigated the role of natural selection in the rapid spread of gypsy moth (Lymantria dispar) across Eastern North America. Their analysis focused on analysis of polymorphisms at 91,468 SNPs identified through double digest restriction-site associated DNA sequencing (ddRADseq). As part of their work, they also collected data on pupal mass (Mass), pupal development time (PD), and total development time (TDT). This allows us to perform a genome-wide association analysis that assesses the relationship, if any, between individual SNP loci and each of the phenotypes. You’ll find a smaller version of the data set on the course web site:
The data set is a subset of the data included in the paper. Specifically, I filtered the data to include only (a) loci that were scored in more than 100 individuals, (b) individuals that had more than 4000 of the remaing SNP loci scored, and (c) loci that were scored in all of the remaining individuals. The resulting data set includes 141 individuals scored at 218 loci. Using these data, answer the following questions:
What loci (if any) are associated with each of the three traits (pupal mass, pupal development time, and total development time)?
IMPORTANT NOTE: In the output, the loci are ordered from the strongest estimated effect (mean) to the weakest, but loci showing a weaker estimated effect may have stronger evidence for them, i.e., the credible interval may not overlap zero. Because of the small sample size, I recommend considering both the 80% and the 95% credible intervals whenn answering these questions.
In the example below you’ll see that X4293 has the largest estimated effect, 0.01457 and that that the 95% credible interval on this effect does not overlap zero.3 This suggests that genotypic differences at X4293 have an effect on pupal mass. Notice that the 80% credible intervals for X2840, X4680, X2841, and X5330 do not overlap zero, providing suggestive evidence that differences at these loci may also affect pupal mass. Estimated effects at remaining loci are smaller and the 80% credible intervals overlap zero indicating that we have very weak evidence that differences at these loci have an effect on pupal mass.4
Is there evidence that any of the SNP loci are associated with variation in more than one of the traits?
Given your answer to #2, would you expect to see a change in pupal development time, total development time, or both if natural selection led to a change in pupal mass? Why or why not?
After reading in the data from gypsymoth.csv
, we extract
the genotype data,5 and use popkin
to estimate the
relatedness of all of the samples from the genotype data. Then we run
the analysis, in this case using Mass as the trait and looking only at
loci 1:10 and sending the relatedness matrix to the function that runs
the analysis. You’ll need to change the second to last line in this
block of code to pick the trait and the range of loci that you’ve been
assigned.
You’ll find GWAS-results.csv
in the directory where you
ran the analysis. The numbers you see in it will match those in the
table that’s displayed after the analysis.6
options(tidyverse.quiet = TRUE)
library(tidyverse)
library(rstan)
library(brms)
library(popkin)
rm(list = ls())
options(mc.cores = parallel::detectCores())
analyze_trait <- function(dat, A, trait, loci, n_samples = 2000, n_chains = 4) {
n_loci <- length(loci)
p_raw <- matrix(nrow = n_loci, ncol = n_chains*n_samples/2)
mean_p <- numeric(n_loci)
mu <- array(dim = c(n_loci, n_samples, nrow(dat)))
for (i in 1:n_loci) {
locus <- colnames(dat)[loci[i] + 5]
cat("Checking locus ", loci[i], " (", locus, ")\n", sep = "")
fit <- brm(paste(trait, locus, sep = " ~ "),
data = dat,
data2 = list(A = A),
family = gaussian(),
iter = n_samples,
refresh = 0)
x <- as.data.frame(fit)
brm_locus <- paste("b_", locus, sep ="")
p_raw[i,] <- x[[brm_locus]]
mean_p[i] <- mean(x[[brm_locus]])
}
loci <- data.frame(locus = colnames(dat)[(1:n_loci) + 5],
p_mean = mean_p,
index = 1:n_loci)
loci <- loci[order(abs(loci$p_mean), decreasing = TRUE), ]
p <- matrix(nrow = n_loci, ncol = n_chains*n_samples/2)
for (i in 1:n_loci) {
p[i, ] <- p_raw[loci$index[i],]
}
rownames(p) <- loci$locus
return(list(loci = loci, p = p))
}
summarize_analysis <- function(results) {
n_report <- nrow(results$loci)
cat("\n\n",
" Mean: ( 2.5%, 10%, 50%, 90%, 97.5%)\n")
dat <- tibble(Locus = character(n_report),
Mean = numeric(n_report),
`2.5%` = numeric(n_report),
`10%` = numeric(n_report),
`50%` = numeric(n_report),
`90%` = numeric(n_report),
`97.5%` = numeric(n_report))
for (i in 1:n_report) {
ci <- quantile(results$p[results$loci$locus[i], ],
c(0.025, 0.1, 0.5, 0.9, 0.975))
output <- sprintf("%6s: %8.5f (%8.5f, %8.5f, %8.5f, %8.5f, %8.5f)\n",
results$loci$locus[i],
results$loci$p_mean[i],
ci[1], ci[2], ci[3], ci[4], ci[5])
cat(output)
dat$Locus[i] <- results$loci$locus[i]
dat$Mean[i] <- results$loci$p_mean[i]
dat$`2.5%`[i] <- ci[1]
dat$`10%`[i] <- ci[2]
dat$`50%`[i] <- ci[3]
dat$`90%`[i] <- ci[4]
dat$`97.5%`[i] <- ci[5]
}
write_csv(dat, "GWAS-results.csv")
}
## read the data
##
dat <- read_csv("gypsymoth.csv",
show_col_types = FALSE)
## get the relatedness matrix
##
genos <- dat %>% select(starts_with("X"))
A <- popkin(t(as.matrix(genos)),
subpops = dat$pops)
## identify the rows with the sample names
##
rownames(A) <- dat$sample
## Change 1:10 to match the range of loci you have been assigned and "Mass" to the
## trait you have been assigned
##
results <- analyze_trait(dat, A, "Mass", 1:10)
Checking locus 1 (X2517)
Checking locus 2 (X2840)
Checking locus 3 (X2841)
Checking locus 4 (X3915)
Checking locus 5 (X4293)
Checking locus 6 (X4612)
Checking locus 7 (X4680)
Checking locus 8 (X4753)
Checking locus 9 (X5330)
Checking locus 10 (X6607)
summarize_analysis(results)
Mean: ( 2.5%, 10%, 50%, 90%, 97.5%)
X4293: 0.01457 ( 0.00214, 0.00634, 0.01448, 0.02285, 0.02717)
X2840: -0.01185 (-0.02509, -0.02049, -0.01192, -0.00296, 0.00161)
X4680: 0.01142 (-0.00161, 0.00298, 0.01166, 0.01950, 0.02412)
X2841: -0.01121 (-0.02437, -0.02011, -0.01117, -0.00223, 0.00274)
X5330: 0.00912 (-0.00321, 0.00119, 0.00916, 0.01702, 0.02135)
X3915: -0.00418 (-0.01686, -0.01267, -0.00416, 0.00435, 0.00910)
X4753: -0.00412 (-0.01670, -0.01247, -0.00401, 0.00413, 0.00858)
X4612: 0.00133 (-0.01521, -0.00994, 0.00109, 0.01268, 0.01868)
X2517: -0.00107 (-0.01492, -0.01003, -0.00112, 0.00813, 0.01327)
X6607: 0.00019 (-0.01459, -0.00932, 0.00016, 0.00978, 0.01486)
If you’re interested in seeing how I prepared the data from the version on Dryad. Here’s what I did.
First, I downloaded the data from Dryad: https://doi.org/10.5061/dryad.8ts2867.
Then I renamed the phenotype filename so that it begins with
gm
. The rest is pretty straightforward.
library(tidyverse)
library(popkin)
rm(list = ls())
genos <- read_tsv("gm_genotypes_012_final_5122017.txt",
col_names = FALSE,
na = "-1",
show_col_types = FALSE)
phenos <- read_csv("gm_phenotypes.csv",
show_col_types = FALSE)
## keep only loci scored in more than 100 individuals
##
n_scored <- apply(genos, 2, sum, na.rm = TRUE)
genos <- genos[, n_scored > 100]
## keep only individuals scored at more than 4000 loci
##
n_scored <- apply(genos, 1, sum, na.rm = TRUE)
genos <- genos[n_scored > 4000, ]
phenos <- phenos[n_scored > 4000, ]
## and exclude any remaining loci where one or more individuals isn't scored
##
genos <- genos[, !is.na(apply(genos, 2, sum))]
## strip off the individual ids to identify populations
##
pops <- gsub("_.*", "", phenos$sample)
## bind the poopulation, phenotype, and genotype information together and save it in a CSV file
##
dat <- cbind(pops, phenos, genos)
write_csv(dat,
file = "gypsymoth.csv")
You’ll find instructions for macOS, Windows, and Linux at that link.↩︎
Friedline, C.J., T.M. Faske, B.M. Lind, E.M. Hobson, D. Parry, R.J. Dyer, D.M. Johnson, L.M. Thompson, K.L. Grayson, and A.J. Eckert. 2019. Evolutionary genomics of gypsy moth populations sampled along a latitudinal gradient. Molecular Ecology 28:2206-2223 doi: https://doi.org/10.1111/mec.15069↩︎
You’ll get slightly slightly different numbers than the
ones below, because brm
uses Monte-Carlo sampling to
estimate the posterior distribution. The differences shouldn’t affect
the interpretation.↩︎
It’s important to be very careful in your conclusion here. We do not have evidence that differences at these loci have no effect. With a larger sample of individuals we might have detected an effect at any of these loci.↩︎
All of the columns with genotype information begin with “X”, and none of the other columns do.↩︎
The results that are displayed are rounded to 5 digits.
The numbers in the GWAS-results.csv
include all digits
found in the results.↩︎