--- title: "Exploring the interaction of drift, mutation, and migration" output: html_notebook: toc: yes toc_float: true --- ## A note about these simulation exercises As I've tried to make clear in lecture, drift is complicated. As I'm sure you've noticed, it can also be confusing. The purpose of the simulations last week, this week, and next week is to let you explore the complications a bit more by presenting more realistic models of the drift process and comparing them to the analytical expectations we've derived (or I've presented in class).^[Notice that I said ***more*** realistic. There are still important ways in which these models are not entirely realistic. As the statistician, George Box, once put it "All models are wrong, but some are useful."] ## Background Although it is very simple, Sewall Wright's infinite island model of migration is a useful way to explore how drift, migration, and mutation interact to influence the genetic structure of populations that are divided into subpopulations. In this model we assume that 1. Every allele introduced into a population differs from any of the resident alleles. 2. Every population receives the same fraction, $m$, of migrants.^[Notice that $m$ is the *backwards* migration rate, i.e., the fraction of a population composed of migrants that come from another population, not the fraction of residents that migrate to another population.] Given those assumptions and assuming an infinite alleles model of mutation, we showed that^[Actually, we didn't show what happens with *both* migration and mutation, but from what we saw you can probably guess where this came from.] $$ F_{ST} = \frac{1}{4N_e(m + \mu) + 1} \quad . $$ A more realistic model of migration is the ***finite*** island model of migration.^[Latter, B.D.H. 1973. The island model of population differentiation: a general solution. *Genetics* 73:147-157 doi: [10.1093/genetics/73.1.147](https://doi.org/10.1093/genetics/73.1.147)] It's still very simple, but it adds one thing that makes it a bit more realistic: > It assumes that there are a finite number of populations exchanging migrants with one another. Every population still receives the same fraction of migrants, $m$, but we now also specify that the fraction of migrants coming from any one population into this one is $m/(k - 1)$, where $k$ is the number of populations. The formula for $F_{ST}$ under the finite island model is a lot more complicated than the one above. If you're interested, you can look in the `R` code below.^[The formula for $F_{ST}$ comes from Fu et al., _Theoretical Population Biology_ 63:231-243; 2003.] Let's compare the predictions for a few values of $m$ and $k$ with $N_e = 100$ and $\mu = 10^{-3}$. ### Comparing the predictions ```{r} rm(list = ls()) infinite_island <- function(n_e, m, mu) { return(1/(4*n_e*(m + mu) + 1)) } lambda <- function(mu, m, k) { m_prime <- m*k/(k - 1) l <- (1/2)*((1 - mu)^(-2)*(1- m_prime)^(-2) - 1) return(l) } gamma <- function(mu, m, k) { m_prime <- m*k/(k - 1) g <- (1 + m_prime*(2 - m_prime)/(k*mu*(2 - mu)*(1 - m_prime)^2))^(-1) return(g) } finite_island <- function(n_e, m, mu, n_islands) { l <- lambda(mu, m, n_islands) g <- gamma(mu, m, n_islands) h_s <- 4*n_e*l*g/(4*n_e*l*g + 1) h_b <- h_s*(1 + 4*n_e*l)/(4*n_e*l) h_t <- ((n_islands - 1)/n_islands)*h_b + (1/n_islands)*h_s f_st <- (h_t - h_s)/h_t return(f_st) } n_e <- 100 mu <- 10^(-4) k_vals <- rep(c(5, 25, 250, 2500), 2) m_vals <- c(rep(0.01, 4), rep(0.05, 4)) dat <- data.frame(k = k_vals, m = m_vals, Infinite = rep(NA, length(k_vals)), Finite = rep(NA, length(m_vals)), Ratio = rep(NA, length(k_vals))) for (i in 1:nrow(dat)) { dat$Infinite[i] <- infinite_island(n_e, dat$m[i], mu) dat$Finite[i] <- finite_island(n_e, dat$m[i], mu, dat$k[i]) dat$Ratio[i] <- dat$Infinite[i]/dat$Finite[i] } round(dat, 3) ``` ### Comparing predictions to simulations From the table above, it's apparent that predictions of the infinite island and finite island models are very different when the number of islands is small. Let's run a simulation of the finite island model and see how well the simulation results match up to each prediction. ```{r} library(tidyverse) library(hierfstat) initialize <- function(k, n_loci) { p <- array(dim = c(n_loci, k, 2)) for (j in 1:n_loci) { for (i in 1:k) { p[j, i, 1] <- runif(n = 1, min = 0, max = 1) p[j, i, 2] <- 1 - p[j, i, 1] } } return(p) } make_symmetric_matrix <- function(m, k) { M <- diag(x = 1 - m, nrow = k) for (i in 1:(k - 1)) { for (j in (i+1):k) { M[i, j] <- m/(k-1) M[j, i] <- M[i, j] } } return(M) } sample_two_alleles <- function(p) { allele_1 <- ((runif(1) < p) + 1)*10 allele_2 <- (runif(1) < p) + 1 return(allele_1 + allele_2) } make_sample <- function(p, n_sample) { n_pops <- dim(p)[2] n_loci <- dim(p)[1] names <- paste("Locus", seq(1:n_loci), sep = "") pops <- numeric(n_pops*n_sample) ct <- 1 for (k in 1:n_pops) { population <- paste("Population", k, sep = "") for (i in 1:n_sample) { pops[ct] <- population ct <- ct + 1 } } locus_data <- matrix(nrow = n_pops*n_sample, ncol = n_loci) for (i in 1:n_loci) { ct <- 1 for (k in 1:n_pops) { for (n in 1:n_sample) { locus_data[ct, i] <- sample_two_alleles(p[i, k, 1]) ct <- ct + 1 } } } df <- data.frame(locus_data) colnames(df) <- paste("Locus", seq(1:n_loci), sep = "") df$Population <- pops df <- relocate(df, Population) return(df) } simulate <- function(n_e, m, mu, n_islands, n_loci, n_gen, n_sample) { p <- initialize(n_islands, n_loci) M <- make_symmetric_matrix(m, n_islands) V <- make_symmetric_matrix(mu, 2) g_st <- numeric(n_loci) p_star <- array(dim = c(n_loci, n_islands, 2)) for (i in 1:n_gen) { for (k in 1:n_loci) { p_star[k, , ] <- M %*% p[k, , ] %*% V for (j in 1:n_islands) { p[k, j, ] <- rmultinom(1, 2*n_e, p_star[k, j, ])/(2*n_e) } mu_p <- mean(p[k, , 1]) g_st[k] <- ((n_islands - 1)/n_islands)*var(p[k, , 1])/(mu_p*(1 - mu_p)) } } df <- make_sample(p, n_sample) return(wc(df)$FST) } run_simulation <- function(n_e, m, mu, n_islands, n_loci, n_gen, n_sample, n_repetitions) { f_st <- numeric(n_repetitions) for (i in 1:n_repetitions) { if (n_repetitions > 50) { cat(".", sep="") if ((i %% 50) == 0) { cat(i, "\n", sep ="") } } f_st[i] <-simulate(n_e, m, mu, n_islands, n_loci, n_gen, n_sample) } return(f_st) } plot_simulation <- function(f_st, n_e, m, mu, n_islands) { df <- data.frame(F_st = f_st) p <- ggplot(df, aes(x = F_st)) + geom_histogram(bins = 20, alpha = 0.4) + geom_vline(xintercept = infinite_island(n_e, m, mu), linetype = "dashed", color = "red") + geom_vline(xintercept = finite_island(n_e, m, mu, n_islands), linetype = "dashed", color = "blue") + ggtitle(paste("N_e = ", n_e, ", m = ", m, ", mu = ", mu, ", n_islands = ", n_islands, sep = "")) + theme_bw() return(p) } f_st <- run_simulation(100, 0.05, 1e-4, 5, 10, 1000, 25, 1000) plot_simulation(f_st, 100, 0.05, 1e-4, 5) ``` The dashed red line is the prediction from the infinite island model. The dashed blue line is the prediction from the finite island model. ## Exercise 6 The last plot shows that neither the infinite island nor the finite island prediction captures the results of this particular simulation very well, although the finite island prediction is reasonably close to the average outcome from the simulations. Even so, it only predicts on average and there's a lot of variation from one replicate to the next. This outcome reflects the "evolutionary" or "genetic" sampling we discussed when comparing Nei's $G_{ST}$ with Weir and Cockerham's $\theta$. This week's lab exercise will explore that variability a bit further and explore conditions that affect the amount of variability and how useful the infinite island approximation is. To do so, you'll want to use the following code.^[Notice that in order to run this code, you'll also need to run the code above so that the relevant functions are loaded in `R`'s memory.] You'll be changing * `n_loci` - The number of loci included in the genetic sample and * `n_gen` - The number of generations for which to run the simulation. In addition to `n_loci` and `n_gen` there are two more parameters * `n_sample` - The number of individuals sampled in each population (and used in making an estimate of $F_{ST}$) and * `n_repetitions` - The number of times that results from a particular combination of `N_e`, `m`, `mu`, and `n_islands` will be simulated for a particular choice of `n_loci`, `n_gen`, and `n_sample`. For each choice of these four parameters, the simulation will run through all combinations of `N_e = (25, 200, 250, 500)`, `m = (0.05, 0.01, 0.005, 0.001)`, `mu = (0.001, 0.0001)`, and `n_islands = (10, 25, 50)`. ```{r} ## These are the parameters to change ## n_loci <- 10 n_gen <- 1000 ## Leave these parameters unchanged ## n_sample <- 25 n_repetitions <- 10 df <- data.frame(N_e = NA, m = NA, mu = NA, n_pops = NA, Observed = NA, Infinite = NA, Finite = NA) for (n_e in c(25, 100, 250, 500)) { for (m in c(0.05, 0.01, 0.005, 0.001)) { for (mu in c(0.001, 1e-4)) { for (n_pops in c(10, 25, 50)) { cat("N_e = ", n_e, ", m = ", m, ", mu = ", mu, "n_pops = ", n_pops, "\n" ) f_st <- run_simulation(n_e, m, mu, n_pops, n_loci, n_gen, n_sample, n_repetitions) df <- add_row(df, N_e = rep(n_e, n_repetitions), m = rep(m, n_repetitions), mu = rep(mu, n_repetitions), n_pops = rep(n_pops, n_repetitions), Observed = f_st, Infinite = infinite_island(n_e, m, mu), Finite = finite_island(n_e, m, mu, n_islands)) } } } } df <- filter(df, !is.na(N_e)) ``` Running one set of simulations for `n_loci` = 10 and `n_gen` = 1000 takes close to half an hour on my MacBook. If you increase the number of loci or the number of generations by a factor of two, it will take about twice as long for the simulations to finish. You'll get a printout periodically that lets you know what the current set of parameters being simulated is, and you'll see the set of parameters that's already been completed. Because running the simulation for one set of parameters takes so long, you can't possibly explore all of the combinations of those parameters you might be interested in, but I'd like you to explore at least four combinations of `n_loci` and `n_gen`. Use the results from those simulations to address the following questions:^[Keep in mind that you'll only be able to answer these questions for the set of simulations that you complete. Nick will be grading your responses on the simulation results that you show, not on what an exhaustive examination of all possibilities might be.] 1. Are there conditions under which the predictions of the infinite island and finite island model are similar enough that we don't need to worry about the added mathematical complexity of the finite island model? I suggest using the following combination of parameters: - `n_loci = 5`, `n_gen = 1000` - `n_loci = 15`, `n_gen = 1000` - `n_loci = 5`, `n_gen = 100` - `n_loci = 25`, `n_gen = 100` 2. How close is the relationship between observed estimates of $F_{ST}$ from the simulations and those predicted from the infinite island model? from the finite island model? 3. Under what conditions, if any, is the observed variation in $F_{ST}$ relatively small? ### A few hints The easiest way to run different combinations of parameters is to copy the code block above and replace `n_loci` and `n_gen` with new choices. To keep track of what you're doing, I recommend doing that for one, examining the results as described below, and moving on to a new set of parameter (with a new code block) after that. You can plot the observed $F_{ST}$ versus the $F_{ST}$ predicted from either model using the code below.^[The diagonal red dashed line is the 1:1 line. Every point should fall on this line if observed exactly matches the prediction.] In addition to showing how well the simulations match the predictions, you can get a feel for how much variability there is among the simulations by seeing how scattered the points are. ```{r} ## for the infinite island model ## p <- ggplot(df, aes(x = Infinite, y = Observed)) + geom_point() + geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") + theme_bw() + ggtitle("Infinite island model") p ## for the finite island model ## p <- ggplot(df, aes(x = Finite, y = Observed)) + geom_point() + geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") + theme_bw() + ggtitle("Finite island model") p ``` If you'd like a more quantitative way to compare the models,^[And I'd encourage you to take a more quantitative approach] a simple way is to run linear regressions of `Observed` on `Infinite` and `Expected` like this: ```{r} summary(lm(Observed ~ Infinite, data = df)) summary(lm(Observed ~ Finite, data = df)) ``` If predictions were perfect the `(Intercept)` term in the regression would be 0, and the other term (`Infinite` in the first regression and `Finite` in the second) would be 1. Given the variability of the simulations, you won't see numbers that are exactly 0 or 1. You can judge whether they are "close" by noticing whether the number in the `Estimate` column is more than two standard errors away from what you predict. The `residual standard error` below the table measures how much error there is around the regression, i.e., it gives you a way to compare quantitatively how much variation there is in $F_{ST}$. ## What follows isn't part of the lab exercise, but you might find it interesting to explore if you have time Wright's island model, whether the infinite island model or the finite island model, is very unrealistic in one important respect. Populations close to one another are no more likely to exchange migrants than those that are far apart. Another model that Wright introduced is more realistic, the one-dimensional stepping stone model. In this model we assume that only adjacent populations can exchange migrants. It's still pretty unrealistic, but it may not be a bad approximation for a set of populations that are distributed along a stream, a narrow beach, or other habitat that is much narrower than it is long. Here's the code you'll need to simulate results from a one-dimensional stepping stone model.^[Note: It draws on some of the simulation code above, so you'll need to run the code that comes before Exercise #6 in order to run it.] Play around with different values of `n_pops`. Compare results from the finite island model and the stepping stone and see what you find. WARNING: This takes a long time to run. You might decide to stop after trying it once with the default parameters. ```{r} make_M_1d_stepping_stone <- function(m, k) { M <- diag(1 - m, nrow = k) M[1, 2] <- m M[k, k - 1] <- m for (i in 2:(k-1)) { M[i, i - 1] <- m/2 M[i, i + 1] <- m/2 } return(M) } simulate <- function(n_e, m, mu, n_pops, n_loci, n_gen, n_sample) { ## the finite island model ## p <- initialize(n_pops, n_loci) M <- make_symmetric_matrix(m, n_pops) V <- make_symmetric_matrix(mu, 2) g_st <- numeric(n_loci) p_star <- array(dim = c(n_loci, n_pops, 2)) for (i in 1:n_gen) { for (k in 1:n_loci) { p_star[k, , ] <- M %*% p[k, , ] %*% V for (j in 1:n_pops) { p[k, j, ] <- rmultinom(1, 2*n_e, p_star[k, j, ])/(2*n_e) } mu_p <- mean(p[k, , 1]) } } df <- make_sample(p, n_sample) fst_island <- wc(df)$FST ## the stepping stone model ## p <- initialize(n_pops, n_loci) M <- make_M_1d_stepping_stone(m, n_pops) V <- make_symmetric_matrix(mu, 2) g_st <- numeric(n_loci) p_star <- array(dim = c(n_loci, n_pops, 2)) for (i in 1:n_gen) { for (k in 1:n_loci) { p_star[k, , ] <- M %*% p[k, , ] %*% V for (j in 1:n_pops) { p[k, j, ] <- rmultinom(1, 2*n_e, p_star[k, j, ])/(2*n_e) } mu_p <- mean(p[k, , 1]) } } df <- make_sample(p, n_sample) fst_step <- wc(df)$FST return(list(fst_step = fst_step, fst_island = fst_island)) } run_simulation <- function(n_e, m, mu, n_pops, n_loci, n_gen, n_sample, n_repetitions) { df <- tibble(Island = NA, Step = NA) for (i in 1:n_repetitions) { if (n_repetitions > 50) { cat(".", sep="") if ((i %% 50) == 0) { cat(i, "\n", sep ="") } } f_st <-simulate(n_e, m, mu, n_pops, n_loci, n_gen, n_sample) df <- add_row(df, Island = f_st$fst_island, Step = f_st$fst_step) } df <- filter(df, !is.na(N_e)) return(df) } result <- run_simulation(100, 0.01, 0.0001, 100, 100, 1000, 25, 100) cat("Finite island: ", round(mean(result$Island), 3), ", ", round(sd(result$Island), 3), "\n", "Stepping stone: ", round(mean(result$Step), 3), ", ", round(sd(result$Step), 3), sep = "") ```