--- title: "Exploring the coalescent" output: html_notebook: toc: yes toc_float: true --- We use `ms()` from `phyclust` to simulate samples from the coalescent and various functions from `ape` to display the coalescent tree and calculate statistics from it. Let's start with a really simple example of constructing a coalescent sample of size 10. The `nsam` argument is the number of samples, and the `opts = "-T"` argument tell `ms()` to return a coalescent tree. We then convert it, using `read.tree()` to a format that we can `plot()` ## The coalescent in `R` ```{r} library(ape) library(phyclust) rm(list = ls()) ## set random number seed so that same tree is produced every time this code ## is run ## set.seed(1234) ms_out <- ms(nsam = 10, opts = "-T") ms_tree_1 <- read.tree(text = ms_out) plot(ms_tree_1) ``` You'll notice that the tip labels run from `s1` through `s10`. Each corresponds to a different allele. That was pretty easy, but we can do a bit more. We can ask how "deep" the tree is, i.e., how far back in time we have to go to find the most recent common ancestor of all the alleles in our sample. ### Branching times ```{r} branching.times(ms_tree_1)[1] ``` You're probably wondering about the `[1]` after `branching.times()`. Well, `branching.times()` returns a vector of *all* of the branching times in the tree. ```{r} branching.times(ms_tree_1) ``` The numbers 11-19 are the node numbers in the tree above. ```{r} plot(ms_tree_1) nodelabels() ``` As you can see, node 11 (which is the first one in the list), is the deepest node, i.e., the most recent common ancestor of all of the alleles in our sample. ### Scaling of branching times The other thing you may be wondering is how to interpret the number `r branching.times(ms_tree_1)[1]` that corresponds to the time of the most recent common ancestor.^[Remember that when using the coalescent, time runs backwards. So the time of the most recent common ancestor is a time in the past.] To explain that I have to point out something that you may have missed when we called `ms()`: We didn't specify a population size. How can we get away with that, since I told you that drift *always* depends on the population size? Well, remember what you know^[or what I told you] about the coalesent process. The expected time to coalescence of all alleles is $4N_e$. `ms()` reports branching time in "coalescent units" where 1 unit equals $4N_e$. So if `branching.times(ms_tree_1)[1]` is `r branching.times(ms_tree_1)[1]` and the effective size of our population is 25, then the time (number of generations before the present) to the most recent common ancestor is `r 4*25*branching.times(ms_tree_1)[1]`. Notice that the time isn't an integer the way you'd expect it to be. That's because of an approximation that is generally used in implementing the coalescent in software. If you want to turn it into an integer, you can do this: ```{r} round(4*25*branching.times(ms_tree_1)[1]) ``` Whether you round it or not, you'll see that the most recent common ancestor of the alleles in our sample didn't occur exactly 100 generations ago as we'd expect. If, however, we repeat that simulation a bunch of times, you can see that we come pretty close - on average. ```{r} library(ggplot2) N_e = 25 c_time <- numeric(0) for (i in 1:1000) { ms_out <- ms(nsam = N_e, opts = "-T") ms_tree <- read.tree(text = ms_out) c_time[i] <- 4*N_e*branching.times(ms_tree)[1] } p <- ggplot(data.frame(t = c_time), aes(x = t)) + geom_histogram(bins = 50, alpha = 0.6) + geom_vline(xintercept = 4*N_e, linetype = "dashed", color = "red") + theme_bw() p ``` We expect a coalescence time of `r 4*N_e` and our observed average coalescence time is `r mean(c_time)`. Not too bad. Notice that the standard deviation of the coalescence time is `r sd(c_time)`. That's a lot of variability in coalescence time, which re-emphasizes a point I've made many times: There is a ***lot*** of variability associated with drift. While the expectations we focus on provide a useful point of comparison, they can delude us into thinking that we know more than we actually do. ### The coalescent with migration What happens when we have a subdivided population so that we have several populations exchanging genes? Here's how we simulate a finite island model with three populations, local population sizes of 10 in each population, and $N_em = 2$.^[Note: You'll need to install `RColorBrewer` if you want to run this code.] ```{r} options(tidyverse.quiet = TRUE) library(tidyverse) library(RColorBrewer) get_pop <- function(tree, sample_sizes, k) { idx <- as.numeric(gsub("^s(.*)$", "\\1", tree$tip.label)) cum <- 0 for (i in 1:length(sample_sizes)) { if (idx[k] < cum + sample_sizes[i] + 1) { retval <- i break } else { cum <- cum + sample_sizes[i] } } return(i) } tipcolors <- function(tree, sample_sizes) { my_colors <- brewer.pal(length(sample_sizes), "Set1") colors <- numeric(length(sample_sizes)) for (i in 1:sum(sample_sizes)) { colors[i] <- my_colors[get_pop(tree, sample_sizes, i)] } return(colors) } ## Note: sample sizes should be multiplied by 2 before calling ## island for diploid samples ## island <- function(sample_sizes, ne_m) { opt_string <- paste("-T -I ", length(sample_sizes), " ", sep = "") for (i in 1:length(sample_sizes)) { opt_string <- paste(opt_string, sample_sizes[i], " ", sep = "") } opt_string <- paste(opt_string, ne_m, "\n", sep = "") ms_out <- ms(nsam = sum(sample_sizes), nreps = 1, opts = opt_string) return(ms_out) } plot_island <- function(sample_sizes, ne_m) { ms_out <- island(2*sample_sizes, ne_m) ms_tree <- read.tree(text = ms_out) plot(ms_tree, show.tip.label = FALSE) tiplabels(pch = 19, cex = 0.75, col = tipcolors(ms_tree, 2*sample_sizes)) return(branching.times(ms_tree)[1]) } time <- plot_island(sample_sizes = c(5, 5, 5), ne_m = 10) ``` The colors correspond to the population from which each sample was collected. `plot_island()` returns the scaled time to the most recent common ancestor, `r time`. We are implicitly assuming that the effective population size is the same in each local population.^[There is a way to relax this assumption, but we won't take advantage of it here.] The scaling is in terms of the product of the local effective population size and the number of populations, i.e., 1 coalescent unit = $4N_ek$, where $k$ is the number of populations. Notice how different the result is if I set $N_em = 0.5$ instead of $N_em = 2$. ```{r} time <- plot_island(sample_sizes = c(5, 5, 5), ne_m = 0.1) ``` The alleles are now nicely arranged into populations. Does it surprise you that the populations in which alleles occur form monophyletic groups when $N_em \ll 1$ and that they don't when $N_em \gg 1$.^[Note: Read $\ll$ as "much less than" and $\gg$ as "much greater than."] ## Lab 7 You will be performing another set of simulations this week to compare coalescence times in two scenarios: 1. A finite island model of migration. 2. A one-dimensional stepping stone model of migration. You saw both the finite island model and the one-dimensional stepping stone model last week. As a reminder, in a one-dimensional stepping stone model the populations are arranged in a lineary array and only those populations that are adjacent to one another can exchange migrants. If $m$ is the migration rate, i.e., the fraction of a population composed of migrants then the fraction of population $i$ derived from population $i-1$ is $\frac{m}{2}$ and the fraction of population $i$ derived from population $i+1$ is $\frac{m}{2}$.^[Note: For the first population in the array, i.e., population 1, the fraction of the population derived from population 2 is $m$. Similarly, if there are $k$ populations in the arry, the fraction of population $k$ that is derived from population $k-1$ is $m$.] Use the functions below to explore the difference in coalescence times for finite island models and one-dimensional stepping stone models of migration, and answer the following questions: 1. How does the number of populations included affect the avearge time to coalescence of all alleles? Try at least one case in which the number of populations is relatively small, i.e., 10 or fewer, and one in which it is moderately large, i.e., 25 or more. 2. How does the migration rate affect the average time to coalescence of all alleles? Does it depend on whether migration is better approximated by a finite island model or a one-dimensional stepping stone model? 3. Are there consistent differences between the average time to coalescence in a finite island model and a one-dimensional stepping stone model? If there are, what might explain the differences that you see? ### Hints `run_simulations()` will automatically keep the number of populations in both models the same to ensure that you can always make appropriate comparisons of the models. It will also set the local effective population size to 25, and it sets the number of replications to 1000 by default. You might want to run the function once with `n_reps = 100` to get a feel for how long it will take the simulation to run on your computer before you begin the full-scale simulations `run_simulations()` returns a data frame^[Well, actually a tibble] with two columns: `Model` and `Time`. If you're interested, you can use that to examine the median time to coalescence (rather than the mean) and the variance (or standard deviation) in coalescence times. You'll notice that `run_simulation()` automatically prints out the mean coalesence times for you. In answering question #3 keep in mind that your answer might depend on how many populations there are.^[Why else would I have asked you to look both at one example in which tne number of populations is 10 or fewer and another in which the number of populations is 25 or more?] ```{r} ## one-dimensional stepping stone ## one_d_stepping_stone <- function(n_e, m, n_pops) { stopifnot((m > 0) && (m < 1)) m_matrix <- diag(1 - m, nrow = n_pops) m_matrix[1, 2] <- m m_matrix[n_pops, n_pops - 1] <- m for (i in 2:(n_pops - 1)) { m_matrix[i, i - 1] <- m/2 m_matrix[i, i + 1] <- m/2 } opt_string <- paste("-T -I ", n_pops, " ", sep = "") for (i in 1:n_pops) { opt_string <- paste(opt_string, 2*n_e, " ", sep = "") } opt_string <- paste(opt_string, "-ma", sep = "") ## transpose m_matrix before converting to vector to get vector from matrix ## by row instead of by column ## m_vector <- 4*n_e*c(t(m_matrix)) for ( i in 1:length(m_vector)) { opt_string <- paste(opt_string, " ", m_vector[i], sep = "") } ms_out <- ms(nsam = 2*n_e*n_pops, nreps = 1, opts = opt_string) return(ms_out) } run_simulation <- function(n_pops, m, n_reps = 1000) { N_e = 25 pop_sizes <- rep(N_e, n_pops) df <- tibble(Model = NA, Time = NA) for (i in 1:n_reps) { ms_out <- island(pop_sizes, N_e*m) ms_tree <- read.tree(text = ms_out) df <- add_row(df, Model = "Island", Time = (4*N_e*n_pops)*branching.times(ms_tree)[1]) ms_out <- one_d_stepping_stone(N_e, m, n_pops) ms_tree <- read.tree(text = ms_out) df <- add_row(df, Model = "1-d", Time = (4*N_e*n_pops)*branching.times(ms_tree)[1]) } df <- filter(df, !is.na(Model)) p <- ggplot(df, aes(x = Time, fill = Model)) + geom_histogram(position = "identity", bins = n_reps/10, alpha = 0.4) + ggtitle(paste("n_pops = ", n_pops, ", m = ", m)) + theme_bw() print(p) sum <- df %>% group_by(Model) %>% summarize(Mean_Time = mean(Time)) print(sum) return(df) } ```