--- title: "Comparing observed coalescence times to expected coalescence times" output: html_notebook: toc: yes toc_float: true --- ## Overview Two weeks ago we explored properties of the coalescent, focusing on how population structure affected the time to coalescence. This week we're going to use `ms()` again, but instead of exploring properties of the coalescent in isolation, we'll compare expected times to coalescence with "observed" times to coalescence. We can't really observe coalescence times, but we can estimate them from nucleotide sequence data. If the sequences are evolving neutrally, the coalescence times we estimate should match those we expect pretty closely. ### An example from *Drosophila* I downloaded 24 sequences of alcohol dehydrogenase (ADH) from *Drosophila melanogaster* from Genbank, and I aligned them using [Muscle](https://www.ebi.ac.uk/Tools/services/web_muscle/toolform.ebi). I eliminated duplicate sequences, leaving 9 sequences in the data set. I converted the ClustalW alignments to NEXUS format, and analyzed their relationships in Mr. Bayes, assuming a strict molecular clock.^[`prset = brlenspr=clock:uniform` with a GTR+I+$\Gamma$ substitution model for any of you who care.] Here's what the results look like. ```{r} library(ape) library(phyclust) library(ggplot2) adh_bayes <- read.nexus("Drosophila-ADH.clock.con.tre") plot(adh_bayes) nodelabels() ``` `branching.times()` gives us the time each branching event occurred, where the integers in the list below correspond to the numbers in the tree above. ```{r} branching.times(adh_bayes) ``` Notice that the times aren't in order. The same is true for the output of `ms()`. ```{r} ms_out <- ms(nsam = 10, opts = "-T") ms_tree <- read.tree(text = ms_out) branching.times(ms_tree) ``` This didn't affect us earlier, because we were only looking at the deepest node, and the deepest node is always the first one in the list. But to compare branching times across the entire tree, we have to sort them. In addition the deepest node is scaled in coalescent units in the output from `ms()` and in number of substitutions in the estimates. To make them comparable, we also have to "normalize" them by expressing them as a fraction of the time to the deepest node. The following function produces a series of coalescent samples and compares the coalescent times at each node that are observed with those that are expected. It returns the result as a data frame.^[If you call it without specifying the number of replications, it will default to 1000. One thousand replications should be plenty, and it runs pretty fast (4-5 seconds on my laptop), but feel free to play around with different choices if you're so inclined.] ```{r} compare_coalescent <- function(tree, n_reps = 1000) { n_taxa <- length(tree$tip.label) coal_times <- matrix(nrow = n_reps, ncol = n_taxa - 1) for (i in 1:n_reps) { ms_out <- ms(nsam = n_taxa, opts = "-T") ms_tree <- read.tree(text = ms_out) coal_times[i, ] <- branching.times(ms_tree) coal_times[i, ] <- sort(coal_times[i, ], decreasing = TRUE) for (j in 1:(n_taxa - 1)) { coal_times[i, j] <- coal_times[i, j]/coal_times[i, 1] } } max_time <- max(branching.times(tree)) observed <- sort(branching.times(tree), decreasing = TRUE) observed <- observed/max_time result <- data.frame(Observed = observed, Expected = apply(coal_times, 2, mean), Lower_2.5 = apply(coal_times, 2, quantile, 0.025), Upper_97.5 = apply(coal_times, 2, quantile, 0.975)) return(result) } ``` Now we can call it with our observed tree and examine the results. First, we'll just look at the mean and 95 percent credible interval for each node. ```{r} result <- compare_coalescent(adh_bayes) round(result, 3) ``` The Observed doesn't look markedly different from the Expected, but it will probably be easier to see if we plot them against one another.^[The dashed red line is the 1:1 line on which all of the points would lie if Observed exactly matched Expected.] ```{r} p <- ggplot(result, aes(x = Expected, y = Observed)) + geom_point() + geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") + theme_bw() p ``` That doesn't look to bad, but it's only showing the *mean* of the simulations. It's not showing anything about the range of plausible values. Let's add that to the plot and see what it looks like. ```{r} p <- ggplot(result, aes(x = Expected, y = Observed)) + geom_point() + geom_errorbar(aes(xmin = Lower_2.5, xmax = Upper_97.5, width = 0.02)) + geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") + theme_bw() p ``` So it appears for these data there's a reasonably good fit between Observed and Expected. There certainly isn't a consistent pattern of mismatch. ## Lab 9 I've mentioned self-incompatibility alleles several times in lecture. In this week's exercise we're going to explore sequence variation in a sample of them. More precisely, I'm going to provide you with a tree describing relationships among self-incompatibility alleles in *Solanum*, the genus that includes potatoes, tomatoes, and eggplants. The data are derived from Richman and Kohn.^[Richman, A.D., and J.R. Kohn. 2000. Evolutionary genetics of self-incompatibility in the Solanaceae. *Plant Molecular Biology* 42:169-179. doi: [10.1023/A:1006336206637](https://dx.doi.org/10.1023/A:1006336206637)] The original data set included 67 nucleotide sequences. After aligning the sequences with Muscle and eliminating a few duplicates, 63 remain. I estimated a phylogenetic tree of relationships with Mr. Bayes assuming a strict molecular clock,^[I used the same parameters as with the ADH example above.] and you'll find the resulting consensus tree at [http://darwin.eeb.uconn.edu/eeb348-resources/sRNase.clock.con.tre](http://darwin.eeb.uconn.edu/eeb348-resources/sRNase.clock.con.tre).^[Note: You can either download the file to your hard drive, or you can read it directly with `read.nexus()`, putting in the full URL (in quotes) instead of the filename.] Using this tree, explore the relationship between observed coalescence times, i.e., the estimated branching times in the output from Mr. Bayes, with those expected, i.e., the branching times simulated by `ms()`. The `compare_coalescent()` function above and the code following it that shows how to look at the numbers and how to visualize the results should make that pretty easy. Once you've done that, answer these two questions: 1. What systematic differences between observed and expected coalescence times do you see? 2. Given what I've said about self-incompatibility in lecture and what Richman and Kohn say about it in their paper, how would you explain the discrepancies that you see.