Two weeks ago we explored properties of the coalescent, focusing on
how population structure affected the time to coalescence. This week
we’re going to use ms()
again, but instead of exploring
properties of the coalescent in isolation, we’ll compare expected times
to coalescence with “observed” times to coalescence. We can’t really
observe coalescence times, but we can estimate them from nucleotide
sequence data. If the sequences are evolving neutrally, the coalescence
times we estimate should match those we expect pretty closely.
I downloaded 24 sequences of alcohol dehydrogenase (ADH) from Drosophila melanogaster from Genbank, and I aligned them using Muscle. I eliminated duplicate sequences, leaving 9 sequences in the data set. I converted the ClustalW alignments to NEXUS format, and analyzed their relationships in Mr. Bayes, assuming a strict molecular clock.1 Here’s what the results look like.
library(ape)
library(phyclust)
library(ggplot2)
adh_bayes <- read.nexus("Drosophila-ADH.clock.con.tre")
plot(adh_bayes)
nodelabels()
branching.times()
gives us the time each branching event
occurred, where the integers in the list below correspond to the numbers
in the tree above.
branching.times(adh_bayes)
10 11 12 13 14 15 16 17
0.018170630 0.001028510 0.005755670 0.003281008 0.001262989 0.000273063 0.004166878 0.002542797
Notice that the times aren’t in order. The same is true for the
output of ms()
.
ms_out <- ms(nsam = 10, opts = "-T")
ms_tree <- read.tree(text = ms_out)
branching.times(ms_tree)
11 12 13 14 15 16 17 18 19
0.84912309 0.26771757 0.11455639 0.11323509 0.01755591 0.13206823 0.10729787 0.04839707 0.04128653
This didn’t affect us earlier, because we were only looking at the
deepest node, and the deepest node is always the first one in the list.
But to compare branching times across the entire tree, we have to sort
them. In addition the deepest node is scaled in coalescent units in the
output from ms()
and in number of substitutions in the
estimates. To make them comparable, we also have to “normalize” them by
expressing them as a fraction of the time to the deepest node. The
following function produces a series of coalescent samples and compares
the coalescent times at each node that are observed with those that are
expected. It returns the result as a data frame.2
compare_coalescent <- function(tree, n_reps = 1000) {
n_taxa <- length(tree$tip.label)
coal_times <- matrix(nrow = n_reps, ncol = n_taxa - 1)
for (i in 1:n_reps) {
ms_out <- ms(nsam = n_taxa, opts = "-T")
ms_tree <- read.tree(text = ms_out)
coal_times[i, ] <- branching.times(ms_tree)
coal_times[i, ] <- sort(coal_times[i, ], decreasing = TRUE)
for (j in 1:(n_taxa - 1)) {
coal_times[i, j] <- coal_times[i, j]/coal_times[i, 1]
}
}
max_time <- max(branching.times(tree))
observed <- sort(branching.times(tree), decreasing = TRUE)
observed <- observed/max_time
result <- data.frame(Observed = observed,
Expected = apply(coal_times, 2, mean),
Lower_2.5 = apply(coal_times, 2, quantile, 0.025),
Upper_97.5 = apply(coal_times, 2, quantile, 0.975))
return(result)
}
Now we can call it with our observed tree and examine the results. First, we’ll just look at the mean and 95 percent credible interval for each node.
result <- compare_coalescent(adh_bayes)
round(result, 3)
The Observed doesn’t look markedly different from the Expected, but it will probably be easier to see if we plot them against one another.3
p <- ggplot(result, aes(x = Expected, y = Observed)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, linetype = "dashed",
color = "red") +
theme_bw()
p
That doesn’t look to bad, but it’s only showing the mean of the simulations. It’s not showing anything about the range of plausible values. Let’s add that to the plot and see what it looks like.
p <- ggplot(result, aes(x = Expected, y = Observed)) +
geom_point() +
geom_errorbar(aes(xmin = Lower_2.5, xmax = Upper_97.5, width = 0.02)) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed",
color = "red") +
theme_bw()
p
So it appears for these data there’s a reasonably good fit between Observed and Expected. There certainly isn’t a consistent pattern of mismatch.
I’ve mentioned self-incompatibility alleles several times in lecture. In this week’s exercise we’re going to explore sequence variation in a sample of them. More precisely, I’m going to provide you with a tree describing relationships among self-incompatibility alleles in Solanum, the genus that includes potatoes, tomatoes, and eggplants. The data are derived from Richman and Kohn.4 The original data set included 67 nucleotide sequences. After aligning the sequences with Muscle and eliminating a few duplicates, 63 remain. I estimated a phylogenetic tree of relationships with Mr. Bayes assuming a strict molecular clock,5 and you’ll find the resulting consensus tree at http://darwin.eeb.uconn.edu/eeb348-resources/sRNase.clock.con.tre.6
Using this tree, explore the relationship between observed
coalescence times, i.e., the estimated branching times in the output
from Mr. Bayes, with those expected, i.e., the branching times simulated
by ms()
. The compare_coalescent()
function
above and the code following it that shows how to look at the numbers
and how to visualize the results should make that pretty easy. Once
you’ve done that, answer these two questions:
What systematic differences between observed and expected coalescence times do you see?
Given what I’ve said about self-incompatibility in lecture and what Richman and Kohn say about it in their paper, how would you explain the discrepancies that you see.
prset = brlenspr=clock:uniform
with a
GTR+I+\(\Gamma\) substitution model for
any of you who care.↩︎
If you call it without specifying the number of replications, it will default to 1000. One thousand replications should be plenty, and it runs pretty fast (4-5 seconds on my laptop), but feel free to play around with different choices if you’re so inclined.↩︎
The dashed red line is the 1:1 line on which all of the points would lie if Observed exactly matched Expected.↩︎
Richman, A.D., and J.R. Kohn. 2000. Evolutionary genetics of self-incompatibility in the Solanaceae. Plant Molecular Biology 42:169-179. doi: 10.1023/A:1006336206637↩︎
I used the same parameters as with the ADH example above.↩︎
Note: You can either download the file to your hard
drive, or you can read it directly with read.nexus()
,
putting in the full URL (in quotes) instead of the filename.↩︎