We use ms() from phyclust to simulate samples from the coalescent and various functions from ape to display the coalescent tree and calculate statistics from it. Let’s start with a really simple example of constructing a coalescent sample of size 10. The nsam argument is the number of samples, and the opts = "-T" argument tell ms() to return a coalescent tree. We then convert it, using read.tree() to a format that we can plot()

The coalescent in R


rm(list = ls())

ms_out <- ms(nsam = 10, opts = "-T")
ms_tree_1 <- read.tree(text = ms_out)

You’ll notice that the tip labels run from s1 through s25. Each corresponds to a different allele. That was pretty easy, but we can do a bit more. We can ask how “deep” the tree is, i.e., how far back in time we have to go to find the most recent common ancestor of all the alleles in our sample.

Branching times


You’re probably wondering about the [1] after branching.times(). Well, branching.times() returns a vector of all of the branching times in the tree.

        11         12         13         14         15         16 
0.76309287 0.08241319 0.03451869 0.31600639 0.12675099 0.12256396 
        17         18         19 
0.10605252 0.02017499 0.02702544 

The numbers 11-19 are the node numbers in the tree above.


As you can see, node 11 (which is the first one in the list), is the deepest node, i.e., the most recent common ancestor of all of the alleles in our sample.

Scaling of branching times

The other thing you may be wondering is how to interpret the number 0.7630929 that corresponds to the time of the most recent common ancestor.1 To explain that I have to point out something that you may have missed when we called ms(): We didn’t specify a population size. How can we get away with that, since I told you that drift always depends on the population size?

Well, remember what you know about the coalesenct process. The expected time to coalescence of all alleles is \(4N_e\). ms() reports branching time in “coalescent units” where 1 unit equals \(4N_e\). So if branching.times(ms_tree_1)[1] is 0.7630929 and the effective size of our population is 25, then the time (number of generations before the present) to the most recent common ancestor is 76.3092874. Notice that the time isn’t an integer the way you’d expect it to be. That’s because of an approximation that is generally used in implementing the coalescent in software. If you want to turn it into an integer, you can do this:


Whether you round it or not, you’ll see that the most recent common ancestor of the alleles in our sample didn’t occur exactly 100 generations ago as we’d expect. If, however, we repeat that simulation a bunch of times, you can see that we come pretty close - on average.


N_e = 25

c_time <- numeric(0)
for (i in 1:1000) {
  ms_out <- ms(nsam = N_e, opts = "-T")
  ms_tree <- read.tree(text = ms_out)
  c_time[i] <- 4*N_e*branching.times(ms_tree)[1]
p <- ggplot(data.frame(t = c_time), aes(x = t)) +
  geom_histogram(bins = 50, alpha = 0.6) +
  geom_vline(xintercept = 4*N_e, linetype = "dashed", color = "red") +