---
title: "Illustrating nucleotide sequence divergence under the Jukes-Cantor model"
output: html_notebook
---

We start by loading the necessary libraries and defining the function to calculate the probability that two nucleotides are identical after $t$ generations under the Jukes-Cantor model.

```{r}
options(tidyverse.quiet = TRUE)
library(tidyverse)
library(ggplot2)

jc <- function(lambda, t) {
 return(1 - (3/4)*(1 - exp(-8*lambda*t/3)))
}

jc_dist <- function(q) {
  return((-3/4)*log(1 - (4/3)*(1 - q)))
}
```

Now we'll look at how the probability of identity decays over 10,000 generations when $\lambda = 0.0001$, 0.001, and 0.01.

```{r}
t <- seq(0, 10000, by = 1)
dat <- tibble(lambda = NA, t = NA, q = NA)
for (lambda in c(0.0001, 0.001, 0.01)) {
  q <- jc(lambda, t)
  tmp <- tibble(lambda = as.character(lambda),
                t = t,
                q = q)
  dat <- rbind(dat, tmp)
}
dat <- dat %>% filter(!is.na(lambda))

p <- ggplot(dat, aes(x = t, y = q, 
                     fill = lambda, 
                     color = lambda)) +
  geom_line() +
  theme_bw() +
  ylim(0, 1) +
  xlab("Generation") +
  ylab("Probability nucleotides are identical") +
  scale_color_brewer(type = "seq",
                     palette = "YlGnBu")
p
```

Notice that the probability of identity decreases to 0.25, but no further. Let's look at the complementary plot of evolutionary distance as a function of percent sequence similarity.

```{r}
q <- seq(1.0, 0.251, by = -0.001)
d <- jc_dist(q)
dat <- tibble(q = q, d = d)
p <- ggplot(dat, aes(x = q, y = d)) +
  geom_line() +
  theme_bw() +
  xlim(0.25, 1) +
  xlab("Percent sequence similarity") +
  ylab("Jukes-Cantor distance")
p
```