Underlying everything else we're going to discuss in this last part of the course is the idea that we should be able to describe the degree of difference between nucleotide sequences, proteins, or anything else as a result of some underlying evolutionary processes. To illustrate the principle, let's start with nucleotide sequences and develop a fairly simple model that describes how they become different over time.
Let
be the probability that two homologous nucleotides are
identical after having been evolving for
generations independently
since the gene in which they were found was replicated in their common
ancestor. Let
be the probability of a substitution occuring
at this nucleotide position in either of the two genes during a small
time interval,
. Then

![\begin{eqnarray*}
q_t &=& 1 - {3 \over 4}\left(1 - e^{-4d/3}\right) \\
d &=& -{3 \over 4}\ln\left[1 - {4 \over 3}(1 - q_t)\right] \\
\end{eqnarray*}](img12.png)
Let's examine the second of those assumptions first. Observed
differences between nucleotide sequences shows that some types of
substitutions, i.e., transitions (
,
), occur much
more frequently than others, i.e., transversions (
,
,
,
). There are a variety of different
substitution models corresponding to different assumed patterns of
mutation: Kimura 2 parameter (K2P), Felsenstein 1984 (F84),
Hasegawa-Kishino-Yano 1985 (HKY85), Tamura and Nei (TrN), and
generalized time-reversible (GTR). The GTR is, as its name suggests,
the most general time-reversible model. It allows substitution
rates to differ between each pair of nucleotides. That's why it's
general. It requires, however, that the substitution rate be the same
in both directions. That's why it's time reversible. While it would be
possible to construct a model in which the substitution rate differs
depending on the direction of substitution, it leads to something of a
paradox: with non-reversible substitution models the distance between
two sequences
and
depends on whether we measure the distance
from
to
or from
to
.
There are two ways in which the rate of nucleotide substitution can be allowed to vary from position to position - the phenomenon of among-site rate variation. First, we expect the rate of substitution to depend on codon position in protein-coding genes. The sequence can be divided into first, second, and third codon positions and rates calculated separately for each of those positions. Second, we can assume a priori that there is a distribution of different rates possible and that this distribution is described by one of the standard distributions from probability theory. We then imagine that the substitution rate at any given site is determined by a random draw from the given probability distribution. The gamma distribution is widely to describe the pattern of among-site rate variation, because it can approximate a wide variety of different distributions (Figure 1).4
The mean substitution rate in each curve above is 0.1. The curves
differ only in the value of a parameter,
, called the ``shape
parameter.'' The shape parameter gives a nice numerical description of
how much rate variation there is, except that it's backwards. The
larger the parameter, the less among-site rate variation there is.