next up previous
Next: Bibliography Up: Patterns of selection on Previous: Patterns of amino acid

Tajima's $D$

So far we've been comparing rates of synonymous and non-synonymous substitution to detect the effects of natural selection on molecular polymorphisms. Tajima [4] proposed a method that builds on the foundation of the neutral theory of molecular evolution in a different way. I've already mentioned the infinite alleles model of mutation several times. When thinking about DNA sequences a closely related approximation is to imagine that every time a mutation occurs, it occurs at a different site.2 If we do that, we have an infinite sites model of mutation.

When dealing with nucleotide sequences in a population context there are two statistics of potential interest:

The quantity $4N_e\mu$ comes up a lot in mathematical analyses of molecular evolution. Population geneticists, being a lazy bunch, get tired of writing that down all the time, so they invented the parameter $\theta = 4N_e\mu$ to save themselves a little time.4 Under the infinite-sites model of DNA sequence evolution, it can be shown that

\begin{eqnarray*}
\mbox{E}(\pi) &=& \theta \\
\mbox{E}(k) &=& \theta\sum_i^{n-1} \frac{1}{i} \quad ,
\end{eqnarray*}

where $n$ is the number of haplotypes in your sample. This suggests that there are two ways to estimate $\theta$, namely

\begin{eqnarray*}
\hat \theta_\pi &=& \hat \pi \\
\hat \theta_k &=& \frac{k}{\sum_i^{n-1}\frac{1}{i}} \quad ,
\end{eqnarray*}

where $\hat\pi$ is the average heterozygosity at nucleotide sites in our sample and $k$ is the observed number of segregating sites in our sample. If the nucleotide sequence variation among our haplotypes is neutral and the population from which we sampled is in equilibrium with respect to drift and mutation, then $\hat\theta_\pi$ and $\hat\theta_k$ should be statistically indistinguishable from one another. Thus,

\begin{displaymath}
\hat D = \hat\theta_\pi - \hat\theta_k
\end{displaymath}

can be used as a test statistic to assess whether the data are consistent with the population being at a mutation-drift equilibrium. Consider the value of $D$ under following four scenarios:

Neutral variation
If the variation is neutral and the population is at a drift-mutation equilibrium, then $\hat D$ will be statistically indistinguishable from zero.

Population bottleneck
If the population has recently undergone a bottleneck, then $\pi$ will be little affected unless the bottleneck was prolonged and severe.5 $k$, however, may be substantially reduced. Thus, $\hat D$ is should be greater than zero.

Purifying selection
If there is purifying selection, mutations will occur and accumulate at silent sites, but they aren't likely ever to become very common. Thus, there are likely to be lots of segregating sites, but not much heterozygosity, meaning that $\hat\theta_k$ will be large, $\hat\theta_\pi$ will be small, and $\hat D$ will be negative.

Overdominant selection
Overdominance will allow alleles beloning to the different classes to become quite divergent from one another. $\delta_{ij}$ within each class will be small, but $\delta_{ij}$ between classes will be large and both classes will be in intermediate frequency, leading to large values of $\theta_\pi$. There won't be a similar tendency for the number of segregating sites to increase, so $\theta_k$ will be relatively unaffected. As a result, $\hat D$ will be positive.

Population expansion
Similarly, if the population has recently begun to expand, mutations that occur are unlikely to be lost, increasing $\hat\theta_k$, but it will take a long time before they contribute to heterozygosity, $\hat\theta_\pi$. Thus, $\hat D$ will be negative.

In short, $\hat D$ provides a different avenue for insight into the evolutionary history of a particular nucleotide sequence. But interpreting it can be a little tricky.

$\hat D = 0$:
We have no evidence for changes in population size or for any particular pattern of selection at the locus.

$\hat D < 0$:
The population size may be increasing or we may have evidence for purifying selection at this locus.

$\hat D > 0$:
The population may have suffered a recent bottleneck (or be decreaing) or we may have evidence for overdominant selection at this locus.

If we have data available for more than one locus, we may be able to distinguish changes in population size from selection at any particular locus. After all, all loci will experience the same demographic effects, but we might expect selection to act differently at different loci, especially if we choose to analyze loci with different physiological function.


next up previous
Next: Bibliography Up: Patterns of selection on Previous: Patterns of amino acid
Kent Holsinger 2006-11-15