At this point, we’ve refined the neutral theory quite a bit. Our understanding of how molecules evolve now recognizes that some substitutions are more likely than others, but we’re still proceeding under the assumption that most nucleotide substitutions are neutral or detrimental. So far we’ve argued that variation like what Hubby and Lewontin found is not likely to be maintained by natural selection. But we have strong evidence that heterozygotes for the sickle-cell allele are more fit than either homozygote in human populations where malaria is prevalent. That’s an example where selection is acting to maintain a polymorphism, not to eliminate it. Are there other examples? How could we detect them?
In the 1970s a variety of studies suggested that a polymorphism in the locus coding for alcohol dehydrogenase in Drosophila melanogaster might not only be subject to selection but that selection may be acting to maintain the polymorphism. As DNA sequencing became more practical at about the same time,1 population geneticists began to realize that comparative analyses of DNA sequences at protein-coding loci could provide a powerful tool for unraveling the action of natural selection. Synonymous sites within a protein-coding sequence provide a powerful standard of comparison. Regardless of
the demographic history of the population from which the sequences were collected,
the length of time that populations have been evolving under the sample conditions and whether it has been long enough for the population to have reached a drift-migration-mutation-selection equilibrium, or
the actual magnitude of the mutation rate, the migration rate, or the selection coefficients
the synonymous positions within the sequence provide an internal control on the amount and pattern of differentiation that should be expected when substitutions are neutral.2 Thus, if we see different patterns of nucleotide substitution at synonymous and non-synonymous sites, we can infer that selection is having an effect on amino acid substitutions.
Kreitman took advantage of these ideas to provide additional insight into whether natural selection was likely to be involved in maintaining the polymorphism at Adh in Drosophila melanogaster. He cloned and sequenced 11 alleles at this locus, each a little less than 2.4kb in length.3 If we restrict our attention to the coding region, a total of 765bp, there were 6 distinct sequences that differed from one another at between 1 and 13 sites. Given the observed level of polymorphism within the gene, there should be 9 or 10 amino acid differences observed as well, but only one of the nucleotide differences results in an amino acid difference, the amino acid difference associated with the already recognized electrophoretic polymorphism. Thus, there is significantly less amino acid diversity than expected if nucleotide substitutions were neutral, consistent with my assertion that most mutations are deleterious and that natural selection will tend to eliminate them. In other words, another example of the “sledgehammer principle.”
Does this settle the question? Is the Adh polymorphism another example of allelic variants being neutral or selected against? Would I be asking these questions if the answer were “Yes”?
A few years after Kreitman appeared, Kreitman and Aguadé published an analysis in which they looked at levels of nucleotide diversity in the Adh region, as revealed through analysis of RFLPs, in D. melanogaster and the closely related D. simulans. Why the comparative approach? Well, Kreitman and Aguadé remembered that the neutral theory of molecular evolution makes two predictions that are related to the underlying mutation rate:
If mutations are neutral, the substitution rate is equal to the mutation rate.
If mutations are neutral, the diversity within populations should be about \(4N_e\mu/(4N_e\mu + 1)\).
Thus, if variation at the Adh locus in D. melanogaster is selectively neutral, the amount of divergence between D. melanogaster and D. simulans should be related to the amount of diversity within each. What they found instead is summarized in Table 1.
The expected level of diversity in each part of the Adh locus is calculated assuming that the probability of polymorphism is independent of what position in the locus we are examining.4 Specifically, Kreitman and Aguadé calculated the expected polymorphism as follows:
They calculated the number of “site equivalents” in each region of the locus. A site equivalent is the actual length of the region (in number of nucleotides) times the fraction of changes within that sequence that would lead to gain or loss of a restriction site.5 There were 414 site equivalents in the 5’ flanking region, 411 site equivalents in the Adh locus, and 129 site equivalents in the 3’ flanking region.
They calculated the fraction of site equivalents that were polymorphic across the entire locus: \[\frac{25}{414+411+129} \approx 0.026 \quad .\]
They calculated the expected number of polymorphic sites within a region as the product of the number of site equivalents and the fraction of polymorphic site equivalents.
They used the same approach to calculate the expected divergence between D. melanogaster and D. simulans with one important exception. They directly compared the nucleotide sequence of one Adh allele from D. melanogaster with one Adh allele from D. simulans.6 As a result, they didn’t have to use the site equivalent correction. They could directly use the number of nucleotides in each region of the gene.
5’ flanking | Adh locus | 3’ flanking | |
---|---|---|---|
Diversity\(^1\) | |||
Observed | 9 | 14 | 2 |
Expected | 10.8 | 10.8 | 3.4 |
Divergence\(^2\) | |||
Observed | 86 | 48 | 31 |
Expected | 55 | 76.9 | 33.1 |
\(^1\)Number of polymorphic sites within D. melanogaster | |||
\(^2\)Number of nucleotide differences between D. melanogaster and D. simulans |
Notice that there is substantially less divergence between D. melanogaster and D. simulans at the Adh locus than would be expected, based on the average level of divergence across the entire region. That’s consistent with the earlier observation that most amino acid substitutions are selected against. On the other hand, there is more nucleotide diversity within D. melanogaster than would be expected based on the levels of diversity seen in across the entire region. What gives?
Time for a trip down memory lane. Remember something called “coalescent theory?” It told us that for a sample of neutral genes from a population, the expected time back to a common ancestor for all of them is about \(4N_e\) for a nuclear gene in a diploid population. That means there’s been about \(4N_e\) generations for mutations to occur. Suppose, however, that the electrophoretic polymorphism were being maintained by natural selection. Then we might well expect that it would be maintained for a lot longer than \(4N_e\) generations. If so, there would be a lot more time for diversity to accumulate. Thus, the excess diversity could be accounted for if there is balancing selection at ADH.
Kreitman and Hudson extended this approach by looking more carefully within the region to see where they could find differences between observed and expected levels of nucleotide sequence diversity. They used a “sliding window” of 100 silent base pairs in their calculations. By “sliding window” what they mean is that first they calculate statistics for bases 1-100, then for bases 2-101, then for bases 3-102, and so on until they hit the end of the sequence (Figure 1).
To me there are two particularly striking things about this figure. First, the position of the single nucleotide substitution responsible for the electrophoretic polymorphism is clearly evident. Second, the excess of polymorphism extends for only a 200-300 nucleotides in each direction. That means that the rate of recombination within the gene is high enough to randomize the nucleotide sequence variation farther away.7
I’ve already mentioned the HapMap project , a collection of genotype data at roughly 3.2M SNPs in the human genome. The data in phase II of the project were collected from four populations:
Yoruba (Ibadan, Nigeria)
Japanese (Tokyo, Japan)
Han Chinese (Beijing, China)
ancestry from northern and western Europe (Utah, USA)
We expect genetic drift to result in allele frequency differences among populations, and we can summarize the extent of that differentiation at each locus with \(F_{ST}\). If all HapMap SNPs are selectively neutral,8 then all loci should have the same \(F_{ST}\) within the bounds of statistical sampling error and the evolutionary sampling due to genetic drift. A scan of human chromosome 7 reveals both a lot of variation in individual-locus estimates of \(F_{ST}\) and a number of loci where there is substantially more differentiation among populations than is expected by chance (Figure 2). At very fine genomic scales we can detect even more outliers (Figure 3), suggesting that human populations have been subject to divergent selection pressures at many different loci .
So far we’ve been comparing rates of synonymous and non-synonymous substitution to detect the effects of natural selection on molecular polymorphisms. Tajima proposed a method that builds on the foundation of the neutral theory of molecular evolution in a different way. I’ve already mentioned the infinite alleles model of mutation several times. When thinking about DNA sequences a closely related approximation is to imagine that every time a mutation occurs, it occurs at a different site.9 If we do that, we have an infinite sites model of mutation.
When dealing with nucleotide sequences in a population context there are two statistics of potential interest:
The number of nucleotide positions at which a polymorphism is found or, equivalently, the number of segregating sites, \(k\).
The average number of nucleotide differences between two sequences, \(\pi\), where \(\pi\) is estimated as \[\pi = \sum x_ix_j\delta_{ij} \quad .\] In this expression, \(x_i\) is the frequency of the \(i\)th haplotype and \(\delta_{ij}\) is the number of nucleotide sequence differences between haplotypes \(i\) and \(j\).10
The quantity \(4N_e\mu\) comes up a lot in mathematical analyses of molecular evolution. Population geneticists, being a lazy bunch, get tired of writing that down all the time, so they invented the parameter \(\theta = 4N_e\mu\) to save themselves a little time.11 Under the infinite-sites model of DNA sequence evolution, it can be shown that \[\begin{aligned} \mbox{E}(\pi) &=& \theta \\ \mbox{E}(k) &=& \theta\sum_i^{n-1} \frac{1}{i} \quad , \end{aligned}\] where \(n\) is the number of haplotypes in your sample.12 This suggests that there are two ways to estimate \(\theta\), namely \[\begin{aligned} \hat \theta_\pi &=& \hat \pi \\ \hat \theta_k &=& \frac{k}{\sum_i^{n-1}\frac{1}{i}} \quad , \end{aligned}\] where \(\hat\pi\) is the average heterozygosity at nucleotide sites in our sample and \(k\) is the observed number of segregating sites in our sample.13 If the nucleotide sequence variation among our haplotypes is neutral and the population from which we sampled is in equilibrium with respect to drift and mutation, then \(\hat\theta_\pi\) and \(\hat\theta_k\) should be statistically indistinguishable from one another. In other words, \[\hat D = \frac{\hat\theta_\pi - \hat\theta_k}{\mbox{Var}(\hat\theta_\pi - \hat\theta_k)}\] should be indistinguishable from zero.14 If it is either negative or positive, we can infer that there’s some departure from the assumptions of neutrality and/or equilibrium. Thus, \(\hat D\) can be used as a test statistic to assess whether the data are consistent with the population being at a neutral mutation-drift equilibrium. Consider the value of \(D\) under following scenarios:
If the variation is neutral and the population is at a drift-mutation equilibrium, then \(\hat D\) will be statistically indistinguishable from zero.
Overdominance will allow alleles belonging to the different classes to become quite divergent from one another. \(\delta_{ij}\) within each class will be small, but \(\delta_{ij}\) between classes will be large and both classes will be in intermediate frequency, leading to large values of \(\theta_\pi\). There won’t be a similar tendency for the number of segregating sites to increase, so \(\theta_k\) will be relatively unaffected. As a result, \(\hat D\) will be positive.
If the population has recently undergone a bottleneck, then \(\pi\) will be little affected unless the bottleneck was prolonged and severe.15 \(k\), however, may be substantially reduced. Thus, \(\hat D\) should be positive.
If there is purifying selection, mutations will occur and accumulate at silent sites, but they aren’t likely ever to become very common. Thus, there are likely to be lots of segregating sites, but not much heterozygosity, meaning that \(\hat\theta_k\) will be large, \(\hat\theta_\pi\) will be small, and \(\hat D\) will be negative.
Similarly, if the population has recently begun to expand, mutations that occur are unlikely to be lost, increasing \(\hat\theta_k\), but it will take a long time before they contribute to heterozygosity, \(\hat\theta_\pi\). Thus, \(\hat D\) will be negative.
In short, \(\hat D\) provides a different avenue for insight into the evolutionary history of a particular nucleotide sequence. But interpreting it can be a little tricky.
We have no evidence for changes in population size or for any particular pattern of selection at the locus.16
The population size may be increasing or we may have evidence for purifying selection at this locus.
The population may have suffered a recent bottleneck (or be decreasing) or we may have evidence for overdominant selection at this locus.
If we have data available for more than one locus, we may be able to distinguish changes in population size from selection at any particular locus. After all, all loci will experience the same demographic effects, but we might expect selection to act differently at different loci, especially if we choose to analyze loci with different physiological function.
A quick search in Google Scholar reveals that the paper in which Tajima described this approach has been cited over 15,000 times.17 Clearly it has been widely used for interpreting patterns of nucleotide sequence variation. Although it is a very useful statistic, Zeng et al. point out that there are important aspects of the data that Tajima’s \(D\) does not consider. As a result, it may be less powerful, i.e., less able to detect departures from neutrality, than some alternatives.
We’ve seen that both drift and natural selection can lead to allele frequency changes in a population. Is there any way to tell how much of the allele frequency change in a population is a result of natural selection?18 Well, to do so we need a set of allele frequencies measured at many loci at several different time steps. With that we can define \[\Delta p_{t} = p_{t+1} - p_{t} \quad ,\] where \(p_{i,t}\) is the frequency of one allele at time \(t\)19 and \[\mbox{Var}(\Delta p_{t})\] is the variance of \(\Delta p_t\) across loci. Buffalo and Coop point out that \[\mbox{Var}(\Delta p_{t}) = \sum_{t=0}^{t-1}\mbox{Var}(p_i) + \sum_{i \ne j}\mbox{Cov}(p_i,p_j) \quad .\] If allele frequency changes are entirely neutral, then there won’t be any correlation between them at different time steps.20 If have a way to estimate \(\mbox{Cov}(p_i, p_j)\) and it turns out to be different from zero, then we have evidence of selection.21 In a later paper Buffalo and Coop estimate that between 17 and 37 percent of the allele frequency changes seen in a 10-generation, high-temperature selection experiment with ten replication populations is due to natural selection
These notes are licensed under the Creative Commons Attribution License. To view a copy of this license, visit or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.