Detecting selection on nucleotide polymorphisms

Introduction

At this point, we’ve refined the neutral theory quite a bit. Our understanding of how molecules evolve now recognizes that some substitutions are more likely than others, but we’re still proceeding under the assumption that most nucleotide substitutions are neutral or detrimental. So far we’ve argued that variation like what Hubby and Lewontin  found is not likely to be maintained by natural selection. But we have strong evidence that heterozygotes for the sickle-cell allele are more fit than either homozygote in human populations where malaria is prevalent. That’s an example where selection is acting to maintain a polymorphism, not to eliminate it. Are there other examples? How could we detect them?

In the 1970s a variety of studies suggested that a polymorphism in the locus coding for alcohol dehydrogenase in Drosophila melanogaster might not only be subject to selection but that selection may be acting to maintain the polymorphism. As DNA sequencing became more practical at about the same time,1 population geneticists began to realize that comparative analyses of DNA sequences at protein-coding loci could provide a powerful tool for unraveling the action of natural selection. Synonymous sites within a protein-coding sequence provide a powerful standard of comparison. Regardless of

the synonymous positions within the sequence provide an internal control on the amount and pattern of differentiation that should be expected when substitutions are neutral.2 Thus, if we see different patterns of nucleotide substitution at synonymous and non-synonymous sites, we can infer that selection is having an effect on amino acid substitutions.

Nucleotide sequence variation at the Adh locus in Drosophila melanogaster

Kreitman  took advantage of these ideas to provide additional insight into whether natural selection was likely to be involved in maintaining the polymorphism at Adh in Drosophila melanogaster. He cloned and sequenced 11 alleles at this locus, each a little less than 2.4kb in length.3 If we restrict our attention to the coding region, a total of 765bp, there were 6 distinct sequences that differed from one another at between 1 and 13 sites. Given the observed level of polymorphism within the gene, there should be 9 or 10 amino acid differences observed as well, but only one of the nucleotide differences results in an amino acid difference, the amino acid difference associated with the already recognized electrophoretic polymorphism. Thus, there is significantly less amino acid diversity than expected if nucleotide substitutions were neutral, consistent with my assertion that most mutations are deleterious and that natural selection will tend to eliminate them. In other words, another example of the “sledgehammer principle.”

Does this settle the question? Is the Adh polymorphism another example of allelic variants being neutral or selected against? Would I be asking these questions if the answer were “Yes”?

Kreitman and Aguadé

A few years after Kreitman  appeared, Kreitman and Aguadé  published an analysis in which they looked at levels of nucleotide diversity in the Adh region, as revealed through analysis of RFLPs, in D. melanogaster and the closely related D. simulans. Why the comparative approach? Well, Kreitman and Aguadé remembered that the neutral theory of molecular evolution makes two predictions that are related to the underlying mutation rate:

Thus, if variation at the Adh locus in D. melanogaster is selectively neutral, the amount of divergence between D. melanogaster and D. simulans should be related to the amount of diversity within each. What they found instead is summarized in Table 1.

The expected level of diversity in each part of the Adh locus is calculated assuming that the probability of polymorphism is independent of what position in the locus we are examining.4 Specifically, Kreitman and Aguadé calculated the expected polymorphism as follows:

They used the same approach to calculate the expected divergence between D. melanogaster and D. simulans with one important exception. They directly compared the nucleotide sequence of one Adh allele from D. melanogaster with one Adh allele from D. simulans.6 As a result, they didn’t have to use the site equivalent correction. They could directly use the number of nucleotides in each region of the gene.

Diversity and divergence in the Adh region of Drosophila (from ).
5’ flanking Adh locus 3’ flanking
Diversity\(^1\)
Observed 9 14 2
Expected 10.8 10.8 3.4
Divergence\(^2\)
Observed 86 48 31
Expected 55 76.9 33.1
\(^1\)Number of polymorphic sites within D. melanogaster
\(^2\)Number of nucleotide differences between D. melanogaster and D. simulans

Notice that there is substantially less divergence between D. melanogaster and D. simulans at the Adh locus than would be expected, based on the average level of divergence across the entire region. That’s consistent with the earlier observation that most amino acid substitutions are selected against. On the other hand, there is more nucleotide diversity within D. melanogaster than would be expected based on the levels of diversity seen in across the entire region. What gives?

Time for a trip down memory lane. Remember something called “coalescent theory?” It told us that for a sample of neutral genes from a population, the expected time back to a common ancestor for all of them is about \(4N_e\) for a nuclear gene in a diploid population. That means there’s been about \(4N_e\) generations for mutations to occur. Suppose, however, that the electrophoretic polymorphism were being maintained by natural selection. Then we might well expect that it would be maintained for a lot longer than \(4N_e\) generations. If so, there would be a lot more time for diversity to accumulate. Thus, the excess diversity could be accounted for if there is balancing selection at ADH.

Kreitman and Hudson

Kreitman and Hudson  extended this approach by looking more carefully within the region to see where they could find differences between observed and expected levels of nucleotide sequence diversity. They used a “sliding window” of 100 silent base pairs in their calculations. By “sliding window” what they mean is that first they calculate statistics for bases 1-100, then for bases 2-101, then for bases 3-102, and so on until they hit the end of the sequence (Figure 1).

Sliding window analysis of nucleotide diversity in the Adh-Adh-dup region of Drosophila melanogaster. The arrow marks the position of the single nucleotide substitution that distinguishes Adh-F from Adh-S (from )

To me there are two particularly striking things about this figure. First, the position of the single nucleotide substitution responsible for the electrophoretic polymorphism is clearly evident. Second, the excess of polymorphism extends for only a 200-300 nucleotides in each direction. That means that the rate of recombination within the gene is high enough to randomize the nucleotide sequence variation farther away.7

Detecting selection in the human genome

I’ve already mentioned the HapMap project , a collection of genotype data at roughly 3.2M SNPs in the human genome. The data in phase II of the project were collected from four populations:

We expect genetic drift to result in allele frequency differences among populations, and we can summarize the extent of that differentiation at each locus with \(F_{ST}\). If all HapMap SNPs are selectively neutral,8 then all loci should have the same \(F_{ST}\) within the bounds of statistical sampling error and the evolutionary sampling due to genetic drift. A scan of human chromosome 7 reveals both a lot of variation in individual-locus estimates of \(F_{ST}\) and a number of loci where there is substantially more differentiation among populations than is expected by chance (Figure 2). At very fine genomic scales we can detect even more outliers (Figure 3), suggesting that human populations have been subject to divergent selection pressures at many different loci .

Single-locus estimates of \(F_{ST}\) along chromosome 7 in the HapMap data set. Blue dots denote outliers. Adjacent SNPs in this sample are separated, on average, by about 52kb. (from )
Single-locus estimates of \(F_{ST}\) along a portion of chromosome 7 in the HapMap data set. Black dots denote outliers. Solid bars refer to previously identified genes. Adjacent SNPs in this sample are separated, on average, by about 1kb. (from )

Tajima’s \(D\)

So far we’ve been comparing rates of synonymous and non-synonymous substitution to detect the effects of natural selection on molecular polymorphisms. Tajima  proposed a method that builds on the foundation of the neutral theory of molecular evolution in a different way. I’ve already mentioned the infinite alleles model of mutation several times. When thinking about DNA sequences a closely related approximation is to imagine that every time a mutation occurs, it occurs at a different site.9 If we do that, we have an infinite sites model of mutation.

When dealing with nucleotide sequences in a population context there are two statistics of potential interest:

The quantity \(4N_e\mu\) comes up a lot in mathematical analyses of molecular evolution. Population geneticists, being a lazy bunch, get tired of writing that down all the time, so they invented the parameter \(\theta = 4N_e\mu\) to save themselves a little time.11 Under the infinite-sites model of DNA sequence evolution, it can be shown that \[\begin{aligned} \mbox{E}(\pi) &=& \theta \\ \mbox{E}(k) &=& \theta\sum_i^{n-1} \frac{1}{i} \quad , \end{aligned}\] where \(n\) is the number of haplotypes in your sample.12 This suggests that there are two ways to estimate \(\theta\), namely \[\begin{aligned} \hat \theta_\pi &=& \hat \pi \\ \hat \theta_k &=& \frac{k}{\sum_i^{n-1}\frac{1}{i}} \quad , \end{aligned}\] where \(\hat\pi\) is the average heterozygosity at nucleotide sites in our sample and \(k\) is the observed number of segregating sites in our sample.13 If the nucleotide sequence variation among our haplotypes is neutral and the population from which we sampled is in equilibrium with respect to drift and mutation, then \(\hat\theta_\pi\) and \(\hat\theta_k\) should be statistically indistinguishable from one another. In other words, \[\hat D = \frac{\hat\theta_\pi - \hat\theta_k}{\mbox{Var}(\hat\theta_\pi - \hat\theta_k)}\] should be indistinguishable from zero.14 If it is either negative or positive, we can infer that there’s some departure from the assumptions of neutrality and/or equilibrium. Thus, \(\hat D\) can be used as a test statistic to assess whether the data are consistent with the population being at a neutral mutation-drift equilibrium. Consider the value of \(D\) under following scenarios:

Neutral variation

If the variation is neutral and the population is at a drift-mutation equilibrium, then \(\hat D\) will be statistically indistinguishable from zero.

Overdominant selection

Overdominance will allow alleles belonging to the different classes to become quite divergent from one another. \(\delta_{ij}\) within each class will be small, but \(\delta_{ij}\) between classes will be large and both classes will be in intermediate frequency, leading to large values of \(\theta_\pi\). There won’t be a similar tendency for the number of segregating sites to increase, so \(\theta_k\) will be relatively unaffected. As a result, \(\hat D\) will be positive.

Population bottleneck

If the population has recently undergone a bottleneck, then \(\pi\) will be little affected unless the bottleneck was prolonged and severe.15 \(k\), however, may be substantially reduced. Thus, \(\hat D\) should be positive.

Purifying selection

If there is purifying selection, mutations will occur and accumulate at silent sites, but they aren’t likely ever to become very common. Thus, there are likely to be lots of segregating sites, but not much heterozygosity, meaning that \(\hat\theta_k\) will be large, \(\hat\theta_\pi\) will be small, and \(\hat D\) will be negative.

Population expansion

Similarly, if the population has recently begun to expand, mutations that occur are unlikely to be lost, increasing \(\hat\theta_k\), but it will take a long time before they contribute to heterozygosity, \(\hat\theta_\pi\). Thus, \(\hat D\) will be negative.

In short, \(\hat D\) provides a different avenue for insight into the evolutionary history of a particular nucleotide sequence. But interpreting it can be a little tricky.

\(\hat D = 0\):

We have no evidence for changes in population size or for any particular pattern of selection at the locus.16

\(\hat D < 0\):

The population size may be increasing or we may have evidence for purifying selection at this locus.

\(\hat D > 0\):

The population may have suffered a recent bottleneck (or be decreasing) or we may have evidence for overdominant selection at this locus.

If we have data available for more than one locus, we may be able to distinguish changes in population size from selection at any particular locus. After all, all loci will experience the same demographic effects, but we might expect selection to act differently at different loci, especially if we choose to analyze loci with different physiological function.

A quick search in Google Scholar reveals that the paper in which Tajima described this approach  has been cited over 15,000 times.17 Clearly it has been widely used for interpreting patterns of nucleotide sequence variation. Although it is a very useful statistic, Zeng et al.  point out that there are important aspects of the data that Tajima’s \(D\) does not consider. As a result, it may be less powerful, i.e., less able to detect departures from neutrality, than some alternatives.

How much genetic change is due to selection?

We’ve seen that both drift and natural selection can lead to allele frequency changes in a population. Is there any way to tell how much of the allele frequency change in a population is a result of natural selection?18 Well, to do so we need a set of allele frequencies measured at many loci at several different time steps. With that we can define \[\Delta p_{t} = p_{t+1} - p_{t} \quad ,\] where \(p_{i,t}\) is the frequency of one allele at time \(t\)19 and \[\mbox{Var}(\Delta p_{t})\] is the variance of \(\Delta p_t\) across loci. Buffalo and Coop  point out that \[\mbox{Var}(\Delta p_{t}) = \sum_{t=0}^{t-1}\mbox{Var}(p_i) + \sum_{i \ne j}\mbox{Cov}(p_i,p_j) \quad .\] If allele frequency changes are entirely neutral, then there won’t be any correlation between them at different time steps.20 If have a way to estimate \(\mbox{Cov}(p_i, p_j)\) and it turns out to be different from zero, then we have evidence of selection.21 In a later paper Buffalo and Coop estimate that between 17 and 37 percent of the allele frequency changes seen in a 10-generation, high-temperature selection experiment with ten replication populations is due to natural selection 

Creative Commons License

These notes are licensed under the Creative Commons Attribution License. To view a copy of this license, visit or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.