Population genomics

Introduction

In the past decade, the development of high-throughput methods for genomic sequencing (next-generation sequencing: NGS) have revolutionized how many geneticists collect data. It is now possible to produce so much data so rapidly that simply storing and processing the data poses great challenges . The Nekrutenko and Taylor review  doesn’t even discuss the new challenges that face population geneticists and evolutionary biologists as they start to take advantage of those tools, nor did it discuss the promise these data hold for providing new insight into long-standing questions, but the challenges and the promise are at least as great as those they do describe.

To some extent the most important opportunity provided by NGS sequencing is simply that we now have a lot more data to answer the same questions. For example, using a technique like RAD sequencing  or genotyping-by-sequencing (GBS: ), it is now possible to identify thousands of polymorphic SNP markers in non-model organisms, even if you don’t have a reference genome available. As we’ve seen several times this semester, the variance associated with drift is enormous. Many SNPs identified through RAD-Seq or GBS are likely to be independently inherited. Thus, the amount and pattern of variation at each locus will represent an independent sample from the underlying evolutionary process. As a result, we should be able to get much better estimates of fundamental parameters like \(\theta=4N_e\mu\), \(M=4N_em\), and \(R=4N_er\) and to have much greater power to discriminate among different evolutionary scenarios. Willing et al. , for example, present simulations suggesting that accurate estimates of \(F_{ST}\) are possible with sample sizes as small as 4–6 individuals per population, so long as the number of markers used for inference is greater than 1000.

A quick overview of NGS methods

I won’t review the chemistry used for next-generation sequencing. It changes very rapidly, and I can’t keep up with it. Suffice it to say that 454 Life Sciences, Illumina, PacBio, and probably other companies I don’t know about each have different approaches to very high throughput DNA sequencing. What they all have in common is that the whole genome is broken into small fragments and sequnced and that a single run through the machine produces an enormous amount of data, 134-6000 Gb and up to 20 billion readsfrom a NovaSeq 600 for example (https://www.illumina.com/systems/sequencing-platforms/comparison-tool.html; accessed 30 December 2018).1

RAD sequencing

Baird et al.  introduced RAD a little over a decade ago. One of its great attraction for evolutionary geneticists is that RAD-seq can be used in any organism from which you can extract DNA and the laboratory manipulations are relatively straightforward.

The resulting library consists of sequences within a relatively small distance from restriction sites.

Genotyping-by-sequencing

Genotyping-by-sequencing (GBS) is a similar approach.

Once an investigator has her sequenced fragments back, she can either map the fragments back to a reference genome or she can assemble the fragments into orthologous sequences de novo. I’m not going to discuss either of those processes, but you can imagine that there’s a lot of bioinformatic processing going on. What I want to focus on is what you do with the data and how you interpret it.

Next-generation phylogeography

The American pitcher plant mosquito Wyeomyia smithii has been extensively studied for many years. It’s a model organism for ecology, but its genome has not been sequenced. An analysis of COI from 20 populatins and two outgroups produced the set of relationships you see in Figure 1 . As you can see, this analysis allows us to distinguish a northern group of populations from a southern group of populations, but it doesn’t provide us any reliable insight into finer scale relationships.

Maximum-likelihood phylogenetic tree depicting relationshps among populations of W. smithii relative to the outgroups W. vanduzeei and W. mitchelli (from ).

Using the same set of samples, the authors used RAD sequencing to identify 3741 SNPs. That’s more than 20 times the number of variable sites found in COI, 167. Not surprisingly, the large number of additional sites allowed the authors to produce a much more highly resolved phylogeny (Figure 2). With this phylogeny it’s easy to see that southern populations are divided into two distinct groups, those from North Carolina and those from the Gulf Coast. Similarly, the northern group of populations is subdivided into those from the Appalachians in North Carolina, those from the mid-Atlantic coast, and those from further north. The glacial history of North America means that both the mid-Atlantic populations and the populations farther north must have been derived from one or more southern populations after the height of the last glaciation. Given the phylogenetic relationships recovered here, it seems clear that they are most closely related to populations in the Appalachians of North Carolina.

A. Geographical distribution of samples included in the analysis. B. Phylogenetic relationship of samples included in the analysis.

That’s the promise of NGS for population genetics. What are the challenges? Funny you should ask.

Estimates of nucleotide diversity2

Beyond the simple challenge of dealing with all of the short DNA fragments that emerge from high-throughput sequencing, there are at least two challenges that don’t arise with data obtained in more traditional ways.

  1. Most studies involve “shotgun” sequencing of entire genomes. In large diploid genomes, this leads to variable coverage. At sites where coverage is low, there’s a good chance that all of the reads will be derived from only one of the two chromosomes present, and a heterozygous individual will be scored as homozygous. “Well,” you might say, “let’s just throw away all of the sites that don’t have at least 8\(\times\) coverage.”3 That would work, but you would also be throwing out a lot of potentially valuable information.4 It seems better to develop an approach that lets us use all of the data we collect.

  2. Sequencing errors are more common with high-throughput methods than with traditional methods, and since so much data is produced, it’s not feasible to go back and resequence apparent polymorphisms to see if they reflect sequencing error rather than real differences. Quality scores can be used, but they only reflect the quality of the reads from the sequencing reaction, not errors that might be introduced during sample preparation. Again, we might focus on high-coverage sites and ignore “polymorphisms” associated with single reads, but we’d be throwing away a lot of information.

A better approach than setting arbitrary thresholds and throwing away data is to develop an explicit model of how errors can arise during sequencing and to use that model to interpret the data we’ve collected. That’s precisely the approach that Lynch  adopts. Here’s how it works assuming that we have a sample from a single, diploid individual:

Notice that this genome-wide estimate of nucleotide diversity is obtained from a sample derived from a single diploid individual. Lynch  develops similar methods for estimating gametic disequilibrium as a function of genetic distance for a sample from a single diploid individual. He also extends that method to samples from a pair of individuals, and he describes how to estimate mutation rates by comparing sequences derived from individuals in mutation accumulation lines with consensus sequences.8

Haubold et al.  describe a program implementing these methods. Recall that under the infinite sites model of mutation \(\pi = 4N_e\mu\). They analyzed data sets from the sea squirt Ciona intestinalis and the water flea Daphnia pulex (Table 1). Notice that the sequencing error rate in D. pulex is indistinguishable from the nucleotide diversity.

Estimates of nucleotide diversity and sequencing error rate in Cionia intestinalis and Daphnia pulex (results from ).
Taxon \(4N_e\mu\) \(4N_e\mu\) (low coverage) \(\epsilon\)
Cionia intestinalis 0.0111 0.012 0.00113
Daphnia pulex 0.0011 0.0012 0.00121

Next-generation AMOVA9

What we’ve discussed so far gets us estimates of some population parameters (\(4N_e\mu\), \(4N_er\)), but they’re derived from the sequences in a single diploid individual. That’s not much of a population sample, and it certainly doesn’t tell us anything about how different populations are from one another. Gompert and Buerkle  describe an approach to estimate statistics very similar to \(\Phi_{ST}\) from AMOVA. Since they take a Bayesian approach to developing their estimates, they refer to approach as BAMOVA, Bayesian models for analysis of molecular variance. They propose several related models.

Then, as we did way back when we used a Bayesian appraach to estimate \(F_{ST}\) , we put a prior on the \(p_i\) and the parameters of this prior are defined in terms of \(\Phi_{ST}\) (among other things).12 They also propose a method for detecting SNP loci13 that have unusually large or small values of \(\Phi_{ST}\).

BAMOVA example

Gompert and Buerkle  used data derived from two different human population data sets:

In analysis of the first data set, they estimated \(\Phi_{ST}=0.08\). Three loci were identified as having unusually high values of \(\Phi_{ST}\).

In analysis of the 33-population data set, they found similar values of \(\Phi_{ST}\) on each chromosome, ranging from 0.083 (0.075, 0.091) on chromosome 22 to 0.11 (0.10, 0.12) on chromosome 16. \(\Phi_{ST}\) for the X chromosome was marginally higher: 0.14 (0.13,0.15). They detected 569 outlier loci, 518 were high outliers and 51 were low outliers. Several of the loci they detected as outliers had been previously identified as targets of selection. The loci they identified as candidates for balancing selection have not been suggested before as targets of such selection.

Estimating population structure

In addition to \(F_{ST}\) we saw that a principal components analysis of genetic data might sometimes be useful. Fumagalli et al.  develop a method for PCA that, like Lynch’s  method for estimating nucleotide diversity, uses all of the information available in NGS data rather than imposing an artificial threshold for calling genotypes. They calculate the pairwise entries of the covariance matrix by integrating across the genotype probabiity at each site as part of the calculation and weighting the contribution of each site to the analysis by the probability that it is variable.14 As shown in Figure 3 this approach to PCA recovers the structure much better than approaches that simply call genotypes at each locus, whether or not outliers are excluded. The authors also describe approaches to estimating \(F_{ST}\) that take account of the characteristics of NGS data. Their software (ANGSD: http://popgen.dk/wiki/index.php/ANGSD) implements these and other useful statistical analysis tools for next-generation sequencing data, including Tajima’s D. They also provide NgsAdmix for Structure-like analyses of NGS data (http://www.popgen.dk/software/index.php/NgsAdmix).

The “true genotypes” PCA is based on the actual, simulated genotypes (20 individuals in each population, 10,000 sites in the sample with 10% variable; \(F_{ST}\) between the purple population and either the red or the green population was 0.4 and between the green and red populations was 0.15; and coverage was simulated at \(2\times\) (from ).

Genetic structure of human populations in Great Britain

As we’ve seen several times in this course, the amount of genetic data available on humans is vastly greater than what is available for any other organism. As a result, it’s possible to use these data to gain unusually deep insight into the recent history of many human populations. Today’s example comes from Great Britain, courtesy of a very large consortium 

Data

Results

Very little evidence of population structure within British sample

Individual assignment analysis of genotypes used fineSTRUCTURE, which uses the same principle as STRUCTURE but models the correlations among SNPs resulting from gametic disequilibrium, rather than treating each locus as being independently inherited. The analysis is on haplotypes rather than on alleles. In addition, it clusters populations hierarchically (Figure 4

fineSTRUCTURE analysis of genotypes from Great Britain (from ).

Analysis of the European data identifies 52 groups. The authors used Chromopainter to construct each of the haplotypes detected in their sample of 2039 individuals from the UK as a mosaic of haplotypes derived from those found in their sample of 6209 individuals from continental Europe. Since they know (a) the UK cluster to which each UK individual belongs and (b) the European group from which each individual contributing to the UK mosaic belongs they can estimate (c) the proportion of ancestry for each UK cluster derived from each European group. The results are shown in Figure 5.

European ancestry of the 17 clusters identified in the UK (from ).

Creative Commons License

These notes are licensed under the Creative Commons Attribution License. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.


  1. In NGS applications for phylogeny, a strategy of targeted enrichment is often used. In this approach, pre-identified parts of the genome are “baited” using primers and those parts of the genome are enriched through PCR before the sequencing library is constructed . By the way, when I taught this course two years ago the Illumina HiSeq X Ten produced the most data from a single run, up to 1800 Gb and 3-6 billion reads. That means the volume of sequence you can produce in a single run has tripled in two years.↩︎

  2. This section draws heavily on ↩︎

  3. If both chromosomes have an equal probability of being sequenced, the probability that one of them is missed with 8\(\times\) coverage is \((1/2)^8 = 1/256\).↩︎

  4. It’s valuable information, providing you know how to deal with in properly.↩︎

  5. It wouldn’t be hard, conceptually, to allow different nucleotides to have different error rates, e.g., \(\epsilon_A\), \(\epsilon_C\), \(\epsilon_G\), \(\epsilon_T\), but the notation would get really complicated, so we won’t bother trying to show how differential error rates can be accommodated.↩︎

  6. This expression looks a little different from the one in , but I’m pretty sure it’s equivalent.↩︎

  7. Ask me for details if you’re interested.↩︎

  8. Mutation accumulation lines are lines propagated through several (sometimes up to hundreds) of generations in which population sizes are repeatedly reduced to one or a few individuals, allowing drift to dominate the dynamics and alleles to “accumulate” with little regard to their fitness effects.↩︎

  9. This section depends heavily on ↩︎

  10. Or that they’ve already been corrected. We don’t care how they might have been corrected. We care only that we can assume that the reads we get from a sequencing run faithfully reflect the sequences present on each of the chromosomes.↩︎

  11. The actual model they use is a bit more complicated than this, but the principles are the same.↩︎

  12. Again, the actual model is a bit more complicated than what I’m describing here, but the principle is the same.↩︎

  13. Or sets of SNP loci that are parts of a single contig.↩︎

  14. See  for details.↩︎