Population genomics

Introduction

In the past 15 years, the development of high-throughput methods for genomic sequencing have revolutionized how geneticists collect data. It is now possible to produce so much data so rapidly that simply storing and processing the data poses great challenges . Nekrutenko and Taylor  don’t even discuss the new challenges that face population geneticists and evolutionary biologists as they start to take advantage of those tools, nor did it discuss the promise these data hold for providing new insight into long-standing questions, but the challenges and the promise are at least as great as those they do describe.

To some extent the most important opportunity provided by high-throughput sequencing is simply that we now have a lot more data to answer the same questions. For example, using a technique like RAD sequencing  or genotyping-by-sequencing (GBS: ), it is now possible to identify thousands of polymorphic SNP markers in non-model organisms, even if you don’t have a reference genome available. And as the cost of sequencing continues to decline, low-coverage whole genome sequencing is becoming more widely used and providing even more detailed genomic data in those organisms with a reference genome available . As we’ve seen several times this semester, the variance associated with drift is enormous. Many SNPs identified through RAD-Seq or GBS are likely to be independently inherited. Thus, the amount and pattern of variation at each locus will represent an independent sample from the underlying evolutionary process. As a result, we should be able to get much better estimates of fundamental parameters like \(\theta=4N_e\mu\), \(M=4N_em\), and \(R=4N_er\) and to have much greater power to discriminate among different evolutionary scenarios. By averaging estimates across thosands of loci our estimates of \(\theta = 4N_e\mu\) and \(M = 4N_em\), for example, are likely to be much more precise, and because we have a (mostly) neutral background from which to make those estimates, we may be able to identify genetic markers with “unusual” patterns reflecting a unique history of selection. Willing et al. , for example, present simulations suggesting that accurate estimates of \(F_{ST}\) are possible with sample sizes as small as 4–6 individuals per population, so long as the number of markers used for inference is greater than 1000.

A quick overview of high-throughput sequencing methods

I won’t review the chemistry used for high-throughput sequencing. It changes very rapidly, and I can’t keep up with it. Suffice it to say that 454 Life Sciences, Illumina, PacBio, and other companies I don’t know about each have different approaches to very high throughput DNA sequencing. In particular there aare several reduced representation sequencing methods that are widely used in organisms without reference genomes. What they all have in common is that the whole genome is broken into small fragments, sequenced, and SNPs are called without aligning the reads to a reference. Whether using a reduced representation method or low-coverage whole genome sequencing, a tremendous amount of data is availalbe, up to 380Gb and up to 1.2 billion reads from a single run on an Illumina NextSeq 2000 for example (https://www.illumina.com/systems/sequencing-platforms.html ; accessedd 5 Novembber 2023).

RAD sequencing

Baird et al.  introduced RAD about 15 years ago. One of its great attractions for evolutionary geneticists is that RAD-seq can be used in any organism from which you can extract DNA and the laboratory manipulations are relatively straightforward.

The resulting library consists of sequences within a relatively small distance from restriction sites.

Genotyping-by-sequencing

Genotyping-by-sequencing (GBS) is a similar approach.

Once an investigator has her sequenced fragments back, she can either map the fragments back to a reference genome or she can assemble the fragments into orthologous sequences de novo. I’m not going to discuss either of those processes, but you can imagine that there’s a lot of bioinformatic processing going on. What I want to focus on is what you do with the data and how you interpret it.

High-resollutions phylogeography

The American pitcher plant mosquito Wyeomyia smithii has been extensively studied for many years. It’s a model organism for ecology, but its genome has not been sequenced. An analysis of COI from 20 populations and two outgroups produced the set of relationships you see in Figure 1 . As you can see, this analysis allows us to distinguish a northern group of populations from a southern group of populations, but it doesn’t provide us any reliable insight into finer scale relationships.

Maximum-likelihood phylogenetic tree depicting relationships among populations of W. smithii relative to the outgroups W. vanduzeei and W. mitchelli (from ).

Using the same set of samples, the authors used RAD sequencing to identify 3741 SNPs. That’s more than 20 times the number of variable sites found in COI.1 Not surprisingly, the large number of additional sites allowed the authors to produce a much more highly resolved phylogeny (Figure 2). With this phylogeny it’s easy to see that southern populations are divided into two distinct groups, those from North Carolina and those from the Gulf Coast. Similarly, the northern group of populations is subdivided into those from the Appalachians in North Carolina, those from the mid-Atlantic coast, and those from further north. The glacial history of North America means that both the mid-Atlantic populations and the populations farther north must have been derived from one or more southern populations after the height of the last glaciation. Given the phylogenetic relationships recovered here, it seems clear that they are most closely related to populations in the Appalachians of North Carolina.

A. Geographical distribution of samples included in the analysis. B. Phylogenetic relationship of samples included in the analysis.

That’s the promise of high-throughput sequencing for population genetics. What are the challenges? Funny you should ask.

Estimates of nucleotide diversity2

Beyond the simple challenge of dealing with all of the short DNA fragments that emerge from high-throughput sequencing, there are at least two challenges that don’t arise with data obtained in more traditional ways.

  1. Most studies involve “shotgun” sequencing of entire genomes. In large diploid genomes, this leads to variable coverage. At sites where coverage is low, there’s a good chance that all of the reads will be derived from only one of the two chromosomes present, and a heterozygous individual will be scored as homozygous. “Well,” you might say, “let’s just throw away all of the sites that don’t have at least 8\(\times\) coverage.”3 That would work, but you would also be throwing out a lot of potentially valuable information.4 It seems better to develop an approach that lets us use all of the data we collect.

  2. Sequencing errors are more common with high-throughput methods than with traditional methods, and since so much data is produced, it’s not feasible to go back and resequence apparent polymorphisms to see if they reflect sequencing error rather than real differences. Quality scores can be used, but they only reflect the quality of the reads from the sequencing reaction, not errors that might be introduced during sample preparation. Again, we might focus on high-coverage sites and ignore “polymorphisms” associated with single reads, but we’d be throwing away a lot of information.

A better approach than setting arbitrary thresholds and throwing away data is to develop an explicit model of how errors can arise during sequencing and to use that model to interpret the data we’ve collected. That’s precisely the approach that Lynch  adopts. Here’s how it works assuming that we have a sample from a single, diploid individual:

Notice that this genome-wide estimate of nucleotide diversity is obtained from a sample derived from a single diploid individual. Lynch  develops similar methods for estimating gametic disequilibrium as a function of genetic distance for a sample from a single diploid individual. He also extends that method to samples from a pair of individuals, and he describes how to estimate mutation rates by comparing sequences derived from individuals in mutation accumulation lines with consensus sequences.8

Haubold et al.  describe a program implementing these methods. Recall that under the infinite sites model of mutation \(\pi = 4N_e\mu\). They analyzed data sets from the sea squirt Ciona intestinalis and the water flea Daphnia pulex (Table 1). Notice that the sequencing error rate in D. pulex is indistinguishable from the nucleotide diversity.

Estimates of nucleotide diversity and sequencing error rate in Cionia intestinalis and Daphnia pulex (results from ).
Taxon \(4N_e\mu\) \(4N_e\mu\) (low coverage) \(\epsilon\)
Cionia intestinalis 0.0111 0.012 0.00113
Daphnia pulex 0.0011 0.0012 0.00121

AMOVA from high-throughput sequencing9

What we’ve discussed so far gets us estimates of some population parameters (\(4N_e\mu\), \(4N_er\)), but they’re derived from the sequences in a single diploid individual. That’s not much of a population sample, and it certainly doesn’t tell us anything about how different populations are from one another. Gompert and Buerkle  describe an approach to estimate statistics very similar to \(\Phi_{ST}\) from AMOVA. Since they take a Bayesian approach to developing their estimates, they refer to approach as BAMOVA, Bayesian models for analysis of molecular variance. They propose several related models.

Then, we put a prior on the \(p_i\) and the parameters of this prior are defined in terms of \(\Phi_{ST}\) (among other things).12 They also propose a method for detecting SNP loci13 that have unusually large or small values of \(\Phi_{ST}\).

BAMOVA example

Gompert and Buerkle  used data derived from two different human population data sets:

In analysis of the first data set, they estimated \(\Phi_{ST}=0.08\). Three loci were identified as having unusually high values of \(\Phi_{ST}\).

In analysis of the 33-population data set, they found similar values of \(\Phi_{ST}\) on each chromosome, ranging from 0.083 (0.075, 0.091) on chromosome 22 to 0.11 (0.10, 0.12) on chromosome 16. \(\Phi_{ST}\) for the X chromosome was marginally higher: 0.14 (0.13,0.15). They detected 569 outlier loci, 518 were high outliers and 51 were low outliers. Several of the loci they detected as outliers had been previously identified as targets of selection. The loci they identified as candidates for balancing selection have not been suggested before as targets of such selection.

Estimating population structure

In addition to \(F_{ST}\) we saw that a principal components analysis of genetic data might sometimes be useful. Fumagalli et al.  develop a method for PCA that, like Lynch’s  method for estimating nucleotide diversity, uses all of the information available in high-throughput sequencing data rather than imposing an artificial threshold for calling genotypes. They calculate the pairwise entries of the covariance matrix by integrating across the genotype probability at each site as part of the calculation and weighting the contribution of each site to the analysis by the probability that it is variable.14 As shown in Figure 3 this approach to PCA recovers the structure much better than approaches that simply call genotypes at each locus, whether or not outliers are excluded. The authors also describe approaches to estimating \(F_{ST}\) that take account of the characteristics of high-throughput sequencing data. Their software (ANGSD: http://www.popgen.dk/angsd/index.php/ANGSD) implements these and other useful statistical analysis tools for next-generation sequencing data, including Tajima’s D. They also provide NgsAdmix for Structure-like analyses of NGS data (http://www.popgen.dk/software/index.php/NgsAdmix).

The “true genotypes” PCA is based on the actual, simulated genotypes (20 individuals in each population, 10,000 sites in the sample with 10% variable; \(F_{ST}\) between the purple population and either the red or the green population was 0.4 and between the green and red populations was 0.15; and coverage was simulated at \(2\times\) (from ).

Genetic structure of human populations in Great Britain

As we’ve seen several times in this course, the amount of genetic data available on humans is vastly greater than what is available for any other organism. As a result, it’s possible to use these data to gain unusually deep insight into the recent history of many human populations. Today’s example comes from Great Britain, courtesy of a very large consortium 

Data

Results

Very little evidence of population structure within British sample

Individual assignment analysis of genotypes used fineSTRUCTURE, which uses the same principle as STRUCTURE but models the correlations among SNPs resulting from gametic disequilibrium, rather than treating each locus as being independently inherited. The analysis is on haplotypes rather than on alleles. In addition, it clusters populations hierarchically (Figure 4)

fineSTRUCTURE analysis of genotypes from Great Britain (from ).

Analysis of the European data identifies 52 groups. The authors used Chromopainter to construct each of the haplotypes detected in their sample of 2039 individuals from the UK as a mosaic of haplotypes derived from those found in their sample of 6209 individuals from continental Europe. Since they know (a) the UK cluster to which each UK individual belongs and (b) the European group from which each individual contributing to the UK mosaic belongs they can estimate (c) the proportion of ancestry for each UK cluster derived from each European group. The results are shown in Figure 5.

European ancestry of the 17 clusters identified in the UK (from ).

Creative Commons License

These notes are licensed under the Creative Commons Attribution License. To view a copy of this license, visit or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.