Uncommon Ground

Population genetics

Lecture notes in population genetics – final version from Spring 2017

I’ve finally had time to clean and post the final version of lecture notes from my graduate course in population genetics last spring. The individual lectures have been since I revised them for class, meaning that the last set of them was available in late April. You will find links to the individual lecture notes at http://darwin.eeb.uconn.edu/uncommon-ground/eeb348/notes/. If you’re interested in a particular topic in population genetics and I have a lecture that covers the topic, that’s probably where you’ll want to go.

If you want a single-volume reference to population genetics (including some old notes that I no longer maintain), you’ll find a PDF (5.89MB, 322 pages) at Figshare (doi: 10.6084/m9.figshare.100687.v2). If you want to print the PDF, I recommend that you print it on a double-sided printer. You can then put the pages in a binder and flip through them as if it were a bound book.

If you use LaTeX (and you’re a glutton for punishment), the LaTeX source and EPS files (for figures) is available in a Github repository (https://kholsinger.github.io/Lecture-Notes-in-Population-Genetics/).

These notes are released under a Creative Commons Attribution-ShareAlike license (http://creativecommons.org/licenses/by-sa/4.0/). I hope you find them useful. If you find errors in them, please let me know.

Causes of genetic differentiation in Protea repens

American Journal of Botany Volume 104, Number 5. May 2017.

Protea repens is the most widespread member of the genus. It was one of the focal species in our recently completed Dimensions of Biodiversity project. Part of the project involved genotyping-by-sequencing analyses of 663 individuals from 19 populations spanning most of the geographical range of the species. We summarize results of those analyses in a paper that just appeared in advance of the May issue (cover photo featured above) of the American Journal of Botany. Here’s the abstract. You’ll find the citation and a link at the bottom.

PREMISE OF THE STUDY: The Cape Floristic Region (CFR) of South Africa is renowned for its botanical diversity, but the evolutionary origins of this diversity remain controversial. Both neutral and adaptive processes have been implicated in driving diversification, but population-level studies of plants in the CFR are rare. Here, we investigate the limits to gene flow and potential environmental drivers of selection in Protea repens L. (Proteaceae L.), a widespread CFR species.
METHODS: We sampled 19 populations across the range of P. repens and used genotyping by sequencing to identify 2066 polymorphic loci in 663 individuals. We used a Bayesian FST outlier analysis to identify single-nucleotide polymorphisms (SNPs) marking genomic regions that may be under selection; we used those SNPs to identify potential drivers of selection and excluded them from analyses of gene flow and genetic structure.
RESULTS: A pattern of isolation by distance suggested limited gene flow between nearby populations. The populations of P. repens fell naturally into two or three groupings, which corresponded to an east-west split. Differences in rainfall seasonality contributed to diversification in highly divergent loci, as do barriers to gene flow that have been identified in other species.
CONCLUSIONS: The strong pattern of isolation by distance is in contrast to the findings in the only other widespread species in the CFR that has been similarly studied, while the effects of rainfall seasonality are consistent with well-known patterns. Assessing the generality of these results will require investigations of other CFR species.

Prunier, R., M. Akman, C.T. Kremer, N. Aitken, A. Chuah, J. Borevitz, and K. E. Holsinger. Isolation by distance and isolation by environment contribute to population differentiation in Protea repens (Proteaceae L.), a widespread South African species. American Journal of Botany doi: 10.3732/ajb.1600232 

Don’t overinterpret STRUCTURE plots

Screen Shot 2016-08-21 at 4.11.10 PM
Several weeks ago1 Daniel Falush (@DanielFalush) posted a preprint on bioRxiv, “A tutorial on how (not) to over-interpret STRUCTURE/ADMIXTURE bar plots”. I finally had a chance to read it this weekend. Here’s the abstract:

Genetic clustering algorithms, implemented in popular programs such as STRUCTURE and ADMIXTURE, have been used extensively in the characterisation of individuals and populations based on genetic data. A successful example is reconstruction of the genetic history of African Americans who are a product of recent admixture between highly differentiated populations. Histories can also be reconstructed using the same procedure for groups which do not have admixture in their recent history, where recent genetic drift is strong or that deviate in other ways from the underlying inference model. Unfortunately, such histories can be misleading. We have implemented an approach (available at www.paintmychromsomes.com) to assessing the goodness of fit of the model using the ancestry ‘palettes’ estimated by CHROMOPAINTER and apply it to both simulated and real examples. Combining these complementary analyses with additional methods that are designed to test specific hypothesis allows a richer and more robust analysis of recent demographic history based on genetic data.

A key observation Falush and his co-authors make is that different demographic scenarios can lead to the same STRUCTURE diagram. They illustrate three different scenarios. In all of them, they simulate data from 12 populations but sample from only four of them. In all of the scenarios, population P4 has been isolated from the other three populations in the sample for a long time. It’s the relationship between P1, P2, and P3 that differs among the scenarios.

  • Recent admixture: P1 and P3 have also been distinct for some time, and P2 is a recent admixture of P1, P3, and P4.
  • Ghost admixture: P1 and P3 diverged some time ago, and P2 is a recent admixture of P1 and a “ghost” population more closely related to P3 than to P1.
  • Recent bottleneck: P1 is sister to P2 but underwent a strong recent bottleneck.

Screen Shot 2016-08-21 at 4.19.59 PM

As you can see, the STRUCTURE diagrams estimated from data simulated in each scenario are indistinguishable. They also show that if you have additional data available, specifically if you are lucky enough to be working in an organism with a lot of SNPs that are mapped, then you can combine estimates from CHROMOPAINTER with those from STRUCTURE to distinguish the recent admixture scenario from the other two – assuming that you’ve picked a reasonable number for K, the number of subpopulations.2

The authors also refer to Puechmaille’s recent work demonstrating that estimates of genetic structure are greatly affected by sample size. Bottom line: Read both this paper and Puechmaille’s if you use STRUCTURE, tread cautiously when interpreting results, and don’t expend too much effort trying to estimate the “right” K.

1OK, as you can see from the tweet, it was almost a month ago.

2The paper contains a brief remark about how hard it is to estimate K: “Unless the demographic history of the sample is particularly simple, the value of K inferred according to any statistically sensible criterion is likely to be smaller than the number of distinct drift events that have significantly impacted the sample. What the algorithm often does is in practice use variation in admixture proportions between individuals to approximately mimic the effect of more than K distinct drift events without estimating ancestral populations corresponding to each one.”

Falush, D., L. van Dorp, D. Lawson. 2016. A tutorial on how (not) to over-interpret STRUCTURE/ADMIXTURE bar plots. bioRxiv doi: 10.1101/066431
Lawson, D.J., G. Hellenthal, S. Myers, and D. Falush. 2012. Inference of population structure using dense haplotype data. PLoS Genetics 8:e1002453. doi: 10.1371/journal.pgen.1002453
Puechmaille, S.J. 2016. The program structure does not reliably recover the correct population structure when sampling is uneven: subsampling and new estimators alleviate the problem. Molecular Ecology Resources 16:608-627. doi: 10.1111/1755-0998.12512