For Lab 3 we used data from Camak et al. (2021)1 to explore population structure using LEA. This week we’re going to run a couple of additional analyses to explore these data further, and we’re going to use the results from these analyses to re-examine some of the conclusions Camak et al. reached. I’ll be asking you not only to produce numerical/graphical results from these analyses, but also to interpret what they mean biologically.

But first I have to introduce the methods, and we’ll use the Protea repens data you’ve already seen to do that.

PCA

First, we’ll run a principal components analysis of the data. To do this we first have to get the data into R using read.structure() from adegenet.2

library(adegenet)
repens <- read.structure("http://darwin.eeb.uconn.edu/eeb348-resources/repens-outliers.stru")
summary(repens)

// Number of individuals: 662
// Group sizes: 12 52 8 36 46 22 41 72 24 39 47 22 12 28 36 42 58 30 35
// Number of alleles per locus: 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
// Number of alleles per group: 247 334 233 325 321 303 316 326 299 295 326 305 292 308 319 328 332 319 323
// Percentage of missing data: 48.99 %
// Observed heterozygosity: 0.13 0.38 0.16 0.28 0.07 0.06 0.15 0.25 0.3 0.02 0.15 0.12 0.13 0.46 0.12 0.23 0.22 0.14 0.34 0.18 0.24 0.16 0.15 0.21 0.25 0.32 0.09 0.18 0.22 0.24 0.18 0.06 0.03 0.08 0.3 0.08 0.42 0.24 0.21 0.14 0.11 0.18 0.42 0.36 0.17 0.44 0.33 0.31 0.16 0.2 0.17 0.34 0.18 0.33 0.17 0.33 0.12 0.14 0.15 0.06 0.24 0.1 0.19 0.2 0.27 0.2 0.12 0.07 0.16 0.23 0.09 0.37 0.16 0.14 0.25 0.15 0.2 0.34 0.34 0.28 0.16 0.26 0.27 0.28 0.23 0.19 0.07 0.09 0.1 0.07 0.13 0.2 0.15 0.36 0.25 0.2 0.1 0.14 0.1 0.37 0.17 0.09 0.49 0.42 0.09 0.45 0.19 0.13 0.15 0.17 0.06 0.31 0.19 0.13 0.1 0.13 0.21 0.34 0.07 0.26 0.16 0.06 0.16 0.06 0.07 0.07 0.23 0.39 0.18 0.15 0.34 0.15 0.11 0.21 0.04 0.38 0.18 0.12 0.27 0.47 0.1 0.13 0.29 0.23 0.24 0.07 0.36 0.1 0.08 0.05 0.2 0.45 0.21 0.25 0.26 0.17 0.22 0.15 0.27 0.12 0.17 0.26 0.22 0.14 0.19 0.2 0.29 0.28 0.18 0.15 0.32 0.04 0.09
// Expected heterozygosity: 0.5 0.46 0.32 0.36 0.2 0.27 0.37 0.34 0.43 0.29 0.3 0.19 0.24 0.48 0.42 0.32 0.33 0.44 0.41 0.47 0.33 0.24 0.47 0.3 0.5 0.32 0.16 0.3 0.38 0.48 0.45 0.22 0.19 0.43 0.35 0.31 0.49 0.46 0.45 0.25 0.16 0.29 0.5 0.49 0.22 0.5 0.5 0.42 0.27 0.49 0.26 0.5 0.35 0.49 0.32 0.46 0.15 0.35 0.24 0.19 0.4 0.5 0.24 0.36 0.5 0.34 0.23 0.45 0.38 0.3 0.16 0.49 0.42 0.18 0.32 0.27 0.48 0.49 0.49 0.47 0.33 0.48 0.35 0.47 0.31 0.44 0.49 0.32 0.28 0.3 0.16 0.28 0.19 0.41 0.49 0.26 0.14 0.32 0.37 0.49 0.48 0.16 0.45 0.45 0.34 0.49 0.24 0.47 0.48 0.3 0.17 0.46 0.36 0.18 0.18 0.18 0.36 0.5 0.37 0.32 0.41 0.15 0.23 0.11 0.19 0.18 0.36 0.42 0.31 0.19 0.41 0.47 0.37 0.25 0.3 0.45 0.49 0.41 0.45 0.5 0.28 0.38 0.46 0.32 0.5 0.49 0.49 0.13 0.19 0.42 0.25 0.48 0.49 0.28 0.41 0.26 0.36 0.19 0.34 0.22 0.39 0.47 0.28 0.18 0.43 0.3 0.3 0.49 0.41 0.22 0.43 0.12 0.34

If we summarize the data, we’ll see that nearly fifty percent of the data is missing, i.e., fifty percent of the individual/locus combinations are empty because we couldn’t score that individual at that locus.3 This amount of missing data isn’t uncommon with genotype-by-sequencing (GBS) data when you don’t have a reference genome, but it is a problem for analysis by principal components. We fill the empty cells by replacing them with the mean allele frequency at the locus in question. This gives us complete data, but it understates the degree to which groupings in the data are different from one another. We also need a data frame that contains the population labels. Here’s how we do those two things:4

repens_mod <- tab(repens, freq = TRUE, NA.method = "mean")
repens_df <- data.frame(Population = pop(repens))

Now we run the principal components analysis (PCA) and plot the results. IMPORTANT NOTE: If you leave out the “u” in “colour” you won’t get the results that you want.

library(ggfortify)

repens_pca <- prcomp(repens_mod)
## must be spelled "colour" with a "u"
##
autoplot(repens_pca, data = repens_df, colour = "Population",
         frame.type = "norm") + 
  theme_bw()