--- title: "$F$-statistics with `hierfstat`" output: html_notebook --- What you see here is an `R` notebook. Unless you've been looking ahead, you will have seen this in class on Tuesday when I introduce the lab exercise. Nick will talk about it more in lab. Let's start with a simple example that illustrates how `R` notebooks work. As you'll see, they allow us to intersperse text (like this) with `R` code that we execute (as you'll see below). We'll use the data from *Isotoma petraea* that we discussed in class to illustrate `hierfstat`. ## Converting the *Isotoma* data to a different format `hierfstat` needs the data in a different format from what is available on the website. This piece of code reads in the data directly from the website and converts it to the format that `hierfstat` uses. Other than the cute trick of reading the CSV file directly from the server rather than downloading it first, you can ignore all of the other coding.[^If you're interested, you're welcome to read it and see if you can figure out what it does. If you try and get stuck, feel free to ask me. I'm not asking you to spend any time on understanding this, because you are very unlikely to encounter genetic data in a format like the *Isotoma petraea* data any where else.] ```{r} ## The options line turns off the messages that tidyverse usually prints when ## it loads. ## options(tidyverse.quiet = TRUE) library("tidyverse") ## The following line clears out any objects in memory. I make it a habit to do this so that I can make sure ## that things left over from the last time I ran R don't confuse things. ## rm(list = ls()) dat <- read_csv("http://darwin.eeb.uconn.edu/eeb348-resources/isotoma.csv", col_types = cols(pop = col_character(), .default = col_integer())) ## If you know R, you'll recognize that I could have expressed this much more ## succinctly, but for anyone who cares to follow the logic, this version should ## be easier to follow. ## for (i in 1:nrow(dat)) { for (j in 2:ncol(dat)) { if (!is.na(dat[i, j])) { if (dat[i, j] == 0) { dat[i, j] <- 11 } else if (dat[i, j] == 1) { dat[i, j] <- 12 } else if (dat[i, j] == 2) { dat[i, j] <- 22 } } } } ## This line converts the population abbreviations to numbers ## dat$Location <- as.numeric(as.factor(dat$pop)) ## The first line puts Location in the first column. The second drops the pop column. ## dat <- relocate(dat, Location) %>% select(-pop) dat <- as.data.frame(dat) ``` ## Using `hierfstat` We use `wc()` to estimate Weir and Cockerham's $F$-statistics.^[You can probably guess why the function is called `wc()`.] ```{r} library(hierfstat) dat_fst <- wc(dat, diploid = TRUE) dat_fst ``` That's all there is to it. You don't need to include the `diploid = TRUE`, but it's good form to do so. If you happened to try this with the original file you downloaded, you'd get an error because the data aren't in the right format. The numbers above match what I mentioned in lecture (with a small rounding error in the 4th decimal place for $F_{IS}$) You can get a sense of how reliable those point estimates are by bootstrapping the estimates using `boot.vc()`. The first argument to `boot.vc()` is the list of populations from which each individual was collected (as a number, not as characters). The second argument to `boot.vc()` is the matrix of genotypes at all of the loci. Our data frame `dat` contains the population numbers in the first column, and the genotype information in the remaining column. We can select the first column as `dat[, 1]`. We can select all of the columns *except* the first column as `dat[, -1]`. ```{r} boot.vc(dat[, 1], dat[, -1], diploid = TRUE)$ci ``` Bootstrapping didn't work in this case, because it requires a minimum of 5 loci.[^If you've never encountered bootstrapping before, and you want to know how it works, feel free to ask. If you already know about bootstrapping, it might help you to know that we are bootstrapping loci rather than individuals.] ## Lab exercise We are going to use a subset of the data that Rachel Prunier and collaborators (including me) used to analyze the genetic structure of *Protea repens* [https://doi.org/10.3732/ajb.1600232](https://doi.org/10.3732/ajb.1600232). You'll find the data on the course website at [http://darwin.eeb.uconn.edu/eeb348-resources/repens-outliers.csv](http://darwin.eeb.uconn.edu/eeb348-resources/repens-outliers.csv) and at [http://darwin.eeb.uconn.edu/eeb348-resources/repens-outliers.stru](http://darwin.eeb.uconn.edu/eeb348-resources/repens-outliers.stru).^[I'll explain what the `.stru` extension refers to next week.] 1. Download the data in the `.stru` file. Then work through the following steps to analyze the data with `wc()`.^[You can also use the trick of simply using the URL above to read the data directly from the course website if you'd prefer.] 2. Convert the data to a form that `hierfstat` can use by running `read.structure()` from `adegenet`.^[You'll need to use `install.packages()` to install `adegenet`]. Note: There are 662 genotypes (individuals), 173 markers (loci), column 1 contains the labels for genotypes (individuals), column 2 contains the population factor (the population/locality from which a particular sample was collected), there are no optional columns, row 1 contains the marker (locus) names, and genotypes are coded in two rows. If everything worked properly, you'll see the following message:
Converting data from a STRUCTURE .stru file to a genind object...3. Now use `wc()` to produce estimates of $F_{IS}$ and $F_{ST}$. 4. Now convert the `genind` object that you used with `wc()` to a `hierfstat` data frame (using `genind2hierfstat()`), and use that dataframe to construct 95 percent confidence intervals for $F_{IS}$ and $F_{ST}$
boot.vc(repens_df[, 1], repens_df[, -1], diploid = TRUE)$ci`repens_df` is the name of the data frame I used. The first argument selects only the `pop` column (the one with the populations), and the second excludes the `pop` column, leaving a data frame with only the genotype data. 5. Is there evidence of inbreeding within populations? 6. Is there evidence of genetic differentiation among populations? Feel free simply to embed the `R` code in a copy of this notebook if you're so inclined. ***but*** if you do, be sure to state explicitly how to interpret the output.[^Nick and I know how to do it, of course, but I want to make sure you know how to do it too.]