Introduction
Last week we explored the coalescence times we expect to see if sequences are neutral with those we estimated from nucleotide sequence data. As we discussed in class, under an infinite sites model of evolution we can derive estimates of θ=4Neμ from either the number of segregating sites, k, or from the nucleotide diversity, π.
ˆθπ=ˆπˆθk=k∑n−1i1i,
where n is the number of sequences in our sample. Since ˆθπ and ˆθk are both estimates of θ, they should be equal if sequences are evolving neutrally and the population is at a drift-mutation equilibrium. Tajima’s D is defined as
ˆD=ˆθπ−ˆθkVar(ˆθπ−ˆθk). If D≠0, then we know either that fitness differences among the sequences are not effectively neutral or that the population is not at a drift-mutation equilibrium.
Esitmating Tajima’s D
It is quite straightforward to estimate Tajima’s D and to assess whether it is different from 0 in R
. Here’s how we’d do it with an sRNase data set from Solanum peruvianum.
library(adegenet)
library(pegas)
rm(list = ls())
solanum <- read.dna("solanum-peruvianum.clw.clwstrict", format = "clustal")
solanum_taj <- tajima.test(solanum)
solanum_taj
The output is pretty easy to intepret. D
is our estimate of Tajima’s D, PVal.normal
is the probability of getting the observed value of D if the sequence variation were neutral and the population was at a drift-mutation equilibrium assuming that D follows a normal distribution with mean zero and variance one. Pval.beta
is the probability of getting the observed value of D if the sequence variation were neutral and the population was at a drift-mutation equilibrium assuming that D follows a beta distribution, which was Tajima’s original proposal.
Here we have a positive D, suggesting that there might be diversifying selection, but the support for that suggestion is very weak. There’s about a 62 percent chance we’d get a value of D as large as what we observe here even if there were no selection acting on these sequences and the population were at a drift-mutation equilibrium.
Lab 10
For this lab exercise you’ll be exploring genetic variation at four loci in loblolly pine. The data are derived from Gonzalez-Martinez and colleagues. There are two data sets for each of the loci included in the analysis:
- ccoaomt-1 : caffeoyl-CoA-O-methyltransferase 1:
- cpk3 : calcium-dependent protein kinase
- erd3 : early response to drought 3
- pp2c: protein phosphatase 2C-like protein
One data set for each locus, the Pinus-taeda-<gene name>-coding.fasta
data set, includes only the coding portion of the nucleotide sequence as downloaded from Genbank. The other data set, the Pinus-taeda-<gene name>.fasta
data set, includes the complete nucleotide sequence as downloaded from Genbank. Each data set contains 32 sequences that were aligned using Muscle. The following table shows the number of nucleotides included in each data set.
NOTE: You’ll need to specify format = "fasta"
when you call read.dna()
.
Using these data, answer the following questions:
Is there evidence for selection, a recent population expansion, or a recent population bottleneck at any locus when the complete sequence is considered?
Is there evidence for selection, a recent population expansion, or a recent population bottleneck at any locus when only the coding sequence is considered?
Assume that there has not been a recent population expansion, a recent bottleneck, or a recent change in range size. What kind of selection might account for the patterns revealed in your answers? Are the patterns of selection you detect consistent with these loci being adaptively important in drought responses?
