Analyzing the genetic structure of populations

Introduction

So far we’ve focused on inbreeding as one important way that populations may fail to mate at random, but there’s another way in which virtually all populations and species fail to mate at random. Individuals tend to mate with those that are nearby. Even within a fairly small area, phenomena like nearest neighbor pollination in flowering plants or home-site fidelity in animals can cause mates to be selected in a geographically non-random way. What are the population genetic consequences of this form of non-random mating?

Well, if you think about it a little, you can probably figure it out. Since individuals that occur close to one another tend to be more genetically similar than those that occur far apart, the impacts of local mating will mimic those of inbreeding within a single, well-mixed population.

A numerical example

For example, suppose we have two subpopulations of green lacewings, one of which occurs in forests the other of which occurs in adjacent meadows.1 Suppose further that within each subpopulation mating occurs completely at random, but that there is no mating between forest and meadow individuals. Suppose we’ve determined allele frequencies in each population at a locus coding for phosphoglucoisomerase (\(PGI\)), which conveniently has only two alleles. The frequency of \(A_1\) in the forest is 0.4 and in the meadow in 0.7. We can easily calculate the expected genotype frequencies within each population, namely

\(A_1A_1\) \(A_1A_2\) \(A_2A_2\)
Forest 0.16 0.48 0.36
Meadow 0.49 0.42 0.09

Suppose, however, we were to consider a combined population consisting of 100 individuals from the forest subpopulation and 100 individuals from the meadow subpopulation. Then we’d get the following:2

\(A_1A_1\) \(A_1A_2\) \(A_2A_2\)
From forest 16 48 36
From meadow 49 42 9
Total 65 90 45

So the frequency of \(A_1\) is \((2(65) + 90)/(2(65 + 90 + 45)) = 0.55\). Notice that this is just the average allele frequency in the two subpopulations, i.e., \((0.4 + 0.7)/2\). Since each subpopulation has genotypes in Hardy-Weinberg proportions, you might expect the combined population to have genotypes in Hardy-Weinberg proportions, but if you did you’d be wrong. Just look.

\(A_1A_1\) \(A_1A_2\) \(A_2A_2\)
Expected (from \(p=0.55\)) (0.3025)200 (0.4950)200 (0.2025)200
60.5 99.0 40.5
Observed (from table above) 65 90 45

The expected and observed don’t match, even though there is random mating within both subpopulations. They don’t match because there isn’t random mating in the combined population, only within each subpopulation. Forest lacewings choose mates at random from other forest lacewings, but they never mate with a meadow lacewing (and vice versa). Our sample includes two populations that don’t mix. As a result, heterozygotes in our combined sample are less frequent (\(0.45\) vs \(0.495\)) than we’d expect if the population were well mixed with an allelel frequency of \(0.55\). This is an example of what’s known as the Wahlund effect .

The algebraic development

Even though you’ve only known me for a couple of weeks now, you should know me well enough to know that I’m not going to be satisfied with a numerical example. You should know that I now feel the need to do some algebra to describe this situation a little more generally.

Suppose we know allele frequencies in \(k\) subpopulations.3 Let \(p_i\) be the frequency of \(A_1\) in the \(i\)th subpopulation. Then if we assume that all subpopulations contribute equally to combined population,4 we can calculate expected and observed genotype frequencies the way we did above:

\(A_1A_1\) \(A_1A_2\) \(A_2A_2\)
Expected \(\bar p^2\) \(2\bar p\bar q\) \(\bar q^2\)
Observed \(\frac{1}{k}\sum p_i^2\) \(\frac{1}{k}\sum 2p_iq_i\) \(\frac{1}{k}\sum q_i^2\)

where \(\bar p = \sum p_i/k\) and \(\bar q = 1 - \bar p\) are the average allele frequencies in the combined sample. Now \[\begin{aligned} \frac{1}{k}\sum p_i^2 &=& \frac{1}{k}\sum (p_i - \bar p + \bar p)^2 \\ &=& \frac{1}{k}\sum \left((p_i - \bar p)^2 + 2\bar p(p_i - \bar p) + \bar p^2\right) \\ &=& \frac{1}{k}\sum (p_i - \bar p)^2 + \bar p^2 \\ &=& \hbox{Var}(p) + \bar p^2 \label{eq:p2} \end{aligned}\] Similarly, \[\begin{aligned} \frac{1}{k}\sum 2p_iq_i &=& 2\bar p\bar q - 2\hbox{Var}(p) \label{eq:2pq} \\ \frac{1}{k}\sum q_i^2 &=& \bar q^2 + \hbox{Var}(p) \label{eq:q2} \end{aligned}\]

Since \(\hbox{Var}(p) \ge 0\) by definition, with equality holding only when all subpopulations have the same allele frequency, we can conclude that

To return to our earlier numerical example:

\[\begin{aligned} \hbox{Var}(p) &=& \left((0.4 - 0.55)^2 + (0.7 - 0.55)^2\right)/2 \\ &=& 0.0225 \end{aligned}\]

Expected Observed
\(A_1A_1\) 0.3025 + 0.0225 = 0.3250
\(A_1A_2\) 0.4950 - 2(0.0225) = 0.4500
\(A_2A_2\) 0.2025 + 0.0225 = 0.2250

Wright’s \(F\)-statistics

One limitation of the way I’ve described things so far is that \(\hbox{Var}(p)\) doesn’t provide a convenient way to compare population structure from different samples. \(\hbox{Var}(p)\) can be much larger if both alleles are about equally common in the whole sample than if one occurs at a mean frequency of 0.99 and the other at a frequency of 0.01. Moreover, if you stare at equations ([eq:p2])–([eq:q2]) for a while, you begin to realize that they look a lot like some equations we’ve already encountered. Namely, if we were to define \(F_{st}\)8 as \(\mbox{Var}(p)/\bar p\bar q\), then we could rewrite equations ([eq:p2])–([eq:q2]) as \[\begin{aligned} \frac{1}{k}\sum p_i^2 &=& \bar p^2 + F_{st}\bar p \bar q \label{eq:p2-f} \\ \frac{1}{k}\sum 2p_iq_i &=& 2\bar p\bar q(1 - F_{st}) \label{eq:2pq-f} \\ \frac{1}{k}\sum q_i^2 &=& \bar q^2 + F_{st}\bar p \bar q \label{eq:q2-f} \end{aligned}\] And it’s not even completely artificial to define \(F_{st}\) the way I did. After all, the effect of geographic structure is to cause matings to occur among genetically similar individuals. It’s rather like inbreeding.9 Moreover, the extent to which this local mating matters depends on the extent to which populations differ from one another. It turns out that \(\bar p\bar q\) is the maximum allele frequency variance possible, given the observed mean frequency. So one way of thinking about \(F_{st}\) is that it measures the amount of allele frequency variance in a sample relative to the maximum possible.10

There may, of course, be inbreeding within populations, too. But it’s easy to incorporate this into the framework, too.11 Let \(H_i\) be the actual heterozygosity in individuals within subpopulations, \(H_s\) be the expected heterozygosity within subpopulations assuming Hardy-Weinberg within populations, and \(H_t\) be the expected heterozygosity in the combined population assuming Hardy-Weinberg over the whole sample.12 Then thinking of \(f\) as a measure of departure from Hardy-Weinberg and assuming that all populations depart from Hardy-Weinberg to the same degree, i.e., that they all have the same \(f\), we can define \[F_{it} = 1 - \frac{H_i}{H_t} \quad .\] \(F_{it}\) is the overall departure from Hardy-Weinberg in the entire sample. Let’s fiddle with \(F_{ST}\) a bit.13 \[\begin{aligned} 1 - F_{it} &=& \frac{H_i}{H_t} \\ &=& \left(\frac{H_i}{H_s}\right)\left(\frac{H_s}{H_t}\right) \\ &=& (1 - F_{is})(1 - F_{st}) \quad , \end{aligned}\] where \(F_{is}\) is the inbreeding coefficient within populations, i.e., \(f\), and \(F_{st}\) has the same definition as before.14 \(H_t\) is often referred to as the genetic diversity in a population. So another way of thinking about \(F_{st} = (H_t - H_s)/H_t\) is that it’s the proportion of the diversity in the sample that’s due to allele frequency differences among populations.

Estimating \(F\)-statistics

We’ve now seen the principles underlying Wright’s \(F\)-statistics. I should point out that Gustave Malécot developed very similar ideas at about the same time as Wright, but since Wright’s notation stuck,15 population geneticists generally refer to statistics like those we’ve discussed as Wright’s \(F\)-statistics.16

Neither Wright nor Malécot worried too much about the problem of estimating \(F\)-statistics from data. Both realized that any inferences about population structure are based on a sample and that the characteristics of the sample may differ from those of the population from which it was drawn, but neither developed any explicit way of dealing with those differences. Wright develops some very ad hoc approaches in his book , but they have been forgotten, which is good because they aren’t satisfactory and they shouldn’t be used. There are now three reasonable approaches available:17

  1. Nei’s \(G\)-statistics,

  2. Weir and Cockerham’s \(\theta\)-statistics, and

  3. A Bayesian analog of \(\theta\).18

An example from Isotoma petraea

To make the differences in implementation and calculation clear, I’m going to use data from 12 populations of Isotoma petraea in southwestern Australia surveyed for genotype at GOT–1  as an example throughout these discussions (Table 1).

Genotype counts at the \(GOT-1\) locus in Isotoma petraea (from ).
Genotype
Population \(A_{1}A_{1}\) \(A_{1}A_{2}\) \(A_{2}A_{2}\) \(\hat p\)
Yackeyackine Soak 29 0 0 1.0000
Gnarlbine Rock 14 3 3 0.7750
Boorabbin 15 2 3 0.8000
Bullabulling 9 0 0 1.0000
Mt. Caudan 9 0 0 1.0000
Victoria Rock 23 5 2 0.8500
Yellowdine 23 3 4 0.8167
Wargangering 29 3 1 0.9242
Wagga Rock 5 0 0 1.0000
“Iron Knob Major” 1 0 0 1.0000
Rainy Rocks 0 1 0 0.5000
“Rainy Rocks Major” 1 0 0 1.0000

Let’s ignore the sampling problem for a moment and calculate the \(F\)-statistics as if we had observed the population allele frequencies without error. They’ll serve as our baseline for comparison. \[\begin{aligned} \bar p &=& 0.8888 \\ \hbox{Var}(p) &=& 0.02118 \\ F_{st} &=& 0.2143 \\ \hbox{Individual heterozygosity} &=& (0.0000 + 0.1500 + 0.1000 + 0.0000 + 0.0000 + 0.1667 + 0.1000 \\ && + 0.0909 + 0.0000 + 0.0000 + 1.0000 + 0.0000)/12 \\ &=& 0.1340 \\ \hbox{Expected heterozygosity} &=& 2(0.8888)(1 - 0.8888) \\ &=& 0.1976 \\ F_{it} &=& 1 - \frac{\hbox{Individual heterozygosity}}% {\hbox{Expected heterozygosity}} \\ &=& 1 - \frac{0.1340}{0.1976} \\ &=& 0.3221 \\ 1 - F_{it} &=& (1 - F_{is})(1 - F_{st}) \\ F_{is} &=& \frac{F_{it} - F_{st}}{1 - F_{st}} \\ &=& \frac{0.3221 - 0.2143}{1 - 0.2143} \\ &=& 0.1372 \end{aligned}\]

Summary

Correlation of gametes due to inbreeding within subpopulations (\(F_{is}\)): 0.1372
Correlation of gametes within subpopulations (\(F_{st}\)): 0.2143
Correlation of gametes in sample (\(F_{it}\)): 0.3221

Why do I refer to them as the “correlation of gametes \(\dots\)”? There are two reasons:

  1. That’s the way Wright always referred to and interpreted them.

  2. We can define indicator variables \(x_{ijk} = 1\) if the \(i\)th allele in the \(jth\) individual of population \(k\) is \(A_1\) and \(x_{ijk} = 0\) if that allele is not \(A_1\). This may seem like a strange thing to do, but the Weir and Cockerham approach to \(F\)-statistics described below uses just such an approach. If we do this, then the definitions for \(F_{is}\), \(F_{st}\), and \(F_{it}\) follow directly.19

Notice that \(F_{is}\) could be negative, i.e., there could be an excess of heterozygotes within populations (\(F_{is} < 0\)). Notice also that we’re implicitly assuming that the extent of departure from Hardy-Weinberg proportions is the same in all populations. Equivalently, we can regard \(F_{is}\) as the average departure from Hardy-Weinberg proportions across all populations.

Statistical expectation and unbiased estimates

So far I’ve assumed that we know the allele frequencies without error, but of course that’s never the case unless we’ve created experimental populations. We are always taking a sample from a population and inferringestimatingallele frequencies from our sample. Similarly, we are estimating \(F_{ST}\) and our estimate of \(F_{ST}\) needs to take account of the imprecision in the allele frequency estimates on which it was based. To understand one approach to dealing with this uncertainty I need to introduce two new concepts: statistical expectation and unbiased estimates.

The concept of statistical expectation is actually quite an easy one. It is an arithmetic average, just one calculated from probabilities instead of being calculated from samples. So, for example, let \(\mbox{P}(k|p,N)\) be the probability that we find \(k\) \(A_1\) alleles in our sample of size \(N\) given that the allele frequency in the population is \(p\). Then the expected number of \(A_1\) alleles in our sample is just \[\begin{aligned} \mbox{E}(k) &=& \sum_{k=0}^n k \mbox{P}(k|p,N) \\ &=& n p \quad \\ \end{aligned}\] where \(n\) is the total number of alleles in our sample.20

Now consider the expected value of our sample estimate of the population allele frequency, \(\hat p = k/n\), where \(k\) now refers to the number of \(A_1\) alleles we actually found. \[\begin{aligned} \mbox{E}(\hat p) &=& \mbox{E}\left(\sum_{k=1}^n (k/n)\right) \\ &=& \sum_{k=1}^n (k/n) P(k|p,N) \\ &=& (1/n)\left(\sum_{k=1}^n k P(k|p,N)\right) \\ &=& (1/n)(n p) \\ &=& p \quad . \\ \end{aligned}\] Because \(\mbox{E}(\hat p) = p\), \(\hat p\) is said to be an unbiased estimate of \(p\).21 When an estimate is unbiased it means that if we were to repeat the sampling experiment an infinite number of times and to take the average of the estimates, the average of those values would be equal to the (unknown) parameter value.

What about estimating the frequency of heterozygotes within a population? The obvious estimator is \(\tilde H = 2\hat p (1 - \hat p)\). Well, \[\begin{aligned} \mbox{E}(\tilde H) &=& \mbox{E}\left(2\hat p (1 - \hat p)\right) \\ &=& 2\left(\mbox{E}(\hat p) - \mbox{E}({\hat p}^2)\right) \\ &=& \mbox{TAMO} \\ &=& ((n-1)/n)2p(1-p) \quad . \\ \end{aligned}\] Because \(\mbox{E}(\tilde H) \ne 2p(1-p)\), \(\tilde H\) is a biased estimate of \(2p(1-p)\). If, however, we set \(\hat H = (n/(n-1))\tilde H\), however, \(\hat H\) is an unbiased estimator of \(2p(1-p)\).22

If you’ve ever wondered why you typically divide the sum of squared deviations about the mean by \(n-1\) instead of \(n\) when estimating the variance of a sample, this is why. Dividing by \(n\) gives you a (slightly) biased estimator.

The gory details23

Starting where we left off above: \[\begin{aligned} \mbox{E}(\tilde H) &=& 2\left((\mbox{E}\hat p) - \mbox{E}({\hat p}^2)\right) \\ &=& 2\left(p - \mbox{E}\left((k/n)^2\right)\right) \quad , \end{aligned}\] where \(k\) is the number of \(A_1\) alleles in our sample and \(n\) is the sample size. \[\begin{aligned} \mbox{E}\left((k/n)^2\right) &=& \sum (k/n)^2 \mbox{P}(k|p,N) \\ &=& (1/n)^2 \sum k^2 \mbox{P}(k|p,N) \\ &=& (1/n)^2 \left(\mbox{Var}(k) + \bar k^2\right) \\ &=& (1/n)^2 \left(np(1-p) + n^2p^2\right) \\ &=& p(1-p)/n + p^2 \quad . \end{aligned}\] Substituting this back into the equation above yields the following: \[\begin{aligned} \mbox{E}(\tilde H) &=& 2\left(p - \left(p(1-p)/n + p^2\right)\right) \\ &=& 2\left(p(1-p) - p(1-p)/n\right) \\ &=& \left(1 - 1/n\right)2p(1-p) \\ &=& ((n-1)/n)2p(1-p) \quad . \\ \end{aligned}\]

Corrections for sampling error

There are two sources of allele frequency difference among subpopulations in our sample: (1) real differences in the allele frequencies among our sampled subpopulations and (2) differences that arise because allele frequencies in our samples differ from those in the subpopulations from which they were taken.24

Nei’s \(G_{st}\)

Nei and Chesser  described one approach to accounting for sampling error. So far as I’ve been able to determine, there aren’t any currently supported programs25 that calculate the bias-corrected versions of \(G_{st}\).26 I calculated the results in Table 2 by hand.

The calculations are tedious, which is why you’ll want to find some way of automating the calculations if you want to do them.27 \[\begin{aligned} H_{i} &=& 1 - {1 \over N} \sum_{k=1}^{N} \sum_{i=1}^{m} {X_{kii}} \\ H_{s} &=& {\tilde n \over {\tilde n - 1}} \left[1 - \sum_{i=1}^{m} {\bar {\hat x_{i}^{2}}} - {H_{I} \over {2 \tilde n}}\right] \\ H_{t} &=& 1 - \sum_{i=1}^{m} {\bar x_{i}^{2}} + {H_{S} \over {\tilde n}} - {H_{I} \over {2 \tilde n N}} \end{aligned}\] where we have \(N\) subpopulations, \({\bar {\hat x_{i}^{2}}} = \sum_{k=1}^{N} {x_{ki}^{2}}/N\), \({\bar x_{i}} = \sum_{k=1}^{N} x_{ki}/N\), \(\tilde n\) is the harmonic mean of the population sample sizes, i.e., \(\tilde n = \frac{1}{\frac{1}{N} \sum_{k=1}^{N} \frac{1}{n_k}}\), \(X_{kii}\) is the frequency of genotype \(A_{i}A_{i}\) in population \(k\), \(x_{ki}\) is the frequency of allele \(A_{i}\) in population \(k\), and \(n_k\) is the sample size from population \(k\). Recall that \[\begin{aligned} F_{is} &=& 1 - {H_{i} \over H_{s}} \\ F_{st} &=& 1 - {H_{s} \over H_{t}} \\ F_{it} &=& 1 - {H_{i} \over H_{t}} \quad . \end{aligned}\]

Weir and Cockerham’s \(\theta\)

Weir and Cockerham  describe the fundamental ideas behind this approach. Weir and Hill  bring things up to date. Holsinger and Weir  provide a less technical overview.28 Most, if not all, packages available now that estimate \(F_{ST}\) provide estimates of \(\theta\). The most important difference between \(\theta\) and \(G_{st}\) and the reason why \(G_{st}\) has fallen into disuse is that \(G_{st}\) ignores an important source of sampling error that \(\theta\) incorporates.

In many applications, especially in evolutionary biology, the subpopulations included in our sample are not an exhasutive sample of all populations. Moreover, even if we have sampled from every population there is now, we may not have sampled from every population there ever was. And even if we’ve sampled from every population there ever was, we know that there are random elements in any evolutionary process. Thus, if we could run the clock back and start it over again, the genetic composition of the populations we have might be rather different from that of the populations we sampled. In other words, our populations are, in many cases, best regarded as a random sample from a much larger set of populations that could have been sampled.

Even more gory details29

Let \(x_{mn,i}\) be an indicator variable such that \(x_{mn,i} = 1\) if allele \(m\) from individual \(n\) is of type \(i\) and is 0 otherwise. Clearly, the sample frequency \(\hat p_i = \frac{1}{2N}\sum_{m=1}^2\sum_{n=1}^Nx_{mn,i}\), and \(E(\hat p_i) = p_i\), \(i=1\dots A\). Assuming that alleles are sampled independently from the population \[\begin{aligned} E(x^2_{mn,i}) &=& p_i \\ E(x_{mn,i}x_{mn',i}) = E(x_{mn,i}x_{m'n',i}) &=& p_i^2 + \sigma_{x_{mn,i}x_{m'n',i}} \\ &=& p_i^2 + p_i(1-p_i)\theta \end{aligned}\] where \(\sigma_{x_{mn,i}x_{m'n',i}}\) is the intraclass covariance for the indicator variables and \[\theta = \frac{\sigma^2_{p_i}}{p_i(1-p_i)} \label{eq:theta-def}\] is the scaled among population variance in allele frequency in the populations from which this population was sampled. Using ([eq:theta-def]) we find after some algebra \[\sigma^2_{\hat p_i} = p_i(1-p_i)\theta + \frac{p_i(1-p_i)(1-\theta)}{2N} \quad .\] The hat on \(\sigma^2_{\hat p_i}\) indicates the sample variance of allele frequencies among popluations. A natural estimate for \(\theta\) emerges using the method of moments when an analysis of variance is applied to indicator variables derived from samples representing more than one population.

Applying \(G_{st}\) and \(\theta\)

If we return to the data that motivated this discussion, the results in Table 2 show what we get from analyses of the \(GOT-1\) data from Isotoma petraea (Table 1).

Comparison of Wright’s \(F\)-statistics when ignoring sampling effects with Nei’s \(G_{ST}\) and Weir and Cockerham’s \(\theta\).
Method \(F_{is}\) \(F_{st}\) \(F_{it}\)
Direct 0.1372 0.2143 0.3221
Nei 0.3092 0.2395 0.4746
Weir & Cockerham 0.5398 0.0387 0.5577

But first a note on how you’ll see statistics like this reported in the literature. It can get a little confusing, because of the different symbols that are used. Sometimes you’ll see \(F_{is}\), \(F_{st}\), and \(F_{it}\). Sometimes you’ll see \(f\), \(\theta\), and \(F\). And it will seem as if they’re referring to similar things. That’s because they are. They’re really just different symbols for the same thing (see Table 3).

Equivalent notations often encountered in descriptions of population genetic structure.
Notation
Wright Weir & Cockerham
\(F_{it}\) \(F\)
\(F_{is}\) \(f\)
\(F_{st}\) \(\theta\)

Strictly speaking the symbols in Table 3 are the parameters, i.e., values in the population that we try to estimate. We should put hats over any values estimated from data to indicate that they are estimates of the parameters, not the parameters themselves. But we’re usually a bit sloppy, and everyone knows that we’re presenting estimates, so we usually leave off the hats.

An example from Wright

Hierarchical analysis of variation in the frequency of the Standard chromosome arrangement of Drosophila pseudoobscura in the western United States (data from , analysis from ). Wright uses his rather peculiar method of accounting for sampling error. I haven’t gone back to the original data and used a more modern method of analysis.30

66 populations (demes) studied. Demes are grouped into eight regions. The regions are grouped into four primary subdivisions.

Results

Correlation of gametes within individuals relative to regions (\(F_{IR}\)): 0.0444
Correlation of gametes within regions relative to subdivisions (\(F_{RS}\)): 0.0373
Correlation of gametes within subdivisions relative to total (\(F_{ST}\)): 0.1478
Correlation of gametes in sample (\(F_{IT}\)): 0.2160

\[1 - F_{IT} = (1 - F_{IR})(1 - F_{RS})(1 - F_{ST})\]

Interpretation

There is relatively little inbreeding within regions (\(F_{IR}\) = 0.04) and relatively little genetic differentiation among regions within subdivisions (\(F_{RS} = 0.04\)). There is, however, substantial genetic differentiation among the subdivisions (\(F_{ST} = 0.15\)).

Thus, an explanation for the chromosomal diversity that predicted great local differentiation and little or no differentiation at a large scale would be inconsistent with these observations.

Reich’s \(f\)-statistics

No, that heading isn’t a typo. I’ve described Wright’s \(F\)-statistics to you, but there are some other \(f\)-statistics you may encounter and should know about.31 The \(f\)-statistics I’ll describe briefly now were introduced by Reich and colleagues  in the context of estimating the populaton history and admixture of human populations on the Indian subcontinent.32 If you read the paper, you won’t find the definitions in the main text. They’re in Supplement 1.

Reich’s \(f\)-statistics are defined only for markers that have two alleles, like SNP loci. I’m going to use notation that’s a bit different from that in , but it will match more closely what we’ve been using here, and it should be easier to follow. Let \(m_k\) be the number of 0 alleles in the sample from population \(k\) and let \(m_k\) be the number of 1 alleles in the sample from population \(k\). \(p_k = n_k/(n_k + m_k)\) is both an unbiased and a maximum-llikelihood estimate of \(p_k\). Let’s define \[f_4(1,2,3,4) = (p_1 - p_2)(p_3 - p_4) \quad .\] If you stare at that a bit,33 you may recognize that \(f_4\) looks like a correlation coefficient. In fact, as Reich et al. point out, if the true population phylogeny looks like the one in Figure 1, then the expected value of \(f_4\) is 0. Looking at the figure you can see that populations 1 and 2 have a common history that is independent of populations 3 and 4 (and vice versa). As a result, if you calculate \(f_4\) statistics from a large number of loci, you can see whether the relationship among four populations is consistent with the phylogeny in Figure 1.

Hypothetical phylogeny for four populations. If you’re wondering why it’s upside down, I warned you. Population geneticists look at the world backward from other people. It’s conventional to draw phylogenies this way in population genetics, probably because, being geneticists, population geneticists tend to think of them as pedigrees.

We can also define \[f_3(1,2,3) = (p_1 - p_2)(p_1 - p_3) - \frac{p_1(1 - p_1)}{n_1} \quad ,\] where \(n_1\) is the sample size in population 1. The extra term, \(\frac{p_1(1 - p_1)}{n_1}\) makes \(f_3\) an unbiased estimator. \(f_3\) measures how much common history \(p_2\) and \(p_3\) share that is independent of \(p_1\). If the true population phylogeny looks like the one in Figure 2, then \[\begin{aligned} \mbox{E}\left(f_3(1,2,3)\right) &>& \mbox{E}\left(f_3(2,1,3)\right) \\ \mbox{E}\left(f_3(1,2,3)\right) &>& \mbox{E}\left(f_3(3,1,2)\right) \quad . \\ \end{aligned}\] Finally we can define \[f_2(1,2) = (p_1 - p_2)^2 - \frac{p_1(1 - p_1)}{n_1} - \frac{p_1(1 - p_1)}{n_1} \quad ,\] which gives an unbiased estimate of squared allele frequency differences between populations, analogous to pairwise \(F\)-statistics.

Hypothetical phylogeny for three populations.

Creative Commons License

These notes are licensed under the Creative Commons Attribution License. To view a copy of this license, visit or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.