Although
A little over 20 years ago a different approach to the analysis of genetic structure began to emerge: analysis of individual assignment. Although the implementation details get a little hairy,3 the basic idea is fairly simple. I’ll give an outline of the math in a moment, but let’s do this in words first. Suppose we have genetic data on a series of individuals at several to many (or very many) loci. If two individuals are part of the same population, we expect them to be more similar to one another than they are to individuals in other populations. So if we “cluster” individuals that are “genetically similar” to one another, those clusters should correspond to populations. Rather than pre-defining the populations, we will have allowed the data to tell us what the populations are.4 We haven’t even required a priori that individuals be grouped according to their geographic proximity. Instead, we can examine the clusters we find and see if they make any sense geographically.
Now for an outline of the math. Label the data we have for each
individual
To see a simple example of how Structure
can be used, we’ll use it to assess whether cultivated genotypes of
Japanese barberry, Berberis thunbergii, contribute
to ongoing invasions in Connecticut and Massachusetts . The first
problem is to determine what
K | Mean L(K) |
---|---|
2 | -2553.2 |
3 | -2331.9 |
4 | -2402.9 |
5 | -2476.3 |
Having determined that the data support
Figure 1 may not look terribly
informative, but actually it is. Look at the labels beneath the figure.
You’ll see that with the exception of individual 17 from Beaver Brook
Park, all the of the individuals that are solid blue are members of the
cultivated Berberis thunbergii var.
atropurpurea. The solid red bar corresponds to
Berberis vulgaris ’Atropurpurea’, a different
modern cultivar.10 You’ll notice that individuals 1,
2, 18, and 19 from Beaver Brook Park and individual 1 from Bluff Point
State Park fall into the same genotypic cluster as this cultivar.
Berberis
A much more interesting application of
Structure
appeared a shortly after
Structure
was introduced. The Human Genome
Diversity Cell Line Panel (HGDP-CEPH) consisted at the time of data from
1056 individuals in 52 geographic populations. Each individual was
genotyped at 377 autosomal loci. If those populations are grouped into 5
broad geographical regions (Africa, [Europe, the Middle East, and
Central/South Asia], East Asia, Oceania, and the Americas), we find that
about 93% of genetic variation is found within local populations and
only about 4% is a result of allele frequency differences between
regions .
You might wonder why Europe, the Middle East, and Central/South Asia
were grouped together for that analysis. The reason becomes clearer when
you look at a Structure
analysis of the
data (Figure 2).
Structure
analysis of
microsatellite diversity in the Human Genome Diversity Cell Line
Panel (from ).Structure
has a lot of nice features, but
you’ll discover a couple of things about it if you begin to use it
seriously: (1) It often isn’t obvious what the “right”
K
is.11 (2) It requires a
lot of computational resources, especially with
datasets that include a few thousand SNPs, as is becoming increasingly
common. An alternative is to use principal component analysis directly
on genotypes. There are technical details associated with estimating the
principal components and interpreting them that we won’t discuss,12, but the results can be pretty
striking. Figure 3 shows the results of a PCA on data
derived from 3192 Europeans at 500,568 SNP loci. The correspondence
between the position of individuals in PCA space and geographical space
is remarkable.
Jombart et al. describe a related method known
as discriminant analysis of principal components. They also provide an
R
package, dapc
,
that implements the method. I prefer Structure
because its approach to individual assignment is based directly on
population genetic principles, and since computers are getting so
fast (especially when you have a computational cluster available) that I
worry less about how long it takes to run an analysis on large
datasets.13 That being said, Gopalan et
al.
released teraStructure
about five years ago,
which can analyze data sets consisting of 10,000 individuals scored at a
million SNPs in less than 10 hours. I haven’t tried it myself, because I
haven’t had a large data set to try it on, but you should keep it in
mind if you collect SNP data on a large number of loci. Here are a
couple more alternatives to consider that I haven’t investigated
yet:
sNMF
estimates individual admixture
coefficients. It is reportedly 10-30 faster than the likelihood based
ADMIXTURE
, which is itself faster than
Structure
. sNMF
is
part of the R
package
LEA
.
Meisner and Albrecthsen present both a principal components method and an admixture method that accounts for sequencing errors inherent in low-coverage next generation DNA sequencing data.
These notes are licensed under the Creative Commons Attribution License. To view a copy of this license, visit or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.