Next: What is a maximum-likelihood
Up: The Hardy-Weinberg Principle and
Previous: The Hardy-Weinberg principle
Before we can determine whether genotypes in a population are in
Hardy-Weinberg proportions, we need to be able to estimate the
frequency of both genotypes and alleles. This is easy when you can
identify all of the alleles within genotypes, but suppose that we're
trying to estimate allele frequencies in the ABO blood group system in
humans. Then we have a situation that looks like this:
| Phenotype |
A |
AB |
B |
O |
| Genotype(s) |
aa ao |
ab |
bb bo |
oo |
| No. in sample |
 |
 |
 |
 |
Now we can't directly count the number of
,
, and
alleles. What do we do? Well, about 50 years ago, some statisticians
came up with a sneaky approach called the EM algorithm. It uses a
trick you'll see repeatedly through this course. When we don't know
something we want to know, we pretend that we know it and do some
calculations with it. If we're lucky, we can fiddle with our
calculations a bit to relate the thing that we pretended to know to
something we actually do know so we can figure out what we wanted to
know. Make sense? Probably not. But let's try an example.
If we knew
,
, and
, we could figure out how many
individuals with the
phenotype have the
genotype and how many
have the
genotype, namely
Obviously we could do the same thing for the
phenotype:
Notice that
and
(lowercase
subscripts refer to genotypes, uppercase to phenotypes). If we knew
all this, then we could calculate
,
, and
from
where
is the total sample size.
Surprisingly enough we can actually estimate the allele frequencies by
using this trick. Just take a guess at the allele frequencies. Any
guess will do. Then calculate
,
,
,
,
, and
as described in the preceding
paragraph.8 That's the
Expectation part the EM algorithm. Now take the values for
,
,
,
,
, and
that
you've calculated and use them to calculate new values for the allele
frequencies. That's the Maximization part of the EM
algorithm. Chances are your new values for
,
, and
won't match your initial guesses, but9 if you take these new values and
start the process over and repeat the whole sequence several times,
eventually the allele frequencies you get out at the end match those
you started with. These are maximum-likelihood estimates of the allele
frequencies.10
Consider the following example:11
| Phenotype |
A |
AB |
AB |
O |
| No. in sample |
25 |
50 |
25 |
15 |
We'll start with the guess that
,
, and
. With that assumption we would calculate that
of the A phenotypes in the sample have
genotype
, and the remaining 16.832 have genotype
. Similarly,
we can calculate that 8.168 of the B phenotypes in the population
sample have genotype
, and the remaining 16.823 have genotype
. Now that we have a guess about how many individuals of each
genotype we have we can calculate a new guess for the allele
frequencies, namely
,
, and
. By the time we've repeated this process four more times, the
allele frequencies aren't changing anymore. So the maximum likelihood
estimate of the allele frequencies is
,
,
and
.
Subsections
Next: What is a maximum-likelihood
Up: The Hardy-Weinberg Principle and
Previous: The Hardy-Weinberg principle
Kent Holsinger
2008-08-13