next up previous
Next: Returning to the ABO Up: An introduction to Bayesian Previous: An introduction to Bayesian

Estimating allele frequencies with two alleles

Let's suppose we've collected data from a population of Desmodium cuspidatum15 and have found 7 alleles coding for the fast allele at a enzyme locus encoding glucose-phosphate isomerase in a sample of 20 alleles. We want to estimate the frequency of the fast allele. The maximum-likelihood estimate is $7/20 =
0.35$, which we got by finding the value of $p$ that maximizes

\begin{eqnarray*}
\mbox{P}(p\vert N,k) &=& {N \choose k} p^k (1-p)^{N-k} \quad ,
\end{eqnarray*}

where $N=20$ and $k=7$. A Bayesian uses the same likelihood, but has to specify a prior distribution for $p$. If we didn't know anything about the allele frequency at this locus in D. cuspidatum before starting the study, it makes sense to express that ignorance by choosing $\mbox{P}(p)$ to be a uniform random variable on the interval $[0,1]$. That means we regarded all values of $p$ as equally likely prior to collecting the data.16

Until a little over fifteen years ago it was necessary to do a bunch of complicated calculus to combine the prior with the likelihood to get a posterior. Since the early 1990s statisticians have used a simulation approach, Monte Carlo Markov Chain sampling, to construct numerical samples from the posterior. For the problems encountered in this course, we'll mostly be using the freely available software package WinBUGS to implement Bayesian analyses. For the problem we just encountered, here's the code that's needed to get our results:17

model {

   # likelihood
   k ~ dbin(p, N)

   # prior
   p ~ dunif(0,1)

}

list(k = 7, n = 20)
Running this in WinBUGS produces the results in Figure 1.

Figure 1: Results of a WinBUGS analysis with the made up allele count data from Desmodium cuspidatum.
\resizebox{\textwidth}{!}{\includegraphics{binomial-results.eps}}

The column headings in Figure 1 should be fairly self-explanatory, except for the one labeled MC error.18 mean is the posterior mean. It's our best guess of the value for the frequency of the fast allele. s.d. is the posterior standard deviation. It's our best guess of the uncertainty associated with our estimate of the frequency of the fast allele. The 2.5%, 50%, and 97.5% columns are the percentiles of the posterior distribution. The [2.5%, 97.5%] interval is the 95% credible interval, which is analogous to the 95% confidence interval in classical statistics, except that we can say that there's a 95% chance that the frequency of the fast allele lies within this interval.19 Since the results are from a simulation, different runs will produce slightly different results. In this case, we have a posterior mean of about 0.36 (as opposed to the maximum-likelihood estimate of 0.35), and there is a 95% chance that $p$ lies in the interval [0.18, 0.57].20


next up previous
Next: Returning to the ABO Up: An introduction to Bayesian Previous: An introduction to Bayesian
Kent Holsinger 2008-08-13