next up previous
Next: Nested clade analysis Up: Nested clade analysis Previous: Introduction

Statistical parsimony

Templeton et al. [5] lay out the theory and procedures involved in statstical parsimony in great detail. Those get a little complicated, and we'll get to those complications soon enough, but in outline the process is pretty simple:

So why use parsimony? Within species the time for substitutions to occur is relatively short. As a result, it may be reasonable to assume that we don't have to worry about multiple substitutions having occurred, at least between those haplotypes that are the most closely related. To ``identify the limits of parsimony'' we first estimate $\theta=4N_e\mu$ from our data. Then we plug it into a formula that allows us to assess the probability that the difference between two randomly drawn haplotypes in our sample is the result of more than one substituion.2 If that probability is small, say less than 5%, we can connect all of the haplotypes into a parsimonious network.

More likely than not, we won't be able to connect all of the haplotypes parsimoniously, but there's still a decent chance that we'll be able to identify subsets of the haplotypes for which the assumption of parsimonious change is reasonable. Templeton et al. [5] suggest the following procedure to construct a haplotype network:

Step 1:
Estimate $P_1$ the probability that haplotype pairs differing by a single change are the result of a single substitution. If $P_1 > 0.95$, as is likely, connect all pairs of haplotypes that differ by a single change. There may be ambiguities in the reconstruction, including loops. Keep these in the network.

Step 2:
Identify the products of recombination by inspecting the 1-step network to determine if postulating recombination between a pair of sequences can remove ambiguity identified in step 1.

Step 3:
Augment $j$ by one and estimate $P_j$. If $P_j > 0.95$, join $j-1$-step networks into a $j$-step network by connecting the two haplotypes that differ by $j$ steps. Repeat until either all haplotypes are included in a single network or you are left with two or more non-overlapping networks.

Step 4:
If you have two or more networks left to connect, estimate the smallest number of non-parsimonious changes that will occur with greater than 95% probability, and connect the networks.

Refer to Templeton et al. [5] for details of the calculations. Figure 2 provides an example of the resulting analysis.

Figure 2: Statistical parsimony network for the Amy locus of Drosophila melanogaster.
\includegraphics[width=10cm]{amy-tcs.eps}


next up previous
Next: Nested clade analysis Up: Nested clade analysis Previous: Introduction
Kent Holsinger 2006-12-06