Next: Creative Commons License
Tests and Fisher's Exact Tests in R
R is an open source statistical package with versions available
for Linux, Mac OS X, and Windows. In addition to being available
without charge, it is very powerful and very flexible. It's
been my statistical package of choice for the past 4-5 years. Running
a
test or Fisher's exact test, as you need to do for the
first question in Problem #1, is very straightforward - once you
get the data in. In the example that follows, we'll imagine that
you've determined the number
and
alleles contributed by
fathers to offspring of each maternal genotype as follows:
| |
Paternal gamete |
| Maternal genotype |
 |
 |
 |
12 |
17 |
 |
4 |
25 |
 |
15 |
4 |
Let me show you the code first. Then I'll explain it.
> alleles <- matrix(c(12, 4, 15, 17, 25, 4), nr=3,
+ dimnames=list(c("A1A1", "A1A2", "A2A2"), c("A1", "A2")))
> alleles
A1 A2
A1A1 12 17
A1A2 4 25
A2A2 15 4
> chisq.test(alleles)
Pearson's Chi-squared test
data: alleles
X-squared = 20.2851, df = 2, p-value = 3.937e-05
> fisher.test(alleles)
Fisher's Exact Test for Count Data
data: alleles
p-value = 2.957e-05
alternative hypothesis: two.sided
The first thing to know about R is that it's command-line
oriented. If you've ever used Linux, the DOS box in Windows, or the
terminal in Mac OS X, you'll be familiar with the idea of a
``prompt''. That's the > that you see at the start of some lines. It
indicates that R is waiting for you to type something. Sometimes
you'll have a long command to type, as I do on the first line. IF you
havent finished your command yet when you hit return, you'll get a
different prompt, the + that you see on the second line. This
reminds you that you're in the middle of typing a command. So what
does that first line do?
- alleles - This is the name of an object that the first
command creates in which to store the data. I do this so that I can
easily refer to the data later. You can give the object pretty much
any name you want, as long as it doesn't start with a number. You
could call it Fred, if you wanted to, or x, if you don't
want to type so much.
- <- - This is the ``operator'' that assigns the result
of the command we put on the right side to the object, alleles, on the left side.
- matrix - This tells R that we're going to
construct a matrix with our data. The stuff in between the pair of
parentheses that start on this line and end on the next line tell
R how to construct the matrix.
- c(12, 4, 15, 17, 25, 4), - As you can probably guess
this is the data. The c() constructs a column of data (each
element separated by a comma. Notice that we list the data by going
down the first column to the bottom and beginning again at the
top. The comma at the end of this parenthesis lets R know that
the next ``argument'' is coming.
- nr=3, - This argument tells R that there are
three rows in our data. This means that the column of data you just
created has to have a multiple of three elements in it. In our case
we have six, so we're fine. The comma tells us that the next
argument is coming.1 Notice that I hit the return key
after the comma so we now go to the next line, which has a prompt of
+ because the command isn't finished. Instead of the comma,
you could put a parenthesis here, ). When you hit return,
you'd get a > instead of a +, indicating that the
command was done.
- This line is optional, but I rather like it. It allows me to
check to make sure I've entered the data the way that I thought I
did. dimnames means that I'm going to give names to the rows
and columns. There's a list with two elements, a column of
maternal genotypes, c("A1A1", "A1A2", "A2A2"), and a column of
paternal alleles, c("A1", "A2"). There are three parentheses
at the end of the line to balance the three that preceded
it.2 When you hit return you
get the primary prompt, >.
- Now I just type alleles at the prompt and I see the table
of data displayed before me. If you'd left off the dimnames,
i.e., if you'd entered only
> alleles <- matrix(c(12, 4, 15, 17, 25, 4), nr=3
You'd get a table that looks like this
> alleles
[,1] [,2]
[1,] 12 17
[2,] 4 25
[3,] 15 4
As you can see, including dimnames makes it easier to see that
the data you've entered matches what you actually wanted to enter.
- That's the hard part. To run a
test just type chisq.test(alleles) and hit return. You'll get the following
result.
Pearson's Chi-squared test
data: alleles
X-squared = 20.2851, df = 2, p-value = 3.937e-05
You can run a Fisher's exact test by typing fisher.test(alleles).
Fisher's Exact Test for Count Data
data: alleles
p-value = 2.957e-05
alternative hypothesis: two.sided
In most cases the
-values won't be too different, but if they are
different, the exact test is right. Since it's just as easy to run a
Fisher's exact test in R as a
test, I prefer the
Fisher's exact test. The
is based on an approximation and was
useful before the advent of powerful desktop computers, but the only
reason to use it now is if you don't have a computer handy and need to
do the calculations by hand.
In this case the
test gives us a
value of
and Fisher's exact test gives us a
value of
. So regardles of which test we choose we have very strong
evidence that there are significant differences in the proportion of
sperm fertilizing eggs of the three different maternal genotypes.
Next: Creative Commons License
Kent Holsinger
2006-09-02