Recently in Statistics Category

Get ahead of the curve

| No Comments | No TrackBacks
Share |
More and more people I know are using R. Fewer and fewer are using SAS.1 Of course, part of the reason for that may be that I've been singing the virtues of R for quite awhile now. My recent students have done all of the statistical analyses for their dissertations in R.2 It's extraordinarily flexible, it has modules for just about any analysis you can imagine (and you can write a new one if it doesn't have what you want), it's open source, it's freely available, and it works almost the same on Windoze, Mac OS X, and Linux.3

Why do I mention all of this?

Over at r4stats.com there's an interesting post arguing that 2015 may be the year in which use of R in academic research exceed that of SAS and SPSS. Some of the analysis is, as you might expect, based on statistical extrapolation of citation trends (data harvested from Google Scholar). But the most interesting part of the post is the "Colbert forecast" based on "truthiness", i.e., gut instinct.

This growth will be driven by:
  • The continued rapid growth in add-on packages (Figure 10)
  • The attraction of R's powerful language
  • The near monopoly R has on the latest analytic methods
  • Its free price
  • The freedom to teach with real-world examples from outside organizations, which is forbidden to academics by SAS and SPSS licenses (it benefits those organizations, so the vendors say they should have their own software license).

What will slow R's growth is its lack of a graphical user interface that:

  • Is powerful
  • Is easy to use
  • Provides journal style output in word processor format
  • Is standard, i.e. widely accepted as The One to Use
  • Is open source

Wow!

| No Comments | No TrackBacks
Share |
I just ran across1 a fabulous site for anyone who's interested in visualizing complex data. chartsnthings is

A (personal) blog of data sketches from the New York Times Graphics Department. Maintained by @KevinQ.
It describes the thought processes that lie behind some of the data graphics that appear in the Times, including preliminary sketches like the one below, intermediate stages with mentions of the software used and the data sources, and the final product. You won't learn how to use R, Adobe Illustrator, or any other package to do what they do at the Times, but you'll get an outline of what you need to learn to do what they do at the Times.

7000025406_fbc6bb579b_c.jpg

For someone like me, who is graphically challenged, it promises to be a great way to pick up some rudimentary skills in visualizing data.

Hat tip: Andrew Gelman.

Type III error

| No Comments | No TrackBacks
Share |
If you've taken statistics, you may remember that Type I error is rejecting a true null hypothesis and that Type II error is failing to reject a false null hypothesis.1

So what's Type III error? Providing the right answer to the wrong question. Schwartz and Carpenter provide examples showing the consequences in studies of homelessness, obesity, and infant mortality. They focus on the fact (often forgotten) that the cause of differences among individuals within a group may be different from the cause of differences between groups. Here's their conclusion on homelessness.

The causes of interindividual differences within the population may be interesting in and of themselves, but from the public health perspective of trying to decrease the amount of homelessness at any time, the interindividual differences are largely irrelevant. The identification of individual risk factors may benefit certain people by decreasing the probability that they will become homeless. The success of an individual-level intervention is based on the premise that the reduction of a specific risk factor will enhance the ability of the individual to compete for the limited housing resource, all things being equal. From the standpoint of public health, however, explaining interindividual differences in homelessness does not adequately address the goal of decreasing the incidence of homelessness.
Why? Well, it would be better to write

Explaining interindividual differences in homelessness will not necessarily lead to decreases in homelessness.If there are more people that need homes than there are homes, interventions based on understanding interindividual differences will change who gets a house, but not how many people get a house.

Accuracy vs. precision

| No Comments | No TrackBacks
Share |
Three statisticians go deer hunting. They see a large buck. The first one fires and misses one meter to the left. The second one fires and misses one meter to the right. The third one exclaims, "We got him!"

The shots were accurate, not precise.


A gentle introduction to statistics

| No Comments | No TrackBacks
Share |
DataBlog.png
If you're reading this blog, you're probably already familiar with statistics. You probably regard linear regressions and ANOVAs as simple, even if you don't know all of the mathematical details behind them. If that describes you, you may not find Nathan Green's new series on statistics at The Guardian all that useful,1 but if you have friends or relatives who don't understand the difference between the mean of a sample and the mean of the population from which it was drawn or why the difference is important, you might want to encourage them to follow his series.

The first one outlines the challenges associated with sampling. The second one describes different measures of central tendency (mean, median, and mode).2


Using Google Docs with R

| No Comments | No TrackBacks
Share |
It's easy to use read.csv() or read.data() to read data that's stored on your hard drive. But suppose you're collaborating with someone on a different continent and sharing data that either of you may update periodically as you make new observations. If you're working with your data in a spreadsheet, Google Docs is your friend. A post at Revolutions explains how to use a spreadsheet on Google as a data source for an R script.

Statistical significance isn't significant

| No Comments | No TrackBacks
Share |
ResearchBlogging.orgJust about any time scientists collect data, they use statistics to interpret them. We say things like "There is a statistically significant relationship between blood cholesterol levels and the risk of heart attacks." The phrase "statistically significant" is hard to understand, but it's vitally important.

The data we collect are just a sample of all observations that could have been made. I care that data from the Framingham heart study show that relationship because I suspect that it applies to me. I care, in other words, because I am extrapolating from a sample to a much larger population.

That's where statistical significance comes in.

The principles of statistics tell us how likely it is that a difference we observe is just a fluke. We typically call a result statistically significant when the chance of getting what we saw in the absence of a real difference is less than 5 percent.

What we sometimes do is a bit more complicated. Let's suppose in a clinical trial we find that drug A lowers cholesterol by 8 percent and that the lowering is statistically significant. We'd be justified in saying not only that drug A lowered cholesterol in our sample but in saying that our evidence suggests it would do so in the whole population. Suppose that drug B lowered cholesterol by 5 percent in our sample but that the lowering is not statistically significant. We'd have to say that we have no evidence that it would lower cholesterol in the whole population -- not that it won't lower cholesterol in the whole population, only that we don't have evidence that it will.

That last point is really important. It's very tempting to conclude from these observations that drug A lowers cholesterol and drug B doesn't -- tempting but wrong. If we don't have evidence that a 5 percent lowering of cholesterol is different from no lowering at all, we probably don't have evidence that 8 percent lowering is different from 5 percent lowering since that difference is only 3 percent. To know for sure, we'd need to do a statistical test that directly compared the 5 percent lowering with the 8 percent lowering.

Unfortunately, a lot of smart scientists don't seem to understand this.Sander Niewenhuis and his colleagues recently reviewed 513 papers on neuroscience published in 5 of the top international journals. Of the 157 articles they found that made a comparison like the one I just described, 78 did the correct direct comparison. 79 did the incorrect "This one is significant, that one isn't" comparison.

Those of you who care a lot about these things (and can handle some probability theory) will want to read a paper Andrew Gelman and Hal Stern wrote several years ago: The difference between "significant" and "not significant"  is not itself statsitcially significant.

One more thing: In writing about this paper in The Guardian, Ben Goldacre concludes with this thought:

But the darkest thought of all is this: analysing a "difference in differences" properly is much less likely to give you a statistically significant result, and so it's much less likely to produce the kind of positive finding you need to look good on your CV, get claps at conferences, and feel good in your belly. Seriously: I hope this is all just incompetence.
Ben, I'm pretty sure you can rest easy. I've met a lot of honest, hard-working scientists who don't understand this point. I haven't met any who made this mistake on purpose. They make this point because the way most of us are taught statistics encourages this kind of conceptual error. I would have made it myself until 10-15 years ago (longer than I care to remember after receiving my Ph.D.).

Nieuwenhuis, S., Forstmann, B., & Wagenmakers, E. (2011). Erroneous analyses of interactions in neuroscience: a problem of significance Nature Neuroscience, 14 (9), 1105-1107 DOI: 10.1038/nn.2886


Styles of statistical inference

| No Comments | No TrackBacks
Share |
Remember awhile back when I went on too long about Bayesian inference. Well, if you'd like a gentle introduction to different approaches to statistical inference, I recently ran across a page entitled "Statistics for experimental biologists" that gives a fairly decent overview of the major approaches you'll encounter as a practicing scientist. I have some quibbles with it,1 but its brief summary of some problems with the traditional hypothesis-testing approach is worth repeating:

  • The p-value doesn't tell scientists what they want (it is the probability of the data given that H0 is true, and scientists would like the probability of H0 or H1 given the data)
  • H0 is often known to be false
  • P-values are widely misunderstood
  • Leads to binary yes/no thinking
  • Prior information is never taken into account (Bayesian argument)
  • A small p-value could reflect a very large sample size rather than a meaningful difference
  • Leads to publication bias, because significant results (i.e. p < 0.05) are more likely to be published

The theory that would not die

| No Comments | 2 TrackBacks
Share |
theory-would-not-die.jpg And you thought that Bayesian statistics were just for inferring phylogenies, species distribution, population genetics parameters, and various kinds of equally arcane social science type stuff. Well, think again.

The review in the New Scientist confuses Bayes' Theorem1 and Bayesian inference In a way that I presume the book doesn't, but it concludes:

[T]o have crafted a page-turner out of the history of statistics is an impressive feat.
Indeed. I can imagine it being a page-turner for a stats geek like me, but David Robson doesn't sound like a stats geek. If Sharon Bertsch McGrayne really turned a history of Bayesian inference into something that more normal people find interesting, she's accomplished a task many of us would envy.

If you're wondering what the review (I hope not the book) got wrong about Bayes' Theorem versus Bayesian inference, read on.

Integrating R and C++

| No Comments | No TrackBacks
Share |
As I've mentioned before, I use R a lot for statistical analyses. I also do some programming in C++ when I develop software (like Hickory) that I think other people might like to use. I just learned about an approach to integrating C++ and R that I expect to use fairly extensively. It's the Rcpp package: It provides a C++ library that facilitates the integration of R and C++.

C++ code can be assigned to an object in R and called as a function, receiving arguments from  R and returning them. I'm not sure what magic happens under the hood, but it sounds very promising and powerful.

There's an example of its use over at Life in Code. Pretty cool, if you're a code geek like me.

 Subscribe in a reader

Pages

OpenID accepted here Learn more about OpenID

Technorati

Technorati search

» Blogs that link here

Nature Blog Network
Creative Commons License
This blog is licensed under a Creative Commons License.

About this Archive

This page is an archive of recent entries in the Statistics category.

Science is the previous category.

University of Connecticut is the next category.

Find recent content on the main index or look in the archives to find all content.