# Author Archive: kent

## Against null hypothesis testing – the elephants and Andrew Gelman edition

Last week I pointed out a new paper by Denes Szucs and John Ioannidis, When null hypothesis significance testing is unsuitable for research: a reassessment.1 I mentioned that P-values from small, noisy studies are likely to be misleading. Last April, Raghu Parthasarathy at The Eighteenth Elephant had a long post on a more fundamental problem with P-values: they encourage binary thinking. Why is this a problem?

1. “Binary statements can’t be sensibly combined” when measurements have noise.
2. “It is almost never necessary to combine boolean statements.”
3. “Everything always has an effect.”

Those brief statements probably won’t make any sense,2 so head over to The Eighteenth Elephant to get the full explanation. The post is a bit long, but it’s easy to read, and well worth your time.

Andrew Gelman recently linked to Parthasarathy’s post and adds one more observation on how P-values are problematic: they are “interpretable only under the null hypothesis, yet the usual purpose of the p-value in practice is to reject the null.” In other words, P-values are derived assuming the null hypothesis is true. They tell us what the chances of getting the data we got are if the null hypothesis were true. Since we typically don’t believe the null hypothesis is true, the P-value doesn’t correspond to anything meaningful.

To take Gelman’s example, suppose we had an experiment with a control, treatment A, and treatment B. Our data suggest that treatment A is not different from control (P=0.13) but that treatment B is different from the control (P=0.003). That’s pretty clear evidence that treatment A and treatment B are different, right? Wrong.

P=0.13 corresponds to a treatment-control difference of 1.5 standard deviations; P=0.003, to a treatment-control difference of 3.0 standard deviations, a difference of 1.5 standard deviations, which corresponds to a P-value of 0.13. Why the apparent contradiction? Because if we want to say that treatment A and treatment B, we need to compare them directly to each other. When we do so, we realize that we don’t have any evidence that the treatments are different from one another.

As Parthasarthy points out in a similar example, a better interpretation is that we have evidence for the ordering (control < treatment A < treatment B). Null hypothesis significance testing could easily mislead us into thinking that what we have instead is (control = treatment A < treatment B). The problem arises, at least in part, because no matter how often we remind ourselves that it’s wrong to do so, we act as if a failure to reject the null hypothesis is evidence for the null hypothesis. Parthasarthy describes nicely how we should be approaching these problems:

It’s absurd to think that anything exists in isolation, or that any treatment really has “zero” effect, certainly not in the messy world of living things. Our task, always, is to quantify the size of an effect, or the value of a parameter, whether this is the resistivity of a metal or the toxicity of a drug.

We should be focusing on estimating the magnitude of effects and the uncertainty associated with those estimates, not testing null hypotheses.

## The influence of climate on tree growth

Northern Hemisphere temperature changes estimated from various proxy records shown in blue (Mann et al. 1999). Instrumental data shown in red. Note the large uncertainty (grey area) as you go further back in time.

Ecologists and paleoecologists have used the width of tree rings for years as a way of inferring past climates. In fact, tree ring data were an important component of the proxy data Mann et al. (1998) used when they constructed their famous1 hockey stick representing global surface temperatures over the last millennium. I don’t have anything as earth shattering as a hockey stick to share with you, but I am pleased to report that a paper on which I am a co-author demonstrates how to combine tree ring and growth increment data (with other data) to predict growth of forest trees. Here’s tha abstract and a link to the paper on bioRxiv.

https://doi.org/10.1101/097535

# Fusing tree-ring and forest inventory data to infer influences on tree growth

Better understanding and prediction of tree growth is important because of the many ecosystem services provided by forests and the uncertainty surrounding how forests will respond to anthropogenic climate change. With the ultimate goal of improving models of forest dynamics, here we construct a statistical model that combines complementary data sources: tree-ring and forest inventory data. A Bayesian hierarchical model is used to gain inference on the effects of many factors on tree growth (individual tree size, climate, biophysical conditions, stand-level competitive environment, tree-level canopy status, and forest management treatments) using both diameter at breast height (DBH) and tree-ring data. The model consists of two multiple regression models, one each for the two data sources, linked via a constant of proportionality between coefficients that are found in parallel in the two regressions. The model was applied to a dataset developed at a single, well-studied site in the Jemez Mountains of north-central New Mexico, U. S. A. Inferences from the model included positive effects of seasonal precipitation, wetness index, and height ratio, and negative effects of seasonal temperature, southerly aspect and radiation, and plot basal area. Climatic effects inferred by the model compared well to results from a dendroclimatic analysis. Combining the two data sources did not lead to higher predictive accuracy (using the leave-one-out information criterion, LOOIC), either when there was a large number of increment cores (129) or under a reduced data scenario of 15 increment cores. However, there was a clear advantage, in terms of parameter estimates, to the use of both data sources under the reduced data scenario: DBH remeasurement data for ~500 trees substantially reduced uncertainty about non-climate fixed effects on radial increments. We discuss the kinds of research questions that might be addressed when the high-resolution information on climate effects contained in tree rings are combined with the rich metadata on tree- and stand-level conditions found in forest inventories, including carbon accounting and projection of tree growth and forest dynamics under future climate scenarios.
(more…)

## Against null hypothesis significance testing

Several months ago I pointed out that P-values from small, noisy experiments are likely to be misleading. Given our training, we think that if a result is significant with a small sample, it must be a really big effect. But unless we have good reason to believe that there is very little noise in the results (a reason other than the small amount of variation observed in our sample), we could easily be misled. Not only will we overestimate how big the effect is, but we are almost as likely to say that the effect is positive when it’s really negative as we are to get the sign right. Look back at this post from August to see for yourself (and download the R code if you want to explore further). As Gelman and Carlin point out,

There is a common misconception that if you happen to obtain statistical significance with low power, then you have achieved a particularly impressive feat, obtaining scientific success under difficult conditions.

I bring this all up again because I recently learned of a new paper by Denes Szucs and John Ioannidis, When null hypothesis significance testing is unsuitable for research: a reassessment. They summarize their advice on null hypothesis significance testing (NHST) in the abstract:

Whenever researchers use NHST they should justify its use, and publish pre-study power calculations and effect sizes, including negative findings. Studies should optimally be pre-registered and raw data published.

They go on to point out that part of the problem is the way that scientists are trained:

[M]ost scientists…are still near exclusively educated in NHST, they tend to misunderstand and abuse NHST and the method is near fully dominant in scientific papers.

## Good news – China will ban ivory trade

Last July, the United States banned nearly all commercial trade in ivory. Last Friday, China announced that it will “end the processing and selling of ivory and ivory products by the end of March as it phases out the legal trade” (The New York Times).

There are some professionals who believe that legal trade in ivory promotes conservation. (See this article from The Guardian for some of the give and take.) The arguments are two-fold (from The Guardian):

1. The ivory ban has made prices high and poaching lucrative. Enrico Di Minin and Douglas MacMillan
2. Lifting Africans from poverty is the only way to save elephants. Rowan Martin

I haven’t studied the issue carefully, but I am not persuaded by their arguments. For one thing, Nitin Sekar and Solomon Hsiang point out in The Guardian that the limited legal trade in ivory established in 2008 seems to have increased the amount of poaching.

Rates of ivory poaching from 2004-2012

To be fair, with only 5 points from before the 2008 announcement of legal ivory sales and 6 points after, you’d be hard-pressed to demonstrate that a statistical model favoring a switch in poaching rates in 2008 is better than one where rates are simply increasing over time, but either way, the limited trade in ivory introduced in 2008 did not decrease the rate of poaching.

Point 2 is undeniably true. Lifting Africans from poverty is the only way to make lasting progress on any conservation problem in Africa. But that observation argues for promoting policies that directly reduce poverty, like increasing sanitation, enhancing access to health care, and strengthening education. Elephant poaching has increased since 2008, and prices of ivory are high. Is there any evidence that incomes of Africans have improved as a result? If there is, Martin Rowan doesn’t provide it.

That’s why I regard it as good news that China is shutting down its domestic ivory trade. A ban on the legal trade of ivory won’t shut down the black market, any more than a ban on cocaine in the US shut down the cocaine market here. But a ban on the legal trade of ivory will make it more difficult for black marketeers to hide. With strong enforcement, a ban will reduce the incentives to trade in ivory and the incentives for poachers.

## Catching up over the holidays

It’s been a busy few weeks since my last post (November 30). I was in Bonn when that post appeared for a meeting at the Crop Trust focused on developing metrics to assess whether global genebanks have the right types and amounts of diversity to meet the UN’s Sustainable Development Goals, specifically to meet this target under Goal 2: End hunger, achieve food security and improved nutrition and promote sustainable agriculture

By 2020, maintain the genetic diversity of seeds, cultivated plants and farmed and domesticated animals and their related wild species, including through soundly managed and diversified seed and plant banks at the national, regional and international levels, and promote access to and fair and equitable sharing of benefits arising from the utilization of genetic resources and associated traditional knowledge, as internationally agreed.

Almost immediately after I returned, I left for Washington, DC and the annual meetings of the Council of Graduate Schools. When I returned, many items requiring attention had accumulated, and some unusually challenging end-of-the-semester student issues emerged. That’s a long way of saying it’s been nearly a month since my last post. I hope to begin posting regularly again. We’ll see if I manage to do it.

I will be busy this spring, too. In addition to my duties as Vice Provost for Graduate Education and Dean of The Graduate School, I’ll be teaching my graduate course in population genetics, EEB 5348. I just started rebuilding the course website after losing it due to a server meltdown last summer. I’ll have notes for individual lectures on-line again in the next week or two. In the meantime, a compiled version of all of the notes is available on Figshare. If you’re a LaTeX geek, a more current version of the notes (in LaTeX, with EPS graphics) is available on Github.

## Designing exploratory studies: measurement (part 2)

Remember Gelman’s preliminary principles for designing exploratory studies:

1. Validity and reliability of measurements.
2. Measuring lots of different things.
3. Connections between quantitative and qualitative data.
4. Collect or construct continuous measurements where possible.

I already wrote about validity and reliability. I admitted to not knowing enough yet to provide advice on assessing the reliability of measurements ecologists and evolutionists make (except in the very limited sense of whether or not repeated measurements of the same characteristic give similar results). For the time being that means I’ll focus on

• Remembering that I’m measuring an indicator of something else that is the thing that really matters, not the thing that really matters itself.
• Being as sure as I can that what I’m measuring is a valid and reliable indicator of that thing, even though the best I can do with that right now is a sort of heuristic connection between a vague notion of what I really think matters, underlying theory, and expectations derived from earlier work.

It’s that second part where “measuring lots of different things” comes in. Let’s go back to LMA and MAP. I’m interested in LMA because it’s an important component of the leaf economics spectrum. There are reasons to expect that tough leaves (those in which LMA is high) will not only be more resistant to herbivory from generalist herbivores, but that they will have lower rates of photosynthesis. Plants are investing more in those leaves. So high LMA is, in some vague sense, an indicator of the extent to which resource conservation is more important to plants than rapid acquisition of resources. So in designing an exploratory study, I should think about other traits plants have that could be indicators of resource conservation vs. rapid resource acquisition and measure as many of them as I can. A few that occur to me are leaf area, photosynthetic rate, leaf nitrogen content, leaf C/N ratio, tissue density, leaf longevity, and leaf thickness.

If I measure all of these (or at least several of them) and think of them as indicators of variation on the underlying “thing I really care about”, I can then imagine treating that underlying “thing I really care about” as a latent variable. One way, but almost certainly not the only way, I could assess the relationship between that latent variable and MAP would be to perform a factor analysis on the trait dataset, identify a single latent factor, and use that factor as the dependent variable whose variation I study in relation to MAP. Of course, MAP is only one way in which we might assess water availability in the environment. Others that might be especially relevant for perennials with long-lived leaves (like Protea) in the Cape Floristic Region rainfall seasonality, maximum number of days between days with “significant” rainfall in the summer, total summer rainfall, estimated potential evapotranspiration for the year, and estimated PET for the summer. A standard way to relate the “resource conservation” factor to the “water availability” factor would be a canonical correspondence analysis.

I am not advocating that we all start doing canonical correspondence analyses as our method of choice in designing exploratory studies, this way of thinking about exploratory studies does help me clarify (a bit) what it is that I’m really looking for. I still have work to do on getting it right, but it feels as if I’m heading towards something analogous to exploratory factor analysis (to identify factors that are valid, in the sense that they are interpretable and related in a meaningful way to existing theoretical constructs and understanding) and confirmatory factor analysis (to confirm that the exploration has revealed factors that can be reliably measured).

Stay tuned. It is likely to be a while before I have more thoughts to share, but as they develop, they’ll appear here, and if you follow along, you’ll be the first to hear about them.

## On the importance of openness in scholarship

I was recently looking something up in Evernote, and I ran across a post by Eric Rauchway on Crooked Timber from July 2013. The post concerns the American Historical Association/s proposed recommendation for an embargo on dissertations. The AHA adopted a Statement on Policies Regarding the Option to Embargo Completed PhD Dissertations on 19 July 2013. It begins

The American Historical Association strongly encourages graduate programs and university libraries to adopt a policy that allows the embargoing of completed history PhD dissertations in digital form for as many as six years.

There was a lot of debate about the wisdom of dissertation embargoes before and after the statement was announced. Rauchway finished his post with a comment that all of us should think about.

When we find ourselves trying to make scholarship less readily available – however good our intentions – we should probably ask ourselves if we can solve our problems some other way.

## Designing exploratory studies: measurement

On Wednesday I argued that we need carefully done exploratory studies to discover phenomena as much as we need carefully done experimental studies to test explanations for the phenomena that have been discovered.1 Andrew Gelman suggests four preliminary principles:

1. Validity and reliability of measurements.
2. Measuring lots of different things.
3. Connections between quantitative and qualitative data.2
4. Collect or construct continuous measurements where possible.

Today I’m going to focus on #1, the validity and reliability of measurements.

If there happen to be any social scientists reading this, it’s likely to come as a shock to you to learn that most ecologists and evolutionary biologists haven’t thought too carefully about the problem of measurement, or at least that’s been my experience. My ecologist and evolutionary biologist friends are probably scratching their heads. “What the heck does Holsinger mean by ‘the problem of measurement.'” I’m sure I’m going to butcher this, because what little I think I know I picked up informally second hand, but here’s how I understand it.

## Designing exploratory studies: preliminary thoughts

Andrew Gelman has a long, interesting, and important post about designing exploratory studies. It was inspired by the following comment from Ed Hagen following a blog post about a paper in Psychological Science.

Exploratory studies need to become a “thing”. Right now, they play almost no formal role in social science, yet they are essential to good social science. That means we need to put as much effort in developing standards, procedures, and techniques for exploratory studies as we have for confirmatory studies. And we need academic norms that reward good exploratory studies so there is less incentive to disguise them as confirmatory.

I think Ed’s suggestion is too narrow. Exploratory studies are essential to good science, not just good social science. We often (or at least I often) have only a vague idea about how features I’m interested in relate to one another. Take leaf mass per area (LMA)

We often (or at least I often) have only a vague idea about how features I’m interested in relate to one another. Take leaf mass per area (LMA)1 and mean annual temperature or mean annual precipitation, for example. In a worldwide dataset compiled by IJ Wright and colleagues2, tougher leaves (higher values of LMA) are associated with warmer temperatures and less rainfall.

LMA, mean annual temperature, and log(mean annual rainfall) in 2370 species at 163 sites (from Wright et al. 2004)

We expected similar relationships in our analysis of Protea and Pelargonium,3 but we weren’t trying to test those expectations. We were trying to determine what those relationships were. We were, in other words, exploring our data, and guess what we found. Tougher leaves are associated with less rainfall in both general and with warmer temperatures in Protea. They were, however, associated with cooler temperatures in Pelargonium, exactly the opposite of what we expected. One reason for the difference might be that Pelargonium leaves are drought deciduous, so they avoid the summer drought characteristic of the regions from which our samples were collected. That is, of course, a post hoc explanation and has to be interpreted cautiously as a hypothesis to be tested, not as an established causal explanation. But that is precisely the point. We needed to investigate the phenomena to identify a pattern. Only then could we generate a hypothesis worth testing.

I find that I am usually more interested in discovering what the phenomena are than in tying down the mechanistic explanations for them. The problem, as Ed Hagen suggests, is that studies that explicitly label themselves as exploratory play little role in science. They tend to be seen as “fishing expeditions,” not serious science. The key, as Hagen suggests, is that to be useful, exploratory studies have to be done as carefully as explicit, hypothesis-testing confirmatory studies. A corollary he didn’t mention is that science will be well-served if we observe the distinction between exploratory and confirmatory studies.4

## I am flattered

@MartineBotany posted this tweet earlier today.

All I can say is “Thank you for the kind words. I have been fortunate to be associated with many talented young scientists. They make me look much more talented than I really am.”