Uncommon Ground

Author Archive: kent

An R-Stan implementation of Bayesian inference for Wright’s F-statistics

Some of you know that many years ago Paul Lewis, Dipak Dey, and I wrote a paper describing a Bayesian approach to inferring population structure from dominant markers.1 You may also know that Paul Lewis and I wrote a Windoze program in C++, Hickory, that implemented the approach. We later extended Hickory for analysis of co-dominant markers. Later still, Feng Guo, Dipak, and I wrote another paper describing a Bayesian approach to (a) estimating population- and locus-specific effects on FST and (b) identifying loci where the posterior distribution of FST is markedly different from the overall estimate.2 If you know that (and maybe even if you don’t know all of that), you also know that Paul and I stopped maintaining Hickory a number of years ago. I moved from Windoze to Mac, and the library we were using to support the graphical user interface became too complicated for me to keep up with.

I’ve had a few requests from people who were interested in using Hickory, but I just haven’t had the time to find a way to support them – until now.

Over the past several years, I’ve been using Stan for many different statistical analyses. When I received another request for Hickory a couple of weeks ago, I realized that I could pretty easily develop a new version of Hickory in R/Stan. This approach has several advantages over the standalone C++ code in the original Hickory.

  1. I don’t have to worry about writing the MCMC sampler myself. I use the very sophisticated Hamiltonian Monte Carlo in Stan. I not only avoid me the bother of writing my own sampler, I have much greater confidence that the sampler is performing correctly. It’s written and maintained by experts, and the convergence diagnostics are far more sophisticated than for Metropolis-Hastings.
  2. It should be readily portable to any platform on which R is supported. The only requirement, for now, is that you have a C++ compiler installed. If you’re running a Mac, you may need to download Xcode. If you’re running Linux, you should be all set. If your running Windows, you can download Rtools from CRAN. I intend to submit the R package I’ve written to CRAN once I’ve tested it more thoroughly and provided some extensions to the crude functionality currently available. Once it’s on CRAN, you won’t even need a C++ compiler.
  3. I can develop an interface to adegenet and other R packages used for analysis and manipulation of genetic data so that Hickory can use data in many different formats supported by other packages.

A very early release of Hickory is available at GitHub. You should find all of the information you need to install and use it there. Let me know if you run into problems. I’ll do my best to walk you through them (and probably correct some errors I’ve made or at least improve the meager documentation in the process).

  1. Holsinger, K. E., Lewis, P. O., and Dey, D. K. 2002. A Bayesian approach to inferring population structure from dominant markers. Molecular Ecology 11:1157–1164.
  2. Guo, F., Dey, D. K., and Holsinger, K. E. 2009. A Bayesian hierarchical model for analysis of SNP diversity in multilocus, multipopulation samples. Journal of the American Statistical Association 104:142–154. http://doi.org/10.1198/jasa.2009.0010

Making accessible HTML from LaTeX sources — an additional experiment

Last week I reported on my initial experiments using Pandoc and LaTeXML to convert LaTeX to HTML. Here are links to the PDF produced with pdfLaTeX and the HTML:

If you’re like me, you’ll prefer the LaTeXML version to the Pandoc version, but as I pointed out the LaTeXML version includes CSS to customize the styling and the Pandoc version doesn’t. I did a quick Google search, figured out how to add CSS (and a table of contents) to the HTML output from Pandoc, and found a very nice CSS style to use (from Pascal Hertlief on Github). It’s possible that I’ll fiddle with Pascal’s CSS a bit, but there’s a good chance I won’t change it at all. It makes the HTML look really, really nice:

What I haven’t tried yet is converting LaTeX source that includes PDF figures. Let’s try that now and see how it works.

It took a while to get ImageMagick installed, to write a short Perl script to change all of the references to EPS files into references to PNG files and convert the EPSs to PNGs, but I really like the results. But this gets two of my three “to-dos” out of the way.

 

  • Check CSS styling for Pandoc.
  • Show the results to an accessibility expert at UConn and get some feedback on the different approaches.
  • See what happens with figures when they’re included in a LaTeX document.

Now I just (just?) need to check with an accessibility expert to confirm that the HTML is accessible. If it is, I’m all set.

By the way, if you’re interested in seeing the Perl script, let me know. It will be posted in the Github archive where I post the LaTeX source for my notes later this fall, but I’d be happy to send you a copy now if you drop me a line.

Making accessible HTML from LaTeX sources – some initial impressions

Some of you know that I’ve been making notes from my graduate course in Population Genetics available online for nearly 20 years (http://darwin.eeb.uconn.edu/uncommon-ground/eeb348/notes/). What a smaller number of you know is that I use LaTeX to write my notes and pdfLaTeX to produce PDFs from the LaTeX source. So far as I can tell (using ANDI), the PDFs produced in this way provide some elements that aid accessibility, but I am exploring options to produce HTML from the same source that might produce documents that are accessible to more readers. For my first experiment, I used the LaTeX file from 2019 that produced notes on resemblance among relatives. Here are links to three versions of the notes:

Both approaches to producing HTML are straightforward.

For Pandoc:

pandoc --standalone --mathjax -o quant-resemblance-pandoc.html quant-resemblance.tex

For LaTeXML:

latexml --includestyles --dest=quant-resemblance.xml quant-resemblance.tex
latexmlpost --dest=quant-resemblance-latexml.html quant-resemblance.xml

With the default options, I like the look of the LaTeXML version better, but it also includes CSS customizations and the Pandoc version doesn’t. It’s probably possible to include customized CSS with Pandoc, but I haven’t had a chance to investigate that yet. I also haven’t had a chance to consult anyone who knows how to judge accessibility of documents. When I’ve had a chance to do that. I’ll return with a report. (Don’t hold your breath. I am a dean, so I don’t have a lot of time on my hands.)

Here’s my to-do list, so that I don’t forget:

  • Check CSS styling for Pandoc.
  • Show the results to an accessibility expert at UConn and get some feedback on the different approaches.
  • See what happens with figures when they’re included in a LaTeX document.

If you have additional questions, let me know, and I’ll add them to the list.

Congratulations to recipients of graduate degrees from @UConn

On Saturday, the University of Connecticut celebrated a virtual Commencement. As part of the Commencement celebration, The Gradaute School published a Commencement page containing video messages from me and  from Stephany Santos, a PhD candidate in Biomedical Engineering. There’s also a complete list of master’s and doctoral degree recipients from August 2019, December 2019, and May 2020, including 9 PhD recipients from my home department, Ecology & Evolutionary Biology.

  • Annette Evans
  • Kaitlin Ann Gallagher
  • Christopher Nadeau
  • Kristen Nolting
  • Nasim Rahmatpour
  • Anna Rose Sjodin
  • Lauren Stanley
  • Katherine Taylor
  • Tanisha Marie Williams

I highlighted Kristen and Tanisha, because they happen to be my students. Tanisha is the Burpee Postdoctoral Fellow at Bucknell University (working with another UConn EEB alum, Chris Martine), and Kristen will begin as a postdoctoral research associate at the University of Georgia later this month (working with Lisa Donovan and John Burke).

On the off chance that you’re interested in seeing what I had to say and you don’t want to click through the Commencement page link, I’ve embedded my message below.


View Video with Transcript

Some thoughts about my career and about biology over the last several decades

It’s a little frightening to me when it dawns on me that the world of biology that is existed when I started graduate school is as far away in time as the New Synthesis was then. Introns had been discovered only a few years before. Allozymes were the rage, and Sanger sequencing was “the hot new thing.” Suffice it to say that a lot has changed.

Some of you know that I had the privilege of serving as President of the American Institute of Biological Sciences in 2006. As a result of that, I also have the privilege of being featured this month in BioScience. The piece is part of the In Their Own Words series. Here’s the abstract and a link:

In Their Own Words chronicles the stories of scientists who have made great contributions to their fields, particularly within the biological sciences. These short oral histories provide our readers a way to learn from and share their experiences. Each month, we will publish in the pages of BioScience and in our podcast, BioScience Talks (http://bioscienceaibs.libsyn.com), the results of these conversations. This second oral history is with Dr. Kent Holsinger, board of trustees distinguished professor of biology in the Department of Ecology and Evolutionary Biology at the University of Connecticut. He previously served as president of the American Institute of Biological Sciences.

https://doi.org/10.1093/biosci/biz136

If for some reason you’d like to hear the interview, there’s also a podcast link: http://bioscienceaibs.libsyn.com. I haven’t listened to the podcast, but the article came out reasonably well.

Yet another (possibly) predatory publisher to add to the list – Asteroid Publishers

Why do I think Asteroid Publishers might be predatory? Because I received this e-mail from them earlier today:

Dear Dr. Kent E. Holsinger,

Greetings for the day

We are aware of your reputation for quality of research and trustworthiness in this field. With a great pleasure we would like to invite you to join the Editorial Board of our “Journal of Plastic and Aesthetic Surgery”. We are very glad to have an eminent like you to guide the journal in a well-organized way.

Hoping that you will accept our invitation.If you are interested, kindly send your updated C.V, Biography, Research Interests, Recent Photograph and working details for our records.

We hope you are always being there to support us.

Looking forward to have your long lasting scientific relationship.

Best Regards,

<name removed>

Journal Coordinator

Journal of Plastic and Aesthetic Surgery

I have a strong background in evolutionary genetics, but the closest I come to knowing anything about plastic surgery is that my father was a family practitioner until he retired and he occasionally performed minor cosmetic surgery. Any journal publisher who thinks I have a reputation “for quality of research and trustworthiness” in the field of plastic and aesthetic surgery doesn’t deserve to be trusted and should probably be shunned.

Some parting thoughts on variable selection in multiple regression

Variable selection in multiple regression

As I said a month and a half ago, this series started because

I was talking with one of my graduate students a few days ago about variable selection in multiple regression. She was looking for a published “cheat sheet.” I told her I didn’t know of any. “Why don’t you write one?” “The world’s too complicated for that. There will always be judgment involved. There will never be a simple recipe to follow.”

If you’ve been following along, it won’t surprise you to learn that I’m not going to conclude with a simple recipe.

“The world’s too complicated for that. There will always be judgment involved. There will never be a simple recipe to follow.”

Although there will never be a simple recipe, I can tell you what I’m going to do. You’ll want to look at a new R notebook that explores what happens when associations among covariates aren’t as strong as those we’ve been assuming so far.

  • For any analysis where I can use stan_glm() or stan_glmer() I’ll use horseshoe priors to “shrink” the regression coefficients for unimportant covariates towards zero. For any analysis where I can’t use stan_glm() or stan_glmer(), I’ll probably be using Stan directly and I’ll hardcode the horseshoe priors myself.
  • If I feel the need to use a relatively objective method to identify some subset of covariates that are “important”,1 I’ll use projection predictive variable selection as implemented in projpred to identify the most important covariates.2
  • For reasons outlined in the Conclusions section of the R notebook I mentioned above, I will be very cautious about interpreting associations between covariates and response variables as anything other than a statistical association. Only if an association I find has been found repeatedly in other data sets and also has a good “first principles” explanation will I begin to interpret it as a causal association. Otherwise, I’ll interpret it as an intriguing pattern worthy of further study and exploration. If you want more details on how hard it is to infer causal relationships from these kinds of analyses, look at my blog series on causal inference in ecology.
  1. I can think of a couple of reasons that I might want to select a subset of covariates. (1) I might not have a lot of data to fit to my model. Because of the priors, I’ll be able to fit it without the model blowing up, but the parameter estimates are likely to be very poorly defined. Reducing the number of parameters may help me isolate an “interesting” relationship. So long as I remember that all I may uncover is a statistical association, that pattern might still be worth investigating. (2) I might want a relatively objective way to simplify a complicated model so that it’s easy to understand, and there may not be an obvious break graphically or numerically between those that are important and those that aren’t. Regardless of whether it’s for reason #1 or reason #2, I will be extremely cautious about interpreting any associations identified through projection predictive variable selection as “real” and the ones not identified as “spurious.” In fact, I probably won’t do it at all, and I’ll probably present results from analysis of the full model in addition to the reduced model, even if the full model results only go in online supplemental material.
  2. Since I’m new to using projpred, I don’t know whether I’ll be able to use it with my own Stan code. If not, it’s yet another reason for me to learn brms, which can handle a bunch of models that rstanarm can’t – possibly many of them that I’d be hardcoding in Stan otherwise.

Using projection predictiion for variable selection in a Bayesian regression

Variable selection in multiple regression

Horseshoe priors are very easy to use if you’re using rstanarm. You should consider using them in any analysis where you use stan_glm() or stan_glmer(). If you were paying attention, though, I did a bit (OK, more than a bit) of handwaving in deciding which covariates were “important”. In this example, it was pretty easy, because there were some covariates with posterior distributions well away from zero and others with posterior distributions (close to) centered on zero and the difference between the two sets of coefficients was easy to see. That won’t always be the case. In fact, it probably won’t usually be the case. So we’d like to have some way of more “objectively” identifying which covariates are important and which aren’t.

Thats where projection predictive variable selection comes in. It’s an approach that uses a statistically meaningful criterion to guide your choice of variables that are “important” in the sense that including those variables (and only those variables) is sufficient to give predictions roughly equivalent to including all of them. Again, if you’re using rstanarm, it’s very easy to take advantage of the approach thanks to the projpred package available on CRAN.

In case you missed the link to projection predictive variable selection, here it is again: http://darwin.eeb.uconn.edu/pages/variable-selection/projection-predictive-variable-selection.nb.html.

A Bayesian approach to variable selection using horseshoe priors

Variable selection in multiple regression

The Lasso has been very widely used, particularly in high-dimensional problems where the number of observations is less than the number of covariates.1 In fact, when I checked Google Scholar on Saturday, it had been cited nearly 30,000 times.2 Bayesians didn’t want to be left out, so Trevor Park and George Casella developed the Bayesian Lasso.3 The Bayesian Lasso overcomes what to my mind is one of the great disadvantages of the original Lasso, the difficulty of providing an assessment of how reliable the regression coefficients are. Like any other Bayesian method that uses MCMC methods, it’s just as easy to get credible intervals on parameters as it is to get posterior means. The Bayesian Lasso also estimates \(\lambda\) as part of the procedure rather than relying on cross-validation. The R package monovm provides an implementation of the Bayesian Lasso in addition to other shrinkage regression methods.

I haven’t explored monovm, but if you’re interested in the Bayesian Lasso, you might want to check it out. Instead of exploring the Bayesian Lasso, the R notebook I’ve put together here explores the use of “horseshoe priors” in rstanarm. The basic idea is the same. We’d like to “shrink” some parameter estimates towards zero, and we’d like to have the data tell us which estimates to shrink. The nice thing about “horseshoe priors” in rstanarm is that if you know how to set up a regression in stan_glm() or stan_glmer() you can use a horseshoe prior very easily in your analysis simply by changing the prior parameter in your call to one of those functions.

  1. This is often referred to as an \(n \ll p\) problem. I’m not going to address that problem here, but if you deal with genomic data, you’ll want to familiarize yourself with the problem and the approaches typically used for addressing it.
  2. 29,558 times to be exact.
  3. Park, T. and G. Casella. 2008. The Bayesian Lasso. Jornal of the American Statistical Association. 103:681-686. doi: 10.1198/016214508000000337

Using the Lasso for variable selection

Variable selection in multiple regression

If you’ve been following along, you’ve now seen some fairly simple approaches for reducing the number of covariates in a linear regression. It shouldn’t come as a shock that statisticians have been worried about the problem for a long time or that they’ve come up with some pretty sophisticated approaches to the problem.1 The first one we’ll explore is the Lasso (least asolute shrinkage and selection operator), which Rob Tibshirani introduced the Lasso to statistics and machine learning more than 20 years ago.2 You’ll find more details in the R notebook illustrating using the Lasso to select covariates, but here are the basic ideas.

The “shrinkage” part of the name refers to the idea that we don’t expect all of the covariates we’re including in the model to be important. And if a covariate isn’t important, we want the magnitude of the regression coefficient associated with that component to be zero (or nearly zero). In other words, we want the estimate to be “shrunk” towards zero rather than taking the value it would if we included it in the full multiple regression.

The “selection” part of the name refers to the idea that we don’t know ahead of time which of the covariates are important (and shouldn’t be shrunk towards 0) and which are important (and should be shrunk towards 0). We want the data to tell us which covariates are important and which aren’t, i.e., we want the data to “select” important covariates.

The Lasso accomplishes this by adding a penalty to the typical least squares estimates. Instead of simply minimizing the sum of squared deviations from the regression line, we do so subject to a constraint that the total magnitude of all regression coefficients is less than some value. We’ll use glmnet() to fit the Lasso. If you explore the accompanying documentation, you’ll see that the Lasso is just one method along a continuum of constrained optimization approaches. I’ll let you explore those on your own if you’re interested.

  1. I’m not going to discuss forward, backward, or all subsets approaches to selecting variables. They don’t seem to be used much anymore (for good reason). If you’re interested in them, take a look at the Wikipedia page on stepwise regression.
  2. Wikipedia points out that it was originally introduced 10 years earlier in geophysics, but Tibshirani discovered it independently, and it was his discovery that led to its wide use in statistics and machine learning.