# Introduction

Let’s review the basic approach we use in genome-wide association mapping.

• We measure both the phenotype, $$y_i$$, of individual $$i$$ and its genotype at a large number of loci, where $$x_{ij}$$ is the individual’s genotype at locus $$j$$.

• We regress phenotype on genotype one locus at a time, using a random effect to correct for phenotypic similarities that reflect relatedness rather than similarity in genotype. $y_i^{(k)} = x_{ij}\beta_j + \phi^{(k)} + \epsilon_i \quad .$

Keep in mind this is a highly idealized schematic of how GWAS analyses are actually done.1 If you want to do GWAS for real, you should take a look at GEMMA (http://www.xzlab.org/software.html) or TASSEL (https://www.maizegenetics.net/tassel). One important way in which what I’ve presented is a simplification is that in a real GWAS analysis, you’d estimate the effects of every locus simultaneously, which raises an interesting problem.

In a typical GWAS analysis2, you will have measured the phenotype of a few thousand individuals, but you will have genotyped those individuals at several hundred thousand loci. Lango Allen et al. , for example, report results from a large analysis of height variation in humans, 183,727 individuals genotyped at 2,834,208 loci. What’s the problem here?

There are more predictors (loci) than observations (individual phenotypes). If you remember some basic algebra, you’ll remember that you can’t solve a set of linear equations unless you have the same number of equations as unknowns. For example, you can’t solve a set of three equations that has five unknowns. There’s a similar phenomenon in statistics when we’re fitting a linear regression. In statistics we don’t “solve” an equation. We find the best fit in a regression, and we can do so in a reasonable way so long as the number of observations exceeds the number of variables included in our regression. To put a little mathematical notation to it, if $$n$$ is the number of observations and $$p$$ is the number of regression parameters we hope to estimate, life is good (meaning that we can estimate the regression parameters) so long as $$n > p$$.3 The typical situation we encounter in GWAS is that $$n < p$$, which means we have to be really sneaky. Essentially what we do is that we find a way for the data to tell us that a lot of the parameters don’t matter and we fit a regression only to the ones that do, and we set things up so that the remaining number of parameters is less than $$n$$. If that all sounds a little hoky, trust me it isn’t. There are good ways to do it and good statistical justification for doing it4, but the mathematics behind it gets pretty hairy, which is why you want to use GEMMA or TASSEL for a real GWAS. We’ll ignore this part of the challenge associated with GWAS and focus on another one: complex traits often are influenced by a very large nubmer of loci. That is, after all, why we started studying quantitative genetics in the first place.

# Genetics of complex traits

Let’s return to that Lango Allen et al.  GWAS on height in humans. They identified at least 180 loci associated with differences in height. Moreover, many of the variants are closely associated with genes that are part of previously identified pathways, e.g., Hedgehog signaling,5 or that were previously identified as being involved in skeletal growth defects. A more recent study by Wood et al.  synthesized results from 79 studies involving 253,288 individuals and identified 697 variants that were clustered into 423 loci affecting differences in height.6 Think about what that means. If you know my genotype at only one of those 697 variants, you know next to nothing about how tall I am. But what if you knew my genotype for all of those variants? Then you should be able to do better.

The basic idea is fairly simple. When you do a full GWAS and estimate the effects at every locus simultaneously, you are essentially performing a multiple regression of phenotype on all of the loci you’ve scored simultaenously instead of looking at them one at a time. In equation-speak, $y_i^{(k)} = \sum_j x_{ij}\beta_j + \phi^{(k)} + \epsilon_i \quad .$ Now think a bit more about what that equation means. The $$\phi^{(k)}$$ and $$\epsilon_i$$ terms represent random variation, in the first case variation that is correlated among individuals depending on how closely related they are and in the second case variation that is purely random. The term $$\sum_j x_{ij}\beta_j$$ reflects systematic effects associated with the genotype of individual $$i$$. In other words, if we know individual $$i$$’s genotype, i.e., if we know $$x_{ij}$$ we can predict what phenotype it will have, namely $$\mu_i = \sum_j x_{ij}\beta_j$$. Although we know there will be uncertainty associated with this prediction, $$\mu_i$$ is our best guess of the phenotype for that individual, i.e., our genomic prediction or polygenic score. In the case of height in human beings, it turns out that the loci identified in Wood et al.  account for about 16 percent of variation in height.7 If we don’t have too many groups, we could refine our estimate a bit further by adding in the group-specific estimate, $$\phi^{(k)}$$. Of course when we do so, our prediction is no longer a genomic predictiion, per se. It’s a genomic prediction enhanced by non-genetic group information.

## A toy example

To make all of this more concrete, we’ll explore a toy example using the highly simplified one locus at a time approach to GWAS with a highly simplified example of the multiple regression approach to GWAS. You’ll find an R notebook that implements all of these analyses at http://darwin.eeb.uconn.edu/eeb348-notes/Exploring-genomic-prediction.nb.html. I encourage you to download the notebook as you follow along. You will find it especially useful if you try some different scenarios by changing nloci and effect when you generate the data that you later analyze locus by locus or with genomic prediction. Here’s what the code as written does:

• Generate a random dataset with 100 individuals and 20 loci, 5 of which influence the phenotype. The effect of one “1” allele at locus 1 is 1, at locus 2 -1, at locus 3 0.5, at locus 4 -0.5, and at locus 5 0.25. The standard deviation of the phenotype around the predicted mean is 0.2.

• Run the locus-by-locus regression for each locus and store the results (mean and 95% credible interval) in results.  results is sorted in by the magnitude of the posterior mean, so that loci with the largest estimated effect occur at the top and loci with the smallest effect occur at the bottom.

• Run the multiple regression and store the results in  results_gp.

If you look at the code, you’ll see that I use stan_lm() rather than using stan_lmer(). That’s because I simulate the data without family structure, so there’s no need to include the family random effect.

Table 1 shows results of the locus by locus analysis.

Sample results for locus by locus analysis of genetic associations using genomic-prediction.R
mean 2.5% 97.5%
locus_2 -0.939 -1.185 -0.695
locus_1 0.747 0.491 0.951
locus_3 0.395 0.097 0.696
locus_15 0.312 0.031 0.587
locus_4 -0.206 -0.497 0.086
locus_11 -0.182 -0.494 0.136
locus_7 -0.149 -0.445 0.136
locus_12 -0.123 -0.439 0.176
locus_6 -0.102 -0.362 0.169
locus_19 -0.086 -0.379 0.214
locus_13 -0.073 -0.369 0.229
locus_17 -0.072 -0.397 0.264
locus_14 0.068 -0.230 0.359
locus_10 -0.065 -0.334 0.208
locus_20 -0.053 -0.334 0.231
locus_18 0.040 -0.261 0.337
locus_8 0.017 -0.257 0.309
locus_9 -0.010 -0.310 0.302
locus_16 -0.006 -0.275 0.274
locus_5 -0.005 -0.315 0.28

For this simulated data set 4 of the 5 loci with the largest estimated effect are the 5 loci for which I specified an effect, one of them (locus 15) did not have a specified effect and locus 5, which had a specified effect, has the lowest estimated effect of all.

What about the multiple regression approach? First, take a look at the estimated effects (Table 2). Not only does this approach pick out the right loci, the first five, none of the other loci have particularly large estimated effects. The largest,  locus_17 is only about 0.07, about the same as in the locus by locus analysis. It would take much more extensive simulation to demonstrate the advantage empirically, but it is clear from first principles that multiple regression analyses will be more reliable than locus by locus analyses because a multiple regression analysis takes account of random associations among loci.

Results from multiple regression analysis of simulated data.
mean 2.5% 97.5%
locus_1 0.979 0.840 1.116
locus_2 -0.882 -1.016 -0.742
locus_3 0.614 0.465 0.762
locus_4 -0.514 -0.656 -0.373
locus_5 0.246 0.086 0.389
locus_17 -0.070 -0.217 0.028
locus_7 -0.058 -0.198 0.031
locus_18 0.053 -0.037 0.196
locus_6 0.052 -0.030 0.176
locus_8 0.042 -0.037 0.159
locus_16 0.039 -0.038 0.156
locus_10 -0.034 -0.147 0.044
locus_9 0.013 -0.076 0.126
locus_11 -0.010 -0.119 0.082
locus_15 0.006 -0.080 0.107
locus_12 0.005 -0.086 0.106
locus_19 -0.005 -0.104 0.083
locus_20 0.003 -0.082 0.095
locus_13 -0.001 -0.093 0.086
locus_14 0.001 -0.091 0.093

# Comparing the results

Let’s see what other differences we find when we compare the two approaches more directly. First, let’s look at the estimated allelic effects themselves (Figure 1). As you can see, they are broadly similar, but if you look closely, they are most similar when the estimated allelic effects are small.

More interesting than whether the estimated allelic effects are similar is whether the predicted phenotypes are similar to the observed phenotypes (Figure reffig:gwas-obs-vs-predicted). As you can see, in this simple simulated data set both approaches work reasonably well, even though the estimated allelic effects are rather different. In fact, the estimated mean squared error of the locus-by-locus prediction is actually smaller than for the genomic prediction (6.02 vs. 8.02).

# Creative Commons License

These notes are licensed under the Creative Commons Attribution License. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.

1. Remember, also, that in analyses of human disease, a case-control approach is often used rather than the regression approach I’ve been focusing on.↩︎

2. In humans at least↩︎

3. And the more that $$n$$ exceeds $$p$$ the better, the more accurate our estimates of the regression parameters will be.↩︎

4. And biological justification for doing it in GWAS.↩︎

5. “The Hedgehog signaling pathway is a signaling pathway that transmits information to embryonic cells required for proper cell differentiation.” https://en.wikipedia.org/wiki/Hedgehog_signaling_pathway, accessed 14 August 2021.↩︎

6. It’s worth noting that even this is likely to be an underestimate of the number of loci associated with height variation in humans because all of the individuals included in the analysis were of European ancestry.↩︎

7. In Europe the heritability of height at age 20 is about 80 percent .↩︎