Patterns of nucleotide and amino acid substitution

Introduction

So, I’ve just suggested that the neutral theory of molecular evolution explains quite a bit, but it also ignores quite a bit. The derivations we did assumed that all substitutions are equally likely to occur, because they are effectively neutral. That isn’t plausible. We need look no further than sickle cell anemia to see an example of a protein polymorphism in which a single nucleotide substitution and a single amino acid difference has a very large effect on fitness. Even reasoning from first principles we can see that it doesn’t make much sense to think that all nucleotide substitutions are created equal. Just as it’s unlikely that you’ll improve the performance of your car if you pick up a sledgehammer, open its hood, close your eyes, and hit something inside, so it’s unlikely that picking a random amino acid in a protein and substituting it with a different one will improve the function of the protein.1

The genetic code

Of course, not all nucleotide sequence substitutions lead to amino acid substitutions in protein-coding genes. There is redundancy in the genetic code. Table 1 is a list of the codons in the universal genetic code.2 Notice that there are only two amino acids, methionine and tryptophan, that have a single codon. All the rest have at least two. Serine, arginine, and leucine have six.

The universal genetic code.
Amino Amino Amino Amino
Codon Acid Codon Acid Codon Acid Codon Acid
UUU Phe UCU Ser UAU Tyr UGU Cys
UUC Phe UCC Ser UAC Tyr UGC Cys
UUA Leu UCA Ser UAA Stop UGA Stop
UUG Leu UCG Ser UAG Stop UGG Trp
CUU Leu CCU Pro CAU His CGU Arg
CUC Leu CCC Pro CAC His CGC Arg
CUA Leu CCA Pro CAA Gln CGA Arg
CUG Leu CCG Pro CAG Gln CGG Arg
AUU Ile ACU Thr AAU Asn AGU Ser
AUC Ile ACC Thr AAC Asn AGC Ser
AUA Ile ACA Thr AAA Lys AGA Arg
AUG Met ACG Thr AAG Lys AGG Arg
GUU Val GCU Ala GAU Asp GGU Gly
GUC Val GCC Ala GAC Asp GGC Gly
GUA Val GCA Ala GAA Glu GGA Gly
GUG Val GCG Ala GAG Glu GGG Gly

Moreover, most of the redundancy is in the third position, where we can distinguish 2-fold from 4-fold redundant sites (Table 2). 2-fold redundant sites are those at which either one of two nucleotides can be present in a codon for a single amino acid. 4-fold redundant sites are those at which any of the four nucleotides can be present in a codon for a single amino acid. In some cases, there is redundancy in the first codon position, e.g, both AGA and CGA are codons for arginine. Thus, many nucleotide substitutions at third positions do not lead to amino acid substitutions, and some nucleotide substitutions at first positions do not lead to amino acid substitutions. But every nucleotide substitution at a second codon position leads to an amino acid substitution. Nucleotide substitutions that do not lead to amino acid substitutions are referred to as synonymous substitutions, because the codons involved are synonymous, i.e., code for the same amino acid. Nucleotide substitutions that do lead to amino acid substitutions are non-synonymous substitutions.

Examples of 4-fold and 2-fold redundancy in the 3rd position of the universal genetic code.
Amino
Codon Acid Redundancy
CCU Pro 4-fold
CCC
CCA
CCG
AAU Asn 2-fold
AAC
AAA Lys 2-fold
AAG

Rates of synonymous and non-synonymous substitution

By using a modification of the simple Jukes-Cantor model we encountered before, we can estimate both the number of synonymous substitutions and the number of non-synonymous substitutions that have occurred since two sequences diverged from a common ancestor. If we combine an estimate of the number of differences with an estimate of the time of divergence we can estimate the rates of synonymous and non-synonymous substitution (number/time). Table 3 shows some representative estimates for the rates of synonymous and non-synonymous substitution in different genes studied in mammals.

Representative rates of synonymous and non-synonymous substitution in mammalian genes (from ). Rates are expressed as the number of substitutions per \(10^9\) years.
Locus Non-synonymous rate Synonymous rate
Histone
H4 0.00 3.94
H2 0.00 4.52
Ribosomal proteins
S17 0.06 2.69
S14 0.02 2.16
Hemoglobins & myoglobin
\(\alpha\)-globin 0.56 4.38
\(\beta\)-globin 0.78 2.58
Myoglobin 0.57 4.10
Interferons
\(\gamma\) 3.06 5.50
\(\alpha\)1 1.47 3.24
\(\beta\)1 2.38 5.33

Two very important observations emerge after you’ve looked at this table for awhile. The first won’t come as any shock. The rate of non-synonymous substitution is generally lower than the rate of synonymous substitution. This is a result of the “sledgehammer principle” I mentioned earlier. Mutations that change the amino acid sequence of a protein are more likely to reduce that protein’s functionality than to increase it. As a result, they are likely to lower the fitness of individuals carrying them, and they will have a lower probability of being fixed than those mutations that do not change the amino acid sequence.3

The second observation is more subtle. Rates of non-synonymous substitution vary by more than two orders of magnitude: 0.02 substitutions per nucleotide per billion years in ribosomal protein S14 to 3.06 substitutions per nucleotide per billion years in \(\gamma\)-interferon, while rates of synonymous substitution vary only by a factor of two (2.16 in ribosomal protein S14 to 5.50 in \(\gamma\) interferons). If synonymous substitutions are neutral, as they probably are to a first approximation,4 then the rate of synonymous substitution should equal the mutation rate. Thus, the rate of synonymous substitution should be approximately the same at every locus, which is roughly what we observe. But proteins differ in the degree to which their physiological function affects the performance and fitness of the organisms that carry them. Some, like histones and ribosomal proteins, are intimately involved with the structure of chromatin or translation of messenger RNA into protein. It’s easy to imagine that just about any change in the amino acid sequence of such proteins will have a detrimental effect on their function. Others, like interferons, are involved in responses to viral or bacterial pathogens. It’s easy to imagine not only that the selection on these proteins might be less intense, but that some amino acid substitutions might actually be favored by natural selection because they enhance resistance to certain strains of pathogens. Thus, the probability that a non-synonymous substitution will be fixed is likely to vary substantially among genes, just as we observe.

Revising the neutral theory

So we’ve now produced empirical evidence that many mutations are not neutral. Does this mean that we throw the neutral theory of molecular evolution away? Hardly. We need only modify it a little to accommodate these new observations.

As we’ll see, even these revisions aren’t entirely sufficient, but what we do from here on out is more to provide refinements and clarifications than to undertake wholesale revisions.

Creative Commons License

These notes are licensed under the Creative Commons Attribution License. To view a copy of this license, visit or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.