| Code | Base |
|---|---|
| A | Adenine |
| C | Cytosine |
| G | Guanine |
| T | Thymine (Uracil in RNA) |
| M | A or C |
| R | A or G |
| W | A or T |
| S | C or G |
| Y | C or T |
| K | G or T |
| V | A or C or G |
| H | A or C or T |
| D | A or G or T |
| B | C or G or T |
| N | A or C or G or T |
Cornish-Bowden, A. Nucl. Acid Res. 13, 3021-3030 (1985).
| Amino acid | Abbreviation | Code |
|---|---|---|
| Alanine | Ala | A |
| Arginine | Arg | R |
| Asparagine | Asn | N |
| Aspartic acid | Asp | D |
| Cysteine | Cys | C |
| Glutamine | Gln | Q |
| Glutamic acid | Glu | E |
| Glycine | Gly | G |
| Histidine | His | H |
| Isoleucine | Ile | I |
| Leucine | Leu | L |
| Lysine | Lys | K |
| Methionine | Met | M |
| Phenylalanine | Phe | F |
| Proline | Pro | P |
| Serine | Ser | S |
| Threonine | Thr | T |
| Tryptophan | Trp | W |
| Tyrosine | Tyr | Y |
| Valine | Val | V |
| Asparagine or Aspartic acid | Asx | B |
| Glutamine or Glutamic acid | Glx | Z |
| Any amino acid | Xaa | X |
In addition, "TERM" is prescribed as the abbreviation for
any of the termination codons.
| 1st | 2nd -> | U | C | A | G | 3rd |
| U | Phe | Ser | Tyr | Cys | U |
| Phe | Ser | Tyr | Cys | C | |
| Leu | Ser | Stop | Stop | A | |
| Leu | Ser | Stop | Trp | G | |
| C | Leu | Pro | His | Arg | U |
| Leu | Pro | His | Arg | C | |
| Leu | Pro | Gln | Arg | A | |
| Leu | Pro | His | Arg | G | |
| A | Ile | Thr | Asn | Ser | U |
| Ile | Thr | Asn | Ser | C | |
| Ile | Thr | Lys | Arg | A | |
| Met | Thr | Lys | Arg | G | |
| G | Val | Ala | Asp | Gly | U |
| Val | Ala | Asp | Gly | C | |
| Val | Ala | Glu | Gly | A | |
| Val | Ala | Glu | Gly | G |
The table at the left gives is the genetic code used by most organisms to translate nucleotide triplets (codons) in RNA copies of DNA into the sequences of amino acids which form a polypeptide (protein). The mitochondria (energy-producing components of the cell) have their own DNA which uses a slightly different code, as explained below.
The code table is read in the following fashion. Let's assume a gene specifies the last three amino acids to be assembled by the ribosome to be:
Alanine (Ala)
Theronine (Thr)
Serine (Ser)
The DNA which codes for this might then be:
Let's rewrite this for clarity with spaces between each three letters, and colour the first letter of each triplet red, the second green, and the third blue:
Now when this DNA is transcribed into the RNA molecule actually processed by the ribosome, the RNA will contain a uracil ("U") base at positions where the DNA contains thymine ("T"). So, the next step is to rewrite the sequence as it will appear in RNA, which is:
Now find the first letter of each triplet in the leftmost column of the genetic code table, find the column corresponding to the second letter, and the row specified by the third letter; the colour coding avoids confusion about which is which. The item in the table then gives the abbreviation for the amino acid which will be added to the protein. You might want to verify the correctness of the sequence by looking up the first three triplets in the table. The last triplet, UAA does not specify an amino acid; it is a stop codon, which causes the ribosome to cease assembly of the protein and release it so it can complete folding into its functional three-dimensional shape.
Looking at the genetic code through the eyes of a computer programmer, the first thing one observes is that there is a great deal of redundancy in it. Only 20 amino acids are used in proteins, yet there are 64 possible triplets of the four nucleotides in DNA or RNA. Leucine (Leu), for example, can be specified by any of the following: UUA, UUG, CUU, CUC, CUA, CUG. Similarly, there are three different stop codons, all of which serve equally well to terminate assembly of a protein. Conversely, there is a certain economy in the code. Life seems to require 20 different amino acids in order to function and these, plus a stop code, means the blueprint for a protein must therefore encode 21 different possibilities for each position. Given the four nucleotides universal in terrestrial life, this means a minimum of three nucleotides are required: two could encode only 16 possibilities, and that isn't enough. So, given that a protein is specified by a series of nucleotides chosen from a set of four (or, as a programmer would put it, "each nucleotide encodes two bits"), a minimum of three nucleotides (six bits of information) are required. Viewed this way, the genetic code is seen to be as simple as it possibly can be given the chemical constraints (20 amino acids plus a stop code required, genetic information encoded as sequence of nucleotides chosen from a set of four). The fact that several triplets encode the same amino acid (as opposed to all being stop codes or some form of error indication) will remind an electrical engineer of "don't care" conditions in the design of a digital circuit--often the cost and complexity of a circuit can be reduced by allowing multiple sets of inputs produce the same output, even though only one of the inputs may actually be used in operation. Similarly, early (and some not-so-early) computers often had several different binary instruction codes which performed the same operation, purely because that "fell out" of the circuit design.
An aspect of the genetic code which will puzzle a programmer is the lack of a "Start" code. Clearly the ribosome needs to know precisely where to begin. We were able to separate the sequence GCAACTTCGTAA into the triplets GCA ACT TCG TAA only because we'd been told where to start. But in the actual DNA molecule, thousands to millions of genes are all run together in a huge string of letters, intermixed in many cases with sequences of "junk DNA" which appear to have no biological function. If the ribosome were to start reading the sequence of nucleotides one base to the left or right of the proper place, an entirely different set of triplets would be found and a protein assembled which bears no resemblance whatsoever to the the one intended. "Sheesh, there are three interchangeable codes to tell the ribosome where to stop.", the programmer exclaims, "Why not use one of them to tell it where to start?"
Ribosomes (the beads) on strands of messenger RNA, assembling proteins (the lumpy chains on either side). This is a 140,000 magnification electron micrograph; the ribosomes are about 20 nanometres across.
The ribosome does, in fact, start copying at the precise point intended and therefore interpret the triplets correctly. How it works is one of those remarkable hacks you find everywhere in biology which resemble those one encounters when examining the work of programmers who value expedience over elegance. There is, indeed, a codon which means "Start"; it is the triplet AUG. "But wait!", you exclaim, "According to the table AUG says to add a methionine residue to the protein. Certainly every protein doesn't start with methionine, does it?" No, it doesn't. The hack is even more grotesque. First of all, not any old AUG will do as a start signal. To perform this function, it must be preceded by a specific purine-rich (primarily adenine and guanine) sequence of nucleotides, to which the ribosome attaches when it first encounters the strand of messenger RNA (mRNA) it is about to start processing. Assembly of the protein then begins with the first AUG sequence which follows the attach point, at which point the ribosome is aligned to correctly parse the linear set of bases into triplets. But since AUG tells the ribosome to add a methionine to the chain, it promptly does so, and hence every protein does begin to be assembled with methionine at the start of the chain. So why doesn't every protein, then, begin with methionine? Because the initial methionine is usually chopped off by another process before assembly of the protein is complete.
Almost everywhere you look into the innards of living things at the molecular level, you find this mix of fantastic complexity and brutal hacks, and so far I've only scratched the surface of this one tiny corner of the life of a single cell. For example, I didn't mention that AUG is not, in fact, the only start signal. In the intestinal bacterium E. coli, for example, about one out of every 30 protein coding regions begins with GUG, which normally codes for valine, and there's a corresponding mechanism that may lop off the valine from the final protein. There's more. Occasionally UUG and CUG are used as start signals, as well, although much more rarely.
Nor have I mentioned so far that in Eukaryotic organisms (cells with nuclei, such as those in higher organisms such as yeast and Homo sapiens), the DNA sequences coding for proteins (exons) are frequently interrupted by long strings of DNA (introns) which appear to have no function whatsoever--certainly they contribute in no way to the protein assembled by the ribosome. In the process of translating the DNA into protein, the DNA is first transcribed, introns (junk) and all, into an initial RNA sequence. Then specific enzymes in the cell nucleus scan the RNA, chop out and discard the introns, and re-splice the exons into a final piece of messenger RNA which is later used by a ribosome to direct the assembly of a protein; the ribosome doesn't need signals to skip the junk because it's already been removed before the ribosome starts on the RNA. Long intron sequences are commonplace in higher organisms--every hemoglobin molecule that makes your blood red was assembled from a DNA description frequently interrupted by lengthy, incoherent, blithering introns.
"What a way to run a biosphere!", a programmer mutters. In programming, and in almost any field of engineering where one wishes to store, retrieve, and copy large amounts of data with high reliability, great effort is expended in encoding the information as efficiently as possible. If you're trying to store the telephone directories for all the cities in a large country on a CD-ROM, for example, you wouldn't interrupt the list of names and numbers at random points here and there to include long extracts of Finnegan's Wake, random audio clips from Alban Berg's Greatest Hits, or scenes from Plan 9 from Outer Space, not to mention total gibberish resembling E-mail with three or more exclamation marks in the "Subject" line, then create a whole separate mechanism to detect and discard the irrelevant sequences they confuse the program the customer uses to search the directory. Yet this seems precisely what biology has done in encoding the information from which programmers, human beings, and a multitude of other organisms are assembled.
And it's not because biology is inherently sloppy and inefficient, either. Only Archaea and Eukaryotes have all these junk introns--the other domain of life, Bacteria, doesn't have them, and bacterial genomes occasionally contain compression tricks (for example, sequences which are part of two distinct protein descriptions, depending on where you start reading) which can take even the most hardened machine language programmer's breath away. It was while pondering why apparently higher forms of life carried around such a huge amount of seemingly useless information in their own genomes that the idea popped into my mind which led to the science fiction story which, almost a decade later, in the same year in which the story is set, motivated the development of these pages which may close the circle and result in the discovery alluded to in the story.
At first glance, the genome looks very much like a machine language computer program. It is written in a highly structured language designed around the constraints of the hardware which executes it. The program is encoded in a very different form compared to the output it produces. And the small set of distinct machine language instructions can be strung together in arbitrarily long sequences to create a program of arbitrary complexity. And yet, and yet...computer programs aren't randomly interrupted by irritating, irrelevant, and irrational sequences with no information content whatsoever--how odd. But television programs are--by commercials. And, after all, programs are programs.
At the outset, the idea that the "junk DNA" in every one of all of our cells consists, in part, of molecular commercials may seem absurd. But, is it any more absurd than the notion that every time a cell in your body divides, 90% of the genetic material which is faithfully copied is total gibberish, with no function or information content whatsoever?
Isn't it worth investing a little time exploring what information all that
supposedly "junk" we and other higher organisms are carrying around
in our every cell might encode? To begin your own exploration,
after finishing this document, pay a visit to the interactive
Genome Browser.
| 1st | 2nd -> | U | C | A | G | 3rd |
| U | Phe | Ser | Tyr | Cys | U |
| Phe | Ser | Tyr | Cys | C | |
| Leu | Ser | Stop | Trp | A | |
| Leu | Ser | Stop | Trp | G | |
| C | Leu | Pro | His | Arg | U |
| Leu | Pro | His | Arg | C | |
| Leu | Pro | Gln | Arg | A | |
| Leu | Pro | His | Arg | G | |
| A | Ile | Thr | Asn | Ser | U |
| Ile | Thr | Asn | Ser | C | |
| Met | Thr | Lys | Stop | A | |
| Met | Thr | Lys | Stop | G | |
| G | Val | Ala | Asp | Gly | U |
| Val | Ala | Asp | Gly | C | |
| Val | Ala | Glu | Gly | A | |
| Val | Ala | Glu | Gly | G |
Mitochondria are believed to have initially been free-living organisms which first invaded their present-day hosts as parasites. Over time, the relationship between the mitochondria and host cell became symbiotic--mitochondria became the energy-producing factories of the cell, while many of the proteins required to assemble the mitochondria came to be produced by the host cell rather than its endosymbiont. Nonetheless, a mitochondrion retains its own DNA and protein expression mechanism, and requires them to function.
No better evidence of this somewhat separate origin is that of the genetic code. Encoding of mitochondrial DNA into proteins uses a slightly different genetic code, as shown at left (items which differ from those used by most organisms are highlighted in blue).
Mitochondria are not the only dissenters from the
consensus genetic code. The mitochondria of yeast, mold,
invertebrates, echinoderms, ascidians, flatworms, and
blepharisma, and the nuclear DNA of ciliates, euplotids, and bacteria
differ in minor ways from the codes shown here and it is probable
that additional variations will be found as the genomes of
more and more organisms are sequenced. Still, what is
striking is not the slight differences from one organism to
another, but the underlying unity--and thus the virtual certainty
all are descendents of a common ancestor.