Storing Data in DNA

by John Walker

 

Introduction

In my science fiction story, We'll Return, After this Message, two programmers discover something very interesting in the genome of an organism. If you haven't read the story yet, you might want to read it now and return to this document when you're done.

In this page we'll explore ways information beyond the conventional genetic code could be coded into the genome of an organism, tools one can use to search for such information, and results of some experiments with genomes of various organisms. The symbols used in genome sequences are explained in the page The Genetic Code.

DNA Data Storage: Bases as Bits

It's intriguing to contemplate the fact that, aside from the encoding of protein sequences according to the genetic code , DNA can be used as a general-purpose data storage medium. If you take a sequence of binary data representing anything at all--text, sampled sound, an image--code the bits into nucleotides (somewhat cleverly, so as not to accidentally trigger transcription by a ribosome), and splice the segment of DNA into the middle of an intron, the machinery of the cell will faithfully replicate it every time the cell divides. Given the size of mammalian genomes and the percentage of them which are apparent junk, you could be walking around with a complete copy of the works of Shakespeare in every cell and never know it.

DNA copying is not, of course, absolutely perfect: if it were, mutations wouldn't occur and we'd still be single-celled organisms, if that. And, since mutations in junk DNA which does not direct the assembly of proteins rarely affects the fitness of the organism, errors accumulate over time. In fact, tracking the drift in sequences of non-coding DNA is an excellent way to determine when various species diverged from a common ancestor and estimate the time since the divergence. Still, DNA replication is a very reliable process; the error rate ranges from one error per 107 to 1011 nucleotides copied, depending on the chemical structure of the DNA in the region being copied. The overall error rate is about one error per 109 bases. These are error rates with which communication engineers and computer scientists are entirely comfortable: error-correcting codes permit reliable transmission of messages in the presence of error rates many orders of magnitude greater than these. The raw error rate (before the error correction code is applied) on a compact disc, by comparison, is generally between one error in 105 to 106 bits--between two and five orders of magnitude worse than that of DNA replication. In fact, even after error correction, an audio compact disc has an error rate of one error per 1010 to 1011 bits--comparable to the low end of the raw error rate in DNA. (CD-ROMs include additional error correction information to reduce the uncorrectable error rate to one error per 1016 to 1017 bits. The same techniques could, of course, be applied to information encoded in DNA, allowing the uncorrected error rate to be reduced to as low a value as required, at the cost of increasing the length of the sequence due to incorporation of the additional error correcting code bits.

When CD-ROMs began to be used on personal computers in the 1980's, they seemed an almost utopian data distribution medium. One CD-ROM could store as much data as a suitcase full of floppy discs, could be mass produced by the factories already cranking out audio CDs by the hundreds of millions at a cost so low they could be bound into magazines, and were largely immune to the magnetic, dust, and other perils that afflict other computer media. Of course, you couldn't write your own CDs--almost a decade would pass before that became commonplace--but as a way to store and distribute permanent data, the CD-ROM was ideal. It's interesting, then, to compare the efficiency of data storage on CD-ROM with the alternative of encoding it in DNA: to see how biological wetware stacks up next to technological hardware.

In a bulk data storage medium, the efficiency ultimately depends on the cost of the material used to store the data and the cost required to produce each copy of a master. Since most computer media are largely made of plastic, organic compounds containing much the same mix of common molecules as those of biology, we can adopt a metric of "bits per gram" when comparing various media--how much stuff we have to buy to store each bit on the medium; the greater the number of bits per gram, the cheaper we can expect the storage medium to be. The ubquitous 1.44 megabyte floppy disc, for example, weighs about 22 grams and stores 1,457,664 bytes (more or less--the precise number depends on how you format the disc), or 11,661,312 bits, there being 8 bits per byte. Dividing by 22 grams, we find the floppy disc stores about 530,000 bits per gram--not bad, at first glance.

Next, let's work the numbers for a CD-ROM. On a maximum length CD-ROM we have a total of 681,984,000 bytes, or 5,455,872,000 bits. A CD-ROM weighs about 15 grams, so the CD-ROM stores 363,724,800 bits per gram--more than 680 times the efficiency of the floppy--no wonder CD-ROMs have become the distribution medium of choice for Microsoft bloatware!

Now let's take a look at DNA as a storage medium. Let's assume we store one bit of the data stream per complementary base pair in the DNA molecule--we could theoretically store two bits per base, but then we'd run the risk of generating a sequence which triggered transcription by a ribosome or caused the DNA in the region to have undesirable chemical properties. If we store one bit per base pair, we're free to choose a sequence which is benign in these regards. DNA is a double helix in which the backbone of each strand consists of a chain of phosphate groups and the sugar deoxyribose, with a base pair of adenine-thymine (A-T) or cytosine-guanine (C-G) attached to each link the backbone. Assuming a roughly equal distribution of A-T and C-G base pairs (and their complements), we find the average base pair weighs about 366 atomic mass units. Adding the atoms in the backbone on both sides of the helix, we end up with a total of 782 atomic mass units per base pair, and hence per bit, since we're encoding one in each. How does this compare to the compact disc? Rather well, actually. Converting the mass in atomic mass units into grams and dividing into one gram, we calculate that DNA, at one bit per base pair, can store 7.7×1020 bits per gram--about two million million times more efficiently than the CD-ROM.

As I write this document, the Digital Video Disc (DVD) has appeared on the market and many industry pundits have hailed its computer storage incarnation as the successor to the CD-ROM Disc. Maybe so--the increase in storage capacity is certainly impressive. But it's small change compared to the 12 orders of magnitude improvement obtainable by storing data in DNA. Of course, no matter how efficient a storage medium may be in terms of atoms per bit, it's only practical if the time and expense of making copies is also low. It's theoretically possible, for example, to store information with one atom per bit by pushing around atoms with a scanning probe microscope. But, with present technology, it would take years to copy a moderate length file, so the efficiency of the storage medium is illusory. But DNA couldn't be easier to replicate. Just splice the sequence encoding the data you want to transmit into the genome of a yeast cell, drop the cell into a vat of sugar water, and in a couple of hours you have as many copies as you want--when you need more, dump in more sugar. Each of the billions of daughter cells budded from the original will contain your data, along with an error correcting code which allows repairing any errors introduced in the process of replication.

Music Plants

The Sony Walkman, using the primitive Philips Compact Cassette, made music something one could enjoy anywhere, anytime. The Discman, embodying the successor technology, Compact Disc, improved the quality of the sound (although not all that much, as perceived with the headphones typically used with such gadgets). What might DNA data storage do for such a technology? Music plants! Leafman!

Genomes of some higher plants are huge--tens to hundreds of billions of bases. "Consider the lilies how they grow: they toil not, they spin not..." So why the heck does it take a genome thirty times the size of yours and mine to make a trumpet lily plant? Most people believe it's simply because there's a colossal amount of junk DNA in the plants (and amphibians) with these enormous genomes. If these organisms have no problem carrying around all that excess baggage in the nuclei of their every cell, there's no reason we can't add a little more of our own devising.

Imagine a plant cell into which we've encoded, as non-protein-coding introns, the 5 billion bits representing a musical recording, as presently encoded on a clunky plastic CD. Using straightfoward cell culture techniques, we get the cell dividing and differentiating, until it grows into a mature plant and produces seeds which germinate and grow into other plants. Since every cell of these plants is descended from the one into which we spliced, say, a classic recording of The Goldberg Variations, each and every cell nucleus will contain a copy of that recording. To listen to the music, take a tiny clipping from the plant (after all, you only need one cell), put it into your Leafman, and press Play. The Leafman reads messenger RNA copied from the DNA containing the music in real time by an engineered ribosome-like molecule whose instantaneous conformation as it moves along the mRNA strand is sensed by a laser.

The record business would become entirely green. No longer would it consume vast quantities of petroleum-based plastics; instead music farms would raise and harvest Bach bushes, Beethoven beets, Liszt lettuce, Mahler melons, Phil Collins philodendrons, and Gangsta Rap kudzu vines. The CD-ROM will be supplanted as the distribution medium for Microsoft products by digitally encoded digitalis plants: "Place the petal from the deadly nightshade in the reader and type F:\SETUP".

One Bit per Base Encoding

Encoding A C G T   Encoding A C G T
0 0 0 0 0 8 0 0 0 1
1 1 0 0 0 9 1 0 0 1
2 0 1 0 0 10 0 1 0 1
3 1 1 0 0 11 1 1 0 1
4 0 0 1 0 12 0 0 1 1
5 1 0 1 0 13 1 0 1 1
6 0 1 1 0 14 0 1 1 1
7 1 1 1 0 15 1 1 1 1
If each base in the sequence represents one bit of encoded value, there are 16 possible encodings, representing the truth tables for all Boolean functions of two variables. Encodings 0 ("always zero") and 15 ("always one") are devoid of information and may be ignored. The remaining 14 encodings represent all possible ways each base can specify one bit of information independent of context.

Why encode only one bit per base? The three dimensional structure of the DNA molecule, often characterised as a regular double helix, in fact depends upon the sequence of bases making it up. Encoding only one bit per base, thereby sacrificing 50% of the potential information density, creates redundancy which permits encoding of arbitrary information within the chemical constraints of the DNA molecule. It's worth keeping in mind that the genetic code exhibits better then three-to-one redundancy (64 possible codons, 21 amino acids plus a stop code encoded), quite probably due to the same constraints. Also, encoding one bit per base lets us choose a sequence of bases optimised for a low error rate in replication.

Two Bit per Base Encoding

Encoding A C G T   Encoding A C G T
0 00 01 10 11 12 10 00 01 11
1 00 01 11 10 13 10 00 11 01
2 00 10 01 11 14 10 01 00 11
3 00 10 11 01 15 10 01 11 00
4 00 11 01 10 16 10 11 00 01
5 00 11 10 01 17 10 11 10 00
6 01 00 10 11 18 11 00 01 10
7 01 00 11 10 19 11 00 10 01
8 01 10 00 11 20 11 01 00 10
9 01 10 11 00 21 11 01 10 00
10 01 11 00 10 22 11 10 00 01
11 01 11 10 00 23 11 10 01 00
Encoding two bits per base pair achieves the maximum information density attainable. As discussed above, however, it sacrifices the freedom to chose among multiple redundant encodings to obtain desired chemical properties and reliable replication. Still, it's worth looking at how one can encode two bits per base, because there may be circumstances in which "double density DNA" encoding is feasible.

There are twenty-four possible ways to encode two bits of information in each base pair. Encoding 0 assigns values from binary 00 through 11 to the bases in alphabetical (arbitrary) order; the other encodings represent all permutations of the list of four two bit binary values. Since the number of permutations of n items is n!, there are a total of 4! or 24 two bit encodings.

Hide and Seek: An Image in the Genome

Swiss flag, colour image To show how information of any kind can be encoded in DNA, let's work through a concrete example in excruciating detail. The information we'll encode is an image of the Swiss flag, as shown at right--since pharmaceuticals are a major industry in Switzerland, perhaps when custom-DNA organisms reach the market, they may have this image embedded in them, just as the watchmaking industry labels each of its products "Swiss Made". (Unlike the flags of many countries, the Swiss flag is square, not rectangular. This is a small country, and nothing must go to waste.)

Examining the image we're going to encode reveals two key properties: it contains only two colours, red and white, and the regularities in the image allow it to be compressed to an image only 18 pixels square. Since we're interested in the pattern, not the colour (which would take a great deal of additional information to describe to a reader of the message with whom we'd never previously communicated and spoke a different language), we can express the pattern as a tiny 18×18 pixel black and white image like: Swiss flag, small black and white image.

From Image to Bits

Having reduced the initial image into an 18×18 monochrome image, this can then be rewritten as the following string of binary digits, with zeroes representing the black pixels and ones the white (this choice is arbitrary--the person who decodes the message may choose the opposite encoding, but will still discover the pattern).

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 
0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 
0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 
0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 
0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 
0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

Of course, I've nicely formatted the bits into the rows and columns of the image, but as a bit stream they would have no such obvious structure. Instead, they would look something like:

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

It would up to the reader to discern the regular structure and reassemble it into rows and columns of the proper length. Researchers in the Search for Extraterrestrial Intelligence (SETI) believe an actual message in the form of an image would be coded as a rectangle, each side of which was a prime number. The transmitted in 1974 by the Arecibo radio telescope toward the globular cluster M13 in Hercules consisted of 1679 bits, the product of the prime numbers 23 and 73 which were, respectively, the number of columns and rows in the image. This message is described in greater detail in the document Self-decoding Messages. For simplicity, we'll dispense with prime number edge lengths and encode the 18×18 message as-is; the simplicity and symmetry of this message would, in practice, make the proper alignment easy to discover, as would the observation that the number of bits in the message was a square.

From Bits to DNA...

Now we're ready to encode the bits representing the messages as a sequence of nucleotides in a strand of DNA. First, we choose one of the forms of encoding described above. For this example, I'll encode one bit per base and use Encoding 6 as given in the table above. Encoding 6 prescribes that A or T nucleotides encode zeroes, and C or G encode ones. The choice of which to use for a given one or zero is arbitrary. Since complementary pairs are always A-T or C-G, we can think of Encoding 6 as coding the data in the choice of complementary pair, while ignoring which member of the pair is on which strand of the double helix. Chemically, this is an attractive way to encode information since it allows us to avoid a purine-rich sequence which could trigger transcription; to do so we simply make sure to include sufficient pyrimidines (thymine or cytosine) to be safe; since the information is encoded in the choice or complementary pair, their order is irrelevant to the coding and can be flipped where necessary to meet this constraint.

The result of our encoding, then, would be something like the following. I used the genome of an actual organism as a template to choose the sense of the base pairs, and the ones and zeroes to choose which pair appears at each position in the sequence.

TTTTATATATATAAATATAAAATATAAAAATTATTAAATATAAGGCCTTTAAAAATATATACCG
CAAATAATATTTATAGCCCTTTTTTTTATTTTTCCGCAATAAATAATTATAGCCCAATATTTTT
CGCGCGGGCGGCGCTATTGCGGCCGGCGCCCGTAAACGGCGCCCGCGCCCATTTGCCCGGGCCC
CGCCTTTTTAATTGGGCTTTTTAAATTAATTCGGCATTATTTAATATAACCGGAATAATTTTTA
TAAGGCCTTATTAAAATTTAACCGGTATTAATTAAAATTTAATTAATTTAAATTTAAATTTAAT
ATTA

...Buried in a Genome

As you can see, the result of letting an existing organism choose whether the purine or pyrimidine appears has broken up much of the symmetry in the patterns of ones and zeroes. Further, when this sequence is spliced into the middle of the genome of an actual organism, there won't be any obvious marker to indicate where it starts or how long it is. Instead, you'll encounter something like the following, in the middle of a multi-million base genome.

TATATACATCATATGTACATTTATCGAGCCAATCGAGGGCAGCAGTTTAACATCAAGCCGGATT
TGCTCACGCTACTTTGACCCCTTTTCGTTTCGACGGAGAGAAGAAACCGGTGTTTTCCTATCCT
TGCCTATTCTTTCCTCCTTACGGGGTCCTAGCCTGTTTCTCTTGATATGATAATAGGTGGAAAC
GTAGAAAAAAAAATCGACATATAAAAGTGGGGCAGATACTTCGTGTGACAATGGCCAATTCAAG
CCCTTTGGGCAGATGTTGCCCTTCTTCTTTCTTAAAAAGTCTTAGTACGATTGACCAAGTCAGA
AAAAAAAAAAAAAAGGAACTAAAAAAAGTTTTAATTAATTATGAGAGCTTTGGCATATTTCAAG
AAGGGTGATATTCACTTCACTAATGATATCCCTAGGCCAGAAATCCAAACCGACGATGAGGTTA
TTTTTTATATATATAAATATAAAATATAAAAATTATTAAATATAAGGCCTTTAAAAATATATAC
CGCAAATAATATTTATAGCCCTTTTTTTTATTTTTCCGCAATAAATAATTATAGCCCAATATTT
TTCGCGCGGGCGGCGCTATTGCGGCCGGCGCCCGTAAACGGCGCCCGCGCCCATTTGCCCGGGC
CCCGCCTTTTTAATTGGGCTTTTTAAATTAATTCGGCATTATTTAATATAACCGGAATAATTTT
TATAAGGCCTTATTAAAATTTAACCGGTATTAATTAAAATTTAATTAATTTAAATTTAAATTTA
ATATTAATCGACGTCTCTTGGTGTGGGATTTGTGGCTCGGATCTTCACGAGTACTTGGATGGTC
CAATCTTCATGCCTAAAGATGGAGAGTGCCATAAATTATCCAACGCTGCTTTACCTCTGGCAAT
GGGCCATGAGATGTCAGGAATTGTTTCCAAGGTTGGTCCTAAAGTGACAAAGGTGAAGGTTGGC
GACCACGTGGTCGTTGATGCTGCCAGCAGTTGTGCGGACCTGCATTGCTGGCCACACTCCAAAT
TTTACAATTCCAAACCATGTGATGCTTGTCAGAGGGGCAGTGAAAATCTATGTACCCACGCCGG
TTTTGTAGGACTAGGTGTGATCAGTGGTGGCTTTGCTGAACAAGTCGTAGTCTCTCAACATCAC
ATTATCCCGGTTCCAAAGGAAATTCCTCTAGATGTGGCTGCTTTAGTTGAGCCTCTTTCTGTCA
CCTGGCATGCTGTTAAGATTTCTGGTTTCAAAAAAGGCAGTTCAGCCTTGGTTCTTGGTGCAGG
TCCCATTGGGTTGTGTACCATTTTGGTACTTAAGGGAATGGGGG

Searching for a Message

It's up to you, by wits or patience or both, to discover if an encoded message exists and, if so, how it was encoded and where it starts and ends. If, for example, the 1679 bit Arecibo SETI message were encoded somewhere in the human genome as one or two bits per nucleotide, to find it you'd have to try all 24 two bit and 14 non-void one bit encodings, scanning frames of 1679 bits starting at every position in the approximately 3 billion nucleotide human genome--a rather daunting prospect, but entirely feasible with a fast computer and some patience. But if, in addition, you don't know anything about the size and shape of the message or even whether one exists at all, the complexity of the problem is compounded enormously. Now there's no alternative other than developing computer tools, similar to those used by researchers seeking evidence of radio transmissions from intelligent extraterrestrials, which scan genomes automatically, using filters which identify regions with characteristics which resemble those of various types of encoded messages, flagging them for later, more detailed examination.

Let's assume our encoded Swiss Flag message has been spliced somwehere into the the 12 million base pair genome of Saccharomyces cerevisiae, baker's yeast--the first Eukaryote to be completely sequenced. Automated examination of the genome has identified a tediously long list of sequences with somewhat unusual statistics under various kinds of encoding compared to the surrounding material, and we're now patiently checking them out with the Genome Browser. Note: the images which follow are pictures of the Genome Browser in operation, not Java applets--the buttons and text entry fields in these images are not "live". To run the Genome Browser on your own computer, please visit the Browser's home page.

We start the investigation of the flagged area by loading the organism's genome into the Browser and entering the region in which the unusual sequence was detected in the "Start" box. What appears, using the default settings for the other fields, is the usual TV-snow random-like pattern of bits.

First view of Genome Browser: Start 32000, Width 128, Variant 1

Now we use the "+" button to the right of the "Variant" box to step through the available variants to see if something interesting appears. We try variant 2.

Second view of Genome Browser: Variant 2

Nope. How about 3?

Third view of Genome Browser: Variant 3

Nothing there either. Having already spent many, many hours on futile searches, we plod onward through the variants. Upon reaching variant 6, the following appears.

Fourth view of Genome Browser: Variant 6

The pattern around the middle of the browser window is definitely interesting. It looks like black dots and white dashes of absolutely regular length, coherent over a substantial distance. Now we can try to home in on any possible two-dimensional structure, the certain hallmark of an encoded image. Using the "-" button to the left of the "Width" box, we display the region with successively smaller row lengths, using the scroll bar as needed to keep the apparently structured region within the window (as you narrow the row length, material in the window appears to move down since fewer bits now appear in the window. Upon reaching a row length of 90, the region seems to be transforming itself into a series of little 1950's-style flying saucers--interesting! But still we continue to narrower rows.

Fifth view of Genome Browser: Width 90

But still we continue to narrower rows. On arriving at 72 columns per row, we now have four well-defined saucers flying in formation!

Sixth view of Genome Browser: Width 72

Finally, after watching the pattern become better and better defined (and continuing to use the scroll bar to keep it on screen), we arrive at a row width of 20...

Seventh view of Genome Browser: Width 20

...19...

Eighth view of Genome Browser: Width 19

...and finally 18.

Ninth view of Genome Browser: Width 18

And there, at last, is the image we spliced into the genome. Note that we need both the correct row length and encoding to make sense of it--if we change the encoding variant to 10, for example, we see only gibberish without the slightest clue an image is hidden in a different encoding.

Tenth view of Genome Browser: Width 18, Variant 10--nothing but gibberish

The Genome Browser

Introduction

Table of contents


Fourmilab home page