Center for Strategic Assessment and forecasts

Autonomous non-profit organization

Home / Science and Society / Direction materials
A breakthrough in the storage of information: scientists have encoded the "Wizard of Oz" in DNA
Material posted : Administrator Publication date: 11-08-2020

Synthetic DNA as a storage medium of high density for many years interested in the digital futurists. The entire Internet can be encoded in the DNA sequence, which are placed inside Shoe boxes at the same time the DNA molecule is so stable that it can be stored for tens or even hundreds of thousands of years. For example, in 2013, scientists sequenced the entire genome of a fossil horse older than 700 000 years.

The whole difficulty is to translate a huge number of bytes of the standard data unit specially designed for linear and sequential storage, such as RAM and hard drives — in a swirling nano-scale structure of DNA. The translation of one data format to another is very easy.

And the first such set of algorithms in encoding and decoding of DNA has been developed by William Press of the University of Texas at Austin. It can give impetus to the development of new areas of long-term data storage with high density. Their work its significance is tantamount to the development of the BB84 Protocol in 1984, which initiated the development of quantum cryptography, and one day she may become the basis for an entire sector of the storage of genomic data in which we will most likely handle the volumes in petabytes per gram of substance.

Stephen Jones, postdoc in the group Press and co-author in the journal Proceedings of the National Academy of Sciencesdescribing their researchsays that the best way to start is understanding where errors occur normally store data. In traditional devices with hard drives and flash memory the revolution of bits and erasing are the worst enemies of zeroes and ones.

"Over decades of work with the bits we have great algorithms for finding and correcting these types of mistakes," Jones said. "But DNA is fundamentally different".

To make a workable standard data storage DNA you need, instead, to worry about substitutions, insertions and deletions. The first type of errors similar to the coup of bits — this is equivalent to saying that 0 becomes 1 or 1 becomes 0, and it is easily detected and corrected using the same good old-fashioned reed-Solomon code.

Reed-Solomon code is great for recovering lost pixels and eliminates pixels with false colors.

But in this case, the problem is that in DNA the four nucleotides — adenine (A), guanine (G), cytosine (C) and thymine (T), and each of them may join any other, which greatly increases the number of potential errors and complicates the way they track and correct. The remaining two classes of errors represent cases, as follows from their names, when the base pairs of DNA are inserted or removed from the circuit.

And the most fundamentally important and annoying is that DNA is no reliable natural way to know whether a read chain of nucleotides any errors of substitution, insertion or deletion. There is no such thing as a readable and quantifiable "register memory" of DNA. Each base pair is simply one nucleotide in a long sequence. And all together they form another chain of DNA.

The relative nature of data storage in DNA, in fact, is the key to the Protocol, HEDGES Press, Jones and other sponsors (stands for Hash Encoded, Decoded by Greedy Exhaustive Search is the hash coding and decoding using greedy exhaustive search). None of the isolated nucleotide in their Protocol does not contain useful data. Rather, it is a group of sequences of nucleotides, which together provide a reliable data storage system that scientists will be able to achieve high storage density in natural DNA, remaining thus for a very long time.

The group used a book by Frank Baum "the Wizard of Oz" translated to a simple artificial language , Esperanto, as the sample data for storage. Synthetic DNA in our days, according to Jones, usually consists of a chain of a hundred or so base pairs. This was the basis of their "hard disk".

The principle of encoding and decoding Protocol HEDGES.

Thus, they made the Protocol HEDGES, which is able to split the incoming information on thousands or millions of small sequences of around a hundred nucleotides, each of which contains the data necessary to re - "build" of the source text, even with an unknown number of errors of substitution, insertion and deletion, added for the qualitative experiment.

Thus, the encoding of "the Wizard of Oz" in DNA included the transmission of data through the "external" and "internal" levels of coding. Think about these steps as two separate algorithms in complex cryptographic standard.

The outer layer was diagonalizable source data so that any specific sequence of DNA contained fragments of many parts of the message. Domestic level, the Protocol HEDGES, then converts each bit to A, G, C or T in accordance with an algorithm that depends on the value of that bit, and its place in the data stream and bits from the immediately preceding him.

Then, once the book is fully translated into the language of nucleotides, it becomes ready for recording in the chains of synthetic DNA. After encoding the chains were placed in the vault where, according to Jones, his job was to artificially age genetic information, subjecting the sample to heat and cold, trying biochemically to cause mutations DNA.

"I was confused of DNA," he said. "And then we looked at whether we will be able to read the book". The response was positive. It showed how strong the DNA. "We had to work very hard to try to spoil DNA," says Jones. "Of course, this is easier to do if you have 10 000 years, which you can bury data in the earth or in outer space or something like that. But of course, we had to speed up the process."

The chemical structure of DNA.

Decrypting data from their store DNA resulted in the first genome sequencing "Wizard of Oz" and then the translation of these genetic data back into bits. After that only find out which bits are the "address" and use them to bind the remaining information bits back into a single data file.

Co-author of the new research, John Hawkins, also said that one of the most attractive features of their new Protocol is its resistance to technological changes and changes the data format for centuries.

"Reading DNA will never become obsolete", he said. "Storing data is only half the problem. You should still be able to read them in the future. And DNA is unique a future instrument on this front, because we are made of it. Until people are made of DNA, we will always need devices that are able to read".

Morozov Egor


Tags: science

RELATED MATERIALS:Science and Society
  • 26-11-2020
  • 10-11-2020
  • 19-10-2020
  • 25-08-2020
  • 22-08-2020
All materials...
Возрастное ограничение