The mathematical structure of DNA : Revista Pesquisa Fapesp

castelli_equacao1_final Sandro CastelliScientific articles by a group of Brazilian researchers from the University of Campinas (Unicamp) and the University of São Paulo (USP) show that genetic sequences can have the same mathematical structure as the Error Correcting Codes (ECC) used in both broadcast and digital recording systems. ECCs are a set of commands built into the software installed in computer chips, telecommunications equipment, televisions and smartphones to correct digital information defects in such processes as telephone conversations or the storage of data on a computer’s hard disk.

The same mathematical logic, say the researchers, is found in the formation of DNA—the deoxyribonucleic acid whose cells carry the genes and all instructions for development and survival of living beings. In the study, they compared algebraic equations of error-correcting codes with certain DNA sequences, attributing a numerical logic to the nucleotides that make up the genome: thymine (T), guanine (G), cytosine (C) and adenine (A). In doing so, they discovered that there are patterns that link the nucleotide to a number. Thus, depending on the type of sequence, A is represented by 0, C is 2, G is 1 and T is 3. In digital language, which consists of bits, the information is translated into 0s and 1s. “We have shown that DNA has sequences that follow the same mathematical structures and rules as digital communication,” says Márcio de Castro Silva Filho, from the Genetics Department of the Luiz de Queiroz School of Agriculture (ESALQ) at USP. “The DNA sequence is not random; it follows a pattern,” he says.

The group’s most recent study was published in the journal Scientific Reports, from the publishers of Nature, in July 2015. The introduction states that the biological and digital communication systems have similar procedures for transmitting information from one point to another. According to the researchers, the information contained in DNA is copied (transcribed) as RNA that will use mathematical logic to direct the sorting of amino acids in the proteins required for cell function. In the study, the researchers presented a computational tool to better understand the evolutionary path of the genetic code by analyzing, for example, Arabidopsis thaliana, a plant widely used as a model organism in genetic studies, and the formation of nucleotides in groupings of three letters called codons. In rare cases this biological grouping – TGA, for example – presented differences that did not match the results presented by the ECC.

Sandro CastelliThe letters and numbers in red indicate mutations in the genetic sequenceSandro Castelli

In presenting the problem at the Brazilian Conference of Genetics in 2011, Silva Filho fielded a question from biologist Everaldo Barros of the Catholic University of Brasília that helped him find a way forward. Barros wanted to know if the alteration in a DNA codon of a sweet potato (Ipomoea batatas) referred to an ancestral code. Silva Filho and electronics engineer Reginaldo Palazzo Júnior, of the School of Electrical Engineering and Computer Sciences (FEEC) at Unicamp, another group coordinator, set out to find an answer. Working together with doctoral candidates Luzinete Cristina Bonani Faria and Andréa Santos Leite da Rocha, they showed that the difference detected between the sequence derived from the error code and the biological sequence is a mutation that does not match the mathematical equations of the primordial genome of the sweet potato found in sequences of older organisms such as prymnesophytes algae or ancestral mitochondrial variants of the genetic code. Mitochondria are cell organelles that show traces of more remote genetic material. Therefore, only the oldest DNA is part of the equation.

“The gene sequence that encodes the delta subunit of F1-ATPase protein of the sweet potato presents the TGG codon that encodes the amino acid tryptophan. However, the sequence generated by the mathematical code for the tryptophan codon was TGA, which would introduce a stop in the protein synthesis, impairing its function. Initially, the alteration generated by the mathematical code would be incorrect,” says Silva Filho. “When we determined that the amino acid tryptophan is ancestral and encoded by the TGA codon, everything came together and we were then able to understand that a mutation had occurred,” says Palazzo Júnior. This type of mutation had already been recognized through the biochemical process, but had never before been identified through a mathematical process.

The researchers are now working on a phylogenetic study to learn more about the evolution of species from the mathematical and biological standpoint. They are analyzing genetic sequences to determine whether the mutations found present characteristics in individuals that are important for functionality of the species. Current studies are being carried out on plant and animal genomes to confirm whether in fact the mathematical model is closely related to the biological model.

The discovery has led the group to file for an international patent on the utility model of the system they developed, already patented in the United States. “This mathematical structure may be important in the field of protein engineering for developing genetically modified organisms, new drugs, vaccines and altering the DNA sequence in future gene therapy systems, or even producing and discovering new proteins from the mathematical code,” explains Silva Filho, an agronomist who holds master’s and doctoral degrees in genetics and molecular biology and specializes in protein transport.

It would also be possible, in a treatment for diabetes, for instance, to study the genes linked to the disease through a mathematical structure and correct the genes to eliminate the problem. Silva Filho predicts that the pharmaceutical industry will benefit greatly from this new way of envisioning DNA because use of the mathematical code will facilitate both the understanding of the disease and the formulation of drugs that are capable of more specifically targeting it.

Alterations in sequence
Mathematicians and computer scientists recognize the Brazilian researchers’ code by the letters BCH, which are the initials of the Indian-born mathematicians Raj Chandra Bose and Dwijendra Kumar Ray-Chaudhuri and the French mathematician Alexis Hocquenghem who invented the code in 1959 and 1960. BCH is only one of several existing error-correcting codes. By using this code, biologists, biochemists and pharmacists, perhaps in collaboration with mathematicians, could conduct preliminary analyses using computer sequences to test the alteration of amino acids, proteins and mutations and then go to the laboratory to determine if the results are correct. “The existence of a mathematical structure in DNA sequences implies an enormous albeit feasible computational complexity in carrying out analyses and predicting mutations,” says Palazzo Júnior, who is an electronics engineer and works in the fields of information and coding theory. Today, this alteration process to produce a genetically modified organism or a medication is carried out through extensive laboratory tests. The function of the mathematical code in the biotechnological processes will be to minimize the occurrence of errors in the cell nucleus after genetic transcription of DNA to RNA, the ribonucleic acid that directs protein synthesis in ribosomes.

The potential association between error-correcting codes and DNA sequences is not entirely new. One of the first scholars on the subject was Professor Hubert Yockey, who has been working in the field since the 1980s at the University of Carlifornia, Berkeley. Another researcher in the field is Gérard Battail, a retired professor from France’s National Superior School of Telecommunications who has published several articles proposing the relationship between error-correcting codes and genomes. These scientists have demonstrated the process and proposed hypotheses but have not yet presented actual mathematical relationships with the DNA. The Brazilians have been able to establish this relationship in the protein-producing genetic sequences. “By understanding the mathematical structure of the protein-encoding gene, we can alter the order of the bases as well as correct any mutations or errors that could appear for it to revert to its original protein condition,” says Silva Filho.

The initial study came about in 2008 when Palazzo Júnior challenged the previously mentioned two doctoral candidates to the task of modeling the transmission of information, in this case, proteins, between the cell nucleus and the mitochondria. In order to do this, Faria and Leite da Rocha sought out Márcio de Castro Silva Filho at ESALQ. They established a dialogue and the two began testing some of the mathematical models of communications systems in order to find the one best suited to the biological model. After several months, they revealed their findings to Silva Filho. At first, he thought that there was just a coincidence between the sequences generated by the ECC and the biological model with regard to the amino acids. As the research progressed, more DNA sequences were obtained from different living things and the results stood, independent of the species. Assisting in the discovery were doctoral candidate João Henrique Kleinschmidt, a computer engineer and now professor at the Federal University of the ABC (UFABC), and more recently, biologist Larissa Spoladore, a doctoral candidate at ESALQ, and biologist Marcelo Brandão, a professor at Unicamp.

In 2009, Silva Filho, Palazzo Júnior, Faria and Leite da Rocha submitted an article to the journal Eletronics Letters, which was published in the February 2010 issue (see Pesquisa FAPESP Issue nº 178). “Now, with the publication in Scientific Reports, we think the global biological sciences community will become more interested,” says Silva Filho. “As far as we know from the literature available, no other group is conducting research on this, although there might be someone in the pharmaceutical industry developing something like this privately.”

“As in the case of many other scientific discoveries, there is probably a long road ahead before this is accepted and used. Clearly, they have made a huge leap and shifted the paradigm,” says biologist Rogério Margis, a professor in the Biotechnology Center at the Federal University of Rio Grande do Sul (UFRGS). “New challenges will likely appear with the discovery of this pattern, which transcends the linear sequence of the bases and adds another layer of complexity and code pattern to the DNA molecule. Expanding this type of analysis will require extensive computational infrastructure,” Margis notes. “Up to now, the studies have not had the impact and repercussions the researchers had expected them to have within the scientific community. One problem is that the study, while unique, encompasses separate fields such as biology and mathematics that do not typically work together,” he says.

“I’ve presented the studies at events abroad, but I think there is a certain level of distrust for a number of reasons. The subject is extremely complex, few people are able to go back and forth between the fields of genetics and error-correcting codes, the group is made up of Brazilians and the 2010 study was published in a journal in the field of electrical engineering,” Silva Filho explains. He thinks that increased interest in the studies has to come from people involved in molecular biology and biotechnology. On the mathematics side, interest would have to come from groups involved in information theory and communication. But this will only take place if multidisciplinary integration occurs as it did in the initial discovery.

Projects
1. Mathematical code to generate and decode DNA sequence and proteins: its use in the identification of ligands and receptors (nº 2008/04992-0); Grant Mechanism Program of Support of Intellectual Property Rights (PAPI); Principal Investigator Márcio de Castro Silva Filho (USP); Investment R$ 13,200.00 and US$ 20,000.00.
2. Herbivory and intracellular transport of proteins (nº 2008/52067-3); Grant Mechanism Thematic Project; Principal Investigator Márcio de Castro Silva Filho (USP); Investment R$ 1,392,217.77 and US$ 169,187.06.
3. System biology techniques applied to the agriculture: transcriptomes and interactomes analyses (nº 2011/00417-3); Grant Mechanism Young Investigators in Emerging Institutions grant; Principal Investigator Marcelo Mendes Brandão (Unicamp); Investment R$ 199,169.39 and US$ 3,846.15.

Scientific articles
BRANDÃO, M. M., et al. Ancient DNA sequence revealed by error-correcting codes. Scientific Reports. V. 5, No. 12051. July 2015.
FARIA, L. C. B., et. al. Transmission of intra-cellular genetic information: A system proposal. Journal of Theoretical Biology. V. 358, p. 208-31. Oct. 2014.
FARIA L. C. B., et al. Is a Genome a Codeword of an Error-Correcting Code? PLOS ONE. V. 7, No. 5, and 36644. May 2012.
FARIA, L. C. B. et. al. DNA sequences generated by BCH codes over GF(4). Electronics Letters. V. 46, No. 3, p. 202-3. Feb. 2010.

Republish