Thanks to existing equipment, it has become increasingly fast and easy to read the full DNA of any live being. But even this technological evolution has still not made the comparison of genomes banal, when it comes to evaluating similarities between species and building a tree of life. “With the current methods, computers take a long time to compare the full genetic material of a set of more than twenty species,” says mathematician João Meidanis, from the Instituto de Informática, infotech, Institute of Informatics,at the State University of Campinas/Unicamp. Unhappy with the standard solution of searching for similarities to provide data with some meaning, he and his student, Pedro Feijão, developed a new method of comparing genomes. In September, this new method will be subject to intense scrutiny, as it will be presented to colleagues from all over the world attending the Algorithms in Bioinfotechnology Workshop, in the United States.
Data analysis is slow not when it comes to obtaining the sequences, but when it comes to comparing the sequences, because each genome is represented by billions of letters in a single file (approximately 3 billion, in the case of human beings). The methods to compare species use mathematical representations of models of natural mutations. These mutations gradually substitute the letters or cut this long chain, which then mends itself at another point – an exhausting task even for the most powerful computers.
The formula proposed by the two mathematicians simulates a situation in which the genome breaks at a single point and then joins randomly again. If this happens a successive number of times, the genetic sequence is gradually shuffled. This is the origin of the term single-cut-or-join, which is the name of this genome method. The process simulates the most common type of genetic rearrangement, in which a stretch of the DNA is inverted. If the comma in the previous sentence were the breaking point, the sentence could be rearranged as – tnemegnarraer citeneg fo epyt nommoc tsom eht setalumis ssecorp ehT, in which the stretch of the DNA is inverted,” or “in which a stretch of the DNA is inverted, The process simulates the most common type of genetic rearrangement,” among other possibilities. “This is one of the most common forms of alterations in the genome,” explains Meidanis, “because the stretches remain intact and the genetic properties are maintained.” The program he developed makes a series of random cuts in the selected genome and determines the similarity with another genome according to the number of cuts necessary for the first genome to be identical to the second one. By comparing the genetic material of various species – with this method it is possible to compare up to 100 genomes in a few days – the program devised by Meidanis and Feijão produces a phylogenetic tree that shows the kinship between the live beings being compared.
Debate
The paper was far from being unanimously accepted by the scientific committee that analyzed the papers submitted to the conference. “Two committee members felt that we were not presenting anything useful and three had no opinion,” says Meidanis. Instead of creating discouragement, the committee’s reaction was actually a driver. To begin with, the paper was analyzed by five reviewers instead of the usual three. – It seems that they found it difficult to decide, but nonetheless maybe they accepted our paper because it was something new that could lead to a major debate,” he ponders.
They are prepared for the discussion. They have redone all the calculations to show that their proposal is mathematically distinct from the methods currently being used, namely, the breakpoint, used since the beginning of the 20th century when genetics of the peple first appeared, and the double-cut-or-join, the use of which is more recent. The older method is conceptually very similar to the model proposed by the researchers; the difference lies in the mathematical formalization; the more recent method considers that the genome is broken into three parts that rejoin randomly – an unnecessary complexity, in the opinion of Meidanis. In his opinion, the simplicity of his model makes it easier to solve the problems, and might even bring the solution closer to reality, he adds, referring to physicist Albert Einstein: “He said that everything should be considered in the simplest possible way, but no simpler than this.”
The next step entails discussions and collaborations with geneticists; this will be essential to evaluate whether the mathematical simplification exceeds that of nature. In the meantime, Meidanis and Feijão have tested the model with a set of data that the bioinfotech community uses to test new methods. By comparing the format of the trees obtained by the program, they verified that their method achieves similar results to the ones obtained by the others – but within a much shorter period of time.
Even before the discussion this month and the formal publication of the paper, a German research group has already shown interest in receiving the final version. This is another indication, for the professor from Unicamp, that his is an innovative proposal.
Scientific article
FEIJÃO, P. and MEIDANIS, J. SCJ: a variant of breakpoint distance for which sorting, genome median and genome halving problems are easy. 9th Workshop on Algorithms in Bioinformatics. 2009.