Bits, bytes and genes : Revista Pesquisa Fapesp

EDUARDO CESARESTweb, ZERG and Sabiá: programs born of sequencing projectsEDUARDO CESAR

In September, two teams of Brazilian researchers published scientific articles in international magazines about the genomes (set of genes) of two organisms, the Schistosoma mansoni parasite, which causes schistosomiasis in Brazil, and the Chromobacterium violaceum bacterium, which is abundant in the Negro river and has potential for biotechnological use. Although they worked independently, with different organisms and methodologies, both the teams developed computer programs that organized and facilitated getting the data made public in their latest writings.

Two programs came out of the Bioinformatics Laboratory of Chemistry Institute of the University of São Paulo (IQ-USP), which took part in the project about the worm causing schistosomiasis; called ESTweb and ZERG, they are now available for free download from the electronic address. A third tool, Sabiá, was conceived at the National Scientific Computing Laboratory (LNCC), from Petrópolis, where the heart of bioinformatics for the venture that studied the genes of C. violaceum worked. For the moment, the use of the system is restricted to the 25 laboratories from the national network that sequenced the bacterium’s genome. But its use should shortly be opened up to all those interested.

Clean sequences
Each one of these programs carries out very specific tasks and serves a particular purpose. The ESTweb, which earned a scientific article in the Bioinformatics magazine of August 12, 2003, receives and processes the fragments of generated active genes using an organism’s tissues, and allocates them to a database. These pieces of genes are called ESTs, an acronym for expressed sequence tags, which served as inspiration for baptizing the program. ESTweb removes from the gene fragments all the elements that are unnecessary for analyzing the sequence, and cleaner ESTs are thus obtained. “The program generates, in real time, graphs that show the quality and degree of redundancy of the sequences produced by the laboratories”, comments Sergio Verjovski-Almeida, from USP’s Chemistry Institute, who is coordinating the venture financed by FAPESP, which has identified 92% of the S. mansoni’s expressed genes.

The second creation of the São Paulo team is a tool of a more analytic nature. “ZERG interprets the output from BLAST”, says biologist Eduardo Reis, one of the inventors of the software, resorting to the jargon of bioinformatics. BLAST is a program in the public domain, popular among molecular biologists and other professionals who work with genes and proteins. Its function is to compare any EST with the genetic sequences deposited in the public databases. Accordingly, the researcher discovers whether his ESTs are the same as, or similar to, others already known, and in many cases manages to associate these sequences with genes with defined functions. Although it is very useful, BLAST has a little problem: in major undertakings, like the S. mansoni project, it generates a report a mile long, that is difficult to understand and has much data that needs checking.

Untangling this balance sheet is not a task for human beings, but for another software. “There are commercial programs that read the response from BLAST, but not with the same precision and speed as ZERG”, says programmer Apuã Paquola, from the IQ/USP’s Bioinformatics Laboratory. In an article published in Bioinformatics of May 22, 2003, the authors of ZERG, whose name was borrowed from a computer game, showed that their invention is up to 250 times quicker than its rivals.

In spite of its very Brazilian name, the third program is an acronym for an expression in English: Sabiá means System for Automated Bacterial Integrated Annotation. The program, which serves for setting up and annotating only bacterial genomes, was conceived at the LNCC and used for the first time during the work for sequencing C. violaceum, a project financed by the National Council for Scientific and Technological Development (CNPq).

Some notion is needed of certain basic procedures in the world of genomics, to get an idea of what the program does. Sequencing a genome is determining the order in which their nitrogenous base pairs appear, the primordial chemical units that form the molecule of deoxyribonucleic acid, and are habitually represented by the letters A (adenine), C (cytosine), G (guanine) and T (thymine). As an organism’s genome can be very large to be sequenced in a single go – the genome of C. violaceum, for example, has 4.7 million base pairs -, the researchers have to break it up into small pieces. Like what one does with jigsaw puzzles, assembling it consists of joining together correctly these smaller parts, duly sequenced. “During the assembly process, Sabiá points out the regions of the genome where the data generated by the sequencing is of a good quality or of a bad quality”, says Ana Tereza Vasconcelos, from the LNNC, the coordinator of the project with C. violaceum and one of the authors of the software.

Monkeys and men
Once the jigsaw puzzle of the assembly is concluded, the device invented by the team of researchers that worked with the genetic material of C. violaceum starts the annotation of the gene. In general terms, this task is equivalent to discovering which proteins are produced from the chemical recipes contained in the genes of a genome. This is the way that one arrives at the function (or functions) of a gene. A major part of the annotation data is derived from comparisons. With the assistance of programs, such as Sabiá or others, freeware or paid for, the scientists compare the genetic material recently identified in an organism with already known sequences, with a defined function, which are to be found on file in public databases.

If in monkeys a given sequences leads to the production of some protein, which we shall call, X, say, it is probable that a similar sequence, if present in men, will also lead to the synthesis of this same protein X.Things, of course, are not so simple as all that, but this is the spirit of annotation. “Sabiá works in a computing environment that makes it possible to crosscheck information from eight public databases”, says Ana Tereza. “You can even compare entire genomes.” To increase its range, Sabiá, which is to be targeted in a scientific article this year, will be perfected. The idea is to produce a version of the system that also serves for the assembly and annotation of genomes of other organisms, besides bacteria.

Republish