molecular biology

Filling in the gaps

Brazilians create a genomic research strategy and complete the sequencing of 211 human genes

In February of 2001, on being presented to the general public, the human genome was compared to a landscape with extensive deserts intermingled with afew scattered cities. The deserts represented the long strips of DNA, technically called introns, which apparently don’t do anything – they don’t lead to the production of proteins that make up human beings. The cities were the functional strips of DNA, called exons. But there was still a lot of fog and, at the first moment, it was impossible to tell the cities from the desert, nor how many there were, or where they were located. So huge was the uncertainty that the estimates of the number of genes varied from 35,000 to 120,000.

In this international race in search of an exact number, with localized precision, of size and of the structure of the genes, research groups in the United States, Japan and Germany have cluttered up rooms with dozens of DNA sequencing machines, which work day and night. Even without so much equipment, the researchers from the São Paulo universities and institutes didn?t get themselves depressed.

They adopted their own and ambitious strategy – with an exhaustive analysis of the public data concerning the genome allied to laboratory tests – and, four years after, they have managed to complete the sequencing of 211 genes, of which previously there had only been fragments, as well as showing where they are found on the genome: the cities gained exact locations in the middle of the desert. The almost one hundred researchers from 31 São Paulo university laboratories and the Ludwig Cancer Research Institute also discovered around 40 new genes, that had still not been described by any other group. The results of this work, coordinated by Anamaria Camargo, from the Ludwig, and by Mari Cleide Sogayar, from the Chemical Institute of the University of Sao Paulo (USP), were published on-line at the end of last month and will come out on the 1st of this month in paper version in the magazineGenome Research .

“This is the science that results from a partnership between FAPESP and the Ludwig Institute”, commented José Fernando Perez, FAPESP’s scientific director. During two years, from 1999 until 2000, the Foundation and the São Paulo branch of the Ludwig Institute conducted the Human Genome Cancer Project, to which each institution destined today’s equivalent of R$ 30 million. The group work ended up with a balance of approximately 1.2 million sequences of genes associated to various types of cancer – they were central strips of genes, characterized by way of a methodology created in the country, named Orestes, an acronym for Open Reading Expressed Sequence Tags. In a complementary occurrence, other research groups had sequenced the extremities of strips of genes, using another technique, the EST, Expressed Sequence Tags.

By way of the Cancer Transcriptome Project, which began at the end of 2000, with investments of around R$ 4 million by FAPESP and R$ 1.5 million by the Ludwig Institute, the researchers attempted to bring together the two groupings of sequences, those of the middle and those of the extreme ends of genes. Both were formed only by exons, the active parts of genes, but not always were they sufficient to complete the genes – many open spaces had remained.

“In principal any research group could have carried out this work, since all of the data had been public”, affirmed Sandro José de Souza, the coordinator of Ludwig’s bioinformatics group. “Our advantage was to unite groups with different callings.” The enterprise officially began in 2001 and mobilized five bioinformatics teams – those of Ludwig, the University of Ribeirão Preto (Unaerp), the Federal University of São Paulo (Unifesp) and the Medical Faculty of Ribeirão Preto and of the Heart Institute, both from the University of Sao Paulo (USP).

Gene candidates
The bioinformatics, as they are called, superimpose the sequences of the core and of the extremities of the genes with the information that arrived from international projects on the sequencing of the human genome – they were long lists of nucleotides, the DNA units without which nobody had the minimum idea where the introns and exons were located. “We centered our attention on the incomplete genes, bringing together the Orestes sequences and those of other ESTs”, says Souza. With the help of computer programs, lists of gene candidates came from the work and were tested experimentally by the teams from the 31 laboratories of the Ludwig Institute, USP, Unifesp, the Paulista State University (Unesp), the State University of Campinas (Unicamp) and of the University of the Paraiba Valley (Univap).

The first results, demonstrating the viability of the technique, came out in October 2001 in theProceedings of the National Academy of Sciences (PNAS) and, in a two page commentary, they gained the recognition of two world authorities on the human genome, Robert Strausberg and Gregory Riggins, both from the National Cancer Institute (NCI), of the United States.To have found the right path only made things a little easier. Luciana Oliveira Cruz, a research in Mari Cleide’s team worked hard for months cultivating twenty lineages of human tissue – of the uterus, testicles and liver, among others -, preparing complementary DNA samples (cDNA), which correspond to the active genes in each tissue, and distributing them to the laboratories that had tested the 488 candidates of genes previously selected to see if they were truly genes.

Each possible gene was submitted to a polymerase chain reaction, a technique known by the initials PCR, using two specific primers. Primers are sequences of synthetic nucleotides, made, in this case, starting from two known strips of DNA (Orestes or ESTs) – they are the primers that mark out the extremities of a fragment of DNA to be copied thousands of times.”We knew that the two strips previously described as individual sequences belonged to a single gene when, starting from the primers, one obtained a copy of the cDNA, in a demonstration that one was dealing with pieces of a single molecule”, stated Luciana.

This strategy – the alignment of strips of DNA and testing with primers – was called Transcript Finishing Initiative (TFI) and determined what was an intron and what was an exon in the incomplete genes. With efficiency of 43%, the system revealed 211 new genes, many of them described by other research groups, using other techniques, during the progress of the project. There remained around 40 unpublished genes, presented in an article in theGenome Research .The Transcriptome Project was completed at the end of last year, though thousands of gaps, yet to be filled in the human genome, still remain. There is still not a consensus of opinion about the total number of genes, but around 25,000 complete genes have already been described – or almost complete.

The Project
Characteristics of Complete Human Genes – An Extension of the Human Cancer Genome
Special Project
Anamaria Aranha Camargo – Ludwig Cancer Research Institute and Mari Cleide Sogayar – Chemical Institute of USP
R$ 540,000.00 and R$ 550,000.00