Imprimir Republish


The era of whole genome sequencing

New techniques are changing the way genetic material is understood and promising to transform our understanding of diseases, human genetic diversity, food production, and evolutionary processes

Most genes are common between organisms; about 0.01% of the genome differentiates one human from another

Felipe Mayerle

Imagine if you were on a first date and wanted to check if your dinner partner really does have Italian ancestry as they claim. What if all you had to do to find out was to rub a cotton swab around the rim of their glass while they were distracted, put the DNA sample in a pocket sequencer, connect it to your cell phone, and have an answer within minutes. If you think that sounds like science fiction, you may be in for a surprise. Miniature devices that read genetic material in real time already exist and can be used to identify disease-causing pathogens. The gadget is part of the genome revolution that is allowing complete DNA sequencing to be performed in shorter and shorter time frames, with fewer errors and at ever-decreasing costs.

Anyone who was an adult 20 years ago probably remembers the genome sequencing era, when the complete sequences of a series of organisms were released for the first time, with an emphasis on the Human Genome Project (HGP), which announced an almost complete human sequence in 2001. Major initiatives — some in Brazil funded by FAPESP (see Pesquisa FAPESP issue nº 174) — revealed the genetic material of organisms such as the bacteria Xylella fastidiosa, which damages orange groves, and various forms of cancer, offering a boost to genetics laboratories and developing potential in bioinformatics.

Since then, the data has been shared with the scientific community and research has made it possible to identify a multitude of genes linked to health problems, but with an as-yet limited impact on patients’ lives (see Pesquisa FAPESP issue  284), especially for complex diseases, such as diabetes and cancer, which involve the action of many DNA segments responsible for the production of proteins.

“The Human Genome Project inaugurated modern genetics,” says bioinformatician Pedro Galante, head of the bioinformatics laboratory at the Teaching and Research Institute of Sírio-Libanês Hospital (IEP-HSL) in São Paulo. “We discovered many genes associated with a vast range of traits.” Carlos Menck, a molecular biologist from the Institute of Biomedical Sciences at the University of São Paulo (ICB-USP), reports that the use of next-generation sequencing around 10 years ago made a huge difference in the diagnosis of the disease Xeroderma pigmentosum, patients with which cannot be exposed to sunlight (see Pesquisa FAPESP issue  323 and sidebar). “We will soon be publishing an article with the mutations detected in 150 patients,” he says. He also sees the approaching possibility that this knowledge will give rise to treatments based on gene editing or regulation using tools such as CRISPR-Cas9 and RNA interference.

► New oncology
(click to learn more)

Pedro Galante from IEP-HSL points out that oncology is at a much higher level than it was just a few years ago and that more advances are coming quickly. It is now possible to direct aggressive — and extremely expensive — treatments to the patients who are most likely to benefit based on the mutation profile of the tumors and the patient’s immune system, among other factors. “This is thanks to discoveries made over the last five years.”

One example, published in the scientific journal Cancers this year, was the discovery of 19 highly active olfactory receptor genes, normally involved in the detection of odors, inside acute myeloid leukemia cells — a type of cancerous cell for which there is no good treatment. It was a surprising finding that could improve early diagnosis of the disease and possibly reveal potential therapeutic targets.

Skin tumors among xeroderma pigmentosa patients have greater insertion of transposable elements, known as retrotransposons, according to an article by a group led by Carlos Menck from ICB-USP, published in the journal Carcinogenesis in May. “We were the first to show this, thanks to a bioinformatics tool developed by Pedro Galante,” says Menck. The study indicates that these insertions cause instability in the genome and may be the cause of the tumors, as well as identifying an enzyme with an important role in maintaining stability. “We are now investigating the mechanism of action.”

The way we look at the genome is changing radically, focusing not only on a very small percentage of genes that were previously considered essential but also on the complete DNA sequence, and comparing a large number of individuals and even species. It is an exciting time. In May, Galante attended The Biology of Genomes conference at the Cold Spring Harbor Laboratory in the USA, the most renowned in the field. He recalls that in the closing lecture, American geneticist Evan Eichler of the University of Washington stated that he had been waiting years for the field to arrive at the moment in which it now finds itself. There is great anticipation of the major advances that genomic research could provide, especially in medicine.

Léo Ramos Chaves/Revista Pesquisa FAPESPSequencing machines: A flow cell, into which samples are inserted in the Illumina NovaSeq (below); with the Sanger method, DNA passes through thin threads (right)Léo Ramos Chaves/Revista Pesquisa FAPESP

One of the limitations of the HGP was that despite having collected DNA from around 20 volunteers, roughly 70% of the sequenced genome came from just one person, a North American of European and African descent from the city of Buffalo in the state of New York. This means that genetic variants that differ between one population and another may not be represented in the sequenced genome that is still used as a reference to this day. This is the genome that researchers compare against their results to identify the function and variation of genes detected in different situations.

Now, the Human Pangenome Reference Consortium (HPRC) intends to resolve this issue by completely sequencing at least 350 people from all over the planet, to ensure scientists have access to more than just one reference. “It would be much better if we had genome references from the Brazilian population to compare to what we see in our patients, because common variants here may not be present in the North American population or vice versa,” explains Galante, who is applying to join the consortium in the hope of contributing through his bioinformatics knowledge.

The plan is to carry out telomere-to-telomere (T2T) sequencing, which uses the ends of the chromosomes, whose highly repetitive sequences make them difficult to disentangle using previous methods. To do this, the strategy has been to combine techniques. Devices made by British company Oxford Nanopore and American company Pacific Biosciences can read very long stretches of DNA, up to millions of bases long. They are very efficient for genome assembly. “Long reads allow us to determine the genome’s outline,” explains geneticist Glória Franco of the Federal University of Minas Gerais (UFMG). Not surprisingly, long reading was highlighted in an editorial published in the journal Nature in January this year as the method of the year for 2022. Because the error rate is higher, geneticists refine the sequences by using shorter readings made by machines from the Californian company Illumina, which allow them to look at smaller sections with greater precision.

► Autism
(click to learn more)

Genome studies related to autism have highlighted gene variants associated with the disorder and developmental delays, according to a study published in the journal Nature Genetics in 2022 in which Maria Rita Passos-Bueno, from USP’s Center for Human Genome Studies, participated. The approach shows promise for better understanding the neurobiological basis of autism spectrum disorder (ASD).

In the quest to understand hereditary factors and the effects of parental age on children born with ASD, an analysis that considers three generations of each family rather than broader population studies seems promising. An article written by CEGH-USP scientists on an analysis of 33 families, published in the journal Genetics this year, praised the approach — although the sample size was too small to draw any major conclusions.

Until recently, genomes were sequenced in small fragments that needed to be assembled like a puzzle that may have overlapping parts (see infographic below). Repetitive genome sections, which are highly frequent, are especially difficult to include in the assembly and used to be left out — in the representation of the nitrogenous bases that make up DNA, such as A, T, C, and G, there could be a long string of ATATATATATAT, for example.

Alexandre Affonso / Revista Pesquisa FAPESP

The initiatives go far beyond human genomes, as noted by the journal Science in a January article highlighting the topics it expected to make the news this year. The Earth Biogenome Project (EBP), a consortium of researchers from several countries launched in 2022, is attempting to sequence all species — including the large proportion still unknown to science, especially unicellular organism and small invertebrates — over the next 10 years. The objective is to “know, use, and conserve biodiversity.” This umbrella also covers other projects, such as the sequencing of all mammals (Zoonomia, see Pesquisa FAPESP issue nº 328) and primates (see Pesquisa FAPESP issue  329).

Marcela Uliano-Silva, a Brazilian bioinformatician from the Wellcome Sanger Institute in the UK, is part of the Darwin Tree of Life project, which is sequencing the genome of all organisms in the UK. She explains why such a broad perspective is needed: “Everything in biology is investigated on the basis of comparison.” There are around 70,000 species, of which one thousand are almost sequenced already. “We have a total of around 2,700 genomes at some stage in the sequencing process,” she estimates. As the task progresses, the data is made immediately available and can be analyzed by anyone interested. “The first large sequencing block was of Lepidoptera, the family of butterflies and moths,” says the bioinformatician.

► Diabetes and metformin
(click to learn more)

The most commonly used oral treatment for controlling blood glucose levels in people with type 2 diabetes is metformin, but its mode of action is still a mystery and only protein-coding genes have been studied in this context. A team led by Glória Franco at UFMG looked for long, noncoding RNA molecules (lncRNA), as described in a 2022 article published in the scientific journal Non-coding RNA. The study’s still exploratory conclusion that the drug regulates critical forms of lncRNA that can affect the response to treatment, as well as cell proliferation and cell energy metabolism, could contribute to future investigations into how it works.

Franco participated in the first genome projects with DNA from the parasite Schistosoma mansoni, which causes schistosomiasis, and is now interested in the relationship between lncRNA and cancer and drug responses.

In search of diversity beyond genetics, Uliano-Silva is also a member of the EBP’s Justice, Equity, Diversity, and Inclusion commission, which is seeking to prevent “scientific neocolonization” as a result of the dominance of the wealthiest countries in large international consortiums. As part of this role, she has been contacted by Brazilian researchers interested in participating in this type of project and suggests that similar initiatives should be organized in Brazil.

One early-stage project in the country plans to sequence all Brazilian tetrapods (vertebrates with four limbs), an initiative started by primatologist Jean Boubli of the University of Salford, UK, in partnership with a group from the Center for Human Genome Studies at USP (CEGH-USP). “We use an Illumina NovaSeq 6000 sequencer,” explains geneticist Maria Rita Passos-Bueno, referring to one of the most modern devices. Although CEGH’s focus is on rare genetic diseases, she is excited about the new venture. “We are organizing a collection from the USP Zoology Museum, where they will check the quality of the samples before sending them to us.” An initial test successfully sequenced 70 bird samples. “I want to use data from nonhuman primates to understand the human genome,” she says, returning to her main interest. “If we find a variant in humans that is common in other animals, there is a good chance that it will not be behind the problems,” she explains.

Ana Cotta / Wikimedia Commons | Frank Fox / Wikimedia Commons | Eduardo Cesar / Revista Pesquisa FAPESP | Michael Wolf / Wikimedia Commons | Zeynel Cebeci / Wikimedia Commons | rufus46 / Wikimedia CommonsThe ocelot, the maned wolf, protozoan, rice, tomato, and a butterfly: very different organisms can have similar genes and transposable elements contribute to variationsAna Cotta / Wikimedia Commons | Frank Fox / Wikimedia Commons | Eduardo Cesar / Revista Pesquisa FAPESP | Michael Wolf / Wikimedia Commons | Zeynel Cebeci / Wikimedia Commons | rufus46 / Wikimedia Commons

► Agrigenomics
(click to learn more)

Pangenomics has already influenced humankind’s most widely eaten crops. It is well-known that some plants naturally undergo duplications of their entire genome, resulting in them having several copies of DNA in their cells. This process increases the chances of transposable elements — and therefore new responses to environmental conditions — being generated, according to an article published in the journal Genome Biology in 2021 by Brazilian biologist Rafael Della Coletta, a PhD student at the University of Minnesota, USA.

Transposons are responsible for changes such as repressing the functioning of the gene that causes maize to flower when the days are longer, promoting aluminum tolerance in rice, and generating the oval shape of Roma tomatoes. The article explains that identifying the causal relationship between changes in phenotype (observable characteristics of an organism) and genetic novelties, which can be achieved by gene editing based on agricultural production needs, would represent a new era in the domestication of edible plants.

Obtaining an infinite number of complete genomes is not even the most important step forward in modern genomics. The crucial advance was the realization that focusing on genes that produce proteins restricts attention to just 1.2% of human DNA. When it comes to these genes, people are 99.9% identical. The majority of the genome consists of sequences that control and orchestrate gene activity and are the real reason for differences between organisms, largely through the action of RNA molecules with regulatory functions. This is one of the concepts of the field known as “evo-devo” (evolutionary developmental biology, see Pesquisa FAPESP issue  152): throughout development, adjustments in the construction of the same structures can generate very different results, such as a person’s hand and a bat’s wing. “The hardware is the same, what changes is the software,” explains molecular geneticist Paulo Amaral, from the Insper Institute of Education and Research. “The sea sponge, which we use as a luffa, has very similar genes to ours.”

Alexandre Affonso / Revista Pesquisa FAPESP

The discovery that small RNA molecules turn off genes (see Pesquisa FAPESP issue nº 133) won Americans Andrew Fire and Craig Mello the Nobel Prize in 2006. With the new techniques that have emerged since then, its function is increasingly under the spotlight. “For the first time, we are producing complete genomes, including the noncoding parts and the RNA they produce,” says Amaral.

This year, in partnership with Australian geneticist John Mattick of the University of New South Wales in Australia, he published the book RNA – The epicenter of genetic information (CRC Press). The basis for the book, which was later expanded and updated, was a review that resulted from three months holed up in the library during his PhD, defended in 2011 at the University of Queensland, also in Australia, at what was at the time Mattick’s laboratory. It provides an overview of the history of molecular biology since the nineteenth century, focusing on the dilemma that attracted Amaral even as an undergraduate: if it was considered that the main function of RNA was to translate genes into proteins, then why does most of the human genome represent instructions for RNA molecules that do not make proteins? Compelling examples include the Xist gene, which produces long RNAs responsible for silencing one of the two X chromosomes in female mammalian cells and is involved in diseases such as cancer. Together with a group led by immunologist Helder Nakaya of Hospital Israelita Albert Einstein in São Paulo, Amaral has been investigating the action of noncoding RNA in immunological responses to vaccines and in cardiovascular and neurological diseases. In the end, the book elevates RNA to the position of a “computational engine of the cell, development, cognition, and evolution.” “We are already writing updates for a second edition,” he reveals. Scientific discoveries are advancing faster than the editorial process.

Transposable elements, also known as transposons — which tend to contain repeated sequences — are also gaining increasing prominence (see Pesquisa FAPESP issue nº 246), having initially been described in the 1940s by American geneticist Barbara McClintock as responsible for the color variation in corn kernels. These replicated sections can jump and insert themselves into another part of the DNA strand, quickly influencing neighboring genes, causing changes and generating new features.

Alexandre Affonso / Revista Pesquisa FAPESP

A recent article by Galante’s group that was published on the bioRxiv repository in February investigated the origin of regulatory RNAs using transposable elements in primates, known as retrocopies. Five of the 17 studied molecules are the same in all primates, while two are specific to the human genome and may be involved in essential biological processes, such as metabolism, communication between cells, and the development of various types of cancer. “The line that separates transposable elements from genes is getting thinner and thinner,” says botanist Marie-Anne van Sluys, who believes the topic does not receive as much attention as it deserves, although she has focused on it for decades and is currently leading a study on genes, genomes, and transposition elements in sugarcane and their association in interaction with pathogens. A study by her group published on bioRxiv in 2020 identified a transposable element capable of modulating development and gene expression in tobacco plants in response to stress. “In mammals, it is already clear that they stand out as drivers of diversification.”

The results also include epigenetic markers (such as methylation patterns) that regulate DNA functions without altering the sequence and reflect how genes interact with the environment (see infographic above). They could be the cause of certain types of tumors and detecting them in DNA circulating in the bloodstream has emerged as a promising and practical diagnostic tool in recent years, known as a liquid biopsy (see Pesquisa FAPESP issue nº 253).

► Brazilian genomics
(click to learn more)

Brazilian initiatives funded by FAPESP have kept the country in line with advances being made abroad. The sequencing of the bacteria Xyllella fastidiosa, which started in 1997, involved 35 laboratories and 191 researchers from institutions in São Paulo as part of the Network for Nucleotide Sequencing and Analysis (ONSA). It was the first sequencing of an organism that causes disease in plants and was thus of commercial interest, and it made the cover of Nature on July 13, 2000.

At the beginning of the century, the sugarcane genome, cancer genome, and the genome of the bacteria Xanthomonas citri, which causes citrus canker, were also completed.

The results, which were fundamental to the establishment of contemporary scientific culture, involved the creation of dozens of laboratories, coordination between research groups, the establishment of shared databases, and the development of bioinformatics. It also inspired the formation of new research groups nationwide.

“The formation of the ONSA network was a daring and risky initiative that ended up turning out really well and generating great results. Building a methodology capable of delivering a sequence was extremely important and helped qualify a lot of people in this area,” said geneticist Marcio de Castro, scientific director of FAPESP, in a recent interview (see Pesquisa FAPESP issue nº 328).

The urgency of bioinformatics
Highly specialized professionals called bioinformaticians are needed to process and interpret such immense volumes of data. Marcela Uliano-Silva is keenly aware of this need and has just developed an open-access program for assembling mitochondrial genomes, something that did not yet exist, according to an article published in the scientific journal BMC Bioinformatics in July. Mitochondrial genomes are a special type of DNA — they result from an ancient symbiosis with bacteria, are circular, and are usually transmitted from mothers to children. She warns that it is essential for biologists to learn programming because software development is so crucial to analyzing and visualizing data.

Steve Wilson / Wikimedia Commons | Ivar Leidus / Wikimedia Commons | Oceancetaceen / Wikimedia Commons | Andrew Mercer / Wikimedia Commons | H. Zell / Wikimedia Commons Tamarin monkeys, bees, porpoises, bats, marsupials, sea sponges: regulation during development is behind some of the differencesSteve Wilson / Wikimedia Commons | Ivar Leidus / Wikimedia Commons | Oceancetaceen / Wikimedia Commons | Andrew Mercer / Wikimedia Commons | H. Zell / Wikimedia Commons 

Galante adds that the training needed is not usually available in Brazilian undergraduate courses. Becoming a bioinformatician, he says, often requires going back to school to learn statistics and mathematics, something that most people with a background in biological sciences are not eager to do. “The statistics and computing demands of analyzing genomic data are increasingly complex,” he stresses. Demand is also rising in the private sector, and Galante points out that the PhD students working in his laboratory who master this area are almost guaranteed employment in clinical analysis and oncology labs or companies wishing to develop computational technologies. “In five short years, an entire ecosystem of industries that require bioinformaticians was formed in the country.”

Technological advances make sense, and reach their full potential, through sound scientific questions. But this is a two-way relationship, highlights Carlos Menck. “When we are looking for a way to solve a problem, sometimes we learn something that changes the scientific question.” Many researchers in health-related fields who spend most of their time in the lab doing basic research, as is the case for Menck, have a clear objective: to solve patients’ problems. While searching for solutions to practical difficulties regarding diagnoses or how to deliver the drug to the intended destination, for example, they uncover molecular modes of action, and as they investigate how the DNA works, they may find unexpected applied solutions.

HUG-CELL – Human Genome and Stem Cell Research Center (nº 13/08028-1); Grant Mechanism Research, Innovation, and Dissemination Center (RIDC); Principal Investigator Mayana Zatz (USP); Investment R$55,474,011.98.
2. The role of DNA damage and mitochondrial function in vascular, immune system, and neurological aging (DNA MoVINg) (nº 19/19435-3); Grant Mechanism Thematic Project; Principal Investigator Carlos Frederico Martins Menck (USP); Investment R$8,100,714.10.
3. Contribution of genes, genomes, and transposable elements in the interaction between plants and microorganisms: sugarcane case study (nº 16/17545-8); Grant Mechanism Thematic Project; Program BIOEN; Principal Investigator Marie-Anne van Sluys (USP); Investment R$6,106,313.34.
4. Retroelements: A driving force in the creation of genetic novelties in the genomes of humans and mice (nº 18/15579-8); Grant Mechanism Young Investigator Award; Principal Investigator Pedro Alexandre Favoretto Galante (SBSHSL); Investment R$2,080,277.52.
5. Olfactory receptors: Mechanisms of gene expression and signal transduction (nº 16/24471-0); Grant Mechanism Thematic Project; Principal Investigator Bettina Malnic (USP); Investment R$1,486,107.99.
6. Impact of RNA-binding proteins on splicing dysregulation in glioblastoma (nº 17/19541-2); Grant Mechanism Postdoctoral fellowship; Supervisor Pedro Alexandre Favoretto Galante (SBSHSL); Beneficiary Gabriela Der Agopian Guardia; Investment R$321,292.29.

Scientific articles
MATTICK, J. & AMARAL, P. RNA – The epicenter of genetic information. Boca Raton: CRC Press, 2023.
GUARDIA, G. D. A. et al. Acute myeloid leukemia expresses a specific group of olfactory receptors. Cancers. vol. 15 no. 12 3073. june 6, 2023.
WRIGHT, C. J. et al. Chromosome evolution in Lepidoptera. bioRxiv. may 14, 2023.
CORRADI, C. et al.Mutational signatures and increased retrotransposon insertions in xeroderma pigmentosum variant skin tumors. Carcinogenesis. bgad030. may 17, 2023.
MERCURI, R. L. et al.Retro-miRs: Novel and functional miRNAs originated from mRNA retrotransposition. bioRxiv. feb. 25, 2023.
CONCEIÇÃO, I. M. C. A. da, et al.Metformin treatment modulates long non-coding RNA 2 isoforms expression in human cells. Non-coding RNA. vol. 8, no. 5, 68. oct. 12, 2022.
MULHAIR, P. O. et al.Diversity, duplication, and genomic organization of homeobox genes in Lepidoptera. Genome Research. vol. 33 pp. 32–44. 2023.
ULIANO-SILVA, M. et al. MitoHiFi: A python pipeline for mitochondrial genome assembly from PacBio High Fidelity reads. BMC Bioinformatics. vol. 24, 288. july 18, 2023.
DELLA COLETTA, R. et al.How the pan-genome is changing crop genomics and improvement. Genome Biology. vol. 22, no. 3. jan. 4, 2021.
FU, J. M. et al.Rare coding variation provides insight into the genetic architecture and phenotypic context of autism. Nature Genetics. vol. 54, pp. 1320–31. aug. 18, 2022.
COSTA, C. I. S. et al. Three generation families: Analysis of de novo variants in autism. European Journal of Human Genetics. june 6, 2023.
HERNANDES-ROSA, J. et al. Evidence-based gene expression modulation correlates with transposable element knock-down. bioRxiv. aug. 15, 2020.