Good practices

Software detects errors in cancer genetics articles

Online tool available for testing identifies mistakes in gene sequences

Imagem: Sandro Castelli

A preliminary version of Seek & Blastn, designed to detect flawed or fraudulent DNA sequences in scientific articles, is available online for testing. Developed by Australian oncologist Jennifer Byrne and French computer scientist Cyril Labbé, the program compares human gene sequences published in papers with those stored in the Blastn (Nucleotide Basic Local Alignment Search Tool) database. “The software tries to find mismatches between the claimed status of a sequence—what the paper says it does—and what the sequence actually is,” Byrne told Nature magazine. According to Labbé, the online version still needs to be fine-tuned. The program struggles to recognize nucleotide sequences in PDF files, for example.

“But the software can provide significant support and reduce the need for manual analysis by specialists,” Labbé said at an international congress on peer review held in Chicago, USA, in September 2017. Seek & Blastn can be tested at

The two researchers have found errors in more than 60 papers on cancer genetics. Some of them are small and accidental, but according to Byrne, the majority of the discrepancies are enough to invalidate results and conclusions. “It’s a very serious issue. The use of defective data can have implications on clinical research and the search for cancer treatments,” she said. “Scientists need a better understanding of these types of errors, so that they can avoid wasting time and money by inadvertently following up incorrect results.”

Byrne’s interest in the subject arose in 2015, when she noticed a problem with five articles on cancer genetics. The papers described a similar type of experiment, in which gene TPD52L2 was deactivated by targeting a short sequence of nucleotides and the effects on tumor cell development were examined. Byrne, who is head of the Children’s Cancer Research Unit at the Kids Research Institute and a professor of molecular oncology at the University of Sydney, was very familiar with the gene—she was leader of the research group that first identified it in 1998. The gene is linked to the onset of certain types of breast cancer and leukemia, but its function is still not well known.

The oncologist soon discovered that the nucleotide sequence described in the five papers did not correspond to the actual sequence. “It was highly unlikely or impossible that they could have obtained the results they obtained,” the researcher told the Sydney Morning Herald newspaper. She reported her findings and her concerns to the editors of the journals that had published the articles, and four of the five were retracted. The authors admitted that they had not conducted the experiment, but had acquired their data from a commercial biotechnology company, without disclosing this partnership.

Imagem: The University Of Sydney Australian oncologist Jennifer Byrne has created a program that automatically finds errors in scientific papersImagem: The University Of Sydney

Byrne suspected the episode was not an isolated case, and began searching for other papers in the PubMed database. She found similar problems in another 43 papers. They were all related to gene silencing, but there were so many coincidences in the titles, data, and images that she suspected the authors had also obtained second-hand data. She then got in touch with Cyril Labbé, from the University of Grenoble, France, who had previously created a program that identifies fraudulent, nonsense, computer-generated articles published in conference proceedings, often not properly read beforehand. Together they found errors in the nucleotide sequences of 30 articles, all written by Chinese authors. Without identifying the authors, the pair described the problem in an article published in Scientometrics in 2016.

For now, Seek & Blastn only compares human gene sequences, but the two researchers plan to extend the analysis to include laboratory animals. They have made the preliminary version available online so that other scientists can test the program and help refine it. The intention is to later offer the software to journal editors so that they can use it to analyze manuscripts submitted for publication.

Several tools are already available for automatically checking the robustness and veracity of large volumes of research data. The best known and most widely used are those that search texts for evidence of plagiarism, but there are other examples. Researchers from Tilburg University in the Netherlands wrote a program capable of detecting statistical errors, causing controversy by analyzing 50,000 psychology articles and publicly sharing the results (see Pesquisa FAPESP, issue No. 253).

The Office of Research Integrity (ORI), which monitors research conducted by the US Department of Health, recommends a set of software that detects image manipulation or duplication. For David Allison, a statistician at Indiana University Bloomington, these tools are useful for promoting good practices and encouraging researchers to prevent errors. “They can also help measure error rates in specific journals or fields of knowledge,” Allison told Nature.