DATA MANAGEMENT 

Unbalanced Dissemination

Reuse of scientific data is still uncommon, and varies depending on the field of knowledge

Imagem: Marcelo Cipis

The reuse of research data is becoming more common, but it is still far from an established scientific practice. Studies based on data from previous experiments by other researchers occur more frequently in the exact and biological sciences, while there is resistance in the social sciences. In general, researchers working with data obtained by computer modeling or remote sensors feel more comfortable reusing data from third parties. This is one of the conclusions of an article published in the journal PLOS ONE by data scientist Renata Curty, from Londrina State University (UEL), Paraná.

Based on the responses given by 595 researchers from a range of disciplines and countries, she and her colleagues evaluated the frequency with which data is reused and the perceived factors that encourage or discourage the practice. Interestingly, the analysis was itself based on reused data. The original data came from over 1,000 questionnaires answered by researchers between October 2013 and March 2014 as part of the Data Observation Network for Earth (DataONE), a project run by the US National Science Foundation (NSF).

Curty has been studying the perceptions of data reuse since her doctorate at the University of Syracuse, USA. At the time, she found that social science researchers were concerned about the potentially harmful consequences of data reuse. “Many are worried about ethical issues or violating participant confidentiality,” she explains. There are also concerns about the risk of misinterpreting or misrepresenting the original data. Opinions on data reuse are also strongly influenced by the researcher’s field, and social sciences traditionally encourage the production of new knowledge. “Studies that reuse data are considered less authentic and of a lesser impact,” she says.

Imagem: Daniel Eisenstein /Sloan Digital Sky Survey Reusing data on galaxy clusters (above) is helping astronomers learn more about celestial objectsImagem: Daniel Eisenstein /Sloan Digital Sky Survey

The reasons to reuse data are numerous, including a growing concern about the reproducibility of research (see Good Practices section) and the importance of making primary data available so that other people can verify the accuracy and relevance of the results. Since 2013, Brazilian Political Science Review, a journal published by the Brazilian Political Science Association, has required authors of articles based on quantitative studies to make all of their data—as well as their codebooks, which describe the variables used to obtain the data—available on the journal’s website. “The aim is to enable other researchers to replicate the procedures that led to the conclusions of the research,” says the journal’s editor, Marta Arretche, from the Department of Political Science at the University of São Paulo’s School of Philosophy, Languages and Literature, and Human Sciences (FFLCH-USP).

She points out that science can only be replicated if the data and tools used in experiments, simulations, and analyses are made openly and freely available. It is crucial, however, that this mass of information be accompanied by explanations regarding its origin. “Without well-documented data it is not possible to reproduce the original experiment or reuse the data in another research,” she adds.

Since 2014, the PLOS group of journals has only accepted articles whose raw data is available in public repositories (see Pesquisa FAPESP, issue no. 218). In genetics and bioinformatics journals, where research often generates a large amount of data on DNA and protein sequences, this recommendation has long been a requirement. This rule, in fact, allowed geneticists Lygia da Veiga Pereira and Maria Vibranovski, from the Institute of Biosciences at USP, to explain the process by which one of the X chromosomes detaches in female embryos. They analyzed data provided by Chinese researchers in 2013 and found that the XIST gene, responsible for initiating the inactivation process, was expressed in female embryos starting from the eight-cell stage (see Pesquisa FAPESP, issue no. 260). “The Chinese team had done the whole laboratory part. They acquired the human embryos, separated the cells, extracted and sequenced the RNA, but they were not looking at inactivation of the X chromosome,” said Lygia, whose findings were published in Scientific Reports in September 2017.

A group led by parasitologist Marcelo Ferreira and biologist Priscila Rodrigues, both from the USP Institute of Biomedical Sciences, also reused scientific data to good effect while studying the global dispersion patterns of parasites that carry malaria. “We used samples of protozoan genetic material made available by GenBank, a database of information on DNA sequences and amino acids provided by the US National Center for Biotechnology Information,” Rodrigues says. At least two articles have been produced in the last three years based on the reuse of this data: one in 2016, published in Nature Genetics, highlighting how the Plasmodium vivax parasite underwent mutations after arriving in the Americas that make it distinguishable from African and Asian strains, and another in January 2018, published in Scientific Reports, presenting new evidence on how human migration helped transport these parasites across the American continent.

Reusing data is also encouraged by funding agencies, which are interested not only in improving reproducibility but also in optimizing the application of public resources in the projects they fund. “Data sharing can help scientists save time and resources, as well as avoiding the duplication of research,” says Claudia Bauzer Medeiros, an electrical engineer from the Institute of Computing at the University of Campinas (UNICAMP) and coordinator of the FAPESP eScience program. “International studies show that the practice increases the number of partnerships, accelerates scientific discoveries, and makes the knowledge produced more visible,” she says.

The idea that publicly funded research should share its results without restriction, even the primary data collected, is also related to the concept of open science, which promotes free access to data and collaborative knowledge development, notes Claudia Domingues Vargas, from the Biophysics Institute at the Federal University of Rio de Janeiro (UFRJ). She is one of the researchers involved in the Neuroscience Experiments System (NES), which provides open access to primary data from neuroscience studies.

The platform was designed by neuromathematics research center NeuroMat, one of the Research, Innovation, and Dissemination Centers (RIDC) supported by FAPESP, which involves Brazilian and foreign researchers from the fields of mathematics, computer science, statistics, neuroscience, biology, physics, and communications. “The NES is a public repository that provides open access to a wide range of neurophysiological, clinical, and experimental data, as well as the software used in its analysis, processing, and generation,” explains Claudia Vargas, one of the NeuroMat researchers.

The sharing of scientific data is advancing at different rates in different fields of ​​knowledge. In astronomy it is commonplace, as observed by physicist Marcelle Soares-Santos, a professor at Brandeis University and a researcher at the Fermi National Accelerator Laboratory, one of the most important particle physics research centers in the world. “I benefited a lot from this practice while studying my doctorate,” she says. At the time, she was developing algorithms to find galaxy clusters using data on 500 million celestial objects from the Sloan Digital Sky Survey. Soares-Santos explains that primary data in astronomy is rich and is often not fully explored. “Many questions in astronomy can only be studied through the analysis of scientific data obtained by other research groups.”

Paradox
The study published in PLOS ONE highlights a curious fact related to the perceptions of data reuse: researchers more concerned with the credibility of the data they intend to use are actually more willing to reuse data produced by third parties. Those who almost never reuse data have more difficulty understanding the benefits of the practice and evaluating the quality of data available.

Imagem: Marcelo CipisIn a 2014 study titled How and why researchers share data (and why they do not), publisher John Wiley & Sons surveyed almost 3,000 researchers from various fields and countries, and found that Germans are the most willing to share data, with the aim of increasing the visibility and ensuring the transparency of their research. The Chinese are the least likely to share their research data, possibly because it is not a requirement for funding in the country. Brazilians complained about the extra work needed to organize the data, the costs of hosting it, and the difficulties finding suitable repositories.

In studies on the reuse of scientific data, researchers often claim that they are afraid to make their data available because they still want to use it themselves in future studies, or they fear they will not receive credit as the source. These and other concerns were highlighted in a report by Elsevier titled Open Data: The research perspective. But the same study found that 73% of respondents believed access to third-party scientific data could benefit their own research, and that 64% were willing to share data with other researchers.

The main challenge, according to Claudia Bauzer Medeiros, is to show researchers the benefits of reusing scientific data while at the same time combating cases of misappropriation. Another effective strategy, says Medeiros, is creating courses to teach researchers and students how to prepare data and experiments for sharing. “This type of training is already standard in several countries around the world, and in some institutions it has become a requirement,” she adds.

Renata Curty argues that we need to invest more in data quality verification systems and rewarding researchers that adopt the practice. There are several such initiatives already in place in the US, including the Global Biodiversity Information Facility (GBIF), which stores records on almost 850 million species, 6 million of which are from Brazil (see Pesquisa FAPESP, issue no. 263). By registering their primary data on the GBIF, researchers can generate a data paper, a document describing the dataset, which can be published on open access online platforms. According to Curty, there are publications dedicated to sharing these data papers, such as the Biodiversity Data Journal and Data in Brief, published by Elsevier, and Scientific Data, from the Nature group.

Scientific article
CURTY, R. G. et al. Attitudes and norms affecting scientists’ data reuse. PLOS One. Vol. 12, no. 12, pp. 1–22. Jan. 2018.