A strategy for research data : Revista Pesquisa Fapesp

Managing and storing large volumes of research data are challenges faced by scientists in every field. In the last decade, research funding agencies such as the National Science Foundation (NSF) in the US and the Economic and Social Research Council in the UK have increasingly required grant applicants to submit data management plans outlining how research data will be managed, preserved, and made available through public repositories. Their aim is to ensure that information is shared, that research data is reusable, and that experiments are reproducible, facilitating further scientific discoveries and optimizing returns on funding investment.

Although data management planning is not currently mandatory in Brazil, last October FAPESP took a step in this direction and announced that grants for “thematic projects”—projects lasting five years and characterized by ambitious objectives—would be required to include a data management plan as a supplement. The requirement would be gradually extended to other grant mechanisms later in the year. “This is among the first initiatives in Brazil to establish policies and guidelines for managing scientific data,” says Claudia Bauzer Medeiros, a professor at the Computing Department at the University of Campinas (UNICAMP) and head of FAPESP’s eScience program.

The Foundation’s Code of Good Practice launched in 2011 already required researchers to submit records from their research. “They will now be required to specify how their data will be managed—from collection to storage—and how and when the data will be made available,” she says. UNICAMP was the first university in Brazil to post a data management plan template on the DMPTool website (dmptool.org). The initiative, led by Benilton de Sá Carvalho of the Institute of Mathematics, Statistics, and Scientific Computing (IMECC), allows researchers from his university to easily create their plans online and make them available worldwide. More than 200 research institutions in different countries have officially adopted DMPTool to create and share data-management plans. Currently, only three Brazilian universities are on the platform: UNICAMP, the University of São Paulo (USP) and the Federal University of ABC (UFABC).

Making experiment or field data widely available can lead to collaborations and accelerate scientific breakthroughs by increasing the visibility of research outputs. In 2016, an international consortium involving more than 30 organizations including the Oswaldo Cruz Foundation, the Chinese Academy of Sciences, and the National Institutes of Health (NIH) in the US encouraged researchers to share the data that they had collected during the recent Zika virus outbreak. As a result, in a matter of months they were able to publish research showing the link between Zika and microcephaly. In the field of biodiversity, storing research data in public repositories makes millions of records on plant and animal species widely accessible, facilitating further research. The speciesLink network, a digital biodiversity database developed in Brazil, allows researchers to find information on the occurrence and distribution of species of microorganisms, algae, fungi, plants, and animals. The platform has compiled records from 470 collections in Brazil and other countries. These collections contain roughly 9 million records on 125,000 species, including records on 2,756 threatened species.

Researchers are being required to specify how their data will be managed, from collection to preservation

However, data management planning involves more than simply listing data on an online database. According to the Digital Curation Centre, a UK center specializing in digital curation, a data management plan should include information on how and why data have been created and stored. This means that information must be provided on how metadata—or data describing other data—will be organized. “Metadata are descriptions of datasets, detailing how, when, and where they were produced, how they can be reused, and who created them,” explains information scientist Márcia Teixeira Cavalcanti, a professor at Universidade Santa Úrsula, Rio de Janeiro and a member of the Information, Heritage, and Society research group at the Brazilian Institute of Information in Science and Technology (IBICT). “It’s about identifying and standardizing scientific data so they can be easily accessed in repository searches and reused in other research,” she says.

In 2016, Cavalcanti was involved in curating data on the CarpeDIEN platform of Brazil’s Nuclear Energy Institute (IEN), which performs research in fields such as radiopharmaceuticals and artificial intelligence. “It took time to develop the right metadata models for the kind of information we were dealing with,” she says. According to Cavalcanti, the curation process should begin before any data is produced. “In a data management plan, it may also be important to specify the software or equipment that will be used to generate information such as images or algorithms.” Claudia Medeiros agrees that this type of information can be essential. “Often having access to the data is not enough to reproduce an experiment. You also need to have the same computer programs or operating system to recreate the same conditions as in the original study,” she says.

Publicly funded researchers cannot omit themselves from sharing information, says Câmara

During her time at IEN, Márcia Cavalcanti conducted a survey on data repositories in Europe, which she published last year in a journal of the Institute for Humanities and Information Sciences of the Federal University of Rio Grande (FURG) in Rio Grande do Sul. The survey covered 33 countries and found that only nine supported open-access research repositories in 2016. Her findings show that data-sharing is still incipient in many European countries. Horizon 2020, the largest research funding scheme in the European Union established in 2007, issued a step-by-step guide on data-management plans in 2016 before making them mandatory for all grant applications in 2017. One important aspect of the guide lies in the attention it draws to conditions under which sharing raw data can create ethical issues such as clinical trials that use personal data and must protect patient privacy.

Barring these exceptions, there are really no arguments to justify publicly funded researchers in refusing to furnish their data,” says Gilberto Câmara, a researcher at the Brazilian National Institute for Space Research (INPE) and a coordinator of the FAPESP Research Program on Global Climate Change. According to Câmara, many researchers will hold off archiving experiment data until their research has been published in a journal on the argument that their data could be appropriated by others and published without them receiving credit for them. “That’s a poor excuse,” he says. Câmara explains that information can be safely archived before publishing a paper, as all data are assigned an identification code known as a Digital Object Identifier (DOI), so they are traceable. “The fact is that, unfortunately, many researchers don’t want others to publish research before they, who collected the data, have published their work,” says Câmara.

“All the data from my research are archived in open databases as they are collected,” says the researcher, who publishes data from satellite-image analyses on Pangaea, a platform for georeferenced data. Recently, information he stored in this digital repository was used by researchers from Restore+, an international consortium for land-use research based in Germany. Câmara welcomes FAPESP’s initiative to require researchers to develop data-management plans. “This can help to address bad habits in the scientific community by promoting good practices in data management,” he says. “There are researchers who feel they own the data and will only share it with their colleagues if they get something in return, such as coauthorship of the paper. This, unfortunately, is all too frequent,” he says.

Republish