A strategy for research data

Researchers are being encouraged to better manage and share the data they produce

Managing and storing large volumes of research data is a challenge faced by scientists in every field. In the last decade, research funding agencies such as the National Science Foundation (NSF) in the US and the Economic and Social Research Council in the UK have increasingly required grant applicants to submit data management plans outlining how research data will be managed, preserved, and made available in public repositories. Their aim is to ensure that information is shared, research data is reusable, and experiments are reproducible, facilitating further scientific discoveries and optimizing returns on funding investment.

Although data management planning is not currently mandatory in Brazil, last October FAPESP took a step in this direction and announced that grants for “thematic projects”—projects lasting five years and characterized by ambitious objectives—will be required to contain a data management plan as a supplement. The requirement will be gradually extended to other grant mechanisms later in the year. “This is among the first initiatives in Brazil to establish policies and guidelines for managing scientific data,” says Claudia Bauzer Medeiros, a professor at the Computing Department at the University of Campinas (UNICAMP) and head of FAPESP’s eScience program.

The Foundation’s Code of Good Practice, launched in 2011, already required researchers to submit records from their research. “They will now be required to specify how their data will be managed—from collection to storage—and how and when the data will be made available,” she says. UNICAMP was the first university in Brazil to create a data-management plan template on the DMPTool website. The initiative, led by Benilton de Sá Carvalho of the Institute of Mathematics, Statistics, and Scientific Computing (IMECC), allows researchers from his university to easily create their plans online and make them available worldwide. More than 200 research institutions in different countries have officially adopted DMPTool for creating and sharing data-management plans. Currently only three Brazilian universities are on the platform: UNICAMP, the University of São Paulo (USP) and the Federal University of ABC (UFABC).

Making experiment or field data widely available can lead to collaborations and accelerate scientific breakthroughs by increasing the visibility of research outputs. In 2016, an international consortium involving more than 30 organizations, including the Oswaldo Cruz Foundation, the Chinese Academy of Sciences, and the National Institutes of Health (NIH), in the US, encouraged researchers to share the data they collected during the recent Zika virus outbreak. As a result, in a matter of months they were able to publish research showing the link between Zika and microcephaly. In the field of ​​biodiversity, storing research data in public repositories makes millions of records on plant and animal species widely accessible, facilitating further research. The speciesLink network, one of the digital biodiversity databases developed in Brazil, allows researchers to find information about the occurrence and distribution of species of microorganisms, algae, fungi, plants, and animals. The platform has compiled records from 470 collections in Brazil and other countries. These collections contain about 9 million records on 125,000 species, including 2,756 threatened species.

But data management planning is more than just about placing data in an online database. According to the Digital Curation Centre, a UK center of expertise in digital curation, a data management plan should contain information on how and why the data has been created and stored. This means that information needs to be provided on how metadata—or data describing other data—will be organized. “Metadata are descriptions of datasets, detailing how, when, and where they were produced, how they can be reused, and who created them,” explains information scientist Márcia Teixeira Cavalcanti, a professor at Universidade Santa Úrsula, Rio de Janeiro, and a member of the Information, Heritage, and Society research group at the Brazilian Institute of Information in Science and Technology (IBICT). “It’s about identifying and standardizing scientific data so they can be easily accessed in repository searches and reused in other research,” she says.

Researchers are being required to specify how their data will be managed, from collection to preservation

In 2016, Cavalcanti was involved in data curation on the CarpeDIEN platform at Brazil’s Nuclear Energy Institute (IEN), which performs research in fields such as radiopharmaceuticals and artificial intelligence. “It took time to develop the right metadata models for the kind of information we were dealing with,” she says. According to Cavalcanti, the curation process should begin before any data is produced. “In a data management plan, it may also be important to specify the software or equipment that will be used to generate information such as images or algorithms.” Claudia Medeiros agrees that this type of information can be essential. “Often having access to the data is not enough to reproduce an experiment. You also need to have the same computer programs or operating system to recreate the same conditions as in the original study,” she says.

During her time at IEN, Márcia Cavalcanti conducted a survey on data repositories in Europe which she published last year in a journal of the Institute for Humanities and Information Sciences at the Federal University of Rio Grande (FURG), in Rio Grande do Sul. The survey covered 33 countries and found that only nine had open-access research repositories in 2016. Her findings show that data-sharing is still incipient in many European countries. Horizon 2020, the largest research funding scheme in the European Union, established in 2007, issued a step-by-step guide on data-management plans in 2016, ahead of them becoming mandatory for all grant applications in 2017. One important aspect of the guide is the attention it draws to situations in which sharing raw data can create ethical issues, such as in clinical trials that use personal data and need to protect patient privacy.

Publicly funded researchers cannot omit themselves from sharing information, says Câmara

“Barring these exceptions, there are really no arguments to justify publicly funded researchers in refusing to furnish their data,” says Gilberto Câmara, a researcher at the Brazilian National Institute for Space Research (INPE) and a coordinator of the FAPESP Research Program on Global Climate Change. According to Câmara, many researchers will hold off archiving experiment data until their research has been published in a journal, on the argument that their data could be appropriated by others and published without them receiving credit for it. “That’s a poor excuse,” he says. Câmara explains that information can be safely archived before publishing a paper as all data are assigned an identification code known as a Digital Object Identifier (DOI) so they are traceable. “The fact is that, unfortunately, many researchers don’t want others to publish research before they, who collected the data, have published their work,” says Câmara.

“All the data from my research are archived in open databases as they are collected,” says the researcher, who publishes data from satellite-image analysis on Pangaea, a platform for georeferenced data. Recently, information he stored in this digital repository was used by researchers from Restore+, an international consortium for land-use research based in Germany. Câmara welcomes FAPESP’s initiative to require researchers to develop data-management plans. “This can help to address bad habits in the scientific community by promoting good practices in data management,” he says. “There are researchers who feel they own the data and will only share it with their colleagues if they get something in return, such as coauthorship of the paper. This, unfortunately, is all too frequent,” he says.