The crisis over the novel coronavirus pandemic is changing the way researchers communicate and work together to create and disseminate knowledge, making the process faster and more transparent. In the race to develop new drugs and vaccines, many scientists are now sharing their research data—the body of primary information they use as a basis for their research conclusions—in real time. This comes as part of an effort by governments, industry, international organizations, funding agencies, and the scientific community to advance open science practices in fighting the pandemic, including free access to information and collaborative research. The Organization for Economic Cooperation and Development (OECD), for example, stressed the importance of this strategy in tackling COVID-19 in a policy brief published in May: “In global emergencies like the COVID-19 pandemic, open science policies can remove obstacles to the free flow of research data and ideas, and thus accelerate the pace of research critical to combating the disease.”
Many initiatives have emerged to promote sharing of research data about the novel coronavirus. Among these initiatives is Nextstrain, an open-source database of SARS-CoV-2 genetic sequences created by researchers at the University of Basel, in Switzerland, and the Fred Hutchinson Cancer Research Center, in Seattle. The database allows researchers to map out patterns of viral epidemic spread and analyze information from multiple sources on the virus’s genetic changes. “Researchers can share and compare data, and identify where in the world the coronavirus is mutating, explains Trevor Bedford, who cocreated the platform. The project has revealed connections between COVID-19 cases in Iran and strains identified in Australia, as well as a patient in Taiwan who was infected with a strain from the Netherlands. Researchers have also determined that the SARS-CoV-2 strain that spread to Italy later found its way to Latin America and Africa, and that Asian countries have been reinfected with strains that they had previously exported to Europe.
Nextstrain provides access to the genomes of 1,787 SARS-CoV-2 strains circulating in Latin America
Bedford believes the platform could also have been useful in recent epidemics, such as the Zika virus between April 2015 and November 2016. “Brazil’s Northeast was the hardest-hit region. If we then had a real-time tool for mapping how and at what rate the Zika virus was spreading around the world, perhaps we could have predicted that this region would be at greatest risk. This would give public authorities a chance to limit the spread of the disease.”
The urgent need for data on the novel coronavirus led the European Commission to launch the COVID-19 Data Portal in collaboration with partner institutions. The platform allows researchers to share, access, and analyze different types of data about the new coronavirus, including virus-specific proteins and genes. This information is being used to develop artificial intelligence systems that can identify key areas of COVID-19 research around the world, overlaps in research efforts, and promising approaches that are worth exploring. The portal also aggregates information hosted in other European repositories, such as the UK Node of ELIXIR, a platform that brings together Europe’s major life-science data archives and has recently launched an exclusive section for SARS-CoV-2 data—providing information about virus-specific genes, the cell lines most useful for studying the virus’s mechanisms of action, and proteins that interact with the pathogen.
– Times of uncertainty
– Health professionals under emotional stress
– The puzzle of immunity
– The risk of traveling by plane
– Varying investment in science
– Pandemics as allegory
The data-sharing effort has also found its way to Brazil. One example is the COVID-19 Data Sharing/BR platform, launched in June. Created through a FAPESP collaboration involving the University of São Paulo (USP), the Fleury Group, the Albert Einstein Jewish Hospital, and Sírio-Libanês Hospital, the repository aggregates laboratory and demographic data on around 180,000 individuals who have tested either positive or negative for COVID-19, as well as data on 6,500 case outcomes—either recovery or death—and almost 5 million results from clinical exams and laboratory tests. “We expect this information will be useful for improving diagnoses, for research into factors affecting the progression of the disease in Brazil, and for investigations into drug and vaccine candidates,” said FAPESP’s scientific director, neuroscientist Luiz Eugênio Mello, during the unveiling of the initiative.
The COVID-19 Data Sharing/BR Platform brings together laboratory and demographic data on approximately 180,000 people
The new repository is using computing infrastructure launched by the USP Office for Information Technology in December 2019 to connect research data repositories at different scientific institutions in São Paulo (see Pesquisa FAPESP issue no. 287). “Having the infrastructure readily available helped us to fast-track implementation of the COVID-19 platform,” says physicist Sylvio Canuto, associate dean for research at USP.
There has been a longstanding drive to increase sharing of research data, for a variety of reasons. One is improving reproducibility in research, and making primary research data widely available so that other scientists can verify the accuracy and relevance of published results. With the pandemic, this became even more urgent. “Data-sharing can optimize research efforts and catalyze new collaborations, accelerating the pace of discovery,” explains Claudia Bauzer Medeiros, a professor of databases at the Institute of Computer Science at the University of Campinas (UNICAMP), and a manager of the eScience and Data Science programs at FAPESP. “It also allows researchers to conduct studies that combine data from multiple sources.”
Medeiros is a Council member at the Research Data Alliance, an organization created in 2013 to develop and adopt infrastructure that promotes data sharing. In March, she and 136 other members joined in an effort to develop recommendations to accelerate COVID-19 research (see sidebar).
“The pandemic has underlined the importance of swift and open dissemination of shared scientific data,” says British biochemist Richard Sever, one of the founders of bioRxiv, an open access preprint repository for the biological sciences. “This has helped greatly to advance knowledge about the virus.” The importance of current efforts is underscored by a comparison with previous pandemics. “It took almost five months to completely sequence the genome of SARS-CoV-1, which caused an epidemic in Asia between 2002 and 2003,” says electrical engineer Daniel Villela, a researcher in the Scientific Computing Program at Oswaldo Cruz Foundation (FIOCRUZ). “Now, the rapid flow of COVID-19 information within a few days after collecting samples from the first infected individuals has allowed the full genome of SARS-CoV-2 to be sequenced in just one month.”
More than 2,800 clinical trials of COVID-19 treatments are available in the Cochrane COVID-19 Study Register
Despite the progress made during the pandemic, several obstacles remain. Building an environment that facilitates the flow of information requires not only that researchers be willing to share their data, but also that governments commit to collecting and making information available transparently. Since April, Open Knowledge Brazil, an organization that promotes transparency around public information, has assessed the availability and quality of COVID-19 epidemiological and health infrastructure data provided by Brazil’s Federal, state, and local governments. This assessment has informed a COVID-19 Transparency Index for states and the Federal Government. The index is updated every 15 days and is based on three measures of information quality: content, format, and granularity, or the level of detail in disclosures. “We found that only five states publish detailed databases, including suspected cases, for example,” explains Fernanda Campagnucci, executive director at Open Knowledge Brazil. “On the Federal Government side, there is a lack of integration to allow detailed information to be provided about the pandemic. This information is essential in estimating the dynamics of virus spread.”
Despite global efforts, many researchers are still reluctant to adopt collaborative research practices. Some fear their original research information could be misused. Others retain their data for use in new studies or fear that they will not be credited for the data they provide. This creates concerns that data sharing will diminish after the pandemic.
Since October 2017, FAPESP and other funding institutions in Australia, the US, and Europe have required researchers to supplement their grant applications with a data management plan describing how their data will be collected and where it will be available. “Having a data-sharing strategy will become an increasingly important item in assessments of project proposals submitted to FAPESP,” says Luiz Eugênio Mello, the Foundation’s scientific director.
Brazilian data scientist Renata Curty, who manages and curates research data at the University of California in Santa Barbara, says research-funding agencies can help to shape new practices in data sharing. “However,” she notes, “this would require investment in assessing data management plans and in new ways to assess whether data is being shared and is high quality.” Equally important is that research data are accompanied by metadata, which provide a detailed description of the data produced in a given study, specifying how it was produced, by whom, when, where, and how the data can be reutilized. This helps other researchers to properly interpret and potentially repurpose the new research data.
Claudia Bauzer Medeiros believes that for a culture of data sharing to gain traction after the pandemic, mechanisms would need to be implemented to reward researchers who share their data. One strategy would be to create metrics for citations of shared information. “It is equally important that these metrics are used in grant evaluation systems to recognize and reward the efforts of researchers who provide their data.” An environment with free access to information and collaborative research is also dependent on steady funding. “Between 20% and 30% of research projects that share their primary data are discontinued within two to three years for lack of funding,” says Medeiros.
Toward the end of June, the Research Data Alliance (RDA) published detailed guidelines to encourage researchers to share and reutilize data in the context of the pandemic and in future public-health emergencies. The guidelines address the use of data from clinical, epidemiological, social sciences, and omics (research in fields whose names end in the suffix -omics, such as such as genomics, transcriptomics, proteomics, and metabolomics) studies and the development of strategies to encourage information sharing.
The report was produced as a collaborative effort involving researchers from different countries, among them Claudia Bauzer Medeiros, of the UNICAMP Computing Department. “In mid-March, at the request of the European Commission, RDA invited its 10,000 members to submit proposed guidelines to inform the development of data-sharing strategies,” says Medeiros. Of these, 130 joined the project, forming several writing groups. “We met two to three times a week over the Internet to discuss and collaboratively develop the final guidelines document.”
The report calls on governments, funding agencies, and scientific institutions around the world to work together to develop policies and promote investment in optimizing the flow of data between local and international entities. “The guidelines stress the need for data, software and models to be findable, accessible, interoperable, and reusable,” explains Medeiros. “This requires researchers to develop highly detailed data management plans, with information on how their data were generated and how they can be reused.”