Transparent Science

Researchers are increasingly being asked to store their raw study data in public repositories

scan_FINALbel falleirosCertain changes occur so gradually that we only become aware of their extent near the end of the process. One moment that seemed to crystallize these changes took place in March. It was the decision by seven PLoS (Public Library of Science) journals that new articles would be accepted only if the authors made their research data available in public repositories. In other words, that mass of primary information which, when analyzed and interpreted, forms the basis of the paper’s conclusions. The new PLoS rule, which fits in with a broad mobilization of research-sponsoring agencies, scientists and universities to lend more transparency to the publication of research results, is nothing new in and of itself. Most journals already recommend that authors make their data available, and this recommendation has long been a requirement of journals in the field of genetics and bioinformatics, whose studies generate huge volumes of information on DNA sequences and proteins. In 2013, the U.S. Office of Science and Technology Policy, a government agency, sent a memorandum to major funding agencies establishing a policy of open access to the results of publicly funded research, which included providing primary data to repositories, unless restricted by considerations of confidentiality or personal privacy. But the Office did not impose any deadlines to make this happen.

The PLoS decision seems to represent a tipping point in this direction. “Our point of view is simple. Ensuring access to underlying data must be an intrinsic part of the scientific publication process,” explained Theodora Bloom, the editor of PLoS Biology, PLoS Computational Biology and PLoS Genetics. Last year alone PLoS journals published over 30,000 articles. These journals have been produced since 2000 by a nonprofit institution and follow an innovative model. Articles are only published online and with open access, which means they can be consulted, at no charge, by anyone with Internet access. Thanks to a group of first-rate reviewers, they have achieved an impact factor comparable to traditional publications. PLoS Medicine, for example, had an impact factor of 15.2 in 2012. This means that, on average, each of its articles published between 2010 and 2011 was cited 15.2 times in indexed journals in 2012. Its competitor Nature Medicine, from the Nature group, had an impact factor of 24.3 in the same period. “Since PLoS is an international standard, its decision is likely to help promote the idea of depositing research data and will create an additional demand for repositories and also for models to finance this demand,” says Abel Packer, coordinator of SciELO Brazil (Scientific Electronic Library Online), a special FAPESP program established in 1998 that brings together nearly 300 Brazilian scientific publications with open access. 

Database of the European Organization for Nuclear Research (CERN), in Geneva

cernDatabase of the European Organization for Nuclear Research (CERN), in Genevacern

The new PLoS rules raised questions and caused an uproar among some. Ten days after implementation, the editors of PLoS apologized for any ambiguity and explained that nothing had changed regarding the nature of the data to be described in the articles. The only new object of concern was how to identify the databank or repository where the primary data could be found (the researcher’s own files were not an option), if the article’s reviewers or other researchers interested in the subject needed to evaluate the data. PLoS defines primary data as data published in the article in the form of tables and statistical analyses that are indispensable to the paper’s conclusions, the idea being that other researchers must be able to independently replicate the findings. Data protected for reasons of security or privacy are not included in the requirement.

The changes sparked reactions from those who see the rules as a new burden on researchers. David Crotty, a geneticist and editor of the Oxford University Press program for publication of scientific journals, wrote in his blog posting on The Scholarly Kitchen website that the change could reduce the number of articles submitted to the PLoS journals. “If publishing an article in a PLoS journal requires you to do additional weeks of work to organize your data in a reusable, or at least recognizable fashion, not to mention the cost of hosting the data and the effort to find a suitable repository, why not publish the article in a different journal and eliminate the costs and expenditure of time?” asks Crotty. It has nothing to do with additional work says Packer, because the paradigm shift goes much deeper. “We’re talking about new practices, in which data will have already been organized during the research study, so as to be made available in the repositories and be intelligible and reusable by others,” he says.

Control Room for the meteorological satellite operated by the European Space Agency and the European Organization for the Exploitation of Meteorological Satellites, in Darmstadt, Germany

Ysangkok / WikimediaControl Room for the meteorological satellite operated by the European Space Agency and the European Organization for the Exploitation of Meteorological Satellites, in Darmstadt, GermanyYsangkok / Wikimedia

The storage of scientific data in repositories and data reuse are among the concerns of the recently launched FAPESP Research Program in eScience, an expression that sums up the challenge posed by the requirement to organize, classify and ensure access to the huge amount of data being generated continuously in all research fields, in order to extract new knowledge and do comprehensive and original analysis. “Don’t imagine that a researcher can simply download data from a repository and use it in a new study,” says Claudia Bauzer Medeiros, a professor at the Unicamp Institute of Computing, and Assistant Coordinator of special eScience programs for FAPESP. “Sharing data for reuse or reproduction of experiments requires knowledge of its origin and an understanding of how it was produced, then associating with that data the methods, algorithms or the techniques adopted, and even having access to the software necessary to process the data, all of which makes the process very complex. Without this, it may be impossible to reproduce the original experiment or reuse the data in another study,” says Medeiros, who reminded us that the first call for proposals for the eScience program would be open until April 28, 2014. One of the objectives of the program is research related to data repositories. “We hope that the projects submitted, which will probably involve joint research in computing and in other areas of knowledge, will help create methodologies and data models for setting up repositories, and lead to more efficient ways of describing content and structuring it such that it can be retrieved,” she says. “It is not enough, for example, to describe data by keyword. If a researcher wants to reuse that data for a different purpose it will be hard to find by keyword,” says Medeiros.

This type of research effort inspired Nature Publishing Group, which publishes the journal Nature, to launch a new magazine beginning in May. Called Scientific Data, it will be an online, open-access publication aimed not at describing new scientific findings, but rather at research datasets considered scientifically valuable. The goal is to promote documentation, the exchange and reuse of data supporting the research, through open access, in order to accelerate the pace of scientific discovery. To achieve this goal, the magazine’s editors introduced a new type of metadata (data about other data) known as the data descriptor. The magazine’s metadata will provide detailed descriptions of datasets in the life sciences, biomedicine and the environment, focused solely on how they were produced, by whom and how they can be reused by independent researchers. “Metadata give scientific data an identity, provide standardized documentation and the ability to be accessible through searches, in addition to interoperability with different systems on the web; plus data are reusable and citable in other research,” says Packer.

The principle that research must be reproducible is the most important dynamic in creating repositories for research data. Many scientific discoveries wind up not being confirmed after publication. This could be due to errors and fraud, but the problem also extends to false positive or negative results obtained in good faith. This difficulty haunts researchers and scientific journals, which are forced to cancel the publication of papers whose results sounded promising, and has become a nightmare for pharmaceutical and biotechnology companies. According to a recent report in The Economist, researchers at the biotechnology company Amgen found that it was possible to reproduce only 6 out of 53 studies considered “milestones” in cancer research.

Archive of remote sensing images maintained by the U.S. Geological Survey

usgsArchive of remote sensing images maintained by the U.S. Geological Surveyusgs

“In addition to checking the validity of results, data access and reuse also facilitate new research and comparative studies by combining data from different sources,” says Packer. “This is an important breakthrough for research-sponsoring agencies, because it allows them to generate more knowledge from the same investment.” Experience shows that it is difficult for researchers to make primary data available over time. An article published in the December issue of Current Biology demonstrated that the supporting data of scientific articles will be lost over time. The authors scanned 516 articles on ecology published between 1991 and 2011, in order to see what happened to the primary data. They found that articles published in the two previous years were available, but the chances of that happening with those published prior to that time fell at a rate of 17% per year. “Sooner or later, the software that permits access to a file or database will become obsolete. There is an area of computer research called digital curation, which aims to preserve computational devices and ensure not just the quality, but also the preservation of data for future use, or at least that which is considered most valuable,” says Medeiros.

The challenge still remains to develop a model to finance the services associated with this new stage. “The fees that repositories may decide to charge are not high, but somebody has to pay them. Some institutions and research programs, in genetics and proteins for example, have created repositories of this type and are funding the storage and availability of data,” says Packer, referring to institutions such as the U.S. National Center for Biotechnology Information, which maintains the GenBank, a database of DNA and amino acid sequences. In an effort to organize more than 600 repositories and develop methodologies, two catalogs were developed and are working cooperatively. One of them is, based at Oxford University, which lists data repositories for the biological sciences, such as DNA and proteins. The second is the Registry of Research Data Repositories, funded by the German Research Foundation, which compiles the repositories of the other sciences, including social science.

By the end of the year SciELO Brazil will have defined a policy for archiving in repositories the research data published in its journals, according to international standards. “We are studying whether it is advisable to develop SciELO’s own repository, in addition to forging alliances with existing repositories,” says Packer. The idea of establishing repositories for scientific data is still new to Brazil. The Environmental Information System (SinBiota) is a pioneering example of a database that grew out of scientific projects. It brings together and integrates the data resulting from projects linked to the Biota-FAPESP Program. With SinBiota, the distribution of species cataloged in São Paulo State can be analyzed on a digital map.  “The Ministry of Planning is organizing an open-data movement, but its focus is on public government data, not research data,” says Hélio Kuramoto, a senior technologist of the Brazilian Institute of Science and Technology Information (Ibict) and an expert on the movement for open access to scientific research. Several Brazilian universities, including three state universities in São Paulo, have set up repositories to bring together their scientific production, which was a major breakthrough. However, it still does not focus on storage of the data that supported such research.

scan_FINAL2bel falleirosAmong Brazilian scientific journals, the Brazilian Political Science Review (BPSR), published by the Brazilian Political Science Association, is a rare example of a journal with a publication policy similar to PLoS. BPSR is an open-access journal, published only in English and only in electronic format. Since last year, the authors of articles whose content is based on quantitative methods have been asked to make ​​available, on the journal’s own site, the databases on which the paper is based, and also the so-called codebooks, dictionaries that allow the variables used in the databases to be identified. The measure increased the magazine’s maintenance costs, because it requires hiring a professional to manage the repository. “The guiding principle behind the adoption of this initiative is a basic principle of science, i.e., that other researchers must be able to replicate the procedures leading to the conclusions obtained in the research work. If the reader wants to redo the calculations in order to determine if the findings are correct, then the underlying data supporting them must be publicly available, that is, without the reader having to rely on the goodwill of the study’s authors to provide them,” says Marta Arretche, a professor at the University of São Paulo (USP) Faculty of Philosophy, Languages and Literature, and Human Sciences (FFLCH), and co-editor of the magazine, along with Janina Onuki, a USP Professor at its Institute of International Relations. Another motivation is the potential for expanding the impact of the articles published in the magazine. Arretche cites a study done on the Journal of Peace Research, which is also in the field of political science and international relations. The study concluded that journal articles that make primary data available are twice as likely to be cited as those that do not.

“A third motivation is related to the cost of producing databases, which is very high. Repositories allow these costs to be shared and increase the opportunities that a given research topic will be studied,” says Arretche, who coordinates the Center for Metropolitan Studies (CEM), one of the 17 Research, Innovation and Dissemination Centers (RIDCs) funded by FAPESP. Since 2000, CEM has become known for producing and disseminating geo-referenced data about the major Brazilian metropolitan areas, offering several databases on its site at no charge. Arretche says most of the magazine’s authors deal well with the requirement to provide data. “They have some legitimate concerns, such as the potential for someone to use the data without giving due credit, although the magazine makes clear that it is necessary to cite the source. We have thought of requiring users to identify themselves as a prerequisite to accessing the data, but this would dampen the spirit of open access to scientific publications. Other authors would like to use the information intensively before making it available. There is indeed a tension between the principle of replication and the principle of authorship, but replication has prevailed,” she says.