Online biodiversity databases offer access to millions of records on plant and animal species and the areas they occupy—or occupied—in Brazil and around the world. The benefit of this abundance of raw scientific data, however, is offset by a series of questions: how can the data be extracted and filtered, and how can researchers know if it is really reliable? Will errors in species names and locations be automatically identified and corrected? These issues are important because incorrect or incomplete data can often lead to inconsistent analyses.
Researchers at the Polytechnic School of the University of São Paulo (Poli-USP) are involved in the debate on information quality control in online databases, proposing new strategies to solve the problems first observed over a decade ago. In 2006, when electrical engineer Antônio Mauro Saraiva, a professor at Poli-USP, joined a biodiversity research network of biologists from 11 countries in the Americas, he found different scientific names for the same species, incorrect geographic coordinates, and a lack of details about the organisms collected. This data feeds online databases, which scientific articles use as sources for information on the distribution or abundance of animal and plant species. “Five or ten years later, the researchers no longer understood the codes and abbreviations they had used in the collections,” he noted.
In 2008, Saraiva began talking to computer scientist Allan Koch Veiga about how improving the organization and quality criteria of databases would make the information stored in them more accurate and complete. Veiga completed his doctorate in 2016, under Saraiva’s supervision. Currently on a postdoctoral fellowship at Poli-USP, he presented a conceptual proposal by Saraiva’s research group in Ottawa, Canada, on October 3. The proposal aims to unify the terminology and quality assessment criteria for information stored in online databases related to animals, microorganisms, plants, and fungi. The most comprehensive of around 25 global databases is the Global Biodiversity Information Facility (GBIF). Created in 2001, it provides records on almost 850 million species, of which 6 million are from Brazil, one of about 60 countries in the network.
“Because there is no general consensus, each group defines their own concept of quality and assesses it differently, making it impossible to compare the results,” says Saraiva, who is coordinator of the USP Biodiversity and Computing Research Center (BioComp). What the Poli-USP group is proposing, together with experts from Canada, the USA, Australia, and Denmark, is a common language to facilitate data quality management. Saraiva notes that a survey based on Poli-USP’s proposal and conducted across several countries by researchers from Biodiversity Information Standards, an international scientific association that develops data quality standards, identified 100 different types of database quality-verification tests. The tests consist of programs or subprograms that detect inaccuracies, such as incorrect geographical coordinates. “Even if the programs have the same objective, we cannot compare the results because the criteria they adopt are different,” he says. “We want to standardize all databases within a single conceptual framework, making it clear how each one operates.”
These concepts will guide the data quality platform that the Poli-USP group plans to develop in 2018 for the Brazilian Biodiversity Information System (SIBBR). Launched in 2014, the SIBBR database has about 10 million records on 155,000 Brazilian animal and plant species. “Despite recent advances, such as the growing range of open source software for publishing scientific information, there is still no national data management policy that establishes quality criteria and control,” says biologist Andrea Nunes, general coordinator of ecosystems at the Ministry of Science, Technology, Innovation, and Communications (MCTIC), and national director of SIBBR.
As we move forward, the data quality platform should interact with the databases that feed the SIBBR and establish common operating standards. “One element not always considered by databases is that the starting point for defining quality depends on how the researcher intends for the information to be used,” comments Saraiva, who uses an analogy to explain better: “How we define the qualities of tomatoes may differ if they are to be used to make a sauce as opposed to a salad; for a sauce, the tomatoes should be very ripe and a little soft, while for salads they should be firmer and not too ripe,” he says.
The Poli-USP group’s aim is to help researchers define the data selection criteria before starting a search, so that they do not have to look through thousands of records about a species or group of species, as well as to divulge these criteria as a guide to other users. “If a researcher wants a list of species from one specific country, they do not need the exact geographical coordinate of each location, but this information is indispensable if somebody is studying the geographic distribution of animals or plants within a region,” says Veiga.
Error checking
The speciesLink network, a Brazilian biodiversity database, allows information to be selected by occurrence and distribution of microorganisms, algae, fungi, plants, and animals. The database has been developing and expanding since 2001 with FAPESP support, combining 12 biological collections in the state of São Paulo, including the Flora and Fungi Virtual Herbarium, one of the Brazilian National Institutes for Science and Technology (INCT), as well as records from 470 collections in Brazil and worldwide. These collections share a total of roughly 9 million records on 125,000 species, of which 2,756 are at risk of extinction.
Of the total records, 68% have exact geographical coordinates that match the given municipality, 23% have no information on the location in which they were collected, and 8% have inaccurate data. The coordinates of 1% of the records are blocked for verification by the curators responsible for each collection. “If any data is considered sensitive, such as the geographic coordinates of a threatened species with a high commercial value, the location or even the complete record may be blocked. The curator is responsible for deciding what should be shared on the network,” says food engineer Dora Canhos, director of the Environmental Information Reference Center (CRIA) in Campinas, which is responsible for developing and maintaining the speciesLink network. “Every mistake must be corrected at the source. No records are altered by CRIA.” Once incorporated into the network, the information is freely shared and available to all.
Because there is no general consensus, each group defines quality differently, says Saraiva
Millions of stars
“There are not enough people working on the herbarium to clean the data, check the quality, and update the scientific names,” observes Luís Alexandre Estevão da Silva, coordinator of the scientific computing and geoprocessing center at the Rio de Janeiro Botanical Garden Research Institute. For this reason, the institution created and implemented automatic detection programs with 81 quality-verification filters capable of issuing alerts, such as “the coordinates do not match the given municipality.” “We have a long way to go, because there are still many duplicates and inconsistencies in the herbarium classifications,” says Silva. In 2005, his team developed and deployed Jabot, a management system for scientific herb collections. They released it for use by other institutions in 2016, and it has so far been adopted by herbariums in 28 universities and research centers around Brazil.
“We have to use methods that analyze data quality at the moment it is produced,” says electrical engineer Cláudia Bauzer Medeiros, a professor at the Institute of Computing of the University of Campinas (UNICAMP) and coordinator of the FAPESP eScience program. “When using data produced by others, researchers often do not check the reliability of the information, despite knowing that the results of their research depend on the quality of the data.” Often, she adds, “such verification is not possible due to a lack of information about data quality.”
Although data quality control strategies are not yet integrated and standardized, concerns about the consistency of science’s raw material—data—are growing, and not just in biology. Colombian physicist Alberto Molino Benito has spent two years with his team at the USP Institute of Astronomy, Geophysics, and Atmospheric Sciences (IAG-USP) developing programs to automatically and accurately extract numerical data from the images captured by the Southern Photometric Local Universe Survey (S-Plus) telescope in Cerro Tololo, Chile, which is managed by IAG-USP itself.
“The information will help us catalog stars, galaxies, quasars, and asteroids, including their positions, size, luminosity, distance from Earth, and mass,” says Benito. “We are calibrating and validating the programs so that researchers will not have to worry about data quality when automatic image collection starts in early 2018.” With an 80-centimeter mirror, the S-Plus should complete its observation of the sky in the southern hemisphere within two years, gathering information on the spatial distribution of millions of stars and galaxies.
Scientific article
VEIGA, A. K. et al. A conceptual framework for quality assessment and management of biodiversity data. PLoS ONE. V. 12, No 6, e0178731. 2017.