An algorithm for evaluating research credibility : Revista Pesquisa Fapesp

The United States Department of Defense (DoD) is investing US$7.6 million in the development of an artificial intelligence system to assess whether findings from the social and behavioral sciences are actually true. The program, known as SCORE (Systematizing Confidence in Open Research and Evidence), will last three years. The Pentagon’s objective is to create an automated tool that can assign a score to research results in disciplines such as psychology, anthropology, and sociology, according to an estimated degree of confidence. The classification would give users of scientific information a better idea about the level of uncertainty of the conclusions.

According to anthropologist Adam Russell, the Pentagon often uses research by social scientists and psychologists as a basis for developing national security plans, modelling human social systems, and guiding investment. “However, a number of recent empirical studies and meta-analyses have revealed that results vary dramatically in terms of their ability to be independently reproduced or replicated,” wrote Russell, who is a program manager at DARPA, the Pentagon’s research agency. He makes reference to the so-called “replication crisis,” which has seen a number of high-profile cases where scientific articles, especially in fields such as medicine, life sciences, and psychology, have been discredited because it was not possible to replicate their results in subsequent studies. One such scandal involved Diederick Stapel, a social psychology professor at the University of Tilburg in the Netherlands, who had 30 articles retracted for data manipulation. Three years ago, an international collaboration replicated 100 experimental psychology studies and was only able to reproduce the results of 36 of them.

Last month, DARPA announced a partnership between the SCORE program and the Center for Open Science (COS), a nongovernmental organization linked to the University of Virginia with vast experience in replicating scientific experiments. The COS has become known for its Reproducibility Initiative, a program it led from 2013 to 2018 to assess whether 50 potential cancer drugs described in scientific papers had any chance of ever reaching the market. “Research credibility assessments can help scientists choose research topics, agencies decide what to fund, and policymakers select the best evidence,” said biologist Tim Errington, a researcher at COS.

The SCORE program will be divided into four phases. First, a database of around 30,000 scientific articles will be created, populated with study results and information from other sources such as the number of citations received, whether primary data is publicly available, and whether the research was preregistered—a guarantee that the hypothesis was not changed during the course of the experiment. This stage will involve a collaboration with researchers from the universities of Syracuse and Pennsylvania. Next, 3,000 of these articles will be selected for expert analysis, and each will be assigned a score representing how likely it is that the results can be replicated.

Particular attention will be paid to elements related to the quality of the results, such as sample size, conflicts of interest, and the reputation of the author and institution. The scoring process used by the experts to classify the articles will then be analyzed by computer scientists who will create algorithms designed to automatically perform the same task. Finally, teams of researchers will attempt to redo the experiments described in the 3,000 articles, to test whether the algorithm is capable of accurately predicting reproducibility. “The plan is not to replace humans with machines, but rather to find the best ways to combine the two,” Russell told the journal Nature.

The plan is not to replace humans with machines, but rather to find the best ways to combine the two, says anthropologist Adam Russell

There is, of course, a risk of failure. The Reproducibility Initiative led by the COS involved dozens of teams of scientists and in the end, the results were limited. Difficulties achieving the right conditions meant that the program was terminated after analyzing less than half of the planned 50 studies. In an initial survey of 10 evaluated studies, only five were considered reliable—the others reached inconclusive or negative results. SCORE program manager Adam Russell has experience with complicated projects like this. Prior to working at the DoD, he managed programs specializing in high-risk innovative projects at IARPA, an agency overseen by the US Office of the Director of National Intelligence that funds research in businesses and universities involving experts in mathematics, computer science, neuroscience, cognitive psychology, and others.

Psychologist Brian Nosek, a professor at the University of Virginia and director of the COS, believes there is a chance that the program will fail to provide a faithful view of research credibility if it cannot create a robust database or replicate studies to a high quality. But he thinks it is worth the risk. “Whatever the outcome, we will learn a ton about the state of science and how we can improve.”

Republish