Questioning statistics : Revista Pesquisa Fapesp

daniel almeidaIn August 2016, researchers from Tilburg University in the Netherlands used software they developed that can detect statistical inconsistencies to check a set of roughly 50,000 scientific articles on psychology. Dubbed Statcheck, the program redoes calculations and notes whether the results are robust and match what is published in the manuscript. One or more types of problems were found in half the papers, from typographical errors and simple rounding of figures, to incorrect results that could compromise study conclusions. The findings of this massive verification were automatically sent by email to the authors of each article and published on PubPeer, an online platform where any user can write comments on papers already published and point out possible errors in a kind of peer review that is performed after the article is published.

This verification is different because it was performed by a computer and because the volume of data checked was substantial; almost every psychology researcher who published a paper in the last 20 years was scrutinized under the Statcheck microscope. The publication of results sent shockwaves. The German Psychological Society published a declaration on October 20, 2016, in which it criticized how the results were disclosed. According to the text, many researchers were displeased by the exposure, having had no opportunity to defend themselves. “Many colleagues are deeply concerned that, obviously, it is very difficult to remove a comment on PubPeer after Statcheck uncovers a false positive,” according to the text that was disseminated.

Susan Fiske, professor at Princeton University and former president of the US Association for Psychological Science, spoke out more vociferously when she called the “police” work that proactively investigates research data “a new kind of harassment.” German psychologist Mathias Kauff told British newspaper The Guardian: “I felt a bit frightened and exposed.” Kauff had received an email from Statcheck notifying him that there were inconsistencies in an article he wrote in 2013 on multiculturalism and prejudice, published in the Personality and Social Psychology Bulletin. He states that the errors were the result of rounding of figures that did not compromise the conclusions.

Many articles in the field of psychology use standardized statistical tests whose results require confirmation. Statcheck identifies and checks the tests that calculate p values, an average that represents the probability that the effect observed is coincidental and unrelated to the factors that are being studied. A p value of less than or equal to 0.05 is often used as an indicator of statistical significance since it suggests that results are robust.

There is in fact evidence that the software still needs more work and that it flags problems that are not actually statistical errors. In an article in the ArXiv repository, Thomas Schmidt, professor of experimental psychology at the University of Kaiserslautern in Germany, criticized the quality of the Statcheck analysis for two articles he wrote. According to Schmidt, the software found 35 potentially incorrect statistical results, but only five contained inconsistencies which, according to him, did not compromise the final results.

Reproduction Chris Hartgerink, PhD student who submitted psychology papers to StatcheckReproduction

The methodology the software uses has existed since 2015, when an article on the subject was published on the website of the journal Behavior Research Methods. It was written by PhD student Michèle Nuijten and colleagues from the Meta-Research Center at the Tilburg University School of Social and Behavioral Sciences. In the paper, the group showed that in half of the 16,695 articles the software analyzed, there was some type of inconsistency in their statistical analyses, and 12% had findings compromised by errors. “Statcheck can be a tool to support peer review. For example, Nuijten tells Pesquisa FAPESP that the journal Psychological Science has already adopted the software to search for statistical inconsistencies in the manuscripts it receives.”

The project to analyze the 50,000 articles and make the results public on PubPeer was developed by 25-year-old doctoral candidate Chris Hartgerink. According to him, the idea was to bring about immediate benefits for the field of psychology that would not be generated if only general results were disclosed. The fact that false positives and unimportant errors were found does not compromise this objective, Hartgerink says. He and professor Marcel van Assen are now working on developing another kind of software that can determine whether a scientific article contains fabricated data. To test the method’s effectiveness, the two asked colleagues to send in versions of their papers that were deliberately altered, and now those papers are being reviewed.

Among psychology researchers, there are some who believe that Statcheck is a useful tool for improving the quality of scientific publications. Simine Vazire, researcher in the Department of Psychology at the University of California at Davis, predicts that authors of articles in this field will be even more cautious in their statistical analyses now that they know that there is a program that can identify carelessness, errors and fraud.

Tilburg University, where the program was developed, was the scene of a scientific misconduct scandal. In September 2011, the university dismissed one of its most productive researchers, social psychology professor Diederik Stapel, accused of falsifying more than 30 scientific articles over eight years. An investigation proved that he fabricated data, misled coauthors and even intimidated anyone who questioned him (see Pesquisa FAPESP Issue nº 190).

Stapel was Chris Hartgerink’s professor during his undergraduate work and served as a kind of mentor to him. He even hired him as a research assistant. At that time, Hartgerink felt disorientation. “He inspired me to become enthusiastic about research,” Hartgerink tells The Guardian. The bitter experience of the scandal led some of the researchers from the group that investigated Stapel’s fraud to establish the Meta-Research Center, which studies scientific misconduct. Hartgerink joined the group in 2013 as he completed his PhD project on methods to detect fabricated research data.

Republish