How Reliable Are Psychology Studies?

A new study shows that the field suffers from a reproducibility problem, but the extent of the issue is still hard to nail down.

No one is entirely clear on how Brian Nosek pulled it off, including Nosek himself. Over the last three years, the psychologist from the University of Virginia persuaded some 270 of his peers to channel their free time into repeating 100 published psychological experiments to see if they could get the same results a second time around. There would be no glory, no empirical eurekas, no breaking of fresh ground. Instead, this initiative—the Reproducibility Project—would be the first big systematic attempt to answer questions that have been vexing psychologists for years, if not decades. What proportion of results in their field are reliable?

A few signs hinted that the reliable proportion might be unnervingly small. Psychology has been recently rocked by several high-profile controversies, including: the publication of studies that documented impossible effects like precognition, failures to replicate the results of classic textbook experiments, and some prominent cases of outright fraud.

The causes of such problems have been well-documented. Like many sciences, psychology suffers from publication bias, where journals tend to only publish positive results (that is, those that confirm the researchers’ hypothesis), and negative results are left to linger in file drawers. On top of that, several questionable practices have become common, even accepted. A researcher might, for example, check to see if they had a statistically significant result before deciding whether to collect more data. Or they might only report the results of “successful” experiments. These acts, known colloquially as p-hacking, are attempts to torture positive results out of ambiguous data. They may be done innocuously, but they flood the literature with snazzy but ultimately false “discoveries.”

In the last few years, psychologists have become increasingly aware of, and unsettled by, these problems. Some have created an informal movement to draw attention to the “reproducibility crisis” that threatens the credibility of their field. Others have argued that no such crisis exists, and accused critics of being second-stringers and bullies, and of favoring joyless grousing over important science. In the midst of this often acrimonious debate, Nosek has always been a level-headed figure, who gained the respect of both sides. As such, the results of the Reproducibility Project, published today in Science, have been hotly anticipated.

They make for grim reading. Although 97 percent of the 100 studies originally reported statistically significant results, just 36 percent of the replications did.

Does this mean that only a third of psychology results are “true”? Not quite. A result is typically said to be statistically significant if its p-value is less than 0.05—briefly, this means that if you did the study again, your odds of fluking your way to the same results (or better) would be less than 1 in 20. This creates a sharp cut-off at an arbitrary (some would say meaningless) threshold, in which an experiment that skirts over the 0.05 benchmark is somehow magically more “successful” than one that just fails to meet it.

So Nosek’s team looked beyond statistical significance. They also considered the effect sizes of the studies. These measure the strength of a phenomenon; if your experiment shows that red lights make people angry, the effect size tells you how much angrier they get. And again, the results were worrisome. On average, the effect sizes of the replications were half those of the originals.

“The success rate is lower than I would have thought,” says John Ioannidis from Stanford University, whose classic theoretical paper Why Most Published Research Findings are False has been a lightning rod for the reproducibility movement. “I feel bad to see that some of my predictions have been validated. I wish they’d been proven wrong.”

Nosek, a self-described “congenital optimist,” is less upset. The results aren’t great, but he takes them as a sign that psychologists are leading the way in tackling these problems. “It has been a fantastic experience, all this common energy around a very specific goal,” he says. “The collaborators all contributed their time to the project knowing that they wouldn’t get any credit for being 253rd author.”

Jason Mitchell from Harvard University, who has written critically about the replication movement, agrees. “The work is heroic,” he says. “The sheer number of people involved and the care with which it was carried out is just astonishing. This is an example of science working as it should in being very self-critical and questioning everything, especially its own assumptions, methods, and findings.”

But even though the project is historic in scope, its results are still hard to interpret. Let’s say that only a third of studies are replicable. What does that mean? It seems low, but is it? “Science needs to involve taking risks and pushing frontiers, so even an optimal science will generate false positives,” says Sanjay Srivastava, an associate professor of psychology at the University of Oregon. “If 36 percent of replications are getting statistically significant results, it is not at all clear what that number should be.”

It is similarly hard to interpret failed replications. Consider the paper’s most controversial finding: that studies from cognitive psychology (which looks at attention, memory, learning, and the like) were twice as likely to replicate as those from social psychology (which looks at how people influence each other). “It was, for me, inconvenient,” says Nosek. “It encourages squabbling. Now you’ll get cognitive people saying ‘Social’s a problem’ and social psychologists saying, ‘You jerks!’”

Nosek explains that the effect sizes from both disciplines declined with replication; it’s just that cognitive experiments find larger effects than social ones to begin with, because social psychologists wrestle with problems that are more sensitive to context. “How the eye works is probably very consistent across people but how people react to self-esteem threat will vary a lot,” says Nosek. Cognitive experiments also tend to test the same people under different conditions (a within-subject design) while social experiments tend to compare different people under different conditions (a between-subject design). Again, people vary so much that social-psychology experiments can struggle to find signals amid the noise.

More generally, failed replications don’t discredit the original studies, any more than successful ones enshrine them as truth. There are many reasons why two attempts to run the same experiment might produce different results. There’s random chance. The original might be flawed. So might the replication. There could be subtle differences in the people who volunteered for both experiments, or the way in which those experiments were done. And, to be blunt, the replicating team might simply lack nous or technical skill to pull off the original experiments.

Indeed, Jason Mitchell wonders how good the Reproducibility Project’s consortium would be at replicating well-known phenomena, like the Stroop effect (people take longer to name the color of a word if it is printed in mismatching ink) or the endowment effect (people place more value on things they own). “Would it be better than 36 percent or worse? We don’t know and that’s the problem,” he says. “We can’t interpret whether 36 percent is good, bad, or right on the money.”

Mitchell also worries that the kind of researchers who are drawn to this kind of project may be biased towards “disproving” the original findings. How could you tell if they are “unconsciously sabotaging their own replication efforts to bring about the (negative) result they prefer?” he asks.

In several ways, according to Nosek. Most of the replicators worked with the scientists behind the original studies, who provided materials, advice, and support—only 3 out of 100 refused to help. The teams pre-registered their plans—that is, they decided on every detail of their methods and analyses beforehand to remove the possibility of p-hacking. Nosek also stopped the teams from following vendettas by offering them a limited buffet of studies to pick from: only those published in the first issue of three major psychology journals in 2008. Finally, he says that most of the teams that failed to replicate their assigned studies were surprised—even disappointed.“Anecdotally, I observed that as they were assigned to a task, they got invested in their particular effect,” says Nosek. “They got excited. Most of them expected theirs to work out.”

And yet, they largely didn’t. “This was surprising to most people,” says Nosek. “This doesn’t mean the originals are wrong or false positives. There may be other reasons why they didn’t replicate, but this does mean that we don’t understand those reasons as well as we think we do. We can’t ignore that. We have data that says: We can do better.”

What does doing better look like? To Dorothy Bishop, a professor of developmental neuropsychology at the University of Oxford, it begins with public pre-registration of research plans. “Simply put, if you are required to specify in advance what your hypothesis is and how you plan to test it, then there is no wiggle room for cherry-picking the most eye-catching results after you have done the study,” she says. Psychologists should also make more efforts to run larger studies, which are less likely to throw up spurious results by chance. Geneticists, Bishop says, learned this lesson after many early genetic variants that were linked to human diseases and traits turned out to be phantoms; their solution was to join forces to do large collaborative studies, involving many institutes and huge numbers of volunteers. These steps would reduce the number of false positives that marble the literature.

To help detect the ones that slip through, researchers could describe their methods in more detail, and upload any materials or code to open databases, making it trivially easy for others to check their work.“We also need to be better at amassing the information we already have,” adds Bobbie Spellman from the University of Virginia. Scientists already check each other’s work as part of their daily practice, she says. But that much of that effort is invisible to the wider world because journals have been loath to publish the results of replications.

Change is already in the air. “Journals, funders, and scientists are paying a lot more attention to replication, to statistical power, to p-hacking, all of it,” says Srivastava. He notes that the studies that were targeted in the Reproducibility Project all come from a time before these changes. “Has psychology learned and gotten better?” he wonders.

One would hope so. After all, several journals have started to publish the results of pre-registered studies. In a few cases, scientists from many labs have worked together to jointly replicate controversial earlier studies. Meanwhile, Nosek’s own brainchild, the Center for Open Science established in 2013, has been busy developing standards for transparency and openness. It is also channelling $1 million of funding into a pre-registration challenge, where the first 1,000 teams who pre-register and publish their studies will receive $1,000 awards. “It’s to stimulate people to try pre-registration for the first time,” he says.

The Center is also working with scientists from other fields, including ecology and computer science, to address their own concerns about reproducibility. Nosek’s colleague Tim Errington, for example, is leading an effort to replicate the results of 50 high-profile cancer biology studies. “I really hope that this isn’t a one-off but a maturing area of research in its own right,” Nosek says.

That’s all in the future, though. For now?I will be having a drink,” he says.

Ed Yong is a former staff writer at The Atlantic. He won the Pulitzer Prize for Explanatory Reporting for his coverage of the COVID-19 pandemic.