Artificial Intelligence should help vet new research
Early results suggest that artificial intelligence is just as good at filtering out the noise as 100 human experts — and apparently much better than human editors at scientific journals and newspapers. Credit: AFP / David McNew via Getty Images
Have you ever heard that reading literature for three minutes makes people more empathetic, or that holding a heavier clipboard makes a manager more likely to hire a job candidate? The popular press has had a love affair with social science findings like these. But they might not be true.
Attempts to replicate such results led to a shocking discovery in 2015 that fewer than 40% of papers in peer-reviewed psychology journals could be verified. Similarly dismal findings occurred in economics and some biomedical research, including cancer biology.
Since then, researchers have been trying to find better ways to sort the treasure from the trash. A few years ago, one group of social scientists showed that prediction markets — asking people to make bets on a paper's validity — worked far better than standard peer review. But that required increasing the number of people who vet a paper from three to 100. That's not a scalable solution.
Now, machine learning programs seem to be getting equally good results — and more are coming thanks to millions of dollars being invested in the effort by the Defense Advanced Research Projects Agency (DARPA). Just as ChatGPT is trained using lots of text, these paper-evaluating systems are trained using data gathered from painstaking attempts to replicate hundreds of studies. The systems are then tested using studies they haven't seen.
Early results suggest that the robots are just as good at filtering out the noise as 100 human experts — and apparently much better than human editors at scientific journals and newspapers.
The more improbable studies are the ones to garner the most press attention, says Brian Uzzi, a psychologist at Northwestern University who led a recent study of machine learning published in the Proceedings of the National Academy of Sciences. Many of those attention-grabbing studies backed a view, now discarded, that people were being buffeted around by seemingly irrelevant stimuli in a measurable, predictable way.
Uzzi traces the replication problem back to 2011, when a prominent journal published a paper claiming that ordinary experimental subjects could see the future — that is, they had ESP.
The research had followed methods that were standard for the field, which got at least a few people worried that something was wrong with those techniques. Some critics identified a flawed use of statistical methods — a form of data manipulation called p-hacking. But p-hacking wasn't the problem in most irreproducible papers, said Uzzi.
The deeper problem was that in fields involving human behavior, it's not always obvious what makes a claim extraordinary. That makes it tougher to follow the mantra that extraordinary claims require extraordinary evidence — an idea attributed to David Hume and Carl Sagan.
For a counterexample, consider the physical sciences. Any new finding that violates quantum mechanics or general relativity is almost always subject to extra scrutiny — as when physicists quickly put to rest a claim that particles were moving faster than the speed of light. Human behavior doesn't fall into the same theoretical framework.
In 2018, experiments with prediction markets showed that when researchers asked 100 of their fellow social scientists to bet on whether an array of results would replicate, they got it right 75% of the time. That's the same success rate Uzzi found with machine learning, but the machine worked a lot faster.
Prediction markets and the machine learning system both went beyond what normal peer review provides by rating confidence levels. And both systems were right 100% of the time for those papers that fell within the top 15% of the confidence range.
University of Virginia psychologist Brian Nosek, who spearheaded the initial efforts to understand what became known as the replication crisis, said a larger effort funded by DARPA will produce several different evaluation systems, some of which will focus on external factors such as track records of the authors and where a paper was cited.
Would such a system make it harder for new researchers to break in? Not necessarily — it might be a tool they can use to protect themselves from being wrong and having someone else point it out later.
Anna Dreber, a researcher at the Stockholm School of Economics, said she could have used that kind of help when she led a study that seemed to show a correlation between certain genes and financial risk taking. It couldn't be replicated, and she regrets the amount of time she spent on the project. Now she's leading efforts to improve the reliability of published work in economics.
If researchers themselves don't recognize a problem, machine learning might be useful further up the chain — helping journal editors, journalists and policymakers to evaluate research.
And maybe, in the future, it can be used to counter scientific papers being spit out by large language models such as ChatGPT. Machines might be taught both how to detect inaccuracies and generate glib misinformation. "It's sort of like the Matrix in the later parts of the series, there were the good machines, and the bad machines fighting against each other," said University of Virginia's Nosek.
That wouldn't be a bad outcome. If human beings aren't yet quite capable of understanding how our minds work, at least we're capable of inventing machines that can help us figure it out.
This column does not necessarily reflect the opinion of the editorial board or Bloomberg LP and its owners. Faye Flam is a Bloomberg Opinion columnist covering science. She is host of the "Follow the Science" podcast.


