Science by the numbers: Researchers ask, ‘How true are our findings?’Listen
Research shows that listening to The Beatles’ song “When I’m 64” makes people younger – by almost a year and a half.
“You could argue that we found the fountain of youth,” said Joe Simmons, one of the researchers at the University of Pennsylvania who published the finding. “All you have to do is listen to this song about old age, and it makes you younger.”
Skeptical? You should be.
“Obviously, that’s not true,” Simmons said.
The finding was statistically significant and published in a well-respected journal, but it was totally bogus.
Simmons and his team from Wharton and the University of California Berkeley published the study to show how easy it can be to land on bad findings in academic research. They argue some of the questionable research practices that allowed them to produce that finding may be skewing other, less obviously absurd-sounding research.
The paper, published online last fall, made a splash. Now, it is part of a larger dialogue within psychology — and medicine and other fields — about whether poor research methods have led to a preponderance of false-positive findings.
Creating a fountain of youth
The group of University of Pennsylvania undergraduates that listened to “When I’m 64” during the experiment happened to be younger than a group that listened to a second song. But Simmons manipulated the data to show that the song was effectively causing the age difference.
Simmons collected data from a third test condition, participants who listened to the children’s song “Hot Potato,” and threw it out when it turned out not to be significant. He also collected a slew of additional data, including the participants’ mother’s ages, father’s ages, and whether they would go to an early bird special, all so he could compare many different variables to each other, hold different conditions constant, and run the test for statistical significance many times based on each of these tweaks.
“We can do that enough times that we’re likely, extremely likely, to find at least one of those analyses to be statistically significant,” Simmons said.
Why does running many different tests increase the odds of finding something “statistically significant”? It has to do with how scientists commonly determine statistical significance, with a measurement called the p-value.
The p-value is used to determine whether a difference that is seen in a test group is due to chance — one group happened to be younger than the other, for example — or due to the treatment or experimental variable. In this case, listening to “When I’m 64.”
How it works
“Suppose in truth there was no difference. In fact it’s all just random,” said Ed Gracely, an associate professor of biostatistics and epidemiology at Drexel University. “How likely would we have been to get a difference as good as we got or better purely by chance? What we do is calculate a probability for that.”
That probability is denoted by the p-value. The standard cutoff for statistical significance is a p-value of 0.05, or 5 percent or less.
“So what we’re saying is 5 percent of the time, even when there really is no effect, we would declare statistical significance,” Gracely said.
That is basically an error rate of 5 percent. Five times out of a hundred, your statistically significant finding isn’t actually statistically significant. It’s a false positive.
The problem Simmons is trying to point out with his paper?
“Five percent of the time happens 5 percent of the time,” Simmons said.
If a researcher runs 20 different analyses, odds are one is going to be “statistically significant,” purely by chance.
That is why it is bad practice to run extra experiments that boost possible false-positive rates, like having people listen to “Hot Potato,” and not report them. And why it is bad practice to collect a slew of data points to control for and compare against each other, if that is not what the original research design stipulated.
More attention on research methodology
In the end, comparing “When I’m 64” to “Kalimba” and holding father’s age constant yielded the statistically significant result Simmons and his team were aiming for.
Simmons argues that in an experiment with a more reasonable conclusion, these techniques may have slid by peer reviewers.
“When you are given papers to review, you’re busy,” Simmons said. “You don’t always question every finding, especially if everything else looks good.”
Simmons said his study is an extreme example, used to make a point. Not everyone in academia agrees on how much these techniques are used, or how big of an impact they may actually have on published research findings.
“Different people in our field have strongly different opinions about how common these various things are,” Simmons said.
Still, even without consensus, this issue has been gaining attention across fields.
A survey published this year of more than 2,000 research psychologists found that some of these questionable research practices are relatively common. The results are controversial because of a small response rate to the survey and questions some call “leading,” but the findings as published are still startling.
About half of the researchers queried decided whether to collect more data after looking to see whether the results of their testing so far were significant. Almost two-thirds failed to report all of a study’s dependent measures, about half selectively reported studies that “worked,” and about one in seven had stopped collecting data earlier than planned, when they arrived at significant findings
Not just an issue in psychology
Reformers have emerged in other fields, perhaps most notably Stanford University’s Dr. John Ioannidis. His 2005 paper, “Why Most Published Research Findings Are False,” is still widely cited in academic literature and the national media.
Ioannidis has also examined once generally accepted medical evidence, such as using vitamin E and hormone replacement therapy to reduce the risk of heart disease, and the evidence that later reversed those findings.
The work of Ioannidis, Simmons and others like them has sparked conversations about how to address current problems in research methodology.
“I think that, historically, if somebody had some outstanding, exciting breakthrough, usually journals did not question much about that data,” said Dr. David Chen, a Fox Chase surgical oncologist who reviews articles for peer-reviewed urology journals.
Chen said that is changing.
“Now, they are much more demanding that not only are results published, but access to the data is available for review,” Chen said.
Next month, the respected British Medical Journal will no longer publish the results of clinical trials unless drug companies agree to provide detailed study data. They hope to nudge other medical journals to follow suit. The journal Psychological Science is doing something similar, in a voluntary pilot program for now.
The journal’s editor Eric Eich, also a professor at the University of British Columbia, said other groups are systematically trying to reproduce past experiments to see if they can be replicated.
“Most research in psychology, or pretty well any other field, it’s all geared toward discovery,” said Eich. “People get kudos for discovering new things. It tends to be undervalued trying to replicate someone else’s findings.”
Eich is also developing educational materials for researchers, and advocating for using measures other than the p-value to test for significance when appropriate.
While academics sort out issues of transparency, disclosure and research methodology, Chen recommends a healthy skepticism for the rest of us.
“Everyone hopes there’s going to be some magic bullet wonder pill which fixes diabetes or heart disease or something,” Chen said. “And it’s probably not possible, just because the body’s so complex, that we will have such a simple answer.”
WHYY is your source for fact-based, in-depth journalism and information. As a nonprofit organization, we rely on financial support from readers like you. Please give today.