The Trouble with Psycho

Huh, sem mislil, da imamo težave v (makro)ekonomiji, ker pač glavni del mainstreama ignorira realnost in v makro modelih simulira realnost na osnovi imaginarnih vzorcev obnašanja reprezentativnih agentov. Nato sem pa naletel na spodnji blog zapis statistika Andrewa Gelmana o stanju v psihologiji. Ujebelacesta! Huh, tam je pa čisto drug svet, čisti divji zahod glede spoštovanja raziskovalnih pravil. Nenadoma vidim, da v eksperimentalni psihologiji ne samo masovno manipulirajo podatke, da bi dobili statistično značilne rezultate, ampak še huje, da ustvarjajo “teorije z odprtim koncem”, kar pomeni, da jih potrjujejo katerikoli podatki.

Dvom v rezultate psiholoških študij, objavljenih v prestižnih akademskih revijah z resnim postopkom recenzije, se je pojavljal že od 1960-ih naprej, toda leta 2011 se je zgodila revolucija, ko so začeli kritiki (mlajši raziskovalci, ki (še niso) bili prepuščeni h koritu) javno objavljati kritike objavljenih študij v alternativnih medijih (blogih) in družbenih omrežjih. Zdaj pa je tam cela štala. Vse je odprto. Praktično ne moreš zaupati več nobeni objavljeni psihološki študiji, ki nek koncept validira s podatki in zraven navaja še t-statistike in p-vrednosti.

Pa sem mislil, da imamo v ekonomiji težave…

Spodaj je Gellmanov kronološki pregled razkrivanja slabega raziskovalnega dela v psihologiji. Nato pa na linku berite naprej Gelmanov komentar dogajanja in kako so novi mediji (blogi) dejansko rešili znanost na področju psihologije. Nenadoma je kritika člankov uveljavljenih avtoritet v prestižnih akademskih revijah sploh lahko ugledala luč sveta.

Here’s what I see as the timeline of important events:

1960s-1970s: Paul Meehl argues that the standard paradigm of experimental psychology doesn’t work, that “a zealous and clever investigator can slowly wend his way through a tenuous nomological network, performing a long series of related experiments which appear to the uncritical reader as a fine example of ‘an integrated research program,’ without ever once refuting or corroborating so much as a single strand of the network.”

Psychologists all knew who Paul Meehl was, but they pretty much ignored his warnings. For example, Robert Rosenthal wrote an influential paper on the “file drawer problem” but if anything this distracts from the larger problems of the find-statistical-signficance-any-way-you-can-and-declare-victory paradigm.

1960s: Jacob Cohen studies statistical power, spreading the idea that design and data collection are central to good research in psychology, and culminating in his book, Statistical Power Analysis for the Behavioral Sciences, The research community incorporates Cohen’s methods and terminology into its practice but sidesteps the most important issue by drastically overestimating real-world effect sizes.

1971: Tversky and Kahneman write “Belief in the law of small numbers,” one of their first studies of persistent biases in human cognition. This early work focuses on resarchers’ misunderstanding of uncertainty and variation (particularly but not limited to p-values and statistical significance), but they and their colleagues soon move into more general lines of inquiry and don’t fully recognize the implication of their work for research practice.

1980s-1990s: Null hypothesis significance testing becomes increasingly controversial within the world of psychology. Unfortunately this was framed more as a methods question than a research question, and I think the idea was that research protocols are just fine, all that’s needed was a tweaking of the analysis. I didn’t see general airing of Meehl-like conjectures that much published research was useless.

2006: I first hear about the work of Satoshi Kanazawa, a sociologist who published a series of papers with provocative claims (“Engineers have more sons, nurses have more daughters,” etc.), each of which turns out to be based on some statistical error. I was of course already aware that statistical errors exist, but I hadn’t fully come to terms with the idea that this particular research program, and others like it, were dead on arrival because of too low a signal-to-noise ratio. It still seemed a problem with statistical analysis, to be resolved one error at a time.

2008: Edward Vul, Christine Harris, Piotr Winkielman, and Harold Pashler write a controversial article, “Voodoo correlations in social neuroscience,” arguing not just that some published papers have technical problems but also that these statistical problems are distorting the research field, and that many prominent published claims in the area are not to be trusted. This is moving into Meehl territory.

2008 also saw the start of the blog Neuroskeptic, which started with the usual soft targets (prayer studies, vaccine deniers), then started to criticize science hype (“I’d like to make it clear that I’m not out to criticize the paper itself or the authors . . . I think the data from this study are valuable and interesting – to a specialist. What concerns me is the way in which this study and others like it are reported, and indeed the fact that they are repored as news at all,” but soon moved to larger criticisms of the field. I don’t know that the Neuroskeptic blog per se was such a big deal but it’s symptomatic of a larger shift of science-opinion blogging away from traditional political topics toward internal criticism.

2011: Joseph Simmons, Leif Nelson, and Uri Simonsohn publish a paper, “False-positive psychology,” in Psychological Science introducing the useful term “researcher degrees of freedom.” Later they come up with the term p-hacking, and Eric Loken and I speak of the garden of forking paths to describe the processes by which researcher degrees of freedom are employed to attain statistical significance. The paper by Simmons et al. is also notable in its punning title, not just questioning the claims of the subfield of positive psychology but also mocking it.

That same year, Simonsohn also publishes a paper shooting down the dentist-named-Dennis paper, not a major moment in the history of psychology but important to me because that was a paper whose conclusions I’d uncritically accepted when it had come out. I too had been unaware of the fundamental weakness of so much empirical research.

2011: Daryl Bem publishes his article, “Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect,” in a top journal in psychology. Not too many people thought Bem had discovered ESP but there was a general impression that his work was basically solid, and thus this was presented as a concern for pscyhology research. For example, the New York Times reported:

The editor of the journal, Charles Judd, a psychologist at the University of Colorado, said the paper went through the journal’s regular review process. “Four reviewers made comments on the manuscript,” he said, “and these are very trusted people.”

In retrospect, Bem’s paper had huge, obvious multiple comparisons problems—the editor and his four reviewers just didn’t know what to look for—but back in 2011 we weren’t so good at noticing this sort of thing.

At this point, certain earlier work was seen to fit into this larger pattern, that certain methodological flaws in standard statistical practice were not merely isolated mistakes or even patterns of mistakes, but that they could be doing serious damage to the scientific process. Some relevant documents here are John Ioannidis’s 2005 paper, “Why most published research findings are false,” and Nicholas Christakis’s and James Fowler’s paper from 2007 claiming that obesity is contagious. Ioannidis’s paper is now a classic, but when it came out I don’t think most of us thought through its larger implications; the paper by Christakis and Fowler is no longer being taken seriously but back in the day it was a big deal. My point is, these events from 2005 and 1007 fit into our storyline but were not fully recognized as such at the time. It was Bem, perhaps, who kicked us all into the realization that bad work could be the rule, not the exception.

So, as of early 2011, there’s a sense that something’s wrong, but it’s not so clear to people how wrong things are, and observers (myself included) remain unaware of the ubiquity, indeed the obviousness, of fatal multiple comparisons problems in so much published research. Or, I should say, the deadly combination of weak theory being supported almost entirely by statistically significant results which themselves are the product of uncontrolled researcher degrees of freedom.

2011: Various episodes of scientific misconduct hit the news. Diederik Stapel is kicked out of the pscyhology department at Tilburg University and Marc Hauser leaves the psychology department at Harvard. These and other episodes bring attention to the Retraction Watch blog. I see a connection between scientific fraud, sloppiness, and plain old incompetence: in all cases I see researchers who are true believers in their hypotheses, which in turn are vague enough to support any evidence thrown at them. Recall Clarke’s Law.

2012: Gregory Francis publishes “Too good to be true,” leading off a series of papers arguing that repeated statistically significant results (that is, standard practice in published psychology papers) can be a sign of selection bias. PubPeer starts up.

2013: Katherine Button, John Ioannidis, Claire Mokrysz, Brian Nosek, Jonathan Flint, Emma Robinson, and Marcus Munafo publish the article, “Power failure: Why small sample size undermines the reliability of neuroscience,” which closes the loop from Cohen’s power analysis to Meehl’s more general despair, with the connection being selection and overestimates of effect sizes.

Around this time, people start sending me bad papers that make extreme claims based on weak data. The first might have been the one on ovulation and voting, but then we get ovulation and clothing, fat arms and political attitudes, and all the rest. The term “Psychological-Science-style research” enters the lexicon.

Also, the replication movement gains steam and a series of high-profile failed replications come out. First there’s the entirely unsurprising lack of replication of Bem’s ESP work—Bem himself wrote a paper claiming successful replication, but his meta-analysis included various studies that were not replications at all—and then came the unsuccessful replications of embodied cognition, ego depletion, and various other respected findings from social pscyhology.

2015: Many different concerns with research quality and the scientific publication process converge in the “power pose” research of Dana Carney, Amy Cuddy, and Andy Yap, which received adoring media coverage but which suffered from the now-familiar problems of massive uncontrolled researcher degrees of freedom (see this discussion by Uri Simonsohn), and which failed to reappear in a replication attempt by Eva Ranehill, Anna Dreber, Magnus Johannesson, Susanne Leiberg, Sunhae Sul, and Roberto Weber.

Meanwhile, the prestigous Proceedings of the National Academy of Sciences (PPNAS) gets into the game, publishing really bad, fatally flawed papers on media-friendly topics such as himmicanes, air rage, and “People search for meaning when they approach a new decade in chronological age.” These particular articles were all edited by “Susan T. Fiske, Princeton University.” Just when the news was finally getting out about researcher degrees of freedom, statistical significance, and the perils of low-power studies, PPNAS jumps in. Talk about bad timing.

2016: Brian Nosek and others organize a large collaborative replication project. Lots of prominent studies don’t replicate. The replication project gets lots of attention among scientists and in the news, moving psychology, and maybe scientific research, down a notch when it comes to public trust. There are some rearguard attempts to pooh-pooh the failed replication but they are not convincing.

Late 2016: We have now reached the “emperor has no clothes” phase. When seemingly solid findings in social psychology turn out not to replicate, we’re no longer surprised.

Rained real hard and it rained for a real long time

OK, that was a pretty detailed timeline. But here’s the point. Almost nothing was happening for a long time, and even after the first revelations and theoretical articles you could still ignore the crisis if you were focused on your research and other responsibilities. Remember, as late as 2011, even Daniel Kahneman was saying of priming studies that “disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.”

Fiske’s own published work has some issues too. I make no statement about her research in general, as I haven’t read most of her papers. What I do know is what Nick Brown sent me:

For an assortment of reasons, I [Brown] found myself reading this article one day: This Old Stereotype: The Pervasiveness and Persistence of the Elderly Stereotype by Amy J. C. Cuddy, Michael I. Norton, and Susan T. Fiske (Journal of Social Issues, 2005). . . .

This paper was just riddled through with errors. First off, its main claims were supported by t statistics of 5.03 and 11.14 . . . ummmmm, upon recalculation the values were actually 1.8 and 3.3. So one of the claim wasn’t even “statistically significant” (thus, under the rules, was unpublishable).

But that wasn’t the worst of it. It turns out that some of the numbers reported in that paper just couldn’t have been correct. It’s possible that the authors were doing some calculations wrong, for example by incorrectly rounding intermediate quantities. Rounding error doesn’t sound like such a big deal, but it can supply a useful set of “degrees of freedom” to allow researchers to get the results they want, out of data that aren’t readily cooperating.

There’s more at the link. The short story is that Cuddy, Norton, and Fiske made a bunch of data errors—which is too bad, but such things happen—and then when the errors were pointed out to them, they refused to reconsider anything. Their substantive theory is so open-ended that it can explain just about any result, any interaction in any direction.

And that’s why the authors’ claim that fixing the errors “does not change the conclusion of the paper” is both ridiculous and all too true. It’s ridiculous because one of the key claims is entirely based on a statistically significant p-value that is no longer there. But the claim is true because the real “conclusion of the paper” doesn’t depend on any of its details—all that matters is that there’s something, somewhere, that has p less than .05, because that’s enough to make publishable, promotable claims about “the pervasiveness and persistence of the elderly stereotype” or whatever else they want to publish that day.

When the authors protest that none of the errors really matter, it makes you realize that, in these projects, the data hardly matter at all.

Vir: Andrew Gelman

%d bloggers like this: