November 16, 2010

The Neutral Model of Inquiry (or, What Is the Scientific Literature, Chopped Liver?)

Attention conservation notice: 900 words of wondering what the scientific literature would look like if it were entirely a product of publication bias. Veils the hard-won discoveries of actual empirical scientists in vague, abstract, hyper-theoretical doubts, without alleging any concrete errors. A pile of skeptical nihilism, best refuted by going back to the lab.

I have been musing about the following scenario for several years now, without ever getting around to doing anything with it. Since it came up in conversation last month between talks in New York, now seems like as good a time as any to get it out of my system.

Imagine an epistemic community that seeks to discover which of a large set of postulated phenomena actually happen. (The example I originally had in mind was specific foods causing or preventing specific diseases, but it really has nothing to do with causality, or observational versus experimental studies.) Let's build a stochastic model of this. At each time step, an investigator will draw a random candidate phenomenon from the pool, and conduct an appropriately-designed study. The investigator will test the hypothesis that the phenomenon exists, and calculate a p-value. Let's suppose that this is all done properly (no dead fish here), so that the p-value is uniformly distributed between 0 and 1 when the hypothesis is false and the phenomenon does not exist. The investigator writes up the report and submits it for publication.

What happens next depends on whether the phenomenon has entered the published literature already or not. If it has, the new p-value is allowed to be published. If it has not, the report is published if, and only if, the p-value is < 0.05. This is the "file-drawer problem": finding a lack of evidence for a phenomenon is publication-worthy only if people thought it existed.

The community combines the published p-values in some fashion — reasonably exact solutions to this problem were devised by R. A. Fisher and Karl Pearson in the 1930s, leading to Neyman's smooth test of goodness of fit, but I have been told by a psychologist that "of course" one should just use the median of the published p-values. Different rules of combination will lead to slightly different forms of this model.

The last assumption of the model is that, sadly, none of the phenomena the community is interested in exist. All of their null hypotheses are, strictly speaking, true. Just as neutral models of evolution are ones which have all sorts of evolutionary mechanisms except selection, this is a model of the scientific process without discovery. Since, by assumption, everyone does their calculations correctly and honestly, if we could look at all the published and unpublished p-values they'd be uniformly distributed between 0 and 1. But the first published p-value for any phenomenon is uniformly distributed between 0 and 0.05. A full 2% of initial announcements will have an impressive-seeming (nominal) significance level of 10-3.

Of course, when people try to replicate those initial findings, their p-values will be distributed between 0 and 1. The joint distribution of p-values from the initial study and m attempts at replication will be a product of independent uniforms, one on [0, 0.05] and m of them on [0,1]. What follows from this will depend on the exact rule used to aggregate individual studies, and on doing some calculations I have never pushed through, so I will structure it as a series of "exercises for the reader".

  1. Pick your favorite meta-analytic rule for aggregating p-values. (If you do not have a favorite rule, one will be issued to you.) What is the distribution of the aggregate p-value after m replications?
  2. Say that a phenomenon is dropped from the literature when its aggregate p-value climbs above 0.05. Find the probability of being dropped as a function of m.
  3. Say that the lifespan of a phenomenon is the number of replications it receives before being dropped from the literature. (Under any sensible aggregation rule, the probability of being dropped will tend towards 1 as m increases, so lifespans will be finite.) Find the distribution of lifespans.
  4. Let us take any field of inquiry; say, to be diplomatic, haruspicy. Surveying all the published claims of phenomena in its literature, how many replications have they survived? Does this look at all different from the distribution of lifespans under the neutral model? How much nudging of marginal results below the 5% threshold would be needed to account for the discrepancy? (After all, "the difference between 'significant' and 'not significant' is not itself statistically significant".) Does the literature, in other words, provide any evidence that the discipline knows anything at all?
p < 1e-3 (two-lobed test)

Let me draw the moral. Even if the community of inquiry is both too clueless to make any contact with reality and too honest to nudge borderline findings into significance, so long as they can keep coming up with new phenomena to look for, the mechanism of the file-drawer problem alone will guarantee a steady stream of new results. There is, so far as I know, no Journal of Evidence-Based Haruspicy filled, issue after issue, with methodologically-faultless papers reporting the ability of sheeps' livers to predict the winners of sumo championships, the outcome of speed dates, or real estate trends in selected suburbs of Chicago. But the difficulty can only be that the evidence-based haruspices aren't trying hard enough, and some friendly rivalry with the plastromancers is called for. It's true that none of these findings will last forever, but this constant overturning of old ideas by new discoveries is just part of what makes this such a dynamic time in the field of haruspicy. Many scholars will even tell you that their favorite part of being a haruspex is the frequency with which a new sacrifice over-turns everything they thought they knew about reading the future from a sheep's liver! We are very excited about the renewed interest on the part of policy-makers in the recommendations of the mantic arts...

Update, later that same day: I meant to mention this classic paper on the file-drawer problem, but forgot because I was writing at one in the morning.

Update, yet later: sense-negating typo fixed, thanks to Gustavo Lacerda.

Manual trackback: Wolfgang Beirl; Matt McIrvin's Steam-Operated World of Yesteryear; Idiolect; Cognition and Culture; Brad DeLong; Dynamic Ecology

Modest Proposals; Learned Folly; The Collective Use and Evolution of Concepts; Enigmas of Chance

Posted at November 16, 2010 01:30 | permanent link

Three-Toed Sloth