2013 discussion of how systemic biases in science, particularly medicine and psychology, have resulted in a research literature filled with false positives and exaggerated effects, called 'the Replication Crisis'.
source; created: 27 Oct 2010; modified: 9 Dec 2019; status: finished; confidence: highly likely; importance: 8
Long-standing problems in standard scientific methodology have exploded as the compilation of some additional links are provided for post-2013 developments.)
The crisis is caused by methods & publishing procedures which interpret random noise as important results, far too small datasets, selective analysis by an analyst trying to reach expected/desired results, publication bias, poor implementation of existing best-practices, nontrivial levels of research fraud, software errors, philosophical beliefs among researchers that false positives are acceptable, neglect of known confounding like genetics, and skewed incentives (financial & professional) to publish ‘hot’ results.
Thus, any individual piece of research typically establishes little. Scientific validation comes not from small p-values, but from discovering a regular feature of the world which disinterested third parties can discover with straightforward research done independently on new data with new procedures—replication.
Mainstream science is flawed: seriously mistaken statistics combined with poor incentives has led to masses of misleading research. Not that this problem is exclusive to psychology—economics, certain genetics subfields (principally candidate-gene research), biomedical science, and biology in general are often on shaky ground.
Statistical background on p-value problems: Against null-hypothesis significance testing
The basic nature of ‘significance’ being usually defined as p<0.05 means we should expect something like >5% of studies or experiments to be bogus (optimistically), but that only considers “false positives”; reducing “false negatives” requires statistical power (weakened by small samples), and the two combine with the base rate of true underlying effects into a total error rate. Ioannidis 2005 points out that considering the usual p values, the underpowered nature of many studies, the rarity of underlying effects, and a little bias, even large randomized trials may wind up with only an 85% chance of having yielded the truth. One survey of reported p-values in medicine yielding a lower bound of false positives of 17%.
Yet, there are too many positive results1 (psychiatry, neurobiology biomedicine, biology, ecology & evolution, psychology 12 3 4 5, economics’ top journals, sociology, gene-disease correlations) given effect sizes (and positive results correlate with per capita publishing rates in US states & vary by period & country—apparently chance is kind to scientists who must publish a lot and recently!); then there come the inadvertent errors which might cause retraction, which is rare, but the true retraction rate may be 0.1–1% (
“How many scientific papers should be retracted?”), is increasing & seems to positively correlate with journal prestige metrics (modulo the confounding factor that famous papers/journals get more scrutiny), not that anyone pays any attention to such things; then there are basic statistical errors in >11% of papers (based on the high-quality papers in Nature and the British Medical Journal;
“Incongruence between test statistics and P values in medical papers”, García-Berthou 2004) or 50% in neuroscience.
And only then can we get into replicating at all. See for example The Atlantic article
“Lies, Damned Lies, and Medical Science” on John P. A. Ioannidis‘s research showing 41% of the most cited medical research failed to be replicated—were wrong. For details, you can see Ioannidis’s
“Why Most Published Research Findings Are False”2, or Begley’s failed attempts to replicate 47 of 53 articles on top cancer journals (leading to Booth’s
“Begley’s Six Rules”; see also the Nature Biotechnology editorial & note that full details have not been published because the researchers of the original studies demanded secrecy from Begley’s team), or Kumar & Nash 2011’s
“Health Care Myth Busters: Is There a High Degree of Scientific Certainty in Modern Medicine?” who write
’We could accurately say, “Half of what physicians do is wrong,” or “Less than 20% of what physicians do has solid research to support it.”’ Nutritional epidemiology is something of a fish in a barrel; after Ioannidis, is anyone surprised that when Young & Karr 2011 followed up on 52 correlations tested in 12 RCTs, 0/52 replicated and the RCTs found the opposite of 5?
Attempts to use animal models to infer anything about humans suffer from all the methodological problems previously mentioned, and add in interesting new forms of error such as mice simply being irrelevant to humans, leading to cases like <150 sepsis clinical trials all failing—because the drugs worked in mice but humans have a completely different set of genetic reactions to inflammation.
‘Hot’ fields tend to be new fields, which brings problems of its own, see
“Large-Scale Assessment of the Effect of Popularity on the Reliability of Research” & discussion. (Failure to replicate in larger studies seems to be a hallmark of biological/medical research. Ioannidis performs the same trick with biomarkers, finding less than half of the most-cited biomarkers were even statistically-significant in the larger studies. 12 of the more prominent SNP-IQ correlations failed to replicate on a larger data.) As we know now, almost the entire candidate-gene literature, most things reported from 2000–2010 before large-scale GWASes started to be done (and completely failing to find the candidate-genes), is nothing but false positives! The replication rates of candidate-genes for things like intelligence, personality, gene-environment interactions, psychiatric disorders–the whole schmeer—are literally ~0%. On the plus side, the parlous state of affairs means that there are some cheap heuristics for detecting unreliable papers—simply asking for data & being refused/ignored correlates strongly with the original paper having errors in its statistics.
This epidemic of false positives is apparently deliberately and knowing accepted by epidemiology; Young’s 2008
“Everything is Dangerous” remarks that 80–90% of epidemiology’s claims do not replicate (eg. the NIH ran 20 randomized-controlled-trials of claims, and only 1 replicated) and that lack of ‘multiple comparisons’ (either Bonferroni or Benjamin-Hochberg) is taught:
“Rothman (1990) says no correction for multiple testing is necessary and Vandenbroucke, PLoS Med (2008) agrees” (see also Perneger 1998 who also explicitly understands that no correction increases type 2 errors and reduces type 1 errors). Multiple correction is necessary because its absence does, in fact, result in the overstatement of medical benefit (Godfrey 1985, Pocock et al 1987, Smith 1987). The average effect size for findings confirmed meta-analytically in psychology/education is d=0.53 (well below several effect sizes from n-back/IQ studies); when moving from laboratory to non-laboratory settings, meta-analyses replicate findings correlate ~0.7 of the time, but for social psychology the replication correlation falls to ~0.5 with >14% of findings actually turning out to be the opposite (see Anderson et al 1999 and Mitchell 2012; for exaggeration due to non-blinding or poor randomization, Wood et al 2008). (Meta-analyses also give us a starting point for understanding how unusual medium or large effects sizes are4.) Psychology does have many challenges, but practitioners also handicap themselves; an older overview is the entertaining
“What’s Wrong With Psychology, Anyway?”, which mentions the obvious point that statistics & experimental design are flexible enough to reach significance as desired. In an interesting example of how methodological reforms are no panacea in the presence of continued perverse incentives, an earlier methodological improvement in psychology (reporting multiple experiments in a single publication as a check against results not being generalizable) has merely demonstrated the widespread p-value hacking or manipulation or publication bias when one notes that given the low statistical power of each experiment, even if the underlying phenomena were real it would still be wildly improbable that all n experiments in a paper would turn up statistically-significant results, since power is usually extremely low in experiments (eg. in neuroscience,
“between 20–30%”). These problems are pervasive enough that I believe they entirely explain any
The failures to replicate “statistically significant” results has led one blogger to caustically remark (see also
“Parapsychology: the control group for science”,
“Using degrees of freedom to change the past for fun and profit”,
“The Control Group is Out Of Control”):
Parapsychology, the control group for science, would seem to be a thriving field with “statistically significant” results aplenty…Parapsychologists are constantly protesting that they are playing by all the standard scientific rules, and yet their results are being ignored—that they are unfairly being held to higher standards than everyone else. I’m willing to believe that. It just means that the standard statistical methods of science are so weak and flawed as to permit a field of study to sustain itself in the complete absence of any subject matter. With two-thirds of medical studies in prestigious journals failing to replicate, getting rid of the entire actual subject matter would shrink the field by only 33%.
…Let me draw the moral [about publication bias]. Even if the community of inquiry is both too clueless to make any contact with reality and too honest to nudge borderline findings into significance, so long as they can keep coming up with new phenomena to look for, the mechanism of the file-drawer problem alone will guarantee a steady stream of new results. There is, so far as I know, no Journal of Evidence-Based Haruspicy filled, issue after issue, with methodologically-faultless papers reporting the ability of sheep’s livers to predict the winners of sumo championships, the outcome of speed dates, or real estate trends in selected suburbs of Chicago. But the difficulty can only be that the evidence-based haruspices aren’t trying hard enough, and some friendly rivalry with the plastromancers is called for. It’s true that none of these findings will last forever, but this constant overturning of old ideas by new discoveries is just part of what makes this such a dynamic time in the field of haruspicy. Many scholars will even tell you that their favorite part of being a haruspex is the frequency with which a new sacrifice over-turns everything they thought they knew about reading the future from a sheep’s liver! We are very excited about the renewed interest on the part of policy-makers in the recommendations of the mantic arts…
And this is when there is enough information to replicate; open access to any data for a paper is rare (economics: <10%) the economics journal Journal of Money, Credit and Banking, which required researchers provide the data & software which could replicate their statistical analyses, discovered that <10% of the submitted materials were adequate for repeating the paper (see
“Lessons from the JMCB Archive”). In one cute economics example, replication failed because the dataset had been heavily edited to make participants look better (for more economics-specific critique, see Ioannidis & Doucouliagos 2013). Availability of data, often low, decreases with time, and many studies never get published regardless of whether publication is legally mandated.
Transcription errors in papers seem to be common (possibly due to constantly changing analyses & p-hacking?), and as software and large datasets becomes more inherent to research, the need and the problem of it being possible to replicate will get worse because even mature commercial software libraries can disagree majorly on their computed results to the same mathematical specification (see also Anda et al 2009). And spreadsheets are especially bad, with error rates in the 88% range (
“What we know about spreadsheet errors”, Panko 1998); spreadsheets are used in all areas of science, including biology and medicine (see
“Error! What biomedical computing can learn from its mistakes”; famous examples of coding errors include Donohue-Levitt & Reinhart-Rogoff), not to mention regular business (eg the London Whale).
Psychology is far from being perfect either; look at the examples in The New Yorker’s
“The Truth Wears Off” article (or look at some excerpts from that article). Computer scientist Peter Norvig has written a must-read essay on interpreting statistics,
“Warning Signs in Experimental Design and Interpretation”; a number of warning signs apply to many psychological studies. There may be incentive problems: a transplant researcher discovered the only way to publish in Nature his inability to replicate his earlier Nature paper was to officially retract it; another interesting example is when, after Daryl Bem got a paper published in the top journal JPSP demonstrating precognition, the journal refused to publish any replications (failed or successful) because…
“‘We don’t want to be the Journal of Bem Replication’, he says, pointing out that other high-profile journals have similar policies of publishing only the best original research.” (Quoted in New Scientist) One doesn’t need to be a genius to understand why psychologist Andrew D. Wilson might snarkily remark
“…think about the message JPSP is sending to authors. That message is ‘we will publish your crazy story if it’s new, but not your sensible story if it’s merely a replication’.” (You get what you pay for.) In one large test of the most famous psychology results, 10 of 13 (77%) replicated. The replication rate is under 1/3 in one area of psychology touching on genetics. This despite the simple point that replications reduce the risk of publication bias, and increase statistical power, so that a replicated result is more likely to be true. And the small samples of n-back studies and nootropic chemicals are especially problematic. Quoting from Nick Bostrom & Anders Sandberg’s 2006
“Converging Cognitive Enhancements”:
The reliability of research is also an issue. Many of the cognition-enhancing interventions show small effect sizes, which may necessitate very large epidemiological studies possibly exposing large groups to unforeseen risks.
Particularly troubling is the slowdown in drug discovery & medical technology during the 2000s, even as genetics in particular was expected to produce earth-shaking new treatments. One biotech venture capitalist writes:
The company spent $5M or so trying to validate a platform that didn’t exist. When they tried to directly repeat the academic founder’s data, it never worked. Upon re-examination of the lab notebooks, it was clear the founder’s lab had at the very least massaged the data and shaped it to fit their hypothesis. Essentially, they systematically ignored every piece of negative data. Sadly this “failure to repeat” happens more often than we’d like to believe. It has happened to us at Atlas [Venture] several times in the past decade…The unspoken rule is that at least 50% of the studies published even in top tier academic journals—Science, Nature, Cell, PNAS, etc…—can’t be repeated with the same conclusions by an industrial lab. In particular, key animal models often don’t reproduce. This 50% failure rate isn’t a data free assertion: it’s backed up by dozens of experienced R&D professionals who’ve participated in the (re)testing of academic findings. This is a huge problem for translational research and one that won’t go away until we address it head on.
Half the respondents to a 2012 survey at one cancer research center reported 1 or more incidents where they could not reproduce published research; two-thirds of those were unable to
“ever able to explain or resolve their discrepant findings”, half had trouble publishing results contradicting previous publications, and two-thirds failed to publish contradictory results. An internal Bayer survey of 67 projects (commentary) found that
“only in ~20–25% of the projects were the relevant published data completely in line with our in-house findings”, and as far as assessing the projects went:
…despite the low numbers, there was no apparent difference between the different research fields. Surprisingly, even publications in prestigious journals or from several independent groups did not ensure reproducibility. Indeed, our analysis revealed that the reproducibility of published data did not significantly correlate with journal impact factors, the number of publications on the respective target or the number of independent groups that authored the publications. Our findings are mirrored by ‘gut feelings’ expressed in personal communications with scientists from academia or other companies, as well as published observations. [apropos of above] An unspoken rule among early-stage venture capital firms that“at least 50% of published studies, even those in top-tier academic journals, can’t be repeated with the same conclusions by an industrial lab”has been recently reported (see Further information) and discussed 4.
Looking at 306 estimates for particle properties, 7% were outside of a 98% confidence interval (where only 2% should be). In seven other cases, each with 14 to 40 estimates, the fraction outside the 98% confidence interval ranged from 7% to 57%, with a median of 14%.
Nor is peer review reliable or robust against even low levels of collusion. Scientists who win the Nobel Prize find their other work suddenly being heavily cited, suggesting either that the community either badly failed in recognizing the work’s true value or that they are now sucking up & attempting to look better by association. (A mathematician once told me that often, to boost a paper’s acceptance chance, they would add citations to papers by the journal’s editors—a practice that will surprise none familiar with Goodhart’s law and the use of citations in tenure & grants.)
The former editor Richard Smith amusingly recounts his doubts about the merits of peer review as practiced, and physicist Michael Nielsen points out that peer review is historically rare (just one of Einstein’s 300 papers was peer reviewed; the famous Nature did not institute peer review until 1967), has been poorly studied & not shown to be effective, is nationally biased, erroneously rejects many historic discoveries (one study lists
“34 Nobel Laureates whose awarded work was rejected by peer review”; Horrobin 1990 lists other), and catches only a small fraction of errors. And questionable choices or fraud? Forget about it:
A pooled weighted average of 1.97% (N = 7, 95%CI: 0.86–4.45) of scientists admitted to have fabricated, falsified or modified data or results at least once—a serious form of misconduct by any standard—and up to 33.7% admitted other questionable research practices. In surveys asking about the behaviour of colleagues, admission rates were 14.12% (N = 12, 95% CI: 9.91–19.72) for falsification, and up to 72% for other questionable research practices…When these factors were controlled for, misconduct was reported more frequently by medical/pharmacological researchers than others.
We surveyed over 2,000 psychologists about their involvement in questionable research practices, using an anonymous elicitation format supplemented by incentives for honest reporting. The impact of incentives on admission rates was positive, and greater for practices that respondents judge to be less defensible. Using three different estimation methods, we find that the proportion of respondents that have engaged in these practices is surprisingly high relative to respondents’ own estimates of these proportions. Some questionable practices may constitute the prevailing research norm.
In short, the secret sauce of science is not ‘peer review’. It is replication!
“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.”
Why isn’t the solution as simple as eliminating datamining by methods like larger n or pre-registered analyses? Because once we have eliminated the random error in our analysis, we are still left with a (potentially arbitrarily large) systematic error, leaving us with a large total error.
None of these systematic problems should be considered minor or methodological quibbling or foolish idealism: they are systematic biases and as such, they force an upper bound on how accurate a corpus of studies can be even if there were thousands upon thousands of studies, because the total error in the results is made up of random error and systematic error, but while random error shrinks as more studies are done, systematic error remains the same.
A thousand biased studies merely result in an extremely precise estimate of the wrong number.
This is a point appreciated by statisticians and experimental physicists, but it doesn’t seem to be frequently discussed. Andrew Gelman has a fun demonstration of selection bias involving candy, or from pg812–1020 of Chapter 8
“Sufficiency, Ancillarity, And All That” of Probability Theory: The Logic of Science by E.T. Jaynes:
The classical example showing the error of this kind of reasoning is the fable about the height of the Emperor of China. Supposing that each person in China surely knows the height of the Emperor to an accuracy of at least ±1 meter, if there are N=1,000,000,000 inhabitants, then it seems that we could determine his height to an accuracy at least as good as
merely by asking each person’s opinion and averaging the results.
The absurdity of the conclusion tells us rather forcefully that the rule is not always valid, even when the separate data values are causally independent; it requires them to be logically independent. In this case, we know that the vast majority of the inhabitants of China have never seen the Emperor; yet they have been discussing the Emperor among themselves and some kind of mental image of him has evolved as folklore. Then knowledge of the answer given by one does tell us something about the answer likely to be given by another, so they are not logically independent. Indeed, folklore has almost surely generated a systematic error, which survives the averaging; thus the above estimate would tell us something about the folklore, but almost nothing about the Emperor.
We could put it roughly as follows:
error in estimate = (8-50)
where S is the common systematic error in each datum, R is the RMS ‘random’ error in the individual data values. Uninformed opinions, even though they may agree well among themselves, are nearly worthless as evidence. Therefore sound scientific inference demands that, when this is a possibility, we use a form of probability theory (i.e. a probabilistic model) which is sophisticated enough to detect this situation and make allowances for it.
As a start on this, equation (8-50) gives us a crude but useful rule of thumb; it shows that, unless we know that the systematic error is less than about of the random error, we cannot be sure that the average of a million data values is any more accurate or reliable than the average of ten6. As Henri Poincare put it:“The physicist is persuaded that one good measurement is worth many bad ones.”This has been well recognized by experimental physicists for generations; but warnings about it are conspicuously missing in the “soft” sciences whose practitioners are educated from those textbooks.
Or pg1019–1020 Chapter 10
“Physics of ‘Random Experiments’”:
…Nevertheless, the existence of such a strong connection is clearly only an ideal limiting case unlikely to be realized in any real application. For this reason, the law of large numbers and limit theorems of probability theory can be grossly misleading to a scientist or engineer who naively supposes them to be experimental facts, and tries to interpret them literally in his problems. Here are two simple examples:
- Suppose there is some random experiment in which you assign a probability p for some particular outcome A. It is important to estimate accurately the fraction f of times A will be true in the next million trials. If you try to use the laws of large numbers, it will tell you various things about f; for example, that it is quite likely to differ from p by less than a tenth of one percent, and enormously unlikely to differ from p by more than one percent. But now, imagine that in the first hundred trials, the observed frequency of A turned out to be entirely different from p. Would this lead you to suspect that something was wrong, and revise your probability assignment for the 101’st trial? If it would, then your state of knowledge is different from that required for the validity of the law of large numbers. You are not sure of the independence of different trials, and/or you are not sure of the correctness of the numerical value of p. Your prediction of f for a million trials is probably no more reliable than for a hundred.
- The common sense of a good experimental scientist tells him the same thing without any probability theory. Suppose someone is measuring the velocity of light. After making allowances for the known systematic errors, he could calculate a probability distribution for the various other errors, based on the noise level in his electronics, vibration amplitudes, etc. At this point, a naive application of the law of large numbers might lead him to think that he can add three significant figures to his measurement merely by repeating it a million times and averaging the results. But, of course, what he would actually do is to repeat some unknown systematic error a million times. It is idle to repeat a physical measurement an enormous number of times in the hope that “good statistics” will average out your errors, because we cannot know the full systematic error. This is the old “Emperor of China” fallacy…
Indeed, unless we know that all sources of systematic error—recognized or unrecognized—contribute less than about one-third the total error, we cannot be sure that the average of a million measurements is any more reliable than the average of ten. Our time is much better spent in designing a new experiment which will give a lower probable error per trial. As Poincare put it,“The physicist is persuaded that one good measurement is worth many bad ones.”7 In other words, the common sense of a scientist tells him that the probabilities he assigns to various errors do not have a strong connection with frequencies, and that methods of inference which presuppose such a connection could be disastrously misleading in his problems.
Schlaifer much earlier made the same point in Probability and Statistics for Business Decisions: an Introduction to Managerial Economics Under Uncertainty, Schlaifer 1959, pg488–489 (see also Meng 2018/Shirani-Mehr et al 2018):
31.4.3 Bias and Sample Size
In Section 31.2.6 we used a hypothetical example to illustrate the implications of the fact that the variance of the mean of a sample in which bias is suspected is
so that only the second term decreases as the sample size increases and the total can never be less than the fixed value of the first term. To emphasize the importance of this point by a real example we recall the most famous sampling fiasco in history, the presidential poll conducted by the Literary Digest in 1936. Over 2 million registered voters filled in and returned the straw ballots sent out by the Digest, so that there was less than one chance in 1 billion of a sampling error as large as of one percentage point8, and yet the poll was actually off by nearly 18 percentage points: it predicted that 54.5 per cent of the popular vote would go to Landon, who in fact received only 36.7 per cent.9 10
Since sampling error cannot account for any appreciable part of the 18-point discrepancy, it is virtually all actual bias. A part of this total bias may be measurement bias due to the fact that not all people voted as they said they would vote; the implications of this possibility were discussed in Section 31.3. The larger part of the total bias, however, was almost certainly selection bias. The straw ballots were mailed to people whose names were selected from lists of owners of telephones and automobiles and the subpopulation which was effectively sampled was even more restricted than this: it consisted only of those owners of telephones and automobiles who were willing to fill out and return a straw ballot. The true mean of this subpopulation proved to be entirely different from the true mean of the population of all United States citizens who voted in 1936.
It is true that there was no evidence at the time this poll was planned which would have suggested that the bias would be as great as the 18 percentage points actually realized, but experience with previous polls had shown biases which would have led any sensible person to assign to a distribution with equal to at least 1 percentage point. A sample of only 23,760 returned ballots, one th the size actually used, would have given a value of only percentage point, so that the standard deviation of x would have been
percentage points. Using a sample 100 times this large reduced from point to virtually zero, but it could not affect and thus on the most favorable assumption could reduce only from 1.05 points to 1 point. To collect and tabulate over 2 million additional ballots when this was the greatest gain that could be hoped for was obviously ridiculous before the fact and not just in the light of hindsight.
What’s particularly sad is when people read something like this and decide to rely on anecdotes, personal experiments, and alternative medicine where there are even more systematic errors and no way of reducing random error at all! Science may be the lens that sees its own flaws, but if other epistemologies do not boast such long detailed self-critiques, it’s not because they are flawless… It’s like that old Jamie Zawinski quote: Some people, when faced with the problem of mainstream medicine & epidemiology having serious methodological weaknesses, say “I know, I’ll turn to non-mainstream medicine & epidemiology. After all, if only some medicine is based on real scientific method and outperforms placebos, why bother?” (Now they have two problems.) Or perhaps Isaac Asimov:
“John, when people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together.”
Questionable research findings:
- Leprechaun hunting
- “Hydrocephalus and Intelligence: The Hollow Men”
- On the ‘Mouse Utopia’ experiment
- Dual N-Back Meta-Analysis
- Lunar circadian rhythms
- On the Jeanne Calment longevity anomaly
- Munksgaard et al 2016,
“A replication and methodological critique of the study ‘Evaluating drug trafficking on the Tor Network’”
Additional links, largely curated from my newsletter:
“An opportunity cost model of subjective effort and task performance”, Kurzban et al 2013;
“A Meta-Analysis of Blood Glucose Effects on Human Decision Making”, Orquin & Kurzban 2016;
“Is Ego-Depletion a Replicable Effect? A Forensic Meta-Analysis of 165 Ego Depletion Articles”;
“Eyes wide shut or eyes wide open?”, Roberts commentary;
“Ego depletion may disappear by 2020”, Vadillo 2019
“Randomized Controlled Trials Commissioned by the Institute of Education Sciences Since 2002: How Many Found Positive Versus Weak or No Effects?”, Coalition for Evidence-Based Policy 2013
“Life Extension Supplements: A Reality Check; In a paper published late last year, a cautious and expert biochemist reports that none of the most popular ‘life extension supplement’ mixes actually extend life span in mice”
“If correlation doesn’t imply causation, then what does?”, Michael Nielsen
“Deep impact: unintended consequences of journal rank”, Brembs et al 2013 (meta-science)
“Publication bias in the social sciences: Unlocking the file drawer”, Franco et al 2014 (Nature; Andrew Gelman)
“Academic urban legends”, Rekdal 2014
“School Desegregation and Black Achievement: an integrative review”, Wortman & Bryant 1985
“An Examination of Stereotype Threat Effects on Girls’ Mathematics Performance”, Ganley et al 2013;
“The influence of gender stereotype threat on mathematics test scores of Dutch high school students: a registered report”, Flore et al 2019;
“Stereotype Threat Effects in Settings With Features Likely Versus Unlikely in Operational Test Settings: A Meta-Analysis”, Shewach et al 2019
“How to Make More Published Research True”, Ioannidis 2014
“Deliberate practice: Is that all it takes to become an expert?”, Hambrick et al 2013
“Bayesian data analysis”, Kruschke 2010
“The harm done by tests of significance”, Hauer 2004 (how p-values increase traffic fatalities)
“Suppressing Intelligence Research: Hurting Those We Intend to Help”, Gottfredson 2005
“Political Diversity Will Improve Social Psychological Science”, Duarte et al 2015 (liberal ideological homogeneity in social sciences causing problems)
“What Went Wrong? Reflections on Science by Observation and The Bell Curve”, Glymour 1998 (correlation & causation)
“Compliance with Results Reporting at ClinicalTrials.gov”, Anderson et al 2015 (incentives matter)
“p Values are not Error Probabilities”, Hubbard & Bayarri 2003
“Crowdsourcing data analysis: Do soccer referees give more red cards to dark skin toned players?”(61 analysts examine the same dataset for the same research question to see how much variation in approach determines results)
“Predictive modeling, data leakage, model evaluation”(a researcher is just a very big neural net model)
“The Nine Circles of Scientific Hell”, Neuroskeptic 2012
“The Economics of Reproducibility in Preclinical Research”, Freedman et al 2015
“Effect of monthly vitamin D3 supplementation in healthy adults on adverse effects of earthquakes: a randomised controlled trial”, Slow et al 2014 (if only more people could be this imaginative in thinking about how to run RCTs, and less defeatist)
“Reanalyses of Randomized Clinical Trial Data”, Ebrahim et al 2014
“The Bayesian Reproducibility Project”(interpreting the psychology reproducibility projects’ results not by uninterpretable p-values but by a direct Bayesian examination of whether the replications support or contradict the originals)
“The Most Dangerous Equation”(sample size and standard error; examples: coinage, disease rates, small schools, male test scores.)
“When Quality Beats Quantity: Decision Theory, Drug Discovery, and the Reproducibility Crisis”, Scannell & Bosley 2016
“Underreporting in Psychology Experiments: Evidence From a Study Registry”, Franco et al 2015 (previously: Franco et al 2014)
On the ‘bilingual advantage’:
“Is Bilingualism Really an Advantage?”;
“The Bitter Fight Over the Benefits of Bilingualism”;
“The Bilingual Advantage: Three Years Later—Where do things stand in this area of research?”
“The Relationship of Bilingualism Compared to Monolingualism to the Risk of Cognitive Decline or Dementia: A Systematic Review and Meta-Analysis”, Mukadam et al 2017
“Is bilingualism associated with enhanced executive functioning in adults? A meta-analytic review”, Lehtonen et al 2018
“No evidence for a bilingual executive function advantage in the ABCD study”, Dick et al 2018
“The Advantages of Bilingualism Debate”, Antoniou 2019
“Seven Pervasive Statistical Flaws in Cognitive Training Interventions”, Moreau et al 2016
“Could a neuroscientist understand a microprocessor?”, Jonas & Kording 2016 (amusing followup to
“Can a biologist fix a radio?”, Lazebnik 2002)
“Statistically Controlling for Confounding Constructs Is Harder than You Think”, Westfall & Yarkoni 2016 (See also Stouffer 1936/Thorndike 1942/Kahneman 1965.)
“Preventing future offending of delinquents and offenders: what have we learned from experiments and meta-analyses?”, Mackenzie & Farrington 2015
“Science Is Not Always ‘Self-Correcting’: Fact-Value Conflation and the Study of Intelligence”, Cofnas 2015 (scientific endorsement of the Noble Lie)
“How Multiple Imputation Makes a Difference”, Lall 2016 (many political science results biased & driven by treatment of missing data)
“Predicting Experimental Results: Who Knows What?”, DellaVigna & Pope 2016
“We Gave Four Good Pollsters the Same Raw Data. They Had Four Different Results”(random error vs systematic error)
“Close but no Nobel: the scientists who never won; Archives reveal the most-nominated researchers who missed out on a Nobel Prize”(Nobels have considerable measurement error)
“Stereotype (In)Accuracy in Perceptions of Groups and Individuals”, Jussim et al 2015
“The Statistical Crisis in Science: The Cult Of P”, Debrouwere 2016 (a readable overview of the problems with p-values/NHST)
“Meta-assessment of bias in science”, Fanelli et al 2017
“How readers understand causal and correlational expressions used in news headlines”, Adams et al 2017 (people do not understand the difference between correlation and causation)
“Daryl Bem Proved ESP Is Real—which means science is broken”(Apparently Bem was serious. But one man’s modus ponens is another man’s modus tollens.)
“Does High Self-esteem Cause Better Performance, Interpersonal Success, Happiness, or Healthier Lifestyles?”, Baumeister et al 2003;
“The Man Who Destroyed America’s Ego: How a rebel psychologist challenged one of the 20th century’s biggest—and most dangerous—ideas”;
“‘It was quasi-religious’: the great self-esteem con”
“Computational Analysis of Lifespan Experiment Reproducibility”, Petrascheck & Miller 2017 (Power analysis of Gompertz curve mortality data: as expected, most animal studies of possible life-extension drugs are severely underpowered, which has the usual implications for replication, exaggerated effect sizes, & publication bias.)
“The prior can generally only be understood in the context of the likelihood”, Gelman et al 2017 (
“They strain at the gnat of the prior who swallow the camel of the likelihood.”)
“The High Cost of Not Doing Experiments”, Nisbett 2015
“On the Reproducibility of Psychological Science”, Johnson et al 2016
“Disappointing findings on Conditional Cash Transfers as a tool to break the poverty cycle in the United States”(results inflated by self-reported data; as the US is not Scandinavia, many studies depend on self-report.)
“No Beneficial Effects of Resveratrol on the Metabolic Syndrome: A Randomized Placebo-Controlled Clinical Trial”, Kjær et al 2017 (A big resveratrol null: zero effect, even evidence of harm.)
“The Elusive Backfire Effect: Mass Attitudes’ Steadfast Factual Adherence”, Wood & Porter 2017 (Big failed replication: n=10100/k=52.)
“Randomizing Religion: The Impact of Protestant Evangelism on Economic Outcomes”, Bryan et al 2018 (How beneficial is religion to individuals?)
“Does teaching children how to play cognitively demanding games improve their educational attainment? Evidence from a Randomised Controlled Trial of chess instruction in England”, Jerrim et al 2017 (“no”)
“Evaluation of Evidence of Statistical Support and Corroboration of Subgroup Claims in Randomized Clinical Trials”, Wallach et al 2017 (Lies, damn lies, and subgroups.)
“Inventing the randomized double-blind trial: the Nuremberg salt test of 1835”, Stolberg 2006 (The first double-blind randomized pre-registered clinical trial in history was a 1835 German test of homeopathic salt remedies.)
“Producing Wrong Data Without Doing Anything Obviously Wrong!”, Mytkowicz et al 2009
“Revisiting the Marshmallow Test: A Conceptual Replication Investigating Links Between Early Delay of Gratification and Later Outcomes”, Watts et al 2018 (effect size shrinks considerably in replication and the effect appears moderated largely through the expected individual traits like IQ)
“Split brain: divided perception but undivided consciousness”, Pinto et al 2017
“p-Hacking and False Discovery in A/B Testing”, Berman et al 2018
“22 Case Studies Where Phase 2 and Phase 3 Trials had Divergent Results”, FDA 2017 (Small study biases)
“The prehistory of biology preprints: A forgotten experiment from the 1960s”, Cobb 2017 (Who killed preprints the first time? The commercial academic publishing cartel did. Copyright is why we can’t have nice things.)
“Communicating uncertainty in official economic statistics”, Manski 2015 (error from systematic bias > error from random sampling error in major economic statistics)
“Why do humans reason? Arguments for an argumentative theory”, Mercier & Sperber 2011
“Many Labs 2: Investigating Variation in Replicability Across Sample and Setting”, Klein et al 2018 (28 targets, k=60, n=7000 each, Registered Reports: 14/28 replication rate w/few moderators/country differences & remarkably low heterogeneity)
“A Star Surgeon Left a Trail of Dead Patients—and His Whistleblowers Were Punished”(organizational incentives & science)
“Five-Year Follow-up of Antibiotic Therapy for Uncomplicated Acute Appendicitis in the APPAC Randomized Clinical Trial”, Salminen et al 2018 (media; most appendicitis surgeries unnecessary—the question is never whether running an RCT is ethical, the question is how can not running it possibly be ethical?)
“What’s Wrong with Psychology Anyway?”, Lykken 1991
“Notes on a New Philosophy of Empirical Science”, Burfoot 2011 (the ‘compression as inference’ paradigm of intelligence for application to science; stupid
gziptricks, Cilibrasi 2006; stupid
“The association between adolescent well-being and digital technology use”, Orben & Przybylski 2019 (SCA: brute-forcing the ‘garden of forking paths’ to examine robustness of results—the poor man’s Bayesian model comparison)
“Most Rigorous Large-Scale Educational RCTs are Uninformative: Should We Be Concerned?”, Lortie-Forgues & Inglis 2019
“How Replicable Are Links Between Personality Traits and Consequential Life Outcomes? The Life Outcomes of Personality Replication Project”, Soto 2019 (individual-differences research continues to replicate well)
“Behavior Genetic Research Methods: Testing Quasi-Causal Hypotheses Using Multivariate Twin Data”, Turkheimer & Harden 2014
“How genome-wide association studies (GWAS) made traditional candidate gene studies obsolete”, Duncan et al 2019 (apropos of SSC on Border et al 2019 revisiting the candidate-gene & gene-environment debacles)
“The Hype Cycle of Working Memory Training”, Redick 2019
“The Big Crunch”, David Goodstein 1994
“Super-centenarians and the oldest-old are concentrated into regions with no birth certificates and short lifespans”, Newman 2019 (
“a primary role of fraud and error in generating remarkable human age records”)
“Registered reports: an early example and analysis”, Wiseman et al 2019 (parapsychology was the first field to use Registered Reports in 1976, which cut statistical-significance rates by 2/3rds: Johnson 1975a, Johnson 1975b, Johnson 197611)
“Anthropology’s Science Wars: Insights from a New Survey”, Horowitz et al 2019 (visualizations)
“Effect of Lower Versus Higher Red Meat Intake on Cardiometabolic and Cancer Outcomes: A Systematic Review of Randomized Trials”, Zeraatkar et al 2019 (
“diets lower in red meat may have little or no effect on all-cause mortality (HR 0.99 [95% CI, 0.95 to 1.03])”)
“Reading Lies: Nonverbal Communication and Deception”, Vrij et al 2019
Did psychologist David Rosenhan fabricate his famous 1973
“Being Sane in Insane Places”mental hospital exposé? (Nature review; on the Rosenhan experiment; interesting to consider Spitzer 1975’s criticisms in light of this)
“Debunking the Stanford Prison Experiment”, Le Texier 2019
“A real-life Lord of the Flies: the troubling legacy of the Robbers Cave experiment; In the early 1950s, the psychologist Muzafer Sherif brought together a group of boys at a US summer camp—and tried to make them fight each other. Does his work teach us anything about our age of resurgent tribalism?”
“Crowdsourcing Hypothesis Tests”, Landy et al 2019
John D. Arnold:
“The Struggles of a $40 Million Nutrition Science Crusade: The Nutrition Science Initiative promised to study obesity and diabetes the right way. Now it’s nearly broke and all but gone.”;
“‘We are not recommending you give to Texas per se’: GiveDirectly’s bold disaster-relief experiment”;
“John Arnold Made a Fortune at Enron. Now He’s Declared War on Bad Science”;
“Cancer Research Is Broken: There’s a replication crisis in biomedicine—and no one even knows how deep it runs”;
“Reproducibility Initiative receives $1.3M grant to validate 50 landmark cancer studies”;
“Two Texas Billionaires Think They Can Fix Philanthropy: John and Laura Arnold are trying to give faster, bolder, and smarter”;
“The Four Most Dangerous Words? A New Study Shows”, Laura Arnold, 2017 TED talk
“A national experiment reveals where a growth mindset improves achievement”, Yeager et al 2019
“Reviews: Rosenthal, Robert, and Jacobson, Lenore; Pygmalion in the Classroom. 1968”, Thorndike 1968;
“But You Have To Know How To Tell Time”, Thorndike 1969
“Unfinished Pygmalion”, Snow 1969
- Pygmalion Reconsidered: A Case Study In Statistical Inference: Reconsideration Of The Rosenthal And Jacobson Data On Teacher Expectancy, Elashoff & Snow 1971
“Five Decades Of Public Controversy Over Mental Testing”, Cronbach 1975
“The Self-Fulfillment of the Self-Fulfilling Prophecy: A Critical Appraisal”, Wineburg 1987a;
“Does Research Count in the Lives of Behavioral Scientists?”, Wineburg 1987b
“Pygmalion and Intelligence?”, Snow 1995
“Beleaguered Pygmalion: A History of the Controversy Over Claims that Teacher Expectancy Raises Intelligence”, Spitz 1999
“Teacher Expectations and Self-Fulfilling Prophecies: Knowns and Unknowns, Resolved and Unresolved Controversies”, Jussim & Harber 2005
“We’ve Been Here Before: The Replication Crisis over the ‘Pygmalion Effect’”
Some examples of how ‘datamining’ or ‘data dredging’ can manufacture correlations on demand from large datasets by comparing enough variables:
Rates of autism diagnoses in children correlate with age—or should we blame organic food sales?; height & vocabulary or foot size & math skills may correlate strongly (in children); national chocolate consumption correlates with Nobel prizes12, as do borrowing from commercial banks & buying luxury cars & serial killers/mass-murderers/traffic fatalities13; moderate alcohol consumption predicts increased lifespan and earnings; the role of storks in delivering babies may have been underestimated; children and people with high self-esteem have higher grades & lower crime rates etc, so “we all know in our gut that it’s true” that raising people’s self-esteem
“empowers us to live responsibly and that inoculates us against the lures of crime, violence, substance abuse, teen pregnancy, child abuse, chronic welfare dependency and educational failure”—unless perhaps high self-esteem is caused by high grades & success, boosting self-esteem has no experimental benefits, and may backfire?
Tyler Vigen’s “spurious correlations” catalogues 35k+ correlations, many with r>0.9, based primarily on US Census & CDC data.
Google Correlate “finds Google search query patterns which correspond with real-world trends” based on geography or user-provided data, which offers endless fun (“Facebook”/“tapeworm in humans”, r=0.8721; “Superfreakonomic”/“Windows 7 advisor”, r=0.9751; Irish electricity prices/“Stanford webmail”, r=0.83; “heart attack”/“pink lace dress”, r=0.88; US states’ parasite loads/“booty models”, r=0.92; US states’ family ties/“how to swim”; metronidazole/“Is Lil’ Wayne gay?”, r=0.89; Clojure/“prnhub”, r=0.9784; “accident”/“itchy bumps”, r=0.87; “migraine headaches”/“sciences”, r=0.77; “Irritable Bowel Syndrome”/“font download”, r=0.94; interest-rate-index/“pill identification”, r=0.98; “advertising”/“medical research”, r=0.99; Barack Obama 2012 vote-share/“Top Chef”, r=0.88; “losing weight”/“houses for rent”, r=0.97; “Bieber”/tonsillitis, r=0.95; “paternity test”/“food for dogs”, r=0.83; “breast enlargement”/“reverse telephone search”, r=0.95; “theory of evolution” / “the Sumerians” or “Hector of Troy” or “Jim Crow laws”; “gwern”/“Danny Brown lyrics”, r=0.92; “weed”/“new Family Guy episodes”, r=0.8; a drawing of a bell curve matches “MySpace” while a penis matches “STD symptoms in men” r=0.95, not to mention Kurt Vonnegut stories).
Financial data-mining offers some fun examples; there’s the Super Bowl/stock-market one which worked well for several decades; and it’s not very elegant, but a 3-variable model (Bangladeshi butter, American cheese, joint sheep population) reaches R2=0.99 on 20 years of the S&P 500
On the general topic of animal model external validity & translation to humans, a number of op-eds, reviews, and meta-analyses have been done; reading through some of the literature up to March 2013, I would summarize them as indicating that the animal research literature in general is of considerably lower quality than human research, and that for those and intrinsic biological reasons, the probability of meaningful transfer from animal to human can be astoundingly low, far below 50% and in some categories of results, 0%.
The primary reasons identified for this poor performance are generally: small samples (much smaller than the already underpowered norms in human research), lack of blinding in taking measurements, pseudo-replication due to animals being correlated by genetic relatedness/living in same cage/same room/same lab, extensive non-normality in data14, large differences between labs due to local differences in reagents/procedures/personnel illustrating the importance of “tacit knowledge”, publication bias (small cheap samples + little perceived ethical need to publish + no preregistration norms), unnatural & unnaturally easy lab environments (more naturalistic environments both offer more realistic measurements & challenge animals), large genetic differences due to inbreeding/engineering/drift of lab strains mean the same treatment can produce dramatically different results in different strains (or sexes) of the same species, different species can have different responses, and none of them may be like humans in the relevant biological way in the first place.
So it is no wonder that “we can cure cancer in mice but not people” and almost all amazing breakthroughs in animals never make it to human practice; medicine & biology are difficult.
“The Importance of Being Cross-Bred”, Michie & McLaren 1955
“The evaluation of anticancer drugs in dogs and monkeys for the prediction of qualitative toxicities in man”, Schein et al 1970; systematic review
“Drug safety tests and subsequent clinical experience”, Fletcher 1978; systematic review
“A Case Study of Statistics in the Regulatory Process: The FD&C Red No. 40 Experiments”, Lagakos & Mosteller 1981; experiment confounded by litter & cage effects
“Response of Experimental Animals To human carcinogens: an analysis based upon the IARC Monographs programme”, Wilbourn et al 1986; systematic review
“Predictability of clinical adverse reactions of drugs by general pharmacology studies”, Igrashi et al 1992; systematic review
“Genetics of Mouse Behavior: Interactions with Laboratory Environment”, Crabbe et al 1999; experiment
“Evidence-Based Data From Animal and Human Experimental Studies on Pain Relief With Antidepressants: A Structured Review”, Fishbain et al 2000; review
“Concordance of the Toxicity of Pharmaceuticals in Humans and in Animals”, Olson et al 2000; survey
“Nimodipine in animal model experiments of focal cerebral ischemia: a systematic review”, Horn 2001; review
“Wound healing in cell studies and animal model experiments by Low Level Laser Therapy; were clinical studies justified? A systematic review”, Lucas et al 2002; meta-analysis
“Does animal experimentation inform human healthcare? Observations from a systematic review of international animal experiments on fluid resuscitation”, Roberts et al 2002; meta-analysis
“Systematic reviews of animal experiments”, Sandercock & Roberts 2002
“Why did NMDA receptor antagonists fail clinical trials for stroke and traumatic brain injury?”, Ikonomidou & Turski 2002; essay
“Jake Leg: How the blues diagnosed a medical mystery”, Baum 2003
“Meta-analysis of the effects of endothelin receptor blockade on survival in experimental heart failure”, Lee et al 2003; meta-analysis
“Emergency medicine animal research: does use of randomization and blinding affect the results?”, Bebarta et al 2003 (review/meta-analysis)
“Where is the evidence that animal research benefits humans?”, Pound et al 2004; review
“The use of animal models in the study of complex disease: all else is never equal or why do so many human studies fail to replicate animal findings?”, Williams et al 2004; essay
“First Dose of Potential New Medicines to Humans: How Animals Help”, Greaves et al 2004; essay
“The future of teratology research is in vitro”, Bailey et al 2005; review
“How good are rodent models of carcinogenesis in predicting efficacy in humans? A systematic review and meta-analysis of colon chemoprevention in rats, mice and men”, Corpet & Pierre 2005 (meta-analysis)
“Surveying the literature from animal experiments”, Lemon & Dunnett 2005; essay
“Systematic review and meta-analysis of the efficacy of FK506 in experimental stroke”, Macleod et al 2005; meta-analysis
“Systematic review and meta-analysis of the efficacy of melatonin in experimental stroke”, Macleod et al 2005; meta-analysis
“Methodological quality of animal studies on neuroprotection in focal cerebral ischaemia”, van der Worp et al 2005; review
“Nitric oxide synthase inhibitors in experimental ischemic stroke and their effects on infarct size and cerebral blood flow: a systematic review”, Willmot et al 2005; meta-analysis
“A systematic review of nitric oxide donors and L-arginine in experimental stroke; effects on infarct size and cerebral blood flow”, Willmot et al 2005; meta-analysis
“Translation of Research Evidence From Animals to Humans”, Hackam & Redelmeier 2006; review
“1,026 experimental treatments in acute stroke”, O’Collins et al 2006; review
“A Systematic Review of Systematic Reviews and Meta-Analyses of Animal Experiments with Guidelines for Reporting”, Peters 2006; review
“Translating animal research into clinical benefit”, Hackam 2007; essay
“Systematic Reviews of Animal Experiments Demonstrate Poor Human Clinical and Toxicological Utility”, Knight 2007; review
“Comparison of treatment effects between animal experiments and clinical trials: systematic review”, Perel et al 2007; review
“How can we improve the pre-clinical development of drugs for stroke?”, Sena et al 2007; essay
“Healthy animals and animal models of human disease(s) in safety assessment of human pharmaceuticals, including therapeutic antibodies”, Dixit & Boelsterli 2007; review
“Systematic Reviews of Animal Experiments Demonstrate Poor Contributions to Human Healthcare”, Knight 2008; essay
“Are animal models as good as we think?”, Wall & Shani 2008; essay
“Predictive validity of animal pain models? A comparison of the pharmacokinetic-pharmacodynamic relationship for pain drugs in rats and humans”, Whiteside et al 2008
“Design, power, and interpretation of studies in the standard murine model of ALS”, Scott et al 2008; review
“Evidence for the efficacy of NXY-059 in experimental focal cerebral ischaemia is confounded by study quality”, Macleod et al 2008 (meta-analysis)
“Empirical evidence of bias in the design of experimental stroke studies: a metaepidemiologic approach”, Crossley et al 2008
“Publication bias in reports of animal stroke studies leads to major overstatement of efficacy”, Sena et al 2010; meta-analysis
“Can Animal Models of Disease Reliably Inform Human Studies?”, van der Worp et al 2010; essay
“Improving the translational hit of experimental treatments in multiple sclerosis”, Vesterinen et al 2010 (meta-analysis)
“Human relevance of pre-clinical studies in stem cell therapy: systematic review and meta-analysis of large animal models of ischaemic heart disease”, van der Spoel et al 2011; meta-analysis
“Evaluation of Excess Significance Bias in Animal Studies of Neurological Diseases”, Tsilidis et al 2013 (meta-analysis)
“Food for Thought: Look Back in Anger – What Clinical Studies Tell Us About Preclinical Work”, Hartung 2013
“Two Years Later: Journals Are Not Yet Enforcing the ARRIVE Guidelines on Reporting Standards for Pre-Clinical Animal Studies”, Baker et al 2014 (review)
“Why genes extending lifespan in model organisms not been consistently associated with human longevity and what it means to translation research”, Magalhães 2014
“Low-dose paroxetine exposure causes lifetime declines in male mouse body weight, reproduction and competitive ability as measured by the novel organismal performance assay”, Gaukler et al 2015 (experiment; on the implausibility of lab environments)
“What exactly is ‘n’ in cell culture and animal experiments?”, Lazic et al 2017
“A long journey to reproducible results: Replicating our work took four years and 100,000 worms but brought surprising discoveries”(background to Lucanic et al 2017)
“Sex matters in experiments on party drug—in mice: Ketamine lifts rodents’ mood only if administered by male researchers”(2017)
“Don’t believe the mice”(2018)
“Individual differences in behaviour explain variation in survival: a meta-analysis”, Moiron et al 2019
- Twitter discussion of challenges in mice behavioral testing, Grace Mosley 2018
Publication bias can come in many forms, and seems to be severe. For example, the 2008 version of a Cochrane review (“Full publication of results initially presented in abstracts (Review)”) finds
“Only 63% of results from abstracts describing randomized or controlled clinical trials are published in full. ‘Positive’ results were more frequently published than not ‘positive’ results.”↩︎
For a second, shorter take on the implications of low prior probabilities & low power:
“Is the Replicability Crisis Overblown? Three Arguments Examined”, Pashler & Harris 2012:
So what is the truth of the matter? To put it simply, adopting an alpha level of, say, 5% means that about 5% of the time when researchers test a null hypothesis that is true (i.e., when they look for a difference that does not exist), they will end up with a statistically significant difference (a Type 1 error or false positive.)1 Whereas some have argued that 5% would be too many mistakes to tolerate, it certainly would not constitute a flood of error. So what is the problem?
Unfortunately, the problem is that the alpha level does not provide even a rough estimate, much less a true upper bound, on the likelihood that any given positive finding appearing in a scientific literature will be erroneous. To estimate what the literature-wide false positive likelihood is, several additional values, which can only be guessed at, need to be specified. We begin by considering some highly simplified scenarios. Although artificial, these have enough plausibility to provide some eye-opening conclusions.
For the following example, let us suppose that 10% of the effects that researchers look for actually exist, which will be referred to here as the prior probability of an effect (i.e., the null hypothesis is true 90% of the time). Given an alpha of 5%, Type 1 errors will occur in 4.5% of the studies performed (90% × 5%). If one assumes that studies all have a power of, say, 80% to detect those effects that do exist, correct rejections of the null hypothesis will occur in 8% of the time (80% × 10%). If one further imagines that all positive results are published then this would mean that the probability any given published positive result is erroneous would be equal to the proportion of false positives divided by the sum of the proportion of false positives plus the proportion of correct rejections. Given the proportions specified above, then, we see that more than one third of published positive findings would be false positives [4.5% / (4.5% + 8%) = 36%]. In this example, the errors occur at a rate approximately seven times the nominal alpha level (row 1 of Table 1).
Table 1 shows a few more hypothetical examples of how the frequency of false positives in the literature would depend upon the assumed probability of null hypothesis being false and the statistical power. An 80% power likely exceeds any realistic assumptions about psychology studies in general. For example, Bakker, van Dijk, and Wikkerts, (2012, this issue) estimate .35 as a typical power level in the psychological literature. If one modifies the previous example to assume a more plausible power level of 35%, the likelihood of positive results being false rises to 56% (second row of the table). John Ioannidis (2005b) did pioneering work to analyze (much more carefully and realistically than we do here) the proportion of results that are likely to be false, and he concluded that it could very easily be a majority of all reported effects.
Table 1. Proportion of Positive Results That Are False Given Assumptions About Prior Probability of an Effect and Power. Prior probability of effect Power Proportion of studies yielding true positives Proportion of studies yielding false positives Proportion of total positive results (false+positive) which are false 10% 80% 10% x 80% = 8% (100–10%) x 5% = 4.5% 4.5% / (4.5% + 8%) = 36% 10% 35% = 3.5% = 4.5% 4.5% / (4.5% + 3.5%) = 56.25% 50% 35% = 17.5% (100–50%) x 5% = 2.5% 2.5% / (2.5% + 17.5%) = 12.5% 75% 35% = 26.3% (100–75%) x 5% = 1.6% 1.6% / (1.6% + 26.3%) = 5.73%
So for example, if we imagined that a Jaeggi effect size of 0.8 were completely borne out by a meta-analysis of many studies and turned in a point estimate of d=0.8; this data would imply that the strength of the n-back effect was ~1 standard deviation above the average effect (of things which get studied enough to be meta-analyzable & have published meta-analyses etc) or to put it another way, that n-back was stronger than ~84% of all reliable well-substantiated effects that psychology/education had discovered as of 1992.↩︎
We can infer empirical priors from field-wide collections of effect sizes, in particular, highly reliable meta-analytic effect sizes. For example, Lipsey & Wilson 1993 which finds for various kinds of therapy a mean effect of d=0.5 based on >300 meta-analyses; or better yet,
“One Hundred Years of Social Psychology Quantitatively Described”, Bond et al 2003:
This article compiles results from a century of social psychological research, more than 25,000 studies of 8 million people. A large number of social psychological conclusions are listed alongside meta-analytic information about the magnitude and variability of the corresponding effects. References to 322 meta-analyses of social psychological phenomena are presented, as well as statistical effect-size summaries. Analyses reveal that social psychological effects typically yield a value of r equal to .21 and that, in the typical research literature, effects vary from study to study in ways that produce a standard deviation in r of .15. Uses, limitations, and implications of this large-scale compilation are noted.
Only 5% of the correlations were greater than .50; only 34% yielded an r of .30 or more; for example, Jaeggi 2008’s 15-day group racked up an IQ increase of d=1.53 which converts to an r of 0.61 and is 2.6 standard deviations above the overall mean, implying that the DNB effect is greater than ~99% of previous known effects in psychology! (Schönbrodt & Perugini 2013 observe that their sampling simulation imply that, given Bond’s mean effect of r = .21, a psychology study would require n=238 for reasonable accuracy in estimating effects; most studies are far smaller.)↩︎
One might be aware that the writer of that essay, Jonah Lehrer, was fired after making up materials for one of his books, and wonder if this work can be trusted; I believe it can as the New Yorker is famous for rigorous fact-checking (and no one has cast doubt on this article), Lehrer’s scandals involved his books, I have not found any questionable claims in the article besides Lehrer’s belief that known issues like publication bias are insufficient to explain the decline effect (which reasonable men may differ on), and Virginia Hughes ran the finished article against 7 people quoted in it like Ioannidis without any disputing facts/quotes & several somewhat praising it (see also Andrew Gelman).↩︎
If I am understanding this right, Jaynes’s point here is that the random error shrinks towards zero as N increases, but this error is added onto the “common systematic error” S, so the total error approaches S no matter how many observations you make and this can force the total error up as well as down (variability, in this case, actually being helpful for once). So for example, ; with N=100, it’s 0.43; with N=1,000,000 it’s 0.334; and with N=1,000,000 it equals 0.333365 etc, and never going below the original systematic error of —that is, after 10 observations, the portion of error due to sampling error is less than that due to the systematic error, so one has hit severely diminishing returns in the value of any additional (biased) data, and to meaningfully improve the estimate one must obtain unbiased data. This leads to the unfortunate consequence that the likely error of N=10 is 0.017<x<0.64956 while for N=1,000,000 it is the similar range 0.017<x<0.33433—so it is possible that the estimate could be exactly as good (or bad) for the tiny sample as compared with the enormous sample, since neither can do better than 0.017!↩︎
Possibly this is what Lord Rutherford meant when he said,
“If your experiment needs statistics you ought to have done a better experiment”.↩︎
Neglecting the finite-population correction, the standard deviation of the mean sampling error is and this quantity is largest when p=.5. The number of ballots returned was 2,376,523, and with a sample of this size the largest possible value of is , or 0.322 percentage point, so that an error of .2 percentage point is .2/.0322 = 6.17 times the standard deviation. The total area in the two tails of the Normal distribution below u = −6.17 and above u = +6.17 is .0000000007.↩︎
Over 10 million ballots were sent out. Of the 2,376,523 ballots which were filled in and returned, 1,293,669 were for Landon, 972,897 for Roosevelt, and the remainder for other candidates. The actual vote was 16,679,583 for Landon and 27,476,673 for Roosevelt out of a total of 45,647,117.↩︎
Readers curious about modern election forecasting’s systematic vs random error should see Shirani-Mehr et al 2018,
“Disentangling Bias and Variance in Election Polls”: the systematic error turns out to be almost identical sized ie half the total error. Hence, anomalies like Donald Trump or Brexit are not particularly anomalous at all. –Editor.↩︎
I should mention this one is not quite as silly as it sounds as there is experimental evidence for cocoa improving cognitive function↩︎
The same authors offer up a number of country-level correlation such as “Linguistic Diversity/Traffic accidents”, alcohol consumption/morphological complexity, and acacia trees vs tonality, which feed into their paper “Constructing knowledge: nomothetic approaches to language evolution” on the dangers of naive approaches to cross-country comparisons due to the high intercorrelation of cultural traits. More sophisticated approaches might be better; they derive a fairly-plausible looking graph of the relationships between variables.↩︎
Lots of data is not exactly normal, but, particularly in human studies, this is not a big deal because the n are often large enough, eg n>20, that the asymptotics have started to work & model misspecification doesn’t produce too large a false positive rate inflation or mis-estimation. Unfortunately, in animal research, it’s perfectly typical to have sample sizes more like n=5, which in an idealized power analysis of a normally-distributed variable might be fine because one is (hopefully) exploiting the freedom of animal models to get a large effect size / precise measurements—except that with n=5 the data won’t be even close to approximately normal or fitting other model assumptions, and a single biased or selected or outlier datapoint can mess it up further.↩︎