Measurement error and the replication crisis
Measurement error adds noise to predictions, increases uncertainty in parameter estimates, and makes it more difficult to discover new phenomena or to distinguish among competing theories. A common view is that any study finding an effect under noisy conditions provides evidence that the underlying effect is particularly strong and robust. Yet, statistical significance conveys very little information when measurements are noisy. In noisy research settings, poor measurement can contribute to exaggerated estimates of effect size. This problem and related misunderstandings are key components in a feedback loop that perpetuates the replication crisis in science.
It seems intuitive that producing a result under challenging circumstances makes it all the more impressive. If you learned that a friend had run a mile in 5 minutes, you would be respectful; if you learned she had done it while carrying a heavy backpack, you would be awed. The obvious inference is that she would have been even faster without the backpack. But should the same intuition always be applied to research findings? Should we assume that if statistical significance is achieved in the presence of measurement error, the associated effects would have been stronger without noise? We caution against the fallacy of assuming that that which does not kill statistical significance makes it stronger.
Measurement error can be defined as random variation, of some distributional form, that produces a difference between observed and true values (1). Measurement error and other sources of uncontrolled variation in scientific research therefore add noise. The latter is typically an attenuating factor, as acknowledged in various scientific disciplines. Spearman (2) famously derived a formula for the attenuation of observed correlations due to unreliable measurement. In epidemiology, it is textbook knowledge that nondifferential misclassification tends to bias relative risk estimates toward the null (3). According to Hausman's “iron law” of econometrics, effect sizes in simple regression models are underestimated when the predictors contain error variance (4).
It is understandable, then, that many researchers have the intuition that if they manage to achieve statistical significance under noisy conditions, the observed effect would have been even larger in the absence of noise. As with the runner, they assume that without the burden—that is, uncontrolled variation—their effects would have been even larger (5–7).
The reasoning about the runner with the backpack fails in noisy research for two reasons. First, researchers typically have so many “researcher degrees of freedom”—unacknowledged choices in how they prepare, analyze, and report their data—that statistical significance is easily found even in the absence of underlying effects (8) and even without multiple hypothesis testing by researchers (9). In settings with uncontrolled researcher degrees of freedom, the attainment of statistical significance in the presence of noise is not an impressive feat.
The second, related issue is that in noisy research settings, statistical significance provides very weak evidence for either the sign or the magnitude of any underlying effect. Statistically significant estimates are, roughly speaking, at least two standard errors from zero. In a study with noisy measurements and small or moderate sample size, standard errors will be high and statistically significant estimates will therefore be large, even if the underlying effects are small. This is known as the statistical significance filter and can be a severe upward bias in the magnitude of effects; as one of us has shown, reported estimates can be an order-of-magnitude larger than any plausible underlying effects (10).
In a low-noise setting, the theoretical results of Hausman and others correctly show that measurement error will attenuate coefficient estimates. But we can demonstrate with a simple exercise that the opposite occurs in the presence of high noise and selection on statistical significance.
Suppose we measure x and y in a setting where the underlying truth is that there is a small effect of x on y. Imagine four conditions based on changes in two factors. First, we might have either a high-powered study (sample size N = 3000) or a low-powered study (N = 50). Second, we might have measurements on x and y that are high quality, or have some degree of additional measurement error. In the large-N scenario, adding measurement error will almost always reduce the observed correlation between x and y (see the figure, left panel). But in the small-N setting, this will not hold; the observed correlation can easily be larger in the presence of measurement error (see the figure, middle panel).
Take these scenarios and now add selection on statistical significance. We can track the proportion of studies, as a function of sample size, where the observed effect is larger than the original error-free effect. For the largest samples, the observed effect is always smaller than the original. But for smaller N, a fraction of the observed effects exceeds the original. If we were to condition on whether or not the observed effect was statistically significant, then the fraction is even larger (see the figure, right panel).
Our concern is that researchers are sometimes tempted to use the “iron law” reasoning to defend or justify surprisingly large statistically significant effects from small studies. If it really were true that effect sizes were always attenuated by measurement error, then it would be all the more impressive to have achieved significance. But to acknowledge that there may have been a substantial amount of uncontrolled variation is to acknowledge that the study contained less information than was initially thought. If researchers focus on getting statistically significant estimates of small effects, using noisy measurements and small samples, then it is likely that the additional sources of variance are already making the t test look strong. Measurement error and selection bias thus can combine to exacerbate the replication crisis.
The situation becomes more complicated in problems with multiple predictors, or with nonindependent errors. Wacholder et al. (11) discuss scenarios beyond simple two-group risk-exposure studies where misclassification can lead to exaggerated estimates. For the simpler setting, though, they conclude that while “the estimate may exceed the true value…it is more likely to fall below the true value.” We agree with Wacholder et al. for studies in which effects and sample sizes are large. But for noisier studies, especially combined with selective filtering on statistically significant observed effects, we think that there is a greater chance that the effects are exaggerated rather than attenuated. Jurek et al. have also provided evidence that individual research studies can be biased away from the null (12).
A key point for practitioners is that surprising results from small studies should not be defended by saying that they would have been even better with improved measurement. Furthermore, the signal-to-noise ratio cannot in general be estimated merely from internal evidence. It is a common mistake to take a t-ratio as a measure of strength of evidence and conclude that just because an estimate is statistically significant, the signal-to-noise level is high. It is also a mistake to assume that the observed effect size would have been even larger if not for the burden of measurement error. Intuitions that are appropriate when measurements are precise are sometimes misapplied in noisy and more probabilistic settings.
The consequences for scientific replication are obvious. Many published effects are overstated and future studies, powered by the expectation that the effects can be replicated, might be destined to fail before they even begin. We would all run faster without a backpack on our backs. But when it comes to surprising research findings from small studies, measurement error (or other uncontrolled variation) should not be invoked automatically to suggest that effects are even larger.
Acknowledgments
We thank the Office of Naval Research for grant N00014-15-1-2541.
References and Notes
1
S. Messick, Validity, ETS Research Report Series RR-87-40 (Educational Testing Service, Princeton, NJ, 1987).
2
C. Spearman, Am. J. Psychol. 15, 72 (1904).
3
K. J. Rothman, S. Greenland, T. L. Lash, Modern Epidemiology (Kluwer, ed. 3, 2008).
4
J. Hausman, J. Econ. Perspect. 15, 57 (2001).
5
J. J. Heckman, Boston Rev. (1 September 2012); http://bostonreview.net/forum/promoting-social-mobility/final-response-aiding-life-cycle-james-heckman.
6
J. K. Maner, J. Exp. Soc. Psychol. 66, 100 (2016).
7
S. Goldin-Meadow, Assoc. for Psychol. Sci. Observer (October 2016); www.psychologicalscience.org/observer/preregistration-replication-and-nonexperimental-studies.
8
J. Simmons, L. Nelson, U. Simonsohn, Psychol. Sci. 22, 1359 (2011).
9
A. Gelman, E. Loken, Am. Sci. 102, 460 (2014).
10
A. Gelman, J. B. Carlin, Perspect. Psychol. Sci. 9, 641 (2014).
11
S. Wacholder, P. Hartge, J. H. Lubin, M. Dosemeci, Occup. Environ. Med. 52, 557 (1995).
12
A. M. Jurek, S. Greenland, G. Maldonado, T. R. Church, Int. J. Epidemiol. 34, 680 (2005).
Correcting Measurement Error to Build Scientific Knowledge
Loken and Gelman (1) describe problems of null-hypothesis significance testing, selective publishing, and imperfect measures distorting the scientific literature. They raise questions about the validity of widely-accepted, well-understood methods for statistically correcting measurement error (2) and the studies that have applied them. Relationships and effects under study are affected by "noise." But fortunately, the "noise" can be separated into two types, and there are well-known solutions for each. The first type is systematic error in measurements, which predictably biases observed relations downward. The second type is random (sampling) error, which unpredictably obscures true relations, unsystematically making observed relations smaller or larger. On average, random error is zero, but it can be large in small samples (3). Both types of error obscure true relations and must be corrected to draw accurate conclusions. Systematic error can be addressed in single samples using well-known statistical corrections (2). Random error cannot be corrected in single samples because the direction and size of the error is unknown. However, random error asymptotes to zero as sample size increases. Its impact can be mitigated either by gathering much larger samples or, more practically, by combining results from smaller studies using meta-analysis (4). When studies are meta-analytically pooled, their random errors cancel out. Furthermore, their systematic errors can be statistically corrected without sacrifices in precision (4), permitting accurate estimates of true relations. In primary studies and replications, the focus should be on estimating parameters, quantifying measurement error, and reporting complete results (i.e., without censoring for significance or other questionable research practices), including corrected effect sizes and confidence intervals. Primary studies and replications are well-suited for collecting and reporting high-quality data, but less-suited for reaching "definitive" conclusions about relationships. Meta-analyses can integrate those data most effectively, drawing conclusions about generalizability (5) and building scientific knowledge.
1. E. Loken, A. Gelman, Measurement error and the replication crisis. Science. 355, 584–585 (2017).
2. C. Spearman, The proof and measurement of association between two things. Am. J. Psychol. 15, 72–101 (1904).
3. A. Tversky, D. Kahneman, Belief in the law of small numbers. Psychol. Bull. 76, 105–110 (1971).
4. F. L. Schmidt, J. E. Hunter, Methods of meta-analysis: Correcting error and bias in research findings (SAGE, 2014).
5. C. Viswesvaran, D. S. Ones, F. L. Schmidt, H. Le, I.-S. Oh, Measurement error obfuscates scientific knowledge: Path to cumulative knowledge requires corrections for unreliability and psychometric meta-analyses. Ind. Organ. Psychol. 7, 507–518 (2014).
RE: Measurement error and the replication crisis
As noted in a personal email from the lead author, the article could have made a clearer distinction between sampling error and random measurement error. We show that random measurement error always attenuates population effect sizes and statistical power, which reduces the chance of obtaining a significant result.
If sampling error (due to luck or questionable research practices) inflates observed effect sizes enough to produce a significant result, the median amount of inflation is inversely related to power of a study. Thus, conditional on selection for significance, random measurement error leads to more inflation, but the estimated effect sizes are never larger than the effect sizes that would have been obtained with a more reliable measure. In sum, consistent with statistical theories, random measurement error always attenuates observed effect sizes, even when studies are selected for significance.
For a detailed comment that could not be published in Science, you can read the commentary on my blog.
Ulrich Schimmack & Rickard Carlsson
replicationindex.wordpress.com/2017/02/23/random-measurement-error-and-the-replication-crisis-a-statistical-analysis/