Criticizing studies and statistics is hard in part because so many criticisms are possible, rendering them meaningless. What makes a good criticism is the chance of being a 'difference which makes a difference' to our ultimate actions.
created: 19 May 2019; modified: 21 May 2019; status: finished; confidence: highly likely; importance: 7
Scientific and statistical research must be read with a critical eye to understand how credible the claims are. The Reproducibility Crisis and the growth of meta-science have demonstrated that much research is of low quality and often false. But there are so many possible things any given study could be criticized for, falling short of an unobtainable ideal, that it becomes unclear which possible criticism is important and they may degenerate into mere rhetoric. How do we separate fatal flaws from unfortunate caveats from specious quibbling?
I think that what makes a criticism important is how much it could change a result if corrected and how much that would then change our decisions or actions: to what extent it is a“difference which makes a difference”. This is why issues of causal inference or biases yielding overestimates are universally important, because a ‘causal’ effect turning out to be zero effect will change almost all decisions based on such research, but other issues like measurement error or distributions, which are equally common, are often not important as they may yield much smaller changes in conclusions and hence decisions.
If we regularly ask whether a criticism would make this kind of difference, it will be clearer which ones are important criticisms, and which ones risk being rhetorical distractions and obstructing meaningful evaluation of research.
Learning statistics is great. If you want to read and understand scientific papers in general, there’s little better to learn than statistics because everything these days touches on statistical issues and draws on increasingly powerful statistical methods and large datasets, whether flashy like machine learning or mundane like geneticists drawing on biobanks of millions of people, and if you don’t have at least some grasp of statistics, you will be increasingly left out of scientific and technological progress and unable to meaningfully discuss their application to society, so you must have a good grounding in statistics if you are at all interested in these topics—or so I want to say. The problem is… learning statistics can be dangerous.
Like learning some formal logic or about cognitive biases, statistics seems like the sort of thing one might say
“A little learning is a dangerous thing / Drink deep, or taste not the Pierian spring / There shallow draughts intoxicate the brain,/ And drinking largely sobers us again.” When you first learn some formal logic and about fallacies, it’s hard to not use the shiny new hammer to go around playing ‘fallacy bingo’ (to mix metaphors): “ah ha! that is an ad hominem, my good sir, and a logically invalid objection.” The problem, of course, is that many fallacies are perfectly good as a matter of inductive logic: ad hominems are often highly relevant (eg if the person is being bribed). A rigorous insistence on formal syllogisms will at best waste a lot of time, and at worst becomes a tool for self-delusion by selective application of rigor. Similarly, cognitive biases are hard to use effectively (because they are informative priors in some cases, and in common harmful cases, one will have already learned better), but are easy to abuse—it’s always easiest to see how someone else is sadly falling prey to confirmation bias.
With statistics, a little reading and self-education will quickly lead to learning about a whole universe of possible ways for a study to screw up statistically, and as skeptical as one quickly becomes, as Ioannidis and Gelman and the Replicability Crisis and far too many examples of scientific findings completely collapsing show, one probably isn’t skeptical enough because there are in fact an awful lot of screwed up studies out there. Here are a few potential issues, in deliberately no particular order:
“spurious correlations”caused by data processing (such as ratio/percentage data, or normalizing to a common time-series)
multiplicity: many different subgroups or hypotheses tested, with only statistically-significant ones reported and no control of the overall false detection rate
missingness not modeled
in vivo animal results applied to humans
experiment run on regular children rather than identical twins
publication bias detected in meta-analysis
a failure to reject the null or a positive point-estimate being interpreted as evidence for the null hypothesis
“the difference between statistically-significant and non-statistically-significant is not statistically-significant”
choice of an inappropriate distribution, like modeling a log-normal variable by a normal variable (
“they strain at the gnat of the prior who swallow the camel of the likelihood”)
no use of single or double-blinding or placebos
a genetic study testing correlation between 1 gene and a trait
an IQ experiment finding an intervention increased before/after scores on some IQ subtests and thus increased IQ
cross-sectional rather than longitudinal study
ignoring multilevel structure (like data being collected from sub-units of schools, countries, families, companies, websites, individual fishing vessels, WP editors etc)
reporting performance of GWAS polygenic scores using only SNPs which reach genome-wide statistical-significance
nonzero attrition but no use of intent-to-treat analysis
use of a fixed alpha threshold like 0.05
correlational data interpreted as causation
use of an
“unidentified”model, requiring additional constraints or priors
non-preregistered analyses done after looking at data; p-hacking of every shade
use of cause-specific mortality vs all-cause mortality as a measurement
use of measurements with high levels of measurement error (such as dietary questionnaires)
- ceiling/floor effects (particularly IQ tests)
claims about latent variables made on the basis of measurements of greatly differing quality
- or after
“controlling for”intermediate variables, comparing total effects of one variable to solely indirect effects of another
- or that one variable mediates an effect without actually setting up a mediation SEM
- or after
studies radically underpowered to detect a plausible effect
“statistical-significance filter”inflating effects
base rate fallacy
self-selected survey respondents; convenience samples from Mechanical Turk or Google Surveys or similar services
animal experiments with randomization not blocked by litter/cage/room
using a large dataset and obtaining many statistically-significant results
factor analysis without establishing measurement invariance
experimenter demand effects
using a SVM/NN/RF without crossvalidation or heldout sample
- using them, but with data preprocessing done or hyperparameters selected based on the whole dataset
passive control groups
not doing a factorial experiment but testing one intervention on each group
flat priors overestimating effects
a genetic study testing correlation between 500,000 genes and a trait
conflicts of interest by the researchers/funders
lack of power analysis to design experiment
analyzing Likert scales as a simple continuous cardinal variable
animal results in a single inbred or clonal strain, with the goal of reducing variance/increased power (Michie 1955)
temporal autocorrelation of measurements
reliance on interaction terms
Some of these issues are big issues—even fatal, to the point where the study is not just meaningless but the world would be a better place if the researchers in question had never published. Others are serious but while regrettable, a study afflicted by it is still useful and perhaps the best that can reasonably be done. And some flaws are usually minor, almost certain not to matter, possibly to the point of being misleading to bring up at all as a ‘criticism’ as it implies that the flaw is worth discussing. And many are completely context-dependent, and could be anything from instantly fatal to minor nuisance.
But which are which? You can probably guess at where a few of them fall, but I would be surprised if you knew what I meant by all of them, or had well-justified beliefs about how important each is, because I don’t, and I suspect few people do. Nor can anyone tell you how important each one is. One just has to learn by experience, it seems, watching things replicate or diminish in meta-analyses or get debunked over the years, to gradually get a feel of what is important. There are checklists and professional manuals1 which one can read and employ, and they at least have the virtue of checklists in being systematic reminders of things to check, reducing the temptation to cherry-pick criticism, and I recommend their use, but they are not a complete solution. (In some cases, they recommend quite bad things, and none can be considered complete.)
No wonder that statistical criticism can feel like a blood-sport, or feel like learning statistical-significance statistics: a long list of special-case tests with little rhyme or reason, making up a “cookbook” of arbitrary formulas and rituals, useful largely for
After a while, you have learned enough to throw a long list of criticisms at any study, which devalues criticism (surely studies can’t all be equally worthless) and risks the same problem as with formal logic or cognitive biases, of merely weaponizing it and laboring to make yourself more wrong. (I have over the years criticized many studies and while for many of them my criticisms were much less than they deserved and have since been borne out, I could not honestly say that I have always been right or that I did not occasionally ‘gild the lily’ a little.)
So, what do we mean by statistical criticism? what makes a good or bad statistical objection?
It can’t just be that a criticism is boring and provokes eye-rolling—someone who in every genetics discussion from ~2000–2010 harped on statistical power & polygenicity and stated that all these exciting new candidate-gene & gene-environment interaction results were so much hogwash and the entire literature garbage would have been deeply irritating to read, wear out their welcome fast, and have been absolutely right. (Or for nutrition research, or for social psychology, or for…) As provoking as it may be to read yet another person sloganize “correlation≠causation” or “yeah, in mice!”, unfortunately, for much research that is all that should ever be said about it, no matter how much we weary of it.
It can’t be that some assumption is violated (or unproven or unprovable), or that some aspect of the real world is left out, because all statistical models are massively abstract, gross simplifications. Because it is always possible to identify some issue of inappropriate assumption of normality, or some autocorrelation which is not modeled, or some nonlinear term not included, or prior information left out, or data lacking in some respect. Checklists and preregistrations and other techniques can help improve the quality considerably, but will never solve this problem. Short of tautological analysis of a computer simulation, there is not and never has been a perfect statistical analysis, and if there was, it would be too complicated for anyone to understand (which is a criticism as well). All of our models are false, but some may be useful, and a good statistical analysis is merely ‘good enough’.
It can’t be that results “replicate” or not. Replicability doesn’t say much other than if further data were collected the same way, the results would stay the same. While a result which doesn’t replicate is of questionable value at best (it most likely wasn’t real to begin with2), a result being replicable is no guarantee of quality either. One may have a consistent GIGO process, but replicable garbage is still garbage. To collect more data may be to simply more precisely estimate the process’s systematic error and biases. (No matter how many published homeopathy papers you can find showing homeopathy works, it doesn’t.)
It certainly has little to do with p-values, either in a study or in its replications (because nothing of interest has to do with p-values); if we correct an error and change a specific p-value from p=0.05 to p=0.06, so what? (
“Surely, God loves the 0.06 nearly as much as the 0.05…”) Posterior probabilities, while meaningful and important, also are no criterion: is it important if a study has a posterior probability of a parameter being greater than zero of 95% rather than 94%? Or >99%? Or >50%? If a criticism, when corrected, reduces a posterior probability from 99% to 90%, is that what we mean by an important criticism? Probably (ahem) not.
It also doesn’t have to do with any increase or decrease in effect sizes. If a study makes some errors which means that it produces an effect size twice as large as it should, this might be absolutely damning or it might be largely irrelevant. Perhaps the uncertainty was at least that large so no one took the point-estimate at face-value to begin with, or everyone understood the potential for errors and understood the point-estimate was an upper bound. Or perhaps the effect is so large that overestimation by a factor of 10 wouldn’t be a problem.
It usually doesn’t have to do with predictive power (whether quantified as R2 or AUC or ROC etc); sheer prediction is the goal of a subset of research (although if one could show that a particular choice led to a lower predictive score, that would be a good critique), and in many contexts, the best model is not particularly predictive at all, and a model being too predictive is a red flag.
What would count as a good criticism?
Well, if a draft of a study was found and the claims were based on a statistically-significant effect in one variable, but in the final published version, it omits that variable and talks only about a different variable, one would wonder. Discovering that authors of a study had been paid millions of dollars by a company benefiting from the study results would seriously shake one’s confidence in the results. If a correlation didn’t exist at all when we compared siblings within a family, or better yet, identical twins, or if the correlation didn’t exist in other datasets, or other countries, then regardless of how strongly supported it is in that one dataset, it would be a concern. If a fancy new machine learning model outperformed SOTA by 2%, but turned out to not be using a heldout sample properly and actually performed the same, doubtless ML researchers would be less impressed. If someone showed an RCT reached the opposite effect size to a correlational analysis, that would strike most people as important. If a major new cancer drug was being touted as being as effective as the usual chemotherapy with fewer side-effects in the latest trial, and one sees that both were being compared to a null hypothesis of zero effect and the point-estimate for the new drug was lower than the usual chemotherapy, would patients want to use it? If a psychology experiment had different results with a passive control group and an active control group, or a surgery’s results depend on whether the clinical trial used blinding, certainly an issue. And if data was fabricated entirely, that would certainly be worth mentioning.
These are all inherently different going by some of the conventional views outlined above. So what do they have in common that makes them good criticisms?
“Results are only valuable when the amount by which they probably differ from the truth is so small as to be insignificant for the purposes of the experiment. What the odds should be depends:”
- “On the degree of accuracy which the nature of the experiment allows, and”
- “On the importance of the issues at stake.”
“Moreover, the economic approach seems (if not rejected owing to aristocratic or puritanic taboos) the only device apt to distinguish neatly what is or is not contradictory in the logic of uncertainty (or probability theory). That is the fundamental lesson supplied by Wald’s notion of ‘admissibility’…probability theory and decision theory are but two versions (theoretical and practical) of the study of the same subject: uncertainty.”
But what I think they share in common is this decision-theoretic justification:
The importance of a statistical criticism is the probability that it would change a hypothetical decision based on that research.
I would assert that p-values are not posterior probabilities are not effect sizes are not utilities are not profits are not decisions. All analyses are ultimately decision analyses: our beliefs and analyses may be continuous, but our actions are discrete.
When we critique a study, the standard we grope towards is one which ultimately terminates in real-world actions and decision-making, a standard which is inherently context-dependent, admits of no bright lines, and depends on the use and motivation for research, grounded in what is the right thing to do.4
It doesn’t have anything to do with attaining some arbitrary level of “significance”, or even any particular posterior probability, or effect size threshold; it doesn’t have anything to do with violating a particular assumption, unless, by violating that assumption, the model is not ‘good enough’ and would lead to bad choices; and it is loosely tied to replication (because if a result doesn’t replicate in the future situations in which actions will be taken, it’s not useful for planning) but not defined by it (as a result could replicate fine while still being useless).
The importance of many of these criticisms can be made much more intuitive by asking what the research is for and how it would affect a downstream decision. We don’t need to do a formal decision analysis going all the way from data through a Bayesian analysis to utilities and a causal model to compare (although this would be useful to do and might be necessary in edge cases), an informal consideration can be a good start, as one can intuitively guess at the downstream effects.
Revisiting some of the example criticisms with more of a decision-theoretic view:
A critique of assuming correlation=causation is a good one, because correlation is usually not causation, and going from an implicit ~100% certainty that it is to a more realistic 25% or less, would change many decisions as that observation alone reduces the expected value by >75%, which is a big enough penalty to eliminate many appealing-sounding things.
Because causal effects are such a central topic, any methodological errors which affect inference of correlation rather than causation are important errors.
A critique of a distributional assumption (such as observing that a variable isn’t so much normal as Student’s t-distributed) isn’t usually an important one, because the change in the posterior distribution of any key variable will be minimal, and could change only decisions which are on a knife’s-edge to begin with (and thus, of little value).
A sociological or psychological correlation without any controls for genetic confounds is vulnerable to critique, because controlling for genetics routinely shrinks the correlation by a large fraction, often to zero, and thus eliminates most of the causal expectations.
Overfitting to a training set and actually being similar or worse than the current SOTA is one of the more serious criticisms in machine learning, because having any degree of better performance is typically why anyone would want to use a method. (But of course—if the new method has some other practical advantage, it would be entirely reasonable for someone to say that the overfitting is a minor critique as far as they are concerned, because they want it for that other reason and some loss of performance is minor.)
In a medical context too, what matters is the cost-benefit of a new treatment compared to the best existing one, and not whether it happens to work better than nothing at all; the important thing is being more cost-effective than the default action, so people will choose the new treatment over the old, and if the net estimate is that it is probably slightly worse, why would they choose it?
Interpreting a failure to reject the null as evidence for the null: often a problem. A reasonable person might conclude the opposite of what they would if they looked at the actual data.
I’m reminded of the Toxoplasmosis gondii studies which use small samples to estimate the correlation between it and something like accidents, and upon getting a point-estimate almost identical to other larger studies (ie. infection predicts bad things) which happens to not be statistically-significant due to their sample size, conclude that they have found evidence against there being a correlation. One should conclude the opposite! (One heuristic for interpreting results is to ask: “if I entered this result into a meta-analysis of all results, would it strengthen or weaken the meta-analytic result?”)
distribution-wise, using a normal instead of a log-normal is minor since they are so similar in the bulk of their distribution… unless we are talking about their tails, like in a employment or athletics or a media analysis, where the more extreme points out on the tail are the important thing, in which case using a normal will lead to wild underestimates of how far out those outliers will be, which could be of great practical importance
issues of experiment design, like using between-subject rather than within-subject, or not using identical twins for twin experiments, can be practically important: as Student pointed out in his discussion of the Lanarkshire Milk Experiment, among other problems with it, the use of random children who weren’t matched or blocked in any way meant that statistical power was unnecessarily low, and the Lanarkshire Milk Experiment could have been done with a sample size 97% smaller had it been better designed, which would have yielded a major savings in expense (which could have paid for many more experiments).
Treating a Likert scale as a cardinal variable is a statistical sin… but only a peccadillo that everyone commits because Likert scales are so often equivalent to a (more noisy) normally-distributed variable that a fully-correct transformation to an ordinal scale with a latent variable winds up being a great deal more work while not actually changing any conclusions and thus actions.5
Similarly, temporal autocorrelation is often not as big a deal as it’s made out to be.
On the other hand, questions of ‘measurement invariance’ in IQ experiments may sound deeply recondite and like statistical quibbling, but they boil down to the question of whether the test gain is a gain on intelligence or if it can be accounted for by gains solely on a subtest tapping into some much more specialized skill like English vocabulary; a gain on intelligence is far more valuable than some test-specific improvement, and if it is the latter, the experiment has found a real causal effect but that effect is fool’s gold.
And in discussions of measurement of latent variables, the question may hinge critically on the use. Suppose one compares a noisy IQ test to a high-quality personality test (without incorporating correction for the differing measurement error of each one), and finds that the latter is more predictive of some life outcome; does this mean ‘personality is more important than intelligence’ to that trait? Well, it depends on use. If one is making a theoretical argument about the latent variables, this is a serious fallacy and correcting for measurement error may completely reverse the conclusion and show the opposite; but if one is doing screening (for employment or college or something like that), then it is irrelevant which latent variables is a better predictor, because the tests are what they are—unless, on the gripping hand, one is considering introducing a better more expensive IQ test, in which case the latent variables are important after all because, depending on how important the latent variables are (rather than the crude measured variables), the potential improvement from a better measurement may be enough to justify the better test…
Or consider heritability estimates, like SNP heritability estimates from GCTA. A GCTA estimate of, say, 25% for a trait measurement can be interpreted as an upper bound on a GWAS for the same measurement; this is useful to know, but it’s not the same thing as an upper bound on a GWAS of the true, latent, measured-without-error variable. Most such GCTAs use measurements with a great deal of measurement error, and if you correct for measurement error, the true GCTA could be much higher—for example, IQ GCTAs are typically ~25%, but most datasets trade quality for quantity and use poor IQ tests, and correcting that, the true GCTA is closer to 50%, which is quite different. Which is right? Well, if you are merely trying to understand how good a GWAS based on that particular dataset of measurements can be, the former is the right interpretation, as it establishes your upper bound and you will need better methods or measurements to go beyond it; but if you are trying to make claims about a trait per se (as so many people do!), the latent variable is the relevant thing, and talking only about the measured variable is highly misleading and can result in totally mistaken conclusions (especially when comparing across datasets with different measurement errors).
Blinding poses a similar problem: its absence means that the effect being estimated is not necessarily the one we want to estimate—but this is context-dependent. A psychology study typically uses measures where some degree of effort or control is possible, and the effects of research interest are typically so small (like dual n-back’s supposed IQ improvement of a few points) that they can be inflated by a small amount of trying harder; on the other hand, a medical experiment of a cancer drug measuring all-cause mortality, if it works, can produce a dramatic difference in survival rates, cancer doesn’t care whether a patient is optimistic, and it is difficult for the researchers to subtly skew the collected data like all-cause mortality (because a patient is either dead or not).
This definition is not a panacea since often it may not be clear what decisions are downstream, much less how much a criticism could quantitatively affect it. But it provides a clear starting point for understanding which ones are, or should be, important (meta-analyses being particularly useful for nailing down things like average effect size bias due to a particular flaw), and which ones are dubious or quibbling and are signs that you are stretching to come up with any criticisms; if you can’t explain at least somewhat plausibly how a criticism (or a combination of criticisms) could lead to diametrically opposite conclusions or actions, perhaps they are best left out.
There’s two main categories I know of, reporting checklists and quality-evaluation checklists (in addition to the guidelines/recommendations published by professional groups like the APA’s manual based apparently on JARS or AERA’s standards).
Some reporting checklists:
- STROBE (checklists; justification) for cohort, case-control, and cross-sectional studies
- TREND (checklist; justification) for non-randomized experiments
- CONSORT (checklist; justification) for RCTs
- QUOROM/PRISMA (checklist; justification) for reviews/meta-analyses
Some quality-evaluation scales:↩︎
Non-replication of a result puts the original result in an awkward trilemma: either the original result was spurious (the most a priori likely case), the non-replicator got it wrong or was unlucky (difficult since most are well-powered and following the original, so—one man’s modus ponens is another man’s modus tollens—it would be easier to argue the original result was unlucky), or the research claim is so fragile and context-specific that non-replication is just ‘heterogeneity’ (but then why should anyone believe the result in any substantive way, or act on it, if it’s a coin-flip whether it even exists anywhere else?).↩︎
It is interesting to note that the medieval origins of ‘probability’ were themselves inherently decision-based as focused on the question of what it was moral to believe & act upon, and the mathematical roots of probability theory were also pragmatic, based on gambling. Laplace, of course, took a similar perspective in his early Bayesianism (eg Laplace on witness testimony or estimating the mass of Saturn). It was later statistical thinkers like Boole or Fisher who tried to expunge pragmatic interpretations in favor of purer definitions like limiting frequencies.↩︎
As formally incorrect as it may be, whenever I have done the work to treat ordinal variables correctly, it has typically merely tweaked the coefficients & standard error, and not actually changed anything. Knowing this, it would be dishonest of me to criticize any study which does likewise unless I have some good reason (like having reanalyzed the data and—for once—gotten a major change in results).↩︎