How Should We Critique Research?

Criticizing studies and statistics is hard in part because so many criticisms are possible, rendering them meaningless. What makes a good criticism is the chance of being a ‘difference which makes a difference’ to our ultimate actions.
Bayes, decision-theory, criticism, statistics, philosophy, causality
2019-05-192019-07-07 finished certainty: highly likely importance: 8


Sci­en­tific and sta­tis­ti­cal research must be read with a crit­i­cal eye to under­stand how cred­i­ble the claims are. The Repro­ducibil­ity Cri­sis and the growth of meta-science have demon­strated that much research is of low qual­ity and often false. But there are so many pos­si­ble things any given study could be crit­i­cized for, falling short of an unob­tain­able ide­al, that it becomes unclear which pos­si­ble crit­i­cism is impor­tant, and they may degen­er­ate into mere rhetoric. How do we sep­a­rate fatal flaws from unfor­tu­nate caveats from spe­cious quib­bling?

I offer a prag­matic cri­te­ri­on: what makes a crit­i­cism impor­tant is how much it could change a result if cor­rected and how much that would then change our deci­sions or actions: to what extent it is a “differ­ence which makes a differ­ence”. This is why issues of research fraud, causal infer­ence, or biases yield­ing over­es­ti­mates are uni­ver­sally impor­tant, because a ‘causal’ effect turn­ing out to be zero effect or over­es­ti­mated by a fac­tor will change almost all deci­sions based on such research; while on the other hand, other issues like mea­sure­ment error or dis­tri­b­u­tional assump­tions, which are equally com­mon, are often not impor­tant as they typ­i­cally yield much smaller changes in con­clu­sions, and hence deci­sions.

If we reg­u­larly ask whether a crit­i­cism would make this kind of differ­ence, it will be clearer which ones are impor­tant crit­i­cisms, and which ones risk being rhetor­i­cal dis­trac­tions and obstruct­ing mean­ing­ful eval­u­a­tion of research.

Learn­ing sta­tis­tics is great. If you want to read and under­stand sci­en­tific papers in gen­er­al, there’s lit­tle bet­ter to learn than sta­tis­tics because every­thing these days touches on sta­tis­ti­cal issues and draws on increas­ingly pow­er­ful sta­tis­ti­cal meth­ods and large datasets, whether flashy like machine learn­ing or mun­dane like geneti­cists draw­ing on biobanks of mil­lions of peo­ple, and if you don’t have at least some grasp of sta­tis­tics, you will be increas­ingly left out of sci­en­tific and tech­no­log­i­cal progress and unable to mean­ing­fully dis­cuss their appli­ca­tion to soci­ety, so you must have a good ground­ing in sta­tis­tics if you are at all inter­ested in these top­ic­s—or so I want to say. The prob­lem is… learn­ing sta­tis­tics can be dan­ger­ous.

Valley of Bad Statistics

Like learn­ing some for­mal logic or about cog­ni­tive bias­es, sta­tis­tics seems like the sort of thing one might say “A lit­tle learn­ing is a dan­ger­ous thing / Drink deep, or taste not the Pier­ian spring / There shal­low draughts intox­i­cate the brain, / And drink­ing largely sobers us again.”

When you first learn some for­mal logic and about fal­lac­i­es, it’s hard to not use the shiny new ham­mer to go around play­ing ‘fal­lacy bingo’ (to mix metaphors): “aha! that is an ad hominem, my good sir, and a log­i­cally invalid objec­tion.” The prob­lem, of course, is that many fal­lac­ies are per­fectly good as a mat­ter of induc­tive log­ic: ad hominems are often highly rel­e­vant (eg if the per­son is being bribed). A rig­or­ous insis­tence on for­mal syl­lo­gisms will at best waste a lot of time, and at worst becomes a tool for self­-delu­sion by selec­tive appli­ca­tion of rig­or.

Sim­i­lar­ly, cog­ni­tive biases are hard to use effec­tively (be­cause they are infor­ma­tive pri­ors in some cas­es, and in com­mon harm­ful cas­es, one will have already learned bet­ter), but are easy to abuse—it’s always eas­i­est to see how some­one else is sadly falling prey to con­fir­ma­tion bias.

All Things Large and Small

With sta­tis­tics, a lit­tle read­ing and self­-e­d­u­ca­tion will quickly lead to learn­ing about a uni­verse of ways for a study to screw up sta­tis­ti­cal­ly, and as skep­ti­cal as one quickly becomes, as Ioan­ni­dis and Gel­man and the Replic­a­bil­ity Cri­sis and far too many exam­ples of sci­en­tific find­ings com­pletely col­laps­ing show, one prob­a­bly isn’t skep­ti­cal enough because there are in fact an awful lot of screwed up stud­ies out there. Here are a few poten­tial issues, in delib­er­ately no par­tic­u­lar order:

  • “spu­ri­ous cor­re­la­tions” caused by data pro­cess­ing (such as ratio/percentage data, or nor­mal­iz­ing to a com­mon time-series)

  • mul­ti­plic­i­ty: many sub­groups or hypothe­ses test­ed, with only sta­tis­ti­cal­ly-sig­nifi­cant ones reported and no con­trol of the over­all false detec­tion rate

  • miss­ing­ness not mod­eled

  • in vivo ani­mal results applied to humans

  • exper­i­ment run on reg­u­lar chil­dren

  • pub­li­ca­tion bias detected in meta-analy­sis

  • a fail­ure to reject the null or a pos­i­tive point-es­ti­mate being inter­preted as evi­dence for the null hypoth­e­sis

  • “the differ­ence between sta­tis­ti­cal­ly-sig­nifi­cant and non-s­ta­tis­ti­cal­ly-sig­nifi­cant is not sta­tis­ti­cal­ly-sig­nifi­cant”

  • choice of an inap­pro­pri­ate dis­tri­b­u­tion, like mod­el­ing a log-nor­mal vari­able by a nor­mal vari­able (“they strain at the gnat of the prior who swal­low the camel of the like­li­hood”)

  • no use of sin­gle or dou­ble-blind­ing or place­bos

  • a genetic study test­ing cor­re­la­tion between 1 gene and a trait

  • an IQ exper­i­ment find­ing an inter­ven­tion increased before/after scores on some IQ sub­tests and thus increased IQ

  • cross-sec­tional rather than lon­gi­tu­di­nal study

  • ignor­ing mul­ti­level struc­ture (like data being col­lected from sub­-u­nits of schools, coun­tries, fam­i­lies, com­pa­nies, web­sites, indi­vid­ual fish­ing ves­sels, WP edi­tors etc)

  • report­ing per­for­mance of GWAS poly­genic scores using only SNPs which reach genome-wide sta­tis­ti­cal-sig­nifi­cance

  • nonzero attri­tion but no use of inten­t-to-treat analy­sis

  • use of a fixed alpha thresh­old like 0.05

  • cor­re­la­tional data inter­preted as cau­sa­tion

  • use of an “uniden­ti­fied” mod­el, requir­ing addi­tional con­straints or pri­ors

  • non-pre­reg­is­tered analy­ses done after look­ing at data; p-hack­ing of every shade

  • use of cause-spe­cific mor­tal­ity vs all-cause mor­tal­ity as a mea­sure­ment

  • use of mea­sure­ments with high lev­els of mea­sure­ment error (such as dietary ques­tion­naires)

    • ceiling/floor effects (par­tic­u­larly IQ tests)
  • claims about latent vari­ables made on the basis of mea­sure­ments of greatly differ­ing qual­ity

    • or after “con­trol­ling for” inter­me­di­ate vari­ables, com­par­ing total effects of one vari­able to solely indi­rect effects of another
    • or that one vari­able medi­ates an effect with­out actu­ally set­ting up a medi­a­tion SEM
  • stud­ies rad­i­cally under­pow­ered to detect a plau­si­ble effect

  • the “sta­tis­ti­cal-sig­nifi­cance fil­ter” inflat­ing effects

  • base rate fal­lacy

  • self­-s­e­lected sur­vey respon­dents; con­ve­nience sam­ples from Mechan­i­cal Turk or Google Sur­veys or sim­i­lar ser­vices

  • ani­mal exper­i­ments with ran­dom­iza­tion not blocked by litter/cage/room

  • using a and obtain­ing many sta­tis­ti­cal­ly-sig­nifi­cant results

  • fac­tor analy­sis with­out estab­lish­ing mea­sure­ment invari­ance

  • exper­i­menter demand effects

  • using a SVM/NN/RF with­out cross­val­i­da­tion or held­out sam­ple

    • using them, but with data pre­pro­cess­ing done or hyper­pa­ra­me­ters selected based on the whole dataset
  • pas­sive con­trol groups

  • not doing a fac­to­r­ial exper­i­ment but test­ing one inter­ven­tion on each group

  • flat pri­ors over­es­ti­mat­ing effects

  • report­ing of rel­a­tive risk increase with­out absolute risk increase

  • a genetic study test­ing cor­re­la­tion between 500,000 genes and a trait

  • con­flicts of inter­est by the researchers/funders

  • lack of power analy­sis to design exper­i­ment

  • ana­lyz­ing Lik­ert scales as a sim­ple con­tin­u­ous car­di­nal vari­able

  • ani­mal results in a sin­gle inbred or clonal strain, with the goal of reduc­ing variance/increased power (Michie 1955)

  • right-cen­sored data

  • tem­po­ral auto­cor­re­la­tion of mea­sure­ments

  • genetic con­found­ing

  • reliance on inter­ac­tion terms

Some of these issues are big issues—even fatal, to the point where the study is not just mean­ing­less but the world would be a bet­ter place if the researchers in ques­tion had never pub­lished. Oth­ers are seri­ous but while regret­table, a study afflicted by it is still use­ful and per­haps the best that can rea­son­ably be done. And some flaws are usu­ally minor, almost cer­tain not to mat­ter, pos­si­bly to the point of being mis­lead­ing to bring up at all as a ‘crit­i­cism’ as it implies that the flaw is worth dis­cussing. And many are com­pletely con­tex­t-de­pen­dent, and could be any­thing from instantly fatal to minor nui­sance.

But which are which? You can prob­a­bly guess at where a few of them fall, but I would be sur­prised if you knew what I meant by all of them, or had well-jus­ti­fied beliefs about how impor­tant each is, because I don’t, and I sus­pect few peo­ple do. Nor can any­one tell you how impor­tant each one is. One just has to learn by expe­ri­ence, it seems, watch­ing things repli­cate or dimin­ish in meta-analy­ses or get debunked over the years, to grad­u­ally get a feel of what is impor­tant. There are check­lists and pro­fes­sional man­u­als1 which one can read and employ, and they at least have the virtue of check­lists in being sys­tem­atic reminders of things to check, reduc­ing the temp­ta­tion to cher­ry-pick crit­i­cism, and I rec­om­mend their use, but they are not a com­plete solu­tion. (In some cas­es, they rec­om­mend quite bad things, and none can be con­sid­ered com­plete.)

No won­der that sta­tis­ti­cal crit­i­cism can feel like a blood­-s­port, or feel like learn­ing sta­tis­ti­cal-sig­nifi­cance sta­tis­tics: a long list of spe­cial-case tests with lit­tle rhyme or rea­son, mak­ing up a “cook­book” of arbi­trary for­mu­las and rit­u­als, use­ful largely for “mid­dle­brow dis­missals”.

After a while, you have learned enough to throw a long list of crit­i­cisms at any study regard­less of whether they are rel­e­vant or not, engag­ing in “pseudo-analy­sis”2, which deval­ues crit­i­cism (surely stud­ies can’t all be equally worth­less) and risks the same prob­lem as with for­mal logic or cog­ni­tive bias­es—of merely weaponiz­ing it and hav­ing laboured solely to make your­self more wrong, and defend your errors in more elab­o­rate ways. (I have over the years crit­i­cized many stud­ies and while for many of them my crit­i­cisms were much less than they deserved and have since been borne out, I could not hon­estly say that I have always been right or that I did not occa­sion­ally ‘gild the lily’ a lit­tle.)

Relevant But Not Definitive

So, what do we mean by sta­tis­ti­cal crit­i­cism? what makes a good or bad sta­tis­ti­cal objec­tion?

Bad Criticisms

“Here I should like to say: a wheel that can be turned though noth­ing else moves with it, is not part of the mech­a­nism.”

, §271,

It can’t just be that a crit­i­cism is bor­ing and pro­vokes eye­-rolling—­some­one who in every genet­ics dis­cus­sion from ~2000–2010 harped on sta­tis­ti­cal power & poly­genic­ity and stated that all these excit­ing new can­di­date-gene & gene-en­vi­ron­ment inter­ac­tion results were so much hog­wash and the entire lit­er­a­ture garbage would have been deeply irri­tat­ing to read, wear out their wel­come fast, and have been absolutely right. (Or for nutri­tion research, or for social psy­chol­o­gy, or for…) As pro­vok­ing as it may be to read yet another per­son slo­ga­nize “cor­re­la­tion ≠ cau­sa­tion” or “yeah, in mice!”, unfor­tu­nate­ly, for much research that is all that should ever be said about it, no mat­ter how much we weary of it.

It can’t be that some assump­tion is vio­lated (or unproven or unprov­able), or that some aspect of the real world is left out, because all sta­tis­ti­cal mod­els are mas­sively abstract, gross sim­pli­fi­ca­tions. Because it is always pos­si­ble to iden­tify some issue of inap­pro­pri­ate assump­tion of nor­mal­i­ty, or some auto­cor­re­la­tion which is not mod­eled, or some non­lin­ear term not includ­ed, or prior infor­ma­tion left out, or data lack­ing in some respect. Check­lists and pre­reg­is­tra­tions and other tech­niques can help improve the qual­ity con­sid­er­ably, but will never solve this prob­lem. Short of tau­to­log­i­cal analy­sis of a com­puter sim­u­la­tion, there is not and never has been a per­fect sta­tis­ti­cal analy­sis, and if there was, it would be too com­pli­cated for any­one to under­stand (which is a crit­i­cism as well). All of our mod­els are false, but some may be use­ful, and a good sta­tis­ti­cal analy­sis is merely ‘good enough’.

It can’t be that results “repli­cate” or not. Replic­a­bil­ity does­n’t say much other than if fur­ther data were col­lected the same way, the results would stay the same. While a result which does­n’t repli­cate is of ques­tion­able value at best (it most likely was­n’t real to begin with3), a result being replic­a­ble is no guar­an­tee of qual­ity either. One may have a con­sis­tent GIGO process, but replic­a­ble garbage is still garbage. To col­lect more data may be to sim­ply more pre­cisely esti­mate the process’s sys­tem­atic error and bias­es. (No mat­ter how many pub­lished home­opa­thy papers you can find show­ing home­opa­thy works, it does­n’t.)

It cer­tainly has lit­tle to do with p-val­ues, either in a study or in its repli­ca­tions (be­cause noth­ing of inter­est has to do with p-val­ues); if we cor­rect an error and change a spe­cific p-value from p = 0.05 to p = 0.06, so what? (“Sure­ly, God loves the 0.06 nearly as much as the 0.05…”) Pos­te­rior prob­a­bil­i­ties, while mean­ing­ful and impor­tant, also are no cri­te­ri­on: is it impor­tant if a study has a pos­te­rior prob­a­bil­ity of a para­me­ter being greater than zero of 95% rather than 94%? Or >99%? Or >50%? If a crit­i­cism, when cor­rect­ed, reduces a pos­te­rior prob­a­bil­ity from 99% to 90%, is that what we mean by an impor­tant crit­i­cism? Prob­a­bly (ahem) not.

It also does­n’t have to do with any increase or decrease in effect sizes. If a study makes some errors which means that it pro­duces an effect size twice as large as it should, this might be absolutely damn­ing or it might be largely irrel­e­vant. Per­haps the uncer­tainty was at least that large so no one took the point-es­ti­mate at face-value to begin with, or every­one under­stood the poten­tial for errors and under­stood the point-es­ti­mate was an upper bound. Or per­haps the effect is so large that over­es­ti­ma­tion by a fac­tor of 10 would­n’t be a prob­lem.

It usu­ally does­n’t have to do with pre­dic­tive power (whether quan­ti­fied as R2 or AUC etc); sheer pre­dic­tion is the goal of a sub­set of research (although if one could show that a par­tic­u­lar choice led to a lower pre­dic­tive score, that would be a good cri­tique), and in many con­texts, the best model is not par­tic­u­larly pre­dic­tive at all, and a model being too pre­dic­tive is a red flag.

Good Criticisms

“The sta­tis­ti­cian is no longer an alchemist expected to pro­duce gold from any worth­less mate­r­ial offered him. He is more like a chemist capa­ble of assay­ing exactly how much of value it con­tains, and capa­ble also of extract­ing this amount, and no more. In these cir­cum­stances, it would be fool­ish to com­mend a sta­tis­ti­cian because his results are pre­cise, or to reprove because they are not. If he is com­pe­tent in his craft, the value of the result fol­lows solely from the value of the mate­r­ial given him. It con­tains so much infor­ma­tion and no more. His job is only to pro­duce what it con­tain­s…Im­mensely labo­ri­ous cal­cu­la­tions on infe­rior data may increase the yield from 95 to 100 per cent. A gain of 5 per cent, of per­haps a small total. A com­pe­tent over­haul­ing of the process of col­lec­tion, or of the exper­i­men­tal design, may often increase the yield ten or twelve fold, for the same cost in time and labour.
…To con­sult the sta­tis­ti­cian after an exper­i­ment is fin­ished is often merely to ask him to con­duct a post mortem exam­i­na­tion. He can per­haps say what the exper­i­ment died of.”

, “Pres­i­den­tial address to the first Indian sta­tis­ti­cal con­gress”, 1938

What would count as a good crit­i­cism?

Well, if a draft of a study was found and the claims were based on a sta­tis­ti­cal­ly-sig­nifi­cant effect in one vari­able, but in the final pub­lished ver­sion, it omits that vari­able and talks only about a differ­ent vari­able, one would won­der. Dis­cov­er­ing that authors of a study had been paid mil­lions of dol­lars by a com­pany ben­e­fit­ing from the study results would seri­ously shake one’s con­fi­dence in the results. If a cor­re­la­tion did­n’t exist at all when we com­pared sib­lings within a fam­i­ly, or bet­ter yet, iden­ti­cal twins, or if the cor­re­la­tion did­n’t exist in other datasets, or other coun­tries, then regard­less of how strongly sup­ported it is in that one dataset, it would be a con­cern. If a fancy new machine learn­ing model out­per­formed SOTA by 2%, but turned out to not be using a held­out sam­ple prop­erly and actu­ally per­formed the same, doubt­less ML researchers would be less impressed. If some­one showed an RCT reached the oppo­site effect size to a cor­re­la­tional analy­sis, that would strike most peo­ple as impor­tant. If a major new can­cer drug was being touted as being as effec­tive as the usual chemother­apy with fewer side-effects in the lat­est tri­al, and one sees that both were being com­pared to a null hypoth­e­sis of zero effect and the point-es­ti­mate for the new drug was lower than the usual chemother­a­py, would patients want to use it? If a psy­chol­ogy exper­i­ment had differ­ent results with a pas­sive con­trol group and an active con­trol group, or a surgery’s results depend on whether the clin­i­cal trial used blind­ing, cer­tainly an issue. And if data was fab­ri­cated entire­ly, that would cer­tainly be worth men­tion­ing.

These are all inher­ently differ­ent going by some of the con­ven­tional views out­lined above. So what do they have in com­mon that makes them good crit­i­cisms?

Beliefs Are For Actions

“Results are only valu­able when the amount by which they prob­a­bly differ from the truth is so small as to be insignifi­cant for the pur­poses of the exper­i­ment. What the odds should be depends:”

  1. “On the degree of accu­racy which the nature of the exper­i­ment allows, and”
  2. “On the impor­tance of the issues at stake.”

, “The Appli­ca­tion of the ‘Law of Error’ to the work of the Brew­ery”, 19044

“More­over, the eco­nomic approach seems (if not rejected owing to aris­to­cratic or puri­tanic taboos) the only device apt to dis­tin­guish neatly what is or is not con­tra­dic­tory in the logic of uncer­tainty (or prob­a­bil­ity the­o­ry). That is the fun­da­men­tal les­son sup­plied by notion of …prob­a­bil­ity the­ory and deci­sion the­ory are but two ver­sions (the­o­ret­i­cal and prac­ti­cal) of the study of the same sub­ject: uncer­tain­ty.”

, “Com­ment on Sav­age’s ‘On Reread­ing R. A. Fisher’”, 1976

But what I think they share in com­mon is this deci­sion-the­o­retic jus­ti­fi­ca­tion which uni­fies crit­i­cisms (and would unify sta­tis­ti­cal ped­a­gogy too):

The impor­tance of a sta­tis­ti­cal crit­i­cism is the prob­a­bil­ity that it would change a hypo­thet­i­cal deci­sion based on that research.

I would assert that p-val­ues are not pos­te­rior prob­a­bil­i­ties are not effect sizes are not util­i­ties are not profits are not deci­sions. Dichotomies come from deci­sions. All analy­ses are ulti­mately deci­sion analy­ses: our beliefs and analy­ses may be con­tin­u­ous, but our actions are dis­crete.

When we cri­tique a study, the stan­dard we grope towards is one which ulti­mately ter­mi­nates in real-world actions and deci­sion-mak­ing, a stan­dard which is inher­ently con­tex­t-de­pen­dent, admits of no bright lines, and depends on the use and moti­va­tion for research, grounded in what is the right thing to do.5

How should we eval­u­ate a sin­gle small study?

It does­n’t have any­thing to do with attain­ing some arbi­trary level of “sig­nifi­cance” or being “well-pow­ered” or hav­ing a cer­tain k in a meta-analy­sis for esti­mat­ing het­ero­gene­ity, or even any par­tic­u­lar pos­te­rior prob­a­bil­i­ty, or effect size thresh­old; it does­n’t have any­thing to do with vio­lat­ing a par­tic­u­lar assump­tion, unless, by vio­lat­ing that assump­tion, the model is not ‘good enough’ and would lead to bad choic­es; and it is loosely tied to repli­ca­tion (be­cause if a result does­n’t repli­cate in the future sit­u­a­tions in which actions will be tak­en, it’s not use­ful for plan­ning) but not defined by it (as a result could repli­cate fine while still being use­less).

The impor­tance of many of these crit­i­cisms can be made much more intu­itive by ask­ing what the research is for and how it would affect a down­stream deci­sion. We don’t need to do a for­mal deci­sion analy­sis going all the way from data through a Bayesian analy­sis to util­i­ties and a causal model to com­pare (although this would be use­ful to do and might be nec­es­sary in edge cas­es), an infor­mal con­sid­er­a­tion can be a good start, as one can intu­itively guess at the down­stream effects.

I think we can mean­ing­fully apply this cri­te­rion even to ‘pure’ research ques­tions where it is unclear how the research would ever be applied, specifi­cal­ly. We know a great deal about epis­te­mol­ogy and sci­en­tific method­ol­ogy and what prac­tices tend to lead to reli­able knowl­edge. (When peo­ple argue in favor of pure research because of its his­tory of spin­offs like cryp­tog­ra­phy from num­ber the­o­ry, that very argu­ment implies that the spin­offs aren’t that unpre­dictable & is a suc­cess­ful prag­matic defense. The fact that our evolved curios­ity can be use­ful is surely no acci­den­t.)

For exam­ple, even with­out a spe­cific pur­pose in mind for some research, we can see why forg­ing fraud­u­lent data is the worst pos­si­ble crit­i­cism: because there is no deci­sion what­so­ever which is made bet­ter by using faked data. Many assump­tions or short­cuts will work in some cas­es, but there is no case where fake data, which is uncor­re­lated with real­i­ty, works; even in the case where the fake data is scrupu­lously forged to exactly repli­cate the best under­stand­ing of real­ity6, it dam­ages deci­sion-mak­ing by over­stat­ing the amount of evi­dence, lead­ing to over­con­fi­dence and under­ex­plo­ration.

Sim­i­lar­ly, care­less data col­lec­tion and mea­sure­ment error. Micro­bi­ol­o­gists could­n’t know about CRISPR in advance, before it was dis­cov­ered by com­par­ing odd entries in DNA data­bas­es, and it’s a good exam­ple of how pure research can lead to tremen­dous gains. But how could you dis­cover any­thing from DNA data­bases if they are incom­plete, full of mislabeled/contaminated sam­ples, or the sequenc­ing was done slop­pily & the sequences largely ran­dom garbage? If you’re study­ing ‘can­cer cells’ and they are a mis­la­beled cell line & actu­ally liver cells, how could that pos­si­bly add to knowl­edge about can­cer?

Or con­sider the placebo effect. If you learned that a par­tic­u­lar study’s result was dri­ven entirely by a placebo effect and that using blind­ing would yield a null, I can safely pre­dict that—re­gard­less of field or topic or any­thing else—you will almost always be badly dis­ap­point­ed. If a study mea­sures just a placebo effect (specifi­cal­ly, demand or expectancy effect­s), this is damn­ing, because the placebo effect is already known to be uni­ver­sally applic­a­ble (so show­ing that it hap­pened again is not inter­est­ing) through a nar­row psy­cho­log­i­cal causal mech­a­nism which fades out over time & does­n’t affect hard end­points (like mor­tal­i­ty), while it does­n’t affect the count­less causal mech­a­nisms which place­bo-bi­ased stud­ies seem to be manip­u­lat­ing (and whose manip­u­la­tion would in fact be use­ful both imme­di­ately and for build­ing the­o­ries). If, say, except through the placebo effect, why would we want to use them? There are some excep­tions where we would be indiffer­ent after learn­ing a result was just a placebo effect (chronic pain treat­ment? mild influen­za­?), but not many.

How about non-replic­a­bil­i­ty? The sim­plest expla­na­tion for the Replic­a­bil­ity Cri­sis in psy­chol­ogy is that most of the results aren’t real and were ran­dom noise, p-hacked into pub­li­ca­tions. The most char­i­ta­ble inter­pre­ta­tion made by apol­o­gists is that the effects were real, but they are sim­ply either small or so highly con­tex­t-de­pen­dent on the exact details (the pre­cise loca­tion, color of the paper, exper­i­menter, etc) to the point where even col­lab­o­rat­ing with the orig­i­nal researchers is not guar­an­teed to repli­cate suc­cess­fully an effect. Again, regard­less of the spe­cific result, this presents a trilemma which is par­tic­u­larly dam­ag­ing from a deci­sion-the­ory point of view:

  1. either the results aren’t real (and are use­less for deci­sion-mak­ing),
  2. they are much smaller than reported (and thus much less use­ful for any kind of appli­ca­tion or the­o­ry-build­ing),
  3. or they are so frag­ile and in any future con­text almost as likely to be some other effect, in even the oppo­site direc­tion, that their aver­age effect is effec­tively zero (and thus use­less).

Deci­sions pre­cede beliefs. Our ontol­ogy and our epis­te­mol­ogy flows from our deci­sion the­o­ry, not vice-ver­sa. This may appear to be log­i­cally back­wards, but that is the sit­u­a­tion we are in, as evolved embod­ied beings think­ing & act­ing under uncer­tain­ty: like Otto Neu­rath’s —there is nowhere we can ‘step aside’ and con­struct all belief and knowl­edge up from scratch and log­i­cal meta­physics, instead, we exam­ine and repair our raft as we stand on it, piece by piece. The nat­u­ral­is­tic answer to the skep­tic (like ) is that our beliefs are not unre­li­able because they are empir­i­cal or evolved or ulti­mately tem­po­rally begin in tri­al-and-er­ror but they are reli­able because they have been grad­u­ally evolved to prag­mat­i­cally be cor­rect for deci­sion-mak­ing, and , devel­oped reli­able knowl­edge of the world and meth­ods of sci­ence. (An exam­ple of revers­ing the flow would be the Deutsch-Wal­lace attempt to found the Born on deci­sion the­o­ry; ear­lier, sta­tis­ti­cians such as , , , , & etc showed that much of sta­tis­tics could be grounded in deci­sion-mak­ing instead of vice-ver­sa, demon­strated by the sub­jec­tive prob­a­bil­ity school and devices like the enforc­ing .)

Decision-theoretic Criticisms

“The threat of deci­sion analy­sis is more pow­er­ful than its exe­cu­tion.”

, 2019

“A good rule of thumb might be, ‘If I added a zero to this num­ber, would the sen­tence con­tain­ing it mean some­thing differ­ent to me?’ If the answer is ‘no’, maybe the num­ber has no busi­ness being in the sen­tence in the first place.”

Ran­dall Munroe

Revis­it­ing some of the exam­ple crit­i­cisms with more of a deci­sion-the­o­retic view:

  • A cri­tique of assum­ing cor­re­la­tion=­cau­sa­tion is a good one, because cor­re­la­tion is usu­ally not cau­sa­tion, and going from an implicit ~100% cer­tainty that it is to a more real­is­tic 25% or less, would change many deci­sions as that obser­va­tion alone reduces the expected value by >75%, which is a big enough penalty to elim­i­nate many appeal­ing-sound­ing things.

    Because causal effects are such a cen­tral top­ic, any method­olog­i­cal errors which affect infer­ence of cor­re­la­tion rather than cau­sa­tion are impor­tant errors.

  • A cri­tique of dis­tri­b­u­tional assump­tions (such as observ­ing that a vari­able isn’t so much nor­mal as Stu­den­t’s t-dis­trib­ut­ed) isn’t usu­ally an impor­tant one, because the change in the pos­te­rior dis­tri­b­u­tion of any key vari­able will be min­i­mal, and could change only deci­sions which are on a knife’s-edge to begin with (and thus, of lit­tle val­ue).

    • There are excep­tions here, and in some areas, this can be crit­i­cal. Dis­tri­b­u­tion-wise, using a nor­mal instead of a log-nor­mal is often minor since they are so sim­i­lar in the bulk of their dis­tri­b­u­tion… unless we are talk­ing about their tails, like in an con­text (com­mon in any kind of selec­tion or extremes analy­sis, such as employ­ment or ath­let­ics or or nat­ural dis­as­ter­s), where the more extreme points out on the tail are the impor­tant thing; in which case, using a nor­mal will lead to wild under­es­ti­mates of how far out those out­liers will be, which could be of great prac­ti­cal impor­tance

    • On the other hand, treat­ing a Lik­ert scale as a car­di­nal vari­able is a sta­tis­ti­cal sin… but only a pec­ca­dillo that every­one com­mits because Lik­ert scales are so often equiv­a­lent to a (more noisy) nor­mal­ly-dis­trib­uted vari­able that a ful­ly-cor­rect trans­for­ma­tion to an ordi­nal scale with a latent vari­able winds up being a great deal more work while not actu­ally chang­ing any con­clu­sions and thus actions.7

      Sim­i­lar­ly, tem­po­ral auto­cor­re­la­tion is often not as big a deal as it’s made out to be.

  • Sociological/psychological cor­re­la­tions with genetic con­founds are vul­ner­a­ble to cri­tique, because con­trol­ling for genet­ics rou­tinely shrinks the cor­re­la­tion by a large frac­tion, often to zero, and thus elim­i­nates most of the causal expec­ta­tions.

  • Over­fit­ting to a train­ing set and actu­ally being sim­i­lar or worse than the cur­rent SOTA is one of the more seri­ous crit­i­cisms in machine learn­ing, because hav­ing bet­ter per­for­mance is typ­i­cally why any­one would want to use a method. (But of course—if the new method is intrigu­ingly nov­el, or has some other prac­ti­cal advan­tage, it would be entirely rea­son­able for some­one to say that the over­fit­ting is a minor cri­tique as far as they are con­cerned, because they want it for that other rea­son and some loss of per­for­mance is minor.)

  • use of a straw­man null hypoth­e­sis: in a med­ical con­text too, what mat­ters is the cost-ben­e­fit of a new treat­ment com­pared to the best exist­ing one, and not whether it hap­pens to work bet­ter than noth­ing at all; the impor­tant thing is being more cost-effec­tive than the default action, so peo­ple will choose the new treat­ment over the old, and if the net esti­mate is that it is prob­a­bly slightly worse, why would they choose it?

  • Inter­pret­ing a fail­ure to reject the null as proof of null: often a prob­lem. The logic of sig­nifi­cance-test­ing, such as it is, man­dates agnos­ti­cism any time the null has not been reject­ed, but so intu­itive is Bayesian rea­son­ing—ab­sence of evi­dence is evi­dence of absence—that if a sig­nifi­cance-test does not vin­di­cate a hypoth­e­sis, we nat­u­rally inter­pret it as evi­dence against the hypoth­e­sis. Yet real­ly, it might well be evi­dence for the hypoth­e­sis, and sim­ply not enough evi­dence: so a rea­son­able per­son might con­clude the oppo­site of what they would if they looked at the actual data.

    I’m reminded of the stud­ies which use small sam­ples to esti­mate the cor­re­la­tion between it and some­thing like acci­dents, and upon get­ting a point-es­ti­mate almost iden­ti­cal to other larger stud­ies (ie. infec­tion pre­dicts bad things) which hap­pens to not be sta­tis­ti­cal­ly-sig­nifi­cant due to their sam­ple size, con­clude that they have found evi­dence against there being a cor­re­la­tion. One should con­clude the oppo­site! (One heuris­tic for inter­pret­ing results is to ask: “if I entered this result into a meta-analy­sis of all results, would it strengthen or weaken the meta-an­a­lytic result?”)

  • ineffi­cient exper­i­ment design, like using between-sub­ject rather than with­in-sub­ject, or not using iden­ti­cal twins for twin exper­i­ments, can be prac­ti­cally impor­tant: as in his dis­cus­sion of the Lanark­shire Milk Exper­i­ment, among other prob­lems with it, the use of ran­dom chil­dren who weren’t matched or blocked in any way meant that sta­tis­ti­cal power was unnec­es­sar­ily low, and the Lanark­shire Milk Exper­i­ment could have been done with a sam­ple size 97% smaller had it been bet­ter designed, which would have yielded a major sav­ings in expense (which could have paid for many more exper­i­ments).

  • lack of mea­sure­ment invari­ance: ques­tions of ‘mea­sure­ment invari­ance’ in IQ exper­i­ments may sound deeply recon­dite and like sta­tis­ti­cal quib­bling, but they boil down to the ques­tion of whether the test gain is a gain on intel­li­gence or if it can be accounted for by gains solely on a sub­test tap­ping into some much more spe­cial­ized skill like Eng­lish vocab­u­lary; a gain on intel­li­gence is far more valu­able than some test-spe­cific improve­ment, and if it is the lat­ter, the exper­i­ment has found a real causal effect but that effect is fool’s gold.

  • con­flat­ing mea­sured with latent vari­ables: And in dis­cus­sions of mea­sure­ment of latent vari­ables, the ques­tion may hinge crit­i­cally on the use. Sup­pose one com­pares a noisy IQ test to a high­-qual­ity per­son­al­ity test (with­out incor­po­rat­ing cor­rec­tion for the differ­ing mea­sure­ment error of each one), and finds that the lat­ter is more pre­dic­tive of some life out­come; does this mean ‘per­son­al­ity is more impor­tant than intel­li­gence’ to that trait? Well, it depends on use. If one is mak­ing a the­o­ret­i­cal argu­ment about the latent vari­ables, this is a seri­ous fal­lacy and cor­rect­ing for mea­sure­ment error may com­pletely reverse the con­clu­sion and show the oppo­site; but if one is doing screen­ing (for employ­ment or col­lege or some­thing like that), then it is irrel­e­vant which latent vari­ables is a bet­ter pre­dic­tor, because the tests are what they are—un­less, on the grip­ping hand, one is con­sid­er­ing intro­duc­ing a bet­ter more expen­sive IQ test, in which case the latent vari­ables are impor­tant after all because, depend­ing on how impor­tant the latent vari­ables are (rather than the crude mea­sured vari­ables), the poten­tial improve­ment from a bet­ter mea­sure­ment may be enough to jus­tify the bet­ter test…

    Or con­sider her­i­tabil­ity esti­mates, like SNP her­i­tabil­ity esti­mates from . A GCTA esti­mate of, say, 25% for a trait mea­sure­ment can be inter­preted as an upper bound on a GWAS for the same mea­sure­ment; this is use­ful to know, but it’s not the same thing as an upper bound on a GWAS, or ‘genetic influ­ence’ in some sense, of the true, latent, mea­sured-with­out-er­ror vari­able. Most such GCTAs use mea­sure­ments with a great deal of mea­sure­ment error, and if you cor­rect for mea­sure­ment error, the true GCTA could be much high­er—­for exam­ple, IQ GCTAs are typ­i­cally ~25%, but most datasets trade qual­ity for quan­tity and use poor IQ tests, and cor­rect­ing that, the true GCTA is closer to 50%, which is quite differ­ent. Which is right? Well, if you are merely try­ing to under­stand how good a GWAS based on that par­tic­u­lar dataset of mea­sure­ments can be, the for­mer is the right inter­pre­ta­tion, as it estab­lishes your upper bound and you will need bet­ter meth­ods or mea­sure­ments to go beyond it; but if you are try­ing to make claims about a trait per se (as so many peo­ple do!), the latent vari­able is the rel­e­vant thing, and talk­ing only about the mea­sured vari­able is highly mis­lead­ing and can result in totally mis­taken con­clu­sions (espe­cially when com­par­ing across datasets with differ­ent mea­sure­ment errors).

  • Lack of blind­ing poses a sim­i­lar prob­lem: its absence means that the effect being esti­mated is not nec­es­sar­ily the one we want to esti­mate—but this is con­tex­t-de­pen­dent. A psy­chol­ogy study typ­i­cally uses mea­sures where some degree of effort or con­trol is pos­si­ble, and the effects of research inter­est are typ­i­cally so small (like dual n-back’s sup­posed IQ improve­ment of a few points) that they can be inflated by a small amount of try­ing hard­er; on the other hand, a med­ical exper­i­ment of a can­cer drug mea­sur­ing all-cause mor­tal­i­ty, if it works, can pro­duce a dra­matic differ­ence in sur­vival rates, can­cer does­n’t care whether a patient is opti­mistic, and it is diffi­cult for the researchers to sub­tly skew the col­lected data like all-cause mor­tal­ity (be­cause a patient is either dead or not).

This defi­n­i­tion is not a panacea since often it may not be clear what deci­sions are down­stream, much less how much a crit­i­cism could quan­ti­ta­tively affect it. But it pro­vides a clear start­ing point for under­stand­ing which ones are, or should be, impor­tant (meta-analy­ses being par­tic­u­larly use­ful for nail­ing down things like aver­age effect size bias due to a par­tic­u­lar flaw), and which ones are dubi­ous or quib­bling and are signs that you are stretch­ing to come up with any crit­i­cisms; if you can’t explain at least some­what plau­si­bly how a crit­i­cism (or a com­bi­na­tion of crit­i­cisms) could lead to dia­met­ri­cally oppo­site con­clu­sions or actions, per­haps they are best left out.

Appendix

Teaching Statistics

If deci­sion the­ory is the end-all be-all, why is it so easy to take Sta­tis­tics 101 or read a sta­tis­tics text­book and come away with the atti­tude that sta­tis­tics is noth­ing but a bag of tricks applied at the whim of the ana­lyst, fol­low­ing rules writ­ten down nowhere, inscrutable to the unini­ti­at­ed, who can only lis­ten in baffle­ment to this or that piped piper of prob­a­bil­i­ties? (One uses a t-test unless one uses a Wilcox test, but of course, some­times the p-value must be mul­ti­ple-cor­rect­ed, except when it’s fine not to, because you were using it as part of the main analy­sis or a com­po­nent of a pro­ce­dure like an ANOVA—not to be con­fused with an ANCOVA, MANCOVA, or lin­ear mod­el, which might really be a gen­er­al­ized lin­ear mod­el, with clus­tered stan­dard errors as rel­e­van­t…)

One issue is that the field greatly dis­likes pre­sent­ing it in any of the uni­fi­ca­tions which are avail­able. Because those par­a­digms are not uni­ver­sally accept­ed, the atti­tude seems to be that no par­a­digm should be taught; how­ev­er, to refuse to make a choice is itself a choice, and what gets taught is the par­a­digm of sta­tis­tic­s-as-grab-bag. As often taught or dis­cussed, sta­tis­tics is treated as a bag of tricks and p-val­ues and prob­lem-spe­cific algo­rithms. But there are par­a­digms one could teach.

For exam­ple, around the 1940s, led by , there was a huge par­a­digm shift towards the deci­sion-the­o­retic inter­pre­ta­tion of sta­tis­tics, where all these Fish­er­ian giz­mos can be under­stood, jus­ti­fied, and crit­i­cized as being about min­i­miz­ing loss given spe­cific loss func­tions. So, the mean is a good way to esti­mate your para­me­ter (rather than the mode or median or a bazil­lion other uni­vari­ate sta­tis­tics one could invent) not because that par­tic­u­lar func­tion was handed down at Mount Sinai but because it does a good job of min­i­miz­ing your loss under such-and-such con­di­tions like hav­ing a squared error loss (be­cause big­ger errors hurt you much more), and if those con­di­tions do not hold, that is why the, say, median is bet­ter, and you can say pre­cisely how much bet­ter and when you’d go back to the mean (as opposed to rules of thumbs about stan­dard devi­a­tions or arbi­trary p-value thresh­olds test­ing nor­mal­i­ty). Many issues in meta-science are much more trans­par­ent if you sim­ply ask how they would affect deci­sion-mak­ing (see the rest of this essay).

Sim­i­lar­ly, Bayesian­ism means you can just ‘turn the crank’ on many prob­lems: define a mod­el, your pri­ors, and turn the MCMC crank, with­out all the fancy prob­lem-spe­cific deriva­tions and spe­cial-cas­es. Instead of all these mys­te­ri­ous dis­tri­b­u­tions and for­mu­las and tests and like­li­hoods drop­ping out of the sky, you under­stand that you are just set­ting up equa­tions (or even just writ­ing a pro­gram) which reflect how you think some­thing works in a suffi­ciently for­mal­ized way that you can run data through it and see how the prior updates into the pos­te­ri­or. The dis­tri­b­u­tions & like­li­hoods then do not drop out of the sky but are prag­matic choic­es: what par­tic­u­lar bits of math­e­mat­ics are imple­mented in your MCMC library, and which match up well with how you think the prob­lem works, with­out being too con­fus­ing or hard to work with or com­pu­ta­tion­al­ly-in­effi­cient?

And causal mod­el­ing is another good exam­ple: there is an end­less zoo of biases and prob­lems in fields like epi­demi­ol­ogy which look like a mess of spe­cial cases you just have to mem­o­rize, but they all reduce to straight­for­ward issues if you draw out a DAG of a causal graph of how things might work.

What hap­pens in the absence of explicit use of these par­a­digms is an implicit use of them. Much of the ‘expe­ri­ence’ that sta­tis­ti­cians or ana­lysts rely on when they apply the bag of tricks is actu­ally a hid­den the­ory learned from expe­ri­ence & osmo­sis, used to reach the cor­rect results while osten­si­bly using the bag of tricks: the ana­lyst knows he ought to use a median here because he has a vaguely defined loss in mind for the down­stream exper­i­ment, and he knows the data some­times throws out­liers which screwed up exper­i­ments in the past so the mean is a bad choice and he ought to use ‘robust sta­tis­tics’; or he knows from expe­ri­ence that most of the vari­ables are irrel­e­vant so it’d be good to get shrink­age by sleight of hand by pick­ing a lasso regres­sion instead of a reg­u­lar OLS regres­sion and if any­one asks, talk vaguely about ‘reg­u­lar­iza­tion’; or he has a par­tic­u­lar causal model of how enroll­ment in a group is a col­lider so he knows to ask about “Simp­son’s para­dox”. Thus, in the hands of an expert, the bag of tricks works out, even as the neo­phyte is mys­ti­fied and won­ders how the expert knew to pull this or that trick out of, seem­ing­ly, their nether regions.

Teach­ers don’t like this because they don’t want to defend the philoso­phies of things like Bayesian­ism, often aren’t trained in them in the first place, and because teach­ing them is simul­ta­ne­ously too easy (the con­cepts are uni­ver­sal, straight­for­ward, and can be one-lin­ers) and too hard (re­duc­ing them to prac­tice and actu­ally com­put­ing any­thing—it’s easy to write down Bayes’s for­mu­la, not so easy to actu­ally com­pute a real pos­te­ri­or, much less max­i­mize over a deci­sion tree).

There’s a lot of crit­i­cisms that can be made of each par­a­digm, of course, none of them are uni­ver­sally assented to, to say the last—but I think it would gen­er­ally be bet­ter to teach peo­ple in those prin­ci­pled approach­es, and then later cri­tique them, than to teach peo­ple in an entirely unprin­ci­pled fash­ion.


  1. There’s two main cat­e­gories I know of, report­ing check­lists and qual­i­ty-e­val­u­a­tion check­lists (in addi­tion to the guidelines/recommendations pub­lished by pro­fes­sional groups like the ’s man­ual based appar­ently on JARS or ’s stan­dards).

    Some report­ing check­lists:

    Some qual­i­ty-e­val­u­a­tion scales:

    ↩︎
  2. As pointed out by Jack­son in a review of a sim­i­lar book, the argu­ments used against her­i­tabil­ity or IQ exem­pli­fied bad research cri­tiques by mak­ing the per­fect the enemy of bet­ter & selec­tively apply­ing demands for rig­or:

    There is no ques­tion that here, as in many areas that depend on field stud­ies, pre­cise con­trol of extra­ne­ous vari­ables is less than per­fect. For exam­ple, in stud­ies of sep­a­rated twins, the inves­ti­ga­tor must con­cede that the ideal of ran­dom assign­ment of twin pairs to sep­a­rated fos­ter homes is not likely to be fully achieved, that it will be diffi­cult to find com­par­i­son or con­trol groups per­fectly matched on all vari­ables, and so on. Short of aban­don­ing field data in social sci­ence entire­ly, there is no alter­na­tive but to employ a vari­a­tional approach, seek­ing to weigh admit­tedly fal­li­ble data to iden­tify sup­port for hypothe­ses by the pre­pon­der­ance of evi­dence. Most who have done this find sup­port for the her­i­tabil­ity of IQ. instead sees only flaws in the evi­dence…As the data stand, had the author been equally zeal­ous in eval­u­at­ing the null hypoth­e­sis that such treat­ments make no differ­ence he would have been hard pressed to fail to reject it.

    ↩︎
  3. Non-repli­ca­tion of a result puts the orig­i­nal result in an awk­ward trilem­ma: either the orig­i­nal result was spu­ri­ous (the most a pri­ori likely case), the non-repli­ca­tor got it wrong or was unlucky (diffi­cult since most are well-pow­ered and fol­low­ing the orig­i­nal, so——it would be eas­ier to argue the orig­i­nal result was unluck­y), or the research claim is so frag­ile and con­tex­t-spe­cific that non-repli­ca­tion is just ‘het­ero­gene­ity’ (but then why should any­one believe the result in any sub­stan­tive way, or act on it, if it’s a coin-flip whether it even exists any­where else?).↩︎

  4. As quoted in .↩︎

  5. It is inter­est­ing to note that the medieval ori­gins of ‘prob­a­bil­ity’ were them­selves inher­ently deci­sion-based as focused on the ques­tion of what it was moral to believe & act upon, and the math­e­mat­i­cal roots of prob­a­bil­ity the­ory were also prag­mat­ic, based on gam­bling. Laplace, of course, took a sim­i­lar per­spec­tive in his early Bayesian­ism (eg or esti­mat­ing the mass of Sat­urn). It was later sta­tis­ti­cal thinkers like Boole or Fisher who tried to expunge prag­matic inter­pre­ta­tions in favor of purer defi­n­i­tions like lim­it­ing fre­quen­cies.↩︎

  6. Which is usu­ally not the case, and why fak­ers like Stapel can be detected by look­ing for ‘too good to be true’ sets of results, over-round­ing or over­ly-s­mooth num­bers, or some­times just not even arith­meti­cally cor­rect!↩︎

  7. As for­mally incor­rect as it may be, when­ever I have done the work to treat ordi­nal vari­ables cor­rect­ly, it has typ­i­cally merely tweaked the coeffi­cients & stan­dard error, and not actu­ally changed any­thing. Know­ing this, it would be dis­hon­est of me to crit­i­cize any study which does like­wise unless I have some good rea­son (like hav­ing rean­a­lyzed the data and—­for once—­got­ten a major change in result­s).↩︎