How Should We Critique Research?

Criticizing studies and statistics is hard in part because so many criticisms are possible, rendering them meaningless. What makes a good criticism is the chance of being a ‘difference which makes a difference’ to our ultimate actions.
Bayes, decision-theory, criticism, statistics, philosophy, causality
2019-05-192019-07-07 finished certainty: highly likely importance: 8

Sci­en­tific and sta­tis­ti­cal re­search must be read with a crit­i­cal eye to un­der­stand how cred­i­ble the claims are. The Re­pro­ducibil­ity Cri­sis and the growth of meta-science have demon­strated that much re­search is of low qual­ity and often false. But there are so many pos­si­ble things any given study could be crit­i­cized for, falling short of an un­ob­tain­able ide­al, that it be­comes un­clear which pos­si­ble crit­i­cism is im­por­tant, and they may de­gen­er­ate into mere rhetoric. How do we sep­a­rate fa­tal flaws from un­for­tu­nate caveats from spe­cious quib­bling?

I offer a prag­matic cri­te­ri­on: what makes a crit­i­cism im­por­tant is how much it could change a re­sult if cor­rected and how much that would then change our de­ci­sions or ac­tions: to what ex­tent it is a “differ­ence which makes a differ­ence”. This is why is­sues of re­search fraud, causal in­fer­ence, or bi­ases yield­ing over­es­ti­mates are uni­ver­sally im­por­tant, be­cause a ‘causal’ effect turn­ing out to be zero effect or over­es­ti­mated by a fac­tor will change al­most all de­ci­sions based on such re­search; while on the other hand, other is­sues like mea­sure­ment er­ror or dis­tri­b­u­tional as­sump­tions, which are equally com­mon, are often not im­por­tant as they typ­i­cally yield much smaller changes in con­clu­sions, and hence de­ci­sions.

If we reg­u­larly ask whether a crit­i­cism would make this kind of differ­ence, it will be clearer which ones are im­por­tant crit­i­cisms, and which ones risk be­ing rhetor­i­cal dis­trac­tions and ob­struct­ing mean­ing­ful eval­u­a­tion of re­search.

Learn­ing sta­tis­tics is great. If you want to read and un­der­stand sci­en­tific pa­pers in gen­er­al, there’s lit­tle bet­ter to learn than sta­tis­tics be­cause every­thing these days touches on sta­tis­ti­cal is­sues and draws on in­creas­ingly pow­er­ful sta­tis­ti­cal meth­ods and large datasets, whether flashy like ma­chine learn­ing or mun­dane like ge­neti­cists draw­ing on biobanks of mil­lions of peo­ple, and if you don’t have at least some grasp of sta­tis­tics, you will be in­creas­ingly left out of sci­en­tific and tech­no­log­i­cal progress and un­able to mean­ing­fully dis­cuss their ap­pli­ca­tion to so­ci­ety, so you must have a good ground­ing in sta­tis­tics if you are at all in­ter­ested in these top­ic­s—or so I want to say. The prob­lem is… learn­ing sta­tis­tics can be dan­ger­ous.

Valley of Bad Statistics

Like learn­ing some for­mal logic or about cog­ni­tive bi­as­es, sta­tis­tics seems like the sort of thing one might say “A lit­tle learn­ing is a dan­ger­ous thing / Drink deep, or taste not the Pier­ian spring / There shal­low draughts in­tox­i­cate the brain, / And drink­ing largely sobers us again.”

When you first learn some for­mal logic and about fal­lac­i­es, it’s hard to not use the shiny new ham­mer to go around play­ing ‘fal­lacy bingo’ (to mix metaphors): “aha! that is an ad hominem, my good sir, and a log­i­cally in­valid ob­jec­tion.” The prob­lem, of course, is that many fal­lac­ies are per­fectly good as a mat­ter of in­duc­tive log­ic: ad hominems are often highly rel­e­vant (eg if the per­son is be­ing bribed). A rig­or­ous in­sis­tence on for­mal syl­lo­gisms will at best waste a lot of time, and at worst be­comes a tool for self­-delu­sion by se­lec­tive ap­pli­ca­tion of rig­or.

Sim­i­lar­ly, cog­ni­tive bi­ases are hard to use effec­tively (be­cause they are in­for­ma­tive pri­ors in some cas­es, and in com­mon harm­ful cas­es, one will have al­ready learned bet­ter), but are easy to abuse—it’s al­ways eas­i­est to see how some­one else is sadly falling prey to con­fir­ma­tion bias.

All Things Large and Small

With sta­tis­tics, a lit­tle read­ing and self­-e­d­u­ca­tion will quickly lead to learn­ing about a uni­verse of ways for a study to screw up sta­tis­ti­cal­ly, and as skep­ti­cal as one quickly be­comes, as Ioan­ni­dis and Gel­man and the Replic­a­bil­ity Cri­sis and far too many ex­am­ples of sci­en­tific find­ings com­pletely col­laps­ing show, one prob­a­bly is­n’t skep­ti­cal enough be­cause there are in fact an aw­ful lot of screwed up stud­ies out there. Here are a few po­ten­tial is­sues, in de­lib­er­ately no par­tic­u­lar or­der:

  • “spu­ri­ous cor­re­la­tions” caused by data pro­cess­ing (such as ra­tio/per­cent­age data, or nor­mal­iz­ing to a com­mon time-series)

  • mul­ti­plic­i­ty: many sub­groups or hy­pothe­ses test­ed, with only sta­tis­ti­cal­ly-sig­nifi­cant ones re­ported and no con­trol of the over­all false de­tec­tion rate

  • miss­ing­ness not mod­eled

  • in vivo an­i­mal re­sults ap­plied to hu­mans

  • ex­per­i­ment run on reg­u­lar chil­dren

  • pub­li­ca­tion bias de­tected in meta-analy­sis

  • a fail­ure to re­ject the null or a pos­i­tive point-es­ti­mate be­ing in­ter­preted as ev­i­dence for the null hy­poth­e­sis

  • “the differ­ence be­tween sta­tis­ti­cal­ly-sig­nifi­cant and non-s­ta­tis­ti­cal­ly-sig­nifi­cant is not sta­tis­ti­cal­ly-sig­nifi­cant”

  • choice of an in­ap­pro­pri­ate dis­tri­b­u­tion, like mod­el­ing a log-nor­mal vari­able by a nor­mal vari­able (“they strain at the gnat of the prior who swal­low the camel of the like­li­hood”)

  • no use of sin­gle or dou­ble-blind­ing or place­bos

  • a ge­netic study test­ing cor­re­la­tion be­tween 1 gene and a trait

  • an IQ ex­per­i­ment find­ing an in­ter­ven­tion in­creased be­fore/after scores on some IQ sub­tests and thus in­creased IQ

  • cross-sec­tional rather than lon­gi­tu­di­nal study

  • ig­nor­ing mul­ti­level struc­ture (like data be­ing col­lected from sub­-u­nits of schools, coun­tries, fam­i­lies, com­pa­nies, web­sites, in­di­vid­ual fish­ing ves­sels, WP ed­i­tors etc)

  • re­port­ing per­for­mance of GWAS poly­genic scores us­ing only SNPs which reach genome-wide sta­tis­ti­cal-sig­nifi­cance

  • nonzero at­tri­tion but no use of in­ten­t-to-treat analy­sis

  • use of a fixed al­pha thresh­old like 0.05

  • cor­re­la­tional data in­ter­preted as cau­sa­tion

  • use of an “uniden­ti­fied” mod­el, re­quir­ing ad­di­tional con­straints or pri­ors

  • non-pre­reg­is­tered analy­ses done after look­ing at data; p-hack­ing of every shade

  • use of cause-spe­cific mor­tal­ity vs al­l-cause mor­tal­ity as a mea­sure­ment

  • use of mea­sure­ments with high lev­els of mea­sure­ment er­ror (such as di­etary ques­tion­naires)

    • ceil­ing/floor effects (par­tic­u­larly IQ tests)
  • claims about la­tent vari­ables made on the ba­sis of mea­sure­ments of greatly differ­ing qual­ity

    • or after “con­trol­ling for” in­ter­me­di­ate vari­ables, com­par­ing to­tal effects of one vari­able to solely in­di­rect effects of an­other
    • or that one vari­able me­di­ates an effect with­out ac­tu­ally set­ting up a me­di­a­tion SEM
  • stud­ies rad­i­cally un­der­pow­ered to de­tect a plau­si­ble effect

  • the “sta­tis­ti­cal-sig­nifi­cance fil­ter” in­flat­ing effects

  • base rate fal­lacy

  • self­-s­e­lected sur­vey re­spon­dents; con­ve­nience sam­ples from Me­chan­i­cal Turk or Google Sur­veys or sim­i­lar ser­vices

  • an­i­mal ex­per­i­ments with ran­dom­iza­tion not blocked by lit­ter/cage/­room

  • us­ing a and ob­tain­ing many sta­tis­ti­cal­ly-sig­nifi­cant re­sults

  • fac­tor analy­sis with­out es­tab­lish­ing mea­sure­ment in­vari­ance

  • ex­per­i­menter de­mand effects

  • us­ing a SVM/NN/RF with­out cross­val­i­da­tion or held­out sam­ple

    • us­ing them, but with data pre­pro­cess­ing done or hy­per­pa­ra­me­ters se­lected based on the whole dataset
  • pas­sive con­trol groups

  • not do­ing a fac­to­r­ial ex­per­i­ment but test­ing one in­ter­ven­tion on each group

  • flat pri­ors over­es­ti­mat­ing effects

  • re­port­ing of rel­a­tive risk in­crease with­out ab­solute risk in­crease

  • a ge­netic study test­ing cor­re­la­tion be­tween 500,000 genes and a trait

  • con­flicts of in­ter­est by the re­searcher­s/­fun­ders

  • lack of power analy­sis to de­sign ex­per­i­ment

  • an­a­lyz­ing Lik­ert scales as a sim­ple con­tin­u­ous car­di­nal vari­able

  • an­i­mal re­sults in a sin­gle in­bred or clonal strain, with the goal of re­duc­ing vari­ance/in­creased power (Michie 1955)

  • right-cen­sored data

  • tem­po­ral au­to­cor­re­la­tion of mea­sure­ments

  • ge­netic con­found­ing

  • re­liance on in­ter­ac­tion terms

Some of these is­sues are big is­sues—even fa­tal, to the point where the study is not just mean­ing­less but the world would be a bet­ter place if the re­searchers in ques­tion had never pub­lished. Oth­ers are se­ri­ous but while re­gret­table, a study afflicted by it is still use­ful and per­haps the best that can rea­son­ably be done. And some flaws are usu­ally mi­nor, al­most cer­tain not to mat­ter, pos­si­bly to the point of be­ing mis­lead­ing to bring up at all as a ‘crit­i­cism’ as it im­plies that the flaw is worth dis­cussing. And many are com­pletely con­tex­t-de­pen­dent, and could be any­thing from in­stantly fa­tal to mi­nor nui­sance.

But which are which? You can prob­a­bly guess at where a few of them fall, but I would be sur­prised if you knew what I meant by all of them, or had well-jus­ti­fied be­liefs about how im­por­tant each is, be­cause I don’t, and I sus­pect few peo­ple do. Nor can any­one tell you how im­por­tant each one is. One just has to learn by ex­pe­ri­ence, it seems, watch­ing things repli­cate or di­min­ish in meta-analy­ses or get de­bunked over the years, to grad­u­ally get a feel of what is im­por­tant. There are check­lists and pro­fes­sional man­u­als1 which one can read and em­ploy, and they at least have the virtue of check­lists in be­ing sys­tem­atic re­minders of things to check, re­duc­ing the temp­ta­tion to cher­ry-pick crit­i­cism, and I rec­om­mend their use, but they are not a com­plete so­lu­tion. (In some cas­es, they rec­om­mend quite bad things, and none can be con­sid­ered com­plete.)

No won­der that sta­tis­ti­cal crit­i­cism can feel like a blood­-s­port, or feel like learn­ing sta­tis­ti­cal-sig­nifi­cance sta­tis­tics: a long list of spe­cial-case tests with lit­tle rhyme or rea­son, mak­ing up a “cook­book” of ar­bi­trary for­mu­las and rit­u­als, use­ful largely for “mid­dle­brow dis­missals”.

After a while, you have learned enough to throw a long list of crit­i­cisms at any study re­gard­less of whether they are rel­e­vant or not, en­gag­ing in “pseudo-analy­sis”2, which de­val­ues crit­i­cism (surely stud­ies can’t all be equally worth­less) and risks the same prob­lem as with for­mal logic or cog­ni­tive bi­as­es—of merely weaponiz­ing it and hav­ing laboured solely to make your­self more wrong, and de­fend your er­rors in more elab­o­rate ways. (I have over the years crit­i­cized many stud­ies and while for many of them my crit­i­cisms were much less than they de­served and have since been borne out, I could not hon­estly say that I have al­ways been right or that I did not oc­ca­sion­ally ‘gild the lily’ a lit­tle.)

Relevant But Not Definitive

So, what do we mean by sta­tis­ti­cal crit­i­cism? what makes a good or bad sta­tis­ti­cal ob­jec­tion?

Bad Criticisms

“Here I should like to say: a wheel that can be turned though noth­ing else moves with it, is not part of the mech­a­nism.”

, §271,

It can’t just be that a crit­i­cism is bor­ing and pro­vokes eye­-rolling—­some­one who in every ge­net­ics dis­cus­sion from ~2000–2010 harped on sta­tis­ti­cal power & poly­genic­ity and stated that all these ex­cit­ing new can­di­date-gene & gene-en­vi­ron­ment in­ter­ac­tion re­sults were so much hog­wash and the en­tire lit­er­a­ture garbage would have been deeply ir­ri­tat­ing to read, wear out their wel­come fast, and have been ab­solutely right. (Or for nu­tri­tion re­search, or for so­cial psy­chol­o­gy, or for…) As pro­vok­ing as it may be to read yet an­other per­son slo­ga­nize “cor­re­la­tion ≠ cau­sa­tion” or “yeah, in mice!”, un­for­tu­nate­ly, for much re­search that is all that should ever be said about it, no mat­ter how much we weary of it.

It can’t be that some as­sump­tion is vi­o­lated (or un­proven or un­prov­able), or that some as­pect of the real world is left out, be­cause all sta­tis­ti­cal mod­els are mas­sively ab­stract, gross sim­pli­fi­ca­tions. Be­cause it is al­ways pos­si­ble to iden­tify some is­sue of in­ap­pro­pri­ate as­sump­tion of nor­mal­i­ty, or some au­to­cor­re­la­tion which is not mod­eled, or some non­lin­ear term not in­clud­ed, or prior in­for­ma­tion left out, or data lack­ing in some re­spect. Check­lists and pre­reg­is­tra­tions and other tech­niques can help im­prove the qual­ity con­sid­er­ably, but will never solve this prob­lem. Short of tau­to­log­i­cal analy­sis of a com­puter sim­u­la­tion, there is not and never has been a per­fect sta­tis­ti­cal analy­sis, and if there was, it would be too com­pli­cated for any­one to un­der­stand (which is a crit­i­cism as well). All of our mod­els are false, but some may be use­ful, and a good sta­tis­ti­cal analy­sis is merely ‘good enough’.

It can’t be that re­sults “repli­cate” or not. Replic­a­bil­ity does­n’t say much other than if fur­ther data were col­lected the same way, the re­sults would stay the same. While a re­sult which does­n’t repli­cate is of ques­tion­able value at best (it most likely was­n’t real to be­gin with3), a re­sult be­ing replic­a­ble is no guar­an­tee of qual­ity ei­ther. One may have a con­sis­tent GIGO process, but replic­a­ble garbage is still garbage. To col­lect more data may be to sim­ply more pre­cisely es­ti­mate the process’s sys­tem­atic er­ror and bi­as­es. (No mat­ter how many pub­lished home­opa­thy pa­pers you can find show­ing home­opa­thy works, it does­n’t.)

It cer­tainly has lit­tle to do with p-val­ues, ei­ther in a study or in its repli­ca­tions (be­cause noth­ing of in­ter­est has to do with p-val­ues); if we cor­rect an er­ror and change a spe­cific p-value from p = 0.05 to p = 0.06, so what? (“Sure­ly, God loves the 0.06 nearly as much as the 0.05…”) Pos­te­rior prob­a­bil­i­ties, while mean­ing­ful and im­por­tant, also are no cri­te­ri­on: is it im­por­tant if a study has a pos­te­rior prob­a­bil­ity of a pa­ra­me­ter be­ing greater than zero of 95% rather than 94%? Or >99%? Or >50%? If a crit­i­cism, when cor­rect­ed, re­duces a pos­te­rior prob­a­bil­ity from 99% to 90%, is that what we mean by an im­por­tant crit­i­cism? Prob­a­bly (a­hem) not.

It also does­n’t have to do with any in­crease or de­crease in effect sizes. If a study makes some er­rors which means that it pro­duces an effect size twice as large as it should, this might be ab­solutely damn­ing or it might be largely ir­rel­e­vant. Per­haps the un­cer­tainty was at least that large so no one took the point-es­ti­mate at face-value to be­gin with, or every­one un­der­stood the po­ten­tial for er­rors and un­der­stood the point-es­ti­mate was an up­per bound. Or per­haps the effect is so large that over­es­ti­ma­tion by a fac­tor of 10 would­n’t be a prob­lem.

It usu­ally does­n’t have to do with pre­dic­tive power (whether quan­ti­fied as R2 or AUC etc); sheer pre­dic­tion is the goal of a sub­set of re­search (although if one could show that a par­tic­u­lar choice led to a lower pre­dic­tive score, that would be a good cri­tique), and in many con­texts, the best model is not par­tic­u­larly pre­dic­tive at all, and a model be­ing too pre­dic­tive is a red flag.

Good Criticisms

“The sta­tis­ti­cian is no longer an al­chemist ex­pected to pro­duce gold from any worth­less ma­te­r­ial offered him. He is more like a chemist ca­pa­ble of as­say­ing ex­actly how much of value it con­tains, and ca­pa­ble also of ex­tract­ing this amount, and no more. In these cir­cum­stances, it would be fool­ish to com­mend a sta­tis­ti­cian be­cause his re­sults are pre­cise, or to re­prove be­cause they are not. If he is com­pe­tent in his craft, the value of the re­sult fol­lows solely from the value of the ma­te­r­ial given him. It con­tains so much in­for­ma­tion and no more. His job is only to pro­duce what it con­tain­s…Im­mensely la­bo­ri­ous cal­cu­la­tions on in­fe­rior data may in­crease the yield from 95 to 100 per cent. A gain of 5 per cent, of per­haps a small to­tal. A com­pe­tent over­haul­ing of the process of col­lec­tion, or of the ex­per­i­men­tal de­sign, may often in­crease the yield ten or twelve fold, for the same cost in time and labour.
…To con­sult the sta­tis­ti­cian after an ex­per­i­ment is fin­ished is often merely to ask him to con­duct a post mortem ex­am­i­na­tion. He can per­haps say what the ex­per­i­ment died of.”

, “Pres­i­den­tial ad­dress to the first In­dian sta­tis­ti­cal con­gress”, 1938

What would count as a good crit­i­cism?

Well, if a draft of a study was found and the claims were based on a sta­tis­ti­cal­ly-sig­nifi­cant effect in one vari­able, but in the fi­nal pub­lished ver­sion, it omits that vari­able and talks only about a differ­ent vari­able, one would won­der. Dis­cov­er­ing that au­thors of a study had been paid mil­lions of dol­lars by a com­pany ben­e­fit­ing from the study re­sults would se­ri­ously shake one’s con­fi­dence in the re­sults. If a cor­re­la­tion did­n’t ex­ist at all when we com­pared sib­lings within a fam­i­ly, or bet­ter yet, iden­ti­cal twins, or if the cor­re­la­tion did­n’t ex­ist in other datasets, or other coun­tries, then re­gard­less of how strongly sup­ported it is in that one dataset, it would be a con­cern. If a fancy new ma­chine learn­ing model out­per­formed SOTA by 2%, but turned out to not be us­ing a held­out sam­ple prop­erly and ac­tu­ally per­formed the same, doubt­less ML re­searchers would be less im­pressed. If some­one showed an RCT reached the op­po­site effect size to a cor­re­la­tional analy­sis, that would strike most peo­ple as im­por­tant. If a ma­jor new can­cer drug was be­ing touted as be­ing as effec­tive as the usual chemother­apy with fewer side-effects in the lat­est tri­al, and one sees that both were be­ing com­pared to a null hy­poth­e­sis of zero effect and the point-es­ti­mate for the new drug was lower than the usual chemother­a­py, would pa­tients want to use it? If a psy­chol­ogy ex­per­i­ment had differ­ent re­sults with a pas­sive con­trol group and an ac­tive con­trol group, or a surgery’s re­sults de­pend on whether the clin­i­cal trial used blind­ing, cer­tainly an is­sue. And if data was fab­ri­cated en­tire­ly, that would cer­tainly be worth men­tion­ing.

These are all in­her­ently differ­ent go­ing by some of the con­ven­tional views out­lined above. So what do they have in com­mon that makes them good crit­i­cisms?

Beliefs Are For Actions

“Re­sults are only valu­able when the amount by which they prob­a­bly differ from the truth is so small as to be in­signifi­cant for the pur­poses of the ex­per­i­ment. What the odds should be de­pends:”

  1. “On the de­gree of ac­cu­racy which the na­ture of the ex­per­i­ment al­lows, and”
  2. “On the im­por­tance of the is­sues at stake.”

, “The Ap­pli­ca­tion of the ‘Law of Er­ror’ to the work of the Brew­ery”, 19044

“More­over, the eco­nomic ap­proach seems (if not re­jected ow­ing to aris­to­cratic or pu­ri­tanic taboos) the only de­vice apt to dis­tin­guish neatly what is or is not con­tra­dic­tory in the logic of un­cer­tainty (or prob­a­bil­ity the­o­ry). That is the fun­da­men­tal les­son sup­plied by no­tion of …prob­a­bil­ity the­ory and de­ci­sion the­ory are but two ver­sions (the­o­ret­i­cal and prac­ti­cal) of the study of the same sub­ject: un­cer­tain­ty.”

, “Com­ment on Sav­age’s ‘On Reread­ing R. A. Fisher’”, 1976

But what I think they share in com­mon is this de­ci­sion-the­o­retic jus­ti­fi­ca­tion which uni­fies crit­i­cisms (and would unify sta­tis­ti­cal ped­a­gogy too):

The im­por­tance of a sta­tis­ti­cal crit­i­cism is the prob­a­bil­ity that it would change a hy­po­thet­i­cal de­ci­sion based on that re­search.

I would as­sert that p-val­ues are not pos­te­rior prob­a­bil­i­ties are not effect sizes are not util­i­ties are not profits are not de­ci­sions. Di­chotomies come from de­ci­sions. All analy­ses are ul­ti­mately de­ci­sion analy­ses: our be­liefs and analy­ses may be con­tin­u­ous, but our ac­tions are dis­crete.

When we cri­tique a study, the stan­dard we grope to­wards is one which ul­ti­mately ter­mi­nates in re­al-world ac­tions and de­ci­sion-mak­ing, a stan­dard which is in­her­ently con­tex­t-de­pen­dent, ad­mits of no bright lines, and de­pends on the use and mo­ti­va­tion for re­search, grounded in what is the right thing to do.5

How should we eval­u­ate a sin­gle small study?

It does­n’t have any­thing to do with at­tain­ing some ar­bi­trary level of “sig­nifi­cance” or be­ing “well-pow­ered” or hav­ing a cer­tain k in a meta-analy­sis for es­ti­mat­ing het­ero­gene­ity, or even any par­tic­u­lar pos­te­rior prob­a­bil­i­ty, or effect size thresh­old; it does­n’t have any­thing to do with vi­o­lat­ing a par­tic­u­lar as­sump­tion, un­less, by vi­o­lat­ing that as­sump­tion, the model is not ‘good enough’ and would lead to bad choic­es; and it is loosely tied to repli­ca­tion (be­cause if a re­sult does­n’t repli­cate in the fu­ture sit­u­a­tions in which ac­tions will be tak­en, it’s not use­ful for plan­ning) but not de­fined by it (as a re­sult could repli­cate fine while still be­ing use­less).

The im­por­tance of many of these crit­i­cisms can be made much more in­tu­itive by ask­ing what the re­search is for and how it would affect a down­stream de­ci­sion. We don’t need to do a for­mal de­ci­sion analy­sis go­ing all the way from data through a Bayesian analy­sis to util­i­ties and a causal model to com­pare (although this would be use­ful to do and might be nec­es­sary in edge cas­es), an in­for­mal con­sid­er­a­tion can be a good start, as one can in­tu­itively guess at the down­stream effects.

I think we can mean­ing­fully ap­ply this cri­te­rion even to ‘pure’ re­search ques­tions where it is un­clear how the re­search would ever be ap­plied, specifi­cal­ly. We know a great deal about epis­te­mol­ogy and sci­en­tific method­ol­ogy and what prac­tices tend to lead to re­li­able knowl­edge. (When peo­ple ar­gue in fa­vor of pure re­search be­cause of its his­tory of spin­offs like cryp­tog­ra­phy from num­ber the­o­ry, that very ar­gu­ment im­plies that the spin­offs aren’t that un­pre­dictable & is a suc­cess­ful prag­matic de­fense. The fact that our evolved cu­rios­ity can be use­ful is surely no ac­ci­den­t.)

For ex­am­ple, even with­out a spe­cific pur­pose in mind for some re­search, we can see why forg­ing fraud­u­lent data is the worst pos­si­ble crit­i­cism: be­cause there is no de­ci­sion what­so­ever which is made bet­ter by us­ing faked da­ta. Many as­sump­tions or short­cuts will work in some cas­es, but there is no case where fake data, which is un­cor­re­lated with re­al­i­ty, works; even in the case where the fake data is scrupu­lously forged to ex­actly repli­cate the best un­der­stand­ing of re­al­ity6, it dam­ages de­ci­sion-mak­ing by over­stat­ing the amount of ev­i­dence, lead­ing to over­con­fi­dence and un­der­ex­plo­ration.

Sim­i­lar­ly, care­less data col­lec­tion and mea­sure­ment er­ror. Mi­cro­bi­ol­o­gists could­n’t know about CRISPR in ad­vance, be­fore it was dis­cov­ered by com­par­ing odd en­tries in DNA data­bas­es, and it’s a good ex­am­ple of how pure re­search can lead to tremen­dous gains. But how could you dis­cover any­thing from DNA data­bases if they are in­com­plete, full of mis­la­beled/­con­t­a­m­i­nated sam­ples, or the se­quenc­ing was done slop­pily & the se­quences largely ran­dom garbage? If you’re study­ing ‘can­cer cells’ and they are a mis­la­beled cell line & ac­tu­ally liver cells, how could that pos­si­bly add to knowl­edge about can­cer?

Or con­sider the placebo effect. If you learned that a par­tic­u­lar study’s re­sult was dri­ven en­tirely by a placebo effect and that us­ing blind­ing would yield a null, I can safely pre­dict that—re­gard­less of field or topic or any­thing else—you will al­most al­ways be badly dis­ap­point­ed. If a study mea­sures just a placebo effect (specifi­cal­ly, de­mand or ex­pectancy effect­s), this is damn­ing, be­cause the placebo effect is al­ready known to be uni­ver­sally ap­plic­a­ble (so show­ing that it hap­pened again is not in­ter­est­ing) through a nar­row psy­cho­log­i­cal causal mech­a­nism which fades out over time & does­n’t affect hard end­points (like mor­tal­i­ty), while it does­n’t affect the count­less causal mech­a­nisms which place­bo-bi­ased stud­ies seem to be ma­nip­u­lat­ing (and whose ma­nip­u­la­tion would in fact be use­ful both im­me­di­ately and for build­ing the­o­ries). If, say, ex­cept through the placebo effect, why would we want to use them? There are some ex­cep­tions where we would be in­differ­ent after learn­ing a re­sult was just a placebo effect (chronic pain treat­ment? mild in­fluen­za­?), but not many.

How about non-replic­a­bil­i­ty? The sim­plest ex­pla­na­tion for the Replic­a­bil­ity Cri­sis in psy­chol­ogy is that most of the re­sults aren’t real and were ran­dom noise, p-hacked into pub­li­ca­tions. The most char­i­ta­ble in­ter­pre­ta­tion made by apol­o­gists is that the effects were re­al, but they are sim­ply ei­ther small or so highly con­tex­t-de­pen­dent on the ex­act de­tails (the pre­cise lo­ca­tion, color of the pa­per, ex­per­i­menter, etc) to the point where even col­lab­o­rat­ing with the orig­i­nal re­searchers is not guar­an­teed to repli­cate suc­cess­fully an effect. Again, re­gard­less of the spe­cific re­sult, this presents a trilemma which is par­tic­u­larly dam­ag­ing from a de­ci­sion-the­ory point of view:

  1. ei­ther the re­sults aren’t real (and are use­less for de­ci­sion-mak­ing),
  2. they are much smaller than re­ported (and thus much less use­ful for any kind of ap­pli­ca­tion or the­o­ry-build­ing),
  3. or they are so frag­ile and in any fu­ture con­text al­most as likely to be some other effect, in even the op­po­site di­rec­tion, that their av­er­age effect is effec­tively zero (and thus use­less).

De­ci­sions pre­cede be­liefs. Our on­tol­ogy and our epis­te­mol­ogy flows from our de­ci­sion the­o­ry, not vice-ver­sa. This may ap­pear to be log­i­cally back­wards, but that is the sit­u­a­tion we are in, as evolved em­bod­ied be­ings think­ing & act­ing un­der un­cer­tain­ty: like Otto Neu­rath’s —there is nowhere we can ‘step aside’ and con­struct all be­lief and knowl­edge up from scratch and log­i­cal meta­physics, in­stead, we ex­am­ine and re­pair our raft as we stand on it, piece by piece. The nat­u­ral­is­tic an­swer to the skep­tic (like ) is that our be­liefs are not un­re­li­able be­cause they are em­pir­i­cal or evolved or ul­ti­mately tem­po­rally be­gin in tri­al-and-er­ror but they are re­li­able be­cause they have been grad­u­ally evolved to prag­mat­i­cally be cor­rect for de­ci­sion-mak­ing, and , de­vel­oped re­li­able knowl­edge of the world and meth­ods of sci­ence. (An ex­am­ple of re­vers­ing the flow would be the Deutsch-Wal­lace at­tempt to found the Born on de­ci­sion the­o­ry; ear­lier, sta­tis­ti­cians such as , , , , & etc showed that much of sta­tis­tics could be grounded in de­ci­sion-mak­ing in­stead of vice-ver­sa, demon­strated by the sub­jec­tive prob­a­bil­ity school and de­vices like the en­forc­ing .)

Decision-theoretic Criticisms

“The threat of de­ci­sion analy­sis is more pow­er­ful than its ex­e­cu­tion.”

, 2019

“A good rule of thumb might be, ‘If I added a zero to this num­ber, would the sen­tence con­tain­ing it mean some­thing differ­ent to me?’ If the an­swer is ‘no’, maybe the num­ber has no busi­ness be­ing in the sen­tence in the first place.”

Ran­dall Munroe

Re­vis­it­ing some of the ex­am­ple crit­i­cisms with more of a de­ci­sion-the­o­retic view:

  • A cri­tique of as­sum­ing cor­re­la­tion=­cau­sa­tion is a good one, be­cause cor­re­la­tion is usu­ally not cau­sa­tion, and go­ing from an im­plicit ~100% cer­tainty that it is to a more re­al­is­tic 25% or less, would change many de­ci­sions as that ob­ser­va­tion alone re­duces the ex­pected value by >75%, which is a big enough penalty to elim­i­nate many ap­peal­ing-sound­ing things.

    Be­cause causal effects are such a cen­tral top­ic, any method­olog­i­cal er­rors which affect in­fer­ence of cor­re­la­tion rather than cau­sa­tion are im­por­tant er­rors.

  • A cri­tique of dis­tri­b­u­tional as­sump­tions (such as ob­serv­ing that a vari­able is­n’t so much nor­mal as Stu­den­t’s t-dis­trib­ut­ed) is­n’t usu­ally an im­por­tant one, be­cause the change in the pos­te­rior dis­tri­b­u­tion of any key vari­able will be min­i­mal, and could change only de­ci­sions which are on a knife’s-edge to be­gin with (and thus, of lit­tle val­ue).

    • There are ex­cep­tions here, and in some ar­eas, this can be crit­i­cal. Dis­tri­b­u­tion-wise, us­ing a nor­mal in­stead of a log-nor­mal is often mi­nor since they are so sim­i­lar in the bulk of their dis­tri­b­u­tion… un­less we are talk­ing about their tails, like in an con­text (com­mon in any kind of se­lec­tion or ex­tremes analy­sis, such as em­ploy­ment or ath­let­ics or or nat­ural dis­as­ter­s), where the more ex­treme points out on the tail are the im­por­tant thing; in which case, us­ing a nor­mal will lead to wild un­der­es­ti­mates of how far out those out­liers will be, which could be of great prac­ti­cal im­por­tance

    • On the other hand, treat­ing a Lik­ert scale as a car­di­nal vari­able is a sta­tis­ti­cal sin… but only a pec­ca­dillo that every­one com­mits be­cause Lik­ert scales are so often equiv­a­lent to a (more noisy) nor­mal­ly-dis­trib­uted vari­able that a ful­ly-cor­rect trans­for­ma­tion to an or­di­nal scale with a la­tent vari­able winds up be­ing a great deal more work while not ac­tu­ally chang­ing any con­clu­sions and thus ac­tions.7

      Sim­i­lar­ly, tem­po­ral au­to­cor­re­la­tion is often not as big a deal as it’s made out to be.

  • So­ci­o­log­i­cal/psy­cho­log­i­cal cor­re­la­tions with ge­netic con­founds are vul­ner­a­ble to cri­tique, be­cause con­trol­ling for ge­net­ics rou­tinely shrinks the cor­re­la­tion by a large frac­tion, often to ze­ro, and thus elim­i­nates most of the causal ex­pec­ta­tions.

  • Over­fit­ting to a train­ing set and ac­tu­ally be­ing sim­i­lar or worse than the cur­rent SOTA is one of the more se­ri­ous crit­i­cisms in ma­chine learn­ing, be­cause hav­ing bet­ter per­for­mance is typ­i­cally why any­one would want to use a method. (But of course—if the new method is in­trigu­ingly nov­el, or has some other prac­ti­cal ad­van­tage, it would be en­tirely rea­son­able for some­one to say that the over­fit­ting is a mi­nor cri­tique as far as they are con­cerned, be­cause they want it for that other rea­son and some loss of per­for­mance is mi­nor.)

  • use of a straw­man null hy­poth­e­sis: in a med­ical con­text too, what mat­ters is the cost-ben­e­fit of a new treat­ment com­pared to the best ex­ist­ing one, and not whether it hap­pens to work bet­ter than noth­ing at all; the im­por­tant thing is be­ing more cost-effec­tive than the de­fault ac­tion, so peo­ple will choose the new treat­ment over the old, and if the net es­ti­mate is that it is prob­a­bly slightly worse, why would they choose it?

  • In­ter­pret­ing a fail­ure to re­ject the null as proof of null: often a prob­lem. The logic of sig­nifi­cance-test­ing, such as it is, man­dates ag­nos­ti­cism any time the null has not been re­ject­ed, but so in­tu­itive is Bayesian rea­son­ing—ab­sence of ev­i­dence is ev­i­dence of ab­sence—that if a sig­nifi­cance-test does not vin­di­cate a hy­poth­e­sis, we nat­u­rally in­ter­pret it as ev­i­dence against the hy­poth­e­sis. Yet re­al­ly, it might well be ev­i­dence for the hy­poth­e­sis, and sim­ply not enough ev­i­dence: so a rea­son­able per­son might con­clude the op­po­site of what they would if they looked at the ac­tual da­ta.

    I’m re­minded of the stud­ies which use small sam­ples to es­ti­mate the cor­re­la­tion be­tween it and some­thing like ac­ci­dents, and upon get­ting a point-es­ti­mate al­most iden­ti­cal to other larger stud­ies (ie. in­fec­tion pre­dicts bad things) which hap­pens to not be sta­tis­ti­cal­ly-sig­nifi­cant due to their sam­ple size, con­clude that they have found ev­i­dence against there be­ing a cor­re­la­tion. One should con­clude the op­po­site! (One heuris­tic for in­ter­pret­ing re­sults is to ask: “if I en­tered this re­sult into a meta-analy­sis of all re­sults, would it strengthen or weaken the meta-an­a­lytic re­sult?”)

  • in­effi­cient ex­per­i­ment de­sign, like us­ing be­tween-sub­ject rather than with­in-sub­ject, or not us­ing iden­ti­cal twins for twin ex­per­i­ments, can be prac­ti­cally im­por­tant: as in his dis­cus­sion of the La­nark­shire Milk Ex­per­i­ment, among other prob­lems with it, the use of ran­dom chil­dren who weren’t matched or blocked in any way meant that sta­tis­ti­cal power was un­nec­es­sar­ily low, and the La­nark­shire Milk Ex­per­i­ment could have been done with a sam­ple size 97% smaller had it been bet­ter de­signed, which would have yielded a ma­jor sav­ings in ex­pense (which could have paid for many more ex­per­i­ments).

  • lack of mea­sure­ment in­vari­ance: ques­tions of ‘mea­sure­ment in­vari­ance’ in IQ ex­per­i­ments may sound deeply re­con­dite and like sta­tis­ti­cal quib­bling, but they boil down to the ques­tion of whether the test gain is a gain on in­tel­li­gence or if it can be ac­counted for by gains solely on a sub­test tap­ping into some much more spe­cial­ized skill like Eng­lish vo­cab­u­lary; a gain on in­tel­li­gence is far more valu­able than some test-spe­cific im­prove­ment, and if it is the lat­ter, the ex­per­i­ment has found a real causal effect but that effect is fool’s gold.

  • con­flat­ing mea­sured with la­tent vari­ables: And in dis­cus­sions of mea­sure­ment of la­tent vari­ables, the ques­tion may hinge crit­i­cally on the use. Sup­pose one com­pares a noisy IQ test to a high­-qual­ity per­son­al­ity test (with­out in­cor­po­rat­ing cor­rec­tion for the differ­ing mea­sure­ment er­ror of each one), and finds that the lat­ter is more pre­dic­tive of some life out­come; does this mean ‘per­son­al­ity is more im­por­tant than in­tel­li­gence’ to that trait? Well, it de­pends on use. If one is mak­ing a the­o­ret­i­cal ar­gu­ment about the la­tent vari­ables, this is a se­ri­ous fal­lacy and cor­rect­ing for mea­sure­ment er­ror may com­pletely re­verse the con­clu­sion and show the op­po­site; but if one is do­ing screen­ing (for em­ploy­ment or col­lege or some­thing like that), then it is ir­rel­e­vant which la­tent vari­ables is a bet­ter pre­dic­tor, be­cause the tests are what they are—un­less, on the grip­ping hand, one is con­sid­er­ing in­tro­duc­ing a bet­ter more ex­pen­sive IQ test, in which case the la­tent vari­ables are im­por­tant after all be­cause, de­pend­ing on how im­por­tant the la­tent vari­ables are (rather than the crude mea­sured vari­ables), the po­ten­tial im­prove­ment from a bet­ter mea­sure­ment may be enough to jus­tify the bet­ter test…

    Or con­sider her­i­tabil­ity es­ti­mates, like SNP her­i­tabil­ity es­ti­mates from . A GCTA es­ti­mate of, say, 25% for a trait mea­sure­ment can be in­ter­preted as an up­per bound on a GWAS for the same mea­sure­ment; this is use­ful to know, but it’s not the same thing as an up­per bound on a GWAS, or ‘ge­netic in­flu­ence’ in some sense, of the true, la­tent, mea­sured-with­out-er­ror vari­able. Most such GCTAs use mea­sure­ments with a great deal of mea­sure­ment er­ror, and if you cor­rect for mea­sure­ment er­ror, the true GCTA could be much high­er—­for ex­am­ple, IQ GCTAs are typ­i­cally ~25%, but most datasets trade qual­ity for quan­tity and use poor IQ tests, and cor­rect­ing that, the true GCTA is closer to 50%, which is quite differ­ent. Which is right? Well, if you are merely try­ing to un­der­stand how good a GWAS based on that par­tic­u­lar dataset of mea­sure­ments can be, the for­mer is the right in­ter­pre­ta­tion, as it es­tab­lishes your up­per bound and you will need bet­ter meth­ods or mea­sure­ments to go be­yond it; but if you are try­ing to make claims about a trait per se (as so many peo­ple do!), the la­tent vari­able is the rel­e­vant thing, and talk­ing only about the mea­sured vari­able is highly mis­lead­ing and can re­sult in to­tally mis­taken con­clu­sions (e­spe­cially when com­par­ing across datasets with differ­ent mea­sure­ment er­rors).

  • Lack of blind­ing poses a sim­i­lar prob­lem: its ab­sence means that the effect be­ing es­ti­mated is not nec­es­sar­ily the one we want to es­ti­mate—but this is con­tex­t-de­pen­dent. A psy­chol­ogy study typ­i­cally uses mea­sures where some de­gree of effort or con­trol is pos­si­ble, and the effects of re­search in­ter­est are typ­i­cally so small (like dual n-back’s sup­posed IQ im­prove­ment of a few points) that they can be in­flated by a small amount of try­ing hard­er; on the other hand, a med­ical ex­per­i­ment of a can­cer drug mea­sur­ing al­l-cause mor­tal­i­ty, if it works, can pro­duce a dra­matic differ­ence in sur­vival rates, can­cer does­n’t care whether a pa­tient is op­ti­mistic, and it is diffi­cult for the re­searchers to sub­tly skew the col­lected data like all-cause mor­tal­ity (be­cause a pa­tient is ei­ther dead or not).

This de­fi­n­i­tion is not a panacea since often it may not be clear what de­ci­sions are down­stream, much less how much a crit­i­cism could quan­ti­ta­tively affect it. But it pro­vides a clear start­ing point for un­der­stand­ing which ones are, or should be, im­por­tant (meta-analy­ses be­ing par­tic­u­larly use­ful for nail­ing down things like av­er­age effect size bias due to a par­tic­u­lar flaw), and which ones are du­bi­ous or quib­bling and are signs that you are stretch­ing to come up with any crit­i­cisms; if you can’t ex­plain at least some­what plau­si­bly how a crit­i­cism (or a com­bi­na­tion of crit­i­cisms) could lead to di­a­met­ri­cally op­po­site con­clu­sions or ac­tions, per­haps they are best left out.


Teaching Statistics

If de­ci­sion the­ory is the end-all be-all, why is it so easy to take Sta­tis­tics 101 or read a sta­tis­tics text­book and come away with the at­ti­tude that sta­tis­tics is noth­ing but a bag of tricks ap­plied at the whim of the an­a­lyst, fol­low­ing rules writ­ten down nowhere, in­scrutable to the unini­ti­at­ed, who can only lis­ten in baffle­ment to this or that piped piper of prob­a­bil­i­ties? (One uses a t-test un­less one uses a Wilcox test, but of course, some­times the p-value must be mul­ti­ple-cor­rect­ed, ex­cept when it’s fine not to, be­cause you were us­ing it as part of the main analy­sis or a com­po­nent of a pro­ce­dure like an ANOVA—not to be con­fused with an ANCOVA, MANCOVA, or lin­ear mod­el, which might re­ally be a gen­er­al­ized lin­ear mod­el, with clus­tered stan­dard er­rors as rel­e­van­t…)

One is­sue is that the field greatly dis­likes pre­sent­ing it in any of the uni­fi­ca­tions which are avail­able. Be­cause those par­a­digms are not uni­ver­sally ac­cept­ed, the at­ti­tude seems to be that no par­a­digm should be taught; how­ev­er, to refuse to make a choice is it­self a choice, and what gets taught is the par­a­digm of sta­tis­tic­s-as-grab-bag. As often taught or dis­cussed, sta­tis­tics is treated as a bag of tricks and p-val­ues and prob­lem-spe­cific al­go­rithms. But there are par­a­digms one could teach.

For ex­am­ple, around the 1940s, led by , there was a huge par­a­digm shift to­wards the de­ci­sion-the­o­retic in­ter­pre­ta­tion of sta­tis­tics, where all these Fish­er­ian giz­mos can be un­der­stood, jus­ti­fied, and crit­i­cized as be­ing about min­i­miz­ing loss given spe­cific loss func­tions. So, the mean is a good way to es­ti­mate your pa­ra­me­ter (rather than the mode or me­dian or a bazil­lion other uni­vari­ate sta­tis­tics one could in­vent) not be­cause that par­tic­u­lar func­tion was handed down at Mount Sinai but be­cause it does a good job of min­i­miz­ing your loss un­der such-and-such con­di­tions like hav­ing a squared er­ror loss (be­cause big­ger er­rors hurt you much more), and if those con­di­tions do not hold, that is why the, say, me­dian is bet­ter, and you can say pre­cisely how much bet­ter and when you’d go back to the mean (as op­posed to rules of thumbs about stan­dard de­vi­a­tions or ar­bi­trary p-value thresh­olds test­ing nor­mal­i­ty). Many is­sues in meta-science are much more trans­par­ent if you sim­ply ask how they would affect de­ci­sion-mak­ing (see the rest of this es­say).

Sim­i­lar­ly, Bayesian­ism means you can just ‘turn the crank’ on many prob­lems: de­fine a mod­el, your pri­ors, and turn the MCMC crank, with­out all the fancy prob­lem-spe­cific de­riva­tions and spe­cial-cas­es. In­stead of all these mys­te­ri­ous dis­tri­b­u­tions and for­mu­las and tests and like­li­hoods drop­ping out of the sky, you un­der­stand that you are just set­ting up equa­tions (or even just writ­ing a pro­gram) which re­flect how you think some­thing works in a suffi­ciently for­mal­ized way that you can run data through it and see how the prior up­dates into the pos­te­ri­or. The dis­tri­b­u­tions & like­li­hoods then do not drop out of the sky but are prag­matic choic­es: what par­tic­u­lar bits of math­e­mat­ics are im­ple­mented in your MCMC li­brary, and which match up well with how you think the prob­lem works, with­out be­ing too con­fus­ing or hard to work with or com­pu­ta­tion­al­ly-in­effi­cient?

And causal mod­el­ing is an­other good ex­am­ple: there is an end­less zoo of bi­ases and prob­lems in fields like epi­demi­ol­ogy which look like a mess of spe­cial cases you just have to mem­o­rize, but they all re­duce to straight­for­ward is­sues if you draw out a DAG of a causal graph of how things might work.

What hap­pens in the ab­sence of ex­plicit use of these par­a­digms is an im­plicit use of them. Much of the ‘ex­pe­ri­ence’ that sta­tis­ti­cians or an­a­lysts rely on when they ap­ply the bag of tricks is ac­tu­ally a hid­den the­ory learned from ex­pe­ri­ence & os­mo­sis, used to reach the cor­rect re­sults while os­ten­si­bly us­ing the bag of tricks: the an­a­lyst knows he ought to use a me­dian here be­cause he has a vaguely de­fined loss in mind for the down­stream ex­per­i­ment, and he knows the data some­times throws out­liers which screwed up ex­per­i­ments in the past so the mean is a bad choice and he ought to use ‘ro­bust sta­tis­tics’; or he knows from ex­pe­ri­ence that most of the vari­ables are ir­rel­e­vant so it’d be good to get shrink­age by sleight of hand by pick­ing a lasso re­gres­sion in­stead of a reg­u­lar OLS re­gres­sion and if any­one asks, talk vaguely about ‘reg­u­lar­iza­tion’; or he has a par­tic­u­lar causal model of how en­roll­ment in a group is a col­lider so he knows to ask about “Simp­son’s para­dox”. Thus, in the hands of an ex­pert, the bag of tricks works out, even as the neo­phyte is mys­ti­fied and won­ders how the ex­pert knew to pull this or that trick out of, seem­ing­ly, their nether re­gions.

Teach­ers don’t like this be­cause they don’t want to de­fend the philoso­phies of things like Bayesian­ism, often aren’t trained in them in the first place, and be­cause teach­ing them is si­mul­ta­ne­ously too easy (the con­cepts are uni­ver­sal, straight­for­ward, and can be one-lin­ers) and too hard (re­duc­ing them to prac­tice and ac­tu­ally com­put­ing any­thing—it’s easy to write down Bayes’s for­mu­la, not so easy to ac­tu­ally com­pute a real pos­te­ri­or, much less max­i­mize over a de­ci­sion tree).

There’s a lot of crit­i­cisms that can be made of each par­a­digm, of course, none of them are uni­ver­sally as­sented to, to say the last—but I think it would gen­er­ally be bet­ter to teach peo­ple in those prin­ci­pled ap­proach­es, and then later cri­tique them, than to teach peo­ple in an en­tirely un­prin­ci­pled fash­ion.

  1. There’s two main cat­e­gories I know of, re­port­ing check­lists and qual­i­ty-e­val­u­a­tion check­lists (in ad­di­tion to the guide­li­nes/rec­om­men­da­tions pub­lished by pro­fes­sional groups like the ’s man­ual based ap­par­ently on JARS or ’s stan­dards).

    Some re­port­ing check­lists:

    Some qual­i­ty-e­val­u­a­tion scales:

  2. As pointed out by Jack­son in a re­view of a sim­i­lar book, the ar­gu­ments used against her­i­tabil­ity or IQ ex­em­pli­fied bad re­search cri­tiques by mak­ing the per­fect the en­emy of bet­ter & se­lec­tively ap­ply­ing de­mands for rig­or:

    There is no ques­tion that here, as in many ar­eas that de­pend on field stud­ies, pre­cise con­trol of ex­tra­ne­ous vari­ables is less than per­fect. For ex­am­ple, in stud­ies of sep­a­rated twins, the in­ves­ti­ga­tor must con­cede that the ideal of ran­dom as­sign­ment of twin pairs to sep­a­rated fos­ter homes is not likely to be fully achieved, that it will be diffi­cult to find com­par­i­son or con­trol groups per­fectly matched on all vari­ables, and so on. Short of aban­don­ing field data in so­cial sci­ence en­tire­ly, there is no al­ter­na­tive but to em­ploy a vari­a­tional ap­proach, seek­ing to weigh ad­mit­tedly fal­li­ble data to iden­tify sup­port for hy­pothe­ses by the pre­pon­der­ance of ev­i­dence. Most who have done this find sup­port for the her­i­tabil­ity of IQ. in­stead sees only flaws in the ev­i­dence…As the data stand, had the au­thor been equally zeal­ous in eval­u­at­ing the null hy­poth­e­sis that such treat­ments make no differ­ence he would have been hard pressed to fail to re­ject it.

  3. Non-repli­ca­tion of a re­sult puts the orig­i­nal re­sult in an awk­ward trilem­ma: ei­ther the orig­i­nal re­sult was spu­ri­ous (the most a pri­ori likely case), the non-repli­ca­tor got it wrong or was un­lucky (d­iffi­cult since most are well-pow­ered and fol­low­ing the orig­i­nal, so——it would be eas­ier to ar­gue the orig­i­nal re­sult was un­luck­y), or the re­search claim is so frag­ile and con­tex­t-spe­cific that non-repli­ca­tion is just ‘het­ero­gene­ity’ (but then why should any­one be­lieve the re­sult in any sub­stan­tive way, or act on it, if it’s a coin-flip whether it even ex­ists any­where else?).↩︎

  4. As quoted in .↩︎

  5. It is in­ter­est­ing to note that the me­dieval ori­gins of ‘prob­a­bil­ity’ were them­selves in­her­ently de­ci­sion-based as fo­cused on the ques­tion of what it was moral to be­lieve & act up­on, and the math­e­mat­i­cal roots of prob­a­bil­ity the­ory were also prag­mat­ic, based on gam­bling. Laplace, of course, took a sim­i­lar per­spec­tive in his early Bayesian­ism (eg or es­ti­mat­ing the mass of Sat­urn). It was later sta­tis­ti­cal thinkers like Boole or Fisher who tried to ex­punge prag­matic in­ter­pre­ta­tions in fa­vor of purer de­fi­n­i­tions like lim­it­ing fre­quen­cies.↩︎

  6. Which is usu­ally not the case, and why fak­ers like Stapel can be de­tected by look­ing for ‘too good to be true’ sets of re­sults, over-round­ing or over­ly-s­mooth num­bers, or some­times just not even arith­meti­cally cor­rect!↩︎

  7. As for­mally in­cor­rect as it may be, when­ever I have done the work to treat or­di­nal vari­ables cor­rect­ly, it has typ­i­cally merely tweaked the co­effi­cients & stan­dard er­ror, and not ac­tu­ally changed any­thing. Know­ing this, it would be dis­hon­est of me to crit­i­cize any study which does like­wise un­less I have some good rea­son (like hav­ing re­an­a­lyzed the data and—­for on­ce—­got­ten a ma­jor change in re­sult­s).↩︎