The Replication Crisis: Flaws in Mainstream Science

2013 discussion of how systemic biases in science, particularly medicine and psychology, have resulted in a research literature filled with false positives and exaggerated effects, called ‘the Replication Crisis’.
psychology, statistics, meta-analysis, sociology, causality
2010-10-272019-12-09 finished certainty: highly likely importance: 8

Long-s­tand­ing prob­lems in stan­dard sci­en­tific method­ol­ogy have ex­ploded as the “”: the dis­cov­ery that many re­sults in fields as di­verse as psy­chol­o­gy, eco­nom­ics, med­i­cine, bi­ol­o­gy, and so­ci­ol­ogy are in fact false or quan­ti­ta­tively highly in­ac­cu­rately mea­sured. I cover here a hand­ful of the is­sues and pub­li­ca­tions on this large, im­por­tant, and rapidly de­vel­op­ing topic up to about 2013, at which point the Repli­ca­tion Cri­sis be­came too large a topic to cover more than cur­so­ri­ly. (A com­pi­la­tion of some ad­di­tional links are pro­vided for post-2013 de­vel­op­ments.)

The cri­sis is caused by meth­ods & pub­lish­ing pro­ce­dures which in­ter­pret ran­dom noise as im­por­tant re­sults, far too small datasets, se­lec­tive analy­sis by an an­a­lyst try­ing to reach expected/desired re­sults, pub­li­ca­tion bi­as, poor im­ple­men­ta­tion of ex­ist­ing best-prac­tices, non­triv­ial lev­els of re­search fraud, soft­ware er­rors, philo­soph­i­cal be­liefs among re­searchers that false pos­i­tives are ac­cept­able, ne­glect of known con­found­ing like ge­net­ics, and skewed in­cen­tives (fi­nan­cial & pro­fes­sion­al) to pub­lish ‘hot’ re­sults.

Thus, any in­di­vid­ual piece of re­search typ­i­cally es­tab­lishes lit­tle. Sci­en­tific val­i­da­tion comes not from small p-val­ues, but from dis­cov­er­ing a reg­u­lar fea­ture of the world which dis­in­ter­ested third par­ties can dis­cover with straight­for­ward re­search done in­de­pen­dently on new data with new pro­ce­dures—repli­ca­tion.

Main­stream sci­ence is flawed: se­ri­ously mis­taken sta­tis­tics com­bined with poor in­cen­tives has led to masses of mis­lead­ing re­search. Not that this prob­lem is ex­clu­sive to psy­chol­o­gy—e­co­nom­ics, cer­tain ge­net­ics sub­fields (prin­ci­pally can­di­date-gene re­search), bio­med­ical sci­ence, and bi­ol­ogy in gen­eral are often on shaky ground.

NHST and Systematic Biases

Sta­tis­ti­cal back­ground on p-value prob­lems: Against nul­l-hy­poth­e­sis sig­nifi­cance test­ing

The ba­sic na­ture of be­ing usu­ally de­fined as p < 0.05 means we should ex­pect some­thing like >5% of stud­ies or ex­per­i­ments to be bo­gus (op­ti­misti­cal­ly), but that only con­sid­ers “false pos­i­tives”; re­duc­ing “false neg­a­tives” re­quires (weak­ened by small sam­ples), and the two com­bine with the base rate of true un­der­ly­ing effects into a to­tal er­ror rate. Ioan­ni­dis 2005 points out that con­sid­er­ing the usual p val­ues, the un­der­pow­ered na­ture of many stud­ies, the rar­ity of un­der­ly­ing effects, and a lit­tle bi­as, even large ran­dom­ized tri­als may wind up with only an 85% chance of hav­ing yielded the truth. of re­ported p-val­ues in med­i­cine yield­ing a lower bound of false pos­i­tives of 17%.

: “Fig­ure 1: Orig­i­nal study effect size ver­sus repli­ca­tion effect size (cor­re­la­tion co­effi­cients). Di­ag­o­nal line rep­re­sents repli­ca­tion effect size equal to orig­i­nal effect size. Dot­ted line rep­re­sents repli­ca­tion effect size of 0. Points be­low the dot­ted line were effects in the op­po­site di­rec­tion of the orig­i­nal. Den­sity plots are sep­a­rated by sig­nifi­cant (blue) and non­signifi­cant (red) effects.”

Yet, there are too 1 (psy­chi­a­try, neu­ro­bi­ol­ogy bio­med­i­cine, bi­ol­ogy, ecol­ogy & evo­lu­tion, psy­chol­ogy 12 3 4 5, eco­nom­ics, so­ci­ol­ogy, gene-dis­ease cor­re­la­tions) given (and pos­i­tive re­sults cor­re­late with per capita pub­lish­ing rates in US states & vary by —ap­par­ently chance is kind to sci­en­tists who must pub­lish a lot and re­cent­ly!); then there come the in­ad­ver­tent er­rors which might cause re­trac­tion, which is rare, but the true re­trac­tion rate may be 0.1–1% (“How many sci­en­tific pa­pers should be re­tract­ed?”), is in­creas­ing & seems to pos­i­tively cor­re­late with jour­nal pres­tige met­rics (mod­ulo the con­found­ing fac­tor that fa­mous papers/journals get more scruti­ny), not that any­one pays any at­ten­tion to such things; then there are ba­sic sta­tis­ti­cal er­rors in >11% of pa­pers (based on the high­-qual­ity pa­pers in Na­ture and the British Med­ical Jour­nal; “In­con­gru­ence be­tween test sta­tis­tics and P val­ues in med­ical pa­pers”, Gar­cía-Ber­thou 2004) or 50% in neu­ro­science.

And only then can we get into repli­cat­ing at all. See for ex­am­ple ar­ti­cle “Lies, Damned Lies, and Med­ical Sci­ence” on re­search show­ing 41% of the most cited med­ical re­search failed to be —were wrong. For de­tails, you can see Ioan­ni­dis’s “Why Most Pub­lished Re­search Find­ings Are False”2, or Be­g­ley’s failed at­tempts to repli­cate 47 of 53 ar­ti­cles on top can­cer jour­nals (lead­ing to Booth’s “Be­g­ley’s Six Rules”; see also the Na­ture Biotech­nol­ogy ed­i­to­r­ial & note that full de­tails have not been pub­lished be­cause the re­searchers of the orig­i­nal stud­ies de­manded se­crecy from Be­g­ley’s team), or Ku­mar & Nash 2011’s “Health Care Myth Busters: Is There a High De­gree of Sci­en­tific Cer­tainty in Mod­ern Med­i­cine?” who write ‘We could ac­cu­rately say, “Half of what physi­cians do is wrong,” or “Less than 20% of what physi­cians do has solid re­search to sup­port it.”’ Nu­tri­tional epi­demi­ol­ogy is some­thing of a fish in a bar­rel; after Ioan­ni­dis, is any­one sur­prised that when Young & Karr 2011 fol­lowed up on 52 cor­re­la­tions tested in 12 RCTs, 0⁄52 repli­cated and the RCTs found the op­po­site of 5?

At­tempts to use an­i­mal mod­els to in­fer any­thing about hu­mans suffer from all the method­olog­i­cal prob­lems pre­vi­ously men­tioned, and add in in­ter­est­ing new forms of er­ror such as mice sim­ply be­ing ir­rel­e­vant to hu­mans, lead­ing to cases like <150 clin­i­cal tri­als all fail­ing—be­cause the drugs worked in mice but hu­mans have a com­pletely differ­ent set of ge­netic re­ac­tions to in­flam­ma­tion.

‘Hot’ fields tend to be new fields, which brings prob­lems of its own, see “Large-S­cale As­sess­ment of the Effect of Pop­u­lar­ity on the Re­li­a­bil­ity of Re­search” & dis­cus­sion. (Fail­ure to repli­cate in larger stud­ies seems to be a hall­mark of biological/medical re­search. Ioan­ni­dis per­forms the same trick with , find­ing less than half of the most-cited bio­mark­ers were even sta­tis­ti­cal­ly-sig­nifi­cant in the larger stud­ies. 12 of the more promi­nent -IQ cor­re­la­tions on a larger da­ta.) As we know now, al­most the en­tire can­di­date-gene lit­er­a­ture, most things re­ported from 2000–2010 be­fore large-s­cale GWASes started to be done (and com­pletely fail­ing to find the can­di­date-ge­nes), is noth­ing but false pos­i­tives! The repli­ca­tion rates of can­di­date-genes for things like in­tel­li­gence, per­son­al­i­ty, gene-en­vi­ron­ment in­ter­ac­tions, psy­chi­atric dis­or­der­s–the whole schmeer—are lit­er­ally ~0%. On the plus side, the par­lous state of affairs means that there are some cheap heuris­tics for de­tect­ing un­re­li­able pa­pers—sim­ply ask­ing for data & be­ing refused/ignored cor­re­lates strongly with the orig­i­nal pa­per hav­ing er­rors in its sta­tis­tics.

This epi­demic of false pos­i­tives is ap­par­ently de­lib­er­ately and know­ing ac­cepted by epi­demi­ol­ogy; Young’s 2008 “Every­thing is Dan­ger­ous” re­marks that 80–90% of epi­demi­ol­o­gy’s claims do not repli­cate (eg. the NIH ran 20 ran­dom­ized-con­trolled-tri­als of claims, and only 1 repli­cat­ed) and that lack of ‘’ (ei­ther Bon­fer­roni or Ben­jam­in-Hochberg) is taught: “Roth­man (1990) says no cor­rec­tion for mul­ti­ple test­ing is nec­es­sary and Van­den­broucke, PLoS Med (2008) agrees” (see also Per­neger 1998 who also ex­plic­itly un­der­stands that no cor­rec­tion in­creases type 2 er­rors and re­duces type 1 er­rors). Mul­ti­ple cor­rec­tion is nec­es­sary be­cause its ab­sence does, in fact, re­sult in the over­state­ment of med­ical ben­e­fit (God­frey 1985, Pocock et al 1987, Smith 1987). The av­er­age effect size for in psychology/education is d = 0.53 (well be­low sev­eral effect sizes from n-back/IQ stud­ies); when mov­ing from lab­o­ra­tory to non-lab­o­ra­tory set­tings, meta-analy­ses repli­cate find­ings cor­re­late ~0.7 of the time, but for the repli­ca­tion cor­re­la­tion falls to ~0.5 with >14% of find­ings ac­tu­ally turn­ing out to be the op­po­site (see An­der­son et al 1999 and Mitchell 2012; for ex­ag­ger­a­tion due to non-blind­ing or poor ran­dom­iza­tion, Wood et al 2008). (Meta-analy­ses also give us a start­ing point for un­der­stand­ing how un­usual medium or large effects sizes are4.) Psy­chol­ogy does have many chal­lenges, but prac­ti­tion­ers also hand­i­cap them­selves; an older overview is the en­ter­tain­ing “What’s Wrong With Psy­chol­o­gy, Any­way?”, which men­tions the ob­vi­ous point that sta­tis­tics & ex­per­i­men­tal de­sign are flex­i­ble enough to reach sig­nifi­cance as de­sired. In an in­ter­est­ing ex­am­ple of how method­olog­i­cal re­forms are no panacea in the pres­ence of con­tin­ued per­verse in­cen­tives, an ear­lier method­olog­i­cal im­prove­ment in psy­chol­ogy (re­port­ing mul­ti­ple ex­per­i­ments in a sin­gle pub­li­ca­tion as a check against re­sults not be­ing gen­er­al­iz­able) has merely demon­strated the wide­spread p-value hack­ing or ma­nip­u­la­tion or pub­li­ca­tion bias when one notes that given the low sta­tis­ti­cal power of each ex­per­i­ment, even if the un­der­ly­ing phe­nom­ena were real it would still be wildly im­prob­a­ble that all n ex­per­i­ments in a pa­per would turn up sta­tis­ti­cal­ly-sig­nifi­cant re­sults, since power is usu­ally ex­tremely low in ex­per­i­ments (eg. in neu­ro­science, “be­tween 20–30%”). These prob­lems are per­va­sive enough that I be­lieve they en­tirely ex­plain any “de­cline effects”5.

The fail­ures to repli­cate “sta­tis­ti­cally sig­nifi­cant” re­sults has led one blog­ger to caus­ti­cally re­mark (see also “Para­psy­chol­o­gy: the con­trol group for sci­ence”, “Us­ing de­grees of free­dom to change the past for fun and profit”, ):

Para­psy­chol­o­gy, the con­trol group for sci­ence, would seem to be a thriv­ing field with “sta­tis­ti­cally sig­nifi­cant” re­sults aplen­ty…­Para­psy­chol­o­gists are con­stantly protest­ing that they are play­ing by all the stan­dard sci­en­tific rules, and yet their re­sults are be­ing ig­nored—that they are un­fairly be­ing held to higher stan­dards than every­one else. I’m will­ing to be­lieve that. It just means that the stan­dard sta­tis­ti­cal meth­ods of sci­ence are so weak and flawed as to per­mit a field of study to sus­tain it­self in the com­plete ab­sence of any sub­ject mat­ter. With two-thirds of med­ical stud­ies in pres­ti­gious jour­nals fail­ing to repli­cate, get­ting rid of the en­tire ac­tual sub­ject mat­ter would shrink the field by only 33%.

Cosma Shal­izi:

…Let me draw the moral [about pub­li­ca­tion bi­as]. Even if the com­mu­nity of in­quiry is both too clue­less to make any con­tact with re­al­ity and too hon­est to nudge bor­der­line find­ings into sig­nifi­cance, so long as they can keep com­ing up with new phe­nom­ena to look for, the mech­a­nism of the file-drawer prob­lem alone will guar­an­tee a steady stream of new re­sults. There is, so far as I know, no Jour­nal of Ev­i­dence-Based Harus­picy filled, is­sue after is­sue, with method­olog­i­cal­ly-fault­less pa­pers re­port­ing the abil­ity of sheep’s liv­ers to pre­dict the win­ners of sumo cham­pi­onships, the out­come of speed dates, or real es­tate trends in se­lected sub­urbs of Chica­go. But the diffi­culty can only be that the ev­i­dence-based harus­pices aren’t try­ing hard enough, and some friendly ri­valry with the plas­tro­mancers is called for. It’s true that none of these find­ings will last forever, but this con­stant over­turn­ing of old ideas by new dis­cov­er­ies is just part of what makes this such a dy­namic time in the field of harus­picy. Many schol­ars will even tell you that their fa­vorite part of be­ing a harus­pex is the fre­quency with which a new sac­ri­fice over-turns every­thing they thought they knew about read­ing the fu­ture from a sheep’s liv­er! We are very ex­cited about the re­newed in­ter­est on the part of pol­i­cy-mak­ers in the rec­om­men­da­tions of the man­tic arts…

And this is when there is enough in­for­ma­tion to repli­cate; open ac­cess to any data for a pa­per is rare (e­co­nom­ics: ) the eco­nom­ics jour­nal Jour­nal of Mon­ey, Credit and Bank­ing, which re­quired re­searchers pro­vide the data & soft­ware which could repli­cate their sta­tis­ti­cal analy­ses, dis­cov­ered that <10% of the sub­mit­ted ma­te­ri­als were ad­e­quate for re­peat­ing the pa­per (see “Lessons from the JMCB Archive”). In one cute eco­nom­ics ex­am­ple, repli­ca­tion failed be­cause the dataset had been to make par­tic­i­pants look bet­ter (for more eco­nom­ic­s-spe­cific cri­tique, see ). Avail­abil­ity of data, often low, , and many stud­ies never get pub­lished re­gard­less of whether pub­li­ca­tion is legally man­dated.

Tran­scrip­tion er­rors in pa­pers seem to be com­mon (pos­si­bly due to con­stantly chang­ing analy­ses & p-hack­ing?), and as soft­ware and large datasets be­comes more in­her­ent to re­search, the need and the prob­lem of it be­ing pos­si­ble to repli­cate will get worse be­cause even ma­ture com­mer­cial soft­ware li­braries can dis­agree ma­jorly on their com­puted re­sults to the same math­e­mat­i­cal spec­i­fi­ca­tion (see also Anda et al 2009). And spread­sheets are es­pe­cially bad, with er­ror rates in the 88% range (“What we know about spread­sheet er­rors”, Panko 1998); spread­sheets are used in all ar­eas of sci­ence, in­clud­ing bi­ol­ogy and med­i­cine (see “Er­ror! What bio­med­ical com­put­ing can learn from its mis­takes”; fa­mous ex­am­ples of cod­ing er­rors in­clude & Rein­hart-Ro­goff), not to men­tion reg­u­lar busi­ness (eg the Lon­don Whale).

Psy­chol­ogy is far from be­ing per­fect ei­ther; look at the ex­am­ples in The New Yorker’s “The Truth Wears Off” ar­ti­cle (or look at some ex­cerpts from that ar­ti­cle). Com­puter sci­en­tist has writ­ten a must-read es­say on in­ter­pret­ing sta­tis­tics, “Warn­ing Signs in Ex­per­i­men­tal De­sign and In­ter­pre­ta­tion”; a num­ber of warn­ing signs ap­ply to many psy­cho­log­i­cal stud­ies. There may be in­cen­tive prob­lems: a trans­plant re­searcher dis­cov­ered the only way to pub­lish in Na­ture his in­abil­ity to repli­cate his ear­lier Na­ture pa­per was to offi­cially re­tract it; an­other in­ter­est­ing ex­am­ple is when, after got a pa­per pub­lished in the top jour­nal demon­strat­ing pre­cog­ni­tion, the jour­nal re­fused to pub­lish any repli­ca­tions (failed or suc­cess­ful) be­cause… “‘We don’t want to be the Jour­nal of Bem Repli­ca­tion’, he says, point­ing out that other high­-pro­file jour­nals have sim­i­lar poli­cies of pub­lish­ing only the best orig­i­nal re­search.” (Quoted in New Sci­en­tist) One does­n’t need to be a ge­nius to un­der­stand why psy­chol­o­gist An­drew D. Wil­son might snark­ily re­mark “…think about the mes­sage JPSP is send­ing to au­thors. That mes­sage is ‘we will pub­lish your crazy story if it’s new, but not your sen­si­ble story if it’s merely a repli­ca­tion’.” (You get what you pay for.) In one large test of the most fa­mous psy­chol­ogy re­sults, 10 of 13 (77%) repli­cat­ed. The repli­ca­tion rate is un­der 1⁄3 in touch­ing on ge­net­ics. This de­spite the sim­ple point that repli­ca­tions re­duce the risk of pub­li­ca­tion bi­as, and in­crease sta­tis­ti­cal pow­er, so that a repli­cated re­sult is . And the small sam­ples of n-back stud­ies and nootropic chem­i­cals are es­pe­cially prob­lem­at­ic. Quot­ing from & 2006 “Con­verg­ing Cog­ni­tive En­hance­ments”:

The re­li­a­bil­ity of re­search is also an is­sue. Many of the cog­ni­tion-en­hanc­ing in­ter­ven­tions show small effect sizes, which may ne­ces­si­tate very large epi­demi­o­log­i­cal stud­ies pos­si­bly ex­pos­ing large groups to un­fore­seen risks.

Par­tic­u­larly trou­bling is the slow­down in drug dis­cov­ery & med­ical tech­nol­ogy dur­ing the 2000s, even as ge­net­ics in par­tic­u­lar was ex­pected to pro­duce earth­-shak­ing new treat­ments. One biotech writes:

The com­pany spent $6$52011M or so try­ing to val­i­date a plat­form that did­n’t ex­ist. When they tried to di­rectly re­peat the aca­d­e­mic founder’s data, it never worked. Upon re-ex­am­i­na­tion of the lab note­books, it was clear the founder’s lab had at the very least mas­saged the data and shaped it to fit their hy­poth­e­sis. Es­sen­tial­ly, they sys­tem­at­i­cally ig­nored every piece of neg­a­tive da­ta. Sadly this “fail­ure to re­peat” hap­pens more often than we’d like to be­lieve. It has hap­pened to us at At­las [Ven­ture] sev­eral times in the past decade…The un­spo­ken rule is that at least 50% of the stud­ies pub­lished even in top tier aca­d­e­mic jour­nals—Sci­ence, Na­ture, Cell, PNAS, etc…—­can’t be re­peated with the same con­clu­sions by an in­dus­trial lab. In par­tic­u­lar, key an­i­mal mod­els often don’t re­pro­duce. This 50% fail­ure rate is­n’t a data free as­ser­tion: it’s backed up by dozens of ex­pe­ri­enced R&D pro­fes­sion­als who’ve par­tic­i­pated in the (re)test­ing of aca­d­e­mic find­ings. This is a huge prob­lem for trans­la­tional re­search and one that won’t go away un­til we ad­dress it head on.

Half the re­spon­dents to a 2012 sur­vey at one can­cer re­search cen­ter re­ported 1 or more in­ci­dents where they could not re­pro­duce pub­lished re­search; two-thirds of those were un­able to “ever able to ex­plain or re­solve their dis­crepant find­ings”, half had trou­ble pub­lish­ing re­sults con­tra­dict­ing pre­vi­ous pub­li­ca­tions, and two-thirds failed to pub­lish con­tra­dic­tory re­sults. An in­ter­nal sur­vey of 67 projects (com­men­tary) found that “only in ~20–25% of the projects were the rel­e­vant pub­lished data com­pletely in line with our in­-house find­ings”, and as far as as­sess­ing the projects went:

…de­spite the low num­bers, there was no ap­par­ent differ­ence be­tween the differ­ent re­search fields. Sur­pris­ing­ly, even pub­li­ca­tions in pres­ti­gious jour­nals or from sev­eral in­de­pen­dent groups did not en­sure re­pro­ducibil­i­ty. In­deed, our analy­sis re­vealed that the re­pro­ducibil­ity of pub­lished data did not sig­nifi­cantly cor­re­late with jour­nal im­pact fac­tors, the num­ber of pub­li­ca­tions on the re­spec­tive tar­get or the num­ber of in­de­pen­dent groups that au­thored the pub­li­ca­tions. Our find­ings are mir­rored by ‘gut feel­ings’ ex­pressed in per­sonal com­mu­ni­ca­tions with sci­en­tists from acad­e­mia or other com­pa­nies, as well as pub­lished ob­ser­va­tions. [apro­pos of above] An un­spo­ken rule among ear­ly-stage ven­ture cap­i­tal firms that “at least 50% of pub­lished stud­ies, even those in top-tier aca­d­e­mic jour­nals, can’t be re­peated with the same con­clu­sions by an in­dus­trial lab” has been re­cently re­ported (see Fur­ther in­for­ma­tion) and dis­cussed 4.

Physics has rel­a­tively small sins; “As­sess­ing un­cer­tainty in phys­i­cal con­stants” (Hen­rion & Fischoff 1985); Han­son’s sum­ma­ry:

Look­ing at 306 es­ti­mates for par­ti­cle prop­er­ties, 7% were out­side of a 98% con­fi­dence in­ter­val (where only 2% should be). In seven other cas­es, each with 14 to 40 es­ti­mates, the frac­tion out­side the 98% con­fi­dence in­ter­val ranged from 7% to 57%, with a me­dian of 14%.

Nor is or ro­bust against even . Sci­en­tists who win the No­bel Prize find their other work sud­denly be­ing heav­ily cited, sug­gest­ing ei­ther that the com­mu­nity ei­ther badly failed in rec­og­niz­ing the work’s true value or that they are now suck­ing up & at­tempt­ing to look bet­ter . (A math­e­mati­cian once told me that often, to boost a pa­per’s ac­cep­tance chance, they would add ci­ta­tions to pa­pers by the jour­nal’s ed­i­tors—a prac­tice that will sur­prise none fa­mil­iar with and the use of in tenure & grants.)

The for­mer ed­i­tor Richard Smith amus­ingly re­counts his doubts about the mer­its of peer re­view as prac­ticed, and physi­cist points out that peer re­view is his­tor­i­cally rare (just one of Ein­stein’s 300 pa­pers was peer re­viewed; the fa­mous Na­ture did not in­sti­tute peer re­view un­til 1967), has been poorly stud­ied & not shown to be effec­tive, is na­tion­ally bi­ased, er­ro­neously re­jects many his­toric dis­cov­er­ies (one study lists “34 No­bel Lau­re­ates whose awarded work was re­jected by peer re­view”; Hor­robin 1990 lists oth­er), and catches only a small frac­tion of er­rors. And ques­tion­able choices or fraud? For­get about it:

A pooled weighted av­er­age of 1.97% (N = 7, 95%­CI: 0.86–4.45) of sci­en­tists ad­mit­ted to have fab­ri­cat­ed, fal­si­fied or mod­i­fied data or re­sults at least on­ce—a se­ri­ous form of mis­con­duct by any stan­dard­—and up to 33.7% ad­mit­ted other ques­tion­able re­search prac­tices. In sur­veys ask­ing about the be­hav­iour of col­leagues, ad­mis­sion rates were 14.12% (N = 12, 95% CI: 9.91–19.72) for fal­si­fi­ca­tion, and up to 72% for other ques­tion­able re­search prac­tices…When these fac­tors were con­trolled for, mis­con­duct was re­ported more fre­quently by medical/pharmacological re­searchers than oth­ers.

And :

We sur­veyed over 2,000 psy­chol­o­gists about their in­volve­ment in ques­tion­able re­search prac­tices, us­ing an anony­mous elic­i­ta­tion for­mat sup­ple­mented by in­cen­tives for hon­est re­port­ing. The im­pact of in­cen­tives on ad­mis­sion rates was pos­i­tive, and greater for prac­tices that re­spon­dents judge to be less de­fen­si­ble. Us­ing three differ­ent es­ti­ma­tion meth­ods, we find that the pro­por­tion of re­spon­dents that have en­gaged in these prac­tices is sur­pris­ingly high rel­a­tive to re­spon­dents’ own es­ti­mates of these pro­por­tions. Some ques­tion­able prac­tices may con­sti­tute the pre­vail­ing re­search norm.

In short, the se­cret sauce of sci­ence is not ‘peer re­view’. It is repli­ca­tion!

Systemic Error Doesn’t Go Away

“Far bet­ter an ap­prox­i­mate an­swer to the right ques­tion, which is often vague, than an ex­act an­swer to the wrong ques­tion, which can al­ways be made pre­cise.”

, “The fu­ture of data analy­sis” 1962

Why is­n’t the so­lu­tion as sim­ple as elim­i­nat­ing dat­a­min­ing by meth­ods like larger n or pre-reg­is­tered analy­ses? Be­cause once we have elim­i­nated the ran­dom er­ror in our analy­sis, we are still left with a (po­ten­tially ar­bi­trar­ily large) sys­tem­atic er­ror, leav­ing us with a large to­tal er­ror.

None of these sys­tem­atic prob­lems : they are sys­tem­atic bi­ases and as such, they force an up­per bound on how ac­cu­rate a cor­pus of stud­ies can be even if there were thou­sands upon thou­sands of stud­ies, be­cause the to­tal er­ror in the re­sults is made up of and , but while ran­dom er­ror shrinks as more stud­ies are done, sys­tem­atic er­ror re­mains the same.

A thou­sand bi­ased stud­ies merely re­sult in an ex­tremely pre­cise es­ti­mate of the wrong num­ber.

This is a point ap­pre­ci­ated by sta­tis­ti­cians and ex­per­i­men­tal physi­cists, but it does­n’t seem to be fre­quently dis­cussed. An­drew Gel­man has a fun demon­stra­tion of se­lec­tion bias in­volv­ing candy, or from pg812–1020 of Chap­ter 8 “Suffi­cien­cy, An­cil­lar­i­ty, And All That” of Prob­a­bil­ity The­o­ry: The Logic of Sci­ence by :

The clas­si­cal ex­am­ple show­ing the er­ror of this kind of rea­son­ing is the fa­ble about the height of the Em­peror of Chi­na. Sup­pos­ing that each per­son in China surely knows the height of the Em­peror to an ac­cu­racy of at least ±1 me­ter, if there are N = 1,000,000,000 in­hab­i­tants, then it seems that we could de­ter­mine his height to an ac­cu­racy at least as good as


merely by ask­ing each per­son’s opin­ion and av­er­ag­ing the re­sults.

The ab­sur­dity of the con­clu­sion tells us rather force­fully that the rule is not al­ways valid, even when the sep­a­rate data val­ues are causally in­de­pen­dent; it re­quires them to be log­i­cally in­de­pen­dent. In this case, we know that the vast ma­jor­ity of the in­hab­i­tants of China have never seen the Em­per­or; yet they have been dis­cussing the Em­peror among them­selves and some kind of men­tal im­age of him has evolved as folk­lore. Then knowl­edge of the an­swer given by one does tell us some­thing about the an­swer likely to be given by an­oth­er, so they are not log­i­cally in­de­pen­dent. In­deed, folk­lore has al­most surely gen­er­ated a sys­tem­atic er­ror, which sur­vives the av­er­ag­ing; thus the above es­ti­mate would tell us some­thing about the folk­lore, but al­most noth­ing about the Em­per­or.

We could put it roughly as fol­lows:

er­ror in es­ti­mate = (8-50)

where S is the com­mon sys­tem­atic er­ror in each da­tum, R is the ‘ran­dom’ er­ror in the in­di­vid­ual data val­ues. Un­in­formed opin­ions, even though they may agree well among them­selves, are nearly worth­less as ev­i­dence. There­fore sound sci­en­tific in­fer­ence de­mands that, when this is a pos­si­bil­i­ty, we use a form of prob­a­bil­ity the­ory (i.e. a prob­a­bilis­tic mod­el) which is so­phis­ti­cated enough to de­tect this sit­u­a­tion and make al­lowances for it.

As a start on this, equa­tion (8-50) gives us a crude but use­ful rule of thumb; it shows that, un­less we know that the sys­tem­atic er­ror is less than about 1⁄3 of the ran­dom er­ror, we can­not be sure that the av­er­age of a mil­lion data val­ues is any more ac­cu­rate or re­li­able than the av­er­age of ten6. As put it: “The physi­cist is per­suaded that one good mea­sure­ment is worth many bad ones.” This has been well rec­og­nized by ex­per­i­men­tal physi­cists for gen­er­a­tions; but warn­ings about it are con­spic­u­ously miss­ing in the “soft” sci­ences whose prac­ti­tion­ers are ed­u­cated from those text­books.

Or pg1019–1020 Chap­ter 10 “Physics of ‘Ran­dom Ex­per­i­ments’”:

…N­ev­er­the­less, the ex­is­tence of such a strong con­nec­tion is clearly only an ideal lim­it­ing case un­likely to be re­al­ized in any real ap­pli­ca­tion. For this rea­son, the and limit the­o­rems of prob­a­bil­ity the­ory can be grossly mis­lead­ing to a sci­en­tist or en­gi­neer who naively sup­poses them to be ex­per­i­men­tal facts, and tries to in­ter­pret them lit­er­ally in his prob­lems. Here are two sim­ple ex­am­ples:

  1. Sup­pose there is some ran­dom ex­per­i­ment in which you as­sign a prob­a­bil­ity p for some par­tic­u­lar out­come A. It is im­por­tant to es­ti­mate ac­cu­rately the frac­tion f of times A will be true in the next mil­lion tri­als. If you try to use the laws of large num­bers, it will tell you var­i­ous things about f; for ex­am­ple, that it is quite likely to differ from p by less than a tenth of one per­cent, and enor­mously un­likely to differ from p by more than one per­cent. But now, imag­ine that in the first hun­dred tri­als, the ob­served fre­quency of A turned out to be en­tirely differ­ent from p. Would this lead you to sus­pect that some­thing was wrong, and re­vise your prob­a­bil­ity as­sign­ment for the 101’st tri­al? If it would, then your state of knowl­edge is differ­ent from that re­quired for the va­lid­ity of the law of large num­bers. You are not sure of the in­de­pen­dence of differ­ent tri­als, and/or you are not sure of the cor­rect­ness of the nu­mer­i­cal value of p. Your pre­dic­tion of f for a mil­lion tri­als is prob­a­bly no more re­li­able than for a hun­dred.
  2. The com­mon sense of a good ex­per­i­men­tal sci­en­tist tells him the same thing with­out any prob­a­bil­ity the­o­ry. Sup­pose some­one is mea­sur­ing the ve­loc­ity of light. After mak­ing al­lowances for the known sys­tem­atic er­rors, he could cal­cu­late a prob­a­bil­ity dis­tri­b­u­tion for the var­i­ous other er­rors, based on the noise level in his elec­tron­ics, vi­bra­tion am­pli­tudes, etc. At this point, a naive ap­pli­ca­tion of the law of large num­bers might lead him to think that he can add three sig­nifi­cant fig­ures to his mea­sure­ment merely by re­peat­ing it a mil­lion times and av­er­ag­ing the re­sults. But, of course, what he would ac­tu­ally do is to re­peat some un­known sys­tem­atic er­ror a mil­lion times. It is idle to re­peat a phys­i­cal mea­sure­ment an enor­mous num­ber of times in the hope that “good sta­tis­tics” will av­er­age out your er­rors, be­cause we can­not know the full sys­tem­atic er­ror. This is the old “Em­peror of China” fal­la­cy…

In­deed, un­less we know that all sources of sys­tem­atic er­ror—rec­og­nized or un­rec­og­nized—­con­tribute less than about one-third the to­tal er­ror, we can­not be sure that the av­er­age of a mil­lion mea­sure­ments is any more re­li­able than the av­er­age of ten. Our time is much bet­ter spent in de­sign­ing a new ex­per­i­ment which will give a lower prob­a­ble er­ror per tri­al. As Poin­care put it, “The physi­cist is per­suaded that one good mea­sure­ment is worth many bad ones.”7 In other words, the com­mon sense of a sci­en­tist tells him that the prob­a­bil­i­ties he as­signs to var­i­ous er­rors do not have a strong con­nec­tion with fre­quen­cies, and that meth­ods of in­fer­ence which pre­sup­pose such a con­nec­tion could be dis­as­trously mis­lead­ing in his prob­lems.

Schlaifer much ear­lier made the same point in Prob­a­bil­ity and Sta­tis­tics for Busi­ness De­ci­sions: an In­tro­duc­tion to Man­age­r­ial Eco­nom­ics Un­der Un­cer­tainty, Schlaifer 1959, pg488–489 (see also /):

31.4.3 Bias and Sam­ple Size

In Sec­tion 31.2.6 we used a hy­po­thet­i­cal ex­am­ple to il­lus­trate the im­pli­ca­tions of the fact that the vari­ance of the mean of a sam­ple in which bias is sus­pected is

so that only the sec­ond term de­creases as the sam­ple size in­creases and the to­tal can never be less than the fixed value of the first term. To em­pha­size the im­por­tance of this point by a real ex­am­ple we re­call the most fa­mous sam­pling fi­asco in his­to­ry, the . Over 2 mil­lion reg­is­tered vot­ers filled in and re­turned the straw bal­lots sent out by the Di­gest, so that there was less than one chance in 1 bil­lion of a sam­pling er­ror as large as 2⁄10 of one per­cent­age point8, and yet the poll was ac­tu­ally off by nearly 18 per­cent­age points: it pre­dicted that 54.5 per cent of the pop­u­lar vote would go to Lan­don, who in fact re­ceived only 36.7 per cent.9 10

Since sam­pling er­ror can­not ac­count for any ap­pre­cia­ble part of the 18-point dis­crep­an­cy, it is vir­tu­ally all ac­tual bias. A part of this to­tal bias may be mea­sure­ment bias due to the fact that not all peo­ple voted as they said they would vote; the im­pli­ca­tions of this pos­si­bil­ity were dis­cussed in Sec­tion 31.3. The larger part of the to­tal bi­as, how­ev­er, was al­most cer­tainly se­lec­tion bias. The straw bal­lots were mailed to peo­ple whose names were se­lected from lists of own­ers of tele­phones and au­to­mo­biles and the sub­pop­u­la­tion which was effec­tively sam­pled was even more re­stricted than this: it con­sisted only of those own­ers of tele­phones and au­to­mo­biles who were will­ing to fill out and re­turn a straw bal­lot. The true mean of this sub­pop­u­la­tion proved to be en­tirely differ­ent from the true mean of the pop­u­la­tion of all United States cit­i­zens who voted in 1936.

It is true that there was no ev­i­dence at the time this poll was planned which would have sug­gested that the bias would be as great as the 18 per­cent­age points ac­tu­ally re­al­ized, but ex­pe­ri­ence with pre­vi­ous polls had shown bi­ases which would have led any sen­si­ble per­son to as­sign to a dis­tri­b­u­tion with equal to at least 1 per­cent­age point. A sam­ple of only 23,760 re­turned bal­lots, one 1⁄100th the size ac­tu­ally used, would have given a value of only 1⁄3 per­cent­age point, so that the stan­dard de­vi­a­tion of x would have been

per­cent­age points. Us­ing a sam­ple 100 times this large re­duced from 1⁄3 point to vir­tu­ally ze­ro, but it could not affect and thus on the most fa­vor­able as­sump­tion could re­duce only from 1.05 points to 1 point. To col­lect and tab­u­late over 2 mil­lion ad­di­tional bal­lots when this was the great­est gain that could be hoped for was ob­vi­ously ridicu­lous be­fore the fact and not just in the light of hind­sight.

What’s par­tic­u­larly sad is when peo­ple read some­thing like this and de­cide to rely on anec­dotes, per­sonal ex­per­i­ments, and al­ter­na­tive med­i­cine where there are even more sys­tem­atic er­rors and no way of re­duc­ing ran­dom er­ror at all! Sci­ence may be the lens that sees its own flaws, but if other epis­te­molo­gies do not boast such long de­tailed self­-cri­tiques, it’s not be­cause they are flaw­less… It’s like that old quote: Some peo­ple, when faced with the prob­lem of main­stream med­i­cine & epi­demi­ol­ogy hav­ing se­ri­ous method­olog­i­cal weak­ness­es, say “I know, I’ll turn to non-main­stream med­i­cine & epi­demi­ol­o­gy. After all, if only some med­i­cine is based on real sci­en­tific method and out­per­forms place­bos, why both­er?” (Now they have two prob­lem­s.) Or per­haps Isaac Asi­mov: “John, when peo­ple thought the earth was flat, they were wrong. When peo­ple thought the earth was spher­i­cal, they were wrong. But if you think that think­ing the earth is spher­i­cal is just as wrong as think­ing the earth is flat, then your view is wronger than both of them put to­geth­er.”

See Also


Further reading

Ad­di­tional links, largely cu­rated from my :

Pygmalion Effect


Some ex­am­ples of how ‘dat­a­min­ing’ or ‘data dredg­ing’ can man­u­fac­ture cor­re­la­tions on de­mand from large datasets by com­par­ing enough vari­ables:

Rates of autism di­ag­noses in chil­dren cor­re­late with age—or should we blame or­ganic food sales?; height & vo­cab­u­lary or foot size & math skills may cor­re­late strongly (in chil­dren); na­tional choco­late con­sump­tion cor­re­lates with No­bel prizes12, as do bor­row­ing from com­mer­cial banks & buy­ing lux­ury cars & serial-killers/mass-murderers/traffic-fatalities13; mod­er­ate al­co­hol con­sump­tion pre­dicts in­creased lifes­pan and earn­ings; the role of storks in de­liv­er­ing ba­bies may have been un­der­es­ti­mat­ed; chil­dren and peo­ple with high have higher grades & lower crime rates etc, so “we all know in our gut that it’s true” that rais­ing peo­ple’s self­-es­teem “em­pow­ers us to live re­spon­si­bly and that in­oc­u­lates us against the lures of crime, vi­o­lence, sub­stance abuse, teen preg­nan­cy, child abuse, chronic wel­fare de­pen­dency and ed­u­ca­tional fail­ure”— high self­-es­teem is caused by high grades & suc­cess, boost­ing self­-es­teem has no ex­per­i­men­tal ben­e­fits, and may back­fire?

Those last can be gen­er­ated ad nau­se­am: Shaun Gal­lagher’s Cor­re­lated (also a book) sur­veys users & com­pares against all pre­vi­ous sur­veys with 1k+ cor­re­la­tions.

Tyler Vi­gen’s “spu­ri­ous cor­re­la­tions” cat­a­logues 35k+ cor­re­la­tions, many with r > 0.9, based pri­mar­ily on US Cen­sus & CDC da­ta.

Google Cor­re­late “finds Google search query pat­terns which cor­re­spond with re­al-world trends” based on ge­og­ra­phy or user-pro­vided data, which offers end­less fun (“Face­book”/“tape­worm in hu­mans”, r = 0.8721; “Su­per­f­reako­nomic”/“Win­dows 7 ad­vi­sor”, r = 0.9751; Irish elec­tric­ity prices/“Stan­ford web­mail”, r = 0.83; “heart at­tack”/“pink lace dress”, r = 0.88; US states’ /“booty mod­els”, r = 0.92; US states’ fam­ily ties/“how to swim”; /“Is Lil’ Wayne gay?”, r = 0.89; /“prn­hub”, r = 0.9784; “ac­ci­dent”/“itchy bumps”, r = 0.87; “mi­graine headaches”/“sci­ences”, r = 0.77; “Ir­ri­ta­ble Bowel Syn­drome”/“font down­load”, r = 0.94; interest-rate-index/“pill iden­ti­fi­ca­tion”, r = 0.98; “ad­ver­tis­ing”/“med­ical re­search”, r = 0.99; Barack Obama 2012 vote-share/“Top Chef”, r = 0.88; “los­ing weight”/“houses for rent”, r = 0.97; “Bieber”/tonsillitis, r = 0.95; “pa­ter­nity test”/“food for dogs”, r = 0.83; “breast en­large­ment”/“re­verse tele­phone search”, r = 0.95; “the­ory of evo­lu­tion” / “the Sume­ri­ans” or “Hec­tor of Troy” or “Jim Crow laws”; “gw­ern”/“Danny Brown lyrics”, r = 0.92; “weed”/“new Fam­ily Guy episodes”, r = 0.8; a draw­ing of a bell curve matches “My­Space” while a pe­nis matches “STD symp­toms in men” r = 0.95, not to men­tion Kurt Von­negut sto­ries).

And on less sec­u­lar themes, do churches cause obe­sity & do Welsh rugby vic­to­ries pre­dict pa­pal deaths?

Fi­nan­cial data-min­ing offers some fun ex­am­ples; there’s the which worked well for sev­eral decades; and it’s not very el­e­gant, but a 3-vari­able model (Bangladeshi but­ter, Amer­i­can cheese, joint sheep pop­u­la­tion) reaches R2=0.99 on 20 years of the S&P 500

Animal models

On the gen­eral topic of an­i­mal model ex­ter­nal va­lid­ity & trans­la­tion to hu­mans, a num­ber of op-eds, re­views, and meta-analy­ses have been done; read­ing through some of the lit­er­a­ture up to March 2013, I would sum­ma­rize them as in­di­cat­ing that the an­i­mal re­search lit­er­a­ture in gen­eral is of con­sid­er­ably lower qual­ity than hu­man re­search, and that for those and in­trin­sic bi­o­log­i­cal rea­sons, the prob­a­bil­ity of mean­ing­ful trans­fer from an­i­mal to hu­man can be as­tound­ingly low, far be­low 50% and in some cat­e­gories of re­sults, 0%.

The pri­mary rea­sons iden­ti­fied for this poor per­for­mance are gen­er­al­ly: small sam­ples (much smaller than the al­ready un­der­pow­ered norms in hu­man re­search), lack of blind­ing in tak­ing mea­sure­ments, pseudo-repli­ca­tion due to an­i­mals be­ing cor­re­lated by ge­netic relatedness/living in same cage/same room/same lab, ex­ten­sive non-nor­mal­ity in data14, large differ­ences be­tween labs due to lo­cal differ­ences in reagents/procedures/personnel il­lus­trat­ing the im­por­tance of “tacit knowl­edge”, pub­li­ca­tion bias (s­mall cheap sam­ples + lit­tle per­ceived eth­i­cal need to pub­lish + no pre­reg­is­tra­tion norm­s), un­nat­ural & un­nat­u­rally easy lab en­vi­ron­ments (more nat­u­ral­is­tic en­vi­ron­ments both offer more re­al­is­tic mea­sure­ments & chal­lenge an­i­mal­s), large ge­netic differ­ences due to inbreeding/engineering/drift of lab strains mean the same treat­ment can pro­duce dra­mat­i­cally differ­ent re­sults in differ­ent strains (or sex­es) of the same species, differ­ent species can have differ­ent re­spons­es, and none of them may be like hu­mans in the rel­e­vant bi­o­log­i­cal way in the first place.

So it is no won­der that “we can cure can­cer in mice but not peo­ple” and al­most all amaz­ing break­throughs in an­i­mals never make it to hu­man prac­tice; med­i­cine & bi­ol­ogy are diffi­cult.

The bib­li­og­ra­phy:

  1. Pub­li­ca­tion bias can come in many forms, and seems to be se­vere. For ex­am­ple, the 2008 ver­sion of a Cochrane re­view () finds “Only 63% of re­sults from ab­stracts de­scrib­ing ran­dom­ized or con­trolled clin­i­cal tri­als are pub­lished in full. ‘Pos­i­tive’ re­sults were more fre­quently pub­lished than not ‘pos­i­tive’ re­sults.”↩︎

  2. For a sec­ond, shorter take on the im­pli­ca­tions of low prior prob­a­bil­i­ties & low pow­er: “Is the Replic­a­bil­ity Cri­sis Overblown? Three Ar­gu­ments Ex­am­ined”, Pash­ler & Har­ris 2012:

    So what is the truth of the mat­ter? To put it sim­ply, adopt­ing an al­pha level of, say, 5% means that about 5% of the time when re­searchers test a null hy­poth­e­sis that is true (i.e., when they look for a differ­ence that does not ex­ist), they will end up with a sta­tis­ti­cally sig­nifi­cant differ­ence (a Type 1 er­ror or false pos­i­tive.)1 Whereas some have ar­gued that 5% would be too many mis­takes to tol­er­ate, it cer­tainly would not con­sti­tute a flood of er­ror. So what is the prob­lem?

    Un­for­tu­nate­ly, the prob­lem is that the al­pha level does not pro­vide even a rough es­ti­mate, much less a true up­per bound, on the like­li­hood that any given pos­i­tive find­ing ap­pear­ing in a sci­en­tific lit­er­a­ture will be er­ro­neous. To es­ti­mate what the lit­er­a­ture-wide false pos­i­tive like­li­hood is, sev­eral ad­di­tional val­ues, which can only be guessed at, need to be spec­i­fied. We be­gin by con­sid­er­ing some highly sim­pli­fied sce­nar­ios. Al­though ar­ti­fi­cial, these have enough plau­si­bil­ity to pro­vide some eye­-open­ing con­clu­sions.

    For the fol­low­ing ex­am­ple, let us sup­pose that 10% of the effects that re­searchers look for ac­tu­ally ex­ist, which will be re­ferred to here as the prior prob­a­bil­ity of an effect (i.e., the null hy­poth­e­sis is true 90% of the time). Given an al­pha of 5%, Type 1 er­rors will oc­cur in 4.5% of the stud­ies per­formed (90% × 5%). If one as­sumes that stud­ies all have a power of, say, 80% to de­tect those effects that do ex­ist, cor­rect re­jec­tions of the null hy­poth­e­sis will oc­cur in 8% of the time (80% × 10%). If one fur­ther imag­ines that all pos­i­tive re­sults are pub­lished then this would mean that the prob­a­bil­ity any given pub­lished pos­i­tive re­sult is er­ro­neous would be equal to the pro­por­tion of false pos­i­tives di­vided by the sum of the pro­por­tion of false pos­i­tives plus the pro­por­tion of cor­rect re­jec­tions. Given the pro­por­tions spec­i­fied above, then, we see that more than one third of pub­lished pos­i­tive find­ings would be false pos­i­tives [4.5% / (4.5% + 8%) = 36%]. In this ex­am­ple, the er­rors oc­cur at a rate ap­prox­i­mately seven times the nom­i­nal al­pha level (row 1 of Ta­ble 1).

    Ta­ble 1 shows a few more hy­po­thet­i­cal ex­am­ples of how the fre­quency of false pos­i­tives in the lit­er­a­ture would de­pend upon the as­sumed prob­a­bil­ity of null hy­poth­e­sis be­ing false and the sta­tis­ti­cal pow­er. An 80% power likely ex­ceeds any re­al­is­tic as­sump­tions about psy­chol­ogy stud­ies in gen­er­al. For ex­am­ple, Bakker, van Dijk, and Wikkerts, (2012, this is­sue) es­ti­mate .35 as a typ­i­cal power level in the psy­cho­log­i­cal lit­er­a­ture. If one mod­i­fies the pre­vi­ous ex­am­ple to as­sume a more plau­si­ble power level of 35%, the like­li­hood of pos­i­tive re­sults be­ing false rises to 56% (sec­ond row of the table). John Ioan­ni­dis (2005b) did pi­o­neer­ing work to an­a­lyze (much more care­fully and re­al­is­ti­cally than we do here) the pro­por­tion of re­sults that are likely to be false, and he con­cluded that it could very eas­ily be a ma­jor­ity of all re­ported effects.

    Ta­ble 1. Pro­por­tion of Pos­i­tive Re­sults That Are False Given As­sump­tions About Prior Prob­a­bil­ity of an Effect and Pow­er.
    Prior prob­a­bil­ity of effect Power Pro­por­tion of stud­ies yield­ing true pos­i­tives Pro­por­tion of stud­ies yield­ing false pos­i­tives Pro­por­tion of to­tal pos­i­tive re­sults (false+­pos­i­tive) which are false
    10% 80% 10% x 80% = 8% (100–10%) x 5% = 4.5% 4.5% / (4.5% + 8%) = 36%
    10% 35% = 3.5% = 4.5% 4.5% / (4.5% + 3.5%) = 56.25%
    50% 35% = 17.5% (100–50%) x 5% = 2.5% 2.5% / (2.5% + 17.5%) = 12.5%
    75% 35% = 26.3% (100–75%) x 5% = 1.6% 1.6% / (1.6% + 26.3%) = 5.73%
  3. So for ex­am­ple, if we imag­ined that a Jaeggi effect size of 0.8 were com­pletely borne out by a meta-analy­sis of many stud­ies and turned in a point es­ti­mate of d = 0.8; this data would im­ply that the strength of the n-back effect was ~1 stan­dard de­vi­a­tion above the av­er­age effect (of things which get stud­ied enough to be meta-an­a­lyz­able & have pub­lished meta-analy­ses etc) or to put it an­other way, that n-back was stronger than ~84% of all re­li­able well-sub­stan­ti­ated effects that psychology/education had dis­cov­ered as of 1992.↩︎

  4. We can in­fer em­pir­i­cal pri­ors from field­-wide col­lec­tions of effect sizes, in par­tic­u­lar, highly re­li­able meta-an­a­lytic effect sizes. For ex­am­ple, Lipsey & Wil­son 1993 which finds for var­i­ous kinds of ther­apy a mean effect of d = 0.5 based on >300 meta-analy­ses; or bet­ter yet, “One Hun­dred Years of So­cial Psy­chol­ogy Quan­ti­ta­tively De­scribed”, Bond et al 2003:

    This ar­ti­cle com­piles re­sults from a cen­tury of so­cial psy­cho­log­i­cal re­search, more than 25,000 stud­ies of 8 mil­lion peo­ple. A large num­ber of so­cial psy­cho­log­i­cal con­clu­sions are listed along­side meta-an­a­lytic in­for­ma­tion about the mag­ni­tude and vari­abil­ity of the cor­re­spond­ing effects. Ref­er­ences to 322 meta-analy­ses of so­cial psy­cho­log­i­cal phe­nom­ena are pre­sent­ed, as well as sta­tis­ti­cal effec­t-size sum­maries. Analy­ses re­veal that so­cial psy­cho­log­i­cal effects typ­i­cally yield a value of r equal to .21 and that, in the typ­i­cal re­search lit­er­a­ture, effects vary from study to study in ways that pro­duce a stan­dard de­vi­a­tion in r of .15. Us­es, lim­i­ta­tions, and im­pli­ca­tions of this large-s­cale com­pi­la­tion are not­ed.

    Only 5% of the were greater than .50; only 34% yielded an r of .30 or more; for ex­am­ple, Jaeggi 2008’s 15-day group racked up an IQ in­crease of d = 1.53 which con­verts to an r of 0.61 and is 2.6 stan­dard de­vi­a­tions above the over­all mean, im­ply­ing that the DNB effect is greater than ~99% of pre­vi­ous known effects in psy­chol­o­gy! (Schön­brodt & Pe­rug­ini 2013 ob­serve that their sam­pling sim­u­la­tion im­ply that, given Bond’s mean effect of r = .21, a psy­chol­ogy study would re­quire n = 238 for rea­son­able ac­cu­racy in es­ti­mat­ing effects; most stud­ies are far small­er.)↩︎

  5. One might be aware that the writer of that es­say, , was fired after mak­ing up ma­te­ri­als for one of his books, and won­der if this work can be trust­ed; I be­lieve it can as the New Yorker is fa­mous for rig­or­ous fac­t-check­ing (and no one has cast doubt on this ar­ti­cle), Lehrer’s scan­dals in­volved his books, I have not found any ques­tion­able claims in the ar­ti­cle be­sides Lehrer’s be­lief that known is­sues like pub­li­ca­tion bias are in­suffi­cient to ex­plain the de­cline effect (which rea­son­able men may differ on), and Vir­ginia Hughes ran the fin­ished ar­ti­cle against 7 peo­ple quoted in it like Ioan­ni­dis with­out any dis­put­ing facts/quotes & sev­eral some­what prais­ing it (see also An­drew Gel­man).↩︎

  6. If I am un­der­stand­ing this right, Jay­nes’s point here is that the ran­dom er­ror shrinks to­wards zero as N in­creas­es, but this er­ror is added onto the “com­mon sys­tem­atic er­ror” S, so the to­tal er­ror ap­proaches S no mat­ter how many ob­ser­va­tions you make and this can force the to­tal er­ror up as well as down (vari­abil­i­ty, in this case, ac­tu­ally be­ing help­ful for on­ce). So for ex­am­ple, ; with N = 100, it’s 0.43; with N = 1,000,000 it’s 0.334; and with N = 1,000,000 it equals 0.333365 etc, and never go­ing be­low the orig­i­nal sys­tem­atic er­ror of 1⁄3—that is, after 10 ob­ser­va­tions, the por­tion of er­ror due to sam­pling er­ror is less than that due to the sys­tem­atic er­ror, so one has hit se­verely di­min­ish­ing re­turns in the value of any ad­di­tional (bi­ased) data, and to mean­ing­fully im­prove the es­ti­mate one must ob­tain un­bi­ased da­ta. This leads to the un­for­tu­nate con­se­quence that the likely er­ror of N = 10 is 0.017<x < 0.64956 while for N = 1,000,000 it is the sim­i­lar range 0.017<x < 0.33433—so it is pos­si­ble that the es­ti­mate could be ex­actly as good (or bad) for the tiny sam­ple as com­pared with the enor­mous sam­ple, since nei­ther can do bet­ter than 0.017!↩︎

  7. Pos­si­bly this is what Lord Ruther­ford meant when he said, “If your ex­per­i­ment needs sta­tis­tics you ought to have done a bet­ter ex­per­i­ment”.↩︎

  8. Ne­glect­ing the finite-pop­u­la­tion cor­rec­tion, the stan­dard de­vi­a­tion of the mean sam­pling er­ror is and this quan­tity is largest when p = 0.5. The num­ber of bal­lots re­turned was 2,376,523, and with a sam­ple of this size the largest pos­si­ble value of is , or 0.322 per­cent­age point, so that an er­ror of .2 per­cent­age point is .2/.0322 = 6.17 times the stan­dard de­vi­a­tion. The to­tal area in the two tails of the Nor­mal dis­tri­b­u­tion be­low u = −6.17 and above u = +6.17 is .0000000007.↩︎

  9. Over 10 mil­lion bal­lots were sent out. Of the 2,376,523 bal­lots which were filled in and re­turned, 1,293,669 were for Lan­don, 972,897 for Roo­sevelt, and the re­main­der for other can­di­dates. The ac­tual vote was 16,679,583 for Lan­don and 27,476,673 for Roo­sevelt out of a to­tal of 45,647,117.↩︎

  10. Read­ers cu­ri­ous about mod­ern elec­tion fore­cast­ing’s sys­tem­atic vs ran­dom er­ror should see Shi­rani-Mehr et al 2018, : the sys­tem­atic er­ror turns out to be al­most iden­ti­cal sized ie half the to­tal er­ror. Hence, anom­alies like Don­ald Trump or Brexit are not par­tic­u­larly anom­alous at all. –Ed­i­tor.↩︎

  11. John­son, in­ter­est­ing­ly, like Bouchard, was in­flu­enced by (and also ).↩︎

  12. I should men­tion this one is not quite as silly as it sounds as there is ex­per­i­men­tal ev­i­dence for co­coa im­prov­ing cog­ni­tive func­tion↩︎

  13. The same au­thors offer up a num­ber of coun­try-level cor­re­la­tion such as “Lin­guis­tic Diversity/Traffic ac­ci­dents”, al­co­hol consumption/morphological com­plex­i­ty, and aca­cia trees vs tonal­i­ty, which feed into their pa­per “Con­struct­ing knowl­edge: nomo­thetic ap­proaches to lan­guage evo­lu­tion” on the dan­gers of naive ap­proaches to cross-coun­try com­par­isons due to the high in­ter­cor­re­la­tion of cul­tural traits. More so­phis­ti­cated ap­proaches might be bet­ter; they de­rive a fair­ly-plau­si­ble look­ing graph of the re­la­tion­ships be­tween vari­ables.↩︎

  14. Lots of data is not ex­actly nor­mal, but, par­tic­u­larly in hu­man stud­ies, this is not a big deal be­cause the n are often large enough, eg n > 20, that the as­ymp­tot­ics have started to work & model mis­spec­i­fi­ca­tion does­n’t pro­duce too large a false pos­i­tive rate in­fla­tion or mis­-es­ti­ma­tion. Un­for­tu­nate­ly, in an­i­mal re­search, it’s per­fectly typ­i­cal to have sam­ple sizes more like n = 5, which in an ide­al­ized power analy­sis of a nor­mal­ly-dis­trib­uted vari­able might be fine be­cause one is (hope­ful­ly) ex­ploit­ing the free­dom of an­i­mal mod­els to get a large effect size / pre­cise mea­sure­ments—ex­cept that with n = 5 the data won’t be even close to ap­prox­i­mately nor­mal or fit­ting other model as­sump­tions, and a sin­gle bi­ased or se­lected or out­lier dat­a­point can mess it up fur­ther.↩︎