The Replication Crisis: Flaws in Mainstream Science

2013 discussion of how systemic biases in science, particularly medicine and psychology, have resulted in a research literature filled with false positives and exaggerated effects, called ‘the Replication Crisis’.
psychology, statistics, meta-analysis, sociology, causality
2010-10-272019-12-09 finished certainty: highly likely importance: 8

Long-­s­tand­ing prob­lems in stan­dard sci­en­tific method­ol­ogy have exploded as the “”: the dis­cov­ery that many results in fields as diverse as psy­chol­o­gy, eco­nom­ics, med­i­cine, biol­o­gy, and soci­ol­ogy are in fact false or quan­ti­ta­tively highly inac­cu­rately mea­sured. I cover here a hand­ful of the issues and pub­li­ca­tions on this large, impor­tant, and rapidly devel­op­ing topic up to about 2013, at which point the Repli­ca­tion Cri­sis became too large a topic to cover more than cur­so­ri­ly. (A are pro­vided for post-2013 devel­op­ments.)

The cri­sis is caused by meth­ods & pub­lish­ing pro­ce­dures which inter­pret ran­dom noise as impor­tant results, far too small datasets, selec­tive analy­sis by an ana­lyst try­ing to reach expected/desired results, pub­li­ca­tion bias, poor imple­men­ta­tion of exist­ing best-prac­tices, non­triv­ial lev­els of research fraud, soft­ware errors, philo­soph­i­cal beliefs among researchers that false pos­i­tives are accept­able, neglect of known con­found­ing like genet­ics, and skewed incen­tives (fi­nan­cial & pro­fes­sion­al) to pub­lish ‘hot’ results.

Thus, any indi­vid­ual piece of research typ­i­cally estab­lishes lit­tle. Sci­en­tific val­i­da­tion comes not from small p-val­ues, but from dis­cov­er­ing a reg­u­lar fea­ture of the world which dis­in­ter­ested third par­ties can dis­cover with straight­for­ward research done inde­pen­dently on new data with new pro­ce­dures—repli­ca­tion.

Main­stream sci­ence is flawed: seri­ously mis­taken sta­tis­tics com­bined with poor incen­tives has led to masses of mis­lead­ing research. Not that this prob­lem is exclu­sive to psy­chol­o­gy—e­co­nom­ics, cer­tain genet­ics sub­fields (prin­ci­pally can­di­date-­gene research), bio­med­ical sci­ence, and biol­ogy in gen­eral are often on shaky ground.

NHST and Systematic Biases

Sta­tis­ti­cal back­ground on p-value prob­lems: Against nul­l-hy­poth­e­sis sig­nif­i­cance test­ing

The basic nature of being usu­ally defined as p < 0.05 means we should expect some­thing like >5% of stud­ies or exper­i­ments to be bogus (op­ti­misti­cal­ly), but that only con­sid­ers “false pos­i­tives”; reduc­ing “false neg­a­tives” requires (weak­ened by small sam­ples), and the two com­bine with the base rate of true under­ly­ing effects into a total error rate. points out that con­sid­er­ing the usual p val­ues, the under­pow­ered nature of many stud­ies, the rar­ity of under­ly­ing effects, and a lit­tle bias, even large ran­dom­ized tri­als may wind up with only an 85% chance of hav­ing yielded the truth. of reported p-val­ues in med­i­cine yield­ing a lower bound of false pos­i­tives of 17%.

Open Sci­ence Col­lab­o­ra­tion 2015: “Fig­ure 1: Orig­i­nal study effect size ver­sus repli­ca­tion effect size (cor­re­la­tion coef­fi­cients). Diag­o­nal line rep­re­sents repli­ca­tion effect size equal to orig­i­nal effect size. Dot­ted line rep­re­sents repli­ca­tion effect size of 0. Points below the dot­ted line were effects in the oppo­site direc­tion of the orig­i­nal. Den­sity plots are sep­a­rated by sig­nif­i­cant (blue) and non­signif­i­cant (red) effects.”

Yet, there are too 1 (psy­chi­a­try, neu­ro­bi­ol­ogy bio­med­i­cine, biol­ogy, ecol­ogy & evo­lu­tion, psy­chol­ogy 12 3 4 , eco­nom­ics, soci­ol­ogy, gene-dis­ease cor­re­la­tions) given (and pos­i­tive results cor­re­late with per capita pub­lish­ing rates in & vary by —ap­par­ently chance is kind to sci­en­tists who must pub­lish a lot and recent­ly!); then there come the inad­ver­tent errors which might cause retrac­tion, which is rare, but the true retrac­tion rate may be 0.1–1% (“How many sci­en­tific papers should be retract­ed?”), is increas­ing & seems to pos­i­tively cor­re­late with jour­nal pres­tige met­rics (mod­ulo the con­found­ing fac­tor that famous papers/journals get more scruti­ny), not that any­one pays any atten­tion to such things; then there are basic sta­tis­ti­cal errors in >11% of papers (based on the high­-qual­ity papers in Nature and the British Med­ical Jour­nal; “Incon­gru­ence between test sta­tis­tics and P val­ues in med­ical papers”, Gar­cía-Ber­thou 2004) or 50% in neu­ro­science.

And only then can we get into repli­cat­ing at all. See for exam­ple arti­cle “Lies, Damned Lies, and Med­ical Sci­ence” on research show­ing 41% of the most cited med­ical research failed to be —were wrong. For details, you can see Ioan­ni­dis’s 2, or Beg­ley’s failed attempts to repli­cate 47 of 53 arti­cles on top can­cer jour­nals (lead­ing to Booth’s “Beg­ley’s Six Rules”; see also the Nature Biotech­nol­ogy edi­to­r­ial & note that full details have not been pub­lished because the researchers of the orig­i­nal stud­ies demanded secrecy from Beg­ley’s team), or Kumar & Nash 2011’s “Health Care Myth Busters: Is There a High Degree of Sci­en­tific Cer­tainty in Mod­ern Med­i­cine?” who write ‘We could accu­rately say, “Half of what physi­cians do is wrong,” or “Less than 20% of what physi­cians do has solid research to sup­port it.”’ Nutri­tional epi­demi­ol­ogy is some­thing of a fish in a bar­rel; after Ioan­ni­dis, is any­one sur­prised that when Young & Karr 2011 fol­lowed up on 52 cor­re­la­tions tested in 12 RCTs, 0⁄52 repli­cated and the RCTs found the oppo­site of 5?

Attempts to use ani­mal mod­els to infer any­thing about humans suf­fer from all the method­olog­i­cal prob­lems pre­vi­ously men­tioned, and add in inter­est­ing new forms of error such as mice sim­ply being irrel­e­vant to humans, lead­ing to cases like <150 clin­i­cal tri­als all fail­ing—be­cause the drugs worked in mice but humans have a com­pletely dif­fer­ent set of genetic reac­tions to inflam­ma­tion.

‘Hot’ fields tend to be new fields, which brings prob­lems of its own, see & dis­cus­sion. (Fail­ure to repli­cate in larger stud­ies seems to be a hall­mark of biological/medical research. Ioan­ni­dis per­forms the same trick with , find­ing less than half of the most-cited bio­mark­ers were even sta­tis­ti­cal­ly-sig­nif­i­cant in the larger stud­ies. 12 of the more promi­nent -IQ cor­re­la­tions on a larger data.) As we know now, almost the entire can­di­date-­gene lit­er­a­ture, most things reported from 2000–2010 before large-s­cale GWASes started to be done (and com­pletely fail­ing to find the can­di­date-­ge­nes), is noth­ing but false pos­i­tives! The repli­ca­tion rates of can­di­date-­genes for things like intel­li­gence, per­son­al­i­ty, gene-en­vi­ron­ment inter­ac­tions, psy­chi­atric dis­or­der­s–the whole schmeer—are lit­er­ally ~0%. On the plus side, the par­lous state of affairs means that there are some cheap heuris­tics for detect­ing unre­li­able papers— cor­re­lates strongly with the orig­i­nal paper hav­ing errors in its sta­tis­tics.

This epi­demic of false pos­i­tives is appar­ently delib­er­ately and know­ing accepted by ; Young’s 2008 “Every­thing is Dan­ger­ous” remarks that 80–90% of epi­demi­ol­o­gy’s claims do not repli­cate (eg. the NIH ran 20 ran­dom­ized-­con­trolled-­tri­als of claims, and only 1 repli­cat­ed) and that lack of ‘’ (ei­ther or Ben­jam­in-Hochberg) is taught: “Roth­man (1990) says no cor­rec­tion for mul­ti­ple test­ing is nec­es­sary and Van­den­broucke, PLoS Med (2008) agrees” (see also Per­neger 1998 who also explic­itly under­stands that no cor­rec­tion increases type 2 errors and reduces type 1 errors). Mul­ti­ple cor­rec­tion is nec­es­sary because its absence does, in fact, result in the over­state­ment of med­ical ben­e­fit (God­frey 1985, Pocock et al 1987, Smith 1987). The aver­age effect size for in psychology/education is d = 0.53 (well below sev­eral effect sizes from n-back/IQ stud­ies); when mov­ing from lab­o­ra­tory to non-lab­o­ra­tory set­tings, meta-­analy­ses repli­cate find­ings cor­re­late ~0.7 of the time, but for the repli­ca­tion cor­re­la­tion falls to ~0.5 with >14% of find­ings actu­ally turn­ing out to be the oppo­site (see Ander­son et al 1999 and Mitchell 2012; for exag­ger­a­tion due to non-blind­ing or poor ran­dom­iza­tion, Wood et al 2008). (Meta-­analy­ses also give us a start­ing point for under­stand­ing how unusual medium or large effects sizes are4.) Psy­chol­ogy does have many chal­lenges, but prac­ti­tion­ers also hand­i­cap them­selves; an older overview is the enter­tain­ing “What’s Wrong With Psy­chol­o­gy, Any­way?”, which men­tions the obvi­ous point that sta­tis­tics & exper­i­men­tal design are flex­i­ble enough to reach sig­nif­i­cance as desired. In an inter­est­ing exam­ple of how method­olog­i­cal reforms are no panacea in the pres­ence of con­tin­ued per­verse incen­tives, an ear­lier method­olog­i­cal improve­ment in psy­chol­ogy (re­port­ing mul­ti­ple exper­i­ments in a sin­gle pub­li­ca­tion as a check against results not being gen­er­al­iz­able) has merely demon­strated the wide­spread p-value hack­ing or manip­u­la­tion or pub­li­ca­tion bias when one notes that given the low sta­tis­ti­cal power of each exper­i­ment, even if the under­ly­ing phe­nom­ena were real it would still be wildly improb­a­ble that all n exper­i­ments in a paper would turn up sta­tis­ti­cal­ly-sig­nif­i­cant results, since power is usu­ally extremely low in exper­i­ments (eg. in neu­ro­science, “between 20–30%”). These prob­lems are per­va­sive enough that I believe they entirely explain any “decline effects”5.

The fail­ures to repli­cate “sta­tis­ti­cally sig­nif­i­cant” results has led one blog­ger to caus­ti­cally remark (see also “Para­psy­chol­o­gy: the con­trol group for sci­ence”, “Using degrees of free­dom to change the past for fun and profit”, ):

Para­psy­chol­o­gy, the con­trol group for sci­ence, would seem to be a thriv­ing field with “sta­tis­ti­cally sig­nif­i­cant” results aplen­ty…­Para­psy­chol­o­gists are con­stantly protest­ing that they are play­ing by all the stan­dard sci­en­tific rules, and yet their results are being ignored—that they are unfairly being held to higher stan­dards than every­one else. I’m will­ing to believe that. It just means that the stan­dard sta­tis­ti­cal meth­ods of sci­ence are so weak and flawed as to per­mit a field of study to sus­tain itself in the com­plete absence of any sub­ject mat­ter. With two-thirds of med­ical stud­ies in pres­ti­gious jour­nals fail­ing to repli­cate, get­ting rid of the entire actual sub­ject mat­ter would shrink the field by only 33%.

Cosma Shal­izi:

…Let me draw the moral [about pub­li­ca­tion bias]. Even if the com­mu­nity of inquiry is both too clue­less to make any con­tact with real­ity and too hon­est to nudge bor­der­line find­ings into sig­nif­i­cance, so long as they can keep com­ing up with new phe­nom­ena to look for, the mech­a­nism of the file-­drawer prob­lem alone will guar­an­tee a steady stream of new results. There is, so far as I know, no Jour­nal of Evi­dence-Based Harus­picy filled, issue after issue, with method­olog­i­cal­ly-­fault­less papers report­ing the abil­ity of sheep’s liv­ers to pre­dict the win­ners of sumo cham­pi­onships, the out­come of speed dates, or real estate trends in selected sub­urbs of Chica­go. But the dif­fi­culty can only be that the evi­dence-based harus­pices aren’t try­ing hard enough, and some friendly rivalry with the plas­tro­mancers is called for. It’s true that none of these find­ings will last forever, but this con­stant over­turn­ing of old ideas by new dis­cov­er­ies is just part of what makes this such a dynamic time in the field of harus­picy. Many schol­ars will even tell you that their favorite part of being a harus­pex is the fre­quency with which a new sac­ri­fice over-­turns every­thing they thought they knew about read­ing the future from a sheep’s liv­er! We are very excited about the renewed inter­est on the part of pol­i­cy-­mak­ers in the rec­om­men­da­tions of the man­tic arts…

And this is when there is enough infor­ma­tion to repli­cate; open access to any data for a paper is rare (eco­nom­ics: ) the eco­nom­ics jour­nal Jour­nal of Mon­ey, Credit and Bank­ing, which required researchers pro­vide the data & soft­ware which could repli­cate their sta­tis­ti­cal analy­ses, dis­cov­ered that <10% of the sub­mit­ted mate­ri­als were ade­quate for repeat­ing the paper (see “Lessons from the JMCB Archive”). In one cute eco­nom­ics exam­ple, repli­ca­tion failed because the dataset had been to make par­tic­i­pants look bet­ter (for more eco­nom­ic­s-spe­cific cri­tique, see ). Avail­abil­ity of data, often low, , and many stud­ies never get pub­lished regard­less of whether pub­li­ca­tion is legally man­dated.

Tran­scrip­tion errors in papers seem to be com­mon (pos­si­bly due to con­stantly chang­ing analy­ses & p-hack­ing?), and as soft­ware and large datasets becomes more inher­ent to research, the need and the prob­lem of it being pos­si­ble to repli­cate will get worse because even mature com­mer­cial soft­ware libraries can dis­agree majorly on their com­puted results to the same math­e­mat­i­cal spec­i­fi­ca­tion (see also Anda et al 2009). And spread­sheets are espe­cially bad, with error rates in the 88% range (“What we know about spread­sheet errors”, Panko 1998); spread­sheets are used in all areas of sci­ence, includ­ing biol­ogy and med­i­cine (see “Error! What bio­med­ical com­put­ing can learn from its mis­takes”; famous exam­ples of cod­ing errors include & Rein­hart-Ro­goff), not to men­tion reg­u­lar busi­ness (eg the Lon­don Whale).

Psy­chol­ogy is far from being per­fect either; look at the exam­ples in The New Yorker’s “The Truth Wears Off” arti­cle (or look at some excerpts from that arti­cle). Com­puter sci­en­tist has writ­ten a must-read essay on inter­pret­ing sta­tis­tics, “Warn­ing Signs in Exper­i­men­tal Design and Inter­pre­ta­tion”; a num­ber of warn­ing signs apply to many psy­cho­log­i­cal stud­ies. There may be incen­tive prob­lems: a trans­plant researcher dis­cov­ered the only way to pub­lish in Nature his inabil­ity to repli­cate his ear­lier Nature paper was to offi­cially retract it; another inter­est­ing exam­ple is when, after got a paper pub­lished in the top jour­nal demon­strat­ing pre­cog­ni­tion, the jour­nal refused to pub­lish any repli­ca­tions (failed or suc­cess­ful) because… “‘We don’t want to be the Jour­nal of Bem Repli­ca­tion’, he says, point­ing out that other high­-pro­file jour­nals have sim­i­lar poli­cies of pub­lish­ing only the best orig­i­nal research.” (Quoted in New Sci­en­tist) One does­n’t need to be a genius to under­stand why psy­chol­o­gist Andrew D. Wil­son might snark­ily remark “…think about the mes­sage JPSP is send­ing to authors. That mes­sage is ‘we will pub­lish your crazy story if it’s new, but not your sen­si­ble story if it’s merely a repli­ca­tion’.” (You get what you pay for.) In one large test of the most famous psy­chol­ogy results, 10 of 13 (77%) repli­cat­ed. The repli­ca­tion rate is under 1⁄3 in touch­ing on genet­ics. This despite the sim­ple point that repli­ca­tions reduce the risk of pub­li­ca­tion bias, and increase sta­tis­ti­cal pow­er, so that a repli­cated result is . And the small sam­ples of n-back stud­ies and nootropic chem­i­cals are espe­cially prob­lem­at­ic. Quot­ing from & 2006 “Con­verg­ing Cog­ni­tive Enhance­ments”:

The reli­a­bil­ity of research is also an issue. Many of the cog­ni­tion-en­hanc­ing inter­ven­tions show small effect sizes, which may neces­si­tate very large epi­demi­o­log­i­cal stud­ies pos­si­bly expos­ing large groups to unfore­seen risks.

Par­tic­u­larly trou­bling is the slow­down in drug dis­cov­ery & med­ical tech­nol­ogy dur­ing the 2000s, even as genet­ics in par­tic­u­lar was expected to pro­duce earth­-shak­ing new treat­ments. One biotech writes:

The com­pany spent $6M or so try­ing to val­i­date a plat­form that did­n’t exist. When they tried to directly repeat the aca­d­e­mic founder’s data, it never worked. Upon re-ex­am­i­na­tion of the lab note­books, it was clear the founder’s lab had at the very least mas­saged the data and shaped it to fit their hypoth­e­sis. Essen­tial­ly, they sys­tem­at­i­cally ignored every piece of neg­a­tive data. Sadly this “fail­ure to repeat” hap­pens more often than we’d like to believe. It has hap­pened to us at Atlas [Ven­ture] sev­eral times in the past decade…The unspo­ken rule is that at least 50% of the stud­ies pub­lished even in top tier aca­d­e­mic jour­nals—Sci­ence, Nature, Cell, PNAS, etc…—­can’t be repeated with the same con­clu­sions by an indus­trial lab. In par­tic­u­lar, key ani­mal mod­els often don’t repro­duce. This 50% fail­ure rate isn’t a data free asser­tion: it’s backed up by dozens of expe­ri­enced R&D pro­fes­sion­als who’ve par­tic­i­pated in the (re)test­ing of aca­d­e­mic find­ings. This is a huge prob­lem for trans­la­tional research and one that won’t go away until we address it head on.

Half the respon­dents to at one can­cer research cen­ter reported 1 or more inci­dents where they could not repro­duce pub­lished research; two-thirds of those were unable to “ever able to explain or resolve their dis­crepant find­ings”, half had trou­ble pub­lish­ing results con­tra­dict­ing pre­vi­ous pub­li­ca­tions, and two-thirds failed to pub­lish con­tra­dic­tory results. An inter­nal sur­vey of 67 projects (com­men­tary) found that “only in ~20–25% of the projects were the rel­e­vant pub­lished data com­pletely in line with our in-­house find­ings”, and as far as assess­ing the projects went:

…de­spite the low num­bers, there was no appar­ent dif­fer­ence between the dif­fer­ent research fields. Sur­pris­ing­ly, even pub­li­ca­tions in pres­ti­gious jour­nals or from sev­eral inde­pen­dent groups did not ensure repro­ducibil­i­ty. Indeed, our analy­sis revealed that the repro­ducibil­ity of pub­lished data did not sig­nif­i­cantly cor­re­late with jour­nal impact fac­tors, the num­ber of pub­li­ca­tions on the respec­tive tar­get or the num­ber of inde­pen­dent groups that authored the pub­li­ca­tions. Our find­ings are mir­rored by ‘gut feel­ings’ expressed in per­sonal com­mu­ni­ca­tions with sci­en­tists from acad­e­mia or other com­pa­nies, as well as pub­lished obser­va­tions. [apro­pos of above] An unspo­ken rule among ear­ly-stage ven­ture cap­i­tal firms that “at least 50% of pub­lished stud­ies, even those in top-tier aca­d­e­mic jour­nals, can’t be repeated with the same con­clu­sions by an indus­trial lab” has been recently reported (see Fur­ther infor­ma­tion) and dis­cussed 4.

Physics has rel­a­tively small sins; “Assess­ing uncer­tainty in phys­i­cal con­stants” (Hen­rion & Fischoff 1985); Han­son’s sum­ma­ry:

Look­ing at 306 esti­mates for par­ti­cle prop­er­ties, 7% were out­side of a 98% con­fi­dence inter­val (where only 2% should be). In seven other cas­es, each with 14 to 40 esti­mates, the frac­tion out­side the 98% con­fi­dence inter­val ranged from 7% to 57%, with a median of 14%.

Nor is or robust against even . Sci­en­tists who win the Nobel Prize find their other work sud­denly being heav­ily cited, sug­gest­ing either that the com­mu­nity either badly failed in rec­og­niz­ing the work’s true value or that they are now suck­ing up & attempt­ing to look bet­ter . (A math­e­mati­cian once told me that often, to boost a paper’s accep­tance chance, they would add cita­tions to papers by the jour­nal’s edi­tors—a prac­tice that will sur­prise none famil­iar with and the use of in tenure & grants.)

The for­mer edi­tor Richard Smith amus­ingly recounts his doubts about the mer­its of peer review as prac­ticed, and physi­cist points out that peer review is his­tor­i­cally rare (just one of Ein­stein’s 300 papers was peer reviewed; the famous Nature did not insti­tute peer review until 1967), has been poorly stud­ied & not shown to be effec­tive, is nation­ally biased, erro­neously rejects many his­toric dis­cov­er­ies (one study lists “34 Nobel Lau­re­ates whose awarded work was rejected by peer review”; Hor­robin 1990 lists oth­er), and catches only a small frac­tion of errors. And ques­tion­able choices or fraud? :

A pooled weighted aver­age of 1.97% (N = 7, 95%­CI: 0.86–4.45) of sci­en­tists admit­ted to have fab­ri­cat­ed, fal­si­fied or mod­i­fied data or results at least once—a seri­ous form of mis­con­duct by any stan­dard­—and up to 33.7% admit­ted other ques­tion­able research prac­tices. In sur­veys ask­ing about the behav­iour of col­leagues, admis­sion rates were 14.12% (N = 12, 95% CI: 9.91–19.72) for fal­si­fi­ca­tion, and up to 72% for other ques­tion­able research prac­tices…When these fac­tors were con­trolled for, mis­con­duct was reported more fre­quently by medical/pharmacological researchers than oth­ers.

And :

We sur­veyed over 2,000 psy­chol­o­gists about their involve­ment in ques­tion­able research prac­tices, using an anony­mous elic­i­ta­tion for­mat sup­ple­mented by incen­tives for hon­est report­ing. The impact of incen­tives on admis­sion rates was pos­i­tive, and greater for prac­tices that respon­dents judge to be less defen­si­ble. Using three dif­fer­ent esti­ma­tion meth­ods, we find that the pro­por­tion of respon­dents that have engaged in these prac­tices is sur­pris­ingly high rel­a­tive to respon­dents’ own esti­mates of these pro­por­tions. Some ques­tion­able prac­tices may con­sti­tute the pre­vail­ing research norm.

In short, the secret sauce of sci­ence is not ‘peer review’. It is repli­ca­tion!

Systemic Error Doesn’t Go Away

“Far bet­ter an approx­i­mate answer to the right ques­tion, which is often vague, than an exact answer to the wrong ques­tion, which can always be made pre­cise.”

, “The future of data analy­sis” 1962

Why isn’t the solu­tion as sim­ple as elim­i­nat­ing dat­a­min­ing by meth­ods like larger n or pre-reg­is­tered analy­ses? Because once we have elim­i­nated the ran­dom error in our analy­sis, we are still left with a (po­ten­tially arbi­trar­ily large) sys­tem­atic error, leav­ing us with a large total error.

None of these sys­tem­atic prob­lems : they are sys­tem­atic biases and as such, they force an upper bound on how accu­rate a cor­pus of stud­ies can be even if there were thou­sands upon thou­sands of stud­ies, because the total error in the results is made up of and , but while ran­dom error shrinks as more stud­ies are done, sys­tem­atic error remains the same.

A thou­sand biased stud­ies merely result in an extremely pre­cise esti­mate of the wrong num­ber.

This is a point appre­ci­ated by sta­tis­ti­cians and exper­i­men­tal physi­cists, but it does­n’t seem to be fre­quently dis­cussed. Andrew Gel­man has a fun demon­stra­tion of selec­tion bias involv­ing candy, or from pg812–1020 of Chap­ter 8 “Suf­fi­cien­cy, Ancil­lar­i­ty, And All That” of Prob­a­bil­ity The­o­ry: The Logic of Sci­ence by :

The clas­si­cal exam­ple show­ing the error of this kind of rea­son­ing is the fable about the height of the Emperor of Chi­na. Sup­pos­ing that each per­son in China surely knows the height of the Emperor to an accu­racy of at least ±1 meter, if there are N = 1,000,000,000 inhab­i­tants, then it seems that we could deter­mine his height to an accu­racy at least as good as


merely by ask­ing each per­son’s opin­ion and aver­ag­ing the results.

The absur­dity of the con­clu­sion tells us rather force­fully that the rule is not always valid, even when the sep­a­rate data val­ues are causally inde­pen­dent; it requires them to be log­i­cally inde­pen­dent. In this case, we know that the vast major­ity of the inhab­i­tants of China have never seen the Emper­or; yet they have been dis­cussing the Emperor among them­selves and some kind of men­tal image of him has evolved as folk­lore. Then knowl­edge of the answer given by one does tell us some­thing about the answer likely to be given by anoth­er, so they are not log­i­cally inde­pen­dent. Indeed, folk­lore has almost surely gen­er­ated a sys­tem­atic error, which sur­vives the aver­ag­ing; thus the above esti­mate would tell us some­thing about the folk­lore, but almost noth­ing about the Emper­or.

We could put it roughly as fol­lows:

error in esti­mate = (8-50)

where S is the com­mon sys­tem­atic error in each datum, R is the ‘ran­dom’ error in the indi­vid­ual data val­ues. Unin­formed opin­ions, even though they may agree well among them­selves, are nearly worth­less as evi­dence. There­fore sound sci­en­tific infer­ence demands that, when this is a pos­si­bil­i­ty, we use a form of prob­a­bil­ity the­ory (i.e. a prob­a­bilis­tic mod­el) which is sophis­ti­cated enough to detect this sit­u­a­tion and make allowances for it.

As a start on this, equa­tion (8-50) gives us a crude but use­ful rule of thumb; it shows that, unless we know that the sys­tem­atic error is less than about 1⁄3 of the ran­dom error, we can­not be sure that the aver­age of a mil­lion data val­ues is any more accu­rate or reli­able than the aver­age of ten6. As put it: “The physi­cist is per­suaded that one good mea­sure­ment is worth many bad ones.” This has been well rec­og­nized by exper­i­men­tal physi­cists for gen­er­a­tions; but warn­ings about it are con­spic­u­ously miss­ing in the “soft” sci­ences whose prac­ti­tion­ers are edu­cated from those text­books.

Or pg1019–1020 Chap­ter 10 “Physics of ‘Ran­dom Exper­i­ments’”:

…Nev­er­the­less, the exis­tence of such a strong con­nec­tion is clearly only an ideal lim­it­ing case unlikely to be real­ized in any real appli­ca­tion. For this rea­son, the and limit the­o­rems of prob­a­bil­ity the­ory can be grossly mis­lead­ing to a sci­en­tist or engi­neer who naively sup­poses them to be exper­i­men­tal facts, and tries to inter­pret them lit­er­ally in his prob­lems. Here are two sim­ple exam­ples:

  1. Sup­pose there is some ran­dom exper­i­ment in which you assign a prob­a­bil­ity p for some par­tic­u­lar out­come A. It is impor­tant to esti­mate accu­rately the frac­tion f of times A will be true in the next mil­lion tri­als. If you try to use the laws of large num­bers, it will tell you var­i­ous things about f; for exam­ple, that it is quite likely to dif­fer from p by less than a tenth of one per­cent, and enor­mously unlikely to dif­fer from p by more than one per­cent. But now, imag­ine that in the first hun­dred tri­als, the observed fre­quency of A turned out to be entirely dif­fer­ent from p. Would this lead you to sus­pect that some­thing was wrong, and revise your prob­a­bil­ity assign­ment for the 101’st tri­al? If it would, then your state of knowl­edge is dif­fer­ent from that required for the valid­ity of the law of large num­bers. You are not sure of the inde­pen­dence of dif­fer­ent tri­als, and/or you are not sure of the cor­rect­ness of the numer­i­cal value of p. Your pre­dic­tion of f for a mil­lion tri­als is prob­a­bly no more reli­able than for a hun­dred.
  2. The com­mon sense of a good exper­i­men­tal sci­en­tist tells him the same thing with­out any prob­a­bil­ity the­o­ry. Sup­pose some­one is mea­sur­ing the veloc­ity of light. After mak­ing allowances for the known sys­tem­atic errors, he could cal­cu­late a prob­a­bil­ity dis­tri­b­u­tion for the var­i­ous other errors, based on the noise level in his elec­tron­ics, vibra­tion ampli­tudes, etc. At this point, a naive appli­ca­tion of the law of large num­bers might lead him to think that he can add three sig­nif­i­cant fig­ures to his mea­sure­ment merely by repeat­ing it a mil­lion times and aver­ag­ing the results. But, of course, what he would actu­ally do is to repeat some unknown sys­tem­atic error a mil­lion times. It is idle to repeat a phys­i­cal mea­sure­ment an enor­mous num­ber of times in the hope that “good sta­tis­tics” will aver­age out your errors, because we can­not know the full sys­tem­atic error. This is the old “Emperor of China” fal­la­cy…

Indeed, unless we know that all sources of sys­tem­atic error—rec­og­nized or unrec­og­nized—­con­tribute less than about one-third the total error, we can­not be sure that the aver­age of a mil­lion mea­sure­ments is any more reli­able than the aver­age of ten. Our time is much bet­ter spent in design­ing a new exper­i­ment which will give a lower prob­a­ble error per tri­al. As Poin­care put it, “The physi­cist is per­suaded that one good mea­sure­ment is worth many bad ones.”7 In other words, the com­mon sense of a sci­en­tist tells him that the prob­a­bil­i­ties he assigns to var­i­ous errors do not have a strong con­nec­tion with fre­quen­cies, and that meth­ods of infer­ence which pre­sup­pose such a con­nec­tion could be dis­as­trously mis­lead­ing in his prob­lems.

Schlaifer much ear­lier made the same point in Prob­a­bil­ity and Sta­tis­tics for Busi­ness Deci­sions: an Intro­duc­tion to Man­age­r­ial Eco­nom­ics Under Uncer­tainty, Schlaifer 1959, pg488–489 (see also /):

31.4.3 Bias and Sam­ple Size

In Sec­tion 31.2.6 we used a hypo­thet­i­cal exam­ple to illus­trate the impli­ca­tions of the fact that the vari­ance of the mean of a sam­ple in which bias is sus­pected is

so that only the sec­ond term decreases as the sam­ple size increases and the total can never be less than the fixed value of the first term. To empha­size the impor­tance of this point by a real exam­ple we recall the most famous sam­pling fiasco in his­to­ry, the . Over 2 mil­lion reg­is­tered vot­ers filled in and returned the straw bal­lots sent out by the Digest, so that there was less than one chance in 1 bil­lion of a sam­pling error as large as 2⁄10 of one per­cent­age point8, and yet the poll was actu­ally off by nearly 18 per­cent­age points: it pre­dicted that 54.5 per cent of the pop­u­lar vote would go to Lan­don, who in fact received only 36.7 per cent.9 10

Since sam­pling error can­not account for any appre­cia­ble part of the 18-­point dis­crep­an­cy, it is vir­tu­ally all actual bias. A part of this total bias may be mea­sure­ment bias due to the fact that not all peo­ple voted as they said they would vote; the impli­ca­tions of this pos­si­bil­ity were dis­cussed in Sec­tion 31.3. The larger part of the total bias, how­ev­er, was almost cer­tainly selec­tion bias. The straw bal­lots were mailed to peo­ple whose names were selected from lists of own­ers of tele­phones and auto­mo­biles and the sub­pop­u­la­tion which was effec­tively sam­pled was even more restricted than this: it con­sisted only of those own­ers of tele­phones and auto­mo­biles who were will­ing to fill out and return a straw bal­lot. The true mean of this sub­pop­u­la­tion proved to be entirely dif­fer­ent from the true mean of the pop­u­la­tion of all United States cit­i­zens who voted in 1936.

It is true that there was no evi­dence at the time this poll was planned which would have sug­gested that the bias would be as great as the 18 per­cent­age points actu­ally real­ized, but expe­ri­ence with pre­vi­ous polls had shown biases which would have led any sen­si­ble per­son to assign to a dis­tri­b­u­tion with equal to at least 1 per­cent­age point. A sam­ple of only 23,760 returned bal­lots, one 1⁄100th the size actu­ally used, would have given a value of only 1⁄3 per­cent­age point, so that the stan­dard devi­a­tion of x would have been

per­cent­age points. Using a sam­ple 100 times this large reduced from 1⁄3 point to vir­tu­ally zero, but it could not affect and thus on the most favor­able assump­tion could reduce only from 1.05 points to 1 point. To col­lect and tab­u­late over 2 mil­lion addi­tional bal­lots when this was the great­est gain that could be hoped for was obvi­ously ridicu­lous before the fact and not just in the light of hind­sight.

What’s par­tic­u­larly sad is when peo­ple read some­thing like this and decide to rely on anec­dotes, per­sonal exper­i­ments, and alter­na­tive med­i­cine where there are even more sys­tem­atic errors and no way of reduc­ing ran­dom error at all! Sci­ence may be the lens that sees its own flaws, but if other epis­te­molo­gies do not boast such long detailed self­-­cri­tiques, it’s not because they are flaw­less… It’s like that old quote: Some peo­ple, when faced with the prob­lem of main­stream med­i­cine & epi­demi­ol­ogy hav­ing seri­ous method­olog­i­cal weak­ness­es, say “I know, I’ll turn to non-­main­stream med­i­cine & epi­demi­ol­o­gy. After all, if only some med­i­cine is based on real sci­en­tific method and out­per­forms place­bos, why both­er?” (Now they have two prob­lem­s.) Or per­haps Isaac Asi­mov: “John, when peo­ple thought the earth was flat, they were wrong. When peo­ple thought the earth was spher­i­cal, they were wrong. But if you think that think­ing the earth is spher­i­cal is just as wrong as think­ing the earth is flat, then your view is wronger than both of them put togeth­er.”

See Also


Further reading

Addi­tional links, largely curated from my :

Pygmalion Effect


Some exam­ples of how ‘dat­a­min­ing’ or ‘data dredg­ing’ can man­u­fac­ture cor­re­la­tions on demand from large datasets by com­par­ing enough vari­ables:

Rates of autism diag­noses in chil­dren cor­re­late with age—or should we blame organic food sales?; height & vocab­u­lary or foot size & math skills may cor­re­late strongly (in chil­dren); national choco­late con­sump­tion cor­re­lates with Nobel prizes12, as do bor­row­ing from com­mer­cial banks & buy­ing lux­ury cars & serial-killers/mass-murderers/traffic-fatalities13; mod­er­ate alco­hol con­sump­tion pre­dicts increased lifes­pan and earn­ings; the role of storks in deliv­er­ing babies may have been under­es­ti­mat­ed; chil­dren and peo­ple with high have higher grades & lower crime rates etc, so “we all know in our gut that it’s true” that rais­ing peo­ple’s self­-es­teem “empow­ers us to live respon­si­bly and that inoc­u­lates us against the lures of crime, vio­lence, sub­stance abuse, teen preg­nan­cy, child abuse, chronic wel­fare depen­dency and edu­ca­tional fail­ure”— high self­-es­teem is caused by high grades & suc­cess, boost­ing self­-es­teem has no exper­i­men­tal ben­e­fits, and may back­fire?

Those last can be gen­er­ated ad nau­se­am: Shaun Gal­lagher’s Cor­re­lated (also a book) sur­veys users & com­pares against all pre­vi­ous sur­veys with 1k+ cor­re­la­tions.

Tyler Vigen’s “spu­ri­ous cor­re­la­tions” cat­a­logues 35k+ cor­re­la­tions, many with r > 0.9, based pri­mar­ily on US Cen­sus & CDC data.

Google Cor­re­late “finds Google search query pat­terns which cor­re­spond with real-­world trends” based on geog­ra­phy or user-pro­vided data, which offers end­less fun (“Face­book”/“tape­worm in humans”, r = 0.8721; “Super­f­reako­nomic”/“Win­dows 7 advi­sor”, r = 0.9751; Irish elec­tric­ity prices/“Stan­ford web­mail”, r = 0.83; “heart attack”/“pink lace dress”, r = 0.88; US states’ /“booty mod­els”, r = 0.92; US states’ fam­ily ties/“how to swim”; /“Is Lil’ Wayne gay?”, r = 0.89; /“prn­hub”, r = 0.9784; “acci­dent”/“itchy bumps”, r = 0.87; “migraine headaches”/“sci­ences”, r = 0.77; “Irri­ta­ble Bowel Syn­drome”/“font down­load”, r = 0.94; interest-rate-index/“pill iden­ti­fi­ca­tion”, r = 0.98; “adver­tis­ing”/“med­ical research”, r = 0.99; Barack Obama 2012 vote-share/“Top Chef”, r = 0.88; “los­ing weight”/“houses for rent”, r = 0.97; “Bieber”/tonsillitis, r = 0.95; “pater­nity test”/“food for dogs”, r = 0.83; “breast enlarge­ment”/“reverse tele­phone search”, r = 0.95; “the­ory of evo­lu­tion” / “the Sume­ri­ans” or “Hec­tor of Troy” or “Jim Crow laws”; “gwern”/“Danny Brown lyrics”, r = 0.92; “weed”/“new Fam­ily Guy episodes”, r = 0.8; a draw­ing of a bell curve matches “MySpace” while a penis matches “STD symp­toms in men” r = 0.95, not to men­tion Kurt Von­negut sto­ries).

And on less sec­u­lar themes, do churches cause obe­sity & do Welsh rugby vic­to­ries pre­dict papal deaths?

Finan­cial data-min­ing offers some fun exam­ples; there’s the which worked well for sev­eral decades; and it’s not very ele­gant, but a 3-vari­able model (Bangladeshi but­ter, Amer­i­can cheese, joint sheep pop­u­la­tion) reaches R2=0.99 on 20 years of the S&P 500

Animal models

On the gen­eral topic of ani­mal model exter­nal valid­ity & trans­la­tion to humans, a num­ber of op-eds, reviews, and meta-­analy­ses have been done; read­ing through some of the lit­er­a­ture up to March 2013, I would sum­ma­rize them as indi­cat­ing that the ani­mal research lit­er­a­ture in gen­eral is of con­sid­er­ably lower qual­ity than human research, and that for those and intrin­sic bio­log­i­cal rea­sons, the prob­a­bil­ity of mean­ing­ful trans­fer from ani­mal to human can be astound­ingly low, far below 50% and in some cat­e­gories of results, 0%.

The pri­mary rea­sons iden­ti­fied for this poor per­for­mance are gen­er­al­ly: small sam­ples (much smaller than the already under­pow­ered norms in human research), lack of blind­ing in tak­ing mea­sure­ments, pseudo-repli­ca­tion due to ani­mals being cor­re­lated by genetic relatedness/living in same cage/same room/same lab, exten­sive non-nor­mal­ity in data14, large dif­fer­ences between labs due to local dif­fer­ences in reagents/procedures/personnel illus­trat­ing the impor­tance of “tacit knowl­edge”, pub­li­ca­tion bias (small cheap sam­ples + lit­tle per­ceived eth­i­cal need to pub­lish + no pre­reg­is­tra­tion norm­s), unnat­ural & unnat­u­rally easy lab envi­ron­ments (more nat­u­ral­is­tic envi­ron­ments both offer more real­is­tic mea­sure­ments & chal­lenge ani­mal­s), large genetic dif­fer­ences due to inbreeding/engineering/drift of lab strains mean the same treat­ment can pro­duce dra­mat­i­cally dif­fer­ent results in dif­fer­ent strains (or sex­es) of the same species, dif­fer­ent species can have dif­fer­ent respons­es, and none of them may be like humans in the rel­e­vant bio­log­i­cal way in the first place.

So it is no won­der that “we can cure can­cer in mice but not peo­ple” and almost all amaz­ing break­throughs in ani­mals never make it to human prac­tice; med­i­cine & biol­ogy are dif­fi­cult.

The bib­li­og­ra­phy:

  1. Pub­li­ca­tion bias can come in many forms, and seems to be severe. For exam­ple, the 2008 ver­sion of a Cochrane review () finds “Only 63% of results from abstracts describ­ing ran­dom­ized or con­trolled clin­i­cal tri­als are pub­lished in full. ‘Pos­i­tive’ results were more fre­quently pub­lished than not ‘pos­i­tive’ results.”↩︎

  2. For a sec­ond, shorter take on the impli­ca­tions of low prior prob­a­bil­i­ties & low pow­er: “Is the Replic­a­bil­ity Cri­sis Overblown? Three Argu­ments Exam­ined”, Pash­ler & Har­ris 2012:

    So what is the truth of the mat­ter? To put it sim­ply, adopt­ing an alpha level of, say, 5% means that about 5% of the time when researchers test a null hypoth­e­sis that is true (i.e., when they look for a dif­fer­ence that does not exist), they will end up with a sta­tis­ti­cally sig­nif­i­cant dif­fer­ence (a Type 1 error or false pos­i­tive.)1 Whereas some have argued that 5% would be too many mis­takes to tol­er­ate, it cer­tainly would not con­sti­tute a flood of error. So what is the prob­lem?

    Unfor­tu­nate­ly, the prob­lem is that the alpha level does not pro­vide even a rough esti­mate, much less a true upper bound, on the like­li­hood that any given pos­i­tive find­ing appear­ing in a sci­en­tific lit­er­a­ture will be erro­neous. To esti­mate what the lit­er­a­ture-wide false pos­i­tive like­li­hood is, sev­eral addi­tional val­ues, which can only be guessed at, need to be spec­i­fied. We begin by con­sid­er­ing some highly sim­pli­fied sce­nar­ios. Although arti­fi­cial, these have enough plau­si­bil­ity to pro­vide some eye­-open­ing con­clu­sions.

    For the fol­low­ing exam­ple, let us sup­pose that 10% of the effects that researchers look for actu­ally exist, which will be referred to here as the prior prob­a­bil­ity of an effect (i.e., the null hypoth­e­sis is true 90% of the time). Given an alpha of 5%, Type 1 errors will occur in 4.5% of the stud­ies per­formed (90% × 5%). If one assumes that stud­ies all have a power of, say, 80% to detect those effects that do exist, cor­rect rejec­tions of the null hypoth­e­sis will occur in 8% of the time (80% × 10%). If one fur­ther imag­ines that all pos­i­tive results are pub­lished then this would mean that the prob­a­bil­ity any given pub­lished pos­i­tive result is erro­neous would be equal to the pro­por­tion of false pos­i­tives divided by the sum of the pro­por­tion of false pos­i­tives plus the pro­por­tion of cor­rect rejec­tions. Given the pro­por­tions spec­i­fied above, then, we see that more than one third of pub­lished pos­i­tive find­ings would be false pos­i­tives [4.5% / (4.5% + 8%) = 36%]. In this exam­ple, the errors occur at a rate approx­i­mately seven times the nom­i­nal alpha level (row 1 of Table 1).

    Table 1 shows a few more hypo­thet­i­cal exam­ples of how the fre­quency of false pos­i­tives in the lit­er­a­ture would depend upon the assumed prob­a­bil­ity of null hypoth­e­sis being false and the sta­tis­ti­cal pow­er. An 80% power likely exceeds any real­is­tic assump­tions about psy­chol­ogy stud­ies in gen­er­al. For exam­ple, Bakker, van Dijk, and Wikkerts, (2012, this issue) esti­mate .35 as a typ­i­cal power level in the psy­cho­log­i­cal lit­er­a­ture. If one mod­i­fies the pre­vi­ous exam­ple to assume a more plau­si­ble power level of 35%, the like­li­hood of pos­i­tive results being false rises to 56% (sec­ond row of the table). John Ioan­ni­dis (2005b) did pio­neer­ing work to ana­lyze (much more care­fully and real­is­ti­cally than we do here) the pro­por­tion of results that are likely to be false, and he con­cluded that it could very eas­ily be a major­ity of all reported effects.

    Table 1. Pro­por­tion of Pos­i­tive Results That Are False Given Assump­tions About Prior Prob­a­bil­ity of an Effect and Pow­er.
    Prior prob­a­bil­ity of effect Power Pro­por­tion of stud­ies yield­ing true pos­i­tives Pro­por­tion of stud­ies yield­ing false pos­i­tives Pro­por­tion of total pos­i­tive results (false+­pos­i­tive) which are false
    10% 80% 10% x 80% = 8% (100–10%) x 5% = 4.5% 4.5% / (4.5% + 8%) = 36%
    10% 35% = 3.5% = 4.5% 4.5% / (4.5% + 3.5%) = 56.25%
    50% 35% = 17.5% (100–50%) x 5% = 2.5% 2.5% / (2.5% + 17.5%) = 12.5%
    75% 35% = 26.3% (100–75%) x 5% = 1.6% 1.6% / (1.6% + 26.3%) = 5.73%
  3. So for exam­ple, if we imag­ined that a Jaeggi effect size of 0.8 were com­pletely borne out by a meta-­analy­sis of many stud­ies and turned in a point esti­mate of d = 0.8; this data would imply that the strength of the n-back effect was ~1 stan­dard devi­a­tion above the aver­age effect (of things which get stud­ied enough to be meta-­an­a­lyz­able & have pub­lished meta-­analy­ses etc) or to put it another way, that n-back was stronger than ~84% of all reli­able well-­sub­stan­ti­ated effects that psychology/education had dis­cov­ered as of 1992.↩︎

  4. We can infer empir­i­cal pri­ors from field­-wide col­lec­tions of effect sizes, in par­tic­u­lar, highly reli­able meta-­an­a­lytic effect sizes. For exam­ple, Lipsey & Wil­son 1993 which finds for var­i­ous kinds of ther­apy a mean effect of d = 0.5 based on >300 meta-­analy­ses; or bet­ter yet, “One Hun­dred Years of Social Psy­chol­ogy Quan­ti­ta­tively Described”, Bond et al 2003:

    This arti­cle com­piles results from a cen­tury of social psy­cho­log­i­cal research, more than 25,000 stud­ies of 8 mil­lion peo­ple. A large num­ber of social psy­cho­log­i­cal con­clu­sions are listed along­side meta-­an­a­lytic infor­ma­tion about the mag­ni­tude and vari­abil­ity of the cor­re­spond­ing effects. Ref­er­ences to 322 meta-­analy­ses of social psy­cho­log­i­cal phe­nom­ena are pre­sent­ed, as well as sta­tis­ti­cal effec­t-­size sum­maries. Analy­ses reveal that social psy­cho­log­i­cal effects typ­i­cally yield a value of r equal to .21 and that, in the typ­i­cal research lit­er­a­ture, effects vary from study to study in ways that pro­duce a stan­dard devi­a­tion in r of .15. Uses, lim­i­ta­tions, and impli­ca­tions of this large-s­cale com­pi­la­tion are not­ed.

    Only 5% of the were greater than .50; only 34% yielded an r of .30 or more; for exam­ple, Jaeggi 2008’s 15-­day group racked up an IQ increase of d = 1.53 which con­verts to an r of 0.61 and is 2.6 stan­dard devi­a­tions above the over­all mean, imply­ing that the DNB effect is greater than ~99% of pre­vi­ous known effects in psy­chol­o­gy! (Schön­brodt & Perug­ini 2013 observe that their sam­pling sim­u­la­tion imply that, given Bond’s mean effect of r = .21, a psy­chol­ogy study would require n = 238 for rea­son­able accu­racy in esti­mat­ing effects; most stud­ies are far small­er.)↩︎

  5. One might be aware that the writer of that essay, , was fired after mak­ing up mate­ri­als for one of his books, and won­der if this work can be trust­ed; I believe it can as the New Yorker is famous for rig­or­ous fac­t-check­ing (and no one has cast doubt on this arti­cle), Lehrer’s scan­dals involved his books, I have not found any ques­tion­able claims in the arti­cle besides Lehrer’s belief that known issues like pub­li­ca­tion bias are insuf­fi­cient to explain the decline effect (which rea­son­able men may dif­fer on), and Vir­ginia Hughes ran the fin­ished arti­cle against 7 peo­ple quoted in it like Ioan­ni­dis with­out any dis­put­ing facts/quotes & sev­eral some­what prais­ing it (see also Andrew Gel­man).↩︎

  6. If I am under­stand­ing this right, Jay­nes’s point here is that the ran­dom error shrinks towards zero as N increas­es, but this error is added onto the “com­mon sys­tem­atic error” S, so the total error approaches S no mat­ter how many obser­va­tions you make and this can force the total error up as well as down (vari­abil­i­ty, in this case, actu­ally being help­ful for once). So for exam­ple, ; with N = 100, it’s 0.43; with N = 1,000,000 it’s 0.334; and with N = 1,000,000 it equals 0.333365 etc, and never going below the orig­i­nal sys­tem­atic error of 1⁄3—that is, after 10 obser­va­tions, the por­tion of error due to sam­pling error is less than that due to the sys­tem­atic error, so one has hit severely dimin­ish­ing returns in the value of any addi­tional (bi­ased) data, and to mean­ing­fully improve the esti­mate one must obtain unbi­ased data. This leads to the unfor­tu­nate con­se­quence that the likely error of N = 10 is 0.017<x < 0.64956 while for N = 1,000,000 it is the sim­i­lar range 0.017<x < 0.33433—so it is pos­si­ble that the esti­mate could be exactly as good (or bad) for the tiny sam­ple as com­pared with the enor­mous sam­ple, since nei­ther can do bet­ter than 0.017!↩︎

  7. Pos­si­bly this is what Lord Ruther­ford meant when he said, “If your exper­i­ment needs sta­tis­tics you ought to have done a bet­ter exper­i­ment”.↩︎

  8. Neglect­ing the finite-pop­u­la­tion cor­rec­tion, the stan­dard devi­a­tion of the mean sam­pling error is and this quan­tity is largest when p = 0.5. The num­ber of bal­lots returned was 2,376,523, and with a sam­ple of this size the largest pos­si­ble value of is , or 0.322 per­cent­age point, so that an error of .2 per­cent­age point is .2/.0322 = 6.17 times the stan­dard devi­a­tion. The total area in the two tails of the Nor­mal dis­tri­b­u­tion below u = −6.17 and above u = +6.17 is .0000000007.↩︎

  9. Over 10 mil­lion bal­lots were sent out. Of the 2,376,523 bal­lots which were filled in and returned, 1,293,669 were for Lan­don, 972,897 for Roo­sevelt, and the remain­der for other can­di­dates. The actual vote was 16,679,583 for Lan­don and 27,476,673 for Roo­sevelt out of a total of 45,647,117.↩︎

  10. Read­ers curi­ous about mod­ern elec­tion fore­cast­ing’s sys­tem­atic vs ran­dom error should see Shi­rani-Mehr et al 2018, : the sys­tem­atic error turns out to be almost iden­ti­cal sized ie half the total error. Hence, anom­alies like Don­ald Trump or Brexit are not par­tic­u­larly anom­alous at all. –Ed­i­tor.↩︎

  11. John­son, inter­est­ing­ly, like Bouchard, was influ­enced by (and also ).↩︎

  12. I should men­tion this one is not quite as silly as it sounds as there is exper­i­men­tal evi­dence for cocoa improv­ing cog­ni­tive func­tion↩︎

  13. The same authors offer up a num­ber of coun­try-level cor­re­la­tion such as “Lin­guis­tic Diversity/Traffic acci­dents”, alco­hol consumption/morphological com­plex­i­ty, and aca­cia trees vs tonal­i­ty, which feed into their paper “Con­struct­ing knowl­edge: nomo­thetic approaches to lan­guage evo­lu­tion” on the dan­gers of naive approaches to cross-­coun­try com­par­isons due to the high inter­cor­re­la­tion of cul­tural traits. More sophis­ti­cated approaches might be bet­ter; they derive a fair­ly-­plau­si­ble look­ing graph of the rela­tion­ships between vari­ables.↩︎

  14. Lots of data is not exactly nor­mal, but, par­tic­u­larly in human stud­ies, this is not a big deal because the n are often large enough, eg n > 20, that the asymp­tot­ics have started to work & model mis­spec­i­fi­ca­tion does­n’t pro­duce too large a false pos­i­tive rate infla­tion or mis­-es­ti­ma­tion. Unfor­tu­nate­ly, in ani­mal research, it’s per­fectly typ­i­cal to have sam­ple sizes more like n = 5, which in an ide­al­ized power analy­sis of a nor­mal­ly-dis­trib­uted vari­able might be fine because one is (hope­ful­ly) exploit­ing the free­dom of ani­mal mod­els to get a large effect size / pre­cise mea­sure­ments—ex­cept that with n = 5 the data won’t be even close to approx­i­mately nor­mal or fit­ting other model assump­tions, and a sin­gle biased or selected or out­lier dat­a­point can mess it up fur­ther.↩︎