Everything Is Correlated

Anthology of sociology, statistical, or psychological papers discussing the observation that all real-world variables have non-zero correlations and the implications for statistical theory such as ‘null hypothesis testing’.
statistics, philosophy, survey, Bayes, psychology, genetics, sociology, bibliography, causality, insight-porn
2014-09-122020-11-14 finished certainty: log importance: 7

Sta­tis­ti­cal folk­lore as­serts that “every­thing is cor­re­lated”: in any re­al-world dataset, most or all mea­sured vari­ables will have non-zero cor­re­la­tions, even be­tween vari­ables which ap­pear to be com­pletely in­de­pen­dent of each oth­er, and that these cor­re­la­tions are not merely sam­pling er­ror flukes but will ap­pear in large-s­cale datasets to ar­bi­trar­ily des­ig­nated lev­els of sta­tis­ti­cal-sig­nifi­cance or pos­te­rior prob­a­bil­i­ty.

This raises se­ri­ous ques­tions for nul­l-hy­poth­e­sis sta­tis­ti­cal-sig­nifi­cance test­ing, as it im­plies the null hy­poth­e­sis of 0 will al­ways be re­jected with suffi­cient data, mean­ing that a fail­ure to re­ject only im­plies in­suffi­cient data, and pro­vides no ac­tual test or con­fir­ma­tion of a the­o­ry. Even a di­rec­tional pre­dic­tion is min­i­mally con­fir­ma­tory since there is a 50% chance of pick­ing the right di­rec­tion at ran­dom.

It also has im­pli­ca­tions for con­cep­tu­al­iza­tions of the­o­ries & causal mod­els, in­ter­pre­ta­tions of struc­tural mod­els, and other sta­tis­ti­cal prin­ci­ples such as the “spar­sity prin­ci­ple”.

Know­ing one vari­able tells you (a lit­tle) about every­thing else. In sta­tis­tics & psy­chol­ogy folk­lore, this idea cir­cu­lates un­der many names: “every­thing is cor­re­lated”, “every­thing is re­lated to every­thing else”, “crud fac­tor”, “the null hy­poth­e­sis is al­ways false”, “co­effi­cients are never zero”, “am­bi­ent cor­re­la­tional noise”, dic­tum (“in hu­man na­ture good traits go to­gether”1), etc. Closely re­lated are the “bet on spar­sity prin­ci­ple”2, , “first law of ecol­ogy” (“Every­thing is con­nected to every­thing else”) & (“every­thing is re­lated to every­thing else, but near things are more re­lated than dis­tant things”).3

The core idea here is that in any re­al-world dataset, it is ex­cep­tion­ally un­likely that any par­tic­u­lar re­la­tion­ship will be ex­actly 0 for rea­sons of arith­metic (eg it may be im­pos­si­ble for a bi­nary vari­able to be an equal per­cent­age in 2 un­bal­anced group­s); prior prob­a­bil­ity (0 is only one num­ber out of the in­fi­nite re­al­s); and be­cause re­al-world prop­er­ties & traits are linked by a myr­iad of causal net­works, dy­nam­ics, & la­tent vari­ables (eg the s which affect all hu­man traits, see heat maps in ap­pen­dix for vi­su­al­iza­tions) which mu­tu­ally affect each other which will pro­duce gen­uine cor­re­la­tions be­tween ap­par­ent­ly-in­de­pen­dent vari­ables, and these cor­re­la­tions may be of sur­pris­ingly large & im­por­tant size. These rea­sons are un­affected by sam­ple size and are not sim­ply due to ‘small n’. The claim is gen­er­ally backed up by per­sonal ex­pe­ri­ence and rea­son­ing, al­though in a few in­stances like Meehl large datasets are men­tioned in which al­most all vari­ables are cor­re­lated at high lev­els of sta­tis­ti­cal-sig­nifi­cance.


This claim has sev­eral im­pli­ca­tions:

  1. Sharp null hy­pothe­ses are mean­ing­less: The most com­monly men­tioned, and the ap­par­ent mo­ti­va­tion for early dis­cus­sions, is that in the nul­l-hy­poth­e­sis par­a­digm dom­i­nant in psy­chol­ogy and many sci­ences, any sharp nul­l-hy­poth­e­sis such as a pa­ra­me­ter (like a Pear­son’s r cor­re­la­tion) be­ing ex­actly equal to 0 is known—in ad­vance—to al­ready be false and so it will in­evitably be re­jected as soon as suffi­cient data col­lec­tion per­mits sam­pling to the fore­gone con­clu­sion.

    The ex­is­tence of per­va­sive cor­re­la­tions, in ad­di­tion to the pres­ence of sys­tem­atic er­ror4, guar­an­tees nonzero ‘effects’. This ren­ders the mean­ing of sig­nifi­cance-test­ing un­clear; it is cal­cu­lat­ing pre­cisely the odds of the data un­der sce­nar­ios known a pri­ori to be false.

  2. Di­rec­tional hy­pothe­ses are lit­tle bet­ter: bet­ter nul­l-hy­pothe­ses, such as >0 or <0, are also prob­lem­atic since if the true value of a pa­ra­me­ter is never 0 then one’s the­o­ries have at least a 50-50 chance of guess­ing the right di­rec­tion and so cor­rect ‘pre­dic­tions’ of the sign count for lit­tle.

    This ren­ders any suc­cess­ful pre­dic­tions of lit­tle val­ue.

  3. Model in­ter­pre­ta­tion is diffi­cult: This ex­ten­sive in­ter­cor­re­la­tion threat­ens many naive sta­tis­ti­cal mod­els or the­o­ret­i­cal in­ter­pre­ta­tions there­of, quite aside from p-val­ues

    For ex­am­ple, given the large amounts of mea­sure­ment er­ror in most so­ci­o­log­i­cal or psy­cho­log­i­cal traits such as SES, home en­vi­ron­ment, or IQ, fully ‘con­trol­ling for’ a la­tent vari­able based on mea­sured vari­ables is diffi­cult or im­pos­si­ble and said vari­able will in fact be cor­re­lated with the pri­mary vari­able of in­ter­est, lead­ing to “resid­ual con­found­ing” (/Stouffer 1936/Thorndike 1942/Kah­ne­man 1965)

  4. In­ter­cor­re­la­tion im­plies causal net­works: The em­pir­i­cal fact of ex­ten­sive in­ter­cor­re­la­tions is con­sis­tent with the (often fac­tors) link­ing all mea­sured traits, such as ex­ten­sive her­i­tabil­ity & ge­netic cor­re­la­tions of hu­man traits, lead­ing to .

    The ex­is­tence of both “every­thing is cor­re­lated” and the suc­cess of the “bet on spar­sity” prin­ci­ple sug­gests that these causal net­works may be best thought of as hav­ing hubs or la­tent vari­ables: there are a rel­a­tively few vari­ables such as ‘arousal’ or ‘IQ’ which play cen­tral roles, ex­plain­ing much of vari­ance, fol­lowed by al­most all other vari­ables ac­count­ing for a lit­tle bit each with most of their in­flu­ence me­di­ated through the key vari­ables.

    The fact that these vari­ables can be suc­cess­fully mod­eled as sub­stan­tively lin­ear or ad­di­tive fur­ther im­plies that in­ter­ac­tions be­tween vari­ables will be typ­i­cally rare or small or both (im­ply­ing fur­ther that most such hits will be false pos­i­tives, as in­ter­ac­tions are al­ready harder to de­tect than main effects, and more so if they are a pri­ori un­likely or of small size).

    To the ex­tent that these key vari­ables are un­mod­i­fi­able, the many pe­riph­eral vari­ables may also be un­mod­i­fi­able (which may be re­lated to the ). Any in­ter­ven­tion on those pe­riph­eral vari­ables, be­ing ‘down­stream’, will tend to ei­ther be ‘hol­low’ or fade out or have no effect at all on the true de­sired goals no mat­ter how con­sis­tently they are cor­re­lat­ed.

    On a more con­tem­po­rary note, these the­o­ret­i­cal & em­pir­i­cal con­sid­er­a­tions also throw doubt on con­cerns about ‘al­go­rith­mic bias’ or in­fer­ences draw­ing on ‘pro­tected classes’: not draw­ing on them may not be de­sir­able, pos­si­ble, or even mean­ing­ful.

  5. Un­cor­re­lated vari­ables may be mean­ing­less: given this em­pir­i­cal re­al­i­ty, any vari­able which is un­cor­re­lated with the ma­jor vari­ables is sus­pi­cious (some­what like the in hu­man traits ren­ders traits with zero her­i­tabil­ity sus­pi­cious, sug­gest­ing is­sues like mea­sur­ing at the wrong time). The lack of cor­re­la­tion sug­gests that the analy­sis is un­der­pow­ered, some­thing has gone wrong in the con­struc­tion of the vari­able/­dataset, or that the vari­able is part of a sys­tem whose causal net­work ren­ders con­ven­tional analy­ses dan­ger­ously mis­lead­ing.

    For ex­am­ple, the dataset may be cor­rupted by a sys­tem­atic bias such as range re­stric­tion or a se­lec­tion effect such as , which erases from the data a cor­re­la­tion that ac­tu­ally ex­ists. Or the data may be ran­dom noise, due to soft­ware er­ror or fraud or ex­tremely high lev­els of mea­sure­ment er­ror (such as re­spon­dents an­swer­ing at ran­dom); or the vari­able may not be real in the first place. An­other pos­si­bil­ity is that the vari­able is causally con­nect­ed, in feed­back loops (e­spe­cially com­mon in eco­nom­ics or bi­ol­o­gy), to an­other vari­able, in which case the stan­dard sta­tis­ti­cal ma­chin­ery is mis­lead­ing—the clas­sic ex­am­ple is Mil­ton Fried­man’s ther­mostat, not­ing that a ther­mo­stat would be al­most en­tirely un­cor­re­lated with room tem­per­a­ture.

This idea, as sug­gested by the many names, is not due to any sin­gle the­o­ret­i­cal or em­pir­i­cal re­sult or re­searcher, but has been made many times by many differ­ent re­searchers in many con­texts, cir­cu­lat­ing as in­for­mal ‘folk­lore’. To bring some or­der to this, I have com­piled ex­cerpts of some rel­e­vant ref­er­ences in chrono­log­i­cal or­der. (Ad­di­tional ci­ta­tions are wel­come.)

Gosset/Student 1904

A ver­sion of this is at­trib­uted to (S­tu­dent) in an un­pub­lished 1904 in­ter­nal re­port5 by :

In early No­vem­ber, 1904, Gos­set dis­cussed his first break­through in an in­ter­nal re­port en­ti­tled “The Ap­pli­ca­tion of the ‘Law of Er­ror’ to the Work of the Brew­ery” (Gos­set, 1904 Lab­o­ra­tory Re­port, Nov. 3, 1904, p. 3). Gos­set (p. 3–16) wrote:

Re­sults are only valu­able when the amount by which they prob­a­bly differ from the truth is so small as to be in­signifi­cant for the pur­poses of the ex­per­i­ment. What the odds should be de­pends

  1. On the de­gree of ac­cu­racy which the na­ture of the ex­per­i­ment al­lows, and
  2. On the im­por­tance of the is­sues at stake.

Two fea­tures of Gos­set’s re­port are es­pe­cially worth high­light­ing here. First, he sug­gested that judg­ments about “sig­nifi­cant” differ­ences were not a purely prob­a­bilis­tic ex­er­cise: they de­pend on the “im­por­tance of the is­sues at stake.” Sec­ond, Gos­set un­der­scored a pos­i­tive cor­re­la­tion in the nor­mal dis­tri­b­u­tion curve be­tween “the square root of the num­ber of ob­ser­va­tions” and the level of sta­tis­ti­cal sig­nifi­cance. Other things equal, he wrote, “the greater the num­ber of ob­ser­va­tions of which means are taken [the larger the sam­ple size], the smaller the [prob­a­ble or stan­dard] er­ror” (p. 5). “And the curve which rep­re­sents their fre­quency of er­ror,” he il­lus­trat­ed, “be­comes taller and nar­rower” (p. 7).

Since its dis­cov­ery in the early nine­teenth cen­tu­ry, ta­bles of the nor­mal prob­a­bil­ity curve had been cre­ated for large sam­ples…The re­la­tion be­tween sam­ple size and “sig­nifi­cance” was rarely ex­plored. For ex­am­ple, while look­ing at bio­met­ric sam­ples with up to thou­sands of ob­ser­va­tions, Karl Pear­son de­clared that a re­sult de­part­ing by more than three stan­dard de­vi­a­tions is “defi­nitely sig­nifi­cant.”12 Yet Gos­set, a self­-trained sta­tis­ti­cian, found that at such large sam­ples, nearly every­thing is sta­tis­ti­cally “sig­nifi­cant”—though not, in Gos­set’s terms, eco­nom­i­cally or sci­en­tifi­cally “im­por­tant.” Re­gard­less, Gos­set did­n’t have the lux­ury of large sam­ples. One of his ear­li­est ex­per­i­ments em­ployed a sam­ple size of 2 (Gos­set, 1904, p.7) and in fact in “The Prob­a­ble Er­ror of a Mean” he cal­cu­lated a t sta­tis­tic for n=2 (S­tu­dent, 1908b, p. 23).

…the “de­gree of cer­tainty to be aimed at,” Gos­set wrote, de­pends on the op­por­tu­nity cost of fol­low­ing a re­sult as if true, added to the op­por­tu­nity cost of con­duct­ing the ex­per­i­ment it­self. Gos­set never de­vi­ated from this cen­tral po­si­tion.15 [See, for ex­am­ple, Stu­dent (1923, p. 271, para­graph one: “The ob­ject of test­ing va­ri­eties of ce­re­als is to find out which will pay the farmer best.”) and Stu­dent (1931c, p. 1342, para­graph one) reprinted in Stu­dent (1942, p. 90 and p. 150).]

Thorndike 1920

“In­tel­li­gence and Its Uses”, Ed­ward L. Thorndike 1920 (Harper’s Monthly):

…the sig­nifi­cance of in­tel­li­gence for suc­cess in a given ac­tiv­ity of life is mea­sured by the co­effi­cient of cor­re­la­tion be­tween them. Sci­en­tific in­ves­ti­ga­tions of these mat­ters is just be­gin­ning; and it is a mat­ter of great diffi­culty and ex­pense to mea­sure the in­tel­li­gence of, say, a thou­sand cler­gy­men, and then se­cure suffi­cient ev­i­dence to rate them ac­cu­rately for their suc­cess as min­is­ters of the Gospel. Con­se­quent­ly, one can re­port no fi­nal, per­fectly au­thor­i­ta­tive re­sults in this field. One can only or­ga­nize rea­son­able es­ti­mates from the var­i­ous par­tial in­ves­ti­ga­tions that have been made. Do­ing this, I find the fol­low­ing:

  • In­tel­li­gence and suc­cess in the el­e­men­tary schools, r = +.80
  • In­tel­li­gence and suc­cess in high­-school and col­leges in the case of those who go, r = +0.60; but if all were forced to try to do this ad­vanced work, the cor­re­la­tion would be +.80 or more.
  • In­tel­li­gence and salary, r = +.35.
  • In­tel­li­gence and suc­cess in ath­letic sports, r = +.25
  • In­tel­li­gence and char­ac­ter, r = +.40
  • In­tel­li­gence and pop­u­lar­i­ty, r = +.20

What­ever be the even­tual ex­act find­ings, two sound prin­ci­ples are il­lus­trated by our pro­vi­sional list. First, there is al­ways some re­sem­blance; in­tel­lect al­ways counts. Sec­ond, the re­sem­blance varies great­ly; in­tel­lect counts much more in some lines than in oth­ers.

The first fact is in part a con­se­quence of a still broader fact or prin­ci­ple—­name­ly, that in hu­man na­ture good traits go to­geth­er. To him that hath a su­pe­rior in­tel­lect is given also on the av­er­age a su­pe­rior char­ac­ter; the quick boy is also in the long run more ac­cu­rate; the able boy is also more in­dus­tri­ous. There is no prin­ci­ple of com­pen­sa­tion whereby a weak in­tel­lect is off­set by a strong will, a poor mem­ory by good judg­ment, or a lack of am­bi­tion by an at­trac­tive per­son­al­i­ty. Every pair of such sup­posed com­pen­sat­ing qual­i­ties that have been in­ves­ti­gated has been found re­ally to show cor­re­spon­dence. Pop­u­lar opin­ion has been mis­led by at­tend­ing to strik­ing in­di­vid­ual cases which at­tracted at­ten­tion partly be­cause they were re­ally ex­cep­tions to the rule. The rule is that de­sir­able qual­i­ties are pos­i­tively cor­re­lat­ed. In­tel­lect is good in and of it­self, and also for what it im­plies about other traits.

Berkson 1938

“Some diffi­cul­ties of in­ter­pre­ta­tion en­coun­tered in the ap­pli­ca­tion of the chi-square test”, 1938:

I be­lieve that an ob­ser­vant sta­tis­ti­cian who has had any con­sid­er­able ex­pe­ri­ence with ap­ply­ing the chi-square test re­peat­edly will agree with my state­ment that, as a mat­ter of ob­ser­va­tion, when the num­bers in the data are quite large, the P’s tend to come out small. Hav­ing ob­served this, and on re­flec­tion, I make the fol­low­ing dog­matic state­ment, re­fer­ring for il­lus­tra­tion to the nor­mal curve: “If the nor­mal curve is fit­ted to a body of data rep­re­sent­ing any real ob­ser­va­tions what­ever of quan­ti­ties in the phys­i­cal world, then if the num­ber of ob­ser­va­tions is ex­tremely large—­for in­stance, on an or­der of 200,000—the chi-square P will be small be­yond any usual limit of sig­nifi­cance.”

This dog­matic state­ment is made on the ba­sis of an ex­trap­o­la­tion of the ob­ser­va­tion re­ferred to and can also be de­fended as a pre­dic­tion from a pri­ori con­sid­er­a­tions. For we may as­sume that it is prac­ti­cally cer­tain that any se­ries of real ob­ser­va­tions does not ac­tu­ally fol­low a nor­mal curve with ab­solute ex­ac­ti­tude in all re­spects, and no mat­ter how small the dis­crep­ancy be­tween the nor­mal curve and the true curve of ob­ser­va­tions, the chi-square P will be small if the sam­ple has a suffi­ciently large num­ber of ob­ser­va­tions in it.

If this be so, then we have some­thing here that is apt to trou­ble the con­science of a re­flec­tive sta­tis­ti­cian us­ing the chi-square test. For I sup­pose it would be agreed by sta­tis­ti­cians that a large sam­ple is al­ways bet­ter than a small sam­ple. If, then, we know in ad­vance the P that will re­sult from an ap­pli­ca­tion of a chi-square test to a large sam­ple, there would seem to be no use in do­ing it on a smaller one. But since the re­sult of the for­mer test is known, it is no test at all.

Thorndike 1939

Your City, Thorndike 1939 (and the fol­lowup 144 Smaller Cities pro­vid­ing ta­bles for 144 cities) com­piles var­i­ous sta­tis­tics about Amer­i­can cities such as in­fant mor­tal­i­ty, spend­ing on the arts, crime etc and finds ex­ten­sive in­ter­cor­re­la­tions & fac­tors.

The gen­eral fac­tor of so­cioe­co­nomic sta­tus or ‘S-fac­tor’ also ap­plies across coun­tries as well: eco­nomic growth is by far the largest in­flu­ence on all mea­sures of well-be­ing, and at­tempts at com­put­ing in­ter­na­tional rank­ings of things like ma­ter­nal founder on this fact, as they typ­i­cally wind up sim­ply re­pro­duc­ing GDP rank-order­ings and be­ing re­dun­dant. For ex­am­ple, com­pute an in­ter­na­tional well­be­ing met­ric us­ing “life ex­pectan­cy, the ra­tio of con­sump­tion to in­come, an­nual hours worked per cap­i­ta, the stan­dard de­vi­a­tion of log con­sump­tion, and the stan­dard de­vi­a­tion of an­nual hours worked” to in­cor­po­rate fac­tors like in­equal­i­ty, but this still winds up just be­ing equiv­a­lent to GDP (r = 0.98). Or , who note that of 9 in­ter­na­tional in­dices they con­sid­er, all cor­re­late pos­i­tively with per capita GDP, & 6 have τ>0.5.

Good 1950

Prob­a­bil­ity and the Weigh­ing of Ev­i­dence, :

The gen­eral ques­tion of sig­nifi­cance tests was raised in 7.3 and a sim­ple ex­am­ple will now be con­sid­ered. Sup­pose that a die is thrown n times and that it shows an r-face on mr oc­ca­sions (r = 1, 2, …, 6). The ques­tion is whether the die is loaded. The an­swer de­pends on the mean­ing of “loaded”. From one point of view, it is un­nec­es­sary to look at the sta­tis­tics since it is ob­vi­ous that no die could be ab­solutely sym­met­ri­cal. [It would be no con­tra­dic­tion of 4.3 (ii) to say that the hy­poth­e­sis that the die is ab­solutely sym­met­ri­cal is al­most im­pos­si­ble. In fact, this hy­poth­e­sis is an ide­alised propo­si­tion rather than an em­pir­i­cal one.] It is pos­si­ble that a sim­i­lar re­mark ap­plies to all ex­per­i­ments—even to the ESP ex­per­i­ment, since there may be no way of de­sign­ing it so that the prob­a­bil­i­ties are ex­actly equal to 1⁄2.

Hodges & Lehmann 1954

“Test­ing the ap­prox­i­mate va­lid­ity of sta­tis­ti­cal hy­pothe­ses”, Hodges & Lehmann 1954:

When test­ing sta­tis­ti­cal hy­pothe­ses, we usu­ally do not wish to take the ac­tion of re­jec­tion un­less the hy­poth­e­sis be­ing tested is false to an ex­tent suffi­cient to mat­ter. For ex­am­ple, we may for­mu­late the hy­poth­e­sis that a pop­u­la­tion is nor­mally dis­trib­ut­ed, but we re­al­ize that no nat­ural pop­u­la­tion is ever ex­actly nor­mal. We would want to re­ject nor­mal­ity only if the de­par­ture of the ac­tual dis­tri­b­u­tion from the nor­mal form were great enough to be ma­te­r­ial for our in­ves­ti­ga­tion. Again, when we for­mu­late the hy­poth­e­sis that the sex ra­tio is the same in two pop­u­la­tions, we do not re­ally be­lieve that it could be ex­actly the same, and would only wish to re­ject equal­ity if they are suffi­ciently differ­ent. Fur­ther ex­am­ples of the phe­nom­e­non will oc­cur to the read­er.

Savage 1954

The Foun­da­tions of Sta­tis­tics 1st edi­tion, 19546, pg252–255:

The de­vel­op­ment of the the­ory of test­ing has been much in­flu­enced by the spe­cial prob­lem of sim­ple di­choto­my, that is, test­ing prob­lems in which H0 and H1 have ex­actly one el­e­ment each. Sim­ple di­chotomy is sus­cep­ti­ble of neat and full analy­sis (as in Ex­er­cise 7.5.2 and in §14.4), like­li­hood-ra­tio tests here be­ing the only ad­mis­si­ble tests; and sim­ple di­chotomy often gives in­sight into more com­pli­cated prob­lems, though the point is not ex­plic­itly il­lus­trated in this book.

Coin and ball ex­am­ples of sim­ple di­chotomy are easy to con­struct, but in­stances seem rare in real life. The as­tro­nom­i­cal ob­ser­va­tions made to dis­tin­guish be­tween the New­ton­ian and Ein­stein­ian hy­pothe­ses are a good, but not per­fect, ex­am­ple, and I sup­pose that re­search in Mendelian ge­net­ics some­times leads to oth­ers. There is, how­ev­er, a tra­di­tion of ap­ply­ing the con­cept of sim­ple di­chotomy to some sit­u­a­tions to which it is, to say the best, only crudely adapt­ed. Con­sid­er, for ex­am­ple, the de­ci­sion prob­lem of a per­son who must buy, f0, or refuse to buy, f1, a lot of man­u­fac­tured ar­ti­cles on the ba­sis of an ob­ser­va­tion x. Sup­pose that i is the differ­ence be­tween the value of the lot to the per­son and the price at which the lot is offered for sale, and that P(x | i) is known to the per­son. Clear­ly, H0, H1, and N are sets char­ac­ter­ized re­spec­tively by i > 0, i < 0, i = 0. This analy­sis of this, and sim­i­lar, prob­lems has re­cently been ex­plored in terms of the min­i­max rule, for ex­am­ple by Sprowls [S16] and a lit­tle more fully by Rudy [R4], and by Allen [A3]. It seems to me nat­ural and promis­ing for many fields of ap­pli­ca­tion, but it is not a tra­di­tional analy­sis. On the con­trary, much lit­er­a­ture rec­om­mends, in effect, that the per­son pre­tend that only two val­ues of i, i0 > 0 and i1 < 0, are pos­si­ble and that the per­son then choose a test for the re­sult­ing sim­ple di­choto­my. The se­lec­tion of the two val­ues i0 and i1 is left to the per­son, though they are some­times sup­posed to cor­re­spond to the per­son’s judg­ment of what con­sti­tutes good qual­ity and poor qual­i­ty—terms re­ally quite with­out de­fi­n­i­tion. The em­pha­sis on sim­ple di­chotomy is tem­pered in some ac­cep­tance-sam­pling lit­er­a­ture, where it is rec­om­mended that the per­son choose among avail­able tests by some largely un­spec­i­fied over­all con­sid­er­a­tion of op­er­at­ing char­ac­ter­is­tics and costs, and that he fa­cil­i­tate his sur­vey of the avail­able tests by fo­cus­ing on a pair of points that hap­pen to in­ter­est him and con­sid­er­ing the test whose op­er­at­ing char­ac­ter­is­tic passes (e­co­nom­i­cal­ly, in the case of se­quen­tial test­ing) through the pair of points. These tra­di­tional analy­ses are cer­tainly in­fe­rior in the the­o­ret­i­cal frame­work of the present dis­cus­sion, and I think they will be found in­fe­rior in prac­tice.

…I turn now to a differ­ent and, at least for me, del­i­cate topic in con­nec­tion with ap­pli­ca­tions of the the­ory of test­ing. Much at­ten­tion is given in the lit­er­a­ture of sta­tis­tics to what pur­port to be tests of hy­pothe­ses, in which the null hy­poth­e­sis is such that it would not re­ally be ac­cepted by any­one. The fol­low­ing three propo­si­tions, though play­ful in con­tent, are typ­i­cal in form of these ex­treme null hy­pothe­ses, as I shall call them for the mo­ment.

  • A. The mean noise out­put of the ce­real Krakl is a lin­ear func­tion of the at­mos­pheric pres­sure, in the range from 900 to 1,100 mil­libars.
  • B. The basal meta­bolic con­sump­tion of sperm whales is nor­mally dis­trib­uted [Wl­l].
  • C. New York taxi dri­vers of Irish, Jew­ish, and Scan­di­na­vian ex­trac­tion are equally pro­fi­cient in abu­sive lan­guage.

Lit­er­ally to test such hy­pothe­ses as these is pre­pos­ter­ous. If, for ex­am­ple, the loss as­so­ci­ated with f1 is ze­ro, ex­cept in case Hy­poth­e­sis A is ex­actly sat­is­fied, what pos­si­ble ex­pe­ri­ence with Krakl could dis­suade you from adopt­ing f1?

The un­ac­cept­abil­ity of ex­treme null hy­pothe­ses is per­fectly well known; it is closely re­lated to the often heard maxim that sci­ence dis­proves, but never proves, hy­pothe­ses. The role of ex­treme hy­pothe­ses in sci­ence and other sta­tis­ti­cal ac­tiv­i­ties seems to be im­por­tant but ob­scure. In par­tic­u­lar, though I, like every­one who prac­tices sta­tis­tics, have often “tested” ex­treme hy­pothe­ses, I can­not give a very sat­is­fac­tory analy­sis of the process, nor say clearly how it is re­lated to test­ing as de­fined in this chap­ter and other the­o­ret­i­cal dis­cus­sions. None the less, it seems worth while to ex­plore the sub­ject ten­ta­tive­ly; I will do so largely in terms of two ex­am­ples.

Con­sider first the prob­lem of a ce­real dy­nam­i­cist who must es­ti­mate the noise out­put of Krakl at each of ten at­mos­pheric pres­sures be­tween 900 and 1,100 mil­libars. It may well be that he can prop­erly re­gard the prob­lem as that of es­ti­mat­ing the ten pa­ra­me­ters in ques­tion, in which case there is no ques­tion of test­ing. But sup­pose, for ex­am­ple, that one or both of the fol­low­ing con­sid­er­a­tions ap­ply. First, the en­gi­neer and his col­leagues may at­tach con­sid­er­able per­sonal prob­a­bil­ity to the pos­si­bil­ity that A is very nearly sat­is­fied—very near­ly, that is, in terms of the dis­per­sion of his mea­sure­ments. Sec­ond, the ad­min­is­tra­tive, com­pu­ta­tion­al, and other in­ci­den­tal costs of us­ing ten in­di­vid­ual es­ti­mates might be con­sid­er­ably greater than that of us­ing a lin­ear for­mu­la.

It might be im­prac­ti­cal to deal with ei­ther of these con­sid­er­a­tions very rig­or­ous­ly. One rough at­tack is for the en­gi­neer first to ex­am­ine the ob­served data x and then to pro­ceed ei­ther as though he ac­tu­ally be­lieved Hy­poth­e­sis A or else in some other way. The other way might be to make the es­ti­mate ac­cord­ing to the ob­jec­tivis­tic for­mu­lae that would have been used had there been no com­pli­cat­ing con­sid­er­a­tions, or it might take into ac­count differ­ent but re­lated com­pli­cat­ing con­sid­er­a­tions not ex­plic­itly men­tioned here, such as the ad­van­tage of us­ing a qua­dratic ap­prox­i­ma­tion. It is ar­ti­fi­cial and in­ad­e­quate to re­gard this de­ci­sion be­tween one class of ba­sic acts or an­other as a test, but that is what in cur­rent prac­tice we seem to do. The choice of which test to adopt in such a con­text is at least partly mo­ti­vated by the vague idea that the test should read­ily ac­cept, that is, re­sult in act­ing as though the ex­treme null hy­pothe­ses were true, in the far­fetched case that the null hy­poth­e­sis is in­deed true, and that the worse the ap­prox­i­ma­tion of the null hy­pothe­ses to the truth the less prob­a­ble should be the ac­cep­tance.

The method just out­lined is crude, to say the best. It is often mod­i­fied in ac­cor­dance with com­mon sense, es­pe­cially so far as the sec­ond con­sid­er­a­tion is con­cerned. Thus, if the mea­sure­ments are suffi­ciently pre­cise, no or­di­nary test might ac­cept the null hy­pothe­ses, for the ex­per­i­ment will lead to a clear and sure idea of just what the de­par­tures from the null hy­pothe­ses ac­tu­ally are. But, if the en­gi­neer con­sid­ers those de­par­tures unim­por­tant for the con­text at hand, he will jus­ti­fi­ably de­cide to ne­glect them.

Re­jec­tion of an ex­treme null hy­poth­e­sis, in the sense of the fore­go­ing dis­cus­sion, typ­i­cally gives rise to a com­pli­cated sub­sidiary de­ci­sion prob­lem. Some as­pects of this sit­u­a­tion have re­cently been ex­plored, for ex­am­ple by Paul­son [P3], [P4]; Dun­can [Dl­l], [D12]; Tukey [T4], [T5]; Schefte [S7]; and W. D. Fisher [F7].

Fisher 1956

Sta­tis­ti­cal Meth­ods and Sci­en­tific In­fer­ence, 1956 (pg42):

…How­ev­er, the cal­cu­la­tion [of er­ror rates of ‘re­ject­ing the null’] is ab­surdly aca­d­e­mic, for in fact no sci­en­tific worker has a fixed level of sig­nifi­cance at which from year to year and in all cir­cum­stances, he re­jects hy­pothe­ses; he rather gives his mind to each par­tic­u­lar case in the light of his ev­i­dence and his ideas. Fur­ther, the cal­cu­la­tion is based solely on a hy­poth­e­sis, which, in the light of the ev­i­dence, is often not be­lieved to be true at all, so that the ac­tual prob­a­bil­ity of er­ro­neous de­ci­sion, sup­pos­ing such a phrase to have any mean­ing, may be much less than the fre­quency spec­i­fy­ing the level of sig­nifi­cance.

Wallis & Roberts 1956

Sta­tis­tics: A New Ap­proach, Wal­lis & Roberts 1956 (pg 384–388):

A diffi­culty with this view­point is that it is often known that the hy­poth­e­sis tested could not be pre­cisely true. No coin, for ex­am­ple, has a prob­a­bil­ity of pre­cisely 1⁄2 of com­ing heads. The true prob­a­bil­ity will al­ways differ from 1⁄2 even if it differs by only 0.000,000,000,1. Nei­ther will any treat­ment cure pre­cisely one-third of the pa­tients in the pop­u­la­tion to which it might be ap­plied, nor will the pro­por­tion of vot­ers in a pres­i­den­tial elec­tion fa­vor­ing one can­di­date be pre­cisely 1⁄2. Recog­ni­tion of this leads to the no­tion of differ­ences that are or are not of prac­ti­cal im­por­tance. “Prac­ti­cal im­por­tance” de­pends on the ac­tions that are go­ing to be taken on the ba­sis of the data, and on the losses from tak­ing cer­tain ac­tions when oth­ers would be more ap­pro­pri­ate.

Thus, the fo­cus is shifted to de­ci­sions: Would the same de­ci­sion about prac­ti­cal ac­tion be ap­pro­pri­ate if the coin pro­duces heads 0.500,000,000,1 of the time as if it pro­duces heads 0.5 of the time pre­cise­ly? Does it mat­ter whether the coin pro­duces heads 0.5 of the time or 0.6 of the time, and if so does it mat­ter enough to be worth the cost of the data needed to de­cide be­tween the ac­tions ap­pro­pri­ate to these sit­u­a­tions? Ques­tions such as these carry us to­ward a com­pre­hen­sive the­ory of ra­tio­nal ac­tion, in which the con­se­quences of each pos­si­ble ac­tion are weighed in the light of each pos­si­ble state of re­al­i­ty. The value of a cor­rect de­ci­sion, or the costs of var­i­ous de­grees of er­ror, are then bal­anced against the costs of re­duc­ing the risks of er­ror by col­lect­ing fur­ther da­ta. It is this view­point that un­der­lies the de­fi­n­i­tion of sta­tis­tics given in the first sen­tence of this book. [“Sta­tis­tics is a body of meth­ods for mak­ing wise de­ci­sions in the face of un­cer­tain­ty.”]

Savage 1957

“Non­para­met­ric sta­tis­tics”, I. Richard Sav­age7 1957:

Siegel does not ex­plain why his in­ter­est is con­fined to tests of sig­nifi­cance; to make mea­sure­ments and then ig­nore their mag­ni­tudes would or­di­nar­ily be point­less. Ex­clu­sive re­liance on tests of sig­nifi­cance ob­scures the fact that sta­tis­ti­cal sig­nifi­cance does not im­ply sub­stan­tive sig­nifi­cance. The tests given by Siegel ap­ply only to null hy­pothe­ses of “no differ­ence.” In re­search, how­ev­er, null hy­pothe­ses of the form “Pop­u­la­tion A has a me­dian at least five units larger than the me­dian of Pop­u­la­tion B” arise. Null hy­pothe­ses of no differ­ence are usu­ally known to be false be­fore the data are col­lected [9, p. 42; 48, pp. 384–8]; when they are, their re­jec­tion or ac­cep­tance sim­ply re­flects the size of the sam­ple and the power of the test, and is not a con­tri­bu­tion to sci­ence.

Nunnally 1960

“The place of sta­tis­tics in psy­chol­ogy”, Nun­nally 1960:

The most mis­used and mis­con­ceived hy­poth­e­sis-test­ing model em­ployed in psy­chol­ogy is re­ferred to as the “nul­l-hy­poth­e­sis” mod­el. Stat­ing it crude­ly, one null hy­poth­e­sis would be that two treat­ments do not pro­duce differ­ent mean effects in the long run. Us­ing the ob­tained means and sam­ple es­ti­mates of“pop­u­la­tion” vari­ances, prob­a­bil­ity state­ments can be made about the ac­cep­tance or re­jec­tion of the null hy­poth­e­sis. Sim­i­lar null hy­pothe­ses are ap­plied to cor­re­la­tions, com­plex ex­per­i­men­tal de­signs, fac­tor-an­a­lytic re­sults, and most all ex­per­i­men­tal re­sults.

Al­though from a math­e­mat­i­cal point of view the nul­l-hy­poth­e­sis mod­els are in­ter­nally neat, they share a crip­pling flaw: in the real world the null hy­poth­e­sis is al­most never true, and it is usu­ally non­sen­si­cal to per­form an ex­per­i­ment with the sole aim of re­ject­ing the null hy­poth­e­sis. This is a per­sonal point of view, and it can­not be proved di­rect­ly. How­ev­er, it is sup­ported both by com­mon sense and by prac­ti­cal ex­pe­ri­ence. The com­mon-sense ar­gu­ment is that differ­ent psy­cho­log­i­cal treat­ments will al­most al­ways (in the long run) pro­duce differ­ences in mean effects, even though the differ­ences may be very small. Al­so, just as na­ture ab­hors a vac­u­um, it prob­a­bly ab­hors zero cor­re­la­tions be­tween vari­ables.

…Ex­pe­ri­ence shows that when large num­bers of sub­jects are used in stud­ies, nearly all com­par­isons of means are “sig­nifi­cantly” differ­ent and all cor­re­la­tions are “sig­nifi­cantly” differ­ent from ze­ro. The au­thor once had oc­ca­sion to use 700 sub­jects in a study of pub­lic opin­ion. After a fac­tor analy­sis of the re­sults, the fac­tors were cor­re­lated with in­di­vid­u­al-d­iffer­ence vari­ables such as amount of ed­u­ca­tion, age, in­come, sex, and oth­ers. In look­ing at the re­sults I was happy to find so many “sig­nifi­cant” cor­re­la­tions (un­der the nul­l-hy­poth­e­sis mod­el)-in­deed, nearly all cor­re­la­tions were sig­nifi­cant, in­clud­ing ones that made lit­tle sense. Of course, with an N of 700 cor­re­la­tions as large as 0.08 are “be­yond the 0.05 lev­el.” Many of the “sig­nifi­cant” cor­re­la­tions were of no the­o­ret­i­cal or prac­ti­cal im­por­tance.

The point of view taken here is that if the null hy­poth­e­sis is not re­ject­ed, it usu­ally is be­cause the N is too small. If enough data is gath­ered, the hy­poth­e­sis will gen­er­ally be re­ject­ed. If re­jec­tion of the null hy­poth­e­sis were the real in­ten­tion in psy­cho­log­i­cal ex­per­i­ments, there usu­ally would be no need to gather da­ta.

…S­ta­tis­ti­cians are not to blame for the mis­con­cep­tions in psy­chol­ogy about the use of sta­tis­ti­cal meth­ods. They have warned us about the use of the hy­poth­e­sis-test­ing mod­els and the re­lated con­cepts. In par­tic­u­lar they have crit­i­cized the nul­l-hy­poth­e­sis model and have rec­om­mended al­ter­na­tive pro­ce­dures sim­i­lar to those rec­om­mended here (See Sav­age, 1957; Tukey, 1954; and Yates, 1951).

Smith 1960

“Re­view of N. T. J. Bai­ley, Sta­tis­ti­cal meth­ods in bi­ol­ogy, Smith 1960:

How­ev­er, it is in­ter­est­ing to look at this book from an­other an­gle. Here we have set be­fore us with great clar­ity a panorama of mod­ern sta­tis­ti­cal meth­ods, as used in bi­ol­o­gy, med­i­cine, phys­i­cal sci­ence, so­cial and men­tal sci­ence, and in­dus­try. How far does this show that these meth­ods ful­fil their aims of analysing the data re­li­ably, and how many gaps are there still in our knowl­edge?…One fea­ture which can puz­zle an out­sider, and which re­quires much more jus­ti­fi­ca­tion than is usu­ally given, is the set­ting up of un­plau­si­ble null hy­pothe­ses. For ex­am­ple, a sta­tis­ti­cian may set out a test to see whether two drugs have ex­actly the same effect, or whether a re­gres­sion line is ex­actly straight. These hy­pothe­ses can scarcely be taken lit­er­al­ly, but a sta­tis­ti­cian may say, quite rea­son­ably, that he wishes to test whether there is an ap­pre­cia­ble differ­ence be­tween the effects of the two drugs, or an ap­pre­cia­ble cur­va­ture in the re­gres­sion line. But this raises at once the ques­tion: how large is ‘ap­pre­cia­ble’? Or in other words, are we not re­ally con­cerned with some kind of es­ti­ma­tion, rather than sig­nifi­cance?

Edwards 1963

“Bayesian sta­tis­ti­cal in­fer­ence for psy­cho­log­i­cal re­search”, Ed­wards et al 1963:

The most pop­u­lar no­tion of a test is, rough­ly, a ten­ta­tive de­ci­sion be­tween two hy­pothe­ses on the ba­sis of data, and this is the no­tion that will dom­i­nate the present treat­ment of tests. Some qual­i­fi­ca­tion is needed if only be­cause, in typ­i­cal ap­pli­ca­tions, one of the hy­pothe­ses—the null hy­poth­e­sis—is known by all con­cerned to be false from the out­set (Berk­son, 1938; Hodges & Lehmann, 1954; Lehmann, 19598; I. R. Sav­age, 1957; L. J. Sav­age, 1954, p. 254); some ways of re­solv­ing the seem­ing ab­sur­dity will later be pointed out, and at least one of them will be im­por­tant for us here…­Clas­si­cal pro­ce­dures some­times test null hy­pothe­ses that no one would be­lieve for a mo­ment, no mat­ter what the data; our list of sit­u­a­tions that might stim­u­late hy­poth­e­sis tests ear­lier in the sec­tion in­cluded sev­eral ex­am­ples. Test­ing an un­be­liev­able null hy­poth­e­sis amounts, in prac­tice, to as­sign­ing an un­rea­son­ably large prior prob­a­bil­ity to a very small re­gion of pos­si­ble val­ues of the true pa­ra­me­ter. In such cas­es, the more the pro­ce­dure is against the null hy­poth­e­sis, the bet­ter. The fre­quent re­luc­tance of em­pir­i­cal sci­en­tists to ac­cept null hy­pothe­ses which their data do not clas­si­cally re­ject sug­gests their ap­pro­pri­ate skep­ti­cism about the orig­i­nal plau­si­bil­ity of these null hy­pothe­ses.

Bakan 1966

“The test of sig­nifi­cance in psy­cho­log­i­cal re­search”, Bakan 1966:

Let us con­sider some of the diffi­cul­ties as­so­ci­ated with the null hy­poth­e­sis.

  1. The a pri­ori rea­sons for be­liev­ing that the null hy­poth­e­sis is gen­er­ally false any­way. One of the com­mon ex­pe­ri­ences of re­search work­ers is the very high fre­quency with which sig­nifi­cant re­sults are ob­tained with large sam­ples. Some years ago, the au­thor had oc­ca­sion to run a num­ber of tests of sig­nifi­cance on a bat­tery of tests col­lected on about 60,000 sub­jects from all over the United States. Every test came out sig­nifi­cant. Di­vid­ing the cards by such ar­bi­trary cri­te­ria as east ver­sus west of the Mis­sis­sippi River, Maine ver­sus the rest of the coun­try, North ver­sus South, etc., all pro­duced sig­nifi­cant differ­ences in means. In some in­stances, the differ­ences in the sam­ple means were quite small, but nonethe­less, the p val­ues were all very low. Nun­nally (1960) has re­ported a sim­i­lar ex­pe­ri­ence in­volv­ing cor­re­la­tion co­effi­cients on 700 sub­jects. Joseph Berk­son (1938) made the ob­ser­va­tion al­most 30 years in con­nec­tion with chi-square:

I be­lieve that an ob­ser­vant sta­tis­ti­cian who has had any con­sid­er­able ex­pe­ri­ence with ap­ply­ing the chi-square test re­peat­edly will agree with my state­ment that, as a mat­ter of ob­ser­va­tion, when the num­bers in the data are quite large, the P’s tend to come out small. Hav­ing ob­served this, and on re­flec­tion, I make the fol­low­ing dog­matic state­ment, re­fer­ring for il­lus­tra­tion to the nor­mal curve: “If the nor­mal curve is fit­ted to a body of data rep­re­sent­ing any real ob­ser­va­tions what­ever of quan­ti­ties in the phys­i­cal world, then if the num­ber of ob­ser­va­tions is ex­tremely large—­for in­stance, on an or­der of 200,000—the chi-square P will be small be­yond any usual limit of sig­nifi­cance.”

This dog­matic state­ment is made on the ba­sis of an ex­trap­o­la­tion of the ob­ser­va­tion re­ferred to and can also be de­fended as a pre­dic­tion from a pri­ori con­sid­er­a­tions. For we may as­sume that it is prac­ti­cally cer­tain that any se­ries of real ob­ser­va­tions does not ac­tu­ally fol­low a nor­mal curve with ab­solute ex­ac­ti­tude in all re­spects, and no mat­ter how small the dis­crep­ancy be­tween the nor­mal curve and the true curve of ob­ser­va­tions, the chi-square P will be small if the sam­ple has a suffi­ciently large num­ber of ob­ser­va­tions in it.

If this be so, then we have some­thing here that is apt to trou­ble the con­science of a re­flec­tive sta­tis­ti­cian us­ing the chi-square test. For I sup­pose it would be agreed by sta­tis­ti­cians that a large sam­ple is al­ways bet­ter than a small sam­ple. If, then, we know in ad­vance the P that will re­sult from an ap­pli­ca­tion of a chi-square test to a large sam­ple, there would seem to be no use in do­ing it on a smaller one. But since the re­sult of the for­mer test is known, it is no test at all [pp. 526–527].

As one group of au­thors has put it, “in typ­i­cal ap­pli­ca­tions . . . the null hy­poth­e­sis . . . is known by all con­cerned to be false from the out­set [Ed­wards et al, 1963, p. 214].” The fact of the mat­ter is that there is re­ally no good rea­son to ex­pect the null hy­poth­e­sis to be true in any pop­u­la­tion. Why should the mean, say, of all scores east of the Mis­sis­sippi be iden­ti­cal to all scores west of the Mis­sis­sip­pi? Why should any cor­re­la­tion co­effi­cient be ex­actly 0.00 in the pop­u­la­tion? Why should we ex­pect the ra­tio of males to fe­males be ex­actly 50:50 in any pop­u­la­tion? Or why should differ­ent drugs have ex­actly the same effect on any pop­u­la­tion pa­ra­me­ter (Smith, 1960)? A glance at any set of sta­tis­tics on to­tal pop­u­la­tions will quickly con­firm the rar­ity of the null hy­poth­e­sis in na­ture.

…Should there be any de­vi­a­tion from the null hy­poth­e­sis in the pop­u­la­tion, no mat­ter how small—and we have lit­tle doubt but that such a de­vi­a­tion usu­ally ex­ist­s—a suffi­ciently large num­ber of ob­ser­va­tions will lead to the re­jec­tion of the null hy­poth­e­sis. As Nun­nally (1960) put it,

if the null hy­poth­e­sis is not re­ject­ed, it is usu­ally be­cause the N is too small. If enough data are gath­ered, the hy­poth­e­sis will gen­er­ally be re­ject­ed. If re­jec­tion of the null hy­poth­e­sis were the real in­ten­tion in psy­cho­log­i­cal ex­per­i­ments, there usu­ally would be no need to gather data [p. 643].

Meehl 1967

“The­o­ry-test­ing in psy­chol­ogy and physics: A method­olog­i­cal para­dox”, Meehl 1967

One rea­son why the di­rec­tional null hy­poth­e­sis (H02 : μg ≤ μb) is the ap­pro­pri­ate can­di­date for ex­per­i­men­tal refu­ta­tion is the uni­ver­sal agree­ment that the old point-null hy­poth­e­sis (H0 : μg = μb) is [qua­si-] al­ways false in bi­o­log­i­cal and so­cial sci­ence. Any de­pen­dent vari­able of in­ter­est, such as I.Q., or aca­d­e­mic achieve­ment, or per­cep­tual speed, or emo­tional re­ac­tiv­ity as mea­sured by skin re­sis­tance, or what­ev­er, de­pends mainly upon a fi­nite num­ber of “strong” vari­ables char­ac­ter­is­tic of the or­gan­isms stud­ied (em­body­ing the ac­cu­mu­lated re­sults of their ge­netic makeup and their learn­ing his­to­ries) plus the in­flu­ences ma­nip­u­lated by the ex­per­i­menter. Upon some com­pli­cat­ed, un­known math­e­mat­i­cal func­tion of this fi­nite list of “im­por­tant” de­ter­min­ers is then su­per­im­posed an in­defi­nitely large num­ber of es­sen­tially “ran­dom” fac­tors which con­tribute to the in­tra­group vari­a­tion and there­fore boost the er­ror term of the sta­tis­ti­cal sig­nifi­cance test. In or­der for two groups which differ in some iden­ti­fied prop­er­ties (such as so­cial class, in­tel­li­gence, di­ag­no­sis, racial or re­li­gious back­ground) to differ not at all in the “out­put” vari­able of in­ter­est, it would be nec­es­sary that all de­ter­min­ers of the out­put vari­able have pre­cisely the same av­er­age val­ues in both groups, or else that their val­ues should differ by a pat­tern of amounts of differ­ence which pre­cisely coun­ter­bal­ance one an­other to yield a net differ­ence of ze­ro. Now our gen­eral back­ground knowl­edge in the so­cial sci­ences, or, for that mat­ter, even “com­mon sense” con­sid­er­a­tions, makes such an ex­act equal­ity of all de­ter­min­ing vari­ables, or a pre­cise “ac­ci­den­tal” coun­ter­bal­anc­ing of them, so ex­tremely un­likely that no psy­chol­o­gist or sta­tis­ti­cian would as­sign more than a neg­li­gi­bly small prob­a­bil­ity to such a state of affairs.

Ex­am­ple: Sup­pose we are study­ing a sim­ple per­cep­tu­al-ver­bal task like rate of col­or-nam­ing in school chil­dren, and the in­de­pen­dent vari­able is fa­ther’s re­li­gious pref­er­ence. Su­per­fi­cial con­sid­er­a­tion might sug­gest that these two vari­ables would not be re­lat­ed, but a lit­tle thought leads one to con­clude that they will al­most cer­tainly be re­lated by some amount, how­ever small. Con­sid­er, for in­stance, that a child’s re­ac­tion to any sort of school-con­text task will be to some ex­tent de­pen­dent upon his so­cial class, since the de­sire to please aca­d­e­mic per­son­nel and the de­sire to achieve at a per­for­mance (just be­cause it is a task, re­gard­less of its in­trin­sic in­ter­est) are both re­lated to the kinds of sub­-cul­tural and per­son­al­ity traits in the par­ents that lead to up­ward mo­bil­i­ty, eco­nomic suc­cess, the gain­ing of fur­ther ed­u­ca­tion, and the like. Again, since there is known to be a sex differ­ence in color nam­ing, it is likely that fa­thers who have en­tered oc­cu­pa­tions more at­trac­tive to “fem­i­nine” males will (on the av­er­age) pro­vide a some­what more fem­i­nine fa­ther fig­ure for iden­ti­fi­ca­tion on the part of their male off­spring, and that a more re­fined color vo­cab­u­lary, mak­ing closer dis­crim­i­na­tions be­tween sim­i­lar hues, will be char­ac­ter­is­tic of the or­di­nary lan­guage of such a house­hold. Fur­ther, it is known that there is a cor­re­la­tion be­tween a child’s gen­eral in­tel­li­gence and its fa­ther’s oc­cu­pa­tion, and of course there will be some re­la­tion, even though it may be small, be­tween a child’s gen­eral in­tel­li­gence and his color vo­cab­u­lary, aris­ing from the fact that vo­cab­u­lary in gen­eral is heav­ily sat­u­rated with the gen­eral in­tel­li­gence fac­tor. Since re­li­gious pref­er­ence is a cor­re­late of so­cial class, all of these so­cial class fac­tors, as well as the in­tel­li­gence vari­able, would tend to in­flu­ence col­or-nam­ing per­for­mance. Or con­sider a more ex­treme and faint kind of re­la­tion­ship. It is quite con­ceiv­able that a child who be­longs to a more litur­gi­cal re­li­gious de­nom­i­na­tion would be some­what more col­or-ori­ented than a child for whom bright col­ors were not as­so­ci­ated with the re­li­gious life. Every­one fa­mil­iar with psy­cho­log­i­cal re­search knows that nu­mer­ous “puz­zling, un­ex­pected” cor­re­la­tions pop up all the time, and that it re­quires only a mod­er­ate amount of mo­ti­va­tion-plus-in­ge­nu­ity to con­struct very plau­si­ble al­ter­na­tive the­o­ret­i­cal ex­pla­na­tions for them.

…These arm­chair con­sid­er­a­tions are borne out by the find­ing that in psy­cho­log­i­cal and so­ci­o­log­i­cal in­ves­ti­ga­tions in­volv­ing very large num­bers of sub­jects, it is reg­u­larly found that al­most all cor­re­la­tions or differ­ences be­tween means are sta­tis­ti­cally sig­nifi­cant. See, for ex­am­ple, the pa­pers by Bakan 1966 and Nun­nally 1960. Data cur­rently be­ing an­a­lyzed by Dr. David Lykken and my­self9, de­rived from a huge sam­ple of over 55,000 Min­nesota high school se­niors, re­veal sta­tis­ti­cally sig­nifi­cant re­la­tion­ships in 91% of pair­wise as­so­ci­a­tions among a con­geries of 45 mis­cel­la­neous vari­ables such as sex, birth or­der, re­li­gious pref­er­ence, num­ber of sib­lings, vo­ca­tional choice, club mem­ber­ship, col­lege choice, moth­er’s ed­u­ca­tion, danc­ing, in­ter­est in wood­work­ing, lik­ing for school, and the like. The 9% of non-sig­nifi­cant as­so­ci­a­tions are heav­ily con­cen­trated among a small mi­nor­ity of vari­ables hav­ing du­bi­ous , or in­volv­ing ar­bi­trary group­ings of non-ho­mo­ge­neous or non­mo­not­o­nic sub­-cat­e­gories. The ma­jor­ity of vari­ables ex­hib­ited sig­nifi­cant re­la­tion­ships with all but three of the oth­ers, often at a very high con­fi­dence level (p < 10-6).

…Con­sid­er­ing the fact that “every­thing in the brain is con­nected with every­thing else,” and that there ex­ist sev­eral “gen­eral state-vari­ables” (such as arousal, at­ten­tion, anx­i­ety, and the like) which are known to be at least slightly in­flu­ence­able by prac­ti­cally any kind of stim­u­lus in­put, it is highly un­likely that any psy­cho­log­i­cally dis­crim­inable stim­u­la­tion which we ap­ply to an ex­per­i­men­tal sub­ject would ex­ert lit­er­ally zero effect upon any as­pect of his per­for­mance. The psy­cho­log­i­cal lit­er­a­ture abounds with ex­am­ples of small but de­tectable in­flu­ences of this kind. Thus it is known that if a sub­ject mem­o­rizes a list of non­sense syl­la­bles in the pres­ence of a faint odor of pep­per­mint, his re­call will be fa­cil­i­tated by the pres­ence of that odor. Or, again, we know that in­di­vid­u­als solv­ing in­tel­lec­tual prob­lems in a “messy” room do not per­form quite as well as in­di­vid­u­als work­ing in a neat, well-ordered sur­round. Again, cog­ni­tive processes un­dergo a de­tectable fa­cil­i­ta­tion when the think­ing sub­ject is con­cur­rently per­form­ing the ir­rel­e­vant, noncog­ni­tive task of squeez­ing a hand dy­namome­ter. It would re­quire con­sid­er­able in­ge­nu­ity to con­coct ex­per­i­men­tal ma­nip­u­la­tions, ex­cept the most min­i­mal and triv­ial (such as a very slight mod­i­fi­ca­tion in the word or­der of in­struc­tions given a sub­ject) where one could have con­fi­dence that the ma­nip­u­la­tion would be ut­terly with­out effect upon the sub­jec­t’s mo­ti­va­tional lev­el, at­ten­tion, arousal, fear of fail­ure, achieve­ment dri­ve, de­sire to please the ex­per­i­menter, dis­trac­tion, so­cial fear, etc., etc. So that, for ex­am­ple, while there is no very “in­ter­est­ing” psy­cho­log­i­cal the­ory that links hunger drive with col­or-nam­ing abil­i­ty, I my­self would con­fi­dently pre­dict a sig­nifi­cant differ­ence in col­or-nam­ing abil­ity be­tween per­sons tested after a full meal and per­sons who had not eaten for 10 hours, pro­vided the sam­ple size were suffi­ciently large and the col­or-nam­ing mea­sure­ments suffi­ciently re­li­able, since one of the effects of the in­creased hunger drive is height­ened “arousal,” and any­thing which height­ens arousal would be ex­pected to affect a per­cep­tu­al-cog­ni­tive per­for­mance like col­or-nam­ing. Suffice it to say that there are very good rea­sons for ex­pect­ing at least some slight in­flu­ence of al­most any ex­per­i­men­tal ma­nip­u­la­tion which would differ suffi­ciently in its form and con­tent from the ma­nip­u­la­tion im­posed upon a con­trol group to be in­cluded in an ex­per­i­ment in the first place. In what fol­lows I shall there­fore as­sume that the point-null hy­poth­e­sis H0 is, in psy­chol­o­gy, [qua­si-] al­ways false.

See also Waller 2004, and Meehl’s 2003 CSS talk, “Cri­tique of Null Hy­poth­e­sis Sig­nifi­cance Test­ing” (MP3 au­dio; slides).

Lykken 1968

“Sta­tis­ti­cal Sig­nifi­cance in Psy­cho­log­i­cal Re­search”, Lykken 1968:

Most the­o­ries in the ar­eas of per­son­al­i­ty, clin­i­cal, and so­cial psy­chol­ogy pre­dict no more than the di­rec­tion of a cor­re­la­tion, group differ­ence, or treat­ment effect. Since the null hy­poth­e­sis is never strictly true, such pre­dic­tions have about a 50-50 chance of be­ing con­firmed by ex­per­i­ment when the the­ory in ques­tion is false, since the sta­tis­ti­cal sig­nifi­cance of the re­sult is a func­tion of the sam­ple size.

…Most psy­cho­log­i­cal ex­per­i­ments are of three kinds: (a) stud­ies of the effect of some treat­ment on some out­put vari­ables, which can be re­garded as a spe­cial case of (b) stud­ies of the differ­ence be­tween two or more groups of in­di­vid­u­als with re­spect to some vari­able, which in turn are a spe­cial case of (c) the study of the re­la­tion­ship or cor­re­la­tion be­tween two or more vari­ables within some spec­i­fied pop­u­la­tion. Us­ing the bi­vari­ate cor­re­la­tion de­sign as par­a­dig­mat­ic, then, one notes first that the strict null hy­poth­e­sis must al­ways be as­sumed to be false (this idea is not new and has re­cently been il­lu­mi­nated by Bak­en, 1966). Un­less one of the vari­ables is wholly un­re­li­able so that the val­ues ob­tained are strictly ran­dom, it would be fool­ish to sup­pose that the cor­re­la­tion be­tween any two vari­ables is iden­ti­cally equal to 0.0000 . . . (or that the effect of some treat­ment or the differ­ence be­tween two groups is ex­actly zero). The mo­lar de­pen­dent vari­ables em­ployed in psy­cho­log­i­cal re­search are ex­tremely com­pli­cated in the sense that the mea­sured value of such a vari­able tends to be affected by the in­ter­ac­tion of a vast num­ber of fac­tors, both in the present sit­u­a­tion and in the his­tory of the sub­ject or­gan­ism. It is ex­ceed­ingly un­likely that any two such vari­ables will not share at least some of these fac­tors and equally un­likely that their effects will ex­actly can­cel one an­other out.

It might be ar­gued that the more com­plex the vari­ables the smaller their av­er­age cor­re­la­tion ought to be since a larger pool of com­mon fac­tors al­lows more chance for mu­tual can­cel­la­tion of effects in obe­di­ence to the 10. How­ev­er, one knows of a num­ber of un­usu­ally po­tent and per­va­sive fac­tors which op­er­ate to un­bal­ance such con­ve­nient sym­me­tries and to pro­duce cor­re­la­tions large enough to ri­val the effects of what­ever causal fac­tors the ex­per­i­menter may have had in mind. Thus, we know that (a) “good” psy­cho­log­i­cal and phys­i­cal vari­ables tend to be pos­i­tively cor­re­lat­ed; (6) ex­per­i­menters, with­out de­lib­er­ate in­ten­tion, can some­how sub­tly bias their find­ings in the ex­pected di­rec­tion (Rosen­thal, 1963); (c) the effects of com­mon method are often as strong as or stronger than those pro­duced by the ac­tual vari­ables of in­ter­est (e.g., in a large and care­ful study of the fac­to­r­ial struc­ture of ad­just­ment to stress among offi­cer can­di­dates, Holtz­man & Bit­ter­man, 1956, found that their 101 orig­i­nal vari­ables con­tained five main com­mon fac­tors rep­re­sent­ing, re­spec­tive­ly, their rat­ing scales, their per­cep­tu­al-mo­tor tests, the McK­in­ney Re­port­ing Test, their GSR vari­ables, and the ); (d) tran­si­tory state vari­ables such as the sub­jec­t’s anx­i­ety lev­el, fa­tigue, or his de­sire to please, may broadly affect all mea­sures ob­tained in a sin­gle ex­per­i­men­tal ses­sion. This av­er­age shared vari­ance of “un­re­lated” vari­ables can be thought of as a kind of am­bi­ent noise level char­ac­ter­is­tic of the do­main. It would be in­ter­est­ing to ob­tain em­pir­i­cal es­ti­mates of this quan­tity in our field to serve as a kind of against which to com­pare ob­tained re­la­tion­ships pre­dicted by some the­ory un­der test. If, as I think, it is not un­rea­son­able to sup­pose that “un­re­lated” mo­lar psy­cho­log­i­cal vari­ables share on the av­er­age about 4% to 5% of com­mon vari­ance, then the ex­pected cor­re­la­tion be­tween any such vari­ables would be about 0.20 in ab­solute value and the ex­pected differ­ence be­tween any two groups on some such vari­able would be nearly 0.5 stan­dard de­vi­a­tion units. (Note that these es­ti­mates as­sume zero mea­sure­ment er­ror. One can bet­ter ex­plain the near-zero cor­re­la­tions often ob­served in psy­cho­log­i­cal re­search in terms of un­re­li­a­bil­ity of mea­sures than in terms of the as­sump­tion that the true scores are in fact un­re­lat­ed.)

Nichols 1968

“Hered­i­ty, En­vi­ron­ment, and School Achieve­ment”, Nichols 1968:

There are three main fac­tors or types of vari­ables that seem likely to have an im­por­tant in­flu­ence on abil­ity and school achieve­ment. These are (a) the school fac­tor or or­ga­nized ed­u­ca­tional in­flu­ences; (b) the fam­ily fac­tor or all of the so­cial in­flu­ences of fam­ily life on a child; and (c) the ge­netic fac­tor…the sep­a­ra­tion of the effects of the ma­jor types of in­flu­ences has proved to be ex­tra­or­di­nar­ily diffi­cult, and all of the re­search so far has not re­sulted in a clear-cut con­clu­sion.

…This messy sit­u­a­tion is due pri­mar­ily to the fact that in hu­man so­ci­ety all good things tend to go to­geth­er. The most in­tel­li­gent par­ents—those with the best ge­netic po­ten­tial—also tend to pro­vide the most com­fort­able and in­tel­lec­tu­ally stim­u­lat­ing home en­vi­ron­ments for their chil­dren, and also tend to send their chil­dren to the most afflu­ent and well-e­quipped schools. Thus, the ubiq­ui­tous cor­re­la­tion be­tween fam­ily so­cio-e­co­nomic sta­tus and school achieve­ment is am­bigu­ous in mean­ing, and iso­lat­ing the in­de­pen­dent con­tri­bu­tion of the fac­tors in­volved is diffi­cult. How­ev­er, the strong emo­tion­ally mo­ti­vated at­ti­tudes and vested in­ter­ests in this area have also tended to in­hibit the sort of dis­pas­sion­ate, ob­jec­tive eval­u­a­tion of the avail­able ev­i­dence that is nec­es­sary for the ad­vance of sci­ence.

Hays 1973

Sta­tis­tics for the so­cial sci­ences (2nd edi­tion), Hays 1973; chap­ter 10, page 413–417:

10.19: Test­man­ship, or how big is a differ­ence?

…As we saw in Chap­ter 4, the com­plete ab­sence of a sta­tis­ti­cal re­la­tion, or no as­so­ci­a­tion, oc­curs only when the con­di­tional dis­tri­b­u­tion of the de­pen­dent vari­able is the same re­gard­less of which treat­ment is ad­min­is­tered. Thus if the in­de­pen­dent vari­able is not as­so­ci­ated at all with the de­pen­dent vari­able the pop­u­la­tion dis­tri­b­u­tions must be iden­ti­cal over the treat­ments. If, on the other hand, the means of the differ­ent treat­ment pop­u­la­tions are differ­ent, the con­di­tional dis­tri­b­u­tions them­selves must be differ­ent and the in­de­pen­dent and de­pen­dent vari­ables must be as­so­ci­at­ed. The re­jec­tion of the hy­poth­e­sis of no differ­ence be­tween pop­u­la­tion means is tan­ta­mount to the as­ser­tion that the treat­ment given does have some sta­tis­ti­cal as­so­ci­a­tion with the de­pen­dent vari­able score.

…How­ev­er, the oc­cur­rence of a sig­nifi­cant re­sult says noth­ing at all about the strength of the as­so­ci­a­tion be­tween treat­ment and score. A sig­nifi­cant re­sult leads to the in­fer­ence that some as­so­ci­a­tion ex­ists, but in no sense does this mean that an im­por­tant de­gree of as­so­ci­a­tion nec­es­sar­ily ex­ists. Con­verse­ly, ev­i­dence of a strong sta­tis­ti­cal as­so­ci­a­tion can oc­cur in data even when the re­sults are not sig­nifi­cant. The game of in­fer­ring the true de­gree of sta­tis­ti­cal as­so­ci­a­tion has a jok­er: this is the sam­ple size. The time has come to de­fine the no­tion of the strength of a sta­tis­ti­cal as­so­ci­a­tion more sharply, and to link this idea with that of the true differ­ence be­tween pop­u­la­tion means.

. When does it seem ap­pro­pri­ate to say that a strong as­so­ci­a­tion ex­ists be­tween the ex­per­i­men­tal fac­tor X and the de­pen­dent vari­able Y? Over all of the differ­ent pos­si­bil­i­ties for X there is a prob­a­bil­ity dis­tri­b­u­tion of Y val­ues, which is the mar­ginal dis­tri­b­u­tion of Y over (x,y) events. The ex­is­tence of this dis­tri­b­u­tion im­plies that we do not know ex­actly what the Y value for any ob­ser­va­tion will be; we are al­ways un­cer­tain about Y to some ex­tent. How­ev­er, given any par­tic­u­lar X, there is also a con­di­tional dis­tri­b­u­tion of Y, and it may be that in this con­di­tional dis­tri­b­u­tion the highly prob­a­ble val­ues of Y tend to “shrink” within a much nar­rower range than in the mar­ginal dis­tri­b­u­tion. If so, we can say that the in­for­ma­tion about X tends to re­duce un­cer­tainty about Y. In gen­eral we will say that the strength of a sta­tis­ti­cal re­la­tion is re­flected by the ex­tent to which know­ing X re­duces un­cer­tainty about Y. One of the best in­di­ca­tors of our un­cer­tainty about the value of a vari­able is σ2, the vari­ance of its dis­tri­b­u­tion…This in­dex re­flects the pre­dic­tive power afforded by a re­la­tion­ship: when w2 is ze­ro, then X does not aid us at all in pre­dict­ing the value of Y. On the other hand, when w2 is 1.00, this tells us that X lets us know Y ex­act­ly…About now you should be won­der­ing what the in­dex w2 has to do with the differ­ence be­tween pop­u­la­tion means.

…When the differ­ence u1 - u2 is ze­ro, then w2 must be ze­ro. In the usual t-test for a differ­ence, the hy­poth­e­sis of no differ­ence be­tween means is equiv­a­lent to the hy­poth­e­sis that w2 = 0. On the other hand, when there is any differ­ence at all be­tween pop­u­la­tion means, the value of w2 must be greater than 0. In short, a true differ­ence is “big” in the sense of pre­dic­tive power only if the square of that differ­ence is large rel­a­tive to . How­ev­er, in sig­nifi­cance tests such as t, we com­pare the differ­ence we get with an es­ti­mate of σdiff. The stan­dard er­ror of the differ­ence can be made al­most as small as we choose if we are given a free choice of sam­ple size. Un­less sam­ple size is spec­i­fied, there is no nec­es­sary con­nec­tion be­tween sig­nifi­cance and the true strength of as­so­ci­a­tion.

This points up the fal­lacy of eval­u­at­ing the “good­ness” of a re­sult in terms of sta­tis­ti­cal sig­nifi­cance alone, with­out al­low­ing for the sam­ple size used. All sig­nifi­cant re­sults do not im­ply the same de­gree of true as­so­ci­a­tion be­tween in­de­pen­dent and de­pen­dent vari­ables.

It is sad but true that re­searchers have been known to cap­i­tal­ize on this fact. There is a cer­tain amount of “test­man­ship” in­volved in us­ing in­fer­en­tial sta­tis­tics. Vir­tu­ally any study can be made to show sig­nifi­cant re­sults if one uses enough sub­jects, re­gard­less of how non­sen­si­cal the con­tent may be. There is surely noth­ing on earth that is com­pletely in­de­pen­dent of any­thing else. The strength of an as­so­ci­a­tion may ap­proach ze­ro, but it should sel­dom or never be ex­actly ze­ro. If one ap­plies a large enough sam­ple of the study of any re­la­tion, triv­ial or mean­ing­less as it may be, sooner or later he is al­most cer­tain to achieve a sig­nifi­cant re­sult. Such a re­sult may be a valid find­ing, but only in the sense that one can say with as­sur­ance that some as­so­ci­a­tion is not ex­actly ze­ro. The de­gree to which such a find­ing en­hances our knowl­edge is de­bat­able. If the cri­te­rion of strength of as­so­ci­a­tion is ap­plied to such a re­sult, it be­comes ob­vi­ous that lit­tle or noth­ing is ac­tu­ally con­tributed to our abil­ity to pre­dict one thing from an­oth­er.

For ex­am­ple, sup­pose that two meth­ods of teach­ing first grade chil­dren to read are be­ing com­pared. A ran­dom sam­ple of 1000 chil­dren are taught to read by method I, an­other sam­ple of 1000 chil­dren by method II. The re­sults of the in­struc­tion are eval­u­ated by a test that pro­vides a score, in whole units, for each child. Sup­pose that the re­sults turned out as fol­lows:

Method I Method II
M1 = 147.21 M2 = 147.64
N1 = 1000 N2 = 1000

Then, the es­ti­mated stan­dard er­ror of the differ­ence is about 0.145, and the z value is

This cer­tainly per­mits re­jec­tion of the null hy­poth­e­sis of no differ­ence be­tween the groups. How­ev­er, does it re­ally tell us very much about what to ex­pect of an in­di­vid­ual child’s score on the test, given the in­for­ma­tion that he was taught by method I or method II? If we look at the group of chil­dren taught by method II, and as­sume that the dis­tri­b­u­tion of their scores is ap­prox­i­mately nor­mal, we find that about 45% of these chil­dren fall be­low the mean score for chil­dren in group I. Sim­i­lar­ly, about 45% of chil­dren in group I fall above the mean score for group II. Al­though the differ­ence be­tween the two groups is sig­nifi­cant, the two groups ac­tu­ally over­lap a great deal in terms of their per­for­mances on the test. In this sense, the two groups are re­ally not very differ­ent at all, even though the differ­ence be­tween the means is quite sig­nifi­cant in a purely sta­tis­ti­cal sense.

Putting the mat­ter in a slightly differ­ent way, we note that the grand mean of the two groups is 147.425. Thus, our best bet about the score of any child, not know­ing the method of his train­ing, is 147.425. If we guessed that any child drawn at ran­dom from the com­bined group should have a score above 147.425, we should be wrong about half the time. How­ev­er, among the orig­i­nal groups, ac­cord­ing to method I and method II, the pro­por­tions falling above and be­low this grand mean are ap­prox­i­mately as fol­lows:

Be­low 147.425 Above 147.425
Method I 0.51 0.49
Method II 0.49 0.51

This im­plies that if we know a child is from group I, and we guess that this score is be­low the grand mean, then we will be wrong about 49% of the time. Sim­i­lar­ly, if a child is from group II, and we guess his score to be above the grand mean, we will be wrong about 49% of the time. If we are not given the group to which the child be­longs, ad we guess ei­ther above or be­low the grand mean, we will be wrong about 50% of the time. Know­ing the group does re­duce the prob­a­bil­ity of er­ror in such a guess, but it does not re­duce it very much. The method by which the child was trained sim­ply does­n’t tell us a great deal about what the child’s score will be, even though the differ­ence in mean scores is sig­nifi­cant in the sta­tis­ti­cal sense.

This kind of test­man­ship flour­ishes best when peo­ple pay too much at­ten­tion to the sig­nifi­cance test and too lit­tle to the de­gree of sta­tis­ti­cal as­so­ci­a­tion the find­ing rep­re­sents. This clut­ters up the lit­er­a­ture with find­ings that are often not worth pur­su­ing, and which serve only to ob­scure the re­ally im­por­tant pre­dic­tive re­la­tions that oc­ca­sion­ally ap­pear. The se­ri­ous sci­en­tist owes it to him­self and his read­ers to ask not on­ly, “Is there any as­so­ci­a­tion be­tween X and Y?” but al­so, “How much does my find­ing sug­gest about the power to pre­dict Y from X?” Much too much em­pha­sis is paid to the for­mer, at the ex­pense of the lat­ter, ques­tion.

Oakes 1975

“On the al­leged fal­sity of the null hy­poth­e­sis”, Oakes 1975:

Con­sid­er­a­tion is given to the con­tention by Bakan, Meehl, Nun­nally, and oth­ers that the null hy­poth­e­sis in be­hav­ioral re­search is gen­er­ally false in na­ture and that the N is large enough, it will al­ways be re­ject­ed. A dis­tinc­tion is made be­tween self­-s­e­lect­ed-groups re­search de­signs and true ex­per­i­ments, and it is sug­gested that the null hy­poth­e­sis prob­a­bly is gen­er­ally false in the case of re­search in­volv­ing the for­mer de­sign, but is not in the case of re­search in­volv­ing the lat­ter. Rea­sons for the fal­sity of the null hy­poth­e­sis in the one case but not in the other are sug­gest­ed.

The U.S. has re­cently re­ported the re­sults of re­search on per­for­mance con­tract­ing. With 23,000 Ss—-13,000 ex­per­i­men­tal and 10,000 con­trol—the null hy­poth­e­sis was not re­ject­ed. The ex­per­i­men­tal Ss, who re­ceived spe­cial in­struc­tion in read­ing and math­e­mat­ics for 2 hours per day dur­ing the 1970–71 school year, did not differ sig­nifi­cantly from the con­trols in achieve­ment gains (Amer­i­can In­sti­tutes for Re­search, 1972, p. 5). Such an in­abil­ity to re­ject the null hy­poth­e­sis might not be sur­pris­ing to the typ­i­cal class­room teacher or to most ed­u­ca­tional psy­chol­o­gists, but in view of the huge N in­volved, it should give pause to Bakan (1966), who con­tends that the null hy­poth­e­sis is gen­er­ally false in be­hav­ioral re­search, as well as to those writ­ers such as Nun­nally (1960) and Meehl (1967), who agree with that con­tention. They hold that if the N is large enough, the null is sure to be re­jected in be­hav­ioral re­search. This pa­per will sug­gest that the Fal­sity con­tention does not hold in the case of ex­per­i­men­tal re­search—that the null hy­poth­e­sis is not gen­er­ally false in such re­search.

Loehlin & Nichols 1976

Loehlin & Nichols 1976, (see also Hered­ity and En­vi­ron­ment: Ma­jor Find­ings from Twin Stud­ies of Abil­i­ty, Per­son­al­i­ty, and In­ter­ests, Nichols 1976/1979):

This vol­ume re­ports on a study of 850 pairs of twins who were tested to de­ter­mine the in­flu­ence of hered­ity and en­vi­ron­ment on in­di­vid­ual differ­ences in per­son­al­i­ty, abil­i­ty, and in­ter­ests. It presents the back­ground, re­search de­sign, and pro­ce­dures of the study, a com­plete tab­u­la­tion of the test re­sults, and the au­thors’ ex­ten­sive analy­sis of their find­ings. Based on one of the largest stud­ies of twin be­hav­ior con­ducted in the twen­ti­eth cen­tu­ry, the book chal­lenges a num­ber of tra­di­tional be­liefs about ge­netic and en­vi­ron­men­tal con­tri­bu­tions to per­son­al­ity de­vel­op­ment.

The sub­jects were cho­sen from par­tic­i­pants in the Na­tional Merit Schol­ar­ship Qual­i­fy­ing Test of 1962 and were mailed a bat­tery of per­son­al­ity and in­ter­est ques­tion­naires. In ad­di­tion, par­ents of the twins were sent ques­tion­naires ask­ing about the twins’ early ex­pe­ri­ences. A sim­i­lar sam­ple of non­twin stu­dents who had taken the merit exam pro­vided a com­par­i­son group. The ques­tions in­ves­ti­gated in­cluded how twins are sim­i­lar to or differ­ent from non-twins, how iden­ti­cal twins are sim­i­lar to or differ­ent from fra­ter­nal twins, how the per­son­al­i­ties and in­ter­ests of twins re­flect ge­netic fac­tors, how the per­son­al­i­ties and in­ter­ests of twins re­flect early en­vi­ron­men­tal fac­tors, and what im­pli­ca­tions these ques­tions have for the gen­eral is­sue of how hered­ity and en­vi­ron­ment in­flu­ence the de­vel­op­ment of psy­cho­log­i­cal char­ac­ter­is­tics. In at­tempt­ing to an­swer these ques­tions, the au­thors shed light on the im­por­tance of both genes and en­vi­ron­ment and form the ba­sis for differ­ent ap­proaches in be­hav­ior ge­netic re­search.

The book is largely a dis­cus­sion of com­pre­hen­sive sum­mary sta­tis­tics of twin cor­re­la­tions from an early large-s­cale twin study (can­vassed via the Na­tional Merit Schol­ar­ship Qual­i­fy­ing Test, 1962). They at­tempted to com­pile a large-s­cale twin sam­ple with­out the bur­den of a ful­l-blown twin reg­istry by an ex­ten­sive mail sur­vey of the n = 1507 11th-grade ado­les­cent pairs of par­tic­i­pants in the high school Na­tional Merit Schol­ar­ship Qual­i­fy­ing Test of 1962 (to­tal n~600,000) who in­di­cated they were twins (as well as a con­trol sam­ple of non-twin­s), yield­ing 514 iden­ti­cal twin & 336 (same-sex) fra­ter­nal twin pairs; they were ques­tioned as fol­lows:

…to these [par­tic­i­pants] were mailed a bat­tery of per­son­al­ity and in­ter­est tests, in­clud­ing the Cal­i­for­nia Psy­cho­log­i­cal In­ven­tory (CPI), the Hol­land Vo­ca­tional Pref­er­ence In­ven­tory (VPI), an ex­per­i­men­tal Ob­jec­tive Be­hav­ior In­ven­tory (OBI), an Ad­jec­tive Check List (ACL), and a num­ber of oth­er, briefer self­-rat­ing scales, at­ti­tude mea­sures, and other items. In ad­di­tion, a par­ent was asked to fill out a ques­tion­naire de­scrib­ing the early ex­pe­ri­ences and home en­vi­ron­ment of the twins. Other brief ques­tion­naires were sent to teach­ers and friends, ask­ing them to rate the twins on a num­ber of per­son­al­ity traits; be­cause these rat­ings were avail­able for only part of our ba­sic sam­ple, they have not been an­a­lyzed in de­tail and will not be dis­cussed fur­ther in this book. (The par­ent and twin ques­tion­naires, ex­cept for the CPI, are re­pro­duced in Ap­pen­dix A.)

Un­usu­al­ly, the book in­cludes ap­pen­dices re­port­ing raw twin-pair cor­re­la­tions for all of the re­ported items, not a mere hand­ful of se­lected analy­ses on full test-s­cales or sub­fac­tors. (Be­cause of this, I was able to ex­tract vari­ables re­lated to leisure time pref­er­ences & ac­tiv­i­ties for .) One can see that even down to the item lev­el, her­i­tabil­i­ties tend to be non-zero and most vari­ables are cor­re­lated with­in-in­di­vid­u­als or with en­vi­ron­ments as well.

Meehl 1978

“The­o­ret­i­cal risks and tab­u­lar as­ter­isks: Sir Karl, Sir Ronald, and the slow progress of soft psy­chol­ogy”, Meehl 1978:

Since the null hy­poth­e­sis is qua­si­-al­ways false, ta­bles sum­ma­riz­ing re­search in terms of pat­terns of “sig­nifi­cant differ­ences” are lit­tle more than com­plex, causally un­in­ter­pretable out­comes of sta­tis­ti­cal power func­tions.

The kinds of the­o­ries and the kinds of the­o­ret­i­cal risks to which we put them in soft psy­chol­ogy when we use sig­nifi­cance test­ing as our method are not like test­ing Meehl’s the­ory of weather by see­ing how well it fore­casts the num­ber of inches it will rain on cer­tain days. In­stead, they are de­press­ingly close to test­ing the the­ory by see­ing whether it rains in April at all, or rains sev­eral days in April, or rains in April more than in May. It hap­pens mainly be­cause, as I be­lieve is gen­er­ally rec­og­nized by sta­tis­ti­cians to­day and by thought­ful so­cial sci­en­tists, the null hy­poth­e­sis, taken lit­er­al­ly, is al­ways false. I shall not at­tempt to doc­u­ment this here, be­cause among so­phis­ti­cated per­sons it is taken for grant­ed. (See Mor­ri­son & Henkel, 1970 [The Sig­nifi­cance Test Con­tro­ver­sy: A Reader], es­pe­cially the chap­ters by Bakan, Hog­ben, Lykken, Meehl, and .) A lit­tle re­flec­tion shows us why it has to be the case, since an out­put vari­able such as adult IQ, or aca­d­e­mic achieve­ment, or effec­tive­ness at com­mu­ni­ca­tion, or what­ev­er, will al­ways, in the so­cial sci­ences, be a func­tion of a siz­able but fi­nite num­ber of fac­tors. (The small­est con­tri­bu­tions may be con­sid­ered as es­sen­tially a ran­dom vari­ance ter­m.) In or­der for two groups (males and fe­males, or whites and blacks, or manic de­pres­sives and schiz­o­phren­ics, or Re­pub­li­cans and De­moc­rats) to be ex­actly equal on such an out­put vari­able, we have to imag­ine that they are ex­actly equal or del­i­cately coun­ter­bal­anced on all of the con­trib­u­tors in the causal equa­tion, which will never be the case.

Fol­low­ing the gen­eral line of rea­son­ing (p­re­sented by my­self and sev­eral oth­ers over the last decade), from the fact that the null hy­poth­e­sis is al­ways false in soft psy­chol­o­gy, it fol­lows that the prob­a­bil­ity of re­fut­ing it de­pends wholly on the sen­si­tiv­ity of the ex­per­i­men­t—its log­i­cal de­sign, the net (at­ten­u­at­ed) con­struct va­lid­ity of the mea­sures, and, most im­por­tant­ly, the sam­ple size, which de­ter­mines where we are on the sta­tis­ti­cal power func­tion. Putting it crude­ly, if you have enough cases and your mea­sures are not to­tally un­re­li­able, the null hy­poth­e­sis will al­ways be fal­si­fied, re­gard­less of the truth of the sub­stan­tive the­ory. Of course, it could be fal­si­fied in the wrong di­rec­tion, which means that as the power im­proves, the prob­a­bil­ity of a cor­rob­o­ra­tive re­sult ap­proaches one-half. How­ev­er, if the the­ory has no verisimil­i­tude—­such that we can imag­ine, so to speak, pick­ing our em­pir­i­cal re­sults ran­domly out of a di­rec­tional hat apart from any the­o­ry—the prob­a­bil­ity of re­fut­ing by get­ting a sig­nifi­cant differ­ence in the wrong di­rec­tion also ap­proaches one-half. Ob­vi­ous­ly, this is quite un­like the sit­u­a­tion de­sired from ei­ther a Bayesian, a Pop­pe­ri­an, or a com­mon­sense sci­en­tific stand­point. As I have pointed out else­where (Meehl, 1967/1970b; but see crit­i­cism by Oakes, 1975; Keuth, 1973; and re­but­tal by Swoyer & Mon­son, 1975), an im­prove­ment in in­stru­men­ta­tion or other sources of ex­per­i­men­tal ac­cu­racy tends, in physics or as­tron­omy or chem­istry or ge­net­ics, to sub­ject the the­ory to a greater risk of refu­ta­tion modus tol­lens, whereas im­proved pre­ci­sion in null hy­poth­e­sis test­ing usu­ally de­creases this risk. A suc­cess­ful sig­nifi­cance test of a sub­stan­tive the­ory in soft psy­chol­ogy pro­vides a fee­ble cor­rob­o­ra­tion of the the­ory be­cause the pro­ce­dure has sub­jected the the­ory to a fee­ble risk.

…I am not mak­ing some nit-pick­ing sta­tis­ti­cian’s cor­rec­tion. I am say­ing that the whole busi­ness is so rad­i­cally de­fec­tive as to be sci­en­tifi­cally al­most point­less… I am mak­ing a philo­soph­i­cal com­plaint or, if you prefer, a com­plaint in the do­main of sci­en­tific method. I sug­gest that when a re­viewer tries to “make the­o­ret­i­cal sense” out of such a ta­ble of fa­vor­able and ad­verse sig­nifi­cance test re­sults, what the re­viewer is ac­tu­ally en­gaged in, willy-nilly or un­wit­ting­ly, is mean­ing­less sub­stan­tive con­struc­tions on the prop­er­ties of the sta­tis­ti­cal power func­tion, and al­most noth­ing else.

…You may say, “But, Meehl, R. A. Fisher was a ge­nius, and we all know how valu­able his stuff has been in agron­o­my. Why should­n’t it work for soft psy­chol­o­gy?” Well, I am not in­tim­i­dated by Fish­er’s ge­nius, be­cause my com­plaint is not in the field of math­e­mat­i­cal sta­tis­tics, and as re­gards in­duc­tive logic and phi­los­o­phy of sci­ence, it is well-known that Sir Ronald per­mit­ted him­self a great deal of dog­ma­tism. I re­mem­ber my amaze­ment when the late said to me, the first time I met him, “But, of course, on this sub­ject Fisher is just mis­tak­en: surely you must know that.” My sta­tis­ti­cian friends tell me that it is not clear just how use­ful the sig­nifi­cance test has been in bi­o­log­i­cal sci­ence ei­ther, but I set that aside as be­yond my com­pe­tence to dis­cuss.

Loftus & Loftus 1982

Essence of Sta­tis­tics, Loftus & Loftus 1982/1988 (2nd ed), pg515–516 (pg498-499 in the 1982 print­ing):

Rel­a­tive Im­por­tance Of These Three Mea­sures. It is a mat­ter of some de­bate as to which of these three mea­sures [σ2/p/R2] we should pay the most at­ten­tion to in an ex­per­i­ment. It’s our opin­ion that find­ing a “sig­nifi­cant effect” re­ally pro­vides very lit­tle in­for­ma­tion be­cause it’s al­most cer­tainly true that some re­la­tion­ship (how­ever small) ex­ists be­tween any two vari­ables. And in gen­eral find­ing a sig­nifi­cant effect sim­ply means that enough ob­ser­va­tions have been col­lected in the ex­per­i­ment to make the sta­tis­ti­cal test of the ex­per­i­ment pow­er­ful enough to de­tect what­ever effect there is. The smaller the effect, the more pow­er­ful the ex­per­i­ments needs to be of course, but no mat­ter how small the effect, it’s al­ways pos­si­ble in prin­ci­ple to de­sign an ex­per­i­ment suffi­ciently pow­er­ful to de­tect it. We saw a strik­ing ex­am­ple of this prin­ci­ple in the office hours ex­per­i­ment. In this ex­per­i­ment there was a re­la­tion­ship be­tween the two vari­ables—and since there were so many sub­jects in the ex­per­i­ment (that is, since the test was so pow­er­ful), this re­la­tion­ship was re­vealed in the sta­tis­ti­cal analy­sis. But was it any­thing to write home about? Cer­tainly not. In any sort of prac­ti­cal con­text the size of the effect, al­though nonze­ro, is so small it can al­most be ig­nored.

It is our judg­ment that ac­count­ing for vari­ance is re­ally much more mean­ing­ful than test­ing for sig­nifi­cance.

Meehl 1990 (1)

“Why sum­maries of re­search on psy­cho­log­i­cal the­o­ries are often un­in­ter­pretable”, Meehl 1990a (also dis­cussed in Co­hen’s 1994 pa­per “The Earth is Round (p<.05)”):

Prob­lem 6. Crud fac­tor: In the so­cial sci­ences and ar­guably in the bi­o­log­i­cal sci­ences, “every­thing cor­re­lates to some ex­tent with every­thing else.” This tru­ism, which I have found no com­pe­tent psy­chol­o­gist dis­putes given five min­utes re­flec­tion, does not ap­ply to pure ex­per­i­men­tal stud­ies in which at­trib­utes that the sub­jects bring with them are not the sub­ject of study (ex­cept in so far as they ap­pear as a source of er­ror and hence in the de­nom­i­na­tor of a sig­nifi­cance test).6 There is noth­ing mys­te­ri­ous about the fact that in psy­chol­ogy and so­ci­ol­ogy every­thing cor­re­lates with every­thing. Any mea­sured trait or at­tribute is some func­tion of a list of partly known and mostly un­known causal fac­tors in the genes and life his­tory of the in­di­vid­u­al, and both ge­netic and en­vi­ron­men­tal fac­tors are known from tons of em­pir­i­cal re­search to be them­selves cor­re­lat­ed. To take an ex­treme case, sup­pose we con­strue the null hy­poth­e­sis lit­er­ally (ob­ject­ing that we mean by it “al­most null” gets ahead of the sto­ry, and de­stroys the rigor of the Fish­er­ian math­e­mat­ic­s!) and ask whether we ex­pect males and fe­males in Min­nesota to be pre­cisely equal in some ar­bi­trary trait that has in­di­vid­ual differ­ences, say, color nam­ing. In the case of color nam­ing we could think of some ob­vi­ous differ­ences right off, but even if we did­n’t know about them, what is the causal sit­u­a­tion? If we write a causal equa­tion (which is not the same as a re­gres­sion equa­tion for pure pre­dic­tive pur­poses but which, if we had it, would serve bet­ter than the lat­ter) so that the score of an in­di­vid­ual male is some func­tion (pre­sum­ably non­lin­ear if we knew enough about it but here sup­posed lin­ear for sim­plic­i­ty) of a rather long set of causal vari­ables of ge­netic and en­vi­ron­men­tal type X1, X2, … Xm. These val­ues are op­er­ated upon by re­gres­sion co­effi­cients b1, b2, …bm.

…Now we write a sim­i­lar equa­tion for the class of fe­males. Can any­one sup­pose that the beta co­effi­cients for the two sexes will be ex­actly the same? Can any­one imag­ine that the mean val­ues of all of the _X_s will be ex­actly the same for males and fe­males, even if the cul­ture were not still con­sid­er­ably sex­ist in child-rear­ing prac­tices and the like? If the be­tas are not ex­actly the same for the two sex­es, and the mean val­ues of the _X_s are not ex­actly the same, what kind of Leib­nitz­ian preestab­lished har­mony would we have to imag­ine in or­der for the mean col­or-nam­ing score to come out ex­actly equal be­tween males and fe­males? It bog­gles the mind; it sim­ply would never hap­pen. As Ein­stein said, “the Lord God is sub­tle, but He is not ma­li­cious.” We can­not imag­ine that na­ture is out to fool us by this kind of del­i­cate bal­anc­ing. Any­body fa­mil­iar with large scale re­search data takes it as a mat­ter of course that when the N gets big enough she will not be look­ing for the sta­tis­ti­cally sig­nifi­cant cor­re­la­tions but rather look­ing at their pat­terns, since al­most all of them will be sig­nifi­cant. In say­ing this, I am not go­ing counter to what is stated by math­e­mat­i­cal sta­tis­ti­cians or psy­chol­o­gists with sta­tis­ti­cal ex­per­tise. For ex­am­ple, the stan­dard psy­chol­o­gist’s text­book, the ex­cel­lent treat­ment by Hays (1973, page 415), ex­plic­itly states that, taken lit­er­al­ly, the null hy­poth­e­sis is al­ways false.

20 ago David Lykken and I con­ducted an ex­ploratory study of the crud fac­tor which we never pub­lished but I shall sum­ma­rize it briefly here. (I offer it not as “em­pir­i­cal proof”—that H0 taken lit­er­ally is qua­si­-al­ways false hardly needs proof and is gen­er­ally ad­mit­ted—but as a punchy and some­what amus­ing ex­am­ple of an in­suffi­ciently ap­pre­ci­ated truth about soft cor­re­la­tional psy­chol­o­gy.) In 1966, the Uni­ver­sity of Min­nesota Stu­dent Coun­sel­ing Bu­reau’s Statewide Test­ing Pro­gram ad­min­is­tered a ques­tion­naire to 57,000 high school se­niors, the items deal­ing with fam­ily facts, at­ti­tudes to­ward school, vo­ca­tional and ed­u­ca­tional plans, leisure time ac­tiv­i­ties, school or­ga­ni­za­tions, etc. We cross-tab­u­lated a to­tal of 15 (and then 45) vari­ables in­clud­ing the fol­low­ing (the num­ber of cat­e­gories for each vari­able given in paren­the­ses): fa­ther’s oc­cu­pa­tion (7), fa­ther’s ed­u­ca­tion (9), moth­er’s ed­u­ca­tion (9), num­ber of sib­lings (10), birth or­der (on­ly, old­est, youngest, nei­ther), ed­u­ca­tional plans after high school (3), fam­ily at­ti­tudes to­wards col­lege (3), do you like school (3), sex (2), col­lege choice (7), oc­cu­pa­tional plan in ten years (20), and re­li­gious pref­er­ence (20). In ad­di­tion, there were 22 “leisure time ac­tiv­i­ties” such as “act­ing,” “model build­ing,” “cook­ing,” etc., which could be treated ei­ther as a sin­gle 22-cat­e­gory vari­able or as 22 di­choto­mous vari­ables. There were also 10 “high school or­ga­ni­za­tions” such as “school sub­ject clubs,” “farm youth groups,” “po­lit­i­cal clubs,” etc., which also could be treated ei­ther as a sin­gle ten-cat­e­gory vari­able or as ten di­choto­mous vari­ables. Con­sid­er­ing the lat­ter two vari­ables as mul­ti­chotomies gives a to­tal of 15 vari­ables pro­duc­ing 105 differ­ent cross-tab­u­la­tions. All val­ues of χ2 for these 105 cross-tab­u­la­tions were sta­tis­ti­cally sig­nifi­cant, and 101 (96%) of them were sig­nifi­cant with a prob­a­bil­ity of less than 10-6.

…If “leisure ac­tiv­ity” and “high school or­ga­ni­za­tions” are con­sid­ered as sep­a­rate di­chotomies, this gives a to­tal of 45 vari­ables and 990 differ­ent crosstab­u­la­tions. Of the­se, 92% were sta­tis­ti­cally sig­nifi­cant and more than 78% were sig­nifi­cant with a prob­a­bil­ity less than 10-6. Looked at in an­other way, the me­dian num­ber of sig­nifi­cant re­la­tion­ships be­tween a given vari­able and all the oth­ers was 41 out of a pos­si­ble 44!

We also com­puted scores by cat­e­gory for the fol­low­ing vari­ables: num­ber of sib­lings, birth or­der, sex, oc­cu­pa­tional plan, and re­li­gious pref­er­ence. Highly sig­nifi­cant de­vi­a­tions from chance al­lo­ca­tion over cat­e­gories were found for each of these vari­ables. For ex­am­ple, the fe­males score higher than the males; MCAT score steadily and markedly de­creases with in­creas­ing num­bers of sib­lings; el­dest or only chil­dren are sig­nifi­cantly brighter than youngest chil­dren; there are marked differ­ences in MCAT scores be­tween those who hope to be­come nurses and those who hope to be­come nurses aides, or be­tween those plan­ning to be farm­ers, en­gi­neers, teach­ers, or physi­cians; and there are sub­stan­tial MCAT differ­ences among the var­i­ous re­li­gious groups. We also tab­u­lated the five prin­ci­pal Protes­tant re­li­gious de­nom­i­na­tions (Bap­tist, Epis­co­pal, Luther­an, Methodist, and Pres­by­te­ri­an) against all the other vari­ables, find­ing highly sig­nifi­cant re­la­tion­ships in most in­stances. For ex­am­ple, only chil­dren are nearly twice as likely to be Pres­by­ter­ian than Bap­tist in Min­neso­ta, more than half of the Epis­co­palians “usu­ally like school” but only 45% of Luther­ans do, 55% of Pres­by­te­ri­ans feel that their grades re­flect their abil­i­ties as com­pared to only 47% of Epis­co­palians, and Epis­co­palians are more likely to be male whereas Bap­tists are more likely to be fe­male. 83% of Bap­tist chil­dren said that they en­joyed danc­ing as com­pared to 68% of Lutheran chil­dren. More than twice the pro­por­tion of Epis­co­palians plan to at­tend an out of state col­lege than is true for Bap­tists, Luther­ans, or Methodists. The pro­por­tion of Methodists who plan to be­come con­ser­va­tion­ists is nearly twice that for Bap­tists, whereas the pro­por­tion of Bap­tists who plan to be­come re­cep­tion­ists is nearly twice that for Epis­co­palians.

In ad­di­tion, we tab­u­lated the four prin­ci­pal Lutheran Syn­ods (Mis­souri, ALC, LCA, and Wis­con­sin) against the other vari­ables, again find­ing highly sig­nifi­cant re­la­tion­ships in most cas­es. Thus, 5.9% of Wis­con­sin Synod chil­dren have no sib­lings as com­pared to only 3.4% of Mis­souri Synod chil­dren. 58% of ALC Luther­ans are in­volved in play­ing a mu­si­cal in­stru­ment or singing as com­pared to 67% of Mis­souri Synod Luther­ans. 80% of Mis­souri Synod Luther­ans be­long to school or po­lit­i­cal clubs as com­pared to only 71% of LCA Luther­ans. 49% of ALC Luther­ans be­long to de­bate, dra­mat­ics, or mu­si­cal or­ga­ni­za­tions in high school as com­pared to only 40% of Mis­souri Synod Luther­ans. 36% of LCA Luther­ans be­long to or­ga­nized non-school youth groups as com­pared to only 21% of Wis­con­sin Synod Luther­ans. [Pre­ced­ing text cour­tesy of D. T. Lykken.]

These re­la­tion­ships are not, I re­peat, Type I er­rors. They are facts about the world, and with N = 57,000 they are pretty sta­ble. Some are the­o­ret­i­cally easy to ex­plain, oth­ers more diffi­cult, oth­ers com­pletely baffling. The “easy” ones have mul­ti­ple ex­pla­na­tions, some­times com­pet­ing, usu­ally not. Draw­ing the­o­ries from a pot and as­so­ci­at­ing them whim­si­cally with vari­able pairs would yield an im­pres­sive batch of H0-re­fut­ing “con­fir­ma­tions.”

An­other amus­ing ex­am­ple is the be­hav­ior of the items in the 550 items of the MMPI pool with re­spect to sex. Only 60 items ap­pear on the Mf scale, about the same num­ber that were put into the pool with the hope that they would dis­crim­i­nate fem­i­nin­i­ty. It turned out that over half the items in the scale were not put in the pool for that pur­pose, and of those that were, a bare ma­jor­ity did the job. Scale de­riva­tion was based on item analy­sis of a small group of cri­te­rion cases of male ho­mo­sex­ual in­vert syn­drome, a sig­nifi­cant differ­ence on a rather small N of Dr. Starke Hath­away’s pri­vate pa­tients be­ing then con­joined with the re­quire­ment of dis­crim­i­nat­ing be­tween male nor­mals and fe­male nor­mals. When the N be­comes very large as in the data pub­lished by Swen­son, Pear­son, and Os­borne (1973; An MMPI Source Book: Ba­sic Item, Scale, And Pat­tern Data On 50,000 Med­ical Pa­tients. Min­neapolis, MN: Uni­ver­sity of Min­nesota Press.), ap­prox­i­mately 25,000 of each sex tested at the Mayo Clinic over a pe­riod of years, it turns out that 507 of the 550 items dis­crim­i­nate the sex­es. Thus in a het­ero­ge­neous item pool we find only 8% of items fail­ing to show a sig­nifi­cant differ­ence on the sex di­choto­my. The fol­low­ing are sex-dis­crim­i­na­tors, the male/fe­male differ­ences rang­ing from a few per­cent­age points to over 30%:7

  • Some­times when I am not feel­ing well I am cross.
  • I be­lieve there is a Devil and a Hell in after­life.
  • I think nearly any­one would tell a lie to keep out of trou­ble.
  • Most peo­ple make friends be­cause friends are likely to be use­ful to them.
  • I like po­et­ry.
  • I like to cook.
  • Po­lice­men are usu­ally hon­est.
  • I some­times tease an­i­mals.
  • My hands and feet are usu­ally warm enough.
  • I think Lin­coln was greater than Wash­ing­ton.
  • I am cer­tainly lack­ing in self­-con­fi­dence.
  • Any man who is able and will­ing to work hard has a good chance of suc­ceed­ing.

I in­vite the reader to guess which di­rec­tion scores “fem­i­nine.” Given this in­for­ma­tion, I find some items easy to “ex­plain” by one ob­vi­ous the­o­ry, oth­ers have com­pet­ing plau­si­ble ex­pla­na­tions, still oth­ers are baffling.

Note that we are not deal­ing here with some source of sta­tis­ti­cal er­ror (the oc­cur­rence of ran­dom sam­pling fluc­tu­a­tion­s). That source of er­ror is lim­ited by the sig­nifi­cance level we choose, just as the prob­a­bil­ity of Type II er­ror is set by ini­tial choice of the sta­tis­ti­cal pow­er, based upon a pi­lot study or other an­tecedent data con­cern­ing an ex­pected av­er­age differ­ence. Since in so­cial sci­ence every­thing cor­re­lates with every­thing to some ex­tent, due to com­plex and ob­scure causal in­flu­ences, in con­sid­er­ing the crud fac­tor we are talk­ing about real differ­ences, real cor­re­la­tions, real trends and pat­terns for which there is, of course, some true but com­pli­cated mul­ti­vari­ate causal the­o­ry. I am not sug­gest­ing that these cor­re­la­tions are fun­da­men­tally un­ex­plain­able. They would be com­pletely ex­plained if we had the knowl­edge of Om­ni­scient Jones, which we don’t. The point is that we are in the weak sit­u­a­tion of cor­rob­o­rat­ing our par­tic­u­lar sub­stan­tive the­ory by show­ing that X and Y are “re­lated in a non­chance man­ner,” when our the­ory is too weak to make a nu­mer­i­cal pre­dic­tion or even (usu­al­ly) to set up a range of ad­mis­si­ble val­ues that would be counted as cor­rob­o­ra­tive.

…Some psy­chol­o­gists play down the in­flu­ence of the ubiq­ui­tous crud fac­tor, what David Lykken (1968) calls the “am­bi­ent cor­re­la­tional noise” in so­cial sci­ence, by say­ing that we are not in dan­ger of be­ing mis­led by small differ­ences that show up as sig­nifi­cant in gi­gan­tic sam­ples. How much that soft­ens the blow of the crud fac­tor’s in­flu­ence de­pends upon the crud fac­tor’s av­er­age size in a given re­search do­main, about which nei­ther I nor any­body else has ac­cu­rate in­for­ma­tion. But the no­tion that the cor­re­la­tion be­tween ar­bi­trar­ily paired trait vari­ables will be, while not lit­er­ally ze­ro, of such mi­nus­cule size as to be of no im­por­tance, is surely wrong. Every­body knows that there is a set of de­mo­graphic fac­tors, some un­der­stood and oth­ers quite mys­te­ri­ous, that cor­re­late quite re­spectably with a va­ri­ety of traits. (So­cioe­co­nomic sta­tus, SES, is the one usu­ally con­sid­ered, and fre­quently as­sumed to be only in the “in­put” causal role.) The clin­i­cal scales of the MMPI were de­vel­oped by em­pir­i­cal key­ing against a set of dis­junct noso­log­i­cal cat­e­gories, some of which are phe­nom­e­no­log­i­cally and psy­cho­dy­nam­i­cally op­po­site to oth­ers. Yet the 45 pair­wise cor­re­la­tions of these scales are al­most al­ways pos­i­tive (s­cale Ma pro­vides most of the neg­a­tives) and a rep­re­sen­ta­tive size is in the neigh­bor­hood of 0.35 to 0.40. The same is true of the scores on the Strong Vo­ca­tional In­ter­est Blank, where I find an av­er­age ab­solute value cor­re­la­tion close to 0.40. The ma­lig­nant in­flu­ence of so-called “meth­ods co­vari­ance” in psy­cho­log­i­cal re­search that re­lies upon tasks or tests hav­ing cer­tain kinds of be­hav­ioral sim­i­lar­i­ties such as ques­tion­naires or ink blots is com­mon­place and a reg­u­lar source of con­cern to clin­i­cal and per­son­al­ity psy­chol­o­gists. For fur­ther dis­cus­sion and ex­am­ples of crud fac­tor size, see Meehl (1990).

Now sup­pose we imag­ine a so­ci­ety of psy­chol­o­gists do­ing re­search in this soft area, and each in­ves­ti­ga­tor sets his ex­per­i­ments up in a whim­si­cal, ir­ra­tional man­ner as fol­lows: First he picks a the­ory at ran­dom out of the the­ory pot. Then he picks a pair of vari­ables ran­domly out of the ob­serv­able vari­able pot. He then ar­bi­trar­ily as­signs a di­rec­tion (you un­der­stand there is no in­trin­sic con­nec­tion of con­tent be­tween the sub­stan­tive the­ory and the vari­ables, ex­cept once in a while there would be such by co­in­ci­dence) and says that he is go­ing to test the ran­domly cho­sen sub­stan­tive the­ory by pre­tend­ing that it pre­dict­s—although in fact it does not, hav­ing no in­trin­sic con­tentual re­la­tion—a pos­i­tive cor­re­la­tion be­tween ran­domly cho­sen ob­ser­va­tional vari­ables X and Y. Now sup­pose that the crud fac­tor op­er­a­tive in the broad do­main were 0.30, that is, the av­er­age cor­re­la­tion be­tween all of the vari­ables pair­wise in this do­main is 0.30. This is not sam­pling er­ror but the true cor­re­la­tion pro­duced by some com­plex un­known net­work of ge­netic and en­vi­ron­men­tal fac­tors. Sup­pose he di­vides a nor­mal dis­tri­b­u­tion of sub­jects at the me­dian and uses all of his cases (which fre­quently is not what is done, al­though if prop­erly treated sta­tis­ti­cally that is not method­olog­i­cally sin­ful). Let us take vari­able X as the “in­put” vari­able (n­ever mind its causal role). The mean score of the cases in the top half of the dis­tri­b­u­tion will then be at one mean de­vi­a­tion, that is, in stan­dard score terms they will have an av­er­age score of 0.80. Sim­i­lar­ly, the sub­jects in the bot­tom half of the X dis­tri­b­u­tion will have a mean stan­dard score of -0.80. So the mean differ­ence in stan­dard score terms be­tween the high and low _X_s, the one “ex­per­i­men­tal” and the other “con­trol” group, is 1.6. If the re­gres­sion of out­put vari­able Y on X is ap­prox­i­mately lin­ear, this yields an ex­pected differ­ence in stan­dard score terms of 0.48, so the differ­ence on the ar­bi­trar­ily de­fined “out­put” vari­able Y is in the neigh­bor­hood of half a stan­dard de­vi­a­tion.

When the in­ves­ti­ga­tor runs a t-test on these data, what is the prob­a­bil­ity of achiev­ing a sta­tis­ti­cally sig­nifi­cant re­sult? This de­pends upon the sta­tis­ti­cal power func­tion and hence upon the sam­ple size, which varies wide­ly, more in soft psy­chol­ogy be­cause of the na­ture of the data col­lec­tion prob­lems than in ex­per­i­men­tal work. I do not have ex­act fig­ures, but an in­for­mal scan­ning of sev­eral is­sues of jour­nals in the soft ar­eas of clin­i­cal, ab­nor­mal, and so­cial gave me a rep­re­sen­ta­tive value of the num­ber of cases in each of two groups be­ing com­pared at around N1 = N2 = 37 (that’s a me­dian be­cause of the skew­ness, sam­ple sizes rang­ing from a low of 17 in one clin­i­cal study to a high of 1,000 in a so­cial sur­vey study). As­sum­ing equal vari­ances, this gives us a stan­dard er­ror of the mean differ­ence of 0.2357 in sig­ma-u­nits, so that our t is a lit­tle over 2.0. The sub­stan­tive the­ory in a real life case be­ing al­most in­vari­ably pre­dic­tive of a di­rec­tion (it is hard to know what sort of sig­nifi­cance test­ing we would be do­ing oth­er­wise), the 5% level of con­fi­dence can be le­git­i­mately taken as one-tailed and in fact could be crit­i­cized if it were not (as­sum­ing that the 5% level of con­fi­dence is given the usual spe­cial mag­i­cal sig­nifi­cance afforded it by so­cial sci­en­tist­s!). The di­rec­tional 5% level be­ing at 1.65, the ex­pected value of our t-test in this sit­u­a­tion is ap­prox­i­mately 0.35 t units from the re­quired sig­nifi­cance lev­el. Things be­ing es­sen­tially nor­mal for 72 df, this gives us a power of de­tect­ing a differ­ence of around 0.64.

How­ev­er, since in our imag­ined “ex­per­i­ment” the as­sign­ment of di­rec­tion was ran­dom, the prob­a­bil­ity of de­tect­ing a differ­ence in the pre­dicted di­rec­tion (even though in re­al­ity this pre­dic­tion was not me­di­ated by any ra­tio­nal re­la­tion of con­tent) is only half of that. Even this con­ser­v­a­tive power based upon the as­sump­tion of a com­pletely ran­dom as­so­ci­a­tion be­tween the the­o­ret­i­cal sub­stance and the pseudo­pre­dicted di­rec­tion should give one pause. We find that the prob­a­bil­ity of get­ting a pos­i­tive re­sult from a the­ory with no verisimil­i­tude what­so­ev­er, as­so­ci­ated in a to­tally whim­si­cal fash­ion with a pair of vari­ables picked ran­domly out of the ob­ser­va­tional pot, is one chance in three! This is quite differ­ent from the 0.05 level that peo­ple usu­ally think about. Of course, the rea­son for this is that the 0.05 level is based upon strictly hold­ing H0 if the the­ory were false. Where­as, be­cause in the so­cial sci­ences every­thing is cor­re­lated with every­thing, for epis­temic pur­poses (de­spite the rigor of the math­e­mati­cian’s ta­bles) the true base­line—if the the­ory has noth­ing to do with re­al­ity and has only a chance re­la­tion­ship to it (so to speak, “any con­nec­tion be­tween the the­ory and the facts is purely co­in­ci­den­tal”) - is 6 or 7 times as great as the re­as­sur­ing 0.05 level upon which the psy­chol­o­gist fo­cuses his mind. If the crud fac­tor in a do­main were run­ning around 0.40, the power func­tion is 0.86 and the “di­rec­tional power” for ran­dom the­o­ry/pre­dic­tion pair­ings would be 0.43.

…A sim­i­lar sit­u­a­tion holds for psy­chopathol­o­gy, and for many vari­ables in per­son­al­ity mea­sure­ment that re­fer to as­pects of so­cial com­pe­tence on the one hand or im­pair­ment of in­ter­per­sonal func­tion (as in men­tal ill­ness) on the oth­er. Thorndike had a dic­tum “All good things tend to go to­geth­er.”

Meehl 1990 (2)

“Ap­prais­ing and amend­ing the­o­ries: the strat­egy of Lakatosian de­fense and two prin­ci­ples that war­rant us­ing it”, Meehl 1990b:

Re­search in the be­hav­ioral sci­ences can be ex­per­i­men­tal, cor­re­la­tion­al, or field study (in­clud­ing clin­i­cal); only the first two are ad­dressed here. For rea­sons to be ex­plained (Meehl, 1990c), I treat as cor­re­la­tional those ex­per­i­men­tal stud­ies in which the chief the­o­ret­i­cal test pro­vided in­volves an in­ter­ac­tion effect be­tween an ex­per­i­men­tal ma­nip­u­la­tion and an in­di­vid­u­al-d­iffer­ences vari­able (whether trait, sta­tus, or de­mo­graph­ic). In cor­re­la­tional re­search there arises a spe­cial prob­lem for the so­cial sci­en­tist from the em­pir­i­cal fact that “every­thing is cor­re­lated with every­thing, more or less.” My col­league David Lykken presses the point fur­ther to in­clude most, if not all, purely ex­per­i­men­tal re­search de­signs, say­ing that, speak­ing causal­ly, “Every­thing in­flu­ences every­thing”, a stronger the­sis that I nei­ther as­sert nor deny but that I do not rely on here. The ob­vi­ous fact that every­thing is more or less cor­re­lated with every­thing in the so­cial sci­ences is read­ily fore­seen from the arm­chair on com­mon-sense con­sid­er­a­tions. These are strength­ened by more ad­vanced the­o­ret­i­cal ar­gu­ments in­volv­ing such con­cepts as ge­netic link­age, au­to-cat­alytic effects be­tween cog­ni­tive and affec­tive process­es, traits re­flect­ing in­flu­ences such as child-rear­ing prac­tices cor­re­lated with in­tel­li­gence, eth­nic­i­ty, so­cial class, re­li­gion, and so forth. If one asks, to take a triv­ial and the­o­ret­i­cally un­in­ter­est­ing ex­am­ple, whether we might ex­pect to find so­cial class differ­ences in a col­or-nam­ing test, there im­me­di­ately spring to mind nu­mer­ous in­flu­ences, rang­ing from (a) ver­bal in­tel­li­gence lead­ing to bet­ter ver­bal dis­crim­i­na­tions and re­ten­tion of color names to (b) class differ­ences in ma­ter­nal teach­ing be­hav­ior (which one can read­ily ob­serve by watch­ing moth­ers ex­plain things to their chil­dren at a zoo) to (c) more sub­tle—but still nonze­ro—in­flu­ences, such as up­per-class chil­dren be­ing more likely An­gli­cans than Bap­tists, hence ex­posed to the changes in litur­gi­cal col­ors dur­ing the church year! Ex­am­ples of such mul­ti­ple pos­si­ble in­flu­ences are so easy to gen­er­ate, I shall re­sist the temp­ta­tion to go on. If some­body asks a psy­chol­o­gist or so­ci­ol­o­gist whether she might ex­pect a nonzero cor­re­la­tion be­tween den­tal caries and IQ, the best guess would be yes, small but sta­tis­ti­cally sig­nifi­cant. A small neg­a­tive cor­re­la­tion was in fact found dur­ing the 1920s, mis­lead­ing some hy­gien­ists to hold that IQ was low­ered by tox­ins from de­cayed teeth. (The re­ceived ex­pla­na­tion to­day is that den­tal caries and IQ are both cor­re­lates of so­cial class.) More than 75 years ago, Ed­ward Lee Thorndike enun­ci­ated the fa­mous dic­tum, “All good things tend to go to­geth­er, as do all bad ones.” Al­most all hu­man per­for­mance (work com­pe­tence) dis­po­si­tions, if care­fully stud­ied, are sat­u­rated to some ex­tent with the gen­eral in­tel­li­gence fac­tor g, which for psy­cho­dy­namic and ide­o­log­i­cal rea­sons has been some­what ne­glected in re­cent years but is due for a come­back (Betz, 1986).11

The ubiq­uity of nonzero cor­re­la­tions gives rise to what is method­olog­i­cally dis­turb­ing to the the­ory tester and what I call, fol­low­ing Lykken, the crud fac­tor. I have dis­cussed this at length else­where (Meehl, 1990c), so I only sum­ma­rize and pro­vide a cou­ple of ex­am­ples here. The main point is that, when the sam­ple size is suffi­ciently large to pro­duce ac­cu­rate es­ti­mates of the pop­u­la­tion val­ues, al­most any pair of vari­ables in psy­chol­ogy will be cor­re­lated to some ex­tent. Thus, for in­stance, less than 10% of the items in the MMPI item pool were put into the pool with mas­culin­i­ty-fem­i­nin­ity in mind, and the em­pir­i­cally de­rived Mf scale con­tains only some of those plus oth­ers put into the item pool for other rea­sons, or with­out any the­o­ret­i­cal con­sid­er­a­tions. When one sam­ples thou­sands of in­di­vid­u­als, it turns out that only 43 of the 550 items (8%) fail to show a sig­nifi­cant differ­ence be­tween males and fe­males. In an un­pub­lished study (but see Meehl, 1990c) of the hob­bies, in­ter­ests, vo­ca­tional plans, school course pref­er­ences, so­cial life, and home fac­tors of Min­nesota col­lege fresh­men, when Lykken and I ran chi squares on all pos­si­ble pair­wise com­bi­na­tions of vari­ables, 92% were sig­nifi­cant, and 78% were sig­nifi­cant at p < 10-6. Looked at an­other way, the me­dian num­ber of sig­nifi­cant re­la­tion­ships be­tween a given vari­able and all the oth­ers was 41 of a pos­si­ble 44. One finds such odd­i­ties as a re­la­tion­ship be­tween which kind of shop courses boys pre­ferred in high school and which of sev­eral Lutheran syn­ods they be­longed to!

…The third ob­jec­tion is some­what harder to an­swer be­cause it would re­quire an en­cy­clo­pe­dic sur­vey of re­search lit­er­a­ture over many do­mains. It is ar­gued that, al­though the crud fac­tor is ad­mit­tedly ubiq­ui­tous—that is, al­most no cor­re­la­tions of the so­cial sci­ences are lit­er­ally zero (as re­quired by the usual sig­nifi­cance test)—the crud fac­tor is in most re­search do­mains not large enough to be worth wor­ry­ing about. With­out mak­ing a claim to know just how big it is, I think this ob­jec­tion is pretty clearly un­sound. Doubt­less the av­er­age cor­re­la­tion of any ran­domly picked pair of vari­ables in so­cial sci­ence de­pends on the do­main, and also on the in­stru­ments em­ployed (e.g., it is well known that per­son­al­ity in­ven­to­ries often have as much meth­od­s-co­vari­ance as they do cri­te­rion va­lidi­ties).

A rep­re­sen­ta­tive pair­wise cor­re­la­tion among MMPI scales, de­spite the marked differ­ences (some­times amount­ing to phe­nom­e­no­log­i­cal “op­po­site­ness”) of the noso­log­i­cal rubrics on which they were de­rived, is in the mid­dle to high 0.30s, in both nor­mal and ab­nor­mal pop­u­la­tions. The same is true for the oc­cu­pa­tional keys of the . De­lib­er­ately aim­ing to di­ver­sify the qual­i­ta­tive fea­tures of cog­ni­tive tasks (and thus “pu­rify” the mea­sures) in his clas­sic stud­ies of pri­mary men­tal abil­i­ties (“pure fac­tors,” or­thog­o­nal), Thur­stone (1938; Thur­stone & Thur­stone, 1941) still found an av­er­age in­tertest cor­re­la­tion of .28 (range = .01 to .56!) in the cross-val­i­da­tion sam­ple. In the set of 20 scales built to cover broadly the do­main of (nor­mal range) “folk-con­cept” traits, Gough (1987) found an av­er­age pair­wise cor­re­la­tion of .44 among both males and fe­males. Guil­ford’s So­cial In­tro­ver­sion, Think­ing In­tro­ver­sion, De­pres­sion, Cy­cloid Ten­den­cies, and Rhathymia or Free­dom From Care scales, con­structed on the ba­sis of (orthog­o­nal) fac­tors, showed pair­wise cor­re­la­tions rang­ing from -.02 to .85, with 5 of the 10 rs ≥ .33 de­spite the pu­rifi­ca­tion effort (Evans & Mc­Connell, 1941). Any trea­tise on fac­tor analy­sis ex­em­pli­fy­ing pro­ce­dures with em­pir­i­cal data suffices to make the point con­vinc­ing­ly. For ex­am­ple, in Har­man (1960), eight “emo­tional” vari­ables cor­re­late .10 to .87, me­dian r= .44 (p. 176), and eight “po­lit­i­cal” vari­ables cor­re­late .03 to .88, me­dian (ab­solute val­ue) r = .62 (p. 178). For highly di­verse ac­qui­es­cence-cor­rected mea­sures (per­son­al­ity traits, in­ter­ests, hob­bies, psy­chopathol­o­gy, so­cial at­ti­tudes, and re­li­gious, po­lit­i­cal, and moral opin­ion­s), es­ti­mat­ing in­di­vid­u­als’ (orthog­o­nal!) fac­tor scores, one can hold mean _r_s down to an av­er­age of .12, means from .04 to .20, still some in­di­vid­ual _r_s > .30 (Lykken, per­sonal com­mu­ni­ca­tion, 1990; cf. Mc­Closky & Meehl, in prepa­ra­tion). Pub­lic opin­ion polls and at­ti­tude sur­veys rou­tinely dis­ag­gre­gate data with re­spect to sev­eral de­mo­graphic vari­ables (e.g., age, ed­u­ca­tion, sec­tion of coun­try, sex, eth­nic­i­ty, re­li­gion, ed­u­ca­tion, in­come, ru­ral/ur­ban, self­-de­scribed po­lit­i­cal affil­i­a­tion) be­cause these fac­tors are al­ways cor­re­lated with at­ti­tudes or elec­toral choic­es, some­times strongly so. One must also keep in mind that so­cioe­co­nomic sta­tus, al­though in­trin­si­cally in­ter­est­ing (e­spe­cially to so­ci­ol­o­gists) is prob­a­bly often func­tion­ing as a proxy for other un­mea­sured per­son­al­ity or sta­tus char­ac­ter­is­tics that are not part of the de­fi­n­i­tion of so­cial class but are, for a va­ri­ety of com­pli­cated rea­sons, cor­re­lated with it. The proxy role is im­por­tant be­cause it pre­vents ad­e­quate “con­trol­ling for” un­known (or un­mea­sured) crud-fac­tor in­flu­ences by sta­tis­ti­cal pro­ce­dures (match­ing, par­tial cor­re­la­tion, analy­sis of co­vari­ance, ). [ie “resid­ual con­found­ing”]

  • Thur­stone, L. L. (1938). Pri­mary men­tal abil­i­ties. Chicago: Uni­ver­sity of Chicago Press.
  • Gough, H. G. (1987). CPI, Ad­min­is­tra­tor’s guide. Palo Al­to, CA: Con­sult­ing Psy­chol­o­gists Press.
  • Mc­Closky, Her­bert, & Meehl, P. E. (in prepa­ra­tion). Ide­olo­gies in con­flict.12

Tukey 1991

“The phi­los­o­phy of mul­ti­ple com­par­isons”, Tukey 1991:

Sta­tis­ti­cians clas­si­cally asked the wrong ques­tion—and were will­ing to an­swer with a lie, one that was often a down­right lie. They asked “Are the effects of A and B differ­ent?” and they were will­ing to an­swer “no”.

All we know about the world teaches us that the effects of A and B are al­ways differ­en­t—in some dec­i­mal place—­for any A and B. Thus ask­ing “Are the effects differ­ent?” is fool­ish.

What we should be an­swer­ing first is “Can we tell the di­rec­tion in which the effects differ from the effects of B?” In other words, can we be con­fi­dent about the di­rec­tion from A to B? Is it “up”, “down”, or “un­cer­tain”?

Raftery 1995

“Bayesian Model Se­lec­tion in So­cial Re­search (with Dis­cus­sion by An­drew Gel­man & Don­ald B. Ru­bin, and Robert M. Hauser, and a Re­join­der)”, Raftery 1995:

In the past 15 years, how­ev­er, some quan­ti­ta­tive so­ci­ol­o­gists have been at­tach­ing less im­por­tance to p-val­ues be­cause of prac­ti­cal diffi­cul­ties and coun­ter-in­tu­itive re­sults. These diffi­cul­ties are most ap­par­ent with large sam­ples, where p-val­ues tend to in­di­cate re­jec­tion of the null hy­poth­e­sis even when the null model seems rea­son­able the­o­ret­i­cally and in­spec­tion of the data fails to re­veal any strik­ing dis­crep­an­cies with it. Be­cause much so­ci­o­log­i­cal re­search is based on sur­vey data, often with thou­sands of cas­es, so­ci­ol­o­gists fre­quently come up against this prob­lem. In the early 1980s, some so­ci­ol­o­gists dealt with this prob­lem by ig­nor­ing the re­sults of p-val­ue-based tests when they seemed coun­ter-in­tu­itive, and by bas­ing model se­lec­tion in­stead on the­o­ret­i­cal con­sid­er­a­tions and in­for­mal as­sess­ment of dis­crep­an­cies be­tween model and data (e.g. Fien­berg and Ma­son, 1979; Hout, 1983, 1984; Grusky and Hauser, 1984).

…It is clear that mod­els 1 and 2 are un­sat­is­fac­tory and should be re­jected in fa­vor of model 3.3 By the stan­dard test, model 3 should also be re­ject­ed, in fa­vor of model 4, given the de­viance differ­ence of 150 on 16 de­grees of free­dom, cor­re­spond­ing to a p-value of about 10-120 . Grusky and Hauser (1984) nev­er­the­less adopted model 3 be­cause it ex­plains most (99.7%) of the de­viance un­der the base­line model of in­de­pen­dence, fits well in the sense that the differ­ences be­tween ob­served and ex­pected counts are a small pro­por­tion of the to­tal, and makes good the­o­ret­i­cal sense. This seems sen­si­ble, and yet is in dra­matic con­flict with the p-val­ue-based test. This type of con­flict often arises in large sam­ples, and hence is fre­quent in so­ci­ol­ogy with its sur­vey data sets com­pris­ing thou­sands of cas­es. The main re­sponse to it has been to claim that there is a dis­tinc­tion be­tween “sta­tis­ti­cal” and “sub­stan­tive” sig­nifi­cance, with differ­ences that are sta­tis­ti­cally sig­nifi­cant not nec­es­sar­ily be­ing sub­stan­tively im­por­tant.

Thompson 1995

“Ed­i­to­r­ial Poli­cies Re­gard­ing Sta­tis­ti­cal Sig­nifi­cance Test­ing: Three Sug­gested Re­forms”, Thomp­son 1995:

One se­ri­ous prob­lem with this sta­tis­ti­cal test­ing logic is that the in re­al­ity H0 is never true in the pop­u­la­tion, as rec­og­nized by any num­ber of promi­nent sta­tis­ti­cians (Tukey, 1991), i.e., there will al­ways be some differ­ences in pop­u­la­tion pa­ra­me­ters, al­though the differ­ences may be in­cred­i­bly triv­ial. Near 40 years ago Sav­age (1957, pp. 332–333) noted that, “Null hy­pothe­ses of no differ­ence are usu­ally known to be false be­fore the data are col­lect­ed.” Sub­se­quent­ly, Meehl (1978, p.822) ar­gued, “As I be­lieve is gen­er­ally rec­og­nized by sta­tis­ti­cians to­day and by thought­ful so­cial sci­en­tists, the null hy­poth­e­sis, taken lit­er­al­ly, is al­ways false.” Sim­i­lar­ly, noted sta­tis­ti­cian Hays (1981, p. 293 [Sta­tis­tics], 3rd ed.) pointed out that “[t]here is surely noth­ing on earth that is com­pletely in­de­pen­dent of any­thing else. The strength of as­so­ci­a­tion may ap­proach ze­ro, but it should sel­dom or never be ex­actly ze­ro.” And Loftus and Loftus (1982, pp. 498–499) ar­gued that, “find­ing a ‘[s­ta­tis­ti­cal­ly] sig­nifi­cant effect’ re­ally pro­vides very lit­tle in­for­ma­tion, be­cause it’s al­most cer­tain that some re­la­tion­ship (how­ever small) ex­ists be­tween any two vari­ables.” The very im­por­tant im­pli­ca­tion of all this is that sta­tis­ti­cal sig­nifi­cance test­ing pri­mar­ily be­comes only a test of re­searcher en­durance, be­cause “vir­tu­ally any study can be made to show [s­ta­tis­ti­cal­ly] sig­nifi­cant re­sults if one uses enough sub­jects” (Hays, 1981, p. 293). As Nun­nally (1960, p. 643) noted some 35 years ago, “If the null hy­poth­e­sis is not re­ject­ed, it is usu­ally be­cause the N is too small. If enough data are gath­ered, the hy­poth­e­sis will gen­er­ally be re­ject­ed.” The im­pli­ca­tion is that:

Sta­tis­ti­cal sig­nifi­cance test­ing can in­volve a tau­to­log­i­cal logic in which tired re­searchers, hav­ing col­lected data from hun­dreds of sub­jects, then con­duct a sta­tis­ti­cal test to eval­u­ate whether there were a lot of sub­jects, which the re­searchers al­ready know, be­cause they col­lected the data and know they’re tired. This tau­tol­ogy has cre­ated con­sid­er­able dam­age as re­gards the cu­mu­la­tion of knowl­edge… (Thomp­son, 1992, p. 436)

Mulaik et al 1997

“There Is a Time and a Place for Sig­nifi­cance Test­ing”, Mu­laik et al 1997 (in What If There Were No Sig­nifi­cance Tests ed Har­low et al 1997):

Most of these ar­ti­cles ex­pose mis­con­cep­tions about sig­nifi­cance test­ing com­mon among re­searchers and writ­ers of psy­cho­log­i­cal text­books on sta­tis­tics and mea­sure­ment. But the crit­i­cisms do not stop with mis­con­cep­tions about sig­nifi­cance test­ing. Oth­ers like Meehl (1967) ex­pose the lim­i­ta­tions of a sta­tis­ti­cal prac­tice that fo­cuses only on test­ing for zero differ­ences be­tween means and zero cor­re­la­tions in­stead of test­ing pre­dic­tions about spe­cific nonzero val­ues for pa­ra­me­ters de­rived from the­ory or prior ex­pe­ri­ence, as is done in the phys­i­cal sci­ences. Still oth­ers em­pha­size that sig­nifi­cance tests do not alone con­vey the in­for­ma­tion needed to prop­erly eval­u­ate re­search find­ings and per­form ac­cu­mu­la­tive re­search.

…Other than em­pha­siz­ing a need to prop­erly un­der­stand the in­ter­pre­ta­tion of con­fi­dence in­ter­vals, we have no dis­agree­ments with these crit­i­cisms and pro­pos­als. But a few of the crit­ics go even fur­ther. In this chap­ter we will look at ar­gu­ments made by Carver (1978), Co­hen (1994), , Schmidt (1992, 1996), and Schmidt and Hunter (chap­ter 3 of this vol­ume), in fa­vor of not merely rec­om­mend­ing the re­port­ing of point es­ti­mates of effect sizes and con­fi­dence in­ter­vals based on them, but of aban­don­ing al­to­gether the use of sig­nifi­cance tests in re­search. Our fo­cus will be prin­ci­pally on Schmidt’s (1992, 1996) pa­pers, be­cause they in­cor­po­rate ar­gu­ments from ear­lier pa­pers, es­pe­cially Carver’s (1978), and also carry the ar­gu­ment to its most ex­treme con­clu­sions. Where ap­pro­pri­ate, we will also com­ment on Schmidt and Hunter’s (chap­ter 3 of this vol­ume) re­but­tal of ar­gu­ments against their po­si­tion.

The Null Hy­poth­e­sis Is Al­ways False?

Co­hen (1994), in­flu­enced by Meehl (1978), ar­gued that “the nil hy­poth­e­sis is al­ways false” (p. 1000). Get a large enough sam­ple and you will al­ways re­ject the null hy­poth­e­sis. He cites a num­ber of em­i­nent sta­tis­ti­cians in sup­port of this view. He quotes Tukey (1991, p. 100) to the effect that there are al­ways differ­ences be­tween ex­per­i­men­tal treat­ments-for some dec­i­mal places. Co­hen cites an un­pub­lished study by Meehl and Lykken in which cross tab­u­la­tions for 15 Min­nesota Mul­ti­pha­sic Per­son­al­ity In­ven­tory (MMPI) items for a sam­ple of 57,000 sub­jects yielded 105 chi-square tests of as­so­ci­a­tion and every one of them was sig­nifi­cant, and 96% of them were sig­nifi­cant at p<.000001 (Co­hen, 1994, p. 1000). Co­hen cites Meehl (1990) as sug­gest­ing that this re­flects a “crud fac­tor” in na­ture. “Every­thing is re­lated to every­thing else” to some de­gree. So, the ques­tion is, why do a sig­nifi­cance test if you know it will al­ways be sig­nifi­cant if the sam­ple is large enough? But if this is an em­pir­i­cal hy­poth­e­sis, is it not one that is es­tab­lished us­ing sig­nifi­cance test­ing?

But the ex­am­ple may not be an apt demon­stra­tion of the prin­ci­ple Co­hen sought to es­tab­lish: It is gen­er­ally ex­pected that re­sponses to differ­ent items re­sponded to by the same sub­jects are not in­de­pen­dently dis­trib­uted across sub­jects, so it would not be re­mark­able to find sig­nifi­cant cor­re­la­tions be­tween many such items.

Much more in­ter­est­ing would be to demon­strate sys­tem­atic and replic­a­ble sig­nifi­cant treat­ment effects when sub­jects are as­signed at ran­dom to differ­ent treat­ment groups but the same treat­ments are ad­min­is­tered to each group. But in this case, small but sig­nifi­cant effects in stud­ies with high power that de­vi­ate from ex­pec­ta­tions of no effect when no differ­ences in treat­ments are ad­min­is­tered are rou­tinely treated as sys­tem­atic ex­per­i­menter er­rors, and knowl­edge of ex­per­i­men­tal tech­nique is im­proved by their de­tec­tion and re­moval or con­trol. Sys­tem­atic er­ror and ex­per­i­men­tal ar­ti­fact must al­ways be con­sid­ered a pos­si­bil­ity when re­ject­ing the null hy­poth­e­sis. Nev­er­the­less, do we know a pri­ori that a test will al­ways be sig­nifi­cant if the sam­ple is large enough? Is the propo­si­tion “Every sta­tis­ti­cal hy­poth­e­sis is false” an ax­iom that needs no test­ing? Ac­tu­al­ly, we be­lieve that to re­gard this as an ax­iom would in­tro­duce an in­ter­nal con­tra­dic­tion into sta­tis­ti­cal rea­son­ing, com­pa­ra­ble to ar­gu­ing that all propo­si­tions and de­scrip­tions are false. You could not think and rea­son about the world with such an ax­iom. So it seems prefer­able to re­gard this as some kind of em­pir­i­cal gen­er­al­iza­tion. But no em­pir­i­cal gen­er­al­iza­tion is ever in­cor­ri­gi­ble and be­yond test­ing. Nev­er­the­less, if in­deed there is a phe­nom­e­non of na­ture known as “the crud fac­tor,” then it is some­thing we know to be ob­jec­tively a fact only be­cause of sig­nifi­cance tests. Some­thing in the back­ground noise stands out as a sig­nal against that noise, be­cause we have suffi­ciently pow­er­ful tests us­ing huge sam­ples to de­tect it. At that point it may be­come a chal­lenge to sci­ence to de­velop a bet­ter un­der­stand­ing of what pro­duces it. How­ev­er, it may tum out to re­flect only ex­per­i­menter ar­ti­fact. But in any case the hy­poth­e­sis of a crud fac­tor is not be­yond fur­ther test­ing.

The point is that it does­n’t mat­ter if the null hy­poth­e­sis is al­ways judged false at some sam­ple size, as long as we re­gard this as an em­pir­i­cal phe­nom­e­non. What mat­ters is whether at the sam­ple size we have we can dis­tin­guish ob­served de­vi­a­tions from our hy­poth­e­sized val­ues to be suffi­ciently large and im­prob­a­ble un­der a hy­poth­e­sis of chance that we can treat them rea­son­ably but pro­vi­sion­ally as not due to chance er­ror. There is no a pri­ori rea­son to be­lieve that one will al­ways re­ject the null hy­poth­e­sis at any given sam­ple size. On the other hand, ac­cept­ing the null hy­poth­e­sis does not mean the hy­poth­e­sized value is true, but rather that the ev­i­dence ob­served is not dis­tin­guish­able from what we would re­gard as due to chance if the null hy­poth­e­sis were true and thus is not suffi­cient to dis­prove it. The re­main­ing un­cer­tainty re­gard­ing the truth of our null hy­poth­e­sis is mea­sured by the width of the re­gion of ac­cep­tance or a func­tion of the stan­dard er­ror. And this will be closely re­lated to the power of the test, which also pro­vides us with in­for­ma­tion about our un­cer­tain­ty. The fact that the width of the re­gion of ac­cep­tance shrinks with in­creas­ing sam­ple size, means we are able to re­duce our un­cer­tainty re­gard­ing the pro­vi­sional va­lid­ity of an ac­cepted null hy­poth­e­sis with larger sam­ples. In huge sam­ples the is­sue of un­cer­tainty due to chance looms not as im­por­tant as it does in small- and mod­er­ate-size sam­ples.

Waller 2004

“The fal­lacy of the null hy­poth­e­sis in soft psy­chol­ogy”, Waller 2004:

In his clas­sic ar­ti­cle on the fal­lacy of the null hy­poth­e­sis in soft psy­chol­o­gy, Paul Meehl claimed that,in non­ex­per­i­men­tal set­tings, the prob­a­bil­ity of re­ject­ing the null hy­poth­e­sis of nil group differ­ences in fa­vor of a di­rec­tional al­ter­na­tive was 0.50—a value that is an or­der of mag­ni­tude higher than the cus­tom­ary Type I er­ror rate. In a se­ries of real data sim­u­la­tions, us­ing Min­nesota Mul­ti­pha­sic Per­son­al­ity In­ven­to­ry-Re­vised (MMPI-2) data col­lected from more than 80,000 in­di­vid­u­als, I found strong sup­port for Meehl’s claim.

…Be­fore run­ning the ex­per­i­ments I re­al­ized that, to be fair to Meehl, I needed a large data set with a broad range of bioso­cial vari­ables. For­tu­nate­ly, I had ac­cess to data from 81,485 in­di­vid­u­als who ear­lier had com­pleted the 567 items of the Min­nesota Mul­ti­pha­sic Per­son­al­ity In­ven­to­ry-Re­vised (MMPI-2; Butcher, Dahlstrom, Gra­ham, Tel­le­gen, & Kaem­mer, 1989). The MMPI-2, in my opin­ion, is an ideal ve­hi­cle for test­ing Meehl’s claim be­cause it in­cludes items in such var­ied con­tent do­mains as gen­eral health con­cerns; per­sonal habits and in­ter­ests; at­ti­tudes to­wards sex, mar­riage, and fam­i­ly; affec­tive func­tion­ing; nor­mal range per­son­al­i­ty; and ex­treme man­i­fes­ta­tions of psy­chopathol­ogy (for a more com­plete de­scrip­tion of the la­tent con­tent of the MMPI, see Waller, 1999, “Search­ing for struc­ture in the MMPI”).

…Next, the com­puter se­lected (with­out re­place­ment) a ran­dom item from the pool of MMPI-2 items. Us­ing data from the 41,491 males and 39,994 fe­males, it then (a) per­formed a differ­ence of pro­por­tions test on the item group means; (b) recorded the signed z-val­ue; and (c) recorded the as­so­ci­ated sig­nifi­cance lev­el. Fi­nal­ly, the pro­gram tal­lied the num­ber of “sig­nifi­cant” test re­sults (i.e., those with |z|≥1.96). The re­sults of this mini sim­u­la­tion were en­light­en­ing and in ex­cel­lent ac­cord with the out­come of Meehl’s gedanken ex­per­i­ment. Specifi­cal­ly, 46% of the di­rec­tional hy­pothe­ses were sup­ported at sig­nifi­cance lev­els that far ex­ceeded tra­di­tional p-value cut­offs. A sum­mary of the re­sults is por­trayed in Fig. 1. No­tice in this fig­ure, which dis­plays the dis­tri­b­u­tion of z-val­ues for the 511 tests, that many of the item mean differ­ences were 50–100 times larger than their as­so­ci­ated stan­dard er­rors!

Fig­ure 1 & Fig­ure 2 of Waller 2004: “Fig. 1. Dis­tri­b­u­tion of z-val­ues for 511 hy­poth­e­sis tests. Fig. 2. Dis­tri­b­u­tion of the fre­quency of re­jected null hy­pothe­ses, in fa­vor of a ran­domly cho­sen di­rec­tional al­ter­na­tive, in 320,922 hy­poth­e­sis test”

Waller also high­lights Bill Thomp­son’s 2001 bib­li­og­ra­phy “402 Ci­ta­tions Ques­tion­ing the In­dis­crim­i­nate Use of Null Hy­poth­e­sis Sig­nifi­cance Tests in Ob­ser­va­tional Stud­ies” as a source for crit­i­cisms of NHST but un­for­tu­nately it’s un­clear which of them might bear on the spe­cific crit­i­cism of ‘the null hy­poth­e­sis is al­ways false’.

Starbuck 2006

The Pro­duc­tion of Knowl­edge: The Chal­lenge of So­cial Sci­ence Re­search, Star­buck 2006, pg47–49:

In­duc­tion re­quires dis­tin­guish­ing mean­ing­ful re­la­tion­ships (sig­nals) in the midst of an ob­scur­ing back­ground of con­found­ing re­la­tion­ships (noise). The weak and mean­ing­less or sub­stan­tively sec­ondary cor­re­la­tions in the back­ground make in­duc­tion un­trust­wor­thy. In many tasks, peo­ple can dis­tin­guish weak sig­nals against rather strong back­ground noise. The rea­son is that both the sig­nals and the back­ground noise match fa­mil­iar pat­terns. For ex­am­ple, a dri­ver trav­el­ing to a fa­mil­iar des­ti­na­tion fo­cuses on land­marks that ex­pe­ri­ence has shown to be rel­e­vant. Peo­ple have trou­ble mak­ing such dis­tinc­tions where sig­nals and noise look much alike or where sig­nals and noise have un­fa­mil­iar char­ac­ter­is­tics. For ex­am­ple, a dri­ver trav­el­ing a new road to a new des­ti­na­tion is likely to have diffi­culty spot­ting land­marks and turns on a rec­om­mended route.

So­cial sci­ence re­search has the lat­ter char­ac­ter­is­tics. This ac­tiv­ity is called re­search be­cause its out­puts are un­known; and the sig­nals and noise look a lot alike in that both have sys­tem­atic com­po­nents and both con­tain com­po­nents that vary er­rat­i­cal­ly. There­fore, re­searchers rely upon sta­tis­ti­cal tech­niques to dis­tin­guish sig­nals from noise. How­ev­er, these tech­niques as­sume: (a) that the so-called ran­dom er­rors re­ally do can­cel each other out so that their av­er­age val­ues are close to ze­ro; and (b) that the so-called ran­dom er­rors in differ­ent vari­ables are un­cor­re­lat­ed. These are very strong as­sump­tions be­cause they pre­sume that the re­searchers’ hy­pothe­ses en­com­pass ab­solutely all of the sys­tem­atic effects in the data, in­clud­ing effects that the re­searchers have not fore­seen or mea­sured. When these as­sump­tions are not met, the sta­tis­ti­cal tech­niques tend to mis­take noise for sig­nal, and to at­tribute more im­por­tance to the re­searchers’ hy­pothe­ses than they de­serve.

I re­mem­bered what Ames and Re­iter (1961) had said about how easy it is for macro­econ­o­mists to dis­cover sta­tis­ti­cally sig­nifi­cant cor­re­la­tions that have no sub­stan­tive sig­nifi­cance, and I could see five rea­sons why a sim­i­lar phe­nom­e­non might oc­cur with cross-sec­tional da­ta. First­ly, a few broad char­ac­ter­is­tics of peo­ple and so­cial sys­tems per­vade so­cial sci­ence data—ex­am­ples be­ing sex, age, in­tel­li­gence, so­cial class, in­come, ed­u­ca­tion, or or­ga­ni­za­tion size. Such char­ac­ter­is­tics cor­re­late with many be­hav­iors and with each oth­er. Sec­ond­ly, re­searchers’ de­ci­sions about how to treat data can cre­ate cor­re­la­tions be­tween vari­ables. For ex­am­ple, when the As­ton re­searchers used fac­tor analy­sis to cre­ate ag­gre­gate vari­ables, they im­plic­itly de­ter­mined the cor­re­la­tions among these ag­gre­gate vari­ables. Third­ly, so-called ‘sam­ples’ are fre­quently not ran­dom, and many of them are com­plete sub­pop­u­la­tion­s—say, every em­ployee of a com­pa­ny—even though study after study has turned up ev­i­dence that peo­ple who live close to­geth­er, who work to­geth­er, or who so­cial­ize to­gether tend to have more at­ti­tudes, be­liefs, and be­hav­iors in com­mon than do peo­ple who are far apart phys­i­cally and so­cial­ly. Fourth­ly, some stud­ies ob­tain data from re­spon­dents at one time and through one method. By in­clud­ing items in a sin­gle ques­tion­naire or in­ter­view, re­searchers sug­gest to re­spon­dents that re­la­tion­ships ex­ist among these items. Last­ly, most re­searchers are in­tel­li­gent peo­ple who are liv­ing suc­cess­ful lives. They are likely to have some in­tu­itive abil­ity to pre­dict the be­hav­iors of peo­ple and of so­cial sys­tems. They are much more likely to for­mu­late hy­pothe­ses that ac­cord with their in­tu­ition than ones that vi­o­late it; they are quite likely to in­ves­ti­gate cor­re­la­tions and differ­ences that de­vi­ate from ze­ro; and they are less likely than chance would im­ply to ob­serve cor­re­la­tions and differ­ences near ze­ro.

Web­ster and I hy­poth­e­sized that sta­tis­ti­cal tests with a null hy­poth­e­sis of no cor­re­la­tion are bi­ased to­ward sta­tis­ti­cal sig­nifi­cance. Web­ster culled through Ad­min­is­tra­tive Sci­ence Quar­terly, the Acad­emy of Man­age­ment Jour­nal, and the Jour­nal of Ap­plied Psy­chol­ogy seek­ing ma­tri­ces of cor­re­la­tions. She tab­u­lated only com­plete ma­tri­ces of cor­re­la­tions in or­der to ob­serve the re­la­tions among all of the vari­ables that the re­searchers per­ceived when draw­ing in­duc­tive in­fer­ences, not only those vari­ables that re­searchers ac­tu­ally in­cluded in hy­pothe­ses. Of course, some re­searchers prob­a­bly gath­ered data on ad­di­tional vari­ables be­yond those pub­lished, and then omit­ted these ad­di­tional vari­ables be­cause they cor­re­lated very weakly with the de­pen­dent vari­ables. We es­ti­mated that 64% of the cor­re­la­tions in our data were as­so­ci­ated with re­searchers’ hy­pothe­ses.

Fig­ure 2.6 Cor­re­la­tions re­ported in three jour­nals

Fig­ure 2.6 shows the dis­tri­b­u­tions of 14,897 cor­re­la­tions. In all 3 jour­nals, both the mean cor­re­la­tion and the me­dian cor­re­la­tion were close to +0.09 and the dis­tri­b­u­tions of cor­re­la­tions were very sim­i­lar. Find­ing sig­nifi­cant cor­re­la­tions is ab­surdly easy in this pop­u­la­tion of vari­ables, es­pe­cially when re­searchers make two-tailed tests with a null hy­poth­e­sis of no cor­re­la­tion. Choos­ing two vari­ables ut­terly at ran­dom, a re­searcher has 2-to-1 odds of find­ing a sig­nifi­cant cor­re­la­tion on the first try, and 24-to-1 odds of find­ing a sig­nifi­cant cor­re­la­tion within three tries (also see Hub­bard and Arm­strong 1992). Fur­ther­more, the odds are bet­ter than 2-to-1 that an ob­served cor­re­la­tion will be pos­i­tive, and pos­i­tive cor­re­la­tions are more likely than neg­a­tive ones to be sta­tis­ti­cally sig­nifi­cant. Be­cause re­searchers gather more data when they are get­ting small cor­re­la­tions, stud­ies with large num­bers of ob­ser­va­tions ex­hibit slightly less pos­i­tive bias. The mean cor­re­la­tion in stud­ies with fewer than 70 ob­ser­va­tions is about twice the mean cor­re­la­tion in stud­ies with over 180 ob­ser­va­tions. The main in­fer­ence I drew from these sta­tis­tics was that the so­cial sci­ences are drown­ing in sta­tis­ti­cally sig­nifi­cant but mean­ing­less noise. Be­cause the differ­ences and cor­re­la­tions that so­cial sci­en­tists test have dis­tri­b­u­tions quite differ­ent from those as­sumed in hy­poth­e­sis tests, so­cial sci­en­tists are us­ing tests that as­sign sta­tis­ti­cal sig­nifi­cance to con­found­ing back­ground re­la­tion­ships. Be­cause so­cial sci­en­tists equate sta­tis­ti­cal sig­nifi­cance with mean­ing­ful re­la­tion­ships, they often mis­take con­found­ing back­ground re­la­tion­ships for the­o­ret­i­cally im­por­tant in­for­ma­tion. One re­sult is that so­cial sci­ence re­search cre­ates a cloud of sta­tis­ti­cally sig­nifi­cant differ­ences and cor­re­la­tions that not only have no real mean­ing but also im­pede sci­en­tific progress by ob­scur­ing the truly mean­ing­ful re­la­tion­ships.

Sup­pose that roughly 10% of all ob­serv­able re­la­tions could be the­o­ret­i­cally mean­ing­ful and that the re­main­ing 90% ei­ther have no mean­ings or can be de­duced as im­pli­ca­tions of the key 10%. How­ev­er, we do not know now which re­la­tions con­sti­tute the key 10%, and so our re­search re­sem­bles a search through a haystack in which we are try­ing to sep­a­rate nee­dles from more nu­mer­ous straws. Now sup­pose that we adopt a search method that makes al­most every straw look very much like a nee­dle and that turns up thou­sands of ap­par­ent nee­dles an­nu­al­ly; 90% of these ap­par­ent nee­dles are ac­tu­ally straws, but we have no way of know­ing which ones. Next, we fab­ri­cate a the­ory that ‘ex­plains’ these ap­par­ent nee­dles. Some of the propo­si­tions in our the­ory are likely to be cor­rect, merely by chance; but many, many more propo­si­tions are in­cor­rect or mis­lead­ing in that they de­scribe straws. Even if this the­ory were to ac­count ra­tio­nally for all of the nee­dles that we have sup­pos­edly dis­cov­ered in the past, which is ex­tremely un­like­ly, the the­ory has very lit­tle chance of mak­ing highly ac­cu­rate pre­dic­tions about the con­se­quences of our ac­tions un­less the the­ory it­self acts as a pow­er­ful self­-ful­fill­ing prophecy (E­den and Ravid 1982). Our the­ory would make some cor­rect pre­dic­tions, of course, be­cause with so many cor­re­lated vari­ables, even a com­pletely false the­ory would have a rea­son­able chance of gen­er­at­ing pre­dic­tions that come true. Thus, we dare not even take cor­rect pre­dic­tions as de­pend­able ev­i­dence of our the­o­ry’s cor­rect­ness (Deese 1972: 61–67 [Psy­chol­ogy as Sci­ence and Art]).

Smith et al 2007

, Smith et al 2007:

…We ex­am­ined the ex­tent to which ge­netic vari­ants, on the one hand, and non­genetic en­vi­ron­men­tal ex­po­sures or phe­no­typic char­ac­ter­is­tics on the oth­er, tend to be as­so­ci­ated with each oth­er, to as­sess the de­gree of con­found­ing that would ex­ist in con­ven­tional epi­demi­o­log­i­cal stud­ies com­pared with Mendelian ran­dom­iza­tion stud­ies. Meth­ods and Find­ings: We es­ti­mated pair­wise cor­re­la­tions be­tween [96] non­genetic base­line vari­ables and ge­netic vari­ables in a cross-sec­tional study [Bri­tish Wom­en’s Heart and Health Study; n = 4,286] com­par­ing the num­ber of cor­re­la­tions that were sta­tis­ti­cally sig­nifi­cant at the 5%, 1%, and 0.01% level (α = 0.05, 0.01, and 0.0001, re­spec­tive­ly) with the num­ber ex­pected by chance if all vari­ables were in fact un­cor­re­lat­ed, us­ing a two-sided bi­no­mial ex­act test. We demon­strate that be­hav­ioural, so­cioe­co­nom­ic, and phys­i­o­log­i­cal fac­tors are strongly in­ter­re­lat­ed, with 45% of all pos­si­ble pair­wise as­so­ci­a­tions be­tween 96 non­genetic char­ac­ter­is­tics (n = 4,560 cor­re­la­tions) be­ing sig­nifi­cant at the p < 0.01 level (the ra­tio of ob­served to ex­pected sig­nifi­cant as­so­ci­a­tions was 45; p-value for differ­ence be­tween ob­served and ex­pected < 0.000001). Sim­i­lar find­ings were ob­served for other lev­els of sig­nifi­cance.

…The 96 non­genetic vari­ables gen­er­ated 4,560 pair­wise com­par­isons, of which, as­sum­ing no as­so­ci­a­tions ex­ist­ed, 5 in 100 (to­tal 228) would be ex­pected to be as­so­ci­ated by chance at the 5% sig­nifi­cance level (α = 0.05). How­ev­er, 2,447 (54%) of the cor­re­la­tions were sig­nifi­cant at the α = 0.05 lev­el, giv­ing an ob­served to ex­pected (O:E) ra­tio of 11, p for differ­ence O:E < 0.000001 (Table 1). At the 1% sig­nifi­cance lev­el, 45.6 of the cor­re­la­tions would be ex­pected to be as­so­ci­ated by chance, but we found that 2,036 (45%) of the pair­wise as­so­ci­a­tions were sta­tis­ti­cally sig­nifi­cant at α = 0.01, giv­ing an O:E ra­tio of 45, p for differ­ence O:E < 0.000001 (Table 2). At the 0.01% sig­nifi­cance lev­el, 0.456 of the cor­re­la­tions would be ex­pected to be as­so­ci­ated by chance, but we found that 1,378 (30%) were sig­nifi­cantly as­so­ci­ated at α = 0.0001, giv­ing an O:E ra­tio of 3,022, p for differ­ence O:E < 0.000001.

…Over 50% of the pair­wise as­so­ci­a­tions be­tween base­line non­genetic char­ac­ter­is­tics in our study were sta­tis­ti­cally sig­nifi­cant at the 0.05 lev­el; an 11-fold in­crease from what would be ex­pect­ed, as­sum­ing these char­ac­ter­is­tics were in­de­pen­dent. Sim­i­lar find­ings were found for sta­tis­ti­cally sig­nifi­cant as­so­ci­a­tions at the 0.01 level (45-fold in­crease from ex­pect­ed) and the 0.0001 level (3,000-fold in­crease from ex­pect­ed). This il­lus­trates the con­sid­er­able diffi­culty of de­ter­min­ing which as­so­ci­a­tions are valid and po­ten­tially causal from a back­ground of highly cor­re­lated fac­tors, re­flect­ing that be­hav­ioural, so­cioe­co­nom­ic, and phys­i­o­log­i­cal char­ac­ter­is­tics tend to clus­ter. This ten­dency will mean that there will often be high lev­els of con­found­ing when study­ing any sin­gle fac­tor in re­la­tion to an out­come. Given the com­plex­ity of such con­found­ing, even after for­mal sta­tis­ti­cal ad­just­ment, a lack of data for some con­founders, and mea­sure­ment er­ror in as­sessed con­founders will leave con­sid­er­able scope for resid­ual con­found­ing [4]. When epi­demi­o­log­i­cal stud­ies present ad­justed as­so­ci­a­tions as a re­flec­tion of the mag­ni­tude of a causal as­so­ci­a­tion, they are as­sum­ing that all pos­si­ble con­found­ing fac­tors have been ac­cu­rately mea­sured and that their re­la­tion­ships with the out­come have been ap­pro­pri­ately mod­elled. We think this is un­likely to be the case in most ob­ser­va­tional epi­demi­o­log­i­cal stud­ies [26].

Pre­dictably, such con­founded re­la­tion­ships will be par­tic­u­larly marked for highly so­cially and cul­tur­ally pat­terned risk fac­tors, such as di­etary in­take. This high de­gree of con­found­ing might un­der­lie the poor con­cor­dance of ob­ser­va­tional epi­demi­o­log­i­cal stud­ies that iden­ti­fied di­etary fac­tors (such as beta carotene, vi­t­a­min E, and vi­t­a­min C in­take) as pro­tec­tive against car­dio­vas­cu­lar dis­ease and can­cer, with the find­ings of ran­dom­ized con­trolled tri­als of these di­etary fac­tors [1,27]. In­deed, with 45% of the pair­wise as­so­ci­a­tions of non­genetic char­ac­ter­is­tics be­ing “sta­tis­ti­cally sig­nifi­cant” at the p < 0.01 level in our study, and our study be­ing un­ex­cep­tional with re­gard to the lev­els of con­found­ing that will be found in ob­ser­va­tional in­ves­ti­ga­tions, it is clear that the large ma­jor­ity of as­so­ci­a­tions that ex­ist in ob­ser­va­tional data­bases will not reach pub­li­ca­tion. We sug­gest that those that do achieve pub­li­ca­tion will re­flect ap­par­ent bi­o­log­i­cal plau­si­bil­ity (a weak causal cri­te­rion [28]) and the in­ter­ests of in­ves­ti­ga­tors. Ex­am­ples ex­ist of in­ves­ti­ga­tors re­port­ing pro­vi­sional analy­ses in ab­stract­s—­such as an­tiox­i­dant vi­t­a­min in­take be­ing ap­par­ently pro­tec­tive against fu­ture car­dio­vas­cu­lar events in women with clin­i­cal ev­i­dence of car­dio­vas­cu­lar dis­ease [29]—but not go­ing on to full pub­li­ca­tion of these find­ings, per­haps be­cause ran­dom­ized con­trolled tri­als ap­peared soon after the pre­sen­ta­tion of the ab­stracts [30] that ren­dered their find­ings as be­ing un­likely to re­flect causal re­la­tion­ships. Con­verse­ly, it is likely that the large ma­jor­ity of null find­ings will not achieve pub­li­ca­tion, un­less they con­tra­dict high­-pro­file prior find­ings, as has been demon­strated in mol­e­c­u­lar ge­netic re­search [31].

Smith et al 2007: “Fig­ure 1. His­togram of Sta­tis­ti­cally Sig­nifi­cant (at α = 1%) Age-Ad­justed Pair­wise Cor­re­la­tion Co­effi­cients be­tween 96 Non­genetic Char­ac­ter­is­tics. British Women Aged 60–79 y”

The mag­ni­tudes of most of the sig­nifi­cant cor­re­la­tions be­tween non­genetic char­ac­ter­is­tics were small (see Fig­ure 1), with a me­dian value at p ≤ 0.01 and p ≤ 0.05 of 0.08, and it might be con­sid­ered that such weak as­so­ci­a­tions are un­likely to be im­por­tant sources of con­found­ing. How­ev­er, so many as­so­ci­ated non­genetic vari­ables, even with weak cor­re­la­tions, can present a very im­por­tant po­ten­tial for resid­ual con­found­ing. For ex­am­ple, we have pre­vi­ously demon­strated how 15 so­cioe­co­nomic and be­hav­ioural risk fac­tors, each with weak but sta­tis­ti­cally in­de­pen­dent (at p ≤ 0.05) as­so­ci­a­tions with both vi­t­a­min C lev­els and coro­nary heart dis­ease (CHD), could to­gether ac­count for an ap­par­ent strong pro­tec­tive effect (odds ra­tio = 0.60 com­par­ing top to bot­tom quar­ter of vi­t­a­min C dis­tri­b­u­tion) of vi­t­a­min C on CHD (32 [see also Lawlor et al 2004b]).

Hecht & Moxley 2009

“Ter­abytes of To­bler: eval­u­at­ing the first law in a mas­sive, do­main-neu­tral rep­re­sen­ta­tion of world knowl­edge”, Hecht & Mox­ley 2009:

The First Law of Ge­og­ra­phy states, “every­thing is re­lated to every­thing else, but near things are more re­lated than dis­tant things.” De­spite the fact that it is to a large de­gree what makes “spa­tial spe­cial,” the law has never been em­pir­i­cally eval­u­ated on a large, do­main-neu­tral rep­re­sen­ta­tion of world knowl­edge. We ad­dress the gap in the lit­er­a­ture about this crit­i­cal idea by sta­tis­ti­cally ex­am­in­ing the mul­ti­tude of en­ti­ties and re­la­tions be­tween en­ti­ties present across 22 differ­ent lan­guage edi­tions of Wikipedia. We find that, at least ac­cord­ing to the myr­iad au­thors of Wikipedia, the First Law is true to an over­whelm­ing ex­tent re­gard­less of lan­guage-de­fined cul­tural do­main.

Andrew Gelman

Gelman 2004

“Type 1, type 2, type S, and type M er­rors”

I’ve never in my pro­fes­sional life made a Type I er­ror or a Type II er­ror. But I’ve made lots of er­rors. How can this be?

A Type 1 er­ror oc­curs only if the null hy­poth­e­sis is true (typ­i­cally if a cer­tain pa­ra­me­ter, or differ­ence in pa­ra­me­ters, equals ze­ro). In the ap­pli­ca­tions I’ve worked on, in so­cial sci­ence and pub­lic health, I’ve never come across a null hy­poth­e­sis that could ac­tu­ally be true, or a pa­ra­me­ter that could ac­tu­ally be ze­ro.

Gelman 2007

“Sig­nifi­cance test­ing in eco­nom­ics: Mc­Closkey, Zil­i­ak, Hoover, and Siegler”:

I think that Mc­Closkey and Zil­i­ak, and also Hoover and Siegler, would agree with me that the null hy­poth­e­sis of zero co­effi­cient is es­sen­tially al­ways false. (The par­a­dig­matic ex­am­ple in eco­nom­ics is pro­gram eval­u­a­tion, and I think that just about every pro­gram be­ing se­ri­ously con­sid­ered will have effect­s—­pos­i­tive for some peo­ple, neg­a­tive for oth­er­s—but not av­er­ag­ing to ex­actly zero in the pop­u­la­tion.) From this per­spec­tive, the point of hy­poth­e­sis test­ing (or, for that mat­ter, of con­fi­dence in­ter­vals) is not to as­sess the null hy­poth­e­sis but to give a sense of the un­cer­tainty in the in­fer­ence. As Hoover and Siegler put it, “while the eco­nomic sig­nifi­cance of the co­effi­cient does not de­pend on the sta­tis­ti­cal sig­nifi­cance, our cer­tainty about the ac­cu­racy of the mea­sure­ment surely does. . . . Sig­nifi­cance tests, prop­erly used, are a tool for the as­sess­ment of sig­nal strength and not mea­sures of eco­nomic sig­nifi­cance.” Cer­tain­ly, I’d rather see an es­ti­mate with an as­sess­ment of sta­tis­ti­cal sig­nifi­cance than an es­ti­mate with­out such an as­sess­ment.

Gelman 2010a

“Bayesian Sta­tis­tics Then and Now”, Gel­man 2010a:

My third meta-prin­ci­ple is that differ­ent ap­pli­ca­tions de­mand differ­ent philoso­phies. This prin­ci­ple comes up for me in Efron’s dis­cus­sion of hy­poth­e­sis test­ing and the so-called false dis­cov­ery rate, which I la­bel as “so-called” for the fol­low­ing rea­son. In Efron’s for­mu­la­tion (which fol­lows the clas­si­cal mul­ti­ple com­par­isons lit­er­a­ture), a “false dis­cov­ery” is a zero effect that is iden­ti­fied as nonze­ro, where­as, in my own work, I never study zero effects. The effects I study are some­times small but it would be sil­ly, for ex­am­ple, to sup­pose that the differ­ence in vot­ing pat­terns of men and women (after con­trol­ling for some other vari­ables) could be ex­actly ze­ro. My prob­lems with the “false dis­cov­ery” for­mu­la­tion are partly a mat­ter of taste, I’m sure, but I be­lieve they also arise from the differ­ence be­tween prob­lems in ge­net­ics (in which some genes re­ally have es­sen­tially zero effects on some traits, so that the clas­si­cal hy­poth­e­sis-test­ing model is plau­si­ble) and in so­cial sci­ence and en­vi­ron­men­tal health (where es­sen­tially every­thing is con­nected to every­thing else, and effect sizes fol­low a con­tin­u­ous dis­tri­b­u­tion rather than a mix of large effects and near-ex­act ze­roes).

Gelman 2010b

“Causal­ity and Sta­tis­ti­cal Learn­ing”, Gel­man 2010b:

There are (al­most) no true ze­roes: diffi­cul­ties with the re­search pro­gram of learn­ing causal struc­ture

We can dis­tin­guish be­tween learn­ing within a causal model (that is, in­fer­ence about pa­ra­me­ters char­ac­ter­iz­ing a spec­i­fied di­rected graph) and learn­ing causal struc­ture it­self (that is, in­fer­ence about the graph it­self). In so­cial sci­ence re­search, I am ex­tremely skep­ti­cal of this sec­ond goal.

The diffi­culty is that, in so­cial sci­ence, there are no true ze­roes. For ex­am­ple, re­li­gious at­ten­dance is as­so­ci­ated with at­ti­tudes on eco­nomic as well as so­cial is­sues, and both these cor­re­la­tions vary by state. And it does not in­ter­est me, for ex­am­ple, to test a model in which so­cial class affects vote choice through party iden­ti­fi­ca­tion but not along a di­rect path.

More gen­er­al­ly, any­thing that plau­si­bly could have an effect will not have an effect that is ex­actly ze­ro. I can re­spect that some so­cial sci­en­tists find it use­ful to frame their re­search in terms of con­di­tional in­de­pen­dence and the test­ing of null effects, but I don’t gen­er­ally find this ap­proach help­ful—and I cer­tainly don’t be­lieve that it is nec­es­sary to think in terms of con­di­tional in­de­pen­dence in or­der to study causal­i­ty. With­out struc­tural ze­roes, it is im­pos­si­ble to iden­tify graph­i­cal struc­tural equa­tion mod­els.

The most com­mon ex­cep­tions to this rule, as I see it, are in­de­pen­dences from de­sign (as in a de­signed or nat­ural ex­per­i­ment) or effects that are zero based on a plau­si­ble sci­en­tific hy­poth­e­sis (as might arise, for ex­am­ple, in ge­net­ics where genes on differ­ent chro­mo­somes might have es­sen­tially in­de­pen­dent effect­s), or in a study of ESP. In such set­tings I can see the value of test­ing a null hy­poth­e­sis of zero effect, ei­ther for its own sake or to rule out the pos­si­bil­ity of a con­di­tional cor­re­la­tion that is sup­posed not to be there.

An­other sort of ex­cep­tion to the “no ze­roes” rule comes from in­for­ma­tion re­stric­tion: a per­son’s de­ci­sion should not be affected by knowl­edge that he or she does­n’t have. For ex­am­ple, a con­sumer in­ter­ested in buy­ing ap­ples cares about the to­tal price he pays, not about how much of that goes to the seller and how much goes to the gov­ern­ment in the form of tax­es. So the re­stric­tion is that the util­ity de­pends on prices, not on the share of that go­ing to tax­es. That is the type of re­stric­tion that can help iden­tify de­mand func­tions in eco­nom­ics.

I re­al­ize, how­ev­er, that my per­spec­tive that there are no ze­roes (in­for­ma­tion re­stric­tions aside) is a mi­nor­ity view among so­cial sci­en­tists and per­haps among peo­ple in gen­er­al, on the ev­i­dence of psy­chol­o­gist Slo­man’s book. For ex­am­ple, from chap­ter 2: “A good politi­cian will know who is mo­ti­vated by greed and who is mo­ti­vated by larger prin­ci­ples in or­der to dis­cern how to so­licit each one’s vote when it is need­ed.” I can well be­lieve that peo­ple think in this way but I don’t buy it! Just about every­one is mo­ti­vated by greed and by larger prin­ci­ples! This sort of dis­crete think­ing does­n’t seem to me to be at all re­al­is­tic about how peo­ple be­have-although it might very well be a good model about how peo­ple char­ac­ter­ize oth­ers!

In the next chap­ter, Slo­man writes, “No mat­ter how many times A and B oc­cur to­geth­er, mere co-oc­cur­rence can­not re­veal whether A causes B, or B causes A, or some­thing else causes both.” [i­tal­ics added] Again, I am both­ered by this sort of dis­crete think­ing. I will re­turn in a mo­ment with an ex­am­ple, but just to speak gen­er­al­ly, if A could cause B, and B could cause A, then I would think that, yes, they could cause each oth­er. And if some­thing else could cause them both, I imag­ine that could be hap­pen­ing along with the cau­sa­tion of A on B and of B on A.

Here we’re get­ting into some of the differ­ences be­tween a nor­ma­tive view of sci­ence, a de­scrip­tive view of sci­ence, and a de­scrip­tive view of how peo­ple per­ceive the world. Just as there are lim­its to what “folk physics” can tell us about the mo­tion of par­ti­cles, sim­i­larly I think we have to be care­ful about too closely iden­ti­fy­ing “folk causal in­fer­ence” from the stuff done by the best so­cial sci­en­tists. To con­tinue the anal­o­gy: it is in­ter­est­ing to study how we de­velop phys­i­cal in­tu­itions us­ing com­mon­sense no­tions of force, en­er­gy, mo­men­tum, and so on—but it’s also im­por­tant to see where these in­tu­itions fail. Sim­i­lar­ly, ideas of causal­ity are fun­da­men­tal but that does­n’t stop or­di­nary peo­ple and even ex­perts from mak­ing ba­sic mis­takes.

Now I would like to re­turn to the graph­i­cal model ap­proach de­scribed by Slo­man. In chap­ter 5, he dis­cusses an ex­am­ple with three vari­ables:

If two of the vari­ables are de­pen­dent, say, in­tel­li­gence and so­cioe­co­nomic sta­tus, but con­di­tion­ally in­de­pen­dent given the third vari­able [beer con­sump­tion], then ei­ther they are re­lated by one of two chains:

(Intelligence → Amount of beer consumed → Socioeconomic status)
(Socio-economic status → Amount of beer consumed → Intelligence)

or by a fork:

                           Socioeconomic status
 Amount of beer consumed

and then we must use some other means [other than ob­ser­va­tional data] to de­cide be­tween these three pos­si­bil­i­ties. In some cas­es, com­mon sense may be suffi­cient, but we can al­so, if nec­es­sary, run an ex­per­i­ment. If we in­ter­vene and vary the amount of beer con­sumed and see that we affect in­tel­li­gence, that im­plies that the sec­ond or third model is pos­si­ble; the first one is not. Of course, all this as­sumes that there aren’t other vari­ables me­di­at­ing be­tween the ones shown that pro­vide al­ter­na­tive ex­pla­na­tions of the de­pen­den­cies.

This makes no sense to me. I don’t see why only one of the three mod­els can be true. This is a math­e­mat­i­cal pos­si­bil­i­ty, but it seems highly im­plau­si­ble to me. And, in par­tic­u­lar, run­ning an ex­per­i­ment that re­veals one of these causal effects does not rule out the other pos­si­ble paths. For ex­am­ple, sup­pose that Slo­man were to per­form the above ex­per­i­ment (find­ing that beer con­sump­tion affects in­tel­li­gence) and then an­other ex­per­i­ment, this time vary­ing in­tel­li­gence (in some way; the method of do­ing this can very well de­ter­mine the causal effect) and find­ing that it affects the amount of beer con­sumed.

Be­yond this fun­da­men­tal prob­lem, I have a sta­tis­ti­cal cri­tique, which is that in so­cial sci­ence you won’t have these sorts of con­di­tional in­de­pen­den­cies, ex­cept from de­sign or as ar­ti­facts of small sam­ple sizes that do not al­low us to dis­tin­guish small de­pen­den­cies from ze­ro.

I think I see where Slo­man is com­ing from, from a psy­cho­log­i­cal per­spec­tive: you see these vari­ables that are re­lated to each oth­er, and you want to know which is the cause and which is the effect. But I don’t think this is a use­ful way of un­der­stand­ing the world, just as I don’t think it’s use­ful to cat­e­go­rize po­lit­i­cal play­ers as be­ing mo­ti­vated ei­ther by greed or by larger prin­ci­ples, but not both. Ex­clu­sive-or might feel right to us in­ter­nal­ly, but I don’t think it works as sci­ence.

One im­por­tant place where I agree with Slo­man (and thus with Pearl and Sprites et al.) is in the em­pha­sis that causal struc­ture can­not in gen­eral be learned from ob­ser­va­tional data alone; they hold the very rea­son­able po­si­tion that we can use ob­ser­va­tional data to rule out pos­si­bil­i­ties and for­mu­late hy­pothe­ses, and then use some sort of in­ter­ven­tion or ex­per­i­ment (whether ac­tual or hy­po­thet­i­cal) to move fur­ther. In this way they con­nect the ob­ser­va­tion­al/­ex­per­i­men­tal di­vi­sion to the hy­poth­e­sis/d­e­duc­tion for­mu­la­tion that is fa­mil­iar to us from the work of Pop­per, Kuhn, and other mod­ern philoso­phers of sci­ence.

The place where I think Slo­man is mis­guided is in his for­mu­la­tion of sci­en­tific mod­els in an ei­ther/or way, as if, in truth, so­cial vari­ables are linked in sim­ple causal paths, with a sci­en­tific goal of fig­ur­ing out if A causes B or the re­verse. I don’t know much about in­tel­li­gence, beer con­sump­tion, and so­cioe­co­nomic sta­tus, but I cer­tainly don’t see any sim­ple re­la­tion­ships be­tween in­come, re­li­gious at­ten­dance, party iden­ti­fi­ca­tion, and vot­ing—and I don’t see how a search for such a pat­tern will ad­vance our un­der­stand­ing, at least given cur­rent tech­niques. I’d rather start with de­scrip­tion and then go to­ward causal­ity fol­low­ing the ap­proach of econ­o­mists and sta­tis­ti­cians by think­ing about po­ten­tial in­ter­ven­tions one at a time. I’d love to see Slo­man’s and Pearl’s ideas of the in­ter­play be­tween ob­ser­va­tional and ex­per­i­men­tal data de­vel­oped in a frame­work that is less strongly tied to the no­tion of choice among sim­ple causal struc­tures.

Gelman 2012

“The”hot hand" and prob­lems with hy­poth­e­sis test­ing", Gel­man 2012:

The effects are cer­tainly not ze­ro. We are not ma­chi­nes, and any­thing that can affect our ex­pec­ta­tions (for ex­am­ple, our suc­cess in pre­vi­ous tries) should affect our per­for­mance…What­ever the lat­est re­sults on par­tic­u­lar sports, I can’t see any­one over­turn­ing the ba­sic find­ing of Gilovich, Val­lone, and Tver­sky that play­ers and spec­ta­tors alike will per­ceive the hot hand even when it does not ex­ist and dra­mat­i­cally over­es­ti­mate the mag­ni­tude and con­sis­tency of any hot-hand phe­nom­e­non that does ex­ist. In sum­ma­ry, this is yet an­other prob­lem where much is lost by go­ing down the stan­dard route of null hy­poth­e­sis test­ing.

Gelman et al 2013

“In­her­ent diffi­cul­ties of non-Bayesian like­li­hood-based in­fer­ence, as re­vealed by an ex­am­i­na­tion of a re­cent book by Aitkin” (ear­lier ver­sion):

  1. Solv­ing non-prob­lems

Sev­eral of the ex­am­ples in Sta­tis­ti­cal In­fer­ence rep­re­sent so­lu­tions to prob­lems that seem to us to be ar­ti­fi­cial or con­ven­tional tasks with no clear anal­ogy to ap­plied work.

"They are ar­ti­fi­cial and are ex­pressed in terms of a sur­vey of 100 in­di­vid­u­als ex­press­ing sup­port (Yes/No) for the pres­i­dent, be­fore and after a pres­i­den­tial ad­dress (…) The ques­tion of in­ter­est is whether there has been a change in sup­port be­tween the sur­veys (…). We want to as­sess the ev­i­dence for the hy­poth­e­sis of equal­ity __H_1 against the al­ter­na­tive hy­poth­e­sis H2 of a change." —Sta­tis­ti­cal In­fer­ence ,page 147

Based on our ex­pe­ri­ence in pub­lic opin­ion re­search, this is not a real ques­tion. Sup­port for any po­lit­i­cal po­si­tion is al­ways chang­ing. The real ques­tion is how much the sup­port has changed, or per­haps how this change is dis­trib­uted across the pop­u­la­tion.

A de­fender of Aitkin (and of clas­si­cal hy­poth­e­sis test­ing) might re­spond at this point that, yes, every­body knows that changes are never ex­actly zero and that we should take a more “grown-up” view of the null hy­poth­e­sis, not that the change is zero but that it is nearly ze­ro. Un­for­tu­nate­ly, the metaphor­i­cal in­ter­pre­ta­tion of hy­poth­e­sis tests has prob­lems sim­i­lar to the the­o­log­i­cal doc­trines of the Uni­tar­ian church. Once you have aban­doned lit­eral be­lief in the Bible, the ques­tion soon aris­es: why fol­low it at all? Sim­i­lar­ly, once one rec­og­nizes the in­ap­pro­pri­ate­ness of the point null hy­poth­e­sis, we think it makes more sense not to try to re­ha­bil­i­tate it or treat it as trea­sured metaphor but rather to at­tack our sta­tis­ti­cal prob­lems di­rect­ly, in this case by per­form­ing in­fer­ence on the change in opin­ion in the pop­u­la­tion.

To be clear: we are not deny­ing the value of hy­poth­e­sis test­ing. In this ex­am­ple, we find it com­pletely rea­son­able to ask whether ob­served changes are sta­tis­ti­cally sig­nifi­cant, i.e. whether the data are con­sis­tent with a null hy­poth­e­sis of zero change. What we do not find rea­son­able is the state­ment that “the ques­tion of in­ter­est is whether there has been a change in sup­port.”

All this is ap­pli­ca­tion-spe­cific. Sup­pose pub­lic opin­ion was ob­served to re­ally be flat, punc­tu­ated by oc­ca­sional changes, as in the left graph in Fig­ure 7.1. In that case, Aitk­in’s ques­tion of “whether there has been a change” would be well-de­fined and ap­pro­pri­ate, in that we could in­ter­pret the null hy­poth­e­sis of no change as some min­i­mal level of base­line vari­a­tion.

Real pub­lic opin­ion, how­ev­er, does not look like base­line noise plus jumps, but rather shows con­tin­u­ous move­ment on many time scales at on­ce, as can be seen from the right graph in Fig­ure 7.1, which shows ac­tual pres­i­den­tial ap­proval da­ta. In this ex­am­ple, we do not see Aitk­in’s ques­tion as at all rea­son­able. Any at­tempt to work with a null hy­poth­e­sis of opin­ion sta­bil­ity will be in­her­ently ar­bi­trary. It would make much more sense to model opin­ion as a con­tin­u­ous­ly-vary­ing process. The sta­tis­ti­cal prob­lem here is not merely that the null hy­poth­e­sis of zero change is non­sen­si­cal; it is that the null is in no sense a rea­son­able ap­prox­i­ma­tion to any in­ter­est­ing mod­el. The so­ci­o­log­i­cal prob­lem is that, from Sav­age (1954) on­ward, many Bayesians have felt the need to mimic the clas­si­cal nul­l-hy­poth­e­sis test­ing frame­work, even where it makes no sense.

Lin et al 2013

“Too Big to Fail: Large Sam­ples and the p-Value Prob­lem”, Lin et al 2013:

The In­ter­net has pro­vided IS re­searchers with the op­por­tu­nity to con­duct stud­ies with ex­tremely large sam­ples, fre­quently well over 10,000 ob­ser­va­tions. There are many ad­van­tages to large sam­ples, but re­searchers us­ing sta­tis­ti­cal in­fer­ence must be aware of the p-value prob­lem as­so­ci­ated with them. In very large sam­ples, p-val­ues go quickly to ze­ro, and solely re­ly­ing on p-val­ues can lead the re­searcher to claim sup­port for re­sults of no prac­ti­cal sig­nifi­cance. In a sur­vey of large sam­ple IS re­search, we found that a sig­nifi­cant num­ber of pa­pers rely on a low p-value and the sign of a re­gres­sion co­effi­cient alone to sup­port their hy­pothe­ses. This re­search com­men­tary rec­om­mends a se­ries of ac­tions the re­searcher can take to mit­i­gate the p-value prob­lem in large sam­ples and il­lus­trates them with an ex­am­ple of over 300,000 cam­era sales on eBay. We be­lieve that ad­dress­ing the p-value prob­lem will in­crease the cred­i­bil­ity of large sam­ple IS re­search as well as pro­vide more in­sights for read­ers.

…A key is­sue with ap­ply­ing smal­l­-sam­ple sta­tis­ti­cal in­fer­ence to large sam­ples is that even mi­nus­cule effects can be­come sta­tis­ti­cally sig­nifi­cant. The in­creased power leads to a dan­ger­ous pit­fall as well as to a huge op­por­tu­ni­ty. The is­sue is one that sta­tis­ti­cians have long been aware of: “the p-value prob­lem.” Chat­field (1995, p. 70 [Prob­lem Solv­ing: A Sta­tis­ti­cian’s Guide, 2nd ed]) com­ments, “The ques­tion is not whether differ­ences are ‘sig­nifi­cant’ (they nearly al­ways are in large sam­ples), but whether they are in­ter­est­ing. For­get sta­tis­ti­cal sig­nifi­cance, what is the prac­ti­cal sig­nifi­cance of the re­sults?” The in­creased power of large sam­ples means that re­searchers can de­tect small­er, sub­tler, and more com­plex effects, but re­ly­ing on p-val­ues alone can lead to claims of sup­port for hy­pothe­ses of lit­tle or no prac­ti­cal sig­nifi­cance.

…In re­view­ing the lit­er­a­ture, we found only a few men­tions of the large-sam­ple is­sue and its effect on p-val­ues; we also saw lit­tle recog­ni­tion that the au­thors’ low p-val­ues might be an ar­ti­fact of their large-sam­ple sizes. Au­thors who rec­og­nized the “large-sam­ple, small p-val­ues” is­sue ad­dressed it by one of the fol­low­ing ap­proach­es: re­duc­ing the sig­nifi­cance level thresh­old5 (which does not re­ally help), by re­com­put­ing the p-value for a small sam­ple (Ge­fen and Carmel 2008), or by fo­cus­ing on prac­ti­cal sig­nifi­cance and com­ment­ing about the use­less­ness of sta­tis­ti­cal sig­nifi­cance (Mithas and Lu­cas 2010).

Schwitzgebel 2013

“Pre­lim­i­nary Ev­i­dence That the World Is Sim­ple (An Ex­er­cise in Stu­pid Epis­te­mol­o­gy)” (hu­mor­ous blog post)

Here’s what I did. I thought up 30 pairs of vari­ables that would be easy to mea­sure and that might re­late in di­verse ways. Some vari­ables were phys­i­cal (the dis­tance vs. ap­par­ent bright­ness of nearby stars), some bi­o­log­i­cal (the length vs. weight of sticks found in my back yard), and some psy­cho­log­i­cal or so­cial (the S&P 500 in­dex clos­ing value vs. num­ber of days past). Some I would ex­pect to show no re­la­tion­ship (the num­ber of pages in a li­brary book vs. how high up it is shelved in the li­brary), some I would ex­pect to show a roughly lin­ear re­la­tion­ship (dis­tance of Mc­Don­ald’s fran­chises from my house vs. es­ti­mated dri­ving time), and some I ex­pected to show a curved or com­plex re­la­tion­ship (fore­casted tem­per­a­ture vs. time of day, size in KB of a JPG photo of my office vs. the an­gle at which the photo was tak­en). See here for the full list of vari­ables. I took 11 mea­sure­ments of each vari­able pair. Then I an­a­lyzed the re­sult­ing da­ta.

Now, if the world is mas­sively com­plex, then it should be diffi­cult to pre­dict a third dat­a­point from any two other data points. Sup­pose that two mea­sure­ments of some con­tin­u­ous vari­able yield val­ues of 27 and 53. What should I ex­pect the third mea­sured value to be? Why not 1,457,002? Or 3.22 × 10-17? There are just as many func­tions (that is, in­fi­nitely many) con­tain­ing 27, 53, and 1,457,002 as there are con­tain­ing 27, 53, and some more pedes­tri­an-seem­ing value like 44.

…To con­duct the test, I used each pair of de­pen­dent vari­ables to pre­dict the value of the next vari­able in the se­ries (the 1st and 2nd ob­ser­va­tions pre­dict­ing the value of the 3rd, the 2nd and 3rd pre­dict­ing the value of the 4th, etc.), yield­ing 270 pre­dic­tions for the 30 vari­ables. I counted an ob­ser­va­tion “wild” if its ab­solute value was 10 times the max­i­mum of the ab­solute value of the two pre­vi­ous ob­ser­va­tions or if its ab­solute value was be­low 1⁄10 of the min­i­mum of the ab­solute value of the two pre­vi­ous ob­ser­va­tions. Sep­a­rate­ly, I also looked for flipped signs (ei­ther two neg­a­tive val­ues fol­lowed by a pos­i­tive or two pos­i­tive val­ues fol­lowed by a neg­a­tive), though most of the vari­ables only ad­mit­ted pos­i­tive val­ues. This mea­sure of wild­ness yielded three wild ob­ser­va­tions out of 270 (1%) plus an­other three flipped-sign cases (to­tal 2%). (A few vari­ables were capped, ei­ther top or bot­tom, in a way that would make an above-10x or be­low-1/10th ob­ser­va­tion an­a­lyt­i­cally un­like­ly, but ex­clud­ing such vari­ables would­n’t affect the re­sult much.) So it looks like the Wild Com­plex­ity The­sis might be in trou­ble.

Ellenberg 2014

Jor­dan El­len­berg, “The Myth Of The Myth Of The Hot Hand” (ex­cerpted from How Not to Be Wrong: The Power of Math­e­mat­i­cal Think­ing, 2014):

A sig­nifi­cance test is a sci­en­tific in­stru­ment, and like any other in­stru­ment, it has a cer­tain de­gree of pre­ci­sion. If you make the test more sen­si­tive—by in­creas­ing the size of the stud­ied pop­u­la­tion, for ex­am­ple—you en­able your­self see ever-s­maller effects. That’s the power of the method, but also its dan­ger. The truth is, the null hy­poth­e­sis is prob­a­bly al­ways false! When you drop a pow­er­ful drug into a pa­tien­t’s blood­stream, it’s hard to be­lieve the in­ter­ven­tion lit­er­ally has zero effect on the prob­a­bil­ity that the pa­tient will de­velop esophageal can­cer, or throm­bo­sis, or bad breath. Each part of the body speaks to every oth­er, in a com­plex feed­back loop of in­flu­ence and con­trol. Every­thing you do ei­ther gives you can­cer or pre­vents it. And in prin­ci­ple, if you carry out a pow­er­ful enough study, you can find out which it is. But those effects are usu­ally so mi­nus­cule that they can be safely ig­nored. Just be­cause we can de­tect them does­n’t al­ways mean they mat­ter…The right ques­tion is­n’t, “Do bas­ket­ball play­ers some­times tem­porar­ily get bet­ter or worse at mak­ing shots?”—the kind of yes/no ques­tion a sig­nifi­cance test ad­dress­es. The right ques­tion is “How much does their abil­ity vary with time, and to what ex­tent can ob­servers de­tect in real time whether a player is hot?” Here, the an­swer is surely “not as much as peo­ple think, and hardly at all.”

Lakens 2014

“The Null Is Al­ways False (Ex­cept When It Is True)”, Daniel Lak­ens:

The more im­por­tant ques­tion is whether it is true that there are al­ways real differ­ences in the real world, and what the ‘real world’ is. Let’s con­sider the pop­u­la­tion of peo­ple in the real world. While you read this sen­tence, some in­di­vid­u­als in this pop­u­la­tion have died, and some were born. For most ques­tions in psy­chol­o­gy, the pop­u­la­tion is sur­pris­ingly sim­i­lar to an eter­nally run­ning Monte Carlo sim­u­la­tion. Even if you could mea­sure all peo­ple in the world in a mil­lisec­ond, and the test-retest cor­re­la­tion was per­fect, the an­swer you would get now would be differ­ent from the an­swer you would get in an hour. Fre­quen­tists (the peo­ple that use NHST) are not specifi­cally in­ter­ested in the ex­act value now, or in one hour, or next week Thurs­day, but in the av­er­age value in the ‘long’ run. The value in the real world to­day might never be ze­ro, but it’s never any­thing, be­cause it’s con­tin­u­ously chang­ing. If we want to make gen­er­al­iz­able state­ments about the world, I think the fact that the nul­l-hy­poth­e­sis is never pre­cisely true at any spe­cific mo­ment is not a prob­lem. I’ll ig­nore more com­plex ques­tions for now, such as how we can es­tab­lish whether effects vary over time.

…Meehl talks about how in psy­chol­ogy every in­di­vid­u­al-d­iffer­ence vari­able (e.g., trait, sta­tus, de­mo­graph­ic) cor­re­lates with every other vari­able, which means the null is prac­ti­cally never true. In these sit­u­a­tions, it’s not that test­ing against the nul­l-hy­poth­e­sis is mean­ing­less, but it’s not in­for­ma­tive. If every­thing cor­re­lates with every­thing else, you need to cre­ate good mod­els, and test those. A sim­ple nul­l-hy­poth­e­sis sig­nifi­cance test will not get you very far. I agree.

Ran­dom As­sign­ment vs. Crud

To il­lus­trate when NHST can be used to as a source of in­for­ma­tion in large sam­ples, and when NHST is not in­for­ma­tive in large sam­ples, I’ll an­a­lyze data of large dataset with 6344 par­tic­i­pants from the Many Labs pro­ject. I’ve an­a­lyzed 10 de­pen­dent vari­ables to see whether they were in­flu­enced by (A) Gen­der, and (B) As­sign­ment to the high or low an­chor­ing con­di­tion in the first study. Gen­der is a mea­sured in­di­vid­ual differ­ence vari­able, and not a ma­nip­u­lated vari­able, and might thus be affected by what Meehl calls the crud fac­tor. Here, I want to il­lus­trate this is (A) prob­a­bly often true for in­di­vid­ual differ­ence vari­ables, but per­haps not al­ways true, and (B) it is prob­a­bly never true for when an­a­lyz­ing differ­ences be­tween groups in­di­vid­u­als were ran­domly as­sign­ment to.

…When we an­a­lyze the 10 de­pen­dent vari­ables as a func­tion of the an­chor­ing con­di­tion, none of the differ­ences are sta­tis­ti­cally sig­nifi­cant (even though there are more than 6000 par­tic­i­pants). You can play around with the script, re­peat­ing the analy­sis for the con­di­tions re­lated to the other three an­chor­ing ques­tions (re­mem­ber to cor­rect for mul­ti­ple com­par­isons if you per­form many test­s), and see how ran­dom­iza­tion does a pretty good job at re­turn­ing non-sig­nifi­cant re­sults even in very large sam­ple sizes. If the null is al­ways false, it is re­mark­ably diffi­cult to re­ject. Ob­vi­ous­ly, when we an­a­lyze the an­swer peo­ple gave on the first an­chor­ing ques­tion, we find a huge effect of the high vs. low an­chor­ing con­di­tion they were ran­domly as­signed to. Here, NHST works. There is prob­a­bly some­thing go­ing on. If the an­chor­ing effect was a com­pletely novel phe­nom­e­non, this would be an im­por­tant first find­ing, to be fol­lowed by repli­ca­tions and ex­ten­sions, and fi­nally model build­ing and test­ing.

The re­sults change dra­mat­i­cally if we use Gen­der as a fac­tor. There are Gen­der effects on de­pen­dent vari­ables re­lated to quote at­tri­bu­tion, sys­tem jus­ti­fi­ca­tion, the gam­bler’s fal­la­cy, imag­ined con­tact, the ex­plicit eval­u­a­tion of arts and math, and the norm of rec­i­proc­i­ty. There are no sig­nifi­cant differ­ences in po­lit­i­cal iden­ti­fi­ca­tion (as con­ser­v­a­tive or lib­er­al), on the re­sponse scale ma­nip­u­la­tion, or on gain vs. loss fram­ing (even though p = .025, such a high p-value is stronger sup­port for the nul­l-hy­poth­e­sis than for the al­ter­na­tive hy­poth­e­sis with 5500 par­tic­i­pants). It’s sur­pris­ing that the nul­l-hy­poth­e­sis (gen­der does not in­flu­ence the re­sponses par­tic­i­pants give) is re­jected for 7 out of 10 effects. Per­son­ally (per­haps be­cause I’ve got very lit­tle ex­per­tise in gen­der effects) I was ac­tu­ally ex­tremely sur­prised, even though the effects are small (with Co­hen d’s or around 0.09). This, iron­i­cal­ly, shows that NHST work­s—I’ve learned gen­der effects are much more wide­spread than I’d have though be­fore I wrote this blog post.

Kirkegaard 2014

“The in­ter­na­tional gen­eral so­cioe­co­nomic fac­tor: Fac­tor an­a­lyz­ing in­ter­na­tional rank­ings”:

Many stud­ies have ex­am­ined the cor­re­la­tions be­tween na­tional IQs and var­i­ous coun­try-level in­dexes of well-be­ing. The analy­ses have been un­sys­tem­atic and not gath­ered in one sin­gle analy­sis or dataset. In this pa­per I gather a large sam­ple of coun­try-level in­dexes and show that there is a strong gen­eral so­cioe­co­nomic fac­tor (S fac­tor) which is highly cor­re­lated (.86–.87) with na­tional cog­ni­tive abil­ity us­ing ei­ther Lynn and Van­hanen’s dataset or Al­ti­nok’s. Fur­ther­more, the method of cor­re­lated vec­tors shows that the cor­re­la­tions be­tween vari­able load­ings on the S fac­tor and cog­ni­tive mea­sure­ments are .99 in both datasets us­ing both cog­ni­tive mea­sure­ments, in­di­cat­ing that it is the S fac­tor that dri­ves the re­la­tion­ship with na­tional cog­ni­tive mea­sure­ments, not the re­main­ing vari­ance.

See also “Coun­tries Are Ranked On Every­thing From Health To Hap­pi­ness. What’s The Point?”:

It’s a brand new rank­ing. Called the Sus­tain­able De­vel­op­ment Goals Gen­der In­dex, it gives 129 coun­tries a score for progress on achiev­ing gen­der equal­ity by 2030. Here’s the quick sum­ma­ry: Things are “good” in much of Eu­rope and North Amer­i­ca. And “very poor” in much of sub­-Sa­ha­ran Africa. In fact, that’s the way it looks in many in­ter­na­tional rank­ings, which tackle every­thing from the worst places to be a child to the most cor­rupt coun­tries to world hap­pi­ness…As for the fact that many rank­ings look the same at the top and bot­tom, one rea­son has to do with mon­ey. Many in­dexes are cor­re­lated with GDP per cap­i­ta, a mea­sure of a coun­try’s pros­per­i­ty, says Ken­ny. That in­cludes the World Bank’s Hu­man Cap­i­tal In­dex, which mea­sures the eco­nomic pro­duc­tiv­ity of a coun­try’s young peo­ple; and Free­dom House’s Free­dom in the World in­dex, which ranks the world by its level of democ­ra­cy, in­clud­ing eco­nomic free­dom. And coun­tries that have more money can spend more money on health, ed­u­ca­tion and in­fra­struc­ture.

Shen et al 2014

, Shen et al 2014:

Is Too Much Vari­ance Ex­plained? It is in­ter­est­ing that his­tor­i­cally the I-O lit­er­a­ture has be­moaned the pres­ence of a “va­lid­ity ceil­ing”, and the field seemed to be un­able to make large gains in the pre­dic­tion of job per­for­mance (High­house, 2008). In con­trast, Le­Bre­ton et al. ap­pear to have the op­po­site con­cern—that we maybe able to pre­dict too much, per­haps even all, of the vari­ance in job per­for­mance once ac­count­ing for sta­tis­ti­cal ar­ti­facts. In ad­di­tion to their four fo­cal pre­dic­tors (i.e., GMA, in­tegri­ty, struc­tured in­ter­view, work sam­ple), Le­Bre­ton et al. list an ad­di­tional 24 vari­ables that have been shown to be re­lated to job per­for­mance meta-an­a­lyt­i­cal­ly. How­ev­er, we be­lieve that many of the vari­ables Le­Bre­ton et al. in­cluded in their list are vari­ables that Sack­ett, Borne­man, and Con­nelly (2009) would ar­gue are likely un­know­able at time of hire.

…Fur­ther­more, in con­trast to Le­Bre­ton et al.’s as­ser­tion that or­ga­ni­za­tional vari­ables, such as pro­ce­dural jus­tice, are likely un­re­lated to their fo­cal pre­dic­tors, our be­lief is that many of these vari­ables are likely to be at least mod­er­ately cor­re­lat­ed–lim­it­ing the in­cre­men­tal va­lid­ity we could ex­pect with the in­clu­sion of these ad­di­tional vari­ables. For ex­am­ple, re­search has shown that in­tegrity tests mostly tap into Con­sci­en­tious­ness, Agree­able­ness, and Emo­tional Sta­bil­ity (Ones & Viswes­varan, 2001), and a re­cent meta-analy­sis of or­ga­ni­za­tional jus­tice shows that all three per­son­al­ity traits are mod­er­ately re­lated to one’s ex­pe­ri­ence of pro­ce­dural jus­tice (ρ=0.19–0.23; Hutchin­son et al., 2014), sug­gest­ing that even ap­par­ently un­re­lated vari­ables can share a sur­pris­ing amount of con­struc­t-level vari­ance. In sup­port of this per­spec­tive, Pa­ter­son, Harms, and Crede (2012) [“The meta of all metas: 30 years of meta-analy­sis re­viewed”] con­ducted a meta-analy­sis of over 200 meta-analy­ses and found an av­er­age cor­re­la­tion of 0.27, sug­gest­ing that most vari­ables we study are at least some­what cor­re­lated and val­i­dat­ing the first au­thor’s long-held per­sonal as­sump­tion that the world is cor­re­lated 0.30 (on av­er­age; see also Meehl’s, 1990, crud fac­tor)!

Gordon et al 2019

, Gor­don et al 2019:

We ex­am­ine how com­mon tech­niques used to mea­sure the causal im­pact of ad ex­po­sures on users’ con­ver­sion out­comes com­pare to the “gold stan­dard” of a true ex­per­i­ment (ran­dom­ized con­trolled tri­al). Us­ing data from 12 US ad­ver­tis­ing lift stud­ies at Face­book com­pris­ing 435 mil­lion user-s­tudy ob­ser­va­tions and 1.4 bil­lion to­tal im­pres­sions we con­trast the ex­per­i­men­tal re­sults to those ob­tained from ob­ser­va­tional meth­ods, such as com­par­ing ex­posed to un­ex­posed users, match­ing meth­ods, mod­el-based ad­just­ments, syn­thetic matched-mar­kets tests, and be­fore-after tests. We show that ob­ser­va­tional meth­ods often fail to pro­duce the same re­sults as true ex­per­i­ments even after con­di­tion­ing on in­for­ma­tion from thou­sands of be­hav­ioral vari­ables and us­ing non-lin­ear mod­els. We ex­plain why this is the case. Our find­ings sug­gest that com­mon ap­proaches used to mea­sure ad­ver­tis­ing effec­tive­ness in in­dus­try fail to mea­sure ac­cu­rately the true effect of ads.

An im­por­tant in­put to (PSM) is the set of vari­ables used to pre­dict the propen­sity score it­self. We tested three differ­ent PSM spec­i­fi­ca­tions for study 4, each of which used a larger set of in­puts.

  1. PSM 1: In ad­di­tion to age and gen­der, the ba­sis of our ex­act match­ing (EM) ap­proach, this spec­i­fi­ca­tion uses com­mon Face­book vari­ables, such as how long users have been on Face­book, how many Face­book friends the have, their re­ported re­la­tion­ship sta­tus, and their phone OS, in ad­di­tion to other user char­ac­ter­is­tics.
  2. PSM 2: In ad­di­tion to the vari­ables in PSM 1, this spec­i­fi­ca­tion uses Face­book’s es­ti­mate of the user’s zip code of res­i­dence to as­so­ciate with each user nearly 40 vari­ables drawn from the most re­cent Cen­sus and Amer­i­can Com­mu­ni­ties Sur­veys (ACS).
  3. PSM 3: In ad­di­tion to the vari­ables in PSM 2, this spec­i­fi­ca­tion adds a com­pos­ite met­ric of Face­book data that sum­ma­rizes thou­sands of be­hav­ioral vari­ables. This is a ma­chine-learn­ing based met­ric used by Face­book to con­struct tar­get au­di­ences that are sim­i­lar to con­sumers that an ad­ver­tiser has iden­ti­fied as de­sir­able.16 Us­ing this met­ric bases the es­ti­ma­tion of our propen­sity score on a non-lin­ear ma­chine-learn­ing model with thou­sands of fea­tures.17

…When we go from ex­act match­ing (EM) to our most par­si­mo­nious propen­sity score match­ing model(PSM 1), the con­ver­sion rate for un­ex­posed users in­creases from 0.032% to 0.042%, de­creas­ing the im­plied ad­ver­tis­ing lift from 221% to 147%. PSM 2 per­forms sim­i­larly to PSM 1, with an im­plied lift of 154%.21 Fi­nal­ly, adding the com­pos­ite mea­sure of Face­book vari­ables in PSM 3 im­proves the fit of the propen­sity model (as mea­sured by a higher AUC/ROC) and fur­ther in­creases the con­ver­sion rate for matched un­ex­posed users to 0.051%. The re­sult is that our best per­form­ing PSM model es­ti­mates an ad­ver­tis­ing lift of 102%…We sum­ma­rize the re­sult of all our propen­sity score match­ing and re­gres­sion meth­ods for study 4 in Fig­ure 7.

Gor­don et al 2016: “Fig­ure 7: Sum­mary of lift es­ti­mates and con­fi­dence in­ter­vals”

While not di­rectly test­ing sta­tis­ti­cal-sig­nifi­cance in its propen­sity scor­ing, the in­creas­ing ac­cu­racy in es­ti­mat­ing the true causal effect of adding in ad­di­tional be­hav­ioral vari­ables im­plies that (e­spe­cially at Face­book-s­cale, us­ing bil­lions of dat­a­points) the cor­re­la­tions of the thou­sands of used vari­ables with the ad­ver­tis­ing be­hav­ior would be sta­tis­ti­cal­ly-sig­nifi­cant and demon­strate that every­thing is cor­re­lat­ed. (See also my & pages.)

Kirkegaard 2020

“En­hanc­ing archival datasets with ma­chine learned psy­cho­met­rics”, Kirkegaard 2020:

In our ISIR 2019 pre­sen­ta­tion (“Ma­chine learn­ing psy­cho­met­rics: Im­proved cog­ni­tive abil­ity va­lid­ity from su­per­vised train­ing on item level data”), we showed that one can use ma­chine learn­ing on cog­ni­tive data to im­prove the pre­dic­tive va­lid­ity of it. The effect sizes can be quite large, e.g. one could pre­dict ed­u­ca­tional at­tain­ment in the Viet­nam Ex­pe­ri­ence Study (VES) sam­ple (n = 4.5k US army re­cruits) at R2=32.3% with vs. 17.7% with . Pre­dic­tion is more than g, after all. What if we had a dataset of 185 di­verse items, and we train the model to pre­dict IRT-based g from the full set, but us­ing only a lim­ited set us­ing the LASSO? How many items do we need when op­ti­mally weight­ed? Turns out that with 42 items, one can get a test that cor­re­lates at 0.96 with the full g. That’s an ab­bre­vi­a­tion of nearly 80%!

Now comes the fancy part. What if we have archival datasets with only a few cog­ni­tive items (e.g. datasets with items) or maybe even no items. Can we im­prove things here? May­be! If the dataset has a lot of other items, we may be able to train an ma­chine learn­ing (ML) model that pre­dict g quite well from them, even if they seem un­re­lat­ed. Every item has some vari­ance over­lap with g how­ever small (crud fac­tor), it is only a ques­tion of hav­ing a good enough al­go­rithm and enough data to ex­ploit this co­vari­ance. For in­stance, I have found that if one uses the 556 items in the in the VES to pre­dict the very well mea­sured g based on all the cog­ni­tive data (18 test­s), how well can one do? I was sur­prised to learn that one can do ex­tremely well:

“Elas­tic net pre­dic­tion of g: r = 0.83 (0.82–0.84), n = 4320”

[There are 203 (e­las­tic)/217 () non-zero co­effi­cients out of 556]

Thus, one can mea­sure g as well as one could with a de­cent test like Won­der­lic, or Raven’s with­out hav­ing any cog­ni­tive data at all! The big ques­tion here is whether these mod­els gen­er­al­ize well. If one can train a model to pre­dict g from MMPI items in dataset 1, and then ap­ply it to dataset 2 with­out much loss of ac­cu­ra­cy, this means that one could im­pute g in po­ten­tially thou­sands of old archival datasets that in­clude the same MMPI items, or a sub­set of them.

A sim­i­lar analy­sis is done by Rev­elle et al 2020’s (e­spe­cially “Study 4: Pro­file cor­re­la­tions us­ing 696 items”); they do not di­rectly re­port an equiv­a­lent to pos­te­ri­ors/p-val­ues or non-zero cor­re­la­tions after pe­nal­ized re­gres­sion or any­thing like that, but the per­va­sive­ness of cor­re­la­tion is ap­par­ent from their re­sults & data vi­su­al­iza­tions.


Genetic correlations

Mod­ern ge­nomics has found large-s­cale biobanks & sum­ma­ry-s­ta­tis­tic-only meth­ods to be a fruit­ful area for iden­ti­fy­ing as the power of pub­licly-re­leased PGSes have steadily grown with in­creas­ing n (sta­bi­liz­ing es­ti­mates & mak­ing ever more ge­netic cor­re­la­tions pass sta­tis­ti­cal-sig­nifi­cance thresh­old­s), which also fre­quently mir­ror phe­no­typic cor­re­la­tions in all or­gan­isms (“Cheverud’s con­jec­ture”13).

Ex­am­ple graphs drawn from the broader analy­ses (pri­mar­ily vi­su­al­ized as heatmap­s):

  • “Phe­nome-wide analy­sis of genome-wide poly­genic scores”, Krapohl et al 2015:

    Krapohl et al 2015: “Fig­ure 1. Cor­re­la­tions be­tween 13 genome-wide poly­genic scores and 50 traits from the be­hav­ioral phe­nome. These re­sults are based on GPS con­structed us­ing a GWAS P-value thresh­old (PT)=0.30; re­sults for PT = 0.10 and 0.05 (Sup­ple­men­tary Fig­ures 1a and b and Sup­ple­men­tary Ta­ble 3). P-val­ues that pass Ny­holt–Si­dak cor­rec­tion (see Sup­ple­men­tary Meth­ods 1) are in­di­cated with two as­ter­isks, whereas those reach­ing nom­i­nal sig­nifi­cance (thus sug­ges­tive ev­i­dence) are shown with a sin­gle as­ter­isk.”
  • , Ha­ge­naars et al 2016:

    Ha­ge­naars et al 2016: “Fig­ure 1. Heat map of ge­netic cor­re­la­tions cal­cu­lated us­ing LD re­gres­sion be­tween cog­ni­tive phe­no­types in UK Biobank and health-re­lated vari­ables from GWAS con­sor­tia. Hues and col­ors de­pict, re­spec­tive­ly, the strength and di­rec­tion of the ge­netic cor­re­la­tion be­tween the cog­ni­tive phe­no­types in UK Biobank and the health-re­lated vari­ables. Red and blue in­di­cate pos­i­tive and neg­a­tive cor­re­la­tions, re­spec­tive­ly. Cor­re­la­tions with the darker shade as­so­ci­ated with a stronger as­so­ci­a­tion. Based on re­sults in Ta­ble 2. ADHD, at­ten­tion deficit hy­per­ac­tiv­ity dis­or­der; FEV1, forced ex­pi­ra­tory vol­ume in 1 s; GWAS, genome-wide as­so­ci­a­tion study; LD, link­age dis­e­qui­lib­ri­um; NA, not avail­able.”
  • , Hill et al 2016:

    Hill et al 2016 fig­ure: “Ge­netic cor­re­la­tions be­tween house­hold in­comes and health vari­ables”
  • , Socrates et al 2017 (sup­ple­ment w/­full heatmaps)

    Socrates et al 2017: “Fig­ure 3. Heat map show­ing ge­netic as­so­ci­a­tions be­tween poly­genic risk scores from GWAS traits (X-ax­is) and NFBC1966 traits (y-ax­is) for self­-re­ported dis­or­ders, med­ical and psy­chi­atric con­di­tions ver­i­fied or treated by a doc­tor, con­trolled for sex, BMI, and SES
    Socrates et al 2017: “Fig­ure 3. Heat map show­ing ge­netic as­so­ci­a­tions be­tween poly­genic risk scores from GWAS traits (X-ax­is) and NFBC1966 traits (y-ax­is) from ques­tion­naires lifestyle and so­cial fac­tors”
  • , Docherty et al 2017:

    Docherty et al 2017: “Fig­ure 2: Phe­nome on GPS re­gres­sion q-val­ues in Eu­ro­pean Sam­ple (EUR). GPS dis­played with prior pro­por­tion of causal effects = 0.3. Here, as­ter­isks in the cells of the heatmap de­note re­sults of greater effect: *** = q-value < 0.01, ** = q-value < 0.05, * = q-value < 0.16. Blue val­ues re­flect a neg­a­tive as­so­ci­a­tion, and red re­flect pos­i­tive as­so­ci­a­tion. In­ten­sity of color in­di­cates −log10 p val­ue.”
    Docherty et al 2017: “Fig­ure 3: Ge­netic Over­lap and Co-Her­i­tabil­ity of GPS in Eu­ro­pean Sam­ple (EUR). Heatmap of par­tial cor­re­la­tion co­effi­cients be­tween GPS with prior pro­por­tion of causal effects = 0.3. Here, as­ter­isks in the cells of the heatmap de­note re­sults of greater effect: **** = q-value < 0.0001, *** = q-value < 0.001, ** = q value < 0.01, * = q value < 0.05, and ~ = sug­ges­tive sig­nifi­cance at q value < 0.16. Blue val­ues re­flect a neg­a­tive cor­re­la­tion, and red re­flect pos­i­tive cor­re­la­tion.”
  • , Joshi et al 2017:

    “Fig­ure 5: Ge­netic cor­re­la­tions be­tween trait clus­ters that as­so­ciate with mor­tal­i­ty. The up­per panel shows whole ge­netic cor­re­la­tions, the lower pan­el, par­tial cor­re­la­tions. T2D, type 2 di­a­betes; BP, blood pres­sure; BC, breast can­cer; CAD, coro­nary artery dis­ease; Edu, ed­u­ca­tional at­tain­ment; RA, rheuma­toid arthri­tis; AM, age at menar­che; DL/WHR Dys­lipi­demi­a/Waist-Hip ra­tio; BP, blood pres­sure”
  • , Hill et al 2018:

    “Fig. 4: Heat map show­ing the ge­netic cor­re­la­tions be­tween the meta-an­a­lytic in­tel­li­gence phe­no­type, in­tel­li­gence, ed­u­ca­tion with 29 cog­ni­tive, SES, men­tal health, meta­bol­ic, health and well­be­ing, an­thro­po­met­ric, and re­pro­duc­tive traits. Pos­i­tive ge­netic cor­re­la­tions are shown in green and neg­a­tive ge­netic cor­re­la­tions are shown in red. Sta­tis­ti­cal sig­nifi­cance fol­low­ing FDR (us­ing Ben­jamini-Hochberg pro­ce­dure [51]) cor­rec­tion is in­di­cated by an as­ter­isk.”
  • , Watan­abe et al 2018:

    Watan­abe et al 2018: "Fig. 2. Within and be­tween do­mains ge­netic cor­re­la­tions. (a.) Pro­por­tion of trait pairs with sig­nifi­cant rg (top) and av­er­age |_rg_| for sig­nifi­cant trait pairs (bot­tom) within do­mains. Dashed lines rep­re­sent the pro­por­tion of trait pairs with sig­nifi­cant rg (top) and av­er­age |rg| for sig­nifi­cant trait pairs (bot­tom) across all 558 traits, re­spec­tive­ly. Con­nec­tive tis­sue, mus­cu­lar and in­fec­tion do­mains are ex­cluded as these each con­tains less than 3 traits. (b.) Heatmap of pro­por­tion of trait pairs with sig­nifi­cant rg (up­per right tri­an­gle) and av­er­age |rg| for sig­nifi­cant trait pairs (lower left tri­an­gle) be­tween do­mains. Con­nec­tive tis­sue, mus­cu­lar and in­fec­tion do­mains are ex­cluded as each con­tains less than 3 traits. The di­ag­o­nal rep­re­sents the pro­por­tion of trait pairs with sig­nifi­cant rg within do­mains. Stars de­note the pairs of do­mains in which the ma­jor­ity (>50%) of sig­nifi­cant rg are neg­a­tive."
  • , Ab­del­laoui et al 2018:

    Ab­del­laoui et al 2018: “Fig­ure 6: Ge­netic cor­re­la­tions based on LD score re­gres­sion. Col­ored is sig­nifi­cant after FDR cor­rec­tion. The green num­bers in the left part of the Fig­ure be­low the di­ag­o­nal of 1’s are the phe­no­typic cor­re­la­tions be­tween the re­gional out­comes of coal min­ing, re­li­gious­ness, and re­gional po­lit­i­cal pref­er­ence. The blue stars next to the trait names in­di­cate that UK Biobank was part of the GWAS of the trait.”
  • “Iden­ti­fi­ca­tion of 12 ge­netic loci as­so­ci­ated with hu­man healthspan”, Zenin et al 2019:

    “Fig­ure 4. 35 traits with sig­nifi­cant and high ge­netic cor­re­la­tions with healthspan (|rg| ≥ 0.3; p≤ 4.3 × 10−5). PMID ref­er­ences are placed in square brack­ets. Note the ab­sence of ge­netic cor­re­la­tion be­tween the healthspan and Alzheimer dis­ease traits (rg= −0.03)”
  • “As­so­ci­a­tion stud­ies of up to 1.2 mil­lion in­di­vid­u­als yield new in­sights into the ge­netic eti­ol­ogy of to­bacco and al­co­hol use”, Li et al 2019:

    Liu et al 2019: “Fig. 1 | Ge­netic cor­re­la­tions be­tween sub­stance use phe­no­types and phe­no­types from other large GWAS. Ge­netic cor­re­la­tions be­tween each of the phe­no­types are shown in the first 5 rows, with her­i­tabil­ity es­ti­mates dis­played down the di­ag­o­nal. All ge­netic cor­re­la­tions and her­i­tabil­ity es­ti­mates were cal­cu­lated us­ing LD score re­gres­sion. Pur­ple shad­ing rep­re­sents neg­a­tive ge­netic cor­re­la­tions, and red shad­ing rep­re­sents pos­i­tive cor­re­la­tions, with in­creas­ing color in­ten­sity re­flect­ing in­creas­ing cor­re­la­tion strength. A sin­gle as­ter­isk re­flects a sig­nifi­cant ge­netic cor­re­la­tion at the p < 0.05 lev­el. Dou­ble as­ter­isks re­flect a sig­nifi­cant ge­netic cor­re­la­tion at the Bon­fer­roni-cor­rec­tion p < 0.000278 level (cor­rected for 180 in­de­pen­dent test­s). Note that SmkCes was ori­ented such that higher scores re­flected cur­rent smok­ing, and for AgeSmk, lower scores re­flect ear­lier ages of ini­ti­a­tion, both of which are typ­i­cally as­so­ci­ated with neg­a­tive out­comes.”

  1. Some­times para­phrased as “All good things tend to go to­geth­er, as do all bad ones”.↩︎

  2. Tib­shi­rani 2014:

    In de­scrib­ing some of this work, Hastie et al. (2001) coined the in­for­mal “Bet on Spar­sity” prin­ci­ple [“Use a pro­ce­dure that does well in sparse prob­lems, since no pro­ce­dure does well in dense prob­lems.”]. The ℓ1 meth­ods as­sume that the truth is sparse, in some ba­sis. If the as­sump­tion holds true, then the pa­ra­me­ters can be effi­ciently es­ti­mated us­ing ℓ1 penal­ties. If the as­sump­tion does not hold—so that the truth is dense—then no method will be able to re­cover the un­der­ly­ing model with­out a large amount of data per pa­ra­me­ter. This is typ­i­cally not the case when pN, a com­monly oc­cur­ring sce­nario.

    This can be seen as a kind of de­ci­sion-the­o­retic jus­ti­fi­ca­tion for Oc­cam-style as­sump­tions: if the real world is not pre­dictable in the sense of be­ing pre­dictable by sim­ple/­fast al­go­rithms, or in­duc­tion does­n’t work at all, then no method works in ex­pec­ta­tion, and the “re­gret” (d­iffer­ence be­tween ex­pected value of ac­tual de­ci­sion and ex­pected value of op­ti­mal de­ci­sion) from mis­tak­enly as­sum­ing that the world is sim­ple/s­parse is ze­ro. So one should as­sume the world is sim­ple.↩︎

  3. A ma­chine learn­ing prac­ti­tioner as of 2019, will be struck by the thought that To­bler’s first law nicely en­cap­su­lates the prin­ci­ple be­hind the “un­rea­son­able effec­tive­ness” of to so many do­mains far be­yond im­ages; this con­nec­tion has been made by John Hessler.↩︎

  4. The most in­ter­est­ing ex­am­ple of this is ESP/psi para­psy­chol­ogy re­search: the more rig­or­ously con­ducted the ESP ex­per­i­ments are, the smaller the effects be­come—but, while dis­cred­it­ing all claims of hu­man ESP, fre­quently they aren’t pushed to ex­actly zero and are “sta­tis­ti­cal­ly-sig­nifi­cant”. some resid­ual crud fac­tor in the ex­per­i­ments, even when con­ducted & an­a­lyzed as best as we know how.↩︎

  5. While Gos­set 1904 is dis­cussed in sev­eral sources, like , the au­thors have con­sulted the Guin­ness Archive in per­son; the re­port it­self does not ap­pear to have ever been made pub­lic or dig­i­tized. I have con­tacted the Archives about get­ting a copy.↩︎

  6. The ver­sion in the sec­ond edi­tion, The Foun­da­tions of Sta­tis­tics, 2nd edi­tion, Sav­age 1972, is iden­ti­cal to the first.↩︎

  7. N.B.: I. Richard is not to be con­fused with his broth­er, Leonard Jim­mie Sav­age, who also worked in Bayesian sta­tis­tics & is cited pre­vi­ous­ly.↩︎

  8. 2nd edi­tion, 1986; after skim­ming the 2nd edi­tion, I have not been able to find a rel­e­vant pas­sage, but Lehmann re­marks that he sub­stan­tially rewrote the text­book for a more ro­bust de­ci­sion-the­o­retic ap­proach, so it may have been re­moved.↩︎

  9. This analy­sis was never pub­lished, ac­cord­ing to Meehl 1990a.↩︎

  10. I would note there is a dan­ger­ous fal­lacy here even if one does be­lieve the Law of Large Num­bers should ap­ply here with an ex­pec­ta­tion of zero effect: even if the ex­pec­ta­tion of the pair­wise cor­re­la­tion of 2 ar­bi­trary vari­ables was in fact pre­cisely zero (as is not too im­plau­si­ble in some do­mains such as op­ti­miza­tion or feed­back loop­s—­such as the fa­mous ex­am­ple of the ther­mostat/­room-tem­per­a­ture), that does not mean any spe­cific pair will be ex­actly zero no mat­ter how many num­bers get added up to cre­ate their re­la­tion­ship, as the ab­solute size of the de­vi­a­tion in­creas­es.

    So for ex­am­ple, imag­ine 2 ge­netic traits which may be ge­net­i­cal­ly-cor­re­lat­ed, and their her­i­tabil­ity may be caused by a num­ber of genes rang­ing from 1 (mono­genic) to tens of thou­sands (highly poly­genic); the spe­cific over­lap is cre­ated by a chance draw of evo­lu­tion­ary processes through­out the or­gan­is­m’s evo­lu­tion; does the Law of Large Num­bers jus­tify say­ing that while 2 mono­genic traits may have a sub­stan­tial cor­re­la­tion, 2 highly poly­genic traits must have much closer to zero cor­re­la­tion sim­ply be­cause they are in­flu­enced by more genes? No, be­cause the dis­tri­b­u­tion around the ex­pec­ta­tion of 0 can be­come wider & wider the more rel­e­vant genes there are.

    To rea­son oth­er­wise is, as Samuel­son not­ed, to think like an in­surer who is wor­ried about los­ing $100 on an in­sur­ance con­tract so it goes out & makes 100 more $100 con­tracts.↩︎

  11. Betz 1986 spe­cial is­sue’s con­tents:

    1. “The g fac­tor in em­ploy­ment”, Got­tfred­son 1986
    2. “Ori­gins of and Re­ac­tions to the PTC con­fer­ence on The g Fac­tor In Em­ploy­ment Test­ing, Av­ery 1986
    3. g: Ar­ti­fact or re­al­i­ty?”, Jensen 1986
    4. “The role of gen­eral abil­ity in pre­dic­tion”, Thorndike 1986
    5. “Cog­ni­tive abil­i­ty, cog­ni­tive ap­ti­tudes, job knowl­edge, and job per­for­mance”, Hunter 1986
    6. “Va­lid­ity ver­sus util­ity of men­tal tests: Ex­am­ple of the SAT, Got­tfred­son & Crouse 1986
    7. “So­ci­etal con­se­quences of the g fac­tor in em­ploy­ment”, Got­tfred­son 1986
    8. “Real world im­pli­ca­tions of g, Hawk 1986
    9. “Gen­eral abil­ity in em­ploy­ment: A dis­cus­sion”, Ar­vey 1986
    10. “Com­men­tary”, Humphreys 1986
    11. “Com­ments on the g fac­tor in Em­ploy­ment Test­ing”, Linn 1986
    12. “Back to Spear­man?”, Tyler 1986
  12. This work does not seem to have been pub­lished, as I can find no books pub­lished by them joint­ly, or nor nay Mc­Closky books pub­lished be­tween 1990 & his death in 2004.↩︎

  13. For de­fi­n­i­tions & ev­i­dence for, see: Cheverud 1988, Roff 1996, Kruuk et al 2008, Dochter­mann 2011, , & .↩︎