Everything Is Correlated

Anthology of sociology, statistical, or psychological papers discussing the observation that all real-world variables have non-zero correlations and the implications for statistical theory such as ‘null hypothesis testing’.
statistics, philosophy, survey, Bayes, psychology, genetics, sociology, bibliography, causality, insight-porn
2014-09-122020-11-14 finished certainty: log importance: 7

Sta­tis­ti­cal folk­lore asserts that “every­thing is cor­re­lated”: in any real-world dataset, most or all mea­sured vari­ables will have non-zero cor­re­la­tions, even between vari­ables which appear to be com­pletely inde­pen­dent of each oth­er, and that these cor­re­la­tions are not merely sam­pling error flukes but will appear in large-s­cale datasets to arbi­trar­ily des­ig­nated lev­els of sta­tis­ti­cal-sig­nifi­cance or pos­te­rior prob­a­bil­i­ty.

This raises seri­ous ques­tions for nul­l-hy­poth­e­sis sta­tis­ti­cal-sig­nifi­cance test­ing, as it implies the null hypoth­e­sis of 0 will always be rejected with suffi­cient data, mean­ing that a fail­ure to reject only implies insuffi­cient data, and pro­vides no actual test or con­fir­ma­tion of a the­o­ry. Even a direc­tional pre­dic­tion is min­i­mally con­fir­ma­tory since there is a 50% chance of pick­ing the right direc­tion at ran­dom.

It also has impli­ca­tions for con­cep­tu­al­iza­tions of the­o­ries & causal mod­els, inter­pre­ta­tions of struc­tural mod­els, and other sta­tis­ti­cal prin­ci­ples such as the “spar­sity prin­ci­ple”.

In sta­tis­ti­cal folk­lore, there is an idea which cir­cu­lates under a num­ber of expres­sions such as: “every­thing is cor­re­lated”, “every­thing is related to every­thing else”, “crud fac­tor”, “the null hypoth­e­sis is always false”, “coeffi­cients are never zero”, “ambi­ent cor­re­la­tional noise”, dic­tum (“in human nature good traits go together”1), etc. Closely related are the “bet on spar­sity prin­ci­ple”2, , “first law of ecol­ogy” (“Every­thing is con­nected to every­thing else”) & (“every­thing is related to every­thing else, but near things are more related than dis­tant things”).3

The core idea here is that in any real-world dataset, it is excep­tion­ally unlikely that any par­tic­u­lar rela­tion­ship will be exactly 0 for rea­sons of arith­metic (eg it may be impos­si­ble for a binary vari­able to be an equal per­cent­age in 2 unbal­anced group­s); prior prob­a­bil­ity (0 is only one num­ber out of the infi­nite real­s); and because real-world prop­er­ties & traits are linked by a myr­iad of causal net­works, dynam­ics, & latent vari­ables (eg the s which affect all human traits, see heat maps in appen­dix for visu­al­iza­tions) which mutu­ally affect each other which will pro­duce gen­uine cor­re­la­tions between appar­ent­ly-in­de­pen­dent vari­ables, and these cor­re­la­tions may be of sur­pris­ingly large & impor­tant size. These rea­sons are unaffected by sam­ple size and are not sim­ply due to ‘small n’. The claim is gen­er­ally backed up by per­sonal expe­ri­ence and rea­son­ing, although in a few instances like Meehl large datasets are men­tioned in which almost all vari­ables are cor­re­lated at high lev­els of sta­tis­ti­cal-sig­nifi­cance.


This claim has sev­eral impli­ca­tions:

  1. Sharp null hypothe­ses are mean­ing­less: The most com­monly men­tioned, and the appar­ent moti­va­tion for early dis­cus­sions, is that in the nul­l-hy­poth­e­sis par­a­digm dom­i­nant in psy­chol­ogy and many sci­ences, any sharp nul­l-hy­poth­e­sis such as a para­me­ter (like a Pear­son’s r cor­re­la­tion) being exactly equal to 0 is known—in advance—to already be false and so it will inevitably be rejected as soon as suffi­cient data col­lec­tion per­mits sam­pling to the fore­gone con­clu­sion.

    The exis­tence of per­va­sive cor­re­la­tions, in addi­tion to the pres­ence of sys­tem­atic error4, guar­an­tees nonzero ‘effects’. This ren­ders the mean­ing of sig­nifi­cance-test­ing unclear; it is cal­cu­lat­ing pre­cisely the odds of the data under sce­nar­ios known a pri­ori to be false.

  2. Direc­tional hypothe­ses are lit­tle bet­ter: bet­ter nul­l-hy­pothe­ses, such as >0 or <0, are also prob­lem­atic since if the true value of a para­me­ter is never 0 then one’s the­o­ries have at least a 50-50 chance of guess­ing the right direc­tion and so cor­rect ‘pre­dic­tions’ of the sign count for lit­tle.

    This ren­ders any suc­cess­ful pre­dic­tions of lit­tle val­ue.

  3. Model inter­pre­ta­tion is diffi­cult: This exten­sive inter­cor­re­la­tion threat­ens many naive sta­tis­ti­cal mod­els or the­o­ret­i­cal inter­pre­ta­tions there­of, quite aside from p-val­ues

    For exam­ple, given the large amounts of mea­sure­ment error in most soci­o­log­i­cal or psy­cho­log­i­cal traits such as SES, home envi­ron­ment, or IQ, fully ‘con­trol­ling for’ a latent vari­able based on mea­sured vari­ables is diffi­cult or impos­si­ble and said vari­able will in fact be cor­re­lated with the pri­mary vari­able of inter­est, lead­ing to “resid­ual con­found­ing” (/Stouffer 1936/Thorndike 1942/Kah­ne­man 1965)

  4. Inter­cor­re­la­tion implies causal net­works: The empir­i­cal fact of exten­sive inter­cor­re­la­tions is con­sis­tent with the (often fac­tors) link­ing all mea­sured traits, such as exten­sive her­i­tabil­ity & genetic cor­re­la­tions of human traits, lead­ing to .

    The exis­tence of both “every­thing is cor­re­lated” and the suc­cess of the “bet on spar­sity” prin­ci­ple sug­gests that these causal net­works may be best thought of as hav­ing hubs or latent vari­ables: there are a rel­a­tively few vari­ables such as ‘arousal’ or ‘IQ’ which play cen­tral roles, explain­ing much of vari­ance, fol­lowed by almost all other vari­ables account­ing for a lit­tle bit each with most of their influ­ence medi­ated through the key vari­ables.

    The fact that these vari­ables can be suc­cess­fully mod­eled as sub­stan­tively lin­ear or addi­tive fur­ther implies that inter­ac­tions between vari­ables will be typ­i­cally rare or small or both (im­ply­ing fur­ther that most such hits will be false pos­i­tives, as inter­ac­tions are already harder to detect than main effects, and more so if they are a pri­ori unlikely or of small size).

    To the extent that these key vari­ables are unmod­i­fi­able, the many periph­eral vari­ables may also be unmod­i­fi­able (which may be related to the ). Any inter­ven­tion on those periph­eral vari­ables, being ‘down­stream’, will tend to either be ‘hol­low’ or fade out or have no effect at all on the true desired goals no mat­ter how con­sis­tently they are cor­re­lat­ed.

    On a more con­tem­po­rary note, these the­o­ret­i­cal & empir­i­cal con­sid­er­a­tions also throw doubt on con­cerns about ‘algo­rith­mic bias’ or infer­ences draw­ing on ‘pro­tected classes’: not draw­ing on them may not be desir­able, pos­si­ble, or even mean­ing­ful.

  5. Uncor­re­lated vari­ables may be mean­ing­less: given this empir­i­cal real­i­ty, any vari­able which is uncor­re­lated with the major vari­ables is sus­pi­cious (some­what like the in human traits ren­ders traits with zero her­i­tabil­ity sus­pi­cious, sug­gest­ing issues like mea­sur­ing at the wrong time). The lack of cor­re­la­tion sug­gests that the analy­sis is under­pow­ered, some­thing has gone wrong in the con­struc­tion of the variable/dataset, or that the vari­able is part of a sys­tem whose causal net­work ren­ders con­ven­tional analy­ses dan­ger­ously mis­lead­ing.

    For exam­ple, the dataset may be cor­rupted by a sys­tem­atic bias such as range restric­tion or a selec­tion effect such as , which erases from the data a cor­re­la­tion that actu­ally exists. Or the data may be ran­dom noise, due to soft­ware error or fraud or extremely high lev­els of mea­sure­ment error (such as respon­dents answer­ing at ran­dom); or the vari­able may not be real in the first place. Another pos­si­bil­ity is that the vari­able is causally con­nect­ed, in feed­back loops (espe­cially com­mon in eco­nom­ics or biol­o­gy), to another vari­able, in which case the stan­dard sta­tis­ti­cal machin­ery is mis­lead­ing—the clas­sic exam­ple is Mil­ton Fried­man’s ther­mostat, not­ing that a ther­mo­stat would be almost entirely uncor­re­lated with room tem­per­a­ture.

This idea, as sug­gested by the many names, is not due to any sin­gle the­o­ret­i­cal or empir­i­cal result or researcher, but has been made many times by many differ­ent researchers in many con­texts, cir­cu­lat­ing as infor­mal ‘folk­lore’. To bring some order to this, I have com­piled excerpts of some rel­e­vant ref­er­ences in chrono­log­i­cal order. (Ad­di­tional cita­tions are wel­come.)

Gosset/Student 1904

A ver­sion of this is attrib­uted to (Stu­dent) in an unpub­lished 1904 inter­nal report5 by :

In early Novem­ber, 1904, Gos­set dis­cussed his first break­through in an inter­nal report enti­tled “The Appli­ca­tion of the ‘Law of Error’ to the Work of the Brew­ery” (Gos­set, 1904 Lab­o­ra­tory Report, Nov. 3, 1904, p. 3). Gos­set (p. 3–16) wrote:

Results are only valu­able when the amount by which they prob­a­bly differ from the truth is so small as to be insignifi­cant for the pur­poses of the exper­i­ment. What the odds should be depends

  1. On the degree of accu­racy which the nature of the exper­i­ment allows, and
  2. On the impor­tance of the issues at stake.

Two fea­tures of Gos­set’s report are espe­cially worth high­light­ing here. First, he sug­gested that judg­ments about “sig­nifi­cant” differ­ences were not a purely prob­a­bilis­tic exer­cise: they depend on the “impor­tance of the issues at stake.” Sec­ond, Gos­set under­scored a pos­i­tive cor­re­la­tion in the nor­mal dis­tri­b­u­tion curve between “the square root of the num­ber of obser­va­tions” and the level of sta­tis­ti­cal sig­nifi­cance. Other things equal, he wrote, “the greater the num­ber of obser­va­tions of which means are taken [the larger the sam­ple size], the smaller the [prob­a­ble or stan­dard] error” (p. 5). “And the curve which rep­re­sents their fre­quency of error,” he illus­trat­ed, “becomes taller and nar­rower” (p. 7).

Since its dis­cov­ery in the early nine­teenth cen­tu­ry, tables of the nor­mal prob­a­bil­ity curve had been cre­ated for large sam­ples…The rela­tion between sam­ple size and “sig­nifi­cance” was rarely explored. For exam­ple, while look­ing at bio­met­ric sam­ples with up to thou­sands of obser­va­tions, Karl Pear­son declared that a result depart­ing by more than three stan­dard devi­a­tions is “defi­nitely sig­nifi­cant.”12 Yet Gos­set, a self­-trained sta­tis­ti­cian, found that at such large sam­ples, nearly every­thing is sta­tis­ti­cally “sig­nifi­cant”—though not, in Gos­set’s terms, eco­nom­i­cally or sci­en­tifi­cally “impor­tant.” Regard­less, Gos­set did­n’t have the lux­ury of large sam­ples. One of his ear­li­est exper­i­ments employed a sam­ple size of 2 (Gos­set, 1904, p.7) and in fact in “The Prob­a­ble Error of a Mean” he cal­cu­lated a t sta­tis­tic for n=2 (Stu­dent, 1908b, p. 23).

…the “degree of cer­tainty to be aimed at,” Gos­set wrote, depends on the oppor­tu­nity cost of fol­low­ing a result as if true, added to the oppor­tu­nity cost of con­duct­ing the exper­i­ment itself. Gos­set never devi­ated from this cen­tral posi­tion.15 [See, for exam­ple, Stu­dent (1923, p. 271, para­graph one: “The object of test­ing vari­eties of cere­als is to find out which will pay the farmer best.”) and Stu­dent (1931c, p. 1342, para­graph one) reprinted in Stu­dent (1942, p. 90 and p. 150).]

Thorndike 1920

“Intel­li­gence and Its Uses”, Edward L. Thorndike 1920 (Harper’s Monthly):

…the sig­nifi­cance of intel­li­gence for suc­cess in a given activ­ity of life is mea­sured by the coeffi­cient of cor­re­la­tion between them. Sci­en­tific inves­ti­ga­tions of these mat­ters is just begin­ning; and it is a mat­ter of great diffi­culty and expense to mea­sure the intel­li­gence of, say, a thou­sand cler­gy­men, and then secure suffi­cient evi­dence to rate them accu­rately for their suc­cess as min­is­ters of the Gospel. Con­se­quent­ly, one can report no final, per­fectly author­i­ta­tive results in this field. One can only orga­nize rea­son­able esti­mates from the var­i­ous par­tial inves­ti­ga­tions that have been made. Doing this, I find the fol­low­ing:

  • Intel­li­gence and suc­cess in the ele­men­tary schools, r = +.80
  • Intel­li­gence and suc­cess in high­-school and col­leges in the case of those who go, r = +0.60; but if all were forced to try to do this advanced work, the cor­re­la­tion would be +.80 or more.
  • Intel­li­gence and salary, r = +.35.
  • Intel­li­gence and suc­cess in ath­letic sports, r = +.25
  • Intel­li­gence and char­ac­ter, r = +.40
  • Intel­li­gence and pop­u­lar­i­ty, r = +.20

What­ever be the even­tual exact find­ings, two sound prin­ci­ples are illus­trated by our pro­vi­sional list. First, there is always some resem­blance; intel­lect always counts. Sec­ond, the resem­blance varies great­ly; intel­lect counts much more in some lines than in oth­ers.

The first fact is in part a con­se­quence of a still broader fact or prin­ci­ple—­name­ly, that in human nature good traits go togeth­er. To him that hath a supe­rior intel­lect is given also on the aver­age a supe­rior char­ac­ter; the quick boy is also in the long run more accu­rate; the able boy is also more indus­tri­ous. There is no prin­ci­ple of com­pen­sa­tion whereby a weak intel­lect is off­set by a strong will, a poor mem­ory by good judg­ment, or a lack of ambi­tion by an attrac­tive per­son­al­i­ty. Every pair of such sup­posed com­pen­sat­ing qual­i­ties that have been inves­ti­gated has been found really to show cor­re­spon­dence. Pop­u­lar opin­ion has been mis­led by attend­ing to strik­ing indi­vid­ual cases which attracted atten­tion partly because they were really excep­tions to the rule. The rule is that desir­able qual­i­ties are pos­i­tively cor­re­lat­ed. Intel­lect is good in and of itself, and also for what it implies about other traits.

Berkson 1938

“Some diffi­cul­ties of inter­pre­ta­tion encoun­tered in the appli­ca­tion of the chi-square test”, 1938:

I believe that an obser­vant sta­tis­ti­cian who has had any con­sid­er­able expe­ri­ence with apply­ing the chi-square test repeat­edly will agree with my state­ment that, as a mat­ter of obser­va­tion, when the num­bers in the data are quite large, the P’s tend to come out small. Hav­ing observed this, and on reflec­tion, I make the fol­low­ing dog­matic state­ment, refer­ring for illus­tra­tion to the nor­mal curve: “If the nor­mal curve is fit­ted to a body of data rep­re­sent­ing any real obser­va­tions what­ever of quan­ti­ties in the phys­i­cal world, then if the num­ber of obser­va­tions is extremely large—­for instance, on an order of 200,000—the chi-square P will be small beyond any usual limit of sig­nifi­cance.”

This dog­matic state­ment is made on the basis of an extrap­o­la­tion of the obser­va­tion referred to and can also be defended as a pre­dic­tion from a pri­ori con­sid­er­a­tions. For we may assume that it is prac­ti­cally cer­tain that any series of real obser­va­tions does not actu­ally fol­low a nor­mal curve with absolute exac­ti­tude in all respects, and no mat­ter how small the dis­crep­ancy between the nor­mal curve and the true curve of obser­va­tions, the chi-square P will be small if the sam­ple has a suffi­ciently large num­ber of obser­va­tions in it.

If this be so, then we have some­thing here that is apt to trou­ble the con­science of a reflec­tive sta­tis­ti­cian using the chi-square test. For I sup­pose it would be agreed by sta­tis­ti­cians that a large sam­ple is always bet­ter than a small sam­ple. If, then, we know in advance the P that will result from an appli­ca­tion of a chi-square test to a large sam­ple, there would seem to be no use in doing it on a smaller one. But since the result of the for­mer test is known, it is no test at all.

Thorndike 1939

Your City, Thorndike 1939 (and the fol­lowup 144 Smaller Cities pro­vid­ing tables for 144 cities) com­piles var­i­ous sta­tis­tics about Amer­i­can cities such as infant mor­tal­i­ty, spend­ing on the arts, crime etc and finds exten­sive inter­cor­re­la­tions & fac­tors.

The gen­eral fac­tor of socioe­co­nomic sta­tus or ‘S-fac­tor’ also applies across coun­tries as well: eco­nomic growth is by far the largest influ­ence on all mea­sures of well-be­ing, and attempts at com­put­ing inter­na­tional rank­ings of things like mater­nal founder on this fact, as they typ­i­cally wind up sim­ply repro­duc­ing GDP rank-order­ings and being redun­dant. For exam­ple, com­pute an inter­na­tional well­be­ing met­ric using “life expectan­cy, the ratio of con­sump­tion to income, annual hours worked per cap­i­ta, the stan­dard devi­a­tion of log con­sump­tion, and the stan­dard devi­a­tion of annual hours worked” to incor­po­rate fac­tors like inequal­i­ty, but this still winds up just being equiv­a­lent to GDP (r = 0.98). Or , who note that of 9 inter­na­tional indices they con­sid­er, all cor­re­late pos­i­tively with per capita GDP, & 6 have τ>0.5.

Good 1950

Prob­a­bil­ity and the Weigh­ing of Evi­dence, :

The gen­eral ques­tion of sig­nifi­cance tests was raised in 7.3 and a sim­ple exam­ple will now be con­sid­ered. Sup­pose that a die is thrown n times and that it shows an r-face on mr occa­sions (r = 1, 2, …, 6). The ques­tion is whether the die is loaded. The answer depends on the mean­ing of “loaded”. From one point of view, it is unnec­es­sary to look at the sta­tis­tics since it is obvi­ous that no die could be absolutely sym­met­ri­cal. [It would be no con­tra­dic­tion of 4.3 (ii) to say that the hypoth­e­sis that the die is absolutely sym­met­ri­cal is almost impos­si­ble. In fact, this hypoth­e­sis is an ide­alised propo­si­tion rather than an empir­i­cal one.] It is pos­si­ble that a sim­i­lar remark applies to all exper­i­ments—even to the ESP exper­i­ment, since there may be no way of design­ing it so that the prob­a­bil­i­ties are exactly equal to 1⁄2.

Hodges & Lehmann 1954

“Test­ing the approx­i­mate valid­ity of sta­tis­ti­cal hypothe­ses”, Hodges & Lehmann 1954:

When test­ing sta­tis­ti­cal hypothe­ses, we usu­ally do not wish to take the action of rejec­tion unless the hypoth­e­sis being tested is false to an extent suffi­cient to mat­ter. For exam­ple, we may for­mu­late the hypoth­e­sis that a pop­u­la­tion is nor­mally dis­trib­ut­ed, but we real­ize that no nat­ural pop­u­la­tion is ever exactly nor­mal. We would want to reject nor­mal­ity only if the depar­ture of the actual dis­tri­b­u­tion from the nor­mal form were great enough to be mate­r­ial for our inves­ti­ga­tion. Again, when we for­mu­late the hypoth­e­sis that the sex ratio is the same in two pop­u­la­tions, we do not really believe that it could be exactly the same, and would only wish to reject equal­ity if they are suffi­ciently differ­ent. Fur­ther exam­ples of the phe­nom­e­non will occur to the read­er.

Savage 1954

The Foun­da­tions of Sta­tis­tics 1st edi­tion, 19546, pg252–255:

The devel­op­ment of the the­ory of test­ing has been much influ­enced by the spe­cial prob­lem of sim­ple dichoto­my, that is, test­ing prob­lems in which H0 and H1 have exactly one ele­ment each. Sim­ple dichotomy is sus­cep­ti­ble of neat and full analy­sis (as in Exer­cise 7.5.2 and in §14.4), like­li­hood-ra­tio tests here being the only admis­si­ble tests; and sim­ple dichotomy often gives insight into more com­pli­cated prob­lems, though the point is not explic­itly illus­trated in this book.

Coin and ball exam­ples of sim­ple dichotomy are easy to con­struct, but instances seem rare in real life. The astro­nom­i­cal obser­va­tions made to dis­tin­guish between the New­ton­ian and Ein­stein­ian hypothe­ses are a good, but not per­fect, exam­ple, and I sup­pose that research in Mendelian genet­ics some­times leads to oth­ers. There is, how­ev­er, a tra­di­tion of apply­ing the con­cept of sim­ple dichotomy to some sit­u­a­tions to which it is, to say the best, only crudely adapt­ed. Con­sid­er, for exam­ple, the deci­sion prob­lem of a per­son who must buy, f0, or refuse to buy, f1, a lot of man­u­fac­tured arti­cles on the basis of an obser­va­tion x. Sup­pose that i is the differ­ence between the value of the lot to the per­son and the price at which the lot is offered for sale, and that P(x | i) is known to the per­son. Clear­ly, H0, H1, and N are sets char­ac­ter­ized respec­tively by i > 0, i < 0, i = 0. This analy­sis of this, and sim­i­lar, prob­lems has recently been explored in terms of the min­i­max rule, for exam­ple by Sprowls [S16] and a lit­tle more fully by Rudy [R4], and by Allen [A3]. It seems to me nat­ural and promis­ing for many fields of appli­ca­tion, but it is not a tra­di­tional analy­sis. On the con­trary, much lit­er­a­ture rec­om­mends, in effect, that the per­son pre­tend that only two val­ues of i, i0 > 0 and i1 < 0, are pos­si­ble and that the per­son then choose a test for the result­ing sim­ple dichoto­my. The selec­tion of the two val­ues i0 and i1 is left to the per­son, though they are some­times sup­posed to cor­re­spond to the per­son’s judg­ment of what con­sti­tutes good qual­ity and poor qual­i­ty—terms really quite with­out defi­n­i­tion. The empha­sis on sim­ple dichotomy is tem­pered in some accep­tance-sam­pling lit­er­a­ture, where it is rec­om­mended that the per­son choose among avail­able tests by some largely unspec­i­fied over­all con­sid­er­a­tion of oper­at­ing char­ac­ter­is­tics and costs, and that he facil­i­tate his sur­vey of the avail­able tests by focus­ing on a pair of points that hap­pen to inter­est him and con­sid­er­ing the test whose oper­at­ing char­ac­ter­is­tic passes (eco­nom­i­cal­ly, in the case of sequen­tial test­ing) through the pair of points. These tra­di­tional analy­ses are cer­tainly infe­rior in the the­o­ret­i­cal frame­work of the present dis­cus­sion, and I think they will be found infe­rior in prac­tice.

…I turn now to a differ­ent and, at least for me, del­i­cate topic in con­nec­tion with appli­ca­tions of the the­ory of test­ing. Much atten­tion is given in the lit­er­a­ture of sta­tis­tics to what pur­port to be tests of hypothe­ses, in which the null hypoth­e­sis is such that it would not really be accepted by any­one. The fol­low­ing three propo­si­tions, though play­ful in con­tent, are typ­i­cal in form of these extreme null hypothe­ses, as I shall call them for the moment.

  • A. The mean noise out­put of the cereal Krakl is a lin­ear func­tion of the atmos­pheric pres­sure, in the range from 900 to 1,100 mil­libars.
  • B. The basal meta­bolic con­sump­tion of sperm whales is nor­mally dis­trib­uted [Wl­l].
  • C. New York taxi dri­vers of Irish, Jew­ish, and Scan­di­na­vian extrac­tion are equally pro­fi­cient in abu­sive lan­guage.

Lit­er­ally to test such hypothe­ses as these is pre­pos­ter­ous. If, for exam­ple, the loss asso­ci­ated with f1 is zero, except in case Hypoth­e­sis A is exactly sat­is­fied, what pos­si­ble expe­ri­ence with Krakl could dis­suade you from adopt­ing f1?

The unac­cept­abil­ity of extreme null hypothe­ses is per­fectly well known; it is closely related to the often heard maxim that sci­ence dis­proves, but never proves, hypothe­ses. The role of extreme hypothe­ses in sci­ence and other sta­tis­ti­cal activ­i­ties seems to be impor­tant but obscure. In par­tic­u­lar, though I, like every­one who prac­tices sta­tis­tics, have often “tested” extreme hypothe­ses, I can­not give a very sat­is­fac­tory analy­sis of the process, nor say clearly how it is related to test­ing as defined in this chap­ter and other the­o­ret­i­cal dis­cus­sions. None the less, it seems worth while to explore the sub­ject ten­ta­tive­ly; I will do so largely in terms of two exam­ples.

Con­sider first the prob­lem of a cereal dynam­i­cist who must esti­mate the noise out­put of Krakl at each of ten atmos­pheric pres­sures between 900 and 1,100 mil­libars. It may well be that he can prop­erly regard the prob­lem as that of esti­mat­ing the ten para­me­ters in ques­tion, in which case there is no ques­tion of test­ing. But sup­pose, for exam­ple, that one or both of the fol­low­ing con­sid­er­a­tions apply. First, the engi­neer and his col­leagues may attach con­sid­er­able per­sonal prob­a­bil­ity to the pos­si­bil­ity that A is very nearly sat­is­fied—very near­ly, that is, in terms of the dis­per­sion of his mea­sure­ments. Sec­ond, the admin­is­tra­tive, com­pu­ta­tion­al, and other inci­den­tal costs of using ten indi­vid­ual esti­mates might be con­sid­er­ably greater than that of using a lin­ear for­mu­la.

It might be imprac­ti­cal to deal with either of these con­sid­er­a­tions very rig­or­ous­ly. One rough attack is for the engi­neer first to exam­ine the observed data x and then to pro­ceed either as though he actu­ally believed Hypoth­e­sis A or else in some other way. The other way might be to make the esti­mate accord­ing to the objec­tivis­tic for­mu­lae that would have been used had there been no com­pli­cat­ing con­sid­er­a­tions, or it might take into account differ­ent but related com­pli­cat­ing con­sid­er­a­tions not explic­itly men­tioned here, such as the advan­tage of using a qua­dratic approx­i­ma­tion. It is arti­fi­cial and inad­e­quate to regard this deci­sion between one class of basic acts or another as a test, but that is what in cur­rent prac­tice we seem to do. The choice of which test to adopt in such a con­text is at least partly moti­vated by the vague idea that the test should read­ily accept, that is, result in act­ing as though the extreme null hypothe­ses were true, in the far­fetched case that the null hypoth­e­sis is indeed true, and that the worse the approx­i­ma­tion of the null hypothe­ses to the truth the less prob­a­ble should be the accep­tance.

The method just out­lined is crude, to say the best. It is often mod­i­fied in accor­dance with com­mon sense, espe­cially so far as the sec­ond con­sid­er­a­tion is con­cerned. Thus, if the mea­sure­ments are suffi­ciently pre­cise, no ordi­nary test might accept the null hypothe­ses, for the exper­i­ment will lead to a clear and sure idea of just what the depar­tures from the null hypothe­ses actu­ally are. But, if the engi­neer con­sid­ers those depar­tures unim­por­tant for the con­text at hand, he will jus­ti­fi­ably decide to neglect them.

Rejec­tion of an extreme null hypoth­e­sis, in the sense of the fore­go­ing dis­cus­sion, typ­i­cally gives rise to a com­pli­cated sub­sidiary deci­sion prob­lem. Some aspects of this sit­u­a­tion have recently been explored, for exam­ple by Paul­son [P3], [P4]; Dun­can [Dl­l], [D12]; Tukey [T4], [T5]; Schefte [S7]; and W. D. Fisher [F7].

Fisher 1956

Sta­tis­ti­cal Meth­ods and Sci­en­tific Infer­ence, 1956 (pg42):

…How­ev­er, the cal­cu­la­tion [of error rates of ‘reject­ing the null’] is absurdly aca­d­e­mic, for in fact no sci­en­tific worker has a fixed level of sig­nifi­cance at which from year to year and in all cir­cum­stances, he rejects hypothe­ses; he rather gives his mind to each par­tic­u­lar case in the light of his evi­dence and his ideas. Fur­ther, the cal­cu­la­tion is based solely on a hypoth­e­sis, which, in the light of the evi­dence, is often not believed to be true at all, so that the actual prob­a­bil­ity of erro­neous deci­sion, sup­pos­ing such a phrase to have any mean­ing, may be much less than the fre­quency spec­i­fy­ing the level of sig­nifi­cance.

Wallis & Roberts 1956

Sta­tis­tics: A New Approach, Wal­lis & Roberts 1956 (pg 384–388):

A diffi­culty with this view­point is that it is often known that the hypoth­e­sis tested could not be pre­cisely true. No coin, for exam­ple, has a prob­a­bil­ity of pre­cisely 1⁄2 of com­ing heads. The true prob­a­bil­ity will always differ from 1⁄2 even if it differs by only 0.000,000,000,1. Nei­ther will any treat­ment cure pre­cisely one-third of the patients in the pop­u­la­tion to which it might be applied, nor will the pro­por­tion of vot­ers in a pres­i­den­tial elec­tion favor­ing one can­di­date be pre­cisely 1⁄2. Recog­ni­tion of this leads to the notion of differ­ences that are or are not of prac­ti­cal impor­tance. “Prac­ti­cal impor­tance” depends on the actions that are going to be taken on the basis of the data, and on the losses from tak­ing cer­tain actions when oth­ers would be more appro­pri­ate.

Thus, the focus is shifted to deci­sions: Would the same deci­sion about prac­ti­cal action be appro­pri­ate if the coin pro­duces heads 0.500,000,000,1 of the time as if it pro­duces heads 0.5 of the time pre­cise­ly? Does it mat­ter whether the coin pro­duces heads 0.5 of the time or 0.6 of the time, and if so does it mat­ter enough to be worth the cost of the data needed to decide between the actions appro­pri­ate to these sit­u­a­tions? Ques­tions such as these carry us toward a com­pre­hen­sive the­ory of ratio­nal action, in which the con­se­quences of each pos­si­ble action are weighed in the light of each pos­si­ble state of real­i­ty. The value of a cor­rect deci­sion, or the costs of var­i­ous degrees of error, are then bal­anced against the costs of reduc­ing the risks of error by col­lect­ing fur­ther data. It is this view­point that under­lies the defi­n­i­tion of sta­tis­tics given in the first sen­tence of this book. [“Sta­tis­tics is a body of meth­ods for mak­ing wise deci­sions in the face of uncer­tain­ty.”]

Savage 1957

“Non­para­met­ric sta­tis­tics”, I. Richard Sav­age7 1957:

Siegel does not explain why his inter­est is con­fined to tests of sig­nifi­cance; to make mea­sure­ments and then ignore their mag­ni­tudes would ordi­nar­ily be point­less. Exclu­sive reliance on tests of sig­nifi­cance obscures the fact that sta­tis­ti­cal sig­nifi­cance does not imply sub­stan­tive sig­nifi­cance. The tests given by Siegel apply only to null hypothe­ses of “no differ­ence.” In research, how­ev­er, null hypothe­ses of the form “Pop­u­la­tion A has a median at least five units larger than the median of Pop­u­la­tion B” arise. Null hypothe­ses of no differ­ence are usu­ally known to be false before the data are col­lected [9, p. 42; 48, pp. 384–8]; when they are, their rejec­tion or accep­tance sim­ply reflects the size of the sam­ple and the power of the test, and is not a con­tri­bu­tion to sci­ence.

Nunnally 1960

“The place of sta­tis­tics in psy­chol­ogy”, Nun­nally 1960:

The most mis­used and mis­con­ceived hypoth­e­sis-test­ing model employed in psy­chol­ogy is referred to as the “nul­l-hy­poth­e­sis” mod­el. Stat­ing it crude­ly, one null hypoth­e­sis would be that two treat­ments do not pro­duce differ­ent mean effects in the long run. Using the obtained means and sam­ple esti­mates of“pop­u­la­tion” vari­ances, prob­a­bil­ity state­ments can be made about the accep­tance or rejec­tion of the null hypoth­e­sis. Sim­i­lar null hypothe­ses are applied to cor­re­la­tions, com­plex exper­i­men­tal designs, fac­tor-an­a­lytic results, and most all exper­i­men­tal results.

Although from a math­e­mat­i­cal point of view the nul­l-hy­poth­e­sis mod­els are inter­nally neat, they share a crip­pling flaw: in the real world the null hypoth­e­sis is almost never true, and it is usu­ally non­sen­si­cal to per­form an exper­i­ment with the sole aim of reject­ing the null hypoth­e­sis. This is a per­sonal point of view, and it can­not be proved direct­ly. How­ev­er, it is sup­ported both by com­mon sense and by prac­ti­cal expe­ri­ence. The com­mon-sense argu­ment is that differ­ent psy­cho­log­i­cal treat­ments will almost always (in the long run) pro­duce differ­ences in mean effects, even though the differ­ences may be very small. Also, just as nature abhors a vac­u­um, it prob­a­bly abhors zero cor­re­la­tions between vari­ables.

…Ex­pe­ri­ence shows that when large num­bers of sub­jects are used in stud­ies, nearly all com­par­isons of means are “sig­nifi­cantly” differ­ent and all cor­re­la­tions are “sig­nifi­cantly” differ­ent from zero. The author once had occa­sion to use 700 sub­jects in a study of pub­lic opin­ion. After a fac­tor analy­sis of the results, the fac­tors were cor­re­lated with indi­vid­u­al-d­iffer­ence vari­ables such as amount of edu­ca­tion, age, income, sex, and oth­ers. In look­ing at the results I was happy to find so many “sig­nifi­cant” cor­re­la­tions (un­der the nul­l-hy­poth­e­sis mod­el)-in­deed, nearly all cor­re­la­tions were sig­nifi­cant, includ­ing ones that made lit­tle sense. Of course, with an N of 700 cor­re­la­tions as large as 0.08 are “beyond the 0.05 lev­el.” Many of the “sig­nifi­cant” cor­re­la­tions were of no the­o­ret­i­cal or prac­ti­cal impor­tance.

The point of view taken here is that if the null hypoth­e­sis is not reject­ed, it usu­ally is because the N is too small. If enough data is gath­ered, the hypoth­e­sis will gen­er­ally be reject­ed. If rejec­tion of the null hypoth­e­sis were the real inten­tion in psy­cho­log­i­cal exper­i­ments, there usu­ally would be no need to gather data.

…Sta­tis­ti­cians are not to blame for the mis­con­cep­tions in psy­chol­ogy about the use of sta­tis­ti­cal meth­ods. They have warned us about the use of the hypoth­e­sis-test­ing mod­els and the related con­cepts. In par­tic­u­lar they have crit­i­cized the nul­l-hy­poth­e­sis model and have rec­om­mended alter­na­tive pro­ce­dures sim­i­lar to those rec­om­mended here (See Sav­age, 1957; Tukey, 1954; and Yates, 1951).

Smith 1960

“Review of N. T. J. Bai­ley, Sta­tis­ti­cal meth­ods in biol­ogy, Smith 1960:

How­ev­er, it is inter­est­ing to look at this book from another angle. Here we have set before us with great clar­ity a panorama of mod­ern sta­tis­ti­cal meth­ods, as used in biol­o­gy, med­i­cine, phys­i­cal sci­ence, social and men­tal sci­ence, and indus­try. How far does this show that these meth­ods ful­fil their aims of analysing the data reli­ably, and how many gaps are there still in our knowl­edge?…One fea­ture which can puz­zle an out­sider, and which requires much more jus­ti­fi­ca­tion than is usu­ally given, is the set­ting up of unplau­si­ble null hypothe­ses. For exam­ple, a sta­tis­ti­cian may set out a test to see whether two drugs have exactly the same effect, or whether a regres­sion line is exactly straight. These hypothe­ses can scarcely be taken lit­er­al­ly, but a sta­tis­ti­cian may say, quite rea­son­ably, that he wishes to test whether there is an appre­cia­ble differ­ence between the effects of the two drugs, or an appre­cia­ble cur­va­ture in the regres­sion line. But this raises at once the ques­tion: how large is ‘appre­cia­ble’? Or in other words, are we not really con­cerned with some kind of esti­ma­tion, rather than sig­nifi­cance?

Edwards 1963

“Bayesian sta­tis­ti­cal infer­ence for psy­cho­log­i­cal research”, Edwards et al 1963:

The most pop­u­lar notion of a test is, rough­ly, a ten­ta­tive deci­sion between two hypothe­ses on the basis of data, and this is the notion that will dom­i­nate the present treat­ment of tests. Some qual­i­fi­ca­tion is needed if only because, in typ­i­cal appli­ca­tions, one of the hypothe­ses—the null hypoth­e­sis—is known by all con­cerned to be false from the out­set (Berk­son, 1938; Hodges & Lehmann, 1954; Lehmann, 19598; I. R. Sav­age, 1957; L. J. Sav­age, 1954, p. 254); some ways of resolv­ing the seem­ing absur­dity will later be pointed out, and at least one of them will be impor­tant for us here…­Clas­si­cal pro­ce­dures some­times test null hypothe­ses that no one would believe for a moment, no mat­ter what the data; our list of sit­u­a­tions that might stim­u­late hypoth­e­sis tests ear­lier in the sec­tion included sev­eral exam­ples. Test­ing an unbe­liev­able null hypoth­e­sis amounts, in prac­tice, to assign­ing an unrea­son­ably large prior prob­a­bil­ity to a very small region of pos­si­ble val­ues of the true para­me­ter. In such cas­es, the more the pro­ce­dure is against the null hypoth­e­sis, the bet­ter. The fre­quent reluc­tance of empir­i­cal sci­en­tists to accept null hypothe­ses which their data do not clas­si­cally reject sug­gests their appro­pri­ate skep­ti­cism about the orig­i­nal plau­si­bil­ity of these null hypothe­ses.

Bakan 1966

“The test of sig­nifi­cance in psy­cho­log­i­cal research”, Bakan 1966:

Let us con­sider some of the diffi­cul­ties asso­ci­ated with the null hypoth­e­sis.

  1. The a pri­ori rea­sons for believ­ing that the null hypoth­e­sis is gen­er­ally false any­way. One of the com­mon expe­ri­ences of research work­ers is the very high fre­quency with which sig­nifi­cant results are obtained with large sam­ples. Some years ago, the author had occa­sion to run a num­ber of tests of sig­nifi­cance on a bat­tery of tests col­lected on about 60,000 sub­jects from all over the United States. Every test came out sig­nifi­cant. Divid­ing the cards by such arbi­trary cri­te­ria as east ver­sus west of the Mis­sis­sippi River, Maine ver­sus the rest of the coun­try, North ver­sus South, etc., all pro­duced sig­nifi­cant differ­ences in means. In some instances, the differ­ences in the sam­ple means were quite small, but nonethe­less, the p val­ues were all very low. Nun­nally (1960) has reported a sim­i­lar expe­ri­ence involv­ing cor­re­la­tion coeffi­cients on 700 sub­jects. Joseph Berk­son (1938) made the obser­va­tion almost 30 years in con­nec­tion with chi-square:

I believe that an obser­vant sta­tis­ti­cian who has had any con­sid­er­able expe­ri­ence with apply­ing the chi-square test repeat­edly will agree with my state­ment that, as a mat­ter of obser­va­tion, when the num­bers in the data are quite large, the P’s tend to come out small. Hav­ing observed this, and on reflec­tion, I make the fol­low­ing dog­matic state­ment, refer­ring for illus­tra­tion to the nor­mal curve: “If the nor­mal curve is fit­ted to a body of data rep­re­sent­ing any real obser­va­tions what­ever of quan­ti­ties in the phys­i­cal world, then if the num­ber of obser­va­tions is extremely large—­for instance, on an order of 200,000—the chi-square P will be small beyond any usual limit of sig­nifi­cance.”

This dog­matic state­ment is made on the basis of an extrap­o­la­tion of the obser­va­tion referred to and can also be defended as a pre­dic­tion from a pri­ori con­sid­er­a­tions. For we may assume that it is prac­ti­cally cer­tain that any series of real obser­va­tions does not actu­ally fol­low a nor­mal curve with absolute exac­ti­tude in all respects, and no mat­ter how small the dis­crep­ancy between the nor­mal curve and the true curve of obser­va­tions, the chi-square P will be small if the sam­ple has a suffi­ciently large num­ber of obser­va­tions in it.

If this be so, then we have some­thing here that is apt to trou­ble the con­science of a reflec­tive sta­tis­ti­cian using the chi-square test. For I sup­pose it would be agreed by sta­tis­ti­cians that a large sam­ple is always bet­ter than a small sam­ple. If, then, we know in advance the P that will result from an appli­ca­tion of a chi-square test to a large sam­ple, there would seem to be no use in doing it on a smaller one. But since the result of the for­mer test is known, it is no test at all [pp. 526–527].

As one group of authors has put it, “in typ­i­cal appli­ca­tions . . . the null hypoth­e­sis . . . is known by all con­cerned to be false from the out­set [Edwards et al, 1963, p. 214].” The fact of the mat­ter is that there is really no good rea­son to expect the null hypoth­e­sis to be true in any pop­u­la­tion. Why should the mean, say, of all scores east of the Mis­sis­sippi be iden­ti­cal to all scores west of the Mis­sis­sip­pi? Why should any cor­re­la­tion coeffi­cient be exactly 0.00 in the pop­u­la­tion? Why should we expect the ratio of males to females be exactly 50:50 in any pop­u­la­tion? Or why should differ­ent drugs have exactly the same effect on any pop­u­la­tion para­me­ter (Smith, 1960)? A glance at any set of sta­tis­tics on total pop­u­la­tions will quickly con­firm the rar­ity of the null hypoth­e­sis in nature.

…Should there be any devi­a­tion from the null hypoth­e­sis in the pop­u­la­tion, no mat­ter how small—and we have lit­tle doubt but that such a devi­a­tion usu­ally exist­s—a suffi­ciently large num­ber of obser­va­tions will lead to the rejec­tion of the null hypoth­e­sis. As Nun­nally (1960) put it,

if the null hypoth­e­sis is not reject­ed, it is usu­ally because the N is too small. If enough data are gath­ered, the hypoth­e­sis will gen­er­ally be reject­ed. If rejec­tion of the null hypoth­e­sis were the real inten­tion in psy­cho­log­i­cal exper­i­ments, there usu­ally would be no need to gather data [p. 643].

Meehl 1967

“The­o­ry-test­ing in psy­chol­ogy and physics: A method­olog­i­cal para­dox”, Meehl 1967

One rea­son why the direc­tional null hypoth­e­sis (H02 : μg ≤ μb) is the appro­pri­ate can­di­date for exper­i­men­tal refu­ta­tion is the uni­ver­sal agree­ment that the old point-null hypoth­e­sis (H0 : μg = μb) is [qua­si-] always false in bio­log­i­cal and social sci­ence. Any depen­dent vari­able of inter­est, such as I.Q., or aca­d­e­mic achieve­ment, or per­cep­tual speed, or emo­tional reac­tiv­ity as mea­sured by skin resis­tance, or what­ev­er, depends mainly upon a finite num­ber of “strong” vari­ables char­ac­ter­is­tic of the organ­isms stud­ied (em­body­ing the accu­mu­lated results of their genetic makeup and their learn­ing his­to­ries) plus the influ­ences manip­u­lated by the exper­i­menter. Upon some com­pli­cat­ed, unknown math­e­mat­i­cal func­tion of this finite list of “impor­tant” deter­min­ers is then super­im­posed an indefi­nitely large num­ber of essen­tially “ran­dom” fac­tors which con­tribute to the intra­group vari­a­tion and there­fore boost the error term of the sta­tis­ti­cal sig­nifi­cance test. In order for two groups which differ in some iden­ti­fied prop­er­ties (such as social class, intel­li­gence, diag­no­sis, racial or reli­gious back­ground) to differ not at all in the “out­put” vari­able of inter­est, it would be nec­es­sary that all deter­min­ers of the out­put vari­able have pre­cisely the same aver­age val­ues in both groups, or else that their val­ues should differ by a pat­tern of amounts of differ­ence which pre­cisely coun­ter­bal­ance one another to yield a net differ­ence of zero. Now our gen­eral back­ground knowl­edge in the social sci­ences, or, for that mat­ter, even “com­mon sense” con­sid­er­a­tions, makes such an exact equal­ity of all deter­min­ing vari­ables, or a pre­cise “acci­den­tal” coun­ter­bal­anc­ing of them, so extremely unlikely that no psy­chol­o­gist or sta­tis­ti­cian would assign more than a neg­li­gi­bly small prob­a­bil­ity to such a state of affairs.

Exam­ple: Sup­pose we are study­ing a sim­ple per­cep­tu­al-ver­bal task like rate of col­or-nam­ing in school chil­dren, and the inde­pen­dent vari­able is father’s reli­gious pref­er­ence. Super­fi­cial con­sid­er­a­tion might sug­gest that these two vari­ables would not be relat­ed, but a lit­tle thought leads one to con­clude that they will almost cer­tainly be related by some amount, how­ever small. Con­sid­er, for instance, that a child’s reac­tion to any sort of school-con­text task will be to some extent depen­dent upon his social class, since the desire to please aca­d­e­mic per­son­nel and the desire to achieve at a per­for­mance (just because it is a task, regard­less of its intrin­sic inter­est) are both related to the kinds of sub­-cul­tural and per­son­al­ity traits in the par­ents that lead to upward mobil­i­ty, eco­nomic suc­cess, the gain­ing of fur­ther edu­ca­tion, and the like. Again, since there is known to be a sex differ­ence in color nam­ing, it is likely that fathers who have entered occu­pa­tions more attrac­tive to “fem­i­nine” males will (on the aver­age) pro­vide a some­what more fem­i­nine father fig­ure for iden­ti­fi­ca­tion on the part of their male off­spring, and that a more refined color vocab­u­lary, mak­ing closer dis­crim­i­na­tions between sim­i­lar hues, will be char­ac­ter­is­tic of the ordi­nary lan­guage of such a house­hold. Fur­ther, it is known that there is a cor­re­la­tion between a child’s gen­eral intel­li­gence and its father’s occu­pa­tion, and of course there will be some rela­tion, even though it may be small, between a child’s gen­eral intel­li­gence and his color vocab­u­lary, aris­ing from the fact that vocab­u­lary in gen­eral is heav­ily sat­u­rated with the gen­eral intel­li­gence fac­tor. Since reli­gious pref­er­ence is a cor­re­late of social class, all of these social class fac­tors, as well as the intel­li­gence vari­able, would tend to influ­ence col­or-nam­ing per­for­mance. Or con­sider a more extreme and faint kind of rela­tion­ship. It is quite con­ceiv­able that a child who belongs to a more litur­gi­cal reli­gious denom­i­na­tion would be some­what more col­or-ori­ented than a child for whom bright col­ors were not asso­ci­ated with the reli­gious life. Every­one famil­iar with psy­cho­log­i­cal research knows that numer­ous “puz­zling, unex­pected” cor­re­la­tions pop up all the time, and that it requires only a mod­er­ate amount of moti­va­tion-plus-in­ge­nu­ity to con­struct very plau­si­ble alter­na­tive the­o­ret­i­cal expla­na­tions for them.

…These arm­chair con­sid­er­a­tions are borne out by the find­ing that in psy­cho­log­i­cal and soci­o­log­i­cal inves­ti­ga­tions involv­ing very large num­bers of sub­jects, it is reg­u­larly found that almost all cor­re­la­tions or differ­ences between means are sta­tis­ti­cally sig­nifi­cant. See, for exam­ple, the papers by Bakan 1966 and Nun­nally 1960. Data cur­rently being ana­lyzed by Dr. David Lykken and myself9, derived from a huge sam­ple of over 55,000 Min­nesota high school seniors, reveal sta­tis­ti­cally sig­nifi­cant rela­tion­ships in 91% of pair­wise asso­ci­a­tions among a con­geries of 45 mis­cel­la­neous vari­ables such as sex, birth order, reli­gious pref­er­ence, num­ber of sib­lings, voca­tional choice, club mem­ber­ship, col­lege choice, moth­er’s edu­ca­tion, danc­ing, inter­est in wood­work­ing, lik­ing for school, and the like. The 9% of non-sig­nifi­cant asso­ci­a­tions are heav­ily con­cen­trated among a small minor­ity of vari­ables hav­ing dubi­ous , or involv­ing arbi­trary group­ings of non-ho­mo­ge­neous or non­mo­not­o­nic sub­-cat­e­gories. The major­ity of vari­ables exhib­ited sig­nifi­cant rela­tion­ships with all but three of the oth­ers, often at a very high con­fi­dence level (p < 10-6).

…Con­sid­er­ing the fact that “every­thing in the brain is con­nected with every­thing else,” and that there exist sev­eral “gen­eral state-vari­ables” (such as arousal, atten­tion, anx­i­ety, and the like) which are known to be at least slightly influ­ence­able by prac­ti­cally any kind of stim­u­lus input, it is highly unlikely that any psy­cho­log­i­cally dis­crim­inable stim­u­la­tion which we apply to an exper­i­men­tal sub­ject would exert lit­er­ally zero effect upon any aspect of his per­for­mance. The psy­cho­log­i­cal lit­er­a­ture abounds with exam­ples of small but detectable influ­ences of this kind. Thus it is known that if a sub­ject mem­o­rizes a list of non­sense syl­la­bles in the pres­ence of a faint odor of pep­per­mint, his recall will be facil­i­tated by the pres­ence of that odor. Or, again, we know that indi­vid­u­als solv­ing intel­lec­tual prob­lems in a “messy” room do not per­form quite as well as indi­vid­u­als work­ing in a neat, well-ordered sur­round. Again, cog­ni­tive processes undergo a detectable facil­i­ta­tion when the think­ing sub­ject is con­cur­rently per­form­ing the irrel­e­vant, noncog­ni­tive task of squeez­ing a hand dynamome­ter. It would require con­sid­er­able inge­nu­ity to con­coct exper­i­men­tal manip­u­la­tions, except the most min­i­mal and triv­ial (such as a very slight mod­i­fi­ca­tion in the word order of instruc­tions given a sub­ject) where one could have con­fi­dence that the manip­u­la­tion would be utterly with­out effect upon the sub­jec­t’s moti­va­tional lev­el, atten­tion, arousal, fear of fail­ure, achieve­ment dri­ve, desire to please the exper­i­menter, dis­trac­tion, social fear, etc., etc. So that, for exam­ple, while there is no very “inter­est­ing” psy­cho­log­i­cal the­ory that links hunger drive with col­or-nam­ing abil­i­ty, I myself would con­fi­dently pre­dict a sig­nifi­cant differ­ence in col­or-nam­ing abil­ity between per­sons tested after a full meal and per­sons who had not eaten for 10 hours, pro­vided the sam­ple size were suffi­ciently large and the col­or-nam­ing mea­sure­ments suffi­ciently reli­able, since one of the effects of the increased hunger drive is height­ened “arousal,” and any­thing which height­ens arousal would be expected to affect a per­cep­tu­al-cog­ni­tive per­for­mance like col­or-nam­ing. Suffice it to say that there are very good rea­sons for expect­ing at least some slight influ­ence of almost any exper­i­men­tal manip­u­la­tion which would differ suffi­ciently in its form and con­tent from the manip­u­la­tion imposed upon a con­trol group to be included in an exper­i­ment in the first place. In what fol­lows I shall there­fore assume that the point-null hypoth­e­sis H0 is, in psy­chol­o­gy, [qua­si-] always false.

See also Waller 2004, and Meehl’s 2003 CSS talk, “Cri­tique of Null Hypoth­e­sis Sig­nifi­cance Test­ing” (MP3 audio; slides).

Lykken 1968

“Sta­tis­ti­cal Sig­nifi­cance in Psy­cho­log­i­cal Research”, Lykken 1968:

Most the­o­ries in the areas of per­son­al­i­ty, clin­i­cal, and social psy­chol­ogy pre­dict no more than the direc­tion of a cor­re­la­tion, group differ­ence, or treat­ment effect. Since the null hypoth­e­sis is never strictly true, such pre­dic­tions have about a 50-50 chance of being con­firmed by exper­i­ment when the the­ory in ques­tion is false, since the sta­tis­ti­cal sig­nifi­cance of the result is a func­tion of the sam­ple size.

…Most psy­cho­log­i­cal exper­i­ments are of three kinds: (a) stud­ies of the effect of some treat­ment on some out­put vari­ables, which can be regarded as a spe­cial case of (b) stud­ies of the differ­ence between two or more groups of indi­vid­u­als with respect to some vari­able, which in turn are a spe­cial case of (c) the study of the rela­tion­ship or cor­re­la­tion between two or more vari­ables within some spec­i­fied pop­u­la­tion. Using the bivari­ate cor­re­la­tion design as par­a­dig­mat­ic, then, one notes first that the strict null hypoth­e­sis must always be assumed to be false (this idea is not new and has recently been illu­mi­nated by Bak­en, 1966). Unless one of the vari­ables is wholly unre­li­able so that the val­ues obtained are strictly ran­dom, it would be fool­ish to sup­pose that the cor­re­la­tion between any two vari­ables is iden­ti­cally equal to 0.0000 . . . (or that the effect of some treat­ment or the differ­ence between two groups is exactly zero). The molar depen­dent vari­ables employed in psy­cho­log­i­cal research are extremely com­pli­cated in the sense that the mea­sured value of such a vari­able tends to be affected by the inter­ac­tion of a vast num­ber of fac­tors, both in the present sit­u­a­tion and in the his­tory of the sub­ject organ­ism. It is exceed­ingly unlikely that any two such vari­ables will not share at least some of these fac­tors and equally unlikely that their effects will exactly can­cel one another out.

It might be argued that the more com­plex the vari­ables the smaller their aver­age cor­re­la­tion ought to be since a larger pool of com­mon fac­tors allows more chance for mutual can­cel­la­tion of effects in obe­di­ence to the 10. How­ev­er, one knows of a num­ber of unusu­ally potent and per­va­sive fac­tors which oper­ate to unbal­ance such con­ve­nient sym­me­tries and to pro­duce cor­re­la­tions large enough to rival the effects of what­ever causal fac­tors the exper­i­menter may have had in mind. Thus, we know that (a) “good” psy­cho­log­i­cal and phys­i­cal vari­ables tend to be pos­i­tively cor­re­lat­ed; (6) exper­i­menters, with­out delib­er­ate inten­tion, can some­how sub­tly bias their find­ings in the expected direc­tion (Rosen­thal, 1963); (c) the effects of com­mon method are often as strong as or stronger than those pro­duced by the actual vari­ables of inter­est (e.g., in a large and care­ful study of the fac­to­r­ial struc­ture of adjust­ment to stress among offi­cer can­di­dates, Holtz­man & Bit­ter­man, 1956, found that their 101 orig­i­nal vari­ables con­tained five main com­mon fac­tors rep­re­sent­ing, respec­tive­ly, their rat­ing scales, their per­cep­tu­al-mo­tor tests, the McK­in­ney Report­ing Test, their GSR vari­ables, and the ); (d) tran­si­tory state vari­ables such as the sub­jec­t’s anx­i­ety lev­el, fatigue, or his desire to please, may broadly affect all mea­sures obtained in a sin­gle exper­i­men­tal ses­sion. This aver­age shared vari­ance of “unre­lated” vari­ables can be thought of as a kind of ambi­ent noise level char­ac­ter­is­tic of the domain. It would be inter­est­ing to obtain empir­i­cal esti­mates of this quan­tity in our field to serve as a kind of against which to com­pare obtained rela­tion­ships pre­dicted by some the­ory under test. If, as I think, it is not unrea­son­able to sup­pose that “unre­lated” molar psy­cho­log­i­cal vari­ables share on the aver­age about 4% to 5% of com­mon vari­ance, then the expected cor­re­la­tion between any such vari­ables would be about 0.20 in absolute value and the expected differ­ence between any two groups on some such vari­able would be nearly 0.5 stan­dard devi­a­tion units. (Note that these esti­mates assume zero mea­sure­ment error. One can bet­ter explain the near-zero cor­re­la­tions often observed in psy­cho­log­i­cal research in terms of unre­li­a­bil­ity of mea­sures than in terms of the assump­tion that the true scores are in fact unre­lat­ed.)

Nichols 1968

“Hered­i­ty, Envi­ron­ment, and School Achieve­ment”, Nichols 1968:

There are three main fac­tors or types of vari­ables that seem likely to have an impor­tant influ­ence on abil­ity and school achieve­ment. These are (a) the school fac­tor or orga­nized edu­ca­tional influ­ences; (b) the fam­ily fac­tor or all of the social influ­ences of fam­ily life on a child; and (c) the genetic fac­tor…the sep­a­ra­tion of the effects of the major types of influ­ences has proved to be extra­or­di­nar­ily diffi­cult, and all of the research so far has not resulted in a clear-cut con­clu­sion.

…This messy sit­u­a­tion is due pri­mar­ily to the fact that in human soci­ety all good things tend to go togeth­er. The most intel­li­gent par­ents—those with the best genetic poten­tial—also tend to pro­vide the most com­fort­able and intel­lec­tu­ally stim­u­lat­ing home envi­ron­ments for their chil­dren, and also tend to send their chil­dren to the most afflu­ent and well-e­quipped schools. Thus, the ubiq­ui­tous cor­re­la­tion between fam­ily socio-e­co­nomic sta­tus and school achieve­ment is ambigu­ous in mean­ing, and iso­lat­ing the inde­pen­dent con­tri­bu­tion of the fac­tors involved is diffi­cult. How­ev­er, the strong emo­tion­ally moti­vated atti­tudes and vested inter­ests in this area have also tended to inhibit the sort of dis­pas­sion­ate, objec­tive eval­u­a­tion of the avail­able evi­dence that is nec­es­sary for the advance of sci­ence.

Hays 1973

Sta­tis­tics for the social sci­ences (2nd edi­tion), Hays 1973; chap­ter 10, page 413–417:

10.19: Test­man­ship, or how big is a differ­ence?

…As we saw in Chap­ter 4, the com­plete absence of a sta­tis­ti­cal rela­tion, or no asso­ci­a­tion, occurs only when the con­di­tional dis­tri­b­u­tion of the depen­dent vari­able is the same regard­less of which treat­ment is admin­is­tered. Thus if the inde­pen­dent vari­able is not asso­ci­ated at all with the depen­dent vari­able the pop­u­la­tion dis­tri­b­u­tions must be iden­ti­cal over the treat­ments. If, on the other hand, the means of the differ­ent treat­ment pop­u­la­tions are differ­ent, the con­di­tional dis­tri­b­u­tions them­selves must be differ­ent and the inde­pen­dent and depen­dent vari­ables must be asso­ci­at­ed. The rejec­tion of the hypoth­e­sis of no differ­ence between pop­u­la­tion means is tan­ta­mount to the asser­tion that the treat­ment given does have some sta­tis­ti­cal asso­ci­a­tion with the depen­dent vari­able score.

…How­ev­er, the occur­rence of a sig­nifi­cant result says noth­ing at all about the strength of the asso­ci­a­tion between treat­ment and score. A sig­nifi­cant result leads to the infer­ence that some asso­ci­a­tion exists, but in no sense does this mean that an impor­tant degree of asso­ci­a­tion nec­es­sar­ily exists. Con­verse­ly, evi­dence of a strong sta­tis­ti­cal asso­ci­a­tion can occur in data even when the results are not sig­nifi­cant. The game of infer­ring the true degree of sta­tis­ti­cal asso­ci­a­tion has a jok­er: this is the sam­ple size. The time has come to define the notion of the strength of a sta­tis­ti­cal asso­ci­a­tion more sharply, and to link this idea with that of the true differ­ence between pop­u­la­tion means.

. When does it seem appro­pri­ate to say that a strong asso­ci­a­tion exists between the exper­i­men­tal fac­tor X and the depen­dent vari­able Y? Over all of the differ­ent pos­si­bil­i­ties for X there is a prob­a­bil­ity dis­tri­b­u­tion of Y val­ues, which is the mar­ginal dis­tri­b­u­tion of Y over (x,y) events. The exis­tence of this dis­tri­b­u­tion implies that we do not know exactly what the Y value for any obser­va­tion will be; we are always uncer­tain about Y to some extent. How­ev­er, given any par­tic­u­lar X, there is also a con­di­tional dis­tri­b­u­tion of Y, and it may be that in this con­di­tional dis­tri­b­u­tion the highly prob­a­ble val­ues of Y tend to “shrink” within a much nar­rower range than in the mar­ginal dis­tri­b­u­tion. If so, we can say that the infor­ma­tion about X tends to reduce uncer­tainty about Y. In gen­eral we will say that the strength of a sta­tis­ti­cal rela­tion is reflected by the extent to which know­ing X reduces uncer­tainty about Y. One of the best indi­ca­tors of our uncer­tainty about the value of a vari­able is σ2, the vari­ance of its dis­tri­b­u­tion…This index reflects the pre­dic­tive power afforded by a rela­tion­ship: when w2 is zero, then X does not aid us at all in pre­dict­ing the value of Y. On the other hand, when w2 is 1.00, this tells us that X lets us know Y exact­ly…About now you should be won­der­ing what the index w2 has to do with the differ­ence between pop­u­la­tion means.

…When the differ­ence u1 - u2 is zero, then w2 must be zero. In the usual t-test for a differ­ence, the hypoth­e­sis of no differ­ence between means is equiv­a­lent to the hypoth­e­sis that w2 = 0. On the other hand, when there is any differ­ence at all between pop­u­la­tion means, the value of w2 must be greater than 0. In short, a true differ­ence is “big” in the sense of pre­dic­tive power only if the square of that differ­ence is large rel­a­tive to . How­ev­er, in sig­nifi­cance tests such as t, we com­pare the differ­ence we get with an esti­mate of σdiff. The stan­dard error of the differ­ence can be made almost as small as we choose if we are given a free choice of sam­ple size. Unless sam­ple size is spec­i­fied, there is no nec­es­sary con­nec­tion between sig­nifi­cance and the true strength of asso­ci­a­tion.

This points up the fal­lacy of eval­u­at­ing the “good­ness” of a result in terms of sta­tis­ti­cal sig­nifi­cance alone, with­out allow­ing for the sam­ple size used. All sig­nifi­cant results do not imply the same degree of true asso­ci­a­tion between inde­pen­dent and depen­dent vari­ables.

It is sad but true that researchers have been known to cap­i­tal­ize on this fact. There is a cer­tain amount of “test­man­ship” involved in using infer­en­tial sta­tis­tics. Vir­tu­ally any study can be made to show sig­nifi­cant results if one uses enough sub­jects, regard­less of how non­sen­si­cal the con­tent may be. There is surely noth­ing on earth that is com­pletely inde­pen­dent of any­thing else. The strength of an asso­ci­a­tion may approach zero, but it should sel­dom or never be exactly zero. If one applies a large enough sam­ple of the study of any rela­tion, triv­ial or mean­ing­less as it may be, sooner or later he is almost cer­tain to achieve a sig­nifi­cant result. Such a result may be a valid find­ing, but only in the sense that one can say with assur­ance that some asso­ci­a­tion is not exactly zero. The degree to which such a find­ing enhances our knowl­edge is debat­able. If the cri­te­rion of strength of asso­ci­a­tion is applied to such a result, it becomes obvi­ous that lit­tle or noth­ing is actu­ally con­tributed to our abil­ity to pre­dict one thing from anoth­er.

For exam­ple, sup­pose that two meth­ods of teach­ing first grade chil­dren to read are being com­pared. A ran­dom sam­ple of 1000 chil­dren are taught to read by method I, another sam­ple of 1000 chil­dren by method II. The results of the instruc­tion are eval­u­ated by a test that pro­vides a score, in whole units, for each child. Sup­pose that the results turned out as fol­lows:

Method I Method II
M1 = 147.21 M2 = 147.64
N1 = 1000 N2 = 1000

Then, the esti­mated stan­dard error of the differ­ence is about 0.145, and the z value is

This cer­tainly per­mits rejec­tion of the null hypoth­e­sis of no differ­ence between the groups. How­ev­er, does it really tell us very much about what to expect of an indi­vid­ual child’s score on the test, given the infor­ma­tion that he was taught by method I or method II? If we look at the group of chil­dren taught by method II, and assume that the dis­tri­b­u­tion of their scores is approx­i­mately nor­mal, we find that about 45% of these chil­dren fall below the mean score for chil­dren in group I. Sim­i­lar­ly, about 45% of chil­dren in group I fall above the mean score for group II. Although the differ­ence between the two groups is sig­nifi­cant, the two groups actu­ally over­lap a great deal in terms of their per­for­mances on the test. In this sense, the two groups are really not very differ­ent at all, even though the differ­ence between the means is quite sig­nifi­cant in a purely sta­tis­ti­cal sense.

Putting the mat­ter in a slightly differ­ent way, we note that the grand mean of the two groups is 147.425. Thus, our best bet about the score of any child, not know­ing the method of his train­ing, is 147.425. If we guessed that any child drawn at ran­dom from the com­bined group should have a score above 147.425, we should be wrong about half the time. How­ev­er, among the orig­i­nal groups, accord­ing to method I and method II, the pro­por­tions falling above and below this grand mean are approx­i­mately as fol­lows:

Below 147.425 Above 147.425
Method I 0.51 0.49
Method II 0.49 0.51

This implies that if we know a child is from group I, and we guess that this score is below the grand mean, then we will be wrong about 49% of the time. Sim­i­lar­ly, if a child is from group II, and we guess his score to be above the grand mean, we will be wrong about 49% of the time. If we are not given the group to which the child belongs, ad we guess either above or below the grand mean, we will be wrong about 50% of the time. Know­ing the group does reduce the prob­a­bil­ity of error in such a guess, but it does not reduce it very much. The method by which the child was trained sim­ply does­n’t tell us a great deal about what the child’s score will be, even though the differ­ence in mean scores is sig­nifi­cant in the sta­tis­ti­cal sense.

This kind of test­man­ship flour­ishes best when peo­ple pay too much atten­tion to the sig­nifi­cance test and too lit­tle to the degree of sta­tis­ti­cal asso­ci­a­tion the find­ing rep­re­sents. This clut­ters up the lit­er­a­ture with find­ings that are often not worth pur­su­ing, and which serve only to obscure the really impor­tant pre­dic­tive rela­tions that occa­sion­ally appear. The seri­ous sci­en­tist owes it to him­self and his read­ers to ask not only, “Is there any asso­ci­a­tion between X and Y?” but also, “How much does my find­ing sug­gest about the power to pre­dict Y from X?” Much too much empha­sis is paid to the for­mer, at the expense of the lat­ter, ques­tion.

Oakes 1975

“On the alleged fal­sity of the null hypoth­e­sis”, Oakes 1975:

Con­sid­er­a­tion is given to the con­tention by Bakan, Meehl, Nun­nally, and oth­ers that the null hypoth­e­sis in behav­ioral research is gen­er­ally false in nature and that the N is large enough, it will always be reject­ed. A dis­tinc­tion is made between self­-s­e­lect­ed-groups research designs and true exper­i­ments, and it is sug­gested that the null hypoth­e­sis prob­a­bly is gen­er­ally false in the case of research involv­ing the for­mer design, but is not in the case of research involv­ing the lat­ter. Rea­sons for the fal­sity of the null hypoth­e­sis in the one case but not in the other are sug­gest­ed.

The U.S. has recently reported the results of research on per­for­mance con­tract­ing. With 23,000 Ss—-13,000 exper­i­men­tal and 10,000 con­trol—the null hypoth­e­sis was not reject­ed. The exper­i­men­tal Ss, who received spe­cial instruc­tion in read­ing and math­e­mat­ics for 2 hours per day dur­ing the 1970–71 school year, did not differ sig­nifi­cantly from the con­trols in achieve­ment gains (Amer­i­can Insti­tutes for Research, 1972, p. 5). Such an inabil­ity to reject the null hypoth­e­sis might not be sur­pris­ing to the typ­i­cal class­room teacher or to most edu­ca­tional psy­chol­o­gists, but in view of the huge N involved, it should give pause to Bakan (1966), who con­tends that the null hypoth­e­sis is gen­er­ally false in behav­ioral research, as well as to those writ­ers such as Nun­nally (1960) and Meehl (1967), who agree with that con­tention. They hold that if the N is large enough, the null is sure to be rejected in behav­ioral research. This paper will sug­gest that the Fal­sity con­tention does not hold in the case of exper­i­men­tal research—that the null hypoth­e­sis is not gen­er­ally false in such research.

Loehlin & Nichols 1976

Loehlin & Nichols 1976, (see also Hered­ity and Envi­ron­ment: Major Find­ings from Twin Stud­ies of Abil­i­ty, Per­son­al­i­ty, and Inter­ests, Nichols 1976/1979):

This vol­ume reports on a study of 850 pairs of twins who were tested to deter­mine the influ­ence of hered­ity and envi­ron­ment on indi­vid­ual differ­ences in per­son­al­i­ty, abil­i­ty, and inter­ests. It presents the back­ground, research design, and pro­ce­dures of the study, a com­plete tab­u­la­tion of the test results, and the authors’ exten­sive analy­sis of their find­ings. Based on one of the largest stud­ies of twin behav­ior con­ducted in the twen­ti­eth cen­tu­ry, the book chal­lenges a num­ber of tra­di­tional beliefs about genetic and envi­ron­men­tal con­tri­bu­tions to per­son­al­ity devel­op­ment.

The sub­jects were cho­sen from par­tic­i­pants in the National Merit Schol­ar­ship Qual­i­fy­ing Test of 1962 and were mailed a bat­tery of per­son­al­ity and inter­est ques­tion­naires. In addi­tion, par­ents of the twins were sent ques­tion­naires ask­ing about the twins’ early expe­ri­ences. A sim­i­lar sam­ple of non­twin stu­dents who had taken the merit exam pro­vided a com­par­i­son group. The ques­tions inves­ti­gated included how twins are sim­i­lar to or differ­ent from non-twins, how iden­ti­cal twins are sim­i­lar to or differ­ent from fra­ter­nal twins, how the per­son­al­i­ties and inter­ests of twins reflect genetic fac­tors, how the per­son­al­i­ties and inter­ests of twins reflect early envi­ron­men­tal fac­tors, and what impli­ca­tions these ques­tions have for the gen­eral issue of how hered­ity and envi­ron­ment influ­ence the devel­op­ment of psy­cho­log­i­cal char­ac­ter­is­tics. In attempt­ing to answer these ques­tions, the authors shed light on the impor­tance of both genes and envi­ron­ment and form the basis for differ­ent approaches in behav­ior genetic research.

The book is largely a dis­cus­sion of com­pre­hen­sive sum­mary sta­tis­tics of twin cor­re­la­tions from an early large-s­cale twin study (can­vassed via the National Merit Schol­ar­ship Qual­i­fy­ing Test, 1962). They attempted to com­pile a large-s­cale twin sam­ple with­out the bur­den of a ful­l-blown twin reg­istry by an exten­sive mail sur­vey of the n = 1507 11th-grade ado­les­cent pairs of par­tic­i­pants in the high school National Merit Schol­ar­ship Qual­i­fy­ing Test of 1962 (to­tal n~600,000) who indi­cated they were twins (as well as a con­trol sam­ple of non-twin­s), yield­ing 514 iden­ti­cal twin & 336 (same-sex) fra­ter­nal twin pairs; they were ques­tioned as fol­lows:

…to these [par­tic­i­pants] were mailed a bat­tery of per­son­al­ity and inter­est tests, includ­ing the Cal­i­for­nia Psy­cho­log­i­cal Inven­tory (CPI), the Hol­land Voca­tional Pref­er­ence Inven­tory (VPI), an exper­i­men­tal Objec­tive Behav­ior Inven­tory (OBI), an Adjec­tive Check List (ACL), and a num­ber of oth­er, briefer self­-rat­ing scales, atti­tude mea­sures, and other items. In addi­tion, a par­ent was asked to fill out a ques­tion­naire describ­ing the early expe­ri­ences and home envi­ron­ment of the twins. Other brief ques­tion­naires were sent to teach­ers and friends, ask­ing them to rate the twins on a num­ber of per­son­al­ity traits; because these rat­ings were avail­able for only part of our basic sam­ple, they have not been ana­lyzed in detail and will not be dis­cussed fur­ther in this book. (The par­ent and twin ques­tion­naires, except for the CPI, are repro­duced in Appen­dix A.)

Unusu­al­ly, the book includes appen­dices report­ing raw twin-pair cor­re­la­tions for all of the reported items, not a mere hand­ful of selected analy­ses on full test-s­cales or sub­fac­tors. (Be­cause of this, I was able to extract vari­ables related to leisure time pref­er­ences & activ­i­ties for .) One can see that even down to the item lev­el, her­i­tabil­i­ties tend to be non-zero and most vari­ables are cor­re­lated with­in-in­di­vid­u­als or with envi­ron­ments as well.

Meehl 1978

“The­o­ret­i­cal risks and tab­u­lar aster­isks: Sir Karl, Sir Ronald, and the slow progress of soft psy­chol­ogy”, Meehl 1978:

Since the null hypoth­e­sis is qua­si­-al­ways false, tables sum­ma­riz­ing research in terms of pat­terns of “sig­nifi­cant differ­ences” are lit­tle more than com­plex, causally unin­ter­pretable out­comes of sta­tis­ti­cal power func­tions.

The kinds of the­o­ries and the kinds of the­o­ret­i­cal risks to which we put them in soft psy­chol­ogy when we use sig­nifi­cance test­ing as our method are not like test­ing Meehl’s the­ory of weather by see­ing how well it fore­casts the num­ber of inches it will rain on cer­tain days. Instead, they are depress­ingly close to test­ing the the­ory by see­ing whether it rains in April at all, or rains sev­eral days in April, or rains in April more than in May. It hap­pens mainly because, as I believe is gen­er­ally rec­og­nized by sta­tis­ti­cians today and by thought­ful social sci­en­tists, the null hypoth­e­sis, taken lit­er­al­ly, is always false. I shall not attempt to doc­u­ment this here, because among sophis­ti­cated per­sons it is taken for grant­ed. (See Mor­ri­son & Henkel, 1970 [The Sig­nifi­cance Test Con­tro­ver­sy: A Reader], espe­cially the chap­ters by Bakan, Hog­ben, Lykken, Meehl, and .) A lit­tle reflec­tion shows us why it has to be the case, since an out­put vari­able such as adult IQ, or aca­d­e­mic achieve­ment, or effec­tive­ness at com­mu­ni­ca­tion, or what­ev­er, will always, in the social sci­ences, be a func­tion of a siz­able but finite num­ber of fac­tors. (The small­est con­tri­bu­tions may be con­sid­ered as essen­tially a ran­dom vari­ance ter­m.) In order for two groups (males and females, or whites and blacks, or manic depres­sives and schiz­o­phren­ics, or Repub­li­cans and Democ­rats) to be exactly equal on such an out­put vari­able, we have to imag­ine that they are exactly equal or del­i­cately coun­ter­bal­anced on all of the con­trib­u­tors in the causal equa­tion, which will never be the case.

Fol­low­ing the gen­eral line of rea­son­ing (pre­sented by myself and sev­eral oth­ers over the last decade), from the fact that the null hypoth­e­sis is always false in soft psy­chol­o­gy, it fol­lows that the prob­a­bil­ity of refut­ing it depends wholly on the sen­si­tiv­ity of the exper­i­men­t—its log­i­cal design, the net (at­ten­u­at­ed) con­struct valid­ity of the mea­sures, and, most impor­tant­ly, the sam­ple size, which deter­mines where we are on the sta­tis­ti­cal power func­tion. Putting it crude­ly, if you have enough cases and your mea­sures are not totally unre­li­able, the null hypoth­e­sis will always be fal­si­fied, regard­less of the truth of the sub­stan­tive the­ory. Of course, it could be fal­si­fied in the wrong direc­tion, which means that as the power improves, the prob­a­bil­ity of a cor­rob­o­ra­tive result approaches one-half. How­ev­er, if the the­ory has no verisimil­i­tude—­such that we can imag­ine, so to speak, pick­ing our empir­i­cal results ran­domly out of a direc­tional hat apart from any the­o­ry—the prob­a­bil­ity of refut­ing by get­ting a sig­nifi­cant differ­ence in the wrong direc­tion also approaches one-half. Obvi­ous­ly, this is quite unlike the sit­u­a­tion desired from either a Bayesian, a Pop­pe­ri­an, or a com­mon­sense sci­en­tific stand­point. As I have pointed out else­where (Meehl, 1967/1970b; but see crit­i­cism by Oakes, 1975; Keuth, 1973; and rebut­tal by Swoyer & Mon­son, 1975), an improve­ment in instru­men­ta­tion or other sources of exper­i­men­tal accu­racy tends, in physics or astron­omy or chem­istry or genet­ics, to sub­ject the the­ory to a greater risk of refu­ta­tion modus tol­lens, whereas improved pre­ci­sion in null hypoth­e­sis test­ing usu­ally decreases this risk. A suc­cess­ful sig­nifi­cance test of a sub­stan­tive the­ory in soft psy­chol­ogy pro­vides a fee­ble cor­rob­o­ra­tion of the the­ory because the pro­ce­dure has sub­jected the the­ory to a fee­ble risk.

…I am not mak­ing some nit-pick­ing sta­tis­ti­cian’s cor­rec­tion. I am say­ing that the whole busi­ness is so rad­i­cally defec­tive as to be sci­en­tifi­cally almost point­less… I am mak­ing a philo­soph­i­cal com­plaint or, if you prefer, a com­plaint in the domain of sci­en­tific method. I sug­gest that when a reviewer tries to “make the­o­ret­i­cal sense” out of such a table of favor­able and adverse sig­nifi­cance test results, what the reviewer is actu­ally engaged in, willy-nilly or unwit­ting­ly, is mean­ing­less sub­stan­tive con­struc­tions on the prop­er­ties of the sta­tis­ti­cal power func­tion, and almost noth­ing else.

…You may say, “But, Meehl, R. A. Fisher was a genius, and we all know how valu­able his stuff has been in agron­o­my. Why should­n’t it work for soft psy­chol­o­gy?” Well, I am not intim­i­dated by Fish­er’s genius, because my com­plaint is not in the field of math­e­mat­i­cal sta­tis­tics, and as regards induc­tive logic and phi­los­o­phy of sci­ence, it is well-known that Sir Ronald per­mit­ted him­self a great deal of dog­ma­tism. I remem­ber my amaze­ment when the late said to me, the first time I met him, “But, of course, on this sub­ject Fisher is just mis­tak­en: surely you must know that.” My sta­tis­ti­cian friends tell me that it is not clear just how use­ful the sig­nifi­cance test has been in bio­log­i­cal sci­ence either, but I set that aside as beyond my com­pe­tence to dis­cuss.

Loftus & Loftus 1982

Essence of Sta­tis­tics, Loftus & Loftus 1982/1988 (2nd ed), pg515–516 (pg498-499 in the 1982 print­ing):

Rel­a­tive Impor­tance Of These Three Mea­sures. It is a mat­ter of some debate as to which of these three mea­sures [σ2/p/R2] we should pay the most atten­tion to in an exper­i­ment. It’s our opin­ion that find­ing a “sig­nifi­cant effect” really pro­vides very lit­tle infor­ma­tion because it’s almost cer­tainly true that some rela­tion­ship (how­ever small) exists between any two vari­ables. And in gen­eral find­ing a sig­nifi­cant effect sim­ply means that enough obser­va­tions have been col­lected in the exper­i­ment to make the sta­tis­ti­cal test of the exper­i­ment pow­er­ful enough to detect what­ever effect there is. The smaller the effect, the more pow­er­ful the exper­i­ments needs to be of course, but no mat­ter how small the effect, it’s always pos­si­ble in prin­ci­ple to design an exper­i­ment suffi­ciently pow­er­ful to detect it. We saw a strik­ing exam­ple of this prin­ci­ple in the office hours exper­i­ment. In this exper­i­ment there was a rela­tion­ship between the two vari­ables—and since there were so many sub­jects in the exper­i­ment (that is, since the test was so pow­er­ful), this rela­tion­ship was revealed in the sta­tis­ti­cal analy­sis. But was it any­thing to write home about? Cer­tainly not. In any sort of prac­ti­cal con­text the size of the effect, although nonze­ro, is so small it can almost be ignored.

It is our judg­ment that account­ing for vari­ance is really much more mean­ing­ful than test­ing for sig­nifi­cance.

Meehl 1990 (1)

“Why sum­maries of research on psy­cho­log­i­cal the­o­ries are often unin­ter­pretable”, Meehl 1990a (also dis­cussed in Cohen’s 1994 paper “The Earth is Round (p<.05)”):

Prob­lem 6. Crud fac­tor: In the social sci­ences and arguably in the bio­log­i­cal sci­ences, “every­thing cor­re­lates to some extent with every­thing else.” This tru­ism, which I have found no com­pe­tent psy­chol­o­gist dis­putes given five min­utes reflec­tion, does not apply to pure exper­i­men­tal stud­ies in which attrib­utes that the sub­jects bring with them are not the sub­ject of study (ex­cept in so far as they appear as a source of error and hence in the denom­i­na­tor of a sig­nifi­cance test).6 There is noth­ing mys­te­ri­ous about the fact that in psy­chol­ogy and soci­ol­ogy every­thing cor­re­lates with every­thing. Any mea­sured trait or attribute is some func­tion of a list of partly known and mostly unknown causal fac­tors in the genes and life his­tory of the indi­vid­u­al, and both genetic and envi­ron­men­tal fac­tors are known from tons of empir­i­cal research to be them­selves cor­re­lat­ed. To take an extreme case, sup­pose we con­strue the null hypoth­e­sis lit­er­ally (ob­ject­ing that we mean by it “almost null” gets ahead of the sto­ry, and destroys the rigor of the Fish­er­ian math­e­mat­ic­s!) and ask whether we expect males and females in Min­nesota to be pre­cisely equal in some arbi­trary trait that has indi­vid­ual differ­ences, say, color nam­ing. In the case of color nam­ing we could think of some obvi­ous differ­ences right off, but even if we did­n’t know about them, what is the causal sit­u­a­tion? If we write a causal equa­tion (which is not the same as a regres­sion equa­tion for pure pre­dic­tive pur­poses but which, if we had it, would serve bet­ter than the lat­ter) so that the score of an indi­vid­ual male is some func­tion (pre­sum­ably non­lin­ear if we knew enough about it but here sup­posed lin­ear for sim­plic­i­ty) of a rather long set of causal vari­ables of genetic and envi­ron­men­tal type X1, X2, … Xm. These val­ues are oper­ated upon by regres­sion coeffi­cients b1, b2, …bm.

…Now we write a sim­i­lar equa­tion for the class of females. Can any­one sup­pose that the beta coeffi­cients for the two sexes will be exactly the same? Can any­one imag­ine that the mean val­ues of all of the _X_s will be exactly the same for males and females, even if the cul­ture were not still con­sid­er­ably sex­ist in child-rear­ing prac­tices and the like? If the betas are not exactly the same for the two sex­es, and the mean val­ues of the _X_s are not exactly the same, what kind of Leib­nitz­ian preestab­lished har­mony would we have to imag­ine in order for the mean col­or-nam­ing score to come out exactly equal between males and females? It bog­gles the mind; it sim­ply would never hap­pen. As Ein­stein said, “the Lord God is sub­tle, but He is not mali­cious.” We can­not imag­ine that nature is out to fool us by this kind of del­i­cate bal­anc­ing. Any­body famil­iar with large scale research data takes it as a mat­ter of course that when the N gets big enough she will not be look­ing for the sta­tis­ti­cally sig­nifi­cant cor­re­la­tions but rather look­ing at their pat­terns, since almost all of them will be sig­nifi­cant. In say­ing this, I am not going counter to what is stated by math­e­mat­i­cal sta­tis­ti­cians or psy­chol­o­gists with sta­tis­ti­cal exper­tise. For exam­ple, the stan­dard psy­chol­o­gist’s text­book, the excel­lent treat­ment by Hays (1973, page 415), explic­itly states that, taken lit­er­al­ly, the null hypoth­e­sis is always false.

20 ago David Lykken and I con­ducted an exploratory study of the crud fac­tor which we never pub­lished but I shall sum­ma­rize it briefly here. (I offer it not as “empir­i­cal proof”—that H0 taken lit­er­ally is qua­si­-al­ways false hardly needs proof and is gen­er­ally admit­ted—but as a punchy and some­what amus­ing exam­ple of an insuffi­ciently appre­ci­ated truth about soft cor­re­la­tional psy­chol­o­gy.) In 1966, the Uni­ver­sity of Min­nesota Stu­dent Coun­sel­ing Bureau’s Statewide Test­ing Pro­gram admin­is­tered a ques­tion­naire to 57,000 high school seniors, the items deal­ing with fam­ily facts, atti­tudes toward school, voca­tional and edu­ca­tional plans, leisure time activ­i­ties, school orga­ni­za­tions, etc. We cross-tab­u­lated a total of 15 (and then 45) vari­ables includ­ing the fol­low­ing (the num­ber of cat­e­gories for each vari­able given in paren­the­ses): father’s occu­pa­tion (7), father’s edu­ca­tion (9), moth­er’s edu­ca­tion (9), num­ber of sib­lings (10), birth order (on­ly, old­est, youngest, nei­ther), edu­ca­tional plans after high school (3), fam­ily atti­tudes towards col­lege (3), do you like school (3), sex (2), col­lege choice (7), occu­pa­tional plan in ten years (20), and reli­gious pref­er­ence (20). In addi­tion, there were 22 “leisure time activ­i­ties” such as “act­ing,” “model build­ing,” “cook­ing,” etc., which could be treated either as a sin­gle 22-cat­e­gory vari­able or as 22 dichoto­mous vari­ables. There were also 10 “high school orga­ni­za­tions” such as “school sub­ject clubs,” “farm youth groups,” “polit­i­cal clubs,” etc., which also could be treated either as a sin­gle ten-cat­e­gory vari­able or as ten dichoto­mous vari­ables. Con­sid­er­ing the lat­ter two vari­ables as mul­ti­chotomies gives a total of 15 vari­ables pro­duc­ing 105 differ­ent cross-tab­u­la­tions. All val­ues of χ2 for these 105 cross-tab­u­la­tions were sta­tis­ti­cally sig­nifi­cant, and 101 (96%) of them were sig­nifi­cant with a prob­a­bil­ity of less than 10-6.

…If “leisure activ­ity” and “high school orga­ni­za­tions” are con­sid­ered as sep­a­rate dichotomies, this gives a total of 45 vari­ables and 990 differ­ent crosstab­u­la­tions. Of the­se, 92% were sta­tis­ti­cally sig­nifi­cant and more than 78% were sig­nifi­cant with a prob­a­bil­ity less than 10-6. Looked at in another way, the median num­ber of sig­nifi­cant rela­tion­ships between a given vari­able and all the oth­ers was 41 out of a pos­si­ble 44!

We also com­puted scores by cat­e­gory for the fol­low­ing vari­ables: num­ber of sib­lings, birth order, sex, occu­pa­tional plan, and reli­gious pref­er­ence. Highly sig­nifi­cant devi­a­tions from chance allo­ca­tion over cat­e­gories were found for each of these vari­ables. For exam­ple, the females score higher than the males; MCAT score steadily and markedly decreases with increas­ing num­bers of sib­lings; eldest or only chil­dren are sig­nifi­cantly brighter than youngest chil­dren; there are marked differ­ences in MCAT scores between those who hope to become nurses and those who hope to become nurses aides, or between those plan­ning to be farm­ers, engi­neers, teach­ers, or physi­cians; and there are sub­stan­tial MCAT differ­ences among the var­i­ous reli­gious groups. We also tab­u­lated the five prin­ci­pal Protes­tant reli­gious denom­i­na­tions (Bap­tist, Epis­co­pal, Luther­an, Methodist, and Pres­by­te­ri­an) against all the other vari­ables, find­ing highly sig­nifi­cant rela­tion­ships in most instances. For exam­ple, only chil­dren are nearly twice as likely to be Pres­by­ter­ian than Bap­tist in Min­neso­ta, more than half of the Epis­co­palians “usu­ally like school” but only 45% of Luther­ans do, 55% of Pres­by­te­ri­ans feel that their grades reflect their abil­i­ties as com­pared to only 47% of Epis­co­palians, and Epis­co­palians are more likely to be male whereas Bap­tists are more likely to be female. 83% of Bap­tist chil­dren said that they enjoyed danc­ing as com­pared to 68% of Lutheran chil­dren. More than twice the pro­por­tion of Epis­co­palians plan to attend an out of state col­lege than is true for Bap­tists, Luther­ans, or Methodists. The pro­por­tion of Methodists who plan to become con­ser­va­tion­ists is nearly twice that for Bap­tists, whereas the pro­por­tion of Bap­tists who plan to become recep­tion­ists is nearly twice that for Epis­co­palians.

In addi­tion, we tab­u­lated the four prin­ci­pal Lutheran Syn­ods (Mis­souri, ALC, LCA, and Wis­con­sin) against the other vari­ables, again find­ing highly sig­nifi­cant rela­tion­ships in most cas­es. Thus, 5.9% of Wis­con­sin Synod chil­dren have no sib­lings as com­pared to only 3.4% of Mis­souri Synod chil­dren. 58% of ALC Luther­ans are involved in play­ing a musi­cal instru­ment or singing as com­pared to 67% of Mis­souri Synod Luther­ans. 80% of Mis­souri Synod Luther­ans belong to school or polit­i­cal clubs as com­pared to only 71% of LCA Luther­ans. 49% of ALC Luther­ans belong to debate, dra­mat­ics, or musi­cal orga­ni­za­tions in high school as com­pared to only 40% of Mis­souri Synod Luther­ans. 36% of LCA Luther­ans belong to orga­nized non-school youth groups as com­pared to only 21% of Wis­con­sin Synod Luther­ans. [Pre­ced­ing text cour­tesy of D. T. Lykken.]

These rela­tion­ships are not, I repeat, Type I errors. They are facts about the world, and with N = 57,000 they are pretty sta­ble. Some are the­o­ret­i­cally easy to explain, oth­ers more diffi­cult, oth­ers com­pletely baffling. The “easy” ones have mul­ti­ple expla­na­tions, some­times com­pet­ing, usu­ally not. Draw­ing the­o­ries from a pot and asso­ci­at­ing them whim­si­cally with vari­able pairs would yield an impres­sive batch of H0-re­fut­ing “con­fir­ma­tions.”

Another amus­ing exam­ple is the behav­ior of the items in the 550 items of the MMPI pool with respect to sex. Only 60 items appear on the Mf scale, about the same num­ber that were put into the pool with the hope that they would dis­crim­i­nate fem­i­nin­i­ty. It turned out that over half the items in the scale were not put in the pool for that pur­pose, and of those that were, a bare major­ity did the job. Scale deriva­tion was based on item analy­sis of a small group of cri­te­rion cases of male homo­sex­ual invert syn­drome, a sig­nifi­cant differ­ence on a rather small N of Dr. Starke Hath­away’s pri­vate patients being then con­joined with the require­ment of dis­crim­i­nat­ing between male nor­mals and female nor­mals. When the N becomes very large as in the data pub­lished by Swen­son, Pear­son, and Osborne (1973; An MMPI Source Book: Basic Item, Scale, And Pat­tern Data On 50,000 Med­ical Patients. Min­neapolis, MN: Uni­ver­sity of Min­nesota Press.), approx­i­mately 25,000 of each sex tested at the Mayo Clinic over a period of years, it turns out that 507 of the 550 items dis­crim­i­nate the sex­es. Thus in a het­ero­ge­neous item pool we find only 8% of items fail­ing to show a sig­nifi­cant differ­ence on the sex dichoto­my. The fol­low­ing are sex-dis­crim­i­na­tors, the male/female differ­ences rang­ing from a few per­cent­age points to over 30%:7

  • Some­times when I am not feel­ing well I am cross.
  • I believe there is a Devil and a Hell in after­life.
  • I think nearly any­one would tell a lie to keep out of trou­ble.
  • Most peo­ple make friends because friends are likely to be use­ful to them.
  • I like poet­ry.
  • I like to cook.
  • Police­men are usu­ally hon­est.
  • I some­times tease ani­mals.
  • My hands and feet are usu­ally warm enough.
  • I think Lin­coln was greater than Wash­ing­ton.
  • I am cer­tainly lack­ing in self­-con­fi­dence.
  • Any man who is able and will­ing to work hard has a good chance of suc­ceed­ing.

I invite the reader to guess which direc­tion scores “fem­i­nine.” Given this infor­ma­tion, I find some items easy to “explain” by one obvi­ous the­o­ry, oth­ers have com­pet­ing plau­si­ble expla­na­tions, still oth­ers are baffling.

Note that we are not deal­ing here with some source of sta­tis­ti­cal error (the occur­rence of ran­dom sam­pling fluc­tu­a­tion­s). That source of error is lim­ited by the sig­nifi­cance level we choose, just as the prob­a­bil­ity of Type II error is set by ini­tial choice of the sta­tis­ti­cal pow­er, based upon a pilot study or other antecedent data con­cern­ing an expected aver­age differ­ence. Since in social sci­ence every­thing cor­re­lates with every­thing to some extent, due to com­plex and obscure causal influ­ences, in con­sid­er­ing the crud fac­tor we are talk­ing about real differ­ences, real cor­re­la­tions, real trends and pat­terns for which there is, of course, some true but com­pli­cated mul­ti­vari­ate causal the­o­ry. I am not sug­gest­ing that these cor­re­la­tions are fun­da­men­tally unex­plain­able. They would be com­pletely explained if we had the knowl­edge of Omni­scient Jones, which we don’t. The point is that we are in the weak sit­u­a­tion of cor­rob­o­rat­ing our par­tic­u­lar sub­stan­tive the­ory by show­ing that X and Y are “related in a non­chance man­ner,” when our the­ory is too weak to make a numer­i­cal pre­dic­tion or even (usu­al­ly) to set up a range of admis­si­ble val­ues that would be counted as cor­rob­o­ra­tive.

…Some psy­chol­o­gists play down the influ­ence of the ubiq­ui­tous crud fac­tor, what David Lykken (1968) calls the “ambi­ent cor­re­la­tional noise” in social sci­ence, by say­ing that we are not in dan­ger of being mis­led by small differ­ences that show up as sig­nifi­cant in gigan­tic sam­ples. How much that soft­ens the blow of the crud fac­tor’s influ­ence depends upon the crud fac­tor’s aver­age size in a given research domain, about which nei­ther I nor any­body else has accu­rate infor­ma­tion. But the notion that the cor­re­la­tion between arbi­trar­ily paired trait vari­ables will be, while not lit­er­ally zero, of such minus­cule size as to be of no impor­tance, is surely wrong. Every­body knows that there is a set of demo­graphic fac­tors, some under­stood and oth­ers quite mys­te­ri­ous, that cor­re­late quite respectably with a vari­ety of traits. (So­cioe­co­nomic sta­tus, SES, is the one usu­ally con­sid­ered, and fre­quently assumed to be only in the “input” causal role.) The clin­i­cal scales of the MMPI were devel­oped by empir­i­cal key­ing against a set of dis­junct noso­log­i­cal cat­e­gories, some of which are phe­nom­e­no­log­i­cally and psy­cho­dy­nam­i­cally oppo­site to oth­ers. Yet the 45 pair­wise cor­re­la­tions of these scales are almost always pos­i­tive (scale Ma pro­vides most of the neg­a­tives) and a rep­re­sen­ta­tive size is in the neigh­bor­hood of 0.35 to 0.40. The same is true of the scores on the Strong Voca­tional Inter­est Blank, where I find an aver­age absolute value cor­re­la­tion close to 0.40. The malig­nant influ­ence of so-called “meth­ods covari­ance” in psy­cho­log­i­cal research that relies upon tasks or tests hav­ing cer­tain kinds of behav­ioral sim­i­lar­i­ties such as ques­tion­naires or ink blots is com­mon­place and a reg­u­lar source of con­cern to clin­i­cal and per­son­al­ity psy­chol­o­gists. For fur­ther dis­cus­sion and exam­ples of crud fac­tor size, see Meehl (1990).

Now sup­pose we imag­ine a soci­ety of psy­chol­o­gists doing research in this soft area, and each inves­ti­ga­tor sets his exper­i­ments up in a whim­si­cal, irra­tional man­ner as fol­lows: First he picks a the­ory at ran­dom out of the the­ory pot. Then he picks a pair of vari­ables ran­domly out of the observ­able vari­able pot. He then arbi­trar­ily assigns a direc­tion (you under­stand there is no intrin­sic con­nec­tion of con­tent between the sub­stan­tive the­ory and the vari­ables, except once in a while there would be such by coin­ci­dence) and says that he is going to test the ran­domly cho­sen sub­stan­tive the­ory by pre­tend­ing that it pre­dict­s—although in fact it does not, hav­ing no intrin­sic con­tentual rela­tion—a pos­i­tive cor­re­la­tion between ran­domly cho­sen obser­va­tional vari­ables X and Y. Now sup­pose that the crud fac­tor oper­a­tive in the broad domain were 0.30, that is, the aver­age cor­re­la­tion between all of the vari­ables pair­wise in this domain is 0.30. This is not sam­pling error but the true cor­re­la­tion pro­duced by some com­plex unknown net­work of genetic and envi­ron­men­tal fac­tors. Sup­pose he divides a nor­mal dis­tri­b­u­tion of sub­jects at the median and uses all of his cases (which fre­quently is not what is done, although if prop­erly treated sta­tis­ti­cally that is not method­olog­i­cally sin­ful). Let us take vari­able X as the “input” vari­able (never mind its causal role). The mean score of the cases in the top half of the dis­tri­b­u­tion will then be at one mean devi­a­tion, that is, in stan­dard score terms they will have an aver­age score of 0.80. Sim­i­lar­ly, the sub­jects in the bot­tom half of the X dis­tri­b­u­tion will have a mean stan­dard score of -0.80. So the mean differ­ence in stan­dard score terms between the high and low _X_s, the one “exper­i­men­tal” and the other “con­trol” group, is 1.6. If the regres­sion of out­put vari­able Y on X is approx­i­mately lin­ear, this yields an expected differ­ence in stan­dard score terms of 0.48, so the differ­ence on the arbi­trar­ily defined “out­put” vari­able Y is in the neigh­bor­hood of half a stan­dard devi­a­tion.

When the inves­ti­ga­tor runs a t-test on these data, what is the prob­a­bil­ity of achiev­ing a sta­tis­ti­cally sig­nifi­cant result? This depends upon the sta­tis­ti­cal power func­tion and hence upon the sam­ple size, which varies wide­ly, more in soft psy­chol­ogy because of the nature of the data col­lec­tion prob­lems than in exper­i­men­tal work. I do not have exact fig­ures, but an infor­mal scan­ning of sev­eral issues of jour­nals in the soft areas of clin­i­cal, abnor­mal, and social gave me a rep­re­sen­ta­tive value of the num­ber of cases in each of two groups being com­pared at around N1 = N2 = 37 (that’s a median because of the skew­ness, sam­ple sizes rang­ing from a low of 17 in one clin­i­cal study to a high of 1,000 in a social sur­vey study). Assum­ing equal vari­ances, this gives us a stan­dard error of the mean differ­ence of 0.2357 in sig­ma-u­nits, so that our t is a lit­tle over 2.0. The sub­stan­tive the­ory in a real life case being almost invari­ably pre­dic­tive of a direc­tion (it is hard to know what sort of sig­nifi­cance test­ing we would be doing oth­er­wise), the 5% level of con­fi­dence can be legit­i­mately taken as one-tailed and in fact could be crit­i­cized if it were not (as­sum­ing that the 5% level of con­fi­dence is given the usual spe­cial mag­i­cal sig­nifi­cance afforded it by social sci­en­tist­s!). The direc­tional 5% level being at 1.65, the expected value of our t-test in this sit­u­a­tion is approx­i­mately 0.35 t units from the required sig­nifi­cance lev­el. Things being essen­tially nor­mal for 72 df, this gives us a power of detect­ing a differ­ence of around 0.64.

How­ev­er, since in our imag­ined “exper­i­ment” the assign­ment of direc­tion was ran­dom, the prob­a­bil­ity of detect­ing a differ­ence in the pre­dicted direc­tion (even though in real­ity this pre­dic­tion was not medi­ated by any ratio­nal rela­tion of con­tent) is only half of that. Even this con­ser­v­a­tive power based upon the assump­tion of a com­pletely ran­dom asso­ci­a­tion between the the­o­ret­i­cal sub­stance and the pseudo­pre­dicted direc­tion should give one pause. We find that the prob­a­bil­ity of get­ting a pos­i­tive result from a the­ory with no verisimil­i­tude what­so­ev­er, asso­ci­ated in a totally whim­si­cal fash­ion with a pair of vari­ables picked ran­domly out of the obser­va­tional pot, is one chance in three! This is quite differ­ent from the 0.05 level that peo­ple usu­ally think about. Of course, the rea­son for this is that the 0.05 level is based upon strictly hold­ing H0 if the the­ory were false. Where­as, because in the social sci­ences every­thing is cor­re­lated with every­thing, for epis­temic pur­poses (de­spite the rigor of the math­e­mati­cian’s tables) the true base­line—if the the­ory has noth­ing to do with real­ity and has only a chance rela­tion­ship to it (so to speak, “any con­nec­tion between the the­ory and the facts is purely coin­ci­den­tal”) - is 6 or 7 times as great as the reas­sur­ing 0.05 level upon which the psy­chol­o­gist focuses his mind. If the crud fac­tor in a domain were run­ning around 0.40, the power func­tion is 0.86 and the “direc­tional power” for ran­dom theory/prediction pair­ings would be 0.43.

…A sim­i­lar sit­u­a­tion holds for psy­chopathol­o­gy, and for many vari­ables in per­son­al­ity mea­sure­ment that refer to aspects of social com­pe­tence on the one hand or impair­ment of inter­per­sonal func­tion (as in men­tal ill­ness) on the oth­er. Thorndike had a dic­tum “All good things tend to go togeth­er.”

Meehl 1990 (2)

“Apprais­ing and amend­ing the­o­ries: the strat­egy of Lakatosian defense and two prin­ci­ples that war­rant using it”, Meehl 1990b:

Research in the behav­ioral sci­ences can be exper­i­men­tal, cor­re­la­tion­al, or field study (in­clud­ing clin­i­cal); only the first two are addressed here. For rea­sons to be explained (Meehl, 1990c), I treat as cor­re­la­tional those exper­i­men­tal stud­ies in which the chief the­o­ret­i­cal test pro­vided involves an inter­ac­tion effect between an exper­i­men­tal manip­u­la­tion and an indi­vid­u­al-d­iffer­ences vari­able (whether trait, sta­tus, or demo­graph­ic). In cor­re­la­tional research there arises a spe­cial prob­lem for the social sci­en­tist from the empir­i­cal fact that “every­thing is cor­re­lated with every­thing, more or less.” My col­league David Lykken presses the point fur­ther to include most, if not all, purely exper­i­men­tal research designs, say­ing that, speak­ing causal­ly, “Every­thing influ­ences every­thing”, a stronger the­sis that I nei­ther assert nor deny but that I do not rely on here. The obvi­ous fact that every­thing is more or less cor­re­lated with every­thing in the social sci­ences is read­ily fore­seen from the arm­chair on com­mon-sense con­sid­er­a­tions. These are strength­ened by more advanced the­o­ret­i­cal argu­ments involv­ing such con­cepts as genetic link­age, auto-cat­alytic effects between cog­ni­tive and affec­tive process­es, traits reflect­ing influ­ences such as child-rear­ing prac­tices cor­re­lated with intel­li­gence, eth­nic­i­ty, social class, reli­gion, and so forth. If one asks, to take a triv­ial and the­o­ret­i­cally unin­ter­est­ing exam­ple, whether we might expect to find social class differ­ences in a col­or-nam­ing test, there imme­di­ately spring to mind numer­ous influ­ences, rang­ing from (a) ver­bal intel­li­gence lead­ing to bet­ter ver­bal dis­crim­i­na­tions and reten­tion of color names to (b) class differ­ences in mater­nal teach­ing behav­ior (which one can read­ily observe by watch­ing moth­ers explain things to their chil­dren at a zoo) to (c) more sub­tle—but still nonze­ro—in­flu­ences, such as upper-class chil­dren being more likely Angli­cans than Bap­tists, hence exposed to the changes in litur­gi­cal col­ors dur­ing the church year! Exam­ples of such mul­ti­ple pos­si­ble influ­ences are so easy to gen­er­ate, I shall resist the temp­ta­tion to go on. If some­body asks a psy­chol­o­gist or soci­ol­o­gist whether she might expect a nonzero cor­re­la­tion between den­tal caries and IQ, the best guess would be yes, small but sta­tis­ti­cally sig­nifi­cant. A small neg­a­tive cor­re­la­tion was in fact found dur­ing the 1920s, mis­lead­ing some hygien­ists to hold that IQ was low­ered by tox­ins from decayed teeth. (The received expla­na­tion today is that den­tal caries and IQ are both cor­re­lates of social class.) More than 75 years ago, Edward Lee Thorndike enun­ci­ated the famous dic­tum, “All good things tend to go togeth­er, as do all bad ones.” Almost all human per­for­mance (work com­pe­tence) dis­po­si­tions, if care­fully stud­ied, are sat­u­rated to some extent with the gen­eral intel­li­gence fac­tor g, which for psy­cho­dy­namic and ide­o­log­i­cal rea­sons has been some­what neglected in recent years but is due for a come­back (Betz, 1986).11

The ubiq­uity of nonzero cor­re­la­tions gives rise to what is method­olog­i­cally dis­turb­ing to the the­ory tester and what I call, fol­low­ing Lykken, the crud fac­tor. I have dis­cussed this at length else­where (Meehl, 1990c), so I only sum­ma­rize and pro­vide a cou­ple of exam­ples here. The main point is that, when the sam­ple size is suffi­ciently large to pro­duce accu­rate esti­mates of the pop­u­la­tion val­ues, almost any pair of vari­ables in psy­chol­ogy will be cor­re­lated to some extent. Thus, for instance, less than 10% of the items in the MMPI item pool were put into the pool with mas­culin­i­ty-fem­i­nin­ity in mind, and the empir­i­cally derived Mf scale con­tains only some of those plus oth­ers put into the item pool for other rea­sons, or with­out any the­o­ret­i­cal con­sid­er­a­tions. When one sam­ples thou­sands of indi­vid­u­als, it turns out that only 43 of the 550 items (8%) fail to show a sig­nifi­cant differ­ence between males and females. In an unpub­lished study (but see Meehl, 1990c) of the hob­bies, inter­ests, voca­tional plans, school course pref­er­ences, social life, and home fac­tors of Min­nesota col­lege fresh­men, when Lykken and I ran chi squares on all pos­si­ble pair­wise com­bi­na­tions of vari­ables, 92% were sig­nifi­cant, and 78% were sig­nifi­cant at p < 10-6. Looked at another way, the median num­ber of sig­nifi­cant rela­tion­ships between a given vari­able and all the oth­ers was 41 of a pos­si­ble 44. One finds such odd­i­ties as a rela­tion­ship between which kind of shop courses boys pre­ferred in high school and which of sev­eral Lutheran syn­ods they belonged to!

…The third objec­tion is some­what harder to answer because it would require an ency­clo­pe­dic sur­vey of research lit­er­a­ture over many domains. It is argued that, although the crud fac­tor is admit­tedly ubiq­ui­tous—that is, almost no cor­re­la­tions of the social sci­ences are lit­er­ally zero (as required by the usual sig­nifi­cance test)—the crud fac­tor is in most research domains not large enough to be worth wor­ry­ing about. With­out mak­ing a claim to know just how big it is, I think this objec­tion is pretty clearly unsound. Doubt­less the aver­age cor­re­la­tion of any ran­domly picked pair of vari­ables in social sci­ence depends on the domain, and also on the instru­ments employed (e.g., it is well known that per­son­al­ity inven­to­ries often have as much meth­od­s-co­vari­ance as they do cri­te­rion validi­ties).

A rep­re­sen­ta­tive pair­wise cor­re­la­tion among MMPI scales, despite the marked differ­ences (some­times amount­ing to phe­nom­e­no­log­i­cal “oppo­site­ness”) of the noso­log­i­cal rubrics on which they were derived, is in the mid­dle to high 0.30s, in both nor­mal and abnor­mal pop­u­la­tions. The same is true for the occu­pa­tional keys of the . Delib­er­ately aim­ing to diver­sify the qual­i­ta­tive fea­tures of cog­ni­tive tasks (and thus “purify” the mea­sures) in his clas­sic stud­ies of pri­mary men­tal abil­i­ties (“pure fac­tors,” orthog­o­nal), Thur­stone (1938; Thur­stone & Thur­stone, 1941) still found an aver­age intertest cor­re­la­tion of .28 (range = .01 to .56!) in the cross-val­i­da­tion sam­ple. In the set of 20 scales built to cover broadly the domain of (nor­mal range) “folk-con­cept” traits, Gough (1987) found an aver­age pair­wise cor­re­la­tion of .44 among both males and females. Guil­ford’s Social Intro­ver­sion, Think­ing Intro­ver­sion, Depres­sion, Cycloid Ten­den­cies, and Rhathymia or Free­dom From Care scales, con­structed on the basis of (orthog­o­nal) fac­tors, showed pair­wise cor­re­la­tions rang­ing from -.02 to .85, with 5 of the 10 rs ≥ .33 despite the purifi­ca­tion effort (Evans & McConnell, 1941). Any trea­tise on fac­tor analy­sis exem­pli­fy­ing pro­ce­dures with empir­i­cal data suffices to make the point con­vinc­ing­ly. For exam­ple, in Har­man (1960), eight “emo­tional” vari­ables cor­re­late .10 to .87, median r= .44 (p. 176), and eight “polit­i­cal” vari­ables cor­re­late .03 to .88, median (ab­solute val­ue) r = .62 (p. 178). For highly diverse acqui­es­cence-cor­rected mea­sures (per­son­al­ity traits, inter­ests, hob­bies, psy­chopathol­o­gy, social atti­tudes, and reli­gious, polit­i­cal, and moral opin­ion­s), esti­mat­ing indi­vid­u­als’ (orthog­o­nal!) fac­tor scores, one can hold mean _r_s down to an aver­age of .12, means from .04 to .20, still some indi­vid­ual _r_s > .30 (Lykken, per­sonal com­mu­ni­ca­tion, 1990; cf. McClosky & Meehl, in prepa­ra­tion). Pub­lic opin­ion polls and atti­tude sur­veys rou­tinely dis­ag­gre­gate data with respect to sev­eral demo­graphic vari­ables (e.g., age, edu­ca­tion, sec­tion of coun­try, sex, eth­nic­i­ty, reli­gion, edu­ca­tion, income, rural/urban, self­-de­scribed polit­i­cal affil­i­a­tion) because these fac­tors are always cor­re­lated with atti­tudes or elec­toral choic­es, some­times strongly so. One must also keep in mind that socioe­co­nomic sta­tus, although intrin­si­cally inter­est­ing (espe­cially to soci­ol­o­gists) is prob­a­bly often func­tion­ing as a proxy for other unmea­sured per­son­al­ity or sta­tus char­ac­ter­is­tics that are not part of the defi­n­i­tion of social class but are, for a vari­ety of com­pli­cated rea­sons, cor­re­lated with it. The proxy role is impor­tant because it pre­vents ade­quate “con­trol­ling for” unknown (or unmea­sured) crud-fac­tor influ­ences by sta­tis­ti­cal pro­ce­dures (match­ing, par­tial cor­re­la­tion, analy­sis of covari­ance, ). [ie “resid­ual con­found­ing”]

  • Thur­stone, L. L. (1938). Pri­mary men­tal abil­i­ties. Chicago: Uni­ver­sity of Chicago Press.
  • Gough, H. G. (1987). CPI, Admin­is­tra­tor’s guide. Palo Alto, CA: Con­sult­ing Psy­chol­o­gists Press.
  • McClosky, Her­bert, & Meehl, P. E. (in prepa­ra­tion). Ide­olo­gies in con­flict.12

Tukey 1991

“The phi­los­o­phy of mul­ti­ple com­par­isons”, Tukey 1991:

Sta­tis­ti­cians clas­si­cally asked the wrong ques­tion—and were will­ing to answer with a lie, one that was often a down­right lie. They asked “Are the effects of A and B differ­ent?” and they were will­ing to answer “no”.

All we know about the world teaches us that the effects of A and B are always differ­en­t—in some dec­i­mal place—­for any A and B. Thus ask­ing “Are the effects differ­ent?” is fool­ish.

What we should be answer­ing first is “Can we tell the direc­tion in which the effects differ from the effects of B?” In other words, can we be con­fi­dent about the direc­tion from A to B? Is it “up”, “down”, or “uncer­tain”?

Raftery 1995

“Bayesian Model Selec­tion in Social Research (with Dis­cus­sion by Andrew Gel­man & Don­ald B. Rubin, and Robert M. Hauser, and a Rejoin­der)”, Raftery 1995:

In the past 15 years, how­ev­er, some quan­ti­ta­tive soci­ol­o­gists have been attach­ing less impor­tance to p-val­ues because of prac­ti­cal diffi­cul­ties and coun­ter-in­tu­itive results. These diffi­cul­ties are most appar­ent with large sam­ples, where p-val­ues tend to indi­cate rejec­tion of the null hypoth­e­sis even when the null model seems rea­son­able the­o­ret­i­cally and inspec­tion of the data fails to reveal any strik­ing dis­crep­an­cies with it. Because much soci­o­log­i­cal research is based on sur­vey data, often with thou­sands of cas­es, soci­ol­o­gists fre­quently come up against this prob­lem. In the early 1980s, some soci­ol­o­gists dealt with this prob­lem by ignor­ing the results of p-val­ue-based tests when they seemed coun­ter-in­tu­itive, and by bas­ing model selec­tion instead on the­o­ret­i­cal con­sid­er­a­tions and infor­mal assess­ment of dis­crep­an­cies between model and data (e.g. Fien­berg and Mason, 1979; Hout, 1983, 1984; Grusky and Hauser, 1984).

…It is clear that mod­els 1 and 2 are unsat­is­fac­tory and should be rejected in favor of model 3.3 By the stan­dard test, model 3 should also be reject­ed, in favor of model 4, given the deviance differ­ence of 150 on 16 degrees of free­dom, cor­re­spond­ing to a p-value of about 10-120 . Grusky and Hauser (1984) nev­er­the­less adopted model 3 because it explains most (99.7%) of the deviance under the base­line model of inde­pen­dence, fits well in the sense that the differ­ences between observed and expected counts are a small pro­por­tion of the total, and makes good the­o­ret­i­cal sense. This seems sen­si­ble, and yet is in dra­matic con­flict with the p-val­ue-based test. This type of con­flict often arises in large sam­ples, and hence is fre­quent in soci­ol­ogy with its sur­vey data sets com­pris­ing thou­sands of cas­es. The main response to it has been to claim that there is a dis­tinc­tion between “sta­tis­ti­cal” and “sub­stan­tive” sig­nifi­cance, with differ­ences that are sta­tis­ti­cally sig­nifi­cant not nec­es­sar­ily being sub­stan­tively impor­tant.

Thompson 1995

“Edi­to­r­ial Poli­cies Regard­ing Sta­tis­ti­cal Sig­nifi­cance Test­ing: Three Sug­gested Reforms”, Thomp­son 1995:

One seri­ous prob­lem with this sta­tis­ti­cal test­ing logic is that the in real­ity H0 is never true in the pop­u­la­tion, as rec­og­nized by any num­ber of promi­nent sta­tis­ti­cians (Tukey, 1991), i.e., there will always be some differ­ences in pop­u­la­tion para­me­ters, although the differ­ences may be incred­i­bly triv­ial. Near 40 years ago Sav­age (1957, pp. 332–333) noted that, “Null hypothe­ses of no differ­ence are usu­ally known to be false before the data are col­lect­ed.” Sub­se­quent­ly, Meehl (1978, p.822) argued, “As I believe is gen­er­ally rec­og­nized by sta­tis­ti­cians today and by thought­ful social sci­en­tists, the null hypoth­e­sis, taken lit­er­al­ly, is always false.” Sim­i­lar­ly, noted sta­tis­ti­cian Hays (1981, p. 293 [Sta­tis­tics], 3rd ed.) pointed out that “[t]here is surely noth­ing on earth that is com­pletely inde­pen­dent of any­thing else. The strength of asso­ci­a­tion may approach zero, but it should sel­dom or never be exactly zero.” And Loftus and Loftus (1982, pp. 498–499) argued that, “find­ing a ‘[sta­tis­ti­cal­ly] sig­nifi­cant effect’ really pro­vides very lit­tle infor­ma­tion, because it’s almost cer­tain that some rela­tion­ship (how­ever small) exists between any two vari­ables.” The very impor­tant impli­ca­tion of all this is that sta­tis­ti­cal sig­nifi­cance test­ing pri­mar­ily becomes only a test of researcher endurance, because “vir­tu­ally any study can be made to show [sta­tis­ti­cal­ly] sig­nifi­cant results if one uses enough sub­jects” (Hays, 1981, p. 293). As Nun­nally (1960, p. 643) noted some 35 years ago, “If the null hypoth­e­sis is not reject­ed, it is usu­ally because the N is too small. If enough data are gath­ered, the hypoth­e­sis will gen­er­ally be reject­ed.” The impli­ca­tion is that:

Sta­tis­ti­cal sig­nifi­cance test­ing can involve a tau­to­log­i­cal logic in which tired researchers, hav­ing col­lected data from hun­dreds of sub­jects, then con­duct a sta­tis­ti­cal test to eval­u­ate whether there were a lot of sub­jects, which the researchers already know, because they col­lected the data and know they’re tired. This tau­tol­ogy has cre­ated con­sid­er­able dam­age as regards the cumu­la­tion of knowl­edge… (Thomp­son, 1992, p. 436)

Mulaik et al 1997

“There Is a Time and a Place for Sig­nifi­cance Test­ing”, Mulaik et al 1997 (in What If There Were No Sig­nifi­cance Tests ed Har­low et al 1997):

Most of these arti­cles expose mis­con­cep­tions about sig­nifi­cance test­ing com­mon among researchers and writ­ers of psy­cho­log­i­cal text­books on sta­tis­tics and mea­sure­ment. But the crit­i­cisms do not stop with mis­con­cep­tions about sig­nifi­cance test­ing. Oth­ers like Meehl (1967) expose the lim­i­ta­tions of a sta­tis­ti­cal prac­tice that focuses only on test­ing for zero differ­ences between means and zero cor­re­la­tions instead of test­ing pre­dic­tions about spe­cific nonzero val­ues for para­me­ters derived from the­ory or prior expe­ri­ence, as is done in the phys­i­cal sci­ences. Still oth­ers empha­size that sig­nifi­cance tests do not alone con­vey the infor­ma­tion needed to prop­erly eval­u­ate research find­ings and per­form accu­mu­la­tive research.

…Other than empha­siz­ing a need to prop­erly under­stand the inter­pre­ta­tion of con­fi­dence inter­vals, we have no dis­agree­ments with these crit­i­cisms and pro­pos­als. But a few of the crit­ics go even fur­ther. In this chap­ter we will look at argu­ments made by Carver (1978), Cohen (1994), , Schmidt (1992, 1996), and Schmidt and Hunter (chap­ter 3 of this vol­ume), in favor of not merely rec­om­mend­ing the report­ing of point esti­mates of effect sizes and con­fi­dence inter­vals based on them, but of aban­don­ing alto­gether the use of sig­nifi­cance tests in research. Our focus will be prin­ci­pally on Schmidt’s (1992, 1996) papers, because they incor­po­rate argu­ments from ear­lier papers, espe­cially Carver’s (1978), and also carry the argu­ment to its most extreme con­clu­sions. Where appro­pri­ate, we will also com­ment on Schmidt and Hunter’s (chap­ter 3 of this vol­ume) rebut­tal of argu­ments against their posi­tion.

The Null Hypoth­e­sis Is Always False?

Cohen (1994), influ­enced by Meehl (1978), argued that “the nil hypoth­e­sis is always false” (p. 1000). Get a large enough sam­ple and you will always reject the null hypoth­e­sis. He cites a num­ber of emi­nent sta­tis­ti­cians in sup­port of this view. He quotes Tukey (1991, p. 100) to the effect that there are always differ­ences between exper­i­men­tal treat­ments-for some dec­i­mal places. Cohen cites an unpub­lished study by Meehl and Lykken in which cross tab­u­la­tions for 15 Min­nesota Mul­ti­pha­sic Per­son­al­ity Inven­tory (MMPI) items for a sam­ple of 57,000 sub­jects yielded 105 chi-square tests of asso­ci­a­tion and every one of them was sig­nifi­cant, and 96% of them were sig­nifi­cant at p<.000001 (Co­hen, 1994, p. 1000). Cohen cites Meehl (1990) as sug­gest­ing that this reflects a “crud fac­tor” in nature. “Every­thing is related to every­thing else” to some degree. So, the ques­tion is, why do a sig­nifi­cance test if you know it will always be sig­nifi­cant if the sam­ple is large enough? But if this is an empir­i­cal hypoth­e­sis, is it not one that is estab­lished using sig­nifi­cance test­ing?

But the exam­ple may not be an apt demon­stra­tion of the prin­ci­ple Cohen sought to estab­lish: It is gen­er­ally expected that responses to differ­ent items responded to by the same sub­jects are not inde­pen­dently dis­trib­uted across sub­jects, so it would not be remark­able to find sig­nifi­cant cor­re­la­tions between many such items.

Much more inter­est­ing would be to demon­strate sys­tem­atic and replic­a­ble sig­nifi­cant treat­ment effects when sub­jects are assigned at ran­dom to differ­ent treat­ment groups but the same treat­ments are admin­is­tered to each group. But in this case, small but sig­nifi­cant effects in stud­ies with high power that devi­ate from expec­ta­tions of no effect when no differ­ences in treat­ments are admin­is­tered are rou­tinely treated as sys­tem­atic exper­i­menter errors, and knowl­edge of exper­i­men­tal tech­nique is improved by their detec­tion and removal or con­trol. Sys­tem­atic error and exper­i­men­tal arti­fact must always be con­sid­ered a pos­si­bil­ity when reject­ing the null hypoth­e­sis. Nev­er­the­less, do we know a pri­ori that a test will always be sig­nifi­cant if the sam­ple is large enough? Is the propo­si­tion “Every sta­tis­ti­cal hypoth­e­sis is false” an axiom that needs no test­ing? Actu­al­ly, we believe that to regard this as an axiom would intro­duce an inter­nal con­tra­dic­tion into sta­tis­ti­cal rea­son­ing, com­pa­ra­ble to argu­ing that all propo­si­tions and descrip­tions are false. You could not think and rea­son about the world with such an axiom. So it seems prefer­able to regard this as some kind of empir­i­cal gen­er­al­iza­tion. But no empir­i­cal gen­er­al­iza­tion is ever incor­ri­gi­ble and beyond test­ing. Nev­er­the­less, if indeed there is a phe­nom­e­non of nature known as “the crud fac­tor,” then it is some­thing we know to be objec­tively a fact only because of sig­nifi­cance tests. Some­thing in the back­ground noise stands out as a sig­nal against that noise, because we have suffi­ciently pow­er­ful tests using huge sam­ples to detect it. At that point it may become a chal­lenge to sci­ence to develop a bet­ter under­stand­ing of what pro­duces it. How­ev­er, it may tum out to reflect only exper­i­menter arti­fact. But in any case the hypoth­e­sis of a crud fac­tor is not beyond fur­ther test­ing.

The point is that it does­n’t mat­ter if the null hypoth­e­sis is always judged false at some sam­ple size, as long as we regard this as an empir­i­cal phe­nom­e­non. What mat­ters is whether at the sam­ple size we have we can dis­tin­guish observed devi­a­tions from our hypoth­e­sized val­ues to be suffi­ciently large and improb­a­ble under a hypoth­e­sis of chance that we can treat them rea­son­ably but pro­vi­sion­ally as not due to chance error. There is no a pri­ori rea­son to believe that one will always reject the null hypoth­e­sis at any given sam­ple size. On the other hand, accept­ing the null hypoth­e­sis does not mean the hypoth­e­sized value is true, but rather that the evi­dence observed is not dis­tin­guish­able from what we would regard as due to chance if the null hypoth­e­sis were true and thus is not suffi­cient to dis­prove it. The remain­ing uncer­tainty regard­ing the truth of our null hypoth­e­sis is mea­sured by the width of the region of accep­tance or a func­tion of the stan­dard error. And this will be closely related to the power of the test, which also pro­vides us with infor­ma­tion about our uncer­tain­ty. The fact that the width of the region of accep­tance shrinks with increas­ing sam­ple size, means we are able to reduce our uncer­tainty regard­ing the pro­vi­sional valid­ity of an accepted null hypoth­e­sis with larger sam­ples. In huge sam­ples the issue of uncer­tainty due to chance looms not as impor­tant as it does in small- and mod­er­ate-size sam­ples.

Waller 2004

“The fal­lacy of the null hypoth­e­sis in soft psy­chol­ogy”, Waller 2004:

In his clas­sic arti­cle on the fal­lacy of the null hypoth­e­sis in soft psy­chol­o­gy, Paul Meehl claimed that,in non­ex­per­i­men­tal set­tings, the prob­a­bil­ity of reject­ing the null hypoth­e­sis of nil group differ­ences in favor of a direc­tional alter­na­tive was 0.50—a value that is an order of mag­ni­tude higher than the cus­tom­ary Type I error rate. In a series of real data sim­u­la­tions, using Min­nesota Mul­ti­pha­sic Per­son­al­ity Inven­to­ry-Re­vised (MMPI-2) data col­lected from more than 80,000 indi­vid­u­als, I found strong sup­port for Meehl’s claim.

…Be­fore run­ning the exper­i­ments I real­ized that, to be fair to Meehl, I needed a large data set with a broad range of bioso­cial vari­ables. For­tu­nate­ly, I had access to data from 81,485 indi­vid­u­als who ear­lier had com­pleted the 567 items of the Min­nesota Mul­ti­pha­sic Per­son­al­ity Inven­to­ry-Re­vised (MMPI-2; Butcher, Dahlstrom, Gra­ham, Tel­le­gen, & Kaem­mer, 1989). The MMPI-2, in my opin­ion, is an ideal vehi­cle for test­ing Meehl’s claim because it includes items in such var­ied con­tent domains as gen­eral health con­cerns; per­sonal habits and inter­ests; atti­tudes towards sex, mar­riage, and fam­i­ly; affec­tive func­tion­ing; nor­mal range per­son­al­i­ty; and extreme man­i­fes­ta­tions of psy­chopathol­ogy (for a more com­plete descrip­tion of the latent con­tent of the MMPI, see Waller, 1999, “Search­ing for struc­ture in the MMPI”).

…Next, the com­puter selected (with­out replace­ment) a ran­dom item from the pool of MMPI-2 items. Using data from the 41,491 males and 39,994 females, it then (a) per­formed a differ­ence of pro­por­tions test on the item group means; (b) recorded the signed z-val­ue; and (c) recorded the asso­ci­ated sig­nifi­cance lev­el. Final­ly, the pro­gram tal­lied the num­ber of “sig­nifi­cant” test results (i.e., those with |z|≥1.96). The results of this mini sim­u­la­tion were enlight­en­ing and in excel­lent accord with the out­come of Meehl’s gedanken exper­i­ment. Specifi­cal­ly, 46% of the direc­tional hypothe­ses were sup­ported at sig­nifi­cance lev­els that far exceeded tra­di­tional p-value cut­offs. A sum­mary of the results is por­trayed in Fig. 1. Notice in this fig­ure, which dis­plays the dis­tri­b­u­tion of z-val­ues for the 511 tests, that many of the item mean differ­ences were 50–100 times larger than their asso­ci­ated stan­dard errors!

Fig­ure 1 & Fig­ure 2 of Waller 2004: “Fig. 1. Dis­tri­b­u­tion of z-val­ues for 511 hypoth­e­sis tests. Fig. 2. Dis­tri­b­u­tion of the fre­quency of rejected null hypothe­ses, in favor of a ran­domly cho­sen direc­tional alter­na­tive, in 320,922 hypoth­e­sis test”

Waller also high­lights Bill Thomp­son’s 2001 bib­li­og­ra­phy “402 Cita­tions Ques­tion­ing the Indis­crim­i­nate Use of Null Hypoth­e­sis Sig­nifi­cance Tests in Obser­va­tional Stud­ies” as a source for crit­i­cisms of NHST but unfor­tu­nately it’s unclear which of them might bear on the spe­cific crit­i­cism of ‘the null hypoth­e­sis is always false’.

Starbuck 2006

The Pro­duc­tion of Knowl­edge: The Chal­lenge of Social Sci­ence Research, Star­buck 2006, pg47–49:

Induc­tion requires dis­tin­guish­ing mean­ing­ful rela­tion­ships (sig­nals) in the midst of an obscur­ing back­ground of con­found­ing rela­tion­ships (noise). The weak and mean­ing­less or sub­stan­tively sec­ondary cor­re­la­tions in the back­ground make induc­tion untrust­wor­thy. In many tasks, peo­ple can dis­tin­guish weak sig­nals against rather strong back­ground noise. The rea­son is that both the sig­nals and the back­ground noise match famil­iar pat­terns. For exam­ple, a dri­ver trav­el­ing to a famil­iar des­ti­na­tion focuses on land­marks that expe­ri­ence has shown to be rel­e­vant. Peo­ple have trou­ble mak­ing such dis­tinc­tions where sig­nals and noise look much alike or where sig­nals and noise have unfa­mil­iar char­ac­ter­is­tics. For exam­ple, a dri­ver trav­el­ing a new road to a new des­ti­na­tion is likely to have diffi­culty spot­ting land­marks and turns on a rec­om­mended route.

Social sci­ence research has the lat­ter char­ac­ter­is­tics. This activ­ity is called research because its out­puts are unknown; and the sig­nals and noise look a lot alike in that both have sys­tem­atic com­po­nents and both con­tain com­po­nents that vary errat­i­cal­ly. There­fore, researchers rely upon sta­tis­ti­cal tech­niques to dis­tin­guish sig­nals from noise. How­ev­er, these tech­niques assume: (a) that the so-called ran­dom errors really do can­cel each other out so that their aver­age val­ues are close to zero; and (b) that the so-called ran­dom errors in differ­ent vari­ables are uncor­re­lat­ed. These are very strong assump­tions because they pre­sume that the researchers’ hypothe­ses encom­pass absolutely all of the sys­tem­atic effects in the data, includ­ing effects that the researchers have not fore­seen or mea­sured. When these assump­tions are not met, the sta­tis­ti­cal tech­niques tend to mis­take noise for sig­nal, and to attribute more impor­tance to the researchers’ hypothe­ses than they deserve.

I remem­bered what Ames and Reiter (1961) had said about how easy it is for macro­econ­o­mists to dis­cover sta­tis­ti­cally sig­nifi­cant cor­re­la­tions that have no sub­stan­tive sig­nifi­cance, and I could see five rea­sons why a sim­i­lar phe­nom­e­non might occur with cross-sec­tional data. First­ly, a few broad char­ac­ter­is­tics of peo­ple and social sys­tems per­vade social sci­ence data—ex­am­ples being sex, age, intel­li­gence, social class, income, edu­ca­tion, or orga­ni­za­tion size. Such char­ac­ter­is­tics cor­re­late with many behav­iors and with each oth­er. Sec­ond­ly, researchers’ deci­sions about how to treat data can cre­ate cor­re­la­tions between vari­ables. For exam­ple, when the Aston researchers used fac­tor analy­sis to cre­ate aggre­gate vari­ables, they implic­itly deter­mined the cor­re­la­tions among these aggre­gate vari­ables. Third­ly, so-called ‘sam­ples’ are fre­quently not ran­dom, and many of them are com­plete sub­pop­u­la­tion­s—say, every employee of a com­pa­ny—even though study after study has turned up evi­dence that peo­ple who live close togeth­er, who work togeth­er, or who social­ize together tend to have more atti­tudes, beliefs, and behav­iors in com­mon than do peo­ple who are far apart phys­i­cally and social­ly. Fourth­ly, some stud­ies obtain data from respon­dents at one time and through one method. By includ­ing items in a sin­gle ques­tion­naire or inter­view, researchers sug­gest to respon­dents that rela­tion­ships exist among these items. Last­ly, most researchers are intel­li­gent peo­ple who are liv­ing suc­cess­ful lives. They are likely to have some intu­itive abil­ity to pre­dict the behav­iors of peo­ple and of social sys­tems. They are much more likely to for­mu­late hypothe­ses that accord with their intu­ition than ones that vio­late it; they are quite likely to inves­ti­gate cor­re­la­tions and differ­ences that devi­ate from zero; and they are less likely than chance would imply to observe cor­re­la­tions and differ­ences near zero.

Web­ster and I hypoth­e­sized that sta­tis­ti­cal tests with a null hypoth­e­sis of no cor­re­la­tion are biased toward sta­tis­ti­cal sig­nifi­cance. Web­ster culled through Admin­is­tra­tive Sci­ence Quar­terly, the Acad­emy of Man­age­ment Jour­nal, and the Jour­nal of Applied Psy­chol­ogy seek­ing matri­ces of cor­re­la­tions. She tab­u­lated only com­plete matri­ces of cor­re­la­tions in order to observe the rela­tions among all of the vari­ables that the researchers per­ceived when draw­ing induc­tive infer­ences, not only those vari­ables that researchers actu­ally included in hypothe­ses. Of course, some researchers prob­a­bly gath­ered data on addi­tional vari­ables beyond those pub­lished, and then omit­ted these addi­tional vari­ables because they cor­re­lated very weakly with the depen­dent vari­ables. We esti­mated that 64% of the cor­re­la­tions in our data were asso­ci­ated with researchers’ hypothe­ses.

Fig­ure 2.6 Cor­re­la­tions reported in three jour­nals

Fig­ure 2.6 shows the dis­tri­b­u­tions of 14,897 cor­re­la­tions. In all 3 jour­nals, both the mean cor­re­la­tion and the median cor­re­la­tion were close to +0.09 and the dis­tri­b­u­tions of cor­re­la­tions were very sim­i­lar. Find­ing sig­nifi­cant cor­re­la­tions is absurdly easy in this pop­u­la­tion of vari­ables, espe­cially when researchers make two-tailed tests with a null hypoth­e­sis of no cor­re­la­tion. Choos­ing two vari­ables utterly at ran­dom, a researcher has 2-to-1 odds of find­ing a sig­nifi­cant cor­re­la­tion on the first try, and 24-to-1 odds of find­ing a sig­nifi­cant cor­re­la­tion within three tries (also see Hub­bard and Arm­strong 1992). Fur­ther­more, the odds are bet­ter than 2-to-1 that an observed cor­re­la­tion will be pos­i­tive, and pos­i­tive cor­re­la­tions are more likely than neg­a­tive ones to be sta­tis­ti­cally sig­nifi­cant. Because researchers gather more data when they are get­ting small cor­re­la­tions, stud­ies with large num­bers of obser­va­tions exhibit slightly less pos­i­tive bias. The mean cor­re­la­tion in stud­ies with fewer than 70 obser­va­tions is about twice the mean cor­re­la­tion in stud­ies with over 180 obser­va­tions. The main infer­ence I drew from these sta­tis­tics was that the social sci­ences are drown­ing in sta­tis­ti­cally sig­nifi­cant but mean­ing­less noise. Because the differ­ences and cor­re­la­tions that social sci­en­tists test have dis­tri­b­u­tions quite differ­ent from those assumed in hypoth­e­sis tests, social sci­en­tists are using tests that assign sta­tis­ti­cal sig­nifi­cance to con­found­ing back­ground rela­tion­ships. Because social sci­en­tists equate sta­tis­ti­cal sig­nifi­cance with mean­ing­ful rela­tion­ships, they often mis­take con­found­ing back­ground rela­tion­ships for the­o­ret­i­cally impor­tant infor­ma­tion. One result is that social sci­ence research cre­ates a cloud of sta­tis­ti­cally sig­nifi­cant differ­ences and cor­re­la­tions that not only have no real mean­ing but also impede sci­en­tific progress by obscur­ing the truly mean­ing­ful rela­tion­ships.

Sup­pose that roughly 10% of all observ­able rela­tions could be the­o­ret­i­cally mean­ing­ful and that the remain­ing 90% either have no mean­ings or can be deduced as impli­ca­tions of the key 10%. How­ev­er, we do not know now which rela­tions con­sti­tute the key 10%, and so our research resem­bles a search through a haystack in which we are try­ing to sep­a­rate nee­dles from more numer­ous straws. Now sup­pose that we adopt a search method that makes almost every straw look very much like a nee­dle and that turns up thou­sands of appar­ent nee­dles annu­al­ly; 90% of these appar­ent nee­dles are actu­ally straws, but we have no way of know­ing which ones. Next, we fab­ri­cate a the­ory that ‘explains’ these appar­ent nee­dles. Some of the propo­si­tions in our the­ory are likely to be cor­rect, merely by chance; but many, many more propo­si­tions are incor­rect or mis­lead­ing in that they describe straws. Even if this the­ory were to account ratio­nally for all of the nee­dles that we have sup­pos­edly dis­cov­ered in the past, which is extremely unlike­ly, the the­ory has very lit­tle chance of mak­ing highly accu­rate pre­dic­tions about the con­se­quences of our actions unless the the­ory itself acts as a pow­er­ful self­-ful­fill­ing prophecy (Eden and Ravid 1982). Our the­ory would make some cor­rect pre­dic­tions, of course, because with so many cor­re­lated vari­ables, even a com­pletely false the­ory would have a rea­son­able chance of gen­er­at­ing pre­dic­tions that come true. Thus, we dare not even take cor­rect pre­dic­tions as depend­able evi­dence of our the­o­ry’s cor­rect­ness (Deese 1972: 61–67 [Psy­chol­ogy as Sci­ence and Art]).

Smith et al 2007

, Smith et al 2007:

…We exam­ined the extent to which genetic vari­ants, on the one hand, and non­genetic envi­ron­men­tal expo­sures or phe­no­typic char­ac­ter­is­tics on the oth­er, tend to be asso­ci­ated with each oth­er, to assess the degree of con­found­ing that would exist in con­ven­tional epi­demi­o­log­i­cal stud­ies com­pared with Mendelian ran­dom­iza­tion stud­ies. Meth­ods and Find­ings: We esti­mated pair­wise cor­re­la­tions between [96] non­genetic base­line vari­ables and genetic vari­ables in a cross-sec­tional study [Bri­tish Wom­en’s Heart and Health Study; n = 4,286] com­par­ing the num­ber of cor­re­la­tions that were sta­tis­ti­cally sig­nifi­cant at the 5%, 1%, and 0.01% level (α = 0.05, 0.01, and 0.0001, respec­tive­ly) with the num­ber expected by chance if all vari­ables were in fact uncor­re­lat­ed, using a two-sided bino­mial exact test. We demon­strate that behav­ioural, socioe­co­nom­ic, and phys­i­o­log­i­cal fac­tors are strongly inter­re­lat­ed, with 45% of all pos­si­ble pair­wise asso­ci­a­tions between 96 non­genetic char­ac­ter­is­tics (n = 4,560 cor­re­la­tions) being sig­nifi­cant at the p < 0.01 level (the ratio of observed to expected sig­nifi­cant asso­ci­a­tions was 45; p-value for differ­ence between observed and expected < 0.000001). Sim­i­lar find­ings were observed for other lev­els of sig­nifi­cance.

…The 96 non­genetic vari­ables gen­er­ated 4,560 pair­wise com­par­isons, of which, assum­ing no asso­ci­a­tions exist­ed, 5 in 100 (to­tal 228) would be expected to be asso­ci­ated by chance at the 5% sig­nifi­cance level (α = 0.05). How­ev­er, 2,447 (54%) of the cor­re­la­tions were sig­nifi­cant at the α = 0.05 lev­el, giv­ing an observed to expected (O:E) ratio of 11, p for differ­ence O:E < 0.000001 (Table 1). At the 1% sig­nifi­cance lev­el, 45.6 of the cor­re­la­tions would be expected to be asso­ci­ated by chance, but we found that 2,036 (45%) of the pair­wise asso­ci­a­tions were sta­tis­ti­cally sig­nifi­cant at α = 0.01, giv­ing an O:E ratio of 45, p for differ­ence O:E < 0.000001 (Table 2). At the 0.01% sig­nifi­cance lev­el, 0.456 of the cor­re­la­tions would be expected to be asso­ci­ated by chance, but we found that 1,378 (30%) were sig­nifi­cantly asso­ci­ated at α = 0.0001, giv­ing an O:E ratio of 3,022, p for differ­ence O:E < 0.000001.

…Over 50% of the pair­wise asso­ci­a­tions between base­line non­genetic char­ac­ter­is­tics in our study were sta­tis­ti­cally sig­nifi­cant at the 0.05 lev­el; an 11-fold increase from what would be expect­ed, assum­ing these char­ac­ter­is­tics were inde­pen­dent. Sim­i­lar find­ings were found for sta­tis­ti­cally sig­nifi­cant asso­ci­a­tions at the 0.01 level (45-fold increase from expect­ed) and the 0.0001 level (3,000-fold increase from expect­ed). This illus­trates the con­sid­er­able diffi­culty of deter­min­ing which asso­ci­a­tions are valid and poten­tially causal from a back­ground of highly cor­re­lated fac­tors, reflect­ing that behav­ioural, socioe­co­nom­ic, and phys­i­o­log­i­cal char­ac­ter­is­tics tend to clus­ter. This ten­dency will mean that there will often be high lev­els of con­found­ing when study­ing any sin­gle fac­tor in rela­tion to an out­come. Given the com­plex­ity of such con­found­ing, even after for­mal sta­tis­ti­cal adjust­ment, a lack of data for some con­founders, and mea­sure­ment error in assessed con­founders will leave con­sid­er­able scope for resid­ual con­found­ing [4]. When epi­demi­o­log­i­cal stud­ies present adjusted asso­ci­a­tions as a reflec­tion of the mag­ni­tude of a causal asso­ci­a­tion, they are assum­ing that all pos­si­ble con­found­ing fac­tors have been accu­rately mea­sured and that their rela­tion­ships with the out­come have been appro­pri­ately mod­elled. We think this is unlikely to be the case in most obser­va­tional epi­demi­o­log­i­cal stud­ies [26].

Pre­dictably, such con­founded rela­tion­ships will be par­tic­u­larly marked for highly socially and cul­tur­ally pat­terned risk fac­tors, such as dietary intake. This high degree of con­found­ing might under­lie the poor con­cor­dance of obser­va­tional epi­demi­o­log­i­cal stud­ies that iden­ti­fied dietary fac­tors (such as beta carotene, vit­a­min E, and vit­a­min C intake) as pro­tec­tive against car­dio­vas­cu­lar dis­ease and can­cer, with the find­ings of ran­dom­ized con­trolled tri­als of these dietary fac­tors [1,27]. Indeed, with 45% of the pair­wise asso­ci­a­tions of non­genetic char­ac­ter­is­tics being “sta­tis­ti­cally sig­nifi­cant” at the p < 0.01 level in our study, and our study being unex­cep­tional with regard to the lev­els of con­found­ing that will be found in obser­va­tional inves­ti­ga­tions, it is clear that the large major­ity of asso­ci­a­tions that exist in obser­va­tional data­bases will not reach pub­li­ca­tion. We sug­gest that those that do achieve pub­li­ca­tion will reflect appar­ent bio­log­i­cal plau­si­bil­ity (a weak causal cri­te­rion [28]) and the inter­ests of inves­ti­ga­tors. Exam­ples exist of inves­ti­ga­tors report­ing pro­vi­sional analy­ses in abstract­s—­such as antiox­i­dant vit­a­min intake being appar­ently pro­tec­tive against future car­dio­vas­cu­lar events in women with clin­i­cal evi­dence of car­dio­vas­cu­lar dis­ease [29]—but not going on to full pub­li­ca­tion of these find­ings, per­haps because ran­dom­ized con­trolled tri­als appeared soon after the pre­sen­ta­tion of the abstracts [30] that ren­dered their find­ings as being unlikely to reflect causal rela­tion­ships. Con­verse­ly, it is likely that the large major­ity of null find­ings will not achieve pub­li­ca­tion, unless they con­tra­dict high­-pro­file prior find­ings, as has been demon­strated in mol­e­c­u­lar genetic research [31].

Smith et al 2007: “Fig­ure 1. His­togram of Sta­tis­ti­cally Sig­nifi­cant (at α = 1%) Age-Ad­justed Pair­wise Cor­re­la­tion Coeffi­cients between 96 Non­genetic Char­ac­ter­is­tics. British Women Aged 60–79 y”

The mag­ni­tudes of most of the sig­nifi­cant cor­re­la­tions between non­genetic char­ac­ter­is­tics were small (see Fig­ure 1), with a median value at p ≤ 0.01 and p ≤ 0.05 of 0.08, and it might be con­sid­ered that such weak asso­ci­a­tions are unlikely to be impor­tant sources of con­found­ing. How­ev­er, so many asso­ci­ated non­genetic vari­ables, even with weak cor­re­la­tions, can present a very impor­tant poten­tial for resid­ual con­found­ing. For exam­ple, we have pre­vi­ously demon­strated how 15 socioe­co­nomic and behav­ioural risk fac­tors, each with weak but sta­tis­ti­cally inde­pen­dent (at p ≤ 0.05) asso­ci­a­tions with both vit­a­min C lev­els and coro­nary heart dis­ease (CHD), could together account for an appar­ent strong pro­tec­tive effect (odds ratio = 0.60 com­par­ing top to bot­tom quar­ter of vit­a­min C dis­tri­b­u­tion) of vit­a­min C on CHD (32 [see also Lawlor et al 2004b]).

Hecht & Moxley 2009

“Ter­abytes of Tobler: eval­u­at­ing the first law in a mas­sive, domain-neu­tral rep­re­sen­ta­tion of world knowl­edge”, Hecht & Mox­ley 2009:

The First Law of Geog­ra­phy states, “every­thing is related to every­thing else, but near things are more related than dis­tant things.” Despite the fact that it is to a large degree what makes “spa­tial spe­cial,” the law has never been empir­i­cally eval­u­ated on a large, domain-neu­tral rep­re­sen­ta­tion of world knowl­edge. We address the gap in the lit­er­a­ture about this crit­i­cal idea by sta­tis­ti­cally exam­in­ing the mul­ti­tude of enti­ties and rela­tions between enti­ties present across 22 differ­ent lan­guage edi­tions of Wikipedia. We find that, at least accord­ing to the myr­iad authors of Wikipedia, the First Law is true to an over­whelm­ing extent regard­less of lan­guage-de­fined cul­tural domain.

Andrew Gelman

Gelman 2004

“Type 1, type 2, type S, and type M errors”

I’ve never in my pro­fes­sional life made a Type I error or a Type II error. But I’ve made lots of errors. How can this be?

A Type 1 error occurs only if the null hypoth­e­sis is true (typ­i­cally if a cer­tain para­me­ter, or differ­ence in para­me­ters, equals zero). In the appli­ca­tions I’ve worked on, in social sci­ence and pub­lic health, I’ve never come across a null hypoth­e­sis that could actu­ally be true, or a para­me­ter that could actu­ally be zero.

Gelman 2007

“Sig­nifi­cance test­ing in eco­nom­ics: McCloskey, Zil­i­ak, Hoover, and Siegler”:

I think that McCloskey and Zil­i­ak, and also Hoover and Siegler, would agree with me that the null hypoth­e­sis of zero coeffi­cient is essen­tially always false. (The par­a­dig­matic exam­ple in eco­nom­ics is pro­gram eval­u­a­tion, and I think that just about every pro­gram being seri­ously con­sid­ered will have effect­s—­pos­i­tive for some peo­ple, neg­a­tive for oth­er­s—but not aver­ag­ing to exactly zero in the pop­u­la­tion.) From this per­spec­tive, the point of hypoth­e­sis test­ing (or, for that mat­ter, of con­fi­dence inter­vals) is not to assess the null hypoth­e­sis but to give a sense of the uncer­tainty in the infer­ence. As Hoover and Siegler put it, “while the eco­nomic sig­nifi­cance of the coeffi­cient does not depend on the sta­tis­ti­cal sig­nifi­cance, our cer­tainty about the accu­racy of the mea­sure­ment surely does. . . . Sig­nifi­cance tests, prop­erly used, are a tool for the assess­ment of sig­nal strength and not mea­sures of eco­nomic sig­nifi­cance.” Cer­tain­ly, I’d rather see an esti­mate with an assess­ment of sta­tis­ti­cal sig­nifi­cance than an esti­mate with­out such an assess­ment.

Gelman 2010a

“Bayesian Sta­tis­tics Then and Now”, Gel­man 2010a:

My third meta-prin­ci­ple is that differ­ent appli­ca­tions demand differ­ent philoso­phies. This prin­ci­ple comes up for me in Efron’s dis­cus­sion of hypoth­e­sis test­ing and the so-called false dis­cov­ery rate, which I label as “so-called” for the fol­low­ing rea­son. In Efron’s for­mu­la­tion (which fol­lows the clas­si­cal mul­ti­ple com­par­isons lit­er­a­ture), a “false dis­cov­ery” is a zero effect that is iden­ti­fied as nonze­ro, where­as, in my own work, I never study zero effects. The effects I study are some­times small but it would be sil­ly, for exam­ple, to sup­pose that the differ­ence in vot­ing pat­terns of men and women (after con­trol­ling for some other vari­ables) could be exactly zero. My prob­lems with the “false dis­cov­ery” for­mu­la­tion are partly a mat­ter of taste, I’m sure, but I believe they also arise from the differ­ence between prob­lems in genet­ics (in which some genes really have essen­tially zero effects on some traits, so that the clas­si­cal hypoth­e­sis-test­ing model is plau­si­ble) and in social sci­ence and envi­ron­men­tal health (where essen­tially every­thing is con­nected to every­thing else, and effect sizes fol­low a con­tin­u­ous dis­tri­b­u­tion rather than a mix of large effects and near-ex­act zeroes).

Gelman 2010b

“Causal­ity and Sta­tis­ti­cal Learn­ing”, Gel­man 2010b:

There are (al­most) no true zeroes: diffi­cul­ties with the research pro­gram of learn­ing causal struc­ture

We can dis­tin­guish between learn­ing within a causal model (that is, infer­ence about para­me­ters char­ac­ter­iz­ing a spec­i­fied directed graph) and learn­ing causal struc­ture itself (that is, infer­ence about the graph itself). In social sci­ence research, I am extremely skep­ti­cal of this sec­ond goal.

The diffi­culty is that, in social sci­ence, there are no true zeroes. For exam­ple, reli­gious atten­dance is asso­ci­ated with atti­tudes on eco­nomic as well as social issues, and both these cor­re­la­tions vary by state. And it does not inter­est me, for exam­ple, to test a model in which social class affects vote choice through party iden­ti­fi­ca­tion but not along a direct path.

More gen­er­al­ly, any­thing that plau­si­bly could have an effect will not have an effect that is exactly zero. I can respect that some social sci­en­tists find it use­ful to frame their research in terms of con­di­tional inde­pen­dence and the test­ing of null effects, but I don’t gen­er­ally find this approach help­ful—and I cer­tainly don’t believe that it is nec­es­sary to think in terms of con­di­tional inde­pen­dence in order to study causal­i­ty. With­out struc­tural zeroes, it is impos­si­ble to iden­tify graph­i­cal struc­tural equa­tion mod­els.

The most com­mon excep­tions to this rule, as I see it, are inde­pen­dences from design (as in a designed or nat­ural exper­i­ment) or effects that are zero based on a plau­si­ble sci­en­tific hypoth­e­sis (as might arise, for exam­ple, in genet­ics where genes on differ­ent chro­mo­somes might have essen­tially inde­pen­dent effect­s), or in a study of ESP. In such set­tings I can see the value of test­ing a null hypoth­e­sis of zero effect, either for its own sake or to rule out the pos­si­bil­ity of a con­di­tional cor­re­la­tion that is sup­posed not to be there.

Another sort of excep­tion to the “no zeroes” rule comes from infor­ma­tion restric­tion: a per­son’s deci­sion should not be affected by knowl­edge that he or she does­n’t have. For exam­ple, a con­sumer inter­ested in buy­ing apples cares about the total price he pays, not about how much of that goes to the seller and how much goes to the gov­ern­ment in the form of tax­es. So the restric­tion is that the util­ity depends on prices, not on the share of that going to tax­es. That is the type of restric­tion that can help iden­tify demand func­tions in eco­nom­ics.

I real­ize, how­ev­er, that my per­spec­tive that there are no zeroes (in­for­ma­tion restric­tions aside) is a minor­ity view among social sci­en­tists and per­haps among peo­ple in gen­er­al, on the evi­dence of psy­chol­o­gist Slo­man’s book. For exam­ple, from chap­ter 2: “A good politi­cian will know who is moti­vated by greed and who is moti­vated by larger prin­ci­ples in order to dis­cern how to solicit each one’s vote when it is need­ed.” I can well believe that peo­ple think in this way but I don’t buy it! Just about every­one is moti­vated by greed and by larger prin­ci­ples! This sort of dis­crete think­ing does­n’t seem to me to be at all real­is­tic about how peo­ple behave-although it might very well be a good model about how peo­ple char­ac­ter­ize oth­ers!

In the next chap­ter, Slo­man writes, “No mat­ter how many times A and B occur togeth­er, mere co-oc­cur­rence can­not reveal whether A causes B, or B causes A, or some­thing else causes both.” [ital­ics added] Again, I am both­ered by this sort of dis­crete think­ing. I will return in a moment with an exam­ple, but just to speak gen­er­al­ly, if A could cause B, and B could cause A, then I would think that, yes, they could cause each oth­er. And if some­thing else could cause them both, I imag­ine that could be hap­pen­ing along with the cau­sa­tion of A on B and of B on A.

Here we’re get­ting into some of the differ­ences between a nor­ma­tive view of sci­ence, a descrip­tive view of sci­ence, and a descrip­tive view of how peo­ple per­ceive the world. Just as there are lim­its to what “folk physics” can tell us about the motion of par­ti­cles, sim­i­larly I think we have to be care­ful about too closely iden­ti­fy­ing “folk causal infer­ence” from the stuff done by the best social sci­en­tists. To con­tinue the anal­o­gy: it is inter­est­ing to study how we develop phys­i­cal intu­itions using com­mon­sense notions of force, ener­gy, momen­tum, and so on—but it’s also impor­tant to see where these intu­itions fail. Sim­i­lar­ly, ideas of causal­ity are fun­da­men­tal but that does­n’t stop ordi­nary peo­ple and even experts from mak­ing basic mis­takes.

Now I would like to return to the graph­i­cal model approach described by Slo­man. In chap­ter 5, he dis­cusses an exam­ple with three vari­ables:

If two of the vari­ables are depen­dent, say, intel­li­gence and socioe­co­nomic sta­tus, but con­di­tion­ally inde­pen­dent given the third vari­able [beer con­sump­tion], then either they are related by one of two chains:

(Intelligence → Amount of beer consumed → Socioeconomic status)
(Socio-economic status → Amount of beer consumed → Intelligence)

or by a fork:

                           Socioeconomic status
 Amount of beer consumed

and then we must use some other means [other than obser­va­tional data] to decide between these three pos­si­bil­i­ties. In some cas­es, com­mon sense may be suffi­cient, but we can also, if nec­es­sary, run an exper­i­ment. If we inter­vene and vary the amount of beer con­sumed and see that we affect intel­li­gence, that implies that the sec­ond or third model is pos­si­ble; the first one is not. Of course, all this assumes that there aren’t other vari­ables medi­at­ing between the ones shown that pro­vide alter­na­tive expla­na­tions of the depen­den­cies.

This makes no sense to me. I don’t see why only one of the three mod­els can be true. This is a math­e­mat­i­cal pos­si­bil­i­ty, but it seems highly implau­si­ble to me. And, in par­tic­u­lar, run­ning an exper­i­ment that reveals one of these causal effects does not rule out the other pos­si­ble paths. For exam­ple, sup­pose that Slo­man were to per­form the above exper­i­ment (find­ing that beer con­sump­tion affects intel­li­gence) and then another exper­i­ment, this time vary­ing intel­li­gence (in some way; the method of doing this can very well deter­mine the causal effect) and find­ing that it affects the amount of beer con­sumed.

Beyond this fun­da­men­tal prob­lem, I have a sta­tis­ti­cal cri­tique, which is that in social sci­ence you won’t have these sorts of con­di­tional inde­pen­den­cies, except from design or as arti­facts of small sam­ple sizes that do not allow us to dis­tin­guish small depen­den­cies from zero.

I think I see where Slo­man is com­ing from, from a psy­cho­log­i­cal per­spec­tive: you see these vari­ables that are related to each oth­er, and you want to know which is the cause and which is the effect. But I don’t think this is a use­ful way of under­stand­ing the world, just as I don’t think it’s use­ful to cat­e­go­rize polit­i­cal play­ers as being moti­vated either by greed or by larger prin­ci­ples, but not both. Exclu­sive-or might feel right to us inter­nal­ly, but I don’t think it works as sci­ence.

One impor­tant place where I agree with Slo­man (and thus with Pearl and Sprites et al.) is in the empha­sis that causal struc­ture can­not in gen­eral be learned from obser­va­tional data alone; they hold the very rea­son­able posi­tion that we can use obser­va­tional data to rule out pos­si­bil­i­ties and for­mu­late hypothe­ses, and then use some sort of inter­ven­tion or exper­i­ment (whether actual or hypo­thet­i­cal) to move fur­ther. In this way they con­nect the observational/experimental divi­sion to the hypothesis/deduction for­mu­la­tion that is famil­iar to us from the work of Pop­per, Kuhn, and other mod­ern philoso­phers of sci­ence.

The place where I think Slo­man is mis­guided is in his for­mu­la­tion of sci­en­tific mod­els in an either/or way, as if, in truth, social vari­ables are linked in sim­ple causal paths, with a sci­en­tific goal of fig­ur­ing out if A causes B or the reverse. I don’t know much about intel­li­gence, beer con­sump­tion, and socioe­co­nomic sta­tus, but I cer­tainly don’t see any sim­ple rela­tion­ships between income, reli­gious atten­dance, party iden­ti­fi­ca­tion, and vot­ing—and I don’t see how a search for such a pat­tern will advance our under­stand­ing, at least given cur­rent tech­niques. I’d rather start with descrip­tion and then go toward causal­ity fol­low­ing the approach of econ­o­mists and sta­tis­ti­cians by think­ing about poten­tial inter­ven­tions one at a time. I’d love to see Slo­man’s and Pearl’s ideas of the inter­play between obser­va­tional and exper­i­men­tal data devel­oped in a frame­work that is less strongly tied to the notion of choice among sim­ple causal struc­tures.

Gelman 2012

“The”hot hand" and prob­lems with hypoth­e­sis test­ing", Gel­man 2012:

The effects are cer­tainly not zero. We are not machi­nes, and any­thing that can affect our expec­ta­tions (for exam­ple, our suc­cess in pre­vi­ous tries) should affect our per­for­mance…What­ever the lat­est results on par­tic­u­lar sports, I can’t see any­one over­turn­ing the basic find­ing of Gilovich, Val­lone, and Tver­sky that play­ers and spec­ta­tors alike will per­ceive the hot hand even when it does not exist and dra­mat­i­cally over­es­ti­mate the mag­ni­tude and con­sis­tency of any hot-hand phe­nom­e­non that does exist. In sum­ma­ry, this is yet another prob­lem where much is lost by going down the stan­dard route of null hypoth­e­sis test­ing.

Gelman et al 2013

“Inher­ent diffi­cul­ties of non-Bayesian like­li­hood-based infer­ence, as revealed by an exam­i­na­tion of a recent book by Aitkin” (ear­lier ver­sion):

  1. Solv­ing non-prob­lems

Sev­eral of the exam­ples in Sta­tis­ti­cal Infer­ence rep­re­sent solu­tions to prob­lems that seem to us to be arti­fi­cial or con­ven­tional tasks with no clear anal­ogy to applied work.

"They are arti­fi­cial and are expressed in terms of a sur­vey of 100 indi­vid­u­als express­ing sup­port (Yes/No) for the pres­i­dent, before and after a pres­i­den­tial address (…) The ques­tion of inter­est is whether there has been a change in sup­port between the sur­veys (…). We want to assess the evi­dence for the hypoth­e­sis of equal­ity __H_1 against the alter­na­tive hypoth­e­sis H2 of a change." —Sta­tis­ti­cal Infer­ence ,page 147

Based on our expe­ri­ence in pub­lic opin­ion research, this is not a real ques­tion. Sup­port for any polit­i­cal posi­tion is always chang­ing. The real ques­tion is how much the sup­port has changed, or per­haps how this change is dis­trib­uted across the pop­u­la­tion.

A defender of Aitkin (and of clas­si­cal hypoth­e­sis test­ing) might respond at this point that, yes, every­body knows that changes are never exactly zero and that we should take a more “grown-up” view of the null hypoth­e­sis, not that the change is zero but that it is nearly zero. Unfor­tu­nate­ly, the metaphor­i­cal inter­pre­ta­tion of hypoth­e­sis tests has prob­lems sim­i­lar to the the­o­log­i­cal doc­trines of the Uni­tar­ian church. Once you have aban­doned lit­eral belief in the Bible, the ques­tion soon aris­es: why fol­low it at all? Sim­i­lar­ly, once one rec­og­nizes the inap­pro­pri­ate­ness of the point null hypoth­e­sis, we think it makes more sense not to try to reha­bil­i­tate it or treat it as trea­sured metaphor but rather to attack our sta­tis­ti­cal prob­lems direct­ly, in this case by per­form­ing infer­ence on the change in opin­ion in the pop­u­la­tion.

To be clear: we are not deny­ing the value of hypoth­e­sis test­ing. In this exam­ple, we find it com­pletely rea­son­able to ask whether observed changes are sta­tis­ti­cally sig­nifi­cant, i.e. whether the data are con­sis­tent with a null hypoth­e­sis of zero change. What we do not find rea­son­able is the state­ment that “the ques­tion of inter­est is whether there has been a change in sup­port.”

All this is appli­ca­tion-spe­cific. Sup­pose pub­lic opin­ion was observed to really be flat, punc­tu­ated by occa­sional changes, as in the left graph in Fig­ure 7.1. In that case, Aitk­in’s ques­tion of “whether there has been a change” would be well-de­fined and appro­pri­ate, in that we could inter­pret the null hypoth­e­sis of no change as some min­i­mal level of base­line vari­a­tion.

Real pub­lic opin­ion, how­ev­er, does not look like base­line noise plus jumps, but rather shows con­tin­u­ous move­ment on many time scales at once, as can be seen from the right graph in Fig­ure 7.1, which shows actual pres­i­den­tial approval data. In this exam­ple, we do not see Aitk­in’s ques­tion as at all rea­son­able. Any attempt to work with a null hypoth­e­sis of opin­ion sta­bil­ity will be inher­ently arbi­trary. It would make much more sense to model opin­ion as a con­tin­u­ous­ly-vary­ing process. The sta­tis­ti­cal prob­lem here is not merely that the null hypoth­e­sis of zero change is non­sen­si­cal; it is that the null is in no sense a rea­son­able approx­i­ma­tion to any inter­est­ing mod­el. The soci­o­log­i­cal prob­lem is that, from Sav­age (1954) onward, many Bayesians have felt the need to mimic the clas­si­cal nul­l-hy­poth­e­sis test­ing frame­work, even where it makes no sense.

Lin et al 2013

“Too Big to Fail: Large Sam­ples and the p-Value Prob­lem”, Lin et al 2013:

The Inter­net has pro­vided IS researchers with the oppor­tu­nity to con­duct stud­ies with extremely large sam­ples, fre­quently well over 10,000 obser­va­tions. There are many advan­tages to large sam­ples, but researchers using sta­tis­ti­cal infer­ence must be aware of the p-value prob­lem asso­ci­ated with them. In very large sam­ples, p-val­ues go quickly to zero, and solely rely­ing on p-val­ues can lead the researcher to claim sup­port for results of no prac­ti­cal sig­nifi­cance. In a sur­vey of large sam­ple IS research, we found that a sig­nifi­cant num­ber of papers rely on a low p-value and the sign of a regres­sion coeffi­cient alone to sup­port their hypothe­ses. This research com­men­tary rec­om­mends a series of actions the researcher can take to mit­i­gate the p-value prob­lem in large sam­ples and illus­trates them with an exam­ple of over 300,000 cam­era sales on eBay. We believe that address­ing the p-value prob­lem will increase the cred­i­bil­ity of large sam­ple IS research as well as pro­vide more insights for read­ers.

…A key issue with apply­ing smal­l­-sam­ple sta­tis­ti­cal infer­ence to large sam­ples is that even minus­cule effects can become sta­tis­ti­cally sig­nifi­cant. The increased power leads to a dan­ger­ous pit­fall as well as to a huge oppor­tu­ni­ty. The issue is one that sta­tis­ti­cians have long been aware of: “the p-value prob­lem.” Chat­field (1995, p. 70 [Prob­lem Solv­ing: A Sta­tis­ti­cian’s Guide, 2nd ed]) com­ments, “The ques­tion is not whether differ­ences are ‘sig­nifi­cant’ (they nearly always are in large sam­ples), but whether they are inter­est­ing. For­get sta­tis­ti­cal sig­nifi­cance, what is the prac­ti­cal sig­nifi­cance of the results?” The increased power of large sam­ples means that researchers can detect small­er, sub­tler, and more com­plex effects, but rely­ing on p-val­ues alone can lead to claims of sup­port for hypothe­ses of lit­tle or no prac­ti­cal sig­nifi­cance.

…In review­ing the lit­er­a­ture, we found only a few men­tions of the large-sam­ple issue and its effect on p-val­ues; we also saw lit­tle recog­ni­tion that the authors’ low p-val­ues might be an arti­fact of their large-sam­ple sizes. Authors who rec­og­nized the “large-sam­ple, small p-val­ues” issue addressed it by one of the fol­low­ing approach­es: reduc­ing the sig­nifi­cance level thresh­old5 (which does not really help), by recom­put­ing the p-value for a small sam­ple (Ge­fen and Carmel 2008), or by focus­ing on prac­ti­cal sig­nifi­cance and com­ment­ing about the use­less­ness of sta­tis­ti­cal sig­nifi­cance (Mithas and Lucas 2010).

Schwitzgebel 2013

“Pre­lim­i­nary Evi­dence That the World Is Sim­ple (An Exer­cise in Stu­pid Epis­te­mol­o­gy)” (hu­mor­ous blog post)

Here’s what I did. I thought up 30 pairs of vari­ables that would be easy to mea­sure and that might relate in diverse ways. Some vari­ables were phys­i­cal (the dis­tance vs. appar­ent bright­ness of nearby stars), some bio­log­i­cal (the length vs. weight of sticks found in my back yard), and some psy­cho­log­i­cal or social (the S&P 500 index clos­ing value vs. num­ber of days past). Some I would expect to show no rela­tion­ship (the num­ber of pages in a library book vs. how high up it is shelved in the library), some I would expect to show a roughly lin­ear rela­tion­ship (dis­tance of McDon­ald’s fran­chises from my house vs. esti­mated dri­ving time), and some I expected to show a curved or com­plex rela­tion­ship (fore­casted tem­per­a­ture vs. time of day, size in KB of a JPG photo of my office vs. the angle at which the photo was tak­en). See here for the full list of vari­ables. I took 11 mea­sure­ments of each vari­able pair. Then I ana­lyzed the result­ing data.

Now, if the world is mas­sively com­plex, then it should be diffi­cult to pre­dict a third dat­a­point from any two other data points. Sup­pose that two mea­sure­ments of some con­tin­u­ous vari­able yield val­ues of 27 and 53. What should I expect the third mea­sured value to be? Why not 1,457,002? Or 3.22 × 10-17? There are just as many func­tions (that is, infi­nitely many) con­tain­ing 27, 53, and 1,457,002 as there are con­tain­ing 27, 53, and some more pedes­tri­an-seem­ing value like 44.

…To con­duct the test, I used each pair of depen­dent vari­ables to pre­dict the value of the next vari­able in the series (the 1st and 2nd obser­va­tions pre­dict­ing the value of the 3rd, the 2nd and 3rd pre­dict­ing the value of the 4th, etc.), yield­ing 270 pre­dic­tions for the 30 vari­ables. I counted an obser­va­tion “wild” if its absolute value was 10 times the max­i­mum of the absolute value of the two pre­vi­ous obser­va­tions or if its absolute value was below 1⁄10 of the min­i­mum of the absolute value of the two pre­vi­ous obser­va­tions. Sep­a­rate­ly, I also looked for flipped signs (ei­ther two neg­a­tive val­ues fol­lowed by a pos­i­tive or two pos­i­tive val­ues fol­lowed by a neg­a­tive), though most of the vari­ables only admit­ted pos­i­tive val­ues. This mea­sure of wild­ness yielded three wild obser­va­tions out of 270 (1%) plus another three flipped-sign cases (to­tal 2%). (A few vari­ables were capped, either top or bot­tom, in a way that would make an above-10x or below-1/10th obser­va­tion ana­lyt­i­cally unlike­ly, but exclud­ing such vari­ables would­n’t affect the result much.) So it looks like the Wild Com­plex­ity The­sis might be in trou­ble.

Ellenberg 2014

Jor­dan Ellen­berg, “The Myth Of The Myth Of The Hot Hand” (ex­cerpted from How Not to Be Wrong: The Power of Math­e­mat­i­cal Think­ing, 2014):

A sig­nifi­cance test is a sci­en­tific instru­ment, and like any other instru­ment, it has a cer­tain degree of pre­ci­sion. If you make the test more sen­si­tive—by increas­ing the size of the stud­ied pop­u­la­tion, for exam­ple—you enable your­self see ever-s­maller effects. That’s the power of the method, but also its dan­ger. The truth is, the null hypoth­e­sis is prob­a­bly always false! When you drop a pow­er­ful drug into a patien­t’s blood­stream, it’s hard to believe the inter­ven­tion lit­er­ally has zero effect on the prob­a­bil­ity that the patient will develop esophageal can­cer, or throm­bo­sis, or bad breath. Each part of the body speaks to every oth­er, in a com­plex feed­back loop of influ­ence and con­trol. Every­thing you do either gives you can­cer or pre­vents it. And in prin­ci­ple, if you carry out a pow­er­ful enough study, you can find out which it is. But those effects are usu­ally so minus­cule that they can be safely ignored. Just because we can detect them does­n’t always mean they mat­ter…The right ques­tion isn’t, “Do bas­ket­ball play­ers some­times tem­porar­ily get bet­ter or worse at mak­ing shots?”—the kind of yes/no ques­tion a sig­nifi­cance test address­es. The right ques­tion is “How much does their abil­ity vary with time, and to what extent can observers detect in real time whether a player is hot?” Here, the answer is surely “not as much as peo­ple think, and hardly at all.”

Lakens 2014

“The Null Is Always False (Ex­cept When It Is True)”, Daniel Lak­ens:

The more impor­tant ques­tion is whether it is true that there are always real differ­ences in the real world, and what the ‘real world’ is. Let’s con­sider the pop­u­la­tion of peo­ple in the real world. While you read this sen­tence, some indi­vid­u­als in this pop­u­la­tion have died, and some were born. For most ques­tions in psy­chol­o­gy, the pop­u­la­tion is sur­pris­ingly sim­i­lar to an eter­nally run­ning Monte Carlo sim­u­la­tion. Even if you could mea­sure all peo­ple in the world in a mil­lisec­ond, and the test-retest cor­re­la­tion was per­fect, the answer you would get now would be differ­ent from the answer you would get in an hour. Fre­quen­tists (the peo­ple that use NHST) are not specifi­cally inter­ested in the exact value now, or in one hour, or next week Thurs­day, but in the aver­age value in the ‘long’ run. The value in the real world today might never be zero, but it’s never any­thing, because it’s con­tin­u­ously chang­ing. If we want to make gen­er­al­iz­able state­ments about the world, I think the fact that the nul­l-hy­poth­e­sis is never pre­cisely true at any spe­cific moment is not a prob­lem. I’ll ignore more com­plex ques­tions for now, such as how we can estab­lish whether effects vary over time.

…Meehl talks about how in psy­chol­ogy every indi­vid­u­al-d­iffer­ence vari­able (e.g., trait, sta­tus, demo­graph­ic) cor­re­lates with every other vari­able, which means the null is prac­ti­cally never true. In these sit­u­a­tions, it’s not that test­ing against the nul­l-hy­poth­e­sis is mean­ing­less, but it’s not infor­ma­tive. If every­thing cor­re­lates with every­thing else, you need to cre­ate good mod­els, and test those. A sim­ple nul­l-hy­poth­e­sis sig­nifi­cance test will not get you very far. I agree.

Ran­dom Assign­ment vs. Crud

To illus­trate when NHST can be used to as a source of infor­ma­tion in large sam­ples, and when NHST is not infor­ma­tive in large sam­ples, I’ll ana­lyze data of large dataset with 6344 par­tic­i­pants from the Many Labs pro­ject. I’ve ana­lyzed 10 depen­dent vari­ables to see whether they were influ­enced by (A) Gen­der, and (B) Assign­ment to the high or low anchor­ing con­di­tion in the first study. Gen­der is a mea­sured indi­vid­ual differ­ence vari­able, and not a manip­u­lated vari­able, and might thus be affected by what Meehl calls the crud fac­tor. Here, I want to illus­trate this is (A) prob­a­bly often true for indi­vid­ual differ­ence vari­ables, but per­haps not always true, and (B) it is prob­a­bly never true for when ana­lyz­ing differ­ences between groups indi­vid­u­als were ran­domly assign­ment to.

…When we ana­lyze the 10 depen­dent vari­ables as a func­tion of the anchor­ing con­di­tion, none of the differ­ences are sta­tis­ti­cally sig­nifi­cant (even though there are more than 6000 par­tic­i­pants). You can play around with the script, repeat­ing the analy­sis for the con­di­tions related to the other three anchor­ing ques­tions (re­mem­ber to cor­rect for mul­ti­ple com­par­isons if you per­form many test­s), and see how ran­dom­iza­tion does a pretty good job at return­ing non-sig­nifi­cant results even in very large sam­ple sizes. If the null is always false, it is remark­ably diffi­cult to reject. Obvi­ous­ly, when we ana­lyze the answer peo­ple gave on the first anchor­ing ques­tion, we find a huge effect of the high vs. low anchor­ing con­di­tion they were ran­domly assigned to. Here, NHST works. There is prob­a­bly some­thing going on. If the anchor­ing effect was a com­pletely novel phe­nom­e­non, this would be an impor­tant first find­ing, to be fol­lowed by repli­ca­tions and exten­sions, and finally model build­ing and test­ing.

The results change dra­mat­i­cally if we use Gen­der as a fac­tor. There are Gen­der effects on depen­dent vari­ables related to quote attri­bu­tion, sys­tem jus­ti­fi­ca­tion, the gam­bler’s fal­la­cy, imag­ined con­tact, the explicit eval­u­a­tion of arts and math, and the norm of rec­i­proc­i­ty. There are no sig­nifi­cant differ­ences in polit­i­cal iden­ti­fi­ca­tion (as con­ser­v­a­tive or lib­er­al), on the response scale manip­u­la­tion, or on gain vs. loss fram­ing (even though p = .025, such a high p-value is stronger sup­port for the nul­l-hy­poth­e­sis than for the alter­na­tive hypoth­e­sis with 5500 par­tic­i­pants). It’s sur­pris­ing that the nul­l-hy­poth­e­sis (gen­der does not influ­ence the responses par­tic­i­pants give) is rejected for 7 out of 10 effects. Per­son­ally (per­haps because I’ve got very lit­tle exper­tise in gen­der effects) I was actu­ally extremely sur­prised, even though the effects are small (with Cohen d’s or around 0.09). This, iron­i­cal­ly, shows that NHST work­s—I’ve learned gen­der effects are much more wide­spread than I’d have though before I wrote this blog post.

Kirkegaard 2014

“The inter­na­tional gen­eral socioe­co­nomic fac­tor: Fac­tor ana­lyz­ing inter­na­tional rank­ings”:

Many stud­ies have exam­ined the cor­re­la­tions between national IQs and var­i­ous coun­try-level indexes of well-be­ing. The analy­ses have been unsys­tem­atic and not gath­ered in one sin­gle analy­sis or dataset. In this paper I gather a large sam­ple of coun­try-level indexes and show that there is a strong gen­eral socioe­co­nomic fac­tor (S fac­tor) which is highly cor­re­lated (.86–.87) with national cog­ni­tive abil­ity using either Lynn and Van­hanen’s dataset or Alti­nok’s. Fur­ther­more, the method of cor­re­lated vec­tors shows that the cor­re­la­tions between vari­able load­ings on the S fac­tor and cog­ni­tive mea­sure­ments are .99 in both datasets using both cog­ni­tive mea­sure­ments, indi­cat­ing that it is the S fac­tor that dri­ves the rela­tion­ship with national cog­ni­tive mea­sure­ments, not the remain­ing vari­ance.

See also “Coun­tries Are Ranked On Every­thing From Health To Hap­pi­ness. What’s The Point?”:

It’s a brand new rank­ing. Called the Sus­tain­able Devel­op­ment Goals Gen­der Index, it gives 129 coun­tries a score for progress on achiev­ing gen­der equal­ity by 2030. Here’s the quick sum­ma­ry: Things are “good” in much of Europe and North Amer­i­ca. And “very poor” in much of sub­-Sa­ha­ran Africa. In fact, that’s the way it looks in many inter­na­tional rank­ings, which tackle every­thing from the worst places to be a child to the most cor­rupt coun­tries to world hap­pi­ness…As for the fact that many rank­ings look the same at the top and bot­tom, one rea­son has to do with mon­ey. Many indexes are cor­re­lated with GDP per cap­i­ta, a mea­sure of a coun­try’s pros­per­i­ty, says Ken­ny. That includes the World Bank’s Human Cap­i­tal Index, which mea­sures the eco­nomic pro­duc­tiv­ity of a coun­try’s young peo­ple; and Free­dom House’s Free­dom in the World index, which ranks the world by its level of democ­ra­cy, includ­ing eco­nomic free­dom. And coun­tries that have more money can spend more money on health, edu­ca­tion and infra­struc­ture.

Shen et al 2014

, Shen et al 2014:

Is Too Much Vari­ance Explained? It is inter­est­ing that his­tor­i­cally the I–O lit­er­a­ture has bemoaned the pres­ence of a “valid­ity ceil­ing”, and the field seemed to be unable to make large gains in the pre­dic­tion of job per­for­mance (High­house, 2008). In con­trast, LeBre­ton et al. appear to have the oppo­site con­cern—that we maybe able to pre­dict too much, per­haps even all, of the vari­ance in job per­for­mance once account­ing for sta­tis­ti­cal arti­facts. In addi­tion to their four focal pre­dic­tors (i.e., GMA, integri­ty, struc­tured inter­view, work sam­ple), LeBre­ton et al. list an addi­tional 24vari­ables that have been shown to be related to job per­for­mance meta-an­a­lyt­i­cal­ly. How­ev­er, we believe that many of the vari­ables LeBre­ton et al. included in their list are vari­ables that Sack­ett, Borne­man, and Con­nelly (2009) would argue are likely unknow­able at time of hire.

…Fur­ther­more, in con­trast to LeBre­ton et al.’s asser­tion that orga­ni­za­tional vari­ables, such as pro­ce­dural jus­tice, are likely unre­lated to their focal pre­dic­tors, our belief is that many of these vari­ables are likely to be at least mod­er­ately cor­re­lat­ed–lim­it­ing the incre­men­tal valid­ity we could expect with the inclu­sion of these addi­tional vari­ables. For exam­ple, research has shown that integrity tests mostly tap into Con­sci­en­tious­ness, Agree­able­ness, and Emo­tional Sta­bil­ity (Ones & Viswes­varan, 2001), and a recent meta-analy­sis of orga­ni­za­tional jus­tice shows that all three per­son­al­ity traits are mod­er­ately related to one’s expe­ri­ence of pro­ce­dural jus­tice (ρ=0.19–0.23; Hutchin­son et al., 2014), sug­gest­ing that even appar­ently unre­lated vari­ables can share a sur­pris­ing amount of con­struc­t-level vari­ance. In sup­port of this per­spec­tive, Pater­son, Harms, and Crede (2012) [“The meta of all metas: 30 years of meta-analy­sis reviewed”] con­ducted a meta-analy­sis of over 200 meta-analy­ses and found an aver­age cor­re­la­tion of 0.27, sug­gest­ing that most vari­ables we study are at least some­what cor­re­lated and val­i­dat­ing the first author’s long-held per­sonal assump­tion that the world is cor­re­lated 0.30 (on aver­age; see also Meehl’s, 1990, crud fac­tor)!

Gordon et al 2019

, Gor­don et al 2019:

We exam­ine how com­mon tech­niques used to mea­sure the causal impact of ad expo­sures on users’ con­ver­sion out­comes com­pare to the “gold stan­dard” of a true exper­i­ment (ran­dom­ized con­trolled tri­al). Using data from 12 US adver­tis­ing lift stud­ies at Face­book com­pris­ing 435 mil­lion user-s­tudy obser­va­tions and 1.4 bil­lion total impres­sions we con­trast the exper­i­men­tal results to those obtained from obser­va­tional meth­ods, such as com­par­ing exposed to unex­posed users, match­ing meth­ods, mod­el-based adjust­ments, syn­thetic matched-mar­kets tests, and before-after tests. We show that obser­va­tional meth­ods often fail to pro­duce the same results as true exper­i­ments even after con­di­tion­ing on infor­ma­tion from thou­sands of behav­ioral vari­ables and using non-lin­ear mod­els. We explain why this is the case. Our find­ings sug­gest that com­mon approaches used to mea­sure adver­tis­ing effec­tive­ness in indus­try fail to mea­sure accu­rately the true effect of ads.

An impor­tant input to (PSM) is the set of vari­ables used to pre­dict the propen­sity score itself. We tested three differ­ent PSM spec­i­fi­ca­tions for study 4, each of which used a larger set of inputs.

  1. PSM 1: In addi­tion to age and gen­der, the basis of our exact match­ing (EM) approach, this spec­i­fi­ca­tion uses com­mon Face­book vari­ables, such as how long users have been on Face­book, how many Face­book friends the have, their reported rela­tion­ship sta­tus, and their phone OS, in addi­tion to other user char­ac­ter­is­tics.
  2. PSM 2: In addi­tion to the vari­ables in PSM 1, this spec­i­fi­ca­tion uses Face­book’s esti­mate of the user’s zip code of res­i­dence to asso­ciate with each user nearly 40 vari­ables drawn from the most recent Cen­sus and Amer­i­can Com­mu­ni­ties Sur­veys (ACS).
  3. PSM 3: In addi­tion to the vari­ables in PSM 2, this spec­i­fi­ca­tion adds a com­pos­ite met­ric of Face­book data that sum­ma­rizes thou­sands of behav­ioral vari­ables. This is a machine-learn­ing based met­ric used by Face­book to con­struct tar­get audi­ences that are sim­i­lar to con­sumers that an adver­tiser has iden­ti­fied as desir­able.16 Using this met­ric bases the esti­ma­tion of our propen­sity score on a non-lin­ear machine-learn­ing model with thou­sands of fea­tures.17

…When we go from exact match­ing (EM) to our most par­si­mo­nious propen­sity score match­ing model(PSM 1), the con­ver­sion rate for unex­posed users increases from 0.032% to 0.042%, decreas­ing the implied adver­tis­ing lift from 221% to 147%. PSM 2 per­forms sim­i­larly to PSM 1, with an implied lift of 154%.21 Final­ly, adding the com­pos­ite mea­sure of Face­book vari­ables in PSM 3 improves the fit of the propen­sity model (as mea­sured by a higher AUC/ROC) and fur­ther increases the con­ver­sion rate for matched unex­posed users to 0.051%. The result is that our best per­form­ing PSM model esti­mates an adver­tis­ing lift of 102%…We sum­ma­rize the result of all our propen­sity score match­ing and regres­sion meth­ods for study 4 in Fig­ure 7.

Gor­don et al 2016: “Fig­ure 7: Sum­mary of lift esti­mates and con­fi­dence inter­vals”

While not directly test­ing sta­tis­ti­cal-sig­nifi­cance in its propen­sity scor­ing, the increas­ing accu­racy in esti­mat­ing the true causal effect of adding in addi­tional behav­ioral vari­ables implies that (espe­cially at Face­book-s­cale, using bil­lions of dat­a­points) the cor­re­la­tions of the thou­sands of used vari­ables with the adver­tis­ing behav­ior would be sta­tis­ti­cal­ly-sig­nifi­cant and demon­strate that every­thing is cor­re­lat­ed.

Kirkegaard 2020

“Enhanc­ing archival datasets with machine learned psy­cho­met­rics”, Kirkegaard 2020:

In our ISIR 2019 pre­sen­ta­tion (“Machine learn­ing psy­cho­met­rics: Improved cog­ni­tive abil­ity valid­ity from super­vised train­ing on item level data”), we showed that one can use machine learn­ing on cog­ni­tive data to improve the pre­dic­tive valid­ity of it. The effect sizes can be quite large, e.g. one could pre­dict edu­ca­tional attain­ment in the Viet­nam Expe­ri­ence Study (VES) sam­ple (n = 4.5k US army recruits) at R2=32.3% with vs. 17.7% with . Pre­dic­tion is more than g, after all. What if we had a dataset of 185 diverse items, and we train the model to pre­dict IRT-based g from the full set, but using only a lim­ited set using the LASSO? How many items do we need when opti­mally weight­ed? Turns out that with 42 items, one can get a test that cor­re­lates at 0.96 with the full g. That’s an abbre­vi­a­tion of nearly 80%!

Now comes the fancy part. What if we have archival datasets with only a few cog­ni­tive items (e.g. datasets with items) or maybe even no items. Can we improve things here? May­be! If the dataset has a lot of other items, we may be able to train an machine learn­ing (ML) model that pre­dict g quite well from them, even if they seem unre­lat­ed. Every item has some vari­ance over­lap with g how­ever small (crud fac­tor), it is only a ques­tion of hav­ing a good enough algo­rithm and enough data to exploit this covari­ance. For instance, I have found that if one uses the 556 items in the in the VES to pre­dict the very well mea­sured g based on all the cog­ni­tive data (18 test­s), how well can one do? I was sur­prised to learn that one can do extremely well:

“Elas­tic net pre­dic­tion of g: r = 0.83 (0.82–0.84), n = 4320”

[There are 203 (elastic)/217 () non-zero coeffi­cients out of 556]

Thus, one can mea­sure g as well as one could with a decent test like Won­der­lic, or Raven’s with­out hav­ing any cog­ni­tive data at all! The big ques­tion here is whether these mod­els gen­er­al­ize well. If one can train a model to pre­dict g from MMPI items in dataset 1, and then apply it to dataset 2 with­out much loss of accu­ra­cy, this means that one could impute g in poten­tially thou­sands of old archival datasets that include the same MMPI items, or a sub­set of them.

A sim­i­lar analy­sis is done by Rev­elle et al 2020’s (espe­cially “Study 4: Pro­file cor­re­la­tions using 696 items”); they do not directly report an equiv­a­lent to posteriors/p-val­ues or non-zero cor­re­la­tions after penal­ized regres­sion or any­thing like that, but the per­va­sive­ness of cor­re­la­tion is appar­ent from their results & data visu­al­iza­tions.


Genetic correlations

Mod­ern genomics has found large-s­cale biobanks & sum­ma­ry-s­ta­tis­tic-only meth­ods to be a fruit­ful area for iden­ti­fy­ing as the power of pub­licly-re­leased PGSes have steadily grown with increas­ing n (sta­bi­liz­ing esti­mates & mak­ing ever more genetic cor­re­la­tions pass sta­tis­ti­cal-sig­nifi­cance thresh­old­s), which also fre­quently mir­ror phe­no­typic cor­re­la­tions in all organ­isms (“Cheverud’s con­jec­ture”13).

Exam­ple graphs drawn from the broader analy­ses (pri­mar­ily visu­al­ized as heatmap­s):

  • “Phe­nome-wide analy­sis of genome-wide poly­genic scores”, Krapohl et al 2015:

    Krapohl et al 2015: “Fig­ure 1. Cor­re­la­tions between 13 genome-wide poly­genic scores and 50 traits from the behav­ioral phe­nome. These results are based on GPS con­structed using a GWAS P-value thresh­old (PT)=0.30; results for PT = 0.10 and 0.05 (Sup­ple­men­tary Fig­ures 1a and b and Sup­ple­men­tary Table 3). P-val­ues that pass Nyholt–Si­dak cor­rec­tion (see Sup­ple­men­tary Meth­ods 1) are indi­cated with two aster­isks, whereas those reach­ing nom­i­nal sig­nifi­cance (thus sug­ges­tive evi­dence) are shown with a sin­gle aster­isk.”
  • , Hage­naars et al 2016:

    Hage­naars et al 2016: “Fig­ure 1. Heat map of genetic cor­re­la­tions cal­cu­lated using LD regres­sion between cog­ni­tive phe­no­types in UK Biobank and health-re­lated vari­ables from GWAS con­sor­tia. Hues and col­ors depict, respec­tive­ly, the strength and direc­tion of the genetic cor­re­la­tion between the cog­ni­tive phe­no­types in UK Biobank and the health-re­lated vari­ables. Red and blue indi­cate pos­i­tive and neg­a­tive cor­re­la­tions, respec­tive­ly. Cor­re­la­tions with the darker shade asso­ci­ated with a stronger asso­ci­a­tion. Based on results in Table 2. ADHD, atten­tion deficit hyper­ac­tiv­ity dis­or­der; FEV1, forced expi­ra­tory vol­ume in 1 s; GWAS, genome-wide asso­ci­a­tion study; LD, link­age dis­e­qui­lib­ri­um; NA, not avail­able.”
  • , Hill et al 2016:

    Hill et al 2016 fig­ure: “Genetic cor­re­la­tions between house­hold incomes and health vari­ables”
  • , Socrates et al 2017 (sup­ple­ment w/full heatmaps)

    Socrates et al 2017: “Fig­ure 3. Heat map show­ing genetic asso­ci­a­tions between poly­genic risk scores from GWAS traits (X-ax­is) and NFBC1966 traits (y-ax­is) for self­-re­ported dis­or­ders, med­ical and psy­chi­atric con­di­tions ver­i­fied or treated by a doc­tor, con­trolled for sex, BMI, and SES
    Socrates et al 2017: “Fig­ure 3. Heat map show­ing genetic asso­ci­a­tions between poly­genic risk scores from GWAS traits (X-ax­is) and NFBC1966 traits (y-ax­is) from ques­tion­naires lifestyle and social fac­tors”
  • , Docherty et al 2017:

    Docherty et al 2017: “Fig­ure 2: Phe­nome on GPS regres­sion q-val­ues in Euro­pean Sam­ple (EUR). GPS dis­played with prior pro­por­tion of causal effects = 0.3. Here, aster­isks in the cells of the heatmap denote results of greater effect: *** = q-value < 0.01, ** = q-value < 0.05, * = q-value < 0.16. Blue val­ues reflect a neg­a­tive asso­ci­a­tion, and red reflect pos­i­tive asso­ci­a­tion. Inten­sity of color indi­cates −log10 p val­ue.”
    Docherty et al 2017: “Fig­ure 3: Genetic Over­lap and Co-Her­i­tabil­ity of GPS in Euro­pean Sam­ple (EUR). Heatmap of par­tial cor­re­la­tion coeffi­cients between GPS with prior pro­por­tion of causal effects = 0.3. Here, aster­isks in the cells of the heatmap denote results of greater effect: **** = q-value < 0.0001, *** = q-value < 0.001, ** = q value < 0.01, * = q value < 0.05, and ~ = sug­ges­tive sig­nifi­cance at q value < 0.16. Blue val­ues reflect a neg­a­tive cor­re­la­tion, and red reflect pos­i­tive cor­re­la­tion.”
  • , Joshi et al 2017:

    “Fig­ure 5: Genetic cor­re­la­tions between trait clus­ters that asso­ciate with mor­tal­i­ty. The upper panel shows whole genetic cor­re­la­tions, the lower pan­el, par­tial cor­re­la­tions. T2D, type 2 dia­betes; BP, blood pres­sure; BC, breast can­cer; CAD, coro­nary artery dis­ease; Edu, edu­ca­tional attain­ment; RA, rheuma­toid arthri­tis; AM, age at menar­che; DL/WHR Dyslipidemia/Waist-Hip ratio; BP, blood pres­sure”
  • , Hill et al 2018:

    “Fig. 4: Heat map show­ing the genetic cor­re­la­tions between the meta-an­a­lytic intel­li­gence phe­no­type, intel­li­gence, edu­ca­tion with 29 cog­ni­tive, SES, men­tal health, meta­bol­ic, health and well­be­ing, anthro­po­met­ric, and repro­duc­tive traits. Pos­i­tive genetic cor­re­la­tions are shown in green and neg­a­tive genetic cor­re­la­tions are shown in red. Sta­tis­ti­cal sig­nifi­cance fol­low­ing FDR (us­ing Ben­jamini-Hochberg pro­ce­dure [51]) cor­rec­tion is indi­cated by an aster­isk.”
  • , Watan­abe et al 2018:

    Watan­abe et al 2018: "Fig. 2. Within and between domains genetic cor­re­la­tions. (a.) Pro­por­tion of trait pairs with sig­nifi­cant rg (top) and aver­age |_rg_| for sig­nifi­cant trait pairs (bot­tom) within domains. Dashed lines rep­re­sent the pro­por­tion of trait pairs with sig­nifi­cant rg (top) and aver­age |rg| for sig­nifi­cant trait pairs (bot­tom) across all 558 traits, respec­tive­ly. Con­nec­tive tis­sue, mus­cu­lar and infec­tion domains are excluded as these each con­tains less than 3 traits. (b.) Heatmap of pro­por­tion of trait pairs with sig­nifi­cant rg (up­per right tri­an­gle) and aver­age |rg| for sig­nifi­cant trait pairs (lower left tri­an­gle) between domains. Con­nec­tive tis­sue, mus­cu­lar and infec­tion domains are excluded as each con­tains less than 3 traits. The diag­o­nal rep­re­sents the pro­por­tion of trait pairs with sig­nifi­cant rg within domains. Stars denote the pairs of domains in which the major­ity (>50%) of sig­nifi­cant rg are neg­a­tive."
  • , Abdel­laoui et al 2018:

    Abdel­laoui et al 2018: “Fig­ure 6: Genetic cor­re­la­tions based on LD score regres­sion. Col­ored is sig­nifi­cant after FDR cor­rec­tion. The green num­bers in the left part of the Fig­ure below the diag­o­nal of 1’s are the phe­no­typic cor­re­la­tions between the regional out­comes of coal min­ing, reli­gious­ness, and regional polit­i­cal pref­er­ence. The blue stars next to the trait names indi­cate that UK Biobank was part of the GWAS of the trait.”
  • “Iden­ti­fi­ca­tion of 12 genetic loci asso­ci­ated with human healthspan”, Zenin et al 2019:

    “Fig­ure 4. 35 traits with sig­nifi­cant and high genetic cor­re­la­tions with healthspan (|rg| ≥ 0.3; p≤ 4.3 × 10−5). PMID ref­er­ences are placed in square brack­ets. Note the absence of genetic cor­re­la­tion between the healthspan and Alzheimer dis­ease traits (rg= −0.03)”
  • “Asso­ci­a­tion stud­ies of up to 1.2 mil­lion indi­vid­u­als yield new insights into the genetic eti­ol­ogy of tobacco and alco­hol use”, Li et al 2019:

    Liu et al 2019: “Fig. 1 | Genetic cor­re­la­tions between sub­stance use phe­no­types and phe­no­types from other large GWAS. Genetic cor­re­la­tions between each of the phe­no­types are shown in the first 5 rows, with her­i­tabil­ity esti­mates dis­played down the diag­o­nal. All genetic cor­re­la­tions and her­i­tabil­ity esti­mates were cal­cu­lated using LD score regres­sion. Pur­ple shad­ing rep­re­sents neg­a­tive genetic cor­re­la­tions, and red shad­ing rep­re­sents pos­i­tive cor­re­la­tions, with increas­ing color inten­sity reflect­ing increas­ing cor­re­la­tion strength. A sin­gle aster­isk reflects a sig­nifi­cant genetic cor­re­la­tion at the p < 0.05 lev­el. Dou­ble aster­isks reflect a sig­nifi­cant genetic cor­re­la­tion at the Bon­fer­roni-cor­rec­tion p < 0.000278 level (cor­rected for 180 inde­pen­dent test­s). Note that SmkCes was ori­ented such that higher scores reflected cur­rent smok­ing, and for AgeSmk, lower scores reflect ear­lier ages of ini­ti­a­tion, both of which are typ­i­cally asso­ci­ated with neg­a­tive out­comes.”

  1. Some­times para­phrased as “All good things tend to go togeth­er, as do all bad ones”.↩︎

  2. Tib­shi­rani 2014:

    In describ­ing some of this work, Hastie et al. (2001) coined the infor­mal “Bet on Spar­sity” prin­ci­ple [“Use a pro­ce­dure that does well in sparse prob­lems, since no pro­ce­dure does well in dense prob­lems.”]. The ℓ1 meth­ods assume that the truth is sparse, in some basis. If the assump­tion holds true, then the para­me­ters can be effi­ciently esti­mated using ℓ1 penal­ties. If the assump­tion does not hold—so that the truth is dense—then no method will be able to recover the under­ly­ing model with­out a large amount of data per para­me­ter. This is typ­i­cally not the case when pN, a com­monly occur­ring sce­nario.

    This can be seen as a kind of deci­sion-the­o­retic jus­ti­fi­ca­tion for Occam-style assump­tions: if the real world is not pre­dictable in the sense of being pre­dictable by simple/fast algo­rithms, or induc­tion does­n’t work at all, then no method works in expec­ta­tion, and the “regret” (differ­ence between expected value of actual deci­sion and expected value of opti­mal deci­sion) from mis­tak­enly assum­ing that the world is simple/sparse is zero. So one should assume the world is sim­ple.↩︎

  3. A machine learn­ing prac­ti­tioner as of 2019, will be struck by the thought that Tobler’s first law nicely encap­su­lates the prin­ci­ple behind the “unrea­son­able effec­tive­ness” of to so many domains far beyond images; this con­nec­tion has been made by John Hessler.↩︎

  4. The most inter­est­ing exam­ple of this is ESP/psi para­psy­chol­ogy research: the more rig­or­ously con­ducted the ESP exper­i­ments are, the smaller the effects become—but, while dis­cred­it­ing all claims of human ESP, fre­quently they aren’t pushed to exactly zero and are “sta­tis­ti­cal­ly-sig­nifi­cant”. some resid­ual crud fac­tor in the exper­i­ments, even when con­ducted & ana­lyzed as best as we know how.↩︎

  5. While Gos­set 1904 is dis­cussed in sev­eral sources, like , the authors have con­sulted the Guin­ness Archive in per­son; the report itself does not appear to have ever been made pub­lic or dig­i­tized. I have con­tacted the Archives about get­ting a copy.↩︎

  6. The ver­sion in the sec­ond edi­tion, The Foun­da­tions of Sta­tis­tics, 2nd edi­tion, Sav­age 1972, is iden­ti­cal to the first.↩︎

  7. N.B.: I. Richard is not to be con­fused with his broth­er, Leonard Jim­mie Sav­age, who also worked in Bayesian sta­tis­tics & is cited pre­vi­ous­ly.↩︎

  8. 2nd edi­tion, 1986; after skim­ming the 2nd edi­tion, I have not been able to find a rel­e­vant pas­sage, but Lehmann remarks that he sub­stan­tially rewrote the text­book for a more robust deci­sion-the­o­retic approach, so it may have been removed.↩︎

  9. This analy­sis was never pub­lished, accord­ing to Meehl 1990a.↩︎

  10. I would note there is a dan­ger­ous fal­lacy here even if one does believe the Law of Large Num­bers should apply here with an expec­ta­tion of zero effect: even if the expec­ta­tion of the pair­wise cor­re­la­tion of 2 arbi­trary vari­ables was in fact pre­cisely zero (as is not too implau­si­ble in some domains such as opti­miza­tion or feed­back loop­s—­such as the famous exam­ple of the thermostat/room-temperature), that does not mean any spe­cific pair will be exactly zero no mat­ter how many num­bers get added up to cre­ate their rela­tion­ship, as the absolute size of the devi­a­tion increas­es.

    So for exam­ple, imag­ine 2 genetic traits which may be genet­i­cal­ly-cor­re­lat­ed, and their her­i­tabil­ity may be caused by a num­ber of genes rang­ing from 1 (mono­genic) to tens of thou­sands (highly poly­genic); the spe­cific over­lap is cre­ated by a chance draw of evo­lu­tion­ary processes through­out the organ­is­m’s evo­lu­tion; does the Law of Large Num­bers jus­tify say­ing that while 2 mono­genic traits may have a sub­stan­tial cor­re­la­tion, 2 highly poly­genic traits must have much closer to zero cor­re­la­tion sim­ply because they are influ­enced by more genes? No, because the dis­tri­b­u­tion around the expec­ta­tion of 0 can become wider & wider the more rel­e­vant genes there are.

    To rea­son oth­er­wise is, as Samuel­son not­ed, to think like an insurer who is wor­ried about los­ing $100 on an insur­ance con­tract so it goes out & makes 100 more $100 con­tracts.↩︎

  11. Betz 1986 spe­cial issue’s con­tents:

    1. “The g fac­tor in employ­ment”, Got­tfred­son 1986
    2. “Ori­gins of and Reac­tions to the PTC con­fer­ence on The g Fac­tor In Employ­ment Test­ing, Avery 1986
    3. g: Arti­fact or real­i­ty?”, Jensen 1986
    4. “The role of gen­eral abil­ity in pre­dic­tion”, Thorndike 1986
    5. “Cog­ni­tive abil­i­ty, cog­ni­tive apti­tudes, job knowl­edge, and job per­for­mance”, Hunter 1986
    6. “Valid­ity ver­sus util­ity of men­tal tests: Exam­ple of the SAT, Got­tfred­son & Crouse 1986
    7. “Soci­etal con­se­quences of the g fac­tor in employ­ment”, Got­tfred­son 1986
    8. “Real world impli­ca­tions of g, Hawk 1986
    9. “Gen­eral abil­ity in employ­ment: A dis­cus­sion”, Arvey 1986
    10. “Com­men­tary”, Humphreys 1986
    11. “Com­ments on the g fac­tor in Employ­ment Test­ing”, Linn 1986
    12. “Back to Spear­man?”, Tyler 1986
  12. This work does not seem to have been pub­lished, as I can find no books pub­lished by them joint­ly, or nor nay McClosky books pub­lished between 1990 & his death in 2004.↩︎

  13. For defi­n­i­tions & evi­dence for, see: Cheverud 1988, Roff 1996, Kruuk et al 2008, Dochter­mann 2011, , & .↩︎