Why Correlation Usually ≠ Causation: Causal Nets Cause Common Confounding

Correlations are oft interpreted as evidence for causation; this is oft falsified; do causal graphs explain why this is so common? (statistics, philosophy)
created: 24 Jun 2014; modified: 06 Aug 2017; status: in progress; confidence: log; importance: 10

It is widely understood that statistical correlation between two variables ≠ causation. But despite this admonition, people are routinely overconfident in claiming correlations to support particular causal interpretations and are surprised by the results of randomized experiments, suggesting that they are biased & systematically underestimating the prevalence of confounds/common-causation. I speculate that in realistic causal networks or DAGs, the number of possible correlations grows faster than the number of possible causal relationships. So confounds really are that common, and since people do not think in DAGs, the imbalance also explains overconfidence.

Confound it! Correlation is (usually) not causation! But why not?

I’ve noticed I seem to be unusually willing to bite the correlation≠causation bullet, and I think it’s due to an idea I had some time ago about the nature of reality.

The Problem

Hubris is the greatest danger that accompanies formal data analysis…Let me lay down a few basics, none of which is easy for all to accept… 1. The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.1

Most scientifically-inclined people are reasonably aware that one of the major divides in research is that correlation≠causation: that having discovered some relationship between various data X and Y (not necessarily Pearson’s r, but any sort of mathematical or statistical relationship, whether it be a humble r or an opaque deep neural network’s predictions), we do not know how Y would change if we manipulated X. Y might increase, decrease, do something complicated, or remain implacably the same. This point can be made by listing examples of correlations where we intuitively know changing X should have no effect on Y, and it’s a spurious relationship: the number of churches in a town may correlate with the number of bars, but we know that’s because both are related to how many people are in it; the number of pirates may inversely correlate with global temperatures (but we know pirates don’t control global warming and it’s more likely something like economic development leads to suppression of piracy but also CO2 emissions); sales of ice cream may correlate with snake bites or violent crime or death from heat-strokes (but of course snakes don’t care about sabotaging ice cream sales); thin people may have better posture than fat people, but sitting upright does not seem like a plausible weight loss plan2; wearing XXXL clothing clearly doesn’t cause heart attacks, although one might wonder if diet soda causes obesity; the more firemen are around, the worse fires are; judging by grades of tutored vs non-tutored students, tutors would seem to be harmful rather than helpful; black skin does not cause sickle cell anemia nor, to borrow an example from Pearson3, would black skin cause smallpox or malaria; more recently, part of the psychology behind linking vaccines with autism is that many vaccines are administered to children at the same time autism would start becoming apparent (or should we blame organic food sales?); height & vocabulary or foot size & math skills may correlate strongly (in children); national chocolate consumption correlates with Nobel prizes4, as do borrowing from commercial banks & buying luxury cars & serial killers/mass-murderers/traffic fatalities5; moderate alcohol consumption predicts increased lifespan and earnings; the role of storks in delivering babies may have been underestimated; children and people with high self-esteem have higher grades & lower crime rates etc, so we all know in our gut that it’s true that raising people’s self-esteem empowers us to live responsibly and that inoculates us against the lures of crime, violence, substance abuse, teen pregnancy, child abuse, chronic welfare dependency and educational failure - unless perhaps high self-esteem is caused by high grades & success, boosting self-esteem has no experimental benefits, and may backfire?

Now, the correlation could be bogus in the sense that it would disappear if we gathered more data, and was an illusory correlation due to biases; or it could be an artifact of our mathematical procedures as in spurious correlations; or it is a Type I error, a correlation thrown up by the standard statistical problems we all know about, such as too-small n, false positives from sampling error (A & B just happened to sync together due to randomness), data-mining/multiple testing, p-hacking, data snooping, selection bias, publication bias, misconduct, inappropriate statistical tests, etc Those last can be generated ad nauseam: Shaun Gallagher’s Correlated (also a book) surveys users & compares against all previous surveys with 1k+ correlations. Tyler Vigen’s spurious correlations catalogues 35k+ correlations, many with r>0.9, based primarily on US Census & CDC data. Google Correlate finds Google search query patterns which correspond with real-world trends based on geography or user-provided data, which offers endless fun (Facebook/tapeworm in humans, r=0.8721; Superfreakonomic/Windows 7 advisor, r=0.9751; Irish electricity prices/Stanford webmail, r=0.83; heart attack/pink lace dress, r=0.88; US states’ parasite loads/booty models, r=0.92; US states’ family ties/how to swim; metronidazole/Is Lil’ Wayne gay?, r=0.89; Clojure/prnhub, r=0.9784; accident/itchy bumps, r=0.87; migraine headaches/sciences, r=0.77; Irritable Bowel Syndrome/font download, r=0.94; interest-rate-index/pill identification, r=0.98; advertising/medical research, r=0.99; Barack Obama 2012 vote-share/Top Chef, r=0.88; losing weight/houses for rent, r=0.97; Bieber/tonsillitis, r=0.95; paternity test/food for dogs, r=0.83; breast enlargement/reverse telephone search, r=0.95; theory of evolution / the Sumerians or Hector of Troy or Jim Crow laws; gwern/Danny Brown lyrics, r=0.92; weed/new Family Guy episodes, r=0.8; a drawing of a bell curve matches MySpace while a penis matches STD symptoms in men r=0.95, not to mention Kurt Vonnegut stories). (And on less secular themes, do churches cause obesity & do Welsh rugby victories predict papal deaths?) Financial data-mining offers some fun examples; there’s the Super Bowl/stock-market one which worked well for several decades; and it’s not very elegant, but a 3-variable model (Bangladeshi butter, American cheese, joint sheep population) reaches R2=0.99 on 20 years of the S&P 500

I’ve read about those problems at length, and despite knowing about all that, there still seems to be a problem: I don’t think those issues explain away all the correlations which turn out to be confounds - correlation too often ≠ causation.

One of the constant problems I face in my reading is that I constantly want to know about causal relationships but I only have correlational data, and as we all know, that is an unreliable guide at best.

The unreliability is bad enough, but I’m also worried that the knowledge correlation≠causation, one of the core ideas of the scientific method and fundamental to fields like modern medicine, is going underappreciated and is being abandoned by meta-contrarians as being nothing helpful or meaningless and justified skepticism is actually just a dumb-ass thing to say, a statistical cliché that closes threads and ends debates, the freshman platitude turned final shutdown often used by party poopers Internet blowhards to serve an agenda & is sometimes a dog whistle; in practice, such people seem to go well beyond the XKCD comic and proceed to take any correlations they like as strong evidence for causation, and any disagreement reveals one’s unsophisticated middlebrow thinking or denialism. So it’s unsurprising that one so often runs into researchers for whom indeed correlation=causation; it is common to use causal language and make recommendations (Prasad et al 2013), but even if they don’t, you can be sure to see them confidently talking causally to other researchers or journalists or officials. (I’ve noticed this sort of constant slide is particularly common in medicine, sociology, and education.)

Bandying phrases with meta-contrarians won’t help much here; I agree with them that correlation ought to be some evidence for causation. eg if I suspect that A→B, and I collect data and establish beyond doubt that A&B correlates r=0.7, surely this observations, which is consistent with my theory, should boost my confidence in my theory, just as an observation like r=0.0001 would trouble me greatly. But how much…?

To measure this directly you need a clear set of correlations which are proposed to be causal, randomized experiments to establish what the true causal relationship is in each case, and both categories need to be sharply delineated in advance to avoid issues of cherrypicking and retroactively confirming a correlation. Then you’d be able to say something like 11 out of the 100 proposed A→B causal relationships panned out, and start with a prior of 11% that in your case, A→B. This sort of dataset is pretty rare, although the few examples I’ve found tend to indicate that our prior should be under 10%. (For example, Fraker & Maynard 1987 analyze a government jobs program and got data on randomized participants & others, permitting comparison of randomized inference to standard regression approaches; they find roughly that 0/12 estimates - many statistically-significant - were reasonably similar to the causal effect for one job program & 4/12 for another job program, with the regression estimates for the former heavily biased.) Not great. Why are our best analyses & guesses at causal relationships are so bad?

We’d expect that the a priori odds are good, by the principle of indifference: 13\frac{1}{3}! After all, you can divvy up the possibilities as:

  1. A causes B
  2. B causes A
  3. both A and B are caused by a C

    (Possibly in a complex way like Berkson’s paradox or conditioning on unmentioned variables, like a phone-based survey inadvertently generating conclusions valid only for the phone-using part of the population, causing amusing pseudo-correlations.)

If it’s either #1 or #2, we’re good and we’ve found a causal relationship; it’s only outcome #3 which leaves us baffled & frustrated. If we were guessing at random, you’d expect us to still be right at least 33% of the time. And we can draw on all sorts of knowledge to do better6

I think a lot of people tend to put a lot of weight on any observed correlation because of this intuition that a causal relationship is normal & probable because, well, how else could this correlation happen if there’s no causal connection between A & B‽ And fair enough - there’s no grand cosmic conspiracy arranging matters to fool us by always putting in place a C factor to cause scenario #3, right? If you question people, of course they know correlation doesn’t necessarily mean causation - everyone knows that - since there’s always a chance of a lurking confound, and it would be great if you had a randomized experiment to draw on; but you think with the data you have, not the data you wish you had, and can’t let the perfect be the enemy of the better. So when someone finds a correlation between A and B, it’s no surprise that suddenly their language & attitude change and they seem to place great confidence in their favored causal relationship even if they piously acknowledge Yes, correlation is not causation, but… obviously hanging out with fat people will make you fat / parents influence their kids a lot eg. smoking encourages their kids to smoke / when we gave babies a new drug, fewer went blind / female-named hurricanes increase death tolls due to sexistly underestimating women / Kaposi’s sarcoma correlates so highly with AIDS that it must be another consequence of HIV (actually caused by HHV-8 which is transmitted simultaneously with HIV) / vitamin and anti-oxidant use (among many other lifestyle choices) will save lives / LSD & marijuana use associates with and thus surely causes schizophrenia and other forms of insanity (despite increases in marijuana use not followed by any schizophrenia increases / hormone replacement therapy correlates with mortality reduction in women so it definitely helps and doesn’t hurt etc.

Besides the intuitiveness of correlation=causation, we are also desperate and want to believe: correlative data is so rich and so plentiful, and experimental data so rare. If it is not usually the case that correlation=causation, then what exactly are we going to do for decisions and beliefs, and what exactly have we spent all our time to obtain? When I look at some dataset with a number of variables and I run a multiple regression and can report that variables A, B, and C are all statistically-significant and of large effect-size when regressed on D, all I have really done is learned something along the lines of in a hypothetical dataset generated in the exact same way, if I somehow was lacking data on D, I could make a better prediction in a narrow mathematical sense of no importance (squared error) based on A/B/C. I have not learned whether A/B/C cause D, or whether I could predict values of D in the future, or anything about how I could intervene and manipulate any of A-D, or anything like that - rather, I have learned a small point about prediction. To take a real example: when I learn that moderate alcohol consumption means the actuarial prediction of lifespan for drinkers should be increased slightly, why on earth would I care about this at all unless it was causal? When epidemiologists emerge from a huge survey reporting triumphantly that steaks but not egg consumption slightly predicts decreased lifespan, why would anyone care aside from perhaps life insurance companies? Have you ever been abducted by space aliens and ordered as part of an inscrutable alien blood-sport to take a set of data about Midwest Americans born 1960-1969 with dietary predictors you must combine linearly to create predictors of heart attacks under a squared error loss function to outpredict your fellow abductees from across the galaxy? Probably not. Why would anyone give them grant money for this, why would they spend their time on this, why would they read each others’ papers unless they had a quasi-religious faith7 that these correlations were more than just some coefficients in a predictive model - that they were causal? To quote Rutter 2007, most discussions of correlations fall into two equally problematic camps:

…all behavioral scientists are taught that statistically significant correlations do not necessarily mean any kind of causative effect. Nevertheless, the literature is full of studies with findings that are exclusively based on correlational evidence. Researchers tend to fall into one of two camps with respect to how they react to the problem. > First, there are those who are careful to use language that avoids any direct claim for causation, and yet, in the discussion section of their papers, they imply that the findings do indeed mean causation. > Second, there are those that completely accept the inability to make a causal inference on the basis of simple correlation or association and, instead, take refuge in the claim that they are studying only associations and not causation. This second, pure approach sounds safer, but it is disingenuous because it is difficult to see why anyone would be interested in statistical associations or correlations if the findings were not in some way relevant to an understanding of causative mechanisms.

So, correlations tend to not be causation because it’s almost always #3, a shared cause. This commonness is contrary to our expectations, based on a simple & unobjectionable observation that of the 3 possible relationships, 2 are causal; and so we often reason as though correlation were strong evidence for causation. This leaves us with a paradox: experimental results seem to contradict intuition. To resolve the paradox, I need to offer a clear account of why shared causes/confounds are so common, and hopefully motivate a different set of intuitions.

What a Tangled Net We Weave When First We Practice to Believe

Here’s where Bayes nets & causal networks (seen previously on LW & Michael Nielsen) come up. When networks are inferred on real-world data, they often start to look pretty gnarly: tons of nodes, tons of arrows pointing all over the place. Daphne Koller early on in her Probabilistic Graphical Models course shows an example from a medical setting where the network has like 600 nodes and you can’t understand it at all. When you look at a biological causal network like metabolism:

A Toolkit Supporting Formal Reasoning about Causality in Metabolic Networks
A Toolkit Supporting Formal Reasoning about Causality in Metabolic Networks

You start to appreciate how everything might be correlated with everything, but (usually) not cause each other.

This is not too surprising if you step back and think about it: life is complicated, we have limited resources, and everything has a lot of moving parts. (How many discrete parts does an airplane have? Or your car? Or a single cell? Or think about a chess player analyzing a position: if my bishop goes there, then the other pawn can go here, which opens up a move there or here, but of course, they could also do that or try an en passant in which case I’ll be down in material but up on initiative in the center, which causes an overall shift in tempo…) Fortunately, these networks are still simple compared to what they could be, since most nodes aren’t directly connected to each other, which tamps down on the combinatorial explosion of possible networks. (How many different causal networks are possible if you have 600 nodes to play with? The exact answer is complicated but it’s much larger than 2600 - so very large!)

One interesting thing I managed to learn from PGM (before concluding it was too hard for me and I should try it later) was that in a Bayes net even if two nodes were not in a simple direct correlation relationship A→B, you could still learn a lot about A from setting B to a value, even if the two nodes were way across the network from each other. You could trace the influence flowing up and down the pathways to some surprisingly distant places if there weren’t any blockers.

The bigger the network, the more possible combinations of nodes to look for a pairwise correlation between them (eg If there are 10 nodes/variables and you are looking at bivariate correlations, then you have 10 choose 2 = 45 possible comparisons, and with 20, 190, and 40, 780. 40 variables is not that much for many real-world problems.) A lot of these combos will yield some sort of correlation. But does the number of causal relationships go up as fast? I don’t think so (although I can’t prove it).

If not, then as causal networks get bigger, the number of genuine correlations will explode but the number of genuine causal relationships will increase slower, and so the fraction of correlations which are also causal will collapse.

(Or more concretely: suppose you generated a randomly connected causal network with x nodes and y arrows perhaps using the algorithm in Kuipers & Moffa 2012, where each arrow has some random noise in it; count how many pairs of nodes are in a causal relationship; now, n times initialize the root nodes to random values and generate a possible state of the network & storing the values for each node; count how many pairwise correlations there are between all the nodes using the n samples (using an appropriate significance test & alpha if one wants); divide # of causal relationships by # of correlations, store; return to the beginning and resume with x+1 nodes and y+1 arrows… As one graphs each value of x against its respective estimated fraction, does the fraction head toward 0 as x increases? My thesis is it does. Or, since there must be at least as many causal relationships in a graph as there are arrows, you could simply use that as an upper bound on the fraction.)

It turns out, we weren’t supposed to be reasoning there are 3 categories of possible relationships, so we start with 33%, but rather: there is only one explanation A causes B, only one explanation B causes A, but there are many explanations of the form C1 causes A and B, C2 causes A and B, C3 causes A and B, and the more nodes in a field’s true causal networks (psychology or biology vs physics, say), the bigger this last category will be.

The real world is the largest of causal networks, so it is unsurprising that most correlations are not causal, even after we clamp down our data collection to narrow domains. Hence, our prior for A causes B is not 50% (it’s either true or false) nor is it 33% (either A causes B, B causes A, or mutual cause C) but something much smaller: the number of causal relationships divided by the number of pairwise correlations for a graph, which ratio can be roughly estimated on a field-by-field basis by looking at existing work or directly for a particular problem (perhaps one could derive the fraction based on the properties of the smallest inferrable graph that fits large datasets in that field). And since the larger a correlation relative to the usual correlations for a field, the more likely the two nodes are to be close in the causal network and hence more likely to be joined causally, one could even give causality estimates based on the size of a correlation (eg. an r=0.9 leaves less room for confounding than an r of 0.1, but how much will depend on the causal network).

This is exactly what we see. How do you treat cancer? Thousands of treatments get tried before one works. How do you deal with poverty? Most programs are not even wrong. Or how do you fix societal woes in general? Most attempts fail miserably and the higher-quality your studies, the worse attempts look (leading to Rossi’s Metallic Rules). This even explains why everything correlates with everything and Andrew Gelman’s dictum about how coefficients are never zero: the reason large datasets find most of their variables to have non-zero correlations (often reaching statistical-significance) is because the data is being drawn from large complicated causal networks in which almost everything really is correlated with everything else.

And thus I was enlightened.

Comment

Since I know so little about causal modeling, I asked our local causal researcher Ilya Shpitser to maybe leave a comment about whether the above was trivially wrong / already-proven / well-known folklore / etc; for convenience, I’ll excerpt the core of his comment:

But does the number of causal relationships go up just as fast? I don’t think so (although at the moment I can’t prove it).

I am not sure exactly what you mean, but I can think of a formalization where this is not hard to show. We say A structurally causes B in a DAG G if and only if there is a directed path from A to B in G. We say A is structurally dependent with B in a DAG G if and only if there is a marginal d-connecting path from A to B in G.

A marginal d-connecting path between two nodes is a path with no consecutive edges of the form * -> * <- * (that is, no colliders on the path). In other words all directed paths are marginal d-connecting but the opposite isn’t true.

The justification for this definition is that if A structurally causes B in a DAG G, then if we were to intervene on A, we would observe B change (but not vice versa) in most distributions that arise from causal structures consistent with G. Similarly, if A and B are structurally dependent in a DAG G, then in most distributions consistent with G, A and B would be marginally dependent (e.g. what you probably mean when you say correlations are there).

I qualify with most because we cannot simultaneously represent dependences and independences by a graph, so we have to choose. People have chosen to represent independences. That is, if in a DAG G some arrow is missing, then in any distribution (causal structure) consistent with G, there is some sort of independence (missing effect). But if the arrow is not missing we cannot say anything. Maybe there is dependence, maybe there is independence. An arrow may be present in G, and there may still be independence in a distribution consistent with G. We call such distributions unfaithful to G. If we pick distributions consistent with G randomly, we are unlikely to hit on unfaithful ones (subset of all distributions consistent with G that is unfaithful to G has measure zero), but Nature does not pick randomly.. so unfaithful distributions are a worry. They may arise for systematic reasons (maybe equilibrium of a feedback process in bio?)

If you accept above definition, then clearly for a DAG with n vertices, the number of pairwise structural dependence relationships is an upper bound on the number of pairwise structural causal relationships. I am not aware of anyone having worked out the exact combinatorics here, but it’s clear there are many many more paths for structural dependence than paths for structural causality.


But what you actually want is not a DAG with n vertices, but another type of graph with n vertices. The Universe DAG has a lot of vertices, but what we actually observe is a very small subset of these vertices, and we marginalize over the rest. The trouble is, if you start with a distribution that is consistent with a DAG, and you marginalize over some things, you may end up with a distribution that isn’t well represented by a DAG. Or DAG models aren’t closed under marginalization.

That is, if our DAG is A -> B <- H -> C <- D, and we marginalize over H because we do not observe H, what we get is a distribution where no DAG can properly represent all conditional independences. We need another kind of graph.

In fact, people have come up with a mixed graph (containing -> arrows and <-> arrows) to represent margins of DAGs. Here -> means the same as in a causal DAG, but <-> means there is some sort of common cause/confounder that we don’t want to explicitly write down. Note: <-> is not a correlative arrow, it is still encoding something causal (the presence of a hidden common cause or causes). I am being loose here – in fact it is the absence of arrows that means things, not the presence.

I do a lot of work on these kinds of graphs, because these are graphs are the sensible representation of data we typically get – drawn from a marginal of a joint distribution consistent with a big unknown DAG.

But the combinatorics work out the same in these graphs – the number of marginal d-connected paths is much bigger than the number of directed paths. This is probably the source of your intuition. Of course what often happens is you do have a (weak) causal link between A and B, but a much stronger non-causal link between A and B through an unobserved common parent. So the causal link is hard to find without tricks.

Heuristics & Biases

Now assuming the foregoing to be right (which I’m not sure about; in particular, I’m dubious that correlations in causal nets really do increase much faster than causal relations do), what’s the psychology of this? I see a few major ways that people might be incorrectly reasoning when they overestimate the evidence given by a correlation:

  • they might be aware of the imbalance between correlations and causation, but underestimate how much more common correlation becomes compared to causation.

    This could be shown by giving causal diagrams and seeing how elicited probability changes with the size of the diagrams: if the probability is constant, then the subjects would seem to be considering the relationship in isolation and ignoring the context.

    It might be remediable by showing a network and jarring people out of a simplistic comparison approach.
  • they might not be reasoning in a causal-net framework at all, but starting from the naive 33% base-rate you get when you treat all 3 kinds of causal relationships equally.

    This could be shown by eliciting estimates and seeing whether the estimates tend to look like base rates of 33% and modifications thereof.

    Sterner measures might be needed: could we draw causal nets with not just arrows showing influence but also another kind of arrow showing correlations? For example, the arrows could be drawn in black, inverse correlations drawn in red, and regular correlations drawn in green. The picture would be rather messy, but simply by comparing how few black arrows there are to how many green and red ones, it might visually make the case that correlation is much more common than causation.
  • alternately, they may really be reasoning causally and suffer from a truly deep & persistent cognitive illusion that when people say correlation it’s really a kind of causation and don’t understand the technical meaning of correlation in the first place (which is not as unlikely as it may sound, given examples like David Hestenes’s demonstration of the persistence of Aristotelian folk-physics in physics students as all they had learned was guessing passwords; on the test used, see eg Halloun & Hestenes 1985 & Hestenes et al 1992); in which cause it’s not surprising that if they think they’ve been told a relationship is causation, then they’ll think the relationship is causation. Ilya remarks:

    Pearl has this hypothesis that a lot of probabilistic fallacies/paradoxes/biases are due to the fact that causal and not probabilistic relationships are what our brain natively thinks about. So e.g. Simpson’s paradox is surprising because we intuitively think of a conditional distribution (where conditioning can change anything!) as a kind of interventional distribution (no Simpson’s type reversal under interventions: Understanding Simpson’s Paradox, Pearl 2014 [see also Pearl’s comments on Nielsen’s blog)).

    This hypothesis would claim that people who haven’t looked into the math just interpret statements about conditional probabilities as about interventional probabilities (or whatever their intuitive analogue of a causal thing is).

    This might be testable by trying to identify simple examples where the two approaches diverge, similar to Hestenes’s quiz for diagnosing belief in folk-physics.

Appendix

Everything correlates with everything

In statistical folklore, there is an idea which circulates under a number of expressions such as: everything is correlated, everything is related to everything else, crud factor, the null hypothesis is always false, the Anna Karenina principle, & coefficients are never zero.

The core idea here is that in any real-world dataset, it is exceptionally unlikely that any particular relationship will be exactly 0 for reasons of arithmetic (eg it may be impossible for a binary variable to be an equal percentage in 2 unbalanced groups); prior probability (0 is only one number out of the infinite reals); and because real-world properties & traits are linked by a myriad of causal networks, dynamics, & latent variables (eg the genetic correlations which affect a wide variety of important human traits) which mutually affect each other which will produce genuine correlations between apparently-independent variables, and these correlations may be of surprisingly large & important size. These reasons are unaffected by sample size and are not simply due to small n. The claim is generally backed up by personal experience and reasoning, although in a few instances like Meehl large datasets are mentioned in which almost all variables are correlated at very high levels of statistical-significance.

This claim has several implications. The most commonly mentioned, and the apparent motivation for early discussions, is that in the null-hypothesis significance-testing paradigm dominant in psychology and many sciences, any null-hypothesis of 0 is known - in advance - to already be false and so it will inevitably be rejected as soon as data collection permits. This renders the meaning of significance-testing unclear. (Better null-hypotheses, such as >0 or <0, are also problematic since if the true value of a parameter is never 0 then one’s theories have at least a 50-50 chance of guessing the right direction and so correct predictions of the sign count for little.)

Below I have compiled excerpts of some relevant references in chronological order. Additional citations are welcome.

Hodges & Lehmann 1954

Testing the approximate validity of statistical hypotheses, Hodges & Lehmann 1954:

When testing statistical hypotheses, we usually do not wish to take the action of rejection unless the hypothesis being tested is false to an extent sufficient to matter. For example, we may formulate the hypothesis that a population is normally distributed, but we realize that no natural population is ever exactly normal. We would want to reject normality only if the departure of the actual distribution from the normal form were great enough to be material for our investigation. Again, when we formulate the hypothesis that the sex ratio is the same in two populations, we do not really believe that it could be exactly the same, and would only wish to reject equality if they are sufficiently different. Further examples of the phenomenon will occur to the reader.

Savage 1954

The Foundations of Statistics, 2nd edition, Savage 1972 (the 1954 first edition is hard to obtain and appears to be much the same):

The development of the theory of testing has been much influenced by the special problem of simple dichotomy, that is, testing problems in which H0 and Hi have exactly one element each. Simple dichotomy is susceptible of neat and full analysis (as in Exercise 7.5.2 and in § 14.4), likelihood-ratio tests here being the only admissible tests; and simple dichotomy often gives insight into more complicated problems, though the point is not explicitly illustrated in this book. Coin and ball examples of simple dichotomy are easy to construct, but instances seem rare in real life. The astronomical observations made to distinguish between the Newtonian and Einsteinian hypotheses are a good, but not perfect, example, and I suppose that research in Mendelian genetics sometimes leads to others. There is, however, a tradition of applying the concept of simple dichotomy to some situations to which it is, to say the best, only crudely adapted. Consider, for example, the decision problem of a person who must buy, f0, or refuse to buy, fi, a lot of manufactured articles on the basis of an observation x. Suppose that i is the difference between the value of the lot to the person and the price at which the lot is offered for sale, and that P(x  i) is known to the person. Clearly, H0, Hi, and N are sets characterized respectively by i > 0, i < 0, i = 0. This analysis of this, and similar, problems has recently been explored in terms of the minimax rule, for example by Sprowls [S16] and a little more fully by Rudy [R4], and by Allen [A3]. It seems to me natural and promising for many fields of application, but it is not a traditional analysis. On the contrary, much literature recommends, in effect, that the person pretend that only two values of i, to > 0 and i < 0, are possible and that the person then choose a test for the resulting simple dichotomy. The selection of the two values i0 and %i is left to the person, though they are sometimes supposed to correspond to the person’s judgment of what constitutes good quality and poor quality-terms really quite without definition. The emphasis on simple dichotomy is tempered in some acceptance-sampling literature, where it is recommended that the person choose among available tests by some largely unspecified overall consideration of operating characteristics and costs, and that he facilitate his survey of the available tests by focusing on a pair of points that happen to interest him and considering the test whose operating characteristic passes (economically, in the case of sequential testing) through the pair of points. These traditional analyses are certainly inferior in the theoretical framework of the present discussion, and I think they will be found inferior in practice.

…I turn now to a different and, at least for me, delicate topic in connection with applications of the theory of testing. Much attention is given in the literature of statistics to what purport to be tests of hypotheses, in which the null hypothesis is such that it would not really be accepted by anyone. The following three propositions, though playful in content, are typical in form of these extreme null hypotheses, as I shall call them for the moment.

A. The mean noise output of the cereal Krakl is a linear function of the atmospheric pressure, in the range from 900 to 1,100 millibars. B. The basal metabolic consumption of sperm whales is normally distributed [Wll]. C New York taxi drivers of Irish, Jewish, and Scandinavian extraction are equally proficient in abusive language.

Literally to test such hypotheses as these is preposterous. If, for example, the loss associated with fx is zero, except in case Hypothesis A is exactly satisfied, what possible experience with Krakl could dissuade you from adopting fi?

The unacceptability of extreme null hypotheses is perfectly well known; it is closely related to the often heard maxim that science disproves, but never proves, hypotheses. The role of extreme hypotheses in science and other statistical activities seems to be important but obscure. In particular, though I, like everyone who practices statistics, have often tested extreme hypotheses, I cannot give a very satisfactory analysis of the process, nor say clearly how it is related to testing as defined in this chapter and other theoretical discussions. None the less, it seems worth while to explore the subject tentatively; I will do so largely in terms of two examples.

Consider first the problem of a cereal dynamicist who must estimate the noise output of Krakl at each of ten atmospheric pressures between 900 and 1,100 millibars. It may well be that he can properly regard the problem as that of estimating the ten parameters in question, in which case there is no question of testing. But suppose, for example, that one or both of the following considerations apply. First, the engineer and his colleagues may attach considerable personal probability to the possibility that A is very nearly satisfied - very nearly, that is, in terms of the dispersion of his measurements. Second, the administrative, computational, and other incidental costs of using ten individual estimates might be considerably greater than that of using a linear formula.

It might be impractical to deal with either of these considerations very rigorously. One rough attack is for the engineer first to examine the observed data x and then to proceed either as though he actually believed Hypothesis A or else in some other way. The other way might be to make the estimate according to the objectivistic formulae that would have been used had there been no complicating considerations, or it might take into account different but related complicating considerations not explicitly mentioned here, such as the advantage of using a quadratic approximation. It is artificial and inadequate to regard this decision between one class of basic acts or another as a test, but that is what in current practice we seem to do. The choice of which test to adopt in such a context is at least partly motivated by the vague idea that the test should readily accept, that is, result in acting as though the extreme null hypotheses were true, in the farfetched case that the null hypothesis is indeed true, and that the worse the approximation of the null hypotheses to the truth the less probable should be the acceptance.

The method just outlined is crude, to say the best. It is often modified in accordance with common sense, especially so far as the second consideration is concerned. Thus, if the measurements are sufficiently precise, no ordinary test might accept the null hypotheses, for the experiment will lead to a clear and sure idea of just what the departures from the null hypotheses actually are. But, if the engineer considers those departures unimportant for the context at hand, he will justifiably decide to neglect them.

Rejection of an extreme null hypothesis, in the sense of the foregoing discussion, typically gives rise to a complicated subsidiary decision problem. Some aspects of this situation have recently been explored, for example by Paulson [P3], [P4]; Duncan [Dll], [D12]; Tukey [T4], [T5]; Schefte [S7]; and W. D. Fisher [F7].

Savage 1957

Nonparametric statistics, Savage 1957:

Siegel does not explain why his interest is confined to tests of significance; to make measurements and then ignore their magnitudes would ordinarily be pointless. Exclusive reliance on tests of significance obscures the fact that statistical significance does not imply substantive significance. The tests given by Siegel apply only to null hypotheses of no difference. In research, however, null hypotheses of the form Population A has a median at least five units larger than the median of Population B arise. Null hypotheses of no difference are usually known to be false before the data are collected [9, p. 42; 48, pp. 384-8]; when they are, their rejection or acceptance simply reflects the size of the sample and the power of the test, and is not a contribution to science.

  • TODO 9: R.A. Fisher Statistical Methods and Scientific Inference, New York: Hafner Publishing Co., 1956 [pg42]
  • TODO 48: Wallis and Roberts, Harry V. Statistics: A New Approach, Glencoe, Ill.: The Free Press, 1956 [pg 384-8]

Nunnally 1960

The place of statistics in psychology, Nunnally 1960:

The most misused and misconceived hypothesis-testing model employed in psychology is referred to as the null-hypothesis model. Stating it crudely, one null hypothesis would be that two treatments do not produce different mean effects in the long run. Using the obtained means and sample estimates ofpopulation variances, probability statements can be made about the acceptance or rejection of the null hypothesis. Similar null hypotheses are applied to correlations, complex experimental designs, factor-analytic results, and most all experimental results.

Although from a mathematical point of view the null-hypothesis models are internally neat, they share a crippling flaw: in the real world the null hypothesis is almost never true, and it is usually nonsensical to perform an experiment with the sole aim of rejecting the null hypothesis. This is a personal point of view, and it cannot be proved directly. However, it is supported both by common sense and by practical experience. The common-sense argument is that different psychological treatments will almost always (in the long run) produce differences in mean effects, even though the differences may be very small. Also, just as nature abhors a vacuum, it probably abhors zero correlations between variables.

…Experience shows that when large numbers of subjects are used in studies, nearly all comparisons of means are significantly different and all correlations are significantly different from zero. The author once had occasion to use 700 subjects in a study of public opinion. After a factor analysis of the results, the factors were correlated with individual-difference variables such as amount of education, age, income, sex, and others. In looking at the results I was happy to find so many significant correlations (under the null-hypothesis model)-indeed, nearly all correlations were significant, including ones that made little sense. Of course, with an N of 700 correlations as large as .08 are beyond the .05 level. Many of the significant correlations were of no theoretical or practical importance.

The point of view taken here is that if the null hypothesis is not rejected, it usually is because the N is too small. If enough data is gathered, the hypothesis will generally be rejected. If rejection of the null hypothesis were the real intention in psychological experiments, there usually would be no need to gather data.

…Statisticians are not to blame for the misconceptions in psychology about the use of statistical methods. They have warned us about the use of the hypothesis-testing models and the related concepts. In particular they have criticized the null-hypothesis model and have recommended alternative procedures similar to those recommended here (See Savage, 1957; Tukey, 1954; and Yates, 1951).

  • TODO Tukey, J. W. Unsolved Problems of Experimental Statistics. Journal of the American Statistical Association, XLIX (1954), 710.
  • TODO Yates, F. The Influence of Statistical Methods for Research Workers on the Development of the Science of Statistics. Journal of the American Statistical Association, XLVI (1951), 32-33.

Smith 1960

Review of N. T. J. Bailey, Statistical methods in biology, Smith 1960:

However, it is interesting to look at this book from another angle. here we have set before us with great clarity a panorama of modern statistical methods, as used in biology, medicine, physical science, social and mental science, and industry. How far does this show that these methods fulfil their aims of analysing the data reliably, and how many gaps are there still in our knowledge?…One feature which can puzzle an outsider, and which requires much more justification than is usually given, is the setting up of unplausible null hypotheses. For example, a statistician may set out a test to see whether two drugs have exactly the same effect, or whether a regression line is exactly straight. These hypotheses can scarcely be taken literally, but a statistician may say, quite reasonably, that he wishes to test whether there is an appreciable difference between the effects of the two drugs, or an appreciable curvature in the regression line. But this raises at once the question: how large is appreciable? Or in other words, are we not really concerned with some kind of estimation, rather than significance?

Edwards 1963

Bayesian statistical inference for psychological research, Edwards et al 1963:

The most popular notion of a test is, roughly, a tentative decision between two hypotheses on the basis of data, and this is the notion that will dominate the present treatment of tests. Some qualification is needed if only because, in typical applications, one of the hypotheses-the null hypothesis-is known by all concerned to be false from the outset (Berkson, 1938; Hodges & Lehmann, 1954; Lehmann, 19598; I. R. Savage, 1957; L. J. Savage, 1954, p. 254); some ways of resolving the seeming absurdity will later be pointed out, and at least one of them will be important for us here…Classical procedures sometimes test null hypotheses that no one would believe for a moment, no matter what the data; our list of situations that might stimulate hypothesis tests earlier in the section included several examples. Testing an unbelievable null hypothesis amounts, in practice, to assigning an unreasonably large prior probability to a very small region of possible values of the true parameter. In such cases, the more the procedure is against the null hypothesis, the better. The frequent reluctance of empirical scientists to accept null hypotheses which their data do not classically reject suggests their appropriate skepticism about the original plausibility of these null hypotheses.

Bakan 1966

The test of significance in psychological research, Bakan 1966:

Let us consider some of the difficulties associated with the null hypothesis.

  1. The _a priori- reasons for believing that the null hypothesis is generally false anyway. One of the common experiences of research workers is the very high frequency with which significant results are obtained with large samples. Some years ago, the author had occasion to run a number of tests of significance on a battery of tests collected on about 60,000 subjects from all over the United States. Every test came out significant. Dividing the cards by such arbitrary criteria as east versus west of the Mississippi River, Maine versus the rest of the country, North versus South, etc., all produced significant differences in means. In some instances, the differences in the sample means were quite small, but nonetheless, the p values were all very low. Nunnally (1960) has reported a similar experience involving correlation coefficients on 700 subjects. Joseph Berkson (1938) made the observation almost 30 years in connection with chi-square:

I believe that an observant statistician who has had any considerable experience with applying the chi-square test repeatedly will agree with my statement that, as a matter of observation, when the numbers in the data are quite large, the P’s tend to come out small. Having observed this, and on reflection, I make the following dogmatic statement, referring for illustration to the normal curve: If the normal curve is fitted to a body of data representing any real observations whatever of quantities in the physical world, then if the number of observations is extremely large-for instance, on an order of 200,000-the chi-square P will be small beyond any usual limit of significance.

This dogmatic statement is made on the basis of an extrapolation of the observation referred to and can also be defended as a prediction from a priori considerations. For we may assume that it is practically certain that any series of real observations does not actually follow a normal curve with absolute exactitude in all respects, and no matter how small the discrepancy between the normal curve and the true curve of observations, the chi-square P will be small if the sample has a sufficiently large number of observations in it.

If this be so, then we have something here that is apt to trouble the conscience of a reflective statistician using the chi-square test. For I suppose it would be agreed by statisticians that a large sample is always better than a small sample. If, then, we know in advance the P that will result from an application of a chi-square test to a large sample, there would seem to be no use in doing it on a smaller one. But since the result of the former test is known, it is no test at all [pp. 526-527].

As one group of authors has put it, in typical applications . . . the null hypothesis . . . is known by all concerned to be false from the outset [Edwards et al, 1963, p. 214]. The fact of the matter is that there is really no good reason to expect the null hypothesis to be true in any population. Why should the mean, say, of all scores east of the Mississippi be identical to all scores west of the Mississippi? Why should any correlation coefficient be exactly .00 in the population? Why should we expect the ratio of males to females be exactly 50:50 in any population? Or why should different drugs have exactly the same effect on any population parameter (Smith, 1960)? A glance at any set of statistics on total populations will quickly confirm the rarity of the null hypothesis in nature.

…Should there be any deviation from the null hypothesis in the population, no matter how small-and we have little doubt but that such a deviation usually exists-a sufficiently large number of observations will lead to the rejection of the null hypothesis. As Nunnally (1960) put it,

if the null hypothesis is not rejected, it is usually because the N is too small. If enough data are gathered, the hypothesis will generally be rejected. If rejection of the null hypothesis were the real intention in psychological experiments, there usually would be no need to gather data [p. 643].

Meehl 1967

Theory-testing in psychology and physics: A methodological paradox, Meehl 1967

One reason why the directional null hypothesis (H 02 : μ g ≤ μ b ) is the appropriate candidate for experimental refutation is the universal agreement that the old point-null hypothesis (H 0 : μ g = μ b ) is [quasi-] always false in biological and social science. Any dependent variable of interest, such as I.Q., or academic achievement, or perceptual speed, or emotional reactivity as measured by skin resistance, or whatever, depends mainly upon a finite number of strong variables characteristic of the organisms studied (embodying the accumulated results of their genetic makeup and their learning histories) plus the influences manipulated by the experimenter. Upon some complicated, unknown mathematical function of this finite list of important determiners is then superimposed an indefinitely large number of essentially random factors which contribute to the intragroup variation and therefore boost the error term of the statistical significance test. In order for two groups which differ in some identified properties (such as social class, intelligence, diagnosis, racial or religious background) to differ not at all in the output variable of interest, it would be necessary that all determiners of the output variable have precisely the same average values in both groups, or else that their values should differ by a pattern of amounts of difference which precisely counterbalance one another to yield a net difference of zero. Now our general background knowledge in the social sciences, or, for that matter, even common sense considerations, makes such an exact equality of all determining variables, or a precise accidental counterbalanceing of them, so extremely unlikely that no psychologist or statistician would assign more than a negligibly small probability to such a state of affairs.

Example: Suppose we are studying a simple perceptual-verbal task like rate of color-naming in school children, and the independent variable is father’s religious preference. Superficial consideration might suggest that these two variables would not be related, but a little thought leads one to conclude that they will almost certainly be related by some amount, however small. Consider, for instance, that a child’s reaction to any sort of school-context task will be to some extent dependent upon his social class, since the desire to please academic personnel and the desire to achieve at a performance (just because it is a task, regardless of its intrinsic interest) are both related to the kinds of sub-cultural and personality traits in the parents that lead to upward mobility, economic success, the gaining of further education, and the like. Again, since there is known to be a sex difference in colornaming, it is likely that fathers who have entered occupations more attractive to feminine males will (on the average) provide a somewhat more feminine fatherfigure for identification on the part of their male offspring, and that a more refined color vocabulary, making closer discriminations between similar hues, will be characteristic of the ordinary language of such a household. Further, it is known that there is a correlation between a child’s general intelligence and its father’s occupation, and of course there will be some relation, even though it may be small, between a child’s general intelligence and his color vocabulary, arising from the fact that vocabulary in general is heavily saturated with the general intelligence factor. Since religious preference is a correlate of social class, all of these social class factors, as well as the intelligence variable, would tend to influence color-naming performance. Or consider a more extreme and faint kind of relationship. It is quite conceivable that a child who belongs to a more liturgical religious denomination would be somewhat more color-oriented than a child for whom bright colors were not associated with the religious life. Everyone familiar with psychological research knows that numerous puzzling, unexpected correlations pop up all the time, and that it requires only a moderate amount of motivation-plus-ingenuity to construct very plausible alternative theoretical explanations for them.

…These armchair considerations are borne out by the finding that in psychological and sociological investigations involving very large numbers of subjects, it is regularly found that almost all correlations or differences between means are statistically significant. See, for example, the papers by Bakan 1966 and Nunnally 1960. Data currently being analyzed by Dr. David Lykken and myself, derived from a huge sample of over 55,000 Minnesota high school seniors, reveal statistically significant relationships in 91% of pairwise associations among a congeries of 45 miscellaneous variables such as sex, birth order, religious preference, number of siblings, vocational choice, club membership, college choice, mother’s education, dancing, interest in woodworking, liking for school, and the like. The 9% of non-significant associations are heavily concentrated among a small minority of variables having dubious reliability, or involving arbitrary groupings of non-homogeneous or nonmonotonic sub-categories. The majority of variables exhibited significant relationships with all but three of the others, often at a very high confidence level (p < 10^-6 ).

…Considering the fact that everything in the brain is connected with everything else, and that there exist several general state-variables (such as arousal, attention, anxiety, and the like) which are known to be at least slightly influenceable by practically any kind of stimulus input, it is highly unlikely that any psychologically discriminable stimulation which we apply to an experimental subject would exert literally zero effect upon any aspect of his performance. The psychological literature abounds with examples of small but detectable influences of this kind. Thus it is known that if a subject memorizes a list of nonsense syllables in the presence of a faint odor of peppermint, his recall will be facilitated by the presence of that odor. Or, again, we know that individuals solving intellectual problems in a messy room do not perform quite as well as individuals working in a neat, well-ordered surround. Again, cognitive processes undergo a detectable facilitation when the thinking subject is concurrently performing the irrelevant, noncognitive task of squeezing a hand dynamometer. It would require considerable ingenuity to concoct experimental manipulations, except the most minimal and trivial (such as a very slight modification in the word order of instructions given a subject) where one could have confidence that the manipulation would be utterly without effect upon the subject’s motivational level, attention, arousal, fear of failure, achievement drive, desire to please the experimenter, distraction, social fear, etc., etc. So that, for example, while there is no very interesting psychological theory that links hunger drive with color-naming ability, I myself would confidently predict a significant difference in color-naming ability between persons tested after a full meal and persons who had not eaten for 10 hours, provided the sample size were sufficiently large and the color-naming measurements sufficiently reliable, since one of the effects of the increased hunger drive is heightened arousal, and anything which heightens arousal would be expected to affect a perceptual-cognitive performance like color-naming. Suffice it to say that there are very good reasons for expecting at least some slight influence of almost any experimental manipulation which would differ sufficiently in its form and content from the manipulation imposed upon a control group to be included in an experiment in the first place. In what follows I shall therefore assume that the point-null hypothesis H 0 is, in psychology, [quasi-] always false.

Lykken 1968

Statistical Significance in Psychological Research, Lykken 1968

Most theories in the areas of personality, clinical, and social psychology predict no more than the direction of a correlation, group difference, or treatment effect. Since the null hypothesis is never strictly true, such predictions have about a 50-50 chance of being confirmed by experiment when the theory in question is false, since the statistical significance of the result is a function of the sample size.

…Most psychological experiments are of three kinds: (a) studies of the effect of some treatment on some output variables, which can be regarded as a special case of (b) studies of the difference between two or more groups of individuals with respect to some variable, which in turn are a special case of (c) the study of the relationship or correlation between two or more variables within some specified population. Using the bivariate correlation design as paradigmatic, then, one notes first that the strict null hypothesis must always be assumed to be false (this idea is not new and has recently been illuminated by Baken, 1966). Unless one of the variables is wholly unreliable so that the values obtained are strictly random, it would be foolish to suppose that the correlation between any two variables is identically equal to 0.0000 . . . (or that the effect of some treatment or the difference between two groups is exactly zero). The molar dependent variables employed in psychological research are extremely complicated in the sense that the measured value of such a variable tends to be affected by the interaction of a vast number of factors, both in the present situation and in the history of the subject organism. It is exceedingly unlikely that any two such variables will not share at least some of these factors and equally unlikely that their effects will exactly cancel one another out.

It might be argued that the more complex the variables the smaller their average correlation ought to be since a larger pool of common factors allows more chance for mutual cancellation of effects in obedience to the Law of Large Numbers. However, one knows of a number of unusually potent and pervasive factors which operate to unbalance such convenient symmetries and to produce correlations large enough to rival the effects of whatever causal factors the experimenter may have had in mind. Thus, we know that (a) good psychological and physical variables tend to be positively correlated; (6) experimenters, without deliberate intention, can somehow subtly bias their findings in the expected direction (Rosenthal, 1963); (c) the effects of common method are often as strong as or stronger than those produced by the actual variables of interest (e.g., in a large and careful study of the factorial structure of adjustment to stress among officer candidates, Holtzman & Bitterman, 1956, found that their 101 original variables contained five main common factors representing, respectively, their rating scales, their perceptual-motor tests, the McKinney Reporting Test, their GSR variables, and the MMPI); (d) transitory state variables such as the subject’s anxiety level, fatigue, or his desire to please, may broadly affect all measures obtained in a single experimental session. This average shared variance of unrelated variables can be thought of as a kind of ambient noise level characteristic of the domain. It would be interesting to obtain empirical estimates of this quantity in our field to serve as a kind of Plimsoll mark against which to compare obtained relationships predicted by some theory under test. If, as I think, it is not unreasonable to suppose that unrelated molar psychological variables share on the average about 4% to 5% of common variance, then the expected correlation between any such variables would be about .20 in absolute value and the expected difference between any two groups on some such variable would be nearly 0.5 standard deviation units. (Note that these estimates assume zero measurement error. One can better explain the near-zero correlations often observed in psychological research in terms of unreliability of measures than in terms of the assumption that the true scores are in fact unrelated.)

Hays 1973

Statistics for the social sciences (2nd edition), Hays 1973; chapter 10, page 413-417:

10.19: Testmanship, or how big is a difference?

…As we saw in Chapter 4, the complete absence of a statistical relation, or no association, occurs only when the conditional distribution of the dependent variable is the same regardless of which treatment is administered. Thus if the independent variable is not associated at all with the dependent variable the population distributions must be identical over the treatments. If, on the other hand, the means of the different treatment populations are different, the conditional distributions themselves must be different and the independent and dependent variables must be associated. The rejection of the hypothesis of no difference between population means is tantamount to the assertion that the treatment given does have some statistical association with the dependent variable score.

…However, the occurrence of a significant result says nothing at all about the strength of the association between treatment and score. A significant result leads to the inference that some association exists, but in no sense does this mean that an important degree of association necessarily exists. Conversely, evidence of a strong statistical association can occur in data even when the results are not significant. The game of inferring the true degree of statistical association has a joker: this is the sample size. The time has come to define the notion of the strength of a statistical association more sharply, and to link this idea with that of the true difference between population means.

. When does it seem appropriate to say that a strong association exists between the experimental factor XX and the dependent variable YY? Over all of the different possibilities for XX there is a probability distribution of YY values, which is the marginal distribution of YY over (x,y)(x,y) events. The existence of this distribution implies that we do not know exactly what the YY value for any observation will be; we are always uncertain about YY to some extent. However, given any particular XX, there is also a conditional distribution of YY, and it may be that in this conditional distribution the highly probable values of YY tend to shrink within a much narrower range than in the marginal distribution. If so, we can say that the information about XX tends to reduce uncertainty about YY. In general we will say that the strength of a statistical relation is reflected by the extent to which knowing XX reduces uncertainty about YY. One of the best indicators of our uncertainty about the value of a variable is σ2\sigma^2, the variance of its distribution…This index reflects the predictive power afforded by a relationship: when w2w^2 is zero, then XX does not aid us at all in predicting the value of YY. On the other hand, when w2w^2 is 1.00, this tells us that XX lets us know YY exactly…About now you should be wondering what the index w2w^2 has to do with the difference between population means.

…When the difference u1u2u_1 - u_2 is zero, then w2w^2 must be zero. In the usual tt test for a difference, the hypothesis of no difference between means is equivalent to the hypothesis that w2=0w^2 = 0. On the other hand, when there is any difference at all between population means, the value of w2w^2 must be greater than 0. In short, a true difference is big in the sense of predictive power only if the square of that difference is large relative to σY2\sigma^2_Y. However, in significance tests such as tt, we compare the difference we get with an estimate of σdiff\sigma_{diff}. The standard error of the difference can be made almost as small as we choose if we are given a free choice of sample size. Unless sample size is specified, there is no necessary connection between significance and the true strength of association.

This points up the fallacy of evaluating the goodness of a result in terms of statistical significance alone, without allowing for the sample size used. All significant results do not imply the same degree of true association between independent and dependent variables.

It is sad but true that researchers have been known to capitalize on this fact. There is a certain amount of testmanship involved in using inferential statistics. Virtually any study can be made to show significant results if one uses enough subjects, regardless of how nonsensical the content may be. There is surely nothing on earth that is completely independent of anything else. The strength of an association may approach zero, but it should seldom or never be exactly zero. If one applies a large enough sample of the study of any relation, trivial or meaningless as it may be, sooner or later he is almost certain to achieve a significant result. Such a result may be a valid finding, but only in the sense that one can say with assurance that some association is not exactly zero. The degree to which such a finding enhances our knowledge is debatable. If the criterion of strength of association is applied to such a result, it becomes obvious that little or nothing is actually contributed to our ability to predict one thing from another.

For example, suppose that two methods of teaching first grade children to read are being compared. A random sample of 1000 children are taught to read by method I, another sample of 1000 children by method II. The results of the instruction are evaluated by a test that provides a score, in whole units, for each child. Suppose that the results turned out as follows:

Method I Method II
M1=147.21M_1 = 147.21 M2=147.64M_2 = 147.64
s12=10s^2_1 = 10 s22=11s^2_2 = 11
N1=1000N_1 = 1000 N2=1000N_2 = 1000

Then, the estimated standard error of the difference is about .145, and the zz value is

z=147.21147.64.145=2.96.z = \frac{147.21 - 147.64}{.145} = -2.96.

This certainly permits rejection of the null hypothesis of no difference between the groups. However, does it really tell us very much about what to expect of an individual child’s score on the test, given the information that he was taught by method I or method II? If we look at the group of children taught by method II, and assume that the distribution of their scores is approximately normal, we find that about 45 percent of these children fall below the mean score for children in group I. Similarly, about 45 percent of children in group I fall above the mean score for group II. Although the difference between the two groups is significant, the two groups actually overlap a great deal in terms of their performances on the test. In this sense, the two groups are really not very different at all, even though the difference between the means is quite significant in a purely statistical sense.

Putting the matter in a slightly different way, we note that the grand mean of the two groups is 147.425. Thus, our best bet about the score of any child, not knowing the method of his training, is 147.425. If we guessed that any child drawn at random from the combined group should have a score above 147.425, we should be wrong about half the time. However, among the original groups, according to method I and method II, the proportions falling above and below this grand mean are approximately as follows:

Below 147.425 Above 147.425
Method I .51 .49
Method II .49 .51

This implies that if we know a child is from group I, and we guess that this score is below the grand mean, then we will be wrong about 49 percent of the time. Similarly, if a child is from group II, and we guess his score to be above the grand mean, we will be wrong about 49 percent of the time. If we are not given the group to which the child belongs, ad we guess either above or below the grand mean, we will be wrong about 50 percent of the time. Knowing the group does reduce the probability of error in such a guess, but it does not reduce it very much. The method by which the child was trained simply doesn’t tell us a great deal about what the child’s score will be, even though the difference in mean scores is significant in the statistical sense.

This kind of testmanship flourishes best when people pay too much attention to the significance test and too little to the degree of statistical association the finding represents. This clutters up the literature with findings that are often not worth pursuing, and which serve only to obscure the really important predictive relations that occasionally appear. The serious scientist owes it to himself and his readers to ask not only, Is there any association between XX and YY? but also, How much does my finding suggest about the power to predict YY from XX? Much too much emphasis is paid to the former, at the expense of the latter, question.

Meehl 1990 (1)

Why summaries of research on psychological theories are often uninterpretable, Meehl 1990 (also discussed in Cohen’s 1994 paper The Earth is Round (p<.05)):

Problem 6. Crud factor: In the social sciences and arguably in the biological sciences, everything correlates to some extent with everything else. This truism, which I have found no competent psychologist disputes given five minutes reflection, does not apply to pure experimental studies in which attributes that the subjects bring with them are not the subject of study (except in so far as they appear as a source of error and hence in the denominator of a significance test). 6 There is nothing mysterious about the fact that in psychology and sociology everything correlates with everything. Any measured trait or attribute is some function of a list of partly known and mostly unknown causal factors in the genes and life history of the individual, and both genetic and environmental factors are known from tons of empirical research to be themselves correlated. To take an extreme case, suppose we construe the null hypothesis literally (objecting that we mean by it almost null gets ahead of the story, and destroys the rigor of the Fisherian mathematics!) and ask whether we expect males and females in Minnesota to be precisely equal in some arbitrary trait that has individual differences, say, color naming. In the case of color naming we could think of some obvious differences right off, but even if we didn’t know about them, what is the causal situation? If we write a causal equation (which is not the same as a regression equation for pure predictive purposes but which, if we had it, would serve better than the latter) so that the score of an individual male is some function (presumably nonlinear if we knew enough about it but here supposed linear for simplicity) of a rather long set of causal variables of genetic and environmental type X 1 , X 2 , … X m . These values are operated upon by regression coefficients b 1 , b 2 , …b m .

…Now we write a similar equation for the class of females. Can anyone suppose that the beta coefficients for the two sexes will be exactly the same? Can anyone imagine that the mean values of all of the Xs will be exactly the same for males and females, even if the culture were not still considerably sexist in child-rearing practices and the like? If the betas are not exactly the same for the two sexes, and the mean values of the Xs are not exactly the same, what kind of Leibnitzian preestablished harmony would we have to imagine in order for the mean color-naming score to come out exactly equal between males and females? It boggles the mind; it simply would never happen. As Einstein said, the Lord God is subtle, but He is not malicious. We cannot imagine that nature is out to fool us by this kind of delicate balancing. Anybody familiar with large scale research data takes it as a matter of course that when the N gets big enough she will not be looking for the statistically significant correlations but rather looking at their patterns, since almost all of them will be significant. In saying this, I am not going counter to what is stated by mathematical statisticians or psychologists with statistical expertise. For example, the standard psychologist’s textbook, the excellent treatment by Hays (1973, page 415 [Statistics for the social sciences. (2nd ed.) New York: Holt, Rinehart & Winston.]), explicitly states that, taken literally, the null hypothesis is always false.

Twenty years ago David Lykken and I conducted an exploratory study of the crud factor which we never published but I shall summarize it briefly here. (I offer it not as empirical proof - that H 0 taken literally is quasi-always false hardly needs proof and is generally admitted - but as a punchy and somewhat amusing example of an insufficiently appreciated truth about soft correlational psychology.) In 1966, the University of Minnesota Student Counseling Bureau’s Statewide Testing Program administered a questionnaire to 57,000 high school seniors, the items dealing with family facts, attitudes toward school, vocational and educational plans, leisure time activities, school organizations, etc. We cross-tabulated a total of 15 (and then 45) variables including the following (the number of categories for each variable given in parentheses): father’s occupation (7), father’s education (9), mother’s education (9), number of siblings (10), birth order (only, oldest, youngest, neither), educational plans after high school (3), family attitudes towards college (3), do you like school (3), sex (2), college choice (7), occupational plan in ten years (20), and religious preference (20). In addition, there were 22 leisure time activities such as acting, model building, cooking, etc., which could be treated either as a single 22-category variable or as 22 dichotomous variables. There were also 10 high school organizations such as school subject clubs, farm youth groups, political clubs, etc., which also could be treated either as a single ten-category variable or as ten dichotomous variables. Considering the latter two variables as multichotomies gives a total of 15 variables producing 105 different cross-tabulations. All values of χ 2 for these 105 cross-tabulations were statistically significant, and 101 (96%) of them were significant with a probability of less than 10^-6 .

…If leisure activity and high school organizations are considered as separate dichotomies, this gives a total of 45 variables and 990 different crosstabulations. Of these, 92% were statistically significant and more than 78% were significant with a probability less than 10 -6 . Looked at in another way, the median number of significant relationships between a given variable and all the others was 41 out of a possible 44!

We also computed MCAT scores by category for the following variables: number of siblings, birth order, sex, occupational plan, and religious preference. Highly significant deviations from chance allocation over categories were found for each of these variables. For example, the females score higher than the males; MCAT score steadily and markedly decreases with increasing numbers of siblings; eldest or only children are significantly brighter than youngest children; there are marked differences in MCAT scores between those who hope to become nurses and those who hope to become nurses aides, or between those planning to be farmers, engineers, teachers, or physicians; and there are substantial MCAT differences among the various religious groups. We also tabulated the five principal Protestant religious denominations (Baptist, Episcopal, Lutheran, Methodist, and Presbyterian) against all the other variables, finding highly significant relationships in most instances. For example, only children are nearly twice as likely to be Presbyterian than Baptist in Minnesota, more than half of the Episcopalians usually like school but only 45% of Lutherans do, 55% of Presbyterians feel that their grades reflect their abilities as compared to only 47% of Episcopalians, and Episcopalians are more likely to be male whereas Baptists are more likely to be female. Eighty-three percent of Baptist children said that they enjoyed dancing as compared to 68% of Lutheran children. More than twice the proportion of Episcopalians plan to attend an out of state college than is true for Baptists, Lutherans, or Methodists. The proportion of Methodists who plan to become conservationists is nearly twice that for Baptists, whereas the proportion of Baptists who plan to become receptionists is nearly twice that for Episcopalians.

In addition, we tabulated the four principal Lutheran Synods (Missouri, ALC, LCA, and Wisconsin) against the other variables, again finding highly significant relationships in most cases. Thus, 5.9% of Wisconsin Synod children have no siblings as compared to only 3.4% of Missouri Synod children. Fifty-eight percent of ALC Lutherans are involved in playing a musical instrument or singing as compared to 67% of Missouri Synod Lutherans. Eighty percent of Missouri Synod Lutherans belong to school or political clubs as compared to only 71% of LCA Lutherans. Forty-nine percent of ALC Lutherans belong to debate, dramatics, or musical organizations in high school as compared to only 40% of Missouri Synod Lutherans. Thirty-six percent of LCA Lutherans belong to organized non-school youth groups as compared to only 21% of Wisconsin Synod Lutherans. [Preceding text courtesy of D. T. Lykken.]

These relationships are not, I repeat, Type I errors. They are facts about the world, and with N = 57,000 they are pretty stable. Some are theoretically easy to explain, others more difficult, others completely baffling. The easy ones have multiple explanations, sometimes competing, usually not. Drawing theories from a pot and associating them whimsically with variable pairs would yield an impressive batch of H 0 -refuting confirmations.

Another amusing example is the behavior of the items in the 550 items of the MMPI pool with respect to sex. Only 60 items appear on the Mf scale, about the same number that were put into the pool with the hope that they would discriminate femininity. It turned out that over half the items in the scale were not put in the pool for that purpose, and of those that were, a bare majority did the job. Scale derivation was based on item analysis of a small group of criterion cases of male homosexual invert syndrome, a significant difference on a rather small N of Dr. Starke Hathaway’s private patients being then conjoined with the requirement of discriminating between male normals and female normals. When the N becomes very large as in the data published by Swenson, Pearson, and Osborne (1973; An MMPI Source Book: basic item, scale, and pattern data on 50,000 medical patients. Minneapolis, MN: University of Minnesota Press.), approximately 25,000 of each sex tested at the Mayo Clinic over a period of years, it turns out that 507 of the 550 items discriminate the sexes. Thus in a heterogeneous item pool we find only 8% of items failing to show a significant difference on the sex dichotomy. The following are sex-discriminators, the male/female differences ranging from a few percentage points to over 30%: 7

  • Sometimes when I am not feeling well I am cross.
  • I believe there is a Devil and a Hell in afterlife.
  • I think nearly anyone would tell a lie to keep out of trouble.
  • Most people make friends because friends are likely to be useful to them.
  • I like poetry.
  • I like to cook.
  • Policemen are usually honest.
  • I sometimes tease animals.
  • My hands and feet are usually warm enough.
  • I think Lincoln was greater than Washington.
  • I am certainly lacking in self-confidence.
  • Any man who is able and willing to work hard has a good chance of succeeding.

I invite the reader to guess which direction scores feminine. Given this information, I find some items easy to explain by one obvious theory, others have competing plausible explanations, still others are baffling.

Note that we are not dealing here with some source of statistical error (the occurrence of random sampling fluctuations). That source of error is limited by the significance level we choose, just as the probability of Type II error is set by initial choice of the statistical power, based upon a pilot study or other antecedent data concerning an expected average difference. Since in social science everything correlates with everything to some extent, due to complex and obscure causal influences, in considering the crud factor we are talking about real differences, real correlations, real trends and patterns for which there is, of course, some true but complicated multivariate causal theory. I am not suggesting that these correlations are fundamentally unexplainable. They would be completely explained if we had the knowledge of Omniscient Jones, which we don’t. The point is that we are in the weak situation of corroborating our particular substantive theory by showing that X and Y are related in a nonchance manner, when our theory is too weak to make a numerical prediction or even (usually) to set up a range of admissible values that would be counted as corroborative.

…Some psychologists play down the influence of the ubiquitous crud factor, what David Lykken (1968) calls the ambient correlational noise in social science, by saying that we are not in danger of being misled by small differences that show up as significant in gigantic samples. How much that softens the blow of the crud factor’s influence depends upon the crud factor’s average size in a given research domain, about which neither I nor anybody else has accurate information. But the notion that the correlation between arbitrarily paired trait variables will be, while not literally zero, of such minuscule size as to be of no importance, is surely wrong. Everybody knows that there is a set of demographic factors, some understood and others quite mysterious, that correlate quite respectably with a variety of traits. (Socioeconomic status, SES, is the one usually considered, and frequently assumed to be only in the input causal role.) The clinical scales of the MMPI were developed by empirical keying against a set of disjunct nosological categories, some of which are phenomenologically and psychodynamically opposite to others. Yet the 45 pairwise correlations of these scales are almost always positive (scale Ma provides most of the negatives) and a representative size is in the neighborhood of .35 to .40. The same is true of the scores on the Strong Vocational Interest Blank, where I find an average absolute value correlation close to .40. The malignant influence of so-called methods covariance in psychological research that relies upon tasks or tests having certain kinds of behavioral similarities such as questionnaires or ink blots is commonplace and a regular source of concern to clinical and personality psychologists. For further discussion and examples of crud factor size, see Meehl (1990).

Now suppose we imagine a society of psychologists doing research in this soft area, and each investigator sets his experiments up in a whimsical, irrational manner as follows: First he picks a theory at random out of the theory pot. Then he picks a pair of variables randomly out of the observable variable pot. He then arbitrarily assigns a direction (you understand there is no intrinsic connection of content between the substantive theory and the variables, except once in a while there would be such by coincidence) and says that he is going to test the randomly chosen substantive theory by pretending that it predicts - although in fact it does not, having no intrinsic contentual relation - a positive correlation between randomly chosen observational variables X and Y. Now suppose that the crud factor operative in the broad domain were .30, that is, the average correlation between all of the variables pairwise in this domain is .30. This is not sampling error but the true correlation produced by some complex unknown network of genetic and environmental factors. Suppose he divides a normal distribution of subjects at the median and uses all of his cases (which frequently is not what is done, although if properly treated statistically that is not methodologically sinful). Let us take variable X as the input variable (never mind its causal role). The mean score of the cases in the top half of the distribution will then be at one mean deviation, that is, in standard score terms they will have an average score of .80. Similarly, the subjects in the bottom half of the X distribution will have a mean standard score of -.80. So the mean difference in standard score terms between the high and low Xs, the one experimental and the other control group, is 1.6. If the regression of output variable Y on X is approximately linear, this yields an expected difference in standard score terms of .48, so the difference on the arbitrarily defined output variable Y is in the neighborhood of half a standard deviation.

When the investigator runs a t-test on these data, what is the probability of achieving a statistically significant result? This depends upon the statistical power function and hence upon the sample size, which varies widely, more in soft psychology because of the nature of the data collection problems than in experimental work. I do not have exact figures, but an informal scanning of several issues of journals in the soft areas of clinical, abnormal, and social gave me a representative value of the number of cases in each of two groups being compared at around N 1 = N 2 = 37 (that’s a median because of the skewness, sample sizes ranging from a low of 17 in one clinical study to a high of 1,000 in a social survey study). Assuming equal variances, this gives us a standard error of the mean difference of .2357 in sigma-units, so that our t is a little over 2.0. The substantive theory in a real life case being almost invariably predictive of a direction (it is hard to know what sort of significance testing we would be doing otherwise), the 5% level of confidence can be legitimately taken as one-tailed and in fact could be criticized if it were not (assuming that the 5% level of confidence is given the usual special magical significance afforded it by social scientists!). The directional 5% level being at 1.65, the expected value of our t test in this situation is approximately .35 t units from the required significance level. Things being essentially normal for 72 df, this gives us a power of detecting a difference of around .64.

However, since in our imagined experiment the assignment of direction was random, the probability of detecting a difference in the predicted direction (even though in reality this prediction was not mediated by any rational relation of content) is only half of that. Even this conservative power based upon the assumption of a completely random association between the theoretical substance and the pseudopredicted direction should give one pause. We find that the probability of getting a positive result from a theory with no verisimilitude whatsoever, associated in a totally whimsical fashion with a pair of variables picked randomly out of the observational pot, is one chance in three! This is quite different from the .05 level that people usually think about. Of course, the reason for this is that the .05 level is based upon strictly holding H 0 if the theory were false. Whereas, because in the social sciences everything is correlated with everything, for epistemic purposes (despite the rigor of the mathematician’s tables) the true baseline - if the theory has nothing to do with reality and has only a chance relationship to it (so to speak, any connection between the theory and the facts is purely coincidental) - is 6 or 7 times as great as the reassuring .05 level upon which the psychologist focuses his mind. If the crud factor in a domain were running around .40, the power function is .86 and the directional power for random theory/prediction pairings would be .43.

…A similar situation holds for psychopathology, and for many variables in personality measurement that refer to aspects of social competence on the one hand or impairment of interpersonal function (as in mental illness) on the other. Thorndike had a dictum All good things tend to go together.

Meehl 1990 (2)

Appraising and amending theories: the strategy of Lakatosian defense and two principles that warrant using it, Meehl 1990:

Research in the behavioral sciences can be experimental, correlational, or field study (including clinical); only the first two are addressed here. For reasons to be explained (Meehl, 1990c), I treat as correlational those experimental studies in which the chief theoretical test provided involves an interaction effect between an experimental manipulation and an individual-differences variable (whether trait, status, or demographic). In correlational research there arises a special problem for the social scientist from the empirical fact that everything is correlated with everything, more or less. My colleague David Lykken presses the point further to include most, if not all, purely experimental research designs, saying that, speaking causally, Everything influences everything, a stronger thesis that I neither assert nor deny but that I do not rely on here. The obvious fact that everything is more or less correlated with everything in the social sciences is readily foreseen from the armchair on common-sense considerations. These are strengthened by more advanced theoretical arguments involving such concepts as genetic linkage, auto-catalytic effects between cognitive and affective processes, traits reflecting influences such as child-rearing practices correlated with intelligence, ethnicity, social class, religion, and so forth. If one asks, to take a trivial and theoretically uninteresting example, whether we might expect to find social class differences in a color-naming test, there immediately spring to mind numerous influences, ranging from (a) verbal intelligence leading to better verbal discriminations and retention of color names to (b) class differences in maternal teaching behavior (which one can readily observe by watching mothers explain things to their children at a zoo) to (c) more subtle-but still nonzero-influences, such as upper-class children being more likely Anglicans than Baptists, hence exposed to the changes in liturgical colors during the church year! Examples of such multiple possible influences are so easy to generate, I shall resist the temptation to go on. If somebody asks a psychologist or sociologist whether she might expect a nonzero correlation between dental caries and IQ, the best guess would be yes, small but statistically significant. A small negative correlation was in fact found during the 1920s, misleading some hygienists to hold that IQ was lowered by toxins from decayed teeth. (The received explanation today is that dental caries and IQ are both correlates of social class.) More than 75 years ago, Edward Lee Thorndike enunciated the famous dictum, All good things tend to go together, as do all bad ones. Almost all human performance (work competence) dispositions, if carefully studied, are saturated to some extent with the general intelligence factor g, which for psychodynamic and ideological reasons has been somewhat neglected in recent years but is due for a comeback (Betz, 1986).

  • Meehl, P. E. (1990c). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66, 195-244. In R. E. Snow & D. Wiley [Eds.], Improving inquiry in social science: A volume in honor of Lee J. Cronbach (pp. 13-59). Hillsdale. NJ: Lawrence Erlbaum Associates, Inc.
  • Betz, N. E. (Ed.). (1986). The g factor in employment [Special issue]. Journal of Vocational Behavior, 29(3).

The ubiquity of nonzero correlations gives rise to what is methodologically disturbing to the theory tester and what I call, following Lykken, the crud factor. I have discussed this at length elsewhere (Meehl, 1990c), so I only summarize and provide a couple of examples here. The main point is that, when the sample size is sufficiently large to produce accurate estimates of the population values, almost any pair of variables in psychology will be correlated to some extent. Thus, for instance, less than 10% of the items in the MMPI item pool were put into the pool with masculinity-femininity in mind, and the empirically derived Mf scale contains only some of those plus others put into the item pool for other reasons, or without any theoretical considerations. When one samples thousands of individuals, it turns out that only 43 of the 550 items (8%) fail to show a significant difference between males and females. In an unpublished study (but see Meehl, 1990c) of the hobbies, interests, vocational plans, school course preferences, social life, and home factors of Minnesota college freshmen, when Lykken and I ran chi squares on all possible pairwise combinations of variables, 92% were significant, and 78% were significant at p<10-6 . Looked at another way, the median number of significant relationships between a given variable and all the others was 41 of a possible 44. One finds such oddities as a relationship between which kind of shop courses boys preferred in high school and which of several Lutheran synods they belonged to!

…The third objection is somewhat harder to answer because it would require an encyclopedic survey of research literature over many domains. It is argued that, although the crud factor is admittedly ubiquitous-that is, almost no correlations of the social sciences are literally zero (as required by the usual significance test)-the crud factor is in most research domains not large enough to be worth worrying about. Without making a claim to know just how big it is, I think this objection is pretty clearly unsound. Doubtless the average correlation of any randomly picked pair of variables in social science depends on the domain, and also on the instruments employed (e.g., it is well known that personality inventories often have as much methods-covariance as they do criterion validities).

A representative pairwise correlation among MMPI scales, despite the marked differences (sometimes amounting to phenomenological oppositeness) of the nosological rubrics on which they were derived, is in the middle to high .30s, in both normal and abnormal populations. The same is true for the occupational keys of the Strong Vocational Interest Bank. Deliberately aiming to diversify the qualitative features of cognitive tasks (and thus purify the measures) in his classic studies of primary mental abilities (pure factors, orthogonal), Thurstone (1938; Thurstone & Thurstone, 1941) still found an average intertest correlation of .28 (range = .01 to .56!) in the cross-validation sample. In the set of 20 California Psychological Inventory scales built to cover broadly the domain of (normal range) folk-concept traits, Gough (1987) found an average pairwise correlation of .44 among both males and females. Guilford’s Social Introversion, Thinking Introversion, Depression, Cycloid Tendencies, and Rhathymia or Freedom From Care scales, constructed on the basis of (orthogonal) factors, showed pairwise correlations ranging from -.02 to .85, with 5 of the 10 rs ≥ .33 despite the purification effort (Evans & McConnell, 1941). Any treatise on factor analysis exemplifying procedures with empirical data suffices to make the point convincingly. For example, in Harman (1960), eight emotional variables correlate .10 to .87, median r= .44 (p. 176), and eight political variables correlate .03 to .88, median (absolute value) r = .62 (p. 178). For highly diverse acquiescence-corrected measures (personality traits, interests, hobbies, psychopathology, social attitudes, and religious, political, and moral opinions), estimating individuals’ (orthogonal!) factor scores, one can hold mean _r_s down to an average of . 12, means from .04 to .20, still some individual _r_s > .30 (Lykken, personal communication, 1990; cf. McClosky & Meehl, in preparation). Public opinion polls and attitude surveys routinely disaggregate data with respect to several demographic variables (e.g., age, education, section of country, sex, ethnicity, religion, education, income, rural/urban, self-described political affiliation) because these factors are always correlated with attitudes or electoral choices, sometimes strongly so. One must also keep in mind that socioeconomic status, although intrinsically interesting (especially to sociologists) is probably often functioning as a proxy for other unmeasured personality or status characteristics that are not part of the definition of social class but are, for a variety of complicated reasons, correlated with it. The proxy role is important because it prevents adequate controlling for unknown (or unmeasured) crud-factor influences by statistical procedures (matching, partial correlation, analysis of covariance, path analysis).

  • Thurstone, L. L. (1938). Primary mental abilities. Chicago: University of Chicago Press.
  • Thurstone, L. L., & Thurstone, T. G. (1941). Factorial studies of intelligence. Chicago: University of Chicago Press.
  • Gough, H. G. (1987). CPI, Administrator’s guide. Palo Alto, CA: Consulting Psychologists Press.
  • Evans, C., & McConnell, T. R. (1941). A new measure of introversion-extroversion. Journal of Psychology, 12, 111-124.
  • Harman, H. H. (1960). Modern factor analysis. Chicago: University of Chicago Press.
  • McClosky, Herbert, & Meehl, P. E. (in preparation). Ideologies in conflict.9

Raftery 1995

Bayesian Model Selection in Social Research (with Discussion by Andrew Gelman & Donald B. Rubin, and Robert M. Hauser, and a Rejoinder):

In the past 15 years, however, some quantitative sociologists have been attaching less importance to P -values because of practical difficulties and counter-intuitive results. These difficulties are most apparent with large samples, where p-values tend to indicate rejection of the null hypothesis even when the null model seems reasonable theoretically and inspection of the data fails to reveal any striking discrepancies with it. Because much sociological research is based on survey data, often with thousands of cases, sociologists frequently come up against this problem. In the early 1980s, some sociologists dealt with this problem by ignoring the results of p-value-based tests when they seemed counter-intuitive, and by basing model selection instead on theoretical considerations and informal assessment of discrepancies between model and data (e.g. Fienberg and Mason, 1979; Hout, 1983, 1984; Grusky and Hauser, 1984).

…It is clear that models 1 and 2 are unsatisfactory and should be rejected in favor of model 3. 3 By the standard test, model 3 should also be rejected, in favor of model 4, given the deviance difference of 150 on 16 degrees of freedom, corresponding to a p-value of about 10-120 . Grusky and Hauser (1984) nevertheless adopted model 3 because it explains most (99.7%) of the deviance under the baseline model of independence, fits well in the sense that the differences between observed and expected counts are a small proportion of the total, and makes good theoretical sense. This seems sensible, and yet is in dramatic conflict with the p-value-based test. This type of conflict often arises in large samples, and hence is frequent in sociology with its survey data sets comprising thousands of cases. The main response to it has been to claim that there is a distinction between statistical and substantive significance, with differences that are statistically significant not necessarily being substantively important.

Gelman

Type 1, type 2, type S, and type M errors

I’ve never in my professional life made a Type I error or a Type II error. But I’ve made lots of errors. How can this be?

A Type 1 error occurs only if the null hypothesis is true (typically if a certain parameter, or difference in parameters, equals zero). In the applications I’ve worked on, in social science and public health, I’ve never come across a null hypothesis that could actually be true, or a parameter that could actually be zero.

Significance testing in economics: McCloskey, Ziliak, Hoover, and Siegler:

I think that McCloskey and Ziliak, and also Hoover and Siegler, would agree with me that the null hypothesis of zero coefficient is essentially always false. (The paradigmatic example in economics is program evaluation, and I think that just about every program being seriously considered will have effects-positive for some people, negative for others-but not averaging to exactly zero in the population.) From this perspective, the point of hypothesis testing (or, for that matter, of confidence intervals) is not to assess the null hypothesis but to give a sense of the uncertainty in the inference. As Hoover and Siegler put it, while the economic significance of the coefficient does not depend on the statistical significance, our certainty about the accuracy of the measurement surely does. . . . Significance tests, properly used, are a tool for the assessment of signal strength and not measures of economic significance. Certainly, I’d rather see an estimate with an assessment of statistical significance than an estimate without such an assessment.

Bayesian Statistics Then and Now:

My third meta-principle is that different applications demand different philosophies. This principle comes up for me in Efron’s discussion of hypothesis testing and the so-called false discovery rate, which I label as so-called for the following reason. In Efron’s formulation (which follows the classical multiple comparisons literature), a false discovery is a zero effect that is identified as nonzero, whereas, in my own work, I never study zero effects. The effects I study are sometimes small but it would be silly, for example, to suppose that the difference in voting patterns of men and women (after controlling for some other variables) could be exactly zero. My problems with the false discovery formulation are partly a matter of taste, I’m sure, but I believe they also arise from the difference between problems in genetics (in which some genes really have essentially zero effects on some traits, so that the classical hypothesis-testing model is plausible) and in social science and environmental health (where essentially everything is connected to everything else, and effect sizes follow a continuous distribution rather than a mix of large effects and near-exact zeroes).

Gelman 2010

Causality and Statistical Learning, Gelman 2010:

There are (almost) no true zeroes: difficulties with the research program of learning causal structure

We can distinguish between learning within a causal model (that is, inference about parameters characterizing a specified directed graph) and learning causal structure itself (that is, inference about the graph itself). In social science research, I am extremely skeptical of this second goal.

The difficulty is that, in social science, there are no true zeroes. For example, religious attendance is associated with attitudes on economic as well as social issues, and both these correlations vary by state. And it does not interest me, for example, to test a model in which social class affects vote choice through party identification but not along a direct path.

More generally, anything that plausibly could have an effect will not have an effect that is exactly zero. I can respect that some social scientists find it useful to frame their research in terms of conditional independence and the testing of null effects, but I don’t generally find this approach helpful-and I certainly don’t believe that it is necessary to think in terms of conditional independence in order to study causality. Without structural zeroes, it is impossible to identify graphical structural equation models.

The most common exceptions to this rule, as I see it, are independences from design (as in a designed or natural experiment) or effects that are zero based on a plausible scientific hypothesis (as might arise, for example, in genetics (where genes on different chromosomes might have essentially independent effects, or in a study of ESP). In such settings I can see the value of testing a null hypothesis of zero effect, either for its own sake or to rule out the possibility of a conditional correlation that is supposed not to be there.

Another sort of exception to the no zeroes rule comes from information restriction: a person’s decision should not be affected by knowledge that he or she doesn’t have. For example, a consumer interested in buying apples cares about the total price he pays, not about how much of that goes to the seller and how much goes to the government in the form of taxes. So the restriction is that the utility depends on prices, not on the share of that going to taxes. That is the type of restriction that can help identify demand functions in economics.

I realize, however, that my perspective that there are no zeroes (information restrictions aside) is a minority view among social scientists and perhaps among people in general, on the evidence of psychologist Sloman’s book. For example, from chapter 2: A good politician will know who is motivated by greed and who is motivated by larger principles in order to discern how to solicit each one’s vote when it is needed. I can well believe that people think in this way but I don’t buy it! Just about everyone is motivated by greed and by larger principles! This sort of discrete thinking doesn’t seem to me to be at all realistic about how people behave-although it might very well be a good model about how people characterize others!

In the next chapter, Sloman writes, No matter how many times A and B occur together, mere co-occurrence cannot reveal whether A causes B, or B causes A, or something else causes both. [italics added] Again, I am bothered by this sort of discrete thinking. I will return in a moment with an example, but just to speak generally, if A could cause B, and B could cause A, then I would think that, yes, they could cause each other. And if something else could cause them both, I imagine that could be happening along with the causation of A on B and of B on A.

Here we’re getting into some of the differences between a normative view of science, a descriptive view of science, and a descriptive view of how people perceive the world. Just as there are limits to what folk physics can tell us about the motion of particles, similarly I think we have to be careful about too closely identifying folk causal inference from the stuff done by the best social scientists. To continue the analogy: it is interesting to study how we develop physical intuitions using commonsense notions of force, energy, momentum, and so on-but it’s also important to see where these intuitions fail. Similarly, ideas of causality are fundamental but that doesn’t stop ordinary people and even experts from making basic mistakes.

Now I would like to return to the graphical model approach described by Sloman. In chapter 5, he discusses an example with three variables:

If two of the variables are dependent, say, intelligence and socioeconomic status, but conditionally independent given the third variable [beer consumption], then either they are related by one of two chains:

(Intelligence -> Amount of beer consumed -> Socioeconomic status) (Socio-economic status -> Amount of beer consumed -> Intelligence)

or by a fork:

(Amount of beer consumed -> Socioeconomic status) -> Intelligence) and then we must use some other means [other than observational data] to decide between these three possibilities. In some cases, common sense may be sufficient, but we can also, if necessary, run an experiment. If we intervene and vary the amount of beer consumed and see that we affect intelligence, that implies that the second or third model is possible; the first one is not. Of course, all this assumes that there aren’t other variables mediating between the ones shown that provide alternative explanations of the dependencies.

This makes no sense to me. I don’t see why only one of the three models can be true. This is a mathematical possibility, but it seems highly implausible to me. And, in particular, running an experiment that reveals one of these causal effects does not rule out the other possible paths. For example, suppose that Sloman were to perform the above experiment (finding that beer consumption affects intelligence) and then another experiment, this time varying intelligence (in some way; the method of doing this can very well determine the causal effect) and finding that it affects the amount of beer consumed.

Beyond this fundamental problem, I have a statistical critique, which is that in social science you won’t have these sorts of conditional independencies, except from design or as artifacts of small sample sizes that do not allow us to distinguish small dependencies from zero.

I think I see where Sloman is coming from, from a psychological perspective: you see these variables that are related to each other, and you want to know which is the cause and which is the effect. But I don’t think this is a useful way of understanding the world, just as I don’t think it’s useful to categorize political players as being motivated either by greed or by larger principles, but not both. Exclusive-or might feel right to us internally, but I don’t think it works as science.

One important place where I agree with Sloman (and thus with Pearl and Sprites et al.) is in the emphasis that causal structure cannot in general be learned from observational data alone; they hold the very reasonable position that we can use observational data to rule out possibilities and formulate hypotheses, and then use some sort of intervention or experiment (whether actual or hypothetical) to move further. In this way they connect the observational/experimental division to the hypothesis/deduction formulation that is familiar to us from the work of Popper, Kuhn, and other modern philosophers of science.

The place where I think Sloman is misguided is in his formulation of scientific models in an either/or way, as if, in truth, social variables are linked in simple causal paths, with a scientific goal of figuring out if A causes B or the reverse. I don’t know much about intelligence, beer consumption, and socioeconomic status, but I certainly don’t see any simple relationships between income, religious attendance, party identification, and voting-and I don’t see how a search for such a pattern will advance our understanding, at least given current techniques. I’d rather start with description and then go toward causality following the approach of economists and statisticians by thinking about potential interventions one at a time. I’d love to see Sloman’s and Pearl’s ideas of the interplay between observational and experimental data developed in a framework that is less strongly tied to the notion of choice among simple causal structures.

Gelman et al 2011

Inherent Difficulties of Non-Bayesian Likelihood-based Inference, as Revealed by an Examination of a Recent book by Aitkin

Several of the examples in Statistical Inference represent solutions to problems that seem to us to be artificial or conventional tasks with no clear analogy to applied work.

They are artificial and are expressed in terms of a survey of 100 individuals expressing support (Yes/No) for the president, before and after a presidential address (…) The question of interest is whether there has been a change in support between the surveys (…). We want to assess the evidence for the hypothesis of equality H1 against the alternative hypothesis H2 of a change. Statistical Inference, page 147

Based on our experience in public opinion research, this is not a real question. Support for any political position is always changing. The real question is how much the support has changed, or perhaps how this change is distributed across the population.

A defender of Aitkin (and of classical hypothesis testing) might respond at this point that, yes, everybody knows that changes are never exactly zero and that we should take a more grown-up view of the null hypothesis, not that the change is zero but that it is nearly zero. Unfortunately, the metaphorical interpretation of hypothesis tests has problems similar to the theological doctrines of the Unitarian church. Once you have abandoned literal belief in the Bible, the question soon arises: why follow it at all? Similarly, once one recognizes the inappropriateness of the point null hypothesis, it makes more sense not to try to rehabilitate it or treat it as treasured metaphor but rather to attack our statistical problems directly, in this case by performing inference on the change in opinion in the population.

To be clear: we are not denying the value of hypothesis testing. In this example, we find it completely reasonable to ask whether observed changes are statistically significant, i.e. whether the data are consistent with a null hypothesis of zero change. What we do not find reasonable is the statement that the question of interest is whether there has been a change in support.…Suppose public opinion was observed to really be flat, punctuated by occasional changes, as in the left graph in Figure 3. In that case, Aitkin’s question of whether there has been a change would be well-defined and appropriate, in that we could interpret the null hypothesis of no change as some minimal level of baseline variation. Real public opinion, however, does not look like baseline noise plus jumps, but rather shows continuous movement on many time scales at once, as can be seen from the right graph in Figure 3, which shows actual presidential approval data. In this example, we do not see Aitkin’s question as at all reasonable. Any attempt to work with a null hypothesis of opinion stability will be inherently arbitrary. It would make much more sense to model opinion as a continuously- varying process.

The statistical problem here is not merely that the null hypothesis of change is nonsensical; it is that the null is in no sense a reasonable approximation to any interesting model.

Gelman et al 2013

Inherent difficulties of non-Bayesian likelihood-based inference, as revealed by an examination of a recent book by Aitkin:

  1. Solving non-problems

Several of the examples in Statistical Inference represent solutions to problems that seem to us to be artificial or conventional tasks with no clear analogy to applied work.

They are artificial and are expressed in terms of a survey of 100 individuals expressing support (Yes/No) for the president, before and after a presidential address (. . . ) The question of interest is whether there has been a change in support between the surveys (…). We want to assess the evidence for the hypothesis of equality H1 against the alternative hypothesis H2 of a change.Statistical Inference ,page 147

Based on our experience in public opinion research, this is not a real question. Support for any political position is always changing. The real question is how much the support has changed, or perhaps how this change is distributed across the population.

A defender of Aitkin (and of classical hypothesis testing) might respond at this point that, yes, everybody knows that changes are never exactly zero and that we should take a more grown-up view of the null hypothesis, not that the change is zero but that it is nearly zero. Unfortunately, the metaphorical interpretation of hypothesis tests has problems similar to the theological doctrines of the Unitarian church. Once you have abandoned literal belief in the Bible, the question soon arises: why follow it at all? Similarly, once one recognizes the inappropriateness of the point null hypothesis, we think it makes more sense not to try to rehabilitate it or treat it as treasured metaphor but rather to attack our statistical problems directly, in this case by performing inference on the change in opinion in the population.

To be clear: we are not denying the value of hypothesis testing. In this example, we find it completely reasonable to ask whether observed changes are statistically significant, i.e. whether the data are consistent with a null hypothesis of zero change. What we do not find reasonable is the statement that the question of interest is whether there has been a change in support.

All this is application-specific. Suppose public opinion was observed to really be flat, punctuated by occasional changes, as in the left graph in Figure 7.1. In that case, Aitkin’s question of whether there has been a change would be well-defined and appropriate, in that we could interpret the null hypothesis of no change as some minimal level of baseline variation.

Real public opinion, however, does not look like baseline noise plus jumps, but rather shows continuous movement on many time scales at once, as can be seen from the right graph in Figure 7.1, which shows actual presidential approval data. In this example, we do not see Aitkin’s question as at all reasonable. Any attempt to work with a null hypothesis of opinion stability will be inherently arbitrary. It would make much more sense to model opinion as a continuously-varying process. The statistical problem here is not merely that the null hypothesis of zero change is nonsensical; it is that the null is in no sense a reasonable approximation to any interesting model. The sociological problem is that, from Savage (1954) onward, many Bayesians have felt the need to mimic the classical null-hypothesis testing framework, even where it makes no sense.

Schwitzgebel 2013

Preliminary Evidence That the World Is Simple (An Exercise in Stupid Epistemology) (humorous blog post)

Here’s what I did. I thought up 30 pairs of variables that would be easy to measure and that might relate in diverse ways. Some variables were physical (the distance vs. apparent brightness of nearby stars), some biological (the length vs. weight of sticks found in my back yard), and some psychological or social (the S&P 500 index closing value vs. number of days past). Some I would expect to show no relationship (the number of pages in a library book vs. how high up it is shelved in the library), some I would expect to show a roughly linear relationship (distance of McDonald’s franchises from my house vs. MapQuest estimated driving time), and some I expected to show a curved or complex relationship (forecasted temperature vs. time of day, size in KB of a JPG photo of my office vs. the angle at which the photo was taken). See here for the full list of variables. I took 11 measurements of each variable pair. Then I analyzed the resulting data.

Now, if the world is massively complex, then it should be difficult to predict a third datapoint from any two other data points. Suppose that two measurements of some continuous variable yield values of 27 and 53. What should I expect the third measured value to be? Why not 1,457,002? Or 3.22 x 10^-17? There are just as many functions (that is, infinitely many) containing 27, 53, and 1,457,002 as there are containing 27, 53, and some more pedestrian-seeming value like 44.

… To conduct the test, I used each pair of dependent variables to predict the value of the next variable in the series (the 1st and 2nd observations predicting the value of the 3rd, the 2nd and 3rd predicting the value of the 4th, etc.), yielding 270 predictions for the 30 variables. I counted an observation wild if its absolute value was 10 times the maximum of the absolute value of the two previous observations or if its absolute value was below 1/10 of the minimum of the absolute value of the two previous observations. Separately, I also looked for flipped signs (either two negative values followed by a positive or two positive values followed by a negative), though most of the variables only admitted positive values. This measure of wildness yielded three wild observations out of 270 (1%) plus another three flipped-sign cases (total 2%). (A few variables were capped, either top or bottom, in a way that would make an above-10x or below-1/10th observation analytically unlikely, but excluding such variables wouldn’t affect the result much.) So it looks like the Wild Complexity Thesis might be in trouble.

Ellenberg 2014

Jordan Ellenberg, The Myth Of The Myth Of The Hot Hand (excerpted from How Not to Be Wrong: The Power of Mathematical Thinking, 2014):

A significance test is a scientific instrument, and like any other instrument, it has a certain degree of precision. If you make the test more sensitive-by increasing the size of the studied population, for example-you enable yourself see ever-smaller effects. That’s the power of the method, but also its danger. The truth is, the null hypothesis is probably always false! When you drop a powerful drug into a patient’s bloodstream, it’s hard to believe the intervention literally has zero effect on the probability that the patient will develop esophageal cancer, or thrombosis, or bad breath. Each part of the body speaks to every other, in a complex feedback loop of influence and control. Everything you do either gives you cancer or prevents it. And in principle, if you carry out a powerful enough study, you can find out which it is. But those effects are usually so minuscule that they can be safely ignored. Just because we can detect them doesn’t always mean they matter…The right question isn’t, Do basketball players sometimes temporarily get better or worse at making shots?-the kind of yes/no question a significance test addresses. The right question is How much does their ability vary with time, and to what extent can observers detect in real time whether a player is hot? Here, the answer is surely not as much as people think, and hardly at all.

Thehot hand" and problems with hypothesis testing“, Gelman:

The effects are certainly not zero. We are not machines, and anything that can affect our expectations (for example, our success in previous tries) should affect our performance…Whatever the latest results on particular sports, I can’t see anyone overturning the basic finding of Gilovich, Vallone, and Tversky that players and spectators alike will perceive the hot hand even when it does not exist and dramatically overestimate the magnitude and consistency of any hot-hand phenomenon that does exist. In summary, this is yet another problem where much is lost by going down the standard route of null hypothesis testing.

google everything is correlated, everything is related to everything else, crud factor, null hypothesis is always false, coefficients are never zero Gelman: https://www.google.com/search?num=100&q=%22everything%20is%20correlated%22%20OR%20%22everything%20is%20related%20to%20everything%20else%22%20OR%20%22crud%20factor%22%20OR%20%20%22null%20hypothesis%20is%20always%20false%22%20OR%20%22coefficients%20are%20never%20zero%22%20OR%20zeroes%20OR%20null%20hypothesis%20site%3Aandrewgelman.com


  1. pg74-75 of Sunset Salvo, John Tukey 1986

  2. Although this may have been suggested:

    I used to read a magazine called Milo that covered a bunch of different strength sports. I ended my subscription after reading an article in which an entirely serious author wrote about how he noticed that shortly after he started hearing birds singing in the morning, plants started to grow. His conclusion was that birdsong made plants grow. If I remember correctly, he then concluded that it was the vibrations in the birdsong that made the plants grow, therefore vibrations were good for strength, therefore you could make your muscles grow through being exposed to certain types of vibrations, i.e. birdsong. It was my favorite article of all time, just for the way the guy started out so absurdly wrong and just kept digging.

    I used to read old weight training books. In one of them the author proudly recalled how his secretary had asked him for advice on how to lose weight. This guy went around studying all the secretaries and noticed that the thin ones sat more upright compared to the fat ones. He then recommended to his secretary that she sit more upright, and if she did this she would lose weight. What I loved most about that whole story was that the guy was so proud of his analysis and conclusion that he made it an entire chapter of his book, and that no one in the entire publishing chain from the writer to the editor to the proofreader to the librarian who put the book on the shelves noticed any problems with any of it.

  3. Slate provides a nice example from Pearson’s The Grammar of Science (pg407):

    All causation as we have defined it is correlation, but the converse is not necessarily true, i.e. where we find correlation we cannot always predict causation. In a mixed African population of Kaffirs and Europeans, the former may be more subject to smallpox, yet it would be useless to assert darkness of skin (and not absence of vaccination) as a cause.

  4. I should mention this one is not quite as silly as it sounds as there is experimental evidence for cocoa improving cognitive function

  5. The same authors offer up a number of country-level correlation such as Linguistic Diversity/Traffic accidents, alcohol consumption/morphological complexity, and acacia trees vs tonality, which feed into their paper Constructing knowledge: nomothetic approaches to language evolution on the dangers of naive approaches to cross-country comparisons due to the high intercorrelation of cultural traits. More sophisticated approaches might be better; they derive a fairly-plausible looking graph of the relationships between variables.

  6. like temporal order or biological plausibility - for example, in medicine you can generally rule out some of the relationships this way: if you find a correlation between taking supertetrohydracyline™ and pancreas cancer remission, it seems unlikely that #2 curing pancreas cancer causes a desire to take supertetrohydracyline™ so the causal relationship is probably either #1 supertetrohydracyline™ cures cancer or #3 a common cause like doctors prescribe supertetrohydracyline™ to patients who are getting better.

  7. I borrow this phrase from the paper Looking to the 21st century: have we learned from our mistakes, or are we doomed to compound them?, Shapiro 2004:

    In 1968, when I attended a course in epidemiology 101, Dick Monson was fond of pointing out that when it comes to relative risk estimates, epidemiologists are not intellectually superior to apes. Like them, we can count only three numbers: 1, 2 and BIG (I am indebted to Allen Mitchell for Figure 7). In adequately designed studies we can be reasonably confident about BIG relative risks, sometimes; we can be only guardedly confident about relative risk estimates of the order of 2.0, occasionally; we can hardly ever be confident about estimates of less than 2.0, and when estimates are much below 2.0, we are quite simply out of business. Epidemiologists have only primitive tools, which for small relative risks are too crude to enable us to distinguish between bias, confounding and causation.

    …To illustrate that point, I have to allude to a problem that is usually avoided because to mention it in public is considered impolite: I refer to bias (unconscious, to be sure, but bias all the same) on the part of the investigator. And in order not to obscure the issue by considering studies of questionable quality, I have chosen the example of putatively causal (or preventive) associations published by the Nurses Health Study (NHS). For that study, the investigators have repeatedly claimed that their methods are almost perfect. Over the years, the NHS investigators have published a torrent of papers and Figure 8 gives an entirely fictitious but nonetheless valid distribution of the relative risk estimates derived from them (for relative risk estimates of less than unity, assume the inverse values). The overwhelming majority of the estimates have been less than 2 and mostly less than 1.5, and the great majority have been interpreted as causal (or preventive). Well, perhaps they are and perhaps they are not: we cannot tell. But, perhaps as a matter of quasi-religious faith, the investigators have to believe that the small risk increments they have observed can be interpreted and that they can be interpreted as causal (or preventive). Otherwise they can hardly justify their own existence. They have no choice but to ignore Feinstein’s dictum [Several years ago, Alvan Feinstein made the point that if some scientific fallacy is demonstrated and if it cannot be rebutted, a convenient way around the problem is simply to pretend that it does not exist and to ignore it.]

  8. Lehmann, E. L. Testing statistical hypotheses. New York: Wiley, 1959 (2nd edition); after skimming the 2nd edition, I have not been able to find a relevant passage, but Lehmann remarks that he substantially rewrote the textbook for a more robust and decision-theoretic approach, so it may have been removed.

  9. This work does not seem to have been published.