Why Correlation Usually ≠ Causation

Correlations are oft interpreted as evidence for causation; this is oft falsified; do causal graphs explain why this is so common, because the number of possible indirect paths greatly exceeds the direct paths necessary for useful manipulation?
statistics, philosophy, survey, Bayes, causality, insight-porn
2014-06-242019-12-09 in progress certainty: log importance: 10

It is widely under­stood that sta­tis­ti­cal cor­re­la­tion between two vari­ables ≠ cau­sa­tion. Despite this admo­ni­tion, peo­ple are over­con­fi­dent in claim­ing cor­re­la­tions to sup­port favored causal inter­pre­ta­tions and are sur­prised by the results of ran­dom­ized exper­i­ments, sug­gest­ing that they are biased & sys­tem­at­i­cally under­es­ti­mate the preva­lence of con­founds / com­mon-­cau­sa­tion. I spec­u­late that in real­is­tic causal net­works or DAGs, the num­ber of pos­si­ble cor­re­la­tions grows faster than the num­ber of pos­si­ble causal rela­tion­ships. So con­founds really are that com­mon, and since peo­ple do not think in real­is­tic DAGs but toy mod­els, the imbal­ance also explains over­con­fi­dence.

I have noticed I seem to be unusu­ally will­ing to bite the cor­re­la­tion ≠ cau­sa­tion bul­let, and I think it’s due to an idea I had some time ago about the nature of real­i­ty. It ties into the cumu­la­tive evi­dence from meta-­analy­sis and repli­ca­tion ini­tia­tives and the his­tory of ran­dom­ized exper­i­ments about the Repli­ca­tion cri­sis, cor­re­la­tions’ abil­ity to pre­dict ran­dom­ized exper­i­ments in prac­tice, behav­ioral genet­ics results from the genome sequenc­ing era, and count­less failed social engi­neer­ing attempts.

Overview: The Current Situation

Here is how I cur­rently under­stand the rela­tion­ship between cor­re­la­tion and causal­i­ty, and the col­lec­tive find­ings of meta-­sci­en­tific research:

  1. : a shock­ingly large frac­tion of psy­cho­log­i­cal research and other fields is sim­ple ran­dom noise which can­not be repli­cat­ed, and due to p-hack­ing, low sta­tis­ti­cal pow­er, pub­li­ca­tion bias, and other sources of sys­tem­atic error. The RC can man­u­fac­ture arbi­trar­ily many spu­ri­ous results by dat­a­min­ing, and this alone ensures a high error rate: .

  2. : when we sys­tem­at­i­cally mea­sure many vari­ables at large scale with n large enough to defeat the Repli­ca­tion Cri­sis prob­lems and estab­lish with high con­fi­dence that a given cor­re­la­tion is real and esti­mated rea­son­ably accu­rate­ly, we find that ‘every­thing is cor­re­lated’—even things which seem to have no causal rela­tion­ship what­so­ev­er.

    This is true whether we com­pare aggre­gate envi­ron­men­tal data over many peo­ple, phe­no­type data about indi­vid­u­als, or genetic data. They’re all cor­re­lat­ed. Every­thing is cor­re­lat­ed. If you fail to reject the null hypoth­e­sis with p < 0.05, you sim­ply haven’t col­lected enough data yet. So, as Meehl asked, what does a cor­re­la­tion between 2 vari­ables mean if we know per­fectly well in advance that it will either be pos­i­tive or neg­a­tive, but we can’t pre­dict which or how large it is?

  3. : empir­i­cal­ly, most efforts to change human behav­ior and soci­ol­ogy and eco­nom­ics and edu­ca­tion fail in ran­dom­ized eval­u­a­tion and the mean effect size of exper­i­ments in meta-­analy­ses typ­i­cally approaches zero, despite promis­ing cor­re­la­tions.

  4. : so, we live in a world where research man­u­fac­tures many spu­ri­ous results and, even once we see through the fake find­ings, find­ing a cor­re­la­tion is mean­ing­less because every­thing is cor­re­lated to begin with and accord­ing­ly, they are lit­tle bet­ter than exper­i­ment­ing at ran­dom, which does­n’t work well either.

    Thus, unsur­pris­ing­ly, in every field from med­i­cine to eco­nom­ics, when we directly ask how well cor­re­la­tions pre­dict sub­se­quent ran­dom­ized exper­i­ments, we find that the pre­dic­tive power is poor. Despite the best efforts of all involved researchers, ani­mal exper­i­ments, nat­ural exper­i­ments, human expert judg­ment, our under­stand­ing of the rel­e­vant causal net­works etc, the high­est qual­ity cor­re­la­tional research still strug­gles to out­per­form a coin flip in pre­dict­ing the results of a ran­dom­ized exper­i­ment. This reflects both the con­t­a­m­i­nat­ing spu­ri­ous cor­re­la­tions from the Repli­ca­tion Cri­sis, but also the brute fact that in prac­tice, cor­re­la­tions don’t pro­vide good guides to causal inter­ven­tions.

    But why is cor­re­la­tion ≠ cau­sa­tion?

  5. Dense Causal Graphs: because, if we write down a causal graph con­sis­tent with ‘every­thing is cor­re­lated’ and the empir­i­cal facts of aver­age null effects + unpre­dic­tive cor­re­la­tions, this implies that all vari­ables are part of enor­mous dense causal graphs where each vari­able is con­nected to sev­eral oth­ers.

    And in such a graph, the num­ber of ways in which a vari­able’s value is con­nected to another vari­able and pro­vides infor­ma­tion about it (ie cor­re­lat­ed) grows extremely rapid­ly, while it only pro­vides a use­ful causal manip­u­la­tion if the other vari­able is solely con­nected by a nar­row spe­cific sequence of one-way causal arrows; there are vastly more indi­rect con­nec­tions than direct con­nec­tions, so any given indi­rect con­nec­tion is vastly unlikely to be a direct con­nec­tion, and thus manip­u­lat­ing one vari­able typ­i­cally will not affect the other vari­able.

  6. Incor­rect Intu­itions: This inequal­ity between observ­able cor­re­la­tions and actual use­ful causal manip­u­la­bil­ity merely grows with larger net­works, and causal net­works in fields like eco­nom­ics or biol­ogy are far more com­plex than those in more ordi­nary every­day fields like ‘catch­ing a ball’.

    Our intu­itions, formed in sim­ple domains designed to have sparse causal net­works (it would be bad if balls could make you do ran­dom things! your brain is care­fully designed to con­trol the influ­ence of any out­side forces & model the world as sim­ple for plan­ning pur­pos­es), turn out to be pro­foundly mis­lead­ing in these other domains.

    Things like vision are pro­foundly com­plex, but our brains present to us sim­ple processed illu­sions to assist deci­sion-­mak­ing; we have ‘folk psy­chol­ogy’ and ‘folk physics’, which are sim­ple, use­ful, and dead­-wrong as full descrip­tions of real­i­ty. Even physics stu­dents trained in New­ton­ian mechan­ics can­not eas­ily over­ride their Aris­totelian ‘folk physics’ intu­itions & cor­rectly pre­dict the move­ments of objects in orbit; it is unsur­pris­ing that ‘folk causal­ity’ often per­forms bad­ly, espe­cially in extremely com­plex fields with ambigu­ous long-term out­comes on novel inher­ent­ly-d­if­fi­cult prob­lems where things like pos­i­tively con­spire to fool human heuris­tics which are quite adap­tive in every­day life—­like med­ical research where a researcher is for­tu­nate if, dur­ing their entire career, they can for­mu­late and then def­i­nitely prove or refute even a sin­gle major hypoth­e­sis.

  7. No, Real­ly, Cor­re­la­tion ≠ Cau­sa­tion: This cog­ni­tive bias is why cor­re­la­tion ≠ cau­sa­tion is so dif­fi­cult to inter­nal­ize and accept, and hon­ored pri­mar­ily in the breach even by sophis­ti­cated researchers, and is why ran­dom­ized exper­i­ments are his­tor­i­cally late devel­oped, neglect­ed, coun­ter­in­tu­itive, and crit­i­cized when run despite rou­tinely debunk­ing con­ven­tional wis­dom of experts in almost every field.

    Because of this, we must treat cor­re­la­tional claims and pol­icy rec­om­men­da­tions based on even more skep­ti­cally than we find intu­itive, and it is uneth­i­cal to make impor­tant deci­sions based on such weak evi­dence. The ques­tion should never be, “is it eth­i­cal to run a ran­dom­ized exper­i­ment here?” but “could it pos­si­bly be eth­i­cal to not run a ran­dom­ized exper­i­ment here?”

    What can be done about this, besides repeat­ing ad nau­seam that ‘cor­re­la­tion ≠ cau­sa­tion’? How can we replace wrong intu­itions with right ones so peo­ple feel that? Do we need to empha­size the many exam­ples, like med­ical rever­sals of stents or back surg­eries or blood tran­fu­sions? Do we need to ? Do we need inter­ac­tive visu­al­iza­tions—“Sim­C­ity for causal graphs”?—to make it intu­itive?

Confound it! Correlation is (usually) not causation! But why not?

The Problem

“Hubris is the great­est dan­ger that accom­pa­nies for­mal data analy­sis…Let me lay down a few basics, none of which is easy for all to accept… 1. The data may not con­tain the answer. The com­bi­na­tion of some data and an aching desire for an answer does not ensure that a rea­son­able answer can be extracted from a given body of data.”

(pg74–75, “Sun­set Salvo” 1986)

“Every time I write about the impos­si­bil­ity of effec­tively pro­tect­ing dig­i­tal files on a gen­er­al-pur­pose com­put­er, I get responses from peo­ple decry­ing the death of copy­right.”How will authors and artists get paid for their work?" they ask me. Truth be told, I don’t know. I feel rather like the physi­cist who just explained rel­a­tiv­ity to a group of would-be inter­stel­lar trav­el­ers, only to be asked: “How do you expect us to get to the stars, then?” I’m sor­ry, but I don’t know that, either."

Bruce Schneier, “Pro­tect­ing Copy­right in the Dig­i­tal World”, 2001

Most sci­en­tif­i­cal­ly-in­clined peo­ple are rea­son­ably aware that one of the major divides in research is that : that hav­ing dis­cov­ered some rela­tion­ship between var­i­ous data X and Y (not nec­es­sar­ily Pear­son’s r, but any sort of math­e­mat­i­cal or sta­tis­ti­cal rela­tion­ship, whether it be a hum­ble r or an opaque deep neural net­work’s pre­dic­tion­s), we do not know how Y would change if we manip­u­lated X. Y might increase, decrease, do some­thing com­pli­cat­ed, or remain implaca­bly the same.

Correlations Often Aren’t

This might be because the cor­re­la­tion is not a real one, and is spu­ri­ous, in the sense that it would dis­ap­pear if we gath­ered more data, and was an due to bias­es; or it could be an arti­fact of our math­e­mat­i­cal pro­ce­dures as in “s”; or it is a Type I error, a cor­re­la­tion thrown up by the stan­dard sta­tis­ti­cal prob­lems we all know about, such as too-s­mall n, false pos­i­tives from sam­pling error (A & B just hap­pened to sync together due to ran­dom­ness), data-mining/multiple test­ing, p-hack­ing, data snoop­ing, selec­tion bias, pub­li­ca­tion bias, mis­con­duct, inap­pro­pri­ate sta­tis­ti­cal tests, etc. has seri­ously shaken faith in the pub­lished research lit­er­a­ture in many fields, and it’s clear that many cor­re­la­tions are over-es­ti­mated in strength by sev­er­al­fold, or the sign is in fact the oppo­site direc­tion.

Correlation Often Isn’t Causation

I’ve read about those prob­lems at length, and despite know­ing about all that, there still seems to be a prob­lem: I don’t think those issues explain away all the cor­re­la­tions which turn out to be con­found­s—­cor­re­la­tion too often ≠ cau­sa­tion.

But let’s say we get past that and we have estab­lished beyond a rea­son­able doubt that some X and Y really do cor­re­late. We still have not solved the prob­lem.

A priori

This point can be made by list­ing exam­ples of cor­re­la­tions where we intu­itively know chang­ing X should have no effect on Y, and it’s a : the num­ber of churches in a town may cor­re­late with the num­ber of bars, but we know that’s because both are related to how many peo­ple are in it; the num­ber of pirates may (but we know pirates don’t con­trol global warm­ing and it’s more likely some­thing like eco­nomic devel­op­ment leads to sup­pres­sion of piracy but also CO2 emis­sion­s); sales of ice cream may cor­re­late with snake bites or vio­lent crime or death from heat-strokes (but of course snakes don’t care about sab­o­tag­ing ice cream sales); thin peo­ple may have bet­ter pos­ture than fat peo­ple, but sit­ting upright does not seem like a plau­si­ble weight loss plan1; wear­ing XXXL cloth­ing clearly does­n’t cause heart attacks, although one might won­der if diet soda causes obe­si­ty; black skin does not cause sickle cell ane­mia nor, to bor­row an exam­ple from Pear­son2, would black skin cause small­pox or malar­ia; more recent­ly, part of the psy­chol­ogy behind is that many vac­cines are admin­is­tered to chil­dren at the same time autism would start becom­ing appar­ent… In these cas­es, we can see what the cor­re­la­tion, which is surely true (in the sense that we can go out and observe it any time we like), does­n’t work like we might think: there is some third vari­able which causes both X and Y, or it turns out we’ve got­ten it back­wards.

In fact, if we go out and look at large datasets, we will find that two vari­ables being cor­re­lated is noth­ing spe­cial—be­cause . As Paul Meehl noted, the cor­re­la­tions can seem com­pletely arbi­trary, yet are firmly estab­lished by extremely large n (eg n = 57,000 & n = 50,000 in his 2 exam­ples).

A posteriori

We can also estab­lish this by sim­ply look­ing at the results of ran­dom­ized exper­i­ments which take a cor­re­la­tion and nail down what a manip­u­la­tion does.

To mea­sure this directly you need a clear set of cor­re­la­tions which are pro­posed to be causal, ran­dom­ized exper­i­ments to estab­lish what the true causal rela­tion­ship is in each case, and both cat­e­gories need to be sharply delin­eated in advance to avoid issues of cher­ryp­ick­ing and retroac­tively con­firm­ing a cor­re­la­tion. Then you’d be able to say some­thing like ‘11 out of the 100 pro­posed A→B causal rela­tion­ships panned out’, and start with a prior of 11% that in your case, A→B.

This sort of dataset is pretty rare, although tend to indi­cate that our prior should be low. (For exam­ple, ana­lyze a gov­ern­ment jobs pro­gram and got data on ran­dom­ized par­tic­i­pants & oth­ers, per­mit­ting com­par­i­son of ran­dom­ized infer­ence to stan­dard regres­sion approach­es; they find roughly that 0⁄12 esti­mates—­many sta­tis­ti­cal­ly-sig­nif­i­can­t—were rea­son­ably sim­i­lar to the causal effect for one job pro­gram & 4⁄12 for another job pro­gram, with the regres­sion esti­mates for the for­mer heav­ily biased.) Not great. Why are our best analy­ses & guesses at causal rela­tion­ships so bad?

Shouldn’t It Be Easy?

We’d expect that the a pri­ori odds are good, by the : 1⁄3! After all, you can divvy up the pos­si­bil­i­ties as:

  1. A causes B (A→B)

  2. B causes A (A←B)

  3. both A and B are caused by C (A←C→B)

    (Pos­si­bly in a com­plex way like or con­di­tion­ing on unmen­tioned vari­ables, like a phone-based sur­vey inad­ver­tently gen­er­at­ing con­clu­sions valid only for the , caus­ing amus­ing pseudo-­cor­re­la­tions.)

If it’s either #1 or #2, we’re good and we’ve found a causal rela­tion­ship; it’s only out­come #3 which leaves us baf­fled & frus­trat­ed. If we were guess­ing at ran­dom, you’d expect us to still be right at least 33% of the time. (As in the joke about the lot­tery—y­ou’ll either win or lose, and you don’t know which, so it’s 50-50, and you like dem odd­s!) And we can draw on all sorts of knowl­edge to do bet­ter3

I think a lot of peo­ple tend to put a lot of weight on any observed cor­re­la­tion because of this intu­ition that a causal rela­tion­ship is nor­mal & prob­a­ble because, well, “how else could this cor­re­la­tion hap­pen if there’s no causal con­nec­tion between A & B‽” And fair enough—there’s no grand cos­mic con­spir­acy arrang­ing mat­ters to fool us by always putting in place a C fac­tor to cause sce­nario #3, right? If you ques­tion peo­ple, of course they know cor­re­la­tion does­n’t nec­es­sar­ily mean cau­sa­tion—ev­ery­one knows that—s­ince there’s always a chance of a lurk­ing con­found, and it would be great if you had a ran­dom­ized exper­i­ment to draw on; but you think with the data you have, not the data you wish you had, and can’t let the per­fect be the enemy of the bet­ter. So when some­one finds a cor­re­la­tion between A and B, it’s no sur­prise that sud­denly their lan­guage & atti­tude change and they seem to place great con­fi­dence in their favored causal rela­tion­ship even if they piously acknowl­edge “Yes, cor­re­la­tion is not cau­sa­tion, but… obvi­ously hang­ing out with fat peo­ple will make you fat / par­ents eg. smok­ing encour­ages their kids to smoke / when we gave babies a new drug, fewer went blind / female-­named hur­ri­canes increase death tolls due to sex­istly under­es­ti­mat­ing women / cor­re­lates so highly with AIDS that it must be another con­se­quence of HIV (ac­tu­ally caused by HHV-8 which is trans­mit­ted simul­ta­ne­ously with HIV) / vit­a­min and anti-ox­i­dant use (among many other ) will save lives / & asso­ciates with and thus surely causes schiz­o­phre­nia and other forms of insan­ity (de­spite increases in mar­i­juana use not fol­lowed by any schiz­o­phre­nia increases / cor­re­lates with mor­tal­ity reduc­tion in women so it def­i­nitely helps and does­n’t ” etc.

Besides the intu­itive­ness of cor­re­la­tion=­cau­sa­tion, we are also des­per­ate and want to believe: cor­rel­a­tive data is so rich and so plen­ti­ful, and exper­i­men­tal data so rare. If it is not usu­ally the case that cor­re­la­tion=­cau­sa­tion, then what exactly are we going to do for deci­sions and beliefs, and what exactly have we spent all our time to obtain? When I look at some dataset with a num­ber of vari­ables and I run a mul­ti­ple regres­sion and can report that vari­ables A, B, and C are all sta­tis­ti­cal­ly-sig­nif­i­cant and of large effec­t-­size when regressed on D, all I have really done is learned some­thing along the lines of “in a hypo­thet­i­cal dataset gen­er­ated in the exact same way, if I some­how was lack­ing data on D, I could make a bet­ter pre­dic­tion in a nar­row math­e­mat­i­cal sense of no impor­tance (squared error) based on A/B/C”. I have not learned whether A/B/C cause D, or whether I could pre­dict val­ues of D in the future, or any­thing about how I could inter­vene and manip­u­late any of A-D, or any­thing like that—rather, I have learned a small point about pre­dic­tion. To take a real exam­ple: when I learn that mod­er­ate alco­hol con­sump­tion means the actu­ar­ial pre­dic­tion of lifes­pan for drinkers should be increased slight­ly, why on earth would I care about this at all unless it was causal? When epi­demi­ol­o­gists emerge from a huge sur­vey report­ing tri­umphantly that steaks but not egg con­sump­tion slightly pre­dicts decreased lifes­pan, why would any­one care aside from per­haps life insur­ance com­pa­nies? Have you ever been abducted by space aliens and ordered as part of an inscrutable alien blood­-s­port to take a set of data about Mid­west Amer­i­cans born 1960–1969 with dietary pre­dic­tors you must com­bine lin­early to cre­ate pre­dic­tors of heart attacks under a squared error loss func­tion to out­pre­dict your fel­low abductees from across the galaxy? Prob­a­bly not. Why would any­one give them grant money for this, why would they spend their time on this, why would they read each oth­ers’ papers unless they had a “qua­si­-re­li­gious faith”4 that these cor­re­la­tions were more than just some coef­fi­cients in a pre­dic­tive mod­el—that they were causal? To quote Rut­ter 2007, most dis­cus­sions of cor­re­la­tions fall into two equally prob­lem­atic camps:

…all behav­ioral sci­en­tists are taught that sta­tis­ti­cally sig­nif­i­cant cor­re­la­tions do not nec­es­sar­ily mean any kind of causative effect. Nev­er­the­less, the lit­er­a­ture is full of stud­ies with find­ings that are exclu­sively based on cor­re­la­tional evi­dence. Researchers tend to fall into one of two camps with respect to how they react to the prob­lem.

  1. First, there are those who are care­ful to use lan­guage that avoids any direct claim for cau­sa­tion, and yet, in the dis­cus­sion sec­tion of their papers, they imply that the find­ings do indeed mean cau­sa­tion.
  2. Sec­ond, there are those that com­pletely accept the inabil­ity to make a causal infer­ence on the basis of sim­ple cor­re­la­tion or asso­ci­a­tion and, instead, take refuge in the claim that they are study­ing only asso­ci­a­tions and not cau­sa­tion. This sec­ond, “pure” approach sounds safer, but it is disin­gen­u­ous because it is dif­fi­cult to see why any­one would be inter­ested in sta­tis­ti­cal asso­ci­a­tions or cor­re­la­tions if the find­ings were not in some way rel­e­vant to an under­stand­ing of causative mech­a­nisms.

So, cor­re­la­tions tend to not be cau­sa­tion because it’s almost always #3, a shared cause. This com­mon­ness is con­trary to our expec­ta­tions, based on a sim­ple & unob­jec­tion­able obser­va­tion that of the 3 pos­si­ble rela­tion­ships, 2 are causal; and so we often rea­son as though cor­re­la­tion were strong evi­dence for cau­sa­tion. This leaves us with a para­dox: exper­i­men­tal results seem to con­tra­dict intu­ition. To resolve the para­dox, I need to offer a clear account of why shared causes/confounds are so com­mon, and hope­fully moti­vate a dif­fer­ent set of intu­itions.

What a Tangled Net We Weave When First We Practice to Believe

“…we think so much rever­sal is based on ‘We think some­thing should work, and so we’re going to adopt it before we know that it actu­ally does work,’ and one of the rea­sons for this is because that’s how med­ical edu­ca­tion is struc­tured. We learn the bio­chem­istry, the phys­i­ol­o­gy, the patho­phys­i­ol­ogy as the very first things in med­ical school. And over the first two years we kind of get con­vinced that every­thing works mech­a­nis­ti­cally the way we think it does.”

Adam Cifu5

Here’s where Bayes nets & (seen pre­vi­ously on LW & Michael Nielsen) come up. Even sim­u­lat­ing the sim­plest pos­si­ble model of lin­ear regres­sion, adding covari­ates barely increase the prob­a­bil­ity of cor­rectly infer­ring direc­tion of causal­i­ty, and the effect sizes remain badly impre­cise (Walker 2014). And when net­works are inferred on real-­world data, they look gnarly: tons of nodes, tons of arrows point­ing all over the place. early on in her Prob­a­bilis­tic Graph­i­cal Mod­els course shows an exam­ple from a med­ical set­ting where the net­work has like 600 nodes and you can’t under­stand it at all. When you look at a bio­log­i­cal causal net­work like metab­o­lism:

“A Toolkit Sup­port­ing For­mal Rea­son­ing about Causal­ity in Meta­bolic Net­works”

You start to appre­ci­ate how every­thing might be cor­re­lated with every­thing, but (usu­al­ly) not cause each oth­er.

This is not too sur­pris­ing if you step back and think about it: life is com­pli­cat­ed, we have lim­ited resources, and every­thing has a lot of mov­ing parts. (How many dis­crete parts does an air­plane have? Or your car? Or a sin­gle cell? Or think about a chess player ana­lyz­ing a posi­tion: ‘if my bishop goes there, then the other pawn can go here, which opens up a move there or here, but of course, they could also do that or try an en pas­sant in which case I’ll be down in mate­r­ial but up on ini­tia­tive in the cen­ter, which causes an over­all shift in tem­po…’) For­tu­nate­ly, these net­works are still sim­ple com­pared to what they could be, since most nodes aren’t directly con­nected to each oth­er, which tamps down on the com­bi­na­to­r­ial explo­sion of pos­si­ble net­works. (How many dif­fer­ent causal net­works are pos­si­ble if you have 600 nodes to play with? The exact answer is com­pli­cated but it’s much larger than 2600—so very large!)

One inter­est­ing thing I man­aged to learn from PGM (be­fore con­clud­ing it was too hard for me and I should try it lat­er) was that in a Bayes net even if two nodes were not in a sim­ple direct cor­re­la­tion rela­tion­ship A→B, you could still learn a lot about A from set­ting B to a val­ue, even if the two nodes were ‘way across the net­work’ from each oth­er. You could trace the influ­ence flow­ing up and down the path­ways to some sur­pris­ingly dis­tant places if there weren’t any block­ers.

The big­ger the net­work, the more pos­si­ble com­bi­na­tions of nodes to look for a pair­wise cor­re­la­tion between them (eg If there are 10 nodes/variables and you are look­ing at bivari­ate cor­re­la­tions, then you have 10 choose 2 = 45 pos­si­ble com­par­isons, and with 20, 190, and 40, 780. 40 vari­ables is not that much for many real-­world prob­lem­s.) A lot of these com­bos will yield some sort of cor­re­la­tion. But does the num­ber of causal rela­tion­ships go up as fast? I don’t think so (although I can’t prove it).

If not, then as causal net­works get big­ger, the num­ber of gen­uine cor­re­la­tions will explode but the num­ber of gen­uine causal rela­tion­ships will increase slow­er, and so the frac­tion of cor­re­la­tions which are also causal will col­lapse.

(Or more con­crete­ly: sup­pose you gen­er­ated a ran­domly con­nected causal net­work with x nodes and y arrows per­haps using the algo­rithm in , where each arrow has some ran­dom noise in it; count how many pairs of nodes are in a causal rela­tion­ship; now, n times ini­tial­ize the root nodes to ran­dom val­ues and gen­er­ate a pos­si­ble state of the net­work & stor­ing the val­ues for each node; count how many pair­wise cor­re­la­tions there are between all the nodes using the n sam­ples (us­ing an appro­pri­ate sig­nif­i­cance test & alpha if one wants); divide # of causal rela­tion­ships by # of cor­re­la­tions, store; return to the begin­ning and resume with x+1 nodes and y+1 arrows… As one graphs each value of x against its respec­tive esti­mated frac­tion, does the frac­tion head toward 0 as x increas­es? My the­sis is it does. Or, since there must be at least as many causal rela­tion­ships in a graph as there are arrows, you could sim­ply use that as an upper bound on the frac­tion.)

It turns out, we weren’t sup­posed to be rea­son­ing ‘there are 3 cat­e­gories of pos­si­ble rela­tion­ships, so we start with 33%’, but rather: ‘there is only one expla­na­tion “A causes B”, only one expla­na­tion “B causes A”, but there are many expla­na­tions of the form “C1 causes A and B”, “C2 causes A and B”, “C3 causes A and B”…’, and the more nodes in a field’s true causal net­works (psy­chol­ogy or biol­ogy vs physics, say), the big­ger this last cat­e­gory will be.

The real world is the largest of causal net­works, so it is unsur­pris­ing that most cor­re­la­tions are not causal, even after we clamp down our data col­lec­tion to nar­row domains. Hence, our prior for “A causes B” is not 50% (it’s either true or false) nor is it 33% (ei­ther A causes B, B causes A, or mutual cause C) but some­thing much small­er: the num­ber of causal rela­tion­ships divided by the num­ber of pair­wise cor­re­la­tions for a graph, which ratio can be roughly esti­mated on a field­-by-­field basis by look­ing at exist­ing work or directly for a par­tic­u­lar prob­lem (per­haps one could derive the frac­tion based on the prop­er­ties of the small­est inferrable graph that fits large datasets in that field). And since the larger a cor­re­la­tion rel­a­tive to the usual cor­re­la­tions for a field, the more likely the two nodes are to be close in the causal net­work and hence more likely to be joined causal­ly, one could even give causal­ity esti­mates based on the size of a cor­re­la­tion (eg. an r = 0.9 leaves less room for con­found­ing than an r of 0.1, but how much will depend on the causal net­work).

This is exactly what we see. How do you treat can­cer? Thou­sands of treat­ments get tried before one works. How do you deal with pover­ty? Most pro­grams are not even wrong. Or how do you fix soci­etal woes in gen­er­al? Most attempts fail mis­er­ably and the high­er-qual­ity your stud­ies, the worse attempts look (lead­ing to ). This even explains why and Andrew Gel­man’s dic­tum about how coef­fi­cients are never zero: the rea­son large datasets find most of their vari­ables to have non-zero cor­re­la­tions (often reach­ing sta­tis­ti­cal-sig­nif­i­cance) is because the data is being drawn from large com­pli­cated causal net­works in which almost every­thing really is cor­re­lated with every­thing else.

And thus I was enlight­ened.


Since I know so lit­tle about causal mod­el­ing, I asked our local causal researcher Ilya Shpitser to maybe leave a com­ment about whether the above was triv­ially wrong / already-proven / well-­known folk­lore / etc; for con­ve­nience, I’ll excerpt the core of his com­ment:

But does the num­ber of causal rela­tion­ships go up just as fast? I don’t think so (although at the moment I can’t prove it).

I am not sure exactly what you mean, but I can think of a for­mal­iza­tion where this is not hard to show. We say A “struc­turally causes” B in a DAG G if and only if there is a directed path from A to B in G. We say A is “struc­turally depen­dent” with B in a DAG G if and only if there is a mar­ginal d-con­nect­ing path from A to B in G.

A mar­ginal d-con­nect­ing path between two nodes is a path with no con­sec­u­tive edges of the form * → * ← * (that is, no col­lid­ers on the path). In other words all directed paths are mar­ginal d-con­nect­ing but the oppo­site isn’t true.

The jus­ti­fi­ca­tion for this def­i­n­i­tion is that if A “struc­turally causes” B in a DAG G, then if we were to inter­vene on A, we would observe B change (but not vice ver­sa) in “most” dis­tri­b­u­tions that arise from causal struc­tures con­sis­tent with G. Sim­i­lar­ly, if A and B are “struc­turally depen­dent” in a DAG G, then in “most” dis­tri­b­u­tions con­sis­tent with G, A and B would be mar­gin­ally depen­dent (e.g. what you prob­a­bly mean when you say ‘cor­re­la­tions are there’).

I qual­ify with “most” because we can­not simul­ta­ne­ously rep­re­sent depen­dences and inde­pen­dences by a graph, so we have to choose. Peo­ple have cho­sen to rep­re­sent inde­pen­dences. That is, if in a DAG G some arrow is miss­ing, then in any dis­tri­b­u­tion (causal struc­ture) con­sis­tent with G, there is some sort of inde­pen­dence (miss­ing effec­t). But if the arrow is not miss­ing we can­not say any­thing. Maybe there is depen­dence, maybe there is inde­pen­dence. An arrow may be present in G, and there may still be inde­pen­dence in a dis­tri­b­u­tion con­sis­tent with G. We call such dis­tri­b­u­tions “unfaith­ful” to G. If we pick dis­tri­b­u­tions con­sis­tent with G ran­dom­ly, we are unlikely to hit on unfaith­ful ones (sub­set of all dis­tri­b­u­tions con­sis­tent with G that is unfaith­ful to G has mea­sure zero), but Nature does not pick ran­dom­ly.. so unfaith­ful dis­tri­b­u­tions are a wor­ry. They may arise for sys­tem­atic rea­sons (maybe equi­lib­rium of a feed­back process in bio?)

If you accept above def­i­n­i­tion, then clearly for a DAG with n ver­tices, the num­ber of pair­wise struc­tural depen­dence rela­tion­ships is an upper bound on the num­ber of pair­wise struc­tural causal rela­tion­ships. I am not aware of any­one hav­ing worked out the exact com­bi­na­torics here, but it’s clear there are many many more paths for struc­tural depen­dence than paths for struc­tural causal­i­ty.

But what you actu­ally want is not a DAG with n ver­tices, but another type of graph with n ver­tices. The “Uni­verse DAG” has a lot of ver­tices, but what we actu­ally observe is a very small sub­set of these ver­tices, and we mar­gin­al­ize over the rest. The trou­ble is, if you start with a dis­tri­b­u­tion that is con­sis­tent with a DAG, and you mar­gin­al­ize over some things, you may end up with a dis­tri­b­u­tion that isn’t well rep­re­sented by a DAG. Or “DAG mod­els aren’t closed under mar­gin­al­iza­tion.”

That is, if our DAG is A → B ← H → C ← D, and we mar­gin­al­ize over H because we do not observe H, what we get is a dis­tri­b­u­tion where no DAG can prop­erly rep­re­sent all con­di­tional inde­pen­dences. We need another kind of graph.

In fact, peo­ple have come up with a mixed graph (con­tain­ing → arrows and ⟺ arrows) to rep­re­sent mar­gins of DAGs. Here → means the same as in a causal DAG, but ⟺ means “there is some sort of com­mon cause/confounder that we don’t want to explic­itly write down.” Note: ⟺ is not a cor­rel­a­tive arrow, it is still encod­ing some­thing causal (the pres­ence of a hid­den com­mon cause or caus­es). I am being loose here—in fact it is the absence of arrows that means things, not the pres­ence.

I do a lot of work on these kinds of graphs, because these are graphs are the sen­si­ble rep­re­sen­ta­tion of data we typ­i­cally get—­drawn from a mar­ginal of a joint dis­tri­b­u­tion con­sis­tent with a big unknown DAG.

But the com­bi­na­torics work out the same in these graph­s—the num­ber of mar­ginal d-con­nected paths is much big­ger than the num­ber of directed paths. This is prob­a­bly the source of your intu­ition. Of course what often hap­pens is you do have a (weak) causal link between A and B, but a much stronger non-­causal link between A and B through an unob­served com­mon par­ent. So the causal link is hard to find with­out “tricks.”

What is to be done?

Shouting 2+2=4 From the Rooftops

“There is noth­ing that plays worse in our cul­ture than seem­ing to be the stodgy defender of old ideas, no mat­ter how true those ideas may be. Luck­i­ly, at this point the ortho­doxy of the aca­d­e­mic econ­o­mists is very much a minor­ity posi­tion among intel­lec­tu­als in gen­er­al; one can seem to be a coura­geous mav­er­ick, boldly chal­leng­ing the pow­ers that be, by recit­ing the con­tents of a stan­dard text­book. It has worked for me!”

, “Ricar­do’s Dif­fi­cult Idea”

To go on a tan­gent: why is it so impor­tant that we ham­mer this in?

The unre­li­a­bil­ity is bad enough, but I’m also wor­ried that the knowl­edge cor­re­la­tion ≠ cau­sa­tion, one of the core ideas of the sci­en­tific method and fun­da­men­tal to fields like mod­ern med­i­cine, is going under­ap­pre­ci­ated and is being aban­doned by meta-­con­trar­i­ans due to its incon­ve­nience. Point­ing it out is “noth­ing help­ful” or “mean­ing­less”, and jus­ti­fied skep­ti­cism is actu­ally just “a dum­b­-ass thing to say”, a “sta­tis­ti­cal cliché that closes threads and ends debates, the fresh­man plat­i­tude turned final shut­down” often used by “party poop­ers” “Inter­net blowhards” to serve an “agenda” & is some­times “a dog whis­tle”; in prac­tice, such peo­ple seem to go well beyond the XKCD comic and pro­ceed to take any cor­re­la­tions they like as strong evi­dence for cau­sa­tion, and any dis­agree­ment is an unso­phis­ti­cated mid­dle­brow dis­missal & denial­ism.

Insist­ing cor­re­la­tion ≅ cau­sa­tion (, “Bilge” (2013))

So it’s unsur­pris­ing that one so often runs into researchers for whom indeed cor­re­la­tion=­cau­sa­tion (we cer­tainly would­n’t want to be fresh­men or Inter­net blowhards, would we?). It is com­mon to use causal lan­guage and make rec­om­men­da­tions (Prasad et al 2013), but even if they don’t, you can be sure to see them con­fi­dently talk­ing causally to other researchers or jour­nal­ists or offi­cials. (I’ve noticed this sort of con­stant mot­te-and-bai­ley slide from vague men­tions of how results are cor­rel­a­tive tucked away at the end of the paper to freely dis­pens­ing advice for pol­i­cy­mak­ers about how their research proves X should be imple­mented is par­tic­u­larly com­mon in med­i­cine, soci­ol­o­gy, and edu­ca­tion.)

Bandy­ing phrases with meta-­con­trar­i­ans won’t help much here; I agree with them that cor­re­la­tion ought to be some evi­dence for cau­sa­tion. eg if I sus­pect that A→B, and I col­lect data and estab­lish beyond doubt that A&B cor­re­lates r = 0.7, surely this obser­va­tions, which is con­sis­tent with my the­o­ry, should boost my con­fi­dence in my the­o­ry, just as an obser­va­tion like r = 0.0001 would trou­ble me great­ly. But how much…?

As it is, it seems we fall read­ily into intel­lec­tual traps of our own mak­ing. When you believe every cor­re­la­tion adds sup­port to your causal the­o­ry, you just get more and more wrong as you col­lect more data.

Heuristics & Biases

Paper’s dis­cus­sion sec­tion vs quotes in Now assum­ing the fore­go­ing to be right (which I’m not sure about; in par­tic­u­lar, I’m dubi­ous that cor­re­la­tions in causal nets really do increase much faster than causal rela­tions do), what’s the psy­chol­ogy of this? I see a few major ways that peo­ple might be incor­rectly rea­son­ing when they over­es­ti­mate the evi­dence given by a cor­re­la­tion:

  • they might be aware of the imbal­ance between cor­re­la­tions and cau­sa­tion, but under­es­ti­mate how much more com­mon cor­re­la­tion becomes com­pared to cau­sa­tion.

    This could be shown by giv­ing causal dia­grams and see­ing how elicited prob­a­bil­ity changes with the size of the dia­grams: if the prob­a­bil­ity is con­stant, then the sub­jects would seem to be con­sid­er­ing the rela­tion­ship in iso­la­tion and ignor­ing the con­text.

    It might be reme­di­a­ble by show­ing a net­work and jar­ring peo­ple out of a sim­plis­tic com­par­i­son approach.

  • they might not be rea­son­ing in a causal-net frame­work at all, but start­ing from the naive 33% base-rate you get when you treat all 3 kinds of causal rela­tion­ships equal­ly.

    This could be shown by elic­it­ing esti­mates and see­ing whether the esti­mates tend to look like base rates of 33% and mod­i­fi­ca­tions there­of.

    Sterner mea­sures might be need­ed: could we draw causal nets with not just arrows show­ing influ­ence but also another kind of arrow show­ing cor­re­la­tions? For exam­ple, the arrows could be drawn in black, inverse cor­re­la­tions drawn in red, and reg­u­lar cor­re­la­tions drawn in green. The pic­ture would be rather messy, but sim­ply by com­par­ing how few black arrows there are to how many green and red ones, it might visu­ally make the case that cor­re­la­tion is much more com­mon than cau­sa­tion.

  • alter­nate­ly, they may really be rea­son­ing causally and suf­fer from a truly deep & per­sis­tent cog­ni­tive illu­sion that when peo­ple say ‘cor­re­la­tion’ it’s really a kind of cau­sa­tion and don’t under­stand the tech­ni­cal mean­ing of ‘cor­re­la­tion’ in the first place (which is not as unlikely as it may sound, given exam­ples like demon­stra­tion of the per­sis­tence of Aris­totelian folk-­physics in physics stu­dents as all they had learned was guess­ing pass­words; on the test used, see eg Hal­loun & Hestenes 1985 & Hestenes et al 1992); in which cause it’s not sur­pris­ing that if they think they’ve been told a rela­tion­ship is ‘cau­sa­tion’, then they’ll think the rela­tion­ship is cau­sa­tion. Ilya remarks:

    has this hypoth­e­sis that a lot of prob­a­bilis­tic fallacies/paradoxes/biases are due to the fact that causal and not prob­a­bilis­tic rela­tion­ships are what our brain natively thinks about. So e.g. is sur­pris­ing because we intu­itively think of a con­di­tional dis­tri­b­u­tion (where con­di­tion­ing can change any­thing!) as a kind of “inter­ven­tional dis­tri­b­u­tion” (no Simp­son’s type rever­sal under inter­ven­tions: “Under­stand­ing Simp­son’s Para­dox”, Pearl 2014 [see also Pearl’s com­ments on Nielsen’s blog)).

    This hypoth­e­sis would claim that peo­ple who haven’t looked into the math just inter­pret state­ments about con­di­tional prob­a­bil­i­ties as about “inter­ven­tional prob­a­bil­i­ties” (or what­ever their intu­itive ana­logue of a causal thing is).

    This might be testable by try­ing to iden­tify sim­ple exam­ples where the two approaches diverge, sim­i­lar to Heste­nes’s quiz for diag­nos­ing belief in folk-­physics.


Everything correlates with everything

Sta­tis­ti­cal folk­lore asserts that “every­thing is cor­re­lated”: in any real-­world dataset, most or all mea­sured vari­ables will have non-zero cor­re­la­tions, even between vari­ables which appear to be com­pletely inde­pen­dent of each oth­er, and that these cor­re­la­tions are not merely sam­pling error flukes but will appear in large-s­cale datasets to arbi­trar­ily des­ig­nated lev­els of sta­tis­ti­cal-sig­nif­i­cance or pos­te­rior prob­a­bil­i­ty.

This raises seri­ous ques­tions for nul­l-hy­poth­e­sis sta­tis­ti­cal-sig­nif­i­cance test­ing, as it implies the null hypoth­e­sis of 0 will always be rejected with suf­fi­cient data, mean­ing that a fail­ure to reject only implies insuf­fi­cient data, and pro­vides no actual test or con­fir­ma­tion of a the­o­ry. Even a direc­tional pre­dic­tion is min­i­mal con­fir­ma­tion since there is a 50% chance of pick­ing the right direc­tion at ran­dom.

It also has impli­ca­tions for con­cep­tu­al­iza­tions of the­o­ries & causal mod­els, inter­pre­ta­tions of struc­tural mod­els, and other sta­tis­ti­cal prin­ci­ples such as the “spar­sity prin­ci­ple”.

Main arti­cle: .

  1. Although this may have been sug­gested:

    I used to read a mag­a­zine called Milo that cov­ered a bunch of dif­fer­ent strength sports. I ended my sub­scrip­tion after read­ing an arti­cle in which an entirely seri­ous author wrote about how he noticed that shortly after he started hear­ing birds singing in the morn­ing, plants started to grow. His con­clu­sion was that bird­song made plants grow. If I remem­ber cor­rect­ly, he then con­cluded that it was the vibra­tions in the bird­song that made the plants grow, there­fore vibra­tions were good for strength, there­fore you could make your mus­cles grow through being exposed to cer­tain types of vibra­tions, i.e. bird­song. It was my favorite arti­cle of all time, just for the way the guy started out so absurdly wrong and just kept dig­ging.

    I used to read old weight train­ing books. In one of them the author proudly recalled how his sec­re­tary had asked him for advice on how to lose weight. This guy went around study­ing all the sec­re­taries and noticed that the thin ones sat more upright com­pared to the fat ones. He then rec­om­mended to his sec­re­tary that she sit more upright, and if she did this she would lose weight. What I loved most about that whole story was that the guy was so proud of his analy­sis and con­clu­sion that he made it an entire chap­ter of his book, and that no one in the entire pub­lish­ing chain from the writer to the edi­tor to the proof­reader to the librar­ian who put the book on the shelves noticed any prob­lems with any of it.

  2. Slate pro­vides a nice exam­ple from Pear­son’s The Gram­mar of Sci­ence (pg407):

    All cau­sa­tion as we have defined it is cor­re­la­tion, but the con­verse is not nec­es­sar­ily true, i.e. where we find cor­re­la­tion we can­not always pre­dict cau­sa­tion. In a mixed African pop­u­la­tion of Kaf­firs and Euro­peans, the for­mer may be more sub­ject to small­pox, yet it would be use­less to assert dark­ness of skin (and not absence of vac­ci­na­tion) as a cause.

  3. Like tem­po­ral order or bio­log­i­cal plau­si­bil­i­ty—­for exam­ple, in med­i­cine you can gen­er­ally rule out some of the rela­tion­ships this way: if you find a cor­re­la­tion between tak­ing supertetro­hy­dra­cy­line™ and then later one’s depres­sion (or flu symp­toms or…) get­ting bet­ter, what does this mean? We have 3 gen­eral pat­terns: A→B, A←B, and A←C→B. It seems unlikely that #2 (A←B), ‘cur­ing depres­sion causes tak­ing supertetro­hy­dra­cy­line™ pre­vi­ously in time’, is true since that requires time trav­el; we can rule that one out. So, the causal rela­tion­ship is prob­a­bly either #1 (A→B) direct cau­sa­tion (su­pertetro­hy­dra­cy­line™ cures depres­sion), or #3 (A←C→B), a com­mon cause and con­found­ing, in which some third vari­able is respon­si­ble for both out­comes (like ‘doc­tors pre­scribe supertetro­hy­dra­cy­line™ to patients who are get­ting bet­ter’ some process leads to dif­fer­en­tial treat­ment like or doc­tors pre­scrib­ing supertetro­hy­dra­cy­line™ to patients they think have the best prog­no­sis). We may not know which, but at least the tem­po­ral order did let us rule out one of the 3 pos­si­bil­i­ties, which is a start.↩︎

  4. I bor­row this phrase from the paper “Look­ing to the 21st cen­tu­ry: have we learned from our mis­takes, or are we doomed to com­pound them?”, Shapiro 2004:

    In 1968, when I attended a course in epi­demi­ol­ogy 101, Dick Mon­son was fond of point­ing out that when it comes to rel­a­tive risk esti­mates, epi­demi­ol­o­gists are not intel­lec­tu­ally supe­rior to apes. Like them, we can count only three num­bers: 1, 2 and BIG (I am indebted to Allen Mitchell for Fig­ure 7). In ade­quately designed stud­ies we can be rea­son­ably con­fi­dent about BIG rel­a­tive risks, some­times; we can be only guard­edly con­fi­dent about rel­a­tive risk esti­mates of the order of 2.0, occa­sion­al­ly; we can hardly ever be con­fi­dent about esti­mates of less than 2.0, and when esti­mates are much below 2.0, we are quite sim­ply out of busi­ness. Epi­demi­ol­o­gists have only prim­i­tive tools, which for small rel­a­tive risks are too crude to enable us to dis­tin­guish between bias, con­found­ing and cau­sa­tion.

    …To illus­trate that point, I have to allude to a prob­lem that is usu­ally avoided because to men­tion it in pub­lic is con­sid­ered impo­lite: I refer to bias (un­con­scious, to be sure, but bias all the same) on the part of the inves­ti­ga­tor. And in order not to obscure the issue by con­sid­er­ing stud­ies of ques­tion­able qual­i­ty, I have cho­sen the exam­ple of puta­tively causal (or pre­ven­tive) asso­ci­a­tions pub­lished by the Nurses Health Study (NHS). For that study, the inves­ti­ga­tors have repeat­edly claimed that their meth­ods are almost per­fect. Over the years, the NHS inves­ti­ga­tors have pub­lished a tor­rent of papers and Fig­ure 8 gives an entirely fic­ti­tious but nonethe­less valid dis­tri­b­u­tion of the rel­a­tive risk esti­mates derived from them (for rel­a­tive risk esti­mates of less than uni­ty, assume the inverse val­ues). The over­whelm­ing major­ity of the esti­mates have been less than 2 and mostly less than 1.5, and the great major­ity have been inter­preted as causal (or pre­ven­tive). Well, per­haps they are and per­haps they are not: we can­not tell. But, per­haps as a mat­ter of qua­si­-re­li­gious faith, the inves­ti­ga­tors have to believe that the small risk incre­ments they have observed can be inter­preted and that they can be inter­preted as causal (or pre­ven­tive). Oth­er­wise they can hardly jus­tify their own exis­tence. They have no choice but to ignore Fein­stein’s dic­tum [Sev­eral years ago, Alvan Fein­stein made the point that if some sci­en­tific fal­lacy is demon­strated and if it can­not be rebutted, a con­ve­nient way around the prob­lem is sim­ply to pre­tend that it does not exist and to ignore it.]

  5. Apro­pos of End­ing Med­ical Rever­sal, esti­mat­ing a ~40% error rate in med­ical inter­ven­tions.↩︎