Hacker News new | past | comments | ask | show | jobs | submit login
Everything Is Correlated (gwern.net)
162 points by 19h on May 2, 2019 | hide | past | favorite | 53 comments



When the correlation is close to 0 it's often because of a feedback loop.

For example - in economy with central bank trying to hit inflation target - interest rates and inflation will have near 0 correlation (interest rates change but inflation remains constant). That's because central bank adjusts interest rates to counter other variables so that inflation remains near the target.

Other example (my favorite, it was mindblowing when my teacher showed it to us on econometrics as a warning :) ) - gas pedal and speed of a car driving on a hilly road. Driver wants to drive near the speed limit, so he adjusts the gas pedal to keep the speed constant. Simplistic conclusion would be - speed is constant despite the gas pedal position changing therefore they are unrelated :)


That's a good point. Another one I forgot to make: given the established empirical reality of 'everything is correlated', if you find a variable which does in fact seem to be independent of most or everything else, that alone makes that variable suspicious - it suggests that it may be a pseudo-variable, composed largely or entirely of measurement error/randomness, or perhaps afflicted by a severe selection bias or other problem (such as range restriction or Berkson's paradox eliminating the real correlation).

Somewhat similarly, because 'everything is heritable', if you run into a human trait which is not heritable at all and is precisely estimated at h^2~0, that cast considerable doubt on whether you have a real trait at all. (I've seen this happen to a few latent variables extracted by factor analysis: they have near-zero heritability in a twin study and on further investigation, turn out to have been just sampling error or bad factor analysis in the first place, and don't replicate or predict anything or satisfy any of the criteria you might use to decide if a trait is 'real'.)


That's very interesting. In the car driving example we can define three variables: 1) Throttle 2) Speed 3) Elevation derivative

If "3" is constant (ex: flat terrain) then "1" and "2" will have strong correlation. However if "2" is constant (ex: cruise control) as in your example, "1" and "3" will have strong correlation.

In the economic example, however, this kind of analisys should be much more complex and take plenty of variables into account.


The key point being identifying those variables and ensuring they remain constant (i.e. in that example - tire pressure, elevation, fuel load etc.)


> Other example (my favorite, it was mindblowing when my teacher showed it to us on econometrics as a warning :) ) - gas pedal and speed of a car driving on a hilly road. Driver wants to drive near the speed limit, so he adjusts the gas pedal to keep the speed constant. Simplistic conclusion would be - speed is constant despite the gas pedal position changing therefore they are unrelated :)

I think that's Milton Friedman's Thermostat in case you want to search for it.


Good discussion. On the flip side, in my data mining class the professor keeps saying ~"you may be able to find clusters in a data set, but often no true correlation exists." However, that's an absolute statement I just don't swallow. In my mind what I see is that if an unexplained correlation or non-correlation appears, it may be random (or true) or it could be the result of an unmeasured (hidden) variable. In your two examples, your simply pointing out two respective hidden variables that weren't accounted for in the original analysis.

I think any data analysis should always be caveated with the understanding that there may be hidden variables shrouding or perhaps enhancing correlations - from economics to quantum mechanics. It's up to the reviewer of the results to determine, subjectively or by using a standard measure, whether the level of rigor involved in data collection & analysis sufficiently models reality.


Perhaps they are trying to explain clustering illusion? The phenomenon that even random data will produce clusters. You can take that further and state random data WILL produce clusters. If you don't have clusters then your data is not random and some pattern is at play.

This really tricks up our mind as our mind tries to find patterns everywhere. If you try and plot random dots you will usually put dots without clusters. A true random plot will have clusters.

https://en.wikipedia.org/wiki/Clustering_illusion

Edit: Note your professor said "often" which means they did not make an absolute statement


Ipso factum all "natural" variables are related to bounded random walk which produces clusters (Markovian process), or otherwise have complex chaotic (e.g. fractal) mechanics, which also produces clusters. This follows from physics.

Maximum entropy as well as zero entropy is a very rare state to observe.


does this imply that the universe somehow rewards structures that engender 'compressibility' (coarse graining)? it does seem like our brains subjectively enjoy identifying it, to the point of over-optimization in the form of phenomena like pareidolia


The universe doesn’t “reward” it so much as it’s just a consequence of random events. For example, if you flip a coin many times, you’ll see long sequences of heads. From the central limit theorem it follows that sufficiently many random events will form a normal distribution, which exhibits clustering phenomenon. Take a look at a Galton board in action.


That's ignoring anything related to actual life we observe and Gaussian distributed data does not have to exhibit clustering either. (But it allows that.)

About the only thing that is naturally uniform so far within bounds is large scale homogeneity and isotropy of universe. Which is an unsolved mystery potentially involving dark matter.


I would argue if it didn't do clustering then there was some sort of pattern/bias at play that caused it.


>"The phenomenon that even random data will produce clusters."

You don't really mean "random", you mean i.i.d. You can have a statistical model where the probability of something happens is random, but not independent of the past values (eg, the next step a markov chain).


The ability for adults to drink milk and fluency in speaking English is well correlated. This is because those of northern European ancestry are more likely to be able to drink milk, and it happens that most of northern European ancestry either immigrated to an English speaking country (US) 150 years ago, or are in a country where English instruction is good.


It's probably in the same vein as the classic quote by Tukey "The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data." - a "need" for an answer can easily motivate people to mangle the data in order to find it even if it doesn't exist.


I like the gas pedal example. I read a similar one somewhere where we measure temperature inside the house and energy usage by the heater. The energy usage is correlated with outside temperature, but inside temperature stays constant, so we conclude that inside temperature is unrelated to heater and turn the heater off.


I really like that example, but I am wondering if it would really be true? Real drivers would not maintain a perfect speed, but would instead work to maintain the average. If you looked closely at the speed, it would drift away from the average, then the peddle would move to return it to the average. So it would look a bit like an integral (The I in PID control) of the difference from the mean speed right?


Yup, that's how you know I only had this on university, never used it in real life :) I think in real life you might see the feedback loop in motion, or not, depending on the resolution and sampling.


Well, we're obviously talking about perfectly spherical drivers (good point though).


Yes I figured :) Just pointing out that reality is always a bit more nuanced. To put it more simply, the petal position could be seen as an error accumulator.


Pressing the brake is positively correlated with the car going faster. Down hill.

Good thing correlation is not an indicator of causation.


It is true that, as Fisher points out, with enough samples you are almost guaranteed to reject the null hypothesis. That's why we tell students to consider both p values (which you could think of as a form of quality control on the dataset) and variance explained. Loftus and Loftus make the point nicely: p tells you if you have enough samples and any effect to consider, variance explained tells you if it's worth pursuing. Both are useful guides to a thoughtful analysis. In addition, I'd make a case for thinking about the scientific significance and importance of the hypothesis and the Bayesian prior. And to put a positive spin on this, given how easy it is to get small p values, big ones are pretty much a red flag to stop the analysis and go and do something more productive instead.


> "It is true that, as Fisher points out, with enough samples you are almost guaranteed to reject the null hypothesis. "

Where does Fisher point this out?

> "That's why we tell students to consider both p values (which you could think of as a form of quality control on the dataset)"

How is this "quality control"? It just tells you whether your sample size was large enough to pass an arbitrary threshold...


> Where does Fisher point this out?

Probably in the Fisher excerpt.


I looked but did not see it.


Agree that NHST using simple null hypothesis of the form

   H0:  μ = 0
doesn't provide much value. H0 is never true, and the conclusion of "rejecting H0" based on a p-value is therefore not super profound. Also "rejecting H0" conclusion doesn't really tells anything about the alternative hypothesis HA (not even considered when computing p-value, since p-value is under H0). Dichotomies in general are bad, but NHST with point H0 is useless!

However a composite hypothesis setup of the form

   H0:  μ ≤ 0
   HA:  μ > 0
is probabilistically sound (in as much as some journal requires you to report a p-values). Much better to report effect size estimate and/or CI.


That still gives 50-50 odds with sufficient sample size, not much of a test of the research hypothesis (since many alternatives will predict the same direction). It is better than 100% chance of rejection though.


Couldn't you make an argument that the point H0 has use when you are testing whether two populations are identical? i.e. it's probably true that \mu is very close to 0 if it is the difference in heights of men from Nebraska vs men from Iowa.


You've kind of hit the point with the second half of your comment. Two populations are virtually never identical, so you don't need any statistics to answer the question. A more reasonable question is whether or not you have the statistical power (i.e. measurement precision) to see the difference, and whether the difference is big enough to matter.


This reminds me of the current omnigenic hypothesis about genes. That unexpectedly almost every gene seems to affect the expression of traits.

https://www.quantamagazine.org/omnigenic-model-suggests-that...

"Drawing on GWAS analyses of three diseases, they concluded that in the cell types that are relevant to a disease, it appears that not 15, not 100, but essentially all genes contribute to the condition. The authors suggested that for some traits, “multiple” loci could mean more than 100,000."


That is just a special case of the "everything is correlated" principle.


I think a major issue here is that, perhaps, there is a tendency to want to use statistics to decide what the 'truth' is, because it takes the onus of responsibility for making a mistake away from the interpreter. Its nice to be able to stand behind a p-value and not be accountable for whatever argument is being made. But the issue here, is that most any argument can be made in a large enough dataset, and a careful analyst will find significance.

This is of course the case only if one does not venture far from the principal assumptions of frequentism, most of which are routinely violated outside of almost every example except pure random number generation and fundamental quantum physics.

So a central issue that isn't addressed in STATS101 level hypothesis testing is the impact that the question has on the result. Its almost inevitable that people want to interpret a failure to reject as a positive result. But a p-value really doesn't tell you if its a useful result; but rather, your sample size is big enough to detect a difference.

Statistical significance is something that can be calculated. Practical significance is something that needs to be interpreted.


I think this article is trying to tie two things together, the p-value problem and the fact you can throw in more data.

I disagree.

It's cheating, it's goes against experimental design analysis, and it does not differentiate between given data and data that was carefully collected. We have experimental design class for a reason. It helps us to be honest. Of course there are tons of pit falls many novice statisticians can do.

It also implicitly leads people to think that statistic can magically handle given data and big data by doing the old fashion statistic way. If you do that than of course you'll get a good p-value.


> It's cheating, it's goes against experimental design analysis, and it does not differentiate between given data and data that was carefully collected. We have experimental design class for a reason. It helps us to be honest. Of course there are tons of pit falls many novice statisticians can do.

Explicit sequential testing runs into exactly the same problem. The problem is, the null hypothesis is not true. So no matter whether you use fixed (large) sample sizes or adaptive procedures which can terminate early while still preserving (the irrelevant) nominal false-positive error rates, you will at some sample size reject the null as your power approaches 100%.


This is mostly right, but you are still thinking of these rejections as "false positives" for some reason. They are real deviations from the null hypothesis ("true positives"). The problem is the user didn't test the null model they wanted, it is 100% user error.


Can you explain that last sentence? What is a valid null model if everything is correlated?


A model of whatever process you think generated the data.

EDIT:

I guess I should say that the concept of testing a "null model" without interpreting the fit relative to other models is wrong to begin with. You need to use Bayes' rule and determine:

  p(H[0]|D) = p(H[0])p(D|H[0])/sum(p(H[0:n])*p(D|H[0:n]))
Lots of stuff wrong with what has been standard stats for the last 70 years, it literally amounts to stringing together a bunch of fallacies and makes no sense at all.


Thanks for the response. Do you know of any you good blog posts or articles that dive into this a bit more? It looks very interesting.


This is the best description of the main problem (testing your own vs some default hypothesis) I have seen:

Paul E. Meehl, "Theory-Testing in Psychology and Physics: A Methodological Paradox," Philosophy of Science 34, no. 2 (Jun., 1967): 103-115. https://doi.org/10.1086/288135

Free download here: www.fisme.science.uu.nl/staff/christianb/downloads/meehl1967.pdf

Andrew Gelman (andrewgelman.com) has a great blog that often touches on this issue.


Thanks a bunch for sharing, I appreciate it. I'll add these resources to my reading list. I may also pass it along to my brother who getting a graduate degree in psych :)



>"The fact that these variables are all typically linear or additive further implies that interactions between variables will be typically rare or small or both (implying that most such hits will be false positives, as interactions are far harder to detect than main effects)."

Where does this "fact" come from? And if everything is correlated with everything else all these effects are true positives...

Also, another ridiculous aspect of this is that when data becomes cheap the researchers just make the threshold stricter so it doesn't become too easy. They are (collectively) choosing what is "significant" or not and then acting like "significant" = real and "non-significant" = 0.

Finally, I didn't read through the whole thing. Does he claim to have found an exception to this rule at any point?


> Finally, I didn't read through the whole thing. Does he claim to have found an exception to this rule at any point?

Oakes 1975 points out that explicit randomized experiments, which test a useless intervention such as school reform, can be exceptions. (Oakes might not be quite right here, since surely even useless interventions have some non-zero effect, if only by wasting peoples' time & effort, but you might say that the 'crud factor' is vastly smaller in randomized experiments than in correlational data, which is a point worth noting.)


Thanks,

How about this "fact": The fact that these variables are all typically linear or additive?


That is simply a corollary of the fact that Pearson's r and regressions are usually linear/additive, and things like Meehl's demonstration wouldn't work if they weren't. You'd just calculate all the pairwise correlations and get nothing if they were solely totally nonlinear/interactions. (In which case you'd have a hard time proving they were related at all.)


> You'd just calculate all the pairwise correlations and get nothing if they were solely totally nonlinear/interactions.

I don't believe this. Most nonlinear correlations also show up as non-zero (linear) correlation coefficients. There are really only a couple pathological cases I can think of where it would not happen.


Is this trying to be too clever? If the correlation is weaker than the random noise of the data, then it is equivalent to not being correlated.

Otherwise, we'd get conclusions like the color of your car influencing your risk of lung cancer or some such nonsense. With enough data, you could see a weak correlation of red car to cancer, but it would still be insignificant. That's what the null-hypothesis is for: to put a treshold under which we can just ignore whatever weak correlation seems to be there.


Question: Are these correlations typically transitive? That is to say, does it typically happen that in addition to everything having nonzero correlation with everything else, it additionally happens that the sign of the correlation between A and C is equal to the product of the signs of the correlations between A and B and between B and C?

Thorndike's dictum would suggest that this is so, at least in that particular domain. What about more generally?


Like a background radiation, we have an "absolute background" correlation value...a value we might test against e.g. |+/- .02321|

Or we could drop the null


REJECT THE NULL HYPOTHESIS !!! :-)


It's well known that the number of Nicholas Cage movies is correlated with a wide variety of natural phenomena.


Sample means and true means are different things.


You're being downvoted because you missed the point repeatedly made in the intro and many of the excerpts that this is in fact a claim about the 'true means'.




Applications are open for YC Summer 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: