Many multi-step processes look like ‘leaky pipelines’, where a fractional loss/success happens at every step; such multiplicative processes can often be modeled as a log-normal distribution (or power law), with counterintuitive implications like skewed output distributions and large final differences from small differences in per-step success rates.
-
The log-normal distribution (quincunx visualization from Li et al 2010) is a skewed distribution which is the multiplicative counterpart to the normal distribution: where the normal distribution is conceptually applicable to where many independent parts are added, the log-normal is where those parts instead multiply. A common example is latent variables multiplying to give a final output; a concrete example would be multiple successive liability threshold-like steps. Also like the normal, the log-normal enjoys many general properties such as limit theorems or preservation under multiplication/addition.1
- Power law fits are often suggested for, but also heavily criticized: power law fits are not always carefully compared against log-normals, and sometimes turn out to fit log-normals better or be mechanistically implausible.
-
“On the Statistics of Individual Variations of Productivity in Research Laboratories”, Shockley 1957
Some researchers are orders of magnitude more prolific and successful than others. Under a normal distribution conceptualization of scientific talent, this would be odd & require them to be many standard deviations beyond the norm on some ‘output’ variable. Shockley suggests that this isn’t so surprising if we imagine scientific research as more of a ‘pipeline’: a scientist has ideas, which feeds into background research, which feeds into a series of experiments, which feeds into writing up papers, then getting them published, then influencing other scientists, then back to getting ideas.
Each step is a different skill, which is plausibly normally-distributed, but each step relies on the output of a previous step: you can’t experiment on non-existent ideas, and you can only publish on that which you experimented on, etc. Few people have an impact by simply having a fabulous idea if they can’t be bothered to write it down. (Consider how much more impact Claude Shannon, Euler, Ramanujan, or Gauss would have had if they had published more than they did.) So if one researcher is merely somewhat better than average at each step, they may wind up having a far larger output of important work than a researcher who is exactly average at each step.
Shockley notes that with 8 variables and an advantage of 50%, the output under a log-normal model would be increased by as much as 25×, eg:
simulateLogNormal <- function(advantage, n.variables, iters=100000) { regular <- 1 advantaged <- replicate(iters, Reduce(`*`, rnorm(n.variables, mean=(1+advantage), sd=1), 1)) ma <- mean(advantaged) return(ma) } simulateLogNormal(0.5, 8) # [1] 25.58716574
With more variables, the output difference would be larger still, and is connected to the o-ring theory of productivity. This poses a challenge to those who expect small differences in ability to lead to small output differences, as the log-normal distribution is common in the real world, and also implies that if several stages can be optimized, the remainder will become a severe bottleneck.
-
“The Best And The Rest: Revisiting The Norm Of Normality Of Individual Performance”, O’Boyle & Aguinis 2012
-
“The Geometric Mean, in Vital and Social Statistics”, Galton 1879; “Ability and Income: III. The Relation Between the Distribution of Ability and the Distribution of Income”, Burt 1943
-
Bias In Mental Testing, ch 4: §“Distribution of Achievement”, Jensen 1980; “Giftedness and Genius: Crucial Differences”, Jensen 1996
-
Drug Development:
- “When Quality Beats Quantity: Decision Theory, Drug Discovery, and the Reproducibility Crisis”, Scannell & Bosley 2016
- “Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 1: Ways to make an impact, and why we are not there yet: Quality is more important than speed and cost in drug discovery”, Bender & Cortés-Ciriano 2021
- Psychiatric Drugs: “The Alzheimer Photo”; “Prescriptions, Paradoxes, and Perversities”; “Is Pharma Research Worse Than Chance?” (see also: ketamine, MDMA, LSD, amphetamines, lithium, off-label drugs in general)
-
“Dissolving the Fermi Paradox”, Sandberg et al 2018 (the mean estimate of the Drake equation may be high but the distribution is wide and the median is much smaller than the mean, somewhat akin to Jensen’s inequality/inequality of arithmetic and geometric means)
-
“Prospecting for Gold”, Cotton-Barratt 2016; “Counterproductive Altruism: The Other Heavy Tail”, Kokotajlo & Oprea 2020
-
“Why is there only one Elon Musk? Why is there so much low-hanging fruit?”, Alexey Guzey 2020
-
“The Fundamentals of Heavy Tails: Properties, Emergence, & Estimation: Chapter 6: Multiplicative processes”, Nair et al 2021
-
Lotka’s law/Price’s law, Preferential attachment/Matthew effect
-
“Construction of arbitrarily strong amplifiers of natural selection using evolutionary graph theory”, Pavlogiannis et al 2018
-
See Also: On Development Hell, Multi-Stage Selection
-
Someone asked if the product of correlated normal variables also yields a log-normal, the way the sum of correlated normals is still normal; checking Product distribution, I suspect not. (Experimenting with random correlation matrices generated by
randcor
to simulate out possible log-normals ashist(apply(abs(mvrnorm(n=500, mu=rep(0,5), Sigma=randcorr(5))), 1, prod))
, the histograms look far more skewed & peaky to me than a regular log-normal—which is in accord with my intuitions about correlations between variables typicaly increasing variance and creating more extremes.)↩︎