# Leaky Pipelines

Many multi-step processes look like ‘leaky pipelines’, where a fractional loss/success happens at every step; such multiplicative processes can often be modeled as a log-normal distribution (or power law), with counterintuitive implications like very skewed output distributions and large final differences from small differences in per-step success rates.
bibliography⁠, statistics⁠, order-statistics
2014-11-272021-10-21 in progress certainty: highly-likely

• The ( visualization from Li et al 2010) is a skewed distribution which is the multiplicative counterpart to the : where the normal distribution is conceptually applicable to where many independent parts are added, the log-normal is where those parts instead multiply. A common example is variables multiplying to give a final output; a concrete example would be multiple successive -like steps. Also like the normal, the log-normal enjoys many general properties such as limit theorems or preservation under multiplication/​​addition.1

• fits are often suggested for, but also heavily criticized: power law fits are not always carefully compared against log-normals, and sometimes turn out to fit log-normals better or be mechanistically implausible.
• Some researchers are orders of magnitude more prolific and successful than others. Under a normal distribution conceptualization of scientific talent, this would be odd & require them to be many standard deviations beyond the norm on some ‘output’ variable. Shockley suggests that this isn’t so surprising if we imagine scientific research as more of a ‘pipeline’: a scientist has ideas, which feeds into background research, which feeds into a series of experiments, which feeds into writing up papers, then getting them published, then influencing other scientists, then back to getting ideas.

Each step is a different skill, which is plausibly normally-distributed, but each step relies on the output of a previous step: you can’t experiment on non-existent ideas, and you can only publish on that which you experimented on, etc. Few people have an impact by simply having a fabulous idea if they can’t be bothered to write it down. (Consider how much more impact ⁠, Euler, Ramanujan, or Gauss would have had if they had published more than they did.) So if one researcher is merely somewhat better than average at each step, they may wind up having a far larger output of important work than a researcher who is exactly average at each step.

Shockley notes that with 8 variables and an advantage of 50%, the output under a log-normal model would be increased by as much as 25×, eg:

``````simulateLogNormal <- function(advantage, n.variables, iters=100000) {
regular <- 1
return(ma)
}
simulateLogNormal(0.5, 8)
#  25.58716574``````

With more variables, the output difference would be larger still, and is connected to the ⁠. This poses a challenge to those who expect small differences in ability to lead to small output differences, as the log-normal distribution is common in the real world, and also implies that if several stages can be optimized, ⁠.

• ⁠, O’Boyle & Aguinis 2012

• Bias In Mental Testing, ch 4: §“Distribution of Achievement”⁠, Jensen 1980; “Giftedness and Genius: Crucial Differences”⁠, Jensen 1996

• Drug Development:

• ⁠, Sandberg et al 2018 (the mean estimate of the may be high but the distribution is wide and the median is much smaller than the mean, somewhat akin to ⁠/​​)

• “Prospecting for Gold”⁠, Cotton-Barratt 2016; ⁠, Kokotajlo & Oprea 2020

• ⁠, Pavlogiannis et al 2018

1. Someone asked if the multiple of correlated normal variables also yields a log-normal, the way ⁠; checking ⁠, I suspect not. (Experimenting with random correlation matrices generated by `randcor` to simulate out possible log-normals as `hist(apply(abs(mvrnorm(n=500, mu=rep(0,5), Sigma=randcorr(5))), 1, prod))`, the histograms look far more skewed & peaky to me than a regular log-normal—which is in accord with my intuitions about correlations between variables typicaly increasing and creating more extremes.)↩︎