> In generating a sample of _n_ datapoints drawn from a normal/Gaussian distribution with a particular mean/SD, how big on average the biggest datapoint is will depend on how large _n_ is.
> Knowing this average is useful in a number of areas like sports or breeding or manufacturing, as it defines how bad/good the worst/best datapoint will be (eg the score of the winner in a multi-player game).
>
> The [order statistic](!Wikipedia) of the mean/average/expectation of the maximum of a draw of _n_ samples from a normal distribution has no exact formula, unfortunately, and is generally not built into any programming language's libraries.
>
> I implement & compare some of the approaches to estimating this order statistic in the R programming language, for both the maximum and the general order statistic.
> The overall best approach is to calculate the exact order statistics for the _n_ range of interest using numerical integration via `lmomco` and cache them in a lookup table, rescaling the mean/SD as necessary for arbitrary normal distributions; next best is a polynomial regression approximation; finally, the Elfving correction to the Blom 1958 approximation is fast, easily implemented, and accurate for reasonably large _n_ such as _n_ > 100.

![Visualizing maxima/minima in order statistics with increasing _n_ in each sample (1--100).](/images/orderstatistics/orderstatistics-maximums.png)
# Approximation
## Monte Carlo
Most simply and directly, we can estimate it using a [Monte Carlo](!Wikipedia "Monte Carlo method") simulation with hundreds of thousands of iterations:
~~~{.R}
scores <- function(n, sd) { rnorm(n, mean=0, sd=sd); }
gain <- function(n, sd) { scores <- scores(n, sd)
return(max(scores)); }
simGain <- function(n, sd=1, iters=500000) {
mean(replicate(iters, gain(n, sd))); }
~~~
But in R this can take seconds for small _n_ and gets worse as _n_ increases into the hundreds as we need to calculate over increasingly large samples of random normals (so one could consider this 𝓞(_n_)); this makes use of the simulation difficult when nested in higher-level procedures such as anything involving resampling or simulation.
In R, calling functions many times is slower than being able to call a function once in a 'vectorized' way where all the values can be processed in a single batch.
We can try to vectorize this simulation by generating _n_ · _i_ random normals, group it into a large matrix with _n_ columns (each row being one _n_-sized batch of samples), then computing the maximum of the _i_ rows, and the mean of the maximums.
This is about twice as fast for small _n_; implementing using `rowMaxs` from the R package [`matrixStats`](https://cran.r-project.org/web/packages/matrixStats/index.html), it is anywhere from 25% to 500% faster (at the expense of likely much higher memory usage, as the R interpreter is unlikely to apply any optimizations such as Haskell's stream fusion):
~~~{.R}
simGain2 <- function(n, sd=1, iters=500000) {
mean(apply(matrix(ncol=n, data=rnorm(n*iters, mean=0, sd=sd)), 1, max)) }
library(matrixStats)
simGain3 <- function(n, sd=1, iters=500000) {
mean(rowMaxs(matrix(ncol=n, data=rnorm(n*iters, mean=0, sd=sd)))) }
~~~
Each simulate is too small to be worth parallelizing, but there are so many iterations that they can be split up usefully and run with a fraction in a different process; something like
~~~{.R}
library(parallel)
library(plyr)
simGainP <- function(n, sd=1, iters=500000, n.parallel=4) {
mean(unlist(mclapply(1:n.parallel, function(i) {
mean(replicate(iters/n.parallel, gain(n, sd))); })))
}
~~~
We can treat the simulation estimates as exact and use [memoization](!Wikipedia) such as provided by the R package [`memoise`](https://cran.r-project.org/web/packages/memoise/) to cache results & never recompute them, but it will still be slow on the first calculation.
So it would be good to have either an exact algorithm or an accurate approximation: for one of analyses, I want accuracy to ±0.0006 SDs, which requires large Monte Carlo samples.
## Upper bounds
To summarize the [Cross Validated discussion](https://stats.stackexchange.com/questions/9001/approximate-order-statistics-for-normal-random-variables): the simplest [upper bound](http://math.stackexchange.com/a/89147) is $E[Z] \leq \sigma \cdot \sqrt{2 \cdot \log(n)}$, which makes the diminishing returns clear (see also the [Fisher–Tippett–Gnedenko theorem](!Wikipedia) & [Gumbel distribution](!Wikipedia)).
Implementation:
~~~{.R}
upperBoundMax <- function(n, sd=1) { sd * sqrt(2 * log(n)) }
~~~
![The max central limit theorem visualized ([Gabriel Peyré](https://nitter.cc/gabrielpeyre/status/1331840107792322560/photo/1 "The max of n independent Gaussians concentrates around sqrt(2*log(n)). The max central limit theorem allows to quantify this"))](/images/orderstatistics/2020-11-26-gabrielpeyre-maxgaussianclt.jpg)
Most of the approximations are sufficiently fast as they are effectively 𝓞(1) with small constant factors (if we ignore that functions like Φ^−1^/`qnorm` themselves may technically be 𝓞(log(_n_)) or 𝓞(_n_) for large _n_).
However, accuracy becomes the problem: this upper bound is hopelessly inaccurate in small samples when we compare to the Monte Carlo simulation.
Others (also inaccurate) include $\frac{n-1}{\sqrt{2 \cdot n - 1}} \cdot \sigma$ and $-\Phi^{-1}(\frac{1}{n+1}) \cdot \sigma$:
~~~{.R}
upperBoundMax2 <- function(n, sd=1) { ((n-1) / sqrt(2*n - 1)) * sd }
upperBoundMax3 <- function(n, sd=1) { -qnorm(1/(n+1), sd=sd) }
~~~
## Formulas
1. [Blom 1958, _Statistical estimates and transformed beta-variables_](/docs/statistics/order/1958-blom-orderstatistics.pdf) provides a general approximation formula $\mathbb{E}(r,n)$, which specializing to the max ($\mathbb{E}(n,n)$) is $\Phi^{-1}(\frac{n-\alpha}{n - 2\cdot\alpha + 1 }) \cdot \sigma; \alpha=0.375$ and is better than the upper bounds:
~~~{.R}
blom1958 <- function(n, sd=1) { alpha <- 0.375; qnorm((n-alpha)/(n-2*alpha+1)) * sd }
~~~
2. [Elfving 1947](/docs/statistics/order/1947-elfving.pdf "The asymptotical distribution of range in samples from a normal population"), apparently, by way of _Mathematical Statistics_, Wilks 1962, demonstrates that Blom 1958's approximation is imperfect because actually $\alpha=\frac{\pi}{8}$, so:
~~~{.R}
elfving1947 <- function(n, sd=1) { alpha <- pi/8; qnorm((n-alpha)/(n-2*alpha+1)) * sd }
~~~
(Blom 1958 appears to be more accurate for _n_ < 48 and then Elfving's correction dominates.)
3. [Harter 1961](/docs/statistics/order/1961-harter.pdf "Expected Values of Normal Order Statistics") elaborated this by giving different values for $\alpha$, and [Royston 1982](/docs/statistics/order/1982-royston.pdf "Algorithm AS 177: Expected Normal Order Statistics (Exact and Approximate)") provides computer algorithms; I have not attempted to provide an R implementation of these.
4. [probabilityislogic](https://stats.stackexchange.com/users/2392/probabilityislogic) offers a [2015 derivation via the beta-F compound distribution](http://stats.stackexchange.com/a/9010/16897)^[Exploiting the [beta transformation](!Wikipedia "Order statistic#Order statistics sampled from a uniform distribution"), where the order statistics of a simple 0--1 interval turns out to follow a [beta distribution](!Wikipedia) (specifically: $U_{(k)} \sim \operatorname{Beta}(k, n + 1 - k)$) , which can then be easily transformed into the equivalent order statistics of more useful distributions like the normal distribution. The beta transformation is not just computationally useful, but critical to order statistics in general.]:
$$E[x_{i}]\approx \mu+\sigma\Phi^{-1}\left(\frac{i}{N+1}\right)\left[1+\frac{\left(\frac{i}{N+1}\right)\left(1-\frac{i}{N+1}\right)}{2(N+2)\left[\phi\left[\Phi^{-1}\left(\frac{i}{N+1}\right)\right]\right]^{2}}\right]$$
and an approximate (but highly accurate) numerical integration as well:
~~~{.R}
pil2015 <- function(n, sd=1) { sd * qnorm(n/(n+1)) * { 1 +
((n/(n+1)) * (1 - (n/(n+1)))) /
(2*(n+2) * (pnorm(qnorm(n/(n+1))))^2) }}
pil2015Integrate <- function(n) { mean(qnorm(qbeta(((1:10000) - 0.5) / 10000, n, 1))) + 1}
~~~
The integration can be done more directly as
~~~{.R}
pil2015Integrate2 <- function(n) { integrate(function(v) qnorm(qbeta(v, n, 1)), 0, 1) }
~~~
5. Another approximation comes from [Chen & Tyler 1999](/docs/statistics/order/1999-chen.pdf "Accurate Approximation to the Extreme Order Statistics of Gaussian Samples"): $\Phi^{-1}(0.5264^{\frac{1}{n}})$.
Unfortunately, while accurate enough for most purposes, it is still off by as much as 1 IQ point and has an average mean error of −0.32 IQ points compared to the simulation:
~~~{.R}
chen1999 <- function(n, sd=1){ qnorm(0.5264^(1/n), sd=sd) }
approximationError <- sapply(1:1000, function(n) { (chen1999(n) - simGain(n)) * 15 })
summary(approximationError)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# -0.3801803 -0.3263603 -0.3126665 -0.2999775 -0.2923680 0.9445290
plot(1:1000, approximationError, xlab="Number of samples taking the max", ylab="Error in 15*SD (IQ points)")
~~~
![Error in using the Chen & Tyler 1999 approximation to estimate the expected value (gain) from taking the maximum of _n_ normal samples](/images/genetics/selection/iq-selection-chen1999approximation-error.png)
## Polynomial regression
From a less mathematical perspective, any regression or machine learning model could be used to try to develop a cheap but highly accurate approximation by simply predicting the extreme from the relevant range of _n_ = 2--300---the goal being less to make good predictions out of sample than to interpolate/memorize as much as possible in-sample.
Plotting the extremes, they form a smooth *almost* logarithmic curve:
~~~{.R}
df <- data.frame(N=2:300, Max=sapply(2:300, exactMax))
l <- lm(Max ~ log(N), data=df); summary(l)
# Residuals:
# Min 1Q Median 3Q Max
# -0.36893483 -0.02058671 0.00244294 0.02747659 0.04238113
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.658802439 0.011885532 55.42894 < 2.22e-16
# log(N) 0.395762956 0.002464912 160.55866 < 2.22e-16
#
# Residual standard error: 0.03947098 on 297 degrees of freedom
# Multiple R-squared: 0.9886103, Adjusted R-squared: 0.9885719
# F-statistic: 25779.08 on 1 and 297 DF, p-value: < 2.2204e-16
plot(df); lines(predict(l))
~~~
This has the merit of simplicity (`function(n) {0.658802439 + 0.395762956*log(n)}`), but while the R^2^ is quite high by most standards, the residuals are too large to make a good approximation---the log curve overshoots initially, then undershoots, then overshoots.
We can try to find a better log curve by using polynomial or spline regression, which broaden the family of possible curves.
A 4^th^-order polynomial turns out to fit as beautifully as we could wish, R^2^=0.9999998:
~~~{.R}
lp <- lm(Max ~ log(N) + I(log(N)^2) + I(log(N)^3) + I(log(N)^4), data=df); summary(lp)
# Residuals:
# Min 1Q Median 3Q Max
# -1.220430e-03 -1.074138e-04 -1.655586e-05 1.125596e-04 9.690842e-04
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 1.586366e-02 4.550132e-04 34.86418 < 2.22e-16
# log(N) 8.652822e-01 6.627358e-04 1305.62159 < 2.22e-16
# I(log(N)^2) -1.122682e-01 3.256415e-04 -344.76027 < 2.22e-16
# I(log(N)^3) 1.153201e-02 6.540518e-05 176.31640 < 2.22e-16
# I(log(N)^4) -5.302189e-04 4.622731e-06 -114.69820 < 2.22e-16
#
# Residual standard error: 0.0001756982 on 294 degrees of freedom
# Multiple R-squared: 0.9999998, Adjusted R-squared: 0.9999998
# F-statistic: 3.290056e+08 on 4 and 294 DF, p-value: < 2.2204e-16
## If we want to call the fitted objects:
linearApprox <- function (n) { predict(l, data.frame(N=n)); }
polynomialApprox <- function (n) { predict(lp, data.frame(N=n)); }
## Or simply code it by hand:
la <- function(n, sd=1) { 0.395762956*log(n) * sd; }
pa <- function(n, sd=1) { N <- log(n);
(1.586366e-02 + 8.652822e-01*N^1 + -1.122682e-01*N^2 + 1.153201e-02*N^3 + -5.302189e-04*N^4) * sd; }
~~~
This has the virtue of speed & simplicity (a few arithmetic operations) and high accuracy, but is not intended to perform well past the largest datapoint of _n_ = 300 (although if one needed to, one could simply generate the additional datapoints, and refit, adding more polynomials if necessary), but turns out to be a good approximation up to _n_ = 800 (after which it consistently overestimates by ~0.01):
~~~{.R}
heldout <- sapply(301:1000, exactMax)
test <- sapply(301:1000, pa)
mean((heldout - test)^2)
# [1] 3.820988144e-05
plot(301:1000, heldout); lines(test)
~~~
So this method, while lacking any kind of mathematical pedigree or derivation, provides the best approximation so far.
# Exact
The R package [`lmomco`](https://cran.r-project.org/web/packages/lmomco/index.html) ([Asquith 2011](https://ttu-ir.tdl.org/ttu-ir/handle/2346/ETD-TTU-2011-05-1319 "Univariate Distributional Analysis with L-moment Statistics using R")) calculates a wide variety of order statistics using numerical integration & other methods.
It is fast, unbiased, and generally correct (for small values of _n_[^lmomco-bug])---it is close to the Monte Carlo estimates even for the smallest _n_ where the approximations tend to do badly, so it does what it claims to and provides what we want (a fast exact estimate of the mean gain from selecting the maximum from _n_ samples from a normal distribution).
The results can be memoized for a further moderate speedup (eg calculated over _n_ = 1--1000, 0.45s vs 3.9s for a speedup of ~8.7×):
[^lmomco-bug]: `lmomco` is accurate for all values I checked with Monte Carlo _n_ < 1000, but appears to have some bugs _n_ > 2000: there are occasional deviations from the quasi-logarithmic curve, such as _n_ = 2225--2236 (which are off by 1.02SD compared to the Monte Carlo estimates and the surrounding lmomco estimates), another cluster of errors _n_~=40,000, and then after _n_ > 60,000, the estimates are totally incorrect. The maintainer has been notified & verified the bug.
~~~{.R}
library(lmomco)
exactMax_unmemoized <- function(n, mean=0, sd=1) {
expect.max.ostat(n, para=vec2par(c(mean, sd), type="nor"), cdf=cdfnor, pdf=pdfnor) }
## Comparison to MC:
# ... Min. 1st Qu. Median Mean 3rd Qu. Max.
# −0.0523499300 −0.0128622900 −0.0003641315 −0.0007935236 0.0108748800 0.0645207000
library(memoise)
exactMax_memoised <- memoise(exactMax_unmemoized)
~~~
![Error in using Asquith 2011's L-moment Statistics numerical integration package to estimate the expected value (gain) from taking the maximum of _n_ normal samples](/images/genetics/selection/iq-selection-asquith2011approximation-error.png){.full-width}
## Comparison
With `lmomco` providing exact values, we can visually compare all the presented methods for accuracy; there are considerable differences but the best methods are in close agreement:
![Comparison of estimates of the maximum for _n_ = 2--300 for 12 methods, showing Chen 1999/polynomial approximation/Monte Carlo/`lmomco` are the most accurate and Blom 1958/upper bounds highly-inaccurate.](/images/orderstatistics/orderstatistics-12comparisons.png){.full-width}
~~~{.R .collapse}
library(parallel)
library(plyr)
scores <- function(n, sd) { rnorm(n, mean=0, sd=sd); }
gain <- function(n, sd) { scores <- scores(n, sd)
return(max(scores)); }
simGainP <- function(n, sd=1, iters=500000, n.parallel=4) {
mean(unlist(mclapply(1:n.parallel, function(i) {
mean(replicate(iters/n.parallel, gain(n, sd))); })))
}
upperBoundMax <- function(n, sd=1) { sd * sqrt(2 * log(n)) }
upperBoundMax2 <- function(n, sd=1) { ((n-1) / sqrt(2*n - 1)) * sd }
upperBoundMax3 <- function(n, sd=1) { -qnorm(1/(n+1), sd=sd) }
blom1958 <- function(n, sd=1) { alpha <- 0.375; qnorm((n-alpha)/(n-2*alpha+1)) * sd }
elfving1947 <- function(n, sd=1) { alpha <- pi/8; qnorm((n-alpha)/(n-2*alpha+1)) * sd }
pil2015 <- function(n, sd=1) { sd * qnorm(n/(n+1)) * { 1 +
((n/(n+1)) * (1 - (n/(n+1)))) /
(2*(n+2) * (pnorm(qnorm(n/(n+1))))^2) }}
pil2015Integrate <- function(n) { mean(qnorm(qbeta(((1:10000) - 0.5) / 10000, n, 1))) + 1}
chen1999 <- function(n,sd=1){ qnorm(0.5264^(1/n), sd=sd) }
library(lmomco)
exactMax_unmemoized <- function(n, mean=0, sd=1) {
expect.max.ostat(n, para=vec2par(c(mean, sd), type="nor"), cdf=cdfnor, pdf=pdfnor) }
library(memoise)
exactMax_memoised <- memoise(exactMax_unmemoized)
la <- function(n, sd=1) { 0.395762956*log(n) * sd; }
pa <- function(n, sd=1) { N <- log(n);
(1.586366e-02 + 8.652822e-01*N^1 + -1.122682e-01*N^2 + 1.153201e-02*N^3 + -5.302189e-04*N^4) * sd; }
implementations <- c(simGainP,upperBoundMax,upperBoundMax2,upperBoundMax3,blom1958,elfving1947,
pil2015,pil2015Integrate,chen1999,exactMax_unmemoized,la,pa)
implementationNames <- c("Monte Carlo","Upper bound #1","UB #2","UB #3","Blom 1958","Elfving 1947",
"Pil 2015","Pil 2015 (numerical integration)","Chen 1999","Exact","Log approx.","Polynomial approx.")
maxN <- 300
results <- data.frame(N=integer(), SD=numeric(), Implementation=character())
for (i in 1:length(implementations)) {
SD <- unlist(Map(function(n) { implementations[i][[1]](n); }, 2:maxN))
name <- implementationNames[i]
results <- rbind(results, (data.frame(N=2:maxN, SD=SD, Implementation=name)))
}
results$Implementation <- ordered(results$Implementation,
levels=c("Upper bound #1","UB #2","UB #3","Log approx.","Blom 1958","Elfving 1947","Pil 2015",
"Pil 2015 (numerical integration)","Chen 1999","Polynomial approx.", "Monte Carlo", "Exact"))
library(ggplot2)
qplot(N, SD, color=Implementation, data=results) + coord_cartesian(ylim = c(0.25,3.9))
qplot(N, SD, color=Implementation, data=results) + geom_smooth()
~~~
And micro-benchmarking them quickly (excluding Monte Carlo) to get an idea of time consumption shows the expected results (aside from Pil 2015's numerical integration taking longer than expected, suggesting either memoising or changing the fineness):
~~~{.R}
library(microbenchmark)
f <- function() { sample(2:1000, 1); }
microbenchmark(times=10000, upperBoundMax(f()),upperBoundMax2(f()),upperBoundMax3(f()),
blom1958(f()),elfving1947(f()),pil2015(f()),pil2015Integrate(f()),chen1999(f()),
exactMax_memoised(f()),la(f()),pa(f()))
# Unit: microseconds
# expr min lq mean median uq max neval
# f() 2.437 2.9610 4.8928136 3.2530 3.8310 1324.276 10000
# upperBoundMax(f()) 3.029 4.0020 6.6270124 4.9920 6.3595 1218.010 10000
# upperBoundMax2(f()) 2.886 3.8970 6.5326593 4.7235 5.8420 1029.148 10000
# upperBoundMax3(f()) 3.678 4.8290 7.4714030 5.8660 7.2945 892.594 10000
# blom1958(f()) 3.734 4.7325 7.3521356 5.6200 7.0590 1050.915 10000
# elfving1947(f()) 3.757 4.8330 7.7927493 5.7850 7.2800 1045.616 10000
# pil2015(f()) 5.518 6.9330 10.8501286 9.2065 11.5280 867.332 10000
# pil2015Integrate(f()) 14088.659 20499.6835 21516.4141399 21032.5725 22151.4150 53977.498 10000
# chen1999(f()) 3.788 4.9260 7.7456654 6.0370 7.5600 1415.992 10000
# exactMax_memoised(f()) 106.222 126.1050 211.4051056 162.7605 221.2050 4009.048 10000
# la(f()) 2.882 3.8000 5.7257008 4.4980 5.6845 1287.379 10000
# pa(f()) 3.397 4.4860 7.0406035 5.4785 6.9090 1818.558 10000
~~~
## Rescaling for generality
The memoised function has three arguments, so memoising on the fly would seem to be the best one could do, since one cannot precompute all possible combinations of the _n_/mean/SD.
But actually, we only need to compute the results for each _n_.
We can default to assuming the standard normal distribution (𝒩(0,1)) without loss of generality because it's easy to rescale any normal to another normal: to scale to a different mean $\mu$, one simply adds $\mu$ to the expected extreme, so one can assume $\mu=0$; and to scale to a different standard deviation, we simply multiply appropriately.
So if we wanted the extreme for _n_ = 5 for $\mathcal{N}(10,2)$, we can calculate it simply by taking the estimate for _n_ = 5 for 𝒩(0,1) and multiplying by 2⁄1 = 2 and then adding 10 − 0 = 10:
~~~{.R}
(exactMax(5, mean=0, sd=1)*2 + 10) ; exactMax(5, mean=10, sd=2)
# [1] 12.32592895
# [1] 12.32592895
~~~
So in other words, it is unnecessary to memoize all possible combinations of _n_, mean, and SD---in reality, we only need to compute each _n_ and then rescale it as necessary for each caller.
And in practice, we only care about _n_ = 2--200, which is few enough that we can define a lookup table using the `lmomco` results and use that instead (with a fallback to `lmomco` for _n_ > 200, and a fallback to Chen et al 1999 for _n_ > 2000 to work around `lmomco`'s bug with large _n_):
~~~{.R}
exactMax <- function (n, mean=0, sd=1) {
if (n>2000) {
chen1999 <- function(n,mean=0,sd=1){ mean + qnorm(0.5264^(1/n), sd=sd) }
chen1999(n,mean=mean,sd=sd) } else {
if(n>200) { library(lmomco)
exactMax_unmemoized <- function(n, mean=0, sd=1) {
expect.max.ostat(n, para=vec2par(c(mean, sd), type="nor"), cdf=cdfnor, pdf=pdfnor) }
exactMax_unmemoized(n,mean=mean,sd=sd) } else {
lookup <- c(0,0,0.5641895835,0.8462843753,1.0293753730,1.1629644736,1.2672063606,1.3521783756,1.4236003060,
1.4850131622,1.5387527308,1.5864363519,1.6292276399,1.6679901770,1.7033815541,1.7359134449,1.7659913931,
1.7939419809,1.8200318790,1.8444815116,1.8674750598,1.8891679149,1.9096923217,1.9291617116,1.9476740742,
1.9653146098,1.9821578398,1.9982693020,2.0137069241,2.0285221460,2.0427608442,2.0564640976,2.0696688279,
2.0824083360,2.0947127558,2.1066094396,2.1181232867,2.1292770254,2.1400914552,2.1505856577,2.1607771781,
2.1706821847,2.1803156075,2.1896912604,2.1988219487,2.2077195639,2.2163951679,2.2248590675,2.2331208808,
2.2411895970,2.2490736293,2.2567808626,2.2643186963,2.2716940833,2.2789135645,2.2859833005,2.2929091006,
2.2996964480,2.3063505243,2.3128762306,2.3192782072,2.3255608518,2.3317283357,2.3377846191,2.3437334651,
2.3495784520,2.3553229856,2.3609703096,2.3665235160,2.3719855541,2.3773592389,2.3826472594,2.3878521858,
2.3929764763,2.3980224835,2.4029924601,2.4078885649,2.4127128675,2.4174673530,2.4221539270,2.4267744193,
2.4313305880,2.4358241231,2.4402566500,2.4446297329,2.4489448774,2.4532035335,2.4574070986,2.4615569196,
2.4656542955,2.4697004768,2.4736966781,2.4776440650,2.4815437655,2.4853968699,2.4892044318,2.4929674704,
2.4966869713,2.5003638885,2.5039991455,2.5075936364,2.5111482275,2.5146637581,2.5181410417,2.5215808672,
2.5249839996,2.5283511812,2.5316831323,2.5349805521,2.5382441196,2.5414744943,2.5446723168,2.5478382097,
2.5509727783,2.5540766110,2.5571502801,2.5601943423,2.5632093392,2.5661957981,2.5691542321,2.5720851410,
2.5749890115,2.5778663175,2.5807175211,2.5835430725,2.5863434103,2.5891189625,2.5918701463,2.5945973686,
2.5973010263,2.5999815069,2.6026391883,2.6052744395,2.6078876209,2.6104790841,2.6130491728,2.6155982225,
2.6181265612,2.6206345093,2.6231223799,2.6255904791,2.6280391062,2.6304685538,2.6328791081,2.6352710490,
2.6376446504,2.6400001801,2.6423379005,2.6446580681,2.6469609341,2.6492467445,2.6515157401,2.6537681566,
2.6560042252,2.6582241720,2.6604282187,2.6626165826,2.6647894763,2.6669471086,2.6690896839,2.6712174028,
2.6733304616,2.6754290533,2.6775133667,2.6795835873,2.6816398969,2.6836824739,2.6857114935,2.6877271274,
2.6897295441,2.6917189092,2.6936953850,2.6956591311,2.6976103040,2.6995490574,2.7014755424,2.7033899072,
2.7052922974,2.7071828562,2.7090617242,2.7109290393,2.7127849375,2.7146295520,2.7164630139,2.7182854522,
2.7200969934,2.7218977622,2.7236878809,2.7254674700,2.7272366478,2.7289955308,2.7307442335,2.7324828686,
2.7342115470,2.7359303775,2.7376394676,2.7393389228,2.7410288469,2.7427093423,2.7443805094,2.7460424475)
return(mean + sd*lookup[n+1]) }}}
~~~
This gives us exact computation at 𝓞(1) (with an amortized 𝓞(1) when _n_ > 200) with an extremely small constant factor (a conditional, vector index, multiplication, and addition, which is overall ~10× faster than a memoised lookup), giving us all our desiderata simultaneously & resolving the problem.
# General order statistics for the normal distribution
One might also be interested in computing the general order statistic.
Some available implementations in R:
- **numerical integration**:
- `lmomco`, with `j` of `n` (warning: remember lmomco's bug with _n_ > 2000):
~~~{.R}
j = 9; n = 10
expect.max.ostat(n, j=j, para=vec2par(c(0, 1), type="nor"), cdf=cdfnor, pdf=pdfnor)
# [1] 1.001357045
~~~
- [`evNormOrdStats`](https://www.rdocumentation.org/packages/EnvStats/versions/2.1.0/topics/evNormOrdStats) in [`EnvStats`](https://cran.r-project.org/web/packages/EnvStats/index.html) (version ≥2.3.0), using Royston 1982:
~~~{.R}
library(EnvStats)
evNormOrdStatsScalar(10^10,10^10)
# [1] 6.446676405
## Warning message: In evNormOrdStatsScalar(10^10, 10^10) :
## The 'royston' method has not been validated for sample sizes greater than 2000 using
## the default value of inc = 0.025. You may want to make the value of 'inc' less than 0.025.
evNormOrdStatsScalar(10^10,10^10, inc=0.000001)
# [1] 6.446676817
~~~
- **Monte Carlo**: the simple approach of averaging over _i_ iterations of generating _n_ random normal deviates, sorting, and selecting the `j`th order statistic does not scale well due to being 𝓞(_n_) in both time & space for generation & 𝓞(_n_ · log(_n_)) for a comparison sort or another 𝓞(_n_) if one is more careful to use a lazy sort or [selection algorithm](!Wikipedia), and coding up an online selection algorithm is not a one-liner. Better solutions typically use a beta transformation to efficiently generate a single sample from the expected order statistic:
- `order_rnorm` in [`orderstats`](https://cran.r-project.org/web/packages/orderstats/index.html), with `k` of `n`:
~~~{.R}
library(orderstats)
mean(replicate(100000, order_rnorm(k=10^10, n=10^10)))
# [1] 6.446370373
~~~
- `order` in [`evd`](https://cran.r-project.org/web/packages/evd/index.html), with `j` of `n`:
~~~{.R}
library(evd)
mean(rorder(100000, distn="norm", j=10^10, mlen=10^10, largest=FALSE))
# [1] 6.447222051
~~~
- **Blom & other approximations**:
- `evNormOrdStats` in `EnvStats`'s provides as an option the Blom approximation:[^EnvStats-earlier]
> When `method="blom"`, the following approximation to $\mathbb{E}(r,n)$,
> proposed by Blom (1958, pp. 68--75), is used:
>
> $$\mathbb{E}(r, n) \approx \Phi^{-1}(\frac{r - \alpha}{n - 2\alpha + 1}) \;\;\;\;\;\; (5)$$
>
> By default, alpha = 3⁄8 = 0.375. This approximation is quite
> accurate. For example, for _n_ > 2_, the approximation is accurate
> to the first decimal place, and for _n_ > 9 it is accurate to the
> second decimal place.
Blom's approximation is also [quoted as](https://stats.stackexchange.com/a/9007/16897):
$$\mathbb{E}(r, n) \approx \mu + \Phi^{-1} (\frac{r - \alpha}{n - 2\alpha +1})\sigma; \alpha = 0.375$$
- [Elfving's]{.smallcaps} correction to Blom is the same, yielding:
~~~{.R}
elfving1947E <- function(r,n) { alpha=pi/8; qnorm( (r - alpha) / (n - 2*alpha + 1)) }
elfving1947E(10^10, 10^10)
# [1] 6.437496713
~~~
[^EnvStats-earlier]: A previous version of `EnvStats` described the approximation thus:
> The function `evNormOrdStatsScalar` computes the value of $\mathbb{E}(r,n)$ for user-specified values of _r_ and _n_. The function `evNormOrdStats` computes the values of $\mathbb{E}(r,n)$ for all values of _r_ for a user-specified value of _n_. For large values of _n_, the function `evNormOrdStats` with `approximate=FALSE` may take a long time to execute. When `approximate=TRUE`, `evNormOrdStats` and `evNormOrdStatsScalar` use the following approximation to $\mathbb{E}(r,n)$, which was proposed by Blom (1958, pp. 68--75, ["6.9 An approximate mean value formula" & formula 6.10.3--6.10.5]):
>
> $$\mathbb{E}(r,n) \approx \Phi^{-1} (\frac{r - \frac{3}{8}}{n + \frac{1}{4}})$$
>
> ~~~{.R}
> ## General Blom 1958 approximation:
> blom1958E <- function(r,n) { qnorm((r - 3/8) / (n + 1/4)) }
> blom1958E(10^10, 10^10)
> # [1] 6.433133208
> ~~~
>
> This approximation is quite accurate. For example, for $n \ge 2$, the approximation is accurate to the first decimal place, and for $n \ge 9$ it is accurate to the second decimal place.
# See Also
- [Optimal search of a sample of Gaussians with costs/benefits](/Embryo-selection#optimal-stoppingsearch)
# Appendix
## Probability of Bivariate Maximum
> Given a sample of _n_ pairs of 2 normal variables A & B which are correlated _r_, what is the probability _P_~max~ that the maximum on the first variable A is also the maximum on the second variable B?
> This is analogous to many testing or screening situations, such as employee hiring ("what is the probability the top-scoring applicant on the first exam is the top-scorer on the second as well?") or athletic contests ("what is the probability the current world champ will win the next championship?").
>
> Order statistics has long proven that asymptotically, _P_~max~ approaches 1⁄_n_. Exact answers are hard to find, but confirm the asymptotics; the closest that exists is for an approximation & special-case of the Ali-Mikhail-Haq copula: which roughly indicates that _r_ functions as a constant factor boost in _P_~max~, and the boost from _r_ fades out as _n_ increases.
>
> As long as _r_≠1, "the tails will come apart". _n_ increases the difficult too fast for any fixed _r_ to overcome. This has implications for interpreting extremes and test metrics.

^[!Margin: Few are the best at everything.] [["The tails come apart"]{.smallcaps}](https://www.lesswrong.com/posts/dC7mP5nSwvpL65Qu5/why-the-tails-come-apart "'Why The Tails Come Apart', Thrasymachus 2014") when we look at extremes of multiple traits: the best on one dimension is rarely the best on the second dimension, even if there is a close causal relationship---the tallest basketball player is not the best basketball player, the smartest scientist is not the most influential scientist, the calmest investor is not the richest investor, and so on.
This observation can be misinterpreted, and is not necessarily that important: height is certainly useful for basketball players, and the smarter you are, the better your chance of being a great scientist.
If we look at the best, they *will* indeed be well above average on relevant traits, even if they are not the most extreme outlier on all (or any) traits.
^[!Margin: How much does being the best at _X_ help with _Y_?] But this does raise the question of what the relationship is between the two possible maxima.
In other order statistics, an initially small tail effect can become large; a familiar example of a tail effect is the probability of a random sample crossing a threshold, when drawing from populations with different means or standard deviations---the higher the threshold, the bigger the multiplicative increase, and with a not-too-high threshold and a quite small difference in means, one population could completely (and counterintuitively) dominant the sample.
Does that happen with the bivariate maximum?
More generally, does a given _r_ > 0 increase the probability, how much, and does this probability increase or decrease relative to the 1⁄_n_ baseline?
Let's consider some cases of _r_ and _n_:
1. If _r_ = 1, then _P_~max~ must also always be **1**, as the two variables are identical.
2. For any _r_, Selecting the maximum out of _n_ = 1 is trivial and we can guarantee a _P_~max~=**1** chance of selecting the maximum
3. For _n_ = 2, getting the maximum twice out of 2 is easy (we must start at **≥0.5** and go up from there, to $\frac{1}{2} + \frac{1}{\pi}\arcsin(r)$, so eg 2⁄3 for _r_ = 0.5).
4. For larger _n_, we can simulate out _P_~max~ and see that despite the benefits to increasing averages & reduced regret (eg for [embryo selection](/images/genetics/selection/iq-selection-orderstatistics-pgs-errorregret.png "Graphs of the expected rank of top selected, probability of making an ideal selection, making a pessimal selection, and expected regret, for a simple idealized order-statistics scenario where the set of samples is measured with a noisy variable correlated _r_ with the latent variables (such as a PGS predicting adult IQ).")), getting the max twice becomes **increasingly difficult** and out of eg _n_ = 1 million, becomes near-impossible---even with high _r_, like _r_ = 0.95 (comparable to test-retest reliability for the SAT).
^[!Margin: Eventually, 'the tails come apart'.] Asymptotically, as _n_ increases, the two distributions become independent ([Sibuya 1960](https://www.ism.ac.jp/editsec/aism/pdf/011_3_0195.pdf "Bivariate Extreme Statistics, I")) and so the probability of the max in one being max in the other approaches 1⁄_n_.
([Srivastava 1967](/docs/statistics/order/1967-srivastava.pdf "Asymptotic Independence of Certain Statistics Connected with the Extreme Order Statistics in a Bivariate Distribution ")/[Mardia 1964](/docs/statistics/order/1964-mardia.pdf "Asymptotic Independence of Bivariate Extremes") prove this is true of the other extremes as well.)
The benefits of an _r_, however large initially, are not large enough to keep up with the increasing _n_, and is overwhelmed.
Of course, the *mean* on the second distribution keeps increasing---it just doesn't increase as fast as necessary to maintain _P_~max~.
This has some implications.
First, it shows that we shouldn't take too seriously any argument of the form "the max on A is not the max on B, therefore A doesn't matter to B", since we can see that this is consistent with arbitrarily high _r_ (even a _r_ = 0.9999 will eventually give such a datapoint if we make _n_ big enough).
Second, it shows that _P_~max~ is not that interesting as a measure, since the asymptotic independence means that we may just be "sampling to a foregone conclusion"; anyone evaluating a selection or ranking procedure which puts stress on a candidate being the max on multiple traits (instead of having a high absolute value on traits) should think carefully about using such an unnatural zero-one [loss](!Wikipedia "Loss function") & its perverse consequences, like guaranteeing that the best candidate all-around will rarely or never be the maximum on any or all traits.
### Ali-Mikhail-Haq Copula
The relationship between _r_ and _n_ can be seen more explicitly in a special case formula for the Ali-Mikhail-Haq copula, which also shows how _r_ fades out with _n_.
[Matt F.](https://stats.stackexchange.com/a/453511/16897) has provided an analytic expression by using a different distribution, the [Ali-Mikhail-Haq](https://en.wikipedia.org/wiki/Copula_(probability_theory)#Archimedean_copulas) (AMH) [copula](!Wikipedia "Copula (probability theory)"), which approximates the normal distribution for the special-case 0<_r_ < 0.5, if _r_ is converted to the Ali-Mikhail-Haq's equivalent [Kendall's tau](!Wikipedia "Kendall rank correlation coefficient") correlation instead:
> First we choose the parameter θ to get Kendall's tau to agree for the [two](https://cran.r-project.org/web/packages/copula/vignettes/rhoAMH-dilog.pdf "'Spearman's Rho for the AMH Copula: a Beautiful Formula', M̈achler 2014") [distributions](https://math.stackexchange.com/questions/3058888/kendalls-tau-of-bivariate-normal "Kendall's Tau of bivariate normal"):
>
> $$1 - \frac{2((1-\theta)^2\log(1-\theta)+\theta)}{3\theta^2}=\frac{2}{\pi}\arcsin(r)$$
>
> ...θ = 3_r_ - 2_r_^2^ is a decent approximation to the first equation.
>
> Then we can calculate the probability of _X_ and _Y_ being maximized at the same observation:
>
> - In the limiting case where _r_ = 1⁄2, we have θ = 1, and the probability is $\frac{2}{1+n}$
> - In the limiting case of high _n_, the probability tends to $\frac{1+\theta}{n}$ and more precisely $\lim_{n\rightarrow \infty}np(n,\theta)=1+\theta$.
> - In general, the probability is
>
> $$p(n,\theta)=t\frac{\, _2F_1\left(1,n+1;n+2;t\right)}{n+1} - 2nt^2\frac{\, _2F_1\left(1,n+2;n+3;t\right)}{(n+1)(n+2)} -\frac{2nt}{(n+1)^2}+\frac{1}{n}$$
>
> where $t=\theta / (\theta-1)$ and $\,_2F_1$ is the [hypergeometric function](!Wikipedia).
His identity isn't a simple one in terms of _r_, but the necessary θ can be found to high accuracy by numerical optimization of the identity, using R's built-in numerical optimization routine `optim`.
Then the general formula can be estimated as written, with the main difficulty being where to get the hypergeometric function.
The resulting algorithm (an approximation of an exact answer to an approximation) can then be compared to a large (_i_ = 1 million) Monte Carlo simulation: it is good for small _n_ but has noticeable error:
~~~{.R .collapse}
r_to_theta <- function (r) {
identity_solver <- function(theta, r) {
target = 2/pi * asin(r)
theta_guess = 1 - ((2 * ((1 - theta)^2 * log(1 - theta) + theta))/(3 * theta^2))
return(abs(target - theta_guess)) }
theta_approx <- optim(0, identity_solver, r=r, method="Brent", lower=0, upper=1)
theta_approx$par
}
## The hypergeometric function is valid for all reals, but practically speaking,
## GSL's 'hyperg_2F1' code has issues with NaN once r>0.2 (theta ≅ −2.3); Stéphane Laurent
## recommends transforming (https://stats.stackexchange.com/a/33455/16897) for stability:
Gauss2F1 <- function(a,b,c,x){
## "gsl: Wrapper for the Gnu Scientific Library"
## https://cran.r-project.org/web/packages/gsl/index.html for 'hyperg_2F1'
library(gsl)
if(x>=0 & x<1){
hyperg_2F1(a,b,c,x)
}else{
hyperg_2F1(c-a,b,c,1-1/(1-x))/(1-x)^b
}
}
## Calculate the exact probability of a double max using the AMH approximation:
p_max_bivariate_amh <- function(n,r) {
## only valid for the special-case r=0-0.5:
if (r > 0.5 || r < 0.0) { stop("AMH formula only valid for 0 <= r <= 0.5") }
else {
## Handle special cases:
if (r == 0.5) { 2 / 1+n } else {
if (r == 0) { 1 / n } else {
theta <- r_to_theta(r)
t <- theta / (theta - 1)
t * (Gauss2F1(1, n+1, n+2, t) / (n+1)) -
2 * n * t^2 * (Gauss2F1(1, n+2, n+3, t) / ((n+1)*(n+2))) -
(2*n*t)/((n+1)^2) +
(1/n)
}
}
}
}
## Monte Carlo simulate the AMH approximation to check the exact formula:
p_max_bivariate_amh_montecarlo <- function(n,r, iters=1000000) {
## "copula: Multivariate Dependence with Copulas"
## https://cran.r-project.org/web/packages/copula/index.html for 'rCopula', 'amhCopula'
library(copula)
theta <- r_to_theta(r)
mean(replicate(iters, {
sample <- rCopula(n, amhCopula(theta))
which.max(sample[,1])==which.max(sample[,2]) }))
}
## Monte Carlo simulation of double max using the normal distribution:
p_max_bivariate_montecarlo <- function(n,r,iters=1000000) {
mean(replicate(iters, {
library(MASS) # for 'mvrnorm'
sample <- mvrnorm(n, mu=c(0,0), Sigma=matrix(c(1, r, r, 1), nrow=2))
which.max(sample[,1])==which.max(sample[,2])
})) }
## Compare all 3:
## 1. approximate Monte Carlo answer for the normal
## 2. exact answer to A-M-H approximation
## 3. approximate Monte Carlo answer to A-M-H approximation
r=0.25
round(digits=3, sapply(2:20, function(n) { p_max_bivariate_montecarlo(n, r) }))
# [1] 0.580 0.427 0.345 0.295 0.258 0.230 0.211 0.193 0.179 0.168 0.158 0.150 0.142 0.136 0.130 0.124 0.120 0.115 0.111
round(digits=3, sapply(2:20, function(n) { p_max_bivariate_amh(n, r) }))
# [1] 0.580 0.420 0.331 0.274 0.233 0.204 0.181 0.162 0.147 0.135 0.124 0.115 0.108 0.101 0.095 0.090 0.085 0.081 0.077
round(digits=3, sapply(2:20, function(n) { p_max_bivariate_amh_montecarlo(n, r) }))
# [1] 0.581 0.420 0.331 0.274 0.233 0.204 0.181 0.162 0.147 0.135 0.124 0.115 0.108 0.101 0.095 0.089 0.085 0.081 0.077
~~~
### See also
- [_The Asymptotic Theory of Extreme Order Statistics_](/docs/statistics/order/1987-galambos-theasymptotictheoryofextremeorderstatistics2nded.pdf), Galambos 1987
- ["Copula associated to order statistics"](/docs/statistics/order/2005-dosanjos.pdf), dos Anjos et al 2005
## Sampling Gompertz Distribution Extremes {.collapse}
> I implement random sampling from the extremes/order statistics of the Gompertz survival distribution, used to model human life expectancies, with the beta transformation trick and `flexsurv`/root-finding inversion. I then discuss the unusually robust lifespan record of Jeanne Calment, and show that records like hers (which surpass the runner-up's lifespan by such a degree) are not usually produced by a Gompertz distribution, supporting the claim that her lifespan was indeed unusual even for the record holder.

The [Gompertz distribution](!Wikipedia) is a distribution often used to model survival curves where mortality increases over time, particularly human life expectancies.
The order statistics of a Gompertz are of interest in considering extreme cases such as centenarians.
The usual family of random/density/inverse CDF (quantile) functions (`rgompertz`/`dgompertz`/`pgompertz`/`qgompertz`) are provided by several R libraries, such as [`flexsurv`](https://cran.r-project.org/web/packages/flexsurv/index.html), but none of them appear to provide implementations of order statistics.
Using `rgompertz` for the Monte Carlo approximation works, but like the normal distribution case, breaks down as one examines larger cases (eg considering order statistics out of one billion takes >20s & runs out of RAM).
The beta transformation used for the normal distribution works for the Gompertz as well, as it merely requires an ability to invert the CDF, which is provided by `qgompertz`, and if that is not available (perhaps we are working with some other distribution besides the Gompertz where a `q*` function is not available), one can approximate it by a short root-finding optimization.
So, given the [beta trick](!Wikipedia "Order statistic#Order statistics sampled from a uniform distribution") where the order statistics of any distribution is equivalent to $U_{(k)} \sim \operatorname{Beta}(k, n + 1 - k)$, we can plug in the desired _k_-out-of-_n_ parameters and generate a random sample efficiently via `rbeta`, getting a value on the 0--1 range (eg 0.998879842 for max-of-1000) and then either use `qgompertz` or optimization to transform to the final Gompertz-distributed values (see also [inverse transform sampling](!Wikipedia)).
~~~{.R}
library(flexsurv)
round(digits=2, rgompertz(n = 10, shape = log(1.1124), rate = 0.000016443))
# [1] 79.85 89.88 82.82 80.87 81.24 73.14 93.54 57.52 78.02 85.96
## The beta/uniform order statistics transform, which samples from the Gompertz CDF:
uk <- function(n, kth, total_n) { rbeta(n, kth, total_n + 1 - kth) }
## Root-finding version: define the CDF by hand, then invert via optimization:
### Define Gompertz CDF; these specific parameters are taken from a Dutch population
### survival curve for illustrative purposes (see https://www.gwern.net/Longevity)
F <- function(g) { min(1, pgompertz(g, log(1.1124), rate = 0.000016443, log = FALSE)) }
### Invert the Gompertz CDF to yield the actual numerical value:
InvF <- Vectorize(function(a) { uniroot(function(x) F(x) - a, c(0,130))$root })
### Demo: 10 times, report the age of the oldest person out of a hypothetical 10b:
round(digits=2, InvF(uk(10, 1e+10, 1e+10)))
# [1] 111.89 111.69 112.31 112.25 111.74 111.36 111.54 111.91 112.46 111.79
## Easier: just use `qgompertz` to invert it directly:
round(digits=2, qgompertz(uk(10, 1e+10, 1e+10), log(1.1124), rate = 0.000016443, log = FALSE))
# [1] 111.64 111.59 112.41 111.99 111.91 112.00 111.84 112.33 112.20 111.30
## Package up as a function:
library(flexsurv)
uk <- function(n, kth, total_n) { rbeta(n, kth, total_n + 1 - kth) }
rKNGompertz <- function (iters, total_n, kth, scale, rate) {
qgompertz(uk(iters, kth, total_n), log(scale), rate = rate, log = FALSE)
}
## Demo:
round(digits=2, rKNGompertz(10, 1e+10, 1e+10, 1.1124, 0.000016443))
# [1] 112.20 113.23 112.12 111.62 111.65 111.94 111.60 112.26 112.15 111.99
## Comparison with Monte Carlo to show correctness: max-of-10000 (for tractability)
mean(replicate(10000, max(rgompertz(n = 10000, shape = log(1.1124), rate = 0.000016443))))
# 103.715134
mean(rKNGompertz(10000, 10000, 10000, 1.1124, 0.000016443))
# 103.717864
~~~
As mentioned, other distributions work just as well.
If we wanted a normal or [log-normal distribution](!Wikipedia) instead, then we simply use `qnorm`/`qlnorm`:
~~~{.R}
## Comparison with Monte Carlo to show correctness: max-of-10000 (for tractability)
### Normal:
mean(replicate(10000, max(rnorm(n = 10000))))
# [1] 3.85142815
mean(qnorm(uk(10000, 10000, 10000)))
# [1] 3.8497741
### Log-normal:
mean(replicate(10000, max(rlnorm(n = 10000))))
# [1] 49.7277024
mean(qlnorm(uk(10000, 10000, 10000)))
# [1] 49.7507764
~~~
### Jeanne Calment case study
An example where simulating from the tails of the Gompertz distribution is useful would be considering the case of super-centenarian [Jeanne Calment](!Wikipedia), who has held the world record for longevity for over 22 years now: Calment lived for 122 years & 164 days (122.45 years) and was the world's oldest person for almost 13× longer than usual, while the global runner-up, [Sarah Knauss](!Wikipedia) lived only 119 years & 97 days (119.27 years), a difference of 3.18 years.
This has [raised questions about what is the cause](/Questions#jeanne-calment) of what appears to be an unexpectedly large difference.
Empirically, using the Gerontology Research Group data, the gaps between [verified oldest people ever](!Wikipedia "List of the verified oldest people#100 verified oldest women") are much smaller than >3 years; for example, among the women, Knapp was 119, and #3--9 were all 117 year old (with #10 just 18 days shy of 117).
The oldest men follow a similar pattern: #1, [Jiroemon Kimura](!Wikipedia), is 116.15 vs #2's 115.69, a gap of only 1.8 years, and then #3--4 are all 115, and #5--7 are 114, and so on.
In order statistics, the expected gap between successive _k_-of-_n_ samples typically shrinks the larger _k_/_n_ becomes (diminishing returns), and the Gompertz curve is used to model an acceleration in mortality that makes annual mortality rates skyrocket and is why no one ever lives to 130 or 140 as they run into a 'mortality spike'.
The other women and the men make Calment's record look extraordinary, but order statistics and the Gompertz curve are counterintuitive, and one might wonder if Gompertz curves might naturally occasionally produce Calment-like gaps regardless of the expected gaps or mortality acceleration.
To get an idea of what the Gompertz distribution would produce, we can consider a scenario like sampling from the top 10 out of 10 billion (about the right order of magnitude for the total elderly population of the Earth which has credible documentation over the past ~50 years), and, using some survival curve parameters from a Dutch population paper [I have laying around](/Longevity), see what sort of sets of top-10 ages we would expect and in particular, how often we'd see large gaps between #1 and #2:
~~~{.R}
simulateSample <- function(total_n, top_k) { sort(sapply(top_k:0,
function(k) {
rKNGompertz(1, total_n, total_n-k, 1.1124, 0.000016443) }
)) }
round(digits=2, simulateSample(1e+10, 10))
# [1] 110.70 110.84 110.89 110.99 111.06 111.08 111.25 111.26 111.49 111.70 112.74
simulateSamples <- function(total_n, top_k, iters=10000) { t(replicate(iters,
simulateSample(total_n, top_k))) }
small <- as.data.frame(simulateSamples(10000000000, 10, iters=100))
large <- as.data.frame(simulateSamples(10000000000, 10, iters=100000))
summary(round(small$V11 - small$V10, digits=2))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.0000 0.0975 0.2600 0.3656 0.5450 1.5700
summary(round(large$V11 - large$V10, digits=2))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.00000 0.15000 0.35000 0.46367 0.66000 3.99000
mean((large$V11 - large$V10) >= 3.18) * 100
# [1] 0.019
library(ggplot2)
library(reshape)
colnames(small) <- as.character(seq(-10, 0, by=1))
small$V0 <- row.names(small)
small_melted <- melt(small, id.vars= 'V0')
colnames(small_melted) <- c("V0", "K", "Years")
ggplot(small_melted, aes(x = K, y = Years)) +
geom_path(aes(color = V0, group = V0)) + geom_point() +
coord_cartesian(ylim = c(110, 115)) + theme(legend.position = "none")
~~~
![100 simulated samples of top-10-oldest-people out of 10 billion, showing what are reasonable gaps between individuals](/images/longevity/gompertzcurve-longevity-calmentcasestudy.png){.full-width}
With these specific set of parameters, we see median gaps somewhat similar to the empirical data, but we hardly ever (~0.02% of the time) see #1--#2 gaps quite as big as Calment's.
The graph also seems to suggest that there is in fact typically a 'jump' between #1 & #2 compared to #2 & #3 and so on, which I was not expecting; thinking about it, I suspect there is some sort of selection effect here: if a random sample from '#1' falls below the random sample of '#2', then they will simply swap places (because when sorted, #2 will be bigger than #1), so an 'unlucky' #1 disappears, creating a ratchet effect ensuring the final '#1' will be larger than expected.
Any _k_ could exceed expectations, but #1 is the last possible ranking, so it can become more extreme.
If I remove the `sort` call which ensures monotonicity with rank, the graph looks considerably more irregular but still has a jump, so this selection effect may not be the entire story:
![100 simulated samples of top-10-oldest-people out of 10 billion; simulated by rank, unsorted (independent samples)](/images/longevity/gompertzcurve-longevity-calmentcasestudy-unsorted.png)
So, overall, a standard Gompertz model does not easily produce a Calment.
This doesn't prove that the Calment age is wrong, though.
It could just be that Calment [proves centenarian aging is not well modeled by a Gompertz](/Modus), or my specific parameter values are wrong (the gaps are similar but the ages are overall off by ~5 years, and perhaps that makes a difference).
To begin with, it is unlikely that the Gompertz curve is *exactly* correct a model of aging; in particular, gerontologists note an apparent ceiling of the annual mortality rate at ~50%, which is inconsistent with the Gompertz (which would continue increasing arbitrarily, quickly hitting >99% annual mortality rates).
And maybe Calment really is just an outlier due to sampling error (0.02% ≠ 0%), or she is indeed out of the ordinary human life expectancy distribution but there is a more benign explanation for it like a unique mutation or genetic configuration.
But it does seem like Calment's record is weird in some way.