Calculating The Expected Maximum

In generating a sample of n datapoints drawn from a normal/Gaussian distribution, how big on average the biggest datapoint is will depend on how large n is. I implement a variety of exact & approximate calculations in R to compare efficiency & accuracy.
statistics, computer-science, R, bibliography, order-statistics
2016-01-222020-03-12 finished certainty: highly likely importance: 5


In gen­er­at­ing a sam­ple of n dat­a­points drawn from a normal/Gaussian dis­tri­b­u­tion with a par­tic­u­lar mean/SD, how big on av­er­age the biggest dat­a­point is will de­pend on how large n is. Know­ing this av­er­age is use­ful in a num­ber of ar­eas like sports or breed­ing or man­u­fac­tur­ing, as it de­fines how bad/good the worst/best dat­a­point will be (eg the score of the win­ner in a mul­ti­-player game).

The of the mean/average/expectation of the max­i­mum of a draw of n sam­ples from a nor­mal dis­tri­b­u­tion has no ex­act for­mu­la, un­for­tu­nate­ly, and is gen­er­ally not built into any pro­gram­ming lan­guage’s li­braries.

I im­ple­ment & com­pare some of the ap­proaches to es­ti­mat­ing this or­der sta­tis­tic in the R pro­gram­ming lan­guage, for both the max­i­mum and the gen­eral or­der sta­tis­tic. The over­all best ap­proach is to cal­cu­late the ex­act or­der sta­tis­tics for the n range of in­ter­est us­ing nu­mer­i­cal in­te­gra­tion via lmomco and cache them in a lookup table, rescal­ing the mean/SD as nec­es­sary for ar­bi­trary nor­mal dis­tri­b­u­tions; next best is a poly­no­mial re­gres­sion ap­prox­i­ma­tion; fi­nal­ly, the Elfv­ing cor­rec­tion to the Blom 1958 ap­prox­i­ma­tion is fast, eas­ily im­ple­ment­ed, and ac­cu­rate for rea­son­ably large n such as n > 100.

Vi­su­al­iz­ing maxima/minima in or­der sta­tis­tics with in­creas­ing n in each sam­ple (1–100).

Approximation

Monte Carlo

Most sim­ply and di­rect­ly, we can es­ti­mate it us­ing a sim­u­la­tion with hun­dreds of thou­sands of it­er­a­tions:

scores  <- function(n, sd) { rnorm(n, mean=0, sd=sd); }
gain    <- function(n, sd) { scores <- scores(n, sd)
                             return(max(scores)); }
simGain <- function(n, sd=1, iters=500000) {
                             mean(replicate(iters, gain(n, sd))); }

But in R this can take sec­onds for small n and gets worse as n in­creases into the hun­dreds as we need to cal­cu­late over in­creas­ingly large sam­ples of ran­dom nor­mals (so one could con­sider this 𝓞(n)); this makes use of the sim­u­la­tion diffi­cult when nested in high­er-level pro­ce­dures such as any­thing in­volv­ing re­sam­pling or sim­u­la­tion. In R, call­ing func­tions many times is slower than be­ing able to call a func­tion once in a ‘vec­tor­ized’ way where all the val­ues can be processed in a sin­gle batch. We can try to vec­tor­ize this sim­u­la­tion by gen­er­at­ing n · i ran­dom nor­mals, group it into a large ma­trix with n columns (each row be­ing one n-sized batch of sam­ples), then com­put­ing the max­i­mum of the i rows, and the mean of the max­i­mums. This is about twice as fast for small n; im­ple­ment­ing us­ing rowMaxs from the R pack­age matrixStats, it is any­where from 25% to 500% faster (at the ex­pense of likely much higher mem­ory us­age, as the R in­ter­preter is un­likely to ap­ply any op­ti­miza­tions such as Haskel­l’s stream fu­sion):

simGain2 <- function(n, sd=1, iters=500000) {
    mean(apply(matrix(ncol=n, data=rnorm(n*iters, mean=0, sd=sd)), 1, max)) }

library(matrixStats)
simGain3 <- function(n, sd=1, iters=500000) {
    mean(rowMaxs(matrix(ncol=n, data=rnorm(n*iters, mean=0, sd=sd)))) }

Each sim­u­late is too small to be worth par­al­leliz­ing, but there are so many it­er­a­tions that they can be split up use­fully and run with a frac­tion in a differ­ent process; some­thing like

library(parallel)
library(plyr)
simGainP <- function(n, sd=1, iters=500000, n.parallel=4) {
   mean(unlist(mclapply(1:n.parallel, function(i) {
    mean(replicate(iters/n.parallel, gain(n, sd))); })))
}

We can treat the sim­u­la­tion es­ti­mates as ex­act and use such as pro­vided by the R pack­age memoise to cache re­sults & never re­com­pute them, but it will still be slow on the first cal­cu­la­tion. So it would be good to have ei­ther an ex­act al­go­rithm or an ac­cu­rate ap­prox­i­ma­tion: for one of analy­ses, I want ac­cu­racy to ±0.0006 SDs, which re­quires large Monte Carlo sam­ples.

Upper bounds

To sum­ma­rize the Cross Val­i­dated dis­cus­sion: the sim­plest up­per bound is , which makes the di­min­ish­ing re­turns clear (see also the & ). Im­ple­men­ta­tion:

upperBoundMax <- function(n, sd=1) { sd * sqrt(2 * log(n)) }
The max cen­tral limit the­o­rem vi­su­al­ized (Gabriel Peyré)

Most of the ap­prox­i­ma­tions are suffi­ciently fast as they are effec­tively 𝓞(1) with small con­stant fac­tors (if we ig­nore that func­tions like Φ−1/qnorm them­selves may tech­ni­cally be 𝓞(log(n)) or 𝓞(n) for large n). How­ev­er, ac­cu­racy be­comes the prob­lem: this up­per bound is hope­lessly in­ac­cu­rate in small sam­ples when we com­pare to the Monte Carlo sim­u­la­tion. Oth­ers (also in­ac­cu­rate) in­clude and :

upperBoundMax2 <- function(n, sd=1) { ((n-1) / sqrt(2*n - 1)) * sd }
upperBoundMax3 <- function(n, sd=1) { -qnorm(1/(n+1), sd=sd) }

Formulas

  1. Blom 1958, Sta­tis­ti­cal es­ti­mates and trans­formed be­ta-vari­ables pro­vides a gen­eral ap­prox­i­ma­tion for­mula , which spe­cial­iz­ing to the max () is and is bet­ter than the up­per bounds:

    blom1958 <- function(n, sd=1) { alpha <- 0.375; qnorm((n-alpha)/(n-2*alpha+1)) * sd }
  2. , ap­par­ent­ly, by way of Math­e­mat­i­cal Sta­tis­tics, Wilks 1962, demon­strates that Blom 1958’s ap­prox­i­ma­tion is im­per­fect be­cause ac­tu­ally , so:

    elfving1947 <- function(n, sd=1) { alpha <- pi/8; qnorm((n-alpha)/(n-2*alpha+1)) * sd }

    (Blom 1958 ap­pears to be more ac­cu­rate for n < 48 and then Elfv­ing’s cor­rec­tion dom­i­nates.)

  3. Har­ter 1961 elab­o­rated this by giv­ing differ­ent val­ues for , and Roys­ton 1982 pro­vides com­puter al­go­rithms; I have not at­tempted to pro­vide an R im­ple­men­ta­tion of these.

  4. prob­a­bil­i­ty­is­logic offers a 2015 de­riva­tion via the be­ta-F com­pound dis­tri­b­u­tion1:

    and an ap­prox­i­mate (but highly ac­cu­rate) nu­mer­i­cal in­te­gra­tion as well:

    pil2015 <- function(n, sd=1) { sd * qnorm(n/(n+1)) * { 1 +
        ((n/(n+1)) * (1 - (n/(n+1)))) /
        (2*(n+2) * (pnorm(qnorm(n/(n+1))))^2) }}
    pil2015Integrate <- function(n) { mean(qnorm(qbeta(((1:10000) - 0.5) / 10000, n, 1))) + 1}

    The in­te­gra­tion can be done more di­rectly as

    pil2015Integrate2 <- function(n) { integrate(function(v) qnorm(qbeta(v, n, 1)), 0, 1) }
  5. An­other ap­prox­i­ma­tion comes from : . Un­for­tu­nate­ly, while ac­cu­rate enough for most pur­pos­es, it is still off by as much as 1 IQ point and has an av­er­age mean er­ror of −0.32 IQ points com­pared to the sim­u­la­tion:

    chen1999 <- function(n, sd=1){ qnorm(0.5264^(1/n), sd=sd) }
    
    approximationError <- sapply(1:1000, function(n) { (chen1999(n) - simGain(n)) * 15 })
    summary(approximationError)
    #       Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
    # -0.3801803 -0.3263603 -0.3126665 -0.2999775 -0.2923680  0.9445290
    plot(1:1000, approximationError,  xlab="Number of samples taking the max", ylab="Error in 15*SD (IQ points)")
    Er­ror in us­ing the Chen & Tyler 1999 ap­prox­i­ma­tion to es­ti­mate the ex­pected value (gain) from tak­ing the max­i­mum of n nor­mal sam­ples

Polynomial regression

From a less math­e­mat­i­cal per­spec­tive, any re­gres­sion or ma­chine learn­ing model could be used to try to de­velop a cheap but highly ac­cu­rate ap­prox­i­ma­tion by sim­ply pre­dict­ing the ex­treme from the rel­e­vant range of n = 2–300—the goal be­ing less to make good pre­dic­tions out of sam­ple than to interpolate/memorize as much as pos­si­ble in­-sam­ple.

Plot­ting the ex­tremes, they form a smooth al­most log­a­rith­mic curve:

df <- data.frame(N=2:300, Max=sapply(2:300, exactMax))
l <- lm(Max ~ log(N), data=df); summary(l)
# Residuals:
#         Min          1Q      Median          3Q         Max
# -0.36893483 -0.02058671  0.00244294  0.02747659  0.04238113
# Coefficients:
#                Estimate  Std. Error   t value   Pr(>|t|)
# (Intercept) 0.658802439 0.011885532  55.42894 < 2.22e-16
# log(N)      0.395762956 0.002464912 160.55866 < 2.22e-16
#
# Residual standard error: 0.03947098 on 297 degrees of freedom
# Multiple R-squared:  0.9886103,   Adjusted R-squared:  0.9885719
# F-statistic: 25779.08 on 1 and 297 DF,  p-value: < 2.2204e-16
plot(df); lines(predict(l))

This has the merit of sim­plic­ity (function(n) {0.658802439 + 0.395762956*log(n)}), but while the R2 is quite high by most stan­dards, the resid­u­als are too large to make a good ap­prox­i­ma­tion—the log curve over­shoots ini­tial­ly, then un­der­shoots, then over­shoots. We can try to find a bet­ter log curve by us­ing poly­no­mial or spline re­gres­sion, which broaden the fam­ily of pos­si­ble curves. A 4th-order poly­no­mial turns out to fit as beau­ti­fully as we could wish, R2=0.9999998:

lp <- lm(Max ~ log(N) + I(log(N)^2) + I(log(N)^3) + I(log(N)^4), data=df); summary(lp)
# Residuals:
#           Min            1Q        Median            3Q           Max
# -1.220430e-03 -1.074138e-04 -1.655586e-05  1.125596e-04  9.690842e-04
#
# Coefficients:
#                  Estimate    Std. Error    t value   Pr(>|t|)
# (Intercept)  1.586366e-02  4.550132e-04   34.86418 < 2.22e-16
# log(N)       8.652822e-01  6.627358e-04 1305.62159 < 2.22e-16
# I(log(N)^2) -1.122682e-01  3.256415e-04 -344.76027 < 2.22e-16
# I(log(N)^3)  1.153201e-02  6.540518e-05  176.31640 < 2.22e-16
# I(log(N)^4) -5.302189e-04  4.622731e-06 -114.69820 < 2.22e-16
#
# Residual standard error: 0.0001756982 on 294 degrees of freedom
# Multiple R-squared:  0.9999998,   Adjusted R-squared:  0.9999998
# F-statistic: 3.290056e+08 on 4 and 294 DF,  p-value: < 2.2204e-16

## If we want to call the fitted objects:
linearApprox <- function (n) { predict(l, data.frame(N=n)); }
polynomialApprox <- function (n) { predict(lp, data.frame(N=n)); }
## Or simply code it by hand:
la <- function(n, sd=1) { 0.395762956*log(n) * sd; }
pa <- function(n, sd=1) { N <- log(n);
    (1.586366e-02 + 8.652822e-01*N^1 + -1.122682e-01*N^2 + 1.153201e-02*N^3 + -5.302189e-04*N^4) * sd; }

This has the virtue of speed & sim­plic­ity (a few arith­metic op­er­a­tions) and high ac­cu­ra­cy, but is not in­tended to per­form well past the largest dat­a­point of n = 300 (although if one needed to, one could sim­ply gen­er­ate the ad­di­tional dat­a­points, and re­fit, adding more poly­no­mi­als if nec­es­sary), but turns out to be a good ap­prox­i­ma­tion up to n = 800 (after which it con­sis­tently over­es­ti­mates by ~0.01):

heldout <- sapply(301:1000, exactMax)
test <- sapply(301:1000, pa)
mean((heldout - test)^2)
# [1] 3.820988144e-05
plot(301:1000, heldout); lines(test)

So this method, while lack­ing any kind of math­e­mat­i­cal pedi­gree or de­riva­tion, pro­vides the best ap­prox­i­ma­tion so far.

Exact

The R pack­age lmomco (Asquith 2011) cal­cu­lates a wide va­ri­ety of or­der sta­tis­tics us­ing nu­mer­i­cal in­te­gra­tion & other meth­ods. It is fast, un­bi­ased, and gen­er­ally cor­rect (for small val­ues of n2)—it is close to the Monte Carlo es­ti­mates even for the small­est n where the ap­prox­i­ma­tions tend to do bad­ly, so it does what it claims to and pro­vides what we want (a fast ex­act es­ti­mate of the mean gain from se­lect­ing the max­i­mum from n sam­ples from a nor­mal dis­tri­b­u­tion). The re­sults can be mem­o­ized for a fur­ther mod­er­ate speedup (eg cal­cu­lated over n = 1–1000, 0.45s vs 3.9s for a speedup of ~8.7×):

library(lmomco)
exactMax_unmemoized <- function(n, mean=0, sd=1) {
    expect.max.ostat(n, para=vec2par(c(mean, sd), type="nor"), cdf=cdfnor, pdf=pdfnor) }
## Comparison to MC:
# ...         Min.       1st Qu.        Median          Mean       3rd Qu.          Max.
#    −0.0523499300 −0.0128622900 −0.0003641315 −0.0007935236  0.0108748800  0.0645207000

library(memoise)
exactMax_memoised <- memoise(exactMax_unmemoized)
Er­ror in us­ing Asquith 2011’s L-mo­ment Sta­tis­tics nu­mer­i­cal in­te­gra­tion pack­age to es­ti­mate the ex­pected value (gain) from tak­ing the max­i­mum of n nor­mal sam­ples

Comparison

With lmomco pro­vid­ing ex­act val­ues, we can vi­su­ally com­pare all the pre­sented meth­ods for ac­cu­ra­cy; there are con­sid­er­able differ­ences but the best meth­ods are in close agree­ment:

Com­par­i­son of es­ti­mates of the max­i­mum for n = 2–300 for 12 meth­ods, show­ing Chen 1999/polynomial approximation/Monte Carlo/lmomco are the most ac­cu­rate and Blom 1958/upper bounds high­ly-i­nac­cu­rate.
library(parallel)
library(plyr)
scores  <- function(n, sd) { rnorm(n, mean=0, sd=sd); }
gain    <- function(n, sd) { scores <- scores(n, sd)
                             return(max(scores)); }
simGainP <- function(n, sd=1, iters=500000, n.parallel=4) {
   mean(unlist(mclapply(1:n.parallel, function(i) {
    mean(replicate(iters/n.parallel, gain(n, sd))); })))
}
upperBoundMax <- function(n, sd=1) { sd * sqrt(2 * log(n)) }
upperBoundMax2 <- function(n, sd=1) { ((n-1) / sqrt(2*n - 1)) * sd }
upperBoundMax3 <- function(n, sd=1) { -qnorm(1/(n+1), sd=sd) }
blom1958 <- function(n, sd=1) { alpha <- 0.375; qnorm((n-alpha)/(n-2*alpha+1)) * sd }
elfving1947 <- function(n, sd=1) { alpha <- pi/8; qnorm((n-alpha)/(n-2*alpha+1)) * sd }
pil2015 <- function(n, sd=1) { sd * qnorm(n/(n+1)) * { 1 +
    ((n/(n+1)) * (1 - (n/(n+1)))) /
    (2*(n+2) * (pnorm(qnorm(n/(n+1))))^2) }}
pil2015Integrate <- function(n) { mean(qnorm(qbeta(((1:10000) - 0.5) / 10000, n, 1))) + 1}
chen1999 <- function(n,sd=1){ qnorm(0.5264^(1/n), sd=sd) }
library(lmomco)
exactMax_unmemoized <- function(n, mean=0, sd=1) {
    expect.max.ostat(n, para=vec2par(c(mean, sd), type="nor"), cdf=cdfnor, pdf=pdfnor) }
library(memoise)
exactMax_memoised <- memoise(exactMax_unmemoized)
la <- function(n, sd=1) { 0.395762956*log(n) * sd; }
pa <- function(n, sd=1) { N <- log(n);
    (1.586366e-02 + 8.652822e-01*N^1 + -1.122682e-01*N^2 + 1.153201e-02*N^3 + -5.302189e-04*N^4) * sd; }

implementations <- c(simGainP,upperBoundMax,upperBoundMax2,upperBoundMax3,blom1958,elfving1947,
    pil2015,pil2015Integrate,chen1999,exactMax_unmemoized,la,pa)
implementationNames <- c("Monte Carlo","Upper bound #1","UB #2","UB #3","Blom 1958","Elfving 1947",
    "Pil 2015","Pil 2015 (numerical integration)","Chen 1999","Exact","Log approx.","Polynomial approx.")
maxN <- 300
results <- data.frame(N=integer(), SD=numeric(), Implementation=character())
for (i in 1:length(implementations)) {
    SD <- unlist(Map(function(n) { implementations[i][[1]](n); }, 2:maxN))
    name <- implementationNames[i]
    results <- rbind(results, (data.frame(N=2:maxN, SD=SD, Implementation=name)))
    }
results$Implementation <- ordered(results$Implementation,
    levels=c("Upper bound #1","UB #2","UB #3","Log approx.","Blom 1958","Elfving 1947","Pil 2015",
             "Pil 2015 (numerical integration)","Chen 1999","Polynomial approx.", "Monte Carlo", "Exact"))
library(ggplot2)
qplot(N, SD, color=Implementation, data=results) + coord_cartesian(ylim = c(0.25,3.9))

qplot(N, SD, color=Implementation, data=results) + geom_smooth()

And mi­cro-bench­mark­ing them quickly (ex­clud­ing Monte Car­lo) to get an idea of time con­sump­tion shows the ex­pected re­sults (a­side from Pil 2015’s nu­mer­i­cal in­te­gra­tion tak­ing longer than ex­pect­ed, sug­gest­ing ei­ther mem­o­is­ing or chang­ing the fine­ness):

library(microbenchmark)
f <- function() { sample(2:1000, 1); }
microbenchmark(times=10000, upperBoundMax(f()),upperBoundMax2(f()),upperBoundMax3(f()),
    blom1958(f()),elfving1947(f()),pil2015(f()),pil2015Integrate(f()),chen1999(f()),
    exactMax_memoised(f()),la(f()),pa(f()))
# Unit: microseconds
#                    expr       min         lq          mean     median         uq       max neval
#                     f()     2.437     2.9610     4.8928136     3.2530     3.8310  1324.276 10000
#      upperBoundMax(f())     3.029     4.0020     6.6270124     4.9920     6.3595  1218.010 10000
#     upperBoundMax2(f())     2.886     3.8970     6.5326593     4.7235     5.8420  1029.148 10000
#     upperBoundMax3(f())     3.678     4.8290     7.4714030     5.8660     7.2945   892.594 10000
#           blom1958(f())     3.734     4.7325     7.3521356     5.6200     7.0590  1050.915 10000
#        elfving1947(f())     3.757     4.8330     7.7927493     5.7850     7.2800  1045.616 10000
#            pil2015(f())     5.518     6.9330    10.8501286     9.2065    11.5280   867.332 10000
#   pil2015Integrate(f()) 14088.659 20499.6835 21516.4141399 21032.5725 22151.4150 53977.498 10000
#           chen1999(f())     3.788     4.9260     7.7456654     6.0370     7.5600  1415.992 10000
#  exactMax_memoised(f())   106.222   126.1050   211.4051056   162.7605   221.2050  4009.048 10000
#                 la(f())     2.882     3.8000     5.7257008     4.4980     5.6845  1287.379 10000
#                 pa(f())     3.397     4.4860     7.0406035     5.4785     6.9090  1818.558 10000

Rescaling for generality

The mem­oi­sed func­tion has three ar­gu­ments, so mem­o­is­ing on the fly would seem to be the best one could do, since one can­not pre­com­pute all pos­si­ble com­bi­na­tions of the n/mean/SD. But ac­tu­al­ly, we only need to com­pute the re­sults for each n.

We can de­fault to as­sum­ing the stan­dard nor­mal dis­tri­b­u­tion () with­out loss of gen­er­al­ity be­cause it’s easy to rescale any nor­mal to an­other nor­mal: to scale to a differ­ent mean , one sim­ply adds to the ex­pected ex­treme, so one can as­sume ; and to scale to a differ­ent stan­dard de­vi­a­tion, we sim­ply mul­ti­ply ap­pro­pri­ate­ly. So if we wanted the ex­treme for n = 5 for , we can cal­cu­late it sim­ply by tak­ing the es­ti­mate for n = 5 for and mul­ti­ply­ing by 2⁄1 = 2 and then adding 10 − 0 = 10:

(exactMax(5, mean=0, sd=1)*2 + 10) ; exactMax(5, mean=10, sd=2)
# [1] 12.32592895
# [1] 12.32592895

So in other words, it is un­nec­es­sary to mem­o­ize all pos­si­ble com­bi­na­tions of n, mean, and SD—in re­al­i­ty, we only need to com­pute each n and then rescale it as nec­es­sary for each caller. And in prac­tice, we only care about n = 2–200, which is few enough that we can de­fine a lookup ta­ble us­ing the lmomco re­sults and use that in­stead (with a fall­back to lmomco for n > 200, and a fall­back to Chen et al 1999 for n > 2000 to work around lmomco’s bug with large n):

exactMax <- function (n, mean=0, sd=1) {
if (n>2000) {
    chen1999 <- function(n,mean=0,sd=1){ mean + qnorm(0.5264^(1/n), sd=sd) }
    chen1999(n,mean=mean,sd=sd) } else {
    if(n>200) { library(lmomco)
        exactMax_unmemoized <- function(n, mean=0, sd=1) {
            expect.max.ostat(n, para=vec2par(c(mean, sd), type="nor"), cdf=cdfnor, pdf=pdfnor) }
        exactMax_unmemoized(n,mean=mean,sd=sd) } else {

 lookup <- c(0,0,0.5641895835,0.8462843753,1.0293753730,1.1629644736,1.2672063606,1.3521783756,1.4236003060,
  1.4850131622,1.5387527308,1.5864363519,1.6292276399,1.6679901770,1.7033815541,1.7359134449,1.7659913931,
  1.7939419809,1.8200318790,1.8444815116,1.8674750598,1.8891679149,1.9096923217,1.9291617116,1.9476740742,
  1.9653146098,1.9821578398,1.9982693020,2.0137069241,2.0285221460,2.0427608442,2.0564640976,2.0696688279,
  2.0824083360,2.0947127558,2.1066094396,2.1181232867,2.1292770254,2.1400914552,2.1505856577,2.1607771781,
  2.1706821847,2.1803156075,2.1896912604,2.1988219487,2.2077195639,2.2163951679,2.2248590675,2.2331208808,
  2.2411895970,2.2490736293,2.2567808626,2.2643186963,2.2716940833,2.2789135645,2.2859833005,2.2929091006,
  2.2996964480,2.3063505243,2.3128762306,2.3192782072,2.3255608518,2.3317283357,2.3377846191,2.3437334651,
  2.3495784520,2.3553229856,2.3609703096,2.3665235160,2.3719855541,2.3773592389,2.3826472594,2.3878521858,
  2.3929764763,2.3980224835,2.4029924601,2.4078885649,2.4127128675,2.4174673530,2.4221539270,2.4267744193,
  2.4313305880,2.4358241231,2.4402566500,2.4446297329,2.4489448774,2.4532035335,2.4574070986,2.4615569196,
  2.4656542955,2.4697004768,2.4736966781,2.4776440650,2.4815437655,2.4853968699,2.4892044318,2.4929674704,
  2.4966869713,2.5003638885,2.5039991455,2.5075936364,2.5111482275,2.5146637581,2.5181410417,2.5215808672,
  2.5249839996,2.5283511812,2.5316831323,2.5349805521,2.5382441196,2.5414744943,2.5446723168,2.5478382097,
  2.5509727783,2.5540766110,2.5571502801,2.5601943423,2.5632093392,2.5661957981,2.5691542321,2.5720851410,
  2.5749890115,2.5778663175,2.5807175211,2.5835430725,2.5863434103,2.5891189625,2.5918701463,2.5945973686,
  2.5973010263,2.5999815069,2.6026391883,2.6052744395,2.6078876209,2.6104790841,2.6130491728,2.6155982225,
  2.6181265612,2.6206345093,2.6231223799,2.6255904791,2.6280391062,2.6304685538,2.6328791081,2.6352710490,
  2.6376446504,2.6400001801,2.6423379005,2.6446580681,2.6469609341,2.6492467445,2.6515157401,2.6537681566,
  2.6560042252,2.6582241720,2.6604282187,2.6626165826,2.6647894763,2.6669471086,2.6690896839,2.6712174028,
  2.6733304616,2.6754290533,2.6775133667,2.6795835873,2.6816398969,2.6836824739,2.6857114935,2.6877271274,
  2.6897295441,2.6917189092,2.6936953850,2.6956591311,2.6976103040,2.6995490574,2.7014755424,2.7033899072,
  2.7052922974,2.7071828562,2.7090617242,2.7109290393,2.7127849375,2.7146295520,2.7164630139,2.7182854522,
  2.7200969934,2.7218977622,2.7236878809,2.7254674700,2.7272366478,2.7289955308,2.7307442335,2.7324828686,
  2.7342115470,2.7359303775,2.7376394676,2.7393389228,2.7410288469,2.7427093423,2.7443805094,2.7460424475)

 return(mean + sd*lookup[n+1]) }}}

This gives us ex­act com­pu­ta­tion at 𝓞(1) (with an amor­tized 𝓞(1) when n > 200) with an ex­tremely small con­stant fac­tor (a con­di­tion­al, vec­tor in­dex, mul­ti­pli­ca­tion, and ad­di­tion, which is over­all ~10× faster than a mem­oi­sed lookup), giv­ing us all our desider­ata si­mul­ta­ne­ously & re­solv­ing the prob­lem.

General order statistics for the normal distribution

One might also be in­ter­ested in com­put­ing the gen­eral or­der sta­tis­tic.

Some avail­able im­ple­men­ta­tions in R:

  • nu­mer­i­cal in­te­gra­tion:

    • lmomco, with j of n (warn­ing: re­mem­ber lmom­co’s bug with n > 2000):

      j = 9; n = 10
      expect.max.ostat(n, j=j, para=vec2par(c(0, 1), type="nor"), cdf=cdfnor, pdf=pdfnor)
      # [1] 1.001357045
    • evNormOrdStats in EnvStats (ver­sion ≥2.3.0), us­ing Roys­ton 1982:

      library(EnvStats)
      evNormOrdStatsScalar(10^10,10^10)
      # [1] 6.446676405
      ## Warning message: In evNormOrdStatsScalar(10^10, 10^10) :
      ## The 'royston' method has not been validated for sample sizes greater than 2000 using
      ## the default value of inc = 0.025. You may want to make the value of 'inc' less than 0.025.
      evNormOrdStatsScalar(10^10,10^10, inc=0.000001)
      # [1] 6.446676817
  • Monte Carlo: the sim­ple ap­proach of av­er­ag­ing over i it­er­a­tions of gen­er­at­ing n ran­dom nor­mal de­vi­ates, sort­ing, and se­lect­ing the jth or­der sta­tis­tic does not scale well due to be­ing 𝓞(n) in both time & space for gen­er­a­tion & 𝓞(n · log(n)) for a com­par­i­son sort or an­other 𝓞(n) if one is more care­ful to use a lazy sort or , and cod­ing up an on­line se­lec­tion al­go­rithm is not a one-lin­er. Bet­ter so­lu­tions typ­i­cally use a beta trans­for­ma­tion to effi­ciently gen­er­ate a sin­gle sam­ple from the ex­pected or­der sta­tis­tic:

    • order_rnorm in orderstats, with k of n:

      library(orderstats)
      mean(replicate(100000, order_rnorm(k=10^10, n=10^10)))
      # [1] 6.446370373
    • order in evd, with j of n:

      library(evd)
      mean(rorder(100000, distn="norm", j=10^10, mlen=10^10, largest=FALSE))
      # [1] 6.447222051
  • Blom & other ap­prox­i­ma­tions:

    • evNormOrdStats in EnvStats’s pro­vides as an op­tion the Blom ap­prox­i­ma­tion:3

      When method="blom", the fol­low­ing ap­prox­i­ma­tion to , pro­posed by Blom (1958, pp. 68–75), is used:

      By de­fault, al­pha = 3⁄8 = 0.375. This ap­prox­i­ma­tion is quite ac­cu­rate. For ex­am­ple, for n > 2_, the ap­prox­i­ma­tion is ac­cu­rate to the first dec­i­mal place, and for n > 9 it is ac­cu­rate to the sec­ond dec­i­mal place.

      Blom’s ap­prox­i­ma­tion is also quoted as:

    • Elfv­ing’s cor­rec­tion to Blom is the same, yield­ing:

      elfving1947E <- function(r,n) { alpha=pi/8; qnorm( (r - alpha) / (n - 2*alpha + 1)) }
      elfving1947E(10^10, 10^10)
      # [1] 6.437496713

See Also

Appendix

Probability of Bivariate Maximum

Given a sam­ple of n pairs of 2 nor­mal vari­ables A & B which are cor­re­lated r, what is the prob­a­bil­ity Pmax that the max­i­mum on the first vari­able A is also the max­i­mum on the sec­ond vari­able B? This is anal­o­gous to many test­ing or screen­ing sit­u­a­tions, such as em­ployee hir­ing (“what is the prob­a­bil­ity the top-s­cor­ing ap­pli­cant on the first exam is the top-s­corer on the sec­ond as well?”) or ath­letic con­tests (“what is the prob­a­bil­ity the cur­rent world champ will win the next cham­pi­onship?”).

Or­der sta­tis­tics has long proven that as­ymp­tot­i­cal­ly, Pmax ap­proaches 1⁄n. Ex­act an­swers are hard to find, but con­firm the as­ymp­tot­ics; the clos­est that ex­ists is for an ap­prox­i­ma­tion & spe­cial-case of the Al­i-Mikhail-Haq cop­u­la: which roughly in­di­cates that r func­tions as a con­stant fac­tor boost in Pmax, and the boost from r fades out as n in­creas­es.

As long as r≠1, “the tails will come apart”. n in­creases the diffi­cult too fast for any fixed r to over­come. This has im­pli­ca­tions for in­ter­pret­ing ex­tremes and test met­rics.

Few are the best at every­thing. when we look at ex­tremes of mul­ti­ple traits: the best on one di­men­sion is rarely the best on the sec­ond di­men­sion, even if there is a close causal re­la­tion­ship—the tallest bas­ket­ball player is not the best bas­ket­ball play­er, the smartest sci­en­tist is not the most in­flu­en­tial sci­en­tist, the calmest in­vestor is not the rich­est in­vestor, and so on. This ob­ser­va­tion can be mis­in­ter­pret­ed, and is not nec­es­sar­ily that im­por­tant: height is cer­tainly use­ful for bas­ket­ball play­ers, and the smarter you are, the bet­ter your chance of be­ing a great sci­en­tist. If we look at the best, they will in­deed be well above av­er­age on rel­e­vant traits, even if they are not the most ex­treme out­lier on all (or any) traits.

How much does be­ing the best at X help with Y? But this does raise the ques­tion of what the re­la­tion­ship is be­tween the two pos­si­ble max­i­ma. In other or­der sta­tis­tics, an ini­tially small tail effect can be­come large; a fa­mil­iar ex­am­ple of a tail effect is the prob­a­bil­ity of a ran­dom sam­ple cross­ing a thresh­old, when draw­ing from pop­u­la­tions with differ­ent means or stan­dard de­vi­a­tion­s—the higher the thresh­old, the big­ger the mul­ti­plica­tive in­crease, and with a not-too-high thresh­old and a quite small differ­ence in means, one pop­u­la­tion could com­pletely (and coun­ter­in­tu­itive­ly) dom­i­nant the sam­ple. Does that hap­pen with the bi­vari­ate max­i­mum? More gen­er­al­ly, does a given r > 0 in­crease the prob­a­bil­i­ty, how much, and does this prob­a­bil­ity in­crease or de­crease rel­a­tive to the 1⁄n base­line?

Let’s con­sider some cases of r and n:

  1. If r = 1, then Pmax must also al­ways be 1, as the two vari­ables are iden­ti­cal.

  2. For any r, Se­lect­ing the max­i­mum out of n = 1 is triv­ial and we can guar­an­tee a Pmax=1 chance of se­lect­ing the max­i­mum

  3. For n = 2, get­ting the max­i­mum twice out of 2 is easy (we must start at ≥0.5 and go up from there, to , so eg 2⁄3 for r = 0.5).

  4. For larger n, we can sim­u­late out Pmax and see that de­spite the ben­e­fits to in­creas­ing av­er­ages & re­duced re­gret (eg for em­bryo se­lec­tion), get­ting the max twice be­comes in­creas­ingly diffi­cult and out of eg n = 1 mil­lion, be­comes near-im­pos­si­ble—even with high r, like r = 0.95 (com­pa­ra­ble to test-retest re­li­a­bil­ity for the SAT).

    Even­tu­al­ly, ‘the tails come apart’. As­ymp­tot­i­cal­ly, as n in­creas­es, the two dis­tri­b­u­tions be­come in­de­pen­dent () and so the prob­a­bil­ity of the max in one be­ing max in the other ap­proaches 1⁄n. (/ prove this is true of the other ex­tremes as well.) The ben­e­fits of an r, how­ever large ini­tial­ly, are not large enough to keep up with the in­creas­ing n, and is over­whelmed.

Of course, the mean on the sec­ond dis­tri­b­u­tion keeps in­creas­ing—it just does­n’t in­crease as fast as nec­es­sary to main­tain Pmax. This has some im­pli­ca­tions. First, it shows that we should­n’t take too se­ri­ously any ar­gu­ment of the form “the max on A is not the max on B, there­fore A does­n’t mat­ter to B”, since we can see that this is con­sis­tent with ar­bi­trar­ily high r (even a r = 0.9999 will even­tu­ally give such a dat­a­point if we make n big enough). Sec­ond, it shows that Pmax is not that in­ter­est­ing as a mea­sure, since the as­ymp­totic in­de­pen­dence means that we may just be “sam­pling to a fore­gone con­clu­sion”; any­one eval­u­at­ing a se­lec­tion or rank­ing pro­ce­dure which puts stress on a can­di­date be­ing the max on mul­ti­ple traits (in­stead of hav­ing a high ab­solute value on traits) should think care­fully about us­ing such an un­nat­ural ze­ro-one & its per­verse con­se­quences, like guar­an­tee­ing that the best can­di­date al­l-around will rarely or never be the max­i­mum on any or all traits.

Ali-Mikhail-Haq Copula

The re­la­tion­ship be­tween r and n can be seen more ex­plic­itly in a spe­cial case for­mula for the Al­i-Mikhail-Haq cop­u­la, which also shows how r fades out with n. Matt F. has pro­vided an an­a­lytic ex­pres­sion by us­ing a differ­ent dis­tri­b­u­tion, the (AMH) , which ap­prox­i­mates the nor­mal dis­tri­b­u­tion for the spe­cial-case 0<r < 0.5, if r is con­verted to the Al­i-Mikhail-Haq’s equiv­a­lent cor­re­la­tion in­stead:

First we choose the pa­ra­me­ter θ to get Kendal­l’s tau to agree for the dis­tri­b­u­tions:

…θ = 3_r_ - 2_r_2 is a de­cent ap­prox­i­ma­tion to the first equa­tion.

Then we can cal­cu­late the prob­a­bil­ity of X and Y be­ing max­i­mized at the same ob­ser­va­tion:

  • In the lim­it­ing case where r = 1⁄2, we have θ = 1, and the prob­a­bil­ity is

  • In the lim­it­ing case of high n, the prob­a­bil­ity tends to and more pre­cisely .

  • In gen­er­al, the prob­a­bil­ity is

    where and is the .

His iden­tity is­n’t a sim­ple one in terms of r, but the nec­es­sary θ can be found to high ac­cu­racy by nu­mer­i­cal op­ti­miza­tion of the iden­ti­ty, us­ing R’s built-in nu­mer­i­cal op­ti­miza­tion rou­tine optim. Then the gen­eral for­mula can be es­ti­mated as writ­ten, with the main diffi­culty be­ing where to get the hy­per­ge­o­met­ric func­tion. The re­sult­ing al­go­rithm (an ap­prox­i­ma­tion of an ex­act an­swer to an ap­prox­i­ma­tion) can then be com­pared to a large (i = 1 mil­lion) Monte Carlo sim­u­la­tion: it is good for small n but has no­tice­able er­ror:

r_to_theta <- function (r) {
   identity_solver <- function(theta, r) {
       target = 2/pi * asin(r)
       theta_guess = 1 - ((2 * ((1 - theta)^2 * log(1 - theta) + theta))/(3 * theta^2))
       return(abs(target - theta_guess)) }

    theta_approx <- optim(0, identity_solver, r=r, method="Brent", lower=0, upper=1)
    theta_approx$par
    }

## The hypergeometric function is valid for all reals, but practically speaking,
## GSL's 'hyperg_2F1' code has issues with NaN once r>0.2 (theta ≅ −2.3); Stéphane Laurent
## recommends transforming (https://stats.stackexchange.com/a/33455/16897) for stability:
Gauss2F1 <- function(a,b,c,x){
    ## "gsl: Wrapper for the Gnu Scientific Library"
    ## https://cran.r-project.org/web/packages/gsl/index.html for 'hyperg_2F1'
    library(gsl)
    if(x>=0 & x<1){
        hyperg_2F1(a,b,c,x)
    }else{
            hyperg_2F1(c-a,b,c,1-1/(1-x))/(1-x)^b
        }
}

## Calculate the exact probability of a double max using the AMH approximation:
p_max_bivariate_amh <- function(n,r) {
  ## only valid for the special-case r=0-0.5:
  if (r > 0.5 || r < 0.0) { stop("AMH formula only valid for 0 <= r <= 0.5") }
    else {
        ## Handle special cases:
        if (r == 0.5) { 2 / 1+n } else {
         if (r == 0) { 1 / n } else {

            theta <- r_to_theta(r)
            t <- theta / (theta - 1)

            t * (Gauss2F1(1, n+1, n+2, t) / (n+1)) -
            2 * n * t^2 * (Gauss2F1(1, n+2, n+3, t) / ((n+1)*(n+2))) -
            (2*n*t)/((n+1)^2) +
           (1/n)
           }
       }
   }
}

## Monte Carlo simulate the AMH approximation to check the exact formula:
p_max_bivariate_amh_montecarlo <- function(n,r, iters=1000000) {
    ## "copula: Multivariate Dependence with Copulas"
    ## https://cran.r-project.org/web/packages/copula/index.html for 'rCopula', 'amhCopula'
    library(copula)
    theta <- r_to_theta(r)
    mean(replicate(iters, {
        sample <- rCopula(n, amhCopula(theta))
        which.max(sample[,1])==which.max(sample[,2]) }))
    }

## Monte Carlo simulation of double max using the normal distribution:
p_max_bivariate_montecarlo <- function(n,r,iters=1000000) {
    mean(replicate(iters, {
                library(MASS) # for 'mvrnorm'
                sample <- mvrnorm(n, mu=c(0,0), Sigma=matrix(c(1, r, r, 1), nrow=2))
                which.max(sample[,1])==which.max(sample[,2])
                })) }

## Compare all 3:
## 1. approximate Monte Carlo answer for the normal
## 2. exact answer to A-M-H approximation
## 3. approximate Monte Carlo answer to A-M-H approximation
r=0.25
round(digits=3, sapply(2:20, function(n) { p_max_bivariate_montecarlo(n, r) }))
# [1] 0.580 0.427 0.345 0.295 0.258 0.230 0.211 0.193 0.179 0.168 0.158 0.150 0.142 0.136 0.130 0.124 0.120 0.115 0.111
round(digits=3, sapply(2:20, function(n) { p_max_bivariate_amh(n, r) }))
# [1] 0.580 0.420 0.331 0.274 0.233 0.204 0.181 0.162 0.147 0.135 0.124 0.115 0.108 0.101 0.095 0.090 0.085 0.081 0.077
round(digits=3, sapply(2:20, function(n) { p_max_bivariate_amh_montecarlo(n, r) }))
# [1] 0.581 0.420 0.331 0.274 0.233 0.204 0.181 0.162 0.147 0.135 0.124 0.115 0.108 0.101 0.095 0.089 0.085 0.081 0.077

See also

Sampling Gompertz Distribution Extremes

I im­ple­ment ran­dom sam­pling from the extremes/order sta­tis­tics of the Gom­pertz sur­vival dis­tri­b­u­tion, used to model hu­man life ex­pectan­cies, with the beta trans­for­ma­tion trick and flexsurv/root-finding in­ver­sion. I then dis­cuss the un­usu­ally ro­bust lifes­pan record of Jeanne Cal­ment, and show that records like hers (which sur­pass the run­ner-up’s lifes­pan by such a de­gree) are not usu­ally pro­duced by a Gom­pertz dis­tri­b­u­tion, sup­port­ing the claim that her lifes­pan was in­deed un­usual even for the record hold­er.

The is a dis­tri­b­u­tion often used to model sur­vival curves where mor­tal­ity in­creases over time, par­tic­u­larly hu­man life ex­pectan­cies. The or­der sta­tis­tics of a Gom­pertz are of in­ter­est in con­sid­er­ing ex­treme cases such as cen­te­nar­i­ans.

The usual fam­ily of random/density/inverse CDF (quan­tile) func­tions (rgompertz/dgompertz/pgompertz/qgompertz) are pro­vided by sev­eral R li­braries, such as flexsurv, but none of them ap­pear to pro­vide im­ple­men­ta­tions of or­der sta­tis­tics.

Us­ing rgompertz for the Monte Carlo ap­prox­i­ma­tion works, but like the nor­mal dis­tri­b­u­tion case, breaks down as one ex­am­ines larger cases (eg con­sid­er­ing or­der sta­tis­tics out of one bil­lion takes >20s & runs out of RAM). The beta trans­for­ma­tion used for the nor­mal dis­tri­b­u­tion works for the Gom­pertz as well, as it merely re­quires an abil­ity to in­vert the CDF, which is pro­vided by qgompertz, and if that is not avail­able (per­haps we are work­ing with some other dis­tri­b­u­tion be­sides the Gom­pertz where a q* func­tion is not avail­able), one can ap­prox­i­mate it by a short root-find­ing op­ti­miza­tion.

So, given the where the or­der sta­tis­tics of any dis­tri­b­u­tion is equiv­a­lent to , we can plug in the de­sired k-out-of-n pa­ra­me­ters and gen­er­ate a ran­dom sam­ple effi­ciently via rbeta, get­ting a value on the 0–1 range (eg 0.998879842 for max-of-1000) and then ei­ther use qgompertz or op­ti­miza­tion to trans­form to the fi­nal Gom­pertz-dis­trib­uted val­ues (see also ).

library(flexsurv)
round(digits=2, rgompertz(n = 10, shape = log(1.1124), rate = 0.000016443))
# [1] 79.85 89.88 82.82 80.87 81.24 73.14 93.54 57.52 78.02 85.96

## The beta/uniform order statistics transform, which samples from the Gompertz CDF:
uk <- function(n, kth, total_n) { rbeta(n, kth, total_n + 1 - kth) }

## Root-finding version: define the CDF by hand, then invert via optimization:
### Define Gompertz CDF; these specific parameters are taken from a Dutch population
### survival curve for illustrative purposes (see https://www.gwern.net/Longevity)
F <- function(g) { min(1, pgompertz(g, log(1.1124), rate = 0.000016443, log = FALSE)) }
### Invert the Gompertz CDF to yield the actual numerical value:
InvF <- Vectorize(function(a) { uniroot(function(x) F(x) - a, c(0,130))$root })
### Demo: 10 times, report the age of the oldest person out of a hypothetical 10b:
round(digits=2, InvF(uk(10, 1e+10, 1e+10)))
# [1] 111.89 111.69 112.31 112.25 111.74 111.36 111.54 111.91 112.46 111.79

## Easier: just use `qgompertz` to invert it directly:
round(digits=2, qgompertz(uk(10, 1e+10, 1e+10), log(1.1124), rate = 0.000016443, log = FALSE))
# [1] 111.64 111.59 112.41 111.99 111.91 112.00 111.84 112.33 112.20 111.30

## Package up as a function:
library(flexsurv)
uk <- function(n, kth, total_n) { rbeta(n, kth, total_n + 1 - kth) }
rKNGompertz <- function (iters, total_n, kth, scale, rate) {
    qgompertz(uk(iters, kth, total_n), log(scale), rate = rate, log = FALSE)
    }
## Demo:
round(digits=2, rKNGompertz(10, 1e+10, 1e+10, 1.1124, 0.000016443))
# [1] 112.20 113.23 112.12 111.62 111.65 111.94 111.60 112.26 112.15 111.99

## Comparison with Monte Carlo to show correctness: max-of-10000 (for tractability)
mean(replicate(10000, max(rgompertz(n = 10000, shape = log(1.1124), rate = 0.000016443))))
# 103.715134
mean(rKNGompertz(10000, 10000, 10000, 1.1124, 0.000016443))
# 103.717864

As men­tioned, other dis­tri­b­u­tions work just as well. If we wanted a nor­mal or in­stead, then we sim­ply use qnorm/qlnorm:

## Comparison with Monte Carlo to show correctness: max-of-10000 (for tractability)
### Normal:
mean(replicate(10000, max(rnorm(n = 10000))))
# [1] 3.85142815
mean(qnorm(uk(10000, 10000, 10000)))
# [1] 3.8497741

### Log-normal:
mean(replicate(10000, max(rlnorm(n = 10000))))
# [1] 49.7277024
mean(qlnorm(uk(10000, 10000, 10000)))
# [1] 49.7507764

Jeanne Calment case study

An ex­am­ple where sim­u­lat­ing from the tails of the Gom­pertz dis­tri­b­u­tion is use­ful would be con­sid­er­ing the case of su­per-cen­te­nar­ian , who has held the world record for longevity for over 22 years now: Cal­ment lived for 122 years & 164 days (122.45 years) and was the world’s old­est per­son for al­most 13× longer than usu­al, while the global run­ner-up, lived only 119 years & 97 days (119.27 years), a differ­ence of 3.18 years. This has of what ap­pears to be an un­ex­pect­edly large differ­ence.

Em­pir­i­cal­ly, us­ing the Geron­tol­ogy Re­search Group data, the gaps be­tween ver­i­fied old­est peo­ple ever are much smaller than >3 years; for ex­am­ple, among the wom­en, Knapp was 119, and #3–9 were all 117 year old (with #10 just 18 days shy of 117). The old­est men fol­low a sim­i­lar pat­tern: #1, , is 116.15 vs #2’s 115.69, a gap of only 1.8 years, and then #3–4 are all 115, and #5–7 are 114, and so on.

In or­der sta­tis­tics, the ex­pected gap be­tween suc­ces­sive k-of-n sam­ples typ­i­cally shrinks the larger k/n be­comes (di­min­ish­ing re­turn­s), and the Gom­pertz curve is used to model an ac­cel­er­a­tion in mor­tal­ity that makes an­nual mor­tal­ity rates sky­rocket and is why no one ever lives to 130 or 140 as they run into a ‘mor­tal­ity spike’. The other women and the men make Cal­men­t’s record look ex­tra­or­di­nary, but or­der sta­tis­tics and the Gom­pertz curve are coun­ter­in­tu­itive, and one might won­der if Gom­pertz curves might nat­u­rally oc­ca­sion­ally pro­duce Cal­men­t-like gaps re­gard­less of the ex­pected gaps or mor­tal­ity ac­cel­er­a­tion.

To get an idea of what the Gom­pertz dis­tri­b­u­tion would pro­duce, we can con­sider a sce­nario like sam­pling from the top 10 out of 10 bil­lion (about the right or­der of mag­ni­tude for the to­tal el­derly pop­u­la­tion of the Earth which has cred­i­ble doc­u­men­ta­tion over the past ~50 years), and, us­ing some sur­vival curve pa­ra­me­ters from a Dutch pop­u­la­tion pa­per , see what sort of sets of top-10 ages we would ex­pect and in par­tic­u­lar, how often we’d see large gaps be­tween #1 and #2:

simulateSample <- function(total_n, top_k) { sort(sapply(top_k:0,
    function(k) {
        rKNGompertz(1, total_n, total_n-k, 1.1124, 0.000016443) }
   )) }
round(digits=2, simulateSample(1e+10, 10))
# [1] 110.70 110.84 110.89 110.99 111.06 111.08 111.25 111.26 111.49 111.70 112.74

simulateSamples <- function(total_n, top_k, iters=10000) { t(replicate(iters,
    simulateSample(total_n, top_k))) }
small <- as.data.frame(simulateSamples(10000000000, 10, iters=100))
large <- as.data.frame(simulateSamples(10000000000, 10, iters=100000))

summary(round(small$V11 - small$V10, digits=2))
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
# 0.0000  0.0975  0.2600  0.3656  0.5450  1.5700
summary(round(large$V11 - large$V10, digits=2))
#     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
# 0.00000 0.15000 0.35000 0.46367 0.66000 3.99000
mean((large$V11 - large$V10) >= 3.18) * 100
# [1] 0.019

library(ggplot2)
library(reshape)
colnames(small) <- as.character(seq(-10, 0, by=1))
small$V0 <- row.names(small)
small_melted <- melt(small, id.vars= 'V0')
colnames(small_melted) <- c("V0", "K", "Years")
ggplot(small_melted, aes(x = K, y = Years)) +
    geom_path(aes(color = V0, group = V0)) + geom_point() +
    coord_cartesian(ylim = c(110, 115)) + theme(legend.position = "none")
100 sim­u­lated sam­ples of top-10-old­est-peo­ple out of 10 bil­lion, show­ing what are rea­son­able gaps be­tween in­di­vid­u­als

With these spe­cific set of pa­ra­me­ters, we see me­dian gaps some­what sim­i­lar to the em­pir­i­cal data, but we hardly ever (~0.02% of the time) see #1–#2 gaps quite as big as Cal­men­t’s.

The graph also seems to sug­gest that there is in fact typ­i­cally a ‘jump’ be­tween #1 & #2 com­pared to #2 & #3 and so on, which I was not ex­pect­ing; think­ing about it, I sus­pect there is some sort of se­lec­tion effect here: if a ran­dom sam­ple from ‘#1’ falls be­low the ran­dom sam­ple of ‘#2’, then they will sim­ply swap places (be­cause when sort­ed, #2 will be big­ger than #1), so an ‘un­lucky’ #1 dis­ap­pears, cre­at­ing a ratchet effect en­sur­ing the fi­nal ‘#1’ will be larger than ex­pect­ed. Any k could ex­ceed ex­pec­ta­tions, but #1 is the last pos­si­ble rank­ing, so it can be­come more ex­treme. If I re­move the sort call which en­sures mo­not­o­nic­ity with rank, the graph looks con­sid­er­ably more ir­reg­u­lar but still has a jump, so this se­lec­tion effect may not be the en­tire sto­ry:

100 sim­u­lated sam­ples of top-10-old­est-peo­ple out of 10 bil­lion; sim­u­lated by rank, un­sorted (in­de­pen­dent sam­ples)

So, over­all, a stan­dard Gom­pertz model does not eas­ily pro­duce a Cal­ment.

This does­n’t prove that the Cal­ment age is wrong, though. It could just be that Cal­ment , or my spe­cific pa­ra­me­ter val­ues are wrong (the gaps are sim­i­lar but the ages are over­all off by ~5 years, and per­haps that makes a differ­ence). To be­gin with, it is un­likely that the Gom­pertz curve is ex­actly cor­rect a model of ag­ing; in par­tic­u­lar, geron­tol­o­gists note an ap­par­ent ceil­ing of the an­nual mor­tal­ity rate at ~50%, which is in­con­sis­tent with the Gom­pertz (which would con­tinue in­creas­ing ar­bi­trar­i­ly, quickly hit­ting >99% an­nual mor­tal­ity rates). And maybe Cal­ment re­ally is just an out­lier due to sam­pling er­ror (0.02% ≠ 0%), or she is in­deed out of the or­di­nary hu­man life ex­pectancy dis­tri­b­u­tion but there is a more be­nign ex­pla­na­tion for it like a unique mu­ta­tion or ge­netic con­fig­u­ra­tion. But it does seem like Cal­men­t’s record is weird in some way.


  1. Ex­ploit­ing the , where the or­der sta­tis­tics of a sim­ple 0–1 in­ter­val turns out to fol­low a (specifi­cal­ly: ) , which can then be eas­ily trans­formed into the equiv­a­lent or­der sta­tis­tics of more use­ful dis­tri­b­u­tions like the nor­mal dis­tri­b­u­tion. The beta trans­for­ma­tion is not just com­pu­ta­tion­ally use­ful, but crit­i­cal to or­der sta­tis­tics in gen­er­al.↩︎

  2. lmomco is ac­cu­rate for all val­ues I checked with Monte Carlo n < 1000, but ap­pears to have some bugs n > 2000: there are oc­ca­sional de­vi­a­tions from the qua­si­-log­a­rith­mic curve, such as n = 2225–2236 (which are off by 1.02SD com­pared to the Monte Carlo es­ti­mates and the sur­round­ing lmomco es­ti­mates), an­other clus­ter of er­rors n~=40,000, and then after n > 60,000, the es­ti­mates are to­tally in­cor­rect. The main­tainer has been no­ti­fied & ver­i­fied the bug.↩︎

  3. A pre­vi­ous ver­sion of EnvStats de­scribed the ap­prox­i­ma­tion thus:

    The func­tion evNormOrdStatsScalar com­putes the value of for user-spec­i­fied val­ues of r and n. The func­tion evNormOrdStats com­putes the val­ues of for all val­ues of r for a user-spec­i­fied value of n. For large val­ues of n, the func­tion evNormOrdStats with approximate=FALSE may take a long time to ex­e­cute. When approximate=TRUE, evNormOrdStats and evNormOrdStatsScalar use the fol­low­ing ap­prox­i­ma­tion to , which was pro­posed by Blom (1958, pp. 68–75, [“6.9 An ap­prox­i­mate mean value for­mula” & for­mula 6.10.3–6.10.5]):

    ## General Blom 1958 approximation:
    blom1958E <- function(r,n) { qnorm((r - 3/8) / (n + 1/4)) }
    blom1958E(10^10, 10^10)
    # [1] 6.433133208

    This ap­prox­i­ma­tion is quite ac­cu­rate. For ex­am­ple, for , the ap­prox­i­ma­tion is ac­cu­rate to the first dec­i­mal place, and for it is ac­cu­rate to the sec­ond dec­i­mal place.

    ↩︎