Calculating The Expected Maximum

In generating a sample of n datapoints drawn from a normal/Gaussian distribution, how big on average the biggest datapoint is will depend on how large n is. I implement a variety of exact & approximate calculations in R to compare efficiency & accuracy.
statistics, computer-science, R, bibliography, order-statistics
2016-01-222020-03-12 finished certainty: highly likely importance: 5

In gen­er­at­ing a sam­ple of n dat­a­points drawn from a normal/Gaussian dis­tri­b­u­tion with a par­tic­u­lar mean/SD, how big on aver­age the biggest dat­a­point is will depend on how large n is. Know­ing this aver­age is use­ful in a num­ber of areas like sports or breed­ing or man­u­fac­tur­ing, as it defines how bad/good the worst/best dat­a­point will be (eg the score of the win­ner in a mul­ti­-player game).

The of the mean/average/expectation of the max­i­mum of a draw of n sam­ples from a nor­mal dis­tri­b­u­tion has no exact for­mu­la, unfor­tu­nate­ly, and is gen­er­ally not built into any pro­gram­ming lan­guage’s libraries.

I imple­ment & com­pare some of the approaches to esti­mat­ing this order sta­tis­tic in the R pro­gram­ming lan­guage, for both the max­i­mum and the gen­eral order sta­tis­tic. The over­all best approach is to cal­cu­late the exact order sta­tis­tics for the n range of inter­est using numer­i­cal inte­gra­tion via lmomco and cache them in a lookup table, rescal­ing the mean/SD as nec­es­sary for arbi­trary nor­mal dis­tri­b­u­tions; next best is a poly­no­mial regres­sion approx­i­ma­tion; final­ly, the Elfv­ing cor­rec­tion to the Blom 1958 approx­i­ma­tion is fast, eas­ily imple­ment­ed, and accu­rate for rea­son­ably large n such as n > 100.

Visu­al­iz­ing maxima/minima in order sta­tis­tics with increas­ing n in each sam­ple (1–100).


Monte Carlo

Most sim­ply and direct­ly, we can esti­mate it using a sim­u­la­tion with hun­dreds of thou­sands of iter­a­tions:

scores  <- function(n, sd) { rnorm(n, mean=0, sd=sd); }
gain    <- function(n, sd) { scores <- scores(n, sd)
                             return(max(scores)); }
simGain <- function(n, sd=1, iters=500000) {
                             mean(replicate(iters, gain(n, sd))); }

But in R this can take sec­onds for small n and gets worse as n increases into the hun­dreds as we need to cal­cu­late over increas­ingly large sam­ples of ran­dom nor­mals (so one could con­sider this ); this makes use of the sim­u­la­tion diffi­cult when nested in high­er-level pro­ce­dures such as any­thing involv­ing resam­pling or sim­u­la­tion. In R, call­ing func­tions many times is slower than being able to call a func­tion once in a ‘vec­tor­ized’ way where all the val­ues can be processed in a sin­gle batch. We can try to vec­tor­ize this sim­u­la­tion by gen­er­at­ing ran­dom nor­mals, group it into a large matrix with n columns (each row being one n-sized batch of sam­ples), then com­put­ing the max­i­mum of the i rows, and the mean of the max­i­mums. This is about twice as fast for small n; imple­ment­ing using rowMaxs from the R pack­age matrixStats, it is any­where from 25% to 500% faster (at the expense of likely much higher mem­ory usage, as the R inter­preter is unlikely to apply any opti­miza­tions such as Haskel­l’s stream fusion):

simGain2 <- function(n, sd=1, iters=500000) {
    mean(apply(matrix(ncol=n, data=rnorm(n*iters, mean=0, sd=sd)), 1, max)) }

simGain3 <- function(n, sd=1, iters=500000) {
    mean(rowMaxs(matrix(ncol=n, data=rnorm(n*iters, mean=0, sd=sd)))) }

Each sim­u­late is too small to be worth par­al­leliz­ing, but there are so many iter­a­tions that they can be split up use­fully and run with a frac­tion in a differ­ent process; some­thing like

simGainP <- function(n, sd=1, iters=500000, n.parallel=4) {
   mean(unlist(mclapply(1:n.parallel, function(i) {
    mean(replicate(iters/n.parallel, gain(n, sd))); })))

We can treat the sim­u­la­tion esti­mates as exact and use such as pro­vided by the R pack­age memoise to cache results & never recom­pute them, but it will still be slow on the first cal­cu­la­tion. So it would be good to have either an exact algo­rithm or an accu­rate approx­i­ma­tion: for one of analy­ses, I want accu­racy to ±0.0006 SDs, which requires large Monte Carlo sam­ples.

Upper bounds

To sum­ma­rize the Cross Val­i­dated dis­cus­sion: the sim­plest upper bound is , which makes the dimin­ish­ing returns clear (see also the & ). Imple­men­ta­tion:

upperBoundMax <- function(n, sd=1) { sd * sqrt(2 * log(n)) }
The max cen­tral limit the­o­rem visu­al­ized (Gabriel Peyré)

Most of the approx­i­ma­tions are suffi­ciently fast as they are effec­tively with small con­stant fac­tors (if we ignore that func­tions like /qnorm them­selves may tech­ni­cally be or for large n). How­ev­er, accu­racy becomes the prob­lem: this upper bound is hope­lessly inac­cu­rate in small sam­ples when we com­pare to the Monte Carlo sim­u­la­tion. Oth­ers (also inac­cu­rate) include and :

upperBoundMax2 <- function(n, sd=1) { ((n-1) / sqrt(2*n - 1)) * sd }
upperBoundMax3 <- function(n, sd=1) { -qnorm(1/(n+1), sd=sd) }


  1. Blom 1958, Sta­tis­ti­cal esti­mates and trans­formed beta-vari­ables pro­vides a gen­eral approx­i­ma­tion for­mula , which spe­cial­iz­ing to the max () is and is bet­ter than the upper bounds:

    blom1958 <- function(n, sd=1) { alpha <- 0.375; qnorm((n-alpha)/(n-2*alpha+1)) * sd }
  2. , appar­ent­ly, by way of Math­e­mat­i­cal Sta­tis­tics, Wilks 1962, demon­strates that Blom 1958’s approx­i­ma­tion is imper­fect because actu­ally , so:

    elfving1947 <- function(n, sd=1) { alpha <- pi/8; qnorm((n-alpha)/(n-2*alpha+1)) * sd }

    (Blom 1958 appears to be more accu­rate for n < 48 and then Elfv­ing’s cor­rec­tion dom­i­nates.)

  3. Har­ter 1961 elab­o­rated this by giv­ing differ­ent val­ues for , and Roys­ton 1982 pro­vides com­puter algo­rithms; I have not attempted to pro­vide an R imple­men­ta­tion of these.

  4. prob­a­bil­i­ty­is­logic offers a 2015 deriva­tion via the beta-F com­pound dis­tri­b­u­tion1:

    and an approx­i­mate (but highly accu­rate) numer­i­cal inte­gra­tion as well:

    pil2015 <- function(n, sd=1) { sd * qnorm(n/(n+1)) * { 1 +
        ((n/(n+1)) * (1 - (n/(n+1)))) /
        (2*(n+2) * (pnorm(qnorm(n/(n+1))))^2) }}
    pil2015Integrate <- function(n) { mean(qnorm(qbeta(((1:10000) - 0.5) / 10000, n, 1))) + 1}

    The inte­gra­tion can be done more directly as

    pil2015Integrate2 <- function(n) { integrate(function(v) qnorm(qbeta(v, n, 1)), 0, 1) }
  5. Another approx­i­ma­tion comes from : . Unfor­tu­nate­ly, while accu­rate enough for most pur­pos­es, it is still off by as much as 1 IQ point and has an aver­age mean error of −0.32 IQ points com­pared to the sim­u­la­tion:

    chen1999 <- function(n, sd=1){ qnorm(0.5264^(1/n), sd=sd) }
    approximationError <- sapply(1:1000, function(n) { (chen1999(n) - simGain(n)) * 15 })
    #       Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
    # -0.3801803 -0.3263603 -0.3126665 -0.2999775 -0.2923680  0.9445290
    plot(1:1000, approximationError,  xlab="Number of samples taking the max", ylab="Error in 15*SD (IQ points)")
    Error in using the Chen & Tyler 1999 approx­i­ma­tion to esti­mate the expected value (gain) from tak­ing the max­i­mum of n nor­mal sam­ples

Polynomial regression

From a less math­e­mat­i­cal per­spec­tive, any regres­sion or machine learn­ing model could be used to try to develop a cheap but highly accu­rate approx­i­ma­tion by sim­ply pre­dict­ing the extreme from the rel­e­vant range of n = 2–300—the goal being less to make good pre­dic­tions out of sam­ple than to interpolate/memorize as much as pos­si­ble in-sam­ple.

Plot­ting the extremes, they form a smooth almost log­a­rith­mic curve:

df <- data.frame(N=2:300, Max=sapply(2:300, exactMax))
l <- lm(Max ~ log(N), data=df); summary(l)
# Residuals:
#         Min          1Q      Median          3Q         Max
# -0.36893483 -0.02058671  0.00244294  0.02747659  0.04238113
# Coefficients:
#                Estimate  Std. Error   t value   Pr(>|t|)
# (Intercept) 0.658802439 0.011885532  55.42894 < 2.22e-16
# log(N)      0.395762956 0.002464912 160.55866 < 2.22e-16
# Residual standard error: 0.03947098 on 297 degrees of freedom
# Multiple R-squared:  0.9886103,   Adjusted R-squared:  0.9885719
# F-statistic: 25779.08 on 1 and 297 DF,  p-value: < 2.2204e-16
plot(df); lines(predict(l))

This has the merit of sim­plic­ity (function(n) {0.658802439 + 0.395762956*log(n)}), but while the R2 is quite high by most stan­dards, the resid­u­als are too large to make a good approx­i­ma­tion—the log curve over­shoots ini­tial­ly, then under­shoots, then over­shoots. We can try to find a bet­ter log curve by using poly­no­mial or spline regres­sion, which broaden the fam­ily of pos­si­ble curves. A 4th-order poly­no­mial turns out to fit as beau­ti­fully as we could wish, R2=0.9999998:

lp <- lm(Max ~ log(N) + I(log(N)^2) + I(log(N)^3) + I(log(N)^4), data=df); summary(lp)
# Residuals:
#           Min            1Q        Median            3Q           Max
# -1.220430e-03 -1.074138e-04 -1.655586e-05  1.125596e-04  9.690842e-04
# Coefficients:
#                  Estimate    Std. Error    t value   Pr(>|t|)
# (Intercept)  1.586366e-02  4.550132e-04   34.86418 < 2.22e-16
# log(N)       8.652822e-01  6.627358e-04 1305.62159 < 2.22e-16
# I(log(N)^2) -1.122682e-01  3.256415e-04 -344.76027 < 2.22e-16
# I(log(N)^3)  1.153201e-02  6.540518e-05  176.31640 < 2.22e-16
# I(log(N)^4) -5.302189e-04  4.622731e-06 -114.69820 < 2.22e-16
# Residual standard error: 0.0001756982 on 294 degrees of freedom
# Multiple R-squared:  0.9999998,   Adjusted R-squared:  0.9999998
# F-statistic: 3.290056e+08 on 4 and 294 DF,  p-value: < 2.2204e-16

## If we want to call the fitted objects:
linearApprox <- function (n) { predict(l, data.frame(N=n)); }
polynomialApprox <- function (n) { predict(lp, data.frame(N=n)); }
## Or simply code it by hand:
la <- function(n, sd=1) { 0.395762956*log(n) * sd; }
pa <- function(n, sd=1) { N <- log(n);
    (1.586366e-02 + 8.652822e-01*N^1 + -1.122682e-01*N^2 + 1.153201e-02*N^3 + -5.302189e-04*N^4) * sd; }

This has the virtue of speed & sim­plic­ity (a few arith­metic oper­a­tions) and high accu­ra­cy, but is not intended to per­form well past the largest dat­a­point of n = 300 (although if one needed to, one could sim­ply gen­er­ate the addi­tional dat­a­points, and refit, adding more poly­no­mi­als if nec­es­sary), but turns out to be a good approx­i­ma­tion up to n = 800 (after which it con­sis­tently over­es­ti­mates by ~0.01):

heldout <- sapply(301:1000, exactMax)
test <- sapply(301:1000, pa)
mean((heldout - test)^2)
# [1] 3.820988144e-05
plot(301:1000, heldout); lines(test)

So this method, while lack­ing any kind of math­e­mat­i­cal pedi­gree or deriva­tion, pro­vides the best approx­i­ma­tion so far.


The R pack­age lmomco (Asquith 2011) cal­cu­lates a wide vari­ety of order sta­tis­tics using numer­i­cal inte­gra­tion & other meth­ods. It is fast, unbi­ased, and gen­er­ally cor­rect (for small val­ues of n2)—it is close to the Monte Carlo esti­mates even for the small­est n where the approx­i­ma­tions tend to do bad­ly, so it does what it claims to and pro­vides what we want (a fast exact esti­mate of the mean gain from select­ing the max­i­mum from n sam­ples from a nor­mal dis­tri­b­u­tion). The results can be mem­o­ized for a fur­ther mod­er­ate speedup (eg cal­cu­lated over n = 1–1000, 0.45s vs 3.9s for a speedup of ~8.7×):

exactMax_unmemoized <- function(n, mean=0, sd=1) {
    expect.max.ostat(n, para=vec2par(c(mean, sd), type="nor"), cdf=cdfnor, pdf=pdfnor) }
## Comparison to MC:
# ...         Min.       1st Qu.        Median          Mean       3rd Qu.          Max.
#    −0.0523499300 −0.0128622900 −0.0003641315 −0.0007935236  0.0108748800  0.0645207000

exactMax_memoised <- memoise(exactMax_unmemoized)
Error in using Asquith 2011’s L-mo­ment Sta­tis­tics numer­i­cal inte­gra­tion pack­age to esti­mate the expected value (gain) from tak­ing the max­i­mum of n nor­mal sam­ples


With lmomco pro­vid­ing exact val­ues, we can visu­ally com­pare all the pre­sented meth­ods for accu­ra­cy; there are con­sid­er­able differ­ences but the best meth­ods are in close agree­ment:

Com­par­i­son of esti­mates of the max­i­mum for n = 2–300 for 12 meth­ods, show­ing Chen 1999/polynomial approximation/Monte Carlo/lmomco are the most accu­rate and Blom 1958/upper bounds high­ly-i­nac­cu­rate.
scores  <- function(n, sd) { rnorm(n, mean=0, sd=sd); }
gain    <- function(n, sd) { scores <- scores(n, sd)
                             return(max(scores)); }
simGainP <- function(n, sd=1, iters=500000, n.parallel=4) {
   mean(unlist(mclapply(1:n.parallel, function(i) {
    mean(replicate(iters/n.parallel, gain(n, sd))); })))
upperBoundMax <- function(n, sd=1) { sd * sqrt(2 * log(n)) }
upperBoundMax2 <- function(n, sd=1) { ((n-1) / sqrt(2*n - 1)) * sd }
upperBoundMax3 <- function(n, sd=1) { -qnorm(1/(n+1), sd=sd) }
blom1958 <- function(n, sd=1) { alpha <- 0.375; qnorm((n-alpha)/(n-2*alpha+1)) * sd }
elfving1947 <- function(n, sd=1) { alpha <- pi/8; qnorm((n-alpha)/(n-2*alpha+1)) * sd }
pil2015 <- function(n, sd=1) { sd * qnorm(n/(n+1)) * { 1 +
    ((n/(n+1)) * (1 - (n/(n+1)))) /
    (2*(n+2) * (pnorm(qnorm(n/(n+1))))^2) }}
pil2015Integrate <- function(n) { mean(qnorm(qbeta(((1:10000) - 0.5) / 10000, n, 1))) + 1}
chen1999 <- function(n,sd=1){ qnorm(0.5264^(1/n), sd=sd) }
exactMax_unmemoized <- function(n, mean=0, sd=1) {
    expect.max.ostat(n, para=vec2par(c(mean, sd), type="nor"), cdf=cdfnor, pdf=pdfnor) }
exactMax_memoised <- memoise(exactMax_unmemoized)
la <- function(n, sd=1) { 0.395762956*log(n) * sd; }
pa <- function(n, sd=1) { N <- log(n);
    (1.586366e-02 + 8.652822e-01*N^1 + -1.122682e-01*N^2 + 1.153201e-02*N^3 + -5.302189e-04*N^4) * sd; }

implementations <- c(simGainP,upperBoundMax,upperBoundMax2,upperBoundMax3,blom1958,elfving1947,
implementationNames <- c("Monte Carlo","Upper bound #1","UB #2","UB #3","Blom 1958","Elfving 1947",
    "Pil 2015","Pil 2015 (numerical integration)","Chen 1999","Exact","Log approx.","Polynomial approx.")
maxN <- 300
results <- data.frame(N=integer(), SD=numeric(), Implementation=character())
for (i in 1:length(implementations)) {
    SD <- unlist(Map(function(n) { implementations[i][[1]](n); }, 2:maxN))
    name <- implementationNames[i]
    results <- rbind(results, (data.frame(N=2:maxN, SD=SD, Implementation=name)))
results$Implementation <- ordered(results$Implementation,
    levels=c("Upper bound #1","UB #2","UB #3","Log approx.","Blom 1958","Elfving 1947","Pil 2015",
             "Pil 2015 (numerical integration)","Chen 1999","Polynomial approx.", "Monte Carlo", "Exact"))
qplot(N, SD, color=Implementation, data=results) + coord_cartesian(ylim = c(0.25,3.9))

qplot(N, SD, color=Implementation, data=results) + geom_smooth()

And micro-bench­mark­ing them quickly (ex­clud­ing Monte Car­lo) to get an idea of time con­sump­tion shows the expected results (aside from Pil 2015’s numer­i­cal inte­gra­tion tak­ing longer than expect­ed, sug­gest­ing either mem­o­is­ing or chang­ing the fine­ness):

f <- function() { sample(2:1000, 1); }
microbenchmark(times=10000, upperBoundMax(f()),upperBoundMax2(f()),upperBoundMax3(f()),
# Unit: microseconds
#                    expr       min         lq          mean     median         uq       max neval
#                     f()     2.437     2.9610     4.8928136     3.2530     3.8310  1324.276 10000
#      upperBoundMax(f())     3.029     4.0020     6.6270124     4.9920     6.3595  1218.010 10000
#     upperBoundMax2(f())     2.886     3.8970     6.5326593     4.7235     5.8420  1029.148 10000
#     upperBoundMax3(f())     3.678     4.8290     7.4714030     5.8660     7.2945   892.594 10000
#           blom1958(f())     3.734     4.7325     7.3521356     5.6200     7.0590  1050.915 10000
#        elfving1947(f())     3.757     4.8330     7.7927493     5.7850     7.2800  1045.616 10000
#            pil2015(f())     5.518     6.9330    10.8501286     9.2065    11.5280   867.332 10000
#   pil2015Integrate(f()) 14088.659 20499.6835 21516.4141399 21032.5725 22151.4150 53977.498 10000
#           chen1999(f())     3.788     4.9260     7.7456654     6.0370     7.5600  1415.992 10000
#  exactMax_memoised(f())   106.222   126.1050   211.4051056   162.7605   221.2050  4009.048 10000
#                 la(f())     2.882     3.8000     5.7257008     4.4980     5.6845  1287.379 10000
#                 pa(f())     3.397     4.4860     7.0406035     5.4785     6.9090  1818.558 10000

Rescaling for generality

The mem­oi­sed func­tion has three argu­ments, so mem­o­is­ing on the fly would seem to be the best one could do, since one can­not pre­com­pute all pos­si­ble com­bi­na­tions of the n/mean/SD. But actu­al­ly, we only need to com­pute the results for each n.

We can default to assum­ing the stan­dard nor­mal dis­tri­b­u­tion () with­out loss of gen­er­al­ity because it’s easy to rescale any nor­mal to another nor­mal: to scale to a differ­ent mean , one sim­ply adds to the expected extreme, so one can assume ; and to scale to a differ­ent stan­dard devi­a­tion, we sim­ply mul­ti­ply appro­pri­ate­ly. So if we wanted the extreme for n = 5 for , we can cal­cu­late it sim­ply by tak­ing the esti­mate for n = 5 for and mul­ti­ply­ing by 2⁄1 = 2 and then adding :

(exactMax(5, mean=0, sd=1)*2 + 10) ; exactMax(5, mean=10, sd=2)
# [1] 12.32592895
# [1] 12.32592895

So in other words, it is unnec­es­sary to mem­o­ize all pos­si­ble com­bi­na­tions of n, mean, and SD—in real­i­ty, we only need to com­pute each n and then rescale it as nec­es­sary for each caller. And in prac­tice, we only care about n = 2–200, which is few enough that we can define a lookup table using the lmomco results and use that instead (with a fall­back to lmomco for , and a fall­back to Chen et al 1999 for to work around lmomco’s bug with large n):

exactMax <- function (n, mean=0, sd=1) {
if (n>2000) {
    chen1999 <- function(n,mean=0,sd=1){ mean + qnorm(0.5264^(1/n), sd=sd) }
    chen1999(n,mean=mean,sd=sd) } else {
    if(n>200) { library(lmomco)
        exactMax_unmemoized <- function(n, mean=0, sd=1) {
            expect.max.ostat(n, para=vec2par(c(mean, sd), type="nor"), cdf=cdfnor, pdf=pdfnor) }
        exactMax_unmemoized(n,mean=mean,sd=sd) } else {

 lookup <- c(0,0,0.5641895835,0.8462843753,1.0293753730,1.1629644736,1.2672063606,1.3521783756,1.4236003060,

 return(mean + sd*lookup[n+1]) }}}

This gives us exact com­pu­ta­tion at (with an amor­tized when ) with an extremely small con­stant fac­tor (a con­di­tion­al, vec­tor index, mul­ti­pli­ca­tion, and addi­tion, which is over­all ~10× faster than a mem­oi­sed lookup), giv­ing us all our desider­ata simul­ta­ne­ously & resolv­ing the prob­lem.

General order statistics for the normal distribution

One might also be inter­ested in com­put­ing the gen­eral order sta­tis­tic.

Some avail­able imple­men­ta­tions in R:

  • numer­i­cal inte­gra­tion:

    • lmomco, with j of n (warn­ing: remem­ber lmom­co’s bug with n > 2000):

      j = 9; n = 10
      expect.max.ostat(n, j=j, para=vec2par(c(0, 1), type="nor"), cdf=cdfnor, pdf=pdfnor)
      # [1] 1.001357045
    • evNormOrdStats in EnvStats (ver­sion ≥2.3.0), using Roys­ton 1982:

      # [1] 6.446676405
      ## Warning message: In evNormOrdStatsScalar(10^10, 10^10) :
      ## The 'royston' method has not been validated for sample sizes greater than 2000 using
      ## the default value of inc = 0.025. You may want to make the value of 'inc' less than 0.025.
      evNormOrdStatsScalar(10^10,10^10, inc=0.000001)
      # [1] 6.446676817
  • Monte Carlo: the sim­ple approach of aver­ag­ing over i iter­a­tions of gen­er­at­ing n ran­dom nor­mal devi­ates, sort­ing, and select­ing the jth order sta­tis­tic does not scale well due to being in both time & space for gen­er­a­tion & for a com­par­i­son sort or another if one is more care­ful to use a lazy sort or , and cod­ing up an online selec­tion algo­rithm is not a one-lin­er. Bet­ter solu­tions typ­i­cally use a beta trans­for­ma­tion to effi­ciently gen­er­ate a sin­gle sam­ple from the expected order sta­tis­tic:

    • order_rnorm in orderstats, with k of n:

      mean(replicate(100000, order_rnorm(k=10^10, n=10^10)))
      # [1] 6.446370373
    • order in evd, with j of n:

      mean(rorder(100000, distn="norm", j=10^10, mlen=10^10, largest=FALSE))
      # [1] 6.447222051
  • Blom & other approx­i­ma­tions:

    • evNormOrdStats in EnvStats’s pro­vides as an option the Blom approx­i­ma­tion:3

      When method="blom", the fol­low­ing approx­i­ma­tion to , pro­posed by Blom (1958, pp. 68–75), is used:

      By default, alpha = 3⁄8 = 0.375. This approx­i­ma­tion is quite accu­rate. For exam­ple, for n > 2_, the approx­i­ma­tion is accu­rate to the first dec­i­mal place, and for n > 9 it is accu­rate to the sec­ond dec­i­mal place.

      Blom’s approx­i­ma­tion is also quoted as:

    • Elfv­ing’s cor­rec­tion to Blom is the same, yield­ing:

      elfving1947E <- function(r,n) { alpha=pi/8; qnorm( (r - alpha) / (n - 2*alpha + 1)) }
      elfving1947E(10^10, 10^10)
      # [1] 6.437496713

See Also


Probability of Bivariate Maximum

Given a sam­ple of n pairs of 2 nor­mal vari­ables A & B which are cor­re­lated r, what is the prob­a­bil­ity Pmax that the max­i­mum on the first vari­able A is also the max­i­mum on the sec­ond vari­able B? This is anal­o­gous to many test­ing or screen­ing sit­u­a­tions, such as employee hir­ing (“what is the prob­a­bil­ity the top-s­cor­ing appli­cant on the first exam is the top-s­corer on the sec­ond as well?”) or ath­letic con­tests (“what is the prob­a­bil­ity the cur­rent world champ will win the next cham­pi­onship?”).

Order sta­tis­tics has long proven that asymp­tot­i­cal­ly, Pmax approaches . Exact answers are hard to find, but con­firm the asymp­tot­ics; the clos­est that exists is for an approx­i­ma­tion & spe­cial-case of the Ali-Mikhail-Haq cop­u­la: which roughly indi­cates that r func­tions as a con­stant fac­tor boost in Pmax, and the boost from r fades out as n increas­es.

As long as r≠1, “the tails will come apart”. n increases the diffi­cult too fast for any fixed r to over­come. This has impli­ca­tions for inter­pret­ing extremes and test met­rics.

Few are the best at every­thing. when we look at extremes of mul­ti­ple traits: the best on one dimen­sion is rarely the best on the sec­ond dimen­sion, even if there is a close causal rela­tion­ship—the tallest bas­ket­ball player is not the best bas­ket­ball play­er, the smartest sci­en­tist is not the most influ­en­tial sci­en­tist, the calmest investor is not the rich­est investor, and so on. This obser­va­tion can be mis­in­ter­pret­ed, and is not nec­es­sar­ily that impor­tant: height is cer­tainly use­ful for bas­ket­ball play­ers, and the smarter you are, the bet­ter your chance of being a great sci­en­tist. If we look at the best, they will indeed be well above aver­age on rel­e­vant traits, even if they are not the most extreme out­lier on all (or any) traits.

How much does being the best at X help with Y? But this does raise the ques­tion of what the rela­tion­ship is between the two pos­si­ble max­i­ma. In other order sta­tis­tics, an ini­tially small tail effect can become large; a famil­iar exam­ple of a tail effect is the prob­a­bil­ity of a ran­dom sam­ple cross­ing a thresh­old, when draw­ing from pop­u­la­tions with differ­ent means or stan­dard devi­a­tion­s—the higher the thresh­old, the big­ger the mul­ti­plica­tive increase, and with a not-too-high thresh­old and a quite small differ­ence in means, one pop­u­la­tion could com­pletely (and coun­ter­in­tu­itive­ly) dom­i­nant the sam­ple. Does that hap­pen with the bivari­ate max­i­mum? More gen­er­al­ly, does a given r > 0 increase the prob­a­bil­i­ty, how much, and does this prob­a­bil­ity increase or decrease rel­a­tive to the base­line?

Let’s con­sider some cases of r and n:

  1. If r = 1, then Pmax must also always be 1, as the two vari­ables are iden­ti­cal.

  2. For any r, Select­ing the max­i­mum out of n = 1 is triv­ial and we can guar­an­tee a Pmax=1 chance of select­ing the max­i­mum

  3. For n = 2, get­ting the max­i­mum twice out of 2 is easy (we must start at ≥0.5 and go up from there, to , so eg 2⁄3 for r = 0.5).

  4. For larger n, we can sim­u­late out Pmax and see that despite the ben­e­fits to increas­ing aver­ages & reduced regret (eg for embryo selec­tion), get­ting the max twice becomes increas­ingly diffi­cult and out of eg n = 1 mil­lion, becomes near-im­pos­si­ble—even with high r, like r = 0.95 (com­pa­ra­ble to test-retest reli­a­bil­ity for the SAT).

    Even­tu­al­ly, ‘the tails come apart’. Asymp­tot­i­cal­ly, as n increas­es, the two dis­tri­b­u­tions become inde­pen­dent () and so the prob­a­bil­ity of the max in one being max in the other approaches . (/ prove this is true of the other extremes as well.) The ben­e­fits of an r, how­ever large ini­tial­ly, are not large enough to keep up with the increas­ing n, and is over­whelmed.

Of course, the mean on the sec­ond dis­tri­b­u­tion keeps increas­ing—it just does­n’t increase as fast as nec­es­sary to main­tain Pmax. This has some impli­ca­tions. First, it shows that we should­n’t take too seri­ously any argu­ment of the form “the max on A is not the max on B, there­fore A does­n’t mat­ter to B”, since we can see that this is con­sis­tent with arbi­trar­ily high r (even a r = 0.9999 will even­tu­ally give such a dat­a­point if we make n big enough). Sec­ond, it shows that Pmax is not that inter­est­ing as a mea­sure, since the asymp­totic inde­pen­dence means that we may just be “sam­pling to a fore­gone con­clu­sion”; any­one eval­u­at­ing a selec­tion or rank­ing pro­ce­dure which puts stress on a can­di­date being the max on mul­ti­ple traits (in­stead of hav­ing a high absolute value on traits) should think care­fully about using such an unnat­ural zero-one & its per­verse con­se­quences, like guar­an­tee­ing that the best can­di­date all-around will rarely or never be the max­i­mum on any or all traits.

Ali-Mikhail-Haq Copula

The rela­tion­ship between r and n can be seen more explic­itly in a spe­cial case for­mula for the Ali-Mikhail-Haq cop­u­la, which also shows how r fades out with n. Matt F. has pro­vided an ana­lytic expres­sion by using a differ­ent dis­tri­b­u­tion, the (AMH) , which approx­i­mates the nor­mal dis­tri­b­u­tion for the spe­cial-case 0<r < 0.5, if r is con­verted to the Ali-Mikhail-Haq’s equiv­a­lent cor­re­la­tion instead:

First we choose the para­me­ter θ to get Kendal­l’s tau to agree for the dis­tri­b­u­tions:

…θ = 3_r_ - 2_r_2 is a decent approx­i­ma­tion to the first equa­tion.

Then we can cal­cu­late the prob­a­bil­ity of X and Y being max­i­mized at the same obser­va­tion:

  • In the lim­it­ing case where , we have θ = 1, and the prob­a­bil­ity is

  • In the lim­it­ing case of high n, the prob­a­bil­ity tends to and more pre­cisely .

  • In gen­er­al, the prob­a­bil­ity is

    where and is the .

His iden­tity isn’t a sim­ple one in terms of r, but the nec­es­sary θ can be found to high accu­racy by numer­i­cal opti­miza­tion of the iden­ti­ty, using R’s built-in numer­i­cal opti­miza­tion rou­tine optim. Then the gen­eral for­mula can be esti­mated as writ­ten, with the main diffi­culty being where to get the hyper­ge­o­met­ric func­tion. The result­ing algo­rithm (an approx­i­ma­tion of an exact answer to an approx­i­ma­tion) can then be com­pared to a large (i = 1 mil­lion) Monte Carlo sim­u­la­tion: it is good for small n but has notice­able error:

r_to_theta <- function (r) {
   identity_solver <- function(theta, r) {
       target = 2/pi * asin(r)
       theta_guess = 1 - ((2 * ((1 - theta)^2 * log(1 - theta) + theta))/(3 * theta^2))
       return(abs(target - theta_guess)) }

    theta_approx <- optim(0, identity_solver, r=r, method="Brent", lower=0, upper=1)

## The hypergeometric function is valid for all reals, but practically speaking,
## GSL's 'hyperg_2F1' code has issues with NaN once r>0.2 (theta ≅ −2.3); Stéphane Laurent
## recommends transforming ( for stability:
Gauss2F1 <- function(a,b,c,x){
    ## "gsl: Wrapper for the Gnu Scientific Library"
    ## for 'hyperg_2F1'
    if(x>=0 & x<1){

## Calculate the exact probability of a double max using the AMH approximation:
p_max_bivariate_amh <- function(n,r) {
  ## only valid for the special-case r=0-0.5:
  if (r > 0.5 || r < 0.0) { stop("AMH formula only valid for 0 <= r <= 0.5") }
    else {
        ## Handle special cases:
        if (r == 0.5) { 2 / 1+n } else {
         if (r == 0) { 1 / n } else {

            theta <- r_to_theta(r)
            t <- theta / (theta - 1)

            t * (Gauss2F1(1, n+1, n+2, t) / (n+1)) -
            2 * n * t^2 * (Gauss2F1(1, n+2, n+3, t) / ((n+1)*(n+2))) -
            (2*n*t)/((n+1)^2) +

## Monte Carlo simulate the AMH approximation to check the exact formula:
p_max_bivariate_amh_montecarlo <- function(n,r, iters=1000000) {
    ## "copula: Multivariate Dependence with Copulas"
    ## for 'rCopula', 'amhCopula'
    theta <- r_to_theta(r)
    mean(replicate(iters, {
        sample <- rCopula(n, amhCopula(theta))
        which.max(sample[,1])==which.max(sample[,2]) }))

## Monte Carlo simulation of double max using the normal distribution:
p_max_bivariate_montecarlo <- function(n,r,iters=1000000) {
    mean(replicate(iters, {
                library(MASS) # for 'mvrnorm'
                sample <- mvrnorm(n, mu=c(0,0), Sigma=matrix(c(1, r, r, 1), nrow=2))
                })) }

## Compare all 3:
## 1. approximate Monte Carlo answer for the normal
## 2. exact answer to A-M-H approximation
## 3. approximate Monte Carlo answer to A-M-H approximation
round(digits=3, sapply(2:20, function(n) { p_max_bivariate_montecarlo(n, r) }))
# [1] 0.580 0.427 0.345 0.295 0.258 0.230 0.211 0.193 0.179 0.168 0.158 0.150 0.142 0.136 0.130 0.124 0.120 0.115 0.111
round(digits=3, sapply(2:20, function(n) { p_max_bivariate_amh(n, r) }))
# [1] 0.580 0.420 0.331 0.274 0.233 0.204 0.181 0.162 0.147 0.135 0.124 0.115 0.108 0.101 0.095 0.090 0.085 0.081 0.077
round(digits=3, sapply(2:20, function(n) { p_max_bivariate_amh_montecarlo(n, r) }))
# [1] 0.581 0.420 0.331 0.274 0.233 0.204 0.181 0.162 0.147 0.135 0.124 0.115 0.108 0.101 0.095 0.089 0.085 0.081 0.077

See also

Sampling Gompertz Distribution Extremes

I imple­ment ran­dom sam­pling from the extremes/order sta­tis­tics of the Gom­pertz sur­vival dis­tri­b­u­tion, used to model human life expectan­cies, with the beta trans­for­ma­tion trick and flexsurv/root-finding inver­sion. I then dis­cuss the unusu­ally robust lifes­pan record of Jeanne Cal­ment, and show that records like hers (which sur­pass the run­ner-up’s lifes­pan by such a degree) are not usu­ally pro­duced by a Gom­pertz dis­tri­b­u­tion, sup­port­ing the claim that her lifes­pan was indeed unusual even for the record hold­er.

The is a dis­tri­b­u­tion often used to model sur­vival curves where mor­tal­ity increases over time, par­tic­u­larly human life expectan­cies. The order sta­tis­tics of a Gom­pertz are of inter­est in con­sid­er­ing extreme cases such as cen­te­nar­i­ans.

The usual fam­ily of random/density/inverse CDF (quan­tile) func­tions (rgompertz/dgompertz/pgompertz/qgompertz) are pro­vided by sev­eral R libraries, such as flexsurv, but none of them appear to pro­vide imple­men­ta­tions of order sta­tis­tics.

Using rgompertz for the Monte Carlo approx­i­ma­tion works, but like the nor­mal dis­tri­b­u­tion case, breaks down as one exam­ines larger cases (eg con­sid­er­ing order sta­tis­tics out of one bil­lion takes >20s & runs out of RAM). The beta trans­for­ma­tion used for the nor­mal dis­tri­b­u­tion works for the Gom­pertz as well, as it merely requires an abil­ity to invert the CDF, which is pro­vided by qgompertz, and if that is not avail­able (per­haps we are work­ing with some other dis­tri­b­u­tion besides the Gom­pertz where a q* func­tion is not avail­able), one can approx­i­mate it by a short root-find­ing opti­miza­tion.

So, given the where the order sta­tis­tics of any dis­tri­b­u­tion is equiv­a­lent to , we can plug in the desired k-out-of-n para­me­ters and gen­er­ate a ran­dom sam­ple effi­ciently via rbeta, get­ting a value on the 0–1 range (eg 0.998879842 for max-of-1000) and then either use qgompertz or opti­miza­tion to trans­form to the final Gom­pertz-dis­trib­uted val­ues (see also ).

round(digits=2, rgompertz(n = 10, shape = log(1.1124), rate = 0.000016443))
# [1] 79.85 89.88 82.82 80.87 81.24 73.14 93.54 57.52 78.02 85.96

## The beta/uniform order statistics transform, which samples from the Gompertz CDF:
uk <- function(n, kth, total_n) { rbeta(n, kth, total_n + 1 - kth) }

## Root-finding version: define the CDF by hand, then invert via optimization:
### Define Gompertz CDF; these specific parameters are taken from a Dutch population
### survival curve for illustrative purposes (see
F <- function(g) { min(1, pgompertz(g, log(1.1124), rate = 0.000016443, log = FALSE)) }
### Invert the Gompertz CDF to yield the actual numerical value:
InvF <- Vectorize(function(a) { uniroot(function(x) F(x) - a, c(0,130))$root })
### Demo: 10 times, report the age of the oldest person out of a hypothetical 10b:
round(digits=2, InvF(uk(10, 1e+10, 1e+10)))
# [1] 111.89 111.69 112.31 112.25 111.74 111.36 111.54 111.91 112.46 111.79

## Easier: just use `qgompertz` to invert it directly:
round(digits=2, qgompertz(uk(10, 1e+10, 1e+10), log(1.1124), rate = 0.000016443, log = FALSE))
# [1] 111.64 111.59 112.41 111.99 111.91 112.00 111.84 112.33 112.20 111.30

## Package up as a function:
uk <- function(n, kth, total_n) { rbeta(n, kth, total_n + 1 - kth) }
rKNGompertz <- function (iters, total_n, kth, scale, rate) {
    qgompertz(uk(iters, kth, total_n), log(scale), rate = rate, log = FALSE)
## Demo:
round(digits=2, rKNGompertz(10, 1e+10, 1e+10, 1.1124, 0.000016443))
# [1] 112.20 113.23 112.12 111.62 111.65 111.94 111.60 112.26 112.15 111.99

## Comparison with Monte Carlo to show correctness: max-of-10000 (for tractability)
mean(replicate(10000, max(rgompertz(n = 10000, shape = log(1.1124), rate = 0.000016443))))
# 103.715134
mean(rKNGompertz(10000, 10000, 10000, 1.1124, 0.000016443))
# 103.717864

As men­tioned, other dis­tri­b­u­tions work just as well. If we wanted a nor­mal or instead, then we sim­ply use qnorm/qlnorm:

## Comparison with Monte Carlo to show correctness: max-of-10000 (for tractability)
### Normal:
mean(replicate(10000, max(rnorm(n = 10000))))
# [1] 3.85142815
mean(qnorm(uk(10000, 10000, 10000)))
# [1] 3.8497741

### Log-normal:
mean(replicate(10000, max(rlnorm(n = 10000))))
# [1] 49.7277024
mean(qlnorm(uk(10000, 10000, 10000)))
# [1] 49.7507764

Jeanne Calment case study

An exam­ple where sim­u­lat­ing from the tails of the Gom­pertz dis­tri­b­u­tion is use­ful would be con­sid­er­ing the case of super-cen­te­nar­ian , who has held the world record for longevity for over 22 years now: Cal­ment lived for 122 years & 164 days (122.45 years) and was the world’s old­est per­son for almost 13× longer than usu­al, while the global run­ner-up, lived only 119 years & 97 days (119.27 years), a differ­ence of 3.18 years. This has of what appears to be an unex­pect­edly large differ­ence.

Empir­i­cal­ly, using the Geron­tol­ogy Research Group data, the gaps between are much smaller than >3 years; for exam­ple, among the wom­en, Knapp was 119, and #3–9 were all 117 year old (with #10 just 18 days shy of 117). The old­est men fol­low a sim­i­lar pat­tern: #1, , is 116.15 vs #2’s 115.69, a gap of only 1.8 years, and then #3–4 are all 115, and #5–7 are 114, and so on.

In order sta­tis­tics, the expected gap between suc­ces­sive k-of-n sam­ples typ­i­cally shrinks the larger k/n becomes (di­min­ish­ing return­s), and the Gom­pertz curve is used to model an accel­er­a­tion in mor­tal­ity that makes annual mor­tal­ity rates sky­rocket and is why no one ever lives to 130 or 140 as they run into a ‘mor­tal­ity spike’. The other women and the men make Cal­men­t’s record look extra­or­di­nary, but order sta­tis­tics and the Gom­pertz curve are coun­ter­in­tu­itive, and one might won­der if Gom­pertz curves might nat­u­rally occa­sion­ally pro­duce Cal­men­t-like gaps regard­less of the expected gaps or mor­tal­ity accel­er­a­tion.

To get an idea of what the Gom­pertz dis­tri­b­u­tion would pro­duce, we can con­sider a sce­nario like sam­pling from the top 10 out of 10 bil­lion (about the right order of mag­ni­tude for the total elderly pop­u­la­tion of the Earth which has cred­i­ble doc­u­men­ta­tion over the past ~50 years), and, using some sur­vival curve para­me­ters from a Dutch pop­u­la­tion paper , see what sort of sets of top-10 ages we would expect and in par­tic­u­lar, how often we’d see large gaps between #1 and #2:

simulateSample <- function(total_n, top_k) { sort(sapply(top_k:0,
    function(k) {
        rKNGompertz(1, total_n, total_n-k, 1.1124, 0.000016443) }
   )) }
round(digits=2, simulateSample(1e+10, 10))
# [1] 110.70 110.84 110.89 110.99 111.06 111.08 111.25 111.26 111.49 111.70 112.74

simulateSamples <- function(total_n, top_k, iters=10000) { t(replicate(iters,
    simulateSample(total_n, top_k))) }
small <-, 10, iters=100))
large <-, 10, iters=100000))

summary(round(small$V11 - small$V10, digits=2))
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
# 0.0000  0.0975  0.2600  0.3656  0.5450  1.5700
summary(round(large$V11 - large$V10, digits=2))
#     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
# 0.00000 0.15000 0.35000 0.46367 0.66000 3.99000
mean((large$V11 - large$V10) >= 3.18) * 100
# [1] 0.019

colnames(small) <- as.character(seq(-10, 0, by=1))
small$V0 <- row.names(small)
small_melted <- melt(small, id.vars= 'V0')
colnames(small_melted) <- c("V0", "K", "Years")
ggplot(small_melted, aes(x = K, y = Years)) +
    geom_path(aes(color = V0, group = V0)) + geom_point() +
    coord_cartesian(ylim = c(110, 115)) + theme(legend.position = "none")
100 sim­u­lated sam­ples of top-10-old­est-peo­ple out of 10 bil­lion, show­ing what are rea­son­able gaps between indi­vid­u­als

With these spe­cific set of para­me­ters, we see median gaps some­what sim­i­lar to the empir­i­cal data, but we hardly ever (~0.02% of the time) see #1–#2 gaps quite as big as Cal­men­t’s.

The graph also seems to sug­gest that there is in fact typ­i­cally a ‘jump’ between #1 & #2 com­pared to #2 & #3 and so on, which I was not expect­ing; think­ing about it, I sus­pect there is some sort of selec­tion effect here: if a ran­dom sam­ple from ‘#1’ falls below the ran­dom sam­ple of ‘#2’, then they will sim­ply swap places (be­cause when sort­ed, #2 will be big­ger than #1), so an ‘unlucky’ #1 dis­ap­pears, cre­at­ing a ratchet effect ensur­ing the final ‘#1’ will be larger than expect­ed. Any k could exceed expec­ta­tions, but #1 is the last pos­si­ble rank­ing, so it can become more extreme. If I remove the sort call which ensures monot­o­nic­ity with rank, the graph looks con­sid­er­ably more irreg­u­lar but still has a jump, so this selec­tion effect may not be the entire sto­ry:

100 sim­u­lated sam­ples of top-10-old­est-peo­ple out of 10 bil­lion; sim­u­lated by rank, unsorted (in­de­pen­dent sam­ples)

So, over­all, a stan­dard Gom­pertz model does not eas­ily pro­duce a Cal­ment.

This does­n’t prove that the Cal­ment age is wrong, though. It could just be that Cal­ment , or my spe­cific para­me­ter val­ues are wrong (the gaps are sim­i­lar but the ages are over­all off by ~5 years, and per­haps that makes a differ­ence). To begin with, it is unlikely that the Gom­pertz curve is exactly cor­rect a model of aging; in par­tic­u­lar, geron­tol­o­gists note an appar­ent ceil­ing of the annual mor­tal­ity rate at ~50%, which is incon­sis­tent with the Gom­pertz (which would con­tinue increas­ing arbi­trar­i­ly, quickly hit­ting >99% annual mor­tal­ity rates). And maybe Cal­ment really is just an out­lier due to sam­pling error (0.02% ≠ 0%), or she is indeed out of the ordi­nary human life expectancy dis­tri­b­u­tion but there is a more benign expla­na­tion for it like a unique muta­tion or genetic con­fig­u­ra­tion. But it does seem like Cal­men­t’s record is weird in some way.

  1. Exploit­ing the , where the order sta­tis­tics of a sim­ple 0–1 inter­val turns out to fol­low a (specifi­cal­ly: ) , which can then be eas­ily trans­formed into the equiv­a­lent order sta­tis­tics of more use­ful dis­tri­b­u­tions like the nor­mal dis­tri­b­u­tion. The beta trans­for­ma­tion is not just com­pu­ta­tion­ally use­ful, but crit­i­cal to order sta­tis­tics in gen­er­al.↩︎

  2. lmomco is accu­rate for all val­ues I checked with Monte Carlo n < 1000, but appears to have some bugs n > 2000: there are occa­sional devi­a­tions from the qua­si­-log­a­rith­mic curve, such as n = 2225–2236 (which are off by 1.02SD com­pared to the Monte Carlo esti­mates and the sur­round­ing lmomco esti­mates), another clus­ter of errors n~=40,000, and then after n > 60,000, the esti­mates are totally incor­rect. The main­tainer has been noti­fied & ver­i­fied the bug.↩︎

  3. A pre­vi­ous ver­sion of EnvStats described the approx­i­ma­tion thus:

    The func­tion evNormOrdStatsScalar com­putes the value of for user-spec­i­fied val­ues of r and n. The func­tion evNormOrdStats com­putes the val­ues of for all val­ues of r for a user-spec­i­fied value of n. For large val­ues of n, the func­tion evNormOrdStats with approximate=FALSE may take a long time to exe­cute. When approximate=TRUE, evNormOrdStats and evNormOrdStatsScalar use the fol­low­ing approx­i­ma­tion to , which was pro­posed by Blom (1958, pp. 68–75, [“6.9 An approx­i­mate mean value for­mula” & for­mula 6.10.3–6.10.5]):

    ## General Blom 1958 approximation:
    blom1958E <- function(r,n) { qnorm((r - 3/8) / (n + 1/4)) }
    blom1958E(10^10, 10^10)
    # [1] 6.433133208

    This approx­i­ma­tion is quite accu­rate. For exam­ple, for , the approx­i­ma­tion is accu­rate to the first dec­i­mal place, and for it is accu­rate to the sec­ond dec­i­mal place.