Calculating in R The Expected Maximum of a Gaussian Sample using Order Statistics

In generating a sample of n datapoints drawn from a normal/Gaussian distribution, how big on average the biggest datapoint is will depend on how large n is. I implement in the R programming language & compare some of the approaches to estimate how big on average.
topics: statistics, computer science, R, bibliography, order statistics
created: 22 Jan 2016; modified: 15 Aug 2019; status: finished; confidence: highly likely; importance: 5


In generating a sample of n datapoints drawn from a normal/Gaussian distribution with a particular mean/SD, how big on average the biggest datapoint is will depend on how large n is. Knowing this average is useful in a number of areas like sports or breeding or manufacturing, as it defines how bad/good the worst/best datapoint will be (eg the score of the winner in a multi-player game).

The of the mean/average/expectation of the maximum of a draw of n samples from a normal distribution has no exact formula, unfortunately, and is generally not built into any programming language’s libraries.

I implement & compare some of the approaches to estimating this order statistic in the R programming language, for both the maximum and the general order statistic. The overall best approach is to calculate the exact order statistics for the n range of interest using numerical integration via lmomco and cache them in a lookup table, rescaling the mean/SD as necessary for arbitrary normal distributions; next best is a polynomial regression approximation; finally, the Elfving correction to the Blom 1958 approximation is fast, easily implemented, and accurate for reasonably large n such as n>100.

Visualizing maxima/minima in order statistics with increasing n in each sample (1–100).
Visualizing maxima/minima in order statistics with increasing n in each sample (1–100).

Approximation

Monte Carlo

Most simply and directly, we can estimate it using a simulation with hundreds of thousands of iterations:

But in R this can take seconds for small n and gets worse as n increases into the hundreds as we need to calculate over increasingly large samples of random normals (so one could consider this ); this makes use of the simulation difficult when nested in higher-level procedures such as anything involving resampling or simulation. In R, calling functions many times is slower than being able to call a function once in a ‘vectorized’ way where all the values can be processed in a single batch. We can try to vectorize this simulation by generating random normals, group it into a large matrix with n columns (each row being one n-sized batch of samples), then computing the maximum of the i rows, and the mean of the maximums. This is about twice as fast for small n; implementing using rowMaxs from the R package , it is anywhere from 25% to 500% faster (at the expense of likely much higher memory usage, as the R interpreter is unlikely to apply any optimizations such as Haskell’s stream fusion):

Each simulate is too small to be worth parallelizing, but there are so many iterations that they can be split up usefully and run with a fraction in a different process; something like

We can treat the simulation estimates as exact and use such as provided by the R package to cache results & never recompute them, but it will still be slow on the first calculation. So it would be good to have either an exact algorithm or an accurate approximation: for one of analyses, I want accuracy to ±0.0006 SDs, which requires large Monte Carlo samples.

Upper bounds

To summarize the : the simplest is , which makes the diminishing returns clear. Implementation:

Most of the approximations are sufficiently fast as they are effectively with small constant factors (if we ignore that functions like /qnorm themselves may technically be or for large n). However, accuracy becomes the problem: this upper bound is hopelessly inaccurate in small samples when we compare to the Monte Carlo simulation. Others (also inaccurate) include and :

Formulas

provides a general approximation formula , which specializing to the max () is and is better than the upper bounds:

, apparently, by way of Mathematical Statistics, Wilks 1962, demonstrates that Blom 1958’s approximation is imperfect because actually , so:

(Blom 1958 appears to be more accurate for n<48 and then Elfving’s correction dominates.)

elaborated this by giving different values for , and provides computer algorithms; I have not attempted to provide an R implementation of these.

offers a 1: and an approximate (but highly accurate) numerical integration as well:

The integration can be done more directly as

Another approximation comes from : . Unfortunately, while accurate enough for most purposes, it is still off by as much as 1 IQ point and has an average mean error of -0.32 IQ points compared to the simulation:

Error in using the Chen & Tyler 1999 approximation to estimate the expected value (gain) from taking the maximum of n normal samples
Error in using the Chen & Tyler 1999 approximation to estimate the expected value (gain) from taking the maximum of n normal samples

Polynomial regression

From a less mathematical perspective, any regression or machine learning model could be used to try to develop a cheap but highly accurate approximation by simply predicting the extreme from the relevant range of n=2–300—the goal being less to make good predictions out of sample than to overfit as much as possible in-sample.

Plotting the extremes, they form a smooth almost logarithmic curve:

This has the merit of utter simplicity (function(n) {0.658802439 + 0.395762956*log(n)}), but while the R2 is quite high by most standards, the residuals are too large to make a good approximation—the log curve overshoots initially, then undershoots, then overshoots. We can try to find a better log curve by using polynomial or spline regression, which broaden the family of possible curves. A 4th-order polynomial turns out to fit as beautifully as we could wish, R2=0.9999998:

This has the virtue of speed & simplicity (a few arithmetic operations) and high accuracy, but is not intended to perform well past the largest datapoint of n=300 (although if one needed to, one could simply generate the additional datapoints, and refit, adding more polynomials if necessary), but turns out to be a good approximation up to n=800 (after which it consistently overestimates by ~0.01):

So this method, while lacking any kind of mathematical pedigree or derivation, provides the best approximation so far.

Exact

The R package () calculates a wide variety of order statistics using numerical integration & other methods. It is fast, unbiased, and generally correct (for small values of n2) - it is close to the Monte Carlo estimates even for the smallest n where the approximations tend to do badly, so it does what it claims to and provides what we want (a fast exact estimate of the mean gain from selecting the maximum from n samples from a normal distribution). The results can be memoized for a further moderate speedup (eg calculated over n=1–1000, 0.45s vs 3.9s for a speedup of ~8.7x):

Error in using Asquith 2011’s L-moment Statistics numerical integration package to estimate the expected value (gain) from taking the maximum of n normal samples
Error in using Asquith 2011’s L-moment Statistics numerical integration package to estimate the expected value (gain) from taking the maximum of n normal samples

Comparison

With lmomco providing exact values, we can visually compare the presented methods for accuracy; there are considerable differences but the best methods are in close agreement:

Comparison of estimates of the maximum for n=2–300 for 12 methods, showing Chen 1999/polynomial approximation/Monte Carlo/lmomco are the most accurate and Blom 1958/upper bounds highly-inaccurate.
Comparison of estimates of the maximum for n=2–300 for 12 methods, showing Chen 1999/polynomial approximation/Monte Carlo/lmomco are the most accurate and Blom 1958/upper bounds highly-inaccurate.

And micro-benchmarking them quickly (excluding Monte Carlo) to get an idea of time consumption shows the expected results (aside from Pil 2015’s numerical integration taking longer than expected, suggesting either memoising or changing the fineness):

Rescaling for generality

The memoised function has three arguments, so memoising on the fly would seem to be the best one could do, since one cannot precompute all possible combinations of the n/mean/SD. But actually, we only need to compute the results for each n.

We can default to assuming the standard normal distribution () without loss of generality because it’s easy to rescale any normal to another normal: to scale to a different mean , one simply adds to the expected extreme, so one can assume ; and to scale to a different standard deviation, we simply multiply appropriately. So if we wanted the extreme for n=5 for , we can calculate it simply by taking the estimate for n=5 for and multiplying by and then adding :

So in other words, it is unnecessary to memoize all possible combinations of n, mean, and SD—in reality, we only need to compute each n and then rescale it as necessary for each caller. And in practice, we only care about n=2–200, which is few enough that we can define a lookup table using the lmomco results and use that instead (with a fallback to lmomco for , and a fallback to Chen et al 1999 for to work around lmomco’s bug with large n):

exactMax <- function (n, mean=0, sd=1) {
if (n>2000) {
    chen1999 <- function(n,mean=0,sd=1){ mean + qnorm(0.5264^(1/n), sd=sd) }
    chen1999(n,mean=mean,sd=sd) } else {
    if(n>200) { library(lmomco)
        exactMax_unmemoized <- function(n, mean=0, sd=1) {
            expect.max.ostat(n, para=vec2par(c(mean, sd), type="nor"), cdf=cdfnor, pdf=pdfnor) }
        exactMax_unmemoized(n,mean=mean,sd=sd) } else {

 lookup <- c(0,0,0.5641895835,0.8462843753,1.0293753730,1.1629644736,1.2672063606,1.3521783756,1.4236003060,
  1.4850131622,1.5387527308,1.5864363519,1.6292276399,1.6679901770,1.7033815541,1.7359134449,1.7659913931,
  1.7939419809,1.8200318790,1.8444815116,1.8674750598,1.8891679149,1.9096923217,1.9291617116,1.9476740742,
  1.9653146098,1.9821578398,1.9982693020,2.0137069241,2.0285221460,2.0427608442,2.0564640976,2.0696688279,
  2.0824083360,2.0947127558,2.1066094396,2.1181232867,2.1292770254,2.1400914552,2.1505856577,2.1607771781,
  2.1706821847,2.1803156075,2.1896912604,2.1988219487,2.2077195639,2.2163951679,2.2248590675,2.2331208808,
  2.2411895970,2.2490736293,2.2567808626,2.2643186963,2.2716940833,2.2789135645,2.2859833005,2.2929091006,
  2.2996964480,2.3063505243,2.3128762306,2.3192782072,2.3255608518,2.3317283357,2.3377846191,2.3437334651,
  2.3495784520,2.3553229856,2.3609703096,2.3665235160,2.3719855541,2.3773592389,2.3826472594,2.3878521858,
  2.3929764763,2.3980224835,2.4029924601,2.4078885649,2.4127128675,2.4174673530,2.4221539270,2.4267744193,
  2.4313305880,2.4358241231,2.4402566500,2.4446297329,2.4489448774,2.4532035335,2.4574070986,2.4615569196,
  2.4656542955,2.4697004768,2.4736966781,2.4776440650,2.4815437655,2.4853968699,2.4892044318,2.4929674704,
  2.4966869713,2.5003638885,2.5039991455,2.5075936364,2.5111482275,2.5146637581,2.5181410417,2.5215808672,
  2.5249839996,2.5283511812,2.5316831323,2.5349805521,2.5382441196,2.5414744943,2.5446723168,2.5478382097,
  2.5509727783,2.5540766110,2.5571502801,2.5601943423,2.5632093392,2.5661957981,2.5691542321,2.5720851410,
  2.5749890115,2.5778663175,2.5807175211,2.5835430725,2.5863434103,2.5891189625,2.5918701463,2.5945973686,
  2.5973010263,2.5999815069,2.6026391883,2.6052744395,2.6078876209,2.6104790841,2.6130491728,2.6155982225,
  2.6181265612,2.6206345093,2.6231223799,2.6255904791,2.6280391062,2.6304685538,2.6328791081,2.6352710490,
  2.6376446504,2.6400001801,2.6423379005,2.6446580681,2.6469609341,2.6492467445,2.6515157401,2.6537681566,
  2.6560042252,2.6582241720,2.6604282187,2.6626165826,2.6647894763,2.6669471086,2.6690896839,2.6712174028,
  2.6733304616,2.6754290533,2.6775133667,2.6795835873,2.6816398969,2.6836824739,2.6857114935,2.6877271274,
  2.6897295441,2.6917189092,2.6936953850,2.6956591311,2.6976103040,2.6995490574,2.7014755424,2.7033899072,
  2.7052922974,2.7071828562,2.7090617242,2.7109290393,2.7127849375,2.7146295520,2.7164630139,2.7182854522,
  2.7200969934,2.7218977622,2.7236878809,2.7254674700,2.7272366478,2.7289955308,2.7307442335,2.7324828686,
  2.7342115470,2.7359303775,2.7376394676,2.7393389228,2.7410288469,2.7427093423,2.7443805094,2.7460424475)

 return(mean + sd*lookup[n+1]) }}}

This gives us exact computation at (with an amortized when ) with an extremely small constant factor (a conditional, vector index, multiplication, and addition, which is overall ~10x faster than a memoised lookup), giving us all our desiderata simultaneously & resolving the problem.

General order statistics for the normal distribution

One might also be interested in computing the general order statistic.

Some available implementations in R:

See also

Appendix

Sampling Gompertz Distribution Extremes

I implement random sampling from the extremes/order statistics of the Gompertz survival distribution, used to model human life expectancies, with the beta transformation trick and flexsurv/root-finding inversion. I then discuss the unusually robust lifespan record of Jeanne Calment, and show that records like hers (which surpass the runner-up’s lifespan by such a degree) are not usually produced by a Gompertz distribution, supporting the claim that her lifespan was indeed unusual even for the record holder.

The is a distribution often used to model survival curves where mortality increases over time, particularly human life expectancies. The order statistics of a Gompertz are of interest in considering extreme cases such as centenarians.

The usual family of random/density/inverse CDF (quantile) functions (rgompertz/dgompertz/pgompertz/qgompertz) are provided by several R libraries, such as , but none of them appear to provide implementations of order statistics.

Using rgompertz for the Monte Carlo approximation works, but like the normal distribution case, breaks down as one examines larger cases (eg considering order statistics out of one billion takes >20s & runs out of RAM). The beta transformation used for the normal distribution works for the Gompertz as well, as it merely requires an ability to invert the CDF, which is provided by qgompertz, and if that is not available (perhaps we are working with some other distribution besides the Gompertz where a q* function is not available), one can approximate it by a short root-finding optimization.

So, given the where the order statistics of any distribution is equivalent to , we can plug in the desired k-out-of-n parameters and generate a random sample efficiently via rbeta, getting a value on the 0–1 range (eg 0.998879842 for max-of-1000) and then either use qgompertz or optimization to transform to the final Gompertz-distributed values (see also ).

library(flexsurv)
round(digits=2, rgompertz(n = 10, shape = log(1.1124), rate = 0.000016443))
# [1] 79.85 89.88 82.82 80.87 81.24 73.14 93.54 57.52 78.02 85.96

## The beta/uniform order statistics transform, which samples from the Gompertz CDF:
uk <- function(n, kth, total_n) { rbeta(n, kth, total_n + 1 - kth) }

## Root-finding version: define the CDF by hand, then invert via optimization:
### Define Gompertz CDF; these specific parameters are taken from a Dutch population
### survival curve for illustrative purposes (see https://www.gwern.net/Longevity)
F <- function(g) { min(1, pgompertz(g, log(1.1124), rate = 0.000016443, log = FALSE)) }
### Invert the Gompertz CDF to yield the actual numerical value:
InvF <- Vectorize(function(a) { uniroot(function(x) F(x) - a, c(0,130))$root })
### Demo: 10 times, report the age of the oldest person out of a hypothetical 10b:
round(digits=2, InvF(uk(10, 1e+10, 1e+10)))
# [1] 111.89 111.69 112.31 112.25 111.74 111.36 111.54 111.91 112.46 111.79

## Easier: just use `qgompertz` to invert it directly:
round(digits=2, qgompertz(uk(10, 1e+10, 1e+10), log(1.1124), rate = 0.000016443, log = FALSE))
# [1] 111.64 111.59 112.41 111.99 111.91 112.00 111.84 112.33 112.20 111.30

## Package up as a function:
library(flexsurv)
uk <- function(n, kth, total_n) { rbeta(n, kth, total_n + 1 - kth) }
rKNGompertz <- function (iters, total_n, kth, scale, rate) {
    qgompertz(uk(iters, kth, total_n), log(scale), rate = rate, log = FALSE)
    }
## Demo:
round(digits=2, rKNGompertz(10, 1e+10, 1e+10, 1.1124, 0.000016443))
# [1] 112.20 113.23 112.12 111.62 111.65 111.94 111.60 112.26 112.15 111.99

## Comparison with Monte Carlo to show correctness: max-of-10000 (for tractability)
mean(replicate(10000, max(rgompertz(n = 10000, shape = log(1.1124), rate = 0.000016443))))
# 103.715134
mean(rKNGompertz(10000, 10000, 10000, 1.1124, 0.000016443))
# 103.717864

As mentioned, other distributions work just as well. If we wanted a normal or instead, then we simply use qnorm/qlnorm:

Jeanne Calment case study

An example where simulating from the tails of the Gompertz distribution is useful would be considering the case of super-centenarian , who has held the world record for longevity for over 22 years now: Calment lived for 122 years & 164 days (122.45 years) and was the world’s oldest person for almost 13x longer than usual, while the global runner-up, lived only 119 years & 97 days (119.27 years), a difference of 3.18 years. This has of what appears to be an unexpectedly large difference.

Empirically, using the Gerontology Research Group data, the gaps between are much smaller than >3 years; for example, among the women, Knapp was 119, and #3–9 were all 117 year old (with #10 just 18 days shy of 117). The oldest men follow a similar pattern: #1, , is 116.15 vs #2’s 115.69, a gap of only 1.8 years, and then #3–4 are all 115, and #5–7 are 114, and so on.

In order statistics, the expected gap between successive k-of-n samples typically shrinks the larger k/n becomes (diminishing returns), and the Gompertz curve is used to model an acceleration in mortality that makes annual mortality rates skyrocket and is why no one ever lives to 130 or 140 as they run into a ‘mortality spike’. The other women and the men make Calment’s record look extraordinary, but order statistics and the Gompertz curve are counterintuitive, and one might wonder if Gompertz curves might naturally occasionally produce Calment-like gaps regardless of the expected gaps or mortality acceleration.

To get an idea of what the Gompertz distribution would produce, we can consider a scenario like sampling from the top 10 out of 10 billion (about the right order of magnitude for the total elderly population of the Earth which has credible documentation over the past ~50 years), and, using some survival curve parameters from a Dutch population paper , see what sort of sets of top-10 ages we would expect and in particular, how often we’d see large gaps between #1 and #2:

100 simulated samples of top-10-oldest-people out of 10 billion, showing what are reasonable gaps between individuals
100 simulated samples of top-10-oldest-people out of 10 billion, showing what are reasonable gaps between individuals

With these specific set of parameters, we see median gaps somewhat similar to the empirical data, but we hardly ever (~0.02% of the time) see #1–#2 gaps quite as big as Calment’s.

The graph also seems to suggest that there is in fact typically a ‘jump’ between #1 & #2 compared to #2 & #3 and so on, which I was not expecting; thinking about it, I suspect there is some sort of selection effect here: if a random sample from ‘#1’ falls below the random sample of ‘#2’, then they will simply swap places (because when sorted, #2 will be bigger than #1), so an ‘unlucky’ #1 disappears, creating a ratchet effect ensuring the final ‘#1’ will be larger than expected. Any k could exceed expectations, but #1 is the last possible ranking, so it can become more extreme. If I remove the sort call which ensures monotonicity with rank, the graph looks considerably more irregular but still has a jump, so this selection effect may not be the entire story:

100 simulated samples of top-10-oldest-people out of 10 billion; simulated by rank, unsorted (independent samples)
100 simulated samples of top-10-oldest-people out of 10 billion; simulated by rank, unsorted (independent samples)

So, overall, a standard Gompertz model does not easily produce a Calment.

This doesn’t prove that the Calment age is wrong, though. It could just be that Calment , or my specific parameter values are wrong (the gaps are similar but the ages are overall off by ~5 years, and perhaps that makes a difference). To begin with, it is unlikely that the Gompertz curve is exactly correct a model of aging; in particular, gerontologists note an apparent ceiling of the annual mortality rate at ~50%, which is inconsistent with the Gompertz (which would continue increasing arbitrarily, quickly hitting >99% annual mortality rates). And maybe Calment really is just an outlier due to sampling error (0.02%≠0%), or she is indeed out of the ordinary human life expectancy distribution but there is a more benign explanation for it like a unique mutation or genetic configuration. But it does seem like Calment’s record is weird in some way.


  1. Exploiting the , where the order statistics of a simple 0–1 interval turns out to follow a (specifically: ) , which can then be easily transformed into the equivalent order statistics of more useful distributions like the normal distribution. The beta transformation is not just computationally useful, but critical to order statistics in general.↩︎

  2. lmomco is accurate for all values I checked with Monte Carlo n<1000, but appears to have some bugs n>2000: there are occasional deviations from the quasi-logarithmic curve, such as n=2225–2236 (which are off by 1.02SD compared to the Monte Carlo estimates and the surrounding lmomco estimates), another cluster of errors n~=40,000, and then after n>60,000, the estimates are totally incorrect. The maintainer has been notified & verified the bug.↩︎

  3. A previous version of EnvStats described the approximation thus:

    The function evNormOrdStatsScalar computes the value of for user-specified values of r and n. The function evNormOrdStats computes the values of for all values of r for a user-specified value of n. For large values of n, the function evNormOrdStats with approximate=FALSE may take a long time to execute. When approximate=TRUE, evNormOrdStats and evNormOrdStatsScalar use the following approximation to , which was proposed by Blom (1958, pp. 68–75, [“6.9 An approximate mean value formula” & formula 6.10.3–6.10.5]):

    This approximation is quite accurate. For example, for , the approximation is accurate to the first decimal place, and for it is accurate to the second decimal place.

    ↩︎