[Request] In an average lifetime, how many of the integers between 1 and 1,000,000 will you never say aloud, hear, or even think of?

gwern · 2013-05-29T15:37:35+00:00

How many numbers does one see per minute? It's a little hard to guess since it can differ so wildly from activity to activity - someone looking at a digital clock may see a number every minute, and an accountant will gaze at hundreds of numbers with aplomb. But someone going out to play soccer or hike may not see a number for hours at a time, and obviously we see no numbers while we sleep, not even in dreams (since supposedly you can't read in dreams, the symbols dissolve on closer inspection). I'll just call it 1 a minute on average. You learn to read numbers age ~5, and life expectancy is ~75, so that's 70 years, and you sleep 8 hours, so that's (75-5) * 365.25 * (24-8) * 60 * 1 = 24,544,800 or 24.5m numbers over a lifetime.

"1 and 1,000,000" makes a million distinct numbers, or 1/24th the total in a lifetime, so one might guess that one would basically sample all of the numbers from 1 to a million over a lifetime. Except, the distribution of numbers one sees in real life is clearly not a uniform distribution from which one draws at random. Small numbers are enormously more common (think about how often one sees the number 1 or 5 and not, say, 63008 (generated using random.org)

The most common distribution one hears of for 'real-world distributions' is Benford's law: a log10-normal distribution. I'm sure there's some analytic solution to answering the question "if randomly sampling from a log-normal Benford's-law distribution 24m times, what fraction of 1..1m will be present in the sample?", but I'm just gonna half-ass a simple simulation in R, where 24m times I sample from 1 to a million, weight the probability of each leading digit by Benford's law, then sample uniformly within each decile, and then find the fraction of 1..1m which matches the list of 'hit' integers:

set.seed(98765)

# Benford's law distribution of leading digits:
# R> log(1+1/(1:10))/log(10)
# [1] 0.3010 0.1760 0.1249 0.0969 0.0791 0.0669 0.0579 0.0511 0.0457 0.0413

leadingDigits <- sample.int(9, size = ((75-5) * 365.25 * (24-8) * 60 * 1), # 24m
                               replace = TRUE, # repeat digits
                               prob = (log(1+1/(1:9))/log(10))) # Benford

# fake a set
set <- rep(FALSE, 1000000)
for (i in 1:1000000) { y <- runif(1, leadingDigits[i]*100000,          # lower bound
                                    (leadingDigits[i]*100000)+100000); # upper
                       set[y] <- TRUE; }

sum(set) / 1000000 # Q: # of indices set to TRUE & what fraction of a million?
[1] 0.5358         # A: 54%

So this answer 'you'll see half of the numbers' is obviously going to be an upper bound for two reasons: I truncated sampling at 1m for simplicity so a nontrivial number of real-world numbers are going to be >1m (although we could probably not bother thinking about anything >1 trillion), and a uniform distribution makes no more sense within each 100k group than it does between 100k groups so it's going overestimate how easy is it to reach a weirdo number like 63008 and underestimate how often 63000 will appear. But it's a start.

(Incidentally, isn't it interesting how little coverage even 24 million iterations give us? This reminds me of the Rapgenius post "Heroku's Ugly Secret" complaining about how Heroku - by switching to a random pick of what server a user was sent to as opposed to being users being sent to the current least-busy server - had caused massive performance degradations for Rapgenius compared to the former strategy: "To bring queuing down to an acceptable level (<10ms), you’d need to crank your app to 4,000 dynos, or fifty times your allotment in the intelligent case." (with R simulation). I also learned the even more interesting fact in the Hacker News discussion of something called the "power of two" in queuing theory where you can take dumb random selection of which queue to wait in - which, much as we see above, is very wasteful - and simply by making a second simultaneous selection and picking the best choice of the two, we eliminate most of the problem! And going to three selections or higher adds little more! Now that's counter-intuitive to me.)

estimation

MODERATORS

estimation

MODERATORS

Welcome to Reddit.

Want to add to the discussion?