When Should I Check The Mail?

Bayesian decision-theoretic analysis of local mail delivery times: modeling deliveries as survival analysis, model comparison, optimizing check times with a loss function, and optimal data collection.
statistics, decision-theory, R, survival-analysis, Bayes, tutorial
2015-06-212018-02-24 finished certainty: likely importance: 9

Mail is de­liv­ered by the USPS mail­man at a reg­u­lar but not ob­served time; what is ob­served is whether the mail has been de­liv­ered at a time, yield­ing some­what-unusual “in­ter­val-cen­sored data”. I de­scribe the prob­lem of es­ti­mat­ing when the mail­man de­liv­ers, write a sim­u­la­tion of the data-gen­er­at­ing process, and demon­strate analy­sis of in­ter­val-cen­sored data in R us­ing max­i­mum-like­li­hood (sur­vival analy­sis with Gauss­ian re­gres­sion us­ing survival li­brary), MCMC (Bayesian model in JAGS), and like­li­hood-free Bayesian in­fer­ence (cus­tom ABC, us­ing the sim­u­la­tion). This al­lows es­ti­ma­tion of the dis­tri­b­u­tion of mail de­liv­ery times. I com­pare those es­ti­mates from the in­ter­val-cen­sored data with es­ti­mates from a (s­mall­er) set of ex­act de­liv­ery-times pro­vided by USPS track­ing & per­sonal ob­ser­va­tion, us­ing a mul­ti­level model to deal with het­ero­gene­ity ap­par­ently due to a change in USPS routes/­post­men. Fi­nal­ly, I de­fine a loss func­tion on mail checks, en­abling: a choice of op­ti­mal time to check the mail­box to min­i­mize loss (ex­ploita­tion); op­ti­mal time to check to max­i­mize in­for­ma­tion gain (ex­plo­ration); Thomp­son sam­pling (bal­anc­ing ex­plo­ration & ex­ploita­tion in­defi­nite­ly), and es­ti­mates of the val­ue-of-in­for­ma­tion of an­other dat­a­point (to es­ti­mate when to stop ex­plo­ration and start ex­ploita­tion after a fi­nite amount of data).

Con­sider a ques­tion of burn­ing im­por­tance: what time does the mail­man come, bear­ing gifts and when should I check my mail—­con­sid­er­ing that my mail­box is >110m/>7.5 min­utes away, and that I re­ally dis­like walk­ing out to it in up to 90F heat for a pack­age or book only to dis­cover the mail has­n’t come yet? No one wants to sit around all morn­ing to spot the ex­act time the mail­man comes. At least, I don’t. We could more eas­ily mea­sure by go­ing out in the morn­ing at a ran­dom time to see if the mail has come yet, and then (some­how) es­ti­mate. Given a set of data like “2015-06-20 11:00AM: mail has not come yet; 2015-06-21 11:59AM: mail had come”, how can we es­ti­mate? This is not a nor­mal setup where we es­ti­mate a mean but our data is in­ter­est­ingly messed up: cen­sored or trun­cated or an in­ter­val some­how.

seems like the ap­pro­pri­ate par­a­digm. This is not a sim­ple sur­vival analy­sis with “right-cen­sor­ing” where each in­di­vid­ual is fol­lowed up to a cen­sor­ing time and the ex­act time of ‘fail­ure’ is ob­served. (This would be right-cen­sor­ing if in­stead we had gone out to the mail­box early in the morn­ing and sat there wait­ing for the mail­man to record when she came, oc­ca­sion­ally get­ting bored around 10AM or 11AM and wan­der­ing off with­out see­ing when the mail comes.) This is­n’t “left­-cen­sor­ing” ei­ther (for left­-cen­sor­ing, we’d go out to the mail­box late in the morn­ing when the mail might al­ready be there, and if it is­n’t, then wait un­til it does come). I don’t think this is left or right trun­ca­tion ei­ther, since each day data is col­lected and there’s no sam­pling bi­ases at play. What this is, is in­ter­val cen­sor­ing: when we go out to the mail­box at 11AM and dis­cover the mail is there, we learn that the mail was de­liv­ered to­day some­time in the in­ter­val midnight-10:59AM, or if the mail is­n’t there, we learn it will be de­liv­ered later to­day some­time dur­ing the in­ter­val 11:01AM-midnight (hope­fully closer to the first end than the sec­ond). In­ter­val cen­sor­ing comes up in bio­sta­tis­tics for sit­u­a­tions like pe­ri­odic check­ups for can­cer, which does re­sem­ble our mail sit­u­a­tion.


Interval-censored data


The R survival li­brary sup­ports the usual right/left­-cen­sor­ing but also the in­ter­val-cen­sor­ing. It sup­ports two en­cod­ings of in­ter­vals, interval and interval21; I use the for­mer, which for­mat works well with both the survival li­brary and also other tools like JAGS. Times are writ­ten as min­utes since mid­night, so they can be han­dled as pos­i­tive num­bers rather than date-times (ie mid­night=0, 11AM=660, noon=720, mid­night=1440, etc), and the up­per and lower bounds on in­ter­vals are 0 and 1440 (so if I check the mail at 660 and it’s there, then the in­ter­val is 0-660, and if it’s not, 660-1440).

# Routines to make it easier to work in minutes-since-midnight:
fromClock <- function(ts){ (hour(ts)*60) + minute(ts) + (second(ts)/60)}
toClock <- function(t) {
   h <- floor(t/60)
   m <- floor(t - h*60)
   sprintf("%0.2d:%0.2d", h, m) }

# simulate a scenario in which the mailman tends to come around 11AM (660) and I tend to check around then,
# & generate interval data for each time, bounded by end-of-day/midnight below & above, collecting ~1 month:
simulateMailbox <- function(n, time) {
    deliveryTime <- round(rnorm(n, mean = time, sd = 30))
    checkTime <- round(rnorm(n, mean = time, sd = 20))
    simulates <- mapply(function (ck, dy) { if(ck>dy) { return(c(0,ck)) } else { return(c(ck,1440)) }},
                         checkTime, deliveryTime)
    return(data.frame(Time1=simulates[1,], Time2=simulates[2,])) }
mailSim <- simulateMailbox(30, 650); mailSim
##        Time1 Time2
## 1    620  1440
## 2      0   651
## 3      0   627
## 4    624  1440
## 5    629  1440
## 6    664  1440
## 7    665  1440
## 8    652  1440
## ...
png(file="~/wiki/images/maildelivery/simulated.png", width = 800, height = 500)
ggplot(mailSim) + geom_segment(aes(x=Time1, xend=Time2, y=1:nrow(mailSim), yend=1:nrow(mailSim))) +
    geom_vline(xintercept=650, color="blue") + ylab("Day") + xlab("Time")

In­fer­ring the mean time of de­liv­ery might sound diffi­cult with such ex­tremely crude data of in­ter­vals 700 min­utes wide or worse, but plot­ting the lit­tle sim­u­lated dataset and mark­ing the true mean time of 650, we see it’s not that bad—the mean time is prob­a­bly what­ever line passes through the most in­ter­vals:

The sim­u­lated over­lap­ping-in­ter­vals data, with the true mean time drawn in blue

And also with our sim­u­lated dataset, we can see if the stan­dard R sur­vival li­brary and a in­ter­val-cen­sored model writ­ten in JAGS can re­cover the 650:

surv <- Surv(mailSim$Time1, mailSim$Time2, type="interval2")
s <- survfit(surv ~ 1, data=mailSim); summary(s)
##  time   n.risk   n.event survival   std.err lower 95% CI upper 95% CI
##  625.5 30.0000 3.33599e+00 0.888800 0.0573976     0.783131     1.000000
##  641.0 26.6640 4.15541e+00 0.750287 0.0790267     0.610339     0.922323
##  647.0 22.5086 6.14506e+00 0.545451 0.0909091     0.393449     0.756178
##  651.5 16.3635 3.84506e-11 0.545451 0.0909091     0.393449     0.756178
##  664.5 16.3635 5.82315e-03 0.545257 0.0909124     0.393258     0.756005
##  675.0 16.3577 1.63577e+01 0.000000       NaN           NA           NA

png(file="~/wiki/images/maildelivery/simulated-survivalcurve.png", width = 800, height = 500)
sr <- survreg(surv ~ 1, dist="gaussian", data=mailSim); summary(sr)
##                 Value Std. Error        z           p
##                 Value Std. Error        z           p
## (Intercept) 663.08028   9.742171 68.06289 0.00000e+00
## Log(scale)    3.41895   0.438366  7.79931 6.22465e-15
## Scale= 30.5374
## Gaussian distribution
## Loglik(model)= -15.8   Loglik(intercept only)= -15.8
## Number of Newton-Raphson Iterations: 11
## n= 30
Sur­vival curve (death=de­liv­ery)


More Bayesian­ly, we can write an in­ter­val-cen­sor­ing model in JAGS, which gives us the op­por­tu­nity to use an in­for­ma­tive prior about the mean time the mail­man comes.

They work nor­mal 9-5 hours as far as I know, so we can rule out any­thing out­side 540-1020. From past ex­pe­ri­ence, I ex­pect the mail to show up not be­fore 10AM (600) and not after 1PM (780), with those ex­tremes be­ing rare and some­time around 11AM (650) be­ing much more com­mon; so not a uni­form dis­tri­b­u­tion over 600-780 but a nor­mal one cen­tered on 650 and then some­what ar­bi­trar­ily say­ing that 600-700 rep­re­sent 3 SDs out from the mean of de­liv­ery times to get SD=~30 min­utes so in all, dnorm(650, pow(30, -2)). The SD it­self seems to me like it could range any­where from a few min­utes to an hour, but much be­yond that is im­pos­si­ble (if the SD was over an hour, then every so often the mail­man would have to come at 8AM! and if it was smaller than 10 min­utes, then I would never have no­ticed much vari­a­tion in the first place).

model1 <- function () {
    for (i in 1:n){
           y[i] ~ dinterval(t[i], dt[i,])
           t[i] ~ dnorm(mu,tau)

           mu ~ dnorm(650, pow(30, -2))
           sd ~ dunif(10, 60)
           tau <- pow(1/sd, 2)

           y.new ~ dnorm(mu, tau)
# y=1 == Event=3 for `Surv`: event is hidden inside interval, not observed/left-/right-censored
data <- list("dt"=mailSim, "n"=nrow(mailSim), "y"=rep(1, nrow(mailSim)))
inits <- function() { list(mu=rnorm(1),sd=30,t=as.vector(apply(mailSim,1,mean))) }
params <- c("mu", "sd", "y.new")
j1 <- jags(data,inits, params, model1); j1
##          mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
## mu       665.055   9.841 647.724 658.092 664.334 670.983 686.264 1.008   280
## sd        40.153  11.592  18.626  30.643  40.813  49.707  59.048 1.047    51
## y.new    664.103  42.908 577.518 637.561 663.011 689.912 749.936 1.002  1300
## deviance   0.000   0.000   0.000   0.000   0.000   0.000   0.000 1.000     1

Both ap­proach­es’ point-value mean time of 663-665 (11:03-11:05AM) come close to the true sim­u­la­tion value of 650 (11AM) and the pre­dic­tion in­ter­val of 577-749 also sounds right, val­i­dat­ing the mod­els. The es­ti­mated stan­dard de­vi­a­tion is­n’t as ac­cu­rate with a wide cred­i­ble in­ter­val, re­flect­ing that it’s a harder pa­ra­me­ter to es­ti­mate and the es­ti­mate is still vague with only n = 30.

With a sim­u­la­tion and JAGS set up, we could also see how the pos­te­rior es­ti­mates of the mean, 95% CI of the mean, and pre­dic­tive in­ter­val change as an ad­di­tional dat­a­point is added:

animatePosteriors <- function (df, filename, upperTrue, lowerTrue) {
  # http://cran.r-project.org/web/packages/animation/index.html
      for(n in 1:nrow(df)){
          data <- list("dt"=df[1:n,], "n"=nrow(df[1:n,]), "y"=rep(1, nrow(df[1:n,])))
          inits <- function() { list(mu=rnorm(1),sd=30,t=as.vector(apply(df[1:n,],1,mean))) }
          params <- c("mu","sd", "y.new")
          j1 <- jags(data, inits, params, model1, progress.bar="none")

          lowerMean <- j1$BUGSoutput$summary[c(2),][3]
          medianMean  <- j1$BUGSoutput$mean$mu
          upperMean <- j1$BUGSoutput$summary[c(2),][7]

          lowerPredictive <- j1$BUGSoutput$summary[c(4),][3]
          upperPredictive <- j1$BUGSoutput$summary[c(4),][7]

          ## WARNING: need an environment call for `ggplot` inside a function for local variables like
          ## 'n' to be visible: http://stackoverflow.com/a/29595312/329866
          ## WARNING: & you can't just call qplot due to animation/ggplot2 bug; have to assign & print
          timeLimits <- seq(from=9*60, to=12.8*60, by=15)
          p <- ggplot(df[1:n,], environment = environment()) +
                coord_cartesian(xlim = timeLimits) +
                scale_x_continuous(breaks=timeLimits, labels=sapply(timeLimits, toClock)) +
                ylab("Day") + xlab("Time") +
                geom_segment(aes(x=df[1:n,]$Time1, xend=df[1:n,]$Time2, y=1:n, yend=1:n)) +
                geom_vline(xintercept=medianMean, color="blue") +
                geom_vline(xintercept=lowerMean, color="green") +
                geom_vline(xintercept=upperMean, color="green") +
                geom_vline(xintercept=lowerPredictive, color="red") +
                geom_vline(xintercept=upperPredictive, color="red") +
                geom_vline(xintercept=lowerTrue, color="red4") +
                geom_vline(xintercept=upperTrue, color="red4")
      interval = 0.7, ani.width = 800, ani.height=800,
      movie.name = filename) }

simData <- simulateMailbox(200, 650)
confintTrue <- round(qnorm(c(0.025, 0.975), mean=650, sd=30))
lowerPredictiveTrue <- confintTrue[1]
upperPredictiveTrue <- confintTrue[2]
animatePosteriors(simData, "/home/gwern/wiki/images/maildelivery/simulated-inferencesamplebysample.gif", lowerPredictiveTrue, upperPredictiveTrue)
Sim­u­lated data: pos­te­rior es­ti­mates of mail de­liv­ery times, evolv­ing sam­ple by sam­ple for 200 dat­a­points; es­ti­mated mean time in blue, 95% CI around the mean time in green, and 95% pre­dic­tive in­ter­vals as to when the de­liv­ery is made

Prob­a­bly the most strik­ing as­pect of watch­ing these sum­maries up­dated da­tum by da­tum, to me, is how the es­ti­mate of the mean homes in al­most im­me­di­ately on close to the true value (this is­n’t due solely to the in­for­ma­tive prior ei­ther, as with a com­pletely un­in­for­ma­tive dunif(0,1440) will zoom in within 4 or 5 dat­a­points as well, after some vi­o­lent thrash­ing around). What hap­pens is that the in­ter­vals ini­tially look un­in­for­ma­tive if the first two or three all turn out to be de­liv­ered/non-de­liv­ered and so the mean de­liv­ery time could still be any­where from ~11:AM-midnight or vice-ver­sa, but then as soon as even one nee­dle falls the other way, then the mean sud­denly snaps into tight fo­cus and gets bet­ter from there. While I un­der­stand this abrupt tran­si­tion in hind­sight (only a tiny sub­set of val­ues around the over­lap­ping tips of the nee­dles can yield the ob­served flip-flops of de­liv­ery/non-de­liv­ery, while a mean time far from the tips would yield a com­pletely con­sis­tent dataset of all de­liv­er­ies/non-de­liv­er­ies), I did­n’t ex­pect this, sim­ply rea­son­ing that “one in­ter­val-cen­sored da­tum seems very un­in­for­ma­tive, so it must take many data to yield any sort of de­cent re­sult for the mean & stan­dard de­vi­a­tion and hence the pre­dic­tions”. But, even as the mean be­comes pre­cisely es­ti­mat­ed, the pre­dic­tive in­ter­val—which is what, in the end, we re­ally care about—re­mains ob­du­rate and broad, be­cause we as­sume the de­liv­ery time is gen­er­ated by a nor­mal dis­tri­b­u­tion and so the pre­dicted de­liv­ery times are the prod­uct of not just the mean but the stan­dard de­vi­a­tion as well, and the stan­dard de­vi­a­tion is hard to es­ti­mate (a 95% cred­i­ble in­ter­val of 12-51!). Also in hind­sight this is pre­dictable as well, since the flip-flop­ping nee­dles may be sen­si­tive to the mean, but not to the spread of de­liv­ery-times; the data would not look much differ­ent than it does if the mail­man could de­liver any­where from 8AM to 3PM—on early days she de­liv­ers a few hours be­fore I check the mail­box around 11AM and on late days she de­liv­ers a few hours after.

These con­sid­er­a­tions also raise ques­tions about sta­tis­ti­cal pow­er/op­ti­mal ex­per­i­ment de­sign: what are the best times to sam­ple from in­ter­val-cen­sored data in or­der to es­ti­mate as pre­cisely as pos­si­ble with a lim­ited bud­get of sam­ples? I searched for ma­te­r­ial on in­ter­val-cen­sored data but did­n’t find any­thing di­rectly ad­dress­ing my ques­tion. The flip-flops sug­gest that to es­ti­mate the mean, one should sam­ple only at the cur­rent es­ti­mated mean, which max­i­mizes the prob­a­bil­ity that there will be a net 50-50 split of de­liv­ery/non-de­liv­ery; but where should one sam­ple for the SD as well?

mailInterval <- data.frame(
 Date=as.POSIXct(c( "2015-06-20 11:00AM", "2015-06-21 11:06AM", "2015-06-23 11:03AM",
                    "2015-06-24 11:05AM", "2015-06-25 11:00AM", "2015-06-26 10:56AM",
                    "2015-06-27 10:45AM", "2015-06-29 10:31AM", "2015-06-30 10:39AM", "2015-07-01 10:27AM",
                    "2015-07-02 10:47AM", "2015-07-03 10:27AM", "2015-07-04 10:54AM", "2015-07-05 10:55AM",
                    "2015-07-06 11:21AM", "2015-07-07 10:01AM", "2015-07-08 10:20AM", "2015-07-09 10:50AM",
                    "2015-07-10 11:10AM", "2015-07-11 11:12AM", "2015-07-13 11:05AM", "2015-07-14 11:14AM",
                    "2015-07-15 11:40AM", "2015-07-16 11:24AM", "2015-07-17 11:03AM", "2015-07-18 10:46AM",
                    "2015-07-20 11:05AM", "2015-07-21 10:56AM", "2015-07-22 11:00AM", "2015-07-23 11:17AM",
                    "2015-07-24 11:15AM", "2015-07-27 11:11AM", "2015-07-28 10:44AM", "2015-07-29 11:18AM",
                    "2015-07-30 11:08AM", "2015-07-31 10:44AM", "2015-08-01 11:25AM", "2015-08-03 10:45AM",
                    "2015-08-04 10:45AM", "2015-08-05 10:44AM", "2015-08-06 10:33AM", "2015-08-07 10:55AM",
                    "2015-08-10 11:09AM", "2015-08-11 11:16AM", "2015-08-12 11:14AM", "2015-08-13 11:10AM",
                    "2015-08-14 11:02AM", "2015-08-15 11:04AM", "2015-08-18 11:15AM", "2015-08-20 11:20AM",
                    "2015-08-22 11:46AM", "2015-08-23 11:04AM", "2015-08-24 10:56AM", "2015-08-25 10:26AM",
                    "2015-08-26 11:57AM", "2015-08-27 10:15AM", "2015-08-31 12:35PM", "2015-09-01 11:44AM",
                    "2015-09-02 10:54AM", "2015-09-03 10:29AM", "2015-09-04 10:08AM", "2015-09-05 10:45AM",
                    "2015-09-07 10:58AM", "2015-09-08 10:25AM", "2015-09-09 10:29AM", "2015-09-10 10:48AM",
                    "2015-09-16 11:22AM", "2015-09-17 11:06AM", "2015-09-18 11:22AM", "2015-09-19 11:12AM",
                    "2015-09-21 11:52AM", "2015-09-23 12:02PM", "2015-09-24 10:00AM", "2015-09-25 12:08PM",
                    "2015-09-26 10:59AM", "2015-09-28 10:57AM", "2015-09-29 10:52AM" ),
mailInterval$Time <- fromClock(mailInterval$Date)
mail <- with(mailInterval, data.frame(Time1 = ifelse(Delivered, 0, Time),
                           Time2 = ifelse(Delivered,  Time, 1440)))

model1 <- function() { for (i in 1:n){
           y[i] ~ dinterval(t[i], dt[i,])
           t[i] ~ dnorm(mu,tau)

           mu ~ dnorm(650, pow(30, -2))
           sd ~ dunif(10, 60)
           tau <- pow(1/sd, 2)

           y.new ~ dnorm(mu, tau)
data <- list("dt"=mail, "n"=nrow(mail), "y"=rep(1, nrow(mail)))
inits <- function() { list(mu=rnorm(1),sd=30,t=as.vector(apply(mail,1,mean))) }
params <- c("mu","sd", "y.new")
j1 <- jags(data, inits, params, model1, n.iter=100000); j1
##          mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
## mu       658.173   4.468 649.098 655.469 658.156 660.943 666.797 1.001  3000
## sd        27.138   7.976  16.034  21.417  25.459  30.916  48.149 1.001  3000
## y.new    658.098  27.960 601.899 640.912 657.942 675.067 712.985 1.001  3000
## deviance   0.000   0.000   0.000   0.000   0.000   0.000   0.000 1.000     1

animatePosteriors(mail, "/home/gwern/wiki/images/maildelivery/real-inferencesamplebysample.gif", 0, 0)
Real data: pos­te­rior es­ti­mates of mail de­liv­ery times, an­i­mated up­dated sam­ple by sam­ple


Be­cause JAGS pro­vides an in­ter­val-cen­sored dis­tri­b­u­tion in the form of dinterval() with a , we can use MCMC for in­verse in­fer­ence (rea­son­ing from data to the un­der­ly­ing process) But if it did­n’t, I would­n’t know how to write one down for it and then the MCMC would­n’t work; but I was able to write a lit­tle sim­u­la­tion of how the un­der­ly­ing process of de­liv­ery-and-check­ing works, which, given a set of pa­ra­me­ters, spits out sim­u­lated re­sults gen­er­ated by the process, which is prob­a­bil­ity or for­ward in­fer­ence (rea­son­ing from a ver­sion of an un­der­ly­ing process to see what it cre­ates). This is a com­mon sit­u­a­tion: you can write a good sim­u­la­tion sim­ply by de­scrib­ing how you think some­thing works, but you can’t write a like­li­hood func­tion.

(ex­em­pli­fied in the fun ex­am­ple “Tiny Data, Ap­prox­i­mate Bayesian Com­pu­ta­tion and the Socks of Karl Bro­man”; see also Wilkin­son’s in­tro & sum­mary sta­tis­tics posts) is a re­mark­ably sim­ple and pow­er­ful idea which lets us take a for­ward sim­u­la­tion and use it to run back­wards in­fer­ence. (Re­minds me a lit­tle of .) There’s an R pack­age, of course which im­ple­ments nice fea­tures & op­ti­miza­tions, but ABC is so sim­ple we might as well write our own for trans­paren­cy.

The sim­plest ABC goes like this: You sam­ple pos­si­ble pa­ra­me­ters from your pri­or, feed the set of pa­ra­me­ters into your sim­u­la­tion, and if the re­sult is iden­ti­cal to your data, you save that set of pa­ra­me­ters. At the end, you’re left with a bunch of sets and that’s your pos­te­rior dis­tri­b­u­tion which you can look at the his­tograms of and cal­cu­late 95% den­si­ties etc.

So for the mail data, ABC goes like this:

simulateMailbox <- function(n, dTime, dSD) {
    deliveryTime <- round(rnorm(n, mean = dTime, sd = dSD))
    checkTime <- round(rnorm(n, mean = dTime, sd = dSD))
    simulates <- mapply(function (ck, dy) { if(ck>dy) { return(c(0,ck)) } else { return(c(ck,1440)) }},
                         checkTime, deliveryTime)
    return(data.frame(Time1=simulates[1,], Time2=simulates[2,])) }

# if both data-frames are sorted, comparison is easier
mailSorted <- mail[order(mail$Time1),]

mail_sim <- replicate(100000, {
    # mu ~ dnorm(650, 20)
    mu <- rnorm(n=1, mean=650, sd=20)
    # sd ~ dunif(10, 60)
    sd <- runif(n=1, min=10, max=60)

    newData <- simulateMailbox(nrow(mailSorted), mu, sd)
    newDataSorted <- newData[order(newData$Time1),]

    if (all(newDataSorted == mailSorted)) { return(c(Mu=mu, SD=sd)) }
results <- Filter(function(x) {!is.null(x)}, mail_sim)
results <- data.frame(t(sapply(results,c)))

The first thing to note is effi­cien­cy: I can get a rea­son­able num­ber of sam­ples in rea­son­able amount of time for n = 1-3, but at 4 dat­a­points, it be­comes slow. There’s so many pos­si­ble datasets when 4 checks are sim­u­lated that al­most all get re­jected be­cause they are not iden­ti­cal to the real dataset and it takes mil­lions of sam­ples and hours to run. And this prob­lem only gets worse for n = 5 and big­ger.

To run ABC more effi­cient­ly, you re­lax the re­quire­ment that the sim­u­lated data == real data and in­stead ac­cept the pair of pa­ra­me­ters if the sim­u­lated data is ‘close enough’ in some sense to the real data, close in terms of some sum­mary sta­tis­tic (hope­fully suffi­cient) like the mean. I don’t know what are the suffi­cient sta­tis­tics for a set of in­ter­val-cen­sored data, but I fig­ure that if the means of the pairs of times are sim­i­lar in both datasets, then they are prob­a­bly close enough for ABC to work, so I can use that as a re­jec­tion tol­er­ance; im­ple­ment­ing that and play­ing around, it seems I can make the differ­ence in means as tight as <1 while still run­ning fast.

mail_abc <- function(samples) {
    results <- list()
    n <- 0
    while (n<samples) {

      # Priors:
      ## mu ~ dnorm(650, 20)
      mu <- rnorm(n=1, mean=650, sd=20)
      ## sd ~ dunif(10, 60)
      sd <- runif(n=1, min=10, max=60)

      # generate new data set based on a pair of possible parameters:
      newData <- simulateMailbox(nrow(mail), mu, sd)

      # see if some summaries of the new data were within tolerance e<1 of the real data:
      if (abs(mean(newData$Time1) - mean(mail$Time1)) < 1 &&
          abs(mean(newData$Time2) - mean(mail$Time2)) < 1)
        { results <- list(c(Mu=mu, SD=sd), results); n <- n+1; }
sims <- mail_abc(2000)
results <- matrix(unlist(sims), ncol=2, byrow=TRUE)
##        V1                 V2
##  Min.   :600.5387   Min.   :10.56221
##  1st Qu.:639.2910   1st Qu.:22.79056
##  Median :644.7196   Median :26.33946
##  Mean   :648.4296   Mean   :26.98299
##  3rd Qu.:661.2050   3rd Qu.:30.38428
##  Max.   :716.9134   Max.   :55.80602

The mean value here is off some­what from JAGS but still ac­cept­able.

An­other way to sum­ma­rize the dataset oc­curs to me while look­ing at the graphs: the most strik­ing vi­sual fea­ture of the in­ter­val-cen­sored data is how the ‘nee­dles’ over­lap slightly and it is this slight over­lap which de­ter­mines where the mean is; the most in­for­ma­tive set of data would be bal­anced ex­actly be­tween nee­dles that fall to the left and nee­dles that fall to the right, leav­ing as lit­tle room as pos­si­ble for the mean to ‘es­cape’ out into the wider in­ter­vals and be un­cer­tain. (Imag­ine a set of data where all the nee­dles fall to the left, be­cause I only checked the mail at 2PM; I would then be ex­tremely cer­tain that the mail is not de­liv­ered after 2PM but I would have lit­tle more idea than when I started about when the mail is ac­tu­ally de­liv­ered in the morn­ing and my pos­te­rior would re­peat the pri­or.) So I could use the count of left or right in­ter­vals (it does­n’t mat­ter if I use sum(Time1 == 0) or sum(Time2 == 1440) since they are mu­tu­ally ex­clu­sive) as the sum­mary sta­tis­tic.

mail_abc <- function(samples) {
    results <- list()
    n <- 0
    while (n<samples) {

      # Priors:
      ## mu ~ dnorm(650, 20)
      mu <- rnorm(n=1, mean=650, sd=20)
      ## sd ~ dunif(10, 60)
      sd <- runif(n=1, min=10, max=60)

      # generate new data set based on a pair of possible parameters:
      newData <- simulateMailbox(nrow(mail), mu, sd)

      # see if a summary of the new data matches the old:
      if (sum(mail$Time1 == 0) == sum(newData$Time1 == 0))
        { results <- list(c(Mu=mu, SD=sd), results); n <- n+1; }
sims <- mail_abc(20000)
results <- matrix(unlist(sims), ncol=2, byrow=TRUE)
##        V1                 V2
##  Min.   :567.5387   Min.   :10.00090
##  1st Qu.:636.6108   1st Qu.:22.70875
##  Median :650.0038   Median :34.96424
##  Mean   :650.1095   Mean   :35.08481
##  3rd Qu.:663.7611   3rd Qu.:47.60481
##  Max.   :735.4072   Max.   :59.99941

This one yields the same mean, but a higher SD (which is more di­ver­gent from the JAGS es­ti­mate as well). So per­haps not as good.

Exact delivery-time data

Part­way through com­pil­ing my notes, I re­al­ized that I did in fact have sev­eral ex­act times for de­liv­er­ies: the USPS track­ing emails for pack­ages, while use­less on the day of de­liv­ery for know­ing when to check (s­ince the alerts are only sent around 3-4PM), do in­clude the ex­act time of de­liv­ery that day. And then while record­ing in­ter­val data, I did some­times spot the mail­man on her rounds; to keep things sim­ple, I still recorded it as an in­ter­val.

Ex­act data makes es­ti­mat­ing a mean & SD triv­ial:

mailExact <- data.frame(Date=as.POSIXct(c("2010-04-29 11:33AM", "2010-05-12 11:31AM", "2014-08-20 12:14PM",
                                          "2014-09-29 11:15AM", "2014-12-15 12:02PM", "2015-03-09 11:19AM",
                                          "2015-06-10 10:34AM", "2015-06-20 11:02AM", "2015-06-23 10:58AM",
                                          "2015-06-24 10:53AM", "2015-06-25 10:55AM", "2015-06-30 10:36AM",
                                          "2015-07-02 10:45AM", "2015-07-06 11:19AM", "2015-07-10 10:54AM",
                                          "2015-07-11 11:09AM", "2015-07-15 10:29AM", "2015-07-16 11:02AM",
                                          "2015-07-17 10:46AM", "2015-07-27 11:12AM", "2015-08-15 10:56AM",
                                          "2015-08-17 11:40AM", "2015-08-18 11:19AM", "2015-08-27 10:43AM",
                                          "2015-09-04 10:56AM", "2015-09-18 11:15AM", "2015-09-26 10:42AM",
                                          "2015-09-28 11:48AM"),
mailExact$TimeDelivered <- fromClock(mailExact$Date)
## [1] 668.1071429
## [1] 26.24152673

png(file="~/wiki/images/maildelivery/exact.png", width = 800, height = 500)
timeLimits <- seq(from=10*60, to=12.25*60, by=10)
qplot(Date, TimeDelivered, data=mailExact) +
 scale_y_continuous(breaks=timeLimits, labels=sapply(timeLimits, toClock))
Known ex­act de­liv­ery times to me of USPS pack­ages, 2010-2015

Plot­ted over time, there’s a trou­bling amount of het­ero­gene­ity: de­spite the spar­sity of data (ap­par­ently I did not bother to set up USPS track­ing alerts 2011-2014, to my loss) it’s hard not to see two sep­a­rate clus­ters there.


Be­sides the vi­sual ev­i­dence, a hy­poth­e­sis test agrees with there be­ing a differ­ence be­tween 2010-2014 and 2015

mailExact$Group <- year(mailExact$Date) > 2014
wilcox.test(TimeDelivered ~ Group, data=mailExact)
wilcox.test(TimeDelivered ~ Group, conf.int=TRUE, data=mailExact)
#   Wilcoxon rank sum test with continuity correction
# data:  TimeDelivered by Group
# W = 105.5, p-value = 0.004344224
# alternative hypothesis: true location shift is not equal to 0
# 95% confidence interval:
#  18.99997358 67.00004597
# sample estimates:
# difference in location
#            41.00004647
## [1] 20.03603473

Why might there be two clus­ters? Well, now that I think about it, I re­call that my mail­man used to be an older gen­tle­man with white hair (I re­mem­ber him vividly be­cause in mid-Au­gust 2013 a pack­age of fish oil was dam­aged in tran­sit, and he showed up to ex­plain it to me and offer sug­ges­tions on re­turns; Ama­zon did­n’t in­sist on a leaky fish oil bot­tle be­ing shipped back and sim­ply sent me a new one). But now my mail­man is a younger mid­dle-aged woman. That seems like a good rea­son for a shift in de­liv­ery times—per­haps she dri­ves faster. (As our mailper­sons never in­ter­act with cus­tomers and there is min­i­mal road traffic, the de­liv­ery times should re­flect al­most en­tirely their route length, de­liv­ery vol­ume, and dri­ving speed/­effi­cien­cy.) Al­ter­nate­ly, per­haps the per­son­nel shift was dri­ven by a changed route; the ex­act cause is unim­por­tant as the differ­ence ap­pears large and con­sis­tent.


Es­ti­mat­ing the two dis­tri­b­u­tions sep­a­rately in a sim­ple mul­ti­level/hier­ar­chi­cal mod­el:

model2 <- function() {
  # I expect all deliveries ~11AM/650:
  grand.mean ~ dnorm(650, pow(20, -2))

  # different mailman/groups will deliver at different offsets, but not by more than 2 hours or so:
  delta.between.group ~ dunif(0, 100)
  # similarly, both mailman times and delivery times are reasonably precise within 2 hours or so:
  tau.between.group <- pow(sigma.between.group, -2)
  sigma.between.group ~ dunif(0, 100)

  for(j in 1:K){
   # let's say the group-level differences are also normally-distributed:
   group.delta[j] ~ dnorm(delta.between.group, tau.between.group)
   # and each group also has its own standard-deviation, potentially different from the others':
   group.within.sigma[j] ~ dunif(0, 100)
   group.within.tau[j] <- pow(group.within.sigma[j], -2)

   # save the net combo for convenience & interpretability:
   group.mean[j] <- grand.mean + group.delta[j]

  for (i in 1:N) {
   # each individual observation is from the grand-mean + group-offset, then normally distributed:
   Y[i] ~ dnorm(grand.mean + group.delta[Group[i]], group.within.tau[Group[i]])

  # prediction interval for the second group, the 2015 data, which is the one I care about:
  y.new2 ~ dnorm(group.mean[2], group.within.tau[2])
data <- list(N=nrow(mailExact), Y=mailExact$TimeDelivered, K=max(mailExact$Group+1),
params <- c("grand.mean","delta.between.group", "sigma.between.group", "group.delta", "group.mean",
            "group.within.sigma", "y.new2")
k1 <- jags(data=data, parameters.to.save=params, inits=NULL, model.file=model2, n.iter=50000); k1
##                       mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
## delta.between.group    39.252  24.676   1.866  19.228  35.892  56.239  90.804 1.002  1900
## grand.mean            646.541  18.229 610.306 634.420 646.967 658.402 682.334 1.001  3000
## group.delta[1]         51.451  24.090   6.461  35.024  51.243  67.678  97.720 1.001  3000
## group.delta[2]         14.603  18.634 -22.068   2.057  14.234  26.874  51.781 1.001  3000
## group.mean[1]         697.992  16.997 660.532 688.781 699.057 707.816 731.813 1.001  3000
## group.mean[2]         661.144   4.493 652.375 658.188 661.171 664.057 670.275 1.002  1500
## group.within.sigma[1]  36.006  16.775  15.685  23.966  31.571  42.990  81.056 1.001  3000
## group.within.sigma[2]  21.297   3.461  15.764  18.798  20.929  23.266  28.987 1.002  1400
## sigma.between.group    44.831  24.254   7.910  25.429  41.149  62.391  94.998 1.004  1500
## y.new2                661.669  22.141 616.927 647.494 661.905 675.973 704.147 1.002  3000
## deviance              252.499   3.557 247.820 249.767 251.785 254.502 261.130 1.001  3000
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 6.3 and DIC = 258.8

Combined data

Com­bin­ing the in­ter­val and ex­act data is easy: ex­act data are in­ter­val data with very nar­row in­ter­vals (ac­cu­rate roughly to within half a min­ute), where the be­gin­ning and end match. If ex­act de­liv­ery time is avail­able for a day, then the in­ter­val data for that day is omit­ted as re­dun­dant:

mail2 <- rbind(
                 Time1 = TimeDelivered,
                 Time2= TimeDelivered + 0.5)),
 with(mailInterval[!(as.Date(mailInterval$Date) %in% as.Date(mailExact$Date)),],
                Time1 = ifelse(Delivered, 0, Time),
                Time2 = ifelse(Delivered,  Time, 1440))))

mail2$Group <- year(mail2$Date) > 2014
mailCombined <- subset(mail2,select=c("Time1","Time2"))

The mul­ti­level model is the same as be­fore with the ex­cep­tion that the like­li­hood needs to be tweaked to put dinterval on top, and the in­put data needs to be munged ap­pro­pri­ately to match its ex­pec­ta­tions of it be­ing a two-col­umn data-frame of the form (Time1,Time2):

model3 <- function() {
  grand.mean          ~ dnorm(650, pow(20, -2))
  delta.between.group ~ dunif(0, 100)
  tau.between.group   <- pow(sigma.between.group, -2)
  sigma.between.group ~ dunif(0, 100)
  y.new2015 ~ dnorm(group.mean[2], group.within.tau[2])

  for(j in 1:K){
   group.delta[j] ~ dnorm(delta.between.group, tau.between.group)
   group.within.sigma[j] ~ dunif(0, 100)
   group.within.tau[j] <- pow(group.within.sigma[j], -2)
   group.mean[j] <- grand.mean + group.delta[j]

  for (i in 1:N) {
   y[i] ~ dinterval(t[i], dt[i,])
   t[i] ~ dnorm(grand.mean + group.delta[Group[i]], group.within.tau[Group[i]])

data2 <- list(N=nrow(mail2), K=max(mail2$Group+1),
             Group=(mail2$Group+1), "dt"=subset(mail2, select=c("Time1", "Time2")),
             "y"=rep(1, nrow(mail2)))
params2 <- c("grand.mean","delta.between.group", "sigma.between.group", "group.delta", "group.mean",
            "group.within.sigma", "y.new2015")
inits2 <- function() { list(grand.mean=650, t=as.vector(apply(data2[["dt"]],1,mean))) }
k2 <- jags(data=data2, parameters.to.save=params2, inits=inits2, model.file=model3, n.iter=100000); k2
##                       mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
## delta.between.group    39.364  24.229   2.728  19.947  36.568  56.187  92.219 1.002  3000
## grand.mean            646.921  18.303 609.659 634.911 647.471 659.948 680.647 1.002  1200
## group.delta[1]         51.517  23.431   7.242  35.480  50.876  67.432  97.720 1.002  1300
## group.delta[2]         15.617  18.463 -18.505   2.450  15.254  27.661  53.121 1.002  1100
## group.mean[1]         698.438  16.892 661.420 689.057 699.582 708.741 730.282 1.001  3000
## group.mean[2]         662.538   3.088 656.700 660.463 662.538 664.545 668.783 1.001  3000
## group.within.sigma[1]  36.074  16.210  16.181  24.249  31.834  43.555  80.031 1.001  2500
## group.within.sigma[2]  20.691   2.933  15.832  18.616  20.342  22.421  27.245 1.001  3000
## sigma.between.group    43.801  24.304   6.174  24.458  39.747  60.546  95.167 1.003  3000
## y.new2015             662.168  21.331 620.004 648.164 661.859 675.705 705.655 1.002  1200
## deviance                0.000   0.000   0.000   0.000   0.000   0.000   0.000 1.000     1

It’s worth com­par­ing the y.new2015 (the pos­te­rior pre­dic­tive dis­tri­b­u­tion for just 2015, ig­nor­ing the ear­lier clus­ter) and the orig­i­nal y.new (com­puted on 2015 in­ter­val-cen­sored data): we started with a 95% cred­i­ble in­ter­val of mail de­liv­er­ies in the win­dow 601-712 (10:01AM-11:52AM), and then got a win­dow of 620-702 (10:20AM-11:45), for a gain of half an hour or al­most 30%. And this de­spite hav­ing al­most 3x as much in­ter­val-cen­sored data! The les­son here is to try to min­i­mize the length of in­ter­vals if one wants to make pre­cise in­fer­ences.

Model checking

On Model Uncertainty

“…It is no sur­prise that when this model fails, it is the rather than the that is caus­ing the prob­lem. In the bi­no­mial model un­der con­sid­er­a­tion here, the prior comes into the pos­te­rior dis­tri­b­u­tion only once and the like­li­hood comes in n times. It is per­haps merely an ac­ci­dent of his­tory that skep­tics and sub­jec­tivists alike strain on the gnat of the prior dis­tri­b­u­tion while swal­low­ing the camel that is the like­li­hood.

, 20132

In set­ting up an analy­sis, one must make a va­ri­ety of choices in what mod­els to use, which in­flu­ences the re­sult; this ne­glect of “model un­cer­tainty” can lead to over­con­fi­dent in­fer­ences or miss crit­i­cal de­tails. In Bayesian analy­ses, typ­i­cally the prior comes in for the most scrutiny as to how it de­ter­mines the re­sult; but the prior part of the model often has much less in­flu­ence than the struc­tural or func­tional form of the analy­sis, where it is as­sumed that the out­come is drawn from a nor­mal dis­tri­b­u­tion or an­other such com­mon choice, al­though it’s not. This choice it­self is often ar­bi­trary and can de­ter­mine the re­sult al­most re­gard­less of what the data says. For ex­am­ple, the nor­mal dis­tri­b­u­tion has thin tails and so the prob­a­bil­ity of some­thing be­ing far out on a tail will be ex­tremely small and com­bined with other con­sid­er­a­tions like mea­sure­ment er­ror, can re­sult in a model re­fus­ing to ever ac­cept that a value re­ally is ‘ex­treme’ with­out equally ex­treme amounts of data—a re­fusal that would not hap­pen if any of a num­ber of other per­fectly pos­si­ble dis­tri­b­u­tions had been cho­sen in­stead like log-nor­mal dis­tri­b­u­tion or —de­spite the fact that the choice of nor­mal dis­tri­b­u­tion it­self is usu­ally due more to con­ve­nience than any ex­tremely strong ev­i­dence that the nor­mal dis­tri­b­u­tion is the true dis­tri­b­u­tion (“every model is false, but some are use­ful”), which means the best re­sponse is sim­ply to note that —the prob­a­bil­ity that sets an up­per bound on how much we can get out of them, and a bound which often bars many of the in­fer­ences one might try to ex­tract from them.

For ex­am­ple:

  1. the nor­mal dis­tri­b­u­tion is com­monly used for its an­a­lytic tractabil­ity and as­ymp­totic prop­er­ties, but it has some as­pects that are, in a sense, too pow­er­ful and re­stric­tive: specifi­cal­ly, that its tails are thin. Thin tails can cause se­ri­ous un­der­es­ti­ma­tions of rare phe­nom­e­non (and not just in cases in­volv­ing power laws) and can lead to ab­surdly pow­er­ful con­clu­sions. Two ex­am­ples:

    • in one analy­sis of char­i­ties, the writer uses nor­mal dis­tri­b­u­tions and thus winds up ar­gu­ing that with er­ror in our as­sess­ments of char­i­ties’ val­ues, our es­ti­mate must be shrunken far to­wards the mean (which is en­tirely true); but this is cir­cu­lar, since by as­sum­ing nor­mal dis­tri­b­u­tions he has also as­sumed away the ex­is­tence of large differ­ences in value be­tween char­i­ties. If one were to grant the ap­par­ently in­nocu­ous as­sump­tion of nor­mally dis­trib­uted effec­tive­ness of in­ter­ven­tions, one would be forced to ig­nore that there are large differ­ences—­con­sider knit­ting sweaters for pen­guins vs or ; are they re­ally within a few stan­dard de­vi­a­tions of each oth­er? Should we re­ally ig­nore in­ter­ven­tions like vac­cines which claim to save lives at a cost of no more than tens of thou­sands of dol­lars, on the grounds that most char­i­ties re­quire hun­dreds of thou­sands or mil­lions of dol­lars to achieve goals like that? If we look at and see ex­treme skews in what causes loss of QALYs, with some clas­si­fi­ca­tions be­ing hun­dreds or thou­sands of times more dam­ag­ing than oth­ers (like in­door fires), is it more likely that the data is wrong than right solely be­cause the nor­mal dis­tri­b­u­tion says that the largest prob­lems are many stan­dard de­vi­a­tions out? It’s clear that what­ever the right dis­tri­b­u­tion for mod­el­ing char­ity or health out­comes is, it is prob­a­bly not lit­er­ally & ex­actly a nor­mal dis­tri­b­u­tion and it should al­low for large differ­ences; and hav­ing aban­doned the nor­mal, we then lose the ap­par­ently mag­i­cal abil­ity to make con­fi­dent pre­dic­tions about how many char­i­ties are how effec­tive.

    • in an­other analy­sis, the au­thor, a Michael Fer­gu­son, ar­gues that Amer­i­can so­ci­ety dis­crim­i­nates to a truly as­ton­ish­ing de­gree against peo­ple with high IQs, claim­ing that peo­ple with 150IQ have 97% less chance of achiev­ing suc­cess than peo­ple with the ideal IQ of ~133, and thus, that past that, the chance of suc­cess drops al­most to 0%, and that this is due to vi­cious con­stant dis­crim­i­na­tion & per­se­cu­tion. His analy­sis takes the strat­egy of not­ing the pop­u­la­tion IQ is nor­mally dis­trib­ut­ed, by de­fi­n­i­tion, at and tak­ing a few old stud­ies of elite groups like stu­dents at Har­vard Med­ical School and not­ing the stud­ies re­port means like 122 and stan­dard de­vi­a­tions like 6, in­fer­ring the stu­dents are nor­mally dis­trib­uted at , and then not­ing that when these two nor­mal dis­tri­b­u­tions are su­per­im­posed, there will be al­most no elites as high as IQ 140 or 150 de­spite there be­ing many such peo­ple in the gen­eral pop­u­la­tion and this un­der­rep­re­sen­ta­tion gets more and more ex­treme as one goes fur­ther out on the tail; why did all the high IQ pop­u­la­tion mem­bers fail to show up in the elite dis­tri­b­u­tion? It must be dis­crim­i­na­tion. Thus, he grimly warns his read­ers that “If your IQ is over 150, it is a clar­ion call; with­out di­rect in­ter­ven­tion, your ca­reer prospects are very poor. If you are the par­ent of a child with a D15IQ over 150, im­me­di­ate and dra­matic ac­tion is re­quired.”

      This would be a re­mark­able fact if true for many rea­sons (not least the waste & dam­age to sci­ence and hu­man­i­ty), but that is un­like­ly: be­cause the un­der­rep­re­sen­ta­tion is a math­e­mat­i­cal ne­ces­sity of the nor­mal dis­tri­b­u­tion due to the elite dis­tri­b­u­tion be­ing de­fined as hav­ing a smaller SD. The as­sump­tion that the elite dis­tri­b­u­tion is nor­mal dri­ves the en­tire re­mark­able, un­likely re­sult. But why as­sume elites are nor­mally dis­trib­ut­ed? Most of the processes by which elite schools such as Har­vard would se­lect from the gen­eral pop­u­la­tion would not pro­duce a nor­mal dis­tri­b­u­tion: if se­lected at a SAT thresh­old, the re­sult will not be nor­mal but a trun­cated nor­mal; if se­lected on mul­ti­ple fac­tors such as test scores and per­son­al­i­ty, it will more likely be log-nor­mal; and so on. That the stud­ies of the Har­vard stu­dents re­ported a mean and SD does not show that the stu­dent dis­tri­b­u­tion was nor­mal, and in fact, it’s easy to sim­u­late sce­nar­ios like the thresh­old model or a model in which in­creas­ing IQ pro­duces in­creas­ing odds of ad­mis­sion, and show that the re­ported means & SDs are en­tirely pos­si­ble un­der other highly non-nor­mal dis­tri­b­u­tions (in­tu­itive­ly, peo­ple as high as 140 or 150 are so rare that they don’t affect the sum­mary sta­tis­tics much, re­gard­less of whether they are heav­ily dis­crim­i­nated against or heav­ily fa­vored for ad­mis­sion). The shock­ing con­clu­sion fol­lows from the math­e­mat­i­cal/s­ta­tis­ti­cal as­sump­tion, but where did this as­sump­tion come from and should we grant it?

  2. lin­ear mod­els have a sim­i­lar prob­lem as nor­mal dis­tri­b­u­tions: while con­ve­nient and sim­ple, they have well-known prob­lems when at the ex­tremes of the data or when pro­jected be­yond that. Text­books give ex­am­ples like a lin­ear re­gres­sion on ice cream sales pre­dict­ing neg­a­tive sales dur­ing the win­ter, or pre­dict­ing that sales in sum­mer will be or­ders of mag­ni­tude high­er, and it’s easy to make up other ex­am­ples like a lin­ear re­gres­sion on sprint­ing times pre­dict­ing that in an­other cen­tu­ry, sprint­ers will fin­ish races be­fore they have start­ed. Lin­ear mod­els have no prob­lems pre­dict­ing neg­a­tive val­ues back­wards in time, or pre­dict­ing ab­surdly high amounts in the far fu­ture, but they are con­ve­nient and for the most part, it’s easy to sim­ply ig­nore the non­sen­si­cal pre­dic­tions or, if they’re caus­ing prob­lems, use a tech­ni­cal fix like log­ging val­ues, switch­ing to a log-nor­mal dis­tri­b­u­tion (with its strictly nat­ural sup­port), us­ing a sig­moid or log­a­rith­mic fit to ac­count for the in­evitable flat­ten­ing out, han­dle ra­tio data with beta re­gres­sion, fix the end-val­ues with spline re­gres­sion, etc; if we make the mis­take of mod­el­ing a bac­te­r­ial pop­u­la­tion in an agar dish as an ex­po­nen­tially in­creas­ing pop­u­la­tion, we’ll soon learn oth­er­wise than the sig­moid or lo­gis­tic curves make a bet­ter fit. Mark Twain offered an amus­ing crit­i­cism in , and in­deed, one can often de­tect model prob­lems watch­ing for in­stances where one gets “whole­sale re­turns of con­jec­ture out of such a tri­fling in­vest­ment of fact”, for that in­di­cates that we may be en­gaged in pos­tu­lat­ing, which has all “the ad­van­tages of theft over hon­est toil”:

    In the space of one hun­dred and sev­en­ty-six years the Lower Mis­sis­sippi has short­ened it­self two hun­dred and forty-two miles. That is an av­er­age of a tri­fle over one mile and a third per year. There­fore, any calm per­son, who is not blind or id­i­otic, can see that in the Old Oolitic Sil­urian Pe­ri­od, just a mil­lion years ago next No­vem­ber, the Lower Mis­sis­sippi River was up­wards of one mil­lion three hun­dred thou­sand miles long, and stuck out over the Gulf of Mex­ico like a fish­ing-rod. And by the same to­ken any per­son can see that seven hun­dred and forty-two years from now the Lower Mis­sis­sippi will be only a mile and three­-quar­ters long, and Cairo and New Or­leans will have joined their streets to­geth­er, and be plod­ding com­fort­ably along un­der a sin­gle mayor and a mu­tual board of al­der­men. There is some­thing fas­ci­nat­ing about sci­ence. One gets such whole­sale re­turns of con­jec­ture out of such a tri­fling in­vest­ment of fact.

    An in­ter­est­ing use of OLS lin­ear mod­els is given in , which fol­lows up a . By tak­ing the of 8 bi­o­log­i­cal groups (eg prokary­ote genomes are ~5 logged and prokary­otes orig­i­nated 3-4bya, so it is as­sumed that genome lengths were 5 3-4bya) and re­gress­ing against date of ori­gin (on a log scale), their lin­ear mod­el, in the mother of all ex­trap­o­la­tions, ex­trap­o­lates back­wards to the small­est pos­si­ble genome (and hence, ori­gin of all life) around 9 bil­lion years ago—well be­fore Earth could have sup­ported life or even ex­ist­ed, and con­sis­tent only with a pansper­mia hy­poth­e­sis.

    This is prob­lem­atic for a num­ber of rea­sons: genome length is be­ing used as a clock, but the an­i­mal and bac­te­ria genomes in­volved are all con­tem­po­rary (s­ince no DNA older than a mil­lion years or so has ever been re­cov­ered or the full length count­ed) which raises some ques­tions about why we think genome length has any con­nec­tion with time and how ex­actly this lin­ear in­crease process is sup­posed to work (is evo­lu­tion sup­posed to have stopped for the prokary­otes 3-4bya and their genome lengths been sta­tic ever since?); how they de­fined those par­tic­u­lar tax­o­nomic groups of all the pos­si­bil­i­ties and those par­tic­u­lar genome lengths & dates (and a ques­tion­able rea­son for ex­clud­ing plants); mea­sure­ment er­ror and the very small data set im­plies con­sid­er­able un­cer­tainty in the es­ti­mates & a re­jec­tion of Earthly ori­gins even at face val­ue; genome lengths seem to have lit­tle to do with the pas­sage of time but re­flect pe­cu­liar­i­ties of bi­ol­ogy & se­lec­tion (viruses are un­der more se­lec­tion pres­sure, and have faster gen­er­a­tions than, any other kind of or­gan­ism, and have in­no­vated baffling num­bers of tricks for in­fec­tion, and yet in­stead of hav­ing ul­tra­-long genomes re­flect­ing these end­less gen­er­a­tions and in­no­va­tion, they have ul­tra­-s­mall ones; and sim­i­lar­ly, the Hu­man Genome Project yielded hum­bling re­sults in show­ing the hu­man genome to be smaller than many or­gan­isms we hu­mans look down up­on, such as even some sin­gle-cell or­gan­isms, which can be bizarrely huge like the ferns, ) and vary tremen­dously within groups; there is no di­rect ev­i­dence show­ing that early life’s genome lengths in­creased at the claimed trend­line, and no par­tic­u­lar rea­son to ex­pect re­cent trends to hold ex­actly to the ori­gins of life and times when life may not have been DNA-based or even RNA-based at all; genomes are known to shrink in size as well as in­crease un­der the right con­di­tions (and one would hope that con­tem­po­rary prokary­otes are more ge­net­i­cally effi­cient than the ear­li­est prokary­otes); mu­ta­tion load & tran­scrip­tion er­rors place an up­per bound on how large a genome can be (Ohno 1972); many ge­nomic phe­nom­e­non like retro­vi­ral in­ser­tions or selfish jump­ing ; it’s un­clear why one should en­dorse con­stant in­crease in genome size rather than a pop­u­la­tion ge­net­ics per­spec­tive of equi­lib­ri­ums and oc­ca­sional shifts; hor­i­zon­tal gene flow makes it un­clear how much one can com­pare with mul­ti­cel­lu­lar or­gan­isms & ver­ti­cal trans­mis­sion; Gould ar­gues that any in­crease in com­plex­ity can be due to a ran­dom walk with a lower bound, in which case there is no un­der­ly­ing trend at all, merely sto­chas­tic ex­tremes oc­ca­sion­ally reached; and the wide vari­a­tions in “” can­not be ex­plained if genome length has much at all to do with a clock tick­ing since di­ver­gence from other species.

    More rel­e­vantly here is the model mis­spec­i­fi­ca­tion: the model will hap­pily project past 1bpa and es­ti­mate how long the genomes were—in neg­a­tive unit­s—in 10bya. (To say noth­ing of ex­trap­o­lat­ing a few bil­lion years into the fu­ture.) This is non­sen­si­cal: what­ever life may use for genomes, it must be pos­i­tive and never neg­a­tive. So lin­ear­ity is known a pri­ori to be false and to break down at some point: the lin­ear model is guar­an­teed to be in­cor­rect in its pre­dic­tions early in his­to­ry; now, if one must re­ject the ex­trap­o­la­tions for 10bya be­cause the model has bro­ken down by that point, why should one be­lieve the pro­jec­tions from be­fore the break­down…? Why not in­fer that any lin­ear­ity ap­pears only lat­er, after such regime-chang­ing events as the cre­ation of life, the switch to DNA, the ex­pan­sion over the en­tire world, the oxy­gen cat­a­stro­phe, etc? (If your ice cream lin­ear model pre­dicts sell­ing -1000 ice cream cones at the North Pole, why be­lieve it when it pre­dicts sell­ing -10 ice cream cones in Canada?) Why not use a model like a sig­moid which at least has a chance of be­ing cor­rect? The prob­lem, of course, is that if one tried to fit a sig­moid or an­other such dis­tri­b­u­tion or curve which could fit the true data, one would dis­cover that there is no early data which could give any in­di­ca­tion of how fast genome lengths were in­creas­ing very early on like in 4bya, since all the avail­able data is from now. Where does the idea come from that genome lengths would in­crease slowly even in the first life forms which dis­cov­ered a vir­gin planet of un­ex­ploited or­ganic goop? Ex­am­ined close­ly, since there is no ev­i­dence for early lin­ear­i­ty, we see that the mod­el­ing pro­duces no ad­di­tional ev­i­dence, and is merely an ex­er­cise of show­ing the con­se­quences of an as­sump­tion: “if one as­sumes that genome lengths in­crease lin­early […] then with this par­tic­u­lar set of lengths and as­sumed times, it log­i­cally fol­lows that the first or­gan­isms was 9bya”. This is a thor­oughly bizarre premise which can­not sur­vive the slight­est con­tact with known sizes of genomes and in­flu­ences, so one gets no more out of this than what one put in: since there was no par­tic­u­lar rea­son to ac­cept that as­sump­tion out of all pos­si­ble as­sump­tions, there’s no ad­di­tional rea­son to ac­cept the con­clu­sion as true ei­ther. On its own, it can­not give us any ad­di­tional ev­i­dence for pansper­mia.

  3. in­de­pen­dence is often a ques­tion­able model as­sump­tion: many dat­a­points are highly cor­re­lated with each other and cal­cu­la­tions as­sum­ing in­de­pen­dence will be­come grotesquely over­con­fi­dent. Is it more likely that the prob­a­bil­ity of a tied elec­tion is 10-90 and one could run tril­lions of elec­tions with­out ever once com­ing near a tie, or that the bi­no­mial as­sump­tion is sim­ply wrong about vot­ers be­ing in­de­pen­dent and vic­tory mar­gins be­ing ac­cord­ingly dis­trib­ut­ed? Is it more likely that stock mar­ket move­ments are thin-tailed and large crashes would hap­pen only once in bil­lions of years, or that crashes are com­mon and the dis­tri­b­u­tion is fat-tailed? Is it more likely that there are many strong pre­dic­tors of hu­man pop­u­la­tions’ wealth and so­cial fac­tors, or that they are ? Is it more likely that the mod­els were cor­rect and the fish­eries just ac­ci­den­tally col­lapsed, or that the mod­els did not take into ac­count au­to­cor­re­la­tion/­mul­ti­level­ness and over­stated ac­cu­racy through ? etc

In the case of mail de­liv­er­ies, there’s clearly model un­cer­tain­ty. While I used the , I could as eas­ily have used (fat­ter-tailed nor­mal­s), , , , , overdis­persed ver­sions of some of the pre­vi­ous, or picked some even more ex­otic uni­vari­ate dis­tri­b­u­tion and eas­ily come up with a just-so story for some of them. (It’s a neg­a­tive-bi­no­mial be­cause we’re mod­el­ing min­utes un­til “fail­ure” ie de­liv­ery; no, it’s a log-nor­mal be­cause each stop slows down the mail­man and de­lays mul­ti­ply; no, it’s a nor­mal dis­tri­b­u­tion be­cause each stop adds time and the sum of many small ran­dom de­vi­ates is it­self nor­mally dis­trib­ut­ed; no, it’s a t-dis­tri­b­u­tion be­cause some­times the mail ar­rives much ear­lier or later and it’s a more ro­bust dis­tri­b­u­tion we should be us­ing any­way; no, it’s even wider than that and a uni­form dis­tri­b­u­tion be­cause the mail­man starts and ends their shift at par­tic­u­lar times but oth­er­wise does­n’t care about speed; no, it’s a beta dis­tri­b­u­tion over time be­cause that’s more flex­i­ble and can in­clude the oth­ers as spe­cial-cas­es…)

Model un­cer­tainty can be han­dled sev­eral ways (Hooten & Hobbs 2015), in­clud­ing:

  • Penalty or reg­u­lar­iza­tion or spar­sity pri­ors (only han­dles se­lec­tion of vari­ables/pre­dic­tors; gains largely elim­i­nated in this case by in­for­ma­tive pri­ors)

  • Di­ag­nos­tics like pos­te­rior pre­dic­tive checks, cross­val­i­da­tion, and /// (overview) can be used to check for lack of fit; but like tests in soft­ware de­vel­op­ment, they can only show the pres­ence of bugs & not their ab­sence

  • Max­i­mally flex­i­ble mod­els can be used like non­para­met­ric Bayesian mod­els; if “all mod­els are wrong”, we can use mod­els which are ar­bi­trar­ily flex­i­ble in or­der to min­i­mize the in­flu­ence of our model mis­spec­i­fi­ca­tion (Dun­son 2013)

  • (often tar­geted at the nar­row use-case of se­lect­ing vari­ables/pre­dic­tors in a lin­ear mod­el, but some are ap­plic­a­ble to choos­ing be­tween differ­ent dis­tri­b­u­tions as well), par­tic­u­larly Bayesian ap­proaches (pop­u­lar­iza­tion; ; Hooten & Hobbs 2015):

    • Bayesian model choice: mod­els can be pit­ted against each other (eg us­ing Bayes fac­tors) im­ple­mented us­ing , or /pro­duc­t-space method (Car­lin & Chib 1995, Lodewyckx et al 2011 & source code, Tenan 2014 & source code), or ABC, and the one with high­est pos­te­rior prob­a­bil­ity is se­lected for any fur­ther analy­sis while all oth­ers are ig­nored
    • Bayesian model av­er­ag­ing: pos­te­rior prob­a­bil­i­ties of mod­els are cal­cu­lat­ed, but all mod­els are re­tained and their pos­te­ri­ors are com­bined weighted by the prob­a­bil­ity that model is the true one (Hoet­ing et al 1999); but with enough data, the pos­te­rior prob­a­bil­ity of the model clos­est to the truth will con­verge on 1 and BMA be­comes a model se­lec­tion pro­ce­dure
    • Bayesian model com­bi­na­tion: pos­te­rior prob­a­bil­i­ties of com­bi­na­tion­s/ensem­bles of mod­els are cal­cu­lat­ed, yield­ing a richer set of ‘mod­els’ which often out­per­form all of the orig­i­nal mod­els; such as by learn­ing an op­ti­mal weight­ing in a lin­ear com­bi­na­tion (Mon­teith et al 2011)

Of the non-ensem­ble ap­proach­es: reg­u­lar­iza­tion is al­ready done via the in­for­ma­tive pri­ors; some of the di­ag­nos­tics might work but oth­ers won’t (JAGS won’t cal­cu­late DIC for in­ter­val-cen­sored data); non­para­met­ric ap­proaches would be too chal­leng­ing for me to im­ple­ment in JAGS and I’m not sure there is enough data to al­low the non­para­met­ric mod­els to work well.

This leaves PPC, cross­val­i­da­tion, and the en­sem­ble ap­proach­es:

  • Bayesian model com­bi­na­tion is overkill
  • Bayesian model av­er­ag­ing, given Bayesian model choice, may also be overkill and not mean­ing­fully im­prove pre­dic­tions enough to change fi­nal de­ci­sions about mail check times
  • Bayesian model choice: nested sam­pling does­n’t seem ap­plic­a­ble, leav­ing re­versible-jump/pro­duc­t-space; while JAGS ex­am­ples are avail­able in tu­to­r­ial pa­pers, I found them lengthy & com­pli­cated by us­ing re­al­is­tic data & mod­els, hard­wired to their spe­cific use-case still too diffi­cult to un­der­stand with my lim­ited JAGS skills. So I wound up im­ple­ment­ing us­ing ABC.


Pos­te­rior pre­dic­tive checks (PPC) are a tech­nique rec­om­mended by Gel­man as a test of how ap­pro­pri­ate the model is (as the model or like­li­hood often is far more im­por­tant in forc­ing par­tic­u­lar re­sults than the pri­ors one might’ve used), where one gen­er­ates pos­si­ble ob­ser­va­tions from the model and in­ferred pos­te­ri­ors, and sees how often the sim­u­lated data matches the real data; if that is rarely or nev­er, then the model may be bad and one should fix it or pos­si­bly throw in ad­di­tional mod­els for Bayesian model com­par­i­son (which will hope­fully then pick out the right mod­el). For con­tin­u­ous data or more than triv­ially small data, we have the same is­sue as in ABC (which is sim­i­lar to PPCs): the sim­u­lates will never ex­actly match the data, so we must de­fine some sort of sum­mary or dis­tance mea­sure which lets us say that a par­tic­u­lar sim­u­lated dataset is sim­i­lar enough to the real da­ta. In the two sum­maries I tried for ABC, count­ing ‘flip-flops’ worked bet­ter, so I’ll reuse that here. For the sim­u­la­tion, we also need check or in­ter­val-cen­sor­ing times, which I also treat as a nor­mal dis­tri­b­u­tion & es­ti­mate their means and SDs from the orig­i­nal da­ta.

posteriorTimes <- k2$BUGSoutput$sims.list[["y.new2015"]]

simsPPC <- replicate(10000, {
    nChecks <- nrow(mail)
    checkTimes <- c(mail[mail$Time1>0,]$Time1, mail[mail$Time2<1440,]$Time2)
    meanCheck <- mean(checkTimes)
    sdCheck <- sd(checkTimes)

    deliveryTime <- sample(posteriorTimes, size=nChecks, replace=TRUE)
    checkTime <- round(rnorm(nChecks, mean = meanCheck, sd = sdCheck))
    newSamples <- mapply(function (ck, dy) { if(ck>dy) { return(c(0,ck)) } else { return(c(ck,1440)) }},
                         checkTime, deliveryTime)
    newData <- data.frame(Time1=newSamples[1,], Time2=newSamples[2,])

    return(sum(newData$Time1 == 0))
png(file="~/wiki/images/maildelivery/ppc.png", width = 800, height = 500)
qplot(simsPPC) + geom_vline(xintercept=sum(mail$Time1 == 0), color="blue")
Sim­i­lar­ity of pos­te­rior pre­dic­tive dis­tri­b­u­tion’s flip-flops with the orig­i­nal data’s flip-flops

The true data has a sum­mary near the mid­dle of the dis­tri­b­u­tion of the sum­maries of the pos­te­rior pre­dic­tions, sug­gest­ing that the nor­mal dis­tri­b­u­tion’s fit to the prob­lem is OK (at least, when con­sid­ered in iso­la­tion).


CV comes in sev­eral fla­vors, but since the data is not big, I can use leave-one-out cross­val­i­da­tion. The usual er­ror mea­sures are un­avail­able due to in­ter­val-cen­sor­ing, so after some pon­der­ing, I went with a ze­ro-one loss based on whether a de­liv­ery is pre­dicted or not at a par­tic­u­lar check­-time. Mean-squared-er­ror is out due to the in­ter­val­ing, as are vari­ants like ab­solute er­ror. Think­ing about it some, it seems to me that each dat­a­point is re­ally made of two things: the check­-time, and de­liv­ery sta­tus. The check­-time has noth­ing to do with the model qual­ity and the model is­n’t try­ing to pre­dict it; what I want the model to pre­dict is whether the mail is de­liv­ered or not if I were to check at par­tic­u­lar times. The check­-time is the pre­dic­tor vari­able and the de­liv­ery-s­ta­tus is the re­sponse vari­able. So I could ask the mod­el’s pos­te­rior pre­dic­tive dis­tri­b­u­tion of de­liv­ery-times (y.new) whether or not the mail would or would not be de­liv­ered at a par­tic­u­lar time, and com­pare it against the held-out dat­a­point’s ac­tual de­liv­ery sta­tus, 1 if the model cor­rectly pre­dicts de­liv­ered/not-de­liv­ered and 0 if it pre­dicts the op­po­site of what the data said. (So a ze­ro-one loss.) The base-rate of de­liv­ery vs non-de­liv­ery will be 50-50, so a good model should achieve a bet­ter mean ze­ro-one loss than 0.5.

For dis­tri­b­u­tions, we’ll com­pare the nor­mal used all along, the t-dis­tri­b­u­tion (for its fat­ter tail­s), and a lit­tle more ex­ot­i­cal­ly, the log-nor­mal & ex­po­nen­tial dis­tri­b­u­tions as well. Of the­se, the first two are the most like­ly, and I think it’s highly un­likely that the log-nor­mal or ex­po­nen­tial could fit bet­ter (be­cause they will tend to pre­dict ex­tremely late de­liv­er­ies, much later than I be­lieve is at all re­al­is­tic, while the t’s spread is more rea­son­able in ac­com­mo­dat­ing some non-nor­mal­ity but not too much). I in­clude those two just to see what will hap­pen and test out the ro­bust­ness of any ap­proach used. The 4 mod­els in JAGS:

modelN <- "model { for (i in 1:n){
            y[i] ~ dinterval(t[i], dt[i,])
            t[i] ~ dnorm(mu,tau)
           mu ~ dnorm(650, pow(30, -2))
           sd ~ dunif(10, 60)
           tau <- pow(1/sd, 2)
           y.new ~ dnorm(mu, tau) }"
modelT <- "model { for (i in 1:n){
            y[i] ~ dinterval(t[i], dt[i,])
            t[i] ~ dt(mu,tau,nu)
           nu ~ dexp(1/30)
           mu ~ dnorm(650, pow(30, -2))
           sd ~ dunif(10, 60)
           tau <- pow(1/sd, 2)
           y.new ~ dnorm(mu, tau) }"
modelL <- "model { for (i in 1:n) {
          y[i] ~ dinterval(t[i], dt[i,])
          t[i] ~ dlnorm(muL, tauL)
        sd ~ dunif(10, 60)
        tauL <- pow(1/log(sd), 2)
        mu ~ dnorm(650, pow(30, -2))
        muL <- log(mu)
        y.new ~ dlnorm(muL, tauL) }"
modelE <- "model { for (i in 1:n){
            y[i] ~ dinterval(t[i], dt[i,])
            t[i] ~ dexp(theta)
           # set the mean ~600, but remember rate is reciprocal:
           theta ~ dexp(1/600)
           y.new ~ dexp(theta) }"

Here’s a try at im­ple­ment­ing leave-one-out cross­val­i­da­tion for those four JAGS mod­els on the in­ter­val-cen­sored mail data with a ze­ro-one loss de­fined that way based on de­liv­ery sta­tus:

runModel <- function(newData, model, iter=5000) {
     # set up and run the model and extract predicted delivery times
     data   <- list("dt"=newData, "n"=nrow(newData), "y"=rep(1, nrow(newData)))
     inits  <- function() { list(mu=rnorm(1,mean=600,sd=30),sd=30,t=as.vector(apply(newData,1,mean))) }
     params <- c("y.new")
     cv1 <- jags(data, inits, params, textConnection(model), n.iter=iter, progress.bar="none")
     posteriorTimes <- cv1$BUGSoutput$sims.list[["y.new"]]

loocvs <- function(dt, model) {
    results <- NULL

    for (i in 1:nrow(dt)) {
     # set up relevant data for this particular fold in the crossvalidation
     ith       <- dt[i,]
     newData   <- dt[-i,] # drop the _i_th datapoint
     checkTime <- if (ith$Time1==0) { ith$Time2 } else { ith$Time1 }
     there     <- if (ith$Time1==0) { TRUE } else { FALSE }

     posteriorTimes <- runModel(dt, model, iter=500)
     # score predictions against heldout data point
     results[i] <- mean(sapply(posteriorTimes,
                               function(t) { if (t<checkTime && there) { 1 } else {
                                               if (t>checkTime && !there) { 1 } else { 0 } }}))
loocvsN <- loocvs(mailCombined, modelN)
loocvsT <- loocvs(mailCombined, modelT)
loocvsL <- loocvs(mailCombined, modelL)
loocvsE <- loocvs(mailCombined, modelE)

## [1] 0.6316747967
## [1] 0.6339715447
## [1] 0.4994674797
## [1] 0.4504227642

By this mea­sure, the nor­mal & t mod­els per­form al­most iden­ti­cal­ly, while we can re­ject log-nor­mal & ex­po­nen­tials out of hand as per­form­ing as badly as (or worse than) chance.

Bayesian Model Selection

We are com­par­ing 4 para­met­ric mod­els here: the nor­mal, Stu­den­t’s t, log-nor­mal, and ex­po­nen­tial mod­els. (Time per­mit­ting, a Cauchy would have been good too.)

The BIC/DIC/WAIC de­viance cri­te­ria are un­avail­able; nested sam­pling can­not han­dle mul­ti­ple dis­tri­b­u­tions; this leaves re­versible jump MCMC/product-space method and ABC.

I looked at pro­duc­t-space first. The idea is con­cep­tu­ally sim­ple: we take our JAGS mod­els and treat it as a mix­ture model with each model hav­ing a prob­a­bil­ity of hav­ing gen­er­ated an ob­ser­va­tion, and since we be­lieve only one is true, after in­fer­ring these prob­a­bil­i­ties, they be­come our pos­te­rior prob­a­bil­i­ties of each mod­el. The JAGS im­ple­men­ta­tions look rel­a­tively straight­for­ward, but most of them deal with the case of differ­ent prob­a­bil­i­ties or pa­ra­me­ters for the same dis­tri­b­u­tion or with vari­able se­lec­tion in lin­ear re­gres­sion, and the only ex­am­ples with mul­ti­ple dis­tri­b­u­tions were to­tally baffling to me, as I be­came lost in the differ­ent dis­tri­b­u­tions and cat­e­gories and in­dices and what the ‘pseudo­pri­ors’ are do­ing etc.

After giv­ing up, it oc­curred to me that since I had no prob­lem writ­ing the 4 stand­alone mod­els them­selves, their pos­te­rior pre­dic­tive dis­tri­b­u­tions were, in a sense, gen­er­a­tive mod­els of them­selves and so I could use them in ABC. If I gen­er­ated sam­ples from all 4 mod­els in ABC and counted how often they (ap­prox­i­mate­ly) matched the orig­i­nal data, then would not the ra­tios of ac­cepted sam­ples then con­sti­tute pos­te­rior prob­a­bil­i­ties or at least Bayes fac­tors? If the nor­mal model matched the data 10x more often than the t mod­el, then surely the nor­mal model is a much bet­ter model of the data than the t. But there’s noth­ing new un­der the sun, and so while the pa­pers on Bayesian model com­par­i­son I had read up un­til that point had not men­tioned ABC as an op­tion, it turns out that us­ing ABC for Bayesian model com­par­i­son is not just pos­si­ble but down­right com­mon in ge­net­ics & ecol­ogy (see for back­ground Fu & Li 1997, , Mar­jo­ram et al 2003, Dide­lot et al 2004, Beau­mont 2010 & Sun­naker et al 2013 ; ex­am­ples in­clude Pritchard et al 1999, Es­toup et al 2004, Miller et al 2005, Pas­cual et al 2007, , Sain­udiin et al 2011). There is the is­sue that the sum­mary sta­tis­tics may lose in­for­ma­tion and lead to bad model com­par­i­son re­sults, but there’s al­ways trade­offs.

ABC for Bayesian model com­par­ison:

simulateMailboxMulti <- function(n, m, sd, simulation) {
    deliveryTime <- sample(simulation, size=n, replace=TRUE)
    checkTime <- round(rnorm(n, mean=m, sd=sd))
    simulates <- mapply(function (ck, dy) { if(ck>dy) { return(c(0,ck)) } else { return(c(ck,1440)) }},
                         checkTime, deliveryTime)
    return(data.frame(Time1=simulates[1,], Time2=simulates[2,]))
mail_abc_bmc <- function(samples) {

    # generate the posterior samples for each model, which serve as our generative simulations:
    normal <- runModel(mailCombined, modelN)
    tdistribution <- runModel(mailCombined, modelT)
    lognormal <- runModel(mailCombined, modelL)
    exponential <- runModel(mailCombined, modelE)
    simulates <- nrow(mailCombined)

    # create a distribution for our check times
    checkTimes <- c(mailCombined[mailCombined$Time1>0,]$Time1, mailCombined[mailCombined$Time2<1440,]$Time2)
    meanCheck <- mean(checkTimes)
    sdCheck <- sd(checkTimes)

    results <- list()
    n <- 0

    while (n<samples) {
       newData <- NULL
       type <- sample(c("normal", "t", "lognormal", "exponential"), size=1)
       newData <- switch(type,
           "normal"      = { simulateMailboxMulti(simulates, meanCheck, sdCheck, normal); },
           "t"           = { simulateMailboxMulti(simulates, meanCheck, sdCheck, tdistribution); },
           "lognormal"   = { simulateMailboxMulti(simulates, meanCheck, sdCheck, lognormal); },
           "exponential" = { simulateMailboxMulti(simulates, meanCheck, sdCheck, exponential); } )

         # see if summary of the new data is within tolerance error<1 of the real data,
         # and if it is, save which distribution generated it:
         if (abs(mean(newData$Time1) - mean(mail$Time1)) < 1 &&
             abs(mean(newData$Time2) - mean(mail$Time2)) < 1)
           { results <- list(c(Type=type), results); n <- n+1; }
sims <- mail_abc_bmc(20000)
results <- matrix(unlist(sims), ncol=1, byrow=TRUE)
##            V1
##  normal: 9904
##  t     :10096

As sug­gested by the cross­val­i­da­tion, the ex­po­nen­tial and log­nor­mal mod­els have low pos­te­rior prob­a­bil­i­ties and can be ig­nored, as they make up none of the sam­ples. (If we re­ally wanted to es­ti­mate their prob­a­bil­i­ties more pre­cisely than ~1/20000, we would need to in­crease the ABC sam­pling or loosen the tol­er­ances, and then a hand­ful of ex­po­nen­tials & log-nor­mals would get ac­cept­ed.) And the t-dis­tri­b­u­tion is the same frac­tion of the sam­ples as the nor­mal: 50.1% to 49.9% (which given the lim­ited sam­pling, means they’re effec­tively iden­ti­cal). The Bayes fac­tors for each pair are the ra­tios, so we could say that there’s a BF of ~1 in fa­vor of nor­mal vs t, or no ev­i­dence; but tremen­dous ev­i­dence com­pared with the ex­po­nen­tial or log­nor­mal.

Bayesian Model Averaging

If we started with uni­form pri­ors, then the ex­po­nen­tial or log­nor­mal will go to ~0, and the pos­te­rior prob­a­bil­i­ties of nor­mal/t get pro­moted to ~50% each. This means that Bayesian model av­er­ag­ing be­comes very sim­ple: we just lump to­gether equally the pos­te­rior pre­dic­tive sam­ples from the two mod­els and work with that.

normal <- runModel(mailCombined, modelN)
tdistribution <- runModel(mailCombined, modelT)
posteriorTimes <- c(normal, tdistribution)

Decision theory

One oft-men­tioned ad­van­tage of Bayesian ap­proach­es, and sub­jec­tive Bayesian prob­a­bil­ity in par­tic­u­lar, is that it plugs im­me­di­ately into eco­nomic or de­ci­sion the­ory ap­proaches to mak­ing choices un­der un­cer­tain­ty, so we don’t stop at merely pro­vid­ing an es­ti­mate of some pa­ra­me­ter but can con­tinue on­wards to reach a de­ci­sion. (See for ex­am­ple , Ap­plied Sta­tis­ti­cal De­ci­sion The­ory, Raiffa & Schlaifer 1961, and the more re­cent In­tro­duc­tion to Sta­tis­ti­cal De­ci­sion The­ory, Pratt et al 1995; an ap­plied ex­am­ple of go­ing from Bayesian mul­ti­level mod­els to op­ti­mal de­ci­sions is “Bayesian pre­dic­tion of mean in­door radon con­cen­tra­tions for Min­nesota coun­ties”.)

Since the mo­ti­va­tion for es­ti­mat­ing mail-de­liv­ery time is im­plic­itly a de­ci­sion-the­o­retic prob­lem (“at what time should I de­cide to check the mail?”), there’s no rea­son to set­tle for just a mean or cred­i­ble in­ter­val for the de­liv­ery time.

Optimal mail checking time

With a pos­te­rior dis­tri­b­u­tion of de­liv­ery times (y.new) we can try to find an op­ti­mal time to check, as­sum­ing we can fig­ure out a giv­ing a cost of check­ing at each pos­si­ble time. The loss here can be con­sid­ered in units of time; if I did­n’t care about the de­lay be­tween the pack­age be­ing de­liv­ered and me get­ting my hands on it, I would sim­ply check at 2PM and have done with it.

The usual stop­ping point in an analy­sis like this would in­volve the mean, least­-squares, and . There’s a lot of good things to say about that as a de­fault But as an al­go­rithm for de­cid­ing when check the mail, squar­ing er­rors and av­er­ag­ing them seems like a strange way to eval­u­ate mis­takes: is it re­ally equally bad to check the mail box 5 min­utes be­fore the mail and 5 min­utes after­wards? is check­ing 10 min­utes after re­ally more than twice as bad as check­ing 5 min­utes after? I would say “no” in both cas­es: it’s much bet­ter to check 5 min­utes after rather than 5 min­utes be­fore, since then the pack­age will be there and I won’t have to make an­other hike out to the mail­box; and check­ing 10 min­utes seems only twice as bad to me as check­ing 5 min­utes after. So the loss func­tion it­self is not any of the usual sus­pects: squared loss, ab­solute loss and 0-1 loss don’t cor­re­spond to my process of check­ing mail. I need to de­fine my own loss func­tion, ex­press­ing the real costs and gains of this sit­u­a­tion.

Defining a loss function

Ex­pect­ing a pack­age wears on my mind and in some cas­es, I’d like to start us­ing the con­tents ASAP; I see a de­lay of 10 min­utes as be­ing twice as bad as 5 min­utes and not 4 times as bad, so this part is an ab­solute loss. (If it ar­rives at 11AM and I check at 11:30AM, then the loss due to the de­lay is 30.) Be­sides the de­lay, there’s also the cost of walk­ing all the way down to the mail­box and back, which due to the long dri­ve­way is around 10 min­utes, but it’s mostly un­shaded and hot out and an in­ter­rup­tion, so the loss is much greater than that; in­tro­spect­ing, I would be will­ing to wait at least 60 min­utes to save one roundtrip, so I will de­fine the cost of a mail check at 60 ‘min­utes’. So sup­pose I check at 10:30AM, the mail comes at 11AM, and I check again at 11:30AM and find it; then the cost is 60+60+(11:30AM-11:00AM) = 150, while if I had checked at 11:20AM then the cost is bet­ter and smaller at just 60+(11:20AM-11:00AM) = 70.

OK, so if the pack­age is there, that’s easy, but what if I walk out at 11:20AM and it’s not there? In a proper ap­pli­ca­tion, if I check on­ce, I would then up­date my pos­te­rior and run a loss func­tion again to de­cide when to check next, but this is im­prac­ti­ca­ble for daily us­age and in re­al­i­ty, what I would do is if the first check did­n’t turn up the pack­age, I would then give up in frus­tra­tion & dis­gust and not come back un­til ~1PM (780) when the pack­age would defi­nitely have ar­rived. Then I’d in­cur a loss of two walks and pos­si­bly a long wait un­til 1PM.

All these con­sid­er­a­tions give me a weird-look­ing but nev­er­the­less re­al­is­tic loss func­tion: if the pack­age is de­liv­ered at d and we check at a par­tic­u­lar time t, then if t>d and the pack­age had ar­rived we in­cur a to­tal loss of ; oth­er­wise, we check back a sec­ond and fi­nal time at 1PM, in­cur­ring a to­tal loss of

Finding the optimum

Hav­ing fig­ured that out, we run the loss func­tion on each sam­ple from the pos­te­ri­or, av­er­ag­ing over all weighted pos­si­ble de­liv­ery times, and find what time of day min­i­mizes the loss:

lossFunction <- function(t, walk_cost, predictions, lastResortT) {
                function(delivered) { if (delivered<t) { return(walk_cost   + (t - delivered)); } else
                                                       { return(2*walk_cost + (lastResortT - t) ); }}))
losses <- sapply(c(0:1440), function (tm) { lossFunction(tm, 60, posteriorTimes, 780);})

png(file="~/wiki/images/maildelivery/losses.png", width = 800, height = 700)
timeLimits <- seq(from=10*60, to=13*60, by=5)
qplot(0:1440, losses) + scale_x_continuous(breaks=timeLimits, labels=sapply(timeLimits, toClock)) +
  xlab("Action") + coord_cartesian(xlim = timeLimits, ylim=1:300)

# [1] 701
## 11:41AM
Ex­pected losses in terms of wait­ing time for check­ing the mail at par­tic­u­lar times

And that gives the fi­nal an­swer: I should check for pack­ages around 11:40AM.

Total costs

We can com­pare the cost of check times. If I check 10 min­utes later than the cur­rent es­ti­mated op­ti­mal time, then the loss each time in pseudo-min­utes is:

minIndex <- which.min(losses)
# [1] 108.5026282
# [1] 110.1791093
losses[minIndex+10] - losses[minIndex]
# [1] 1.676481109

With av­er­age losses per check time, an es­ti­mate of num­ber of check times per year, and a dis­count rate, we can de­rive a ; I usu­ally use the ap­prox­i­ma­tion and a dis­count rate of 5% an­nu­al­ly, but in this case as­sum­ing an in­defi­nite hori­zon is wrong—the de­liv­ery time will change and the prob­lem will re­set as soon as the next mail­man be­gins this route, which has al­ready hap­pened once in the 4 years thus far, so I in­stead as­sume that any es­ti­mate is worth­less after 3 years (the mail­man will have changed, the route or USPS will have changed, I may not even be liv­ing in the same place in 3 years). So if I check for a pack­age 30 times a year (which is roughly as much as I did in 2014 & am on track to do in 2015), and I do so sub­op­ti­mally each time and in­cur a loss of -2, then the to­tal loss is equiv­a­lent to

netPresentValue <- function(p) { ((p / (1 + 0.05)^1) + (p / (1 + 0.05)^2) + (p / (1 + 0.05)^3)) * 30  }
## [1] -163.3948818

Optimal data-sampling

Reinforcement learning

Given a loss func­tion, we might be able to op­ti­mize our data-sam­pling by bal­anc­ing ex­plo­ration and ex­ploita­tion us­ing a ap­proach to the such as /prob­a­bil­i­ty-sam­pling/prob­a­bil­i­ty-match­ing—where differ­ent check­-times are ‘arms’, the pay­off is given by the loss, and we draw a check time from the cur­rent pos­te­ri­or, check at that time, and then with the new data up­date the pos­te­rior for the next day.

I have im­ple­mented Thomp­son sam­pling for this prob­lem. While prob­a­bil­ity sam­pling is con­cep­tu­ally sim­ple, in­tu­itive, has no hy­per­pa­ra­me­ters to tweak, and op­ti­mal in a num­ber of ways, im­ple­ment­ing it would have re­quired more dis­ci­pline than I cared for; so I did­n’t use it.

Expected information gains

Ig­nor­ing in­defi­nite on­line learn­ing, it is pos­si­ble to op­ti­mize data sam­pling by es­ti­mat­ing what pa­ra­me­ter value is pre­dicted to yield the most in­for­ma­tion, in the sense of en­tropy of dis­tri­b­u­tions, giv­ing Ex­pected In­for­ma­tion Gains (or “ex­pected gain in Shan­non in­for­ma­tion”, or “ex­pected Kull­back­-Leibler dis­crep­ancy”). The nor­mal dis­tri­b­u­tion has a short en­tropy de­fi­n­i­tion () based on its stan­dard de­vi­a­tion (s­ince the mean is just a fixed off­set, while its stan­dard de­vi­a­tion that de­fines how vari­able/un­pre­dictable sam­ples can be and hence how many bits it takes to en­code them), but we don’t know the ex­act nor­mal dis­tri­b­u­tion—only pos­te­rior dis­tri­b­u­tions in­clud­ing un­cer­tain­ty. The dis­tri­b­u­tion of the pos­te­rior pre­dic­tive dis­tri­b­u­tion for a nor­mal dis­tri­b­u­tion is, I am told, a (source to A First Course in Bayesian Sta­tis­ti­cal Meth­ods by Pe­ter Hoff), which has the more com­plex en­tropy (us­ing digamma & be­ta) based on its de­grees-of-free­dom term, v, and its scale term, s, which goes: .

We can of a pos­te­rior pre­dic­tive dis­tri­b­u­tion us­ing the pos­te­rior sam­ples with a few ap­proach­es:

  1. es­ti­mate the prob­a­bil­ity den­sity func­tion (giv­ing prob­a­bil­ity at each point) and then the en­tropy of each sam­ple is given by the clas­sic for­mula con­vert­ing prob­a­bil­ity to bits nec­es­sary to en­code it/en­tropy

    For ex­am­ple, our pos­te­rior pre­dic­tive sam­ples could be con­verted into a prob­a­bil­ity den­sity func­tion for a par­tic­u­lar time 600-800 or as , the frac­tion of sam­ples rep­re­sent­ing that time. (We’d prob­a­bly round the sam­ples, im­plic­itly bin­ning them into 200 or so bin­s.) Then each sam­ple can be con­verted into the den­sity at its point, the den­si­ties logged and negat­ed, then av­er­aged, to get the en­tropy of the whole pos­te­rior pre­dic­tive sam­ple and hence the dis­tri­b­u­tion it­self.

  2. treat the pos­te­rior pre­dic­tive sam­ples as a t-dis­tri­b­u­tion, es­ti­mate their v & s, and con­vert it to an en­tropy.

    This is less gen­er­al, but eas­ier to im­ple­ment.

Once we can con­vert a set of pos­te­rior pre­dic­tive sam­ples into an en­tropy num­ber, this fol­lows the usual sim­u­la­tion/­max­i­miza­tion: cal­cu­late the cur­rent en­tropy, then loop­ing over all pos­si­ble ac­tion­s/­data-to-col­lect, draw­ing one sam­ple from the pos­te­rior for that and com­par­ing to the ac­tion to get a dat­a­point and then up­date the model and cal­cu­late a new en­tropy, and re­turn the differ­ence be­tween old en­tropy and new en­tropy (re­peated a num­ber of times so one can take the mean and get a good ap­prox­i­ma­tion for each ac­tion). Whichever ac­tion would lower the en­tropy the most (have the biggest gain) is then the op­ti­mal ac­tion to take and make an ob­ser­va­tion do­ing that. An im­ple­men­ta­tion:

tEntropy <- function(v, s) { ((v+1)/2) * (digamma((v+1)/2) - digamma(v/2)) +
                             log(sqrt(v) * beta(v/2, 1/2)) +
                             log(s) }
entropyST <- function(pps) { fit <- fitdistr(pps, "t");
                             tEntropy(fit$estimate[['df']], fit$estimate[['s']]); }
entropyD <- function(dt) {
   # average together the _t_ & normal 50:50, per the Bayesian model averaging results:
   jN <- runModel(dt, modelN, iter=500)
   jT <- runModel(dt, modelT, iter=500)
   posteriorTimesN <- c(jN, jT)

   entropy <- entropyST(posteriorTimesN)

actions <- c(600:728)
oldEntropy <- entropyST(posteriorTimes)


entropyAction <- function(a, iter=10) {
 df <- data.frame()
 for (i in 1:iter) {
    deliveryTime <- sample(posteriorTimes, size=1)
    if (deliveryTime<a) { newData <- rbind(mail, data.frame(Time1=0, Time2=a)); } else
                        { newData <- rbind(mail, data.frame(Time1=a, Time2=1440)); }
    newEntropy <- entropyD(newData)
    gain <- oldEntropy - newEntropy
    print(data.frame(Action=a, EntropyDecrease=gain))
    df <- rbind(df, data.frame(Action=a, EntropyDecrease=gain))
ei <- ldply(mclapply(actions, function(a) { entropyAction(a) }))

eimean <- aggregate(EntropyDecrease ~ Action, mean, data=ei)
##      Action        Entropy
## 53    652 0.01161258411
## [1] "10:52"

In­ter­ested in how the max­i­mal ex­pected in­for­ma­tion time changed, I be­gan check­ing at the op­ti­mal times in late Sep­tem­ber, check­ing at:

  1. 12PM
  2. 10AM
  3. 12:08PM
  4. 10:59AM
  5. 10:57AM
  6. 10:52AM

Optimal sample-size: Value of Information metrics

If we are not op­ti­miz­ing the kind of data col­lect­ed, we can op­ti­mize how much data we col­lect by pe­ri­od­i­cally cal­cu­lat­ing how much ad­di­tional data could im­prove our es­ti­mates and how much this im­prove­ment is worth on av­er­age, es­pe­cially com­pared to how much that ad­di­tional data would cost.

Ex­pected val­ues of var­i­ous kinds of in­for­ma­tion show up fre­quently in Bayesian de­ci­sion the­o­ry, and are valu­able for link­ing ques­tions of how much & what data to col­lect to re­al-world out­comes (not just money but lives and wel­fare; a re­view; an Alzheimer’s ex­am­ple).


The first kind is “” (EVPI): if I es­ti­mated the op­ti­mal time as t=688 and it was ac­tu­ally t=698, then I could be suffer­ing a penalty of 163 min­utes and I should be will­ing to pay some­where up to that (based on how cer­tain I am it’s t=688) to learn the truth and in­stead start check­ing the mail at t=698. The value of per­fect in­for­ma­tion is sim­ply the differ­ence be­tween the cur­rent es­ti­mate and what­ever you hy­poth­e­size the true time is.

The ex­pected value or EVPI is the cost of the cur­rent es­ti­mate of the op­ti­mum ver­sus a pos­si­ble al­ter­na­tive time, where the al­ter­na­tives are weighted by prob­a­bil­i­ties (if the mail re­ally comes at 9AM then that im­plies a big loss if you fool­ishly check at 11:30AM but the prob­a­bil­ity of such a big loss is al­most ze­ro); the prob­a­bil­i­ties come, as usu­al, from the pos­te­rior dis­tri­b­u­tion of times which is ap­prox­i­mated as a big batch of sam­ples:

mean(sapply(round(posteriorTimes), function(time) { netPresentValue((losses[time] - min(losses))) } ))
# [1] 5087.290247

Per­fect in­for­ma­tion is not avail­able at any cost, how­ev­er.3 All I can do is go out to the mail box n times, and as we saw in the sim­u­la­tion, di­min­ish­ing re­turns al­ways hap­pens and at some point the pre­dic­tive in­ter­vals stop chang­ing no­tice­ably. So EVPI rep­re­sents an up­per bound on how much any lesser amount of in­for­ma­tion could be worth, but does­n’t tell us how much we are will­ing to pay for im­per­fect in­for­ma­tion.


The next kind is what is the “” (EVSI): what is the value of col­lect­ing one more ad­di­tional dat­a­point, which can help pin down the op­ti­mal time a lit­tle more pre­cisely and re­duce the risk of loss from pick­ing a bad time? EVSI can be de­fined as “ex­pected value of best de­ci­sion with some ad­di­tional sam­ple in­for­ma­tion” mi­nus “ex­pected value of best cur­rent de­ci­sion”. More specifi­cal­ly, if many times we cre­ated a new dat­a­point, cre­ate an up­dated pos­te­rior dis­tri­b­u­tion in­cor­po­rat­ing that new dat­a­point, run the loss func­tion again & cal­cu­late a new op­ti­mal check time, and com­pute the im­prove­ment of the new check time’s NPV over im­proved es­ti­mate of the old check time’s NPV to es­ti­mate a pos­si­ble EVSI, what is the mean of the EVSIs?

This tells us the ben­e­fit of that dat­a­point, and then we can sub­tract the cost of col­lect­ing one more dat­a­point; if it’s still pos­i­tive, we want to col­lect more data, but if it’s neg­a­tive (the gain from the es­ti­mate im­proved by one more dat­a­point) is less than a dat­a­point would cost to col­lect, then we have hit di­min­ish­ing re­turns and it may be time to stop col­lect­ing da­ta.

We could do this for one dat­a­point to de­cide whether to stop. But we could also go back and look at how the EVSI & profit shrank dur­ing the orig­i­nal col­lec­tion of the mail da­ta.

data <- mailCombined
sampleValues <- data.frame(N=NULL, newOptimum=NULL, newOptimumLoss=NULL,
                           sampleValue=NULL, sampleValueProfit=NULL)
for (n in seq(from=0, to=(nrow(data)+10))) {
 evsis <- replicate(10, {
  # if n is more than we collected, bootstrap hypothetical new data; otherwise,
  # just take that prefix and pretend we are doing a sequential trial where we
  # have only collected the first n observations thus far
  if (n > nrow(data)) { newData <- rbind(data,
                                         data[sample(1:nrow(data), n - nrow(data) , replace=TRUE),]) } else {
                        newData <- data[1:n,] }

  kEVSIN <- runModel(newData, modelN)
  kEVSIT <- runModel(newData, modelT)
  posteriorTimesEVSI <- c(kEVSIN, kEVSIT) # model averaging

  lossesEVSI <- sapply(c(0:1440), function (tm) { lossFunction(tm, 60, posteriorTimesEVSI, 780);})

  # compare to the estimate based on priors if no data yet:
  if (n==0) { oldOptimum <- min(lossesEVSI)  } else {

              oldOptimum <- lossesEVSI[sampleValues[sampleValues$N==0,]$newOptimum] }
  newOptimum <- min(lossesEVSI)
  sampleValue <- netPresentValue(newOptimum - oldOptimum)
  sampleValueProfit <- sampleValue + (n*60)

  return(list(N=n, newOptimum=which.min(lossesEVSI), newOptimumLoss=newOptimum,
              sampleValue=sampleValue, sampleValueProfit=sampleValueProfit))

 sampleValues <- rbind(sampleValues,
##     N newOptimum newOptimumLoss    sampleValue sampleValueProfit
## 1   0    712.600    126.6888958    0.00000000      0.0000000000
## 2   1    712.575    126.7579219  -11.93854006     48.0614599416
## 3   2    719.980    122.3724436 -121.31501738     -1.3150173784
## 4   3    708.150    126.9706051  -32.03379518    147.9662048193
## 5   4    701.790    127.9095775 -143.35714066     96.6428593445
## 6   5    696.695    128.8701531 -299.02233338      0.9776666227
## 7   6    692.440    129.8191319 -466.05452337   -106.0545233747
## 8   7    688.695    132.0807898 -633.09206632   -213.0920663197
## 9   8    690.765    129.5727029 -542.78513810    -62.7851380951
## 10  9    687.660    132.0364916 -681.32564291   -141.3256429125
## 11 10    689.210    130.2163274 -597.48233962      2.5176603795
## 12 11    686.395    130.5338326 -758.15486682    -98.1548668161
## 13 12    687.455    128.6273025 -695.28046012     24.7195398753
## 14 13    691.525    129.2092477 -496.05481084    283.9451891562
## 15 14    695.410    129.2232888 -347.92312703    492.0768729651
## 16 15    693.405    128.1591433 -433.22087565    466.7791243520
## 17 16    693.650    127.0530983 -436.23463412    523.7653658752
## 18 17    693.735    125.5236645 -444.81276457    575.1872354295
## 19 18    695.725    125.3049938 -355.24354368    724.7564563236
## 20 19    693.155    123.7840402 -465.57031668    674.4296833227
## 21 20    691.340    122.2890257 -573.81903972    626.1809602784
## 22 21    695.460    124.4854690 -373.66455168    886.3354483178
## 23 22    693.835    123.3286284 -453.72808090    866.2719190994
## 24 23    692.890    122.1550798 -511.31873755    868.6812624468
## 25 24    691.585    120.6627142 -594.94760901    845.0523909901
## 26 25    689.930    119.8993341 -688.68789781    811.3121021938
## 27 26    690.845    118.8742591 -644.53601191    915.4639880893
## 28 27    693.915    120.5363205 -480.27587034   1139.7241296593
## 29 28    695.455    120.5722357 -410.57354380   1269.4264562033
## 30 29    693.720    119.8216438 -491.01196669   1248.9880333079
## 31 30    692.560    118.5479051 -566.54877290   1233.4512271013
## 32 31    691.265    117.4006005 -642.81873824   1217.1812617638
## 33 32    694.800    120.2531077 -443.02993834   1476.9700616613
## 34 33    694.500    122.6974736 -444.15880135   1535.8411986539
## 35 34    693.200    121.6848098 -506.60101893   1533.3989810750
## 36 35    695.640    123.1766613 -377.44505442   1722.5549455769
## 37 36    695.055    125.0534433 -389.10082564   1770.8991743593
## 38 37    694.210    124.1294044 -439.58662308   1780.4133769227
## 39 38    694.610    123.0556714 -424.08194842   1855.9180515813
## 40 39    694.310    124.7484991 -436.35444597   1903.6455540347
## 41 40    694.505    123.4282826 -429.85235414   1970.1476458589
## 42 41    694.280    122.2800503 -437.52664626   2022.4733537369
## 43 42    695.660    121.9807264 -395.84245846   2124.1575415363
## 44 43    697.795    123.4574969 -298.08161577   2281.9183842337
## 45 44    696.745    122.5201770 -345.76139793   2294.2386020661
## 46 45    695.845    121.6540080 -389.60085022   2310.3991497771
## 47 46    695.090    120.9569570 -436.42100092   2323.5789990833
## 48 47    693.990    120.5449101 -483.41266155   2336.5873384516
## 49 48    694.305    120.7051254 -487.09162861   2392.9083713904
## 50 49    693.840    120.1719436 -499.45106193   2440.5489380720
## 51 50    693.940    119.9749612 -504.81034896   2495.1896510352
## 52 51    693.960    119.9420246 -500.28329793   2559.7167020736
## 53 52    693.595    119.7075089 -512.73745021   2607.2625497905
## 54 53    693.875    119.4981456 -515.01771496   2664.9822850449
## 55 54    693.505    119.1947709 -524.05689719   2715.9431028079
## 56 55    693.900    119.5021316 -517.24622527   2782.7537747290
## 57 56    693.715    119.1529502 -536.63625754   2823.3637424644
## 58 57    693.530    118.7091686 -531.78528869   2888.2147113139
## 59 58      703.2    110.3450586 -237.308410317  3242.691589683
## 60 59      702.8    110.1431462 -260.940058947  3279.059941053
## 61 60      701.8    109.8245030 -292.138648235  3307.861351765
## 62 61      702.0    109.6607867 -313.445660084  3346.554339916
## 63 62      701.7    109.9572132 -324.525787515  3395.474212485
## 64 63      701.0    109.5939801 -353.833817911  3426.166182089
## 65 64      700.9    109.7948883 -353.210103148  3486.789896852
## 66 65      701.3    109.5571512 -343.780758960  3556.219241040
## 67 66      701.2    109.5838684 -310.806893360  3649.193106640
## 68 67      701.5    109.4677088 -316.910308982  3703.089691018
## 69 68      700.9    109.3539013 -326.340338587  3753.659661413
## 70 69      702.1    109.2574961 -317.820011152  3822.179988848
## 71 70      701.2    108.8563787 -351.667974044  3848.332025956
## 72 71      701.2    108.9809869 -339.661339561  3920.338660439
## 73 72      701.5    108.9430690 -313.603261282  4006.396738718
## 74 73      701.6    108.6864327 -324.919641986  4055.080358014
## 75 74      701.4    108.7176209 -308.707536711  4131.292463289
## 76 75      703.1    108.4620179 -304.882851775  4195.117148225
## 77 76      701.7    108.4129922 -303.134783733  4256.865216267
## 78 77      702.2    108.2194238 -304.295347052  4315.704652948
## 79 78      701.4    107.7256569 -334.733360207  4345.266639793
## 80 79      700.4    108.1555870 -340.454520356  4399.545479644
## 81 80      701.8    108.1089954 -315.865705672  4484.134294328
## 82 81      701.7    107.8976325 -332.239957494  4527.760042506
## 83 82      701.3    108.1263477 -318.440927654  4601.559072346
## 84 83      701.7    107.9807588 -341.667394085  4638.332605915
## 85 84      701.6    107.9223031 -332.374354544  4707.625645456
## 86 85      702.3    107.7032516 -329.594428046  4770.405571954
## 87 86      701.2    107.7025934 -323.337539932  4836.662460068
## 88 87      702.6    107.7341694 -315.319855307  4904.680144693
## 89 88      702.7    107.5978394 -318.888777193  4961.111222807
## 90 89      701.7    107.6826961 -329.601897440  5010.398102560
## 91 90      701.2    107.2913853 -352.893246270  5047.106753730
## 92 91      702.6    107.9162503 -298.859548680  5161.140451320
## 93 92      701.4    107.3265408 -345.735043576  5174.264956424
## 94 93      702.9    107.7535692 -307.027575009  5272.972424991
## 95 94      700.7    106.9934449 -380.611482059  5259.388517941
## 96 95      702.2    108.2833572 -318.195906366  5381.804093634

qplot(sampleValues$N, round(sampleValues$newOptimum))
Evo­lu­tion of op­ti­mal time to check the mail based on pos­te­rior up­dated after each ob­ser­va­tion.

This would sug­gest that di­min­ish­ing re­turns were reached early on (pos­si­bly the in­for­ma­tive pri­ors had done more work than I ap­pre­ci­at­ed).

The graph shows the es­ti­mated op­ti­mal mail-check time (700=11:40AM etc) as each dat­a­point was col­lect­ed. You can see that with the pri­ors I set, they were bi­ased to­wards too-late mail times but as more data comes in, the new op­ti­mal check­-time drifts steadily down­wards un­til the bias of the prior has been neu­tral­ized and now it be­gins to fol­low a ran­dom walk around 695, with the fi­nal es­ti­mated mailcheck­time be­ing ~693/11:33AM And at around n=48, di­min­ish­ing re­turns has set in so hard that the de­ci­sion ac­tu­ally stops chang­ing en­tire­ly, it’s all small fluc­tu­a­tions around 693. When you in­clude the cost of gath­er­ing data, the analy­sis says you should col­lect up to n = 13 and then stop (after that, the loss does not de­crease but be­gins to in­crease be­cause each dat­a­point in­flates costs by +60, and we want to min­i­mize loss); at n = 13, one de­cides to check at 687/11:27AM, which is close to the 693 which was es­ti­mated with 4x more data (!).

So this is in­ter­est­ing: by us­ing in­for­ma­tive pri­ors and then tak­ing a de­ci­sion-the­o­retic ap­proach, in this case I can make high­-qual­ity de­ci­sions on sur­pris­ingly lit­tle data; I could have halted data gath­er­ing much ear­li­er. (Ad­di­tional data does have ben­e­fits in let­ting me ver­ify the di­min­ish­ing re­turns by see­ing how the op­ti­mal de­ci­sion hardly changes, can be used for more ac­cu­rate model av­er­ag­ing, and could be used to model time-vary­ing trends like the large in­creases in late de­liv­er­ies dur­ing the hol­i­days. But ul­ti­mately I have to agree with the EVSI: I did­n’t need to col­lect all the data I did.)


Lessons learned:

  1. ABC is as sim­ple to im­ple­ment and in­cred­i­bly gen­eral as promised; with ABC, it’s defin­ing good sum­mary sta­tis­tics & get­ting com­pu­ta­tional tractabil­ity which you need to worry about (be­sides mak­ing it te­dious to use, long run­times also in­ter­fere with writ­ing a cor­rect im­ple­men­ta­tion in the first place)

  2. while pow­er­ful, JAGS mod­els can be con­fus­ing to write; it’s easy to lose track of what the struc­ture of your model is and write the wrong thing, and the use of pre­ci­sion rather than stan­dard­-de­vi­a­tion adds boil­er­plate and makes it even eas­ier to get lost in a wel­ter of vari­ables & dis­tri­b­u­tions, in ad­di­tion to a fair amount of boil­er­plate in run­ning the JAGS code at al­l—lead­ing to er­rors where you are not cer­tain whether you have a con­cep­tual prob­lem or your boil­er­plate has drifted out of date

  3. the com­bi­na­tion of ggplot2 and animation bade fair to make an­i­ma­tions as easy as de­vis­ing a gg­plot2 im­age, but due to some ugly in­ter­ac­tions be­tween them (ggplot2 in­ter­acts with the R top-level scope/en­vi­ron­ment in a way which breaks when it’s called in­side a func­tion, and animation has some sub­tle bug re­lat­ing to de­cid­ing how long to de­lay frames in an an­i­ma­tion which I could­n’t fig­ure out), I lost hours to get­ting it to work at all.

  4. my orig­i­nal prior es­ti­mate of when the mail­man comes was ac­cu­rate, but I seem to have sub­stan­tially un­der­es­ti­mated the stan­dard de­vi­a­tion and there turned out to be con­sid­er­ably model un­cer­tainty about thin tails (nor­mal) vs fat­ter tails (t); de­spite that, pro­vid­ing in­for­ma­tive pri­ors meant that the pre­dic­tions from the JAGS model were sen­si­ble from the start (mostly from the small val­ues for the SD, as the mean time was eas­ily es­ti­mated from the in­ter­vals) and it made good use of the da­ta.

  5. the vari­abil­ity of mail de­liv­ery times is high enough that the pre­dic­tion in­ter­vals are in­her­ently wide; after rel­a­tively few dat­a­points, whether in­ter­val or ex­act, di­min­ish­ing re­turns has set in.

  6. Sub­jec­tive Bayesian­ism & de­ci­sion the­ory go to­gether like peanut but­ter & choco­late.

    It’s all just sam­ples and pos­te­rior dis­tri­b­u­tions; want to cal­cu­late ex­pected in­for­ma­tion gains to op­ti­mize your sam­pling, or com­pare to sub­jec­tive loss to de­cide in a prin­ci­pled fash­ion when to stop? No prob­lem. (In­stead of ad hoc rules of thumbs like go­ing for 80% power or con­trol­ling al­pha at 0.05 or trial se­quen­tial analy­sis stop­ping at an ar­bi­trary effect size con­fi­dence in­ter­val…) Long com­pu­ta­tions are a small price to pay for ap­prox­i­mate an­swers to the right ques­tions rather than ex­act an­swers to the wrong ques­tions.

    The ma­jor down­side is that Bayesian de­ci­sion the­ory is hon­ored mostly in the breach: I found much more dis­cus­sion than ap­pli­ca­tion of it on­line, and few worked-out ex­am­ples in R that I could try to com­pare with my own im­ple­men­ta­tions of the ideas.


Thompson sampling

Given a Bayesian model and a re­in­force­men­t-learn­ing set­ting like check­ing the mail­box, it seems nat­ural to use to guide checks and do (“on­line” here mean­ing learn­ing 1 data point at a time). I did­n’t want to re­or­ga­nize my morn­ings around such an al­go­rithm, but in read­ing up on Thomp­son sam­pling, I found no source code for any uses out­side the stereo­typ­i­cal 3-4 arm bi­no­mial ban­dit, and cer­tainly noth­ing I could use.

Thomp­son sam­pling turns out to be as sim­ple as ad­ver­tised. Each time a Ts agent acts, it up­dates its model based on cur­rent data and then sim­ply draws one pos­si­ble set of pa­ra­me­ters at ran­dom from its pos­te­rior dis­tri­b­u­tions of those pa­ra­me­ters; then it finds the op­ti­mal ac­tion un­der that par­tic­u­lar set of pa­ra­me­ters; and ex­e­cutes it.

An im­ple­men­ta­tion of Thomp­son sam­pling for this prob­lem, and a test har­ness to run it on a sim­u­lated ver­sion of mail-check­ing to see how it per­forms in terms of ac­tions, loss­es, and re­gret:

model1 <- function() { for (i in 1:n){
           y[i] ~ dinterval(t[i], dt[i,])
           t[i] ~ dnorm(mu,tau)
           mu ~ dnorm(650, pow(30, -2))
           sd ~ dunif(10, 60)
           tau <- pow(1/sd, 2)
           y.new ~ dnorm(mu, tau)

lossPosterior <- function(t, walk_cost, predictions, lastResortT) {
                function(delivered) { if (delivered<t) { return(walk_cost   + (t - delivered)); } else
                                                       { return(2*walk_cost + (lastResortT - t) ); }})
posterior_losses <- function(times, actions) {
   df <- data.frame(Actions=actions,
           Loss=sapply(actions, function (tm) { mean(lossPosterior(tm, 60, times, 780));}))

thompsonSample <- function(dt, model) {
   # model the available data:
   data <- list("dt"=dt, "n"=nrow(dt), "y"=rep(1, nrow(dt)))
   inits <- function() { list(mu=rnorm(1),sd=30,t=as.vector(apply(dt,1,mean))) }
   params <- c("mu", "sd")
   j1 <- jags(data, inits, params, model, n.iter=1000, progress.bar="none")

   # 0:1440 is overkill, so let's only look at the plausible region of times to actively check:
   actions <- c(600:800)
   # sample one hypothetical world's set of mean/SD parameter values:
   posteriorMean <- j1$BUGSoutput$sims.list[["mu"]][1]
   posteriorSD <- j1$BUGSoutput$sims.list[["sd"]][1]
   # sample a bunch of data from that hypothetical world:
   posteriorTimes <- rnorm(1000, mean=posteriorMean, sd=posteriorSD)
   # so we can then estimate losses:
   posteriorLosses <- posterior_losses(posteriorTimes, actions)
   # and with expected losses for each action defined, find the optimal action in that world:
   optimalAction <-  posteriorLosses[which.min(posteriorLosses$Loss),]$Actions[1]
   return(optimalAction) }

simulate_mab <- function(n) {
 walk_cost = 60; lastResortT = 780

 # pick a concrete mean/SD for this run, based on the full MCMC estimates as of 2015-09-20:
 mail_mean <- rnorm(1, mean=658.9, sd=4.69)
 mail_sd <- rnorm(1, mean=27.3, sd=8.33)
 # to calculate regret, we need an optimal loss, which we can compare realized losses to:
 posteriorTimes <- rnorm(5000, mean=mail_mean, sd=mail_sd)
 actions <- c(600:800)
 posteriorLosses <- posterior_losses(posteriorTimes, actions)
 optimalLoss <-  min(posteriorLosses$Loss)[1]
 optimalAction <-  posteriorLosses[which.min(posteriorLosses$Loss),]$Actions[1]
 print(paste0("Parameters for this simulation; Mean: ", mail_mean,
              "; SD: ", mail_sd,
              "; minimum average loss: ", optimalLoss,
              "; optimal action: t=", optimalAction))

 # seed data:
 mail3 <- data.frame(Time1=660, Time2=1440, Delivered=NA, Action.optimal=optimalAction, Action=660,
                     Loss=0,  Regret=0, Total.regret=0)

 for (i in 1:n) {
   # today's delivery:
   delivered <- rnorm(1, mean=mail_mean, sd=mail_sd)
   # when should we check?
   t <- thompsonSample(subset(mail3,select=c(Time1,Time2)), model1)
   # discover what our loss is, update the database:

   loss <- if (delivered<t) { walk_cost  + (t - delivered); } else
              { 2*walk_cost + (lastResortT - t); }
   mail3 <- if(delivered<t) {
               rbind(mail3, data.frame(Time1=0, Time2=t, Delivered=delivered,
                                       Action.optimal=optimalAction, Action=t,
                                       Loss=loss, Regret=(loss-optimalLoss),
                                       Total.regret=sum(c(loss,mail3$Regret),na.rm=TRUE))) } else
             { rbind(mail3, data.frame(Time1=t, Time2=1440, Delivered=delivered,
                                       Action.optimal=optimalAction, Action=t,
                                       Loss=loss, Regret=(loss-optimalLoss),
                                       Total.regret=sum(c(loss,mail3$Regret),na.rm=TRUE))) }
run <- simulate_mab(500)

qplot(1:nrow(run), run$Action, xlab="Nth mail check", ylab="Check time",
      color=run$Loss, size=I(5), main="Thompson sampling") +
  stat_smooth() + geom_hline(colour="green", aes(yintercept=run$Action.optimal[1]))
1 sim­u­la­tion run’s ac­tions over 500 time-steps, op­ti­mal ac­tion over­laid in green

In this par­tic­u­lar sim­u­la­tion, the Thomp­son sam­pling agent starts off choos­ing check­-times too late in the day and grad­u­ally tight­ens its se­lec­tion of ac­tions un­til by ~t = 100, it is gen­er­ally check­ing near-op­ti­mally but with con­tin­ued sam­pling from other check­-times which starts to com­pen­sate for a lit­tle un­der­shoot­ing er­ror.

  1. From the doc­u­men­ta­tion:

    Surv(time, time2, event, type="interval")

    1. time: For in­ter­val data, the first ar­gu­ment is the start­ing time for the in­ter­val.
    2. time2: end­ing time of the in­ter­val for in­ter­val cen­sored or count­ing process data on­ly. In­ter­vals are as­sumed to be open on the left and closed on the right, (start, end]. For count­ing process data, event in­di­cates whether an event oc­curred at the end of the in­ter­val.
    3. event: The sta­tus in­di­ca­tor, nor­mally 0=alive, 1=dead­….­For in­ter­val cen­sored data, the sta­tus in­di­ca­tor is 0=right cen­sored, 1=event at time, 2=left cen­sored, 3=in­ter­val cen­sored.

    In­ter­val cen­sored data can be rep­re­sented in two ways. For the first use type = "interval" and the codes shown above. In that us­age the value of the time2 ar­gu­ment is ig­nored un­less event=3. The sec­ond ap­proach is to think of each ob­ser­va­tion as a time in­ter­val with (-∞, t) for left cen­sored, (t, ∞) for right cen­sored, (t,t) for ex­act and (t1, t2) for an in­ter­val. This is the ap­proach used for type = "interval2". In­fi­nite val­ues can be rep­re­sented ei­ther by ac­tual in­fin­ity (Inf) or NA. The sec­ond form has proven to be the more use­ful one.

    …a sub­jec­t’s data for the pair of columns in the dataset (time1, time2) is (te, te) if the event time te is known ex­act­ly; (tl, NA) if right cen­sored (where tl is the cen­sor­ing time); and (tl, tu) if in­ter­val cen­sored (where tl is the lower and tu is the up­per bound of the in­ter­val).

  2. Em­pha­sis added; “‘Not Only De­fended But Also Ap­plied’: The Per­ceived Ab­sur­dity of Bayesian In­fer­ence”, Gel­man & Robert 2013; , Gel­man et al 2017.↩︎

  3. In­stead of do­ing es­ti­mates, why not buy or make some­thing like a mag­netic switch or mo­tion-ac­ti­vated cam­era on the mail­box? By reg­u­larly record­ing ex­actly when the mail ar­rives, this makes es­ti­ma­tion much eas­ier, and if wire­less-en­abled, may ob­vi­ate the prob­lem en­tire­ly. This is chal­leng­ing, though: the mail­box is so far away that there are no power sock­ets near it so it will need to be bat­tery-pow­ered and to last for months (other­wise the has­sle of charg­ing it elim­i­nates the con­ve­nience), and wire­less con­nec­tions may not be pos­si­ble that many me­ters away with­out ex­ces­sive power con­sump­tion.

    I looked into this, and found a few op­tions:

    • the es­tab­lished niche of “peo­ple counter”/“door counter”/“traffic counter” elec­tronic de­vices, often im­ple­mented us­ing a in­frared beam which is tripped by ob­jects mov­ing in be­tween an emit­ter and sen­sor. They are in­tended for busi­ness use, priced out of my range, and typ­i­cally not long-range wire­less or bat­tery-pow­ered or store date-stamps (rather than counts).

    • mail check­ing sys­tems, of which there are two main gad­gets in­tended for use­cases like mine (rural houses with ex­tremely long dri­ve­ways):

      1. spring-pow­ered ‘flags’ which are latched into the door and pop up when it is opened (pre­sum­ably by the mail­man) and are vis­i­ble from a long dis­tance; the home­-owner then re­sets the spring when they pick up the mail. One down­side is that many in­volve drilling holes into the mail­box to keep the flag se­cure long-term against the weath­er. Ex­am­ple: “Mail Time! Yel­low Mail­box Alert Flag for Long Dri­ve­ways”, $16.

        This does­n’t work be­cause it still re­quires me to man­u­ally log data, al­though I would not have to walk all the way to check, it is true. Nor does this re­solve the mail prob­lem in gen­eral be­cause one does not al­ways have per­mis­sion to tam­per with the mail­box like that and it re­quires co­op­er­a­tion from every­one who uses the mail­box (it’s no good if you’re the only one who re­sets the flag and it’s usu­ally up).

      2. wire­less mail­box alert sys­tems: a ra­dio com­bined with a switch mounted in­side the mail­box, which on con­tact be­ing bro­ken, ra­dios an alert. Ex­am­ple: “Dakota Alert 1000 [feet] Se­ries”, $50.

        Most of these sys­tems don’t claim a range past ~30m, while I need at least 200 me­ters (and prob­a­bly much more, be­cause man­u­fac­tur­ers are op­ti­mistic and my walls are con­crete). The ex­am­ple Dakota Alert claims to have the nec­es­sary range, but does­n’t in­clude any es­ti­mates about bat­tery life, is ex­pen­sive enough I’m not sure it’s worth it, and would still re­quire man­ual record­ing of times.

      3. build one’s own data log­ger; log­ging door or win­dow or garage door open­ings or bath­room oc­cu­pancy is a com­mon starter project with or , so it should be pos­si­ble. Both pre­sented se­ri­ous prob­lems as I looked into the de­tails and thought about whether I could do it.

        • While a Rasp­berry Pi Zero costs $5 for the moth­er­board it­self and is rel­a­tively easy to pro­gram & set up, it draws ~160 mil­liamps; this is not a prob­lem for most RP us­es, where power out­lets are avail­able, but is dis­as­trous for re­mote sen­sors—a stan­dard cheap 1000-5000mil­liamp-hour bat­tery would be drained within a day. So­lar pan­els can be hooked up to an RP to recharge the bat­tery, but that adds much com­plex­ity and be­tween the bat­tery, so­lar pan­el, pan­el->­bat­tery adapter, and other wires, would cost >$80 for the parts. A Pi­Juice So­lar would cost ~$80 but does­n’t ship un­til March 2016 (if ever). So a RP would work if I wanted to go as far as set­ting up a so­lar-pow­ered RP, spend­ing eas­ily $100+.
        • Ar­duinos use much less pow­er—around 20 mil­liamp­s—and can be re­duced to <1 mil­liamps, in which case, com­bined with a com­pat­i­ble bat­tery like the $15 Lithium Ion Poly­mer Bat­tery - 3.7v 2500mAh, power is no longer an is­sue and the log­ger could run a week () or months at <1 mil­liamps. But Ar­duinos are much more chal­leng­ing to use than Rasp­berry Pis—I have no ex­pe­ri­ence in em­bed­ded elec­tron­ics or wiring up such de­vices or pro­gram­ming in C, and so no con­fi­dence I could build it cor­rect­ly. The $22 Adafruit Data log­ger (Adalog­ger) (doc­u­men­ta­tion), based on the Ar­duino Uno, com­bines most of the hard­ware nec­es­sary. That still leaves con­nect­ing the reed mag­netic switch (eg “Sec­o-Larm SM-226L Garage Door Con­tacts for Closed Cir­cuits”, $11, or $3.95, mag­netic switch), pro­gram­ming the Adalog­ger, and op­ti­miz­ing power con­sump­tion to make recharges not a has­sle. This makes it much closer to fea­si­ble and I thought se­ri­ously about it, but ul­ti­mate­ly, I don’t think I care enough at the mo­ment about log­ging mail data to strug­gle through the learn­ing curve of em­bed­ded & elec­tri­cal en­gi­neer­ing knowl­edge for this par­tic­u­lar pro­ject.
      4. Wilder ideas in­clude run­ning lasers & mir­rors; or run­ning a coax­ial ca­ble out to the mail­box, putting a pho­to­di­ode at one end, and keep­ing the log­ger in­side near a power out­let to re­solve the wire­less & power is­sues. (But coax ca­ble does cost mon­ey, leav­ing it ex­posed is not a good idea con­sid­er­ing the rid­ing lawn mow­er, and dig­ging up hun­dreds of me­ters of ground to bury the coax ca­ble does not sound like fun.)