When Should I Check The Mail?

Bayesian decision-theoretic analysis of local mail delivery times: modeling deliveries as survival analysis, model comparison, optimizing check times with a loss function, and optimal data collection.
statistics, decision-theory, R, survival-analysis, Bayes, tutorial
2015-06-212018-02-24 finished certainty: likely importance: 9


Mail is deliv­ered by the USPS mail­man at a reg­u­lar but not observed time; what is observed is whether the mail has been deliv­ered at a time, yield­ing some­what-unusual “inter­val-­cen­sored data”. I describe the prob­lem of esti­mat­ing when the mail­man deliv­ers, write a sim­u­la­tion of the data-­gen­er­at­ing process, and demon­strate analy­sis of inter­val-­cen­sored data in R using max­i­mum-­like­li­hood (sur­vival analy­sis with Gauss­ian regres­sion using survival library), MCMC (Bayesian model in JAGS), and like­li­hood-free Bayesian infer­ence (cus­tom ABC, using the sim­u­la­tion). This allows esti­ma­tion of the dis­tri­b­u­tion of mail deliv­ery times. I com­pare those esti­mates from the inter­val-­cen­sored data with esti­mates from a (small­er) set of exact deliv­ery-­times pro­vided by USPS track­ing & per­sonal obser­va­tion, using a mul­ti­level model to deal with het­ero­gene­ity appar­ently due to a change in USPS routes/postmen. Final­ly, I define a loss func­tion on mail checks, enabling: a choice of opti­mal time to check the mail­box to min­i­mize loss (ex­ploita­tion); opti­mal time to check to max­i­mize infor­ma­tion gain (ex­plo­ration); Thomp­son sam­pling (bal­anc­ing explo­ration & exploita­tion indef­i­nite­ly), and esti­mates of the val­ue-of-in­for­ma­tion of another dat­a­point (to esti­mate when to stop explo­ration and start exploita­tion after a finite amount of data).

Con­sider a ques­tion of burn­ing impor­tance: what time does the mail­man come, bear­ing gifts and when should I check my mail—­con­sid­er­ing that my mail­box is >110m/>7.5 min­utes away, and that I really dis­like walk­ing out to it in up to 90F heat for a pack­age or book only to dis­cover the mail has­n’t come yet? No one wants to sit around all morn­ing to spot the exact time the mail­man comes. At least, I don’t. We could more eas­ily mea­sure by going out in the morn­ing at a ran­dom time to see if the mail has come yet, and then (some­how) esti­mate. Given a set of data like “2015-06-20 11:00AM: mail has not come yet; 2015-06-21 11:59AM: mail had come”, how can we esti­mate? This is not a nor­mal setup where we esti­mate a mean but our data is inter­est­ingly messed up: cen­sored or trun­cated or an inter­val some­how.

seems like the appro­pri­ate par­a­digm. This is not a sim­ple sur­vival analy­sis with “right-­cen­sor­ing” where each indi­vid­ual is fol­lowed up to a cen­sor­ing time and the exact time of ‘fail­ure’ is observed. (This would be right-­cen­sor­ing if instead we had gone out to the mail­box early in the morn­ing and sat there wait­ing for the mail­man to record when she came, occa­sion­ally get­ting bored around 10AM or 11AM and wan­der­ing off with­out see­ing when the mail comes.) This isn’t “left­-­cen­sor­ing” either (for left­-­cen­sor­ing, we’d go out to the mail­box late in the morn­ing when the mail might already be there, and if it isn’t, then wait until it does come). I don’t think this is left or right trun­ca­tion either, since each day data is col­lected and there’s no sam­pling biases at play. What this is, is inter­val cen­sor­ing: when we go out to the mail­box at 11AM and dis­cover the mail is there, we learn that the mail was deliv­ered today some­time in the inter­val midnight-10:59AM, or if the mail isn’t there, we learn it will be deliv­ered later today some­time dur­ing the inter­val 11:01AM-midnight (hope­fully closer to the first end than the sec­ond). Inter­val cen­sor­ing comes up in bio­sta­tis­tics for sit­u­a­tions like peri­odic check­ups for can­cer, which does resem­ble our mail sit­u­a­tion.

Inference

Interval-censored data

ML

The R survival library sup­ports the usual right/left-censoring but also the inter­val-­cen­sor­ing. It sup­ports two encod­ings of inter­vals, interval and interval21; I use the for­mer, which for­mat works well with both the survival library and also other tools like JAGS. Times are writ­ten as min­utes since mid­night, so they can be han­dled as pos­i­tive num­bers rather than date-­times (ie mid­night=0, 11AM=660, noon=720, mid­night=1440, etc), and the upper and lower bounds on inter­vals are 0 and 1440 (so if I check the mail at 660 and it’s there, then the inter­val is 0-660, and if it’s not, 660-1440).

# Routines to make it easier to work in minutes-since-midnight:
library(lubridate)
fromClock <- function(ts){ (hour(ts)*60) + minute(ts) + (second(ts)/60)}
toClock <- function(t) {
   h <- floor(t/60)
   m <- floor(t - h*60)
   sprintf("%0.2d:%0.2d", h, m) }

set.seed(2015-06-21)
# simulate a scenario in which the mailman tends to come around 11AM (660) and I tend to check around then,
# & generate interval data for each time, bounded by end-of-day/midnight below & above, collecting ~1 month:
simulateMailbox <- function(n, time) {
    deliveryTime <- round(rnorm(n, mean = time, sd = 30))
    checkTime <- round(rnorm(n, mean = time, sd = 20))
    simulates <- mapply(function (ck, dy) { if(ck>dy) { return(c(0,ck)) } else { return(c(ck,1440)) }},
                         checkTime, deliveryTime)
    return(data.frame(Time1=simulates[1,], Time2=simulates[2,])) }
mailSim <- simulateMailbox(30, 650); mailSim
##        Time1 Time2
## 1    620  1440
## 2      0   651
## 3      0   627
## 4    624  1440
## 5    629  1440
## 6    664  1440
## 7    665  1440
## 8    652  1440
## ...
library(ggplot2)
png(file="~/wiki/images/maildelivery/simulated.png", width = 800, height = 500)
ggplot(mailSim) + geom_segment(aes(x=Time1, xend=Time2, y=1:nrow(mailSim), yend=1:nrow(mailSim))) +
    geom_vline(xintercept=650, color="blue") + ylab("Day") + xlab("Time")
invisible(dev.off())

Infer­ring the mean time of deliv­ery might sound dif­fi­cult with such extremely crude data of inter­vals 700 min­utes wide or worse, but plot­ting the lit­tle sim­u­lated dataset and mark­ing the true mean time of 650, we see it’s not that bad—the mean time is prob­a­bly what­ever line passes through the most inter­vals:

The sim­u­lated over­lap­ping-in­ter­vals data, with the true mean time drawn in blue

And also with our sim­u­lated dataset, we can see if the stan­dard R sur­vival library and a inter­val-­cen­sored model writ­ten in JAGS can recover the 650:

library(survival)
surv <- Surv(mailSim$Time1, mailSim$Time2, type="interval2")
s <- survfit(surv ~ 1, data=mailSim); summary(s)
##  time   n.risk   n.event survival   std.err lower 95% CI upper 95% CI
##  625.5 30.0000 3.33599e+00 0.888800 0.0573976     0.783131     1.000000
##  641.0 26.6640 4.15541e+00 0.750287 0.0790267     0.610339     0.922323
##  647.0 22.5086 6.14506e+00 0.545451 0.0909091     0.393449     0.756178
##  651.5 16.3635 3.84506e-11 0.545451 0.0909091     0.393449     0.756178
##  664.5 16.3635 5.82315e-03 0.545257 0.0909124     0.393258     0.756005
##  675.0 16.3577 1.63577e+01 0.000000       NaN           NA           NA
plot(s)

png(file="~/wiki/images/maildelivery/simulated-survivalcurve.png", width = 800, height = 500)
plot(s)
invisible(dev.off())
sr <- survreg(surv ~ 1, dist="gaussian", data=mailSim); summary(sr)
##                 Value Std. Error        z           p
##                 Value Std. Error        z           p
## (Intercept) 663.08028   9.742171 68.06289 0.00000e+00
## Log(scale)    3.41895   0.438366  7.79931 6.22465e-15
##
## Scale= 30.5374
##
## Gaussian distribution
## Loglik(model)= -15.8   Loglik(intercept only)= -15.8
## Number of Newton-Raphson Iterations: 11
## n= 30
Sur­vival curve (death=de­liv­ery)

MCMC

More Bayesian­ly, we can write an inter­val-­cen­sor­ing model in JAGS, which gives us the oppor­tu­nity to use an infor­ma­tive prior about the mean time the mail­man comes.

They work nor­mal 9-5 hours as far as I know, so we can rule out any­thing out­side 540-1020. From past expe­ri­ence, I expect the mail to show up not before 10AM (600) and not after 1PM (780), with those extremes being rare and some­time around 11AM (650) being much more com­mon; so not a uni­form dis­tri­b­u­tion over 600-780 but a nor­mal one cen­tered on 650 and then some­what arbi­trar­ily say­ing that 600-700 rep­re­sent 3 SDs out from the mean of deliv­ery times to get SD=~30 min­utes so in all, dnorm(650, pow(30, -2)). The SD itself seems to me like it could range any­where from a few min­utes to an hour, but much beyond that is impos­si­ble (if the SD was over an hour, then every so often the mail­man would have to come at 8AM! and if it was smaller than 10 min­utes, then I would never have noticed much vari­a­tion in the first place).

library(R2jags)
model1 <- function () {
    for (i in 1:n){
           y[i] ~ dinterval(t[i], dt[i,])
           t[i] ~ dnorm(mu,tau)
           }

           mu ~ dnorm(650, pow(30, -2))
           sd ~ dunif(10, 60)
           tau <- pow(1/sd, 2)

           y.new ~ dnorm(mu, tau)
    }
# y=1 == Event=3 for `Surv`: event is hidden inside interval, not observed/left-/right-censored
data <- list("dt"=mailSim, "n"=nrow(mailSim), "y"=rep(1, nrow(mailSim)))
inits <- function() { list(mu=rnorm(1),sd=30,t=as.vector(apply(mailSim,1,mean))) }
params <- c("mu", "sd", "y.new")
j1 <- jags(data,inits, params, model1); j1
##          mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
## mu       665.055   9.841 647.724 658.092 664.334 670.983 686.264 1.008   280
## sd        40.153  11.592  18.626  30.643  40.813  49.707  59.048 1.047    51
## y.new    664.103  42.908 577.518 637.561 663.011 689.912 749.936 1.002  1300
## deviance   0.000   0.000   0.000   0.000   0.000   0.000   0.000 1.000     1

Both approach­es’ point-­value mean time of 663-665 (11:03-11:05AM) come close to the true sim­u­la­tion value of 650 (11AM) and the pre­dic­tion inter­val of 577-749 also sounds right, val­i­dat­ing the mod­els. The esti­mated stan­dard devi­a­tion isn’t as accu­rate with a wide cred­i­ble inter­val, reflect­ing that it’s a harder para­me­ter to esti­mate and the esti­mate is still vague with only n = 30.

With a sim­u­la­tion and JAGS set up, we could also see how the pos­te­rior esti­mates of the mean, 95% CI of the mean, and pre­dic­tive inter­val change as an addi­tional dat­a­point is added:

animatePosteriors <- function (df, filename, upperTrue, lowerTrue) {
  # http://cran.r-project.org/web/packages/animation/index.html
  library(animation)
  saveGIF(
      for(n in 1:nrow(df)){
          data <- list("dt"=df[1:n,], "n"=nrow(df[1:n,]), "y"=rep(1, nrow(df[1:n,])))
          inits <- function() { list(mu=rnorm(1),sd=30,t=as.vector(apply(df[1:n,],1,mean))) }
          params <- c("mu","sd", "y.new")
          j1 <- jags(data, inits, params, model1, progress.bar="none")

          lowerMean <- j1$BUGSoutput$summary[c(2),][3]
          medianMean  <- j1$BUGSoutput$mean$mu
          upperMean <- j1$BUGSoutput$summary[c(2),][7]

          lowerPredictive <- j1$BUGSoutput$summary[c(4),][3]
          upperPredictive <- j1$BUGSoutput$summary[c(4),][7]

          ## WARNING: need an environment call for `ggplot` inside a function for local variables like
          ## 'n' to be visible: http://stackoverflow.com/a/29595312/329866
          ## WARNING: & you can't just call qplot due to animation/ggplot2 bug; have to assign & print
          timeLimits <- seq(from=9*60, to=12.8*60, by=15)
          p <- ggplot(df[1:n,], environment = environment()) +
                coord_cartesian(xlim = timeLimits) +
                scale_x_continuous(breaks=timeLimits, labels=sapply(timeLimits, toClock)) +
                ylab("Day") + xlab("Time") +
                geom_segment(aes(x=df[1:n,]$Time1, xend=df[1:n,]$Time2, y=1:n, yend=1:n)) +
                geom_vline(xintercept=medianMean, color="blue") +
                geom_vline(xintercept=lowerMean, color="green") +
                geom_vline(xintercept=upperMean, color="green") +
                geom_vline(xintercept=lowerPredictive, color="red") +
                geom_vline(xintercept=upperPredictive, color="red") +
                geom_vline(xintercept=lowerTrue, color="red4") +
                geom_vline(xintercept=upperTrue, color="red4")
          print(p)
          },
      interval = 0.7, ani.width = 800, ani.height=800,
      movie.name = filename) }

simData <- simulateMailbox(200, 650)
confintTrue <- round(qnorm(c(0.025, 0.975), mean=650, sd=30))
lowerPredictiveTrue <- confintTrue[1]
upperPredictiveTrue <- confintTrue[2]
animatePosteriors(simData, "/home/gwern/wiki/images/maildelivery/simulated-inferencesamplebysample.gif", lowerPredictiveTrue, upperPredictiveTrue)
Sim­u­lated data: pos­te­rior esti­mates of mail deliv­ery times, evolv­ing sam­ple by sam­ple for 200 dat­a­points; esti­mated mean time in blue, 95% CI around the mean time in green, and 95% pre­dic­tive inter­vals as to when the deliv­ery is made

Prob­a­bly the most strik­ing aspect of watch­ing these sum­maries updated datum by datum, to me, is how the esti­mate of the mean homes in almost imme­di­ately on close to the true value (this isn’t due solely to the infor­ma­tive prior either, as with a com­pletely unin­for­ma­tive dunif(0,1440) will zoom in within 4 or 5 dat­a­points as well, after some vio­lent thrash­ing around). What hap­pens is that the inter­vals ini­tially look unin­for­ma­tive if the first two or three all turn out to be delivered/non-delivered and so the mean deliv­ery time could still be any­where from ~11:AM-midnight or vice-ver­sa, but then as soon as even one nee­dle falls the other way, then the mean sud­denly snaps into tight focus and gets bet­ter from there. While I under­stand this abrupt tran­si­tion in hind­sight (only a tiny sub­set of val­ues around the over­lap­ping tips of the nee­dles can yield the observed flip-flops of delivery/non-delivery, while a mean time far from the tips would yield a com­pletely con­sis­tent dataset of all deliveries/non-deliveries), I did­n’t expect this, sim­ply rea­son­ing that “one inter­val-­cen­sored datum seems very unin­for­ma­tive, so it must take many data to yield any sort of decent result for the mean & stan­dard devi­a­tion and hence the pre­dic­tions”. But, even as the mean becomes pre­cisely esti­mat­ed, the pre­dic­tive inter­val—which is what, in the end, we really care about—re­mains obdu­rate and broad, because we assume the deliv­ery time is gen­er­ated by a nor­mal dis­tri­b­u­tion and so the pre­dicted deliv­ery times are the prod­uct of not just the mean but the stan­dard devi­a­tion as well, and the stan­dard devi­a­tion is hard to esti­mate (a 95% cred­i­ble inter­val of 12-51!). Also in hind­sight this is pre­dictable as well, since the flip-flop­ping nee­dles may be sen­si­tive to the mean, but not to the spread of deliv­ery-­times; the data would not look much dif­fer­ent than it does if the mail­man could deliver any­where from 8AM to 3PM—on early days she deliv­ers a few hours before I check the mail­box around 11AM and on late days she deliv­ers a few hours after.

These con­sid­er­a­tions also raise ques­tions about sta­tis­ti­cal power/optimal exper­i­ment design: what are the best times to sam­ple from inter­val-­cen­sored data in order to esti­mate as pre­cisely as pos­si­ble with a lim­ited bud­get of sam­ples? I searched for mate­r­ial on inter­val-­cen­sored data but did­n’t find any­thing directly address­ing my ques­tion. The flip-flops sug­gest that to esti­mate the mean, one should sam­ple only at the cur­rent esti­mated mean, which max­i­mizes the prob­a­bil­ity that there will be a net 50-50 split of delivery/non-delivery; but where should one sam­ple for the SD as well?

mailInterval <- data.frame(
 Date=as.POSIXct(c( "2015-06-20 11:00AM", "2015-06-21 11:06AM", "2015-06-23 11:03AM",
                    "2015-06-24 11:05AM", "2015-06-25 11:00AM", "2015-06-26 10:56AM",
                    "2015-06-27 10:45AM", "2015-06-29 10:31AM", "2015-06-30 10:39AM", "2015-07-01 10:27AM",
                    "2015-07-02 10:47AM", "2015-07-03 10:27AM", "2015-07-04 10:54AM", "2015-07-05 10:55AM",
                    "2015-07-06 11:21AM", "2015-07-07 10:01AM", "2015-07-08 10:20AM", "2015-07-09 10:50AM",
                    "2015-07-10 11:10AM", "2015-07-11 11:12AM", "2015-07-13 11:05AM", "2015-07-14 11:14AM",
                    "2015-07-15 11:40AM", "2015-07-16 11:24AM", "2015-07-17 11:03AM", "2015-07-18 10:46AM",
                    "2015-07-20 11:05AM", "2015-07-21 10:56AM", "2015-07-22 11:00AM", "2015-07-23 11:17AM",
                    "2015-07-24 11:15AM", "2015-07-27 11:11AM", "2015-07-28 10:44AM", "2015-07-29 11:18AM",
                    "2015-07-30 11:08AM", "2015-07-31 10:44AM", "2015-08-01 11:25AM", "2015-08-03 10:45AM",
                    "2015-08-04 10:45AM", "2015-08-05 10:44AM", "2015-08-06 10:33AM", "2015-08-07 10:55AM",
                    "2015-08-10 11:09AM", "2015-08-11 11:16AM", "2015-08-12 11:14AM", "2015-08-13 11:10AM",
                    "2015-08-14 11:02AM", "2015-08-15 11:04AM", "2015-08-18 11:15AM", "2015-08-20 11:20AM",
                    "2015-08-22 11:46AM", "2015-08-23 11:04AM", "2015-08-24 10:56AM", "2015-08-25 10:26AM",
                    "2015-08-26 11:57AM", "2015-08-27 10:15AM", "2015-08-31 12:35PM", "2015-09-01 11:44AM",
                    "2015-09-02 10:54AM", "2015-09-03 10:29AM", "2015-09-04 10:08AM", "2015-09-05 10:45AM",
                    "2015-09-07 10:58AM", "2015-09-08 10:25AM", "2015-09-09 10:29AM", "2015-09-10 10:48AM",
                    "2015-09-16 11:22AM", "2015-09-17 11:06AM", "2015-09-18 11:22AM", "2015-09-19 11:12AM",
                    "2015-09-21 11:52AM", "2015-09-23 12:02PM", "2015-09-24 10:00AM", "2015-09-25 12:08PM",
                    "2015-09-26 10:59AM", "2015-09-28 10:57AM", "2015-09-29 10:52AM" ),
                   "EDT"),
 Delivered=c(FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE,
             FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE,
             FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE,
             FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE,
             FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
             FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE))
mailInterval$Time <- fromClock(mailInterval$Date)
mail <- with(mailInterval, data.frame(Time1 = ifelse(Delivered, 0, Time),
                           Time2 = ifelse(Delivered,  Time, 1440)))

library(R2jags)
model1 <- function() { for (i in 1:n){
           y[i] ~ dinterval(t[i], dt[i,])
           t[i] ~ dnorm(mu,tau)
           }

           mu ~ dnorm(650, pow(30, -2))
           sd ~ dunif(10, 60)
           tau <- pow(1/sd, 2)

           y.new ~ dnorm(mu, tau)
           }
data <- list("dt"=mail, "n"=nrow(mail), "y"=rep(1, nrow(mail)))
inits <- function() { list(mu=rnorm(1),sd=30,t=as.vector(apply(mail,1,mean))) }
params <- c("mu","sd", "y.new")
j1 <- jags(data, inits, params, model1, n.iter=100000); j1
##          mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
## mu       658.173   4.468 649.098 655.469 658.156 660.943 666.797 1.001  3000
## sd        27.138   7.976  16.034  21.417  25.459  30.916  48.149 1.001  3000
## y.new    658.098  27.960 601.899 640.912 657.942 675.067 712.985 1.001  3000
## deviance   0.000   0.000   0.000   0.000   0.000   0.000   0.000 1.000     1

animatePosteriors(mail, "/home/gwern/wiki/images/maildelivery/real-inferencesamplebysample.gif", 0, 0)
Real data: pos­te­rior esti­mates of mail deliv­ery times, ani­mated updated sam­ple by sam­ple

ABC

Because JAGS pro­vides an inter­val-­cen­sored dis­tri­b­u­tion in the form of dinterval() with a , we can use MCMC for inverse infer­ence (rea­son­ing from data to the under­ly­ing process) But if it did­n’t, I would­n’t know how to write one down for it and then the MCMC would­n’t work; but I was able to write a lit­tle sim­u­la­tion of how the under­ly­ing process of deliv­ery-and-check­ing works, which, given a set of para­me­ters, spits out sim­u­lated results gen­er­ated by the process, which is prob­a­bil­ity or for­ward infer­ence (rea­son­ing from a ver­sion of an under­ly­ing process to see what it cre­ates). This is a com­mon sit­u­a­tion: you can write a good sim­u­la­tion sim­ply by describ­ing how you think some­thing works, but you can’t write a like­li­hood func­tion.

(ex­em­pli­fied in the fun exam­ple “Tiny Data, Approx­i­mate Bayesian Com­pu­ta­tion and the Socks of Karl Bro­man”; see also Wilkin­son’s intro & sum­mary sta­tis­tics posts) is a remark­ably sim­ple and pow­er­ful idea which lets us take a for­ward sim­u­la­tion and use it to run back­wards infer­ence. (Re­minds me a lit­tle of .) There’s an R pack­age, of course which imple­ments nice fea­tures & opti­miza­tions, but ABC is so sim­ple we might as well write our own for trans­paren­cy.

The sim­plest ABC goes like this: You sam­ple pos­si­ble para­me­ters from your pri­or, feed the set of para­me­ters into your sim­u­la­tion, and if the result is iden­ti­cal to your data, you save that set of para­me­ters. At the end, you’re left with a bunch of sets and that’s your pos­te­rior dis­tri­b­u­tion which you can look at the his­tograms of and cal­cu­late 95% den­si­ties etc.

So for the mail data, ABC goes like this:

simulateMailbox <- function(n, dTime, dSD) {
    deliveryTime <- round(rnorm(n, mean = dTime, sd = dSD))
    checkTime <- round(rnorm(n, mean = dTime, sd = dSD))
    simulates <- mapply(function (ck, dy) { if(ck>dy) { return(c(0,ck)) } else { return(c(ck,1440)) }},
                         checkTime, deliveryTime)
    return(data.frame(Time1=simulates[1,], Time2=simulates[2,])) }

# if both data-frames are sorted, comparison is easier
mailSorted <- mail[order(mail$Time1),]

mail_sim <- replicate(100000, {
    # mu ~ dnorm(650, 20)
    mu <- rnorm(n=1, mean=650, sd=20)
    # sd ~ dunif(10, 60)
    sd <- runif(n=1, min=10, max=60)

    newData <- simulateMailbox(nrow(mailSorted), mu, sd)
    newDataSorted <- newData[order(newData$Time1),]

    if (all(newDataSorted == mailSorted)) { return(c(Mu=mu, SD=sd)) }
   }
  )
results <- Filter(function(x) {!is.null(x)}, mail_sim)
results <- data.frame(t(sapply(results,c)))
summary(results)

The first thing to note is effi­cien­cy: I can get a rea­son­able num­ber of sam­ples in rea­son­able amount of time for n = 1-3, but at 4 dat­a­points, it becomes slow. There’s so many pos­si­ble datasets when 4 checks are sim­u­lated that almost all get rejected because they are not iden­ti­cal to the real dataset and it takes mil­lions of sam­ples and hours to run. And this prob­lem only gets worse for n = 5 and big­ger.

To run ABC more effi­cient­ly, you relax the require­ment that the sim­u­lated data == real data and instead accept the pair of para­me­ters if the sim­u­lated data is ‘close enough’ in some sense to the real data, close in terms of some sum­mary sta­tis­tic (hope­fully suf­fi­cient) like the mean. I don’t know what are the suf­fi­cient sta­tis­tics for a set of inter­val-­cen­sored data, but I fig­ure that if the means of the pairs of times are sim­i­lar in both datasets, then they are prob­a­bly close enough for ABC to work, so I can use that as a rejec­tion tol­er­ance; imple­ment­ing that and play­ing around, it seems I can make the dif­fer­ence in means as tight as <1 while still run­ning fast.

mail_abc <- function(samples) {
    results <- list()
    n <- 0
    while (n<samples) {

      # Priors:
      ## mu ~ dnorm(650, 20)
      mu <- rnorm(n=1, mean=650, sd=20)
      ## sd ~ dunif(10, 60)
      sd <- runif(n=1, min=10, max=60)

      # generate new data set based on a pair of possible parameters:
      newData <- simulateMailbox(nrow(mail), mu, sd)

      # see if some summaries of the new data were within tolerance e<1 of the real data:
      if (abs(mean(newData$Time1) - mean(mail$Time1)) < 1 &&
          abs(mean(newData$Time2) - mean(mail$Time2)) < 1)
        { results <- list(c(Mu=mu, SD=sd), results); n <- n+1; }
    }
   return(results)
  }
sims <- mail_abc(2000)
results <- matrix(unlist(sims), ncol=2, byrow=TRUE)
summary(results)
##        V1                 V2
##  Min.   :600.5387   Min.   :10.56221
##  1st Qu.:639.2910   1st Qu.:22.79056
##  Median :644.7196   Median :26.33946
##  Mean   :648.4296   Mean   :26.98299
##  3rd Qu.:661.2050   3rd Qu.:30.38428
##  Max.   :716.9134   Max.   :55.80602

The mean value here is off some­what from JAGS but still accept­able.

Another way to sum­ma­rize the dataset occurs to me while look­ing at the graphs: the most strik­ing visual fea­ture of the inter­val-­cen­sored data is how the ‘nee­dles’ over­lap slightly and it is this slight over­lap which deter­mines where the mean is; the most infor­ma­tive set of data would be bal­anced exactly between nee­dles that fall to the left and nee­dles that fall to the right, leav­ing as lit­tle room as pos­si­ble for the mean to ‘escape’ out into the wider inter­vals and be uncer­tain. (Imag­ine a set of data where all the nee­dles fall to the left, because I only checked the mail at 2PM; I would then be extremely cer­tain that the mail is not deliv­ered after 2PM but I would have lit­tle more idea than when I started about when the mail is actu­ally deliv­ered in the morn­ing and my pos­te­rior would repeat the pri­or.) So I could use the count of left or right inter­vals (it does­n’t mat­ter if I use sum(Time1 == 0) or sum(Time2 == 1440) since they are mutu­ally exclu­sive) as the sum­mary sta­tis­tic.

mail_abc <- function(samples) {
    results <- list()
    n <- 0
    while (n<samples) {

      # Priors:
      ## mu ~ dnorm(650, 20)
      mu <- rnorm(n=1, mean=650, sd=20)
      ## sd ~ dunif(10, 60)
      sd <- runif(n=1, min=10, max=60)

      # generate new data set based on a pair of possible parameters:
      newData <- simulateMailbox(nrow(mail), mu, sd)

      # see if a summary of the new data matches the old:
      if (sum(mail$Time1 == 0) == sum(newData$Time1 == 0))
        { results <- list(c(Mu=mu, SD=sd), results); n <- n+1; }
    }
   return(results)
  }
sims <- mail_abc(20000)
results <- matrix(unlist(sims), ncol=2, byrow=TRUE)
summary(results)
##        V1                 V2
##  Min.   :567.5387   Min.   :10.00090
##  1st Qu.:636.6108   1st Qu.:22.70875
##  Median :650.0038   Median :34.96424
##  Mean   :650.1095   Mean   :35.08481
##  3rd Qu.:663.7611   3rd Qu.:47.60481
##  Max.   :735.4072   Max.   :59.99941

This one yields the same mean, but a higher SD (which is more diver­gent from the JAGS esti­mate as well). So per­haps not as good.

Exact delivery-time data

Part­way through com­pil­ing my notes, I real­ized that I did in fact have sev­eral exact times for deliv­er­ies: the USPS track­ing emails for pack­ages, while use­less on the day of deliv­ery for know­ing when to check (since the alerts are only sent around 3-4PM), do include the exact time of deliv­ery that day. And then while record­ing inter­val data, I did some­times spot the mail­man on her rounds; to keep things sim­ple, I still recorded it as an inter­val.

Exact data makes esti­mat­ing a mean & SD triv­ial:

mailExact <- data.frame(Date=as.POSIXct(c("2010-04-29 11:33AM", "2010-05-12 11:31AM", "2014-08-20 12:14PM",
                                          "2014-09-29 11:15AM", "2014-12-15 12:02PM", "2015-03-09 11:19AM",
                                          "2015-06-10 10:34AM", "2015-06-20 11:02AM", "2015-06-23 10:58AM",
                                          "2015-06-24 10:53AM", "2015-06-25 10:55AM", "2015-06-30 10:36AM",
                                          "2015-07-02 10:45AM", "2015-07-06 11:19AM", "2015-07-10 10:54AM",
                                          "2015-07-11 11:09AM", "2015-07-15 10:29AM", "2015-07-16 11:02AM",
                                          "2015-07-17 10:46AM", "2015-07-27 11:12AM", "2015-08-15 10:56AM",
                                          "2015-08-17 11:40AM", "2015-08-18 11:19AM", "2015-08-27 10:43AM",
                                          "2015-09-04 10:56AM", "2015-09-18 11:15AM", "2015-09-26 10:42AM",
                                          "2015-09-28 11:48AM"),
                        "EDT"))
mailExact$TimeDelivered <- fromClock(mailExact$Date)
mean(mailExact$TimeDelivered)
## [1] 668.1071429
sd(mailExact$TimeDelivered)
## [1] 26.24152673

png(file="~/wiki/images/maildelivery/exact.png", width = 800, height = 500)
timeLimits <- seq(from=10*60, to=12.25*60, by=10)
qplot(Date, TimeDelivered, data=mailExact) +
 scale_y_continuous(breaks=timeLimits, labels=sapply(timeLimits, toClock))
invisible(dev.off())
Known exact deliv­ery times to me of USPS pack­ages, 2010-2015

Plot­ted over time, there’s a trou­bling amount of het­ero­gene­ity: despite the spar­sity of data (ap­par­ently I did not bother to set up USPS track­ing alerts 2011-2014, to my loss) it’s hard not to see two sep­a­rate clus­ters there.

ML

Besides the visual evi­dence, a hypoth­e­sis test agrees with there being a dif­fer­ence between 2010-2014 and 2015

mailExact$Group <- year(mailExact$Date) > 2014
wilcox.test(TimeDelivered ~ Group, data=mailExact)
wilcox.test(TimeDelivered ~ Group, conf.int=TRUE, data=mailExact)
#   Wilcoxon rank sum test with continuity correction
#
# data:  TimeDelivered by Group
# W = 105.5, p-value = 0.004344224
# alternative hypothesis: true location shift is not equal to 0
# 95% confidence interval:
#  18.99997358 67.00004597
# sample estimates:
# difference in location
#            41.00004647
sd(mailExact[mailExact$Group,]$TimeDelivered)
## [1] 20.03603473

Why might there be two clus­ters? Well, now that I think about it, I recall that my mail­man used to be an older gen­tle­man with white hair (I remem­ber him vividly because in mid-Au­gust 2013 a pack­age of fish oil was dam­aged in tran­sit, and he showed up to explain it to me and offer sug­ges­tions on returns; Ama­zon did­n’t insist on a leaky fish oil bot­tle being shipped back and sim­ply sent me a new one). But now my mail­man is a younger mid­dle-aged woman. That seems like a good rea­son for a shift in deliv­ery times—per­haps she dri­ves faster. (As our mailper­sons never inter­act with cus­tomers and there is min­i­mal road traf­fic, the deliv­ery times should reflect almost entirely their route length, deliv­ery vol­ume, and dri­ving speed/efficiency.) Alter­nate­ly, per­haps the per­son­nel shift was dri­ven by a changed route; the exact cause is unim­por­tant as the dif­fer­ence appears large and con­sis­tent.

MCMC

Esti­mat­ing the two dis­tri­b­u­tions sep­a­rately in a sim­ple multilevel/hierarchical mod­el:

model2 <- function() {
  # I expect all deliveries ~11AM/650:
  grand.mean ~ dnorm(650, pow(20, -2))

  # different mailman/groups will deliver at different offsets, but not by more than 2 hours or so:
  delta.between.group ~ dunif(0, 100)
  # similarly, both mailman times and delivery times are reasonably precise within 2 hours or so:
  tau.between.group <- pow(sigma.between.group, -2)
  sigma.between.group ~ dunif(0, 100)

  for(j in 1:K){
   # let's say the group-level differences are also normally-distributed:
   group.delta[j] ~ dnorm(delta.between.group, tau.between.group)
   # and each group also has its own standard-deviation, potentially different from the others':
   group.within.sigma[j] ~ dunif(0, 100)
   group.within.tau[j] <- pow(group.within.sigma[j], -2)

   # save the net combo for convenience & interpretability:
   group.mean[j] <- grand.mean + group.delta[j]
  }

  for (i in 1:N) {
   # each individual observation is from the grand-mean + group-offset, then normally distributed:
   Y[i] ~ dnorm(grand.mean + group.delta[Group[i]], group.within.tau[Group[i]])
  }

  # prediction interval for the second group, the 2015 data, which is the one I care about:
  y.new2 ~ dnorm(group.mean[2], group.within.tau[2])
  }
data <- list(N=nrow(mailExact), Y=mailExact$TimeDelivered, K=max(mailExact$Group+1),
             Group=(mailExact$Group+1))
params <- c("grand.mean","delta.between.group", "sigma.between.group", "group.delta", "group.mean",
            "group.within.sigma", "y.new2")
k1 <- jags(data=data, parameters.to.save=params, inits=NULL, model.file=model2, n.iter=50000); k1
##                       mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
## delta.between.group    39.252  24.676   1.866  19.228  35.892  56.239  90.804 1.002  1900
## grand.mean            646.541  18.229 610.306 634.420 646.967 658.402 682.334 1.001  3000
## group.delta[1]         51.451  24.090   6.461  35.024  51.243  67.678  97.720 1.001  3000
## group.delta[2]         14.603  18.634 -22.068   2.057  14.234  26.874  51.781 1.001  3000
## group.mean[1]         697.992  16.997 660.532 688.781 699.057 707.816 731.813 1.001  3000
## group.mean[2]         661.144   4.493 652.375 658.188 661.171 664.057 670.275 1.002  1500
## group.within.sigma[1]  36.006  16.775  15.685  23.966  31.571  42.990  81.056 1.001  3000
## group.within.sigma[2]  21.297   3.461  15.764  18.798  20.929  23.266  28.987 1.002  1400
## sigma.between.group    44.831  24.254   7.910  25.429  41.149  62.391  94.998 1.004  1500
## y.new2                661.669  22.141 616.927 647.494 661.905 675.973 704.147 1.002  3000
## deviance              252.499   3.557 247.820 249.767 251.785 254.502 261.130 1.001  3000
##
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
##
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 6.3 and DIC = 258.8

Combined data

Com­bin­ing the inter­val and exact data is easy: exact data are inter­val data with very nar­row inter­vals (ac­cu­rate roughly to within half a min­ute), where the begin­ning and end match. If exact deliv­ery time is avail­able for a day, then the inter­val data for that day is omit­ted as redun­dant:

mail2 <- rbind(
 with(mailExact,
      data.frame(Date=Date,
                 Time1 = TimeDelivered,
                 Time2= TimeDelivered + 0.5)),
 with(mailInterval[!(as.Date(mailInterval$Date) %in% as.Date(mailExact$Date)),],
     data.frame(Date=Date,
                Time1 = ifelse(Delivered, 0, Time),
                Time2 = ifelse(Delivered,  Time, 1440))))

mail2$Group <- year(mail2$Date) > 2014
mailCombined <- subset(mail2,select=c("Time1","Time2"))

The mul­ti­level model is the same as before with the excep­tion that the like­li­hood needs to be tweaked to put dinterval on top, and the input data needs to be munged appro­pri­ately to match its expec­ta­tions of it being a two-­col­umn data-frame of the form (Time1,Time2):

model3 <- function() {
  grand.mean          ~ dnorm(650, pow(20, -2))
  delta.between.group ~ dunif(0, 100)
  tau.between.group   <- pow(sigma.between.group, -2)
  sigma.between.group ~ dunif(0, 100)
  y.new2015 ~ dnorm(group.mean[2], group.within.tau[2])

  for(j in 1:K){
   group.delta[j] ~ dnorm(delta.between.group, tau.between.group)
   group.within.sigma[j] ~ dunif(0, 100)
   group.within.tau[j] <- pow(group.within.sigma[j], -2)
   group.mean[j] <- grand.mean + group.delta[j]
  }

  for (i in 1:N) {
   y[i] ~ dinterval(t[i], dt[i,])
   t[i] ~ dnorm(grand.mean + group.delta[Group[i]], group.within.tau[Group[i]])
   }
 }

data2 <- list(N=nrow(mail2), K=max(mail2$Group+1),
             Group=(mail2$Group+1), "dt"=subset(mail2, select=c("Time1", "Time2")),
             "y"=rep(1, nrow(mail2)))
params2 <- c("grand.mean","delta.between.group", "sigma.between.group", "group.delta", "group.mean",
            "group.within.sigma", "y.new2015")
inits2 <- function() { list(grand.mean=650, t=as.vector(apply(data2[["dt"]],1,mean))) }
k2 <- jags(data=data2, parameters.to.save=params2, inits=inits2, model.file=model3, n.iter=100000); k2
##                       mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
## delta.between.group    39.364  24.229   2.728  19.947  36.568  56.187  92.219 1.002  3000
## grand.mean            646.921  18.303 609.659 634.911 647.471 659.948 680.647 1.002  1200
## group.delta[1]         51.517  23.431   7.242  35.480  50.876  67.432  97.720 1.002  1300
## group.delta[2]         15.617  18.463 -18.505   2.450  15.254  27.661  53.121 1.002  1100
## group.mean[1]         698.438  16.892 661.420 689.057 699.582 708.741 730.282 1.001  3000
## group.mean[2]         662.538   3.088 656.700 660.463 662.538 664.545 668.783 1.001  3000
## group.within.sigma[1]  36.074  16.210  16.181  24.249  31.834  43.555  80.031 1.001  2500
## group.within.sigma[2]  20.691   2.933  15.832  18.616  20.342  22.421  27.245 1.001  3000
## sigma.between.group    43.801  24.304   6.174  24.458  39.747  60.546  95.167 1.003  3000
## y.new2015             662.168  21.331 620.004 648.164 661.859 675.705 705.655 1.002  1200
## deviance                0.000   0.000   0.000   0.000   0.000   0.000   0.000 1.000     1

It’s worth com­par­ing the y.new2015 (the pos­te­rior pre­dic­tive dis­tri­b­u­tion for just 2015, ignor­ing the ear­lier clus­ter) and the orig­i­nal y.new (com­puted on 2015 inter­val-­cen­sored data): we started with a 95% cred­i­ble inter­val of mail deliv­er­ies in the win­dow 601-712 (10:01AM-11:52AM), and then got a win­dow of 620-702 (10:20AM-11:45), for a gain of half an hour or almost 30%. And this despite hav­ing almost 3x as much inter­val-­cen­sored data! The les­son here is to try to min­i­mize the length of inter­vals if one wants to make pre­cise infer­ences.

Model checking

On Model Uncertainty

“…It is no sur­prise that when this model fails, it is the rather than the that is caus­ing the prob­lem. In the bino­mial model under con­sid­er­a­tion here, the prior comes into the pos­te­rior dis­tri­b­u­tion only once and the like­li­hood comes in n times. It is per­haps merely an acci­dent of his­tory that skep­tics and sub­jec­tivists alike strain on the gnat of the prior dis­tri­b­u­tion while swal­low­ing the camel that is the like­li­hood.

, 20132

In set­ting up an analy­sis, one must make a vari­ety of choices in what mod­els to use, which influ­ences the result; this neglect of “model uncer­tainty” can lead to over­con­fi­dent infer­ences or miss crit­i­cal details. In Bayesian analy­ses, typ­i­cally the prior comes in for the most scrutiny as to how it deter­mines the result; but the prior part of the model often has much less influ­ence than the struc­tural or func­tional form of the analy­sis, where it is assumed that the out­come is drawn from a nor­mal dis­tri­b­u­tion or another such com­mon choice, although it’s not. This choice itself is often arbi­trary and can deter­mine the result almost regard­less of what the data says. For exam­ple, the nor­mal dis­tri­b­u­tion has thin tails and so the prob­a­bil­ity of some­thing being far out on a tail will be extremely small and com­bined with other con­sid­er­a­tions like mea­sure­ment error, can result in a model refus­ing to ever accept that a value really is ‘extreme’ with­out equally extreme amounts of data—a refusal that would not hap­pen if any of a num­ber of other per­fectly pos­si­ble dis­tri­b­u­tions had been cho­sen instead like log-nor­mal dis­tri­b­u­tion or —de­spite the fact that the choice of nor­mal dis­tri­b­u­tion itself is usu­ally due more to con­ve­nience than any extremely strong evi­dence that the nor­mal dis­tri­b­u­tion is the true dis­tri­b­u­tion (“every model is false, but some are use­ful”), which means the best response is sim­ply to note that —the prob­a­bil­ity that sets an upper bound on how much we can get out of them, and a bound which often bars many of the infer­ences one might try to extract from them.

For exam­ple:

  1. the nor­mal dis­tri­b­u­tion is com­monly used for its ana­lytic tractabil­ity and asymp­totic prop­er­ties, but it has some aspects that are, in a sense, too pow­er­ful and restric­tive: specif­i­cal­ly, that its tails are thin. Thin tails can cause seri­ous under­es­ti­ma­tions of rare phe­nom­e­non (and not just in cases involv­ing power laws) and can lead to absurdly pow­er­ful con­clu­sions. Two exam­ples:

    • in one analy­sis of char­i­ties, the writer uses nor­mal dis­tri­b­u­tions and thus winds up argu­ing that with error in our assess­ments of char­i­ties’ val­ues, our esti­mate must be shrunken far towards the mean (which is entirely true); but this is cir­cu­lar, since by assum­ing nor­mal dis­tri­b­u­tions he has also assumed away the exis­tence of large dif­fer­ences in value between char­i­ties. If one were to grant the appar­ently innocu­ous assump­tion of nor­mally dis­trib­uted effec­tive­ness of inter­ven­tions, one would be forced to ignore that there are large dif­fer­ences—­con­sider knit­ting sweaters for pen­guins vs or ; are they really within a few stan­dard devi­a­tions of each oth­er? Should we really ignore inter­ven­tions like vac­cines which claim to save lives at a cost of no more than tens of thou­sands of dol­lars, on the grounds that most char­i­ties require hun­dreds of thou­sands or mil­lions of dol­lars to achieve goals like that? If we look at and see extreme skews in what causes loss of QALYs, with some clas­si­fi­ca­tions being hun­dreds or thou­sands of times more dam­ag­ing than oth­ers (like indoor fires), is it more likely that the data is wrong than right solely because the nor­mal dis­tri­b­u­tion says that the largest prob­lems are many stan­dard devi­a­tions out? It’s clear that what­ever the right dis­tri­b­u­tion for mod­el­ing char­ity or health out­comes is, it is prob­a­bly not lit­er­ally & exactly a nor­mal dis­tri­b­u­tion and it should allow for large dif­fer­ences; and hav­ing aban­doned the nor­mal, we then lose the appar­ently mag­i­cal abil­ity to make con­fi­dent pre­dic­tions about how many char­i­ties are how effec­tive.

    • in another analy­sis, the author, a Michael Fer­gu­son, argues that Amer­i­can soci­ety dis­crim­i­nates to a truly aston­ish­ing degree against peo­ple with high IQs, claim­ing that peo­ple with 150IQ have 97% less chance of achiev­ing suc­cess than peo­ple with the ideal IQ of ~133, and thus, that past that, the chance of suc­cess drops almost to 0%, and that this is due to vicious con­stant dis­crim­i­na­tion & per­se­cu­tion. His analy­sis takes the strat­egy of not­ing the pop­u­la­tion IQ is nor­mally dis­trib­ut­ed, by def­i­n­i­tion, at and tak­ing a few old stud­ies of elite groups like stu­dents at Har­vard Med­ical School and not­ing the stud­ies report means like 122 and stan­dard devi­a­tions like 6, infer­ring the stu­dents are nor­mally dis­trib­uted at , and then not­ing that when these two nor­mal dis­tri­b­u­tions are super­im­posed, there will be almost no elites as high as IQ 140 or 150 despite there being many such peo­ple in the gen­eral pop­u­la­tion and this under­rep­re­sen­ta­tion gets more and more extreme as one goes fur­ther out on the tail; why did all the high IQ pop­u­la­tion mem­bers fail to show up in the elite dis­tri­b­u­tion? It must be dis­crim­i­na­tion. Thus, he grimly warns his read­ers that “If your IQ is over 150, it is a clar­ion call; with­out direct inter­ven­tion, your career prospects are very poor. If you are the par­ent of a child with a D15IQ over 150, imme­di­ate and dra­matic action is required.”

      This would be a remark­able fact if true for many rea­sons (not least the waste & dam­age to sci­ence and human­i­ty), but that is unlike­ly: because the under­rep­re­sen­ta­tion is a math­e­mat­i­cal neces­sity of the nor­mal dis­tri­b­u­tion due to the elite dis­tri­b­u­tion being defined as hav­ing a smaller SD. The assump­tion that the elite dis­tri­b­u­tion is nor­mal dri­ves the entire remark­able, unlikely result. But why assume elites are nor­mally dis­trib­ut­ed? Most of the processes by which elite schools such as Har­vard would select from the gen­eral pop­u­la­tion would not pro­duce a nor­mal dis­tri­b­u­tion: if selected at a SAT thresh­old, the result will not be nor­mal but a trun­cated nor­mal; if selected on mul­ti­ple fac­tors such as test scores and per­son­al­i­ty, it will more likely be log-nor­mal; and so on. That the stud­ies of the Har­vard stu­dents reported a mean and SD does not show that the stu­dent dis­tri­b­u­tion was nor­mal, and in fact, it’s easy to sim­u­late sce­nar­ios like the thresh­old model or a model in which increas­ing IQ pro­duces increas­ing odds of admis­sion, and show that the reported means & SDs are entirely pos­si­ble under other highly non-nor­mal dis­tri­b­u­tions (in­tu­itive­ly, peo­ple as high as 140 or 150 are so rare that they don’t affect the sum­mary sta­tis­tics much, regard­less of whether they are heav­ily dis­crim­i­nated against or heav­ily favored for admis­sion). The shock­ing con­clu­sion fol­lows from the mathematical/statistical assump­tion, but where did this assump­tion come from and should we grant it?

  2. lin­ear mod­els have a sim­i­lar prob­lem as nor­mal dis­tri­b­u­tions: while con­ve­nient and sim­ple, they have well-­known prob­lems when at the extremes of the data or when pro­jected beyond that. Text­books give exam­ples like a lin­ear regres­sion on ice cream sales pre­dict­ing neg­a­tive sales dur­ing the win­ter, or pre­dict­ing that sales in sum­mer will be orders of mag­ni­tude high­er, and it’s easy to make up other exam­ples like a lin­ear regres­sion on sprint­ing times pre­dict­ing that in another cen­tu­ry, sprint­ers will fin­ish races before they have start­ed. Lin­ear mod­els have no prob­lems pre­dict­ing neg­a­tive val­ues back­wards in time, or pre­dict­ing absurdly high amounts in the far future, but they are con­ve­nient and for the most part, it’s easy to sim­ply ignore the non­sen­si­cal pre­dic­tions or, if they’re caus­ing prob­lems, use a tech­ni­cal fix like log­ging val­ues, switch­ing to a log-nor­mal dis­tri­b­u­tion (with its strictly nat­ural sup­port), using a sig­moid or log­a­rith­mic fit to account for the inevitable flat­ten­ing out, han­dle ratio data with beta regres­sion, fix the end-­val­ues with spline regres­sion, etc; if we make the mis­take of mod­el­ing a bac­te­r­ial pop­u­la­tion in an agar dish as an expo­nen­tially increas­ing pop­u­la­tion, we’ll soon learn oth­er­wise than the sig­moid or logis­tic curves make a bet­ter fit. Mark Twain offered an amus­ing crit­i­cism in , and indeed, one can often detect model prob­lems watch­ing for instances where one gets “whole­sale returns of con­jec­ture out of such a tri­fling invest­ment of fact”, for that indi­cates that we may be engaged in pos­tu­lat­ing, which has all “the advan­tages of theft over hon­est toil”:

    In the space of one hun­dred and sev­en­ty-six years the Lower Mis­sis­sippi has short­ened itself two hun­dred and forty-two miles. That is an aver­age of a tri­fle over one mile and a third per year. There­fore, any calm per­son, who is not blind or idi­otic, can see that in the Old Oolitic Sil­urian Peri­od, just a mil­lion years ago next Novem­ber, the Lower Mis­sis­sippi River was upwards of one mil­lion three hun­dred thou­sand miles long, and stuck out over the Gulf of Mex­ico like a fish­ing-rod. And by the same token any per­son can see that seven hun­dred and forty-two years from now the Lower Mis­sis­sippi will be only a mile and three­-quar­ters long, and Cairo and New Orleans will have joined their streets togeth­er, and be plod­ding com­fort­ably along under a sin­gle mayor and a mutual board of alder­men. There is some­thing fas­ci­nat­ing about sci­ence. One gets such whole­sale returns of con­jec­ture out of such a tri­fling invest­ment of fact.

    An inter­est­ing use of OLS lin­ear mod­els is given in , which fol­lows up a . By tak­ing the of 8 bio­log­i­cal groups (eg prokary­ote genomes are ~5 logged and prokary­otes orig­i­nated 3-4bya, so it is assumed that genome lengths were 5 3-4bya) and regress­ing against date of ori­gin (on a log scale), their lin­ear mod­el, in the mother of all extrap­o­la­tions, extrap­o­lates back­wards to the small­est pos­si­ble genome (and hence, ori­gin of all life) around 9 bil­lion years ago—well before Earth could have sup­ported life or even exist­ed, and con­sis­tent only with a pansper­mia hypoth­e­sis.

    This is prob­lem­atic for a num­ber of rea­sons: genome length is being used as a clock, but the ani­mal and bac­te­ria genomes involved are all con­tem­po­rary (since no DNA older than a mil­lion years or so has ever been recov­ered or the full length count­ed) which raises some ques­tions about why we think genome length has any con­nec­tion with time and how exactly this lin­ear increase process is sup­posed to work (is evo­lu­tion sup­posed to have stopped for the prokary­otes 3-4bya and their genome lengths been sta­tic ever since?); how they defined those par­tic­u­lar tax­o­nomic groups of all the pos­si­bil­i­ties and those par­tic­u­lar genome lengths & dates (and a ques­tion­able rea­son for exclud­ing plants); mea­sure­ment error and the very small data set implies con­sid­er­able uncer­tainty in the esti­mates & a rejec­tion of Earthly ori­gins even at face val­ue; genome lengths seem to have lit­tle to do with the pas­sage of time but reflect pecu­liar­i­ties of biol­ogy & selec­tion (viruses are under more selec­tion pres­sure, and have faster gen­er­a­tions than, any other kind of organ­ism, and have inno­vated baf­fling num­bers of tricks for infec­tion, and yet instead of hav­ing ultra­-­long genomes reflect­ing these end­less gen­er­a­tions and inno­va­tion, they have ultra­-s­mall ones; and sim­i­lar­ly, the Human Genome Project yielded hum­bling results in show­ing the human genome to be smaller than many organ­isms we humans look down upon, such as even some sin­gle-­cell organ­isms, which can be bizarrely huge like the ferns, ) and vary tremen­dously within groups; there is no direct evi­dence show­ing that early life’s genome lengths increased at the claimed trend­line, and no par­tic­u­lar rea­son to expect recent trends to hold exactly to the ori­gins of life and times when life may not have been DNA-based or even RNA-based at all; genomes are known to shrink in size as well as increase under the right con­di­tions (and one would hope that con­tem­po­rary prokary­otes are more genet­i­cally effi­cient than the ear­li­est prokary­otes); muta­tion load & tran­scrip­tion errors place an upper bound on how large a genome can be (Ohno 1972); many genomic phe­nom­e­non like retro­vi­ral inser­tions or self­ish jump­ing ; it’s unclear why one should endorse con­stant increase in genome size rather than a pop­u­la­tion genet­ics per­spec­tive of equi­lib­ri­ums and occa­sional shifts; hor­i­zon­tal gene flow makes it unclear how much one can com­pare with mul­ti­cel­lu­lar organ­isms & ver­ti­cal trans­mis­sion; Gould argues that any increase in com­plex­ity can be due to a ran­dom walk with a lower bound, in which case there is no under­ly­ing trend at all, merely sto­chas­tic extremes occa­sion­ally reached; and the wide vari­a­tions in “” can­not be explained if genome length has much at all to do with a clock tick­ing since diver­gence from other species.

    More rel­e­vantly here is the model mis­spec­i­fi­ca­tion: the model will hap­pily project past 1bpa and esti­mate how long the genomes were—in neg­a­tive unit­s—in 10bya. (To say noth­ing of extrap­o­lat­ing a few bil­lion years into the future.) This is non­sen­si­cal: what­ever life may use for genomes, it must be pos­i­tive and never neg­a­tive. So lin­ear­ity is known a pri­ori to be false and to break down at some point: the lin­ear model is guar­an­teed to be incor­rect in its pre­dic­tions early in his­to­ry; now, if one must reject the extrap­o­la­tions for 10bya because the model has bro­ken down by that point, why should one believe the pro­jec­tions from before the break­down…? Why not infer that any lin­ear­ity appears only lat­er, after such regime-chang­ing events as the cre­ation of life, the switch to DNA, the expan­sion over the entire world, the oxy­gen cat­a­stro­phe, etc? (If your ice cream lin­ear model pre­dicts sell­ing -1000 ice cream cones at the North Pole, why believe it when it pre­dicts sell­ing -10 ice cream cones in Canada?) Why not use a model like a sig­moid which at least has a chance of being cor­rect? The prob­lem, of course, is that if one tried to fit a sig­moid or another such dis­tri­b­u­tion or curve which could fit the true data, one would dis­cover that there is no early data which could give any indi­ca­tion of how fast genome lengths were increas­ing very early on like in 4bya, since all the avail­able data is from now. Where does the idea come from that genome lengths would increase slowly even in the first life forms which dis­cov­ered a vir­gin planet of unex­ploited organic goop? Exam­ined close­ly, since there is no evi­dence for early lin­ear­i­ty, we see that the mod­el­ing pro­duces no addi­tional evi­dence, and is merely an exer­cise of show­ing the con­se­quences of an assump­tion: “if one assumes that genome lengths increase lin­early […] then with this par­tic­u­lar set of lengths and assumed times, it log­i­cally fol­lows that the first organ­isms was 9bya”. This is a thor­oughly bizarre premise which can­not sur­vive the slight­est con­tact with known sizes of genomes and influ­ences, so one gets no more out of this than what one put in: since there was no par­tic­u­lar rea­son to accept that assump­tion out of all pos­si­ble assump­tions, there’s no addi­tional rea­son to accept the con­clu­sion as true either. On its own, it can­not give us any addi­tional evi­dence for pansper­mia.

  3. inde­pen­dence is often a ques­tion­able model assump­tion: many dat­a­points are highly cor­re­lated with each other and cal­cu­la­tions assum­ing inde­pen­dence will become grotesquely over­con­fi­dent. Is it more likely that the prob­a­bil­ity of a tied elec­tion is 10-90 and one could run tril­lions of elec­tions with­out ever once com­ing near a tie, or that the bino­mial assump­tion is sim­ply wrong about vot­ers being inde­pen­dent and vic­tory mar­gins being accord­ingly dis­trib­ut­ed? Is it more likely that stock mar­ket move­ments are thin-­tailed and large crashes would hap­pen only once in bil­lions of years, or that crashes are com­mon and the dis­tri­b­u­tion is fat-­tailed? Is it more likely that there are many strong pre­dic­tors of human pop­u­la­tions’ wealth and social fac­tors, or that they are ? Is it more likely that the mod­els were cor­rect and the fish­eries just acci­den­tally col­lapsed, or that the mod­els did not take into account autocorrelation/multilevelness and over­stated accu­racy through ? etc

In the case of mail deliv­er­ies, there’s clearly model uncer­tain­ty. While I used , I could as eas­ily have used (fat­ter-­tailed nor­mal­s), , , , , ver­sions of some of the pre­vi­ous, or picked some even more exotic uni­vari­ate dis­tri­b­u­tion and eas­ily come up with a just-so story for some of them. (It’s a neg­a­tive-bi­no­mial because we’re mod­el­ing min­utes until “fail­ure” ie deliv­ery; no, it’s a log-nor­mal because each stop slows down the mail­man and delays mul­ti­ply; no, it’s a nor­mal dis­tri­b­u­tion because each stop adds time and the sum of many small ran­dom devi­ates is itself nor­mally dis­trib­ut­ed; no, it’s a t-dis­tri­b­u­tion because some­times the mail arrives much ear­lier or later and it’s a more robust dis­tri­b­u­tion we should be using any­way; no, it’s even wider than that and a uni­form dis­tri­b­u­tion because the mail­man starts and ends their shift at par­tic­u­lar times but oth­er­wise does­n’t care about speed; no, it’s a beta dis­tri­b­u­tion over time because that’s more flex­i­ble and can include the oth­ers as spe­cial-­cas­es…)

Model uncer­tainty can be han­dled sev­eral ways (Hooten & Hobbs 2015), includ­ing:

  • Penalty or reg­u­lar­iza­tion or spar­sity pri­ors (only han­dles selec­tion of variables/predictors; gains largely elim­i­nated in this case by infor­ma­tive pri­ors)

  • Diag­nos­tics like pos­te­rior pre­dic­tive checks, cross­val­i­da­tion, and /// (overview) can be used to check for lack of fit; but like tests in soft­ware devel­op­ment, they can only show the pres­ence of bugs & not their absence

  • Max­i­mally flex­i­ble mod­els can be used like non­para­met­ric Bayesian mod­els; if “all mod­els are wrong”, we can use mod­els which are arbi­trar­ily flex­i­ble in order to min­i­mize the influ­ence of our model mis­spec­i­fi­ca­tion (Dun­son 2013)

  • (often tar­geted at the nar­row use-­case of select­ing variables/predictors in a lin­ear mod­el, but some are applic­a­ble to choos­ing between dif­fer­ent dis­tri­b­u­tions as well), par­tic­u­larly Bayesian approaches (pop­u­lar­iza­tion; ; Hooten & Hobbs 2015):

    • Bayesian model choice: mod­els can be pit­ted against each other (eg using Bayes fac­tors) imple­mented using , or /product-space method (Car­lin & Chib 1995, Lodewyckx et al 2011 & source code, Tenan 2014 & source code), or ABC, and the one with high­est pos­te­rior prob­a­bil­ity is selected for any fur­ther analy­sis while all oth­ers are ignored
    • Bayesian model aver­ag­ing: pos­te­rior prob­a­bil­i­ties of mod­els are cal­cu­lat­ed, but all mod­els are retained and their pos­te­ri­ors are com­bined weighted by the prob­a­bil­ity that model is the true one (Hoet­ing et al 1999); but with enough data, the pos­te­rior prob­a­bil­ity of the model clos­est to the truth will con­verge on 1 and BMA becomes a model selec­tion pro­ce­dure
    • Bayesian model com­bi­na­tion: pos­te­rior prob­a­bil­i­ties of combinations/ensembles of mod­els are cal­cu­lat­ed, yield­ing a richer set of ‘mod­els’ which often out­per­form all of the orig­i­nal mod­els; such as by learn­ing an opti­mal weight­ing in a lin­ear com­bi­na­tion (Mon­teith et al 2011)

Of the non-ensem­ble approach­es: reg­u­lar­iza­tion is already done via the infor­ma­tive pri­ors; some of the diag­nos­tics might work but oth­ers won’t (JAGS won’t cal­cu­late DIC for inter­val-­cen­sored data); non­para­met­ric approaches would be too chal­leng­ing for me to imple­ment in JAGS and I’m not sure there is enough data to allow the non­para­met­ric mod­els to work well.

This leaves PPC, cross­val­i­da­tion, and the ensem­ble approach­es:

  • Bayesian model com­bi­na­tion is overkill
  • Bayesian model aver­ag­ing, given Bayesian model choice, may also be overkill and not mean­ing­fully improve pre­dic­tions enough to change final deci­sions about mail check times
  • Bayesian model choice: nested sam­pling does­n’t seem applic­a­ble, leav­ing reversible-jump/product-space; while JAGS exam­ples are avail­able in tuto­r­ial papers, I found them lengthy & com­pli­cated by using real­is­tic data & mod­els, hard­wired to their spe­cific use-­case still too dif­fi­cult to under­stand with my lim­ited JAGS skills. So I wound up imple­ment­ing using ABC.

PPC

Pos­te­rior pre­dic­tive checks (PPC) are a tech­nique rec­om­mended by Gel­man as a test of how appro­pri­ate the model is (as the model or like­li­hood often is far more impor­tant in forc­ing par­tic­u­lar results than the pri­ors one might’ve used), where one gen­er­ates pos­si­ble obser­va­tions from the model and inferred pos­te­ri­ors, and sees how often the sim­u­lated data matches the real data; if that is rarely or nev­er, then the model may be bad and one should fix it or pos­si­bly throw in addi­tional mod­els for Bayesian model com­par­i­son (which will hope­fully then pick out the right mod­el). For con­tin­u­ous data or more than triv­ially small data, we have the same issue as in ABC (which is sim­i­lar to PPCs): the sim­u­lates will never exactly match the data, so we must define some sort of sum­mary or dis­tance mea­sure which lets us say that a par­tic­u­lar sim­u­lated dataset is sim­i­lar enough to the real data. In the two sum­maries I tried for ABC, count­ing ‘flip-flops’ worked bet­ter, so I’ll reuse that here. For the sim­u­la­tion, we also need check or inter­val-­cen­sor­ing times, which I also treat as a nor­mal dis­tri­b­u­tion & esti­mate their means and SDs from the orig­i­nal data.

posteriorTimes <- k2$BUGSoutput$sims.list[["y.new2015"]]

simsPPC <- replicate(10000, {
    nChecks <- nrow(mail)
    checkTimes <- c(mail[mail$Time1>0,]$Time1, mail[mail$Time2<1440,]$Time2)
    meanCheck <- mean(checkTimes)
    sdCheck <- sd(checkTimes)

    deliveryTime <- sample(posteriorTimes, size=nChecks, replace=TRUE)
    checkTime <- round(rnorm(nChecks, mean = meanCheck, sd = sdCheck))
    newSamples <- mapply(function (ck, dy) { if(ck>dy) { return(c(0,ck)) } else { return(c(ck,1440)) }},
                         checkTime, deliveryTime)
    newData <- data.frame(Time1=newSamples[1,], Time2=newSamples[2,])

    return(sum(newData$Time1 == 0))
    }
)
png(file="~/wiki/images/maildelivery/ppc.png", width = 800, height = 500)
qplot(simsPPC) + geom_vline(xintercept=sum(mail$Time1 == 0), color="blue")
invisible(dev.off())
Sim­i­lar­ity of pos­te­rior pre­dic­tive dis­tri­b­u­tion’s flip-flops with the orig­i­nal data’s flip-flops

The true data has a sum­mary near the mid­dle of the dis­tri­b­u­tion of the sum­maries of the pos­te­rior pre­dic­tions, sug­gest­ing that the nor­mal dis­tri­b­u­tion’s fit to the prob­lem is OK (at least, when con­sid­ered in iso­la­tion).

Crossvalidation

CV comes in sev­eral fla­vors, but since the data is not big, I can use leave-one-out cross­val­i­da­tion. The usual error mea­sures are unavail­able due to inter­val-­cen­sor­ing, so after some pon­der­ing, I went with a zero-one loss based on whether a deliv­ery is pre­dicted or not at a par­tic­u­lar check­-­time. Mean-squared-er­ror is out due to the inter­val­ing, as are vari­ants like absolute error. Think­ing about it some, it seems to me that each dat­a­point is really made of two things: the check­-­time, and deliv­ery sta­tus. The check­-­time has noth­ing to do with the model qual­ity and the model isn’t try­ing to pre­dict it; what I want the model to pre­dict is whether the mail is deliv­ered or not if I were to check at par­tic­u­lar times. The check­-­time is the pre­dic­tor vari­able and the deliv­ery-s­ta­tus is the response vari­able. So I could ask the mod­el’s pos­te­rior pre­dic­tive dis­tri­b­u­tion of deliv­ery-­times (y.new) whether or not the mail would or would not be deliv­ered at a par­tic­u­lar time, and com­pare it against the held-out dat­a­point’s actual deliv­ery sta­tus, 1 if the model cor­rectly pre­dicts delivered/not-delivered and 0 if it pre­dicts the oppo­site of what the data said. (So a zero-one loss.) The base-rate of deliv­ery vs non-de­liv­ery will be 50-50, so a good model should achieve a bet­ter mean zero-one loss than 0.5.

For dis­tri­b­u­tions, we’ll com­pare the nor­mal used all along, the t-dis­tri­b­u­tion (for its fat­ter tail­s), and a lit­tle more exot­i­cal­ly, the log-nor­mal & expo­nen­tial dis­tri­b­u­tions as well. Of the­se, the first two are the most like­ly, and I think it’s highly unlikely that the log-nor­mal or expo­nen­tial could fit bet­ter (be­cause they will tend to pre­dict extremely late deliv­er­ies, much later than I believe is at all real­is­tic, while the t’s spread is more rea­son­able in accom­mo­dat­ing some non-nor­mal­ity but not too much). I include those two just to see what will hap­pen and test out the robust­ness of any approach used. The 4 mod­els in JAGS:

modelN <- "model { for (i in 1:n){
            y[i] ~ dinterval(t[i], dt[i,])
            t[i] ~ dnorm(mu,tau)
           }
           mu ~ dnorm(650, pow(30, -2))
           sd ~ dunif(10, 60)
           tau <- pow(1/sd, 2)
           y.new ~ dnorm(mu, tau) }"
modelT <- "model { for (i in 1:n){
            y[i] ~ dinterval(t[i], dt[i,])
            t[i] ~ dt(mu,tau,nu)
           }
           nu ~ dexp(1/30)
           mu ~ dnorm(650, pow(30, -2))
           sd ~ dunif(10, 60)
           tau <- pow(1/sd, 2)
           y.new ~ dnorm(mu, tau) }"
modelL <- "model { for (i in 1:n) {
          y[i] ~ dinterval(t[i], dt[i,])
          t[i] ~ dlnorm(muL, tauL)
        }
        sd ~ dunif(10, 60)
        tauL <- pow(1/log(sd), 2)
        mu ~ dnorm(650, pow(30, -2))
        muL <- log(mu)
        y.new ~ dlnorm(muL, tauL) }"
modelE <- "model { for (i in 1:n){
            y[i] ~ dinterval(t[i], dt[i,])
            t[i] ~ dexp(theta)
           }
           # set the mean ~600, but remember rate is reciprocal:
           theta ~ dexp(1/600)
           y.new ~ dexp(theta) }"

Here’s a try at imple­ment­ing leave-one-out cross­val­i­da­tion for those four JAGS mod­els on the inter­val-­cen­sored mail data with a zero-one loss defined that way based on deliv­ery sta­tus:

runModel <- function(newData, model, iter=5000) {
     # set up and run the model and extract predicted delivery times
     data   <- list("dt"=newData, "n"=nrow(newData), "y"=rep(1, nrow(newData)))
     inits  <- function() { list(mu=rnorm(1,mean=600,sd=30),sd=30,t=as.vector(apply(newData,1,mean))) }
     params <- c("y.new")
     cv1 <- jags(data, inits, params, textConnection(model), n.iter=iter, progress.bar="none")
     posteriorTimes <- cv1$BUGSoutput$sims.list[["y.new"]]
     return(posteriorTimes)
}

loocvs <- function(dt, model) {
    results <- NULL

    for (i in 1:nrow(dt)) {
     # set up relevant data for this particular fold in the crossvalidation
     ith       <- dt[i,]
     newData   <- dt[-i,] # drop the _i_th datapoint
     checkTime <- if (ith$Time1==0) { ith$Time2 } else { ith$Time1 }
     there     <- if (ith$Time1==0) { TRUE } else { FALSE }

     posteriorTimes <- runModel(dt, model, iter=500)
     # score predictions against heldout data point
     results[i] <- mean(sapply(posteriorTimes,
                               function(t) { if (t<checkTime && there) { 1 } else {
                                               if (t>checkTime && !there) { 1 } else { 0 } }}))
    }
   return(results)
}
loocvsN <- loocvs(mailCombined, modelN)
loocvsT <- loocvs(mailCombined, modelT)
loocvsL <- loocvs(mailCombined, modelL)
loocvsE <- loocvs(mailCombined, modelE)

mean(loocvsN)
## [1] 0.6316747967
mean(loocvsT)
## [1] 0.6339715447
mean(loocvsL)
## [1] 0.4994674797
mean(loocvsE)
## [1] 0.4504227642

By this mea­sure, the nor­mal & t mod­els per­form almost iden­ti­cal­ly, while we can reject log-nor­mal & expo­nen­tials out of hand as per­form­ing as badly as (or worse than) chance.

Bayesian Model Selection

We are com­par­ing 4 para­met­ric mod­els here: the nor­mal, Stu­den­t’s t, log-nor­mal, and expo­nen­tial mod­els. (Time per­mit­ting, a Cauchy would have been good too.)

The BIC/DIC/WAIC deviance cri­te­ria are unavail­able; nested sam­pling can­not han­dle mul­ti­ple dis­tri­b­u­tions; this leaves reversible jump MCMC/product-space method and ABC.

I looked at pro­duc­t-­space first. The idea is con­cep­tu­ally sim­ple: we take our JAGS mod­els and treat it as a mix­ture model with each model hav­ing a prob­a­bil­ity of hav­ing gen­er­ated an obser­va­tion, and since we believe only one is true, after infer­ring these prob­a­bil­i­ties, they become our pos­te­rior prob­a­bil­i­ties of each mod­el. The JAGS imple­men­ta­tions look rel­a­tively straight­for­ward, but most of them deal with the case of dif­fer­ent prob­a­bil­i­ties or para­me­ters for the same dis­tri­b­u­tion or with vari­able selec­tion in lin­ear regres­sion, and the only exam­ples with mul­ti­ple dis­tri­b­u­tions were totally baf­fling to me, as I became lost in the dif­fer­ent dis­tri­b­u­tions and cat­e­gories and indices and what the ‘pseudo­pri­ors’ are doing etc.

After giv­ing up, it occurred to me that since I had no prob­lem writ­ing the 4 stand­alone mod­els them­selves, their pos­te­rior pre­dic­tive dis­tri­b­u­tions were, in a sense, gen­er­a­tive mod­els of them­selves and so I could use them in ABC. If I gen­er­ated sam­ples from all 4 mod­els in ABC and counted how often they (ap­prox­i­mate­ly) matched the orig­i­nal data, then would not the ratios of accepted sam­ples then con­sti­tute pos­te­rior prob­a­bil­i­ties or at least Bayes fac­tors? If the nor­mal model matched the data 10x more often than the t mod­el, then surely the nor­mal model is a much bet­ter model of the data than the t. But there’s noth­ing new under the sun, and so while the papers on Bayesian model com­par­i­son I had read up until that point had not men­tioned ABC as an option, it turns out that using ABC for Bayesian model com­par­i­son is not just pos­si­ble but down­right com­mon in genet­ics & ecol­ogy (see for back­ground Fu & Li 1997, , Mar­jo­ram et al 2003, Dide­lot et al 2004, Beau­mont 2010 & Sun­naker et al 2013 ; exam­ples include Pritchard et al 1999, Estoup et al 2004, Miller et al 2005, Pas­cual et al 2007, , Sain­udiin et al 2011). There is the issue that the sum­mary sta­tis­tics may lose infor­ma­tion and lead to bad model com­par­i­son results, but there’s always trade­offs.

ABC for Bayesian model com­par­ison:

simulateMailboxMulti <- function(n, m, sd, simulation) {
    deliveryTime <- sample(simulation, size=n, replace=TRUE)
    checkTime <- round(rnorm(n, mean=m, sd=sd))
    simulates <- mapply(function (ck, dy) { if(ck>dy) { return(c(0,ck)) } else { return(c(ck,1440)) }},
                         checkTime, deliveryTime)
    return(data.frame(Time1=simulates[1,], Time2=simulates[2,]))
}
mail_abc_bmc <- function(samples) {

    # generate the posterior samples for each model, which serve as our generative simulations:
    normal <- runModel(mailCombined, modelN)
    tdistribution <- runModel(mailCombined, modelT)
    lognormal <- runModel(mailCombined, modelL)
    exponential <- runModel(mailCombined, modelE)
    simulates <- nrow(mailCombined)

    # create a distribution for our check times
    checkTimes <- c(mailCombined[mailCombined$Time1>0,]$Time1, mailCombined[mailCombined$Time2<1440,]$Time2)
    meanCheck <- mean(checkTimes)
    sdCheck <- sd(checkTimes)

    results <- list()
    n <- 0

    while (n<samples) {
       newData <- NULL
       type <- sample(c("normal", "t", "lognormal", "exponential"), size=1)
       newData <- switch(type,
           "normal"      = { simulateMailboxMulti(simulates, meanCheck, sdCheck, normal); },
           "t"           = { simulateMailboxMulti(simulates, meanCheck, sdCheck, tdistribution); },
           "lognormal"   = { simulateMailboxMulti(simulates, meanCheck, sdCheck, lognormal); },
           "exponential" = { simulateMailboxMulti(simulates, meanCheck, sdCheck, exponential); } )

         # see if summary of the new data is within tolerance error<1 of the real data,
         # and if it is, save which distribution generated it:
         if (abs(mean(newData$Time1) - mean(mail$Time1)) < 1 &&
             abs(mean(newData$Time2) - mean(mail$Time2)) < 1)
           { results <- list(c(Type=type), results); n <- n+1; }
    }
  return(results)
  }
sims <- mail_abc_bmc(20000)
results <- matrix(unlist(sims), ncol=1, byrow=TRUE)
summary(results)
##            V1
##  normal: 9904
##  t     :10096

As sug­gested by the cross­val­i­da­tion, the expo­nen­tial and log­nor­mal mod­els have low pos­te­rior prob­a­bil­i­ties and can be ignored, as they make up none of the sam­ples. (If we really wanted to esti­mate their prob­a­bil­i­ties more pre­cisely than ~1/20000, we would need to increase the ABC sam­pling or loosen the tol­er­ances, and then a hand­ful of expo­nen­tials & log-nor­mals would get accept­ed.) And the t-dis­tri­b­u­tion is the same frac­tion of the sam­ples as the nor­mal: 50.1% to 49.9% (which given the lim­ited sam­pling, means they’re effec­tively iden­ti­cal). The Bayes fac­tors for each pair are the ratios, so we could say that there’s a BF of ~1 in favor of nor­mal vs t, or no evi­dence; but tremen­dous evi­dence com­pared with the expo­nen­tial or log­nor­mal.

Bayesian Model Averaging

If we started with uni­form pri­ors, then the expo­nen­tial or log­nor­mal will go to ~0, and the pos­te­rior prob­a­bil­i­ties of normal/t get pro­moted to ~50% each. This means that Bayesian model aver­ag­ing becomes very sim­ple: we just lump together equally the pos­te­rior pre­dic­tive sam­ples from the two mod­els and work with that.

normal <- runModel(mailCombined, modelN)
tdistribution <- runModel(mailCombined, modelT)
posteriorTimes <- c(normal, tdistribution)

Decision theory

One oft-­men­tioned advan­tage of Bayesian approach­es, and sub­jec­tive Bayesian prob­a­bil­ity in par­tic­u­lar, is that it plugs imme­di­ately into eco­nomic or deci­sion the­ory approaches to mak­ing choices under uncer­tain­ty, so we don’t stop at merely pro­vid­ing an esti­mate of some para­me­ter but can con­tinue onwards to reach a deci­sion. (See for exam­ple , Applied Sta­tis­ti­cal Deci­sion The­ory, Raiffa & Schlaifer 1961, and the more recent Intro­duc­tion to Sta­tis­ti­cal Deci­sion The­ory, Pratt et al 1995; an applied exam­ple of going from Bayesian mul­ti­level mod­els to opti­mal deci­sions is “Bayesian pre­dic­tion of mean indoor radon con­cen­tra­tions for Min­nesota coun­ties”.)

Since the moti­va­tion for esti­mat­ing mail-de­liv­ery time is implic­itly a deci­sion-the­o­retic prob­lem (“at what time should I decide to check the mail?”), there’s no rea­son to set­tle for just a mean or cred­i­ble inter­val for the deliv­ery time.

Optimal mail checking time

With a pos­te­rior dis­tri­b­u­tion of deliv­ery times (y.new) we can try to find an opti­mal time to check, assum­ing we can fig­ure out a giv­ing a cost of check­ing at each pos­si­ble time. The loss here can be con­sid­ered in units of time; if I did­n’t care about the delay between the pack­age being deliv­ered and me get­ting my hands on it, I would sim­ply check at 2PM and have done with it.

The usual stop­ping point in an analy­sis like this would involve the mean, least­-squares, and . There’s a lot of good things to say about that as a default But as an algo­rithm for decid­ing when check the mail, squar­ing errors and aver­ag­ing them seems like a strange way to eval­u­ate mis­takes: is it really equally bad to check the mail box 5 min­utes before the mail and 5 min­utes after­wards? is check­ing 10 min­utes after really more than twice as bad as check­ing 5 min­utes after? I would say “no” in both cas­es: it’s much bet­ter to check 5 min­utes after rather than 5 min­utes before, since then the pack­age will be there and I won’t have to make another hike out to the mail­box; and check­ing 10 min­utes seems only twice as bad to me as check­ing 5 min­utes after. So the loss func­tion itself is not any of the usual sus­pects: squared loss, absolute loss and 0-1 loss don’t cor­re­spond to my process of check­ing mail. I need to define my own loss func­tion, express­ing the real costs and gains of this sit­u­a­tion.

Defining a loss function

Expect­ing a pack­age wears on my mind and in some cas­es, I’d like to start using the con­tents ASAP; I see a delay of 10 min­utes as being twice as bad as 5 min­utes and not 4 times as bad, so this part is an absolute loss. (If it arrives at 11AM and I check at 11:30AM, then the loss due to the delay is 30.) Besides the delay, there’s also the cost of walk­ing all the way down to the mail­box and back, which due to the long dri­ve­way is around 10 min­utes, but it’s mostly unshaded and hot out and an inter­rup­tion, so the loss is much greater than that; intro­spect­ing, I would be will­ing to wait at least 60 min­utes to save one roundtrip, so I will define the cost of a mail check at 60 ‘min­utes’. So sup­pose I check at 10:30AM, the mail comes at 11AM, and I check again at 11:30AM and find it; then the cost is 60+60+(11:30AM-11:00AM) = 150, while if I had checked at 11:20AM then the cost is bet­ter and smaller at just 60+(11:20AM-11:00AM) = 70.

OK, so if the pack­age is there, that’s easy, but what if I walk out at 11:20AM and it’s not there? In a proper appli­ca­tion, if I check once, I would then update my pos­te­rior and run a loss func­tion again to decide when to check next, but this is imprac­ti­ca­ble for daily usage and in real­i­ty, what I would do is if the first check did­n’t turn up the pack­age, I would then give up in frus­tra­tion & dis­gust and not come back until ~1PM (780) when the pack­age would def­i­nitely have arrived. Then I’d incur a loss of two walks and pos­si­bly a long wait until 1PM.

All these con­sid­er­a­tions give me a weird-look­ing but nev­er­the­less real­is­tic loss func­tion: if the pack­age is deliv­ered at d and we check at a par­tic­u­lar time t, then if t>d and the pack­age had arrived we incur a total loss of ; oth­er­wise, we check back a sec­ond and final time at 1PM, incur­ring a total loss of

Finding the optimum

Hav­ing fig­ured that out, we run the loss func­tion on each sam­ple from the pos­te­ri­or, aver­ag­ing over all weighted pos­si­ble deliv­ery times, and find what time of day min­i­mizes the loss:

lossFunction <- function(t, walk_cost, predictions, lastResortT) {
    mean(sapply(predictions,
                function(delivered) { if (delivered<t) { return(walk_cost   + (t - delivered)); } else
                                                       { return(2*walk_cost + (lastResortT - t) ); }}))
    }
losses <- sapply(c(0:1440), function (tm) { lossFunction(tm, 60, posteriorTimes, 780);})

png(file="~/wiki/images/maildelivery/losses.png", width = 800, height = 700)
timeLimits <- seq(from=10*60, to=13*60, by=5)
qplot(0:1440, losses) + scale_x_continuous(breaks=timeLimits, labels=sapply(timeLimits, toClock)) +
  xlab("Action") + coord_cartesian(xlim = timeLimits, ylim=1:300)
invisible(dev.off())

which.min(losses)
# [1] 701
toClock(which.min(losses))
## 11:41AM
Expected losses in terms of wait­ing time for check­ing the mail at par­tic­u­lar times

And that gives the final answer: I should check for pack­ages around 11:40AM.

Total costs

We can com­pare the cost of check times. If I check 10 min­utes later than the cur­rent esti­mated opti­mal time, then the loss each time in pseudo-min­utes is:

minIndex <- which.min(losses)
losses[minIndex]
# [1] 108.5026282
losses[minIndex+10]
# [1] 110.1791093
losses[minIndex+10] - losses[minIndex]
# [1] 1.676481109

With aver­age losses per check time, an esti­mate of num­ber of check times per year, and a dis­count rate, we can derive a ; I usu­ally use the approx­i­ma­tion and a dis­count rate of 5% annu­al­ly, but in this case assum­ing an indef­i­nite hori­zon is wrong—the deliv­ery time will change and the prob­lem will reset as soon as the next mail­man begins this route, which has already hap­pened once in the 4 years thus far, so I instead assume that any esti­mate is worth­less after 3 years (the mail­man will have changed, the route or USPS will have changed, I may not even be liv­ing in the same place in 3 years). So if I check for a pack­age 30 times a year (which is roughly as much as I did in 2014 & am on track to do in 2015), and I do so sub­op­ti­mally each time and incur a loss of -2, then the total loss is equiv­a­lent to

netPresentValue <- function(p) { ((p / (1 + 0.05)^1) + (p / (1 + 0.05)^2) + (p / (1 + 0.05)^3)) * 30  }
netPresentValue(-2)
## [1] -163.3948818

Optimal data-sampling

Reinforcement learning

Given a loss func­tion, we might be able to opti­mize our data-sam­pling by bal­anc­ing explo­ration and exploita­tion using a approach to the such as /probability-sampling/probability-matching—where dif­fer­ent check­-­times are ‘arms’, the pay­off is given by the loss, and we draw a check time from the cur­rent pos­te­ri­or, check at that time, and then with the new data update the pos­te­rior for the next day.

I have imple­mented Thomp­son sam­pling for this prob­lem. While prob­a­bil­ity sam­pling is con­cep­tu­ally sim­ple, intu­itive, has no hyper­pa­ra­me­ters to tweak, and opti­mal in a num­ber of ways, imple­ment­ing it would have required more dis­ci­pline than I cared for; so I did­n’t use it.

Expected information gains

Ignor­ing indef­i­nite online learn­ing, it is pos­si­ble to opti­mize data sam­pling by esti­mat­ing what para­me­ter value is pre­dicted to yield the most infor­ma­tion, in the sense of entropy of dis­tri­b­u­tions, giv­ing Expected Infor­ma­tion Gains (or “expected gain in Shan­non infor­ma­tion”, or “expected Kull­back­-Leibler dis­crep­ancy”). The nor­mal dis­tri­b­u­tion has a short entropy def­i­n­i­tion () based on its stan­dard devi­a­tion (since the mean is just a fixed off­set, while its stan­dard devi­a­tion that defines how variable/unpredictable sam­ples can be and hence how many bits it takes to encode them), but we don’t know the exact nor­mal dis­tri­b­u­tion—only pos­te­rior dis­tri­b­u­tions includ­ing uncer­tain­ty. The dis­tri­b­u­tion of the pos­te­rior pre­dic­tive dis­tri­b­u­tion for a nor­mal dis­tri­b­u­tion is, I am told, a (source to A First Course in Bayesian Sta­tis­ti­cal Meth­ods by Peter Hof­f), which has the more com­plex entropy (us­ing digamma & beta) based on its degrees-of-free­dom term, v, and its scale term, s, which goes: .

We can of a pos­te­rior pre­dic­tive dis­tri­b­u­tion using the pos­te­rior sam­ples with a few approach­es:

  1. esti­mate the prob­a­bil­ity den­sity func­tion (giv­ing prob­a­bil­ity at each point) and then the entropy of each sam­ple is given by the clas­sic for­mula con­vert­ing prob­a­bil­ity to bits nec­es­sary to encode it/entropy

    For exam­ple, our pos­te­rior pre­dic­tive sam­ples could be con­verted into a prob­a­bil­ity den­sity func­tion for a par­tic­u­lar time 600-800 or as , the frac­tion of sam­ples rep­re­sent­ing that time. (We’d prob­a­bly round the sam­ples, implic­itly bin­ning them into 200 or so bin­s.) Then each sam­ple can be con­verted into the den­sity at its point, the den­si­ties logged and negat­ed, then aver­aged, to get the entropy of the whole pos­te­rior pre­dic­tive sam­ple and hence the dis­tri­b­u­tion itself.

  2. treat the pos­te­rior pre­dic­tive sam­ples as a t-dis­tri­b­u­tion, esti­mate their v & s, and con­vert it to an entropy.

    This is less gen­er­al, but eas­ier to imple­ment.

Once we can con­vert a set of pos­te­rior pre­dic­tive sam­ples into an entropy num­ber, this fol­lows the usual simulation/maximization: cal­cu­late the cur­rent entropy, then loop­ing over all pos­si­ble actions/data-to-collect, draw­ing one sam­ple from the pos­te­rior for that and com­par­ing to the action to get a dat­a­point and then update the model and cal­cu­late a new entropy, and return the dif­fer­ence between old entropy and new entropy (re­peated a num­ber of times so one can take the mean and get a good approx­i­ma­tion for each action). Whichever action would lower the entropy the most (have the biggest gain) is then the opti­mal action to take and make an obser­va­tion doing that. An imple­men­ta­tion:

library(MASS)
tEntropy <- function(v, s) { ((v+1)/2) * (digamma((v+1)/2) - digamma(v/2)) +
                             log(sqrt(v) * beta(v/2, 1/2)) +
                             log(s) }
entropyST <- function(pps) { fit <- fitdistr(pps, "t");
                             tEntropy(fit$estimate[['df']], fit$estimate[['s']]); }
entropyD <- function(dt) {
   # average together the _t_ & normal 50:50, per the Bayesian model averaging results:
   jN <- runModel(dt, modelN, iter=500)
   jT <- runModel(dt, modelT, iter=500)
   posteriorTimesN <- c(jN, jT)

   entropy <- entropyST(posteriorTimesN)
   return(entropy)
   }

actions <- c(600:728)
oldEntropy <- entropyST(posteriorTimes)

library(parallel)
library(plyr)

entropyAction <- function(a, iter=10) {
 df <- data.frame()
 for (i in 1:iter) {
    deliveryTime <- sample(posteriorTimes, size=1)
    if (deliveryTime<a) { newData <- rbind(mail, data.frame(Time1=0, Time2=a)); } else
                        { newData <- rbind(mail, data.frame(Time1=a, Time2=1440)); }
    newEntropy <- entropyD(newData)
    gain <- oldEntropy - newEntropy
    print(data.frame(Action=a, EntropyDecrease=gain))
    df <- rbind(df, data.frame(Action=a, EntropyDecrease=gain))
 }
 return(df)
 }
ei <- ldply(mclapply(actions, function(a) { entropyAction(a) }))

eimean <- aggregate(EntropyDecrease ~ Action, mean, data=ei)
eimean[which.max(eimean$EntropyDecrease),]
##      Action        Entropy
## 53    652 0.01161258411
toClock(eimean[which.max(eimean$EntropyDecrease),]$Action)
## [1] "10:52"
plot(eimean)

Inter­ested in how the max­i­mal expected infor­ma­tion time changed, I began check­ing at the opti­mal times in late Sep­tem­ber, check­ing at:

  1. 12PM
  2. 10AM
  3. 12:08PM
  4. 10:59AM
  5. 10:57AM
  6. 10:52AM

Optimal sample-size: Value of Information metrics

If we are not opti­miz­ing the kind of data col­lect­ed, we can opti­mize how much data we col­lect by peri­od­i­cally cal­cu­lat­ing how much addi­tional data could improve our esti­mates and how much this improve­ment is worth on aver­age, espe­cially com­pared to how much that addi­tional data would cost.

Expected val­ues of var­i­ous kinds of infor­ma­tion show up fre­quently in Bayesian deci­sion the­o­ry, and are valu­able for link­ing ques­tions of how much & what data to col­lect to real-­world out­comes (not just money but lives and wel­fare; a review; an Alzheimer’s exam­ple).

EVPI

The first kind is “” (EVPI): if I esti­mated the opti­mal time as t=688 and it was actu­ally t=698, then I could be suf­fer­ing a penalty of 163 min­utes and I should be will­ing to pay some­where up to that (based on how cer­tain I am it’s t=688) to learn the truth and instead start check­ing the mail at t=698. The value of per­fect infor­ma­tion is sim­ply the dif­fer­ence between the cur­rent esti­mate and what­ever you hypoth­e­size the true time is.

The expected value or EVPI is the cost of the cur­rent esti­mate of the opti­mum ver­sus a pos­si­ble alter­na­tive time, where the alter­na­tives are weighted by prob­a­bil­i­ties (if the mail really comes at 9AM then that implies a big loss if you fool­ishly check at 11:30AM but the prob­a­bil­ity of such a big loss is almost zero); the prob­a­bil­i­ties come, as usu­al, from the pos­te­rior dis­tri­b­u­tion of times which is approx­i­mated as a big batch of sam­ples:

mean(sapply(round(posteriorTimes), function(time) { netPresentValue((losses[time] - min(losses))) } ))
# [1] 5087.290247

Per­fect infor­ma­tion is not avail­able at any cost, how­ev­er.3 All I can do is go out to the mail box n times, and as we saw in the sim­u­la­tion, dimin­ish­ing returns always hap­pens and at some point the pre­dic­tive inter­vals stop chang­ing notice­ably. So EVPI rep­re­sents an upper bound on how much any lesser amount of infor­ma­tion could be worth, but does­n’t tell us how much we are will­ing to pay for imper­fect infor­ma­tion.

EVSI

The next kind is what is the “” (EVSI): what is the value of col­lect­ing one more addi­tional dat­a­point, which can help pin down the opti­mal time a lit­tle more pre­cisely and reduce the risk of loss from pick­ing a bad time? EVSI can be defined as “expected value of best deci­sion with some addi­tional sam­ple infor­ma­tion” minus “expected value of best cur­rent deci­sion”. More specif­i­cal­ly, if many times we cre­ated a new dat­a­point, cre­ate an updated pos­te­rior dis­tri­b­u­tion incor­po­rat­ing that new dat­a­point, run the loss func­tion again & cal­cu­late a new opti­mal check time, and com­pute the improve­ment of the new check time’s NPV over improved esti­mate of the old check time’s NPV to esti­mate a pos­si­ble EVSI, what is the mean of the EVSIs?

This tells us the ben­e­fit of that dat­a­point, and then we can sub­tract the cost of col­lect­ing one more dat­a­point; if it’s still pos­i­tive, we want to col­lect more data, but if it’s neg­a­tive (the gain from the esti­mate improved by one more dat­a­point) is less than a dat­a­point would cost to col­lect, then we have hit dimin­ish­ing returns and it may be time to stop col­lect­ing data.

We could do this for one dat­a­point to decide whether to stop. But we could also go back and look at how the EVSI & profit shrank dur­ing the orig­i­nal col­lec­tion of the mail data.

data <- mailCombined
sampleValues <- data.frame(N=NULL, newOptimum=NULL, newOptimumLoss=NULL,
                           sampleValue=NULL, sampleValueProfit=NULL)
for (n in seq(from=0, to=(nrow(data)+10))) {
 evsis <- replicate(10, {
  # if n is more than we collected, bootstrap hypothetical new data; otherwise,
  # just take that prefix and pretend we are doing a sequential trial where we
  # have only collected the first n observations thus far
  if (n > nrow(data)) { newData <- rbind(data,
                                         data[sample(1:nrow(data), n - nrow(data) , replace=TRUE),]) } else {
                        newData <- data[1:n,] }

  kEVSIN <- runModel(newData, modelN)
  kEVSIT <- runModel(newData, modelT)
  posteriorTimesEVSI <- c(kEVSIN, kEVSIT) # model averaging

  lossesEVSI <- sapply(c(0:1440), function (tm) { lossFunction(tm, 60, posteriorTimesEVSI, 780);})

  # compare to the estimate based on priors if no data yet:
  if (n==0) { oldOptimum <- min(lossesEVSI)  } else {

              oldOptimum <- lossesEVSI[sampleValues[sampleValues$N==0,]$newOptimum] }
  newOptimum <- min(lossesEVSI)
  sampleValue <- netPresentValue(newOptimum - oldOptimum)
  sampleValueProfit <- sampleValue + (n*60)

  return(list(N=n, newOptimum=which.min(lossesEVSI), newOptimumLoss=newOptimum,
              sampleValue=sampleValue, sampleValueProfit=sampleValueProfit))
  }
 )

 sampleValues <- rbind(sampleValues,
                        data.frame(N=n,
                                   newOptimum=mean(unlist(evsis[2,])),
                                   newOptimumLoss=mean(unlist(evsis[3,])),
                                   sampleValue=mean(unlist(evsis[4,])),
                                   sampleValueProfit=mean(unlist(evsis[5,]))))
 }
sampleValues
##     N newOptimum newOptimumLoss    sampleValue sampleValueProfit
## 1   0    712.600    126.6888958    0.00000000      0.0000000000
## 2   1    712.575    126.7579219  -11.93854006     48.0614599416
## 3   2    719.980    122.3724436 -121.31501738     -1.3150173784
## 4   3    708.150    126.9706051  -32.03379518    147.9662048193
## 5   4    701.790    127.9095775 -143.35714066     96.6428593445
## 6   5    696.695    128.8701531 -299.02233338      0.9776666227
## 7   6    692.440    129.8191319 -466.05452337   -106.0545233747
## 8   7    688.695    132.0807898 -633.09206632   -213.0920663197
## 9   8    690.765    129.5727029 -542.78513810    -62.7851380951
## 10  9    687.660    132.0364916 -681.32564291   -141.3256429125
## 11 10    689.210    130.2163274 -597.48233962      2.5176603795
## 12 11    686.395    130.5338326 -758.15486682    -98.1548668161
## 13 12    687.455    128.6273025 -695.28046012     24.7195398753
## 14 13    691.525    129.2092477 -496.05481084    283.9451891562
## 15 14    695.410    129.2232888 -347.92312703    492.0768729651
## 16 15    693.405    128.1591433 -433.22087565    466.7791243520
## 17 16    693.650    127.0530983 -436.23463412    523.7653658752
## 18 17    693.735    125.5236645 -444.81276457    575.1872354295
## 19 18    695.725    125.3049938 -355.24354368    724.7564563236
## 20 19    693.155    123.7840402 -465.57031668    674.4296833227
## 21 20    691.340    122.2890257 -573.81903972    626.1809602784
## 22 21    695.460    124.4854690 -373.66455168    886.3354483178
## 23 22    693.835    123.3286284 -453.72808090    866.2719190994
## 24 23    692.890    122.1550798 -511.31873755    868.6812624468
## 25 24    691.585    120.6627142 -594.94760901    845.0523909901
## 26 25    689.930    119.8993341 -688.68789781    811.3121021938
## 27 26    690.845    118.8742591 -644.53601191    915.4639880893
## 28 27    693.915    120.5363205 -480.27587034   1139.7241296593
## 29 28    695.455    120.5722357 -410.57354380   1269.4264562033
## 30 29    693.720    119.8216438 -491.01196669   1248.9880333079
## 31 30    692.560    118.5479051 -566.54877290   1233.4512271013
## 32 31    691.265    117.4006005 -642.81873824   1217.1812617638
## 33 32    694.800    120.2531077 -443.02993834   1476.9700616613
## 34 33    694.500    122.6974736 -444.15880135   1535.8411986539
## 35 34    693.200    121.6848098 -506.60101893   1533.3989810750
## 36 35    695.640    123.1766613 -377.44505442   1722.5549455769
## 37 36    695.055    125.0534433 -389.10082564   1770.8991743593
## 38 37    694.210    124.1294044 -439.58662308   1780.4133769227
## 39 38    694.610    123.0556714 -424.08194842   1855.9180515813
## 40 39    694.310    124.7484991 -436.35444597   1903.6455540347
## 41 40    694.505    123.4282826 -429.85235414   1970.1476458589
## 42 41    694.280    122.2800503 -437.52664626   2022.4733537369
## 43 42    695.660    121.9807264 -395.84245846   2124.1575415363
## 44 43    697.795    123.4574969 -298.08161577   2281.9183842337
## 45 44    696.745    122.5201770 -345.76139793   2294.2386020661
## 46 45    695.845    121.6540080 -389.60085022   2310.3991497771
## 47 46    695.090    120.9569570 -436.42100092   2323.5789990833
## 48 47    693.990    120.5449101 -483.41266155   2336.5873384516
## 49 48    694.305    120.7051254 -487.09162861   2392.9083713904
## 50 49    693.840    120.1719436 -499.45106193   2440.5489380720
## 51 50    693.940    119.9749612 -504.81034896   2495.1896510352
## 52 51    693.960    119.9420246 -500.28329793   2559.7167020736
## 53 52    693.595    119.7075089 -512.73745021   2607.2625497905
## 54 53    693.875    119.4981456 -515.01771496   2664.9822850449
## 55 54    693.505    119.1947709 -524.05689719   2715.9431028079
## 56 55    693.900    119.5021316 -517.24622527   2782.7537747290
## 57 56    693.715    119.1529502 -536.63625754   2823.3637424644
## 58 57    693.530    118.7091686 -531.78528869   2888.2147113139
## 59 58      703.2    110.3450586 -237.308410317  3242.691589683
## 60 59      702.8    110.1431462 -260.940058947  3279.059941053
## 61 60      701.8    109.8245030 -292.138648235  3307.861351765
## 62 61      702.0    109.6607867 -313.445660084  3346.554339916
## 63 62      701.7    109.9572132 -324.525787515  3395.474212485
## 64 63      701.0    109.5939801 -353.833817911  3426.166182089
## 65 64      700.9    109.7948883 -353.210103148  3486.789896852
## 66 65      701.3    109.5571512 -343.780758960  3556.219241040
## 67 66      701.2    109.5838684 -310.806893360  3649.193106640
## 68 67      701.5    109.4677088 -316.910308982  3703.089691018
## 69 68      700.9    109.3539013 -326.340338587  3753.659661413
## 70 69      702.1    109.2574961 -317.820011152  3822.179988848
## 71 70      701.2    108.8563787 -351.667974044  3848.332025956
## 72 71      701.2    108.9809869 -339.661339561  3920.338660439
## 73 72      701.5    108.9430690 -313.603261282  4006.396738718
## 74 73      701.6    108.6864327 -324.919641986  4055.080358014
## 75 74      701.4    108.7176209 -308.707536711  4131.292463289
## 76 75      703.1    108.4620179 -304.882851775  4195.117148225
## 77 76      701.7    108.4129922 -303.134783733  4256.865216267
## 78 77      702.2    108.2194238 -304.295347052  4315.704652948
## 79 78      701.4    107.7256569 -334.733360207  4345.266639793
## 80 79      700.4    108.1555870 -340.454520356  4399.545479644
## 81 80      701.8    108.1089954 -315.865705672  4484.134294328
## 82 81      701.7    107.8976325 -332.239957494  4527.760042506
## 83 82      701.3    108.1263477 -318.440927654  4601.559072346
## 84 83      701.7    107.9807588 -341.667394085  4638.332605915
## 85 84      701.6    107.9223031 -332.374354544  4707.625645456
## 86 85      702.3    107.7032516 -329.594428046  4770.405571954
## 87 86      701.2    107.7025934 -323.337539932  4836.662460068
## 88 87      702.6    107.7341694 -315.319855307  4904.680144693
## 89 88      702.7    107.5978394 -318.888777193  4961.111222807
## 90 89      701.7    107.6826961 -329.601897440  5010.398102560
## 91 90      701.2    107.2913853 -352.893246270  5047.106753730
## 92 91      702.6    107.9162503 -298.859548680  5161.140451320
## 93 92      701.4    107.3265408 -345.735043576  5174.264956424
## 94 93      702.9    107.7535692 -307.027575009  5272.972424991
## 95 94      700.7    106.9934449 -380.611482059  5259.388517941
## 96 95      702.2    108.2833572 -318.195906366  5381.804093634

qplot(sampleValues$N, round(sampleValues$newOptimum))
Evo­lu­tion of opti­mal time to check the mail based on pos­te­rior updated after each obser­va­tion.

This would sug­gest that dimin­ish­ing returns were reached early on (pos­si­bly the infor­ma­tive pri­ors had done more work than I appre­ci­at­ed).

The graph shows the esti­mated opti­mal mail-check time (700=11:40AM etc) as each dat­a­point was col­lect­ed. You can see that with the pri­ors I set, they were biased towards too-late mail times but as more data comes in, the new opti­mal check­-­time drifts steadily down­wards until the bias of the prior has been neu­tral­ized and now it begins to fol­low a ran­dom walk around 695, with the final esti­mated mailcheck­time being ~693/11:33AM And at around n=48, dimin­ish­ing returns has set in so hard that the deci­sion actu­ally stops chang­ing entire­ly, it’s all small fluc­tu­a­tions around 693. When you include the cost of gath­er­ing data, the analy­sis says you should col­lect up to n = 13 and then stop (after that, the loss does not decrease but begins to increase because each dat­a­point inflates costs by +60, and we want to min­i­mize loss); at n = 13, one decides to check at 687/11:27AM, which is close to the 693 which was esti­mated with 4x more data (!).

So this is inter­est­ing: by using infor­ma­tive pri­ors and then tak­ing a deci­sion-the­o­retic approach, in this case I can make high­-qual­ity deci­sions on sur­pris­ingly lit­tle data; I could have halted data gath­er­ing much ear­li­er. (Ad­di­tional data does have ben­e­fits in let­ting me ver­ify the dimin­ish­ing returns by see­ing how the opti­mal deci­sion hardly changes, can be used for more accu­rate model aver­ag­ing, and could be used to model time-­vary­ing trends like the large increases in late deliv­er­ies dur­ing the hol­i­days. But ulti­mately I have to agree with the EVSI: I did­n’t need to col­lect all the data I did.)

Conclusions

Lessons learned:

  1. ABC is as sim­ple to imple­ment and incred­i­bly gen­eral as promised; with ABC, it’s defin­ing good sum­mary sta­tis­tics & get­ting com­pu­ta­tional tractabil­ity which you need to worry about (be­sides mak­ing it tedious to use, long run­times also inter­fere with writ­ing a cor­rect imple­men­ta­tion in the first place)

  2. while pow­er­ful, JAGS mod­els can be con­fus­ing to write; it’s easy to lose track of what the struc­ture of your model is and write the wrong thing, and the use of pre­ci­sion rather than stan­dard­-de­vi­a­tion adds boil­er­plate and makes it even eas­ier to get lost in a wel­ter of vari­ables & dis­tri­b­u­tions, in addi­tion to a fair amount of boil­er­plate in run­ning the JAGS code at all—lead­ing to errors where you are not cer­tain whether you have a con­cep­tual prob­lem or your boil­er­plate has drifted out of date

  3. the com­bi­na­tion of ggplot2 and animation bade fair to make ani­ma­tions as easy as devis­ing a ggplot2 image, but due to some ugly inter­ac­tions between them (ggplot2 inter­acts with the R top-level scope/environment in a way which breaks when it’s called inside a func­tion, and animation has some sub­tle bug relat­ing to decid­ing how long to delay frames in an ani­ma­tion which I could­n’t fig­ure out), I lost hours to get­ting it to work at all.

  4. my orig­i­nal prior esti­mate of when the mail­man comes was accu­rate, but I seem to have sub­stan­tially under­es­ti­mated the stan­dard devi­a­tion and there turned out to be con­sid­er­ably model uncer­tainty about thin tails (nor­mal) vs fat­ter tails (t); despite that, pro­vid­ing infor­ma­tive pri­ors meant that the pre­dic­tions from the JAGS model were sen­si­ble from the start (mostly from the small val­ues for the SD, as the mean time was eas­ily esti­mated from the inter­vals) and it made good use of the data.

  5. the vari­abil­ity of mail deliv­ery times is high enough that the pre­dic­tion inter­vals are inher­ently wide; after rel­a­tively few dat­a­points, whether inter­val or exact, dimin­ish­ing returns has set in.

  6. Sub­jec­tive Bayesian­ism & deci­sion the­ory go together like peanut but­ter & choco­late.

    It’s all just sam­ples and pos­te­rior dis­tri­b­u­tions; want to cal­cu­late expected infor­ma­tion gains to opti­mize your sam­pling, or com­pare to sub­jec­tive loss to decide in a prin­ci­pled fash­ion when to stop? No prob­lem. (In­stead of ad hoc rules of thumbs like going for 80% power or con­trol­ling alpha at 0.05 or trial sequen­tial analy­sis stop­ping at an arbi­trary effect size con­fi­dence inter­val…) Long com­pu­ta­tions are a small price to pay for approx­i­mate answers to the right ques­tions rather than exact answers to the wrong ques­tions.

    The major down­side is that Bayesian deci­sion the­ory is hon­ored mostly in the breach: I found much more dis­cus­sion than appli­ca­tion of it online, and few worked-out exam­ples in R that I could try to com­pare with my own imple­men­ta­tions of the ideas.

Appendix

Thompson sampling

Given a Bayesian model and a rein­force­men­t-learn­ing set­ting like check­ing the mail­box, it seems nat­ural to use to guide checks and do (“online” here mean­ing learn­ing 1 data point at a time). I did­n’t want to reor­ga­nize my morn­ings around such an algo­rithm, but in read­ing up on Thomp­son sam­pling, I found no source code for any uses out­side the stereo­typ­i­cal 3-4 arm bino­mial ban­dit, and cer­tainly noth­ing I could use.

Thomp­son sam­pling turns out to be as sim­ple as adver­tised. Each time a Ts agent acts, it updates its model based on cur­rent data and then sim­ply draws one pos­si­ble set of para­me­ters at ran­dom from its pos­te­rior dis­tri­b­u­tions of those para­me­ters; then it finds the opti­mal action under that par­tic­u­lar set of para­me­ters; and exe­cutes it.

An imple­men­ta­tion of Thomp­son sam­pling for this prob­lem, and a test har­ness to run it on a sim­u­lated ver­sion of mail-check­ing to see how it per­forms in terms of actions, loss­es, and regret:

library(R2jags)
model1 <- function() { for (i in 1:n){
           y[i] ~ dinterval(t[i], dt[i,])
           t[i] ~ dnorm(mu,tau)
           }
           mu ~ dnorm(650, pow(30, -2))
           sd ~ dunif(10, 60)
           tau <- pow(1/sd, 2)
           y.new ~ dnorm(mu, tau)
           }

lossPosterior <- function(t, walk_cost, predictions, lastResortT) {
    sapply(predictions,
                function(delivered) { if (delivered<t) { return(walk_cost   + (t - delivered)); } else
                                                       { return(2*walk_cost + (lastResortT - t) ); }})
    }
posterior_losses <- function(times, actions) {
   df <- data.frame(Actions=actions,
           Loss=sapply(actions, function (tm) { mean(lossPosterior(tm, 60, times, 780));}))
   return(df)
}

thompsonSample <- function(dt, model) {
   # model the available data:
   data <- list("dt"=dt, "n"=nrow(dt), "y"=rep(1, nrow(dt)))
   inits <- function() { list(mu=rnorm(1),sd=30,t=as.vector(apply(dt,1,mean))) }
   params <- c("mu", "sd")
   j1 <- jags(data, inits, params, model, n.iter=1000, progress.bar="none")

   # 0:1440 is overkill, so let's only look at the plausible region of times to actively check:
   actions <- c(600:800)
   # sample one hypothetical world's set of mean/SD parameter values:
   posteriorMean <- j1$BUGSoutput$sims.list[["mu"]][1]
   posteriorSD <- j1$BUGSoutput$sims.list[["sd"]][1]
   # sample a bunch of data from that hypothetical world:
   posteriorTimes <- rnorm(1000, mean=posteriorMean, sd=posteriorSD)
   # so we can then estimate losses:
   posteriorLosses <- posterior_losses(posteriorTimes, actions)
   # and with expected losses for each action defined, find the optimal action in that world:
   optimalAction <-  posteriorLosses[which.min(posteriorLosses$Loss),]$Actions[1]
   return(optimalAction) }

simulate_mab <- function(n) {
 walk_cost = 60; lastResortT = 780

 # pick a concrete mean/SD for this run, based on the full MCMC estimates as of 2015-09-20:
 mail_mean <- rnorm(1, mean=658.9, sd=4.69)
 mail_sd <- rnorm(1, mean=27.3, sd=8.33)
 # to calculate regret, we need an optimal loss, which we can compare realized losses to:
 posteriorTimes <- rnorm(5000, mean=mail_mean, sd=mail_sd)
 actions <- c(600:800)
 posteriorLosses <- posterior_losses(posteriorTimes, actions)
 optimalLoss <-  min(posteriorLosses$Loss)[1]
 optimalAction <-  posteriorLosses[which.min(posteriorLosses$Loss),]$Actions[1]
 print(paste0("Parameters for this simulation; Mean: ", mail_mean,
              "; SD: ", mail_sd,
              "; minimum average loss: ", optimalLoss,
              "; optimal action: t=", optimalAction))

 # seed data:
 mail3 <- data.frame(Time1=660, Time2=1440, Delivered=NA, Action.optimal=optimalAction, Action=660,
                     Loss=0,  Regret=0, Total.regret=0)

 for (i in 1:n) {
   # today's delivery:
   delivered <- rnorm(1, mean=mail_mean, sd=mail_sd)
   # when should we check?
   t <- thompsonSample(subset(mail3,select=c(Time1,Time2)), model1)
   # discover what our loss is, update the database:

   loss <- if (delivered<t) { walk_cost  + (t - delivered); } else
              { 2*walk_cost + (lastResortT - t); }
   mail3 <- if(delivered<t) {
               rbind(mail3, data.frame(Time1=0, Time2=t, Delivered=delivered,
                                       Action.optimal=optimalAction, Action=t,
                                       Loss=loss, Regret=(loss-optimalLoss),
                                       Total.regret=sum(c(loss,mail3$Regret),na.rm=TRUE))) } else
             { rbind(mail3, data.frame(Time1=t, Time2=1440, Delivered=delivered,
                                       Action.optimal=optimalAction, Action=t,
                                       Loss=loss, Regret=(loss-optimalLoss),
                                       Total.regret=sum(c(loss,mail3$Regret),na.rm=TRUE))) }
 }
 return(mail3)
}
run <- simulate_mab(500)

library(ggplot2)
qplot(1:nrow(run), run$Action, xlab="Nth mail check", ylab="Check time",
      color=run$Loss, size=I(5), main="Thompson sampling") +
  stat_smooth() + geom_hline(colour="green", aes(yintercept=run$Action.optimal[1]))
1 sim­u­la­tion run’s actions over 500 time-steps, opti­mal action over­laid in green

In this par­tic­u­lar sim­u­la­tion, the Thomp­son sam­pling agent starts off choos­ing check­-­times too late in the day and grad­u­ally tight­ens its selec­tion of actions until by ~t = 100, it is gen­er­ally check­ing near-op­ti­mally but with con­tin­ued sam­pling from other check­-­times which starts to com­pen­sate for a lit­tle under­shoot­ing error.


  1. From the doc­u­men­ta­tion:

    Surv(time, time2, event, type="interval")

    1. time: For inter­val data, the first argu­ment is the start­ing time for the inter­val.
    2. time2: end­ing time of the inter­val for inter­val cen­sored or count­ing process data only. Inter­vals are assumed to be open on the left and closed on the right, (start, end]. For count­ing process data, event indi­cates whether an event occurred at the end of the inter­val.
    3. event: The sta­tus indi­ca­tor, nor­mally 0=alive, 1=dead­….­For inter­val cen­sored data, the sta­tus indi­ca­tor is 0=right cen­sored, 1=event at time, 2=left cen­sored, 3=in­ter­val cen­sored.

    Inter­val cen­sored data can be rep­re­sented in two ways. For the first use type = "interval" and the codes shown above. In that usage the value of the time2 argu­ment is ignored unless event=3. The sec­ond approach is to think of each obser­va­tion as a time inter­val with (-∞, t) for left cen­sored, (t, ∞) for right cen­sored, (t,t) for exact and (t1, t2) for an inter­val. This is the approach used for type = "interval2". Infi­nite val­ues can be rep­re­sented either by actual infin­ity (Inf) or NA. The sec­ond form has proven to be the more use­ful one.

    …a sub­jec­t’s data for the pair of columns in the dataset (time1, time2) is (te, te) if the event time te is known exact­ly; (tl, NA) if right cen­sored (where tl is the cen­sor­ing time); and (tl, tu) if inter­val cen­sored (where tl is the lower and tu is the upper bound of the inter­val).

    ↩︎
  2. Empha­sis added; “‘Not Only Defended But Also Applied’: The Per­ceived Absur­dity of Bayesian Infer­ence”, Gel­man & Robert 2013; , Gel­man et al 2017.↩︎

  3. Instead of doing esti­mates, why not buy or make some­thing like a mag­netic switch or motion-ac­ti­vated cam­era on the mail­box? By reg­u­larly record­ing exactly when the mail arrives, this makes esti­ma­tion much eas­ier, and if wire­less-en­abled, may obvi­ate the prob­lem entire­ly. This is chal­leng­ing, though: the mail­box is so far away that there are no power sock­ets near it so it will need to be bat­tery-pow­ered and to last for months (other­wise the has­sle of charg­ing it elim­i­nates the con­ve­nience), and wire­less con­nec­tions may not be pos­si­ble that many meters away with­out exces­sive power con­sump­tion.

    I looked into this, and found a few options:

    • the estab­lished niche of “peo­ple counter”/“door counter”/“traf­fic counter” elec­tronic devices, often imple­mented using a infrared beam which is tripped by objects mov­ing in between an emit­ter and sen­sor. They are intended for busi­ness use, priced out of my range, and typ­i­cally not long-range wire­less or bat­tery-pow­ered or store date-­stamps (rather than counts).

    • mail check­ing sys­tems, of which there are two main gad­gets intended for use­cases like mine (rural houses with extremely long dri­ve­ways):

      1. spring-pow­ered ‘flags’ which are latched into the door and pop up when it is opened (pre­sum­ably by the mail­man) and are vis­i­ble from a long dis­tance; the home­-owner then resets the spring when they pick up the mail. One down­side is that many involve drilling holes into the mail­box to keep the flag secure long-term against the weath­er. Exam­ple: “Mail Time! Yel­low Mail­box Alert Flag for Long Dri­ve­ways”, $16.

        This does­n’t work because it still requires me to man­u­ally log data, although I would not have to walk all the way to check, it is true. Nor does this resolve the mail prob­lem in gen­eral because one does not always have per­mis­sion to tam­per with the mail­box like that and it requires coop­er­a­tion from every­one who uses the mail­box (it’s no good if you’re the only one who resets the flag and it’s usu­ally up).

      2. wire­less mail­box alert sys­tems: a radio com­bined with a switch mounted inside the mail­box, which on con­tact being bro­ken, radios an alert. Exam­ple: “Dakota Alert 1000 [feet] Series”, $50.

        Most of these sys­tems don’t claim a range past ~30m, while I need at least 200 meters (and prob­a­bly much more, because man­u­fac­tur­ers are opti­mistic and my walls are con­crete). The exam­ple Dakota Alert claims to have the nec­es­sary range, but does­n’t include any esti­mates about bat­tery life, is expen­sive enough I’m not sure it’s worth it, and would still require man­ual record­ing of times.

      3. build one’s own data log­ger; log­ging door or win­dow or garage door open­ings or bath­room occu­pancy is a com­mon starter project with or , so it should be pos­si­ble. Both pre­sented seri­ous prob­lems as I looked into the details and thought about whether I could do it.

        • While a Rasp­berry Pi Zero costs $5 for the moth­er­board itself and is rel­a­tively easy to pro­gram & set up, it draws ~160 mil­liamps; this is not a prob­lem for most RP uses, where power out­lets are avail­able, but is dis­as­trous for remote sen­sors—a stan­dard cheap 1000-5000mil­liamp-hour bat­tery would be drained within a day. Solar pan­els can be hooked up to an RP to recharge the bat­tery, but that adds much com­plex­ity and between the bat­tery, solar pan­el, pan­el->­bat­tery adapter, and other wires, would cost >$80 for the parts. A PiJuice Solar would cost ~$80 but does­n’t ship until March 2016 (if ever). So a RP would work if I wanted to go as far as set­ting up a solar-pow­ered RP, spend­ing eas­ily $100+.
        • Arduinos use much less pow­er—around 20 mil­liamp­s—and can be reduced to <1 mil­liamps, in which case, com­bined with a com­pat­i­ble bat­tery like the $15 Lithium Ion Poly­mer Bat­tery - 3.7v 2500mAh, power is no longer an issue and the log­ger could run a week () or months at <1 mil­liamps. But Arduinos are much more chal­leng­ing to use than Rasp­berry Pis—I have no expe­ri­ence in embed­ded elec­tron­ics or wiring up such devices or pro­gram­ming in C, and so no con­fi­dence I could build it cor­rect­ly. The $22 Adafruit Data log­ger (Adalog­ger) (doc­u­men­ta­tion), based on the Arduino Uno, com­bines most of the hard­ware nec­es­sary. That still leaves con­nect­ing the reed mag­netic switch (eg “Sec­o-Larm SM-226L Garage Door Con­tacts for Closed Cir­cuits”, $11, or $3.95, mag­netic switch), pro­gram­ming the Adalog­ger, and opti­miz­ing power con­sump­tion to make recharges not a has­sle. This makes it much closer to fea­si­ble and I thought seri­ously about it, but ulti­mate­ly, I don’t think I care enough at the moment about log­ging mail data to strug­gle through the learn­ing curve of embed­ded & elec­tri­cal engi­neer­ing knowl­edge for this par­tic­u­lar pro­ject.
      4. Wilder ideas include run­ning lasers & mir­rors; or run­ning a coax­ial cable out to the mail­box, putting a pho­to­di­ode at one end, and keep­ing the log­ger inside near a power out­let to resolve the wire­less & power issues. (But coax cable does cost mon­ey, leav­ing it exposed is not a good idea con­sid­er­ing the rid­ing lawn mow­er, and dig­ging up hun­dreds of meters of ground to bury the coax cable does not sound like fun.)

    ↩︎