Candy Japan's new box A/B test

Bayesian decision-theoretic analysis of the effect of fancier packaging on subscription cancellations & optimal experiment design.
statistics, decision-theory, Haskell, R, survival-analysis, power-analysis, Bayes, tutorial, design
2016-05-062016-08-26 finished certainty: likely importance: 8


I ana­lyze an A/B test from a mail-order com­pany of two differ­ent kinds of box pack­ag­ing from a Bayesian deci­sion-the­ory per­spec­tive, bal­anc­ing pos­te­rior prob­a­bil­ity of improve­ments & greater profit against the cost of pack­ag­ing & risk of worse results, find­ing that as the com­pa­ny’s analy­sis sug­gest­ed, the new box is unlikely to be suffi­ciently bet­ter than the old. Cal­cu­lat­ing expected val­ues of infor­ma­tion shows that it is not worth exper­i­ment­ing on fur­ther, and that such fixed-sam­ple tri­als are unlikely to ever be cost-effec­tive for pack­ag­ing improve­ments. How­ev­er, adap­tive exper­i­ments may be worth­while.

Candy Japan is a small mail-order com­pany which, since 2011, sends semi­-monthly small pack­ages of Japan­ese can­dies & nov­elty foods from Japan to sub­scribers typ­i­cally in the West. While mail-order sub­scrip­tions ser­vices such as tea or choco­late of the month clubs are not rare, Candy Japan pub­lishes unusu­ally detailed blog posts dis­cussing their busi­ness, such as how they were nearly killed by credit card fraud, pric­ing, their over­head, (non) sales from YouTube star­dom, life as an expat in Japan, and annual sum­maries (eg 2012, 2013, 2014, 2015), which are often dis­cussed on Hacker News.

New Box A/B test

Start­ing on 2015-11-28 & pub­lish­ing results 2016-05-05, CJ ran an A/B test (HN dis­cus­sion) com­par­ing sub­scrip­tion can­cel­la­tion rates of cus­tomers sent candy in the stan­dard undec­o­rated box, and cus­tomers sent candy in pret­tier but more expen­sive box­es:

Plain unbranded boxes go for just $0.34 a piece, while a box with a ful­l-color illus­tra­tion printed on the cover costs almost twice as much: $0.67. This may not seem like such a big differ­ence, but in absolute terms using the new box means around $500 less profit per month or roughly 10% of profit mar­gin.

In group A 38.27%, or 168 of the 439 cus­tomers receiv­ing the new pack­age design can­celed dur­ing the test. In group B 39.59%, or 175 of the 442 cus­tomers receiv­ing the old pack­age design can­celed dur­ing the test. This is not a sta­tis­ti­cally sig­nifi­cant differ­ence. In a world where it makes no differ­ence which pack­age is sent, you would get a result as sig­nifi­cant as this 80% of the time.

These can­cel­la­tion rates strike me as bizarrely high1, but CJ has here a straight­for­ward deci­sion prob­lem: should it switch to the dec­o­rated boxes or not, based on the infor­ma­tion pro­vided by this ran­dom­ized exper­i­ment? After all, the dec­o­rated boxes did per­form slightly bet­ter than the undec­o­rated box­es.

Analysis

NHST

The most con­ven­tional approach here would be what CJ did: treat this as a 2-sam­ple test of pro­por­tions, or a bino­mial regres­sion, and com­pute a p-value for the treat­ment; if the p-value of a decreased can­cel­la­tion rate is smaller than the arbi­trary thresh­old 0.05, switch to the new box­es.

We can eas­ily repli­cate CJ’s analy­sis given the pro­vided per­cent­ages and _n_s (the in this case):

prop.test(c(175, 168), c(442, 439))
#   2-sample test for equality of proportions with continuity correction
#
# data:  c(175, 168) out of c(442, 439)
# X-squared = 0.11147052, df = 1, p-value = 0.7384761
# alternative hypothesis: two.sided
# 95% confidence interval:
#  -0.05341862656  0.07989797596
# sample estimates:
#       prop 1       prop 2
# 0.3959276018 0.3826879271

So CJ’s p-value (0.80) matches mine.2 Arguably, a one-tailed test here is appro­pri­ate since you are only inter­ested in whether the new box has a bet­ter can­cel­la­tion rate, which would be expected to halve the p-val­ue, as it does:

prop.test(c(175, 168), c(442, 439), alternative="greater")
#   2-sample test for equality of proportions with continuity correction
#
# data:  c(175, 168) out of c(442, 439)
# X-squared = 0.11147052, df = 1, p-value = 0.3692381
# alternative hypothesis: greater
# 95% confidence interval:
#  -0.04306671907  1.00000000000
# sample estimates:
#       prop 1       prop 2
# 0.3959276018 0.3826879271

But of course, what does a p-value mean, any­way? CJ inter­prets what they do mean (“In a world where it makes no differ­ence which pack­age is sent, you would get a result as sta­tis­ti­cal­ly-sig­nifi­cant as this 80% of the time.”), but that does­n’t give us much help in under­stand­ing if we are in the world where which box is sent does make an impor­tant differ­ence and how much of a differ­ence and whether we want to decide whether to switch. (p-val­ues are not pos­te­rior prob­a­bil­i­ties are not effect sizes are not util­i­ties are not profits are not deci­sion­s…)

These are all ques­tions that con­ven­tional approaches will strug­gle with and p-val­ues in par­tic­u­lar are hard to use as a cri­te­rion for deci­sions: the result might be con­sis­tent with infor­ma­tive pri­ors about the pos­si­ble effect of box improve­ments; one might have com­pelling rea­son from ear­lier exper­i­ments to decide to switch to the new box; one might pre­fer the new boxes for other rea­sons; the new boxes might be profitable despite weak evi­dence (for exam­ple, if they cost the same); one might have such a large cus­tomer base that the result might be promis­ing enough to jus­tify fur­ther exper­i­men­ta­tion…

Bayesian

It would be bet­ter to do a more informed approach (, Raiffa & Schlaifer 1961, Pratt et al 1995) includ­ing our knowl­edge that improve­ments should be small, giv­ing a con­crete prob­a­bil­ity that there is an improve­ment and if there is how large, let­ting us cal­cu­late the prob­a­bil­ity that the switch would be profitable using money as our , the profitabil­ity of fur­ther exper­i­men­ta­tion, and how we would run an A/B test more effi­ciently in terms of max­i­miz­ing profit rather than some other cri­te­ri­on.

Uninformative priors

Switch­ing over to a Bayesian approach, using Bayesian­FirstAid gives us a bayes.prop.test func­tion with the same inter­face as prop.test and using a highly unin­for­ma­tive prior beta(1,1) prior (effec­tively a flat prior imply­ing that every prob­a­bil­ity from 0-1 is equally like­ly):

library(BayesianFirstAid)
fit <- bayes.prop.test(c(175, 168), c(442, 439)); fit
#
#   Bayesian First Aid proportion test
#
# data: c(175, 168) out of c(442, 439)
# number of successes:  175, 168
# number of trials:     442, 439
# Estimated relative frequency of success [95% credible interval]:
#   Group 1: 0.40 [0.35, 0.44]
#   Group 2: 0.38 [0.34, 0.43]
# Estimated group difference (Group 1 - Group 2):
#   0.01 [-0.051, 0.078]
# The relative frequency of success is larger for Group 1 by a probability
# of 0.651 and larger for Group 2 by a probability of 0.349 .

Bayesian­FirstAid is overkill in this case as it calls out to for , but the binomial/proportion test uses a tractable beta dis­tri­b­u­tion where we can do the Bayesian update as sim­ply as Beta(1,1)Beta(1+175, 1+442-175) vs Beta(1+168, 1+439-168) for the two groups; the over­lap can be found ana­lyt­i­cally or using dbeta() or sim­u­lated using rbeta(). Given the over­head, that would be much faster (~97x), although it would­n’t be able to han­dle more com­plex real world prob­lems (eg any A/B test will prob­a­bly want to include covari­ates, to improve pow­er). This also makes it easy to imple­ment infor­ma­tive pri­ors as options. We could imple­ment a replace­ment for bayes.prop.test like this:

betaPosterior <- function(xs, ns, n.samples=100000,
                          prior1.a=1, prior1.b=1, prior2.a=prior1.a, prior2.b=prior1.b) {
    sample1 <- rbeta(n.samples, prior1.a+xs[1], prior1.b+ns[1]-xs[1])
    sample2 <- rbeta(n.samples, prior2.a+xs[2], prior2.b+ns[2]-xs[2])
    return(list(Theta_1=sample1, Theta_2=sample2, Theta_diff=sample1 - sample2)) }

## Timing:
mean(replicate(1000, system.time(betaPosterior(  c(175, 168), c(442, 439), n.samples=10000))[3]))
# [1] 0.004309
mean(replicate(1000, system.time(bayes.prop.test(c(175, 168), c(442, 439)))[3]))
# [1] 0.420237

Informative priors

It’s often said that a nul­l-hy­poth­e­sis sig­nifi­cance test is sim­i­lar to Bayesian infer­ence with flat pri­ors, so our p-value winds up hav­ing a sus­pi­cious sim­i­lar­ity to our pos­te­rior prob­a­bil­ity that the new box helped (which is P = 0.651). Of course, we have prior knowl­edge here: A/B tests, or switch­ing box­es, can­not pos­si­bly lead to can­cel­la­tion rates as high as 100% or as low as 0%, and in fact, it would be shock­ing if the usual 39% can­cel­la­tion rate could be changed by more than a few per­cent­age points; even ±5% would be sur­pris­ing (but prac­ti­cally impor­tan­t).

A more real­is­tic prior than beta(1,1) would have a mean of 0.39 and a dis­tri­b­u­tion nar­row enough that, say, 95% of it falls within ±5% (34-44). We can come up with a beta prior which encodes this belief. The mean of our beta dis­tri­b­u­tion will be which can be rearranged to ; to solve for an a which gives the desired ±5% 95% CI, I looked at quan­tiles of ran­dom sam­ples by increas­ing b until it is ade­quately nar­row (there’s prob­a­bly an ana­lytic equa­tion here but I did­n’t bother look­ing it up):

a <- 900; quantile(rbeta(100000, a, 1.5642*a))
#           0%          25%          50%          75%         100%
# 0.3496646950 0.3831142015 0.3899169957 0.3967543674 0.4365377333
a; 1.5642*a
# [1] 900
# [1] 1407.78

So an infor­ma­tive prior here would be Beta(900,1407). Our betaPosterior func­tion already sup­ports user-spec­i­fied pri­ors, but Bayesian­FirstAid does not sup­port user-spec­i­fied pri­ors; a least it does make it pos­si­ble to incor­po­rate any change we wish by print­ing out all the code we need to run it in the form of the R boil­er­plate and the JAGS mod­el. We locate the dbeta(1,1) line and replace it with dbeta(900, 1407), and add in a con­ve­nience line theta_diff <- theta[1] - theta[2] which reports directly on what we care about (how much the new box decreases the can­cel­la­tion rate):

model.code(fit)
# ...
## copy, edit, paste:
require(rjags)
x <- c(175, 168)
n <- c(442, 439)
model_string <- "model {
  for(i in 1:length(x)) {
    x[i] ~ dbinom(theta[i], n[i])
    theta[i] ~ dbeta(900, 1407)
    x_pred[i] ~ dbinom(theta[i], n[i])
  }
  theta_diff <- theta[1] - theta[2]
}"
model <- jags.model(textConnection(model_string), data = list(x = x, n = n),
                    n.chains = getOption("mc.cores"))
samples <- coda.samples(model, c("theta", "x_pred", "theta_diff"), n.iter=100000)
summary(samples)
# 1. Empirical mean and standard deviation for each variable,
#    plus standard error of the mean:
#
#                    Mean           SD     Naive SE Time-series SE
# theta[1]   3.910082e-01  0.009262463 3.274775e-05   3.279952e-05
# theta[2]   3.889790e-01  0.009300708 3.288297e-05   3.270599e-05
# theta_diff 2.029166e-03  0.013132415 4.643010e-05   4.614718e-05
# x_pred[1]  1.727948e+02 11.062655213 3.911239e-02   3.908579e-02
# x_pred[2]  1.707583e+02 10.938572590 3.867369e-02   3.840689e-02
#
# 2. Quantiles for each variable:
#
#                    2.5%           25%          50%          75%        97.5%
# theta[1]     0.37298246   0.384715477 3.909551e-01   0.39720057   0.40937636
# theta[2]     0.37083603   0.382659446 3.889795e-01   0.39532425   0.40718608
# theta_diff  -0.02381196  -0.006805355 1.998807e-03   0.01086074   0.02785733
# x_pred[1]  151.00000000 165.000000000 1.730000e+02 180.00000000 195.00000000
# x_pred[2]  149.00000000 163.000000000 1.710000e+02 178.00000000 192.00000000
posteriors <-  (samples[1][[1]])[,3]
mean(posteriors > 0)
# [1] 0.5623

## Or using `betaPosterior()`
posteriors <- betaPosterior(x, n, prior1.a=900, prior1.b=1407)
mean(posteriors$Theta_diff > 0)
# [1] 0.5671
mean(posteriors$Theta_diff)
# [1] 0.002094560485

So a much more infor­ma­tive prior reduces the pos­te­rior prob­a­bil­ity that the new box reduced the can­cel­la­tion rate to P = 0.56, with the mean esti­mate of the reduc­tion being ~0.21%.

Decision

Benefit

Since, while small, this is still >50%, it is pos­si­ble that switch­ing to the new box is a good idea, but we will need to con­sider costs and ben­e­fits to turn our pos­te­rior esti­mate of the reduc­tion into a pos­te­rior dis­tri­b­u­tion of gains/losses.

What is the value of a reduc­tion in can­cel­la­tions such as 0.21%?

CJ says that it has a profit mar­gin of ~$5000 on the 439+442=881 cus­tomers (881 is con­sis­tent with the totals reported in an ear­lier blog post) or ~$5.7 profit per cus­tomer per month. So a reduc­tion from 0.3909 to 0.3847 is 4.96 cus­tomers stay­ing, each of whom is worth $5.7, so a gain of $28.7 per month.

What is the total ben­e­fit? Well, pre­sum­ably it increases the life­time value of a cus­tomer: the total they spend, and hence your amount of profit on them. If you profit $1 per month on a cus­tomer and they stay 1 year before can­cel­ing, you make $12; but if they stay 2 years on aver­age, you’d make $24, which is bet­ter.

If a 39% can­cel­la­tion rate per month is typ­i­cal, then a cus­tomer will stick around for an aver­age of 2.56 months (this is a with p = 0.39, whose mean is ), for a life­time value of . Then a gain is the differ­ence between two life­time val­ues; so if we could some­how reduce can­cel­la­tion rates by 10% points for free, the gain per cus­tomer would be (new life­time value - old life­time val­ue) - cost:

((5.7 * (1/(0.39-0.10)))  - (5.7 * (1/0.39))) - 0
# [1] 5.039787798

Which over CJ’s entire set of 881 cus­tomers is a cool addi­tional $13553. If 881 is CJ’s steady-s­tate and each cus­tomer lasts ~2.56, then by , CJ must be get­ting new cus­tomers per month. So the annual total gain from a hypo­thet­i­cal improve­ment r = 0.10 is

(344 * 12) * (((5.7 * (1/(0.39-0.10)))  - (5.7 * (1/0.39))) - 0)
# [1] 20804.24403

Final­ly, if the improve­ment is per­ma­nent & repeat­able for all future cus­tomers, we should con­sider the (NPV) of $20.8k per year. What should CJ cal­cu­late its NPV at? An online mail-order busi­ness is unsta­ble and might go out of busi­ness at any­time (and most busi­nesses fail within the first few years), so a com­mon dis­count rate like 5% is prob­a­bly much too low to reflect the risk that CJ will go out of busi­ness before it has had a chance to profit from any improve­ments; I would sug­gest a dis­count rate of at least 10%, in which case we can esti­mate the NPV for that hypo­thet­i­cal r = 0.10 at:

20804.24403 / log(1.10)
# [1] 218279.3493

Putting all the profit for­mu­las together to gen­er­al­ize it, I get:

improvementTotalGain <- function(customerProfit, cancellationRate, cancellationImprovement,
                                 customerN, discountRate) {
 oldLifetimeValue <- customerProfit * (1/cancellationRate)
 newLifetimeValue <- customerProfit * (1/(cancellationRate-cancellationImprovement))
 customerGain     <- newLifetimeValue - oldLifetimeValue
 annualGain       <- customerN * customerGain
 NPV              <- annualGain / log(1+discountRate)
 return(NPV) }

## hypothetical example
improvementTotalGain(5.7, 0.3909, 0.10, 344*12, 0.10)
# [1] 217103.0195
## actual example:
improvementTotalGain(5.7, 0.3909, 2.029166e-03, 344*12, 0.10)
# [1] 3295.503599

Cost-benefit

We don’t have a 10% point reduc­tion in hand, but we do have a pos­te­rior dis­tri­b­u­tion of reduc­tion esti­mates.

So to trans­form the pos­te­rior dis­tri­b­u­tion of can­cel­la­tion decreas­es, each sam­ple of our pos­te­rior dis­tri­b­u­tion is run through the for­mu­la:

gain <- sapply(posteriors$Theta_diff, function(r) { improvementTotalGain(5.7, 0.3909, r, 344*12, 0.10) } )
summary(gain)
#       Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
# -78880.530 -10646.060   3352.669   4044.440  18125.030  88739.940

In this case, we esti­mate that rolling out the new box would increase rev­enue by a NPV of $4044.

The new boxes cost $0.33 more per month per cus­tomer than the old box­es, so at 344 new cus­tomers per month with life­times of 2.57 months, that’s 344*12*2.57*-0.33=-3500 a year, which trans­lates to a NPV of -$36,732.

$4k-$36k yields a loss of $32k, so we con­clude that the new boxes are an order of mag­ni­tude too expen­sive to be worth­while. To jus­tify spend­ing on the new box, we need a reduc­tion of new box price down to ~$0.04, or get at least a ~2.16% reduc­tion in the can­cel­la­tion rate.

If we were absolutely cer­tain that the reduc­tion was as large as ~2.16%, then the new boxes would hit breakeven. How prob­a­ble is it that the decrease in can­cel­la­tion rate is as or larg­er? Improb­a­ble, only ~7%:

mean(posteriors$Theta_diff >= 0.0216)
# [1] 0.0731

So our deci­sion is to stick with the old box­es.

Value of Information

Expected Value of Perfect Information (EVPI)

Still, 7% is not neg­li­gi­ble - there is still a chance that we are mak­ing a mis­take in not using the new box­es, as we did get evi­dence sug­gest­ing the new boxes are bet­ter. Are the results favor­able enough to jus­tify addi­tional A/B test­ing?

This gets into the “” (EVPI): how valu­able would be a defin­i­tive answer to the ques­tion of whether the decrease is bet­ter than 2.16%? Would we be will­ing to pay $10 or $100 or $1000 for an ora­cle’s answer?

In 93% of the cas­es, we believe the answer would be ‘no’: the ora­cle is worth noth­ing since it could only tell us what we already believe (that the decrease is less than 2.26%), in which case we remain with the sta­tus quo profit of $0 and are no bet­ter off, so in those cases the ora­cle was worth $0; in the other 7% of the cas­es, the answer would be ‘yes’ and we would change our deci­sion and make some expected profit. So the value of the ora­cle is $0 + expect­ed-val­ue-of-those-other-7%s.

But in that case, our gain depends on how large the can­cel­la­tion reduc­tion is - if it’s exactly 2.17%, the gain is ~$0 so we are indiffer­ent, but if the gain is 3% or 4% or even higher like 5%, then we would have been leav­ing real money on the table ($52k, $72k, & $93k respec­tive­ly). Of course, each of those large gains is increas­ingly unlike­ly, so we need to go back to the pos­te­rior dis­tri­b­u­tion to weight our gains per cus­tomer by aver­ag­ing over the pos­te­rior dis­tri­b­u­tion of pos­si­ble reduc­tions:

EVPI <- mean(sapply(posteriors$Theta_diff,
         function(r) { max(0, ## old
                            improvementTotalGain(5.7, 0.3909, r, 344*12, 0.10)) } )) ## new
EVPI
# [1] 10773
## convert value back to how many boxes that would pay for:
round(EVPI / 0.33)
# [1] 332645

So because of the small pos­si­bil­ity of a profitable box (which might be very profitable applied to all cus­tomers indefi­nite­ly), we’d be will­ing to pay up to a grand total of $10773 for .

That would be enough to pay for 33.2k new box­es; used as part of another A/B test, it would pro­vide extreme power by stan­dard p-value test­ing3, so we might won­der if fur­ther exper­i­men­ta­tion is profitable.

Expected Value of Sample Information (EVSI)

How much would n more obser­va­tions be worth, the “”? The EVSI must be smaller than the EVPI’s implied n = 332645; we can esti­mate exactly how much smaller by repeat­edly sim­u­lat­ing draw­ing n more obser­va­tions, doing a Bayesian update, recal­cu­lat­ing our expected profits for both choices (sta­tus quo vs new box), decid­ing whether to switch, and record­ing our profit if we do switch. This can be used to plan a fixed-sam­ple exper­i­ment by find­ing the value of n which max­i­mizes the EVSI: if n is too small (eg n = 1), it does­n’t affect our deci­sion, but if n is too big (n = 100,000) it is overkill.

First, we can auto­mate the pos­te­rior & profit analy­sis like so:

posterior <- function(x, n) { betaPosterior(x,n, prior1.a=900, prior1.b=1407)$Theta_diff }
posteriorProfit <- function(x, n) {
    posteriorReduction <- posterior(x, n)
    gains <- sapply(posteriorReduction,
              function(r) { improvementTotalGain(5.7, 0.3909, r, 344*12, 0.10) - 36732 })
    return(list(Profit=gains, Reduction=posteriorReduction)) }

gains <- posteriorProfit(x=c(175, 168), n=c(442, 439))
mean(gains$Profit)
# [1] -33019.54686

So with the cur­rent data, we would suffer an expected loss of $33k by switch­ing to the new box.

It is easy to sim­u­late col­lect­ing another dat­a­point since it’s binary data with­out any covari­ates or any­thing: draw a pos­si­ble can­cel­la­tion prob­a­bil­ity from the pos­te­rior dis­tri­b­u­tion for that group, and then flip a coin with the new prob­a­bil­i­ty.

simulateData <- function(posterior) { rbinom(1, 1, prob=sample(posterior, 1)) }

Now we can repeat­edly sim­u­late fake data, add it to the real data, rerun the analy­sis, see what the new esti­mated profit is from the best action (usu­ally we will con­clude what we already con­clude, that the box is not worth­while and the value of the new infor­ma­tion is then $0 - but in some pos­si­ble uni­verses the new data will change our deci­sion), com­pare the new esti­mated profit against the old profit, and thus whether the increase in profit result­ing from that new dat­a­point jus­ti­fies the cost of the new dat­a­points.

library(parallel)
library(plyr)

evsiEstimate <- function(x, n, n_additional, iters=1000) {

   originalPosterior <- betaPosterior(x, n, prior1.a=900, prior1.b=1407)
   gains <- posteriorProfit(x=x, n=n)
   oldProfit <- mean(gains$Profit)

   evsis <- unlist(mclapply(1:iters, ## parallelize
      function (i) {
                ## draw a set of hypothetical parameters from the posterior
                controlP      <- sample(originalPosterior$Theta_1, 1)
                experimentalP <- sample(originalPosterior$Theta_2, 1)
                ##  simulate the collection of additional data
                control      <- replicate(n_additional, simulateData(controlP))
                experimental <- replicate(n_additional, simulateData(experimentalP))

                ## the old box profit is 0; what's the estimated profit of new boxes given additional data?
                simGains <- posteriorProfit(x=c(x[1]+sum(control), x[2]+sum(experimental)),
                                            n=c(n[1]+n_additional, n[2]+n_additional))
                newBoxProfit <- mean(simGains$Profit)
                oldBoxProfit <- 0

                ## choose the maximum of the two actions:
                evsi <- max(c(newBoxProfit, oldBoxProfit))
                return(evsi) }
   ) )
   return(mean(evsis))
   }

## Example EVSI estimates for various possible experiment sizes:
evsiEstimate(c(175, 168), c(442, 439), n_additional=1)
# [1] 0
evsiEstimate(c(175, 168), c(442, 439), n_additional=100)
# [1] 0
evsiEstimate(c(175, 168), c(442, 439), n_additional=500)
# [1] 0
evsiEstimate(c(175, 168), c(442, 439), n_additional=1000)
# [1] 4.179743603
evsiEstimate(c(175, 168), c(442, 439), n_additional=2000)
# [1] 33.58719093
evsiEstimate(c(175, 168), c(442, 439), n_additional=3000)
# [1] 152.0107205
evsiEstimate(c(175, 168), c(442, 439), n_additional=4000)
# [1] 259.4423937
evsiEstimate(c(175, 168), c(442, 439), n_additional=5000)
# [1] 305.9021146
evsiEstimate(c(175, 168), c(442, 439), n_additional=6000)
# [1] 270.1474476
evsiEstimate(c(175, 168), c(442, 439), n_additional=7000)
# [1] 396.8461236
evsiEstimate(c(175, 168), c(442, 439), n_additional=8000)
# [1] 442.0281358

## Search for _n_ maximizing the EVSI minus the cost of the samples ("Expected Net Benefit"/ENBS)
optimize(function(n) { evsiEstimate(c(175, 168), c(442, 439), n_additional=n, iters=5000) - 0.33*n; },
 interval=c(1, 20000), maximum=TRUE, tol=1)
# $maximum
# [1] 1.483752763
# $objective
# [1] -0.4896384119

EVSI exhibits an inter­est­ing behav­ior in that deci­sions are dis­crete, so unlike one might intu­itively expect, the EVSI of eg n = 1-100 can be zero but the EVSI of n = 1000 can sud­denly be large. Typ­i­cally an EVSI curve will be zero (and hence expected profit increas­ingly neg­a­tive) for small sam­ple sizes where the data can­not pos­si­bly change one’s deci­sion no mat­ter how pos­i­tive it looks, and then when it does become ample enough to affect the deci­sion, becomes increas­ingly valu­able until a peak is reached and then dimin­ish­ing returns sets in and it even­tu­ally stops improv­ing notice­ably (while the cost con­tin­ues to increase lin­ear­ly).

In this case, the end result of the CJ exper­i­ment is that no fixed-sam­ple exten­sion is worth­while as the EVSI remains less than the cost of the sam­ple for all nat­ural num­bers.

Sampling to a foregone conclusion

In con­sid­er­ing whether to pay for an exper­i­ment, the para­me­ters need to be in a fairly nar­row range: if the prior prob­a­bil­ity of suc­cess is high (or the poten­tial profit high, or the cost low, or some com­bi­na­tion there­of), then one is best off sim­ply imple­ment­ing the inter­ven­tion with­out any fur­ther exper­i­men­ta­tion; while if the prior prob­a­bil­ity is low (or the poten­tial profit low, or the cost high), the inter­ven­tion is not worth test­ing at all (since the data is unlikely to dis­cour­age one enough to stop using it); only in between is the prob­a­bil­ity of profit suffi­ciently uncer­tain that the EVSI is high enough to jus­tify run­ning an exper­i­ment. In the other two cas­es, col­lect­ing data is sam­pling to a fore­gone con­clu­sion: regard­less of what the first dat­a­point turns out to be, the deci­sion will still be the same; and if the first dat­a­point does­n’t change one’s deci­sion, why bother col­lect­ing it?

Ear­lier I noted that from the cost of the new boxes and the value of cus­tomers, the new boxes would have to reduce can­cel­la­tion by at least 2.26% just to break-even. This is already a pri­ori fairly unlikely because it seems that can­cel­la­tions ought to be pri­mar­ily due to issues like pric­ing, selec­tion of wares, social media mar­ket­ing, cus­tomer sup­port & prob­lems in ship­ping or charg­ing, and that sort of thing - pack­ag­ing is the sort of thing a busi­ness should exper­i­ment on when it’s run out of bet­ter ideas. And then the profit from >2.26% must pay for the exper­i­men­ta­tion costs which could estab­lish such a reduc­tion. This raises a ques­tion: was it ever a good idea to decide to run such an exper­i­ment?

We can ask what the EVPI & EVSI was before the exper­i­ment was begun, when no data had been col­lect­ed, based on our infor­ma­tive prior and the known gains/costs:

posteriorsBefore <- betaPosterior(c(0,0), c(0,0), prior1.a=900, prior1.b=1407)
EVPIBefore <- mean(sapply(posteriorsBefore$Theta_diff,
              function(r) { max(0, improvementTotalGain(5.7, 0.3909, r, 344*12, 0.10)) } ))
EVPIBefore
# [1] 9656.250988

(Note that the EVPI a pos­te­ri­ori was larger than this EVPI a pri­ori, since we received evi­dence which increased the prob­a­bil­ity of ben­e­fi­cial effect­s.)

evsiEstimate(c(0, 0), c(0, 0), n_additional=1)
# [1] 0
evsiEstimate(c(0, 0), c(0, 0), n_additional=100)
# [1] 0
evsiEstimate(c(0, 0), c(0, 0), n_additional=442)
# [1] 0
evsiEstimate(c(0, 0), c(0, 0), n_additional=500)
# [1] 0
evsiEstimate(c(0, 0), c(0, 0), n_additional=1000)
# [1] 24.59309008
evsiEstimate(c(0, 0), c(0, 0), n_additional=2000)
# [1] 87.27781568
evsiEstimate(c(0, 0), c(0, 0), n_additional=3000)
# [1] 163.0583877
evsiEstimate(c(0, 0), c(0, 0), n_additional=4000)
# [1] 296.6085928
evsiEstimate(c(0, 0), c(0, 0), n_additional=5000)
# [1] 284.8454662
evsiEstimate(c(0, 0), c(0, 0), n_additional=6000)
# [1] 302.5872152
evsiEstimate(c(0, 0), c(0, 0), n_additional=7000)
# [1] 313.9936967
evsiEstimate(c(0, 0), c(0, 0), n_additional=8000)
# [1] 536.5548796
evsiEstimate(c(0, 0), c(0, 0), n_additional=20000)
# [1] 633.8689055
evsiEstimate(c(0, 0), c(0, 0), n_additional=1000000)
# [1] 1001.659894

optimize(function(n) { evsiEstimate(c(0, 0), c(0, 0), n_additional=n, iters=5000) - 0.33*n; },
         interval=c(1, 20000), maximum=TRUE, tol=1)
# $maximum
# [1] 1.483752763
# $objective
# [1] -0.4896384119

So while there are worth­while gains to be found, a pri­ori, exper­i­men­ta­tion costs too much for any fixed-sam­ple trial to be cost-effec­tive. To change this, we need some com­bi­na­tion of cheaper exper­i­ments (eg boxes cost­ing $0.07 each), more pow­er­ful exper­i­ments (adding in covari­ates to explain vari­ance & allow infer­ence at smaller sam­ple sizes?), some rea­son for a more opti­mistic prior (pos­i­tive results from other exper­i­ments indi­cat­ing large effect­s), a reduc­tion in dis­count rate (a new owner of CJ in for the long run?) or increase in cus­tomer base or cus­tomer life­time val­ue, or use of a differ­ent exper­i­men­tal approach entirely like an adap­tive .

Adaptive trials

Fixed-sam­ple exper­i­ments, whether planned using EVSI or not, suffer from the prob­lem that they receive feed­back only in fixed steps: even if the first half of the data is extremely unpromis­ing, one still has to pay for the sec­ond half of the data. One has no options to stop early and change one’s deci­sion. In a sequen­tial tri­al, in which the data comes in smaller chunks (ide­al­ly, 1 by 1), one pays for infor­ma­tion more grad­u­al­ly; as it is always bet­ter to have infor­ma­tion than to not have it, sequen­tial tri­als can be much bet­ter than fixed-sam­ple tri­als and closer to an .

Decision tree

We could see the CJ exper­i­ment as a two-player game against Nature, and cal­cu­late a game tree which gives the opti­mal deci­sion by (exam­ple for an A/B test; see also chap­ter 38, “Sequen­tial Deci­sion Pro­ce­dures” (pg590) of Schlaifer 1959). This will max­i­mize our profit dur­ing the sequence of deci­sions (but not nec­es­sar­ily after­ward­s).

In this con­cep­tion, we

  1. cre­ate a tree in which a level expresses our two choices (to sam­ple 1 old box or 1 new box), and then those 2 nodes link to a deeper layer of 2 nodes express­ing Nature’s choices (to have the cus­tomer can­cel or stay).

  2. Now each node has the can­cel­la­tions & _n_s avail­able, so we can cal­cu­late what the pos­te­rior prob­a­bil­ity of can­cel­la­tion would be con­di­tional on us and Nature mak­ing a par­tic­u­lar sequence of choic­es.

    (If we choose 9 new boxes and Nature chooses 9 can­cel­la­tions, the pos­te­rior prob­a­bil­ity of can­cel­la­tion is going to increase a lot, and vice versa if Nature chose 0 can­cel­la­tion­s.)

  3. Then for each out­come node, we can define a loss: if the cus­tomer can­cels and we used a new box, we lose their suc­ces­sive life­time value of $14.59 and the cost of the new box of $0.3; if we used an old box and they can­celed, just $14.59; if they don’t can­cel and we used an old box, we gain the monthly profit of $5.7, and if they don’t can­cel and we used a new box, the monthly profit of $5.7 minus the new box cost of $0.33.

  4. With the pos­te­rior prob­a­bil­ity and loss­es, we can then cal­cu­late the expected value of each node.

  5. Now that every node has an expected val­ue, we can then prop­a­gate the val­ues back­wards to each choice node; the value of a choice is not the aver­age of the two out­comes - because we are choos­ing intel­li­gently rather than at ran­dom - but the max­i­mum of the two choic­es. We start at the root/first choice and recur­sively define the profit as the loss plus the sum of its best descen­dant; when we reach a terminal/leaf node, just like in the orig­i­nal deci­sion analy­sis or EVSI, we con­sider the expected profit of a new box (often some­thing like -$33,000) vs an old box ($0) and we pick the max­i­mum. With that, the opti­mal deci­sion can be read off the tree.

  6. It is help­ful for read­ing if we fin­ish by prun­ing the deci­sion tree and delete any sub­trees which are infe­rior to their sib­lings, since we will never fol­low them.

Here is a (slow and ineffi­cient) imple­men­ta­tion in R using data.tree; I’ll assume that can­cel­la­tion is known imme­di­ately and no prior infor­ma­tion is avail­able on new vs old boxes and we are using an unin­for­ma­tive prior (to demon­strate switching/exploration):

library(data.tree)

createTree <- function(depth, tree=NULL) {
    cCancellationLoss <- -14.59
    cNoncancellationGain <- 5.7

    eCancellationLoss <- -0.33 + -14.59
    eNoncancellationGain <- -0.33 + 5.7

    if (is.null(tree)) { tree <- Node$new("", cancel_e=0, n_e=0, cancel_c=0, n_c=0, loss=0, profit=0) }

    if (depth != 0) {
    name <- tree$name
       if (tree$name == "Experiment") {
         # 2 possibilities: experimental & canceled, experimental & not-canceled
         ec <- tree$AddChild("Canceled", cancel_e = tree$cancel_e+1, n_e=tree$n_e+1, cancel_c=tree$cancel_c,
                                          n_c=tree$n_c, loss=eCancellationLoss, profit=NA)
         createTree(depth-1, tree=ec)
         enc <- tree$AddChild("Not.c", cancel_e = tree$cancel_e, n_e=tree$n_e+1, cancel_c=tree$cancel_c,
                                       n_c=tree$n_c, loss=eNoncancellationGain, profit=NA)
         createTree(depth-1, tree=enc)
       } else {
        if (tree$name == "Control") {
            cc <- tree$AddChild("Canceled", cancel_e = tree$cancel_e, n_e=tree$n_e, cancel_c=tree$cancel_c+1,
                                            n_c=tree$n_c+1, loss=cCancellationLoss, profit=NA)
            createTree(depth-1, tree=cc)
            cn <- tree$AddChild("Not.c", cancel_e = tree$cancel_e, n_e=tree$n_e, cancel_c=tree$cancel_c,
                                         n_c=tree$n_c+1, loss=cNoncancellationGain, profit=NA)
            createTree(depth-1, tree=cn) } else {
                 e <- tree$AddChild("Experiment", cancel_e = tree$cancel_e, n_e=tree$n_e, cancel_c=tree$cancel_c,
                                                  n_c=tree$n_c, loss=NA, profit=NA)
                 createTree(depth, tree=e)
                 c <- tree$AddChild("Control", cancel_e = tree$cancel_e, n_e=tree$n_e, cancel_c=tree$cancel_c,
                                               n_c=tree$n_c, loss=NA, profit=NA)
                 createTree(depth, tree=c)
                 }
            }
    }
   return(tree)
}

propagateProbabilities <- function(tree) { tree$Do(function(node) {
    posterior <- betaPosterior(c(node$cancel_c, node$cancel_e), c(node$n_c, node$n_e))
    p <- if (node$name == "Control") { mean(posterior$Theta_1) } else { mean(posterior$Theta_2) }
    node$children[[1]]$P <- p
    node$children[[2]]$P <- 1-p
    }, filterFun = function(n) { n$name!="Canceled" && n$name!="Not.c" }) }

estimateProfit <- function(x,n) { mean(posteriorProfit(x=x, n=n)$Profit) }

propagateProfit <- function(node) {
    if (isLeaf(node)) {
    node$profit <- node$loss + max(0, estimateProfit(c(node$cancel_c, node$cancel_e), c(node$n_c, node$n_e)))
        } else {
            propagateProfit(node$children[[1]])
            propagateProfit(node$children[[2]])

            if (node$name=="Experiment"||node$name=="Control") {
               node$profit <- (node$children[[1]]$P * node$children[[1]]$profit) +
                              (node$children[[2]]$P * node$children[[2]]$profit)
                              }
                              else {
                                      node$profit <- node$loss + max(node$children[[1]]$profit,
                                                                 node$children[[2]]$profit)
                                      }
}}
prune <- function(node) {
    if (!isLeaf(node)) {
    if (node$name=="Canceled" || node$name=="Not.c" || node$name=="") {
     if (node$children[[1]]$profit < node$children[[2]]$profit) { node$children[[1]] <- NULL } else {
                                                                  node$children[[2]] <- NULL }

     prune(node$children[[1]]$children[[1]])
     prune(node$children[[1]]$children[[2]])
    }
  }
}

tmp <- createTree(4)
propagateProbabilities(tmp)
propagateProfit(tmp)
prune(tmp)

The opti­mal deci­sion tree for n = 4:

print(tmp, "cancel_e", "n_e", "cancel_c", "n_c", "loss", "P", "profit")
#                                   levelName cancel_e n_e cancel_c n_c   loss            P         profit
# 1                                                  0   0        0   0   0.00           NA -12.6316427851
# 2   °--Control                                     0   0        0   0     NA 0.5005166416 -12.6316427851
# 3       ¦--Not.c                                   0   0        0   1   5.70 0.4991407526   2.9014261861
# 4       ¦   °--Control                             0   0        0   1     NA           NA  -2.7985738139
# 5       ¦       ¦--Not.c                           0   0        0   2   5.70 0.6662979782   6.9510036217
# 6       ¦       ¦   °--Control                     0   0        0   2     NA           NA   1.2510036217
# 7       ¦       ¦       ¦--Not.c                   0   0        0   3   5.70 0.7496239474   7.3477044374
# 8       ¦       ¦       ¦   °--Control             0   0        0   3     NA           NA   1.6477044374
# 9       ¦       ¦       ¦       ¦--Not.c           0   0        0   4   5.70 0.8002811453   5.7000000000
# 10      ¦       ¦       ¦       °--Canceled        0   0        1   4 -14.59 0.1997188547 -14.5900000000
# 11      ¦       ¦       °--Canceled                0   0        1   3 -14.59 0.2503760526 -17.0024710375
# 12      ¦       ¦           °--Control             0   0        1   3     NA           NA  -2.4124710375
# 13      ¦       ¦               ¦--Not.c           0   0        1   4   5.70 0.6001739262   5.7000000000
# 14      ¦       ¦               °--Canceled        0   0        2   4 -14.59 0.3998260738 -14.5900000000
# 15      ¦       °--Canceled                        0   0        1   2 -14.59 0.3337020218 -22.2654134133
# 16      ¦           °--Experiment                  0   0        1   2     NA           NA  -7.6754134133
# 17      ¦               ¦--Not.c                   0   1        1   2   5.37 0.5009864481   3.9780861684
# 18      ¦               ¦   °--Experiment          0   1        1   2     NA           NA  -1.3919138316
# 19      ¦               ¦       ¦--Not.c           0   2        1   2   5.37 0.6667366273   5.3700000000
# 20      ¦               ¦       °--Canceled        1   2        1   2 -14.92 0.3332633727 -14.9200000000
# 21      ¦               °--Canceled                1   1        1   2 -14.92 0.4990135519 -19.3749861841
# 22      ¦                   °--Control             1   1        1   2     NA           NA  -4.4549861841
# 23      ¦                       ¦--Not.c           1   1        1   3   5.70 0.4995078273   5.7000000000
# 24      ¦                       °--Canceled        1   1        2   3 -14.59 0.5004921727 -14.5900000000
# 25      °--Canceled                                0   0        1   1 -14.59 0.5008592474 -28.1114163485
# 26          °--Experiment                          0   0        1   1     NA           NA -13.5214163485
# 27              ¦--Not.c                           0   1        1   1   5.37 0.5004016883   2.5851685160
# 28              ¦   °--Experiment                  0   1        1   1     NA           NA  -2.7848314840
# 29              ¦       ¦--Not.c                   0   2        1   1   5.37 0.6671687861   5.6595106846
# 30              ¦       ¦   °--Experiment          0   2        1   1     NA           NA   0.2895106846
# 31              ¦       ¦       ¦--Not.c           0   3        1   1   5.37 0.7496062437   5.3700000000
# 32              ¦       ¦       °--Canceled        1   3        1   1 -14.92 0.2503937563 -14.9200000000
# 33              ¦       °--Canceled                1   2        1   1 -14.92 0.3328312139 -19.7117340045
# 34              ¦           °--Experiment          1   2        1   1     NA           NA  -4.7917340045
# 35              ¦               ¦--Not.c           1   3        1   1   5.37 0.4991752585   5.3700000000
# 36              ¦               °--Canceled        2   3        1   1 -14.92 0.5008247415 -14.9200000000
# 37              °--Canceled                        1   1        1   1 -14.92 0.4995983117 -29.6539013285
# 38                  °--Control                     1   1        1   1     NA           NA -14.7339013285
# 39                      ¦--Not.c                   1   1        1   2   5.70 0.3336765541   1.2854091899
# 40                      ¦   °--Control             1   1        1   2     NA           NA  -4.4145908101
# 41                      ¦       ¦--Not.c           1   1        1   3   5.70 0.5014987279   5.7000000000
# 42                      ¦       °--Canceled        1   1        2   3 -14.59 0.4985012721 -14.5900000000
# 43                      °--Canceled                1   1        2   2 -14.59 0.6663234459 -22.7559338216
# 44                          °--Experiment          1   1        2   2     NA           NA  -8.1659338216
# 45                              ¦--Not.c           1   2        2   2   5.37 0.3328765982   5.3700000000
# 46                              °--Canceled        2   2        2   2 -14.92 0.6671234018 -14.9200000000

So in this sce­nar­io, we start with a con­trol box (un­sur­pris­ing­ly, as it’s cheap­er) but if we get bad feed­back, we’ll try switch­ing to the new box­es, and if that does­n’t work, switch back to con­trols imme­di­ate­ly.

n = 9 is sim­i­lar but the tree is unwieldy to show inline; one can view the tree as a CSV file.

What about a more real­is­tic exam­ple like n = 800 since the nec­es­sary sam­ples for a A/B test might get into the thou­sands? Unfor­tu­nate­ly, such brute force plan­ning does­n’t scale well. Naive­ly, with two pos­si­ble actions and two pos­si­ble out­comes we’d get a full of nodes (al­ter­nat­ing choice and out­come lay­er­s), so we’d need to store 1,048,576 nodes to express a n = 10 tri­al. As it is, the n = 9 tree is the largest I can fit in RAM.

In this case we can do bet­ter: with i.i.d bino­mial sam­ples and the suffi­cient sta­tis­tics of m can­cels out of n tri­als, then for a total sample/horizon of 100, there are 100 pos­si­ble splits between the two options - 100 new boxes and 0 old box­es, 99 new boxes and 1 old box … 0 new boxes and 100 old boxes etc - and each has 100 pos­si­ble ‘can­cels of tri­als’ - 1 can­celed out of 100 … 100 can­celed out of 100 - so the deep­est layer has 1002=10000 ter­mi­nal nodes and our num­ber of nodes grows slower than expo­nen­tially (quar­tic?); this is fea­si­ble.

Neil Shep­perd wrote a ver­sion in Haskell & Data.Vector:

import           Data.Function
import           Data.Vector (Vector)
import qualified Data.Vector.Generic as V

npv annual = annual / log 1.10

-- unrealistic assumptions
npv_customer = 14.620912124582869
monthly = 5.7
new_box_cost = 0.33

data Beta = B Double Double

posterior :: Beta -> Int -> Int -> Double
posterior (B alpha beta) n_cancel n_trial = (alpha + fromIntegral n_cancel) /
                                            (alpha + beta + fromIntegral n_trial)

-- Expected posterior lifetime. This is not just 1 / posterior, because we need
-- E[1 / p], which is the harmonic mean of the beta distribution, not 1 / E[p].
lifetime :: Beta -> Int -> Int -> Double
lifetime (B alpha beta) n_cancel n_trial = (alpha + beta + fromIntegral n_trial - 1) /
                                           (alpha + fromIntegral n_cancel - 1)

uninformative = B 1.1 1.1 -- B 1 1 is unstable & produces Infinities
informative = B 900 1407
informative_old = B (900 + 175) (1407 + 442 - 175)
informative_new = B (900 + 168) (1407 + 439 - 168)

posterior_old, posterior_new :: Int -> Int -> Double
posterior_old = posterior uninformative
posterior_new = posterior uninformative

lifetime_old, lifetime_new :: Int -> Int -> Double
lifetime_old = lifetime uninformative
lifetime_new = lifetime uninformative

type Depth = Int

-- Comparison strategy - "experiment" where we just send the old boxes d times
idlevalue :: Depth -> Double
idlevalue d = (-npv_customer) * n_cancel + future
  where
    p_cancel = posterior_old 0 0
    n_cancel = p_cancel * fromIntegral d
    future = npv (lifetime_old 0 0 * monthly * 344 * 12)

-- Adaptive experiment strategy
value :: (Depth -> Int -> Int -> Int -> Int -> Double) -> Depth -> Int -> Int -> Int -> Int -> Double
value loop 0 old_trials old_cancel new_trials new_cancel = e_box_cost + cancelled_loss + future
  where
    -- NPV at end of experiment consists of:
    -- Cost of the boxes used in experiment
    e_box_cost = (-new_box_cost) * fromIntegral new_trials
    -- Income lost from customers who cancelled
    cancelled_loss = -npv_customer * fromIntegral (old_cancel + new_cancel)
    -- Expected future profit from following the suggested policy (old or new boxes)
    future = npv (future_customer_totals * 344 * 12)
    future_customer_totals = (max
                              (lifetime_old old_cancel old_trials * monthly)
                              (lifetime_new new_cancel new_trials * (monthly - new_box_cost)))
    -- (not used: Profit from customers who did not cancel)
    -- noncancel_profit = monthly * fromIntegral (new_trials + old_trials - old_cancel - new_cancel)
value loop n old_trials old_cancel new_trials new_cancel = max value_new value_old
  where
    p_cancel_new = posterior_new new_cancel new_trials
    p_cancel_old = posterior_old old_cancel old_trials
    value_new = p_cancel_new * loop (n-1) old_trials old_cancel (new_trials + 1) (new_cancel + 1) +
                (1 - p_cancel_new) * loop (n-1) old_trials old_cancel (new_trials + 1) new_cancel
    value_old = p_cancel_old * loop (n-1) (old_trials + 1) (old_cancel + 1) new_trials new_cancel +
                (1 - p_cancel_old) * loop (n-1) (old_trials + 1) old_cancel new_trials new_cancel

-- memoized version
run :: Depth -> Double
run d = getvalue d 0 0 0 0
  where
    memo :: Vector (Vector (Vector (Vector Double)))
    memo = V.generate (d+1) $ \n ->
      let trials = d - n in
      V.generate (trials + 1) $ \new_trials ->
      let old_trials = trials - new_trials in
      V.generate (new_trials + 1) $ \new_cancel ->
      V.generate (old_trials + 1) $ \old_cancel ->
      value getvalue n old_trials old_cancel new_trials new_cancel

    getvalue n ot oc nt nc = memo V.! n V.! nt V.! nc V.! oc

-- direct recursion version
run' :: Depth -> Double
run' d = fix value d 0 0 0 0

main = print (run 242) >> print (idlevalue 242)

This ver­sion when com­piled is per­for­mant enough to reach n = 242 before run­ning out of RAM (peaks at ~9.6GB use). A tree this big is hard to visu­al­ize, so we can com­pare the value of the opti­mal trial to that of a con­stant deci­sion to use only the old box. With unin­for­ma­tive pri­ors, at n = 242, it esti­mates that the con­stant strat­egy is worth $2,960,718 but the adap­tive trial is worth $5,275,332. With the infor­ma­tive pri­or, at n = 242, the con­stant strat­egy is worth $631869 and the adap­tive trial is worth $631869 (that is, the new boxes are never worth sam­pling with such a short hori­zon) and like­wise at shorter hori­zons like n = 100 ($632679).

To be sam­pling over a long enough run to make the new boxes worth test­ing despite the infor­ma­tive pri­ors’ neg­a­tiv­i­ty, we might need to run as long as n = 10000, in which case now there would be 100002=1e+08 ter­mi­nal nodes to eval­u­ate and prop­a­gate back through the deci­sion tree in addi­tion to their stor­age space, which is prob­lem­at­ic. At this point, one would have to switch to spe­cial­ized MDP/POMDP solvers or approx­i­ma­tion solvers like which use clever abstrac­tions to prune the tree or gen­er­al­ize over many states or expand only parts of the tree for deeper eval­u­a­tion.

Multi-armed bandits

A more eas­ily imple­mented approach would be to treat it as not an exper­i­ment which must ter­mi­nate soon, but a per­pet­ual choice by CJ: a prob­lem.

Then the solu­tion is sim­ple: . We cal­cu­late the prob­a­bil­ity that the new box is profitable (which we already know how to do) and allo­cate a cus­tomer to new boxes with that prob­a­bil­i­ty. Hardly takes more than a line of code, yet it will auto­mat­i­cally bal­ance explo­ration & exploita­tion while min­i­miz­ing wasted sam­ples.

We can try sim­u­lat­ing out some­thing sim­i­lar to the CJ exper­i­ment by imag­in­ing that CJ mails off 1 pack­age per day and 31 days lat­er, learns if the cus­tomer has can­celed or not.4 Run this for, say, 20,000 days, and watch how the total cost and esti­mated prob­a­bil­ity of profitabil­ity evolve under var­i­ous sce­nar­ios.

oldP <- 3.910082e-01

## 4 possible scenarios: -5%, 5%, exactly equal, or exactly equal to sample estimates:
# newP <- oldP - 0.05
# newP <- oldP + 0.05
# newP <- 3.889790e-01
newP <- oldP

maximumN <- 20000
thompsonSample <- TRUE ## alternative is simple 50-50 randomization

## If we want  a data.frame expressing the results as of the end of the CJ test:
# a <- c(rep(TRUE, 168), rep(FALSE, 439-168))
# b <- c(rep(TRUE, 175), rep(FALSE, 442-175))
# interventionState <- c(rep(TRUE, 439), rep(FALSE, 442))
# customersTotal <- data.frame(Date=-880:0, N=881, Intervention=interventionState, Canceled=c(a,b),
#                              Profit=NA, ProfitP=NA, Reduction=NA, Cost=c(rep(0.33, 439), rep(0.00, 442)))
customersTotal <- data.frame(Date=numeric(), Control.p=numeric(), Experiment.p=numeric(), Ns=numeric(),
                             Intervention=logical(), Canceled=integer(), Profit=numeric(),
                             Reduction=numeric(), Cost=numeric(), Cost.total=numeric())

for (date in 1:maximumN) {

  customers <- subset(customersTotal, Date <= (date-31))

  cancels <- c(sum(subset(customers, !Intervention)$Canceled),
               sum(subset(customers, Intervention)$Canceled))
  ns <- c(nrow(subset(customers, !Intervention)),
          nrow(subset(customers, Intervention)))
  results <- cancels / ns

  gains <- posteriorProfit(x=cancels, n=ns)
  profitP <- mean(gains$Profit>0)

  intervention <- if (thompsonSample) { sample(gains$Profit, 1) > 0 } else {
                   as.logical(rbinom(1,1, prob=0.5)) }
  canceled <- rbinom(1,1, prob=if(intervention) { newP; } else { oldP; })

  customersTotal <- rbind(customersTotal,
                          data.frame(Date=date, N=nrow(customers), Control.p=round(results[1],digits=3),
                                     Experiment.p=round(results[2], digits=3), Intervention=intervention,
                                     Canceled=canceled, Profit=round(mean(gains$Profit)),
                                     ProfitP=round(profitP, digits=4),
                                     Reduction=round(mean(gains$Reduction), digits=4),
                                     Cost=if(intervention) { 0.33; } else { 0; }, Cost.total=NA))
  customersTotal$Cost.total <- cumsum(customersTotal$Cost)
  print(tail(customersTotal, n=1))
}

A mul­ti­-armed ban­dit approach never turns in a final deci­sion but instead allo­cates ever fewer sam­ples to infe­rior arms; since a MAB never sets an arm to 0, it never makes a final deci­sion (which might be wrong). Still, after n = 10,000 or 20,000, the results are unlikely to change and it becomes extremely unlikely for a bad box to be cho­sen.

Here’s the results of 1 run of 4 pos­si­ble sce­nar­ios:

  • new boxes are 5% worse (un­profitable): $56.76 total spent, profit P = 0.0052, n is 172 of 20000
  • * equal prob­a­bil­ity (un­profitable): $93.39 total spent, profit P = 0.0172, n is 283 of 20000
  • * sam­ple esti­mates (un­profitable): $90.42 total spent, profit P = 0.0062, n is 274 of 20000
  • * 5% bet­ter (profitable): $3552.12 total spent, profit P > 0.9999 (P > 0.50 by n = 10000), n is 10764 of 20000

The Thomp­son sam­pler has reached the cor­rect con­clu­sion in all cas­es, at mod­est costs in the unprofitable cas­es. The sim­plic­ity and effi­ciency of Thomp­son sam­pling, and ease of align­ing it with busi­ness objec­tives, is one rea­son it’s pop­u­lar in the tech world.

Conclusion

So to round up, by employ­ing Bayesian deci­sion the­ory instead of stan­dard NHST p-val­ues, we can con­clude in a prin­ci­pled and flex­i­ble fash­ion that:

  1. there is a 56-65% (de­pend­ing on pri­ors) chance that the new boxes reduced can­cel­la­tion
  2. the plau­si­ble reduc­tions are small enough that the expected value of switch­ing is neg­a­tive as there is ~6% chance that switch­ing to new boxes would be profitable
  3. a defin­i­tive res­o­lu­tion of the remain­ing uncer­tainty would be worth ~$10773
  4. fur­ther exper­i­men­ta­tion would not be profitable
  5. poten­tial pack­ag­ing improve­ments were prob­a­bly not large enough to be worth run­ning an A/B test on in the first place
  6. a sequen­tial trial using Thomp­son sam­pling could do exper­i­men­ta­tion in a cost-effec­tive man­ner
  7. in gen­er­al, future exper­i­men­ta­tion should focus on the more impor­tant busi­ness issues CJ seems to have

See Also


  1. Almost half of CJ’s cus­tomers are so unhappy they can­cel their first mon­th? Is there an issue with select­ing bad can­dy, or are there unre­al­is­tic expec­ta­tions of how much candy can be bought for the sub­scrip­tion price of $25/month, or do peo­ple not like semi­-monthly deliv­ery, or not like the limit on candy imposed by two deliv­er­ies rather than one, or do peo­ple only want to try it once, or what? Look­ing through early CJ posts, I see these issues, as well as other poten­tial prob­lems like not doc­u­ment­ing past ship­ments ade­quate­ly, have been raised repeat­edly by com­menters but never addressed or exper­i­mented on by CJ. Have any sur­veys been done to find out why the can­cel­la­tion rate is so high? But mov­ing on.↩︎

  2. CJ ran their test using a : fisher.test(rbind(c(439,168),c(442,175))). Fish­er’s exact test is most use­ful when sam­ple sizes are very small, like n < 20, but it does­n’t make a differ­ence here, and I find the pro­por­tion test eas­ier to inter­pret.↩︎

  3. A well-pow­ered exper­i­ment to detect a differ­ence between 39% and 36.74% would require n = 14460 (n = 7230 in each arm). With 33.2k new boxes (con­trasted against another 33.2k old boxes for free), we would have >99.9% pow­er, leav­ing us with hardly any prob­a­bil­ity of not detect­ing the cur­rent effect:

    power.prop.test(power=0.8, p1=0.39, p2=0.3674)
    #      Two-sample comparison of proportions power calculation
    #
    #               n = 7230.138793
    #              p1 = 0.39
    #              p2 = 0.3674
    #       sig.level = 0.05
    #           power = 0.8
    #     alternative = two.sided
    #
    # NOTE: n is number in *each* group
    power.prop.test(n= 3265*2, p1=0.39, p2=0.3674)
    #  Two-sample comparison of proportions power calculation
    #
    #           n = 7110
    #          p1 = 0.39
    #          p2 = 0.3674
    #   sig.level = 0.05
    #       power = 0.9999999999
    # alternative = two.sided
    ↩︎
  4. One diffi­culty with MAB tech­niques in this con­text is that the feed­back is delayed: until the next billing cycle, you can­not know that a par­tic­u­lar cus­tomer won’t can­cel; it is pos­si­ble that they will but sim­ply haven’t yet. And if you wait until the next billing cycle, then CJ’s exper­i­ment would already be over. This can be solved by treat­ing it as a prob­lem in which cus­tomer can­cel­la­tion is right-cen­sored, a class of prob­lems that JAGS can also han­dle with­out diffi­cul­ty, but CJ’s writeup/code does­n’t pro­vide exact time-un­til-can­cel­la­tion data. For­tu­nate­ly, Thomp­son sam­pling is one of the MAB tech­niques which can han­dle such delayed feed­back because it will allo­cate the inter­ven­tion sto­chas­ti­cally rather than solely one inter­ven­tion or the oth­er.↩︎