# Banner Ads Considered Harmful (Here)

9 months of daily A/B-testing of Google AdSense banner ads on gwern.net indicates banner ads decrease total traffic substantially, possibly due to spillover effects in reader engagement and resharing. (experiments, statistics, decision theory, R, JS, power analysis, Bayes)
created: 8 Jan 2017; modified: 21 Jun 2018; status: in progress; confidence: possible;

One source of complexity & JavaScript use on gwern.net is the use of Google AdSense advertising to insert banner ads. In considering design & usability improvements, removing the banner ads comes up every time as a possibility, as readers do not like ads, but such removal comes at a revenue loss and it’s unclear whether the benefit outweighs the cost, suggesting I run an A/B experiment. However, ads might be expected to have broader effects on traffic than individual page reading times/bounce rates, affecting total site traffic instead through long-term effects on or spillover mechanisms between readers (eg social media behavior), rendering the usual A/B testing method of per-page-load/session randomization incorrect; instead it would be better to analyze total traffic as a time-series experiment.

Design: A decision analysis of revenue vs readers yields an maximum acceptable total traffic loss of ~3%. Power analysis of historical gwern.net traffic data demonstrates that the high autocorrelation yields low statistical power with standard tests & regressions but acceptable power with ARIMA models. I design a long-term Bayesian ARIMA(4,0,1) time-series model in which an A/B-test running January-October 2017 in randomized paired 2-day blocks of ads/no-ads uses client-local JS to determine whether to load & display ads, with total traffic data collected in Google Analytics & ad exposure data in Google AdSense.

Correcting for a flaw in the randomization, the final results yield a surprisingly large estimate of -14% traffic loss if all traffic were exposed to ads (95% credible interval: -13-16%) and an expected traffic loss of -9.7%, exceeding the decision threshold for disabling ads and rendering further experimentation profitless.

Thus, banner ads on gwern.net appear to be harmful and AdSense has been removed. If these results generalize to other blogs and personal websites, an important implication is that many websites may be harmed by their use of banner ad advertising without realizing it.

One thing about gwern.net I prize, especially in comparison to the rest of the Internet, is the fast page loads & renders. This is why in my previous A/B tests of site design changes, I have generally focused on CSS changes which do not affect load times. Benchmarking loads, the total time is dominated by Google AdSense (for the medium-sized banner advertisements centered above the title) and Disqus comments. While I want comments, so the Disqus is not optional1, AdSense I keep only because, well, it makes me some money (~$30 a month or ~$360 a year; it would be more but ~60% of visitors have adblock, which is apparently unusually high for the US). So ads are a good thing to do an experiment on: it offers a chance to remove one of the heaviest components of the page, an excuse to use a decision approach, an opportunity to try applying Bayesian time-series models in JAGS/Stan, and an investigation into whether longitudinal site-wide A/B experiments are practical & useful.

# Modeling effects of advertising: global rather than local

This isn’t a huge amount (it is much less than my monthly Patreon) and might be offset by the effects on load/render time and people not liking advertisement. If I am reducing my traffic & influence by 10% because people don’t want to browse or link pages with ads, then it’s definitely not worthwhile.

One of the more common criticisms of the usual A/B test design is that it is missing the forest for the trees & giving fast precise answers to the wrong question; a change may have good results when done individually, but may harm the overall experience or community in a way that shows up on the macro but not micro scale.2 In this case, I am interested less in time-on-page than in total traffic per day, as the latter will measure effects like resharing on social media (especially, given my traffic history, Hacker News, which always generates a long lag of additional traffic from Twitter & aggregators). It is somewhat appreciated that A/B testing in social media or network settings is not as simple as randomizing individual users & running a t-test - as the users are not independent of each other (violating SUTVA among other things). Instead, you need to randomize groups or subgraphs or something like that, and consider the effects of interventions on those larger more-independent treatment units.

So my usual ABalytics setup isn’t appropriate here: I don’t want to randomize individual visitors & measure time on page, I want to randomize individual days or weeks and measure total traffic, giving a time-series regression.

This could be randomized by uploading a different version of the site every day, but this is tedious, inefficient, and has technical issues: aggressive caching of my webpages means that many visitors may be seeing old versions of the site! With that in mind, there is a simple A/B test implementation in JS: in the invocation of the AdSense JS, simply throw in a conditional which predictably randomizes based on the current day (something like the day-of-year (1-366) modulo 2, hashing the day, or simply a lookup in an array of constants), and then after a few months, extract daily traffic numbers from Google Analytics/AdSense and match up with randomization and do a regression. By using a pre-specified source of randomness, caching is never an issue, and using JS is not a problem since anyone with JS disabled wouldn’t be one of the people seeing ads anyway. Since there might be spillover effects due to lags in propagating through social media & emails etc, daily randomization might be too fast, and 2-day blocks more appropriate, ensuring occasional runs up to a week or so to expose longer effects while still ensuring allocation equal total days to advertising/no-advertising.3

# Implementation: In-browser Randomization of Banner Ads

Setting this up in JS turned out to be a little tricky since there is no built-in function for getting day-of-year or for hashing numbers/strings; so rather than spend another 10 lines copy-pasting some hash functions, I copied some day-of-year code and then simply generated in R 366 binary variables for randomizing double-days and put them in a JS array for doing the randomization:

         <script src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js" async></script>
style="display:inline-block;width:468px;height:60px"
+        <!-- A/B test of ad effects on site traffic: randomize 2-days based on day-of-year &
pre-generated randomness; offset by 8 because started on 8 January 2016 -->
<script>
+          var now = new Date(); var start = new Date(now.getFullYear(), 0, 0); var diff = now - start;
var oneDay = 1000 * 60 * 60 * 24; var day = Math.floor(diff / oneDay);
+          randomness = [1,0,0,0,1,1,0,0,1,1,0,0,0,0,1,1,0,0,1,1,0,0,1,1,1,1,1,1,1,1,1,1,0,0,1,1,0,0,1,1,
1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,0,0,1,1,0,0,1,1,1,1,0,0,1,1,0,0,0,
0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1,0,0,1,1,0,0,1,1,0,0,0,0,1,1,0,0,1,1,1,1,1,1,0,0,1,1,0,0,0,0,0,
0,1,1,1,1,0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,1,1,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,0,0,1,1,1,1,0,0,0,0,1,1,1,
1,1,1,1,1,0,0,0,0,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,0,0,1,1,0,0,1,1,1,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0,1,1,0,
0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,0,0,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,1,
1,1,1,0,0,1,1,0,0,0,0,1,1,0,0];
+
+          if (randomness[day - 8]) {
+          }

While simple, static, and cache-compatible, a few months in I discovered that I had perhaps been a little too clever: checking my AdSense reports on a whim, I noticed that the reported daily impressions was rising and falling in roughly the 2-day chunks expected, but it was never falling all the way to 0 impressions, instead, perhaps to a tenth of the usual number of impressions. This was odd because how would any browsers be displaying ads on the wrong days given that the JS runs before any ads code, and any browser not running JS would, ipso facto, never be running AdSense anyway? Then it hit me: whose date is the randomization based on? The browser’s, of course, which is not mine if it’s running in a different timezone. Presumably browsers across a dateline would be randomized into on on the same day as others are being randomized into off. What I should have done was some sort of timezone independent date conditional. Unfortunately, it was a little late to modify the code.

This implies that the simple binary randomization test is not good as it will be substantially biased towards zero/attenuated by the measurement error inasmuch as many of the pagehits on supposedly ad-free days are in fact being contaminated by exposure to ads. Fortunately, the AdSense impressions data can be used instead to regress on, say, percentage of ad-affected pageviews.

From a decision theory perspective, this is a good place to apply sequential testing ideas as we face a similar problem as with the Candy Japan A/B test and the experiment has an easily quantified cost: each day randomized off costs ~$1, so a long experiment over 200 days would cost ~$100 in ad revenue etc. There is also the risk of making the wrong decision and choosing to disable ads when they are harmless, in which case the cost as NPV (at my usual 5% discount rate, and assuming ad revenue never changes and I never experiment further, which are reasonable assumptions given how fortunately stable my traffic is and the unlikeliness of me revisiting a conclusive result from a well-designed experiment) would be $\frac{360}{log(1.05)} = \7378$, which is substantial.

On the other side of the equation, the ads could be doing substantial damage to site traffic; with ~40% of traffic seeing ads and total page-views of 635123 in 2016 (1740/day), a discouraging effect of 5% off that would mean a loss of $635123 \cdot 0.40 \cdot 0.05 = 12702$, the equivalent of 1 week of traffic. My website is important to me because it is what I have accomplished & is my livelihood, and if people are not reading it, that is bad, both because I lose possible income and because it means no one is reading my work.

How bad? In lieu of advertising it’s hard to directly quantify the value of a page-view, so I can instead ask myself hypothetically, would I trade ~1 week of traffic for $360 (~$0.02/view, or to put it another way which may be more intuitive, would I delete gwern.net in exchange for >$18720/year)? Probably; that’s about the right number - with my current parlous income, I cannot casually throw away hundreds or thousands of dollars for some additional traffic, but I would still pay for readers at the right price, and weighing the feelings, I feel comfortable valuing page-views at ~$0.02. (If the estimate of the loss turns out to be near the threshold, then I can revisit it again and attempt more preference elicitation. Given the actual results, this proved to be unnecessary.)

Then the loss function of the traffic reduction parameter t is $\frac{360 - 635123 \cdot 0.40 \cdot t \cdot 0.02}{log(1.05)}$, So the long-run consequence of permanently turning advertising on would be, for a t decrease of 1%, 1% = +$4775; 5% = +$2171; 10% = -$3035; 20% = -$13449; etc.

Thus, the decision question is whether the decrease for the ad-affected 40% of traffic is >7%; or for traffic as a whole, if the decrease is >2.8%. If it is, then I am better off removing AdSense and increasing traffic; otherwise, the money is better.

How much should we expect traffic to fall?

My best guess is that the effect of any advertising avoidance ought to be a small percentage of traffic, since: A/B tests in general find small effects in the percentage range (my corpus of A/B tests - or see What works in e-commerce - a meta-analysis of 6700 online experiments, Brown & Jones 2017), ads are pervasive online, only a fraction of users are ever exposed, AdSense loads ads asynchronously in the background so it never blocks the page loading or rendering (which would definitely be frustrating & web design holds that small delays in pageloads are harmful5), Google supposedly spends billions of dollars a year on a surveillance Internet & the most cutting-edge AI technology to better model users & target ads to them without irritating them too much (eg Hohnhold et al 2015), ads should have little effect on SEO or search engine ranking (since why would search engines penalize their own ads?), I’ve seen a decent amount of research on optimizing ad deliveries to maximize revenue & avoiding annoying ads but not reducing total harm, the banner ads have never offended or bothered me much when I browse my pages with adblocker disabled as they are normal medium-sized banners centered above the <title> element where one expects an ad6, and you would think that if there were any worrisome level of harm someone would’ve noticed by now & it’d be common knowledge to avoid ads unless you were desperate for revenue. So my prior estimate is of a small effect and needing to run for a long time to make a decision at a moderate opportunity cost.

# Design

How do we analyze this? In the ABalytics per-reader approach, it was simple: we defined a threshold and did a binomial regression. But by switching to trying to increase overall total traffic, I have opened up a can of worms. Let’s look at the traffic data:

traffic <- read.csv("https://www.gwern.net/docs/traffic/20170108-traffic.csv", colClasses=c("Date", "integer", "logical"))
summary(traffic)
summary(traffic)
#    Date               Pageviews
#  Min.   :2010-10-04   Min.   :    1
#  1st Qu.:2012-04-28   1st Qu.: 1348
#  Median :2013-11-21   Median : 1794
#  Mean   :2013-11-21   Mean   : 2352
#  3rd Qu.:2015-06-16   3rd Qu.: 2639
#  Max.   :2017-01-08   Max.   :53517
nrow(traffic)
# [1] 2289
library(ggplot2)
qplot(Date, Pageviews, data=traffic)
qplot(Date, log(Pageviews), data=traffic)

Two things jump out. The distribution of traffic is weird, with enormous spikes; doing a log-transform to tame the spikes, it is also clearly a non-stationary time-series with heavy autocorrelation as traffic consistently grows and declines. These are not surprising, as social media traffic from sites like Hacker News or Reddit are notorious for creating spikes in site traffic (and sometimes bringing them down under the load), and I would hope that as I keep writing things, traffic would gradually increase! Nevertheless, both of these will make the traffic data difficult to analyze despite having over 6 years of it.

## Power analysis

We can demonstrate with a quick power analysis: if we pick a random subset of days and force a decrease of 2.8% (the value on the decision boundary), can we detect that?

ads <- traffic
ads$Ads <- rbinom(nrow(ads), size=1, p=0.5) ads[ads$Ads==1,]$Pageviews <- round(ads[ads$Ads==1,]$Pageviews * (1-0.028)) wilcox.test(Pageviews ~ Ads, data=ads) # W = 665105.5, p-value = 0.5202686 t.test(Pageviews ~ Ads, data=ads) # t = 0.27315631, df = 2285.9151, p-value = 0.7847577 # alternative hypothesis: true difference in means is not equal to 0 # 95% confidence interval: # -203.7123550 269.6488393 # sample estimates: # mean in group 0 mean in group 1 # 2335.331004 2302.362762 wilcox.test(log(Pageviews) ~ Ads, data=ads) # W = 665105.5, p-value = 0.5202686 t.test(log(Pageviews) ~ Ads, data=ads) # t = 0.36685265, df = 2286.8348, p-value = 0.7137629 sd(ads$Pageviews)
# [1] 2880.044636

The answer is no. We are nowhere near being able to detect it with either a t-test or the nonparametric u-test (which one might expect to handle the strange distribution better), and the log transform doesn’t help. We can hardly even see a hint of the decrease in the t-test - the decrease in the mean is ~30 pageviews but the standard deviations are ~2900 and actually bigger than the mean. So the spikes in the traffic are crippling the tests and this cannot be fixed by waiting a few more months since it’s inherent to the data.

If our trusty friend the log-transform can’t help, what can we do? In this case, we know that the reality here is literally a mixture model as the spikes are being driven by qualitatively distinct phenomenon like a gwern.net link appearing on the HN front page, as compared to normal daily traffic from existing links & search traffic7; but mixture models tend to be hard to use. One ad hoc approach to taming the spikes would be to effectively throw them out by winsorizing/clipping everything at a certain point (since the daily traffic average is ~1700, perhaps twice that, 3000):

ads <- traffic
ads$Ads <- rbinom(nrow(ads), size=1, p=0.5) ads[ads$Ads==1,]$Pageviews <- round(ads[ads$Ads==1,]$Pageviews * (1-0.028)) ads[ads$Pageviews>3000,]$Pageviews <- 3000 sd(ads$Pageviews)
# [1] 896.8798131
# W = 679859, p-value = 0.1131403
# t = 1.3954503, df = 2285.3958, p-value = 0.1630157
# alternative hypothesis: true difference in means is not equal to 0
# 95% confidence interval:
#  -21.2013943 125.8265361
# sample estimates:
# mean in group 0 mean in group 1
#     1830.496049     1778.183478

Better but still inadequate. Even with the spikes tamed, we continue to have problems; the logged graph suggests that we can’t afford to ignore the time-series aspect. A check of autocorrelation indicates substantial autocorrelation out to lags as high as 8 days:

pacf(traffic$Pageviews, main="gwern.net traffic time-series autocorrelation") The usual regression framework for time-series is the ARIMA time-series model, in which the current daily value would be regressed on by each of the previous day’s values (with an estimated coefficient for each lag, as day 8 ought to be less predictive than day 7 and so on) and possibly a difference and a moving average (also with varying distances in time). The models are usually denoted as ARIMA(days back to use as lags, days back to difference, days back for moving average). So the pacf suggests that an ARIMA(8,0,0) might work - lags back 8 days but agnostic on differencing and moving averages, respectively. R’s forecast library helpfully includes both an arima regression function and also an auto.arima to do model comparison. auto.arima generally finds that a much simpler model than ARIMA(8,0,0) works best, preferring models like ARIMA(4,1,1) (presumably the differencing and moving-average steal enough of the distant lags’ predictive power that they no longer look better to AIC). Such an ARIMA model works well and now we actually can detect convincingly our simulated effect: library(forecast) library(lmtest) l <- lm(Pageviews ~ Ads, data=ads); summary(l) # Residuals: # Min 1Q Median 3Q Max # -2352.275 -995.275 -557.275 294.725 51239.783 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 2277.21747 86.86295 26.21621 < 2e-16 # Ads 76.05732 120.47141 0.63133 0.52789 # # Residual standard error: 2879.608 on 2287 degrees of freedom # Multiple R-squared: 0.0001742498, Adjusted R-squared: -0.0002629281 # F-statistic: 0.3985787 on 1 and 2287 DF, p-value: 0.5278873 a <- arima(ads$Pageviews, xreg=ads$Ads, order=c(4,1,1)) summary(a); coeftest(a) # Coefficients: # ar1 ar2 ar3 ar4 ma1 ads$Ads
#       0.5424117  -0.0803198  -0.0310823  -0.0094242  -0.8906085  -52.4148244
# s.e.  0.0281538   0.0245621   0.0245500   0.0240701   0.0189952   10.5735098
#
# sigma^2 estimated as 89067.31:  log likelihood = -16285.31,  aic = 32584.63
#
# Training set error measures:
#                       ME        RMSE         MAE          MPE        MAPE         MASE             ACF1
# Training set 3.088924008 298.3762646 188.5442545 -6.839735685 31.17041388 0.9755280945 -0.0002804416646
#
# z test of coefficients:
#
#               Estimate     Std. Error   z value   Pr(>|z|)
# ar1       0.5424116948   0.0281538043  19.26602 < 2.22e-16
# ar2      -0.0803197830   0.0245621012  -3.27007  0.0010752
# ar3      -0.0310822966   0.0245499783  -1.26608  0.2054836
# ar4      -0.0094242194   0.0240700967  -0.39153  0.6954038
# ma1      -0.8906085375   0.0189952434 -46.88587 < 2.22e-16
# Ads     -52.4148243747  10.5735097735  -4.95718 7.1523e-07

One might reasonably ask, what is doing the real work, the truncation/trimming or the ARIMA(4,1,1)? The answer is both; if we go back and regenerate the ads dataset without the truncation/trimming and we look again at the estimated effect of Ads, we find it changes to

#               Estimate     Std. Error   z value   Pr(>|z|)
# ...
# Ads      26.3244086579 81.2278521231    0.32408 0.74587666

For the simple linear model with no time-series or truncation, the standard error on the ads effect is 121; for the time-series with no truncation, the standard error is 81; and for the time series plus truncation, the standard error is 11. My conclusion is that we can’t leave either one out if we are to reach correct conclusions in any feasible sample size - we must deal with the spikes, and we must deal with the time-series aspect.

So having settled on a specific ARIMA model with truncation, I can do a power analysis. For a time-series, the simple bootstrap is inappropriate as it ignores the autocorrelation; the right bootstrap is the block bootstrap: for each hypothetical sample size n, split the traffic history into as many non-overlapping n-sized chunks m as possible, select-with-replacement from them m, and run the analysis. This is implemented in the R boot library.

library(boot)
library(lmtest)

## fit models & report p-value/test statistic
ut <- function(df) { wilcox.test(Pageviews ~ Ads, data=df)$p.value } at <- function(df) { coeftest(arima(df$Pageviews, xreg=df$Ads, order=c(4,1,1)))[,4][["df$Ads"]] }

## create the hypothetical effect, truncate, and test
simulate <- function (df, testFunction, effect=0.03, truncate=TRUE, threshold=3000) {
df$Ads <- rbinom(nrow(df), size=1, p=0.5) df[df$Ads==1,]$Pageviews <- round(df[df$Ads==1,]$Pageviews * (1-effect)) if(truncate) { df[df$Pageviews>threshold,]$Pageviews <- threshold } return(testFunction(df)) } power <- function(ns, df, test, effect, alpha=0.05, iters=2000) { powerEstimates <- vector(mode="numeric", length=length(ns)) i <- 1 for (n in ns) { tsb <- tsboot(df, function(d){simulate(d, test, effect=effect)}, iters, l=n, sim="fixed", parallel="multicore", ncpus=getOption("mc.cores")) powerEstimates[i] <- mean(tsb$t < alpha)
i <- i+1 }
return(powerEstimates) }

ns <- seq(10, 2000, by=5)
## test the critical value but also 0 effect to check whether alpha is respected
powerUtestNull <- power(ns, traffic, ut, 0)
powerUtest     <- power(ns, traffic, ut, 0.028)
powerArimaNull <- power(ns, traffic, at, 0)
powerArima     <- power(ns, traffic, at, 0.028)
p1 <- qplot(ns, powerUtestNull) + stat_smooth() + coord_cartesian(ylim = c(0, 1))
p2 <- qplot(ns, powerUtest) + stat_smooth() + coord_cartesian(ylim = c(0, 1))
p3 <- qplot(ns, powerArimaNull) + stat_smooth() + coord_cartesian(ylim = c(0, 1))
p4 <- qplot(ns, powerArima) + stat_smooth() + coord_cartesian(ylim = c(0, 1))

library(grid)
library(gridExtra)
grid.arrange(p1, p3, p2, p4, ncol = 2, name = "Power analysis of detecting null effect/2.8% reduction using u-test and ARIMA regression")

So the false-positive rate is preserved for both, the ARIMA requires a reasonable-looking n<70 to be well-powered, but the u-test power is bizarre - the power is never great, never going >31.6%, and actually decreasing after a certain point, which is not something you usually see in a power graph. (The ARIMA power curve is also odd but at least it doesn’t get worse with more data!) My speculation about that is that it is because as the time-series window increases, more of the spikes come into view of the u-test, making the distribution dramatically wider & this more than overwhelms the gain in detectability; hypothetically, with even more years of data, the spikes would stop coming as a surprise and the gradual hypothetical damage of the ads will then become more visible with increasing sample size as expected.

An model is an autoregressive model of order 2, or AR(2) model. This model can be written as: X_t - mu = (Beta1 * (X_t-1 - mu)) + (Beta2 * (Xt-2 - mu)) + Z_t, where X_t is the stationary time series we are studying (the time series of volcanic dust veil index), mu is the mean of time series X_t, Beta1 and Beta2 are parameters to be estimated, and Z_t is white noise with mean zero and constant variance.

JAGS:

library(runjags)
arima311 <- "model {
# initialize the first 3 days, which we need to fit the 3 lags/moving-averages for day 4:
# y[1] <- 50
# y[2] <- 50
# y[3] <- 50
eps[1] <- 0
eps[2] <- 0
eps[3] <- 0

for (i in 4:length(y)) {
y[i] ~ dt(mu[i], tauOfClust[clust[i]], nuOfClust[clust[i]])
mu[i] <- muOfClust[clust[i]] + w1*y[i-1] + w2*y[i-2] + w3*y[i-3] + m1*eps[i-1]
eps[i] <- y[i] - mu[i]

clust[i] ~ dcat(pClust[1:Nclust])
}

for (clustIdx in 1:Nclust) {
muOfClust[clustIdx] ~ dnorm(100, 1.0E-06)
sigmaOfClust[clustIdx] ~ dnorm(500, 1e-06)
tauOfClust[clustIdx] <- pow(sigmaOfClust[clustIdx], -2)
nuMinusOneOfClust[clustIdx] ~ dexp(5)
nuOfClust[clustIdx] <- nuMinusOneOfClust[clustIdx] + 1
}
pClust[1:Nclust] ~ ddirch(onesRepNclust)

m1 ~ dnorm(0, 4)
w1 ~ dnorm(0, 5)
w2 ~ dnorm(0, 4)
w3 ~ dnorm(0, 3)
}"
y <- traffic$Pageviews Nclust = 2 clust = rep(NA,length(y)) clust[which(y<1800)] <- 1 # seed labels for cluster 1, normal traffic clust[which(y>4000)] <- 2 # seed labels for cluster 2, spikes model <- run.jags(arima311, data = list(y=y, Nclust = Nclust, clust=clust, onesRepNclust = c(1,1) ), monitor=c("w1", "w2", "w3", "m1", "pClust", "muOfClust", "sigmaOfClust", "nuOfClust"), inits=list(w1=0.55, w2=0.37, w3=-0.01, m1=0.45, pClust=c(0.805, 0.195), muOfClust=c(86.5, 781), sigmaOfClust=c(156, 763), nuMinusOneOfClust=c(2.4-1, 1.04-1)), n.chains = getOption("mc.cores"), method="rjparallel", sample=500) summary(model) JAGS is painfully slow: 5h+ for 500 samples. Sharper priors, removing the 4th-order ARIMA lag, and better initialization didn’t help. Too much correlation? Try switching to STAN. traffic <- read.csv("https://www.gwern.net/docs/traffic/20170108-traffic.csv", colClasses=c("Date", "integer", "logical")) library(rstan) rstan_options(auto_write = TRUE) options(mc.cores = parallel::detectCores()) m <- "data { int<lower=1> K; // number of mixture components int<lower=1> T; // number of data points int<lower=0> y[T]; // observations } parameters { simplex[K] theta; // mixing proportions real<lower=0, upper=100> muM[K]; // locations of mixture components real<lower=0.01, upper=1000> sigmaM[K]; // scales of mixture components real<lower=0.01, upper=5> nuM[K]; real phi1; // autoregression coeffs real phi2; real phi3; real phi4; real ma; // moving avg coeff } model { real mu[T, K]; // prediction for time t vector[T] err; // error for time t real ps[K]; // temp for log component densities // initialize the first 4 days for the lags mu[1][1] = 0; // assume err[0] == 0 mu[2][1] = 0; mu[3][1] = 0; mu[4][1] = 0; err[1] = y[1] - mu[1][1]; err[2] = y[2] - mu[2][1]; err[3] = y[3] - mu[3][1]; err[4] = y[4] - mu[4][1]; muM ~ normal(0, 5); sigmaM ~ cauchy(0, 2); nuM ~ exponential(1); ma ~ normal(0, 0.5); phi1 ~ normal(0,1); phi2 ~ normal(0,1); phi3 ~ normal(0,1); phi4 ~ normal(0,1); for (t in 5:T) { for (k in 1:K) { mu[t][k] = muM[k] + phi1 * y[t-1] + phi2 * y[t-2] + phi3 * y[t-3] + phi4 * y[t-4] + ma * err[t-1]; err[t] = y[t] - mu[t][k]; ps[k] = log(theta[k]) + student_t_lpdf(y[t] | nuM[k], mu[t][k], sigmaM[k]); } target += log_sum_exp(ps); } }" # 17m for 200 samples nchains <- getOption("mc.cores") - 1 # original, based on MCMC: # inits <- list(theta=c(0.92, 0.08), muM=c(56.2, 0.1), sigmaM=c(189.7, 6), nuM=c(1.09, 0.61), phi1=1.72, phi2=-0.8, phi3=0.08, phi4=0, ma=-0.91) # optimized based on gradient descent inits <- list(theta=c(0.06, 0.94), muM=c(0.66, 0.13), sigmaM=c(5.97, 190.05), nuM=c(1.40, 1.10), phi1=1.74, phi2=-0.83, phi3=0.10, phi4=-0.01, ma=-0.93) model2 <- stan(model_code=m, data=list(T=nrow(traffic), y=traffic$Pageviews, K=2), init=replicate(nchains, inits, simplify=FALSE), chains=nchains, iter=50); print(model2)
traceplot(model2)

This was perhaps the first time I’ve attempted to write a complex model in Stan, in this case, adapting a simple ARIMA time-series model from the Stan manual. Stan has some interesting features like the variational inference optimizer which can find sensible parameter values for complex models in seconds, an active community & involved developers & an exciting roadmap, and when Stan works it is substantially faster than the equivalent JAGS model; but I encountered a number of drawbacks (in decreasing order of importance):

• Stan’s treatment of mixture models and discrete variables is… not good. I like mixture models & tend to think in terms of them and latent variables a lot, which makes the neglect an issue for me. This was particularly vexing in my initial modeling where I tried to allow for traffic spikes from HN etc by having a mixture model, with one component for regular traffic and one component for traffic surges. This is relatively straightforward in JAGS as one defines a categorical variable and indexes into it, but it is a nightmare in Stan, requiring a bizarre hack.

I defy any Stan user to look at the example mixture model in the manual and tell me that they naturally and easily understand the target/temperature stuff as a way of implementing a mixture model. I sure didn’t. And once I did get it implemented, I couldn’t modify it at all. And it was slow, too, eroding the original performance advantage over JAGS. I was saved only by the fact that the A/B test period happened to not include many spikes and so I could simply drop the mixture model entirely.
• mysterious segfaults and errors under a variety of condition; once when my cat walked over my keyboard, and frequently when running multi-core Stan in a loop. The latter was a serious issue for me when running a permutation test with 5000 iterates: when I run Stan on 8 chains in parallel normally (hence 1/8th the samples per chain) in a for-loop - the simplest way to implement the permutation test - it would occasionally segfault and take down R. I was forced to reduce the chains to 1 before it stopped crashing, making it 8 times slower unless I wished to add in manual parallel processing, running 8 separate Stans.
• Stan’s support for posterior predictives is poor. The manual tells one to use a different module/scope, generated_quantities, lest the code be slow, which apparently requires one to copy-paste the entire likelihood section! Which is especially unfortunate when doing a time-series and requiring access to arrays/vectors declared in a different scope… I never did figure out how to generate posterior predictions correctly for that reason, and resorted to the usual Bugs/JAGS-like method (which thankfully does work)
• Stan’s treatment of missing data is also uninspiring and makes me worried about more complex analyses where I am not so fortunate as to have perfectly clean complete datasets
• Stan’s syntax is terrible, particularly the entirely unnecessary semicolons. It is 2017, I should not be spending my time adding useless end-of-line markers. If they are necessary for C++, they can be added by Stan itself. This was particularly infuriating when painfully editing a model trying to implement various parameterizations and rerunning only to find that I had forgotten a semicolon (as no language I use regularly, R, Haskell, shell, Python, or Bugs/JAGS, insists on them!).

# Analysis

## Descriptive

traffic <- read.csv("https://www.gwern.net/docs/traffic/2017-10-20-abtesting-adsense.csv",
colClasses=c("Date", "integer", "integer", "numeric", "integer", "integer",
"integer", "numeric", "integer", "numeric"))
summary(traffic)
#       Date              Pageviews           Sessions         Avg.Session.seconds   Total.time            Ads.r
#  Min.   :2017-01-01   Min.   : 794.000   Min.   : 561.0000   Min.   : 45.48000   Min.   : 39074.00   Min.   :0.0000
#  1st Qu.:2017-03-13   1st Qu.:1108.750   1st Qu.: 743.0000   1st Qu.: 87.30000   1st Qu.: 70499.50   1st Qu.:0.0000
#  Median :2017-05-24   Median :1232.000   Median : 800.0000   Median : 98.98500   Median : 81173.50   Median :0.0000
#  Mean   :2017-05-24   Mean   :1319.931   Mean   : 872.0972   Mean   : 99.08451   Mean   : 84517.41   Mean   :0.4375
#  3rd Qu.:2017-08-04   3rd Qu.:1394.000   3rd Qu.: 898.2500   3rd Qu.:109.21500   3rd Qu.: 94002.00   3rd Qu.:1.0000
#  Max.   :2017-10-15   Max.   :8310.000   Max.   :6924.0000   Max.   :145.46000   Max.   :314904.00   Max.   :1.0000
#  Min.   :   0.0000   Min.   :-8.127100   Min.   :   0.0000   Min.   :0.00000000
#  1st Qu.:  76.5000   1st Qu.:-2.791700   1st Qu.:  33.0000   1st Qu.:0.02364601
#  Median : 180.5000   Median :-1.801750   Median : 127.5000   Median :0.10311474
#  Mean   : 399.0243   Mean   :-1.457549   Mean   : 358.9549   Mean   :0.28578668
#  3rd Qu.: 734.5000   3rd Qu.: 0.529475   3rd Qu.: 708.0000   3rd Qu.:0.59389847
#  Max.   :1925.0000   Max.   : 1.436700   Max.   :1848.0000   Max.   :0.76507937
nrow(traffic)
# [1] 288

library(ggplot2)
qplot(Date, Pageviews, color=as.logical(Ads.r), data=traffic) + stat_smooth() +
coord_cartesian(ylim = c(750,3089)) +
qplot(Date, Total.time, color=as.logical(Ads.r), data=traffic) + stat_smooth() +
coord_cartesian(ylim = c(38000,190000)) +
labs(color="Ads", title="AdSense advertising effect on total time spent reading Gwern.net , January-October 2017")

Traffic looks similar whether counting by total page views or total time reading (average-time-reading-per-session x number-of-sessions); the data is definitely autocorrelated, somewhat noisy, and I get an impression that there is a noticeable decrease in the advertising days despite the measurement error:

## Simple tests & regressions

As expected from the power analysis, the usual tests are unable to reliably detect anything but it’s worth noting that the point-estimates of both the mean & median indicate the ads are worse:

t.test(Pageviews ~ Ads.r, data=traffic)
#   Welch Two Sample t-test
#
# t = 0.28265274, df = 178.00378, p-value = 0.7777715
# alternative hypothesis: true difference in means is not equal to 0
# 95% confidence interval:
#  -111.4252500  148.6809819
# sample estimates:
# mean in group 0 mean in group 1
#     1328.080247     1309.452381
#   Wilcoxon rank sum test with continuity correction
#
# W = 11294, p-value = 0.1208844
# alternative hypothesis: true location shift is not equal to 0
# 95% confidence interval:
#  -10.00001128  87.99998464
# sample estimates:
# difference in location
#             37.9999786

The tests can only handle a binary variable, so next is a quick simple linear model, and then a Bayesian regression with an autocorrelation term; both turn up a weak effect for the binary randomization, and then much stronger (and negative) for the more accurate percentage measurement:

summary(lm(Pageviews ~ Ads.r, data = traffic))
# ...Residuals:
#       Min        1Q    Median        3Q       Max
# -534.0802 -207.6093  -90.0802   65.9198 7000.5476
#
# Coefficients:
#               Estimate Std. Error  t value Pr(>|t|)
# (Intercept) 1328.08025   40.58912 32.72010  < 2e-16
# Ads.r        -18.62787   61.36498 -0.30356  0.76168
#
# Residual standard error: 516.6152 on 286 degrees of freedom
# Multiple R-squared:  0.0003220914,    Adjusted R-squared:  -0.003173286
# F-statistic: 0.09214781 on 1 and 286 DF,  p-value: 0.7616849
summary(lm(Pageviews ~ Ads.percent, data = traffic))
# ...Residuals:
#       Min        1Q    Median        3Q       Max
# -579.4145 -202.7547  -89.1473   60.7785 6928.3160
#
# Coefficients:
#               Estimate Std. Error  t value Pr(>|t|)
# (Intercept) 1384.17269   42.77952 32.35596  < 2e-16
# Ads.percent -224.79052  105.98550 -2.12096 0.034786
#
# Residual standard error: 512.6821 on 286 degrees of freedom
# Multiple R-squared:  0.01548529,  Adjusted R-squared:  0.01204293
# F-statistic: 4.498451 on 1 and 286 DF,  p-value: 0.03478589

library(brms)
b <- brm(Pageviews ~ Ads.r, autocor = cor_bsts(), iter=20000, chains=8, data = traffic); b
#  Family: gaussian(identity)
#    Data: traffic (Number of observations: 288)
# Samples: 8 chains, each with iter = 20000; warmup = 10000; thin = 1;
#          total post-warmup samples = 80000
#     ICs: LOO = Not computed; WAIC = Not computed
#
# Correlation Structure: bsts(~1)
#         Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# sigmaLL    50.86     16.47    26.53    90.49        741 1.01
#
# Population-Level Effects:
#       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# Ads.r    50.36     63.19   -74.73   174.21      34212    1
#
# Family Specific Parameters:
#       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# sigma   499.77     22.38   457.76    545.4      13931    1
#
# Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
# is a crude measure of effective sample size, and Rhat is the potential
# scale reduction factor on split chains (at convergence, Rhat = 1).
b2 <- brm(Pageviews ~ Ads.percent, autocor = cor_bsts(), chains=8, data = traffic); b2
...
# Population-Level Effects:
#             Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# Ads.percent    -91.8    113.05  -317.85   131.41       2177    1

Treating percentage as additive is also odd, as one would expect it to be multiplicative in some sense. As well, brms makes it convenient to throw in a simple Bayesian structural autocorrelation (corresponding to AR(1) if I am understanding it correctly) but the function involved does not support the higher-order lags or moving average involved in traffic.

## Stan ARIMA time-series model

The fully Bayesian analysis in Stan, using ARIMA(4,0,1) time-series, a multiplicative effect of ads as percentage of traffic, skeptical informative priors of small negative effects, and extracting posterior predictions (of each day if hypothetically it were not advertising-affected) for further analysis:

library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
m <- "data {
int<lower=1> T; // number of data points
int<lower=0> y[T]; // traffic
}
parameters {
real<lower=0> muM;
real<lower=0> sigma;
real phi1; // autoregression coeffs
real phi2;
real phi3;
real phi4;
real ma; // moving avg coeff

real<lower=0, upper=10000> y_pred[T]; // traffic predictions
}
model {
real mu[T]; // prediction for time t
vector[T] err; // error for time t

// initialize the first 4 days for the lags
mu[1] = 0;
mu[2] = 0;
mu[3] = 0;
mu[4] = 0;
err[1] = y[1] - mu[1];
err[2] = y[2] - mu[2];
err[3] = y[3] - mu[3];
err[4] = y[4] - mu[4];

muM ~ normal(1300, 500);
sigma ~ exponential(250);
phi1 ~ normal(0,1);
phi2 ~ normal(0,1);
phi3 ~ normal(0,1);
phi4 ~ normal(0,1);
ma ~ normal(0, 0.5);

for (t in 5:T) {
mu[t] = muM + phi1 * y[t-1] + phi2 * y[t-2] + phi3 * y[t-3] + phi4 * y[t-4] + ma * err[t-1];
err[t] = y[t] - mu[t];
y_pred[t] ~ normal(mu[t] * 1, sigma); // for comparison, what would the ARIMA predict for a today w/no ads?
}
}"

# extra flourish: find posterior mode via Stan's new L-BFGS gradient descent optimization feature;
# also offers a good initialization point for MCMC
sm <- stan_model(model_code = m)
optimized <- optimizing(sm, data=list(T=nrow(traffic), y=traffic$Pageviews, Ads=traffic$Ads.percent), hessian=TRUE)
round(optimized$par, digits=3) # muM sigma phi1 phi2 phi3 phi4 ma ads # 1352.864 65.221 -0.062 0.033 -0.028 0.083 0.249 -0.144 ## Initialize from previous MCMC run: inits <- list(muM=1356, sigma=65.6, phi1=-0.06, phi2=0.03, phi3=-0.03, phi4=0.08, ma=0.25, ads=-0.15) nchains <- getOption("mc.cores") - 1 model <- stan(model_code=m, data=list(T=nrow(traffic), y=traffic$Pageviews, Ads=traffic$Ads.percent), init=replicate(nchains, inits, simplify=FALSE), chains=nchains, iter=200000); print(model) # ...Elapsed Time: 413.816 seconds (Warm-up) # 654.858 seconds (Sampling) # 1068.67 seconds (Total) # # Inference for Stan model: bacd35459b712679e6fc2c2b6bc0c443. # 1 chains, each with iter=2e+05; warmup=1e+05; thin=1; # post-warmup draws per chain=1e+05, total post-warmup draws=1e+05. # # mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat # muM 1355.27 0.20 47.54 1261.21 1323.50 1355.53 1387.20 1447.91 57801 1 # sigma 65.61 0.00 0.29 65.03 65.41 65.60 65.80 66.18 100000 1 # phi1 -0.06 0.00 0.04 -0.13 -0.09 -0.06 -0.04 0.01 52368 1 # phi2 0.03 0.00 0.01 0.01 0.03 0.03 0.04 0.05 100000 1 # phi3 -0.03 0.00 0.01 -0.04 -0.03 -0.03 -0.02 -0.01 100000 1 # phi4 0.08 0.00 0.01 0.07 0.08 0.08 0.09 0.10 100000 1 # ma 0.25 0.00 0.04 0.18 0.23 0.25 0.27 0.32 52481 1 # ads -0.14 0.00 0.01 -0.16 -0.15 -0.14 -0.14 -0.13 100000 1 # ... mean(extract(model)$ads)
# [1] -0.1449574151

## permutation test to check for model misspecification: shuffle ad exposure and rerun the model,
## see what the empirical null distribution of the ad coefficient is and how often it yields a
## reduction of >= -14.5%:
empiricalNull <- numeric()
iters <- 5000
for (i in 1:iters) {
df <- traffic
df$Ads.percent <- sample(df$Ads.percent)
inits <- list(muM=1356, sigma=65.6, phi1=-0.06, phi2=0.03, phi3=-0.03, phi4=0.08, ma=0.25, ads=-0.01)
# nchains <- 1; options(mc.cores = 1) # disable multi-core to work around occasional Stan segfaults
model <- stan(model_code=m, data=list(T=nrow(df), y=df$Pageviews, Ads=df$Ads.percent),
init=replicate(nchains, inits, simplify=FALSE), chains=nchains); print(model)
adEstimate <- mean(extract(model)$ads) empiricalNull[i] <- adEstimate } summary(empiricalNull); sum(empiricalNull < -0.1449574151) / length(empiricalNull) # Min. 1st Qu. Median Mean 3rd Qu. Max. # -0.206359600 -0.064702600 -0.012325460 -0.035497930 -0.001696464 -0.000439064 # [1] 0.0136425648 We see a consistent & large estimate of harm: the mean of traffic falls by -14.5% (95% CI: -0.16 to -0.13; permutation test: p=0.01) on 100% ad-affected traffic! To more directly calculate the harm, I turn to the posterior predictions, which were computed for each day under the hypothetical of no advertising; one would expect the prediction for all days to be somewhat higher than the actual traffics were (because almost every day has some non-zero % of ad-affected traffic), and, summed or averaged over all days, that gives the predicted loss of traffic from ads: mean(traffic$Pageviews)
# [1] 1319.930556
## fill in defaults when extracting mean posterior predictives:
traffic$Prediction <- c(1319,1319,1319,1319, colMeans(extract(model)$y_pred)[5:288])
# qplot(traffic$Date, traffic$Prediction, color=as.logical(traffic$Ads.r)) + stat_smooth() mean(with(traffic, Prediction - Pageviews) ) # [1] 53.67329617 mean(with(traffic, (Prediction - Pageviews) / Pageviews) ) # [1] 0.09668207805 sum(with(traffic, Prediction - Pageviews) ) # [1] 15457.9093 So during the A/B test, the expected estimated loss of traffic is ~9.7%. ## Decision As this is so far past the decision threshold and the 95% credible interval around -0.14 is extremely tight (-0.16 - -0.13), the EVSI of any additional sampling is surely negative & not worth calculating. Thus, I permanently removed the AdSense banner ad in the middle of 11 September 2017. # Discussion The result is surprising. I had been expecting some degree of harm but the estimated reduction is much larger than I expected. Could banner ads really be that harmful? The effect is estimated with considerable precision, so it’s almost certainly not a fluke of the data (if anything I collected far more data than I should’ve); there weren’t many traffic spikes to screw with the analysis, so omitting mixture model or t-scale responses in the model doesn’t seem like it should be an issue either; the modeling itself might be driving it, but the crudest tests suggest a similar level of harm (just not at high statistical-significance or posterior probability); it does seem to be visible in the scatterplot; and the more realistic models - which include time-series aspects I know exist from the long historical time-series of gwern.net traffic & skeptical priors encouraging small effects - estimate it much better as I expected from my previous power analyses, and considerable tinkering with my original ARIMA(4,0,1) Stan model to check my understanding of my code (I haven’t used Stan much before) didn’t turn up any issues or make the effect go away. So as far as I can tell, this effect is real. I still doubt my results, but it’s convincing enough for me to disable ads, at least. Does it generalize? I admit gwern.net is unusual: highly technical longform static content in a minimalist layout optimized for fast loading & rendering catering to Anglophone STEM-types in the USA. It is entirely possible that for most websites, the effect of ads is much smaller because they already load so slow, have much busier cluttered designs, their users have less visceral distaste for advertising or are more easily targeted for useful advertising etc, and thus gwern.net is merely an outlier for whom removing ads makes sense (particularly given my option of being Patreon-supported rather than depending entirely on ads like many media websites must). I have no way of knowing whether or not this is true, and as always with optimizations, one should benchmark one’s own specific usecase; perhaps in a few years more results will be reported and it will be seen if my results are merely a coding error or an outlier or something else. If a max loss of 14% and average loss of ~9% (both of which could be higher for sites whose users don’t use adblock as much) is accurate and generalizable to other blogs/websites , there are many implications: in particular, it implies a huge deadweight loss to Internet users from advertising; and suggests advertising may be a net loss for many smaller sites. Ironically, in the latter case, those sites may not yet have realize, and may never realize, how much the pennies they earn from advertising are costing them, because the harm won’t show up in standard single-user A/B testing due to either measurement error hiding much of the effect or because it exists as an emergent global effect, requiring long-term experimentation & relatively sophisticated time-series modeling. There may be a connection here to earlier observations on the business of advertising questioning whether advertising works, works more than it hurts or cannibalizes other avenues, works sufficiently well to be profitable, or sufficiently well to know if it is working at all: on purely statistical grounds, it should be hard to cost-effectively show that advertising works at all (Lewis & Reiley 2008/Lewis & Rao 2014), Steve Sailer’s observation that the BehaviorScan field-experiment linking individual TV advertisements & grocery store sales likely showed little effect, eBay’s own experiments to similar effect (Lewis et al 2011, Blake et al 2014), P&G & JPMorgan digital ad cuts (and the continued success of many São Paulo-style low-advertising retailers like Aldi or MUJI), the extreme inaccuracy of correlational attempts to predict advertising effects (Lewis et al 2011, Gordon et al 2016; see also Eckles & Bakshy 2017), political science’s difficulty showing any causal impact of campaign advertisement spending on victories (Kalla & Broockman 2017; & mostly recently, Donald Trump), and many anecdotal reports seriously questioning the value of Facebook or Google advertising for their businesses in yielding mistaken/curious or fraudulent or just useless traffic. # Appendices ## Stan: mixture time-series An attempt at a ARIMA(4,0,1) time-series mixture model, where the mixture has two components: one component for normal traffic where daily traffic is ~1000 making up >90% of daily data, and one component for the occasional traffic spike around 10x larger but happening rarely: library(rstan) rstan_options(auto_write = TRUE) options(mc.cores = parallel::detectCores()) m <- "data { int<lower=1> K; // number of mixture components int<lower=1> T; // number of data points int<lower=0> y[T]; // traffic int<lower=0, upper=1> Ads[T]; // Ad randomization } parameters { simplex[K] theta; // mixing proportions positive_ordered[K] muM; // locations of mixture components; since no points are labeled, // like in JAGS, we add a constraint to force an ordering, make it identifiable, and // avoid label switching, which will totally screw with the posterior samples real<lower=0.01, upper=500> sigmaM[K]; // scales of mixture components real<lower=0.01, upper=5> nuM[K]; real phi1; // autoregression coeffs real phi2; real phi3; real phi4; real ma; // moving avg coeff real<upper=0> ads; // advertising coeff; can only be negative } model { real mu[T, K]; // prediction for time t vector[T] err; // error for time t real ps[K]; // temp for log component densities // initialize the first 4 days for the lags mu[1][1] = 0; // assume err[0] == 0 mu[2][1] = 0; mu[3][1] = 0; mu[4][1] = 0; err[1] = y[1] - mu[1][1]; err[2] = y[2] - mu[2][1]; err[3] = y[3] - mu[3][1]; err[4] = y[4] - mu[4][1]; muM ~ normal(0, 5); sigmaM ~ cauchy(0, 2); nuM ~ exponential(1); ma ~ normal(0, 0.5); phi1 ~ normal(0,1); phi2 ~ normal(0,1); phi3 ~ normal(0,1); phi4 ~ normal(0,1); ads ~ normal(0,1); for (t in 5:T) { for (k in 1:K) { mu[t][k] = ads*Ads[t] + muM[k] + phi1 * y[t-1] + phi2 * y[t-2] + phi3 * y[t-3] + phi4 * y[t-4] + ma * err[t-1]; err[t] = y[t] - mu[t][k]; ps[k] = log(theta[k]) + student_t_lpdf(y[t] | nuM[k], mu[t][k], sigmaM[k]); } target += log_sum_exp(ps); } }" # find posterior mode via L-BFGS gradient descent optimization; this can be a good set of initializations for MCMC sm <- stan_model(model_code = m) optimized <- optimizing(sm, data=list(T=nrow(traffic), y=traffic$Pageviews, Ads=traffic$Ads.r, K=2), hessian=TRUE) round(optimized$par, digits=3)
#  theta[1]  theta[2]    muM[1]    muM[2] sigmaM[1] sigmaM[2]    nuM[1]    nuM[2]      phi1      phi2      phi3      phi4        ma
#     0.001     0.999     0.371     2.000     0.648   152.764     0.029     2.031     1.212    -0.345    -0.002     0.119    -0.604
#    -0.009

## optimized:
inits <- list(theta=c(0.001, 0.999), muM=c(0.37, 2), sigmaM=c(0.648, 152), nuM=c(0.029, 2), phi1=1.21, phi2=-0.345, phi3=-0.002, phi4=0.119, ma=-0.6, ads=-0.009)
## MCMC means:
nchains <- getOption("mc.cores") - 1
model <- stan(model_code=m, data=list(T=nrow(traffic), y=traffic$Pageviews, Ads=traffic$Ads.r, K=2),
init=replicate(nchains, inits, simplify=FALSE), chains=nchains, control = list(max_treedepth = 15, adapt_delta = 0.95),
iter=20000); print(model)
traceplot(model, pars=names(inits))

This code winds up continuing to fail due to label-switching issue (ie the MCMC bouncing between estimates of what each mixture component is because of symmetry or lack of data) despite using some of the suggested fixes in the Stan model like the ordering trick. Since there were so few spikes in 2017 only, the mixture model can’t converge to anything sensible; but on the plus side, this also implies that the complex mixture model is unnecessary for analyzing 2017 data and I can simply model the outcome as a normal.

## EVSI

Demo code of simple Expected Value of Sample Information (EVSI) in a JAGS log-Poisson model of traffic (which turns out to be inferior to a normal distribution for 2017 traffic data but I keep here for historical purposes). We consider an experiment resembling historical data with a 5% traffic decrease due to ads; the reduction is modeled and implies a certain utility loss given my relative preferences for traffic vs the advertising revenue, and then the remaining uncertainty in the reduction estimate can be queried for how likely it is that the decision is wrong and that collecting further data would then change a wrong decision to a right one:

## simulate a plausible effect superimposed on the actual data:
ads[ads$Ads==1,]$Hits <- round(ads[ads$Ads==1,]$Hits * 0.95)

require(rjags)
y <- ads$Hits x <- ads$Ads
model_string <- "model {
for (i in 1:length(y)) {
y[i] ~ dpois(lambda[i])
log(lambda[i]) <- alpha0 - alpha1 * x[i]

}
alpha0 ~ dunif(0,10)
alpha1 ~ dgamma(50, 6)
}"
model <- jags.model(textConnection(model_string), data = list(x = x, y = y),
n.chains = getOption("mc.cores"))
samples <- coda.samples(model, c("alpha0", "alpha1"), n.iter=10000)
summary(samples)
# 1. Empirical mean and standard deviation for each variable,
#    plus standard error of the mean:
#
#              Mean          SD     Naive SE Time-series SE
# alpha0 6.98054476 0.003205046 1.133155e-05   2.123554e-05
# alpha1 0.06470139 0.005319866 1.880857e-05   3.490445e-05
#
# 2. Quantiles for each variable:
#
#              2.5%        25%        50%        75%      97.5%
# alpha0 6.97426621 6.97836982 6.98055144 6.98273011 6.98677827
# alpha1 0.05430508 0.06110893 0.06469162 0.06828215 0.07518853
alpha0 <- samples[[1]][,1]; alpha1 <- samples[[1]][,2]
posteriorTrafficReduction <- exp(alpha0) - exp(alpha0-alpha1)

generalLoss <- function(annualAdRevenue, trafficLoss,  hitValue, discountRate) {
(annualAdRevenue - (trafficLoss * hitValue * 365.25)) / log(1 + discountRate) }
loss <- function(tr) { generalLoss(360, tr, 0.02, 0.05) }
posteriorLoss <- sapply(posteriorTrafficReduction, loss)
summary(posteriorLoss)
#       Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
# -5743.5690 -3267.4390 -2719.6300 -2715.3870 -2165.6350   317.7016

Expected loss of turning on ads: -$2715. Current decision: keep ads off to avoid that loss. The expected average gain in the case where the correct decision is turning ads on: mean(ifelse(posteriorLoss>0, posteriorLoss, 0)) # [1] 0.06868814833 so EVPI is$0.07. This doesn’t pay for any additional days of sampling, so there’s no need to calculate an exact EVSI.

1. I am not too happy about how much uncached JS the Disqus plugin loads or how long it takes to set itself up while spewing warnings in the browser console, but at the moment, I don’t know of any other static site commenting system which has good anti-spam capabilities or an equivalent user base, and Disqus has worked reasonably well for 5+ years.

2. This is especially an issue with A/B testing as usually practiced with NHST & arbitrary threshold; one could steadily degrade one’s website by repeatedly making bad changes which don’t appear harmful in small-scale experiments (no harm, p<0.05, no harm, p<0.05, no harm, p<0.05 etc). One might call this the Schlitz beer problem, after the famous business case study: a series of small quality decreases/profit increases eventually had catastrophic cumulative effects on their reputation & sales. This can be countered by a few things, such as either testing regularly against a minimalistic baseline or carefully tuning $\alpha$/$beta$ thresholds based on a decision analysis (likely, one would conclude that statistical power must be made much higher and the p-threshold should be made less stringent for detecting harm).

3. Why block instead of, say, just randomizing 5-days at a time (simple randomization)? If we did that, we would occasionally do something like spend an entire month in one condition without switching, simply by rolling a 0 5 or 6 times in a row; since traffic can be expected to drift and change and spike, having such large units means that sometimes they will line up with noise, increasing the apparent variance thus shrinking the effect size thus requiring possibly a great deal more data to detect the signal. Or we might finish the experiment after 100 days (20 units) and discover we had n=15 for advertising and only n=5 for non-advertising (wasting most of our information on unnecessarily refining the advertising condition). Not blocking doesn’t bias our analysis - we still get the right answers eventually - but it could be costly. Whereas if we block pairs of 2-days ([00,11] vs [11,00]), we ensure that we regularly (but still randomly) switch the condition, spreading it more evenly over time, so if there are 4 days of suddenly high traffic, it’ll probably get split between conditions and we can more easily see the effect. This sort of issue is why experiments try to run interventions on the same person, or at least on age and sex-matched participants, to eliminate unnecessary noise. The gains can be extreme; in one experiment, I estimated that using twins rather than ordinary school-children would have let $n \lt \frac{1}{20}$: The Power of Twins: Revisiting Student’s Scottish Milk Experiment Example. Thus, when possible, I block my experiments at least temporally.

4. After publishing initial results, Chris Stucchio commented on Twitter: Most of the work on this stuff is proprietary. I ran such an experiment for a large content site which generated directionally similar results. I helped another major content site set up a similar test, but they didn’t tell me the results. David Kitchen, a hobbyist operator of several hundred forums, claims that removing banner ads boosted all his metrics (admittedly, at the cost of total revenue), but unclear if this used randomizations or just a before-after comparison.

5. Several of the Internet giants like Google have reported measurable harms from delays as small as 100ms.

6. Examples of ads I saw would be Lumosity ads or online university ads (typically master’s degrees, for some reason) on my DNB FAQ. They looked about what one would expect: generically glossy and clean. It is difficult to imagine anyone being offended by them.

7. This mixture model would have two distributions/components in it; there apparently is no point in trying to distinguish between levels of virality, as 3 or higher does not fit the data well:

library(flexmix)
stepFlexmix(traffic$Pageviews ~ 1, model = FLXMRglmfix(family = "poisson"), k=1:10, nrep=20) ## errors out k>2 fit <- flexmix(traffic$Pageviews ~ 1, model = FLXMRglmfix(family = "poisson"), k=2)
summary(fit)
#        prior size post>0 ratio
# Comp.1 0.196  448    449 0.998
# Comp.2 0.804 1841   1842 0.999
#
# 'log Lik.' -1073191.871 (df=3)
# AIC: 2146389.743   BIC: 2146406.95
summary(refit(fit))
# $Comp.1 # Estimate Std. Error z value Pr(>|z|) # (Intercept) 8.64943378382 0.00079660747 10857.837 < 2.22e-16 # #$Comp.2
#                  Estimate    Std. Error   z value   Pr(>|z|)
# (Intercept) 7.33703564817 0.00067256175 10909.088 < 2.22e-16