*9 months of daily A/B-testing of Google AdSense banner ads on gwern.net indicates banner ads decrease total traffic substantially, possibly due to spillover effects in reader engagement and resharing.*

topics:

created: *8 Jan 2017*; modified: *10 Feb 2019*; status: *in progress*; confidence: *possible*; importance: *5*

One source of complexity & JavaScript use on

`gwern.net`

is the use of Google AdSense advertising to insert banner ads. In considering design & usability improvements, removing the banner ads comes up every time as a possibility, as readers do not like ads, but such removal comes at a revenue loss and it’s unclear whether the benefit outweighs the cost, suggesting I run an A/B experiment. However, ads might be expected to have broader effects on traffic than individual page reading times/bounce rates, affectingtotalsite traffic instead through long-term effects on or spillover mechanisms between readers (eg social media behavior), rendering the usual A/B testing method of per-page-load/session randomization incorrect; instead it would be better to analyze total traffic as a time-series experiment.Design: A decision analysis of revenue vs readers yields an maximum acceptable total traffic loss of ~3%. Power analysis of historical

`gwern.net`

traffic data demonstrates that the high autocorrelation yields low statistical power with standard tests & regressions but acceptable power with ARIMA models. I design a long-term Bayesian ARIMA(4,0,1) time-series model in which an A/B-test running January-October 2017 in randomized paired 2-day blocks of ads/no-ads uses client-local JS to determine whether to load & display ads, with total traffic data collected in Google Analytics & ad exposure data in Google AdSense. The A/B test ran from 1 January 2017 to 15 October 2017, affecting 288 days with collectively 380,140 pageviews in 251,164 sessions.Correcting for a flaw in the randomization, the final results yield a surprisingly large estimate of -14% traffic loss if all traffic were exposed to ads (95% credible interval: -13-16%) and an expected traffic loss of -9.7% from the subset of users without adblock, exceeding my decision threshold for disabling ads and rendering further experimentation profitless.

Thus, banner ads on

`gwern.net`

appear to be harmful and AdSense has been removed. If these results generalize to other blogs and personal websites, an important implication is that many websites may be harmed by their use of banner ad advertising without realizing it.

One thing about `gwern.net`

I prize, especially in comparison to the rest of the Internet, is the fast page loads & renders. This is why in my previous A/B tests of site design changes, I have generally focused on CSS changes which do not affect load times. Benchmarking loads, the total time is dominated by Google AdSense (for the medium-sized banner advertisements centered above the title) and Disqus comments. While I want comments, so the Disqus is not optional^{1}, AdSense I keep only because, well, it makes me some money (~$30 a month or ~$360 a year; it would be more but ~60% of visitors have adblock, which is apparently unusually high for the US). So ads are a good thing to do an experiment on: it offers a chance to remove one of the heaviest components of the page, an excuse to use a decision approach, an opportunity to try applying Bayesian time-series models in JAGS/Stan, and an investigation into whether longitudinal site-wide A/B experiments are practical & useful.

# Modeling effects of advertising: global rather than local

This isn’t a *huge* amount (it is much less than my monthly Patreon) and might be offset by the effects on load/render time and people not liking advertisement. If I am reducing my traffic & influence by 10% because people don’t want to browse or link pages with ads, then it’s definitely not worthwhile.

One of the more common criticisms of the usual A/B test design is that it is missing the forest for the trees & giving fast precise answers to the wrong question; a change may have good results when done individually, but may harm the overall experience or community in a way that shows up on the macro but not micro scale.^{2} In this case, I am interested less in time-on-page than in total traffic per day, as the latter will measure effects like resharing on social media (especially, given my traffic history, Hacker News, which always generates a long lag of additional traffic from Twitter & aggregators). It is somewhat appreciated that A/B testing in social media or network settings is not as simple as randomizing individual users & running a *t*-test - as the users are not independent of each other (violating SUTVA among other things). Instead, you need to randomize groups or subgraphs or something like that, and consider the effects of interventions on those larger more-independent treatment units.

So my usual ABalytics setup isn’t appropriate here: I don’t want to randomize individual visitors & measure time on page, I want to randomize individual days or weeks and measure total traffic, giving a time-series regression.

This could be randomized by uploading a different version of the site every day, but this is tedious, inefficient, and has technical issues: aggressive caching of my webpages means that many visitors may be seeing old versions of the site! With that in mind, there is a simple A/B test implementation in JS: in the invocation of the AdSense JS, simply throw in a conditional which predictably randomizes based on the current day (something like the day-of-year (1-366) modulo 2

, hashing the day, or simply a lookup in an array of constants), and then after a few months, extract daily traffic numbers from Google Analytics/AdSense and match up with randomization and do a regression. By using a pre-specified source of randomness, caching is never an issue, and using JS is not a problem since anyone with JS disabled wouldn’t be one of the people seeing ads anyway. Since there might be spillover effects due to lags in propagating through social media & emails etc, daily randomization might be too fast, and 2-day blocks more appropriate, ensuring occasional runs up to a week or so to expose longer effects while still ensuring allocation equal total days to advertising/no-advertising.^{3}

# Ads as Decision Problem

From a decision theory perspective, this is a good place to apply sequential testing ideas as we face a similar problem as with the Candy Japan A/B test and the experiment has an easily quantified cost: each day randomized off

costs ~$1, so a long experiment over 200 days would cost ~$100 in ad revenue etc. There is also the risk of making the wrong decision and choosing to disable ads when they are harmless, in which case the cost as NPV (at my usual 5% discount rate, and assuming ad revenue never changes and I never experiment further, which are reasonable assumptions given how fortunately stable my traffic is and the unlikeliness of me revisiting a conclusive result from a well-designed experiment) would be $\frac{360}{log(1.05)} = \$7378$, which is substantial.

On the other side of the equation, the ads could be doing substantial damage to site traffic; with ~40% of traffic seeing ads and total page-views of 635123 in 2016 (1740/day), a discouraging effect of 5% off that would mean a loss of $635123 \cdot 0.40 \cdot 0.05 = 12702$, the equivalent of 1 week of traffic. My website is important to me because it is what I have accomplished & is my livelihood, and if people are not reading it, that is bad, both because I lose possible income and because it means no one is reading my work.

How bad? In lieu of advertising it’s hard to directly quantify the value of a page-view, so I can instead ask myself hypothetically, would I trade ~1 week of traffic for $360 (~$0.02/view, or to put it another way which may be more intuitive, would I delete `gwern.net`

in exchange for >$18720/year)? Probably; that’s about the right number - with my current parlous income, I cannot casually throw away hundreds or thousands of dollars for some additional traffic, but I would still pay for readers at the right price, and weighing the feelings, I feel comfortable valuing page-views at ~$0.02. (If the estimate of the loss turns out to be near the threshold, then I can revisit it again and attempt more preference elicitation. Given the actual results, this proved to be unnecessary.)

Then the loss function of the traffic reduction parameter *t* is $\frac{360 - 635123 \cdot 0.40 \cdot t \cdot 0.02}{log(1.05)}$, So the long-run consequence of permanently turning advertising on would be, for a *t* decrease of 1%, 1% = +$4775; 5% = +$2171; 10% = -$3035; 20% = -$13449; etc.

Thus, the decision question is whether the decrease for the ad-affected 40% of traffic is >7%; or for traffic as a whole, if the decrease is >2.8%. If it is, then I am better off removing AdSense and increasing traffic; otherwise, the money is better.

How much should we expect traffic to fall?

Unfortunately, I’m not aware of any previous research similar to my proposal for examining the effect on total traffic rather than more common metrics such as revenue or per-page engagement. I assume such research exists, since there’s a literature on everything, but I haven’t found it yet and no one I’ve asked knows where it is either; and of course presumably the big Internet advertising giants have detailed knowledge of such spillover or emergent effects, although no incentive to publicize the harms.^{4} There is a sparse open literature on advertising avoidance

, which focuses on surveys of consumer attitudes and economic modeling; skimming, the main results appear to be that people claim to dislike advertising on TV or the Internet a great deal, claim to dislike personalization but find personalized ads less annoying, a nontrivial fraction of viewers will take action during TV commercial breaks to avoid watching ads (5-23% for various methods of estimating/definitions of avoidance, and sources like TV channels), and are particularly annoyed by ads getting in the way when researching or engaged in goal-oriented

activity, and in a work context (Amazon Mechanical Turk) will tolerate non-annoying ads without demanding large payment increases (Goldstein et al 2013/Goldstein et al 2014). But while those surveys & measurements show users will do some work to avoid ads (which is supported by the high percentage of browsers with adblockers installed) and in some contexts like jobs appear to be insensitive to ads, there is little information about to what extent ads unconsciously drive users away from a publisher towards other publishers or mediums, with pervasive amounts of advertising taken for granted (see cites in Abernethy, Edwards et al 2002, & Wilbur 2016). After running this experiment, a paper was published in 2018 on a large-scale (*n*=34m) long-term (~2 years) individual-level advertising experiment by the Pandora streaming music service (Huang et al 2018) which found a strikingly large effect of number of ads on reduced listener frequency & worsened retention, which accumulated over time and would have been hard to observe in a short-term experiment; this roughly paralleled my results (although it doesn’t examine any potential spillover or global effects). Almost simultaneously, Mozilla (Miroglio et al 2018) conducted a longitudinal (but non-randomized, using propensity scoring) study which found that after installing adblock, adblock users experienced increases in both active time spent in the browser (+28% over [matched] controls) and the number of pages viewed (+15% over control)

; the latter correlate is almost identical to the per-user effect my first A/B test had yielded.

My best guess is that the effect of any advertising avoidance

ought to be a small percentage of traffic, since: A/B tests in general find small effects in the percentage range (my corpus of A/B tests - or see What works in e-commerce - a meta-analysis of 6700 online experiments

, Brown & Jones 2017;

, Berman et al 2018), ads are pervasive online, only a fraction of users are ever exposed, AdSense loads ads asynchronously in the background so it never blocks the page loading or rendering (which would definitely be frustrating & web design holds that small delays in pageloads are harmful*p*-Hacking and False Discovery in A/B Testing^{5}), Google supposedly spends billions of dollars a year on a surveillance Internet & the most cutting-edge AI technology to better model users & target ads to them without irritating them too much (eg Hohnhold et al 2015), ads should have little effect on SEO or search engine ranking (since why would search engines penalize their *own* ads?), I’ve seen a decent amount of research on optimizing ad deliveries to maximize revenue & avoiding annoying ads but not reducing total harm, the banner ads have never offended or bothered me much when I browse my pages with adblocker disabled as they are normal medium-sized banners centered above the `<title>`

element where one expects an ad^{6}, and you would think that if there were any worrisome level of harm *someone* would’ve noticed by now & it’d be common knowledge to avoid ads unless you were desperate for revenue. So my prior estimate is of a small effect and needing to run for a long time to make a decision at a moderate opportunity cost.

# Design

How do we analyze this? In the ABalytics per-reader approach, it was simple: we defined a threshold and did a binomial regression. But by switching to trying to increase overall total traffic, I have opened up a can of worms. Let’s look at the traffic data:

```
traffic <- read.csv("https://www.gwern.net/docs/traffic/20170108-traffic.csv", colClasses=c("Date", "integer", "logical"))
summary(traffic)
summary(traffic)
# Date Pageviews
# Min. :2010-10-04 Min. : 1
# 1st Qu.:2012-04-28 1st Qu.: 1348
# Median :2013-11-21 Median : 1794
# Mean :2013-11-21 Mean : 2352
# 3rd Qu.:2015-06-16 3rd Qu.: 2639
# Max. :2017-01-08 Max. :53517
nrow(traffic)
# [1] 2289
library(ggplot2)
qplot(Date, Pageviews, data=traffic)
qplot(Date, log(Pageviews), data=traffic)
```

Two things jump out. The distribution of traffic is weird, with enormous spikes; doing a log-transform to tame the spikes, it is also clearly a non-stationary time-series with heavy autocorrelation as traffic consistently grows and declines. These are not surprising, as social media traffic from sites like Hacker News or Reddit are notorious for creating spikes in site traffic (and sometimes bringing them down under the load), and I would hope that as I keep writing things, traffic would gradually increase! Nevertheless, both of these will make the traffic data difficult to analyze despite having over 6 years of it.

## Power analysis

We can demonstrate with a quick power analysis: if we pick a random subset of days and force a decrease of 2.8% (the value on the decision boundary), can we detect that?

```
ads <- traffic
ads$Ads <- rbinom(nrow(ads), size=1, p=0.5)
ads[ads$Ads==1,]$Pageviews <- round(ads[ads$Ads==1,]$Pageviews * (1-0.028))
wilcox.test(Pageviews ~ Ads, data=ads)
# W = 665105.5, p-value = 0.5202686
t.test(Pageviews ~ Ads, data=ads)
# t = 0.27315631, df = 2285.9151, p-value = 0.7847577
# alternative hypothesis: true difference in means is not equal to 0
# 95% confidence interval:
# -203.7123550 269.6488393
# sample estimates:
# mean in group 0 mean in group 1
# 2335.331004 2302.362762
wilcox.test(log(Pageviews) ~ Ads, data=ads)
# W = 665105.5, p-value = 0.5202686
t.test(log(Pageviews) ~ Ads, data=ads)
# t = 0.36685265, df = 2286.8348, p-value = 0.7137629
sd(ads$Pageviews)
# [1] 2880.044636
```

The answer is no. We are *nowhere* near being able to detect it with either a *t*-test or the nonparametric *u*-test (which one might expect to handle the strange distribution better), and the log transform doesn’t help. We can hardly even see a hint of the decrease in the *t*-test - the decrease in the mean is ~30 pageviews but the standard deviations are ~2900 and actually bigger than the mean. So the spikes in the traffic are crippling the tests and this cannot be fixed by waiting a few more months since it’s inherent to the data.

If our trusty friend the log-transform can’t help, what can we do? In this case, we know that the reality here is literally a mixture model as the spikes are being driven by qualitatively distinct phenomenon like a `gwern.net`

link appearing on the HN front page, as compared to normal daily traffic from existing links & search traffic^{7}; but mixture models tend to be hard to use. One ad hoc approach to taming the spikes would be to effectively throw them out by winsorizing/clipping everything at a certain point (since the daily traffic average is ~1700, perhaps twice that, 3000):

```
ads <- traffic
ads$Ads <- rbinom(nrow(ads), size=1, p=0.5)
ads[ads$Ads==1,]$Pageviews <- round(ads[ads$Ads==1,]$Pageviews * (1-0.028))
ads[ads$Pageviews>3000,]$Pageviews <- 3000
sd(ads$Pageviews)
# [1] 896.8798131
wilcox.test(Pageviews ~ Ads, data=ads)
# W = 679859, p-value = 0.1131403
t.test(Pageviews ~ Ads, data=ads)
# t = 1.3954503, df = 2285.3958, p-value = 0.1630157
# alternative hypothesis: true difference in means is not equal to 0
# 95% confidence interval:
# -21.2013943 125.8265361
# sample estimates:
# mean in group 0 mean in group 1
# 1830.496049 1778.183478
```

Better but still inadequate. Even with the spikes tamed, we continue to have problems; the logged graph suggests that we can’t afford to ignore the time-series aspect. A check of autocorrelation indicates substantial autocorrelation out to lags as high as 8 days:

The usual regression framework for time-series is the ARIMA time-series model, in which the current daily value would be regressed on by each of the previous day’s values (with an estimated coefficient for each lag, as day 8 ought to be less predictive than day 7 and so on) and possibly a difference and a moving average (also with varying distances in time). The models are usually denoted as ARIMA(days back to use as lags, days back to difference, days back for moving average)

. So the `pacf`

suggests that an `ARIMA(8,0,0)`

might work - lags back 8 days but agnostic on differencing and moving averages, respectively. R’s `forecast`

library helpfully includes both an `arima`

regression function and also an `auto.arima`

to do model comparison. `auto.arima`

generally finds that a much simpler model than `ARIMA(8,0,0)`

works best, preferring models like `ARIMA(4,1,1)`

(presumably the differencing and moving-average steal enough of the distant lags’ predictive power that they no longer look better to AIC). Such an ARIMA model works well and now we actually can detect convincingly our simulated effect:

```
library(forecast)
library(lmtest)
l <- lm(Pageviews ~ Ads, data=ads); summary(l)
# Residuals:
# Min 1Q Median 3Q Max
# -2352.275 -995.275 -557.275 294.725 51239.783
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 2277.21747 86.86295 26.21621 < 2e-16
# Ads 76.05732 120.47141 0.63133 0.52789
#
# Residual standard error: 2879.608 on 2287 degrees of freedom
# Multiple R-squared: 0.0001742498, Adjusted R-squared: -0.0002629281
# F-statistic: 0.3985787 on 1 and 2287 DF, p-value: 0.5278873
a <- arima(ads$Pageviews, xreg=ads$Ads, order=c(4,1,1))
summary(a); coeftest(a)
# Coefficients:
# ar1 ar2 ar3 ar4 ma1 ads$Ads
# 0.5424117 -0.0803198 -0.0310823 -0.0094242 -0.8906085 -52.4148244
# s.e. 0.0281538 0.0245621 0.0245500 0.0240701 0.0189952 10.5735098
#
# sigma^2 estimated as 89067.31: log likelihood = -16285.31, aic = 32584.63
#
# Training set error measures:
# ME RMSE MAE MPE MAPE MASE ACF1
# Training set 3.088924008 298.3762646 188.5442545 -6.839735685 31.17041388 0.9755280945 -0.0002804416646
#
# z test of coefficients:
#
# Estimate Std. Error z value Pr(>|z|)
# ar1 0.5424116948 0.0281538043 19.26602 < 2.22e-16
# ar2 -0.0803197830 0.0245621012 -3.27007 0.0010752
# ar3 -0.0310822966 0.0245499783 -1.26608 0.2054836
# ar4 -0.0094242194 0.0240700967 -0.39153 0.6954038
# ma1 -0.8906085375 0.0189952434 -46.88587 < 2.22e-16
# Ads -52.4148243747 10.5735097735 -4.95718 7.1523e-07
```

One might reasonably ask, what is doing the real work, the truncation/trimming or the `ARIMA(4,1,1)`

? The answer is both; if we go back and regenerate the `ads`

dataset *without* the truncation/trimming and we look again at the estimated effect of `Ads`

, we find it changes to

For the simple linear model with no time-series or truncation, the standard error on the ads effect is 121; for the time-series with no truncation, the standard error is 81; and for the time series plus truncation, the standard error is 11. My conclusion is that we can’t leave either one out if we are to reach correct conclusions in any feasible sample size - we *must* deal with the spikes, and we *must* deal with the time-series aspect.

So having settled on a specific ARIMA model with truncation, I can do a power analysis. For a time-series, the simple bootstrap is inappropriate as it ignores the autocorrelation; the right bootstrap is the block bootstrap: for each hypothetical sample size *n*, split the traffic history into as many non-overlapping *n*-sized chunks *m* as possible, select-with-replacement from them *m*, and run the analysis. This is implemented in the R `boot`

library.

```
library(boot)
library(lmtest)
## fit models & report p-value/test statistic
ut <- function(df) { wilcox.test(Pageviews ~ Ads, data=df)$p.value }
at <- function(df) { coeftest(arima(df$Pageviews, xreg=df$Ads, order=c(4,1,1)))[,4][["df$Ads"]] }
## create the hypothetical effect, truncate, and test
simulate <- function (df, testFunction, effect=0.03, truncate=TRUE, threshold=3000) {
df$Ads <- rbinom(nrow(df), size=1, p=0.5)
df[df$Ads==1,]$Pageviews <- round(df[df$Ads==1,]$Pageviews * (1-effect))
if(truncate) { df[df$Pageviews>threshold,]$Pageviews <- threshold }
return(testFunction(df))
}
power <- function(ns, df, test, effect, alpha=0.05, iters=2000) {
powerEstimates <- vector(mode="numeric", length=length(ns))
i <- 1
for (n in ns) {
tsb <- tsboot(df, function(d){simulate(d, test, effect=effect)}, iters, l=n,
sim="fixed", parallel="multicore", ncpus=getOption("mc.cores"))
powerEstimates[i] <- mean(tsb$t < alpha)
i <- i+1 }
return(powerEstimates) }
ns <- seq(10, 2000, by=5)
## test the critical value but also 0 effect to check whether alpha is respected
powerUtestNull <- power(ns, traffic, ut, 0)
powerUtest <- power(ns, traffic, ut, 0.028)
powerArimaNull <- power(ns, traffic, at, 0)
powerArima <- power(ns, traffic, at, 0.028)
p1 <- qplot(ns, powerUtestNull) + stat_smooth() + coord_cartesian(ylim = c(0, 1))
p2 <- qplot(ns, powerUtest) + stat_smooth() + coord_cartesian(ylim = c(0, 1))
p3 <- qplot(ns, powerArimaNull) + stat_smooth() + coord_cartesian(ylim = c(0, 1))
p4 <- qplot(ns, powerArima) + stat_smooth() + coord_cartesian(ylim = c(0, 1))
library(grid)
library(gridExtra)
grid.arrange(p1, p3, p2, p4, ncol = 2, name = "Power analysis of detecting null effect/2.8% reduction using u-test and ARIMA regression")
```

So the false-positive rate is preserved for both, the ARIMA requires a reasonable-looking *n*<70 to be well-powered, but the *u*-test power is bizarre - the power is never great, never going >31.6%, and actually *decreasing* after a certain point, which is not something you usually see in a power graph. (The ARIMA power curve is also odd but at least it doesn’t get worse with more data!) My speculation about that is that it is because as the time-series window increases, more of the spikes come into view of the *u*-test, making the distribution dramatically wider & this more than overwhelms the gain in detectability; hypothetically, with even more years of data, the spikes would stop coming as a surprise and the gradual hypothetical damage of the ads will then become more visible with increasing sample size as expected.

An model is an autoregressive model of order 2, or AR(2) model. This model can be written as: X_t - mu = (Beta1 * (X_t-1 - mu)) + (Beta2 * (Xt-2 - mu)) + Z_t, where X_t is the stationary time series we are studying (the time series of volcanic dust veil index), mu is the mean of time series X_t, Beta1 and Beta2 are parameters to be estimated, and Z_t is white noise with mean zero and constant variance.

AR(1): https://doingbayesiandataanalysis.blogspot.com/2012/10/bayesian-estimation-of-trend-with-auto.html ARMA(2,0): http://radhakrishna.typepad.com/ar2-parameter-estimation-in-winbugs-and-jags.pdf ARIMA(1,1,1) http://www.zinkov.com/posts/2012-06-27-why-prob-programming-matters/

JAGS:

```
library(runjags)
arima311 <- "model {
# initialize the first 3 days, which we need to fit the 3 lags/moving-averages for day 4:
# y[1] <- 50
# y[2] <- 50
# y[3] <- 50
eps[1] <- 0
eps[2] <- 0
eps[3] <- 0
for (i in 4:length(y)) {
y[i] ~ dt(mu[i], tauOfClust[clust[i]], nuOfClust[clust[i]])
mu[i] <- muOfClust[clust[i]] + w1*y[i-1] + w2*y[i-2] + w3*y[i-3] + m1*eps[i-1]
eps[i] <- y[i] - mu[i]
clust[i] ~ dcat(pClust[1:Nclust])
}
for (clustIdx in 1:Nclust) {
muOfClust[clustIdx] ~ dnorm(100, 1.0E-06)
sigmaOfClust[clustIdx] ~ dnorm(500, 1e-06)
tauOfClust[clustIdx] <- pow(sigmaOfClust[clustIdx], -2)
nuMinusOneOfClust[clustIdx] ~ dexp(5)
nuOfClust[clustIdx] <- nuMinusOneOfClust[clustIdx] + 1
}
pClust[1:Nclust] ~ ddirch(onesRepNclust)
m1 ~ dnorm(0, 4)
w1 ~ dnorm(0, 5)
w2 ~ dnorm(0, 4)
w3 ~ dnorm(0, 3)
}"
y <- traffic$Pageviews
Nclust = 2
clust = rep(NA,length(y))
clust[which(y<1800)] <- 1 # seed labels for cluster 1, normal traffic
clust[which(y>4000)] <- 2 # seed labels for cluster 2, spikes
model <- run.jags(arima311, data = list(y=y, Nclust = Nclust, clust=clust, onesRepNclust = c(1,1) ),
monitor=c("w1", "w2", "w3", "m1", "pClust", "muOfClust", "sigmaOfClust", "nuOfClust"),
inits=list(w1=0.55, w2=0.37, w3=-0.01, m1=0.45, pClust=c(0.805, 0.195), muOfClust=c(86.5, 781), sigmaOfClust=c(156, 763), nuMinusOneOfClust=c(2.4-1, 1.04-1)),
n.chains = getOption("mc.cores"), method="rjparallel", sample=500)
summary(model)
```

JAGS is painfully slow: 5h+ for 500 samples. Sharper priors, removing the 4th-order ARIMA lag, and better initialization didn’t help. Too much correlation? Try switching to STAN.

```
traffic <- read.csv("https://www.gwern.net/docs/traffic/20170108-traffic.csv", colClasses=c("Date", "integer", "logical"))
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
m <- "data {
int<lower=1> K; // number of mixture components
int<lower=1> T; // number of data points
int<lower=0> y[T]; // observations
}
parameters {
simplex[K] theta; // mixing proportions
real<lower=0, upper=100> muM[K]; // locations of mixture components
real<lower=0.01, upper=1000> sigmaM[K]; // scales of mixture components
real<lower=0.01, upper=5> nuM[K];
real phi1; // autoregression coeffs
real phi2;
real phi3;
real phi4;
real ma; // moving avg coeff
}
model {
real mu[T, K]; // prediction for time t
vector[T] err; // error for time t
real ps[K]; // temp for log component densities
// initialize the first 4 days for the lags
mu[1][1] = 0; // assume err[0] == 0
mu[2][1] = 0;
mu[3][1] = 0;
mu[4][1] = 0;
err[1] = y[1] - mu[1][1];
err[2] = y[2] - mu[2][1];
err[3] = y[3] - mu[3][1];
err[4] = y[4] - mu[4][1];
muM ~ normal(0, 5);
sigmaM ~ cauchy(0, 2);
nuM ~ exponential(1);
ma ~ normal(0, 0.5);
phi1 ~ normal(0,1);
phi2 ~ normal(0,1);
phi3 ~ normal(0,1);
phi4 ~ normal(0,1);
for (t in 5:T) {
for (k in 1:K) {
mu[t][k] = muM[k] + phi1 * y[t-1] + phi2 * y[t-2] + phi3 * y[t-3] + phi4 * y[t-4] + ma * err[t-1];
err[t] = y[t] - mu[t][k];
ps[k] = log(theta[k]) + student_t_lpdf(y[t] | nuM[k], mu[t][k], sigmaM[k]);
}
target += log_sum_exp(ps);
}
}"
# 17m for 200 samples
nchains <- getOption("mc.cores") - 1
# original, based on MCMC:
# inits <- list(theta=c(0.92, 0.08), muM=c(56.2, 0.1), sigmaM=c(189.7, 6), nuM=c(1.09, 0.61), phi1=1.72, phi2=-0.8, phi3=0.08, phi4=0, ma=-0.91)
# optimized based on gradient descent
inits <- list(theta=c(0.06, 0.94), muM=c(0.66, 0.13), sigmaM=c(5.97, 190.05), nuM=c(1.40, 1.10), phi1=1.74, phi2=-0.83, phi3=0.10, phi4=-0.01, ma=-0.93)
model2 <- stan(model_code=m, data=list(T=nrow(traffic), y=traffic$Pageviews, K=2), init=replicate(nchains, inits, simplify=FALSE), chains=nchains, iter=50); print(model2)
traceplot(model2)
```

This was perhaps the first time I’ve attempted to write a complex model in Stan, in this case, adapting a simple ARIMA time-series model from the Stan manual. Stan has some interesting features like the variational inference optimizer which can find sensible parameter values for complex models in seconds, an active community & involved developers & an exciting roadmap, and when Stan works it is substantially faster than the equivalent JAGS model; but I encountered a number of drawbacks (in decreasing order of importance):

Stan’s treatment of mixture models and discrete variables is… not good. I like mixture models & tend to think in terms of them and latent variables a lot, which makes the neglect an issue for me. This was particularly vexing in my initial modeling where I tried to allow for traffic spikes from HN etc by having a mixture model, with one component for

regular

traffic and one component for traffic surges. This is relatively straightforward in JAGS as one defines a categorical variable and indexes into it, but it is a nightmare in Stan, requiring a bizarre hack.I defy any Stan user to look at the example mixture model in the manual and tell me that they naturally and easily understand the

`target`

/`temperature`

stuff as a way of implementing a mixture model. I sure didn’t. And once I did get it implemented, I couldn’t modify it at all. And it was slow, too, eroding the original performance advantage over JAGS. I was saved only by the fact that the A/B test period happened to not include many spikes and so I could simply drop the mixture model entirely.mysterious segfaults and errors under a variety of condition; once when my cat walked over my keyboard, and frequently when running multi-core Stan in a loop. The latter was a serious issue for me when running a permutation test with 5000 iterates: when I run Stan on 8 chains in parallel normally (hence 1/8th the samples per chain) in a for-loop - the simplest way to implement the permutation test - it would occasionally segfault and take down R. I was forced to reduce the chains to 1 before it stopped crashing, making it 8 times slower unless I wished to add in manual parallel processing, running 8 separate Stans.

Stan’s support for posterior predictives is poor. The manual tells one to use a different module/scope,

`generated_quantities`

, lest the code be slow, which apparently requires one to copy-paste the entire likelihood section! Which is especially unfortunate when doing a time-series and requiring access to arrays/vectors declared in a different scope… I never did figure out how to generate posterior predictionscorrectly

for that reason, and resorted to the usual Bugs/JAGS-like method (which thankfully does work)Stan’s treatment of missing data is also uninspiring and makes me worried about more complex analyses where I am not so fortunate as to have perfectly clean complete datasets

Stan’s syntax is terrible, particularly the

*entirely*unnecessary semicolons. It is 2017, I should not be spending my time adding useless end-of-line markers. If they are necessary for C++, they can be added by Stan itself. This was particularly infuriating when painfully editing a model trying to implement various parameterizations and rerunning only to find that I had forgotten a semicolon (as*no*language I use regularly, R, Haskell, shell, Python, or Bugs/JAGS, insists on them!).

# Analysis

## Descriptive

AdSense data:

```
traffic <- read.csv("https://www.gwern.net/docs/traffic/2017-10-20-abtesting-adsense.csv",
colClasses=c("Date", "integer", "integer", "numeric", "integer", "integer",
"integer", "numeric", "integer", "numeric"))
library(skimr)
skim(traffic)
# n obs: 288
# n variables: 10
#
# Variable type: Date
# variable missing complete n min max median n_unique
# Date 0 288 288 2017-01-01 2017-10-15 2017-05-24 288
#
# Variable type: integer
# variable missing complete n mean sd p0 p25 p50 p75 p100 hist
# Ad.impressions 0 288 288 358.95 374.67 0 33 127.5 708 1848 ▇▁▂▃▁▁▁▁
# Ad.pageviews 0 288 288 399.02 380.62 0 76.5 180.5 734.5 1925 ▇▁▂▃▁▁▁▁
# Ads.r 0 288 288 0.44 0.5 0 0 0 1 1 ▇▁▁▁▁▁▁▆
# Pageviews 0 288 288 1319.93 515.8 794 1108.75 1232 1394 8310 ▇▁▁▁▁▁▁▁
# Sessions 0 288 288 872.1 409.91 561 743 800 898.25 6924 ▇▁▁▁▁▁▁▁
# Total.time 0 288 288 84517.41 24515.13 39074 70499.5 81173.5 94002 314904 ▅▇▁▁▁▁▁▁
#
# Variable type: numeric
# variable missing complete n mean sd p0 p25 p50 p75 p100 hist
# Ad.pageviews.logit 0 288 288 -1.46 2.07 -8.13 -2.79 -1.8 0.53 1.44 ▁▁▁▂▆▃▂▇
# Ads.percent 0 288 288 0.29 0.29 0 0.024 0.1 0.59 0.77 ▇▁▁▁▁▂▃▂
# Avg.Session.seconds 0 288 288 99.08 17.06 45.48 87.3 98.98 109.22 145.46 ▁▁▃▅▇▅▂▁
sum(traffic$Sessions); sum(traffic$Pageviews)
# [1] 251164
# [1] 380140
library(ggplot2)
qplot(Date, Pageviews, color=as.logical(Ads.r), data=traffic) + stat_smooth() +
coord_cartesian(ylim = c(750,3089)) +
labs(color="Ads", title="AdSense advertising effect on Gwern.net daily traffic, January-October 2017")
qplot(Date, Total.time, color=as.logical(Ads.r), data=traffic) + stat_smooth() +
coord_cartesian(ylim = c(38000,190000)) +
labs(color="Ads", title="AdSense advertising effect on total time spent reading Gwern.net , January-October 2017")
```

Traffic looks similar whether counting by total page views or total time reading (average-time-reading-per-session x number-of-sessions); the data is definitely autocorrelated, somewhat noisy, and I get an impression that there is a noticeable decrease in the advertising days despite the measurement error:

## Simple tests & regressions

As expected from the power analysis, the usual tests are unable to reliably detect anything but it’s worth noting that the point-estimates of both the mean & median indicate the ads are worse:

```
t.test(Pageviews ~ Ads.r, data=traffic)
# Welch Two Sample t-test
#
# data: Pageviews by Ads.r
# t = 0.28265274, df = 178.00378, p-value = 0.7777715
# alternative hypothesis: true difference in means is not equal to 0
# 95% confidence interval:
# -111.4252500 148.6809819
# sample estimates:
# mean in group 0 mean in group 1
# 1328.080247 1309.452381
wilcox.test(Pageviews ~ Ads.r, conf.int=TRUE, data=traffic)
# Wilcoxon rank sum test with continuity correction
#
# data: Pageviews by Ads.r
# W = 11294, p-value = 0.1208844
# alternative hypothesis: true location shift is not equal to 0
# 95% confidence interval:
# -10.00001128 87.99998464
# sample estimates:
# difference in location
# 37.9999786
```

The tests can only handle a binary variable, so next is a quick simple linear model, and then a Bayesian regression with an autocorrelation term; both turn up a weak effect for the binary randomization, and then much stronger (and negative) for the more accurate percentage measurement:

```
summary(lm(Pageviews ~ Ads.r, data = traffic))
# ...Residuals:
# Min 1Q Median 3Q Max
# -534.0802 -207.6093 -90.0802 65.9198 7000.5476
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 1328.08025 40.58912 32.72010 < 2e-16
# Ads.r -18.62787 61.36498 -0.30356 0.76168
#
# Residual standard error: 516.6152 on 286 degrees of freedom
# Multiple R-squared: 0.0003220914, Adjusted R-squared: -0.003173286
# F-statistic: 0.09214781 on 1 and 286 DF, p-value: 0.7616849
summary(lm(Pageviews ~ Ads.percent, data = traffic))
# ...Residuals:
# Min 1Q Median 3Q Max
# -579.4145 -202.7547 -89.1473 60.7785 6928.3160
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 1384.17269 42.77952 32.35596 < 2e-16
# Ads.percent -224.79052 105.98550 -2.12096 0.034786
#
# Residual standard error: 512.6821 on 286 degrees of freedom
# Multiple R-squared: 0.01548529, Adjusted R-squared: 0.01204293
# F-statistic: 4.498451 on 1 and 286 DF, p-value: 0.03478589
library(brms)
b <- brm(Pageviews ~ Ads.r, autocor = cor_bsts(), iter=20000, chains=8, data = traffic); b
# Family: gaussian(identity)
# Formula: Pageviews ~ Ads.r
# Data: traffic (Number of observations: 288)
# Samples: 8 chains, each with iter = 20000; warmup = 10000; thin = 1;
# total post-warmup samples = 80000
# ICs: LOO = Not computed; WAIC = Not computed
#
# Correlation Structure: bsts(~1)
# Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# sigmaLL 50.86 16.47 26.53 90.49 741 1.01
#
# Population-Level Effects:
# Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# Ads.r 50.36 63.19 -74.73 174.21 34212 1
#
# Family Specific Parameters:
# Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# sigma 499.77 22.38 457.76 545.4 13931 1
#
# Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
# is a crude measure of effective sample size, and Rhat is the potential
# scale reduction factor on split chains (at convergence, Rhat = 1).
b2 <- brm(Pageviews ~ Ads.percent, autocor = cor_bsts(), chains=8, data = traffic); b2
...
# Population-Level Effects:
# Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# Ads.percent -91.8 113.05 -317.85 131.41 2177 1
```

Treating percentage as additive is also odd, as one would expect it to be multiplicative in some sense. As well, `brms`

makes it convenient to throw in a simple Bayesian structural autocorrelation (corresponding to `AR(1)`

if I am understanding it correctly) but the function involved does not support the higher-order lags or moving average involved in traffic.

## Stan ARIMA time-series model

The fully Bayesian analysis in Stan, using `ARIMA(4,0,1)`

time-series, a multiplicative effect of ads as percentage of traffic, skeptical informative priors of small negative effects, and extracting posterior predictions (of each day if hypothetically it were not advertising-affected) for further analysis:

```
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
m <- "data {
int<lower=1> T; // number of data points
int<lower=0> y[T]; // traffic
real Ads[T]; // Ad logit
}
parameters {
real<lower=0> muM;
real<lower=0> sigma;
real phi1; // autoregression coeffs
real phi2;
real phi3;
real phi4;
real ma; // moving avg coeff
real<upper=0> ads; // advertising coeff; can only be negative
real<lower=0, upper=10000> y_pred[T]; // traffic predictions
}
model {
real mu[T]; // prediction for time t
vector[T] err; // error for time t
// initialize the first 4 days for the lags
mu[1] = 0;
mu[2] = 0;
mu[3] = 0;
mu[4] = 0;
err[1] = y[1] - mu[1];
err[2] = y[2] - mu[2];
err[3] = y[3] - mu[3];
err[4] = y[4] - mu[4];
muM ~ normal(1300, 500);
sigma ~ exponential(250);
phi1 ~ normal(0,1);
phi2 ~ normal(0,1);
phi3 ~ normal(0,1);
phi4 ~ normal(0,1);
ma ~ normal(0, 0.5);
ads ~ normal(0,1);
for (t in 5:T) {
mu[t] = muM + phi1 * y[t-1] + phi2 * y[t-2] + phi3 * y[t-3] + phi4 * y[t-4] + ma * err[t-1];
err[t] = y[t] - mu[t];
y[t] ~ normal(mu[t] * (1 + ads*Ads[t]), sigma);
y_pred[t] ~ normal(mu[t] * 1, sigma); // for comparison, what would the ARIMA predict for a today w/no ads?
}
}"
# extra flourish: find posterior mode via Stan's new L-BFGS gradient descent optimization feature;
# also offers a good initialization point for MCMC
sm <- stan_model(model_code = m)
optimized <- optimizing(sm, data=list(T=nrow(traffic), y=traffic$Pageviews, Ads=traffic$Ads.percent), hessian=TRUE)
round(optimized$par, digits=3)
# muM sigma phi1 phi2 phi3 phi4 ma ads
# 1352.864 65.221 -0.062 0.033 -0.028 0.083 0.249 -0.144
## Initialize from previous MCMC run:
inits <- list(muM=1356, sigma=65.6, phi1=-0.06, phi2=0.03, phi3=-0.03, phi4=0.08, ma=0.25, ads=-0.15)
nchains <- getOption("mc.cores") - 1
model <- stan(model_code=m, data=list(T=nrow(traffic), y=traffic$Pageviews, Ads=traffic$Ads.percent),
init=replicate(nchains, inits, simplify=FALSE), chains=nchains, iter=200000); print(model)
# ...Elapsed Time: 413.816 seconds (Warm-up)
# 654.858 seconds (Sampling)
# 1068.67 seconds (Total)
#
# Inference for Stan model: bacd35459b712679e6fc2c2b6bc0c443.
# 1 chains, each with iter=2e+05; warmup=1e+05; thin=1;
# post-warmup draws per chain=1e+05, total post-warmup draws=1e+05.
#
# mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
# muM 1355.27 0.20 47.54 1261.21 1323.50 1355.53 1387.20 1447.91 57801 1
# sigma 65.61 0.00 0.29 65.03 65.41 65.60 65.80 66.18 100000 1
# phi1 -0.06 0.00 0.04 -0.13 -0.09 -0.06 -0.04 0.01 52368 1
# phi2 0.03 0.00 0.01 0.01 0.03 0.03 0.04 0.05 100000 1
# phi3 -0.03 0.00 0.01 -0.04 -0.03 -0.03 -0.02 -0.01 100000 1
# phi4 0.08 0.00 0.01 0.07 0.08 0.08 0.09 0.10 100000 1
# ma 0.25 0.00 0.04 0.18 0.23 0.25 0.27 0.32 52481 1
# ads -0.14 0.00 0.01 -0.16 -0.15 -0.14 -0.14 -0.13 100000 1
# ...
mean(extract(model)$ads)
# [1] -0.1449574151
## permutation test to check for model misspecification: shuffle ad exposure and rerun the model,
## see what the empirical null distribution of the ad coefficient is and how often it yields a
## reduction of >= -14.5%:
empiricalNull <- numeric()
iters <- 5000
for (i in 1:iters) {
df <- traffic
df$Ads.percent <- sample(df$Ads.percent)
inits <- list(muM=1356, sigma=65.6, phi1=-0.06, phi2=0.03, phi3=-0.03, phi4=0.08, ma=0.25, ads=-0.01)
# nchains <- 1; options(mc.cores = 1) # disable multi-core to work around occasional Stan segfaults
model <- stan(model_code=m, data=list(T=nrow(df), y=df$Pageviews, Ads=df$Ads.percent),
init=replicate(nchains, inits, simplify=FALSE), chains=nchains); print(model)
adEstimate <- mean(extract(model)$ads)
empiricalNull[i] <- adEstimate
}
summary(empiricalNull); sum(empiricalNull < -0.1449574151) / length(empiricalNull)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# -0.206359600 -0.064702600 -0.012325460 -0.035497930 -0.001696464 -0.000439064
# [1] 0.0136425648
```

We see a consistent & large estimate of harm: the mean of traffic falls by -14.5% (95% CI: -0.16 to -0.13; permutation test: *p*=0.01) on 100% ad-affected traffic!

To more directly calculate the harm, I turn to the posterior predictions, which were computed for each day under the hypothetical of no advertising; one would expect the prediction for all days to be somewhat higher than the actual traffics were (because almost every day has some non-zero % of ad-affected traffic), and, summed or averaged over all days, that gives the predicted loss of traffic from ads:

```
mean(traffic$Pageviews)
# [1] 1319.930556
## fill in defaults when extracting mean posterior predictives:
traffic$Prediction <- c(1319,1319,1319,1319, colMeans(extract(model)$y_pred)[5:288])
# qplot(traffic$Date, traffic$Prediction, color=as.logical(traffic$Ads.r)) + stat_smooth()
mean(with(traffic, Prediction - Pageviews) )
# [1] 53.67329617
mean(with(traffic, (Prediction - Pageviews) / Pageviews) )
# [1] 0.09668207805
sum(with(traffic, Prediction - Pageviews) )
# [1] 15457.9093
```

So during the A/B test, the expected estimated loss of traffic is ~9.7%.

## Decision

As this is so far past the decision threshold and the 95% credible interval around -0.14 is extremely tight (-0.16 - -0.13), the EVSI of any additional sampling is surely negative & not worth calculating.

Thus, I permanently removed the AdSense banner ad in the middle of 11 September 2017.

# Discussion

The result is surprising. I had been expecting some degree of harm but the estimated reduction is much larger than I expected. Could banner ads really be *that* harmful?

The effect is estimated with considerable precision, so it’s almost certainly not a fluke of the data (if anything I collected far more data than I should’ve); there weren’t many traffic spikes to screw with the analysis, so omitting mixture model or *t*-scale responses in the model doesn’t seem like it should be an issue either; the modeling itself might be driving it, but the crudest tests suggest a similar level of harm (just not at high statistical-significance or posterior probability); it does seem to be visible in the scatterplot; and the more realistic models - which include time-series aspects I know exist from the long historical time-series of `gwern.net`

traffic & skeptical priors encouraging small effects - estimate it much better as I expected from my previous power analyses, and considerable tinkering with my original `ARIMA(4,0,1)`

Stan model to check my understanding of my code (I haven’t used Stan much before) didn’t turn up any issues or make the effect go away. So as far as I can tell, this effect is real. I still doubt my results, but it’s convincing enough for me to disable ads, at least.

Does it generalize? I admit `gwern.net`

is unusual: highly technical longform static content in a minimalist layout optimized for fast loading & rendering catering to Anglophone STEM-types in the USA. It is entirely possible that for most websites, the effect of ads is much smaller because they already load so slow, have much busier cluttered designs, their users have less visceral distaste for advertising or are more easily targeted for useful advertising etc, and thus `gwern.net`

is merely an outlier for whom removing ads makes sense (particularly given my option of being Patreon-supported rather than depending entirely on ads like many media websites must). I have no way of knowing whether or not this is true, and as always with optimizations, one should benchmark one’s own specific usecase; perhaps in a few years more results will be reported and it will be seen if my results are merely a coding error or an outlier or something else.

If a max loss of 14% and average loss of ~9% (both of which could be higher for sites whose users don’t use adblock as much) is accurate and generalizable to other blogs/websites , there are many implications: in particular, it implies a huge deadweight loss to Internet users from advertising; and suggests advertising may be a net loss for many smaller sites. Ironically, in the latter case, those sites may not yet have realize, and may never realize, how much the pennies they earn from advertising are costing them, because the harm won’t show up in standard single-user A/B testing due to either measurement error hiding much of the effect or because it exists as an emergent global effect, requiring long-term experimentation & relatively sophisticated time-series modeling.

There may be a connection here to earlier observations on the business of advertising questioning whether advertising works, works more than it hurts or cannibalizes other avenues, works sufficiently well to be profitable, or sufficiently well to know if it is working at all: on purely statistical grounds, it should be hard to cost-effectively show that advertising works at all (Lewis & Reiley 2008/Lewis & Rao 2014), Steve Sailer’s observation that IRI’s BehaviorScan field-experiment linking individual TV advertisements & grocery store sales likely showed little effect, eBay’s own experiments to similar effect (Lewis et al 2011, Blake et al 2014), P&G & JPMorgan digital ad cuts (and the continued success of many São Paulo-style low-advertising retailers like Aldi or MUJI), the extreme inaccuracy of correlational attempts to predict advertising effects (Lewis et al 2011, Gordon et al 2016; see also Eckles & Bakshy 2017), political science’s difficulty showing any causal impact of campaign advertisement spending on victories (Kalla & Broockman 2017; & mostly recently, Donald Trump), and many anecdotal reports seriously questioning the value of Facebook or Google advertising for their businesses in yielding mistaken/curious or fraudulent or just useless traffic.

# Followup test

Experiment design: blocked by weeks, no repetition (Latin square-style): 6 months, 27 September 2018 - 27 March 2019

```
library(gtools)
m <- permutations(2, 7, 0:1, repeats=TRUE)
m <- m[sample(1:nrow(m), size=51),]; c(m)
# [1] 1 0 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 1 0 1 0 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 0 1 0 1 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 1 1 1 1 1
# [81] 0 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 0 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 0 1 0 1 0 0 1 1 0 0 0 0 0 0 1
# [161] 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 0 1 0 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 1 0 1 1 0
# [241] 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 0 1 1 1 0 1 1 0 0 1 1 1 1 0 0 0 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 0 0 0 1 0 1 1 1 1 0 0
# [321] 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 0 0 1 0 0 1 1 0 0 1 0 0 0 0 1 1
```

`default.css`

:

```
/* hide the ad in the ad A/B test by default */
div#ads { display: block; text-align: center; display: none; }
```

`default.html`

:

```
<div id="ads"><a href="https://triplebyte.com/a/Lpa4wbK/d"><img alt="Banner ad for the tech recruiting company Triplebyte: 'Triplebyte is building a background-blind screening process for hiring software engineers'" width="500" height="128" src="/static/img/ads-triplebyte-banner.png"></a></div>
<header>
...
<!-- A/B test of ad effects on site traffic, static ad version: randomize blocks of 7-days based on day-of-year pre-generated randomness -->
<script id="adsABTestJS">
// create Date object for current location
var d = new Date();
// convert to msec
// add local time zone offset
// get UTC time in msec
var utc = d.getTime() + (d.getTimezoneOffset() * 60000);
// create new Date object for different timezone, EST/UTC-4
var now = new Date(utc + (3600000*-4));
start = new Date("2018-09-28"); var diff = now - start;
var oneDay = 1000 * 60 * 60 * 24; var day = Math.floor(diff / oneDay);
randomness = [1,0,1,1,1,0,0,1,0,1,1,0,0,1,1,0,0,0,1,1,1,0,1,1,0,0,0,0,0,
1,1,0,0,1,1,1,0,1,0,1,0,0,1,1,1,1,0,0,0,1,1,0,0,1,1,1,0,1,0,1,1,0,0,0,
0,0,1,1,0,1,0,0,1,1,0,1,1,1,1,1,0,1,0,0,1,0,1,1,1,1,0,1,1,1,1,1,1,0,1,
1,0,0,0,0,1,0,1,1,0,1,0,1,0,1,1,1,1,1,1,1,1,1,0,1,0,0,1,1,1,0,1,0,0,1,
0,1,1,0,1,1,0,0,0,0,1,0,1,0,1,0,0,1,1,0,0,0,0,0,0,1,1,1,1,1,0,1,0,0,0,
1,1,1,1,1,1,1,1,0,0,0,0,1,0,0,0,1,1,1,0,0,0,1,0,0,1,0,1,1,1,1,1,1,0,0,
0,0,0,0,0,1,1,0,1,1,0,1,0,1,0,1,1,0,0,0,1,0,0,1,1,0,0,1,1,0,0,1,0,1,1,
0,1,1,0,1,0,0,1,0,0,1,1,0,1,0,1,0,0,0,0,0,1,0,1,1,1,0,0,0,1,0,0,1,1,1,
0,1,1,0,0,1,1,1,1,0,0,0,0,1,1,0,1,0,1,0,1,0,1,0,1,0,1,1,1,1,1,0,0,0,1,
0,0,0,1,0,1,1,1,1,0,0,1,1,0,1,1,1,0,1,0,1,1,0,1,0,0,1,1,0,1,0,1,0,0,1,
0,0,1,1,0,0,1,0,0,0,0,1,1];
if (randomness[day]) {
document.getElementById("ads").style.display = "block";
document.getElementById("ads").style.visibility = "visible";
}
</script>
```

Appearance: /images/traffic/2018-09-28-abtest-advertising-2nd-appearance.png

# External links

A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments

, Dmitriev et al 2017- Discussion: HN

# Appendices

## Stan: mixture time-series

An attempt at a `ARIMA(4,0,1)`

time-series mixture model, where the mixture has two components: one component for normal traffic where daily traffic is ~1000 making up >90% of daily data, and one component for the occasional traffic spike around 10x larger but happening rarely:

```
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
m <- "data {
int<lower=1> K; // number of mixture components
int<lower=1> T; // number of data points
int<lower=0> y[T]; // traffic
int<lower=0, upper=1> Ads[T]; // Ad randomization
}
parameters {
simplex[K] theta; // mixing proportions
positive_ordered[K] muM; // locations of mixture components; since no points are labeled,
// like in JAGS, we add a constraint to force an ordering, make it identifiable, and
// avoid label switching, which will totally screw with the posterior samples
real<lower=0.01, upper=500> sigmaM[K]; // scales of mixture components
real<lower=0.01, upper=5> nuM[K];
real phi1; // autoregression coeffs
real phi2;
real phi3;
real phi4;
real ma; // moving avg coeff
real<upper=0> ads; // advertising coeff; can only be negative
}
model {
real mu[T, K]; // prediction for time t
vector[T] err; // error for time t
real ps[K]; // temp for log component densities
// initialize the first 4 days for the lags
mu[1][1] = 0; // assume err[0] == 0
mu[2][1] = 0;
mu[3][1] = 0;
mu[4][1] = 0;
err[1] = y[1] - mu[1][1];
err[2] = y[2] - mu[2][1];
err[3] = y[3] - mu[3][1];
err[4] = y[4] - mu[4][1];
muM ~ normal(0, 5);
sigmaM ~ cauchy(0, 2);
nuM ~ exponential(1);
ma ~ normal(0, 0.5);
phi1 ~ normal(0,1);
phi2 ~ normal(0,1);
phi3 ~ normal(0,1);
phi4 ~ normal(0,1);
ads ~ normal(0,1);
for (t in 5:T) {
for (k in 1:K) {
mu[t][k] = ads*Ads[t] + muM[k] + phi1 * y[t-1] + phi2 * y[t-2] + phi3 * y[t-3] + phi4 * y[t-4] + ma * err[t-1];
err[t] = y[t] - mu[t][k];
ps[k] = log(theta[k]) + student_t_lpdf(y[t] | nuM[k], mu[t][k], sigmaM[k]);
}
target += log_sum_exp(ps);
}
}"
# find posterior mode via L-BFGS gradient descent optimization; this can be a good set of initializations for MCMC
sm <- stan_model(model_code = m)
optimized <- optimizing(sm, data=list(T=nrow(traffic), y=traffic$Pageviews, Ads=traffic$Ads.r, K=2), hessian=TRUE)
round(optimized$par, digits=3)
# theta[1] theta[2] muM[1] muM[2] sigmaM[1] sigmaM[2] nuM[1] nuM[2] phi1 phi2 phi3 phi4 ma
# 0.001 0.999 0.371 2.000 0.648 152.764 0.029 2.031 1.212 -0.345 -0.002 0.119 -0.604
# ads
# -0.009
## optimized:
inits <- list(theta=c(0.001, 0.999), muM=c(0.37, 2), sigmaM=c(0.648, 152), nuM=c(0.029, 2), phi1=1.21, phi2=-0.345, phi3=-0.002, phi4=0.119, ma=-0.6, ads=-0.009)
## MCMC means:
nchains <- getOption("mc.cores") - 1
model <- stan(model_code=m, data=list(T=nrow(traffic), y=traffic$Pageviews, Ads=traffic$Ads.r, K=2),
init=replicate(nchains, inits, simplify=FALSE), chains=nchains, control = list(max_treedepth = 15, adapt_delta = 0.95),
iter=20000); print(model)
traceplot(model, pars=names(inits))
```

This code winds up continuing to fail due to label-switching issue (ie the MCMC bouncing between estimates of what each mixture component is because of symmetry or lack of data) despite using some of the suggested fixes in the Stan model like the ordering trick. Since there were so few spikes in 2017 only, the mixture model can’t converge to anything sensible; but on the plus side, this also implies that the complex mixture model is unnecessary for analyzing 2017 data and I can simply model the outcome as a normal.

## EVSI

Demo code of simple Expected Value of Sample Information (EVSI) in a JAGS log-Poisson model of traffic (which turns out to be inferior to a normal distribution for 2017 traffic data but I keep here for historical purposes). We consider an experiment resembling historical data with a 5% traffic decrease due to ads; the reduction is modeled and implies a certain utility loss given my relative preferences for traffic vs the advertising revenue, and then the remaining uncertainty in the reduction estimate can be queried for how likely it is that the decision is wrong and that collecting further data would then change a wrong decision to a right one:

```
## simulate a plausible effect superimposed on the actual data:
ads[ads$Ads==1,]$Hits <- round(ads[ads$Ads==1,]$Hits * 0.95)
require(rjags)
y <- ads$Hits
x <- ads$Ads
model_string <- "model {
for (i in 1:length(y)) {
y[i] ~ dpois(lambda[i])
log(lambda[i]) <- alpha0 - alpha1 * x[i]
}
alpha0 ~ dunif(0,10)
alpha1 ~ dgamma(50, 6)
}"
model <- jags.model(textConnection(model_string), data = list(x = x, y = y),
n.chains = getOption("mc.cores"))
samples <- coda.samples(model, c("alpha0", "alpha1"), n.iter=10000)
summary(samples)
# 1. Empirical mean and standard deviation for each variable,
# plus standard error of the mean:
#
# Mean SD Naive SE Time-series SE
# alpha0 6.98054476 0.003205046 1.133155e-05 2.123554e-05
# alpha1 0.06470139 0.005319866 1.880857e-05 3.490445e-05
#
# 2. Quantiles for each variable:
#
# 2.5% 25% 50% 75% 97.5%
# alpha0 6.97426621 6.97836982 6.98055144 6.98273011 6.98677827
# alpha1 0.05430508 0.06110893 0.06469162 0.06828215 0.07518853
alpha0 <- samples[[1]][,1]; alpha1 <- samples[[1]][,2]
posteriorTrafficReduction <- exp(alpha0) - exp(alpha0-alpha1)
generalLoss <- function(annualAdRevenue, trafficLoss, hitValue, discountRate) {
(annualAdRevenue - (trafficLoss * hitValue * 365.25)) / log(1 + discountRate) }
loss <- function(tr) { generalLoss(360, tr, 0.02, 0.05) }
posteriorLoss <- sapply(posteriorTrafficReduction, loss)
summary(posteriorLoss)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# -5743.5690 -3267.4390 -2719.6300 -2715.3870 -2165.6350 317.7016
```

Expected loss of turning on ads: -$2715. Current decision: keep ads off to avoid that loss. The expected average gain in the case where the correct decision is turning ads on:

so EVPI is $0.07. This doesn’t pay for any additional days of sampling, so there’s no need to calculate an exact EVSI.

I am not too happy about how much uncached JS the Disqus plugin loads or how long it takes to set itself up while spewing warnings in the browser console, but at the moment, I don’t know of any other static site commenting system which has good anti-spam capabilities or an equivalent user base, and Disqus has worked reasonably well for 5+ years.↩

This is especially an issue with A/B testing as usually practiced with NHST & arbitrary alpha threshold, which poses a

sorites of suck

problem; one could steadily degrade one’s website by repeatedly making bad changes which don’t*appear*harmful in small-scale experiments (no user harm,

,*p*>0.05, and increased revenue*p*<0.05no harm, increased revenue

,no harm, increased revenue

etc). One might call this theSchlitz beer problem

(How Milwaukee’s Famous Beer Became Infamous: The Fall of Schlitz

), after the famous business case study: a series of small quality decreases/profit increases eventually had catastrophic cumulative effects on their reputation & sales. Another example is the now-infamousRed Delicious

apple: widely considered one of the worst-tasting apples commonly sold, it was reportedly an excellent-tasting apple when first discovered in 1880, winning contests for its flavor; but its flavor worsened rapidly over the 20th century, a decline blamed on apple growers gradually switching to ever-redder sports which looked better in grocery stores, a decline which ultimately culminated in the near-collapse of the Red-Delicious-centric Washington State apple industry when consumer backlash finally began in the 1980s with the availability of tastier apples like Gala. Thisdeath by degrees

can be countered by a few things, such as either testing regularly against a historical baseline to establish total cumulative degradation or carefully tuning $\alpha$/$\beta$ thresholds based on a decision analysis (likely, one would conclude that statistical power must be made much higher and the*p*-threshold should be made less stringent for detecting harm).↩Why

block

instead of, say, just randomizing 5-days at a time (simple randomization

)? If we did that, we would occasionally do something like spend an entire month in one condition without switching, simply by rolling a 0 5 or 6 times in a row; since traffic can be expected to drift and change and spike, having such large units means that sometimes they will line up with noise, increasing the apparent variance thus shrinking the effect size thus requiring possibly a great deal more data to detect the signal. Or we might finish the experiment after 100 days (20 units) and discover we had*n*=15 for advertising and only*n*=5 for non-advertising (wasting most of our information on unnecessarily refining the advertising condition). Not blocking doesn’t*bias*our analysis - we still get the right answers eventually - but it could be costly. Whereas if we block pairs of 2-days (`[00,11]`

vs`[11,00]`

), we ensure that we regularly (but still randomly) switch the condition, spreading it more evenly over time, so if there are 4 days of suddenly high traffic, it’ll probably get split between conditions and we can more easily see the effect. This sort of issue is why experiments try to run interventions on the same person, or at least on age and sex-matched participants, to eliminate unnecessary noise. The gains can be extreme; in one experiment, I estimated that using twins rather than ordinary school-children would have let $n \lt \frac{1}{20}$:The Power of Twins: Revisiting Student’s Scottish Milk Experiment Example

. Thus, when possible, I block my experiments at least temporally.↩After publishing initial results, Chris Stucchio commented on Twitter:

Most of the work on this stuff is proprietary. I ran such an experiment for a large content site which generated directionally similar results. I helped another major content site set up a similar test, but they didn’t tell me the results…as well as smaller effects from ads further down the page (e.g. Taboola). Huge sample size, very clear effects.

David Kitchen, a hobbyist operator of several hundred forums, claims that removing banner ads boosted all his metrics (admittedly, at the cost of total revenue), but unclear if this used randomizations or just a before-after comparison.↩Several of the Internet giants like Google have reported measurable harms from delays as small as 100ms.↩

Examples of ads I saw would be Lumosity ads or online university ads (typically master’s degrees, for some reason) on my DNB FAQ. They looked about what one would expect: generically glossy and clean. It is difficult to imagine anyone being offended by them.↩

This mixture model would have two distributions/components in it; there apparently is no point in trying to distinguish between levels of virality, as 3 or higher does not fit the data well:

↩`library(flexmix) stepFlexmix(traffic$Pageviews ~ 1, model = FLXMRglmfix(family = "poisson"), k=1:10, nrep=20) ## errors out k>2 fit <- flexmix(traffic$Pageviews ~ 1, model = FLXMRglmfix(family = "poisson"), k=2) summary(fit) # prior size post>0 ratio # Comp.1 0.196 448 449 0.998 # Comp.2 0.804 1841 1842 0.999 # # 'log Lik.' -1073191.871 (df=3) # AIC: 2146389.743 BIC: 2146406.95 summary(refit(fit)) # $Comp.1 # Estimate Std. Error z value Pr(>|z|) # (Intercept) 8.64943378382 0.00079660747 10857.837 < 2.22e-16 # # $Comp.2 # Estimate Std. Error z value Pr(>|z|) # (Intercept) 7.33703564817 0.00067256175 10909.088 < 2.22e-16`