---
title: Banner Ads Considered Harmful (Here)
description: 9 months of daily A/B-testing of Google AdSense banner ads on gwern.net indicates banner ads decrease total traffic substantially, possibly due to spillover effects in reader engagement and resharing.
created: 8 Jan 2017
tags: experiments, statistics, decision theory, R, JS, power analysis, Bayes
status: in progress
confidence: possible
importance: 5
...
> One source of complexity & JavaScript use on `gwern.net` is the use of Google AdSense advertising to insert banner ads. In considering design & usability improvements, removing the banner ads comes up every time as a possibility, as readers do not like ads, but such removal comes at a revenue loss and it's unclear whether the benefit outweighs the cost, suggesting I run an A/B experiment. However, ads might be expected to have broader effects on traffic than individual page reading times/bounce rates, affecting *total* site traffic instead through long-term effects on or spillover mechanisms between readers (eg social media behavior), rendering the usual A/B testing method of per-page-load/session randomization incorrect; instead it would be better to analyze total traffic as a time-series experiment.
>
> Design: A decision analysis of revenue vs readers yields an maximum acceptable total traffic loss of ~3%. Power analysis of historical `gwern.net` traffic data demonstrates that the high autocorrelation yields low statistical power with standard tests & regressions but acceptable power with ARIMA models. I design a long-term Bayesian ARIMA(4,0,1) time-series model in which an A/B-test running January-October 2017 in randomized paired 2-day blocks of ads/no-ads uses client-local JS to determine whether to load & display ads, with total traffic data collected in Google Analytics & ad exposure data in Google AdSense. The A/B test ran from 1 January 2017 to 15 October 2017, affecting 288 days with collectively 380,140 pageviews in 251,164 sessions.
>
> Correcting for a flaw in the randomization, the final results yield a surprisingly large estimate of -14% traffic loss if all traffic were exposed to ads (95% credible interval: -13-16%) and an expected traffic loss of -9.7% from the subset of users without adblock, exceeding my decision threshold for disabling ads and rendering further experimentation profitless.
>
> Thus, banner ads on `gwern.net` appear to be harmful and AdSense has been removed. If these results generalize to other blogs and personal websites, an important implication is that many websites may be harmed by their use of banner ad advertising without realizing it.
One thing about `gwern.net` I prize, especially in comparison to the rest of the Internet, is the fast page loads & renders. This is why in my previous [A/B tests](/AB-testing) of site design changes, I have generally focused on CSS changes which do not affect load times.
Benchmarking loads, the total time is dominated by Google AdSense (for the medium-sized banner advertisements centered above the title) and Disqus comments.
While I want comments, so the Disqus is not optional^[I am not too happy about how much uncached JS the Disqus plugin loads or how long it takes to set itself up while spewing warnings in the browser console, but at the moment, I don't know of any other static site commenting system which has good anti-spam capabilities or an equivalent user base, and Disqus has worked reasonably well for 5+ years.], AdSense I keep only because, well, it makes me some money (~\$30 a month or ~\$360 a year; it would be more but ~60% of visitors have adblock, which is apparently [unusually high for the US](https://pagefair.com/blog/2017/adblockreport/ "Pagefair 2017 Adblock Report")).
So ads are a good thing to do an experiment on: it offers a chance to remove one of the heaviest components of the page, an excuse to use a decision approach, an opportunity to try applying Bayesian time-series models in JAGS/Stan, and an investigation into whether longitudinal site-wide A/B experiments are practical & useful.
# Modeling effects of advertising: global rather than local
This isn't a *huge* amount (it is much less than my [monthly Patreon](https://www.patreon.com/gwern)) and might be offset by the effects on load/render time and people not liking advertisement.
If I am reducing my traffic & influence by 10% because people don't want to browse or link pages with ads, then it's definitely not worthwhile.
One of the more common criticisms of the usual A/B test design is that it is missing the forest for the trees & giving fast precise answers to the wrong question; a change may have good results when done individually, but may harm the overall experience or community in a way that shows up on the macro but not micro scale.^[This is especially an issue with A/B testing as usually practiced with NHST & arbitrary alpha threshold, which poses a "[sorites](!Wikipedia "Sorites paradox") of suck" problem; one could steadily degrade one's website by repeatedly making bad changes which don't *appear* harmful in small-scale experiments ("no user harm, _p_>0.05, and increased revenue _p_<0.05", "no harm, increased revenue", "no harm, increased revenue" etc). One might call this the ["Schlitz beer problem"](!Wikipedia "Joseph Schlitz Brewing Company#Decline in status and sale to Stroh") (["How Milwaukee's Famous Beer Became Infamous: The Fall of Schlitz"](https://beerconnoisseur.com/articles/how-milwaukees-famous-beer-became-infamous)), after the famous business case study: a series of small quality decreases/profit increases eventually had catastrophic cumulative effects on their reputation & sales. Another example is the now-infamous "[Red Delicious](!Wikipedia)" apple: widely considered one of the worst-tasting apples commonly sold, it was reportedly an excellent-tasting apple when first discovered in 1880, winning contests for its flavor; but its flavor worsened rapidly over the 29th century, a decline blamed on apple growers gradually switching to ever-redder [sports](!Wikipedia "Sport (botany)") which looked better in grocery stores, a decline which ultimately culminated in the near-collapse of the Red-Delicious-centric [Washington State apple industry](http://www.washingtonpost.com/wp-dyn/content/article/2005/08/04/AR2005080402194_pf.html "Why the Red Delicious No Longer Is") when consumer backlash finally began in the 1980s with the availability of tastier apples like [Gala](!Wikipedia "Gala (apple)"). This "death by degrees" can be countered by a few things, such as either testing regularly against a historical baseline to establish total cumulative degradation or carefully tuning $\alpha$/$\beta$ thresholds based on a decision analysis (likely, one would conclude that statistical power must be made much higher and the _p_-threshold should be made less stringent for detecting harm).]
In this case, I am interested less in time-on-page than in total traffic per day, as the latter will measure effects like resharing on social media (especially, given my traffic history, Hacker News, which always generates a long lag of additional traffic from Twitter & aggregators).
It is [somewhat](https://tech.okcupid.com/the-pitfalls-of-a-b-testing-in-social-networks/ "The pitfalls of A/B testing in social networks") [appreciated](https://news.ycombinator.com/item?id=15484861) that A/B testing in social media or network settings is not as simple as randomizing individual users & running a _t_-test - as the users are not independent of each other (violating [SUTVA](!Wikipedia) among other things).
Instead, you need to randomize groups or subgraphs or something like that, and consider the effects of interventions on those larger more-independent treatment units.
So my usual ABalytics setup isn't appropriate here: I don't want to randomize individual visitors & measure time on page, I want to randomize individual days or weeks and measure total traffic, giving a time-series regression.
This could be randomized by uploading a different version of the site every day, but this is tedious, inefficient, and has technical issues: aggressive caching of my webpages means that many visitors may be seeing old versions of the site!
With that in mind, there is a simple A/B test implementation in JS: in the invocation of the AdSense JS, simply throw in a conditional which predictably randomizes based on the current day (something like the 'day-of-year (1-366) modulo 2', hashing the day, or simply a lookup in an array of constants), and then after a few months, extract daily traffic numbers from Google Analytics/AdSense and match up with randomization and do a regression.
By using a pre-specified source of randomness, caching is never an issue, and using JS is not a problem since anyone with JS disabled wouldn't be one of the people seeing ads anyway.
Since there might be spillover effects due to lags in propagating through social media & emails etc, daily randomization might be too fast, and 2-day [blocks](!Wikipedia "Blocking (statistics)") more appropriate, ensuring occasional runs up to a week or so to expose longer effects while still ensuring allocation equal total days to advertising/no-advertising.[^blocking]
[^blocking]: Why 'block' instead of, say, just randomizing 5-days at a time ("simple randomization")? If we did that, we would occasionally do something like spend an entire month in one condition without switching, simply by rolling a 0 5 or 6 times in a row; since traffic can be expected to drift and change and spike, having such large units means that sometimes they will line up with noise, increasing the apparent variance thus shrinking the effect size thus requiring possibly a great deal more data to detect the signal. Or we might finish the experiment after 100 days (20 units) and discover we had _n_=15 for advertising and only _n_=5 for non-advertising (wasting most of our information on unnecessarily refining the advertising condition). Not blocking doesn't *bias* our analysis - we still get the right answers eventually - but it could be costly. Whereas if we block pairs of 2-days (`[00,11]` vs `[11,00]`), we ensure that we regularly (but still randomly) switch the condition, spreading it more evenly over time, so if there are 4 days of suddenly high traffic, it'll probably get split between conditions and we can more easily see the effect. This sort of issue is why experiments try to run interventions on the same person, or at least on age and sex-matched participants, to eliminate unnecessary noise. The gains can be extreme; in one experiment, I estimated that using twins rather than ordinary school-children would have let $n \lt \frac{1}{20}$: ["The Power of Twins: Revisiting Student’s Scottish Milk Experiment Example"](Milk). Thus, when possible, I block my experiments at least temporally.
# Implementation: In-browser Randomization of Banner Ads
Setting this up in JS turned out to be a little tricky since there is no built-in function for getting day-of-year or for hashing numbers/strings; so rather than spend another 10 lines copy-pasting some hash functions, I copied some day-of-year code and then simply generated in R 366 binary variables for randomizing double-days and put them in a JS array for doing the randomization:
~~~{.HTML}
+
~~~
Appearance: /images/traffic/2018-09-28-abtest-advertising-2nd-appearance.png
# External links
- ["A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments"](http://exp-platform.com/Documents/2017-08%20KDDMetricInterpretationPitfalls.pdf), Dmitriev et al 2017
- Discussion: [HN](https://news.ycombinator.com/item?id=15681358)
# Appendices
## Stan: mixture time-series
An attempt at a `ARIMA(4,0,1)` time-series mixture model, where the mixture has two components: one component for normal traffic where daily traffic is ~1000 making up >90% of daily data, and one component for the occasional traffic spike around 10x larger but happening rarely:
~~~{.R}
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
m <- "data {
int K; // number of mixture components
int T; // number of data points
int y[T]; // traffic
int Ads[T]; // Ad randomization
}
parameters {
simplex[K] theta; // mixing proportions
positive_ordered[K] muM; // locations of mixture components; since no points are labeled,
// like in JAGS, we add a constraint to force an ordering, make it identifiable, and
// avoid label switching, which will totally screw with the posterior samples
real sigmaM[K]; // scales of mixture components
real nuM[K];
real phi1; // autoregression coeffs
real phi2;
real phi3;
real phi4;
real ma; // moving avg coeff
real ads; // advertising coeff; can only be negative
}
model {
real mu[T, K]; // prediction for time t
vector[T] err; // error for time t
real ps[K]; // temp for log component densities
// initialize the first 4 days for the lags
mu[1][1] = 0; // assume err[0] == 0
mu[2][1] = 0;
mu[3][1] = 0;
mu[4][1] = 0;
err[1] = y[1] - mu[1][1];
err[2] = y[2] - mu[2][1];
err[3] = y[3] - mu[3][1];
err[4] = y[4] - mu[4][1];
muM ~ normal(0, 5);
sigmaM ~ cauchy(0, 2);
nuM ~ exponential(1);
ma ~ normal(0, 0.5);
phi1 ~ normal(0,1);
phi2 ~ normal(0,1);
phi3 ~ normal(0,1);
phi4 ~ normal(0,1);
ads ~ normal(0,1);
for (t in 5:T) {
for (k in 1:K) {
mu[t][k] = ads*Ads[t] + muM[k] + phi1 * y[t-1] + phi2 * y[t-2] + phi3 * y[t-3] + phi4 * y[t-4] + ma * err[t-1];
err[t] = y[t] - mu[t][k];
ps[k] = log(theta[k]) + student_t_lpdf(y[t] | nuM[k], mu[t][k], sigmaM[k]);
}
target += log_sum_exp(ps);
}
}"
# find posterior mode via L-BFGS gradient descent optimization; this can be a good set of initializations for MCMC
sm <- stan_model(model_code = m)
optimized <- optimizing(sm, data=list(T=nrow(traffic), y=traffic$Pageviews, Ads=traffic$Ads.r, K=2), hessian=TRUE)
round(optimized$par, digits=3)
# theta[1] theta[2] muM[1] muM[2] sigmaM[1] sigmaM[2] nuM[1] nuM[2] phi1 phi2 phi3 phi4 ma
# 0.001 0.999 0.371 2.000 0.648 152.764 0.029 2.031 1.212 -0.345 -0.002 0.119 -0.604
# ads
# -0.009
## optimized:
inits <- list(theta=c(0.001, 0.999), muM=c(0.37, 2), sigmaM=c(0.648, 152), nuM=c(0.029, 2), phi1=1.21, phi2=-0.345, phi3=-0.002, phi4=0.119, ma=-0.6, ads=-0.009)
## MCMC means:
nchains <- getOption("mc.cores") - 1
model <- stan(model_code=m, data=list(T=nrow(traffic), y=traffic$Pageviews, Ads=traffic$Ads.r, K=2),
init=replicate(nchains, inits, simplify=FALSE), chains=nchains, control = list(max_treedepth = 15, adapt_delta = 0.95),
iter=20000); print(model)
traceplot(model, pars=names(inits))
~~~
This code winds up continuing to fail due to label-switching issue (ie the MCMC bouncing between estimates of what each mixture component is because of symmetry or lack of data) despite using some of the suggested fixes in the Stan model like the ordering trick.
Since there were so few spikes in 2017 only, the mixture model can't converge to anything sensible; but on the plus side, this also implies that the complex mixture model is unnecessary for analyzing 2017 data and I can simply model the outcome as a normal.
## EVSI
Demo code of simple Expected Value of Sample Information (EVSI) in a JAGS log-Poisson model of traffic (which turns out to be inferior to a normal distribution for 2017 traffic data but I keep here for historical purposes).
We consider an experiment resembling historical data with a 5% traffic decrease due to ads; the reduction is modeled and implies a certain utility loss given my relative preferences for traffic vs the advertising revenue, and then the remaining uncertainty in the reduction estimate can be queried for how likely it is that the decision is wrong and that collecting further data would then change a wrong decision to a right one:
~~~
## simulate a plausible effect superimposed on the actual data:
ads[ads$Ads==1,]$Hits <- round(ads[ads$Ads==1,]$Hits * 0.95)
require(rjags)
y <- ads$Hits
x <- ads$Ads
model_string <- "model {
for (i in 1:length(y)) {
y[i] ~ dpois(lambda[i])
log(lambda[i]) <- alpha0 - alpha1 * x[i]
}
alpha0 ~ dunif(0,10)
alpha1 ~ dgamma(50, 6)
}"
model <- jags.model(textConnection(model_string), data = list(x = x, y = y),
n.chains = getOption("mc.cores"))
samples <- coda.samples(model, c("alpha0", "alpha1"), n.iter=10000)
summary(samples)
# 1. Empirical mean and standard deviation for each variable,
# plus standard error of the mean:
#
# Mean SD Naive SE Time-series SE
# alpha0 6.98054476 0.003205046 1.133155e-05 2.123554e-05
# alpha1 0.06470139 0.005319866 1.880857e-05 3.490445e-05
#
# 2. Quantiles for each variable:
#
# 2.5% 25% 50% 75% 97.5%
# alpha0 6.97426621 6.97836982 6.98055144 6.98273011 6.98677827
# alpha1 0.05430508 0.06110893 0.06469162 0.06828215 0.07518853
alpha0 <- samples[[1]][,1]; alpha1 <- samples[[1]][,2]
posteriorTrafficReduction <- exp(alpha0) - exp(alpha0-alpha1)
generalLoss <- function(annualAdRevenue, trafficLoss, hitValue, discountRate) {
(annualAdRevenue - (trafficLoss * hitValue * 365.25)) / log(1 + discountRate) }
loss <- function(tr) { generalLoss(360, tr, 0.02, 0.05) }
posteriorLoss <- sapply(posteriorTrafficReduction, loss)
summary(posteriorLoss)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# -5743.5690 -3267.4390 -2719.6300 -2715.3870 -2165.6350 317.7016
~~~
Expected loss of turning on ads: -\$2715. Current decision: keep ads off to avoid that loss.
The expected average gain in the case where the correct decision is turning ads on:
~~~{.R}
mean(ifelse(posteriorLoss>0, posteriorLoss, 0))
# [1] 0.06868814833
~~~
so EVPI is \$0.07. This doesn't pay for any additional days of sampling, so there's no need to calculate an exact EVSI.