Banner Ads Considered Harmful

9 months of daily A/B-testing of Google AdSense banner ads on gwern.net indicates banner ads decrease total traffic substantially, possibly due to spillover effects in reader engagement and resharing.
experiments, statistics, decision-theory, R, JS, power-analysis, Bayes, Google, survey, insight-porn
2017-01-082019-12-17 in progress certainty: possible importance: 5


One source of com­plex­ity & JavaScript use on Gwern.net is the use of Google AdSense adver­tis­ing to insert ban­ner ads. In con­sid­er­ing design & usabil­ity improve­ments, remov­ing the ban­ner ads comes up every time as a pos­si­bil­i­ty, as read­ers do not like ads, but such removal comes at a rev­enue loss and it’s unclear whether the ben­e­fit out­weighs the cost, sug­gest­ing I run an A/B exper­i­ment. How­ev­er, ads might be expected to have broader effects on traffic than indi­vid­ual page read­ing times/bounce rates, affect­ing total site traffic instead through long-term effects on or spillover mech­a­nisms between read­ers (eg social media behav­ior), ren­der­ing the usual A/B test­ing method of per-page-load/session ran­dom­iza­tion incor­rect; instead it would be bet­ter to ana­lyze total traffic as a time-series exper­i­ment.

Design: A deci­sion analy­sis of rev­enue vs read­ers yields an max­i­mum accept­able total traffic loss of ~3%. Power analy­sis of his­tor­i­cal Gwern.net traffic data demon­strates that the high auto­cor­re­la­tion yields low sta­tis­ti­cal power with stan­dard tests & regres­sions but accept­able power with ARIMA mod­els. I design a long-term Bayesian ARIMA(4,0,1) time-series model in which an A/B-test run­ning Jan­u­ary–Oc­to­ber 2017 in ran­dom­ized paired 2-day blocks of ads/no-ads uses clien­t-lo­cal JS to deter­mine whether to load & dis­play ads, with total traffic data col­lected in Google Ana­lyt­ics & ad expo­sure data in Google AdSense. The A/B test ran from 2017-01-01 to 2017-10-15, affect­ing 288 days with col­lec­tively 380,140 pageviews in 251,164 ses­sions.

Cor­rect­ing for a flaw in the ran­dom­iza­tion, the final results yield a sur­pris­ingly large esti­mate of an expected traffic loss of −9.7% (driven by the sub­set of users with­out adblock), with an implied −14% traffic loss if all traffic were exposed to ads (95% cred­i­ble inter­val: −13–16%), exceed­ing my deci­sion thresh­old for dis­abling ads & strongly rul­ing out the pos­si­bil­ity of accept­ably small losses which might jus­tify fur­ther exper­i­men­ta­tion.

Thus, ban­ner ads on Gwern.net appear to be harm­ful and AdSense has been removed. If these results gen­er­al­ize to other blogs and per­sonal web­sites, an impor­tant impli­ca­tion is that many web­sites may be harmed by their use of ban­ner ad adver­tis­ing with­out real­iz­ing it.

One thing about Gwern.net I prize, espe­cially in com­par­i­son to the rest of the Inter­net, is the fast page loads & ren­ders. This is why in my pre­vi­ous of site design changes, I have gen­er­ally focused on CSS changes which do not affect load times. Bench­mark­ing web­site per­for­mance in 2017, the total time had become dom­i­nated by Google AdSense (for the medi­um-sized ban­ner adver­tise­ments cen­tered above the title) and Dis­qus com­ments.

While I want com­ments, so the Dis­qus is not optional1, AdSense I keep only because, well, it makes me some money (~$30 a month or ~$360 a year; it would be more but ~60% of vis­i­tors have adblock, which is appar­ently unusu­ally high for the US). So ads are a good thing to do an exper­i­ment on: it offers a chance to remove one of the heav­i­est com­po­nents of the page, an excuse to apply a deci­sion-the­o­ret­i­cal approach (cal­cu­lat­ing a deci­sion-thresh­old & s), an oppor­tu­nity to try apply­ing Bayesian time-series mod­els in JAGS/Stan, and an inves­ti­ga­tion into whether lon­gi­tu­di­nal site-wide A/B exper­i­ments are prac­ti­cal & use­ful.

Modeling effects of advertising: global rather than local

This isn’t a huge amount (it is much less than my monthly Patreon) and might be off­set by the effects on load/render time and peo­ple not lik­ing adver­tise­ment. If I am reduc­ing my traffic & influ­ence by 10% because peo­ple don’t want to browse or link pages with ads, then it’s defi­nitely not worth­while.

One of the more com­mon crit­i­cisms of the usual A/B test design is that it is miss­ing the for­est for the trees & giv­ing fast pre­cise answers to the wrong ques­tion; a change may have good results when done indi­vid­u­al­ly, but may harm the over­all expe­ri­ence or com­mu­nity in a way that shows up on the macro but not micro scale.2 In this case, I am inter­ested less in time-on-page than in total traffic per day, as the lat­ter will mea­sure effects like reshar­ing on social media (espe­cial­ly, given my traffic his­to­ry, Hacker News, which always gen­er­ates a long lag of addi­tional traffic from Twit­ter & aggre­ga­tors). It is some­what appre­ci­ated that A/B test­ing in social media or net­work set­tings is not as sim­ple as ran­dom­iz­ing indi­vid­ual users & run­ning a t-test—as the users are not inde­pen­dent of each other (vi­o­lat­ing among other things). Instead, you need to ran­dom­ize groups or sub­graphs or some­thing like that, and con­sider the effects of inter­ven­tions on those larger more-in­de­pen­dent treat­ment units.

So my usual ABa­lyt­ics setup isn’t appro­pri­ate here: I don’t want to ran­dom­ize indi­vid­ual vis­i­tors & mea­sure time on page, I want to ran­dom­ize indi­vid­ual days or weeks and mea­sure total traffic, giv­ing a time-series regres­sion.

This could be ran­dom­ized by upload­ing a differ­ent ver­sion of the site every day, but this is tedious, ineffi­cient, and has tech­ni­cal issues: aggres­sive caching of my web­pages means that many vis­i­tors may be see­ing old ver­sions of the site! With that in mind, there is a sim­ple A/B test imple­men­ta­tion in JS: in the invo­ca­tion of the AdSense JS, sim­ply throw in a con­di­tional which pre­dictably ran­dom­izes based on the cur­rent day (some­thing like the ‘day-of-year (1–366) mod­ulo 2’, hash­ing the day, or sim­ply a lookup in an array of con­stants), and then after a few months, extract daily traffic num­bers from Google Analytics/AdSense and match up with ran­dom­iza­tion and do a regres­sion. By using a pre-spec­i­fied source of ran­dom­ness, caching is never an issue, and using JS is not a prob­lem since any­one with JS dis­abled would­n’t be one of the peo­ple see­ing ads any­way. Since there might be spillover effects due to lags in prop­a­gat­ing through social media & emails etc, daily ran­dom­iza­tion might be too fast, and 2-day more appro­pri­ate, ensur­ing occa­sional runs up to a week or so to expose longer effects while still ensur­ing allo­ca­tion equal total days to advertising/no-advertising.3

Implementation: In-browser Randomization of Banner Ads

Set­ting this up in JS turned out to be a lit­tle tricky since there is no built-in func­tion for get­ting day-of-year or for hash­ing numbers/strings; so rather than spend another 10 lines copy­-past­ing some hash func­tions, I copied some day-of-year code and then sim­ply gen­er­ated in R 366 binary vari­ables for ran­dom­iz­ing dou­ble-days and put them in a JS array for doing the ran­dom­iza­tion:

         <script src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js" async></script>
         <!-- Medium Header -->
         <ins class="adsbygoogle"
              style="display:inline-block;width:468px;height:60px"
              data-ad-client="ca-pub-3962790353015211"
              data-ad-slot="2936413286"></ins>
+        <!-- A/B test of ad effects on site traffic: randomize 2-days based on day-of-year &
              pre-generated randomness; offset by 8 because started on 2016-01-08 -->
         <script>
-          (adsbygoogle = window.adsbygoogle || []).push({});
+          var now = new Date(); var start = new Date(now.getFullYear(), 0, 0); var diff = now - start;
           var oneDay = 1000 * 60 * 60 * 24; var day = Math.floor(diff / oneDay);
           +          randomness = [1,0,0,0,1,1,0,0,1,1,0,0,0,0,1,1,0,0,1,1,0,0,1,1,1,1,1,1,1,1,1,1,0,0,1,1,0,0,1,1,
           1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,0,0,1,1,0,0,1,1,1,1,0,0,1,1,0,0,0,
           0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1,0,0,1,1,0,0,1,1,0,0,0,0,1,1,0,0,1,1,1,1,1,1,0,0,1,1,0,0,0,0,0,
           0,1,1,1,1,0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,1,1,0,0,0,
           0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,0,0,1,1,1,1,0,0,0,0,1,1,1,
           1,1,1,1,1,0,0,0,0,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,0,0,1,1,0,0,1,1,1,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0,1,1,0,
           0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,0,0,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,1,
           1,1,1,0,0,1,1,0,0,0,0,1,1,0,0];
+
+          if (randomness[day - 8]) {
+              (adsbygoogle = window.adsbygoogle || []).push({});
+          }

While sim­ple, sta­t­ic, and cache-com­pat­i­ble, a few months in I dis­cov­ered that I had per­haps been a lit­tle too clev­er: check­ing my AdSense reports on a whim, I noticed that the reported daily “impres­sions” was ris­ing and falling in roughly the 2-day chunks expect­ed, but it was never falling all the way to 0 impres­sions, instead, per­haps to a tenth of the usual num­ber of impres­sions. This was odd because how would any browsers be dis­play­ing ads on the wrong days given that the JS runs before any ads code, and any browser not run­ning JS would, ipso facto, never be run­ning AdSense any­way? Then it hit me: whose date is the ran­dom­iza­tion based on? The browser’s, of course, which is not mine if it’s run­ning in a differ­ent time­zone. Pre­sum­ably browsers across a date­line would be ran­dom­ized into ‘on’ on the ‘same day’ as oth­ers are being ran­dom­ized into ‘off’. What I should have done was some sort of time­zone inde­pen­dent date con­di­tion­al. Unfor­tu­nate­ly, it was a lit­tle late to mod­ify the code.

This implies that the sim­ple binary ran­dom­iza­tion test is not good as it will be sub­stan­tially biased towards zero/attenuated by the mea­sure­ment error inas­much as many of the page­hits on sup­pos­edly ad-free days are in fact being con­t­a­m­i­nated by expo­sure to ads. For­tu­nate­ly, the AdSense impres­sions data can be used instead to regress on, say, per­cent­age of ad-affected pageviews.

Ads as Decision Problem

From a deci­sion the­ory per­spec­tive, this is a good place to apply sequen­tial test­ing ideas as we face a sim­i­lar prob­lem as with and the exper­i­ment has an eas­ily quan­ti­fied cost: each day ran­dom­ized ‘off’ costs ~$1, so a long exper­i­ment over 200 days would cost ~$100 in ad rev­enue etc. There is also the risk of mak­ing the wrong deci­sion and choos­ing to dis­able ads when they are harm­less, in which case the cost as NPV (at my usual 5% dis­count rate, and assum­ing ad rev­enue never changes and I never exper­i­ment fur­ther, which are rea­son­able assump­tions given how for­tu­nately sta­ble my traffic is and the unlike­li­ness of me revis­it­ing a con­clu­sive result from a well-de­signed exper­i­ment) would be , which is sub­stan­tial.

On the other side of the equa­tion, the ads could be doing sub­stan­tial dam­age to site traffic; with ~40% of traffic see­ing ads and total page-views of 635123 in 2016 (1740/day), a dis­cour­ag­ing effect of 5% off that would mean a loss of , the equiv­a­lent of 1 week of traffic. My web­site is impor­tant to me because it is what I have accom­plished & is my liveli­hood, and if peo­ple are not read­ing it, that is bad, both because I lose pos­si­ble income and because it means no one is read­ing my work.

How bad? In lieu of adver­tis­ing it’s hard to directly quan­tify the value of a page-view, so I can instead ask myself hypo­thet­i­cal­ly, would I trade ~1 week of traffic for $360 (~$0.02/view, or to put it another way which may be more intu­itive, would I delete Gwern.net in exchange for >$18720/year)? Prob­a­bly; that’s about the right num­ber—with my cur­rent par­lous income, I can­not casu­ally throw away hun­dreds or thou­sands of dol­lars for some addi­tional traffic, but I would still pay for read­ers at the right price, and weigh­ing the feel­ings, I feel com­fort­able valu­ing page-views at ~$0.02. (If the esti­mate of the loss turns out to be near the thresh­old, then I can revisit it again and attempt more pref­er­ence elic­i­ta­tion. Given the actual results, this proved to be unnec­es­sary.)

Then the loss func­tion of the traffic reduc­tion para­me­ter t is , So the long-run con­se­quence of per­ma­nently turn­ing adver­tis­ing on would be, for a t decrease of 1%, 1% = +$4775; 5% = +$2171; 10% = -$3035; 20% = -$13449; etc.

Thus, the deci­sion ques­tion is whether the decrease for the ad-affected 40% of traffic is >7%; or for traffic as a whole, if the decrease is >2.8%. If it is, then I am bet­ter off remov­ing AdSense and increas­ing traffic; oth­er­wise, the money is bet­ter.

Ad Harms

How much should we expect traffic to fall?

Unfor­tu­nate­ly, before run­ning the first exper­i­ment, I was unable to find pre­vi­ous research sim­i­lar to my pro­posal for exam­in­ing the effect on total traffic rather than more com­mon met­rics such as rev­enue or per-page engage­ment. I assume such research exists, since there’s a lit­er­a­ture on every­thing, but I haven’t found it yet and no one I’ve asked knows where it is either; and of course pre­sum­ably the big Inter­net adver­tis­ing giants have detailed knowl­edge of such spillover or emer­gent effects, although no incen­tive to pub­li­cize the harms.4

There is a sparse open lit­er­a­ture on “adver­tis­ing avoid­ance”, which focuses on sur­veys of con­sumer atti­tudes and eco­nomic mod­el­ing; skim­ming, the main results appear to be that peo­ple claim to dis­like adver­tis­ing on TV or the Inter­net a great deal, claim to dis­like per­son­al­iza­tion but find per­son­al­ized ads less annoy­ing, a non­triv­ial frac­tion of view­ers will take action dur­ing TV com­mer­cial breaks to avoid watch­ing ads (5–23% for var­i­ous meth­ods of estimating/definitions of avoid­ance, and sources like TV chan­nel­s), and are par­tic­u­larly annoyed by ads get­ting in the way when research­ing or engaged in ‘goal-ori­ented’ activ­i­ty, and in a work con­text (Ama­zon Mechan­i­cal Turk) will tol­er­ate non-an­noy­ing ads with­out demand­ing large pay­ment increases (Gold­stein et al 2013/).

Some par­tic­u­larly rel­e­vant results:

  • 5 did one of the few rel­e­vant exper­i­ments, with stu­dents in labs, and noted “sub­jects who were not exposed to ads reported they were 11% more likely to return or rec­om­mend the site to oth­ers than those who were exposed to ads (p < 0.01).”; but could not mea­sure any real-world or long-term effects.

  • exploits a sort of nat­ural exper­i­ment on YouTube, where video cre­ators learned that YouTube had a hard­wired rule that videos <10 min­utes in length could have only 1 ad, while they are allowed to insert mul­ti­ple ads in longer videos; track­ing a sub­set of Ger­man YT chan­nels using adver­tis­ing, she finds that some chan­nels began increas­ing video lengths, insert­ing ads, turn­ing away from ‘pop­u­lar’ con­tent to obscurer con­tent (d = 0.4), and had more video views (>20%) but lower rat­ings (4%/d = −0.25)6.

    While that might sound good on net (more vari­ety & more traffic even if some of the addi­tional view­ers may be less sat­is­fied), Kerk­hof 2019 is only track­ing video cre­ators and not a fixed set of view­ers, and can­not exam­ine to what extent view­ers watch less due to the increase in ads or what global site-wide effects there may have been (after all, why weren’t the cre­ators or view­ers doing all that before?), and cau­tions that we should expect YouTube to algo­rith­mi­cally drive traffic to more mon­e­ti­z­able chan­nels, regard­less of whether site-wide traffic or social util­ity decreased7.

  • run a large-s­cale (to­tal n = 40,000) Google Sur­veys sur­vey ask­ing Amer­i­cans about will­ing­ness-to-pay for, among other things, an ad-free Face­book (n = 1,001), which was a mean ~$2.5/month (sub­stan­tially less than cur­rent FB ad rev­enue per capita per mon­th); their results imply Face­book could increase rev­enue by increas­ing ads.

  • inves­ti­gate ad harms indi­rect­ly, by look­ing at an online pub­lish­er’s logs of anti-ad­blocker mech­a­nism (which typ­i­cally detect the use of an adblock­er, hides the con­tent, and shows a splash­screen telling the user to dis­able adblock); they do not have ran­dom­ized data, but attempt a cor­re­la­tional analy­sis, where, Fig­ure 3 implies (com­par­ing the anti-ad­blocker ‘treat­ment’ with their pre­ferred con­trol group control_1) that com­pared to the adblock­-pos­si­ble base­line, anti-ad­block decreases pages per user and time per user—­page per user drops from ~1.4 to ~1.1, and time per user drops from ~2min to ~1.5min. (De­spite the use of the term ‘aggre­gate’, Sinha et al 2017 does not appear to ana­lyze total site pageview/time traffic sta­tis­tics, but only per-user.)

    These are large decreas­es, sub­stan­tially larger than 10%, but it’s worth not­ing that, aside from DiD not being a great way of infer­ring causal­i­ty, these esti­mates are not directly com­pa­ra­ble to the oth­ers because adding anti-ad­block ≠ adding ads: anti-ad­block is much more intru­sive & frus­trat­ing (an ugly pay­wall hid­ing all con­tent & requir­ing man­ual action a user may not know how to per­form) than sim­ply adding some ads, and plau­si­bly is much more harm­ful.

But while those sur­veys & mea­sure­ments show some users will do some work to avoid ads (which is sup­ported by the high but nev­er­the­less <100% per­cent­age of browsers with adblock­ers installed) and in some con­texts like jobs appear to be insen­si­tive to ads, there is lit­tle infor­ma­tion about to what extent ads uncon­sciously drive users away from a pub­lisher towards other pub­lish­ers or medi­ums, with per­va­sive amounts of adver­tis­ing taken for granted & researchers focus­ing on just about any­thing else (see cites in , , , Bra­jnik & Gabrielli 2008 & Wilbur 2016, ). For exam­ple, Google’s Hohn­hold et al 2015 tells us that , and notes pre­cisely the prob­lem: “Opti­miz­ing which ads show based on short­-term rev­enue is the obvi­ous and easy thing to do, but may be detri­men­tal in the long-term if user expe­ri­ence is neg­a­tively impact­ed. Since we did not have meth­ods to mea­sure the long-term user impact, we used short­-term user sat­is­fac­tion met­rics as a proxy for the long-term impact”, and after exper­i­ment­ing with pre­dic­tive mod­els & ran­dom­iz­ing ad loads, decided to make a “50% reduc­tion of the ad load on Google’s mobile search inter­face” but Hohn­hold et al 2015 does­n’t tell us what the effect on user attrition/activity was! What they do say is (am­bigu­ous­ly, given the “pos­i­tive user response” is dri­ven by a com­bi­na­tion of less attri­tion, more user activ­i­ty, and less ad blind­ness, with the indi­vid­ual con­tri­bu­tions unspec­i­fied):

This and sim­i­lar ads blind­ness stud­ies led to a sequence of launches that decreased the search ad load on Google’s mobile traffic by 50%, result­ing in dra­matic gains in user expe­ri­ence met­rics. We esti­mated that the pos­i­tive user response would be so great that the long-term rev­enue change would be a net pos­i­tive. One of these launches was rolled out over ten weeks to 10% cohorts of traffic per week. Fig­ure 6 shows the rel­a­tive change in CTR [click­through rate] for differ­ent cohorts rel­a­tive to a hold­back. Each curve starts at one point, rep­re­sent­ing the instan­ta­neous qual­ity gains, and climbs higher post-launch due to user sight­ed­ness. Differ­ences between the cohorts rep­re­sent pos­i­tive user learn­ing, i.e., ads sight­ed­ness.

My best guess is that the effect of any “adver­tis­ing avoid­ance” ought to be a small per­cent­age of traffic, for the fol­low­ing rea­sons:

  • many peo­ple never bother to take a minute to learn about & install adblock browser plu­g­ins, despite the exis­tence of adblock­ers being uni­ver­sally known, which would elim­i­nate almost all ads on all web­sites they would vis­it; if ads as a whole are not worth a minute of work to avoid for years to come for so many peo­ple, how bad could ads be? (And to the extent that peo­ple do use adblock­ers, any total neg­a­tive effect of ads ought to be that much small­er.)

    • in par­tic­u­lar, my AdSense ban­ner ads have never offended or both­ered me much when I browse my pages with adblocker dis­abled to check appear­ance, as they are nor­mal medi­um-sized ban­ners cen­tered above the <title> ele­ment where one expects an ad8, and
  • web­site design ranges wildly in qual­ity & ad den­si­ty, with even enor­mously suc­cess­ful web­sites like Ama­zon look­ing like garbage9; if users care about good design at all, it’s diffi­cult to tell

  • great efforts are invested in min­i­miz­ing the impact of ads: AdSense loads ads asyn­chro­nously in the back­ground so it never blocks the page load­ing or ren­der­ing (which would defi­nitely be frus­trat­ing & web design holds that small delays in page­loads are harm­ful10), Google sup­pos­edly spends bil­lions of dol­lars a year on a sur­veil­lance Inter­net & the most cut­ting-edge AI tech­nol­ogy to bet­ter model users & tar­get ads to them with­out irri­tat­ing them too much (eg Hohn­hold et al 2015), ads should have lit­tle effect on SEO or search engine rank­ing (since why would search engines penal­ize their own ads?), and I’ve seen a decent amount of research on opti­miz­ing ad deliv­er­ies to max­i­mize rev­enue & avoid­ing annoy­ing ads (but, as described before, never research on mea­sur­ing or reduc­ing total harm)

  • final­ly, if they were all that harm­ful, how could there be no past research on it and how could no one know this?

    You would think that if there were any wor­ri­some level of harm some­one would’ve noticed by now & it’d be com­mon knowl­edge to avoid ads unless you were des­per­ate for the rev­enue.

So my prior esti­mate is of a small effect and need­ing to run for a long time to make a deci­sion at a mod­er­ate oppor­tu­nity cost.

Replication

After run­ning my first exper­i­ment (n = 179,550 users on mobile+desk­top browser­s), addi­tional results have come out and a research lit­er­a­ture on quan­ti­fy­ing “adver­tis­ing avoid­ance” is finally emerg­ing; I have also been able to find ear­lier results which were either too obscure for me to find the first time around or on closer read turn out to imply esti­mates of total ad harm.

To sum­ma­rize all cur­rent results:

Review of exper­i­ments or cor­re­la­tional analy­ses which mea­sure the harm of ads on total activ­ity (broadly defined).
Entity Cite Date Method Users Ads n (mil­lions) Total effect Activ­ity
Pan­dora June 2014–April 2016 ran­dom­ized mobile app com­mer­cial break-style audio ads 34 7.5% total music lis­ten­ing hours (whole cohort)
Mozilla Feb­ru­ary–April 2017 cor­re­la­tional desk­top all 0.358 28% total time WWW brows­ing (per user)
LinkedIn March–June 2018 ran­dom­ized mobile app news­feed insert items 102 12% total news­feed interaction/use (whole cohort)
McCoy 2004? ran­dom­ized desk­top banner/pop-ups 0.000536 11% self­-rated will­ing­ness to revisit web­site
Google 2013–2014 (pri­ma­ry) ran­dom­ized mobile search engine text ads 500? 50–70%? total search engine queries (whole cohort, inclu­sive of attri­tion etc)11
Google (Ad­Sense) Hohn­hold et al 2015 2007?–? ran­dom­ized all? AdSense ban­ners (>>1)? >5%? total site usage
Page­Fair Shiller et al 2017 July 2013–June 2016 cor­re­la­tional all all <535? (k = 2574) <19% total web­site usage (Alexa traffic rank)
Gwern.net Jan­u­ary 2017–Oc­to­ber 2017 ran­dom­ized all ban­ner ad 0.251 14% total web­site traffic
Anon. News Web­site Yan et al 2020 June 2015–Sep­tem­ber 2015 cor­re­la­tional reg­is­tered users banner/skyscraper/square ads 0.08 20% total web­site traffic

While these results come from com­pletely differ­ent domains (gen­eral web use, enter­tain­ment, and business/productivity), differ­ent plat­forms (mo­bile app vs desk­top browser), differ­ent ad deliv­ery mech­a­nisms (in­line news feed items, audio inter­rup­tions, inline+popup ads, and web ads as a whole), and pri­mar­ily exam­ine with­in-user effects, the numer­i­cal esti­mates of total decreases are remark­ably con­sis­tently in the same 10–15% as my own esti­mate.

The con­sis­tency of these results under­mines many of the inter­pre­ta­tions of how & why ads cause harm.

For exam­ple, how can it be dri­ven by “per­for­mance” prob­lems when the LinkedIn app loads ads for their news­feed (un­less they are too incom­pe­tent to down­load ads in advance), or for the Pan­dora audio ads (as the audio ads must inter­rupt the music while they play but oth­er­wise do not affect the music—the music surely isn’t “sta­t­ic-y” because audio ads played at some point long before or after! unless again we assume total incom­pe­tence on the part of Pan­do­ra), or for McCoy et al 2007 which served sim­ple sta­tic image ads off servers set up by them for the exper­i­ment? And why would a Google AdSense ban­ner ad, which loads asyn­chro­nously and does­n’t block page ren­der­ing, have a ‘per­for­mance’ prob­lem in the first place? (Nev­er­the­less, to exam­ine this pos­si­bil­ity fur­ther in my fol­lowup A/B test, I switched from AdSense to a sin­gle small cacheable sta­tic PNG ban­ner ad which is loaded in both con­di­tions in order to elim­i­nate any per­for­mance impact.)

Sim­i­lar­ly, if users have area-spe­cific tol­er­ance of ads and will tol­er­ate them for work but not play or vice-ver­sa, why do McCoy/LinkedIn vs Pan­dora find about the same thing? Or if Gwern.net read­ers are sim­ply unusu­ally intol­er­ant of ads?

The sim­plest expla­na­tion is that users are averse to ads qua ads, regard­less of domain, deliv­ery mech­a­nism, or ‘per­for­mance’.

Pandora

Stream­ing ser­vice activ­ity & users (n = 34 mil­lion), ran­dom­ized.

In 2018, Pan­dora pub­lished a large-s­cale long-term (~2 years) indi­vid­u­al-level adver­tis­ing exper­i­ment in their stream­ing music ser­vice () which found a strik­ingly large effect of num­ber of ads on reduced lis­tener fre­quency & wors­ened reten­tion, which accu­mu­lated over time and would have been hard to observe in a short­-term exper­i­ment.

Huang et al 2019, adver­tis­ing harms for Pan­dora lis­ten­ers: “Fig­ure 4: Mean Total Hours Lis­tened by Treat­ment Group”; “Fig­ure 5: Mean Weekly Unique Lis­ten­ers by Treat­ment Group”

In the low ad con­di­tion, 2.732/hr, the final activ­ity level was +1.74% lis­ten­ing time; baseline/control, 3.622/hr, 0%; and in the high ad con­di­tion, 5.009/hr, final activ­ity lev­el: −2.83% lis­ten­ing time, The ads per hour coeffi­cient, is −2.0751% for Total Hours & −1.8965% Active Days. The net total effect can be backed out:

The coeffi­cients show us that one addi­tional ad per hour results in mean lis­ten­ing time decreas­ing by 2.075%±0.226%, and the num­ber of active lis­ten­ing days decreas­ing by 1.897%±0.129%….­Does this decrease in total lis­ten­ing come from shorter ses­sions of lis­ten­ing, or from a lower prob­a­bil­ity of lis­ten­ing at all? To answer this ques­tion, Table 6 breaks the decrease in total hours down into three com­po­nents: the num­ber of hours lis­tened per active day, the num­ber of active days lis­tened per active lis­ten­er, and the prob­a­bil­ity of being an active lis­tener at all in the final month of the exper­i­ment. We have nor­mal­ized each of these three vari­ables so that the con­trol group mean equals 100, so each of these treat­ment effects can be inter­preted as a per­cent­age differ­ence from con­trol. We find the per­cent­age decrease in hours per active day to be approx­i­mately 0.41%, the per­cent­age decrease in days per active lis­tener to be 0.94%, and the per­cent­age decrease in the prob­a­bil­ity of being an active lis­tener in the final month to be 0.92%. These three num­bers sum to 2.27%, which is approx­i­mately equal to the 2.08% per­cent­age decline we already cal­cu­lated for total hours lis­tened.5 This tells us that approx­i­mately 18% of the decline in the hours in the final month is due to a decline in the hours per active day, 41% is due to a decline in the days per active lis­ten­er, and 41% is due to a decline in the num­ber of lis­ten­ers active at all on Pan­dora in the final month. We find it inter­est­ing that all three of these mar­gins see sta­tis­ti­cally sig­nifi­cant reduc­tions, though the vast major­ity of the effect involves fewer lis­ten­ing ses­sions rather than a reduc­tion in the num­ber of hours per ses­sion.

The coeffi­cient of 2.075% less total activ­ity (lis­ten­ing) per 1 ad/hour implies that with a base­line of 3.622 ads per hour, the total harm is = 7.5% at the end of 21 months (cor­re­spond­ing to the end of the exper­i­ment, at which point the harm from increased attri­tion appears to have sta­bi­lized—per­haps every­one at the mar­gin who might attrit away or reduce lis­ten­ing has done so by this point—and that may reflect the total indefi­nite har­m).

Mozilla

Desk­top browser usage lev­els (n = 358,000), lon­gi­tu­di­nal:

Almost simul­ta­ne­ously with Pan­do­ra, Mozilla () con­ducted a lon­gi­tu­di­nal (but non-ran­dom­ized, using +re­gres­sion to reduce the infla­tion of the cor­re­la­tional effect12) study of browser users which found that after installing adblock, the sub­set of adblock users expe­ri­enced “increases in both active time spent in the browser (+28% over [matched] con­trols) and the num­ber of pages viewed (+15% over con­trol)”.

(This, inci­den­tal­ly, is a tes­ta­ment to the value of browser exten­sions to users: in a mature piece of soft­ware like Fire­fox, usu­al­ly, noth­ing improves a met­ric like 28%. One won­ders if Mozilla fully appre­ci­ates this find­ing?)

Miroglio et al 2018, ben­e­fits to Fire­fox users from adblock­ers: “Fig­ure 3: Esti­mates & 95% CI for , the change in log-trans­formed engage­ment due to installing add-ons [ad­block­ers]”; “Table 5: Esti­mated rel­a­tive changes in engage­ment due to installing add-ons com­pared to con­trol group ()”

LinkedIn

Social news feed activ­ity & users, mobile app (n = 102 mil­lion), ran­dom­ized:

LinkedIn ran a large-s­cale ad exper­i­ment on their mobile app’s users (ex­clud­ing desk­top etc, pre­sum­ably iOS+An­droid) track­ing effect of addi­tional ads in user ‘news feeds’ on short­-term & long-term met­rics like reten­tion over 3 months (); it com­pares the LinkedIn base­line of 1 ad every 6 feed items to alter­na­tives of 1 ad every 3 feed items and 1 ad every 9 feed items. Unlike Pan­do­ra, the short­-term effect is the bulk of the adver­tis­ing effect within their 3-month win­dow (per­haps because LinkedIn is a pro­fes­sional tool and quit­ting is harder than an enter­tain­ment ser­vice, or visual web ads are more less intru­sive than audio, or because 3-months is still not long enough), but while ad increases show min­i­mal net rev­enue impact (if I am under­stand­ing their met­rics right), the ad den­sity clearly dis­cour­ages usage of the news feed, the authors spec­u­lat­ing this is due to dis­cour­ag­ing less-ac­tive or “dor­mant” mar­ginal users; con­sid­er­ing the implied annu­al­ized effect of user reten­tion & activ­i­ty, I esti­mate a total activ­ity decrease of >12% due to the base­line ad bur­den com­pared to no ads.13

Yan et al 2019, on the harms of adver­tis­ing on Linked­In: “Fig­ure 3. Effect of ads den­sity on feed inter­ac­tion”

McCoy et al 2007

Aca­d­e­mic busi­ness school lab sto­ry, , self­-rated will­ing­ness to revisit/recommend a web­site on a scale after expo­sure to ads (n = 536), ran­dom­ized

While markedly differ­ent in both method & mea­sure, McCoy et al 2007 nev­er­the­less finds a ~11% reduc­tion from no-ads to ads (3 types test­ed, but the least annoy­ing kind, “in-line”, still incurred a ~9% reduc­tion). They point­edly note that while this may sound small, it is still of con­sid­er­able prac­ti­cal impor­tance.14

McCoy et al 2007, harms of ads on stu­dent rat­ings: “Fig­ure 2: Inten­tions to revisit the site con­tain­ing the ads (4-item scale; cau­tion: the ori­gin of the Y axis is not 0).”

Google

Hohn­hold et al 2015, , search activ­i­ty, mobile Android Google users (n > 100m?)15, ran­dom­ized:

Hohn­hold et al 2015, ben­e­fit from 50% ad reduc­tion on mobile over 2 month roll­out of 10% users each: “Fig­ure 6:∆CTR [CTR = Clicks/Ad, term 5] time series for differ­ent user cohorts in the launch. (The launch was stag­gered by weekly cohort.)”

This and sim­i­lar ads blind­ness stud­ies led to a sequence of launches that decreased the search ad load on Google’s mobile traffic by 50%, result­ing in dra­matic gains in user expe­ri­ence met­rics. We esti­mated that the pos­i­tive user response would be so great that the long-term rev­enue change would be a net pos­i­tive. One of these launches was rolled out over ten weeks to 10% cohorts of traffic per week. Fig­ure 6 shows the rel­a­tive change in CTR [click­through rate] for differ­ent cohorts rel­a­tive to a hold­back. Each curve starts at one point, rep­re­sent­ing the instan­ta­neous qual­ity gains, and climbs higher post-launch due to user sight­ed­ness. Differ­ences between the cohorts rep­re­sent pos­i­tive user learn­ing, i.e., ads sight­ed­ness.

Hohn­hold et al 2015, as the result of search engine ad load exper­i­ments on user activ­ity (search­es) & ad inter­ac­tions, decided to make a “50% reduc­tion of the ad load on Google’s mobile search inter­face” which, because of the ben­e­fits to ad click rates & “user expe­ri­ence met­rics”, would pre­serve or increase Google’s absolute rev­enue.

To exactly off­set a 50% reduc­tion in ad expo­sure solely by being more likely to click on ads, user CTRs must dou­ble, of course. But Fig­ure 6 shows an increase of at most 20% in the CTR rather than 100%. So if the change was still rev­enue-neu­tral or pos­i­tive, user activ­ity must have gone up in some way—but Hohn­hold et al 2015 does­n’t tell us what the effect on user attrition/activity was! The “pos­i­tive user response” is dri­ven by some com­bi­na­tion of less attri­tion, more user activ­i­ty, and less ad blind­ness, with the indi­vid­ual con­tri­bu­tions left unspec­i­fied.

Can the effect on user activ­ity be inferred from what Hohn­hold et al 2015 does report? Pos­si­bly. As they set it up in equa­tion 2:

Apro­pos of this setup, they remark

For Google search ads exper­i­ments, we have not mea­sured a sta­tis­ti­cal­ly-sig­nifi­cant learned effect on terms 1 [“Users”] and 2 [].2 [2: We sus­pect the lack of effect is due to our focus on qual­ity and user expe­ri­ence. Exper­i­ments on other sites indi­cate that there can indeed be user learn­ing affect­ing over­all site usage.]

This would, inci­den­tal­ly, appear to imply that Google ad exper­i­ments have demon­strated an ad harm effect on other web­sites, pre­sum­ably via AdSense ads rather than search query ads, and given the sta­tis­ti­cal power con­sid­er­a­tions, the effect would need to be sub­stan­tial (guessti­mate >5%?). I emailed Hohn­hold et al sev­eral times for addi­tional details but received no replies.

Given the reported results, this is under­-spec­i­fied but we can make some addi­tional assump­tions: we’ll ignore user attri­tion & num­ber of ‘tasks’ (as they say there is no “sta­tis­ti­cal­ly-sig­nifi­cant learned effect”, which is not the same thing as zero effects but implies they are smal­l), assume con­stant absolute rev­enue & rev­enue per click, and assume the CTR is 18% (the CTR increase is cumu­la­tive over time and has reached >18% for the longest-ex­posed cohort in Fig­ure 6, so this rep­re­sents a lower bound as it may well have kept increas­ing). This gives an upper bound of a <60% increase in user search queries per task thanks to the halv­ing of ad load (as­sum­ing the CTR did­n’t increase fur­ther and there was zero effect on user reten­tion or acqui­si­tion): . Assum­ing a reten­tion rate sim­i­lar to LinkedIn of ~-0.5% user attri­tion per 2 months, then it’d be more like <65%, and adding in a −1–2% effect on num­ber of tasks shrinks it down to <60%; if the increased rev­enue refers to annu­al­ized pro­jec­tions based on the 2-month data and we imag­ine annualizing/compounding hypo­thet­i­cal −1% effects on user attri­tion & activ­i­ty, a <50% increase in search queries per task becomes plau­si­ble (which would be the differ­ence between run­ning 1 query per task and run­ning 1.5 queries per task, which does­n’t sound unre­al­is­tic to me).

Regard­less of how we guessti­mate at the break­down of user response across their equa­tion 2’s first 3 terms, the fact remains that being able to cut ads by half with­out any net rev­enue effec­t—on a ser­vice “focus[ed] on qual­ity and user expe­ri­ence” whose authors have data show­ing its ads to already be far less harm­ful than “other sites”—sug­gests a major impact of search engine ads on mobile users.

Strik­ing­ly, this 50–70% range of effects on search engine use would be far larger than esti­mated for other mea­sures of use in the other stud­ies. Some pos­si­ble expla­na­tions are that the oth­ers have sub­stan­tial mea­sure­ment error bias­ing them towards zero or that there is mod­er­a­tion by pur­pose: per­haps even LinkedIn is a kind of “enter­tain­ment” where ads are not as irri­tat­ing a dis­trac­tion, while search engine queries are more seri­ous time-sen­si­tive busi­ness and ads are much more frus­trat­ingly fric­tion.

PageFair

Shiller et al 2017, “Will Ad Block­ing Break the Inter­net?” (), Alexa traffic rank (proxy for traffic), all web­site users (n=?m16, k = 2,574), lon­gi­tu­di­nal cor­re­la­tional analy­sis:

Page­Fair is an anti-ad­block ad tech com­pa­ny; their soft­ware detects adblock use, and in this analy­sis, the Alex­a-es­ti­mated traffic ranks of 2,574 cus­tomer web­sites (me­dian rank: #210,000) are cor­re­lated with Page­Fair-es­ti­mated frac­tion of adblock traffic. The 2013–2016 time-series are inter­mit­tent & short (me­dian 16.7 weeks per web­site, ana­lyzed in monthly traffic blocks with monthly n~12,718) as cus­tomer web­sites add/remove Page­Fair soft­ware. 14.6% of users have adblock in their sam­ple.

Shiller et al 2017’s pri­mary find­ing is increases in adblock usage share of Page­Fair-us­ing web­sites pre­dict improve­ment in Alexa traffic rank over the next mul­ti­-month time-pe­riod ana­lyzed but then grad­ual wors­en­ing of Alexa traffic ranks up to 2 years lat­er. Shiller et al 2017 attempts to make a causal story more plau­si­ble by look­ing at base­line covari­ates and attempt­ing to use adblock rates as (none too con­vinc­ing) instru­men­tal vari­ables. The inter­pre­ta­tion offered is that adblock increases are exoge­nous and cause an ini­tial ben­e­fit from freerid­ing users but then grad­ual dete­ri­o­ra­tion of site content/quality from reduced rev­enue.

While their inter­pre­ta­tion is not unrea­son­able, and if true is a reminder that for ad-driven web­sites there is an opti­mal trade­off between ads & traffic where the opti­mal point is not nec­es­sar­ily known and ‘pro­gram­matic adver­tis­ing’ may not be a good rev­enue source (in­deed, Shiller et al 2017 note that “ad block­ing had a sta­tis­ti­cal­ly-sig­nifi­cantly smaller impact at high­-traffic web­sites…indis­tin­guish­able from 0”), the more inter­est­ing impli­ca­tion is that if causal, the imme­di­ate short­-run effect is an esti­mate of the harm of adver­tis­ing.

Specifi­cal­ly, the Page­Fair sum­mary empha­sizes, in a graph of a sam­ple start­ing from July 2013, a 0%→25% change in adblock usage would be pre­dicted to see a +5% rank improve­ment in the first half-year, +2% first year-and-half, decreas­ing to −16% by June 2016 ~3 years lat­er. The graph and the exact esti­mates do not appear in Shiller et al 2017, but seems to be based on Table 5; the first coeffi­cient in col­umn 1–4 cor­re­sponds to the first mul­ti­-month block, and the coeffi­cient is expressed in terms of log ranks (low­er=­bet­ter), so given the Page­Fair hypo­thet­i­cal of 0%→25%, the pre­dicted effect in the first time period for the var­i­ous mod­els (−0.2250, −0.2250, −0.2032, & −0.2034; mean, −0.21415) is or ~5%. Or to put it another way, the effect of adver­tis­ing expo­sure for 100%→0% of the user­base would be pre­dicted to be , or 19% (of Alexa traffic rank). Given the non­lin­ear­ity of Alexa ranks/true traffic, I sus­pect this implies an actual traffic gain of <19%.

Yan et al 2020

, Yan et al 202017 reports a differ­ence-in-d­iffer­ences cor­re­la­tional analy­sis of adblock vs non-ad­block users on an anony­mous medi­um-size Euro­pean (sim­i­lar to Mozilla & Page­Fair):

Many online news pub­lish­ers finance their web­sites by dis­play­ing ads along­side con­tent. Yet, remark­ably lit­tle is known about how expo­sure to such ads impacts users’ news con­sump­tion. We exam­ine this ques­tion using 3.1 mil­lion anonymized brows­ing ses­sions from 79,856 users on a news web­site and the qua­si­-ran­dom vari­a­tion cre­ated by ad blocker adop­tion. We find that see­ing ads has a robust neg­a­tive effect on the quan­tity and vari­ety of news con­sump­tion: Users who adopt ad block­ers sub­se­quently con­sume 20% more news arti­cles cor­re­spond­ing to 10% more cat­e­gories. The effect per­sists over time and is largely dri­ven by con­sump­tion of “hard” news. The effect is pri­mar­ily attrib­ut­able to a learn­ing mech­a­nism, wherein users gain pos­i­tive expe­ri­ence with the ad-free site; a cog­ni­tive mech­a­nism, wherein ads impede pro­cess­ing of con­tent, also plays a role. Our find­ings open an impor­tant dis­cus­sion on the suit­abil­ity of adver­tis­ing as a mon­e­ti­za­tion model for valu­able dig­i­tal con­tent…Our dataset was com­posed of click­stream data for all reg­is­tered users who vis­ited the news web­site from the sec­ond week of June, 2015 (week 1) [2015-06-07] to the last week of Sep­tem­ber, 2015 (week 16) [2015-09-30]. We focus on reg­is­tered users for both econo­met­ric and socio-e­co­nomic rea­sons. We can only track reg­is­tered users on the indi­vid­u­al-level over time, which pro­vides us with a unique panel set­ting that we use for our empir­i­cal analy­sis…These per­cent­ages trans­late into 2 fewer news arti­cles per week and 1 less news cat­e­gory in total.

…Of the 79,856 users whom we observed, 19,088 users used an ad blocker dur­ing this period (as indi­cated by a non-zero num­ber of page impres­sions blocked), and 60,768 users did not use an ad blocker dur­ing this peri­od. Thus, 24% of users in our dataset used an ad block­er; this per­cent­age is com­pa­ra­ble to the ad block­ing adop­tion rates across Euro­pean coun­tries at the same time, rang­ing from 20% in Italy to 38% in Poland (New­man et al. 2016).

…Our results also high­light sub­stan­tial het­ero­gene­ity in the effect of ad expo­sure across differ­ent users: First, users with a stronger ten­dency to read news on their mobile phones (as opposed to on desk­top devices) exhibit a stronger treat­ment effect.

They Just Don’t Know?

This raises again one of my orig­i­nal ques­tions: why do peo­ple not take the sim­ple & easy step of installing adblock­er, despite appar­ently hat­ing ads & ben­e­fit­ing from it so much? Some pos­si­bil­i­ties:

  • peo­ple don’t care that much, and the loud com­plaints are dri­ven by a small minor­i­ty, or other fac­tors (such as a polit­i­cal moral panic post-Trump elec­tion); Ben­zell & Col­lis 2019’s will­ing­ness-to-pay is con­sis­tent with not both­er­ing to learn about or use adblock because peo­ple just don’t care

  • adblock is typ­i­cally dis­abled or hard to get on mobile; could the effect be dri­ven by mobile users who know about it & want to but can’t?

    This should be testable by re-an­a­lyz­ing the A/B tests to split total traffic into desk­top & mobile (which Google Ana­lyt­ics does track and, inci­den­tal­ly, is how I know that mobile traffic has steadily increased over the years & became a major­ity of Gwern.net traffic in Jan­u­ary–Feb­ru­ary 2019)2

  • is it pos­si­ble that peo­ple don’t use adblock because they don’t know it exists?

The sec­ond sounds crazy to me. Ad block­ing is well-known and they are among the most pop­u­lar browser exten­sions there are and often the first thing installed on a new OS.

cites an ’s esti­mate of 45m Amer­i­cans in 2015 (out of a total US pop­u­la­tion of ~321m peo­ple, or ~14% users18) and a 2016 UK esti­mate of ~22%; a Page­Fair paper, Shiller et al 2017 cites an unspec­i­fied analy­sis esti­mat­ing “24% for Ger­many, 14% for Spain, 10% for the UK, and 9% for the US” instal­la­tion rates account­ing for “28% in Ger­many, 16% for Spain, 13% for the UK, and 12% for the U.S.” of web traffic. A 2016 Midia Research report­edly claims that 41% of users knew about ad block­ers, of which 80% used it on desk­top & 46% on smart­phones, imply­ing a 33%/19% use rate. One might expect higher num­bers now 3–4 years later since adblock usage has been grow­ing. sur­veyed Pol­ish Inter­net users in an unspec­i­fied way, find­ing 77% total used adblock and <2% claimed to not know what adblock is. (The Pol­ish users mostly accepted “Sta­tic graphic or text ban­ners” but par­tic­u­larly dis­liked video, native, and audio ads.)

So plenty of ordi­nary peo­ple, not just nerds, have not merely heard of it but are active users of it (and why would pub­lish­ers & the ad indus­try be so hys­ter­i­cal about ad block­ing if it were no more widely used than, say, desk­top Lin­ux?). But, I am well-aware I live in a bub­ble and my intu­itions are not to be trusted on this (as Jakob Nielsen puts it: “The Dis­tri­b­u­tion of Users’ Com­puter Skills: Worse Than You Think”). The only way to rule this out is to ask ordi­nary peo­ple.

As usu­al, I use to run a weighted pop­u­la­tion sur­vey. On 2019-03-16, I launched a n = 1000 one-ques­tion sur­vey of all Amer­i­cans with ran­domly reversed order, with the fol­low­ing results (CSV):

Do you know about ‘adblock­ers’: web browser exten­sions like AdBlock Plus or ublock?

  • Yes, and I have one installed [13.9% weighted / n= 156 raw]
  • Yes, but I do not have one installed [14.4% weighted / n = 146 raw]
  • No [71.8% weighted / n = 702 raw]
First Google Sur­vey about adblock usage & aware­ness: bar graph of results.

The instal­la­tion per­cent­age closely par­al­lels the 2015 Adobe/PageFair esti­mate, which is rea­son­able. (Adobe/PageFair 2015 makes much hay of the growth rates, but those are desk­top growth rates, and desk­top usage in gen­eral seems to’ve cratered as peo­ple shift ever more time to tablets/smartphones; they note that “Mobile is yet to be a fac­tor in ad block­ing growth”.) I am how­ever shocked by the per­cent­age claim­ing to not know what an adblocker is: 72%! I had expected to get some­thing more like 10–30%. As one learns read­ing sur­veys, a decent frac­tion of every pop­u­la­tion strug­gles with basic ques­tions like whether the Earth goes around the Sun or vice-ver­sa, so I would be shocked if they knew of ad block­ers but I expected the remain­ing 50%, who are dri­ving this puz­zle of “why adver­tis­ing avoid­ance but not adblock instal­la­tion?”, to be a lit­tle more on the ball, and be aware of ad block­ers but have some other rea­son to not install them (if only myopic lazi­ness).

But that appears to not be the case. There are rel­a­tively few peo­ple who claim to be aware of ad block­ers but not be using them, and those might just be mobile users whose browsers (specifi­cal­ly, Chrome, as Apple’s Safari/iOS per­mit­ted adblock exten­sions in 2015), for­bid ad block­ers.

To look some more into the moti­va­tion of the recu­sants, I launched an expanded ver­sion of the first GS sur­vey with n = 500 on 2019-03-18, oth­er­wise same options, ask­ing (CSV):

If you don’t have an adblock exten­sion like AdBlock Plus/ublock installed in your web browser, why not?

  1. I do have one installed [weighted 34.9% raw n = 183]
  2. I don’t know what ad block­ers are [36.7%; n = 173]
  3. Ad block­ers are too hard to install [6.2%; n = 28]
  4. My browser or device does­n’t sup­port them [7.8%; n = 49]
  5. Ad block­ing hurts web­sites or is uneth­i­cal [10.4%; n = 51]
  6. [free response text field to allow list­ing of rea­sons I did­n’t think of] [0.6%/0.5%/3.0%; n = 1/1/15]
Sec­ond Google Sur­vey about rea­sons for not using adblock: bar graph of results.

The responses here aren’t entirely con­sis­tent with the pre­vi­ous group. Pre­vi­ous­ly, 14% claimed to have adblock, and here 35% do, which is more than dou­ble and the CIs do not over­lap. The word­ing of the answer is almost the same (“Yes, and I have one installed” vs “I do have one installed”) so I won­der if there is a from the word­ing of the ques­tion—the first one treats adblock use as an excep­tion, while the sec­ond frames it as the norm (from which devi­a­tion must be jus­ti­fied). So it’s pos­si­ble that the true adblock rate is some­where in between 14–35%. The two other esti­mates fall in that range as well.

In any case, the rea­sons are what this sur­vey was for and are more inter­est­ing. Of the non-users, igno­rance makes up the major­ity of responses (56%), with only 12% claim­ing that device restric­tions like Android’s stops them from using adblock­ers (which is evi­dence that informed-but-frus­trated mobile users aren’t dri­ving the ad harm­s), 16% abstain­ing out of prin­ci­ple, and 9% blam­ing the has­sle of installing/using.

Around 6% of non-users took the option of using the free response text field to pro­vide an alter­na­tive rea­son. I group the free responses as fol­lows:

  1. Ads aren’t sub­jec­tively painful enough to install adblock:

    “Ads aren’t as annoy­ing as sur­veys”/“I don’t visit sites with pop up ads and have not been both­ered”/“Haven’t needed”/“Too lazy”/“i’m not sure, seems like a has­sle”

    • what is prob­a­bly a sub­cat­e­go­ry, unspec­i­fied dis­like or lack of need :

      “Don’t want it”/“Don’t want to block them”/“don’t want to”/“doo not want them”/“No rea­son”/“No”/“Not sure why”

  2. vari­ant of “browser or device does­n’t sup­port them”:

    “work com­puter”/“Mac”

  3. Tech­ni­cal prob­lems with adblock­ers:

    “Many web­sites won’t allow you to use it with an adblocker acti­vated”/“far more effec­tive to just dis­able javascript to kill ads”

  4. Igno­rance (more speci­fic):

    “Did­n’t know they had one for ipads”

So the major miss­ing option here is an option for believ­ing that ads don’t annoy them (although given the size of the ad effect, one won­ders if that is really true).

For a third sur­vey, I added a response for ads not being sub­jec­tively annoy­ing, and, because of that 14% vs 35% differ­ence indi­cat­ing poten­tial demand effects, I tried to reverse the per­ceived ‘demand’ by explic­itly fram­ing non-ad­block use as the norm. Launched with n = 500 2019-03-21–2019-03-23, same options (CSV):

Most peo­ple do not use adblock exten­sions for web browsers like AdBlock Plus/ublock; if you do not, why not?

  1. I do have one installed [weighted 36.5%; raw n = 168]
  2. I don’t know what ad block­ers are [22.8%; n = 124]
  3. I don’t want or need to remove ads [14.6%; n = 70]
  4. Ad block­ers are too hard to install [12%; n = 65]
  5. My browser or device does­n’t sup­port them [7.8%; n = 41]
  6. Ad block­ing hurts web­sites or is uneth­i­cal [2.6%; n = 17]
  7. [free response text field to allow list­ing of rea­sons I did­n’t think of] [3.6%; n = 15]
Third Google Sur­vey, 2nd ask­ing about rea­sons for not using adblock: bar graph of results.

Free responses show­ing noth­ing new:

  • “dont think add block­ers are eth­i­cal”/“No inter­est in them”/“go away”/“idk”/“I only use them when I’m blinded by ads !”/“Incon­ve­nient to install for a prob­lem I hardly encounter for the web­sites that I use”/“The”/“n/a”/“I dont know”/“worms”/“lazy”/“Don’t need it”/“Fu”/“boo”

With the word­ing rever­sal and addi­tional potion, these results are con­sis­tent with the sec­ond on instal­la­tion per­cent­age (35% vs 37%), but not so much on the oth­ers (37% vs 23%, 6% vs 12%, 8% vs 8%, & 10.4% vs 3%). The free responses are also much worse the sec­ond time around.

Inves­ti­gat­ing word­ing choice again, I sim­pli­fied the first sur­vey down to a binary yes/no, on 2019-04-05–2019-04-07, n = 500 (CSV):

Do you know about ‘adblock­ers’: web browser exten­sions like AdBlock Plus or ublock?

  1. Yes [weighted 26.5%; raw n = 125]
  2. No [weighted 73.5%; raw n = 375]

The results were almost iden­ti­cal: “no” was 73% vs 71%.

For a final sur­vey, I tried directly query­ing the ‘don’t want/need’ pos­si­bil­i­ty, ask­ing a 1–5 Lik­ert ques­tion (no shuffle); n = 500, 2019-06-08–2019-06-10 (CSV):

How much do Inter­net ads (like ban­ner ads) annoy you? [On a scale of 1–5]:

  • 1: Not at all [weighted 11.7%; raw n = 59]
  • 2: [9.5%; n = 46]
  • 3: [14.2%; n = 62]
  • 4: [18.0%; n = 93]
  • 5: Great­ly: I avoid web­sites with ads [46.6%; n = 244]

Almost half of respon­dents gave the max­i­mal respon­se; only 12% claim to not care about ads at all.

The changes are puz­zling. The decrease in “Ad block­ing hurts web­sites or is uneth­i­cal” and “I don’t know what ad block­ers are” could be explained as users shift­ing buck­ets: they don’t want to use adblock­ers because ad block­ers are uneth­i­cal, or they haven’t both­ered to learn what ad block­ers are because they don’t want/need to remove ads. But how can adding an option like “I don’t want or need to remove ads” pos­si­bly affect a response like “Ad block­ers are too hard to install” so as to make it dou­ble (6% → 12%)? At first blush, this seems like a kind of vio­la­tion of log­i­cal con­sis­tency along the lines of the . Adding more alter­na­tives, which ought to be strict sub­sets of some respons­es, nev­er­the­less decreases other respons­es. This sug­gests that per­haps the responses are in gen­eral low-qual­ity and not to be trusted as the sur­vey­ees are being lazy or oth­er­wise screw­ing things up; they may be semi­-ran­domly click­ing, or those igno­rant of adblock may be con­fab­u­lat­ing excuses for why they are right to be igno­rant.

Per­plexed by the troll­ish free responses & stark incon­sis­ten­cies, I decided to run the third sur­vey 2019-03-25–2019-03-27 for an addi­tional n = 500, to see if the results held up. They did, with more sen­si­ble free responses as well, so it was­n’t a fluke (CSV):

Most peo­ple do not use adblock exten­sions for web browsers like AdBlock Plus/ublock; if you do not, why not?

  1. I do have one installed [weighted 33.3%; raw n = 165]

  2. I don’t know what ad block­ers are [30.4%; n = 143]

  3. I don’t want or need to remove ads [13.3%; n = 71]

  4. Ad block­ers are too hard to install [10.6%; n = 64]

  5. My browser or device does­n’t sup­port them [5.9%; n = 31]

  6. Ad block­ing hurts web­sites or is uneth­i­cal [4.4%; n = 18]

  7. [free response text field to allow list­ing of rea­sons I did­n’t think of] [2.2%; n = 10]

    • “Na”/“dont care”/“I have one”/“I can’t do sweep­stakes”/“i dont know what adblock is”/“job com­puter do not know what they have”/“Not edu­cated on them”/“Didnt know they were avail­able or how to use them. Have never heard of them.”

Is the igno­rance rate 23%, 31%, 37%, or 72%? It’s hard to say given the incon­sis­ten­cies. But taken as a whole, the sur­veys sug­gest that:

  1. only a minor­ity of users use adblock
  2. adblock non-usage is to a small extent due to (per­ceived) tech­ni­cal bar­ri­ers
  3. a minor­ity & pos­si­bly a plu­ral­ity of poten­tial adblock users do not know what adblock is

This offers a res­o­lu­tion of the appar­ent adblock para­dox: use of ads can drive away a non­triv­ial pro­por­tion of users (such as ~10%) who despite their aver­sion are unable to use adblock because of tech­ni­cal bar­ri­ers but to a much larger extent, sim­ple igno­rance.

Design

How do we ana­lyze this? In the ABa­lyt­ics per-reader approach, it was sim­ple: we defined a thresh­old and did a bino­mial regres­sion. But by switch­ing to try­ing to increase over­all total traffic, I have opened up a can of worms.

Descriptive

Let’s look at the traffic data:

traffic <- read.csv("https://www.gwern.net/docs/traffic/20170108-traffic.csv", colClasses=c("Date", "integer", "logical"))
summary(traffic)
summary(traffic)
#    Date               Pageviews
#  Min.   :2010-10-04   Min.   :    1
#  1st Qu.:2012-04-28   1st Qu.: 1348
#  Median :2013-11-21   Median : 1794
#  Mean   :2013-11-21   Mean   : 2352
#  3rd Qu.:2015-06-16   3rd Qu.: 2639
#  Max.   :2017-01-08   Max.   :53517
nrow(traffic)
# [1] 2289
library(ggplot2)
qplot(Date, Pageviews, data=traffic)
qplot(Date, log(Pageviews), data=traffic)
Daily pageviews/images//traffic to Gwern.net, 2010–2017
Daily pageviews/images//traffic to Gwern.net, 2010–2017; log-trans­formed

Two things jump out. The dis­tri­b­u­tion of traffic is weird, with spikes; doing a log-trans­form to tame the spikes, it is also clearly a non-s­ta­tion­ary time-series with auto­cor­re­la­tion as traffic con­sis­tently grows & declines. These are not sur­pris­ing, as social media traffic from sites like Hacker News or Red­dit are noto­ri­ous for cre­at­ing spikes in site traffic (and some­times bring­ing them down under the load), and I would hope that as I keep writ­ing things, traffic would grad­u­ally increase! Nev­er­the­less, both of these will make the traffic data diffi­cult to ana­lyze despite hav­ing over 6 years of it.

Power analysis

Using the his­tor­i­cal traffic data, how easy would it be to detect a total traffic reduc­tion of ~3%, the crit­i­cal bound­ary for the ads/no-ads deci­sion? Stan­dard non-time-series meth­ods are unable to detect it at any rea­son­able sam­ple size, but using more com­plex time-series-ori­ented meth­ods like ARIMA mod­els (ei­ther NHST or Bayesian), it can be detected given sev­eral months of data.

NHST

We can demon­strate with a quick power analy­sis: if we pick a ran­dom sub­set of days and force a decrease of 2.8% (the value on the deci­sion bound­ary), can we detect that?

ads <- traffic
ads$Ads <- rbinom(nrow(ads), size=1, p=0.5)
ads[ads$Ads==1,]$Pageviews <- round(ads[ads$Ads==1,]$Pageviews * (1-0.028))
wilcox.test(Pageviews ~ Ads, data=ads)
# W = 665105.5, p-value = 0.5202686
t.test(Pageviews ~ Ads, data=ads)
# t = 0.27315631, df = 2285.9151, p-value = 0.7847577
# alternative hypothesis: true difference in means is not equal to 0
# 95% confidence interval:
#  -203.7123550  269.6488393
# sample estimates:
# mean in group 0 mean in group 1
#     2335.331004     2302.362762
wilcox.test(log(Pageviews) ~ Ads, data=ads)
# W = 665105.5, p-value = 0.5202686
t.test(log(Pageviews) ~ Ads, data=ads)
# t = 0.36685265, df = 2286.8348, p-value = 0.7137629
sd(ads$Pageviews)
# [1] 2880.044636

The answer is no. We are nowhere near being able to detect it with either a t-test or the non­para­met­ric u-test (which one might expect to han­dle the strange dis­tri­b­u­tion bet­ter), and the log trans­form does­n’t help. We can hardly even see a hint of the decrease in the t-test—the decrease in the mean is ~30 pageviews but the stan­dard devi­a­tions are ~2900 and actu­ally big­ger than the mean. So the spikes in the traffic are crip­pling the tests and this can­not be fixed by wait­ing a few more months since it’s inher­ent to the data.

If our trusty friend the log-trans­form can’t help, what can we do? In this case, we know that the real­ity here is lit­er­ally a as the spikes are being dri­ven by qual­i­ta­tively dis­tinct phe­nom­e­non like a Gwern.net link appear­ing on the HN front page, as com­pared to nor­mal daily traffic from exist­ing links & search traffic19; but mix­ture mod­els tend to be hard to use. One ad hoc approach to tam­ing the spikes would be to effec­tively throw them out by /clipping every­thing at a cer­tain point (since the daily traffic aver­age is ~1700, per­haps twice that, 3000):

ads <- traffic
ads$Ads <- rbinom(nrow(ads), size=1, p=0.5)
ads[ads$Ads==1,]$Pageviews <- round(ads[ads$Ads==1,]$Pageviews * (1-0.028))
ads[ads$Pageviews>3000,]$Pageviews <- 3000
sd(ads$Pageviews)
# [1] 896.8798131
wilcox.test(Pageviews ~ Ads, data=ads)
# W = 679859, p-value = 0.1131403
t.test(Pageviews ~ Ads, data=ads)
# t = 1.3954503, df = 2285.3958, p-value = 0.1630157
# alternative hypothesis: true difference in means is not equal to 0
# 95% confidence interval:
#  -21.2013943 125.8265361
# sample estimates:
# mean in group 0 mean in group 1
#     1830.496049     1778.183478

Bet­ter but still inad­e­quate. Even with the spikes tamed, we con­tinue to have prob­lems; the logged graph sug­gests that we can’t afford to ignore the time-series aspect. A check of auto­cor­re­la­tion indi­cates sub­stan­tial auto­cor­re­la­tion out to lags as high as 8 days:

pacf(traffic$Pageviews, main="gwern.net traffic time-series autocorrelation")
Auto­cor­re­la­tion in Gwern.net daily traffic: pre­vi­ous daily traffic is pre­dic­tive of cur­rent traffic up to t = 8 days ago

The usual regres­sion frame­work for time-series is the time-series mod­el, in which the cur­rent daily value would be regressed on by each of the pre­vi­ous day’s val­ues (with an esti­mated coeffi­cient for each lag, as day 8 ought to be less pre­dic­tive than day 7 and so on) and pos­si­bly a differ­ence and a mov­ing aver­age (also with vary­ing dis­tances in time). The mod­els are usu­ally denoted as “ARIMA([days back to use as lags], [days back to differ­ence], [days back for mov­ing aver­age])”. So the pacf sug­gests that an ARIMA(8,0,0) might work—lags back 8 days but agnos­tic on differ­enc­ing and mov­ing aver­ages, respec­tive­ly. R’s forecast library help­fully includes both an arima regres­sion func­tion and also an auto.arima to do model com­par­i­son. auto.arima gen­er­ally finds that a much sim­pler model than ARIMA(8,0,0) works best, pre­fer­ring mod­els like ARIMA(4,1,1) (pre­sum­ably the differ­enc­ing and mov­ing-av­er­age steal enough of the dis­tant lags’ pre­dic­tive power that they no longer look bet­ter to AIC).

Such an ARIMA model works well and now we can detect our sim­u­lated effect:

library(forecast)
library(lmtest)
l <- lm(Pageviews ~ Ads, data=ads); summary(l)
# Residuals:
#       Min        1Q    Median        3Q       Max
# -2352.275  -995.275  -557.275   294.725 51239.783
#
# Coefficients:
#               Estimate Std. Error  t value Pr(>|t|)
# (Intercept) 2277.21747   86.86295 26.21621  < 2e-16
# Ads           76.05732  120.47141  0.63133  0.52789
#
# Residual standard error: 2879.608 on 2287 degrees of freedom
# Multiple R-squared:  0.0001742498,    Adjusted R-squared:  -0.0002629281
# F-statistic: 0.3985787 on 1 and 2287 DF,  p-value: 0.5278873
a <- arima(ads$Pageviews, xreg=ads$Ads, order=c(4,1,1))
summary(a); coeftest(a)
# Coefficients:
#             ar1         ar2         ar3         ar4         ma1      ads$Ads
#       0.5424117  -0.0803198  -0.0310823  -0.0094242  -0.8906085  -52.4148244
# s.e.  0.0281538   0.0245621   0.0245500   0.0240701   0.0189952   10.5735098
#
# sigma^2 estimated as 89067.31:  log likelihood = -16285.31,  aic = 32584.63
#
# Training set error measures:
#                       ME        RMSE         MAE          MPE        MAPE         MASE             ACF1
# Training set 3.088924008 298.3762646 188.5442545 -6.839735685 31.17041388 0.9755280945 -0.0002804416646
#
# z test of coefficients:
#
#               Estimate     Std. Error   z value   Pr(>|z|)
# ar1       0.5424116948   0.0281538043  19.26602 < 2.22e-16
# ar2      -0.0803197830   0.0245621012  -3.27007  0.0010752
# ar3      -0.0310822966   0.0245499783  -1.26608  0.2054836
# ar4      -0.0094242194   0.0240700967  -0.39153  0.6954038
# ma1      -0.8906085375   0.0189952434 -46.88587 < 2.22e-16
# Ads     -52.4148243747  10.5735097735  -4.95718 7.1523e-07

One might rea­son­ably ask, what is doing the real work, the truncation/trimming or the ARIMA(4,1,1)? The answer is both; if we go back and regen­er­ate the ads dataset with­out the truncation/trimming and we look again at the esti­mated effect of Ads, we find it changes to

#               Estimate     Std. Error   z value   Pr(>|z|)
# ...
# Ads      26.3244086579 81.2278521231    0.32408 0.74587666

For the sim­ple lin­ear model with no time-series or trun­ca­tion, the stan­dard error on the ads effect is 121; for the time-series with no trun­ca­tion, the stan­dard error is 81; and for the time series plus trun­ca­tion, the stan­dard error is 11. My con­clu­sion is that we can’t leave either one out if we are to reach cor­rect con­clu­sions in any fea­si­ble sam­ple size—we must deal with the spikes, and we must deal with the time-series aspect.

So hav­ing set­tled on a spe­cific ARIMA model with trun­ca­tion, I can do a power analy­sis. For a time-series, the sim­ple boot­strap is inap­pro­pri­ate as it ignores the auto­cor­re­la­tion; the right boot­strap is the : for each hypo­thet­i­cal sam­ple size n, split the traffic his­tory into as many non-over­lap­ping n-sized chunks m as pos­si­ble, selec­t-with­-re­place­ment from them m, and run the analy­sis. This is imple­mented in the R boot library.

library(boot)
library(lmtest)

## fit models & report p-value/test statistic
ut <- function(df) { wilcox.test(Pageviews ~ Ads, data=df)$p.value }
at <- function(df) { coeftest(arima(df$Pageviews, xreg=df$Ads, order=c(4,1,1)))[,4][["df$Ads"]] }

## create the hypothetical effect, truncate, and test
simulate <- function (df, testFunction, effect=0.03, truncate=TRUE, threshold=3000) {
    df$Ads <- rbinom(nrow(df), size=1, p=0.5)
    df[df$Ads==1,]$Pageviews <- round(df[df$Ads==1,]$Pageviews * (1-effect))
    if(truncate) { df[df$Pageviews>threshold,]$Pageviews <- threshold }
    return(testFunction(df))
    }
power <- function(ns, df, test, effect, alpha=0.05, iters=2000) {
    powerEstimates <- vector(mode="numeric", length=length(ns))
    i <- 1
    for (n in ns) {
        tsb <- tsboot(df, function(d){simulate(d, test, effect=effect)}, iters, l=n,
                          sim="fixed", parallel="multicore", ncpus=getOption("mc.cores"))
        powerEstimates[i] <- mean(tsb$t < alpha)
        i <- i+1 }
    return(powerEstimates) }

ns <- seq(10, 2000, by=5)
## test the critical value but also 0 effect to check whether alpha is respected
powerUtestNull <- power(ns, traffic, ut, 0)
powerUtest     <- power(ns, traffic, ut, 0.028)
powerArimaNull <- power(ns, traffic, at, 0)
powerArima     <- power(ns, traffic, at, 0.028)
p1 <- qplot(ns, powerUtestNull) + stat_smooth() + coord_cartesian(ylim = c(0, 1))
p2 <- qplot(ns, powerUtest) + stat_smooth() + coord_cartesian(ylim = c(0, 1))
p3 <- qplot(ns, powerArimaNull) + stat_smooth() + coord_cartesian(ylim = c(0, 1))
p4 <- qplot(ns, powerArima) + stat_smooth() + coord_cartesian(ylim = c(0, 1))

library(grid)
library(gridExtra)
grid.arrange(p1, p3, p2, p4, ncol = 2, name = "Power analysis of detecting null effect/2.8% reduction using u-test and ARIMA regression")
Block­-boot­strap power analy­sis of abil­ity to detect 2.8% traffic reduc­tion using u-test & ARIMA time-series model (bot­tom row), while pre­serv­ing nom­i­nal false-pos­i­tive error con­trol (top row)

So the false-pos­i­tive rate is pre­served for both, the ARIMA requires a rea­son­able-look­ing n < 70 to be well-pow­ered, but the u-test power is bizarre—the power is never great, never going >31.6%, and actu­ally decreas­ing after a cer­tain point, which is not some­thing you usu­ally see in a power graph. (The ARIMA power curve is also odd but at least it does­n’t get worse with more data!) My spec­u­la­tion about that is that it is because as the time-series win­dow increas­es, more of the spikes come into view of the u-test, mak­ing the dis­tri­b­u­tion dra­mat­i­cally wider & this more than over­whelms the gain in detectabil­i­ty; hypo­thet­i­cal­ly, with even more years of data, the spikes would stop com­ing as a sur­prise and the grad­ual hypo­thet­i­cal dam­age of the ads will then become more vis­i­ble with increas­ing sam­ple size as expect­ed.

Bayesian

ARIMA Bayesian mod­els are prefer­able as they deliver the pos­te­rior dis­tri­b­u­tion nec­es­sary for deci­sion-mak­ing, which allows weighted aver­ages of all the pos­si­ble effects, and can ben­e­fit from includ­ing my prior infor­ma­tion that the effect of ads is defi­nitely neg­a­tive but prob­a­bly close to zero. Some exam­ples of Bayesian ARIMA time-series analy­sis:

An AR(3,1) model in JAGS:

library(runjags)
arima311 <- "model {
  # initialize the first 3 days, which we need to fit the 3 lags/moving-averages for day 4:
  # y[1] <- 50
  # y[2] <- 50
  # y[3] <- 50
  eps[1] <- 0
  eps[2] <- 0
  eps[3] <- 0

  for (i in 4:length(y)) {
     y[i] ~ dt(mu[i], tauOfClust[clust[i]], nuOfClust[clust[i]])
     mu[i] <- muOfClust[clust[i]] + w1*y[i-1] + w2*y[i-2] + w3*y[i-3] + m1*eps[i-1]
     eps[i] <- y[i] - mu[i]

     clust[i] ~ dcat(pClust[1:Nclust])
  }

  for (clustIdx in 1:Nclust) {
      muOfClust[clustIdx] ~ dnorm(100, 1.0E-06)
      sigmaOfClust[clustIdx] ~ dnorm(500, 1e-06)
      tauOfClust[clustIdx] <- pow(sigmaOfClust[clustIdx], -2)
      nuMinusOneOfClust[clustIdx] ~ dexp(5)
      nuOfClust[clustIdx] <- nuMinusOneOfClust[clustIdx] + 1
  }
  pClust[1:Nclust] ~ ddirch(onesRepNclust)

  m1 ~ dnorm(0, 4)
  w1 ~ dnorm(0, 5)
  w2 ~ dnorm(0, 4)
  w3 ~ dnorm(0, 3)
  }"
y <- traffic$Pageviews
Nclust = 2
clust = rep(NA,length(y))
clust[which(y<1800)] <- 1 # seed labels for cluster 1, normal traffic
clust[which(y>4000)] <- 2 # seed labels for cluster 2, spikes
model <- run.jags(arima311, data = list(y=y, Nclust = Nclust, clust=clust, onesRepNclust = c(1,1) ),
    monitor=c("w1", "w2", "w3", "m1", "pClust", "muOfClust", "sigmaOfClust", "nuOfClust"),
    inits=list(w1=0.55, w2=0.37, w3=-0.01, m1=0.45, pClust=c(0.805, 0.195), muOfClust=c(86.5, 781), sigmaOfClust=c(156, 763), nuMinusOneOfClust=c(2.4-1, 1.04-1)),
    n.chains = getOption("mc.cores"), method="rjparallel", sample=500)
summary(model)

JAGS is painfully slow: 5h+ for 500 sam­ples. Sharper pri­ors, remov­ing a 4th-order ARIMA lag, & bet­ter ini­tial­iza­tion did­n’t help. The level of auto­cor­re­la­tion might make fit­ting with JAGS’s Gibbs MCMC diffi­cult, so I tried switch­ing to Stan, which is gen­er­ally faster & its HMC MCMC typ­i­cally deals with hard mod­els bet­ter:

traffic <- read.csv("https://www.gwern.net/docs/traffic/20170108-traffic.csv", colClasses=c("Date", "integer", "logical"))
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
m <- "data {
        int<lower=1> K; // number of mixture components
        int<lower=1> T; // number of data points
        int<lower=0> y[T]; // observations
    }
    parameters {
        simplex[K] theta; // mixing proportions
        real<lower=0, upper=100>    muM[K]; // locations of mixture components
        real<lower=0.01, upper=1000> sigmaM[K]; // scales of mixture components
        real<lower=0.01, upper=5>    nuM[K];

        real phi1; // autoregression coeffs
        real phi2;
        real phi3;
        real phi4;
        real ma; // moving avg coeff
    }
    model {

        real mu[T, K]; // prediction for time t
        vector[T] err; // error for time t
        real ps[K]; // temp for log component densities
        // initialize the first 4 days for the lags
        mu[1][1] = 0; // assume err[0] == 0
        mu[2][1] = 0;
        mu[3][1] = 0;
        mu[4][1] = 0;
        err[1] = y[1] - mu[1][1];
        err[2] = y[2] - mu[2][1];
        err[3] = y[3] - mu[3][1];
        err[4] = y[4] - mu[4][1];


        muM ~ normal(0, 5);
        sigmaM ~ cauchy(0, 2);
        nuM ~ exponential(1);
        ma ~ normal(0, 0.5);
        phi1 ~ normal(0,1);
        phi2 ~ normal(0,1);
        phi3 ~ normal(0,1);
        phi4 ~ normal(0,1);

        for (t in 5:T) {
            for (k in 1:K) {
                mu[t][k] = muM[k] + phi1 * y[t-1] + phi2 * y[t-2] + phi3 * y[t-3] + phi4 * y[t-4] + ma * err[t-1];
                err[t] = y[t] - mu[t][k];

                ps[k] = log(theta[k]) + student_t_lpdf(y[t] | nuM[k], mu[t][k], sigmaM[k]);
            }
        target += log_sum_exp(ps);
        }
    }"
# 17m for 200 samples
nchains <- getOption("mc.cores") - 1
# original, based on MCMC:
# inits <- list(theta=c(0.92, 0.08), muM=c(56.2, 0.1), sigmaM=c(189.7, 6), nuM=c(1.09, 0.61), phi1=1.72, phi2=-0.8, phi3=0.08, phi4=0, ma=-0.91)
# optimized based on gradient descent
inits <- list(theta=c(0.06, 0.94), muM=c(0.66, 0.13), sigmaM=c(5.97, 190.05), nuM=c(1.40, 1.10), phi1=1.74, phi2=-0.83, phi3=0.10, phi4=-0.01, ma=-0.93)
model2 <- stan(model_code=m, data=list(T=nrow(traffic), y=traffic$Pageviews, K=2), init=replicate(nchains, inits, simplify=FALSE), chains=nchains, iter=50); print(model2)
traceplot(model2)

This was per­haps the first time I’ve attempted to write a com­plex model in Stan, in this case, adapt­ing a sim­ple ARIMA time-series model from the Stan man­u­al. Stan has some inter­est­ing fea­tures like the vari­a­tional infer­ence opti­mizer which can find sen­si­ble para­me­ter val­ues for com­plex mod­els in sec­onds, an active com­mu­nity & involved devel­op­ers & an excit­ing roadmap, and when Stan works it is sub­stan­tially faster than the equiv­a­lent JAGS mod­el; but I encoun­tered a num­ber of draw­backs.

Given diffi­cul­ties in run­ning JAGS/Stan and slow­ness of the final mod­els, I ulti­mately did not get a suc­cess­ful power analy­sis of the Bayesian mod­els, and I opted to essen­tially wing it and hope that ~10 months would be ade­quate for mak­ing a deci­sion whether to dis­able ads per­ma­nent­ly, enable ads per­ma­nent­ly, or con­tinue the exper­i­ment.

Analysis

Descriptive

Google Ana­lyt­ics reports that over­all traffic from 2017-01-01–2017-10-15 was 179,550 unique users with 380,140 page-views & ses­sion dura­tion of 1m37s; this is a typ­i­cal set of traffic sta­tis­tics for my site, .

Merged traffic & AdSense data:

traffic <- read.csv("https://www.gwern.net/docs/traffic/2017-10-20-abtesting-adsense.csv",
            colClasses=c("Date", "integer", "integer", "numeric", "integer", "integer",
                         "integer", "numeric", "integer", "numeric"))
library(skimr)
skim(traffic)
#  n obs: 288
#  n variables: 10
#
# Variable type: Date
#  variable missing complete   n        min        max     median n_unique
#      Date       0      288 288 2017-01-01 2017-10-15 2017-05-24      288
#
# Variable type: integer
#        variable missing complete   n     mean       sd    p0      p25     p50      p75   p100     hist
#  Ad.impressions       0      288 288   358.95   374.67     0    33      127.5   708      1848 ▇▁▂▃▁▁▁▁
#    Ad.pageviews       0      288 288   399.02   380.62     0    76.5    180.5   734.5    1925 ▇▁▂▃▁▁▁▁
#           Ads.r       0      288 288     0.44     0.5      0     0        0       1         1 ▇▁▁▁▁▁▁▆
#       Pageviews       0      288 288  1319.93   515.8    794  1108.75  1232    1394      8310 ▇▁▁▁▁▁▁▁
#        Sessions       0      288 288   872.1    409.91   561   743      800     898.25   6924 ▇▁▁▁▁▁▁▁
#      Total.time       0      288 288 84517.41 24515.13 39074 70499.5  81173.5 94002    314904 ▅▇▁▁▁▁▁▁
#
# Variable type: numeric
#             variable missing complete   n  mean    sd    p0    p25   p50    p75   p100     hist
#   Ad.pageviews.logit       0      288 288 -1.46  2.07 -8.13 -2.79  -1.8    0.53   1.44 ▁▁▁▂▆▃▂▇
#          Ads.percent       0      288 288  0.29  0.29  0     0.024  0.1    0.59   0.77 ▇▁▁▁▁▂▃▂
#  Avg.Session.seconds       0      288 288 99.08 17.06 45.48 87.3   98.98 109.22 145.46 ▁▁▃▅▇▅▂▁
sum(traffic$Sessions); sum(traffic$Pageviews)
# [1] 251164
# [1] 380140

library(ggplot2)
qplot(Date, Pageviews, color=as.logical(Ads.r), data=traffic) + stat_smooth() +
    coord_cartesian(ylim = c(750,3089)) +
    labs(color="Ads", title="AdSense advertising effect on Gwern.net daily traffic, January-October 2017")
qplot(Date, Total.time, color=as.logical(Ads.r), data=traffic) + stat_smooth() +
    coord_cartesian(ylim = c(38000,190000)) +
    labs(color="Ads", title="AdSense advertising effect on total time spent reading Gwern.net , January-October 2017")

Traffic looks sim­i­lar whether count­ing by total page views or total time read­ing (av­er­age-time-read­ing-per-ses­sion x num­ber-of-ses­sion­s); the data is defi­nitely auto­cor­re­lat­ed, some­what noisy, and I get a sub­jec­tive impres­sion that there is a small decrease in pageviews/total-time on the adver­tis­ing days (de­spite the mea­sure­ment error):

AdSense ban­ner ad A/B test of effect on Gwern.net traffic: daily pageviews, Jan­u­ary–Oc­to­ber 2017 split by adver­tis­ing con­di­tion
Daily total-time-spen­t-read­ing Gwern.net, Jan­u­ary–Oc­to­ber 2017 (split by A/B)

Simple tests & regressions

As expected from the power analy­sis, the usual tests are unable to reli­ably detect any­thing but it’s worth not­ing that the point-es­ti­mates of both the mean & median indi­cate the ads are worse:

t.test(Pageviews ~ Ads.r, data=traffic)
#   Welch Two Sample t-test
#
# data:  Pageviews by Ads.r
# t = 0.28265274, df = 178.00378, p-value = 0.7777715
# alternative hypothesis: true difference in means is not equal to 0
# 95% confidence interval:
#  -111.4252500  148.6809819
# sample estimates:
# mean in group 0 mean in group 1
#     1328.080247     1309.452381
wilcox.test(Pageviews ~ Ads.r, conf.int=TRUE, data=traffic)
#   Wilcoxon rank sum test with continuity correction
#
# data:  Pageviews by Ads.r
# W = 11294, p-value = 0.1208844
# alternative hypothesis: true location shift is not equal to 0
# 95% confidence interval:
#  -10.00001128  87.99998464
# sample estimates:
# difference in location
#             37.9999786

The tests can only han­dle a binary vari­able, so next is a quick sim­ple lin­ear mod­el, and then a quick & easy Bayesian regres­sion in brms with an auto­cor­re­la­tion term to improve on the lin­ear mod­el; both turn up a weak effect for the binary ran­dom­iza­tion, and then much stronger (and neg­a­tive) for the more accu­rate per­cent­age mea­sure­ment:

summary(lm(Pageviews ~ Ads.r, data = traffic))
# ...Residuals:
#       Min        1Q    Median        3Q       Max
# -534.0802 -207.6093  -90.0802   65.9198 7000.5476
#
# Coefficients:
#               Estimate Std. Error  t value Pr(>|t|)
# (Intercept) 1328.08025   40.58912 32.72010  < 2e-16
# Ads.r        -18.62787   61.36498 -0.30356  0.76168
#
# Residual standard error: 516.6152 on 286 degrees of freedom
# Multiple R-squared:  0.0003220914,    Adjusted R-squared:  -0.003173286
# F-statistic: 0.09214781 on 1 and 286 DF,  p-value: 0.7616849
summary(lm(Pageviews ~ Ads.percent, data = traffic))
# ...Residuals:
#       Min        1Q    Median        3Q       Max
# -579.4145 -202.7547  -89.1473   60.7785 6928.3160
#
# Coefficients:
#               Estimate Std. Error  t value Pr(>|t|)
# (Intercept) 1384.17269   42.77952 32.35596  < 2e-16
# Ads.percent -224.79052  105.98550 -2.12096 0.034786
#
# Residual standard error: 512.6821 on 286 degrees of freedom
# Multiple R-squared:  0.01548529,  Adjusted R-squared:  0.01204293
# F-statistic: 4.498451 on 1 and 286 DF,  p-value: 0.03478589

library(brms)
b <- brm(Pageviews ~ Ads.r, autocor = cor_bsts(), iter=20000, chains=8, data = traffic); b
#  Family: gaussian(identity)
# Formula: Pageviews ~ Ads.r
#    Data: traffic (Number of observations: 288)
# Samples: 8 chains, each with iter = 20000; warmup = 10000; thin = 1;
#          total post-warmup samples = 80000
#     ICs: LOO = Not computed; WAIC = Not computed
#
# Correlation Structure: bsts(~1)
#         Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# sigmaLL    50.86     16.47    26.53    90.49        741 1.01
#
# Population-Level Effects:
#       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# Ads.r    50.36     63.19   -74.73   174.21      34212    1
#
# Family Specific Parameters:
#       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# sigma   499.77     22.38   457.76    545.4      13931    1
#
# Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
# is a crude measure of effective sample size, and Rhat is the potential
# scale reduction factor on split chains (at convergence, Rhat = 1).
 b2 <- brm(Pageviews ~ Ads.percent, autocor = cor_bsts(), chains=8, data = traffic); b2
...
# Population-Level Effects:
#             Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# Ads.percent    -91.8    113.05  -317.85   131.41       2177    1

This is imper­fect since treat­ing per­cent­age as addi­tive is odd, as one would expect it to be mul­ti­plica­tive in some sense. As well, brms makes it con­ve­nient to throw in a sim­ple Bayesian struc­tural auto­cor­re­la­tion (cor­re­spond­ing to AR(1) if I am under­stand­ing it cor­rect­ly) but the func­tion involved does not sup­port the high­er-order lags or mov­ing aver­age involved in traffic, so is weaker than it could be.

Stan ARIMA time-series model

For the real analy­sis, I do a fully Bayesian analy­sis in Stan, using ARIMA(4,0,1) time-series, a mul­ti­plica­tive effect of ads as per­cent­age of traffic, skep­ti­cal infor­ma­tive pri­ors of small neg­a­tive effects, and extract­ing pos­te­rior pre­dic­tions (of each day if hypo­thet­i­cally it were not adver­tis­ing-affect­ed) for fur­ther analy­sis.

Model defi­n­i­tion & setup:

library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
m <- "data {
        int<lower=1> T; // number of data points
        int<lower=0> y[T]; // traffic
        real Ads[T]; // Ad logit
    }
    parameters {
        real<lower=0> muM;
        real<lower=0> sigma;
        real phi1; // autoregression coeffs
        real phi2;
        real phi3;
        real phi4;
        real ma; // moving avg coeff

        real<upper=0> ads; // advertising coeff; can only be negative

        real<lower=0, upper=10000> y_pred[T]; // traffic predictions
    }
    model {
        real mu[T]; // prediction for time t
        vector[T] err; // error for time t

        // initialize the first 4 days for the lags
        mu[1] = 0;
        mu[2] = 0;
        mu[3] = 0;
        mu[4] = 0;
        err[1] = y[1] - mu[1];
        err[2] = y[2] - mu[2];
        err[3] = y[3] - mu[3];
        err[4] = y[4] - mu[4];

        muM ~ normal(1300, 500);
        sigma ~ exponential(250);
        phi1 ~ normal(0,1);
        phi2 ~ normal(0,1);
        phi3 ~ normal(0,1);
        phi4 ~ normal(0,1);
        ma ~ normal(0, 0.5);
        ads  ~ normal(0,1);

        for (t in 5:T) {
          mu[t] = muM + phi1 * y[t-1] + phi2 * y[t-2] + phi3 * y[t-3] + phi4 * y[t-4] + ma * err[t-1];
          err[t] = y[t] - mu[t];
          y[t]      ~ normal(mu[t] * (1 + ads*Ads[t]),       sigma);
          y_pred[t] ~ normal(mu[t] * 1, sigma); // for comparison, what would the ARIMA predict for a today w/no ads?
        }
    }"

# extra flourish: find posterior mode via Stan's new L-BFGS gradient descent optimization feature;
# also offers a good initialization point for MCMC
sm <- stan_model(model_code = m)
optimized <- optimizing(sm, data=list(T=nrow(traffic), y=traffic$Pageviews, Ads=traffic$Ads.percent), hessian=TRUE)
round(optimized$par, digits=3)
#      muM       sigma        phi1        phi2        phi3        phi4          ma         ads
# 1352.864      65.221      -0.062       0.033      -0.028       0.083       0.249      -0.144
## Initialize from previous MCMC run:
inits <- list(muM=1356, sigma=65.6, phi1=-0.06, phi2=0.03, phi3=-0.03, phi4=0.08, ma=0.25, ads=-0.15)
nchains <- getOption("mc.cores") - 1
model <- stan(model_code=m, data=list(T=nrow(traffic), y=traffic$Pageviews, Ads=traffic$Ads.percent),
    init=replicate(nchains, inits, simplify=FALSE), chains=nchains, iter=200000); print(model)

Results from the Bayesian mod­el, plus a sim­ple per­mu­ta­tion test as a san­i­ty-check on the data+­mod­el:

# ...Elapsed Time: 413.816 seconds (Warm-up)
#                654.858 seconds (Sampling)
#                1068.67 seconds (Total)
#
# Inference for Stan model: bacd35459b712679e6fc2c2b6bc0c443.
# 1 chains, each with iter=2e+05; warmup=1e+05; thin=1;
# post-warmup draws per chain=1e+05, total post-warmup draws=1e+05.
#
#                  mean se_mean      sd      2.5%       25%       50%       75%     97.5%  n_eff Rhat
# muM           1355.27    0.20   47.54   1261.21   1323.50   1355.53   1387.20   1447.91  57801    1
# sigma           65.61    0.00    0.29     65.03     65.41     65.60     65.80     66.18 100000    1
# phi1            -0.06    0.00    0.04     -0.13     -0.09     -0.06     -0.04      0.01  52368    1
# phi2             0.03    0.00    0.01      0.01      0.03      0.03      0.04      0.05 100000    1
# phi3            -0.03    0.00    0.01     -0.04     -0.03     -0.03     -0.02     -0.01 100000    1
# phi4             0.08    0.00    0.01      0.07      0.08      0.08      0.09      0.10 100000    1
# ma               0.25    0.00    0.04      0.18      0.23      0.25      0.27      0.32  52481    1
# ads             -0.14    0.00    0.01     -0.16     -0.15     -0.14     -0.14     -0.13 100000    1
# ...
mean(extract(model)$ads)
# [1] -0.1449574151

## permutation test to check for model misspecification: shuffle ad exposure and rerun the model,
## see what the empirical null distribution of the ad coefficient is and how often it yields a
## reduction of >= -14.5%:
empiricalNull <- numeric()
iters <- 5000
for (i in 1:iters) {
    df <- traffic
    df$Ads.percent <- sample(df$Ads.percent)
    inits <- list(muM=1356, sigma=65.6, phi1=-0.06, phi2=0.03, phi3=-0.03, phi4=0.08, ma=0.25, ads=-0.01)
    # nchains <- 1; options(mc.cores = 1) # disable multi-core to work around occasional Stan segfaults
    model <- stan(model_code=m, data=list(T=nrow(df), y=df$Pageviews, Ads=df$Ads.percent),
                   init=replicate(nchains, inits, simplify=FALSE), chains=nchains); print(model)
    adEstimate <- mean(extract(model)$ads)
    empiricalNull[i] <- adEstimate
}
summary(empiricalNull); sum(empiricalNull < -0.1449574151) / length(empiricalNull)
#        Min.      1st Qu.       Median         Mean      3rd Qu.         Max.
# -0.206359600 -0.064702600 -0.012325460 -0.035497930 -0.001696464 -0.000439064
# [1] 0.0136425648

We see a con­sis­tent & large esti­mate of harm: the mean of traffic falls by −14.5% (95% CI: −0.16 to −0.13; per­mu­ta­tion test: p = 0.01) on 100% ad-affected traffic! Given that these traffic sta­tis­tics are sourced from Google Ana­lyt­ics, which could be blocked along with the ad by an adblock­er, which ‘invis­i­ble’ traffic to aver­age ~10% of total traffic, the true esti­mate is pre­sum­ably some­what larger because there is more actual traffic than mea­sured. Ad expo­sure, how­ev­er, was not 100%, sim­ply because of the adblock/randomization issues.

To more directly cal­cu­late the harm, I turn to the pos­te­rior pre­dic­tions, which were com­puted for each day under the hypo­thet­i­cal of no adver­tis­ing; one would expect the pre­dic­tion for all days to be some­what higher than the actual traffics were (be­cause almost every day has some non-zero % of ad-affected traffic), and, summed or aver­aged over all days, that gives the pre­dicted loss of traffic from ads:

mean(traffic$Pageviews)
# [1] 1319.930556
## fill in defaults when extracting mean posterior predictives:
traffic$Prediction <- c(1319,1319,1319,1319, colMeans(extract(model)$y_pred)[5:288])
mean(with(traffic, Prediction - Pageviews) )
# [1] 53.67329617
mean(with(traffic, (Prediction - Pageviews) / Pageviews) )
# [1] 0.09668207805
sum(with(traffic, Prediction - Pageviews) )
# [1] 15457.9093

So dur­ing the A/B test, the expected esti­mated loss of traffic is ~9.7%.

Decision

As this is so far past the deci­sion thresh­old and the 95% cred­i­ble inter­val around −0.14 is extremely tight (−0.16–0.13) and rules out accept­able losses in the 0–2% range, the EVSI of any addi­tional sam­pling is neg­a­tive & not worth cal­cu­lat­ing.

Thus, I removed the AdSense ban­ner ad in the mid­dle of 2017-09-11.

Discussion

The result is sur­pris­ing. I had been expect­ing some degree of harm but the esti­mated reduc­tion is much larger than I expect­ed. Could ban­ner ads really be that harm­ful?

The effect is esti­mated with con­sid­er­able pre­ci­sion, so it’s almost cer­tainly not a fluke of the data (if any­thing I col­lected far more data than I should’ve); there weren’t many traffic spikes to screw with the analy­sis, so omit­ting mix­ture model or t-scale responses in the model does­n’t seem like it should be an issue either; the mod­el­ing itself might be dri­ving it, but the crud­est tests sug­gest a sim­i­lar level of harm (just not at high sta­tis­ti­cal-sig­nifi­cance or pos­te­rior prob­a­bil­i­ty); it does seem to be vis­i­ble in the scat­ter­plot; and the more real­is­tic mod­el­s—which include time-series aspects I know exist from the long his­tor­i­cal time-series of Gwern.net traffic & skep­ti­cal pri­ors encour­ag­ing small effect­s—es­ti­mate it much bet­ter as I expected from my pre­vi­ous power analy­ses, and con­sid­er­able tin­ker­ing with my orig­i­nal ARIMA(4,0,1) Stan model to check my under­stand­ing of my code (I haven’t used Stan much before) did­n’t turn up any issues or make the effect go away. So as far as I can tell, this effect is real. I still doubt my results, but it’s con­vinc­ing enough for me to dis­able ads, at least.

Does it gen­er­al­ize? I admit Gwern.net is unusu­al: highly tech­ni­cal long­form sta­tic con­tent in a min­i­mal­ist lay­out opti­mized for fast load­ing & ren­der­ing cater­ing to Anglo­phone STEM-types in the USA. It is entirely pos­si­ble that for most web­sites, the effect of ads is much smaller because they already load so slow, have much busier clut­tered designs, their users have less vis­ceral dis­taste for adver­tis­ing or are more eas­ily tar­geted for use­ful adver­tis­ing etc, and thus Gwern.net is merely an out­lier for whom remov­ing ads makes sense (par­tic­u­larly given my option of being Patre­on-sup­ported rather than depend­ing entirely on ads like many media web­sites must). I have no way of know­ing whether or not this is true, and as always with opti­miza­tions, one should bench­mark one’s own spe­cific use case; per­haps in a few years more results will be reported and it will be seen if my results are merely a cod­ing error or an out­lier or some­thing else.

If a max loss of 14% and aver­age loss of ~9% (both of which could be higher for sites whose users don’t use adblock as much) is accu­rate and gen­er­al­iz­able to other blogs/websites (as the repli­ca­tions since my first A/B test imply), there are many impli­ca­tions: in par­tic­u­lar, it implies a huge dead­weight loss to Inter­net users from adver­tis­ing; and sug­gests adver­tis­ing may be a net loss for many smaller sites. (It seems unlike­ly, to say the least, that every sin­gle web­site or busi­ness in exis­tence would deliver pre­cisely the amount of ads they do now while igno­rant of the true costs, by sheer luck hav­ing made the opti­mal trade­off, and likely that many would pre­fer to reduce their ad inten­sity or remove ads entire­ly.) Iron­i­cal­ly, in the lat­ter case, those sites may not yet have real­ize, and may never real­ize, how much the pen­nies they earn from adver­tis­ing are cost­ing them, because the harm won’t show up in stan­dard sin­gle-user A/B test­ing due to either mea­sure­ment error hid­ing much of the effect or because it exists as an emer­gent global effect, requir­ing long-term exper­i­men­ta­tion & rel­a­tively sophis­ti­cated time-series mod­el­ing—a decrease of 10% is impor­tant and yet, site traffic exoge­nously changes on a dai­ly, much less weekly or month­ly, basis more than 10%, ren­der­ing even a dras­tic on/off change invis­i­ble to the naked eye.

There may be a con­nec­tion here to ear­lier obser­va­tions on the busi­ness of adver­tis­ing ques­tion­ing whether adver­tis­ing works, works more than it hurts or can­ni­bal­izes other avenues, works suffi­ciently well to be profitable, or suffi­ciently well to know if it is work­ing at all. usu­ally fail, and the more rig­or­ous the eval­u­a­tion, the smaller effects are. On purely sta­tis­ti­cal grounds, it should be hard to cost-effec­tively show that adver­tis­ing works at all (//), pub­li­ca­tion bias in pub­lished esti­mates of adver­tis­ing effi­cacy (), Steve Sail­er’s obser­va­tion that ‘s Behav­iorScan field­-ex­per­i­ment link­ing P&G’s indi­vid­ual TV adver­tise­ments & gro­cery store sales likely showed lit­tle effect, eBay’s own exper­i­ments to sim­i­lar effect (, Blake et al 2014), P&G & JPMorgan dig­i­tal ad cuts (and the con­tin­ued suc­cess of many São Paulo-style low-ad­ver­tis­ing retail­ers like or ), the extreme inac­cu­racy of cor­re­la­tional attempts to pre­dict adver­tis­ing effects (Lewis et al 2011, ; see also ), polit­i­cal sci­ence’s diffi­culty show­ing any causal impact of cam­paign adver­tise­ment spend­ing on vic­to­ries (Kalla & Broock­man 2017/; & mostly recent­ly, Don­ald Trump, which may be related to why [there is so lit­tle money in politics](/docs/economics/2003-ansolabehere.pdf’“Why is There so Lit­tle Money in U.S. Pol­i­tic­s?’, Ansolabehere et al 2003”)), the min­i­mal value of behav­ioral pro­fil­ing on ad effi­cacy (Marotta et al 2019), and many anec­do­tal reports seri­ously ques­tion­ing the value of Face­book or Google adver­tis­ing for their busi­nesses in yield­ing mistaken/curious or fraud­u­lent or just use­less traffic. One coun­ter­point is John­son et la 2017 which takes up Lewis/Gordon gaunt­let and, using n = 2.2b/k = 432 (!), is able to defi­nitely estab­lish small adver­tis­ing effects (but dri­ven by lower qual­ity traffic, het­ero­ge­neous effects, and with mod­est long-term effect­s).

Followup test

An active & skep­ti­cal dis­cus­sion ensued on Hacker News & else­where after I posted my first analy­sis. Many peo­ple believed the results, but many were skep­ti­cal. Nor could I blame them—while all the analy­ses turn in neg­a­tive esti­mates, it was (then) only the one result on an unusual web­site with the head­line result esti­mated by an unusu­ally com­plex sta­tis­ti­cal model and so on. But this is too impor­tant to leave unset­tled like this.

So after see­ing lit­tle appar­ent fol­lowup by other web­sites who could pro­vide larger sam­ple sizes & truly inde­pen­dent repli­ca­tion, I resolved to run a sec­ond one, which would at least demon­strate that the first one was not a fluke of Gwern.net’s 2017 traffic.

As the first result was so strong, I decided to run it not quite so long the sec­ond time, for 6 months, 2018-09-27–2019-03-27. (The sec­ond exper­i­ment can be pooled with the first.) Due to major dis­trac­tions (specifi­cal­ly, & ), I did­n’t have time to

Design

For the fol­lowup, I wanted to fix a few of the issues and explore some mod­er­a­tors:

  1. the ran­dom­iza­tion would not repeat the embar­rass­ing time­zone mis­take from the first time

  2. to force more tran­si­tions, there would be 2-day peri­ods ran­dom­ized Lat­in-square-style in weeks

  3. to exam­ine the pos­si­bil­ity that ads per se are not the prob­lem but JS-heavy ani­mated ads like Google AdSense ads (de­spite the efforts Google invests in per­for­mance opti­miza­tion) with the con­se­quent browser per­for­mance impact, or that per­son­al­ized ads are the prob­lem, I would not use Google AdSense but rather a sin­gle, sta­t­ic, fixed ad which was noth­ing but a small loss­i­ly-op­ti­mized PNG and which is at least plau­si­bly rel­e­vant to Gwern.net read­ers

    After some cast­ing about, I set­tled on a tech recruit­ing ban­ner ad from Triple­byte which I copied from Google Images & edited light­ly, bor­row­ing an affil­i­ate link from SSC (who has Triple­byte ads of his own & who I would be pleased to see get some rev­enue). The final PNG weighs 12kb. If there is any neg­a­tive effect of ads from this, it is not from ‘per­for­mance impact’! (Espe­cially in light of the repli­ca­tion exper­i­ments, I am skep­ti­cal that the per­for­mance bur­den of ads is all that impor­tan­t.)

As before, it is imple­mented as pre-gen­er­ated ran­dom­ness in a JS array with ads hid­den by default and then enabled by an inline JS script based on the ran­dom­ness.

Implementation

Gen­er­ated ran­dom­ness:

library(gtools)
m <- permutations(2, 7, 0:1, repeats=TRUE)
m <- m[sample(1:nrow(m), size=51),]; c(m)
#   [1] 1 0 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 1 0 1 0 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 0 1 0 1 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 1 1 1 1 1
#  [81] 0 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 0 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 0 1 0 1 0 0 1 1 0 0 0 0 0 0 1
# [161] 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 0 1 0 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 1 0 1 1 0
# [241] 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 0 1 1 1 0 1 1 0 0 1 1 1 1 0 0 0 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 0 0 0 1 0 1 1 1 1 0 0
# [321] 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 0 0 1 0 0 1 1 0 0 1 0 0 0 0 1 1

default.css:

/* hide the ad in the ad A/B test by default */
div#ads { display: block; text-align: center; display: none; }

Full HTML+JS in default.html:

<div id="ads"><a href="https://triplebyte.com/a/Lpa4wbK/d"><img alt="Banner ad for the tech recruiting company Triplebyte: 'Triplebyte is building a background-blind screening process for hiring software engineers'" width="500" height="128" src="/static/img/ads-triplebyte-banner.png"></a></div>

<header>
...
<!-- A/B test of ad effects on site traffic, static ad version: randomize blocks of 7-days based on day-of-year pre-generated randomness -->
<script id="adsABTestJS">
// create Date object for current location
var d = new Date();
// convert to msec
// add local time zone offset
// get UTC time in msec
var utc = d.getTime() + (d.getTimezoneOffset() * 60000);
// create new Date object for different timezone, EST/UTC-4
var now = new Date(utc + (3600000*-4));

  start = new Date("2018-09-28"); var diff = now - start;
  var oneDay = 1000 * 60 * 60 * 24; var day = Math.floor(diff / oneDay);
  randomness = [1,0,1,1,1,0,0,1,0,1,1,0,0,1,1,0,0,0,1,1,1,0,1,1,0,0,0,0,0,
    1,1,0,0,1,1,1,0,1,0,1,0,0,1,1,1,1,0,0,0,1,1,0,0,1,1,1,0,1,0,1,1,0,0,0,
    0,0,1,1,0,1,0,0,1,1,0,1,1,1,1,1,0,1,0,0,1,0,1,1,1,1,0,1,1,1,1,1,1,0,1,
    1,0,0,0,0,1,0,1,1,0,1,0,1,0,1,1,1,1,1,1,1,1,1,0,1,0,0,1,1,1,0,1,0,0,1,
    0,1,1,0,1,1,0,0,0,0,1,0,1,0,1,0,0,1,1,0,0,0,0,0,0,1,1,1,1,1,0,1,0,0,0,
    1,1,1,1,1,1,1,1,0,0,0,0,1,0,0,0,1,1,1,0,0,0,1,0,0,1,0,1,1,1,1,1,1,0,0,
    0,0,0,0,0,1,1,0,1,1,0,1,0,1,0,1,1,0,0,0,1,0,0,1,1,0,0,1,1,0,0,1,0,1,1,
    0,1,1,0,1,0,0,1,0,0,1,1,0,1,0,1,0,0,0,0,0,1,0,1,1,1,0,0,0,1,0,0,1,1,1,
    0,1,1,0,0,1,1,1,1,0,0,0,0,1,1,0,1,0,1,0,1,0,1,0,1,0,1,1,1,1,1,0,0,0,1,
    0,0,0,1,0,1,1,1,1,0,0,1,1,0,1,1,1,0,1,0,1,1,0,1,0,0,1,1,0,1,0,1,0,0,1,
    0,0,1,1,0,0,1,0,0,0,0,1,1];
  if (randomness[day]) {
    document.getElementById("ads").style.display = "block";
    document.getElementById("ads").style.visibility = "visible";
  }
</script>

For the sec­ond run, fresh ran­dom­ness was gen­er­ated the same way:

      start = new Date("2019-04-13"); var diff = now - start;
      var oneDay = 1000 * 60 * 60 * 24; var day = Math.floor(diff / oneDay);
      randomness = [0,1,0,0,0,1,1,1,0,0,0,0,1,1,1,0,1,1,1,1,0,0,0,0,1,0,1,1,0,0,1,0,0,1,0,
                    0,0,0,1,1,0,0,0,1,0,1,0,0,1,0,1,1,1,1,0,1,1,0,0,0,0,0,0,1,1,0,1,0,0,1,
                    1,0,0,0,1,1,1,1,1,1,0,1,1,1,0,1,1,0,0,0,0,1,0,1,1,0,0,1,1,0,1,1,1,1,1,
                    1,1,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1,1,0,1,0,0,1,0,1,0,1,1,0,0,0,0,1,1,0,
                    1,0,0,0,0,1,1,1,0,0,1,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,1,0,0,1,1,1,1,
                    1,1,1,1,0,1,0,0,1,0,0,0,1,1,0,1,0,0,0,0,1,0,0,1,1,0,0,0,0,0,1,0,0,1,1,
                    1,1,1,0,0,1,1,1,0,0,0,1,1,0,1,0,1,0,0,0,1,0,0,1,1,0,1,1,0,0,0,1,1,0,1,
                    0,0,0,1,0,1,0,0,1,0,1,0,0,0,0,1,0,1,0,0,0,1,1,0,0,1,0,1,0,1,0,1,0,0,0,
                    1,1,0,1,0,1,0,0,1,1,0,1,1,0,1,1,1,0,1,0,0,0,1,1,0,0,1,1,0,0,0,1,1,0,0,
                    0,1,1,1,0,1,0,0,0,1,0,1,1,0,0,1,0,0,1,0,0,1,0,1,1,1,0,0,0,0,1,1,1,1,0,
                    1,0,1,1,1,0,1];
      if (randomness[day]) {
        document.getElementById("ads").style.display = "block";
        document.getElementById("ads").style.visibility = "visible";
      }

The result:

Screen­shot of ban­ner ad appear­ance on the in Sep­tem­ber 2018 when the 2nd ad A/B test began

Analysis

An interim analy­sis of the first 6 months was defeated by surges in site traffic linked to , among other things, caus­ing the SD of daily pageviews to increase by >5.3x (!); this destroyed the sta­tis­ti­cal pow­er, ren­der­ing results unin­for­ma­tive, and made it hard to com­pare to the first A/B test—­some of the brms analy­ses pro­duced CIs of effects on site traffic from +18% to −14%! The Stan model (us­ing a merged dataset of both ad types, treat­ing them as inde­pen­dent vari­ables ie. fixed effects) esti­mated an effect of −0.04%.

Inas­much as I still want a solid result and the best model sug­gests that the harm from con­tin­ued A/B test­ing of the sta­tic ban­ner ad is tiny, I decided to extend the sec­ond A/B test for another roughly 6 months, start­ing on 2019-04-13, imple­mented the same way (but with new ran­dom­ness). I even­tu­ally halted the exper­i­ment with the last full day of 2019-12-21, fig­ur­ing that an addi­tional 253 days of ran­dom­iza­tion ought to be enough, I wanted to clear the decks for an A/B test of differ­ent col­ors for the new ‘dark mode’ (to test a hypoth­e­sis derived from that green or blue should be the opti­mal color to con­trast with a black back­ground), and after look­ing into Stan some more, it seems that deal­ing with the het­ero­gene­ity should be pos­si­ble with a model if brms is unable to han­dle it.

TODO

Appendices

Stan issues

Obser­va­tions I made while try­ing to develop the Gwern.net traffic ARIMA model in Stan, in decreas­ing order of impor­tance:

  • Stan’s treat­ment of mix­ture mod­els and dis­crete vari­ables is… not good. I like mix­ture mod­els & tend to think in terms of them and latent vari­ables a lot, which makes the neglect an issue for me. This was par­tic­u­larly vex­ing in my ini­tial mod­el­ing where I tried to allow for traffic spikes from HN etc by hav­ing a mix­ture mod­el, with one com­po­nent for ‘reg­u­lar’ traffic and one com­po­nent for traffic surges. This is rel­a­tively straight­for­ward in JAGS as one defines a cat­e­gor­i­cal vari­able and indexes into it, but it is a night­mare in Stan, requir­ing a bizarre hack.

    I defy any Stan user to look at the exam­ple mix­ture model in the man­ual and tell me that they nat­u­rally and eas­ily under­stand the target/temperature stuff as a way of imple­ment­ing a mix­ture mod­el. I sure did­n’t. And once I did get it imple­ment­ed, I could­n’t mod­ify it at all. And it was slow, too, erod­ing the orig­i­nal per­for­mance advan­tage over JAGS. I was saved only by the fact that the A/B test period hap­pened to not include many spikes and so I could sim­ply drop the mix­ture aspect from the model entire­ly.

  • mys­te­ri­ous seg­faults and errors under a vari­ety of con­di­tion; once when my cat walked over my key­board, and fre­quently when run­ning mul­ti­-core Stan in a loop. The lat­ter was a seri­ous issue for me when run­ning a per­mu­ta­tion test with 5000 iter­ates: when I run Stan on 8 chains in par­al­lel nor­mally (hence 1/8th the sam­ples per chain) in a for-loop—the sim­plest way to imple­ment the per­mu­ta­tion test—it would occa­sion­ally seg­fault and take down R. I was forced to reduce the chains to 1 before it stopped crash­ing, mak­ing it 8 times slower (un­less I wished to add in man­ual par­al­lel pro­cess­ing, run­ning 8 sep­a­rate Stan­s).

  • Stan’s sup­port for pos­te­rior pre­dic­tives is poor. The man­ual tells one to use a differ­ent module/scope, generated_quantities, lest the code be slow, which appar­ently requires one to copy­-paste the entire like­li­hood sec­tion! Which is espe­cially unfor­tu­nate when doing a time-series and requir­ing access to arrays/vectors declared in a differ­ent scope… I never did fig­ure out how to gen­er­ate pos­te­rior pre­dic­tions ‘cor­rectly’ for that rea­son, and resorted to the usual Bugs/JAGS-like method (which thank­fully does work)

  • Stan’s treat­ment of miss­ing data is also unin­spir­ing and makes me wor­ried about more com­plex analy­ses where I am not so for­tu­nate as to have per­fectly clean com­plete datasets

  • Stan’s syn­tax is ter­ri­ble, par­tic­u­larly the entirely unnec­es­sary semi­colons. It is 2017, I should not be spend­ing my time adding use­less end-of-line mark­ers. If they are nec­es­sary for C++, they can be added by Stan itself. This was par­tic­u­larly infu­ri­at­ing when painfully edit­ing a model try­ing to imple­ment var­i­ous para­me­ter­i­za­tions and rerun­ning only to find that I had for­got­ten a semi­colon (as no lan­guage I use reg­u­lar­ly—R, Haskell, shell, Python, or Bugs/JAGS—insists on them!).

Stan: mixture time-series

An attempt at a ARIMA(4,0,1) time-series mix­ture model imple­mented in Stan, where the mix­ture has two com­po­nents: one com­po­nent for nor­mal traffic where daily traffic is ~1000 mak­ing up >90% of daily data, and one com­po­nent for the occa­sional traffic spike around 10× larger but hap­pen­ing rarely:

library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
m <- "data {
        int<lower=1> K; // number of mixture components
        int<lower=1> T; // number of data points
        int<lower=0> y[T]; // traffic
        int<lower=0, upper=1> Ads[T]; // Ad randomization
    }
    parameters {
        simplex[K] theta; // mixing proportions
        positive_ordered[K] muM; // locations of mixture components; since no points are labeled,
        // like in JAGS, we add a constraint to force an ordering, make it identifiable, and
        // avoid label switching, which will totally screw with the posterior samples
        real<lower=0.01, upper=500> sigmaM[K]; // scales of mixture components
        real<lower=0.01, upper=5>    nuM[K];

        real phi1; // autoregression coeffs
        real phi2;
        real phi3;
        real phi4;
        real ma; // moving avg coeff

        real<upper=0> ads; // advertising coeff; can only be negative
    }
    model {

        real mu[T, K]; // prediction for time t
        vector[T] err; // error for time t
        real ps[K]; // temp for log component densities
        // initialize the first 4 days for the lags
        mu[1][1] = 0; // assume err[0] == 0
        mu[2][1] = 0;
        mu[3][1] = 0;
        mu[4][1] = 0;
        err[1] = y[1] - mu[1][1];
        err[2] = y[2] - mu[2][1];
        err[3] = y[3] - mu[3][1];
        err[4] = y[4] - mu[4][1];


        muM ~ normal(0, 5);
        sigmaM ~ cauchy(0, 2);
        nuM ~ exponential(1);
        ma ~ normal(0, 0.5);
        phi1 ~ normal(0,1);
        phi2 ~ normal(0,1);
        phi3 ~ normal(0,1);
        phi4 ~ normal(0,1);
        ads  ~ normal(0,1);

        for (t in 5:T) {
            for (k in 1:K) {
                mu[t][k] = ads*Ads[t] + muM[k] + phi1 * y[t-1] + phi2 * y[t-2] + phi3 * y[t-3] + phi4 * y[t-4] + ma * err[t-1];
                err[t] = y[t] - mu[t][k];

                ps[k] = log(theta[k]) + student_t_lpdf(y[t] | nuM[k], mu[t][k], sigmaM[k]);
            }
        target += log_sum_exp(ps);
        }
    }"

# find posterior mode via L-BFGS gradient descent optimization; this can be a good set of initializations for MCMC
sm <- stan_model(model_code = m)
optimized <- optimizing(sm, data=list(T=nrow(traffic), y=traffic$Pageviews, Ads=traffic$Ads.r, K=2), hessian=TRUE)
round(optimized$par, digits=3)
#  theta[1]  theta[2]    muM[1]    muM[2] sigmaM[1] sigmaM[2]    nuM[1]    nuM[2]      phi1      phi2      phi3      phi4        ma
#     0.001     0.999     0.371     2.000     0.648   152.764     0.029     2.031     1.212    -0.345    -0.002     0.119    -0.604
#       ads
#    -0.009

## optimized:
inits <- list(theta=c(0.001, 0.999), muM=c(0.37, 2), sigmaM=c(0.648, 152), nuM=c(0.029, 2), phi1=1.21, phi2=-0.345, phi3=-0.002, phi4=0.119, ma=-0.6, ads=-0.009)
## MCMC means:
nchains <- getOption("mc.cores") - 1
model <- stan(model_code=m, data=list(T=nrow(traffic), y=traffic$Pageviews, Ads=traffic$Ads.r, K=2),
    init=replicate(nchains, inits, simplify=FALSE), chains=nchains, control = list(max_treedepth = 15, adapt_delta = 0.95),
    iter=20000); print(model)
traceplot(model, pars=names(inits))

This code winds up con­tin­u­ing to fail due to label-switch­ing issue (ie the MCMC bounc­ing between esti­mates of what each mix­ture com­po­nent is because of sym­me­try or lack of data) despite using some of the sug­gested fixes in the Stan model like the order­ing trick. Since there were so few spikes in 2017 only, the mix­ture model can’t con­verge to any­thing sen­si­ble; but on the plus side, this also implies that the com­plex mix­ture model is unnec­es­sary for ana­lyz­ing 2017 data and I can sim­ply model the out­come as a nor­mal.

EVSI

Demo code of sim­ple Expected Value of Sam­ple Infor­ma­tion (EVSI) in a JAGS log-Pois­son model of traffic (which turns out to be infe­rior to a nor­mal dis­tri­b­u­tion for 2017 traffic data but I keep here for his­tor­i­cal pur­pos­es).

We con­sider an exper­i­ment resem­bling his­tor­i­cal data with a 5% traffic decrease due to ads; the reduc­tion is mod­eled and implies a cer­tain util­ity loss given my rel­a­tive pref­er­ences for traffic vs the adver­tis­ing rev­enue, and then the remain­ing uncer­tainty in the reduc­tion esti­mate can be queried for how likely it is that the deci­sion is wrong and that col­lect­ing fur­ther data would then change a wrong deci­sion to a right one:

## simulate a plausible effect superimposed on the actual data:
ads[ads$Ads==1,]$Hits <- round(ads[ads$Ads==1,]$Hits * 0.95)

require(rjags)
y <- ads$Hits
x <- ads$Ads
model_string <- "model {
  for (i in 1:length(y)) {
   y[i] ~ dpois(lambda[i])
   log(lambda[i]) <- alpha0 - alpha1 * x[i]

  }
  alpha0 ~ dunif(0,10)
  alpha1 ~ dgamma(50, 6)
}"
model <- jags.model(textConnection(model_string), data = list(x = x, y = y),
                    n.chains = getOption("mc.cores"))
samples <- coda.samples(model, c("alpha0", "alpha1"), n.iter=10000)
summary(samples)
# 1. Empirical mean and standard deviation for each variable,
#    plus standard error of the mean:
#
#              Mean          SD     Naive SE Time-series SE
# alpha0 6.98054476 0.003205046 1.133155e-05   2.123554e-05
# alpha1 0.06470139 0.005319866 1.880857e-05   3.490445e-05
#
# 2. Quantiles for each variable:
#
#              2.5%        25%        50%        75%      97.5%
# alpha0 6.97426621 6.97836982 6.98055144 6.98273011 6.98677827
# alpha1 0.05430508 0.06110893 0.06469162 0.06828215 0.07518853
alpha0 <- samples[[1]][,1]; alpha1 <- samples[[1]][,2]
posteriorTrafficReduction <- exp(alpha0) - exp(alpha0-alpha1)

generalLoss <- function(annualAdRevenue, trafficLoss,  hitValue, discountRate) {
  (annualAdRevenue - (trafficLoss * hitValue * 365.25)) / log(1 + discountRate) }
loss <- function(tr) { generalLoss(360, tr, 0.02, 0.05) }
posteriorLoss <- sapply(posteriorTrafficReduction, loss)
summary(posteriorLoss)
#       Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
# -5743.5690 -3267.4390 -2719.6300 -2715.3870 -2165.6350   317.7016

Expected loss of turn­ing on ads: -$2715. Cur­rent deci­sion: keep ads off to avoid that loss. The expected aver­age gain in the case where the cor­rect deci­sion is turn­ing ads on:

mean(ifelse(posteriorLoss>0, posteriorLoss, 0))
# [1] 0.06868814833

so EVPI is $0.07. This does­n’t pay for any addi­tional days of sam­pling, so there’s no need to cal­cu­late an exact EVSI.


  1. I am unhappy about the uncached JS that Dis­qus loads & how long it takes to set itself up while spew­ing warn­ings in the browser con­sole, but at the moment, I don’t know of any other sta­tic site com­ment­ing sys­tem which has good anti-s­pam capa­bil­i­ties or an equiv­a­lent user base, and Dis­qus has worked for 5+ years.↩︎

  2. This is espe­cially an issue with A/B test­ing as usu­ally prac­ticed with NHST & arbi­trary alpha thresh­old, which poses a of suck” or prob­lem; one could steadily degrade one’s web­site by repeat­edly mak­ing bad changes which don’t appear harm­ful in smal­l­-s­cale exper­i­ments (“no user harm, p > 0.05, and increased rev­enue p < 0.05”, “no harm, increased rev­enue”, “no harm, increased rev­enue” etc), yet the prob­a­bil­ity of a net harm goes up.

    One might call this the (“How Mil­wau­kee’s Famous Beer Became Infa­mous: The Fall of Schlitz”), after the famous busi­ness case study: a series of small qual­ity decreases/profit increases even­tu­ally had cat­a­strophic cumu­la­tive effects on their rep­u­ta­tion & sales. (It is also called the “fast-food fal­lacy” after Ger­ald M. Wein­berg’s dis­cus­sion of a hypo­thet­i­cal exam­ple in The Secrets of Con­sult­ing: A Guide to Giv­ing and Get­ting Advice Suc­cess­fully, pg254, “Con­trol­ling Small Changes”, where he notes: “No differ­ence plus no differ­ence plus no differ­ence plus … even­tu­ally equals a clear differ­ence.”) Another exam­ple is the now-in­fa­mous “” apple: widely con­sid­ered one of the worst-tast­ing apples com­monly sold, it was report­edly an excel­len­t-tast­ing apple when first dis­cov­ered in 1880, win­ning con­tests for its fla­vor; but its fla­vor wors­ened rapidly over the 20th cen­tu­ry, a decline blamed on apple grow­ers grad­u­ally switch­ing to ever-red­der which looked bet­ter in gro­cery stores, a decline which ulti­mately cul­mi­nated in the near-col­lapse of the Red-De­li­cious-cen­tric Wash­ing­ton State apple indus­try when con­sumer back­lash finally began in the 1980s with the avail­abil­ity of tastier apples like . The more com­plex a sys­tem, the worse the “death by a thou­sand cuts” can be—in a 2003 email from Bill Gates, he lists (at least) 25 dis­tinct prob­lems he encoun­tered try­ing (and fail­ing) to install & use the pro­gram.

    This death-by-de­grees can be coun­tered by a few things, such as either test­ing reg­u­larly against a his­tor­i­cal base­line to estab­lish total cumu­la­tive degra­da­tion or care­fully tun­ing / thresh­olds based on a deci­sion analy­sis (like­ly, one would con­clude that sta­tis­ti­cal power must be made much higher and the p-thresh­old should be made less strin­gent for detect­ing har­m).

    In addi­tion, one must avoid a bias towards test­ing only changes which make a pro­duct, which becomes “sam­pling to a fore­gone con­clu­sion” (imag­ine a prod­uct is at a point where the profit gra­di­ent of qual­ity is profitable, but exper­i­ments are con­ducted only on var­i­ous ways of reduc­ing qual­i­ty—even if thresh­olds are set cor­rect­ly, false-pos­i­tives must nev­er­the­less even­tu­ally occur once in a while and thus over the long run, qual­ity & profits inevitably decrease). A ratio­nal profit-max­i­mizer should remem­ber that increases in qual­ity can be profitable too.↩︎

  3. Why ‘block’ instead of, say, just ran­dom­iz­ing 5-days at a time (“sim­ple ran­dom­iza­tion”)? If we did that, we would occa­sion­ally do some­thing like spend an entire month in one con­di­tion with­out switch­ing, sim­ply by rolling a 0 5 or 6 times in a row; since traffic can be expected to drift and change and spike, hav­ing such large units means that some­times they will line up with noise, increas­ing the appar­ent vari­ance thus shrink­ing the effect size thus requir­ing pos­si­bly a great deal more data to detect the sig­nal. Or we might fin­ish the exper­i­ment after 100 days (20 units) and dis­cover we had n = 15 for adver­tis­ing and only n = 5 for non-ad­ver­tis­ing (wast­ing most of our infor­ma­tion on unnec­es­sar­ily refin­ing the adver­tis­ing con­di­tion). Not block­ing does­n’t bias our analy­sis—we still get the right answers even­tu­al­ly—but it could be cost­ly. Whereas if we block pairs of 2-days ([00,11] vs [11,00]), we ensure that we reg­u­larly (but still ran­dom­ly) switch the con­di­tion, spread­ing it more evenly over time, so if there are 4 days of sud­denly high traffic, it’ll prob­a­bly get split between con­di­tions and we can more eas­ily see the effect. This sort of issue is why exper­i­ments try to run inter­ven­tions on the same per­son, or at least on age and sex-matched par­tic­i­pants, to elim­i­nate unnec­es­sary noise.

    The gains from proper choice of exper­i­men­tal unit & block­ing can be extreme; in one exper­i­ment, I esti­mated that using twins rather than ordi­nary school-chil­dren would have let n be th the size: . Thus, when pos­si­ble, I block my exper­i­ments at least tem­po­ral­ly.↩︎

  4. After pub­lish­ing ini­tial results, Chris Stuc­chio com­mented on Twit­ter: “Most of the work on this stuff is pro­pri­etary. I ran such an exper­i­ment for a large con­tent site which gen­er­ated direc­tion­ally sim­i­lar results. I helped another major con­tent site set up a sim­i­lar test, but they did­n’t tell me the result­s…as well as smaller effects from ads fur­ther down the page (e.g.). Huge sam­ple size, very clear effects.” David Kitchen, a hob­by­ist oper­a­tor of sev­eral hun­dred forums, claims that remov­ing ban­ner ads boosted all his met­rics (ad­mit­ted­ly, at the cost of total rev­enue), but unclear if this used ran­dom­iza­tions or just a before-after com­par­i­son. I have been told in pri­vate by ad indus­try peo­ple that they have seen sim­i­lar results but either assumed every­one already knew all this, or were unsure how gen­er­al­iz­able the results were. And I know of one well-known tech web­site which tested this ques­tion after see­ing my analy­sis, and found a remark­ably sim­i­lar result.

    This raises seri­ous ques­tions about and the “file drawer” of ad exper­i­ments: if in fact these sorts of exper­i­ments are being run all the time by many com­pa­nies, the pub­lished papers could eas­ily be sys­tem­at­i­cally differ­ent than the unpub­lished ones—­given the com­mer­cial incen­tives, should we assume that the harms of adver­tis­ing are even greater than implied by pub­lished results?↩︎

  5. An ear­lier ver­sion of the exper­i­ment reported in McCoy et al 2007 is McCoy et al 2004; McCoy et al 2008 appears to be a smaller fol­lowup doing more sophis­ti­cated struc­tural equa­tion mod­el­ing of the var­i­ous scales used to quan­tify ad effects/perception.↩︎

  6. YouTube no longer exposes a Lik­ert scale but binary up/downvotes, so Kerk­hof 2019 uses like frac­tion of total. More specifi­cal­ly:

    9.1 Likes and dis­likes: First, I use a video’s num­ber of likes and dis­likes to mea­sure its qual­i­ty. To this end, I nor­mal­ize the num­ber of likes of video v by YouTu­ber i in month t by its sum of likes and dis­likes: . Though straight­for­ward to inter­pret, this mea­sure reflects the view­ers’ gen­eral sat­is­fac­tion with a video, which is deter­mined by its qual­ity and the view­ers’ ad aver­sion. Thus, even if an increase in the fea­si­ble num­ber of ad breaks led to an increase in video qual­i­ty, a video’s frac­tion of likes could decrease if the view­ers’ addi­tional ad nui­sance costs pre­vail.

    I replace the depen­dent vari­able Pop­u­larvit in equa­tion (2) with and esti­mate equa­tions (2) and (3) by 2SLS. Table 16 shows the results. Again, the poten­tially biased OLS esti­mates of equa­tion (2) in columns 1 to 3 are close to zero and not sta­tis­ti­cal­ly-sig­nifi­cant at the 1% lev­el: an increase in the fea­si­ble num­ber of ad breaks leads to a 4 per­cent­age point reduc­tion in the frac­tion of likes. The effect size cor­re­sponds to around 25% of a stan­dard devi­a­tion in the depen­dent vari­able and to 4.4% of its base­line value 0.1. The reduced form esti­mates in columns 7 to 9 are in line with these results. Note that I lose 77,066 videos that have not received any likes or dis­likes. The results in Table 16 illus­trate that viewer sat­is­fac­tion has gone down. It is, how­ev­er, unclear if the effect is dri­ven by a decrease in video qual­ity or by the view­ers’ irrta­tion from addi­tional ad breaks. See Appen­dix A.8 for valid­ity checks.

    ↩︎
  7. Kerk­hof 2019:

    There are two poten­tial expla­na­tions for the differ­ences to Sec­tion 9.1. First, video qual­ity may enhance, whereby more (re­peat­ed) view­ers are attract­ed. At the same time, how­ev­er, view­ers express their dis­sat­is­fac­tion with the addi­tional breaks by dis­lik­ing the video. Sec­ond, there could be algo­rith­mic con­found­ing of the data (Sal­ganik, 2017, Ch.3). YouTube, too, earns a frac­tion of the YouTu­bers’ ad rev­enue. Thus, the plat­form has an incen­tive to treat videos with many ad breaks favor­ably, for instance, through its rank­ing algo­rithm. In this case, the num­ber of views was not infor­ma­tive about a video’s qual­i­ty, but only about an algo­rith­mic advan­tage. See Appen­dix A.8 for valid­ity check­s…Table 5 shows the results. The size of the esti­mates for δ′′(­columns 1 to 3), though sta­tis­ti­cally sig­nifi­cant at the 1%-level, is neg­li­gi­ble: a one sec­ond increase in video dura­tion cor­re­sponds to a 0.0001 per­cent­age point increase in the frac­tion of likes. The esti­mates for δ′′′ in columns 4 to 6, though, are rel­a­tively large and sta­tis­ti­cally sig­nifi­cant at the 1%-level, too. Accord­ing to these esti­mates, one fur­ther sec­ond in video dura­tion leads on aver­age to about 1.5 per­cent more views. These esti­mates may reflect the algo­rith­mic drift dis­cussed in Sec­tion 9.2. YouTube wants to keep its view­ers as long as pos­si­ble on the plat­form to show as many ads as pos­si­ble to them. As a result, longer videos get higher rank­ings and are watched more often.

    …Sec­ond, I can­not eval­u­ate the effect of adver­tis­ing on wel­fare, because I lack mea­sures for con­sumer and pro­ducer sur­plus. Although I demon­strate that adver­tis­ing leads to more con­tent differ­en­ti­a­tion—which is likely to raise con­sumer sur­plus (Bryn­jolf­s­son et al., 2003)—the view­ers must also pay an increased ad “price”, which works into the oppo­site direc­tion. Since I obtain no esti­mates for the view­ers’ ad aver­sion, my setup does not answer which effect over­weights. On the pro­ducer side, I remain agnos­tic about the effect of adver­tis­ing on the sur­plus of YouTube itself, the YouTu­bers, and the adver­tis­ers. YouTube as a plat­form is likely to ben­e­fit from adver­tis­ing, though. Adver­tis­ing leads to more con­tent differ­en­ti­a­tion, which attracts more view­ers; more view­ers, in turn, gen­er­ate more ad rev­enue. Like­wise, the YouTu­bers’ sur­plus ben­e­fits from an increase in ad rev­enue; it is, how­ev­er, unclear how their util­ity from cov­er­ing differ­ent top­ics than before is affect­ed.­Fi­nal­ly, the adver­tis­ers’ sur­plus may go up or down. On the one hand, a higher ad quan­tity makes it more likely that poten­tial cus­tomers click on their ads and buy their prod­ucts. On the other hand, the adver­tis­ers can­not influ­ence where exactly their ads appear, whereby it is unclear how well the audi­ence is tar­get­ed. Hence, it is pos­si­ble that the addi­tional costs of adver­tis­ing sur­mount the addi­tional rev­enues.

    ↩︎
  8. Exam­ples of ads I saw would be ads or online uni­ver­sity ads (typ­i­cally mas­ter’s degrees, for some rea­son) on my . They looked about what one would expect: gener­i­cally glossy and clean. It is diffi­cult to imag­ine any­one being offended by them.↩︎

  9. One might rea­son­ably assume that Ama­zon’s ultra­-cramped-yet-un­in­for­ma­tive site design was the result of exten­sive A/B test­ing and is, as much as one would like to believe oth­er­wise, opti­mal for rev­enue. How­ev­er, accord­ing to ex-A­ma­zon engi­neer Steve Yegge, Ama­zon is well aware their web­site looks awful—but sim­ply refuses to change it:

    Jeff Bezos is an infa­mous micro-man­ag­er. He micro-man­ages every sin­gle pixel of Ama­zon’s retail site. He hired , Apple’s Chief Sci­en­tist and prob­a­bly the very most famous and respected human-com­puter inter­ac­tion expert in the entire world, and then ignored every god­damn thing Larry said for three years until Larry final­ly—­wise­ly—left the com­pany [2001–2005]. Larry would do these big usabil­ity stud­ies and demon­strate beyond any shred of doubt that nobody can under­stand that frig­ging web­site, but Bezos just could­n’t let go of those pix­els, all those mil­lions of seman­tic­s-packed pix­els on the land­ing page. They were like mil­lions of his own pre­cious chil­dren. So they’re all still there, and Larry is not…The guy is a reg­u­lar… well, Steve Jobs, I guess. Except with­out the fash­ion or design sense. Bezos is super smart; don’t get me wrong. He just makes ordi­nary con­trol freaks look like stoned hip­pies.

    ↩︎
  10. Sev­eral of the Inter­net giants like Google have reported mea­sur­able harms from delays as small as 100ms. Effects of delays/latency have often been mea­sured eg The Tele­graph.↩︎

  11. offers a cau­tion­ary exam­ple for search engine opti­miz­ers about the need to exam­ine global effects inclusie of attri­tion (like Hohn­hold et al make sure to do); since a search engine is a tool for find­ing things, and users may click on ads only when unsat­is­fied, worse search engine results may increase queries/ad clicks:

    When Bing had a bug in an exper­i­ment, which resulted in very poor results being shown to users, two key orga­ni­za­tional met­rics improved sig­nifi­cant­ly: dis­tinct queries per user went up over 10%, and rev­enue per user went up over 30%! How should Bing eval­u­ate exper­i­ments? What is the Over­all Eval­u­a­tion Cri­te­ri­on? Clearly these long-term goals do not align with short­-term mea­sure­ments in exper­i­ments. If they did, we would inten­tion­ally degrade qual­ity to raise query share and rev­enue!

    Expla­na­tion: From a search engine per­spec­tive, degraded algo­rith­mic results (the main search engine results shown to users, some­times referred to as the 10 blue links) force peo­ple to issue more queries (in­creas­ing queries per user) and click more on ads (in­creas­ing rev­enues). How­ev­er, these are clearly short­-term improve­ments, sim­i­lar to rais­ing prices at a retail store: you can increase short­-term rev­enues, but cus­tomers will pre­fer the com­pe­ti­tion over time, so the aver­age cus­tomer life­time value will decline

    That is, increases in ‘rev­enue per user’ are not nec­es­sar­ily either increases in increases in total rev­enue per user, or total rev­enue period (be­cause the user will be more likely to attrit to a bet­ter search engine).↩︎

  12. How­ev­er, the esti­mate, despite using scores of vari­ables for the match­ing to attempt to con­struct accu­rate con­trols, is almost cer­tainly inflated because ; note that in Face­book’s , which tested propen­sity scor­ing’s abil­ity to pre­dict the results of ran­dom­ized Face­book ad exper­i­ments, it required thou­sands of vari­ables before propen­sity scor­ing could recover the true causal effect.↩︎

  13. I found Yan et al 2019 con­fus­ing, so to explain it a lit­tle fur­ther. The key graph is Fig­ure 3, where “UU” means “unique users”, which appar­ently in con­text means a LinkedIn user dichotomized by whether they ever click on some­thing in their feed or ignore it entirely dur­ing the 3-month win­dow; “feed inter­ac­tion counts” are then the num­ber of feed clicks for those with non-zero clicks dur­ing the 3-month win­dow

    The 1-ad-ev­ery-9-items con­di­tion’s “unique user” grad­u­ally climbs through the 3 months approach­ing ~0.75% more users inter­act­ing with the feed more than the base­line 1-ev­ery-6, and the 1-ev­ery 3 decreases by −0.5%. Pre­sum­ably

    So if LinkedIn had 1-ev­ery-3 (33%) as the base­line and it moved to 1-ev­ery-6 (17%) then to 1-ev­ery-9 (11%), then the num­ber of users would increase ~1.25%. The elas­tic­ity here is unclear since 1-ev­ery-3 vs 6 rep­re­sents a larger absolute increase of ads than 6 vs 9, but the lat­ter seems to have a larger respon­se; in any case, if mov­ing from 33% ads to 11% ads increases usage by 1.25% over 3 months, then that sug­gests that elim­i­nat­ing ads entirely (go­ing from 11% to 0%) to 0% would yield another per­cent­age point or two per 3-months, or if we con­sid­ered 17% vs 11% which is 8% and adjust, that sug­gests 1.03%. A quar­terly increase of 1.03%–1.25% users is an annu­al­ized 4–5%, which is not triv­ial.

    Within the users using the feed at all, the num­ber of daily inter­ac­tions for 1-ev­ery-9 vs 1-ev­ery-6 increases by 1.5%, and decreases −1% for 1-ev­ery-3 users, which is a stronger effect than for usage. By sim­i­lar hand­wav­ing, that sug­gests a pos­si­ble annu­al­ized increase in daily feed inter­ac­tions of ~8.5% (on an unknown base but I would guess some­where around 1 inter­ac­tion a day).

    These two effects should be mul­ti­plica­tive: if there are more feed users and each feed user is using the feed more, the total num­ber of feed uses (which can be thought of as the total LinkedIn “audi­ence” or “read­er­ship”) will be larger still; 4% by 8% is >12%.

    That esti­mate is remark­ably con­sis­tent with the esti­mate I made based on my first A/B exper­i­ment of the hypo­thet­i­cal effect of Gwern.net going from 100% to 0% ads. (It was hypo­thet­i­cal because many read­ers have adblock on and can’t be exposed to ads; how­ev­er, LinkedIn mobile app users pre­sum­ably have no such recourse and have ~100% expo­sure to how­ever many ads LinkedIn chooses to show them.)↩︎

  14. McCoy et al 2007, pg3:

    Although that differ­ence is sta­tis­ti­cal­ly-sig­nifi­cant, its mag­ni­tude might not appear on the sur­face to be very impor­tant. On the con­trary, recent reports show that a site such as Ama­zon enjoys about 48 mil­lion unique vis­i­tors per month. A drop of 11% would rep­re­sent more than 5 mil­lion fewer vis­i­tors per month. Such a drop would be inter­preted as a seri­ous prob­lem requir­ing deci­sive action, and the ROI of the adver­tise­ment could actu­ally become neg­a­tive. Assum­ing a uni­form dis­tri­b­u­tion of buy­ers among the browsers before and after the drop, the rev­enue pro­duced by and ad might not make up for the 11% short­fall in sales.

    ↩︎
  15. Sam­ple size is not dis­cussed other than to note that it was set accord­ing to stan­dard Google power analy­sis tools described in Tang et al 2010, which like­wise omits con­crete num­bers; because of the use of shared con­trol groups and cook­ie-level track­ing and A/A tests to do dis­tri­b­u­tion-free esti­mates of true stan­dard errors/sampling dis­tri­b­u­tions, it is impos­si­ble to esti­mate what n their tool would’ve esti­mated nec­es­sary for stan­dard power like 80% for changes as small as 0.1% (“we gen­er­ally care about small changes: even a 0.1% change can be sub­stan­tive.”) on the described exper­i­ments. How­ev­er, I believe it would be safe to say that the _n_s for the k > 100 exper­i­ments described are at least in the mil­lions each, as that is what is nec­es­sary else­where to detect sim­i­lar effects and Tang et al 2010 empha­sizes that the “cook­ie-level” exper­i­ments—which is what Hohn­hold et al 2015 uses—has many times larger stan­dard errors than query-level exper­i­ments which assume inde­pen­dence. Hohn­hold et al 2015 does imply that the total sam­ple size for the Fig­ure 6 CTR rate changes may be all Google Android users: “One of these launches was rolled out over ten weeks to 10% cohorts of traffic per week.” This would put the sam­ple size at n≅500m given ~1b users.↩︎

  16. How many hits does this cor­re­spond to? A quick­-and-dirty esti­mate sug­gests n < 535m.

    The median Alexa rank of the 2,574 sites in the Page­Fair sam­ple is #210,000, with a median expo­sure of 16.7 weeks (means are not given); as of 2019-04-05, Gwern.net is at Alexa rank #112,300, and has 106,866 page-views in the pre­vi­ous 30 days; since the median Alexa rank is about half as bad, a guessti­mate adjust­ment is 50k per 30 days or 1,780 page-views/day (there are Alexa rank for­mula but they are so wrong for Gwern.net that I don’t trust them to esti­mate other web­sites; given the long tail of traffic, I think halv­ing would be an upper bound); then treat­ing the medi­ans as means, then the total mea­sured page-views over the mul­ti­-year sam­ple is . Con­sid­er­ing the long time-pe­ri­ods and employ­ing total traffic of thou­sands of web­sites, this seems sen­si­ble.↩︎

  17. No rela­tion­ship to Yan et al 2019.↩︎

  18. The PageFair/Adobe report actu­ally says “16% of the US online pop­u­la­tion blocked ads dur­ing Q2 2015.” which is slightly differ­ent; I assume that is because the ‘online pop­u­la­tion’ < total US pop­u­la­tion.↩︎

  19. This mix­ture model would have two distributions/components in it; there appar­ently is no point in try­ing to dis­tin­guish between lev­els of viral­i­ty, as 3 or higher does not fit the data well:

    library(flexmix)
    stepFlexmix(traffic$Pageviews ~ 1, model = FLXMRglmfix(family = "poisson"), k=1:10, nrep=20)
    ## errors out k>2
    fit <- flexmix(traffic$Pageviews ~ 1, model = FLXMRglmfix(family = "poisson"), k=2)
    summary(fit)
    #        prior size post>0 ratio
    # Comp.1 0.196  448    449 0.998
    # Comp.2 0.804 1841   1842 0.999
    #
    # 'log Lik.' -1073191.871 (df=3)
    # AIC: 2146389.743   BIC: 2146406.95
    summary(refit(fit))
    # $Comp.1
    #                  Estimate    Std. Error   z value   Pr(>|z|)
    # (Intercept) 8.64943378382 0.00079660747 10857.837 < 2.22e-16
    #
    # $Comp.2
    #                  Estimate    Std. Error   z value   Pr(>|z|)
    # (Intercept) 7.33703564817 0.00067256175 10909.088 < 2.22e-16
    ↩︎