Banner Ads Considered Harmful

9 months of daily A/B-testing of Google AdSense banner ads on Gwern.net indicates banner ads decrease total traffic substantially, possibly due to spillover effects in reader engagement and resharing.
experiments, statistics, decision-theory, R, JS, power-analysis, Bayes, Google, survey, insight-porn
2017-01-082020-12-12 in progress certainty: possible importance: 5


One source of com­plex­ity & JavaScript use on Gw­ern.net is the use of Google Ad­Sense ad­ver­tis­ing to in­sert ban­ner ads. In con­sid­er­ing de­sign & us­abil­ity im­prove­ments, re­mov­ing the ban­ner ads comes up every time as a pos­si­bil­i­ty, as read­ers do not like ads, but such re­moval comes at a rev­enue loss and it’s un­clear whether the ben­e­fit out­weighs the cost, sug­gest­ing I run an A/B ex­per­i­ment. How­ev­er, ads might be ex­pected to have broader effects on traffic than in­di­vid­ual page read­ing times/bounce rates, affect­ing to­tal site traffic in­stead through long-term effects on or spillover mech­a­nisms be­tween read­ers (eg so­cial me­dia be­hav­ior), ren­der­ing the usual A/B test­ing method of per-page-load­/ses­sion ran­dom­iza­tion in­cor­rect; in­stead it would be bet­ter to an­a­lyze to­tal traffic as a time-series ex­per­i­ment.

De­sign: A de­ci­sion analy­sis of rev­enue vs read­ers yields an max­i­mum ac­cept­able to­tal traffic loss of ~3%. Power analy­sis of his­tor­i­cal Gw­ern.net traffic data demon­strates that the high au­to­cor­re­la­tion yields low sta­tis­ti­cal power with stan­dard tests & re­gres­sions but ac­cept­able power with ARIMA mod­els. I de­sign a long-term Bayesian ARIMA(4,0,1) time-series model in which an A/B-test run­ning Jan­u­ary–Oc­to­ber 2017 in ran­dom­ized paired 2-day blocks of ad­s/no-ads uses clien­t-lo­cal JS to de­ter­mine whether to load & dis­play ads, with to­tal traffic data col­lected in Google An­a­lyt­ics & ad ex­po­sure data in Google Ad­Sense. The A/B test ran from 2017-01-01 to 2017-10-15, affect­ing 288 days with col­lec­tively 380,140 pageviews in 251,164 ses­sions.

Cor­rect­ing for a flaw in the ran­dom­iza­tion, the fi­nal re­sults yield a sur­pris­ingly large es­ti­mate of an ex­pected traffic loss of −9.7% (driven by the sub­set of users with­out ad­block), with an im­plied −14% traffic loss if all traffic were ex­posed to ads (95% cred­i­ble in­ter­val: −13–16%), ex­ceed­ing my de­ci­sion thresh­old for dis­abling ads & strongly rul­ing out the pos­si­bil­ity of ac­cept­ably small losses which might jus­tify fur­ther ex­per­i­men­ta­tion.

Thus, ban­ner ads on Gw­ern.net ap­pear to be harm­ful and Ad­Sense has been re­moved. If these re­sults gen­er­al­ize to other blogs and per­sonal web­sites, an im­por­tant im­pli­ca­tion is that many web­sites may be harmed by their use of ban­ner ad ad­ver­tis­ing with­out re­al­iz­ing it.

One thing about Gw­ern.net I prize, es­pe­cially in com­par­i­son to the rest of the In­ter­net, is the fast page loads & ren­ders. This is why in my pre­vi­ous of site de­sign changes, I have gen­er­ally fo­cused on CSS changes which do not affect load times. Bench­mark­ing web­site per­for­mance in 2017, the to­tal time had be­come dom­i­nated by Google Ad­Sense (for the medi­um-sized ban­ner ad­ver­tise­ments cen­tered above the ti­tle) and Dis­qus com­ments.

While I want com­ments, so the Dis­qus is not op­tional1, Ad­Sense I keep only be­cause, well, it makes me some money (~$30 a month or ~$360 a year; it would be more but ~60% of vis­i­tors have ad­block, which is ap­par­ently un­usu­ally high for the US). So ads are a good thing to do an ex­per­i­ment on: it offers a chance to re­move one of the heav­i­est com­po­nents of the page, an ex­cuse to ap­ply a de­ci­sion-the­o­ret­i­cal ap­proach (cal­cu­lat­ing a de­ci­sion-thresh­old & s), an op­por­tu­nity to try ap­ply­ing Bayesian time-series mod­els in JAGS/Stan, and an in­ves­ti­ga­tion into whether lon­gi­tu­di­nal site-wide A/B ex­per­i­ments are prac­ti­cal & use­ful.

Modeling effects of advertising: global rather than local

This is­n’t a huge amount (it is much less than my monthly Pa­treon) and might be off­set by the effects on load­/ren­der time and peo­ple not lik­ing ad­ver­tise­ment. If I am re­duc­ing my traffic & in­flu­ence by 10% be­cause peo­ple don’t want to browse or link pages with ads, then it’s defi­nitely not worth­while.

One of the more com­mon crit­i­cisms of the usual A/B test de­sign is that it is miss­ing the for­est for the trees & giv­ing fast pre­cise an­swers to the wrong ques­tion; a change may have good re­sults when done in­di­vid­u­al­ly, but may harm the over­all ex­pe­ri­ence or com­mu­nity in a way that shows up on the macro but not mi­cro scale.2 In this case, I am in­ter­ested less in time-on-page than in to­tal traffic per day, as the lat­ter will mea­sure effects like re­shar­ing on so­cial me­dia (e­spe­cial­ly, given my traffic his­to­ry, Hacker News, which al­ways gen­er­ates a long lag of ad­di­tional traffic from Twit­ter & ag­gre­ga­tors). It is some­what ap­pre­ci­ated that A/B test­ing in so­cial me­dia or net­work set­tings is not as sim­ple as ran­dom­iz­ing in­di­vid­ual users & run­ning a t-test—as the users are not in­de­pen­dent of each other (vi­o­lat­ing among other things). In­stead, you need to ran­dom­ize groups or sub­graphs or some­thing like that, and con­sider the effects of in­ter­ven­tions on those larger more-in­de­pen­dent treat­ment units.

So my usual AB­a­lyt­ics setup is­n’t ap­pro­pri­ate here: I don’t want to ran­dom­ize in­di­vid­ual vis­i­tors & mea­sure time on page, I want to ran­dom­ize in­di­vid­ual days or weeks and mea­sure to­tal traffic, giv­ing a time-series re­gres­sion.

This could be ran­dom­ized by up­load­ing a differ­ent ver­sion of the site every day, but this is te­dious, in­effi­cient, and has tech­ni­cal is­sues: ag­gres­sive caching of my web­pages means that many vis­i­tors may be see­ing old ver­sions of the site! With that in mind, there is a sim­ple A/B test im­ple­men­ta­tion in JS: in the in­vo­ca­tion of the Ad­Sense JS, sim­ply throw in a con­di­tional which pre­dictably ran­dom­izes based on the cur­rent day (some­thing like the ‘day-of-year (1–366) mod­ulo 2’, hash­ing the day, or sim­ply a lookup in an ar­ray of con­stants), and then after a few months, ex­tract daily traffic num­bers from Google An­a­lyt­ic­s/Ad­Sense and match up with ran­dom­iza­tion and do a re­gres­sion. By us­ing a pre-spec­i­fied source of ran­dom­ness, caching is never an is­sue, and us­ing JS is not a prob­lem since any­one with JS dis­abled would­n’t be one of the peo­ple see­ing ads any­way. Since there might be spillover effects due to lags in prop­a­gat­ing through so­cial me­dia & emails etc, daily ran­dom­iza­tion might be too fast, and 2-day blocks more ap­pro­pri­ate, en­sur­ing oc­ca­sional runs up to a week or so to ex­pose longer effects while still en­sur­ing al­lo­ca­tion equal to­tal days to ad­ver­tis­ing/no-ad­ver­tis­ing.3

Implementation: In-browser Randomization of Banner Ads

Set­ting this up in JS turned out to be a lit­tle tricky since there is no built-in func­tion for get­ting day-of-year or for hash­ing num­ber­s/strings; so rather than spend an­other 10 lines copy­-past­ing some hash func­tions, I copied some day-of-year code and then sim­ply gen­er­ated in R 366 bi­nary vari­ables for ran­dom­iz­ing dou­ble-days and put them in a JS ar­ray for do­ing the ran­dom­iza­tion:

         <script src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js" async></script>
         <!-- Medium Header -->
         <ins class="adsbygoogle"
              style="display:inline-block;width:468px;height:60px"
              data-ad-client="ca-pub-3962790353015211"
              data-ad-slot="2936413286"></ins>
+        <!-- A/B test of ad effects on site traffic: randomize 2-days based on day-of-year &
              pre-generated randomness; offset by 8 because started on 2016-01-08 -->
         <script>
-          (adsbygoogle = window.adsbygoogle || []).push({});
+          var now = new Date(); var start = new Date(now.getFullYear(), 0, 0); var diff = now - start;
           var oneDay = 1000 * 60 * 60 * 24; var day = Math.floor(diff / oneDay);
           +          randomness = [1,0,0,0,1,1,0,0,1,1,0,0,0,0,1,1,0,0,1,1,0,0,1,1,1,1,1,1,1,1,1,1,0,0,1,1,0,0,1,1,
           1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,0,0,1,1,0,0,1,1,1,1,0,0,1,1,0,0,0,
           0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1,0,0,1,1,0,0,1,1,0,0,0,0,1,1,0,0,1,1,1,1,1,1,0,0,1,1,0,0,0,0,0,
           0,1,1,1,1,0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,1,1,0,0,0,
           0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,0,0,1,1,1,1,0,0,0,0,1,1,1,
           1,1,1,1,1,0,0,0,0,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,0,0,1,1,0,0,1,1,1,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0,1,1,0,
           0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,0,0,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,1,
           1,1,1,0,0,1,1,0,0,0,0,1,1,0,0];
+
+          if (randomness[day - 8]) {
+              (adsbygoogle = window.adsbygoogle || []).push({});
+          }

While sim­ple, sta­t­ic, and cache-com­pat­i­ble, a few months in I dis­cov­ered that I had per­haps been a lit­tle too clev­er: check­ing my Ad­Sense re­ports on a whim, I no­ticed that the re­ported daily “im­pres­sions” was ris­ing and falling in roughly the 2-day chunks ex­pect­ed, but it was never falling all the way to 0 im­pres­sions, in­stead, per­haps to a tenth of the usual num­ber of im­pres­sions. This was odd be­cause how would any browsers be dis­play­ing ads on the wrong days given that the JS runs be­fore any ads code, and any browser not run­ning JS would, ipso facto, never be run­ning Ad­Sense any­way? Then it hit me: whose date is the ran­dom­iza­tion based on? The browser’s, of course, which is not mine if it’s run­ning in a differ­ent time­zone. Pre­sum­ably browsers across a date­line would be ran­dom­ized into ‘on’ on the ‘same day’ as oth­ers are be­ing ran­dom­ized into ‘off’. What I should have done was some sort of time­zone in­de­pen­dent date con­di­tion­al. Un­for­tu­nate­ly, it was a lit­tle late to mod­ify the code.

This im­plies that the sim­ple bi­nary ran­dom­iza­tion test is not good as it will be sub­stan­tially bi­ased to­wards ze­ro/at­ten­u­ated by the mea­sure­ment er­ror inas­much as many of the page­hits on sup­pos­edly ad-free days are in fact be­ing con­t­a­m­i­nated by ex­po­sure to ads. For­tu­nate­ly, the Ad­Sense im­pres­sions data can be used in­stead to regress on, say, per­cent­age of ad-affected pageviews.

Ads as Decision Problem

From a de­ci­sion the­ory per­spec­tive, this is a good place to ap­ply se­quen­tial test­ing ideas as we face a sim­i­lar prob­lem as with and the ex­per­i­ment has an eas­ily quan­ti­fied cost: each day ran­dom­ized ‘off’ costs ~$1, so a long ex­per­i­ment over 200 days would cost ~$100 in ad rev­enue etc. There is also the risk of mak­ing the wrong de­ci­sion and choos­ing to dis­able ads when they are harm­less, in which case the cost as NPV (at my usual 5% dis­count rate, and as­sum­ing ad rev­enue never changes and I never ex­per­i­ment fur­ther, which are rea­son­able as­sump­tions given how for­tu­nately sta­ble my traffic is and the un­like­li­ness of me re­vis­it­ing a con­clu­sive re­sult from a well-de­signed ex­per­i­ment) would be , which is sub­stan­tial.

On the other side of the equa­tion, the ads could be do­ing sub­stan­tial dam­age to site traffic; with ~40% of traffic see­ing ads and to­tal page-views of 635123 in 2016 (1740/­day), a dis­cour­ag­ing effect of 5% off that would mean a loss of , the equiv­a­lent of 1 week of traffic. My web­site is im­por­tant to me be­cause it is what I have ac­com­plished & is my liveli­hood, and if peo­ple are not read­ing it, that is bad, both be­cause I lose pos­si­ble in­come and be­cause it means no one is read­ing my work.

How bad? In lieu of ad­ver­tis­ing it’s hard to di­rectly quan­tify the value of a page-view, so I can in­stead ask my­self hy­po­thet­i­cal­ly, would I trade ~1 week of traffic for $360 (~$0.02/view, or to put it an­other way which may be more in­tu­itive, would I delete Gw­ern.net in ex­change for >$18720/year)? Prob­a­bly; that’s about the right num­ber—with my cur­rent par­lous in­come, I can­not ca­su­ally throw away hun­dreds or thou­sands of dol­lars for some ad­di­tional traffic, but I would still pay for read­ers at the right price, and weigh­ing the feel­ings, I feel com­fort­able valu­ing page-views at ~$0.02. (If the es­ti­mate of the loss turns out to be near the thresh­old, then I can re­visit it again and at­tempt more pref­er­ence elic­i­ta­tion. Given the ac­tual re­sults, this proved to be un­nec­es­sary.)

Then the loss func­tion of the traffic re­duc­tion pa­ra­me­ter t is , So the long-run con­se­quence of per­ma­nently turn­ing ad­ver­tis­ing on would be, for a t de­crease of 1%, 1% = +$4775; 5% = +$2171; 10% = -$3035; 20% = -$13449; etc.

Thus, the de­ci­sion ques­tion is whether the de­crease for the ad-affected 40% of traffic is >7%; or for traffic as a whole, if the de­crease is >2.8%. If it is, then I am bet­ter off re­mov­ing Ad­Sense and in­creas­ing traffic; oth­er­wise, the money is bet­ter.

Ad Harms

How much should we ex­pect traffic to fall?

Un­for­tu­nate­ly, be­fore run­ning the first ex­per­i­ment, I was un­able to find pre­vi­ous re­search sim­i­lar to my pro­posal for ex­am­in­ing the effect on to­tal traffic rather than more com­mon met­rics such as rev­enue or per-page en­gage­ment. I as­sume such re­search ex­ists, since there’s a lit­er­a­ture on every­thing, but I haven’t found it yet and no one I’ve asked knows where it is ei­ther; and of course pre­sum­ably the big In­ter­net ad­ver­tis­ing gi­ants have de­tailed knowl­edge of such spillover or emer­gent effects, al­though no in­cen­tive to pub­li­cize the harms.4

There is a sparse open lit­er­a­ture on “ad­ver­tis­ing avoid­ance”, which fo­cuses on sur­veys of con­sumer at­ti­tudes and eco­nomic mod­el­ing; skim­ming, the main re­sults ap­pear to be that peo­ple claim to dis­like ad­ver­tis­ing on TV or the In­ter­net a great deal, claim to dis­like per­son­al­iza­tion but find per­son­al­ized ads less an­noy­ing, a non­triv­ial frac­tion of view­ers will take ac­tion dur­ing TV com­mer­cial breaks to avoid watch­ing ads (5–23% for var­i­ous meth­ods of es­ti­mat­ing/de­fi­n­i­tions of avoid­ance, and sources like TV chan­nel­s), and are par­tic­u­larly an­noyed by ads get­ting in the way when re­search­ing or en­gaged in ‘goal-ori­ented’ ac­tiv­i­ty, and in a work con­text (A­ma­zon Me­chan­i­cal Turk) will tol­er­ate non-an­noy­ing ads with­out de­mand­ing large pay­ment in­creases (Gold­stein et al 2013/).

Some par­tic­u­larly rel­e­vant re­sults:

  • 5 did one of the few rel­e­vant ex­per­i­ments, with stu­dents in labs, and noted “sub­jects who were not ex­posed to ads re­ported they were 11% more likely to re­turn or rec­om­mend the site to oth­ers than those who were ex­posed to ads (p < 0.01).”; but could not mea­sure any re­al-world or long-term effects.

  • ex­ploits a sort of nat­ural ex­per­i­ment on YouTube, where video cre­ators learned that YouTube had a hard­wired rule that videos <10 min­utes in length could have only 1 ad, while they are al­lowed to in­sert mul­ti­ple ads in longer videos; track­ing a sub­set of Ger­man YT chan­nels us­ing ad­ver­tis­ing, she finds that some chan­nels be­gan in­creas­ing video lengths, in­sert­ing ads, turn­ing away from ‘pop­u­lar’ con­tent to ob­scurer con­tent (d = 0.4), and had more video views (>20%) but lower rat­ings (4%/d = −0.25)6.

    While that might sound good on net (more va­ri��ety & more traffic even if some of the ad­di­tional view­ers may be less sat­is­fied), Kerk­hof 2019 is only track­ing video cre­ators and not a fixed set of view­ers, and can­not ex­am­ine to what ex­tent view­ers watch less due to the in­crease in ads or what global site-wide effects there may have been (after all, why weren’t the cre­ators or view­ers do­ing all that be­fore?), and cau­tions that we should ex­pect YouTube to al­go­rith­mi­cally drive traffic to more mon­e­ti­z­able chan­nels, re­gard­less of whether site-wide traffic or so­cial util­ity de­creased7.

  • run a large-s­cale (to­tal n = 40,000) Google Sur­veys sur­vey ask­ing Amer­i­cans about will­ing­ness-to-pay for, among other things, an ad-free Face­book (n = 1,001), which was a mean ~$2.5/month (sub­stan­tially less than cur­rent FB ad rev­enue per capita per mon­th); their re­sults im­ply Face­book could in­crease rev­enue by in­creas­ing ads.

  • in­ves­ti­gate ad harms in­di­rect­ly, by look­ing at an on­line pub­lish­er’s logs of anti-ad­blocker mech­a­nism (which typ­i­cally de­tect the use of an ad­block­er, hides the con­tent, and shows a splash­screen telling the user to dis­able ad­block); they do not have ran­dom­ized data, but at­tempt a cor­re­la­tional analy­sis, where, Fig­ure 3 im­plies (com­par­ing the an­ti-ad­blocker ‘treat­ment’ with their pre­ferred con­trol group control_1) that com­pared to the ad­block­-pos­si­ble base­line, an­ti-ad­block de­creases pages per user and time per user—­page per user drops from ~1.4 to ~1.1, and time per user drops from ~2min to ~1.5min. (De­spite the use of the term ‘ag­gre­gate’, Sinha et al 2017 does not ap­pear to an­a­lyze to­tal site pageview/­time traffic sta­tis­tics, but only per-user.)

    These are large de­creas­es, sub­stan­tially larger than 10%, but it’s worth not­ing that, aside from DiD not be­ing a great way of in­fer­ring causal­i­ty, these es­ti­mates are not di­rectly com­pa­ra­ble to the oth­ers be­cause adding an­ti-ad­block ≠ adding ads: an­ti-ad­block is much more in­tru­sive & frus­trat­ing (an ugly pay­wall hid­ing all con­tent & re­quir­ing man­ual ac­tion a user may not know how to per­form) than sim­ply adding some ads, and plau­si­bly is much more harm­ful.

But while those sur­veys & mea­sure­ments show some users will do some work to avoid ads (which is sup­ported by the high but nev­er­the­less <100% per­cent­age of browsers with ad­block­ers in­stalled) and in some con­texts like jobs ap­pear to be in­sen­si­tive to ads, there is lit­tle in­for­ma­tion about to what ex­tent ads un­con­sciously drive users away from a pub­lisher to­wards other pub­lish­ers or medi­ums, with per­va­sive amounts of ad­ver­tis­ing taken for granted & re­searchers fo­cus­ing on just about any­thing else (see cites in , , , Bra­jnik & Gabrielli 2008 & Wilbur 2016, ). For ex­am­ple, Google’s Hohn­hold et al 2015 tells us that , and notes pre­cisely the prob­lem: “Op­ti­miz­ing which ads show based on short­-term rev­enue is the ob­vi­ous and easy thing to do, but may be detri­men­tal in the long-term if user ex­pe­ri­ence is neg­a­tively im­pact­ed. Since we did not have meth­ods to mea­sure the long-term user im­pact, we used short­-term user sat­is­fac­tion met­rics as a proxy for the long-term im­pact”, and after ex­per­i­ment­ing with pre­dic­tive mod­els & ran­dom­iz­ing ad loads, de­cided to make a “50% re­duc­tion of the ad load on Google’s mo­bile search in­ter­face” but Hohn­hold et al 2015 does­n’t tell us what the effect on user at­tri­tion/ac­tiv­ity was! What they do say is (am­bigu­ous­ly, given the “pos­i­tive user re­sponse” is dri­ven by a com­bi­na­tion of less at­tri­tion, more user ac­tiv­i­ty, and less ad blind­ness, with the in­di­vid­ual con­tri­bu­tions un­spec­i­fied):

This and sim­i­lar ads blind­ness stud­ies led to a se­quence of launches that de­creased the search ad load on Google’s mo­bile traffic by 50%, re­sult­ing in dra­matic gains in user ex­pe­ri­ence met­rics. We es­ti­mated that the pos­i­tive user re­sponse would be so great that the long-term rev­enue change would be a net pos­i­tive. One of these launches was rolled out over ten weeks to 10% co­horts of traffic per week. Fig­ure 6 shows the rel­a­tive change in CTR [click­through rate] for differ­ent co­horts rel­a­tive to a hold­back. Each curve starts at one point, rep­re­sent­ing the in­stan­ta­neous qual­ity gains, and climbs higher post-launch due to user sight­ed­ness. Differ­ences be­tween the co­horts rep­re­sent pos­i­tive user learn­ing, i.e., ads sight­ed­ness.

My best guess is that the effect of any “ad­ver­tis­ing avoid­ance” ought to be a small per­cent­age of traffic, for the fol­low­ing rea­sons:

  • many peo­ple never bother to take a minute to learn about & in­stall ad­block browser plu­g­ins, de­spite the ex­is­tence of ad­block­ers be­ing uni­ver­sally known, which would elim­i­nate al­most all ads on all web­sites they would vis­it; if ads as a whole are not worth a minute of work to avoid for years to come for so many peo­ple, how bad could ads be? (And to the ex­tent that peo­ple do use ad­block­ers, any to­tal neg­a­tive effect of ads ought to be that much small­er.)

    • in par­tic­u­lar, my Ad­Sense ban­ner ads have never offended or both­ered me much when I browse my pages with ad­blocker dis­abled to check ap­pear­ance, as they are nor­mal medi­um-sized ban­ners cen­tered above the <title> el­e­ment where one ex­pects an ad8, and
  • web­site de­sign ranges wildly in qual­ity & ad den­si­ty, with even enor­mously suc­cess­ful web­sites like Ama­zon look­ing like garbage9; if users care about good de­sign at all, it’s diffi­cult to tell

  • great efforts are in­vested in min­i­miz­ing the im­pact of ads: Ad­Sense loads ads asyn­chro­nously in the back­ground so it never blocks the page load­ing or ren­der­ing (which would defi­nitely be frus­trat­ing & web de­sign holds that small de­lays in page­loads are harm­ful10), Google sup­pos­edly spends bil­lions of dol­lars a year on a sur­veil­lance In­ter­net & the most cut­ting-edge AI tech­nol­ogy to bet­ter model users & tar­get ads to them with­out ir­ri­tat­ing them too much (eg Hohn­hold et al 2015), ads should have lit­tle effect on SEO or search en­gine rank­ing (s­ince why would search en­gines pe­nal­ize their own ad­s?), and I’ve seen a de­cent amount of re­search on op­ti­miz­ing ad de­liv­er­ies to max­i­mize rev­enue & avoid­ing an­noy­ing ads (but, as de­scribed be­fore, never re­search on mea­sur­ing or re­duc­ing to­tal harm)

  • fi­nal­ly, if they were all that harm­ful, how could there be no past re­search on it and how could no one know this?

    You would think that if there were any wor­ri­some level of harm some­one would’ve no­ticed by now & it’d be com­mon knowl­edge to avoid ads un­less you were des­per­ate for the rev­enue.

So my prior es­ti­mate is of a small effect and need­ing to run for a long time to make a de­ci­sion at a mod­er­ate op­por­tu­nity cost.

Replication

After run­ning my first ex­per­i­ment (n = 179,550 users on mo­bile+desk­top browser­s), ad­di­tional re­sults have come out and a re­search lit­er­a­ture on quan­ti­fy­ing “ad­ver­tis­ing avoid­ance” is fi­nally emerg­ing; I have also been able to find ear­lier re­sults which were ei­ther too ob­scure for me to find the first time around or on closer read turn out to im­ply es­ti­mates of to­tal ad harm.

To sum­ma­rize all cur­rent re­sults:

Re­view of ex­per­i­ments or cor­re­la­tional analy­ses which mea­sure the harm of ads on to­tal ac­tiv­ity (broadly de­fined).
En­tity Cite Date Method Users Ads n (mil­lions) To­tal effect Ac­tiv­ity
Pan­dora June 2014–April 2016 ran­dom­ized mo­bile app com­mer­cial break-style au­dio ads 34 7.5% to­tal mu­sic lis­ten­ing hours (w­hole co­hort)
Mozilla Feb­ru­ary–April 2017 cor­re­la­tional desk­top all 0.358 28% to­tal time WWW brows­ing (per user)
LinkedIn March–June 2018 ran­dom­ized mo­bile app news­feed in­sert items 102 12% to­tal news­feed in­ter­ac­tion/use (w­hole co­hort)
Mc­Coy lab 2004? ran­dom­ized desk­top ban­ner/pop-ups 0.000536 11% self­-rated will­ing­ness to re­visit web­site
Google 2013–2014 (pri­ma­ry) ran­dom­ized mo­bile search en­gine text ads 500? 50–70%? to­tal search en­gine queries (w­hole co­hort, in­clu­sive of at­tri­tion etc)11
Google (*) Hohn­hold et al 2015 2007?–? ran­dom­ized all? Ad­Sense ban­ners (>>1)? >5%? to­tal site us­age
Page­Fair Shiller et al 2017 July 2013–June 2016 cor­re­la­tional all all <535? (k = 2574) <19% to­tal web­site us­age (Alexa traffic rank)
Gw­ern.net (here) Jan­u­ary 2017–Oc­to­ber 2017 ran­dom­ized all ban­ner ad 0.251 14% to­tal web­site traffic
Anony­mous News­pa­per June 2015–Sep­tem­ber 2015 cor­re­la­tional reg­is­tered users ban­ner/sky­scrap­er/square ads 0.08 20% to­tal web­site traffic
New York Times Aral & Dhillon 2020 2013-06-27–2014-01-25 cor­re­la­tional all soft pay­wall 29.7 9.9% Read­ing ar­ti­cles.

While these re­sults come from com­pletely differ­ent do­mains (gen­eral web use, en­ter­tain­ment, and busi­ness/pro­duc­tiv­i­ty), differ­ent plat­forms (mo­bile app vs desk­top browser), differ­ent ad de­liv­ery mech­a­nisms (in­line news feed items, au­dio in­ter­rup­tions, in­line+popup ads, and web ads as a whole), and pri­mar­ily ex­am­ine with­in-user effects, the nu­mer­i­cal es­ti­mates of to­tal de­creases are re­mark­ably con­sis­tently in the same 10–15% as my own es­ti­mate.

The con­sis­tency of these re­sults un­der­mines many of the in­ter­pre­ta­tions of how & why ads cause harm.

For ex­am­ple, how can it be dri­ven by “per­for­mance” prob­lems when the LinkedIn app loads ads for their news­feed (un­less they are too in­com­pe­tent to down­load ads in ad­vance), or for the Pan­dora au­dio ads (as the au­dio ads must in­ter­rupt the mu­sic while they play but oth­er­wise do not affect the mu­sic—the mu­sic surely is­n’t “sta­t­ic-y” be­cause au­dio ads played at some point long be­fore or after! un­less again we as­sume to­tal in­com­pe­tence on the part of Pan­do­ra), or for Mc­Coy et al 2007 which served sim­ple sta­tic im­age ads off servers set up by them for the ex­per­i­ment? And why would a Google Ad­Sense ban­ner ad, which loads asyn­chro­nously and does­n’t block page ren­der­ing, have a ‘per­for­mance’ prob­lem in the first place? (N­ev­er­the­less, to ex­am­ine this pos­si­bil­ity fur­ther in my fol­lowup A/B test, I switched from Ad­Sense to a sin­gle small cacheable sta­tic PNG ban­ner ad which is loaded in both con­di­tions in or­der to elim­i­nate any per­for­mance im­pact.)

Sim­i­lar­ly, if users have area-spe­cific tol­er­ance of ads and will tol­er­ate them for work but not play or vice-ver­sa, why do Mc­Coy/LinkedIn vs Pan­dora find about the same thing? Or if Gw­ern.net read­ers are sim­ply un­usu­ally in­tol­er­ant of ads?

The sim­plest ex­pla­na­tion is that users are averse to ads qua ads, re­gard­less of do­main, de­liv­ery mech­a­nism, or ‘per­for­mance’.

Pandora

Stream­ing ser­vice ac­tiv­ity & users (n = 34 mil­lion), ran­dom­ized.

In 2018, Pan­dora pub­lished a large-s­cale long-term (~2 years) in­di­vid­u­al-level ad­ver­tis­ing ex­per­i­ment in their stream­ing mu­sic ser­vice () which found a strik­ingly large effect of num­ber of ads on re­duced lis­tener fre­quency & wors­ened re­ten­tion, which ac­cu­mu­lated over time and would have been hard to ob­serve in a short­-term ex­per­i­ment.

Huang et al 2019, ad­ver­tis­ing harms for Pan­dora lis­ten­ers: “Fig­ure 4: Mean To­tal Hours Lis­tened by Treat­ment Group”; “Fig­ure 5: Mean Weekly Unique Lis­ten­ers by Treat­ment Group”

In the low ad con­di­tion, 2.732/hr, the fi­nal ac­tiv­ity level was +1.74% lis­ten­ing time; base­line/­con­trol, 3.622/hr, 0%; and in the high ad con­di­tion, 5.009/hr, fi­nal ac­tiv­ity lev­el: −2.83% lis­ten­ing time, The ads per hour co­effi­cient, is −2.0751% for To­tal Hours & −1.8965% Ac­tive Days. The net to­tal effect can be backed out:

The co­effi­cients show us that one ad­di­tional ad per hour re­sults in mean lis­ten­ing time de­creas­ing by 2.075%±0.226%, and the num­ber of ac­tive lis­ten­ing days de­creas­ing by 1.897%±0.129%….­Does this de­crease in to­tal lis­ten­ing come from shorter ses­sions of lis­ten­ing, or from a lower prob­a­bil­ity of lis­ten­ing at all? To an­swer this ques­tion, Ta­ble 6 breaks the de­crease in to­tal hours down into three com­po­nents: the num­ber of hours lis­tened per ac­tive day, the num­ber of ac­tive days lis­tened per ac­tive lis­ten­er, and the prob­a­bil­ity of be­ing an ac­tive lis­tener at all in the fi­nal month of the ex­per­i­ment. We have nor­mal­ized each of these three vari­ables so that the con­trol group mean equals 100, so each of these treat­ment effects can be in­ter­preted as a per­cent­age differ­ence from con­trol. We find the per­cent­age de­crease in hours per ac­tive day to be ap­prox­i­mately 0.41%, the per­cent­age de­crease in days per ac­tive lis­tener to be 0.94%, and the per­cent­age de­crease in the prob­a­bil­ity of be­ing an ac­tive lis­tener in the fi­nal month to be 0.92%. These three num­bers sum to 2.27%, which is ap­prox­i­mately equal to the 2.08% per­cent­age de­cline we al­ready cal­cu­lated for to­tal hours lis­tened.5 This tells us that ap­prox­i­mately 18% of the de­cline in the hours in the fi­nal month is due to a de­cline in the hours per ac­tive day, 41% is due to a de­cline in the days per ac­tive lis­ten­er, and 41% is due to a de­cline in the num­ber of lis­ten­ers ac­tive at all on Pan­dora in the fi­nal month. We find it in­ter­est­ing that all three of these mar­gins see sta­tis­ti­cally sig­nifi­cant re­duc­tions, though the vast ma­jor­ity of the effect in­volves fewer lis­ten­ing ses­sions rather than a re­duc­tion in the num­ber of hours per ses­sion.

The co­effi­cient of 2.075% less to­tal ac­tiv­ity (lis­ten­ing) per 1 ad/hour im­plies that with a base­line of 3.622 ads per hour, the to­tal harm is = 7.5% at the end of 21 months (cor­re­spond­ing to the end of the ex­per­i­ment, at which point the harm from in­creased at­tri­tion ap­pears to have sta­bi­lized—per­haps every­one at the mar­gin who might at­trit away or re­duce lis­ten­ing has done so by this point—and that may re­flect the to­tal in­defi­nite har­m).

Mozilla

Desk­top browser us­age lev­els (n = 358,000), lon­gi­tu­di­nal:

Al­most si­mul­ta­ne­ously with Pan­do­ra, Mozilla () con­ducted a lon­gi­tu­di­nal (but non-ran­dom­ized, us­ing +re­gres­sion to re­duce the in­fla­tion of the cor­re­la­tional effect12) study of browser users which found that after in­stalling ad­block, the sub­set of ad­block users ex­pe­ri­enced “in­creases in both ac­tive time spent in the browser (+28% over [matched] con­trols) and the num­ber of pages viewed (+15% over con­trol)”.

(This, in­ci­den­tal­ly, is a tes­ta­ment to the value of browser ex­ten­sions to users: in a ma­ture piece of soft­ware like Fire­fox, usu­al­ly, noth­ing im­proves a met­ric like 28%. One won­ders if Mozilla fully ap­pre­ci­ates this find­ing?)

Miroglio et al 2018, ben­e­fits to Fire­fox users from ad­block­ers: “Fig­ure 3: Es­ti­mates & 95% CI for , the change in log-trans­formed en­gage­ment due to in­stalling ad­d-ons [ad­block­ers]”; “Ta­ble 5: Es­ti­mated rel­a­tive changes in en­gage­ment due to in­stalling ad­d-ons com­pared to con­trol group ()”

LinkedIn

So­cial news feed ac­tiv­ity & users, mo­bile app (n = 102 mil­lion), ran­dom­ized:

LinkedIn ran a large-s­cale ad ex­per­i­ment on their mo­bile ap­p’s users (ex­clud­ing desk­top etc, pre­sum­ably iOS+An­droid) track­ing effect of ad­di­tional ads in user ‘news feeds’ on short­-term & long-term met­rics like re­ten­tion over 3 months (); it com­pares the LinkedIn base­line of 1 ad every 6 feed items to al­ter­na­tives of 1 ad every 3 feed items and 1 ad every 9 feed items. Un­like Pan­do­ra, the short­-term effect is the bulk of the ad­ver­tis­ing effect within their 3-month win­dow (per­haps be­cause LinkedIn is a pro­fes­sional tool and quit­ting is harder than an en­ter­tain­ment ser­vice, or vi­sual web ads are more less in­tru­sive than au­dio, or be­cause 3-months is still not long enough), but while ad in­creases show min­i­mal net rev­enue im­pact (if I am un­der­stand­ing their met­rics right), the ad den­sity clearly dis­cour­ages us­age of the news feed, the au­thors spec­u­lat­ing this is due to dis­cour­ag­ing less-ac­tive or “dor­mant” mar­ginal users; con­sid­er­ing the im­plied an­nu­al­ized effect of user re­ten­tion & ac­tiv­i­ty, I es­ti­mate a to­tal ac­tiv­ity de­crease of >12% due to the base­line ad bur­den com­pared to no ads.13

Yan et al 2019, on the harms of ad­ver­tis­ing on Linked­In: “Fig­ure 3. Effect of ads den­sity on feed in­ter­ac­tion”

McCoy et al 2007

Aca­d­e­mic busi­ness school lab sto­ry, , self­-rated will­ing­ness to re­vis­it/rec­om­mend a web­site on a scale after ex­po­sure to ads (n = 536), ran­dom­ized

While markedly differ­ent in both method & mea­sure, Mc­Coy et al 2007 nev­er­the­less finds a ~11% re­duc­tion from no-ads to ads (3 types test­ed, but the least an­noy­ing kind, “in­-line”, still in­curred a ~9% re­duc­tion). They point­edly note that while this may sound small, it is still of con­sid­er­able prac­ti­cal im­por­tance.14

Mc­Coy et al 2007, harms of ads on stu­dent rat­ings: “Fig­ure 2: In­ten­tions to re­visit the site con­tain­ing the ads (4-item scale; cau­tion: the ori­gin of the Y axis is not 0).”

Google

Hohn­hold et al 2015, , search ac­tiv­i­ty, mo­bile An­droid Google users (n > 100m?)15, ran­dom­ized:

Hohn­hold et al 2015, ben­e­fit from 50% ad re­duc­tion on mo­bile over 2 month roll­out of 10% users each: “Fig­ure 6:∆CTR [CTR = Click­s/Ad, term 5] time se­ries for differ­ent user co­horts in the launch. (The launch was stag­gered by weekly co­hort.)”

This and sim­i­lar ads blind­ness stud­ies led to a se­quence of launches that de­creased the search ad load on Google’s mo­bile traffic by 50%, re­sult­ing in dra­matic gains in user ex­pe­ri­ence met­rics. We es­ti­mated that the pos­i­tive user re­sponse would be so great that the long-term rev­enue change would be a net pos­i­tive. One of these launches was rolled out over ten weeks to 10% co­horts of traffic per week. Fig­ure 6 shows the rel­a­tive change in CTR [click­through rate] for differ­ent co­horts rel­a­tive to a hold­back. Each curve starts at one point, rep­re­sent­ing the in­stan­ta­neous qual­ity gains, and climbs higher post-launch due to user sight­ed­ness. Differ­ences be­tween the co­horts rep­re­sent pos­i­tive user learn­ing, i.e., ads sight­ed­ness.

Hohn­hold et al 2015, as the re­sult of search en­gine ad load ex­per­i­ments on user ac­tiv­ity (search­es) & ad in­ter­ac­tions, de­cided to make a “50% re­duc­tion of the ad load on Google’s mo­bile search in­ter­face” which, be­cause of the ben­e­fits to ad click rates & “user ex­pe­ri­ence met­rics”, would pre­serve or in­crease Google’s ab­solute rev­enue.

To ex­actly off­set a 50% re­duc­tion in ad ex­po­sure solely by be­ing more likely to click on ads, user CTRs must dou­ble, of course. But Fig­ure 6 shows an in­crease of at most 20% in the CTR rather than 100%. So if the change was still rev­enue-neu­tral or pos­i­tive, user ac­tiv­ity must have gone up in some way—but Hohn­hold et al 2015 does­n’t tell us what the effect on user at­tri­tion/ac­tiv­ity was! The “pos­i­tive user re­sponse” is dri­ven by some com­bi­na­tion of less at­tri­tion, more user ac­tiv­i­ty, and less ad blind­ness, with the in­di­vid­ual con­tri­bu­tions left un­spec­i­fied.

Can the effect on user ac­tiv­ity be in­ferred from what Hohn­hold et al 2015 does re­port? Pos­si­bly. As they set it up in equa­tion 2:

Apro­pos of this se­tup, they re­mark

For Google search ads ex­per­i­ments, we have not mea­sured a sta­tis­ti­cal­ly-sig­nifi­cant learned effect on terms 1 [“Users”] and 2 [].2 [2: We sus­pect the lack of effect is due to our fo­cus on qual­ity and user ex­pe­ri­ence. Ex­per­i­ments on other sites in­di­cate that there can in­deed be user learn­ing affect­ing over­all site us­age.]

This would, in­ci­den­tal­ly, ap­pear to im­ply that Google ad ex­per­i­ments have demon­strated an ad harm effect on other web­sites, pre­sum­ably via Ad­Sense ads rather than search query ads, and given the sta­tis­ti­cal power con­sid­er­a­tions, the effect would need to be sub­stan­tial (guessti­mate >5%?). I emailed Hohn­hold et al sev­eral times for ad­di­tional de­tails but re­ceived no replies.

Given the re­ported re­sults, this is un­der­-spec­i­fied but we can make some ad­di­tional as­sump­tions: we’ll ig­nore user at­tri­tion & num­ber of ‘tasks’ (as they say there is no “sta­tis­ti­cal­ly-sig­nifi­cant learned effect”, which is not the same thing as zero effects but im­plies they are smal­l), as­sume con­stant ab­solute rev­enue & rev­enue per click, and as­sume the CTR is 18% (the CTR in­crease is cu­mu­la­tive over time and has reached >18% for the longest-ex­posed co­hort in Fig­ure 6, so this rep­re­sents a lower bound as it may well have kept in­creas­ing). This gives an up­per bound of a <60% in­crease in user search queries per task thanks to the halv­ing of ad load (as­sum­ing the CTR did­n’t in­crease fur­ther and there was zero effect on user re­ten­tion or ac­qui­si­tion): . As­sum­ing a re­ten­tion rate sim­i­lar to LinkedIn of ~-0.5% user at­tri­tion per 2 months, then it’d be more like <65%, and adding in a −1–2% effect on num­ber of tasks shrinks it down to <60%; if the in­creased rev­enue refers to an­nu­al­ized pro­jec­tions based on the 2-month data and we imag­ine an­nu­al­iz­ing/­com­pound­ing hy­po­thet­i­cal −1% effects on user at­tri­tion & ac­tiv­i­ty, a <50% in­crease in search queries per task be­comes plau­si­ble (which would be the differ­ence be­tween run­ning 1 query per task and run­ning 1.5 queries per task, which does­n’t sound un­re­al­is­tic to me).

Re­gard­less of how we guessti­mate at the break­down of user re­sponse across their equa­tion 2’s first 3 terms, the fact re­mains that be­ing able to cut ads by half with­out any net rev­enue effec­t—on a ser­vice “fo­cus[ed] on qual­ity and user ex­pe­ri­ence” whose au­thors have data show­ing its ads to al­ready be far less harm­ful than “other sites”—sug­gests a ma­jor im­pact of search en­gine ads on mo­bile users.

Strik­ing­ly, this 50–70% range of effects on search en­gine use would be far larger than es­ti­mated for other mea­sures of use in the other stud­ies. Some pos­si­ble ex­pla­na­tions are that the oth­ers have sub­stan­tial mea­sure­ment er­ror bi­as­ing them to­wards zero or that there is mod­er­a­tion by pur­pose: per­haps even LinkedIn is a kind of “en­ter­tain­ment” where ads are not as ir­ri­tat­ing a dis­trac­tion, while search en­gine queries are more se­ri­ous time-sen­si­tive busi­ness and ads are much more frus­trat­ingly fric­tion.

PageFair

Shiller et al 2017, “Will Ad Block­ing Break the In­ter­net?” (), Alexa traffic rank (proxy for traffic), all web­site users (n=?m16, k = 2,574), lon­gi­tu­di­nal cor­re­la­tional analy­sis:

Page­Fair is an an­ti-ad­block ad tech com­pa­ny; their soft­ware de­tects ad­block use, and in this analy­sis, the Alex­a-es­ti­mated traffic ranks of 2,574 cus­tomer web­sites (me­dian rank: #210,000) are cor­re­lated with Page­Fair-es­ti­mated frac­tion of ad­block traffic. The 2013–2016 time-series are in­ter­mit­tent & short (me­dian 16.7 weeks per web­site, an­a­lyzed in monthly traffic blocks with monthly n~12,718) as cus­tomer web­sites ad­d/re­move Page­Fair soft­ware. 14.6% of users have ad­block in their sam­ple.

Shiller et al 2017’s pri­mary find­ing is in­creases in ad­block us­age share of Page­Fair-us­ing web­sites pre­dict im­prove­ment in Alexa traffic rank over the next mul­ti­-month time-pe­riod an­a­lyzed but then grad­ual wors­en­ing of Alexa traffic ranks up to 2 years lat­er. Shiller et al 2017 at­tempts to make a causal story more plau­si­ble by look­ing at base­line co­vari­ates and at­tempt­ing to use ad­block rates as (none too con­vinc­ing) in­stru­men­tal vari­ables. The in­ter­pre­ta­tion offered is that ad­block in­creases are ex­oge­nous and cause an ini­tial ben­e­fit from freerid­ing users but then grad­ual de­te­ri­o­ra­tion of site con­tent/qual­ity from re­duced rev­enue.

While their in­ter­pre­ta­tion is not un­rea­son­able, and if true is a re­minder that for ad-driven web­sites there is an op­ti­mal trade­off be­tween ads & traffic where the op­ti­mal point is not nec­es­sar­ily known and ‘pro­gram­matic ad­ver­tis­ing’ may not be a good rev­enue source (in­deed, Shiller et al 2017 note that “ad block­ing had a sta­tis­ti­cal­ly-sig­nifi­cantly smaller im­pact at high­-traffic web­sites…indis­tin­guish­able from 0”), the more in­ter­est­ing im­pli­ca­tion is that if causal, the im­me­di­ate short­-run effect is an es­ti­mate of the harm of ad­ver­tis­ing.

Specifi­cal­ly, the Page­Fair sum­mary em­pha­sizes, in a graph of a sam­ple start­ing from July 2013, a 0%→25% change in ad­block us­age would be pre­dicted to see a +5% rank im­prove­ment in the first half-year, +2% first year-and-half, de­creas­ing to −16% by June 2016 ~3 years lat­er. The graph and the ex­act es­ti­mates do not ap­pear in Shiller et al 2017, but seems to be based on Ta­ble 5; the first co­effi­cient in col­umn 1–4 cor­re­sponds to the first mul­ti­-month block, and the co­effi­cient is ex­pressed in terms of log ranks (low­er=­bet­ter), so given the Page­Fair hy­po­thet­i­cal of 0%→25%, the pre­dicted effect in the first time pe­riod for the var­i­ous mod­els (−0.2250, −0.2250, −0.2032, & −0.2034; mean, −0.21415) is or ~5%. Or to put it an­other way, the effect of ad­ver­tis­ing ex­po­sure for 100%→0% of the user­base would be pre­dicted to be , or 19% (of Alexa traffic rank). Given the non­lin­ear­ity of Alexa ranks/true traffic, I sus­pect this im­plies an ac­tual traffic gain of <19%.

Yan et al 2020

, Yan et al 202017 re­ports a differ­ence-in-d­iffer­ences cor­re­la­tional analy­sis of ad­block vs non-ad­block users on an anony­mous medi­um-size Eu­ro­pean (sim­i­lar to Mozilla & Page­Fair):

Many on­line news pub­lish­ers fi­nance their web­sites by dis­play­ing ads along­side con­tent. Yet, re­mark­ably lit­tle is known about how ex­po­sure to such ads im­pacts users’ news con­sump­tion. We ex­am­ine this ques­tion us­ing 3.1 mil­lion anonymized brows­ing ses­sions from 79,856 users on a news web­site and the qua­si­-ran­dom vari­a­tion cre­ated by ad blocker adop­tion. We find that see­ing ads has a ro­bust neg­a­tive effect on the quan­tity and va­ri­ety of news con­sump­tion: Users who adopt ad block­ers sub­se­quently con­sume 20% more news ar­ti­cles cor­re­spond­ing to 10% more cat­e­gories. The effect per­sists over time and is largely dri­ven by con­sump­tion of “hard” news. The effect is pri­mar­ily at­trib­ut­able to a learn­ing mech­a­nism, wherein users gain pos­i­tive ex­pe­ri­ence with the ad-free site; a cog­ni­tive mech­a­nism, wherein ads im­pede pro­cess­ing of con­tent, also plays a role. Our find­ings open an im­por­tant dis­cus­sion on the suit­abil­ity of ad­ver­tis­ing as a mon­e­ti­za­tion model for valu­able dig­i­tal con­tent…Our dataset was com­posed of click­stream data for all reg­is­tered users who vis­ited the news web­site from the sec­ond week of June, 2015 (week 1) [2015-06-07] to the last week of Sep­tem­ber, 2015 (week 16) [2015-09-30]. We fo­cus on reg­is­tered users for both econo­met­ric and so­cio-e­co­nomic rea­sons. We can only track reg­is­tered users on the in­di­vid­u­al-level over time, which pro­vides us with a unique panel set­ting that we use for our em­pir­i­cal analy­sis…These per­cent­ages trans­late into 2 fewer news ar­ti­cles per week and 1 less news cat­e­gory in to­tal.

…Of the 79,856 users whom we ob­served, 19,088 users used an ad blocker dur­ing this pe­riod (as in­di­cated by a non-zero num­ber of page im­pres­sions blocked), and 60,768 users did not use an ad blocker dur­ing this pe­ri­od. Thus, 24% of users in our dataset used an ad block­er; this per­cent­age is com­pa­ra­ble to the ad block­ing adop­tion rates across Eu­ro­pean coun­tries at the same time, rang­ing from 20% in Italy to 38% in Poland (New­man et al. 2016).

…Our re­sults also high­light sub­stan­tial het­ero­gene­ity in the effect of ad ex­po­sure across differ­ent users: First, users with a stronger ten­dency to read news on their mo­bile phones (as op­posed to on desk­top de­vices) ex­hibit a stronger treat­ment effect.

Aral & Dhillon 2020

, Aral & Dhillon 2020:

Most on­line con­tent pub­lish­ers have moved to sub­scrip­tion-based busi­ness mod­els reg­u­lated by dig­i­tal pay­walls. But the man­age­r­ial im­pli­ca­tions of such freemium con­tent offer­ings are not well un­der­stood. We, there­fore, uti­lized mi­crolevel user ac­tiv­ity data from the New York Times to con­duct a large-s­cale study of the im­pli­ca­tions of dig­i­tal pay­wall de­sign for pub­lish­ers. Specifi­cal­ly, we use a qua­si­-ex­per­i­ment that var­ied the (1) quan­tity (the num­ber of free ar­ti­cles) and (2) ex­clu­siv­ity (the num­ber of avail­able sec­tions) of free con­tent avail­able through the pay­wall to in­ves­ti­gate the effects of pay­wall de­sign on con­tent de­mand, sub­scrip­tions, and to­tal rev­enue.

The pay­wall pol­icy changes we stud­ied sup­pressed to­tal con­tent de­mand by about 9.9%, re­duc­ing to­tal ad­ver­tis­ing rev­enue. How­ev­er, this de­crease was more than off­set by in­creased sub­scrip­tion rev­enue as the pol­icy change led to a 31% in­crease in to­tal sub­scrip­tions dur­ing our sev­en-month study, yield­ing net pos­i­tive rev­enues of over $287,104$230,0002013. The re­sults con­firm an eco­nom­i­cally sig­nifi­cant im­pact of the news­pa­per’s pay­wall de­sign on con­tent de­mand, sub­scrip­tions, and net rev­enue. Our find­ings can help struc­ture the sci­en­tific dis­cus­sion about dig­i­tal pay­wall de­sign and help man­agers op­ti­mize dig­i­tal pay­walls to max­i­mize read­er­ship, rev­enue, and profit.

Fig­ure 5, Aral & Dhillon 2020 (d­iffer­ence-in-d­iffer­ences re­gres­sion)

…Dur­ing our ob­ser­va­tion pe­ri­od, the NYT pay­wall al­lowed 10 free ar­ti­cles per month via chan­nels (1) and (2). Vis­i­tors could, how­ev­er, read an un­lim­ited num­ber of ar­ti­cles through the mo­bile app but only from the top news and video sec­tions. How­ev­er, on June 27, 2013, the NYT started me­ter­ing their mo­bile apps such that un­sub­scribed users could only read 3 ar­ti­cles per day. At the same time, those ar­ti­cles could now be ac­cessed from any sec­tion and­notjust from the top news and video sec­tions. If, after hit­ting their quo­ta, a user tries to ac­cess more ar­ti­cles, they see a pop-up in the mo­bile app urg­ing them to be­come a sub­scriber.9 To kick off the up­date, users had a one-week trial pe­riod from the time they up­dated the app dur­ing which they could freely read any num­ber of ar­ti­cles from any sec­tion­s…Es­ti­ma­tion re­sults are shown in Ta­ble 10, and it’s easy to see that the im­pacts of the var­i­ous vari­ables are qual­i­ta­tively and di­rec­tion­ally sim­i­lar as in Ta­ble 2. The pol­icy change de­creased to­tal read­er­ship by 9.9% us­ing the Pois­son Re­gres­sion spec­i­fi­ca­tion and ap­prox­i­mately 7% us­ing the log-lin­earized spec­i­fi­ca­tion.

The NYT pay­wall was ex­tremely porous and eas­ily by­passed by many meth­ods (only a few of which are men­tioned) by any­one who cared, and often man­i­fested it­self as a ban­ner ad (of the “You Have Read 2 of 3 Free Ar­ti­cles This Mon­th, Please Con­sider Sub­scrib­ing” nag­ware sort), and the re­sults are strik­ingly sim­i­lar to the trends in the other ad qua­si­ex­per­i­men­t/­ex­per­i­men­tal pa­pers.

They Just Don’t Know?

This raises again one of my orig­i­nal ques­tions: why do peo­ple not take the sim­ple & easy step of in­stalling ad­block­er, de­spite ap­par­ently hat­ing ads & ben­e­fit­ing from it so much? Some pos­si­bil­i­ties:

  • peo­ple don’t care that much, and the loud com­plaints are dri­ven by a small mi­nor­i­ty, or other fac­tors (such as a po­lit­i­cal moral panic post-Trump elec­tion); Ben­zell & Col­lis 2019’s will­ing­ness-to-pay is con­sis­tent with not both­er­ing to learn about or use ad­block be­cause peo­ple just don’t care

  • ad­block is typ­i­cally dis­abled or hard to get on mo­bile; could the effect be dri­ven by mo­bile users who know about it & want to but can’t?

    This should be testable by re-an­a­lyz­ing the A/B tests to split to­tal traffic into desk­top & mo­bile (which Google An­a­lyt­ics does track and, in­ci­den­tal­ly, is how I know that mo­bile traffic has steadily in­creased over the years & be­came a ma­jor­ity of Gw­ern.net traffic in Jan­u­ary–Feb­ru­ary 2019)2

  • is it pos­si­ble that peo­ple don’t use ad­block be­cause they don’t know it ex­ists?

The sec­ond sounds crazy to me. Ad block­ing is well-known and they are among the most pop­u­lar browser ex­ten­sions there are and often the first thing in­stalled on a new OS.

cites an ’s es­ti­mate of 45m Amer­i­cans in 2015 (out of a to­tal US pop­u­la­tion of ~321m peo­ple, or ~14% users18) and a 2016 UK es­ti­mate of ~22%; a Page­Fair pa­per, Shiller et al 2017 cites an un­spec­i­fied analy­sis es­ti­mat­ing “24% for Ger­many, 14% for Spain, 10% for the UK, and 9% for the US” in­stal­la­tion rates ac­count­ing for “28% in Ger­many, 16% for Spain, 13% for the UK, and 12% for the U.S.” of web traffic. A 2016 Midia Re­search re­port­edly claims that 41% of users knew about ad block­ers, of which 80% used it on desk­top & 46% on smart­phones, im­ply­ing a 33%/19% use rate. One might ex­pect higher num­bers now 3–4 years later since ad­block us­age has been grow­ing. sur­veyed Pol­ish In­ter­net users in an un­spec­i­fied way, find­ing 77% to­tal used ad­block and <2% claimed to not know what ad­block is. (The Pol­ish users mostly ac­cepted “Sta­tic graphic or text ban­ners” but par­tic­u­larly dis­liked video, na­tive, and au­dio ad­s.)

So plenty of or­di­nary peo­ple, not just nerds, have not merely heard of it but are ac­tive users of it (and why would pub­lish­ers & the ad in­dus­try be so hys­ter­i­cal about ad block­ing if it were no more widely used than, say, desk­top Lin­ux?). But, I am well-aware I live in a bub­ble and my in­tu­itions are not to be trusted on this (as Jakob Nielsen puts it: “The Dis­tri­b­u­tion of Users’ Com­puter Skills: Worse Than You Think”). The only way to rule this out is to ask or­di­nary peo­ple.

As usu­al, I use to run a weighted pop­u­la­tion sur­vey. On 2019-03-16, I launched a n = 1000 one-ques­tion sur­vey of all Amer­i­cans with ran­domly re­versed or­der, with the fol­low­ing re­sults (CSV):

Do you know about ‘ad­block­ers’: web browser ex­ten­sions like Ad­Block Plus or ublock?

  • Yes, and I have one in­stalled [13.9% weighted / n= 156 raw]
  • Yes, but I do not have one in­stalled [14.4% weighted / n = 146 raw]
  • No [71.8% weighted / n = 702 raw]
First Google Sur­vey about ad­block us­age & aware­ness: bar graph of re­sults.

The in­stal­la­tion per­cent­age closely par­al­lels the 2015 Adobe/­Page­Fair es­ti­mate, which is rea­son­able. (Adobe/­Page­Fair 2015 makes much hay of the growth rates, but those are desk­top growth rates, and desk­top us­age in gen­eral seems to’ve cratered as peo­ple shift ever more time to tablet­s/s­mart­phones; they note that “Mo­bile is yet to be a fac­tor in ad block­ing growth”.) I am how­ever shocked by the per­cent­age claim­ing to not know what an ad­blocker is: 72%! I had ex­pected to get some­thing more like 10–30%. As one learns read­ing sur­veys, a de­cent frac­tion of every pop­u­la­tion strug­gles with ba­sic ques­tions like whether the Earth goes around the Sun or vice-ver­sa, so I would be shocked if they knew of ad block­ers but I ex­pected the re­main­ing 50%, who are dri­ving this puz­zle of “why ad­ver­tis­ing avoid­ance but not ad­block in­stal­la­tion?”, to be a lit­tle more on the ball, and be aware of ad block­ers but have some other rea­son to not in­stall them (if only my­opic lazi­ness).

But that ap­pears to not be the case. There are rel­a­tively few peo­ple who claim to be aware of ad block­ers but not be us­ing them, and those might just be mo­bile users whose browsers (specifi­cal­ly, Chrome, as Ap­ple’s Sa­far­i/iOS per­mit­ted ad­block ex­ten­sions in 2015), for­bid ad block­ers.

To look some more into the mo­ti­va­tion of the re­cu­sants, I launched an ex­panded ver­sion of the first GS sur­vey with n = 500 on 2019-03-18, oth­er­wise same op­tions, ask­ing (CSV):

If you don’t have an ad­block ex­ten­sion like Ad­Block Plus/ublock in­stalled in your web browser, why not?

  1. I do have one in­stalled [weighted 34.9% raw n = 183]
  2. I don’t know what ad block­ers are [36.7%; n = 173]
  3. Ad block­ers are too hard to in­stall [6.2%; n = 28]
  4. My browser or de­vice does­n’t sup­port them [7.8%; n = 49]
  5. Ad block­ing hurts web­sites or is un­eth­i­cal [10.4%; n = 51]
  6. [free re­sponse text field to al­low list­ing of rea­sons I did­n’t think of] [0.6%/0.5%/3.0%; n = 1/1/15]
Sec­ond Google Sur­vey about rea­sons for not us­ing ad­block: bar graph of re­sults.

The re­sponses here aren’t en­tirely con­sis­tent with the pre­vi­ous group. Pre­vi­ous­ly, 14% claimed to have ad­block, and here 35% do, which is more than dou­ble and the CIs do not over­lap. The word­ing of the an­swer is al­most the same (“Yes, and I have one in­stalled” vs “I do have one in­stalled”) so I won­der if there is a from the word­ing of the ques­tion—the first one treats ad­block use as an ex­cep­tion, while the sec­ond frames it as the norm (from which de­vi­a­tion must be jus­ti­fied). So it’s pos­si­ble that the true ad­block rate is some­where in be­tween 14–35%. The two other es­ti­mates fall in that range as well.

In any case, the rea­sons are what this sur­vey was for and are more in­ter­est­ing. Of the non-users, ig­no­rance makes up the ma­jor­ity of re­sponses (56%), with only 12% claim­ing that de­vice re­stric­tions like An­droid’s stops them from us­ing ad­block­ers (which is ev­i­dence that in­formed-but-frus­trated mo­bile users aren’t dri­ving the ad harm­s), 16% ab­stain­ing out of prin­ci­ple, and 9% blam­ing the has­sle of in­stalling/us­ing.

Around 6% of non-users took the op­tion of us­ing the free re­sponse text field to pro­vide an al­ter­na­tive rea­son. I group the free re­sponses as fol­lows:

  1. Ads aren’t sub­jec­tively painful enough to in­stall ad­block:

    “Ads aren’t as an­noy­ing as sur­veys”/“I don’t visit sites with pop up ads and have not been both­ered”/“Haven’t needed”/“Too lazy”/“i’m not sure, seems like a has­sle”

    • what is prob­a­bly a sub­cat­e­go­ry, un­spec­i­fied dis­like or lack of need :

      “Don’t want it”/“Don’t want to block them”/“don’t want to”/“doo not want them”/“No rea­son”/“No”/“Not sure why”

  2. vari­ant of “browser or de­vice does­n’t sup­port them”:

    “work com­puter”/“Mac”

  3. Tech­ni­cal prob­lems with ad­block­ers:

    “Many web­sites won’t al­low you to use it with an ad­blocker ac­ti­vated”/“far more effec­tive to just dis­able javascript to kill ads”

  4. Ig­no­rance (more speci­fic):

    “Did­n’t know they had one for ipads”

So the ma­jor miss­ing op­tion here is an op­tion for be­liev­ing that ads don’t an­noy them (although given the size of the ad effect, one won­ders if that is re­ally true).

For a third sur­vey, I added a re­sponse for ads not be­ing sub­jec­tively an­noy­ing, and, be­cause of that 14% vs 35% differ­ence in­di­cat­ing po­ten­tial de­mand effects, I tried to re­verse the per­ceived ‘de­mand’ by ex­plic­itly fram­ing non-ad­block use as the norm. Launched with n = 500 2019-03-21–2019-03-23, same op­tions (CSV):

Most peo­ple do not use ad­block ex­ten­sions for web browsers like Ad­Block Plus/ublock; if you do not, why not?

  1. I do have one in­stalled [weighted 36.5%; raw n = 168]
  2. I don’t know what ad block­ers are [22.8%; n = 124]
  3. I don’t want or need to re­move ads [14.6%; n = 70]
  4. Ad block­ers are too hard to in­stall [12%; n = 65]
  5. My browser or de­vice does­n’t sup­port them [7.8%; n = 41]
  6. Ad block­ing hurts web­sites or is un­eth­i­cal [2.6%; n = 17]
  7. [free re­sponse text field to al­low list­ing of rea­sons I did­n’t think of] [3.6%; n = 15]
Third Google Sur­vey, 2nd ask­ing about rea­sons for not us­ing ad­block: bar graph of re­sults.

Free re­sponses show­ing noth­ing new:

  • “dont think add block­ers are eth­i­cal”/“No in­ter­est in them”/“go away”/“idk”/“I only use them when I’m blinded by ads !”/“In­con­ve­nient to in­stall for a prob­lem I hardly en­counter for the web­sites that I use”/“The”/“n/a”/“I dont know”/“worms”/“lazy”/“Don’t need it”/“Fu”/“boo”

With the word­ing re­ver­sal and ad­di­tional po­tion, these re­sults are con­sis­tent with the sec­ond on in­stal­la­tion per­cent­age (35% vs 37%), but not so much on the oth­ers (37% vs 23%, 6% vs 12%, 8% vs 8%, & 10.4% vs 3%). The free re­sponses are also much worse the sec­ond time around.

In­ves­ti­gat­ing word­ing choice again, I sim­pli­fied the first sur­vey down to a bi­nary yes/no, on 2019-04-05–2019-04-07, n = 500 (CSV):

Do you know about ‘ad­block­ers’: web browser ex­ten­sions like Ad­Block Plus or ublock?

  1. Yes [weighted 26.5%; raw n = 125]
  2. No [weighted 73.5%; raw n = 375]

The re­sults were al­most iden­ti­cal: “no” was 73% vs 71%.

For a fi­nal sur­vey, I tried di­rectly query­ing the ‘don’t wan­t/need’ pos­si­bil­i­ty, ask­ing a 1–5 Lik­ert ques­tion (no shuffle); n = 500, 2019-06-08–2019-06-10 (CSV):

How much do In­ter­net ads (like ban­ner ads) an­noy you? [On a scale of 1–5]:

  • 1: Not at all [weighted 11.7%; raw n = 59]
  • 2: [9.5%; n = 46]
  • 3: [14.2%; n = 62]
  • 4: [18.0%; n = 93]
  • 5: Great­ly: I avoid web­sites with ads [46.6%; n = 244]

Al­most half of re­spon­dents gave the max­i­mal re­spon­se; only 12% claim to not care about ads at all.

The changes are puz­zling. The de­crease in “Ad block­ing hurts web­sites or is un­eth­i­cal” and “I don’t know what ad block­ers are” could be ex­plained as users shift­ing buck­ets: they don’t want to use ad­block­ers be­cause ad block­ers are un­eth­i­cal, or they haven’t both­ered to learn what ad block­ers are be­cause they don’t wan­t/need to re­move ads. But how can adding an op­tion like “I don’t want or need to re­move ads” pos­si­bly affect a re­sponse like “Ad block­ers are too hard to in­stall” so as to make it dou­ble (6% → 12%)? At first blush, this seems like a kind of vi­o­la­tion of log­i­cal con­sis­tency along the lines of the . Adding more al­ter­na­tives, which ought to be strict sub­sets of some re­spons­es, nev­er­the­less de­creases other re­spons­es. This sug­gests that per­haps the re­sponses are in gen­eral low-qual­ity and not to be trusted as the sur­vey­ees are be­ing lazy or oth­er­wise screw­ing things up; they may be semi­-ran­domly click­ing, or those ig­no­rant of ad­block may be con­fab­u­lat­ing ex­cuses for why they are right to be ig­no­rant.

Per­plexed by the troll­ish free re­sponses & stark in­con­sis­ten­cies, I de­cided to run the third sur­vey 2019-03-25–2019-03-27 for an ad­di­tional n = 500, to see if the re­sults held up. They did, with more sen­si­ble free re­sponses as well, so it was­n’t a fluke (CSV):

Most peo­ple do not use ad­block ex­ten­sions for web browsers like Ad­Block Plus/ublock; if you do not, why not?

  1. I do have one in­stalled [weighted 33.3%; raw n = 165]

  2. I don’t know what ad block­ers are [30.4%; n = 143]

  3. I don’t want or need to re­move ads [13.3%; n = 71]

  4. Ad block­ers are too hard to in­stall [10.6%; n = 64]

  5. My browser or de­vice does­n’t sup­port them [5.9%; n = 31]

  6. Ad block­ing hurts web­sites or is un­eth­i­cal [4.4%; n = 18]

  7. [free re­sponse text field to al­low list­ing of rea­sons I did­n’t think of] [2.2%; n = 10]

    • “Na”/“dont care”/“I have one”/“I can’t do sweep­stakes”/“i dont know what ad­block is”/“job com­puter do not know what they have”/“Not ed­u­cated on them”/“Didnt know they were avail­able or how to use them. Have never heard of them.”

Is the ig­no­rance rate 23%, 31%, 37%, or 72%? It’s hard to say given the in­con­sis­ten­cies. But taken as a whole, the sur­veys sug­gest that:

  1. only a mi­nor­ity of users use ad­block
  2. ad­block non-usage is to a small ex­tent due to (per­ceived) tech­ni­cal bar­ri­ers
  3. a mi­nor­ity & pos­si­bly a plu­ral­ity of po­ten­tial ad­block users do not know what ad­block is

This offers a res­o­lu­tion of the ap­par­ent ad­block para­dox: use of ads can drive away a non­triv­ial pro­por­tion of users (such as ~10%) who de­spite their aver­sion are un­able to use ad­block be­cause of tech­ni­cal bar­ri­ers but to a much larger ex­tent, sim­ple ig­no­rance.

Design

How do we an­a­lyze this? In the AB­a­lyt­ics per-reader ap­proach, it was sim­ple: we de­fined a thresh­old and did a bi­no­mial re­gres­sion. But by switch­ing to try­ing to in­crease over­all to­tal traffic, I have opened up a can of worms.

Descriptive

Let’s look at the traffic data:

traffic <- read.csv("https://www.gwern.net/docs/traffic/20170108-traffic.csv", colClasses=c("Date", "integer", "logical"))
summary(traffic)
summary(traffic)
#    Date               Pageviews
#  Min.   :2010-10-04   Min.   :    1
#  1st Qu.:2012-04-28   1st Qu.: 1348
#  Median :2013-11-21   Median : 1794
#  Mean   :2013-11-21   Mean   : 2352
#  3rd Qu.:2015-06-16   3rd Qu.: 2639
#  Max.   :2017-01-08   Max.   :53517
nrow(traffic)
# [1] 2289
library(ggplot2)
qplot(Date, Pageviews, data=traffic)
qplot(Date, log(Pageviews), data=traffic)
Daily pageviews/im­ages//­traffic to Gw­ern.net, 2010–2017
Daily pageviews/im­ages//­traffic to Gw­ern.net, 2010–2017; log-trans­formed

Two things jump out. The dis­tri­b­u­tion of traffic is weird, with spikes; do­ing a log-trans­form to tame the spikes, it is also clearly a non-s­ta­tion­ary time-series with au­to­cor­re­la­tion as traffic con­sis­tently grows & de­clines. These are not sur­pris­ing, as so­cial me­dia traffic from sites like Hacker News or Red­dit are no­to­ri­ous for cre­at­ing spikes in site traffic (and some­times bring­ing them down un­der the load), and I would hope that as I keep writ­ing things, traffic would grad­u­ally in­crease! Nev­er­the­less, both of these will make the traffic data diffi­cult to an­a­lyze de­spite hav­ing over 6 years of it.

Power analysis

Us­ing the his­tor­i­cal traffic data, how easy would it be to de­tect a to­tal traffic re­duc­tion of ~3%, the crit­i­cal bound­ary for the ad­s/no-ads de­ci­sion? Stan­dard non-time-series meth­ods are un­able to de­tect it at any rea­son­able sam­ple size, but us­ing more com­plex time-series-ori­ented meth­ods like ARIMA mod­els (ei­ther NHST or Bayesian), it can be de­tected given sev­eral months of da­ta.

NHST

We can demon­strate with a quick power analy­sis: if we pick a ran­dom sub­set of days and force a de­crease of 2.8% (the value on the de­ci­sion bound­ary), can we de­tect that?

ads <- traffic
ads$Ads <- rbinom(nrow(ads), size=1, p=0.5)
ads[ads$Ads==1,]$Pageviews <- round(ads[ads$Ads==1,]$Pageviews * (1-0.028))
wilcox.test(Pageviews ~ Ads, data=ads)
# W = 665105.5, p-value = 0.5202686
t.test(Pageviews ~ Ads, data=ads)
# t = 0.27315631, df = 2285.9151, p-value = 0.7847577
# alternative hypothesis: true difference in means is not equal to 0
# 95% confidence interval:
#  -203.7123550  269.6488393
# sample estimates:
# mean in group 0 mean in group 1
#     2335.331004     2302.362762
wilcox.test(log(Pageviews) ~ Ads, data=ads)
# W = 665105.5, p-value = 0.5202686
t.test(log(Pageviews) ~ Ads, data=ads)
# t = 0.36685265, df = 2286.8348, p-value = 0.7137629
sd(ads$Pageviews)
# [1] 2880.044636

The an­swer is no. We are nowhere near be­ing able to de­tect it with ei­ther a t-test or the non­para­met­ric u-test (which one might ex­pect to han­dle the strange dis­tri­b­u­tion bet­ter), and the log trans­form does­n’t help. We can hardly even see a hint of the de­crease in the t-test—the de­crease in the mean is ~30 pageviews but the stan­dard de­vi­a­tions are ~2900 and ac­tu­ally big­ger than the mean. So the spikes in the traffic are crip­pling the tests and this can­not be fixed by wait­ing a few more months since it’s in­her­ent to the da­ta.

If our trusty friend the log-trans­form can’t help, what can we do? In this case, we know that the re­al­ity here is lit­er­ally a as the spikes are be­ing dri­ven by qual­i­ta­tively dis­tinct phe­nom­e­non like a Gw­ern.net link ap­pear­ing on the HN front page, as com­pared to nor­mal daily traffic from ex­ist­ing links & search traffic19; but mix­ture mod­els tend to be hard to use. One ad hoc ap­proach to tam­ing the spikes would be to effec­tively throw them out by /clip­ping every­thing at a cer­tain point (s­ince the daily traffic av­er­age is ~1700, per­haps twice that, 3000):

ads <- traffic
ads$Ads <- rbinom(nrow(ads), size=1, p=0.5)
ads[ads$Ads==1,]$Pageviews <- round(ads[ads$Ads==1,]$Pageviews * (1-0.028))
ads[ads$Pageviews>3000,]$Pageviews <- 3000
sd(ads$Pageviews)
# [1] 896.8798131
wilcox.test(Pageviews ~ Ads, data=ads)
# W = 679859, p-value = 0.1131403
t.test(Pageviews ~ Ads, data=ads)
# t = 1.3954503, df = 2285.3958, p-value = 0.1630157
# alternative hypothesis: true difference in means is not equal to 0
# 95% confidence interval:
#  -21.2013943 125.8265361
# sample estimates:
# mean in group 0 mean in group 1
#     1830.496049     1778.183478

Bet­ter but still in­ad­e­quate. Even with the spikes tamed, we con­tinue to have prob­lems; the logged graph sug­gests that we can’t afford to ig­nore the time-series as­pect. A check of au­to­cor­re­la­tion in­di­cates sub­stan­tial au­to­cor­re­la­tion out to lags as high as 8 days:

pacf(traffic$Pageviews, main="gwern.net traffic time-series autocorrelation")
Au­to­cor­re­la­tion in Gw­ern.net daily traffic: pre­vi­ous daily traffic is pre­dic­tive of cur­rent traffic up to t = 8 days ago

The usual re­gres­sion frame­work for time-series is the time-series mod­el, in which the cur­rent daily value would be re­gressed on by each of the pre­vi­ous day’s val­ues (with an es­ti­mated co­effi­cient for each lag, as day 8 ought to be less pre­dic­tive than day 7 and so on) and pos­si­bly a differ­ence and a mov­ing av­er­age (also with vary­ing dis­tances in time). The mod­els are usu­ally de­noted as “ARIMA([days back to use as lags], [days back to differ­ence], [days back for mov­ing av­er­age])”. So the pacf sug­gests that an ARIMA(8,0,0) might work—lags back 8 days but ag­nos­tic on differ­enc­ing and mov­ing av­er­ages, re­spec­tive­ly. R’s forecast li­brary help­fully in­cludes both an arima re­gres­sion func­tion and also an auto.arima to do model com­par­i­son. auto.arima gen­er­ally finds that a much sim­pler model than ARIMA(8,0,0) works best, pre­fer­ring mod­els like ARIMA(4,1,1) (pre­sum­ably the differ­enc­ing and mov­ing-av­er­age steal enough of the dis­tant lags’ pre­dic­tive power that they no longer look bet­ter to AIC).

Such an ARIMA model works well and now we can de­tect our sim­u­lated effect:

library(forecast)
library(lmtest)
l <- lm(Pageviews ~ Ads, data=ads); summary(l)
# Residuals:
#       Min        1Q    Median        3Q       Max
# -2352.275  -995.275  -557.275   294.725 51239.783
#
# Coefficients:
#               Estimate Std. Error  t value Pr(>|t|)
# (Intercept) 2277.21747   86.86295 26.21621  < 2e-16
# Ads           76.05732  120.47141  0.63133  0.52789
#
# Residual standard error: 2879.608 on 2287 degrees of freedom
# Multiple R-squared:  0.0001742498,    Adjusted R-squared:  -0.0002629281
# F-statistic: 0.3985787 on 1 and 2287 DF,  p-value: 0.5278873
a <- arima(ads$Pageviews, xreg=ads$Ads, order=c(4,1,1))
summary(a); coeftest(a)
# Coefficients:
#             ar1         ar2         ar3         ar4         ma1      ads$Ads
#       0.5424117  -0.0803198  -0.0310823  -0.0094242  -0.8906085  -52.4148244
# s.e.  0.0281538   0.0245621   0.0245500   0.0240701   0.0189952   10.5735098
#
# sigma^2 estimated as 89067.31:  log likelihood = -16285.31,  aic = 32584.63
#
# Training set error measures:
#                       ME        RMSE         MAE          MPE        MAPE         MASE             ACF1
# Training set 3.088924008 298.3762646 188.5442545 -6.839735685 31.17041388 0.9755280945 -0.0002804416646
#
# z test of coefficients:
#
#               Estimate     Std. Error   z value   Pr(>|z|)
# ar1       0.5424116948   0.0281538043  19.26602 < 2.22e-16
# ar2      -0.0803197830   0.0245621012  -3.27007  0.0010752
# ar3      -0.0310822966   0.0245499783  -1.26608  0.2054836
# ar4      -0.0094242194   0.0240700967  -0.39153  0.6954038
# ma1      -0.8906085375   0.0189952434 -46.88587 < 2.22e-16
# Ads     -52.4148243747  10.5735097735  -4.95718 7.1523e-07

One might rea­son­ably ask, what is do­ing the real work, the trun­ca­tion/trim­ming or the ARIMA(4,1,1)? The an­swer is both; if we go back and re­gen­er­ate the ads dataset with­out the trun­ca­tion/trim­ming and we look again at the es­ti­mated effect of Ads, we find it changes to

#               Estimate     Std. Error   z value   Pr(>|z|)
# ...
# Ads      26.3244086579 81.2278521231    0.32408 0.74587666

For the sim­ple lin­ear model with no time-series or trun­ca­tion, the stan­dard er­ror on the ads effect is 121; for the time-series with no trun­ca­tion, the stan­dard er­ror is 81; and for the time se­ries plus trun­ca­tion, the stan­dard er­ror is 11. My con­clu­sion is that we can’t leave ei­ther one out if we are to reach cor­rect con­clu­sions in any fea­si­ble sam­ple size—we must deal with the spikes, and we must deal with the time-series as­pect.

So hav­ing set­tled on a spe­cific ARIMA model with trun­ca­tion, I can do a power analy­sis. For a time-series, the sim­ple boot­strap is in­ap­pro­pri­ate as it ig­nores the au­to­cor­re­la­tion; the right boot­strap is the : for each hy­po­thet­i­cal sam­ple size n, split the traffic his­tory into as many non-over­lap­ping n-sized chunks m as pos­si­ble, se­lec­t-with­-re­place­ment from them m, and run the analy­sis. This is im­ple­mented in the R boot li­brary.

library(boot)
library(lmtest)

## fit models & report p-value/test statistic
ut <- function(df) { wilcox.test(Pageviews ~ Ads, data=df)$p.value }
at <- function(df) { coeftest(arima(df$Pageviews, xreg=df$Ads, order=c(4,1,1)))[,4][["df$Ads"]] }

## create the hypothetical effect, truncate, and test
simulate <- function (df, testFunction, effect=0.03, truncate=TRUE, threshold=3000) {
    df$Ads <- rbinom(nrow(df), size=1, p=0.5)
    df[df$Ads==1,]$Pageviews <- round(df[df$Ads==1,]$Pageviews * (1-effect))
    if(truncate) { df[df$Pageviews>threshold,]$Pageviews <- threshold }
    return(testFunction(df))
    }
power <- function(ns, df, test, effect, alpha=0.05, iters=2000) {
    powerEstimates <- vector(mode="numeric", length=length(ns))
    i <- 1
    for (n in ns) {
        tsb <- tsboot(df, function(d){simulate(d, test, effect=effect)}, iters, l=n,
                          sim="fixed", parallel="multicore", ncpus=getOption("mc.cores"))
        powerEstimates[i] <- mean(tsb$t < alpha)
        i <- i+1 }
    return(powerEstimates) }

ns <- seq(10, 2000, by=5)
## test the critical value but also 0 effect to check whether alpha is respected
powerUtestNull <- power(ns, traffic, ut, 0)
powerUtest     <- power(ns, traffic, ut, 0.028)
powerArimaNull <- power(ns, traffic, at, 0)
powerArima     <- power(ns, traffic, at, 0.028)
p1 <- qplot(ns, powerUtestNull) + stat_smooth() + coord_cartesian(ylim = c(0, 1))
p2 <- qplot(ns, powerUtest) + stat_smooth() + coord_cartesian(ylim = c(0, 1))
p3 <- qplot(ns, powerArimaNull) + stat_smooth() + coord_cartesian(ylim = c(0, 1))
p4 <- qplot(ns, powerArima) + stat_smooth() + coord_cartesian(ylim = c(0, 1))

library(grid)
library(gridExtra)
grid.arrange(p1, p3, p2, p4, ncol = 2, name = "Power analysis of detecting null effect/2.8% reduction using u-test and ARIMA regression")
Block­-boot­strap power analy­sis of abil­ity to de­tect 2.8% traffic re­duc­tion us­ing u-test & ARIMA time-series model (bot­tom row), while pre­serv­ing nom­i­nal false-pos­i­tive er­ror con­trol (top row)

So the false-pos­i­tive rate is pre­served for both, the ARIMA re­quires a rea­son­able-look­ing n < 70 to be well-pow­ered, but the u-test power is bizarre—the power is never great, never go­ing >31.6%, and ac­tu­ally de­creas­ing after a cer­tain point, which is not some­thing you usu­ally see in a power graph. (The ARIMA power curve is also odd but at least it does­n’t get worse with more data!) My spec­u­la­tion about that is that it is be­cause as the time-series win­dow in­creas­es, more of the spikes come into view of the u-test, mak­ing the dis­tri­b­u­tion dra­mat­i­cally wider & this more than over­whelms the gain in de­tectabil­i­ty; hy­po­thet­i­cal­ly, with even more years of data, the spikes would stop com­ing as a sur­prise and the grad­ual hy­po­thet­i­cal dam­age of the ads will then be­come more vis­i­ble with in­creas­ing sam­ple size as ex­pect­ed.

Bayesian

ARIMA Bayesian mod­els are prefer­able as they de­liver the pos­te­rior dis­tri­b­u­tion nec­es­sary for de­ci­sion-mak­ing, which al­lows weighted av­er­ages of all the pos­si­ble effects, and can ben­e­fit from in­clud­ing my prior in­for­ma­tion that the effect of ads is defi­nitely neg­a­tive but prob­a­bly close to ze­ro. Some ex­am­ples of Bayesian ARIMA time-series analy­sis:

An AR(3,1) model in JAGS:

library(runjags)
arima311 <- "model {
  # initialize the first 3 days, which we need to fit the 3 lags/moving-averages for day 4:
  # y[1] <- 50
  # y[2] <- 50
  # y[3] <- 50
  eps[1] <- 0
  eps[2] <- 0
  eps[3] <- 0

  for (i in 4:length(y)) {
     y[i] ~ dt(mu[i], tauOfClust[clust[i]], nuOfClust[clust[i]])
     mu[i] <- muOfClust[clust[i]] + w1*y[i-1] + w2*y[i-2] + w3*y[i-3] + m1*eps[i-1]
     eps[i] <- y[i] - mu[i]

     clust[i] ~ dcat(pClust[1:Nclust])
  }

  for (clustIdx in 1:Nclust) {
      muOfClust[clustIdx] ~ dnorm(100, 1.0E-06)
      sigmaOfClust[clustIdx] ~ dnorm(500, 1e-06)
      tauOfClust[clustIdx] <- pow(sigmaOfClust[clustIdx], -2)
      nuMinusOneOfClust[clustIdx] ~ dexp(5)
      nuOfClust[clustIdx] <- nuMinusOneOfClust[clustIdx] + 1
  }
  pClust[1:Nclust] ~ ddirch(onesRepNclust)

  m1 ~ dnorm(0, 4)
  w1 ~ dnorm(0, 5)
  w2 ~ dnorm(0, 4)
  w3 ~ dnorm(0, 3)
  }"
y <- traffic$Pageviews
Nclust = 2
clust = rep(NA,length(y))
clust[which(y<1800)] <- 1 # seed labels for cluster 1, normal traffic
clust[which(y>4000)] <- 2 # seed labels for cluster 2, spikes
model <- run.jags(arima311, data = list(y=y, Nclust = Nclust, clust=clust, onesRepNclust = c(1,1) ),
    monitor=c("w1", "w2", "w3", "m1", "pClust", "muOfClust", "sigmaOfClust", "nuOfClust"),
    inits=list(w1=0.55, w2=0.37, w3=-0.01, m1=0.45, pClust=c(0.805, 0.195), muOfClust=c(86.5, 781), sigmaOfClust=c(156, 763), nuMinusOneOfClust=c(2.4-1, 1.04-1)),
    n.chains = getOption("mc.cores"), method="rjparallel", sample=500)
summary(model)

JAGS is painfully slow: 5h+ for 500 sam­ples. Sharper pri­ors, re­mov­ing a 4th-order ARIMA lag, & bet­ter ini­tial­iza­tion did­n’t help. The level of au­to­cor­re­la­tion might make fit­ting with JAGS’s Gibbs MCMC diffi­cult, so I tried switch­ing to Stan, which is gen­er­ally faster & its HMC MCMC typ­i­cally deals with hard mod­els bet­ter:

traffic <- read.csv("https://www.gwern.net/docs/traffic/20170108-traffic.csv", colClasses=c("Date", "integer", "logical"))
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
m <- "data {
        int<lower=1> K; // number of mixture components
        int<lower=1> T; // number of data points
        int<lower=0> y[T]; // observations
    }
    parameters {
        simplex[K] theta; // mixing proportions
        real<lower=0, upper=100>    muM[K]; // locations of mixture components
        real<lower=0.01, upper=1000> sigmaM[K]; // scales of mixture components
        real<lower=0.01, upper=5>    nuM[K];

        real phi1; // autoregression coeffs
        real phi2;
        real phi3;
        real phi4;
        real ma; // moving avg coeff
    }
    model {

        real mu[T, K]; // prediction for time t
        vector[T] err; // error for time t
        real ps[K]; // temp for log component densities
        // initialize the first 4 days for the lags
        mu[1][1] = 0; // assume err[0] == 0
        mu[2][1] = 0;
        mu[3][1] = 0;
        mu[4][1] = 0;
        err[1] = y[1] - mu[1][1];
        err[2] = y[2] - mu[2][1];
        err[3] = y[3] - mu[3][1];
        err[4] = y[4] - mu[4][1];


        muM ~ normal(0, 5);
        sigmaM ~ cauchy(0, 2);
        nuM ~ exponential(1);
        ma ~ normal(0, 0.5);
        phi1 ~ normal(0,1);
        phi2 ~ normal(0,1);
        phi3 ~ normal(0,1);
        phi4 ~ normal(0,1);

        for (t in 5:T) {
            for (k in 1:K) {
                mu[t][k] = muM[k] + phi1 * y[t-1] + phi2 * y[t-2] + phi3 * y[t-3] + phi4 * y[t-4] + ma * err[t-1];
                err[t] = y[t] - mu[t][k];

                ps[k] = log(theta[k]) + student_t_lpdf(y[t] | nuM[k], mu[t][k], sigmaM[k]);
            }
        target += log_sum_exp(ps);
        }
    }"
# 17m for 200 samples
nchains <- getOption("mc.cores") - 1
# original, based on MCMC:
# inits <- list(theta=c(0.92, 0.08), muM=c(56.2, 0.1), sigmaM=c(189.7, 6), nuM=c(1.09, 0.61), phi1=1.72, phi2=-0.8, phi3=0.08, phi4=0, ma=-0.91)
# optimized based on gradient descent
inits <- list(theta=c(0.06, 0.94), muM=c(0.66, 0.13), sigmaM=c(5.97, 190.05), nuM=c(1.40, 1.10), phi1=1.74, phi2=-0.83, phi3=0.10, phi4=-0.01, ma=-0.93)
model2 <- stan(model_code=m, data=list(T=nrow(traffic), y=traffic$Pageviews, K=2), init=replicate(nchains, inits, simplify=FALSE), chains=nchains, iter=50); print(model2)
traceplot(model2)

This was per­haps the first time I’ve at­tempted to write a com­plex model in Stan, in this case, adapt­ing a sim­ple ARIMA time-series model from the Stan man­u­al. Stan has some in­ter­est­ing fea­tures like the vari­a­tional in­fer­ence op­ti­mizer which can find sen­si­ble pa­ra­me­ter val­ues for com­plex mod­els in sec­onds, an ac­tive com­mu­nity & in­volved de­vel­op­ers & an ex­cit­ing roadmap, and when Stan works it is sub­stan­tially faster than the equiv­a­lent JAGS mod­el; but I en­coun­tered a num­ber of draw­backs.

Given diffi­cul­ties in run­ning JAGS/Stan and slow­ness of the fi­nal mod­els, I ul­ti­mately did not get a suc­cess­ful power analy­sis of the Bayesian mod­els, and I opted to es­sen­tially wing it and hope that ~10 months would be ad­e­quate for mak­ing a de­ci­sion whether to dis­able ads per­ma­nent­ly, en­able ads per­ma­nent­ly, or con­tinue the ex­per­i­ment.

Analysis

Descriptive

Google An­a­lyt­ics re­ports that over­all traffic from 2017-01-01–2017-10-15 was 179,550 unique users with 380,140 page-views & ses­sion du­ra­tion of 1m37s; this is a typ­i­cal set of traffic sta­tis­tics for my site, .

Merged traffic & Ad­Sense data:

traffic <- read.csv("https://www.gwern.net/docs/traffic/2017-10-20-abtesting-adsense.csv",
            colClasses=c("Date", "integer", "integer", "numeric", "integer", "integer",
                         "integer", "numeric", "integer", "numeric"))
library(skimr)
skim(traffic)
#  n obs: 288
#  n variables: 10
#
# Variable type: Date
#  variable missing complete   n        min        max     median n_unique
#      Date       0      288 288 2017-01-01 2017-10-15 2017-05-24      288
#
# Variable type: integer
#        variable missing complete   n     mean       sd    p0      p25     p50      p75   p100     hist
#  Ad.impressions       0      288 288   358.95   374.67     0    33      127.5   708      1848 ▇▁▂▃▁▁▁▁
#    Ad.pageviews       0      288 288   399.02   380.62     0    76.5    180.5   734.5    1925 ▇▁▂▃▁▁▁▁
#           Ads.r       0      288 288     0.44     0.5      0     0        0       1         1 ▇▁▁▁▁▁▁▆
#       Pageviews       0      288 288  1319.93   515.8    794  1108.75  1232    1394      8310 ▇▁▁▁▁▁▁▁
#        Sessions       0      288 288   872.1    409.91   561   743      800     898.25   6924 ▇▁▁▁▁▁▁▁
#      Total.time       0      288 288 84517.41 24515.13 39074 70499.5  81173.5 94002    314904 ▅▇▁▁▁▁▁▁
#
# Variable type: numeric
#             variable missing complete   n  mean    sd    p0    p25   p50    p75   p100     hist
#   Ad.pageviews.logit       0      288 288 -1.46  2.07 -8.13 -2.79  -1.8    0.53   1.44 ▁▁▁▂▆▃▂▇
#          Ads.percent       0      288 288  0.29  0.29  0     0.024  0.1    0.59   0.77 ▇▁▁▁▁▂▃▂
#  Avg.Session.seconds       0      288 288 99.08 17.06 45.48 87.3   98.98 109.22 145.46 ▁▁▃▅▇▅▂▁
sum(traffic$Sessions); sum(traffic$Pageviews)
# [1] 251164
# [1] 380140

library(ggplot2)
qplot(Date, Pageviews, color=as.logical(Ads.r), data=traffic) + stat_smooth() +
    coord_cartesian(ylim = c(750,3089)) +
    labs(color="Ads", title="AdSense advertising effect on Gwern.net daily traffic, January-October 2017")
qplot(Date, Total.time, color=as.logical(Ads.r), data=traffic) + stat_smooth() +
    coord_cartesian(ylim = c(38000,190000)) +
    labs(color="Ads", title="AdSense advertising effect on total time spent reading Gwern.net , January-October 2017")

Traffic looks sim­i­lar whether count­ing by to­tal page views or to­tal time read­ing (av­er­age-time-read­ing-per-ses­sion x num­ber-of-ses­sion­s); the data is defi­nitely au­to­cor­re­lat­ed, some­what noisy, and I get a sub­jec­tive im­pres­sion that there is a small de­crease in pageviews/­to­tal-time on the ad­ver­tis­ing days (de­spite the mea­sure­ment er­ror):

Ad­Sense ban­ner ad A/B test of effect on Gw­ern.net traffic: daily pageviews, Jan­u­ary–Oc­to­ber 2017 split by ad­ver­tis­ing con­di­tion
Daily to­tal-time-spen­t-read­ing Gw­ern.net, Jan­u­ary–Oc­to­ber 2017 (s­plit by A/B)

Simple tests & regressions

As ex­pected from the power analy­sis, the usual tests are un­able to re­li­ably de­tect any­thing but it’s worth not­ing that the point-es­ti­mates of both the mean & me­dian in­di­cate the ads are worse:

t.test(Pageviews ~ Ads.r, data=traffic)
#   Welch Two Sample t-test
#
# data:  Pageviews by Ads.r
# t = 0.28265274, df = 178.00378, p-value = 0.7777715
# alternative hypothesis: true difference in means is not equal to 0
# 95% confidence interval:
#  -111.4252500  148.6809819
# sample estimates:
# mean in group 0 mean in group 1
#     1328.080247     1309.452381
wilcox.test(Pageviews ~ Ads.r, conf.int=TRUE, data=traffic)
#   Wilcoxon rank sum test with continuity correction
#
# data:  Pageviews by Ads.r
# W = 11294, p-value = 0.1208844
# alternative hypothesis: true location shift is not equal to 0
# 95% confidence interval:
#  -10.00001128  87.99998464
# sample estimates:
# difference in location
#             37.9999786

The tests can only han­dle a bi­nary vari­able, so next is a quick sim­ple lin­ear mod­el, and then a quick & easy Bayesian re­gres­sion in brms with an au­to­cor­re­la­tion term to im­prove on the lin­ear mod­el; both turn up a weak effect for the bi­nary ran­dom­iza­tion, and then much stronger (and neg­a­tive) for the more ac­cu­rate per­cent­age mea­sure­ment:

summary(lm(Pageviews ~ Ads.r, data = traffic))
# ...Residuals:
#       Min        1Q    Median        3Q       Max
# -534.0802 -207.6093  -90.0802   65.9198 7000.5476
#
# Coefficients:
#               Estimate Std. Error  t value Pr(>|t|)
# (Intercept) 1328.08025   40.58912 32.72010  < 2e-16
# Ads.r        -18.62787   61.36498 -0.30356  0.76168
#
# Residual standard error: 516.6152 on 286 degrees of freedom
# Multiple R-squared:  0.0003220914,    Adjusted R-squared:  -0.003173286
# F-statistic: 0.09214781 on 1 and 286 DF,  p-value: 0.7616849
summary(lm(Pageviews ~ Ads.percent, data = traffic))
# ...Residuals:
#       Min        1Q    Median        3Q       Max
# -579.4145 -202.7547  -89.1473   60.7785 6928.3160
#
# Coefficients:
#               Estimate Std. Error  t value Pr(>|t|)
# (Intercept) 1384.17269   42.77952 32.35596  < 2e-16
# Ads.percent -224.79052  105.98550 -2.12096 0.034786
#
# Residual standard error: 512.6821 on 286 degrees of freedom
# Multiple R-squared:  0.01548529,  Adjusted R-squared:  0.01204293
# F-statistic: 4.498451 on 1 and 286 DF,  p-value: 0.03478589

library(brms)
b <- brm(Pageviews ~ Ads.r, autocor = cor_bsts(), iter=20000, chains=8, data = traffic); b
#  Family: gaussian(identity)
# Formula: Pageviews ~ Ads.r
#    Data: traffic (Number of observations: 288)
# Samples: 8 chains, each with iter = 20000; warmup = 10000; thin = 1;
#          total post-warmup samples = 80000
#     ICs: LOO = Not computed; WAIC = Not computed
#
# Correlation Structure: bsts(~1)
#         Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# sigmaLL    50.86     16.47    26.53    90.49        741 1.01
#
# Population-Level Effects:
#       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# Ads.r    50.36     63.19   -74.73   174.21      34212    1
#
# Family Specific Parameters:
#       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# sigma   499.77     22.38   457.76    545.4      13931    1
#
# Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
# is a crude measure of effective sample size, and Rhat is the potential
# scale reduction factor on split chains (at convergence, Rhat = 1).
 b2 <- brm(Pageviews ~ Ads.percent, autocor = cor_bsts(), chains=8, data = traffic); b2
...
# Population-Level Effects:
#             Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# Ads.percent    -91.8    113.05  -317.85   131.41       2177    1

This is im­per­fect since treat­ing per­cent­age as ad­di­tive is odd, as one would ex­pect it to be mul­ti­plica­tive in some sense. As well, brms makes it con­ve­nient to throw in a sim­ple Bayesian struc­tural au­to­cor­re­la­tion (cor­re­spond­ing to AR(1) if I am un­der­stand­ing it cor­rect­ly) but the func­tion in­volved does not sup­port the high­er-order lags or mov­ing av­er­age in­volved in traffic, so is weaker than it could be.

Stan ARIMA time-series model

For the real analy­sis, I do a fully Bayesian analy­sis in Stan, us­ing ARIMA(4,0,1) time-series, a mul­ti­plica­tive effect of ads as per­cent­age of traffic, skep­ti­cal in­for­ma­tive pri­ors of small neg­a­tive effects, and ex­tract­ing pos­te­rior pre­dic­tions (of each day if hy­po­thet­i­cally it were not ad­ver­tis­ing-affect­ed) for fur­ther analy­sis.

Model de­fi­n­i­tion & se­tup:

library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
m <- "data {
        int<lower=1> T; // number of data points
        int<lower=0> y[T]; // traffic
        real Ads[T]; // Ad logit
    }
    parameters {
        real<lower=0> muM;
        real<lower=0> sigma;
        real phi1; // autoregression coeffs
        real phi2;
        real phi3;
        real phi4;
        real ma; // moving avg coeff

        real<upper=0> ads; // advertising coeff; can only be negative

        real<lower=0, upper=10000> y_pred[T]; // traffic predictions
    }
    model {
        real mu[T]; // prediction for time t
        vector[T] err; // error for time t

        // initialize the first 4 days for the lags
        mu[1] = 0;
        mu[2] = 0;
        mu[3] = 0;
        mu[4] = 0;
        err[1] = y[1] - mu[1];
        err[2] = y[2] - mu[2];
        err[3] = y[3] - mu[3];
        err[4] = y[4] - mu[4];

        muM ~ normal(1300, 500);
        sigma ~ exponential(250);
        phi1 ~ normal(0,1);
        phi2 ~ normal(0,1);
        phi3 ~ normal(0,1);
        phi4 ~ normal(0,1);
        ma ~ normal(0, 0.5);
        ads  ~ normal(0,1);

        for (t in 5:T) {
          mu[t] = muM + phi1 * y[t-1] + phi2 * y[t-2] + phi3 * y[t-3] + phi4 * y[t-4] + ma * err[t-1];
          err[t] = y[t] - mu[t];
          y[t]      ~ normal(mu[t] * (1 + ads*Ads[t]),       sigma);
          y_pred[t] ~ normal(mu[t] * 1, sigma); // for comparison, what would the ARIMA predict for a today w/no ads?
        }
    }"

# extra flourish: find posterior mode via Stan's new L-BFGS gradient descent optimization feature;
# also offers a good initialization point for MCMC
sm <- stan_model(model_code = m)
optimized <- optimizing(sm, data=list(T=nrow(traffic), y=traffic$Pageviews, Ads=traffic$Ads.percent), hessian=TRUE)
round(optimized$par, digits=3)
#      muM       sigma        phi1        phi2        phi3        phi4          ma         ads
# 1352.864      65.221      -0.062       0.033      -0.028       0.083       0.249      -0.144
## Initialize from previous MCMC run:
inits <- list(muM=1356, sigma=65.6, phi1=-0.06, phi2=0.03, phi3=-0.03, phi4=0.08, ma=0.25, ads=-0.15)
nchains <- getOption("mc.cores") - 1
model <- stan(model_code=m, data=list(T=nrow(traffic), y=traffic$Pageviews, Ads=traffic$Ads.percent),
    init=replicate(nchains, inits, simplify=FALSE), chains=nchains, iter=200000); print(model)

Re­sults from the Bayesian mod­el, plus a sim­ple per­mu­ta­tion test as a san­i­ty-check on the data+­mod­el:

# ...Elapsed Time: 413.816 seconds (Warm-up)
#                654.858 seconds (Sampling)
#                1068.67 seconds (Total)
#
# Inference for Stan model: bacd35459b712679e6fc2c2b6bc0c443.
# 1 chains, each with iter=2e+05; warmup=1e+05; thin=1;
# post-warmup draws per chain=1e+05, total post-warmup draws=1e+05.
#
#                  mean se_mean      sd      2.5%       25%       50%       75%     97.5%  n_eff Rhat
# muM           1355.27    0.20   47.54   1261.21   1323.50   1355.53   1387.20   1447.91  57801    1
# sigma           65.61    0.00    0.29     65.03     65.41     65.60     65.80     66.18 100000    1
# phi1            -0.06    0.00    0.04     -0.13     -0.09     -0.06     -0.04      0.01  52368    1
# phi2             0.03    0.00    0.01      0.01      0.03      0.03      0.04      0.05 100000    1
# phi3            -0.03    0.00    0.01     -0.04     -0.03     -0.03     -0.02     -0.01 100000    1
# phi4             0.08    0.00    0.01      0.07      0.08      0.08      0.09      0.10 100000    1
# ma               0.25    0.00    0.04      0.18      0.23      0.25      0.27      0.32  52481    1
# ads             -0.14    0.00    0.01     -0.16     -0.15     -0.14     -0.14     -0.13 100000    1
# ...
mean(extract(model)$ads)
# [1] -0.1449574151

## permutation test to check for model misspecification: shuffle ad exposure and rerun the model,
## see what the empirical null distribution of the ad coefficient is and how often it yields a
## reduction of >= -14.5%:
empiricalNull <- numeric()
iters <- 5000
for (i in 1:iters) {
    df <- traffic
    df$Ads.percent <- sample(df$Ads.percent)
    inits <- list(muM=1356, sigma=65.6, phi1=-0.06, phi2=0.03, phi3=-0.03, phi4=0.08, ma=0.25, ads=-0.01)
    # nchains <- 1; options(mc.cores = 1) # disable multi-core to work around occasional Stan segfaults
    model <- stan(model_code=m, data=list(T=nrow(df), y=df$Pageviews, Ads=df$Ads.percent),
                   init=replicate(nchains, inits, simplify=FALSE), chains=nchains); print(model)
    adEstimate <- mean(extract(model)$ads)
    empiricalNull[i] <- adEstimate
}
summary(empiricalNull); sum(empiricalNull < -0.1449574151) / length(empiricalNull)
#        Min.      1st Qu.       Median         Mean      3rd Qu.         Max.
# -0.206359600 -0.064702600 -0.012325460 -0.035497930 -0.001696464 -0.000439064
# [1] 0.0136425648

We see a con­sis­tent & large es­ti­mate of harm: the mean of traffic falls by −14.5% (95% CI: −0.16 to −0.13; per­mu­ta­tion test: p = 0.01) on 100% ad-affected traffic! Given that these traffic sta­tis­tics are sourced from Google An­a­lyt­ics, which could be blocked along with the ad by an ad­block­er, which ‘in­vis­i­ble’ traffic to av­er­age ~10% of to­tal traffic, the true es­ti­mate is pre­sum­ably some­what larger be­cause there is more ac­tual traffic than mea­sured. Ad ex­po­sure, how­ev­er, was not 100%, sim­ply be­cause of the ad­block­/ran­dom­iza­tion is­sues.

To more di­rectly cal­cu­late the harm, I turn to the pos­te­rior pre­dic­tions, which were com­puted for each day un­der the hy­po­thet­i­cal of no ad­ver­tis­ing; one would ex­pect the pre­dic­tion for all days to be some­what higher than the ac­tual traffics were (be­cause al­most every day has some non-zero % of ad-affected traffic), and, summed or av­er­aged over all days, that gives the pre­dicted loss of traffic from ads:

mean(traffic$Pageviews)
# [1] 1319.930556
## fill in defaults when extracting mean posterior predictives:
traffic$Prediction <- c(1319,1319,1319,1319, colMeans(extract(model)$y_pred)[5:288])
mean(with(traffic, Prediction - Pageviews) )
# [1] 53.67329617
mean(with(traffic, (Prediction - Pageviews) / Pageviews) )
# [1] 0.09668207805
sum(with(traffic, Prediction - Pageviews) )
# [1] 15457.9093

So dur­ing the A/B test, the ex­pected es­ti­mated loss of traffic is ~9.7%.

Decision

As this is so far past the de­ci­sion thresh­old and the 95% cred­i­ble in­ter­val around −0.14 is ex­tremely tight (−0.16–0.13) and rules out ac­cept­able losses in the 0–2% range, the EVSI of any ad­di­tional sam­pling is neg­a­tive & not worth cal­cu­lat­ing.

Thus, I re­moved the Ad­Sense ban­ner ad in the mid­dle of 2017-09-11.

Discussion

The re­sult is sur­pris­ing. I had been ex­pect­ing some de­gree of harm but the es­ti­mated re­duc­tion is much larger than I ex­pect­ed. Could ban­ner ads re­ally be that harm­ful?

The effect is es­ti­mated with con­sid­er­able pre­ci­sion, so it’s al­most cer­tainly not a fluke of the data (if any­thing I col­lected far more data than I should’ve); there weren’t many traffic spikes to screw with the analy­sis, so omit­ting mix­ture model or t-s­cale re­sponses in the model does­n’t seem like it should be an is­sue ei­ther; the mod­el­ing it­self might be dri­ving it, but the crud­est tests sug­gest a sim­i­lar level of harm (just not at high sta­tis­ti­cal-sig­nifi­cance or pos­te­rior prob­a­bil­i­ty); it does seem to be vis­i­ble in the scat­ter­plot; and the more re­al­is­tic mod­el­s—which in­clude time-series as­pects I know ex­ist from the long his­tor­i­cal time-series of Gw­ern.net traffic & skep­ti­cal pri­ors en­cour­ag­ing small effect­s—es­ti­mate it much bet­ter as I ex­pected from my pre­vi­ous power analy­ses, and con­sid­er­able tin­ker­ing with my orig­i­nal ARIMA(4,0,1) Stan model to check my un­der­stand­ing of my code (I haven’t used Stan much be­fore) did­n’t turn up any is­sues or make the effect go away. So as far as I can tell, this effect is re­al. I still doubt my re­sults, but it’s con­vinc­ing enough for me to dis­able ads, at least.

Does it gen­er­al­ize? I ad­mit Gw­ern.net is un­usu­al: highly tech­ni­cal long­form sta­tic con­tent in a min­i­mal­ist lay­out op­ti­mized for fast load­ing & ren­der­ing cater­ing to An­glo­phone STEM-types in the USA. It is en­tirely pos­si­ble that for most web­sites, the effect of ads is much smaller be­cause they al­ready load so slow, have much busier clut­tered de­signs, their users have less vis­ceral dis­taste for ad­ver­tis­ing or are more eas­ily tar­geted for use­ful ad­ver­tis­ing etc, and thus Gw­ern.net is merely an out­lier for whom re­mov­ing ads makes sense (par­tic­u­larly given my op­tion of be­ing Pa­tre­on-sup­ported rather than de­pend­ing en­tirely on ads like many me­dia web­sites must). I have no way of know­ing whether or not this is true, and as al­ways with op­ti­miza­tions, one should bench­mark one’s own spe­cific use case; per­haps in a few years more re­sults will be re­ported and it will be seen if my re­sults are merely a cod­ing er­ror or an out­lier or some­thing else.

If a max loss of 14% and av­er­age loss of ~9% (both of which could be higher for sites whose users don’t use ad­block as much) is ac­cu­rate and gen­er­al­iz­able to other blogs/web­sites (as the repli­ca­tions since my first A/B test im­ply), there are many im­pli­ca­tions: in par­tic­u­lar, it im­plies a huge dead­weight loss to In­ter­net users from ad­ver­tis­ing; and sug­gests ad­ver­tis­ing may be a net loss for many smaller sites. (It seems un­like­ly, to say the least, that every sin­gle web­site or busi­ness in ex­is­tence would de­liver pre­cisely the amount of ads they do now while ig­no­rant of the true costs, by sheer luck hav­ing made the op­ti­mal trade­off, and likely that many would pre­fer to re­duce their ad in­ten­sity or re­move ads en­tire­ly.) Iron­i­cal­ly, in the lat­ter case, those sites may not yet have re­al­ize, and may never re­al­ize, how much the pen­nies they earn from ad­ver­tis­ing are cost­ing them, be­cause the harm won’t show up in stan­dard sin­gle-user A/B test­ing due to ei­ther mea­sure­ment er­ror hid­ing much of the effect or be­cause it ex­ists as an emer­gent global effect, re­quir­ing long-term ex­per­i­men­ta­tion & rel­a­tively so­phis­ti­cated time-series mod­el­ing—a de­crease of 10% is im­por­tant and yet, site traffic ex­oge­nously changes on a dai­ly, much less weekly or month­ly, ba­sis more than 10%, ren­der­ing even a dras­tic on/off change in­vis­i­ble to the naked eye.

There may be a con­nec­tion here to ear­lier ob­ser­va­tions on the busi­ness of ad­ver­tis­ing ques­tion­ing whether ad­ver­tis­ing works, works more than it hurts or can­ni­bal­izes other av­enues, works suffi­ciently well to be profitable, or suffi­ciently well to know if it is work­ing at all. usu­ally fail, and the more rig­or­ous the eval­u­a­tion, the smaller effects are. On purely sta­tis­ti­cal grounds, it should be hard to cost-effec­tively show that ad­ver­tis­ing works at all (//), pub­li­ca­tion bias in pub­lished es­ti­mates of ad­ver­tis­ing effi­cacy (), Steve Sail­er’s ob­ser­va­tion that ‘s Be­hav­iorScan field­-ex­per­i­ment link­ing P&G’s in­di­vid­ual TV ad­ver­tise­ments & gro­cery store sales likely showed lit­tle effect, eBay’s own ex­per­i­ments to sim­i­lar effect (, Blake et al 2014)20, P&G & JPMorgan dig­i­tal ad cuts (and the con­tin­ued suc­cess of many São Paulo-style low-ad­ver­tis­ing re­tail­ers like or ), the ex­treme in­ac­cu­racy of cor­re­la­tional at­tempts to pre­dict ad­ver­tis­ing effects (Lewis et al 2011, ; see also ), po­lit­i­cal sci­ence’s diffi­culty show­ing any causal im­pact of cam­paign ad­ver­tise­ment spend­ing on vic­to­ries (Kalla & Broock­man 2017/; & mostly re­cent­ly, Don­ald Trump, which may be re­lated to why [there is so lit­tle money in pol­i­tic­s](/­doc­s/e­co­nom­ic­s/2003-an­solabehere.pdf’“Why is There so Lit­tle Money in U.S. Pol­i­tic­s?’, An­solabehere et al 2003”)), the min­i­mal value of be­hav­ioral pro­fil­ing on ad effi­cacy (Marotta et al 2019), and many anec­do­tal re­ports se­ri­ously ques­tion­ing the value of Face­book or Google ad­ver­tis­ing for their busi­nesses in yield­ing mis­tak­en/cu­ri­ous or fraud­u­lent or just use­less traffic. One coun­ter­point is John­son et la 2017 which takes up Lewis/­Gor­don gaunt­let and, us­ing n = 2.2b/k = 432 (!), is able to defi­nitely es­tab­lish small ad­ver­tis­ing effects (but dri­ven by lower qual­ity traffic, het­ero­ge­neous effects, and with mod­est long-term effect­s).

Followup test

An ac­tive & skep­ti­cal dis­cus­sion en­sued on Hacker News & else­where after I posted my first analy­sis. Many peo­ple be­lieved the re­sults, but many were skep­ti­cal. Nor could I blame them—while all the analy­ses turn in neg­a­tive es­ti­mates, it was (then) only the one re­sult on an un­usual web­site with the head­line re­sult es­ti­mated by an un­usu­ally com­plex sta­tis­ti­cal model and so on. But this is too im­por­tant to leave un­set­tled like this.

So after see­ing lit­tle ap­par­ent fol­lowup by other web­sites who could pro­vide larger sam­ple sizes & truly in­de­pen­dent repli­ca­tion, I re­solved to run a sec­ond one, which would at least demon­strate that the first one was not a fluke of Gw­ern.net’s 2017 traffic.

As the first re­sult was so strong, I de­cided to run it not quite so long the sec­ond time, for 6 months, 2018-09-27–2019-03-27. (The sec­ond ex­per­i­ment can be pooled with the first.) Due to ma­jor dis­trac­tions (specifi­cal­ly, & ), I did­n’t have time to an­a­lyze it when the ran­dom­iza­tion end­ed, so I dis­abled ads and left analy­sis for lat­er.

Design

For the fol­lowup, I wanted to fix a few of the is­sues and ex­plore some mod­er­a­tors:

  1. the ran­dom­iza­tion would not re­peat the em­bar­rass­ing time­zone mis­take from the first time

  2. to force more tran­si­tions, there would be 2-day pe­ri­ods ran­dom­ized Lat­in-square-style in weeks

  3. to ex­am­ine the pos­si­bil­ity that ads per se are not the prob­lem but JS-heavy an­i­mated ads like Google Ad­Sense ads (de­spite the efforts Google in­vests in per­for­mance op­ti­miza­tion) with the con­se­quent browser per­for­mance im­pact, or that per­son­al­ized ads are the prob­lem, I would not use Google Ad­Sense but rather a sin­gle, sta­t­ic, fixed ad which was noth­ing but a small loss­i­ly-op­ti­mized PNG and which is at least plau­si­bly rel­e­vant to Gw­ern.net read­ers

    After some cast­ing about, I set­tled on a tech re­cruit­ing ban­ner ad from Triple­byte which I copied from Google Im­ages & edited light­ly, bor­row­ing an affil­i­ate link from SSC (who has Triple­byte ads of his own & who I would be pleased to see get some rev­enue). The fi­nal PNG weighs 12kb. If there is any neg­a­tive effect of ads from this, it is not from ‘per­for­mance im­pact’! (E­spe­cially in light of the repli­ca­tion ex­per­i­ments, I am skep­ti­cal that the per­for­mance bur­den of ads is all that im­por­tan­t.)

As be­fore, it is im­ple­mented as pre-gen­er­ated ran­dom­ness in a JS ar­ray with ads hid­den by de­fault and then en­abled by an in­line JS script based on the ran­dom­ness.

Implementation

Gen­er­ated ran­dom­ness:

library(gtools)
m <- permutations(2, 7, 0:1, repeats=TRUE)
m <- m[sample(1:nrow(m), size=51),]; c(m)
#   [1] 1 0 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 1 0 1 0 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 0 1 0 1 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 1 1 1 1 1
#  [81] 0 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 0 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 0 1 0 1 0 0 1 1 0 0 0 0 0 0 1
# [161] 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 0 1 0 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 1 0 1 1 0
# [241] 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 0 1 1 1 0 1 1 0 0 1 1 1 1 0 0 0 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 0 0 0 1 0 1 1 1 1 0 0
# [321] 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 0 0 1 0 0 1 1 0 0 1 0 0 0 0 1 1

default.css:

/* hide the ad in the ad A/B test by default */
div#ads { display: block; text-align: center; display: none; }

Full HTML+JS in default.html:

<div id="ads"><a href="https://triplebyte.com/a/Lpa4wbK/d"><img alt="Banner ad for the tech recruiting company Triplebyte: 'Triplebyte is building a background-blind screening process for hiring software engineers'" width="500" height="128" src="/static/img/ads-triplebyte-banner.png"></a></div>

<header>
...
<!-- A/B test of ad effects on site traffic, static ad version: randomize blocks of 7-days based on day-of-year pre-generated randomness -->
<script id="adsABTestJS">
// create Date object for current location
var d = new Date();
// convert to msec
// add local time zone offset
// get UTC time in msec
var utc = d.getTime() + (d.getTimezoneOffset() * 60000);
// create new Date object for different timezone, EST/UTC-4
var now = new Date(utc + (3600000*-4));

  start = new Date("2018-09-28"); var diff = now - start;
  var oneDay = 1000 * 60 * 60 * 24; var day = Math.floor(diff / oneDay);
  randomness = [1,0,1,1,1,0,0,1,0,1,1,0,0,1,1,0,0,0,1,1,1,0,1,1,0,0,0,0,0,
    1,1,0,0,1,1,1,0,1,0,1,0,0,1,1,1,1,0,0,0,1,1,0,0,1,1,1,0,1,0,1,1,0,0,0,
    0,0,1,1,0,1,0,0,1,1,0,1,1,1,1,1,0,1,0,0,1,0,1,1,1,1,0,1,1,1,1,1,1,0,1,
    1,0,0,0,0,1,0,1,1,0,1,0,1,0,1,1,1,1,1,1,1,1,1,0,1,0,0,1,1,1,0,1,0,0,1,
    0,1,1,0,1,1,0,0,0,0,1,0,1,0,1,0,0,1,1,0,0,0,0,0,0,1,1,1,1,1,0,1,0,0,0,
    1,1,1,1,1,1,1,1,0,0,0,0,1,0,0,0,1,1,1,0,0,0,1,0,0,1,0,1,1,1,1,1,1,0,0,
    0,0,0,0,0,1,1,0,1,1,0,1,0,1,0,1,1,0,0,0,1,0,0,1,1,0,0,1,1,0,0,1,0,1,1,
    0,1,1,0,1,0,0,1,0,0,1,1,0,1,0,1,0,0,0,0,0,1,0,1,1,1,0,0,0,1,0,0,1,1,1,
    0,1,1,0,0,1,1,1,1,0,0,0,0,1,1,0,1,0,1,0,1,0,1,0,1,0,1,1,1,1,1,0,0,0,1,
    0,0,0,1,0,1,1,1,1,0,0,1,1,0,1,1,1,0,1,0,1,1,0,1,0,0,1,1,0,1,0,1,0,0,1,
    0,0,1,1,0,0,1,0,0,0,0,1,1];
  if (randomness[day]) {
    document.getElementById("ads").style.display = "block";
    document.getElementById("ads").style.visibility = "visible";
  }
</script>

For the sec­ond run, fresh ran­dom­ness was gen­er­ated the same way:

      start = new Date("2019-04-13"); var diff = now - start;
      var oneDay = 1000 * 60 * 60 * 24; var day = Math.floor(diff / oneDay);
      randomness = [0,1,0,0,0,1,1,1,0,0,0,0,1,1,1,0,1,1,1,1,0,0,0,0,1,0,1,1,0,0,1,0,0,1,0,
                    0,0,0,1,1,0,0,0,1,0,1,0,0,1,0,1,1,1,1,0,1,1,0,0,0,0,0,0,1,1,0,1,0,0,1,
                    1,0,0,0,1,1,1,1,1,1,0,1,1,1,0,1,1,0,0,0,0,1,0,1,1,0,0,1,1,0,1,1,1,1,1,
                    1,1,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1,1,0,1,0,0,1,0,1,0,1,1,0,0,0,0,1,1,0,
                    1,0,0,0,0,1,1,1,0,0,1,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,1,0,0,1,1,1,1,
                    1,1,1,1,0,1,0,0,1,0,0,0,1,1,0,1,0,0,0,0,1,0,0,1,1,0,0,0,0,0,1,0,0,1,1,
                    1,1,1,0,0,1,1,1,0,0,0,1,1,0,1,0,1,0,0,0,1,0,0,1,1,0,1,1,0,0,0,1,1,0,1,
                    0,0,0,1,0,1,0,0,1,0,1,0,0,0,0,1,0,1,0,0,0,1,1,0,0,1,0,1,0,1,0,1,0,0,0,
                    1,1,0,1,0,1,0,0,1,1,0,1,1,0,1,1,1,0,1,0,0,0,1,1,0,0,1,1,0,0,0,1,1,0,0,
                    0,1,1,1,0,1,0,0,0,1,0,1,1,0,0,1,0,0,1,0,0,1,0,1,1,1,0,0,0,0,1,1,1,1,0,
                    1,0,1,1,1,0,1];
      if (randomness[day]) {
        document.getElementById("ads").style.display = "block";
        document.getElementById("ads").style.visibility = "visible";
      }

The re­sult:

Screen­shot of ban­ner ad ap­pear­ance on the in Sep­tem­ber 2018 when the 2nd ad A/B test be­gan

Analysis

An in­terim analy­sis of the first 6 months was de­feated by surges in site traffic linked to , among other things, caus­ing the SD of daily pageviews to in­crease by >5.3x (!); this de­stroyed the sta­tis­ti­cal pow­er, ren­der­ing re­sults un­in­for­ma­tive, and made it hard to com­pare to the first A/B test—­some of the brms analy­ses pro­duced CIs of effects on site traffic from +18% to −14%! The Stan model (us­ing a merged dataset of both ad types, treat­ing them as in­de­pen­dent vari­ables ie. fixed effects) es­ti­mated an effect of −0.04%.

Inas­much as I still want a solid re­sult and the best model sug­gests that the harm from con­tin­ued A/B test­ing of the sta­tic ban­ner ad is tiny, I de­cided to ex­tend the sec­ond A/B test for an­other roughly 6 months, start­ing on 2019-04-13, im­ple­mented the same way (but with new ran­dom­ness). I even­tu­ally halted the ex­per­i­ment with the last full day of 2019-12-21, fig­ur­ing that an ad­di­tional 253 days of ran­dom­iza­tion ought to be enough, I wanted to clear the decks for an A/B test of differ­ent col­ors for the new ‘dark mode’ (to test a hy­poth­e­sis de­rived from that green or blue should be the op­ti­mal color to con­trast with a black back­ground), and after look­ing into Stan some more, it seems that deal­ing with the het­ero­gene­ity should be pos­si­ble with a model if brms is un­able to han­dle it.

TODO

Appendices

Stan issues

Ob­ser­va­tions I made while try­ing to de­velop the Gw­ern.net traffic ARIMA model in Stan, in de­creas­ing or­der of im­por­tance:

  • Stan’s treat­ment of mix­ture mod­els and dis­crete vari­ables is… not good. I like mix­ture mod­els & tend to think in terms of them and la­tent vari­ables a lot, which makes the ne­glect an is­sue for me. This was par­tic­u­larly vex­ing in my ini­tial mod­el­ing where I tried to al­low for traffic spikes from HN etc by hav­ing a mix­ture mod­el, with one com­po­nent for ‘reg­u­lar’ traffic and one com­po­nent for traffic surges. This is rel­a­tively straight­for­ward in JAGS as one de­fines a cat­e­gor­i­cal vari­able and in­dexes into it, but it is a night­mare in Stan, re­quir­ing a bizarre hack.

    I defy any Stan user to look at the ex­am­ple mix­ture model in the man­ual and tell me that they nat­u­rally and eas­ily un­der­stand the target/temperature stuff as a way of im­ple­ment­ing a mix­ture mod­el. I sure did­n’t. And once I did get it im­ple­ment­ed, I could­n’t mod­ify it at all. And it was slow, too, erod­ing the orig­i­nal per­for­mance ad­van­tage over JAGS. I was saved only by the fact that the A/B test pe­riod hap­pened to not in­clude many spikes and so I could sim­ply drop the mix­ture as­pect from the model en­tire­ly.

  • mys­te­ri­ous seg­faults and er­rors un­der a va­ri­ety of con­di­tion; once when my cat walked over my key­board, and fre­quently when run­ning mul­ti­-core Stan in a loop. The lat­ter was a se­ri­ous is­sue for me when run­ning a per­mu­ta­tion test with 5000 it­er­ates: when I run Stan on 8 chains in par­al­lel nor­mally (hence 1/8th the sam­ples per chain) in a for-loop—the sim­plest way to im­ple­ment the per­mu­ta­tion test—it would oc­ca­sion­ally seg­fault and take down R. I was forced to re­duce the chains to 1 be­fore it stopped crash­ing, mak­ing it 8 times slower (un­less I wished to add in man­ual par­al­lel pro­cess­ing, run­ning 8 sep­a­rate Stan­s).

  • Stan’s sup­port for pos­te­rior pre­dic­tives is poor. The man­ual tells one to use a differ­ent mod­ule/s­cope, generated_quantities, lest the code be slow, which ap­par­ently re­quires one to copy­-paste the en­tire like­li­hood sec­tion! Which is es­pe­cially un­for­tu­nate when do­ing a time-series and re­quir­ing ac­cess to ar­rays/vec­tors de­clared in a differ­ent scope… I never did fig­ure out how to gen­er­ate pos­te­rior pre­dic­tions ‘cor­rectly’ for that rea­son, and re­sorted to the usual Bugs/JAGS-like method (which thank­fully does work)

  • Stan’s treat­ment of miss­ing data is also unin­spir­ing and makes me wor­ried about more com­plex analy­ses where I am not so for­tu­nate as to have per­fectly clean com­plete datasets

  • Stan’s syn­tax is ter­ri­ble, par­tic­u­larly the en­tirely un­nec­es­sary semi­colons. It is 2017, I should not be spend­ing my time adding use­less end-of-line mark­ers. If they are nec­es­sary for C++, they can be added by Stan it­self. This was par­tic­u­larly in­fu­ri­at­ing when painfully edit­ing a model try­ing to im­ple­ment var­i­ous pa­ra­me­ter­i­za­tions and re­run­ning only to find that I had for­got­ten a semi­colon (as no lan­guage I use reg­u­lar­ly—R, Haskell, shell, Python, or Bugs/JAGS—insists on them!).

Stan: mixture time-series

An at­tempt at a ARIMA(4,0,1) time-series mix­ture model im­ple­mented in Stan, where the mix­ture has two com­po­nents: one com­po­nent for nor­mal traffic where daily traffic is ~1000 mak­ing up >90% of daily data, and one com­po­nent for the oc­ca­sional traffic spike around 10× larger but hap­pen­ing rarely:

library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
m <- "data {
        int<lower=1> K; // number of mixture components
        int<lower=1> T; // number of data points
        int<lower=0> y[T]; // traffic
        int<lower=0, upper=1> Ads[T]; // Ad randomization
    }
    parameters {
        simplex[K] theta; // mixing proportions
        positive_ordered[K] muM; // locations of mixture components; since no points are labeled,
        // like in JAGS, we add a constraint to force an ordering, make it identifiable, and
        // avoid label switching, which will totally screw with the posterior samples
        real<lower=0.01, upper=500> sigmaM[K]; // scales of mixture components
        real<lower=0.01, upper=5>    nuM[K];

        real phi1; // autoregression coeffs
        real phi2;
        real phi3;
        real phi4;
        real ma; // moving avg coeff

        real<upper=0> ads; // advertising coeff; can only be negative
    }
    model {

        real mu[T, K]; // prediction for time t
        vector[T] err; // error for time t
        real ps[K]; // temp for log component densities
        // initialize the first 4 days for the lags
        mu[1][1] = 0; // assume err[0] == 0
        mu[2][1] = 0;
        mu[3][1] = 0;
        mu[4][1] = 0;
        err[1] = y[1] - mu[1][1];
        err[2] = y[2] - mu[2][1];
        err[3] = y[3] - mu[3][1];
        err[4] = y[4] - mu[4][1];


        muM ~ normal(0, 5);
        sigmaM ~ cauchy(0, 2);
        nuM ~ exponential(1);
        ma ~ normal(0, 0.5);
        phi1 ~ normal(0,1);
        phi2 ~ normal(0,1);
        phi3 ~ normal(0,1);
        phi4 ~ normal(0,1);
        ads  ~ normal(0,1);

        for (t in 5:T) {
            for (k in 1:K) {
                mu[t][k] = ads*Ads[t] + muM[k] + phi1 * y[t-1] + phi2 * y[t-2] + phi3 * y[t-3] + phi4 * y[t-4] + ma * err[t-1];
                err[t] = y[t] - mu[t][k];

                ps[k] = log(theta[k]) + student_t_lpdf(y[t] | nuM[k], mu[t][k], sigmaM[k]);
            }
        target += log_sum_exp(ps);
        }
    }"

# find posterior mode via L-BFGS gradient descent optimization; this can be a good set of initializations for MCMC
sm <- stan_model(model_code = m)
optimized <- optimizing(sm, data=list(T=nrow(traffic), y=traffic$Pageviews, Ads=traffic$Ads.r, K=2), hessian=TRUE)
round(optimized$par, digits=3)
#  theta[1]  theta[2]    muM[1]    muM[2] sigmaM[1] sigmaM[2]    nuM[1]    nuM[2]      phi1      phi2      phi3      phi4        ma
#     0.001     0.999     0.371     2.000     0.648   152.764     0.029     2.031     1.212    -0.345    -0.002     0.119    -0.604
#       ads
#    -0.009

## optimized:
inits <- list(theta=c(0.001, 0.999), muM=c(0.37, 2), sigmaM=c(0.648, 152), nuM=c(0.029, 2), phi1=1.21, phi2=-0.345, phi3=-0.002, phi4=0.119, ma=-0.6, ads=-0.009)
## MCMC means:
nchains <- getOption("mc.cores") - 1
model <- stan(model_code=m, data=list(T=nrow(traffic), y=traffic$Pageviews, Ads=traffic$Ads.r, K=2),
    init=replicate(nchains, inits, simplify=FALSE), chains=nchains, control = list(max_treedepth = 15, adapt_delta = 0.95),
    iter=20000); print(model)
traceplot(model, pars=names(inits))

This code winds up con­tin­u­ing to fail due to la­bel-switch­ing is­sue (ie the MCMC bounc­ing be­tween es­ti­mates of what each mix­ture com­po­nent is be­cause of sym­me­try or lack of data) de­spite us­ing some of the sug­gested fixes in the Stan model like the or­der­ing trick. Since there were so few spikes in 2017 on­ly, the mix­ture model can’t con­verge to any­thing sen­si­ble; but on the plus side, this also im­plies that the com­plex mix­ture model is un­nec­es­sary for an­a­lyz­ing 2017 data and I can sim­ply model the out­come as a nor­mal.

EVSI

Demo code of sim­ple Ex­pected Value of Sam­ple In­for­ma­tion (EVSI) in a JAGS log-Pois­son model of traffic (which turns out to be in­fe­rior to a nor­mal dis­tri­b­u­tion for 2017 traffic data but I keep here for his­tor­i­cal pur­pos­es).

We con­sider an ex­per­i­ment re­sem­bling his­tor­i­cal data with a 5% traffic de­crease due to ads; the re­duc­tion is mod­eled and im­plies a cer­tain util­ity loss given my rel­a­tive pref­er­ences for traffic vs the ad­ver­tis­ing rev­enue, and then the re­main­ing un­cer­tainty in the re­duc­tion es­ti­mate can be queried for how likely it is that the de­ci­sion is wrong and that col­lect­ing fur­ther data would then change a wrong de­ci­sion to a right one:

## simulate a plausible effect superimposed on the actual data:
ads[ads$Ads==1,]$Hits <- round(ads[ads$Ads==1,]$Hits * 0.95)

require(rjags)
y <- ads$Hits
x <- ads$Ads
model_string <- "model {
  for (i in 1:length(y)) {
   y[i] ~ dpois(lambda[i])
   log(lambda[i]) <- alpha0 - alpha1 * x[i]

  }
  alpha0 ~ dunif(0,10)
  alpha1 ~ dgamma(50, 6)
}"
model <- jags.model(textConnection(model_string), data = list(x = x, y = y),
                    n.chains = getOption("mc.cores"))
samples <- coda.samples(model, c("alpha0", "alpha1"), n.iter=10000)
summary(samples)
# 1. Empirical mean and standard deviation for each variable,
#    plus standard error of the mean:
#
#              Mean          SD     Naive SE Time-series SE
# alpha0 6.98054476 0.003205046 1.133155e-05   2.123554e-05
# alpha1 0.06470139 0.005319866 1.880857e-05   3.490445e-05
#
# 2. Quantiles for each variable:
#
#              2.5%        25%        50%        75%      97.5%
# alpha0 6.97426621 6.97836982 6.98055144 6.98273011 6.98677827
# alpha1 0.05430508 0.06110893 0.06469162 0.06828215 0.07518853
alpha0 <- samples[[1]][,1]; alpha1 <- samples[[1]][,2]
posteriorTrafficReduction <- exp(alpha0) - exp(alpha0-alpha1)

generalLoss <- function(annualAdRevenue, trafficLoss,  hitValue, discountRate) {
  (annualAdRevenue - (trafficLoss * hitValue * 365.25)) / log(1 + discountRate) }
loss <- function(tr) { generalLoss(360, tr, 0.02, 0.05) }
posteriorLoss <- sapply(posteriorTrafficReduction, loss)
summary(posteriorLoss)
#       Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
# -5743.5690 -3267.4390 -2719.6300 -2715.3870 -2165.6350   317.7016

Ex­pected loss of turn­ing on ads: -$2715. Cur­rent de­ci­sion: keep ads off to avoid that loss. The ex­pected av­er­age gain in the case where the cor­rect de­ci­sion is turn­ing ads on:

mean(ifelse(posteriorLoss>0, posteriorLoss, 0))
# [1] 0.06868814833

so EVPI is $0.07. This does­n’t pay for any ad­di­tional days of sam­pling, so there’s no need to cal­cu­late an ex­act EVSI.


  1. I am un­happy about the un­cached JS that Dis­qus loads & how long it takes to set it­self up while spew­ing warn­ings in the browser con­sole, but at the mo­ment, I don’t know of any other sta­tic site com­ment­ing sys­tem which has good an­ti-s­pam ca­pa­bil­i­ties or an equiv­a­lent user base, and Dis­qus has worked for 5+ years.↩︎

  2. This is es­pe­cially an is­sue with A/B test­ing as usu­ally prac­ticed with NHST & ar­bi­trary al­pha thresh­old, which poses a of suck” or prob­lem; one could steadily de­grade one’s web­site by re­peat­edly mak­ing bad changes which don’t ap­pear harm­ful in smal­l­-s­cale ex­per­i­ments (“no user harm, p > 0.05, and in­creased rev­enue p < 0.05”, “no harm, in­creased rev­enue”, “no harm, in­creased rev­enue” etc), yet the prob­a­bil­ity of a net harm goes up.

    One might call this the (“How Mil­wau­kee’s Fa­mous Beer Be­came In­fa­mous: The Fall of Schlitz”), after the fa­mous busi­ness case study: a se­ries of small qual­ity de­creas­es/profit in­creases even­tu­ally had cat­a­strophic cu­mu­la­tive effects on their rep­u­ta­tion & sales. (It is also called the “fast-food fal­lacy” after Ger­ald M. Wein­berg’s dis­cus­sion of a hy­po­thet­i­cal ex­am­ple in The Se­crets of Con­sult­ing: A Guide to Giv­ing and Get­ting Ad­vice Suc­cess­fully, pg254, “Con­trol­ling Small Changes”, where he notes: “No differ­ence plus no differ­ence plus no differ­ence plus … even­tu­ally equals a clear differ­ence.”) An­other ex­am­ple is the now-in­fa­mous “” ap­ple: widely con­sid­ered one of the worst-tast­ing ap­ples com­monly sold, it was re­port­edly an ex­cel­len­t-tast­ing ap­ple when first dis­cov­ered in 1880, win­ning con­tests for its fla­vor; but its fla­vor wors­ened rapidly over the 20th cen­tu­ry, a de­cline blamed on ap­ple grow­ers grad­u­ally switch­ing to ever-red­der which looked bet­ter in gro­cery stores, a de­cline which ul­ti­mately cul­mi­nated in the near-col­lapse of the Red-De­li­cious-cen­tric Wash­ing­ton State ap­ple in­dus­try when con­sumer back­lash fi­nally be­gan in the 1980s with the avail­abil­ity of tastier ap­ples like . The more com­plex a sys­tem, the worse the “death by a thou­sand cuts” can be—in a 2003 email from Bill Gates, he lists (at least) 25 dis­tinct prob­lems he en­coun­tered try­ing (and fail­ing) to in­stall & use the pro­gram.

    This death-by-de­grees can be coun­tered by a few things, such as ei­ther test­ing reg­u­larly against a his­tor­i­cal base­line to es­tab­lish to­tal cu­mu­la­tive degra­da­tion or care­fully tun­ing / thresh­olds based on a de­ci­sion analy­sis (like­ly, one would con­clude that sta­tis­ti­cal power must be made much higher and the p-thresh­old should be made less strin­gent for de­tect­ing har­m).

    In ad­di­tion, one must avoid a bias to­wards test­ing only changes which make a pro­duct, which be­comes “sam­pling to a fore­gone con­clu­sion” (imag­ine a prod­uct is at a point where the profit gra­di­ent of qual­ity is profitable, but ex­per­i­ments are con­ducted only on var­i­ous ways of re­duc­ing qual­i­ty—even if thresh­olds are set cor­rect­ly, false-pos­i­tives must nev­er­the­less even­tu­ally oc­cur once in a while and thus over the long run, qual­ity & profits in­evitably de­crease). A ra­tio­nal profit-max­i­mizer should re­mem­ber that in­creases in qual­ity can be profitable too.↩︎

  3. Why ‘block’ in­stead of, say, just ran­dom­iz­ing 5-days at a time (“sim­ple ran­dom­iza­tion”)? If we did that, we would oc­ca­sion­ally do some­thing like spend an en­tire month in one con­di­tion with­out switch­ing, sim­ply by rolling a 0 5 or 6 times in a row; since traffic can be ex­pected to drift and change and spike, hav­ing such large units means that some­times they will line up with noise, in­creas­ing the ap­par­ent vari­ance thus shrink­ing the effect size thus re­quir­ing pos­si­bly a great deal more data to de­tect the sig­nal. Or we might fin­ish the ex­per­i­ment after 100 days (20 units) and dis­cover we had n = 15 for ad­ver­tis­ing and only n = 5 for non-ad­ver­tis­ing (wast­ing most of our in­for­ma­tion on un­nec­es­sar­ily re­fin­ing the ad­ver­tis­ing con­di­tion). Not block­ing does­n’t bias our analy­sis—we still get the right an­swers even­tu­al­ly—but it could be cost­ly. Whereas if we block pairs of 2-days ([00,11] vs [11,00]), we en­sure that we reg­u­larly (but still ran­dom­ly) switch the con­di­tion, spread­ing it more evenly over time, so if there are 4 days of sud­denly high traffic, it’ll prob­a­bly get split be­tween con­di­tions and we can more eas­ily see the effect. This sort of is­sue is why ex­per­i­ments try to run in­ter­ven­tions on the same per­son, or at least on age and sex-matched par­tic­i­pants, to elim­i­nate un­nec­es­sary noise.

    The gains from proper choice of ex­per­i­men­tal unit & block­ing can be ex­treme; in one ex­per­i­ment, I es­ti­mated that us­ing twins rather than or­di­nary school-chil­dren would have let n be th the size: . Thus, when pos­si­ble, I block my ex­per­i­ments at least tem­po­ral­ly.↩︎

  4. After pub­lish­ing ini­tial re­sults, Chris Stuc­chio com­mented on Twit­ter: “Most of the work on this stuff is pro­pri­etary. I ran such an ex­per­i­ment for a large con­tent site which gen­er­ated di­rec­tion­ally sim­i­lar re­sults. I helped an­other ma­jor con­tent site set up a sim­i­lar test, but they did­n’t tell me the re­sult­s…as well as smaller effects from ads fur­ther down the page (e.g.). Huge sam­ple size, very clear effects.” David Kitchen, a hob­by­ist op­er­a­tor of sev­eral hun­dred fo­rums, claims that re­mov­ing ban­ner ads boosted all his met­rics (ad­mit­ted­ly, at the cost of to­tal rev­enue), but un­clear if this used ran­dom­iza­tions or just a be­fore-after com­par­i­son. I have been told in pri­vate by ad in­dus­try peo­ple that they have seen sim­i­lar re­sults but ei­ther as­sumed every­one al­ready knew all this, or were un­sure how gen­er­al­iz­able the re­sults were. And I know of one well-known tech web­site which tested this ques­tion after see­ing my analy­sis, and found a re­mark­ably sim­i­lar re­sult.

    This raises se­ri­ous ques­tions about and the “file drawer” of ad ex­per­i­ments: if in fact these sorts of ex­per­i­ments are be­ing run all the time by many com­pa­nies, the pub­lished pa­pers could eas­ily be sys­tem­at­i­cally differ­ent than the un­pub­lished ones—­given the com­mer­cial in­cen­tives, should we as­sume that the harms of ad­ver­tis­ing are even greater than im­plied by pub­lished re­sults?↩︎

  5. An ear­lier ver­sion of the ex­per­i­ment re­ported in Mc­Coy et al 2007 is Mc­Coy et al 2004; Mc­Coy et al 2008 ap­pears to be a smaller fol­lowup do­ing more so­phis­ti­cated struc­tural equa­tion mod­el­ing of the var­i­ous scales used to quan­tify ad effect­s/per­cep­tion.↩︎

  6. YouTube no longer ex­poses a Lik­ert scale but bi­nary up­/­down­votes, so Kerk­hof 2019 uses like frac­tion of to­tal. More specifi­cal­ly:

    9.1 Likes and dis­likes: First, I use a video’s num­ber of likes and dis­likes to mea­sure its qual­i­ty. To this end, I nor­mal­ize the num­ber of likes of video v by YouTu­ber i in month t by its sum of likes and dis­likes: . Though straight­for­ward to in­ter­pret, this mea­sure re­flects the view­ers’ gen­eral sat­is­fac­tion with a video, which is de­ter­mined by its qual­ity and the view­ers’ ad aver­sion. Thus, even if an in­crease in the fea­si­ble num­ber of ad breaks led to an in­crease in video qual­i­ty, a video’s frac­tion of likes could de­crease if the view­ers’ ad­di­tional ad nui­sance costs pre­vail.

    I re­place the de­pen­dent vari­able Pop­u­larvit in equa­tion (2) with and es­ti­mate equa­tions (2) and (3) by 2SLS. Ta­ble 16 shows the re­sults. Again, the po­ten­tially bi­ased OLS es­ti­mates of equa­tion (2) in columns 1 to 3 are close to zero and not sta­tis­ti­cal­ly-sig­nifi­cant at the 1% lev­el: an in­crease in the fea­si­ble num­ber of ad breaks leads to a 4 per­cent­age point re­duc­tion in the frac­tion of likes. The effect size cor­re­sponds to around 25% of a stan­dard de­vi­a­tion in the de­pen­dent vari­able and to 4.4% of its base­line value 0.1. The re­duced form es­ti­mates in columns 7 to 9 are in line with these re­sults. Note that I lose 77,066 videos that have not re­ceived any likes or dis­likes. The re­sults in Ta­ble 16 il­lus­trate that viewer sat­is­fac­tion has gone down. It is, how­ev­er, un­clear if the effect is dri­ven by a de­crease in video qual­ity or by the view­ers’ ir­rta­tion from ad­di­tional ad breaks. See Ap­pen­dix A.8 for va­lid­ity checks.

    ↩︎
  7. Kerk­hof 2019:

    There are two po­ten­tial ex­pla­na­tions for the differ­ences to Sec­tion 9.1. First, video qual­ity may en­hance, whereby more (re­peat­ed) view­ers are at­tract­ed. At the same time, how­ev­er, view­ers ex­press their dis­sat­is­fac­tion with the ad­di­tional breaks by dis­lik­ing the video. Sec­ond, there could be al­go­rith­mic con­found­ing of the data (Sal­ganik, 2017, Ch.3). YouTube, too, earns a frac­tion of the YouTu­bers’ ad rev­enue. Thus, the plat­form has an in­cen­tive to treat videos with many ad breaks fa­vor­ably, for in­stance, through its rank­ing al­go­rithm. In this case, the num­ber of views was not in­for­ma­tive about a video’s qual­i­ty, but only about an al­go­rith­mic ad­van­tage. See Ap­pen­dix A.8 for va­lid­ity check­s…Table 5 shows the re­sults. The size of the es­ti­mates for δ′′(­columns 1 to 3), though sta­tis­ti­cally sig­nifi­cant at the 1%-level, is neg­li­gi­ble: a one sec­ond in­crease in video du­ra­tion cor­re­sponds to a 0.0001 per­cent­age point in­crease in the frac­tion of likes. The es­ti­mates for δ′′′ in columns 4 to 6, though, are rel­a­tively large and sta­tis­ti­cally sig­nifi­cant at the 1%-level, too. Ac­cord­ing to these es­ti­mates, one fur­ther sec­ond in video du­ra­tion leads on av­er­age to about 1.5 per­cent more views. These es­ti­mates may re­flect the al­go­rith­mic drift dis­cussed in Sec­tion 9.2. YouTube wants to keep its view­ers as long as pos­si­ble on the plat­form to show as many ads as pos­si­ble to them. As a re­sult, longer videos get higher rank­ings and are watched more often.

    …Sec­ond, I can­not eval­u­ate the effect of ad­ver­tis­ing on wel­fare, be­cause I lack mea­sures for con­sumer and pro­ducer sur­plus. Al­though I demon­strate that ad­ver­tis­ing leads to more con­tent differ­en­ti­a­tion—which is likely to raise con­sumer sur­plus (Bryn­jolf­s­son et al., 2003)—the view­ers must also pay an in­creased ad “price”, which works into the op­po­site di­rec­tion. Since I ob­tain no es­ti­mates for the view­ers’ ad aver­sion, my setup does not an­swer which effect over­weights. On the pro­ducer side, I re­main ag­nos­tic about the effect of ad­ver­tis­ing on the sur­plus of YouTube it­self, the YouTu­bers, and the ad­ver­tis­ers. YouTube as a plat­form is likely to ben­e­fit from ad­ver­tis­ing, though. Ad­ver­tis­ing leads to more con­tent differ­en­ti­a­tion, which at­tracts more view­ers; more view­ers, in turn, gen­er­ate more ad rev­enue. Like­wise, the YouTu­bers’ sur­plus ben­e­fits from an in­crease in ad rev­enue; it is, how­ev­er, un­clear how their util­ity from cov­er­ing differ­ent top­ics than be­fore is affect­ed.­Fi­nal­ly, the ad­ver­tis­ers’ sur­plus may go up or down. On the one hand, a higher ad quan­tity makes it more likely that po­ten­tial cus­tomers click on their ads and buy their prod­ucts. On the other hand, the ad­ver­tis­ers can­not in­flu­ence where ex­actly their ads ap­pear, whereby it is un­clear how well the au­di­ence is tar­get­ed. Hence, it is pos­si­ble that the ad­di­tional costs of ad­ver­tis­ing sur­mount the ad­di­tional rev­enues.

    ↩︎
  8. Ex­am­ples of ads I saw would be Lu­mos­ity ads or on­line uni­ver­sity ads (typ­i­cally mas­ter’s de­grees, for some rea­son) on my . They looked about what one would ex­pect: gener­i­cally glossy and clean. It is diffi­cult to imag­ine any­one be­ing offended by them.↩︎

  9. One might rea­son­ably as­sume that Ama­zon’s ul­tra­-cramped-yet-un­in­for­ma­tive site de­sign was the re­sult of ex­ten­sive A/B test­ing and is, as much as one would like to be­lieve oth­er­wise, op­ti­mal for rev­enue. How­ev­er, ac­cord­ing to ex-A­ma­zon en­gi­neer Steve Yegge, Ama­zon is well aware their web­site looks aw­ful—but sim­ply re­fuses to change it:

    Jeff Be­zos is an in­fa­mous mi­cro-man­ag­er. He mi­cro-man­ages every sin­gle pixel of Ama­zon’s re­tail site. He hired , Ap­ple’s Chief Sci­en­tist and prob­a­bly the very most fa­mous and re­spected hu­man-com­puter in­ter­ac­tion ex­pert in the en­tire world, and then ig­nored every god­damn thing Larry said for three years un­til Larry fi­nal­ly—­wise­ly—left the com­pany [2001–2005]. Larry would do these big us­abil­ity stud­ies and demon­strate be­yond any shred of doubt that no­body can un­der­stand that frig­ging web­site, but Be­zos just could­n’t let go of those pix­els, all those mil­lions of se­man­tic­s-packed pix­els on the land­ing page. They were like mil­lions of his own pre­cious chil­dren. So they’re all still there, and Larry is not…The guy is a reg­u­lar… well, Steve Jobs, I guess. Ex­cept with­out the fash­ion or de­sign sense. Be­zos is su­per smart; don’t get me wrong. He just makes or­di­nary con­trol freaks look like stoned hip­pies.

    ↩︎
  10. Sev­eral of the In­ter­net gi­ants like Google have re­ported mea­sur­able harms from de­lays as small as 100ms. Effects of de­lays/la­tency have often been mea­sured eg The Tele­graph.↩︎

  11. offers a cau­tion­ary ex­am­ple for search en­gine op­ti­miz­ers about the need to ex­am­ine global effects in­clusie of at­tri­tion (like Hohn­hold et al make sure to do); since a search en­gine is a tool for find­ing things, and users may click on ads only when un­sat­is­fied, worse search en­gine re­sults may in­crease queries/ad clicks:

    When Bing had a bug in an ex­per­i­ment, which re­sulted in very poor re­sults be­ing shown to users, two key or­ga­ni­za­tional met­rics im­proved sig­nifi­cant­ly: dis­tinct queries per user went up over 10%, and rev­enue per user went up over 30%! How should Bing eval­u­ate ex­per­i­ments? What is the Over­all Eval­u­a­tion Cri­te­ri­on? Clearly these long-term goals do not align with short­-term mea­sure­ments in ex­per­i­ments. If they did, we would in­ten­tion­ally de­grade qual­ity to raise query share and rev­enue!

    Ex­pla­na­tion: From a search en­gine per­spec­tive, de­graded al­go­rith­mic re­sults (the main search en­gine re­sults shown to users, some­times re­ferred to as the 10 blue links) force peo­ple to is­sue more queries (in­creas­ing queries per user) and click more on ads (in­creas­ing rev­enues). How­ev­er, these are clearly short­-term im­prove­ments, sim­i­lar to rais­ing prices at a re­tail store: you can in­crease short­-term rev­enues, but cus­tomers will pre­fer the com­pe­ti­tion over time, so the av­er­age cus­tomer life­time value will de­cline

    That is, in­creases in ‘rev­enue per user’ are not nec­es­sar­ily ei­ther in­creases in in­creases in to­tal rev­enue per user, or to­tal rev­enue pe­riod (be­cause the user will be more likely to at­trit to a bet­ter search en­gine).↩︎

  12. How­ev­er, the es­ti­mate, de­spite us­ing scores of vari­ables for the match­ing to at­tempt to con­struct ac­cu­rate con­trols, is al­most cer­tainly in­flated be­cause ; note that in Face­book’s , which tested propen­sity scor­ing’s abil­ity to pre­dict the re­sults of ran­dom­ized Face­book ad ex­per­i­ments, it re­quired thou­sands of vari­ables be­fore propen­sity scor­ing could re­cover the true causal effect.↩︎

  13. I found Yan et al 2019 con­fus­ing, so to ex­plain it a lit­tle fur­ther. The key graph is Fig­ure 3, where “UU” means “unique users”, which ap­par­ently in con­text means a LinkedIn user di­chotomized by whether they ever click on some­thing in their feed or ig­nore it en­tirely dur­ing the 3-month win­dow; “feed in­ter­ac­tion counts” are then the num­ber of feed clicks for those with non-zero clicks dur­ing the 3-month win­dow

    The 1-ad-ev­ery-9-items con­di­tion’s “unique user” grad­u­ally climbs through the 3 months ap­proach­ing ~0.75% more users in­ter­act­ing with the feed more than the base­line 1-ev­ery-6, and the 1-ev­ery 3 de­creases by −0.5%. Pre­sum­ably

    So if LinkedIn had 1-ev­ery-3 (33%) as the base­line and it moved to 1-ev­ery-6 (17%) then to 1-ev­ery-9 (11%), then the num­ber of users would in­crease ~1.25%. The elas­tic­ity here is un­clear since 1-ev­ery-3 vs 6 rep­re­sents a larger ab­solute in­crease of ads than 6 vs 9, but the lat­ter seems to have a larger re­spon­se; in any case, if mov­ing from 33% ads to 11% ads in­creases us­age by 1.25% over 3 months, then that sug­gests that elim­i­nat­ing ads en­tirely (go­ing from 11% to 0%) to 0% would yield an­other per­cent­age point or two per 3-months, or if we con­sid­ered 17% vs 11% which is 8% and ad­just, that sug­gests 1.03%. A quar­terly in­crease of 1.03%–1.25% users is an an­nu­al­ized 4–5%, which is not triv­ial.

    Within the users us­ing the feed at all, the num­ber of daily in­ter­ac­tions for 1-ev­ery-9 vs 1-ev­ery-6 in­creases by 1.5%, and de­creases −1% for 1-ev­ery-3 users, which is a stronger effect than for us­age. By sim­i­lar hand­wav­ing, that sug­gests a pos­si­ble an­nu­al­ized in­crease in daily feed in­ter­ac­tions of ~8.5% (on an un­known base but I would guess some­where around 1 in­ter­ac­tion a day).

    These two effects should be mul­ti­plica­tive: if there are more feed users and each feed user is us­ing the feed more, the to­tal num­ber of feed uses (which can be thought of as the to­tal LinkedIn “au­di­ence” or “read­er­ship”) will be larger still; 4% by 8% is >12%.

    That es­ti­mate is re­mark­ably con­sis­tent with the es­ti­mate I made based on my first A/B ex­per­i­ment of the hy­po­thet­i­cal effect of Gw­ern.net go­ing from 100% to 0% ads. (It was hy­po­thet­i­cal be­cause many read­ers have ad­block on and can’t be ex­posed to ads; how­ev­er, LinkedIn mo­bile app users pre­sum­ably have no such re­course and have ~100% ex­po­sure to how­ever many ads LinkedIn chooses to show them.)↩︎

  14. Mc­Coy et al 2007, pg3:

    Al­though that differ­ence is sta­tis­ti­cal­ly-sig­nifi­cant, its mag­ni­tude might not ap­pear on the sur­face to be very im­por­tant. On the con­trary, re­cent re­ports show that a site such as Ama­zon en­joys about 48 mil­lion unique vis­i­tors per month. A drop of 11% would rep­re­sent more than 5 mil­lion fewer vis­i­tors per month. Such a drop would be in­ter­preted as a se­ri­ous prob­lem re­quir­ing de­ci­sive ac­tion, and the ROI of the ad­ver­tise­ment could ac­tu­ally be­come neg­a­tive. As­sum­ing a uni­form dis­tri­b­u­tion of buy­ers among the browsers be­fore and after the drop, the rev­enue pro­duced by and ad might not make up for the 11% short­fall in sales.

    ↩︎
  15. Sam­ple size is not dis­cussed other than to note that it was set ac­cord­ing to stan­dard Google power analy­sis tools de­scribed in Tang et al 2010, which like­wise omits con­crete num­bers; be­cause of the use of shared con­trol groups and cook­ie-level track­ing and A/A tests to do dis­tri­b­u­tion-free es­ti­mates of true stan­dard er­rors/sam­pling dis­tri­b­u­tions, it is im­pos­si­ble to es­ti­mate what n their tool would’ve es­ti­mated nec­es­sary for stan­dard power like 80% for changes as small as 0.1% (“we gen­er­ally care about small changes: even a 0.1% change can be sub­stan­tive.”) on the de­scribed ex­per­i­ments. How­ev­er, I be­lieve it would be safe to say that the _n_s for the k > 100 ex­per­i­ments de­scribed are at least in the mil­lions each, as that is what is nec­es­sary else­where to de­tect sim­i­lar effects and Tang et al 2010 em­pha­sizes that the “cook­ie-level” ex­per­i­ments—which is what Hohn­hold et al 2015 us­es—has many times larger stan­dard er­rors than query-level ex­per­i­ments which as­sume in­de­pen­dence. Hohn­hold et al 2015 does im­ply that the to­tal sam­ple size for the Fig­ure 6 CTR rate changes may be all Google An­droid users: “One of these launches was rolled out over ten weeks to 10% co­horts of traffic per week.” This would put the sam­ple size at n≅500m given ~1b users.↩︎

  16. How many hits does this cor­re­spond to? A quick­-and-dirty es­ti­mate sug­gests n < 535m.

    The me­dian Alexa rank of the 2,574 sites in the Page­Fair sam­ple is #210,000, with a me­dian ex­po­sure of 16.7 weeks (means are not given); as of 2019-04-05, Gw­ern.net is at Alexa rank #112,300, and has 106,866 page-views in the pre­vi­ous 30 days; since the me­dian Alexa rank is about half as bad, a guessti­mate ad­just­ment is 50k per 30 days or 1,780 page-views/­day (there are Alexa rank for­mula but they are so wrong for Gw­ern.net that I don’t trust them to es­ti­mate other web­sites; given the long tail of traffic, I think halv­ing would be an up­per bound); then treat­ing the me­di­ans as means, then the to­tal mea­sured page-views over the mul­ti­-year sam­ple is . Con­sid­er­ing the long time-pe­ri­ods and em­ploy­ing to­tal traffic of thou­sands of web­sites, this seems sen­si­ble.↩︎

  17. No re­la­tion­ship to Yan et al 2019.↩︎

  18. The Page­Fair/Adobe re­port ac­tu­ally says “16% of the US on­line pop­u­la­tion blocked ads dur­ing Q2 2015.” which is slightly differ­ent; I as­sume that is be­cause the ‘on­line pop­u­la­tion’ < to­tal US pop­u­la­tion.↩︎

  19. This mix­ture model would have two dis­tri­b­u­tion­s/­com­po­nents in it; there ap­par­ently is no point in try­ing to dis­tin­guish be­tween lev­els of vi­ral­i­ty, as 3 or higher does not fit the data well:

    library(flexmix)
    stepFlexmix(traffic$Pageviews ~ 1, model = FLXMRglmfix(family = "poisson"), k=1:10, nrep=20)
    ## errors out k>2
    fit <- flexmix(traffic$Pageviews ~ 1, model = FLXMRglmfix(family = "poisson"), k=2)
    summary(fit)
    #        prior size post>0 ratio
    # Comp.1 0.196  448    449 0.998
    # Comp.2 0.804 1841   1842 0.999
    #
    # 'log Lik.' -1073191.871 (df=3)
    # AIC: 2146389.743   BIC: 2146406.95
    summary(refit(fit))
    # $Comp.1
    #                  Estimate    Std. Error   z value   Pr(>|z|)
    # (Intercept) 8.64943378382 0.00079660747 10857.837 < 2.22e-16
    #
    # $Comp.2
    #                  Estimate    Std. Error   z value   Pr(>|z|)
    # (Intercept) 7.33703564817 0.00067256175 10909.088 < 2.22e-16
    ↩︎
  20. For fur­ther back­ground on the eBay ex­per­i­ment, as well as anec­dotes about other com­pa­nies dis­cov­er­ing their ad­ver­tis­ing’s causal effects was far smaller than they thought (as well as the in­ter­nal dy­nam­ics which main­tains ad­ver­tis­ing), see the Freako­nom­ics pod­cast, “Does Ad­ver­tis­ing Ac­tu­ally Work? (Part 1: TV) (Ep. 440)”/part 2.↩︎