A/B testing long-form readability on gwern.net

A log of experiments done on the site design, intended to render pages more readable, focusing on the challenge of testing a static site, page width, fonts, plugins, and effects of advertising.
experiments, statistics, computer-science, meta, decision-theory, shell, R, JS, CSS, power-analysis, Bayes, Google, tutorial, design
2012-06-162019-02-16 in progress certainty: possible importance: 4


To gain some sta­tis­ti­cal & web de­vel­op­ment ex­pe­ri­ence and to im­prove my read­ers’ ex­pe­ri­ences, I have been run­ning a se­ries of CSS A/B tests since June 2012. As ex­pect­ed, most do not show any mean­ing­ful differ­ence.

Background

  • https://www.google.com/analytics/siteopt/exptlist?account=18912926
  • http://www.pqinternet.com/196.htm
  • https://support.google.com/websiteoptimizer/bin/answer.py?hl=en&answer=61203 “Ex­per­i­ment with site-wide changes”
  • https://support.google.com/websiteoptimizer/bin/answer.py?hl=en&answer=117911 “Work­ing with global head­ers”
  • https://support.google.com/websiteoptimizer/bin/answer.py?hl=en-GB&answer=61427
  • https://support.google.com/websiteoptimizer/bin/answer.py?hl=en&answer=188090 “Vary­ing page and el­e­ment styles” - test­ing with in­line CSS over­rid­ing the de­faults
  • http://stackoverflow.com/questions/2993199/with-google-website-optimizers-multivariate-testing-can-i-vary-multiple-css-cl
  • http://www.xemion.com/blog/the-secret-to-painless-google-website-optimizer-70.html
  • http://stackoverflow.com/tags/google-website-optimizer/hot

Problems with “conversion” metric

https://support.google.com/websiteoptimizer/bin/answer.py?hl=en-AU&answer=74345 “Time on page as a con­ver­sion goal” - every page con­verts, by us­ing a time­out (mine is 40 sec­ond­s). Prob­lem: di­chotomiz­ing a con­tin­u­ous vari­able into a sin­gle bi­nary vari­able de­stroys a mas­sive amount of in­for­ma­tion. This is well-known in the sta­tis­ti­cal and psy­cho­log­i­cal lit­er­a­ture (eg. Mac­Cal­lum et al 2002) but I’ll il­lus­trate fur­ther with some in­for­ma­tion-the­o­ret­i­cal ob­ser­va­tions.

Ac­cord­ing to my An­a­lyt­ics, the mean read­ing time (time on page) is 1:47 and the max­i­mum brack­et, hit by 1% of view­ers, is 1801 sec­onds, and the range 1-1801 takes <10.8 bits to en­code (log2(1801) → 10.81), hence each page view could be rep­re­sented by <10.8 bits (less since read­ing time is so highly skewed). But if we di­chotomize, then we learn sim­ply that ~14% of read­ers will read for 40 sec­onds, hence each reader car­ries not 6 bits, nor 1 bit (if 50% read that long) but closer to 2/3 of a bit:

p=0.14;  q=1-p; (-p*log2(p) - q*log2(q))
# [1] 0.5842

This is­n’t even an effi­cient di­chotomiza­tion: we could im­prove the frac­tional bit to 1 bit if we could some­how di­chotomize at 50% of read­ers:

p=0.50;  q=1-p; (-p*log2(p) - q*log2(q))
# [1] 1

But un­for­tu­nate­ly, sim­ply low­er­ing the time­out will have min­i­mal re­turns as An­a­lyt­ics also re­ports that 82% of reader spend 0-10 sec­onds on pages. So we are stuck with a se­vere loss.

ideas for testing

    JS:
            disqus
    CSS
            differences from readability
            every declaration in default.CSS?
    Donation
            placement - left, right, bottom
            donation text
                     help pay for hosting
                     help sponsor X experiment
                     Xah's text - did you find this article useful?
  • test the sug­ges­tions in https://code.google.com/p/better-web-readability-project/ http://www.vcarrer.com/2009/05/how-we-read-on-web-and-how-can-we.html

Testing

max-width

CSS-3 prop­er­ty: set how wide the page will be in pix­els if un­lim­ited screen real es­tate is avail­able. I no­ticed some peo­ple com­plained that pages were ‘too wide’ and this made it hard to read, which ap­par­ently is a real thing since lines are sup­posed to fit in eye sac­cades. So I tossed in 800px, 900px, 1300px, and 1400px to the first A/B test.

<script>
function utmx_section(){}function utmx(){}
(function(){var k='0520977997',d=document,l=d.location,c=d.cookie;function f(n){
if(c){var i=c.indexOf(n+'=');if(i>-1){var j=c.indexOf(';',i);return escape(c.substring(i+n.
length+1,j<0?c.length:j))}}}var x=f('__utmx'),xx=f('__utmxx'),h=l.hash;
d.write('<sc'+'ript src="'+
'http'+(l.protocol=='https:'?'s://ssl':'://www')+'.google-analytics.com'
+'/siteopt.js?v=1&utmxkey='+k+'&utmx='+(x?x:'')+'&utmxx='+(xx?xx:'')+'&utmxtime='
+new Date().valueOf()+(h?'&utmxhash='+escape(h.substr(1)):'')+
'" type="text/javascript" charset="utf-8"></sc'+'ript>')})();
</script>

<script type="text/javascript">
  var _gaq = _gaq || [];
  _gaq.push(['gwo._setAccount', 'UA-18912926-2']);
  _gaq.push(['gwo._trackPageview', '/0520977997/test']);
  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www')
              + '.google-analytics.com/ga.js';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
  })();
</script>

<script type="text/javascript">
  var _gaq = _gaq || [];
  _gaq.push(['gwo._setAccount', 'UA-18912926-2']);
      setTimeout(function() {
  _gaq.push(['gwo._trackPageview', '/0520977997/goal']);
      }, 40000);
  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') +
              '.google-analytics.com/ga.js';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
  })();
</script>

    <script>utmx_section("max width")</script>
    <style type="text/css">
      body { max-width: 800px; }
    </style>
    </noscript>

It ran from mid-June to 2012-08-01. Un­for­tu­nate­ly, I can­not be more speci­fic: on 1 Au­gust, Google deleted Web­site Op­ti­mizer and told every­one to use ‘Ex­per­i­ments’ in Google An­a­lyt­ics - and deleted all my in­for­ma­tion. The graph over time, the ex­act num­bers - all gone. So this is from mem­o­ry.

The re­sults were ini­tially very promis­ing: ‘con­ver­sion’ was de­fined as stay­ing on a page for 40 sec­onds (I rea­soned that this meant some­one was ac­tu­ally read­ing the page), and had a base of around 70% of read­ers con­vert­ing. With a few hun­dred hits, 900px con­verted at 10-20% more than the de­fault! I was ec­sta­t­ic. So when it be­gan falling, I was only a lit­tle both­ered (one had to ex­pect some re­gres­sion to the mean since the re­sults were too good to be true). But as the hits in­creased into the low thou­sands, the effect kept shrink­ing all the way down to 0.4% im­proved con­ver­sion. At some points, 1300px ac­tu­ally ex­ceeded 900px.

The sec­ond dis­tress­ing thing was that Google’s es­ti­mated chance of a par­tic­u­lar in­ter­ven­tion beat­ing the de­fault (which I be­lieve is a Bon­fer­roni-cor­rected p-val­ue), did not in­crease! Even as each ver­sion re­ceived 20,000 hits, the chance stub­bornly bounced around the 70-90% range for 900px and 1300px. This re­mained true all the way to the bit­ter end. At the end, each ver­sion had racked up 93,000 hits and still was in the 80% decile. Wow.

Iron­i­cal­ly, I was warned at the be­gin­ning about both of these pos­si­ble be­hav­iors by a pa­per I read on large-s­cale cor­po­rate A/B test­ing: http://www.exp-platform.com/Documents/puzzlingOutcomesInControlledExperiments.pdf and http://www.exp-platform.com/Documents/controlledExperimentDMKD.pdf and http://www.exp-platform.com/Documents/2013%20controlledExperimentsAtScale.pdf It cov­ered at length how many ap­par­ent trends sim­ply evap­o­rat­ed, but it also cov­ered later a pe­cu­liar phe­nom­e­non where A/B tests did not con­verge even after be­ing run on un­godly amounts of data be­cause the stan­dard de­vi­a­tions kept chang­ing (the user com­po­si­tion kept shift­ing and ren­der­ing pre­vi­ous data more un­cer­tain). And it’s a gen­eral phe­nom­e­non that even for large cor­re­la­tions, the trend will bounce around a lot be­fore it sta­bi­lizes (Schön­brodt & Pe­rug­ini 2013).

Oy vey! When I dis­cov­ered Google had deleted my re­sults, I de­cided to sim­ply switch to 900px. Run­ning a new test would not pro­vide any bet­ter an­swers.

TODO

how about a blue back­ground? see http://www.overcomingbias.com/2010/06/near-far-summary.html for more de­sign ideas

  1. ta­ble strip­ing
tbody tr:hover td { background-color: #f5f5f5;}
tbody tr:nth-child(odd) td { background-color: #f9f9f9;}
  1. link dec­o­ra­tion
a { color: black; text-decoration: underline;}
a { color:#005AF2; text-decoration:none; }

Resumption: ABalytics

In March 2013, I de­cided to give A/B test­ing an­other whack. Google An­a­lyt­ics Ex­per­i­ment did not seem to have im­proved and the com­mer­cial ser­vices con­tin­ued to charge un­ac­cept­able prices, so I gave the Google An­a­lyt­ics cus­tom vari­able in­te­gra­tion ap­proach an­other try­ing us­ing AB­a­lyt­ics. The usual puz­zling, de­bug­ging, and frus­tra­tion of com­bin­ing so many dis­parate tech­nolo­gies (HTML and CSS and JS and Google An­a­lyt­ics) aside, it seemed to work on my test page. The cur­rent down­side seems to be that the AB­a­lyt­ics ap­proach may be frag­ile, and the UI in GA is aw­ful (you have to do the sta­tis­tics your­self).

max-width redux

The test case is to re­run the max-width test and fin­ish it.

Implementation

The ex­act changes:

Sun Mar 17 11:25:39 EDT 2013  gwern@gwern.net
  * default.html: setup ABalytics a/b testing https://github.com/danmaz74/ABalytics
                  (hope this doesn't break anything...)
    addfile ./static/js/abalytics.js
    hunk ./static/js/abalytics.js 1
...
    hunk ./static/templates/default.html 28
    +    <div class="maxwidth_class1"></div>
    +
...
    -    <noscript><p>Enable JavaScript for Disqus comments</p></noscript>
    +      window.onload = function() {
    +      ABalytics.applyHtml();
    +      };
    +    </script>
    hunk ./static/templates/default.html 119
    +
    +      ABalytics.init({
    +      maxwidth: [
    +      {
    +      name: '800',
    +      "maxwidth_class1": "<style>body { max-width: 800px; }</style>",
    +      "maxwidth_class2": ""
    +      },
    +      {
    +      name: '900',
    +      "maxwidth_class1": "<style>body { max-width: 900px; }</style>",
    +      "maxwidth_class2": ""
    +      },
    +      {
    +      name: '1100',
    +      "maxwidth_class1": "<style>body { max-width: 1100px; }</style>",
    +      "maxwidth_class2": ""
    +      },
    +      {
    +      name: '1200',
    +      "maxwidth_class1": "<style>body { max-width: 1200px; }</style>",
    +      "maxwidth_class2": ""
    +      },
    +      {
    +      name: '1300',
    +      "maxwidth_class1": "<style>body { max-width: 1300px; }</style>",
    +      "maxwidth_class2": ""
    +      },
    +      {
    +      name: '1400',
    +      "maxwidth_class1": "<style>body { max-width: 1400px; }</style>",
    +      "maxwidth_class2": ""
    +      }
    +      ],
    +      }, _gaq);
    +

Results

I wound up the test on 2013-04-17 with the fol­low­ing re­sults:

Width (px) Vis­its Con­ver­sion
1100 18,164 14.49%
1300 18,071 14.28%
1200 18,150 13.99%
800 18,599 13.94%
900 18,419 13.78%
1400 18,378 13.68%
109772 14.03%

Analysis

1100px is close to my orig­i­nal A/B test in­di­cat­ing 1000px was the lead­ing can­di­date, so that gives me ad­di­tional con­fi­dence, as does the ob­ser­va­tion that 1300px and 1200px are the other lead­ing can­di­dates. (Cu­ri­ous­ly, the site con­ver­sion av­er­age be­fore was 13.88%; per­haps my un­der­ly­ing traffic changed slightly around the time of the test? This would demon­strate why al­ter­na­tives need to be tested si­mul­ta­ne­ous­ly.) A quick and dirty R test of 1100px vs 1300px (prop.test(c(2632,2581),c(18164,18071))) in­di­cates the differ­ence is­n’t sta­tis­ti­cal­ly-sig­nifi­cant (at p = 0.58), and we might want more data; worse, there is no clear lin­ear re­la­tion be­tween con­ver­sion and width (the plot is er­rat­ic, and a lin­ear fit a dis­mal p = 0.89):

rates <- read.csv(stdin(),header=TRUE)
Width,N,Rate
1100,18164,0.1449
1300,18071,0.1428
1200,18150,0.1399
800,18599,0.1394
900,18419,0.1378
1400,18378,0.1368


rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes

g <- glm(cbind(Successes,Failures) ~ Width, data=rates, family="binomial")
# ...Coefficients:
#              Estimate Std. Error z value Pr(>|z|)
# (Intercept) -1.82e+00   4.65e-02  -39.12   <2e-16
# Width        5.54e-06   4.10e-05    0.14     0.89
## not much better:
rates$Width <- as.factor(rates$Width)
rates$Width <- relevel(rates$Width, ref="900")
g2 <- glm(cbind(Successes,Failures) ~ Width, data=rates, family="binomial"); summary(g2)

But I want to move on to the next test and by the same logic it is highly un­likely that the differ­ence be­tween them is large or much in 1300px’s fa­vor (the kind of mis­take I care about: switch­ing be­tween 2 equiv­a­lent choices does­n’t mat­ter, miss­ing out on an im­prove­ment does mat­ter - max­i­miz­ing β, not min­i­miz­ing α).

Fonts

The New York Times ran an in­for­mal on­line ex­per­i­ment with a large num­ber of read­ers (n = 60750) and found that the font led to more read­ers agree­ing with a short text pas­sage - this seems plau­si­ble enough given their very large sam­ple size and Wikipedi­a’s note that “The re­fined feel­ing of the type­face makes it an ex­cel­lent choice to con­vey dig­nity and tra­di­tion.”

Power analysis

Would this font work its magic on Gw­ern.net too? Let’s see. The sam­ple size is quite man­age­able, as over a month I will eas­ily have 60k vis­its, and they tested 6 fonts, ex­pand­ing their nec­es­sary sam­ple. What sam­ple size do I ac­tu­ally need? Their pro­fes­sor es­ti­mates the effect size of Baskerville at 1.5%; I would like my A/B test to have very high sta­tis­ti­cal power (0.9) and reach more strin­gent sta­tis­ti­cal-sig­nifi­cance (p < 0.01) so I can go around and in good con­science tell peo­ple to use Baskerville. I al­ready know the av­er­age “con­ver­sion rate” is ~13%, so I get this power cal­cu­la­tion:

power.prop.test(p1=0.13+0.015, p2=0.13, power=0.90, sig.level=0.01)

     Two-sample comparison of proportions power calculation

              n = 15683
             p1 = 0.145
             p2 = 0.13
      sig.level = 0.01
          power = 0.9
    alternative = two.sided

 NOTE: n is number in *each* group

15000 vis­i­tors in each group seems rea­son­able; at ~16k vis­i­tors a week, that sug­gests a few weeks of test­ing. Of course I’m test­ing 4 fonts (see be­low), but that still fits in the ~2 months I’ve al­lot­ted for this test.

Implementation

I had pre­vi­ously drawn on the NYT ex­per­i­ment for my site de­sign:

html {
...
    font-family: Georgia, "HelveticaNeue-Light", "Helvetica Neue Light", "Helvetica Neue", Helvetica,
                 Arial, "Lucida Grande", garamond, palatino, verdana, sans-serif;
}

I had not used Baskerville but since Geor­gia seemed sim­i­lar and was con­ve­nient, but we’ll fix that now. Be­sides Baskerville & Geor­gia, we’ll omit (of course), but we can try for a to­tal of 4 fonts (falling back to Geor­gia):

hunk ./static/templates/default.html 28
+    <div class="fontfamily_class1"></div>
...
hunk ./static/templates/default.html 121
+      fontfamily: [
+      {
+      name: 'Baskerville',
+      "fontfamily_class1": "<style>html { font-family: Baskerville, Georgia; }</style>",
+      "fontfamily_class2": ""
+      },
+      {
+      name: 'Georgia',
+      "fontfamily_class1": "<style>html { font-family: Georgia; }</style>",
+      "fontfamily_class2": ""
+      },
+      {
+      name: 'Trebuchet',
+      "fontfamily_class1": "<style>html { font-family: 'Trebuchet MS', Georgia; }</style>",
+      "fontfamily_class2": ""
+      },
+      {
+      name: 'Helvetica',
+      "fontfamily_class1": "<style>html { font-family: Helvetica, Georgia; }</style>",
+      "fontfamily_class2": ""
+      }
+      ],

Results

Run­ning from 2013-04-14 to 2013-06-16:

Font Type Vis­its Con­ver­sion
Tre­buchet sans 35,473 13.81%
Baskerville serif 36,021 13.73%
Hel­vetica sans 35,656 13.43%
Geor­gia serif 35,833 13.31%
sans 71,129 13.62%
serif 71,854 13.52%
142,983 13.57%

The sam­ple size for each font is 20k higher than I pro­jected due to the enor­mous pop­u­lar­ity of I fin­ished dur­ing the test. Re­gard­less, it’s clear that the re­sults - with dou­ble the to­tal sam­ple size of the NYT ex­per­i­ment, fo­cused on fewer fonts - are dis­ap­point­ing and there seems to be very lit­tle differ­ence be­tween fonts.

Analysis

Pick­ing the most ex­treme differ­ence, be­tween Tre­buchet and Geor­gia, the differ­ence is close to the usual de­fi­n­i­tion of sta­tis­ti­cal-sig­nifi­cance:

prop.test(c(0.1381*35473,0.1331*35833),c(35473,35833))
#     2-sample test for equality of proportions with continuity correction
#
# data:  c(0.1381 * 35473, 0.1331 * 35833) out of c(35473, 35833)
# X-squared = 3.76, df = 1, p-value = 0.0525
# alternative hypothesis: two.sided
# 95% confidence interval:
#  -5.394e-05  1.005e-02
# sample estimates:
# prop 1 prop 2
# 0.1381 0.1331

Which nat­u­rally im­plies that the much smaller differ­ence be­tween Tre­buchet and Baskerville is not sta­tis­ti­cal­ly-sig­nifi­cant:

prop.test(c(0.1381*35473,0.1373*36021), c(35473,36021))
#     2-sample test for equality of proportions with continuity correction
#
# data:  c(0.1381 * 35473, 0.1373 * 36021) out of c(35473, 36021)
# X-squared = 0.0897, df = 1, p-value = 0.7645
# alternative hypothesis: two.sided
# 95% confidence interval:
#  -0.00428  0.00588

Since there’s only small differ­ences be­tween in­di­vid­ual fonts, I won­dered if there might be a differ­ence be­tween the two san­s-ser­ifs and the two ser­ifs. If we lump the 4 fonts into those 2 cat­e­gories and look at the small differ­ence in mean con­ver­sion rate:

prop.test(c(0.1362*71129,0.1352*71854), c(71129,71854))
#     2-sample test for equality of proportions with continuity correction
#
# data:  c(0.1362 * 71129, 0.1352 * 71854) out of c(71129, 71854)
# X-squared = 0.2963, df = 1, p-value = 0.5862
# alternative hypothesis: two.sided
# 95% confidence interval:
#  -0.002564  0.004564

Noth­ing do­ing there ei­ther. More gen­er­al­ly:

rates <- read.csv(stdin(),header=TRUE)
Font,Serif,N,Rate
Trebuchet,FALSE,35473,0.1381
Baskerville,TRUE,6021,0.1373
Helvetica,FALSE,35656,0.1343
Georgia,TRUE,5833,0.1331


rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes

g <- glm(cbind(Successes,Failures) ~ Font, data=rates, family="binomial"); summary(g)
# ...Coefficients:
#               Estimate Std. Error z value Pr(>|z|)
# (Intercept)   -1.83745    0.03744  -49.08   <2e-16
# FontGeorgia   -0.03692    0.05374   -0.69     0.49
# FontHelvetica -0.02591    0.04053   -0.64     0.52
# FontTrebuchet  0.00634    0.04048    0.16     0.88

With es­sen­tially no mean­ing­ful differ­ences be­tween con­ver­sion rates, this sug­gests that how­ever fonts mat­ter, they don’t mat­ter for read­ing du­ra­tion. So I feel free to pick the font that ap­peals to me vi­su­al­ly, which is Baskerville.

Line height

I have seen com­plaints that lines on Gw­ern.net are “too closely spaced” or “run to­gether” or “cramped”, re­fer­ring to the line height (the CSS prop­erty line-height). I set the CSS to line-height: 150%; to deal with this ob­jec­tion, but this was a sim­ple hack based on rough eye­balling of it, and it was done be­fore I changed the max-width and font-family set­tings after the pre­vi­ous test­ing. So it’s worth test­ing some vari­ants.

Most web de­sign guides seem to sug­gest a safe de­fault of 120%, rather than my cur­rent 150%. If we try to test each decile plus one on the out­side, that’d give us 110, 120, 130, 140, 150, 160 or 6 op­tions, which com­bined with the ex­pected small effect, would re­quire an un­rea­son­able sam­ple size (and I have noth­ing in the pipeline I ex­pect might catch fire like the Google analy­sis and de­liver an ex­cess >50k vis­it­s). So I’ll try just 120/130/140/150, and sched­ule a sim­i­lar block of time as fonts (end­ing the ex­per­i­ment on 2013-08-16, with pre­sum­ably >70k dat­a­points).

Implementation

hunk ./static/templates/default.html 30
-    <div class="fontfamily_class1"></div>
+    <div class="linewidth_class1"></div>
hunk ./static/templates/default.html 156
-      fontfamily:
+      linewidth:
hunk ./static/templates/default.html 158
-      name: 'Baskerville',
-      "fontfamily_class1": "<style>html { font-family: Baskerville, Georgia; }</style>",
-      "fontfamily_class2": ""
+      name: 'Line120',
+      "linewidth_class1": "<style>div#content { line-height: 120%;}</style>",
+      "linewidth_class2": ""
hunk ./static/templates/default.html 163
-      name: 'Georgia',
-      "fontfamily_class1": "<style>html { font-family: Georgia; }</style>",
-      "fontfamily_class2": ""
+      name: 'Line130',
+      "linewidth_class1": "<style>div#content { line-height: 130%;}</style>",
+      "linewidth_class2": ""
hunk ./static/templates/default.html 168
-      name: 'Trebuchet',
-      "fontfamily_class1": "<style>html { font-family: 'Trebuchet MS', Georgia; }</style>",
-      "fontfamily_class2": ""
+      name: 'Line140',
+      "linewidth_class1": "<style>div#content { line-height: 140%;}</style>",
+      "linewidth_class2": ""
hunk ./static/templates/default.html 173
-      name: 'Helvetica',
-      "fontfamily_class1": "<style>html { font-family: Helvetica, Georgia; }</style>",
-      "fontfamily_class2": ""
+      name: 'Line150',
+      "linewidth_class1": "<style>div#content { line-height: 150%;}</style>",
+      "linewidth_class2": ""

Analysis

From 2013-06-15–2013-08-15:

line % n Con­ver­sion %
130 18,124 15.26
150 17,459 15.22
120 17,773 14.92
140 17,927 14.92
71,283 15.08

Just from look­ing at the mis­er­ably small differ­ence be­tween the most ex­treme per­cent­ages (15.26 - 14.92 = 0.34%), we can pre­dict that noth­ing here was sta­tis­ti­cal­ly-sig­nifi­cant:

x1 <- 18124; x2 <- 17927; prop.test(c(x1*0.1524, x2*0.1476), c(x1,x2))
#     2-sample test for equality of proportions with continuity correction
#
# data:  c(x1 * 0.1524, x2 * 0.1476) out of c(x1, x2)
# X-squared = 1.591, df = 1, p-value = 0.2072

I changed the 150% to 130% for the heck of it, even though the differ­ence be­tween 130 and 150 was triv­ially small:

rates <- read.csv(stdin(),header=TRUE)
Width,N,Rate
130,18124,0.1526
150,17459,0.1522
120,17773,0.1492
140,17927,0.1492


rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes

rates$Width <- as.factor(rates$Width)
g <- glm(cbind(Successes,Failures) ~ Width, data=rates, family="binomial")
# ...Coefficients:
#              Estimate Std. Error z value Pr(>|z|)
# (Intercept) -1.74e+00   2.11e-02  -82.69   <2e-16
# Width130     2.65e-02   2.95e-02    0.90     0.37
# Width140     9.17e-06   2.97e-02    0.00     1.00
# Width150     2.32e-02   2.98e-02    0.78     0.44

Null test

One of the sug­ges­tions in the A/B test­ing pa­pers was to run a “null” A/B test (or “A/A test”) where the pay­load is empty but the A/B test­ing frame­work is still mea­sur­ing con­ver­sions etc. By de­fi­n­i­tion, the null hy­poth­e­sis of “no differ­ence” should be true and at an al­pha of 0.05, only 5% of the time would the null tests yield a p < 0.05 (which is very differ­ent from the usual sit­u­a­tion). The in­ter­est here is that it’s pos­si­ble that some­thing is go­ing wrong in one’s A/B setup or in gen­er­al, and so if one gets a “sta­tis­ti­cal­ly-sig­nifi­cant” re­sult, it may be worth­while in­ves­ti­gat­ing this anom­aly.

It’s easy to switch from the line­height test to the null test; just re­name the vari­ables for Google An­a­lyt­ics, and empty the pay­loads:

hunk ./static/templates/default.html 30
-    <div class="linewidth_class1"></div>
+    <div class="null_class1"></div>
hunk ./static/templates/default.html 158
-      linewidth: [
+      null: [
+      ...]]
hunk ./static/templates/default.html 160
-      name: 'Line120',
-      "linewidth_class1": "<style>div#content { line-height: 120%;}</style>",
+      name: 'null1',
+      "null_class1": "",
hunk ./static/templates/default.html 165
-      { ...
-      name: 'Line130',
-      "linewidth_class1": "<style>div#content { line-height: 130%;}</style>",
-      "linewidth_class2": ""
-      },
-      {
-      name: 'Line140',
-      "linewidth_class1": "<style>div#content { line-height: 140%;}</style>",
-      "linewidth_class2": ""
-      },
-      {
-      name: 'Line150',
-      "linewidth_class1": "<style>div#content { line-height: 150%;}</style>",
+      name: 'null2',
+      "null_class1": "",
+       ... }

Since any differ­ence due to the test­ing frame­work should be no­tice­able, this will be a shorter ex­per­i­ment, from 15 Au­gust to 29 Au­gust.

Results

While amus­ingly the first pair of 1k hits re­sulted in a dra­matic 18% vs 14% re­sult, this quickly dis­ap­peared into a much more nor­mal-look­ing set of data:

op­tion n con­ver­sion
null2 7,359 16.23%
null1 7,488 15.89%
14,847 16.06%

Analysis

Ah, but can we re­ject the null hy­poth­e­sis that "“==”"? In a rare vic­tory for nul­l-hy­poth­e­sis-sig­nifi­cance-test­ing, we do not com­mit a Type I er­ror:

x1 <- 7359; x2 <- 7488; prop.test(c(x1*0.1623, x2*0.1589), c(x1,x2))
#     2-sample test for equality of proportions with continuity correction
#
# data:  c(x1 * 0.1623, x2 * 0.1589) out of c(x1, x2)
# X-squared = 0.2936, df = 1, p-value = 0.5879
# alternative hypothesis: two.sided
# 95% confidence interval:
#  -0.008547  0.015347

But se­ri­ous­ly, it is nice to see that AB­a­lyt­ics does not seem to be bro­ken & fa­vor­ing ei­ther op­tion and any re­sults dri­ven by place­ment in the ar­ray of op­tions.

Text & background color

As part of the gen­er­ally mono­chro­matic color scheme, the back­ground was off-white (grey) and the text was black:

html { ...
    background-color: #FCFCFC; /* off-white */
    color: black;
... }

The hy­per­links, on the other hand, make use of a off-black color: #303C3C, par­tially mo­ti­vated by Ian Storm Tay­lor’s ad­vice to “Never Use Black”. I won­der - should all the text be off-black too? And which com­bi­na­tion is best? White/black? Off-white/black? Off-white/off-black? White/off-black? Let’s try all 4 com­bi­na­tions here.

Implementation

The usu­al:

hunk ./static/templates/default.html 30
-    <div class="underline_class1"></div>
+    <div class="ground_class1"></div>
hunk ./static/templates/default.html 155
-      underline: [
+      ground: [
hunk ./static/templates/default.html 157
-      name: 'underlined',
-      "underline_class1": "<style>a { color: #303C3C; text-decoration: underline; }</style>",
-      "underline_class2": ""
+      name: 'bw',
+      "ground_class1": "<style>html { background-color: white; color: black; }</style>",
+      "ground_class2": ""
hunk ./static/templates/default.html 162
-      name: 'notUnderlined',
-      "underline_class1": "<style>a { color: #303C3C; text-decoration: none; }</style>",
-      "underline_class2": ""
+      name: 'obw',
+      "ground_class1": "<style>html { background-color: white; color: #303C3C; }</style>",
+      "ground_class2": ""
+      },
+      {
+      name: 'bow',
+      "ground_class1": "<style>html { background-color: #FCFCFC; color: black; }</style>",
+      "ground_class2": ""
+      },
+      {
+      name: 'obow',
+      "ground_class1": "<style>html { background-color: #FCFCFC; color: #303C3C; }</style>",
+      "ground_class2": ""
... ]]

Data

I am a lit­tle cu­ri­ous about this one, so I sched­uled a full month and half: 10 Sep­tem­ber - 20 Oc­to­ber. Due to far more traffic than an­tic­i­pated from sub­mis­sions to Hacker News, I cut it short by 10 days to avoid wast­ing traffic on a test which was done (a to­tal n of 231,599 was more than enough). The re­sults:

Ver­sion n Con­ver­sion
bw 58,237 12.90%
obow 58,132 12.62%
bow 57,576 12.48%
obw 57,654 12.44%

Analysis

rates <- read.csv(stdin(),header=TRUE)
Black,White,N,Rate
TRUE,TRUE,58237,0.1290
FALSE,FALSE,58132,0.1262
TRUE,FALSE,57576,0.1248
FALSE,TRUE,57654,0.1244


rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes

g <- glm(cbind(Successes,Failures) ~ Black * White, data=rates, family="binomial")
summary(g)
# ...Coefficients:
#                     Estimate Std. Error z value Pr(>|z|)
# (Intercept)          -1.9350     0.0125 -154.93   <2e-16
# BlackTRUE            -0.0128     0.0177   -0.72     0.47
# WhiteTRUE            -0.0164     0.0178   -0.92     0.36
# BlackTRUE:WhiteTRUE   0.0545     0.0250    2.17     0.03
#
# (Dispersion parameter for binomial family taken to be 1)
#
#     Null deviance:  6.8625e+00  on 3  degrees of freedom
# Residual deviance: -1.1758e-11  on 0  degrees of freedom
# AIC: 50.4
summary(step(g))
# same thing

So we can es­ti­mate the net effect of the 4 pos­si­bil­i­ties:

  1. Black, White: -0.0128 + -0.0164 + 0.0545 = 0.0253
  2. Off-black, Off-white: 0 + 0 + 0 = 0
  3. Black, Off-white: -0.0128 + 0 + 0 = -0.0128
  4. Off-black, White: 0 + -0.0164 + 0 = -0.0164

The re­sults ex­actly match the data’s rank­ings.

So, this sug­gests a change to the CSS: we switch the de­fault back­ground color from #FCFCFC to white, while leav­ing the de­fault color its cur­rent black.

Reader Lu­cas asks in the com­ment sec­tions whether, since we would ex­pect new vis­i­tors to the web­site to be less likely to read a page in full than a re­turn­ing vis­i­tor (who knows what they’re in for & prob­a­bly wants more), whether in­clud­ing such a vari­able (which is some­thing Google An­a­lyt­ics does track) might im­prove the analy­sis. It’s easy to ask GA for “New vs Re­turn­ing Vis­i­tor” so I did:

rates <- read.csv(stdin(),header=TRUE)
Black,White,Type,N,Rate
FALSE,TRUE,new,36695,0.1058
FALSE,TRUE,old,21343,0.1565
FALSE,FALSE,new,36997,0.1043
FALSE,FALSE,old,21537,0.1588
TRUE,TRUE,new,36600,0.1073
TRUE,TRUE,old,22274,0.1613
TRUE,FALSE,new,36409,0.1075
TRUE,FALSE,old,21743,0.1507

rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes

g <- glm(cbind(Successes,Failures) ~ Black * White + Type, data=rates, family="binomial")
summary(g)
# Coefficients:
#                      Estimate Std. Error z value Pr(>|z|)
# (Intercept)         -2.134459   0.013770 -155.01   <2e-16
# BlackTRUE           -0.009219   0.017813   -0.52     0.60
# WhiteTRUE            0.000837   0.017798    0.05     0.96
# BlackTRUE:WhiteTRUE  0.034362   0.025092    1.37     0.17
# Typeold              0.448004   0.012603   35.55   <2e-16
  1. B/W: (-0.009219) + 0.000837 + 0.034362 = 0.02598
  2. 0 + 0 + 0 = 0
  3. B: (-0.009219) + 0 + 0 = -0.009219
  4. W: 0 + 0.000837 + 0 = 0.000837

And again, 0.02598 > 0.000837. So as one hopes, thank to ran­dom­iza­tion, adding a miss­ing co­vari­ate does­n’t change our con­clu­sion.

List symbol and font-size

I make heavy use of un­ordered lists in ar­ti­cles; for no par­tic­u­lar rea­son, the sym­bol de­not­ing the start of each en­try in a list is the lit­tle black square, rather than the more com­mon lit­tle cir­cle. I’ve come to find the lit­tle squares a lit­tle chunky and ug­ly, so I want to test that. And I just re­al­ized that I never tested font size (just type of font), even though in­creas­ing font size one of the most com­mon CSS tweaks around. I don’t have any rea­son to ex­pect an in­ter­ac­tion be­tween these two bits of de­signs, un­like the pre­vi­ous A/B test, but I like the idea of get­ting more out of my data, so I am do­ing an­other fac­to­r­ial de­sign, this time not 2x2 but 3x5. The op­tions:

ul { list-style-type: square; }
ul { list-style-type: circle; }
ul { list-style-type: disc; }

html { font-size: 100%; }
html { font-size: 105%; }
html { font-size: 110%; }
html { font-size: 115%; }
html { font-size: 120%; }

Implementation

A 3x5 de­sign, or 15 pos­si­bil­i­ties, does get a lit­tle bulkier than I’d like:

hunk ./static/templates/default.html 30
-    <div class="ground_class1"></div>
+    <div class="ulFontSize_class1"></div>
hunk ./static/templates/default.html 146
-      ground: [
+      ulFontSize: [
hunk ./static/templates/default.html 148
-      name: 'bw',
-      "ground_class1": "<style>html { background-color: white; color: black; }</style>",
-      "ground_class2": ""
+      name: 's100',
+      "ulFontSize_class1": "<style>ul { list-style-type: square; }; html { font-size: 100%; }</style>",
+      "ulFontSize_class2": ""
hunk ./static/templates/default.html 153
-      name: 'obw',
-      "ground_class1": "<style>html { background-color: white; color: #303C3C; }</style>",
-      "ground_class2": ""
+      name: 's105',
+      "ulFontSize_class1": "<style>ul { list-style-type: square; }; html { font-size: 105%; }</style>",
+      "ulFontSize_class2": ""
hunk ./static/templates/default.html 158
-      name: 'bow',
-      "ground_class1": "<style>html { background-color: #FCFCFC; color: black; }</style>",
-      "ground_class2": ""
+      name: 's110',
+      "ulFontSize_class1": "<style>ul { list-style-type: square; }; html { font-size: 110%; }</style>",
+      "ulFontSize_class2": ""
hunk ./static/templates/default.html 163
-      name: 'obow',
-      "ground_class1": "<style>html { background-color: #FCFCFC; color: #303C3C; }</style>",
-      "ground_class2": ""
+      name: 's115',
+      "ulFontSize_class1": "<style>ul { list-style-type: square; }; html { font-size: 115%; }</style>",
+      "ulFontSize_class2": ""
+      },
+      {
+      name: 's120',
+      "ulFontSize_class1": "<style>ul { list-style-type: square; }; html { font-size: 120%; }</style>",
+      "ulFontSize_class2": ""
+      },
+      {
+      name: 'c100',
+      "ulFontSize_class1": "<style>ul { list-style-type: circle; }; html { font-size: 100%; }</style>",
+      "ulFontSize_class2": ""
+      },
+      {
+      name: 'c105',
+      "ulFontSize_class1": "<style>ul { list-style-type: circle; }; html { font-size: 105%; }</style>",
+      "ulFontSize_class2": ""
+      },
+      {
+      name: 'c110',
+      "ulFontSize_class1": "<style>ul { list-style-type: circle; }; html { font-size: 110%; }</style>",
+      "ulFontSize_class2": ""
+      },
+      {
+      name: 'c115',
+      "ulFontSize_class1": "<style>ul { list-style-type: circle; }; html { font-size: 115%; }</style>",
+      "ulFontSize_class2": ""
+      },
+      {
+      name: 'c120',
+      "ulFontSize_class1": "<style>ul { list-style-type: circle; }; html { font-size: 120%; }</style>",
+      "ulFontSize_class2": ""
+      },
+      {
+      name: 'd100',
+      "ulFontSize_class1": "<style>ul { list-style-type: disc; }; html { font-size: 100%; }</style>",
+      "ulFontSize_class2": ""
+      },
+      {
+      name: 'd105',
+      "ulFontSize_class1": "<style>ul { list-style-type: disc; }; html { font-size: 105%; }</style>",
+      "ulFontSize_class2": ""
+      },
+      {
+      name: 'd110',
+      "ulFontSize_class1": "<style>ul { list-style-type: disc; }; html { font-size: 110%; }</style>",
+      "ulFontSize_class2": ""
+      },
+      {
+      name: 'd115',
+      "ulFontSize_class1": "<style>ul { list-style-type: disc; }; html { font-size: 115%; }</style>",
+      "ulFontSize_class2": ""
+      },
+      {
+      name: 'd120',
+      "ulFontSize_class1": "<style>ul { list-style-type: disc; }; html { font-size: 120%; }</style>",
+      "ulFontSize_class2": ""
... ]]

Data

I halted the A/B test on 27 Oc­to­ber be­cause I was notic­ing clear dam­age as com­pared to my de­fault CSS. The re­sults were:

List icon Font zoom n Read­ing con­ver­sion rate
square 100% 4,763 16.38%
disc 100% 4,759 16.18%
disc 110% 4,716 16.09%
cir­cle 115% 4,933 15.95%
cir­cle 100% 4,872 15.85%
cir­cle 110% 4,920 15.53%
cir­cle 120% 5,114 15.51%
square 115% 4,815 15.51%
square 110% 4,927 15.47%
cir­cle 105% 5,101 15.33%
square 105% 4,775 14.85%
disc 115% 4,797 14.78%
disc 105% 5,006 14.72%
disc 120% 4,912 14.56%
square 120% 4,786 13.96%
73,196 15.38%

Analysis

In­cor­po­rat­ing vis­i­tor type:

rates <- read.csv(stdin(),header=TRUE)
Ul,Size,Type,N,Rate
c,120,old,2673,0.1650
c,115,old,2643,0.1854
c,105,new,2636,0.1392
d,105,old,2635,0.1613
s,110,old,2596,0.1749
s,120,old,2593,0.1678
s,105,new,2582,0.1243
d,120,old,2559,0.1649
c,110,new,2558,0.1298
d,110,new,2555,0.1307
c,100,old,2553,0.2002
c,105,old,2539,0.1713
d,115,old,2524,0.1565
s,115,new,2516,0.1391
c,110,old,2505,0.1741
d,100,new,2502,0.1431
c,120,new,2500,0.1284
s,110,new,2491,0.1265
c,115,new,2483,0.1228
d,120,new,2452,0.1277
d,105,new,2448,0.1364
c,100,new,2436,0.1199
d,115,new,2435,0.1437
s,100,new,2411,0.1497
s,120,new,2411,0.1161
s,105,old,2387,0.1571
s,115,old,2365,0.1674
d,100,old,2358,0.1735
s,100,old,2329,0.1803
d,110,old,2235,0.1888


rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes

g <- glm(cbind(Successes,Failures) ~ Ul * Size + Type, data=rates, family="binomial"); summary(g)
# ...Coefficients:
#              Estimate Std. Error z value Pr(>|z|)
# (Intercept) -1.389310   0.270903   -5.13  2.9e-07
# Uld         -0.103201   0.386550   -0.27    0.789
# Uls          0.055036   0.389109    0.14    0.888
# Size        -0.004397   0.002458   -1.79    0.074
# Uld:Size     0.000842   0.003509    0.24    0.810
# Uls:Size    -0.000741   0.003533   -0.21    0.834
# Typeold      0.317126   0.020507   15.46  < 2e-16
summary(step(g))
# ...Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept) -1.40555    0.15921   -8.83   <2e-16
# Size        -0.00436    0.00144   -3.02   0.0025
# Typeold      0.31725    0.02051   15.47   <2e-16

## examine just the list type alone, since the Size result is clear.
summary(glm(cbind(Successes,Failures) ~ Ul + Type, data=rates, family="binomial"))
# ...Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)  -1.8725     0.0208  -89.91   <2e-16
# Uld          -0.0106     0.0248   -0.43     0.67
# Uls          -0.0265     0.0249   -1.07     0.29
# Typeold       0.3163     0.0205   15.43   <2e-16
summary(glm(cbind(Successes,Failures) ~ Ul + Type, data=rates[rates$Size==100,], family="binomial"))
# ...Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)  -1.8425     0.0465  -39.61  < 2e-16
# Uld          -0.0141     0.0552   -0.26     0.80
# Uls           0.0353     0.0551    0.64     0.52
# Typeold       0.3534     0.0454    7.78  7.3e-15

The re­sults are a lit­tle con­fus­ing in fac­to­r­ial form: it seems pretty clear that Size is bad and that 100% per­forms best, but what’s go­ing on with the list icon type? Do we have too lit­tle data or is it in­ter­act­ing with the font size some­how? I find it a lot clearer when plot­ted:

library(ggplot2)
qplot(Size,Rate,color=Ul,data=rates)
Read­ing rate, split by font size, then by list icon type

Im­me­di­ately the neg­a­tive effect of in­creas­ing the font size jumps out, but it’s eas­ier to un­der­stand the list icon es­ti­mates: square per­forms the best in the 100% (the orig­i­nal de­fault) font size con­di­tion but it per­forms poorly in the other font sizes, which is why it seems to do only medi­um-well com­pared to the oth­ers. Given how much bet­ter 100% per­forms than the oth­ers, I’m in­clined to ig­nore their re­sults and keep the squares.

100% and squares, how­ev­er, were the orig­i­nal CSS set­tings, so this means I will make no changes to the ex­ist­ing CSS based on these re­sults.

Blockquote formatting

An­other bit of for­mat­ting I’ve been mean­ing to test for a while is see­ing how well ’s pul­l-quotes next to block­quotes per­form, and to check whether my ze­bra-strip­ing of nested block­quotes is help­ful or harm­ful.

The Read­abil­ity thing goes like this:

blockquote: : before {
    content: "\201C";
    filter: alpha(opacity=20);
    font-family: "Constantia", Georgia, 'Hoefler Text', 'Times New Roman', serif;
    font-size: 4em;
    left: -0.5em;
    opacity: .2;
    position: absolute;
    top: .25em }

The cur­rent block­quote strip­ing goes thus­ly:

blockquote, blockquote blockquote blockquote,
blockquote blockquote blockquote blockquote blockquote {
    z-index: -2;
    background-color: rgb(245, 245, 245); }
blockquote blockquote, blockquote blockquote blockquote blockquote,
blockquote blockquote blockquote blockquote blockquote blockquote {
    background-color: rgb(235, 235, 235); }

Implementation

This is an­other 2x2 de­sign since we can use the Read­abil­ity quotes or not, and the ze­bra-strip­ing or not.

hunk ./static/css/default.css 271
-blockquote, blockquote blockquote blockquote,
- blockquote blockquote blockquote blockquote blockquote {
-    z-index: -2;
-    background-color: rgb(245, 245, 245); }
-blockquote blockquote, blockquote blockquote blockquote blockquote,
- blockquote blockquote blockquote blockquote blockquote blockquote {
-    background-color: rgb(235, 235, 235); }
+/* blockquote, blockquote blockquote blockquote, */
+/* blockquote blockquote blockquote blockquote blockquote { */
+/*     z-index: -2; */
+/*     background-color: rgb(245, 245, 245); } */
+/* blockquote blockquote, blockquote blockquote blockquote blockquote, */
+/*blockquote blockquote blockquote blockquote blockquote blockquote { */
+/*     background-color: rgb(235, 235, 235); } */
hunk ./static/templates/default.html 30
-    <div class="ulFontSize_class1"></div>
+    <div class="blockquoteFormatting_class1"></div>
hunk ./static/templates/default.html 148
-      ulFontSize: [
+      blockquoteFormatting: [
hunk ./static/templates/default.html 150
-      name: 's100',
-      "ulFontSize_class1": "<style>ul { list-style-type: square; }; html { font-size: 100%; }</style>",
-      "ulFontSize_class2": ""
+      name: 'rz',
+      "blockquoteFormatting_class1": "<style>blockquote: : before { content: '\201C';
filter: alpha(opacity=20);
font-family: 'Constantia', Georgia, 'Hoefler Text', 'Times New Roman', serif; font-size: 4em;left: -0.5em;
opacity: .2; position: absolute; top: .25em }; blockquote, blockquote blockquote blockquote,
blockquote blockquote blockquote blockquote blockquote { z-index: -2; background-color: rgb(245, 245, 245); };
blockquote blockquote, blockquote blockquote blockquote blockquote,
blockquote blockquote blockquote blockquote blockquote blockquote { background-color: rgb(235, 235, 235); }</style>",
+      "blockquoteFormatting_class2": ""
hunk ./static/templates/default.html 155
-      name: 's105',
-      "ulFontSize_class1": "<style>ul { list-style-type: square; }; html { font-size: 105%; }</style>",
-      "ulFontSize_class2": ""
+      name: 'orz',
+      "blockquoteFormatting_class1": "<style>blockquote, blockquote blockquote blockquote,
blockquote blockquote blockquote blockquote blockquote { z-index: -2; background-color: rgb(245, 245, 245); };
blockquote blockquote, blockquote blockquote blockquote blockquote,
blockquote blockquote blockquote blockquote blockquote blockquote { background-color: rgb(235, 235, 235); }</style>",
+      "blockquoteFormatting_class2": ""
hunk ./static/templates/default.html 160
-      name: 's110',
-      "ulFontSize_class1": "<style>ul { list-style-type: square; }; html { font-size: 110%; }</style>",
-      "ulFontSize_class2": ""
+      name: 'roz',
+      "blockquoteFormatting_class1": "<style>blockquote: : before { content: '\201C';
filter: alpha(opacity=20);
font-family: 'Constantia', Georgia, 'Hoefler Text', 'Times New Roman', serif; font-size: 4em;left: -0.5em;
opacity: .2; position: absolute; top: .25em }</style>",
+      "blockquoteFormatting_class2": ""
hunk ./static/templates/default.html 165
-      name: 's115',
-      "ulFontSize_class1": "<style>ul { list-style-type: square; }; html { font-size: 115%; }</style>",
-      "ulFontSize_class2": ""
-      },
-      {
-      name: 's120',
-      "ulFontSize_class1": "<style>ul { list-style-type: square; }; html { font-size: 120%; }</style>",
-      "ulFontSize_class2": ""
-      },
-      {
-      name: 'c100',
-      "ulFontSize_class1": "<style>ul { list-style-type: circle; }; html { font-size: 100%; }</style>",
-      "ulFontSize_class2": ""
-      },
-      {
-      name: 'c105',
-      "ulFontSize_class1": "<style>ul { list-style-type: circle; }; html { font-size: 105%; }</style>",
-      "ulFontSize_class2": ""
-      },
-      {
-      name: 'c110',
-      "ulFontSize_class1": "<style>ul { list-style-type: circle; }; html { font-size: 110%; }</style>",
-      "ulFontSize_class2": ""
-      },
-      {
-      name: 'c115',
-      "ulFontSize_class1": "<style>ul { list-style-type: circle; }; html { font-size: 115%; }</style>",
-      "ulFontSize_class2": ""
-      },
-      {
-      name: 'c120',
-      "ulFontSize_class1": "<style>ul { list-style-type: circle; }; html { font-size: 120%; }</style>",
-      "ulFontSize_class2": ""
-      },
-      {
-      name: 'd100',
-      "ulFontSize_class1": "<style>ul { list-style-type: disc; }; html { font-size: 100%; }</style>",
-      "ulFontSize_class2": ""
-      },
-      {
-      name: 'd105',
-      "ulFontSize_class1": "<style>ul { list-style-type: disc; }; html { font-size: 105%; }</style>",
-      "ulFontSize_class2": ""
-      },
-      {
-      name: 'd110',
-      "ulFontSize_class1": "<style>ul { list-style-type: disc; }; html { font-size: 110%; }</style>",
-      "ulFontSize_class2": ""
-      },
-      {
-      name: 'd115',
-      "ulFontSize_class1": "<style>ul { list-style-type: disc; }; html { font-size: 115%; }</style>",
-      "ulFontSize_class2": ""
-      },
-      {
-      name: 'd120',
-      "ulFontSize_class1": "<style>ul { list-style-type: disc; }; html { font-size: 120%; }</style>",
-      "ulFontSize_class2": ""
+      name: 'oroz',
+      "blockquoteFormatting_class1": "<style></style>",
+      "blockquoteFormatting_class2": ""
... ]]

Data

Read­abil­ity Quote Block­quote high­light­ing n Con­ver­sion Rate
no yes 11,663 20.04%
yes yes 11,514 19.86%
no no 11,464 19.21%
yes no 10,669 18.51%
45,310 19.42%

I dis­cov­ered dur­ing this ex­per­i­ment that I could graph the con­ver­sion rate of each con­di­tion sep­a­rate­ly:

Google An­a­lyt­ics view on block­quote fac­to­r­ial test con­ver­sions, by day

What I like about this graph is how it demon­strates some ba­sic sta­tis­ti­cal points:

  1. the more traffic, the smaller sam­pling er­ror is and the closer the 4 con­di­tions are to their true val­ues as they clus­ter to­geth­er. This il­lus­trates how even what seems like a large differ­ence based on a large amount of data, may still be - un­in­tu­itively - dom­i­nated by sam­pling er­ror
  2. day to day, any con­di­tion can be on top; no mat­ter which one proves su­pe­rior and which ver­sion is the worst, we can spot days where the worst ver­sion looks bet­ter than the best ver­sion. This il­lus­trates how in­sid­i­ous se­lec­tion bi­ases or choice of dat­a­points can be: we can eas­ily lie and show black is white, if we can just man­age to cher­ryp­ick a lit­tle bit.
  3. the un­der­ly­ing traffic does not it­self ap­pear to be com­pletely sta­ble or con­sis­tent. There are a lot of move­ments which look like the un­der­ly­ing vis­i­tors may be chang­ing in com­po­si­tion slightly and re­spond­ing slight­ly. This harks back to the pa­per’s warn­ing that for some tests, no an­swer was pos­si­ble as the re­sponses of vis­i­tors kept chang­ing which ver­sion was per­form­ing best.

Analysis

rates <- read.csv(stdin(),header=TRUE)
Readability,Zebra,Type,N,Rate
FALSE,FALSE,new,7191,0.1837
TRUE,TRUE,new,7182,0.1910
FALSE,TRUE,new,7112,0.1800
TRUE,FALSE,new,6508,0.1804
FALSE,TRUE,old,4652,0.2236
TRUE,FALSE,old,4452,0.1995
TRUE,TRUE,old,4412,0.2201
FALSE,FALSE,old,4374,0.2046


rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes

g <- glm(cbind(Successes,Failures) ~ Readability * Zebra + Type, data=rates, family="binomial"); summary(g)
# ...Coefficients:
#                           Estimate Std. Error z value Pr(>|z|)
# (Intercept)                -1.5095     0.0255  -59.09   <2e-16
# ReadabilityTRUE            -0.0277     0.0340   -0.81     0.42
# ZebraTRUE                   0.0327     0.0331    0.99     0.32
# ReadabilityTRUE:ZebraTRUE   0.0609     0.0472    1.29     0.20
# Typeold                     0.1788     0.0239    7.47    8e-14
summary(step(g))
# ...Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)  -1.5227     0.0197  -77.20  < 2e-16
# ZebraTRUE     0.0627     0.0236    2.66   0.0079
# Typeold       0.1782     0.0239    7.45  9.7e-14

The top-per­form­ing vari­ant is the sta­tus quo (no Read­abil­i­ty-style quote, ze­bra-striped block­s). So we keep it.

Font size & ToC background

It was pointed out to me that in my pre­vi­ous font-size test, the clear lin­ear trend may have im­plied that larger fonts than 100% were bad, but that I was mak­ing an un­jus­ti­fied leap in im­plic­itly as­sum­ing that 100% was best: if big­ger is worse, then might­n’t the op­ti­mal font size be some­thing smaller than 100%, like 95%?

And while the block­quote back­ground col­or­ing is a good idea, per the pre­vi­ous test, what about the other place on Gw­ern.net where I use a light back­ground shad­ing: the Ta­ble of Con­tents? Per­haps it would be bet­ter with the same back­ground shad­ing as the block­quotes, or no shad­ing?

Fi­nal­ly, be­cause I am tired of just 2 fac­tors, I throw in a third fac­tor to make it re­ally mul­ti­fac­to­r­i­al. I picked the num­ber-siz­ing from the ex­ist­ing list of sug­ges­tions.

Each fac­tor has 3 vari­ants, giv­ing 27 con­di­tions:

.num { font-size: 85%; }
.num { font-size: 95%; }
.num { font-size: 100%; }

html { font-size: 85%; }
html { font-size: 95%; }
html { font-size: 100%; }

div#TOC { background: #fff; }
div#TOC { background: #eee; }
div#TOC { background-color: rgb(245, 245, 245); }

Implementation

hunk ./static/templates/default.html 30
-    <div class="blockquoteFormatting_class1"></div>
+    <div class="tocFormatting_class1"></div>
hunk ./static/templates/default.html 150
-      blockquoteFormatting: [
+      tocFormatting: [
hunk ./static/templates/default.html 152
-      name: 'rz',
-      "blockquoteFormatting_class1": "<style>blockquote:before { display: block; font-size: 200%; color: #ccc; content: open-quote; height: 0px; margin-left: -0.55em; position:relative; }; blockquote blockquote, blockquote blockquote blockquote blockquote, blockquote blockquote blockquote blockquote blockquote blockquote { background-color: rgb(235, 235, 235); }</style>",
-      "blockquoteFormatting_class2": ""
+      name: '88f',
+      "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 85%; }; div#TOC { background: #fff; };</style>",
+      "tocFormatting_class2": ""
hunk ./static/templates/default.html 157
-      name: 'orz',
-      "blockquoteFormatting_class1": "<style>blockquote, blockquote blockquote blockquote, blockquote blockquote blockquote blockquote blockquote { z-index: -2; background-color: rgb(245, 245, 245); }; blockquote blockquote, blockquote blockquote blockquote blockquote, blockquote blockquote blockquote blockquote blockquote blockquote { background-color: rgb(235, 235, 235); }</style>",
-      "blockquoteFormatting_class2": ""
+      name: '88e',
+      "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 85%; }; div#TOC { background: #eee; }</style>",
+      "tocFormatting_class2": ""
hunk ./static/templates/default.html 162
-      name: 'oroz',
-      "blockquoteFormatting_class1": "<style></style>",
-      "blockquoteFormatting_class2": ""
+      name: '88r',
+      "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 85%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '89f',
+      "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 95%; }; div#TOC { background: #fff; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '89e',
+      "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 95%; }; div#TOC { background: #eee; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '89f',
+      "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 95%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '81f',
+      "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 100%; }; div#TOC { background: #fff; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '81e',
+      "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 100%; }; div#TOC { background: #eee; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '81r',
+      "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 100%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '98f',
+      "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 85%; }; div#TOC { background: #fff; };</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '98e',
+      "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 85%; }; div#TOC { background: #eee; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '98r',
+      "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 85%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '99f',
+      "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 95%; }; div#TOC { background: #fff; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '99e',
+      "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 95%; }; div#TOC { background: #eee; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '99f',
+      "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 95%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '91f',
+      "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 100%; }; div#TOC { background: #fff; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '91e',
+      "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 100%; }; div#TOC { background: #eee; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '91r',
+      "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 100%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '18f',
+      "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 85%; }; div#TOC { background: #fff; };</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '18e',
+      "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 85%; }; div#TOC { background: #eee; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '18r',
+      "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 85%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '19f',
+      "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 95%; }; div#TOC { background: #fff; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '19e',
+      "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 95%; }; div#TOC { background: #eee; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '19f',
+      "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 95%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '11f',
+      "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 100%; }; div#TOC { background: #fff; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '11e',
+      "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 100%; }; div#TOC { background: #eee; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '11r',
+      "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 100%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
+      "tocFormatting_class2": ""
... ]]

Analysis

rates <- read.csv(stdin(),header=TRUE)
NumSize,FontSize,TocBg,Type,N,Rate
1,9,e,new,3060,0.1513
8,9,e,new,2978,0.1605
9,1,r,new,2965,0.1548
8,8,f,new,2941,0.1629
1,9,f,new,2933,0.1558
9,9,r,new,2932,0.1576
8,9,f,new,2906,0.1473
1,9,r,new,2901,0.1482
9,9,f,new,2901,0.1420
8,8,r,new,2885,0.1567
1,8,e,new,2876,0.1412
8,1,r,new,2869,0.1593
9,8,f,new,2846,0.1472
1,1,e,new,2844,0.1551
1,8,f,new,2841,0.1457
9,8,e,new,2834,0.1478
8,1,f,new,2833,0.1521
1,8,r,new,2818,0.1544
8,8,e,new,2818,0.1678
8,1,e,new,2810,0.1605
1,1,r,new,2806,0.1775
9,8,r,new,2801,0.1682
9,1,e,new,2799,0.1422
8,9,r,new,2764,0.1548
9,9,e,new,2753,0.1478
1,1,f,new,2750,0.1611
9,1,f,new,2700,0.1537
8,8,r,old,1551,0.2521
9,8,e,old,1519,0.2146
9,8,f,old,1505,0.2153
1,8,e,old,1489,0.2317
1,1,e,old,1475,0.2339
8,1,f,old,1416,0.2112
1,9,r,old,1390,0.2245
8,9,e,old,1388,0.2464
9,9,r,old,1379,0.2466
8,9,r,old,1374,0.1907
1,9,f,old,1361,0.2337
8,8,f,old,1348,0.2322
1,9,e,old,1347,0.2279
1,8,f,old,1340,0.2470
9,1,r,old,1336,0.2605
8,1,r,old,1326,0.2119
8,8,e,old,1321,0.2286
9,1,f,old,1318,0.2398
1,1,r,old,1293,0.2111
1,8,r,old,1293,0.2073
9,9,f,old,1261,0.2411
8,9,f,old,1254,0.2113
9,9,e,old,1240,0.2435
1,1,f,old,1232,0.2240
8,1,e,old,1229,0.2587
9,1,e,old,1182,0.2335
9,8,r,old,1032,0.2403


rates[rates$NumSize==1,]$NumSize <- 100
rates[rates$NumSize==9,]$NumSize <- 95
rates[rates$NumSize==8,]$NumSize <- 85
rates[rates$FontSize==1,]$FontSize <- 100
rates[rates$FontSize==9,]$FontSize <- 95
rates[rates$FontSize==8,]$FontSize <- 85
rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes

g <- glm(cbind(Successes,Failures) ~ NumSize * FontSize * TocBg + Type, data=rates, family="binomial"); summary(g)
# ...Coefficients:
#                          Estimate Std. Error z value Pr(>|z|)
# (Intercept)              0.124770   3.020334    0.04     0.97
# NumSize                 -0.022262   0.032293   -0.69     0.49
# FontSize                -0.012775   0.032283   -0.40     0.69
# TocBgf                   4.042812   4.287006    0.94     0.35
# TocBgr                   5.356794   4.250778    1.26     0.21
# NumSize:FontSize         0.000166   0.000345    0.48     0.63
# NumSize:TocBgf          -0.040645   0.045855   -0.89     0.38
# NumSize:TocBgr          -0.054164   0.045501   -1.19     0.23
# FontSize:TocBgf         -0.052406   0.045854   -1.14     0.25
# FontSize:TocBgr         -0.065503   0.045482   -1.44     0.15
# NumSize:FontSize:TocBgf  0.000531   0.000490    1.08     0.28
# NumSize:FontSize:TocBgr  0.000669   0.000487    1.37     0.17
# Typeold                  0.492688   0.015978   30.84   <2e-16
summary(step(g))
# ...Coefficients:
#                   Estimate Std. Error z value Pr(>|z|)
# (Intercept)       3.808438   1.750144    2.18   0.0295
# NumSize          -0.059730   0.018731   -3.19   0.0014
# FontSize         -0.052262   0.018640   -2.80   0.0051
# TocBgf           -0.844664   0.285387   -2.96   0.0031
# TocBgr           -0.747451   0.283304   -2.64   0.0083
# NumSize:FontSize  0.000568   0.000199    2.85   0.0044
# NumSize:TocBgf    0.008853   0.003052    2.90   0.0037
# NumSize:TocBgr    0.008139   0.003030    2.69   0.0072
# Typeold           0.492598   0.015975   30.83   <2e-16

The two size tweaks turn out to be un­am­bigu­ously neg­a­tive com­pared to the sta­tus quo (with an al­most neg­li­gi­ble in­ter­ac­tion term prob­a­bly re­flect­ing reader pref­er­ence for con­sis­tency in sizes of let­ters and num­bers - as one gets small­er, the other does bet­ter if it’s smaller too). The Ta­ble of Con­tents back­grounds also sur­vive (thanks to the new vs old vis­i­tor type co­vari­ate adding pow­er): there were 3 back­ground types, e/f/r[g­b], and f/r turn out to have neg­a­tive co­effi­cients, im­ply­ing that e is best - but e is also the sta­tus quo, so no change is rec­om­mend­ed.

Multifactorial roundup

At this point it seems worth ask­ing whether run­ning mul­ti­fac­to­ri­als has been worth­while. The analy­sis is a bit more diffi­cult, and the more fac­tors there are, the harder to in­ter­pret. I’m also not too keen on en­cod­ing the com­bi­na­to­r­ial ex­plo­sion into a big JS ar­ray for AB­a­lyt­ics. In my tests so far, have there been many in­ter­ac­tions? A quick tally of the glm()/step() re­sults:

  1. Text & back­ground col­or:

    • orig­i­nal: 2 main, 1 two-way in­ter­ac­tion
    • sur­vived: 2 main, 1 two-way in­ter­ac­tion
  2. List sym­bol and font-size:

    • orig­i­nal: 3 main, 2 two-way in­ter­ac­tions
    • sur­vived: 1 main
  3. Block­quote for­mat­ting:

    • orig­i­nal: 2 main, 1 two-way
    • sur­vived: 1 main
  4. Font size & ToC back­ground:

    • orig­i­nal: 4 mains, 5 two-ways, 2 three­-ways
    • sur­vived: 3 mains, 2 two-way

So of the 11 main effects, 9 two-ways, & 2 three­-ways, there were con­firmed in the re­duced mod­els: 7 mains, 3 two-ways (22%), & 0 three­-ways (0%). And of the 2 in­ter­ac­tions, only the black/white in­ter­ac­tion was im­por­tant (and even there, if I had re­gressed in­stead cbind(Successes, Failures) ~ Black + White, black & white would still have pos­i­tive co­effi­cients, they just would not be sta­tis­ti­cal­ly-sig­nifi­cant, and so I would likely have made the same choice as I did with the in­ter­ac­tion data avail­able).

This is not a re­sound­ing en­dorse­ment so far.

Section header capitalization

3x3:

  • h1, h2, h3, h4, h5 { text-transform: uppercase; }
  • h1, h2, h3, h4, h5 { text-transform: none; }
  • h1, h2, h3, h4, h5 { text-transform: capitalize; }
  • div#header h1 { text-transform: uppercase; }
  • div#header h1 { text-transform: none; }
  • div#header h1 { text-transform: capitalize; }
--- a/static/templates/default.html
+++ b/static/templates/default.html
@@ -27,7 +27,7 @@
   <body>

-    <div class="tocFormatting_class1"></div>
+    <div class="headerCaps_class1"></div>

     <div id="main">
       <div id="sidebar">
@@ -152,141 +152,51 @@
       _gaq.push(['_setAccount', 'UA-18912926-1']);

       ABalytics.init({
-      tocFormatting: [
+      headerCaps: [
       {
- name: '88f',
- "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 85%; }; div#TOC { background: #fff; };</style>",
- "tocFormatting_class2": ""
+ name: 'uu',
+ "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: uppercase; }; div#header h1 { text-transform: uppercase; }</style>",
+ "headerCaps_class2": ""
  },
  {
- name: '88e',
- "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 85%; }; div#TOC { background: #eee; }</style>",
- "tocFormatting_class2": ""
+ name: 'un',
+ "headerCaps_class1": "<style>div#header h1 { text-transform: uppercase; }; div#header h1 { text-transform: none; }</style>",
+ "headerCaps_class2": ""
  },
  {
- name: '88r',
- "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 85%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
- "tocFormatting_class2": ""
+ name: 'uc',
+ "headerCaps_class1": "<style>div#header h1 { text-transform: uppercase; }; div#header h1 { text-transform: capitalize; }</style>",
+ "headerCaps_class2": ""
  },
  {
- name: '89f',
- "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 95%; }; div#TOC { background: #fff; }</style>",
- "tocFormatting_class2": ""
+ name: 'nu',
+ "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: none; }; div#header h1 { text-transform: uppercase; }</style>",
+ "headerCaps_class2": ""
  },
  {
- name: '89e',
- "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 95%; }; div#TOC { background: #eee; }</style>",
- "tocFormatting_class2": ""
+ name: 'nn',
+ "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: none; }; div#header h1 { text-transform: none; }</style>",
+ "headerCaps_class2": ""
  },
  {
- name: '89r',
- "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 95%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
- "tocFormatting_class2": ""
+ name: 'nc',
+ "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: none; }; div#header h1 { text-transform: capitalize; }</style>",
+ "headerCaps_class2": ""
  },
  {
- name: '81f',
- "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 100%; }; div#TOC { background: #fff; }</style>",
- "tocFormatting_class2": ""
+ name: 'cu',
+ "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: capitalize; }; div#header h1 { text-transform: uppercase; }</style>",
+ "headerCaps_class2": ""
  },
  {
- name: '81e',
- "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 100%; }; div#TOC { background: #eee; }</style>",
- "tocFormatting_class2": ""
+ name: 'cn',
+ "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: capitalize; }; div#header h1 { text-transform: none; }</style>",
+ "headerCaps_class2": ""
  },
  {
- name: '81r',
- "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 100%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '98f',
- "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 85%; }; div#TOC { background: #fff; };</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '98e',
- "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 85%; }; div#TOC { background: #eee; }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '98r',
- "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 85%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '99f',
- "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 95%; }; div#TOC { background: #fff; }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '99e',
- "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 95%; }; div#TOC { background: #eee; }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '99r',
- "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 95%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '91f',
- "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 100%; }; div#TOC { background: #fff; }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '91e',
- "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 100%; }; div#TOC { background: #eee; }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '91r',
- "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 100%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '18f',
- "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 85%; }; div#TOC { background: #fff; };</style>",
- "tocFormatting_class2": ""
- {
- name: '18e',
- "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 85%; }; div#TOC { background: #eee; }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '18r',
- "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 85%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '19f',
- "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 95%; }; div#TOC { background: #fff; }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '19e',
- "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 95%; }; div#TOC { background: #eee; }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '19r',
- "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 95%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '11f',
- "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 100%; }; div#TOC { background: #fff; }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '11e',
- "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 100%; }; div#TOC { background: #eee; }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '11r',
- "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 100%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
- "tocFormatting_class2": ""
+ name: 'cc',
+ "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: capitalize; }; div#header h1 { text-transform: capitalize; }</style>",
+ "headerCaps_class2": ""
  }
       ],
       }, _gaq);
       ...)}
rates <- read.csv(stdin(),header=TRUE)
Sections,Title,Old,N,Rate
c,u,FALSE,2362, 0.1808
c,n,FALSE,2356,0.1855
c,c,FALSE,2342,0.2003
u,u,FALSE,2341,0.1965
u,c,FALSE,2333,0.1989
n,u,FALSE,2329,0.1928
n,c,FALSE,2323,0.1941
n,n,FALSE,2321,0.1978
u,n,FALSE,2315,0.1965
c,c,TRUE,1370,0.2190
n,u,TRUE,1302,0.2558
u,u,TRUE,1271,0.2919
c,n,TRUE,1258,0.2377
u,c,TRUE,1228,0.2272
n,c,TRUE,1211,0.2337
n,n,TRUE,1200,0.2400
c,u,TRUE,1135,0.2396
u,n,TRUE,1028,0.2442


rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes

g <- glm(cbind(Successes,Failures) ~ Sections * Title + Old, data=rates, family="binomial"); summary(g)
# ...Coefficients:
# (Intercept)       -1.4552     0.0422  -34.50   <2e-16
# Sectionsn          0.0111     0.0581    0.19    0.848
# Sectionsu          0.0163     0.0579    0.28    0.779
# Titlen            -0.0153     0.0579   -0.26    0.791
# Titleu            -0.0318     0.0587   -0.54    0.588
# OldTRUE            0.2909     0.0283   10.29   <2e-16
# Sectionsn:Titlen   0.0429     0.0824    0.52    0.603
# Sectionsu:Titlen   0.0419     0.0829    0.51    0.613
# Sectionsn:Titleu   0.0732     0.0825    0.89    0.375
# Sectionsu:Titleu   0.1553     0.0820    1.89    0.058
summary(step(g))
# ...Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)  -1.4710     0.0263  -55.95   <2e-16
# Sectionsn     0.0497     0.0337    1.47    0.140
# Sectionsu     0.0833     0.0337    2.47    0.013
# OldTRUE       0.2920     0.0283   10.33   <2e-16

Up­per­case and ‘none’ beat ‘cap­i­tal­ize’ in both page ti­tles & sec­tion head­ers (in­ter­ac­tion does not sur­vive). So I toss in a CSS de­c­la­ra­tion to up­per­case sec­tion head­ers as well as the sta­tus quo of the ti­tle.

ToC formatting

After the page ti­tle, the next thing a reader will gen­er­ally see on my pages in the ta­ble of con­tents. It’s been tweaked over the years (par­tic­u­larly by sug­ges­tions from Hacker News) but still has some untested as­pects, par­tic­u­larly the first two parts of div#TOC:

    float: left;
    width: 25%;

I’d like to test left vs right, and 15,20,25,30,35%, so that’s a 2x5 de­sign. Usual im­ple­men­ta­tion:

diff --git a/static/templates/default.html b/static/templates/default.html
index 83c6f9c..11c4ada 100644
--- a/static/templates/default.html
+++ b/static/templates/default.html
@@ -27,7 +27,7 @@
   <body>

-    <div class="headerCaps_class1"></div>
+    <div class="tocAlign_class1"></div>

     <div id="main">
       <div id="sidebar">
@@ -152,51 +152,56 @@
       _gaq.push(['_setAccount', 'UA-18912926-1']);

       ABalytics.init({
-      headerCaps: [
+      tocAlign: [
       {
-      name: 'uu',
-      "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: uppercase; }; div#header h1 { text-transform: uppercase; }</style>",
-      "headerCaps_class2": ""
+      name: 'l15',
+      "tocAlign_class1": "<style>div#TOC { float: left; width: 15%; }</style>",
+      "tocAlign_class2": ""
       },
       {
-      name: 'un',
-      "headerCaps_class1": "<style>div#header h1 { text-transform: uppercase; }; div#header h1 { text-transform: none; }</style>",
-      "headerCaps_class2": ""
+      name: 'l20',
+      "tocAlign_class1": "<style>div#TOC { float: left; width: 20%; }</style>",
+      "tocAlign_class2": ""
       },
       {
-      name: 'uc',
-      "headerCaps_class1": "<style>div#header h1 { text-transform: uppercase; }; div#header h1 { text-transform: capitalize; }</style>",
-      "headerCaps_class2": ""
+      name: 'l25',
+      "tocAlign_class1": "<style>div#TOC { float: left; width: 25%; }</style>",
+      "tocAlign_class2": ""
       },
       {
-      name: 'nu',
-      "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: none; }; div#header h1 { text-transform: uppercase; }</style>",
-      "headerCaps_class2": ""
+      name: 'l30',
+      "tocAlign_class1": "<style>div#TOC { float: left; width: 30%; }</style>",
+      "tocAlign_class2": ""
       },
       {
-      name: 'nn',
-      "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: none; }; div#header h1 { text-transform: none; }</style>",
-      "headerCaps_class2": ""
+      name: 'l35',
+      "tocAlign_class1": "<style>div#TOC { float: left; width: 35%; }</style>",
+      "tocAlign_class2": ""
       },
       {
-      name: 'nc',
-      "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: none; }; div#header h1 { text-transform: capitalize; }</style>",
-      "headerCaps_class2": ""
+      name: 'r15',
+      "tocAlign_class1": "<style>div#TOC { float: right; width: 15%; }</style>",
+      "tocAlign_class2": ""
       },
       {
-      name: 'cu',
-      "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: capitalize; }; div#header h1 { text-transform: uppercase; }</style>",
-      "headerCaps_class2": ""
+      name: 'r20',
+      "tocAlign_class1": "<style>div#TOC { float: right; width: 20%; }</style>",
+      "tocAlign_class2": ""
       },
       {
-      name: 'cn',
-      "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: capitalize; }; div#header h1 { text-transform: none; }</style>",
-      "headerCaps_class2": ""
+      name: 'r25',
+      "tocAlign_class1": "<style>div#TOC { float: right; width: 25%; }</style>",
+      "tocAlign_class2": ""
       },
       {
-      name: 'cc',
-      "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: capitalize; }; div#header h1 { text-transform: capitalize; }</style>",
-      "headerCaps_class2": ""
+      name: 'r30',
+      "tocAlign_class1": "<style>div#TOC { float: right; width: 30%; }</style>",
+      "tocAlign_class2": ""
+      },
+      {
+      name: 'r35',
+      "tocAlign_class1": "<style>div#TOC { float: right; width: 35%; }</style>",
+      "tocAlign_class2": ""
       }
       ],
       }, _gaq));

I de­cided to end this test early on 2014-03-10 be­cause I wanted to move onto the Bee­Line Reader test, so it’s un­der­pow­ered & the re­sults aren’t as clear as usu­al:

rates <- read.csv(stdin(),header=TRUE)
Alignment,Width,Old,N,Rate
r,25,FALSE,1040,0.1673
r,30,FALSE,1026,0.1891
l,20,FALSE,1023,0.1896
l,25,FALSE,1022,0.1800
l,35,FALSE,1022,0.1820
l,30,FALSE,1016,0.1781
l,15,FALSE,1010,0.1851
r,15,FALSE,991,0.1554
r,20,FALSE,989,0.1881
r,35,FALSE,969,0.1672
l,30,TRUE,584,0.2414
l,25,TRUE,553,0.2224
l,20,TRUE,520,0.3096
r,15,TRUE,512,0.2539
l,35,TRUE,496,0.2520
r,25,TRUE,494,0.2105
l,15,TRUE,482,0.2282
r,35,TRUE,480,0.2417
r,20,TRUE,460,0.2326
r,30,TRUE,455,0.2549


rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes

g <- glm(cbind(Successes,Failures) ~ Alignment * Width + Old, data=rates, family="binomial"); summary(g)
# Coefficients:
#                  Estimate Std. Error z value Pr(>|z|)
# (Intercept)      -1.43309    0.10583  -13.54   <2e-16
# Alignmentr       -0.17726    0.15065   -1.18     0.24
# Width            -0.00253    0.00403   -0.63     0.53
# OldTRUE           0.40092    0.04184    9.58   <2e-16
# Alignmentr:Width  0.00450    0.00580    0.78     0.44

So, as I ex­pect­ed, putting the ToC on the right per­formed worse; the larger ToC widths don’t seem to be bet­ter but it’s un­clear what’s go­ing on there. A vi­sual in­spec­tion of the Width data (library(ggplot2); qplot(Width,Rate,color=Alignment,data=rates)) sug­gests that 20% width was the best vari­ant, so might as well go with that.

BeeLine Reader text highlighting

BLR is a JS li­brary for high­light­ing tex­tual para­graphs with pairs of half-lines to make read­ing eas­i­er. I run a ran­dom­ized ex­per­i­ment on sev­eral differ­ent­ly-col­ored ver­sions to see if de­fault site-wide us­age of BLR will im­prove time-on-page for Gw­ern.net read­ers, in­di­cat­ing eas­ier read­ing of the long-form tex­tual con­tent. Most ver­sions per­form worse than the con­trol of no-high­light­ing; the best ver­sion per­forms slightly bet­ter but the im­prove­ment is not sta­tis­ti­cal­ly-sig­nifi­cant.

Bee­Line Reader (BLR) is an in­ter­est­ing new browser plu­gin which launched around Oc­to­ber 2013; I learned of it from the Hacker News dis­cus­sion. The idea is that part of the diffi­culty in read­ing text is that when one fin­ishes a line and sac­cades left to the con­tin­u­a­tion of the next line, the un­cer­tainty of where it is adds a bit of stress, so one can make read­ing eas­ier by adding some sort of guide to the next line; in this case, each match­ing pair of half-lines is col­ored differ­ent­ly, so if you are on a red half-line, when you sac­cade left, you look for a line also col­ored red, then you switch to blue in the mid­dle of that line, and so on. A col­or­ful vari­ant on writ­ing. I found the de­fault BLR col­or­ing gar­ish & dis­tract­ing, but I could­n’t see any rea­son that a sub­tle gray vari­ant would not help: the idea seems plau­si­ble. And very long text pages (like mine) are where BLR should shine most.

I asked if there were a JavaScript ver­sion I could use in an A/B test; the ini­tial JS im­ple­men­ta­tion was not fast enough, but by 2014-03-10 it was good enough. BLR has sev­eral themes, in­clud­ing “gray”; I de­cided to test the vari­ants no BLR, “dark”, “blues”, & ex­panded the gray se­lec­tion to in­clude grays #222222/#333333/#444444/#555555/#666666/#777777 (gray-6; they vary in how bla­tant the high­light­ing is) for a to­tal of 9 equal­ly-ran­dom­ized vari­ants.

Since I’m par­tic­u­larly in­ter­ested in these re­sults, and I think many other peo­ple will find the re­sults in­ter­est­ing, I will run this test ex­tra-long: a min­i­mum of 2 months. I’m only in­ter­ested in the best vari­ant, not es­ti­mat­ing each vari­ant ex­actly (what do I care if the ugly dark is 15% rather than 14%? I just want to know it’s worse than the con­trol) so con­cep­tu­ally I want some­thing like a or or where bad vari­ants get dropped over time; un­for­tu­nate­ly, I haven’t stud­ied them yet (and MABs would be hard to im­ple­ment on a sta­tic site), so I’ll just ad hoc drop the worst vari­ant every week or two. (Maybe next ex­per­i­ment I’ll do a for­mal adap­tive tri­al.)

Setup

The usual im­ple­men­ta­tion us­ing AB­a­lyt­ics does­n’t work be­cause it uses a innerHTML call to sub­sti­tute the var­i­ous frag­ments, and while HTML & CSS get in­ter­preted fine, JavaScript does not; the offered so­lu­tions were suffi­ciently baroque I wound up im­ple­ment­ing a cus­tom sub­set of AB­a­lyt­ics hard­wired for BLR in­side the An­a­lyt­ics script:

     <script id="googleAnalytics" type="text/javascript">
       var _gaq = _gaq || [];
       _gaq.push(['_setAccount', 'UA-18912926-1']);
+     // A/B test: heavily based on ABalytics
+      function readCookie (name) {
+        var nameEQ = name + "=";
+        var ca = document.cookie.split(';');
+        for(var i=0;i < ca.length;i++) {
+            var c = ca[i];
+            while (c.charAt(0)==' ') c = c.substring(1,c.length);
+            if (c.indexOf(nameEQ) == 0) return c.substring(nameEQ.length,c.length);
+        }
+        return null;
+      }
+
+      if (typeof(start_slot) == 'undefined') start_slot = 1;
+      var experiment = "blr3";
+      var variant_names = ["none", "dark", "blues", "gray1", "gray2", "gray3", "gray4", "gray5", "gray6"];
+
+      var variant_id = this.readCookie("ABalytics_"+experiment);
+      if (!variant_id || !variant_names[variant_id]) {
+      var variant_id = Math.floor(Math.random()*variant_names.length);
+      document.cookie = "ABalytics_"+experiment+"="+variant_id+"; path=/";
+                        }
+      function beelinefy (COLOR) {
+       if (COLOR != "none") {
+          var elements=document.querySelectorAll("#content");
+          for(var i=0;i < elements.length;i++) {
+                          var beeline=new BeeLineReader(elements[i], { theme: COLOR, skipBackgroundColor: true, skipTags: ['math', 'svg', 'h1', 'h2', 'h3', 'h4'] });
+                          beeline.color();
+                          }
+       }
+      }
+      beelinefy(variant_names[variant_id]);
+      _gaq.push(['_setCustomVar',
+                  start_slot,
+                  experiment,                 // The name of the custom variable = name of the experiment
+                  variant_names[variant_id],  // The value of the custom variable = variant shown
+                  2                           // Sets the scope to session-level
+                 ]);
      _gaq.push(['_trackPageview']);

The themes are de­fined in beeline.min.js as:

r.THEMES={
 dark: ["#000000","#970000","#000000","#00057F","#FBFBFB"],
 blues:["#000000","#0000FF","#000000","#840DD2","#FBFBFB"],
 gray1:["#000000","#222222","#000000","#222222","#FBFBFB"],
 gray2:["#000000","#333333","#000000","#333333","#FBFBFB"],
 gray3:["#000000","#444444","#000000","#444444","#FBFBFB"],
 gray4:["#000000","#555555","#000000","#555555","#FBFBFB"],
 gray5:["#000000","#666666","#000000","#666666","#FBFBFB"],
 gray6:["#000000","#777777","#000000","#777777","#FBFBFB"]
}

(Why “bl3”? I don’t know JS, so it took some time; things I learned along the line in­cluded al­ways leav­ing white­space around a < op­er­a­tor, and that the “none” ar­gu­ment passed into beeline.setOptions causes a prob­lem which some browsers will ig­nore and con­tinue record­ing A/B data after but most browsers will not; this broke the orig­i­nal test. Then I dis­cov­ered that BLR by de­fault broke all the MathML/MathJax, caus­ing nasty-look­ing er­rors over pages with math ex­pres­sions; this broke the sec­ond test, and I had to get a fixed ver­sion.)

Data

On 31 March, with to­tal n hav­ing reached 15652 vis­its, I deleted the worst-per­form­ing vari­ant: gray4, which at 19.21% was sub­stan­tially un­der­per­form­ing the best-per­form­ing vari­ant’s 22.38%, and wast­ing traffic. On 6 April, two Hacker News sub­mis­sions hav­ing dou­bled vis­its to 36533, I deleted the nex­t-worst vari­ant, gray5 (14.66% vs con­trol of 16.25%; p = 0.038). On 9 April, the al­most as in­fe­rior gray6 (15.67% vs 16.26%) was delet­ed. On 17 April, dark (16.00% vs 16.94%) was delet­ed. On 30 April, I deleted gray2 (17.56% vs 18.07%). 11 May, blues was gone (18.11% vs 18.53%), and on 31 May, I deleted gray3 (18.04% vs 18.24%).

Due to caching, the dele­tions did­n’t nec­es­sar­ily drop data col­lec­tion in­stantly to ze­ro. Traffic was also het­ero­ge­neous: Hacker News traffic is much less likely to spend much time on page than the usual traffic.

The con­ver­sion data, with new vs re­turn­ing vis­i­tor, seg­mented by pe­ri­od, and or­dered by when a vari­ant was delet­ed:

Vari­ant Old To­tal: n (%) 10–31 March 1–6 April 7–9 April 10–17 April 18–30 April 1–11 May 12–31 May 1–8 June
none FALSE 17648 (16.01%) 1189 (19.26%) 3 607 (13.97%) 460 (17.39%) 1182 (16.58%) 3444 (17.04%) 2397 (14.39%) 3997 (17.39%) 2563 (16.35%)
none TRUE 8009 (23.65%) 578 (24.91%) 1 236 (22.09%) 226 (20.35%) 570 (23.86%) 1364 (27.05%) 1108 (23.83%) 2142 (22.46%) 1363 (23.84%)
gray1 FALSE 17579 (16.28%) 1177 (19.71%) 3 471 (14.06%) 475 (13.47%) 1200 (17.33%) 3567 (17.49%) 2365 (13.57%) 3896 (18.17%) 2605 (17.24%)
gray1 TRUE 7694 (23.85%) 515 (28.35%) 1 183 (23.58%) 262 (21.37%) 518 (21.43%) 1412 (26.56%) 1090 (24.86%) 2032 (22.69%) 1197 (23.56%)
gray3 FALSE 14871 (15.81%) 1192 (18.29%) 3 527 (14.15%) 446 (15.47%) 1160 (15.43%) 3481 (17.98%) 2478 (14.65%) 3776 (16.26%) 3 (33.33%)
gray3 TRUE 6631 (23.06%) 600 (24.83%) 1 264 (21.52%) 266 (18.05%) 638 (21.79%) 1447 (25.22%) 1053 (24.60%) 1912 (23.17%) 51 (5.88%)
blues FALSE 10844 (15.34%) 1157 (18.93%) 3 470 (14.35%) 449 (16.04%) 1214 (15.57%) 3346 (17.54%) 2362 (13.46%) 3 (0.00%)
blues TRUE 4544 (23.04%) 618 (27.18%) 1 256 (23.81%) 296 (20.27%) 584 (22.09%) 1308 (24.46%) 1052 (22.15%) 48 (12.50%)
gray2 FALSE 8646 (15.51%) 1220 (20.33%) 3 649 (13.81%) 416 (15.14%) 1144 (15.03%) 3433 (17.54%) 4 (0.00%)
gray2 TRUE 3366 (22.82%) 585 (22.74%) 1 271 (21.79%) 230 (16.52%) 514 (21.60%) 1298 (25.42%) 44 (27.27%) 6 (0.00%) 3 (0.00%)
dark FALSE 5240 (14.05%) 1224 (20.59%) 3 644 (13.83%) 420 (13.81%) 1175 (14.81%) 1 (0.00%)
dark TRUE 2161 (20.59%) 618 (21.52%) 1 242 (20.85%) 276 (21.74%) 574 (20.56%) 64 (10.94%) 1 (0.00%) 2 (0.00%) 2 (50.00%)
gray6 FALSE 4022 (13.30%) 1153 (19.51%) 3 610 (12.88%) 409 (17.11%) 1 (0.00%) 2 (0.00%) 3 (0.00%)
gray6 TRUE 1727 (20.61%) 654 (23.70%) 1 358 (22.02%) 259 (18.92%) 95 (7.37%) 11 (9.09%) 1 (0.00%)
gray5 FALSE 3245 (12.20%) 1175 (16.68%) 3 242 (12.21%) 3 (0.00%)
gray5 TRUE 1180 (21.53%) 559 (25.94%) 1 130 (21.77%) 34 (17.65%) 16 (12.50%)
gray4 FALSE 1176 (18.54%) 1174 (18.57%) 1 174 (18.57%) 2 (0.00%)
gray4 TRUE 673 (19.91%) 650 (20.31%) 6 69 (20.03%) 1 (0.00%) 1 (0.00%) 2 (0.00%)
137438 (18.27%)

Graphed:

Weekly con­ver­sion rates for each of the Bee­Line Reader set­tings

I also re­ceived a num­ber of com­plaints while run­ning the BLR test (prin­ci­pally due to the dark and blues vari­ants, but also ap­par­ently trig­gered by some of the less pop­u­lar gray vari­ants; the num­ber of com­plaints dropped off con­sid­er­ably by halfway through):

  • 2 in emails
  • 2 on IRC un­so­licit­ed; when I later asked, there were 2 com­plaints of slow­ness load­ing pages & after re­flow­ing
  • 2 on Red­dit
  • 3 men­tions in Gw­ern.net com­ments
  • 4 through my anony­mous feed­back form
  • 6 com­plaints on Hacker News
  • to­tal: 19

Analysis

The BLR peo­ple say that there may be cross-browser differ­ences, so I thought about throw­ing in browser as a co­vari­ate too (an un­ordered fac­tor of Chrome & Fire­fox, and maybe I’ll bin every­thing else as an ‘other’ browser); it seems I may have to use the GA API to ex­tract con­ver­sion rates split by vari­ant, vis­i­tor sta­tus, and brows­er. This turned out to be enough work that I de­cided to not both­er.

As usu­al, a lo­gis­tic re­gres­sion on the var­i­ous BLR themes with new vs re­turn­ing vis­i­tors (Old) as a co­vari­ate. Be­cause of the het­ero­gene­ity in traffic (and be­cause I both­ered break­ing out the data by time pe­riod this time for the table), I also in­clude each block as a fac­tor. Fi­nal­ly, be­cause I ex­pected the 6 gray vari­ants to per­form sim­i­lar­ly, I try out a mul­ti­level model nest­ing the grays to­geth­er.

The re­sults are not im­pres­sive: only 2 gray vari­ants out of the 8 vari­ants have a pos­i­tive es­ti­mate, and nei­ther is sta­tis­ti­cal­ly-sig­nifi­cant; the best vari­ant was gray1 (“#222222” & “#FBFBFB”), at an es­ti­mated in­crease from 19.52% to 20.04% con­ver­sion rate. More sur­pris­ing, the nest­ing turns out to not mat­ter at all, and in fact the worst vari­ant was gray. (The best-fit­ting mul­ti­level model ig­nore the vari­ants en­tire­ly, al­though it did not fit bet­ter than the reg­u­lar lo­gis­tic model in­cor­po­rat­ing all of the time pe­ri­ods, Old, and vari­ants.)

# Pivot table view on custom variable:
# ("Secondary dimension: User Type"; "Pivot by: Custom Variable (Value 01); Pivot metrics: Sessions | Time reading (Goal 1 Conversion Rate)")
# then hand-edited to add Color and Date variables
rates <- read.csv("https://www.gwern.net/docs/traffic/2014-06-08-abtesting-blr.csv")

rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes

# specify the control group is 'none'
rates$Variant <- relevel(rates$Variant, ref="none")
rates$Color <- relevel(rates$Color, ref="none")

# normal:
g0 <- glm(cbind(Successes,Failures) ~ Old + Variant + Date, data=rates, family=binomial); summary(g0)
# ...Coefficients:
#                  Estimate Std. Error z value Pr(>|z|)
# (Intercept)     -1.633959   0.027712  -58.96  < 2e-16
# OldTRUE          0.465491   0.014559   31.97  < 2e-16
# Date10-17 April -0.021047   0.037563   -0.56   0.5753
# Date10-31 March  0.150498   0.035017    4.30  1.7e-05
# Date1-11 May    -0.107965   0.035133   -3.07   0.0021
# Date12-31 May    0.009534   0.032448    0.29   0.7689
# Date1-6 April   -0.138053   0.031809   -4.34  1.4e-05
# Date18-30 April  0.095898   0.031817    3.01   0.0026
# Date7-9 April   -0.129704   0.047314   -2.74   0.0061
#
# Variantgray5    -0.114487   0.040429   -2.83   0.0046
# Variantdark     -0.060299   0.033912   -1.78   0.0754
# Variantgray2    -0.027338   0.028518   -0.96   0.3378
# Variantblues    -0.012120   0.026330   -0.46   0.6453
# Variantgray3    -0.005484   0.023441   -0.23   0.8150
# Variantgray4    -0.003556   0.047273   -0.08   0.9400
# Variantgray6     0.000536   0.036308    0.01   0.9882
# Variantgray1     0.026765   0.021757    1.23   0.2186

library(lme4)
g1 <- glmer(cbind(Successes,Failures) ~ Old + (1|Color/Variant) + (1|Date), data=rates, family=binomial)
g2 <- glmer(cbind(Successes,Failures) ~ Old + (1|Color)         + (1|Date), data=rates, family=binomial)
g3 <- glmer(cbind(Successes,Failures) ~ Old +                     (1|Date), data=rates, family=binomial)
g4 <- glmer(cbind(Successes,Failures) ~ Old + (1|Variant),                  data=rates, family=binomial)
g5 <- glmer(cbind(Successes,Failures) ~ Old + (1|Color),                    data=rates, family=binomial)
AIC(g0, g1, g2, g3, g4, g5)
#    df  AIC
# g0 17 1035
# g1  5 1059
# g2  4 1058
# g3 13 1041
# g4  3 1252
# g5  3 1264

Conclusion

An un­likely +0.5% to read­ing rates is­n’t enough for me to want to add a de­pen­dency an­other JS li­brary, so I will be re­mov­ing BLR. I’m not sur­prised by this re­sult, since most tests don’t show an im­prove­ment, BLR col­or­ing test is pretty un­usual for a web­site, and users would­n’t have any un­der­stand­ing of what it is or abil­ity to opt out of it; us­ing BLR by de­fault does­n’t work, but the browser ex­ten­sion might be use­ful since the user ex­pects the col­or­ing & can choose their pre­ferred color scheme.

I was sur­prised that the gray vari­ants could per­form so wildly differ­ent, from slightly bet­ter than the con­trol to hor­ri­bly worse, con­sid­er­ing that they did­n’t strike me as look­ing that differ­ent when I was pre­view­ing them lo­cal­ly. I also did­n’t ex­pect blues to last as long as it did, and thought I would be delet­ing it as soon as dark. This makes me won­der: are there color themes only sub­tly differ­ent from the ones I tried which might work un­pre­dictably well? Since BLR by de­fault offers only a few themes, I think BLR should try out as many color themes as pos­si­ble to lo­cate good ones they’ve missed.

Some lim­i­ta­tions to this ex­per­i­ment:

  • no way for users to dis­able BLR or change color themes
  • did not in­clude web browser type as a co­vari­ate, which might have shown that par­tic­u­lar com­bi­na­tions of browser & theme sub­stan­tially out­per­formed the con­trol (then BLR could have im­proved their code for the bad browsers or a browser check done be­fore high­light­ing any text)
  • did not use for­mal adap­tive trial method­ol­o­gy, so the p-val­ues have no par­tic­u­lar in­ter­pre­ta­tion

Floating footnotes

One of the site fea­tures I like the most is how the end­notes pop-out/float when the mouse hov­ers over the link, so the reader does­n’t have to jump to the end­notes and back, jar­ring their con­cen­tra­tion and break­ing their train of thought. I got the JS from Luka Mathis back in 2010. But some­times the mouse hov­ers by ac­ci­dent, and with big foot­notes, the popped-up foot­note can cover the screen and be un­read­able. I’ve won­dered if it’s as cool as I think it is, or whether it might be dam­ag­ing. So now that I’ve hacked up an AB­a­lyt­ics clone which can han­dle JS in or­der to run the BLR ex­per­i­ment, I might as well run an A/B test to ver­ify that the float­ing foot­notes are not badly dam­ag­ing con­ver­sions. (I’m not de­mand­ing the float­ing foot­notes in­crease con­ver­sions by 1% or any­thing, just that the float­ing is­n’t com­ing at too steep a price.)

Implementation

diff --git a/static/js/footnotes.js b/static/js/footnotes.js
index 69088fa..e08d63c 100644
--- a/static/js/footnotes.js
+++ b/static/js/footnotes.js
@@ -1,7 +1,3 @@
-$(document).ready(function() {
-    Footnotes.setup();
-});
-

diff --git a/static/templates/default.html b/static/templates/default.html
index 4395130..8c97954 100644
--- a/static/templates/default.html
+++ b/static/templates/default.html
@@ -133,6 +133,9 @@
     <script type="text/javascript" src="//ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js"></script>

+    <script type="text/javascript" src="/static/js/footnotes.js"></script>
+
     <script id="googleAnalytics" type="text/javascript">
       var _gaq = _gaq || [];
@@ -151,14 +154,23 @@

       if (typeof(start_slot) == 'undefined') start_slot = 1;
-      var experiment = "blr3";
-      var variant_names = ["none", "gray1"];
+      var experiment = "floating_footnotes";
+      var variant_names = ["none", "float"];

       var variant_id = this.readCookie("ABalytics_"+experiment);
       if (!variant_id || !variant_names[variant_id]) {
       var variant_id = Math.floor(Math.random()*variant_names.length);
       document.cookie = "ABalytics_"+experiment+"="+variant_id+"; path=/";
                         }
+      // enable the floating footnotes
+      function footnotefy (VARIANT) {
+       if (VARIANT != "none") {
+         $$(document).ready(function() {
+                        Footnotes.setup();
+                        });
+       }
+      }
+      footnotefy(variant_names[variant_id]);
       _gaq.push(['_setCustomVar',
                   start_slot,
                   experiment,                 // The name of the custom variable = name of the experiment
                   ...)]
@@ -196,9 +208,6 @@
     <script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>

-    <script type="text/javascript" src="/static/js/footnotes.js"></script>
-
     <script type="text/javascript" src="/static/js/tablesorter.js"></script>
     <script type="text/javascript" id="tablesorter">

Data

2014-06-08–2014-07-12:

Vari­ant Old n Con­ver­sion
none FALSE 10342 17.00%
float FALSE 10039 17.42%
none TRUE 4767 22.24%
float TRUE 4876 22.40%
none 15109 18.65%
float 14915 19.05%
30024 18.85%

Analysis

rates <- read.csv(stdin(),header=TRUE)
Footnote,Old,N,Rate
none,FALSE,10342,0.1700
float,FALSE,10039,0.1742
none,TRUE,4767,0.2224
float,TRUE,4876,0.2240


rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes

rates$Footnote <- relevel(rates$Footnote, ref="none")

g <- glm(cbind(Successes,Failures) ~ Footnote + Old, data=rates, family="binomial"); summary(g)
# ...Coefficients:
#               Estimate Std. Error z value Pr(>|z|)
# (Intercept)    -1.5820     0.0237  -66.87   <2e-16
# Footnotefloat   0.0222     0.0296    0.75     0.45
# OldTRUE         0.3234     0.0307   10.53   <2e-16
confint(g)
#                  2.5 %   97.5 %
# (Intercept)   -1.62856 -1.53582
# Footnotefloat -0.03574  0.08018
# OldTRUE        0.26316  0.38352

As I had hoped, float­ing foot­notes seems to do no harm, and the point-es­ti­mate is pos­i­tive. The 95% CI, while not ex­clud­ing ze­ro, does ex­clude val­ues worse than -0.035, which sat­is­fies me: if float­ing foot­notes are do­ing any harm, it’s a small harm.

Indented paragraphs

An anony­mous feed­back sug­gested a site de­sign tweak:

Could you for­mat your pages so that the texts are all aligned at the left? It looks un­pro­fes­sional when the lines of text break at differ­ent ar­eas. Could you make the site like a LaTeX ar­ti­cle? The for­mat­ting is the only thing pre­vent­ing you from look­ing re­ally pro­fes­sion­al.

I was­n’t sure what he meant, since the text is left­-aligned, and I can’t ask for clar­i­fi­ca­tion (anony­mous means anony­mous).

Look­ing at a ran­dom page, my best guess is that he’s both­ered by the in­den­ta­tion at the start of suc­ces­sive para­graphs: in a se­quence of para­graphs, the first para­graph is not in­dented (be­cause it can’t be vi­su­ally con­fused) but the suc­ces­sive para­graphs are in­dented by 1.5em in or­der to make read­ing eas­i­er. The CSS is:

p { margin-top: -0.2em;
    margin-bottom: 0 }
p + p {
  text-indent: 1.5em;
  margin-top: 0 }

I liked this, but I sup­pose for lots of small para­graphs, it lends a ragged ap­pear­ance to the page. So might as well test a few vari­ants of text-indent to see what works best: 0em, 0.1, 0.5, 1.0, 1.5, and 2.0.

In ret­ro­spect years lat­er, after learn­ing more about ty­pog­ra­phy and re­vamp­ing Gw­ern.net CSS a num­ber of times, I think Anony­mous was ac­tu­ally talk­ing about : HTML/Gwern.net is by de­fault “flush left, ragged right”, with large white­space gaps left where words of differ­ent lengths get moved to the next line but not broken/hyphenated or stretched to fill the line. Some peo­ple do not like text jus­ti­fi­ca­tion, de­scrib­ing ragged right as eas­ier to read, but most ty­pog­ra­phers en­dorse it, it was his­tor­i­cally the norm for pro­fes­sion­al­ly-set print, still car­ries con­no­ta­tions of class, and I think the ap­pear­ance fits in with my over­all site es­thet­ic. I even­tu­ally en­abled text jus­ti­fi­ca­tion on Gw­ern.net in Feb­ru­ary 2019 (although I was ir­ri­tated by the dis­cov­ery that the stan­dard CSS method of do­ing so does not work in the Chrome browser due to a long-s­tand­ing fail­ure to im­ple­ment hy­phen­ation sup­port).

Implementation

Since we’re back to test­ing CSS, we can use the old AB­a­lyt­ics ap­proach with­out hav­ing to do JS cod­ing:

--- a/static/templates/default.html
+++ b/static/templates/default.html
@@ -19,6 +19,9 @@
   </head>
   <body>

+   <div class="indent_class1"></div>
+
     <div id="main">
       <div id="sidebar">
         <div id="logo"><img alt="Logo: a Gothic/Fraktur blackletter capital G/𝕲" height="36" src="/images/logo/logo.png" width="32" /></div>
@@ -136,10 +139,48 @@
     <script type="text/javascript" src="/static/js/footnotes.js"></script>

+    <script type="text/javascript" src="/static/js/abalytics.js"></script>
+    <script type="text/javascript">
+      window.onload = function() {
+      ABalytics.applyHtml();
+      };
+    </script>
+
     <script id="googleAnalytics" type="text/javascript">
       var _gaq = _gaq || [];
       _gaq.push(['_setAccount', 'UA-18912926-1']);
+
+      ABalytics.init({
+      indent: [
+      {
+      name: "none",
+      "indent_class1": "<style>p + p { text-indent: 0.0em; margin-top: 0 }</style>"
+      },
+      {
+      name: "indent0.1",
+      "indent_class1": "<style>p + p { text-indent: 0.1em; margin-top: 0 }</style>"
+      },
+      {
+      name: "indent0.5",
+      "indent_class1": "<style>p + p { text-indent: 0.5em; margin-top: 0 }</style>"
+      },
+      {
+      name: "indent1.0",
+      "indent_class1": "<style>p + p { text-indent: 1.0em; margin-top: 0 }</style>"
+      },
+      {
+      name: "indent1.5",
+      "indent_class1": "<style>p + p { text-indent: 1.5em; margin-top: 0 }</style>"
+      },
+      {
+      name: "indent2.0",
+      "indent_class1": "<style>p + p { text-indent: 2.0em; margin-top: 0 }</style>"
+      }
+      ],
+      }, _gaq);
+
       _gaq.push(['_trackPageview']);
       (function() { // })

Data

On 2014-07-27, since the 95% CIs for the best and worst in­dent vari­ants no longer over­lapped, I deleted the worst vari­ant (0.1). On 2014-08-23, the 2.0em and 0.0em vari­ants no longer over­lapped, and I deleted the lat­ter.

Daily traffic and con­ver­sion rates for each of the in­den­ta­tion set­tings

The con­ver­sion data, with new vs re­turn­ing vis­i­tor, seg­mented by pe­ri­od, and or­dered by when a vari­ant was delet­ed:

Vari­ant Old To­tal: n (%) 12-27 July 28 Ju­ly-23 Au­gust 24 Au­gust-19 No­vem­ber
0.1 FALSE 1552 (18.11%) 1551 (18.12%) 1552 (18.11%)
0.1 TRUE 707 (21.64%) 673 (21.69%) 706 (21.67%) 6 (0.00%)
none FALSE 5419 (16.70%) 1621 (17.27%) 5419 (16.70%) 3179 (16.55%)
none TRUE 2742 (23.23%) 749 (27.77%) 2684 (23.62%) 1637 (21.01%)
0.5 FALSE 26357 (15.09%) 1562 (18.89%) 5560 (17.86%) 24147 (14.74%)
0.5 TRUE 10965 (21.35%) 728 (23.63%) 2430 (23.13%) 9939 (21.06%)
1.0 FALSE 25987 (14.86%) 1663 (19.42%) 5615 (17.68%) 23689 (14.39%)
1.0 TRUE 11288 (21.14%) 817 (25.46%) 2498 (24.38%) 10159 (20.74%)
1.5 FALSE 26045 (14.54%) 1619 (16.80%) 5496 (16.67%) 23830 (14.26%)
1.5 TRUE 11255 (21.60%) 694 (26.95%) 2647 (24.25%) 10250 (21.00%)
2.0 FALSE 26198 (14.96%) 1659 (18.75%) 5624 (18.31%) 23900 (14.59%)
2.0 TRUE 11125 (21.17%) 781 (25.99%) 2596 (24.27%) 10010 (20.74%)
159634 (16.93%) 14117 (20.44%) 42827 (19.49%) 140746 (16.45%)

Analysis

A sim­ple analy­sis of the to­tals would in­di­cate that 0.1em is the best set­ting - which is odd since it was the worst-per­form­ing and first vari­ant to be delet­ed, so how could it be the best? The graph of traffic sug­gests that, like be­fore, the fi­nal to­tals are con­founded by time-vary­ing changes in con­ver­sion rates plus drop­ping vari­ants; that is, 0.1em prob­a­bly only looks good be­cause after it was dropped, a bunch of Hacker News traffic hit and hap­pened to con­vert at lower rates, mak­ing the sur­viv­ing vari­ants look bad. One might hope that all of that effect would be cap­tured by the Old co­vari­ate as HN traffic gets recorded as new vis­i­tors, but that would be too much to hope for. So in­stead, I add a dummy vari­able for each of the 3 sep­a­rate time-pe­ri­ods which will ab­sorb some of this het­ero­gene­ity and make clearer the effect of the in­den­ta­tion choic­es.

rates <- read.csv(stdin(),header=TRUE)
Indent,Old,Month,N,Rate
0.1,FALSE,July,1551,0.1812
0.1,TRUE,July,673,0.2169
0,FALSE,July,1621,0.1727
0,TRUE,July,749,0.2777
0.5,FALSE,July,1562,0.1889
0.5,TRUE,July,728,0.2363
1.0,FALSE,July,1663,0.1942
1.0,TRUE,July,817,0.2546
1.5,FALSE,July,1619,0.1680
1.5,TRUE,July,694,0.2695
2.0,FALSE,July,1659,0.1875
2.0,TRUE,July,781,0.2599
0.1,FALSE,August,1552,0.1811
0.1,TRUE,August,706,0.2167
0,FALSE,August,5419,0.1670
0,TRUE,August,2684,0.2362
0.5,FALSE,August,5560,0.1786
0.5,TRUE,August,2430,0.2313
1.0,FALSE,August,5615,0.1768
1.0,TRUE,August,2498,0.2438
1.5,FALSE,August,5496,0.1667
1.5,TRUE,August,2647,0.2425
2.0,FALSE,August,5624,0.1831
2.0,TRUE,August,2596,0.2427
0.1,FALSE,November,0,0.000
0.1,TRUE,November,6,0.000
0,FALSE,November,3179,0.1655
0,TRUE,November,1637,0.2101
0.5,FALSE,November,24147,0.1474
0.5,TRUE,November,9939,0.2106
1.0,FALSE,November,23689,0.1439
1.0,TRUE,November,10159,0.2074
1.5,FALSE,November,23830,0.1426
1.5,TRUE,November,10250,0.2100
2.0,FALSE,November,23900,0.1459
2.0,TRUE,November,10010,0.2074


rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes
g <- glm(cbind(Successes,Failures) ~ as.factor(Indent) + Old + Month, data=rates, family="binomial"); summary(g)
# ...Coefficients:
#                         Estimate  Std. Error   z value   Pr(>|z|)
# (Intercept)          -1.55640959  0.02238959 -69.51487 < 2.22e-16
# as.factor(Indent)0.1 -0.05726851  0.04400363  -1.30145  0.1931046
# as.factor(Indent)0.5  0.00249949  0.02503877   0.09982  0.9204833
# as.factor(Indent)1   -0.00877850  0.02502047  -0.35085  0.7256988
# as.factor(Indent)1.5 -0.02435198  0.02505726  -0.97185  0.3311235
# as.factor(Indent)2    0.00271475  0.02498665   0.10865  0.9134817
# OldTRUE               0.42448061  0.01238799  34.26549 < 2.22e-16
# MonthJuly             0.06606325  0.02459961   2.68554  0.0072413
# MonthNovember        -0.20156678  0.01483356 -13.58857 < 2.22e-16
#
# (Dispersion parameter for binomial family taken to be 1)
#
#     Null deviance: 1496.6865  on 34  degrees of freedom
# Residual deviance:   41.1407  on 26  degrees of freedom
# AIC: 331.8303

There’s defi­nitely tem­po­ral het­ero­gene­ity, given the sta­tis­ti­cal-sig­nifi­cance of the time-pe­riod dum­mies, so that is good to know. But the es­ti­mated effects for each in­den­ta­tion vari­ant is de­riso­rily small (de­spite hav­ing spent n = 159634), sug­gest­ing read­ers don’t care at all. Since I have no opin­ion on the mat­ter, I sup­pose I’ll go with the high­est point-es­ti­mate, 2em.

Moving sidebar’s metadata into page

Look­ing at the side­bar some more, it oc­curred to me that the side­bar was serv­ing 3 differ­ent pur­poses all mixed to­geth­er:

  1. site-wide: nav­i­ga­tion to the main index/homepage, as well as meta-site pages like about me, the site, re­cent up­dates, and ways of get­ting RSS/email up­dates
  2. site-wide: do­na­tion re­quests
  3. page-speci­fic: a page’s meta­data about when that page’s con­tent was first cre­at­ed, last mod­i­fied, con­tent tags, etc

The page meta­data is the odd man out, and I’ve no­ticed that a lot of peo­ple seem to not no­tice the page meta­data hid­ing in the side­bar (eg there will be com­ments won­der­ing when a page was cre­at­ed, when that’s listed clearly right there in the page’s side­bar). What if I moved the page meta­data to un­der­neath the big ti­tle? I’d have to change the for­mat­ting, since I can’t afford to spend 10+ ver­ti­cal lines of space the way it must be for­mat­ted in the side­bar, but the meta­data could fit in 2-5 lines if I com­bine the log­i­cal pairs (so in­stead of 4 lines for “cre­at­ed: / 2013-05-07 / mod­i­fied: / 2015-01-09”, just one line “cre­at­ed: 2013-05-07; mod­i­fied: 2015-01-09”).

There are sev­eral differ­ent ways and lev­els of den­si­ty, so I cre­ated 6 vari­ants with in­creas­ing amounts of den­si­ty.

Implementation

As an HTML rather than CSS change, the im­ple­men­ta­tion as an A/B test is more com­plex.

I de­fine in­line in the HTML tem­plate each of the 6 vari­ants, as divs ID ‘meta­data1..meta­data6’. In the default.css, I set them to display: none so the user does not 6 differ­ent meta­datas tak­ing up 2 screens of space. Then, each A/B vari­ant passed to AB­a­lyt­ics tog­gles back on one ver­sion us­ing display: block. I also in­clude a 7th vari­ant, where none of the 6 should be vis­i­ble, which is effec­tively the con­trol con­di­tion which roughly matches the sta­tus quo of show­ing the meta­data in the side­bar. (“Roughly”, since in the none con­di­tion, there won’t be meta­data any­where in the dis­played page; but since the pre­vi­ous ex­per­i­ment in­di­cated that re­mov­ing el­e­ments from the side­bar did­n’t make any no­tice­able differ­ence, I de­cided to sim­plify the HTML source code by re­mov­ing the orig­i­nal meta­data div en­tirely to avoid any col­li­sions or is­sues with the CSS/HTML I’ve de­fined.)

So the flow should be:

  1. page HTML loads, all 6 ver­sions may get ren­dered

  2. site-wide de­fault CSS loads, and when in­ter­pret­ed, hides all 6 ver­sions

    (This also means that peo­ple brows­ing with­out Javascript en­abled should still con­tinue to see a read­able ver­sion of the site.)

  3. page JS runs, picks 1 of 6 vari­ables to ex­e­cute, and a CSS com­mand is in­ter­preted to ex­pose 1 ver­sion

  4. JS con­tin­ues to run, and fires (con­verts) if user re­mains on page long enough

The HTML changes:

--- a/static/templates/default.html
+++ b/static/templates/default.html
@@ -20,7 +20,7 @@
   <body>

-   <div class="sidebar_test_class1"></div>
+   <div class="metadata_test_class1"></div>

@@ -61,29 +59,6 @@
         </div>
         <hr/>
         </div>
-        <div id="metadata">
-          <div class="abstract"><em>$description$</em></div>
-          <br />
-          <div id="tags"><i>$tags$</i></div>
-          <br />
-          <div id="page-created">created:
-            <br />
-            <i>$created$</i></div>
-          <div id="last-modified">modified:
-            <br />
-            <i>$modified$</i></div>
-          <br />
-          <div id="version">status:
-            <br />
-            <i>$status$</i></div>
-          <br />
-          <div id="epistemological-status"><a href="/About#belief-tags" title="Explanation of 'belief' metadata">belief:</a>
-            <br />
-            <i>$belief$</i>
-          </div>
-          <hr/>
-        </div>
-
         <div id="donations">
           <div id="bitcoin-donation-address">
             <a href="https://en.wikipedia.org/wiki/Bitcoin">₿</a>: 1GWERNkwxeMsBheWgVWEc6NUXD8HkHTUXg
@@ -115,6 +90,102 @@
       </div>

       <div id="content">
+
+<div id="metadata1">
+  <span id="abstract"><em></em></span>
+  <br>
+  <span id="tags"><i>$tags$</i></span>
+  <br>
+  <span id="page-created">created:
+    <br>
+    <i>$created$</i></span>
+  <br>
+  <span id="last-modified">modified:
+    <br>
+    <i>$modified$</i></span>
+  <br>
+  <span id="version">status:
+    <br>
+    <i>$status$</i></span>
+  <br>
+  <span id="epistemological-status"><a href="/About#belief-tags" title="Explanation of 'belief' metadata">belief:</a>
+    <br>
+    <i>$belief$</i>
+  </span>
+  <hr>
+</div>
+
+<div id="metadata2">
+  <span id="abstract"><em>$description$</em></span>
+  <br>
+  <span id="tags"><i>$tags$</i></span>
+  <br>
+  <span id="page-created">created: <i>$created$</i></span>
+  <br>
+  <span id="last-modified">modified: <i>$modified$</i></span>
+  <br>
+  <span id="version">status:
+    <br>
+    <i>$status$</i></span>
+  <br>
+  <span id="epistemological-status"><a href="/About#belief-tags" title="Explanation of 'belief' metadata">belief:</a>
+    <br>
+    <i>$belief$</i>
+  </span>
+  <hr>
+</div>
+
+<div id="metadata3">
+  <span id="abstract"><em>$description$</em></span>
+  <br>
+  <span id="tags"><i>$tags$</i></span>
+  <br>
+  <span id="page-created">created: <i>$created$</i></span>;  <span id="last-modified">modified: <i>$modified$</i></span>
+  <br>
+  <span id="version">status:
+    <br>
+    <i>$status$</i></span>
+  <br>
+  <span id="epistemological-status"><a href="/About#belief-tags" title="Explanation of 'belief' metadata">belief:</a>
+    <br>
+    <i>$belief$</i>
+  </span>
+  <hr>
+</div>
+
+<div id="metadata4">
+  <span id="abstract"><em>$description$</em></span>
+  <br>
+  <span id="tags"><i>$tags$</i></span>
+  <br>
+  <span id="page-created">created: <i>$created$</i></span>;  <span id="last-modified">modified: <i>$modified$</i></span>
+  <br>
+  <span id="version">status: <i>$status$</i></span>; <span id="epistemological-status"><a href="/About#belief-tags" title="Explanation of 'belief' metadata">belief:</a> <i>$belief$</i></span>
+  <hr>
+</div>
+
+<div id="metadata5">
+  <span id="abstract"><em>$description$</em></span> (<span id="tags"><i>$tags$</i></span>)
+  <br>
+  <span id="page-created">created: <i>$created$</i></span>;  <span id="last-modified">modified: <i>$modified$</i></span>
+  <br>
+  <span id="version">status: <i>$status$</i></span>; <span id="epistemological-status"><a href="/About#belief-tags" title="Explanation of 'belief' metadata">belief:</a> <i>$belief$</i></span>
+  <hr>
+</div>
+
+<div id="metadata6">
+  <span id="abstract"><em>$description$</em></span> (<span id="tags"><i>$tags$</i></span>)
+  <br>
+  <span id="page-created">created: <i>$created$</i></span>;  <span id="last-modified">modified: <i>$modified$</i></span>; <span id="version">status: <i>$status$</i></span>; <
span id="epistemological-status"><a href="/About#belief-tags" title="Explanation of 'belief' metadata">belief:</a> <i>$belief$</i></span>
+  <hr>
+</div>
+
         $body$
       </div>
     </div>
@@ -155,28 +226,32 @@
       ABalytics.init({
+       metadata_test: [
       {
-      name: "s1c1d1",
-      "sidebar_test_class1": "<style></style>"
+      name: "none",
+      "metadata_test_class1": "<style></style>"
+      },
+      {
+      name: "meta1",
+      "metadata_test_class1": "<style>div#metadata1 { display: block; }</style>"
       },
       {
-      name: "s1c1d0",
-      "sidebar_test_class1": "<style>div#donations {visibility:hidden; display:none;}</style>"
+      name: "meta2",
+      "metadata_test_class1": "<style>div#metadata2 { display: block; }</style>"
       },
       {
-      name: "s1c0d1",
-      "sidebar_test_class1": "<style>div#cse-sitesearch {visibility:hidden; display:none;}</style>"
+      name: "meta3",
+      "metadata_test_class1": "<style>div#metadata3 { display: block; }</style>"
       },
       {
-      name: "s0c1d1",
-      "sidebar_test_class1": "<style>div#sidebar hr {visibility:hidden; display:none;}</style>"
+      name: "meta4",
+      "metadata_test_class1": "<style>div#metadata4 { display: block; }</style>"
       },
       {
-      name: "s0c1d0",
-      "sidebar_test_class1": "<style>div#sidebar hr {visibility:hidden; display:none;}; div#donations {visibility:hidden; display:none;}</style>"
+      name: "meta5",
+      "metadata_test_class1": "<style>div#metadata5 { display: block; }</style>"
       },
       {
-      name: "s0c0d0",
-      "sidebar_test_class1": "<style>div#sidebar hr {visibility:hidden; display:none;}; div#cse-sitesearch {visibility:hidden; display:none;}; div#donations {visibility:hidden; display:none;}</style>"
+      name: "meta6",
+      "metadata_test_class1": "<style>div#metadata6 { display: block; }</style>"
       }
       ], /* }) */

The CSS changes:

--- a/static/css/default.css
+++ b/static/css/default.css
@@ -90,8 +90,12 @@ div#sidebar-news a {
    text-transform: uppercase;
 }

+/* metadata customization: */
 div#description { font-size: 95%; }
 div#tags, div#page-created, div#last-modified, div#license { font-size: 80%; }
+/* support A/B test by hiding by default all the HTML variants: */
+div#metadata1, div#metadata2, div#metadata3, div#metadata4, div#metadata5, div#metadata6 { display: none; }

Data

On 2015-02-05, the top vari­ant (meta5) out­per­formed the bot­tom one (meta1, cor­re­spond­ing to my ex­pec­ta­tion that the taller vari­ants would be worse than the com­pactest ones), so the worst was delet­ed. On 2015-02-08, the new top vari­ant (meta6) now out­per­formed (meta4), so I deleted it. On 2015-03-22, it out­per­formed none. On 2015-05-25, the differ­ence was not sta­tis­ti­cal­ly-sig­nifi­cant but I de­cided to delete meta3 any­way. On 2015-07-02, I deleted meta2 sim­i­lar­ly; given the ever smaller differ­ences be­tween vari­ants, it may be time to kill the ex­per­i­ment.

To­tals, 2015-01-29–2015-07-27:

Meta­data Re­turn­ing N Con­ver­sion rate
meta1 FALSE 835 0.1545
meta1 TRUE 364 0.2060
meta2 FALSE 37140 0.1532
meta2 TRUE 14063 0.2213
meta3 FALSE 26600 0.1538
meta3 TRUE 10045 0.2301
meta4 FALSE 1234 0.1669
meta4 TRUE 462 0.2186
meta5 FALSE 61646 0.1397
meta5 TRUE 20130 0.2109
meta6 FALSE 61608 0.1382
meta6 TRUE 19219 0.2243
none FALSE 9227 0.1568
none TRUE 3358 0.2225

Analysis

rates <- read.csv(stdin(),header=TRUE)
Metadata,Date,Old,N,Rate
meta1,"2015-02-06",FALSE, 832, 0.1538
meta1,"2015-02-06",TRUE, 356, 0.2051
meta2,"2015-02-06",FALSE, 1037, 0.1716
meta2,"2015-02-06",TRUE, 423, 0.2411
meta3,"2015-02-06",FALSE, 1010, 0.1604
meta3,"2015-02-06",TRUE, 431, 0.2204
meta4,"2015-02-06",FALSE, 1061, 0.1697
meta4,"2015-02-06",TRUE, 349, 0.2092
meta5,"2015-02-06",FALSE, 1018, 0.1798
meta5,"2015-02-06",TRUE, 382, 0.2749
meta6,"2015-02-06",FALSE, 1011, 0.1731
meta6,"2015-02-06",TRUE, 423, 0.2837
none ,"2015-02-06",FALSE, 1000, 0.1710
none ,"2015-02-06",TRUE, 434, 0.2074
meta1,"2015-02-09",TRUE, 8, 0.1250
meta2,"2015-02-09",FALSE, 921, 0.1238
meta2,"2015-02-09",TRUE, 248, 0.1895
meta3,"2015-02-09",FALSE, 861, 0.1440
meta3,"2015-02-09",TRUE, 262, 0.2137
meta4,"2015-02-09",FALSE, 189, 0.1429
meta4,"2015-02-09",TRUE, 92, 0.2500
meta5,"2015-02-09",FALSE, 889, 0.1327
meta5,"2015-02-09",TRUE, 304, 0.2401
meta6,"2015-02-09",FALSE, 845, 0.1219
meta6,"2015-02-09",TRUE, 274, 0.2336
none ,"2015-02-09",FALSE, 866, 0.1236
none ,"2015-02-09",TRUE, 236, 0.2288
meta1,"2015-03-23",FALSE, 635, 0.1496
meta1,"2015-03-23",TRUE, 277, 0.1841
meta2,"2015-03-23",FALSE, 9346, 0.1562
meta2,"2015-03-23",TRUE, 3545, 0.2305
meta3,"2015-03-23",FALSE, 9392, 0.1533
meta3,"2015-03-23",TRUE, 3627, 0.2412
meta4,"2015-03-23",FALSE, 1020, 0.1588
meta4,"2015-03-23",TRUE, 381, 0.2231
meta5,"2015-03-23",FALSE, 9359, 0.1631
meta5,"2015-03-23",TRUE, 3744, 0.2228
meta6,"2015-03-23",FALSE, 9532, 0.1600
meta6,"2015-03-23",TRUE, 3479, 0.2483
none ,"2015-03-23",FALSE, 8979, 0.1537
none ,"2015-03-23",TRUE, 3196, 0.2287
meta1,"2015-05-25",TRUE, 1, 0.000
meta2,"2015-05-25",FALSE, 21879, 0.1584
meta2,"2015-05-25",TRUE, 8131, 0.2285
meta3,"2015-05-25",FALSE, 22066, 0.1539
meta3,"2015-05-25",TRUE, 8288, 0.2300
meta5,"2015-05-25",FALSE, 21994, 0.1611
meta5,"2015-05-25",TRUE, 8629, 0.2187
meta6,"2015-05-25",FALSE, 22197, 0.1575
meta6,"2015-05-25",TRUE, 8114, 0.2328
none ,"2015-05-25",FALSE, 4987, 0.1562
none ,"2015-05-25",TRUE, 1721, 0.2342
meta2,"2015-07-02",FALSE, 11016, 0.1452
meta2,"2015-07-02",TRUE, 4291, 0.2123
meta3,"2015-07-02",FALSE, 208, 0.865
meta3,"2015-07-02",TRUE, 137, 0.1387
meta5,"2015-07-02",FALSE, 11336, 0.1451
meta5,"2015-07-02",TRUE, 4165, 0.2091
meta6,"2015-07-02",FALSE, 11051, 0.1397
meta6,"2015-07-02",TRUE, 3879, 0.2274
meta2,"2015-07-28",FALSE, 10299, 0.1448
meta2,"2015-07-28",TRUE, 4086, 0.2102
meta3,"2015-07-28",TRUE, 28, 0.1429
meta5,"2015-07-28",FALSE, 34976, 0.1250
meta5,"2015-07-28",TRUE, 9984, 0.1988
meta6,"2015-07-28",FALSE, 34830, 0.1242
meta6,"2015-07-28",TRUE, 9550, 0.2093


rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes
g <- glm(cbind(Successes,Failures) ~ Metadata * Old + Date, data=rates, family="binomial"); summary(g)
##                          Estimate  Std. Error   z value   Pr(>|z|)
## (Intercept)           -1.68585483  0.07376022 -22.85588 < 2.22e-16
## Metadatameta2          0.11289144  0.07557654   1.49374  0.1352445
## Metadatameta3          0.10270100  0.07602219   1.35093  0.1767164
## Metadatameta4          0.10061048  0.09241740   1.08865  0.2763069
## Metadatameta5          0.08577369  0.07542883   1.13715  0.2554767
## Metadatameta6          0.06413629  0.07543722   0.85019  0.3952171
## Metadatanone           0.06769859  0.07738865   0.87479  0.3816898
## OldTRUE                0.30223404  0.12339673   2.44929  0.0143139
## Date2015-02-09        -0.25042825  0.04531921  -5.52587 3.2785e-08
## Date2015-03-23        -0.07756390  0.02932304  -2.64515  0.0081654
## Date2015-05-25        -0.09191468  0.02904941  -3.16408  0.0015557
## Date2015-07-02        -0.16628108  0.03108431  -5.34936 8.8267e-08
## Date2015-07-28        -0.30091724  0.02988108 -10.07050 < 2.22e-16
## Metadatameta2:OldTRUE  0.15884370  0.12509633   1.26977  0.2041662
## Metadatameta3:OldTRUE  0.16917541  0.12606099   1.34201  0.1795920
## Metadatameta4:OldTRUE  0.08085814  0.15986591   0.50579  0.6130060
## Metadatameta5:OldTRUE  0.15772161  0.12470219   1.26479  0.2059480
## Metadatameta6:OldTRUE  0.26593031  0.12471587   2.13229  0.0329831
## Metadatanone :OldTRUE  0.18329569  0.12933518   1.41721  0.1564202
confint(g)
##                                2.5 %         97.5 %
## (Intercept)           -1.83279352769 -1.54352045422
## Metadatameta2         -0.03311333865  0.26327668480
## Metadatameta3         -0.04420468214  0.25393168209
## Metadatameta4         -0.07967057622  0.28275671162
## Metadatameta5         -0.05993245076  0.23587876421
## Metadatameta6         -0.08158679693  0.21425729726
## Metadatanone          -0.08197368374  0.22151789608
## OldTRUE                0.05847893596  0.54254577177
## Date2015-02-09        -0.33953084106 -0.16186556722
## Date2015-03-23        -0.13481890103 -0.01986901416
## Date2015-05-25        -0.14861372005 -0.03473644767
## Date2015-07-02        -0.22700745604 -0.10515380198
## Date2015-07-28        -0.35925991220 -0.24212277265
## Metadatameta2:OldTRUE -0.08484020193  0.40587579037
## Metadatameta3:OldTRUE -0.07642276867  0.41806754337
## Metadatameta4:OldTRUE -0.23209702844  0.39481052801
## Metadatameta5:OldTRUE -0.08518032081  0.40399365950
## Metadatameta6:OldTRUE  0.02300128120  0.51222878593
## Metadatanone :OldTRUE -0.06880343759  0.43849923508

A strange set of re­sults. meta2 per­forms the best on new vis­i­tors, and worst on old vis­i­tors; while meta6 is the ex­act op­po­site. Be­cause there are more new vis­i­tors than old vis­i­tors, meta2 is the best on av­er­age. Ex­cept I hate how meta2 looks and much pre­fer meta6. The con­fi­dence in­ter­vals are wide, though - it’s not clear that meta6 is defi­nitely worse than meta2.

Given my own pref­er­ence, I will go with meta6.

CSE

A CSE is a Google search query but one spe­cial­ized in var­i­ous ways - some­what like offer­ing a user a form field which redi­rects to a Google search query like QUERY site:gwern.net/docs/, but more pow­er­ful since you can spec­ify thou­sands of URLs to black­list and whitelist and have lim­ited pat­terns. I have two: one is spe­cial­ized for search­ing for anime/manga news sites and makes writ­ing Wikipedia ar­ti­cles much eas­ier (s­ince you can search for a par­tic­u­lar anime ti­tle and the re­sults will be mostly news and re­views which you can use in a WP ar­ti­cle, rather than im­ages, songs, memes, Ama­zon and com­mer­cial sites, blogs, etc); and the sec­ond is spe­cial­ized to search Gw­ern.net, my Red­dit, Less­Wrong, Pre­dic­tion­Book, Good Reads and some other sites, to make it eas­ier to find some­thing I may’ve writ­ten. The sec­ond I cre­ated to put in the side­bar and serve as a web­site search func­tion. (I threw in the other sites be­cause why not?)

Google pro­vides HTML & JS for in­te­grat­ing a CSE some­where, so cre­at­ing & in­stalling it was straight­for­ward, and it went live 2013-05-24.

The prob­lem is that the CSE search in­put takes up space in the side­bar, and is more JS to run on each page load and loads at least one other JS file as well. So on 2015-07-17, I took a look to eval­u­ate whether it was worth keep­ing.

There had been 8974 searches since I in­stalled it 785 days pre­vi­ously or ~11.4 searches per day; at least 119 were searches for “e”, which I as­sume were user mis­takes where they did­n’t in­tend to search and prob­a­bly an­noyed them. (The next most pop­u­lar searches are “Grae­ber”/26, “chunk­ing”/22, and “nootrop­ics”/10, with CSE re­fus­ing to pro­vide any fur­ther queries due to low vol­ume. This sug­gests a long tail of search queries - but also that they’re not very im­por­tant since it’s easy to find the DNB FAQ & my nootrop­ics page, and it can hardly be use­ful if the top search is an er­ror.)

To put these 8855 searches in per­spec­tive, in that same ex­act time pe­ri­od, there were 891,790 unique users with 2,010,829 page views. So only 0.44% of page-views in­volve a use of the CSE, or a ra­tio of 1:227 Is it net-ben­e­fi­cial to make 227 page-views in­cur the JS run & load­ing for the sake of 1 CSE search?

This might seem like a time to A/B test the presence/absence of the CSE div. (I can’t sim­ply hide it us­ing CSS like usual be­cause it will still affect page load­s.) Ex­cept con­sider the power is­sues: if that 1 CSE search con­verts, then to be profitable, it needs to dam­age the 227 other page-views con­ver­sion rate by <1/227. Or to put it the other way, the cur­rent con­ver­sion rate is ~17% of page-views and CSE search rep­re­sents 0.44% of page-views, so if the CSE makes that one page-view 100% guar­an­teed to con­vert and oth­er­wise con­verts nor­mal­ly, then over 1000 page-views, we have 0.17 × 995 + 1.0 × 5 = 174 vs 0.17 × 995 + 0.17 × 5 = 170, or 17.4% vs 17.0%.

power.prop.test(p1=0.174, p2=0.170, power=0.80, sig.level=0.05)
#     Two-sample comparison of proportions power calculation
#              n = 139724.5781

Even with the most op­ti­mistic pos­si­ble as­sump­tions (per­fect con­ver­sion, no neg­a­tive effec­t), it takes 279,449 page-views to get de­cent pow­er. This is ridicu­lous from a cost-ben­e­fit per­spec­tive, and worse given that my pri­ors are against it due to the ex­tra JS & CSS it en­tails.

So I sim­ply re­moved it. It was a bit of an ex­per­i­ment, and <8.9k searches does not seem worth it.

Deep reinforcement learning

A/B test­ing vari­ants one at a time is fine as far as it goes, but it has sev­eral draw­backs that have be­come ap­par­ent:

  1. fixed tri­als, com­pared to se­quen­tial or adap­tive trial ap­proach­es, waste data/page-views. Look­ing back, it’s clear that many of these tri­als did­n’t need to run so long.
  2. they are costly to set up, both be­cause of the de­tails of a sta­tic site do­ing A/B tests but also be­cause it re­quires me to de­fine each change, code it up, col­lect, and an­a­lyze the re­sults all by hand.
  3. they are not amenable to test­ing com­pli­cated mod­els or re­la­tion­ships, since fac­to­r­ial de­signs suffer com­bi­na­to­r­ial ex­plo­sion.
  4. they will test only the in­ter­ven­tions the ex­per­i­menter thinks of, which may be a tiny hand­ful of pos­si­bil­i­ties out of a wide space of pos­si­ble in­ter­ven­tions (this is re­lated to the cost: I won’t test any­thing that is­n’t in­ter­est­ing, con­tro­ver­sial, or po­ten­tially valu­able, be­cause it’s far too much of a has­sle to implement/collect/analyze)

The topic of se­quen­tial tri­als leads nat­u­rally to (MAB), which can be seen as a gen­er­al­iza­tion of reg­u­lar ex­per­i­ment­ing which nat­u­rally re­al­lo­cate sam­ples across branches as the pos­te­rior prob­a­bil­i­ties change in a way which min­i­mizes how many page-views go to bad vari­ants. It’s hard to see how to im­ple­ment MABs as a sta­tic site, so this would prob­a­bly mo­ti­vate a shift to a dy­namic site, at least to the ex­tent that the server will tweak the served sta­tic con­tent based on the cur­rent MAB.

MABs work for the cur­rent use case of spec­i­fy­ing a small num­ber of vari­ants (eg <20) and find­ing the best one. De­pend­ing on im­ple­men­ta­tion de­tail, they could also make it easy to run fac­to­r­ial tri­als check­ing for in­ter­ac­tions among those vari­ants, re­solv­ing an­other ob­jec­tion.

They’re still ex­pen­sive to set up since one still has to come up with con­crete vari­ants to pit against each oth­er, but if it’s now a dy­namic server, it can at least han­dle the analy­sis au­to­mat­i­cal­ly.

MABs them­selves are a spe­cial case of (R­L), which is a fam­ily of ap­proaches to ex­plor­ing com­pli­cated sys­tems to max­i­mize a re­ward at (hope­ful­ly) min­i­mum data cost. Op­ti­miz­ing a web­site fits nat­u­rally into a RL mold: all the pos­si­ble CSS and HTML vari­ants are a very com­pli­cated sys­tem, which we are try­ing to ex­plore as cheaply as pos­si­ble while max­i­miz­ing the re­ward of vis­i­tors spend­ing more time read­ing each page.

To solve the ex­pres­siv­ity prob­lem, one could try to equip the RLer with a lot of power over the CSS: parse it into an , so in­stead of spec­i­fy­ing by hand ‘100%’ vs ‘105%’ in a CSS de­c­la­ra­tion like div#sidebar-news a { font-size: 105%; }, the RLer sees a node in the AST like (font-size [Real ~ dnorm(100,20)]) and tries out num­bers around 100% to see what yields higher con­ver­sion rates. Of course, this yields an enor­mous num­ber of pos­si­bil­i­ties and my web­site traffic is not equally enor­mous. In­for­ma­tive pri­ors on each node would help if one was us­ing a Bayesian MAB to do the op­ti­miza­tion, but a Bayesian model might be too weak to de­tect many effects. (You can’t eas­ily put in in­ter­ac­tions be­tween every node of the AST, after al­l.)

In a chal­leng­ing prob­lem like this, deep neural net­works come to mind, yield­ing a deep re­in­force­ment learner () - such a sys­tem made a splash in 2013-2015 in learn­ing to play dozens of Atari games (DQN). The deep net­work han­dles in­ter­pre­ta­tion of the in­put, and the RLer han­dles pol­icy and op­ti­miza­tion.

So the loop would go some­thing like this:

  1. a web browser re­quests a page
  2. the server asks the RL for CSS to in­clude
  3. the RL gen­er­ates a best guess at op­ti­mal CSS, tak­ing the CSS AST skele­ton and re­turn­ing the de­faults, with some fields/parameters ran­dom­ized for ex­plo­ration pur­poses (pos­si­bly se­lected by to max­i­mize in­for­ma­tion gain)
  4. the CSS is tran­scluded into the HTML page, and sent to the web browser
  5. JS an­a­lyt­ics in the HTML page re­port back how long the user spent on that page and de­tails like their coun­try, web browser, etc, which pre­dict time on page (ex­plain­ing vari­ance, mak­ing it eas­ier to see effects)
  6. this time-on-page con­sti­tutes the re­ward which is fed into the RL and up­dates
  7. re­turn to wait­ing for a re­quest

Learn­ing can be sped up by data aug­men­ta­tion or lo­cal train­ing: the de­vel­oper can browse pages lo­cally and based on whether they look hor­ri­ble or not, in­sert pseudo-da­ta. (If one vari­ant looks bad, it can be im­me­di­ately heav­ily pe­nal­ized by adding, say, 100 page-views of that vari­ant with low re­ward­s.) Once pre­views have sta­bi­lized on not-too-ter­ri­ble-look­ing, it can be run on live users; the de­vel­op­er’s pref­er­ences may in­tro­duce some bias com­pared to the gen­eral In­ter­net pop­u­la­tion, but the de­vel­oper won’t be too differ­ent and this will kill off many of the worst vari­ants. As well, his­tor­i­cal in­for­ma­tion can be in­serted as pseudo-data: if the cur­rent CSS file has 17% con­ver­sion over 1 mil­lion page views, one can sim­u­late 1m page views to that CSS vari­ant’s con­sid­er­able cred­it.

Pars­ing CSS into an AST seems diffi­cult, and it is still lim­ited in that it will only ever tweak ex­ist­ing CSS fields.

How to offer more power and ex­pres­siv­ity to the RLer with­out giv­ing it so much free­dom that it will hang it­self with gib­ber­ish CSS be­fore ever find­ing work­ing CSS, never mind im­prove­ments?

A pow­er­ful AI tool which could gen­er­ate CSS on its own are the : NNs which gen­er­ate some out­put which gets fed back in un­til a long se­quence has been emit­ted. (They usu­ally also have some spe­cial sup­port for stor­ing ‘mem­o­ries’ over mul­ti­ple re­cur­sive ap­pli­ca­tions, us­ing .) RNNs are fa­mous for mim­ic­k­ing text and other se­quen­tial ma­te­ri­al; in one de­mo, Karpa­thy’s , he trained a RNN on a Wikipedia dump in XML for­mat and a LaTeX math book (both repli­cat­ing the syn­tax quite well) and more rel­e­vant­ly, 474MB of C source code & head­ers where the RNN does a cred­i­ble job of emit­ting pseudo-C code which looks con­vinc­ing and is even mostly syn­tac­ti­cal­ly-cor­rect in bal­anc­ing paren­the­ses & brack­ets, which more fa­mil­iar Markov-chain ap­proaches would have trou­ble man­ag­ing. (Of course, the pseudo-C does­n’t do any­thing but that RNN was never asked to make it do some­thing, ei­ther.) In , the au­thors trained it on Python source code and it was able to ‘ex­e­cute’ very sim­ple Python pro­grams and pre­dict the out­put; this is per­haps not too sur­pris­ing given the ear­lier and solv­ing the Trav­el­ing Sales­man Prob­lem (). So RNNs are pow­er­ful and have al­ready shown promise in learn­ing how to write sim­ple pro­grams.

This sug­gests the use of an RNN in­side an RLer for gen­er­at­ing CSS files. Train the RNN on a few hun­dred megabytes of CSS files (there are mil­lions on­line, no short­age there), which teaches the RNN about the full range of pos­si­ble CSS ex­pres­sions, then plug it into step 3 of the above web­site op­ti­miza­tion al­go­rithm and be­gin train­ing it to emit use­ful CSS. For ad­di­tional learn­ing, the out­put can be judged us­ing an or­a­cle (a CSS val­ida­tor like the W3C CSS Val­i­da­tion Ser­vice/w3c-markup-validator pack­age, or pos­si­bly CSSTidy), and the er­ror or re­ward based on how many val­i­da­tion er­rors there are. The pre­train­ing pro­vides ex­tremely strong pri­ors about what CSS should look like so syn­tac­ti­cally valid CSS will be mostly used with­out the con­straint of op­er­at­ing on a rigid AST, the RL be­gins op­ti­miz­ing par­tic­u­lar steps, and pro­vid­ing the orig­i­nal CSS with a high re­ward pre­vents it from stray­ing too far from a known good de­sign.

Can we go fur­ther? Per­haps. In the Atari RL pa­per, the NN was specifi­cally a (CNN), used al­most uni­ver­sally in im­age clas­si­fi­ca­tion tastes; the CNN was in charge of un­der­stand­ing the pixel out­put so it could be ma­nip­u­lated by the RL. The RNN would have con­sid­er­able un­der­stand­ing of CSS on a tex­tual lev­el, but it would­n’t be eas­ily able to un­der­stand how one CSS de­c­la­ra­tion changes the ap­pear­ance of the web­page. A CNN, on the other hand, can look at a page+CSS as ren­dered by a web browser, and ‘see’ what it looks like; pos­si­bly it could learn that ‘messy’ lay­outs are bad, that fonts should­n’t be made ‘too big’, that blocks should­n’t over­lap, etc. The RNN gen­er­ates CSS, the CSS is ren­dered in a web browser, the ren­der­ing is looked at by a CNN… and then what? I’m not sure how to make use of a gen­er­a­tive ap­proach here. Some­thing to think about.

Re­cur­rent Q-learn­ing:

  • Lin & Mitchell 1992 “Mem­ory ap­proaches to re­in­force­ment learn­ing in non-Mar­kov­ian do­mains”
  • Mee­den, Mc­Graw & Blank 1993 “Emer­gent con­trol and plan­ning in an au­tonomous ve­hi­cle”
  • Schmid­hu­ber 1991b “Re­in­force­ment learn­ing in Mar­kov­ian and non-Mar­kov­ian en­vi­ron­ments”
  • http://nikhilbuduma.com/2015/01/11/a-deep-dive-into-recurrent-neural-networks/

Training a neural net to generate CSS

It would be nifty if I could set up a NN to gen­er­ate and op­ti­mize the CSS on Gw­ern.net so I don’t have to learn CSS & de­vise tests my­self; as a first step to­wards this, I wanted to see how well a (RNN) could gen­er­ate CSS after be­ing trained on CSS. (If it can’t do a good job mim­ic­k­ing the ‘av­er­age’ syntax/appearance of CSS based on a large CSS cor­pus, then it’s un­likely it can learn more use­ful things like gen­er­at­ing us­able CSS given a par­tic­u­lar HTML file, or the ul­ti­mate goal - learn to gen­er­ate op­ti­mal CSS given HTML files and user re­ac­tion­s.)

char-rnn

For­tu­nate­ly, Karpa­thy has al­ready writ­ten an easy-to-use tool char-rnn which has al­ready been shown to work well on . (I was par­tic­u­larly amused by the LaTeX/math text­book, which yielded a com­pil­ing and even good-look­ing doc­u­ment after Karpa­thy fixed some er­rors in it; if the RNN had been trained against com­pile errors/warnings as well, per­haps it would not have needed any fix­ing at al­l…?)

char-rnn re­lies on the Torch NN frame­work & NVIDIA’s CUDA GPU frame­work (Ubuntu in­stal­la­tion guide/down­load).

Torch is fairly easy to in­stall (cheat sheet):

cd ~/src/
curl -s https://raw.githubusercontent.com/torch/ezinstall/master/install-deps | bash
git clone https://github.com/torch/distro.git ./torch --recursive
cd ./torch; ./install.sh
export PATH=$HOME/src/torch/install/bin:$PATH
## fire up the REPL to check:
th

Then char-rnn is like­wise easy to get run­ning and try out a sim­ple ex­am­ple:

luarocks install nngraph
luarocks install optim
# luarocks install cutorch && luarocks install cunn ## 'cutorch' & 'cunn' need working CUDA
git clone 'https://github.com/karpathy/char-rnn.git'
cd ./char-rnn/
th train.lua -data_dir data/tinyshakespeare/ -gpuid 0 -rnn_size 512 -num_layers 2 -dropout 0.5
# package cunn not found!
# package cutorch not found!
# If cutorch and cunn are installed, your CUDA toolkit may be improperly configured.
# Check your CUDA toolkit installation, rebuild cutorch and cunn, and try again.
# Falling back on CPU mode
# loading data files...
# cutting off end of data so that the batches/sequences divide evenly
# reshaping tensor...
# data load done. Number of data batches in train: 423, val: 23, test: 0
# vocab size: 65
# creating an lstm with 2 layers
# number of parameters in the model: 3320385
# cloning rnn
# cloning criterion
# 1/21150 (epoch 0.002), train_loss = 4.19087871, grad/param norm = 2.1744e-01, time/batch = 4.98s
# 2/21150 (epoch 0.005), train_loss = 4.99026574, grad/param norm = 1.8453e+00, time/batch = 3.13s
# 3/21150 (epoch 0.007), train_loss = 4.29807770, grad/param norm = 5.6664e-01, time/batch = 4.30s
# 4/21150 (epoch 0.009), train_loss = 3.78911860, grad/param norm = 3.1319e-01, time/batch = 3.87s
# ...

Un­for­tu­nate­ly, even on my i7 CPU, train­ing is quite slow: ~3s a batch on the Tiny Shake­speare ex­am­ple. The im­por­tant pa­ra­me­ter is train_loss here1; after some ex­per­i­ment­ing, I found that >3=out­put is to­tal garbage, 1-2=lousy, and with <1=­good, with <0.8=very good.

With Tiny Shake­speare, the loss drops quickly at first, get­ting <4 within sec­onds and into the 2s within 20 min­utes, but then the 1s take a long time to sur­pass, and <1 even longer (hours of wait­ing).

GPU vs CPU

This is a toy dataset and sug­gests that for a real dataset I’d be wait­ing weeks or months. GPU ac­cel­er­a­tion is crit­i­cal. I spent sev­eral days try­ing to get Nvidi­a’s CUDA to work, even sign­ing up as a de­vel­oper & us­ing the un­re­leased ver­sion 7.5 pre­view of CUDA, but it seems that when they say Ubuntu 14.04 and not 15.04 (the lat­ter is what I have in­stalled), they are quite se­ri­ous: every­thing I tried yielded blood­cur­dling ATA hard drive er­rors (!) upon boot fol­lowed by a hard freeze the in­stant X be­gan to run.2 This made me un­happy since my old lap­top be­gan dy­ing in late July 2015 and I had pur­chased my Acer As­pire V17 Ni­tro Black Edi­tion VN7-791G-792A lap­top with the ex­press goal of us­ing its NVIDIA GeForce GTX 960M for deep learn­ing. But at the mo­ment I am out of ideas for how to get CUDA work­ing aside from ei­ther re­in­stalling to down­grade to Ubuntu 14.04 or sim­ply wait­ing for ver­sion 8 of CUDA which will hope­fully sup­port the lat­est Ubun­tu. (De­bian is not an op­tion be­cause on De­bian Stretch, I could not even get the GPU dri­ver to work, much less CUDA.)31

Frus­trat­ed, I fi­nally gave up and went the easy way: Torch pro­vides an Ama­zon OS im­age pre­con­fig­ured with Torch, CUDA, and other rel­e­vant li­braries for deep learn­ing.

EC2

The Torch AMI can be im­me­di­ately launched if you have an AWS ac­count. (I as­sume you have signed up, have a valid credit card, IP per­mis­sion ac­cesses set to al­low you to con­nect to your VM at all, and a SSH pub­lic key set up so you can log in.) The two GPU in­stances seem to have the same num­ber and kind of GPUs (1 Nvidia4) and differ mostly in RAM & CPUs, nei­ther of which are the bot­tle­neck here, so I picked the smaller/cheaper “g2.2xlarge” type. (“Cheaper” here is rel­a­tive; “g2.2xlarge” still costs $0.65/hr and when I looked at spot that day, ~$0.21.)

Once start­ed, you can SSH us­ing your reg­is­tered pub­lic key like any other EC2 in­stance. The de­fault user­name for this im­age is “ubuntu”, so:

ssh -i /home/gwern/.ssh/EST.pem ubuntu@ec2-54-164-237-156.compute-1.amazonaws.com

Once in, we set up the $PATH to find the Torch in­stal­la­tion like be­fore (I’m not sure why Torch’s im­age does­n’t al­ready have this done) and grab a copy of char-rnn to run Tiny Shake­speare:

export PATH=$HOME/torch/install/bin:$PATH
git clone 'https://github.com/karpathy/char-rnn'
# etc

Per-batch, this yields a 20x speedup on Tiny Shake­speare com­pared to my lap­top’s CPU, run­ning each batch in ~0.2s.

Now we can be­gin work­ing on what we care about.

CSS

First, to gen­er­ate a de­cent sized CSS cor­pus; be­tween all the HTML doc­u­men­ta­tion in­stalled by Ubuntu and my , I have some­thing like 1GB of CSS hang­ing around my dri­ve. Let’s grab 20MB of it (e­nough to not take for­ever to train on, but not so lit­tle as to be triv­ial):

cd ~/src/char-rnn/
mkdir ./data/css/
find / -type f -name "*.css" -exec cat {} \; | head --bytes=20MB >> ./data/css/input.txt
## https://www.dropbox.com/s/mvqo8vg5gr9wp21/rnn-css-20mb.txt.xz
wc --chars ./data/css/input.txt
# 19,999,924 ./data/input.txt
scp -i ~/.ssh/EST.pem -C data/css/input.txt ubuntu@ec2-54-164-237-156.compute-1.amazonaws.com:/home/ubuntu/char-rnn/data/css/

With 19.999M char­ac­ters, our RNN can afford only <20M pa­ra­me­ters; how big can I go with -rnn_size and -num_layers? (Which as they sound like, spec­ify the size of each layer and how many lay­er­s.) The full set of char-rnn train­ing op­tions:

  -data_dir                  data directory. Should contain the file input.txt with input data [data/tinyshakespeare]
  -rnn_size                  size of LSTM internal state [128]
  -num_layers                number of layers in the LSTM [2]
  -model                     LSTM, GRU or RNN [LSTM]
  -learning_rate             learning rate [0.002]
  -learning_rate_decay       learning rate decay [0.97]
  -learning_rate_decay_after in number of epochs, when to start decaying the learning rate [10]
  -decay_rate                decay rate for RMSprop [0.95]
  -dropout                   dropout for regularization, used after each RNN hidden layer. 0 = no dropout [0]
  -seq_length                number of timesteps to unroll for [50]
  -batch_size                number of sequences to train on in parallel [50]
  -max_epochs                number of full passes through the training data [50]
  -grad_clip                 clip gradients at this value [5]
  -train_frac                fraction of data that goes into train set [0.95]
  -val_frac                  fraction of data that goes into validation set [0.05]
  -init_from                 initialize network parameters from checkpoint at this path []
  -seed                      torch manual random number generator seed [123]
  -print_every               how many steps/minibatches between printing out the loss [1]
  -eval_val_every            every how many iterations should we evaluate on validation data? [1000]
  -checkpoint_dir            output directory where checkpoints get written [cv]
  -savefile                  filename to autosave the checkpoint to. Will be inside checkpoint_dir/ [lstm]
  -gpuid                     which GPU to use. -1 = use CPU [0]
  -opencl                    use OpenCL (instead of CUDA) [0]
Large RNN

Some play­ing around sug­gests that the up­per limit is 950 neu­rons and 3 lay­ers, yield­ing a to­tal of 18,652,422 pa­ra­me­ters. (I orig­i­nally went with 4 lay­ers, but with that many lay­ers, RNNs seem to train very slow­ly.) Some other set­tings to give an idea of how pa­ra­me­ter count in­creas­es:

  • 512/4: 8,012,032
  • 950/3: 18,652,422
  • 1000/3: 20,634,122
  • 1024/3: 21,620,858
  • 1024/4: 30,703,872
  • 1024/5: 39,100,672
  • 1024/6: 47,497,472
  • 1800/4: 93,081,856
  • 2048/4: 120,127,744
  • 2048/5: 153,698,560
  • 2048/6: 187,269,376

If we re­ally wanted to stress the EC2 im­age’s hard­ware, we could go as large as this:

th train.lua -data_dir data/css/ -rnn_size 1306 -num_layers 4 -dropout 0.5 -eval_val_every 1

This turns out to not be a good idea since it will take for­ever to train - eg after ~70m of train­ing, still at train-loss of 3.7! I sus­pect some of the hy­per­pa­ra­me­ters may be im­por­tant - the level of dropout does­n’t seem to mat­ter much but more than 3 lay­ers seems to be un­nec­es­sary and slow if there are a lot of neu­rons to store state (per­haps be­cause RNNs are said to ‘un­roll’ com­pu­ta­tions over each character/time-step in­stead of be­ing forced to do all their com­pu­ta­tion in a sin­gle deep net­work with >4 lay­er­s?) - but with the EC2 clock tick­ing and my own im­pa­tience, there’s no time to try a few dozen ran­dom sets of hy­per­pa­ra­me­ters to see which achieves best val­i­da­tion scores.

Un­de­terred, I de­cided to up­load all the CSS (us­ing the to re­duce the archive size):

find / -type f -name "*.css" | rev | sort | rev | tar c --to-stdout --no-recursion --files-from - | xz -9 --stdout > ~/src/char-rnn/data/css/all.tar.xz
cd ~/src/char-rnn/ && scp -C data/css/all.tar.xz ubuntu@ec2-54-164-237-156.compute-1.amazonaws.com:/home/ubuntu/char-rnn/data/css/
unxz all.tar.xz
## non-ASCII input seems to cause problems, so delete anything not ASCII:
## https://disqus.com/home/discussion/karpathyblog/the_unreasonable_effectiveness_of_recurrent_neural_networks_66/#comment-2042588381
## https://github.com/karpathy/char-rnn/issues/51
tar xfJ  data/css/all.tar.xz --to-stdout | iconv -c -tascii  > data/css/input.txt
wc --char all.css
# 1,126,949,128 all.css

Un­sur­pris­ing­ly, this did not solve the prob­lem, and with 1GB of data, even 1 pass over the data (1 epoch) would take weeks, like­ly. Ad­di­tional prob­lems in­cluded -val_frac’s de­fault 50 and -eval_val_every’s de­fault 1000: 0.05 of 1GB is 50MB, which means every time char-rnn checked on the val­i­da­tion set, it took ages; and since it only wrote a check­point out every 1000 it­er­a­tions, hours would pass in be­tween check­points. 1MB or 0.001 is a more fea­si­ble val­i­da­tion data size; and check­ing every 100 it­er­a­tions strikes a rea­son­able bal­ance be­tween be­ing able to run the lat­est & great­est and spend­ing as much GPU time on train­ing as pos­si­ble.

Small RNN

So I backed off to the 20MB sam­ple and a smaller 3-layer RNN, train­ing it overnight, and was star­tled to see what hap­pened:

th train.lua -print_every 5 -data_dir data/css/ -savefile css -eval_val_every 10000 -val_frac 0.001 -rnn_size 1700 -num_layers 3 -dropout 0.8
# ...
# 20000/21408250 (epoch 0.047), train_loss = 0.84076253, grad/param norm = 3.4693e-03, time/batch = 3.59s
# loss is exploding, aborting.

Specifi­cal­ly, the loss on the val­i­da­tion set had ex­ploded to 333.2351 (!). When I looked at sam­ples from the check­-pointed copy, it per­formed both well and poor­ly. th sample.lua cv/lm_css_epoch0.05_333.2351.t7 yield­ed:

...
.nest-comments .mb_comments p,
.pomenefister .views-content .add-post,
.pleuse-product-nav .dropdown-pane .date .intro,
.product-nav .shortlist.tooltip_pane .bread-slide,
.module-category .vote-content td {
  font-size: 9px;
  text-align: center;
}
.teotspaces-left h2,
.footer .social-icons .mepubititab-left ul {
  margin: 0;
}
.mlissbox-box .post-content {
  margin-top: 10px;
}
.secondary .tooltip-arrow .scegee-category h6,
.no-list .comment-footer .module-title,
.featured-left-column .lightbox-description h0,
.popup-tomimate .social-rover .comment-arrow {
  list-style: none;
}
.tooltip-post .secondary .module-menu-houder .thumbnail {
  float: left;
}
.module-content .lightbox-list {
  display: inline-block;
  float: none;
  width: 50%;
  position: relative;
  background-position: 0 1px;
  width: 30px;
  margin-top: 5px;
  text-align: center;
  margin-right: 10px;
}
.logo-events .show-luset .box-content li {
  width: 30px;
  -webkit-box-shadow: inset 0 2px 4px rgba(0, 0, 0, 0.3);
}
.search-results .section-featured .content-top-section-left {
  margin-top: -6px;
}
.secondary .module-mones-grey .down-mod-page {
  margin-bottom: 20px;
}
.comment-top-content .meviat-alignright {
  width: 350px;
}
.module-contributor h2.comment-hold-homicate.sptbed_postnames{
  display: inline-block;
  padding-top: -24px;
}
.secondary .module-content .logo div {
  float: left;
}
.medium-block-grid-1 > li:hover {
  border-bottom: none #5c466d;
}
.row-image-container .medium-topic-browser {
  clear: right;
}
.rooc-box-sprite .modal-links .list-group li {
  display: inline-block;
  float: left;
  padding-top: 8px;
}

.vead-video-list {
  display: block;
  margin-right: 13px;
}
#no-touch .main-tabs-new-content .widget-top-content {
  color: #58128c;
  display: block;
  padding: 8px 0;
  border-color: transparent;
  border-bottom: 1px solid #fff;
  padding: 5px 12px;
  text-align: center;
}

Aside from the Uni­code junk at the be­gin­ning, the out­put ac­tu­ally looks tremen­dously like CSS! The brack­ets are matched, the se­lec­tors look like se­lec­tors, and the fields are prop­erly typed (pix­els go into pixel fields, col­ors go into color fields, etc). If I val­i­date the non-junk CSS part, the val­ida­tor re­mark­ably yields only 1 er­ror, at line 52/.module-contributor h2.comment-hold-homicate.sptbed_postnames where it notes that “Value Er­ror : padding-top -24px neg­a­tive val­ues are not al­lowed : -24px”. Con­sid­er­ing it did­n’t even fin­ish 1 epoch, the mim­ic­k­ing is al­most un­can­ny: it nails the var­i­ous as­pects like RGB color no­ta­tion (both hex & rgba()), match­ing brack­ets, plau­si­ble-sound­ing iden­ti­fiers (eg .scegee-category), etc. If I were shown this with­out any cor­re­spond­ing HTML, I would not eas­ily be able to tell it’s all gib­ber­ish.

Chas­tened by the ex­plod­ing-er­ror prob­lem and the mostly waste of ~26 hours of pro­cess­ing (7:30PM–9:30PM / $15.6), I tried a smaller yet RNN (500/2), run from 5PM11AM (so to­tal bill for all in­stances, in­clud­ing var­i­ous play­ing around, restart­ing, gen­er­at­ing sam­ples, down­load­ing to lap­top etc: $25.58).

Data URI problem

One flaw in the RNN I stum­bled across but was un­able to re­pro­duce was that it seemed to have a prob­lem with . A data URI is a spe­cial kind of URL which is its own con­tent, let­ting one write small files in­line and avoid­ing need­ing a sep­a­rate file; for ex­am­ple, this fol­low­ing CSS frag­ment would yield a PNG im­age with­out the user’s browser mak­ing ad­di­tional net­work re­quests or the de­vel­oper need­ing to cre­ate a tiny file just for an icon or some­thing:

.class {
    content: url('data:image/png;base64,iVBORw0KGgoAA \
        AANSUhEUgAAABAAAAAQAQMAAAAlPW0iAAAABlBMVEUAAAD///+l2Z/dAAAAM0l \
        EQVR4nGP4/5/h/1+G/58ZDrAz3D/McH8yw83NDDeNGe4Ug9C9zwz3gVLMDA/A6 \
        P9/AFGGFyjOXZtQAAAAAElFTkSuQmCC')
            }

So it’s a stan­dard pre­fix like data:image/png;base64, fol­lowed by an in­defi­nitely long string of ASCII gib­ber­ish, which is a tex­tual en­cod­ing of the un­der­ly­ing bi­nary da­ta. The RNN some­times starts a data URI and gen­er­ates the pre­fix but then gets stuck con­tin­u­ally pro­duc­ing hun­dreds or thou­sands of char­ac­ters of ASCII gib­ber­ish with­out ever clos­ing the data URI with a quote & paren­the­ses and get­ting back to writ­ing reg­u­lar CSS.

What’s go­ing on there? Since PNG/JPG are com­pressed im­age for­mats, the bi­nary en­cod­ing will be near-ran­dom and the base-64 en­cod­ing like­wise near-ran­dom. The RNN can eas­ily gen­er­ate an­other char­ac­ter once it has started the base-64, but how does it know when to stop? (“I know how to spell ba­nana, I just don’t know when to stop! BA NA NA NA…”) Pos­si­bly it has run into the lim­its of its ‘mem­ory’ and once it has started emit­ting base-64 and has reached a plau­si­ble length of at least a few score char­ac­ters (few im­ages can be en­coded in less), it’s now too far away from the orig­i­nal CSS, and all it can see is base-64; so of course the max­i­mal prob­a­bil­ity is an ad­di­tional base-64 char­ac­ter…

This might be fix­able by ei­ther giv­ing the RNN more neu­rons in the hope that with more mem­ory it can break out of the base-64 trap, train­ing more (per­haps data URIs are too rare for it to have ad­e­quately learned it with the few epochs thus far), back­prop­a­gat­ing er­ror fur­ther in time/the se­quence by in­creas­ing the size of the RNN in terms of un­rolling (such as in­creas­ing -seq_length from 50); I thought im­prov­ing the sam­pling strat­egy with beam search rather than greedy char­ac­ter-by-char­ac­ter gen­er­a­tion would help but it turns out beam search does­n’t fix it and can per­form worse, get­ting trapped in an even deeper lo­cal min­ima of re­peat­ing the char­ac­ter “A” end­lessly. Or of course one could delete data URIs and other un­de­sir­able fea­tures from the cor­pus, in which case those prob­lems will never come up; still, I would pre­fer the RNN to han­dle is­sues on its own and have as lit­tle do­main knowl­edge en­gi­neered in as pos­si­ble. I won­der if the data URI is­sue might be what killed the large RNN at the end? (My other hy­poth­e­sis is that the sort-key trick ac­ci­den­tally led to a mul­ti­-megabyte set of rep­e­ti­tions of the same com­mon CSS file, which caused the large RNN to over­fit, and then once the train­ing reached a new sec­tion of nor­mal CSS, the large RNN be­gan mak­ing ex­tremely con­fi­dent pre­dic­tions of more rep­e­ti­tion, which were wrong and would lead to very large loss­es, pos­si­bly trig­ger­ing the ex­plod­ing-er­ror killer.)

Progress

This RNN pro­gressed steadily over time, al­though by the end the per­for­mance on the held-out val­i­da­tion dataset seem to have been stag­nat­ing when I plot the val­i­da­tion tests:

performance <- dget(textConnection("structure(list(Epoch = c(0.13, 0.26, 0.4, 0.53, 0.66, 0.79, 0.92,
1.06, 1.19, 1.32, 1.45, 1.58, 1.71, 1.85, 1.98, 2.11, 2.24, 2.37,
2.51, 2.64, 2.77, 2.9, 3.03, 3.17, 3.3, 3.43, 3.56, 3.69, 3.82,
3.96, 4.09, 4.22, 4.35, 4.48, 4.62, 4.75, 4.88, 5.01, 5.14, 5.28,
5.41, 5.54, 5.67, 5.8, 5.94, 6.07, 6.2, 6.33, 6.46, 6.59, 6.73,
6.86, 6.99, 7.12, 7.25, 7.39, 7.52, 7.65, 7.78, 7.91, 8.05, 8.18,
8.31, 8.44, 8.57, 8.7, 8.84, 8.97, 9.1, 9.23, 9.36, 9.5, 9.63,
9.76, 9.89, 10.02, 10.16, 10.29, 10.42, 10.55, 10.68, 10.82,
10.95, 11.08, 11.21, 11.34, 11.47, 11.61, 11.74, 11.87, 12, 12.13,
12.27, 12.4, 12.53, 12.66, 12.79, 12.93, 13.06, 13.19, 13.32,
13.45, 13.58, 13.72, 13.85, 13.98, 14.11, 14.24, 14.38, 14.51,
14.64, 14.77, 14.9, 15.04, 15.17, 15.3, 15.43, 15.56, 15.7, 15.83,
15.96, 16.09, 16.22, 16.35, 16.49, 16.62, 16.75, 16.88, 17.01,
17.15, 17.28, 17.41, 17.54, 17.67, 17.81, 17.94, 18.07, 18.2,
18.33, 18.46, 18.6, 18.73, 18.86, 18.99, 19.12, 19.26, 19.39,
19.52, 19.65, 19.78, 19.92, 20.05, 20.18, 20.31, 20.44, 20.58,
20.71, 20.84, 20.97, 21.1, 21.23, 21.37, 21.5, 21.63, 21.76,
21.89, 22.03, 22.16, 22.29, 22.42, 22.55, 22.69, 22.82, 22.95,
23.08, 23.21, 23.34, 23.48, 23.61, 23.74, 23.87, 24, 24.14, 24.27,
24.4, 24.53, 24.66, 24.8, 24.93, 25.06, 25.19, 25.32, 25.46,
25.59, 25.72), Validation.loss = c(1.4991, 1.339, 1.3006, 1.2896,
1.2843, 1.1884, 1.1825, 1.0279, 1.1091, 1.1157, 1.181, 1.1525,
1.1382, 1.0993, 0.9931, 1.0369, 1.0429, 1.071, 1.08, 1.1059,
1.0121, 1.0614, 0.9521, 1.0002, 1.0275, 1.0542, 1.0593, 1.0494,
0.9714, 0.9274, 0.9498, 0.9679, 0.9974, 1.0536, 1.0292, 1.028,
0.9872, 0.8833, 0.9679, 0.962, 0.9937, 1.0054, 1.0173, 0.9486,
0.9015, 0.8815, 0.932, 0.9781, 0.992, 1.0052, 0.981, 0.9269,
0.8523, 0.9251, 0.9228, 0.9838, 0.9807, 1.0066, 0.8873, 0.9604,
0.9155, 0.9242, 0.9259, 0.9656, 0.9892, 0.9715, 0.9742, 0.8606,
0.8482, 0.8879, 0.929, 0.9663, 0.9866, 0.9035, 0.9491, 0.8154,
0.8611, 0.9068, 0.9575, 0.9601, 0.9805, 0.9005, 0.8452, 0.8314,
0.8582, 0.892, 0.9186, 0.9551, 0.9508, 0.9074, 0.7957, 0.8634,
0.8884, 0.8953, 0.9163, 0.9307, 0.8527, 0.8522, 0.812, 0.858,
0.897, 0.9328, 0.9398, 0.9504, 0.8664, 0.821, 0.8441, 0.8832,
0.8891, 0.9422, 0.953, 0.8326, 0.871, 0.8024, 0.8369, 0.8541,
0.895, 0.8892, 0.9275, 0.8378, 0.8172, 0.8078, 0.8353, 0.8602,
0.8863, 0.9176, 0.9335, 0.8561, 0.7952, 0.8423, 0.8833, 0.9052,
0.9202, 0.9354, 0.8477, 0.8271, 0.8187, 0.8714, 0.8714, 0.9089,
0.903, 0.9225, 0.8583, 0.7903, 0.8016, 0.8432, 0.877, 0.8825,
0.9323, 0.8243, 0.8233, 0.7981, 0.8249, 0.826, 0.9109, 0.8875,
0.9265, 0.8239, 0.8026, 0.7934, 0.851, 0.8856, 0.9033, 0.9317,
0.8576, 0.8335, 0.7829, 0.8172, 0.8658, 0.8976, 0.8756, 0.9262,
0.8184, 0.792, 0.7826, 0.8244, 0.861, 0.9144, 0.9244, 0.9106,
0.8327, 0.766, 0.7988, 0.8378, 0.8606, 0.8831, 0.9032, 0.8113,
0.8138, 0.7747, 0.8027, 0.8197, 0.8684, 0.874, 0.912)), .Names = c('Epoch',
'Validation.loss'), class = 'data.frame', row.names = c(NA, -195L
))"))

library(ggplot2)
qplot(Epoch, Validation.loss, data=performance) + stat_smooth()
Loss of the CSS char-RNN dur­ing train­ing

As the loss di­min­ished to ~0.8-0.9, the sam­pled CSS out­put be­came even more re­al­is­tic. At one point I was im­pressed to see that the RNN had learned to switch be­tween “mini­fied” and un­mini­fied CSS for­mat­ting. For ex­am­ple, above the out­put is un­mini­fied, but the RNN at 0.88 some­times writes mini­fied (fol­low­ing has been line-bro­ken from a sin­gle line):

$ th sample.lua  cv/lm_css_epoch6.07_0.8815.t7 -primetext 'div#sidebar { margin: 0px; }' -length 2000
div#sidebar { margin: 0px; }
#flashTopgip ul li h3 { clear: both; padding: 0; height:25px;
 background:url(/images/exporibox.png) no-repeat 0
 0;}.col_description{text-align:left!important;display:block;height:44px;top:-3px;left:68%;width:150px;}.front
 .content
 h3{display:inline-block;width:100%;position:fixed;position:absolute;left:0;}.date-repeat
 #right{list-style:none;}.rtl
 #block-agned-header{padding:10px;line-height:14px;width:auto;}#block-column-right{background:#63c;}.block-document_body
 #content,.activism-content-box .content,.rtl .archive-wide
 .button.input-rawignad{float:left;}.rtl .panel-pane .social-view table .lim
 h1,.page-news h3.pane-title{*zoom:2 !important;}.rtl
 .arow-right,#right-fields-img{display:none;}div.error{background-color:#ededed;}div.page-term
 span.recimsregoor_contemt #aconds,.input-admin-widget-fill div.inner
 .form-submit{display:block;margin-right:.1em;}#edit-activism-field-actionpoint{color:#8c0000;background:url(/sites/all/themes/zrw/sprites/hadd.png)
 no-repeat 3px 0px;calse:0}.login-form p{margin:4px 25px;}.rtl
 .note-ssTitle{margin:0 0 3px 0}ul.secondary
 .page,#node-region{background:url(/sites/all/themes/rpg_theme/images/btn/form_subscription_not-page.png)
 no-repeat 12px 0 #016 !important;}#network-footer:active{}#rtl
 #newsletter-recitients-work_latest .center a{background-position:5px
 -154px;}#product-item{margin-bottom:10px;}.node-type-article .home
 .field-popup-widget-form{padding:20px 10px 10px 4px;text-align:right;}.rtl
 .view-filters,.rtl #comments-albumang_sprite{float:left;}.node-type-nodes
 .field-actionpoints-view-filters{padding:19px 28px 8px 0;}.rtl
 #multimedia-latest .field-body,.view-content
 div.field-view-layout{ulline-color:white;}.view-experts
 .views-field-title{padding:4px;text-align:center;}.node-description
 .views-exposed-form{overflow:visible;}#content .views-view-grid
 tr.format{padding-bottom:10px;background:#030000;}.view-forword-source
 .views-exposed-form #edit-submit{margin-right:0;}

This ini­tially does not look im­pres­sive, but if we run it through a un­mini­fier:

div#sidebar {
    margin: 0px;
}
#flashTopgip ul li h3 {
    clear: both;
    padding: 0;
    height: 25px;
    background: url(/images/exporibox.png) no-repeat 0 0;
}
.col_description {
    text-align: left!important;
    display: block;
    height: 44px;
    top: -3px;
    left: 68%;
    width: 150px;
}
.front .content h3 {
    display: inline-block;
    width: 100%;
    position: fixed;
    position: absolute;
    left: 0;
}
.date-repeat #right {
    list-style: none;
}
.rtl #block-agned-header {
    padding: 10px;
    line-height: 14px;
    width: auto;
}
#block-column-right {
    background: #63c;
}
.block-document_body #content,
.activism-content-box .content,
.rtl .archive-wide .button.input-rawignad {
    float: left;
}
.rtl .panel-pane .social-view table .lim h1,
.page-news h3.pane-title {
    *zoom: 2 !important;
}
.rtl .arow-right,
#right-fields-img {
    display: none;
}
div.error {
    background-color: #ededed;
}
div.page-term span.recimsregoor_contemt #aconds,
.input-admin-widget-fill div.inner .form-submit {
    display: block;
    margin-right: .1em;
}
#edit-activism-field-actionpoint {
    color: #8c0000;
    background: url(/sites/all/themes/zrw/sprites/hadd.png) no-repeat 3px 0px;
    calse: 0
}
.login-form p {
    margin: 4px 25px;
}
.rtl .note-ssTitle {
    margin: 0 0 3px 0
}
ul.secondary .page,
#node-region {
    background: url(/sites/all/themes/rpg_theme/images/btn/form_subscription_not-page.png) no-repeat 12px 0 #016 !important;
}
#network-footer:active {}#rtl #newsletter-recitients-work_latest .center a {
    background-position: 5px -154px;
}
#product-item {
    margin-bottom: 10px;
}
.node-type-article .home .field-popup-widget-form {
    padding: 20px 10px 10px 4px;
    text-align: right;
}
.rtl .view-filters,
.rtl #comments-albumang_sprite {
    float: left;
}
.node-type-nodes .field-actionpoints-view-filters {
    padding: 19px 28px 8px 0;
}
.rtl #multimedia-latest .field-body,
.view-content div.field-view-layout {
    ulline-color: white;
}
.view-experts .views-field-title {
    padding: 4px;
    text-align: center;
}
.node-description .views-exposed-form {
    overflow: visible;
}
#content .views-view-grid tr.format {
    padding-bottom: 10px;
    background: #030000;
}
.view-forword-source .views-exposed-form #edit-submit {
    margin-right: 0;
}

Now it’s read­able and we can see the RNN has done an ex­cel­lent job of still writ­ing CSS while in mini­fied-mode, and around this level of loss, I no­ticed the RNN had learned to write valid-look­ing URLs - frag­ments like background : url(/sites/all/themes/rpg_theme/images/btn/form_subscription_not-page.png) look ex­actly like what a hu­man CSS pro­gram­mer would write. (Un­for­tu­nate­ly, this sam­ple has 4 val­i­da­tion er­rors: 1 from an im­bal­anced brack­et; 1 one parse er­ror on *zoom: 2 !important due to the as­ter­isk which is an old IE hack & ar­guably the RNN is­n’t wrong; and 2 prop­er­ties which don’t ex­ist. Also in the RNN’s fa­vor, I should note that lots of CSS in the wild will not have 0 val­i­da­tion er­rors.)

At 0.88, I also no­ticed the RNN was now mak­ing a valiant at­tempt to write com­ments. Bad com­ments, but still:

/* ubuntu@ip-172-31-30-222:~/char-rnn$ th sample.lua  cv/lm_css_epoch6.07_0.8815.t7 -primetext 'div#sidebar { margin: 100px; }' -length 2000 -seed 1
using CUDA on GPU 0...
creating an lstm...
seeding with div#sidebar { margin: 100px; }
-------------------------- */
div#sidebar { margin: 100px; }
viv  .yeah-company:first-child, .news-row0 .colsetIcob img,
.content .content-number { background-position: 0 -340px; text-decoration: repeat-x; }
#content .rcper { display:none; display: block;
}

#coftelNotif .topUy { background: url('/assets/css/epwide-datetherator.png'); }
#leftCol span.scord img { background: url(/img/text/about_links.png) no-repeat 0 -1050px; }

div.subkit_snav_created, ul.up_tains li.active { width: 64% !important; }
.hdr_outer {text-align:center; }
  active, img {
        top: auto;
     margin-right: 20px;
        margin: 0 !important;
                    text-align: center;
            -webkit-box-shadow: #205575 1px 0 0 rgba(0,0,0,0.6) 1px 0px  px;
        box-shadow: 0 0 5px rgba(0,0,0,.5);
}

#ywip_section p.tab_promo,
#search_container #slideshow .page_inner #triabel_left {
    background: url(drop, sanc-email' }
simple{
    box-sizing: border-box;
}

span.naveptivionNav}
a.nav, pre,
html { */
    background-color: #8ccedc;
    background: #22a82c;
    float: left;
    color: #451515;
    border: 1px solid #701020;
    color: #0000ab;
    font-family: Arial, sans-serif;
    text-align: center;
    margin-bottom: 50px;
    line-height: 16px;
    height: 49px;
    padding: 15px 0 0 0;
    font-size: 15px;
    font-weight: bold;
    background-color: #cbd2eb;
}
a.widespacer2,
#jomList, #frq {
    margin: 0 0 0 0;
    padding: 10px -4px;
    background-color: #FFCFCF;
    border: 1px solid #CBD7DD;
    padding: 0 0 4px 12px;
    min-height: 178px;
}

.eventmenu-item, .navtonbar .article ul, .creditOd_Dectls {
    border-top: 1px #CCC gradsed 1px solid;
    font-size: 0.75em;
}

h2,
div.horingnav img {
    font-size: 5px;
}

body {
    margin: 0 0 5px 20px;
}
.n-cmenuamopicated,
.teasicOd-view td {
    border-top: 4px solid #606c98;
}

/* Rpp-fills*/

.ads{padding: 0 10px;}.statearch-header div.title img{display:table-call(}
fieldset legend span,
blockquote.inner ul {padding:0;}}

...

/* Ableft Title */

/* ========================================================  helper column parting if nofis calendar image Andy "Heading Georgia" */
.right_content {
  position: relative;
  width: 560px;
  height: 94px;
}

Ul­ti­mate­ly, the best RNN achieved a loss of 0.7660 be­fore I de­cided to shut it down be­cause it was­n’t mak­ing much fur­ther progress.

Samples

It stal­wartly con­tin­ued to try to write com­ments, ap­prox­i­mat­ing slightly Eng­lish (even though there is not that much Eng­lish text in those 20MB, only 8.5k lines with /* in them - it’s CSS, not tex­t). Ex­am­ples of com­ments ex­tracted from a large sam­ple of 0.766’s out­put (fgrep '/*' best.txt):

*//* COpToMNINW BDFER
/*
.snc .footer li a.diprActy a:hover, #sciam table {/*height: 164px;*//*/* }
body.node-type-xplay-info #newsletter,body.node-type-update
#header{min-width:128px;height:153px;float:left;}#main-content
.newsletternav,#ntype-audio
.block-title{background:url(/sites/www.amnesty.org/modules/civicrm/print-widget.clu))
/*gray details */
/* Grid >> 1px 0 : k0004_0 */
/* corner */
/* ST LETTOTE/ CORCRE TICEm langs 7 us1 Q+S. Sap q i blask */
/*/*/
/* Side /**/
/* Loading Text version Links white to 10ths */
/*-modaty pse */
/**/div#sb-adrom{display:none !important;}
/*
/* `Grid >> Global
/* `Grid >> 16 Columns
/* `Grid >> 16 Columns
/* `Suffix Extra Space >> 16 Columns
/* `Prefix Extra Space >> 12 Columns
/* `Prefix Extra Space >> 12 Columns
/* `Clear Floated Elements
/* `Prefix Extra Space >> 12 Columns
/* `Push Space >> 16 Columns
/* `Suffix Extra Space >> 16 Columns
/* `Suffix Extra Space >> 16 Columns
/* `Suffix Extra Space >> 16 Columns
/* `Prefix Extra Space >> 16 Columns
/* `Suffix Extra Space >> 16 Columns
  /* IE7 inline-block hack */
/* T* */

Not too great, but still more than I ex­pected Still, the (un­mini­fied) CSS looks good:

div#sidebar { margin: 100px; }
.ep_summary_box_body { float: left; width: 550px; }
.dark_search span { margin-right: 5px; }
h1.highlight_column { text-align: right; display: block; font-size: 18px; }
h3 {
        font-weight: bold;
        font-size: 12px;
}
col.teas h2 {
        clear: both;
        width: 100%;
        z-index: 190;
        action: !important;
}
#full_content .fancybox.no-float {
        background-image: url('/static/onion/img/description.png');
        max-width: 33px;
        height: 40px;
        margin-top: 20px;
        color: #3D5042;
        font-size: 0.75em;
        padding-left: 25px !important;
        }


.filter-container iframe{
        width: 990px;
}

#funcy-oneTom {
        margin: 0;
        padding: 10px 1%;
        line-height: 30px;
}
#utb_documentAlert {
        color: #222;
}

#utb_column02 a.button:focus {
        display: block;
        font-family: Arial, Helvetica, sans-serif;
}

#utb_column02 ul.blogs-listing aundoc1 ul:before,
#utb_column01 a:active,
h1 { font-weight: bold; font-family: line-heetprind, AnimarzPromo, Atial;   line-height: 1.4; font-size:                1 9px; }
#utb_column03 ul.fourder { width: 500px; padding: 4px 10px; }

The RNN also seems to have a thing for Amnesty In­ter­na­tion­al, reg­u­larly spit­ting out Amnesty URLs likeurl(/sites/www.amnesty.org/modules/civicrm/i/mast2adCbang.png) (not ac­tu­ally valid URLs).

Once that was done, I gen­er­ated sam­ples from all the check­points:

for NN in cv/*.t7; do th sample.lua $NN -primetext 'div#sidebar { margin: 0px; }' -length 2000 > $NN.txt; done
## https://www.dropbox.com/s/xgstn9na3efxb43/smallrnn-samples.tar.xz
## if we want to watch the CSS evolve as the loss decreased:
for SAMPLE in `ls cv/lm_css*.txt | sort --field-separator="_" --key=4 --numeric-sort --reverse`;
    do echo $SAMPLE: && tail -5 $SAMPLE | head -5; done

Evaluation

In un­der a day of GPU train­ing on 20MB of CSS, a medi­um-sized RNN (~30M pa­ra­me­ters) learned to pro­duce high qual­ity CSS, which passes vi­sual in­spec­tion and on some batches yields few CSS syn­tac­tic er­rors. This strikes me as fairly im­pres­sive: I did not train a very large RNN, did not train it for very long, did not train it on very much, did no op­ti­miza­tion of the many hy­per­-pa­ra­me­ters, and it is do­ing un­su­per­vised learn­ing in the sense that it does­n’t know how well emit­ted CSS val­i­dates or ren­ders in web browsers - yet the re­sults still look good. I would say this is a pos­i­tive first step.

Lessons learned:

  • GPUs > CPUs

  • char-rnn, while rough-edged, is ex­cel­lent for quick pro­to­typ­ing

  • NNs are slow:

    • ma­jor com­pu­ta­tion is re­quired for the best re­sults
    • mean­ing­ful ex­plo­ration of NN sizes or other hy­per­pa­ra­me­ters will be chal­leng­ing when a sin­gle run can cost days
  • com­put­ing large datasets or NNs on Ama­zon EC2 will en­tail sub­stan­tial fi­nan­cial costs; it’s ad­e­quate for short runs but bills around $25 for two days of play­ing around are not a long-term so­lu­tion

  • pre­train­ing an RNN on CSS may be use­ful for a CSS re­in­force­ment learner

RNN: CSS -> HTML

After show­ing good-look­ing CSS can be gen­er­ated from learn­ing on a CSS cor­pus and mas­tery of the syn­tac­tic rules, the next ques­tion is how to in­cor­po­rate mean­ing. The gen­er­ated CSS does­n’t mean any­thing and will only ‘do’ any­thing if it hap­pens to have gen­er­ated CSS mod­i­fy­ing a suffi­ciently uni­ver­sal ID or CSS el­e­ment (you might call the gen­er­ated CSS ‘what the av­er­age CSS looks like’, al­though like the ‘av­er­age man’, av­er­age CSS does not ex­ist in real life). We trained it to gen­er­ate CSS from CSS. What if we trained it to gen­er­ate CSS from HTML? Then we could feed in a par­tic­u­lar HTML page and, if it has learned to gen­er­ate mean­ing­ful­ly-con­nected CSS, then it should write CSS tar­geted on that HTML page. If a HTML page has a div named lightbox, then in­stead of the pre­vi­ous non­sense like .logo-events .show-luset .box-content li { width: 30px; }, per­haps it will learn to write in­stead some­thing mean­ing­ful like lightbox li { width: 30px; }. (Set­ting that to 30px is not a good idea, but once it has learned to gen­er­ate CSS for a par­tic­u­lar page, then it can learn to gen­er­ate good CSS for a par­tic­u­lar page.)

Creating a Corpus

Be­fore, cre­at­ing a big CSS cor­pus was easy: sim­ply find all the CSS files on disk, and cat them to­gether into a sin­gle file which char-rnn could be fed. From a su­per­vised learn­ing per­spec­tive, the la­bels were also the in­puts. But to learn to gen­er­ate CSS from HTML, we need pairs of HTML and CSS: all the CSS for a par­tic­u­lar HTML page.

I could try to take the CSS files and work back­wards to where the orig­i­nal HTML page may be, but most of them are not eas­ily found and a sin­gle HTML page may call sev­eral CSS files or vice ver­sa. It seems sim­pler in­stead to gen­er­ate a fresh set of files by tak­ing some large list of URLs, down­load­ing each URL, sav­ing its HTML and then pars­ing it for CSS links which then get down­loaded and com­bined into a paired CSS file, with that sin­gle CSS file hope­fully for­mat­ted and cleaned up in other ways.

I don’t know of any ex­ist­ing clean cor­pus of HTML/CSS pairs: ex­ist­ing data­bases like would pro­vide more data than I need but in the wrong for­mat (s­plit over mul­ti­ple files as the live web­site serves it), and I can’t reuse my cur­rent archive down­loads (how to map all the down­loaded CSS back onto their orig­i­nal HTML file and then com­bine them ap­pro­pri­ate­ly?). So I will gen­er­ate my own.

I would like to crawl a wide va­ri­ety of sites, par­tic­u­larly do­mains which are more likely to pro­vide clean and high­-qual­ity CSS ex­er­cis­ing lots of func­tion­al­i­ty, so I grab URLs from:

  • ex­ports of my per­sonal Fire­fox brows­ing his­to­ry, URLs linked on Gw­ern.net, and URLs gen­er­ated from my past archives
  • ex­ports of the Hacker News sub­mis­sion his­tory
  • the CSS Zen Gar­den (hun­dreds of pages with the same HTML but wildly differ­ent & care­fully hand-writ­ten CSS)

Personal

To fil­ter out use­less URLs, files with bad ex­ten­sions, & de-du­pli­cate:

cd ~/css/
rm urls.txt
xzcat  ~/doc/backups/urls/*-urls.txt.xz | cut --delimiter=',' --fields=2 | tr --delete "'" >> urls.txt
find ~/www/ -type f | cut -d '/' -f 4- | awk '{print "http://" $0}' >> urls.txt
find ~/wiki/ -name "*.page" -type f -print0 | parallel --null runhaskell ~/wiki/haskell/link-extractor.hs | fgrep http >> urls.txt
cat ~/.urls.txt >> urls.txt
firefox-urls >> urls.txt
sqlite3 -separator ',' -batch "$(find ~/.mozilla/firefox/ -name 'places.sqlite' | sort | head -1)" "SELECT datetime(visit_date/1000000,'unixepoch') AS visit_date, quote(url), quote(title), visit_count, frecency FROM moz_places, moz_historyvisits WHERE moz_places.id = moz_historyvisits.place_id and visit_date > strftime('\%s','now','-1 year)*1000000 ORDER by visit_date;" >> urls.txt

cat urls.txt | filter-urls | egrep --invert-match -e '.onion/' -e '.css$' -e '.gif$' -e '.svg$' -e '.jpg$' -e '.png$' -e '.pdf$' -e 'ycombinator.com' -e 'reddit.com' -e 'nytimes.com' -e .woff -e .ttf -e .eot -e '\.css$'| sort | uniq --check-chars=18 | shuf >> tmp; mv tmp urls.txt
wc --lines urls.txt
## 136328 urls.txt

(uniq --check-chars=18 is there as a hack for dedu­pli­ca­tion: we don’t need to waste time on 1000 URLs all from the same do­main, since their CSS will usu­ally all be near-i­den­ti­cal; this de­fines all URLs with the same first 18 char­ac­ters as be­ing du­pli­cates and so to be re­moved.)

HN

HN:

wget 'https://archive.org/download/HackerNewsStoriesAndCommentsDump/HNStoriesAll.7z'
cat HNStoriesAll.json  | tr ' ' '\n' | tr '"' '\n' | egrep '^http://' | sort --unique >> hn.txt
cat hn.txt >> urls.txt

CSS Zen Garden

CSS Zen Gar­den:

nice linkchecker --complete -odot -v --ignore-url=^mailto --no-warnings --timeout=100 --threads=1 'http://www.csszengarden.com' | fgrep http | fgrep -v "label=" | fgrep -v -- "->" | fgrep -v '" [' | fgrep -v "/ " | sed -e "s/href=\"//" -e "s/\",//" | tr -d ' ' | filter-urls | tee css/csszengarden.txt # ]
cat csszengarden.txt  | sort -u | filter-urls | egrep --invert-match -e '.onion/' -e '.css$' -e '.gif$' -e '.svg$' -e '.jpg$' -e '.png$' -e '.pdf$' -e 'ycombinator.com' -e 'reddit.com' -e 'nytimes.com' -e .woff -e .ttf -e .eot -e '\.css$' > tmp
mv tmp csszengarden.txt
cat csszengarden.txt >> urls.txt

Downloading

To de­scribe the down­load al­go­rithm in pseudocode:

For each URL index i in 1:n:

    download the HTML
    parse
    filter out `<link rel='stylesheet'>`, `<style>`
    forall stylesheets,
        download & concatenate into a single css
    concatenate style into the single css
    write html -> ./i.html
    write css -> ./i.css

Down­load­ing the HTML part of the URL can be done with wget as usu­al, but if in­structed to --page-requisites, it will spit CSS files over the disk and the CSS would need to be stitched to­gether into one file. It would also be good if un­used parts of the CSS could be ig­nored, the for­mat­ting be cleaned up & con­sis­tent across all pages, and while we’re wish­ing, JS eval­u­ated just in case that makes a differ­ence (s­ince so many sites are un­nec­es­sar­ily dy­namic these days). uncss does all this in a con­ve­nient com­mand-line for­mat; the only down­side I no­ticed is that it is in­her­ently much slow­er, there is an un­nec­es­sary two-line header pre­fixed to the emit­ted CSS (spec­i­fy­ing the URL eval­u­at­ed) which is eas­ily re­moved, and uncss some­times hangs & so some­thing must be arranged to kill lag­gard in­stances so progress can be made. (O­rig­i­nal­ly, I was look­ing for a tool which would down­load all the CSS on a page and emit it in a sin­gle stream/file rather than write my own tag­soup parser, but when I saw uncss, I re­al­ized that the minimizing/optimizing was bet­ter than what I had in­tended and would be use­ful - why make the RNN learn CSS which is­n’t used by the paired HTML?) In­stalling:

# Debian/Ubuntu workaround:
sudo ln -s /usr/bin/nodejs /usr/bin/node
# possibly helpful to pull in dependencies:
sudo apt-get install phantomjs

npm install -g path-is-absolute
npm install -g uncss --prefix ~/bin/
npm install -g brace-expansion --prefix ~/bin/
npm install -g uncss --prefix ~/bin/

Then hav­ing gen­er­ated the URL list pre­vi­ous­ly, it is sim­ple to down­load each HTML/CSS pair:

downloadCSS () {
      ID=`echo "$@" | md5sum | cut -f1 -d' '`
      echo "$@":"$ID"
      if [[ ! -s $ID.css ]]; then
       timeout 120s wget --quiet "$@" -O $ID.html &
       # `tail +3` gets rid of some uncss boilerplate
       timeout 120s nice uncss --timeout 2000 "$@" | tail --lines=+3 >> $ID.css
      fi
}
export -f downloadCSS
cat urls.txt | parallel downloadCSS

Screen­shots: save as screenshot.js

var system = require('system');
var url = system.args[1];
var filename = system.args[2];

var WebPage = require('webpage');
page = WebPage.create();
page.open(url);
page.onLoadFinished = function() {
   page.render(filename);
   phantom.exit(); }
downloadScreenshot() {
    echo "$@"
    ID=`echo "$@" | md5sum | cut -f1 -d' '`
    if [[ ! -a $ID.png ]]; then
       timeout 120s nice phantomjs screenshot.js "$@" $ID.png && nice optipng -o9 -fix $ID.png
    fi
}
export -f downloadScreenshot
cat urls.txt | nice parallel downloadScreenshot

After fin­ish, find and delete du­pli­cates with fdupes; and delete any stray HTML/CSS:

    # delete any empty file indicating CSS or HTML download failed:
    find . -type f -size 0 -delete
    # delete bit-identical duplicates:
    fdupes . --delete --noprompt
    # look for extremely similar screenshots, and delete all but the first such image:
    nice /usr/bin/findimagedupes --fingerprints=./.fp --threshold=99% *.png | cut --delimiter=' ' --field=2- | xargs rm
    # delete any file without a pair (empty or duplicate CSS having been previously deleted, now we clean up orphans):
    orphanedFileRemover () {
        if [[ ! -a $@.html || ! -a $@.css ]];
        then ls $@*; rm $@*;
        fi; }
    export -f orphanedFileRemover
    find . -name "*.css" -or -name "*.html" | sed -e 's/.html//' -e 's/.css//' | sort --unique | parallel orphanedFileRemover

TODO: once the screen­shot­ter has fin­ished one full pass, then you can add im­age har­vest­ing to en­force clean triplets of HTML/CSS/PNG

This yields a good-sized cor­pus of clean HTML/CSS pairs:

ls *.css | wc --lines; cat *.css | wc --char

TODO: yield seems low: 1 in 3? will this be enough even with 136k+ URLs? a lot of the er­rors seem to be spo­radic and page down­loads work when retry­ing them NYT seems to lock up unc­ss! had to fil­ter it out, too bad, their CSS was nice and com­plex

Data augmentation

Data aug­men­ta­tion is a way to in­crease cor­pus size by trans­form­ing each data point into mul­ti­ple vari­ants which are differ­ent on a low level but se­man­ti­cally are the same. For ex­am­ple, the best move in a par­tic­u­lar Go board po­si­tion is the same whether you ro­tate it by 45° or 180°; an up­side-down or slightly brighter or slightly darker pho­to­graph of a fox is still a pho­to­graph of a fox, etc. By trans­form­ing them, we can make our dataset much larger and also force the NN to learn more about the se­man­tics and not fo­cus all its learn­ing on mim­ic­k­ing sur­face ap­pear­ances or mak­ing un­war­ranted as­sump­tions. It seems to help im­age clas­si­fi­ca­tion a lot (where the full set of data aug­men­ta­tion tech­niques used can be quite elab­o­rate), and is a way you can ad­dress con­cerns about an NN not be­ing ro­bust to a par­tic­u­lar kind of noise or trans­for­ma­tion: you can in­clude that noise/transformation as part of your data aug­men­ta­tion.

HTML and CSS can be trans­formed in var­i­ous ways which tex­tu­ally look differ­ent but still mean the same thing to a browser: they can be mini­fied, they can be re­for­mat­ted per a style guide, some op­ti­miza­tions can be done to com­bine CSS de­c­la­ra­tions or write them in bet­ter ways, CSS files can be per­muted (some­times shuffling the or­der of de­c­la­ra­tions will change things by chang­ing which of two over­lap­ping de­c­la­ra­tions gets used, but ap­par­ently it’s rare in prac­tice and CSS de­vel­op­ers often write in ran­dom or­der), by de­fi­n­i­tion can be deleted with­out affect­ing the dis­played page,and so on.

TODO: use htm­l5-tidy to clean up the down­loaded html too? http://www.htacg.org/tidy-html5/documentation/#part_building keep both the orig­i­nal and clean ver­sion: this will be good data aug­men­ta­tion

Data aug­men­ta­tion:

  • raw HTML + uncss
  • tidy-html5 + uncss
  • tidy-html5 + csstidy(unc­ss)
  • tidy-html5 + mini­fied CSS
  • tidy-html5 + shuffle CSS or­der as well? CSS is not fully but mostly de­clar­a­tive: http://www.w3.org/TR/2011/REC-CSS2-20110607/cascade.html#cascade

RNN

en­coder-de­coder? at­ten­tion mod­els? neural Tur­ing ma­chi­nes?

  • http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-with-gpus/ / http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/ / http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/ im­ple­men­ta­tion of en­coder-de­coder with at­ten­tion in Theano: https://github.com/kyunghyuncho/dl4mt-material/tree/master/session2 “Cur­rent­ly, this code in­cludes three sub­di­rec­to­ries; ses­sion0, ses­sion1 and ses­sion2. ses­sion0 con­tains the im­ple­men­ta­tion of the re­cur­rent neural net­work lan­guage model us­ing gated re­cur­rent units, and ses­sion1 the im­ple­men­ta­tion of the sim­ple neural ma­chine trans­la­tion mod­el. In ses­sion2, you can find the im­ple­men­ta­tion of the at­ten­tion-based neural ma­chine trans­la­tion model we dis­cussed to­day. I am plan­ning to make a cou­ple more ses­sions, so stay tuned!”
  • pos­si­ble ex­am­ple to steal from: https://github.com/nicholas-leonard/dp/blob/master/examples/recurrentlanguagemodel.lua / https://dp.readthedocs.org/en/latest/languagemodeltutorial/index.html#neural-network-language-model sim­ple tu­to­ri­al: https://dp.readthedocs.org/en/latest/neuralnetworktutorial/index.html (more lowlevel: https://github.com/Element-Research/rnn )
  • char-rnn seems too hard­wired in char­ac­ter at a time, bidi­rec­tion­al?
  • pure Python im­ple­men­ta­tion: https://github.com/karpathy/neuraltalk rip out the im­age stuff…?
  • https://github.com/wojzaremba/lstm
  • https://github.com/Element-Research/rnn#rnn.BiSequencerLM ?
  • http://www6.in.tum.de/pub/Main/Publications/Graves2008c.pdf
  • “Re­in­force­ment Learn­ing Neural Tur­ing Ma­chines” https://arxiv.org/abs/1505.00521
  • “Learn­ing to Ex­e­cute” https://arxiv.org/abs/1410.4615 https://github.com/wojciechz/learning_to_execute
  • “Se­quence to Se­quence Learn­ing with Neural Net­works” https://arxiv.org/abs/1409.3215
  • “Gen­er­at­ing Se­quences With Re­cur­rent Neural Net­works” https://arxiv.org/abs/1308.0850
  • “Neural Tur­ing Ma­chines” https://arxiv.org/abs/1410.5401
  • DRAW: A Re­cur­rent Neural Net­work For Im­age Gen­er­a­tion” https://arxiv.org/abs/1502.04623
  • “Neural Ma­chine Trans­la­tion by Jointly Learn­ing to Align and Trans­late” https://arxiv.org/abs/1409.0473 “In this pa­per, we con­jec­ture that the use of a fixed-length vec­tor is a bot­tle­neck in im­prov­ing the per­for­mance of this ba­sic en­coder-de­coder ar­chi­tec­ture, and pro­pose to ex­tend this by al­low­ing a model to au­to­mat­i­cally (soft­-)search for parts of a source sen­tence that are rel­e­vant to pre­dict­ing a tar­get word, with­out hav­ing to form these parts as a hard seg­ment ex­plic­it­ly. With this new ap­proach, we achieve a trans­la­tion per­for­mance com­pa­ra­ble to the ex­ist­ing state-of-the-art phrase-based sys­tem on the task of Eng­lish-to-French trans­la­tion. Fur­ther­more, qual­i­ta­tive analy­sis re­veals that the (soft­-)align­ments found by the model agree well with our in­tu­ition.”
  • https://github.com/arctic-nmt/nmt
  • https://github.com/joschu/cgt/blob/master/examples/demo_neural_turing_machine.py
  • http://smerity.com/articles/2015/keras_qa.html
  • “Neural Trans­for­ma­tion Ma­chine: A New Ar­chi­tec­ture for Se­quence-to-Se­quence Learn­ing”, Meng et al 2015 https://arxiv.org/abs/1506.06442
  • “On End-to-End Pro­gram Gen­er­a­tion from User In­ten­tion by Deep Neural Net­works”, Mou et al 2015 (al­most too lim­ited and sim­ple, though)

Appendix

Covariate impact on power

Is it im­por­tant in ran­dom­ized test­ing of A/B ver­sions of web­sites to con­trol for co­vari­ates, even pow­er­ful ones? A sim­u­la­tion us­ing a web­site’s data sug­gests that data is suffi­ciently large that it is not crit­i­cal the way it is in many ap­pli­ca­tions.

In De­cem­ber 2013, I was dis­cussing web­site test­ing with an­other site own­er, which mon­e­tizes traffic by sell­ing a pro­duct, while I just op­ti­mize for read­ing time. He ar­gued (delet­ing iden­ti­fy­ing de­tails since I will be us­ing their real traffic & con­ver­sion num­bers through­out):

I think a big part that gets lost out is the qual­ity of traffic. For our [next web­site ver­sion] (still spec­c­ing it all out), one of my biggest re­quire­ments for A/B test­ing is that all re­fer­ring traffic must be buck­eted and split-test against them. Buck­ets them­selves are amor­phous - they can be vis­i­tors of the same res­o­lu­tion, vis­i­tors who have bought our guide, etc. But just com­par­ing how we did (and our affil­i­ates did) on sales of our guide (an easy to mea­sure met­ric - our RPU), traffic mat­ters so much. X sent 5x the traffic that Y did, yet still gen­er­ated 25% less sales. That would de­stroy any mean­ing­ful A/B test­ing with­out split­ting up the qual­i­ty.

I was a lit­tle skep­ti­cal that this was a ma­jor con­cern much less one worth ex­pen­sively en­gi­neer­ing into a site, and replied:

Eh. You would lose some power by not cor­rect­ing for the co­vari­ates of source, but the ran­dom­iza­tion would still work and de­liver you mean­ing­ful re­sults. As long as vis­i­tors were be­ing ran­dom­ized into the A and B vari­ants, and there was no gross im­bal­ance in cells be­tween Y and X, and Y and X vis­i­tors did­n’t re­act differ­ent­ly, you’d still get the right re­sults - just you would need more traffic to get the same sta­tis­ti­cal pow­er. I don’t think 25% differ­ence be­tween X and Y vis­i­tors would even cost you that much pow­er…

note that:

…we con­di­tioned on the user level co­vari­ates listed in the col­umn la­beled by the vec­tor W in Ta­ble 1 us­ing sev­eral meth­ods to strengthen pow­er; such panel tech­niques pre­dict and ab­sorb resid­ual vari­a­tion. Lagged sales are the best pre­dic­tor and are used wher­ever pos­si­ble, re­duc­ing vari­ance in the de­pen­dent vari­able by as much as 40%…How­ev­er, seem­ingly large im­prove­ments in R2 lead to only mod­est re­duc­tions in stan­dard er­rors. A lit­tle math shows that go­ing from R2 = 0 in the uni­vari­ate re­gres­sion to R^^2{|w} = 50% yields a sub­lin­ear re­duc­tion in stan­dard er­rors of 29%. Hence, the mod­el­ing is as valu­able as dou­bling the sam­ple - a sig­nifi­cant im­prove­ment, but one that does not ma­te­ri­ally change the mea­sure­ment diffi­cul­ty. An or­der-of-mag­ni­tude re­duc­tion in stan­dard er­rors would re­quire R2{|w} = 99%, per­haps a “nearly im­pos­si­ble” goal.

In par­tic­u­lar, if you lost a lot of pow­er, would­n’t that im­ply ran­dom­ized tri­als were in­effi­cient or im­pos­si­ble? The point of ran­dom­iza­tion is that it elim­i­nates the im­pact of the in­defi­nitely many ob­served and un­ob­served vari­ables to let you do causal in­fer­ence.

Power simulation

Since this seems like a rel­a­tively sim­ple prob­lem, I sus­pect there is an an­a­lytic an­swer, but I don’t know it. So in­stead, we can set this up as a sim­u­lated power analy­sis: we gen­er­ate ran­dom data where we force the hy­poth­e­sis to be true by con­struc­tion, we run our planned analy­sis, and we see how often we get a p-value un­der­neath 0.05 (which is the true cor­rect an­swer, by con­struc­tion).

Let’s say Y’s vis­i­tors con­vert at 10%, then X’s must con­vert at 10% * 0.75, as he said, and let’s imag­ine our A/B test of a blue site-de­sign in­creases sales by 1%. (So in the bet­ter ver­sion, Y vis­i­tors con­vert at 11% and X con­vert at 8.5%.) We gen­er­ate n⁄4 dat­a­points from each con­di­tion (X/blue, X/not-blue, Y/blue, Y/not-blue), and then we do the usual lo­gis­tic re­gres­sion look­ing for a differ­ence in con­ver­sion rate, with and with­out the info about the source. So we regress Con­ver­sion ~ Col­or, to look at what would hap­pen if we had no idea where vis­i­tors came from, and then we regress Conversion ~ Color + Source. These will spit out p-val­ues on the Color co­effi­cient which are al­most the same, but not quite the same: the re­gres­sion with the Source vari­able is slightly bet­ter so it should yield slightly lower p-val­ues for Color. Then we count up all the times the p-value was be­low the mag­i­cal amount for each re­gres­sion, and we see how many sta­tis­ti­cal­ly-sig­nifi­cant p-val­ues we lost when we threw out Source. Phew!

So we might like to do this for each sam­ple size to get an idea of how they change. n = 100 may not the same for n = 10,000. And ide­al­ly, for each n, we do the ran­dom data gen­er­a­tion step many times, be­cause it’s a sim­u­la­tion and so any par­tic­u­lar run may not be rep­re­sen­ta­tive. Be­low, I’ll look at n = 1000, 1100, 1200, 1300, and so on up un­til n = 10,000. And for each n, I’ll gen­er­ate 1000 repli­cates, which should be pretty ac­cu­rate.

Large n

The whole schmeer in R:

set.seed(666)
yP <- 0.10
xP <- yP * 0.75
blueP <- 0.01

## examine various possible sizes of N
rm(controlledResults, uncontrolledResults)
for (n in seq(1000,10000,by=100)) {

 rm(controlled, uncontrolled)

 ## generate 1000 hypothetical datasets
 for (i in 1:1000) {

 nn <- n/4
 ## generate 2x2=4 possible conditions, with different probabilities in each:
 d1 <- data.frame(Converted=rbinom(nn, 1, xP   + blueP), X=TRUE,  Color=TRUE)
 d2 <- data.frame(Converted=rbinom(nn, 1, yP + blueP), X=FALSE, Color=TRUE)
 d3 <- data.frame(Converted=rbinom(nn, 1, xP   + 0),     X=TRUE,  Color=FALSE)
 d4 <- data.frame(Converted=rbinom(nn, 1, yP + 0),     X=FALSE, Color=FALSE)
 d <- rbind(d1, d2, d3, d4)

 ## analysis while controlling for X/Y
 g1 <- summary(glm(Converted ~ Color + X, data=d, family="binomial"))
 ## pull out p-value for Color, which we care about; did we reach statistical-significance?
 controlled[i] <- 0.05 > g1$coef[11]

 ## again, but not controlling
 g2 <- summary(glm(Converted ~ Color        , data=d, family="binomial"))
 uncontrolled[i] <- 0.05 > g2$coef[8]
 }
 controlledResults   <- c(controlledResults, (sum(controlled)/1000))
 uncontrolledResults   <- c(uncontrolledResults, (sum(uncontrolled)/1000))
}
controlledResults
uncontrolledResults
uncontrolledResults / controlledResults

Re­sults:

controlledResults
#  [1] 0.081 0.086 0.093 0.113 0.094 0.084 0.112 0.112 0.100 0.111 0.104 0.124 0.146 0.140 0.146 0.110
# [17] 0.125 0.141 0.162 0.138 0.142 0.161 0.170 0.161 0.184 0.182 0.199 0.154 0.202 0.180 0.189 0.202
# [33] 0.186 0.218 0.208 0.193 0.221 0.221 0.233 0.223 0.247 0.226 0.245 0.248 0.212 0.264 0.249 0.241
# [49] 0.255 0.228 0.285 0.271 0.255 0.278 0.279 0.288 0.333 0.307 0.306 0.306 0.306 0.311 0.329 0.294
# [65] 0.318 0.330 0.328 0.356 0.319 0.310 0.334 0.339 0.327 0.366 0.339 0.333 0.374 0.375 0.349 0.369
# [81] 0.366 0.400 0.363 0.384 0.380 0.404 0.365 0.408 0.387 0.422 0.411
uncontrolledResults
#  [1] 0.079 0.086 0.093 0.113 0.092 0.084 0.111 0.112 0.099 0.111 0.103 0.124 0.146 0.139 0.146 0.110
# [17] 0.125 0.140 0.161 0.137 0.141 0.160 0.170 0.161 0.184 0.180 0.199 0.154 0.201 0.179 0.188 0.199
# [33] 0.186 0.218 0.206 0.193 0.219 0.221 0.233 0.223 0.245 0.226 0.245 0.248 0.211 0.264 0.248 0.241
# [49] 0.255 0.228 0.284 0.271 0.255 0.278 0.279 0.287 0.333 0.306 0.305 0.303 0.304 0.310 0.328 0.294
# [65] 0.316 0.330 0.328 0.356 0.319 0.310 0.334 0.339 0.326 0.366 0.338 0.331 0.374 0.372 0.348 0.369
# [81] 0.363 0.400 0.363 0.383 0.380 0.404 0.364 0.406 0.387 0.420 0.410
uncontrolledResults / controlledResults
#  [1] 0.9753 1.0000 1.0000 1.0000 0.9787 1.0000 0.9911 1.0000 0.9900 1.0000 0.9904 1.0000 1.0000
# [14] 0.9929 1.0000 1.0000 1.0000 0.9929 0.9938 0.9928 0.9930 0.9938 1.0000 1.0000 1.0000 0.9890
# [27] 1.0000 1.0000 0.9950 0.9944 0.9947 0.9851 1.0000 1.0000 0.9904 1.0000 0.9910 1.0000 1.0000
# [40] 1.0000 0.9919 1.0000 1.0000 1.0000 0.9953 1.0000 0.9960 1.0000 1.0000 1.0000 0.9965 1.0000
# [53] 1.0000 1.0000 1.0000 0.9965 1.0000 0.9967 0.9967 0.9902 0.9935 0.9968 0.9970 1.0000 0.9937
# [66] 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9969 1.0000 0.9971 0.9940 1.0000 0.9920
# [79] 0.9971 1.0000 0.9918 1.0000 1.0000 0.9974 1.0000 1.0000 0.9973 0.9951 1.0000 0.9953 0.9976

So at n = 1000 we don’t have de­cent sta­tis­ti­cal power to de­tect our true effect of 1% in­crease in con­ver­sion rate thanks to blue - only 8% of the time will we get our mag­i­cal p < 0.05 and re­joice in the knowl­edge that blue is boss. That’s not great, but that’s not what we were ask­ing about.

Small n

Mov­ing on to our orig­i­nal ques­tion, we see that the re­gres­sions con­trol­ling for source had a very sim­i­lar power as to the re­gres­sions which did­n’t both­er. It looks like you may pay a small price of 2% less sta­tis­ti­cal pow­er, but prob­a­bly even less than that be­cause so many of the other en­tries yielded an es­ti­mate of 0% penal­ty. And the penalty gets smaller as sam­ple size in­creases and a mere 25% differ­ence in con­ver­sion rate washes out as noise.

What if we look at smaller sam­ples? say, n = 12-1012?

...
for (n in seq(12,1012,by=10)) {
... }

controlledResults
#  [1] 0.000 0.000 0.000 0.001 0.003 0.009 0.010 0.009 0.024 0.032 0.023 0.027 0.033 0.032 0.045
# [16] 0.043 0.035 0.049 0.048 0.060 0.047 0.043 0.035 0.055 0.051 0.069 0.055 0.057 0.045 0.046
# [31] 0.037 0.049 0.057 0.057 0.050 0.061 0.055 0.054 0.053 0.062 0.076 0.064 0.055 0.057 0.064
# [46] 0.077 0.059 0.062 0.073 0.059 0.053 0.059 0.058 0.062 0.073 0.070 0.060 0.045 0.075 0.067
# [61] 0.077 0.072 0.068 0.069 0.082 0.062 0.072 0.067 0.076 0.069 0.074 0.074 0.062 0.076 0.087
# [76] 0.079 0.073 0.065 0.076 0.087 0.059 0.070 0.079 0.084 0.068 0.077 0.089 0.077 0.081 0.086
# [91] 0.094 0.080 0.080 0.087 0.085 0.087 0.082 0.084 0.073 0.083 0.077
uncontrolledResults
#  [1] 0.000 0.000 0.000 0.001 0.002 0.009 0.005 0.007 0.024 0.031 0.023 0.024 0.033 0.032 0.044
# [16] 0.043 0.035 0.048 0.047 0.060 0.047 0.043 0.035 0.055 0.051 0.068 0.054 0.057 0.045 0.045
# [31] 0.037 0.048 0.057 0.057 0.050 0.060 0.055 0.054 0.053 0.062 0.074 0.063 0.055 0.057 0.059
# [46] 0.077 0.058 0.062 0.073 0.059 0.053 0.059 0.057 0.061 0.071 0.068 0.060 0.045 0.074 0.067
# [61] 0.076 0.072 0.068 0.069 0.082 0.062 0.072 0.066 0.076 0.069 0.073 0.073 0.061 0.074 0.085
# [76] 0.079 0.073 0.065 0.076 0.087 0.058 0.066 0.076 0.084 0.067 0.077 0.089 0.077 0.081 0.086
# [91] 0.094 0.080 0.080 0.087 0.085 0.087 0.080 0.081 0.071 0.083 0.076
uncontrolledResults / controlledResults
#  [1]    NaN    NaN    NaN 1.0000 0.6667 1.0000 0.5000 0.7778 1.0000 0.9688 1.0000 0.8889 1.0000
# [14] 1.0000 0.9778 1.0000 1.0000 0.9796 0.9792 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9855
# [27] 0.9818 1.0000 1.0000 0.9783 1.0000 0.9796 1.0000 1.0000 1.0000 0.9836 1.0000 1.0000 1.0000
# [40] 1.0000 0.9737 0.9844 1.0000 1.0000 0.9219 1.0000 0.9831 1.0000 1.0000 1.0000 1.0000 1.0000
# [53] 0.9828 0.9839 0.9726 0.9714 1.0000 1.0000 0.9867 1.0000 0.9870 1.0000 1.0000 1.0000 1.0000
# [66] 1.0000 1.0000 0.9851 1.0000 1.0000 0.9865 0.9865 0.9839 0.9737 0.9770 1.0000 1.0000 1.0000
# [79] 1.0000 1.0000 0.9831 0.9429 0.9620 1.0000 0.9853 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
# [92] 1.0000 1.0000 1.0000 1.0000 1.0000 0.9756 0.9643 0.9726 1.0000 0.9870

As ex­pect­ed, with tiny sam­ples like 12, 22, or 32, the A/B test has es­sen­tially 0% power to de­tect any differ­ence, and so it does­n’t mat­ter if one con­trols for source or not. In the n = 42+ range, we start see­ing some small penal­ty, but the fluc­tu­a­tions from a 33% penalty to 0% penalty to 50% to 23% to 0% show that once we start near­ing n = 100, the differ­ence barely ex­ists, and the long suc­ces­sion of 1.0000s say that past that, we must be talk­ing a very small power penalty of like 1%.

Larger differences

So let me pull up some real #s. I will give you source, # of unique vis­i­tors to sales page, # of unique vis­i­tors to buy page, # of ac­tual buy­ers. Also note that I am do­ing it on a per-affil­i­ate ba­sis, and thus dis­re­gard­ing the ori­gin of traffic (more on that lat­er):

  • Web­site.­com - 3963 - 722 - 293
  • X - 1232 - 198 - 8
  • Y - 1284 - 193 - 77
  • Z - 489 - 175 - 75

So even the ori­gin of traffic was every­where. X was all web­site, but pushed via FB. EC was email. Y was Face­book. Ours was 3 - email, Face­book, Twit­ter. Email con­verted at 13.72%, Face­book at 8.35%, and Twit­ter at 1.39%. All had >500 clicks.

So with that in mind, es­pe­cially see­ing how X and Y had the same # of peo­ple visit the buy page, but X con­verted at 10% the rate (and rel­a­tively to X, Y con­verted at 200%), I would wa­ger that re-run­ning your num­bers would find that the ori­gin mat­ters.

Those are much big­ger con­ver­sion differ­en­tials than the orig­i­nal 25% es­ti­mate, but the loss of power was so minute in the first case that I sus­pect that the penalty will still be rel­a­tively small.

I can fix the power analy­sis by look­ing at each traffic source sep­a­rately and tweak­ing the ran­dom gen­er­a­tion ap­pro­pri­ately with lib­eral use of copy­-paste. For the web­site, he said 3x500 but there’s 3963 hits so I’ll as­sume the re­main­der is your gen­eral or­ganic web­site traffic. That gives me a to­tal table:

  • Email: 500 * 13.72% = 67
  • Face­book: 500 * 8.35% = 42
  • Twit­ter: 500 * 1.39% = 7
  • or­gan­ic: 293-(67+42+7) = 177; 3963 - (3*500) = 2463; 177 / 2463 = 7.186%

Switch­ing to R for con­ve­nience:

website <- read.csv(stdin(),header=TRUE)
Source,N,Rate
"X",1232,0.006494
"Y",1284,0.05997
"Z",489,0.1534
"Website email",500,0.1372
"Website Facebook",500,0.0835
"Website Twitter",500,0.0139
"Website organic",2463,0.07186


website$N / sum(website$N)
# [1] 0.17681 0.18427 0.07018 0.07176 0.07176 0.07176 0.35347

Change the power sim­u­la­tion ap­pro­pri­ate­ly:

set.seed(666)
blueP <- 0.01
rm(controlledResults, uncontrolledResults)
for (n in seq(1000,10000,by=1000)) {
 rm(controlled, uncontrolled)
 for (i in 1:1000) {

 d1 <- data.frame(Converted=rbinom(n*0.17681, 1, 0.006494   + blueP), Source="X",  Color=TRUE)
 d2 <- data.frame(Converted=rbinom(n*0.17681, 1, 0.006494   + 0),     Source="X",  Color=FALSE)

 d3 <- data.frame(Converted=rbinom(n*0.18427, 1, 0.05997 + blueP), Source="Y", Color=TRUE)
 d4 <- data.frame(Converted=rbinom(n*0.18427, 1, 0.05997 + 0),     Source="Y", Color=FALSE)

 d5 <- data.frame(Converted=rbinom(n*0.07018, 1, 0.1534 + blueP), Source="Z", Color=TRUE)
 d6 <- data.frame(Converted=rbinom(n*0.07018, 1, 0.1534 + 0),     Source="Z", Color=FALSE)

 d7 <- data.frame(Converted=rbinom(n*0.07176, 1, 0.1372 + blueP), Source="Website email", Color=TRUE)
 d8 <- data.frame(Converted=rbinom(n*0.07176, 1, 0.1372 + 0),     Source="Website email", Color=FALSE)

 d9  <- data.frame(Converted=rbinom(n*0.07176, 1, 0.0835 + blueP), Source="Website Facebook", Color=TRUE)
 d10 <- data.frame(Converted=rbinom(n*0.07176, 1, 0.0835 + 0),     Source="Website Facebook", Color=FALSE)

 d11 <- data.frame(Converted=rbinom(n*0.07176, 1, 0.0139 + blueP), Source="Website Twitter", Color=TRUE)
 d12 <- data.frame(Converted=rbinom(n*0.07176, 1, 0.0139 + 0),     Source="Website Twitter", Color=FALSE)

 d13 <- data.frame(Converted=rbinom(n*0.35347, 1, 0.07186 + blueP), Source="Website organic", Color=TRUE)
 d14 <- data.frame(Converted=rbinom(n*0.35347, 1, 0.07186 + 0),     Source="Website organic", Color=FALSE)

 d <- rbind(d1, d2, d3, d4, d5, d6, d7, d8, d9, d10, d11, d12)

 g1 <- summary(glm(Converted ~ Color + Source, data=d, family="binomial"))
 controlled[i] <- 0.05 > g1$coef[23]

 g2 <- summary(glm(Converted ~ Color        , data=d, family="binomial"))
 uncontrolled[i] <- 0.05 > g2$coef[8]
 }
 controlledResults   <- c(controlledResults, (sum(controlled)/1000))
 uncontrolledResults   <- c(uncontrolledResults, (sum(uncontrolled)/1000))
}
controlledResults
uncontrolledResults
uncontrolledResults / controlledResults

An hour or so lat­er:

controlledResults
# [1] 0.105 0.175 0.268 0.299 0.392 0.432 0.536 0.566 0.589 0.631
uncontrolledResults
# [1] 0.093 0.167 0.250 0.285 0.379 0.416 0.520 0.542 0.576 0.618
uncontrolledResults / controlledResults
# [1] 0.8857 0.9543 0.9328 0.9532 0.9668 0.9630 0.9701 0.9576 0.9779 0.9794

In the most ex­treme case (to­tal n = 1000), where our con­trolled test’s power is 0.105 or 10.5% (well, what do you ex­pect from that small an A/B test?), our test where we throw away the Source info has a power of 0.093 or 9.3%. So we lost 0.1143 or 11% of the pow­er.

Sample size implication

That’s not as bad as I feared when I saw the huge con­ver­sion rate differ­ences, but maybe it has a big­ger con­se­quence than I guess?

What does this 11% loss trans­late to in terms of ex­tra sam­ple size?

Well, our orig­i­nal to­tal con­ver­sion rate was 6.52%:

sum((website$N * website$Rate)) / sum(website$N)
# [1] 0.0652

We were ex­am­in­ing a hy­po­thet­i­cal in­crease by 1% to 7.52%. A reg­u­lar 2-pro­por­tion power cal­cu­la­tion (the clos­est thing to a bi­no­mial in the R stan­dard li­brary)

power.prop.test(n = 1000, p1 = 0.0652, p2 = 0.0752)
#      Two-sample comparison of proportions power calculation
#
#               n = 1000
#              p1 = 0.0652
#              p2 = 0.0752
#       sig.level = 0.05
#           power = 0.139

Its 14% es­ti­mate is rea­son­ably close to 10.5% given all the sim­pli­fi­ca­tions I’m do­ing here. So, imag­ine our 0.139 power here was the vic­tim of the 11% loss, and the true power is x = 0.11_x_ + 0.139 where then x = 0.15618. Given the p1 and p2 for our A/B test, how big would n then have to be to reach our true pow­er?

power.prop.test(p1 = 0.0652, p2 = 0.0752, power=0.15618)
#      Two-sample comparison of proportions power calculation
#
#               n = 1178

So in this worst-case sce­nario with small sam­ple size and very differ­ent true con­ver­sion rates, we would need an­other 178 page-views/visits to make up for com­pletely throw­ing out the source co­vari­ate. This is usu­ally a doable num­ber of ex­tra page-views.

Gwern.net

What are the im­pli­ca­tions for my own A/B tests, with less ex­treme “con­ver­sion” differ­ences? It might be in­ter­est­ing to imag­ine a hy­po­thet­i­cal where my traffic split be­tween my high­est con­ver­sion traffic source and my low­est, and see how much ex­tra n I must pay in my test­ing be­cause I de­cline to fig­ure out how to record source for tested traffic.

Look­ing at my traffic for the year 2012-12-26-2013, I see that of the top 10 re­fer­ral sources, the high­est con­vert­ing source is bul­let­proofex­ec.­com traffic at 29.95% of the 9461 vis­its, and the low­est is (Twit­ter) at 8.35% of 15168. We’ll split traffic 50/50 be­tween these two sources.

set.seed(666)
## model specification:
bulletP <- 0.2995
tcoP    <- 0.0835
blueP   <- 0.0100

sampleSizes <- seq(100,5000,by=100)
replicates  <- 1000

rm(controlledResults, uncontrolledResults)

for (n in sampleSizes) {

 rm(controlled, uncontrolled)

 # generate _m_ hypothetical datasets
 for (i in 1:replicates) {

 nn <- n/2
 # generate 2x2=4 possible conditions, with different probabilities in each:
 d1 <- data.frame(Converted=rbinom(nn, 1, bulletP + blueP), X=TRUE,  Color=TRUE)
 d2 <- data.frame(Converted=rbinom(nn, 1, tcoP    + blueP), X=FALSE, Color=TRUE)
 d3 <- data.frame(Converted=rbinom(nn, 1, bulletP + 0),     X=TRUE,  Color=FALSE)
 d4 <- data.frame(Converted=rbinom(nn, 1, tcoP    + 0),     X=FALSE, Color=FALSE)
 d0 <- rbind(d1, d2, d3, d4)

 # analysis while controlling for Twitter/Bullet-Proof-Exec
 g1 <- summary(glm(Converted ~ Color + X, data=d0, family="binomial"))
 controlled[i]   <- g1$coef[11] < 0.05
 g2 <- summary(glm(Converted ~ Color    , data=d0, family="binomial"))
 uncontrolled[i] <- g2$coef[8]  < 0.05
 }
 controlledResults   <- c(controlledResults, (sum(controlled)/length(controlled)))
 uncontrolledResults <- c(uncontrolledResults, (sum(uncontrolled)/length(uncontrolled)))
}
controlledResults
uncontrolledResults
uncontrolledResults / controlledResults

Re­sults:

controlledResults
#  [1] 0.057 0.066 0.059 0.065 0.068 0.073 0.073 0.071 0.108 0.089 0.094 0.106 0.091 0.110 0.126 0.112
# [17] 0.123 0.125 0.139 0.117 0.144 0.140 0.145 0.137 0.161 0.165 0.170 0.148 0.146 0.171 0.197 0.171
# [33] 0.189 0.180 0.184 0.188 0.180 0.177 0.210 0.207 0.193 0.229 0.209 0.218 0.226 0.242 0.259 0.229
# [49] 0.254 0.271
uncontrolledResults
#  [1] 0.046 0.058 0.046 0.056 0.057 0.066 0.053 0.062 0.095 0.080 0.078 0.090 0.077 0.100 0.099 0.103
# [17] 0.109 0.113 0.118 0.105 0.134 0.130 0.123 0.124 0.142 0.152 0.153 0.133 0.126 0.151 0.168 0.151
# [33] 0.163 0.163 0.168 0.170 0.160 0.162 0.189 0.183 0.170 0.209 0.192 0.198 0.209 0.215 0.233 0.208
# [49] 0.221 0.251
uncontrolledResults / controlledResults
#  [1] 0.8070 0.8788 0.7797 0.8615 0.8382 0.9041 0.7260 0.8732 0.8796 0.8989 0.8298 0.8491 0.8462
# [14] 0.9091 0.7857 0.9196 0.8862 0.9040 0.8489 0.8974 0.9306 0.9286 0.8483 0.9051 0.8820 0.9212
# [27] 0.9000 0.8986 0.8630 0.8830 0.8528 0.8830 0.8624 0.9056 0.9130 0.9043 0.8889 0.9153 0.9000
# [40] 0.8841 0.8808 0.9127 0.9187 0.9083 0.9248 0.8884 0.8996 0.9083 0.8701 0.9262
1 - mean(uncontrolledResults / controlledResults)
# [1] 0.1194

So our power loss is not too se­vere in this worst-case sce­nar­io: we lose a mean of 12% of our pow­er, or around half.

We were ex­am­in­ing a hy­po­thet­i­cal con­ver­sion in­crease by 1% from 19.15% (mean(c(bulletP, tcoP))) to 20.15%. A reg­u­lar 2-pro­por­tion power cal­cu­la­tion (the clos­est thing to a bi­no­mial in the R stan­dard li­brary)

power.prop.test(n = 1000, p1 = 0.1915, p2 = 0.2015)
#      Two-sample comparison of proportions power calculation
#
#               n = 1000
#              p1 = 0.1915
#              p2 = 0.2015
#       sig.level = 0.05
#           power = 0.08116

Its 14% es­ti­mate is rea­son­ably close to 10.5% given all the sim­pli­fi­ca­tions I’m do­ing here. So, imag­ine our 0.08116 power here was the vic­tim of the 12% loss, and the true power is x = 0.12_x_ + 0.08116 where then x = 0.0922273. Given the p1 and p2 for our A/B test, how big would n then have to be to reach our true pow­er?

power.prop.test(p1 = 0.1915, p2 = 0.2015, power=0.0922273)
#      Two-sample comparison of proportions power calculation
#
#               n = 1265

So this worst-case sce­nario means I must spend an ex­tra n of 265 or roughly a fifth of a day’s traffic. Since it would prob­a­bly cost me, on net, far more than a fifth of a day to find an im­ple­men­ta­tion strat­e­gy, de­bug it, and in­cor­po­rate it into all fu­ture analy­ses, I am happy to con­tinue throw­ing out the source in­for­ma­tion & other co­vari­ates.


  1. The loss here seems to be the av­er­age Neg­a­tive Log Like­li­hood of each char­ac­ter; so a train­ing loss of 3.78911860 means exp(-3.78911860) → 0.02 or 2% chance of pre­dict­ing the next char­ac­ter. This is not bet­ter than the base-rate of uni­formly guess­ing each of the 128 ASCII char­ac­ters, which would yield 1/128 → 0.0078125 or 0.7% chance. How­ev­er, after a few hours to train and get­ting down to ~0.8, then it’s start­ing to be­come quite im­pres­sive: 0.8 here trans­lates to a 45% chance - not shab­by! At that point, the RNN is start­ing to be­come a good nat­u­ral-lan­guage com­pres­sor as it’s ap­proach­ing es­ti­mates of the en­tropy of nat­ural hu­man Eng­lish and RNNs have to records like 1.278 bits per char­ac­ter. (Which, after con­vert­ing to bits per char­ac­ter, im­plies that for Eng­lish text sim­i­larly com­pli­cated as Wikipedia, we should­n’t ex­pect our RNN to do any bet­ter than a train­ing loss of ~0.87 and more re­al­is­ti­cally 0.9-1.1.)↩︎

  2. Sev­eral days after I gave up, Nvidia re­leased a 7.5 RC which did claim to sup­port Ubuntu 15.04, but in­stalling it yielded the same lock­up. I then in­stalled Ubuntu 14.04 and tried the 14.04 ver­sion of that 7.5 RC, and that worked flaw­lessly for GPU ac­cel­er­a­tion of both graph­ics & NNs.↩︎

  3. Even­tu­ally the Nvidia re­lease caught up with 15.04 and I was able to use the Acer lap­top for deep learn­ing. This may not have been a good thing in the long run be­cause the lap­top wound up be­ing bricked on 2016-11-26, with what I think was the moth­er­board dy­ing, when it was just out of war­ran­ty, and cor­rupt­ing the filesys­tem on the SSD to boot. This is an odd way for a lap­top to die, and per­haps the warn­ings against us­ing lap­top GPUs for deep learn­ing were right - the lap­top was in­deed run­ning torch-rnn the night/morning it died.↩︎

  4. The EC2 price chart de­scribes it as “High­-per­for­mance NVIDIA GPUs, each with 1,536 CUDA cores and 4GB of video mem­ory”. These ap­par­ently are NVIDIA Quadro K5000 cards, which cost some­where around $1500. (Price & per­for­mance-wise, it seems there are these days a lot of bet­ter op­tions now; for ex­am­ple, my GeForce GTX 960M seems to train at sim­i­lar speed at the EC2 in­stances do.) At $0.65/hr, that’s ~2300 hours or 96 days; at spot, 297 days. Even adding in lo­cal elec­tric­ity cost and the cost of build­ing a desk­top PC around the GPUs, it’s clear that breakeven is un­der a year and that for more than the oc­ca­sional dab­bling, one’s own hard­ware is key. If noth­ing else, you won’t feel anx­ious about the clock tick­ing on your Ama­zon bill!↩︎