A/B testing long-form readability on gwern.net

A log of experiments done on the site design, intended to render pages more readable, focusing on the challenge of testing a static site, page width, fonts, plugins, and effects of advertising.
experiments, statistics, computer-science, meta, decision-theory, shell, R, JS, CSS, power-analysis, Bayes, Google, tutorial, design
2012-06-162019-02-16 in progress certainty: possible importance: 4


To gain some sta­tis­ti­cal & web devel­op­ment expe­ri­ence and to improve my read­ers’ expe­ri­ences, I have been run­ning a series of CSS A/B tests since June 2012. As expect­ed, most do not show any mean­ing­ful differ­ence.

Background

  • https://www.google.com/analytics/siteopt/exptlist?account=18912926
  • http://www.pqinternet.com/196.htm
  • https://support.google.com/websiteoptimizer/bin/answer.py?hl=en&answer=61203 “Exper­i­ment with site-wide changes”
  • https://support.google.com/websiteoptimizer/bin/answer.py?hl=en&answer=117911 “Work­ing with global head­ers”
  • https://support.google.com/websiteoptimizer/bin/answer.py?hl=en-GB&answer=61427
  • https://support.google.com/websiteoptimizer/bin/answer.py?hl=en&answer=188090 “Vary­ing page and ele­ment styles” - test­ing with inline CSS over­rid­ing the defaults
  • http://stackoverflow.com/questions/2993199/with-google-website-optimizers-multivariate-testing-can-i-vary-multiple-css-cl
  • http://www.xemion.com/blog/the-secret-to-painless-google-website-optimizer-70.html
  • http://stackoverflow.com/tags/google-website-optimizer/hot

Problems with “conversion” metric

https://support.google.com/websiteoptimizer/bin/answer.py?hl=en-AU&answer=74345 “Time on page as a con­ver­sion goal” - every page con­verts, by using a time­out (mine is 40 sec­ond­s). Prob­lem: dichotomiz­ing a con­tin­u­ous vari­able into a sin­gle binary vari­able destroys a mas­sive amount of infor­ma­tion. This is well-known in the sta­tis­ti­cal and psy­cho­log­i­cal lit­er­a­ture (eg. Mac­Cal­lum et al 2002) but I’ll illus­trate fur­ther with some infor­ma­tion-the­o­ret­i­cal obser­va­tions.

Accord­ing to my Ana­lyt­ics, the mean read­ing time (time on page) is 1:47 and the max­i­mum brack­et, hit by 1% of view­ers, is 1801 sec­onds, and the range 1-1801 takes <10.8 bits to encode (log2(1801) → 10.81), hence each page view could be rep­re­sented by <10.8 bits (less since read­ing time is so highly skewed). But if we dichotomize, then we learn sim­ply that ~14% of read­ers will read for 40 sec­onds, hence each reader car­ries not 6 bits, nor 1 bit (if 50% read that long) but closer to 2/3 of a bit:

p=0.14;  q=1-p; (-p*log2(p) - q*log2(q))
# [1] 0.5842

This isn’t even an effi­cient dichotomiza­tion: we could improve the frac­tional bit to 1 bit if we could some­how dichotomize at 50% of read­ers:

p=0.50;  q=1-p; (-p*log2(p) - q*log2(q))
# [1] 1

But unfor­tu­nate­ly, sim­ply low­er­ing the time­out will have min­i­mal returns as Ana­lyt­ics also reports that 82% of reader spend 0-10 sec­onds on pages. So we are stuck with a severe loss.

ideas for testing

    JS:
            disqus
    CSS
            differences from readability
            every declaration in default.CSS?
    Donation
            placement - left, right, bottom
            donation text
                     help pay for hosting
                     help sponsor X experiment
                     Xah's text - did you find this article useful?
  • test the sug­ges­tions in https://code.google.com/p/better-web-readability-project/ http://www.vcarrer.com/2009/05/how-we-read-on-web-and-how-can-we.html

Testing

max-width

CSS-3 prop­er­ty: set how wide the page will be in pix­els if unlim­ited screen real estate is avail­able. I noticed some peo­ple com­plained that pages were ‘too wide’ and this made it hard to read, which appar­ently is a real thing since lines are sup­posed to fit in eye sac­cades. So I tossed in 800px, 900px, 1300px, and 1400px to the first A/B test.

<script>
function utmx_section(){}function utmx(){}
(function(){var k='0520977997',d=document,l=d.location,c=d.cookie;function f(n){
if(c){var i=c.indexOf(n+'=');if(i>-1){var j=c.indexOf(';',i);return escape(c.substring(i+n.
length+1,j<0?c.length:j))}}}var x=f('__utmx'),xx=f('__utmxx'),h=l.hash;
d.write('<sc'+'ript src="'+
'http'+(l.protocol=='https:'?'s://ssl':'://www')+'.google-analytics.com'
+'/siteopt.js?v=1&utmxkey='+k+'&utmx='+(x?x:'')+'&utmxx='+(xx?xx:'')+'&utmxtime='
+new Date().valueOf()+(h?'&utmxhash='+escape(h.substr(1)):'')+
'" type="text/javascript" charset="utf-8"></sc'+'ript>')})();
</script>

<script type="text/javascript">
  var _gaq = _gaq || [];
  _gaq.push(['gwo._setAccount', 'UA-18912926-2']);
  _gaq.push(['gwo._trackPageview', '/0520977997/test']);
  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www')
              + '.google-analytics.com/ga.js';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
  })();
</script>

<script type="text/javascript">
  var _gaq = _gaq || [];
  _gaq.push(['gwo._setAccount', 'UA-18912926-2']);
      setTimeout(function() {
  _gaq.push(['gwo._trackPageview', '/0520977997/goal']);
      }, 40000);
  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') +
              '.google-analytics.com/ga.js';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
  })();
</script>

    <script>utmx_section("max width")</script>
    <style type="text/css">
      body { max-width: 800px; }
    </style>
    </noscript>

It ran from mid-June to 2012-08-01. Unfor­tu­nate­ly, I can­not be more speci­fic: on 1 August, Google deleted Web­site Opti­mizer and told every­one to use ‘Exper­i­ments’ in Google Ana­lyt­ics - and deleted all my infor­ma­tion. The graph over time, the exact num­bers - all gone. So this is from mem­o­ry.

The results were ini­tially very promis­ing: ‘con­ver­sion’ was defined as stay­ing on a page for 40 sec­onds (I rea­soned that this meant some­one was actu­ally read­ing the page), and had a base of around 70% of read­ers con­vert­ing. With a few hun­dred hits, 900px con­verted at 10-20% more than the default! I was ecsta­t­ic. So when it began falling, I was only a lit­tle both­ered (one had to expect some regres­sion to the mean since the results were too good to be true). But as the hits increased into the low thou­sands, the effect kept shrink­ing all the way down to 0.4% improved con­ver­sion. At some points, 1300px actu­ally exceeded 900px.

The sec­ond dis­tress­ing thing was that Google’s esti­mated chance of a par­tic­u­lar inter­ven­tion beat­ing the default (which I believe is a Bon­fer­roni-cor­rected p-val­ue), did not increase! Even as each ver­sion received 20,000 hits, the chance stub­bornly bounced around the 70-90% range for 900px and 1300px. This remained true all the way to the bit­ter end. At the end, each ver­sion had racked up 93,000 hits and still was in the 80% decile. Wow.

Iron­i­cal­ly, I was warned at the begin­ning about both of these pos­si­ble behav­iors by a paper I read on large-s­cale cor­po­rate A/B test­ing: http://www.exp-platform.com/Documents/puzzlingOutcomesInControlledExperiments.pdf and http://www.exp-platform.com/Documents/controlledExperimentDMKD.pdf and http://www.exp-platform.com/Documents/2013%20controlledExperimentsAtScale.pdf It cov­ered at length how many appar­ent trends sim­ply evap­o­rat­ed, but it also cov­ered later a pecu­liar phe­nom­e­non where A/B tests did not con­verge even after being run on ungodly amounts of data because the stan­dard devi­a­tions kept chang­ing (the user com­po­si­tion kept shift­ing and ren­der­ing pre­vi­ous data more uncer­tain). And it’s a gen­eral phe­nom­e­non that even for large cor­re­la­tions, the trend will bounce around a lot before it sta­bi­lizes (Schön­brodt & Perug­ini 2013).

Oy vey! When I dis­cov­ered Google had deleted my results, I decided to sim­ply switch to 900px. Run­ning a new test would not pro­vide any bet­ter answers.

TODO

how about a blue back­ground? see http://www.overcomingbias.com/2010/06/near-far-summary.html for more design ideas

  1. table strip­ing
tbody tr:hover td { background-color: #f5f5f5;}
tbody tr:nth-child(odd) td { background-color: #f9f9f9;}
  1. link dec­o­ra­tion
a { color: black; text-decoration: underline;}
a { color:#005AF2; text-decoration:none; }

Resumption: ABalytics

In March 2013, I decided to give A/B test­ing another whack. Google Ana­lyt­ics Exper­i­ment did not seem to have improved and the com­mer­cial ser­vices con­tin­ued to charge unac­cept­able prices, so I gave the Google Ana­lyt­ics cus­tom vari­able inte­gra­tion approach another try­ing using ABa­lyt­ics. The usual puz­zling, debug­ging, and frus­tra­tion of com­bin­ing so many dis­parate tech­nolo­gies (HTML and CSS and JS and Google Ana­lyt­ics) aside, it seemed to work on my test page. The cur­rent down­side seems to be that the ABa­lyt­ics approach may be frag­ile, and the UI in GA is awful (you have to do the sta­tis­tics your­self).

max-width redux

The test case is to rerun the max-width test and fin­ish it.

Implementation

The exact changes:

Sun Mar 17 11:25:39 EDT 2013  gwern@gwern.net
  * default.html: setup ABalytics a/b testing https://github.com/danmaz74/ABalytics
                  (hope this doesn't break anything...)
    addfile ./static/js/abalytics.js
    hunk ./static/js/abalytics.js 1
...
    hunk ./static/templates/default.html 28
    +    <div class="maxwidth_class1"></div>
    +
...
    -    <noscript><p>Enable JavaScript for Disqus comments</p></noscript>
    +      window.onload = function() {
    +      ABalytics.applyHtml();
    +      };
    +    </script>
    hunk ./static/templates/default.html 119
    +
    +      ABalytics.init({
    +      maxwidth: [
    +      {
    +      name: '800',
    +      "maxwidth_class1": "<style>body { max-width: 800px; }</style>",
    +      "maxwidth_class2": ""
    +      },
    +      {
    +      name: '900',
    +      "maxwidth_class1": "<style>body { max-width: 900px; }</style>",
    +      "maxwidth_class2": ""
    +      },
    +      {
    +      name: '1100',
    +      "maxwidth_class1": "<style>body { max-width: 1100px; }</style>",
    +      "maxwidth_class2": ""
    +      },
    +      {
    +      name: '1200',
    +      "maxwidth_class1": "<style>body { max-width: 1200px; }</style>",
    +      "maxwidth_class2": ""
    +      },
    +      {
    +      name: '1300',
    +      "maxwidth_class1": "<style>body { max-width: 1300px; }</style>",
    +      "maxwidth_class2": ""
    +      },
    +      {
    +      name: '1400',
    +      "maxwidth_class1": "<style>body { max-width: 1400px; }</style>",
    +      "maxwidth_class2": ""
    +      }
    +      ],
    +      }, _gaq);
    +

Results

I wound up the test on 2013-04-17 with the fol­low­ing results:

Width (px) Vis­its Con­ver­sion
1100 18,164 14.49%
1300 18,071 14.28%
1200 18,150 13.99%
800 18,599 13.94%
900 18,419 13.78%
1400 18,378 13.68%
109772 14.03%

Analysis

1100px is close to my orig­i­nal A/B test indi­cat­ing 1000px was the lead­ing can­di­date, so that gives me addi­tional con­fi­dence, as does the obser­va­tion that 1300px and 1200px are the other lead­ing can­di­dates. (Cu­ri­ous­ly, the site con­ver­sion aver­age before was 13.88%; per­haps my under­ly­ing traffic changed slightly around the time of the test? This would demon­strate why alter­na­tives need to be tested simul­ta­ne­ous­ly.) A quick and dirty R test of 1100px vs 1300px (prop.test(c(2632,2581),c(18164,18071))) indi­cates the differ­ence isn’t sta­tis­ti­cal­ly-sig­nifi­cant (at p = 0.58), and we might want more data; worse, there is no clear lin­ear rela­tion between con­ver­sion and width (the plot is errat­ic, and a lin­ear fit a dis­mal p = 0.89):

rates <- read.csv(stdin(),header=TRUE)
Width,N,Rate
1100,18164,0.1449
1300,18071,0.1428
1200,18150,0.1399
800,18599,0.1394
900,18419,0.1378
1400,18378,0.1368


rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes

g <- glm(cbind(Successes,Failures) ~ Width, data=rates, family="binomial")
# ...Coefficients:
#              Estimate Std. Error z value Pr(>|z|)
# (Intercept) -1.82e+00   4.65e-02  -39.12   <2e-16
# Width        5.54e-06   4.10e-05    0.14     0.89
## not much better:
rates$Width <- as.factor(rates$Width)
rates$Width <- relevel(rates$Width, ref="900")
g2 <- glm(cbind(Successes,Failures) ~ Width, data=rates, family="binomial"); summary(g2)

But I want to move on to the next test and by the same logic it is highly unlikely that the differ­ence between them is large or much in 1300px’s favor (the kind of mis­take I care about: switch­ing between 2 equiv­a­lent choices does­n’t mat­ter, miss­ing out on an improve­ment does mat­ter - max­i­miz­ing β, not min­i­miz­ing α).

Fonts

The New York Times ran an infor­mal online exper­i­ment with a large num­ber of read­ers (n = 60750) and found that the font led to more read­ers agree­ing with a short text pas­sage - this seems plau­si­ble enough given their very large sam­ple size and Wikipedi­a’s note that “The refined feel­ing of the type­face makes it an excel­lent choice to con­vey dig­nity and tra­di­tion.”

Power analysis

Would this font work its magic on Gwern.net too? Let’s see. The sam­ple size is quite man­age­able, as over a month I will eas­ily have 60k vis­its, and they tested 6 fonts, expand­ing their nec­es­sary sam­ple. What sam­ple size do I actu­ally need? Their pro­fes­sor esti­mates the effect size of Baskerville at 1.5%; I would like my A/B test to have very high sta­tis­ti­cal power (0.9) and reach more strin­gent sta­tis­ti­cal-sig­nifi­cance (p < 0.01) so I can go around and in good con­science tell peo­ple to use Baskerville. I already know the aver­age “con­ver­sion rate” is ~13%, so I get this power cal­cu­la­tion:

power.prop.test(p1=0.13+0.015, p2=0.13, power=0.90, sig.level=0.01)

     Two-sample comparison of proportions power calculation

              n = 15683
             p1 = 0.145
             p2 = 0.13
      sig.level = 0.01
          power = 0.9
    alternative = two.sided

 NOTE: n is number in *each* group

15000 vis­i­tors in each group seems rea­son­able; at ~16k vis­i­tors a week, that sug­gests a few weeks of test­ing. Of course I’m test­ing 4 fonts (see below), but that still fits in the ~2 months I’ve allot­ted for this test.

Implementation

I had pre­vi­ously drawn on the NYT exper­i­ment for my site design:

html {
...
    font-family: Georgia, "HelveticaNeue-Light", "Helvetica Neue Light", "Helvetica Neue", Helvetica,
                 Arial, "Lucida Grande", garamond, palatino, verdana, sans-serif;
}

I had not used Baskerville but since Geor­gia seemed sim­i­lar and was con­ve­nient, but we’ll fix that now. Besides Baskerville & Geor­gia, we’ll omit (of course), but we can try for a total of 4 fonts (falling back to Geor­gia):

hunk ./static/templates/default.html 28
+    <div class="fontfamily_class1"></div>
...
hunk ./static/templates/default.html 121
+      fontfamily: [
+      {
+      name: 'Baskerville',
+      "fontfamily_class1": "<style>html { font-family: Baskerville, Georgia; }</style>",
+      "fontfamily_class2": ""
+      },
+      {
+      name: 'Georgia',
+      "fontfamily_class1": "<style>html { font-family: Georgia; }</style>",
+      "fontfamily_class2": ""
+      },
+      {
+      name: 'Trebuchet',
+      "fontfamily_class1": "<style>html { font-family: 'Trebuchet MS', Georgia; }</style>",
+      "fontfamily_class2": ""
+      },
+      {
+      name: 'Helvetica',
+      "fontfamily_class1": "<style>html { font-family: Helvetica, Georgia; }</style>",
+      "fontfamily_class2": ""
+      }
+      ],

Results

Run­ning from 2013-04-14 to 2013-06-16:

Font Type Vis­its Con­ver­sion
Tre­buchet sans 35,473 13.81%
Baskerville serif 36,021 13.73%
Hel­vetica sans 35,656 13.43%
Geor­gia serif 35,833 13.31%
sans 71,129 13.62%
serif 71,854 13.52%
142,983 13.57%

The sam­ple size for each font is 20k higher than I pro­jected due to the enor­mous pop­u­lar­ity of I fin­ished dur­ing the test. Regard­less, it’s clear that the results - with dou­ble the total sam­ple size of the NYT exper­i­ment, focused on fewer fonts - are dis­ap­point­ing and there seems to be very lit­tle differ­ence between fonts.

Analysis

Pick­ing the most extreme differ­ence, between Tre­buchet and Geor­gia, the differ­ence is close to the usual defi­n­i­tion of sta­tis­ti­cal-sig­nifi­cance:

prop.test(c(0.1381*35473,0.1331*35833),c(35473,35833))
#     2-sample test for equality of proportions with continuity correction
#
# data:  c(0.1381 * 35473, 0.1331 * 35833) out of c(35473, 35833)
# X-squared = 3.76, df = 1, p-value = 0.0525
# alternative hypothesis: two.sided
# 95% confidence interval:
#  -5.394e-05  1.005e-02
# sample estimates:
# prop 1 prop 2
# 0.1381 0.1331

Which nat­u­rally implies that the much smaller differ­ence between Tre­buchet and Baskerville is not sta­tis­ti­cal­ly-sig­nifi­cant:

prop.test(c(0.1381*35473,0.1373*36021), c(35473,36021))
#     2-sample test for equality of proportions with continuity correction
#
# data:  c(0.1381 * 35473, 0.1373 * 36021) out of c(35473, 36021)
# X-squared = 0.0897, df = 1, p-value = 0.7645
# alternative hypothesis: two.sided
# 95% confidence interval:
#  -0.00428  0.00588

Since there’s only small differ­ences between indi­vid­ual fonts, I won­dered if there might be a differ­ence between the two san­s-ser­ifs and the two ser­ifs. If we lump the 4 fonts into those 2 cat­e­gories and look at the small differ­ence in mean con­ver­sion rate:

prop.test(c(0.1362*71129,0.1352*71854), c(71129,71854))
#     2-sample test for equality of proportions with continuity correction
#
# data:  c(0.1362 * 71129, 0.1352 * 71854) out of c(71129, 71854)
# X-squared = 0.2963, df = 1, p-value = 0.5862
# alternative hypothesis: two.sided
# 95% confidence interval:
#  -0.002564  0.004564

Noth­ing doing there either. More gen­er­al­ly:

rates <- read.csv(stdin(),header=TRUE)
Font,Serif,N,Rate
Trebuchet,FALSE,35473,0.1381
Baskerville,TRUE,6021,0.1373
Helvetica,FALSE,35656,0.1343
Georgia,TRUE,5833,0.1331


rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes

g <- glm(cbind(Successes,Failures) ~ Font, data=rates, family="binomial"); summary(g)
# ...Coefficients:
#               Estimate Std. Error z value Pr(>|z|)
# (Intercept)   -1.83745    0.03744  -49.08   <2e-16
# FontGeorgia   -0.03692    0.05374   -0.69     0.49
# FontHelvetica -0.02591    0.04053   -0.64     0.52
# FontTrebuchet  0.00634    0.04048    0.16     0.88

With essen­tially no mean­ing­ful differ­ences between con­ver­sion rates, this sug­gests that how­ever fonts mat­ter, they don’t mat­ter for read­ing dura­tion. So I feel free to pick the font that appeals to me visu­al­ly, which is Baskerville.

Line height

I have seen com­plaints that lines on Gwern.net are “too closely spaced” or “run together” or “cramped”, refer­ring to the (the CSS prop­erty line-height). I set the CSS to line-height: 150%; to deal with this objec­tion, but this was a sim­ple hack based on rough eye­balling of it, and it was done before I changed the max-width and font-family set­tings after the pre­vi­ous test­ing. So it’s worth test­ing some vari­ants.

Most web design guides seem to sug­gest a safe default of 120%, rather than my cur­rent 150%. If we try to test each decile plus one on the out­side, that’d give us 110, 120, 130, 140, 150, 160 or 6 options, which com­bined with the expected small effect, would require an unrea­son­able sam­ple size (and I have noth­ing in the pipeline I expect might catch fire like the Google analy­sis and deliver an excess >50k vis­it­s). So I’ll try just 120/130/140/150, and sched­ule a sim­i­lar block of time as fonts (end­ing the exper­i­ment on 2013-08-16, with pre­sum­ably >70k dat­a­points).

Implementation

hunk ./static/templates/default.html 30
-    <div class="fontfamily_class1"></div>
+    <div class="linewidth_class1"></div>
hunk ./static/templates/default.html 156
-      fontfamily:
+      linewidth:
hunk ./static/templates/default.html 158
-      name: 'Baskerville',
-      "fontfamily_class1": "<style>html { font-family: Baskerville, Georgia; }</style>",
-      "fontfamily_class2": ""
+      name: 'Line120',
+      "linewidth_class1": "<style>div#content { line-height: 120%;}</style>",
+      "linewidth_class2": ""
hunk ./static/templates/default.html 163
-      name: 'Georgia',
-      "fontfamily_class1": "<style>html { font-family: Georgia; }</style>",
-      "fontfamily_class2": ""
+      name: 'Line130',
+      "linewidth_class1": "<style>div#content { line-height: 130%;}</style>",
+      "linewidth_class2": ""
hunk ./static/templates/default.html 168
-      name: 'Trebuchet',
-      "fontfamily_class1": "<style>html { font-family: 'Trebuchet MS', Georgia; }</style>",
-      "fontfamily_class2": ""
+      name: 'Line140',
+      "linewidth_class1": "<style>div#content { line-height: 140%;}</style>",
+      "linewidth_class2": ""
hunk ./static/templates/default.html 173
-      name: 'Helvetica',
-      "fontfamily_class1": "<style>html { font-family: Helvetica, Georgia; }</style>",
-      "fontfamily_class2": ""
+      name: 'Line150',
+      "linewidth_class1": "<style>div#content { line-height: 150%;}</style>",
+      "linewidth_class2": ""

Analysis

From 2013-06-15–2013-08-15:

line % n Con­ver­sion %
130 18,124 15.26
150 17,459 15.22
120 17,773 14.92
140 17,927 14.92
71,283 15.08

Just from look­ing at the mis­er­ably small differ­ence between the most extreme per­cent­ages (%), we can pre­dict that noth­ing here was sta­tis­ti­cal­ly-sig­nifi­cant:

x1 <- 18124; x2 <- 17927; prop.test(c(x1*0.1524, x2*0.1476), c(x1,x2))
#     2-sample test for equality of proportions with continuity correction
#
# data:  c(x1 * 0.1524, x2 * 0.1476) out of c(x1, x2)
# X-squared = 1.591, df = 1, p-value = 0.2072

I changed the 150% to 130% for the heck of it, even though the differ­ence between 130 and 150 was triv­ially small:

rates <- read.csv(stdin(),header=TRUE)
Width,N,Rate
130,18124,0.1526
150,17459,0.1522
120,17773,0.1492
140,17927,0.1492


rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes

rates$Width <- as.factor(rates$Width)
g <- glm(cbind(Successes,Failures) ~ Width, data=rates, family="binomial")
# ...Coefficients:
#              Estimate Std. Error z value Pr(>|z|)
# (Intercept) -1.74e+00   2.11e-02  -82.69   <2e-16
# Width130     2.65e-02   2.95e-02    0.90     0.37
# Width140     9.17e-06   2.97e-02    0.00     1.00
# Width150     2.32e-02   2.98e-02    0.78     0.44

Null test

One of the sug­ges­tions in the A/B test­ing papers was to run a “null” A/B test (or “A/A test”) where the pay­load is empty but the A/B test­ing frame­work is still mea­sur­ing con­ver­sions etc. By defi­n­i­tion, the null hypoth­e­sis of “no differ­ence” should be true and at an alpha of 0.05, only 5% of the time would the null tests yield a p < 0.05 (which is very differ­ent from the usual sit­u­a­tion). The inter­est here is that it’s pos­si­ble that some­thing is going wrong in one’s A/B setup or in gen­er­al, and so if one gets a “sta­tis­ti­cal­ly-sig­nifi­cant” result, it may be worth­while inves­ti­gat­ing this anom­aly.

It’s easy to switch from the line­height test to the null test; just rename the vari­ables for Google Ana­lyt­ics, and empty the pay­loads:

hunk ./static/templates/default.html 30
-    <div class="linewidth_class1"></div>
+    <div class="null_class1"></div>
hunk ./static/templates/default.html 158
-      linewidth: [
+      null: [
+      ...]]
hunk ./static/templates/default.html 160
-      name: 'Line120',
-      "linewidth_class1": "<style>div#content { line-height: 120%;}</style>",
+      name: 'null1',
+      "null_class1": "",
hunk ./static/templates/default.html 165
-      { ...
-      name: 'Line130',
-      "linewidth_class1": "<style>div#content { line-height: 130%;}</style>",
-      "linewidth_class2": ""
-      },
-      {
-      name: 'Line140',
-      "linewidth_class1": "<style>div#content { line-height: 140%;}</style>",
-      "linewidth_class2": ""
-      },
-      {
-      name: 'Line150',
-      "linewidth_class1": "<style>div#content { line-height: 150%;}</style>",
+      name: 'null2',
+      "null_class1": "",
+       ... }

Since any differ­ence due to the test­ing frame­work should be notice­able, this will be a shorter exper­i­ment, from 15 August to 29 August.

Results

While amus­ingly the first pair of 1k hits resulted in a dra­matic 18% vs 14% result, this quickly dis­ap­peared into a much more nor­mal-look­ing set of data:

option n con­ver­sion
null2 7,359 16.23%
null1 7,488 15.89%
14,847 16.06%

Analysis

Ah, but can we reject the null hypoth­e­sis that "“==”"? In a rare vic­tory for nul­l-hy­poth­e­sis-sig­nifi­cance-test­ing, we do not com­mit a Type I error:

x1 <- 7359; x2 <- 7488; prop.test(c(x1*0.1623, x2*0.1589), c(x1,x2))
#     2-sample test for equality of proportions with continuity correction
#
# data:  c(x1 * 0.1623, x2 * 0.1589) out of c(x1, x2)
# X-squared = 0.2936, df = 1, p-value = 0.5879
# alternative hypothesis: two.sided
# 95% confidence interval:
#  -0.008547  0.015347

But seri­ous­ly, it is nice to see that ABa­lyt­ics does not seem to be bro­ken & favor­ing either option and any results dri­ven by place­ment in the array of options.

Text & background color

As part of the gen­er­ally mono­chro­matic color scheme, the back­ground was off-white (grey) and the text was black:

html { ...
    background-color: #FCFCFC; /* off-white */
    color: black;
... }

The hyper­links, on the other hand, make use of a off-black color: #303C3C, par­tially moti­vated by Ian Storm Tay­lor’s advice to “Never Use Black”. I won­der - should all the text be off-black too? And which com­bi­na­tion is best? White/black? Off-white/black? Off-white/off-black? White/off-black? Let’s try all 4 com­bi­na­tions here.

Implementation

The usu­al:

hunk ./static/templates/default.html 30
-    <div class="underline_class1"></div>
+    <div class="ground_class1"></div>
hunk ./static/templates/default.html 155
-      underline: [
+      ground: [
hunk ./static/templates/default.html 157
-      name: 'underlined',
-      "underline_class1": "<style>a { color: #303C3C; text-decoration: underline; }</style>",
-      "underline_class2": ""
+      name: 'bw',
+      "ground_class1": "<style>html { background-color: white; color: black; }</style>",
+      "ground_class2": ""
hunk ./static/templates/default.html 162
-      name: 'notUnderlined',
-      "underline_class1": "<style>a { color: #303C3C; text-decoration: none; }</style>",
-      "underline_class2": ""
+      name: 'obw',
+      "ground_class1": "<style>html { background-color: white; color: #303C3C; }</style>",
+      "ground_class2": ""
+      },
+      {
+      name: 'bow',
+      "ground_class1": "<style>html { background-color: #FCFCFC; color: black; }</style>",
+      "ground_class2": ""
+      },
+      {
+      name: 'obow',
+      "ground_class1": "<style>html { background-color: #FCFCFC; color: #303C3C; }</style>",
+      "ground_class2": ""
... ]]

Data

I am a lit­tle curi­ous about this one, so I sched­uled a full month and half: 10 Sep­tem­ber - 20 Octo­ber. Due to far more traffic than antic­i­pated from sub­mis­sions to Hacker News, I cut it short by 10 days to avoid wast­ing traffic on a test which was done (a total n of 231,599 was more than enough). The results:

Ver­sion n Con­ver­sion
bw 58,237 12.90%
obow 58,132 12.62%
bow 57,576 12.48%
obw 57,654 12.44%

Analysis

rates <- read.csv(stdin(),header=TRUE)
Black,White,N,Rate
TRUE,TRUE,58237,0.1290
FALSE,FALSE,58132,0.1262
TRUE,FALSE,57576,0.1248
FALSE,TRUE,57654,0.1244


rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes

g <- glm(cbind(Successes,Failures) ~ Black * White, data=rates, family="binomial")
summary(g)
# ...Coefficients:
#                     Estimate Std. Error z value Pr(>|z|)
# (Intercept)          -1.9350     0.0125 -154.93   <2e-16
# BlackTRUE            -0.0128     0.0177   -0.72     0.47
# WhiteTRUE            -0.0164     0.0178   -0.92     0.36
# BlackTRUE:WhiteTRUE   0.0545     0.0250    2.17     0.03
#
# (Dispersion parameter for binomial family taken to be 1)
#
#     Null deviance:  6.8625e+00  on 3  degrees of freedom
# Residual deviance: -1.1758e-11  on 0  degrees of freedom
# AIC: 50.4
summary(step(g))
# same thing

So we can esti­mate the net effect of the 4 pos­si­bil­i­ties:

  1. Black, White: -0.0128 + -0.0164 + 0.0545 = 0.0253
  2. Off-black, Off-white: 0 + 0 + 0 = 0
  3. Black, Off-white: -0.0128 + 0 + 0 = -0.0128
  4. Off-black, White: 0 + -0.0164 + 0 = -0.0164

The results exactly match the data’s rank­ings.

So, this sug­gests a change to the CSS: we switch the default back­ground color from #FCFCFC to white, while leav­ing the default color its cur­rent black.

Reader Lucas asks in the com­ment sec­tions whether, since we would expect new vis­i­tors to the web­site to be less likely to read a page in full than a return­ing vis­i­tor (who knows what they’re in for & prob­a­bly wants more), whether includ­ing such a vari­able (which is some­thing Google Ana­lyt­ics does track) might improve the analy­sis. It’s easy to ask GA for “New vs Return­ing Vis­i­tor” so I did:

rates <- read.csv(stdin(),header=TRUE)
Black,White,Type,N,Rate
FALSE,TRUE,new,36695,0.1058
FALSE,TRUE,old,21343,0.1565
FALSE,FALSE,new,36997,0.1043
FALSE,FALSE,old,21537,0.1588
TRUE,TRUE,new,36600,0.1073
TRUE,TRUE,old,22274,0.1613
TRUE,FALSE,new,36409,0.1075
TRUE,FALSE,old,21743,0.1507

rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes

g <- glm(cbind(Successes,Failures) ~ Black * White + Type, data=rates, family="binomial")
summary(g)
# Coefficients:
#                      Estimate Std. Error z value Pr(>|z|)
# (Intercept)         -2.134459   0.013770 -155.01   <2e-16
# BlackTRUE           -0.009219   0.017813   -0.52     0.60
# WhiteTRUE            0.000837   0.017798    0.05     0.96
# BlackTRUE:WhiteTRUE  0.034362   0.025092    1.37     0.17
# Typeold              0.448004   0.012603   35.55   <2e-16
  1. B/W:
  2. B:
  3. W:

And again, 0.02598 > 0.000837. So as one hopes, thank to ran­dom­iza­tion, adding a miss­ing covari­ate does­n’t change our con­clu­sion.

List symbol and font-size

I make heavy use of unordered lists in arti­cles; for no par­tic­u­lar rea­son, the sym­bol denot­ing the start of each entry in a list is the lit­tle black square, rather than the more com­mon lit­tle cir­cle. I’ve come to find the lit­tle squares a lit­tle chunky and ugly, so I want to test that. And I just real­ized that I never tested font size (just type of font), even though increas­ing font size one of the most com­mon CSS tweaks around. I don’t have any rea­son to expect an inter­ac­tion between these two bits of designs, unlike the pre­vi­ous A/B test, but I like the idea of get­ting more out of my data, so I am doing another fac­to­r­ial design, this time not 2x2 but 3x5. The options:

ul { list-style-type: square; }
ul { list-style-type: circle; }
ul { list-style-type: disc; }

html { font-size: 100%; }
html { font-size: 105%; }
html { font-size: 110%; }
html { font-size: 115%; }
html { font-size: 120%; }

Implementation

A 3x5 design, or 15 pos­si­bil­i­ties, does get a lit­tle bulkier than I’d like:

hunk ./static/templates/default.html 30
-    <div class="ground_class1"></div>
+    <div class="ulFontSize_class1"></div>
hunk ./static/templates/default.html 146
-      ground: [
+      ulFontSize: [
hunk ./static/templates/default.html 148
-      name: 'bw',
-      "ground_class1": "<style>html { background-color: white; color: black; }</style>",
-      "ground_class2": ""
+      name: 's100',
+      "ulFontSize_class1": "<style>ul { list-style-type: square; }; html { font-size: 100%; }</style>",
+      "ulFontSize_class2": ""
hunk ./static/templates/default.html 153
-      name: 'obw',
-      "ground_class1": "<style>html { background-color: white; color: #303C3C; }</style>",
-      "ground_class2": ""
+      name: 's105',
+      "ulFontSize_class1": "<style>ul { list-style-type: square; }; html { font-size: 105%; }</style>",
+      "ulFontSize_class2": ""
hunk ./static/templates/default.html 158
-      name: 'bow',
-      "ground_class1": "<style>html { background-color: #FCFCFC; color: black; }</style>",
-      "ground_class2": ""
+      name: 's110',
+      "ulFontSize_class1": "<style>ul { list-style-type: square; }; html { font-size: 110%; }</style>",
+      "ulFontSize_class2": ""
hunk ./static/templates/default.html 163
-      name: 'obow',
-      "ground_class1": "<style>html { background-color: #FCFCFC; color: #303C3C; }</style>",
-      "ground_class2": ""
+      name: 's115',
+      "ulFontSize_class1": "<style>ul { list-style-type: square; }; html { font-size: 115%; }</style>",
+      "ulFontSize_class2": ""
+      },
+      {
+      name: 's120',
+      "ulFontSize_class1": "<style>ul { list-style-type: square; }; html { font-size: 120%; }</style>",
+      "ulFontSize_class2": ""
+      },
+      {
+      name: 'c100',
+      "ulFontSize_class1": "<style>ul { list-style-type: circle; }; html { font-size: 100%; }</style>",
+      "ulFontSize_class2": ""
+      },
+      {
+      name: 'c105',
+      "ulFontSize_class1": "<style>ul { list-style-type: circle; }; html { font-size: 105%; }</style>",
+      "ulFontSize_class2": ""
+      },
+      {
+      name: 'c110',
+      "ulFontSize_class1": "<style>ul { list-style-type: circle; }; html { font-size: 110%; }</style>",
+      "ulFontSize_class2": ""
+      },
+      {
+      name: 'c115',
+      "ulFontSize_class1": "<style>ul { list-style-type: circle; }; html { font-size: 115%; }</style>",
+      "ulFontSize_class2": ""
+      },
+      {
+      name: 'c120',
+      "ulFontSize_class1": "<style>ul { list-style-type: circle; }; html { font-size: 120%; }</style>",
+      "ulFontSize_class2": ""
+      },
+      {
+      name: 'd100',
+      "ulFontSize_class1": "<style>ul { list-style-type: disc; }; html { font-size: 100%; }</style>",
+      "ulFontSize_class2": ""
+      },
+      {
+      name: 'd105',
+      "ulFontSize_class1": "<style>ul { list-style-type: disc; }; html { font-size: 105%; }</style>",
+      "ulFontSize_class2": ""
+      },
+      {
+      name: 'd110',
+      "ulFontSize_class1": "<style>ul { list-style-type: disc; }; html { font-size: 110%; }</style>",
+      "ulFontSize_class2": ""
+      },
+      {
+      name: 'd115',
+      "ulFontSize_class1": "<style>ul { list-style-type: disc; }; html { font-size: 115%; }</style>",
+      "ulFontSize_class2": ""
+      },
+      {
+      name: 'd120',
+      "ulFontSize_class1": "<style>ul { list-style-type: disc; }; html { font-size: 120%; }</style>",
+      "ulFontSize_class2": ""
... ]]

Data

I halted the A/B test on 27 Octo­ber because I was notic­ing clear dam­age as com­pared to my default CSS. The results were:

List icon Font zoom n Read­ing con­ver­sion rate
square 100% 4,763 16.38%
disc 100% 4,759 16.18%
disc 110% 4,716 16.09%
cir­cle 115% 4,933 15.95%
cir­cle 100% 4,872 15.85%
cir­cle 110% 4,920 15.53%
cir­cle 120% 5,114 15.51%
square 115% 4,815 15.51%
square 110% 4,927 15.47%
cir­cle 105% 5,101 15.33%
square 105% 4,775 14.85%
disc 115% 4,797 14.78%
disc 105% 5,006 14.72%
disc 120% 4,912 14.56%
square 120% 4,786 13.96%
73,196 15.38%

Analysis

Incor­po­rat­ing vis­i­tor type:

rates <- read.csv(stdin(),header=TRUE)
Ul,Size,Type,N,Rate
c,120,old,2673,0.1650
c,115,old,2643,0.1854
c,105,new,2636,0.1392
d,105,old,2635,0.1613
s,110,old,2596,0.1749
s,120,old,2593,0.1678
s,105,new,2582,0.1243
d,120,old,2559,0.1649
c,110,new,2558,0.1298
d,110,new,2555,0.1307
c,100,old,2553,0.2002
c,105,old,2539,0.1713
d,115,old,2524,0.1565
s,115,new,2516,0.1391
c,110,old,2505,0.1741
d,100,new,2502,0.1431
c,120,new,2500,0.1284
s,110,new,2491,0.1265
c,115,new,2483,0.1228
d,120,new,2452,0.1277
d,105,new,2448,0.1364
c,100,new,2436,0.1199
d,115,new,2435,0.1437
s,100,new,2411,0.1497
s,120,new,2411,0.1161
s,105,old,2387,0.1571
s,115,old,2365,0.1674
d,100,old,2358,0.1735
s,100,old,2329,0.1803
d,110,old,2235,0.1888


rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes

g <- glm(cbind(Successes,Failures) ~ Ul * Size + Type, data=rates, family="binomial"); summary(g)
# ...Coefficients:
#              Estimate Std. Error z value Pr(>|z|)
# (Intercept) -1.389310   0.270903   -5.13  2.9e-07
# Uld         -0.103201   0.386550   -0.27    0.789
# Uls          0.055036   0.389109    0.14    0.888
# Size        -0.004397   0.002458   -1.79    0.074
# Uld:Size     0.000842   0.003509    0.24    0.810
# Uls:Size    -0.000741   0.003533   -0.21    0.834
# Typeold      0.317126   0.020507   15.46  < 2e-16
summary(step(g))
# ...Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept) -1.40555    0.15921   -8.83   <2e-16
# Size        -0.00436    0.00144   -3.02   0.0025
# Typeold      0.31725    0.02051   15.47   <2e-16

## examine just the list type alone, since the Size result is clear.
summary(glm(cbind(Successes,Failures) ~ Ul + Type, data=rates, family="binomial"))
# ...Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)  -1.8725     0.0208  -89.91   <2e-16
# Uld          -0.0106     0.0248   -0.43     0.67
# Uls          -0.0265     0.0249   -1.07     0.29
# Typeold       0.3163     0.0205   15.43   <2e-16
summary(glm(cbind(Successes,Failures) ~ Ul + Type, data=rates[rates$Size==100,], family="binomial"))
# ...Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)  -1.8425     0.0465  -39.61  < 2e-16
# Uld          -0.0141     0.0552   -0.26     0.80
# Uls           0.0353     0.0551    0.64     0.52
# Typeold       0.3534     0.0454    7.78  7.3e-15

The results are a lit­tle con­fus­ing in fac­to­r­ial form: it seems pretty clear that Size is bad and that 100% per­forms best, but what’s going on with the list icon type? Do we have too lit­tle data or is it inter­act­ing with the font size some­how? I find it a lot clearer when plot­ted:

library(ggplot2)
qplot(Size,Rate,color=Ul,data=rates)
Read­ing rate, split by font size, then by list icon type

Imme­di­ately the neg­a­tive effect of increas­ing the font size jumps out, but it’s eas­ier to under­stand the list icon esti­mates: square per­forms the best in the 100% (the orig­i­nal default) font size con­di­tion but it per­forms poorly in the other font sizes, which is why it seems to do only medi­um-well com­pared to the oth­ers. Given how much bet­ter 100% per­forms than the oth­ers, I’m inclined to ignore their results and keep the squares.

100% and squares, how­ev­er, were the orig­i­nal CSS set­tings, so this means I will make no changes to the exist­ing CSS based on these results.

Blockquote formatting

Another bit of for­mat­ting I’ve been mean­ing to test for a while is see­ing how well ’s pul­l-quotes next to block­quotes per­form, and to check whether my zebra-strip­ing of nested block­quotes is help­ful or harm­ful.

The Read­abil­ity thing goes like this:

blockquote: : before {
    content: "\201C";
    filter: alpha(opacity=20);
    font-family: "Constantia", Georgia, 'Hoefler Text', 'Times New Roman', serif;
    font-size: 4em;
    left: -0.5em;
    opacity: .2;
    position: absolute;
    top: .25em }

The cur­rent block­quote strip­ing goes thus­ly:

blockquote, blockquote blockquote blockquote,
blockquote blockquote blockquote blockquote blockquote {
    z-index: -2;
    background-color: rgb(245, 245, 245); }
blockquote blockquote, blockquote blockquote blockquote blockquote,
blockquote blockquote blockquote blockquote blockquote blockquote {
    background-color: rgb(235, 235, 235); }

Implementation

This is another 2x2 design since we can use the Read­abil­ity quotes or not, and the zebra-strip­ing or not.

hunk ./static/css/default.css 271
-blockquote, blockquote blockquote blockquote,
- blockquote blockquote blockquote blockquote blockquote {
-    z-index: -2;
-    background-color: rgb(245, 245, 245); }
-blockquote blockquote, blockquote blockquote blockquote blockquote,
- blockquote blockquote blockquote blockquote blockquote blockquote {
-    background-color: rgb(235, 235, 235); }
+/* blockquote, blockquote blockquote blockquote, */
+/* blockquote blockquote blockquote blockquote blockquote { */
+/*     z-index: -2; */
+/*     background-color: rgb(245, 245, 245); } */
+/* blockquote blockquote, blockquote blockquote blockquote blockquote, */
+/*blockquote blockquote blockquote blockquote blockquote blockquote { */
+/*     background-color: rgb(235, 235, 235); } */
hunk ./static/templates/default.html 30
-    <div class="ulFontSize_class1"></div>
+    <div class="blockquoteFormatting_class1"></div>
hunk ./static/templates/default.html 148
-      ulFontSize: [
+      blockquoteFormatting: [
hunk ./static/templates/default.html 150
-      name: 's100',
-      "ulFontSize_class1": "<style>ul { list-style-type: square; }; html { font-size: 100%; }</style>",
-      "ulFontSize_class2": ""
+      name: 'rz',
+      "blockquoteFormatting_class1": "<style>blockquote: : before { content: '\201C';
filter: alpha(opacity=20);
font-family: 'Constantia', Georgia, 'Hoefler Text', 'Times New Roman', serif; font-size: 4em;left: -0.5em;
opacity: .2; position: absolute; top: .25em }; blockquote, blockquote blockquote blockquote,
blockquote blockquote blockquote blockquote blockquote { z-index: -2; background-color: rgb(245, 245, 245); };
blockquote blockquote, blockquote blockquote blockquote blockquote,
blockquote blockquote blockquote blockquote blockquote blockquote { background-color: rgb(235, 235, 235); }</style>",
+      "blockquoteFormatting_class2": ""
hunk ./static/templates/default.html 155
-      name: 's105',
-      "ulFontSize_class1": "<style>ul { list-style-type: square; }; html { font-size: 105%; }</style>",
-      "ulFontSize_class2": ""
+      name: 'orz',
+      "blockquoteFormatting_class1": "<style>blockquote, blockquote blockquote blockquote,
blockquote blockquote blockquote blockquote blockquote { z-index: -2; background-color: rgb(245, 245, 245); };
blockquote blockquote, blockquote blockquote blockquote blockquote,
blockquote blockquote blockquote blockquote blockquote blockquote { background-color: rgb(235, 235, 235); }</style>",
+      "blockquoteFormatting_class2": ""
hunk ./static/templates/default.html 160
-      name: 's110',
-      "ulFontSize_class1": "<style>ul { list-style-type: square; }; html { font-size: 110%; }</style>",
-      "ulFontSize_class2": ""
+      name: 'roz',
+      "blockquoteFormatting_class1": "<style>blockquote: : before { content: '\201C';
filter: alpha(opacity=20);
font-family: 'Constantia', Georgia, 'Hoefler Text', 'Times New Roman', serif; font-size: 4em;left: -0.5em;
opacity: .2; position: absolute; top: .25em }</style>",
+      "blockquoteFormatting_class2": ""
hunk ./static/templates/default.html 165
-      name: 's115',
-      "ulFontSize_class1": "<style>ul { list-style-type: square; }; html { font-size: 115%; }</style>",
-      "ulFontSize_class2": ""
-      },
-      {
-      name: 's120',
-      "ulFontSize_class1": "<style>ul { list-style-type: square; }; html { font-size: 120%; }</style>",
-      "ulFontSize_class2": ""
-      },
-      {
-      name: 'c100',
-      "ulFontSize_class1": "<style>ul { list-style-type: circle; }; html { font-size: 100%; }</style>",
-      "ulFontSize_class2": ""
-      },
-      {
-      name: 'c105',
-      "ulFontSize_class1": "<style>ul { list-style-type: circle; }; html { font-size: 105%; }</style>",
-      "ulFontSize_class2": ""
-      },
-      {
-      name: 'c110',
-      "ulFontSize_class1": "<style>ul { list-style-type: circle; }; html { font-size: 110%; }</style>",
-      "ulFontSize_class2": ""
-      },
-      {
-      name: 'c115',
-      "ulFontSize_class1": "<style>ul { list-style-type: circle; }; html { font-size: 115%; }</style>",
-      "ulFontSize_class2": ""
-      },
-      {
-      name: 'c120',
-      "ulFontSize_class1": "<style>ul { list-style-type: circle; }; html { font-size: 120%; }</style>",
-      "ulFontSize_class2": ""
-      },
-      {
-      name: 'd100',
-      "ulFontSize_class1": "<style>ul { list-style-type: disc; }; html { font-size: 100%; }</style>",
-      "ulFontSize_class2": ""
-      },
-      {
-      name: 'd105',
-      "ulFontSize_class1": "<style>ul { list-style-type: disc; }; html { font-size: 105%; }</style>",
-      "ulFontSize_class2": ""
-      },
-      {
-      name: 'd110',
-      "ulFontSize_class1": "<style>ul { list-style-type: disc; }; html { font-size: 110%; }</style>",
-      "ulFontSize_class2": ""
-      },
-      {
-      name: 'd115',
-      "ulFontSize_class1": "<style>ul { list-style-type: disc; }; html { font-size: 115%; }</style>",
-      "ulFontSize_class2": ""
-      },
-      {
-      name: 'd120',
-      "ulFontSize_class1": "<style>ul { list-style-type: disc; }; html { font-size: 120%; }</style>",
-      "ulFontSize_class2": ""
+      name: 'oroz',
+      "blockquoteFormatting_class1": "<style></style>",
+      "blockquoteFormatting_class2": ""
... ]]

Data

Read­abil­ity Quote Block­quote high­light­ing n Con­ver­sion Rate
no yes 11,663 20.04%
yes yes 11,514 19.86%
no no 11,464 19.21%
yes no 10,669 18.51%
45,310 19.42%

I dis­cov­ered dur­ing this exper­i­ment that I could graph the con­ver­sion rate of each con­di­tion sep­a­rate­ly:

Google Ana­lyt­ics view on block­quote fac­to­r­ial test con­ver­sions, by day

What I like about this graph is how it demon­strates some basic sta­tis­ti­cal points:

  1. the more traffic, the smaller sam­pling error is and the closer the 4 con­di­tions are to their true val­ues as they clus­ter togeth­er. This illus­trates how even what seems like a large differ­ence based on a large amount of data, may still be - unin­tu­itively - dom­i­nated by sam­pling error
  2. day to day, any con­di­tion can be on top; no mat­ter which one proves supe­rior and which ver­sion is the worst, we can spot days where the worst ver­sion looks bet­ter than the best ver­sion. This illus­trates how insid­i­ous selec­tion biases or choice of dat­a­points can be: we can eas­ily lie and show black is white, if we can just man­age to cher­ryp­ick a lit­tle bit.
  3. the under­ly­ing traffic does not itself appear to be com­pletely sta­ble or con­sis­tent. There are a lot of move­ments which look like the under­ly­ing vis­i­tors may be chang­ing in com­po­si­tion slightly and respond­ing slight­ly. This harks back to the paper’s warn­ing that for some tests, no answer was pos­si­ble as the responses of vis­i­tors kept chang­ing which ver­sion was per­form­ing best.

Analysis

rates <- read.csv(stdin(),header=TRUE)
Readability,Zebra,Type,N,Rate
FALSE,FALSE,new,7191,0.1837
TRUE,TRUE,new,7182,0.1910
FALSE,TRUE,new,7112,0.1800
TRUE,FALSE,new,6508,0.1804
FALSE,TRUE,old,4652,0.2236
TRUE,FALSE,old,4452,0.1995
TRUE,TRUE,old,4412,0.2201
FALSE,FALSE,old,4374,0.2046


rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes

g <- glm(cbind(Successes,Failures) ~ Readability * Zebra + Type, data=rates, family="binomial"); summary(g)
# ...Coefficients:
#                           Estimate Std. Error z value Pr(>|z|)
# (Intercept)                -1.5095     0.0255  -59.09   <2e-16
# ReadabilityTRUE            -0.0277     0.0340   -0.81     0.42
# ZebraTRUE                   0.0327     0.0331    0.99     0.32
# ReadabilityTRUE:ZebraTRUE   0.0609     0.0472    1.29     0.20
# Typeold                     0.1788     0.0239    7.47    8e-14
summary(step(g))
# ...Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)  -1.5227     0.0197  -77.20  < 2e-16
# ZebraTRUE     0.0627     0.0236    2.66   0.0079
# Typeold       0.1782     0.0239    7.45  9.7e-14

The top-per­form­ing vari­ant is the sta­tus quo (no Read­abil­i­ty-style quote, zebra-striped block­s). So we keep it.

Font size & ToC background

It was pointed out to me that in my pre­vi­ous font-size test, the clear lin­ear trend may have implied that larger fonts than 100% were bad, but that I was mak­ing an unjus­ti­fied leap in implic­itly assum­ing that 100% was best: if big­ger is worse, then might­n’t the opti­mal font size be some­thing smaller than 100%, like 95%?

And while the block­quote back­ground col­or­ing is a good idea, per the pre­vi­ous test, what about the other place on Gwern.net where I use a light back­ground shad­ing: the Table of Con­tents? Per­haps it would be bet­ter with the same back­ground shad­ing as the block­quotes, or no shad­ing?

Final­ly, because I am tired of just 2 fac­tors, I throw in a third fac­tor to make it really mul­ti­fac­to­r­i­al. I picked the num­ber-siz­ing from the exist­ing list of sug­ges­tions.

Each fac­tor has 3 vari­ants, giv­ing 27 con­di­tions:

.num { font-size: 85%; }
.num { font-size: 95%; }
.num { font-size: 100%; }

html { font-size: 85%; }
html { font-size: 95%; }
html { font-size: 100%; }

div#TOC { background: #fff; }
div#TOC { background: #eee; }
div#TOC { background-color: rgb(245, 245, 245); }

Implementation

hunk ./static/templates/default.html 30
-    <div class="blockquoteFormatting_class1"></div>
+    <div class="tocFormatting_class1"></div>
hunk ./static/templates/default.html 150
-      blockquoteFormatting: [
+      tocFormatting: [
hunk ./static/templates/default.html 152
-      name: 'rz',
-      "blockquoteFormatting_class1": "<style>blockquote:before { display: block; font-size: 200%; color: #ccc; content: open-quote; height: 0px; margin-left: -0.55em; position:relative; }; blockquote blockquote, blockquote blockquote blockquote blockquote, blockquote blockquote blockquote blockquote blockquote blockquote { background-color: rgb(235, 235, 235); }</style>",
-      "blockquoteFormatting_class2": ""
+      name: '88f',
+      "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 85%; }; div#TOC { background: #fff; };</style>",
+      "tocFormatting_class2": ""
hunk ./static/templates/default.html 157
-      name: 'orz',
-      "blockquoteFormatting_class1": "<style>blockquote, blockquote blockquote blockquote, blockquote blockquote blockquote blockquote blockquote { z-index: -2; background-color: rgb(245, 245, 245); }; blockquote blockquote, blockquote blockquote blockquote blockquote, blockquote blockquote blockquote blockquote blockquote blockquote { background-color: rgb(235, 235, 235); }</style>",
-      "blockquoteFormatting_class2": ""
+      name: '88e',
+      "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 85%; }; div#TOC { background: #eee; }</style>",
+      "tocFormatting_class2": ""
hunk ./static/templates/default.html 162
-      name: 'oroz',
-      "blockquoteFormatting_class1": "<style></style>",
-      "blockquoteFormatting_class2": ""
+      name: '88r',
+      "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 85%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '89f',
+      "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 95%; }; div#TOC { background: #fff; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '89e',
+      "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 95%; }; div#TOC { background: #eee; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '89f',
+      "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 95%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '81f',
+      "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 100%; }; div#TOC { background: #fff; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '81e',
+      "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 100%; }; div#TOC { background: #eee; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '81r',
+      "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 100%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '98f',
+      "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 85%; }; div#TOC { background: #fff; };</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '98e',
+      "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 85%; }; div#TOC { background: #eee; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '98r',
+      "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 85%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '99f',
+      "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 95%; }; div#TOC { background: #fff; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '99e',
+      "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 95%; }; div#TOC { background: #eee; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '99f',
+      "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 95%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '91f',
+      "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 100%; }; div#TOC { background: #fff; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '91e',
+      "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 100%; }; div#TOC { background: #eee; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '91r',
+      "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 100%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '18f',
+      "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 85%; }; div#TOC { background: #fff; };</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '18e',
+      "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 85%; }; div#TOC { background: #eee; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '18r',
+      "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 85%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '19f',
+      "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 95%; }; div#TOC { background: #fff; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '19e',
+      "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 95%; }; div#TOC { background: #eee; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '19f',
+      "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 95%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '11f',
+      "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 100%; }; div#TOC { background: #fff; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '11e',
+      "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 100%; }; div#TOC { background: #eee; }</style>",
+      "tocFormatting_class2": ""
+      },
+      {
+      name: '11r',
+      "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 100%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
+      "tocFormatting_class2": ""
... ]]

Analysis

rates <- read.csv(stdin(),header=TRUE)
NumSize,FontSize,TocBg,Type,N,Rate
1,9,e,new,3060,0.1513
8,9,e,new,2978,0.1605
9,1,r,new,2965,0.1548
8,8,f,new,2941,0.1629
1,9,f,new,2933,0.1558
9,9,r,new,2932,0.1576
8,9,f,new,2906,0.1473
1,9,r,new,2901,0.1482
9,9,f,new,2901,0.1420
8,8,r,new,2885,0.1567
1,8,e,new,2876,0.1412
8,1,r,new,2869,0.1593
9,8,f,new,2846,0.1472
1,1,e,new,2844,0.1551
1,8,f,new,2841,0.1457
9,8,e,new,2834,0.1478
8,1,f,new,2833,0.1521
1,8,r,new,2818,0.1544
8,8,e,new,2818,0.1678
8,1,e,new,2810,0.1605
1,1,r,new,2806,0.1775
9,8,r,new,2801,0.1682
9,1,e,new,2799,0.1422
8,9,r,new,2764,0.1548
9,9,e,new,2753,0.1478
1,1,f,new,2750,0.1611
9,1,f,new,2700,0.1537
8,8,r,old,1551,0.2521
9,8,e,old,1519,0.2146
9,8,f,old,1505,0.2153
1,8,e,old,1489,0.2317
1,1,e,old,1475,0.2339
8,1,f,old,1416,0.2112
1,9,r,old,1390,0.2245
8,9,e,old,1388,0.2464
9,9,r,old,1379,0.2466
8,9,r,old,1374,0.1907
1,9,f,old,1361,0.2337
8,8,f,old,1348,0.2322
1,9,e,old,1347,0.2279
1,8,f,old,1340,0.2470
9,1,r,old,1336,0.2605
8,1,r,old,1326,0.2119
8,8,e,old,1321,0.2286
9,1,f,old,1318,0.2398
1,1,r,old,1293,0.2111
1,8,r,old,1293,0.2073
9,9,f,old,1261,0.2411
8,9,f,old,1254,0.2113
9,9,e,old,1240,0.2435
1,1,f,old,1232,0.2240
8,1,e,old,1229,0.2587
9,1,e,old,1182,0.2335
9,8,r,old,1032,0.2403


rates[rates$NumSize==1,]$NumSize <- 100
rates[rates$NumSize==9,]$NumSize <- 95
rates[rates$NumSize==8,]$NumSize <- 85
rates[rates$FontSize==1,]$FontSize <- 100
rates[rates$FontSize==9,]$FontSize <- 95
rates[rates$FontSize==8,]$FontSize <- 85
rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes

g <- glm(cbind(Successes,Failures) ~ NumSize * FontSize * TocBg + Type, data=rates, family="binomial"); summary(g)
# ...Coefficients:
#                          Estimate Std. Error z value Pr(>|z|)
# (Intercept)              0.124770   3.020334    0.04     0.97
# NumSize                 -0.022262   0.032293   -0.69     0.49
# FontSize                -0.012775   0.032283   -0.40     0.69
# TocBgf                   4.042812   4.287006    0.94     0.35
# TocBgr                   5.356794   4.250778    1.26     0.21
# NumSize:FontSize         0.000166   0.000345    0.48     0.63
# NumSize:TocBgf          -0.040645   0.045855   -0.89     0.38
# NumSize:TocBgr          -0.054164   0.045501   -1.19     0.23
# FontSize:TocBgf         -0.052406   0.045854   -1.14     0.25
# FontSize:TocBgr         -0.065503   0.045482   -1.44     0.15
# NumSize:FontSize:TocBgf  0.000531   0.000490    1.08     0.28
# NumSize:FontSize:TocBgr  0.000669   0.000487    1.37     0.17
# Typeold                  0.492688   0.015978   30.84   <2e-16
summary(step(g))
# ...Coefficients:
#                   Estimate Std. Error z value Pr(>|z|)
# (Intercept)       3.808438   1.750144    2.18   0.0295
# NumSize          -0.059730   0.018731   -3.19   0.0014
# FontSize         -0.052262   0.018640   -2.80   0.0051
# TocBgf           -0.844664   0.285387   -2.96   0.0031
# TocBgr           -0.747451   0.283304   -2.64   0.0083
# NumSize:FontSize  0.000568   0.000199    2.85   0.0044
# NumSize:TocBgf    0.008853   0.003052    2.90   0.0037
# NumSize:TocBgr    0.008139   0.003030    2.69   0.0072
# Typeold           0.492598   0.015975   30.83   <2e-16

The two size tweaks turn out to be unam­bigu­ously neg­a­tive com­pared to the sta­tus quo (with an almost neg­li­gi­ble inter­ac­tion term prob­a­bly reflect­ing reader pref­er­ence for con­sis­tency in sizes of let­ters and num­bers - as one gets small­er, the other does bet­ter if it’s smaller too). The Table of Con­tents back­grounds also sur­vive (thanks to the new vs old vis­i­tor type covari­ate adding pow­er): there were 3 back­ground types, e/f/r[gb], and f/r turn out to have neg­a­tive coeffi­cients, imply­ing that e is best - but e is also the sta­tus quo, so no change is rec­om­mend­ed.

Multifactorial roundup

At this point it seems worth ask­ing whether run­ning mul­ti­fac­to­ri­als has been worth­while. The analy­sis is a bit more diffi­cult, and the more fac­tors there are, the harder to inter­pret. I’m also not too keen on encod­ing the com­bi­na­to­r­ial explo­sion into a big JS array for ABa­lyt­ics. In my tests so far, have there been many inter­ac­tions? A quick tally of the glm()/step() results:

  1. Text & back­ground col­or:

    • orig­i­nal: 2 main, 1 two-way inter­ac­tion
    • sur­vived: 2 main, 1 two-way inter­ac­tion
  2. List sym­bol and font-size:

    • orig­i­nal: 3 main, 2 two-way inter­ac­tions
    • sur­vived: 1 main
  3. Block­quote for­mat­ting:

    • orig­i­nal: 2 main, 1 two-way
    • sur­vived: 1 main
  4. Font size & ToC back­ground:

    • orig­i­nal: 4 mains, 5 two-ways, 2 three­-ways
    • sur­vived: 3 mains, 2 two-way

So of the 11 main effects, 9 two-ways, & 2 three­-ways, there were con­firmed in the reduced mod­els: 7 mains, 3 two-ways (22%), & 0 three­-ways (0%). And of the 2 inter­ac­tions, only the black/white inter­ac­tion was impor­tant (and even there, if I had regressed instead cbind(Successes, Failures) ~ Black + White, black & white would still have pos­i­tive coeffi­cients, they just would not be sta­tis­ti­cal­ly-sig­nifi­cant, and so I would likely have made the same choice as I did with the inter­ac­tion data avail­able).

This is not a resound­ing endorse­ment so far.

Section header capitalization

3x3:

  • h1, h2, h3, h4, h5 { text-transform: uppercase; }
  • h1, h2, h3, h4, h5 { text-transform: none; }
  • h1, h2, h3, h4, h5 { text-transform: capitalize; }
  • div#header h1 { text-transform: uppercase; }
  • div#header h1 { text-transform: none; }
  • div#header h1 { text-transform: capitalize; }
--- a/static/templates/default.html
+++ b/static/templates/default.html
@@ -27,7 +27,7 @@
   <body>

-    <div class="tocFormatting_class1"></div>
+    <div class="headerCaps_class1"></div>

     <div id="main">
       <div id="sidebar">
@@ -152,141 +152,51 @@
       _gaq.push(['_setAccount', 'UA-18912926-1']);

       ABalytics.init({
-      tocFormatting: [
+      headerCaps: [
       {
- name: '88f',
- "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 85%; }; div#TOC { background: #fff; };</style>",
- "tocFormatting_class2": ""
+ name: 'uu',
+ "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: uppercase; }; div#header h1 { text-transform: uppercase; }</style>",
+ "headerCaps_class2": ""
  },
  {
- name: '88e',
- "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 85%; }; div#TOC { background: #eee; }</style>",
- "tocFormatting_class2": ""
+ name: 'un',
+ "headerCaps_class1": "<style>div#header h1 { text-transform: uppercase; }; div#header h1 { text-transform: none; }</style>",
+ "headerCaps_class2": ""
  },
  {
- name: '88r',
- "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 85%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
- "tocFormatting_class2": ""
+ name: 'uc',
+ "headerCaps_class1": "<style>div#header h1 { text-transform: uppercase; }; div#header h1 { text-transform: capitalize; }</style>",
+ "headerCaps_class2": ""
  },
  {
- name: '89f',
- "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 95%; }; div#TOC { background: #fff; }</style>",
- "tocFormatting_class2": ""
+ name: 'nu',
+ "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: none; }; div#header h1 { text-transform: uppercase; }</style>",
+ "headerCaps_class2": ""
  },
  {
- name: '89e',
- "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 95%; }; div#TOC { background: #eee; }</style>",
- "tocFormatting_class2": ""
+ name: 'nn',
+ "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: none; }; div#header h1 { text-transform: none; }</style>",
+ "headerCaps_class2": ""
  },
  {
- name: '89r',
- "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 95%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
- "tocFormatting_class2": ""
+ name: 'nc',
+ "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: none; }; div#header h1 { text-transform: capitalize; }</style>",
+ "headerCaps_class2": ""
  },
  {
- name: '81f',
- "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 100%; }; div#TOC { background: #fff; }</style>",
- "tocFormatting_class2": ""
+ name: 'cu',
+ "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: capitalize; }; div#header h1 { text-transform: uppercase; }</style>",
+ "headerCaps_class2": ""
  },
  {
- name: '81e',
- "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 100%; }; div#TOC { background: #eee; }</style>",
- "tocFormatting_class2": ""
+ name: 'cn',
+ "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: capitalize; }; div#header h1 { text-transform: none; }</style>",
+ "headerCaps_class2": ""
  },
  {
- name: '81r',
- "tocFormatting_class1": "<style>.num { font-size: 85%; }; html { font-size: 100%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '98f',
- "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 85%; }; div#TOC { background: #fff; };</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '98e',
- "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 85%; }; div#TOC { background: #eee; }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '98r',
- "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 85%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '99f',
- "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 95%; }; div#TOC { background: #fff; }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '99e',
- "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 95%; }; div#TOC { background: #eee; }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '99r',
- "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 95%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '91f',
- "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 100%; }; div#TOC { background: #fff; }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '91e',
- "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 100%; }; div#TOC { background: #eee; }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '91r',
- "tocFormatting_class1": "<style>.num { font-size: 95%; }; html { font-size: 100%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '18f',
- "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 85%; }; div#TOC { background: #fff; };</style>",
- "tocFormatting_class2": ""
- {
- name: '18e',
- "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 85%; }; div#TOC { background: #eee; }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '18r',
- "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 85%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '19f',
- "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 95%; }; div#TOC { background: #fff; }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '19e',
- "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 95%; }; div#TOC { background: #eee; }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '19r',
- "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 95%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '11f',
- "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 100%; }; div#TOC { background: #fff; }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '11e',
- "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 100%; }; div#TOC { background: #eee; }</style>",
- "tocFormatting_class2": ""
- },
- {
- name: '11r',
- "tocFormatting_class1": "<style>.num { font-size: 100%; }; html { font-size: 100%; }; div#TOC { background-color: rgb(245, 245, 245); }</style>",
- "tocFormatting_class2": ""
+ name: 'cc',
+ "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: capitalize; }; div#header h1 { text-transform: capitalize; }</style>",
+ "headerCaps_class2": ""
  }
       ],
       }, _gaq);
       ...)}
rates <- read.csv(stdin(),header=TRUE)
Sections,Title,Old,N,Rate
c,u,FALSE,2362, 0.1808
c,n,FALSE,2356,0.1855
c,c,FALSE,2342,0.2003
u,u,FALSE,2341,0.1965
u,c,FALSE,2333,0.1989
n,u,FALSE,2329,0.1928
n,c,FALSE,2323,0.1941
n,n,FALSE,2321,0.1978
u,n,FALSE,2315,0.1965
c,c,TRUE,1370,0.2190
n,u,TRUE,1302,0.2558
u,u,TRUE,1271,0.2919
c,n,TRUE,1258,0.2377
u,c,TRUE,1228,0.2272
n,c,TRUE,1211,0.2337
n,n,TRUE,1200,0.2400
c,u,TRUE,1135,0.2396
u,n,TRUE,1028,0.2442


rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes

g <- glm(cbind(Successes,Failures) ~ Sections * Title + Old, data=rates, family="binomial"); summary(g)
# ...Coefficients:
# (Intercept)       -1.4552     0.0422  -34.50   <2e-16
# Sectionsn          0.0111     0.0581    0.19    0.848
# Sectionsu          0.0163     0.0579    0.28    0.779
# Titlen            -0.0153     0.0579   -0.26    0.791
# Titleu            -0.0318     0.0587   -0.54    0.588
# OldTRUE            0.2909     0.0283   10.29   <2e-16
# Sectionsn:Titlen   0.0429     0.0824    0.52    0.603
# Sectionsu:Titlen   0.0419     0.0829    0.51    0.613
# Sectionsn:Titleu   0.0732     0.0825    0.89    0.375
# Sectionsu:Titleu   0.1553     0.0820    1.89    0.058
summary(step(g))
# ...Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)  -1.4710     0.0263  -55.95   <2e-16
# Sectionsn     0.0497     0.0337    1.47    0.140
# Sectionsu     0.0833     0.0337    2.47    0.013
# OldTRUE       0.2920     0.0283   10.33   <2e-16

Upper­case and ‘none’ beat ‘cap­i­tal­ize’ in both page titles & sec­tion head­ers (in­ter­ac­tion does not sur­vive). So I toss in a CSS dec­la­ra­tion to upper­case sec­tion head­ers as well as the sta­tus quo of the title.

ToC formatting

After the page title, the next thing a reader will gen­er­ally see on my pages in the table of con­tents. It’s been tweaked over the years (par­tic­u­larly by sug­ges­tions from Hacker News) but still has some untested aspects, par­tic­u­larly the first two parts of div#TOC:

    float: left;
    width: 25%;

I’d like to test left vs right, and 15,20,25,30,35%, so that’s a 2x5 design. Usual imple­men­ta­tion:

diff --git a/static/templates/default.html b/static/templates/default.html
index 83c6f9c..11c4ada 100644
--- a/static/templates/default.html
+++ b/static/templates/default.html
@@ -27,7 +27,7 @@
   <body>

-    <div class="headerCaps_class1"></div>
+    <div class="tocAlign_class1"></div>

     <div id="main">
       <div id="sidebar">
@@ -152,51 +152,56 @@
       _gaq.push(['_setAccount', 'UA-18912926-1']);

       ABalytics.init({
-      headerCaps: [
+      tocAlign: [
       {
-      name: 'uu',
-      "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: uppercase; }; div#header h1 { text-transform: uppercase; }</style>",
-      "headerCaps_class2": ""
+      name: 'l15',
+      "tocAlign_class1": "<style>div#TOC { float: left; width: 15%; }</style>",
+      "tocAlign_class2": ""
       },
       {
-      name: 'un',
-      "headerCaps_class1": "<style>div#header h1 { text-transform: uppercase; }; div#header h1 { text-transform: none; }</style>",
-      "headerCaps_class2": ""
+      name: 'l20',
+      "tocAlign_class1": "<style>div#TOC { float: left; width: 20%; }</style>",
+      "tocAlign_class2": ""
       },
       {
-      name: 'uc',
-      "headerCaps_class1": "<style>div#header h1 { text-transform: uppercase; }; div#header h1 { text-transform: capitalize; }</style>",
-      "headerCaps_class2": ""
+      name: 'l25',
+      "tocAlign_class1": "<style>div#TOC { float: left; width: 25%; }</style>",
+      "tocAlign_class2": ""
       },
       {
-      name: 'nu',
-      "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: none; }; div#header h1 { text-transform: uppercase; }</style>",
-      "headerCaps_class2": ""
+      name: 'l30',
+      "tocAlign_class1": "<style>div#TOC { float: left; width: 30%; }</style>",
+      "tocAlign_class2": ""
       },
       {
-      name: 'nn',
-      "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: none; }; div#header h1 { text-transform: none; }</style>",
-      "headerCaps_class2": ""
+      name: 'l35',
+      "tocAlign_class1": "<style>div#TOC { float: left; width: 35%; }</style>",
+      "tocAlign_class2": ""
       },
       {
-      name: 'nc',
-      "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: none; }; div#header h1 { text-transform: capitalize; }</style>",
-      "headerCaps_class2": ""
+      name: 'r15',
+      "tocAlign_class1": "<style>div#TOC { float: right; width: 15%; }</style>",
+      "tocAlign_class2": ""
       },
       {
-      name: 'cu',
-      "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: capitalize; }; div#header h1 { text-transform: uppercase; }</style>",
-      "headerCaps_class2": ""
+      name: 'r20',
+      "tocAlign_class1": "<style>div#TOC { float: right; width: 20%; }</style>",
+      "tocAlign_class2": ""
       },
       {
-      name: 'cn',
-      "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: capitalize; }; div#header h1 { text-transform: none; }</style>",
-      "headerCaps_class2": ""
+      name: 'r25',
+      "tocAlign_class1": "<style>div#TOC { float: right; width: 25%; }</style>",
+      "tocAlign_class2": ""
       },
       {
-      name: 'cc',
-      "headerCaps_class1": "<style>h1, h2, h3, h4, h5 { text-transform: capitalize; }; div#header h1 { text-transform: capitalize; }</style>",
-      "headerCaps_class2": ""
+      name: 'r30',
+      "tocAlign_class1": "<style>div#TOC { float: right; width: 30%; }</style>",
+      "tocAlign_class2": ""
+      },
+      {
+      name: 'r35',
+      "tocAlign_class1": "<style>div#TOC { float: right; width: 35%; }</style>",
+      "tocAlign_class2": ""
       }
       ],
       }, _gaq));

I decided to end this test early on 2014-03-10 because I wanted to move onto the Bee­Line Reader test, so it’s under­pow­ered & the results aren’t as clear as usu­al:

rates <- read.csv(stdin(),header=TRUE)
Alignment,Width,Old,N,Rate
r,25,FALSE,1040,0.1673
r,30,FALSE,1026,0.1891
l,20,FALSE,1023,0.1896
l,25,FALSE,1022,0.1800
l,35,FALSE,1022,0.1820
l,30,FALSE,1016,0.1781
l,15,FALSE,1010,0.1851
r,15,FALSE,991,0.1554
r,20,FALSE,989,0.1881
r,35,FALSE,969,0.1672
l,30,TRUE,584,0.2414
l,25,TRUE,553,0.2224
l,20,TRUE,520,0.3096
r,15,TRUE,512,0.2539
l,35,TRUE,496,0.2520
r,25,TRUE,494,0.2105
l,15,TRUE,482,0.2282
r,35,TRUE,480,0.2417
r,20,TRUE,460,0.2326
r,30,TRUE,455,0.2549


rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes

g <- glm(cbind(Successes,Failures) ~ Alignment * Width + Old, data=rates, family="binomial"); summary(g)
# Coefficients:
#                  Estimate Std. Error z value Pr(>|z|)
# (Intercept)      -1.43309    0.10583  -13.54   <2e-16
# Alignmentr       -0.17726    0.15065   -1.18     0.24
# Width            -0.00253    0.00403   -0.63     0.53
# OldTRUE           0.40092    0.04184    9.58   <2e-16
# Alignmentr:Width  0.00450    0.00580    0.78     0.44

So, as I expect­ed, putting the ToC on the right per­formed worse; the larger ToC widths don’t seem to be bet­ter but it’s unclear what’s going on there. A visual inspec­tion of the Width data (library(ggplot2); qplot(Width,Rate,color=Alignment,data=rates)) sug­gests that 20% width was the best vari­ant, so might as well go with that.

BeeLine Reader text highlighting

BLR is a JS library for high­light­ing tex­tual para­graphs with pairs of half-lines to make read­ing eas­i­er. I run a ran­dom­ized exper­i­ment on sev­eral differ­ent­ly-col­ored ver­sions to see if default site-wide usage of BLR will improve time-on-page for Gwern.net read­ers, indi­cat­ing eas­ier read­ing of the long-form tex­tual con­tent. Most ver­sions per­form worse than the con­trol of no-high­light­ing; the best ver­sion per­forms slightly bet­ter but the improve­ment is not sta­tis­ti­cal­ly-sig­nifi­cant.

Bee­Line Reader (BLR) is an inter­est­ing new browser plu­gin which launched around Octo­ber 2013; I learned of it from the Hacker News dis­cus­sion. The idea is that part of the diffi­culty in read­ing text is that when one fin­ishes a line and sac­cades left to the con­tin­u­a­tion of the next line, the uncer­tainty of where it is adds a bit of stress, so one can make read­ing eas­ier by adding some sort of guide to the next line; in this case, each match­ing pair of half-lines is col­ored differ­ent­ly, so if you are on a red half-line, when you sac­cade left, you look for a line also col­ored red, then you switch to blue in the mid­dle of that line, and so on. A col­or­ful vari­ant on writ­ing. I found the default BLR col­or­ing gar­ish & dis­tract­ing, but I could­n’t see any rea­son that a sub­tle gray vari­ant would not help: the idea seems plau­si­ble. And very long text pages (like mine) are where BLR should shine most.

I asked if there were a JavaScript ver­sion I could use in an A/B test; the ini­tial JS imple­men­ta­tion was not fast enough, but by 2014-03-10 it was good enough. BLR has sev­eral themes, includ­ing “gray”; I decided to test the vari­ants no BLR, “dark”, “blues”, & expanded the gray selec­tion to include grays #222222/#333333/#444444/#555555/#666666/#777777 (gray-6; they vary in how bla­tant the high­light­ing is) for a total of 9 equal­ly-ran­dom­ized vari­ants.

Since I’m par­tic­u­larly inter­ested in these results, and I think many other peo­ple will find the results inter­est­ing, I will run this test extra-long: a min­i­mum of 2 months. I’m only inter­ested in the best vari­ant, not esti­mat­ing each vari­ant exactly (what do I care if the ugly dark is 15% rather than 14%? I just want to know it’s worse than the con­trol) so con­cep­tu­ally I want some­thing like a or or where bad vari­ants get dropped over time; unfor­tu­nate­ly, I haven’t stud­ied them yet (and MABs would be hard to imple­ment on a sta­tic site), so I’ll just ad hoc drop the worst vari­ant every week or two. (Maybe next exper­i­ment I’ll do a for­mal adap­tive tri­al.)

Setup

The usual imple­men­ta­tion using ABa­lyt­ics does­n’t work because it uses a innerHTML call to sub­sti­tute the var­i­ous frag­ments, and while HTML & CSS get inter­preted fine, JavaScript does not; the offered solu­tions were suffi­ciently baroque I wound up imple­ment­ing a cus­tom sub­set of ABa­lyt­ics hard­wired for BLR inside the Ana­lyt­ics script:

     <script id="googleAnalytics" type="text/javascript">
       var _gaq = _gaq || [];
       _gaq.push(['_setAccount', 'UA-18912926-1']);
+     // A/B test: heavily based on ABalytics
+      function readCookie (name) {
+        var nameEQ = name + "=";
+        var ca = document.cookie.split(';');
+        for(var i=0;i < ca.length;i++) {
+            var c = ca[i];
+            while (c.charAt(0)==' ') c = c.substring(1,c.length);
+            if (c.indexOf(nameEQ) == 0) return c.substring(nameEQ.length,c.length);
+        }
+        return null;
+      }
+
+      if (typeof(start_slot) == 'undefined') start_slot = 1;
+      var experiment = "blr3";
+      var variant_names = ["none", "dark", "blues", "gray1", "gray2", "gray3", "gray4", "gray5", "gray6"];
+
+      var variant_id = this.readCookie("ABalytics_"+experiment);
+      if (!variant_id || !variant_names[variant_id]) {
+      var variant_id = Math.floor(Math.random()*variant_names.length);
+      document.cookie = "ABalytics_"+experiment+"="+variant_id+"; path=/";
+                        }
+      function beelinefy (COLOR) {
+       if (COLOR != "none") {
+          var elements=document.querySelectorAll("#content");
+          for(var i=0;i < elements.length;i++) {
+                          var beeline=new BeeLineReader(elements[i], { theme: COLOR, skipBackgroundColor: true, skipTags: ['math', 'svg', 'h1', 'h2', 'h3', 'h4'] });
+                          beeline.color();
+                          }
+       }
+      }
+      beelinefy(variant_names[variant_id]);
+      _gaq.push(['_setCustomVar',
+                  start_slot,
+                  experiment,                 // The name of the custom variable = name of the experiment
+                  variant_names[variant_id],  // The value of the custom variable = variant shown
+                  2                           // Sets the scope to session-level
+                 ]);
      _gaq.push(['_trackPageview']);

The themes are defined in beeline.min.js as:

r.THEMES={
 dark: ["#000000","#970000","#000000","#00057F","#FBFBFB"],
 blues:["#000000","#0000FF","#000000","#840DD2","#FBFBFB"],
 gray1:["#000000","#222222","#000000","#222222","#FBFBFB"],
 gray2:["#000000","#333333","#000000","#333333","#FBFBFB"],
 gray3:["#000000","#444444","#000000","#444444","#FBFBFB"],
 gray4:["#000000","#555555","#000000","#555555","#FBFBFB"],
 gray5:["#000000","#666666","#000000","#666666","#FBFBFB"],
 gray6:["#000000","#777777","#000000","#777777","#FBFBFB"]
}

(Why “bl3”? I don’t know JS, so it took some time; things I learned along the line included always leav­ing white­space around a < oper­a­tor, and that the “none” argu­ment passed into beeline.setOptions causes a prob­lem which some browsers will ignore and con­tinue record­ing A/B data after but most browsers will not; this broke the orig­i­nal test. Then I dis­cov­ered that BLR by default broke all the MathML/MathJax, caus­ing nasty-look­ing errors over pages with math expres­sions; this broke the sec­ond test, and I had to get a fixed ver­sion.)

Data

On 31 March, with total n hav­ing reached 15652 vis­its, I deleted the worst-per­form­ing vari­ant: gray4, which at 19.21% was sub­stan­tially under­per­form­ing the best-per­form­ing vari­ant’s 22.38%, and wast­ing traffic. On 6 April, two Hacker News sub­mis­sions hav­ing dou­bled vis­its to 36533, I deleted the nex­t-worst vari­ant, gray5 (14.66% vs con­trol of 16.25%; p = 0.038). On 9 April, the almost as infe­rior gray6 (15.67% vs 16.26%) was delet­ed. On 17 April, dark (16.00% vs 16.94%) was delet­ed. On 30 April, I deleted gray2 (17.56% vs 18.07%). 11 May, blues was gone (18.11% vs 18.53%), and on 31 May, I deleted gray3 (18.04% vs 18.24%).

Due to caching, the dele­tions did­n’t nec­es­sar­ily drop data col­lec­tion instantly to zero. Traffic was also het­ero­ge­neous: Hacker News traffic is much less likely to spend much time on page than the usual traffic.

The con­ver­sion data, with new vs return­ing vis­i­tor, seg­mented by peri­od, and ordered by when a vari­ant was delet­ed:

Vari­ant Old Total: n (%) 10–31 March 1–6 April 7–9 April 10–17 April 18–30 April 1–11 May 12–31 May 1–8 June
none FALSE 17648 (16.01%) 1189 (19.26%) 3 607 (13.97%) 460 (17.39%) 1182 (16.58%) 3444 (17.04%) 2397 (14.39%) 3997 (17.39%) 2563 (16.35%)
none TRUE 8009 (23.65%) 578 (24.91%) 1 236 (22.09%) 226 (20.35%) 570 (23.86%) 1364 (27.05%) 1108 (23.83%) 2142 (22.46%) 1363 (23.84%)
gray1 FALSE 17579 (16.28%) 1177 (19.71%) 3 471 (14.06%) 475 (13.47%) 1200 (17.33%) 3567 (17.49%) 2365 (13.57%) 3896 (18.17%) 2605 (17.24%)
gray1 TRUE 7694 (23.85%) 515 (28.35%) 1 183 (23.58%) 262 (21.37%) 518 (21.43%) 1412 (26.56%) 1090 (24.86%) 2032 (22.69%) 1197 (23.56%)
gray3 FALSE 14871 (15.81%) 1192 (18.29%) 3 527 (14.15%) 446 (15.47%) 1160 (15.43%) 3481 (17.98%) 2478 (14.65%) 3776 (16.26%) 3 (33.33%)
gray3 TRUE 6631 (23.06%) 600 (24.83%) 1 264 (21.52%) 266 (18.05%) 638 (21.79%) 1447 (25.22%) 1053 (24.60%) 1912 (23.17%) 51 (5.88%)
blues FALSE 10844 (15.34%) 1157 (18.93%) 3 470 (14.35%) 449 (16.04%) 1214 (15.57%) 3346 (17.54%) 2362 (13.46%) 3 (0.00%)
blues TRUE 4544 (23.04%) 618 (27.18%) 1 256 (23.81%) 296 (20.27%) 584 (22.09%) 1308 (24.46%) 1052 (22.15%) 48 (12.50%)
gray2 FALSE 8646 (15.51%) 1220 (20.33%) 3 649 (13.81%) 416 (15.14%) 1144 (15.03%) 3433 (17.54%) 4 (0.00%)
gray2 TRUE 3366 (22.82%) 585 (22.74%) 1 271 (21.79%) 230 (16.52%) 514 (21.60%) 1298 (25.42%) 44 (27.27%) 6 (0.00%) 3 (0.00%)
dark FALSE 5240 (14.05%) 1224 (20.59%) 3 644 (13.83%) 420 (13.81%) 1175 (14.81%) 1 (0.00%)
dark TRUE 2161 (20.59%) 618 (21.52%) 1 242 (20.85%) 276 (21.74%) 574 (20.56%) 64 (10.94%) 1 (0.00%) 2 (0.00%) 2 (50.00%)
gray6 FALSE 4022 (13.30%) 1153 (19.51%) 3 610 (12.88%) 409 (17.11%) 1 (0.00%) 2 (0.00%) 3 (0.00%)
gray6 TRUE 1727 (20.61%) 654 (23.70%) 1 358 (22.02%) 259 (18.92%) 95 (7.37%) 11 (9.09%) 1 (0.00%)
gray5 FALSE 3245 (12.20%) 1175 (16.68%) 3 242 (12.21%) 3 (0.00%)
gray5 TRUE 1180 (21.53%) 559 (25.94%) 1 130 (21.77%) 34 (17.65%) 16 (12.50%)
gray4 FALSE 1176 (18.54%) 1174 (18.57%) 1 174 (18.57%) 2 (0.00%)
gray4 TRUE 673 (19.91%) 650 (20.31%) 6 69 (20.03%) 1 (0.00%) 1 (0.00%) 2 (0.00%)
137438 (18.27%)

Graphed:

Weekly con­ver­sion rates for each of the Bee­Line Reader set­tings

I also received a num­ber of com­plaints while run­ning the BLR test (prin­ci­pally due to the dark and blues vari­ants, but also appar­ently trig­gered by some of the less pop­u­lar gray vari­ants; the num­ber of com­plaints dropped off con­sid­er­ably by halfway through):

  • 2 in emails
  • 2 on IRC unso­licit­ed; when I later asked, there were 2 com­plaints of slow­ness load­ing pages & after reflow­ing
  • 2 on Red­dit
  • 3 men­tions in Gwern.net com­ments
  • 4 through my anony­mous feed­back form
  • 6 com­plaints on Hacker News
  • total: 19

Analysis

The BLR peo­ple say that there may be cross-browser differ­ences, so I thought about throw­ing in browser as a covari­ate too (an unordered fac­tor of Chrome & Fire­fox, and maybe I’ll bin every­thing else as an ‘other’ browser); it seems I may have to use the GA API to extract con­ver­sion rates split by vari­ant, vis­i­tor sta­tus, and brows­er. This turned out to be enough work that I decided to not both­er.

As usu­al, a logis­tic regres­sion on the var­i­ous BLR themes with new vs return­ing vis­i­tors (Old) as a covari­ate. Because of the het­ero­gene­ity in traffic (and because I both­ered break­ing out the data by time period this time for the table), I also include each block as a fac­tor. Final­ly, because I expected the 6 gray vari­ants to per­form sim­i­lar­ly, I try out a mul­ti­level model nest­ing the grays togeth­er.

The results are not impres­sive: only 2 gray vari­ants out of the 8 vari­ants have a pos­i­tive esti­mate, and nei­ther is sta­tis­ti­cal­ly-sig­nifi­cant; the best vari­ant was gray1 (“#222222” & “#FBFBFB”), at an esti­mated increase from 19.52% to 20.04% con­ver­sion rate. More sur­pris­ing, the nest­ing turns out to not mat­ter at all, and in fact the worst vari­ant was gray. (The best-fit­ting mul­ti­level model ignore the vari­ants entire­ly, although it did not fit bet­ter than the reg­u­lar logis­tic model incor­po­rat­ing all of the time peri­ods, Old, and vari­ants.)

# Pivot table view on custom variable:
# ("Secondary dimension: User Type"; "Pivot by: Custom Variable (Value 01); Pivot metrics: Sessions | Time reading (Goal 1 Conversion Rate)")
# then hand-edited to add Color and Date variables
rates <- read.csv("https://www.gwern.net/docs/traffic/2014-06-08-abtesting-blr.csv")

rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes

# specify the control group is 'none'
rates$Variant <- relevel(rates$Variant, ref="none")
rates$Color <- relevel(rates$Color, ref="none")

# normal:
g0 <- glm(cbind(Successes,Failures) ~ Old + Variant + Date, data=rates, family=binomial); summary(g0)
# ...Coefficients:
#                  Estimate Std. Error z value Pr(>|z|)
# (Intercept)     -1.633959   0.027712  -58.96  < 2e-16
# OldTRUE          0.465491   0.014559   31.97  < 2e-16
# Date10-17 April -0.021047   0.037563   -0.56   0.5753
# Date10-31 March  0.150498   0.035017    4.30  1.7e-05
# Date1-11 May    -0.107965   0.035133   -3.07   0.0021
# Date12-31 May    0.009534   0.032448    0.29   0.7689
# Date1-6 April   -0.138053   0.031809   -4.34  1.4e-05
# Date18-30 April  0.095898   0.031817    3.01   0.0026
# Date7-9 April   -0.129704   0.047314   -2.74   0.0061
#
# Variantgray5    -0.114487   0.040429   -2.83   0.0046
# Variantdark     -0.060299   0.033912   -1.78   0.0754
# Variantgray2    -0.027338   0.028518   -0.96   0.3378
# Variantblues    -0.012120   0.026330   -0.46   0.6453
# Variantgray3    -0.005484   0.023441   -0.23   0.8150
# Variantgray4    -0.003556   0.047273   -0.08   0.9400
# Variantgray6     0.000536   0.036308    0.01   0.9882
# Variantgray1     0.026765   0.021757    1.23   0.2186

library(lme4)
g1 <- glmer(cbind(Successes,Failures) ~ Old + (1|Color/Variant) + (1|Date), data=rates, family=binomial)
g2 <- glmer(cbind(Successes,Failures) ~ Old + (1|Color)         + (1|Date), data=rates, family=binomial)
g3 <- glmer(cbind(Successes,Failures) ~ Old +                     (1|Date), data=rates, family=binomial)
g4 <- glmer(cbind(Successes,Failures) ~ Old + (1|Variant),                  data=rates, family=binomial)
g5 <- glmer(cbind(Successes,Failures) ~ Old + (1|Color),                    data=rates, family=binomial)
AIC(g0, g1, g2, g3, g4, g5)
#    df  AIC
# g0 17 1035
# g1  5 1059
# g2  4 1058
# g3 13 1041
# g4  3 1252
# g5  3 1264

Conclusion

An unlikely +0.5% to read­ing rates isn’t enough for me to want to add a depen­dency another JS library, so I will be remov­ing BLR. I’m not sur­prised by this result, since most tests don’t show an improve­ment, BLR col­or­ing test is pretty unusual for a web­site, and users would­n’t have any under­stand­ing of what it is or abil­ity to opt out of it; using BLR by default does­n’t work, but the browser exten­sion might be use­ful since the user expects the col­or­ing & can choose their pre­ferred color scheme.

I was sur­prised that the gray vari­ants could per­form so wildly differ­ent, from slightly bet­ter than the con­trol to hor­ri­bly worse, con­sid­er­ing that they did­n’t strike me as look­ing that differ­ent when I was pre­view­ing them local­ly. I also did­n’t expect blues to last as long as it did, and thought I would be delet­ing it as soon as dark. This makes me won­der: are there color themes only sub­tly differ­ent from the ones I tried which might work unpre­dictably well? Since BLR by default offers only a few themes, I think BLR should try out as many color themes as pos­si­ble to locate good ones they’ve missed.

Some lim­i­ta­tions to this exper­i­ment:

  • no way for users to dis­able BLR or change color themes
  • did not include web browser type as a covari­ate, which might have shown that par­tic­u­lar com­bi­na­tions of browser & theme sub­stan­tially out­per­formed the con­trol (then BLR could have improved their code for the bad browsers or a browser check done before high­light­ing any text)
  • did not use for­mal adap­tive trial method­ol­o­gy, so the p-val­ues have no par­tic­u­lar inter­pre­ta­tion

Floating footnotes

One of the site fea­tures I like the most is how the end­notes pop-out/float when the mouse hov­ers over the link, so the reader does­n’t have to jump to the end­notes and back, jar­ring their con­cen­tra­tion and break­ing their train of thought. I got the JS from Luka Mathis back in 2010. But some­times the mouse hov­ers by acci­dent, and with big foot­notes, the popped-up foot­note can cover the screen and be unread­able. I’ve won­dered if it’s as cool as I think it is, or whether it might be dam­ag­ing. So now that I’ve hacked up an ABa­lyt­ics clone which can han­dle JS in order to run the BLR exper­i­ment, I might as well run an A/B test to ver­ify that the float­ing foot­notes are not badly dam­ag­ing con­ver­sions. (I’m not demand­ing the float­ing foot­notes increase con­ver­sions by 1% or any­thing, just that the float­ing isn’t com­ing at too steep a price.)

Implementation

diff --git a/static/js/footnotes.js b/static/js/footnotes.js
index 69088fa..e08d63c 100644
--- a/static/js/footnotes.js
+++ b/static/js/footnotes.js
@@ -1,7 +1,3 @@
-$(document).ready(function() {
-    Footnotes.setup();
-});
-

diff --git a/static/templates/default.html b/static/templates/default.html
index 4395130..8c97954 100644
--- a/static/templates/default.html
+++ b/static/templates/default.html
@@ -133,6 +133,9 @@
     <script type="text/javascript" src="//ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js"></script>

+    <script type="text/javascript" src="/static/js/footnotes.js"></script>
+
     <script id="googleAnalytics" type="text/javascript">
       var _gaq = _gaq || [];
@@ -151,14 +154,23 @@

       if (typeof(start_slot) == 'undefined') start_slot = 1;
-      var experiment = "blr3";
-      var variant_names = ["none", "gray1"];
+      var experiment = "floating_footnotes";
+      var variant_names = ["none", "float"];

       var variant_id = this.readCookie("ABalytics_"+experiment);
       if (!variant_id || !variant_names[variant_id]) {
       var variant_id = Math.floor(Math.random()*variant_names.length);
       document.cookie = "ABalytics_"+experiment+"="+variant_id+"; path=/";
                         }
+      // enable the floating footnotes
+      function footnotefy (VARIANT) {
+       if (VARIANT != "none") {
+         $$(document).ready(function() {
+                        Footnotes.setup();
+                        });
+       }
+      }
+      footnotefy(variant_names[variant_id]);
       _gaq.push(['_setCustomVar',
                   start_slot,
                   experiment,                 // The name of the custom variable = name of the experiment
                   ...)]
@@ -196,9 +208,6 @@
     <script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>

-    <script type="text/javascript" src="/static/js/footnotes.js"></script>
-
     <script type="text/javascript" src="/static/js/tablesorter.js"></script>
     <script type="text/javascript" id="tablesorter">

Data

2014-06-08–2014-07-12:

Vari­ant Old n Con­ver­sion
none FALSE 10342 17.00%
float FALSE 10039 17.42%
none TRUE 4767 22.24%
float TRUE 4876 22.40%
none 15109 18.65%
float 14915 19.05%
30024 18.85%

Analysis

rates <- read.csv(stdin(),header=TRUE)
Footnote,Old,N,Rate
none,FALSE,10342,0.1700
float,FALSE,10039,0.1742
none,TRUE,4767,0.2224
float,TRUE,4876,0.2240


rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes

rates$Footnote <- relevel(rates$Footnote, ref="none")

g <- glm(cbind(Successes,Failures) ~ Footnote + Old, data=rates, family="binomial"); summary(g)
# ...Coefficients:
#               Estimate Std. Error z value Pr(>|z|)
# (Intercept)    -1.5820     0.0237  -66.87   <2e-16
# Footnotefloat   0.0222     0.0296    0.75     0.45
# OldTRUE         0.3234     0.0307   10.53   <2e-16
confint(g)
#                  2.5 %   97.5 %
# (Intercept)   -1.62856 -1.53582
# Footnotefloat -0.03574  0.08018
# OldTRUE        0.26316  0.38352

As I had hoped, float­ing foot­notes seems to do no harm, and the point-es­ti­mate is pos­i­tive. The 95% CI, while not exclud­ing zero, does exclude val­ues worse than -0.035, which sat­is­fies me: if float­ing foot­notes are doing any harm, it’s a small harm.

Indented paragraphs

An anony­mous feed­back sug­gested a site design tweak:

Could you for­mat your pages so that the texts are all aligned at the left? It looks unpro­fes­sional when the lines of text break at differ­ent areas. Could you make the site like a LaTeX arti­cle? The for­mat­ting is the only thing pre­vent­ing you from look­ing really pro­fes­sion­al.

I was­n’t sure what he meant, since the text is left­-aligned, and I can’t ask for clar­i­fi­ca­tion (anony­mous means anony­mous).

Look­ing at a ran­dom page, my best guess is that he’s both­ered by the inden­ta­tion at the start of suc­ces­sive para­graphs: in a sequence of para­graphs, the first para­graph is not indented (be­cause it can’t be visu­ally con­fused) but the suc­ces­sive para­graphs are indented by 1.5em in order to make read­ing eas­i­er. The CSS is:

p { margin-top: -0.2em;
    margin-bottom: 0 }
p + p {
  text-indent: 1.5em;
  margin-top: 0 }

I liked this, but I sup­pose for lots of small para­graphs, it lends a ragged appear­ance to the page. So might as well test a few vari­ants of text-indent to see what works best: 0em, 0.1, 0.5, 1.0, 1.5, and 2.0.

In ret­ro­spect years lat­er, after learn­ing more about typog­ra­phy and revamp­ing Gwern.net CSS a num­ber of times, I think Anony­mous was actu­ally talk­ing about : HTML/Gwern.net is by default “flush left, ragged right”, with large white­space gaps left where words of differ­ent lengths get moved to the next line but not broken/hyphenated or stretched to fill the line. Some peo­ple do not like text jus­ti­fi­ca­tion, describ­ing ragged right as eas­ier to read, but most typog­ra­phers endorse it, it was his­tor­i­cally the norm for pro­fes­sion­al­ly-set print, still car­ries con­no­ta­tions of class, and I think the appear­ance fits in with my over­all site esthet­ic. I even­tu­ally enabled text jus­ti­fi­ca­tion on Gwern.net in Feb­ru­ary 2019 (although I was irri­tated by the dis­cov­ery that the stan­dard CSS method of doing so does not work in the Chrome browser due to a long-s­tand­ing fail­ure to imple­ment hyphen­ation sup­port).

Implementation

Since we’re back to test­ing CSS, we can use the old ABa­lyt­ics approach with­out hav­ing to do JS cod­ing:

--- a/static/templates/default.html
+++ b/static/templates/default.html
@@ -19,6 +19,9 @@
   </head>
   <body>

+   <div class="indent_class1"></div>
+
     <div id="main">
       <div id="sidebar">
         <div id="logo"><img alt="Logo: a Gothic/Fraktur blackletter capital G/𝕲" height="36" src="/images/logo/logo.png" width="32" /></div>
@@ -136,10 +139,48 @@
     <script type="text/javascript" src="/static/js/footnotes.js"></script>

+    <script type="text/javascript" src="/static/js/abalytics.js"></script>
+    <script type="text/javascript">
+      window.onload = function() {
+      ABalytics.applyHtml();
+      };
+    </script>
+
     <script id="googleAnalytics" type="text/javascript">
       var _gaq = _gaq || [];
       _gaq.push(['_setAccount', 'UA-18912926-1']);
+
+      ABalytics.init({
+      indent: [
+      {
+      name: "none",
+      "indent_class1": "<style>p + p { text-indent: 0.0em; margin-top: 0 }</style>"
+      },
+      {
+      name: "indent0.1",
+      "indent_class1": "<style>p + p { text-indent: 0.1em; margin-top: 0 }</style>"
+      },
+      {
+      name: "indent0.5",
+      "indent_class1": "<style>p + p { text-indent: 0.5em; margin-top: 0 }</style>"
+      },
+      {
+      name: "indent1.0",
+      "indent_class1": "<style>p + p { text-indent: 1.0em; margin-top: 0 }</style>"
+      },
+      {
+      name: "indent1.5",
+      "indent_class1": "<style>p + p { text-indent: 1.5em; margin-top: 0 }</style>"
+      },
+      {
+      name: "indent2.0",
+      "indent_class1": "<style>p + p { text-indent: 2.0em; margin-top: 0 }</style>"
+      }
+      ],
+      }, _gaq);
+
       _gaq.push(['_trackPageview']);
       (function() { // })

Data

On 2014-07-27, since the 95% CIs for the best and worst indent vari­ants no longer over­lapped, I deleted the worst vari­ant (0.1). On 2014-08-23, the 2.0em and 0.0em vari­ants no longer over­lapped, and I deleted the lat­ter.

Daily traffic and con­ver­sion rates for each of the inden­ta­tion set­tings

The con­ver­sion data, with new vs return­ing vis­i­tor, seg­mented by peri­od, and ordered by when a vari­ant was delet­ed:

Vari­ant Old Total: n (%) 12-27 July 28 July-23 August 24 August-19 Novem­ber
0.1 FALSE 1552 (18.11%) 1551 (18.12%) 1552 (18.11%)
0.1 TRUE 707 (21.64%) 673 (21.69%) 706 (21.67%) 6 (0.00%)
none FALSE 5419 (16.70%) 1621 (17.27%) 5419 (16.70%) 3179 (16.55%)
none TRUE 2742 (23.23%) 749 (27.77%) 2684 (23.62%) 1637 (21.01%)
0.5 FALSE 26357 (15.09%) 1562 (18.89%) 5560 (17.86%) 24147 (14.74%)
0.5 TRUE 10965 (21.35%) 728 (23.63%) 2430 (23.13%) 9939 (21.06%)
1.0 FALSE 25987 (14.86%) 1663 (19.42%) 5615 (17.68%) 23689 (14.39%)
1.0 TRUE 11288 (21.14%) 817 (25.46%) 2498 (24.38%) 10159 (20.74%)
1.5 FALSE 26045 (14.54%) 1619 (16.80%) 5496 (16.67%) 23830 (14.26%)
1.5 TRUE 11255 (21.60%) 694 (26.95%) 2647 (24.25%) 10250 (21.00%)
2.0 FALSE 26198 (14.96%) 1659 (18.75%) 5624 (18.31%) 23900 (14.59%)
2.0 TRUE 11125 (21.17%) 781 (25.99%) 2596 (24.27%) 10010 (20.74%)
159634 (16.93%) 14117 (20.44%) 42827 (19.49%) 140746 (16.45%)

Analysis

A sim­ple analy­sis of the totals would indi­cate that 0.1em is the best set­ting - which is odd since it was the worst-per­form­ing and first vari­ant to be delet­ed, so how could it be the best? The graph of traffic sug­gests that, like before, the final totals are con­founded by time-vary­ing changes in con­ver­sion rates plus drop­ping vari­ants; that is, 0.1em prob­a­bly only looks good because after it was dropped, a bunch of Hacker News traffic hit and hap­pened to con­vert at lower rates, mak­ing the sur­viv­ing vari­ants look bad. One might hope that all of that effect would be cap­tured by the Old covari­ate as HN traffic gets recorded as new vis­i­tors, but that would be too much to hope for. So instead, I add a dummy vari­able for each of the 3 sep­a­rate time-pe­ri­ods which will absorb some of this het­ero­gene­ity and make clearer the effect of the inden­ta­tion choic­es.

rates <- read.csv(stdin(),header=TRUE)
Indent,Old,Month,N,Rate
0.1,FALSE,July,1551,0.1812
0.1,TRUE,July,673,0.2169
0,FALSE,July,1621,0.1727
0,TRUE,July,749,0.2777
0.5,FALSE,July,1562,0.1889
0.5,TRUE,July,728,0.2363
1.0,FALSE,July,1663,0.1942
1.0,TRUE,July,817,0.2546
1.5,FALSE,July,1619,0.1680
1.5,TRUE,July,694,0.2695
2.0,FALSE,July,1659,0.1875
2.0,TRUE,July,781,0.2599
0.1,FALSE,August,1552,0.1811
0.1,TRUE,August,706,0.2167
0,FALSE,August,5419,0.1670
0,TRUE,August,2684,0.2362
0.5,FALSE,August,5560,0.1786
0.5,TRUE,August,2430,0.2313
1.0,FALSE,August,5615,0.1768
1.0,TRUE,August,2498,0.2438
1.5,FALSE,August,5496,0.1667
1.5,TRUE,August,2647,0.2425
2.0,FALSE,August,5624,0.1831
2.0,TRUE,August,2596,0.2427
0.1,FALSE,November,0,0.000
0.1,TRUE,November,6,0.000
0,FALSE,November,3179,0.1655
0,TRUE,November,1637,0.2101
0.5,FALSE,November,24147,0.1474
0.5,TRUE,November,9939,0.2106
1.0,FALSE,November,23689,0.1439
1.0,TRUE,November,10159,0.2074
1.5,FALSE,November,23830,0.1426
1.5,TRUE,November,10250,0.2100
2.0,FALSE,November,23900,0.1459
2.0,TRUE,November,10010,0.2074


rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes
g <- glm(cbind(Successes,Failures) ~ as.factor(Indent) + Old + Month, data=rates, family="binomial"); summary(g)
# ...Coefficients:
#                         Estimate  Std. Error   z value   Pr(>|z|)
# (Intercept)          -1.55640959  0.02238959 -69.51487 < 2.22e-16
# as.factor(Indent)0.1 -0.05726851  0.04400363  -1.30145  0.1931046
# as.factor(Indent)0.5  0.00249949  0.02503877   0.09982  0.9204833
# as.factor(Indent)1   -0.00877850  0.02502047  -0.35085  0.7256988
# as.factor(Indent)1.5 -0.02435198  0.02505726  -0.97185  0.3311235
# as.factor(Indent)2    0.00271475  0.02498665   0.10865  0.9134817
# OldTRUE               0.42448061  0.01238799  34.26549 < 2.22e-16
# MonthJuly             0.06606325  0.02459961   2.68554  0.0072413
# MonthNovember        -0.20156678  0.01483356 -13.58857 < 2.22e-16
#
# (Dispersion parameter for binomial family taken to be 1)
#
#     Null deviance: 1496.6865  on 34  degrees of freedom
# Residual deviance:   41.1407  on 26  degrees of freedom
# AIC: 331.8303

There’s defi­nitely tem­po­ral het­ero­gene­ity, given the sta­tis­ti­cal-sig­nifi­cance of the time-pe­riod dum­mies, so that is good to know. But the esti­mated effects for each inden­ta­tion vari­ant is deriso­rily small (de­spite hav­ing spent n = 159634), sug­gest­ing read­ers don’t care at all. Since I have no opin­ion on the mat­ter, I sup­pose I’ll go with the high­est point-es­ti­mate, 2em.

Moving sidebar’s metadata into page

Look­ing at the side­bar some more, it occurred to me that the side­bar was serv­ing 3 differ­ent pur­poses all mixed togeth­er:

  1. site-wide: nav­i­ga­tion to the main index/homepage, as well as meta-site pages like about me, the site, recent updates, and ways of get­ting RSS/email updates
  2. site-wide: dona­tion requests
  3. page-speci­fic: a page’s meta­data about when that page’s con­tent was first cre­at­ed, last mod­i­fied, con­tent tags, etc

The page meta­data is the odd man out, and I’ve noticed that a lot of peo­ple seem to not notice the page meta­data hid­ing in the side­bar (eg there will be com­ments won­der­ing when a page was cre­at­ed, when that’s listed clearly right there in the page’s side­bar). What if I moved the page meta­data to under­neath the big title? I’d have to change the for­mat­ting, since I can’t afford to spend 10+ ver­ti­cal lines of space the way it must be for­mat­ted in the side­bar, but the meta­data could fit in 2-5 lines if I com­bine the log­i­cal pairs (so instead of 4 lines for “cre­at­ed: / 2013-05-07 / mod­i­fied: / 2015-01-09”, just one line “cre­at­ed: 2013-05-07; mod­i­fied: 2015-01-09”).

There are sev­eral differ­ent ways and lev­els of den­si­ty, so I cre­ated 6 vari­ants with increas­ing amounts of den­si­ty.

Implementation

As an HTML rather than CSS change, the imple­men­ta­tion as an A/B test is more com­plex.

I define inline in the HTML tem­plate each of the 6 vari­ants, as divs ID ‘meta­data1..meta­data6’. In the default.css, I set them to display: none so the user does not 6 differ­ent meta­datas tak­ing up 2 screens of space. Then, each A/B vari­ant passed to ABa­lyt­ics tog­gles back on one ver­sion using display: block. I also include a 7th vari­ant, where none of the 6 should be vis­i­ble, which is effec­tively the con­trol con­di­tion which roughly matches the sta­tus quo of show­ing the meta­data in the side­bar. (“Roughly”, since in the none con­di­tion, there won’t be meta­data any­where in the dis­played page; but since the pre­vi­ous exper­i­ment indi­cated that remov­ing ele­ments from the side­bar did­n’t make any notice­able differ­ence, I decided to sim­plify the HTML source code by remov­ing the orig­i­nal meta­data div entirely to avoid any col­li­sions or issues with the CSS/HTML I’ve defined.)

So the flow should be:

  1. page HTML loads, all 6 ver­sions may get ren­dered

  2. site-wide default CSS loads, and when inter­pret­ed, hides all 6 ver­sions

    (This also means that peo­ple brows­ing with­out Javascript enabled should still con­tinue to see a read­able ver­sion of the site.)

  3. page JS runs, picks 1 of 6 vari­ables to exe­cute, and a CSS com­mand is inter­preted to expose 1 ver­sion

  4. JS con­tin­ues to run, and fires (con­verts) if user remains on page long enough

The HTML changes:

--- a/static/templates/default.html
+++ b/static/templates/default.html
@@ -20,7 +20,7 @@
   <body>

-   <div class="sidebar_test_class1"></div>
+   <div class="metadata_test_class1"></div>

@@ -61,29 +59,6 @@
         </div>
         <hr/>
         </div>
-        <div id="metadata">
-          <div class="abstract"><em>$description$</em></div>
-          <br />
-          <div id="tags"><i>$tags$</i></div>
-          <br />
-          <div id="page-created">created:
-            <br />
-            <i>$created$</i></div>
-          <div id="last-modified">modified:
-            <br />
-            <i>$modified$</i></div>
-          <br />
-          <div id="version">status:
-            <br />
-            <i>$status$</i></div>
-          <br />
-          <div id="epistemological-status"><a href="/About#belief-tags" title="Explanation of 'belief' metadata">belief:</a>
-            <br />
-            <i>$belief$</i>
-          </div>
-          <hr/>
-        </div>
-
         <div id="donations">
           <div id="bitcoin-donation-address">
             <a href="https://en.wikipedia.org/wiki/Bitcoin">₿</a>: 1GWERNkwxeMsBheWgVWEc6NUXD8HkHTUXg
@@ -115,6 +90,102 @@
       </div>

       <div id="content">
+
+<div id="metadata1">
+  <span id="abstract"><em></em></span>
+  <br>
+  <span id="tags"><i>$tags$</i></span>
+  <br>
+  <span id="page-created">created:
+    <br>
+    <i>$created$</i></span>
+  <br>
+  <span id="last-modified">modified:
+    <br>
+    <i>$modified$</i></span>
+  <br>
+  <span id="version">status:
+    <br>
+    <i>$status$</i></span>
+  <br>
+  <span id="epistemological-status"><a href="/About#belief-tags" title="Explanation of 'belief' metadata">belief:</a>
+    <br>
+    <i>$belief$</i>
+  </span>
+  <hr>
+</div>
+
+<div id="metadata2">
+  <span id="abstract"><em>$description$</em></span>
+  <br>
+  <span id="tags"><i>$tags$</i></span>
+  <br>
+  <span id="page-created">created: <i>$created$</i></span>
+  <br>
+  <span id="last-modified">modified: <i>$modified$</i></span>
+  <br>
+  <span id="version">status:
+    <br>
+    <i>$status$</i></span>
+  <br>
+  <span id="epistemological-status"><a href="/About#belief-tags" title="Explanation of 'belief' metadata">belief:</a>
+    <br>
+    <i>$belief$</i>
+  </span>
+  <hr>
+</div>
+
+<div id="metadata3">
+  <span id="abstract"><em>$description$</em></span>
+  <br>
+  <span id="tags"><i>$tags$</i></span>
+  <br>
+  <span id="page-created">created: <i>$created$</i></span>;  <span id="last-modified">modified: <i>$modified$</i></span>
+  <br>
+  <span id="version">status:
+    <br>
+    <i>$status$</i></span>
+  <br>
+  <span id="epistemological-status"><a href="/About#belief-tags" title="Explanation of 'belief' metadata">belief:</a>
+    <br>
+    <i>$belief$</i>
+  </span>
+  <hr>
+</div>
+
+<div id="metadata4">
+  <span id="abstract"><em>$description$</em></span>
+  <br>
+  <span id="tags"><i>$tags$</i></span>
+  <br>
+  <span id="page-created">created: <i>$created$</i></span>;  <span id="last-modified">modified: <i>$modified$</i></span>
+  <br>
+  <span id="version">status: <i>$status$</i></span>; <span id="epistemological-status"><a href="/About#belief-tags" title="Explanation of 'belief' metadata">belief:</a> <i>$belief$</i></span>
+  <hr>
+</div>
+
+<div id="metadata5">
+  <span id="abstract"><em>$description$</em></span> (<span id="tags"><i>$tags$</i></span>)
+  <br>
+  <span id="page-created">created: <i>$created$</i></span>;  <span id="last-modified">modified: <i>$modified$</i></span>
+  <br>
+  <span id="version">status: <i>$status$</i></span>; <span id="epistemological-status"><a href="/About#belief-tags" title="Explanation of 'belief' metadata">belief:</a> <i>$belief$</i></span>
+  <hr>
+</div>
+
+<div id="metadata6">
+  <span id="abstract"><em>$description$</em></span> (<span id="tags"><i>$tags$</i></span>)
+  <br>
+  <span id="page-created">created: <i>$created$</i></span>;  <span id="last-modified">modified: <i>$modified$</i></span>; <span id="version">status: <i>$status$</i></span>; <
span id="epistemological-status"><a href="/About#belief-tags" title="Explanation of 'belief' metadata">belief:</a> <i>$belief$</i></span>
+  <hr>
+</div>
+
         $body$
       </div>
     </div>
@@ -155,28 +226,32 @@
       ABalytics.init({
+       metadata_test: [
       {
-      name: "s1c1d1",
-      "sidebar_test_class1": "<style></style>"
+      name: "none",
+      "metadata_test_class1": "<style></style>"
+      },
+      {
+      name: "meta1",
+      "metadata_test_class1": "<style>div#metadata1 { display: block; }</style>"
       },
       {
-      name: "s1c1d0",
-      "sidebar_test_class1": "<style>div#donations {visibility:hidden; display:none;}</style>"
+      name: "meta2",
+      "metadata_test_class1": "<style>div#metadata2 { display: block; }</style>"
       },
       {
-      name: "s1c0d1",
-      "sidebar_test_class1": "<style>div#cse-sitesearch {visibility:hidden; display:none;}</style>"
+      name: "meta3",
+      "metadata_test_class1": "<style>div#metadata3 { display: block; }</style>"
       },
       {
-      name: "s0c1d1",
-      "sidebar_test_class1": "<style>div#sidebar hr {visibility:hidden; display:none;}</style>"
+      name: "meta4",
+      "metadata_test_class1": "<style>div#metadata4 { display: block; }</style>"
       },
       {
-      name: "s0c1d0",
-      "sidebar_test_class1": "<style>div#sidebar hr {visibility:hidden; display:none;}; div#donations {visibility:hidden; display:none;}</style>"
+      name: "meta5",
+      "metadata_test_class1": "<style>div#metadata5 { display: block; }</style>"
       },
       {
-      name: "s0c0d0",
-      "sidebar_test_class1": "<style>div#sidebar hr {visibility:hidden; display:none;}; div#cse-sitesearch {visibility:hidden; display:none;}; div#donations {visibility:hidden; display:none;}</style>"
+      name: "meta6",
+      "metadata_test_class1": "<style>div#metadata6 { display: block; }</style>"
       }
       ], /* }) */

The CSS changes:

--- a/static/css/default.css
+++ b/static/css/default.css
@@ -90,8 +90,12 @@ div#sidebar-news a {
    text-transform: uppercase;
 }

+/* metadata customization: */
 div#description { font-size: 95%; }
 div#tags, div#page-created, div#last-modified, div#license { font-size: 80%; }
+/* support A/B test by hiding by default all the HTML variants: */
+div#metadata1, div#metadata2, div#metadata3, div#metadata4, div#metadata5, div#metadata6 { display: none; }

Data

On 2015-02-05, the top vari­ant (meta5) out­per­formed the bot­tom one (meta1, cor­re­spond­ing to my expec­ta­tion that the taller vari­ants would be worse than the com­pactest ones), so the worst was delet­ed. On 2015-02-08, the new top vari­ant (meta6) now out­per­formed (meta4), so I deleted it. On 2015-03-22, it out­per­formed none. On 2015-05-25, the differ­ence was not sta­tis­ti­cal­ly-sig­nifi­cant but I decided to delete meta3 any­way. On 2015-07-02, I deleted meta2 sim­i­lar­ly; given the ever smaller differ­ences between vari­ants, it may be time to kill the exper­i­ment.

Totals, 2015-01-29–2015-07-27:

Meta­data Return­ing N Con­ver­sion rate
meta1 FALSE 835 0.1545
meta1 TRUE 364 0.2060
meta2 FALSE 37140 0.1532
meta2 TRUE 14063 0.2213
meta3 FALSE 26600 0.1538
meta3 TRUE 10045 0.2301
meta4 FALSE 1234 0.1669
meta4 TRUE 462 0.2186
meta5 FALSE 61646 0.1397
meta5 TRUE 20130 0.2109
meta6 FALSE 61608 0.1382
meta6 TRUE 19219 0.2243
none FALSE 9227 0.1568
none TRUE 3358 0.2225

Analysis

rates <- read.csv(stdin(),header=TRUE)
Metadata,Date,Old,N,Rate
meta1,"2015-02-06",FALSE, 832, 0.1538
meta1,"2015-02-06",TRUE, 356, 0.2051
meta2,"2015-02-06",FALSE, 1037, 0.1716
meta2,"2015-02-06",TRUE, 423, 0.2411
meta3,"2015-02-06",FALSE, 1010, 0.1604
meta3,"2015-02-06",TRUE, 431, 0.2204
meta4,"2015-02-06",FALSE, 1061, 0.1697
meta4,"2015-02-06",TRUE, 349, 0.2092
meta5,"2015-02-06",FALSE, 1018, 0.1798
meta5,"2015-02-06",TRUE, 382, 0.2749
meta6,"2015-02-06",FALSE, 1011, 0.1731
meta6,"2015-02-06",TRUE, 423, 0.2837
none ,"2015-02-06",FALSE, 1000, 0.1710
none ,"2015-02-06",TRUE, 434, 0.2074
meta1,"2015-02-09",TRUE, 8, 0.1250
meta2,"2015-02-09",FALSE, 921, 0.1238
meta2,"2015-02-09",TRUE, 248, 0.1895
meta3,"2015-02-09",FALSE, 861, 0.1440
meta3,"2015-02-09",TRUE, 262, 0.2137
meta4,"2015-02-09",FALSE, 189, 0.1429
meta4,"2015-02-09",TRUE, 92, 0.2500
meta5,"2015-02-09",FALSE, 889, 0.1327
meta5,"2015-02-09",TRUE, 304, 0.2401
meta6,"2015-02-09",FALSE, 845, 0.1219
meta6,"2015-02-09",TRUE, 274, 0.2336
none ,"2015-02-09",FALSE, 866, 0.1236
none ,"2015-02-09",TRUE, 236, 0.2288
meta1,"2015-03-23",FALSE, 635, 0.1496
meta1,"2015-03-23",TRUE, 277, 0.1841
meta2,"2015-03-23",FALSE, 9346, 0.1562
meta2,"2015-03-23",TRUE, 3545, 0.2305
meta3,"2015-03-23",FALSE, 9392, 0.1533
meta3,"2015-03-23",TRUE, 3627, 0.2412
meta4,"2015-03-23",FALSE, 1020, 0.1588
meta4,"2015-03-23",TRUE, 381, 0.2231
meta5,"2015-03-23",FALSE, 9359, 0.1631
meta5,"2015-03-23",TRUE, 3744, 0.2228
meta6,"2015-03-23",FALSE, 9532, 0.1600
meta6,"2015-03-23",TRUE, 3479, 0.2483
none ,"2015-03-23",FALSE, 8979, 0.1537
none ,"2015-03-23",TRUE, 3196, 0.2287
meta1,"2015-05-25",TRUE, 1, 0.000
meta2,"2015-05-25",FALSE, 21879, 0.1584
meta2,"2015-05-25",TRUE, 8131, 0.2285
meta3,"2015-05-25",FALSE, 22066, 0.1539
meta3,"2015-05-25",TRUE, 8288, 0.2300
meta5,"2015-05-25",FALSE, 21994, 0.1611
meta5,"2015-05-25",TRUE, 8629, 0.2187
meta6,"2015-05-25",FALSE, 22197, 0.1575
meta6,"2015-05-25",TRUE, 8114, 0.2328
none ,"2015-05-25",FALSE, 4987, 0.1562
none ,"2015-05-25",TRUE, 1721, 0.2342
meta2,"2015-07-02",FALSE, 11016, 0.1452
meta2,"2015-07-02",TRUE, 4291, 0.2123
meta3,"2015-07-02",FALSE, 208, 0.865
meta3,"2015-07-02",TRUE, 137, 0.1387
meta5,"2015-07-02",FALSE, 11336, 0.1451
meta5,"2015-07-02",TRUE, 4165, 0.2091
meta6,"2015-07-02",FALSE, 11051, 0.1397
meta6,"2015-07-02",TRUE, 3879, 0.2274
meta2,"2015-07-28",FALSE, 10299, 0.1448
meta2,"2015-07-28",TRUE, 4086, 0.2102
meta3,"2015-07-28",TRUE, 28, 0.1429
meta5,"2015-07-28",FALSE, 34976, 0.1250
meta5,"2015-07-28",TRUE, 9984, 0.1988
meta6,"2015-07-28",FALSE, 34830, 0.1242
meta6,"2015-07-28",TRUE, 9550, 0.2093


rates$Successes <- rates$N * rates$Rate
rates$Successes <- round(rates$Successes,0)
rates$Failures <-rates$N - rates$Successes
g <- glm(cbind(Successes,Failures) ~ Metadata * Old + Date, data=rates, family="binomial"); summary(g)
##                          Estimate  Std. Error   z value   Pr(>|z|)
## (Intercept)           -1.68585483  0.07376022 -22.85588 < 2.22e-16
## Metadatameta2          0.11289144  0.07557654   1.49374  0.1352445
## Metadatameta3          0.10270100  0.07602219   1.35093  0.1767164
## Metadatameta4          0.10061048  0.09241740   1.08865  0.2763069
## Metadatameta5          0.08577369  0.07542883   1.13715  0.2554767
## Metadatameta6          0.06413629  0.07543722   0.85019  0.3952171
## Metadatanone           0.06769859  0.07738865   0.87479  0.3816898
## OldTRUE                0.30223404  0.12339673   2.44929  0.0143139
## Date2015-02-09        -0.25042825  0.04531921  -5.52587 3.2785e-08
## Date2015-03-23        -0.07756390  0.02932304  -2.64515  0.0081654
## Date2015-05-25        -0.09191468  0.02904941  -3.16408  0.0015557
## Date2015-07-02        -0.16628108  0.03108431  -5.34936 8.8267e-08
## Date2015-07-28        -0.30091724  0.02988108 -10.07050 < 2.22e-16
## Metadatameta2:OldTRUE  0.15884370  0.12509633   1.26977  0.2041662
## Metadatameta3:OldTRUE  0.16917541  0.12606099   1.34201  0.1795920
## Metadatameta4:OldTRUE  0.08085814  0.15986591   0.50579  0.6130060
## Metadatameta5:OldTRUE  0.15772161  0.12470219   1.26479  0.2059480
## Metadatameta6:OldTRUE  0.26593031  0.12471587   2.13229  0.0329831
## Metadatanone :OldTRUE  0.18329569  0.12933518   1.41721  0.1564202
confint(g)
##                                2.5 %         97.5 %
## (Intercept)           -1.83279352769 -1.54352045422
## Metadatameta2         -0.03311333865  0.26327668480
## Metadatameta3         -0.04420468214  0.25393168209
## Metadatameta4         -0.07967057622  0.28275671162
## Metadatameta5         -0.05993245076  0.23587876421
## Metadatameta6         -0.08158679693  0.21425729726
## Metadatanone          -0.08197368374  0.22151789608
## OldTRUE                0.05847893596  0.54254577177
## Date2015-02-09        -0.33953084106 -0.16186556722
## Date2015-03-23        -0.13481890103 -0.01986901416
## Date2015-05-25        -0.14861372005 -0.03473644767
## Date2015-07-02        -0.22700745604 -0.10515380198
## Date2015-07-28        -0.35925991220 -0.24212277265
## Metadatameta2:OldTRUE -0.08484020193  0.40587579037
## Metadatameta3:OldTRUE -0.07642276867  0.41806754337
## Metadatameta4:OldTRUE -0.23209702844  0.39481052801
## Metadatameta5:OldTRUE -0.08518032081  0.40399365950
## Metadatameta6:OldTRUE  0.02300128120  0.51222878593
## Metadatanone :OldTRUE -0.06880343759  0.43849923508

A strange set of results. meta2 per­forms the best on new vis­i­tors, and worst on old vis­i­tors; while meta6 is the exact oppo­site. Because there are more new vis­i­tors than old vis­i­tors, meta2 is the best on aver­age. Except I hate how meta2 looks and much pre­fer meta6. The con­fi­dence inter­vals are wide, though - it’s not clear that meta6 is defi­nitely worse than meta2.

Given my own pref­er­ence, I will go with meta6.

CSE

A CSE is a Google search query but one spe­cial­ized in var­i­ous ways - some­what like offer­ing a user a form field which redi­rects to a Google search query like QUERY site:gwern.net/docs/, but more pow­er­ful since you can spec­ify thou­sands of URLs to black­list and whitelist and have lim­ited pat­terns. I have two: one is spe­cial­ized for search­ing for anime/manga news sites and makes writ­ing Wikipedia arti­cles much eas­ier (since you can search for a par­tic­u­lar anime title and the results will be mostly news and reviews which you can use in a WP arti­cle, rather than images, songs, memes, Ama­zon and com­mer­cial sites, blogs, etc); and the sec­ond is spe­cial­ized to search Gwern.net, my Red­dit, Less­Wrong, Pre­dic­tion­Book, Good Reads and some other sites, to make it eas­ier to find some­thing I may’ve writ­ten. The sec­ond I cre­ated to put in the side­bar and serve as a web­site search func­tion. (I threw in the other sites because why not?)

Google pro­vides HTML & JS for inte­grat­ing a CSE some­where, so cre­at­ing & installing it was straight­for­ward, and it went live 2013-05-24.

The prob­lem is that the CSE search input takes up space in the side­bar, and is more JS to run on each page load and loads at least one other JS file as well. So on 2015-07-17, I took a look to eval­u­ate whether it was worth keep­ing.

There had been 8974 searches since I installed it 785 days pre­vi­ously or ~11.4 searches per day; at least 119 were searches for “e”, which I assume were user mis­takes where they did­n’t intend to search and prob­a­bly annoyed them. (The next most pop­u­lar searches are “Grae­ber”/26, “chunk­ing”/22, and “nootrop­ics”/10, with CSE refus­ing to pro­vide any fur­ther queries due to low vol­ume. This sug­gests a long tail of search queries - but also that they’re not very impor­tant since it’s easy to find the DNB FAQ & my nootrop­ics page, and it can hardly be use­ful if the top search is an error.)

To put these 8855 searches in per­spec­tive, in that same exact time peri­od, there were 891,790 unique users with 2,010,829 page views. So only 0.44% of page-views involve a use of the CSE, or a ratio of 1:227 Is it net-ben­e­fi­cial to make 227 page-views incur the JS run & load­ing for the sake of 1 CSE search?

This might seem like a time to A/B test the presence/absence of the CSE div. (I can’t sim­ply hide it using CSS like usual because it will still affect page load­s.) Except con­sider the power issues: if that 1 CSE search con­verts, then to be profitable, it needs to dam­age the 227 other page-views con­ver­sion rate by <1/227. Or to put it the other way, the cur­rent con­ver­sion rate is ~17% of page-views and CSE search rep­re­sents 0.44% of page-views, so if the CSE makes that one page-view 100% guar­an­teed to con­vert and oth­er­wise con­verts nor­mal­ly, then over 1000 page-views, we have vs , or 17.4% vs 17.0%.

power.prop.test(p1=0.174, p2=0.170, power=0.80, sig.level=0.05)
#     Two-sample comparison of proportions power calculation
#              n = 139724.5781

Even with the most opti­mistic pos­si­ble assump­tions (per­fect con­ver­sion, no neg­a­tive effec­t), it takes 279,449 page-views to get decent pow­er. This is ridicu­lous from a cost-ben­e­fit per­spec­tive, and worse given that my pri­ors are against it due to the extra JS & CSS it entails.

So I sim­ply removed it. It was a bit of an exper­i­ment, and <8.9k searches does not seem worth it.

Deep reinforcement learning

A/B test­ing vari­ants one at a time is fine as far as it goes, but it has sev­eral draw­backs that have become appar­ent:

  1. fixed tri­als, com­pared to sequen­tial or adap­tive trial approach­es, waste data/page-views. Look­ing back, it’s clear that many of these tri­als did­n’t need to run so long.
  2. they are costly to set up, both because of the details of a sta­tic site doing A/B tests but also because it requires me to define each change, code it up, col­lect, and ana­lyze the results all by hand.
  3. they are not amenable to test­ing com­pli­cated mod­els or rela­tion­ships, since fac­to­r­ial designs suffer com­bi­na­to­r­ial explo­sion.
  4. they will test only the inter­ven­tions the exper­i­menter thinks of, which may be a tiny hand­ful of pos­si­bil­i­ties out of a wide space of pos­si­ble inter­ven­tions (this is related to the cost: I won’t test any­thing that isn’t inter­est­ing, con­tro­ver­sial, or poten­tially valu­able, because it’s far too much of a has­sle to implement/collect/analyze)

The topic of sequen­tial tri­als leads nat­u­rally to (MAB), which can be seen as a gen­er­al­iza­tion of reg­u­lar exper­i­ment­ing which nat­u­rally real­lo­cate sam­ples across branches as the pos­te­rior prob­a­bil­i­ties change in a way which min­i­mizes how many page-views go to bad vari­ants. It’s hard to see how to imple­ment MABs as a sta­tic site, so this would prob­a­bly moti­vate a shift to a dynamic site, at least to the extent that the server will tweak the served sta­tic con­tent based on the cur­rent MAB.

MABs work for the cur­rent use case of spec­i­fy­ing a small num­ber of vari­ants (eg <20) and find­ing the best one. Depend­ing on imple­men­ta­tion detail, they could also make it easy to run fac­to­r­ial tri­als check­ing for inter­ac­tions among those vari­ants, resolv­ing another objec­tion.

They’re still expen­sive to set up since one still has to come up with con­crete vari­ants to pit against each oth­er, but if it’s now a dynamic server, it can at least han­dle the analy­sis auto­mat­i­cal­ly.

MABs them­selves are a spe­cial case of (RL), which is a fam­ily of approaches to explor­ing com­pli­cated sys­tems to max­i­mize a reward at (hope­ful­ly) min­i­mum data cost. Opti­miz­ing a web­site fits nat­u­rally into a RL mold: all the pos­si­ble CSS and HTML vari­ants are a very com­pli­cated sys­tem, which we are try­ing to explore as cheaply as pos­si­ble while max­i­miz­ing the reward of vis­i­tors spend­ing more time read­ing each page.

To solve the expres­siv­ity prob­lem, one could try to equip the RLer with a lot of power over the CSS: parse it into an , so instead of spec­i­fy­ing by hand ‘100%’ vs ‘105%’ in a CSS dec­la­ra­tion like div#sidebar-news a { font-size: 105%; }, the RLer sees a node in the AST like (font-size [Real ~ dnorm(100,20)]) and tries out num­bers around 100% to see what yields higher con­ver­sion rates. Of course, this yields an enor­mous num­ber of pos­si­bil­i­ties and my web­site traffic is not equally enor­mous. Infor­ma­tive pri­ors on each node would help if one was using a Bayesian MAB to do the opti­miza­tion, but a Bayesian model might be too weak to detect many effects. (You can’t eas­ily put in inter­ac­tions between every node of the AST, after all.)

In a chal­leng­ing prob­lem like this, deep neural net­works come to mind, yield­ing a deep rein­force­ment learner () - such a sys­tem made a splash in 2013-2015 in learn­ing to play dozens of Atari games (DQN). The deep net­work han­dles inter­pre­ta­tion of the input, and the RLer han­dles pol­icy and opti­miza­tion.

So the loop would go some­thing like this:

  1. a web browser requests a page
  2. the server asks the RL for CSS to include
  3. the RL gen­er­ates a best guess at opti­mal CSS, tak­ing the CSS AST skele­ton and return­ing the defaults, with some fields/parameters ran­dom­ized for explo­ration pur­poses (pos­si­bly selected by to max­i­mize infor­ma­tion gain)
  4. the CSS is tran­scluded into the HTML page, and sent to the web browser
  5. JS ana­lyt­ics in the HTML page report back how long the user spent on that page and details like their coun­try, web browser, etc, which pre­dict time on page (ex­plain­ing vari­ance, mak­ing it eas­ier to see effects)
  6. this time-on-page con­sti­tutes the reward which is fed into the RL and updates
  7. return to wait­ing for a request

Learn­ing can be sped up by data aug­men­ta­tion or local train­ing: the devel­oper can browse pages locally and based on whether they look hor­ri­ble or not, insert pseudo-da­ta. (If one vari­ant looks bad, it can be imme­di­ately heav­ily penal­ized by adding, say, 100 page-views of that vari­ant with low reward­s.) Once pre­views have sta­bi­lized on not-too-ter­ri­ble-look­ing, it can be run on live users; the devel­op­er’s pref­er­ences may intro­duce some bias com­pared to the gen­eral Inter­net pop­u­la­tion, but the devel­oper won’t be too differ­ent and this will kill off many of the worst vari­ants. As well, his­tor­i­cal infor­ma­tion can be inserted as pseudo-data: if the cur­rent CSS file has 17% con­ver­sion over 1 mil­lion page views, one can sim­u­late 1m page views to that CSS vari­ant’s con­sid­er­able cred­it.

Pars­ing CSS into an AST seems diffi­cult, and it is still lim­ited in that it will only ever tweak exist­ing CSS fields.

How to offer more power and expres­siv­ity to the RLer with­out giv­ing it so much free­dom that it will hang itself with gib­ber­ish CSS before ever find­ing work­ing CSS, never mind improve­ments?

A pow­er­ful AI tool which could gen­er­ate CSS on its own are the : NNs which gen­er­ate some out­put which gets fed back in until a long sequence has been emit­ted. (They usu­ally also have some spe­cial sup­port for stor­ing ‘mem­o­ries’ over mul­ti­ple recur­sive appli­ca­tions, using .) RNNs are famous for mim­ic­k­ing text and other sequen­tial mate­ri­al; in one demo, Karpa­thy’s , he trained a RNN on a Wikipedia dump in XML for­mat and a LaTeX math book (both repli­cat­ing the syn­tax quite well) and more rel­e­vant­ly, 474MB of C source code & head­ers where the RNN does a cred­i­ble job of emit­ting pseudo-C code which looks con­vinc­ing and is even mostly syn­tac­ti­cal­ly-cor­rect in bal­anc­ing paren­the­ses & brack­ets, which more famil­iar Markov-chain approaches would have trou­ble man­ag­ing. (Of course, the pseudo-C does­n’t do any­thing but that RNN was never asked to make it do some­thing, either.) In , the authors trained it on Python source code and it was able to ‘exe­cute’ very sim­ple Python pro­grams and pre­dict the out­put; this is per­haps not too sur­pris­ing given the ear­lier and solv­ing the Trav­el­ing Sales­man Prob­lem (). So RNNs are pow­er­ful and have already shown promise in learn­ing how to write sim­ple pro­grams.

This sug­gests the use of an RNN inside an RLer for gen­er­at­ing CSS files. Train the RNN on a few hun­dred megabytes of CSS files (there are mil­lions online, no short­age there), which teaches the RNN about the full range of pos­si­ble CSS expres­sions, then plug it into step 3 of the above web­site opti­miza­tion algo­rithm and begin train­ing it to emit use­ful CSS. For addi­tional learn­ing, the out­put can be judged using an ora­cle (a CSS val­ida­tor like the W3C CSS Val­i­da­tion Ser­vice/w3c-markup-validator pack­age, or pos­si­bly CSSTidy), and the error or reward based on how many val­i­da­tion errors there are. The pre­train­ing pro­vides extremely strong pri­ors about what CSS should look like so syn­tac­ti­cally valid CSS will be mostly used with­out the con­straint of oper­at­ing on a rigid AST, the RL begins opti­miz­ing par­tic­u­lar steps, and pro­vid­ing the orig­i­nal CSS with a high reward pre­vents it from stray­ing too far from a known good design.

Can we go fur­ther? Per­haps. In the Atari RL paper, the NN was specifi­cally a (CNN), used almost uni­ver­sally in image clas­si­fi­ca­tion tastes; the CNN was in charge of under­stand­ing the pixel out­put so it could be manip­u­lated by the RL. The RNN would have con­sid­er­able under­stand­ing of CSS on a tex­tual lev­el, but it would­n’t be eas­ily able to under­stand how one CSS dec­la­ra­tion changes the appear­ance of the web­page. A CNN, on the other hand, can look at a page+CSS as ren­dered by a web browser, and ‘see’ what it looks like; pos­si­bly it could learn that ‘messy’ lay­outs are bad, that fonts should­n’t be made ‘too big’, that blocks should­n’t over­lap, etc. The RNN gen­er­ates CSS, the CSS is ren­dered in a web browser, the ren­der­ing is looked at by a CNN… and then what? I’m not sure how to make use of a gen­er­a­tive approach here. Some­thing to think about.

Recur­rent Q-learn­ing:

  • Lin & Mitchell 1992 “Mem­ory approaches to rein­force­ment learn­ing in non-Mar­kov­ian domains”
  • Mee­den, McGraw & Blank 1993 “Emer­gent con­trol and plan­ning in an autonomous vehi­cle”
  • Schmid­hu­ber 1991b “Rein­force­ment learn­ing in Mar­kov­ian and non-Mar­kov­ian envi­ron­ments”
  • http://nikhilbuduma.com/2015/01/11/a-deep-dive-into-recurrent-neural-networks/

Training a neural net to generate CSS

It would be nifty if I could set up a NN to gen­er­ate and opti­mize the CSS on Gwern.net so I don’t have to learn CSS & devise tests myself; as a first step towards this, I wanted to see how well a (RNN) could gen­er­ate CSS after being trained on CSS. (If it can’t do a good job mim­ic­k­ing the ‘aver­age’ syntax/appearance of CSS based on a large CSS cor­pus, then it’s unlikely it can learn more use­ful things like gen­er­at­ing usable CSS given a par­tic­u­lar HTML file, or the ulti­mate goal - learn to gen­er­ate opti­mal CSS given HTML files and user reac­tion­s.)

char-rnn

For­tu­nate­ly, Karpa­thy has already writ­ten an easy-to-use tool char-rnn which has already been shown to work well on . (I was par­tic­u­larly amused by the LaTeX/math text­book, which yielded a com­pil­ing and even good-look­ing doc­u­ment after Karpa­thy fixed some errors in it; if the RNN had been trained against com­pile errors/warnings as well, per­haps it would not have needed any fix­ing at all…?)

char-rnn relies on the Torch NN frame­work & NVIDIA’s CUDA GPU frame­work (Ubuntu instal­la­tion guide/down­load).

Torch is fairly easy to install (cheat sheet):

cd ~/src/
curl -s https://raw.githubusercontent.com/torch/ezinstall/master/install-deps | bash
git clone https://github.com/torch/distro.git ./torch --recursive
cd ./torch; ./install.sh
export PATH=$HOME/src/torch/install/bin:$PATH
## fire up the REPL to check:
th

Then char-rnn is like­wise easy to get run­ning and try out a sim­ple exam­ple:

luarocks install nngraph
luarocks install optim
# luarocks install cutorch && luarocks install cunn ## 'cutorch' & 'cunn' need working CUDA
git clone 'https://github.com/karpathy/char-rnn.git'
cd ./char-rnn/
th train.lua -data_dir data/tinyshakespeare/ -gpuid 0 -rnn_size 512 -num_layers 2 -dropout 0.5
# package cunn not found!
# package cutorch not found!
# If cutorch and cunn are installed, your CUDA toolkit may be improperly configured.
# Check your CUDA toolkit installation, rebuild cutorch and cunn, and try again.
# Falling back on CPU mode
# loading data files...
# cutting off end of data so that the batches/sequences divide evenly
# reshaping tensor...
# data load done. Number of data batches in train: 423, val: 23, test: 0
# vocab size: 65
# creating an lstm with 2 layers
# number of parameters in the model: 3320385
# cloning rnn
# cloning criterion
# 1/21150 (epoch 0.002), train_loss = 4.19087871, grad/param norm = 2.1744e-01, time/batch = 4.98s
# 2/21150 (epoch 0.005), train_loss = 4.99026574, grad/param norm = 1.8453e+00, time/batch = 3.13s
# 3/21150 (epoch 0.007), train_loss = 4.29807770, grad/param norm = 5.6664e-01, time/batch = 4.30s
# 4/21150 (epoch 0.009), train_loss = 3.78911860, grad/param norm = 3.1319e-01, time/batch = 3.87s
# ...

Unfor­tu­nate­ly, even on my i7 CPU, train­ing is quite slow: ~3s a batch on the Tiny Shake­speare exam­ple. The impor­tant para­me­ter is train_loss here1; after some exper­i­ment­ing, I found that >3=out­put is total garbage, 1-2=lousy, and with <1=­good, with <0.8=very good.

With Tiny Shake­speare, the loss drops quickly at first, get­ting <4 within sec­onds and into the 2s within 20 min­utes, but then the 1s take a long time to sur­pass, and <1 even longer (hours of wait­ing).

GPU vs CPU

This is a toy dataset and sug­gests that for a real dataset I’d be wait­ing weeks or months. GPU accel­er­a­tion is crit­i­cal. I spent sev­eral days try­ing to get Nvidi­a’s CUDA to work, even sign­ing up as a devel­oper & using the unre­leased ver­sion 7.5 pre­view of CUDA, but it seems that when they say Ubuntu 14.04 and not 15.04 (the lat­ter is what I have installed), they are quite seri­ous: every­thing I tried yielded blood­cur­dling ATA hard drive errors (!) upon boot fol­lowed by a hard freeze the instant X began to run.2 This made me unhappy since my old lap­top began dying in late July 2015 and I had pur­chased my Acer Aspire V17 Nitro Black Edi­tion VN7-791G-792A lap­top with the express goal of using its NVIDIA GeForce GTX 960M for deep learn­ing. But at the moment I am out of ideas for how to get CUDA work­ing aside from either rein­stalling to down­grade to Ubuntu 14.04 or sim­ply wait­ing for ver­sion 8 of CUDA which will hope­fully sup­port the lat­est Ubun­tu. (De­bian is not an option because on Debian Stretch, I could not even get the GPU dri­ver to work, much less CUDA.)31

Frus­trat­ed, I finally gave up and went the easy way: Torch pro­vides an Ama­zon OS image pre­con­fig­ured with Torch, CUDA, and other rel­e­vant libraries for deep learn­ing.

EC2

The Torch AMI can be imme­di­ately launched if you have an AWS account. (I assume you have signed up, have a valid credit card, IP per­mis­sion accesses set to allow you to con­nect to your VM at all, and a SSH pub­lic key set up so you can log in.) The two GPU instances seem to have the same num­ber and kind of GPUs (1 Nvidia4) and differ mostly in RAM & CPUs, nei­ther of which are the bot­tle­neck here, so I picked the smaller/cheaper “g2.2xlarge” type. (“Cheaper” here is rel­a­tive; “g2.2xlarge” still costs $0.65/hr and when I looked at spot that day, ~$0.21.)

Once start­ed, you can SSH using your reg­is­tered pub­lic key like any other EC2 instance. The default user­name for this image is “ubuntu”, so:

ssh -i /home/gwern/.ssh/EST.pem ubuntu@ec2-54-164-237-156.compute-1.amazonaws.com

Once in, we set up the $PATH to find the Torch instal­la­tion like before (I’m not sure why Torch’s image does­n’t already have this done) and grab a copy of char-rnn to run Tiny Shake­speare:

export PATH=$HOME/torch/install/bin:$PATH
git clone 'https://github.com/karpathy/char-rnn'
# etc

Per-batch, this yields a 20x speedup on Tiny Shake­speare com­pared to my lap­top’s CPU, run­ning each batch in ~0.2s.

Now we can begin work­ing on what we care about.

CSS

First, to gen­er­ate a decent sized CSS cor­pus; between all the HTML doc­u­men­ta­tion installed by Ubuntu and my , I have some­thing like 1GB of CSS hang­ing around my dri­ve. Let’s grab 20MB of it (enough to not take for­ever to train on, but not so lit­tle as to be triv­ial):

cd ~/src/char-rnn/
mkdir ./data/css/
find / -type f -name "*.css" -exec cat {} \; | head --bytes=20MB >> ./data/css/input.txt
## https://www.dropbox.com/s/mvqo8vg5gr9wp21/rnn-css-20mb.txt.xz
wc --chars ./data/css/input.txt
# 19,999,924 ./data/input.txt
scp -i ~/.ssh/EST.pem -C data/css/input.txt ubuntu@ec2-54-164-237-156.compute-1.amazonaws.com:/home/ubuntu/char-rnn/data/css/

With 19.999M char­ac­ters, our RNN can afford only <20M para­me­ters; how big can I go with -rnn_size and -num_layers? (Which as they sound like, spec­ify the size of each layer and how many lay­er­s.) The full set of char-rnn train­ing options:

  -data_dir                  data directory. Should contain the file input.txt with input data [data/tinyshakespeare]
  -rnn_size                  size of LSTM internal state [128]
  -num_layers                number of layers in the LSTM [2]
  -model                     LSTM, GRU or RNN [LSTM]
  -learning_rate             learning rate [0.002]
  -learning_rate_decay       learning rate decay [0.97]
  -learning_rate_decay_after in number of epochs, when to start decaying the learning rate [10]
  -decay_rate                decay rate for RMSprop [0.95]
  -dropout                   dropout for regularization, used after each RNN hidden layer. 0 = no dropout [0]
  -seq_length                number of timesteps to unroll for [50]
  -batch_size                number of sequences to train on in parallel [50]
  -max_epochs                number of full passes through the training data [50]
  -grad_clip                 clip gradients at this value [5]
  -train_frac                fraction of data that goes into train set [0.95]
  -val_frac                  fraction of data that goes into validation set [0.05]
  -init_from                 initialize network parameters from checkpoint at this path []
  -seed                      torch manual random number generator seed [123]
  -print_every               how many steps/minibatches between printing out the loss [1]
  -eval_val_every            every how many iterations should we evaluate on validation data? [1000]
  -checkpoint_dir            output directory where checkpoints get written [cv]
  -savefile                  filename to autosave the checkpoint to. Will be inside checkpoint_dir/ [lstm]
  -gpuid                     which GPU to use. -1 = use CPU [0]
  -opencl                    use OpenCL (instead of CUDA) [0]
Large RNN

Some play­ing around sug­gests that the upper limit is 950 neu­rons and 3 lay­ers, yield­ing a total of 18,652,422 para­me­ters. (I orig­i­nally went with 4 lay­ers, but with that many lay­ers, RNNs seem to train very slow­ly.) Some other set­tings to give an idea of how para­me­ter count increas­es:

  • 512/4: 8,012,032
  • 950/3: 18,652,422
  • 1000/3: 20,634,122
  • 1024/3: 21,620,858
  • 1024/4: 30,703,872
  • 1024/5: 39,100,672
  • 1024/6: 47,497,472
  • 1800/4: 93,081,856
  • 2048/4: 120,127,744
  • 2048/5: 153,698,560
  • 2048/6: 187,269,376

If we really wanted to stress the EC2 image’s hard­ware, we could go as large as this:

th train.lua -data_dir data/css/ -rnn_size 1306 -num_layers 4 -dropout 0.5 -eval_val_every 1

This turns out to not be a good idea since it will take for­ever to train - eg after ~70m of train­ing, still at train-loss of 3.7! I sus­pect some of the hyper­pa­ra­me­ters may be impor­tant - the level of dropout does­n’t seem to mat­ter much but more than 3 lay­ers seems to be unnec­es­sary and slow if there are a lot of neu­rons to store state (per­haps because RNNs are said to ‘unroll’ com­pu­ta­tions over each character/time-step instead of being forced to do all their com­pu­ta­tion in a sin­gle deep net­work with >4 lay­er­s?) - but with the EC2 clock tick­ing and my own impa­tience, there’s no time to try a few dozen ran­dom sets of hyper­pa­ra­me­ters to see which achieves best val­i­da­tion scores.

Unde­terred, I decided to upload all the CSS (us­ing the to reduce the archive size):

find / -type f -name "*.css" | rev | sort | rev | tar c --to-stdout --no-recursion --files-from - | xz -9 --stdout > ~/src/char-rnn/data/css/all.tar.xz
cd ~/src/char-rnn/ && scp -C data/css/all.tar.xz ubuntu@ec2-54-164-237-156.compute-1.amazonaws.com:/home/ubuntu/char-rnn/data/css/
unxz all.tar.xz
## non-ASCII input seems to cause problems, so delete anything not ASCII:
## https://disqus.com/home/discussion/karpathyblog/the_unreasonable_effectiveness_of_recurrent_neural_networks_66/#comment-2042588381
## https://github.com/karpathy/char-rnn/issues/51
tar xfJ  data/css/all.tar.xz --to-stdout | iconv -c -tascii  > data/css/input.txt
wc --char all.css
# 1,126,949,128 all.css

Unsur­pris­ing­ly, this did not solve the prob­lem, and with 1GB of data, even 1 pass over the data (1 epoch) would take weeks, like­ly. Addi­tional prob­lems included -val_frac’s default 50 and -eval_val_every’s default 1000: 0.05 of 1GB is 50MB, which means every time char-rnn checked on the val­i­da­tion set, it took ages; and since it only wrote a check­point out every 1000 iter­a­tions, hours would pass in between check­points. 1MB or 0.001 is a more fea­si­ble val­i­da­tion data size; and check­ing every 100 iter­a­tions strikes a rea­son­able bal­ance between being able to run the lat­est & great­est and spend­ing as much GPU time on train­ing as pos­si­ble.

Small RNN

So I backed off to the 20MB sam­ple and a smaller 3-layer RNN, train­ing it overnight, and was star­tled to see what hap­pened:

th train.lua -print_every 5 -data_dir data/css/ -savefile css -eval_val_every 10000 -val_frac 0.001 -rnn_size 1700 -num_layers 3 -dropout 0.8
# ...
# 20000/21408250 (epoch 0.047), train_loss = 0.84076253, grad/param norm = 3.4693e-03, time/batch = 3.59s
# loss is exploding, aborting.

Specifi­cal­ly, the loss on the val­i­da­tion set had exploded to 333.2351 (!). When I looked at sam­ples from the check­-pointed copy, it per­formed both well and poor­ly. th sample.lua cv/lm_css_epoch0.05_333.2351.t7 yield­ed:

...
.nest-comments .mb_comments p,
.pomenefister .views-content .add-post,
.pleuse-product-nav .dropdown-pane .date .intro,
.product-nav .shortlist.tooltip_pane .bread-slide,
.module-category .vote-content td {
  font-size: 9px;
  text-align: center;
}
.teotspaces-left h2,
.footer .social-icons .mepubititab-left ul {
  margin: 0;
}
.mlissbox-box .post-content {
  margin-top: 10px;
}
.secondary .tooltip-arrow .scegee-category h6,
.no-list .comment-footer .module-title,
.featured-left-column .lightbox-description h0,
.popup-tomimate .social-rover .comment-arrow {
  list-style: none;
}
.tooltip-post .secondary .module-menu-houder .thumbnail {
  float: left;
}
.module-content .lightbox-list {
  display: inline-block;
  float: none;
  width: 50%;
  position: relative;
  background-position: 0 1px;
  width: 30px;
  margin-top: 5px;
  text-align: center;
  margin-right: 10px;
}
.logo-events .show-luset .box-content li {
  width: 30px;
  -webkit-box-shadow: inset 0 2px 4px rgba(0, 0, 0, 0.3);
}
.search-results .section-featured .content-top-section-left {
  margin-top: -6px;
}
.secondary .module-mones-grey .down-mod-page {
  margin-bottom: 20px;
}
.comment-top-content .meviat-alignright {
  width: 350px;
}
.module-contributor h2.comment-hold-homicate.sptbed_postnames{
  display: inline-block;
  padding-top: -24px;
}
.secondary .module-content .logo div {
  float: left;
}
.medium-block-grid-1 > li:hover {
  border-bottom: none #5c466d;
}
.row-image-container .medium-topic-browser {
  clear: right;
}
.rooc-box-sprite .modal-links .list-group li {
  display: inline-block;
  float: left;
  padding-top: 8px;
}

.vead-video-list {
  display: block;
  margin-right: 13px;
}
#no-touch .main-tabs-new-content .widget-top-content {
  color: #58128c;
  display: block;
  padding: 8px 0;
  border-color: transparent;
  border-bottom: 1px solid #fff;
  padding: 5px 12px;
  text-align: center;
}

Aside from the Uni­code junk at the begin­ning, the out­put actu­ally looks tremen­dously like CSS! The brack­ets are matched, the selec­tors look like selec­tors, and the fields are prop­erly typed (pix­els go into pixel fields, col­ors go into color fields, etc). If I val­i­date the non-junk CSS part, the val­ida­tor remark­ably yields only 1 error, at line 52/.module-contributor h2.comment-hold-homicate.sptbed_postnames where it notes that “Value Error : padding-top -24px neg­a­tive val­ues are not allowed : -24px”. Con­sid­er­ing it did­n’t even fin­ish 1 epoch, the mim­ic­k­ing is almost uncan­ny: it nails the var­i­ous aspects like RGB color nota­tion (both hex & rgba()), match­ing brack­ets, plau­si­ble-sound­ing iden­ti­fiers (eg .scegee-category), etc. If I were shown this with­out any cor­re­spond­ing HTML, I would not eas­ily be able to tell it’s all gib­ber­ish.

Chas­tened by the explod­ing-er­ror prob­lem and the mostly waste of ~26 hours of pro­cess­ing (7:30PM - 9:30PM / $15.6), I tried a smaller yet RNN (500/2), run from 5PM-11AM (so total bill for all instances, includ­ing var­i­ous play­ing around, restart­ing, gen­er­at­ing sam­ples, down­load­ing to lap­top etc: $25.58).

Data URI problem

One flaw in the RNN I stum­bled across but was unable to repro­duce was that it seemed to have a prob­lem with . A data URI is a spe­cial kind of URL which is its own con­tent, let­ting one write small files inline and avoid­ing need­ing a sep­a­rate file; for exam­ple, this fol­low­ing CSS frag­ment would yield a PNG image with­out the user’s browser mak­ing addi­tional net­work requests or the devel­oper need­ing to cre­ate a tiny file just for an icon or some­thing:

.class {
    content: url('data:image/png;base64,iVBORw0KGgoAA \
        AANSUhEUgAAABAAAAAQAQMAAAAlPW0iAAAABlBMVEUAAAD///+l2Z/dAAAAM0l \
        EQVR4nGP4/5/h/1+G/58ZDrAz3D/McH8yw83NDDeNGe4Ug9C9zwz3gVLMDA/A6 \
        P9/AFGGFyjOXZtQAAAAAElFTkSuQmCC')
            }

So it’s a stan­dard pre­fix like data:image/png;base64, fol­lowed by an indefi­nitely long string of ASCII gib­ber­ish, which is a tex­tual encod­ing of the under­ly­ing binary data. The RNN some­times starts a data URI and gen­er­ates the pre­fix but then gets stuck con­tin­u­ally pro­duc­ing hun­dreds or thou­sands of char­ac­ters of ASCII gib­ber­ish with­out ever clos­ing the data URI with a quote & paren­the­ses and get­ting back to writ­ing reg­u­lar CSS.

What’s going on there? Since PNG/JPG are com­pressed image for­mats, the binary encod­ing will be near-ran­dom and the base-64 encod­ing like­wise near-ran­dom. The RNN can eas­ily gen­er­ate another char­ac­ter once it has started the base-64, but how does it know when to stop? (“I know how to spell banana, I just don’t know when to stop! BA NA NA NA…”) Pos­si­bly it has run into the lim­its of its ‘mem­ory’ and once it has started emit­ting base-64 and has reached a plau­si­ble length of at least a few score char­ac­ters (few images can be encoded in less), it’s now too far away from the orig­i­nal CSS, and all it can see is base-64; so of course the max­i­mal prob­a­bil­ity is an addi­tional base-64 char­ac­ter…

This might be fix­able by either giv­ing the RNN more neu­rons in the hope that with more mem­ory it can break out of the base-64 trap, train­ing more (per­haps data URIs are too rare for it to have ade­quately learned it with the few epochs thus far), back­prop­a­gat­ing error fur­ther in time/the sequence by increas­ing the size of the RNN in terms of unrolling (such as increas­ing -seq_length from 50); I thought improv­ing the sam­pling strat­egy with beam search rather than greedy char­ac­ter-by-char­ac­ter gen­er­a­tion would help but it turns out beam search does­n’t fix it and can per­form worse, get­ting trapped in an even deeper local min­ima of repeat­ing the char­ac­ter “A” end­lessly. Or of course one could delete data URIs and other unde­sir­able fea­tures from the cor­pus, in which case those prob­lems will never come up; still, I would pre­fer the RNN to han­dle issues on its own and have as lit­tle domain knowl­edge engi­neered in as pos­si­ble. I won­der if the data URI issue might be what killed the large RNN at the end? (My other hypoth­e­sis is that the sort-key trick acci­den­tally led to a mul­ti­-megabyte set of rep­e­ti­tions of the same com­mon CSS file, which caused the large RNN to over­fit, and then once the train­ing reached a new sec­tion of nor­mal CSS, the large RNN began mak­ing extremely con­fi­dent pre­dic­tions of more rep­e­ti­tion, which were wrong and would lead to very large loss­es, pos­si­bly trig­ger­ing the explod­ing-er­ror killer.)

Progress

This RNN pro­gressed steadily over time, although by the end the per­for­mance on the held-out val­i­da­tion dataset seem to have been stag­nat­ing when I plot the val­i­da­tion tests:

performance <- dget(textConnection("structure(list(Epoch = c(0.13, 0.26, 0.4, 0.53, 0.66, 0.79, 0.92,
1.06, 1.19, 1.32, 1.45, 1.58, 1.71, 1.85, 1.98, 2.11, 2.24, 2.37,
2.51, 2.64, 2.77, 2.9, 3.03, 3.17, 3.3, 3.43, 3.56, 3.69, 3.82,
3.96, 4.09, 4.22, 4.35, 4.48, 4.62, 4.75, 4.88, 5.01, 5.14, 5.28,
5.41, 5.54, 5.67, 5.8, 5.94, 6.07, 6.2, 6.33, 6.46, 6.59, 6.73,
6.86, 6.99, 7.12, 7.25, 7.39, 7.52, 7.65, 7.78, 7.91, 8.05, 8.18,
8.31, 8.44, 8.57, 8.7, 8.84, 8.97, 9.1, 9.23, 9.36, 9.5, 9.63,
9.76, 9.89, 10.02, 10.16, 10.29, 10.42, 10.55, 10.68, 10.82,
10.95, 11.08, 11.21, 11.34, 11.47, 11.61, 11.74, 11.87, 12, 12.13,
12.27, 12.4, 12.53, 12.66, 12.79, 12.93, 13.06, 13.19, 13.32,
13.45, 13.58, 13.72, 13.85, 13.98, 14.11, 14.24, 14.38, 14.51,
14.64, 14.77, 14.9, 15.04, 15.17, 15.3, 15.43, 15.56, 15.7, 15.83,
15.96, 16.09, 16.22, 16.35, 16.49, 16.62, 16.75, 16.88, 17.01,
17.15, 17.28, 17.41, 17.54, 17.67, 17.81, 17.94, 18.07, 18.2,
18.33, 18.46, 18.6, 18.73, 18.86, 18.99, 19.12, 19.26, 19.39,
19.52, 19.65, 19.78, 19.92, 20.05, 20.18, 20.31, 20.44, 20.58,
20.71, 20.84, 20.97, 21.1, 21.23, 21.37, 21.5, 21.63, 21.76,
21.89, 22.03, 22.16, 22.29, 22.42, 22.55, 22.69, 22.82, 22.95,
23.08, 23.21, 23.34, 23.48, 23.61, 23.74, 23.87, 24, 24.14, 24.27,
24.4, 24.53, 24.66, 24.8, 24.93, 25.06, 25.19, 25.32, 25.46,
25.59, 25.72), Validation.loss = c(1.4991, 1.339, 1.3006, 1.2896,
1.2843, 1.1884, 1.1825, 1.0279, 1.1091, 1.1157, 1.181, 1.1525,
1.1382, 1.0993, 0.9931, 1.0369, 1.0429, 1.071, 1.08, 1.1059,
1.0121, 1.0614, 0.9521, 1.0002, 1.0275, 1.0542, 1.0593, 1.0494,
0.9714, 0.9274, 0.9498, 0.9679, 0.9974, 1.0536, 1.0292, 1.028,
0.9872, 0.8833, 0.9679, 0.962, 0.9937, 1.0054, 1.0173, 0.9486,
0.9015, 0.8815, 0.932, 0.9781, 0.992, 1.0052, 0.981, 0.9269,
0.8523, 0.9251, 0.9228, 0.9838, 0.9807, 1.0066, 0.8873, 0.9604,
0.9155, 0.9242, 0.9259, 0.9656, 0.9892, 0.9715, 0.9742, 0.8606,
0.8482, 0.8879, 0.929, 0.9663, 0.9866, 0.9035, 0.9491, 0.8154,
0.8611, 0.9068, 0.9575, 0.9601, 0.9805, 0.9005, 0.8452, 0.8314,
0.8582, 0.892, 0.9186, 0.9551, 0.9508, 0.9074, 0.7957, 0.8634,
0.8884, 0.8953, 0.9163, 0.9307, 0.8527, 0.8522, 0.812, 0.858,
0.897, 0.9328, 0.9398, 0.9504, 0.8664, 0.821, 0.8441, 0.8832,
0.8891, 0.9422, 0.953, 0.8326, 0.871, 0.8024, 0.8369, 0.8541,
0.895, 0.8892, 0.9275, 0.8378, 0.8172, 0.8078, 0.8353, 0.8602,
0.8863, 0.9176, 0.9335, 0.8561, 0.7952, 0.8423, 0.8833, 0.9052,
0.9202, 0.9354, 0.8477, 0.8271, 0.8187, 0.8714, 0.8714, 0.9089,
0.903, 0.9225, 0.8583, 0.7903, 0.8016, 0.8432, 0.877, 0.8825,
0.9323, 0.8243, 0.8233, 0.7981, 0.8249, 0.826, 0.9109, 0.8875,
0.9265, 0.8239, 0.8026, 0.7934, 0.851, 0.8856, 0.9033, 0.9317,
0.8576, 0.8335, 0.7829, 0.8172, 0.8658, 0.8976, 0.8756, 0.9262,
0.8184, 0.792, 0.7826, 0.8244, 0.861, 0.9144, 0.9244, 0.9106,
0.8327, 0.766, 0.7988, 0.8378, 0.8606, 0.8831, 0.9032, 0.8113,
0.8138, 0.7747, 0.8027, 0.8197, 0.8684, 0.874, 0.912)), .Names = c('Epoch',
'Validation.loss'), class = 'data.frame', row.names = c(NA, -195L
))"))

library(ggplot2)
qplot(Epoch, Validation.loss, data=performance) + stat_smooth()
Loss of the CSS char-RNN dur­ing train­ing

As the loss dimin­ished to ~0.8-0.9, the sam­pled CSS out­put became even more real­is­tic. At one point I was impressed to see that the RNN had learned to switch between “mini­fied” and unmini­fied CSS for­mat­ting. For exam­ple, above the out­put is unmini­fied, but the RNN at 0.88 some­times writes mini­fied (fol­low­ing has been line-bro­ken from a sin­gle line):

$ th sample.lua  cv/lm_css_epoch6.07_0.8815.t7 -primetext 'div#sidebar { margin: 0px; }' -length 2000
div#sidebar { margin: 0px; }
#flashTopgip ul li h3 { clear: both; padding: 0; height:25px;
 background:url(/images/exporibox.png) no-repeat 0
 0;}.col_description{text-align:left!important;display:block;height:44px;top:-3px;left:68%;width:150px;}.front
 .content
 h3{display:inline-block;width:100%;position:fixed;position:absolute;left:0;}.date-repeat
 #right{list-style:none;}.rtl
 #block-agned-header{padding:10px;line-height:14px;width:auto;}#block-column-right{background:#63c;}.block-document_body
 #content,.activism-content-box .content,.rtl .archive-wide
 .button.input-rawignad{float:left;}.rtl .panel-pane .social-view table .lim
 h1,.page-news h3.pane-title{*zoom:2 !important;}.rtl
 .arow-right,#right-fields-img{display:none;}div.error{background-color:#ededed;}div.page-term
 span.recimsregoor_contemt #aconds,.input-admin-widget-fill div.inner
 .form-submit{display:block;margin-right:.1em;}#edit-activism-field-actionpoint{color:#8c0000;background:url(/sites/all/themes/zrw/sprites/hadd.png)
 no-repeat 3px 0px;calse:0}.login-form p{margin:4px 25px;}.rtl
 .note-ssTitle{margin:0 0 3px 0}ul.secondary
 .page,#node-region{background:url(/sites/all/themes/rpg_theme/images/btn/form_subscription_not-page.png)
 no-repeat 12px 0 #016 !important;}#network-footer:active{}#rtl
 #newsletter-recitients-work_latest .center a{background-position:5px
 -154px;}#product-item{margin-bottom:10px;}.node-type-article .home
 .field-popup-widget-form{padding:20px 10px 10px 4px;text-align:right;}.rtl
 .view-filters,.rtl #comments-albumang_sprite{float:left;}.node-type-nodes
 .field-actionpoints-view-filters{padding:19px 28px 8px 0;}.rtl
 #multimedia-latest .field-body,.view-content
 div.field-view-layout{ulline-color:white;}.view-experts
 .views-field-title{padding:4px;text-align:center;}.node-description
 .views-exposed-form{overflow:visible;}#content .views-view-grid
 tr.format{padding-bottom:10px;background:#030000;}.view-forword-source
 .views-exposed-form #edit-submit{margin-right:0;}

This ini­tially does not look impres­sive, but if we run it through a unmini­fier:

div#sidebar {
    margin: 0px;
}
#flashTopgip ul li h3 {
    clear: both;
    padding: 0;
    height: 25px;
    background: url(/images/exporibox.png) no-repeat 0 0;
}
.col_description {
    text-align: left!important;
    display: block;
    height: 44px;
    top: -3px;
    left: 68%;
    width: 150px;
}
.front .content h3 {
    display: inline-block;
    width: 100%;
    position: fixed;
    position: absolute;
    left: 0;
}
.date-repeat #right {
    list-style: none;
}
.rtl #block-agned-header {
    padding: 10px;
    line-height: 14px;
    width: auto;
}
#block-column-right {
    background: #63c;
}
.block-document_body #content,
.activism-content-box .content,
.rtl .archive-wide .button.input-rawignad {
    float: left;
}
.rtl .panel-pane .social-view table .lim h1,
.page-news h3.pane-title {
    *zoom: 2 !important;
}
.rtl .arow-right,
#right-fields-img {
    display: none;
}
div.error {
    background-color: #ededed;
}
div.page-term span.recimsregoor_contemt #aconds,
.input-admin-widget-fill div.inner .form-submit {
    display: block;
    margin-right: .1em;
}
#edit-activism-field-actionpoint {
    color: #8c0000;
    background: url(/sites/all/themes/zrw/sprites/hadd.png) no-repeat 3px 0px;
    calse: 0
}
.login-form p {
    margin: 4px 25px;
}
.rtl .note-ssTitle {
    margin: 0 0 3px 0
}
ul.secondary .page,
#node-region {
    background: url(/sites/all/themes/rpg_theme/images/btn/form_subscription_not-page.png) no-repeat 12px 0 #016 !important;
}
#network-footer:active {}#rtl #newsletter-recitients-work_latest .center a {
    background-position: 5px -154px;
}
#product-item {
    margin-bottom: 10px;
}
.node-type-article .home .field-popup-widget-form {
    padding: 20px 10px 10px 4px;
    text-align: right;
}
.rtl .view-filters,
.rtl #comments-albumang_sprite {
    float: left;
}
.node-type-nodes .field-actionpoints-view-filters {
    padding: 19px 28px 8px 0;
}
.rtl #multimedia-latest .field-body,
.view-content div.field-view-layout {
    ulline-color: white;
}
.view-experts .views-field-title {
    padding: 4px;
    text-align: center;
}
.node-description .views-exposed-form {
    overflow: visible;
}
#content .views-view-grid tr.format {
    padding-bottom: 10px;
    background: #030000;
}
.view-forword-source .views-exposed-form #edit-submit {
    margin-right: 0;
}

Now it’s read­able and we can see the RNN has done an excel­lent job of still writ­ing CSS while in mini­fied-mode, and around this level of loss, I noticed the RNN had learned to write valid-look­ing URLs - frag­ments like background : url(/sites/all/themes/rpg_theme/images/btn/form_subscription_not-page.png) look exactly like what a human CSS pro­gram­mer would write. (Un­for­tu­nate­ly, this sam­ple has 4 val­i­da­tion errors: 1 from an imbal­anced brack­et; 1 one parse error on *zoom: 2 !important due to the aster­isk which is an old IE hack & arguably the RNN isn’t wrong; and 2 prop­er­ties which don’t exist. Also in the RNN’s favor, I should note that lots of CSS in the wild will not have 0 val­i­da­tion errors.)

At 0.88, I also noticed the RNN was now mak­ing a valiant attempt to write com­ments. Bad com­ments, but still:

/* ubuntu@ip-172-31-30-222:~/char-rnn$ th sample.lua  cv/lm_css_epoch6.07_0.8815.t7 -primetext 'div#sidebar { margin: 100px; }' -length 2000 -seed 1
using CUDA on GPU 0...
creating an lstm...
seeding with div#sidebar { margin: 100px; }
-------------------------- */
div#sidebar { margin: 100px; }
viv  .yeah-company:first-child, .news-row0 .colsetIcob img,
.content .content-number { background-position: 0 -340px; text-decoration: repeat-x; }
#content .rcper { display:none; display: block;
}

#coftelNotif .topUy { background: url('/assets/css/epwide-datetherator.png'); }
#leftCol span.scord img { background: url(/img/text/about_links.png) no-repeat 0 -1050px; }

div.subkit_snav_created, ul.up_tains li.active { width: 64% !important; }
.hdr_outer {text-align:center; }
  active, img {
        top: auto;
     margin-right: 20px;
        margin: 0 !important;
                    text-align: center;
            -webkit-box-shadow: #205575 1px 0 0 rgba(0,0,0,0.6) 1px 0px  px;
        box-shadow: 0 0 5px rgba(0,0,0,.5);
}

#ywip_section p.tab_promo,
#search_container #slideshow .page_inner #triabel_left {
    background: url(drop, sanc-email' }
simple{
    box-sizing: border-box;
}

span.naveptivionNav}
a.nav, pre,
html { */
    background-color: #8ccedc;
    background: #22a82c;
    float: left;
    color: #451515;
    border: 1px solid #701020;
    color: #0000ab;
    font-family: Arial, sans-serif;
    text-align: center;
    margin-bottom: 50px;
    line-height: 16px;
    height: 49px;
    padding: 15px 0 0 0;
    font-size: 15px;
    font-weight: bold;
    background-color: #cbd2eb;
}
a.widespacer2,
#jomList, #frq {
    margin: 0 0 0 0;
    padding: 10px -4px;
    background-color: #FFCFCF;
    border: 1px solid #CBD7DD;
    padding: 0 0 4px 12px;
    min-height: 178px;
}

.eventmenu-item, .navtonbar .article ul, .creditOd_Dectls {
    border-top: 1px #CCC gradsed 1px solid;
    font-size: 0.75em;
}

h2,
div.horingnav img {
    font-size: 5px;
}

body {
    margin: 0 0 5px 20px;
}
.n-cmenuamopicated,
.teasicOd-view td {
    border-top: 4px solid #606c98;
}

/* Rpp-fills*/

.ads{padding: 0 10px;}.statearch-header div.title img{display:table-call(}
fieldset legend span,
blockquote.inner ul {padding:0;}}

...

/* Ableft Title */

/* ========================================================  helper column parting if nofis calendar image Andy "Heading Georgia" */
.right_content {
  position: relative;
  width: 560px;
  height: 94px;
}

Ulti­mate­ly, the best RNN achieved a loss of 0.7660 before I decided to shut it down because it was­n’t mak­ing much fur­ther progress.

Samples

It stal­wartly con­tin­ued to try to write com­ments, approx­i­mat­ing slightly Eng­lish (even though there is not that much Eng­lish text in those 20MB, only 8.5k lines with /* in them - it’s CSS, not tex­t). Exam­ples of com­ments extracted from a large sam­ple of 0.766’s out­put (fgrep '/*' best.txt):

*//* COpToMNINW BDFER
/*
.snc .footer li a.diprActy a:hover, #sciam table {/*height: 164px;*//*/* }
body.node-type-xplay-info #newsletter,body.node-type-update
#header{min-width:128px;height:153px;float:left;}#main-content
.newsletternav,#ntype-audio
.block-title{background:url(/sites/www.amnesty.org/modules/civicrm/print-widget.clu))
/*gray details */
/* Grid >> 1px 0 : k0004_0 */
/* corner */
/* ST LETTOTE/ CORCRE TICEm langs 7 us1 Q+S. Sap q i blask */
/*/*/
/* Side /**/
/* Loading Text version Links white to 10ths */
/*-modaty pse */
/**/div#sb-adrom{display:none !important;}
/*
/* `Grid >> Global
/* `Grid >> 16 Columns
/* `Grid >> 16 Columns
/* `Suffix Extra Space >> 16 Columns
/* `Prefix Extra Space >> 12 Columns
/* `Prefix Extra Space >> 12 Columns
/* `Clear Floated Elements
/* `Prefix Extra Space >> 12 Columns
/* `Push Space >> 16 Columns
/* `Suffix Extra Space >> 16 Columns
/* `Suffix Extra Space >> 16 Columns
/* `Suffix Extra Space >> 16 Columns
/* `Prefix Extra Space >> 16 Columns
/* `Suffix Extra Space >> 16 Columns
  /* IE7 inline-block hack */
/* T* */

Not too great, but still more than I expected Still, the (un­mini­fied) CSS looks good:

div#sidebar { margin: 100px; }
.ep_summary_box_body { float: left; width: 550px; }
.dark_search span { margin-right: 5px; }
h1.highlight_column { text-align: right; display: block; font-size: 18px; }
h3 {
        font-weight: bold;
        font-size: 12px;
}
col.teas h2 {
        clear: both;
        width: 100%;
        z-index: 190;
        action: !important;
}
#full_content .fancybox.no-float {
        background-image: url('/static/onion/img/description.png');
        max-width: 33px;
        height: 40px;
        margin-top: 20px;
        color: #3D5042;
        font-size: 0.75em;
        padding-left: 25px !important;
        }


.filter-container iframe{
        width: 990px;
}

#funcy-oneTom {
        margin: 0;
        padding: 10px 1%;
        line-height: 30px;
}
#utb_documentAlert {
        color: #222;
}

#utb_column02 a.button:focus {
        display: block;
        font-family: Arial, Helvetica, sans-serif;
}

#utb_column02 ul.blogs-listing aundoc1 ul:before,
#utb_column01 a:active,
h1 { font-weight: bold; font-family: line-heetprind, AnimarzPromo, Atial;   line-height: 1.4; font-size:                1 9px; }
#utb_column03 ul.fourder { width: 500px; padding: 4px 10px; }

The RNN also seems to have a thing for Amnesty Inter­na­tion­al, reg­u­larly spit­ting out Amnesty URLs likeurl(/sites/www.amnesty.org/modules/civicrm/i/mast2adCbang.png) (not actu­ally valid URLs).

Once that was done, I gen­er­ated sam­ples from all the check­points:

for NN in cv/*.t7; do th sample.lua $NN -primetext 'div#sidebar { margin: 0px; }' -length 2000 > $NN.txt; done
## https://www.dropbox.com/s/xgstn9na3efxb43/smallrnn-samples.tar.xz
## if we want to watch the CSS evolve as the loss decreased:
for SAMPLE in `ls cv/lm_css*.txt | sort --field-separator="_" --key=4 --numeric-sort --reverse`;
    do echo $SAMPLE: && tail -5 $SAMPLE | head -5; done

Evaluation

In under a day of GPU train­ing on 20MB of CSS, a medi­um-sized RNN (~30M para­me­ters) learned to pro­duce high qual­ity CSS, which passes visual inspec­tion and on some batches yields few CSS syn­tac­tic errors. This strikes me as fairly impres­sive: I did not train a very large RNN, did not train it for very long, did not train it on very much, did no opti­miza­tion of the many hyper­-pa­ra­me­ters, and it is doing unsu­per­vised learn­ing in the sense that it does­n’t know how well emit­ted CSS val­i­dates or ren­ders in web browsers - yet the results still look good. I would say this is a pos­i­tive first step.

Lessons learned:

  • GPUs > CPUs

  • char-rnn, while rough-edged, is excel­lent for quick pro­to­typ­ing

  • NNs are slow:

    • major com­pu­ta­tion is required for the best results
    • mean­ing­ful explo­ration of NN sizes or other hyper­pa­ra­me­ters will be chal­leng­ing when a sin­gle run can cost days
  • com­put­ing large datasets or NNs on Ama­zon EC2 will entail sub­stan­tial finan­cial costs; it’s ade­quate for short runs but bills around $25 for two days of play­ing around are not a long-term solu­tion

  • pre­train­ing an RNN on CSS may be use­ful for a CSS rein­force­ment learner

RNN: CSS -> HTML

After show­ing good-look­ing CSS can be gen­er­ated from learn­ing on a CSS cor­pus and mas­tery of the syn­tac­tic rules, the next ques­tion is how to incor­po­rate mean­ing. The gen­er­ated CSS does­n’t mean any­thing and will only ‘do’ any­thing if it hap­pens to have gen­er­ated CSS mod­i­fy­ing a suffi­ciently uni­ver­sal ID or CSS ele­ment (you might call the gen­er­ated CSS ‘what the aver­age CSS looks like’, although like the ‘aver­age man’, aver­age CSS does not exist in real life). We trained it to gen­er­ate CSS from CSS. What if we trained it to gen­er­ate CSS from HTML? Then we could feed in a par­tic­u­lar HTML page and, if it has learned to gen­er­ate mean­ing­ful­ly-con­nected CSS, then it should write CSS tar­geted on that HTML page. If a HTML page has a div named lightbox, then instead of the pre­vi­ous non­sense like .logo-events .show-luset .box-content li { width: 30px; }, per­haps it will learn to write instead some­thing mean­ing­ful like lightbox li { width: 30px; }. (Set­ting that to 30px is not a good idea, but once it has learned to gen­er­ate CSS for a par­tic­u­lar page, then it can learn to gen­er­ate good CSS for a par­tic­u­lar page.)

Creating a Corpus

Before, cre­at­ing a big CSS cor­pus was easy: sim­ply find all the CSS files on disk, and cat them together into a sin­gle file which char-rnn could be fed. From a super­vised learn­ing per­spec­tive, the labels were also the inputs. But to learn to gen­er­ate CSS from HTML, we need pairs of HTML and CSS: all the CSS for a par­tic­u­lar HTML page.

I could try to take the CSS files and work back­wards to where the orig­i­nal HTML page may be, but most of them are not eas­ily found and a sin­gle HTML page may call sev­eral CSS files or vice ver­sa. It seems sim­pler instead to gen­er­ate a fresh set of files by tak­ing some large list of URLs, down­load­ing each URL, sav­ing its HTML and then pars­ing it for CSS links which then get down­loaded and com­bined into a paired CSS file, with that sin­gle CSS file hope­fully for­mat­ted and cleaned up in other ways.

I don’t know of any exist­ing clean cor­pus of HTML/CSS pairs: exist­ing data­bases like would pro­vide more data than I need but in the wrong for­mat (split over mul­ti­ple files as the live web­site serves it), and I can’t reuse my cur­rent archive down­loads (how to map all the down­loaded CSS back onto their orig­i­nal HTML file and then com­bine them appro­pri­ate­ly?). So I will gen­er­ate my own.

I would like to crawl a wide vari­ety of sites, par­tic­u­larly domains which are more likely to pro­vide clean and high­-qual­ity CSS exer­cis­ing lots of func­tion­al­i­ty, so I grab URLs from:

  • exports of my per­sonal Fire­fox brows­ing his­to­ry, URLs linked on Gwern.net, and URLs gen­er­ated from my past archives
  • exports of the Hacker News sub­mis­sion his­tory
  • the CSS Zen Gar­den (hun­dreds of pages with the same HTML but wildly differ­ent & care­fully hand-writ­ten CSS)

Personal

To fil­ter out use­less URLs, files with bad exten­sions, & de-du­pli­cate:

cd ~/css/
rm urls.txt
xzcat  ~/doc/backups/urls/*-urls.txt.xz | cut --delimiter=',' --fields=2 | tr --delete "'" >> urls.txt
find ~/www/ -type f | cut -d '/' -f 4- | awk '{print "http://" $0}' >> urls.txt
find ~/wiki/ -name "*.page" -type f -print0 | parallel --null runhaskell ~/wiki/haskell/link-extractor.hs | fgrep http >> urls.txt
cat ~/.urls.txt >> urls.txt
firefox-urls >> urls.txt
sqlite3 -separator ',' -batch "$(find ~/.mozilla/firefox/ -name 'places.sqlite' | sort | head -1)" "SELECT datetime(visit_date/1000000,'unixepoch') AS visit_date, quote(url), quote(title), visit_count, frecency FROM moz_places, moz_historyvisits WHERE moz_places.id = moz_historyvisits.place_id and visit_date > strftime('\%s','now','-1 year)*1000000 ORDER by visit_date;" >> urls.txt

cat urls.txt | filter-urls | egrep --invert-match -e '.onion/' -e '.css$' -e '.gif$' -e '.svg$' -e '.jpg$' -e '.png$' -e '.pdf$' -e 'ycombinator.com' -e 'reddit.com' -e 'nytimes.com' -e .woff -e .ttf -e .eot -e '\.css$'| sort | uniq --check-chars=18 | shuf >> tmp; mv tmp urls.txt
wc --lines urls.txt
## 136328 urls.txt

(uniq --check-chars=18 is there as a hack for dedu­pli­ca­tion: we don’t need to waste time on 1000 URLs all from the same domain, since their CSS will usu­ally all be near-i­den­ti­cal; this defines all URLs with the same first 18 char­ac­ters as being dupli­cates and so to be removed.)

HN

HN:

wget 'https://archive.org/download/HackerNewsStoriesAndCommentsDump/HNStoriesAll.7z'
cat HNStoriesAll.json  | tr ' ' '\n' | tr '"' '\n' | egrep '^http://' | sort --unique >> hn.txt
cat hn.txt >> urls.txt

CSS Zen Garden

CSS Zen Gar­den:

nice linkchecker --complete -odot -v --ignore-url=^mailto --no-warnings --timeout=100 --threads=1 'http://www.csszengarden.com' | fgrep http | fgrep -v "label=" | fgrep -v -- "->" | fgrep -v '" [' | fgrep -v "/ " | sed -e "s/href=\"//" -e "s/\",//" | tr -d ' ' | filter-urls | tee css/csszengarden.txt # ]
cat csszengarden.txt  | sort -u | filter-urls | egrep --invert-match -e '.onion/' -e '.css$' -e '.gif$' -e '.svg$' -e '.jpg$' -e '.png$' -e '.pdf$' -e 'ycombinator.com' -e 'reddit.com' -e 'nytimes.com' -e .woff -e .ttf -e .eot -e '\.css$' > tmp
mv tmp csszengarden.txt
cat csszengarden.txt >> urls.txt

Downloading

To describe the down­load algo­rithm in pseudocode:

For each URL index i in 1:n:

    download the HTML
    parse
    filter out `<link rel='stylesheet'>`, `<style>`
    forall stylesheets,
        download & concatenate into a single css
    concatenate style into the single css
    write html -> ./i.html
    write css -> ./i.css

Down­load­ing the HTML part of the URL can be done with wget as usu­al, but if instructed to --page-requisites, it will spit CSS files over the disk and the CSS would need to be stitched together into one file. It would also be good if unused parts of the CSS could be ignored, the for­mat­ting be cleaned up & con­sis­tent across all pages, and while we’re wish­ing, JS eval­u­ated just in case that makes a differ­ence (since so many sites are unnec­es­sar­ily dynamic these days). uncss does all this in a con­ve­nient com­mand-line for­mat; the only down­side I noticed is that it is inher­ently much slow­er, there is an unnec­es­sary two-line header pre­fixed to the emit­ted CSS (spec­i­fy­ing the URL eval­u­at­ed) which is eas­ily removed, and uncss some­times hangs & so some­thing must be arranged to kill lag­gard instances so progress can be made. (Orig­i­nal­ly, I was look­ing for a tool which would down­load all the CSS on a page and emit it in a sin­gle stream/file rather than write my own tag­soup parser, but when I saw uncss, I real­ized that the minimizing/optimizing was bet­ter than what I had intended and would be use­ful - why make the RNN learn CSS which isn’t used by the paired HTML?) Installing:

# Debian/Ubuntu workaround:
sudo ln -s /usr/bin/nodejs /usr/bin/node
# possibly helpful to pull in dependencies:
sudo apt-get install phantomjs

npm install -g path-is-absolute
npm install -g uncss --prefix ~/bin/
npm install -g brace-expansion --prefix ~/bin/
npm install -g uncss --prefix ~/bin/

Then hav­ing gen­er­ated the URL list pre­vi­ous­ly, it is sim­ple to down­load each HTML/CSS pair:

downloadCSS () {
      ID=`echo "$@" | md5sum | cut -f1 -d' '`
      echo "$@":"$ID"
      if [[ ! -s $ID.css ]]; then
       timeout 120s wget --quiet "$@" -O $ID.html &
       # `tail +3` gets rid of some uncss boilerplate
       timeout 120s nice uncss --timeout 2000 "$@" | tail --lines=+3 >> $ID.css
      fi
}
export -f downloadCSS
cat urls.txt | parallel downloadCSS

Screen­shots: save as screenshot.js

var system = require('system');
var url = system.args[1];
var filename = system.args[2];

var WebPage = require('webpage');
page = WebPage.create();
page.open(url);
page.onLoadFinished = function() {
   page.render(filename);
   phantom.exit(); }
downloadScreenshot() {
    echo "$@"
    ID=`echo "$@" | md5sum | cut -f1 -d' '`
    if [[ ! -a $ID.png ]]; then
       timeout 120s nice phantomjs screenshot.js "$@" $ID.png && nice optipng -o9 -fix $ID.png
    fi
}
export -f downloadScreenshot
cat urls.txt | nice parallel downloadScreenshot

After fin­ish, find and delete dupli­cates with fdupes; and delete any stray HTML/CSS:

    # delete any empty file indicating CSS or HTML download failed:
    find . -type f -size 0 -delete
    # delete bit-identical duplicates:
    fdupes . --delete --noprompt
    # look for extremely similar screenshots, and delete all but the first such image:
    nice /usr/bin/findimagedupes --fingerprints=./.fp --threshold=99% *.png | cut --delimiter=' ' --field=2- | xargs rm
    # delete any file without a pair (empty or duplicate CSS having been previously deleted, now we clean up orphans):
    orphanedFileRemover () {
        if [[ ! -a $@.html || ! -a $@.css ]];
        then ls $@*; rm $@*;
        fi; }
    export -f orphanedFileRemover
    find . -name "*.css" -or -name "*.html" | sed -e 's/.html//' -e 's/.css//' | sort --unique | parallel orphanedFileRemover

TODO: once the screen­shot­ter has fin­ished one full pass, then you can add image har­vest­ing to enforce clean triplets of HTML/CSS/PNG

This yields a good-sized cor­pus of clean HTML/CSS pairs:

ls *.css | wc --lines; cat *.css | wc --char

TODO: yield seems low: 1 in 3? will this be enough even with 136k+ URLs? a lot of the errors seem to be spo­radic and page down­loads work when retry­ing them NYT seems to lock up unc­ss! had to fil­ter it out, too bad, their CSS was nice and com­plex

Data augmentation

Data aug­men­ta­tion is a way to increase cor­pus size by trans­form­ing each data point into mul­ti­ple vari­ants which are differ­ent on a low level but seman­ti­cally are the same. For exam­ple, the best move in a par­tic­u­lar Go board posi­tion is the same whether you rotate it by 45° or 180°; an upside-down or slightly brighter or slightly darker pho­to­graph of a fox is still a pho­to­graph of a fox, etc. By trans­form­ing them, we can make our dataset much larger and also force the NN to learn more about the seman­tics and not focus all its learn­ing on mim­ic­k­ing sur­face appear­ances or mak­ing unwar­ranted assump­tions. It seems to help image clas­si­fi­ca­tion a lot (where the full set of data aug­men­ta­tion tech­niques used can be quite elab­o­rate), and is a way you can address con­cerns about an NN not being robust to a par­tic­u­lar kind of noise or trans­for­ma­tion: you can include that noise/transformation as part of your data aug­men­ta­tion.

HTML and CSS can be trans­formed in var­i­ous ways which tex­tu­ally look differ­ent but still mean the same thing to a browser: they can be mini­fied, they can be refor­mat­ted per a style guide, some opti­miza­tions can be done to com­bine CSS dec­la­ra­tions or write them in bet­ter ways, CSS files can be per­muted (some­times shuffling the order of dec­la­ra­tions will change things by chang­ing which of two over­lap­ping dec­la­ra­tions gets used, but appar­ently it’s rare in prac­tice and CSS devel­op­ers often write in ran­dom order), by defi­n­i­tion can be deleted with­out affect­ing the dis­played page,and so on.

TODO: use htm­l5-tidy to clean up the down­loaded html too? http://www.htacg.org/tidy-html5/documentation/#part_building keep both the orig­i­nal and clean ver­sion: this will be good data aug­men­ta­tion

Data aug­men­ta­tion:

  • raw HTML + uncss
  • tidy-html5 + uncss
  • tidy-html5 + csstidy(unc­ss)
  • tidy-html5 + mini­fied CSS
  • tidy-html5 + shuffle CSS order as well? CSS is not fully but mostly declar­a­tive: http://www.w3.org/TR/2011/REC-CSS2-20110607/cascade.html#cascade

RNN

encoder-de­coder? atten­tion mod­els? neural Tur­ing machi­nes?

  • http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-with-gpus/ / http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/ / http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/ imple­men­ta­tion of encoder-de­coder with atten­tion in Theano: https://github.com/kyunghyuncho/dl4mt-material/tree/master/session2 “Cur­rent­ly, this code includes three sub­di­rec­to­ries; ses­sion0, ses­sion1 and ses­sion2. ses­sion0 con­tains the imple­men­ta­tion of the recur­rent neural net­work lan­guage model using gated recur­rent units, and ses­sion1 the imple­men­ta­tion of the sim­ple neural machine trans­la­tion mod­el. In ses­sion2, you can find the imple­men­ta­tion of the atten­tion-based neural machine trans­la­tion model we dis­cussed today. I am plan­ning to make a cou­ple more ses­sions, so stay tuned!”
  • pos­si­ble exam­ple to steal from: https://github.com/nicholas-leonard/dp/blob/master/examples/recurrentlanguagemodel.lua / https://dp.readthedocs.org/en/latest/languagemodeltutorial/index.html#neural-network-language-model sim­ple tuto­ri­al: https://dp.readthedocs.org/en/latest/neuralnetworktutorial/index.html (more lowlevel: https://github.com/Element-Research/rnn )
  • char-rnn seems too hard­wired in char­ac­ter at a time, bidi­rec­tion­al?
  • pure Python imple­men­ta­tion: https://github.com/karpathy/neuraltalk rip out the image stuff…?
  • https://github.com/wojzaremba/lstm
  • https://github.com/Element-Research/rnn#rnn.BiSequencerLM ?
  • http://www6.in.tum.de/pub/Main/Publications/Graves2008c.pdf
  • “Rein­force­ment Learn­ing Neural Tur­ing Machines” https://arxiv.org/abs/1505.00521
  • “Learn­ing to Exe­cute” https://arxiv.org/abs/1410.4615 https://github.com/wojciechz/learning_to_execute
  • “Sequence to Sequence Learn­ing with Neural Net­works” https://arxiv.org/abs/1409.3215
  • “Gen­er­at­ing Sequences With Recur­rent Neural Net­works” https://arxiv.org/abs/1308.0850
  • “Neural Tur­ing Machines” https://arxiv.org/abs/1410.5401
  • DRAW: A Recur­rent Neural Net­work For Image Gen­er­a­tion” https://arxiv.org/abs/1502.04623
  • “Neural Machine Trans­la­tion by Jointly Learn­ing to Align and Trans­late” https://arxiv.org/abs/1409.0473 “In this paper, we con­jec­ture that the use of a fixed-length vec­tor is a bot­tle­neck in improv­ing the per­for­mance of this basic encoder-de­coder archi­tec­ture, and pro­pose to extend this by allow­ing a model to auto­mat­i­cally (soft­-)search for parts of a source sen­tence that are rel­e­vant to pre­dict­ing a tar­get word, with­out hav­ing to form these parts as a hard seg­ment explic­it­ly. With this new approach, we achieve a trans­la­tion per­for­mance com­pa­ra­ble to the exist­ing state-of-the-art phrase-based sys­tem on the task of Eng­lish-to-French trans­la­tion. Fur­ther­more, qual­i­ta­tive analy­sis reveals that the (soft­-)align­ments found by the model agree well with our intu­ition.”
  • https://github.com/arctic-nmt/nmt
  • https://github.com/joschu/cgt/blob/master/examples/demo_neural_turing_machine.py
  • http://smerity.com/articles/2015/keras_qa.html
  • “Neural Trans­for­ma­tion Machine: A New Archi­tec­ture for Sequence-to-Se­quence Learn­ing”, Meng et al 2015 https://arxiv.org/abs/1506.06442
  • “On End-to-End Pro­gram Gen­er­a­tion from User Inten­tion by Deep Neural Net­works”, Mou et al 2015 (al­most too lim­ited and sim­ple, though)

Appendix

Covariate impact on power

Is it impor­tant in ran­dom­ized test­ing of A/B ver­sions of web­sites to con­trol for covari­ates, even pow­er­ful ones? A sim­u­la­tion using a web­site’s data sug­gests that data is suffi­ciently large that it is not crit­i­cal the way it is in many appli­ca­tions.

In Decem­ber 2013, I was dis­cussing web­site test­ing with another site own­er, which mon­e­tizes traffic by sell­ing a pro­duct, while I just opti­mize for read­ing time. He argued (delet­ing iden­ti­fy­ing details since I will be using their real traffic & con­ver­sion num­bers through­out):

I think a big part that gets lost out is the qual­ity of traffic. For our [next web­site ver­sion] (still spec­c­ing it all out), one of my biggest require­ments for A/B test­ing is that all refer­ring traffic must be buck­eted and split-test against them. Buck­ets them­selves are amor­phous - they can be vis­i­tors of the same res­o­lu­tion, vis­i­tors who have bought our guide, etc. But just com­par­ing how we did (and our affil­i­ates did) on sales of our guide (an easy to mea­sure met­ric - our RPU), traffic mat­ters so much. X sent 5x the traffic that Y did, yet still gen­er­ated 25% less sales. That would destroy any mean­ing­ful A/B test­ing with­out split­ting up the qual­i­ty.

I was a lit­tle skep­ti­cal that this was a major con­cern much less one worth expen­sively engi­neer­ing into a site, and replied:

Eh. You would lose some power by not cor­rect­ing for the covari­ates of source, but the ran­dom­iza­tion would still work and deliver you mean­ing­ful results. As long as vis­i­tors were being ran­dom­ized into the A and B vari­ants, and there was no gross imbal­ance in cells between Y and X, and Y and X vis­i­tors did­n’t react differ­ent­ly, you’d still get the right results - just you would need more traffic to get the same sta­tis­ti­cal pow­er. I don’t think 25% differ­ence between X and Y vis­i­tors would even cost you that much pow­er…

note that:

…we con­di­tioned on the user level covari­ates listed in the col­umn labeled by the vec­tor W in Table 1 using sev­eral meth­ods to strengthen pow­er; such panel tech­niques pre­dict and absorb resid­ual vari­a­tion. Lagged sales are the best pre­dic­tor and are used wher­ever pos­si­ble, reduc­ing vari­ance in the depen­dent vari­able by as much as 40%…How­ev­er, seem­ingly large improve­ments in R2 lead to only mod­est reduc­tions in stan­dard errors. A lit­tle math shows that going from in the uni­vari­ate regres­sion to = 50% yields a sub­lin­ear reduc­tion in stan­dard errors of 29%. Hence, the mod­el­ing is as valu­able as dou­bling the sam­ple - a sig­nifi­cant improve­ment, but one that does not mate­ri­ally change the mea­sure­ment diffi­cul­ty. An order-of-mag­ni­tude reduc­tion in stan­dard errors would require = 99%, per­haps a “nearly impos­si­ble” goal.

In par­tic­u­lar, if you lost a lot of pow­er, would­n’t that imply ran­dom­ized tri­als were ineffi­cient or impos­si­ble? The point of ran­dom­iza­tion is that it elim­i­nates the impact of the indefi­nitely many observed and unob­served vari­ables to let you do causal infer­ence.

Power simulation

Since this seems like a rel­a­tively sim­ple prob­lem, I sus­pect there is an ana­lytic answer, but I don’t know it. So instead, we can set this up as a sim­u­lated power analy­sis: we gen­er­ate ran­dom data where we force the hypoth­e­sis to be true by con­struc­tion, we run our planned analy­sis, and we see how often we get a p-value under­neath 0.05 (which is the true cor­rect answer, by con­struc­tion).

Let’s say Y’s vis­i­tors con­vert at 10%, then X’s must con­vert at 10% * 0.75, as he said, and let’s imag­ine our A/B test of a blue site-de­sign increases sales by 1%. (So in the bet­ter ver­sion, Y vis­i­tors con­vert at 11% and X con­vert at 8.5%.) We gen­er­ate dat­a­points from each con­di­tion (X/blue, X/not-blue, Y/blue, Y/not-blue), and then we do the usual logis­tic regres­sion look­ing for a differ­ence in con­ver­sion rate, with and with­out the info about the source. So we regress Con­ver­sion ~ Col­or, to look at what would hap­pen if we had no idea where vis­i­tors came from, and then we regress Conversion ~ Color + Source. These will spit out p-val­ues on the Color coeffi­cient which are almost the same, but not quite the same: the regres­sion with the Source vari­able is slightly bet­ter so it should yield slightly lower p-val­ues for Color. Then we count up all the times the p-value was below the mag­i­cal amount for each regres­sion, and we see how many sta­tis­ti­cal­ly-sig­nifi­cant p-val­ues we lost when we threw out Source. Phew!

So we might like to do this for each sam­ple size to get an idea of how they change. n = 100 may not the same for n = 10,000. And ide­al­ly, for each n, we do the ran­dom data gen­er­a­tion step many times, because it’s a sim­u­la­tion and so any par­tic­u­lar run may not be rep­re­sen­ta­tive. Below, I’ll look at n = 1000, 1100, 1200, 1300, and so on up until n = 10,000. And for each n, I’ll gen­er­ate 1000 repli­cates, which should be pretty accu­rate.

Large n

The whole schmeer in R:

set.seed(666)
yP <- 0.10
xP <- yP * 0.75
blueP <- 0.01

## examine various possible sizes of N
rm(controlledResults, uncontrolledResults)
for (n in seq(1000,10000,by=100)) {

 rm(controlled, uncontrolled)

 ## generate 1000 hypothetical datasets
 for (i in 1:1000) {

 nn <- n/4
 ## generate 2x2=4 possible conditions, with different probabilities in each:
 d1 <- data.frame(Converted=rbinom(nn, 1, xP   + blueP), X=TRUE,  Color=TRUE)
 d2 <- data.frame(Converted=rbinom(nn, 1, yP + blueP), X=FALSE, Color=TRUE)
 d3 <- data.frame(Converted=rbinom(nn, 1, xP   + 0),     X=TRUE,  Color=FALSE)
 d4 <- data.frame(Converted=rbinom(nn, 1, yP + 0),     X=FALSE, Color=FALSE)
 d <- rbind(d1, d2, d3, d4)

 ## analysis while controlling for X/Y
 g1 <- summary(glm(Converted ~ Color + X, data=d, family="binomial"))
 ## pull out p-value for Color, which we care about; did we reach statistical-significance?
 controlled[i] <- 0.05 > g1$coef[11]

 ## again, but not controlling
 g2 <- summary(glm(Converted ~ Color        , data=d, family="binomial"))
 uncontrolled[i] <- 0.05 > g2$coef[8]
 }
 controlledResults   <- c(controlledResults, (sum(controlled)/1000))
 uncontrolledResults   <- c(uncontrolledResults, (sum(uncontrolled)/1000))
}
controlledResults
uncontrolledResults
uncontrolledResults / controlledResults

Results:

controlledResults
#  [1] 0.081 0.086 0.093 0.113 0.094 0.084 0.112 0.112 0.100 0.111 0.104 0.124 0.146 0.140 0.146 0.110
# [17] 0.125 0.141 0.162 0.138 0.142 0.161 0.170 0.161 0.184 0.182 0.199 0.154 0.202 0.180 0.189 0.202
# [33] 0.186 0.218 0.208 0.193 0.221 0.221 0.233 0.223 0.247 0.226 0.245 0.248 0.212 0.264 0.249 0.241
# [49] 0.255 0.228 0.285 0.271 0.255 0.278 0.279 0.288 0.333 0.307 0.306 0.306 0.306 0.311 0.329 0.294
# [65] 0.318 0.330 0.328 0.356 0.319 0.310 0.334 0.339 0.327 0.366 0.339 0.333 0.374 0.375 0.349 0.369
# [81] 0.366 0.400 0.363 0.384 0.380 0.404 0.365 0.408 0.387 0.422 0.411
uncontrolledResults
#  [1] 0.079 0.086 0.093 0.113 0.092 0.084 0.111 0.112 0.099 0.111 0.103 0.124 0.146 0.139 0.146 0.110
# [17] 0.125 0.140 0.161 0.137 0.141 0.160 0.170 0.161 0.184 0.180 0.199 0.154 0.201 0.179 0.188 0.199
# [33] 0.186 0.218 0.206 0.193 0.219 0.221 0.233 0.223 0.245 0.226 0.245 0.248 0.211 0.264 0.248 0.241
# [49] 0.255 0.228 0.284 0.271 0.255 0.278 0.279 0.287 0.333 0.306 0.305 0.303 0.304 0.310 0.328 0.294
# [65] 0.316 0.330 0.328 0.356 0.319 0.310 0.334 0.339 0.326 0.366 0.338 0.331 0.374 0.372 0.348 0.369
# [81] 0.363 0.400 0.363 0.383 0.380 0.404 0.364 0.406 0.387 0.420 0.410
uncontrolledResults / controlledResults
#  [1] 0.9753 1.0000 1.0000 1.0000 0.9787 1.0000 0.9911 1.0000 0.9900 1.0000 0.9904 1.0000 1.0000
# [14] 0.9929 1.0000 1.0000 1.0000 0.9929 0.9938 0.9928 0.9930 0.9938 1.0000 1.0000 1.0000 0.9890
# [27] 1.0000 1.0000 0.9950 0.9944 0.9947 0.9851 1.0000 1.0000 0.9904 1.0000 0.9910 1.0000 1.0000
# [40] 1.0000 0.9919 1.0000 1.0000 1.0000 0.9953 1.0000 0.9960 1.0000 1.0000 1.0000 0.9965 1.0000
# [53] 1.0000 1.0000 1.0000 0.9965 1.0000 0.9967 0.9967 0.9902 0.9935 0.9968 0.9970 1.0000 0.9937
# [66] 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9969 1.0000 0.9971 0.9940 1.0000 0.9920
# [79] 0.9971 1.0000 0.9918 1.0000 1.0000 0.9974 1.0000 1.0000 0.9973 0.9951 1.0000 0.9953 0.9976

So at n = 1000 we don’t have decent sta­tis­ti­cal power to detect our true effect of 1% increase in con­ver­sion rate thanks to blue - only 8% of the time will we get our mag­i­cal p < 0.05 and rejoice in the knowl­edge that blue is boss. That’s not great, but that’s not what we were ask­ing about.

Small n

Mov­ing on to our orig­i­nal ques­tion, we see that the regres­sions con­trol­ling for source had a very sim­i­lar power as to the regres­sions which did­n’t both­er. It looks like you may pay a small price of 2% less sta­tis­ti­cal pow­er, but prob­a­bly even less than that because so many of the other entries yielded an esti­mate of 0% penal­ty. And the penalty gets smaller as sam­ple size increases and a mere 25% differ­ence in con­ver­sion rate washes out as noise.

What if we look at smaller sam­ples? say, n = 12-1012?

...
for (n in seq(12,1012,by=10)) {
... }

controlledResults
#  [1] 0.000 0.000 0.000 0.001 0.003 0.009 0.010 0.009 0.024 0.032 0.023 0.027 0.033 0.032 0.045
# [16] 0.043 0.035 0.049 0.048 0.060 0.047 0.043 0.035 0.055 0.051 0.069 0.055 0.057 0.045 0.046
# [31] 0.037 0.049 0.057 0.057 0.050 0.061 0.055 0.054 0.053 0.062 0.076 0.064 0.055 0.057 0.064
# [46] 0.077 0.059 0.062 0.073 0.059 0.053 0.059 0.058 0.062 0.073 0.070 0.060 0.045 0.075 0.067
# [61] 0.077 0.072 0.068 0.069 0.082 0.062 0.072 0.067 0.076 0.069 0.074 0.074 0.062 0.076 0.087
# [76] 0.079 0.073 0.065 0.076 0.087 0.059 0.070 0.079 0.084 0.068 0.077 0.089 0.077 0.081 0.086
# [91] 0.094 0.080 0.080 0.087 0.085 0.087 0.082 0.084 0.073 0.083 0.077
uncontrolledResults
#  [1] 0.000 0.000 0.000 0.001 0.002 0.009 0.005 0.007 0.024 0.031 0.023 0.024 0.033 0.032 0.044
# [16] 0.043 0.035 0.048 0.047 0.060 0.047 0.043 0.035 0.055 0.051 0.068 0.054 0.057 0.045 0.045
# [31] 0.037 0.048 0.057 0.057 0.050 0.060 0.055 0.054 0.053 0.062 0.074 0.063 0.055 0.057 0.059
# [46] 0.077 0.058 0.062 0.073 0.059 0.053 0.059 0.057 0.061 0.071 0.068 0.060 0.045 0.074 0.067
# [61] 0.076 0.072 0.068 0.069 0.082 0.062 0.072 0.066 0.076 0.069 0.073 0.073 0.061 0.074 0.085
# [76] 0.079 0.073 0.065 0.076 0.087 0.058 0.066 0.076 0.084 0.067 0.077 0.089 0.077 0.081 0.086
# [91] 0.094 0.080 0.080 0.087 0.085 0.087 0.080 0.081 0.071 0.083 0.076
uncontrolledResults / controlledResults
#  [1]    NaN    NaN    NaN 1.0000 0.6667 1.0000 0.5000 0.7778 1.0000 0.9688 1.0000 0.8889 1.0000
# [14] 1.0000 0.9778 1.0000 1.0000 0.9796 0.9792 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9855
# [27] 0.9818 1.0000 1.0000 0.9783 1.0000 0.9796 1.0000 1.0000 1.0000 0.9836 1.0000 1.0000 1.0000
# [40] 1.0000 0.9737 0.9844 1.0000 1.0000 0.9219 1.0000 0.9831 1.0000 1.0000 1.0000 1.0000 1.0000
# [53] 0.9828 0.9839 0.9726 0.9714 1.0000 1.0000 0.9867 1.0000 0.9870 1.0000 1.0000 1.0000 1.0000
# [66] 1.0000 1.0000 0.9851 1.0000 1.0000 0.9865 0.9865 0.9839 0.9737 0.9770 1.0000 1.0000 1.0000
# [79] 1.0000 1.0000 0.9831 0.9429 0.9620 1.0000 0.9853 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
# [92] 1.0000 1.0000 1.0000 1.0000 1.0000 0.9756 0.9643 0.9726 1.0000 0.9870

As expect­ed, with tiny sam­ples like 12, 22, or 32, the A/B test has essen­tially 0% power to detect any differ­ence, and so it does­n’t mat­ter if one con­trols for source or not. In the n = 42+ range, we start see­ing some small penal­ty, but the fluc­tu­a­tions from a 33% penalty to 0% penalty to 50% to 23% to 0% show that once we start near­ing n = 100, the differ­ence barely exists, and the long suc­ces­sion of 1.0000s say that past that, we must be talk­ing a very small power penalty of like 1%.

Larger differences

So let me pull up some real #s. I will give you source, # of unique vis­i­tors to sales page, # of unique vis­i­tors to buy page, # of actual buy­ers. Also note that I am doing it on a per-affil­i­ate basis, and thus dis­re­gard­ing the ori­gin of traffic (more on that lat­er):

  • Web­site.­com - 3963 - 722 - 293
  • X - 1232 - 198 - 8
  • Y - 1284 - 193 - 77
  • Z - 489 - 175 - 75

So even the ori­gin of traffic was every­where. X was all web­site, but pushed via FB. EC was email. Y was Face­book. Ours was 3 - email, Face­book, Twit­ter. Email con­verted at 13.72%, Face­book at 8.35%, and Twit­ter at 1.39%. All had >500 clicks.

So with that in mind, espe­cially see­ing how X and Y had the same # of peo­ple visit the buy page, but X con­verted at 10% the rate (and rel­a­tively to X, Y con­verted at 200%), I would wager that re-run­ning your num­bers would find that the ori­gin mat­ters.

Those are much big­ger con­ver­sion differ­en­tials than the orig­i­nal 25% esti­mate, but the loss of power was so minute in the first case that I sus­pect that the penalty will still be rel­a­tively small.

I can fix the power analy­sis by look­ing at each traffic source sep­a­rately and tweak­ing the ran­dom gen­er­a­tion appro­pri­ately with lib­eral use of copy­-paste. For the web­site, he said 3x500 but there’s 3963 hits so I’ll assume the remain­der is your gen­eral organic web­site traffic. That gives me a total table:

  • Email: 500 * 13.72% = 67
  • Face­book: 500 * 8.35% = 42
  • Twit­ter: 500 * 1.39% = 7
  • organ­ic: 293-(67+42+7) = 177; 3963 - (3*500) = 2463; 177 / 2463 = 7.186%

Switch­ing to R for con­ve­nience:

website <- read.csv(stdin(),header=TRUE)
Source,N,Rate
"X",1232,0.006494
"Y",1284,0.05997
"Z",489,0.1534
"Website email",500,0.1372
"Website Facebook",500,0.0835
"Website Twitter",500,0.0139
"Website organic",2463,0.07186


website$N / sum(website$N)
# [1] 0.17681 0.18427 0.07018 0.07176 0.07176 0.07176 0.35347

Change the power sim­u­la­tion appro­pri­ate­ly:

set.seed(666)
blueP <- 0.01
rm(controlledResults, uncontrolledResults)
for (n in seq(1000,10000,by=1000)) {
 rm(controlled, uncontrolled)
 for (i in 1:1000) {

 d1 <- data.frame(Converted=rbinom(n*0.17681, 1, 0.006494   + blueP), Source="X",  Color=TRUE)
 d2 <- data.frame(Converted=rbinom(n*0.17681, 1, 0.006494   + 0),     Source="X",  Color=FALSE)

 d3 <- data.frame(Converted=rbinom(n*0.18427, 1, 0.05997 + blueP), Source="Y", Color=TRUE)
 d4 <- data.frame(Converted=rbinom(n*0.18427, 1, 0.05997 + 0),     Source="Y", Color=FALSE)

 d5 <- data.frame(Converted=rbinom(n*0.07018, 1, 0.1534 + blueP), Source="Z", Color=TRUE)
 d6 <- data.frame(Converted=rbinom(n*0.07018, 1, 0.1534 + 0),     Source="Z", Color=FALSE)

 d7 <- data.frame(Converted=rbinom(n*0.07176, 1, 0.1372 + blueP), Source="Website email", Color=TRUE)
 d8 <- data.frame(Converted=rbinom(n*0.07176, 1, 0.1372 + 0),     Source="Website email", Color=FALSE)

 d9  <- data.frame(Converted=rbinom(n*0.07176, 1, 0.0835 + blueP), Source="Website Facebook", Color=TRUE)
 d10 <- data.frame(Converted=rbinom(n*0.07176, 1, 0.0835 + 0),     Source="Website Facebook", Color=FALSE)

 d11 <- data.frame(Converted=rbinom(n*0.07176, 1, 0.0139 + blueP), Source="Website Twitter", Color=TRUE)
 d12 <- data.frame(Converted=rbinom(n*0.07176, 1, 0.0139 + 0),     Source="Website Twitter", Color=FALSE)

 d13 <- data.frame(Converted=rbinom(n*0.35347, 1, 0.07186 + blueP), Source="Website organic", Color=TRUE)
 d14 <- data.frame(Converted=rbinom(n*0.35347, 1, 0.07186 + 0),     Source="Website organic", Color=FALSE)

 d <- rbind(d1, d2, d3, d4, d5, d6, d7, d8, d9, d10, d11, d12)

 g1 <- summary(glm(Converted ~ Color + Source, data=d, family="binomial"))
 controlled[i] <- 0.05 > g1$coef[23]

 g2 <- summary(glm(Converted ~ Color        , data=d, family="binomial"))
 uncontrolled[i] <- 0.05 > g2$coef[8]
 }
 controlledResults   <- c(controlledResults, (sum(controlled)/1000))
 uncontrolledResults   <- c(uncontrolledResults, (sum(uncontrolled)/1000))
}
controlledResults
uncontrolledResults
uncontrolledResults / controlledResults

An hour or so lat­er:

controlledResults
# [1] 0.105 0.175 0.268 0.299 0.392 0.432 0.536 0.566 0.589 0.631
uncontrolledResults
# [1] 0.093 0.167 0.250 0.285 0.379 0.416 0.520 0.542 0.576 0.618
uncontrolledResults / controlledResults
# [1] 0.8857 0.9543 0.9328 0.9532 0.9668 0.9630 0.9701 0.9576 0.9779 0.9794

In the most extreme case (to­tal n = 1000), where our con­trolled test’s power is 0.105 or 10.5% (well, what do you expect from that small an A/B test?), our test where we throw away the Source info has a power of 0.093 or 9.3%. So we lost 0.1143 or 11% of the pow­er.

Sample size implication

That’s not as bad as I feared when I saw the huge con­ver­sion rate differ­ences, but maybe it has a big­ger con­se­quence than I guess?

What does this 11% loss trans­late to in terms of extra sam­ple size?

Well, our orig­i­nal total con­ver­sion rate was 6.52%:

sum((website$N * website$Rate)) / sum(website$N)
# [1] 0.0652

We were exam­in­ing a hypo­thet­i­cal increase by 1% to 7.52%. A reg­u­lar 2-pro­por­tion power cal­cu­la­tion (the clos­est thing to a bino­mial in the R stan­dard library)

power.prop.test(n = 1000, p1 = 0.0652, p2 = 0.0752)
#      Two-sample comparison of proportions power calculation
#
#               n = 1000
#              p1 = 0.0652
#              p2 = 0.0752
#       sig.level = 0.05
#           power = 0.139

Its 14% esti­mate is rea­son­ably close to 10.5% given all the sim­pli­fi­ca­tions I’m doing here. So, imag­ine our 0.139 power here was the vic­tim of the 11% loss, and the true power is where then x = 0.15618. Given the p1 and p2 for our A/B test, how big would n then have to be to reach our true pow­er?

power.prop.test(p1 = 0.0652, p2 = 0.0752, power=0.15618)
#      Two-sample comparison of proportions power calculation
#
#               n = 1178

So in this worst-case sce­nario with small sam­ple size and very differ­ent true con­ver­sion rates, we would need another 178 page-views/visits to make up for com­pletely throw­ing out the source covari­ate. This is usu­ally a doable num­ber of extra page-views.

Gwern.net

What are the impli­ca­tions for my own A/B tests, with less extreme “con­ver­sion” differ­ences? It might be inter­est­ing to imag­ine a hypo­thet­i­cal where my traffic split between my high­est con­ver­sion traffic source and my low­est, and see how much extra n I must pay in my test­ing because I decline to fig­ure out how to record source for tested traffic.

Look­ing at my traffic for the year 2012-12-26-2013, I see that of the top 10 refer­ral sources, the high­est con­vert­ing source is bul­let­proofex­ec.­com traffic at 29.95% of the 9461 vis­its, and the low­est is (Twit­ter) at 8.35% of 15168. We’ll split traffic 50/50 between these two sources.

set.seed(666)
## model specification:
bulletP <- 0.2995
tcoP    <- 0.0835
blueP   <- 0.0100

sampleSizes <- seq(100,5000,by=100)
replicates  <- 1000

rm(controlledResults, uncontrolledResults)

for (n in sampleSizes) {

 rm(controlled, uncontrolled)

 # generate _m_ hypothetical datasets
 for (i in 1:replicates) {

 nn <- n/2
 # generate 2x2=4 possible conditions, with different probabilities in each:
 d1 <- data.frame(Converted=rbinom(nn, 1, bulletP + blueP), X=TRUE,  Color=TRUE)
 d2 <- data.frame(Converted=rbinom(nn, 1, tcoP    + blueP), X=FALSE, Color=TRUE)
 d3 <- data.frame(Converted=rbinom(nn, 1, bulletP + 0),     X=TRUE,  Color=FALSE)
 d4 <- data.frame(Converted=rbinom(nn, 1, tcoP    + 0),     X=FALSE, Color=FALSE)
 d0 <- rbind(d1, d2, d3, d4)

 # analysis while controlling for Twitter/Bullet-Proof-Exec
 g1 <- summary(glm(Converted ~ Color + X, data=d0, family="binomial"))
 controlled[i]   <- g1$coef[11] < 0.05
 g2 <- summary(glm(Converted ~ Color    , data=d0, family="binomial"))
 uncontrolled[i] <- g2$coef[8]  < 0.05
 }
 controlledResults   <- c(controlledResults, (sum(controlled)/length(controlled)))
 uncontrolledResults <- c(uncontrolledResults, (sum(uncontrolled)/length(uncontrolled)))
}
controlledResults
uncontrolledResults
uncontrolledResults / controlledResults

Results:

controlledResults
#  [1] 0.057 0.066 0.059 0.065 0.068 0.073 0.073 0.071 0.108 0.089 0.094 0.106 0.091 0.110 0.126 0.112
# [17] 0.123 0.125 0.139 0.117 0.144 0.140 0.145 0.137 0.161 0.165 0.170 0.148 0.146 0.171 0.197 0.171
# [33] 0.189 0.180 0.184 0.188 0.180 0.177 0.210 0.207 0.193 0.229 0.209 0.218 0.226 0.242 0.259 0.229
# [49] 0.254 0.271
uncontrolledResults
#  [1] 0.046 0.058 0.046 0.056 0.057 0.066 0.053 0.062 0.095 0.080 0.078 0.090 0.077 0.100 0.099 0.103
# [17] 0.109 0.113 0.118 0.105 0.134 0.130 0.123 0.124 0.142 0.152 0.153 0.133 0.126 0.151 0.168 0.151
# [33] 0.163 0.163 0.168 0.170 0.160 0.162 0.189 0.183 0.170 0.209 0.192 0.198 0.209 0.215 0.233 0.208
# [49] 0.221 0.251
uncontrolledResults / controlledResults
#  [1] 0.8070 0.8788 0.7797 0.8615 0.8382 0.9041 0.7260 0.8732 0.8796 0.8989 0.8298 0.8491 0.8462
# [14] 0.9091 0.7857 0.9196 0.8862 0.9040 0.8489 0.8974 0.9306 0.9286 0.8483 0.9051 0.8820 0.9212
# [27] 0.9000 0.8986 0.8630 0.8830 0.8528 0.8830 0.8624 0.9056 0.9130 0.9043 0.8889 0.9153 0.9000
# [40] 0.8841 0.8808 0.9127 0.9187 0.9083 0.9248 0.8884 0.8996 0.9083 0.8701 0.9262
1 - mean(uncontrolledResults / controlledResults)
# [1] 0.1194

So our power loss is not too severe in this worst-case sce­nar­io: we lose a mean of 12% of our pow­er, or around half.

We were exam­in­ing a hypo­thet­i­cal con­ver­sion increase by 1% from 19.15% (mean(c(bulletP, tcoP))) to 20.15%. A reg­u­lar 2-pro­por­tion power cal­cu­la­tion (the clos­est thing to a bino­mial in the R stan­dard library)

power.prop.test(n = 1000, p1 = 0.1915, p2 = 0.2015)
#      Two-sample comparison of proportions power calculation
#
#               n = 1000
#              p1 = 0.1915
#              p2 = 0.2015
#       sig.level = 0.05
#           power = 0.08116

Its 14% esti­mate is rea­son­ably close to 10.5% given all the sim­pli­fi­ca­tions I’m doing here. So, imag­ine our 0.08116 power here was the vic­tim of the 12% loss, and the true power is where then x = 0.0922273. Given the p1 and p2 for our A/B test, how big would n then have to be to reach our true pow­er?

power.prop.test(p1 = 0.1915, p2 = 0.2015, power=0.0922273)
#      Two-sample comparison of proportions power calculation
#
#               n = 1265

So this worst-case sce­nario means I must spend an extra n of 265 or roughly a fifth of a day’s traffic. Since it would prob­a­bly cost me, on net, far more than a fifth of a day to find an imple­men­ta­tion strat­e­gy, debug it, and incor­po­rate it into all future analy­ses, I am happy to con­tinue throw­ing out the source infor­ma­tion & other covari­ates.


  1. The loss here seems to be the aver­age Neg­a­tive Log Like­li­hood of each char­ac­ter; so a train­ing loss of 3.78911860 means exp(-3.78911860) → 0.02 or 2% chance of pre­dict­ing the next char­ac­ter. This is not bet­ter than the base-rate of uni­formly guess­ing each of the 128 ASCII char­ac­ters, which would yield 1/128 → 0.0078125 or 0.7% chance. How­ev­er, after a few hours to train and get­ting down to ~0.8, then it’s start­ing to become quite impres­sive: 0.8 here trans­lates to a 45% chance - not shab­by! At that point, the RNN is start­ing to become a good nat­u­ral-lan­guage com­pres­sor as it’s approach­ing esti­mates of the entropy of nat­ural human Eng­lish and RNNs have to records like 1.278 bits per char­ac­ter. (Which, after con­vert­ing to bits per char­ac­ter, implies that for Eng­lish text sim­i­larly com­pli­cated as Wikipedia, we should­n’t expect our RNN to do any bet­ter than a train­ing loss of ~0.87 and more real­is­ti­cally 0.9-1.1.)↩︎

  2. Sev­eral days after I gave up, Nvidia released a 7.5 RC which did claim to sup­port Ubuntu 15.04, but installing it yielded the same lock­up. I then installed Ubuntu 14.04 and tried the 14.04 ver­sion of that 7.5 RC, and that worked flaw­lessly for GPU accel­er­a­tion of both graph­ics & NNs.↩︎

  3. Even­tu­ally the Nvidia release caught up with 15.04 and I was able to use the Acer lap­top for deep learn­ing. This may not have been a good thing in the long run because the lap­top wound up being bricked on 2016-11-26, with what I think was the moth­er­board dying, when it was just out of war­ran­ty, and cor­rupt­ing the filesys­tem on the SSD to boot. This is an odd way for a lap­top to die, and per­haps the warn­ings against using lap­top GPUs for deep learn­ing were right - the lap­top was indeed run­ning torch-rnn the night/morning it died.↩︎

  4. The EC2 price chart describes it as “High­-per­for­mance NVIDIA GPUs, each with 1,536 CUDA cores and 4GB of video mem­ory”. These appar­ently are NVIDIA Quadro K5000 cards, which cost some­where around $1500. (Price & per­for­mance-wise, it seems there are these days a lot of bet­ter options now; for exam­ple, my GeForce GTX 960M seems to train at sim­i­lar speed at the EC2 instances do.) At $0.65/hr, that’s ~2300 hours or 96 days; at spot, 297 days. Even adding in local elec­tric­ity cost and the cost of build­ing a desk­top PC around the GPUs, it’s clear that breakeven is under a year and that for more than the occa­sional dab­bling, one’s own hard­ware is key. If noth­ing else, you won’t feel anx­ious about the clock tick­ing on your Ama­zon bill!↩︎