About This Website

Meta page describing gwern.net site ideals of stable long-term essays which improve over time; technical decisions using Markdown and static hosting; idea sources and writing methodology; metadata definitions; site statistics; copyright license.
personal, psychology, archiving, statistics, predictions, meta, Bayes, Google, design
2010-10-012020-10-19 finished certainty: highly likely importance: 3


This page is about Gwern.net; for infor­ma­tion about me, see .

The Content

“Of all the books I have deliv­ered to the press­es, none, I think, is as per­sonal as the strag­gling col­lec­tion mus­tered for this hodge­podge, pre­cisely because it abounds in reflec­tions and inter­po­la­tions. Few things have hap­pened to me, and I have read a great many. Or rather, few things have hap­pened to me more worth remem­ber­ing than Schopen­hauer’s thought or the music of Eng­land’s words.”

“A man sets him­self the task of por­tray­ing the world. Through the years he peo­ples a space with images of provinces, king­doms, moun­tains, bays, ships, islands, fish­es, rooms, instru­ments, stars, hors­es, and peo­ple. Shortly before his death, he dis­cov­ers that that patient labyrinth of lines traces the image of his face.”

, Epi­logue

The con­tent here varies from to to / to to to to to inves­ti­ga­tions of or (or two top­ics at once: or or heck !).

I believe that some­one who has been well-e­d­u­cated will think of some­thing worth writ­ing at least once a week; to a sur­pris­ing extent, this has been true. (I added ~130 doc­u­ments to this repos­i­tory over the first 3 years.)

Target Audience

“Spe­cial knowl­edge can be a ter­ri­ble dis­ad­van­tage if it leads you too far along a path you can­not explain any­more.”

()

I don’t write sim­ply to find things out, although curios­ity is my pri­mary moti­va­tor, as I find I want to read some­thing which has­n’t been writ­ten—“…I realised that I wanted to read about them what I myself knew. More than this—what only I knew. Deprived of this pos­si­bil­i­ty, I decided to write about them. Hence this book.”1 There are many ben­e­fits to keep­ing notes as they allow one to accu­mu­late con­firm­ing and espe­cially con­tra­dic­tory evi­dence2, and even drafts can be use­ful so you or sim­ply decently respect the opin­ions of mankind.

The goal of these pages is not to be a model of con­ci­sion, max­i­miz­ing enter­tain­ment value per word, or to preach to a choir by ele­gantly repeat­ing a con­clu­sion. Rather, I am attempt­ing to explain things to my future self, who is intel­li­gent and inter­est­ed, but has for­got­ten. What I am doing is explain­ing why I decided what I did to myself and not­ing down every­thing I found inter­est­ing about it for future ref­er­ence. I hope my other read­ers, whomever they may be, might find the topic as inter­est­ing as I found it, and the essay use­ful or at least enter­tain­ing–but the intended audi­ence is my future self.

Development

“I hate the water that thinks that it boiled itself on its own. I hate the sea­sons that think they cycle nat­u­ral­ly. I hate the sun that thinks it rose on its own.”

Sodachi Oiku­ra, (So­dachi Rid­dle, Part One)

It is every­thing I felt worth writ­ing that did­n’t fit some­where like Wikipedia or was already writ­ten I never expected to write so much, but I dis­cov­ered that once I had a ham­mer, nails were every­where, and that 3.

Long Site

“The Inter­net is self destruc­t­ing paper. A place where any­thing writ­ten is soon destroyed by rapa­cious com­pe­ti­tion and the only preser­va­tion is to for­ever copy writ­ing from sheet to sheet faster than they can burn. If it’s worth writ­ing, it’s worth keep­ing. If it can be kept, it might be worth writ­ing…If you store your writ­ing on a third party site like , or even on your own site, but in the com­plex for­mat used by blog/wiki soft­ware du jour you will lose it for­ever as soon as hyper­sonic wings of Inter­net labor flows direct peo­ple’s ener­gies else­where. For most infor­ma­tion pub­lished on the Inter­net, per­haps that is not a moment too soon, but how can the muse of orig­i­nal­ity soar when immo­lat­ing tran­sience brushes every feath­er?”

(“Self destruc­t­ing paper”, 2006-12-05)

One of my per­sonal inter­ests is apply­ing the idea of the . What and how do you write a per­sonal site with the long-term in mind? We live most of our lives in the future, and the actu­ar­ial tables give me until the 2070–2080s, exclud­ing any ben­e­fits from / or projects like . It is a com­mon-place in sci­ence fic­tion4 that longevity would cause wide­spread risk aver­sion. But on the other hand, it could do the oppo­site: the longer you live, the more long-shots you can afford to invest in. Some­one with a times­pan of 70 years has rea­son to pro­tect against black swan­s—but also time to look for them.5 It’s worth not­ing that old peo­ple make many short­-term choic­es, as reflected in increased sui­cide rates and reduced invest­ment in edu­ca­tion or new hob­bies, and this is not due solely to the rav­ages of age but the prox­im­ity of death—the HIV-infected (but oth­er­wise in per­fect health) act sim­i­larly short­-term.6

What sort of writ­ing could you cre­ate if you worked on it (be it ever so rarely) for the next 60 years? What could you do if you started now?7

Keep­ing the site run­ning that long is a chal­lenge, and leads to the rec­om­men­da­tions for : 100% soft­ware8, for data, tex­tual human-read­abil­i­ty, avoid­ing exter­nal depen­den­cies910, and sta­t­ic­ness11.

Pre­serv­ing the con­tent is another chal­lenge. Keep­ing the con­tent in a like pro­tects against file cor­rup­tion and makes it eas­ier to mir­ror the con­tent; reg­u­lar back­ups12 help. I have taken addi­tional mea­sures: has archived most pages and almost all exter­nal links; the is also archiv­ing pages & exter­nal links13. (For details, read .)

One could con­tinue in this vein, devis­ing ever more pow­er­ful & robust stor­age meth­ods (per­haps com­bine the DVCS with through , a la bup), but what is one to fill the stor­age with?

Long Content

“What has been done, thought, writ­ten, or spo­ken is not cul­ture; cul­ture is only that frac­tion which is remem­bered.”

Gary Tay­lor (The Clock of the Long Now; empha­sis added)14

‘Blog posts’ might be the answer. But I have read blogs for many years and most blog posts are the tri­umph of the hare over the tor­toise. They are meant to be read by a few peo­ple on a week­day in 2004 and never again, and are quickly aban­doned—and per­haps as Assange says, not a moment too soon. (But isn’t that sad? Isn’t it a ter­ri­ble for one’s time?) On the other hand, the best blogs always seem to be build­ing some­thing: they are rough draft­s—­works in progress15. So I did not wish to write a blog. Then what? More than just “ever­green con­tent”, what would con­sti­tute Long Con­tent as opposed to the exist­ing cul­ture of Short Con­tent? How does one live in a Long Now sort of way?16

It’s shock­ing to find how many peo­ple do not believe they can learn, and how many more believe learn­ing to be diffi­cult. Muad’Dib knew that every expe­ri­ence car­ries its les­son.17

My answer is that one uses such a frame­work to work on projects that are too big to work on nor­mally or too tedious. (Con­sci­en­tious­ness is often lack­ing online or in vol­un­teer com­mu­ni­ties18 and many use­ful things go undone.) Know­ing your site will sur­vive for decades to come gives you the men­tal where­withal to tackle long-term tasks like gath­er­ing infor­ma­tion for years, and such per­sis­tence can be use­ful19—if one holds onto every glim­mer of genius for years, then even the dullest per­son may look a bit like a genius him­self20. (Even expe­ri­enced pro­fes­sion­als can only write at their peak for a few hours a day—usu­ally , it seem­s.) Half the chal­lenge of fight­ing pro­cras­ti­na­tion is the pain of start­ing—I find when I actu­ally get into the swing of work­ing on even dull tasks, it’s not so bad. So this sug­gests a solu­tion: never start. Merely have per­pet­ual drafts, which one tweaks from time to time. And the rest takes care of itself. I have a few exam­ples of this:

  1. :

    When I read in Wired in 2008 that the obscure work­ing mem­ory exer­cise called dual n-back (DNB) had been found to increase IQ sub­stan­tial­ly, I was shocked. IQ is one of the most stub­born prop­er­ties of one’s mind, one of the most frag­ile21, the hard­est to affect pos­i­tively22, but also one of the most valu­able traits one could have23; if the tech­nique panned out, it would be huge. Unfor­tu­nate­ly, DNB requires a major time invest­ment (as in, half an hour dai­ly); which would be a bar­gain—if it deliv­ers. So, to do DNB or not?

    Ques­tions of great import like this are worth study­ing care­ful­ly. The wheels of acad­e­mia grind exceed­ing slow, and only a fool expects unan­i­mous answers from fields like psy­chol­o­gy. Any attempt to answer the ques­tion ‘is DNB worth­while?’ will require years and cover a breadth of mate­r­i­al. This FAQ on DNB is my attempt to cover that breadth over those years.

  2. :

    I have been dis­cussing since 2004. The task of inter­pret­ing Eva is very diffi­cult; the source works them­selves are a major time-sink24, and there are thou­sands of pri­ma­ry, sec­ondary, and ter­tiary works to con­sid­er—per­sonal essays, inter­views, reviews, etc. The net effect is that many Eva fans ‘know’ cer­tain things about Eva, such as not being a grand ‘screw you’ state­ment by Hideaki Anno or that the TV series was cen­sored, but they no longer have proof. Because each fan remem­bers a differ­ent sub­set, they have irrec­on­cil­able inter­pre­ta­tions. (Half the value of the page for me is hav­ing a place to store things I’ve said in count­less fora which I can even­tu­ally turn into some­thing more sys­tem­at­ic.)

    To com­pile claims from all those works, to dig up for­got­ten ref­er­ences, to scroll through micro­films, buy issues of defunct mag­a­zi­nes—all this is enough work to shat­ter of the stoutest salary­man. Which is why I began years ago and expect not to fin­ish for years to come. (Fin­ish­ing by 2020 seems like a good pre­dic­tion.)

  3. : Years ago I was read­ing the papers of the econ­o­mist . I rec­om­mend his work high­ly; even if they are wrong, they are imag­i­na­tive and some of the finest spec­u­la­tive fic­tion I have read. (Ex­cept they were non-fic­tion.) One night I had a dream in which I saw in a flash a medieval city run in part on Han­son­ian grounds; a ver­sion of his . A city must have another city as a rival, and soon I had remem­bered the strange ’90s idea of s, which was eas­ily tweaked to work in a medieval set­ting. Final­ly, between them, was one of my favorite pro­pos­als, Buck­min­ster Fuller’s megas­truc­ture.

    I wrote sev­eral drafts but always lost them. Sad25 and dis­cour­aged, I aban­doned it for years. This fear leads straight into the next exam­ple.

  4. A Book read­ing list:

    Once, I did­n’t have to keep read­ing lists. I sim­ply went to the school library shelf where I left off and grabbed the next book. But then I began read­ing harder books, and they would cite other books, and some­times would even have hor­ri­fy­ing lists of hun­dreds of other books I ought to read (‘bib­li­ogra­phies’). I tried remem­ber­ing the most impor­tant ones but quickly for­got. So I began keep­ing a book list on paper. I thought I would throw it away in a few months when I read them all, but some­how it kept grow­ing and grow­ing. I did­n’t trust com­put­ers to store it before26, but now I do, and it lives on in dig­i­tal form (cur­rently on Goodreads—be­cause they have export func­tion­al­i­ty). With it, I can track how my inter­ests evolved over time27, and what I was read­ing at the time. I some­times won­der if I will read them all even by 2070.

What is next? So far the pages will per­sist through time, and they will grad­u­ally improve over time. But a truly Long Now approach would be to make them be improved by time—­make them more valu­able the more time pass­es. ( remarks in that a group of monks carved thou­sands of scrip­tures into stone, hop­ing to pre­serve them for pos­ter­i­ty—but pos­ter­ity would value far more a care­fully pre­served col­lec­tion of monk feces, which would tell us count­less valu­able things about impor­tant phe­nom­e­non like global warm­ing.)

One idea I am explor­ing is adding long-term pre­dic­tions like the ones I make on Pre­dic­tion­Book.­com. Many28 pages explic­itly or implic­itly make pre­dic­tions about the future. As time pass­es, pre­dic­tions would be val­i­dated or fal­si­fied, pro­vid­ing feed­back on the ideas.29

For exam­ple, the Evan­ge­lion essay’s par­a­digm implies many things about the future movies in 30; is an extended pre­dic­tion31 of future plot devel­op­ments in series; has sug­ges­tions about what makes good pro­jects, which could be turned into pre­dic­tions by apply­ing them to pre­dict suc­cess or fail­ure when the next Sum­mer of Code choices are announced. And so on.

I don’t think “Long Con­tent” is sim­ply for work­ing on things which are equiv­a­lent to a “” (a work which attempts to be an exhaus­tive expo­si­tion of all that is known—and what has been recently dis­cov­ered—on a sin­gle top­ic), although mono­graphs clearly would ben­e­fit from such an approach. If I write a short essay cyn­i­cally remark­ing on, say, Al Gore and pre­dict­ing he’d sell out and reg­is­tered some pre­dic­tions and came back 20 years later to see how it worked out, I would con­sider this “Long Con­tent” (it gets more inter­est­ing with time, as the pre­dic­tions reach mat­u­ra­tion); but one could­n’t con­sider this a “mono­graph” in any ordi­nary sense of the word.

One of the ironies of this approach is that as a , I assign non-triv­ial prob­a­bil­ity to the world under­go­ing mas­sive change dur­ing the 21st cen­tury due to any of a num­ber of tech­nolo­gies such as arti­fi­cial intel­li­gence (such as 32) or ; yet here I am, plan­ning as if I and the world were immor­tal.

I per­son­ally believe that one should “think Less Wrong and act Long Now”, if you fol­low me. I dili­gently do my daily and n-back­ing; I care­fully design my web­site and writ­ings to last decades, actively think about how to write mate­r­ial that improves with time, and work on writ­ings that will not be fin­ished for years (if ever). It’s a bit schiz­o­phrenic since both are total­ized world­views with dras­ti­cally con­flict­ing rec­om­men­da­tions about where to invest my time. It’s a case of high ver­sus low dis­count rates; and one could fairly accuse me of com­mit­ting the , but then, I’m not sure that (cer­tain­ly, I have more to show for my wasted time than most peo­ple).

The Long Now views its pro­pos­als like the Clock and the Long Library and as insur­ance—in case the future turns out to be sur­pris­ingly unsur­pris­ing. I view these writ­ings sim­i­lar­ly. If most ambi­tious pre­dic­tions turn out right and the hap­pens by 2050 or so, then much of my writ­ings will be moot, but I will have all the ben­e­fits of said Sin­gu­lar­i­ty; if the Sin­gu­lar­ity never hap­pens or ulti­mately pays off in a very dis­ap­point­ing way, then my writ­ings will be valu­able to me. By work­ing on them, I hedge my bets.

Finding my ideas

To the extent I per­son­ally have any method for ‘get­ting started’ on writ­ing some­thing, it’s to pay atten­tion to any­time you find your­self think­ing, “how irri­tat­ing that there’s no good webpage/Wikipedia arti­cle on X” or “I won­der if Y” or “has any­one done Z” or “huh, I just real­ized that A!” or “this is the third time I’ve had to explain this, jeez.”

The DNB FAQ started because I was irri­tated peo­ple were repeat­ing them­selves on the dual n-back mail­ing list; the arti­cle started because it was a pain to fig­ure out where one could order modafinil; the trio of Death Note arti­cles (, , ) all started because I had an amus­ing thought about infor­ma­tion the­o­ry; the page was com­mis­sioned after I growsed about how deeply sen­sa­tion­al­ist & shal­low & ill-in­formed all the main­stream media arti­cles on the Silk Road drug mar­ket­place were (sim­i­larly for ); my was based on think­ing it was a pity that Arthur’s Guardian analy­sis was triv­ially & fatally flawed; and so on and so forth.

None of these seems spe­cial to me. Any­one could’ve com­piled the DNB FAQ; any­one could’ve kept a list of online phar­ma­cies where one could buy modafinil; some­one tried some­thing sim­i­lar to my Google shut­down analy­sis before me (and the fancier sta­tis­tics were all stan­dard tool­s). If I have done any­thing mer­i­to­ri­ous with them, it was per­haps sim­ply putting more work into them than some­one else would have; to quote Teller:

“I think you’ll see what I mean if I teach you a few prin­ci­ples magi­cians employ when they want to alter your per­cep­tion­s…­Make the secret a lot more trou­ble than the trick seems worth. You will be fooled by a trick if it involves more time, money and prac­tice than you (or any other sane onlook­er) would be will­ing to invest.”

“My part­ner, Penn, and I once pro­duced 500 live cock­roaches from a top hat on the desk of talk-show host David Let­ter­man. To pre­pare this took weeks. We hired an ento­mol­o­gist who pro­vided slow-mov­ing, cam­er­a-friendly cock­roaches (the kind from under your stove don’t hang around for close-ups) and taught us to pick the bugs up with­out scream­ing like pread­o­les­cent girls. Then we built a secret com­part­ment out of foam-core (one of the few mate­ri­als cock­roaches can’t cling to) and worked out a devi­ous rou­tine for sneak­ing the com­part­ment into the hat. More trou­ble than the trick was worth? To you, prob­a­bly. But not to magi­cians.”

Besides that, I think after a while writing/research can be a vir­tu­ous cir­cle or auto­cat­alyt­ic. If one were to look at my repo sta­tis­tics, you see that I haven’t always been writ­ing as much. What seems to hap­pen is that as I write more:

  • I learn more tools

    eg. I learned basic in R to answer what all the pos­i­tive & neg­a­tive , but then I was able to use it for iodine; I learned lin­ear mod­els for ana­lyz­ing MoR reviews but now I can use them any­where I want to, like in my .

    The “Feyn­man method” has been face­tiously described as “find a prob­lem; think very hard; write down the answer”, but Gian-Carlo Rota gives the real one:

    Richard Feyn­man was fond of giv­ing the fol­low­ing advice on how to be a genius. You have to keep a dozen of your favorite prob­lems con­stantly present in your mind, although by and large they will lay in a dor­mant state. Every time you hear or read a new trick or a new result, test it against each of your twelve prob­lems to see whether it helps. Every once in a while there will be a hit, and peo­ple will say: “How did he do it? He must be a genius!”

  • I inter­nal­ize a habit of notic­ing inter­est­ing ques­tions that flit across my brain

    eg. in March 2013 while med­i­tat­ing: “I won­der if more dou­jin music gets released when unem­ploy­ment goes up and peo­ple may have more spare time or fail to find jobs? Hey! That giant Touhou music tor­rent I down­load­ed, with its 45000 songs all tagged with release year, could prob­a­bly answer that!” (One could argue that these ques­tions prob­a­bly should be ignored and not inves­ti­gated in depth—Teller again—n­ev­er­the­less, this is how things work for me.)

  • if you aren’t writ­ing, you’ll ignore use­ful links or quotes; but if you stick them in small asides or foot­notes as you notice them, even­tu­ally you’ll have some­thing big­ger.

    I grab things I see on Google Alerts & Schol­ar, Pub­med, Red­dit, Hacker News, my RSS feeds, books I read, and note them some­where until they amount to some­thing. (An exam­ple would be my slowly accret­ing cita­tions on IQ and eco­nom­ics.)

  • peo­ple leave com­ments, ping me on IRC, send me emails, or leave anony­mous mes­sages, all of which help

    Some exam­ples of this come from my most pop­u­lar page, on Silk Road 1:

    1. an anony­mous mes­sage led me to inves­ti­gate a ven­dor in depth and pon­der the accu­sa­tion lev­eled against them; I wrote it up and gave my opin­ions and thus I got another short essay to add to my SR page which I would not have had oth­er­wise (and I think there’s a <20% chance that in a few years this will pay off and become a very inter­est­ing essay).
    2. CMU’s Nicholas Christin, who by scrap­ing SR for many months and giv­ing all sorts of over­all sta­tis­tics, emailed me to point out I was cit­ing inac­cu­rate fig­ures from the first ver­sion of his paper. I thanked him for the cor­rec­tion and while I was reply­ing, men­tioned I had a hard time believ­ing his paper’s claims about the extreme rar­ity of scams on SR as esti­mated through buyer feed­back. After some back and forth and sug­gest­ing spe­cific mech­a­nisms how the esti­mates could be pos­i­tively biased, he was able to check his data­base and con­firmed that there was at least one very large omis­sion of scams in the scraped data and there was prob­a­bly a gen­eral under­sam­pling; so now I have a more accu­rate feed­back esti­mate for my SR page (im­por­tant for esti­mat­ing risk of order­ing) and he said he’ll acknowl­edge me in the/a paper, which is nice.

Information organizing

Occa­sion­ally peo­ple ask how I man­age infor­ma­tion and read things.

  1. For quotes or facts which are very impor­tant, I employ by adding them to my Mnemosyne

  2. I keep web clip­pings in Ever­notes; I also excerpt from research papers & books, and mis­cel­la­neous sources. This is use­ful for tar­geted searches when I remem­ber a fact but not where I learned it, and for stor­ing things which I don’t want to mem­o­rize but which have no log­i­cal home in my web­site or LW or else­where. It is also help­ful for writ­ing my and the , as I can read through my book excerpts to remind myself of the high­lights and at the end of the month review clip­pings from papers/webpages to find good things to reshare which I was too busy at the time to do so or was unsure of its impor­tance. I don’t make any use of more com­plex Ever­note fea­tures.

    I peri­od­i­cally back up my Ever­note using the Linux client Nixnote’s export fea­ture. (I made sure there was a work­ing export method before I began using Ever­note, and use it only as long as Nixnote con­tin­ues to work.)

    My work­flow for deal­ing with PDFs, as of late 2014, is:

    1. if nec­es­sary, jail­break the paper using Lib­gen or a uni­ver­sity proxy, then upload a copy to Drop­box, named year-author.pdf
    2. read the paper, mak­ing excerpts as I go
    3. store the meta­data & excerpts in Ever­note
    4. if use­ful, inte­grate into Gwern.net with its title/year/author meta­data, adding a local full­text copy if the paper had to be jail­bro­ken, oth­er­wise rely on my cus­tom archiv­ing setup to pre­serve the remote URL
    5. hence, any future searches for the file­name / title / key con­tents should result in hits either in my Ever­note or Gwern.net
  3. Web pages are archived & backed up by . This is intended mostly for fix­ing dead links (eg to recover the full­text of the orig­i­nal URL of an Ever­note clip­ping).

  4. I don’t have any spe­cial book read­ing tech­niques. For really good books I excerpt from each chap­ter and stick the quotes into Ever­note.

  5. I store insights and thoughts in var­i­ous pages as par­en­thet­i­cal com­ments, foot­notes, and appen­dices. If they don’t fit any­where, I dump them in .

  6. Larger masses of cita­tions and quotes typ­i­cally get turned into pages.

  7. I make heavy use of RSS sub­scrip­tions for news. For that, I am cur­rently using . (Not that I’m hugely thrilled about it. Google Reader was much bet­ter.)

  8. For projects and fol­lowups, I use reminders in Google Cal­en­dar.

  9. For record­ing per­sonal data, I auto­mate as much as pos­si­ble (eg Zeo and arbtt) and I make a habit of the rest—get­ting up in the morn­ing is a great time to build a habit of record­ing data because it’s a time of habits like eat­ing break­fast and get­ting dressed.

Hence, to refind infor­ma­tion, I use a com­bi­na­tion of Google, Ever­note, grep (on the Gwern.net files), occa­sion­ally Mnemosyne, and a good visual mem­o­ry.

As far as writ­ing goes, I do not use note-tak­ing soft­ware or things like or —not that I think they are use­less but I am wor­ried about whether they would ever repay the large upfront invest­ments of learning/tweaking or inter­fere with other things. Instead, I occa­sion­ally com­pile out­lines of arti­cles from com­ments on LW/Reddit/IRC, keep edit­ing them with stuff as I remem­ber them, search for rel­e­vant parts, allow lit­tle thoughts to bub­ble up while med­i­tat­ing, and pay atten­tion to when I am irri­tated at peo­ple being wrong or annoyed that a par­tic­u­lar topic has­n’t been writ­ten down yet.

Confidence tags

Most of the meta­data in each page is self­-ex­plana­to­ry: the date is the last time the page was mean­ing­fully mod­i­fied33, the tags are cat­e­go­riza­tion, etc. The “sta­tus” tag describes the state of com­ple­tion: whether it’s a pile of links & snip­pets & “notes”, or whether it is a “draft” which at least has some struc­ture and con­veys a coher­ent the­sis, or it’s a well-de­vel­oped draft which could be described as “in progress”, and finally when a page is done—in lieu of addi­tional mate­r­ial turn­ing up—it is sim­ply “fin­ished”.

The “con­fi­dence” tag is a lit­tle more unusu­al. I stole the idea from Muflax’s “epis­temic state” tags; I use the same mean­ing for “log” for col­lec­tions of data or links (“log entries that sim­ply describe what hap­pened with­out any judg­ment or reflec­tion”) per­sonal or reflec­tive writ­ing can be tagged “emo­tional” (“some clus­ter of ideas that got itself entan­gled with a com­plex emo­tional state, and I needed to exter­nal­ize it to even look at it; in no way endorsed, but occa­sion­ally nec­es­sary (sim­i­lar to fic­tion)”), and “fic­tion” needs no expla­na­tion (ev­ery author has some rea­son for writ­ing the story or poem they do, but not even they always know whether it is an expres­sion of their deep­est fears, desires, his­to­ry, or sim­ply ran­dom thought­s). I drop his other tags in favor of giv­ing my sub­jec­tive prob­a­bil­ity using the :

  1. “cer­tain”
  2. “highly likely”
  3. “likely”
  4. “pos­si­ble” (my pref­er­ence over Kessel­man’s “Chances a Lit­tle Bet­ter [or Less]”)
  5. “unlikely”
  6. “highly unlikely”
  7. “remote”
  8. “impos­si­ble”

These are used to express my feel­ing about how well-sup­ported the essay is, or how likely it is the over­all ideas are right. (Of course, an inter­est­ing idea may be worth writ­ing about even if very wrong, and even a long shot may be profitable to exam­ine if the poten­tial pay­off is large enough.)

Importance tags

An addi­tional use­ful bit of meta­data would be dis­tinc­tion between things which are triv­ial and those which are about more impor­tant top­ics which might change your life. Using , I’ve ranked pages in deciles from 0–10 on how impor­tant the topic is to myself, the intended read­er, or the world. For exam­ple, top­ics like or are vastly more impor­tant, and be ranked 10, than some poems or a dream or some­one’s small nootrop­ics self­-ex­per­i­ment, which would be ranked 0–1.

Writing checklist

It turns out that writ­ing essays (tech­ni­cal or philo­soph­i­cal) is a lot like writ­ing code—there are so many ways to err that you need a process with as much automa­tion as pos­si­ble. My cur­rent check­list for fin­ish­ing an essay:

Markdown checker

I’ve found that many errors in my writ­ing can be caught by some sim­ple scripts, which I’ve com­piled into a shell script, markdown-lint.sh.

My lin­ter does:

  1. checks for cor­rupted non-text binary files

  2. checks a black­list of domains which are either dead (eg Google+) or have a his­tory of being unre­li­able (eg Research­Gate, NBER, PNAS); such links need34 to either be fixed, pre-emp­tively mir­rored, or removed entire­ly.

    • a spe­cial case is PDFs hosted on IA; the IA is reli­able, but I try to rehost such PDFs so they’ll show up in Google/Google Scholar for every­one else.
  3. Bro­ken syn­tax: I’ve noticed that when I make Mark­down syn­tax errors, they tend to be pre­dictable and show up either in the orig­i­nal Mark­down source, or in the ren­dered HTML. Two com­mon source errors:

     "(www"
     ")www"

    And the fol­low­ing should rarely show up in the final ren­dered HTML:

     "\frac"
     "\times"
     "(http"
     ")http"
     "[http"
     "]http"
     " _ "
     "[^"
     "^]"
     "<!--"
     "-->"
     "<-- "
     "<-"
     "->"
     "$title$"
     "$description$"
     "$author$"
     "$tags$"
     "$category$"

    Sim­i­lar­ly, I some­times slip up in writ­ing image/document links so any link start­ing https://www.gwern.net or ~/wiki/ or /home/gwern/ is prob­a­bly wrong. There are a few Pan­doc-spe­cific issues that should be checked for too, like dupli­cate foot­note names and images with­out sep­a­rat­ing new­lines or unescaped dol­lar signs (which can acci­den­tally lead to sen­tences being ren­dered as TeX).

    A final pass with htmltidy finds many errors which slip through, like incor­rect­ly-escaped URLs.

  4. Flag dan­ger­ous lan­guage: Impe­r­ial units are dep­re­cat­ed, but so too is the mis­lead­ing lan­guage of NHST sta­tis­tics (if one must talk of “sig­nifi­cance” I try to flag it as “sta­tis­ti­cal­ly-sig­nifi­cant” to warn the read­er). I also avoid some other dan­ger­ous words like “obvi­ous” (if it is really is, why do I need to say it?).

  5. Bad habits:

    • proselint (with some checks dis­abled because they play badly with Mark­down doc­u­ments)
    • Another sta­tic warn­ing is check­ing for too-long lines (most com­mon in code blocks, although some­times bro­ken inden­ta­tion will cause this) which will cause browsers to use scroll­bars, for which I’ve writ­ten a Pan­doc script,
    • one for a bad habit of mine—too-long foot­notes
  6. dupli­cate and hidden-PDF URLs: a URL being linked mul­ti­ple times is some­times an error (too much copy­-paste or insuffi­ciently edited sec­tion­s); PDF URLs should receive a visual anno­ta­tion warn­ing the reader it’s a PDF, but the CSS rules, which catch cases like .pdf$, don’t cover cases where the host qui­etly serves a PDF any­way, so all URLs are checked. (A URL which is a PDF can be made to trig­ger the PDF rule by append­ing #pdf.)

  7. bro­ken links are detected with linkchecker. The best time to fix bro­ken links is when you’re already edit­ing a page.

While this throws many false pos­i­tives, those are easy to ignore, and the script fights bad habits of mine while giv­ing me much greater con­fi­dence that a page does­n’t have any merely tech­ni­cal issues that screw it up (with­out requir­ing me to con­stantly reread pages every time I mod­ify them, lest an acci­den­tal typo while mak­ing an edit breaks every­thing).

Anonymous feedback

Back in Novem­ber 2011, luke­prog posted “Tell me what you think of me” where he described his use of a Google Docs form for anony­mous receipt of tex­tual feed­back or com­ments. Typ­i­cal­ly, most forms of com­mu­ni­ca­tion are non-anony­mous, or if they are anony­mous, they’re pub­lic. One can set up pseu­do­nyms and use those for pri­vate con­tact, but it’s not always that easy, and is defi­nitely a series of (if anony­mous feed­back is not solicit­ed, one has to feel it’s impor­tant enough to do and vio­late implicit norms against anony­mous mes­sages; one has to set up an iden­ti­ty; one has to com­pose and send off the mes­sage, etc).

I thought it was a good idea to try out, and on 2011-11-08, I set up my own anony­mous feed­back form and stuck it in the footer of all pages on Gwern.net where it remains to this day. I did won­der if any­one would use the form, espe­cially since I am easy to con­tact via email, use mul­ti­ple sites like Red­dit or Less­wrong, and even my Dis­qus com­ments allow anony­mous com­ments—so who, if any­one, would be using this form? I sched­uled a fol­lowup in 2 years on 2013-11-30 to review how the form fared.

754 days, 2.884m page views, and 1.350m unique vis­i­tors lat­er, I have received 116 pieces of feed­back (mean of 24.8k vis­its per feed­back). I cat­e­go­rize them as fol­lows in descend­ing order of fre­quen­cy:

  • Cor­rec­tions, prob­lems (tech­ni­cal or oth­er­wise), sug­gested edits: 34
  • Praise: 31
  • Question/request (per­son­al, tech sup­port, etc): 22
  • Misc (eg gib­ber­ish, social­iz­ing, Japan­ese): 13
  • Crit­i­cism: 9
  • News/suggestions: 5
  • Fea­ture request: 4
  • Request for cyber­ing: 1
  • Extor­tion: 1 (see my black­mail page deal­ing with the Sep­tem­ber 2013 inci­dent)

Some sub­mis­sions cover mul­ti­ple angles (they can be quite long), some­times peo­ple dou­ble-sub­mit­ted or left it blank, etc, so the num­bers won’t sum to 116.

In gen­er­al, a lot of the cor­rec­tions were usable and fixed issues of vary­ing impor­tance, from typos to the entire site’s CSS being bro­ken due to being uploaded with the wrong MIME type. One of the news/suggestion feed­backs was very valu­able, as it lead to writ­ing the Silk Road mini-es­say “A Mole?” A lot of the ques­tions were a waste of my time; I’d say half related to Tor/Bitcoin/Silk-Road. (I also got an irri­tat­ing num­ber of emails from peo­ple ask­ing me to, say, buy LSD or heroin off SR for them.) The fea­ture requests were usu­ally for a bet­ter RSS feed, which I tried to oblige by start­ing the page. The cyber­ing and extor­tion were amus­ing, if noth­ing else. The praise was good for me men­tal­ly, as I don’t inter­act much with peo­ple.

I con­sider the anony­mous feed­back form to have been a suc­cess, I’m glad luke­prog brought it up on LW, and I plan to keep the feed­back form indefi­nite­ly.

Feedback causes

One thing I won­dered is whether feed­back was purely a func­tion of traffic (the more vis­its, the more peo­ple who could see the link in the footer and decide to leave a com­men­t), or more related to time (per­haps peo­ple return­ing reg­u­larly and even­tu­ally being embold­ened or notic­ing some­thing to com­ment on). So I com­piled daily hits, com­bined with the feed­back dates, and looked at a graph of hits:

Hits over time for Gwern.net

The hits are heav­ily skewed by Hacker News & Red­dit traffic spikes, and prob­a­bly should be log trans­formed. Then I did a logis­tic regres­sion on hits, log hits, and a sim­ple time index:

feedback <- read.csv("https://www.gwern.net/docs/traffic/2013-gwernnet-anonymousfeedback.csv",
                     colClasses=c("Date","logical","integer"))
plot(Visits ~ Day, data=feedback)
feedback$Time <- 1:nrow(feedback)
summary(step(glm(Feedback ~ log(Visits) + Visits + Time, family=binomial, data=feedback)))
# ...
# Coefficients:
#              Estimate Std. Error z value Pr(>|z|)
# (Intercept) -7.363507   1.311703   -5.61  2.0e-08
# log(Visits)  0.749730   0.173846    4.31  1.6e-05
# Time        -0.000881   0.000569   -1.55     0.12
#
# (Dispersion parameter for binomial family taken to be 1)
#
#     Null deviance: 578.78  on 753  degrees of freedom
# Residual deviance: 559.94  on 751  degrees of freedom
# AIC: 565.9

The logged hits works out bet­ter than reg­u­lar hits, and sur­vives to the sim­pli­fied mod­el. And the traffic influ­ence seems much larger than the time vari­able (which is, curi­ous­ly, neg­a­tive).

Technical aspects

Popularity

On a semi­-an­nual basis, since 2011, I review Gwern.net web­site traffic using Google Ana­lyt­ics; although what most read­ers value is not what I val­ue, I find it moti­vat­ing to see total traffic sta­tis­tics remind­ing me of read­ers (writ­ing can be a lonely and abstract endeav­our), and use­ful to see what are major refer­rers.

Gwern.net typ­i­cally enjoys steady traffic in the 50–100k range per mon­th, with occa­sional spikes from social media, par­tic­u­larly Hacker News; over the first decade (2010–2020), there were 7.98m pageviews by 3.8m unique users.

See

Colophon

Hosting

Gwern.net is served by through the . (Ama­zon charges less for band­width and disk space than NFSN, although one loses all the capa­bil­i­ties offered by Apache’s , and com­pres­sion is diffi­cult so must be han­dled by Cloud­Flare; total costs may turn out to be a wash and I will con­sider the switch to Ama­zon S3 a suc­cess if it can bring my monthly bill to <$10 or <$120 a year.)

From Octo­ber 2010 to June 2012, the site was hosted on Near­lyFreeSpeech.net, an old host­ing com­pa­ny; its spe­cific niche is con­tro­ver­sial mate­r­ial and activist-friendly pric­ing. Its lib­er­tar­ian own­ers cast a jaun­diced eye on s, and pric­ing is pay-as-y­ou-go. I like the for­mer aspect, but the lat­ter sold me on NFSN. Before I stum­bled on NFSN (some­one men­tioned it offhand­edly while chat­ting), I was get­ting ready to pay $10–15 a month ($120 year­ly) to . Lin­ode’s offer­ings are overkill since I do not run dynamic web­sites or some­thing like Haskel­l.org (with wikis and mail­ing lists and repos­i­to­ries), but I did­n’t know a good alter­na­tive. NFSN’s pric­ing meant that I paid for usage rather than large flat fees. I put in $32 to cover reg­is­ter­ing Gwern.net until 2014, and then another $10 to cover band­width & stor­age price. DNS aside, I was billed $8.27 for Octo­ber-De­cem­ber 2010; DNS includ­ed, Jan­u­ary-April 2011 cost $10.09. $10 cov­ered months of Gwern.net for what I would have paid Lin­ode in 1 mon­th! In total, my 2010 costs were $39.44 (bill archive); my 2011 costs were $118.32 ($9.86 a mon­th; archive); and my 2012 costs through June were $112.54 ($21 a mon­th; archive); sum total: $270.3.

The switch to Ama­zon S3 host­ing is com­pli­cated by my simul­ta­ne­ous addi­tion of Cloud­Flare as a CDN; my total June 2012 Ama­zon bill is $1.62, with $0.19 for stor­age. Cloud­Flare claims it cov­ered 17.5GB of 24.9GB total band­width, so the $1.41 rep­re­sents 30% of my total band­width; mul­ti­ply 1.41 by 3 is 4.30, and my hypo­thet­i­cal non-Cloud­Flare S3 bill is ~$4.5. Even at $10, this was well below the $21 monthly cost at NFSN. (The traffic graph indi­cates that June 2012 was a rel­a­tively quiet peri­od, but I don’t think this elim­i­nates the fac­tor of 5.) From July 2012 to June 2013, my Ama­zon bills totaled $60, which is rea­son­able except for the steady increase ($1.62/$3.27/$2.43/$2.45/$2.88/$3.43/$4.12/$5.36/$5.65/$5.49/$4.88/$8.48/$9.26), being pri­mar­ily dri­ven by out­-bound band­width (in June 2013, the $9.26 was largely due to the 75GB trans­ferred—and that was after Cloud­Flare dealt with 82G­B); $9.26 is much higher than I would pre­fer since that would be >$110 annu­al­ly. This was prob­a­bly due to all the graph­ics I included in the “Google shut­downs” analy­sis, since it returned to a more rea­son­able $5.14 on 42GB of traffic in August. Sep­tem­ber, Octo­ber, Novem­ber and Decem­ber 2013 saw high lev­els main­tained at $7.63/$12.11/$5.49/$8.75, so it’s prob­a­bly a new nor­mal. 2014 entailed new costs related to EC2 instances & S3 band­width spikes due to host­ing a mul­ti­-gi­ga­byte sci­en­tific dataset, so bills ran $8.51/$7.40/$7.32/$9.15/$26.63/$14.75/$7.79/$7.98/$8.98/$7.71/$7/$5.94. 2015 & 2016 were sim­i­lar: $5.94/$7.30/$8.21/$9.00/$8.00/$8.30/$10.00/$9.68/$14.74/$7.10/$7.39/$8.03/$8.20/$8.31/$8.25/$9.04/$7.60/$7.93/$7.96/$9.98/$9.22/$11.80/$9.01/$8.87. 2017 saw costs increase due to one of my side-pro­jects, aggres­sively increas­ing full­tex­ting of Gwern.net by pro­vid­ing more papers & scan­ning cited books, only par­tially off­set by changes like lossy opti­miza­tion of images & con­vert­ing GIFs to WebMs: $12.49/$10.68/$11.02/$12.53/$11.05/$10.63/$9.04/$11.03/$14.67/$15.52/$13.12/$12.23 (to­tal: $144.01). In 2018, I con­tin­ued full­tex­ting: $13.08/$14.85/$14.14/$18.73/$18.88/$15.92/$15.64/$15.27/$16.66/$22.56/$23.59/$25.91/(total: $213).

For 2019, I made a deter­mined effort to host more things, includ­ing whole web­sites like the OKCupid archives or rotten.com, and to include more images/videos (the StyleGAN anime faces tuto­r­ial alone must be eas­ily 20MB+ just for images) and it shows in how my band­width costs explod­ed: $26.49/$37.56/$37.56/$37.56/$25.00/$25.00/$25.00/$25.00/$77.91/$124.45/$74.32/$79.19. I’ve begun con­sid­er­ing a move of Gwern.net to my Het­zner ded­i­cated server which has cheap band­width, com­bined with upgrad­ing my Cloud­flare CDN to keep site latency in check (even at $20/month, it’s still far cheaper than AWS S3 band­width).

Source

The revi­sion his­tory is kept in git; indi­vid­ual page sources can be read by append­ing .page to their URL.

Size

As of 2020-01-07, the source of Gwern.net is com­posed of >366 text files with >3.76m words or >27MB; this includes my writ­ings & doc­u­ments I have tran­scribed into Mark­down, but excludes images, PDFs, HTML mir­rors, source code, archives, infra­struc­ture, popup and the revi­sion his­to­ry. With those included and every­thing com­piled to the sta­tic35 HTML, the site is >18.3GB. The source repos­i­tory con­tains >13,323 patches (this is an under­-count as the cre­ation of the repos­i­tory in 2008-09-26 included already-writ­ten mate­ri­al).

Design

“Peo­ple who are really seri­ous about soft­ware should make their own hard­ware.”

, “Cre­ative Think” 1982

The great sor­row of web design & typog­ra­phy is that it all can mat­ter just a lit­tle how you present your pages. A page can be ter­ri­bly designed and ren­der as type­writer text in 80-col­umn ASCII mono­space, and read­ers will still read it, even if they com­plain about it. And the most taste­ful­ly-de­signed page, with true small­caps and cor­rect use of em-dashes vs en-dashes vs hyphens vs minuses and all, which loads in a frac­tion of a sec­ond and is SEO opti­mized, is of lit­tle avail if the page has noth­ing worth read­ing; no amount of typog­ra­phy can res­cue a page of dreck. Per­haps 1% of read­ers could even name any of these details, much less rec­og­nize them. If we added up all the small touch­es, they surely make a differ­ence to the read­er’s hap­pi­ness, but it would have to be a small one—say, 5%.36 It’s hardly worth it for writ­ing just a few things.

But the great joy of web design & typog­ra­phy is that just its pre­sen­ta­tion can mat­ter a lit­tle to all your pages. Writ­ing is hard work, and any new piece of writ­ing will gen­er­ally add to the pile of exist­ing ones, rather than mul­ti­ply­ing it all; it’s an enor­mous amount of work to go through all one’s exist­ing writ­ings and improve them some­how, so it usu­ally does­n’t hap­pen, Design improve­ments, on the other hand, ben­e­fit one’s entire web­site & all future read­ers, and so at a cer­tain scale, can be quite use­ful. I feel I’ve reached the point where it’s worth sweat­ing the small stuff, typo­graph­i­cal­ly.

Principles

There are 4 design prin­ci­ples:

  1. Aes­thet­i­cal­ly-pleas­ing Min­i­mal­ism

    The design esthetic is min­i­mal­ist. I believe that helps one focus on the con­tent. Any­thing besides the con­tent is dis­trac­tion and not design. ‘Atten­tion!’, as would say37.

    The palette is delib­er­ately kept to grayscale as an exper­i­ment in con­sis­tency and whether this con­straint per­mits a read­able aes­thet­i­cal­ly-pleas­ing web­site. Var­i­ous clas­sic typo­graph­i­cal tools, like and are used for empha­sis.

  2. Acces­si­bil­ity &

    Seman­tic markup is used where Mark­down per­mits. JavaScript is not required for the core read­ing expe­ri­ence, only for optional fea­tures: com­ments, table-sort­ing, , and so on. Pages can even be read with­out much prob­lem in a smart­phone or a text browser like elinks.

  3. Speed & Effi­ciency

    On an increas­ing­ly-bloated Inter­net, a web­site which is any­where remotely as fast as it could be is a breath of fresh air. Read­ers deserve bet­ter. Gwern.net uses many tricks to offer nice fea­tures like side­notes or LaTeX math at min­i­mal cost.

  4. Struc­tural Read­ing

    How should we present texts online? A web page, unlike many medi­ums such as print mag­a­zi­nes, lets us pro­vide an unlim­ited amount of text. We need not limit our­selves to overly con­cise con­struc­tions, which coun­te­nance con­tem­pla­tion but not con­vic­tion.

    The prob­lem then becomes tam­ing com­plex­ity and length, lest we hang our­selves with our own rope. Some read­ers want to read every last word about a par­tic­u­lar top­ic, while most read­ers want the sum­mary or are skim­ming through on their way to some­thing else. A tree struc­ture is help­ful in orga­niz­ing the con­cepts, but does­n’t solve the pre­sen­ta­tion prob­lem: a book or arti­cle may be hier­ar­chi­cally orga­nized, but it still must present every last leaf node at 100% size. Tricks like foot­notes or appen­dices only go so far—hav­ing thou­sands of end­notes or 20 appen­dices to tame the size of the ‘main text’ is unsat­is­fac­tory as while any spe­cific reader is unlikely to want to read any spe­cific appen­dix, they will cer­tainly want to read an appen­dix & pos­si­bly many. The clas­sic hyper­text par­a­digm of sim­ply hav­ing a rat’s-nest of links to hun­dreds of tiny pages to avoid any page being too big also breaks down, because how gran­u­lar does one want to go? Should every sec­tion be a sep­a­rate page? (Any­one who attempted to read a man­ual knows how tedious that can be, where each page may be a sin­gle para­graph or sen­tence, and it’s not clear that it’s much bet­ter than the other extreme, the mono­lithic which includes every detail under the sun and is impos­si­ble nav­i­gate with­out one’s eyes glaz­ing over even using .) What about every ref­er­ence in the bib­li­og­ra­phy, should there be 100 differ­ent pages for 100 differ­ent ref­er­ences?

    A web page, how­ev­er, can be dynam­ic. The solu­tion to the length prob­lem is to pro­gres­sively expose more beyond the default as the user requests it, and make request­ing as easy as pos­si­ble. For lack of a well-known term and by anal­ogy to in /, I call this struc­tural read­ing: the hier­ar­chy is made vis­i­ble & mal­leable to allow read­ing at mul­ti­ple lev­els of the struc­ture.

    A Gwern.net page can be read at mul­ti­ple struc­tural lev­els: title, meta­data block, abstracts, mar­gin notes, empha­sized key­words in list items, footnotes/sidenotes, col­lapsi­ble sec­tions, popup link anno­ta­tions, and full­text links or inter­nal links to other pages. So the reader can read (in increas­ing depth) the title/metadata, or the page abstract, or skim the headers/Table of Con­tents, then skim mar­gin notes+item sum­maries, then read the body text, then click to uncol­lapse regions to read in-depth sec­tions too, and then if they still want more, they can mouse over ref­er­ences to pull up the abstracts or excerpts, and then they can go even deeper by click­ing the full­text link to read the full orig­i­nal. Thus, a page may look short, and the reader can under­stand & nav­i­gate it eas­i­ly, but like an ice­berg, those read­ers who want to know more about any spe­cific point will find much more under the sur­face.

Features

Notable fea­tures (com­pared to stan­dard Mark­down sta­tic site):

  • using both mar­gins, fall­back to float­ing foot­notes

  • code fold­ing (col­lapsi­ble sections/code blocks/tables)

  • JS-free LaTeX math ren­der­ing

  • Link popup anno­ta­tions:

    Anno­ta­tions are hand-writ­ten, and auto­mat­i­cally extracted from Wikipedia/Arxiv/BioRxiv/MedRxiv/gwern.net/Crossref.

  • dark mode (with a )

  • click­-to-zoom images & slideshows; ful­l-width tables/images

  • Dis­qus com­ments

  • sortable tables; tables of var­i­ous sizes

  • auto­mat­i­cally infla­tion-ad­just dol­lar amounts, exchange-rate Bit­coin amounts

  • link icons for filetype/domain/topic

  • infoboxes (Wikipedi­a-like by way of Markdeep)

  • light­weight drop caps

  • epigraphs

  • TeX-like hyphen­ation for jus­ti­fied text (espe­cially on Chrome); auto­matic small­caps type­set­ting

  • 2-col­umn lists

  • inter­wiki link syn­tax

Much of Gwern.net design and JS/CSS was devel­oped by Said Achmiz, 2017–2020. Some inspi­ra­tion has come from Tufte CSS & Matthew But­t­er­ick’s Prac­ti­cal Typog­ra­phy.

Abandoned

Worth not­ing are things I tried but aban­doned (in roughly chrono­log­i­cal order):

  • Gitit wiki: I pre­ferred to edit files in Emacs/Bash rather than a GUI/browser-based wiki.

    A Pan­doc-based wiki using Darcs as a his­tory mech­a­nism, serv­ing mostly as a demo; the require­ment that ‘one page edit = one Darcs revi­sion’ quickly became sti­fling, and I began edit­ing my Mark­down files directly and record­ing patches at the end of the day, and sync­ing the HTML cache with my host (at the time, a per­sonal direc­tory on code.haskell.org). Even­tu­ally I got tired of that and fig­ured that since I was­n’t using the wiki, but only the sta­tic com­piled pages, I might as well switch to Hakyll and a nor­mal sta­tic web­site approach.

  • jQuery sausages: unhelp­ful UI visu­al­iza­tion of sec­tion lengths.

    A UI exper­i­ment, ‘sausages’ add a sec­ond scroll bar where ver­ti­cal lozenges cor­re­spond to each top-level sec­tion of the page; it indi­cates to the reader how long each sec­tion is and where they are. (They look like a long link of pale white sausages.) I thought it might assist the reader in posi­tion­ing them­selves, like the pop­u­lar ‘float­ing high­lighted Table of Con­tents’ UI ele­ment, but with­out text labels, the sausages were mean­ing­less. After a jQuery upgrade broke it, I did­n’t bother fix­ing it.

  • Bee­line Reader: a ‘read­ing aid’ which just annoyed read­ers.

    BLR tries to aid read­ing by col­or­ing the begin­nings & end­ings of lines to indi­cate the con­tin­u­a­tion and make it eas­ier for the read­er’s eyes to sac­cade to the cor­rect next line with­out dis­trac­tion (ap­par­ently dyslexic read­ers in par­tic­u­lar have issue cor­rectly fix­at­ing on the con­tin­u­a­tion of a line). The A/B test indi­cated no improve­ments in the time-on-page met­ric, and I received many com­plaints about it; I was not too happy with the browser per­for­mance or the appear­ance of it, either.

    I’m sym­pa­thetic to the goal and think syn­tax high­light­ing aids are under­used, but BLR was a bit half-baked and not worth the cost com­pared more straight­for­ward inter­ven­tions like reduc­ing para­graph lengths or more rig­or­ous use of ‘struc­tural read­ing’ for­mat­ting. (We may be able to do typog­ra­phy very differ­ently in the future with new tech­nol­o­gy, like VR/AR head­sets which come with tech­nol­ogy intended for —for­get sim­ple tricks like empha­siz­ing the begin­ning of the next line as the reader reaches the end of the cur­rent line, do we need ‘lines’ at all if we can do things like just-in-time dis­play the next piece of text in-place to cre­ate an ‘infi­nite line’?)

  • : site search fea­ture which too few peo­ple used.

    A ‘cus­tom search engine’, a CSE is a souped-up site:gwern.net/ Google search query; I wrote one cov­er­ing gwern.net and some of my accounts on other web­sites, and added it to the side­bar. Check­ing the ana­lyt­ics, per­haps 1 in 227 page-views used the CSE, and a decent num­ber of them used it only by acci­dent (eg search­ing “e”); an A/B test­ing for a fea­ture used so lit­tle would be pow­er­less, and so I removed it rather than try to for­mally test it.

  • Tufte-CSS side­notes: fun­da­men­tally bro­ken, and super­seded.

    An early admirer of Tufte-CSS for its side­notes, I gave a Pan­doc plu­gin a try only to dis­cover a ter­ri­ble draw­back: the CSS did­n’t sup­port block ele­ments & so the plu­gin sim­ply deleted them. This bug appar­ently can be fixed, but the den­sity of foot­notes led to using sidenotes.js instead.

  • doc­u­ment for­mat use: DjVu is a space-effi­cient doc­u­ment for­mat with the fatal draw­back that Google ignores it, and “if it’s not in Google, it does­n’t exist.”

    DjVu is a doc­u­ment for­mat supe­rior to PDFs, espe­cially stan­dard PDFs: I dis­cov­ered that space sav­ings of 5× or more were entirely pos­si­ble, so I used it for most of my book scans. It worked fine in my doc­u­ment view­ers, Inter­net Archive & Lib­gen pre­ferred them, and so why not? Until one day I won­dered if any­one was link­ing them and tried search­ing in Google Scholar for some. Not a sin­gle hit! (As it hap­pens, GS seems to specifi­cally fil­ter out book­s.) Per­plexed, I tried Google—also noth­ing. Huh‽ My scans have been vis­i­ble for years, DjVu dates back to the 1990s and was widely used (if not remotely as pop­u­lar as PDF), and G/GS picks up all my PDFs which are hosted iden­ti­cal­ly. What about filetype:djvu? I dis­cov­ered to my hor­ror that on the entire Inter­net, Google indexed about 50 DjVu files. Total. While appar­ently at one time Google did index DjVu files, that time must be long past.

    Loathe to take the space hit, which would notice­ably increase my Ama­zon AWS S3 host­ing costs, I looked into PDFs more care­ful­ly. I dis­cov­ered PDF tech­nol­ogy had advanced con­sid­er­ably over the default PDFs that gscan2pdf gen­er­ates, and with com­pres­sion, they were closer to DjVu in size; I could con­ve­niently gen­er­ate such PDFs using ocrmypdf.38 This let me con­vert over at mod­er­ate cost and now my doc­u­ments do show up in Google.

  • Darcs/Github repo: no use­ful con­tri­bu­tions or patches sub­mit­ted, added con­sid­er­able process over­head, and I acci­den­tally broke the repo by check­ing in too-large PDFs from a failed post-D­jVu opti­miza­tion pass (I mis­read the result as being small­er, when it was much larg­er).

  • spaces in URLs: an OK idea but users are why we can’t have nice things.

    Gitit assumed ‘titles = file­names = URLs’, which sim­pli­fied things and I liked spaced-sep­a­rated file­names; I car­ried this over to Hakyll, but grad­u­al­ly, by mon­i­tor­ing ana­lyt­ics real­ized this was a ter­ri­ble mis­take—as straight­for­ward as URL-encoding spaces as %20 may seem to be, no one can do it prop­er­ly. I did­n’t want to fix it because by the time I real­ized how bad the prob­lem was, it would have required break­ing, or later on, redi­rect­ing, hun­dreds of URLs and updat­ing all my pages. The final straw was when The Browser linked a page incor­rect­ly, send­ing ~1500 peo­ple to the 404 page. I finally gave in and replaced spaces with hyphens. (Un­der­scores are the other main option but because of Mark­down, I worry that trades one error for anoth­er.) I sus­pect I should have also low­er-cased all links while I was at it, but thus far it has not proven too hard to fix case errors & low­er-case URLs are ugly.

  • ban­ner ads (and ads in gen­er­al): read­er-hos­tile and prob­a­bly a net finan­cial loss.

    I hated run­ning ban­ner ads, but before my Patreon began work­ing, it seemed the lesser of two evils. As my finances became less par­lous, I became curi­ous as to how much lesser—but I could find no Inter­net research what­so­ever mea­sur­ing some­thing as basic as the traffic loss due to adver­tis­ing! So I decided to , with a proper sam­ple size and cost-ben­e­fit analy­sis; the harm turned out to be so large that the analy­sis was unnec­es­sary, and I removed AdSense per­ma­nently the first time I saw the results. Given the mea­sured traffic reduc­tion, I was prob­a­bly los­ing sev­eral times more in poten­tial dona­tions than I ever earned from the ads. (Ama­zon affil­i­ate links appear to not trig­ger this reac­tion, and so I’ve left them alone.)

  • Bitcoin/PayPal/Gittip/Flattr dona­tion links: never worked well com­pared to Patre­on.

    These meth­ods were either sin­gle-shot or never hit a crit­i­cal mass. One-off dona­tions failed because peo­ple would­n’t make a habit if it was man­u­al, and it was too incon­ve­nient. Gittip/Flattr were sim­i­lar to Patreon in bundling dona­tors, and mak­ing it a reg­u­lar thing, but never hit an ade­quate scale.

  • web fonts: slow and bug­gy.

    Google Fonts turned out to intro­duce notice­able latency in page ren­der­ing; fur­ther, its selec­tion of fonts is lim­it­ed, and the fonts out­dated or incom­plete. We got both faster and nicer-look­ing pages by tak­ing the mas­ter Github ver­sions of Adobe Source Serif/Sans Pro (the Google Fonts ver­sion was both out­dated & incom­plete then) and sub­set­ting them for gwern.net specifi­cal­ly.

  • JS: switched to sta­tic ren­der­ing dur­ing com­pi­la­tion for speed.

    For math ren­der­ing, Math­Jax and are rea­son­able options (inas­much as browser adop­tion is dead in the water). Math­Jax ren­der­ing is extremely slow on some pages: up to 6 sec­onds to load and ren­der all the math. Not a great read­ing expe­ri­ence. When I learned that it was pos­si­ble to pre­process Math­Jax-us­ing pages, I dropped Math­Jax JS use the same day.

  • <q> quote tags for Eng­lish : divi­sive and a main­te­nance bur­den.

    I like the idea of treat­ing Eng­lish as a lit­tle (not a lot!) more like a for­mal lan­guage, such as a pro­gram­ming lan­guage, as it comes with ben­e­fits like syn­tax high­light­ing. In a pro­gram, the reader gets guid­ance from syn­tax high­light­ing indi­cat­ing log­i­cal nest­ing and struc­ture of the ‘argu­ment’; in a nat­ural lan­guage doc­u­ment, it’s one damn let­ter after anoth­er, spiced up with the occa­sional punc­tu­a­tion mark or inden­ta­tion. (If Lisp looks like “oat­meal with fin­ger­nail clip­pings mixed in” due to the lack of “”, then Eng­lish must be plain oat­meal!) One of the most basic kinds of syn­tax high­light­ing is sim­ply high­light­ing strings and other lit­er­als vs code: I learned early on that syn­tax high­light­ing was worth it just to make sure you had­n’t for­got­ten a quote or paren­the­sis some­where! The same is true of reg­u­lar writ­ing: if you are exten­sively quot­ing or nam­ing things, the reader can get a bit lost in the thick­ets of curly quotes and be unsure who said what.

    I dis­cov­ered an obscure HTML tag enabled by an obscurer Pan­doc set­ting: the quote tag <q>, which replaces quote char­ac­ters and is ren­dered by the browser as quotes (usu­al­ly). Quote tags are parsed explic­it­ly, rather than just being opaque nat­ural lan­guage text blobs, and so they, at least, can be manip­u­lated eas­ily by JS/CSS and syn­tax-high­light­ed. Any­thing inside a pair of quotes would be tinted a gray to visu­ally set it off sim­i­larly to the block­quotes. I was proud of this tweak, which I’ve seen nowhere else.

    The prob­lems with it was that not every­one was a fan (to say the least); it was not always cor­rect (there are many dou­ble-quotes which are not lit­eral quotes of any­thing, like rhetor­i­cal ques­tion­s); and it inter­acted badly with every­thing else. The HTML/CSS/JS all had to be con­stantly rejig­gered to deal with inter­ac­tions with quotes, browser updates would silently break what was work­ing, and Said Achmiz hated the look. I tried man­u­ally anno­tat­ing quotes to ensure they were all cor­rect and not used in dan­ger­ous ways, but even with inter­ac­tive reg­exp search-and-re­place to assist, the man­ual toil of con­stantly mark­ing up quotes was a major obsta­cle to writ­ing. So I gave in.

  • : a solu­tion in search of a prob­lem.

    Red empha­sis is a visual strat­egy that works won­der­fully well for many styles, but not gwern.net that I could find. Using it on the reg­u­lar web­site resulted in too much empha­sis and the lack of color any­where else made the design incon­sis­tent; we tried using it in dark mode to add some color & pre­serve night vision by mak­ing headers/links/drop-caps red, but it looked like “a vam­pire fan­site” as one reader put it. It is a good idea, but we just haven’t found a use for it. (Per­haps if I ever make another web­site, it will be designed around rubri­ca­tion.)

  • wikipedia-popups.js: a JS library writ­ten to imi­tate Wikipedia pop­ups, which used the WP API to fetch arti­cle sum­maries; obso­leted by the faster & more gen­eral local sta­tic link anno­ta­tions.

    I dis­liked the delay and as I thought about it, it occurred to me that it would be nice to have pop­ups for other web­sites, like Arxiv/BioRxiv links—but they did­n’t have APIs which could be queried. If I fixed the first prob­lem by fetch­ing WP arti­cle sum­maries while com­pil­ing arti­cles and inlin­ing them into the page, then there was no rea­son to include sum­maries for only Wikipedia links, I could get sum­maries from any tool or ser­vice or API, and I could of course write my own! But that required an almost com­plete rewrite to turn it into popups.js.

  • link screen­shot pre­views: auto­matic screen­shots too low-qual­i­ty, and unpop­u­lar.

    To com­pen­sate for the lack of sum­maries for almost all links (even after I wrote the code to scrape var­i­ous sites), I tried a fea­ture I had seen else­where of ‘link pre­views’: small thumb­nail sized screen­shots of a web page or PDF, load­ing using JS when the mouse hov­ered over a link. (They were much too large, ~50kb, to inline sta­t­i­cally like the link anno­ta­tion­s.) They gave some indi­ca­tion of what the tar­get con­tent was, and could be gen­er­ated auto­mat­i­cally using a head­less brows­er. I used Chromi­um’s built-in screen­shot mode, and sim­ply took the first page of PDFs to con­vert to PNGs.

    The PDFs worked fine, but the web­pages often broke: thanks to ads, newslet­ters, and the GDPR, count­less web­pages will pop up some sort of giant modal block­ing any view of the page con­tent, defeat­ing the point. (I have exten­sions installed like AlwaysKill­Sticky to block that sort of spam, but Chrome screen­shot can­not use any exten­sions or cus­tomized set­tings, and the Chrome devs refuse to improve it.) Even when it did work and pro­duced a rea­son­able screen­shot, many read­ers dis­liked it any­way and com­plained. I was­n’t too happy either about hav­ing 10,000 tiny PNGs hang­ing around. So as I expanded link anno­ta­tions steadi­ly, I finally pulled the plug on the link pre­views. Too much for too lit­tle.

    • Link Archiv­ing: my link archiv­ing improved on the link screen­shots in sev­eral ways. First, Sin­gle­File saves pages inside a nor­mal Chromium brows­ing instance, which does sup­port exten­sions and user set­tings. Killing stick­ies alone elim­i­nates half the bad archives, ad block exten­sions elim­i­nate a chunk more, and NoScript black­lists spe­cific domains. (I ini­tially used NoScript on a whitelist basis, but dis­abling JS breaks too many web­sites these days.) Final­ly, I decided to man­u­ally review every snap­shot before it went live to catch bad exam­ples and either fix them by hand or add them to the black­list.
  • auto : a good idea but users are why we can’t have nice things.

    OSes/browsers have defined a ‘global dark mode’ tog­gle the user can set if they want dark mode every­where, and this is avail­able to a web page; if you are imple­ment­ing a dark mode for your web­site, it then seems nat­ural to just make it a fea­ture and turn on iff the tog­gle is on. There is no need for com­pli­cated UI-clut­ter­ing wid­gets. And yet—if you do that, users will reg­u­larly com­plain about the web­site act­ing bizarre or being dark in the day­time, hav­ing appar­ently for­got­ten that they enabled it (or never under­stood what that set­ting mean­t).

    A wid­get is nec­es­sary to give read­ers con­trol, although even there it can be screwed up: many web­sites set­tle for a sim­ple nega­tion switch of the global tog­gle, but if you do that, some­one who sets dark mode at day will be exposed to blind­ing white at night… Our wid­get works bet­ter than that. Most­ly.

  • mul­ti­-col­umn foot­notes: mys­te­ri­ously buggy and yield­ing over­laps.

    Since most foot­notes are short, and no one reads the end­note sec­tion, I thought ren­der­ing them as two columns, as many papers do, would be more space-effi­cient and tidy. It was a good idea, but it did­n’t work.

  • Hyphe­nop­oly: more effi­cient to hyphen­ate the HTML dur­ing com­pi­la­tion than run JS.

    To work around Google Chrome’s 2-decade-long refusal to ship hyphen­ation dic­tio­nar­ies on desk­top and enable (and inci­den­tally use the bet­ter ), the JS library Hyphe­nop­oly will down­load the TeX Eng­lish dic­tio­nary and type­set a web­page itself. While the per­for­mance cost was sur­pris­ingly min­i­mal, it was there, and it caused prob­lems with obscurer browsers like Inter­net Explor­er.

    So we scrapped Hyphenopoly, and I later imple­mented a Hakyll func­tion using of the TeX hyphen­ation algo­rithm & dic­tio­nary to insert at com­pile-time a ‘’ every­where a browser could use­fully break a word, which enables Chrome to hyphen­ate cor­rect­ly, at the mod­er­ate cost of inlin­ing them and a few edge cas­es.39

Tools

Soft­ware tools & libraries used in the site as a whole:

  • The source files are writ­ten in Pan­doc (Pan­doc: John Mac­Far­lane et al; GPL) (source files: Gwern Bran­wen, CC-0). The Pan­doc Mark­down uses a num­ber of exten­sions; pipe tables are pre­ferred for any­thing but the sim­plest tables; and I use seman­tic line­feeds (also called “seman­tic line breaks” or “ven­ti­lated prose”) for­mat­ting.

  • math is writ­ten in which com­piles to , ren­dered by Math­Jax (Apache)

  • the site is com­piled with the Hakyllv4+ sta­tic site gen­er­a­tor, used to gen­er­ate Gwern.net, writ­ten in (Jasper Van der Jeugt et al; BSD); for the gory details, see hakyll.hs which imple­ments the com­pi­la­tion, RSS feed gen­er­a­tion, & pars­ing of inter­wiki links as well. This just gen­er­ates the basic web­site; I do many addi­tional optimizations/tests before & after upload­ing, which is han­dled by sync-gwern.net.sh (Gw­ern Bran­wen, CC-0)

    My pre­ferred method of use is to browse & edit locally using Emacs, and then dis­trib­ute using Hakyll. The sim­plest way to use Hakyll is that you cd into your repos­i­tory and runhaskell hakyll.hs build (with hakyll.hs hav­ing what­ever options you like). Hakyll will build a sta­tic HTML/CSS hier­ar­chy inside _site/; you can then do some­thing like firefox _static/index. (Be­cause HTML exten­sions are not spec­i­fied in the inter­est of cool URIs, you can­not use the Hakyll watch web­server as of Jan­u­ary 2014.) Hakyl­l’s main advan­tage for me is rel­a­tively straight­for­ward inte­gra­tion with the Pan­doc Mark­down libraries; Hakyll is not that easy to use, and so I do not rec­om­mend use of Hakyll as a gen­eral sta­tic site gen­er­a­tor unless one is already adept with Haskell.

  • the CSS is bor­rowed from a mot­ley of sources and has been heav­ily mod­i­fied, but its ori­gin was the Hakyll home­page & Gitit; for specifics, see default.css

  • Mark­down exten­sions:

    • I imple­mented a Pan­doc Mark­down plu­gin for a cus­tom syn­tax for inter­wiki links in Gitit, and then ported it to Hakyll (de­fined in hakyll.hs); it allows link­ing to the Eng­lish Wikipedia (among oth­ers) with syn­tax like [malefits](!Wiktionary) or [antonym of 'benefits'](!Wiktionary "Malefits"). CC-0.
    • infla­tion adjust­ment: pro­vides a Pan­doc Mark­down plu­gin which allows auto­matic infla­tion adjust­ing of dol­lar amounts, pre­sent­ing the nom­i­nal amount & a cur­rent real amount, with a syn­tax like [$5]($1980).
    • Book affil­i­ate links are through an tag appended in the hakyll.hs
    • image dimen­sions are looked up at com­pi­la­tion time & inserted into <img> tags as browser hints
  • JavaScript:

    • Com­ments are out­sourced to (since I am not inter­ested in writ­ing a dynamic sys­tem to do it, and their anti-s­pam tech­niques are much bet­ter than mine).
    • the float­ing foot­notes are via footnotes.js (Lukas Math­is, PD); when the browser win­dow is wide enough, the float­ing foot­notes are instead replaced with mar­ginal notes/side­notes40 using a cus­tom library, sidenotes.js (Said Achmiz, MIT)
    Demon­stra­tion of side­notes on .
    • the HTML tables are sortable via table­sorter (Chris­t­ian Bach; MIT/GPL)
    • the MathML is ren­dered using
    • ana­lyt­ics are han­dled by
    • is done using ABa­lyt­ics (Daniele Mazz­ini; MIT) which hooks into Google Ana­lyt­ics (see ) for indi­vid­u­al-level test­ing; when doing site-level long-term test­ing like in the , I sim­ply write the JS man­u­al­ly.
    • for load­ing introductions/summaries of all links when one mous­es-over a link; reads sta­t­i­cal­ly-gen­er­ated anno­ta­tions auto­mat­i­cally pop­u­lated from many sources (Wikipedia, Pub­med, BioRx­iv, Arx­iv, hand-writ­ten…), with spe­cial han­dling of YouTube videos (Said Achmiz, Shawn Presser; MIT)
    • image size: ful­l-s­cale images (fig­ures) can be clicked on to zoom into them with slideshow mod­e—use­ful for fig­ures or graphs which do not com­fort­ably fit into the nar­row body—us­ing another cus­tom library, image-focus.js (Said Achmiz; GPL)
  • error check­ing: prob­lems such as bro­ken links are checked in 3 phas­es:

    • markdown-lint.sh: writ­ing time
    • sync-gwern.net.sh: dur­ing com­pi­la­tion, san­i­ty-checks file size & count; greps for bro­ken inter­wik­is; runs HTML tidy over pages to warn about invalid HTML; tests live­ness & MIME types of var­i­ous pages post-u­pload; checks for dupli­cates, read­-on­ly, banned file­types, too large or uncom­pressed images, etc.
    • : linkchecker, Archive­Box, and archiver-bot

Implementation Details

There are a num­ber of lit­tle tricks or details that web design­ers might find inter­est­ing.

Effi­cien­cy:

  • fonts:

    • Adobe //: orig­i­nally Gwern.net used Baskerville, but sys­tem Baskerville fonts don’t have ade­quate small caps. Adobe’s open-source “Source” font fam­ily of screen ser­ifs, how­ev­er, is high qual­ity and comes with good small caps, mul­ti­ple sets of numer­als (‘old-style’ num­bers for the body text and differ­ent num­bers for tables), and looks par­tic­u­larly nice on Macs. (It is also sub­set­ted to cut down the load time.) Small cap CSS is auto­mat­i­cally added to abbreviations/acronyms/initials by a Hakyll/Pandoc plug­in, to avoid man­ual anno­ta­tion.

    • effi­cient drop caps by sub­set­ting: 1 drop cap is used on every page, but a typ­i­cal drop cap font will slowly down­load as much as a megabyte in order to ren­der 1 sin­gle let­ter.

      CSS font loads avoid down­load­ing font files which are entirely unused, but they must down­load the entire font file if any­thing in it is used, so it does­n’t mat­ter that only one let­ter gets used. To avoid this, we split each drop cap font up into a sin­gle font file per let­ter and use CSS to load all the font files; since only 1 font file is used at all, only 1 gets down­load­ed, and it will be ~4kb rather than 168kb. This has been done for all the drop cap fonts used (yinit, Cheshire Ini­tials, Deutsche Zier­schrift, Goudy Ini­tialen, Kan­zlei Ini­tialen), and the nec­es­sary CSS can be seen in fonts.css. To spec­ify the drop cap for each page, a Hakyll meta­data field is used to pick the class and sub­sti­tuted into the HTML tem­plate.

  • lazy JavaScript load­ing by Inter­sec­tionOb­server: sev­eral JS fea­tures are used rarely or not at all on many pages, but are respon­si­ble for much net­work activ­i­ty. For exam­ple, most pages have no tables but table­sorter must be loaded any­way, and many read­ers will never get all the way to the Dis­qus com­ments at the bot­tom of each page, but Dis­qus will load any­way, caus­ing much net­work activ­ity and dis­turb­ing the reader because the page is not ‘fin­ished load­ing’ yet.

    To avoid this, Inter­sec­tionOb­server can be used to write a small JS func­tion which fires only when par­tic­u­lar page ele­ments are vis­i­ble to the read­er. The JS then loads the library which does its thing. So an Inter­sec­tionOb­server can be defined to fire only when an actual <table> ele­ment becomes vis­i­ble, and on pages with no tables, this never hap­pens. Sim­i­larly for Dis­qus and image-focus.js. This trick is a lit­tle dan­ger­ous if a library depends on another library because the load­ing might cause race con­di­tions; for­tu­nate­ly, only 1 library, table­sorter, has a pre­req­ui­site, jQuery, so I sim­ply prepend jQuery to table­sorter and load table­sorter. (Other libraries, like side­notes or WP pop­ups, aren’t lazy-loaded because side­notes need to be ren­dered as fast as pos­si­ble or the page will jump around & be lag­gy, and WP links are so uni­ver­sal it’s a waste of time mak­ing them lazy since they will be in the first screen on every page & be loaded imme­di­ately any­way, so they are sim­ply loaded asyn­chro­nously with the defer JS key­word.)

  • image opti­miza­tion: PNGs are opti­mized by pngnq/advpng, JPEGs with mozjpeg, SVGs are mini­fied, PDFs are com­pressed with ocrmypdf’s sup­port. (GIFs are not used at all in favor of WebM/MP4 <video>s.)

  • JS/CSS mini­fi­ca­tion: because Cloud­flare does Brotli com­pres­sion, mini­fi­ca­tion of JS/CSS has lit­tle advan­tage and makes devel­op­ment hard­er, so no mini­fi­ca­tion is done; the font files don’t need any spe­cial com­pres­sion either.

  • Math­Jax: get­ting well-ren­dered math­e­mat­i­cal equa­tions requires Math­Jax or a sim­i­lar heavy­weight JS library; worse, even after dis­abling fea­tures, the load & ren­der time is extremely high—a page like which is both large & has a lot of equa­tions can vis­i­bly take >5s (as a progress bar that help­fully pops up informs the read­er).

    The solu­tion here is to pre­ren­der Math­Jax locally after Hakyll com­pi­la­tion, using the local tool mathjax-node-page to load the final HTML files, parse the page to find all the math, com­pile the expres­sions, define the nec­es­sary CSS, and write the HTML back out. Pages still need to down­load the fonts but the over­all speed goes from >5s to <0.5s, and JS is not nec­es­sary at all.

  • col­lapsi­ble sec­tions: man­ag­ing com­plex­ity of pages is a bal­anc­ing act. It is good to pro­vide all nec­es­sary code to repro­duce results, but does the reader really want to look at a big block of code? Some­times they always would, some­times only a few read­ers inter­ested in the gory details will want to read the code. Sim­i­lar­ly, a sec­tion might go into detail on a tan­gen­tial topic or pro­vide addi­tional jus­ti­fi­ca­tion, which most read­ers don’t want to plow through to con­tinue with the main theme. Should the code or sec­tion be delet­ed? No. But rel­e­gat­ing it to an appen­dix, or another page entirely is not sat­is­fac­tory either—­for code blocks par­tic­u­lar­ly, one loses the lit­er­ate pro­gram­ming aspect if code blocks are being shuffled around out of order.

    A nice solu­tion is to sim­ply use a lit­tle JS to imple­ment approach where sec­tions or code blocks can be visu­ally shrunk or col­lapsed, and expanded on demand by a mouse click. Col­lapsed sec­tions are spec­i­fied by a HTML class (eg <div class="collapse"></div>), and sum­maries of a col­lapsed sec­tion can be dis­played, defined by another class (<div class="collapseSummary">). This allows code blocks to be col­lapse by default where they are lengthy or dis­tract­ing, and for entire regions to be col­lapsed & sum­ma­rized, with­out resort­ing to many appen­dices or forc­ing the reader to an entirely sep­a­rate page.

  • side­notes: one might won­der why sidenotes.js is nec­es­sary when most side­note uses are like and use a sta­tic HTML/CSS approach, which would avoid a JS library entirely and vis­i­bly repaint­ing the page after load?

    The prob­lem is that Tufte-CSS-style side­notes do not reflow and are solely on the right mar­gin (wast­ing the con­sid­er­able white­space on the left), and depend­ing on the imple­men­ta­tion, may over­lap, be pushed far down the page away from their, break when the browser win­dow is too nar­row or not work on smartphones/tablets at all. The JS library is able to han­dle all these and can han­dle the most diffi­cult cases like my anno­tated edi­tion of Radi­ance. (Tufte-CSS-style epigraphs, how­ev­er, pose no such prob­lems and we take the same approach of defin­ing an HTML class & styling with CSS.)

  • Link icons: icons are defined for all file­types used in Gwern.net and many com­mon­ly-linked web­sites such as Wikipedia, Gwern.net (with­in-page sec­tion links and between-page get ‘§’ & logo icons respec­tive­ly), or YouTube; all are inlined into default.css as ; the SVGs are so small it would be absurd to have them be files.

  • Redi­rects: sta­tic sites have trou­ble with redi­rects, as they are just sta­tic files. AWS 3S does not sup­port a .htaccess-like mech­a­nism for rewrit­ing URLs. To allow­ing mov­ing pages & fix bro­ken links, I wrote Hakyll.Web.Redirect for gen­er­at­ing sim­ple HTML pages with redi­rect meta­data+JS, which sim­ply redi­rect from URL 1 to URL 2. After mov­ing to Nginx host­ing, I con­verted all the redi­rects to reg­u­lar rewrite rules.

    In addi­tion to page renames, I mon­i­tor 404 hits in Google Ana­lyt­ics to fix errors where pos­si­ble, and Nginx logs. There are an aston­ish­ing num­ber of ways to mis­spell Gwern.net URLs, it turns out, and I have defined >10k redi­rects so far (in addi­tion to generic reg­exp rewrites to fix pat­terns of errors).

Benford’s law

Does Gwern.net fol­low the famous Ben­ford’s law? A quick analy­sis sug­gests that it sort of does, except for the digit 2, prob­a­bly due to the many cita­tions to research from the past 2 decades (>2000 AD).

In March 2013 I won­dered, upon see­ing a men­tion of : “if I extracted all the num­bers from every­thing I’ve writ­ten on Gwern.net, would it sat­isfy Ben­ford’s law?” It seems the answer is… almost. I gen­er­ate the list of num­bers by run­ning a Haskell pro­gram to parse dig­its, com­mas, and peri­ods; and then I process it with shell util­i­ties.41 This can then be read in R to run a con­firm­ing lack of fit (p=~0) and gen­er­ate this com­par­i­son of the data & Ben­ford’s law42:

Histogram/barplot of parsed num­bers vs pre­dicted

There’s a clear resem­blance for every­thing but the digit ‘2’, which then blows the fit to heck. I have no idea why 2 is over­rep­re­sent­ed—it may be due to all the cita­tions to recent aca­d­e­mic papers which would involve num­bers start­ing with ‘2’ (2002, 2010, 2013…) and cause a dou­ble-count in both the cita­tion and file­name, since if I look in the docs/ full­text fold­er, I see 160 files start­ing with ‘1’ but 326 start­ing with ‘2’. But this can’t be the entire expla­na­tion since ‘2’ has 20.3k entries while to fit Ben­ford, it needs to be just 11.5k—leav­ing a gap of ~10k num­bers unex­plained. A mys­tery.

License

This site is licensed under the pub­lic domain (CC-0) license.

I believe the pub­lic domain license reduces and 43, encour­ages copy­ing (), gives back (how­ever lit­tle) to /, and costs me noth­ing44.


  1. , pg 19 of Russ­ian Sil­hou­ettes, on why he wrote his book of bio­graph­i­cal sketches of great Soviet chess play­ers. (As Richard­son asks (Vec­tors 1.0, 2001): “25. Why would we write if we’d already heard what we wanted to hear?”)↩︎

  2. One dan­ger of such an approach is that you will sim­ply engage in , and build up an impres­sive-look­ing wall of cita­tions that is com­pletely wrong but effec­tive in brain­wash­ing your­self. The only solu­tion is to be dili­gent to include crit­i­cis­m—so even if you do not escape brain­wash­ing, at least your read­ers have a chance. , 1902:

    I had, also, dur­ing many years fol­lowed a golden rule, name­ly, that when­ever a pub­lished fact, a new obser­va­tion or thought came across me, which was opposed to my gen­eral results, to make a mem­o­ran­dum of it with­out fail and at once; for I had found by expe­ri­ence that such facts and thoughts were far more apt to escape from the mem­ory than favourable ones. Owing to this habit, very few objec­tions were raised against my views which I had not at least noticed and attempted to answer.

    ↩︎
  3. “It is only the attempt to write down your ideas that enables them to devel­op.” –Wittgen­stein (pg 109, Rec­ol­lec­tions of Wittgen­stein); “I thought a lit­tle [while in the iso­la­tion tank], and then I stopped think­ing alto­geth­er…in­cred­i­ble how idle­ness of body leads to idle­ness of mind. After 2 days, I’d turned into an idiot. That’s the rea­son why, dur­ing a flight, astro­nauts are always kept busy.” –Ori­ana Fal­laci, quoted in Rocket Men: The Epic Story of the First Men on the Moon by Craig Nel­son.↩︎

  4. Such as uni­verse; con­sider the intro­duc­tion to the chrono­log­i­cally last story in that set­ting, “Safe at Any Speed” (Tales of Known Space).↩︎

  5. :

    “If the indi­vid­ual lived five hun­dred or one thou­sand years, this clash (be­tween his inter­ests and those of soci­ety) might not exist or at least might be con­sid­er­ably reduced. He then might live and har­vest with joy what he sowed in sor­row; the suffer­ing of one his­tor­i­cal period which will bear fruit in the next one could bear fruit for him too.”

    ↩︎
  6. From Aging and Old Age:

    One way to dis­tin­guish empir­i­cally between aging effects and prox­im­i­ty-to-death effects would be to com­pare, with respect to choice of occu­pa­tion, invest­ment, edu­ca­tion, leisure activ­i­ties, and other activ­i­ties, elderly peo­ple on the one hand with young or mid­dle-aged peo­ple who have trun­cated life expectan­cies but are in appar­ent good health, on the oth­er. For exam­ple, a per­son newly infected with the AIDS virus (HIV) has roughly the same life expectancy as a 65-year-old and is unlikely to have, as yet, [ma­jor] symp­toms. The con­ven­tional human-cap­i­tal model implies that, after cor­rec­tion for differ­ences in income and for other differ­ences between such per­sons and elderly per­sons who have the same life expectancy (a big differ­ence is that the for­mer will not have pen­sion enti­tle­ments to fall back upon), the behav­ior of the two groups will be sim­i­lar. It does appear to be sim­i­lar, so far as invest­ing in human cap­i­tal is con­cerned; the trun­ca­tion of the pay­back period causes dis­in­vest­ment. And there is a high sui­cide rate among HIV-infected per­sons (even before they have reached the point in the pro­gres­sion of the dis­ease at which they are clas­si­fied as per­sons with AIDS), just as there is, as we shall see in chap­ter 6, among elderly per­sons.

    ↩︎
  7. John F. Kennedy, 1962:

    I am reminded of the story of the great French Mar­shal Lyautey, who once asked his gar­dener to plant a tree. The gar­dener objected that the tree was slow-grow­ing and would not reach matu­rity for a hun­dred years. The Mar­shal replied, “In that case, there is no time to lose, plant it this after­noon.”

    ↩︎
  8. , :

    In the long run, the util­ity of all non-Free soft­ware approaches zero. All non-Free soft­ware is a dead end.

    ↩︎
  9. These depen­den­cies can be sub­tle. Com­puter archivist Jason Scott writes of ser­vices that:

    URL short­en­ers may be one of the worst ideas, one of the most back­ward ideas, to come out of the last five years. In very recent times, per-site short­en­ers, where a web­site reg­is­ters a smaller ver­sion of its host­name and pro­vides a sin­gle small link for a more com­pli­cated piece of con­tent within it… those are fine. But these gen­er­al-pur­pose URL short­en­ers, with their shady or frag­ile setups and utter depen­dence upon them, well. If we lose or , mil­lions of weblogs, essays, and non-archived tweets lose their mean­ing. Instant­ly. To some­one in the future, it’ll be like every­one from a cer­tain era of his­to­ry, say ten years of the 18th cen­tu­ry, started speak­ing in a one-time pad of cryp­to­graphic pass phras­es. We’re doing our best to stop it. Some of the short­en­ers have been help­ful, oth­ers have been hos­tile. A num­ber have died. We’re going to release tor­rents on a reg­u­lar basis of these spread­sheets, these code break­ing spread­sheets, and we hope oth­ers do too.

    ↩︎
  10. remarks (and the com­ments pro­vide even more exam­ples) fur­ther on URL short­en­ers:

    But the biggest bur­den falls on the click­er, the per­son who fol­lows the links. The extra layer of indi­rec­tion slows down brows­ing with addi­tional DNS lookups and server hits. A new and poten­tially unre­li­able mid­dle­man now sits between the link and its des­ti­na­tion. And the long-term archiv­abil­ity of the hyper­link now depends on the health of a third par­ty. The short­ener may decide a link is a Terms Of Ser­vice vio­la­tion and delete it. If the short­ener acci­den­tally erases a data­base, for­gets to renew its domain, or just dis­ap­pears, the link will break. If a top-level domain changes its pol­icy on com­mer­cial use, the link will break. If the short­ener gets hacked, every link becomes a poten­tial phish­ing attack.

    ↩︎
  11. A sta­tic tex­t-source site has many advan­tages for Long Con­tent that I con­sider use almost a no-brain­er.

    • By nature, they com­pile most con­tent down to flat stand­alone tex­tual files, which allow recov­ery of con­tent even if the orig­i­nal site soft­ware has bit-rot­ted or the source files have been lost or the com­piled ver­sions can­not be directly used in new site soft­ware: one can parse them with XML tools or with quick hacks or by eye.
    • Site com­pil­ers gen­er­ally require depen­den­cies to be declared up front, and the approach makes explic­it­ness and con­tent easy, but dynamic inter­de­pen­dent com­po­nents diffi­cult, all of which dis­cour­ages creep­ing com­plex­ity and hid­den state.
    • A sta­tic site can be archived into a tar­ball of files which will be read­able as long as web browsers exist (or after­wards if the HTML is rea­son­ably clean), but it could be diffi­cult to archive a CMS like Word­Press or Blogspot (the lat­ter does­n’t even pro­vide the con­tent in HTML—it only pro­vides a rat’s-nest of inscrutable JavaScript files which then down­load the con­tent from some­where and dis­play it some­how; indeed, I’m not sure how I would auto­mate archiv­ing of such a site if I had to; I would need some sort of head­less browser to run the JS and seri­al­ize the final result­ing DOM, pos­si­bly with some script­ing of mouse/keyboard action­s).
    • The con­tent is often not avail­able local­ly, or is stored in opaque binary for­mats rather than text (if one is lucky, it will at least be a data­base), both of which make it diffi­cult to port con­tent to other web­site soft­ware; you won’t have the nec­es­sary pieces, or they will be in wildly incom­pat­i­ble for­mats.
    • Sta­tic sites are usu­ally writ­ten in a rea­son­ably stan­dard­ized markup lan­guage such as Mark­down or LaTeX, in dis­tinc­tion to blogs which force one through WYSIWYG edi­tors or invent their own markup con­ven­tions, which is yet another bar­ri­er: pars­ing a pos­si­bly ill-de­fined lan­guage.
    • The low­ered sysad­min efforts (who wants to be con­stantly clean­ing up spam or hacks on their Word­Press blog?) are a final advan­tage: lower run­ning costs make it more likely that a site will stay up rather than cease to be worth the has­sle.

    Sta­tic sites are not appro­pri­ate for many kinds of web­sites, but they are appro­pri­ate for web­sites which are con­tent-ori­ent­ed, do not need inter­ac­tiv­i­ty, expect to migrate web­site soft­ware sev­eral times over com­ing decades, want to enable archiv­ing by one­self or third par­ties (“lots of copies keeps stuff safe”), and to grace­fully degrade after loss or bitrot.↩︎

  12. Such as burn­ing the occa­sional copy onto read­-only media like DVDs.↩︎

  13. One can’t be sure; the IA is fed by , and Alexa does­n’t guar­an­tee pages will be & pre­served if one goes through their request form.↩︎

  14. I am dili­gent in back­ing up my files, in peri­od­i­cally copy­ing my con­tent from the , and in pre­serv­ing viewed Inter­net con­tent; why do I do all this? Because I want to believe that my mem­o­ries are pre­cious, that the things I saw and said are valu­able; “I want to meet them again, because I believe my feel­ings at that time were real.” My past is not trash to me, used up & dis­card­ed.↩︎

  15. Exam­ples of such blogs:

    1. con­tri­bu­tions to Less­Wrong were the rough draft of a phi­los­o­phy book (or two)
    2. John Rob­b’s Global Guer­ril­las lead to his Brave New War: The Next Stage of Ter­ror­ism and the End of Glob­al­iza­tion
    3. ’s Tech­nium was turned into What Tech­nol­ogy Wants.

    An exam­ple of how not to do it would be Over­com­ing Bias blog; it is stuffed with fas­ci­nat­ing cita­tions & sketches of ideas, but they never go any­where with the excep­tion of his mind emu­la­tion econ­omy posts which were even­tu­ally pub­lished in 2016 as The Age of Em. Just his posts on med­i­cine would make a fas­ci­nat­ing essay or just list—but he has never made one. ( would be a nat­ural home for many of his posts’ con­tents, but will never be updat­ed.)↩︎

  16. “Kevin Kelly Answers Your Ques­tions”, 2011-09-06:

    [Ques­tion:] “One pur­pose of the is to encour­age long-term think­ing. Aside from the Clock, though, what do you think peo­ple can do in their every­day lives to adopt or pro­mote long-term think­ing?”

    : “The 10,000-year Clock we are build­ing in the hills of west Texas is meant to remind us to think long-term, but learn­ing how to do that as in indi­vid­ual is diffi­cult. Part of the diffi­culty is that as indi­vid­u­als we con­strained to short lives, and are inher­ently not long-term. So part of the skill in think­ing long-term is to place our val­ues and ener­gies in ways that tran­scend the indi­vid­u­al—ei­ther in gen­er­a­tional pro­jects, or in social enter­pris­es.”

    “As a start I rec­om­mend engag­ing in a project that will not be com­plete in your life­time. Another way is to require that your cur­rent projects exhibit some pay­off that is not imme­di­ate; per­haps some small por­tion of it pays off in the future. A third way is to cre­ate things that get bet­ter, or run up in time, rather than one that decays and runs down in time. For instance a seedling grows into a tree, which has seedlings of its own. A pro­gram like which gives breed­ing pairs of ani­mals to poor farm­ers, who in turn must give one breed­ing pair away them­selves, is an exotropic scheme, grow­ing up over time.”

    ↩︎
  17. ‘Princess Iru­lan’, , ↩︎

  18. reports in “A good vol­un­teer is hard to find” that of vol­un­teers moti­vated enough to email them ask­ing to help, some­thing like <20% will com­plete the GiveWell test assign­ment and ren­der mean­ing­ful help. Such per­sons would have been well-ad­vised to have sim­ply donated some mon­ey. I have long noted that many of the most pop­u­lar pages on Gwern.net could have been writ­ten by any­one and drew on no unique tal­ents of mine; I have on sev­eral occa­sions received offers to help with the DNB FAQ—none of which have resulted in actual help.↩︎

  19. An old sen­ti­ment; con­sider “A drop hol­lows out the stone” (Ovid, Epis­tles) or Thomas Car­lyle’s “The weak­est liv­ing crea­ture, by con­cen­trat­ing his pow­ers on a sin­gle object, can accom­plish some­thing. The strongest, by dis­pens­ing his over many, may fail to accom­plish any­thing. The drop, by con­tin­u­ally falling, bores its pas­sage through the hard­est rock. The hasty tor­rent rushes over it with hideous uproar, and leaves no trace behind.” (The life of Friedrich Schiller, 1825)↩︎

  20. “Ten Lessons I wish I had been Taught”, :

    Richard Feyn­man was fond of giv­ing the fol­low­ing advice on how to be a genius. You have to keep a dozen of your favorite prob­lems con­stantly present in your mind, although by and large they will lay in a dor­mant state. Every time you hear or read a new trick or a new result, test it against each of your twelve prob­lems to see whether it helps. Every once in a while there will be a hit, and peo­ple will say: ‘How did he do it? He must be a genius!’

    ↩︎
  21. IQ is some­times used as a proxy for health, like height, because it some­times seems like any health prob­lem will dam­age IQ. Did­n’t get much pro­tein as a kid? Con­grat­u­la­tions, your nerves will lack and you will lit­er­ally think slow­er. Miss­ing some ? Say good bye to <10 points! If you’re ane­mic or iron-d­e­fi­cient, that might increase to <15 points. Have tape­worms? There go some more points, and maybe cen­time­ters off your adult height, thanks to the worms steal­ing nutri­ents from you. Have a rough birth and suffer a spot of hypoxia before you began breath­ing on your own? Tough luck, old bean. It is very easy to lower IQ; you can do it with a base­ball bat. It’s the other way around that’s nearly impos­si­ble.↩︎

  22. And Amer­ica has tried pretty hard over the past 60 years to affect IQ. The whole nature/nurture would be moot if there were some nutri­ent or edu­ca­tional sys­tem which could add even 10 points on aver­age, because then we would use it on all the blacks. But it seems that I’m con­stantly read­ing about pro­grams like which boost IQ for a lit­tle while… and do noth­ing in the long run.↩︎

  23. For details on the many valu­able cor­re­lates of the Con­sci­en­tious­ness per­son­al­ity fac­tor, see Con­sci­en­tious­ness and online edu­ca­tion.↩︎

  24. 25 episodes, 6 movies, >11 manga vol­umes—just to stick to the core works.↩︎

  25. , KKS XII: 609:

    More than my life
    What I most regret
    Is
    A dream unfinished
    And awakening.
    ↩︎
  26. As with Cloud Nine; I acci­den­tally erased every­thing on a rou­tine basis while mess­ing around with Win­dows.↩︎

  27. For exam­ple, I notice I am no longer deeply inter­ested in the occult. Hope­fully this is because I have grown men­tally and rec­og­nize it as rub­bish; I would be embar­rassed if when I died it turned out my youth­ful self had a bet­ter grasp on the real world.↩︎

  28. Some pages don’t have any con­nec­tion to pre­dic­tions. It’s pos­si­ble to make pre­dic­tions for some bor­der cases like the ter­ror­ism essays (death tolls, achieve­ments of par­tic­u­lar groups’ pol­icy goal­s), but what about the short sto­ries or poems? My imag­i­na­tion fails there.↩︎

  29. Think­ing of pre­dic­tions is good men­tal dis­ci­pline; we should always be able to cash out our beliefs in terms of the real world, or know why we can­not. Unfor­tu­nate­ly, humans being humans, we need to actu­ally track our pre­dic­tions——lest our pre­dict­ing degen­er­ate into enter­tain­ment like polit­i­cal pun­dit­ry.↩︎

  30. Dozens of the­o­ries have been put forth. I have been col­lect­ing & mak­ing pre­dic­tions; and am up to 219. It will be inter­est­ing to see how the movies turn out.↩︎

  31. I have 2 pre­dic­tions reg­is­tered about the the­sis on PB.­com: 1 reviewer will accept my the­ory by 2016 and the light nov­els will fin­ish by 2015.↩︎

  32. See Robin Han­son, “If Uploads Come First”↩︎

  33. I orig­i­nally used last file mod­i­fi­ca­tion time but this turned out to be con­fus­ing to read­ers, because I so reg­u­larly add or update links or add new for­mat­ting fea­tures that the file mod­i­fi­ca­tion time was usu­ally quite recent, and so it was mean­ing­less.↩︎

  34. Reac­tive archiv­ing is inad­e­quate because such links may die before my crawler gets to them, may not be archiv­able, or will just expose read­ers to dead links for an unac­cept­ably long time before I’d nor­mally get around to them.↩︎

  35. I like the sta­tic site approach to things; it tends to be harder to use and more restric­tive, but in exchange it yields bet­ter per­for­mance & leads to fewer has­sles or run­time issues. The sta­tic model of com­pil­ing a sin­gle mono­lithic site direc­tory also lends itself to test­ing: any shell script or CLI tool can be eas­ily run over the com­piled site to find poten­tial bugs (which has become increas­ingly impor­tant as site com­plex­ity & size increases so much that eye­balling the occa­sional page is inad­e­quate).↩︎

  36. Rut­ter argues for this point in Web Typog­ra­phy, which is con­sis­tent with my own where even lousy changes are diffi­cult to dis­tin­guish from zero effect despite large n, and with the gen­eral sham­bolic state of the Inter­net (eg as reviewed in the 2019 Web Almanac). If users and load­ing times of mul­ti­ple sec­onds have rel­a­tively mod­est traffic reduc­tions, things like align­ing columns prop­erly or using sec­tion signs or side­notes must have effects on behav­ior so close to zero as to be unob­serv­able.↩︎

  37. Para­phrased from Dia­logues of the Zen Mas­ters as quoted in pg 11 of the Edi­tor’s Intro­duc­tion to Three Pil­lars of Zen:

    One day a man of the peo­ple said to Mas­ter Ikkyu: “Mas­ter, will you please write for me max­ims of the high­est wis­dom?” Ikkyu imme­di­ately brushed out the word ‘Atten­tion’. “Is that all? Will you not write some more?” Ikkyu then brushed out twice: ‘Atten­tion. Atten­tion.’ The man remarked irri­ta­bly that there was­n’t much depth or sub­tlety to that. Then Ikkyu wrote the same word 3 times run­ning: ‘Atten­tion. Atten­tion. Atten­tion.’ Half-an­gered, the man demand­ed: “What does ‘Atten­tion’ mean any­way?” And Ikkyu answered gen­tly: “Atten­tion means atten­tion.”

    ↩︎
  38. Why don’t all PDF gen­er­a­tors use that? Soft­ware patents, which makes it hard to install the actual JBIG2 encoder (sup­pos­edly all JBIG2 encod­ing patents had expired by 2017, but no one, like Linux dis­tros, wants to take the risk of unknown patents sur­fac­ing.), which has to ship sep­a­rately from ocrmypdf, and wor­ries over edge-cases in JBIG2 where num­bers might be visu­ally changed to differ­ent num­bers to save bits.↩︎

  39. Specifi­cal­ly: some OS/browsers pre­serve soft hyphens in copy­-paste, which might con­fuse read­ers, so we use JS to delete soft hyphens; this breaks for users with JS dis­abled, and on Lin­ux, the X GUI bypasses the JS entirely for mid­dle-click but no other way of copy­-past­ing.↩︎

  40. Side­notes have long been used as a typo­graphic solu­tion to dense­ly-an­no­tated texts such as the (first 2 pages), but have not shown up much online yet.

    An early & inspir­ing use of  notes: Pierre Bayle’s His­tor­i­cal and Crit­i­cal Dic­tio­nary, demon­strat­ing recur­sive  (1737, vol­ume 4, pg901; source: Google Books)↩︎

  41. We write a short Haskell pro­gram as part of a pipeline:

    echo '{-# LANGUAGE OverloadedStrings #-};
          import Data.Text as T;
          main = interact (T.unpack . T.unlines . Prelude.filter (/="") .
                           T.split (not . (`elem` "0123456789,.")) . T.pack)' > ~/number.hs &&
    find ~/wiki/ -type f -name "*.page" -exec cat "{}" \; | runhaskell ~/number.hs |
     sort | tr -d ',' | tr -d '.' | cut -c 1 | sed -e 's/0$//' -e '/^$/d' > ~/number.txt
    ↩︎
  42. Graph then test:

    numbers <- read.table("number.txt")
    ta <- table(numbers$V1); ta
    
    #     1     2     3     4     5     6     7     8     9
    # 20550 20356  7087  5655  3900  2508  2075  2349  2068
    ## cribbing exact R code from http://www.math.utah.edu/~treiberg/M3074BenfordEg.pdf
    sta <- sum(ta)
    pb <- sapply(1:9, function(x) log10(1+1/x)); pb
    m <- cbind(ta/sta,pb)
    colnames(m)<- c("Observed Prop.", "Theoretical Prop.")
    barplot( rbind(ta/sta,pb/sum(pb)), beside = T, col = rainbow(7)[c(2,5)],
                  xlab = "First Digit")
    title("Benford's Law Compared to Writing Data")
    legend(16,.28, legend = c("From Page Data", "Theoretical"),
           fill = rainbow(7)[c(2,5)],bg="white")
    chisq.test(ta,p=pb)
    #
    #     Chi-squared test for given probabilities
    #
    # data:  ta
    # X-squared = 9331, df = 8, p-value < 2.2e-16
    ↩︎
  43. PD increases eco­nomic effi­ciency through—if noth­ing else—­mak­ing works eas­ier to find. says that “Obscu­rity is a far greater threat to authors and cre­ative artists than pira­cy.” If that is so, then that means that diffi­culty of find­ing works reduces the wel­fare of artists and con­sumers, because both forgo a ben­e­fi­cial trade (the artist loses any rev­enue and the con­sumer loses any enjoy­men­t). Even small increases in incon­ve­nience make big differ­ences.↩︎

  44. Not that I could sell any­thing on this wiki; and if I could, I would pol­ish it as much as pos­si­ble, giv­ing me fresh copy­right.↩︎