Skip to main content

About This Website

Meta page describing Gwern.net site ideals of stable long-term essays which improve over time; idea sources and writing methodology; metadata definitions; site statistics; copyright license.

This page is about Gwern.net content; for the details of its implementation & design like the popup paradigm, see Design; and for information about me, see Links.

The Content

Of all the books I have delivered to the presses, none, I think, is as personal as the straggling collection mustered for this hodgepodge, precisely because it abounds in reflections and interpolations. Few things have happened to me, and I have read a great many. Or rather, few things have happened to me more worth remembering than Schopenhauer’s thought or the music of England’s words.

A man sets himself the task of portraying the world. Through the years he peoples a space with images of provinces, kingdoms, mountains, bays, ships, islands, fishes, rooms, instruments, stars, horses, and people. Shortly before his death, he discovers that that patient labyrinth of lines traces the image of his face.

Jorge Luis Borges, Dreamtigers Epilogue

The content here varies from statistics to psychology to self-experiments/Quantified Self to philosophy to poetry to programming to anime to investigations of online drug markets or leaked movie scripts (or two topics at once: anime & statistics or anime & criticism or heck anime & statistics & criticism!).

I believe that someone who has been well-educated will think of something worth writing at least once a week; to a surprising extent, this has been true. (I added ~130 documents to this repository over the first 3 years.)

Target Audience

Special knowledge can be a terrible disadvantage if it leads you too far along a path you cannot explain anymore.

Brian Herbert (Dune: House Harkonnen)

I don’t write simply to find things out, although curiosity is my primary motivator, as I find I want to read something which hasn’t been written—“…I realised that I wanted to read about them what I myself knew. More than this—what only I knew. Deprived of this possibility, I decided to write about them. Hence this book.”1 There are many benefits to keeping notes as they allow one to accumulate confirming and especially contradictory evidence2, and even drafts can be useful so you Don’t Repeat Yourself or simply decently respect the opinions of mankind.

The goal of these pages is not to be a model of concision, maximizing entertainment value per word, or to preach to a choir by elegantly repeating a conclusion. Rather, I am attempting to explain things to my future self, who is intelligent and interested, but has forgotten. What I am doing is explaining why I decided what I did to myself and noting down everything I found interesting about it for future reference. I hope my other readers, whomever they may be, might find the topic as interesting as I found it, and the essay useful or at least entertaining–but the intended audience is my future self.

Development

I hate the water that thinks that it boiled itself on its own. I hate the seasons that think they cycle naturally. I hate the sun that thinks it rose on its own.

Sodachi Oikura, Owarimonogatari (Sodachi Riddle, Part One)

It is everything I felt worth writing that didn’t fit somewhere like Wikipedia or was already written. I never expected to write so much; but I discovered that once I had a hammer, nails were everywhere, and that supply creates its own demand3.

Long Site

The Internet is self destructing paper. A place where anything written is soon destroyed by rapacious competition and the only preservation is to forever copy writing from sheet to sheet faster than they can burn. If it’s worth writing, it’s worth keeping. If it can be kept, it might be worth writing…If you store your writing on a third party site like Blogger, Livejournal or even on your own site, but in the complex format used by blog/wiki software du jour you will lose it forever as soon as hypersonic wings of Internet labor flows direct people’s energies elsewhere. For most information published on the Internet, perhaps that is not a moment too soon, but how can the muse of originality soar when immolating transience brushes every feather?

Julian Assange (“Self destructing paper”, 2006-12-05)

One of my personal interests is applying the idea of the Long Now. What and how do you write a personal site with the long-term in mind? We live most of our lives in the future, and the actuarial tables give me until the 2070102080s, excluding any benefits from caloric restriction/intermittent fasting or projects like SENS. It is a common-place in science fiction4 that longevity would cause widespread risk aversion. But on the other hand, it could do the opposite: the longer you live, the more long-shots you can afford to invest in. Someone with a timespan of 70 years has reason to protect against black swans—but also time to look for them.5 It’s worth noting that old people make many short-term choices, as reflected in increased suicide rates and reduced investment in education or new hobbies, and this is not due solely to the ravages of age but the proximity of death—the HIV-infected (but otherwise in perfect health) act similarly short-term.6

What sort of writing could you create if you worked on it (be it ever so rarely) for the next 60 years? What could you do if you started now?7

Keeping the site running that long is a challenge, and leads to the recommendations for Resilient Haskell Software: 100% FLOSS software8, open standards for data, textual human-readability, avoiding external dependencies910, and staticness11.

Preserving the content is another challenge. Keeping the content in a DVCS like git protects against file corruption and makes it easier to mirror the content; regular backups12 help. I have taken additional measures: WebCitation has archived most pages and almost all external links; the Internet Archive is also archiving pages & external links13. (For details, read Archiving URLs.)

One could continue in this vein, devising ever more powerful & robust storage methods (perhaps combine the DVCS with forward error correction through PAR2, a la bup), but what is one to fill the storage with?

Long Content

What has been done, thought, written, or spoken is not culture; culture is only that fraction which is remembered.

Gary Taylor (The Clock of the Long Now; emphasis added)14

‘Blog posts’ might be the answer. But I have read blogs for many years and most blog posts are the triumph of the hare over the tortoise. They are meant to be read by a few people on a weekday in 200422ya and never again, and are quickly abandoned—and perhaps as Assange says, not a moment too soon. (But isn’t that sad? Isn’t it a terrible ROI for one’s time?) On the other hand, the best blogs always seem to be building something: they are rough drafts—works in progress15. So I did not wish to write a blog. Then what? More than just “evergreen content”, what would constitute Long Content as opposed to the existing culture of Short Content? How does one live in a Long Now sort of way?16

It’s shocking to find how many people do not believe they can learn, and how many more believe learning to be difficult. Muad’Dib knew that every experience carries its lesson.17

My answer is that one uses such a framework to work on projects that are too big to work on normally or too tedious. (Conscientiousness is often lacking online or in volunteer communities18 and many useful things go undone.) Knowing your site will survive for decades to come gives you the mental wherewithal to tackle long-term tasks like gathering information for years, and such persistence can be useful19—if one holds onto every glimmer of genius for years, then even the dullest person may look a bit like a genius himself20. (Even experienced professionals can only write at their peak for a few hours a day—usually first thing in the morning, it seems.) Half the challenge of fighting procrastination is the pain of starting—I find when I actually get into the swing of working on even dull tasks, it’s not so bad. So this suggests a solution: never start. Merely have perpetual drafts, which one tweaks from time to time. And the rest takes care of itself. I have a few examples of this:

  1. DNB FAQ:

    When I read in Wired in 200818ya that the obscure working memory exercise called dual n-back (DNB) had been found to increase IQ substantially, I was shocked. IQ is one of the most stubborn properties of one’s mind, one of the most fragile21, the hardest to affect positively, but also one of the most valuable traits one could have22; if the technique panned out, it would be huge. Unfortunately, DNB requires a major time investment (as in, half an hour daily); which would be a bargain—if it delivers. So, to do DNB or not?

    Questions of great import like this are worth studying carefully. The wheels of academia grind exceeding slow, and only a fool expects unanimous answers from fields like psychology. Any attempt to answer the question ‘is DNB worthwhile?’ will require years and cover a breadth of material. This FAQ on DNB is my attempt to cover that breadth over those years.

  2. Neon Genesis Evangelion notes:

    I have been discussing NGE since 200422ya. The task of interpreting Eva is very difficult; the source works themselves are a major time-sink23, and there are thousands of primary, secondary, and tertiary works to consider—personal essays, interviews, reviews, etc. The net effect is that many Eva fans ‘know’ certain things about Eva, such as End of Evangelion not being a grand ‘screw you’ statement by Hideaki Anno or that the TV series was censored, but they no longer have proof. Because each fan remembers a different subset, they have irreconcilable interpretations. (Half the value of the page for me is having a place to store things I’ve said in countless fora which I can eventually turn into something more systematic.)

    To compile claims from all those works, to dig up forgotten references, to scroll through microfilms, buy issues of defunct magazines—all this is enough work to shatter the heart of the stoutest salaryman. Which is why I began years ago and expect not to finish for years to come. (Finishing by 2020 seems like a good prediction.)

  3. Cloud Nine: Years ago I was reading the papers of the economist Robin Hanson. I recommend his work highly; even if they are wrong, they are imaginative and some of the finest speculative fiction I have read. (Except they were non-fiction.) One night I had a dream in which I saw in a flash a medieval city run in part on Hansonian grounds; a steampunk version of his futarchy. A city must have another city as a rival, and soon I had remembered the strange ’90s idea of assassination markets, which was easily tweaked to work in a medieval setting. Finally, between them, was one of my favorite proposals, Buckminster Fuller’s cloud nine megastructure.

    I wrote several drafts but always lost them. Sad24 and discouraged, I abandoned it for years. This fear leads straight into the next example.

  4. A Book reading list:

    Once, I didn’t have to keep reading lists. I simply went to the school library shelf where I left off and grabbed the next book. But then I began reading harder books, and they would cite other books, and sometimes would even have horrifying lists of hundreds of other books I ought to read (‘bibliographies’). I tried remembering the most important ones but quickly forgot. So I began keeping a book list on paper. I thought I would throw it away in a few months when I read them all, but somehow it kept growing and growing. I didn’t trust computers to store it before25, but now I do, and it lives on in digital form (currently on Goodreads—because they have export functionality). With it, I can track how my interests evolved over time26, and what I was reading at the time. I sometimes wonder if I will read them all even by 2070.

What is next? So far the pages will persist through time, and they will gradually improve over time. But a truly Long Now approach would be to make them be improved by time—make them more valuable the more time passes. (Stewart Brand remarks in The Clock of the Long Now that a group of monks carved thousands of scriptures into stone, hoping to preserve them for posterity—but posterity would value far more a carefully preserved collection of monk feces, which would tell us countless valuable things about important phenomenon like global warming.)

One idea I am exploring is adding long-term predictions like the ones I make on PredictionBook.com. Many27 pages explicitly or implicitly make predictions about the future. As time passes, predictions would be validated or falsified, providing feedback on the ideas.28

For example, the Evangelion essay’s paradigm implies many things about the future movies in Rebuild of Evangelion29; The Melancholy of Kyon is an extended prediction30 of future plot developments in The Melancholy of Haruhi Suzumiya series; Haskell Summer of Code has suggestions about what makes good projects, which could be turned into predictions by applying them to predict success or failure when the next Summer of Code choices are announced. And so on.

I don’t think “Long Content” is simply for working on things which are equivalent to a “monograph” (a work which attempts to be an exhaustive exposition of all that is known—and what has been recently discovered—on a single topic), although monographs clearly would benefit from such an approach. If I write a short essay cynically remarking on, say, Al Gore and predicting he’d sell out and registered some predictions and came back 20 years later to see how it worked out, I would consider this “Long Content” (it gets more interesting with time, as the predictions reach maturation); but one couldn’t consider this a “monograph” in any ordinary sense of the word.

One of the ironies of this approach is that as a transhumanist, I assign non-trivial probability to the world undergoing massive change during the 21st century due to any of a number of technologies such as artificial intelligence (such as mind uploading31) or nanotechnology; yet here I am, planning as if I and the world were immortal.

I personally believe that one should “think Less Wrong and act Long Now”, if you follow me. I diligently do my daily spaced-repetition review and n-backing; I carefully design my website and writings to last decades, actively think about how to write material that improves with time, and work on writings that will not be finished for years (if ever). It’s a bit schizophrenic since both are totalized worldviews with drastically conflicting recommendations about where to invest my time. It’s a case of high discount rates versus low discount rates; and one could fairly accuse me of committing the sunk cost fallacy, but then, I’m not sure that sunk cost fallacy is a fallacy (certainly, I have more to show for my wasted time than most people).

The Long Now views its proposals like the Clock and the Long Library and seedbanks as insurance—in case the future turns out to be surprisingly unsurprising. I view these writings similarly. If Ray Kurzweil’s most ambitious predictions turn out right and the Singularity happens by 2050 or so, then much of my writings will be moot, but I will have all the benefits of said Singularity; if the Singularity never happens or ultimately pays off in a very disappointing way, then my writings will be valuable to me. By working on them, I hedge my bets.

Finding My Ideas

To the extent I personally have any method for ‘getting started’ on writing something, it’s to pay attention to anytime you find yourself thinking, “how irritating that there’s no good webpage/Wikipedia article on X” or “I wonder if Y” or “has anyone done Z” or “huh, I just realized that A!” or “this is the third time I’ve had to explain this, jeez.”

The DNB FAQ started because I was irritated people were repeating themselves on the dual n-back mailing list; the modafinil article started because it was a pain to figure out where one could order modafinil; the trio of Death Note articles (Anonymity, Ending, Script) all started because I had an amusing thought about information theory; the Silk Road 1 page was commissioned after I groused about how deeply sensationalist & shallow & ill-informed all the mainstream media articles on the Silk Road drug marketplace were (similarly for Bitcoin is Worse is Better); my Google survival analysis was based on thinking it was a pity that Arthur’s Guardian analysis was trivially & fatally flawed; and so on and so forth.

None of these seems special to me. Anyone could’ve compiled the DNB FAQ; anyone could’ve kept a list of online pharmacies where one could buy modafinil; someone tried something similar to my Google shutdown analysis before me (and the fancier statistics were all standard tools). If I have done anything meritorious with them, it was perhaps simply putting more work into them than someone else would have; to quote Teller:

“I think you’ll see what I mean if I teach you a few principles magicians employ when they want to alter your perceptions…Make the secret a lot more trouble than the trick seems worth. You will be fooled by a trick if it involves more time, money and practice than you (or any other sane onlooker) would be willing to invest.”

“My partner, Penn, and I once produced 500 live cockroaches from a top hat on the desk of talk-show host David Letterman. To prepare this took weeks. We hired an entomologist who provided slow-moving, camera-friendly cockroaches (the kind from under your stove don’t hang around for close-ups) and taught us to pick the bugs up without screaming like preadolescent girls. Then we built a secret compartment out of foam-core (one of the few materials cockroaches can’t cling to) and worked out a devious routine for sneaking the compartment into the hat. More trouble than the trick was worth? To you, probably. But not to magicians.”

Besides that, I think after a while writing/research can be a virtuous circle or autocatalytic. If one were to look at my repo statistics, you see that I haven’t always been writing as much. What seems to happen is that as I write more:

  • I learn more tools

    eg. I learned basic meta-analysis in R to answer what all the positive & negative n-back studies summed to, but then I was able to use it for iodine; I learned linear models for analyzing MoR reviews but now I can use them anywhere I want to, like in my Touhou draft material.

    The “Feynman method” has been facetiously described as “find a problem; think very hard; write down the answer”, but Gian-Carlo Rota gives the real one:

    Richard Feynman was fond of giving the following advice on how to be a genius. You have to keep a dozen of your favorite problems constantly present in your mind, although by and large they will lay in a dormant state. Every time you hear or read a new trick or a new result, test it against each of your twelve problems to see whether it helps. Every once in a while there will be a hit, and people will say: “How did he do it? He must be a genius!”

  • I internalize a habit of noticing interesting questions that flit across my brain

    eg. in March 201313ya while meditating: “I wonder if more doujin music gets released when unemployment goes up and people may have more spare time or fail to find jobs? Hey! That giant Touhou music torrent I downloaded, with its 45000 songs all tagged with release year, could probably answer that!” (One could argue that these questions probably should be ignored and not investigated in depth—Teller again—nevertheless, this is how things work for me.)

  • if you aren’t writing, you’ll ignore useful links or quotes; but if you stick them in small asides or footnotes as you notice them, eventually you’ll have something bigger.

    I grab things I see on Google Alerts & Scholar, Pubmed, Reddit, Hacker News, my RSS feeds, books I read, and note them somewhere until they amount to something. (An example would be my slowly accreting citations on IQ and economics.)

  • people leave comments, ping me on IRC, send me emails, or leave anonymous messages, all of which help

    Some examples of this come from my most popular page, on Silk Road 1:

    1. an anonymous message led me to investigate a vendor in depth and ponder the accusation leveled against them; I wrote it up and gave my opinions and thus I got another short essay to add to my SR page which I would not have had otherwise (and I think there’s a <20% chance that in a few years this will pay off and become a very interesting essay).

    2. CMU’s Nicholas Christin, who wrote a paper by scraping SR for many months and giving all sorts of overall statistics, emailed me to point out I was citing inaccurate figures from the first version of his paper. I thanked him for the correction and while I was replying, mentioned I had a hard time believing his paper’s claims about the extreme rarity of scams on SR as estimated through buyer feedback. After some back and forth and suggesting specific mechanisms how the estimates could be positively biased, he was able to check his database and confirmed that there was at least one very large omission of scams in the scraped data and there was probably a general undersampling; so now I have a more accurate feedback estimate for my SR page (important for estimating risk of ordering) and he said he’ll acknowledge me in the/a paper, which is nice.

Information Organizing

Occasionally people ask how I manage information and read things.

  1. For quotes or facts which are very important, I employ spaced repetition by adding them to my Mnemosyne

  2. I keep web clippings in Evernotes; I also excerpt from research papers & books, and miscellaneous sources. This is useful for targeted searches when I remember a fact but not where I learned it, and for storing things which I don’t want to memorize but which have no logical home in my website or LW or elsewhere. It is also helpful for writing my book reviews and the monthly newsletter, as I can read through my book excerpts to remind myself of the highlights and at the end of the month review clippings from papers/webpages to find good things to reshare which I was too busy at the time to do so or was unsure of its importance. I don’t make any use of more complex Evernote features.

    I periodically back up my Evernote using the Linux client Nixnote’s export feature. (I made sure there was a working export method before I began using Evernote, and use it only as long as Nixnote continues to work.)

    My workflow for dealing with PDFs, as of late 201412ya, is:

    1. if necessary, jailbreak the paper using Libgen or an university proxy, then upload a copy to Dropbox, named year-author.pdf

    2. read the paper, making excerpts as I go

    3. store the metadata & excerpts in Evernote

    4. if useful, integrate into Gwern.net with its title/year/author metadata, adding a local fulltext copy if the paper had to be jailbroken, otherwise rely on my custom archiving setup to preserve the remote URL

    5. hence, any future searches for the filename / title / key contents should result in hits either in my Evernote or Gwern.net

  3. Web pages are archived & backed up by my custom archiving setup. This is intended mostly for fixing dead links (eg. to recover the fulltext of the original URL of an Evernote clipping).

  4. I don’t have any special book reading techniques. For really good books I excerpt from each chapter and stick the quotes into Evernote.

  5. I store insights and thoughts in various pages as parenthetical comments, footnotes, and appendices. If they don’t fit anywhere, I dump them in Notes.

  6. Larger masses of citations and quotes typically get turned into pages.

  7. I make heavy use of RSS subscriptions for news. For that, I am currently using Liferea. (Not that I’m hugely thrilled about it. Google Reader was much better.)

  8. For projects and followups, I use reminders in Google Calendar.

  9. For recording personal data, I automate as much as possible (eg. Zeo and arbtt) and I make a habit of the rest—getting up in the morning is a great time to build a habit of recording data because it’s a time of habits like eating breakfast and getting dressed.

Hence, to refind information, I use a combination of Google, Evernote, grep (on the Gwern.net files), occasionally Mnemosyne, and a good visual memory.

As far as writing goes, I do not use note-taking software or things like FreeMind or org-mode—not that I think they are useless but I am worried about whether they would ever repay the large upfront investments of learning/tweaking or interfere with other things. Instead, I occasionally compile outlines of articles from comments on LW/Reddit/IRC, keep editing them with stuff as I remember them, search for relevant parts, allow little thoughts to bubble up while meditating, and pay attention to when I am irritated at people being wrong or annoyed that a particular topic hasn’t been written down yet.

My Experience of Writing

That’s not writing; that’s just typewriting.

Truman Capote on the Beat Generation (195967ya)

What is it like to write, for me? Maybe a bit different than for you.

Why don’t you write? Since I find it easy to write, I’ve been puzzled by the many people I know who have worthwhile things they could write, and are fully capable of ‘writing’ them in the sense of explaining them to me (often in text-based chat!) in sufficient detail that it could be turned into a serviceable blog post—but who won’t.

Typical mind. Asking people about their experience of writing, and what bars them from taking that critical step even when they agree the topic is worthwhile & they would like to have the writeup, I’ve come to realize my experience of writing is different from theirs.

Blank-page tyranny. For them, the problem with longform writing is not a lack of material (my default assumption), but the writing being a school-like exercise in pain & tedium, as they struggle to fill up the blank page and assemble their atomic details into a coherent output: they struggle to take their pile of individual playing cards, and build a house of cards (which could topple at the first mistake).

Text earworms. For me, this is almost never problem because my experience of writing is radically different.

So, I would divide my writing into two types: ‘incremental’/‘routine’/occasional, and ‘big bang’: Incremental writing is the ordinary kind of writing where I might add a quote or reference, or copyedit something, or write a small forgettable response to someone. Most people do not find incremental writing to be hard, and may do quite a lot of it, whether in email or social media or work. ‘Big bang’ writing, on the other hand, is the more valuable sort where I sit down and bang out a long comment or even an entire essay like “Why Not To Write A Book” in a single sitting; this is the sort of writing that people are impressed by and find themselves unable to do, but which I do fairly regularly. How?

I rarely struggle with assembling my fragments into a whole, because the whole instead inflicts itself on me. Much of my writing is like a musical earworm or intrusive thought: I experience the rumination as a mental voice reciting a paragraph, looping indefinitely, until I suppress it or get distracted (but then it may return, of its own volition). The paragraph might be a dialogue32, a comment in reply to someone specific, or on a general topic, a tweet, an email, or it might be the key paragraph for an essay I’ve been musing for a while—anything, really. The paragraph usually starts as a tangled rat’s-nest of fragments, allusions, citations, and parenthetical digressions, and gradually cleans itself up into something more readable. (One can always tell when I wrote something in a rush, without the benefit of revision to flatten it out, because of the nested parentheticals and tangents.) No one else seems to operate the same way.

Transcription. The voice is clearly myself, and does not feel like any kind of muse or external force, any more than a musical earworm or phrase feels “alien” to you; but it is effortless and involuntary33, and hard to make it go away, which can be annoying. I am apparently so disagreeable that my brain can’t stop arguing with itself. So I put down my writings into my website in order to forget the writings in my head. Thus, ‘big bang’ writing is easy—I am simply an amanuensis for the voice in my head. The key text has repeated itself so often that I can write it in a sitting. Indeed, these paragraphs have repeated themselves so many times that I am barely even thinking about this as I write it. (Instead I am thinking about a different topic: the value of writing a book vs blog posts.) I don’t experience the ‘tyranny of the blank page’, so much as the ‘tyranny of transcription’.

90% done. The real pain comes in the editing process afterwards, where I must laboriously stitch together the fragments and copyedit and add references and markup. (The voice is no help there, having gone silent once the loop has been written down.) The ‘incremental’ phase of writing is frustrating enough that I generally avoid writing as long as I can, in the hopes that the voice will give up and go away.34 If I cannot outwait the recitation, if it keeps returning over enough periods, or some specific reason comes up (like an interested reader), then I may bother to write it down (rather than do something more fun, like read new research papers).

No free lunch. But that is only my experience of writing. Even if it does not feel like effortful thinking, and often like simply writing down something ‘obvious’, the voice comes from somewhere, and is not divine inspiration. The underlying reality must be the usual one: writing is like gardening. One patiently tends one’s garden, seeding and watering and pruning, and green shoots come up, and one day, one may behold a sudden blossoming, which one may cut and put in a vase to be seen by all. Or not, and let it wither and fall.

Confidence Tags

Most of the metadata in each page is self-explanatory: the date is the last time the page was meaningfully modified35, the tags are categorization, etc. The “status” tag describes the state of completion: whether it’s a pile of links & snippets & “notes”, or whether it is a “draft” which at least has some structure and conveys a coherent thesis, or it’s a well-developed draft which could be described as “in progress”, and finally when a page is done—in lieu of additional material turning up—it is simply “finished”.

The “confidence” tag is a little more unusual. I stole the idea from Muflax’s “epistemic state” tags; I use the same meaning for “log” for collections of data or links (“log entries that simply describe what happened without any judgment or reflection”) personal or reflective writing can be tagged “emotional” (“some cluster of ideas that got itself entangled with a complex emotional state, and I needed to externalize it to even look at it; in no way endorsed, but occasionally necessary (similar to fiction)”), and “fiction” needs no explanation (every author has some reason for writing the story or poem they do, but not even they always know whether it is an expression of their deepest fears, desires, history, or simply random thoughts). I drop his other tags in favor of giving my subjective probability using the “Kesselman List of Estimative Words”:

  1. “certain”

  2. “highly likely”

  3. “likely”

  4. “possible” (my preference over Kesselman’s “Chances a Little Better [or Less]”)

  5. “unlikely”

  6. “highly unlikely”

  7. “remote”

  8. “impossible”

These are used to express my feeling about how well-supported the essay is, or how likely it is the overall ideas are right. (Of course, an interesting idea may be worth writing about even if very wrong, and even a long shot may be profitable to examine if the potential payoff is large enough.)

Importance Tags

An additional useful bit of metadata would be distinction between things which are trivial and those which are about more important topics which might change your life. Using my interactive sorting tool Resorter, I’ve ranked pages in deciles from 0–10 on how important the topic is to myself, the intended reader, or the world. For example, topics like embryo selection for traits such as intelligence or evolutionary pressures towards autonomous AI are vastly more important, and be ranked 10, than some poems or a dream or someone’s small nootropics self-experiment, which would be ranked 0–1.

Writing Checklist

It turns out that writing essays (technical or philosophical) is a lot like writing code—there are so many ways to err that you need a process with as much automation as possible. My current checklist for finishing an essay:

Markdown Checker

I’ve found that many errors in my writing can be caught by some simple scripts, which I’ve compiled into a shell script, markdown-lint.sh.

My linter does:

  1. checks for corrupted non-text binary files

  2. checks a blacklist of domains which are either dead (eg. Google+) or have a history of being unreliable (eg. ResearchGate, NBER, PNAS); such links need36 to either be fixed, pre-emptively mirrored, or removed entirely.

    • a special case is PDFs hosted on IA; the IA is reliable, but I try to rehost such PDFs so they’ll show up in Google/Google Scholar for everyone else.

  3. Broken syntax: I’ve noticed that when I make Markdown syntax errors, they tend to be predictable and show up either in the original Markdown source, or in the rendered HTML. Two common source errors:

     "(www"
     ")www"

    And the following should rarely show up in the final rendered HTML:

     "\frac"
     "\times"
     "(http"
     ")http"
     "[http"
     "]http"
     " _ "
     "[^"
     "^]"
     "<!--"
     "-->"
     "<-- "
     "<-"
     "->"
     "$title$"
     "$description$"
     "$author$"
     "$tags$"
     "$category$"

    Similarly, I sometimes slip up in writing image/document links so any link starting https://gwern.net or ~/wiki/ or /home/gwern/ is probably wrong. There are a few Pandoc-specific issues that should be checked for too, like duplicate footnote names and images without separating newlines or unescaped dollar signs (which can accidentally lead to sentences being rendered as TeX).

    A final pass with htmltidy finds many errors which slip through, like incorrectly-escaped URLs.

  4. Flag dangerous language: Imperial units are deprecated, but so too is the misleading language of NHST statistics (if one must talk of “significance” I try to flag it as “statistically-significant” to warn the reader). I also avoid some other dangerous words like “obvious” (if it is really is, why do I need to say it?).

  5. Bad habits:

    • proselint (with some checks disabled because they play badly with Markdown documents)

    • Another static warning is checking for too-long lines (most common in code blocks, although sometimes broken indentation will cause this) which will cause browsers to use scrollbars, for which I’ve written a Pandoc script,

    • one for a bad habit of mine—too-long footnotes

  6. duplicate and hidden-PDF URLs: a URL being linked multiple times is sometimes an error (too much copy-paste or insufficiently edited sections); PDF URLs should receive a visual annotation warning the reader it’s a PDF, but the CSS rules, which catch cases like .pdf$, don’t cover cases where the host quietly serves a PDF anyway, so all URLs are checked. (A URL which is a PDF can be made to trigger the PDF rule by appending #pdf.)

  7. broken links are detected with linkchecker. The best time to fix broken links is when you’re already editing a page.

While this throws many false positives, those are easy to ignore, and the script fights bad habits of mine while giving me much greater confidence that a page doesn’t have any merely technical issues that screw it up (without requiring me to constantly reread pages every time I modify them, lest an accidental typo while making an edit breaks everything).

Anonymous Feedback

Back in November 201115ya, lukeprog posted “Tell me what you think of me” where he described his use of a Google Docs form for anonymous receipt of textual feedback or comments. Typically, most forms of communication are non-anonymous, or if they are anonymous, they’re public. One can set up pseudonyms and use those for private contact, but it’s not always that easy, and is definitely a series of trivial inconveniences (if anonymous feedback is not solicited, one has to feel it’s important enough to do and violate implicit norms against anonymous messages; one has to set up an identity; one has to compose and send off the message, etc.).

I thought it was a good idea to try out, and on 2011-11-08, I set up my own anonymous feedback form and stuck it in the footer of all pages on Gwern.net where it remains to this day. I did wonder if anyone would use the form, especially since I am easy to contact via email, use multiple sites like Reddit or Lesswrong, and even my Disqus comments allowed anonymous comments—so who, if anyone, would be using this form? I scheduled a followup in 2 years on 2013-11-30 to review how the form fared.

754 days, 2.884m page views, and 1.350m unique visitors later, I have received 116 pieces of feedback (mean of 24.8k visits per feedback). I categorize them as follows in descending order of frequency:

  • Corrections, problems (technical or otherwise), suggested edits: 34

  • Praise: 31

  • Question/request (personal, tech support, etc.): 22

  • Misc (eg. gibberish, socializing, Japanese): 13

  • Criticism: 9

  • News/suggestions: 5

  • Feature request: 4

  • Request for cybering: 1

  • Extortion: 1 (see my blackmail page dealing with the September 201313ya incident)

Some submissions cover multiple angles (they can be quite long), sometimes people double-submitted or left it blank, etc., so the numbers won’t sum to 116.

In general, a lot of the corrections were usable and fixed issues of varying importance, from typos to the entire site’s CSS being broken due to being uploaded with the wrong MIME type. One of the news/suggestion feedbacks was very valuable, as it lead to writing the Silk Road mini-essay “A Mole?” A lot of the questions were a waste of my time; I’d say half related to Tor/Bitcoin/Silk-Road. (I also got an irritating number of emails from people asking me to, say, buy LSD or heroin off SR for them.) The feature requests were usually for a better RSS feed, which I tried to oblige by starting the Changelog page. The cybering and extortion were amusing, if nothing else. The praise was good for me mentally, as I don’t interact much with people.

I consider the anonymous feedback form to have been a success, I’m glad lukeprog brought it up on LW, and I plan to keep the feedback form indefinitely.

Feedback Causes

One thing I wondered is whether feedback was purely a function of traffic (the more visits, the more people who could see the link in the footer and decide to leave a comment), or more related to time (perhaps people returning regularly and eventually being emboldened or noticing something to comment on). So I compiled daily hits, combined with the feedback dates, and looked at a graph of hits:

Hits over time for Gwern.net

Hits over time for Gwern.net

The hits are heavily skewed by Hacker News & Reddit traffic spikes, and probably should be log transformed. Then I did a logistic regression on hits, log hits, and a simple time index:

feedback <- read.csv("https://gwern.net/doc/traffic/2013-gwern-gwernnet-anonymousfeedback.csv",
                     colClasses=c("Date","logical","integer"))
plot(Visits ~ Day, data=feedback)
feedback$Time <- 1:nrow(feedback)
summary(step(glm(Feedback ~ log(Visits) + Visits + Time, family=binomial, data=feedback)))
# ...
# Coefficients:
#              Estimate Std. Error z value Pr(>|z|)
# (Intercept) -7.363507   1.311703   -5.61  2.0e-08
# log(Visits)  0.749730   0.173846    4.31  1.6e-05
# Time        -0.000881   0.000569   -1.55     0.12
#
# (Dispersion parameter for binomial family taken to be 1)
#
#     Null deviance: 578.78  on 753  degrees of freedom
# Residual deviance: 559.94  on 751  degrees of freedom
# AIC: 565.9

The logged hits works out better than regular hits, and survives to the simplified model. And the traffic influence seems much larger than the time variable (which is, curiously, negative).

Technical Aspects

Popularity

On a semi-annual basis, since 201115ya, I review Gwern.net website traffic using Google Analytics; although what most readers value is not what I value, I find it motivating to see total traffic statistics reminding me of readers (writing can be a lonely and abstract endeavour), and useful to see what are major referrers.

Gwern.net typically enjoys steady traffic in the 50–100k range per month, with occasional spikes from social media, particularly Hacker News; over the first decade (2010102020), there were 7.98m pageviews by 3.8m unique users.

See Gwern.net Website Traffic

Colophon

Hosting

Gwern.net is served by Amazon S3 through the CloudFlare CDN. (Amazon charges less for bandwidth and disk space than NearlyFreeSpeech.net, an old hosting company I originally used, although one loses all the capabilities offered by Apache’s .htaccess, and Brotli compression is difficult so must be handled by CloudFlare; total costs may turn out to be a wash and I will consider the switch to Amazon S3 a success if it can bring my monthly bill to <$10 or <$120 a year.)

From October 201016ya to June 201214ya, the site was hosted on NFSN; its specific niche is controversial material and activist-friendly pricing. Its libertarian owners cast a jaundiced eye on takedown requests, and pricing is pay-as-you-go. I like the former aspect, but the latter sold me on NFSN. Before I stumbled on NFSN (someone mentioned it offhandedly while chatting), I was getting ready to pay $10–15 a month ($120 yearly) to Linode. Linode’s offerings are overkill since I do not run dynamic websites or something like Haskell.org (with wikis and mailing lists and darcs repositories), but I didn’t know a good alternative. NFSN’s pricing meant that I paid for usage rather than large flat fees. I put in $32 to cover registering Gwern.net until 2014, and then another $10 to cover bandwidth & storage price. DNS aside, I was billed $8.27 for October-December 2010; DNS included, January-April 2011 cost $10.09. $10 covered months of Gwern.net for what I would have paid Linode in 1 month! In total, my 2010 costs were $39.44 (bill archive); my 2011 costs were $118.32 ($9.86 a month; archive); and my 2012 costs through June were $112.54 ($21 a month; archive); sum total: $270.3.

The switch to Amazon S3 hosting is complicated by my simultaneous addition of CloudFlare as a CDN; my total June 2012 Amazon bill is $1.62, with $0.19 for storage. CloudFlare claims it covered 17.5GB of 24.9GB total bandwidth, so the $1.41 represents 30% of my total bandwidth; multiply 1.41 by 3 is 4.30, and my hypothetical non-CloudFlare S3 bill is ~$4.5. Even at $10, this was well below the $21 monthly cost at NFSN. (The traffic graph indicates that June 2012 was a relatively quiet period, but I don’t think this eliminates the factor of 5.) From July 2012 to June 2013, my Amazon bills totaled $60, which is reasonable except for the steady increase ($1.62/$3.27/$2.43/$2.45/$2.88/$3.43/$4.12/$5.36/$5.65/$5.49/$4.88/$8.48/$9.26), being primarily driven by out-bound bandwidth (in June 2013, the $9.26 was largely due to the 75GB transferred—and that was after CloudFlare dealt with 82GB); $9.26 is much higher than I would prefer since that would be >$110 annually. This was probably due to all the graphics I included in the “Google shutdowns” analysis, since it returned to a more reasonable $5.14 on 42GB of traffic in August. September, October, November and December 2013 saw high levels maintained at $7.63/$12.11/$5.49/$8.75, so it’s probably a new normal. 2014 entailed new costs related to EC2 instances & S3 bandwidth spikes due to hosting a multi-gigabyte scientific dataset, so bills ran $8.51/$7.40/$7.32/$9.15/$26.63/$14.75/$7.79/$7.98/$8.98/$7.71/$7/$5.94. 2015 & 2016 were similar: $5.94/$7.30/$8.21/$9.00/$8.00/$8.30/$10.00/$9.68/$14.74/$7.10/$7.39/$8.03/$8.20/$8.31/$8.25/$9.04/$7.60/$7.93/$7.96/$9.98/$9.22/$11.80/$9.01/$8.87. 2017 saw costs increase due to one of my side-projects, aggressively increasing fulltexting of Gwern.net by providing more papers & scanning cited books, only partially offset by changes like lossy optimization of images & converting GIFs to WebMs: $12.49/$10.68/$11.02/$12.53/$11.05/$10.63/$9.04/$11.03/$14.67/$15.52/$13.12/$12.23 (total: $144.01). In 2018, I continued fulltexting: $13.08/$14.85/$14.14/$18.73/$18.88/$15.92/$15.64/$15.27/$16.66/$22.56/$23.59/$25.91/(total: $213).

For 2019, I made a determined effort to host more things, including whole websites like the OKCupid archives or rotten.com, and to include more images/videos (the StyleGAN anime faces tutorial alone must be easily 20MB+ just for images) and it shows in how my bandwidth costs exploded: $33.8$26.492019/$47.92$37.562019/$47.92$37.562019/$47.92$37.562019/$31.89$252019/$31.89$252019/$31.89$252019/$31.89$252019/$99.4$77.912019/$158.77$124.452019/$94.82$74.322019/$101.03$79.192019. I began considering a move of Gwern.net to my Hetzner dedicated server which has cheap bandwidth + ~6tb space, combined with upgrading my Cloudflare CDN to keep site latency in check (even at $25.16$202020/month, it’s still far cheaper than AWS S3 bandwidth).

In 2020, I did so, merging the hosting of Gwern.net, ThisWaifuDoesNotExist, Danbooru20xx, miscellaneous ML datasets & models, all onto a single Hetzner dedicated server, for ~$62.89$502020/month. With uncapped bandwidth, I could be much more aggressive about hosting files and automatically archiving webpage snapshots. This was highly satisfactory for the next 2 years, but the growth of Danbooru20xx eventually exceeded the drive space and I relocated to another server with >20tb space, costing ~$60. (I didn’t need the full 20tb immediately but that left a large safety margin and I was thinking of creating some additional datasets like Danbooru20xx, using Derpibooru & e621—the goal being to eventually create a single model handling all kinds of illustration-based fandoms with much higher quality than the default of everyone creating their own small underpowered model on just their personal interest.)

Source

The revision history is kept in git; individual Markdown page sources can be read by appending .md to their URL (eg. for this page). The site infrastructure is available on Github.

Size

As of 2022-11-08, the source of Gwern.net is composed of >443 text files with >4.38m words or >31MB; this includes my writings & documents I have transcribed into Markdown, but excludes images, PDFs, HTML mirrors, source code, archives, infrastructure (such as tag-directories), popup and the revision history. With those included and everything compiled to the static37 HTML, the site is >72GB. The source repository contains >16,629 patches (this is an under-count as the creation of the repository in 2008-09-26 included already-written material); the infrastructure repository, >5,807.

Design

Moved to “Design Of This Website”.

License

This site is licensed under the Creative Commons public domain (CC-0) license.

I believe the public domain license reduces FUD and dead-weight loss38, encourages copying (LOCKSS), gives back (however little) to Free Software/Free Content, and costs me nothing39.

Appendix

Benford’s Law

Does Gwern.net follow the famous Benford’s law?

A quick analysis suggests that it sort of does, except for the digit 2, probably due to the many citations to research from the past 2 decades (>200026ya AD).

In March 201313ya I wondered, upon seeing a mention of Benford’s law: “if I extracted all the numbers from everything I’ve written on Gwern.net, would it satisfy Benford’s law?” It seems the answer is… almost. I generate the list of numbers by running a Haskell program to parse digits, commas, and periods; and then I process it with shell utilities.40 This can then be read in R to run a chi-squared test confirming lack of fit (p ≈ 0) and generate this comparison of the data & Benford’s law41:

Histogram/barplot of parsed numbers vs predicted

Histogram/barplot of parsed numbers vs predicted

There’s a clear resemblance for everything but the digit ‘2’, which then blows the fit to heck. I have no idea why 2 is overrepresented—it may be due to all the citations to recent academic papers which would involve numbers starting with ‘2’ (200224ya, 201016ya, 201313ya…) and cause a double-count in both the citation and filename, since if I look in the docs/ fulltext folder, I see 160 files starting with ‘1’ but 326 starting with ‘2’. But this can’t be the entire explanation since ‘2’ has 20.3k entries while to fit Benford, it needs to be just 11.5k—leaving a gap of ~10k numbers unexplained. A mystery.

Similar Links

[Similar links by topic]