Internet Search Tips

A description of advanced tips and tricks for effective Internet research of papers/books.
archiving, technology, shell, Google, tutorial
2018-12-112020-01-21 finished certainty: certain importance: 4

Over time, I devel­oped a cer­tain google-fu and exper­tise in find­ing ref­er­ences, papers, and books online. Some of these tricks are not well-known, like check­ing the Inter­net Archive (IA) for books. I try to write down my search work­flow, and give gen­eral advice about find­ing and host­ing doc­u­ments, with .

Google-fu search skill is some­thing I’ve prided myself ever since ele­men­tary school, when the librar­ian chal­lenged the class to find things in the almanac; not infre­quent­ly, I’d win. And I can still remem­ber the exact moment it dawned on me in high school that much of the rest of my life would be spent deal­ing with search­es, pay­walls, and bro­ken links. The Inter­net is the great­est almanac of all, and to the curi­ous, a nev­er-end­ing cor­nu­copia, so I am sad to see many fail to find things after a cur­sory search—or not look at all. For most peo­ple, if it’s not the first hit in Google/Google Schol­ar, it does­n’t exist. Below, I reveal my best Inter­net search tricks and try to pro­vide a rough flow­chart of how to go about an online search, explain­ing the sub­tle tricks and intu­ition of search-fu.



Human flesh search engine. Last resort: if none of this works, there are a few places online you can request a copy (how­ev­er, they will usu­ally fail if you have exhausted all pre­vi­ous avenues):

Final­ly, you can always try to con­tact the author. This only occa­sion­ally works for the papers I have the hard­est time with, since they tend to be old ones where the author is dead or unreach­able—any author pub­lish­ing a paper since 1990 will usu­ally have been dig­i­tized some­where—but it’s easy to try.


After find­ing a full­text copy, you should find a reli­able long-term link/place to store it and make it more find­able (re­mem­ber—if it’s not in Google/Google Schol­ar, it does­n’t exist!):

  • Never Link Unre­li­able Hosts:

    • LG/SH: Always oper­ate under the assump­tion they could be gone tomor­row. (As my uncle found out with shortly after pay­ing for a life­time mem­ber­ship!) There are no guar­an­tees either one will be around for long under their legal assaults or the behind-the-scenes dra­mas, and no guar­an­tee that they are being prop­erly mir­rored or will be restored else­where. Down­load any­thing you need and keep a copy of it your­self and, ide­al­ly, host it pub­licly.
    • NBER: never rely on a or URL, as they are tem­po­rary. (SSRN is also unde­sir­able due to mak­ing it increas­ingly diffi­cult to down­load, but it is at least reli­able.)
    • Scribd: never link Scrib­d—they are a scummy web­site which impede down­loads, and any­thing on Scribd usu­ally first appeared else­where any­way. (In fact, if you run into any­thing vaguely use­ful-look­ing which exists only on Scribd, you’ll do human­ity a ser­vice if you copy it else­where just in case.)
    • RG: avoid link­ing to (com­pro­mised by new own­er­ship & PDFs get deleted rou­tine­ly, appar­ently often by authors) or (the URLs are one-time and break)
    • high­-im­pact jour­nals: be care­ful link­ing to Nature.­com or Cell (if a paper is not explic­itly marked as Open Access, even if it’s avail­able, it may dis­ap­pear in a few month­s!); sim­i­lar­ly, watch out for,,,,, &, who pull sim­i­lar shenani­gans.
    • ~/: be care­ful link­ing to aca­d­e­mic per­sonal direc­to­ries on uni­ver­sity web­sites (often notice­able by the Unix con­ven­tion .edu/~user/ or by direc­to­ries sug­ges­tive of ephemeral host­ing, like .edu/cs/course112/readings/foo.pdf); they have short half-lives.
  • PDF Edit­ing: if a scan, it may be worth edit­ing the PDF to crop the edges, thresh­old to bina­rize it (which, for a bad grayscale or color scan, can dras­ti­cally reduce file­size while increas­ing read­abil­i­ty), and OCR it.

    I use but there are alter­na­tives worth check­ing out.

  • Check & Improve Meta­data.

    Adding meta­data to papers/books is a good idea because it makes the file find­able in G/GS (if it’s not online, does it really exist?) and helps you if you decide to use bib­li­o­graphic soft­ware like in the future. Many aca­d­e­mic pub­lish­ers & LG are ter­ri­ble about meta­data, and will not include even title/author/DOI/year.

    PDFs can be eas­ily anno­tated with meta­data using : : exiftool -All prints all meta­data, and the meta­data can be set indi­vid­u­ally using sim­i­lar fields.

    For papers hid­den inside vol­umes or other files, you should extract the rel­e­vant page range to cre­ate a sin­gle rel­e­vant file. (For extrac­tion of PDF page-ranges, I use , eg: pdftk 2010-davidson-wellplayed10-videogamesvaluemeaning.pdf cat 180-196 output 2009-fortugno.pdf. Many pub­lish­ers insert a spam page as the first page. You can drop that eas­ily with pdftk INPUT.pdf cat 2-end output OUTPUT.pdf, but note that PDFtk may drop all meta­data, so do that before adding any meta­da­ta.)

    I try to set at least title/author/DOI/year/subject, and stuff any addi­tional top­ics & bib­li­o­graphic infor­ma­tion into the “Key­words” field. Exam­ple of set­ting meta­data:

    exiftool -Author="Frank P. Ramsey" -Date=1930 -Title="On a Problem of Formal Logic" -DOI="10.1112/plms/s2-30.1.264" \
        -Subject="mathematics" -Keywords="Ramsey theory, Ramsey's theorem, combinatorics, mathematical logic, decidability, \
        first-order logic,  Bernays-Schönfinkel-Ramsey class of first-order logic, _Proceedings of the London Mathematical \
        Society_, Volume s2-30, Issue 1, 1930-01-01, pg264-286" 1930-ramsey.pdf
  • Pub­lic Host­ing: if pos­si­ble, host a pub­lic copy; espe­cially if it was very diffi­cult to find, even if it was use­less, it should be host­ed. The life you save may be your own.

  • Link On WP/Social Media: for bonus points, link it in appro­pri­ate places on Wikipedia or Red­dit or Twit­ter; this makes peo­ple aware of the copy being avail­able, and also super­charges vis­i­bil­ity in search engines.

  • Link Spe­cific Pages: as noted before, you can link a spe­cific page by adding #page=N to the URL. Link­ing the rel­e­vant page is help­ful to read­ers.


Aside from the (high­ly-rec­om­mend­ed) use of hotkeys and Booleans for search­es, there are a few use­ful tools for the researcher, which while expen­sive ini­tial­ly, can pay off in the long-term:

  • : auto­mat­i­cally archive your web brows­ing and/or links from arbi­trary web­sites to fore­stall linkrot; par­tic­u­larly use­ful for detect­ing & recov­er­ing from dead PDF links

  • Sub­scrip­tions like PubMed & GS search alerts: set up alerts for a spe­cific search query, or for new cita­tions of a spe­cific paper. ( is not as use­ful as it seem­s.)

    1. PubMed has straight­for­ward con­ver­sion of search queries into alerts: “Cre­ate alert” below the search bar. (Given the vol­ume of PubMed index­ing, I rec­om­mend care­fully tai­lor­ing your search to be as nar­row as pos­si­ble, or else your alerts may over­whelm you.)
    2. To cre­ate generic GS search query alert, sim­ply use the “Cre­ate alert” on the side­bar for any search. To fol­low cita­tions of a key paper, you must: 1. bring up the paper in GS; 2. click on “Cited by X”; 3. then use “Cre­ate alert” on the side­bar.
  • GCSE: a Google Cus­tom Search Engines is a spe­cial­ized search queries lim­ited to whitelisted pages/domains etc (eg my Wikipedi­a-fo­cused anime/manga CSE).

    A GCSE can be thought of as a saved search query on steroids. If you find your­self reg­u­larly includ­ing scores of the same domains in mul­ti­ple searches search, or con­stantly black­list­ing domains with -site: or using many nega­tions to fil­ter out com­mon false pos­i­tives, it may be time to set up a GCSE which does all that by default.

  • Clip­pings: like /: reg­u­larly mak­ing and keep­ing excerpts cre­ates a per­son­al­ized search engine, in effect.

    This can be vital for refind­ing old things you read where the search terms are hope­lessly generic or you can’t remem­ber an exact quote or ref­er­ence; it is one thing to search a key­word like “autism” in a few score thou­sand clip­pings, and another thing to search that in the entire Inter­net! (One can also reor­ga­nize or edit the notes to add in the key­words one is think­ing of, to help with refind­ing.) I make heavy use of Ever­note clip­ping and it is key to refind­ing my ref­er­ences.

  • Crawl­ing Web­sites: some­times hav­ing copies of whole web­sites might be use­ful, either for more flex­i­ble search­ing or for ensur­ing you have any­thing you might need in the future. (ex­am­ple: ).

    Use­ful tools to know about: , , ; Fire­fox plu­g­ins: NoScript, uBlock ori­gin, Live HTTP Head­ers, Bypass Pay­walls, cookie export­ing.

    Short of down­load­ing a web­site, it might also be use­ful to pre-emp­tively archive it by using linkchecker to crawl it, com­pile a list of all exter­nal & inter­nal links, and store them for pro­cess­ing by another archival pro­gram (see for exam­ples). In cer­tain rare cir­cum­stances, secu­rity tools like can be use­ful to exam­ine a mys­te­ri­ous server in more detail: what web server and ser­vices does it run, what else might be on it (some­times inter­est­ing things like old anony­mous FTP servers turn up), has a web­site moved between IPs or servers, etc.

Web pages

With proper use of pre-emp­tive archiv­ing tools like archiver-bot, fix­ing linkrot in one’s own pages is much eas­ier, but that leaves other ref­er­ences. Search­ing for lost web pages is sim­i­lar to search­ing for papers:

  • Just Search The Title: if the page title is given, search for the title.

    It is a good idea to include page titles in one’s own pages, as well as the URL, to help with future search­es, since the URL may be mean­ing­less gib­ber­ish on its own, and pre-emp­tive archiv­ing can fail. HTML sup­ports both alt and title para­me­ters in link tags, and, in cases where dis­play­ing a title is not desir­able (be­cause the link is being used inline as part of nor­mal hyper­tex­tual writ­ing), titles can be included cleanly in Mark­down doc­u­ments like this: [inline text description](URL "Title").

  • Clean URLs: check the URL for weird­ness or trail­ing garbage like ?rss=1 or ?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+blogspot%2FgJZg+%28Google+AI+Blog%29? Or a vari­ant domain, like a URL? Those are all less likely to be find­able or archived than the canon­i­cal ver­sion.

  • Domain Site Search: restrict G search to the orig­i­nal domain with site:, or to related domains

  • Time-Lim­ited Search: restrict G search to the orig­i­nal date-range/years

  • Switch Engines: try a differ­ent search engine: cor­puses can vary, and in some cases G tries to be too smart for its own good when you need a lit­eral search; and are usable alter­na­tives (espe­cially if one of Duck­Duck­Go’s ‘bang’ spe­cial searches is what one needs)

  • Check Archives: if nowhere on the clear­net, try the Inter­net Archive (IA) or the Memento meta-archive search engine:

    IA is the default backup for a dead URL. If IA does­n’t Just Work, there may be other ver­sions in it:

    • mis­lead­ing redi­rects: did the IA ‘help­fully’ redi­rect you to a much-later-in-time error page? Kill the redi­rect and check the ear­li­est stored ver­sion for the exact URL rather than the redi­rect. Did the page ini­tially load but then error out/redirect? Dis­able JS with NoScript and reload.

    • With­in-Do­main Archives: IA lets you list all URLs with any archived ver­sions, by search­ing for URL/*; the list of avail­able URLs may reveal an alter­nate newer/older URL. It can also be use­ful to fil­ter by file­type or sub­string.

      For exam­ple, one might list all URLs in a domain, and if the list is too long and filled with garbage URLs, then using the “Fil­ter results” incre­men­tal-search wid­get to search for “uploads/” on a Word­Press blog.11

      Screen­shot of an oft-over­looked fea­ture of the Inter­net Archive: dis­play­ing all available/archived URLs for a spe­cific domain, fil­tered down to a sub­set match­ing a string like *uploads/*.
      • wayback_machine_downloader (not to be con­fused with the internetarchive Python pack­age which pro­vides a CLI inter­face to upload­ing files) is a Ruby tool which lets you down­load whole domains from IA, which can be use­ful for run­ning a local full­text search using reg­exps (a good grep query is often enough), in cases where just look­ing at the URLs via URL/* is not help­ful. (An alter­na­tive which might work is


      gem install --user-install wayback_machine_downloader
      ~/.gem/ruby/2.5.0/bin/wayback_machine_downloader wayback_machine_downloader --all-timestamps ''
    • did the domain change, eg from to or Entirely differ­ent as far as IA is con­cerned.

    • does the inter­nal evi­dence of the URL pro­vide any hints? You can learn a lot from URLs just by pay­ing atten­tion and think­ing about what each direc­tory and argu­ment means.

    • is this a Blogspot blog? Blogspot is uniquely hor­ri­ble in that it has ver­sions of each blog for every coun­try domain: a blog could be under any of,,, foo.blogspot.jp12

    • did the web­site pro­vide RSS feeds?

      A lit­tle known fact is that (GR; Octo­ber 2005–July 2013) stored all RSS items it crawled, so if a web­site’s RSS feed was con­fig­ured to include full items, the RSS feed his­tory was an alter­nate mir­ror of the whole web­site, and since GR never removed RSS items, it was pos­si­ble to retrieve pages or whole web­sites from it. GR has since closed down, sad­ly, but before it closed, down­loaded a large frac­tion of GR’s his­tor­i­cal RSS feeds, and those archives are now hosted on IA. The catch is that they are stored in mega-, which, for all their archival virtues, are not the most user-friendly for­mat. The raw GR mega-WARCs are diffi­cult enough to work with that I defer an exam­ple to the appen­dix.

    • an IA-like mir­ror

    • any local archives, such as those made with my

    • Google Cache (GC): GC works, some­times, but the copies are usu­ally the worst around, ephemeral & can­not be relied upon. Google also appears to have been steadily dep­re­cat­ing GC over the years, as GC shows up less & less in search results. A last resort.



E-books are rarer and harder to get than papers, although the sit­u­a­tion has improved vastly since the early 2000s. To search for books online:

  • More Straight­for­ward: book searches tend to be faster and sim­pler than paper search­es, and to require less clev­er­ness in search query for­mu­la­tion, per­haps because they are rarer online, much larg­er, and have sim­pler titles, mak­ing it eas­ier for search engines.

    Search G, not GS, for books:

    No Books in Google Scholar
    Book full­texts usu­ally don’t show up in GS (for unknown rea­son­s). You need to check G when search­ing for books.

    To dou­ble-check, you can try a filetype:pdf search; then check LG. Typ­i­cal­ly, if the main title + author does­n’t turn it up, it’s not online. (In some cas­es, the author order is reversed, or the title:­sub­ti­tle are reversed, and you can find a copy by tweak­ing your search, but these are rare.).

  • IA: the Inter­net Archive has many books scanned which do not appear eas­ily in search results (poor SEO?).

    • If an IA hit pops up in a search, always check it; the OCR may offer hints as to where to find it. If you don’t find any­thing or the pro­vid­ed, try doing an IA site search in G (not the IA built-in search engine), eg book title

    • DRM workarounds: if it is on IA but the IA ver­sion is DRMed and is only avail­able for “check­out”, you can jail­break it.

      Check the book out for the full peri­od, 14 days. Down­load the PDF (not EPUB) ver­sion to Adobe Dig­i­tal Ele­ments ver­sion ≤4.0 (which can be run in Wine on Lin­ux), and then import it to with the De-DRM plu­gin, which will pro­duce a DRM-free PDF inside Cal­i­bre’s library. (Get­ting De-DRM run­ning can be tricky, espe­cially under Lin­ux. I wound up hav­ing to edit some of the paths in the Python files to make them work with Wine.) You can then add meta­data to the PDF & upload it to LG13. (LG’s ver­sions of books are usu­ally bet­ter than the IA scans, but if they don’t exist, IA’s is bet­ter than noth­ing.)

  • : use the same PDF DRM as IA, can be bro­ken same way

  • also hosts many book scans, which can be searched for clues or hints or jail­bro­ken.

    HathiTrust blocks whole-book down­loads but it’s easy to down­load each page in a loop and stitch them togeth­er, for exam­ple:

    for i in {0 .. 151}
    do if [[ ! -s "$i.pdf" ]]; then
        wget ";orient=0;size=100;seq=$i;attachment=0" \
              -O "$i.pdf"
        sleep 10s
    pdftk *.pdf cat output 1957-super-scientificcareersandvocationaldevelopmenttheory.pdf
    exiftool -Title="Scientific Careers and Vocational Development Theory: A review, a critique and some recommendations" \
        -Date=1957 -Author="Donald E. Super, Paul B. Bachrach" -Subject="psychology" \
        -Keywords="Bureau Of Publications (Teachers College Columbia University), LCCCN: 57-12336, National Science Foundation, public domain, \;view=1up;seq=1" \

    Another exam­ple of this would be the Well­come Library; while look­ing for An Inves­ti­ga­tion Into The Rela­tion Between Intel­li­gence And Inher­i­tance, Lawrence 1931, I came up dry until I checked one of the last search results, a “Well­come Dig­i­tal Library” hit, on the slim off-chance that, like the occa­sional Chinese/Indian library web­site, it just might have full­text. As it hap­pens, it did—­good news? Yes, but with a caveat: it pro­vides no way to down­load the book! It pro­vides OCR, meta­data, and indi­vid­ual page-im­age down­loads all under CC-BY-NC-SA (so no legal prob­lem­s), but… not the book. (The OCR is also unnec­es­sar­ily zipped, so that is why Google ranked the page so low and did not show any reveal­ing excerpts from the OCR tran­script: because it’s hid­den in an opaque archive to save a few kilo­bytes while destroy­ing SEO.) Exam­in­ing the down­load URLs for the high­est-res­o­lu­tion images, they fol­low an unfor­tu­nate schema:

    3. etc

    Instead of being sequen­tially num­bered 1–90 or what­ev­er, they all live under a unique hash or ID. For­tu­nate­ly, one of the meta­data files, the ‘man­i­fest’ file, pro­vides all of the hashes/IDs (but not the high­-qual­ity down­load URLs). Extract­ing the IDs from the man­i­fest can be done with some quick sed & tr string pro­cess­ing, and fed into another short wget loop for down­load

    fgrep '@id' manifest\?manifest\=https\ | \
       sed -e 's/.*imageanno\/\(.*\)/\1/' | egrep -v '^ .*' | tr -d ',' | tr -d '"' # "
    # bf23642e-e89b-43a0-8736-f5c6c77c03c3
    # 334faf27-3ee1-4a63-92d9-b40d55ab72ad
    # 5c27d7de-6d55-473c-b3b2-6c74ac7a04c6
    # d514271c-b290-4ae8-bed7-fd30fb14d59e
    # f85ef645-ec96-4d5a-be4e-0a781f87b5e2
    # a2e1af25-5576-4101-abee-96bd7c237a4d
    # 6580e767-0d03-40a1-ab8b-e6a37abe849c
    # ca178578-81c9-4829-b912-97c957b668a3
    # 2bd8959d-5540-4f36-82d9-49658f67cff6
    # ...etc
    for HASH in $HASHES; do
        wget "$HASH/full/2212,/0/default.jpg" -O $I.jpg

    And then the 59MB of JPGs can be cleaned up as usual with gscan2pdf (empty pages delet­ed, tables rotat­ed, cover page cropped, all other pages bina­rized), compressed/OCRed with ocrmypdf, and meta­data set with exiftool, pro­duc­ing a read­able, down­load­able, high­ly-search-engine-friendly 1.8MB PDF.

  • remem­ber the works for papers/books too:

    if you can find a copy to read, but can­not fig­ure out how to down­load it directly because the site uses JS or com­pli­cated cookie authen­ti­ca­tion or other tricks, you can always exploit the ‘ana­logue hole’—fullscreen the book in high res­o­lu­tion & take screen­shots of every page; then crop, OCR etc. This is tedious but it works. And if you take screen­shots at suffi­ciently high res­o­lu­tion, there will be rel­a­tively lit­tle qual­ity loss. (This works bet­ter for books that are scans than ones born-dig­i­tal.)


Expen­sive but fea­si­ble. Books are some­thing of a dou­ble-edged sword com­pared to papers/theses. On the one hand, books are much more often unavail­able online, and must be bought offline, but at least you almost always can buy used books offline with­out much trou­ble (and often for <$10 total); on the other hand, while paper/theses are often avail­able online, when one is not unavail­able, it’s usu­ally very unavail­able, and you’re stuck (un­less you have a uni­ver­sity ILL depart­ment back­ing you up or are will­ing to travel to the few or only uni­ver­si­ties with paper or micro­film copies).

Pur­chas­ing from used book sell­ers:

  • Sell­ers:

    • used book search engines: Google Books/find­-more-book­s.­com: a good start­ing point for seller links; if buy­ing from a mar­ket­place like AbeBooks/Amazon/Barnes & Noble, it’s worth search­ing the seller to see if they have their own web­site, which is poten­tially much cheap­er. They may also have mul­ti­ple edi­tions in stock.

    • bad: eBay & Ama­zon are often bad, due to high­-min­i­mum-order+S&H and sell­ers on Ama­zon seem to assume Ama­zon buy­ers are eas­ily rooked; but can be use­ful in pro­vid­ing meta­data like page count or ISBN or vari­a­tions on the title

    • good: Abe­Books, Thrift Books, Bet­ter World Books, B&N, Dis­cover Books.

      Note: on Abe­Books, inter­na­tional orders can be use­ful (espe­cially for behav­ioral genet­ics or psy­chol­ogy books) but be care­ful of inter­na­tional orders with your credit card—­many debit/credit cards will fail on inter­na­tional orders and trig­ger a fraud alert, and Pay­Pal is not accept­ed.

  • Price Alerts: if a book is not avail­able or too expen­sive, set price watch­es: Abe­Books sup­ports email alerts on stored search­es, and Ama­zon can be mon­i­tored via Camel­Camel­Camel (re­mem­ber the CCC price alert you want is on the used third-party cat­e­go­ry, as new books are more expen­sive, less avail­able, and unnec­es­sary).


  • Destruc­tive Vs Non-De­struc­tive: the fun­da­men­tal dilemma of book scan­ning—de­struc­tively debind­ing books with a razor or guil­lo­tine cut­ter works much bet­ter & is much less time-con­sum­ing than spread­ing them on a flatbed scan­ner to scan one-by-one14, because it allows use of a sheet-fed scan­ner instead, which is eas­ily 5x faster and will give high­er-qual­ity scans (be­cause the sheets will be flat, scanned edge-to-edge, and much more closely aligned), but does, of course, require effec­tively destroy­ing the book.

  • Tools:

    • cut­ting: For sim­ple debind­ing of a few books a year, an X-acto knife/razor is good (avoid the ‘tri­an­gle’ blades, get curved blades intended for large cuts instead of detail work).

      Once you start doing more than one a mon­th, it’s time to upgrade to a guil­lo­tine blade paper cut­ter (a fancier swing­ing-arm paper cut­ter, which uses a two-joint sys­tem to clamp down and cut uni­form­ly).

      A guil­lo­tine blade can cut chunks of 200 pages eas­ily with­out much slip­page, so for books with more pages, I use both: an X-acto to cut along the spine and turn it into sev­eral 200-page chunks for the guil­lo­tine cut­ter.

    • scan­ning: at some point, it may make sense to switch to a scan­ning ser­vice like 1Dol­larScan (1DS has accept­able qual­ity for the black­-white scans I have used them for thus far, but watch out for their nick­el-and-dim­ing fees for OCR or “set­ting the PDF title”; these can be done in no time your­self using gscan2pdf/exiftool/ocrmypdf and will save a lot of money as they, amaz­ing­ly, bill by 100-page unit­s). Books can be sent directly to 1DS, reduc­ing logis­ti­cal has­sles.

  • Clean Up: after scan­ning, crop/threshold/OCR/add meta­data

    • Adding meta­data: same prin­ci­ples as papers. While more elab­o­rate meta­data can be added, like book­marks, I have not exper­i­mented with those yet.
  • File for­mat: PDF.

    In the past, I used for doc­u­ments I pro­duce myself, as it pro­duces much smaller scans than gscan2pdf’s default PDF set­tings due to a buggy Perl library (at least half the size, some­times one-tenth the size), mak­ing them more eas­ily hosted & a supe­rior brows­ing expe­ri­ence.

    The down­sides of DjVu are that not all PDF view­ers can han­dle DjVu files, and it appears that G/GS ignore all DjVu files (de­spite the for­mat being 20 years old), ren­der­ing them com­pletely unfind­able online. In addi­tion, DjVu is an increas­ingly obscure for­mat and has, for exam­ple, been dropped by the IA as of 2016. The for­mer is a rel­a­tively small issue, but the lat­ter is fatal—be­ing con­signed to obliv­ion by search engines largely defeats the point of scan­ning! (“If it’s not in Google, it does­n’t exist.”) Hence, despite being a worse for­mat, I now rec­om­mend PDF and have stopped using DjVu for new scans15 and have con­verted my old DjVu files to PDF.

  • Upload­ing: to Lib­Gen, usu­al­ly. For back­ups, file­lock­ers like Drop­box, Mega, Medi­aFire, or Google Drive are good. I usu­ally upload 3 copies includ­ing LG. I rotate accounts once a year, to avoid putting too many files into a sin­gle account.

    Do Not Use Google Docs/Scribd/Dropbox/etc

    ‘Doc­u­ment’ web­sites like Google Docs (GD) should be strictly avoided as pri­mary host­ing. GD does not appear in G/GS, doom­ing a doc­u­ment to obscu­ri­ty, and Scribd is ludi­crously user-hos­tile. Such sites can­not be searched, scraped, down­load­ed, clipped, used on many devices, or counted on for the long haul.

    Such sites may be use­ful for col­lab­o­ra­tion or sur­veys, but should be moved to clean sta­tic HTML/PDF hosted else­where as soon as pos­si­ble.
  • Host­ing: host­ing papers is easy but books come with risk:

    Books can be dan­ger­ous; in decid­ing whether to host a book, my rule of thumb is host only books pre-2000 and which do not have Kin­dle edi­tions or other signs of active exploita­tion and is effec­tively an ‘’.

    As of 2019-10-23, host­ing 4090 files over 9 years (very rough­ly, assum­ing lin­ear growth, <6.7 mil­lion doc­u­men­t-days of host­ing: ), I’ve received 4 take­down orders: a behav­ioral genet­ics text­book (2013), The Hand­book of Psy­chopa­thy (2005), a recent meta-analy­sis paper (Roberts et al 2016), and a CUP DMCA take­down order for 27 files. I broke my rule of thumb to host the 2 books (my mis­take), which leaves only the 1 paper, which I think was a fluke. So, as long as one avoids rel­a­tively recent books, the risk should be min­i­mal.

Case Studies

Below are >13 case stud­ies of diffi­cult-to-find resources or cita­tions, and how I went about locat­ing them, demon­strat­ing the var­i­ous Inter­net search tech­niques described above and how to think about search­es.

  • Miss­ing Appen­dix: asked:

    Does any­body know where the online appen­dix to Nord­haus’ “Two Cen­turies of Pro­duc­tiv­ity Growth in Com­put­ing” is hid­ing?

    I look up the title in Google Schol­ar; see­ing a friendly PDF link (Cite­Seerx), I click. The paper says “The data used in this study are pro­vided in a back­ground spread­sheet avail­able at”. Sad­ly, this is a lie. (Sand­berg would of course have tried that.)

    I imme­di­ately check the URL in the IA—noth­ing. The IA did­n’t catch it at all. Maybe the offi­cial pub­lished paper web­site has it? Nope, it ref­er­ences the same URL, and does­n’t pro­vide a copy as an appen­dix or sup­ple­ment. (What do we pay these pub­lish­ers such enor­mous sums of money for, exact­ly?) So I back off to check­ing, to check Nord­haus’s per­sonal web­site for a newer link. The Yale per­sonal web­site is empty and appears to’ve been replaced by a Google Sites per­sonal page. It links noth­ing use­ful, so I check a more thor­ough index, Google, by search­ing Noth­ing there either (and it appears almost emp­ty, so Nord­haus has allowed most of his stuff to be deleted and bitrot). I try a broader Google: nordhaus appendix.xls. This turns up some spread­sheets, but still noth­ing.

    Eas­ier approaches hav­ing been exhaust­ed, I return to the IA and I pull up all URLs archived for his orig­i­nal per­sonal web­site:*/* This pulls up way too many URLs to man­u­ally review, so I fil­ter results for xls, which reduces to a more man­age­able 60 hits; read­ing through the hits, I spot from 2014-10-10; this sounds right, albeit sub­stan­tially later in time than expected (ei­ther 2010 or 2012, judg­ing from the file­name).

    Down­load­ing it, open­ing it up and cross-ref­er­enc­ing with the paper, it has the same spread­sheet ‘sheets’ as men­tioned, like “Man­ual” or “Cap­i­tal_Deep”, and seems to be either the orig­i­nal file in ques­tion or an updated ver­sion thereof (which may be even bet­ter). The spread­sheet meta­data indi­cates it was cre­ated “04/09/2001, 23:20:43, ITS Aca­d­e­mic Media & Tech­nol­ogy”, and mod­i­fied “12/22/2010, 02:40:20”, so it seems to be the lat­ter—it’s the orig­i­nal spread­sheet Nord­haus cre­ated when he began work sev­eral years prior to the for­mal 2007 pub­li­ca­tion (6 years seems rea­son­able given all the delays in such a process), and then was updated 3 years after­wards. Close enough.

  • Mis­re­mem­bered Book: A Red­di­tor asked:

    I was in a con­sign­ment type store once and picked up a book called “Eat fat, get thin”. Giv­ing it a quick scan through, it was basi­cally the same stuff as Atkins but this book was from the 50s or 60s. I wish I’d have bought it. I think I found a ref­er­ence to it once online but it’s been drowned out since some­one else released a book with the same name (and it was­n’t Barry Groves either).

    The eas­i­est way to find a book given a cor­rupted title, a date range, and the infor­ma­tion there are many sim­i­lar titles drown­ing out a naive search engine query, is to skip to a spe­cial­ized search engine with clean meta­data (ie. a library data­base).

    Search­ing in World­Cat for 1950s–1970s, “Eat fat, get thin” turns up noth­ing rel­e­vant. This is unsur­pris­ing, as he was unlikely to’ve remem­bered the title exactly, and this title does­n’t quite sound right for the era any­way (a lit­tle too punchy and ungram­mat­i­cal, and ‘thin’ was­n’t a desir­able word back then com­pared to words like ‘slim’ or ‘sleek’ or ‘svelte’). Peo­ple often over­sim­plify titles, so I dropped back to just “Eat fat”.

    This imme­di­ately turned up the book: 1958 Eat Fat and Grow Slim—note that it is almost the same title, with a comma serv­ing as con­junc­tion and ‘slim’ rather than the more con­tem­po­rary ‘thin’, but just differ­ent enough to screw up an over­ly-lit­eral search.

    With the same trick in mind, we could also have found it in a reg­u­lar Google search query by adding addi­tional terms to hint to Google that we want old books, not recent ones: both "Eat Fat" 1950s or "Eat Fat" 1960s would have turned it up in the first 5 search results. If we did­n’t use quotes, the searches get harder because broader hits get pulled in. For exam­ple, Eat fat, get thin 1950s -Hyman excludes the recent book men­tioned, but you still have to go down 15 hits before find­ing Mackar­ness, and Eat fat, get thin -Hyman requires going down 18 hits.

  • Miss­ing Web­site: , on the phe­nom­e­non of quotes strik­ing tran­scripts from a major exam­ple of a dis­ap­pear­ing crys­tal, when ~1998 Abbott sud­denly became unable to man­u­fac­ture the anti-retro­vi­ral drug (Norvir™) due to a rival (and less effec­tive) crys­tal form spon­ta­neously infect­ing all its plants, threat­en­ing many AIDS patients, but notes:

    The tran­scripts were orig­i­nally pub­lished on the web­site42 of the Inter­na­tional Asso­ci­a­tion of Physi­cians in AIDS Care [IAPAC], but no longer appear there.

    A search using the quotes con­firms that the orig­i­nals have long since van­ished from the open Inter­net, turn­ing up only quotes of the quo­ta­tions. Unfor­tu­nate­ly, no URL is giv­en. The Inter­net Archive has com­pre­hen­sive mir­rors of the IAPAC, but too many to eas­ily search through. Using the fil­ter fea­ture, I key­word-searched for “riton­avir”, but while this turned up a num­ber of pages from roughly the right time peri­od, they do not men­tion it and none of the quotes appear. The key turned out to be to use the trade­mark name instead which pulls up many more pages, and after check­ing a few, the IAPAC turned out to have orga­nized all the Norvir mate­r­ial into a sin­gle sub­di­rec­tory with a con­ve­nient index.html; the articles/transcripts, in turn, were indexed under the linked .

    I then pulled the Norvir sub­di­rec­tory with a ~/.gem/ruby/2.5.0/bin/wayback_machine_downloader wayback_machine_downloader '' com­mand and hosted a mir­ror to make it vis­i­ble in Google.

  • Speech → Book: Nancy Lebovitz asked about a cita­tion in a Roy Baumeis­ter speech about sex differ­ences:

    There’s an idea I’ve seen a num­ber of times that 80% of women have had descen­dants, but only 40% of men. A lit­tle research tracked it back to this, but the speech does­n’t have a cite and I haven’t found a source.

    This could be solved by guess­ing that the for­mal cita­tion is given in the book, and doing key­word search to find a sim­i­lar pas­sage. The sec­ond line of the speech says:

    For more infor­ma­tion on this top­ic, read Dr. Baumeis­ter’s book Is There Any­thing Good About Men? avail­able in book­stores every­where, includ­ing here.

    A search of Is There Any­thing Good About Men in Lib­gen turns up a copy. Down­load. What are we look­ing for? A reminder, the key lines in the speech are:

    …It’s not a trick ques­tion, and it’s not 50%. True, about half the peo­ple who ever lived were wom­en, but that’s not the ques­tion. We’re ask­ing about all the peo­ple who ever lived who have a descen­dant liv­ing today. Or, put another way, yes, every baby has both a mother and a father, but some of those par­ents had mul­ti­ple chil­dren. Recent research using DNA analy­sis answered this ques­tion about two years ago. Today’s human pop­u­la­tion is descended from twice as many women as men. I think this differ­ence is the sin­gle most under­-ap­pre­ci­ated fact about gen­der. To get that kind of differ­ence, you had to have some­thing like, through­out the entire his­tory of the human race, maybe 80% of women but only 40% of men repro­duced.

    We could search for var­i­ous words or phrase from this pas­sage which seem to be rel­a­tively unique; as it hap­pens, I chose the rhetor­i­cal “50%” (but “80%”, “40%”, “under­ap­pre­ci­ated”, etc all would’ve worked with vary­ing lev­els of effi­ciency since the speech is heav­ily based on the book), and thus jumped straight to chap­ter 4, “The Most Under­ap­pre­ci­ated Fact About Men”. (If these had not worked, we could have started search­ing for years, based on the quote “about two years ago”.) A glance tells us that Baumeis­ter is dis­cussing exactly this topic of repro­duc­tive differ­en­tials, so we read on and a few pages lat­er, on page 63, we hit the jack­pot:

    The cor­rect answer has recently begun to emerge from DNA stud­ies, notably those by Jason Wilder and his col­leagues. They con­cluded that among the ances­tors of today’s human pop­u­la­tion, women out­num­bered men about two to one. Two to one! In per­cent­age terms, then, human­i­ty’s ances­tors were about 67% female and 33% male.

    Who’s Wilder? A C-f for “Wilder” takes us to pg286, where we imme­di­ately read:

    …The DNA stud­ies on how today’s human pop­u­la­tion is descended from twice as many women as men have been the most requested sources from my ear­lier talks on this. The work is by Jason Wilder and his col­leagues. I list here some sources in the mass media, which may be more acces­si­ble to layper­sons than the highly tech­ni­cal jour­nal arti­cles, but for the spe­cial­ists I list those also. For a highly read­able intro­duc­tion, you can Google the arti­cle “Ancient Man Spread the Love Around,” which was pub­lished Sep­tem­ber, 20, 2004 and is still avail­able (last I checked) online. There were plenty of other sto­ries in the media at about this time, when the research find­ings first came out. In “Med­ical News Today,”, on the same date in 2004, a story under “Genes expose secrets of sex on the side” cov­ered much the same mate­r­i­al.

    If you want the orig­i­nal sources, read Wilder, J. A., Mobash­er, Z., & Ham­mer, M. F. (2004). “Genetic evi­dence for unequal effec­tive pop­u­la­tion sizes of human females and males”. Mol­e­c­u­lar Biol­ogy and Evo­lu­tion, 21, 2047–2057. If that went down well, you might try Wilder, J. A., Kingan, S. B., Mobash­er, Z., Pilk­ing­ton, M. M., & Ham­mer, M. F. (2004). “Global pat­terns of human mito­chon­dr­ial DNA and Y-chro­mo­some struc­ture are not influ­enced by higher migra­tion rates of females ver­sus males”. Nature Genet­ics, 36, 1122–1125. That one was over my head, I admit. A more read­able source on these is Shriver, M. D. (2005), “Female migra­tion rate might not be greater than male rate”. Euro­pean Jour­nal of Human Genet­ics, 13, 131–132. Shriver raises another intrigu­ing hypoth­e­sis that could have con­tributed to the greater pre­pon­der­ance of females in our ances­tors: Because cou­ples mate such that the man is old­er, the gen­er­a­tional inter­vals are smaller for females (i.e., baby’s age is closer to moth­er’s than to father’s). As for the 90% to 20% differ­en­tial in other species, that I believe is stan­dard infor­ma­tion in biol­o­gy, which I first heard in one of the lec­tures on testos­terone by the late James Dabbs, whose book Heroes, Rogues, and Lovers remains an author­i­ta­tive source on the top­ic.

    Wilder et al 2004, inci­den­tal­ly, fits well with Baumeis­ter remark­ing in 2007 that the research was done 2 or so years ago. And of course you could’ve done the same thing using Google Books: search “Baumeis­ter any­thing good about men” to get to the book, then search-with­in-the-book for “50%”, jump to page 53, read to page 63, do a sec­ond search-with­in-the-book for “Wilder” and the sec­ond hit of page 287 even luck­ily gives you the snip­pet:

    Sources and Ref­er­ences 287

    …If you want the orig­i­nal sources, read Wilder, J. A., Mobash­er, Z., & Ham­mer, M. F. (2004). “Genetic evi­dence for unequal effec­tive pop­u­la­tion sizes of human females and males”. Mol­e­c­u­lar Biol­ogy and Evo­lu­tion

  • Con­no­ta­tions a com­menter who shall remain name­less wrote

    I chal­lenge you to find an exam­ple of some­one say­ing “this den of X” where X does not have a neg­a­tive con­no­ta­tion.

    I found a pos­i­tive con­no­ta­tion within 5s using my Google hotkey for "this den of ", and, curi­ous about fur­ther ones, found addi­tional uses of the phrase in regard to deal­ing with rat­tlesnakes in Google Books.

  • Rowl­ing Quote On Death: Did say the Harry Pot­ter books were about ‘death’? There are a lot of Rowl­ing state­ments, but check­ing WP and open­ing up each inter­view links (un­der the the­ory that the key inter­views are linked there) and search­ing for ‘death’ soon turns up a rel­e­vant quote from 2001:

    Death is an extremely impor­tant theme through­out all seven books. I would say pos­si­bly the most impor­tant theme. If you are writ­ing about Evil, which I am, and if you are writ­ing about some­one who is essen­tially a psy­chopath, you have a duty to show the real evil of tak­ing human life.

  • Crow­ley Quote: Scott Alexan­der posted a piece link­ing to an except titled “ on Reli­gious Expe­ri­ence”.

    The link was bro­ken, but Alexan­der brought it up in the con­text of an ear­lier dis­cus­sion where he also quoted Crow­ley; search­ing those quotes reveals that it must have been excerpts from Mag­ick: Book 4

  • Find­ing The Right ‘SAGE: Phil Goetz noted that an anti-ag­ing con­fer­ence named “SAGE” had become impos­si­ble to find in Google due to a LGBT aging con­fer­ence also named SAGE.

    Reg­u­lar searches would fail, but a com­bi­na­tion of tricks worked: SAGE anti-aging conference com­bined with restrict­ing Google search to 2003–2005 time-range turned up a cita­tion to its web­site as the fourth hit, (which has iron­i­cally since died).

  • UK Char­ity Finan­cials: The Future of Human­ity Insti­tute (FHI) does­n’t clearly pro­vide char­ity finan­cial forms akin to the US Form 990s, mak­ing it hard to find out infor­ma­tion about its bud­get or results.

    FHI does­n’t show up in the CC, NPC, or GuideStar, which are the first places to check for char­ity finances, so I went a lit­tle broader afield and tried a site search on the FHI web­site: budget This imme­di­ately turned up FHI’s own doc­u­men­ta­tion of its activ­i­ties and bud­gets, such as the 2007 annual report; I used part of its title as a new Google search: future of humanity institute achievements report

  • Nobel Lin­eage Research: John Maxwell referred to a for­got­ten study on high cor­re­la­tion between Nobelist pro­fes­sors & Nobelist grad stu­dents (al­most entirely a selec­tion effect, I would bet). I was able to refind it in 7 min­utes.

    I wasted a few searches like factor predicting Nobel prize or Nobel prize graduate student in Google Schol­ar, until I search for Nobel laureate "graduate student"; the sec­ond hit was a cita­tion, which is a lit­tle unusual for Google Scholar and meant it was impor­tant, and it had the crit­i­cal word mutual in it—si­mul­ta­ne­ous part­ners in Nobel work is some­what rare, but tem­po­rally sep­a­rated teams don’t work for prizes, and I sus­pected that it was exactly what I was look­ing for. Googling the title, I soon found a PDF like “Emi­nent Sci­en­tists’ Demo­ti­va­tion in School: A symp­tom of an incur­able dis­ease?”, Viau 2004 which con­firmed it (and Viau 2004 is inter­est­ing in its own right as a con­tri­bu­tion to the Con­sci­en­tious vs IQ ques­tion). I then fol­lowed it to a use­ful para­graph:

    In a study con­ducted with 92 Amer­i­can win­ners of the Nobel Prize, Zuck­er­man (1977) dis­cov­ered that 48 of them had worked as grad­u­ate stu­dents or assis­tants with pro­fes­sors who were them­selves Nobel Prize award-win­ners. As pointed out by Zuck­er­man (1977), the fact that 11 Nobel prizewin­ners have had the great physi­cist Ruther­ford as a men­tor is an exam­ple of just how sig­nifi­cant a good men­tor can be dur­ing one’s stud­ies and train­ing. It then appears that most emi­nent sci­en­tists did have peo­ple to stim­u­late them dur­ing their child­hood and men­tor(s) dur­ing their stud­ies. But, what exactly is the nature of these peo­ple’s con­tri­bu­tion.

    • Zuck­er­man, H. (1977). Sci­en­tific Elite: Nobel Lau­re­ates in the United States. New York: Free Press.

    GS lists >900 cita­tions of this book, so there may well be addi­tional or fol­lowup stud­ies cov­er­ing the 40 years since. Or, also rel­e­vant is “Zuck­er­man, H. (1983). The sci­en­tific elite: Nobel lau­re­ates’ mutual influ­ences. In R. S. Albert (Ed.), Genius and emi­nence (pp. 241–252). New York: Perg­a­mon Press”, and “Zuck­er­man H. ‘Soci­ol­ogy of Nobel Prizes’, Sci­en­tific Amer­i­can 217 (5): 25& 1967.”

  • Too Nar­row: A fail­ure case study: The_­Duck looked for but failed to find other uses of a famous Wittgen­stein anec­dote. His mis­take was being too spe­cific:

    Yes, clearly my Google-fu is lack­ing. I think I searched for phrases like “sun went around the Earth,” which fails because your quote has “sun went round the Earth.”

    As dis­cussed in the search tips, when you’re for­mu­lat­ing a search, you want to bal­ance how many hits you get, aim­ing for a sweet spot of a few hun­dred high­-qual­ity hits to review—the broader your for­mu­la­tion, the more likely the hits will include your tar­get (if it exists) but the more hits you’ll return. In The_­Duck’s case, he used an over­ly-spe­cific search, which would turn up only 2 hits at most; this should have been a hint to loosen the search, such as by drop­ping quotes or drop­ping key­words.

    In this case, my rea­son­ing would go some­thing like this, laid out explic­it­ly: ‘“Wittgen­stein” is almost guar­an­teed to be on the same page as any instance of this quote, since the quote is about Wittgen­stein; LW, how­ev­er, does­n’t dis­cuss Wittgen­stein much, so there won’t be many hits in the first place; to find this quote, I only need to nar­row down those hits a lit­tle, and after “Wittgen­stein”, the most fun­da­men­tal core word to this quote is “Earth” or “sun”, so I’ll toss one of them in and… ah, there’s the quote!’

    If I were search­ing the gen­eral Inter­net, my rea­son­ing would go more like “‘Wittgen­stein’ will be on, like, a mil­lion web­sites; I need to nar­row that down a lot to hope to find it; so maybe ‘Wittgen­stein’ and ‘Earth’ and ‘Sun’… nope, noth­ing on the first page, so toss in 'goes around' OR 'go around'—ah there it is!”

    (Ac­tu­al­ly, for the gen­eral Inter­net, just Wittgenstein earth sun turns up a first page mostly about this anec­dote, sev­eral of which include all the details one could need.)

  • Dead URL: A link to a research arti­cle in a post by Morendil broke, he had not pro­vided any for­mal cita­tion data, and the orig­i­nal domain blocks all crawlers in its robots.txt so IA would not work. What to do?

    The sim­plest solu­tion was to search a direct quote, turn­ing up a Scribd mir­ror; Scribd is a par­a­site web­site, where peo­ple upload copies from else­where, which ought to make one won­der where the orig­i­nal came from. (It often shows up before the orig­i­nal in any search engine, because it auto­mat­i­cally runs OCR on sub­mis­sions, mak­ing them more vis­i­ble to search engines.) With a copy of the jour­nal issue to work with, you can eas­ily find the offi­cial HP archives and down­load the orig­i­nal PDF.

    If that had­n’t worked, search­ing for the URL with­out /pg_2/ in it yields the full cita­tion, and then that can be looked up nor­mal­ly. Final­ly, some­what more dan­ger­ous would be try­ing to find the arti­cle just by author sur­name & year.

  • Descrip­tion But No Cita­tion: A 2013 Med­ical Daily on the effects of read­ing fic­tion omit­ted any link or cita­tion to the research in ques­tion. But it is easy to find.

    The arti­cle says the authors are one Kauf­man & Lib­by, and implies it was pub­lished in the last year. So: go to Google Schol­ar, punch in Kaufman Libby, limit to ‘Since 2012’; and the cor­rect paper (“Chang­ing beliefs and behav­ior through expe­ri­ence-tak­ing”) is the first hit with full­text avail­able on the right-hand side as the text link “[PDF] from” & many other domains.

  • Find­ing Fol­lowups: Is soy milk bad for you as one study sug­gests? Has any­one repli­cated it? This is easy to look into a lit­tle if you use the power of reverse cita­tion search!

    Plug Brain aging and midlife tofu consumption into Google Schol­ar, one of the lit­tle links under the first hit points to “Cited by 176”; if you click on that, you can hit a check­box for “Search within cit­ing arti­cles”; then you can search a query like experiment OR randomized OR blind which yields 121 results. The first result shows no neg­a­tive effect and a trend to a ben­e­fit, the sec­ond is inac­ces­si­ble, the sec­ond & third are reviews whose abstract sug­gests it would argue for ben­e­fits, and the fourth dis­cusses sleep & mood ben­e­fits to soy diets. At least from a quick skim, this claim is not repli­cat­ing, and I am dubi­ous about it.

  • How Many Home­less?: does NYC really have 114,000+ home­less school chil­dren? This case study demon­strates the crit­i­cal skill of notic­ing the need to search at all, and the search itself is almost triv­ial.

    Won’t some­one think of the chil­dren? In March 2020, as cen­tered in Man­hat­tan (with a sim­i­lar trend to Wuhan/Iran/Italy), NYC Mayor refused to take social distancing/quarantine mea­sures like order­ing the NYC pub­lic school sys­tem closed, and this delay until 16 March con­tributed to the epi­demic’s unchecked spread in NYC; one jus­ti­fi­ca­tion was that there were “114,085 home­less chil­dren” who received social ser­vices like free laun­dry through the schools. This num­ber has been widely cited in the media by the NYT, WSJ, etc, and was vaguely sourced to “state data” reported by “Advo­cates for Chil­dren of New York”. This is a ter­ri­ble rea­son to not deal with a pan­demic that could kill tens of thou­sands of New York­ers, as there are many ways to deliver ser­vices which do not require every child in NYC to attend school & spread infec­tion­s—but first, is this num­ber even true?

    Basic numer­a­cy: implau­si­bly-large! Activists of any stripe are untrust­wor­thy sources, and a num­ber like 114k should make any numer­ate per­son uneasy even with­out any or fac­t-check­ing; “114,085” is sus­pi­ciously pre­cise for such a diffi­cult-to-mea­sure or define thing like home­less­ness, and it’s well-known that the pop­u­la­tion of NYC is ~8m or 8,000k—is it really the case that around 1 in every 70 peo­ple liv­ing in NYC is a home­less child age ~5–18 attend­ing a pub­lic school? They pre­sum­ably have at least 1 par­ent, and prob­a­bly younger sib­lings, so that would bring it up to >228k or 1 in every <35 inhab­i­tants of NYC being home­less in gen­er­al. Depend­ing on addi­tional fac­tors like tran­siency & turnover, the frac­tion could go much higher still. Does that make sense? No, not real­ly. This quoted num­ber is either sur­pris­ing, or there is some­thing miss­ing.

    Redefin­ing “home­less”. For­tu­nate­ly, the sus­pi­cious­ly-pre­cise num­ber and attri­bu­tion make this a good place to start for a search. Search­ing for the num­ber and the name of the activist group instantly turns up the source press release, and the rea­sons for the bizarrely high num­ber are revealed: the sta­tis­tic actu­ally rede­fines ‘home­less­ness’ to include liv­ing with rel­a­tives or friends, and counts any expe­ri­ence of any length in the pre­vi­ous year as ren­der­ing that stu­dent ‘home­less’ at the moment.

    The data, which come from the New York State Edu­ca­tion Depart­ment, show that in the 2018-2019 school year, New York City dis­trict and char­ter schools iden­ti­fied 114,085, or one in ten, stu­dents as home­less. More than 34,000 stu­dents were liv­ing in New York City’s shel­ters, and more than twice that num­ber (73,750) were liv­ing ‘dou­bled-up’ in tem­po­rary hous­ing sit­u­a­tions with rel­a­tives, friends, or oth­ers…“This prob­lem is immense. The num­ber of New York City stu­dents who expe­ri­enced home­less­ness last year—85% of whom are Black or His­pan­ic—­could fill the Bar­clays Cen­ter six times,” said Kim Sweet, AFC’s Exec­u­tive Direc­tor. “The City won’t be able to break the cycle of home­less­ness until we address the dis­mal edu­ca­tional out­comes for stu­dents who are home­less.”

    The WSJ’s arti­cle (but not head­line) con­firms that ‘expe­ri­enced’ does indeed mean ‘at any time in the year for any length of time’, rather than ‘at the moment’:

    City dis­trict and char­ter schools had 114,085 stu­dents with­out their own homes at some point last year, top­ping 100,000 for the fourth year in a row, accord­ing to state data released in a report Mon­day from Advo­cates for Chil­dren of New York, a non­profit seek­ing bet­ter ser­vices for the dis­ad­van­taged. Most chil­dren were black or His­pan­ic, and liv­ing “dou­bled up” with friends, rel­a­tives or oth­ers. But more than 34,000 slept in city shel­ters at some point, a num­ber larger than the entire enroll­ment of many dis­tricts, such as Buffalo, Rochester or Yonkers.

    Less than meet the eye. So the actual num­ber of ‘home­less­ness’ (in the sense that every­one read­ing those media arti­cles under­stands it) is less than a third the quote, 34k, and that 34k num­ber is likely itself a loose esti­mate of how many stu­dents would be home­less at the time of a coro­n­avirus clo­sure. This num­ber is far more plau­si­ble and intu­itive, and while one might won­der about what the under­ly­ing NYS Edu­ca­tion Depart­ment num­bers would reveal if fac­t-checked fur­ther, that’s prob­a­bly unnec­es­sary for show­ing how ill-founded the anti-clo­sure argu­ment is, since even by the activists’ own descrip­tion, the rel­e­vant num­ber is far smaller than 114k.

  • Cita­tion URL With Typo: , dis­cusses the lim­its to the intel­li­gence of increas­ingly large pri­mate brains due to con­sid­er­a­tions like increas­ing latency and over­heat­ing. One cita­tion attempt­ing to extrap­o­late upper bounds is “Bio­log­i­cal lim­its to infor­ma­tion pro­cess­ing in the human brain”, Cochrane et al 1995.

    The source infor­ma­tion is merely a bro­ken URL: which stands out for look­ing dou­bly-wrong: “.phd” is almost cer­tainly a typo for “.php” (prob­a­bly mus­cle mem­ory on the part of Hof­man from “PhD”), but it also gives a hint that the entire URL is wrong: why would an arti­cle or essay be named any­thing like archive/articles.php? That sounds like an index page list­ing all the avail­able arti­cles.

    After try­ing and fail­ing to find Cochrane’s paper in the usual places, I returned to the hint. The Inter­net Archive does­n’t have that page under either pos­si­ble URL, but the direc­tory strongly hints that all of the papers would exist at URLs like archive/brain.php or archive/information-processing.php, and we can look up all of the URLs the IA has under that direc­to­ry—how many could there be? A lot, but only one has the key­word “brain” in it, pro­vid­ing us .

    If that had­n’t worked, there was at least one other ver­sion hid­ing in the IA. When I googled the quoted title “Bio­log­i­cal lim­its to infor­ma­tion pro­cess­ing in the human brain”, the hits all appeared to be use­less cita­tions repeat­ing the orig­i­nal Hof­man cita­tion—but for a cru­cial differ­ence, as they cite a differ­ent URL (note the shift to an ‘archive.­’ sub­do­main rather than the sub­di­rec­tory, and change of exten­sion from .html to .php):

    • hit 5:

      Bio­log­i­cal Lim­its to Infor­ma­tion Pro­cess­ing in the Human Brain. Retrieved from:

    • hit 7:

      Bio­log­i­cal Lim­its to Infor­ma­tion Pro­cess­ing in the Human Brain. Avail­able online at:; Da Costa …

    Aside from con­firm­ing that it was indeed a ‘.php’ exten­sion, that URL gives you a sec­ond copy of the paper in the IA. Unfor­tu­nate­ly, the image links are bro­ken in both ver­sions, and the image sub­di­rec­to­ries also seem to be empty in both IA ver­sions, though there’s no weird JS image load­ing bad­ness, so I’d guess that the image links were always bro­ken, at least by 2004. There’s no indi­ca­tion it was ever pub­lished or mir­rored any­where else, so there’s not much you can do about it other than to con­tact Peter Cochrane (who is still alive and actively pub­lish­ing although he leaves this par­tic­u­lar arti­cle off his pub­li­ca­tion list).

See Also


Searching the Google Reader archives

A tuto­r­ial on how to do man­ual searches of the 2013 archives on the . Google Reader pro­vides full­text mir­rors of many web­sites which are long gone and not oth­er­wise avail­able even in the IA; how­ev­er, the Archive Team archives are extremely user-un­friendly and chal­leng­ing to use even for pro­gram­mers. I explain how to find & extract spe­cific web­sites.

A lit­tle-known way to ‘undelete’ a blog or web­site is to use Google Reader (GR). Unusual archive: Google Read­er. GR crawled reg­u­larly almost all blogs’ RSS feeds; RSS feeds often con­tain the full­text of arti­cles. If a blog author writes an arti­cle, the full­text is included in the RSS feed, GR down­loads it, and then the author changes their mind and edits or deletes it, GR would redown­load the new ver­sion but it would con­tinue to show the ver­sion the old ver­sion as well (you would see two ver­sions, chrono­log­i­cal­ly). If the author blogged reg­u­larly and so GR had learned to check reg­u­lar­ly, it could hypo­thet­i­cally grab differ­ent edited ver­sions, even, not just ones with weeks or months in between. Assum­ing that GR did not, as it some­times did for inscrutable rea­sons, stop dis­play­ing the his­tor­i­cal archives and only showed the last 90 days or so to read­ers; I was never able to fig­ure out why this hap­pened or if indeed it really did hap­pen and was not some sort of UI prob­lem. Regard­less, if all went well, this let you undelete an arti­cle, albeit per­haps with messed up for­mat­ting or some­thing. Sad­ly, GR was closed back in 2013 and you can­not sim­ply log in and look for blogs.

Archive Team mir­rored Google Read­er. How­ev­er, before it was closed, Archive Team launched a major effort to down­load as much of GR as pos­si­ble. So in that dump, there may be archives of all of a ran­dom blog’s posts. Specifi­cal­ly: if a GR user sub­scribed to it; if Archive Team knew about it; if they requested it in time before clo­sure; and if GR did keep full archives stretch­ing back to the first post­ing.

AT mir­ror is raw binary data. Down­side: the Archive Team dump is not in an eas­ily browsed for­mat, and merely fig­ur­ing out what it might have is diffi­cult. In fact, it’s so diffi­cult that before research­ing Craig Wright in Novem­ber–De­cem­ber 2015, I never had an urgent enough rea­son to fig­ure out how to get any­thing out of it before, and I’m not sure I’ve ever seen any­one actu­ally use it before; Archive Team takes the atti­tude that it’s bet­ter to pre­serve the data some­how and let pos­ter­ity worry about using it. (There is a site which claimed to be a fron­tend to the dump but when I tried to use it, it was bro­ken & still is in Decem­ber 2018.)


Find the right archive. The 9TB of data is stored in ~69 opaque com­pressed WARC archives. 9TB is a bit much to down­load and uncom­press to look for one or two files, so to find out which WARC you need, you have to down­load the ~69 CDX indexes which record the con­tents of their respec­tive WARC, and search them for the URLs you are inter­ested in. (They are plain text so you can just grep them.)


In this exam­ple, we will look at the main blog of Craig Wright, (An­other blog,, appears to have been too obscure to be crawled by GR.) To locate the WARC with the Wright RSS feeds, down­load the the mas­ter index. To search:

for file in *.gz; do echo $file; zcat $file | fgrep -e 'gse-compliance' -e 'security-doctor'; done
# com,google/reader/api/0/stream/contents/feed/http:/\
# archiveteam&comments=true&hl=en&likes=true&n=1000&r=n 20130602001238\
# api/0/stream/contents/feed/\
# likes=true&comments=true&client=ArchiveTeam unk - 4GZ4KXJISATWOFEZXMNB4Q5L3JVVPJPM - - 1316181\
# 19808229791 archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz
# com,google/reader/api/0/stream/contents/feed/http:/\
# alt=rss?client=archiveteam&comments=true&hl=en&likes=true&n=1000&r=n 20130602001249\
# com/reader/api/0/stream/contents/feed/\
# %3Falt%3Drss?r=n&n=1000&hl=en&likes=true&comments=true&client=ArchiveTeam unk - HOYKQ63N2D6UJ4TOIXMOTUD4IY7MP5HM\
# - - 1326824 19810951910 archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz
# com,google/reader/api/0/stream/contents/feed/http:/\
# client=archiveteam&comments=true&hl=en&likes=true&n=1000&r=n 20130602001244\
# reader/api/0/stream/contents/feed/\
# r=n&n=1000&hl=en&likes=true&comments=true&client=ArchiveTeam unk - XXISZYMRUZWD3L6WEEEQQ7KY7KA5BD2X - - \
# 1404934 19809546472 archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz
# com,google/reader/api/0/stream/contents/feed/http:/\
# &comments=true&hl=en&likes=true&n=1000&r=n 20130602001253\
# /feed/\
# &client=ArchiveTeam text/html 404 AJSJWHNSRBYIASRYY544HJMKLDBBKRMO - - 9467 19812279226 \
# archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz

Under­stand­ing the out­put: the for­mat is defined by the first line, which then can be looked up:

  • the for­mat string is: CDX N b a m s k r M S V g; which means here:

    • N: mas­saged url
    • b: date
    • a: orig­i­nal url
    • m: mime type of orig­i­nal doc­u­ment
    • s: response code
    • k: new style check­sum
    • r: redi­rect
    • M: meta tags (AIF)
    • S: ?
    • V: com­pressed arc file off­set
    • g: file name


?client=archiveteam&comments=true&hl=en&likes=true&n=1000&r=n 20130602001238\
&n=1000&hl=en&likes=true&comments=true&client=ArchiveTeam unk - 4GZ4KXJISATWOFEZXMNB4Q5L3JVVPJPM\
- - 1316181 19808229791 archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz

Con­verts to:

  • mas­saged URL: (com,google)/reader/api/0/stream/contents/feed/ http:/ client=archiveteam&comments=true&hl=en&likes=true&n=1000&r=n
  • date: 20130602001238
  • orig­i­nal URL: r=n&n=1000&hl=en&likes=true&comments=true&client=ArchiveTeam
  • MIME type: unk [un­known?]
  • response code:—[none?]
  • new-style check­sum: 4GZ4KXJISATWOFEZXMNB4Q5L3JVVPJPM
  • redi­rec­t:—[none?]
  • meta tags:—[none?]
  • S [? maybe length­?]: 1316181
  • com­pressed arc file off­set: 19808229791 (19,808,229,791; so some­where around 19.8GB into the mega-WARC)
  • file­name: archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz

Know­ing the off­set the­o­ret­i­cally makes it pos­si­ble to extract directly from the IA copy with­out hav­ing to down­load and decom­press the entire thing… The S & off­sets for gse-compliance are:

  1. 1316181/19808229791
  2. 1326824/19810951910
  3. 1404934/19809546472
  4. 9467/19812279226

So we found hits point­ing towards archiveteam_greader_20130604001315 & archiveteam_greader_20130614211457 which we then need to down­load (25GB each):

wget ''
wget ''

Once down­load­ed, how do we get the feeds? There are a num­ber of hard-to-use and incom­plete tools for work­ing with giant WARCs; I con­tacted the orig­i­nal GR archiver, ivan, but that was­n’t too help­ful.


I tried using warcat to unpack the entire WARC archive into indi­vid­ual files, and then delete every­thing which was not rel­e­vant:

python3 -m warcat extract /home/gwern/googlereader/...
find ./ -type f -not \( -name "*gse-compliance*" -or -name "*security-doctor*" \) -delete
find ./

But this was too slow, and crashed part­way through before fin­ish­ing.

Bug reports:

A more recent alter­na­tive library, which I haven’t tried, is warcio, which may be able to find the byte ranges & extract them.


If we are feel­ing brave, we can use the off­set and pre­sumed length to have directly extract byte ranges:

dd skip=19810951910 count=1326824 if=greader_20130604001315.megawarc.warc.gz of=2.gz bs=1
# 1326824+0 records in
# 1326824+0 records out
# 1326824 bytes (1.3 MB) copied, 14.6218 s, 90.7 kB/s
dd skip=19810951910 count=1326824 if=greader_20130604001315.megawarc.warc.gz of=3.gz bs=1
# 1326824+0 records in
# 1326824+0 records out
# 1326824 bytes (1.3 MB) copied, 14.2119 s, 93.4 kB/s
dd skip=19809546472 count=1404934 if=greader_20130604001315.megawarc.warc.gz of=4.gz bs=1
# 1404934+0 records in
# 1404934+0 records out
# 1404934 bytes (1.4 MB) copied, 15.4225 s, 91.1 kB/s
dd skip=19812279226 count=9467 if=greader_20130604001315.megawarc.warc.gz of=5.gz bs=1
# 9467+0 records in
# 9467+0 records out
# 9467 bytes (9.5 kB) copied, 0.125689 s, 75.3 kB/s
dd skip=19808229791 count=1316181 if=greader_20130604001315.megawarc.warc.gz of=1.gz bs=1
# 1316181+0 records in
# 316181+0 records out
# 1316181 bytes (1.3 MB) copied, 14.6209 s, 90.0 kB/s
gunzip *.gz


Suc­cess: raw HTML. My dd extrac­tion was suc­cess­ful, and the result­ing HTML/RSS could then be browsed with a com­mand like cat *.warc | fold --spaces -width=200 | less. They can prob­a­bly also be con­verted to a local form and browsed, although they won’t include any of the site assets like images or CSS/JS, since the orig­i­nal RSS feed assumes you can load any ref­er­ences from the orig­i­nal web­site and did­n’t do any kind of or mir­ror­ing (not, after all, hav­ing been intended for archive pur­poses in the first place…)

  1. For exam­ple, the info: oper­a­tor is entirely use­less. The link: oper­a­tor, in almost a decade of me try­ing it once in a great while, has never returned remotely as many links to my web­site as Google Web­mas­ter Tools returns for inbound links, and seems to have been dis­abled entirely at some point.↩︎

  2. WP is increas­ingly out of date & unrep­re­sen­ta­tive due to increas­ingly nar­row poli­cies about sourc­ing & preprints, part of its over­all , so it’s not a good place to look for ref­er­ences. It is a good place to look for key ter­mi­nol­o­gy, though.↩︎

  3. Most search engines will treat any space or sep­a­ra­tion as an implicit AND, but I find it help­ful to be explicit about it to make sure I’m search­ing what I think I’m search­ing.↩︎

  4. This prob­a­bly explains part of why no one cites that paper, and those who cite it clearly have not actu­ally read it, even though it invented racial admix­ture analy­sis, which, since rein­vented by oth­ers, has become a major method in med­ical genet­ics.↩︎

  5. Uni­ver­sity ILL priv­i­leges are one of the most under­rated fringe ben­e­fits of being a stu­dent, if you do any kind of research or hob­by­ist read­ing—you can request almost any­thing you can find in , whether it’s an ultra­-ob­scure book or a mas­ter’s the­sis from 1950! Why would­n’t you make reg­u­lar use of it‽ Of things I miss from being a stu­dent, ILL is near the top.↩︎

  6. The com­plaint and indict­ment are not nec­es­sar­ily the same thing. An indict­ment fre­quently will leave out many details and con­fine itself to list­ing what the defen­dant is accused of. Com­plaints tend to be much richer in detail. How­ev­er, some­times there will be only one and not the oth­er, per­haps because the more detailed com­plaint has been sealed (pos­si­bly pre­cisely because it is more detailed).↩︎

  7. Trial tes­ti­mony can run to hun­dreds of pages and blow through your remain­ing PACER bud­get, so one must be care­ful. In par­tic­u­lar, tes­ti­mony oper­ates under an inter­est­ing & sys­tem related to how report—who are not nec­es­sar­ily paid employ­ees but may be con­trac­tors or free­lancer­s—in­tended to ensure cov­er­ing tran­scrip­tion costs: the tran­script ini­tially may cost hun­dreds of dol­lars, intended to extract full value from those who need the trial tran­script imme­di­ate­ly, such as lawyers or jour­nal­ists, but then a while lat­er, PACER drops the price to some­thing more rea­son­able. That is, the first “orig­i­nal” fee costs a for­tune, but then “copy” fees are cheap­er. So for the US fed­eral court sys­tem, the “orig­i­nal”, when ordered within hours of the tes­ti­mony, will cost <$7.25/page but then the sec­ond per­son order­ing the same tran­script pays only <$1.20/page & every­one sub­se­quently <$0.90/page, and as fur­ther time pass­es, that drops to <$0.60 (and I believe after a few months, PACER will then charge only the stan­dard $0.10). So, when it comes to trial tran­script on PACER, patience pays off.↩︎

  8. I’ve heard that Lex­is­Nexis ter­mi­nals are some­times avail­able for pub­lic use in places like fed­eral libraries or cour­t­hous­es, but I have never tried this myself.↩︎

  9. Curi­ous­ly, in his­tor­i­cal tex­tual crit­i­cism of copied man­u­scripts, it’s the oppo­site: . But with mem­o­ries or para­phras­es, longer = truer, because those tend to elide details and mutate into catch­ier ver­sions when the trans­mit­ter is not osten­si­bly exactly copy­ing a text.↩︎

  10. I advise prepend­ing, like instead of append­ing, like because the for­mer is slightly eas­ier to type but more impor­tant­ly, Sci-Hub does not have SSL cer­tifi­cates set up prop­erly (I assume they’re miss­ing a wild­card) and so append­ing the Sci-Hub domain will fail to work in many web browsers due to HTTPS errors! How­ev­er, if prepend­ed, it’ll always work cor­rect­ly.↩︎

  11. To fur­ther illus­trate this IA fea­ture: if one was look­ing for Alex St. John’s “Judg­ment Day Con­tin­ued…”, a 2013 account of orga­niz­ing the wild 1996 Doom tour­na­ment thrown by Microsoft, but one did­n’t have the URL handy, one could search the entire domain by going to*/* and using the fil­ter with “judg­ment”, or if one at least remem­bered it was in 2013, one could nar­row it down fur­ther to*/* and then fil­ter or search by hand.↩︎

  12. If any Blogspot employee is read­ing this, for god’s sake stop this insan­ity!↩︎

  13. Upload­ing is not as hard as it may seem. There is a web inter­face (user/password: “gen­e­sis”/“upload”). Upload­ing large files can fail, so I usu­ally use the FTP server: curl -T "$FILE" ↩︎

  14. Although flatbed scan­ning is some­times destruc­tive too—I’ve cracked the spine of books while press­ing them flat into a flatbed scan­ner.↩︎

  15. My workaround is to export from gscan2pdf as DjVu, which avoids the bug, then con­vert the DjVu files with ddjvu -format=pdf; this strips any OCR, so I add OCR with ocrmypdf and meta­data with exiftool.↩︎