Internet Search Tips

A description of advanced tips and tricks for effective Internet research of papers/books.
archiving, technology, shell, Google, tutorial
2018-12-112020-01-21 finished certainty: certain importance: 4


Over time, I de­vel­oped a cer­tain google-fu and ex­per­tise in find­ing ref­er­ences, pa­pers, and books on­line. Some of these tricks are not well-known, like check­ing the In­ter­net Archive (IA) for books. I try to write down my search work­flow, and give gen­eral ad­vice about find­ing and host­ing doc­u­ments, with demon­stra­tion case stud­ies.

Google-fu search skill is some­thing I’ve prided my­self ever since el­e­men­tary school, when the li­brar­ian chal­lenged the class to find things in the al­manac; not in­fre­quent­ly, I’d win. And I can still re­mem­ber the ex­act mo­ment it dawned on me in high school that much of the rest of my life would be spent deal­ing with search­es, pay­walls, and bro­ken links. The In­ter­net is the great­est al­manac of all, and to the cu­ri­ous, a nev­er-end­ing cor­nu­copia, so I am sad to see many fail to find things after a cur­sory search—or not look at all. For most peo­ple, if it’s not the first hit in Google/Google Schol­ar, it does­n’t ex­ist. Be­low, I re­veal my best In­ter­net search tricks and try to pro­vide a rough flow­chart of how to go about an on­line search, ex­plain­ing the sub­tle tricks and in­tu­ition of search-fu.

Papers

Request

Hu­man flesh search en­gine. Last re­sort: if none of this works, there are a few places on­line you can re­quest a copy (how­ev­er, they will usu­ally fail if you have ex­hausted all pre­vi­ous av­enues):

Fi­nal­ly, you can al­ways try to con­tact the au­thor. This only oc­ca­sion­ally works for the pa­pers I have the hard­est time with, since they tend to be old ones where the au­thor is dead or un­reach­able—any au­thor pub­lish­ing a pa­per since 1990 will usu­ally have been dig­i­tized some­where—but it’s easy to try.

Post-finding

After find­ing a full­text copy, you should find a re­li­able long-term link/place to store it and make it more find­able (re­mem­ber—if it’s not in Google/Google Schol­ar, it does­n’t ex­ist!):

  • Never Link Un­re­li­able Hosts:

    • LG/SH: Al­ways op­er­ate un­der the as­sump­tion they could be gone to­mor­row. (As my un­cle found out with Li­brary.nu shortly after pay­ing for a life­time mem­ber­ship!) There are no guar­an­tees ei­ther one will be around for long un­der their le­gal as­saults or the be­hind-the-scenes dra­mas, and no guar­an­tee that they are be­ing prop­erly mir­rored or will be re­stored else­where. Down­load any­thing you need and keep a copy of it your­self and, ide­al­ly, host it pub­licly.
    • NBER: never rely on a papers.nber.org/tmp/ or psycnet.apa.org URL, as they are tem­po­rary. (SSRN is also un­de­sir­able due to mak­ing it in­creas­ingly diffi­cult to down­load, but it is at least re­li­able.)
    • Scribd: never link Scrib­d—they are a scummy web­site which im­pede down­loads, and any­thing on Scribd usu­ally first ap­peared else­where any­way. (In fact, if you run into any­thing vaguely use­ful-look­ing which ex­ists only on Scribd, you’ll do hu­man­ity a ser­vice if you copy it else­where just in case.)
    • RG: avoid link­ing to (com­pro­mised by new own­er­ship & PDFs get deleted rou­tine­ly, ap­par­ently often by au­thors) or Academia.edu (the URLs are one-time and break)
    • high­-im­pact jour­nals: be care­ful link­ing to Na­ture.­com or Cell (if a pa­per is not ex­plic­itly marked as Open Ac­cess, even if it’s avail­able, it may dis­ap­pear in a few month­s!); sim­i­lar­ly, watch out for wiley.com, tandfonline.com, jstor.org, springer.com, springerlink.com, & mendeley.com, who pull sim­i­lar shenani­gans.
    • ~/: be care­ful link­ing to aca­d­e­mic per­sonal di­rec­to­ries on uni­ver­sity web­sites (often no­tice­able by the Unix con­ven­tion .edu/~user/ or by di­rec­to­ries sug­ges­tive of ephemeral host­ing, like .edu/cs/course112/readings/foo.pdf); they have short half-lives.
  • PDF Edit­ing: if a scan, it may be worth edit­ing the PDF to crop the edges, thresh­old to bi­na­rize it (which, for a bad grayscale or color scan, can dras­ti­cally re­duce file­size while in­creas­ing read­abil­i­ty), and OCR it.

    I use but there are al­ter­na­tives worth check­ing out.

  • Check & Im­prove Meta­data.

    Adding meta­data to papers/books is a good idea be­cause it makes the file find­able in G/GS (if it’s not on­line, does it re­ally ex­ist?) and helps you if you de­cide to use bib­li­o­graphic soft­ware like in the fu­ture. Many aca­d­e­mic pub­lish­ers & LG are ter­ri­ble about meta­data, and will not in­clude even title/author/DOI/year.

    PDFs can be eas­ily an­no­tated with meta­data us­ing : : exiftool -All prints all meta­data, and the meta­data can be set in­di­vid­u­ally us­ing sim­i­lar fields.

    For pa­pers hid­den in­side vol­umes or other files, you should ex­tract the rel­e­vant page range to cre­ate a sin­gle rel­e­vant file. (For ex­trac­tion of PDF page-ranges, I use , eg: pdftk 2010-davidson-wellplayed10-videogamesvaluemeaning.pdf cat 180-196 output 2009-fortugno.pdf. Many pub­lish­ers in­sert a spam page as the first page. You can drop that eas­ily with pdftk INPUT.pdf cat 2-end output OUTPUT.pdf, but note that PDFtk may drop all meta­data, so do that be­fore adding any meta­da­ta.)

    I try to set at least title/author/DOI/year/subject, and stuff any ad­di­tional top­ics & bib­li­o­graphic in­for­ma­tion into the “Key­words” field. Ex­am­ple of set­ting meta­data:

    exiftool -Author="Frank P. Ramsey" -Date=1930 -Title="On a Problem of Formal Logic" -DOI="10.1112/plms/s2-30.1.264" \
        -Subject="mathematics" -Keywords="Ramsey theory, Ramsey's theorem, combinatorics, mathematical logic, decidability, \
        first-order logic,  Bernays-Schönfinkel-Ramsey class of first-order logic, _Proceedings of the London Mathematical \
        Society_, Volume s2-30, Issue 1, 1930-01-01, pg264-286" 1930-ramsey.pdf
  • Pub­lic Host­ing: if pos­si­ble, host a pub­lic copy; es­pe­cially if it was very diffi­cult to find, even if it was use­less, it should be host­ed. The life you save may be your own.

  • Link On WP/Social Me­dia: for bonus points, link it in ap­pro­pri­ate places on Wikipedia or Red­dit or Twit­ter; this makes peo­ple aware of the copy be­ing avail­able, and also su­per­charges vis­i­bil­ity in search en­gines.

  • Link Spe­cific Pages: as noted be­fore, you can link a spe­cific page by adding #page=N to the URL. Link­ing the rel­e­vant page is help­ful to read­ers.

Advanced

Aside from the (high­ly-rec­om­mend­ed) use of hotkeys and Booleans for search­es, there are a few use­ful tools for the re­searcher, which while ex­pen­sive ini­tial­ly, can pay off in the long-term:

  • : au­to­mat­i­cally archive your web brows­ing and/or links from ar­bi­trary web­sites to fore­stall linkrot; par­tic­u­larly use­ful for de­tect­ing & re­cov­er­ing from dead PDF links

  • Sub­scrip­tions like PubMed & GS search alerts: set up alerts for a spe­cific search query, or for new ci­ta­tions of a spe­cific pa­per. ( is not as use­ful as it seem­s.)

    1. PubMed has straight­for­ward con­ver­sion of search queries into alerts: “Cre­ate alert” be­low the search bar. (Given the vol­ume of PubMed in­dex­ing, I rec­om­mend care­fully tai­lor­ing your search to be as nar­row as pos­si­ble, or else your alerts may over­whelm you.)
    2. To cre­ate generic GS search query alert, sim­ply use the “Cre­ate alert” on the side­bar for any search. To fol­low ci­ta­tions of a key pa­per, you must: 1. bring up the pa­per in GS; 2. click on “Cited by X”; 3. then use “Cre­ate alert” on the side­bar.
  • GCSE: a Google Cus­tom Search En­gines is a spe­cial­ized search queries lim­ited to whitelisted pages/domains etc (eg my Wikipedi­a-fo­cused anime/manga CSE).

    A GCSE can be thought of as a saved search query on steroids. If you find your­self reg­u­larly in­clud­ing scores of the same do­mains in mul­ti­ple searches search, or con­stantly black­list­ing do­mains with -site: or us­ing many nega­tions to fil­ter out com­mon false pos­i­tives, it may be time to set up a GCSE which does all that by de­fault.

  • Clip­pings: like /: reg­u­larly mak­ing and keep­ing ex­cerpts cre­ates a per­son­al­ized search en­gine, in effect.

    This can be vi­tal for re­find­ing old things you read where the search terms are hope­lessly generic or you can’t re­mem­ber an ex­act quote or ref­er­ence; it is one thing to search a key­word like “autism” in a few score thou­sand clip­pings, and an­other thing to search that in the en­tire In­ter­net! (One can also re­or­ga­nize or edit the notes to add in the key­words one is think­ing of, to help with re­find­ing.) I make heavy use of Ever­note clip­ping and it is key to re­find­ing my ref­er­ences.

  • Crawl­ing Web­sites: some­times hav­ing copies of whole web­sites might be use­ful, ei­ther for more flex­i­ble search­ing or for en­sur­ing you have any­thing you might need in the fu­ture. (ex­am­ple: ).

    Use­ful tools to know about: , , ; Fire­fox plu­g­ins: No­Script, uBlock ori­gin, Live HTTP Head­ers, By­pass Pay­walls, cookie ex­port­ing.

    Short of down­load­ing a web­site, it might also be use­ful to pre-emp­tively archive it by us­ing linkchecker to crawl it, com­pile a list of all ex­ter­nal & in­ter­nal links, and store them for pro­cess­ing by an­other archival pro­gram (see for ex­am­ples). In cer­tain rare cir­cum­stances, se­cu­rity tools like can be use­ful to ex­am­ine a mys­te­ri­ous server in more de­tail: what web server and ser­vices does it run, what else might be on it (some­times in­ter­est­ing things like old anony­mous FTP servers turn up), has a web­site moved be­tween IPs or servers, etc.

Web pages

With proper use of pre-emp­tive archiv­ing tools like archiver-bot, fix­ing linkrot in one’s own pages is much eas­ier, but that leaves other ref­er­ences. Search­ing for lost web pages is sim­i­lar to search­ing for pa­pers:

  • Just Search The Ti­tle: if the page ti­tle is given, search for the ti­tle.

    It is a good idea to in­clude page ti­tles in one’s own pages, as well as the URL, to help with fu­ture search­es, since the URL may be mean­ing­less gib­ber­ish on its own, and pre-emp­tive archiv­ing can fail. HTML sup­ports both alt and title pa­ra­me­ters in link tags, and, in cases where dis­play­ing a ti­tle is not de­sir­able (be­cause the link is be­ing used in­line as part of nor­mal hy­per­tex­tual writ­ing), ti­tles can be in­cluded cleanly in Mark­down doc­u­ments like this: [inline text description](URL "Title").

  • Clean URLs: check the URL for weird­ness or trail­ing garbage like ?rss=1 or ?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+blogspot%2FgJZg+%28Google+AI+Blog%29? Or a vari­ant do­main, like a mobile.foo.com/m.foo.com/foo.com/amp/ URL? Those are all less likely to be find­able or archived than the canon­i­cal ver­sion.

  • Do­main Site Search: re­strict G search to the orig­i­nal do­main with site:, or to re­lated do­mains

  • Time-Lim­ited Search: re­strict G search to the orig­i­nal date-range/years

  • Switch En­gines: try a differ­ent search en­gine: cor­puses can vary, and in some cases G tries to be too smart for its own good when you need a lit­eral search; and are us­able al­ter­na­tives (e­spe­cially if one of Duck­Duck­Go’s ‘bang’ spe­cial searches is what one needs)

  • Check Archives: if nowhere on the clear­net, try the In­ter­net Archive (IA) or the Me­mento meta-archive search en­gine:

    IA is the de­fault backup for a dead URL. If IA does­n’t Just Work, there may be other ver­sions in it:

    • mis­lead­ing redi­rects: did the IA ‘help­fully’ redi­rect you to a much-later-in-time er­ror page? Kill the redi­rect and check the ear­li­est stored ver­sion for the ex­act URL rather than the redi­rect. Did the page ini­tially load but then er­ror out/redirect? Dis­able JS with No­Script and re­load.

    • With­in-Do­main Archives: IA lets you list all URLs with any archived ver­sions, by search­ing for URL/*; the list of avail­able URLs may re­veal an al­ter­nate newer/older URL. It can also be use­ful to fil­ter by file­type or sub­string.

      For ex­am­ple, one might list all URLs in a do­main, and if the list is too long and filled with garbage URLs, then us­ing the “Fil­ter re­sults” in­cre­men­tal-search wid­get to search for “uploads/” on a Word­Press blog.11

      Screen­shot of an oft-over­looked fea­ture of the In­ter­net Archive: dis­play­ing all available/archived URLs for a spe­cific do­main, fil­tered down to a sub­set match­ing a string like *uploads/*.
      • wayback_machine_downloader (not to be con­fused with the internetarchive Python pack­age which pro­vides a CLI in­ter­face to up­load­ing files) is a Ruby tool which lets you down­load whole do­mains from IA, which can be use­ful for run­ning a lo­cal full­text search us­ing reg­exps (a good grep query is often enough), in cases where just look­ing at the URLs via URL/* is not help­ful. (An al­ter­na­tive which might work is websitedownloader.io.)

      Ex­am­ple:

      gem install --user-install wayback_machine_downloader
      ~/.gem/ruby/2.5.0/bin/wayback_machine_downloader wayback_machine_downloader --all-timestamps 'https://blog.okcupid.com'
    • did the do­main change, eg from www.foo.com to foo.com or www.foo.org? En­tirely differ­ent as far as IA is con­cerned.

    • does the in­ter­nal ev­i­dence of the URL pro­vide any hints? You can learn a lot from URLs just by pay­ing at­ten­tion and think­ing about what each di­rec­tory and ar­gu­ment means.

    • is this a Blogspot blog? Blogspot is uniquely hor­ri­ble in that it has ver­sions of each blog for every coun­try do­main: a foo.blogspot.com blog could be un­der any of foo.blogspot.de, foo.blogspot.au, foo.blogspot.hk, foo.blogspot.jp12

    • did the web­site pro­vide RSS feeds?

      A lit­tle known fact is that (GR; Oc­to­ber 2005–July 2013) stored all RSS items it crawled, so if a web­site’s RSS feed was con­fig­ured to in­clude full items, the RSS feed his­tory was an al­ter­nate mir­ror of the whole web­site, and since GR never re­moved RSS items, it was pos­si­ble to re­trieve pages or whole web­sites from it. GR has since closed down, sad­ly, but be­fore it closed, down­loaded a large frac­tion of GR’s his­tor­i­cal RSS feeds, and those archives are now hosted on IA. The catch is that they are stored in mega-, which, for all their archival virtues, are not the most user-friendly for­mat. The raw GR mega-WARCs are diffi­cult enough to work with that I de­fer an ex­am­ple to the ap­pen­dix.

    • archive.today: an IA-like mir­ror

    • any lo­cal archives, such as those made with my

    • Google Cache (GC): GC works, some­times, but the copies are usu­ally the worst around, ephemeral & can­not be re­lied up­on. Google also ap­pears to have been steadily dep­re­cat­ing GC over the years, as GC shows up less & less in search re­sults. A last re­sort.

Books

Digital

E-books are rarer and harder to get than pa­pers, al­though the sit­u­a­tion has im­proved vastly since the early 2000s. To search for books on­line:

  • More Straight­for­ward: book searches tend to be faster and sim­pler than pa­per search­es, and to re­quire less clev­er­ness in search query for­mu­la­tion, per­haps be­cause they are rarer on­line, much larg­er, and have sim­pler ti­tles, mak­ing it eas­ier for search en­gines.

    Search G, not GS, for books:

    No Books in Google Scholar
    Book full­texts usu­ally don’t show up in GS (for un­known rea­son­s). You need to check G when search­ing for books.

    To dou­ble-check, you can try a filetype:pdf search; then check LG. Typ­i­cal­ly, if the main ti­tle + au­thor does­n’t turn it up, it’s not on­line. (In some cas­es, the au­thor or­der is re­versed, or the ti­tle:­sub­ti­tle are re­versed, and you can find a copy by tweak­ing your search, but these are rare.).

  • IA: the In­ter­net Archive has many books scanned which do not ap­pear eas­ily in search re­sults (poor SEO?).

    • If an IA hit pops up in a search, al­ways check it; the OCR may offer hints as to where to find it. If you don’t find any­thing or the pro­vid­ed, try do­ing an IA site search in G (not the IA built-in search en­gine), eg book title site:archive.org.

    • DRM workarounds: if it is on IA but the IA ver­sion is DRMed and is only avail­able for “check­out”, you can jail­break it.

      Check the book out for the full pe­ri­od, 14 days. Down­load the PDF (not EPUB) ver­sion to Adobe Dig­i­tal El­e­ments ver­sion ≤4.0 (which can be run in Wine on Lin­ux), and then im­port it to with the De-DRM plu­gin, which will pro­duce a DRM-free PDF in­side Cal­i­bre’s li­brary. (Get­ting De-DRM run­ning can be tricky, es­pe­cially un­der Lin­ux. I wound up hav­ing to edit some of the paths in the Python files to make them work with Wine.) You can then add meta­data to the PDF & up­load it to LG13. (L­G’s ver­sions of books are usu­ally bet­ter than the IA scans, but if they don’t ex­ist, IA’s is bet­ter than noth­ing.)

  • : use the same PDF DRM as IA, can be bro­ken same way

  • also hosts many book scans, which can be searched for clues or hints or jail­bro­ken.

    HathiTrust blocks whole-book down­loads but it’s easy to down­load each page in a loop and stitch them to­geth­er, for ex­am­ple:

    for i in {1..151}
    do if [[ ! -s "$i.pdf" ]]; then
        wget "https://babel.hathitrust.org/cgi/imgsrv/download/pdf?id=mdp.39015050609067;orient=0;size=100;seq=$i;attachment=0" \
              -O "$i.pdf"
        sleep 10s
     fi
    done
    
    pdftk *.pdf cat output 1957-super-scientificcareersandvocationaldevelopmenttheory.pdf
    
    exiftool -Title="Scientific Careers and Vocational Development Theory: A review, a critique and some recommendations" \
        -Date=1957 -Author="Donald E. Super, Paul B. Bachrach" -Subject="psychology" \
        -Keywords="Bureau Of Publications (Teachers College Columbia University), LCCCN: 57-12336, National Science Foundation, public domain, \
        https://babel.hathitrust.org/cgi/pt?id=mdp.39015050609067;view=1up;seq=1 http://psycnet.apa.org/record/1959-04098-000" \
        1957-super-scientificcareersandvocationaldevelopmenttheory.pdf

    An­other ex­am­ple of this would be the Well­come Li­brary; while look­ing for An In­ves­ti­ga­tion Into The Re­la­tion Be­tween In­tel­li­gence And In­her­i­tance, Lawrence 1931, I came up dry un­til I checked one of the last search re­sults, a “Well­come Dig­i­tal Li­brary” hit, on the slim off-chance that, like the oc­ca­sional Chinese/Indian li­brary web­site, it just might have full­text. As it hap­pens, it did—­good news? Yes, but with a caveat: it pro­vides no way to down­load the book! It pro­vides OCR, meta­data, and in­di­vid­ual page-im­age down­loads all un­der CC-BY-NC-SA (so no le­gal prob­lem­s), but… not the book. (The OCR is also un­nec­es­sar­ily zipped, so that is why Google ranked the page so low and did not show any re­veal­ing ex­cerpts from the OCR tran­script: be­cause it’s hid­den in an opaque archive to save a few kilo­bytes while de­stroy­ing SEO.) Ex­am­in­ing the down­load URLs for the high­est-res­o­lu­tion im­ages, they fol­low an un­for­tu­nate schema:

    1. https://dlcs.io/iiif-img/wellcome/1/5c27d7de-6d55-473c-b3b2-6c74ac7a04c6/full/2212,/0/default.jpg
    2. https://dlcs.io/iiif-img/wellcome/1/d514271c-b290-4ae8-bed7-fd30fb14d59e/full/2212,/0/default.jpg
    3. etc

    In­stead of be­ing se­quen­tially num­bered 1–90 or what­ev­er, they all live un­der a unique hash or ID. For­tu­nate­ly, one of the meta­data files, the ‘man­i­fest’ file, pro­vides all of the hashes/IDs (but not the high­-qual­ity down­load URLs). Ex­tract­ing the IDs from the man­i­fest can be done with some quick sed & tr string pro­cess­ing, and fed into an­other short wget loop for down­load

    fgrep '@id' manifest\?manifest\=https\:%2F%2Fwellcomelibrary.org%2Fiiif%2Fb18032217%2Fmanifest | \
       sed -e 's/.*imageanno\/\(.*\)/\1/' | egrep -v '^ .*' | tr -d ',' | tr -d '"' # "
    # bf23642e-e89b-43a0-8736-f5c6c77c03c3
    # 334faf27-3ee1-4a63-92d9-b40d55ab72ad
    # 5c27d7de-6d55-473c-b3b2-6c74ac7a04c6
    # d514271c-b290-4ae8-bed7-fd30fb14d59e
    # f85ef645-ec96-4d5a-be4e-0a781f87b5e2
    # a2e1af25-5576-4101-abee-96bd7c237a4d
    # 6580e767-0d03-40a1-ab8b-e6a37abe849c
    # ca178578-81c9-4829-b912-97c957b668a3
    # 2bd8959d-5540-4f36-82d9-49658f67cff6
    # ...etc
    I=1
    for HASH in $HASHES; do
        wget "https://dlcs.io/iiif-img/wellcome/1/$HASH/full/2212,/0/default.jpg" -O $I.jpg
        I=$((I+1))
    done

    And then the 59MB of JPGs can be cleaned up as usual with gscan2pdf (empty pages delet­ed, ta­bles ro­tat­ed, cover page cropped, all other pages bi­na­rized), compressed/OCRed with ocrmypdf, and meta­data set with exiftool, pro­duc­ing a read­able, down­load­able, high­ly-search-engine-friendly 1.8MB PDF.

  • re­mem­ber the works for papers/books too:

    if you can find a copy to read, but can­not fig­ure out how to down­load it di­rectly be­cause the site uses JS or com­pli­cated cookie au­then­ti­ca­tion or other tricks, you can al­ways ex­ploit the ‘ana­logue hole’—fullscreen the book in high res­o­lu­tion & take screen­shots of every page; then crop, OCR etc. This is te­dious but it works. And if you take screen­shots at suffi­ciently high res­o­lu­tion, there will be rel­a­tively lit­tle qual­ity loss. (This works bet­ter for books that are scans than ones born-dig­i­tal.)

Physical

Ex­pen­sive but fea­si­ble. Books are some­thing of a dou­ble-edged sword com­pared to papers/theses. On the one hand, books are much more often un­avail­able on­line, and must be bought offline, but at least you al­most al­ways can buy used books offline with­out much trou­ble (and often for <$10 to­tal); on the other hand, while paper/theses are often avail­able on­line, when one is not un­avail­able, it’s usu­ally very un­avail­able, and you’re stuck (un­less you have a uni­ver­sity ILL de­part­ment back­ing you up or are will­ing to travel to the few or only uni­ver­si­ties with pa­per or mi­cro­film copies).

Pur­chas­ing from used book sell­ers:

  • Sell­ers:

    • used book search en­gines: Google Books/find­-more-book­s.­com: a good start­ing point for seller links; if buy­ing from a mar­ket­place like AbeBooks/Amazon/Barnes & No­ble, it’s worth search­ing the seller to see if they have their own web­site, which is po­ten­tially much cheap­er. They may also have mul­ti­ple edi­tions in stock.

    • bad: eBay & Ama­zon are often bad, due to high­-min­i­mum-order+S&H and sell­ers on Ama­zon seem to as­sume Ama­zon buy­ers are eas­ily rooked; but can be use­ful in pro­vid­ing meta­data like page count or ISBN or vari­a­tions on the ti­tle

    • good: Abe­Books, Thrift Books, Bet­ter World Books, B&N, Dis­cover Books.

      Note: on Abe­Books, in­ter­na­tional or­ders can be use­ful (e­spe­cially for be­hav­ioral ge­net­ics or psy­chol­ogy books) but be care­ful of in­ter­na­tional or­ders with your credit card—­many debit/credit cards will fail on in­ter­na­tional or­ders and trig­ger a fraud alert, and Pay­Pal is not ac­cept­ed.

  • Price Alerts: if a book is not avail­able or too ex­pen­sive, set price watch­es: Abe­Books sup­ports email alerts on stored search­es, and Ama­zon can be mon­i­tored via Camel­Camel­Camel (re­mem­ber the CCC price alert you want is on the used third-party cat­e­go­ry, as new books are more ex­pen­sive, less avail­able, and un­nec­es­sary).

Scan­ning:

  • De­struc­tive Vs Non-De­struc­tive: the fun­da­men­tal dilemma of book scan­ning—de­struc­tively de­bind­ing books with a ra­zor or guil­lo­tine cut­ter works much bet­ter & is much less time-con­sum­ing than spread­ing them on a flatbed scan­ner to scan one-by-one14, be­cause it al­lows use of a sheet-fed scan­ner in­stead, which is eas­ily 5x faster and will give high­er-qual­ity scans (be­cause the sheets will be flat, scanned edge-to-edge, and much more closely aligned), but does, of course, re­quire effec­tively de­stroy­ing the book.

  • Tools:

    • cut­ting: For sim­ple de­bind­ing of a few books a year, an X-acto knife/razor is good (avoid the ‘tri­an­gle’ blades, get curved blades in­tended for large cuts in­stead of de­tail work).

      Once you start do­ing more than one a mon­th, it’s time to up­grade to a guil­lo­tine blade pa­per cut­ter (a fancier swing­ing-arm pa­per cut­ter, which uses a two-joint sys­tem to clamp down and cut uni­form­ly).

      A guil­lo­tine blade can cut chunks of 200 pages eas­ily with­out much slip­page, so for books with more pages, I use both: an X-acto to cut along the spine and turn it into sev­eral 200-page chunks for the guil­lo­tine cut­ter.

    • scan­ning: at some point, it may make sense to switch to a scan­ning ser­vice like 1Dol­larScan (1DS has ac­cept­able qual­ity for the black­-white scans I have used them for thus far, but watch out for their nick­el-and-dim­ing fees for OCR or “set­ting the PDF ti­tle”; these can be done in no time your­self us­ing gscan2pdf/exiftool/ocrmypdf and will save a lot of money as they, amaz­ing­ly, bill by 100-page unit­s). Books can be sent di­rectly to 1DS, re­duc­ing lo­gis­ti­cal has­sles.

  • Clean Up: after scan­ning, crop/threshold/OCR/add meta­data

    • Adding meta­data: same prin­ci­ples as pa­pers. While more elab­o­rate meta­data can be added, like book­marks, I have not ex­per­i­mented with those yet.
  • File for­mat: PDF.

    In the past, I used for doc­u­ments I pro­duce my­self, as it pro­duces much smaller scans than gscan2pdf’s de­fault PDF set­tings due to a buggy Perl li­brary (at least half the size, some­times one-tenth the size), mak­ing them more eas­ily hosted & a su­pe­rior brows­ing ex­pe­ri­ence.

    The down­sides of DjVu are that not all PDF view­ers can han­dle DjVu files, and it ap­pears that G/GS ig­nore all DjVu files (de­spite the for­mat be­ing 20 years old), ren­der­ing them com­pletely un­find­able on­line. In ad­di­tion, DjVu is an in­creas­ingly ob­scure for­mat and has, for ex­am­ple, been dropped by the IA as of 2016. The for­mer is a rel­a­tively small is­sue, but the lat­ter is fa­tal—be­ing con­signed to obliv­ion by search en­gines largely de­feats the point of scan­ning! (“If it’s not in Google, it does­n’t ex­ist.”) Hence, de­spite be­ing a worse for­mat, I now rec­om­mend PDF and have stopped us­ing DjVu for new scans15 and have con­verted my old DjVu files to PDF.

  • Up­load­ing: to Lib­Gen, usu­al­ly. For back­ups, file­lock­ers like Drop­box, Mega, Me­di­aFire, or Google Drive are good. I usu­ally up­load 3 copies in­clud­ing LG. I ro­tate ac­counts once a year, to avoid putting too many files into a sin­gle ac­count.

    Do Not Use Google Docs/Scribd/Dropbox/etc

    ‘Doc­u­ment’ web­sites like Google Docs (GD) should be strictly avoided as pri­mary host­ing. GD does not ap­pear in G/GS, doom­ing a doc­u­ment to ob­scu­ri­ty, and Scribd is lu­di­crously user-hos­tile. Such sites can­not be searched, scraped, down­load­ed, clipped, used on many de­vices, or counted on for the long haul.

    Such sites may be use­ful for col­lab­o­ra­tion or sur­veys, but should be moved to clean sta­tic HTML/PDF hosted else­where as soon as pos­si­ble.
  • Host­ing: host­ing pa­pers is easy but books come with risk:

    Books can be dan­ger­ous; in de­cid­ing whether to host a book, my rule of thumb is host only books pre-2000 and which do not have Kin­dle edi­tions or other signs of ac­tive ex­ploita­tion and is effec­tively an ‘’.

    As of 2019-10-23, host­ing 4090 files over 9 years (very rough­ly, as­sum­ing lin­ear growth, <6.7 mil­lion doc­u­men­t-days of host­ing: ), I’ve re­ceived 4 take­down or­ders: a be­hav­ioral ge­net­ics text­book (2013), The Hand­book of Psy­chopa­thy (2005), a re­cent meta-analy­sis pa­per (Roberts et al 2016), and a CUP DMCA take­down or­der for 27 files. I broke my rule of thumb to host the 2 books (my mis­take), which leaves only the 1 pa­per, which I think was a fluke. So, as long as one avoids rel­a­tively re­cent books, the risk should be min­i­mal.

Case Studies

Be­low are >13 case stud­ies of diffi­cult-to-find re­sources or ci­ta­tions, and how I went about lo­cat­ing them, demon­strat­ing the var­i­ous In­ter­net search tech­niques de­scribed above and how to think about search­es.

  • Miss­ing Ap­pen­dix: asked:

    Does any­body know where the on­line ap­pen­dix to Nord­haus’ “Two Cen­turies of Pro­duc­tiv­ity Growth in Com­put­ing” is hid­ing?

    I look up the ti­tle in Google Schol­ar; see­ing a friendly psu.edu PDF link (Cite­Seerx), I click. The pa­per says “The data used in this study are pro­vided in a back­ground spread­sheet avail­able at http://www.econ.yale.edu/~nordhaus/Computers/Appendix.xls”. Sad­ly, this is a lie. (Sand­berg would of course have tried that.)

    I im­me­di­ately check the URL in the IA—noth­ing. The IA did­n’t catch it at all. Maybe the offi­cial pub­lished pa­per web­site has it? Nope, it ref­er­ences the same URL, and does­n’t pro­vide a copy as an ap­pen­dix or sup­ple­ment. (What do we pay these pub­lish­ers such enor­mous sums of money for, ex­act­ly?) So I back off to check­ing http://www.econ.yale.edu/~nordhaus/, to check Nord­haus’s per­sonal web­site for a newer link. The Yale per­sonal web­site is empty and ap­pears to’ve been re­placed by a Google Sites per­sonal page. It links noth­ing use­ful, so I check a more thor­ough in­dex, Google, by search­ing site:sites.google.com/site/williamdnordhaus/. Noth­ing there ei­ther (and it ap­pears al­most emp­ty, so Nord­haus has al­lowed most of his stuff to be deleted and bi­trot). I try a broader Google: nordhaus appendix.xls. This turns up some spread­sheets, but still noth­ing.

    Eas­ier ap­proaches hav­ing been ex­haust­ed, I re­turn to the IA and I pull up all URLs archived for his orig­i­nal per­sonal web­site: https://web.archive.org/web/*/http://www.econ.yale.edu/~nordhaus/* This pulls up way too many URLs to man­u­ally re­view, so I fil­ter re­sults for xls, which re­duces to a more man­age­able 60 hits; read­ing through the hits, I spot http://www.econ.yale.edu:80/~nordhaus/homepage/documents/Appendix_Nordhaus_computation_update_121410.xlsx from 2014-10-10; this sounds right, al­beit sub­stan­tially later in time than ex­pected (ei­ther 2010 or 2012, judg­ing from the file­name).

    Down­load­ing it, open­ing it up and cross-ref­er­enc­ing with the pa­per, it has the same spread­sheet ‘sheets’ as men­tioned, like “Man­ual” or “Cap­i­tal_Deep”, and seems to be ei­ther the orig­i­nal file in ques­tion or an up­dated ver­sion thereof (which may be even bet­ter). The spread­sheet meta­data in­di­cates it was cre­ated “04/09/2001, 23:20:43, ITS Aca­d­e­mic Me­dia & Tech­nol­ogy”, and mod­i­fied “12/22/2010, 02:40:20”, so it seems to be the lat­ter—it’s the orig­i­nal spread­sheet Nord­haus cre­ated when he be­gan work sev­eral years prior to the for­mal 2007 pub­li­ca­tion (6 years seems rea­son­able given all the de­lays in such a process), and then was up­dated 3 years after­wards. Close enough.

  • Mis­re­mem­bered Book: A Red­di­tor asked:

    I was in a con­sign­ment type store once and picked up a book called “Eat fat, get thin”. Giv­ing it a quick scan through, it was ba­si­cally the same stuff as Atkins but this book was from the 50s or 60s. I wish I’d have bought it. I think I found a ref­er­ence to it once on­line but it’s been drowned out since some­one else re­leased a book with the same name (and it was­n’t Barry Groves ei­ther).

    The eas­i­est way to find a book given a cor­rupted ti­tle, a date range, and the in­for­ma­tion there are many sim­i­lar ti­tles drown­ing out a naive search en­gine query, is to skip to a spe­cial­ized search en­gine with clean meta­data (ie. a li­brary data­base).

    Search­ing in World­Cat for 1950s–1970s, “Eat fat, get thin” turns up noth­ing rel­e­vant. This is un­sur­pris­ing, as he was un­likely to’ve re­mem­bered the ti­tle ex­actly, and this ti­tle does­n’t quite sound right for the era any­way (a lit­tle too punchy and un­gram­mat­i­cal, and ‘thin’ was­n’t a de­sir­able word back then com­pared to words like ‘slim’ or ‘sleek’ or ‘svelte’). Peo­ple often over­sim­plify ti­tles, so I dropped back to just “Eat fat”.

    This im­me­di­ately turned up the book: 1958 Eat Fat and Grow Slim—note that it is al­most the same ti­tle, with a comma serv­ing as con­junc­tion and ‘slim’ rather than the more con­tem­po­rary ‘thin’, but just differ­ent enough to screw up an over­ly-lit­eral search.

    With the same trick in mind, we could also have found it in a reg­u­lar Google search query by adding ad­di­tional terms to hint to Google that we want old books, not re­cent ones: both "Eat Fat" 1950s or "Eat Fat" 1960s would have turned it up in the first 5 search re­sults. If we did­n’t use quotes, the searches get harder be­cause broader hits get pulled in. For ex­am­ple, Eat fat, get thin 1950s -Hyman ex­cludes the re­cent book men­tioned, but you still have to go down 15 hits be­fore find­ing Mackar­ness, and Eat fat, get thin -Hyman re­quires go­ing down 18 hits.

  • Miss­ing Web­site: , on the phe­nom­e­non of quotes strik­ing tran­scripts from a ma­jor ex­am­ple of a dis­ap­pear­ing crys­tal, when ~1998 Ab­bott sud­denly be­came un­able to man­u­fac­ture the an­ti-retro­vi­ral drug (Norvir™) due to a ri­val (and less effec­tive) crys­tal form spon­ta­neously in­fect­ing all its plants, threat­en­ing many AIDS pa­tients, but notes:

    The tran­scripts were orig­i­nally pub­lished on the web­site42 of the In­ter­na­tional As­so­ci­a­tion of Physi­cians in AIDS Care [IAPAC], but no longer ap­pear there.

    A search us­ing the quotes con­firms that the orig­i­nals have long since van­ished from the open In­ter­net, turn­ing up only quotes of the quo­ta­tions. Un­for­tu­nate­ly, no URL is giv­en. The In­ter­net Archive has com­pre­hen­sive mir­rors of the IAPAC, but too many to eas­ily search through. Us­ing the fil­ter fea­ture, I key­word-searched for “ri­ton­avir”, but while this turned up a num­ber of pages from roughly the right time pe­ri­od, they do not men­tion it and none of the quotes ap­pear. The key turned out to be to use the trade­mark name in­stead which pulls up many more pages, and after check­ing a few, the IAPAC turned out to have or­ga­nized all the Norvir ma­te­r­ial into a sin­gle sub­di­rec­tory with a con­ve­nient index.html; the articles/transcripts, in turn, were in­dexed un­der the linked .

    I then pulled the Norvir sub­di­rec­tory with a ~/.gem/ruby/2.5.0/bin/wayback_machine_downloader wayback_machine_downloader 'http://www.iapac.org/norvir/' com­mand and hosted a mir­ror to make it vis­i­ble in Google.

  • Speech → Book: Nancy Lebovitz asked about a ci­ta­tion in a Roy Baumeis­ter speech about sex differ­ences:

    There’s an idea I’ve seen a num­ber of times that 80% of women have had de­scen­dants, but only 40% of men. A lit­tle re­search tracked it back to this, but the speech does­n’t have a cite and I haven’t found a source.

    This could be solved by guess­ing that the for­mal ci­ta­tion is given in the book, and do­ing key­word search to find a sim­i­lar pas­sage. The sec­ond line of the speech says:

    For more in­for­ma­tion on this top­ic, read Dr. Baumeis­ter’s book Is There Any­thing Good About Men? avail­able in book­stores every­where, in­clud­ing here.

    A search of Is There Any­thing Good About Men in Lib­gen turns up a copy. Down­load. What are we look­ing for? A re­minder, the key lines in the speech are:

    …It’s not a trick ques­tion, and it’s not 50%. True, about half the peo­ple who ever lived were wom­en, but that’s not the ques­tion. We’re ask­ing about all the peo­ple who ever lived who have a de­scen­dant liv­ing to­day. Or, put an­other way, yes, every baby has both a mother and a fa­ther, but some of those par­ents had mul­ti­ple chil­dren. Re­cent re­search us­ing DNA analy­sis an­swered this ques­tion about two years ago. To­day’s hu­man pop­u­la­tion is de­scended from twice as many women as men. I think this differ­ence is the sin­gle most un­der­-ap­pre­ci­ated fact about gen­der. To get that kind of differ­ence, you had to have some­thing like, through­out the en­tire his­tory of the hu­man race, maybe 80% of women but only 40% of men re­pro­duced.

    We could search for var­i­ous words or phrase from this pas­sage which seem to be rel­a­tively unique; as it hap­pens, I chose the rhetor­i­cal “50%” (but “80%”, “40%”, “un­der­ap­pre­ci­ated”, etc all would’ve worked with vary­ing lev­els of effi­ciency since the speech is heav­ily based on the book), and thus jumped straight to chap­ter 4, “The Most Un­der­ap­pre­ci­ated Fact About Men”. (If these had not worked, we could have started search­ing for years, based on the quote “about two years ago”.) A glance tells us that Baumeis­ter is dis­cussing ex­actly this topic of re­pro­duc­tive differ­en­tials, so we read on and a few pages lat­er, on page 63, we hit the jack­pot:

    The cor­rect an­swer has re­cently be­gun to emerge from DNA stud­ies, no­tably those by Ja­son Wilder and his col­leagues. They con­cluded that among the an­ces­tors of to­day’s hu­man pop­u­la­tion, women out­num­bered men about two to one. Two to one! In per­cent­age terms, then, hu­man­i­ty’s an­ces­tors were about 67% fe­male and 33% male.

    Who’s Wilder? A C-f for “Wilder” takes us to pg286, where we im­me­di­ately read:

    …The DNA stud­ies on how to­day’s hu­man pop­u­la­tion is de­scended from twice as many women as men have been the most re­quested sources from my ear­lier talks on this. The work is by Ja­son Wilder and his col­leagues. I list here some sources in the mass me­dia, which may be more ac­ces­si­ble to layper­sons than the highly tech­ni­cal jour­nal ar­ti­cles, but for the spe­cial­ists I list those al­so. For a highly read­able in­tro­duc­tion, you can Google the ar­ti­cle “An­cient Man Spread the Love Around,” which was pub­lished Sep­tem­ber, 20, 2004 and is still avail­able (last I checked) on­line. There were plenty of other sto­ries in the me­dia at about this time, when the re­search find­ings first came out. In “Med­ical News To­day,”, on the same date in 2004, a story un­der “Genes ex­pose se­crets of sex on the side” cov­ered much the same ma­te­r­i­al.

    If you want the orig­i­nal sources, read Wilder, J. A., Mobash­er, Z., & Ham­mer, M. F. (2004). “Ge­netic ev­i­dence for un­equal effec­tive pop­u­la­tion sizes of hu­man fe­males and males”. Mol­e­c­u­lar Bi­ol­ogy and Evo­lu­tion, 21, 2047–2057. If that went down well, you might try Wilder, J. A., Kingan, S. B., Mobash­er, Z., Pilk­ing­ton, M. M., & Ham­mer, M. F. (2004). “Global pat­terns of hu­man mi­to­chon­dr­ial DNA and Y-chro­mo­some struc­ture are not in­flu­enced by higher mi­gra­tion rates of fe­males ver­sus males”. Na­ture Ge­net­ics, 36, 1122–1125. That one was over my head, I ad­mit. A more read­able source on these is Shriver, M. D. (2005), “Fe­male mi­gra­tion rate might not be greater than male rate”. Eu­ro­pean Jour­nal of Hu­man Ge­net­ics, 13, 131–132. Shriver raises an­other in­trigu­ing hy­poth­e­sis that could have con­tributed to the greater pre­pon­der­ance of fe­males in our an­ces­tors: Be­cause cou­ples mate such that the man is old­er, the gen­er­a­tional in­ter­vals are smaller for fe­males (i.e., baby’s age is closer to moth­er’s than to fa­ther’s). As for the 90% to 20% differ­en­tial in other species, that I be­lieve is stan­dard in­for­ma­tion in bi­ol­o­gy, which I first heard in one of the lec­tures on testos­terone by the late James Dabbs, whose book He­roes, Rogues, and Lovers re­mains an au­thor­i­ta­tive source on the top­ic.

    Wilder et al 2004, in­ci­den­tal­ly, fits well with Baumeis­ter re­mark­ing in 2007 that the re­search was done 2 or so years ago. And of course you could’ve done the same thing us­ing Google Books: search “Baumeis­ter any­thing good about men” to get to the book, then search-with­in-the-book for “50%”, jump to page 53, read to page 63, do a sec­ond search-with­in-the-book for “Wilder” and the sec­ond hit of page 287 even luck­ily gives you the snip­pet:

    Sources and Ref­er­ences 287

    …If you want the orig­i­nal sources, read Wilder, J. A., Mobash­er, Z., & Ham­mer, M. F. (2004). “Ge­netic ev­i­dence for un­equal effec­tive pop­u­la­tion sizes of hu­man fe­males and males”. Mol­e­c­u­lar Bi­ol­ogy and Evo­lu­tion

  • Con­no­ta­tions a com­menter who shall re­main name­less wrote

    I chal­lenge you to find an ex­am­ple of some­one say­ing “this den of X” where X does not have a neg­a­tive con­no­ta­tion.

    I found a pos­i­tive con­no­ta­tion within 5s us­ing my Google hotkey for "this den of ", and, cu­ri­ous about fur­ther ones, found ad­di­tional uses of the phrase in re­gard to deal­ing with rat­tlesnakes in Google Books.

  • Rowl­ing Quote On Death: Did say the Harry Pot­ter books were about ‘death’? There are a lot of Rowl­ing state­ments, but check­ing WP and open­ing up each in­ter­view links (un­der the the­ory that the key in­ter­views are linked there) and search­ing for ‘death’ soon turns up a rel­e­vant quote from 2001:

    Death is an ex­tremely im­por­tant theme through­out all seven books. I would say pos­si­bly the most im­por­tant theme. If you are writ­ing about Evil, which I am, and if you are writ­ing about some­one who is es­sen­tially a psy­chopath, you have a duty to show the real evil of tak­ing hu­man life.

  • Crow­ley Quote: Scott Alexan­der posted a piece link­ing to an ex­cept ti­tled “ on Re­li­gious Ex­pe­ri­ence”.

    The link was bro­ken, but Alexan­der brought it up in the con­text of an ear­lier dis­cus­sion where he also quoted Crow­ley; search­ing those quotes re­veals that it must have been ex­cerpts from Mag­ick: Book 4

  • Find­ing The Right ‘SAGE: Phil Goetz noted that an an­ti-ag­ing con­fer­ence named “SAGE” had be­come im­pos­si­ble to find in Google due to a LGBT ag­ing con­fer­ence also named SAGE.

    Reg­u­lar searches would fail, but a com­bi­na­tion of tricks worked: SAGE anti-aging conference com­bined with re­strict­ing Google search to 2003–2005 time-range turned up a ci­ta­tion to its web­site as the fourth hit, http://www.sagecrossroads.net (which has iron­i­cally since died).

  • UK Char­ity Fi­nan­cials: The Fu­ture of Hu­man­ity In­sti­tute (FHI) does­n’t clearly pro­vide char­ity fi­nan­cial forms akin to the US Form 990s, mak­ing it hard to find out in­for­ma­tion about its bud­get or re­sults.

    FHI does­n’t show up in the CC, NPC, or GuideStar, which are the first places to check for char­ity fi­nances, so I went a lit­tle broader afield and tried a site search on the FHI web­site: budget site:fhi.ox.ac.uk. This im­me­di­ately turned up FHI’s own doc­u­men­ta­tion of its ac­tiv­i­ties and bud­gets, such as the 2007 an­nual re­port; I used part of its ti­tle as a new Google search: future of humanity institute achievements report site:fhi.ox.ac.uk.

  • No­bel Lin­eage Re­search: John Maxwell re­ferred to a for­got­ten study on high cor­re­la­tion be­tween No­belist pro­fes­sors & No­belist grad stu­dents (al­most en­tirely a se­lec­tion effect, I would bet). I was able to re­find it in 7 min­utes.

    I wasted a few searches like factor predicting Nobel prize or Nobel prize graduate student in Google Schol­ar, un­til I search for Nobel laureate "graduate student"; the sec­ond hit was a ci­ta­tion, which is a lit­tle un­usual for Google Scholar and meant it was im­por­tant, and it had the crit­i­cal word mu­tual in it—si­mul­ta­ne­ous part­ners in No­bel work is some­what rare, but tem­po­rally sep­a­rated teams don’t work for prizes, and I sus­pected that it was ex­actly what I was look­ing for. Googling the ti­tle, I soon found a PDF like “Em­i­nent Sci­en­tists’ De­mo­ti­va­tion in School: A symp­tom of an in­cur­able dis­ease?”, Viau 2004 which con­firmed it (and Viau 2004 is in­ter­est­ing in its own right as a con­tri­bu­tion to the Con­sci­en­tious vs IQ ques­tion). I then fol­lowed it to a use­ful para­graph:

    In a study con­ducted with 92 Amer­i­can win­ners of the No­bel Prize, Zuck­er­man (1977) dis­cov­ered that 48 of them had worked as grad­u­ate stu­dents or as­sis­tants with pro­fes­sors who were them­selves No­bel Prize award-win­ners. As pointed out by Zuck­er­man (1977), the fact that 11 No­bel prizewin­ners have had the great physi­cist Ruther­ford as a men­tor is an ex­am­ple of just how sig­nifi­cant a good men­tor can be dur­ing one’s stud­ies and train­ing. It then ap­pears that most em­i­nent sci­en­tists did have peo­ple to stim­u­late them dur­ing their child­hood and men­tor(s) dur­ing their stud­ies. But, what ex­actly is the na­ture of these peo­ple’s con­tri­bu­tion.

    • Zuck­er­man, H. (1977). Sci­en­tific Elite: No­bel Lau­re­ates in the United States. New York: Free Press.

    GS lists >900 ci­ta­tions of this book, so there may well be ad­di­tional or fol­lowup stud­ies cov­er­ing the 40 years since. Or, also rel­e­vant is “Zuck­er­man, H. (1983). The sci­en­tific elite: No­bel lau­re­ates’ mu­tual in­flu­ences. In R. S. Al­bert (Ed.), Ge­nius and em­i­nence (pp. 241–252). New York: Perg­a­mon Press”, and “Zuck­er­man H. ‘So­ci­ol­ogy of No­bel Prizes’, Sci­en­tific Amer­i­can 217 (5): 25& 1967.”

  • Too Nar­row: A fail­ure case study: The_­Duck looked for but failed to find other uses of a fa­mous Wittgen­stein anec­dote. His mis­take was be­ing too spe­cific:

    Yes, clearly my Google-fu is lack­ing. I think I searched for phrases like “sun went around the Earth,” which fails be­cause your quote has “sun went round the Earth.”

    As dis­cussed in the search tips, when you’re for­mu­lat­ing a search, you want to bal­ance how many hits you get, aim­ing for a sweet spot of a few hun­dred high­-qual­ity hits to re­view—the broader your for­mu­la­tion, the more likely the hits will in­clude your tar­get (if it ex­ists) but the more hits you’ll re­turn. In The_­Duck’s case, he used an over­ly-spe­cific search, which would turn up only 2 hits at most; this should have been a hint to loosen the search, such as by drop­ping quotes or drop­ping key­words.

    In this case, my rea­son­ing would go some­thing like this, laid out ex­plic­it­ly: ‘“Wittgen­stein” is al­most guar­an­teed to be on the same page as any in­stance of this quote, since the quote is about Wittgen­stein; LW, how­ev­er, does­n’t dis­cuss Wittgen­stein much, so there won’t be many hits in the first place; to find this quote, I only need to nar­row down those hits a lit­tle, and after “Wittgen­stein”, the most fun­da­men­tal core word to this quote is “Earth” or “sun”, so I’ll toss one of them in and… ah, there’s the quote!’

    If I were search­ing the gen­eral In­ter­net, my rea­son­ing would go more like “‘Wittgen­stein’ will be on, like, a mil­lion web­sites; I need to nar­row that down a lot to hope to find it; so maybe ‘Wittgen­stein’ and ‘Earth’ and ‘Sun’… nope, noth­ing on the first page, so toss in 'goes around' OR 'go around'—ah there it is!”

    (Ac­tu­al­ly, for the gen­eral In­ter­net, just Wittgenstein earth sun turns up a first page mostly about this anec­dote, sev­eral of which in­clude all the de­tails one could need.)

  • Dead URL: A link to a re­search ar­ti­cle in a post by Morendil broke, he had not pro­vided any for­mal ci­ta­tion data, and the orig­i­nal do­main blocks all crawlers in its robots.txt so IA would not work. What to do?

    The sim­plest so­lu­tion was to search a di­rect quote, turn­ing up a Scribd mir­ror; Scribd is a par­a­site web­site, where peo­ple up­load copies from else­where, which ought to make one won­der where the orig­i­nal came from. (It often shows up be­fore the orig­i­nal in any search en­gine, be­cause it au­to­mat­i­cally runs OCR on sub­mis­sions, mak­ing them more vis­i­ble to search en­gines.) With a copy of the jour­nal is­sue to work with, you can eas­ily find the offi­cial HP archives and down­load the orig­i­nal PDF.

    If that had­n’t worked, search­ing for the URL with­out /pg_2/ in it yields the full ci­ta­tion, and then that can be looked up nor­mal­ly. Fi­nal­ly, some­what more dan­ger­ous would be try­ing to find the ar­ti­cle just by au­thor sur­name & year.

  • De­scrip­tion But No Ci­ta­tion: A 2013 Med­ical Daily on the effects of read­ing fic­tion omit­ted any link or ci­ta­tion to the re­search in ques­tion. But it is easy to find.

    The ar­ti­cle says the au­thors are one Kauf­man & Lib­by, and im­plies it was pub­lished in the last year. So: go to Google Schol­ar, punch in Kaufman Libby, limit to ‘Since 2012’; and the cor­rect pa­per (“Chang­ing be­liefs and be­hav­ior through ex­pe­ri­ence-tak­ing”) is the first hit with full­text avail­able on the right-hand side as the text link “[PDF] from tiltfactor.org” & many other do­mains.

  • Find­ing Fol­lowups: Is soy milk bad for you as one study sug­gests? Has any­one repli­cated it? This is easy to look into a lit­tle if you use the power of re­verse ci­ta­tion search!

    Plug Brain aging and midlife tofu consumption into Google Schol­ar, one of the lit­tle links un­der the first hit points to “Cited by 176”; if you click on that, you can hit a check­box for “Search within cit­ing ar­ti­cles”; then you can search a query like experiment OR randomized OR blind which yields 121 re­sults. The first re­sult shows no neg­a­tive effect and a trend to a ben­e­fit, the sec­ond is in­ac­ces­si­ble, the sec­ond & third are re­views whose ab­stract sug­gests it would ar­gue for ben­e­fits, and the fourth dis­cusses sleep & mood ben­e­fits to soy di­ets. At least from a quick skim, this claim is not repli­cat­ing, and I am du­bi­ous about it.

  • How Many Home­less?: does NYC re­ally have 114,000+ home­less school chil­dren? This case study demon­strates the crit­i­cal skill of notic­ing the need to search at all, and the search it­self is al­most triv­ial.

    Won’t some­one think of the chil­dren? In March 2020, as cen­tered in Man­hat­tan (with a sim­i­lar trend to Wuhan/Iran/Italy), NYC Mayor re­fused to take so­cial distancing/quarantine mea­sures like or­der­ing the NYC pub­lic school sys­tem closed, and this de­lay un­til 16 March con­tributed to the epi­demic’s unchecked spread in NYC; one jus­ti­fi­ca­tion was that there were “114,085 home­less chil­dren” who re­ceived so­cial ser­vices like free laun­dry through the schools. This num­ber has been widely cited in the me­dia by the NYT, WSJ, etc, and was vaguely sourced to “state data” re­ported by “Ad­vo­cates for Chil­dren of New York”. This is a ter­ri­ble rea­son to not deal with a pan­demic that could kill tens of thou­sands of New York­ers, as there are many ways to de­liver ser­vices which do not re­quire every child in NYC to at­tend school & spread in­fec­tion­s—but first, is this num­ber even true?

    Ba­sic nu­mer­a­cy: im­plau­si­bly-large! Ac­tivists of any stripe are un­trust­wor­thy sources, and a num­ber like 114k should make any nu­mer­ate per­son un­easy even with­out any or fac­t-check­ing; “114,085” is sus­pi­ciously pre­cise for such a diffi­cult-to-mea­sure or de­fine thing like home­less­ness, and it’s well-known that the pop­u­la­tion of NYC is ~8m or 8,000k—is it re­ally the case that around 1 in every 70 peo­ple liv­ing in NYC is a home­less child age ~5–18 at­tend­ing a pub­lic school? They pre­sum­ably have at least 1 par­ent, and prob­a­bly younger sib­lings, so that would bring it up to >228k or 1 in every <35 in­hab­i­tants of NYC be­ing home­less in gen­er­al. De­pend­ing on ad­di­tional fac­tors like tran­siency & turnover, the frac­tion could go much higher still. Does that make sense? No, not re­al­ly. This quoted num­ber is ei­ther sur­pris­ing, or there is some­thing miss­ing.

    Re­defin­ing “home­less”. For­tu­nate­ly, the sus­pi­cious­ly-pre­cise num­ber and at­tri­bu­tion make this a good place to start for a search. Search­ing for the num­ber and the name of the ac­tivist group in­stantly turns up the source press re­lease, and the rea­sons for the bizarrely high num­ber are re­vealed: the sta­tis­tic ac­tu­ally re­de­fines ‘home­less­ness’ to in­clude liv­ing with rel­a­tives or friends, and counts any ex­pe­ri­ence of any length in the pre­vi­ous year as ren­der­ing that stu­dent ‘home­less’ at the mo­ment.

    The data, which come from the New York State Ed­u­ca­tion De­part­ment, show that in the 2018-2019 school year, New York City dis­trict and char­ter schools iden­ti­fied 114,085, or one in ten, stu­dents as home­less. More than 34,000 stu­dents were liv­ing in New York City’s shel­ters, and more than twice that num­ber (73,750) were liv­ing ‘dou­bled-up’ in tem­po­rary hous­ing sit­u­a­tions with rel­a­tives, friends, or oth­ers…“This prob­lem is im­mense. The num­ber of New York City stu­dents who ex­pe­ri­enced home­less­ness last year—85% of whom are Black or His­pan­ic—­could fill the Bar­clays Cen­ter six times,” said Kim Sweet, AFC’s Ex­ec­u­tive Di­rec­tor. “The City won’t be able to break the cy­cle of home­less­ness un­til we ad­dress the dis­mal ed­u­ca­tional out­comes for stu­dents who are home­less.”

    The WSJ’s ar­ti­cle (but not head­line) con­firms that ‘ex­pe­ri­enced’ does in­deed mean ‘at any time in the year for any length of time’, rather than ‘at the mo­ment’:

    City dis­trict and char­ter schools had 114,085 stu­dents with­out their own homes at some point last year, top­ping 100,000 for the fourth year in a row, ac­cord­ing to state data re­leased in a re­port Mon­day from Ad­vo­cates for Chil­dren of New York, a non­profit seek­ing bet­ter ser­vices for the dis­ad­van­taged. Most chil­dren were black or His­pan­ic, and liv­ing “dou­bled up” with friends, rel­a­tives or oth­ers. But more than 34,000 slept in city shel­ters at some point, a num­ber larger than the en­tire en­roll­ment of many dis­tricts, such as Buffalo, Rochester or Yonkers.

    Less than meet the eye. So the ac­tual num­ber of ‘home­less­ness’ (in the sense that every­one read­ing those me­dia ar­ti­cles un­der­stands it) is less than a third the quote, 34k, and that 34k num­ber is likely it­self a loose es­ti­mate of how many stu­dents would be home­less at the time of a coro­n­avirus clo­sure. This num­ber is far more plau­si­ble and in­tu­itive, and while one might won­der about what the un­der­ly­ing NYS Ed­u­ca­tion De­part­ment num­bers would re­veal if fac­t-checked fur­ther, that’s prob­a­bly un­nec­es­sary for show­ing how il­l-founded the an­ti-clo­sure ar­gu­ment is, since even by the ac­tivists’ own de­scrip­tion, the rel­e­vant num­ber is far smaller than 114k.

  • Ci­ta­tion URL With Typo: , dis­cusses the lim­its to the in­tel­li­gence of in­creas­ingly large pri­mate brains due to con­sid­er­a­tions like in­creas­ing la­tency and over­heat­ing. One ci­ta­tion at­tempt­ing to ex­trap­o­late up­per bounds is “Bi­o­log­i­cal lim­its to in­for­ma­tion pro­cess­ing in the hu­man brain”, Cochrane et al 1995.

    The source in­for­ma­tion is merely a bro­ken URL: http://www.cochrane.org.uk/opinion/archive/articles.phd which stands out for look­ing dou­bly-wrong: “.phd” is al­most cer­tainly a typo for “.php” (prob­a­bly mus­cle mem­ory on the part of Hof­man from “PhD”), but it also gives a hint that the en­tire URL is wrong: why would an ar­ti­cle or es­say be named any­thing like archive/articles.php? That sounds like an in­dex page list­ing all the avail­able ar­ti­cles.

    After try­ing and fail­ing to find Cochrane’s pa­per in the usual places, I re­turned to the hint. The In­ter­net Archive does­n’t have that page un­der ei­ther pos­si­ble URL, but the di­rec­tory strongly hints that all of the pa­pers would ex­ist at URLs like archive/brain.php or archive/information-processing.php, and we can look up all of the URLs the IA has un­der that di­rec­to­ry—how many could there be? A lot, but only one has the key­word “brain” in it, pro­vid­ing us .

    If that had­n’t worked, there was at least one other ver­sion hid­ing in the IA. When I googled the quoted ti­tle “Bi­o­log­i­cal lim­its to in­for­ma­tion pro­cess­ing in the hu­man brain”, the hits all ap­peared to be use­less ci­ta­tions re­peat­ing the orig­i­nal Hof­man ci­ta­tion—but for a cru­cial differ­ence, as they cite a differ­ent URL (note the shift to an ‘archive.­cochrane.org’ sub­do­main rather than the sub­di­rec­tory cochrane.org.uk/opinion/archive/, and change of ex­ten­sion from .html to .php):

    • hit 5:

      Bi­o­log­i­cal Lim­its to In­for­ma­tion Pro­cess­ing in the Hu­man Brain. Re­trieved from: http://archive.cochrane.org.uk/opinion/archive/articles/brain9a.php

    • hit 7:

      Bi­o­log­i­cal Lim­its to In­for­ma­tion Pro­cess­ing in the Hu­man Brain. Avail­able on­line at: http://archive.cochrane.org.uk/opinion/archive/articles/brain9a.php; Da Costa …

    Aside from con­firm­ing that it was in­deed a ‘.php’ ex­ten­sion, that URL gives you a sec­ond copy of the pa­per in the IA. Un­for­tu­nate­ly, the im­age links are bro­ken in both ver­sions, and the im­age sub­di­rec­to­ries also seem to be empty in both IA ver­sions, though there’s no weird JS im­age load­ing bad­ness, so I’d guess that the im­age links were al­ways bro­ken, at least by 2004. There’s no in­di­ca­tion it was ever pub­lished or mir­rored any­where else, so there’s not much you can do about it other than to con­tact Pe­ter Cochrane (who is still alive and ac­tively pub­lish­ing al­though he leaves this par­tic­u­lar ar­ti­cle off his pub­li­ca­tion list).

See Also

Appendix

Searching the Google Reader archives

A tu­to­r­ial on how to do man­ual searches of the 2013 archives on the . Google Reader pro­vides full­text mir­rors of many web­sites which are long gone and not oth­er­wise avail­able even in the IA; how­ev­er, the Archive Team archives are ex­tremely user-un­friendly and chal­leng­ing to use even for pro­gram­mers. I ex­plain how to find & ex­tract spe­cific web­sites.

A lit­tle-known way to ‘un­delete’ a blog or web­site is to use Google Reader (GR). Un­usual archive: Google Read­er. GR crawled reg­u­larly al­most all blogs’ RSS feeds; RSS feeds often con­tain the full­text of ar­ti­cles. If a blog au­thor writes an ar­ti­cle, the full­text is in­cluded in the RSS feed, GR down­loads it, and then the au­thor changes their mind and ed­its or deletes it, GR would re­down­load the new ver­sion but it would con­tinue to show the ver­sion the old ver­sion as well (you would see two ver­sions, chrono­log­i­cal­ly). If the au­thor blogged reg­u­larly and so GR had learned to check reg­u­lar­ly, it could hy­po­thet­i­cally grab differ­ent edited ver­sions, even, not just ones with weeks or months in be­tween. As­sum­ing that GR did not, as it some­times did for in­scrutable rea­sons, stop dis­play­ing the his­tor­i­cal archives and only showed the last 90 days or so to read­ers; I was never able to fig­ure out why this hap­pened or if in­deed it re­ally did hap­pen and was not some sort of UI prob­lem. Re­gard­less, if all went well, this let you un­delete an ar­ti­cle, al­beit per­haps with messed up for­mat­ting or some­thing. Sad­ly, GR was closed back in 2013 and you can­not sim­ply log in and look for blogs.

Archive Team mir­rored Google Read­er. How­ev­er, be­fore it was closed, Archive Team launched a ma­jor effort to down­load as much of GR as pos­si­ble. So in that dump, there may be archives of all of a ran­dom blog’s posts. Specifi­cal­ly: if a GR user sub­scribed to it; if Archive Team knew about it; if they re­quested it in time be­fore clo­sure; and if GR did keep full archives stretch­ing back to the first post­ing.

AT mir­ror is raw bi­nary da­ta. Down­side: the Archive Team dump is not in an eas­ily browsed for­mat, and merely fig­ur­ing out what it might have is diffi­cult. In fact, it’s so diffi­cult that be­fore re­search­ing Craig Wright in No­vem­ber–De­cem­ber 2015, I never had an ur­gent enough rea­son to fig­ure out how to get any­thing out of it be­fore, and I’m not sure I’ve ever seen any­one ac­tu­ally use it be­fore; Archive Team takes the at­ti­tude that it’s bet­ter to pre­serve the data some­how and let pos­ter­ity worry about us­ing it. (There is a site which claimed to be a fron­tend to the dump but when I tried to use it, it was bro­ken & still is in De­cem­ber 2018.)

Extracting

Find the right archive. The 9TB of data is stored in ~69 opaque com­pressed WARC archives. 9TB is a bit much to down­load and un­com­press to look for one or two files, so to find out which WARC you need, you have to down­load the ~69 CDX in­dexes which record the con­tents of their re­spec­tive WARC, and search them for the URLs you are in­ter­ested in. (They are plain text so you can just grep them.)

Locations

In this ex­am­ple, we will look at the main blog of Craig Wright, gse-compliance.blogspot.com. (An­other blog, security-doctor.blogspot.com, ap­pears to have been too ob­scure to be crawled by GR.) To lo­cate the WARC with the Wright RSS feeds, down­load the the mas­ter in­dex. To search:

for file in *.gz; do echo $file; zcat $file | fgrep -e 'gse-compliance' -e 'security-doctor'; done
# com,google/reader/api/0/stream/contents/feed/http:/gse-compliance.blogspot.com/atom.xml?client=\
# archiveteam&comments=true&hl=en&likes=true&n=1000&r=n 20130602001238 https://www.google.com/reader/\
# api/0/stream/contents/feed/http%3A%2F%2Fgse-compliance.blogspot.com%2Fatom.xml?r=n&n=1000&hl=en&\
# likes=true&comments=true&client=ArchiveTeam unk - 4GZ4KXJISATWOFEZXMNB4Q5L3JVVPJPM - - 1316181\
# 19808229791 archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz
# com,google/reader/api/0/stream/contents/feed/http:/gse-compliance.blogspot.com/feeds/posts/default?\
# alt=rss?client=archiveteam&comments=true&hl=en&likes=true&n=1000&r=n 20130602001249 https://www.google.\
# com/reader/api/0/stream/contents/feed/http%3A%2F%2Fgse-compliance.blogspot.com%2Ffeeds%2Fposts%2Fdefault\
# %3Falt%3Drss?r=n&n=1000&hl=en&likes=true&comments=true&client=ArchiveTeam unk - HOYKQ63N2D6UJ4TOIXMOTUD4IY7MP5HM\
# - - 1326824 19810951910 archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz
# com,google/reader/api/0/stream/contents/feed/http:/gse-compliance.blogspot.com/feeds/posts/default?\
# client=archiveteam&comments=true&hl=en&likes=true&n=1000&r=n 20130602001244 https://www.google.com/\
# reader/api/0/stream/contents/feed/http%3A%2F%2Fgse-compliance.blogspot.com%2Ffeeds%2Fposts%2Fdefault?\
# r=n&n=1000&hl=en&likes=true&comments=true&client=ArchiveTeam unk - XXISZYMRUZWD3L6WEEEQQ7KY7KA5BD2X - - \
# 1404934 19809546472 archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz
# com,google/reader/api/0/stream/contents/feed/http:/gse-compliance.blogspot.com/rss.xml?client=archiveteam\
# &comments=true&hl=en&likes=true&n=1000&r=n 20130602001253 https://www.google.com/reader/api/0/stream/contents\
# /feed/http%3A%2F%2Fgse-compliance.blogspot.com%2Frss.xml?r=n&n=1000&hl=en&likes=true&comments=true\
# &client=ArchiveTeam text/html 404 AJSJWHNSRBYIASRYY544HJMKLDBBKRMO - - 9467 19812279226 \
# archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz

Un­der­stand­ing the out­put: the for­mat is de­fined by the first line, which then can be looked up:

  • the for­mat string is: CDX N b a m s k r M S V g; which means here:

    • N: mas­saged url
    • b: date
    • a: orig­i­nal url
    • m: mime type of orig­i­nal doc­u­ment
    • s: re­sponse code
    • k: new style check­sum
    • r: redi­rect
    • M: meta tags (AIF)
    • S: ?
    • V: com­pressed arc file off­set
    • g: file name

Ex­am­ple:

(com,google)/reader/api/0/stream/contents/feed/http:/gse-compliance.blogspot.com/atom.xml\
?client=archiveteam&comments=true&hl=en&likes=true&n=1000&r=n 20130602001238 https://www.google.com\
/reader/api/0/stream/contents/feed/http%3A%2F%2Fgse-compliance.blogspot.com%2Fatom.xml?r=n\
&n=1000&hl=en&likes=true&comments=true&client=ArchiveTeam unk - 4GZ4KXJISATWOFEZXMNB4Q5L3JVVPJPM\
- - 1316181 19808229791 archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz

Con­verts to:

  • mas­saged URL: (com,google)/reader/api/0/stream/contents/feed/ http:/gse-compliance.blogspot.com/atom.xml? client=archiveteam&comments=true&hl=en&likes=true&n=1000&r=n
  • date: 20130602001238
  • orig­i­nal URL: https://www.google.com/reader/api/0/stream/contents/feed/ http%3A%2F%2Fgse-compliance.blogspot.com%2Fatom.xml? r=n&n=1000&hl=en&likes=true&comments=true&client=ArchiveTeam
  • MIME type: unk [un­known?]
  • re­sponse code:—[none?]
  • new-style check­sum: 4GZ4KXJISATWOFEZXMNB4Q5L3JVVPJPM
  • redi­rec­t:—[none?]
  • meta tags:—[none?]
  • S [? maybe length­?]: 1316181
  • com­pressed arc file off­set: 19808229791 (19,808,229,791; so some­where around 19.8GB into the mega-WARC)
  • file­name: archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz

Know­ing the off­set the­o­ret­i­cally makes it pos­si­ble to ex­tract di­rectly from the IA copy with­out hav­ing to down­load and de­com­press the en­tire thing… The S & off­sets for gse-compliance are:

  1. 1316181/19808229791
  2. 1326824/19810951910
  3. 1404934/19809546472
  4. 9467/19812279226

So we found hits point­ing to­wards archiveteam_greader_20130604001315 & archiveteam_greader_20130614211457 which we then need to down­load (25GB each):

wget 'https://archive.org/download/archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz'
wget 'https://archive.org/download/archiveteam_greader_20130614211457/greader_20130614211457.megawarc.warc.gz'

Once down­load­ed, how do we get the feeds? There are a num­ber of hard-to-use and in­com­plete tools for work­ing with gi­ant WARCs; I con­tacted the orig­i­nal GR archiver, ivan, but that was­n’t too help­ful.

warcat

I tried us­ing warcat to un­pack the en­tire WARC archive into in­di­vid­ual files, and then delete every­thing which was not rel­e­vant:

python3 -m warcat extract /home/gwern/googlereader/...
find ./www.google.com/ -type f -not \( -name "*gse-compliance*" -or -name "*security-doctor*" \) -delete
find ./www.google.com/

But this was too slow, and crashed part­way through be­fore fin­ish­ing.

Bug re­ports:

A more re­cent al­ter­na­tive li­brary, which I haven’t tried, is warcio, which may be able to find the byte ranges & ex­tract them.

dd

If we are feel­ing brave, we can use the off­set and pre­sumed length to have di­rectly ex­tract byte ranges:

dd skip=19810951910 count=1326824 if=greader_20130604001315.megawarc.warc.gz of=2.gz bs=1
# 1326824+0 records in
# 1326824+0 records out
# 1326824 bytes (1.3 MB) copied, 14.6218 s, 90.7 kB/s
dd skip=19810951910 count=1326824 if=greader_20130604001315.megawarc.warc.gz of=3.gz bs=1
# 1326824+0 records in
# 1326824+0 records out
# 1326824 bytes (1.3 MB) copied, 14.2119 s, 93.4 kB/s
dd skip=19809546472 count=1404934 if=greader_20130604001315.megawarc.warc.gz of=4.gz bs=1
# 1404934+0 records in
# 1404934+0 records out
# 1404934 bytes (1.4 MB) copied, 15.4225 s, 91.1 kB/s
dd skip=19812279226 count=9467 if=greader_20130604001315.megawarc.warc.gz of=5.gz bs=1
# 9467+0 records in
# 9467+0 records out
# 9467 bytes (9.5 kB) copied, 0.125689 s, 75.3 kB/s
dd skip=19808229791 count=1316181 if=greader_20130604001315.megawarc.warc.gz of=1.gz bs=1
# 1316181+0 records in
# 316181+0 records out
# 1316181 bytes (1.3 MB) copied, 14.6209 s, 90.0 kB/s
gunzip *.gz

Results

Suc­cess: raw HTML. My dd ex­trac­tion was suc­cess­ful, and the re­sult­ing HTML/RSS could then be browsed with a com­mand like cat *.warc | fold --spaces -width=200 | less. They can prob­a­bly also be con­verted to a lo­cal form and browsed, al­though they won’t in­clude any of the site as­sets like im­ages or CSS/JS, since the orig­i­nal RSS feed as­sumes you can load any ref­er­ences from the orig­i­nal web­site and did­n’t do any kind of or mir­ror­ing (not, after all, hav­ing been in­tended for archive pur­poses in the first place…)


  1. For ex­am­ple, the info: op­er­a­tor is en­tirely use­less. The link: op­er­a­tor, in al­most a decade of me try­ing it once in a great while, has never re­turned re­motely as many links to my web­site as Google Web­mas­ter Tools re­turns for in­bound links, and seems to have been dis­abled en­tirely at some point.↩︎

  2. WP is in­creas­ingly out of date & un­rep­re­sen­ta­tive due to in­creas­ingly nar­row poli­cies about sourc­ing & preprints, part of its over­all , so it’s not a good place to look for ref­er­ences. It is a good place to look for key ter­mi­nol­o­gy, though.↩︎

  3. Most search en­gines will treat any space or sep­a­ra­tion as an im­plicit AND, but I find it help­ful to be ex­plicit about it to make sure I’m search­ing what I think I’m search­ing.↩︎

  4. This prob­a­bly ex­plains part of why no one cites that pa­per, and those who cite it clearly have not ac­tu­ally read it, even though it in­vented racial ad­mix­ture analy­sis, which, since rein­vented by oth­ers, has be­come a ma­jor method in med­ical ge­net­ics.↩︎

  5. Uni­ver­sity ILL priv­i­leges are one of the most un­der­rated fringe ben­e­fits of be­ing a stu­dent, if you do any kind of re­search or hob­by­ist read­ing—you can re­quest al­most any­thing you can find in , whether it’s an ul­tra­-ob­scure book or a mas­ter’s the­sis from 1950! Why would­n’t you make reg­u­lar use of it‽ Of things I miss from be­ing a stu­dent, ILL is near the top.↩︎

  6. The com­plaint and in­dict­ment are not nec­es­sar­ily the same thing. An in­dict­ment fre­quently will leave out many de­tails and con­fine it­self to list­ing what the de­fen­dant is ac­cused of. Com­plaints tend to be much richer in de­tail. How­ev­er, some­times there will be only one and not the oth­er, per­haps be­cause the more de­tailed com­plaint has been sealed (pos­si­bly pre­cisely be­cause it is more de­tailed).↩︎

  7. Trial tes­ti­mony can run to hun­dreds of pages and blow through your re­main­ing PACER bud­get, so one must be care­ful. In par­tic­u­lar, tes­ti­mony op­er­ates un­der an in­ter­est­ing & sys­tem re­lated to how re­port—who are not nec­es­sar­ily paid em­ploy­ees but may be con­trac­tors or free­lancer­s—in­tended to en­sure cov­er­ing tran­scrip­tion costs: the tran­script ini­tially may cost hun­dreds of dol­lars, in­tended to ex­tract full value from those who need the trial tran­script im­me­di­ate­ly, such as lawyers or jour­nal­ists, but then a while lat­er, PACER drops the price to some­thing more rea­son­able. That is, the first “orig­i­nal” fee costs a for­tune, but then “copy” fees are cheap­er. So for the US fed­eral court sys­tem, the “orig­i­nal”, when or­dered within hours of the tes­ti­mony, will cost <$7.25/page but then the sec­ond per­son or­der­ing the same tran­script pays only <$1.20/page & every­one sub­se­quently <$0.90/page, and as fur­ther time pass­es, that drops to <$0.60 (and I be­lieve after a few months, PACER will then charge only the stan­dard $0.10). So, when it comes to trial tran­script on PACER, pa­tience pays off.↩︎

  8. I’ve heard that Lex­is­Nexis ter­mi­nals are some­times avail­able for pub­lic use in places like fed­eral li­braries or cour­t­hous­es, but I have never tried this my­self.↩︎

  9. Cu­ri­ous­ly, in his­tor­i­cal tex­tual crit­i­cism of copied man­u­scripts, it’s the op­po­site: . But with mem­o­ries or para­phras­es, longer = truer, be­cause those tend to elide de­tails and mu­tate into catch­ier ver­sions when the trans­mit­ter is not os­ten­si­bly ex­actly copy­ing a text.↩︎

  10. I ad­vise prepend­ing, like https://sci-hub.st/https://journal.com in­stead of ap­pend­ing, like https://journal.com.sci-hub.st/ be­cause the for­mer is slightly eas­ier to type but more im­por­tant­ly, Sci-Hub does not have SSL cer­tifi­cates set up prop­erly (I as­sume they’re miss­ing a wild­card) and so ap­pend­ing the Sci-Hub do­main will fail to work in many web browsers due to HTTPS er­rors! How­ev­er, if prepend­ed, it’ll al­ways work cor­rect­ly.↩︎

  11. To fur­ther il­lus­trate this IA fea­ture: if one was look­ing for Alex St. John’s “Judg­ment Day Con­tin­ued…”, a 2013 ac­count of or­ga­niz­ing the wild 1996 Doom tour­na­ment thrown by Mi­crosoft, but one did­n’t have the URL handy, one could search the en­tire do­main by go­ing to https://web.archive.org/web/*/http://www.alexstjohn.com/* and us­ing the fil­ter with “judg­ment”, or if one at least re­mem­bered it was in 2013, one could nar­row it down fur­ther to https://web.archive.org/web/*/http://www.alexstjohn.com/WP/2013/* and then fil­ter or search by hand.↩︎

  12. If any Blogspot em­ployee is read­ing this, for god’s sake stop this in­san­ity!↩︎

  13. Up­load­ing is not as hard as it may seem. There is a web in­ter­face (user/password: “gen­e­sis”/“up­load”). Up­load­ing large files can fail, so I usu­ally use the FTP server: curl -T "$FILE" ftp://anonymous@ftp.libgen.is/upload/. ↩︎

  14. Al­though flatbed scan­ning is some­times de­struc­tive too—I’ve cracked the spine of books while press­ing them flat into a flatbed scan­ner.↩︎

  15. My workaround is to ex­port from gscan2pdf as DjVu, which avoids the bug, then con­vert the DjVu files with ddjvu -format=pdf; this strips any OCR, so I add OCR with ocrmypdf and meta­data with exiftool.↩︎