Archiving URLs

Archiving the Web, because nothing lasts forever: statistics, online archive services, extracting URLs automatically from browsers, and creating a daemon to regularly back up URLs to multiple sources.
Haskell, archiving, Bitcoin, meta, shell, R, tutorial
2011-03-102019-01-05 finished certainty: certain importance: 8


Links on the In­ter­net last for­ever or a year, whichever comes first. This is a ma­jor prob­lem for any­one se­ri­ous about writ­ing with good ref­er­ences, as link rot will crip­ple sev­eral per­cent of all links each year, and com­pound­ing.

To deal with link rot, I present my mul­ti­-pronged archival strat­egy us­ing a com­bi­na­tion of scripts, dae­mons, and In­ter­net archival ser­vices: URLs are reg­u­larly dumped from both my web browser’s daily brows­ing and my web­site pages into an archival dae­mon I wrote, which pre-emp­tively down­loads copies lo­cally and at­tempts to archive them in the In­ter­net Archive. This en­sures a copy will be avail­able in­defi­nitely from one of sev­eral sources. Link rot is then de­tected by reg­u­lar runs of linkchecker, and any newly dead links can be im­me­di­ately checked for al­ter­na­tive lo­ca­tions, or re­stored from one of the archive sources.

As an ad­di­tional flour­ish, my lo­cal archives are in case forgery is a con­cern, and I demon­strate a sim­ple com­pres­sion trick for sub­stan­tially re­duc­ing sizes of large web archives such as crawls (par­tic­u­larly use­ful for re­peated crawls such as my ).

Given my in­ter­est in long term con­tent and ex­ten­sive link­ing, is an is­sue of deep con­cern to me. I need back­ups not just for my files1, but for the web pages I read and use—they’re all part of my . It’s not much good to have an ex­ten­sive es­say on some topic where half the links are dead and the reader can nei­ther ver­ify my claims nor get con­text for my claims.

Detection

“With every new spring
the blos­soms speak not a word
yet ex­pound the Law—
know­ing what is at its heart
by the scat­ter­ing storm winds.”

Shōtetsu7

The first rem­edy is to learn about bro­ken links as soon as they hap­pen, which al­lows one to re­act quickly and scrape archives or search en­gine caches (‘lazy preser­va­tion’). I cur­rently use linkchecker to spi­der gw­ern.net look­ing for bro­ken links. linkchecker is run in a job like so:

@monthly linkchecker --check-extern --timeout=35 --no-warnings --file-output=html \
                      --ignore-url=^mailto --ignore-url=^irc --ignore-url=http://.*\.onion \
                      --ignore-url=paypal.com --ignore-url=web.archive.org \
                     https://www.gwern.net

Just this com­mand would turn up many false pos­i­tives. For ex­am­ple, there would be sev­eral hun­dred warn­ings on Wikipedia links be­cause I link to redi­rects; and linkchecker re­spects s which for­bid it to check live­ness, but emits a warn­ing about this. These can be sup­pressed by edit­ing ~/.linkchecker/linkcheckerrc to say ignorewarnings=http-moved-permanent,http-robots-denied (the avail­able warn­ing classes are listed in linkchecker -h).

The quicker you know about a dead link, the sooner you can look for re­place­ments or its new home.

Prevention

Remote caching

“Any­thing you post on the in­ter­net will be there as long as it’s em­bar­rass­ing and gone as soon as it would be use­ful.”

taejo

We can ask a third party to keep a cache for us. There are sev­eral pos­si­bil­i­ties:

  1. the
  2. Per­ma.cc (highly lim­ited)
  3. Lin­ter­we­b’s Wiki­Wix8.
  4. Peeep­.us (de­funct as of 2018)
  5. Archive.is
  6. Pin­board (with the $25/yr archiv­ing op­tion9)
  7. Hiy­o.jp & Mega­lodon.jp (may be diffi­cult to use)

There are other op­tions but they are not avail­able like Google10 or var­i­ous commercial/government archives11

(An ex­am­ple would be bits.blogs.nytimes.com/2010/12/07/palm-is-far-from-game-over-says-former-chief/ be­ing archived at webcitation.org/5ur7ifr12.)

These archives are also good for archiv­ing your own web­site:

  1. you may be keep­ing back­ups of it, but your own website/server back­ups can be lost (I can speak from per­sonal ex­pe­ri­ence here), so it’s good to have ex­ter­nal copies
  2. An­other ben­e­fit is the re­duc­tion in ‘bus-fac­tor’: if you were hit by a bus to­mor­row, who would get your archives and be able to main­tain the web­sites and un­der­stand the back­ups etc? While if archived in IA, peo­ple al­ready know how to get copies and there are tools to down­load en­tire do­mains.
  3. A fo­cus on back­ing up only one’s web­site can blind one to the need for archiv­ing the ex­ter­nal links as well. Many pages are mean­ing­less or less valu­able with bro­ken links. A linkchecker script/daemon can also archive all the ex­ter­nal links.

So there are sev­eral ben­e­fits to do­ing web archiv­ing be­yond sim­ple server back­ups.

My first pro­gram in this vein of thought was a bot which fired off We­bCite, In­ter­net Archive/Alexa, & Archive.is re­quests: , quickly fol­lowed up by a . (Or you could in­stall the to get au­to­matic sub­mis­sion to the In­ter­net Archive, if you have ceased to care about pri­va­cy.)

The core code was quickly adapted into a gi­tit wiki plu­gin which hooked into the save-page func­tion­al­ity and tried to archive every link in the new­ly-mod­i­fied page, In­ter­wi­k­i.hs

Fi­nal­ly, I wrote archiver, a dae­mon which watches12/reads a text file. Source is avail­able via git clone https://github.com/gwern/archiver-bot.git. (A sim­i­lar tool is Archiveror.)

The li­brary half of archiver is a sim­ple wrap­per around the ap­pro­pri­ate HTTP re­quests; the ex­e­cutable half reads a spec­i­fied text file and loops as it (s­low­ly) fires off re­quests and deletes the ap­pro­pri­ate URL.

That is, archiver is a dae­mon which will process a spec­i­fied text file, each line of which is a URL, and will one by one re­quest that the URLs be archived or spi­dered

Us­age of archiver might look like archiver ~/.urls.txt gwern@gwern.net. In the past, archiver would some­times crash for un­known rea­sons, so I usu­ally wrap it in a while loop like so: while true; do archiver ~/.urls.txt gwern@gwern.net; done. If I wanted to put it in a de­tached ses­sion: screen -d -m -S "archiver" sh -c 'while true; do archiver ~/.urls.txt gwern@gwern.net; done'. Fi­nal­ly, rather than start it man­u­al­ly, I use a cron job to start it at boot, for a fi­nal in­vo­ca­tion of

@reboot sleep 4m && screen -d -m -S "archiver" sh -c 'while true; do archiver ~/.urls.txt gwern2@gwern.net \
        "cd ~/www && nice -n 20 ionice -c3 wget --unlink --limit-rate=20k --page-requisites --timestamping \
        -e robots=off --reject .iso,.exe,.gz,.xz,.rar,.7z,.tar,.bin,.zip,.jar,.flv,.mp4,.avi,.webm \
        --user-agent='Firefox/4.9'" 500; done'

Local caching

Re­mote archiv­ing, while con­ve­nient, has a ma­jor flaw: the archive ser­vices can­not keep up with the growth of the In­ter­net and are woe­fully in­com­plete. I ex­pe­ri­ence this reg­u­lar­ly, where a link on Gw­ern.net goes dead and I can­not find it in the In­ter­net Archive or We­bCite, and it is a gen­eral phe­nom­e­non: find <35% of com­mon Web pages ever copied into an archive ser­vice, and typ­i­cally only one copy ex­ists.

Caching Proxy

The most am­bi­tious & to­tal ap­proach to lo­cal caching is to set up a proxy to do your brows­ing through, and record lit­er­ally all your web traffic; for ex­am­ple, us­ing Live Archiv­ing Proxy (LAP) or War­cProxy which will save as every page you visit through it. (Zachary Vance ex­plains how to set up a lo­cal HTTPS cer­tifi­cate to MITM your HTTPS brows­ing as well.)

One may be re­luc­tant to go this far, and pre­fer some­thing lighter-weight, such as pe­ri­od­i­cally ex­tract­ing a list of vis­ited URLs from one’s web browser and then at­tempt­ing to archive them.

Batch job downloads

For a while, I used a shell script named, imag­i­na­tively enough, local-archiver:

#!/bin/sh
set -euo pipefail

cp `find ~/.mozilla/ -name "places.sqlite"` ~/
sqlite3 places.sqlite "SELECT url FROM moz_places, moz_historyvisits \
                       WHERE moz_places.id = moz_historyvisits.place_id \
                             and visit_date > strftime('%s','now','-1.5 month')*1000000 ORDER by \
                       visit_date;" | filter-urls >> ~/.tmp
rm ~/places.sqlite
split -l500 ~/.tmp ~/.tmp-urls
rm ~/.tmp

cd ~/www/
for file in ~/.tmp-urls*;
 do (wget --unlink --continue --page-requisites --timestamping --input-file $file && rm $file &);
done

find ~/www -size +4M -delete

The code is not the pret­ti­est, but it’s fairly straight­for­ward:

  1. the script grabs my Fire­fox brows­ing his­tory by ex­tract­ing it from the his­tory SQL data­base file13, and feeds the URLs into .

    wget is not the best tool for archiv­ing as it will not run JavaScript or Flash or down­load videos etc. It will down­load in­cluded JS files but the JS will be ob­so­lete when run in the fu­ture and any dy­namic con­tent will be long gone. To do bet­ter would re­quire a head­less browser like which saves to MHT/MHTML, but Phan­tomJS re­fuses to sup­port it and I’m not aware of an ex­ist­ing pack­age to do this. In prac­tice, sta­tic con­tent is what is most im­por­tant to archive, most JS is of highly ques­tion­able value in the first place, and any im­por­tant YouTube videos can be archived man­u­ally with youtube-dl, so wget’s lim­i­ta­tions haven’t been so bad.

  2. The script splits the long list of URLs into a bunch of files and runs that many wgets in par­al­lel be­cause wget ap­par­ently has no way of si­mul­ta­ne­ously down­load­ing from mul­ti­ple do­mains. There’s also the chance of wget hang­ing in­defi­nite­ly, so par­al­lel down­loads con­tin­ues to make progress.

  3. The filter-urls com­mand is an­other shell script, which re­moves URLs I don’t want archived. This script is a hack which looks like this:

    #!/bin/sh
    set -euo pipefail
    cat /dev/stdin | sed -e "s/#.*//" | sed -e "s/&sid=.*$//" | sed -e "s/\/$//" | grep -v -e 4chan -e reddit ...
  4. delete any par­tic­u­larly large (>4MB) files which might be me­dia files like videos or au­dios (pod­casts are par­tic­u­lar offend­ers)

A lo­cal copy is not the best re­source—what if a link goes dead in a way your tool can­not de­tect so you don’t know to put up your copy some­where? But it solves the prob­lem de­ci­sive­ly.

The down­side of this scrip­t’s batch ap­proach soon be­came ap­par­ent to me:

  1. not au­to­mat­ic: you have to re­mem­ber to in­voke it and it only pro­vides a sin­gle lo­cal archive, or if you in­voke it reg­u­larly as a cron job, you may cre­ate lots of du­pli­cates.
  2. un­re­li­able: wget may hang, URLs may be archived too late, it may not be in­voked fre­quently enough, >4MB non-video/audio files are in­creas­ingly com­mon…
  3. I wanted copies in the In­ter­net Archive & else­where as well to let other peo­ple ben­e­fit and pro­vide re­dun­dancy to my lo­cal archive

It was to fix these prob­lems that I be­gan work­ing on archiver—which would run con­stantly archiv­ing URLs in the back­ground, archive them into the IA as well, and be smarter about me­dia file down­loads. It has been much more sat­is­fac­to­ry.

Daemon

archiver has an ex­tra fea­ture where any third ar­gu­ment is treated as an ar­bi­trary sh com­mand to run after each URL is archived, to which is ap­pended said URL. You might use this fea­ture if you wanted to load each URL into Fire­fox, or ap­pend them to a log file, or sim­ply down­load or archive the URL in some other way.

For ex­am­ple, in­stead of a big local-archiver run, I have archiver run wget on each in­di­vid­ual URL: screen -d -m -S "archiver" sh -c 'while true; do archiver ~/.urls.txt gwern@gwern.net "cd ~/www && wget --unlink --continue --page-requisites --timestamping -e robots=off --reject .iso,.exe,.gz,.xz,.rar,.7z,.tar,.bin,.zip,.jar,.flv,.mp4,.avi,.webm --user-agent='Firefox/3.6' 120"; done'. (For pri­vate URLs which re­quire lo­gins, such as , wget can still grab them with some help: in­stalling the Fire­fox ex­ten­sion Ex­port Cook­ies, log­ging into the site in Fire­fox like usu­al, ex­port­ing one’s cookies.txt, and adding the op­tion --load-cookies cookies.txt to give it ac­cess to the cook­ies.)

Al­ter­nate­ly, you might use curl or a spe­cial­ized archive down­loader like the In­ter­net Archive’s crawler Her­itrix.

Cryptographic timestamping local archives

We may want cryp­to­graphic time­stamp­ing to prove that we cre­ated a file or archive at a par­tic­u­lar date and have not since al­tered it. which im­ple­ment down­load­ing (wget-archive) and time­stamp­ing strings or files (timestamp). With these scripts, ex­tend­ing the archive bot is as sim­ple as chang­ing the shell com­mand:

@reboot sleep 4m && screen -d -m -S "archiver" sh -c 'while true; do archiver ~/.urls.txt gwern2@gwern.net \
        "wget-archive" 200; done'

Now every URL we down­load is au­to­mat­i­cally cryp­to­graph­i­cally time­stamped with ~1-day res­o­lu­tion for free.

Resource consumption

The space con­sumed by such a backup is not that bad; only 30–50 gi­ga­bytes for a year of brows­ing, and less de­pend­ing on how hard you prune the down­loads. (More, of course, if you use linkchecker to archive en­tire sites and not just the pages you vis­it.) Stor­ing this is quite vi­able in the long term; while page sizes have in­creased 7x be­tween 2003 and 2011 and pages av­er­age around 400kb14, has also been op­er­at­ing and has in­creased disk ca­pac­ity by ~128x—in 2011, $80 will buy you at least 2 ter­abytes, that works out to 4 cents a gi­ga­byte or 80 cents for the low es­ti­mate for down­loads; that is much bet­ter than the $25 an­nual fee that some­where like Pin­board charges. Of course, you need to back this up your­self. We’re rel­a­tively for­tu­nate here—­most In­ter­net doc­u­ments are ‘born dig­i­tal’ and easy to mi­grate to new for­mats or in­spect in the fu­ture. We can down­load them and worry about how to view them only when we need a par­tic­u­lar doc­u­ment, and Web browser back­ward­s-com­pat­i­bil­ity al­ready stretches back to files writ­ten in the early 1990s. (Of course, we’re prob­a­bly screwed if we dis­cover the con­tent we wanted was dy­nam­i­cally pre­sented only in Adobe Flash or as an in­ac­ces­si­ble ‘cloud’ ser­vice.) In con­trast, if we were try­ing to pre­serve pro­grams or soft­ware li­braries in­stead, we would face a much more for­mi­da­ble task in keep­ing a work­ing lad­der of bi­na­ry-com­pat­i­ble vir­tual ma­chines or in­ter­preters15. The sit­u­a­tion with dig­i­tal movie preser­va­tion hardly bears think­ing on.

There are ways to cut down on the size; if you tar it all up and run with max­i­mum com­pres­sion op­tions, you could prob­a­bly com­pact it to 1/5th the size. I found that the un­com­pressed files could be re­duced by around 10% by us­ing to look for du­pli­cate files and turn­ing the du­pli­cates into a space-sav­ing to the orig­i­nal with a com­mand like fdupes --recurse --hardlink ~/www/. (Ap­par­ently there are a lot of bit-i­den­ti­cal JavaScript (eg. ) and im­ages out there.)

Good fil­ter­ing of URL sources can help re­duce URL archiv­ing count by a large amount. Ex­am­in­ing my man­ual back­ups of Fire­fox brows­ing his­to­ry, over the 1153 days from 2014-02-25 to 2017-04-22, I vis­ited 2,370,111 URLs or 2055 URLs per day; after pass­ing through my fil­ter­ing script, that leaves 171,446 URLs, which after de-du­pli­ca­tion yields 39,523 URLs or ~34 unique URLs per day or 12,520 unique URLs per year to archive.

This shrunk my archive by 9GB from 65GB to 56GB, al­though at the cost of some archiv­ing fi­delity by re­mov­ing many file­types like CSS or JavaScript or GIF im­ages. As of 2017-04-22, after ~6 years of archiv­ing, be­tween xz com­pres­sion (at the cost of easy search­a­bil­i­ty), ag­gres­sive fil­ter­ing, oc­ca­sional man­ual dele­tion of overly bulky do­mains I feel are prob­a­bly ad­e­quately cov­ered in the IA etc, my full WWW archives weigh 55GB.

URL sources

Browser history

There are a num­ber of ways to pop­u­late the source text file. For ex­am­ple, I have a script firefox-urls:

#!/bin/sh
set -euo pipefail

cp --force `find ~/.mozilla/firefox/ -name "places.sqlite"|sort|head -1` ~/
sqlite3 -batch places.sqlite "SELECT url FROM moz_places, moz_historyvisits \
                       WHERE moz_places.id = moz_historyvisits.place_id and \
                       visit_date > strftime('%s','now','-1 day')*1000000 ORDER by \
                       visit_date;" | filter-urls
rm ~/places.sqlite

(filter-urls is the same script as in local-archiver. If I don’t want a do­main lo­cal­ly, I’m not go­ing to bother with re­mote back­ups ei­ther. In fact, be­cause of We­bCite’s rate-lim­it­ing, archiver is al­most per­pet­u­ally back­-logged, and I es­pe­cially don’t want it wast­ing time on worth­less links like .)

This is called every hour by cron:

@hourly firefox-urls >> ~/.urls.txt

This gets all vis­ited URLs in the last time pe­riod and prints them out to the file for archiver to process. Hence, every­thing I browse is backed-up through archiver.

Non-Fire­fox browsers can be sup­ported with sim­i­lar strate­gies; for ex­am­ple, Zachary Vance’s Chromium scripts like­wise ex­tracts URLs from Chromi­um’s SQL his­tory & book­marks.

Website spidering

Some­times a par­tic­u­lar web­site is of long-term in­ter­est to one even if one has not vis­ited every page on it; one could man­u­ally visit them and rely on the pre­vi­ous Fire­fox script to dump the URLs into archiver but this is­n’t al­ways prac­ti­cal or time-effi­cient. linkchecker in­her­ently spi­ders the web­sites it is turned up­on, so it’s not a sur­prise that it can build a or sim­ply spit out all URLs on a do­main; un­for­tu­nate­ly, while linkchecker has the abil­ity to out­put in a re­mark­able va­ri­ety of for­mats, it can­not sim­ply out­put a new­line-de­lim­ited list of URLs, so we need to post-process the out­put con­sid­er­ably. The fol­low­ing is the shell one-liner I use when I want to archive an en­tire site (note that this is a bad com­mand to run on a large or heav­ily hy­per­-linked site like the Eng­lish Wikipedia or Less­Wrong!); edit the tar­get do­main as nec­es­sary:

linkchecker --check-extern -odot --complete -v --ignore-url=^mailto --no-warnings http://www.longbets.org
    | fgrep http # [
    | fgrep -v -e "label=" -e "->" -e '" [' -e '" ]' -e "/ "
    | sed -e "s/href=\"//" -e "s/\",//" -e "s/ //"
    | filter-urls
    | sort --unique >> ~/.urls.txt

When linkchecker does not work, one al­ter­na­tive is to do a wget --mirror and ex­tract the URLs from the file­names—list all the files and pre­fix with a “http://” etc.

Appendices

Cryptographic timestamping

Due to length, this sec­tion has .

filter-urls

A raw dump of URLs, while cer­tainly archiv­able, will typ­i­cally re­sult in a very large mir­ror of ques­tion­able value (is it re­ally nec­es­sary to archive Google search queries or Wikipedia ar­ti­cles? usu­al­ly, no) and worse, given the rate-lim­it­ing nec­es­sary to store URLs in the In­ter­net Archive or other ser­vices, may wind up de­lay­ing the archiv­ing of the im­por­tant links & risk­ing their to­tal loss. Dis­abling the re­mote archiv­ing is un­ac­cept­able, so the best so­lu­tion is to sim­ply take a lit­tle time to man­u­ally black­list var­i­ous do­mains or URL pat­terns.

This black­list­ing can be as sim­ple as a com­mand like filter-urls | grep -v en.wikipedia.org, but can be much more elab­o­rate. The fol­low­ing shell script is the skele­ton of my own cus­tom black­list, de­rived from man­u­ally fil­ter­ing through sev­eral years of daily brows­ing as well as spi­ders of dozens of web­sites for var­i­ous peo­ple & pur­pos­es, demon­strat­ing a va­ri­ety of pos­si­ble tech­niques: reg­exps for do­mains & file-types & query-strings, sed-based rewrites, fixed-string matches (both black­lists and whitelist­s), etc:

#!/bin/sh

# USAGE: `filter-urls` accepts on standard input a list of newline-delimited URLs or filenames,
# and emits on standard output a list of newline-delimited URLs or filenames.
#
# This list may be shorter and entries altered. It tries to remove all unwanted entries, where 'unwanted'
# is a highly idiosyncratic list of regexps and fixed-string matches developed over hundreds of thousands
# of URLs/filenames output by my daily browsing, spidering of interesting sites, and requests
# from other people to spider sites for them.
#
# You are advised to test output to make sure it does not remove
# URLs or filenames you want to keep. (An easy way to test what is removed is to use the `comm` utility.)
#
# For performance, it does not sort or remove duplicates from output; both can be done by
# piping `filter-urls` to `sort --unique`.

set -euo pipefail

cat /dev/stdin \
    | sed -e "s/#.*//" -e 's/>$//' -e "s/&sid=.*$//" -e "s/\/$//" -e 's/$/\n/' -e 's/\?sort=.*$//' \
      -e 's/^[ \t]*//' -e 's/utm_source.*//' -e 's/https:\/\//http:\/\//' -e 's/\?showComment=.*//' \
    | grep "\." \
    | fgrep -v "*" \
    | egrep -v -e '\/\.rss$' -e "\.tw$" -e "//%20www\." -e "/file-not-found" -e "258..\.com/$" \
       -e "3qavdvd" -e "://avdvd" -e "\.avi" -e "\.com\.tw" -e "\.onion" -e "\?fnid\=" -e "\?replytocom=" \
       -e "^lesswrong.com/r/discussion/comments$" -e "^lesswrong.com/user/gwern$" \
       -e "^webcitation.org/query$" \
       -e "ftp.*" -e "6..6\.com" -e "6..9\.com" -e "6??6\.com" -e "7..7\.com" -e "7..8\.com" -e "7..\.com" \
       -e "78..\.com" -e "7??7\.com" -e "8..8\.com" -e "8??8\.com" -e "9..9\.com" -e "9??9\.com" \
       -e gold.*sell -e vip.*club \
    | fgrep -v -e "#!" -e ".bin" -e ".mp4" -e ".swf" -e "/mediawiki/index.php?title=" -e "/search?q=cache:" \
      -e "/wiki/Special:Block/" -e "/wiki/Special:WikiActivity" -e "Special%3ASearch" \
      -e "Special:Search" -e "__setdomsess?dest="
      # ...

# prevent URLs from piling up at the end of the file
echo ""

filter-urls can be used on one’s lo­cal archive to save space by delet­ing files which may be down­loaded by wget as de­pen­den­cies. For ex­am­ple:

find ~/www | sort --unique >> full.txt && \
    find ~/www | filter-urls | sort --unique >> trimmed.txt
comm -23 full.txt trimmed.txt | xargs -d "\n" rm
rm full.txt trimmed.txt

sort --key compression trick

Pro­gram­ming folk­lore notes that one way to get bet­ter loss­less com­pres­sion effi­ciency is by the pre­com­pres­sion trick of re­ar­rang­ing files in­side the archive to group ‘sim­i­lar’ files to­gether and ex­pose re­dun­dancy to the com­pres­sor, in ac­cor­dance with in­for­ma­tion-the­o­ret­i­cal prin­ci­ples. A par­tic­u­larly easy and broad­ly-ap­plic­a­ble way of do­ing this, which does not re­quire us­ing any un­usual for­mats or tools and is fully com­pat­i­ble with the de­fault archive meth­ods, is to sort the files by file­name and es­pe­cially file ex­ten­sion. I show how to do this with the stan­dard Unix com­mand-line sort tool, us­ing the so-called “sort --key trick”, and give ex­am­ples of the large space-sav­ings pos­si­ble from my archiv­ing work for per­sonal web­site mir­rors and for mak­ing where the re­dun­dancy at the file level is par­tic­u­larly ex­treme and the sort --key trick shines com­pared to the naive ap­proach.

One way to look at data com­pres­sion is as a form of in­tel­li­gence (see the & ): a com­pres­sion tool like is be­ing asked to pre­dict what the next bit of the orig­i­nal file is, and the bet­ter its pre­dic­tions, the less data needs to be stored to cor­rect its mis­taken pre­dic­tions (“I know how to spell ‘ba­nana’, I just don’t know when to stop”). The smarter the pro­gram is, the bet­ter its com­pres­sion will be; but on the flip side, you can also im­prove the com­pres­sion ra­tio by giv­ing it a lit­tle help and or­ga­niz­ing the data into an equiv­a­lent form the pro­gram can un­der­stand bet­ter—­for ex­am­ple, by us­ing the . Or by pre­serv­ing spa­tial lo­cal­i­ty: keep­ing sim­i­lar data to­geth­er, and not dis­pers­ing it all over. (This also ex­plains why mul­ti­me­dia files barely com­press: be­cause the lossy en­cod­ings are the prod­uct of decades of work spe­cial­ized to the par­tic­u­lar do­main of au­dio or im­ages or video, and a gen­er­al-pur­pose loss­less com­pres­sion would have to be very in­tel­li­gent, on par with , to beat them at their own game.) Files on one topic should be com­pressed to­geth­er.

Locality

Spa­tial lo­cal­ity can be sub­tle. For ex­am­ple, nat­ural lan­guage text, though not struc­tured line-by-line as vis­i­bly as a dic­tio­nary, is still far from ran­dom and has many lo­cal cor­re­la­tions a com­pres­sion tool might be able to pick up. If this is true, we would ex­pect that with a suffi­ciently smart com­pres­sor, a text file would com­press bet­ter in its nat­ural form than if it were ran­domly shuffled. Is this the case, or are com­pres­sion util­i­ties are too stu­pid to see any differ­ent be­tween ran­dom lines and Eng­lish prose? Tak­ing 14M of text from Gw­ern.net, we can see for our­selves:

## uncompressed
cat *.page */*.page */*/*.page |                                                 wc --bytes
# 13588814

## compressed, files in lexicographic order
cat *.page */*.page */*/*.page |                      xz -9 --extreme --stdout | wc --bytes
# 3626732
## compressed, all lines sorted alphabetically
cat *.page */*.page */*/*.page | sort               | xz -9 --extreme --stdout | wc --bytes
# 3743756
## compressed, all lines randomly shuffled except for non-unique lines sorted together
cat *.page */*.page */*/*.page | sort --random-sort | xz -9 --extreme --stdout | wc --bytes
# 3831740
## compressed, all lines randomly shuffled
cat *.page */*.page */*/*.page | shuf               | xz -9 --extreme --stdout | wc --bytes
# 3862632

The un­mo­lested text com­presses to 3.626M, but the same text ran­domly shuffled is 3.862M! I also in­cluded an in­ter­me­di­ate form of ran­dom­iza­tion: de­spite the name, the --random-sort op­tion to sort is not ac­tu­ally a ran­dom shuffle but a ran­dom hash (this is not doc­u­mented in the man page, though it is in the info page) and so any re­peated non-u­nique lines will be sorted to­gether (al­low­ing for some easy du­pli­ca­tion dele­tion), and so we would ex­pect the on­ly-par­tial ran­dom­iza­tion of --random-sort to maybe per­form a bit bet­ter than the true ran­dom shuffle of shuf. And it does.

Web archives

Spa­tial lo­cal­ity also ap­plies to our web archiv­ing. If you are mir­ror­ing web­sites, or oth­er­wise com­pil­ing a lot of di­rec­to­ries with re­dun­dant data on a file-by-file lev­el, there’s a cute trick to mas­sively im­prove your com­pres­sion ra­tios: don’t sort the usual lex­i­co­graphic way, but sort by a sub­di­rec­to­ry. (I learned about this trick a few years ago while mess­ing around with archiv­ing my home di­rec­tory us­ing find and tar.) This is one of the is­sues with archiv­ing gi­ga­bytes of crawls from thou­sands of do­mains: URLs have a his­tor­i­cal odd­ity where they are not con­sis­tently hi­er­ar­chi­cal. URLs were orig­i­nally mod­eled after hi­er­ar­chi­cal Unix filesys­tems; this page, for ex­am­ple, lives at the name /home/gwern/wiki/Archiving-URLs.page, which fol­lows a log­i­cal left­-to-right pat­tern of in­creas­ing nar­row­ness. If one lists my en­tire filesys­tem in lex­i­co­graphic or­der, all the files in /home/gwern/ will be con­sec­u­tive, and the files in wiki/ will be con­sec­u­tive, and so on. un­for­tu­nate­ly, the top level of URLs breaks this scheme—one does not visit https://com/google/mail/?shva=1#search/l%3aunread, one vis­its https://mail.google.com/mail/?shva=1#search/l%3aunread; one does not visit http://net/www/gwern/Archiving-URLs but https://www.gwern.net/Archiving-URLs. So if I down­load a.google.com and then later z.google.com, a lex­i­co­graphic list of down­loaded files will sep­a­rate the files as much as pos­si­ble (even though they are se­man­ti­cally prob­a­bly sim­i­lar). A quick ex­am­ple from my cur­rent WWW archive:

ls
# ...
# typemoon.wikia.com/
# tytempletonart.wordpress.com/
# ubc-emotionlab.ca/
# ubook.info/
# ucblibrary3.berkeley.edu/
# uhaweb.hartford.edu/
# ujsportal.pacourts.us/
# ukpmc.ac.uk/
# uk.reuters.com/
# ultan.org.uk/
# ...

The di­rec­to­ries are in­deed sort­ed, but aside from the ini­tial 2 let­ters or so, look noth­ing like each oth­er: a Wikia sub­do­main rubs shoul­ders with a Word­Press blog, a .ca do­main is be­tween a .com, a .info, and a .edu (with a .us and .uk thrown in for va­ri­ety), and so on. Is there any way to sort these di­rec­to­ries with a bit of pars­ing thrown in? For ex­am­ple, maybe we could re­verse each line? Some web browsers store URLs re­versed right-to-left to en­able more effi­cient data­base op­er­a­tions, as do Google’s sys­tems16 (to as­sist their rel­a­tively weak com­pres­sion util­ity ). Turns out GNU al­ready sup­ports some­thing sim­i­lar, the --key & --field-separator op­tions; the man page is not very help­ful but the page tells us:

'-t SEPARATOR'
'--field-separator=SEPARATOR'
     Use character SEPARATOR as the field separator when finding the
     sort keys in each line.  By default, fields are separated by the
     empty string between a non-blank character and a blank character.
     By default a blank is a space or a tab, but the 'LC_CTYPE' locale
     can change this.

     That is, given the input line ' foo bar', 'sort' breaks it into
     fields ' foo' and ' bar'.  The field separator is not considered
     to be part of either the field preceding or the field following,
     so with 'sort -t " "' the same input line has three fields: an
     empty field, 'foo', and 'bar'.  However, fields that extend to the
     end of the line, as '-k 2', or fields consisting of a range, as
     '-k 2,3', retain the field separators present between the
     endpoints of the range.


'-k POS1[,POS2]'
'--key=POS1[,POS2]'
     Specify a sort field that consists of the part of the line between
     POS1 and POS2 (or the end of the line, if POS2 is omitted),
     _inclusive_.

     Each POS has the form 'F[.C][OPTS]', where F is the number of the
     field to use, and C is the number of the first character from the
     beginning of the field.  Fields and character positions are
     numbered starting with 1; a character position of zero in POS2
     indicates the field's last character.  If '.C' is omitted from
     POS1, it defaults to 1 (the beginning of the field); if omitted
     from POS2, it defaults to 0 (the end of the field).  OPTS are
     ordering options, allowing individual keys to be sorted according
     to different rules; see below for details.  Keys can span multiple
     fields.

     Example:  To sort on the second field, use '--key=2,2' ('-k 2,2').
     See below for more notes on keys and more examples.  See also the
     '--debug' option to help determine the part of the line being used
     in the sort.

Hence, we can do bet­ter by or­der­ing sort to break on the dots and fo­cus on the sec­ond part of a URL, like so:

ls | sort --key=2 --field-separator="."
# ...
# uhaweb.hartford.edu/
# adsabs.harvard.edu/
# chandra.harvard.edu/
# cmt.harvard.edu/
# dash.harvard.edu/
# gking.harvard.edu/
# isites.harvard.edu/
# ...

There’s many pos­si­ble ways to sort, though. So I took my WWW archive as of 2014-06-15, op­ti­mized all PNGs & JPEGs with optipng & jpegoptim, ran all the files through filter-urls & deleted the ones which failed (this took out all of the JS files, which is fine since I don’t think those are use­ful for archival pur­pos­es), and was left with ~86.5GB of files. Then I tested out sev­eral ways of sort­ing the file­names to see what gave the best com­pres­sion on my cor­pus.

First, I es­tab­lish the base­line:

  1. Size of un­com­pressed un­sorted tar­ball, which elim­i­nates filesys­tem over­head and tells us how much com­pres­sion is re­ally sav­ing:

    cd ~/www/ && find ./ -type f -print0 | tar c --to-stdout --no-recursion --null --files-from - | wc --bytes
    # 86469734400
    ## 1x
  2. Size of sorted tar­ball:

    cd ~/www/ && find . -type f -print0 | sort --zero-terminated | \
                 tar c --to-stdout --no-recursion --null --files-from - | wc --bytes
    # 86469734400
    ## 1x

    So sort­ing a tar­ball does­n’t give any ben­e­fits. This is mostly as I ex­pect­ed, since tar is only sup­posed to pro­duce a lin­ear archive pack­ing to­gether all the spec­i­fied files and oth­er­wise pre­serve them ex­act­ly. I thought there might have been some sort of con­sol­i­da­tion of full path-names which might yield a small space sav­ings, but ap­par­ently not.

Now we can be­gin sort­ing be­fore com­pres­sion. I thought of 6 ap­proach­es; in de­creas­ing or­der of fi­nal archive (s­mall­er=­bet­ter):

  1. Sort by file names, sim­ply by re­vers­ing, sort­ing, un­re­verse (foo.png ... bar.png re­verses to gnp.oof ... gnp.rab, which then sort to­geth­er, and then loss­lessly re­verse back to bar.png / foo.png / ...):

    cd ~/www/ && find . -type f | rev | sort | rev | \
                 tar c --to-stdout --no-recursion --files-from - | xz -9 --stdout | wc --bytes
    # 24970605748
    ## 0.2887x

    (Note that find+rev does­n’t cor­rectly han­dle file­names with the wrong/non-UTF-8 en­cod­ing; I ul­ti­mately used brute force in the form of detox to find all the non-UTF-8 files and re­name them.)

  2. Com­press tar­ball, but with­out any sort­ing (files are con­sumed in the or­der find pro­duces them in the filesys­tem):

    cd ~/www/ && find . -type f -print0 | \
                 tar c --to-stdout --no-recursion --null --files-from - | xz -9 --stdout | wc --bytes
    # 24268747400
    ## 0.2806x
  3. Sort by file suffix­es, try­ing to pars­ing the file­names first:

    cd ~/www/ && find . -type f -printf '%f/%p\n' | sort --field-separator="." --key=2  | cut -f2- -d/ | \
                 tar c --to-stdout --no-recursion --files-from - | xz -9 --stdout | wc --bytes
    # 24097155132
    ## 0.2786x
  4. Sort nor­mal­ly, in lex­i­co­graphic or­der (sub­do­main, do­main, TLD, sub­di­rec­to­ries & files etc):

    cd ~/www/ && find . -type f -print0 | sort --zero-terminated | \
        tar c --to-stdout --no-recursion --null --files-from - | xz -9 --stdout | wc --bytes
    # 23967317664
    ## 0.2771x
  5. Sort by mid­dle of do­main:

    cd ~/www/ && find . -type f -print0 | sort --zero-terminated --key=3 --field-separator="." | \
                  tar c --to-stdout --no-recursion --null --files-from - | xz -9 --stdout | wc --bytes
    # 23946061272
    ## 0.2769x
  6. Sort by first sub­di­rec­tory (if there’s a bunch of foo.com/wp-content/* & bar.com/wp-content/* files, then the /wp-content/ files will all sort to­gether re­gard­less of “f” and “b” be­ing far from each oth­er; sim­i­larly for domain.com/images/, domain.com/css/ etc):

    cd ~/www/ && find . -type f -print0 | sort --zero-terminated --key=3 --field-separator="/"  | \
                 tar c --to-stdout --no-recursion --null --files-from - | xz -9 --stdout | wc --bytes
    # 23897682908
    ## 0.2763x

Sur­pris­ing­ly, #1, re­vers­ing file­names in or­der to sort on the suffix­es, turns out to be even worse than not sort­ing at all. The im­proved at­tempt to sort on file­types does­n’t do much bet­ter, al­though it at least beats the base­line of no-sort­ing; it may be that to get a com­pres­sion win real se­man­tic knowl­edge of file­types will be needed (per­haps call­ing file on every file and sort­ing by the de­tected file­type­?). The reg­u­lar sort also per­forms sur­pris­ingly well, but my in­tu­itions win for once and it’s beaten by my pre­vi­ous strat­egy of sort­ing by the mid­dle of do­mains. Fi­nal­ly, the win­ner is a bit of a sur­prise too, a sort I only threw in out of cu­rios­ity be­cause I no­ticed blogs tend to have sim­i­lar site lay­outs.

In this case, the best ver­sion saved or 371MB over the un­sorted ver­sion. Not as im­pres­sive as in the next use case, but enough to show this seems like a real gain

Separate mirrors

Top-level do­mains are not the only thing we might want to sort differ­ently on. To take my mir­rors of dark­net mar­ket drug sites such as : I down­load a site each time as a sep­a­rate wget run in a time­stamped fold­er. So in my Silk Road 2 fold­er, I have both 2013-12-25/ & 2014-01-15/. These share many sim­i­lar & iden­ti­cal files so they com­press to­gether with xz down from 1.8GB to 0.3GB.

But they could com­press even bet­ter: the sim­i­lar files may be thou­sands of files and hun­dreds of megabytes away by al­pha­bet­i­cal or file-in­ode or­der, so even with a very large win­dow and a top-notch com­pres­sor, it will fail to spot many long-range re­dun­dan­cies. In be­tween 2013-12-25/default.css and 2014-01-15/default.css is go­ing to be all sorts of files which have noth­ing to do with CSS, like 2014-01-16/items/2-grams-of-pure-vanilla-ketamine-powder-dutchdope?vendor_feedback_page=5 and 2014-01-16/items/10-generic-percocet-10-325-plus-1-30mg-morphine. You see the prob­lem.

Be­cause we sort the files by ‘all files start­ing with “2013”’ and then ‘all files start­ing “2014”’, we lose all prox­im­i­ty. If in­stead, we could sort by sub­folder and only then by the top-level fold­er, then we’d have every­thing line up nice­ly. For­tu­nate­ly, we al­ready know how to do this! Reuse the sort-key trick, spec­ify “/” as the de­lim­iter to parse on, and the nth field to sort on. We feed it a file list, tell it to break file­names by “/”, and then to sort on a lower lev­el, and if we did it right, we will in­deed get out­put like 2013-12-25/default.css just be­fore 2014-01-15/default.css, which will do won­ders for our com­pres­sion, and which will pay ever more div­i­dends as we ac­cu­mu­late more par­tial­ly-re­dun­dant mir­rors.

Here is an ex­am­ple of out­put for my Pan­dora mir­rors, where, due to fre­quent ru­mors of its demise trig­ger­ing mir­ror­ing on my part, I have 5 full mir­rors at the time of test­ing; and nat­u­ral­ly, if we em­ploy the sort-key trick (find . -type f | sort --key=3 --field-separator="/"), we find a lot of sim­i­lar-sound­ing files:

./2014-01-15/profile/5a66e5238421f0422706b267b735d2df/6
./2014-01-16/profile/5a9df4f5482d55fb5a8997c270a1e22d
./2013-12-25/profile/5a9df4f5482d55fb5a8997c270a1e22d/1
./2014-01-15/profile/5a9df4f5482d55fb5a8997c270a1e22d.1
./2013-12-25/profile/5a9df4f5482d55fb5a8997c270a1e22d/2
./2014-01-15/profile/5a9df4f5482d55fb5a8997c270a1e22d.2
./2013-12-25/profile/5a9df4f5482d55fb5a8997c270a1e22d/3
./2014-01-15/profile/5a9df4f5482d55fb5a8997c270a1e22d/4
./2014-01-15/profile/5a9df4f5482d55fb5a8997c270a1e22d/5
./2014-01-15/profile/5a9df4f5482d55fb5a8997c270a1e22d/6
./2013-12-25/profile/5abb81db167294478a23ca110284c587
./2013-12-25/profile/5acc44d370e305e252dd4e2b91fda9d0/1
./2014-01-15/profile/5acc44d370e305e252dd4e2b91fda9d0.1
./2013-12-25/profile/5acc44d370e305e252dd4e2b91fda9d0/2
./2014-01-15/profile/5acc44d370e305e252dd4e2b91fda9d0.2

Note the in­ter­leav­ing of 5 differ­ent mir­rors, im­pos­si­ble in a nor­mal left­-to-right al­pha­bet­i­cal sort. You can bet that these 4 files (in 15 ver­sions) are go­ing to com­press much bet­ter than if they were sep­a­rated by a few thou­sand other pro­file pages.

So here’s an ex­am­ple in­vo­ca­tion (do­ing every­thing in pipelines to avoid disk IO):

find . -type f -print0 | sort --zero-terminated --key=3 --field-separator="/"
 | tar --no-recursion --null --files-from - -c
 | xz -9 --extreme --stdout
 > ../mirror.tar.xz

Used on my two Silk Road 2 mir­rors which to­gether weigh 1800M, a nor­mal run with­out the --key/--field-separator op­tions, as men­tioned be­fore, yields a 308M archive. That’s not too bad. Cer­tainly much bet­ter than haul­ing around al­most 2GB. How­ev­er—if I switch to the sort-key trick, how­ev­er, the archive is now 271M, or 37M less. Same com­pres­sion al­go­rithm, same files, same un­packed re­sults, same speed, just 2 lit­tle ob­scure sort op­tions… and I get an archive 87% the size of the orig­i­nal.

Not im­pressed? Well, I did say that the ad­van­tage in­creases with the num­ber of mir­rors to ex­tract re­dun­dancy from. With only 2 mir­rors, the SR2 re­sults can’t be too im­pres­sive. How about the Pan­dora mir­rors? 5 of them gives the tech­nique more scope to shine. And as ex­pect­ed, it’s even more im­pres­sive when I com­pare the Pan­dora archives: 71M vs 162M. The sort-keyed archive saves 56% of the reg­u­lar archive’s size. The fi­nal Pan­dora archive, , boasts 105 mir­rors from 2013-12-25 to 2014-11-05 of 1,745,778 files (many du­pli­cates due to be­ing present over mul­ti­ple crawls) tak­ing up 32GB un­com­pressed. This archive is 0.58GB (606,203,320 bytes) when sort-keyed; when sorted al­pha­bet­i­cal­ly, it is in­stead 7.5GB (7,466,374,836 bytes) or 12.31x larg­er, so the sort-keyed archive saves 92%. Fo­rum mir­rors are even more re­dun­dant than mar­ket mir­rors be­cause all con­tent added to a fo­rum will be present in all sub­se­quent crawls, so each crawl will yield the pre­vi­ous crawl plus some ad­di­tional posts; hence, the Pan­dora fo­rum’s raw size of 8,894,521,228 bytes shrinks to 283,747,148 bytes when com­pressed al­pha­bet­i­cally but shrinks dra­mat­i­cally to the 31x smaller sort-keyed archive of 9,084,508 bytes, sav­ing 97%!

Clearly in these DNM use-cas­es, sort-key­ing has gone from a cute trick to in­dis­pens­able.

Alternatives

The sort-key trick is most use­ful when we can in­fer a lot of spa­tial lo­cal­ity just from parts of the file names, it’s not a panacea. There are other ap­proach­es:

  1. if we have mul­ti­ple tem­po­ral­ly-ordered datasets and we don’t mind mak­ing it more diffi­cult to ac­cess older copies, it may be sim­pler to store the data as a DVCS like where each dataset is a large patch

  2. sort files by min­i­miz­ing bi­nary differ­ences be­tween them us­ing Timm S. Müller’s binsort util­ity

    The de­fault op­ti­miza­tion set­ting of -o15 un­der­per­forms the sort-key trick on both the SR2 and Pan­dora datasets by 5–10MB, and a higher set­ting like -o1000 is best. Note that binsort is 𝒪(n2) in num­ber of files, so it’s lim­ited to sort­ing <10,000 files. An ex­am­ple pipeline for com­press­ing posts from the Silk Road fo­rums re­gard­ing a mi­nor Ponzi scheme there, since the file­names offer lit­tle guid­ance for a se­man­ti­cal­ly-mean­ing­ful sort:

    find dkn255hz262ypmii.onion/ -type f -exec fgrep -i -l vladimir  {} \; >> ponzi.txt
    find dkn255hz262ypmii.onion/ -type f -exec fgrep -i -l Limetless {} \; >> ponzi.txt
    
    mkdir ponzi/ && cp `cat ponzi.txt` ponzi/
    ~/bin/binsort/binsort -t 6 -o 1000 ponzi/ > ponzi-list.txt
    cat ponzi-list.txt | tar --no-recursion --files-from - -c | xz -9 --extreme --stdout > ~/srf1-ponzi.tar.xz

    (Em­bar­rass­ing­ly, binsort seems to un­der­per­form on this dataset: the files are 380M un­com­pressed, 3.536M sort-keyed, and 3.450M sort­ed.)

  3. if we know there are many du­pli­cates but they are far apart by a lex­i­co­graphic sort and we don’t have any good way to sort file­names and a lot of RAM, we can try out a com­pres­sion tool which specifi­cally looks for long-range re­dun­dan­cies through­out the en­tire bit­stream, such as (home­page/Arch wiki). lrzip is pack­aged in De­bian and should be at least as good as xz since it too uses LZMA; but it is an ob­scure tool which is not a de­fault in­stall on Linux dis­tri­b­u­tions like xz is, and this makes dis­trib­ut­ing archives to other peo­ple diffi­cult.


  1. I use duplicity & rdiff-backup to backup my en­tire home di­rec­tory to a cheap 1.5TB hard drive (bought from Newegg us­ing forre.st’s “Stor­age Analysis—GB/$ for differ­ent sizes and me­dia” price-chart); a lim­ited se­lec­tion of fold­ers are backed up to B2 us­ing duplicity.

    I used to semi­an­nu­ally tar up my im­por­tant fold­ers, add re­dun­dan­cy, and burn them to DVD, but that’s no longer re­ally fea­si­ble; if I ever get a Blu-ray burn­er, I’ll re­sume WORM back­ups. (Mag­netic me­dia does­n’t strike me as re­li­able over many decades, and it would ease my mind to have op­ti­cal back­up­s.)↩︎

  2. “When the In­ter­net Is My Hard Dri­ve, Should I Trust Third Par­ties?”, Wired:

    Bits and pieces of the web dis­ap­pear all the time. It’s called ‘link rot’, and we’re all used to it. A friend saved 65 links in 1999 when he planned a trip to Tus­cany; only half of them still work to­day. In my own blog, es­says and news ar­ti­cles and web­sites that I link to reg­u­larly dis­ap­pear – some­times within a few days of my link­ing to them.

    ↩︎
  3. “Go­ing, Go­ing, Gone: Lost In­ter­net Ref­er­ences”; ab­stract:

    The ex­tent of In­ter­net ref­er­enc­ing and In­ter­net ref­er­ence ac­tiv­ity in med­ical or sci­en­tific pub­li­ca­tions was sys­tem­at­i­cally ex­am­ined in more than 1000 ar­ti­cles pub­lished be­tween 2000 and 2003 in the New Eng­land Jour­nal of Med­i­cine, The Jour­nal of the Amer­i­can Med­ical As­so­ci­a­tion, and Sci­ence. In­ter­net ref­er­ences ac­counted for 2.6% of all ref­er­ences (672/25548) and in ar­ti­cles 27 months old, 13% of In­ter­net ref­er­ences were in­ac­tive.

    ↩︎
  4. The Mil­lion Dol­lar Home­page still gets a sur­pris­ing amount of traffic, so one fun thing one could do is buy up ex­pired do­mains which paid for par­tic­u­larly large links.↩︎

  5. By 2013-01-06, the num­ber has in­creased to ~12000 ex­ter­nal links, ~7200 to non-Wikipedia do­mains.↩︎

  6. If each link has a fixed chance of dy­ing in each time pe­ri­od, such as 3%, then the to­tal risk of death is an ; over the time pe­riod 2011–2070 the cu­mu­la­tive chance it will beat each of the 3% risks is 0.1658. So in 2070, how many of the 2200 links will have beat the odds? Each link is in­de­pen­dent, so they are like flip­ping a bi­ased coin and form a . The bi­no­mial dis­tri­b­u­tion, be­ing dis­crete, has no easy equa­tion, so we just ask R how many links sur­vive at the 5th percentile/quantile (a lower bound) and how many sur­vive at the 95th per­centile (an up­per bound):

    qbinom(c(0.05, 0.95), 2200, 0.97^(2070-2011))
    # [1] 336 394
    
    ## the 50% annual link rot hypothetical:
    qbinom(c(0.05, 0.50), 2200, 0.50^(2070-2011))
    # [1] 0 0
    ↩︎
  7. 101, ‘Bud­dhism re­lated to Blos­soms’; ; trans. Steven D. Carter, ISBN 0-231-10576-2↩︎

  8. Which I sus­pect is only ac­ci­den­tally ‘gen­eral’ and would shut down ac­cess if there were some other way to en­sure that Wikipedia ex­ter­nal links still got archived.↩︎

  9. Since Pin­board is a book­mark­ing ser­vice more than an archive site, I asked whether treat­ing it as such would be ac­cept­able and Ma­ciej said “Your cur­rent archive size, grow­ing at 20 GB a year, should not be a prob­lem. I’ll put you on the heavy-duty server where my own stuff lives.”↩︎

  10. Google Cache is gen­er­ally rec­om­mended only as a last ditch re­sort be­cause pages ex­pire quickly from it. Per­son­al­ly, I’m con­vinced that Google would never just delete colos­sal amounts of In­ter­net data—this is Google, after all, the epit­ome of stor­ing un­think­able amounts of data—and that Google Cache merely ceases to make pub­lic its copies. And to re­quest a Google spi­der vis­it, one has to solve a CAPTCHA—so that’s not a scal­able so­lu­tion.↩︎

  11. Which would not be pub­licly ac­ces­si­ble or sub­mit­table; I know they ex­ist, but be­cause they hide them­selves, I know only from ran­dom com­ments on­line eg. “years ago a friend of mine who I’d lost con­tact with caught up with me and told me he found a cached copy of a web­site I’d taken down in his em­ploy­er’s equiv­a­lent to the Way­back Ma­chine. His em­ployer was a branch of the fed­eral gov­ern­ment.”.↩︎

  12. Ver­sion 0.1 of my archiver dae­mon did­n’t sim­ply read the file un­til it was empty and ex­it, but ac­tu­ally watched it for mod­i­fi­ca­tions with . I re­moved this func­tion­al­ity when I re­al­ized that the re­quired We­bCite chok­ing (just one URL every ~25 sec­onds) meant that archiver would never fin­ish any rea­son­able work­load.↩︎

  13. Much eas­ier than it was in the past; records his tra­vails with the pre­vi­ous Mozilla his­tory for­mat in the apt­ly-named “when the data­base worms eat into your brain”.↩︎

  14. An older 2010 Google ar­ti­cle put the av­er­age at 320kb, but that was an av­er­age over the en­tire Web, in­clud­ing all the old con­tent.↩︎

  15. Al­ready one runs old games like the clas­sic in em­u­la­tors of the DOS op­er­at­ing sys­tem like ; but those em­u­la­tors will not al­ways be main­tained. Who will em­u­late the em­u­la­tors? Pre­sum­ably in 2050, one will in­stead em­u­late some an­cient but com­pat­i­ble OS­—Win­dows 7 or De­bian 6.0, per­hap­s—and in­side that run DOSBox (to run the DOS which can run the game).↩︎

  16. “BigTable: A Dis­trib­uted Stor­age Sys­tem for Struc­tured Data”, Chang et al 2006:

    BigTable main­tains data in lex­i­co­graphic or­der by row key. The row range for a ta­ble is dy­nam­i­cally par­ti­tioned. Each row range is called a tablet, which is the unit of dis­tri­b­u­tion and load bal­anc­ing. As a re­sult, reads of short row ranges are effi­cient and typ­i­cally re­quire com­mu­ni­ca­tion with only a small num­ber of ma­chines. Clients can ex­ploit this prop­erty by se­lect­ing their row keys so that they get good lo­cal­ity for their data ac­cess­es. For ex­am­ple, in Webtable, pages in the same do­main are grouped to­gether into con­tigu­ous rows by re­vers­ing the host­name com­po­nents of the URLs. For ex­am­ple, we store data for maps.google.com/index.html un­der the key com.google.maps/index.html. Stor­ing pages from the same do­main near each other makes some host and do­main analy­ses more effi­cient.

    ↩︎