Archiving URLs

Archiving the Web, because nothing lasts forever: statistics, online archive services, extracting URLs automatically from browsers, and creating a daemon to regularly back up URLs to multiple sources.
Haskell, archiving, Bitcoin, meta, shell, R, tutorial
2011-03-102019-01-05 finished certainty: certain importance: 8

Links on the Inter­net last for­ever or a year, whichever comes first. This is a major prob­lem for any­one seri­ous about writ­ing with good ref­er­ences, as link rot will crip­ple sev­eral per­cent of all links each year, and com­pound­ing.

To deal with link rot, I present my mul­ti­-pronged archival strat­egy using a com­bi­na­tion of scripts, dae­mons, and Inter­net archival ser­vices: URLs are reg­u­larly dumped from both my web browser’s daily brows­ing and my web­site pages into an archival dae­mon I wrote, which pre-emp­tively down­loads copies locally and attempts to archive them in the Inter­net Archive. This ensures a copy will be avail­able indefi­nitely from one of sev­eral sources. Link rot is then detected by reg­u­lar runs of linkchecker, and any newly dead links can be imme­di­ately checked for alter­na­tive loca­tions, or restored from one of the archive sources.

As an addi­tional flour­ish, my local archives are in case forgery is a con­cern, and I demon­strate a sim­ple com­pres­sion trick for such as crawls (par­tic­u­larly use­ful for repeated crawls such as my ).

Given my inter­est in long term con­tent and exten­sive link­ing, is an issue of deep con­cern to me. I need back­ups not just for my files1, but for the web pages I read and use—they’re all part of my . It’s not much good to have an exten­sive essay on some topic where half the links are dead and the reader can nei­ther ver­ify my claims nor get con­text for my claims.


“With every new spring
the blos­soms speak not a word
yet expound the Law—
know­ing what is at its heart
by the scat­ter­ing storm winds.”


The first rem­edy is to learn about bro­ken links as soon as they hap­pen, which allows one to react quickly and scrape archives or search engine caches (‘lazy preser­va­tion’). I cur­rently use linkchecker to spi­der look­ing for bro­ken links. linkchecker is run in a job like so:

@monthly linkchecker --check-extern --timeout=35 --no-warnings --file-output=html \
                      --ignore-url=^mailto --ignore-url=^irc --ignore-url=http://.*\.onion \

Just this com­mand would turn up many false pos­i­tives. For exam­ple, there would be sev­eral hun­dred warn­ings on Wikipedia links because I link to redi­rects; and linkchecker respects s which for­bid it to check live­ness, but emits a warn­ing about this. These can be sup­pressed by edit­ing ~/.linkchecker/linkcheckerrc to say ignorewarnings=http-moved-permanent,http-robots-denied (the avail­able warn­ing classes are listed in linkchecker -h).

The quicker you know about a dead link, the sooner you can look for replace­ments or its new home.


Remote caching

“Any­thing you post on the inter­net will be there as long as it’s embar­rass­ing and gone as soon as it would be use­ful.”


We can ask a third party to keep a cache for us. There are sev­eral pos­si­bil­i­ties:

  1. the
  2. Per­ (highly lim­ited)
  3. Lin­ter­we­b’s Wiki­Wix8.
  4. Peeep­.us (de­funct as of 2018)
  6. Pin­board (with the $25/yr archiv­ing option9)
  7. Hiy­ & Mega­ (may be diffi­cult to use)

There are other options but they are not avail­able like Google10 or var­i­ous commercial/government archives11

(An exam­ple would be being archived at

These archives are also good for archiv­ing your own web­site:

  1. you may be keep­ing back­ups of it, but your own website/server back­ups can be lost (I can speak from per­sonal expe­ri­ence here), so it’s good to have exter­nal copies
  2. Another ben­e­fit is the reduc­tion in ‘bus-fac­tor’: if you were hit by a bus tomor­row, who would get your archives and be able to main­tain the web­sites and under­stand the back­ups etc? While if archived in IA, peo­ple already know how to get copies and there are tools to down­load entire domains.
  3. A focus on back­ing up only one’s web­site can blind one to the need for archiv­ing the exter­nal links as well. Many pages are mean­ing­less or less valu­able with bro­ken links. A linkchecker script/daemon can also archive all the exter­nal links.

So there are sev­eral ben­e­fits to doing web archiv­ing beyond sim­ple server back­ups.

My first pro­gram in this vein of thought was a bot which fired off WebCite, Inter­net Archive/Alexa, & requests: , quickly fol­lowed up by a . (Or you could install the to get auto­matic sub­mis­sion to the Inter­net Archive, if you have ceased to care about pri­va­cy.)

The core code was quickly adapted into a gitit wiki plu­gin which hooked into the save-page func­tion­al­ity and tried to archive every link in the new­ly-mod­i­fied page, Inter­wi­k­i.hs

Final­ly, I wrote archiver, a dae­mon which watches12/reads a text file. Source is avail­able via git clone (A sim­i­lar tool is Archiveror.)

The library half of archiver is a sim­ple wrap­per around the appro­pri­ate HTTP requests; the exe­cutable half reads a spec­i­fied text file and loops as it (slow­ly) fires off requests and deletes the appro­pri­ate URL.

That is, archiver is a dae­mon which will process a spec­i­fied text file, each line of which is a URL, and will one by one request that the URLs be archived or spi­dered

Usage of archiver might look like archiver ~/.urls.txt In the past, archiver would some­times crash for unknown rea­sons, so I usu­ally wrap it in a while loop like so: while true; do archiver ~/.urls.txt; done. If I wanted to put it in a detached ses­sion: screen -d -m -S "archiver" sh -c 'while true; do archiver ~/.urls.txt; done'. Final­ly, rather than start it man­u­al­ly, I use a cron job to start it at boot, for a final invo­ca­tion of

@reboot sleep 4m && screen -d -m -S "archiver" sh -c 'while true; do archiver ~/.urls.txt \
        "cd ~/www && nice -n 20 ionice -c3 wget --unlink --limit-rate=20k --page-requisites --timestamping \
        -e robots=off --reject .iso,.exe,.gz,.xz,.rar,.7z,.tar,.bin,.zip,.jar,.flv,.mp4,.avi,.webm \
        --user-agent='Firefox/4.9'" 500; done'

Local caching

Remote archiv­ing, while con­ve­nient, has a major flaw: the archive ser­vices can­not keep up with the growth of the Inter­net and are woe­fully incom­plete. I expe­ri­ence this reg­u­lar­ly, where a link on goes dead and I can­not find it in the Inter­net Archive or WebCite, and it is a gen­eral phe­nom­e­non: find <35% of com­mon Web pages ever copied into an archive ser­vice, and typ­i­cally only one copy exists.

Caching Proxy

The most ambi­tious & total approach to local caching is to set up a proxy to do your brows­ing through, and record lit­er­ally all your web traffic; for exam­ple, using Live Archiv­ing Proxy (LAP) or War­cProxy which will save as every page you visit through it. (Zachary Vance explains how to set up a local HTTPS cer­tifi­cate to MITM your HTTPS brows­ing as well.)

One may be reluc­tant to go this far, and pre­fer some­thing lighter-weight, such as peri­od­i­cally extract­ing a list of vis­ited URLs from one’s web browser and then attempt­ing to archive them.

Batch job downloads

For a while, I used a shell script named, imag­i­na­tively enough, local-archiver:

set -euo pipefail

cp `find ~/.mozilla/ -name "places.sqlite"` ~/
sqlite3 places.sqlite "SELECT url FROM moz_places, moz_historyvisits \
                       WHERE = moz_historyvisits.place_id \
                             and visit_date > strftime('%s','now','-1.5 month')*1000000 ORDER by \
                       visit_date;" | filter-urls >> ~/.tmp
rm ~/places.sqlite
split -l500 ~/.tmp ~/.tmp-urls
rm ~/.tmp

cd ~/www/
for file in ~/.tmp-urls*;
 do (wget --unlink --continue --page-requisites --timestamping --input-file $file && rm $file &);

find ~/www -size +4M -delete

The code is not the pret­ti­est, but it’s fairly straight­for­ward:

  1. the script grabs my Fire­fox brows­ing his­tory by extract­ing it from the his­tory SQL data­base file13, and feeds the URLs into .

    wget is not the best tool for archiv­ing as it will not run JavaScript or Flash or down­load videos etc. It will down­load included JS files but the JS will be obso­lete when run in the future and any dynamic con­tent will be long gone. To do bet­ter would require a head­less browser like which saves to MHT/MHTML, but Phan­tomJS refuses to sup­port it and I’m not aware of an exist­ing pack­age to do this. In prac­tice, sta­tic con­tent is what is most impor­tant to archive, most JS is of highly ques­tion­able value in the first place, and any impor­tant YouTube videos can be archived man­u­ally with youtube-dl, so wget’s lim­i­ta­tions haven’t been so bad.

  2. The script splits the long list of URLs into a bunch of files and runs that many wgets in par­al­lel because wget appar­ently has no way of simul­ta­ne­ously down­load­ing from mul­ti­ple domains. There’s also the chance of wget hang­ing indefi­nite­ly, so par­al­lel down­loads con­tin­ues to make progress.

  3. The filter-urls com­mand is another shell script, which removes URLs I don’t want archived. This script is a hack which looks like this:

    set -euo pipefail
    cat /dev/stdin | sed -e "s/#.*//" | sed -e "s/&sid=.*$//" | sed -e "s/\/$//" | grep -v -e 4chan -e reddit ...
  4. delete any par­tic­u­larly large (>4MB) files which might be media files like videos or audios (pod­casts are par­tic­u­lar offend­ers)

A local copy is not the best resource—what if a link goes dead in a way your tool can­not detect so you don’t know to put up your copy some­where? But it solves the prob­lem deci­sive­ly.

The down­side of this scrip­t’s batch approach soon became appar­ent to me:

  1. not auto­mat­ic: you have to remem­ber to invoke it and it only pro­vides a sin­gle local archive, or if you invoke it reg­u­larly as a cron job, you may cre­ate lots of dupli­cates.
  2. unre­li­able: wget may hang, URLs may be archived too late, it may not be invoked fre­quently enough, >4MB non-video/audio files are increas­ingly com­mon…
  3. I wanted copies in the Inter­net Archive & else­where as well to let other peo­ple ben­e­fit and pro­vide redun­dancy to my local archive

It was to fix these prob­lems that I began work­ing on archiver—which would run con­stantly archiv­ing URLs in the back­ground, archive them into the IA as well, and be smarter about media file down­loads. It has been much more sat­is­fac­to­ry.


archiver has an extra fea­ture where any third argu­ment is treated as an arbi­trary sh com­mand to run after each URL is archived, to which is appended said URL. You might use this fea­ture if you wanted to load each URL into Fire­fox, or append them to a log file, or sim­ply down­load or archive the URL in some other way.

For exam­ple, instead of a big local-archiver run, I have archiver run wget on each indi­vid­ual URL: screen -d -m -S "archiver" sh -c 'while true; do archiver ~/.urls.txt "cd ~/www && wget --unlink --continue --page-requisites --timestamping -e robots=off --reject .iso,.exe,.gz,.xz,.rar,.7z,.tar,.bin,.zip,.jar,.flv,.mp4,.avi,.webm --user-agent='Firefox/3.6' 120"; done'. (For pri­vate URLs which require logins, such as , wget can still grab them with some help: installing the Fire­fox exten­sion Export Cook­ies, log­ging into the site in Fire­fox like usu­al, export­ing one’s cookies.txt, and adding the option --load-cookies cookies.txt to give it access to the cook­ies.)

Alter­nate­ly, you might use curl or a spe­cial­ized archive down­loader like the Inter­net Archive’s crawler Her­itrix.

Cryptographic timestamping local archives

We may want cryp­to­graphic time­stamp­ing to prove that we cre­ated a file or archive at a par­tic­u­lar date and have not since altered it. which imple­ment down­load­ing (wget-archive) and time­stamp­ing strings or files (timestamp). With these scripts, extend­ing the archive bot is as sim­ple as chang­ing the shell com­mand:

@reboot sleep 4m && screen -d -m -S "archiver" sh -c 'while true; do archiver ~/.urls.txt \
        "wget-archive" 200; done'

Now every URL we down­load is auto­mat­i­cally cryp­to­graph­i­cally time­stamped with ~1-day res­o­lu­tion for free.

Resource consumption

The space con­sumed by such a backup is not that bad; only 30–50 giga­bytes for a year of brows­ing, and less depend­ing on how hard you prune the down­loads. (More, of course, if you use linkchecker to archive entire sites and not just the pages you vis­it.) Stor­ing this is quite viable in the long term; while page sizes have increased 7x between 2003 and 2011 and pages aver­age around 400kb14, has also been oper­at­ing and has increased disk capac­ity by ~128x—in 2011, $80 will buy you at least 2 ter­abytes, that works out to 4 cents a giga­byte or 80 cents for the low esti­mate for down­loads; that is much bet­ter than the $25 annual fee that some­where like Pin­board charges. Of course, you need to back this up your­self. We’re rel­a­tively for­tu­nate here—­most Inter­net doc­u­ments are ‘born dig­i­tal’ and easy to migrate to new for­mats or inspect in the future. We can down­load them and worry about how to view them only when we need a par­tic­u­lar doc­u­ment, and Web browser back­ward­s-com­pat­i­bil­ity already stretches back to files writ­ten in the early 1990s. (Of course, we’re prob­a­bly screwed if we dis­cover the con­tent we wanted was dynam­i­cally pre­sented only in Adobe Flash or as an inac­ces­si­ble ‘cloud’ ser­vice.) In con­trast, if we were try­ing to pre­serve pro­grams or soft­ware libraries instead, we would face a much more for­mi­da­ble task in keep­ing a work­ing lad­der of bina­ry-com­pat­i­ble vir­tual machines or inter­preters15. The sit­u­a­tion with dig­i­tal movie preser­va­tion hardly bears think­ing on.

There are ways to cut down on the size; if you tar it all up and run with max­i­mum com­pres­sion options, you could prob­a­bly com­pact it to 1/5th the size. I found that the uncom­pressed files could be reduced by around 10% by using to look for dupli­cate files and turn­ing the dupli­cates into a space-sav­ing to the orig­i­nal with a com­mand like fdupes --recurse --hardlink ~/www/. (Ap­par­ently there are a lot of bit-i­den­ti­cal JavaScript (eg. ) and images out there.)

Good fil­ter­ing of URL sources can help reduce URL archiv­ing count by a large amount. Exam­in­ing my man­ual back­ups of Fire­fox brows­ing his­to­ry, over the 1153 days from 2014-02-25 to 2017-04-22, I vis­ited 2,370,111 URLs or 2055 URLs per day; after pass­ing through my fil­ter­ing script, that leaves 171,446 URLs, which after de-du­pli­ca­tion yields 39,523 URLs or ~34 unique URLs per day or 12,520 unique URLs per year to archive.

This shrunk my archive by 9GB from 65GB to 56GB, although at the cost of some archiv­ing fidelity by remov­ing many file­types like CSS or JavaScript or GIF images. As of 2017-04-22, after ~6 years of archiv­ing, between xz com­pres­sion (at the cost of easy search­a­bil­i­ty), aggres­sive fil­ter­ing, occa­sional man­ual dele­tion of overly bulky domains I feel are prob­a­bly ade­quately cov­ered in the IA etc, my full WWW archives weigh 55GB.

URL sources

Browser history

There are a num­ber of ways to pop­u­late the source text file. For exam­ple, I have a script firefox-urls:

set -euo pipefail

cp --force `find ~/.mozilla/firefox/ -name "places.sqlite"|sort|head -1` ~/
sqlite3 -batch places.sqlite "SELECT url FROM moz_places, moz_historyvisits \
                       WHERE = moz_historyvisits.place_id and \
                       visit_date > strftime('%s','now','-1 day')*1000000 ORDER by \
                       visit_date;" | filter-urls
rm ~/places.sqlite

(filter-urls is the same script as in local-archiver. If I don’t want a domain local­ly, I’m not going to bother with remote back­ups either. In fact, because of WebCite’s rate-lim­it­ing, archiver is almost per­pet­u­ally back­-logged, and I espe­cially don’t want it wast­ing time on worth­less links like .)

This is called every hour by cron:

@hourly firefox-urls >> ~/.urls.txt

This gets all vis­ited URLs in the last time period and prints them out to the file for archiver to process. Hence, every­thing I browse is backed-up through archiver.

Non-Fire­fox browsers can be sup­ported with sim­i­lar strate­gies; for exam­ple, Zachary Vance’s Chromium scripts like­wise extracts URLs from Chromi­um’s SQL his­tory & book­marks.

Website spidering

Some­times a par­tic­u­lar web­site is of long-term inter­est to one even if one has not vis­ited every page on it; one could man­u­ally visit them and rely on the pre­vi­ous Fire­fox script to dump the URLs into archiver but this isn’t always prac­ti­cal or time-effi­cient. linkchecker inher­ently spi­ders the web­sites it is turned upon, so it’s not a sur­prise that it can build a or sim­ply spit out all URLs on a domain; unfor­tu­nate­ly, while linkchecker has the abil­ity to out­put in a remark­able vari­ety of for­mats, it can­not sim­ply out­put a new­line-de­lim­ited list of URLs, so we need to post-process the out­put con­sid­er­ably. The fol­low­ing is the shell one-liner I use when I want to archive an entire site (note that this is a bad com­mand to run on a large or heav­ily hyper­-linked site like the Eng­lish Wikipedia or Less­Wrong!); edit the tar­get domain as nec­es­sary:

linkchecker --check-extern -odot --complete -v --ignore-url=^mailto --no-warnings
    | fgrep http # [
    | fgrep -v -e "label=" -e "->" -e '" [' -e '" ]' -e "/ "
    | sed -e "s/href=\"//" -e "s/\",//" -e "s/ //"
    | filter-urls
    | sort --unique >> ~/.urls.txt

When linkchecker does not work, one alter­na­tive is to do a wget --mirror and extract the URLs from the file­names—list all the files and pre­fix with a “http://” etc.


Cryptographic timestamping

Due to length, this sec­tion has .


A raw dump of URLs, while cer­tainly archiv­able, will typ­i­cally result in a very large mir­ror of ques­tion­able value (is it really nec­es­sary to archive Google search queries or Wikipedia arti­cles? usu­al­ly, no) and worse, given the rate-lim­it­ing nec­es­sary to store URLs in the Inter­net Archive or other ser­vices, may wind up delay­ing the archiv­ing of the impor­tant links & risk­ing their total loss. Dis­abling the remote archiv­ing is unac­cept­able, so the best solu­tion is to sim­ply take a lit­tle time to man­u­ally black­list var­i­ous domains or URL pat­terns.

This black­list­ing can be as sim­ple as a com­mand like filter-urls | grep -v, but can be much more elab­o­rate. The fol­low­ing shell script is the skele­ton of my own cus­tom black­list, derived from man­u­ally fil­ter­ing through sev­eral years of daily brows­ing as well as spi­ders of dozens of web­sites for var­i­ous peo­ple & pur­pos­es, demon­strat­ing a vari­ety of pos­si­ble tech­niques: reg­exps for domains & file-types & query-strings, sed-based rewrites, fixed-string matches (both black­lists and whitelist­s), etc:


# USAGE: `filter-urls` accepts on standard input a list of newline-delimited URLs or filenames,
# and emits on standard output a list of newline-delimited URLs or filenames.
# This list may be shorter and entries altered. It tries to remove all unwanted entries, where 'unwanted'
# is a highly idiosyncratic list of regexps and fixed-string matches developed over hundreds of thousands
# of URLs/filenames output by my daily browsing, spidering of interesting sites, and requests
# from other people to spider sites for them.
# You are advised to test output to make sure it does not remove
# URLs or filenames you want to keep. (An easy way to test what is removed is to use the `comm` utility.)
# For performance, it does not sort or remove duplicates from output; both can be done by
# piping `filter-urls` to `sort --unique`.

set -euo pipefail

cat /dev/stdin \
    | sed -e "s/#.*//" -e 's/>$//' -e "s/&sid=.*$//" -e "s/\/$//" -e 's/$/\n/' -e 's/\?sort=.*$//' \
      -e 's/^[ \t]*//' -e 's/utm_source.*//' -e 's/https:\/\//http:\/\//' -e 's/\?showComment=.*//' \
    | grep "\." \
    | fgrep -v "*" \
    | egrep -v -e '\/\.rss$' -e "\.tw$" -e "//%20www\." -e "/file-not-found" -e "258..\.com/$" \
       -e "3qavdvd" -e "://avdvd" -e "\.avi" -e "\.com\.tw" -e "\.onion" -e "\?fnid\=" -e "\?replytocom=" \
       -e "^$" -e "^$" \
       -e "^$" \
       -e "ftp.*" -e "6..6\.com" -e "6..9\.com" -e "6??6\.com" -e "7..7\.com" -e "7..8\.com" -e "7..\.com" \
       -e "78..\.com" -e "7??7\.com" -e "8..8\.com" -e "8??8\.com" -e "9..9\.com" -e "9??9\.com" \
       -e gold.*sell -e vip.*club \
    | fgrep -v -e "#!" -e ".bin" -e ".mp4" -e ".swf" -e "/mediawiki/index.php?title=" -e "/search?q=cache:" \
      -e "/wiki/Special:Block/" -e "/wiki/Special:WikiActivity" -e "Special%3ASearch" \
      -e "Special:Search" -e "__setdomsess?dest="
      # ...

# prevent URLs from piling up at the end of the file
echo ""

filter-urls can be used on one’s local archive to save space by delet­ing files which may be down­loaded by wget as depen­den­cies. For exam­ple:

find ~/www | sort --unique >> full.txt && \
    find ~/www | filter-urls | sort --unique >> trimmed.txt
comm -23 full.txt trimmed.txt | xargs -d "\n" rm
rm full.txt trimmed.txt

sort --key compression trick

Pro­gram­ming folk­lore notes that one way to get bet­ter loss­less com­pres­sion effi­ciency is by the pre­com­pres­sion trick of rear­rang­ing files inside the archive to group ‘sim­i­lar’ files together and expose redun­dancy to the com­pres­sor, in accor­dance with infor­ma­tion-the­o­ret­i­cal prin­ci­ples. A par­tic­u­larly easy and broad­ly-ap­plic­a­ble way of doing this, which does not require using any unusual for­mats or tools and is fully com­pat­i­ble with the default archive meth­ods, is to sort the files by file­name and espe­cially file exten­sion. I show how to do this with the stan­dard Unix com­mand-line sort tool, using the so-called “sort --key trick”, and give exam­ples of the large space-sav­ings pos­si­ble from my archiv­ing work for per­sonal web­site mir­rors and for mak­ing where the redun­dancy at the file level is par­tic­u­larly extreme and the sort --key trick shines com­pared to the naive approach.

One way to look at data com­pres­sion is as a form of intel­li­gence (see the & ): a com­pres­sion tool like is being asked to pre­dict what the next bit of the orig­i­nal file is, and the bet­ter its pre­dic­tions, the less data needs to be stored to cor­rect its mis­taken pre­dic­tions (“I know how to spell ‘banana’, I just don’t know when to stop”). The smarter the pro­gram is, the bet­ter its com­pres­sion will be; but on the flip side, you can also improve the com­pres­sion ratio by giv­ing it a lit­tle help and orga­niz­ing the data into an equiv­a­lent form the pro­gram can under­stand bet­ter—­for exam­ple, by using the . Or by pre­serv­ing spa­tial local­i­ty: keep­ing sim­i­lar data togeth­er, and not dis­pers­ing it all over. (This also explains why mul­ti­me­dia files barely com­press: because the lossy encod­ings are the prod­uct of decades of work spe­cial­ized to the par­tic­u­lar domain of audio or images or video, and a gen­er­al-pur­pose loss­less com­pres­sion would have to be very intel­li­gent, on par with , to beat them at their own game.) Files on one topic should be com­pressed togeth­er.


Spa­tial local­ity can be sub­tle. For exam­ple, nat­ural lan­guage text, though not struc­tured line-by-line as vis­i­bly as a dic­tio­nary, is still far from ran­dom and has many local cor­re­la­tions a com­pres­sion tool might be able to pick up. If this is true, we would expect that with a suffi­ciently smart com­pres­sor, a text file would com­press bet­ter in its nat­ural form than if it were ran­domly shuffled. Is this the case, or are com­pres­sion util­i­ties are too stu­pid to see any differ­ent between ran­dom lines and Eng­lish prose? Tak­ing 14M of text from, we can see for our­selves:

## uncompressed
cat *.page */*.page */*/*.page |                                                 wc --bytes
# 13588814

## compressed, files in lexicographic order
cat *.page */*.page */*/*.page |                      xz -9 --extreme --stdout | wc --bytes
# 3626732
## compressed, all lines sorted alphabetically
cat *.page */*.page */*/*.page | sort               | xz -9 --extreme --stdout | wc --bytes
# 3743756
## compressed, all lines randomly shuffled except for non-unique lines sorted together
cat *.page */*.page */*/*.page | sort --random-sort | xz -9 --extreme --stdout | wc --bytes
# 3831740
## compressed, all lines randomly shuffled
cat *.page */*.page */*/*.page | shuf               | xz -9 --extreme --stdout | wc --bytes
# 3862632

The unmo­lested text com­presses to 3.626M, but the same text ran­domly shuffled is 3.862M! I also included an inter­me­di­ate form of ran­dom­iza­tion: despite the name, the --random-sort option to sort is not actu­ally a ran­dom shuffle but a ran­dom hash (this is not doc­u­mented in the man page, though it is in the info page) and so any repeated non-u­nique lines will be sorted together (al­low­ing for some easy dupli­ca­tion dele­tion), and so we would expect the only-par­tial ran­dom­iza­tion of --random-sort to maybe per­form a bit bet­ter than the true ran­dom shuffle of shuf. And it does.

Web archives

Spa­tial local­ity also applies to our web archiv­ing. If you are mir­ror­ing web­sites, or oth­er­wise com­pil­ing a lot of direc­to­ries with redun­dant data on a file-by-file lev­el, there’s a cute trick to mas­sively improve your com­pres­sion ratios: don’t sort the usual lex­i­co­graphic way, but sort by a sub­di­rec­to­ry. (I learned about this trick a few years ago while mess­ing around with archiv­ing my home direc­tory using find and tar.) This is one of the issues with archiv­ing giga­bytes of crawls from thou­sands of domains: URLs have a his­tor­i­cal odd­ity where they are not con­sis­tently hier­ar­chi­cal. URLs were orig­i­nally mod­eled after hier­ar­chi­cal Unix filesys­tems; this page, for exam­ple, lives at the name /home/gwern/wiki/, which fol­lows a log­i­cal left­-to-right pat­tern of increas­ing nar­row­ness. If one lists my entire filesys­tem in lex­i­co­graphic order, all the files in /home/gwern/ will be con­sec­u­tive, and the files in wiki/ will be con­sec­u­tive, and so on. unfor­tu­nate­ly, the top level of URLs breaks this scheme—one does not visit https://com/google/mail/?shva=1#search/l%3aunread, one vis­its; one does not visit http://net/www/gwern/Archiving-URLs but So if I down­load and then later, a lex­i­co­graphic list of down­loaded files will sep­a­rate the files as much as pos­si­ble (even though they are seman­ti­cally prob­a­bly sim­i­lar). A quick exam­ple from my cur­rent WWW archive:

# ...
# ...

The direc­to­ries are indeed sort­ed, but aside from the ini­tial 2 let­ters or so, look noth­ing like each oth­er: a Wikia sub­do­main rubs shoul­ders with a Word­Press blog, a .ca domain is between a .com, a .info, and a .edu (with a .us and .uk thrown in for vari­ety), and so on. Is there any way to sort these direc­to­ries with a bit of pars­ing thrown in? For exam­ple, maybe we could reverse each line? Some web browsers store URLs reversed right-to-left to enable more effi­cient data­base oper­a­tions, as do Google’s sys­tems16 (to assist their rel­a­tively weak com­pres­sion util­ity ). Turns out GNU already sup­ports some­thing sim­i­lar, the --key & --field-separator options; the man page is not very help­ful but the page tells us:

     Use character SEPARATOR as the field separator when finding the
     sort keys in each line.  By default, fields are separated by the
     empty string between a non-blank character and a blank character.
     By default a blank is a space or a tab, but the 'LC_CTYPE' locale
     can change this.

     That is, given the input line ' foo bar', 'sort' breaks it into
     fields ' foo' and ' bar'.  The field separator is not considered
     to be part of either the field preceding or the field following,
     so with 'sort -t " "' the same input line has three fields: an
     empty field, 'foo', and 'bar'.  However, fields that extend to the
     end of the line, as '-k 2', or fields consisting of a range, as
     '-k 2,3', retain the field separators present between the
     endpoints of the range.

'-k POS1[,POS2]'
     Specify a sort field that consists of the part of the line between
     POS1 and POS2 (or the end of the line, if POS2 is omitted),

     Each POS has the form 'F[.C][OPTS]', where F is the number of the
     field to use, and C is the number of the first character from the
     beginning of the field.  Fields and character positions are
     numbered starting with 1; a character position of zero in POS2
     indicates the field's last character.  If '.C' is omitted from
     POS1, it defaults to 1 (the beginning of the field); if omitted
     from POS2, it defaults to 0 (the end of the field).  OPTS are
     ordering options, allowing individual keys to be sorted according
     to different rules; see below for details.  Keys can span multiple

     Example:  To sort on the second field, use '--key=2,2' ('-k 2,2').
     See below for more notes on keys and more examples.  See also the
     '--debug' option to help determine the part of the line being used
     in the sort.

Hence, we can do bet­ter by order­ing sort to break on the dots and focus on the sec­ond part of a URL, like so:

ls | sort --key=2 --field-separator="."
# ...
# ...

There’s many pos­si­ble ways to sort, though. So I took my WWW archive as of 2014-06-15, opti­mized all PNGs & JPEGs with optipng & jpegoptim, ran all the files through filter-urls & deleted the ones which failed (this took out all of the JS files, which is fine since I don’t think those are use­ful for archival pur­pos­es), and was left with ~86.5GB of files. Then I tested out sev­eral ways of sort­ing the file­names to see what gave the best com­pres­sion on my cor­pus.

First, I estab­lish the base­line:

  1. Size of uncom­pressed unsorted tar­ball, which elim­i­nates filesys­tem over­head and tells us how much com­pres­sion is really sav­ing:

    cd ~/www/ && find ./ -type f -print0 | tar c --to-stdout --no-recursion --null --files-from - | wc --bytes
    # 86469734400
    ## 1x
  2. Size of sorted tar­ball:

    cd ~/www/ && find . -type f -print0 | sort --zero-terminated | \
                 tar c --to-stdout --no-recursion --null --files-from - | wc --bytes
    # 86469734400
    ## 1x

    So sort­ing a tar­ball does­n’t give any ben­e­fits. This is mostly as I expect­ed, since tar is only sup­posed to pro­duce a lin­ear archive pack­ing together all the spec­i­fied files and oth­er­wise pre­serve them exact­ly. I thought there might have been some sort of con­sol­i­da­tion of full path-names which might yield a small space sav­ings, but appar­ently not.

Now we can begin sort­ing before com­pres­sion. I thought of 6 approach­es; in decreas­ing order of final archive (small­er=­bet­ter):

  1. Sort by file names, sim­ply by revers­ing, sort­ing, unre­verse (foo.png ... bar.png reverses to gnp.oof ... gnp.rab, which then sort togeth­er, and then loss­lessly reverse back to bar.png / foo.png / ...):

    cd ~/www/ && find . -type f | rev | sort | rev | \
                 tar c --to-stdout --no-recursion --files-from - | xz -9 --stdout | wc --bytes
    # 24970605748
    ## 0.2887x

    (Note that find+rev does­n’t cor­rectly han­dle file­names with the wrong/non-UTF-8 encod­ing; I ulti­mately used brute force in the form of detox to find all the non-UTF-8 files and rename them.)

  2. Com­press tar­ball, but with­out any sort­ing (files are con­sumed in the order find pro­duces them in the filesys­tem):

    cd ~/www/ && find . -type f -print0 | \
                 tar c --to-stdout --no-recursion --null --files-from - | xz -9 --stdout | wc --bytes
    # 24268747400
    ## 0.2806x
  3. Sort by file suffix­es, try­ing to pars­ing the file­names first:

    cd ~/www/ && find . -type f -printf '%f/%p\n' | sort --field-separator="." --key=2  | cut -f2- -d/ | \
                 tar c --to-stdout --no-recursion --files-from - | xz -9 --stdout | wc --bytes
    # 24097155132
    ## 0.2786x
  4. Sort nor­mal­ly, in lex­i­co­graphic order (sub­do­main, domain, TLD, sub­di­rec­to­ries & files etc):

    cd ~/www/ && find . -type f -print0 | sort --zero-terminated | \
        tar c --to-stdout --no-recursion --null --files-from - | xz -9 --stdout | wc --bytes
    # 23967317664
    ## 0.2771x
  5. Sort by mid­dle of domain:

    cd ~/www/ && find . -type f -print0 | sort --zero-terminated --key=3 --field-separator="." | \
                  tar c --to-stdout --no-recursion --null --files-from - | xz -9 --stdout | wc --bytes
    # 23946061272
    ## 0.2769x
  6. Sort by first sub­di­rec­tory (if there’s a bunch of* &* files, then the /wp-content/ files will all sort together regard­less of “f” and “b” being far from each oth­er; sim­i­larly for, etc):

    cd ~/www/ && find . -type f -print0 | sort --zero-terminated --key=3 --field-separator="/"  | \
                 tar c --to-stdout --no-recursion --null --files-from - | xz -9 --stdout | wc --bytes
    # 23897682908
    ## 0.2763x

Sur­pris­ing­ly, #1, revers­ing file­names in order to sort on the suffix­es, turns out to be even worse than not sort­ing at all. The improved attempt to sort on file­types does­n’t do much bet­ter, although it at least beats the base­line of no-sort­ing; it may be that to get a com­pres­sion win real seman­tic knowl­edge of file­types will be needed (per­haps call­ing file on every file and sort­ing by the detected file­type­?). The reg­u­lar sort also per­forms sur­pris­ingly well, but my intu­itions win for once and it’s beaten by my pre­vi­ous strat­egy of sort­ing by the mid­dle of domains. Final­ly, the win­ner is a bit of a sur­prise too, a sort I only threw in out of curios­ity because I noticed blogs tend to have sim­i­lar site lay­outs.

In this case, the best ver­sion saved or 371MB over the unsorted ver­sion. Not as impres­sive as in the next use case, but enough to show this seems like a real gain

Separate mirrors

Top-level domains are not the only thing we might want to sort differ­ently on. To take my mir­rors of dark­net mar­ket drug sites such as : I down­load a site each time as a sep­a­rate wget run in a time­stamped fold­er. So in my Silk Road 2 fold­er, I have both 2013-12-25/ & 2014-01-15/. These share many sim­i­lar & iden­ti­cal files so they com­press together with xz down from 1.8GB to 0.3GB.

But they could com­press even bet­ter: the sim­i­lar files may be thou­sands of files and hun­dreds of megabytes away by alpha­bet­i­cal or file-in­ode order, so even with a very large win­dow and a top-notch com­pres­sor, it will fail to spot many long-range redun­dan­cies. In between 2013-12-25/default.css and 2014-01-15/default.css is going to be all sorts of files which have noth­ing to do with CSS, like 2014-01-16/items/2-grams-of-pure-vanilla-ketamine-powder-dutchdope?vendor_feedback_page=5 and 2014-01-16/items/10-generic-percocet-10-325-plus-1-30mg-morphine. You see the prob­lem.

Because we sort the files by ‘all files start­ing with “2013”’ and then ‘all files start­ing “2014”’, we lose all prox­im­i­ty. If instead, we could sort by sub­folder and only then by the top-level fold­er, then we’d have every­thing line up nice­ly. For­tu­nate­ly, we already know how to do this! Reuse the sort-key trick, spec­ify “/” as the delim­iter to parse on, and the nth field to sort on. We feed it a file list, tell it to break file­names by “/”, and then to sort on a lower lev­el, and if we did it right, we will indeed get out­put like 2013-12-25/default.css just before 2014-01-15/default.css, which will do won­ders for our com­pres­sion, and which will pay ever more div­i­dends as we accu­mu­late more par­tial­ly-re­dun­dant mir­rors.

Here is an exam­ple of out­put for my Pan­dora mir­rors, where, due to fre­quent rumors of its demise trig­ger­ing mir­ror­ing on my part, I have 5 full mir­rors at the time of test­ing; and nat­u­ral­ly, if we employ the sort-key trick (find . -type f | sort --key=3 --field-separator="/"), we find a lot of sim­i­lar-sound­ing files:


Note the inter­leav­ing of 5 differ­ent mir­rors, impos­si­ble in a nor­mal left­-to-right alpha­bet­i­cal sort. You can bet that these 4 files (in 15 ver­sions) are going to com­press much bet­ter than if they were sep­a­rated by a few thou­sand other pro­file pages.

So here’s an exam­ple invo­ca­tion (do­ing every­thing in pipelines to avoid disk IO):

find . -type f -print0 | sort --zero-terminated --key=3 --field-separator="/"
 | tar --no-recursion --null --files-from - -c
 | xz -9 --extreme --stdout
 > ../mirror.tar.xz

Used on my two Silk Road 2 mir­rors which together weigh 1800M, a nor­mal run with­out the --key/--field-separator options, as men­tioned before, yields a 308M archive. That’s not too bad. Cer­tainly much bet­ter than haul­ing around almost 2GB. How­ev­er—if I switch to the sort-key trick, how­ev­er, the archive is now 271M, or 37M less. Same com­pres­sion algo­rithm, same files, same unpacked results, same speed, just 2 lit­tle obscure sort options… and I get an archive 87% the size of the orig­i­nal.

Not impressed? Well, I did say that the advan­tage increases with the num­ber of mir­rors to extract redun­dancy from. With only 2 mir­rors, the SR2 results can’t be too impres­sive. How about the Pan­dora mir­rors? 5 of them gives the tech­nique more scope to shine. And as expect­ed, it’s even more impres­sive when I com­pare the Pan­dora archives: 71M vs 162M. The sort-keyed archive saves 56% of the reg­u­lar archive’s size. The final Pan­dora archive, , boasts 105 mir­rors from 2013-12-25 to 2014-11-05 of 1,745,778 files (many dupli­cates due to being present over mul­ti­ple crawls) tak­ing up 32GB uncom­pressed. This archive is 0.58GB (606,203,320 bytes) when sort-keyed; when sorted alpha­bet­i­cal­ly, it is instead 7.5GB (7,466,374,836 bytes) or 12.31x larg­er, so the sort-keyed archive saves 92%. Forum mir­rors are even more redun­dant than mar­ket mir­rors because all con­tent added to a forum will be present in all sub­se­quent crawls, so each crawl will yield the pre­vi­ous crawl plus some addi­tional posts; hence, the Pan­dora forum’s raw size of 8,894,521,228 bytes shrinks to 283,747,148 bytes when com­pressed alpha­bet­i­cally but shrinks dra­mat­i­cally to the 31x smaller sort-keyed archive of 9,084,508 bytes, sav­ing 97%!

Clearly in these DNM use-cas­es, sort-key­ing has gone from a cute trick to indis­pens­able.


The sort-key trick is most use­ful when we can infer a lot of spa­tial local­ity just from parts of the file names, it’s not a panacea. There are other approach­es:

  1. if we have mul­ti­ple tem­po­ral­ly-ordered datasets and we don’t mind mak­ing it more diffi­cult to access older copies, it may be sim­pler to store the data as a DVCS like where each dataset is a large patch

  2. sort files by min­i­miz­ing binary differ­ences between them using Timm S. Müller’s binsort util­ity

    The default opti­miza­tion set­ting of -o15 under­per­forms the sort-key trick on both the SR2 and Pan­dora datasets by 5–10MB, and a higher set­ting like -o1000 is best. Note that binsort is in num­ber of files, so it’s lim­ited to sort­ing <10,000 files. An exam­ple pipeline for com­press­ing posts from the Silk Road forums regard­ing a minor Ponzi scheme there, since the file­names offer lit­tle guid­ance for a seman­ti­cal­ly-mean­ing­ful sort:

    find dkn255hz262ypmii.onion/ -type f -exec fgrep -i -l vladimir  {} \; >> ponzi.txt
    find dkn255hz262ypmii.onion/ -type f -exec fgrep -i -l Limetless {} \; >> ponzi.txt
    mkdir ponzi/ && cp `cat ponzi.txt` ponzi/
    ~/bin/binsort/binsort -t 6 -o 1000 ponzi/ > ponzi-list.txt
    cat ponzi-list.txt | tar --no-recursion --files-from - -c | xz -9 --extreme --stdout > ~/srf1-ponzi.tar.xz

    (Em­bar­rass­ing­ly, binsort seems to under­per­form on this dataset: the files are 380M uncom­pressed, 3.536M sort-keyed, and 3.450M sort­ed.)

  3. if we know there are many dupli­cates but they are far apart by a lex­i­co­graphic sort and we don’t have any good way to sort file­names and a lot of RAM, we can try out a com­pres­sion tool which specifi­cally looks for long-range redun­dan­cies through­out the entire bit­stream, such as (home­page/Arch wiki). lrzip is pack­aged in Debian and should be at least as good as xz since it too uses LZMA; but it is an obscure tool which is not a default install on Linux dis­tri­b­u­tions like xz is, and this makes dis­trib­ut­ing archives to other peo­ple diffi­cult.

  1. I use duplicity & rdiff-backup to backup my entire home direc­tory to a cheap 1.5TB hard drive (bought from Newegg using’s “Stor­age Analysis—GB/$ for differ­ent sizes and media” price-chart); a lim­ited selec­tion of fold­ers are backed up to B2 using duplicity.

    I used to semi­an­nu­ally tar up my impor­tant fold­ers, add redun­dan­cy, and burn them to DVD, but that’s no longer really fea­si­ble; if I ever get a Blu-ray burn­er, I’ll resume WORM back­ups. (Mag­netic media does­n’t strike me as reli­able over many decades, and it would ease my mind to have opti­cal back­up­s.)↩︎

  2. “When the Inter­net Is My Hard Dri­ve, Should I Trust Third Par­ties?”, Wired:

    Bits and pieces of the web dis­ap­pear all the time. It’s called ‘link rot’, and we’re all used to it. A friend saved 65 links in 1999 when he planned a trip to Tus­cany; only half of them still work today. In my own blog, essays and news arti­cles and web­sites that I link to reg­u­larly dis­ap­pear – some­times within a few days of my link­ing to them.

  3. “Going, Going, Gone: Lost Inter­net Ref­er­ences”; abstract:

    The extent of Inter­net ref­er­enc­ing and Inter­net ref­er­ence activ­ity in med­ical or sci­en­tific pub­li­ca­tions was sys­tem­at­i­cally exam­ined in more than 1000 arti­cles pub­lished between 2000 and 2003 in the New Eng­land Jour­nal of Med­i­cine, The Jour­nal of the Amer­i­can Med­ical Asso­ci­a­tion, and Sci­ence. Inter­net ref­er­ences accounted for 2.6% of all ref­er­ences (672/25548) and in arti­cles 27 months old, 13% of Inter­net ref­er­ences were inac­tive.

  4. The Mil­lion Dol­lar Home­page still gets a sur­pris­ing amount of traffic, so one fun thing one could do is buy up expired domains which paid for par­tic­u­larly large links.↩︎

  5. By 2013-01-06, the num­ber has increased to ~12000 exter­nal links, ~7200 to non-Wikipedia domains.↩︎

  6. If each link has a fixed chance of dying in each time peri­od, such as 3%, then the total risk of death is an ; over the time period 2011–2070 the cumu­la­tive chance it will beat each of the 3% risks is 0.1658. So in 2070, how many of the 2200 links will have beat the odds? Each link is inde­pen­dent, so they are like flip­ping a biased coin and form a . The bino­mial dis­tri­b­u­tion, being dis­crete, has no easy equa­tion, so we just ask R how many links sur­vive at the 5th percentile/quantile (a lower bound) and how many sur­vive at the 95th per­centile (an upper bound):

    qbinom(c(0.05, 0.95), 2200, 0.97^(2070-2011))
    # [1] 336 394
    ## the 50% annual link rot hypothetical:
    qbinom(c(0.05, 0.50), 2200, 0.50^(2070-2011))
    # [1] 0 0
  7. 101, ‘Bud­dhism related to Blos­soms’; ; trans. Steven D. Carter, ISBN 0-231-10576-2↩︎

  8. Which I sus­pect is only acci­den­tally ‘gen­eral’ and would shut down access if there were some other way to ensure that Wikipedia exter­nal links still got archived.↩︎

  9. Since Pin­board is a book­mark­ing ser­vice more than an archive site, I asked whether treat­ing it as such would be accept­able and Maciej said “Your cur­rent archive size, grow­ing at 20 GB a year, should not be a prob­lem. I’ll put you on the heavy-duty server where my own stuff lives.”↩︎

  10. Google Cache is gen­er­ally rec­om­mended only as a last ditch resort because pages expire quickly from it. Per­son­al­ly, I’m con­vinced that Google would never just delete colos­sal amounts of Inter­net data—this is Google, after all, the epit­ome of stor­ing unthink­able amounts of data—and that Google Cache merely ceases to make pub­lic its copies. And to request a Google spi­der vis­it, one has to solve a CAPTCHA—so that’s not a scal­able solu­tion.↩︎

  11. Which would not be pub­licly acces­si­ble or sub­mit­table; I know they exist, but because they hide them­selves, I know only from ran­dom com­ments online eg. “years ago a friend of mine who I’d lost con­tact with caught up with me and told me he found a cached copy of a web­site I’d taken down in his employ­er’s equiv­a­lent to the Way­back Machine. His employer was a branch of the fed­eral gov­ern­ment.”.↩︎

  12. Ver­sion 0.1 of my archiver dae­mon did­n’t sim­ply read the file until it was empty and exit, but actu­ally watched it for mod­i­fi­ca­tions with . I removed this func­tion­al­ity when I real­ized that the required WebCite chok­ing (just one URL every ~25 sec­onds) meant that archiver would never fin­ish any rea­son­able work­load.↩︎

  13. Much eas­ier than it was in the past; records his tra­vails with the pre­vi­ous Mozilla his­tory for­mat in the apt­ly-named “when the data­base worms eat into your brain”.↩︎

  14. An older 2010 Google arti­cle put the aver­age at 320kb, but that was an aver­age over the entire Web, includ­ing all the old con­tent.↩︎

  15. Already one runs old games like the clas­sic in emu­la­tors of the DOS oper­at­ing sys­tem like ; but those emu­la­tors will not always be main­tained. Who will emu­late the emu­la­tors? Pre­sum­ably in 2050, one will instead emu­late some ancient but com­pat­i­ble OS—Win­dows 7 or Debian 6.0, per­hap­s—and inside that run DOSBox (to run the DOS which can run the game).↩︎

  16. “BigTable: A Dis­trib­uted Stor­age Sys­tem for Struc­tured Data”, Chang et al 2006:

    BigTable main­tains data in lex­i­co­graphic order by row key. The row range for a table is dynam­i­cally par­ti­tioned. Each row range is called a tablet, which is the unit of dis­tri­b­u­tion and load bal­anc­ing. As a result, reads of short row ranges are effi­cient and typ­i­cally require com­mu­ni­ca­tion with only a small num­ber of machines. Clients can exploit this prop­erty by select­ing their row keys so that they get good local­ity for their data access­es. For exam­ple, in Webtable, pages in the same domain are grouped together into con­tigu­ous rows by revers­ing the host­name com­po­nents of the URLs. For exam­ple, we store data for under the key Stor­ing pages from the same domain near each other makes some host and domain analy­ses more effi­cient.