Easy Cryptographic Timestamping of Files

Scripts for convenient free secure Bitcoin-based dating of large numbers of files/strings
cryptography, archiving, Bitcoin, shell, tutorial
2015-12-042017-12-16 finished certainty: highly likely importance: 8

Lo­cal archives are use­ful for per­sonal pur­pos­es, but some­times, in in­ves­ti­ga­tions that may be con­tro­ver­sial, you want to be able to prove that the copy you down­loaded was not mod­i­fied and you need to time­stamp it and prove the ex­act file ex­isted on or be­fore a cer­tain date. This can be done by cre­at­ing a cryp­to­graphic hash of the file and then pub­lish­ing that hash to global chains like cen­tral­ized dig­i­tal time­stam­pers or the de­cen­tral­ized Bit­coin blockchain. Cur­rent time­stamp­ing mech­a­nisms tend to be cen­tral­ized, man­u­al, cum­ber­some, or cost too much to use rou­tine­ly. Cen­tral­iza­tion can be over­come by time­stamp­ing to Bit­coin; cost­ing too much can be over­come by batch­ing up an ar­bi­trary num­ber of hashes and cre­at­ing just 1 hash/­time­stamp cov­er­ing them all; man­ual & cum­ber­some can be over­come by writ­ing pro­grams to han­dle all of this and in­cor­po­rat­ing them into one’s work­flow. So us­ing an effi­cient cryp­to­graphic time­stamp­ing ser­vice (the Ori­gin­Stamp In­ter­net ser­vice), we can write pro­grams to au­to­mat­i­cally & eas­ily time­stamp ar­bi­trary files & strings, time­stamp every com­mit to a Git repos­i­to­ry, and web­pages down­loaded for archival pur­pos­es. We can im­ple­ment the same idea offline, with­out re­liance on Ori­gin­Stamp, but at the cost of ad­di­tional soft­ware de­pen­den­cies like a Bit­coin client.

The most ro­bust way of time­stamp­ing is , where a doc­u­ment (such as a down­loaded web­page) is hashed us­ing a like , and then the hash is pub­lished; the hash proves that that ex­act ver­sion of the doc­u­ment ex­isted on/be­fore the date the hash was pub­lished on. If pub­lished to some­where like Twit­ter or one’s blog, though, now one has two prob­lems of time­stamp­ing, so it is bet­ter to use the blockchain, where one can eas­ily time­stamp by meth­ods like send­ing 1 satoshi to the ad­dress cor­re­spond­ing to the doc­u­men­t’s hash. (Ap­pro­pri­ate­ly, Bit­coin it­self is an in­tel­lec­tual de­scen­dant of ear­lier Usenet time­stamp­ing ser­vices.)

Mak­ing a full Bit­coin trans­ac­tion for every ver­sion of every file one wants to time­stamp works, but re­quires a Bit­coin client in­stalled, can be­come ex­pen­sive due to trans­ac­tion fees, be a has­sle to do man­u­al­ly, and bloats the Bit­coin blockchain (i­nas­much as clients ver­i­fy­ing the blockchain must keep track of all ad­dresses with un­spent funds, and every time­stamp­ing trans­ac­tion rep­re­sents an ad­di­tional such ad­dress).

Remote timestamping service

Us­ing ser­vices like Proof of Ex­is­tence solves the in­stall prob­lem but not the has­sle or fees (eg Proof of Ex­is­tence charges $2₿0.0052015 as of 2015-12-02).

We can do bet­ter by us­ing a ser­vice like Ori­gin­Stamp () or Tz­Stamp Web: Ori­gin­Stamp is a web ser­vice which re­ceives hashes from users, and then each day, it batches to­gether all hashes sub­mit­ted that day, hashes them, and makes a Bit­coin trans­ac­tion to that mas­ter hash.1 This gives one day-level gran­u­lar­ity of time­stamp­ing (which might sound bad but usu­ally day-level pre­ci­sion is fine and in any case, the pre­ci­sion of Bit­coin time­stamp­ing is lim­ited by the time de­lay be­tween each block and min­ing) To ver­ify any par­tic­u­lar hash, one looks up that hash in the Ori­gin­Stamp archives, finds the day/­batch it is part of, hashes the whole batch, and checks that there was a Bit­coin trans­ac­tion that day. Be­cause Ori­gin­Stamp only needs to make a sin­gle trans­ac­tion each day, no mat­ter how many hashes are sub­mit­ted, it has near-zero effect on the Bit­coin blockchain and costs lit­tle to run—if one Bit­coin trans­ac­tion costs 5 cents, then a year of daily trans­ac­tion fees is <$20 (though Ori­gin­Stamp ac­cepts do­na­tions and I have given $39₿0.12015 to help sup­port it).

Timestamping files or strings

We can get a free API key and then, thanks to Ori­gin­Stam­p’s API, write a sim­ple Bash shell script us­ing curl & sha256sum to time­stamp files or strings, which we will name timestamp, make ex­e­cutable with chmod +x timestamp, and put some­where in our path:

set -euo pipefail


# loop over input targets, hash them whether file or string, and submit:
for TARGET in "$@"; do

 if [ -f "$TARGET" ]; then
  # since it's a valid file, tell `sha256sum` to read it and hash it:
  HASH=$(sha256sum "$TARGET" | cut --delimiter=' ' --field=1)
  # if it's a string we're precommitting to instead, pipe it into `sha256sum`:
  HASH=$(echo "$TARGET" | sha256sum | cut --delimiter=' ' --field=1)

 echo -n "$TARGET: "
 curl --request POST --header "Content-Type: application/json" --header "Authorization: Token token=$API_KEY" \
      --data "{\"hash_sha256\":\"$HASH\"}" 'http://www.originstamp.org/api/stamps'
 # print newline to keep output tidier since curl doesn't add a final newline to the JSON output
 echo ""

Now we can time­stamp ar­bi­trary files or strings as we please:

$ timestamp ~/wiki/catnip.page ~/wiki/images/logo/logo.png
# /home/gwern/wiki/catnip.page: {"hash_sha256":"4b357388100f3cdf330bfa30572e7b3779564295a8f5e6e695fa8b2304fa450e",
#  "created_at":"2015-12-02T23:57:56.985Z","updated_at":"2015-12-02T23:57:56.985Z","submitted_at":null,"title":null}
# /home/gwern/wiki/images/logo/logo.png: {"hash_sha256":"243d5b9b4f97931a07d02497b8fddb181f9ba72dc37bd914077e3714d0163a2f",
# "created_at":"2015-12-02T23:57:20.996Z","updated_at":"2015-12-02T23:57:20.996Z","submitted_at":null,"title":null}

$ timestamp "Lyndon Johnson was really behind the Kennedy assassination." "Sorry: I ate the last cookie in the jar."
# Lyndon Johnson was really behind the Kennedy assassination.: {"hash_sha256":"4aef69aeaf777251d08b809ae1458c1b73653ee5f78699670d37849f6f92d116",
# "created_at":"2015-12-02T23:58:57.615Z","updated_at":"2015-12-02T23:58:57.615Z","submitted_at":null,"title":null}
# Sorry: I ate the last cookie in the jar.: {"hash_sha256":"508190d52a6dfff315c83d7014266737eeb70ab9b95e0cab253639de383a0b44",
# "created_at":"2015-12-02T23:59:03.475Z","updated_at":"2015-12-02T23:59:03.475Z","submitted_at":null,"title":null}

Timestamping version control systems

Given this script, we can in­te­grate time­stamp­ing else­where - for ex­am­ple, into a sys­tem repos­i­tory of doc­u­ments us­ing its post-com­mit hook fea­ture. We could write out the full curl call as part of a self­-con­tained script, but we al­ready fac­tored the time­stamp­ing out as a sep­a­rate shell script. So set­ting it up and en­abling it is now as sim­ple as:

echo 'timestamp $(git rev-parse HEAD)' >> .git/hooks/post-commit
chmod +x .git/hooks/post-commit

Now each com­mit we make, the hash of the last com­mit will be time­stamped and we can take the repo later and prove that all of the con­tent ex­isted be­fore a cer­tain day; this might be source code, but also any­thing one might want to track changes to - in­ter­views, web page archives, copies of emails, fi­nan­cial doc­u­ments, etc.

This ap­proach gen­er­al­izes to most ver­sion con­trol sys­tems built on cryp­to­graphic hashes as IDs, where time­stamp­ing the ID-hashes is enough to as­sure the en­tire tree of con­tent. (I’m not sure about other VCSes; per­haps the post-com­mit hooks could time­stamp en­tire re­vi­sion­s/­patch­es?)

There have long been con­cerns that SHA-1 is in­creas­ingly weak; as of 2017, col­li­sions can be gen­er­ated at fea­si­ble costs, so time­stamps of SHA-1 hashes no longer prove as much as they used to.

It might be pos­si­ble to use a tool like git-evtag for hash­ing the en­tire repos­i­tory his­tory in­clud­ing the changes them­selves (rather than just the ID­s), and time­stamp this mas­ter hash in­stead of the lat­est-re­vi­sion hash. Al­ter­nate­ly, since there are no wor­ries about SHA-256 be­ing bro­ken any­time soon, one could write a post-com­mit script to di­rectly parse out a list of mod­i­fied files & time­stamp each file; in which case, every ver­sion of every file has its own sep­a­rate SHA-256-based time­stamp. (The dis­ad­van­tage here is also an ad­van­tage as it en­ables se­lec­tive dis­clo­sure: if you are time­stamp­ing the en­tire Git repos­i­to­ry, then to sub­se­quently prove the time­stamp to a third par­ty, you must pro­vide the en­tire repos­i­tory so they can re­play it, see what the fi­nal state of the rel­e­vant file is, and check that it con­tains what you claim it con­tains and that the rel­e­vant re­vi­sion’s SHA-1 is cor­rectly time­stamped; but if you have time­stamped each file sep­a­rate­ly, you can pro­vide just the rel­e­vant ver­sion of the rel­e­vant file from your repos­i­to­ry, rather every ver­sion of every file pri­or. The trade­off here is sim­i­lar to that of time­stamp­ing a hash of a batch vs time­stamp­ing in­di­vid­ual hash­es.

Prob­a­bly the best ap­proach is to time­stamp each file at the be­gin­ning, use VCS time­stamps sub­se­quently for reg­u­lar ac­tiv­i­ty, and every once in a long while time­stamp all the files again; then for slow-chang­ing files, one will be prob­a­bly be able to re­veal a use­ful time­stamp with­out need­ing to re­veal the whole VCS his­tory as well, while still hav­ing backup time­stamps of the whole VCS in case very fine-grained time­stamps turn out to be nec­es­sary.)

Timestamping downloaded web pages

Au­to­mat­i­cally track­ing Git com­mits is easy be­cause of the hook func­tion­al­i­ty, but what if we want to down­load web pages and then time­stamp them? Down­load­ing them nor­mally with wget and then man­u­ally call­ing timestamp on what­ever the file winds up be­ing named is a pain, so we want to do it au­to­mat­i­cal­ly. This gets a lit­tle trick­ier be­cause if we write a script which takes a URL as an ar­gu­ment, we don’t nec­es­sar­ily know what the re­sult­ing filepath will be - the URL could redi­rect us to an­other ver­sion of that page with differ­ent ar­gu­ments, an­other page on that do­main, or to an­other do­main en­tire­ly, and then there’s the URL-decoding to deal with.

The sim­ple (and stu­pid) way is to parse out a file­name from the wget out­put be­cause it con­ve­niently places the des­ti­na­tion file­name in a pair of Uni­code quote marks, which give us a per­fect way to parse out the first2 down­loaded file­name; this turns out to work well enough in my pre­lim­i­nary test­ing of it. A script wget-archive which does this and works well with my archiver dae­mon:

set -euo pipefail

cd ~/www/

USER_AGENT="Firefox 12.4"
FILE=$(nice -n 20 ionice -c3 wget --continue --unlink --page-requisites --timestamping -e robots=off \
                                  --reject .exe,.rar,.7z,.bin,.jar,.flv,.mp4,.avi,.webm,.ico \
                                  --user-agent="$USER_AGENT" "$@" 2>&1 \
       | egrep 'Saving to: ‘.*’' | sed -e 's/Saving to: ‘//' | tr -d '’' | head -1 )

timestamp "$FILE"

, Atur­ban et al 2017, dis­cuss some of the lim­i­ta­tions of time­stamp­ing sys­tems and points out a prob­lem with this shell script: only the down­loaded file (usu­ally the HTML file) is time­stamped, and the files nec­es­sary for prop­erly dis­play­ing it (like JS or im­ages) are down­loaded but not time­stamped; these ad­di­tional files could change (ei­ther for nor­mal rea­sons or ma­li­cious­ly) and change the ap­pear­ance or con­tent of the main HTML file. They pro­vide a mod­i­fied shell script (sha256_include_all.sh) to in­clude the aux­il­iary files as well:

rm -rf ~/tmp_www/ # [use of `mktemp` would be better]
cd  ~/tmp_www/

USER_AGENT="Firefox 6.4"

FILE=$(nice -n 20 wget --continue --unlink --page-requisites
                       --timestamping -e robots=off -k
                       --user-agent="$USER_AGENT" "$1" 2>&1
    | egrep 'Saving to: .*'
    | sed -e 's/Saving to: //' | tr -d '’')

let "c=0"
for TARGET in $FILE; do
    if [ -f "$TARGET" ]; then
        let "c++"
        CONT=$(cat $TARGET)
        HASH=$(echo "$CONT" | shasum -a 256 | awk 'print $1;')
        echo "$HASH" >> "allhashes.txt"

if [ $c = 1 ]; then
    FINAL_HASH=$(cat "allhashes.txt" | shasum -a 256
                 | awk 'print $1;')
echo "Final hash: $FINAL_HASH"

Ex­am­ple us­age:

$ sha256_include_all.sh \
# Final hash: 2fa7ece06402cc9d89b9cfe7a53e4ec31a4417a34d79fee584c 01d706036e8cb

Local timestamping

As con­ve­nient as Ori­gin­Stamp is, and as nice as it is to have only one Bit­coin trans­ac­tion made per day cov­er­ing all Ori­gin­Stamp users, one may not want to rely on it for any num­ber of rea­sons: spo­radic In­ter­net con­nec­tiv­i­ty, un­cer­tainty that Ori­gin­Stam­p’s data will re­main ac­ces­si­ble in the far fu­ture, un­cer­tainty Ori­gin­Stamp cor­rectly im­ple­ments the time­stamp­ing al­go­rithm, need­ing to time­stamp so much that it would se­ri­ously bur­den Ori­gin­Stam­p’s band­width/s­tor­age re­sources & in­ter­fere with other users, not want­ing to leak vol­ume & tim­ing of time­stamps, etc.

This can be done with yet more script­ing, a lo­cal Bit­coin client with suffi­cient funds (~$20 should cover a year of us­age), and some­thing to con­vert hashes to Bit­coin ad­dresses (bitcoind & bitcoin-tool re­spec­tively for the lat­ter two).

A sim­ple ar­chi­tec­ture here would be to change timestamper to cre­ate hashes of in­puts as be­fore, but in­stead of send­ing them off to Ori­gin­Stamp, they are stored in a lo­cal file. This file ac­cu­mu­lates hashes from every use of timestamper that day. At the end of the time pe­ri­od, an­other script runs: it

  1. archives the mas­ter file to a date-stamped file (re­plac­ing it with an empty file to re­ceive fu­ture hash­es)
  2. hashes the archived file to yield the mas­ter hash of that batch of hashes
  3. con­verts the mas­ter hash to a Bit­coin ad­dress
  4. fi­nal­ly, calls a lo­cal Bit­coin client like Elec­trum or Bit­coin Core’s bitcoind to make 1 trans­ac­tion to the ad­dress

So let’s say that hashes are be­ing stored in ~/docs/timestamps/; the sim­pler timestamper script reads just:

set -euo pipefail
for TARGET in "$@"; do
    if [ -f "$TARGET" ]; then
     HASH=$(sha256sum "$TARGET" | cut --delimiter=' ' --field=1)
      HASH=$(echo "$TARGET" | sha256sum | cut --delimiter=' ' --field=1)
    echo "$HASH" >> $MASTER_DIR/today.txt

The hard­est part is con­vert­ing a SHA-256 hash to a valid Bit­coin ad­dress, which in­volves a num­ber of steps, so in this ex­am­ple I’ll use the light­weight bitcoin-tool for that part. To give an ex­am­ple of bitcoin-tool use, we can ver­ify an Ori­gin­Stamp time­stamp to make sure we’re do­ing things the same way. Take this test time­stamp:

$ echo "I have a secret." | sha256sum

Ori­gin­Stam­p’s page in­cludes the full batch of hashes (the “Trans­ac­tion Seed”), which we can ver­ify in­cludes 7306a..f1; so far so good. We can then pipe the full list into sha256sum us­ing xclip which gives us the mas­ter hash:

$ xclip -o | sha256sum
7ad6b91226939f075d79da12e5971ae6c886a48b8d7284915b74c7340ac6f61e  -

7ad6..1e is the hash that the Ori­gin­Stamp page claims to use as the “Se­cret”, which also checks out. This hash needs to be con­verted to a Bit­coin ad­dress, so we call bitcoin-tool with the rel­e­vant op­tions:

$ bitcoin-tool --network bitcoin --input-type private-key --input-format hex --output-type address --output-format base58check \
    --public-key-compression compressed --input "7ad6b91226939f075d79da12e5971ae6c886a48b8d7284915b74c7340ac6f61e"

1DMQ..Fm is also the same Bit­coin ad­dress that Ori­gin­Stamp claims to send to on that page, so all that re­mains is to check that some bit­coin was sent to 1DMQ..Fm to­day, and look­ing on Blockchain.info, we see that some bit­coins were sent. So we have suc­cess­fully in­de­pen­dently ver­i­fied that that list of hashes was time­stamped on the day it was claimed to have been time­stamped, and that Ori­gin­Stamp is both work­ing cor­rectly & our use of bitcoin-tool to con­vert a SHA-256 hash to a Bit­coin ad­dress is like­wise work­ing. With that, we can pro­ceed.

To stamp a batch, we can write a script we’ll call timestamper-flush:

set -euo pipefail

DATE=$(date +'%s')
mv $MASTER_DIR/today.txt $MASTER_DIR/"$DATE".txt && touch $MASTER_DIR/today.txt
MASTER_HASH=$(sha256sum "$MASTER_DIR/$DATE.txt" | cut --delimiter=' ' --field=1)

BITCOIN_ADDRESS=$(bitcoin-tool --network bitcoin --input-type private-key --input-format hex \
                               --output-type address --output-format base58check --public-key-compression compressed \
                               --input "$MASTER_HASH")

# assuming no password to unlock
## bitcoind walletpassphrase $PASSWORD 1
bitcoind sendtoaddress "$BITCOIN_ADDRESS" 0.00000001 "Timestamp for $DATE" || \
  bitcoind getbalance # no funds?

timestamper-flush can be put into a crontab as sim­ply @daily timestamper-flush (or @hourly/@weekly/@monthly etc), and can be called at any time if nec­es­sary. (I have not tested these scripts, for lack of disk space to run a full node, but I be­lieve them to be cor­rect; and if not, the idea is clear and one can im­ple­ment it as one prefer­s.)

Now one has an effi­cient, lo­cal, se­cure time­stamp­ing ser­vice.

  1. Ori­gin­Stamp ap­par­ently does not use OP_RETURN like Proof of Ex­is­tence does, which should be more effi­cient; but it is con­tro­ver­sial and lim­ited to 80 bytes, was re­duced even fur­ther to 40 bytes, and then in­creased again to 80 bytes which I’m not sure is enough stor­age space for a se­cure hash.↩︎

  2. Which is typ­i­cally the web page we care about, and sub­se­quent files are things like CSS or im­ages which don’t need to be time­stamped, but if one is para­noid about this, it should be pos­si­ble to time­stamp all the down­loaded files by re­mov­ing the | head -1 call and maybe trans­lat­ing the new­lines to spaces for the sub­se­quent timestamp call.↩︎