Easy Cryptographic Timestamping of Files

Scripts for convenient free secure Bitcoin-based dating of large numbers of files/strings
cryptography, archiving, Bitcoin, shell, tutorial
2015-12-042017-12-16 finished certainty: highly likely importance: 8


Local archives are use­ful for per­sonal pur­pos­es, but some­times, in inves­ti­ga­tions that may be con­tro­ver­sial, you want to be able to prove that the copy you down­loaded was not mod­i­fied and you need to time­stamp it and prove the exact file existed on or before a cer­tain date. This can be done by cre­at­ing a cryp­to­graphic hash of the file and then pub­lish­ing that hash to global chains like cen­tral­ized dig­i­tal time­stam­pers or the decen­tral­ized Bit­coin blockchain. Cur­rent time­stamp­ing mech­a­nisms tend to be cen­tral­ized, man­u­al, cum­ber­some, or cost too much to use rou­tine­ly. Cen­tral­iza­tion can be over­come by time­stamp­ing to Bit­coin; cost­ing too much can be over­come by batch­ing up an arbi­trary num­ber of hashes and cre­at­ing just 1 hash/timestamp cov­er­ing them all; man­ual & cum­ber­some can be over­come by writ­ing pro­grams to han­dle all of this and incor­po­rat­ing them into one’s work­flow. So using an effi­cient cryp­to­graphic time­stamp­ing ser­vice (the Ori­gin­Stamp Inter­net ser­vice), we can write pro­grams to auto­mat­i­cally & eas­ily time­stamp arbi­trary files & strings, time­stamp every com­mit to a Git repos­i­to­ry, and web­pages down­loaded for archival pur­pos­es. We can imple­ment the same idea offline, with­out reliance on Ori­gin­Stamp, but at the cost of addi­tional soft­ware depen­den­cies like a Bit­coin client.

The most robust way of time­stamp­ing is , where a doc­u­ment (such as a down­loaded web­page) is hashed using a like , and then the hash is pub­lished; the hash proves that that exact ver­sion of the doc­u­ment existed on/before the date the hash was pub­lished on. If pub­lished to some­where like Twit­ter or one’s blog, though, now one has two prob­lems of time­stamp­ing, so it is bet­ter to use the blockchain, where one can eas­ily time­stamp by meth­ods like send­ing 1 satoshi to the address cor­re­spond­ing to the doc­u­men­t’s hash. (Ap­pro­pri­ate­ly, Bit­coin itself is an intel­lec­tual descen­dant of ear­lier Usenet time­stamp­ing ser­vices.)

Mak­ing a full Bit­coin trans­ac­tion for every ver­sion of every file one wants to time­stamp works, but requires a Bit­coin client installed, can become expen­sive due to trans­ac­tion fees, be a has­sle to do man­u­al­ly, and bloats the Bit­coin blockchain (inas­much as clients ver­i­fy­ing the blockchain must keep track of all addresses with unspent funds, and every time­stamp­ing trans­ac­tion rep­re­sents an addi­tional such address).

Remote timestamping service

Using ser­vices like Proof of Exis­tence solves the install prob­lem but not the has­sle or fees (eg Proof of Exis­tence charges $2 as of 2015-12-02).

We can do bet­ter by using a ser­vice like Ori­gin­Stamp (): Ori­gin­Stamp is a web ser­vice which receives hashes from users, and then each day, it batches together all hashes sub­mit­ted that day, hashes them, and makes a Bit­coin trans­ac­tion to that mas­ter hash.1 This gives one day-level gran­u­lar­ity of time­stamp­ing (which might sound bad but usu­ally day-level pre­ci­sion is fine and in any case, the pre­ci­sion of Bit­coin time­stamp­ing is lim­ited by the time delay between each block and min­ing) To ver­ify any par­tic­u­lar hash, one looks up that hash in the Ori­gin­Stamp archives, finds the day/batch it is part of, hashes the whole batch, and checks that there was a Bit­coin trans­ac­tion that day. Because Ori­gin­Stamp only needs to make a sin­gle trans­ac­tion each day, no mat­ter how many hashes are sub­mit­ted, it has near-zero effect on the Bit­coin blockchain and costs lit­tle to run—if one Bit­coin trans­ac­tion costs 5 cents, then a year of daily trans­ac­tion fees is <$20 (though Ori­gin­Stamp accepts dona­tions and I have given $39 to help sup­port it).

Timestamping files or strings

We can get a free API key and then, thanks to Ori­gin­Stam­p’s API, write a sim­ple Bash shell script using curl & sha256sum to time­stamp files or strings, which we will name timestamp, make exe­cutable with chmod +x timestamp, and put some­where in our path:

#!/bin/bash
set -euo pipefail

API_KEY="73be2f5ae81ffa076480ac4d48fa9b2d"

# loop over input targets, hash them whether file or string, and submit:
for TARGET in "$@"; do

 if [ -f "$TARGET" ]; then
  # since it's a valid file, tell `sha256sum` to read it and hash it:
  HASH=$(sha256sum "$TARGET" | cut --delimiter=' ' --field=1)
 else
  # if it's a string we're precommitting to instead, pipe it into `sha256sum`:
  HASH=$(echo "$TARGET" | sha256sum | cut --delimiter=' ' --field=1)
 fi

 echo -n "$TARGET: "
 curl --request POST --header "Content-Type: application/json" --header "Authorization: Token token=$API_KEY" \
      --data "{\"hash_sha256\":\"$HASH\"}" 'http://www.originstamp.org/api/stamps'
 # print newline to keep output tidier since curl doesn't add a final newline to the JSON output
 echo ""
done

Now we can time­stamp arbi­trary files or strings as we please:

$ timestamp ~/wiki/catnip.page ~/wiki/images/logo/logo.png
# /home/gwern/wiki/catnip.page: {"hash_sha256":"4b357388100f3cdf330bfa30572e7b3779564295a8f5e6e695fa8b2304fa450e",
#  "created_at":"2015-12-02T23:57:56.985Z","updated_at":"2015-12-02T23:57:56.985Z","submitted_at":null,"title":null}
#
# /home/gwern/wiki/images/logo/logo.png: {"hash_sha256":"243d5b9b4f97931a07d02497b8fddb181f9ba72dc37bd914077e3714d0163a2f",
# "created_at":"2015-12-02T23:57:20.996Z","updated_at":"2015-12-02T23:57:20.996Z","submitted_at":null,"title":null}

$ timestamp "Lyndon Johnson was really behind the Kennedy assassination." "Sorry: I ate the last cookie in the jar."
# Lyndon Johnson was really behind the Kennedy assassination.: {"hash_sha256":"4aef69aeaf777251d08b809ae1458c1b73653ee5f78699670d37849f6f92d116",
# "created_at":"2015-12-02T23:58:57.615Z","updated_at":"2015-12-02T23:58:57.615Z","submitted_at":null,"title":null}
#
# Sorry: I ate the last cookie in the jar.: {"hash_sha256":"508190d52a6dfff315c83d7014266737eeb70ab9b95e0cab253639de383a0b44",
# "created_at":"2015-12-02T23:59:03.475Z","updated_at":"2015-12-02T23:59:03.475Z","submitted_at":null,"title":null}

Timestamping version control systems

Given this script, we can inte­grate time­stamp­ing else­where - for exam­ple, into a sys­tem repos­i­tory of doc­u­ments using its post-com­mit hook fea­ture. We could write out the full curl call as part of a self­-con­tained script, but we already fac­tored the time­stamp­ing out as a sep­a­rate shell script. So set­ting it up and enabling it is now as sim­ple as:

echo 'timestamp $(git rev-parse HEAD)' >> .git/hooks/post-commit
chmod +x .git/hooks/post-commit

Now each com­mit we make, the hash of the last com­mit will be time­stamped and we can take the repo later and prove that all of the con­tent existed before a cer­tain day; this might be source code, but also any­thing one might want to track changes to - inter­views, web page archives, copies of emails, finan­cial doc­u­ments, etc.

This approach gen­er­al­izes to most ver­sion con­trol sys­tems built on cryp­to­graphic hashes as IDs, where time­stamp­ing the ID-hashes is enough to assure the entire tree of con­tent. (I’m not sure about other VCSes; per­haps the post-com­mit hooks could time­stamp entire revisions/patches?)

There have long been con­cerns that SHA-1 is increas­ingly weak; as of 2017, col­li­sions can be gen­er­ated at fea­si­ble costs, so time­stamps of SHA-1 hashes no longer prove as much as they used to.

It might be pos­si­ble to use a tool like git-evtag for hash­ing the entire repos­i­tory his­tory includ­ing the changes them­selves (rather than just the IDs), and time­stamp this mas­ter hash instead of the lat­est-re­vi­sion hash. Alter­nate­ly, since there are no wor­ries about SHA-256 being bro­ken any­time soon, one could write a post-com­mit script to directly parse out a list of mod­i­fied files & time­stamp each file; in which case, every ver­sion of every file has its own sep­a­rate SHA-256-based time­stamp. (The dis­ad­van­tage here is also an advan­tage as it enables selec­tive dis­clo­sure: if you are time­stamp­ing the entire Git repos­i­to­ry, then to sub­se­quently prove the time­stamp to a third par­ty, you must pro­vide the entire repos­i­tory so they can replay it, see what the final state of the rel­e­vant file is, and check that it con­tains what you claim it con­tains and that the rel­e­vant revi­sion’s SHA-1 is cor­rectly time­stamped; but if you have time­stamped each file sep­a­rate­ly, you can pro­vide just the rel­e­vant ver­sion of the rel­e­vant file from your repos­i­to­ry, rather every ver­sion of every file pri­or. The trade­off here is sim­i­lar to that of time­stamp­ing a hash of a batch vs time­stamp­ing indi­vid­ual hash­es.

Prob­a­bly the best approach is to time­stamp each file at the begin­ning, use VCS time­stamps sub­se­quently for reg­u­lar activ­i­ty, and every once in a long while time­stamp all the files again; then for slow-chang­ing files, one will be prob­a­bly be able to reveal a use­ful time­stamp with­out need­ing to reveal the whole VCS his­tory as well, while still hav­ing backup time­stamps of the whole VCS in case very fine-grained time­stamps turn out to be nec­es­sary.)

Timestamping downloaded web pages

Auto­mat­i­cally track­ing Git com­mits is easy because of the hook func­tion­al­i­ty, but what if we want to down­load web pages and then time­stamp them? Down­load­ing them nor­mally with wget and then man­u­ally call­ing timestamp on what­ever the file winds up being named is a pain, so we want to do it auto­mat­i­cal­ly. This gets a lit­tle trick­ier because if we write a script which takes a URL as an argu­ment, we don’t nec­es­sar­ily know what the result­ing filepath will be - the URL could redi­rect us to another ver­sion of that page with differ­ent argu­ments, another page on that domain, or to another domain entire­ly, and then there’s the URL-decoding to deal with.

The sim­ple (and stu­pid) way is to parse out a file­name from the wget out­put because it con­ve­niently places the des­ti­na­tion file­name in a pair of Uni­code quote marks, which give us a per­fect way to parse out the first2 down­loaded file­name; this turns out to work well enough in my pre­lim­i­nary test­ing of it. A script wget-archive which does this and works well with my archiver dae­mon:

#!/bin/bash
set -euo pipefail

cd ~/www/

USER_AGENT="Firefox 12.4"
FILE=$(nice -n 20 ionice -c3 wget --continue --unlink --page-requisites --timestamping -e robots=off \
                                  --reject .exe,.rar,.7z,.bin,.jar,.flv,.mp4,.avi,.webm,.ico \
                                  --user-agent="$USER_AGENT" "$@" 2>&1 \
       | egrep 'Saving to: ‘.*’' | sed -e 's/Saving to: ‘//' | tr -d '’' | head -1 )

timestamp "$FILE"

, Atur­ban et al 2017, dis­cuss some of the lim­i­ta­tions of time­stamp­ing sys­tems and points out a prob­lem with this shell script: only the down­loaded file (usu­ally the HTML file) is time­stamped, and the files nec­es­sary for prop­erly dis­play­ing it (like JS or images) are down­loaded but not time­stamped; these addi­tional files could change (ei­ther for nor­mal rea­sons or mali­cious­ly) and change the appear­ance or con­tent of the main HTML file. They pro­vide a mod­i­fied shell script (sha256_include_all.sh) to include the aux­il­iary files as well:

#!/bin/bash
rm -rf ~/tmp_www/ # [use of `mktemp` would be better]
cd  ~/tmp_www/

USER_AGENT="Firefox 6.4"

FILE=$(nice -n 20 wget --continue --unlink --page-requisites
                       --timestamping -e robots=off -k
                       --user-agent="$USER_AGENT" "$1" 2>&1
    | egrep 'Saving to: .*'
    | sed -e 's/Saving to: //' | tr -d '’')

let "c=0"
for TARGET in $FILE; do
    if [ -f "$TARGET" ]; then
        let "c++"
        CONT=$(cat $TARGET)
        HASH=$(echo "$CONT" | shasum -a 256 | awk 'print $1;')
        echo "$HASH" >> "allhashes.txt"
    fi
    done

if [ $c = 1 ]; then
    FINAL_HASH="$HASH"
    else
    FINAL_HASH=$(cat "allhashes.txt" | shasum -a 256
                 | awk 'print $1;')
fi
echo "Final hash: $FINAL_HASH"

Exam­ple usage:

$ sha256_include_all.sh \
http://wsdl-maturban.cs.odu.edu:11011/michael/wayback/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/
# Final hash: 2fa7ece06402cc9d89b9cfe7a53e4ec31a4417a34d79fee584c 01d706036e8cb

Local timestamping

As con­ve­nient as Ori­gin­Stamp is, and as nice as it is to have only one Bit­coin trans­ac­tion made per day cov­er­ing all Ori­gin­Stamp users, one may not want to rely on it for any num­ber of rea­sons: spo­radic Inter­net con­nec­tiv­i­ty, uncer­tainty that Ori­gin­Stam­p’s data will remain acces­si­ble in the far future, uncer­tainty Ori­gin­Stamp cor­rectly imple­ments the time­stamp­ing algo­rithm, need­ing to time­stamp so much that it would seri­ously bur­den Ori­gin­Stam­p’s bandwidth/storage resources & inter­fere with other users, not want­ing to leak vol­ume & tim­ing of time­stamps, etc.

This can be done with yet more script­ing, a local Bit­coin client with suffi­cient funds (~$20 should cover a year of usage), and some­thing to con­vert hashes to Bit­coin addresses (bitcoind & bitcoin-tool respec­tively for the lat­ter two).

A sim­ple archi­tec­ture here would be to change timestamper to cre­ate hashes of inputs as before, but instead of send­ing them off to Ori­gin­Stamp, they are stored in a local file. This file accu­mu­lates hashes from every use of timestamper that day. At the end of the time peri­od, another script runs: it

  1. archives the mas­ter file to a date-stamped file (re­plac­ing it with an empty file to receive future hash­es)
  2. hashes the archived file to yield the mas­ter hash of that batch of hashes
  3. con­verts the mas­ter hash to a Bit­coin address
  4. final­ly, calls a local Bit­coin client like Elec­trum or Bit­coin Core’s bitcoind to make 1 trans­ac­tion to the address

So let’s say that hashes are being stored in ~/docs/timestamps/; the sim­pler timestamper script reads just:

#!/bin/bash
set -euo pipefail
MASTER_DIR=~/docs/timestamps/
for TARGET in "$@"; do
    if [ -f "$TARGET" ]; then
     HASH=$(sha256sum "$TARGET" | cut --delimiter=' ' --field=1)
    else
      HASH=$(echo "$TARGET" | sha256sum | cut --delimiter=' ' --field=1)
    fi
    echo "$HASH" >> $MASTER_DIR/today.txt
done

The hard­est part is con­vert­ing a SHA-256 hash to a valid Bit­coin address, which involves a num­ber of steps, so in this exam­ple I’ll use the light­weight bitcoin-tool for that part. To give an exam­ple of bitcoin-tool use, we can ver­ify an Ori­gin­Stamp time­stamp to make sure we’re doing things the same way. Take this test time­stamp:

$ echo "I have a secret." | sha256sum
7306a744a285474742f4f9ae8ddae8214fb7625348d578fb3077fb0bae92b8f1

Ori­gin­Stam­p’s page includes the full batch of hashes (the “Trans­ac­tion Seed”), which we can ver­ify includes 7306a..f1; so far so good. We can then pipe the full list into sha256sum using xclip which gives us the mas­ter hash:

$ xclip -o | sha256sum
7ad6b91226939f075d79da12e5971ae6c886a48b8d7284915b74c7340ac6f61e  -

7ad6..1e is the hash that the Ori­gin­Stamp page claims to use as the “Secret”, which also checks out. This hash needs to be con­verted to a Bit­coin address, so we call bitcoin-tool with the rel­e­vant options:

$ bitcoin-tool --network bitcoin --input-type private-key --input-format hex --output-type address --output-format base58check \
    --public-key-compression compressed --input "7ad6b91226939f075d79da12e5971ae6c886a48b8d7284915b74c7340ac6f61e"
1DMQELo9krQDvHHK5nPjbKLQFnnLtUdMFm

1DMQ..Fm is also the same Bit­coin address that Ori­gin­Stamp claims to send to on that page, so all that remains is to check that some bit­coin was sent to 1DMQ..Fm today, and look­ing on Blockchain.info, we see that some bit­coins were sent. So we have suc­cess­fully inde­pen­dently ver­i­fied that that list of hashes was time­stamped on the day it was claimed to have been time­stamped, and that Ori­gin­Stamp is both work­ing cor­rectly & our use of bitcoin-tool to con­vert a SHA-256 hash to a Bit­coin address is like­wise work­ing. With that, we can pro­ceed.

To stamp a batch, we can write a script we’ll call timestamper-flush:

#!/bin/bash
set -euo pipefail

MASTER_DIR=~/docs/timestamps/
DATE=$(date +'%s')
mv $MASTER_DIR/today.txt $MASTER_DIR/"$DATE".txt && touch $MASTER_DIR/today.txt
MASTER_HASH=$(sha256sum "$MASTER_DIR/$DATE.txt" | cut --delimiter=' ' --field=1)

BITCOIN_ADDRESS=$(bitcoin-tool --network bitcoin --input-type private-key --input-format hex \
                               --output-type address --output-format base58check --public-key-compression compressed \
                               --input "$MASTER_HASH")

# assuming no password to unlock
## bitcoind walletpassphrase $PASSWORD 1
bitcoind sendtoaddress "$BITCOIN_ADDRESS" 0.00000001 "Timestamp for $DATE" || \
  bitcoind getbalance # no funds?

timestamper-flush can be put into a crontab as sim­ply @daily timestamper-flush (or @hourly/@weekly/@monthly etc), and can be called at any time if nec­es­sary. (I have not tested these scripts, for lack of disk space to run a full node, but I believe them to be cor­rect; and if not, the idea is clear and one can imple­ment it as one prefer­s.)

Now one has an effi­cient, local, secure time­stamp­ing ser­vice.


  1. Ori­gin­Stamp appar­ently does not use OP_RETURN like Proof of Exis­tence does, which should be more effi­cient; but it is con­tro­ver­sial and lim­ited to 80 bytes, was reduced even fur­ther to 40 bytes, and then increased again to 80 bytes which I’m not sure is enough stor­age space for a secure hash.↩︎

  2. Which is typ­i­cally the web page we care about, and sub­se­quent files are things like CSS or images which don’t need to be time­stamped, but if one is para­noid about this, it should be pos­si­ble to time­stamp all the down­loaded files by remov­ing the | head -1 call and maybe trans­lat­ing the new­lines to spaces for the sub­se­quent timestamp call.↩︎