Bitcoin donations on The Pirate Bay

Downloading and parsing TPB metadata to estimate Bitcoin usage and revenue for uploaders
statistics, Bitcoin, shell, Haskell
2014-02-252014-12-10 notes certainty: log importance: 2


Background

SDr> gwern, here's a shiny new angle for your cryptocurrencies knowledge file: do crypto donations to warez distributors, like work?
gwern> SDr: how would that work? they put addresses in the READMEs of their torrents or something?
SDr> gwern, specifically eg. scraping Pirate Bay's NFO files for wallet addresses & cross-referencing it with blockchain, is there volume for it,
       such that distributors are incentivized to provide clean cracks / keygens, as opposed to bundling blackmail-ware with it?

TODO: com­pare against Pay­pal, Flat­tr, Grati­pay?

Watashi kin­i­nari­masu!

Data

Download

https://archive.org/details/Backup_of_The_Pirate_Bay_32xxxxx-to-79xxxxx https://github.com/andronikov/tpb2csv

more effi­cient to not down­load com­ments

diff --git a/download.py b/download.py
index 82837e2..9fed0aa 100644
--- a/download.py
+++ b/download.py
@@ -6,12 +6,12 @@
 # it under the terms of the GNU General Public License as published by
 # the Free Software Foundation, either version 3 of the License, or
 # (at your option) any later version.
-#
+#
 # This program is distributed in the hope that it will be useful,
 # but WITHOUT ANY WARRANTY; without even the implied warranty of
 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 # GNU General Public License for more details.
-#
+#
 # You should have received a copy of the GNU General Public License
 # along with this program.  If not, see <http://www.gnu.org/licenses/>.

@@ -21,7 +21,7 @@ import HTMLParser

 import torrent_page
 import filelist
-import comments
+# import comments

 import requests
 import datetime
@@ -39,10 +39,10 @@ def main():
             tp_status_code = torrent_page.get_torrent_page(torrent_id, protocol)
             if (tp_status_code == 200):
                 filelist.get_filelist(torrent_id, protocol)
-                comments.get_comments(torrent_id, protocol)
+                # comments.get_comments(torrent_id, protocol)
             elif (tp_status_code == 404):
                 print "Skipping filelist..."
-                print "Skipping comments..."
+                # print "Skipping comments..."
             else:
                 print "ERROR: HTTP " + str(tp_status_code)
                 error_file.write(datetime.datetime.utcnow().strftime("[%FT%H:%M:%SZ]") + ' ' + str(torrent_id) + ": ERROR: HTTP " + str(tp_status_code) + '\n')
@@ -58,10 +58,10 @@ def main():
             tp_status_code = torrent_page.get_torrent_page(torrent_id, protocol)
             if (tp_status_code == 200):
                 filelist.get_filelist(torrent_id, protocol)
-                comments.get_comments(torrent_id, protocol)
+                # comments.get_comments(torrent_id, protocol)
             else:
                 print "Skipping filelist..."
-                print "Skipping comments..."
+                # print "Skipping comments..."
             time_log.write(datetime.datetime.utcnow().strftime("%FT%H:%M:%SZ") + ' ' + str(torrent_id) + " " + str(tp_status_code) + '\n')
             time_log.flush()
             break # Success! Break out of the while loop
@@ -72,10 +72,10 @@ def main():
             tp_status_code = torrent_page.get_torrent_page(torrent_id, protocol)
             if (tp_status_code == 200):
                 filelist.get_filelist(torrent_id, protocol)
-                comments.get_comments(torrent_id, protocol)
+                # comments.get_comments(torrent_id, protocol)
             else:
                 print "Skipping filelist..."
-                print "Skipping comments..."
+                # print "Skipping comments..."
             time_log.write(datetime.datetime.utcnow().strftime("%FT%H:%M:%SZ") + ' ' + str(torrent_id) + " " + str(tp_status_code) + '\n')
             time_log.flush()
             break # Success! Break out of the while loop
@@ -102,7 +102,7 @@ if (len(sys.argv) == 2+offset):
     torrent_id = sys.argv[1+offset]
     print torrent_id
     main()
-
+
 elif (len(sys.argv) == 3+offset):
     if (int(sys.argv[1+offset]) > (int(sys.argv[2+offset])+1)):
         for torrent_id in range(int(sys.argv[1+offset]),int(sys.argv[2+offset])-1, -1):
@@ -112,6 +112,6 @@ elif (len(sys.argv) == 3+offset):
         for torrent_id in range(int(sys.argv[1+offset]),int(sys.argv[2+offset])+1):
             print torrent_id
             main()
-
+
 elif (len(sys.argv) > 3 and not https):
     print "ERROR: Too many arguments"

started 2014-02-25, 8:51PM EST

sudo apt-get install python-requests python-beautifulsoup
git clone git@github.com:andronikov/tpb2csv.git
cd tpb2csv
sed -i -e 's/thepiratebay\.sx/thepiratebay.se/' *.py

# End:
# http://thepiratebay.se/recent
# most recent: http://thepiratebay.se/torrent/9666564/Blonde_Avenger_008_%28BlitzWeasel_-_1995%29_%28Talon-Novus-HD%29_%5BNVS-D%5D
# ID: 9666564

python download.py 1 9666564

8,xxx,xxx

21:02:44 <@gw­ern> https://github.com/tpb-archive?tab=repositories oh, I think I under­stand: they post only each set of mil­lion tor­rents 21:03:15 <@gw­ern> before 3m appar­ently is unavail­able, and the lat­est TPB tor­rent when I checked a lit­tle bit ago was #9,666,564 as in, the 9x series isn’t ready yet 21:03:32 <@gw­ern> so, it goes 3m, 4m, 5m, 6m, 7m, 8m 21:04:01 <@gw­ern> so I can down­load those, and use tpb2csv to pick up the 666k most recent ones 21:04:25 <@quan­ti­cle> gwern: If I might ask, why are you archiv­ing TPB tor­rents? 21:04:45 <@gw­ern> quan­ti­cle: SDr asked how many tor­rent authors pro­vide bit­coin dona­tion address­es, and how much they’d received

09:06 PM 0Mb$ python down­load­.py 9000000 9666564

git clone https://github.com/tpb-archive/3xxxxxx.git git clone https://github.com/tpb-archive/4xxxxxx.git git clone https://github.com/tpb-archive/5xxxxxx.git git clone https://github.com/tpb-archive/6xxxxxx.git git clone https://github.com/tpb-archive/7xxxxxx.git git clone https://github.com/tpb-archive/8xxxxxx.git

http://thepiratebay.se/recent http://thepiratebay.se/torrent/9669888/Aimi_Nagano_-_Girls_Do_Not_Sit_On_The_Bus

$ cd home/gwern/tpb2csv/ && python down­load­.py 9000977 9666564

cd ~/tpb2csv/ && while true; do (LATEST=$(find ./data/9xxxxxx/ -type d | cut –de­lim­iter=‘/’ –field­s=6 | sort –nu­mer­ic-sort | tail –li­nes=1); python down­load­.py $LATEST 9670000; sleep 5s); done

prob­lems with tpb2csv:

  • hard­wired to .sx, eeds .se
  • frag­ile, eas­ily bombs out after a few con­nec­tion fail­ures
  • does­n’t check for already-down­load­ed?

desired fea­tures: - default final tar­get from http://thepiratebay.se/recent - option­ally dis­able com­ments, filelist, meta­data, descrip­tion

but could also just fig­ure out all miss­ing ids

countmissing.hs:

import Data.Set (fromList, difference, toList)
import System.Environment (getArgs)
main :: IO ()
main = do args <- getArgs
          let latest = read (head args) :: Int
          ids <- readFile "ids.txt"
          let numbers = Data.Set.fromList ((Prelude.map read $ lines ids) :: [Int])
          let allIDs = Data.Set.fromList [9000000..latest]
          let missing = Data.Set.difference allIDs numbers
          writeFile "missing.txt" $ unlines $ Prelude.map show $ Data.Set.toList missing

and final­ly, GNU parallel to down­load each tor­rent ID sep­a­rately with­out muck­ing around with ranges:

cd ~/tpb/tpb2csv/ && rm ids.txt missing.txt randomized.txt

find ./data/9xxxxxx/ -type d | cut --delimiter='/' --fields=6 | sort --unique | tail --lines=+2 > ids.txt
LATEST_TORRENT="$(elinks -dump 'http://thepiratebay.se/recent' | fgrep 'http://thepiratebay.se/torrent/1' | cut -d '/' -f 5 | head -1)"
runhaskell countmissing.hs $LATEST_TORRENT

# sort --random-sort missing.txt > randomized.txt
# cat randomized.txt | parallel --max-chars=40 --ungroup --jobs 7 -- python -OO download.py

cat missing.txt | tac | parallel --max-chars=40 --ungroup --jobs 1 -- nice python -OO download.py
gwern> there are an amazing number of 404s on tpb. I wonder why
gwern> why would they ever delete a page?
gwern> short of CP
X> They delete spam, viruses, fake uploads, uploads with wrong names. There is tons of that on TPB actually (and usually uploaded in bulk, that's why the 404s usually form "blocks"). Not that much CP.

Processing

$ find . -type f -name "description.txt" -exec grep --extended-regexp '[13][a-zA-Z0-9]{26,33}' {} \;
Want to help us out? BitCoin: 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
Want to help us out? BitCoin: 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
UniqueID/String                          : 198031419884096486800071953706812228345 (0x94FB76D26E78C1B7A2EA96FDDB10C2F9)
http://easyimghost.com/ImageHosting/8102_7346420e085511e2ba4022000a1e89327.jpg.html
http://easyimghost.com/ImageHosting/8104_83901b420fe611e29797123138133f0a7.jpg.html
http://sharepic.biz/show-image.php?id=b264f152debff549632be27bfe965f86
http://image.bayimg.com/08f6b9b52398b07c3c86e1dc1f3b3d36594e67b8.jpg
http://image.bayimg.com/0b397e88fdefe06fa99acddf86190f4cd2ef3922.jpg
Want to help us out? BitCoin: 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
http://image.bayimg.com/34eae2b9725eb15e7a58fd6bf6e2fedb2c5af0a7.jpg
Want to help us out? BitCoin: 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
Want to help us out? BitCoin: 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
d17a78495f673990bb6d3ea096e5830bcc7dc4dd
Want to help us out? BitCoin: 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
http://rss.thepiratebay.se/user/01997b95ece549c293b824450ea84389
http://rss.thepiratebay.se/user/01997b95ece549c293b824450ea84389
B0A544F55125A26C21622D4118739305B9088448
http://rss.thepiratebay.se/user/01997b95ece549c293b824450ea84389
http://image.bayimg.com/6cd5e44b139c1506dfa8f8d59003187454700627.jpg
Want to help us out? BitCoin: 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
http://rss.thepiratebay.se/user/01997b95ece549c293b824450ea84389
http://image.bayimg.com/0070020142c02ecbadd55e73f0d60a7169367885.jpg
http://photosex.biz/v.php?id=a7f013597a02d5ec0fffe529da47e7eb
http://image.bayimg.com/b6a1d70ca01e50d91f30ef683756442d4d52a357.jpg


$ find . -type f -name "description.txt" -exec grep --only-matching --extended-regexp '[13][a-zA-Z0-9]{26,33}' {} \;
1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
1980314198840964868000719537068122
346420e085511e2ba4022000a1e89327
3901b420fe611e29797123138133f0a7
152debff549632be27bfe965f86
398b07c3c86e1dc1f3b3d36594e67b8
397e88fdefe06fa99acddf86190f4cd2ef
1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
34eae2b9725eb15e7a58fd6bf6e2fedb2c

maybe use R? http://beautifuldata.net/2015/01/querying-the-bitcoin-blockchain-with-r/

nice find tpb/ -type f -print0 | sort --zero-terminated --key=6,5 --field-separator="/" | tar --no-recursion --null --files-from - -c | nice xz -9e --check=sha256 --stdout > ~/tpb.tar.xz; alert

I still miss any­thing newer than ID 8599995 (06-2013). There is this tor­rent …9xxx, 10xxx and 11xxx are miss­ing (the last tor­rents uploaded to TPB were 11xxx) …There were the github CSV repos (that I linked now on the page) by some other guy, but github took them down (at least the 4xxxxxx one)… I am not sure why, but they did, and they did it today, more exactly about 10 min­utes ago while I was cloning it to my disk …Do you have them? I found your web­site when I googled it, it seems you did some exper­i­ments on it. …If you do, could you upload it some­where? (Ide­ally some tor­rent) If you have newer than 8xxxxxx (as you seem you had), that would be even per­fecter

I’m not sure I have as much as you think I have: the TPB down­loader broke 2014-09-18, and I stopped scrap­ing. (I was get­ting tired of hav­ing to babysit it, it was using up a lot of band­width and disk space which make my back­ups much slow­er, and I wanted to focus on scrap­ing the black­mar­ket­s.) Also, note that I hacked the down­loader to only down­load the meta­data I wanted for my Bit­coin analy­sis; I did not intend or want to mir­ror all of TPB since I assumed the orig­i­nal archivers were doing that and that the TPB itself had bet­ter pro­ce­dures than my scrap­ing. So while I don’t remem­ber edit­ing any of the archives I pulled off Github, the more recent files -which are prob­a­bly what you really want - will not be com­plete.

Here’s a sum­mary of what I have:

$ ls
03xxxxxx/  04xxxxxx/  05xxxxxx/  06xxxxxx/  07xxxxxx/  08xxxxxx/  09xxxxxx/  10xxxxxx/  11xxxxxx/  tpb2csv/
$ duh */.git/
481M    03xxxxxx/.git/
684M    04xxxxxx/.git/
718M    05xxxxxx/.git/
944M    06xxxxxx/.git/
806M    07xxxxxx/.git/
495M    08xxxxxx/.git/
156K    tpb2csv/.git/
4.1G    total
$ du -ch *
5.6G    03xxxxxx
7.7G    04xxxxxx
8.2G    05xxxxxx
11G 06xxxxxx
9.9G    07xxxxxx
6.4G    08xxxxxx
7.3G    09xxxxxx
3.8G    10xxxxxx
1.8G    11xxxxxx
208K    tpb2csv
62G total
$ find ~/tpb/ -type f | sort | xz -9e --check=sha256 --stdout > ~/tpb.txt.xz
# https://www.dropbox.com/s/te1zimevmmzi1qg/tpb.txt.xz