Darknet Market Archives (2013-2015)

Mirrors of ~89 Tor-Bitcoin darknet markets & forums 2011-2015, and related material
Bitcoin, Silk-Road, shell, R, dataset
2013-12-012019-05-22 finished certainty: highly likely importance: 9


Dark Net Mar­kets (DNM) are online mar­kets typ­i­cally hosted as Tor hid­den ser­vices pro­vid­ing escrow ser­vices between buy­ers & sell­ers trans­act­ing in Bit­coin or other cryp­to­coins, usu­ally for drugs or other illegal/regulated goods; the most famous DNM was Silk Road 1, which pio­neered the busi­ness model in 2011.

From 2013–2015, I scraped/mirrored on a weekly or daily basis all exist­ing Eng­lish-lan­guage DNMs as part of my research into their , , & ; these scrapes cov­ered ven­dor pages, feed­back, images, etc. In addi­tion, I made or obtained copies of as many other datasets & doc­u­ments related to the DNMs as I could.

This uniquely com­pre­hen­sive col­lec­tion is now pub­licly released as a 50GB (~1.6TB uncom­pressed) col­lec­tion cov­er­ing 89 DNMs & 37+ related forums, rep­re­sent­ing <4,438 mir­rors, and is avail­able for any research.

This page doc­u­ments the down­load, con­tents, inter­pre­ta­tion, and tech­ni­cal meth­ods behind the scrapes.

Dark net mar­kets have thrived since June 2011 when Adrian Chen pub­lished his famous Gawker arti­cle prov­ing that Silk Road 1 was, con­trary to my assump­tion when it was announced in January/February 2011, not a scam and was a ful­ly-­func­tional drug mar­ket, a new kind dubbed “dark net mar­kets” (DNM). Fas­ci­nat­ed, I soon signed up, made my first order, and began doc­u­ment­ing how to use SR1 and then a few months lat­er, began doc­u­ment­ing the first known SR1-linked arrests. Mon­i­tor­ing DNMs was easy because SR1 was over­whelm­ingly dom­i­nant and Black­Mar­ket Reloaded was a dis­tant sec­ond-­place mar­ket, with a few irrel­e­van­cies like Deep­bay or Sheep and then the flashy Atlantis.

This idyl­lic period ended with the raid on SR1 in Octo­ber 2013, which ush­ered in a new age of chaos in which cen­tral­ized mar­kets bat­tled for dom­i­nance, the would-be suc­ces­sor Silk Road 2 was crip­pled by arrests and turned into a ghost-­ship car­ry­ing scam­mers, and the mul­ti­sig break­through went beg­ging. The tumult made it clear to me that no mar­ket or forum could be counted on to last as long as SR1, and research into the DNM com­mu­ni­ties and mar­kets, or even sim­ply the mem­ory of their his­to­ry, was threat­ened by bitrot: already in Novem­ber 2013 I was see­ing per­va­sive myths spread through­out the medi­a—that SR1 had $1 bil­lion in sales, that you could buy child pornog­ra­phy or hit­men ser­vices on it, that there were mul­ti­ple Dread Pirate Robert­s—and other dan­ger­ous beliefs in the com­mu­nity (that use of PGP was para­noia & unnec­es­sary, mar­kets could be trusted not to exit-s­cam, that FE was not a recipe for dis­as­ter, that SR2 was not infil­trated despite the staff arrests & even media cov­er­age of a SR1 mole, that guns & poi­son sell­ers were not extra­or­di­nar­ily risky to pur­chase from, that buy­ers were never arrest­ed).

And so, start­ing with the SR1 forums, which had not been taken down by the raid (to help the mole? I won­dered at the time), I began scrap­ing all the new mar­kets, doing so weekly and some­times daily start­ing in Decem­ber 2013. These are the results.

Download

The full archive is avail­able for down­load from the Inter­net Archive as a tor­rent (item page)1.

A pub­lic rsync mir­ror is also avail­able:

rsync --verbose --recursive rsync://78.46.86.149:873/dnmarchives/ ./dnmarchives/

For a sin­gle file (eg the 2 Grams export­s), one can down­load like thus:

rsync --verbose rsync://78.46.86.149:873/dnmarchives/grams.tar.xz rsync://78.46.86.149:873/dnmarchives/grams-20150714-20160417.tar.xz ./

(If the down­load does not start, it may be a Tor­rent client prob­lem related to Getright-web­seed­ing-­sup­port; if the tor­rent does not work, all files can be down­loaded nor­mally over HTTP from the IA item page, but if pos­si­ble, tor­rents are rec­om­mended for reduc­ing the band­width bur­den & error-check­ing.)

Research

Possible Uses

Here are some sug­gested uses:

  • pro­vid­ing infor­ma­tion on ven­dors across mar­kets like their PGP key and feed­back rat­ings
  • iden­ti­fy­ing arrested and flipped sell­ers (eg the Weapon­s­guy sting on Ago­ra)
  • indi­vid­ual drug and cat­e­gory pop­u­lar­ity
  • total sales per day, with con­se­quent turnover and com­mis­sion esti­mates; cor­re­lates with Bit­coin or DNM-related search traf­fic, sub­red­dit traf­fic, Bit­coin price or vol­ume, etc
  • seller life­times, rat­ings, over time and by prod­uct sold
  • losses to DNM exit scams, or seller exit scams
  • reac­tions to exoge­nous shocks like Oper­a­tion Ony­mous
  • sur­vival analy­sis, and pre­dic­tors of exit-s­cams (early final­iza­tion vol­ume; site down­time; new ven­dors; etc)
  • topic mod­el­ing of forums
  • com­pi­la­tions of forum posts on lab tests esti­mat­ing purity and safety
  • com­pi­la­tions of forum-­posted Bit­coin addresses to exam­ine the effec­tive­ness of mar­ket tum­blers
  • sty­lo­met­ric analy­sis of posters, par­tic­u­lar site staff (what is staff turnover like? do any mar­kets ever change hand­s?)
  • deanonymiza­tion and infor­ma­tion leaks (eg GPS coor­di­nates in meta­data, user­names reused on the clear­net, valid emails in PGP pub­lic keys)
  • secu­rity prac­tices: use of PGP, life­time of indi­vid­ual keys, acci­den­tal posts of pri­vate rather than pub­lic keys, mal­formed or unus­able pub­lic keys, etc
  • antholo­gies of real-­world pho­tos of par­tic­u­lar drugs com­piled from all sell­ers of them
  • sim­ply brows­ing old list­ings, remem­ber­ing the good times and bad times, the fallen and the free

Works using this dataset

Papers:

Media:

Posts or arti­cles:

Citing

Please cite this resource as:

  • Gwern Bran­wen, Nico­las Christin, David Décary-Hé­tu, Ras­mus Munks­gaard Ander­sen, StExo, El Pres­i­den­te, Anony­mous, Daryl Lau, Sohh­lz, Delyan Kratunov, Vince Cakic, Van Buskirk, Whom, Michael McKen­na, Sigi Goode. “Dark Net Mar­ket archives, 2011–2015”, 2015-07-12. Web. [ac­cess date] /DNM-archives

    @misc{dnmArchives,
        author = {Gwern Branwen and Nicolas Christin and David Décary-Hétu and
                  Rasmus Munksgaard Andersen and StExo and El Presidente and Anonymous
                  and Daryl Lau and Sohhlz, Delyan Kratunov and Vince Cakic and Van Buskirk
                  and Whom and Michael McKenna and Sigi Goode},
    title = {Dark Net Market archives, 2011-2015},
    howpublished=  {\url{https://www.gwern.net/DNM-archives}},
    url = {https://www.gwern.net/DNM-archives},
    type = {dataset},
    year = {2015},
    month = {July},
    timestamp = {2015-07-12},
    note = {Accessed: DATE} }

Donations

A dataset like this owes its exis­tence to many par­ties:

  • the DNMs could not exist with­out vol­un­teers and non­prof­its spend­ing the money to pay for the band­width used by the Tor net­work; these scrapes col­lec­tively rep­re­sent ter­abytes of con­sumed band­width. If you would like to donate towards keep­ing Tor servers run­ning, you can donate to Torserver­s.net or the Tor Project itself
  • the hosts count­less amaz­ing resources, of which this is only one, and is a unique Inter­net resource; they accept Bit­coin
  • col­lat­ing and cre­at­ing these scrapes has absorbed an enor­mous amount of my time & energy due to the need to solve CAPTCHAs, launch crawls on a daily or weekly basis, debug sub­tle glitch­es, work around site defens­es, peri­od­i­cally archive scrapes to make disk space avail­able, pro­vide host­ing for some scrapes released pub­licly etc (my arbtt time-logs sug­gest >200 hours since 2013); I sub­sist pri­mar­ily on dona­tions & I thank my sup­port­ers for their patience dur­ing this long pro­ject.

Contents

There are ~89 mar­kets, >37 forums and ~5 other sites, rep­re­sent­ing <4,438 mir­rors of >43,596,420 files in ~49.4GB of 163 com­pressed files, unpack­ing to >1548GB; the largest sin­gle archive decom­presses to <250GB. (It can be burned to 3 25GB BDs or 2 50GB BDs; if the for­mer, it may be worth gen­er­at­ing addi­tional FEC.)

These archives are -com­pressed tar­balls (op­ti­mized with the ); typ­i­cally each sub­folder is a sin­gle date-­stamped (YYYY-MM-DD) crawl using , with the default directory/file lay­out. The major­ity of the con­tent is HTML, CSS, and images (typ­i­cally pho­tos of item list­ings); images are space-in­ten­sive & omit­ted from many crawls, but I feel that images are use­ful to allow brows­ing the mar­kets as they were and may be highly valu­able in their own right as research mate­ri­al, so I tried to col­lect images where applic­a­ble. (Child porn is not a con­cern as all DNMs & DNM forums ban that con­tent.) Archives sourced from other peo­ple fol­low their own par­tic­u­lar con­ven­tions. Mac & Win­dows users may be able to uncom­press using their built-in OS archiver, 7zip, Stuffit, or WinRAR; the PAR2 error-check­ing can be done using par2, Quick­Par, Par Bud­dy, Mul­ti­Par or oth­ers depend­ing on one’s OS.

If you don’t want to uncom­press all of a par­tic­u­lar archive, as they can be large, you can try extract­ing spe­cific files using archiver-spe­cific options; for exam­ple, a SR2F com­mand tar­get­ing a par­tic­u­lar old forum thread:

tar --verbose --extract --xz --file='silkroad2-forums.tar.xz' --no-anchored --wildcards '*topic=49187*'

Overall Coverage

Most of the mate­r­ial dates from 2013 to 2015; some archives sourced from other peo­ple (be­fore I began crawl­ing) may date 2011–2012.

Specif­i­cal­ly:

  • Mar­kets:

    • 1776
    • Abraxas
    • Agape
    • Agora
    • Alpaca
    • AlphaBay
    • Ama­zon Dark
    • Anar­chia
    • Androm­eda
    • Area51
    • Armory3
    • Atlantis
    • Black­Bank Mar­ket
    • Black Gob­lin
    • Black­Mar­ket Reloaded
    • Black Ser­vices Mar­ket
    • Blooms­field
    • Blue Sky Mar­ket
    • Break­ing Bad
    • bungee54
    • Buy­It­Now
    • Cannabis Road 1
    • Cannabis Road 2
    • Cannabis Road 3
    • Can­tina
    • Cloud9
    • Crypto Mar­ket / Dia­bo­lus
    • Dark­Bay
    • Dark­list
    • Dark­net Heroes
    • DBay
    • Deep­zon
    • Doge Road
    • Dream Mar­ket
    • Drugslist
    • East India Com­pany
    • Evo­lu­tion
    • Free­Bay
    • Free­dom Mar­ket­place
    • Free Mar­ket
    • Grey­Road
    • Havana/Absolem
    • Haven
    • Hori­zon
    • Hydra
    • Iron­clad
    • Kiss
    • Mid­dle Earth
    • Mr Nice guy 2
    • Nucleus
    • Onion­shop
    • Out­law Mar­ket
    • Oxy­gen
    • Panacea
    • Pan­dora
    • Pigeon
    • Pirate Mar­ket
    • Posei­don
    • Project Black Flag
    • Sheep
    • Silk Road 1
    • Silk Road 2
    • Silk Road Reloaded (I2P)
    • Silk­street
    • Sim­ply Bear
    • The Black­Box Mar­ket
    • The Majes­tic Gar­den
    • The Mar­ket­place
    • The RealDeal
    • Tochka
    • TOM
    • Topix 2
    • Tor­Bay
    • Tor­Bazaar
    • TorE­scrow
    • Tor­Mar­ket
    • Tor­tuga 2
    • Under­ground Mar­ket
    • Utopia
    • Vault43
    • White Rab­bit
    • Zanz­ibar Spice
  • Forums:

    • Abraxas
    • Agora
    • Androm­eda
    • Black Mar­ket Reloaded
    • Black­Bank Mar­ket
    • bungee54
    • Cannabis Road 2
    • Cannabis Road 3
    • Dark­Bay
    • Dark­net heroes
    • Dia­bo­lus
    • Doge Road
    • Evo­lu­tion
    • Gob­o­tal
    • Grey­Road
    • Havana/Absolem
    • Hydra
    • King­dom
    • Kiss
    • Mr Nice Guy 1
    • Nucleus
    • Out­law Mar­ket
    • Panacea
    • Pan­dora
    • Pigeon
    • Project Black Flag
    • Revolver
    • Silk Road 1
    • Silk Road 2
    • TOM
    • The Cave
    • The Hub
    • The Majes­tic Gar­den
    • The RealDeal
    • TorE­scrow
    • Tor­Bazaar
    • Tor­tuga 1
    • Under­ground Mar­ket
    • Unitech
    • Utopia
  • Mis­cel­la­neous:

    • Assas­si­na­tion Mar­ket
    • Cryuserv
    • DNM-related doc­u­ments4
    • DNStats
    • Grams
    • Ped­o­fund­ing
    • SR2­doug’s leaks
Miss­ing or incom­plete
  • BMR
  • SR1
  • Blue Sky
  • Tor­Mar­ket
  • Deep­bay
  • Red Sun Mar­ket­place
  • San­i­tar­ium Mar­ket
  • EXXTACY
  • Mr Nice Guy 2

Interpreting & analyzing

Scrapes can be dif­fi­cult to ana­lyze. They are large, com­pli­cat­ed, redun­dant, and highly error-prone. They can­not be taken at face-­val­ue.

No mat­ter how much work one puts into it, one will never get an exact snap­shot of a mar­ket at a par­tic­u­lar instant: list­ings will go up or down as one crawls, ven­dors will be banned and their entire pro­file & list­ings & all feed­back van­ish instant­ly, Tor con­nec­tion errors will cause a non­triv­ial % of page requests to fail, the site itself will go down (Agora espe­cial­ly), and Inter­net con­nec­tions are imper­fect. Scrapes can get bogged down in a back­wa­ter of irrel­e­vant pages, spend all their time down­load­ing a morass of on-de­mand gen­er­ated pages, the user login expire or be banned by site admin­is­tra­tors, etc. If a page is present in a scrape, then it prob­a­bly existed at some point; but if a page is not pre­sent, then it may not have existed or existed but did not get down­loaded for any of a myr­iad of rea­sons. At best, a scrape is a lower bound on how much was there.

So any analy­sis must take seri­ously the incom­plete­ness of each crawl and the fact that there is a lot and always will be a lot of miss­ing data, and do things like focus on what can be inferred from ‘ran­dom’ sam­pling or explic­itly model incom­plete­ness by using mar­kets’ cat­e­go­ry-­coun­t-list­ings. (For exam­ple, if your down­load of a mar­ket claims to have 1.3k items but the cat­e­gories’ claimed list­ings sum to 13k items, your down­load is prob­a­bly highly incom­plete & biased towards cer­tain cat­e­gories as well.) There are many sub­tle bias­es: for exam­ple, there will be upward biases in mar­kets’ aver­age review rat­ings because sell­ers who turn out to be scam­mers will dis­ap­pear from the mar­ket scrapes when they are banned, and few of their cus­tomers will go back and revise their rat­ings; sim­i­larly if scam­mers are con­cen­trated in par­tic­u­lar cat­e­gories, then using a sin­gle snap­shot will lead to biased results as the scam­mers have already been removed, while uncon­tro­ver­sial sell­ers last a lot longer (which might lead to, say, e-book sell­ers seem­ing to have many more sales than expect­ed).

The con­tents can­not be taken at face-­value either. Some ven­dors engage in review-stuff­ing using shills. Meta­data like cat­e­gories can be wrong, manip­u­lat­ed, or mis­lead­ing (a cat­e­gory labeled “Musi­cal instru­ments” may con­tain list­ings for pre­scrip­tion drugs—­beta block­er­s—or modafinil or Adder­all may be listed in both a “Pre­scrip­tion drugs” and “Stim­u­lants” cat­e­go­ry). Many things said on forums are lies or bluff­ing or scams. Mar­ket oper­a­tors may delib­er­ately deceive users (Ross Ulbricht claim­ing to have sold SR1, the SR2 team engag­ing in “psy­ops”) or con­ceal infor­ma­tion (the hacks of SR1; the sec­ond SR2 hack) or attack their users (Sheep Mar­ket­place and Pan­do­ra). Dif­fer­ent mar­kets have dif­fer­ent char­ac­ter­is­tics: the com­mis­sion rate on Pan­dora was uni­lat­er­ally raised after it was hacked (caus­ing sales vol­ume to fal­l); SR2 was a noto­ri­ous scam­mer haven due to inac­tive or over­whelmed staff and lack­ing a work­ing escrow mech­a­nism; etc. There is no sub­sti­tute here for domain knowl­edge.

Know­ing this, analy­ses should have some strat­egy to deal with miss­ing­ness. There are a cou­ple tacks:

  • attempt to exploit “ground truths” to explic­itly model and cope with vary­ing degrees of miss­ing­ness; there are a num­ber of ground-truths avail­able in the form of leaked seller data (screen­shots & data), data­bases (leaked, hacked), offi­cial state­ments (eg the FBI’s quoted num­bers about Silk Road 1’s total sales, num­ber of accounts, num­ber of trans­ac­tions, etc)

    For one val­i­da­tion of this set of archives, see Bradley 2019, , which is able to com­pare the SR2 scrapes to data extracted from SR2 by UK law enforce­ment post-­seizure, and finds that any scrape is incom­plete (as expect­ed) but that scrapes in gen­eral appear to be incom­plete in sim­i­lar ways and usable for analy­sis. For another attempt at val­i­dat­ing, see , Soska & Christin 2015, which com­pares crawl-derived esti­mates to SR1 sales records pro­duced at Ross Ulbricht’s trial (CSV/dis­cus­sion), sales fig­ures in the Blake Ben­thall SR2 crim­i­nal com­plaint, and a Agora sell­er’s leaked ven­dor pro­file; in all cas­es, the esti­mates are rea­son­ably close to the ground-truth.

  • assume miss­ing-at-ran­dom and use analy­ses insen­si­tive to that, focus­ing on things like ratios

  • work with the data as is, writ­ing results such that the biases and low­er-bounds are explicit & empha­sized

Individual archives

Some of the archives are unusual and need to be described in more detail.

Aldridge & Decary-Hetu SR1

The Sep­tem­ber SR1 crawl is processed data stored in .sav Data Files. There are var­i­ous libraries avail­able for read­ing this for­mat (in R, using the foreign library like library(foreign); sellers <- read.spss("Sellers---2013-09-15.sav", to.data.frame=TRUE).)

AlphaBay 2017 (McKenna & Goode)

A crawl of AlphaBay 2017-01-26–2017-01-28 and data extrac­tion (us­ing a Python script) pro­vided by Michael McKenna & Sigi Goode. They also tried to crawl AB’s his­tor­i­cal inac­tive list­ings in addi­tion to the usual live/active list­ings, reach­ing many of them.

Due to IA upload prob­lems, cur­rently hosted sep­a­rately.

DNStats

DNStats is a ser­vice which peri­od­i­cally pings hid­den ser­vices and records the response & laten­cy, gen­er­at­ing graphs of uptime and allow­ing users to see how long a mar­ket has been down and if an error is likely to be tran­sient. The owner has pro­vided me with three SQL exports of the ping data­base up to 2017-03-25; this data­base could be use­ful for com­par­ing down­time across mar­kets, exam­in­ing the effect of DoS attacks, or regress­ing down­time against things like the Bit­coin exchange rate (pre­sum­ably if the mar­kets still drive more than a triv­ial amount of the Bit­coin econ­o­my, down­time of the largest mar­kets or mar­ket deaths should pre­dict falls in the exchange rate).

For exam­ple, to graph an aver­age of site uptime per day and high­light as an exoge­nous event Oper­a­tion Ony­mous, the R code would go like this:

dnmUptime <- read.delim("dnstats-20150712.sql", na.strings="NULL",
                         nrows=6000000, colClasses=c("factor", "factor", "factor", "integer",
                                                     "factor", "numeric", "numeric", "POSIXct"))
markets <- dnmUptime[dnmUptime$type==1,] # type 1 = markets
dnmUptime <- NULL # save RAM due to dataset size
markets$Date <- as.Date(markets$timestamp)
markets$Up <- markets$httpcode == 200
daily <- aggregate(Up ~ Date + sitename, markets, mean)
library(ggplot2)
qplot(Date, sitename, color=Up, data=daily) + geom_vline(xintercept=as.Date("2014-11-05"), color="red")

The ser­vice is a use­ful one and accepts dona­tions: 1DNstATs59JANuXjbpS5ngWHqvApAhYHBS.

Grams

Grams (sub­red­dit) is a ser­vice pri­mar­ily spe­cial­iz­ing in search­ing mar­ket list­ings; they can pull list­ings from API exports pro­vided by mar­kets (Evo­lu­tion, Cloud9, Mid­dle Earth, Bungee54, Out­law), or they may use their own cus­tom crawls (the rest). They have gen­er­ously given me near-­daily CSV exports of the cur­rent state of list­ings in their search engine, rang­ing from 2014-06-09 to 2015-07-12 for the first archive and 2015-07-14 to 2016-04-17 for the sec­ond. Grams cov­er­age:

  1. first:

    • 1776
    • Abraxas
    • ADM
    • Agora
    • Alpaca
    • AlphaBay
    • Black­Bank
    • Bungee54
    • Cloud9
    • Evo­lu­tion
    • Haven
    • Mid­dle Earth
    • NK
    • Out­law
    • Oxy­gen
    • Pan­dora
    • Silkki­tie
    • Silk Road 2
    • TOM
    • TPM
  2. sec­ond archive:

    • Abraxas
    • Agora
    • AlphaBay
    • Dream Mar­ket
    • Hansa
    • Mid­dle Earth
    • Nucleus
    • Oasis
    • Oxy­gen
    • RealDeal
    • Silkki­tie
    • Tochka
    • Val­halla

The Grams archive has three virtues:

  1. while it does­n’t have any raw data, the CSVs are easy to work with. For exam­ple, to read in all the Grams SR2 crawls, then count & graph total list­ings by day in R:

    DIR <- "blackmarket-mirrors/archive/grams"
    # Grams's SR2 crawls are named like "grams/2014-06-13/SilkRoad.csv"
    gramsFiles <- list.files(path=DIR, pattern="SilkRoad.csv", all.files=TRUE, full.names=TRUE, recursive=TRUE)
    # schema of SR2 crawls eg:
    ## "hash","market_name","item_link","vendor_name","price","name","description","image_link","add_time", \
    ## "ship_from",
    ## "2-11922","Silk Road 2","http://silkroad6ownowfk.onion/items/220-fe-only-tw-x-mb","$220for28grams", \
    ## "0.34349900", "220 FE Only TW X MB","1oz of the same tw x mb as my other listing FE only. Not shipped \
    ##  until finalized. Price is higher for non FE listing.","","1404258628","United States",...
    # most fields are self-explanatory; 'add_time' is presumably a Unix timestamp
    # read in each CSV, note what day it is from, and combine into a single data-frame:
    grams <- data.frame()
    for (i in 1:length(gramsFiles)) {
        log <- read.csv(gramsFiles[i], header=TRUE)
        log$Date <- as.Date(gsub("/SilkRoad.csv", "", gsub(paste0(DIR,"/"), "", gramsFiles[i])))
        grams <- rbind(grams,log)
    }
    totalCounts <- aggregate(hash ~ Date, length, data=grams)
    summary(totalCounts)
    #       Date                 hash
    #  Min.   :2014-06-09   Min.   : 2846.00
    #  1st Qu.:2014-07-05   1st Qu.: 9584.25
    #  Median :2014-08-26   Median :10527.50
    #  Mean   :2014-08-21   Mean   : 9651.44
    #  3rd Qu.:2014-09-29   3rd Qu.:11165.00
    #  Max.   :2014-11-07   Max.   :19686.00
    library(ggplot2)
    qplot(Date, hash, data=totalCounts)
    # https://i.imgur.com/ucPMvJQ.png

    Other included datasets which are in struc­tured for­mats that may be eas­ier to deal with for pro­to­typ­ing: the Aldridge & Décary-Hétu 2013 SR1 crawl; the SR1 sales spread­sheet (orig­i­nal is a PDF but I’ve cre­ated a usable CSV of it); the BMR feed­back dumps are in SQL, as is DNStats and Christin et al 2013’s pub­lic data (but note the last is so heav­ily redacted & anonymized as to sup­port few analy­ses); and Daryl Lau’s SR2 work may be in a struc­tured for­mat.

  2. the crawls were con­ducted inde­pen­dent of other crawls and they can be used to check each other

  3. the mar­ket data sourced from the APIs can be con­sid­ered close to 100% com­plete & accu­rate, which is rare

The main draw­backs are:

  • the largest mar­kets can be split across mul­ti­ple CSVs (eg EVO.csv & EVO2.csv), com­pli­cat­ing read­ing the data in some­what

  • the export each time is of the cur­rent list­ings, which means that dif­fer­ent days can repeat the same iden­ti­cal crawl data if there was not a suc­cess­ful crawl by Grams in between

  • exports are not avail­able for every day, and some gaps are large. The 2015-01-09 to 2015-02-21 gap is due to a bro­ken Grams export dur­ing this period before I noticed the prob­lem and requested it be fixed; other gaps may be due to tran­sient errors with the cron job:

    @daily ping -q -c 5 google.com && torify wget --quiet --continue
                "http://grams7enufi7jmdl.onion/gwernapi/$SECRETKEY"
                -O ~/blackmarket-mirrors/grams/`date '+\%Y-\%m-\%d'`.zip

    so if my Inter­net was down, or Grams was down, or the down­load was cor­rupted halfway through, then there would be noth­ing that day.

Kilos

The owner of Kilos, a DNM search engine much like Grams, released a CSV on 2020-01-13 of 235,668 review scraped from 6 DNMs (Apol­lon, Can­na­Home, Can­na­zon, Cryp­to­nia, Empire, & Sam­sara):

The data is in the for­mat

site,vendor,timestamp,score,value_btc,comment

Site, vendor, and comment are strings. Site and vendor are both alphanu­mer­ic, while comment may have punc­tu­a­tion and what­not. Line breaks are explicit “\n” in the comment field, and the comment field has quo­ta­tion marks around it to make it eas­ier to sort through. All the data uses Latin char­ac­ters only, no uni­code. timestamp is an inte­ger indi­cat­ing the num­ber of sec­onds since the Unix epoch. Score is 1 for pos­i­tive review, 0 for neu­tral review, and −1 for neg­a­tive review. value_btc is the bit­coin value of the prod­uct being reviewed, cal­cu­lated at the time of the review.

There are some slight prob­lems with the data set as a result of the pain that is scrap­ing these mar­ket­places. All reviews from Cryp­to­nia mar­ket have their time­stamp as 0 because I for­got to decode the dates listed and just used 0 as a place­hold­er. Cryp­to­nia reviews’ score vari­able is unre­li­able, as I acci­den­tally rewrote all scores to 0 on the pro­duc­tion data­base. To cor­rect for this, I rewrote the scores to match a sen­ti­ment analy­sis of the review text, but this is not a per­fect solu­tion, as some reviews are clas­si­fied incor­rect­ly. E.g. “this shit is the bomb!” might be clas­si­fied neg­a­tively despite con­text telling us that this is a pos­i­tive review.

There are a decent num­ber of dupli­cates, some of which are proper (e.g.“Thanks” as a review appears many many times) and some of which are improper (de­tailed reviews being indexed mul­ti­ple times by mis­take).

Information leaks

Diabolus/Crypto Market

Diabolus/Crypto Mar­ket are two mar­kets run by the same team off, appar­ent­ly, the same serv­er. Crypto Mar­ket had an infor­ma­tion leak where any attempt to log in as an exist­ing user revealed the sta­tus bar of that Dia­bo­lus account, list­ing their cur­rent num­ber of orders, num­ber of PMs, and Bit­coin bal­ance, and hence giv­ing access to ground-truth esti­mates of mar­ket turnover and rev­enue. Using my Dia­bo­lus crawls to source a list of ven­dors, I set up a script to auto­mat­i­cally down­load the leaks daily until the hole was finally closed.

Simply Bear

Upon launch, the mar­ket Sim­ply Bear made the ama­teur mis­take of fail­ing to dis­able the default Apache /server-status page, which shows infor­ma­tion about the server such as what HTML pages are being browsed and the con­nect­ing IPs. Being a Tor hid­den ser­vice, most IPs were local­host con­nec­tions from the dae­mon, but I noticed the admin­is­tra­tor was log­ging in from a local IP (the 192.168.1.x range) and curi­ous whether I could de-anonymize him, I set up a script to poll /server-status every minute or so, increas­ing the inter­val as time passed. After two or three days, no naked IPs had appeared yet and I killed the script.

TheRealDeal

The­Re­alDeal was reported on Red­dit in late June 2015 to have a info leak where any logged-in user could browse around a sixth of the order-de­tails pages (which were in a pre­dictable incre­ment­ing whole-num­ber for­mat) of all users with­out any addi­tional authen­ti­ca­tion, yield­ing the Bit­coin amount, list­ing, and all Bit­coin mul­ti­sig addresses for that order. TRD denied that this was any kind of prob­lem, so I col­lected order infor­ma­tion for about a week.

Modafinil

As part of my inter­est in the stim­u­lant , I have been monthly col­lect­ing by hand scrapes of all modafinil/armodafinil/adrafinil list­ings across the DNMs; the modafinil archive con­tains the saved files in MHT or MAFF for­mat from 2013-05-28 to 2015-07-03. Sam­pled mar­kets include:

  • Abraxas
  • Agora
  • Alpaca
  • AlphaBay
  • Androm­eda
  • Black Bank
  • Blue Sky
  • Cloud-­Nine
  • Crypto/Diabolus
  • Dia­bo­lus
  • Dream
  • East India Com­pany
  • Evo­lu­tion
  • Haven
  • Hydra
  • Mid­dle Earth
  • Nucleus
  • Out­law
  • Oxy­gen
  • Pan­dora
  • Sheep
  • SR2
  • TOM

Pedofunding

A site for child pornog­ra­phy, “Ped­o­fund­ing”, was launched in Novem­ber 2014. It seemed like pos­si­bly the birth of a new DNM busi­ness model so I set up a logged-out scrape to archive its begin­nings (sans any images), col­lect­ing 20 scrapes from 2014-11-13 to 2014-12-02, after which it shut down, appar­ently hav­ing found no trac­tion. (A fol­lowup in 2015 tried to use some sort of min­ing mod­el; it’s unclear why they don’t sim­ply use Dark­leaks, or how far it got before it too van­ished.)

Silk Road 1 (SR1)

Sources:

SR1F

This archive of the Silk Road 1 forums is com­posed of 3 parts, all cre­ated dur­ing Octo­ber 2013 after Silk Road 1 was shut down but before the Silk Road 1 forums went offline some months lat­er:

  1. StEx­o’s archive, released anony­mously

    This excludes the Ven­dor Round­table (VRT) sub­fo­rum, and is believed to have been cen­sored in var­i­ous respects such as remov­ing many of StEx­o’s own posts.

  2. Mous­tache’s archived pages

    Unknown source, may be based on StExo archives

  3. con­sol­i­dated wget spi­der

    After the SR1 bust and StEx­o’s archiv­ing, I began mir­ror­ing the SR1F with wget, logged in as a ven­dor with access to the Ven­dor Round­table; unfor­tu­nately due to my inex­pe­ri­ence with the forum soft­ware Sim­ple Machi­nes, I did not know it was pos­si­ble to revoke your own access to sub­fo­rums with wget and failed to black­list the revo­ca­tion URL. Hence the VRT was incom­pletely archived. I com­bined my var­i­ous archives into a sin­gle ver­sion.

    Simul­ta­ne­ous­ly, qwer­ty­oruiop was archiv­ing the SR1F with a reg­u­lar user account and a cus­tom Node.js script. I com­bined his spi­der with my ver­sion to pro­duce a final ver­sion with rea­son­able cov­er­age of the forums (per­haps 3/4s of what was left after every­one began delet­ing & cen­sor­ing their past post­s).

SR2

Sources:

SR2Doug

In 2015, a pseu­do­nym claim­ing to be a SR2 pro­gram­mer offered for sale, using the Dark­leaks pro­to­col, what he claimed was the username/password dump and SR2 source code. The Dark­leaks pro­to­col requires pro­vid­ing encrypted data and then the rev­e­la­tion of a ran­dom frac­tion of it. This archive is all the encrypted data, decryp­tion keys, and revealed user­names I was able to col­late. (The auc­tion did not seem to go well as the revealed data was not a com­pelling proof, and it’s unclear whether he was the gen­uine arti­cle.)

Previous releases

Some of these archives have been released pub­licly before and are now obso­leted by this tor­rent:

Verification

PAR2 archives are pro­vided for error-­cor­rec­tion, and PGP sig­na­tures for strong integrity check­ing, should that be an issue.

Integrity of the archive can be ver­i­fied using : par2verify ecc.par2 Up to 10% of file damage/loss can be repaired using the sup­plied PAR2 files for and par2repair; see the man page for details.

Signed SHA-256 hashes of the archives:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

8b05d5fcba36db6889af4fe23d1117a48c39b0808332d32919f9d7c835380721  1776.tar.xz
cc6f54d5818e13fb585b14d6c414fcdbf4d20a4e1ab3aa398f5ce05287a1d1b0  2015-sr2doug-claimedsr2leaks.tar.xz
6e082846f83dc9e06950fc29095491d303f5b336d65bbe6760db2c03d969cf02  abraxas-forums.tar.xz
3dcb6ba24bc3e4f75e13827bb1e2f0632ed269b10e6158bdb554cc50983f1204  abraxas.tar.xz
4231b81aa12d529f4502129683f8d5f1e0ef1f813d252d6edcce9d3b75eecdd2  agape.tar.xz
4838969a87610fe80678ae72a3d631ab2aaa5a6b219cd67226f528d96c4fc958  agora-forums-20140421-whom-astorposts.tar.xz
f6afe2df9238ce5cecea6dac70fd7c4b67a444824eccf07667ca46b15a167734  agora-forums-2014093020141016-rasmusandersen.tar.xz
5730cc4e7e34138aeee934985b937ba8a2ae78f23580ba9a666348fb04fb3583  agora-forums.tar.xz
4e7d5d4f63be66956037d4c27f3b97c0b980addd3ed5029b24904ab69f705c9d  agora.tar.xz
ab9fc0d2324ddbd03fcf5a9e8b9213fc6c650fcb1f7e99f9d3b7a63cd67923af  alpaca.tar.xz
1bbb33eda2094f662d982cad045033541a5fb22e850359883fa3decb5a0d81d2  alphabay.tar.xz
7a61ae8945322455f9b6d0afdad2751847f9a294b951920ea6cccaa8f3b06d86  amazondark.tar.xz
19e634813d8038474460d72e0c5311a7d97a9a2e9e9089eab32a719cf4a0c377  anarchia.tar.xz
8da899bae2e51384afa8d4f839a45371a1b1c5b22a52685f698aced1dba5adbd  andromeda-forums.tar.xz
0c95881e291bde995dc33ae8ee516ca7c8b200cb8dd3b967f8dc62ec5a36b6b2  andromeda.tar.xz
3466f8f9637aab4f2d74ef9c242be7aeff08d5adfadcffe7ca69ce58392a62a9  area51.tar.xz
d9f4f00dba4a44cc7bb45b19d9967046be56b83328c0149697cdf44862438ef2  armory.tar.xz
d9e887e1370f690724e9a178287baf5c85e5e8a900e9e9dae019b795e2afdb6e  assassinationmarket.tar.xz
c959b430f7aef932d26fe389498c6f4d3d7d02421e9d05c204803b009317869b  atlantis-20130921-christin.tar.xz
e1539816b1318badf183152960783697f234ce6c972e90ed2830b119d620313a  blackbankmarket-forums.tar.xz
c9e4940b16078ad2982a55c4c1221054ad3b6a2cac99517d55fc24063a71efdd  blackbankmarket.tar.xz
cbb17ccd867d242ce571ea692a4672474c8330679d3a41e2fff7ebaa511ffd58  blackgoblin.tar.xz
f68f7bb73b47161d8d0499eb062ddd8b4f7b267cad9b2c9179b3a6d309ac9d2b  blackmarketreloaded-20131017-userlist.sql.xz
eed272069f2f057dc6894bbb078041c4bf64db3936a1218cf9f9db9c42518839  blackmarketreloaded-20131225-feedback-wousd.sql.xz
6b0a07ea3cbf67cd60c743a52cdf0427a3e4e587655e3950a75c48fad2f57085  blackmarketreloaded-elpresidente.tar.xz
63d95bc6baa947842247084f0332e8e5ccc465ad112df2fe4d88e1a024aeb5fc  blackmarketreloaded-forums.tar.xz
84598eccbc428ce0325327618f2d7566e55ab799f46e030a1c5b8295e0397fd0  blackservicesmarket.tar.xz
9d0f068823a37eb405b2bf6014ba3051a6cddfb78997111ae1a0c7507c60dd3e  bloomsfield.tar.xz
a4477cf586ff6b18df649e5bfb47d825f2c604c3913b934c235eafa514d0025b  bluesky.tar.xz
8e9b225be42d4f3cff9f835e7f24ba414a6d72e3131d77655f3fc7d05c3b6208  breakingbad.tar.xz
9bb37c2f8b68730b02d38ddf3be04154384f2c79a70505a3324fb8b973e4553c  bungee54-forums.tar.xz
78f5599807f5adc1a068cb86f8a8c7ad194d67d28ef5f451076a40a8587f1776  bungee54.tar.xz
9afedc1135e8a96a61974fb663eaaabef2476bafbc4193dc9f6744402573c98c  buyitnow.tar.xz
f9559a82359cc33f9e9b093d5aa7a6d8b4deebb39aa13841c2fb91ea6f6fdac5  cannabisroad2-forums.tar.xz
db133bef60e5c338757af23809175a8f64a9b4ca1dbebcbf3d8930af590a924a  cannabisroad2.tar.xz
9fca953f118c80f6e61264b513872404ab67b51e06e544bba35284b1fcf8defd  cannabisroad3-forums.tar.xz
173d4f60232941b18a5cdef0c04d45a678fd1f9c4ff0a4a1158266cd1f15c4fa  cannabisroad3.tar.xz
5feeb4f56b4b2c0ab058e45d82543588ec09386f50a3663af53109abb72d66c6  cannabisroad.tar.xz
e0b5355ac6fc07b53dd6ae6767783462173d0e5a62f77b3ca23b699d5f59ce25  cantina.tar.xz
a2db7e54af153958d9d0bac0bf4088ff371e28c7e5510e5fae6b850af88dda8f  cloudnine.tar.xz
9010bcfd779f01508075d341e278dbd412c2350d9fba41bb96a1345494956b40  cryptomarket.tar.xz
66d0236a256059df1ae4f0c6da5e7ded59f83f4534e2293c576575ad0191262e  cryuserv.tar.xz
1dd482381d3a4ff8b30c4750696f1de1fbceb19ce29061ad39f5ce33092239f3  darkbay-forums.tar.xz
366e30bdb6d84e6cbe5d54909d2f49a7f95e0f232ecd886ea53e729f479104e0  darkbay.tar.xz
d7f666e3fd244c299621c6fb7beb20111690e4e7c8786161f1534c23c7836d51  darklist.tar.xz
c6d2478c2a0f860c4b1e8507a5925f699ee39edf8dead1df2cec5d0d94b51af2  darknetheroes-forums.tar.xz
1197eae4c7cb83ed97aa5374365a26b67beea75bf053a9927b2e8948393fe58d  darknetheroes.tar.xz
623ff7d3509727be5936f27ab95cd2b40432f25b0f07e20df7062e5e2cd55217  darknetnation.tar.xz
23e4932551b2a56c12d151d2f14140d5c9a7c25407b766b34d48456c5dbab589  dbay.tar.xz
f8b3cd5c861e7c32147ad720538728f113bcda0f41760ef7475ffbaf26037490  deepzon.tar.xz
2199f5062ad587d355ed683b894ada4dd1529ec50c5f5761b523cdaff9c20b5c  diabolus-cminfoleak-20150220-20150311.tar.xz
f1f6df5855287def19443db64082aa1c7df507991a6968dca6f5f097b024e253  diabolus-forums.tar.xz
42d1a476d9eb6b9b4807789ba08c5791054d41f3d6b9e7506a78a309603bad78  diabolus.tar.xz
ddeed8ce25ef813814522bffe2224f390c84dcdca4dcd0c3023b49d0a63a8b5a  dnstats-20150712.sql.xz
649e311c427398006bf390f7827fe3534026c730a905766cb9f3e78bad82b520  documents.tar.xz
2f2523f4125e64acaa86ebacb8fe2f08fc640608aabc95d747e9319bf9446e12  dogeroad-forums.tar.xz
78079f03495ba405a04860fb546421780f9bc1cdcf06025e7abd29033f77c450  dogeroad.tar.xz
768482dd0aae12fab023497cda437fd290657ac1e9df29a6b65f1b142d1ce8af  dreammarket.tar.xz
229373106b35aa6d72a71f7dc48e90d1da47647cc58348ee0cb768a3926294c4  drugslist.tar.xz
f8a324d215858918d781436a09d51bfaa88c2b9bd59ef6af4a75f52c81891a6c  eastindiacompany.tar.xz
23449de611a42899bcb27db8186d194f7b805ee7e55034ec5ab17adee226aecd  evolution-forums-2014093020141016-rasmusandersen.tar.xz
109eb980c11ed37b29321f6403cb5e95614f3c44525a549164d95d0a52eb94cf  evolution-forums.tar.xz
a6a0ccd588635903f1e914390f36bb9a56f562d37b9e92d6e58dac6364b35b8a  evolution.tar.xz
0b2e5eac28bad63ca832aeeebb8a759dec21bbf2b52eb5f816dc010ab5a825f3  freebay.tar.xz
336c43eb0794174bb8c58cb8b018a8e019a4dd1719a298051b0c0e4ba04a7109  freedommarketplace.tar.xz
61f2037e6245d2e0a23f87df142ff53c0736da26844a3a3f7d869fdd1b835202  freemarket.tar.xz
af4dd8003b015519677c802cc3c19f0910cb79541876be0be719e0c176fe7f5e  galaxy.tar.xz
0d963a63009ef5b581ce705555a608997cfc7220971a26236d8f12b6268c224c  gobotal-20140818-20141102.tar.xz
0cecd5e78416328caf06614ee6a8fabee0d91b8aecddd9ca2d67f059ff7497d6  grams.tar.xz
2dccb3df553b89dfceb5ba4930269ffff4fcd39dc6c876ca6cfc9e85c98bda9a  grandtrunk.tar.xz
2fe55a93c6c7b69b40a5bfe1c1dcd7c0cc4601045696870f1b4dad460c93ea70  greyroad-forums.tar.xz
419e97c0c28784e6077f296746bf2ae5b4899cc0fef2756108c3b5c3d5ed9b13  greyroad.tar.xz
d7624f290f63642d3d875d0b94baf84af89cd63e2abab57c1889bf8d18883596  havanaabsolem-forums.tar.xz
94bafe76779807cdf7cc86d0534da64155b22e40db79f1bb801e865becd44fc6  havanaabsolem.tar.xz
32475d62c6ff9cce00063b6473576782a2941bf1dc2e05a0f9a6bc9880ed91c3  haven.tar.xz
b69715d148fa02e87af8143d36152f4deda57b39f85fe4da47e8090e5e93c348  horizon.tar.xz
b06b7f272934b661920eae5ba9cc3ac8480c8e94ca86d7ab039988cdbf348f2a  hydra-forums.tar.xz
0cf4eda89b71d17a9a539599053e06f4fed4322c0ea306edb6e30c950ab0d16b  hydra.tar.xz
cebec4d92f705475a61ab0fe66c905d509c737139276e96c4c8826539bdd2e07  ironclad.tar.xz
deb71f9e282bbc477c16c922ea8731ecc8817244808619fe881c22467df1d213  kingdom-forums.tar.xz
466772600b49a37d6f5078c1534d889f0b3d3d7ccb165228292e1121217395fd  kiss-forums.tar.xz
74436c0b38dab5007ad212e5c8bb7f1d67708fbdfbbaf6488a80ea637cdcd912  kiss.tar.xz
73ed19cbc40d0d313cf91ed68c7c8f931438238605076bea95c6db7e41a382bd  middleearth.tar.xz
69e783616806f90715b3a63b8f8623ca7ea83f81a48b71e0fadbfa85dfca214f  modafinil.tar.xz
fc29a84ba388a0bf7aa7c27437ea2e53462bfdb527f00c45958b2d15a43237ef  mrniceguy2.tar.xz
796fa38de4eae84797ce07c30a158123b61224dffdb6e94dfd5be39f8a96a187  mrniceguy-forums.tar.xz
146f2ae90fd4fa25932f43596e621065204a07ca5b8149d4e6af142abea32597  mtgox-2011-usernamepasswordleak.csv.xz
0d4136f8e59a4cedfbfac30da33a846d42ed1c9e6e1af8ed030be8ac42e42522  mtgox-20140309-leak.tar.xz
e22b5c83f04ac244e4e77bad4e91588642373a371b3b5606c311a5021bd2eba2  nucleus-forums.tar.xz
87fb7a67bfd55f25f882fbf10e10c82bf2872721109f47728192b5be0e830252  nucleus.tar.xz
ff975d6dc3c91c5b2fd42a86c54acecfed17616dcd80ba5a320ff4b4df2e89fd  onionshop.tar.xz
1b95c06289b081c1dc674dc5d4e055f61fd1609b8a75d5a65a51134407639c11  outlawmarket-forums.tar.xz
4d7d1c24197c89252d515e35ef1bc3c80543180e952ed3e6aae821eb48d17d4c  outlawmarket.tar.xz
11327c8c1915e802cd6083e590217e8e93b19767c9453fc62291e24b96a0a420  oxygen.tar.xz
5355211f6e1b8a338115ef10b2c8498af3b4ee494405b51147f1ffe27645d7b5  panacea-forums.tar.xz
58a76cba9c7ca06c4d92ce03bb39bddf24f15dabeee508f2004f0158bf1aca70  panacea.tar.xz
ed17677aa7269d725cdd81fc1832655a76b3ab701a0ca356b1182443622bedd7  pandora-elpresidente.tar.xz
9f9de82834b46973a5712a6b1dcabe3cb2af1b3c42348d3f2ab4534b59f64dc6  pandora-forums-20140421-whom-astorposts.tar.xz
29bb6c5add500b077b3545559871eda0515887f8847380f1024072ce6cc785aa  pandora-forums.tar.xz
d6e00fb115cecb5739e72c994243edf3199a7b2c9524ebe1e55983bcd2dbc894  pandora.tar.xz
0dfcfdac5d359b508efae9c50cb861f5403924e047de00831db758841a469bfa  pedofunding.tar.xz
427bc78c1e466a7bdc7f0b667d125aced3de76da7bfd8fed5fce564f44421372  pigeon-forums.tar.xz
6fe6fd24b0b604ec70b9e56610743f3bdf91683d24e6ade3a149ecd61b7b787f  pigeon.tar.xz
bd634bf2b2943fb1d01c548f1d731d86c8344d319b799a03a9197874e8e01772  piratemarket.tar.xz
f8dbee89392ebced3a529a972e19c5146aaa3cfe8ce9d25005f538d41b47c2ed  poseidon.tar.xz
71b44fc678bebb8122ddfdba02e2ef80335f72eaf49b4f11ef3204ee7f29ec35  projectblackflag-20131103-anonymous-logsdump.tar.xz
0000462319ea6467b0a25f070f659124966518da3adce1a0fa92d81a84a24e59  projectblackflag-forums.tar.xz
b2ec62fbe54b8148f7e6e7738b84d0d7d45c6b7a91b951494a9a8ab20769e24b  revolver-forums.tar.xz
4f8573bded758c065f86c1eae189d69c1ad622fb6558d10d4aef780e699e09c2  sheep-elpresidente.tar.xz
073829fc8ae4fe9e6920b2c3232bc253ebe6c877b29264a569651e5d76c3b191  sheep.tar.xz
4099f3d49d74d8828b12d8ff532979531c5ca31092985457e93f5f5e9fafbdc1  silkroad1-20111103-delyankratunov.tar.xz
57b641200c30bf6a801fe2faf462d507fcc99c678567943f25af9d0c51970879  silkroad1-20120722-vanbuskirk.docx
59e72f95201726cc46d9680f97a53f44c45f242b57a96567916c4cb76a863d5e  silkroad1-20120723-christin-censored.tar.xz
da8726427d1b13f850a9647a34757ee95be000c036a5ec370e8f43b01fde6609  silkroad1-20130703-anonymous.tar.xz
a3fe8ec72186e7ec02fe206f92616688fae07b756f06a555bd8f306a92b0451b  silkroad1-20130915-aldridgehetu.tar.xz
12876b0783fb928a9c982dff048155fae331b174e08847e66a3100a9f74c9369  silkroad1-forums-20130703-anonymous.tar.xz
5533a90285c0d072d62ebf681cfe717987dfe595f13b96e1e8dc9ae1ed7274ab  silkroad1-forums-20131103-gwernrasmusandersen.tar.xz
3a28097c243843cc69d365b1c6456075679bfa09cd3a50daa6105a0c7f4df837  silkroad1-forums-anonymous.tar.xz
37db1b2eab69923e22cb0d2ee65426152cb11ab09d92d1d6013a2fe7f20aa7d0  silkroad1-forums-stexo.tar.xz
eac0013182b996b4a77f446a28ffabd74f23ea0fa32eeaa6f3bc499081c372c8  silkroad1-forums.tar.xz
ab1ffac3b85b9cbb2d7ff80ed28a1899561f945758196ba3976dbb2e5b8b4c21  silkroad1-vendorprofiles-stexo.tar.xz
2df744013fedfdacfd349472e05981316dbf392ccb56e627ff6d6f09b4ad7a8a  silkroad1-wiki.tar.xz
1c8e643eade9750b39485c5e101f65d2c12ec977cb7b681cd8df064eccf4c0e7  silkroad2-20140129-sohhlz-vendors.tar.xz
3381cd4305c4cd909aa86cf218a1022e6be5ed227d6eb728603c41b9956c7a28  silkroad2-20140927-daryllau.tar.xz
7367dc56f15f61212d8567033a4d3a9468622e05f86d38607a70d5686164648a  silkroad2-forums-20140419-whom-astorposts.tar.xz
0900093d7100b4faf983707b4b1e0ec1fae3c4b18270eaa8eedfe4f8b69a6e23  silkroad2-forums-2014093020141016-rasmusandersen.tar.xz
a473132cb8eec64aea2066628a24628a0c1eb38c195c9945c700dd19f1f972f2  silkroad2-forums.tar.xz
2abc793c7fdfce31d375db11307b66aa69cb91f4c684408840d546bf4e61e41b  silkroad2.tar.xz
3384789112185d81544dcad5bc69967cd44b097b7a772da48f5a1226b43155de  silkroadreloaded.tar.xz
ed9d47ecc9afce0f541386471da9894c436833b89da06663ffbc5ab6de2beacf  silkstreet.tar.xz
7e254452405543c27ee47c0bf6a455fe34443a6fa335a904e086fef61cf6f330  simplybear.tar.xz
80c759f67a5eac57b6345417dff1181690a80ecb965a14ce812ab79d315f2f2d  tcf.tar.xz
6f0775201cb379bb0845c60fde22e66b8aa7d5319d6046987202cdc9065b0591  theblackboxmarket.tar.xz
c25c1f2b35d1cf1f38f1f009b40d559f5a0aaf484248d98aed7b9942fade20a8  thecave.tar.xz
078cc6e61cb37c56f671b6d87ca243e885c2a37a17645d73d26c01e56b28afe4  thehub-forums-20140420-whom-astorposts.tar.xz
5620dae0fac58b30bff4efbf116ce9674d071c3d43fe7cef2f5f84c2950b4182  thehub-forums.tar.xz
c542fed2541d059c466d0b9dc402465952a778b1ef584a3af73e7ad34d953f7e  themajesticgarden-forums.tar.xz
a8a57924768c5f7ad4062fe0b6931722a078caab91b65a515b554817b2e4c1dc  themajesticgarden.tar.xz
8deee8650c55fbd4cfb8366a4f8b5e8a5370b525f676769de34f81a8864e92d2  themarketplace.tar.xz
420889ca017ac87c92a0ff774d21dc79c3abc1958c8dee0dcc11e1af59fd680d  therealdeal-forums.tar.xz
b1ee23d727b30c486c3d197212ac91ac16f18b78b30ba5346854bedf81e6b821  therealdeal.tar.xz
70cf9c9a75815e9a514d4a5eb69aef77df862f3c8e36aff19feed8dae7c1e1cc  tochka.tar.xz
32acbc1289525785c12f179a7da9ce76a838e5a13a4dbaa6fb16c3f1870f9d98  tom-forums.tar.xz
3f62941a988c166ebcec9c788069de1d30a3c365f0b1da1921d342c8a4df3a35  tom.tar.xz
6c50bd480914e0c257b6e85a3e22a087e0e058614d465f7269e2ebd1f867a35a  topix2.tar.xz
fee6a7cd032648bebaae7752045bcd64c0a069c0abd311c53686323103fe7ede  torbay.tar.xz
76fdc6da85a4d697e2e5ed5b9c3d608c5d1ac33a0831fd0701cfd0c6c922e9db  torbazaar-forums.tar.xz
5b9b457c2e541fc618461b69c14511b03fff886daed25ba1e0cb49a89c5b749c  torbazaar.tar.xz
0f3c3a34496feeb44f258e07ee46704a38f856e975e394bcf689e03a18d263ca  torescrow-forums.tar.xz
7e4bf1ef60826367375ab419b068ce1b61daf231cda407594f595ec3bffc6d50  torescrow.tar.xz
1b911a07423900ee4ef9ff71e9d1f4752bfa89ad9c473b760263314f56c7a021  tormarket-20131213-dpr2-dbdump.mht
e229859ffa92bb7c142d2d54317d4b571e48dcc030d412fc93489a3f5aaa9faa  tormarket-elpresidente.tar.xz
55b50e6e9283df50e68d1843db0d07360cc0e6c7d2d032dc00de2c04a00cd489  tormarket.tar.xz
f81a11e6dd8779a4bf077f9bc833740536ed202d2dca106ab5122d758784bf74  tortuga1-forums.tar.xz
15c7d2ad0b525a9f3ae417dc63a670698204ac755a28bd98f104b0b240f3a4fd  tortuga2.tar.xz
0bb2324c424faa0481a3ca5b4004e57493eacfb7a521a7018edb40c3b467037b  undergroundmarket-forums.tar.xz
2153d48e75b60942cb7287a06b93c43b2968fb175af7b4f82fff59577674e9f6  undergroundmarket.tar.xz
13bb5eda0762a41aecc74caf3f3a527035b0015ea71019ba4d2d2363aeaf86d3  unitech.tar.xz
2811a120a4db56907498b2758b0b5d8b2d43c2167a40b2bf0c6e432ba383ff55  utopia-forums.tar.xz
c64666bf5ea4218f7b69d366243ce13a1c8fc21a68d4e24a6ac8c7c3d8bf6908  utopia.tar.xz
9278f2ed7191642cf736bc4dc88c2ccbe7c0b1af6cc6e6ffcb283263a4aef729  vault43.tar.xz
8087f7b4a7781ffc634d0baa2ac4a7cec7b7b1bd5a619f89cb43d49faae002b7  whiterabbit.tar.xz
dc64656700ad46505bd02412d7af5a04d60aba138c713720a00d80cc4bd20000  zanzibarspice.tar.xz
-----BEGIN PGP SIGNATURE-----

iQIcBAEBCgAGBQJVoq+QAAoJEH3Oo4eJxYjM52IP/3ZMzulM6TuwKfkcsGDrFe4Q
X3gQL4Ru2N80jWWcUj3hA/SxEyhs5gWA/xnLZr1HFPPEOXZQRMZb5G3tVQ7clhxL
dH2q7YPl+1L151iqtZHATYMcK8kSB7gbs8S33JU5SkS+y7R0tOXI9fpVuhnaD6HN
q3nGEKrSXI0CaC2o4bBxmUh/1WsimTySiNbcErdj0jMns10MKeYwTq98E+6yc+XQ
ItsMqS9gfSVlGN0yLRedc+kI+Y3M4ujLzY5aHC7PDv2RnpZhRMV68cSbsTc4FD7m
A7AOFKHukUhDPBqp1d3BEU/IiNqY4YhfIkmDMIQ8y2ioYG+rkk0SMojb3OYXgv0p
ioO0QuHNsJSomXYe9OkNoF9y2Tb99nJr7Wr6TFyJ4Geeow9B9p0j2LWFwfrpD3oq
eevXcIQruyi1AG4sK3/F6UG+GAZ3ZgsvcECoRc0+zytXNF0sn14WNcnyqGmtyfo1
/Y0KcDA0RCiWyvUTyAHWjjv0xOxVGDij8r9aqDM+8UgTsECIL6tlTo/Ifhm/k4a6
qF0adhyCpeFPAhmW2kz7BYsmtM0TzWDV/eD3h3mrpo8bn0ILgZr4MpEpLn3WPjY/
D+ZepCz12epZSURHV+6SWFteO6PM44fU895ezBq/iU5ZIRK8uvTShR6KEtPivJFp
fYrFFbOhBc6KRQbNJ8o2
=U0bP
-----END PGP SIGNATURE-----

How to crawl markets

The bulk of the crawls are my own work, and were gen­er­ally all cre­ated in a sim­i­lar way.

My setup was a Debian test­ing Linux sys­tem with , , and installed. For brows­ing, I used Iceweasel; use­ful FF exten­sions included Last­Pass, Flash­block & NoScript, Live HTTP Head­ers, Mozilla Archive For­mat, User Agent Switcher & switch­prox­y­type, and . See the Tor guides.

  1. when a new mar­ket opens, I learn of it typ­i­cally from Red­dit or The Hub, and browse to it in Fire­fox con­fig­ured to proxy through 127.0.0.1:8123 (Polipo)

  2. cre­ate a new account

    The username/password are not par­tic­u­larly impor­tant but using a to cre­ate & store strong pass­words for throw­away accounts has the advan­tage of mak­ing it eas­ier to authen­ti­cate any hacks or data­base dumps lat­er. (Given the poor secu­rity record of many mar­kets, it should go with­out say­ing that you should not use your own user­name or any pass­word which is used any­where else.)

  3. I locate var­i­ous ‘action’ URLs: login, logout, ‘report ven­dor’, ‘set­tings’, ‘place order’, ‘send mes­sage’, and add the URL pre­fixes (some­times they need to be reg­ex­ps) into /etc/privoxy/user.action; Privoxy, a fil­ter­ing proxy run­ning on 127.0.0.1:8118, will then block any attempt to down­load URLs which match those prefixes/regexps

    A good black­list is crit­i­cal to avoid log­ging one­self out and imme­di­ately end­ing the crawl, but it’s also impor­tant to avoid trig­ger­ing any on-site actions which might cause your account to be banned or prompt the oper­a­tors to put in anti-crawl mea­sures you may have a hard time work­ing around. A black­list is also invalu­able for avoid­ing down­load­ing super­flu­ous pages like the same cat­e­gory page sorted 15 dif­fer­ent ways; Tor is high latency and you can­not afford to waste a request on redun­dant or mean­ing­less pages, which there can be many of. Sim­ple Machine Forums are par­tic­u­larly dan­ger­ous in this regard, requir­ing at least 39 URLs black­listed to get an effi­cient crawl, and imple­ment­ing many actions as sim­ply HTTP links that a crawler will browse (for exam­ple, if you have man­aged to get access to a pri­vate sub­fo­rum on a SMF, you will delete your access to it if you sim­ply turn a crawler like wget or loose, which I learned the hard way).

  4. where pos­si­ble, con­fig­ure the site to sim­plify crawl­ing: request as many list­ings as pos­si­ble on each page, hide clut­ter, dis­able any options which might get in the way, etc.

    Forums often default to show­ing 20 posts on a page, but options might let you show 100; if you set it to dis­play as much as pos­si­ble (max­i­mum num­ber of posts per page, sub­fo­rums list­ed, etc), the crawls will be faster, save disk space, and be more reli­able because the crawl is less likely to suf­fer from down­time. So it is a good idea to go into the SMF forum set­tings and cus­tomize it for your account.

  5. in Fire­fox, I export a cookies.txt using the FF exten­sion Export Cook­ies. (I also rec­om­mend to avoid JavaScript shenani­gans, Live HTTP Head­ers to assist in debug­ging by show­ing the HTTP head­ers and requests FF is actu­ally send­ing to the mar­ket, and User Agent Switcher to lock your FF into show­ing a con­sis­tent )

  6. with a valid cookie in the cookies.txt and a proper black­list set up, mir­rors can now be made with , using com­mands like thus:

    alias today="date '+%F'" # prints out current date like "2015-07-05"
    cat ~/blackmarket-mirrors/user-agent.txt
    ## Mozilla/5.0 (Windows NT 6.1; rv:31.0) Gecko/20100101 Firefox/30.0
    
    cd ~/blackmarket-mirrors/cryptomarket/
    fgrep --no-filename '.onion' ~/cookies.txt ~/`today`/cookies.txt > ./cookies.txt
    http_proxy="localhost:8118" wget --mirror
        --tries=5 --retry-connrefused --waitretry=1 --read-timeout=20 --timeout=15 --tries=10
        --load-cookies=cookies.txt --keep-session-cookies
        --max-redirect=1
        --referer="http://cryptomktgxdn2zd.onion"
        --user-agent="$(cat ~/blackmarket-mirrors/user-agent.txt)"
        --append-output=log.txt --server-response
        'http://cryptomktgxdn2zd.onion/category.php?id=Weed'
    mv ./cryptomktgxdn2zd.onion/ `today`
    mv log.txt ./`today`/
    rm cookies.txt

    To unpack the com­mands:

    • the fgrep invo­ca­tion min­i­mizes the size of the local cook­ies.txt and helps pre­vent acci­den­tal release of a full cook­ies.txt while pack­ing up archives and shar­ing them with other peo­ple

    • wget:

      • we direct it to down­load only through Privoxy in order to ben­e­fit from the black­list. Warn­ing: wget has a black­list option but it does not work, because it is imple­mented in a bizarre fash­ion where it down­loads the black­listed URL (!) and then deletes it; this is a known >12-year-old bug in wget. For other crawlers, this behav­ior should be dou­ble-checked so you don’t wind up inad­ver­tently log­ging your­self out of a mar­ket and down­load­ing giga­bytes of worth­less front pages.
      • we throw in a num­ber of options to encour­age wget to ignore con­nec­tion fail­ures and retry; hid­den servers are slow and unre­li­able
      • we load the cook­ies file with the authen­ti­ca­tion for the mar­ket, and in par­tic­u­lar, we need --keep-session-cookies to keep around all cook­ies a mar­ket might give us, par­tic­u­larly the ones which change on each page load.
      • --max-redirect=1 helps deal with a nasty mar­ket behav­ior where when one’s cookie has expired, they then qui­etly redi­rect, with­out errors or warn­ings, all sub­se­quent page requests to a login page. Of course, the login page should also be in the black­list as well, but this is extra insur­ance and can save one round-trip’s worth of time, which will add up. (This isn’t always a cure, since a mar­ket may serve a requested page with­out any redi­rects or error codes but the con­tent will be a tran­scluded login page; this appar­ently hap­pened with some of my crawls such as Black Bank Mar­ket. There’s not much that can be done about this except some sort of post-­down­load reg­exp check or a sim­i­lar post-pro­cess­ing step.)
      • some mar­kets seem to snoop on the “ref­erer” part of a HTTP request spec­i­fy­ing where you come from; putting in the mar­ket page seems to help
      • the user-a­gent, as men­tioned, should exactly match how­ever one logged in, as some mar­kets record that and block accesses if the user-a­gent does not match exact­ly. Putting the cur­rent user-a­gent into a cen­tral­ized text file helps avoid scripts get­ting out of date and spec­i­fy­ing an old user-a­gent
    • log­ging of requests and par­tic­u­larly errors is impor­tant; --server-response prints out head­ers, and --append-output stores them to a log file. Most crawlers do not keep an error log around, but this is nec­es­sary to allow inves­ti­ga­tion of incom­plete­ness and observe where errors in a crawl started (per­haps you missed black­list­ing a page); for exam­ple, “Eval­u­at­ing drug traf­fick­ing on the Tor Net­work: Silk Road 2, the sequel”, Dol­liver 2015, failed to log errors in their few HTTrack crawls of SR2, and so wound up with a grossly incom­plete crawl which led to non­sense con­clu­sions like 1–2% of SR2’s sales were drugs. (I spec­u­late the HTTrack crawl was stuck in the ebooks sec­tion, which was always clogged with spam, and then SR2 went down for an hour or two, lead­ing to HTTrack’s default behav­ior of quickly error­ing out and fin­ish­ing the crawl; but the lack of log­ging means we may never know what went wrong.)

  7. once the wget crawl is done, then we name it what­ever day it ter­mi­nated on, we store the log inside the mir­ror, and clean up the prob­a­bly-now-­ex­pired cook­ies, and per­haps check for any unusual prob­lems.

This method will per­mit some­where around 18 simul­ta­ne­ous crawls of dif­fer­ent DNMs or forums before you begin to risk Privoxy throw­ing errors about “too many con­nec­tions”. A Privoxy bug may also lead to huge logs being stored on each request. Between these two issues, I’ve found it help­ful to have a daily cron job read­ing rm -rf /var/log/privoxy/*; /etc/init.d/privoxy restart so as to keep the log­file mess under con­trol and occa­sion­ally start a fresh Privoxy.

Crawls can be quickly checked by com­par­ing the down­loaded sizes to past down­loads; mar­kets typ­i­cally do not grow or shrink more than 10% in a week, and forums’ down­loaded size should monot­o­n­i­cally increase. (In­ci­den­tal­ly, that implies that it’s more impor­tant to archive mar­kets than forum­s.) If the crawls are no longer work­ing, one can check for prob­lems:

  • is your user-a­gent no longer in sync?
  • does the crawl error out at a spe­cific page?
  • do the head­ers shown by wget match the head­ers you see in a reg­u­lar browser using Live HTTP Head­ers?
  • has the tar­get URL been renamed?
  • do the URLs in the black­list match the URLs of the site, or did you log in at the right URL? (for exam­ple, a black­list of “www.abrax­as­…o­nion” is dif­fer­ent from “abrax­as­…o­nion”; and if you logged in at a onion with www. pre­fix, the cookie may be invalid on the pre­fix-free onion)
  • did the server sim­ply go down for a few hours while crawl­ing? Then you can sim­ply restart and merge the crawls.
  • has your account been banned? If the signup process is par­tic­u­larly easy, it may be sim­plest to just reg­is­ter a fresh account each time.

Despite all this, not all mar­kets can be crawled or present other dif­fi­cul­ties:

  • Blue Sky Mar­ket did some­thing with HTTP head­ers which defeated all my attempts to crawl it; it rejected all my wget attempts at the first request, before any­thing even down­load­ed, but I was never able to fig­ure out exactly how the wget HTTP head­ers dif­fered in any respect from the (work­ing) Fire­fox requests
  • Mr Nice Guy 2 breaks the HTTP stan­dard by return­ing all pages gzip-en­cod­ed, whether or not the client says it can accept gzip-en­coded HTML; as it hap­pens, wget can­not read gzip-en­coded HTML and parse the page for addi­tional URLs to down­load, and so mir­ror­ing breaks
  • AlphaBay, dur­ing the DoS attacks of mid-2015, began doing some­thing odd with its HTTP respons­es, which makes Polipo error out; one must browse AlphaBay after switch­ing to Privoxy; Posei­don also did some­thing sim­i­lar for a time
  • Mid­dle Earth rate-lim­its crawls per ses­sion, lim­it­ing how much can be down­loaded with­out invest­ing a lot of time or in a CAPTCHA-breaking ser­vice
  • Abraxas leads to pecu­liarly high RAM usage by wget, which can lead to the OOM killer end­ing the crawl pre­ma­turely

See also the com­ments on crawl­ing in , and .

Crawler wishlist

In ret­ro­spect, had I known I was going to be scrap­ing so many sites for 3 years, I prob­a­bly would have worked on writ­ing a cus­tom crawler. A cus­tom crawler could have sim­pli­fied the black­list part and allowed some other desir­able fea­tures (in descend­ing order of impor­tance):

  • CAPTCHA library: if CAPTCHAs could be solved auto­mat­i­cal­ly, then each crawl could be sched­uled and run on its own.

    The down­side is that one would need to occa­sion­ally man­u­ally check in to make sure that none of the pos­si­ble prob­lems men­tioned pre­vi­ously have hap­pened, since one would­n’t be get­ting the imme­di­ate of notic­ing a man­ual crawl fin­ish­ing sus­pi­ciously quickly (eg a big site like SR2 or Evo­lu­tion or Agora should take a sin­gle-threaded nor­mal crawl at least a day and eas­ily sev­eral days if images are down­loaded as well; if a crawl fin­ishes in a few hours, some­thing went wrong).

  • sup­port­ing par­al­lel crawls using mul­ti­ple accounts on a site

  • opti­mized tree tra­ver­sal: ide­ally one would down­load all cat­e­gory pages on a mar­ket first, to max­i­mize infor­ma­tion gain from ini­tial crawls & allow esti­mates of com­plete­ness, and then either ran­domly sam­ple items or pri­or­i­tize items which are new/changed com­pared to pre­vi­ous crawls; this would be bet­ter than generic crawlers’ defaults of depth or breadth-­first

  • remov­ing ini­tial hops in con­nect­ing to the hid­den ser­vice, speed­ing it up and reduc­ing latency (does not seem to be a con­fig option in Tor dae­mon but I’m told some­thing like this is done in )

  • post-­down­load checks: a mar­ket may not vis­i­bly error out but start return­ing login pages or warn­ings. If these could be detect­ed, the cus­tom crawler could log back in (par­tic­u­larly with CAPTCHA-solving) or at least alert the user to the prob­lem so they can decide whether to log back in, cre­ate a new account, slow down crawl­ing, split over mul­ti­ple accounts, etc

Other datasets

One pub­licly avail­able full dataset is:

A num­ber of other datasets are known to exist but are unavail­able or avail­able only in restricted form, includ­ing:


  1. Some­thing that might be use­ful for those seek­ing to upload large datasets or deriv­a­tives to the IA: there is a most­ly-un­doc­u­mented ~25GB size limit on its tor­rents. Past that, the back­ground processes will no longer update the tor­rent to cover the addi­tional files, and one will be handed valid but incom­plete tor­rents. With­out IA sup­port staff inter­ven­tion to remove the lim­it, the full set of files will then only be down­load­able over HTTP, not through the tor­rent.↩︎

  2. Zhang et al 2019 describe the source of their writ­ing+photo dataset as “To fully eval­u­ate our pro­posed method, we have col­lected the data from four dif­fer­ent dark­net mar­kets Val­halla, Dream Mar­ket, Silk Road 2 and Evo­lu­tion. For the for­mer two dark­net mar­ket­s,we develop a set of crawl­ing tools to scrape weekly snap­shots from June 2017 to August 2017. For the rest of mar­kets, we col­lect their pub­lic data dumps.” The ‘pub­lic data dumps’ are unspec­i­fied but I am not aware of any other pub­lic SR2/Evolution datasets which include pho­tos.↩︎

  3. Not to be con­fused with the orig­i­nal Silk Road 1 weapons site which closed for lack of sales; this is a much lat­er, inde­pen­dent site which was prob­a­bly a scam.↩︎

  4. eg. the Ross Ulbricht trial evi­dence exhibits; for the trial tran­script, see Mous­tache.↩︎