Darknet Market Archives (2013-2015)

Mirrors of ~89 Tor-Bitcoin darknet markets & forums 2011-2015, and related material
Bitcoin, Silk-Road, shell, R, dataset
2013-12-012021-01-15 finished certainty: highly likely importance: 9


Dark Net Mar­kets (DNM) are on­line mar­kets typ­i­cally hosted as Tor hid­den ser­vices pro­vid­ing es­crow ser­vices be­tween buy­ers & sell­ers trans­act­ing in Bit­coin or other cryp­to­coins, usu­ally for drugs or other illegal/regulated goods; the most fa­mous DNM was Silk Road 1, which pi­o­neered the busi­ness model in 2011.

From 2013–2015, I scraped/mirrored on a weekly or daily ba­sis all ex­ist­ing Eng­lish-lan­guage DNMs as part of my re­search into their , , & ; these scrapes cov­ered ven­dor pages, feed­back, im­ages, etc. In ad­di­tion, I made or ob­tained copies of as many other datasets & doc­u­ments re­lated to the DNMs as I could.

This uniquely com­pre­hen­sive col­lec­tion is now pub­licly re­leased as a 50GB (~1.6TB un­com­pressed) col­lec­tion cov­er­ing 89 DNMs & 37+ re­lated fo­rums, rep­re­sent­ing <4,438 mir­rors, and is avail­able for any re­search.

This page doc­u­ments the down­load, con­tents, in­ter­pre­ta­tion, and tech­ni­cal meth­ods be­hind the scrapes.

Dark net mar­kets have thrived since June 2011 when Adrian Chen pub­lished his fa­mous Gawker ar­ti­cle prov­ing that Silk Road 1 was, con­trary to my as­sump­tion when it was an­nounced in January/February 2011, not a scam and was a ful­ly-func­tional drug mar­ket, a new kind dubbed “dark net mar­kets” (DNM). Fas­ci­nat­ed, I soon signed up, made my first or­der, and be­gan doc­u­ment­ing how to use SR1 and then a few months lat­er, be­gan doc­u­ment­ing the first known SR1-linked ar­rests. Mon­i­tor­ing DNMs was easy be­cause SR1 was over­whelm­ingly dom­i­nant and Black­Mar­ket Re­loaded was a dis­tant sec­ond-place mar­ket, with a few ir­rel­e­van­cies like Deep­bay or Sheep and then the flashy At­lantis.

This idyl­lic pe­riod ended with the raid on SR1 in Oc­to­ber 2013, which ush­ered in a new age of chaos in which cen­tral­ized mar­kets bat­tled for dom­i­nance, the would-be suc­ces­sor Silk Road 2 was crip­pled by ar­rests and turned into a ghost-ship car­ry­ing scam­mers, and the mul­ti­sig break­through went beg­ging. The tu­mult made it clear to me that no mar­ket or fo­rum could be counted on to last as long as SR1, and re­search into the DNM com­mu­ni­ties and mar­kets, or even sim­ply the mem­ory of their his­to­ry, was threat­ened by bi­trot: al­ready in No­vem­ber 2013 I was see­ing per­va­sive myths spread through­out the me­di­a—that SR1 had $1 bil­lion in sales, that you could buy child pornog­ra­phy or hit­men ser­vices on it, that there were mul­ti­ple Dread Pi­rate Robert­s—and other dan­ger­ous be­liefs in the com­mu­nity (that use of PGP was para­noia & un­nec­es­sary, mar­kets could be trusted not to ex­it-s­cam, that FE was not a recipe for dis­as­ter, that SR2 was not in­fil­trated de­spite the staff ar­rests & even me­dia cov­er­age of a SR1 mole, that guns & poi­son sell­ers were not ex­tra­or­di­nar­ily risky to pur­chase from, that buy­ers were never ar­rest­ed).

And so, start­ing with the SR1 fo­rums, which had not been taken down by the raid (to help the mole? I won­dered at the time), I be­gan scrap­ing all the new mar­kets, do­ing so weekly and some­times daily start­ing in De­cem­ber 2013. These are the re­sults.

Download

The full archive is avail­able for down­load from the In­ter­net Archive as a tor­rent (item page)1.

A pub­lic rsync mir­ror is also avail­able:

rsync --verbose --recursive rsync://78.46.86.149:873/dnmarchives/ ./dnmarchives/

For a sin­gle file (eg the 2 Grams ex­port­s), one can down­load like thus:

rsync --verbose rsync://78.46.86.149:873/dnmarchives/grams.tar.xz rsync://78.46.86.149:873/dnmarchives/grams-20150714-20160417.tar.xz ./

(If the down­load does not start, it may be a Tor­rent client prob­lem re­lated to Getright-web­seed­ing-sup­port; if the tor­rent does not work, all files can be down­loaded nor­mally over HTTP from the IA item page, but if pos­si­ble, tor­rents are rec­om­mended for re­duc­ing the band­width bur­den & er­ror-check­ing.)

Research

Possible Uses

Here are some sug­gested us­es:

  • pro­vid­ing in­for­ma­tion on ven­dors across mar­kets like their PGP key and feed­back rat­ings
  • iden­ti­fy­ing ar­rested and flipped sell­ers (eg the Weapon­s­guy sting on Ago­ra)
  • in­di­vid­ual drug and cat­e­gory pop­u­lar­ity
  • to­tal sales per day, with con­se­quent turnover and com­mis­sion es­ti­mates; cor­re­lates with Bit­coin or DNM-related search traffic, sub­red­dit traffic, Bit­coin price or vol­ume, etc
  • seller life­times, rat­ings, over time and by prod­uct sold
  • losses to DNM exit scams, or seller exit scams
  • re­ac­tions to ex­oge­nous shocks like Op­er­a­tion Ony­mous
  • sur­vival analy­sis, and pre­dic­tors of ex­it-s­cams (early fi­nal­iza­tion vol­ume; site down­time; new ven­dors; etc)
  • topic mod­el­ing of fo­rums
  • com­pi­la­tions of fo­rum posts on lab tests es­ti­mat­ing pu­rity and safety
  • com­pi­la­tions of fo­rum-posted Bit­coin ad­dresses to ex­am­ine the effec­tive­ness of mar­ket tum­blers
  • sty­lo­met­ric analy­sis of posters, par­tic­u­lar site staff (what is staff turnover like? do any mar­kets ever change hand­s?)
  • deanonymiza­tion and in­for­ma­tion leaks (eg GPS co­or­di­nates in meta­data, user­names reused on the clear­net, valid emails in PGP pub­lic keys)
  • se­cu­rity prac­tices: use of PGP, life­time of in­di­vid­ual keys, ac­ci­den­tal posts of pri­vate rather than pub­lic keys, mal­formed or un­us­able pub­lic keys, etc
  • an­tholo­gies of re­al-world pho­tos of par­tic­u­lar drugs com­piled from all sell­ers of them
  • sim­ply brows­ing old list­ings, re­mem­ber­ing the good times and bad times, the fallen and the free

Works using this dataset

Pa­pers:

Me­dia:

Posts or ar­ti­cles:

Citing

Please cite this re­source as:

  • Gw­ern Bran­wen, Nico­las Christin, David Dé­cary-Hé­tu, Ras­mus Munks­gaard An­der­sen, StExo, El Pres­i­den­te, Anony­mous, Daryl Lau, So­hh­lz, Delyan Kratunov, Vince Ca­kic, Van Buskirk, Whom, Michael McKen­na, Sigi Goode. “Dark Net Mar­ket archives, 2011–2015”, 2015-07-12. Web. [ac­cess date] /DNM-archives

    @misc{dnmArchives,
        author = {Gwern Branwen and Nicolas Christin and David Décary-Hétu and
                  Rasmus Munksgaard Andersen and StExo and El Presidente and Anonymous
                  and Daryl Lau and Sohhlz, Delyan Kratunov and Vince Cakic and Van Buskirk
                  and Whom and Michael McKenna and Sigi Goode},
    title = {Dark Net Market archives, 2011-2015},
    howpublished=  {\url{https://www.gwern.net/DNM-archives}},
    url = {https://www.gwern.net/DNM-archives},
    type = {dataset},
    year = {2015},
    month = {July},
    timestamp = {2015-07-12},
    note = {Accessed: DATE} }

Donations

A dataset like this owes its ex­is­tence to many par­ties:

  • the DNMs could not ex­ist with­out vol­un­teers and non­profits spend­ing the money to pay for the band­width used by the Tor net­work; these scrapes col­lec­tively rep­re­sent ter­abytes of con­sumed band­width. If you would like to do­nate to­wards keep­ing Tor servers run­ning, you can do­nate to Torserver­s.net or the Tor Project it­self
  • the hosts count­less amaz­ing re­sources, of which this is only one, and is a unique In­ter­net re­source; they ac­cept Bit­coin
  • col­lat­ing and cre­at­ing these scrapes has ab­sorbed an enor­mous amount of my time & en­ergy due to the need to solve CAPTCHAs, launch crawls on a daily or weekly ba­sis, de­bug sub­tle glitch­es, work around site de­fens­es, pe­ri­od­i­cally archive scrapes to make disk space avail­able, pro­vide host­ing for some scrapes re­leased pub­licly etc (my arbtt time-logs sug­gest >200 hours since 2013); I sub­sist pri­mar­ily on do­na­tions & I thank my sup­port­ers for their pa­tience dur­ing this long pro­ject.

Contents

There are ~89 mar­kets, >37 fo­rums and ~5 other sites, rep­re­sent­ing <4,438 mir­rors of >43,596,420 files in ~49.4GB of 163 com­pressed files, un­pack­ing to >1548GB; the largest sin­gle archive de­com­presses to <250GB. (It can be burned to 3 25GB BDs or 2 50GB BDs; if the for­mer, it may be worth gen­er­at­ing ad­di­tional FEC.)

These archives are -com­pressed tar­balls (op­ti­mized with the ); typ­i­cally each sub­folder is a sin­gle date-stamped (YYYY-MM-DD) crawl us­ing , with the de­fault directory/file lay­out. The ma­jor­ity of the con­tent is HTML, CSS, and im­ages (typ­i­cally pho­tos of item list­ings); im­ages are space-in­ten­sive & omit­ted from many crawls, but I feel that im­ages are use­ful to al­low brows­ing the mar­kets as they were and may be highly valu­able in their own right as re­search ma­te­ri­al, so I tried to col­lect im­ages where ap­plic­a­ble. (Child porn is not a con­cern as all DNMs & DNM fo­rums ban that con­tent.) Archives sourced from other peo­ple fol­low their own par­tic­u­lar con­ven­tions. Mac & Win­dows users may be able to un­com­press us­ing their built-in OS archiver, 7zip, Stuffit, or WinRAR; the PAR2 er­ror-check­ing can be done us­ing par2, Quick­Par, Par Bud­dy, Mul­ti­Par or oth­ers de­pend­ing on one’s OS.

If you don’t want to un­com­press all of a par­tic­u­lar archive, as they can be large, you can try ex­tract­ing spe­cific files us­ing archiver-spe­cific op­tions; for ex­am­ple, a SR2F com­mand tar­get­ing a par­tic­u­lar old fo­rum thread:

tar --verbose --extract --xz --file='silkroad2-forums.tar.xz' --no-anchored --wildcards '*topic=49187*'

Overall Coverage

Most of the ma­te­r­ial dates from 2013 to 2015; some archives sourced from other peo­ple (be­fore I be­gan crawl­ing) may date 2011–2012.

Specifi­cal­ly:

  • Mar­kets:

    • 1776
    • Abraxas
    • Agape
    • Agora
    • Al­paca
    • Al­phaBay
    • Ama­zon Dark
    • An­ar­chia
    • An­drom­eda
    • Area51
    • Ar­mory3
    • At­lantis
    • Black­Bank Mar­ket
    • Black Gob­lin
    • Black­Mar­ket Re­loaded
    • Black Ser­vices Mar­ket
    • Blooms­field
    • Blue Sky Mar­ket
    • Break­ing Bad
    • bungee54
    • Buy­It­Now
    • Cannabis Road 1
    • Cannabis Road 2
    • Cannabis Road 3
    • Can­tina
    • Cloud9
    • Crypto Mar­ket / Di­a­bo­lus
    • Dark­Bay
    • Dark­list
    • Dark­net He­roes
    • DBay
    • Deep­zon
    • Doge Road
    • Dream Mar­ket
    • Drugslist
    • East In­dia Com­pany
    • Evo­lu­tion
    • Free­Bay
    • Free­dom Mar­ket­place
    • Free Mar­ket
    • Grey­Road
    • Havana/Absolem
    • Haven
    • Hori­zon
    • Hy­dra
    • Iron­clad
    • Kiss
    • Mid­dle Earth
    • Mr Nice guy 2
    • Nu­cleus
    • Onion­shop
    • Out­law Mar­ket
    • Oxy­gen
    • Panacea
    • Pan­dora
    • Pi­geon
    • Pi­rate Mar­ket
    • Po­sei­don
    • Project Black Flag
    • Sheep
    • Silk Road 1
    • Silk Road 2
    • Silk Road Re­loaded (I2P)
    • Silk­street
    • Sim­ply Bear
    • The Black­Box Mar­ket
    • The Ma­jes­tic Gar­den
    • The Mar­ket­place
    • The Re­alDeal
    • Tochka
    • TOM
    • Topix 2
    • Tor­Bay
    • Tor­Bazaar
    • TorE­scrow
    • Tor­Mar­ket
    • Tor­tuga 2
    • Un­der­ground Mar­ket
    • Utopia
    • Vault43
    • White Rab­bit
    • Zanz­ibar Spice
  • Fo­rums:

    • Abraxas
    • Agora
    • An­drom­eda
    • Black Mar­ket Re­loaded
    • Black­Bank Mar­ket
    • bungee54
    • Cannabis Road 2
    • Cannabis Road 3
    • Dark­Bay
    • Dark­net he­roes
    • Di­a­bo­lus
    • Doge Road
    • Evo­lu­tion
    • Gob­o­tal
    • Grey­Road
    • Havana/Absolem
    • Hy­dra
    • King­dom
    • Kiss
    • Mr Nice Guy 1
    • Nu­cleus
    • Out­law Mar­ket
    • Panacea
    • Pan­dora
    • Pi­geon
    • Project Black Flag
    • Re­volver
    • Silk Road 1
    • Silk Road 2
    • TOM
    • The Cave
    • The Hub
    • The Ma­jes­tic Gar­den
    • The Re­alDeal
    • TorE­scrow
    • Tor­Bazaar
    • Tor­tuga 1
    • Un­der­ground Mar­ket
    • Unitech
    • Utopia
  • Mis­cel­la­neous:

    • As­sas­si­na­tion Mar­ket
    • Cryuserv
    • DNM-related doc­u­ments4
    • DNStats
    • Grams
    • Ped­o­fund­ing
    • SR2­doug’s leaks
Miss­ing or in­com­plete
  • BMR
  • SR1
  • Blue Sky
  • Tor­Mar­ket
  • Deep­bay
  • Red Sun Mar­ket­place
  • San­i­tar­ium Mar­ket
  • EXXTACY
  • Mr Nice Guy 2

Interpreting & analyzing

Scrapes can be diffi­cult to an­a­lyze. They are large, com­pli­cat­ed, re­dun­dant, and highly er­ror-prone. They can­not be taken at face-val­ue.

No mat­ter how much work one puts into it, one will never get an ex­act snap­shot of a mar­ket at a par­tic­u­lar in­stant: list­ings will go up or down as one crawls, ven­dors will be banned and their en­tire pro­file & list­ings & all feed­back van­ish in­stant­ly, Tor con­nec­tion er­rors will cause a non­triv­ial % of page re­quests to fail, the site it­self will go down (Agora es­pe­cial­ly), and In­ter­net con­nec­tions are im­per­fect. Scrapes can get bogged down in a back­wa­ter of ir­rel­e­vant pages, spend all their time down­load­ing a morass of on-de­mand gen­er­ated pages, the user lo­gin ex­pire or be banned by site ad­min­is­tra­tors, etc. If a page is present in a scrape, then it prob­a­bly ex­isted at some point; but if a page is not pre­sent, then it may not have ex­isted or ex­isted but did not get down­loaded for any of a myr­iad of rea­sons. At best, a scrape is a lower bound on how much was there.

So any analy­sis must take se­ri­ously the in­com­plete­ness of each crawl and the fact that there is a lot and al­ways will be a lot of miss­ing data, and do things like fo­cus on what can be in­ferred from ‘ran­dom’ sam­pling or ex­plic­itly model in­com­plete­ness by us­ing mar­kets’ cat­e­go­ry-coun­t-list­ings. (For ex­am­ple, if your down­load of a mar­ket claims to have 1.3k items but the cat­e­gories’ claimed list­ings sum to 13k items, your down­load is prob­a­bly highly in­com­plete & bi­ased to­wards cer­tain cat­e­gories as well.) There are many sub­tle bi­as­es: for ex­am­ple, there will be up­ward bi­ases in mar­kets’ av­er­age re­view rat­ings be­cause sell­ers who turn out to be scam­mers will dis­ap­pear from the mar­ket scrapes when they are banned, and few of their cus­tomers will go back and re­vise their rat­ings; sim­i­larly if scam­mers are con­cen­trated in par­tic­u­lar cat­e­gories, then us­ing a sin­gle snap­shot will lead to bi­ased re­sults as the scam­mers have al­ready been re­moved, while un­con­tro­ver­sial sell­ers last a lot longer (which might lead to, say, e-book sell­ers seem­ing to have many more sales than ex­pect­ed).

The con­tents can­not be taken at face-value ei­ther. Some ven­dors en­gage in re­view-stuffing us­ing shills. Meta­data like cat­e­gories can be wrong, ma­nip­u­lat­ed, or mis­lead­ing (a cat­e­gory la­beled “Mu­si­cal in­stru­ments” may con­tain list­ings for pre­scrip­tion drugs—­beta block­er­s—or modafinil or Adder­all may be listed in both a “Pre­scrip­tion drugs” and “Stim­u­lants” cat­e­go­ry). Many things said on fo­rums are lies or bluffing or scams. Mar­ket op­er­a­tors may de­lib­er­ately de­ceive users (Ross Ul­bricht claim­ing to have sold SR1, the SR2 team en­gag­ing in “psy­ops”) or con­ceal in­for­ma­tion (the hacks of SR1; the sec­ond SR2 hack) or at­tack their users (Sheep Mar­ket­place and Pan­do­ra). Differ­ent mar­kets have differ­ent char­ac­ter­is­tics: the com­mis­sion rate on Pan­dora was uni­lat­er­ally raised after it was hacked (caus­ing sales vol­ume to fal­l); SR2 was a no­to­ri­ous scam­mer haven due to in­ac­tive or over­whelmed staff and lack­ing a work­ing es­crow mech­a­nism; etc. There is no sub­sti­tute here for do­main knowl­edge.

Know­ing this, analy­ses should have some strat­egy to deal with miss­ing­ness. There are a cou­ple tacks:

  • at­tempt to ex­ploit “ground truths” to ex­plic­itly model and cope with vary­ing de­grees of miss­ing­ness; there are a num­ber of ground-truths avail­able in the form of leaked seller data (screen­shots & data), data­bases (leaked, hacked), offi­cial state­ments (eg the FBI’s quoted num­bers about Silk Road 1’s to­tal sales, num­ber of ac­counts, num­ber of trans­ac­tions, etc)

    For one val­i­da­tion of this set of archives, see Bradley 2019’s “On the Re­silience of the Dark Net Mar­ket Ecosys­tem to Law En­force­ment In­ter­ven­tion”, which is able to com­pare the SR2 scrapes to data ex­tracted from SR2 by UK law en­force­ment post-seizure, and finds that any scrape is in­com­plete (as ex­pect­ed) but that scrapes in gen­eral ap­pear to be in­com­plete in sim­i­lar ways and us­able for analy­sis. For an­other at­tempt at val­i­dat­ing, see Soska & Christin 2015’s “Mea­sur­ing the Lon­gi­tu­di­nal Evo­lu­tion of the On­line Anony­mous Mar­ket­place Ecosys­tem”, which com­pares crawl-derived es­ti­mates to SR1 sales records pro­duced at Ross Ul­bricht’s trial (CSV/dis­cus­sion), sales fig­ures in the Blake Ben­thall SR2 crim­i­nal com­plaint, and a Agora sell­er’s leaked ven­dor pro­file; in all cas­es, the es­ti­mates are rea­son­ably close to the ground-truth.

  • as­sume miss­ing-at-ran­dom and use analy­ses in­sen­si­tive to that, fo­cus­ing on things like ra­tios

  • work with the data as is, writ­ing re­sults such that the bi­ases and low­er-bounds are ex­plicit & em­pha­sized

Individual archives

Some of the archives are un­usual and need to be de­scribed in more de­tail.

Aldridge & Decary-Hetu SR1

The Sep­tem­ber SR1 crawl is processed data stored in .sav Data Files. There are var­i­ous li­braries avail­able for read­ing this for­mat (in R, us­ing the foreign li­brary like library(foreign); sellers <- read.spss("Sellers---2013-09-15.sav", to.data.frame=TRUE).)

AlphaBay 2017 (McKenna & Goode)

A crawl of Al­phaBay 2017-01-26–2017-01-28 and data ex­trac­tion (us­ing a Python script) pro­vided by Michael McKenna & Sigi Goode. They also tried to crawl AB’s his­tor­i­cal in­ac­tive list­ings in ad­di­tion to the usual live/active list­ings, reach­ing many of them.

Due to IA up­load prob­lems, cur­rently hosted sep­a­rately.

DNStats

DNStats is a ser­vice which pe­ri­od­i­cally pings hid­den ser­vices and records the re­sponse & la­ten­cy, gen­er­at­ing graphs of up­time and al­low­ing users to see how long a mar­ket has been down and if an er­ror is likely to be tran­sient. The owner has pro­vided me with three SQL ex­ports of the ping data­base up to 2017-03-25; this data­base could be use­ful for com­par­ing down­time across mar­kets, ex­am­in­ing the effect of DoS at­tacks, or re­gress­ing down­time against things like the Bit­coin ex­change rate (pre­sum­ably if the mar­kets still drive more than a triv­ial amount of the Bit­coin econ­o­my, down­time of the largest mar­kets or mar­ket deaths should pre­dict falls in the ex­change rate).

For ex­am­ple, to graph an av­er­age of site up­time per day and high­light as an ex­oge­nous event Op­er­a­tion Ony­mous, the R code would go like this:

dnmUptime <- read.delim("dnstats-20150712.sql", na.strings="NULL",
                         nrows=6000000, colClasses=c("factor", "factor", "factor", "integer",
                                                     "factor", "numeric", "numeric", "POSIXct"))
markets <- dnmUptime[dnmUptime$type==1,] # type 1 = markets
dnmUptime <- NULL # save RAM due to dataset size
markets$Date <- as.Date(markets$timestamp)
markets$Up <- markets$httpcode == 200
daily <- aggregate(Up ~ Date + sitename, markets, mean)
library(ggplot2)
qplot(Date, sitename, color=Up, data=daily) + geom_vline(xintercept=as.Date("2014-11-05"), color="red")

The ser­vice is a use­ful one and ac­cepts do­na­tions: 1DNstATs59JANuXjbpS5ngWHqvApAhYHBS.

Grams

Grams (sub­red­dit) is a ser­vice pri­mar­ily spe­cial­iz­ing in search­ing mar­ket list­ings; they can pull list­ings from API ex­ports pro­vided by mar­kets (Evo­lu­tion, Cloud9, Mid­dle Earth, Bungee54, Out­law), or they may use their own cus­tom crawls (the rest). They have gen­er­ously given me near-daily CSV ex­ports of the cur­rent state of list­ings in their search en­gine, rang­ing from 2014-06-09 to 2015-07-12 for the first archive and 2015-07-14 to 2016-04-17 for the sec­ond. Grams cov­er­age:

  1. first:

    • 1776
    • Abraxas
    • ADM
    • Agora
    • Al­paca
    • Al­phaBay
    • Black­Bank
    • Bungee54
    • Cloud9
    • Evo­lu­tion
    • Haven
    • Mid­dle Earth
    • NK
    • Out­law
    • Oxy­gen
    • Pan­dora
    • Silkki­tie
    • Silk Road 2
    • TOM
    • TPM
  2. sec­ond archive:

    • Abraxas
    • Agora
    • Al­phaBay
    • Dream Mar­ket
    • Hansa
    • Mid­dle Earth
    • Nu­cleus
    • Oa­sis
    • Oxy­gen
    • Re­alDeal
    • Silkki­tie
    • Tochka
    • Val­halla

The Grams archive has three virtues:

  1. while it does­n’t have any raw data, the CSVs are easy to work with. For ex­am­ple, to read in all the Grams SR2 crawls, then count & graph to­tal list­ings by day in R:

    DIR <- "blackmarket-mirrors/archive/grams"
    # Grams's SR2 crawls are named like "grams/2014-06-13/SilkRoad.csv"
    gramsFiles <- list.files(path=DIR, pattern="SilkRoad.csv", all.files=TRUE, full.names=TRUE, recursive=TRUE)
    # schema of SR2 crawls eg:
    ## "hash","market_name","item_link","vendor_name","price","name","description","image_link","add_time", \
    ## "ship_from",
    ## "2-11922","Silk Road 2","http://silkroad6ownowfk.onion/items/220-fe-only-tw-x-mb","$220for28grams", \
    ## "0.34349900", "220 FE Only TW X MB","1oz of the same tw x mb as my other listing FE only. Not shipped \
    ##  until finalized. Price is higher for non FE listing.","","1404258628","United States",...
    # most fields are self-explanatory; 'add_time' is presumably a Unix timestamp
    # read in each CSV, note what day it is from, and combine into a single data-frame:
    grams <- data.frame()
    for (i in 1:length(gramsFiles)) {
        log <- read.csv(gramsFiles[i], header=TRUE)
        log$Date <- as.Date(gsub("/SilkRoad.csv", "", gsub(paste0(DIR,"/"), "", gramsFiles[i])))
        grams <- rbind(grams,log)
    }
    totalCounts <- aggregate(hash ~ Date, length, data=grams)
    summary(totalCounts)
    #       Date                 hash
    #  Min.   :2014-06-09   Min.   : 2846.00
    #  1st Qu.:2014-07-05   1st Qu.: 9584.25
    #  Median :2014-08-26   Median :10527.50
    #  Mean   :2014-08-21   Mean   : 9651.44
    #  3rd Qu.:2014-09-29   3rd Qu.:11165.00
    #  Max.   :2014-11-07   Max.   :19686.00
    library(ggplot2)
    qplot(Date, hash, data=totalCounts)
    # https://i.imgur.com/ucPMvJQ.png

    Other in­cluded datasets which are in struc­tured for­mats that may be eas­ier to deal with for pro­to­typ­ing: the Aldridge & Dé­cary-Hétu 2013 SR1 crawl; the SR1 sales spread­sheet (o­rig­i­nal is a PDF but I’ve cre­ated a us­able CSV of it); the BMR feed­back dumps are in SQL, as is DNStats and Christin et al 2013’s pub­lic data (but note the last is so heav­ily redacted & anonymized as to sup­port few analy­ses); and Daryl Lau’s SR2 work may be in a struc­tured for­mat.

  2. the crawls were con­ducted in­de­pen­dent of other crawls and they can be used to check each other

  3. the mar­ket data sourced from the APIs can be con­sid­ered close to 100% com­plete & ac­cu­rate, which is rare

The main draw­backs are:

  • the largest mar­kets can be split across mul­ti­ple CSVs (eg EVO.csv & EVO2.csv), com­pli­cat­ing read­ing the data in some­what

  • the ex­port each time is of the cur­rent list­ings, which means that differ­ent days can re­peat the same iden­ti­cal crawl data if there was not a suc­cess­ful crawl by Grams in be­tween

  • ex­ports are not avail­able for every day, and some gaps are large. The 2015-01-09 to 2015-02-21 gap is due to a bro­ken Grams ex­port dur­ing this pe­riod be­fore I no­ticed the prob­lem and re­quested it be fixed; other gaps may be due to tran­sient er­rors with the cron job:

    @daily ping -q -c 5 google.com && torify wget --quiet --continue
                "http://grams7enufi7jmdl.onion/gwernapi/$SECRETKEY"
                -O ~/blackmarket-mirrors/grams/`date '+\%Y-\%m-\%d'`.zip

    so if my In­ter­net was down, or Grams was down, or the down­load was cor­rupted halfway through, then there would be noth­ing that day.

Kilos

The owner of Ki­los, a DNM search en­gine much like Grams, re­leased a CSV on 2020-01-13 of 235,668 re­view scraped from 6 DNMs (Apol­lon, Can­na­Home, Can­na­zon, Cryp­to­nia, Em­pire, & Sam­sara):

The data is in the for­mat

site,vendor,timestamp,score,value_btc,comment

Site, vendor, and comment are strings. Site and vendor are both al­phanu­mer­ic, while comment may have punc­tu­a­tion and what­not. Line breaks are ex­plicit “\n” in the comment field, and the comment field has quo­ta­tion marks around it to make it eas­ier to sort through. All the data uses Latin char­ac­ters on­ly, no uni­code. timestamp is an in­te­ger in­di­cat­ing the num­ber of sec­onds since the Unix epoch. Score is 1 for pos­i­tive re­view, 0 for neu­tral re­view, and −1 for neg­a­tive re­view. value_btc is the bit­coin value of the prod­uct be­ing re­viewed, cal­cu­lated at the time of the re­view.

There are some slight prob­lems with the data set as a re­sult of the pain that is scrap­ing these mar­ket­places. All re­views from Cryp­to­nia mar­ket have their time­stamp as 0 be­cause I for­got to de­code the dates listed and just used 0 as a place­hold­er. Cryp­to­nia re­views’ score vari­able is un­re­li­able, as I ac­ci­den­tally rewrote all scores to 0 on the pro­duc­tion data­base. To cor­rect for this, I rewrote the scores to match a sen­ti­ment analy­sis of the re­view text, but this is not a per­fect so­lu­tion, as some re­views are clas­si­fied in­cor­rect­ly. E.g. “this shit is the bomb!” might be clas­si­fied neg­a­tively de­spite con­text telling us that this is a pos­i­tive re­view.

There are a de­cent num­ber of du­pli­cates, some of which are proper (e.g.“Thanks” as a re­view ap­pears many many times) and some of which are im­proper (de­tailed re­views be­ing in­dexed mul­ti­ple times by mis­take).

Information leaks

Diabolus/Crypto Market

Diabolus/Crypto Mar­ket are two mar­kets run by the same team off, ap­par­ent­ly, the same serv­er. Crypto Mar­ket had an in­for­ma­tion leak where any at­tempt to log in as an ex­ist­ing user re­vealed the sta­tus bar of that Di­a­bo­lus ac­count, list­ing their cur­rent num­ber of or­ders, num­ber of PMs, and Bit­coin bal­ance, and hence giv­ing ac­cess to ground-truth es­ti­mates of mar­ket turnover and rev­enue. Us­ing my Di­a­bo­lus crawls to source a list of ven­dors, I set up a script to au­to­mat­i­cally down­load the leaks daily un­til the hole was fi­nally closed.

Simply Bear

Upon launch, the mar­ket Sim­ply Bear made the am­a­teur mis­take of fail­ing to dis­able the de­fault Apache /server-status page, which shows in­for­ma­tion about the server such as what HTML pages are be­ing browsed and the con­nect­ing IPs. Be­ing a Tor hid­den ser­vice, most IPs were lo­cal­host con­nec­tions from the dae­mon, but I no­ticed the ad­min­is­tra­tor was log­ging in from a lo­cal IP (the 192.168.1.x range) and cu­ri­ous whether I could de-anonymize him, I set up a script to poll /server-status every minute or so, in­creas­ing the in­ter­val as time passed. After two or three days, no naked IPs had ap­peared yet and I killed the script.

TheRealDeal

The­Re­alDeal was re­ported on Red­dit in late June 2015 to have a info leak where any logged-in user could browse around a sixth of the or­der-de­tails pages (which were in a pre­dictable in­cre­ment­ing whole-num­ber for­mat) of all users with­out any ad­di­tional au­then­ti­ca­tion, yield­ing the Bit­coin amount, list­ing, and all Bit­coin mul­ti­sig ad­dresses for that or­der. TRD de­nied that this was any kind of prob­lem, so I col­lected or­der in­for­ma­tion for about a week.

Modafinil

As part of my in­ter­est in the stim­u­lant , I have been monthly col­lect­ing by hand scrapes of all modafinil/armodafinil/adrafinil list­ings across the DNMs; the modafinil archive con­tains the saved files in MHT or MAFF for­mat from 2013-05-28 to 2015-07-03. Sam­pled mar­kets in­clude:

  • Abraxas
  • Agora
  • Al­paca
  • Al­phaBay
  • An­drom­eda
  • Black Bank
  • Blue Sky
  • Cloud-Nine
  • Crypto/Diabolus
  • Di­a­bo­lus
  • Dream
  • East In­dia Com­pany
  • Evo­lu­tion
  • Haven
  • Hy­dra
  • Mid­dle Earth
  • Nu­cleus
  • Out­law
  • Oxy­gen
  • Pan­dora
  • Sheep
  • SR2
  • TOM

Pedofunding

A site for child pornog­ra­phy, “Ped­o­fund­ing”, was launched in No­vem­ber 2014. It seemed like pos­si­bly the birth of a new DNM busi­ness model so I set up a logged-out scrape to archive its be­gin­nings (sans any im­ages), col­lect­ing 20 scrapes from 2014-11-13 to 2014-12-02, after which it shut down, ap­par­ently hav­ing found no trac­tion. (A fol­lowup in 2015 tried to use some sort of min­ing mod­el; it’s un­clear why they don’t sim­ply use Dark­leaks, or how far it got be­fore it too van­ished.)

Silk Road 1 (SR1)

Sources:

SR1F

This archive of the Silk Road 1 fo­rums is com­posed of 3 parts, all cre­ated dur­ing Oc­to­ber 2013 after Silk Road 1 was shut down but be­fore the Silk Road 1 fo­rums went offline some months lat­er:

  1. StEx­o’s archive, re­leased anony­mously

    This ex­cludes the Ven­dor Round­table (VRT) sub­fo­rum, and is be­lieved to have been cen­sored in var­i­ous re­spects such as re­mov­ing many of StEx­o’s own posts.

  2. Mous­tache’s archived pages

    Un­known source, may be based on StExo archives

  3. con­sol­i­dated wget spi­der

    After the SR1 bust and StEx­o’s archiv­ing, I be­gan mir­ror­ing the SR1F with wget, logged in as a ven­dor with ac­cess to the Ven­dor Round­table; un­for­tu­nately due to my in­ex­pe­ri­ence with the fo­rum soft­ware Sim­ple Ma­chi­nes, I did not know it was pos­si­ble to re­voke your own ac­cess to sub­fo­rums with wget and failed to black­list the re­vo­ca­tion URL. Hence the VRT was in­com­pletely archived. I com­bined my var­i­ous archives into a sin­gle ver­sion.

    Si­mul­ta­ne­ous­ly, qw­er­ty­oruiop was archiv­ing the SR1F with a reg­u­lar user ac­count and a cus­tom Node.js script. I com­bined his spi­der with my ver­sion to pro­duce a fi­nal ver­sion with rea­son­able cov­er­age of the fo­rums (per­haps 3/4s of what was left after every­one be­gan delet­ing & cen­sor­ing their past post­s).

SR2

Sources:

SR2Doug

In 2015, a pseu­do­nym claim­ing to be a SR2 pro­gram­mer offered for sale, us­ing the Dark­leaks pro­to­col, what he claimed was the username/password dump and SR2 source code. The Dark­leaks pro­to­col re­quires pro­vid­ing en­crypted data and then the rev­e­la­tion of a ran­dom frac­tion of it. This archive is all the en­crypted data, de­cryp­tion keys, and re­vealed user­names I was able to col­late. (The auc­tion did not seem to go well as the re­vealed data was not a com­pelling proof, and it’s un­clear whether he was the gen­uine ar­ti­cle.)

Previous releases

Some of these archives have been re­leased pub­licly be­fore and are now ob­so­leted by this tor­rent:

Verification

PAR2 archives are pro­vided for er­ror-cor­rec­tion, and PGP sig­na­tures for strong in­tegrity check­ing, should that be an is­sue.

In­tegrity of the archive can be ver­i­fied us­ing : par2verify ecc.par2 Up to 10% of file damage/loss can be re­paired us­ing the sup­plied PAR2 files for and par2repair; see the man page for de­tails.

Signed SHA-256 hashes of the archives:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

8b05d5fcba36db6889af4fe23d1117a48c39b0808332d32919f9d7c835380721  1776.tar.xz
cc6f54d5818e13fb585b14d6c414fcdbf4d20a4e1ab3aa398f5ce05287a1d1b0  2015-sr2doug-claimedsr2leaks.tar.xz
6e082846f83dc9e06950fc29095491d303f5b336d65bbe6760db2c03d969cf02  abraxas-forums.tar.xz
3dcb6ba24bc3e4f75e13827bb1e2f0632ed269b10e6158bdb554cc50983f1204  abraxas.tar.xz
4231b81aa12d529f4502129683f8d5f1e0ef1f813d252d6edcce9d3b75eecdd2  agape.tar.xz
4838969a87610fe80678ae72a3d631ab2aaa5a6b219cd67226f528d96c4fc958  agora-forums-20140421-whom-astorposts.tar.xz
f6afe2df9238ce5cecea6dac70fd7c4b67a444824eccf07667ca46b15a167734  agora-forums-2014093020141016-rasmusandersen.tar.xz
5730cc4e7e34138aeee934985b937ba8a2ae78f23580ba9a666348fb04fb3583  agora-forums.tar.xz
4e7d5d4f63be66956037d4c27f3b97c0b980addd3ed5029b24904ab69f705c9d  agora.tar.xz
ab9fc0d2324ddbd03fcf5a9e8b9213fc6c650fcb1f7e99f9d3b7a63cd67923af  alpaca.tar.xz
1bbb33eda2094f662d982cad045033541a5fb22e850359883fa3decb5a0d81d2  alphabay.tar.xz
7a61ae8945322455f9b6d0afdad2751847f9a294b951920ea6cccaa8f3b06d86  amazondark.tar.xz
19e634813d8038474460d72e0c5311a7d97a9a2e9e9089eab32a719cf4a0c377  anarchia.tar.xz
8da899bae2e51384afa8d4f839a45371a1b1c5b22a52685f698aced1dba5adbd  andromeda-forums.tar.xz
0c95881e291bde995dc33ae8ee516ca7c8b200cb8dd3b967f8dc62ec5a36b6b2  andromeda.tar.xz
3466f8f9637aab4f2d74ef9c242be7aeff08d5adfadcffe7ca69ce58392a62a9  area51.tar.xz
d9f4f00dba4a44cc7bb45b19d9967046be56b83328c0149697cdf44862438ef2  armory.tar.xz
d9e887e1370f690724e9a178287baf5c85e5e8a900e9e9dae019b795e2afdb6e  assassinationmarket.tar.xz
c959b430f7aef932d26fe389498c6f4d3d7d02421e9d05c204803b009317869b  atlantis-20130921-christin.tar.xz
e1539816b1318badf183152960783697f234ce6c972e90ed2830b119d620313a  blackbankmarket-forums.tar.xz
c9e4940b16078ad2982a55c4c1221054ad3b6a2cac99517d55fc24063a71efdd  blackbankmarket.tar.xz
cbb17ccd867d242ce571ea692a4672474c8330679d3a41e2fff7ebaa511ffd58  blackgoblin.tar.xz
f68f7bb73b47161d8d0499eb062ddd8b4f7b267cad9b2c9179b3a6d309ac9d2b  blackmarketreloaded-20131017-userlist.sql.xz
eed272069f2f057dc6894bbb078041c4bf64db3936a1218cf9f9db9c42518839  blackmarketreloaded-20131225-feedback-wousd.sql.xz
6b0a07ea3cbf67cd60c743a52cdf0427a3e4e587655e3950a75c48fad2f57085  blackmarketreloaded-elpresidente.tar.xz
63d95bc6baa947842247084f0332e8e5ccc465ad112df2fe4d88e1a024aeb5fc  blackmarketreloaded-forums.tar.xz
84598eccbc428ce0325327618f2d7566e55ab799f46e030a1c5b8295e0397fd0  blackservicesmarket.tar.xz
9d0f068823a37eb405b2bf6014ba3051a6cddfb78997111ae1a0c7507c60dd3e  bloomsfield.tar.xz
a4477cf586ff6b18df649e5bfb47d825f2c604c3913b934c235eafa514d0025b  bluesky.tar.xz
8e9b225be42d4f3cff9f835e7f24ba414a6d72e3131d77655f3fc7d05c3b6208  breakingbad.tar.xz
9bb37c2f8b68730b02d38ddf3be04154384f2c79a70505a3324fb8b973e4553c  bungee54-forums.tar.xz
78f5599807f5adc1a068cb86f8a8c7ad194d67d28ef5f451076a40a8587f1776  bungee54.tar.xz
9afedc1135e8a96a61974fb663eaaabef2476bafbc4193dc9f6744402573c98c  buyitnow.tar.xz
f9559a82359cc33f9e9b093d5aa7a6d8b4deebb39aa13841c2fb91ea6f6fdac5  cannabisroad2-forums.tar.xz
db133bef60e5c338757af23809175a8f64a9b4ca1dbebcbf3d8930af590a924a  cannabisroad2.tar.xz
9fca953f118c80f6e61264b513872404ab67b51e06e544bba35284b1fcf8defd  cannabisroad3-forums.tar.xz
173d4f60232941b18a5cdef0c04d45a678fd1f9c4ff0a4a1158266cd1f15c4fa  cannabisroad3.tar.xz
5feeb4f56b4b2c0ab058e45d82543588ec09386f50a3663af53109abb72d66c6  cannabisroad.tar.xz
e0b5355ac6fc07b53dd6ae6767783462173d0e5a62f77b3ca23b699d5f59ce25  cantina.tar.xz
a2db7e54af153958d9d0bac0bf4088ff371e28c7e5510e5fae6b850af88dda8f  cloudnine.tar.xz
9010bcfd779f01508075d341e278dbd412c2350d9fba41bb96a1345494956b40  cryptomarket.tar.xz
66d0236a256059df1ae4f0c6da5e7ded59f83f4534e2293c576575ad0191262e  cryuserv.tar.xz
1dd482381d3a4ff8b30c4750696f1de1fbceb19ce29061ad39f5ce33092239f3  darkbay-forums.tar.xz
366e30bdb6d84e6cbe5d54909d2f49a7f95e0f232ecd886ea53e729f479104e0  darkbay.tar.xz
d7f666e3fd244c299621c6fb7beb20111690e4e7c8786161f1534c23c7836d51  darklist.tar.xz
c6d2478c2a0f860c4b1e8507a5925f699ee39edf8dead1df2cec5d0d94b51af2  darknetheroes-forums.tar.xz
1197eae4c7cb83ed97aa5374365a26b67beea75bf053a9927b2e8948393fe58d  darknetheroes.tar.xz
623ff7d3509727be5936f27ab95cd2b40432f25b0f07e20df7062e5e2cd55217  darknetnation.tar.xz
23e4932551b2a56c12d151d2f14140d5c9a7c25407b766b34d48456c5dbab589  dbay.tar.xz
f8b3cd5c861e7c32147ad720538728f113bcda0f41760ef7475ffbaf26037490  deepzon.tar.xz
2199f5062ad587d355ed683b894ada4dd1529ec50c5f5761b523cdaff9c20b5c  diabolus-cminfoleak-20150220-20150311.tar.xz
f1f6df5855287def19443db64082aa1c7df507991a6968dca6f5f097b024e253  diabolus-forums.tar.xz
42d1a476d9eb6b9b4807789ba08c5791054d41f3d6b9e7506a78a309603bad78  diabolus.tar.xz
ddeed8ce25ef813814522bffe2224f390c84dcdca4dcd0c3023b49d0a63a8b5a  dnstats-20150712.sql.xz
649e311c427398006bf390f7827fe3534026c730a905766cb9f3e78bad82b520  documents.tar.xz
2f2523f4125e64acaa86ebacb8fe2f08fc640608aabc95d747e9319bf9446e12  dogeroad-forums.tar.xz
78079f03495ba405a04860fb546421780f9bc1cdcf06025e7abd29033f77c450  dogeroad.tar.xz
768482dd0aae12fab023497cda437fd290657ac1e9df29a6b65f1b142d1ce8af  dreammarket.tar.xz
229373106b35aa6d72a71f7dc48e90d1da47647cc58348ee0cb768a3926294c4  drugslist.tar.xz
f8a324d215858918d781436a09d51bfaa88c2b9bd59ef6af4a75f52c81891a6c  eastindiacompany.tar.xz
23449de611a42899bcb27db8186d194f7b805ee7e55034ec5ab17adee226aecd  evolution-forums-2014093020141016-rasmusandersen.tar.xz
109eb980c11ed37b29321f6403cb5e95614f3c44525a549164d95d0a52eb94cf  evolution-forums.tar.xz
a6a0ccd588635903f1e914390f36bb9a56f562d37b9e92d6e58dac6364b35b8a  evolution.tar.xz
0b2e5eac28bad63ca832aeeebb8a759dec21bbf2b52eb5f816dc010ab5a825f3  freebay.tar.xz
336c43eb0794174bb8c58cb8b018a8e019a4dd1719a298051b0c0e4ba04a7109  freedommarketplace.tar.xz
61f2037e6245d2e0a23f87df142ff53c0736da26844a3a3f7d869fdd1b835202  freemarket.tar.xz
af4dd8003b015519677c802cc3c19f0910cb79541876be0be719e0c176fe7f5e  galaxy.tar.xz
0d963a63009ef5b581ce705555a608997cfc7220971a26236d8f12b6268c224c  gobotal-20140818-20141102.tar.xz
0cecd5e78416328caf06614ee6a8fabee0d91b8aecddd9ca2d67f059ff7497d6  grams.tar.xz
2dccb3df553b89dfceb5ba4930269ffff4fcd39dc6c876ca6cfc9e85c98bda9a  grandtrunk.tar.xz
2fe55a93c6c7b69b40a5bfe1c1dcd7c0cc4601045696870f1b4dad460c93ea70  greyroad-forums.tar.xz
419e97c0c28784e6077f296746bf2ae5b4899cc0fef2756108c3b5c3d5ed9b13  greyroad.tar.xz
d7624f290f63642d3d875d0b94baf84af89cd63e2abab57c1889bf8d18883596  havanaabsolem-forums.tar.xz
94bafe76779807cdf7cc86d0534da64155b22e40db79f1bb801e865becd44fc6  havanaabsolem.tar.xz
32475d62c6ff9cce00063b6473576782a2941bf1dc2e05a0f9a6bc9880ed91c3  haven.tar.xz
b69715d148fa02e87af8143d36152f4deda57b39f85fe4da47e8090e5e93c348  horizon.tar.xz
b06b7f272934b661920eae5ba9cc3ac8480c8e94ca86d7ab039988cdbf348f2a  hydra-forums.tar.xz
0cf4eda89b71d17a9a539599053e06f4fed4322c0ea306edb6e30c950ab0d16b  hydra.tar.xz
cebec4d92f705475a61ab0fe66c905d509c737139276e96c4c8826539bdd2e07  ironclad.tar.xz
deb71f9e282bbc477c16c922ea8731ecc8817244808619fe881c22467df1d213  kingdom-forums.tar.xz
466772600b49a37d6f5078c1534d889f0b3d3d7ccb165228292e1121217395fd  kiss-forums.tar.xz
74436c0b38dab5007ad212e5c8bb7f1d67708fbdfbbaf6488a80ea637cdcd912  kiss.tar.xz
73ed19cbc40d0d313cf91ed68c7c8f931438238605076bea95c6db7e41a382bd  middleearth.tar.xz
69e783616806f90715b3a63b8f8623ca7ea83f81a48b71e0fadbfa85dfca214f  modafinil.tar.xz
fc29a84ba388a0bf7aa7c27437ea2e53462bfdb527f00c45958b2d15a43237ef  mrniceguy2.tar.xz
796fa38de4eae84797ce07c30a158123b61224dffdb6e94dfd5be39f8a96a187  mrniceguy-forums.tar.xz
146f2ae90fd4fa25932f43596e621065204a07ca5b8149d4e6af142abea32597  mtgox-2011-usernamepasswordleak.csv.xz
0d4136f8e59a4cedfbfac30da33a846d42ed1c9e6e1af8ed030be8ac42e42522  mtgox-20140309-leak.tar.xz
e22b5c83f04ac244e4e77bad4e91588642373a371b3b5606c311a5021bd2eba2  nucleus-forums.tar.xz
87fb7a67bfd55f25f882fbf10e10c82bf2872721109f47728192b5be0e830252  nucleus.tar.xz
ff975d6dc3c91c5b2fd42a86c54acecfed17616dcd80ba5a320ff4b4df2e89fd  onionshop.tar.xz
1b95c06289b081c1dc674dc5d4e055f61fd1609b8a75d5a65a51134407639c11  outlawmarket-forums.tar.xz
4d7d1c24197c89252d515e35ef1bc3c80543180e952ed3e6aae821eb48d17d4c  outlawmarket.tar.xz
11327c8c1915e802cd6083e590217e8e93b19767c9453fc62291e24b96a0a420  oxygen.tar.xz
5355211f6e1b8a338115ef10b2c8498af3b4ee494405b51147f1ffe27645d7b5  panacea-forums.tar.xz
58a76cba9c7ca06c4d92ce03bb39bddf24f15dabeee508f2004f0158bf1aca70  panacea.tar.xz
ed17677aa7269d725cdd81fc1832655a76b3ab701a0ca356b1182443622bedd7  pandora-elpresidente.tar.xz
9f9de82834b46973a5712a6b1dcabe3cb2af1b3c42348d3f2ab4534b59f64dc6  pandora-forums-20140421-whom-astorposts.tar.xz
29bb6c5add500b077b3545559871eda0515887f8847380f1024072ce6cc785aa  pandora-forums.tar.xz
d6e00fb115cecb5739e72c994243edf3199a7b2c9524ebe1e55983bcd2dbc894  pandora.tar.xz
0dfcfdac5d359b508efae9c50cb861f5403924e047de00831db758841a469bfa  pedofunding.tar.xz
427bc78c1e466a7bdc7f0b667d125aced3de76da7bfd8fed5fce564f44421372  pigeon-forums.tar.xz
6fe6fd24b0b604ec70b9e56610743f3bdf91683d24e6ade3a149ecd61b7b787f  pigeon.tar.xz
bd634bf2b2943fb1d01c548f1d731d86c8344d319b799a03a9197874e8e01772  piratemarket.tar.xz
f8dbee89392ebced3a529a972e19c5146aaa3cfe8ce9d25005f538d41b47c2ed  poseidon.tar.xz
71b44fc678bebb8122ddfdba02e2ef80335f72eaf49b4f11ef3204ee7f29ec35  projectblackflag-20131103-anonymous-logsdump.tar.xz
0000462319ea6467b0a25f070f659124966518da3adce1a0fa92d81a84a24e59  projectblackflag-forums.tar.xz
b2ec62fbe54b8148f7e6e7738b84d0d7d45c6b7a91b951494a9a8ab20769e24b  revolver-forums.tar.xz
4f8573bded758c065f86c1eae189d69c1ad622fb6558d10d4aef780e699e09c2  sheep-elpresidente.tar.xz
073829fc8ae4fe9e6920b2c3232bc253ebe6c877b29264a569651e5d76c3b191  sheep.tar.xz
4099f3d49d74d8828b12d8ff532979531c5ca31092985457e93f5f5e9fafbdc1  silkroad1-20111103-delyankratunov.tar.xz
57b641200c30bf6a801fe2faf462d507fcc99c678567943f25af9d0c51970879  silkroad1-20120722-vanbuskirk.docx
59e72f95201726cc46d9680f97a53f44c45f242b57a96567916c4cb76a863d5e  silkroad1-20120723-christin-censored.tar.xz
da8726427d1b13f850a9647a34757ee95be000c036a5ec370e8f43b01fde6609  silkroad1-20130703-anonymous.tar.xz
a3fe8ec72186e7ec02fe206f92616688fae07b756f06a555bd8f306a92b0451b  silkroad1-20130915-aldridgehetu.tar.xz
12876b0783fb928a9c982dff048155fae331b174e08847e66a3100a9f74c9369  silkroad1-forums-20130703-anonymous.tar.xz
5533a90285c0d072d62ebf681cfe717987dfe595f13b96e1e8dc9ae1ed7274ab  silkroad1-forums-20131103-gwernrasmusandersen.tar.xz
3a28097c243843cc69d365b1c6456075679bfa09cd3a50daa6105a0c7f4df837  silkroad1-forums-anonymous.tar.xz
37db1b2eab69923e22cb0d2ee65426152cb11ab09d92d1d6013a2fe7f20aa7d0  silkroad1-forums-stexo.tar.xz
eac0013182b996b4a77f446a28ffabd74f23ea0fa32eeaa6f3bc499081c372c8  silkroad1-forums.tar.xz
ab1ffac3b85b9cbb2d7ff80ed28a1899561f945758196ba3976dbb2e5b8b4c21  silkroad1-vendorprofiles-stexo.tar.xz
2df744013fedfdacfd349472e05981316dbf392ccb56e627ff6d6f09b4ad7a8a  silkroad1-wiki.tar.xz
1c8e643eade9750b39485c5e101f65d2c12ec977cb7b681cd8df064eccf4c0e7  silkroad2-20140129-sohhlz-vendors.tar.xz
3381cd4305c4cd909aa86cf218a1022e6be5ed227d6eb728603c41b9956c7a28  silkroad2-20140927-daryllau.tar.xz
7367dc56f15f61212d8567033a4d3a9468622e05f86d38607a70d5686164648a  silkroad2-forums-20140419-whom-astorposts.tar.xz
0900093d7100b4faf983707b4b1e0ec1fae3c4b18270eaa8eedfe4f8b69a6e23  silkroad2-forums-2014093020141016-rasmusandersen.tar.xz
a473132cb8eec64aea2066628a24628a0c1eb38c195c9945c700dd19f1f972f2  silkroad2-forums.tar.xz
2abc793c7fdfce31d375db11307b66aa69cb91f4c684408840d546bf4e61e41b  silkroad2.tar.xz
3384789112185d81544dcad5bc69967cd44b097b7a772da48f5a1226b43155de  silkroadreloaded.tar.xz
ed9d47ecc9afce0f541386471da9894c436833b89da06663ffbc5ab6de2beacf  silkstreet.tar.xz
7e254452405543c27ee47c0bf6a455fe34443a6fa335a904e086fef61cf6f330  simplybear.tar.xz
80c759f67a5eac57b6345417dff1181690a80ecb965a14ce812ab79d315f2f2d  tcf.tar.xz
6f0775201cb379bb0845c60fde22e66b8aa7d5319d6046987202cdc9065b0591  theblackboxmarket.tar.xz
c25c1f2b35d1cf1f38f1f009b40d559f5a0aaf484248d98aed7b9942fade20a8  thecave.tar.xz
078cc6e61cb37c56f671b6d87ca243e885c2a37a17645d73d26c01e56b28afe4  thehub-forums-20140420-whom-astorposts.tar.xz
5620dae0fac58b30bff4efbf116ce9674d071c3d43fe7cef2f5f84c2950b4182  thehub-forums.tar.xz
c542fed2541d059c466d0b9dc402465952a778b1ef584a3af73e7ad34d953f7e  themajesticgarden-forums.tar.xz
a8a57924768c5f7ad4062fe0b6931722a078caab91b65a515b554817b2e4c1dc  themajesticgarden.tar.xz
8deee8650c55fbd4cfb8366a4f8b5e8a5370b525f676769de34f81a8864e92d2  themarketplace.tar.xz
420889ca017ac87c92a0ff774d21dc79c3abc1958c8dee0dcc11e1af59fd680d  therealdeal-forums.tar.xz
b1ee23d727b30c486c3d197212ac91ac16f18b78b30ba5346854bedf81e6b821  therealdeal.tar.xz
70cf9c9a75815e9a514d4a5eb69aef77df862f3c8e36aff19feed8dae7c1e1cc  tochka.tar.xz
32acbc1289525785c12f179a7da9ce76a838e5a13a4dbaa6fb16c3f1870f9d98  tom-forums.tar.xz
3f62941a988c166ebcec9c788069de1d30a3c365f0b1da1921d342c8a4df3a35  tom.tar.xz
6c50bd480914e0c257b6e85a3e22a087e0e058614d465f7269e2ebd1f867a35a  topix2.tar.xz
fee6a7cd032648bebaae7752045bcd64c0a069c0abd311c53686323103fe7ede  torbay.tar.xz
76fdc6da85a4d697e2e5ed5b9c3d608c5d1ac33a0831fd0701cfd0c6c922e9db  torbazaar-forums.tar.xz
5b9b457c2e541fc618461b69c14511b03fff886daed25ba1e0cb49a89c5b749c  torbazaar.tar.xz
0f3c3a34496feeb44f258e07ee46704a38f856e975e394bcf689e03a18d263ca  torescrow-forums.tar.xz
7e4bf1ef60826367375ab419b068ce1b61daf231cda407594f595ec3bffc6d50  torescrow.tar.xz
1b911a07423900ee4ef9ff71e9d1f4752bfa89ad9c473b760263314f56c7a021  tormarket-20131213-dpr2-dbdump.mht
e229859ffa92bb7c142d2d54317d4b571e48dcc030d412fc93489a3f5aaa9faa  tormarket-elpresidente.tar.xz
55b50e6e9283df50e68d1843db0d07360cc0e6c7d2d032dc00de2c04a00cd489  tormarket.tar.xz
f81a11e6dd8779a4bf077f9bc833740536ed202d2dca106ab5122d758784bf74  tortuga1-forums.tar.xz
15c7d2ad0b525a9f3ae417dc63a670698204ac755a28bd98f104b0b240f3a4fd  tortuga2.tar.xz
0bb2324c424faa0481a3ca5b4004e57493eacfb7a521a7018edb40c3b467037b  undergroundmarket-forums.tar.xz
2153d48e75b60942cb7287a06b93c43b2968fb175af7b4f82fff59577674e9f6  undergroundmarket.tar.xz
13bb5eda0762a41aecc74caf3f3a527035b0015ea71019ba4d2d2363aeaf86d3  unitech.tar.xz
2811a120a4db56907498b2758b0b5d8b2d43c2167a40b2bf0c6e432ba383ff55  utopia-forums.tar.xz
c64666bf5ea4218f7b69d366243ce13a1c8fc21a68d4e24a6ac8c7c3d8bf6908  utopia.tar.xz
9278f2ed7191642cf736bc4dc88c2ccbe7c0b1af6cc6e6ffcb283263a4aef729  vault43.tar.xz
8087f7b4a7781ffc634d0baa2ac4a7cec7b7b1bd5a619f89cb43d49faae002b7  whiterabbit.tar.xz
dc64656700ad46505bd02412d7af5a04d60aba138c713720a00d80cc4bd20000  zanzibarspice.tar.xz
-----BEGIN PGP SIGNATURE-----

iQIcBAEBCgAGBQJVoq+QAAoJEH3Oo4eJxYjM52IP/3ZMzulM6TuwKfkcsGDrFe4Q
X3gQL4Ru2N80jWWcUj3hA/SxEyhs5gWA/xnLZr1HFPPEOXZQRMZb5G3tVQ7clhxL
dH2q7YPl+1L151iqtZHATYMcK8kSB7gbs8S33JU5SkS+y7R0tOXI9fpVuhnaD6HN
q3nGEKrSXI0CaC2o4bBxmUh/1WsimTySiNbcErdj0jMns10MKeYwTq98E+6yc+XQ
ItsMqS9gfSVlGN0yLRedc+kI+Y3M4ujLzY5aHC7PDv2RnpZhRMV68cSbsTc4FD7m
A7AOFKHukUhDPBqp1d3BEU/IiNqY4YhfIkmDMIQ8y2ioYG+rkk0SMojb3OYXgv0p
ioO0QuHNsJSomXYe9OkNoF9y2Tb99nJr7Wr6TFyJ4Geeow9B9p0j2LWFwfrpD3oq
eevXcIQruyi1AG4sK3/F6UG+GAZ3ZgsvcECoRc0+zytXNF0sn14WNcnyqGmtyfo1
/Y0KcDA0RCiWyvUTyAHWjjv0xOxVGDij8r9aqDM+8UgTsECIL6tlTo/Ifhm/k4a6
qF0adhyCpeFPAhmW2kz7BYsmtM0TzWDV/eD3h3mrpo8bn0ILgZr4MpEpLn3WPjY/
D+ZepCz12epZSURHV+6SWFteO6PM44fU895ezBq/iU5ZIRK8uvTShR6KEtPivJFp
fYrFFbOhBc6KRQbNJ8o2
=U0bP
-----END PGP SIGNATURE-----

How to crawl markets

The bulk of the crawls are my own work, and were gen­er­ally all cre­ated in a sim­i­lar way.

My setup was a De­bian test­ing Linux sys­tem with , , and in­stalled. For brows­ing, I used Iceweasel; use­ful FF ex­ten­sions in­cluded Last­Pass, Flash­block & No­Script, Live HTTP Head­ers, Mozilla Archive For­mat, User Agent Switcher & switch­prox­y­type, and . See the Tor guides.

  1. when a new mar­ket opens, I learn of it typ­i­cally from Red­dit or The Hub, and browse to it in Fire­fox con­fig­ured to proxy through 127.0.0.1:8123 (Polipo)

  2. cre­ate a new ac­count

    The username/password are not par­tic­u­larly im­por­tant but us­ing a pass­word man­ager to cre­ate & store strong pass­words for throw­away ac­counts has the ad­van­tage of mak­ing it eas­ier to au­then­ti­cate any hacks or data­base dumps lat­er. (Given the poor se­cu­rity record of many mar­kets, it should go with­out say­ing that you should not use your own user­name or any pass­word which is used any­where else.)

  3. I lo­cate var­i­ous ‘ac­tion’ URLs: login, lo­gout, ‘re­port ven­dor’, ‘set­tings’, ‘place or­der’, ‘send mes­sage’, and add the URL pre­fixes (some­times they need to be reg­ex­ps) into /etc/privoxy/user.action; Privoxy, a fil­ter­ing proxy run­ning on 127.0.0.1:8118, will then block any at­tempt to down­load URLs which match those prefixes/regexps

    A good black­list is crit­i­cal to avoid log­ging one­self out and im­me­di­ately end­ing the crawl, but it’s also im­por­tant to avoid trig­ger­ing any on-site ac­tions which might cause your ac­count to be banned or prompt the op­er­a­tors to put in an­ti-crawl mea­sures you may have a hard time work­ing around. A black­list is also in­valu­able for avoid­ing down­load­ing su­per­flu­ous pages like the same cat­e­gory page sorted 15 differ­ent ways; Tor is high la­tency and you can­not afford to waste a re­quest on re­dun­dant or mean­ing­less pages, which there can be many of. Sim­ple Ma­chine Fo­rums are par­tic­u­larly dan­ger­ous in this re­gard, re­quir­ing at least 39 URLs black­listed to get an effi­cient crawl, and im­ple­ment­ing many ac­tions as sim­ply HTTP links that a crawler will browse (for ex­am­ple, if you have man­aged to get ac­cess to a pri­vate sub­fo­rum on a SMF, you will delete your ac­cess to it if you sim­ply turn a crawler like wget or loose, which I learned the hard way).

  4. where pos­si­ble, con­fig­ure the site to sim­plify crawl­ing: re­quest as many list­ings as pos­si­ble on each page, hide clut­ter, dis­able any op­tions which might get in the way, etc.

    Fo­rums often de­fault to show­ing 20 posts on a page, but op­tions might let you show 100; if you set it to dis­play as much as pos­si­ble (max­i­mum num­ber of posts per page, sub­fo­rums list­ed, etc), the crawls will be faster, save disk space, and be more re­li­able be­cause the crawl is less likely to suffer from down­time. So it is a good idea to go into the SMF fo­rum set­tings and cus­tomize it for your ac­count.

  5. in Fire­fox, I ex­port a cookies.txt us­ing the FF ex­ten­sion Ex­port Cook­ies. (I also rec­om­mend to avoid JavaScript shenani­gans, Live HTTP Head­ers to as­sist in de­bug­ging by show­ing the HTTP head­ers and re­quests FF is ac­tu­ally send­ing to the mar­ket, and User Agent Switcher to lock your FF into show­ing a con­sis­tent )

  6. with a valid cookie in the cookies.txt and a proper black­list set up, mir­rors can now be made with , us­ing com­mands like thus:

    alias today="date '+%F'" # prints out current date like "2015-07-05"
    cat ~/blackmarket-mirrors/user-agent.txt
    ## Mozilla/5.0 (Windows NT 6.1; rv:31.0) Gecko/20100101 Firefox/30.0
    
    cd ~/blackmarket-mirrors/cryptomarket/
    fgrep --no-filename '.onion' ~/cookies.txt ~/`today`/cookies.txt > ./cookies.txt
    http_proxy="localhost:8118" wget --mirror
        --tries=5 --retry-connrefused --waitretry=1 --read-timeout=20 --timeout=15 --tries=10
        --load-cookies=cookies.txt --keep-session-cookies
        --max-redirect=1
        --referer="http://cryptomktgxdn2zd.onion"
        --user-agent="$(cat ~/blackmarket-mirrors/user-agent.txt)"
        --append-output=log.txt --server-response
        'http://cryptomktgxdn2zd.onion/category.php?id=Weed'
    mv ./cryptomktgxdn2zd.onion/ `today`
    mv log.txt ./`today`/
    rm cookies.txt

    To un­pack the com­mands:

    • the fgrep in­vo­ca­tion min­i­mizes the size of the lo­cal cook­ies.txt and helps pre­vent ac­ci­den­tal re­lease of a full cook­ies.txt while pack­ing up archives and shar­ing them with other peo­ple

    • wget:

      • we di­rect it to down­load only through Privoxy in or­der to ben­e­fit from the black­list. Warn­ing: wget has a black­list op­tion but it does not work, be­cause it is im­ple­mented in a bizarre fash­ion where it down­loads the black­listed URL (!) and then deletes it; this is a known >12-year-old bug in wget. For other crawlers, this be­hav­ior should be dou­ble-checked so you don’t wind up in­ad­ver­tently log­ging your­self out of a mar­ket and down­load­ing gi­ga­bytes of worth­less front pages.
      • we throw in a num­ber of op­tions to en­cour­age wget to ig­nore con­nec­tion fail­ures and retry; hid­den servers are slow and un­re­li­able
      • we load the cook­ies file with the au­then­ti­ca­tion for the mar­ket, and in par­tic­u­lar, we need --keep-session-cookies to keep around all cook­ies a mar­ket might give us, par­tic­u­larly the ones which change on each page load.
      • --max-redirect=1 helps deal with a nasty mar­ket be­hav­ior where when one’s cookie has ex­pired, they then qui­etly redi­rect, with­out er­rors or warn­ings, all sub­se­quent page re­quests to a lo­gin page. Of course, the lo­gin page should also be in the black­list as well, but this is ex­tra in­sur­ance and can save one round-trip’s worth of time, which will add up. (This is­n’t al­ways a cure, since a mar­ket may serve a re­quested page with­out any redi­rects or er­ror codes but the con­tent will be a tran­scluded lo­gin page; this ap­par­ently hap­pened with some of my crawls such as Black Bank Mar­ket. There’s not much that can be done about this ex­cept some sort of post-down­load reg­exp check or a sim­i­lar post-pro­cess­ing step.)
      • some mar­kets seem to snoop on the “ref­erer” part of a HTTP re­quest spec­i­fy­ing where you come from; putting in the mar­ket page seems to help
      • the user-a­gent, as men­tioned, should ex­actly match how­ever one logged in, as some mar­kets record that and block ac­cesses if the user-a­gent does not match ex­act­ly. Putting the cur­rent user-a­gent into a cen­tral­ized text file helps avoid scripts get­ting out of date and spec­i­fy­ing an old user-a­gent
    • log­ging of re­quests and par­tic­u­larly er­rors is im­por­tant; --server-response prints out head­ers, and --append-output stores them to a log file. Most crawlers do not keep an er­ror log around, but this is nec­es­sary to al­low in­ves­ti­ga­tion of in­com­plete­ness and ob­serve where er­rors in a crawl started (per­haps you missed black­list­ing a page); for ex­am­ple, “Eval­u­at­ing drug traffick­ing on the Tor Net­work: Silk Road 2, the se­quel”, Dol­liver 2015, failed to log er­rors in their few HTTrack crawls of SR2, and so wound up with a grossly in­com­plete crawl which led to non­sense con­clu­sions like 1–2% of SR2’s sales were drugs. (I spec­u­late the HTTrack crawl was stuck in the ebooks sec­tion, which was al­ways clogged with spam, and then SR2 went down for an hour or two, lead­ing to HTTrack’s de­fault be­hav­ior of quickly er­ror­ing out and fin­ish­ing the crawl; but the lack of log­ging means we may never know what went wrong.)

  7. once the wget crawl is done, then we name it what­ever day it ter­mi­nated on, we store the log in­side the mir­ror, and clean up the prob­a­bly-now-ex­pired cook­ies, and per­haps check for any un­usual prob­lems.

This method will per­mit some­where around 18 si­mul­ta­ne­ous crawls of differ­ent DNMs or fo­rums be­fore you be­gin to risk Privoxy throw­ing er­rors about “too many con­nec­tions”. A Privoxy bug may also lead to huge logs be­ing stored on each re­quest. Be­tween these two is­sues, I’ve found it help­ful to have a daily cron job read­ing rm -rf /var/log/privoxy/*; /etc/init.d/privoxy restart so as to keep the log­file mess un­der con­trol and oc­ca­sion­ally start a fresh Privoxy.

Crawls can be quickly checked by com­par­ing the down­loaded sizes to past down­loads; mar­kets typ­i­cally do not grow or shrink more than 10% in a week, and fo­rums’ down­loaded size should mo­not­o­n­i­cally in­crease. (In­ci­den­tal­ly, that im­plies that it’s more im­por­tant to archive mar­kets than fo­rum­s.) If the crawls are no longer work­ing, one can check for prob­lems:

  • is your user-a­gent no longer in sync?
  • does the crawl er­ror out at a spe­cific page?
  • do the head­ers shown by wget match the head­ers you see in a reg­u­lar browser us­ing Live HTTP Head­ers?
  • has the tar­get URL been re­named?
  • do the URLs in the black­list match the URLs of the site, or did you log in at the right URL? (for ex­am­ple, a black­list of “www.abrax­as­…o­nion” is differ­ent from “abrax­as­…o­nion”; and if you logged in at a onion with www. pre­fix, the cookie may be in­valid on the pre­fix-free onion)
  • did the server sim­ply go down for a few hours while crawl­ing? Then you can sim­ply restart and merge the crawls.
  • has your ac­count been banned? If the signup process is par­tic­u­larly easy, it may be sim­plest to just reg­is­ter a fresh ac­count each time.

De­spite all this, not all mar­kets can be crawled or present other diffi­cul­ties:

  • Blue Sky Mar­ket did some­thing with HTTP head­ers which de­feated all my at­tempts to crawl it; it re­jected all my wget at­tempts at the first re­quest, be­fore any­thing even down­load­ed, but I was never able to fig­ure out ex­actly how the wget HTTP head­ers differed in any re­spect from the (work­ing) Fire­fox re­quests
  • Mr Nice Guy 2 breaks the HTTP stan­dard by re­turn­ing all pages gzip-en­cod­ed, whether or not the client says it can ac­cept gzip-en­coded HTML; as it hap­pens, wget can­not read gzip-en­coded HTML and parse the page for ad­di­tional URLs to down­load, and so mir­ror­ing breaks
  • Al­phaBay, dur­ing the DoS at­tacks of mid-2015, be­gan do­ing some­thing odd with its HTTP re­spons­es, which makes Polipo er­ror out; one must browse Al­phaBay after switch­ing to Privoxy; Po­sei­don also did some­thing sim­i­lar for a time
  • Mid­dle Earth rate-lim­its crawls per ses­sion, lim­it­ing how much can be down­loaded with­out in­vest­ing a lot of time or in a CAPTCHA-breaking ser­vice
  • Abraxas leads to pe­cu­liarly high RAM us­age by wget, which can lead to the OOM killer end­ing the crawl pre­ma­turely

See also the com­ments on crawl­ing in , and .

Crawler wishlist

In ret­ro­spect, had I known I was go­ing to be scrap­ing so many sites for 3 years, I prob­a­bly would have worked on writ­ing a cus­tom crawler. A cus­tom crawler could have sim­pli­fied the black­list part and al­lowed some other de­sir­able fea­tures (in de­scend­ing or­der of im­por­tance):

  • CAPTCHA li­brary: if CAPTCHAs could be solved au­to­mat­i­cal­ly, then each crawl could be sched­uled and run on its own.

    The down­side is that one would need to oc­ca­sion­ally man­u­ally check in to make sure that none of the pos­si­ble prob­lems men­tioned pre­vi­ously have hap­pened, since one would­n’t be get­ting the im­me­di­ate of notic­ing a man­ual crawl fin­ish­ing sus­pi­ciously quickly (eg a big site like SR2 or Evo­lu­tion or Agora should take a sin­gle-threaded nor­mal crawl at least a day and eas­ily sev­eral days if im­ages are down­loaded as well; if a crawl fin­ishes in a few hours, some­thing went wrong).

  • sup­port­ing par­al­lel crawls us­ing mul­ti­ple ac­counts on a site

  • op­ti­mized tree tra­ver­sal: ide­ally one would down­load all cat­e­gory pages on a mar­ket first, to max­i­mize in­for­ma­tion gain from ini­tial crawls & al­low es­ti­mates of com­plete­ness, and then ei­ther ran­domly sam­ple items or pri­or­i­tize items which are new/changed com­pared to pre­vi­ous crawls; this would be bet­ter than generic crawlers’ de­faults of depth or breadth-first

  • re­mov­ing ini­tial hops in con­nect­ing to the hid­den ser­vice, speed­ing it up and re­duc­ing la­tency (does not seem to be a con­fig op­tion in Tor dae­mon but I’m told some­thing like this is done in )

  • post-down­load checks: a mar­ket may not vis­i­bly er­ror out but start re­turn­ing lo­gin pages or warn­ings. If these could be de­tect­ed, the cus­tom crawler could log back in (par­tic­u­larly with CAPTCHA-solving) or at least alert the user to the prob­lem so they can de­cide whether to log back in, cre­ate a new ac­count, slow down crawl­ing, split over mul­ti­ple ac­counts, etc

Other datasets

One pub­licly avail­able full dataset is:

A num­ber of other datasets are known to ex­ist but are un­avail­able or avail­able only in re­stricted form, in­clud­ing:


  1. Some­thing that might be use­ful for those seek­ing to up­load large datasets or de­riv­a­tives to the IA: there is a most­ly-un­doc­u­mented ~25GB size limit on its tor­rents. Past that, the back­ground processes will no longer up­date the tor­rent to cover the ad­di­tional files, and one will be handed valid but in­com­plete tor­rents. With­out IA sup­port staff in­ter­ven­tion to re­move the lim­it, the full set of files will then only be down­load­able over HTTP, not through the tor­rent.↩︎

  2. Zhang et al 2019 de­scribe the source of their writ­ing+photo dataset as “To fully eval­u­ate our pro­posed method, we have col­lected the data from four differ­ent dark­net mar­kets Val­halla, Dream Mar­ket, Silk Road 2 and Evo­lu­tion. For the for­mer two dark­net mar­ket­s,we de­velop a set of crawl­ing tools to scrape weekly snap­shots from June 2017 to Au­gust 2017. For the rest of mar­kets, we col­lect their pub­lic data dumps.” The ‘pub­lic data dumps’ are un­spec­i­fied but I am not aware of any other pub­lic SR2/Evolution datasets which in­clude pho­tos.↩︎

  3. Not to be con­fused with the orig­i­nal Silk Road 1 weapons site which closed for lack of sales; this is a much lat­er, in­de­pen­dent site which was prob­a­bly a scam.↩︎

  4. eg. the Ross Ul­bricht trial ev­i­dence ex­hibits; for the trial tran­script, see Mous­tache.↩︎