Hi! I am the owner of Kilos, a search engine that indexes listings, vendors, reviews, and forum posts from online black markets. I am publishing some of the data I have scraped to see if anyone can reach any interesting conclusions from playing with it. I have quite a bit more data, which I will be posting in the coming days. Right now I have...
Currently indexing 534,767 forum posts, 65,741 listings, 2,726 vendors, and 235,668 reviews from 6 markets and 6 forums.
You need Tor to download the dataset. Once you have the Tor browser bundle installed, you can find the data set here: http://lolwuc3342535625.onion/2020-01-13-reviews.csv . If someone could mirror this on a clearnet hosting site, I would appreciate that. I use Tor for everything and most file hosting websites will not allow me to upload over Tor.
Edit: /u/gwern has mirrored the data for me and you can now get it without Tor here. Thanks /u/gwern!
The data is in the format
site,vendor,timestamp,score,value_btc,comment
Site, vendor, and comment are strings. Site and vendor are both alphanumeric, while comment may have punctuation and whatnot. Line breaks are explicit "\n" in the comment field, and the comment field has quotation marks around it to make it easier to sort through. All the data uses Latin characters only, no unicode. Timestamp is an integer indicating the number of seconds since the Unix epoch. Score is 1 for positive review, 0 for neutral review, and -1 for negative review. Value_btc is the bitcoin value of the product being reviewed, calculated at the time of the review.
There are some slight problems with the data set as a result of the pain that is scraping these marketplaces. All reviews from Cryptonia market have their timestamp as 0 because I forgot to decode the dates listed and just used 0 as a placeholder. Cryptonia reviews' score variable is unreliable, as I accidentally rewrote all scores to 0 on the production database. To correct for this, I rewrote the scores to match a sentiment analysis of the review text, but this is not a perfect solution, as some reviews are classified incorrectly. E.g. "this shit is the bomb!" might be classified negatively despite context telling us that this is a positive review.
There are a decent number of duplicates, some of which are proper (e.g. "Thanks" as a review appears many many times) and some of which are improper (detailed reviews being indexed multiple times by mistake).
Anyway if you can make any interesting inferences from this data, let me know! I am always looking to improve Kilos' display of datas. Right now, I am working on using polynomial regression to detect when vendors have padded their reviews with fake positives to improve their listing in search results. I would appreciate help with this if anyone can offer it.
Want to add to the discussion?
Post a comment!