Trawling Tor Hidden Service – Mapping the DHT

Update 2013-08-15: I have been really enthused by reactions I received to this blog post. It has been referenced from Forbes, Gawker and the Daily Mail and a number of people have been in contact about tracking the DHT for themselves. I would recommend the IEEE S&P paper, “Trawling for Tor Hidden Services: Detection, Measurement, Deanonymization” which presents the same issues allowing the DHT to be trawled. They also present some very serious attacks allowing an adversary to locate hidden services with practical resources. It is well worth checking out if you have an interest in Tor.

Tor hidden services have got more media attention lately as a result of some notorious sites like the Silk Road marketplace, an online black market. On a basic level, Tor hidden services allow you to make TCP services available while keeping your server’s physical location hidden via the Tor anonymity network.

TL;DR

  • Tor hidden service directories (HSDir’s) receive a subset of hidden service look-ups from users, allowing them to map relative popularity/usage of hidden service.
  • An adversary with minimal resources can carry out complete DoS attacks of Tor hidden services by running malicious Tor hidden service directories and positioning them in a particular part of the router list.
  • Many look-ups for Tor hidden services go to the incorrect hidden service directories which negatively affects the initial time to access the site.
  • Hidden services such as are popular, sites such as the Silk Road marketplace receive more than 60,000 unique user sessions a day.

Introduction

For users to access a hidden service they must first retrieve a hidden service descriptor. This is a short signed message created by the hidden service approximately every hour contain a list of introduction nodes and some other identifying data such as the descriptor id (desc id). The desc ID is based on a hash of some hidden service information and it changes every 24 hours. This calculation is outlined later in the post. The hidden service then publishes its updated descriptor to a set of 6 responsible hidden service directories (HSDir’s) every hour. These responsible HSDir’s are regular node on the Tor network which have up-time longer than 24 hours and which have received the HSDir flag from the directory authorities. The set of responsible HSDir’s is based on their position of the current descriptor id in a list of all current HSDir’s ordered by their node fingerprint. This is an implementation of a simple DHT (Distributed Hash Table).

A client who would like to retrieve the full HS descriptor will calculate the time based descriptor id and will request it from the responsible HSDir’s directly.

Descriptor is published to a set of HSDir's

Descriptor is published to a set of HSDir’s

Once the client has a copy of the hidden service descriptor they can attempt to connect to one the introduction point and create a complete 7 hop circuit to the hidden service.

This is a very brief explanation which the Tor Project has outlined much more clearly on their hidden service protocol page but it should provide enough information to understand the following information.

The Problems

  • Any one can set up a Tor node (HSDir) and begin logging all hidden service descriptors published to their node. They will also receive all client requests allowing them to observe the number of look-ups for particular hidden services.
  • The list of responsible HSDir’s for a hidden service is based on the calculated descriptor ID and an ordered list of HSDir fingerprints. As the descriptor ID’s are predictable, and the node fingerprint is controlled by an adversary. The can position themselves to be the responsible directories for a targeted hidden service and subsequently perform a DoS attack by not returning the hidden service descriptor to clients. There is nothing the hidden service can do about this attack. They must rely on the Tor Project authority directories to remove the malicious HSDir’s from the consensus.

Information Gathering from the DHT

Hidden services must publish their descriptors to allow clients to reach them. These descriptor id’s are deterministic but over time any HSDir should have an equal chance of receiving the descriptor for any particular hidden service as part of the DHT. As the descriptor is replicated across 6 HSDir’s, any single responsible HSDir should receive up to 1/6 of all client look-ups for that hidden service providing a good means of estimating hidden service popularity.

This data can be made more accurate by running more Tor HSDir’s which have identity digest‘s in the responsible range, to the point where all the responsible HSDir’s are controlled by the observer.

There are currently about 1400 nodes with the ‘HSDir’ flag running. For the past 2 months I have run 4 Tor nodes and logged all hidden service requests that they have received. Each hidden service publishes to 6 HSDir’s every day, and will publish to 360 or approximately 25% of HSDir’s in a 2 month period. As I am running 4 nodes I statistically should have received a copy of every hidden service that was online for those 60 days. While the distribution will not be perfect, my data should contain a representative set of hidden service activity. Here are some stats from those 4 nodes:

  • Received 1.3 million requests for 3815 unique hidden services my nodes were responsible for.
  • Got 16.03 million requests for descriptors my nodes were not currently responsible for.
  • Received published descriptors for 40,500 unique hidden services
  • Received requests from clients for 25,600 unique descriptor id’s.

Even these basic stats point out a number of issues with the way hidden services currently function. 16.03 million or 92.5% of all descriptor requests my nodes received were for hidden services I did not currently have descriptors for. This could be a result of the clients having an out-of-date network consensus and not choosing the correct HSDir’s as responsible. It could also be a result of an out of sync time which causes the clients to look for the wrong/old descriptor id’s. Either way, a significant amount of time is being wasted on descriptor look-up which slows down the time when first accessing a hidden service.

Some More Stats..

The following table contains data on requests my nodes received for some well known Tor based marketplaces. There are also requests for related phishing sites. I have confirmed that some users were directed to these phishing pages from links on the “The Hidden Wiki” (.onion). The number of requests is what my nodes observed. Descriptor look-ups from clients will be divided at random between the 6 responsible HSDir’s and clients will keep trying the remaining HSDir’s until the descriptor is found or all HSDir’s have been checked. As a result the total number of requests received by the following sites per day may be up to 6 times more than the figures below, but these still offer a relative guide to popularity.

Onion Address Descriptor ID Requests
----------------------------------------------------------------
silkroadvb5piz3r cjzls3i2mbj4hjnquqmuvznihues4xh4 16387
silkroadvb5piz3r m6yz6gqrmu35twduuiixzr2mqtxdo3er 10891
5onwnspjvuk7cwvk 6t44eim223ypmb2ueokcsfco5vzvryfm 1413
silkroadvb5piz3r hadco5o7rmh2vcamg7mdzqklprqffyyh 558
silkroadxmx45vk4 6tyqo2bf7xclfbmrtrxwm7mgb3z4s5ui 197
atlantisrky4es5q hdj7wkuaigt7iicqf77gyzbo7zyvq7wf 165
atlantisrky4es5q m6y4s2utv4kxgdczv7t3gbmoloezblzf 161
atlantisrky4es5q 6r3z4tlr2vvl5z34v5lcuaqckgjvtr7s 129
silkroadxmx45vk4 m6eczdbpjse3jdfw54cv4nxcc6s34eku 107
sheep5u64fi457aw ci7dpz6emlh2pxpshovnxjbyjqnp3luo 59
silkrovafuce2ur2 6r6w7ncln5mo4hdwq6yh7f2hf3hlag2u 14
sheep5u64fi457aw rzq5pd4ehayz4yker7dhmvpubm4we7om 9
silkroadopn752dl cja5ppzzmkrzvr2pgp2c2z6mtuc7m7yk 4
silkroadr5cd6wbz m4n2afulpiln4n42l7wlg36wy6okkqrh 3
silkroadfqmteec4 cjcyh7mm6lzzuqka3vlq67upyt5zwd7b 2
Phishing site with a directory listing via forced browsing

Phishing site with a directory listing via forced browsing

Text file containing phished credentials

Text file containing phished credentials

It seems a lot of people, especially scammer’s running phishing sites on Tor hidden services don’t know a lot about web security which leads to sites like this.

Please users, if you enter your credentials on a phishing page, and it doesn’t log you into the site. Don’t try again!

I also observed a large number of requests to the command and control servers of the “Skynet” Tor based botnet which got attention after an AMA with the botnet owner on Reddit. Contact him @skynetbnet on Twitter for more info.

Onion Address Descriptor ID Requests
----------------------------------------------------------------
gpt2u5hhaqvmnwhr m5t2jamzi4fht3hicqadzd3rkl57lyjj 9792
x3wyzqg6cfbqrwht m5doen5pidde5wshaormqh6l2c4bljd5 7162
gpt2u5hhaqvmnwhr hdjeqyqaq344rbb6vxndliobueh3v2u5 6641
4bx2tfgsctov65ch 6twxygfbtb2haivmixjqx5ag35dg72tk 6485
owbm3sjqdnndmydf hd2j5xswo5ddvxpa2rahkg24vwhqo77y 6471
niazgxzlrbpevgvq m63z25bfydkc6nfko4b4kz44u3jsh67k 6334
6ceyqong6nxy7hwp m6czfa7ra6qvrfvbmg6zlsimi4braizy 2691
6tkpktox73usm5vq 6ruzifokbb6ez2qbnion7deylz4jjmq3 1827
uzvyltfdj37rhqfy 6tvqgoeu4piyu3x6dsyerhhhnko6iumy 1735
6m7m4bsdbzsflego m5zclz4icakymh2d6fbdkpq2t3n7efql 1681
6ceyqong6nxy7hwp hds46ckhghbg5gvfwnsgwvhvdildn2e5 1605
jr6t4gi4k2vpry5c m5zfpvht47zgiujyjkf5en2vu6g7ndzm 1579
xvauhzlpkirnzghg 6ugswkgt5pdmhlyzjgj5bhtnclophhwa 1556
jr6t4gi4k2vpry5c m5576obc535dffll4qwq6xh37u4dz7y2 1422
ceif2rmdoput3wjh 6s544wf5d6kuyhtiqq7wo37oisi6n5ce 1397
f2ylgv2jochpzm4c cjore7wxv2x46qrlcvmiyctm4szckpo3 1380
uy5t7cus7dptkchs 6ttlcxahq4obesp5gze4rkjjinivgoyl 1370
7wuwk3aybq5z73m7 6urq4w6to5fmjf3hyqitzlcscxszzbrd 1199
742yhnr32ntzhx3f t5z3anxrp3e5w5z2s5kqfftuxlw4v6ng 1178
7wuwk3aybq5z73m7 m4ejch7su2xm6zrzu5gn3cgrjsg43f6y 235
6m7m4bsdbzsflego 7lsph6grzr76j27fx4r7ir3ipbexidkt 134
6tkpktox73usm5vq hday5dik3mzlwkpvemrorfmirfdnepnp 102
6ceyqong6nxy7hwp lzgpbrgh6ims4fgru6ojdsea7hbruh6s 25
x3wyzqg6cfbqrwht ciq4shgjmutzgmz2t346rvkwtzuobfqe 24
owbm3sjqdnndmydf ciebkgj3gbl4egtwqvwssc6kvtt4x6t5 15
4bx2tfgsctov65ch 6rnq6z57hvjrf4a5yktfd7qavuvlvqjb 6

Many of the “Skynet” onion addresses above and other popular addresses are running Bitcoin mining proxies. They generally responded with a basic authentication request for “bitcoin-mining-proxy”. The other services may be Tor based bitcoin mining pools or part of Skynet and/or other botnets. It should be straight forward to find these sites by scanning the service id’s from the raw data on Github.

DoS Attacks on Tor Hidden Services

Tor hidden service desc_id‘s are calculated deterministically and if there is no ‘descriptor cookie’ set in the hidden service Tor config anyone can determine the desc id‘s for any hidden service at any point in time.This is a requirement for the current hidden service protocol as clients must calculate the current descriptor id to request hidden service descriptors from the HSDir’s. The descriptor ID’s are calculated as follows:

descriptor-id = H(permanent-id | H(time-period | descriptor-cookie | replica))

The replica is an integer, currently either 0 or 1 which will generate two separate descriptor ID’s, distributing the descriptor to two sets of 3 consecutive nodes in the DHT. The permanent-id is derived from the service public key. The hash function is SHA1.

time-period = (current-time + permanent-id-byte * 86400 / 256) / 86400

The time-period changes every 24 hours. The first byte of the permanent_id is added to make sure the hidden services do not all try to update their descriptors at the same time.

identity-digest = H(server-identity-key)

The identity-digest is the SHA1 hash of the public key generated from the secret_id_key file in Tor’s keys directory. Normally it should never change for a node as it is used for to determine the router’s long-term fingerprint, but the key is completely user controlled.

A HSDir is responsible if it is one of the three HSDir’s after the calculated desc id in a descending lists of all nodes in the Tor consensus with the HSDir flag, sorted by their identity digest.  The HS descriptor is published to two replica‘s (two set’s of 3 HSDir’s at different points of the router list) based on the two descriptor id’s generated as a result of the ‘0’ or ‘1’ replica value in the descriptor id hash calculation.

I have implemented a script calculating the descriptor ID’s for a particular hidden service at an arbitrary time and it is available on my Github account. I have also created a modified version of ‘Shallot‘ which can be used to generate keys with an identity key in a specified range. The is more usage information on its Github page.

The Attack

The code listed above could be used to generate identity keys and identity digests for an adversary’s HSDir nodes so that they’ll be selected as 6 of the HSDir’s for a targeted hidden service. These adversary controlled hidden service directories could simple return no data (404 Response) to a client requesting the targeted hidden service’s descriptor and in turn prevent them from finding introduction nodes. As there are no other sources for this hidden service descriptor it would be impossible for a user to set up a circuit a complete circuit to the hidden service and there would be a complete denial of service until the descriptor id changes.

For an adversary to continue this attack over a longer time-frame, they would need to set up their nodes for the upcoming desc id’s of the targeted hidden service more than 24 hours in advance to make sure they will have received the HSDir flag.

An adversary would need to run 12-18 nodes to keep up a complete, persistent DoS on the targeted hidden service. Six nodes would be the “responsible HSDir’s” and the other nodes would be running with identity digests in the range of the upcoming desc id’s, to gain the HSDir flags after 24 hours of up-time. An adversary can cut the resources needed by running two Tor instances/nodes per IPv4 IP or by running the Tor nodes on compromised servers on high, unprivileged ports.

These attacks a quite a real, practical threat against the availability of Tor hidden services. For whatever reason (extortion, censorship etc.) adversary can perform complete DoS attacks with minimal resources and there are no actions hidden service owner can do to mitigate, besides switching to descriptor cookie based authentication or multiple private address. The Tor project can try deal with these attacks by removing known malicious HSDir’s from the network consensus but I don’t see a straight forward way to identify these malicious nodes.

Unfortunately there are no easy solutions to this problem at the moment. I can foresee adversaries employing these attacks in the wild against popular hidden services.

Conclusions

Tor hidden services were originally implemented as some a simple feature on top of the Tor network and unfortunately they haven’t received the attention and love the deserve for such a popular feature. There are discussions under way to re-implement hidden services to alllow them to scale more efficiently. There are also people looking at reducing the ability of HSDir’s to sit and gather data on onion address and look-ups like I have done, by implementing a PIR protocol. A good summary of work that needs to be done is available in a blog post on the Tor Project’s website. I’d urge any developers with an interest to join the tor-dev mailing list and see if there is something  you can contribute! A lot of work is needed.

Anyone interested in learning more about the issues with the current Tor hidden service implementation please check out the presentation “Trawling for Tor Hidden Services: Detection, Measurement, Deanonymization” at the IEEE S&P conference this Monday by Alex Biryukov, Ivan Pustogarov and RalfPhilipp Weinmann will probably provide a much more in-depth, formal investigation and I too look forward to reading it. We have been researching similar areas so I’m very interested in their approach and results.

All raw data and my modified Tor clients are available from my Github repo. This also contains scripts for calculating hidden service descriptors and generating OR private keys with fingerprints in a particular range. This data includes all descriptor requests and hidden service ID’s my nodes observed. Please check it out if you are interested in analyzing a random subset of services on the Tor network.

I’d like to thank @mikesligo for having a heated debate in the pub with me about hidden services and getting me interested in how they work. I’d also like to thank @CiaranmaK for reading a draft of this post and pointing out some corrections.

Thank you for reading my first blog post. I’ll have to work on presenting things better as the information in this post is a bit all over the place. Please let me know if you have any questions or feedback in the comments below.

25 thoughts on “Trawling Tor Hidden Service – Mapping the DHT

  1. gwern

    Interesting… so if I am interpreting your .onion table right, that implies that in April & May 2013, you found a lower bound of 27,836 visitors to SR & 327 to SR phishing sites (so 1.17% of would-be SR visitors were exposed to a phishing site?) and an upper bound of 167,016/1,962 (respectively).

    These directory lookups are one per visit to a hidden service, I take it and the results are subsequently cached?

    Reply
    1. Donncha Post author

      Hi Gwern,

      It is a small sample set. My nodes were only responsible for SR on two days. So it would be an absolute lower bound of 16387 in one day and an upper bound of 98322. This is a very wide range but it gives an idea of the order of magnitude. I wrote a script today to check how many responsible HSDir’s are returning the descriptor. I realize now I should of ran it as I found each service ID to be able to utilize this data better. It seems pretty changeable.

      I think it reasonable that ~1% of would be SR visitors were directed to a phishing page on a particular day it was promoted. Someone was spamming the site on Reddit – Reddit – SR Registration Down. I observed users getting directed to other phishing pages from the hidden wiki.

      I’d just like to reiterate that the absolute values only give a rough estimate of users. But its probably the only way we have of measuring hidden service usage without being an administator on those sites.

      Reply
    2. Joseph

      Very interesting post. Thank you for taking the time to write it up. There was one thing I did not follow though. If you are a HSDir hosting a hidden service descriptor and therefore have the desc id, how do you go backwards from the HS desc ID to determine the 16-character onion address?

      Reply
      1. Donncha Post author

        Thanks, there are a few ways to obtain the onion address. The hidden service descriptor containing the hidden service public key will be uploaded to the HSDir, the onion address can be calculated from that public key. You could also simply request the descriptor ID from another HSDir to get the hidden service public key and calculate the onion address.

        Reply
  2. Pingback: Bitcoin Black Market Competition Heats Up, With Pro Marketing And Millions At Stake - Forbes

  3. Pingback: Bitcoin Black Market Competition Heats Up, With Pro Marketing And Millions At ...

  4. Pingback: Bitcoin Black Market Competition Heats Up, With Pro Marketing And Millions At Stake | The Freedom Watch

  5. Pingback: Bitcoin-accepting Atlantis takes on the Silk Road | BTC World News - BitCoin Network

  6. Pingback: Inside Atlantis: The online black market that lets users buy and sell drugs, forgeries and hacking services anonymously | Talesfromthelou's Blog

  7. Pingback: Inside Atlantis: The online black market that lets users buy and sell drugs …

  8. Pingback: Bitcoin-accepting Atlantis & Sheep takes on the Silk Road | unSpy

  9. Pingback: Bitcoin-accepting Atlantis & Sheep take on the Silk Road | unSpy

  10. Pingback: Study: Estimating hidden service traffic from DNS leaks | Deep Dot Web

  11. enquirer

    Hi Donncha, why are there multiple descriptor ids for these services?
    Does that mean that they were hosting their sites on several servers?

    Reply
    1. Donncha Post author

      Hi, no it doesn’t mean the sites were hosted on seperate servers. If you look at the formula for the descriptor-id you will see that it has a replica variable which can be either 0 or 1 and there is also a time-period included. This results in two different descriptor ids being generated for each hidden service every day.

      Reply
  12. Pingback: Do Tor hidden service directories (HSDirs) see the REAL IP of users who visit .onion sites? | DL-UAT

  13. Pingback: What are the implications of a Relay hosting a Hidden Service? | XL-UAT

  14. Paws

    Hi Donncha, thanks for this very interesting post. Could you elaborate how you calculate the onion address based on the descriptor? Cheers

    Reply
      1. Paws

        Thanks, you’re amazing! I took also a look at your modified TOR source code to dump the HSDir descriptors. However, I can’t find where you do the actual dump in the code.
        As far a I see it, the descriptors are kept in memory and not stored by default in cache. The memory management is done trough the rendservice.c file, is this correct?
        Did you add a DB connector directly in the TOR source?
        Cheers

        Reply
  15. guanxinyue

    Hi Donncha, thanks for your interesting and useful post. I modified the tor source code to collect hidden services descriptors, however several hours later after I got the hsdir flag, my tor instance broke down and lost its hsdir flag. I have tried many times and every time the tor broke down. Can you give me some help?

    Reply
  16. David

    How easy will it be to adapt the changes to the code that you implemented back in 2013 to the current version of tor (0.2.7.6)?

    Reply
  17. rabiu mukhtar

    Hello Donncha,

    As you said ” Each hidden service publishes to 6 HSDir’s every day, and will publish to 360 or approximately 25% of HSDir’s in a 2 month period.”

    Is the process still the same now that there are over 6,000 nodes?

    How many nodes would I need to run to obtain a somewhat complete hidden service requests for all the HS?

    Thank you

    Reply
  18. Pingback: Crawling and Indexing Tor Network – BadTigrou Blog

Leave a Reply

Your email address will not be published.