Cybersecurity Data Science

Deanonymizing Tor using ML

October 20, 2020 by Dimitar Kostadinov

Standard Tor Traffic and Machine Learning /ML/

 Unlike conventional non-Tor encrypted traffic that comes with openly visible destination attributes such as IP addresses (both source and destination) and port number, the onion routing mechanism immanent in Tor networks manages to effectively obfuscate and hide these attributes. Therefore, logic dictates that some other attributes should be used for classifying what traffic originates from Tor, among other things.

 

An analysis of the entry nodes of Tor traffic with more traditional methods is almost always futile. Given that the entry guards are assigned randomly, a person who tries to hack Tor in this way should have to operate many Tor nodes to have a reasonable expectation of intercepting a targeted hidden traffic. On top of that, classifiers would be confused to a great extent if random padding is added to the travelling data. This assessment is in line with the view of the Tor project leader, Roger Dingledine.

Most prevalent approaches for traffic detection that vendors tout rely on blocking tracked down entry nodes of the Tor network. That is easier said than done, however; those approaches are not that difficult to be bypassed. Moreover, because Tor relays dynamically through numerous nodes, blocking IP addresses in practice may not be a great idea after all given that they transmit non-Tor communication packets, too.

Only a generic method enhanced by deep learning-driven techniques could detect Tor traffic with high precision, according to some specialists in the ML technologies.

Dr. Roshy John, Global Head, Robotics & Cognitive Systems at Tata Consultancy Services, considers that technologies like AI and ML have the capability to weed out with high probability entries that originate from a Tor source. This is done in two phases:

1)    A human-based analysis of indicators that arrive through UDP or TCP packets

2)    The data from the first phase should be taught to an ML algorithm during a supervised learning procedure based on recurrent neural networks (RNNs) and Decision Trees.

A research group from Rochester Institute of Technology made an interesting experiment about how machine learning can be used to deanonymize Tor traffic. They hosted multiple Tor nodes, comparing the traffic flowing with a regular one. The major goal of the experiment was to determine whether ML can differentiate between the two kinds of traffic.

Implementing Decision Trees seems to work well for the purpose of classification of different types of traffic with alleged accuracy of “more than ninety-five percent.” An algorithm that combines generated rules identifies four parameters of a packet all at once. And voila – the end result is:

“Thus, we cannot only identify certain packets to be originating from the Tor network but also know that it is part of a TCP, which is an ‘NMAP’ scan while also knowing the location of the hosts they are originating from.

 

More importantly, we concluded that Machine Learning can be an efficient tool for security over the Tor network.”

Deep Neural Networks (DNNs)

 They can process data in complex ways due to their state-of-the-art mathematical modeling. Their strengths are:

  1. a)  Doing away with the application of feature engineering, which is a time-consuming process part of the ML.
  2. b)  Solving new problems easily as the architecture is being adapted to address them.

Unfortunately, not everything is that great. The downside of DNNs is that they require huge quantities of data and could be expensive to train like hundreds machines that come with costly GPUs.

How to train neural networks models? For example, with the help of Keras – an API wrapper. Pandas are other important components part of the DNNs’ architecture. They are Python packages that provide structured analysis of real-world data.

Preparing ML Technologies to Recognize Tor Traffic

  • Data collection can be done in a variety of ways. For instance, tracking live data through Wireshark.
  • Data pre-processing based on data mining is a method that transforms raw data into a more understandable format. It is a process that deals with issues that ensue from the fact that real-world data has multiple shortcomings: inconsistency, incompleteness, contain many errors or/and is free from specific behavioral traits.
  • Feature Selection

 That is a core concept in ML as data features’ role is to train the ML models. There are three feature selection algorithms:

  • Filter Methods: to each feature is assigned a statistical measure
  • Wrapper Methods: several features are evaluated together throughout a predictive model. A respective score is assigned contingent on model accuracy
  • Embedded Methods: during the creation phase, these methods learn the features that are most accurate

All things being equal, more extracted features available mean more precise classification of encryption traffic, as well as reduced redundant data and decreased training time. Nevertheless, too many features can create a bottleneck during the algorithm training process and several features could also decrease the accuracy of the encrypted traffic classification.

Training time might be an important consideration when the data set is enormous. For more efficient processing, the encrypted traffic classification algorithm should be equipped with adequate computational resources.

  • Model Selection

If you have several approaches in mind, you have to choose the final ML model at this point. Deep learning model built from neural networks is an illustration of such a model selection.

  • Model Evaluation

Data loss and accuracy achieved by the model are the main criteria here for success.

 

Conclusion

Although Tor is a great project in terms of privacy, it is not impregnable. Not long ago, users of the Tor Browser were exposed to fingerprinting and location identification due to a bug in the JavaScript code. Also, a 2015 study “A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning” revealed that Tor was not that good at obfuscating the network packet features, which resulted in Tor traffic being exposed to a local observer who also managed to fingerprint the majority of top websites on Alexa.

Besides, great technological discoveries have always been a two-edged sword. While they can be applied for a greater good, someone will eventually find out how to use them in a nefarious manner. The Tor Project is a great example of this line of reasoning. On the one hand, Tor could give regular people the rare opportunity to remain anonymous online, unlike while using common World Wide Web methods; on the other hand, this technology can be used by criminals to cover their tracks when they commit crimes via the Internet.

Yet, do you really think that intelligence services of the United States, for one, with their virtually limitless budgets don’t have the means to track people surfing the global network through Tor? Perhaps that is what the maker of the Silk Road marketplace thought, but he ended up being caught.

The point is – Tor is great privacy-wise, but probably there are ways to defeat its anonymization, and probably machine learning is one of these ways.

 

Sources

  1. A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning, I.J. Computer Network and Information Security
  2. A Survey on Tor Encrypted Traffic Monitoring, School of Computer Sciences University Sains Malaysia
  3. Internet and Tor Traffic Classification Using Machine Learning Internet and Tor Traffic Classification Using Machine Learning, Rochester Institute of Technology Rochester Institute of Technology
  4. Machine Learning To Detect Anonymous Attacks over TOR, Dr. Roshy John
  5. New attack on Tor can deanonymize hidden services with surprising accuracy, Ars Technica
  6. Tor Browser 9.0.7 Patches Bug That Could Deanonymize Users, Bleeping Computer
  7. Tor Traffic Analysis and Detection via Machine Learning Techniques, University of Trieste and ICAR-CNR, and University of Genova
  8. Using Deep Learning for Information Security, Acalvio
  9. User Identification In Tor Network” In Machine Learning | Machine Learning Assignment Help, CodersArts
Posted: October 20, 2020
Articles Author
Dimitar Kostadinov
View Profile

Dimitar Kostadinov applied for a 6-year Master’s program in Bulgarian and European Law at the University of Ruse, and was enrolled in 2002 following high school. He obtained a Master degree in 2009. From 2008-2012, Dimitar held a job as data entry & research for the American company Law Seminars International and its Bulgarian-Slovenian business partner DATA LAB. In 2011, he was admitted Law and Politics of International Security to Vrije Universiteit Amsterdam, the Netherlands, graduating in August of 2012. Dimitar also holds an LL.M. diploma in Intellectual Property Rights & ICT Law from KU Leuven (Brussels, Belgium). Besides legal studies, he is particularly interested in Internet of Things, Big Data, privacy & data protection, electronic contracts, electronic business, electronic media, telecoms, and cybercrime. Dimitar attended the 6th Annual Internet of Things European summit organized by Forum Europe in Brussels.