A Model For Detecting Tor Encrypted Traffic Using Supervised Machine Learning
A Model For Detecting Tor Encrypted Traffic Using Supervised Machine Learning
net/publication/277611386
CITATIONS READS
19 1,699
3 authors:
Jalal Atoum
Princess Sumaya University for Technology
21 PUBLICATIONS 118 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Ali Hadi on 26 February 2016.
Abstract—Tor is the low-latency anonymity tool and one For example, if we have both Bob and Alice
of the prevalent used open source anonymity tools for communicating on a public Internet connection, by using
anonymizing TCP traffic on the Internet used by around the mean of Tor, they can ensure that their communication
500,000 people every day. Tor protects user‘s privacy cannot be intercepted or monitored by eavesdroppers and
against surveillance and censorship by making it that the information passed back and forth is encrypted
extremely difficult for an observer to correlate visited and anonymized.
websites in the Internet with the real physical-world Tor is free open source software that works almost on
identity. Tor accomplished that by ensuring adequate every platform, once Tor installed, users can use web
protection of Tor traffic against traffic analysis and browser to anonymize their traffic. Traffic passes between
feature extraction techniques. Further, Tor ensures anti- Tor nodes and users are secure via strong encryption [3].
website fingerprinting by implementing different Moreover, Tor works perfectly on modern browsers such
defences like TLS encryption, padding, and packet as Firefox and Chrome with Tor bundles.
relaying. However, in this paper, an analysis has been Bundles enable users to install Tor as browser extension
performed against Tor from a local observer in order to that makes it easier for users to protect their
bypass Tor protections; the method consists of a feature communication and attain anonymity and privacy [4].
extraction from a local network dataset. Analysis shows However, despite Tor is used for online anonymity, it‘s
that it‘s still possible for a local observer to fingerprint heavily used by hackers and cybercriminals in order to
top monitored sites on Alexa and Tor traffic can be avoid traceability [5]. With the increasing usage of the
classified amongst other HTTPS traffic in the network Internet, concerns over censorship and privacy have
despite the use of Tor‘s protections. In the experiment, become a big goal, users heavily rely on anonymity tools
several supervised machine-learning algorithms have in order to conceal their identity and gain privacy. For
been employed. The attack assumes a local observer those users, anonymity is significantly important and Tor
sitting on a local network fingerprinting top 100 sites on analysis against various attacks is deemed necessary to
Alexa; results gave an improvement amongst previous ensure adequate protection of user‘s privacy.
results by achieving an accuracy of 99.64% and 0.01% Further, although there is a huge evolution of
false positive. developing more anonymity tools, blocking anonymous
traffic and developing anti-blocking tools attracting many
Index Terms—Anonymity, Censorship, Interception, researchers [6], this makes a strong reason for Tor to
Machine Learning, Tor, Traffic Analysis, Traffic monitor and track down anti-anonymity tools to ensure
Classification secure anonymity for users all the times. In fact, the
detection of anonymity tool is become a hot topic as there
is an infinite battle between developers work to improve
I. INTRODUCTION the anonymity tools and organizations, governments who
work also tremendously to break anonymity. Internet
Tor is widely known low latency network anonymity
users strongly believe that the need for anonymity to
project and is currently used by around 500,000 daily
protect user‘s privacy is very important; users in
users and carrying 2500 MB of data per second [1]. Tor
totalitarian regimes strongly rely on such networks to
stands for ―The onion router‖ or the onion routing network,
freely communicate. Breaking Tor anonymity in fact
it provides two ways bidirectional anonymized connection
reduces the protection that Tor claims to have for
over the network. Tor provides strong implementation,
concealing users identities, and thus, increases the chances
which protects against both sniffing and analysis making a
for those totalitarian regimes to physically identify users,
secure communication to protect both data confidentiality
which could lead to severe consequences such as
and users privacy. TLS protocol is used in Tor
imprisoning or even life threatening [7].
communication to provide the required encryption [2].
Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning 11
In this paper, the research has considered many may be unpopular or conflict with their public persona [8].
machine learning (ML) algorithms in order to fingerprint Tor completely relies on TLS protocols for its network
Tor usage in the network. This study will help Tor communication. TLS encrypts and authenticates the
developers to improve Tor security, provide more communication between Tor instances.
advanced techniques, and solutions in order to boost Tor
A. Transport Layer Security
anonymity. Furthermore attain a complete protection for
the users, this in case the same analysis has conducted by Netscape Communication Corporation first introduced
either attacker or totalitarian regime. The main objectives secure Socket Layer (SSL) protocol in 1995 to enable e-
of this work can be summarized as the following: commerce transaction security on the web. TLS is being
used heavily nowadays by most Internet communication
1. Researching different techniques and tools in order to protect confidentiality through encryption and integrity,
to identify Tor usage in the network by tracing an as well as authentication, to ensure a safe transaction.
offline network traffic data. However, to achieve this, SSL protocol was built up over
2. Researching the possibility of fingerprinting Tor the application layer directly on the top of TCP, which
traffic of top 5 sites on Alexa amongst other top enables the protocol to work on HTTP, SMTP, FTP, and
100 sites on Alexa using ML algorithms by many others. The primary reason of SSL and TLS is to
extracting statistics in the SSL flows used by Tor protect HTTP traffic in the network. In HTTP, when a
software. new TCP connection is created, the client sends the
3. Generating an extensive HTTPS traffic along with request to the server and then the server responds back
Tor traffic using two virtual machines (VMs). with the content, when SSL is utilized, the client first
4. Feature selection exercise from network pcap files create a TCP connection and establishes an SSL stream
generated from a different network traces to build channel to relay the TCP connection, at that point of time,
the ML data model. the HTTP request is sent over the SSL connection instead
5. Conducting an analysis on how many packets of of the regular TCP connection. SSL and TLS handshake
SSL flows are required to classify Tor amongst cannot be understood by the ordinary HTTP, thus, a
HTTPS. protocol specification HTTPS is used instead to indicate
6. Performing a detailed experimentation to measure the use of a connection over SSL [9]
the accuracy of ML classifiers. TLS is layered protocol and consists of mainly two
layer protocols, at the lower level is the record protocol
In this research, studying the possibility of identifying which is responsible for transmitting the message,
the individual users who use Tor is out of this research fragments the data into blocks, and many other steps. On
scope; the focus is to only identifying Tor usage in an the top layer is the Handshake Protocol, Alert Protocol,
offline network traces via websites fingerprinting. Also Change Cipher Spec Protocol and Application Protocol,
studying ML algorithms in this research is limited, since Fig. 1, which shows TLS protocol, layers. TLS handshake
this is more of computer science knowledge, the focus is protocol allows both client and server to authenticate and
mainly on researching traffic classification for Tor using exchange encryption keys and algorithm before the
specific ML algorithms in order to perform websites protocol starts to send data over the network [10].
fingerprinting. Further, the analysis of Tor is conducted in
a closed-world local network environment considering the Application layer protocol
fact that it‘s difficult to obtain traffic from an open-world Handshake
Change
Application
Alert Protocol Cipher
environment such as Internet Service Providers (ISP). protocol
Protocol
Protocol
TLS Record Layer
Transport Layer Security
II. TOR BACKGROUND Fig. 1. TLS Protocol Layers
Tor allows people to access and publish content on the
Internet without being tracked or identified or cleared to B. Onion Routing
authorities. Considering the usage of Tor by various and The Onion Router (OR) was original created for Sun
different type of people the risk is varied from a risk of Solaris 2.4 in 1997, which include proxies for remote
child accessing forbidden sites to other type of risks such logins, email, and web browsing, also file transfer
as employees or political activist accessing Tor where the protocol (FTP) [11]. The main purpose of onion routing is
risk is higher. However, while many people agree on to provide a real-time bidirectional anonymous interaction
positive reasons to use Tor, some people see Tor as a big between two parties that is resistant to eavesdropping,
threat that could make criminals to commit their crimes sniffing and traffic analysis. Onion routing consists of a
with impunity. The good reasons of using Tor are several, series of ORs connected in a way that each OR has a
for example normal people use Tor to protect their dedicated socket connection to a set of neighboring ones.
information from external adversaries, and also, military However, to build up the anonymous connection, the
uses Tor to protect government communication, in application initiates a series of connections to a set of
addition to that, law enforcement offices and agencies are Onion Router Proxies (ORPs) that ultimately build up the
using Tor for their investigation and operation. Low and anonymous connection. The routing occurs at the
high profile people also use Tor to make an opinion that application layer of the protocol stack, and not on the IP
Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
12 A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning
layer. However, the IP network is the one who determines Tor‘s anonymity and identify top monitored sites on
where should the data move between each individual Alexa using ML classification techniques. There are
onion routers. several steps involved in Tor fingerprinting attack within a
local network environment; local network environment
means two things. First, all web pages are known in
III. RELATED WORK advance, and second, the attack is launched by a local
attacker. The attacker observes the encrypted traffic to
Tor achieves anonymity by make it very difficult for an find conclusions from certain features in the traffic such as
adversary to identify client and server identities. In Tor
packet sizes, volume of data transferred, timing and many
design, the entry node only knows the client who others. This type of attack is considered in this research to
communicates with middle node, and the middle node ensure the comparability of the outcome results to related
knows the entry node is communicating with another
works. This method however is not in the position of
machine exit node. The middle relay machine cannot tell breaking the cryptographic used in Tor, although it does
if it‘s the middle node in the circuit or not. Also the exit not provide messages semantic, it can provide a way to
relay knows the middle node, which communicates with
observe specific patterns in order to reveal a known traffic
the server (target destination). Finally, the server believes instances like web pages.
the connection is coming from the exit node [12]. In real world scenario, if a user runs Tor OP in a shared
Historically, an extensive number of work on attacking
local network, other users sit on the same network may
Tor anonymity circuits, which can degrade the anonymous use different applications, and thus, passing different type
communication over Tor; most of these attacks are based of network traffic traces such as HTTP, HTTPS, FTP and
on traffic analysis. However, attacks based on traffic
others. Tor‘s uses TLS encryption between client, ORs
analysis may suffer high rate of false positives (FP) due to and destination server, thus, the hypothesis is that the
a number of reasons, such as Internet traffic dynamics and traffic of Tor should have similar characteristic as any
determining the required number of packets for the other TLS traffic such as HTTPS. Yet, if variation in the
statistical analysis of traffic. That said, timing and latency traffic characteristic is found, then Tor instances can be
are important metrics in traffic analysis to identify Tor as
fingerprinted amongst other TLS traffic and anonymity
well as packet counting and volume metrics [13]. can be broken. In the experiment, HTTPS encrypted
A previous work on path selection focused on latency traffic is considered as majority of sites encrypts their
as property link and take delay in account primarily.
communication use TLS encryption over port 443.
However in this attack by [14], attacker assumes Similarly, Tor traffic is considered from a user (victim)
different approach, which is identifying the important of uses Tor OP on the same local network browsing top 5
latency as indicator of congestion, and accordingly,
sites on Alexa. Therefore, by identifying the variations in
suggesting an improved path selection algorithm. Further, the traffic instances between the HTTPS traffic and Tor‘s
Tao proposed a way for Tor clients to respond to short- victim traffic in the same local network environment is the
term congestion by building timeout mechanism.
goal for this study.
Existing traffic analysis attacks against anonymous In order to find those variations in the traffic
communication can be classified into two main categories: characteristic between Tor and HTTPS, ML methods need
traffic confirmation attacks and traffic analysis attacks.
to be employed using statistical classification technique
Each category consists of both passive and active attacks. [16]. The focuses on using ML methods to detect patterns
Passive traffic analysis techniques is when the adversary in the packet information is to extrapolate and predict
records the traffic passively and identify the resemblance
traffic type contained within a TLS flow, which in this
between client inbound traffic and server outbound traffic. research, using Tor encrypted traffic data and HTTPS data
Meanwhile, the active attack, aims to embed specific to train the system.
secret signal (or marks) into the target traffic and detect it Generally, the Tor Fingerprinting Methodology steps in
[15]. this work can be summarized in Fig. 2, as follows:
Meanwhile, traffic confirmation attack is when an
adversary tries to confirm that two parties are
Step-1, data collection step, it‘s required to capture
communicating with each other over Tor by observing
a ground truth, or HTTPS data for which the
patterns in the traffic, such as timing and volume of the
underlying application is known. At the same time,
traffic. Ideally, traffic confirmation attacks are not in the
collects Tor traffic instances of top 5 sites on Alexa
focus of Tor‘s threat model. Instead, Tor increases the
to represents Tor instances. The data collected is to
focus on preventing traffic analysis attacks, this occurs
be used to train the model using ML methods.
when adversary tries to determine in which points in the
Step-2 is feature extraction and feature selection;
network a traffic pattern based attack should be executed
feature extraction is crucial in order to detect the
[15].
variation between Tor instances and HTTPS
instances and feature selection is required to
identify which features to be used that improve the
IV. TOR FINGERPRINT METHODOLOGY
accuracy of web sites fingerprinting.
The goal of this research is to fingerprint Tor traffic Step-3, labeling process means marking each row
flows in a local network environment in order to break in the traffic with the corresponded label.
Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning 13
Step-4, classifying traffic flows based on those resources of guest OSs and programs are running on
characteristics variations either as Tor‘s site or virtualized computer, they are not aware that they are
regular HTTPS traffic. In the following subsections running on a virtual platform [18]. Also, in the study,
are the details descriptions of each step. different OS distribution systems have been installed to
ensure emulating the actual traffic in the network, a
breakdown of the OSs used as VMs to capture the
network traffic is presented in Table I, each of these
guests operating systems runs with specific VM
configuration, a 512 memory RAM, and 20 GB of disk
space with shared networking Network Address
Fig. 2. Methodology steps for fingerprinting Tor Translation (NAT) setup. Further, a distribution of Linux
OSs on VMs with different processor architecture is used.
A. Data Collection
To validate the method used in this research, there is a Table 1. Break Down Of Client Virtual Machines Operating Systems
need for a ground truth, or SSL connections for which the Client Operating System Operating System processor
underlying application is known. Therefore HTTPS traffic # Version
data is required for building the training dataset. Although CVM1 Linux Ubuntu 12.04 64 bit
CVM2 Linux Backtrack 5.01.3 32 bit
there is no public dataset sources that can be used in the
experiment, similar technique has been considered for
Linux BackTrack is a Linux-based penetration-testing
data generation from [14] which previously known to
arsenal that aids security professionals in the ability to
achieve higher accuracy results ignoring the removal of
perform assessments in a purely native environment
SENDMEs as it did not affect the results that much.
dedicated to hacking. Linux Ubuntu is the standard Linux
Precautions need to be taken in order to collect the data in
distribution system powers millions of desktop PCs,
the same way a realistic attacker would. Firefox browser
laptops and servers around the world. Moreover, OSX
and Selenium [17] Web Driver have been used to perform
machine in Table II is used for conducting the analysis.
an automation browsing process, web sites used are taken
from top 100 sites on Alexa in order to mimic the actual Table 2. Analysis Machine
real user behavior on the local network environment.
Ideally, capturing those traffic traces can be Client Operating Operating System processor
# System Version
accomplished from more than one machine; those captures A1 OSX 10.9.2 64-bit
consist of a raw data that is transmitted over the physical
wire or wireless network at a given point, see Fig. 3. Each
machine runs different encrypted services. Few machines 2) Traffic Generation Tools
run HTTPS traffic and others run Tor application to
In order to obtain traffic traces for the dataset, capturing
generate Tor encrypted traffic. In the experiment two
the data from VMs and use it to build training data sets is
virtual machines are used as clients, below is a details
the first step, aforementioned VMs and Sniffer software
about the software stack used for that.
were used to sniff and capture the traffic from A1. VMs
are configured to run as NAT to A1 machine, which
means traffic will always route through A1 machine, this
provides two major benefits, first a full control on
capturing the dataset, and second, control specific filters
based on particular parameters without any traffic
disruption that could affect the quality of the dataset
which could cause invalid training data set. Ideally, there
are many sniffers available in the market today, the most
famous two are wireshark and tcpdump, however, any
sniffer will suffice for the testing, but simple, flexible,
Fig. 3. Data collection low-cost, and fast tool is best, tcpdump works really well
as sniffer for the experiment.
1) Environment Setup Tcpdump is a free open source sniffer, which uses
libpcap to capture traffic and provides information about
In the data generation method, two virtual machines IP layer packets i.e. the length of the packet, the time the
(VM) are used in order to generate the traffic for the packet was sent or received, the order in which the packets
experiment. The VM is a piece of software were sent and received. Tcpdump is quite flexible and fast,
implementation of a computing environment in which an it runs on most Linux and Unix variants, in fact, it‘s
Operating System (OS) can be installed and run on an installed by default on many Linux distributions, and it
emulated physical computing environment, it basically has been ported to windows as WinDump. It does support
utilizes the underlying physical hardware, including CPU, variety of filters, with a powerful language for specifying
memory, hard disk, network and other hardware resources individual filter types [19]. Further many other services
to create a virtual computing environment. Although are running on the VMs, Table III breakdowns the
Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
14 A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning
services installed on each machine, with the generation process has been scheduled from each machine
corresponding versions, each one of these machines runs using a small bash script to record traffic on hourly basis.
completely independent in its VM environment. Tcpdump sniffer is installed in a way so it can capture
traffic from two machines on a scheduled basis see Fig. 4.
Table 3. Services running on the machines
Machines Services
Firefox 14.0.1
Tor 0.2.4.22
CVM1 Netmate 0.9.5
Tcpdump 4.3.0
Libpcap 1.3.0
Firefox 14.0.1
Tor 0.2.4.22
CVM2
Netmate 0.9.5
Tcpdump 4.3.0
Weka 3.7.3
A1
Wireshark 1.10.7
Fig. 4. Traffic capturing through A1
The details about each VMs used is as the following: Traffic passes from CVM (i) through the NAT
a) CVM1 connection to websites servers. Since traffic scheduled to
pass on a specific timeframe, HTTPS traffic was first
This VM is used to run Firefox and browse sites run generated from CVM1, and is called regular-HTTPS.pcap.
over HTTPS. The traffic generated is intended to represent Similarly, traffic from the other CVM2 which runs Tor
regular HTTPS traffic for the top 100 sites on Alexa. In OP is captured, files named based on site corresponded to
the real world scenario, most of this traffic is generated that traffic, example for Google traces, it‘s called Google-
during regular secure browsing activities such as email Tor.pcap and Yahoo traces Yahoo-Tor.pcap. The data
communication, social network sites, and financial generation took two weeks to finish and the final output
activities. Unfortunately, these activities are somehow files in a pcap format are listed in Table IV. The table
difficult to mimic. Thus, the approach taken in this thesis contains the number of packets, flows along with the sizes
involves using Selenium [17] for automating web for each. Fig. 5 shows a summery chart of each flow.
applications for testing purposes with a complete list of After dumping the network traffic from CVM (i), the next
sites that run over HTTPS. Selenium operates by step is to use those files to build the training model for the
controlling a standard browser. This is important because classification method. The traffic generated contains a
the traffic generated needs to look like if it was captured number of flows; those flows will be used to create the
by a user browsing the web doing his regular business model.
activities. However, similar to Ian and Tao method [14]
the method obtained a specific list of websites in a local Table 4. Traffic and Their associated number of flows
network environment, those sites are taken from the top
Traffic Size / Number Number of Avg Packet
100 sites on Alexa.com. Alexa is the leading provider of Type MB of Packets Flows Size / Byte
free, global web metrics ranks the top sites based on the HTTPS 808.7 1054835 38845 750.617
number of unique users, page views, and number of visits Google 110.1 146151 5231 737.407
[20]. Tor
Yahoo 155.4 206998 7959 734.596
b) CVM2 Tor
Facebook 132.7 160491 4085 810.785
This VM runs a specific list of what expected to be the Tor
top monitored websites on Alexa, but this time with Tor Twitter 132.2 171935 5577 752.653
OP, in the attack scenario, the expectation is that the Tor
Wikipedi 87.7 122708 4465 698.716
victim uses this machine to browse top sites on Alexa. The a Tor
same method of CVM1 is used in CVM2. The 5 sites used
in the experiment are Google, Yahoo, Facebook,
Wikipedia and Twitter. 1200
千
1000
3) Dumping Traffic SIZE / MB
800
Dumping traffic is required in order to capture training 600 NUMBER OF
datasets and record information to be used later for the PACKETS
400
classification part. Basically the A1 machine is used to NUMBER OF
200 FLOWS
capture all the traffic from all the CVM (i) machines, a
sniffer is installed on the interface to capture the traffic 0 AVG PACKET
SIZE / BYTE
passes through. Tcpdump is used to generate the packet
capture (pcap) file which previously developed by
wireshark. In the process, packets are captured and stored
on A1 machine for further analysis using ML. Traffic Fig. 5. Summery chart for all pcap files
Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning 15
Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
16 A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning
specify which data is HTTPS and which data is Tor, to V. EVALUATION TECHNIQUES
accomplish that, a labeling process is required by
In this experiment, in order to fingerprint websites over
specifying the label attribute on each data instance in the
Tor, a few ML methods were used. The experiment was
ARFF file. However, this data is known as ground-truth.
repeated multiple times using Weka, each time using
Building up ground-truth is very important and critical
different set of training and test cases (changing the
phases of any traffic classification method since the entire
number of packets used to create the case). However, to
classification process relies completely on the accuracy of
obtain a simulated test performance, the testing data used
this labeling. Thus, accuracy is important by labeling data
in the evaluation are the same as the training data set but
instances based on their types to ensure the minimum false
with 10 cross-validation using Weka.
positives and false negatives. Also, because traffic has
Cross-validation means that part of the data will be
been completely separated based on CVM(i), validation
reserved for testing while the rest will be used for training.
has been conducted to ensure only traffic generated by
In other words, the data is partitioned into 10 parts (folds),
each CVM is corresponded to that CVM, and no other
one part for testing and the remaining 9 parts for training.
traffic noise mixed up with the intended traffic to be used.
Further, different set of attributes (features) used for
D. Building Ml Classification Model classification, and a deep investigation has been
performed in order to find relevant attributes and building
Supervised ML is employed in order to create the
minimal rule sets for classifying Tor traffic (finding the
training dataset. In Supervised learning ML; the algorithm
minimal rule set is proved to be an NP-hard problem [25])
takes a known data called training dataset to make some
and different classification test cross-validation option to
predictions. The method attempts to discover the
achieve higher accuracy with less FPR and FNR. Fig. 9
relationship between input attributes and target attributes,
the output relationship discovered represents a structure diagram shows the steps of classification method in
called ―model‖ see Fig. 6. general.
In order to apply ML algorithms and build the data Fig. 9. Detection Diagram using ML
model, given the different traffic data training sets, Weka
is chosen for this exercise. Weka GUI or direct command The sites used fingerprinting evaluation is the top 5
line interface can be used to accomplish this, Fig. 8 below websites on Alexa, the sites are listed with a localization
presents the use of Weka as a simple command line domain to avoid Tor redirection into the local IP location.
interface to generate a data model. Table VI ists the top websites that have been chosen in the
evaluation process.
Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning 17
Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
18 A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning
Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning 19
Table 8. Breakdown of C4.5 classification for monitored sites Table 9. Breakdown of Random Forest classification for top monitored
sites
Class TP FP Precision Recall F- ROC
Rate Rate Measure Area 0 TP FP Precision Recall F- ROC
Rate Rate Measure Area
Tor 0.997 0 0.999 0.997 0.998 0.999
Google Tor 0.996 0 0.999 0.996 0.998 1
Google
HTTPS 1 0.003 1 1 1 0.999
HTTPS 1 0.004 0.999 1 1 1
Weighted 0.999 0.002 0.999 0.999 0.999 0.999
Avg Weighted 0.003 0.999 0.999 0.999 1
Avg 0.999
Tor 0.994 0 0.998 0.994 0.996 0.996 Tor 0.995 0 0.997 0.995 0.996 0.999
Wikipedia Wikipedia
HTTPS 1 0.006 0.999 1 0.999 0.996 HTTPS 1 0.005 0.999 1 0.999 0.999
Weighted 0.005 0.999 0.999 0.999 0.996 Weighted 0.999 0.004 0.999 0.999 0.999 0.999
Avg 0.999 Avg
Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
20 A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning
1
1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning 21
Fig. 17. Results comparison between John and This research accuracy
using Random Forest ML algorithm
True Positive Rate False Positive Rate ROC Area D. Conclusions And Future Work
Fig. 18. Results comparison between John and This research accuracy This research presents that Tor can be classified
using C4.5 ML algorithm amongst HTTPS encrypted traffic. Tor is the low-latency
anonymity tool and one of the prevalent used open source
C. Discussion anonymity tools for anonymizing TCP traffic on the
In this paper, the researcher has demonstrated a website Internet. Tor has implemented different defenses
fingerprinting attack against the most widely known techniques in order to prevent automated identification of
anonymity project Tor. Tor is very hard to detect by Tor traffic such as TLS encryption, padding, and packet
Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
22 A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning
relaying. However, as proofed in this research, Tor does [2] J. R. Vacca, Computer and information security handbook.:
not appear to appropriately succeed in blurring the Newnes, 2012.
network packets features, which makes it possible for a [3] B Schneier, Schneier on security.: John Wiley & Sons,
local observer to identify Tor traffic in the network and 2009.
[4] M., Adair, S., Hartstein, B., & Richard, M Ligh, Malware
fingerprint most top sites on Alexa. Different techniques Analyst's Cookbook and DVD: Tools and Techniques for
have been used in order to classify Tor, similar technique Fighting Malicious Code.: Wiley Publishing, 2010.
in previous researches is used to generate the traffic and [5] B., Erdin, E., Güneş, M. H., Bebis, G., & Shipley, T. Li,
dataset model, Netmate is used for features dump and An Analysis of Anonymizer Technology Usage. Berlin:
Weka is used to build the dataset model, several ML Springer, 2011.
algorithms have been employed to identify Tor traffic, [6] X., Zhang, Y., & Niu, X. Bai, "Traffic identification of tor
results gave an improvement amongst previous results by and web-mix," in In Intelligent Systems Design and
achieving an accuracy of 99.64% and 0.01% FP. However, Applications, 2008. ISDA'08. Eighth International
the researcher believes that its hard to compare this Conference, 2008, pp. 548-551.
[7] A., Niessen, L., Zinnen, A., & Engel, T Panchenko,
research results with previous researches as Tor literature "Website fingerprinting in onion routing based
covers a wide variety of techniques with many different anonymization networks," in In Proceedings of the 10th
goals, and no two techniques can be directly compared as annual ACM workshop on Privacy in the electronic society,
the data used for analysis is not publicly disclosed. 2011, pp. 103-114.
[8] P. Loshin, Practical Anonymity: Hiding in Plain Sight
Online.: Newnes, 2013.
[9] Edward M. Schwalb, iTV handbook: technologies &
standards.: Prentice Hall, 2003.
[10] Manuel Mogollon, Cryptography and Security Services:
Mechanisms and applications.: CyberTech Publishing,
2007.
[11] M., Klonowski, M., & Kutyłowski, M. Gomułkiewicz,
"Onions based on universal re-encryption–anonymous
communication immune against repetitive attack," in In
Information Security Applications, Berlin , 2005, pp. 400-
410.
[12] E., Shin, J., & Yu, J. Chan-Tin, "Revisiting Circuit
Clogging Attacks on Tor," In Availability, Reliability and
Security (ARES), 2013 Eighth International Conference,
pp. 131-140, 2013.
Fig. 20. Plot Matrix for a sample of Google‘s ARFF file [13] Nick Mathewson Roger Dingledine. (2004) torproject.
[Online].https://gitweb.torproject.org/torspec.git?a=blob_p
E. Recommendations for Future Work lain;hb=HEAD;f=tor-spec.txt
[14] T., & Goldberg, I. Wang, "Improved website fingerprinting
The researcher believes that this research experiment
on tor," in In Proceedings of the 12th ACM workshop on
was based on a small set of simulated data, and thus, it is Workshop on privacy in the electronic society, New York,
not necessarily that it covers all possible real world 2013, pp. 201-212.
conditions including open world experiments. The noise [15] Z., Luo, J., Yu, W., Fu, X., Xuan, D., & Jia, W. Ling, "A
and variability present in the real Tor network may make new cell-counting-based attack against Tor," IEEE/ACM
this classification technique inaccurate. As future Transactions on Networking (TON), vol. 20(4), pp. 1245-
recommendation, it‘s important to involve different types 1261, 2012.
of noise in the dataset to mimic the real open world [16] S., Nguyen, T., & Armitage, G. Zander, "Automated traffic
experience The researcher believes it‘s important to study classification and application identification using machine
learning," in In Local Computer Networks, 2005. 30th
the ability to classify Tor on a global scope like ISP or
Anniversary, 2005, pp. 250-257.
even more with some fine-tuning to the parameters used in [17] Selenium. (2004) Selenium. [Online].
the experiments. Also due to high computation costs of http://docs.seleniumhq.org/
SVM, it‘s important to use a parallel computing cluster to [18] Margaret Rouse. (2011) search server virtualization.
perform the experiment. Further, increasing the scope of [Online].http://searchservervirtualization.techtarget.com/de
fingerprinting to include more sites in the experiment and finition/virtual-machine
study the variation in the accuracy for each. Finally, as [19] (1987) Tcpdump. [Online]. http://www.tcpdump.org/
future recommendation for Tor protocol, the researcher [20] Alexa. (1996) Alexa. [Online]. http://www.alexa.com/
advices that developers should develop more defenses in [21] The Fraunhofer Institute for Open Communication
Systems FOKUS. (2010) ip-measurement. [Online].
order to make it harder for local observer to classify Tor
http://www.ip-measurement.org/tools/netmate.
amongst HTTPS traffic and thus pertain anonymity and [22] C., & Zincir-Heywood, A. N. McCarthy, "An investigation
privacy for Tor users. on identifying SSL traffic," In Computational Intelligence
for Security and Defense Applications (CISDA), pp. 115 -
REFERENCES 122, 2011.
[23] University of Waikato. (2008) Waikato. [Online].
[1] Inc Tor Project. (2012, July) torproject. [Online].
http://www.cs.waikato.ac.nz/ml/weka/arff.html
https://metrics.torproject.org
Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning 23
[24] O., & Rokach, L. Maimon, "Introduction to supervised companies. Since 2011 he's been teaching different computer
methods," In Data Mining and Knowledge Discovery security, digital forensics, and networking courses for both
Handbook, pp. 149-164, 2005. [Online]. graduates and undergraduates. He's also an author, speaker, and
http://www.ise.bgu.ac.il/faculty/liorr/hbchap8.pdf freelance instructor. His research interests include digital
[25] J. Wroblewski, "Finding minimal reducts using genetic forensics, operating systems internals, malware forensic analysis,
algorithms," in In Proccedings of the second annual join and network security.
conference on infromation science, 1995, pp. 186-189.
[26] I. H., Gori, M., & Numerico, T. Witten, Web dragons:
Inside the myths of search engine technology.: Elsevier, Prof. Atoum is currently the Dean of The
2010. King Hussein School of Computing
[27] B. Lantz, Machine Learning with R.: Packt Publishing Ltd, Sciences at Princess Sumaya University for
2013. Technology (PSUT). He had received his
[28] V. N., & Chervonenkis, A. J. Vapnik. (1974) Theory of B.S. degree in Computer Science from
pattern recognition. Yarmouk university-Jordan in 1984. He had
[29] V. Agneeswaran, Big Data Analytics Beyond Hadoop: received his Master degree in Computer
Real-Time Applications with Storm, Spark, and More Science from University of Texas at
Hadoop Alternatives.: Pearson Education, 2014. Arlington-USA in 1987. He had received his PhD in Computer
[30] R., & Zincir-Heywood, A. N. Alshammari, "Machine Science from University of Houston-USA in 1993. He had
learning based encrypted traffic classification: identifying worked as an assistant professor at Yarmouk University from
ssh and skype," in In Computational Intelligence for 1993 to 1995. He had been appointed as the Computer Science
Security and Defense Applications, 2009, pp. 1-8. department Chairman at PSUT. He has supervised or co-
[31] University of Waikato. (2013) [Online]. supervised several students on their Ph.D. dissertations and
http://www.cs.waikato.ac.nz/ml/weka/ several M.S. theses and has supervised numerous undergraduate
[32] N., Zander, S., & Armitage, G. Williams, "A preliminary graduation projects. Finally, he have been involved in several
performance comparison of five machine learning committees for degree plans, proposed and developed the Master
algorithms for practical IP traffic flow classification," program in Information System Security and Digital
ACM SIGCOMM Computer Communication Review, pp. Criminology at PSUT.
5–16, 2006.
[33] J., Hannay, P., & Szewczyk, P. Barker, "Using traffic
analysis to identify The Second Generation Onion Router,"
in In Embedded and Ubiquitous Computing (EUC), 2011
IFIP 9th International Conference, 2011, pp. 72-78.
Authors’ profiles
Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23