0% found this document useful (0 votes)
28 views

A Model For Detecting Tor Encrypted Traffic Using Supervised Machine Learning

Uploaded by

tayshie.ent
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

A Model For Detecting Tor Encrypted Traffic Using Supervised Machine Learning

Uploaded by

tayshie.ent
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/277611386

A Model for Detecting Tor Encrypted Traffic using Supervised Machine


Learning

Article · June 2015


DOI: 10.5815/ijcnis.2015.07.02

CITATIONS READS

19 1,699

3 authors:

Alaeddin Almubayed Ali Hadi


Princess Sumaya University for Technology Champlain College
1 PUBLICATION 19 CITATIONS 37 PUBLICATIONS 136 CITATIONS

SEE PROFILE SEE PROFILE

Jalal Atoum
Princess Sumaya University for Technology
21 PUBLICATIONS 118 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Detecting Suspicious PDF Files View project

Offensive Security & Reverse Engineering View project

All content following this page was uploaded by Ali Hadi on 26 February 2016.

The user has requested enhancement of the downloaded file.


I. J. Computer Network and Information Security, 2015, 7, 10-23
Published Online June 2015 in MECS (http://www.mecs-press.org/)
DOI: 10.5815/ijcnis.2015.07.02

A Model for Detecting Tor Encrypted Traffic


using Supervised Machine Learning
Alaeddin Almubayed
Yahoo Inc., California, US
Email: [email protected]

Ali Hadi and Jalal Atoum


Princess Sumaya University for Technology (PSUT), Amman, Jordan
Email: {a.hadi, atoum}@psut.edu.jo

Abstract—Tor is the low-latency anonymity tool and one For example, if we have both Bob and Alice
of the prevalent used open source anonymity tools for communicating on a public Internet connection, by using
anonymizing TCP traffic on the Internet used by around the mean of Tor, they can ensure that their communication
500,000 people every day. Tor protects user‘s privacy cannot be intercepted or monitored by eavesdroppers and
against surveillance and censorship by making it that the information passed back and forth is encrypted
extremely difficult for an observer to correlate visited and anonymized.
websites in the Internet with the real physical-world Tor is free open source software that works almost on
identity. Tor accomplished that by ensuring adequate every platform, once Tor installed, users can use web
protection of Tor traffic against traffic analysis and browser to anonymize their traffic. Traffic passes between
feature extraction techniques. Further, Tor ensures anti- Tor nodes and users are secure via strong encryption [3].
website fingerprinting by implementing different Moreover, Tor works perfectly on modern browsers such
defences like TLS encryption, padding, and packet as Firefox and Chrome with Tor bundles.
relaying. However, in this paper, an analysis has been Bundles enable users to install Tor as browser extension
performed against Tor from a local observer in order to that makes it easier for users to protect their
bypass Tor protections; the method consists of a feature communication and attain anonymity and privacy [4].
extraction from a local network dataset. Analysis shows However, despite Tor is used for online anonymity, it‘s
that it‘s still possible for a local observer to fingerprint heavily used by hackers and cybercriminals in order to
top monitored sites on Alexa and Tor traffic can be avoid traceability [5]. With the increasing usage of the
classified amongst other HTTPS traffic in the network Internet, concerns over censorship and privacy have
despite the use of Tor‘s protections. In the experiment, become a big goal, users heavily rely on anonymity tools
several supervised machine-learning algorithms have in order to conceal their identity and gain privacy. For
been employed. The attack assumes a local observer those users, anonymity is significantly important and Tor
sitting on a local network fingerprinting top 100 sites on analysis against various attacks is deemed necessary to
Alexa; results gave an improvement amongst previous ensure adequate protection of user‘s privacy.
results by achieving an accuracy of 99.64% and 0.01% Further, although there is a huge evolution of
false positive. developing more anonymity tools, blocking anonymous
traffic and developing anti-blocking tools attracting many
Index Terms—Anonymity, Censorship, Interception, researchers [6], this makes a strong reason for Tor to
Machine Learning, Tor, Traffic Analysis, Traffic monitor and track down anti-anonymity tools to ensure
Classification secure anonymity for users all the times. In fact, the
detection of anonymity tool is become a hot topic as there
is an infinite battle between developers work to improve
I. INTRODUCTION the anonymity tools and organizations, governments who
work also tremendously to break anonymity. Internet
Tor is widely known low latency network anonymity
users strongly believe that the need for anonymity to
project and is currently used by around 500,000 daily
protect user‘s privacy is very important; users in
users and carrying 2500 MB of data per second [1]. Tor
totalitarian regimes strongly rely on such networks to
stands for ―The onion router‖ or the onion routing network,
freely communicate. Breaking Tor anonymity in fact
it provides two ways bidirectional anonymized connection
reduces the protection that Tor claims to have for
over the network. Tor provides strong implementation,
concealing users identities, and thus, increases the chances
which protects against both sniffing and analysis making a
for those totalitarian regimes to physically identify users,
secure communication to protect both data confidentiality
which could lead to severe consequences such as
and users privacy. TLS protocol is used in Tor
imprisoning or even life threatening [7].
communication to provide the required encryption [2].

Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning 11

In this paper, the research has considered many may be unpopular or conflict with their public persona [8].
machine learning (ML) algorithms in order to fingerprint Tor completely relies on TLS protocols for its network
Tor usage in the network. This study will help Tor communication. TLS encrypts and authenticates the
developers to improve Tor security, provide more communication between Tor instances.
advanced techniques, and solutions in order to boost Tor
A. Transport Layer Security
anonymity. Furthermore attain a complete protection for
the users, this in case the same analysis has conducted by Netscape Communication Corporation first introduced
either attacker or totalitarian regime. The main objectives secure Socket Layer (SSL) protocol in 1995 to enable e-
of this work can be summarized as the following: commerce transaction security on the web. TLS is being
used heavily nowadays by most Internet communication
1. Researching different techniques and tools in order to protect confidentiality through encryption and integrity,
to identify Tor usage in the network by tracing an as well as authentication, to ensure a safe transaction.
offline network traffic data. However, to achieve this, SSL protocol was built up over
2. Researching the possibility of fingerprinting Tor the application layer directly on the top of TCP, which
traffic of top 5 sites on Alexa amongst other top enables the protocol to work on HTTP, SMTP, FTP, and
100 sites on Alexa using ML algorithms by many others. The primary reason of SSL and TLS is to
extracting statistics in the SSL flows used by Tor protect HTTP traffic in the network. In HTTP, when a
software. new TCP connection is created, the client sends the
3. Generating an extensive HTTPS traffic along with request to the server and then the server responds back
Tor traffic using two virtual machines (VMs). with the content, when SSL is utilized, the client first
4. Feature selection exercise from network pcap files create a TCP connection and establishes an SSL stream
generated from a different network traces to build channel to relay the TCP connection, at that point of time,
the ML data model. the HTTP request is sent over the SSL connection instead
5. Conducting an analysis on how many packets of of the regular TCP connection. SSL and TLS handshake
SSL flows are required to classify Tor amongst cannot be understood by the ordinary HTTP, thus, a
HTTPS. protocol specification HTTPS is used instead to indicate
6. Performing a detailed experimentation to measure the use of a connection over SSL [9]
the accuracy of ML classifiers. TLS is layered protocol and consists of mainly two
layer protocols, at the lower level is the record protocol
In this research, studying the possibility of identifying which is responsible for transmitting the message,
the individual users who use Tor is out of this research fragments the data into blocks, and many other steps. On
scope; the focus is to only identifying Tor usage in an the top layer is the Handshake Protocol, Alert Protocol,
offline network traces via websites fingerprinting. Also Change Cipher Spec Protocol and Application Protocol,
studying ML algorithms in this research is limited, since Fig. 1, which shows TLS protocol, layers. TLS handshake
this is more of computer science knowledge, the focus is protocol allows both client and server to authenticate and
mainly on researching traffic classification for Tor using exchange encryption keys and algorithm before the
specific ML algorithms in order to perform websites protocol starts to send data over the network [10].
fingerprinting. Further, the analysis of Tor is conducted in
a closed-world local network environment considering the Application layer protocol
fact that it‘s difficult to obtain traffic from an open-world Handshake
Change
Application
Alert Protocol Cipher
environment such as Internet Service Providers (ISP). protocol
Protocol
Protocol
TLS Record Layer
Transport Layer Security
II. TOR BACKGROUND Fig. 1. TLS Protocol Layers
Tor allows people to access and publish content on the
Internet without being tracked or identified or cleared to B. Onion Routing
authorities. Considering the usage of Tor by various and The Onion Router (OR) was original created for Sun
different type of people the risk is varied from a risk of Solaris 2.4 in 1997, which include proxies for remote
child accessing forbidden sites to other type of risks such logins, email, and web browsing, also file transfer
as employees or political activist accessing Tor where the protocol (FTP) [11]. The main purpose of onion routing is
risk is higher. However, while many people agree on to provide a real-time bidirectional anonymous interaction
positive reasons to use Tor, some people see Tor as a big between two parties that is resistant to eavesdropping,
threat that could make criminals to commit their crimes sniffing and traffic analysis. Onion routing consists of a
with impunity. The good reasons of using Tor are several, series of ORs connected in a way that each OR has a
for example normal people use Tor to protect their dedicated socket connection to a set of neighboring ones.
information from external adversaries, and also, military However, to build up the anonymous connection, the
uses Tor to protect government communication, in application initiates a series of connections to a set of
addition to that, law enforcement offices and agencies are Onion Router Proxies (ORPs) that ultimately build up the
using Tor for their investigation and operation. Low and anonymous connection. The routing occurs at the
high profile people also use Tor to make an opinion that application layer of the protocol stack, and not on the IP

Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
12 A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning

layer. However, the IP network is the one who determines Tor‘s anonymity and identify top monitored sites on
where should the data move between each individual Alexa using ML classification techniques. There are
onion routers. several steps involved in Tor fingerprinting attack within a
local network environment; local network environment
means two things. First, all web pages are known in
III. RELATED WORK advance, and second, the attack is launched by a local
attacker. The attacker observes the encrypted traffic to
Tor achieves anonymity by make it very difficult for an find conclusions from certain features in the traffic such as
adversary to identify client and server identities. In Tor
packet sizes, volume of data transferred, timing and many
design, the entry node only knows the client who others. This type of attack is considered in this research to
communicates with middle node, and the middle node ensure the comparability of the outcome results to related
knows the entry node is communicating with another
works. This method however is not in the position of
machine exit node. The middle relay machine cannot tell breaking the cryptographic used in Tor, although it does
if it‘s the middle node in the circuit or not. Also the exit not provide messages semantic, it can provide a way to
relay knows the middle node, which communicates with
observe specific patterns in order to reveal a known traffic
the server (target destination). Finally, the server believes instances like web pages.
the connection is coming from the exit node [12]. In real world scenario, if a user runs Tor OP in a shared
Historically, an extensive number of work on attacking
local network, other users sit on the same network may
Tor anonymity circuits, which can degrade the anonymous use different applications, and thus, passing different type
communication over Tor; most of these attacks are based of network traffic traces such as HTTP, HTTPS, FTP and
on traffic analysis. However, attacks based on traffic
others. Tor‘s uses TLS encryption between client, ORs
analysis may suffer high rate of false positives (FP) due to and destination server, thus, the hypothesis is that the
a number of reasons, such as Internet traffic dynamics and traffic of Tor should have similar characteristic as any
determining the required number of packets for the other TLS traffic such as HTTPS. Yet, if variation in the
statistical analysis of traffic. That said, timing and latency traffic characteristic is found, then Tor instances can be
are important metrics in traffic analysis to identify Tor as
fingerprinted amongst other TLS traffic and anonymity
well as packet counting and volume metrics [13]. can be broken. In the experiment, HTTPS encrypted
A previous work on path selection focused on latency traffic is considered as majority of sites encrypts their
as property link and take delay in account primarily.
communication use TLS encryption over port 443.
However in this attack by [14], attacker assumes Similarly, Tor traffic is considered from a user (victim)
different approach, which is identifying the important of uses Tor OP on the same local network browsing top 5
latency as indicator of congestion, and accordingly,
sites on Alexa. Therefore, by identifying the variations in
suggesting an improved path selection algorithm. Further, the traffic instances between the HTTPS traffic and Tor‘s
Tao proposed a way for Tor clients to respond to short- victim traffic in the same local network environment is the
term congestion by building timeout mechanism.
goal for this study.
Existing traffic analysis attacks against anonymous In order to find those variations in the traffic
communication can be classified into two main categories: characteristic between Tor and HTTPS, ML methods need
traffic confirmation attacks and traffic analysis attacks.
to be employed using statistical classification technique
Each category consists of both passive and active attacks. [16]. The focuses on using ML methods to detect patterns
Passive traffic analysis techniques is when the adversary in the packet information is to extrapolate and predict
records the traffic passively and identify the resemblance
traffic type contained within a TLS flow, which in this
between client inbound traffic and server outbound traffic. research, using Tor encrypted traffic data and HTTPS data
Meanwhile, the active attack, aims to embed specific to train the system.
secret signal (or marks) into the target traffic and detect it Generally, the Tor Fingerprinting Methodology steps in
[15]. this work can be summarized in Fig. 2, as follows:
Meanwhile, traffic confirmation attack is when an
adversary tries to confirm that two parties are
 Step-1, data collection step, it‘s required to capture
communicating with each other over Tor by observing
a ground truth, or HTTPS data for which the
patterns in the traffic, such as timing and volume of the
underlying application is known. At the same time,
traffic. Ideally, traffic confirmation attacks are not in the
collects Tor traffic instances of top 5 sites on Alexa
focus of Tor‘s threat model. Instead, Tor increases the
to represents Tor instances. The data collected is to
focus on preventing traffic analysis attacks, this occurs
be used to train the model using ML methods.
when adversary tries to determine in which points in the
 Step-2 is feature extraction and feature selection;
network a traffic pattern based attack should be executed
feature extraction is crucial in order to detect the
[15].
variation between Tor instances and HTTPS
instances and feature selection is required to
identify which features to be used that improve the
IV. TOR FINGERPRINT METHODOLOGY
accuracy of web sites fingerprinting.
The goal of this research is to fingerprint Tor traffic  Step-3, labeling process means marking each row
flows in a local network environment in order to break in the traffic with the corresponded label.

Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning 13

 Step-4, classifying traffic flows based on those resources of guest OSs and programs are running on
characteristics variations either as Tor‘s site or virtualized computer, they are not aware that they are
regular HTTPS traffic. In the following subsections running on a virtual platform [18]. Also, in the study,
are the details descriptions of each step. different OS distribution systems have been installed to
ensure emulating the actual traffic in the network, a
breakdown of the OSs used as VMs to capture the
network traffic is presented in Table I, each of these
guests operating systems runs with specific VM
configuration, a 512 memory RAM, and 20 GB of disk
space with shared networking Network Address
Fig. 2. Methodology steps for fingerprinting Tor Translation (NAT) setup. Further, a distribution of Linux
OSs on VMs with different processor architecture is used.
A. Data Collection
To validate the method used in this research, there is a Table 1. Break Down Of Client Virtual Machines Operating Systems
need for a ground truth, or SSL connections for which the Client Operating System Operating System processor
underlying application is known. Therefore HTTPS traffic # Version
data is required for building the training dataset. Although CVM1 Linux Ubuntu 12.04 64 bit
CVM2 Linux Backtrack 5.01.3 32 bit
there is no public dataset sources that can be used in the
experiment, similar technique has been considered for
Linux BackTrack is a Linux-based penetration-testing
data generation from [14] which previously known to
arsenal that aids security professionals in the ability to
achieve higher accuracy results ignoring the removal of
perform assessments in a purely native environment
SENDMEs as it did not affect the results that much.
dedicated to hacking. Linux Ubuntu is the standard Linux
Precautions need to be taken in order to collect the data in
distribution system powers millions of desktop PCs,
the same way a realistic attacker would. Firefox browser
laptops and servers around the world. Moreover, OSX
and Selenium [17] Web Driver have been used to perform
machine in Table II is used for conducting the analysis.
an automation browsing process, web sites used are taken
from top 100 sites on Alexa in order to mimic the actual Table 2. Analysis Machine
real user behavior on the local network environment.
Ideally, capturing those traffic traces can be Client Operating Operating System processor
# System Version
accomplished from more than one machine; those captures A1 OSX 10.9.2 64-bit
consist of a raw data that is transmitted over the physical
wire or wireless network at a given point, see Fig. 3. Each
machine runs different encrypted services. Few machines 2) Traffic Generation Tools
run HTTPS traffic and others run Tor application to
In order to obtain traffic traces for the dataset, capturing
generate Tor encrypted traffic. In the experiment two
the data from VMs and use it to build training data sets is
virtual machines are used as clients, below is a details
the first step, aforementioned VMs and Sniffer software
about the software stack used for that.
were used to sniff and capture the traffic from A1. VMs
are configured to run as NAT to A1 machine, which
means traffic will always route through A1 machine, this
provides two major benefits, first a full control on
capturing the dataset, and second, control specific filters
based on particular parameters without any traffic
disruption that could affect the quality of the dataset
which could cause invalid training data set. Ideally, there
are many sniffers available in the market today, the most
famous two are wireshark and tcpdump, however, any
sniffer will suffice for the testing, but simple, flexible,
Fig. 3. Data collection low-cost, and fast tool is best, tcpdump works really well
as sniffer for the experiment.
1) Environment Setup Tcpdump is a free open source sniffer, which uses
libpcap to capture traffic and provides information about
In the data generation method, two virtual machines IP layer packets i.e. the length of the packet, the time the
(VM) are used in order to generate the traffic for the packet was sent or received, the order in which the packets
experiment. The VM is a piece of software were sent and received. Tcpdump is quite flexible and fast,
implementation of a computing environment in which an it runs on most Linux and Unix variants, in fact, it‘s
Operating System (OS) can be installed and run on an installed by default on many Linux distributions, and it
emulated physical computing environment, it basically has been ported to windows as WinDump. It does support
utilizes the underlying physical hardware, including CPU, variety of filters, with a powerful language for specifying
memory, hard disk, network and other hardware resources individual filter types [19]. Further many other services
to create a virtual computing environment. Although are running on the VMs, Table III breakdowns the

Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
14 A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning

services installed on each machine, with the generation process has been scheduled from each machine
corresponding versions, each one of these machines runs using a small bash script to record traffic on hourly basis.
completely independent in its VM environment. Tcpdump sniffer is installed in a way so it can capture
traffic from two machines on a scheduled basis see Fig. 4.
Table 3. Services running on the machines
Machines Services
Firefox 14.0.1
Tor 0.2.4.22
CVM1 Netmate 0.9.5
Tcpdump 4.3.0
Libpcap 1.3.0
Firefox 14.0.1
Tor 0.2.4.22
CVM2
Netmate 0.9.5
Tcpdump 4.3.0
Weka 3.7.3
A1
Wireshark 1.10.7
Fig. 4. Traffic capturing through A1

The details about each VMs used is as the following: Traffic passes from CVM (i) through the NAT
a) CVM1 connection to websites servers. Since traffic scheduled to
pass on a specific timeframe, HTTPS traffic was first
This VM is used to run Firefox and browse sites run generated from CVM1, and is called regular-HTTPS.pcap.
over HTTPS. The traffic generated is intended to represent Similarly, traffic from the other CVM2 which runs Tor
regular HTTPS traffic for the top 100 sites on Alexa. In OP is captured, files named based on site corresponded to
the real world scenario, most of this traffic is generated that traffic, example for Google traces, it‘s called Google-
during regular secure browsing activities such as email Tor.pcap and Yahoo traces Yahoo-Tor.pcap. The data
communication, social network sites, and financial generation took two weeks to finish and the final output
activities. Unfortunately, these activities are somehow files in a pcap format are listed in Table IV. The table
difficult to mimic. Thus, the approach taken in this thesis contains the number of packets, flows along with the sizes
involves using Selenium [17] for automating web for each. Fig. 5 shows a summery chart of each flow.
applications for testing purposes with a complete list of After dumping the network traffic from CVM (i), the next
sites that run over HTTPS. Selenium operates by step is to use those files to build the training model for the
controlling a standard browser. This is important because classification method. The traffic generated contains a
the traffic generated needs to look like if it was captured number of flows; those flows will be used to create the
by a user browsing the web doing his regular business model.
activities. However, similar to Ian and Tao method [14]
the method obtained a specific list of websites in a local Table 4. Traffic and Their associated number of flows
network environment, those sites are taken from the top
Traffic Size / Number Number of Avg Packet
100 sites on Alexa.com. Alexa is the leading provider of Type MB of Packets Flows Size / Byte
free, global web metrics ranks the top sites based on the HTTPS 808.7 1054835 38845 750.617
number of unique users, page views, and number of visits Google 110.1 146151 5231 737.407
[20]. Tor
Yahoo 155.4 206998 7959 734.596
b) CVM2 Tor
Facebook 132.7 160491 4085 810.785
This VM runs a specific list of what expected to be the Tor
top monitored websites on Alexa, but this time with Tor Twitter 132.2 171935 5577 752.653
OP, in the attack scenario, the expectation is that the Tor
Wikipedi 87.7 122708 4465 698.716
victim uses this machine to browse top sites on Alexa. The a Tor
same method of CVM1 is used in CVM2. The 5 sites used
in the experiment are Google, Yahoo, Facebook,
Wikipedia and Twitter. 1200

1000
3) Dumping Traffic SIZE / MB
800
Dumping traffic is required in order to capture training 600 NUMBER OF
datasets and record information to be used later for the PACKETS
400
classification part. Basically the A1 machine is used to NUMBER OF
200 FLOWS
capture all the traffic from all the CVM (i) machines, a
sniffer is installed on the interface to capture the traffic 0 AVG PACKET
SIZE / BYTE
passes through. Tcpdump is used to generate the packet
capture (pcap) file which previously developed by
wireshark. In the process, packets are captured and stored
on A1 machine for further analysis using ML. Traffic Fig. 5. Summery chart for all pcap files

Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning 15

B. Feature Extraction Table 5. Features obtained from Netmate


No# Abbreviations Features Description
In the experiment, in order to perform the fingerprinting 1 Dscp The protocol (ie. TCP = 6, UDP = 17)
attack, the dataset (or features) that represents each traffic 2 total_fpackets Total packets in the forward direction
type (Tor or HTTPS) needs to be extracted from the 3 total_fvolume Total bytes in the forward direction
4 total_bpackets Total packets in the backward direction
network dump file in order to use those features for our 5 total_bvolume Total bytes in the backward direction
classification model to find characteristic variation 6 min_fpktl
The size of the smallest packet sent in the forward
between those instances. direction (in bytes)
The mean size of packets sent in the forward
The features need to be extracted from the network 7 min_fpktl
direction (in bytes)
generated traffic *.pcap files, but first it‘s important to 8 min_fpktl
The size of the largest packet sent in the forward
direction (in bytes)
bring all the data together into a set of instances. In order The standard deviation from the mean of the packets
to accomplish this, NetMate is used, NetMate is a traffic- 9 std_fpktl
sent in the forward direction (in bytes)
monitoring tool, which converts IP packets into bi- 10 min_bpktl
The size of the smallest packet sent in the backward
direction (in bytes)
directional flows and generates several statistics regarding The mean size of packets sent in the backward
11 mean_bpktl
these flows. The flows are actually defined using a direction (in bytes)
sequence of packets, source IP address, destination IP The size of the largest packet sent in the backward
12 max_bpktl
direction (in bytes)
address, source port, destination port, and type of protocol The standard deviation from the mean of the packets
13 std_bpktl
[21]. NetMate has been used to extract features as flow sent in the backward direction (in bytes)
The minimum amount of time between two packets
attributes from the traffic, NetMate works by processing 14 min_fiat
sent in the forward direction (in microseconds)
the datasets, generating flows, and computing feature 15 mean_fiat
The mean amount of time between two packets sent
values which can be used to build the model, each flow is in the forward direction (in microseconds)
The maximum amount of time between two packets
described by a set of statistical features and associated 16 max_fiat
sent in the forward direction (in microseconds)
feature values. The standard deviation from the mean amount of time
17 std_fiat between two packets sent in the forward direction (in
1) Feature Selection microseconds)
The minimum amount of time between two packets
18 min_biat
In total, 40 features were obtained from NetMate as of sent in the backward direction (in microseconds)
The mean amount of time between two packets sent
Table V, ignoring the other features including the protocol 19 mean_biat
in the backward direction (in microseconds)
feature, which represent as (TCP=6 & UDP=17) 20 max_biat
The maximum amount of time between two packets
sent in the backward direction (in microseconds)
considering that they don‘t impact the classification The standard deviation from the mean amount of time
results positively or negatively [22] and proofed in this 21 std_biat between two packets sent in the backward direction
experiment. Further, It is important to mention that only (in microseconds)
22 duration The duration of the flow (in microseconds)
TCP and UDP flows are considered, and specifically, The minimum amount of time that the flow was
23 min_active
flows that have at least one packet in each direction, and active before going idle (in microseconds)
transport no less than one byte of payload. Also, there are The mean amount of time that the flow was active
24 mean_active
before going idle (in microseconds)
a number of features have been excluded, IP addresses, The maximum amount of time that the flow was
25 max_active
and source/destination ports numbers to ensure that the active before going idle (in microseconds)
The standard deviation from the mean amount of time
results are not dependent from those biases. 26 std_active that the flow was active before going idle (in
microseconds)
2) Generating The Attribute Relationship File The minimum time a flow was idle before becoming
27 min_idle
Format active (in microseconds)
The mean time a flow was idle before becoming
28 mean_idle
Attribute relation file format (ARFF) is an input ASCII active (in microseconds)
The maximum time a flow was idle before becoming
text file format that describes a list of instances sharing a 29 max_idle
active (in microseconds)
set of attributes; it was developed by the ML Project at the 30 std_idle
The standard deviation from the mean time a flow
Department of Computer Science of The University of was idle before becoming active (in microseconds)
The average number of packets in a sub flow in the
Waikato to be used for machine learning software [23]. 31 sflow_fpackets
forward direction
ARFF file has three main sections, RELATION, 32 sflow_fbytes
The average number of bytes in a sub flow in the
forward direction
ATTRIBUTE and DATA. The header contains the The average number of packets in a sub flow in the
relation declaration and an attribute declaration, 33 sflow_bpackets
backward direction
RELATION is a string defined in the first line, The average number of packets in a sub flow in the
34 sflow_bbytes
backward direction
ATTRIBUTE contains both name and data type, whilst The number of times the PSH flag was set in packets
35 fpsh_cnt
DATA is the actual data declaration and actual instances travelling in the forward direction (0 for UDP)
The number of times the PSH flag was set in packets
line. 36 bpsh_cnt
travelling in the backward direction (0 for UDP)
The number of times the URG flag was set in packets
C. Labeling 37 furg_cnt
travelling in the forward direction (0 for UDP)
The number of times the URG flag was set in packets
Obtaining flows from network traffic using NetMate 38 burg_cnt
travelling in the backward direction (0 for UDP)
generates rows of attributes separated by commas in 39 total_fhlen
The total bytes used for headers in the forward
direction.
ARFF file format. Those values will be used to build the The total bytes used for headers in the backward
training dataset model using Weka, which is a collection 40 total_bhlen
direction.
of ML algorithms for data mining tasks [23]. In order to
train the system Weka to use supervised ML with Weka truth, or SSL connections for which the underlying
defaults to validate the method. There is a need for a application is known. In other words, there is a need to

Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
16 A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning

specify which data is HTTPS and which data is Tor, to V. EVALUATION TECHNIQUES
accomplish that, a labeling process is required by
In this experiment, in order to fingerprint websites over
specifying the label attribute on each data instance in the
Tor, a few ML methods were used. The experiment was
ARFF file. However, this data is known as ground-truth.
repeated multiple times using Weka, each time using
Building up ground-truth is very important and critical
different set of training and test cases (changing the
phases of any traffic classification method since the entire
number of packets used to create the case). However, to
classification process relies completely on the accuracy of
obtain a simulated test performance, the testing data used
this labeling. Thus, accuracy is important by labeling data
in the evaluation are the same as the training data set but
instances based on their types to ensure the minimum false
with 10 cross-validation using Weka.
positives and false negatives. Also, because traffic has
Cross-validation means that part of the data will be
been completely separated based on CVM(i), validation
reserved for testing while the rest will be used for training.
has been conducted to ensure only traffic generated by
In other words, the data is partitioned into 10 parts (folds),
each CVM is corresponded to that CVM, and no other
one part for testing and the remaining 9 parts for training.
traffic noise mixed up with the intended traffic to be used.
Further, different set of attributes (features) used for
D. Building Ml Classification Model classification, and a deep investigation has been
performed in order to find relevant attributes and building
Supervised ML is employed in order to create the
minimal rule sets for classifying Tor traffic (finding the
training dataset. In Supervised learning ML; the algorithm
minimal rule set is proved to be an NP-hard problem [25])
takes a known data called training dataset to make some
and different classification test cross-validation option to
predictions. The method attempts to discover the
achieve higher accuracy with less FPR and FNR. Fig. 9
relationship between input attributes and target attributes,
the output relationship discovered represents a structure diagram shows the steps of classification method in
called ―model‖ see Fig. 6. general.

Fig. 6. Generating Model in ML using supervised learning

Weka is an open source project that contains different


tools for data pre-processing, regression, classification,
clustering, association rules, and visualization, and can be
used directly by providing a dataset or from a java code,
as in Fig. 7.

Fig. 7. Weka GUI in OSX

In order to apply ML algorithms and build the data Fig. 9. Detection Diagram using ML
model, given the different traffic data training sets, Weka
is chosen for this exercise. Weka GUI or direct command The sites used fingerprinting evaluation is the top 5
line interface can be used to accomplish this, Fig. 8 below websites on Alexa, the sites are listed with a localization
presents the use of Weka as a simple command line domain to avoid Tor redirection into the local IP location.
interface to generate a data model. Table VI ists the top websites that have been chosen in the
evaluation process.

Table 6. List of websites used in the fingerprinting process


# Site
Google https://www.google.de
Facebook https://www.facebook.com
Yahoo https://se.yahoo.com
Fig. 8. Creating data model using Weka CLI Twitter https://www.twitter.com
Wikipedia https://www.wikipedia.org/

Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning 17

A. Classification Methods Employed combining multiple weaker learners, a stronger learner is


created [27], the prediction is made based on the majority
The focus of this research is to employ ML methods in
of trees votes by training each tree on a bootstrap sample
classifying Tor encrypted flows, classifying is considered
of training sample data. Random feature selection
based on the number of packets necessary to correctly
conducts a simple search to find the best split in each node
classify those flows and the number of feature sets used.
while growing a tree over a random subset of features.
According to researcher knowledge, there is no any other
research that has exclusively worked to fingerprint and 4) Support Vector Machine
classify Tor traffic amongst other HTTPS traffic. Below is
The support vector machine (SVM) that is first
a description of each ML methods used in the experiment.
pioneered by Vapnik and Chervonenkis [28] and is the
Each method is used to classify data collected from Tor
state-of-art supervised ML algorithm for the binary
amongst HTTPS data, the variation in the flow
classification problem. SVM is heavily used for data
characteristics can be understood by each ML algorithm in
mining and is very well known by its high performance in
order to provide the classification accuracy.
terms of the classification accuracy, thus, it has been
The goal is to achieve high accuracy with low FP in
considered in this research to make sure the results that are
order for the methodology to successfully fingerprint Tor
achieved are not biased to specific ML algorithm and that
sites on a local network environment.
this type of ML classification is also capable to classify
1) Classification Using Statistical Model Tor amongst HTTPS traffic. In SVM, Given a set of
objects that falls into two categories (training data), the
Naï ve Bayes is used in the evaluation methodology in
problem is how to classify a new point (test data) into one
order classify Tor and HTTPS traffic. Naï ve Bayes is a
of the aforementioned categories. SVM solves this issue
classification algorithm that relies on Bayes‘ rule of
by calculating the line in which the data can be separated
conditional probability [26]. Naï ve Bayes ML technique
into two categories, training and test data [29].
forms a statistical model of data that is given in the
The key idea of SVM is the interpretation of instances
training phase. The algorithm relates each feature to the
as vectors, in this research classification problem, the
probability that feature will result in a particular outcome
instances are the data generated through site retrieval,
based on the entire training set. To preform testing, the
which represented as vectors. However, based on the
probability of each possible outcome is calculated based
training data provided, the SVM classifier tries to fit a
on the features each test instance has. Naï ve Bayes gets its
hyperplane into the vector multidimensional space which
name because it makes the (naive) assumption that each
represents the instances in order to create a separation
feature is independent, and uses Bayes rule of conditional
between the instances that are belong to a different class.
probability.
The accumulated distance between the fitted plane and the
2) Classification Using Decision Trees support vectors (instances) has to be as high as possible
where it needs to maximizes the gap between the two
The C4.5 is a decision tree classifier, which is built by classes. However, sometimes the vectors are not linearly
repeatedly splitting the training set on the feature
separable and require complex decision planes for optimal
(attribute), which ―best‖ splits, the data. Thus, the separation of the categories similar to Fig. 10. Which
consideration is to use it in order to classify Tor and SVM can solve transforming the vector space into a
HTTPS and provide high accuracy results. There are
higher dimensional space by the so-called kernel trick, in
multiple methods for deciding which feature is best, but the higher dimension; the hyperplane can be fitted again.
C4.5 uses a measure of information entropy. The exact
criterion for splitting the training set is the normalized
information gain, which is the difference in entropy
caused by choosing a specific attribute for splitting the
data. The attribute that has the highest normalized
information gain is ultimately selected to be the one on
which the training set is split. The resulting model of C4.5
is, in effect, a series of IF/THEN statements, which do not
necessarily employ all attributes. Given this structure,
there may be multiple paths for the same outcome class.
3) Random Forest
Fig. 10. Nonlinear SVM separator
Random forest or (RF) is a ML algorithm that evolved
from decision trees, and used in this classification to VI. RESULTS AND DISCUSSIONS
ensure the results are aligned with what is achieved by
both Naï ve Bayes and C4.5 and because of classification This section presents the evaluation results of the
strength of the algorithm. RF consists of many decision classification experiment for fingerprinting Tor encrypted
trees and supports two ML algorithms bagging and traffic in the offline traffic traces, which has been
random selection. In bagging, short for ‗bootstrap discussed previously. HTTPS and Tor-SSL traffic have
aggregating‘, and one of the first ensemble methods, been used in order to create the training dataset. The main
ensemble methods are based on the idea that by goal of this research is to evaluate the possibility of

Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
18 A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning

providing a high detection rate for Tor traffic amongst 1) Naï


ve Bayes
HTTPS traffic in order to fingerprint most monitored
Naïve Bayes is a classification algorithm that relies on
websites on Alexa from a local observer sitting on the
Bayes‘ rule of conditional probability [26]. In the
network.
experiment, Naï ve Bayes in Weka is used to classify Tor
In this research‘s traffic classification for Tor; two
instances with 10-fold cross-validation test mode, by
factors are typically considered in order to quantify the
using 40 features, Naï ve Bayes was able to achieve a high
performance of the classifier: Detection Rate (DR) and
TP Rate 99.60% and FP Rate 0.004% and 99.69%
False Positive Rate (FPR). In this case DR or accuracy
accuracy. Table VII is breakdown of the detailed accuracy
will reflect the number of Tor-SSL flows correctly
using Naï ve Bayes for each monitored site with the
classified whereas FPR will reflect the number of HTTPS
weighted average details; the weighted average is
flows incorrectly classified as Tor-SSL. Naturally, a high
computed by weighting the measure of class (TP Rate, FP
DR rate and a low FPR would be the desired outcomes for
Rate, Precision, Recall, F-Measure, ROC Area) by the
us [30]. DR and FPR are calculated based on the
proportion of instances in that class. Fig. 11 also provides
following equations:
an overall distribution in a chart presentation.
(10) Table 7. Breakdown of Naï
ve Bayes classification for top monitored
sites

(11) Class TP FP Precisio Recall F- ROC


Rate Rate n Measu Area
re
In equation (10), FN represents False Negative, which Tor 0.991 0.002 0.99 0.991 0.99 0.999
Goog
means Tor-SSL traffic classified as HTTPS traffic. le
Likewise, in equation (11), FPR represents false positive HTT 0.998 0.009 0.998 0.998 0.998 0.994
rate, which means HTTPS traffic classified as Tor-SSL PS
traffic. Since the main goal is to achieve a high DR rate Weig 0.997 0.008 0.997 0.997 0.997 0.994
and a low FP rate results, the experiment has been hted
Avg
evaluated by four ML algorithms using Weka [31].
However, In order to evaluate the accuracy/errors of using Tor 0.994 0.002 0.988 0.994 0.991 0.997
ML. The experiment has been run with 10-cross validation Faceb
set option in Weka, cross validation is a necessary step in ook
HTT 0.998 0.006 0.999 0.998 0.999 0.997
model construction, it assesses how the results of a PS
statistical analysis will generalize to an independent data Weig 0.998 0.005 0.998 0.998 0.998 0.997
set and provides estimation on how this model will hted
perform in practice. Avg

A. Classifiers Results Tor 0.998 0.001 0.995 0.998 0.996 0.999


Yaho
This research goal is to achieve High true positive rate o
sometimes known as DR and less FPR. In order to attain HTT 0.999 0.002 0.999 0.999 0.999 0.998
PS
that, a feature selection exercise has been performed,
Weig 0.999 0.002 0.999 0.999 0.999 0.998
feature selection would eliminate features determined to hted
be of a little use in classifying and reducing the Avg
computations needed, feature selection used by tuning the
features calculated from the training packets of the flow, a Tor 0.992 0 0.999 0.992 0.995 0.999
Twitt
high accuracy have been achieved using different features er
set, this also improved the runtime of the ML algorithms HTT 1 0.008 0.998 1 0.999 0.996
that require intensive mathematical calculations, data has PS
been generated on a local network environment by Weig 0.998 0.007 0.998 0.998 0.998 0.996
hted
following the best approach described in Ian and Tao Avg
method [14] for Tor dataset generation. In the experiment,
the fingerprinting has been performed on the top 5 Tor 0.993 0.007 0.954 0.993 0.973 0.998
monitored sites on Alexa, the sites are Google, Facebook, Wiki
pedia
Yahoo, Twitter, and Wikipedia, those sites running
HTT 0.993 0.007 0.999 0.993 0.996 0.996
various types of content and serving almost more than 100 PS
million users every day. Breaking Tor anonymity meaning Weig 0.993 0.007 0.993 0.993 0.993 0.997
a direct identification of those traffic instances within the hted
network traffic. The researcher has performed the Avg

fingerprinting using ML classification technique; four ML


algorithms have been used to classify the traffic and all
has shown very close results. The accuracy, time training,
and runtime including some analysis are described below.

Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning 19

TP Rate FP Rate Precision TP Rate FP Rate Precision

Recall F-Measure ROC Area Recall F-Measure ROC Area


1
1
0.8
0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0

Fig. 11. Distribution of Naï


ve Bayes classification for each monitored Fig. 12. Distribution of Naï
ve Bayes classification for each monitored
site site.

2) C4.5 3) Random Forest


C4.5 is a decision tree classifier, one of the amazing This algorithm evolved from decision trees and
features about C4.5 is the determination of how deeply to supports bagging and random selection, random forest
grow a decision tree to avoid overfitting and choosing an performs much faster than boosting and bagging. The
appropriate attribute selection measures. Table VIII shows results for the classification shows that Random forest
the result of using C4.5 with 10-fold cross-validation test achieved the higher TP Rate results compared to Naï ve
mode, C4.5 is known as J48 in Weka. Fig 12 shows the Bayes and C4.5 with 99.92% accuracy and 99.86% TP
overall distribution in chart representation, C4.5 achieves Rate, 0.002% FP Rate as described in Table IX and Fig.
higher accuracy compared to Naï ve Bayes with 99.92% 13, the algorithm is run using 10-fold cross-validation test
and 99.85% TP Rate, 0.002% FP Rate. mode.

Table 8. Breakdown of C4.5 classification for monitored sites Table 9. Breakdown of Random Forest classification for top monitored
sites
Class TP FP Precision Recall F- ROC
Rate Rate Measure Area 0 TP FP Precision Recall F- ROC
Rate Rate Measure Area
Tor 0.997 0 0.999 0.997 0.998 0.999
Google Tor 0.996 0 0.999 0.996 0.998 1
Google
HTTPS 1 0.003 1 1 1 0.999
HTTPS 1 0.004 0.999 1 1 1
Weighted 0.999 0.002 0.999 0.999 0.999 0.999
Avg Weighted 0.003 0.999 0.999 0.999 1
Avg 0.999

Tor 0.998 0 0.999 0.998 0.999 0.999


Tor 0.998 0 1 0.998 0.999 1
Facebook
Facebook
HTTPS 1 0.002 1 1 1 0.999
HTTPS 1 0.002 1 1 1 1
Weighted 1 0.002 1 1 1 0.999 Weighted 1 0.002 1 1 1 1
Avg Avg

Tor 0.999 0 1 0.999 1 0.999 Tor Yahoo 0.999 0 1 0.999 1 1


Yahoo
HTTPS 1 0.001 1 1 1 0.999
HTTPS 1 0.001 1 1 1 1
Weighted 1 0.001 1 1 1 0.999
Avg Weighted 1 0.001 1 1 1 1
Avg

Tor 0.994 0.001 0.997 0.994 0.996 0.997


Twitter Tor 0.995 0.001 0.997 0.995 0.996 0.999
Twitter
HTTPS 0.999 0.006 0.999 0.999 0.999 0.997
HTTPS 0.999 0.005 0.999 0.999 0.999 0.999
Weighted 0.999 0.005 0.999 0.999 0.999 0.997
Weighted 0.999 0.004 0.999 0.999 0.999 0.999
Avg
Avg

Tor 0.994 0 0.998 0.994 0.996 0.996 Tor 0.995 0 0.997 0.995 0.996 0.999
Wikipedia Wikipedia
HTTPS 1 0.006 0.999 1 0.999 0.996 HTTPS 1 0.005 0.999 1 0.999 0.999
Weighted 0.005 0.999 0.999 0.999 0.996 Weighted 0.999 0.004 0.999 0.999 0.999 0.999
Avg 0.999 Avg

Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
20 A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning

TP Rate FP Rate Precision TP Rate FP Rate Precision

Recall F-Measure ROC Area Recall F-Measure ROC Area

1
1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0

Fig. 13. Distribution of Random Forest classification for each monitored


site Fig. 14. Distribution of SVM classification for each monitored site.

4) SVM Basically the four algorithms Naï ve Bayes, C4.5,


Random forests, and SVM achieved almost very similar
SVM is the state-of-the-art supervised ML method, results as shown in Fig. 15 for all top monitored sites with
most of the previous studies on Tor fingerprinting used less accuracy achieved for both Twitter and Wikipedia
SVM as classifier [14]. Thus, the researcher has considering the dynamic content in both sites. Also less
considered SVM in order to ensure the results achieve TP results achieved for Yahoo classification using SVM.
better accuracy confirming the improvement of the
method considered in this research regardless of the 0.9960.996 0.9990.998 0.9990.998 0.990.977
1
methodology used for Tor fingerprinting. SVM achieved 0.8
an accuracy of 99.04% and 97.72% TP Rate, 0.034% FP 0.6
Rate with 10-fold cross-validation test mode. Table X and 0.4
Fig. 14 are the complete detailed results. 0.2 0.004 0.002 0.002 0.034
0
Naïve Bayes C4.5 Random forests SVM
Table 10. Breakdown of SVM classification for top monitored sites
Accuracy TP Rate FP Rate
Class TP FP Precision Recall F- ROC
Rate Rate Measure Area
Fig. 15. Results comparison between ML algorithms
Tor 0.975 0 0.999 0.975 0.987 0.987
Google
HTTPS 1 0.025 0.996 1 0.998 0.987 B. Comparison
Weighted 0.996 0.022 0.996 0.996 0.996 0.987 In order to compare this research results with previous
Avg
achieved results on Tor fingerprinting, the researcher
needs to perform the improved methodology on the same
Tor 0.974 0 1 0.974 0.987 0.987
Facebook data used in previous researches and compare results.
HTTPS 1 0.026 0.996 1 0.998 0.987 However, because Tor literature covers a wide verity of
Weighted 0.997 0.023 0.997 0.997 0.997 0.987 techniques with many different goals, and no two
Avg techniques can be directly compared, as the data used for
analysis is not publicly disclosed [32]. The researcher
Tor 0.818 0 1 0.818 0.9 0.909 used some parameters for data generation technique (Tao
Yahoo
HTTPS 1 0.182 0.953 1 0.976 0.909
Wang I. G., 2013) which previously known to achieve
Weighted 0.961 0.143 0.963 0.961 0.96 0.909 higher accuracy results ignoring the removal of
Avg SENDMEs as it did not affect the results that much and
then use this dataset with the improved methodology to
Tor 0.972 0 0.999 0.972 0.985 0.986 present the new results.
Twitter There are a couple of few researches that have been
HTTPS 1 0.028 0.995 1 0.997 0.986 known to achieve high fingerprinting results on some
Weighted 0.995 0.024 0.995 0.995 0.995 0.986 monitored sites. The recent research by [14] had achieved
Avg
an accuracy of 91% using SVM. However, the
Tor 0.975 0 0.998 0.975 0.986 0.987
methodology used in that research is different on how
Wikipedia fingerprinting technique works, the accuracy achieved in
HTTPS 1 0.025 0.996 1 0.998 0.987 this research using SVM is 99.04%, which gives an
Weighted 0.996 0.022 0.996 0.996 0.996 0.987 improvement of +8.04%. Fig. 6 shows a graph
Avg comparison between both results.

Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning 21

measuring one parameter, in our methodology, the


0.99 researcher presented an improved technique in order to
1
fingerprinting websites over Tor network, the method
0.95 0.91 combines various improvements in order to achieve higher
0.9 results amongst previous researches on Tor fingerprinting.
0.85 The results have shown that all ML algorithms employed
Combined OSAD This Research achieved very similar results, almost 99% for all top sites
on Alexa, meaning the accuracy achieved is not biased to
Accuracy a specific ML algorithm and that the variation is in the
existence of in the characteristics of Tor traffic amongst
Fig. 16. Combined OSAD accuracy versus this research accuracy. HTTPS traffic.
According to Tor project, the assumption for Tor is that
Considering the technique used in order to identify Tor data over Tor and HTTPs traffic should look alike,
traffic, this research by [33] were able to identify Tor preventing the local observer from distinguishing both
traffic using ML and they proofed that simulated Tor traffic traces in Tor, and thus preserving privacy. However,
network can be distinguished from regular encrypted this research results refute this assumption by noticing that
traffic, suggesting that real world Tor users may be Tor and HTTPS traffic have different flow characteristics,
vulnerable to the same analysis. Barker et al [33] were which proofed by showing high accuracy on
able to detect Tor over HTTP and Tor over HTTPS. distinguishing Tor and HTTPS traffic. Yet, this implies
Further, he was able to achieve a result of 90% using that Tor protections are not enough to make both traffic
different ML algorithms. Basically, their evaluation is traces look alike in order to preserve users‘ privacy and
based on the size of individual packets in a stream as this indicates that the current protections in Tor
feature for traffic classification. However, this research is implementation breaks the anonymity that Tor promised.
considered a similar approach to distinguish Tor traffic The variations in flow characteristics can be shown in Fig.
from HTTPS in order to achieve websites fingerprinting 19 the figure shows a sample traffic that is taken from
over Tor. Yet, employing different improved techniques ARFF file represents Google Tor traffic in blue and
and different feature set for the evaluation, this research HTTPS traffic in red. As described in Table V , the
results gave an improvement, in Random forest of +2.1% features represent different network traffic flows. Also Fig.
and for C4.5 of +2.8%. Fig. 17 and Fig. 18 show a 20 shows a plot matrix in Weka for the current dataset,
comparison between both results considering the mutual Tor Google traffic in blue and HTTPS in red, the plot
ML algorithms used Random Forest and C4.5. matrix shows the distribution for each class feature
amongst the other class features in a matrix distribution
0.998 0.999 0.977 0.999 form. From the chart, its obvious the variation between
1
each class of traffic instances for the current sample
0.5
provided in Fig. 20 However, in the evaluation, 40
features were used which gave a high classification
0.002 0.003 accuracy for both Tor and HTTPS traffic.
0
This Research Results Barker et al Results

True Positive Rate False Positive Rate ROC Area

Fig. 17. Results comparison between John and This research accuracy
using Random Forest ML algorithm

0.998 0.998 0.97 0.992


1
0.8
0.6
0.4
0.2 0.002 0.007
0
This Research Results Barker et al Results Fig. 19. Variation in flow characteristics of sample Google‘s ARFF file

True Positive Rate False Positive Rate ROC Area D. Conclusions And Future Work
Fig. 18. Results comparison between John and This research accuracy This research presents that Tor can be classified
using C4.5 ML algorithm amongst HTTPS encrypted traffic. Tor is the low-latency
anonymity tool and one of the prevalent used open source
C. Discussion anonymity tools for anonymizing TCP traffic on the
In this paper, the researcher has demonstrated a website Internet. Tor has implemented different defenses
fingerprinting attack against the most widely known techniques in order to prevent automated identification of
anonymity project Tor. Tor is very hard to detect by Tor traffic such as TLS encryption, padding, and packet

Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
22 A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning

relaying. However, as proofed in this research, Tor does [2] J. R. Vacca, Computer and information security handbook.:
not appear to appropriately succeed in blurring the Newnes, 2012.
network packets features, which makes it possible for a [3] B Schneier, Schneier on security.: John Wiley & Sons,
local observer to identify Tor traffic in the network and 2009.
[4] M., Adair, S., Hartstein, B., & Richard, M Ligh, Malware
fingerprint most top sites on Alexa. Different techniques Analyst's Cookbook and DVD: Tools and Techniques for
have been used in order to classify Tor, similar technique Fighting Malicious Code.: Wiley Publishing, 2010.
in previous researches is used to generate the traffic and [5] B., Erdin, E., Güneş, M. H., Bebis, G., & Shipley, T. Li,
dataset model, Netmate is used for features dump and An Analysis of Anonymizer Technology Usage. Berlin:
Weka is used to build the dataset model, several ML Springer, 2011.
algorithms have been employed to identify Tor traffic, [6] X., Zhang, Y., & Niu, X. Bai, "Traffic identification of tor
results gave an improvement amongst previous results by and web-mix," in In Intelligent Systems Design and
achieving an accuracy of 99.64% and 0.01% FP. However, Applications, 2008. ISDA'08. Eighth International
the researcher believes that its hard to compare this Conference, 2008, pp. 548-551.
[7] A., Niessen, L., Zinnen, A., & Engel, T Panchenko,
research results with previous researches as Tor literature "Website fingerprinting in onion routing based
covers a wide variety of techniques with many different anonymization networks," in In Proceedings of the 10th
goals, and no two techniques can be directly compared as annual ACM workshop on Privacy in the electronic society,
the data used for analysis is not publicly disclosed. 2011, pp. 103-114.
[8] P. Loshin, Practical Anonymity: Hiding in Plain Sight
Online.: Newnes, 2013.
[9] Edward M. Schwalb, iTV handbook: technologies &
standards.: Prentice Hall, 2003.
[10] Manuel Mogollon, Cryptography and Security Services:
Mechanisms and applications.: CyberTech Publishing,
2007.
[11] M., Klonowski, M., & Kutyłowski, M. Gomułkiewicz,
"Onions based on universal re-encryption–anonymous
communication immune against repetitive attack," in In
Information Security Applications, Berlin , 2005, pp. 400-
410.
[12] E., Shin, J., & Yu, J. Chan-Tin, "Revisiting Circuit
Clogging Attacks on Tor," In Availability, Reliability and
Security (ARES), 2013 Eighth International Conference,
pp. 131-140, 2013.
Fig. 20. Plot Matrix for a sample of Google‘s ARFF file [13] Nick Mathewson Roger Dingledine. (2004) torproject.
[Online].https://gitweb.torproject.org/torspec.git?a=blob_p
E. Recommendations for Future Work lain;hb=HEAD;f=tor-spec.txt
[14] T., & Goldberg, I. Wang, "Improved website fingerprinting
The researcher believes that this research experiment
on tor," in In Proceedings of the 12th ACM workshop on
was based on a small set of simulated data, and thus, it is Workshop on privacy in the electronic society, New York,
not necessarily that it covers all possible real world 2013, pp. 201-212.
conditions including open world experiments. The noise [15] Z., Luo, J., Yu, W., Fu, X., Xuan, D., & Jia, W. Ling, "A
and variability present in the real Tor network may make new cell-counting-based attack against Tor," IEEE/ACM
this classification technique inaccurate. As future Transactions on Networking (TON), vol. 20(4), pp. 1245-
recommendation, it‘s important to involve different types 1261, 2012.
of noise in the dataset to mimic the real open world [16] S., Nguyen, T., & Armitage, G. Zander, "Automated traffic
experience The researcher believes it‘s important to study classification and application identification using machine
learning," in In Local Computer Networks, 2005. 30th
the ability to classify Tor on a global scope like ISP or
Anniversary, 2005, pp. 250-257.
even more with some fine-tuning to the parameters used in [17] Selenium. (2004) Selenium. [Online].
the experiments. Also due to high computation costs of http://docs.seleniumhq.org/
SVM, it‘s important to use a parallel computing cluster to [18] Margaret Rouse. (2011) search server virtualization.
perform the experiment. Further, increasing the scope of [Online].http://searchservervirtualization.techtarget.com/de
fingerprinting to include more sites in the experiment and finition/virtual-machine
study the variation in the accuracy for each. Finally, as [19] (1987) Tcpdump. [Online]. http://www.tcpdump.org/
future recommendation for Tor protocol, the researcher [20] Alexa. (1996) Alexa. [Online]. http://www.alexa.com/
advices that developers should develop more defenses in [21] The Fraunhofer Institute for Open Communication
Systems FOKUS. (2010) ip-measurement. [Online].
order to make it harder for local observer to classify Tor
http://www.ip-measurement.org/tools/netmate.
amongst HTTPS traffic and thus pertain anonymity and [22] C., & Zincir-Heywood, A. N. McCarthy, "An investigation
privacy for Tor users. on identifying SSL traffic," In Computational Intelligence
for Security and Defense Applications (CISDA), pp. 115 -
REFERENCES 122, 2011.
[23] University of Waikato. (2008) Waikato. [Online].
[1] Inc Tor Project. (2012, July) torproject. [Online].
http://www.cs.waikato.ac.nz/ml/weka/arff.html
https://metrics.torproject.org

Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23
A Model for Detecting Tor Encrypted Traffic using Supervised Machine Learning 23

[24] O., & Rokach, L. Maimon, "Introduction to supervised companies. Since 2011 he's been teaching different computer
methods," In Data Mining and Knowledge Discovery security, digital forensics, and networking courses for both
Handbook, pp. 149-164, 2005. [Online]. graduates and undergraduates. He's also an author, speaker, and
http://www.ise.bgu.ac.il/faculty/liorr/hbchap8.pdf freelance instructor. His research interests include digital
[25] J. Wroblewski, "Finding minimal reducts using genetic forensics, operating systems internals, malware forensic analysis,
algorithms," in In Proccedings of the second annual join and network security.
conference on infromation science, 1995, pp. 186-189.
[26] I. H., Gori, M., & Numerico, T. Witten, Web dragons:
Inside the myths of search engine technology.: Elsevier, Prof. Atoum is currently the Dean of The
2010. King Hussein School of Computing
[27] B. Lantz, Machine Learning with R.: Packt Publishing Ltd, Sciences at Princess Sumaya University for
2013. Technology (PSUT). He had received his
[28] V. N., & Chervonenkis, A. J. Vapnik. (1974) Theory of B.S. degree in Computer Science from
pattern recognition. Yarmouk university-Jordan in 1984. He had
[29] V. Agneeswaran, Big Data Analytics Beyond Hadoop: received his Master degree in Computer
Real-Time Applications with Storm, Spark, and More Science from University of Texas at
Hadoop Alternatives.: Pearson Education, 2014. Arlington-USA in 1987. He had received his PhD in Computer
[30] R., & Zincir-Heywood, A. N. Alshammari, "Machine Science from University of Houston-USA in 1993. He had
learning based encrypted traffic classification: identifying worked as an assistant professor at Yarmouk University from
ssh and skype," in In Computational Intelligence for 1993 to 1995. He had been appointed as the Computer Science
Security and Defense Applications, 2009, pp. 1-8. department Chairman at PSUT. He has supervised or co-
[31] University of Waikato. (2013) [Online]. supervised several students on their Ph.D. dissertations and
http://www.cs.waikato.ac.nz/ml/weka/ several M.S. theses and has supervised numerous undergraduate
[32] N., Zander, S., & Armitage, G. Williams, "A preliminary graduation projects. Finally, he have been involved in several
performance comparison of five machine learning committees for degree plans, proposed and developed the Master
algorithms for practical IP traffic flow classification," program in Information System Security and Digital
ACM SIGCOMM Computer Communication Review, pp. Criminology at PSUT.
5–16, 2006.
[33] J., Hannay, P., & Szewczyk, P. Barker, "Using traffic
analysis to identify The Second Generation Onion Router,"
in In Embedded and Ubiquitous Computing (EUC), 2011
IFIP 9th International Conference, 2011, pp. 72-78.

Authors’ profiles

Mr. Almubayed is a security researcher


was born in 1985 and received his B.S
degree from Al-Balqa applied university
(BAU) in 2008. Recently he has completed
his MS degree in information security and
digital crimes from Princess Sumaya
University for Technology (PSUT),
Amman in 2014.
Mr. Almubayed worked as a software developer with various
software companies in Jordan. In 2009, he joined Maktoob,
which later acquired by Yahoo Inc. He is currently based in
Sunnyvale, California, and works with Yahoo inc!, as a security
engineer. Mr. Almubayed has conducted researches in various
areas, including web defensive tools, employing machine
learning for traffic classifications, and he has more than 5 years
of experience in the fields of information security, ethical
hacking, reverse engineering, risk management, and computer
programming.

Dr. Hadi received the B.S. degree in


computer science from Philadelphia
University, Jordan, in 2002 and the M.Sc.
and Ph.D. degree in computer information
system from University of Banking and
Financial Sciences, College of Information
Technology, Jordan, in 2004 and 2010,
respectively.
He's a Senior Level Information Security Officer with 14+ years
of professional experience working for different high-reputed

Copyright © 2014 MECS I.J. Computer Network and Information Security, 2015, 7, 10-23

View publication stats

You might also like