P2P Network Classification
P2P Network Classification
P. J. Molijn
851437445
9/9/2014
Student:
Date:
P2P NETWORK CLASSIFICATION
A BOTH PORT AND PAYLOAD AGNOSTIC APPROACH
by
P. J. Molijn
Master of Science
in Software Engineering
ii
A CKNOWLEDGEMENTS
Writing this thesis would not have been possible without the support of several people
whom I need to gratefully thank.
I wish to thank my supervisors prof. dr. Marko van Eekelen and dr. ir. Harald Vranken
for providing me with the opportunity to write this thesis under their guidance. Words can-
not describe my gratitude towards Marko and Harald. Despite their busy schedules, they
reserved some moments of their valuable time, they provided me with excellent feedback
and guided me into the world of academia. It has been a real pleasure working with both
of you and I am looking forward to our next project together. To PhD or not to PhD? That is
the question!
I would also like to thank dr. Anda Counotte-Portman for her valuable advices and as-
suring that there were no delays in study progress due to the Open University’s part. I am
thankful to dr. Bastiaan Heeren for his availability and willingness to answer many of my
questions during my study.
I would like to thank my parents. They introduced me to the wonderful world of com-
puting by purchasing my first computer, the commodore 128. My father, Hedwig, explained
complex math problems in such a way that it was a real joy to find the solution myself when
I was a child. My late mother, Eleonore, who loved reading books, passed this character-
istic on to me. I really enjoy reading and am thankful for that. Thanks, Mom and Dad, for
believing in me and providing me with the opportunities to pursue higher education.
When my younger and only brother, Delano, achieved his BSc degree, he motivated me
to pursue my Masters. Thank you for just being my brother and one of the first people of
implicit or explicit motivation on my journey.
Finally, without the love and support from my wife Gracia, my son Qylan and daughter
Kaylin, I would never have succeeded. Gracia, my utmost respect for keeping up with all
the late nights / early mornings during my study, always there providing me with the nec-
essary drinks and/or food to keep going, not to mention my grumpy moods you gracefully
dealt with when there were some setbacks. Qylan, thanks for the time we spend together
with the Martial Arts Training. Just an example of the good moments we experienced to
take the mindset from research onto a different topic. Kaylin, the funny faces you can put
on, combined with the most hilarious story telling. Thanks for these example moments as
well. I thank you both, love you very much, am proud of you and I am proud to be your fa-
ther. Although all three of you had to cope with my absence often, you seldom complained.
Hopefully now, as the thesis is completed, we can spend more family time together.
P. J. Molijn
Lelystad, September 2014
iii
C ONTENTS
List of Figures vi
Summary viii
Samenvatting ix
1 Introduction 1
1.1 Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Network Traffic Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Research Method and Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Research Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 Deliverables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.7 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Peer-to-Peer Systems 15
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Degree of centralization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Network structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Traffic Classification 22
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1 Port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.2 Payload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Host behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.4 Flow feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Performance Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Proposed Framework 31
4.1 Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 System design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.1 The dialogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.2 P2P traffic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.3 Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
iv
C ONTENTS v
5 Data Analysis 36
5.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2 Attribute Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3 Classifier metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6 Conclusion 43
6.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.3 Future work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.4 Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.5 MIT-License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Bibliography 47
Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Academic Articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Technical Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Acronyms 52
Glossary 53
L IST OF F IGURES
vi
L IST OF TABLES
vii
S UMMARY
The popularity of Peer-to-Peer (P2P) applications, and consequently the P2P traffic on the
internet, has increased in the last years. This increase in traffic usage of P2P applications
is besides benign P2P applications also due to malicious P2P software such as P2P botnets.
To cope with the increasing threats imposed by malicious P2P botnets, botnets should be
combated actively. A first step is to detect which internet traffic originates from P2P bot-
nets. In this research, a start has been made by looking at whether internet traffic can be
classified as either P2P traffic or non-P2P traffic, yet regardless of whether it concerns be-
nign or malicious traffic.
Classification of P2P traffic is challenging since traditional techniques, that mainly ana-
lyze port numbers or payload data, are becoming ineffective against applications that use
random ports or encryption. This research proposes, based on literature study, Machine
Learning (ML) as a method for P2P traffic classification, using the algorithms J48, REPTree
and AdaBoost for analysis of statistical flow features, which are both port and payload ag-
nostic.
The classifier is trained with a data set consisting of network traffic derived from four P2P
applications, two P2P botnets and non-P2P traffic. Classifier metrics were obtained by uti-
lizing test data sets, in such a way that each individual set is disjunct with all the other
sets(including training set). The results of this quantitative empirical research show that
the proposed method can achieve high accuracy, outperforming comparable existing ap-
proaches for classification of P2P traffic.
The data sets and some source codes used in the thesis will be made available to the re-
search community to enable validation and extension of the work.
Keywords: P2P traffic, port agnostic, payload agnostic, classification, Machine learning
viii
S AMENVATTING
De populariteit van Peer-to-Peer (P2P) toepassingen, en daarmee ook het P2P verkeer op
het internet, is in de laatste jaren sterk toegenomen. Deze toename is naast het gebruik van
goedaardige P2P toepassingen ook te wijten aan kwaadaardige P2P toepassingen zoals P2P
botnets. Om de toenemende bedreigingen van P2P botnets te pareren, is actieve bestrij-
ding ervan noodzakelijk. Een eerste stap daarin is om te detecteren welk internetverkeer
deel uitmaakt van P2P botnets. In dit onderzoek is daarmee een start gemaakt door te kij-
ken of internetverkeer geclassificeerd kan worden als P2P verkeer en niet-P2P verkeer, nog
ongeacht of dat goed- of kwaadaardig verkeer betreft.
Classificatie van P2P verkeer is uitdagend aangezien traditionele technieken, die hoofdza-
kelijk poortnummers of payload-informatie analyseren, ineffectief zijn tegen toepassingen
die willekeurige poorten of encryptie gebruiken. In het onderzoek is, op basis van lite-
ratuuronderzoek, Machine Learning (ML) gebruikt als methode voor classificatie van P2P
verkeer, waarbij de algoritmen J48, REPTree en AdaBoost gebruikt zijn voor analyse van sta-
tistische flow features die zowel poort- als payload agnostisch zijn.
Het classificatie mechanisme leert P2P gedrag van een data set die bestaat uit zowel goed-
aardig P2P-verkeer, kwaadaardig P2P-botnet verkeer en niet-P2P verkeer. De nauwkeurig-
heid van de classifier op de daadwerkelijke test data bepaalt hoe effectief er onderscheid
kan worden gemaakt tussen P2P en niet-P2P verkeer. De performance metrieken van de
classifier zijn allen gebaseerd op het gebruik van test data sets, waarbij elke individuele set
disjunct is met de overige sets(inclusief de training set). Uit de resultaten van dit kwanti-
tatief empirisch onderzoek is gebleken dat hiermee een hoge nauwkeurigheid kan worden
bereikt, die vergelijkbare bestaande benaderingen voor classificatie van P2P verkeer over-
treft.
De datasets en enkele broncodes die tijdens het onderzoek werden gebruikt zullen publie-
kelijk ter beschikking worden gesteld om bijvoorbeeld validatie of uitbreiding van dit werk
mogelijk te maken.
Trefwoord: P2P traffic, port agnostic, payload agnostic, classification, Machine learning
ix
1
I NTRODUCTION
1.1. B ACKGROUND
Nowadays we live in a world where the internet plays a central role. The use of the internet
and its associated applications exposes a trend of increased usage to the point at which
they have turned into a necessary part of our lives. In spite of the fact that internet and
internet-based applications can be extremely helpful, the utilization of these applications
represents various security challenges.
The primary security risk is brought upon us from vulnerabilities in software which is
then utilized by malicious software. Malicious software is also known as malware. McGraw
and Morrisett [MM00] define malicious code as “any code added, changed, or removed from
a software system in order to intentionally cause harm or subvert the intended function of
the system.” As the internet based applications matured, malware experienced a gigantic
improvement as well. Improving its attack scope, way of spreading, methods to hide its
presence and versatility to dismantle attempts to name a few.
The most common malware infrastructure nowadays is a botnet [GOH11; SB11]. A bot-
net is a network of compromised computers which are controlled by a (mostly malicious)
user, who is also known as the attacker, Botmaster or Botherder [GOH11; SB11]. The com-
promised computers, also known as Bots, run malicious software that successfully inte-
grates techniques used by other previously known malware types, such as rootkits, worms,
viruses, Trojans, etc. [PGL11].
A specially crafted communication path between the network of compromised com-
puters and the botmaster is what sets botnets apart from other malware. This specially de-
ployed path is called the Command and Control (C&C) communication channel [GOH11;
SB11; Ros+13; UAH10]. Once the client’s machine is compromised, the C&C channel is
used to send information from the bots to the server(s). The C&C channel provides a way
for botmasters to have full control over the bots.
Common known malicious actions executed by bots are malware distribution, sending
spam mails, commencing Distributed Denial of Service (DDoS) attacks, illegal content dis-
tribution, click fraud, collecting of private information (e.g. banking) and attacks on other
critical infrastructure [GOH11; SB11; PGL11; SV13; Sil+13].
It has been observed that some botnets have a centralized architecture by connecting to
a central C&C server. In this architecture, the computer or device acting as the C&C server
1
2 1. I NTRODUCTION
is the weakest point of the botnet as this exposes a single point of failure for the entire bot-
net [GOH11; SB11; Ros+13; UAH10]. To avoid the single point of failure of the centralized
architecture, botmasters are also exploiting Peer-to-Peer (P2P) architectures. In a Peer-to-
Peer (P2P) architecture there is no dedicated server or client role, as a P2P node can act
as both a server and a client, thereby eliminating the centralized C&C channel, making
P2P botnets an attractive alternative architecture for botmasters [Zei+10; Liu+10; Saa+11;
Ros+13]. In between these two extremes of a single centralized C&C server towards no ded-
icated server, other other variants are possible [Oll09], leading to the following topologies
(see Figure 1.1):
Centralized. The centralized topology relies upon a single central C&C server to commu-
nicate with all the bots. Each bot gets its instructions directly from the central C&C
server.
P2P. Botnets with a P2P architecture, do not have a centralized C&C server. Instead, com-
mands are injected in to the botnet via any bot.
Table 1.1 provides a quick overview with the pros and cons of the common botnet com-
munication topologies [Oll09].
Studies of global internet traffic have shown that P2P applications were producing more
traffic than all the other applications together, being responsible for 49% to 83%, on aver-
age, of all internet traffic and reaching peaks of over 95% [Gom+13]. It should be noted that
not all P2P traffic is malicious, there are numerous legitimate P2P applications (Voice over
IP (VoIP), videoconferencing e.g.).
A Peer-to-Peer (P2P) computer network is a distributed architecture where tasks or
workloads are divided amongst different computers. Every computer in this distributed
network is referred to as a peer or node. In P2P networks, clients provide resources, such as
bandwidth, storage space, and computing power for example [Wan+14]. Also specific for a
P2P network is its resilience capability. When a peer is either intentionally or unintention-
ally disconnected from the network, the P2P application will still continue to function by
using other peers [Wan+14].
Despite its advantages, P2P networks introduce some problems of their own. Resource
discovery introduces overhead costs [Wan+14]. When a peer P needs resource R (e.g. a file,
bandwidth, computing power), peer P sends out what is called a query message describing
R. A list L is returned from the resource discovery system containing the nodes which can
provide R. System performance is decreased by the queries and broadcasts sent, while not
necessarily resulting in improvement in resource quality. As a result of increased overhead,
purely decentralized P2P-based systems scale poorly. Additionally, P2P-based systems also
can be dominated by freeloaders that only consume resources, but do not contribute to the
system as a whole. These peers add to the system overhead, but fail to contribute to other
peers.
In addition decentralized networks introduce new security issues because they are de-
signed so that each user is responsible for controlling their own data and resources [Wan+14].
1.1. B ACKGROUND 3
Geographical optimization
Multiple geographically dis-
tributed C&C severs can speed
up communications between
botnet elements.
Botnet enumeration
Passive monitoring of commu-
nications from a single bot-
compromised host can enumer-
ate other members of the bot-
net.
Since no central server monitors and corrects badly behaving peers, the peers can provide
poor quality data or even unwanted data to other peers and get away.
The need for accurate P2P traffic classification is not only necessary to address the man-
agement and security problems given above, but the trend that botmasters are reaching out
for P2P architectures to distribute their malware, also stresses out how important P2P traffic
classification is [Zei+10; Liu+10; Ros+13; Saa+11; Wan+14]. With P2P traffic classification,
unusual flows can be detected early to help find P2P malware [RH13].
Historically, network traffic was easily identified by matching port numbers of that traf-
fic, with a list of officially assigned port numbers maintained by the Internet Assigned Num-
bers Authority (IANA)1 [Aut11]. Almost all applications nowadays can be reconfigured to
use different port numbers, making this detection technique almost ineffective.
A more sophisticated approach is based on payload inspection or otherwise known
as Deep Packet Inspection (DPI). This approach examines each packet and searches for
some predefined application specific patterns, hence achieving a higher accuracy than
port based matching. If an application communicates on non standard port numbers, DPI
might still be able to detect it, assuming the payload is not encrypted.
The next approach is flow analysis. A flow is summarized data identified by a 5-tuple
consisting of source IP Address, destination IP address, source port number, destination
port number and protocol of the network or transport layer. Flow analysis can be done in
two ways [Kor12; KPF05]:
• host behaviour
• flow feature
With host behaviour, it is assumed that many applications or groups of applications have
a specific behavioral pattern when running on a host. The classification consists of match-
ing previous patterns with the pattern from the behaviour of the host under investigation
[KPF05]. This approach focuses on finding the set of hosts H that are running application
a or group of applications A.
With flow feature, features are computed over multiple packets grouped in flows and
further used in the training process that associates sets of features with known traffic classes.
The classification consists of a statistical comparison of unknown traffic with previously
learned rules [Kor12].
Both host behaviour and flow feature forms of the flow analysis approach may include
data mining techniques and Machine Learning Algorithms(MLAs). A MLA can divide the
communication into clusters or groups where each group contains one dominant protocol.
The main drawback of behavioral and flow feature analyses is that, mostly, they pro-
duce estimates instead of accurate results [PN11]. The consequence, is that behavioral and
flow feature analysis sparingly achieve 100% accuracy [PN11; DD11].
To cope with the increasing threats imposed by malicious P2P botnets, botnets should
be combated actively. A first step is to detect which internet traffic originates from P2P
botnets. In this research, a start has been made by looking at whether internet traffic can
be classified as either P2P traffic or non-P2P traffic, yet regardless of whether it concerns
1
http://www.iana.org
6 1. I NTRODUCTION
benign or malicious traffic. Classification of P2P traffic is challenging since traditional tech-
niques, that mainly analyze port numbers or payload data, are becoming ineffective against
applications that use random ports or encryption. This research proposes, based on liter-
ature study, Machine Learning (ML) as a method for P2P traffic classification, using the
MLAs J48, REPTree and AdaBoost for analysis of statistical flow features, which are both
port and payload agnostic.
1.2.1. A PPROACHES
This section describes the classical approaches used for protocol classification. The four
existing techniques in network traffic classification are divided into two content based ap-
proaches and two flow analysis based approaches. The two content based traffic classifica-
tion approaches are:
P ORT BASED
Port number based approaches, were the first techniques to detect P2P traffic [Wan+14].
This type of classification is the oldest one, mostly due to its ease of use when collecting
and analysing network data [PN11]. In the early days of the internet, most applications
were assigned a specific port number. The detection is done by capturing TCP [Pos03]
or UDP [Pos80] packet headers, and comparing the port numbers with the official list of
port numbers maintained by IANA[Aut11]. However, more and more P2P applications do
not use standard ports anymore to circumvent detection [Wan+14; PN11]. This situation
causes that port based protocol detection effectiveness is deteriorating downwards to an
ample 30% and less of internet traffic [MP05; SSW04; MW06].
PAYLOAD BASED
The port based approach examines the packet header only, while payload based detec-
tion takes a look at the complete packet. Payload analysis is also known as Deep Packet
Inspection (DPI). The packet payload has more data for a detection technique to utilize.
DPI methods have a high degree of exactness and are not dependent on port numbers.
An essential part in DPI detection is the existence of a pattern or signature database. By
1.2. N ETWORK T RAFFIC C LASSIFICATION 7
Network
Traffic
Classification
Host
Flow features
Port-based Payload-based behaviour
based
based
way of comparison, it has the same meaning as the list of well known port numbers of
IANA[Aut11] as most protocols have some identifying byte string patterns which are unique
for them. The payload detection mechanism relies heavily on a properly maintained pat-
tern database, as new protocols are invented or when there are significant differences be-
tween versions of one protocol. In 2004, Sen et al. [SSW04] described the accuracy, feasi-
bility and robustness of signature based P2P traffic detection. Their experiments achieved
accuracy from 90% to 100% depending on the protocol. This showed that P2P protocols
could be detected by deep packet inspection in high-speed networks at that time [PN11].
A drawback of using DPI techniques is that it requires a lot of computational resources,
cannot cope with encrypted data and does not detect new P2P applications with unknown
characteristics (not in the pattern database) and have high maintenance costs (keeping the
pattern database up to date).
• Bulk
• Database
8 1. I NTRODUCTION
• Interactive
• Services
• WWW
• P2P
• Attack
• Games
• Multimedia
For the scope of this research the class P2P as defined by [MZ05] is used as P2P, all other
classes as non-P2P.
Since the information suitable for protocol detection is gathered from statistics about
connections, there is no need to find patterns in packet payload. Behavioral analysis will
sparingly reach 100% accuracy in it’s classification. As it was mentioned earlier, there is no
strict identification of protocols like a byte string inside a packet.
Although a host-behaviour based approach sounds interesting enough for P2P classifi-
cation, the limiting scope of only looking at a single host for analysis is seen as too limiting
for this research. A host behaviour approach would allow for the detection of current P2P
applications (regardless of intent, benign or malicious), only if the behaviour of these ap-
plications is known a-priori or created before the attempt to classify traffic for a host under
investigation. The next approach, flow-feature based, attempts to overcome this limitation.
directly by counting packets or packet sizes. Port numbers, size of TCP segments and other
are derived from packet headers [PN11].
In a follow up study, Li and Moore [LM07] showed that from these 248 attributes, 11
would be sufficient for accurate protocol detection.
These attributes are:
1. server port
2. client port
3. count of all packets with push bit2 set in TCP header (server → client)
4. count of all packets with push bit set in TCP header (client→server)
P2P applications are not bound to specific ports, therefor the server and client port are
not used as an attribute for detection in this research. P2P applications are shifting from
TCP traffic only to hybrid TCP / UDP and even UDP traffic only applications. The result,
for this research, is the omission of attributes regarding TCP header only.
Narang et al. [NRH13] found that the attributes for P2P protocol detection should in-
corporate both TCP and UDP traffic and should not rely on ports.
The features used in this research are based on the work of Narang et al. and are:
1.2.2. M ETHODS
As the more recent flow analysis approach for traffic classification relied on MLAs, the
content-based approaches mainly relied on simple pattern matching [Kor12]. In this sec-
tion, the Pattern Matching and Machine Learning methods (see Figure 1.3 [Kor12]) used in
classification approaches as described in the previous section are discussed.
Network
Traffic
Classification
Pattern Machine
Matching Learning
Supervised Unsupervised
learning learning
PATTERN M ATCHING
Previously, simple pattern matching combined with content-based approaches was one of
the most accurate classification methods. However, in the case of encrypted traffic, pattern
matching based on identifying the application level signatures is less effective (if possible).
Sen et al. [SSW04] provided an efficient method for identifying five popular p2p applica-
tions through application level signatures. All of the proposed signatures, however, become
useless once traffic encryption or tunneling methods are applied [Kor12].
M ACHINE L EARNING
This section provides a general idea of machine learning.
Machine Learning (ML) is a collection of techniques for data mining and knowledge
discovery which searches for useful structural patterns in data [PN11]. ML techniques can
be divided into two groups according to types of learning [PN11]:
Supervised learning uses training data, from which classification rules are extracted to
classify unseen examples.
Unsupervised learning does not rely on training data and groups instances that have sim-
ilar characteristics into clusters.
1.3. R ESEARCH QUESTION 11
With supervised learning, all training data is labelled to contain the target value. To
predict a numerical value for a given set of input values is called Regression. Regression
models can answer a question with a numerical answer. Classification is when each entry
in the dataset is assigned a specific class, with the goal of determining the class value from
new data.
Unsupervised learning methods do not have a labelled training set and are used for
grouping data that expose similar characteristics into clusters or estimating densities.
This research proposes, based on literature study, ML as a method for P2P traffic classifi-
cation, using MLAs for analysis of statistical flow features, which are both port and payload
agnostic.
To answer this question, we would need to answer the following sub questions:
• RQ2: What relevant features are needed for P2P traffic classification?
Attributes are needed for the algorithm found in RQ1. These attributes are mapped
onto the network dataset as characteristics of the network communication between
two hosts. Features relevant for P2P communication need to be extracted from the
dataset.
• RQ3: How do we apply a port agnostic, payload agnostic classification technique into
a P2P traffic classification approach?
Since P2P systems, whether benign or malicious, are using dynamic port numbers
and encrypting their payload more often, a P2P classification technique which is port
agnostic and payload agnostic can be more effective.
• P2P Botnets
• P2P Architectures
First the general Botnet architecture is reviewed, zooming in on P2P Botnets in particular.
This is done to get background information on Botnets in general as well as the more recent
P2P botnet architectures. As Botnet systems evolve by using P2P architectures, the litera-
ture study extended with the study on P2P systems. Information to model a classifier was
gained by studying network classification theories. To address the research subquestion
about a suitable algorithm about P2P traffic classification, Machine Learning theories were
studied.
To get a better understanding of the problem domain a conceptual framework for P2P
traffic classification is proposed. The results of the literature study and the development of
the conceptual framework lead to the implementation of a method for P2P traffic classifica-
tion. This framework was tested and validated with at least 2 datasets. Figure 1.4 provides
the conceptual research model for this research.
The research will utilize public available datasets or datasets provided by others with
the consent of usage within this research. The following network traffic datasets have been
obtained and used (either partially or fully):
The dataset by Rahbarinia et al. is used as it includes P2P traffic in both benign and
malicious forms. The dataset by Shiravi et al. is used for the non-P2P traffic counterpart.
The research objective is to determine if P2P traffic can be distinguished from offline
network trace files, hereby trying to improve classification accuracy of existing P2P classifi-
cation mechanisms.
3
http://www.iscx.ca/datasets
1.5. R ESEARCH C ONTRIBUTION 13
Tool writing
Machine Learning
theory
1.6. D ELIVERABLES
The deliverables of the research project are the following:
• A traffic classification approach not relying on port nor payload combined with flow
analysis.
Chapter 4 elaborates the framework for P2P traffic classification which can separate P2P
traffic from Non-P2P traffic.
Chapter 5 provides insights into the used datasets and analysis results of the framework.
Chapter 6 concludes, presents related work and provides directions towards future work.
2
P EER - TO -P EER S YSTEMS
This chapter provides background information regarding P2P systems. The most impor-
tant classification of P2P systems is their degree of centralization and their network struc-
ture. A brief description of each of the P2P classifications along with their advantages and
disadvantages are described.
2.1. I NTRODUCTION
Peer-to-Peer (P2P) is a distributed computer architecture that facilitates the direct exchange
of information and services between individual nodes (called peers) rather than relying on
a centralized server. P2P forms the basis of many distributed computer systems, permitting
each peer node to act as both a client and a server, consuming services from other available
peers while providing its own service to the rest of the network [Bas+13]. Peers within a P2P
network communicate directly with their known neighbors, in order to submit requests
and serve responses. The definition of what specifically constitutes a P2P system varies
within the literature. Generally, in theory, a P2P system is envisioned as having no cen-
tralized authority, when in reality many existing P2P applications rely on one or multiple.
Some versions of the BitTorrent protocol (a P2P protocol) for example required some kind
of index also known as a “tracker” which is able to link the peers together and to perform
management of the swarm (a swarm is a collection of peers that are interested in distribut-
ing the same content). The following definition by [Bas+13] is found to be well-suited for
classifying P2P systems and is used within this research:
“Peer-to-peer systems are distributed systems consisting of interconnected nodes
able to self-organize into network topologies with the purpose of sharing re-
sources such as content, CPU cycles, storage, and bandwidth, capable of adapt-
ing to failures and accommodating transient populations of nodes while main-
taining acceptable connectivity and performance, without requiring the inter-
mediation or support of a global centralized server or authority.”
The characteristics of this definition are elaborated using the work of Rodrigues and
Druschel [RD10] when they described the properties of a P2P system.
self-organize into network topologies: Once a node is introduced into the system (typi-
cally by providing it with the IP address of a participating node and any necessary
key material), little or no manual configuration is needed to maintain the system.
15
16 2. P EER- TO -P EER S YSTEMS
sharing of resources: Popular P2P systems have an abundance of resources that few or-
ganizations would be able to afford individually. The resources tend to be diverse in
terms of their hardware and software architecture, network attachment, power sup-
ply, geographic location and jurisdiction.
capable of adapting to failures: P2P systems tend to be resilient to failures because there
are few if any nodes that are critical to the system’s operation. To attack or shut down
a P2P system, an attacker must target a large proportion of the nodes simultaneously.
accommodating transient populations: The participating nodes are not owned and con-
trolled by a single organization. In general, each node is owned and operated by an
independent individual who voluntarily joins the system.
without requiring the intermediation or support of a global centralized server: The peers
implement both client and server functionality and most of the system’s state and
tasks are dynamically allocated among the peers. There are few if any dedicated
nodes with centralized state. As a result, the bulk of the computation, bandwidth,
and storage needed to operate the system are contributed by participating nodes.
P2P offers many advantages. These include scalability, high resource availability, no
need for a centralized authority (eliminating a single point of failure) and robustness [Bas+13].
With a P2P architecture however, the resources or services available is completely depen-
dant on the participating nodes of the P2P system. Quality and usefulness are determined
by the nodes themselves. The power of P2P is obvious when considering Metcalfe’s Law,
which states that the value of a network is proportional to the square of the number of con-
nected users [Bas+13]. The number of possible connections within a P2P network can be
exponential in relation to the number of network nodes, n [Bas+13]. All nodes can poten-
tially connect to all other nodes, giving a theoretical maximum number of connections of
n(n − 1)/2 the same number as in a fully connected mesh network [Bas+13].
P2P applications were primarily designed and used for large-scale file sharing, allowing
participating users easy search facilities and the possibility to obtain or contribute content.
This differs from the well known client-server model due to the fact that the files are pro-
visioned in a distributed way and are replicated within the network when necessary. Since
hosts participating in P2P networks also devote some computing resources, such systems
scale with the number of hosts in terms of hardware, bandwidth and disc space.
Besides sharing data files, another interesting area for P2P applications is the sharing
of computing resources. An example is grid computing, using the computing resources
of a distributed P2P system can become a common way to solve large problems such as
brute-forcing a strongly encrypted and encoded message. It should be mentioned that
grid computing existed prior to P2P systems, but introducing P2P to grid computing allows
additional flexibility and gives better scaling properties in regard to older grid computing
techniques.
File-sharing systems are more popular amongst both benign and malicious P2P sys-
tems. The most common use of the P2P principle is multimedia file-sharing like photos,
movies, music files, applications, including often illegal content. Malicious use of P2P file-
sharing would be the distribution of worms, root-kits, viruses, bot-agents and others. P2P
technology is also intensely used for providing communication services, like instant mes-
2.2. A RCHITECTURES 17
sengers (“presence”), chat, Internet and video conferencing. Popular applications are for
instance Skype, WhatsApp, Lync, Google Talk, and many others.
This research is primarily concerned about P2P systems with a file sharing component
and will exclude grid computing traffic because grid computers, normally, don’t have to
hide. These systems use well known port numbers and don’t disguise traffic by using mas-
queraded or dynamic port numbers. Currently grid computing can be detected the same
way as conventional traffic by inspecting port numbers.
The main characteristic of a P2P system is that it is not built around the server and client
concept but on the cooperation of equal peers. This concept allows individual peers to join
or leave the network resulting in the adaptive nature of P2P systems.
2.2. A RCHITECTURES
P2P systems are categorized with respect to their degree of centralization and network
structure.
• Centralized P2P
• Decentralized P2P
C ENTRALIZED
Within a centralized P2P architecture, a number of centralized index servers maintain a
database of the services on offer on the network. The clients are logging on to these index
servers at anytime. The list with services is updated the moment a node joins or leaves the
network, similar to a registration or deregistration process. An overview of this architecture
operation is illustrated in Figure 2.1.
For a peer to get a wanted resource the first step is to submit a query for the resource
(e.g. a file) to the centralized server(s). After receiving this request, the server(s) consult
their services lookup list and respond back with a message about the peers who can serve
the file. The peer then goes out to the serving peer directly to download the file. With this
centralized structure, which is mostly classified as simple, the fired queries can be pro-
cessed quickly, hence achieving relatively good performance. A drawback for a centralized
approach, is analogous to the earlier mentioned central C&C server; when the main servers
are identified, shutdown of these servers could then be achieved quickly.
D ECENTRALIZED
A fully decentralized and distributed architecture is illustrated by Figure 2.2. Within fully
decentralised architectures, all nodes are of equal importance, regardless of their capac-
ities, resources / services offered or their geographic location. Without requiring the in-
termediation or support of a global centralized server or authority, all nodes perform the
18 2. P EER- TO -P EER S YSTEMS
P P
P
P
P
S
Q
R
D Q = Query
P
P R = Response
D = Download
same tasks, acting both as server and client. Hosts participating in such networks are called
servents.
A fully decentralized architecture is not so popular nowadays because it is generally
quite inefficient. Querying for resources works like a “broadcast” system. A node sends a
query message to all it’s connected neighbors. As the response and request messages are
relayed from a node to it’s neighbors, this way of resource discovery generates large traffic
volumes. Messages may also have to cross a large number of hops before they reach a node
which can provide the requested resource, increasing the total response time along the way.
Within fully decentralised networks, it is also a challenge to provide Quality of Service (QoS)
of any kind.
VARIANTS
Besides the two P2P main models regarding degree of centralization, two other variants are
worth mentioning namely:
Q = Query
P P R = Response
D = Download
Q
P P
Q
D R
Q Q
P
R
P
Q
Performance is better than the purely decentralized model and may be less than the
hybrid decentralized model, but this model has more flexibility regarding fault-resiliency
than the hybrid decentralized one.
An overview of the different degrees of centralization regarding P2P systems is illus-
trated in Figure 2.3
• Unstructured
• Structured
• Loosely structured
U NSTRUCTURED
In unstructured networks, there is no relation whatsoever with the placement of data and
the overlay topology. Not knowing where a given resource is, makes that searches are con-
ducted at random, asking a number of servents if they have any resource that match the re-
quest. These servents may ask their own neighbours about the resources, which can lead to
a request that accesses the entire P2P system (at some point every node is sent the request
query). Although there are different possibilities for the construction of the overlay network
20 2. P EER- TO -P EER S YSTEMS
P P
S
P P
P P
P
P P P
P
P P
P S S S
P S S P
S P P
P P
P P
P
and for the query mechanism, unstructured networks generally result in poor lookup per-
formance, scalability problems and inefficient network usage. However, this scheme is the
most widely used since it accommodates easily a transient population and is well adapted
to file-sharing. Users of such systems want some specific files and don’t want to store other
files for the sake of system efficiency; they don’t want to be concerned with such issues
as lookup performance (even if they prefer it fast) or redundancy (even if they want good
availability). To solve performance and scalability issues in unstructured networks, the par-
tially centralised or hybrid decentralised model can be used. Searches are still conducted at
random but only at the supernode/server level. End users only send queries to their local
supernode/server. This two-level structure improves performance and scalability, making
the unstructured networks viable.
S TRUCTURED
In structured systems, topology is closely related to hosts content or host content is related
to topology. Resources (or index to resources) are stored at specific locations in the P2P
system and a mechanism is provided to map a file identifier to its location (or the location
of its pointer). Using a distributed routing table, generally a Distributed Hash Table (DHT),
queries can be forwarded to an adequate host much more efficiently than in the unstruc-
tured case. The disadvantages of structured networks are the difficulty of maintaining the
routing table with frequent arrivals and departures of peers and mapping a keyword query
to a unique file identifier.
L OOSELY STRUCTURED
Loosely structured networks are hybrid solutions between structured and unstructured
networks. In such systems, a mapping exists between resource and topology, but it is not
completely specified and may result in search failure (the search is then conducted as if
the network was unstructured). Due to reduced implementation outside academic world,
loosely structured networks are not elaborated further.
3
T RAFFIC C LASSIFICATION
This chapter covers the basics on traffic classification, which is important to understand
the methodologies proposed in this thesis.
A section is included to describe the most common metrics for the evaluation of the
performance of a classification mechanism.
3.1. I NTRODUCTION
Network traffic consists of IP packets which is examined for further analysis. In most net-
work traffic analysis, the packet flows are uniquely identified by the 5-tuple: source IP ad-
dress, source port, destination IP address, destination port, and transport layer protocol.
The traffic classification problem can be formalized as follows: A pattern p represents
the object under analysis. Each pattern is described by a set of n features that have been
derived from the analyzed traffic [Kor12]. Thus, it can be interpreted by the n-dimensional
random variable X that corresponds to an accurate set of features: p → x = (x 1 , x 2 , x 3 , . . . , x n )
[Kor12].
In the application classification problem, where p could be represented by flows, it is
attempted to assign each of these flows to one of the given application classes c defined by
a random variable Y : y = y 1 , y 2 , . . . , y c , y c+1 . Y = y c+1 means that the analyzed flow is not
recognized as any of the given classes, i.e., it is unknown [Kor12].
In the malware detection problem p could be represented by the aggregated traffic di-
rected to the specific IP destination address [Kor12]. Thus, malware detection refers to a
binary classification problem. The attempt is to verify if the traffic to analyze corresponds
to malicious behavior. Random variable Y takes values in the set y 0 , y 1 , where Y = y 0 means
that the traffic conforms to legitimate behavior, whereas Y = y 1 indicates malicious activity
[Kor12].
In this thesis, the traffic classification problem corresponds to defining pattern p 0 for
P2P and p 1 for non-P2P.
3.2. A PPROACHES
As communication protocols evolve, the selection of an appropriate approach for traffic
classification changes [Zha+09]. The variety of new Internet applications including ser-
vices such as streaming, online gaming, p2p file sharing, or video/voice conferencing have
22
3.2. A PPROACHES 23
intensified research efforts to discriminate against such applications. These, in turn, have
inspired sophisticated obfuscation mechanisms. Figure 3.1 gives the first view of trends in
application development over time with respect to the four main classification approaches
[Kor12].
Application Development
TCP UDP
Cleartext Encrypted/tunneled
transmission transmission
Proprietary
Open protocol
protocol
Host
Classification Flow features
Port-based Payload-based behaviour
Approaches based
based
In the rest of this chapter, the four traffic classification approaches for protocol detec-
tion are described. The two content based approaches as well as the host behaviour ap-
proach are briefly described as they are not used within this research and the basics of
these three approaches were touched upon in Chapter 1 (See Figure 1.2 for an overview)
3.2.1. P ORT
More than a decade ago, network traffic was accurately classified using UDP and TCP port
numbers [Aut11; Zha+09].
The classification of network traffic based on the TCP [Pos03] or UDP [Pos80] port num-
bers is a simple approach built upon the assumption that each application protocol always
uses the same specific transport-layer port [Gom+13].
Nowadays, new internet applications are moving towards the use of dynamic ports to
evade detection (Fig 3.1). An example of such an application is Skype, it puts big efforts into
the establishment of connections amongst it peers, hereby trying to bypass restrictive fire-
walls, by randomly selecting ports and even trying to utilize port 80 or 443 when connection
on dynamic ports do not succeed.
Thereby, port numbers as a classification mechanism are considered obsolete [Kar+04;
MP05; MW06].
As a result, simple inspection of port numbers is no longer a reliable classification mech-
anism [MP05].
24 3. T RAFFIC C LASSIFICATION
3.2.2. PAYLOAD
The second content-based approach involves inspecting the packet payload and for years,
it was considered as the most accurate method. Deep Packet Inspection (DPI), extends
the examination beyond the packet header only as is the case with port based methods.
DPI inspects the complete packet payload. As soon as a unique payload-based signature is
identified, this technique can produce reliable classification results [MP05; SSW04]. It was
not uncommon that payload-based classifiers were often used to establish ground truth
for other methods. DPI methods rely on a database of previously known signatures that are
associated to application protocols, and search each packet for strings that match any of
the signatures [Gom+13]. This approach is used not only in the classification of network
traffic, but also in the identification of threats, malicious data, and other anomalies. Be-
cause of their effectiveness, classification systems based on DPI are especially significant
for accounting solutions, charging mechanisms, or other purposes for which the accuracy
is crucial.
However, deeply inspecting each packet can be a demanding task in terms of computa-
tion power and may be unfeasible in high-speed networks. Therefore, some mechanisms
search only a part of each packet or only a few packets of each flow, as a compromise be-
tween efficiency and accuracy. Besides of the performance issues, the inspection of con-
tents of the packet may also raise legal issues related with privacy protection [SOG07]. Nev-
ertheless, the main drawback of DPI techniques is their inability to be used when the traffic
is encrypted [Gom+13]. Since, in these cases, the contents of the packets are inaccessible
(encrypted), DPI-based mechanisms are restricted to specific packets of the connection
(e.g., when the session is established) or to the cases when UDP and TCP connections are
used concurrently and only the TCP sessions are encrypted [Gom+13]. DPI methods are
also sensitive to modifications in the protocol or to evolution of the application version:
any changes in the signatures known by the classifier will most certainly prevent it from
identifying the application [Gom+13]. Moreover, DPI methods that rely on signatures for
specific applications can only identify traffic generated by those applications [Gom+13].
functional level Checking whether a host acts as a client or server (or both) for serving
resources
application level Recording the transport layer ports to identify the origin of the applica-
tion.
3.3. M ACHINE L EARNING 25
Although promising, the host behavior classification in [KPF05] was still found to be limit-
ing for the research, because of:
• The reliance on port statistics (not port numbers though). Applications hiding be-
hind port masquerading schemes will slip through.
• The assumption at the functional level that hosts that use a single port for the ma-
jority of their interactions with other hosts are likely to be providers of the service
offered on that port. As with the previous item, port masquerading could mislead the
classifier.
3.3.1. A LGORITHMS
The choice of Machine Learning Algorithm (MLA) is a critical step in building a statistics-
based classifier. Narang et al. [NRH13] found out that the three most relevant algorithms
for P2P detection are: J48, Naïve Bayes and REPTRee. Early experiments with Naïve Bayes
showed overall performance less than 60% of correct predictions, while the other two scored
significantly higher. Gomes et al. [Gom+13] also state that within the P2P traffic classifica-
tion domain, the most common used supervised learning approach are the tree structures
26 3. T RAFFIC C LASSIFICATION
DPI based Relies on payload data High classification - May not work for en-
performance crypted traffic
- Requires high pro-
cessing resources
- Can only be used for
known applications
host behavior Uses only packet - Usually lighter than - Usually has lower
header and searches DPI classification perfor-
for previously found - Applicable for en- mance when com-
host behavior patterns crypted traffic pared to DPI
flow feature uses only packet - Applicable for en- - Requires machine
header and flow-level crypted traffic learning theory which
information - Can identify un- could increase com-
known applications plexity
from target classes
like J48 and REPTree. These findings led to the exclusion of Naïve Bayes in further experi-
ments and limited the algorithm set to: J48 and REPTree.
In this research two different supervised learning approaches are used: J48 and REP-
Tree.
Both J48 and REPTree fall under Decision Tree (DT) classifiers. DT classifiers create a
tree whereby each node is composed of a decision that can split the data into smaller sets
using the labels from the supplied training set. Each node on the tree can be visualized as
an if-then-else decision. The construction of a decision tree can be expressed recursively.
First, select an attribute to place at the root node, and make one branch for each possi-
ble value [HWF11]. This splits up the example set into subsets, one for every value of the
attribute [HWF11]. Now the process can be repeated recursively for each branch, using
only those instances that actually reach the branch [HWF11]. If at any time all instances
at a node have the same classification, stop developing that part of the tree [HWF11]. The
only thing left is how to determine which attribute to split on, given a set of examples with
different classes.
For the purpose of this research, Weka1 was used as the tool for handling machine learn-
ing algorithms. This tool provides several different machine learning algorithms and has
cross-platform operability such as: Mac OS, Windows and Linux variants.
Weka has support for J48 and Reduced Error Pruned Tree (REPTree) DTs which are used
as classification tasks in the supervised learning setting.
C4.5
DT algorithms in Weka are implementations of the C4.5 algorithm. This algorithm creates
binary trees in such a way that at each node of the tree, C4.5 chooses the attribute of the
data that most effectively splits its set of samples into subsets enriched in one class or the
other. The splitting criterion is the normalized information gain (difference in entropy).
The attribute with the highest normalized information gain is chosen to make the decision.
The C4.5 algorithm then recurs on the smaller sublists.
If the target attribute can take on c different values, then the entropy of S relative to this
c-wise classification is defined as [Mit97]:
c
X
Ent r op y(S) = − f i log2 f i
i =1
J48
This algorithm uses a tree structure and is divided into several phases [PN11]. During the
training process, every leaf can estimate the error ratio of the number of wrong classified
incidents and the total incidents assigned to each leaf from the supervised training data sets
[PN11]. The upper node can also calculate the weighted sum of error estimates for all its
leaves [PN11]. If the weighted sum at the upper node is less than the error ratio combined
from its leaves, all leaves under the node are pruned [PN11].
REPT REE
The REPTree [PTK06] uses a fast pruning algorithm to increase the accurate detection rate
with respect to noisy training data [PN11]. Pruning is used to find the best sub-tree of the
initially grown tree with the minimum error for the test set [PN11]. However, the number of
sub-trees grows exponentially with the size of the initial tree. Thus it is computationally im-
practical to search all sub-trees. REPTree yields a suboptimal tree under the restriction that
a sub-tree can only be pruned if it does not contain a sub-tree with a lower classification
error than itself.
B OOSTING
Boosting and especially AdaBoost is designed specifically for classification [HWF11]. It can
be applied to any classification learning algorithm. To simplify matters the assumption
is that the learning algorithm can handle weighted instances, where the weight of an in-
stance is a positive number[HWF11]. The presence of instance weights changes the way in
which a classifier’s error is calculated: It is the sum of the weights of the misclassified in-
stances divided by the total weight of all instances, instead of the fraction of instances that
are misclassified [HWF11]. By weighting instances, the learning algorithm can be forced
to concentrate on a particular set of instances, namely those with high weight [HWF11].
Such instances become particularly important because there is a greater incentive to clas-
sify them correctly [HWF11]. The J48 and REPTree algorithms, are examples of learning
methods that can accommodate weighted instances [HWF11].
Cross Validation: K -fold cross validation avoids the need of a validation set, by dividing
the training set into K parts or folds [Luz14]. K − 1 folds are used to train the classifier and
the remaining one is used as validation set [Luz14]. This process is repeated for each of the
k folds[Luz14]. Besides these k folds, Weka uses the full training set as the last step and the
result is the average values of all these calculations. A common value for k is 10, which is
the value used in the experiments for this research as well.
3.4. P ERFORMANCE C RITERIA 29
• 80 flows to be P2P, actually 67 of these are P2P, representing the True Positives, and 13
are non-P2P, which stand for the number of False Positives.
• 20 flows to be non-P2P. Of these, 12 are indeed non-P2P, but 8 flows are P2P.
The focus is on the P2P flows. True Positive Rate (or Recall) is the number of flows cor-
rectly categorized as P2P divided by the total number of flows that are actually P2P. Thus,
67
in this scenario, the TPR = Recall = 67+8 = 0.893. The False Positive Rate is the number of
13
falsely classified flows as P2P to the total number of non-P2P flows, FPR = 13+8 = 0.619. A
complementary measure to Recall is Precision [HWF11; Luz14] which is defined as follows:
TP
P r eci si on = (3.3)
TP +FP
Precision is the number of correctly classified P2P flows to all flows classified as the P2P
67
type, thus, Precision = 67+13 = 0.838.
The overall accuracy [HWF11] is defined as the following equation illustrates:
TP +T N
Accur ac y = (3.4)
TP +T N +FP +FN
Accuracy is the ratio of the sum of the TPs and TNs to the sum of TPs, TNs, FPs and FNs
Complementary to the FPR is the True Negative Rate (TNR) or specificity [HWF11].
Specificity measures the proportion of negatives which are correctly identified as such, that
is in the above scenario the ratio of P2P flows which are correctly identified as not being P2P
TN
T N R = Speci f i ci t y = (3.5)
FP +T N
To assess the performance of the proposed classification methods True Positive and
False Positive Rates as classification metrics are used, as well as Precision, and F-Measure
30 3. T RAFFIC C LASSIFICATION
[HWF11; Luz14] as classification metrics. F-Measure, combines Precision and Recall, and
is defined as:
2 ∗ P r eci si on ∗ Rec al l
F − Measur e = (3.6)
P r eci si on + Rec al l
The F-measure is included because it considers both the precision and the recall to com-
pute a score. This score can be interpreted as a weighted average of the precision and recall,
where a F-measure reaches its best value at 1 and worst score at 0.
The Matthews Correlation Coefficient (MCC) [HWF11; Luz14] provides a way to mea-
sure the quality of the classifier using the predicted values and the real values of each sam-
ple. It takes into consideration both false positives and false negatives, which makes it
suitable for tests even when the categories are not balanced with respect to the number of
samples. The MCC ranges between -1 and 1, where 1 represents a perfect prediction, -1 im-
plies that all predictions were wrong and 0 suggests that the classifier is a good as a random
prediction. The MCC is defined as:
tp ∗ tn − f p ∗ f n
MCC = p (3.7)
(t p + f p)(t p + f n)(t n + f p)(t n + f n)
4
P ROPOSED F RAMEWORK
This chapter elaborates the proposed framework for P2P traffic classification which can
separate P2P traffic from Non-P2P traffic. The framework is composed of a filter module, a
dialogue generation module, an aggregation module and classifier module. The framework
is inspired by the work of Narang et al. [NRH13] for the research on feature selection of P2P
botnet traffic, Rahbarinia et al. [Rah+14] with their P2P traffic categorization and the work
of Narang et al. [Nar+14] regarding P2P botnet detection.
The framework does not rely on Deep Packet Inspection (DPI) or signature-based mech-
anisms (which are considered useless when encryption is applied). This framework focuses
on observing the different dialogues which takes place amongst the nodes, essentially the
idea of who is talking to whom.
The host pairs are extracted from IP headers as well as a set of features which is found
relevant to distinguish P2P traffic from non-P2P traffic, such as flow duration. volume size,
minimum packet size, maximum packet size, etc (see Chapter 1).
The framework uses supervised machine learning algorithms and network traces of P2P
applications & botnets to build models which can correctly categorize different P2P appli-
cations.
The framework is illustrated in Figure 4.1
4.1. B ACKGROUND
As explained in previous chapters, instead of the standard 5-tuple approach, where source
address, source port, destination address, destination port and protocol are used, this work
is based on the idea that a classifier needs to be found which does not rely on port numbers
(as more applications are using dynamic port numbers or masquerade over well-known
ports). With numerous applications also encrypting their payload, the classifier should
and does not rely on payload data neither.
This leads to adopting an approach which looks at the node endpoints only, the 2-tuple
approach, consisting of the two nodes participating in a dialogue.
In short, the proposed framework is:
• A classifier which is protocol agnostic, port agnostic and payload agnostic. The only
reliance is on information from the IP header.
31
32 4. P ROPOSED F RAMEWORK
Filter
P2P
Dialogues
Classified
Traffic
results
Non-P2P
Aggregate
Classify
4.2.1. T HE DIALOGUE
Nodes connected to their neighbors in the P2P overlay network maintain regular commu-
nication amongst themselves to check for updates, to exchange commands and/or to check
if the peer is alive or not. Since certain benign P2P applications (and certainly botnets) are
known to use dynamic port numbers, the regular 5-tuple flow definition will not be able
to give a clear picture of the activity a host is engaged in [NRH13]. This traditional 5-tuple
definition will create multiple flows out of what is actually a single conversation happen-
ing between two nodes (although happening on different ports), and thus give a false view
of the communications happening in the network [NRH13]. A helicopter view of the dia-
logues between the P2P nodes is the approach used here to distinguish P2P from Non-P2P.
and thus exhibit long dialogue durations, a benign P2P node’s conversation with another
is not expected to be long [Rah+14; NRH13]. In summary, four features are extracted from
the datasets and these are used to differentiate P2P traffic from Non-P2P traffic. The four
features used in this work are [Nar+14; NRH13]:
2. The duration.
4.2.3. D ETAILS
Filter: This module reads raw packet data from offline network trace files. The module
reads each packet and discards packets without a valid IPv4 header. From this set, only
those packets which contain payload and have a valid TCP or UDP header are kept. This
discards packets with zero payload or other protocols besides TCP and UDP (ICMP, ARP
etc.) Source IP, Destination IP, Payload length and Timestamp are extracted and then stored
for future use.
This module algorithm is found in Algorithm 1.
Dialogue generation Module: The output of the Filter module is sent to the this module
as input. This module creates a list of the dialogues by aggregating the retrieved packets
(send by the filter). Each dialogue is identified by the combination of <IP1,IP2> and an
initial INTERVAL value of 2 seconds. The dialogues are created with the idea that a uni-
directional flow can be converted to a bi-directional flow if source (A) and destination (B)
ip pairs match and they contacted each other within the INTERVAL time sinds the first
34 4. P ROPOSED F RAMEWORK
packet of either A→B or B→A. While iterating through the packets, if a packet is found
which belongs to the IP pair the dialogue and the time-stamp lies within INTERVAL time
from the beginning or end of the dialogue, the packet is added to this dialogue and the
attributes of the dialogue are modified accordingly [Nar+14].
This is illustrated in Algorithm 2 [Nar+14]
Aggregation Module: The dialogues created in the previous module are aggregated for
a 1 hour interval. This means that several dialogues between the same IP pair combination
are aggregated to a single dialogue. This 1 hour interval can be adjusted to higher values as
needed, providing the flexibility to look at the data for any desired aggregated time-period
(e.g. 3 hours, a day, etc.) Such flexibility is especially valuable for bots which are extremely
stealthy in their communication patterns and exchange as low as a few packets every few
hours. From the datasets obtained, it is found that the zeus botnet exposes this behavior
[Rah+14; Nar+14]. Besides the fact that this behavior is dramatically different from the
others, the number of dialogues found for zeus within the 1 hour frame was significantly
lower than the rest. The number of dialogs per interval is illustrated in Figure 4.2
Zeus requires special attention and evaluation, resulting in the zeus dataset being ex-
cluded from the experiments. The outcome of the aggregation module is then used to train
the classification model. The attributes of each dialogue which are analyzed are: number
of packets, volume, duration and the Median value of IAT [NRH13; Nar+14] .
The aggregation module’s algorithm is seen in Algorithm 3 [Nar+14].
Classification Module: The Classification module uses Weka’s supervised machine learn-
ing algorithms for training the model and classifying the test data. Models of the classifica-
tion were created using two Decision Trees(DTs) algorithms, namely J48 and REPTRee with
and without Boosting.
4.2. S YSTEM DESIGN 35
2047822
1820482
1727698
1696429
1601242
1481366
1419399
1417622
Interval
duration
# OF DIALOGUES
1200981
1199039
1166662
1154776
1133339
1129536
1088271
1h
947327
900453
868106
830961
2h
789669
786445
756939
745801
650786
3h
614148
538771
80733
69567
64817
52691
47945
45175
7224
6424
6012
EMULE1
EMULE2
UTORRENT1
UTORRENT2
WALEDAC
STORM
ZEUS
FROSTWIRE1
FROSTWIRE2
VUZE1
VUZE2
ISCX
P2P APPLICATION
There are three main datasets for the training of the P2P classifier:
(D1) Ordinary P2P Traffic: The P2P Traffic was obtained from Rahbarinia et al. This set
is created between mar. 26 2011 - apr. 11 2011. The following describes how it was created:
To collect the P2P traffic dataset, an experimental lab network consisting of 11 distinct hosts
was build, which was used to run 5 different popular P2P applications for several weeks.
Specifically, 9 hosts were dedicated to running Skype, and the two remaining hosts to run,
at different times, eMule, µTorrent, Frostwire, and Vuze. This choice of P2P applications
provided diversity in both P2P protocols and networks (see Table 5.1). The 9 hosts ded-
icated to Skype were divided into two groups: two hosts were configured with high-end
hardware, public IP addresses, and no firewall filtering. This was done so that these hosts
had a chance to be elected as Skype super-nodes. The remaining 7 hosts were configured
using filtered IP addresses, and resided in distinct sub-networks. Using both filtered and
unfiltered hosts allowed sample collection of Skype traffic that may be witnessed in differ-
ent real-world scenarios. For each host, a separate Skype account was created and some
1
http://www.iscx.ca/datasets
36
5.2. ATTRIBUTE I MPORTANCE 37
of these accounts were made “friends” with each other and with Skype instances running
on machines external to the lab. In addition, using AutoIt [Ben09], a number of scripts
were created to simulate user activities on the host, including periodic chat messages and
phone calls to friends located both inside and outside of the lab network. Overall, 83 days
of a variety of Skype traffic was collected, as shown in Table 5.1.
The other two distinct unfiltered hosts were used to run each of the remaining legit-
imate P2P applications. For instance, initially these two hosts were used to run two in-
stances of eMule for about 9 consecutive days. During this period, a variety of file searches
and downloads2 were initiated . Whenever possible, AutoIt [Ben09] was used to automate
user interactions with the client applications. This process was replicated to collect ap-
proximately the same amount of traffic from FrostWire, µTorrent, and Vuze.
(D2) P2P Botnet Traffic: In addition to popular P2P applications, several days of traf-
fic from three different P2P-botnets: Storm, Waledac, and Zeus were obtained. It is worth
noting that the Waledac traces were collected before the botnet takedown enacted by Mi-
crosoft, while the Zeus traces are from a version which is likely still active and relies en-
tirely on P2P-based command-and-control (C&C) communications. Table 5.1 indicates the
number of hosts and days of traffic we obtained, along with information about the under-
lying transport protocol used to carry P2P management traffic.
(D3) Non-P2P Traffic: The Non-P2P Traffic was obtained from Shiravi et al. This dataset
contained only Non-P2P traffic.
Bots
Storm - 13 7 UDP
Zeus - 1 34 UDP
Waledac - 3 3 TCP
Attribute Importance
Training dataset
0,432 0,422
0,159
0,049
From these figures, it is observed that the attributes IAT and nrofBytes are of more im-
portance than nrOfPackets and Duration.
DTs are simple to train and fast algorithms. However, they tend to create complex tree
structures and over-fit the data. Although this gives high accuracy on the training data and
even over the test data such detection models may not generalize to a real-world scenario.
Besides using J48 and REPTrees, boosting was used with the trees to increase the accuracy
obtained from a single classifier. The AdaBoost meta-classifier of Weka was used, with 10
trees.
Figure 5.2 gives the results for ten-fold cross validation performed with the J48, REPTree
Boosted J48 and boosted REPTRee.
Classifier Performance
0,974
Accuracy
0,972
0,971
0,970
As Tables 5.2, 5.3, 5.4 and 5.5 show, the classifier could consistently give more than
99,9% accuracy in detection of the two P2P botnet applications, and at least 93,7% accu-
racy in the detection of the four P2P applications, with very low false positives. As an overall
average, the accuracy stands above 95%. The classifier performance on the full test set is
shown in Table 5.6
Figure 5.3 shows the algorithm’s accuracy per individual P2P application.
Boosted REPTree
TPR FPR Precision F-Measure Accuracy MCC
eMule 0,980 0,098 0,966 0,975 0,962 0,901
uTorrent 0,941 0,094 0,966 0,953 0,932 0,828
FrostWire 0,987 0,085 0,971 0,979 0,968 0,917
Vuze 0,986 0,089 0,969 0,978 0,967 0,913
Waledac 0,999 0,101 0,966 0,982 0,973 0,929
Storm 1,000 0,081 0,972 0,986 0,979 0,945
Weighed avg. 0,982 0,091 0,968 0,976 0,964 0,906
Figure 5.4 shows this classification TPR compared with the work of Rahbarinia et al.
[Rah+14].
5.3. C LASSIFIER METRICS 41
Boosted J48
TPR FPR Precision F-Measure Accuracy MCC
eMule 0,986 0,087 0,970 0,978 0,967 0,913
uTorrent 0,940 0,090 0,967 0,954 0,932 0,830
FrostWire 0,989 0,078 0,973 0,981 0,972 0,926
Vuze 0,987 0,082 0,972 0,979 0,969 0,919
Waledac 1,000 0,094 0,968 0,984 0,975 0,936
Storm 1,000 0,082 0,972 0,986 0,979 0,945
Weighed avg. 0,984 0,086 0,970 0,977 0,966 0,912
REPTree
TPR FPR Precision F-Measure Accuracy MCC
eMule 0,982 0,111 0,962 0,972 0,957 0,888
uTorrent 0,937 0,101 0,964 0,950 0,927 0,817
FrostWire 0,989 0,098 0,966 0,978 0,967 0,912
Vuze 0,985 0,096 0,967 0,976 0,964 0,906
Waledac 0,999 0,098 0,967 0,982 0,974 0,931
Storm 0,999 0,086 0,970 0,985 0,977 0,940
Weighed avg. 0,982 0,098 0,966 0,974 0,961 0,899
J48
TPR FPR Precision F-Measure Accuracy MCC
eMule 0,982 0,099 0,966 0,974 0,961 0,897
uTorrent 0,935 0,090 0,967 0,951 0,929 0,821
FrostWire 0,990 0,090 0,969 0,979 0,969 0,919
Vuze 0,985 0,102 0,965 0,975 0,963 0,902
Waledac 0,999 0,099 0,966 0,982 0,973 0,931
Storm 1,000 0,084 0,971 0,986 0,978 0,944
Weighed avg. 0,982 0,094 0,967 0,975 0,962 0,902
0,980
0,970
0,960
Boosted REPTree
Accuracy
0,950
Boosted J48
0,940 REPTree
J48
0,930
0,920
0,910
0,900
eMule uTorrent FrostWire Vuze Waledac Storm
Classifier Comparison
1,000
0,950
0,900
0,850
0,800
0,750
eMule uTorrent FrostWire Vuze Waledac Storm
Figure 5.4: Comparing this research with the work of Rahbarinia et al. [Rah+14]
6
C ONCLUSION
P2P applications have been widely used recently. They bring us many conveniences, but
increasing P2P traffic also brings P2P malware with it. One of the first steps against P2P
malware is detection of P2P traffic, an accurate P2P traffic classification becomes more and
more significant.
The answers to the research sub questions are formulated as follows:
• RQ2: What relevant features are needed for P2P traffic classification?
There are four features found to be relevant for P2P traffic classification when an al-
gorithm based on Decision Trees(DTs) is used. These are:
• RQ3: How do we apply a port agnostic, payload agnostic classification technique into
a P2P traffic classification approach?
By restricting the feature set to not rely on port and payload data we can create a P2P
traffic classifier which is both port and payload agnostic.
Network traffic classification into P2P and non-P2P traffic is achieved with Machine
Learning (ML) as a method for P2P traffic classification, using the algorithms J48, REPTree
43
44 6. C ONCLUSION
and AdaBoost for analysis of statistical flow features, which are both port and payload ag-
nostic.
Compared with other existing detection schemes, this approach also exhibits high ac-
curacy. The accuracies of other schemes [Che+09], [Li+09], [KNC10], [YC13] are 95.67%,
96.03%, 95% and 97.46% respectively.
6.1. L IMITATIONS
The framework is able to correctly detect and categorize P2P applications- whether mali-
cious or benign- with high accuracy. There are some limitations though and this section
describes these. The accuracy obtained with the classification of benign P2P applications
is lower when compared to the accuracy of detection of P2P botnets. Adding more benign
P2P applications to the training dataset or possibly increasing the sample size and exper-
imenting with several other features can help in correctly categorizing P2P traffic such as
the control (or management) traffic information of P2P applications.
By design, being port and protocol agnostic makes that many lower-level details (e.g. at
the TCP/UDP layer ) are neglected.
If node A and node B are engaged in P2P file sharing and both of them are also a part of
a botnet, this is seen as a single dialogue. Because of an overlap between the botnet data
and application data, dialogue approach is unable to correctly classify the kind of botnet or
application running on peers A & B. It is therefor possible that smarter bots which mimic
benign-like behavior and/or add noise (or randomness) to their communication patterns,
could evade the current classifier.
One noteworthy remark is the fact that all packets having no payload are discarded.
This was necessary to remove corrupted packets and sanitize the network traces obtained.
However, such an approach has an inherent drawback of dropping all legitimate packets
with zero payload, such as TCP connection establishment (SYN) packets. This could be ex-
ploited by using zero payload TCP packets (SYN or ACK packets) to exchange simple com-
mands for instance in a botnet.
6.4. R EFLECTION
Working on the P2P classification framework was a pleasant, educational but also stressful
journey. The received data sets for instance are huge regarding filesizes. Wireshark1 had
difficulties reading the offline trace files. Windows laptop with 4Gb, Mac OSX desktop with
8Gb were not enough. Even a dedicated quad-core Windows Desktop with 32Gb was not
able to satisfy the (memory) resource hunger of the Wireshark tools. In the end, self-written
tools were applied to chop the big data files into smaller chunks. Early experiments with
training data with millions of instances resulted in Weka being sent for calculation journeys
of several days, which ended in not enough memory or system crashes.
All those setbacks aside, diving into Machine Learning and Machine Learning Algo-
rithms was really educational as those topics are not in the current curricula.
B OOKS
[DD11] Sumeet Dua and Xian Du. Data mining and machine learning in cybersecurity.
CRC press, 2011. ISBN: 978-14-398-3942-3. URL: http://www.crcpress.com/
product/isbn/9781439839423.
[GOH11] James Graham, Ryan Olson, and Rick Howard. Cyber security essentials. CRC
Press, 2011. ISBN: 978-14-398-5123-4. URL: http : / / www . crcpress . com /
product/isbn/9781439851234.
[HWF11] Mark Hall, Ian Witten, and Eibe Frank. Data mining: Practical machine learn-
ing tools and techniques. Morgan Kaufmann Publishers Inc., 2011. ISBN: 978-
01-237-4856-0. URL: https://www.elsevier.com/books/data- mining-
practical-machine-learning-tools-and-techniques/witten/978-0-
12-374856-0.
[Mit97] Tom M. Mitchell. Machine learning. New York, NY [u.a.: McGraw-Hill, 1997.
ISBN : 978-00-711-5467-3. URL : http : / / www . worldcat . org / search ? qt =
worldcat_org_all&q=9780071154673.
[MKT11] Mehedy Masud, Latifur Khan, and Bhavani Thuraisingham. Data Mining Tools
for Malware Detection. CRC Press, 2011. ISBN: 978-14-398-5454-9. URL: http:
//www.crcpress.com/product/isbn/9781439854549.
[SB11] Craig Schiller and James R Binkley. Botnets: The killer web applications. Syn-
gress, 2011. ISBN: 978-15-974-9135-8. URL: https : / / www . elsevier . com /
books/botnets/bradley/978-1-59749-135-8.
A CADEMIC A RTICLES
[ABG13] Adarsh Agarwal, Nipun Bansal, and Sudeep Gupta. “Peer to Peer Networking
and Applications”. In: International Journal of Advanced Research in Computer
Science and Software Engineering 8 (2013).
[Bas+13] Anirban Basu, Simon Fleming, James Stanier, Stephen Naicken, Ian Wakeman,
and Vijay K. Gurbani. “The State of Peer-to-peer Network Simulators”. In: ACM
Computing Surveys (CSUR) 45.4 (Aug. 2013), pp. 1–25. ISSN: 0360-0300. DOI:
10.1145/2501654.2501660. URL: http://doi.acm.org/10.1145/2501654.
2501660.
[Che+09] Zhenxiang Chen, Bo Yang, Yuehui Chen, Ajith Abraham, Crina Grosan, and
Lizhi Peng. “Online hybrid traffic classifier for Peer-to-Peer systems based on
network processors”. In: Applied Soft Computing 9.2 (2009), pp. 685–694.
[Gom+13] Joao V Gomes, Pedro RM Inácio, Manuela Pereira, Mário M Freire, and Paulo P
Monteiro. “Detection and classification of peer-to-peer traffic: A survey”. In:
ACM Computing Surveys (CSUR) 45.3 (2013).
47
48 B IBLIOGRAPHY
[Gu+08] Guofei Gu, Roberto Perdisci, Junjie Zhang, Wenke Lee, et al. “BotMiner: Clus-
tering Analysis of Network Traffic for Protocol-and Structure-Independent Bot-
net Detection.” In: USENIX Security Symposium. 2008, pp. 139–154.
[Haq+10] Irfan Ul Haq, Sardar Ali, Hassan Khan, and Syed Ali Khayam. “What is the im-
pact of p2p traffic on anomaly detection?” In: Recent Advances in Intrusion De-
tection. Springer, 2010, pp. 1–17.
[HCL09] Yan Hu, Dah-Ming Chiu, and John Lui. “Profiling and identification of P2P traf-
fic”. In: Computer Networks 53.6 (2009), pp. 849–863.
[Kar+04] Thomas Karagiannis, Andre Broido, Michalis Faloutsos, and Kc claffy. “Trans-
port Layer Identification of P2P Traffic”. In: Proceedings of the 4th ACM SIG-
COMM Conference on Internet Measurement. IMC ’04. Taormina, Sicily, Italy:
ACM, 2004, pp. 121–134. ISBN: 1-58113-821-0. DOI: 10.1145/1028788.1028804.
URL : http://doi.acm.org/10.1145/1028788.1028804.
[KNC10] Ram Keralapura, Antonio Nucci, and Chen-Nee Chuah. “A novel self-learning
architecture for p2p traffic classification in high speed networks”. In: Computer
Networks 54.7 (2010), pp. 1055–1068.
[KPF05] Thomas Karagiannis, Konstantina Papagiannaki, and Michalis Faloutsos. “BLINC:
Multilevel Traffic Classification in the Dark”. In: Proceedings of the 2005 Confer-
ence on Applications, Technologies, Architectures, and Protocols for Computer
Communications. SIGCOMM ’05. Philadelphia, Pennsylvania, USA: ACM, 2005,
pp. 229–240. ISBN: 1-59593-009-4. DOI: 10 . 1145 / 1080091 . 1080119. URL:
http://doi.acm.org/10.1145/1080091.1080119.
[Li+09] Jun Li, Shunyi Zhang, Yanqing Lu, and Junrong Yan. “Hybrid internet traffic
classification technique”. In: Journal of Electronics (China) 26.1 (2009), pp. 101–
112.
[Liu+10] Dan Liu, Yichao Li, Yue Hu, and Zongwen Liang. “A P2P-botnet detection model
and algorithms based on network streams analysis”. In: Future Information
Technology and Management Engineering (FITME), 2010 International Confer-
ence on. Vol. 1. IEEE, 2010, pp. 55–58.
[LM07] Wei Li and Andrew W Moore. “A machine learning approach for efficient traffic
classification”. In: Modeling, Analysis, and Simulation of Computer and Telecom-
munication Systems, 2007. MASCOTS’07. 15th International Symposium on. IEEE,
2007, pp. 310–317.
[MM00] Gary McGraw and Greg Morrisett. “Attacking malicious code”. In: IEEE soft-
ware 5 (2000), pp. 33–41.
[MP05] Andrew W Moore and Konstantina Papagiannaki. “Toward the accurate identi-
fication of network applications”. In: Passive and Active Network Measurement.
Springer, 2005, pp. 41–54.
[MW06] Alok Madhukar and Carey Williamson. “A longitudinal study of P2P traffic clas-
sification”. In: Modeling, Analysis, and Simulation of Computer and Telecom-
munication Systems, 2006. MASCOTS 2006. 14th IEEE International Sympo-
sium on. IEEE, 2006, pp. 179–188.
A CADEMIC A RTICLES 49
[MZ05] Andrew W Moore and Denis Zuev. “Internet traffic classification using bayesian
analysis techniques”. In: ACM SIGMETRICS Performance Evaluation Review.
Vol. 33. ACM, 2005, pp. 50–60.
[Nar+14] Pratik Narang, Subhajit Ray, Chittaranjan Hota, and Venkat Venkatakrishnan.
“PeerShark: Detecting Peer-to-Peer Botnets by Tracking Conversations”. In: (2014).
[NRH13] Pratik Narang, Jagan Mohan Reddy, and Chittaranjan Hota. “Feature selection
for detection of peer-to-peer botnet traffic”. In: Proceedings of the 6th ACM In-
dia Computing Convention. ACM, 2013, p. 16.
[PTK06] Junghun Park, Hsiao-Rong Tyan, and C-CJ Kuo. “GA-based internet traffic clas-
sification technique for QoS provisioning”. In: Intelligent Information Hiding
and Multimedia Signal Processing, 2006. IIH-MSP’06. International Conference
on. IEEE, 2006, pp. 251–254.
[Rah+14] Babak Rahbarinia, Roberto Perdisci, Andrea Lanzi, and Kang Li. “Peerrush:
Mining for unwanted p2p traffic”. In: Journal of Information Security and Ap-
plications (2014).
[RD10] Rodrigo Rodrigues and Peter Druschel. “Peer-to-peer systems”. In: Commun.
ACM 53.10 (2010), pp. 72–82.
[RH13] Jagan Mohan Reddy and Chittaranjan Hota. “P2P traffic classification using
ensemble learning”. In: Proceedings of the 5th IBM Collaborative Academia Re-
search Exchange Workshop. ACM, 2013, p. 14.
[Ros+13] Christian Rossow, Dennis Andriesse, Tillmann Werner, Brett Stone-Gross, Daniel
Plohmann, Christian J Dietrich, and Herbert Bos. “SoK: P2PWNED-Modeling
and Evaluating the Resilience of Peer-to-Peer Botnets”. In: Security and Pri-
vacy (SP), 2013 IEEE Symposium on. IEEE, 2013, pp. 97–111. URL: https://
www.ieee-security.org/TC/SP2013/papers/4977a097.pdf.
[Saa+11] Sherif Saad, Issa Traore, Ali Ghorbani, Bassam Sayed, David Zhao, Wei Lu, John
Felix, and Payman Hakimian. “Detecting P2P botnets through network behav-
ior analysis and machine learning”. In: Privacy, Security and Trust (PST), 2011
Ninth Annual International Conference on. IEEE, 2011, pp. 174–180.
[Shi+12] Ali Shiravi, Hadi Shiravi, Mahbod Tavallaee, and Ali A Ghorbani. “Toward de-
veloping a systematic approach to generate benchmark datasets for intrusion
detection”. In: Computers & Security 31.3 (2012), pp. 357–374.
[Sil+13] Sérgio SC Silva, Rodrigo MP Silva, Raquel CG Pinto, and Ronaldo M Salles.
“Botnets: A survey”. In: Computer Networks 57.2 (2013), pp. 378–403.
[SOG07] Douglas C Sicker, Paul Ohm, and Dirk Grunwald. “Legal issues surrounding
monitoring during network research”. In: Proceedings of the 7th ACM SIGCOMM
conference on Internet measurement. ACM, 2007, pp. 141–148.
[SP14] Matija Stevanovic and Jens Myrup Pedersen. “An efficient flow-based botnet
detection using supervised machine learning”. In: Computing, Networking and
Communications (ICNC), 2014 International Conference on. IEEE. 2014, pp. 797–
801.
50 B IBLIOGRAPHY
[SSW04] Subhabrata Sen, Oliver Spatscheck, and Dongmei Wang. “Accurate, scalable
in-network identification of p2p traffic using application signatures”. In: Pro-
ceedings of the 13th international conference on World Wide Web. ACM, 2004,
pp. 512–521.
[SV13] Timo Schless and Harald Vranken. “Counter botnet activities in the Nether-
lands a study on organisation and effectiveness”. In: Information Science and
Technology (ICIST), 2013 International Conference on. IEEE, 2013, pp. 437–442.
[UAH10] J Udhayan, R Anitha, and T Hamsapriya. “Lightweight C&C based botnet de-
tection using Aho-Corasick NFA”. In: International Journal of Network Security
& its Applications 2.4 (2010).
[Wan+14] Chunzhi Wang, Zeqi Wang, Zhiwei Ye, and Hongwei Chen. “A P2P Traffic Iden-
tification Approach Based on SVM and BFA”. In: TELKOMNIKA Indonesian
Journal of Electrical Engineering 12.4 (2014).
[XZB05] Kuai Xu, Zhi-Li Zhang, and Supratik Bhattacharyya. “Profiling internet back-
bone traffic: behavior models and applications”. In: ACM SIGCOMM Computer
Communication Review. Vol. 35. ACM, 2005, pp. 169–180.
[YC13] Wujian Ye and Kyungsan Cho. “Two-Step P2P Traffic Classification with Con-
nection Heuristics”. In: Innovative Mobile and Internet Services in Ubiquitous
Computing (IMIS), 2013 Seventh International Conference on. IEEE, 2013, pp. 135–
141.
[YR10] Ting-Fang Yen and Michael K Reiter. “Are your hosts trading or plotting? Telling
P2P file-sharing and bots apart”. In: Distributed Computing Systems (ICDCS),
2010 IEEE 30th International Conference on. IEEE, 2010, pp. 241–252.
[Zei+10] Hossein Rouhani Zeidanloo, Azizah Bt Abdul Manaf, Rabiah Bt Ahmad, Maz-
dak Zamani, and Saman Shojae Chaeikar. “A proposed framework for P2P Bot-
net detection”. In: IACSIT Int. J. Eng. Technol 2 (2010), pp. 161–168.
[Zha+09] Min Zhang, Wolfgang John, K Claffy, and Nevil Brownlee. “State of the art in
traffic classification: A research review”. In: PAM Student Workshop. 2009.
T ECHNICAL D OCUMENTATION
[Aut11] Internet Assigned Numbers Authority. Service Name and Transport Protocol
Port Number Registry. 2011. URL: http://www.iana.org.
[Ben09] Jonathan Bennett. AutoIt Script Home Page. 2009. URL: http://www.autoitscript.
com.
[Kor12] Maciej Korczynski. “Classifying Application Flows and Intrusion Detection in
Internet Traffic”. PhD thesis. Université de Grenoble, 2012.
[Luz14] Pedro Marques da Luz. “Botnet Detection Using Passive DNS”. MA thesis. De-
partment of Computing Science, Radboud University Nijmegen, 2014.
[MZC05] Andrew Moore, Denis Zuev, and Michael Crogan. Discriminators for use in
flow-based classification. Tech. rep. Queen Mary and Westfield College, De-
partment of Computer Science, 2005.
T ECHNICAL D OCUMENTATION 51
52
G LOSSARY
servent A servent is a host within a computer network acting as both a SERVer and a
cliENT.
Supervised learning Supervised learning algorithms are trained on labelled examples, i.e.,
input where the desired output is known. The supervised learning algorithm at-
tempts to generalise a function or mapping from inputs to outputs which can then
be used speculatively to generate an output for previously unseen inputs..
swarm A swarm is a collection of peers that are interested in distributing the same content.
53