0% found this document useful (0 votes)
29 views40 pages

Internship Report

The internship report focuses on network traffic classification using machine learning algorithms, aiming to enhance application performance through accurate traffic analysis. It discusses various machine learning techniques, including KNN, ANN, and Random Forest, and their application in classifying network traffic data. The report also includes a literature survey, design flow, results analysis, and future work recommendations related to the project.

Uploaded by

kumar7.1alok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views40 pages

Internship Report

The internship report focuses on network traffic classification using machine learning algorithms, aiming to enhance application performance through accurate traffic analysis. It discusses various machine learning techniques, including KNN, ANN, and Random Forest, and their application in classifying network traffic data. The report also includes a literature survey, design flow, results analysis, and future work recommendations related to the project.

Uploaded by

kumar7.1alok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 40

A INTERNSHIP REPORT

On
Network Traffic Classification Using Machine Learning
Algorithms

Submitted by

Alok Kumar (21131410091)

in partial fulfillment for the award of the degree of

BACHELOR OF TECHNOLOGY
IN
BRANCH OF STUDY

SCHOOL OF COMPUTING SCIENCE AND ENGINEERING


DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING GALGOTIAS UNIVERSITY, GREATER NOIDA
October , 2024
TABLE OF CONTENTS

CHAPTER 1. INTRODUCTION 10
1.1. Machine Learning Algorithms.........................................................................................11

1.2. KNN................................................................................................................................ 12

1.3. ANN................................................................................................................................ 13

1.4. Random Forest................................................................................................................ 14

1.5. Key Concept.................................................................................................................... 15

CHAPTER 2. LITERATURE SURVEY 17


2.1. Survey 1-3....................................................................................................................... 17

2.2. Survey 4-5....................................................................................................................... 18

2.3. Survey 6-8....................................................................................................................... 19

2.4. Survey 9-10..................................................................................................................... 20

2.5. Survey 11-13................................................................................................................... 21

2.6. Survey 14-16................................................................................................................... 22

2.4. Survey 17-18................................................................................................................... 23

2.4. Survey 19-20................................................................................................................... 24

CHAPTER 3. DESIGN FLOW/PROCESS 25


3.1. Evaluation & Selection of Specifications/Features..........................................................35

3.2. Design Constraints...........................................................................................................26

3.3. Analysis of Features and finalization subject to constraints............................................27

3.4. Design Flow.....................................................................................................................28

3.5. Design selection...............................................................................................................29

3.6. Methodology...............................................................................................................30-35

CHAPTER 4. RESULTS ANALYSIS 36


4.1. IKNN Result Analysis.....................................................................................................36
4.2. Random Forest Result Analysis...................................................................................... 37

4.3. Random Forest Result Analysis....................................................................................................38

4.4. Feature Importance Of Random Forest Classifier.......................................................................39

CHAPTER 5. FUTURE WORK 40


5.1. Conclusion.......................................................................................................................40

5.2. Future work..................................................................................................................... 40


ABSTRACT

Nowadays the Internet does not provide an exchange of information

between applications and networks, which may result in poor application

performance. Concepts such as application-aware networking or

network-aware application programming try to overcome these

limitations. Network traffic classification sees its main usage among

Internet Service Providers(ISP) to analyze the characteristics required to

design the network and hence affects the overall performance of a

network. There are various techniques adopted to classify network

protocols, such as port-based, pay-load based and Machine Learning

based .This project aims to develop a robust machine learning-based

system for accurately classifying network traffic into different

categories. By extracting relevant features from network traffic data,

such as packet headers, payload information, and flow statistics, we train

and evaluate various machine learning algorithms to identify traffic

patterns associated with specific applications or services.


GRAPHICAL ABSTRACT

A graphical abstract for this project can visually represent the flow and

key components of the machine learning based network traffic

classification system. It should highlight the steps involved in

classifying network traffic using machine learning techniques.

Title: Machine Learning-Based Network Traffic Classification System.

Flow Diagram Components :

1. Data Preprocessing :

o Outline simple steps to prepare datasets for analysis and modeling.

2. Import Libraries :

o Import Libraries such as pandas , numpy , seaborn , and matplotlib to

establish a foundation of data management and visualizations ,

3. Machine Learning Algorithm :

o Demonstrate how various machine learning algorithms leverage these

features to classify and predict network traffic patterns (e.g., K-nearest

neighbor (KNN), Random forest Machine learning ).

4. Traffic Classification :
o Using Dataset-Unicauca-versions for dataset to train the model and

clean the dataset .

5. Performance Monitoring (Speedometer/Graph):

o Indicate the impact on network performance and monitoring by

visually representing improvements. A pure dataset is very important ,

out future work gives priority to the network traffic


ABBREVIATIONS

ISPs : Internet service providers

IANA : Internet Assigned Number Authority

ML : Machine Learning

RF: Random Forest

DC: Decision Tree

KNN : K Nearest Neighbours

ANN : Artificial Neural Networks

HTTP : Hypertext Transfer Protocol

QUIC : Quick UDP Internet Connections


Chapter-1 INTRODUCTION

Network traffic classification is a topic of extremely deep interest for internet service providers

(ISPs) and network operators in order to identify the typology of data flowing in the network and

mapping them to their generating applications. This knowledge is important for many reasons,

including monitoring network security and network applications behavior, performing traffic

engineering or Quality of Service/service Level Agreement calibration, improving knowledge of

users’ traffic demands for supporting policing and prioritization mechanisms, as well as

performing data collection for marketing, accounting/billing or capacity planning purposes.

Reliably classifying the network traffic is also crucial for automation of network operations,

where the prediction of traffic demands and flow matrices,the development of realistic traffic

models (Xu et al., 2005) and the detection of anomalous behaviors (D’Angelo et al., 2015;

Palmieri,2019) for triggering autonomous reactions, become core components of modern

network management frameworks. Possible actions and countermeasures associated with traffic

class monitoring may include filtering/blocking unwanted flows, starting lawful interception

sessions,performing rerouting or resource reallocations in presence of specific overload

conditions, menaces or activities contravening the network operator’s terms of service.Nowadays

many techniques are used for traffic classification, but the very first one is Port-Based

Technique. In this technique, network applications were first to register their ports in the Internet

Assigned Number Authority(IANA) However, this technique remains ineffective on account of

use of dynamic port numbers as an alternative of well-known ports numbers. The other one is

Payload-Based Technique. It also remains ineffective due to encryption of applications data to


evade from being detected. As many network applications use encryption techniques, it is very

challenging to inspect the encrypted packet data.In 2016 numerous supervised and unsupervised

machine learning (ML) methods have efficiently been applied in traffic classification. Using

Bayesian Analysis Technique and Bayesian Neural Network , Moor built 248 statistical features

data sets for traffic classification in 2005. Using these techniques, they built data sets wherein

they achieved high classification accuracies. However, in our previous work study we used two

different network trace traffic datasets and classified WeChat application traffic accurately and

got very promising accuracy results using ML classifiers. In this study, we have utilized three

essential machine learning (ML) classifiers for network traffic identification, taking broad

categories of network application trained samples to classify unknown application classes, such

as WWW, IM, MAIL, P2P, TELNET, FTP, and DNS. In this paper, we have used two combined

datasets NIMS and HIT data sets. The NIMS data sets are publically available on the internet

and the HIT data sets are our own developed data sets. In this research study, our main aim is to

achieve high accuracy and also to increase the accuracy results from all the selected classifiers.

To achieve high accuracy results, it is very important to have high quality data sets. In this

research, we have used Dataset-Unicauca-Version2-87Atts, which were collected from different

location servers. We have used twenty two “22” extract features and three “3” different types of

machine learning classifiers to classify DNS, FTP, TELNET, P2P, WWW, IM and MAIL

applications.

Machine Learning Algorithms-

K-Nearest Neighbors (KNN)- algorithm is a simple and intuitive machine learning technique

used for classification and regression tasks.KNN works on the principle of similarity: it classifies
or predicts the output for a new data point based on how its nearest neighbors are classified or

valued.

Working of KNN-

Choose K: Select the number of nearest neighbors (K). This is a hyperparameter that you choose

manually. It represents how many nearby data points will be considered when making a

prediction.

Calculate Distance: For each new data point, KNN calculates the distance between that point

and all other points in the dataset. Common distance metrics include Euclidean distance,

Manhattan distance, or others depending on the nature of the data.

Identify Neighbors: The algorithm identifies the K data points (neighbors) that are closest to the

new data point based on the calculated distances.

Vote or Average:

● For Classification: The algorithm takes a majority vote of the K nearest neighbors’

classes (labels). The new data point is assigned to the class that is most common among

the K neighbors.

● For Regression: The algorithm averages the values of the K nearest neighbors and

assigns this average as the predicted value for the new data point

Pros:

● Simple to understand and implement.

● No assumptions about data distribution (non-parametric).

● Effective for small datasets.


Cons:

● Can be computationally expensive with large datasets (as it calculates distances for all

points).

● Performance can degrade if data isn't normalized or if irrelevant features are present.

● Choosing the right value of K is crucial and may require expert

The Artificial Neural Network (ANN)- algorithm is a machine learning model inspired by the

structure and function of the human brain. It is primarily used for complex tasks like image

recognition, natural language processing, and time series forecasting.

Key Concept:

An ANN is made up of layers of connected nodes (neurons), which process input data and learn

patterns to make predictions or classifications. The learning happens by adjusting the weights of

these connections based on errors in the predictions.

Structure of ANN:

● Input Layer: This layer receives the raw data (features) that are used for prediction or

classification. Each node in this layer represents one feature.

● hidden Layer(s): These are intermediate layers where the actual computation happens.

Each neuron in a hidden layer receives inputs from the previous layer, processes them

(typically using a weighted sum), and applies an activation function to introduce

non-linearity.

● Activation Function: Functions like ReLU (Rectified Linear Unit), Sigmoid, or Tanh are

commonly used to transform the output of each neuron. This allows the network to learn

complex patterns in data.


● Output Layer: This layer produces the final prediction or classification result. For

classification tasks, the number of output neurons corresponds to the number of classes.

For regression tasks, there's typically one output neuron.

Steps in ANN:

● Forward Propagation: Data passes through the network from the input layer, through

hidden layers, to the output layer. Each neuron computes a weighted sum of inputs and

applies the activation function, passing the result to the next layer.

● Loss Calculation: The predicted output is compared to the actual output (ground truth)

using a loss function (e.g., mean squared error for regression or cross-entropy for

classification). The loss measures how far the model's predictions are from the true

values.

● Backpropagation: To reduce the loss, the network adjusts the weights using

backpropagation. This process calculates the gradient of the loss with respect to each

weight and updates the weights using an optimization algorithm (like gradient descent).

● Training: This cycle of forward propagation, loss calculation, and backpropagation is

repeated over many iterations (epochs) until the model converges, i.e., the loss is

minimized.

Pros:

● Highly flexible and can model complex, non-linear relationships.

● Works well with large datasets and high-dimensional data.

● Can be used for various tasks, including classification, regression, and even generative

modeling.

Cons:
● Requires large amounts of data to perform well.

● Computationally expensive and time-consuming to train.

● Tuning the model (e.g., choosing the number of layers, neurons, and hyperparameters)

can be complex.

ANNs form the basis for more advanced models like Convolutional Neural Networks (CNNs)

and Recurrent Neural Networks (RNNs), which are specialized for image processing and

sequence data, respectively.

Random forest: is an ensemble machine learning set of rules in general used for category and

regression responsibilities. It builds a couple of decision trees and combines their outputs to

make greater accurate predictions.

Key idea:

A random forest includes a large variety of person choice bushes that are painted collectively as

an ensemble. Each tree within the woodland offers a prediction, and the final output is decided

by way of averaging (for regression) or majority voting (for type) across all timber.

Steps in Random Forest:

● Bootstrap Sampling: From the authentic dataset, a couple of subsets are created by

randomly sampling the data with replacement (this technique is known as bagging). Each

subset is used to educate an extraordinary choice tree.

● Building Decision Tree :

● For every tree, as opposed to thinking about all capabilities at each cut up, Random forest

randomly selects a subset of functions. This introduces diversity among the bushes,

preventing them from being too comparable.


● The bushes are skilled independently, with every tree studying distinct styles because of

specific information samples and characteristic subsets.

● Aggregation:

classification: each tree inside the forest makes a classification (votes). The magnificence with

the majority of votes across all trees is selected as the final prediction.

Regression: each tree makes a numerical prediction, and the final prediction is the

common of all of the tree outputs.


Chapter 2 LITERATURE

SURVEY

1. The Research paper title “Granular classifier: building traffic granules for encrypted

traffic classification based on granular computing” focuses on enhancing the accuracy

of encrypted network traffic classification. This paper addresses the challenges of the

problem such as traffic desperation, multi-level feature classification, and limitations due

to small training datasets.This paper proposed method introduce

“Cardinality-Constrained Fuzzy C-Means” clustering algorithm that leverages

correlation information between network flows to improve traffic partitioning.(Xuyang

Jing , Jingjing Zhaoa , Zheng Yan , Witold Pedryczb , Xian Li )

2. The Research paper, "High-Speed Encrypted Traffic Classification by Using Payload

Features," proposed a novel method for classifying encrypted network traffic in

high-speed environments. This paper addresses the challenge of distinguishing between

encrypted and unencrypted traffic, especially in scenarios where high-speed and short

flows make traditional methods ineffective. This paper uses a combination of

“Entropy-based and Chi-square” test features for encrypted traffic detection, followed

by a Random Forest model for fine-grained classification of different traffic types.(

Xinge Yanab , Liukun Hec , Yifan Xua , Jiuxin Cao , Liangmin Wanga , Guyang Xie)

3. The Research paper "QUIC Network Traffic Classification Using Ensemble Machine

Learning Techniques" focuses on classification of Quick UDP Internet Connections

(QUIC) protocol traffic using ensemble machine learning methods.This research paper

employs five ensemble learning techniques: Random Forest, Extra Trees, Gradient

Boosting Tree, Extreme Gradient Boosting Tree, and Light Gradient Boosting
Model. The model achieved up to 99.40% accuracy. The model concludes that ensemble

methods, especially XGBT and LGBM, are effective in classifying encrypted traffic

using minimal data.(Sultan Almuhammadi , Abdullatif Alnajim ,Mohammed Ayub)

4. The Research paper "Network Traffic Identification in Packet Sampling

Environment" explores challenges and solutions for identifying network traffic when

packet sampling is used, which is common in modern high-speed networks.Packet

sampling has a lot of impact on traffic identification. The paper proposes a Deep Belief

Networks Application Identification (DBNAI) method, which combines time and

space-based behavior features to improve classification accuracy.The paper achieves

better results than conventional approaches by leveraging machine learning to handle the

reduced data while maintaining classification reliability in various network environments.

(Shi Dong , Yuanjun Xia)

5. The Research paper "Improved Feature Selection and Stream Traffic Classification

Based on Machine Learning in Software-Defined Networks"archives enhancing

traffic classification in software-defined networks using advanced machine learning

techniques. This paper introduced the Boruta feature selection method to identify

optimal features and reduce computational overhead.This study introduced three

classifiers: Hoeffding Adaptive Trees (HAT), Adaptive Random Forest and K-Nearest

Neighbors with Adaptive Sliding Window. The Boruta Feature Selection technique

showed up to 95% accuracy.(ARWA M. ELDHAI , MOSAB HAMDAN ,AHMED

ABDELAZIZ , IBRAHIM ABAKER TARGIO HASHEM ,SHARIEF F. BABIKER, M.

N. MARSONO6 ,MUZAFFAR HAMZAH , AND NOOR ZAMAN JHANJHI)


6. The Research paper “Software defined networking based network traffic

classification using machine learning techniques” This study focuses on using

Software Defined Networking combined with Machine Learning to classify network

traffic more efficiently.The paper addresses the limitations of traditional methods like

port-based or payload inspection, which struggle with encrypted and dynamic network

traffic.This study use supervised and unsupervised ML algorithms like:Decision

Trees,Random Forest,and K-means clustering.Among the models tested, Decision

Tree achieved the highest accuracy of 99.81%.The researchers display ML-based

classification in Software Defined Networking environments to improve real time traffic

analysis.(Ayodeji Olalekan Salau & Melesew Mossie Beyene )

7. The Research paper "Network traffic classification based on federated

semi-supervised learning" This study addresses a federated semi-supervised learning

framework for network traffic classification,addressing challenges such as: privacy

protection and the high cost of labeled data.And introduces a novel Deep Packet

Inspection and Domain Name System-based method for labeling network traffic on home

edge devices.This study extract high-dimensional features from both labeled and

unlabeled data and reducing dependence on labeled samples using Autoencoders and

Convolutional neural networks. This model achieves high accuracy.(ZiXuan Wang ,

ZeYi Li , MengYi Fu , YingChun Ye , Pan Wang )

8. The Research paper "The rise of machine learning for detection and classification of

malware:Research developments, trends and challenges" introduces a comprehensive

review of the application of Machine learning for malware detection and classification.It

highlights how ML has grow to be a important device in addressing the evolving


complexity of malware,which an increasing number of uses obfuscation strategies to stay

away from conventional detection strategies.According to study classification detection

approaches into static, dynamic, and hybrid methods but focus on Deep Learning

approaches.The study also emphasizes rising developments like neural networks and

multimodal getting to know, and concludes with the aid of outlining the challenges

researchers face, including concept drift, hostile mastering, and the imbalance in malware

datasets(Daniel Gibert , Carles Mateu, Jordi Planes)

9. The research paper titled "Analysis of Network Traffic using machine learning "

focuses on identifying unproductive applications consuming bandwidth. The study

addresses issues such as underutilization and network congestion caused by peer-to-peer

file-sharing activities. The methodology involved using Wireshark for traffic monitoring

and MATLAB for data analysis. Recommendations include implementing better

bandwidth management strategies and enforcing policies to enhance academic internet

use.( SB .A. Mohammed , Dr.S.M Sani , Dr. D.D. DAJAB)

10. The research paper titled "Network Traffic Analysis using NLP and MATLAB"

examines network traffic to identify unproductive applications consuming bandwidth,

leading to underutilization and reduced academic access. Using Wireshark for monitoring

and MATLAB for data analysis, the study reveals that inadequate bandwidth

management and peer-to-peer applications are major contributors to congestion. It

advocates for better bandwidth management, stricter controls on non-academic usage,

and the development of clear internet policies to enhance network efficiency for

educational purposes.( Manish R. Joshi , Theyazn Hassn Hadi)


11. The research paper titled "Network Traffic Analysis: Identifying Bandwidth

Consumption Patterns Using Wireshark" focuses on analyzing the university's

network traffic over a 90-day period. The study identifies unproductive applications

consuming bandwidth, highlights underutilization of the network, and emphasizes the

need for improved bandwidth management to enhance academic activities. It also stresses

the importance of monitoring tools like Wireshark and suggests implementing an internet

usage policy to prioritize academic-related traffic.( Vanya Ivanova , Tasho Tashev , Ivo

Draganov)

12. The research paper titled “Analyzing Network Traffic to Optimize Bandwidth Usage”

focuses on improving network performance by identifying unproductive applications

consuming bandwidth. The study used Wireshark for packet sniffing and MATLAB for

data analysis over 90 days, revealing underutilization of the network and the dominance

of peer-to-peer (P2P) traffic. The paper emphasizes the need for better bandwidth

management policies to enhance academic-related traffic and proposes sustainable

strategies for improving network efficiency.( Argha Ghosh, Dr. A. Senthilrajan)

13. The research paper titled “A Deep Learning Approach for Classifying Encrypted

Network Traffic” focuses on improving the classification of encrypted traffic, a

challenge for traditional methods due to the rise of encryption protocols. Distiller

leverages deep learning and multimodal data to enhance performance while maintaining

low computational complexity. It introduces multitask learning, achieving up to an 8.45%

accuracy improvement over existing models. The model’s robustness is validated on a

public dataset, with future research suggesting the inclusion of semi-supervised learning
to further refine its capabilities.(Giuseppe Acetoa, Domenico Ciuonzoa, Antonio Montieria,

Antonio Pescape)

14. The research paper titled "Real-Time Network Traffic Analysis Utilizing AI, ML, and

DL Techniques" focuses on leveraging advanced technologies for network traffic

analysis. It highlights the implementation of a random forest algorithm that achieved an

impressive accuracy of 99.31%. The paper addresses challenges such as handling large

data volumes, ensuring scalability, and minimizing false positives/negatives. Key insights

include the necessity for enhanced security through adaptive machine learning models,

the importance of real-time monitoring for immediate threat detection, and the benefits of

a diverse toolset for effective analysis. The authors emphasize ongoing research to keep

pace with evolving network threats. ( Minal Moharir , Mohana , Aschin Dhakad )

15. The research paper titled "Machine Learning Approaches for Traffic Analysis"

focuses on the critical role of machine learning (ML) in enhancing network security and

performance. It reviews various techniques for detecting intrusions and analyzing

network behavior, emphasizing ML's effectiveness in identifying malicious activities.

The paper differentiates between supervised and unsupervised learning methods,

examines techniques like flow analysis and anomaly detection, and discusses challenges

such as data management and evolving attack types. Future research directions include

comprehensive studies on advanced ML techniques to improve cybersecurity

measures(Nour Alqudaha , Qussai Yaseen)

16. The research paper titled "Machine Learning in Network Traffic Analysis and

Security" focuses on the pivotal role of machine learning (ML) in enhancing the analysis

of network traffic and improving security measures. It explores various ML techniques


for detecting anomalies, predicting network congestion, and optimizing resource allocation.

The authors review past research, highlight challenges such as data quality and model

interpretability, and emphasize the need for continuous improvement in methodologies to

address evolving cyber threats. The paper advocates for further studies on recent ML

techniques to advance traffic analysis.( Ons Aouedi , Kandaraj Piamrat , Salima

Hamma)

17. The research paper titled "Network Traffic Classification Using Machine Learning

Techniques" focuses on the critical role of machine learning in enhancing network

performance and security through effective traffic classification. The study specifically

compares the Naïve Bayes and K-nearest neighbor (KNN) algorithms, finding that KNN

outperforms other methods in accurately classifying network traffic from live video

feeds. The authors utilized Wireshark for data extraction and emphasized the significance

of feature extraction in achieving high accuracy. The paper concludes that KNN is a

robust choice for network traffic analysis, advocating for further exploration of advanced

machine learning techniques to adapt to evolving traffic patterns.(Lakshmi Santhosh

Tripura , Kavya Kurra)

18. The survey paper titled "Deep Learning Approaches for Network Traffic

Classification in IoT" explores the use of deep learning techniques to classify network

traffic generated by the rapidly growing number of IoT devices. It reviews various deep

learning models, highlighting their strengths, limitations, and methodologies for

addressing IoT-specific challenges. The paper identifies key issues such as complex

traffic patterns, resource constraints of IoT devices, and limited data availability. It also

discusses the security implications of accurate classification and proposes future research
directions, including lightweight models and techniques for encrypted traffic classification.

(Jawad Hussain Kalwar , Sania Bhatti)

19. The research paper titled "Advancements in Network Traffic Analysis Using Machine

Learning for Cybersecurity" focuses on improving cybersecurity through machine

learning techniques. It addresses the growing threat landscape of cyber attacks and

proposes an Intrusion Detection System (IDS) utilizing algorithms like Support Vector

Machines (SVM), Random Forest, Convolutional Neural Networks (CNN), and Artificial

Neural Networks (ANN). The study evaluates these methods using the CICIDS2017

dataset, with Random Forest achieving the highest accuracy. The paper advocates for

updated datasets and suggests future research in integrating machine learning with big

data for enhanced threat detection. (Anastasia Victoria , Yelizaveta Elizaveta)


Chapter 3 DESIGN

FLOW/PROCESS

3.1 Evaluation & Selection of Specifications/Features

Network traffic classification is a crucial aspect of network management and security,

enabling the identification and categorization of data packets traversing a network. The

literature on this topic identifies several key features that are instrumental in developing

effective machine learning algorithms for traffic classification. These features typically

include flow-level attributes such as packet count, byte count, flow duration, and protocol

type, as well as statistical measures like inter-arrival time and flow entropy. Other

important features may include source and destination IP addresses, port numbers, and

time-related features such as timestamps. Additionally, advanced features derived from

deep packet inspection (DPI) can enhance classification accuracy, allowing for the

analysis of application-level data.

For an ideal solution, it is essential to include a comprehensive set of features that captures

both the behavioral patterns of network traffic and the characteristics of the applications

in use. Key features should encompass not only basic flow statistics but also advanced

metrics that can capture anomalies and variations in traffic patterns. These features

should be selected based on their relevance to the classification tasks and their ability to

improve model performance, considering factors such as accuracy, precision, recall, and

F1-score. The solution should also prioritize features that are computationally efficient to

extract, enabling real-time classification without significant delays.


3.2 Design Constraints

Designing a network traffic classification system using machine learning algorithms

involves navigating several constraints that can impact the overall effectiveness and

applicability of the solution. These constraints can be categorized into several domains:

Regulatory Standards: Compliance with industry regulations such as the General

Data Protection Regulation (GDPR) is critical, especially concerning data privacy and

user consent. The design must ensure that personal identifiable information (PII) is

not improperly handled during data collection and processing.

Economic Constraints: The cost of implementation, including hardware, software,

and operational expenses, must be considered. A cost-effective solution that does not

compromise on performance is essential for widespread adoption.

Environmental and Health Concerns: While less directly relevant, the ecological

impact of large-scale data processing systems should be considered. Energy-efficient

algorithms that minimize power consumption contribute to a more sustainable design.

Manufacturability and Safety: If the solution involves physical devices (e.g.,

routers or switches), their manufacturability and safety in terms of electrical standards

must be ensured.
Professional and Ethical Issues: Ethical considerations regarding surveillance and

monitoring must be addressed, ensuring that the classification does not lead to misuse

or invasion of privacy.

3.3 Analysis of Features and finalization subject to constraints

The features identified for network traffic classification should be critically evaluated

against the design constraints to finalize the most suitable set for implementation. For

example, while deep packet inspection features may enhance classification accuracy,

they can be resource-intensive and may raise privacy concerns. Consequently, these

features may need to be modified or selectively implemented based on the regulatory

landscape. Similarly, while including all available features may seem advantageous, the

computational overhead could hinder real-time classification, which is a key

requirement for many applications.

Therefore, a careful balance must be struck between feature richness and practical

constraints. Essential features that are computationally light, such as flow statistics and

protocol information, should be prioritized, while advanced features may be selectively

included based on their impact on classification performance and compliance with

ethical standards. Features such as encryption flags or DPI data should be included with

caution, ensuring that they comply with applicable regulations.

3.4 Design Flow


3.5 Design selection

The design of the network traffic classification system can be approached through

various methodologies, with at least two alternative designs outlined below:

Traditional Machine Learning Approach: This approach involves collecting

traffic data, extracting relevant features, and then applying traditional machine

learning algorithms like Random Forest, Support Vector Machines (SVM), or

k-Nearest Neighbors (k-NN). The process starts with data preprocessing, including

cleaning and normalization, followed by feature selection. The selected features are

then used to train the machine learning models, which are validated against a test

dataset. The model with the best performance metrics is deployed for real-time

traffic classification.

Deep Learning Approach: Alternatively, a deep learning approach can be utilized,

leveraging neural networks to automatically extract features from raw traffic data. In

this design, a convolutional neural network (CNN) can be employed to analyze

packet sequences or time-series data. The process begins with data collection,

followed by transformation into a suitable format for the neural network. The model

is trained using labeled datasets, and hyperparameter tuning is performed to

optimize performance. This approach may yield higher accuracy due to its ability to

capture complex patterns in the data.


3.6 Methodology

Figure 3.1 - Data PreProcessing Flow Chart

The Data Preprocessing outlines simple steps to prepare datasets for analysis and

modeling. First, import appropriate libraries such as pandas, numpy, seaborn, and

matplotlib to establish a foundation for data management and visualization. Load the

configuration file and check its structure using functions such as head(),isNull() and file()

that can help identify data types and missing values. Data cleaning issues are then

addressed by interpolating or deleting missing values and processing duplicates. Use


utilities to create or transform variables to increase predictive power, including accessing

categorical variables and normalizing numerical features. Various visualizations using

tools like Seaborn and Matplotlib help to understand data distribution and relationships.

This process introduces preprocessing techniques like scaling and coding to ensure that

the data is ready for modeling. The final point highlights the importance of preprocessing

for good data analysis by showing the shapes and differences of the cleaned data.

After the data preprocessing now the data is clean and balanced, we apply a different

machine learning(ml) algorithm for classifying the network traffic.

Figure 3.2 - Count of Network Traffic Classes


K-Nearest Neighbors (KNN) machine learning algorithm-

Figure-3.3 KNN Model Flow chart

This model shows the step-by-step process of building a K-nearest neighbor (KNN)

classifier. Start by deploying appropriate libraries, such as Sklearn for machine learning

and Pandas for data processing. Use the pandas.read_csv() function to load the data and

select a specific column, excluding the "ProtocolName" column. The data is then scaled

using StandardScaler to standardize the feature matrix. Use KNeighborsClassifier() to

initialize the KNN model and configure the grid to look for different values and weights.

The model is trained with training data and then its performance is evaluated in testing.

Calculate basic metrics such as accuracy, precision, recall, and F1 score. This model

archives 91% accuracy .


| Class | Precision | Recall | F1-Score | Support |
|------------------|-----------|--------|----------|----------------|
| AMAZON | 0.96 | 0.95 | 0.96 | 2381 |
| APPLE | 0.91 | 0.90 | 0.90 | 2384 |
| APPLE_ICLOUD | 0.95 | 0.98 | 0.97 | 2380 |
| APPLE_ITUNES | 0.92 | 0.94 | 0.93 | 2430 |
| CITRIX_ONLINE | 1.00 | 1.00 | 1.00 | 2 |
| CLOUDFLARE | 0.99 | 0.99 | 0.99 | 2447 |
| CONTENT_FLASH | 1.00 | 1.00 | 1.00 | 2454 |
| DEEZER | 0.45 | 0.25 | 0.32 | 20 |
| DNS | 0.97 | 0.99 | 0.98 | 2356 |
| DROPBOX | 0.98 | 0.97 | 0.98 | 2394 |
| EASYTAXI | 0.97 | 0.98 | 0.97 | 2384 |
| EBAY | 0.90 | 0.95 | 0.92 | 2397 |
| EDONKEY | 0.93 | 0.62 | 0.74 | 21 |
| FACEBOOK | 0.98 | 0.96 | 0.97 | 2391 |
| FTP_CONTROL | 1.00 | 1.00 | 1.00 | 5 |
| FTP_DATA | 0.93 | 0.94 | 0.94 | 2352 |
| GMAIL | 0.97 | 0.96 | 0.97 | 2413 |
| GOOGLE | 0.72 | 0.72 | 0.72 | 2432 |
| GOOGLE_MAPS | 0.90 | 0.94 | 0.92 | 2424 |
| HTTP | 0.78 | 0.85 | 0.82 | 2415 |
| HTTP_CONNECT | 0.77 | 0.79 | 0.78 | 2335 |
| HTTP_DOWNLOAD | 0.95 | 0.96 | 0.95 | 2410 |
| HTTP_PROXY | 0.75 | 0.74 | 0.74 | 2351 |
| INSTAGRAM | 0.93 | 0.94 | 0.94 | 2433 |
| IP_ICMP | 1.00 | 1.00 | 1.00 | 2401 |
| MAIL_IMAPS | 0.00 | 0.00 | 0.00 | 2 |
| MICROSOFT | 0.82 | 0.81 | 0.82 | 2397 |
| MQTT | 0.95 | 0.96 | 0.96 | 2372 |
| MSN | 0.77 | 0.65 | 0.71 | 2345 |
| MSSQL | 0.00 | 0.00 | 0.00 | 3 |
| MS_ONE_DRIVE | 0.92 | 0.97 | 0.94 | 2387 |
| NETFLIX | 0.88 | 0.94 | 0.91 | 2366 |
| NTP | 1.00 | 1.00 | 1.00 | 2381 |
| OFFICE_365 | 0.89 | 0.91 | 0.90 | 2466 |
| SKYPE | 0.87 | 0.84 | 0.85 | 2409 |
| SNMP | 1.00 | 1.00 | 1.00 | 1 |
| SPOTIFY | 0.95 | 0.97 | 0.96 | 2403 |
| SSH | 0.99 | 1.00 | 0.99 | 2395 |
| SSL | 0.90 | 0.89 | 0.90 | 2368 |
| SSL_NO_CERT | 0.92 | 0.94 | 0.93 | 2437 |
| STARCRAFT | 0.00 | 0.00 | 0.00 | 2 |
| TEAMSPEAK | 0.00 | 0.00 | 0.00 | 2 |
| TEAMVIEWER | 0.99 | 1.00 | 0.99 | 2414 |
| TOR | 0.97 | 0.97 | 0.97 | 2406 |
| TWITCH | 1.00 | 0.62 | 0.76 | 13 |
| TWITTER | 0.86 | 0.74 | 0.80 | 2380 |
| UBUNTUONE | 0.99 | 1.00 | 0.99 | 2468 |
| WAZE | 0.73 | 0.73 | 0.73 | 15 |
| WHATSAPP | 0.92 | 0.93 | 0.92 | 2397 |
| WHOIS_DAS | 0.00 | 0.00 | 0.00 | 1 |
| WIKIPEDIA | 0.93 | 0.95 | 0.94 | 2365 |
| WINDOWS_UPDATE | 0.88 | 0.85 | 0.86 | 2451 |
| YAHOO | 0.95 | 0.92 | 0.93 | 2429 |
| YOUTUBE | 0.84 | 0.77 | 0.80 | 2395 |

| **Accuracy** | | | **0.92** | 100911 |


| **Macro avg** | **0.70** | **0.69** | **0.69**| 100911 |
| **Weighted avg** | **0.91** | **0.92** | **0.91**| 100911 |

Table 3.1 - Classification Report of KNN

Random Forest Machine Learning Algorithm-


Figure 3.6 - Random Forest Algorithm flow chart

The RandomForest model notebook demonstrates a workflow for building and

evaluating decision tree and random forest models using the Sklearn Python

library. It starts by importing necessary libraries like pandas, Sklearn, and

visualization tools like seaborn and matplotlib. The dataset is loaded and the

functions are selected while the target variable ('ProtocolName') is isolated. The

data is divided into training and test sets (70% training, 30% test).The decision

tree classifier is trained, evaluated and its structure (number of nodes, depth)

analyzed. A random forest model is then created, hyperparameters tuned (max

depth 60, 100 trees and entropy criterion) and evaluated. Model performance is

measured using metrics such as accuracy, precision, recall, and F1 score. To assess

the predictions, the confusion matrix is visualized and the feature importance is

plotted.

| Class | Precision | Recall | F1-Score | Support |


|------------------|-----------|---------|----------|----------------|
| 99TAXI | 0.00 | 0.00 | 0.00 | 1 |
| AMAZON | 0.99 | 0.99 | 0.99 | 2381 |
| APPLE | 1.00 | 0.99 | 1.00 | 2384 |
| APPLE_ICLOUD | 1.00 | 1.00 | 1.00 | 2380 |
| APPLE_ITUNES | 1.00 | 1.00 | 1.00 | 2430 |
| CITRIX | 1.00 | 0.25 | 0.40 | 4 |
| CITRIX_ONLINE | 1.00 | 1.00 | 1.00 | 2 |
| CLOUDFLARE | 1.00 | 1.00 | 1.00 | 2447 |
| CONTENT_FLASH | 1.00 | 1.00 | 1.00 | 2454 |
| DEEZER | 1.00 | 0.70 | 0.82 | 20 |
| DNS | 0.99 | 1.00 | 1.00 | 2356 |
| DROPBOX | 1.00 | 0.99 | 0.99 | 2394 |
| EASYTAXI | 1.00 | 1.00 | 1.00 | 2384 |
| EBAY | 1.00 | 0.99 | 0.99 | 2397 |
| EDONKEY | 1.00 | 0.90 | 0.95 | 21 |
| FACEBOOK | 1.00 | 0.98 | 0.99 | 2391 |
| FTP_CONTROL | 1.00 | 1.00 | 1.00 | 5 |
| FTP_DATA | 0.99 | 1.00 | 1.00 | 2352 |
| GMAIL | 1.00 | 0.99 | 0.99 | 2413 |
| GOOGLE | 0.94 | 0.99 | 0.97 | 2432 |
| GOOGLE_MAPS | 1.00 | 1.00 | 1.00 | 2424 |
| HTTP | 0.99 | 1.00 | 0.99 | 2415 |
| HTTP_CONNECT | 0.98 | 0.99 | 0.99 | 2335 |
| HTTP_DOWNLOAD | 1.00 | 1.00 | 1.00 | 2410 |
| HTTP_PROXY | 0.98 | 1.00 | 0.99 | 2351 |
| INSTAGRAM | 1.00 | 0.99 | 0.99 | 2433 |
| IP_ICMP | 1.00 | 1.00 | 1.00 | 2401 |
| LOTUS_NOTES | 0.00 | 0.00 | 0.00 | 1 |
| MAIL_IMAPS | 0.00 | 0.00 | 0.00 | 2 |
| MICROSOFT | 0.98 | 0.99 | 0.99 | 2397 |
| MQTT | 1.00 | 1.00 | 1.00 | 2372 |
| MSN | 0.99 | 0.98 | 0.99 | 2345 |
| MSSQL | 0.00 | 0.00 | 0.00 | 3 |
| MS_ONE_DRIVE | 0.99 | 0.99 | 0.99 | 2387 |
| NETFLIX | 1.00 | 0.99 | 0.99 | 2366 |
| NTP | 1.00 | 1.00 | 1.00 | 2381 |
| OFFICE_365 | 1.00 | 0.99 | 0.99 | 2466 |
| SKYPE | 0.99 | 0.98 | 0.99 | 2409 |
| SPOTIFY | 1.00 | 1.00 | 1.00 | 2403 |
| SSH | 1.00 | 1.00 | 1.00 | 2395 |
| SSL | 0.97 | 0.99 | 0.98 | 2368 |
| SSL_NO_CERT | 1.00 | 1.00 | 1.00 | 2437 |
| STARCRAFT | 0.00 | 0.00 | 0.00 | 2 |
| TEAMSPEAK | 0.00 | 0.00 | 0.00 | 2 |
| TEAMVIEWER | 1.00 | 1.00 | 1.00 | 2414 |
| TOR | 1.00 | 1.00 | 1.00 | 2406 |
| TWITCH | 1.00 | 0.54 | 0.70 | 13 |
| TWITTER | 0.98 | 0.98 | 0.98 | 2380 |
| UBUNTUONE | 1.00 | 1.00 | 1.00 | 2468 |
| UNENCRYPED_JABBER| 1.00 | 0.89 | 0.94 | 9 |
| WAZE | 1.00 | 0.73 | 0.85 | 15 |
| WHATSAPP | 0.99 | 1.00 | 0.99 | 2397 |
| WIKIPEDIA | 0.99 | 1.00 | 0.99 | 2365 |
| WINDOWS_UPDATE | 0.99 | 0.99 | 0.99 | 2451 |
| YAHOO | 0.99 | 0.98 | 0.98 | 2429 |
| YOUTUBE | 0.99 | 0.96 | 0.98 | 2395 |

| **Accuracy** | | | **0.99** | 100911 |


| **Macro avg** | 0.81 | 0.78 | 0.79 | 100911 |
| **Weighted avg** | 0.99 | 0.99 | 0.99 | 100911 |

Table 3.2 Classification Report of Random Forest


Chapter 4 RESULTS

ANALYSIS

KNN Result Analysis-

Figure 4.1 Confusion Matrix of KNN

Using KNeighborsClassifier() to initialize the KNN model and configure the grid to look

for different values and weights. The model is trained with training data and then its

performance is evaluated in testing. Calculate basic metrics such as accuracy, precision,

recall, and F1 score. This model archives 91% accuracy .


Figure - 4.2 Confusion Confusion Matrix of Some classes

Random Forest Result Analysis-

Figure 4.3 Confusion Matrix Of Random Forest Classifier


Figure - 4.4 Top 5 Classes of Confusion Matrix

In this model the dataset is loaded and the functions are selected while the target variable

('ProtocolName') is isolated. The data is divided into training and test sets (70% training,

30% test).The decision tree classifier is trained, evaluated and its structure (number of

nodes, depth) analyzed. A random forest model is then created, hyperparameters tuned

(max depth 60, 100 trees and entropy criterion) and evaluated. Model performance is

measured using metrics such as accuracy, precision, recall, and F1 score. To assess the

predictions, the confusion matrix is visualized and the feature importance is plotted.
Figure 4.5 - Feature Importance Of Random Forest Classifier

The bar chart in the figure shows the importance of the learning model, which is probably a

Random Forest classifier . Each feature is represented on the x-axis, and its values on the y-axis.

The "L7Protocol" function is the most important function and stands out from the other

functions. This shows that "L7Protocol" plays an important role in determining the prediction

model. The main features have small values and a long tail of weak features. . This suggests that

efforts to improve or simplify the model can be focused on these key features.
Chapter 5

Conclusion

In this research paper ,we use Dataset-Unicauca-Version2-87Atts

dataset.that dataset publicly available on the internet.The dataset collected in

a network section from Universidad Del Cauca, Popayán, Colombia by

performing packet captures at different hours, during morning and afternoon,

over six days (April 26, 27, 28 and May 9, 11 and 15) of 2017. A total of

3.577.296 instances.We apply the data preprocessing on this dataset ,after data

preprocessing we apply two machine learning algorithms (K-nearest

Neighbour and Random Forest ) on Imbalanced and clean dataset.First we

train K-nearest Neighbour (KNN) model that classify the network traffic and

predict network traffic class based on feature .KNN model archives 91%

accuracy.Second we train DecisonTreeClassifier that accuracy 99%.Then we

train RandomForestClassifier that predict the new data class and model

archives the 99% accuracy.For effective network traffic classification, a pure

data set is very important.Our future work gives priority to the network

traffic .For effective network traffic classification, a pure data set is very

important.Our future work gives priority to the network traffic .

You might also like