Internship Report
Internship Report
On
Network Traffic Classification Using Machine Learning
Algorithms
Submitted by
BACHELOR OF TECHNOLOGY
IN
BRANCH OF STUDY
CHAPTER 1. INTRODUCTION 10
1.1. Machine Learning Algorithms.........................................................................................11
1.2. KNN................................................................................................................................ 12
1.3. ANN................................................................................................................................ 13
3.6. Methodology...............................................................................................................30-35
A graphical abstract for this project can visually represent the flow and
1. Data Preprocessing :
2. Import Libraries :
4. Traffic Classification :
o Using Dataset-Unicauca-versions for dataset to train the model and
ML : Machine Learning
Network traffic classification is a topic of extremely deep interest for internet service providers
(ISPs) and network operators in order to identify the typology of data flowing in the network and
mapping them to their generating applications. This knowledge is important for many reasons,
including monitoring network security and network applications behavior, performing traffic
users’ traffic demands for supporting policing and prioritization mechanisms, as well as
Reliably classifying the network traffic is also crucial for automation of network operations,
where the prediction of traffic demands and flow matrices,the development of realistic traffic
models (Xu et al., 2005) and the detection of anomalous behaviors (D’Angelo et al., 2015;
network management frameworks. Possible actions and countermeasures associated with traffic
class monitoring may include filtering/blocking unwanted flows, starting lawful interception
many techniques are used for traffic classification, but the very first one is Port-Based
Technique. In this technique, network applications were first to register their ports in the Internet
use of dynamic port numbers as an alternative of well-known ports numbers. The other one is
challenging to inspect the encrypted packet data.In 2016 numerous supervised and unsupervised
machine learning (ML) methods have efficiently been applied in traffic classification. Using
Bayesian Analysis Technique and Bayesian Neural Network , Moor built 248 statistical features
data sets for traffic classification in 2005. Using these techniques, they built data sets wherein
they achieved high classification accuracies. However, in our previous work study we used two
different network trace traffic datasets and classified WeChat application traffic accurately and
got very promising accuracy results using ML classifiers. In this study, we have utilized three
essential machine learning (ML) classifiers for network traffic identification, taking broad
categories of network application trained samples to classify unknown application classes, such
as WWW, IM, MAIL, P2P, TELNET, FTP, and DNS. In this paper, we have used two combined
datasets NIMS and HIT data sets. The NIMS data sets are publically available on the internet
and the HIT data sets are our own developed data sets. In this research study, our main aim is to
achieve high accuracy and also to increase the accuracy results from all the selected classifiers.
To achieve high accuracy results, it is very important to have high quality data sets. In this
location servers. We have used twenty two “22” extract features and three “3” different types of
machine learning classifiers to classify DNS, FTP, TELNET, P2P, WWW, IM and MAIL
applications.
K-Nearest Neighbors (KNN)- algorithm is a simple and intuitive machine learning technique
used for classification and regression tasks.KNN works on the principle of similarity: it classifies
or predicts the output for a new data point based on how its nearest neighbors are classified or
valued.
Working of KNN-
Choose K: Select the number of nearest neighbors (K). This is a hyperparameter that you choose
manually. It represents how many nearby data points will be considered when making a
prediction.
Calculate Distance: For each new data point, KNN calculates the distance between that point
and all other points in the dataset. Common distance metrics include Euclidean distance,
Identify Neighbors: The algorithm identifies the K data points (neighbors) that are closest to the
Vote or Average:
● For Classification: The algorithm takes a majority vote of the K nearest neighbors’
classes (labels). The new data point is assigned to the class that is most common among
the K neighbors.
● For Regression: The algorithm averages the values of the K nearest neighbors and
assigns this average as the predicted value for the new data point
Pros:
● Can be computationally expensive with large datasets (as it calculates distances for all
points).
● Performance can degrade if data isn't normalized or if irrelevant features are present.
The Artificial Neural Network (ANN)- algorithm is a machine learning model inspired by the
structure and function of the human brain. It is primarily used for complex tasks like image
Key Concept:
An ANN is made up of layers of connected nodes (neurons), which process input data and learn
patterns to make predictions or classifications. The learning happens by adjusting the weights of
Structure of ANN:
● Input Layer: This layer receives the raw data (features) that are used for prediction or
● hidden Layer(s): These are intermediate layers where the actual computation happens.
Each neuron in a hidden layer receives inputs from the previous layer, processes them
non-linearity.
● Activation Function: Functions like ReLU (Rectified Linear Unit), Sigmoid, or Tanh are
commonly used to transform the output of each neuron. This allows the network to learn
classification tasks, the number of output neurons corresponds to the number of classes.
Steps in ANN:
● Forward Propagation: Data passes through the network from the input layer, through
hidden layers, to the output layer. Each neuron computes a weighted sum of inputs and
applies the activation function, passing the result to the next layer.
● Loss Calculation: The predicted output is compared to the actual output (ground truth)
using a loss function (e.g., mean squared error for regression or cross-entropy for
classification). The loss measures how far the model's predictions are from the true
values.
● Backpropagation: To reduce the loss, the network adjusts the weights using
backpropagation. This process calculates the gradient of the loss with respect to each
weight and updates the weights using an optimization algorithm (like gradient descent).
repeated over many iterations (epochs) until the model converges, i.e., the loss is
minimized.
Pros:
● Can be used for various tasks, including classification, regression, and even generative
modeling.
Cons:
● Requires large amounts of data to perform well.
● Tuning the model (e.g., choosing the number of layers, neurons, and hyperparameters)
can be complex.
ANNs form the basis for more advanced models like Convolutional Neural Networks (CNNs)
and Recurrent Neural Networks (RNNs), which are specialized for image processing and
Random forest: is an ensemble machine learning set of rules in general used for category and
regression responsibilities. It builds a couple of decision trees and combines their outputs to
Key idea:
A random forest includes a large variety of person choice bushes that are painted collectively as
an ensemble. Each tree within the woodland offers a prediction, and the final output is decided
by way of averaging (for regression) or majority voting (for type) across all timber.
● Bootstrap Sampling: From the authentic dataset, a couple of subsets are created by
randomly sampling the data with replacement (this technique is known as bagging). Each
● For every tree, as opposed to thinking about all capabilities at each cut up, Random forest
randomly selects a subset of functions. This introduces diversity among the bushes,
● Aggregation:
classification: each tree inside the forest makes a classification (votes). The magnificence with
the majority of votes across all trees is selected as the final prediction.
Regression: each tree makes a numerical prediction, and the final prediction is the
SURVEY
1. The Research paper title “Granular classifier: building traffic granules for encrypted
of encrypted network traffic classification. This paper addresses the challenges of the
problem such as traffic desperation, multi-level feature classification, and limitations due
encrypted and unencrypted traffic, especially in scenarios where high-speed and short
“Entropy-based and Chi-square” test features for encrypted traffic detection, followed
Xinge Yanab , Liukun Hec , Yifan Xua , Jiuxin Cao , Liangmin Wanga , Guyang Xie)
3. The Research paper "QUIC Network Traffic Classification Using Ensemble Machine
(QUIC) protocol traffic using ensemble machine learning methods.This research paper
employs five ensemble learning techniques: Random Forest, Extra Trees, Gradient
Boosting Tree, Extreme Gradient Boosting Tree, and Light Gradient Boosting
Model. The model achieved up to 99.40% accuracy. The model concludes that ensemble
methods, especially XGBT and LGBM, are effective in classifying encrypted traffic
Environment" explores challenges and solutions for identifying network traffic when
sampling has a lot of impact on traffic identification. The paper proposes a Deep Belief
better results than conventional approaches by leveraging machine learning to handle the
5. The Research paper "Improved Feature Selection and Stream Traffic Classification
techniques. This paper introduced the Boruta feature selection method to identify
classifiers: Hoeffding Adaptive Trees (HAT), Adaptive Random Forest and K-Nearest
Neighbors with Adaptive Sliding Window. The Boruta Feature Selection technique
traffic more efficiently.The paper addresses the limitations of traditional methods like
port-based or payload inspection, which struggle with encrypted and dynamic network
protection and the high cost of labeled data.And introduces a novel Deep Packet
Inspection and Domain Name System-based method for labeling network traffic on home
edge devices.This study extract high-dimensional features from both labeled and
unlabeled data and reducing dependence on labeled samples using Autoencoders and
8. The Research paper "The rise of machine learning for detection and classification of
review of the application of Machine learning for malware detection and classification.It
approaches into static, dynamic, and hybrid methods but focus on Deep Learning
approaches.The study also emphasizes rising developments like neural networks and
multimodal getting to know, and concludes with the aid of outlining the challenges
researchers face, including concept drift, hostile mastering, and the imbalance in malware
9. The research paper titled "Analysis of Network Traffic using machine learning "
file-sharing activities. The methodology involved using Wireshark for traffic monitoring
10. The research paper titled "Network Traffic Analysis using NLP and MATLAB"
leading to underutilization and reduced academic access. Using Wireshark for monitoring
and MATLAB for data analysis, the study reveals that inadequate bandwidth
and the development of clear internet policies to enhance network efficiency for
network traffic over a 90-day period. The study identifies unproductive applications
need for improved bandwidth management to enhance academic activities. It also stresses
the importance of monitoring tools like Wireshark and suggests implementing an internet
usage policy to prioritize academic-related traffic.( Vanya Ivanova , Tasho Tashev , Ivo
Draganov)
12. The research paper titled “Analyzing Network Traffic to Optimize Bandwidth Usage”
consuming bandwidth. The study used Wireshark for packet sniffing and MATLAB for
data analysis over 90 days, revealing underutilization of the network and the dominance
of peer-to-peer (P2P) traffic. The paper emphasizes the need for better bandwidth
13. The research paper titled “A Deep Learning Approach for Classifying Encrypted
challenge for traditional methods due to the rise of encryption protocols. Distiller
leverages deep learning and multimodal data to enhance performance while maintaining
public dataset, with future research suggesting the inclusion of semi-supervised learning
to further refine its capabilities.(Giuseppe Acetoa, Domenico Ciuonzoa, Antonio Montieria,
Antonio Pescape)
14. The research paper titled "Real-Time Network Traffic Analysis Utilizing AI, ML, and
impressive accuracy of 99.31%. The paper addresses challenges such as handling large
data volumes, ensuring scalability, and minimizing false positives/negatives. Key insights
include the necessity for enhanced security through adaptive machine learning models,
the importance of real-time monitoring for immediate threat detection, and the benefits of
a diverse toolset for effective analysis. The authors emphasize ongoing research to keep
pace with evolving network threats. ( Minal Moharir , Mohana , Aschin Dhakad )
15. The research paper titled "Machine Learning Approaches for Traffic Analysis"
focuses on the critical role of machine learning (ML) in enhancing network security and
examines techniques like flow analysis and anomaly detection, and discusses challenges
such as data management and evolving attack types. Future research directions include
16. The research paper titled "Machine Learning in Network Traffic Analysis and
Security" focuses on the pivotal role of machine learning (ML) in enhancing the analysis
The authors review past research, highlight challenges such as data quality and model
address evolving cyber threats. The paper advocates for further studies on recent ML
Hamma)
17. The research paper titled "Network Traffic Classification Using Machine Learning
performance and security through effective traffic classification. The study specifically
compares the Naïve Bayes and K-nearest neighbor (KNN) algorithms, finding that KNN
outperforms other methods in accurately classifying network traffic from live video
feeds. The authors utilized Wireshark for data extraction and emphasized the significance
of feature extraction in achieving high accuracy. The paper concludes that KNN is a
robust choice for network traffic analysis, advocating for further exploration of advanced
18. The survey paper titled "Deep Learning Approaches for Network Traffic
Classification in IoT" explores the use of deep learning techniques to classify network
traffic generated by the rapidly growing number of IoT devices. It reviews various deep
addressing IoT-specific challenges. The paper identifies key issues such as complex
traffic patterns, resource constraints of IoT devices, and limited data availability. It also
discusses the security implications of accurate classification and proposes future research
directions, including lightweight models and techniques for encrypted traffic classification.
19. The research paper titled "Advancements in Network Traffic Analysis Using Machine
learning techniques. It addresses the growing threat landscape of cyber attacks and
proposes an Intrusion Detection System (IDS) utilizing algorithms like Support Vector
Machines (SVM), Random Forest, Convolutional Neural Networks (CNN), and Artificial
Neural Networks (ANN). The study evaluates these methods using the CICIDS2017
dataset, with Random Forest achieving the highest accuracy. The paper advocates for
updated datasets and suggests future research in integrating machine learning with big
FLOW/PROCESS
enabling the identification and categorization of data packets traversing a network. The
literature on this topic identifies several key features that are instrumental in developing
effective machine learning algorithms for traffic classification. These features typically
include flow-level attributes such as packet count, byte count, flow duration, and protocol
type, as well as statistical measures like inter-arrival time and flow entropy. Other
important features may include source and destination IP addresses, port numbers, and
deep packet inspection (DPI) can enhance classification accuracy, allowing for the
For an ideal solution, it is essential to include a comprehensive set of features that captures
both the behavioral patterns of network traffic and the characteristics of the applications
in use. Key features should encompass not only basic flow statistics but also advanced
metrics that can capture anomalies and variations in traffic patterns. These features
should be selected based on their relevance to the classification tasks and their ability to
improve model performance, considering factors such as accuracy, precision, recall, and
F1-score. The solution should also prioritize features that are computationally efficient to
involves navigating several constraints that can impact the overall effectiveness and
applicability of the solution. These constraints can be categorized into several domains:
Data Protection Regulation (GDPR) is critical, especially concerning data privacy and
user consent. The design must ensure that personal identifiable information (PII) is
and operational expenses, must be considered. A cost-effective solution that does not
Environmental and Health Concerns: While less directly relevant, the ecological
must be ensured.
Professional and Ethical Issues: Ethical considerations regarding surveillance and
monitoring must be addressed, ensuring that the classification does not lead to misuse
or invasion of privacy.
The features identified for network traffic classification should be critically evaluated
against the design constraints to finalize the most suitable set for implementation. For
example, while deep packet inspection features may enhance classification accuracy,
they can be resource-intensive and may raise privacy concerns. Consequently, these
landscape. Similarly, while including all available features may seem advantageous, the
Therefore, a careful balance must be struck between feature richness and practical
constraints. Essential features that are computationally light, such as flow statistics and
ethical standards. Features such as encryption flags or DPI data should be included with
The design of the network traffic classification system can be approached through
traffic data, extracting relevant features, and then applying traditional machine
k-Nearest Neighbors (k-NN). The process starts with data preprocessing, including
cleaning and normalization, followed by feature selection. The selected features are
then used to train the machine learning models, which are validated against a test
dataset. The model with the best performance metrics is deployed for real-time
traffic classification.
leveraging neural networks to automatically extract features from raw traffic data. In
packet sequences or time-series data. The process begins with data collection,
followed by transformation into a suitable format for the neural network. The model
optimize performance. This approach may yield higher accuracy due to its ability to
The Data Preprocessing outlines simple steps to prepare datasets for analysis and
modeling. First, import appropriate libraries such as pandas, numpy, seaborn, and
matplotlib to establish a foundation for data management and visualization. Load the
configuration file and check its structure using functions such as head(),isNull() and file()
that can help identify data types and missing values. Data cleaning issues are then
tools like Seaborn and Matplotlib help to understand data distribution and relationships.
This process introduces preprocessing techniques like scaling and coding to ensure that
the data is ready for modeling. The final point highlights the importance of preprocessing
for good data analysis by showing the shapes and differences of the cleaned data.
After the data preprocessing now the data is clean and balanced, we apply a different
This model shows the step-by-step process of building a K-nearest neighbor (KNN)
classifier. Start by deploying appropriate libraries, such as Sklearn for machine learning
and Pandas for data processing. Use the pandas.read_csv() function to load the data and
select a specific column, excluding the "ProtocolName" column. The data is then scaled
initialize the KNN model and configure the grid to look for different values and weights.
The model is trained with training data and then its performance is evaluated in testing.
Calculate basic metrics such as accuracy, precision, recall, and F1 score. This model
evaluating decision tree and random forest models using the Sklearn Python
visualization tools like seaborn and matplotlib. The dataset is loaded and the
functions are selected while the target variable ('ProtocolName') is isolated. The
data is divided into training and test sets (70% training, 30% test).The decision
tree classifier is trained, evaluated and its structure (number of nodes, depth)
depth 60, 100 trees and entropy criterion) and evaluated. Model performance is
measured using metrics such as accuracy, precision, recall, and F1 score. To assess
the predictions, the confusion matrix is visualized and the feature importance is
plotted.
ANALYSIS
Using KNeighborsClassifier() to initialize the KNN model and configure the grid to look
for different values and weights. The model is trained with training data and then its
In this model the dataset is loaded and the functions are selected while the target variable
('ProtocolName') is isolated. The data is divided into training and test sets (70% training,
30% test).The decision tree classifier is trained, evaluated and its structure (number of
nodes, depth) analyzed. A random forest model is then created, hyperparameters tuned
(max depth 60, 100 trees and entropy criterion) and evaluated. Model performance is
measured using metrics such as accuracy, precision, recall, and F1 score. To assess the
predictions, the confusion matrix is visualized and the feature importance is plotted.
Figure 4.5 - Feature Importance Of Random Forest Classifier
The bar chart in the figure shows the importance of the learning model, which is probably a
Random Forest classifier . Each feature is represented on the x-axis, and its values on the y-axis.
The "L7Protocol" function is the most important function and stands out from the other
functions. This shows that "L7Protocol" plays an important role in determining the prediction
model. The main features have small values and a long tail of weak features. . This suggests that
efforts to improve or simplify the model can be focused on these key features.
Chapter 5
Conclusion
over six days (April 26, 27, 28 and May 9, 11 and 15) of 2017. A total of
3.577.296 instances.We apply the data preprocessing on this dataset ,after data
train K-nearest Neighbour (KNN) model that classify the network traffic and
predict network traffic class based on feature .KNN model archives 91%
train RandomForestClassifier that predict the new data class and model
data set is very important.Our future work gives priority to the network
traffic .For effective network traffic classification, a pure data set is very