0% found this document useful (0 votes)
3 views7 pages

Internship Report

Uploaded by

LIKHITH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views7 pages

Internship Report

Uploaded by

LIKHITH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Cyberattack Detection Using Machine Learning Models

1. Introduction:
Cybersecurity is a critical aspect of any organization's infrastructure, as it involves protecting
networks, systems, and data from cyberattacks. As cyber threats become more sophisticated, machine
learning (ML) has emerged as a powerful tool for detecting and responding to these attacks. In this
project, we applied various machine learning algorithms to a network traffic dataset to predict and
detect cyberattacks. The goal was to develop a model that could identify attack types and classify
traffic as either normal or suspicious.
2. Dataset Overview:
The dataset used in this project contains network traffic data and features related to the
communication between source and destination devices. The dataset consists of the following
features:
• Timestamp: The time when a network packet was captured.
• Source_IP: The source IP address from which the traffic originated.
• Destination_IP: The destination IP address to which the traffic was sent.
• Protocol: The communication protocol used (e.g., TCP, UDP).
• Packet_Length: The length of the packet transmitted.
• Duration: The duration of the session.
• Source_Port: The source port used for the communication.
• Destination_Port: The destination port.
• Bytes_Sent: The number of bytes sent in the packet.
• Bytes_Received: The number of bytes received.
• Flags: The flags associated with the network packet.
• Flow_Packets/s: The rate of packet flow per second.
• Flow_Bytes/s: The rate of byte flow per second.
• Avg_Packet_Size: The average size of the packets in the flow.
• Total_Fwd_Packets: The total number of forward packets in the flow.
• Total_Bwd_Packets: The total number of backward packets in the flow.
• Fwd_Header_Length: The length of the header for forward packets.
• Bwd_Header_Length: The length of the header for backward packets.
• Sub_Flow_Fwd_Bytes: The total number of forward bytes in the sub-flow.
• Sub_Flow_Bwd_Bytes: The total number of backward bytes in the sub-flow.
• Inbound: Whether the traffic is inbound (binary value).
• Attack_Type: The type of attack (e.g., DDoS, intrusion).
• Label: The target variable indicating whether the traffic is normal or suspicious.
These features were used to train various machine learning models to classify traffic and detect
potential cyberattacks.
3. Preprocessing and Feature Engineering:
Before training the models, the dataset was preprocessed to handle missing values, encode categorical
variables, and scale numeric features. The following steps were performed:
• Categorical Encoding: Categorical features like Source_IP, Destination_IP, Protocol, and
Flags were encoded using LabelEncoder. This step transformed the categorical values into
numerical representations, making them suitable for machine learning models.
• Date-time Conversion: The Timestamp feature was converted to a Unix timestamp,
representing the time in seconds, to facilitate machine learning model processing.
• Data Scaling: Numeric features like Bytes_Sent, Bytes_Received, and others were scaled
using StandardScaler to ensure all features were on the same scale and improve model
performance.
4. Model Training and Evaluation:
Several machine learning models were trained on the preprocessed dataset, including:
i. K-Nearest Neighbors (KNN): A classification algorithm that predicts the label of a data point
based on the majority class of its nearest neighbors.
ii. Logistic Regression: A linear classifier used to model the probability of an attack.

iii. Decision Tree Classifier: A tree-based model used to classify traffic based on feature splits.

iv. Support Vector Machine (SVM): A classification model that creates hyperplanes to classify
data points.
v. Random Forest Classifier: An ensemble method that uses multiple decision trees to improve
classification accuracy.
vi. Neural Networks: A deep learning model with multiple layers designed to capture complex
patterns in the data.
Each model was evaluated using metrics such as accuracy, precision, recall, and F1-score. These
metrics help assess the models' performance in detecting cyberattacks and distinguishing them from
normal traffic.
5. Model Evaluation:
Several machine learning algorithms were implemented, trained, and evaluated:
(a) K-Nearest Neighbors (KNN)
• Confusion Matrix: [[65, 87], [56, 78]]
• Accuracy: 50.00%
• Precision: 0.507
• Recall: 0.50
• F1-Score: 0.498
(b) Logistic Regression

• Confusion Matrix: [[152, 0], [134, 0]]


• Accuracy: 53.15%
• Precision: 0.751
• Recall: 0.531
• F1-Score: 0.369
(c) Decision Tree Classifier

• Confusion Matrix: [[79, 73], [69, 65]]


• Accuracy: 50.35%
• Precision: 0.504
• Recall: 0.50
• F1-Score: 0.504
(d) Support Vector Machine (SVM)

• Confusion Matrix: [[152, 0], [134, 0]]


• Accuracy: 53.15%
• Precision: 0.751
• Recall: 0.531
• F1-Score: 0.369
(e) Random Forest Classifier

• Confusion Matrix: [[81, 71], [81, 53]]


• Accuracy: 46.85%
• Precision: 0.507
• Recall: 0.50
• F1-Score: 0.498
• Classification Report:
• Precision: 0.50 for Class 0, 0.43 for Class
1
• Recall: 0.53 for Class 0, 0.40 for Class 1
• F1-Score: 0.52 for Class 0, 0.41 for Class 1
(f) Neural Network (Basic Architecture)

• Test Accuracy: 100.00%


• The neural network, with a simple dense layer architecture, achieved perfect accuracy
in testing, showing excellent performance for cyberattack detection.
(g) Neural Network with Dropout (Overfitting Prevention)

• Test Accuracy: 100.00%


• Dropout layers were used to prevent overfitting, further enhancing the model's ability
to generalize while achieving perfect accuracy.

(h) Neural Network with Convolutional(CNN) Layers

• Test Accuracy: 100.00%


• By adding convolutional layers, this model further improved in terms of detecting
patterns in the data and achieved perfect accuracy during testing.
6. Feature Analysis and Visualization:
To better understand the relationships between features and model predictions, several visualizations
were generated:
• Effect of Dropout on Model Generalization: Dropout is a regularization technique that helps
improve model generalization by introducing randomness during training. By randomly
deactivating neurons, dropout prevents individual neurons from becoming overly reliant on
specific features in the training data.

The
graph

illustrates the performance of a neural network during training. The left plot shows accuracy,
where the training accuracy generally increases while the validation accuracy plateaus after a
few epochs, indicating potential overfitting. The right plot shows loss, where both training and
validation loss decrease initially, but the validation loss starts to increase after some point, again
suggesting overfitting. This behaviour is common in neural networks and highlights the
importance of techniques like dropout, as implemented in the code with Dropout(0.5) layers, to
mitigate overfitting and improve generalization.

• Model Performance Analysis (CNN Model):The left graph shows the model achieving
perfect accuracy after one epoch, indicating possible overfitting, as training accuracy matches
validation accuracy. The right graph shows a rapid decline in loss for both training and
validation, stabilizing near zero. This suggests excellent training but raises concerns about
generalization due to overfitting.
• Correlation Matrix Heatmap: A heatmap was used to visualize the correlations between
numerical features. Features like Bytes_Sent, Bytes_Received, and Flow_Packets/s showed
strong correlations, indicating their importance in predicting cyberattacks.

• Stacked Bar Chart: A stacked bar chart was created to display the distribution of attack types
across different source IP addresses. This helped identify patterns related to specific attack
types originating from particular sources.
• Network Graph: A network graph was generated to visualize the interaction between source
and destination ports. This graph illustrated how network traffic flows between different ports
and provided insights into potential attack vectors.

• Web Traffic Analysis Over Time: The graph illustrates the variation in bytes sent and
received over time, displaying high fluctuations. The consistent peaks and troughs suggest
dynamic web traffic patterns. Bytes sent and received follow similar trends, indicating
synchronous data exchange. The detailed time-based visualization aids in identifying patterns,
anomalies, or potential bottlenecks in network performance.
7. Conclusion:
The deep learning models, particularly Neural Networks, achieved perfect accuracy in detecting
cyberattacks, outperforming traditional machine learning models like KNN, Logistic Regression,
SVM, Decision Tree and Random Forest. The neural network models showed significant potential
in learning complex patterns from the network traffic data, and their ability to prevent overfitting was
a key advantage.
• Top Performing Model: The Neural Network with Dropout demonstrated the highest
performance, achieving perfect test accuracy.

You might also like