0% found this document useful (0 votes)
0 views

Module -IV

The document discusses the use of the NSL-KDD dataset for training machine learning models to detect network intrusions, categorizing attacks into four main types: DOS, R2L, U2R, and probing. It highlights the advantages of the NSL-KDD dataset over the original KDD dataset, such as the absence of redundant records and improved evaluation consistency. Additionally, it explains the concept of confusion matrices for evaluating classification models, detailing performance metrics and their importance in machine learning.

Uploaded by

teddy haile
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Module -IV

The document discusses the use of the NSL-KDD dataset for training machine learning models to detect network intrusions, categorizing attacks into four main types: DOS, R2L, U2R, and probing. It highlights the advantages of the NSL-KDD dataset over the original KDD dataset, such as the absence of redundant records and improved evaluation consistency. Additionally, it explains the concept of confusion matrices for evaluating classification models, detailing performance metrics and their importance in machine learning.

Uploaded by

teddy haile
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Ethiopian Defence University, College of

Engineering

CT-6713: Machine Learning in


Cybersecurity

Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 1


Intrusion Detection Using NLS-KDD dataset
• Software to detect network intrusions protects a computer
network from unauthorized users including perhaps
insiders
• The intrusion detector learning task is to build a predictive
model (i.e. a classifier) capable of distinguishing between
bad connections, called intrusions or attacks, and good
normal connections
• A connection is a sequence of TCP packets starting and
ending at some well defined times

Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 2 2


Contd…
• Between which data flows to and from a source IP address
to a target IP address under some well defined protocol
• Each connection is labeled as either normal, or as an
attack, with exactly one specific attack type. Each
connection record consists of about 100 bytes

Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 3 3


Attacks fall into four main categories:
• DOS: denial-of-service, e.g. synchflood;
• R2L: unauthorized access from a remote machine, e.g.
guessing password;
• U2R: unauthorized access to local superuser (root)
privileges, e.g., various ''buffer overflow'' attacks;
• probing: surveillance and other probing, e.g., port
scanning.

Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 4 4


Contd…
• It is important to note that the test data is not from the
same probability distribution as the training data
• It includes specific attack types not in the training data
• This makes the task more realistic
• Some intrusion experts believe that most novel attacks are
variants of known attacks
• The "signature" of known attacks can be sufficient to catch
novel variants
• The datasets contain a total of 24 training attack types,
with an additional 14 types in the test data only

Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 5 5


Training attacks
• back dos • perl u2r
• buffer_overflow u2r • phf r2l
• ftp_write r2l • pod dos
• guess_passwd r2l • portsweep probe
• imap r2l • rootkit u2r
• ipsweep probe • satan probe
• land dos • smurf dos
• loadmodule u2r • spy r2l
• multihop r2l • teardrop dos
• neptune dos • warezclient r2l
• nmap probe • warezmaster r2l
Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 6 6
NSL-KDD dataset description
• NSL-KDD is a data set suggested to solve some of the inherent
problems of the KDD'99 data set
• The NSL-KDD data set has the following advantages over the
original KDD data set:
• It does not include redundant records in the train set, so the
classifiers will not be biased towards more frequent records
• There is no duplicate records in the proposed test sets;
• Therefore, the performance of the learners are not biased by the
methods which have better detection rates on the frequent records

Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 7 7


Contd…
• The number of selected records from each difficulty level group is
inversely proportional to the percentage of records in the original
KDD data set
• As a result, the classification rates of distinct machine learning
methods vary in a wider range, which makes it more efficient to have
an accurate evaluation of different learning techniques
• The number of records in the train and test sets are reasonable,
which makes it affordable to run the experiments on the complete set
without the need to randomly select a small portion
• Consequently, evaluation results of different research works will be
consistent and comparable

Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 8 8


Demonstration of Random forest Classifier results

Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 9 9


Demonstration of Random forest Classifier results

Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 10 10


The classifier Tree

Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 11 11


Confusion Matrix
• A Confusion matrix is an N x N
matrix used for evaluating the
performance of a classification
model
• Where N is the total number of
target classes
• The matrix compares the
actual target values with those
predicted by the machine
learning model
• A confusion matrix is used for
evaluating the performance of
a machine learning model

Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 12 12


Understanding Confusion Matrix
• The following 4 are the basic terminology which will help us in
determining the metrics we are looking for.
• True Positives (TP): when the actual value is Positive and predicted is
also Positive.
• True negatives (TN): when the actual value is Negative and
prediction is also Negative.
• False positives (FP): When the actual is negative but prediction is
Positive. Also known as the Type 1 error
• False negatives (FN): When the actual is Positive but the prediction is
Negative. Also known as the Type 2 error

Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 13 13


Confusion Matrix
• A confusion matrix, as the name suggests, is a matrix of
numbers that tell us where a model gets confused
• It is a class-wise distribution of the predictive performance of a
classification model
• That is, the confusion matrix is an organized way of mapping
the predictions to the original classes to which the data belong
• This also implies that confusion matrices can only be used when
the output distribution is known, i.e., in supervised
learning frameworks
• The confusion matrix not only allows the calculation of the
accuracy of a classifier, be it the global or the class-wise
accuracy
• But also helps compute other important metrics that developers
often use to evaluate their models
Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 14 14
Confusion Matrix
• A confusion matrix computed for the same test set of a
dataset
• But using different classifiers, can also help compare their
relative strengths and weaknesses
• Draw an inference about how they can be combined
(ensemble learning) to obtain the optimal performance.

Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 15 15


Confusion Matrix for binary classes

Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 16 16


The Most Common performance metrics in classification

Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 17 17


Contd…

Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 18 18


Contd…

Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 19 19


Confusion Matrix for Multiple Classes
• The concept of the multi-class confusion
matrix is similar to the binary-class
matrix
• The columns represent the Actual or
expected class distribution, and the rows
represent the predicted or output
distribution by the classifier.
• Let us elaborate on the features of the
multi-class confusion matrix with an
example
• Suppose we have the test set (consisting
of 191 total samples) of a dataset with
the following distribution:

Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 20 20


Contd…

Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 21 21


Confusion Matrix of multiple class
• Assignment : Explain how to use and interpret results using
Confusion matrix.

• Maximum marks : 10%


• Use numerical example
• Using this concept, we can calculate the class-wise accuracy, precision,
recall, and f1-scores and put the results in a table
• Submit in PPTx format to the address of : [email protected]
• Due date : 5/12/2024 till mid-night
• Penalty: Deduction of marks submitting after due date!

Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 22 22

You might also like