The document discusses the use of the NSL-KDD dataset for training machine learning models to detect network intrusions, categorizing attacks into four main types: DOS, R2L, U2R, and probing. It highlights the advantages of the NSL-KDD dataset over the original KDD dataset, such as the absence of redundant records and improved evaluation consistency. Additionally, it explains the concept of confusion matrices for evaluating classification models, detailing performance metrics and their importance in machine learning.
The document discusses the use of the NSL-KDD dataset for training machine learning models to detect network intrusions, categorizing attacks into four main types: DOS, R2L, U2R, and probing. It highlights the advantages of the NSL-KDD dataset over the original KDD dataset, such as the absence of redundant records and improved evaluation consistency. Additionally, it explains the concept of confusion matrices for evaluating classification models, detailing performance metrics and their importance in machine learning.
Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 1
Intrusion Detection Using NLS-KDD dataset • Software to detect network intrusions protects a computer network from unauthorized users including perhaps insiders • The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between bad connections, called intrusions or attacks, and good normal connections • A connection is a sequence of TCP packets starting and ending at some well defined times
Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 2 2
Contd… • Between which data flows to and from a source IP address to a target IP address under some well defined protocol • Each connection is labeled as either normal, or as an attack, with exactly one specific attack type. Each connection record consists of about 100 bytes
Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 3 3
Attacks fall into four main categories: • DOS: denial-of-service, e.g. synchflood; • R2L: unauthorized access from a remote machine, e.g. guessing password; • U2R: unauthorized access to local superuser (root) privileges, e.g., various ''buffer overflow'' attacks; • probing: surveillance and other probing, e.g., port scanning.
Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 4 4
Contd… • It is important to note that the test data is not from the same probability distribution as the training data • It includes specific attack types not in the training data • This makes the task more realistic • Some intrusion experts believe that most novel attacks are variants of known attacks • The "signature" of known attacks can be sufficient to catch novel variants • The datasets contain a total of 24 training attack types, with an additional 14 types in the test data only
Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 5 5
Training attacks • back dos • perl u2r • buffer_overflow u2r • phf r2l • ftp_write r2l • pod dos • guess_passwd r2l • portsweep probe • imap r2l • rootkit u2r • ipsweep probe • satan probe • land dos • smurf dos • loadmodule u2r • spy r2l • multihop r2l • teardrop dos • neptune dos • warezclient r2l • nmap probe • warezmaster r2l Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 6 6 NSL-KDD dataset description • NSL-KDD is a data set suggested to solve some of the inherent problems of the KDD'99 data set • The NSL-KDD data set has the following advantages over the original KDD data set: • It does not include redundant records in the train set, so the classifiers will not be biased towards more frequent records • There is no duplicate records in the proposed test sets; • Therefore, the performance of the learners are not biased by the methods which have better detection rates on the frequent records
Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 7 7
Contd… • The number of selected records from each difficulty level group is inversely proportional to the percentage of records in the original KDD data set • As a result, the classification rates of distinct machine learning methods vary in a wider range, which makes it more efficient to have an accurate evaluation of different learning techniques • The number of records in the train and test sets are reasonable, which makes it affordable to run the experiments on the complete set without the need to randomly select a small portion • Consequently, evaluation results of different research works will be consistent and comparable
Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 8 8
Demonstration of Random forest Classifier results
Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 9 9
Demonstration of Random forest Classifier results
Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 10 10
The classifier Tree
Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 11 11
Confusion Matrix • A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model • Where N is the total number of target classes • The matrix compares the actual target values with those predicted by the machine learning model • A confusion matrix is used for evaluating the performance of a machine learning model
Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 12 12
Understanding Confusion Matrix • The following 4 are the basic terminology which will help us in determining the metrics we are looking for. • True Positives (TP): when the actual value is Positive and predicted is also Positive. • True negatives (TN): when the actual value is Negative and prediction is also Negative. • False positives (FP): When the actual is negative but prediction is Positive. Also known as the Type 1 error • False negatives (FN): When the actual is Positive but the prediction is Negative. Also known as the Type 2 error
Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 13 13
Confusion Matrix • A confusion matrix, as the name suggests, is a matrix of numbers that tell us where a model gets confused • It is a class-wise distribution of the predictive performance of a classification model • That is, the confusion matrix is an organized way of mapping the predictions to the original classes to which the data belong • This also implies that confusion matrices can only be used when the output distribution is known, i.e., in supervised learning frameworks • The confusion matrix not only allows the calculation of the accuracy of a classifier, be it the global or the class-wise accuracy • But also helps compute other important metrics that developers often use to evaluate their models Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 14 14 Confusion Matrix • A confusion matrix computed for the same test set of a dataset • But using different classifiers, can also help compare their relative strengths and weaknesses • Draw an inference about how they can be combined (ensemble learning) to obtain the optimal performance.
Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 15 15
Confusion Matrix for binary classes
Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 16 16
The Most Common performance metrics in classification
Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 17 17
Contd…
Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 18 18
Contd…
Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 19 19
Confusion Matrix for Multiple Classes • The concept of the multi-class confusion matrix is similar to the binary-class matrix • The columns represent the Actual or expected class distribution, and the rows represent the predicted or output distribution by the classifier. • Let us elaborate on the features of the multi-class confusion matrix with an example • Suppose we have the test set (consisting of 191 total samples) of a dataset with the following distribution:
Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 20 20
Contd…
Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 21 21
Confusion Matrix of multiple class • Assignment : Explain how to use and interpret results using Confusion matrix.
• Maximum marks : 10%
• Use numerical example • Using this concept, we can calculate the class-wise accuracy, precision, recall, and f1-scores and put the results in a table • Submit in PPTx format to the address of : [email protected] • Due date : 5/12/2024 till mid-night • Penalty: Deduction of marks submitting after due date!
Capt. Mehari K (Ph.D) Ethiopian University, Engineering College 22 22