0% found this document useful (0 votes)
69 views

Lab 7

The document summarizes the results of various machine learning algorithms for classifying network traffic data: - Random forest and decision tree algorithms performed best with over 99% precision, recall, accuracy, and F1 score. KNN and SVM had lower success rates of around 75-76% and 72% respectively. - The data distribution was heavily skewed towards benign and DoS attacks, with few user escalation packets, which could affect classification accuracy. - Metrics for the clustering algorithm showed low completeness, homogeneity, and v-measure, indicating the data was incompletely labeled and mixed, which likely contributed to errors in the classification models.

Uploaded by

Stephen Giacobe
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

Lab 7

The document summarizes the results of various machine learning algorithms for classifying network traffic data: - Random forest and decision tree algorithms performed best with over 99% precision, recall, accuracy, and F1 score. KNN and SVM had lower success rates of around 75-76% and 72% respectively. - The data distribution was heavily skewed towards benign and DoS attacks, with few user escalation packets, which could affect classification accuracy. - Metrics for the clustering algorithm showed low completeness, homogeneity, and v-measure, indicating the data was incompletely labeled and mixed, which likely contributed to errors in the classification models.

Uploaded by

Stephen Giacobe
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Lab 7

CYBER 362 Section 001


Seth Schuler, Kelly Wallert, Isaac Robles Jr.,
Andrea Hatcher, Stephen Giacobe
The Pennsylvania State University
1

Table of Contents
Table of Contents 1

Section A 2

Section B - Isaac 7

Section C 9
2

Section A
1-a) Table containing performance measures.

According to the information generated by Splunk, with a testing ratio of 80:20, the most precise
tests were the Decision Tree and the Random Forest analysis. Looking at the table shown below,
both the Decision Tree and Random Forest had a precision score of 0.99 or 99%, with the same
scores for recall, accuracy, and F1.

When looking at the confusion Matrices, the likelihood of a False Negative (Type II Error) for
each algorithm are as follows: 16.4% (657), 11.1% (444), 1.3% (51), and 1.1% (46). Meanwhile,
the False Positives (Type I Error) are as follows: 24% (1455), 0.5% (32), 1.4% (83), and 0.8%
(49).
3

1-b) It appears that the most accurate algorithm is a Random Forest, but a decision tree doesn’t
use as much memory and yields almost the same result. If it came to an instance of speed, I’d
pick the Decision tree, but for accuracy, use Random Forest.

2-a) K-Means Clustering

Below, you can see that we used the k-means clustering with K=2, K=3, and K=4. When looking
at the results you can see that the value of k that yields the best clustering. Higher the K value,
tighter the cluster.

i.

ii.
4

iii.

2-b) DBSCAN

Below you can see the DBSCAN clustering algorithm, with the eps = 0.4, eps = 0.2, eps = 1.0. When
looking at the results, the value of eps that yields the best clustering.

i.
5

ii.

iii.

iv.

3-a) Predict VPN Usage - Linear Regression


6

i. R2 Statistic
1. 0.8759
ii. Root Mean Squared Error (RMSE)
1. 75.63

3-b) Screenshot for Decision Tree regression, R2, and RMSE.

i. R2 Statistic
1. 0.8321
ii. Root Mean Squared Error (RMSE)
1. 86.78
7

3-c) Screenshot for Random Forest regression, R2, and RMSE. Commented on the performance of the
three regressors.

i. R2 Statistic
1. 0.9400
ii. Root Mean Squared Error (RMSE)
1. 66.69
The three regression models fit the data in the following order from least to best: Decision Tree,
Linear, and Random Forest. The regression fits the data from 83%-94% which is not too bad.

Section B
a) Loading dataset.
8

a) Table of performance measures.

b) Discussed the performance of these classifiers and provided a reflection on completing this exercise.
According to the table, the random forest and decision tree algorithm performed exceptionally well.
Logistic regression did and the source vector machine algorithm performed the weakest.
9

Section C
1. Execute the code in line 82 and capture the screenshot showing the distribution of the 22
attack types.

2. Execute the code in line 86 and capture the screenshot showing the distribution of the 5
categories (benign and the four attack categories). Comment on the distribution of these
attack categories in terms of how it might affect the classification.

The distribution is heavily skewed to having mostly benign and dos packets. The lack of user
escalation packets are a problem for classifications as it can cause more error than necessary.
Having a more even category of
10

3. Provide the confusion matrix of the decision tree classifier in line 163 and the error rate
in line 164. Comment on the success rate of this classifier. An error rate of 0.2 means the
success rate is 0.8.

With a success rate of around 75.7%, it falls slightly behind the second approach, being
around .1% less effective, which is statistically insignificant.

4. Provide the confusion matrix of the k-nearest neighbor’s classifier in line 177 and the
error rate in line 178. Comment on the success rate of this classifier. We didn't discuss
this one in class but you can read about it on pages 52 and 53 of the course textbook.

The confusion matrix can be seen above, with a success rate of around 75.8%. This
approach was the most effective but should be adjusted more as the success rate is fairly
low.

5. Provide the confusion matrix of the support vector machine classifier in line 191 and the
error rate in line 192. Comment on the success rate of this classifier.

The confusion matrix can be seen above, with a success rate of around 72.2%. After
seeing this, we concluded that this approach was less effective than the other two;
containing around 3% more error.
11

6. Execute the code in lines 232, 233 and 234. Comment on the values you get for these
three parameters.

The values for completeness, homogeneity, and v-measure are very low showing that the
data is very mixed and labeled incompletely. This would explain some of the errors that
resulted from the three approaches.

You might also like