Lab 7
Lab 7
Table of Contents
Table of Contents 1
Section A 2
Section B - Isaac 7
Section C 9
2
Section A
1-a) Table containing performance measures.
According to the information generated by Splunk, with a testing ratio of 80:20, the most precise
tests were the Decision Tree and the Random Forest analysis. Looking at the table shown below,
both the Decision Tree and Random Forest had a precision score of 0.99 or 99%, with the same
scores for recall, accuracy, and F1.
When looking at the confusion Matrices, the likelihood of a False Negative (Type II Error) for
each algorithm are as follows: 16.4% (657), 11.1% (444), 1.3% (51), and 1.1% (46). Meanwhile,
the False Positives (Type I Error) are as follows: 24% (1455), 0.5% (32), 1.4% (83), and 0.8%
(49).
3
1-b) It appears that the most accurate algorithm is a Random Forest, but a decision tree doesn’t
use as much memory and yields almost the same result. If it came to an instance of speed, I’d
pick the Decision tree, but for accuracy, use Random Forest.
Below, you can see that we used the k-means clustering with K=2, K=3, and K=4. When looking
at the results you can see that the value of k that yields the best clustering. Higher the K value,
tighter the cluster.
i.
ii.
4
iii.
2-b) DBSCAN
Below you can see the DBSCAN clustering algorithm, with the eps = 0.4, eps = 0.2, eps = 1.0. When
looking at the results, the value of eps that yields the best clustering.
i.
5
ii.
iii.
iv.
i. R2 Statistic
1. 0.8759
ii. Root Mean Squared Error (RMSE)
1. 75.63
i. R2 Statistic
1. 0.8321
ii. Root Mean Squared Error (RMSE)
1. 86.78
7
3-c) Screenshot for Random Forest regression, R2, and RMSE. Commented on the performance of the
three regressors.
i. R2 Statistic
1. 0.9400
ii. Root Mean Squared Error (RMSE)
1. 66.69
The three regression models fit the data in the following order from least to best: Decision Tree,
Linear, and Random Forest. The regression fits the data from 83%-94% which is not too bad.
Section B
a) Loading dataset.
8
b) Discussed the performance of these classifiers and provided a reflection on completing this exercise.
According to the table, the random forest and decision tree algorithm performed exceptionally well.
Logistic regression did and the source vector machine algorithm performed the weakest.
9
Section C
1. Execute the code in line 82 and capture the screenshot showing the distribution of the 22
attack types.
2. Execute the code in line 86 and capture the screenshot showing the distribution of the 5
categories (benign and the four attack categories). Comment on the distribution of these
attack categories in terms of how it might affect the classification.
The distribution is heavily skewed to having mostly benign and dos packets. The lack of user
escalation packets are a problem for classifications as it can cause more error than necessary.
Having a more even category of
10
3. Provide the confusion matrix of the decision tree classifier in line 163 and the error rate
in line 164. Comment on the success rate of this classifier. An error rate of 0.2 means the
success rate is 0.8.
With a success rate of around 75.7%, it falls slightly behind the second approach, being
around .1% less effective, which is statistically insignificant.
4. Provide the confusion matrix of the k-nearest neighbor’s classifier in line 177 and the
error rate in line 178. Comment on the success rate of this classifier. We didn't discuss
this one in class but you can read about it on pages 52 and 53 of the course textbook.
The confusion matrix can be seen above, with a success rate of around 75.8%. This
approach was the most effective but should be adjusted more as the success rate is fairly
low.
5. Provide the confusion matrix of the support vector machine classifier in line 191 and the
error rate in line 192. Comment on the success rate of this classifier.
The confusion matrix can be seen above, with a success rate of around 72.2%. After
seeing this, we concluded that this approach was less effective than the other two;
containing around 3% more error.
11
6. Execute the code in lines 232, 233 and 234. Comment on the values you get for these
three parameters.
The values for completeness, homogeneity, and v-measure are very low showing that the
data is very mixed and labeled incompletely. This would explain some of the errors that
resulted from the three approaches.