S Detection Using Machine Learning
S Detection Using Machine Learning
March 2, 2024
[2]: ddos=pd.read_csv("APA-DDoS-Dataset.csv")
[3]: ddos
1
2 54 0 0 1
3 54 0 0 1
4 54 0 0 1
… … … … …
151195 66 0 0 0
151196 66 0 0 0
151197 66 0 0 0
151198 66 0 0 0
151199 66 0 0 0
2
151199 6 560 4 586 Benign
[5]: ddos.info()
<class 'pandas.core.frame.DataFrame'>
3
16 Packets 151200 non-null int64
[6]: ddos.isna().sum()
[6]: ip.src 0
ip.dst 0
tcp.srcport 0
tcp.dstport 0
ip.proto 0
frame.len 0
tcp.flags.syn 0
tcp.flags.reset 0
tcp.flags.push 0
tcp.flags.ack 0
ip.flags.mf 0
ip.flags.df 0
ip.flags.rb 0
tcp.seq 0
tcp.ack 0
frame.time 0
Packets 0
Bytes 0
Tx Packets 0
Tx Bytes 0
Rx Packets 0
Rx Bytes 0
Label 0
dtype: int64
[7]: ddos.duplicated().sum()
4
[7]: 0
There are no duplicates or nulls that needs to be dropped, so we can proceed into our analysis.
[8]: ddos.groupby('Label').size()
[8]: Label
Benign 75600
DDoS-ACK 37800
DDoS-PSH-ACK 37800
dtype: int64
/usr/local/lib/python3.10/dist-packages/seaborn/axisgrid.py:2095: UserWarning:
The `size` parameter has been renamed to `height`; please update your code.
warnings.warn(msg, UserWarning)
5
By observing the pairplot generated, we can notice too many features that have just a single value
across the coloumn, they can be dropped, we can notice which ones exactly
[16]: numeric_data = ddos.select_dtypes(include='number')# select only the columns in␣
↪the DataFrame data that have numeric (number) data
correlation_matrix = numeric_data.corr()
fig, ax = plt.subplots(figsize=(15,8))
sns.heatmap(correlation_matrix, annot=True,ax=ax, cmap="RdPu")
plt.title('Correlation Between the Variables')
#plt.xticks(rotation=45);
plt.show()
6
[9]: columns_to_drop = ['tcp.dstport', 'ip.proto', 'tcp.flags.syn', 'tcp.flags.
↪reset', 'tcp.flags.ack', 'ip.flags.mf', 'ip.flags.rb', 'tcp.seq', 'tcp.ack']
ddos_new= ddos.drop(columns=columns_to_drop).copy()
ddos_new
ip.flags.df frame.time \
0 0 16-Jun 2020 20:18:15.071112000 Mountain Dayli…
1 0 16-Jun 2020 20:18:15.071138000 Mountain Dayli…
2 0 16-Jun 2020 20:18:15.071146000 Mountain Dayli…
3 0 16-Jun 2020 20:18:15.071152000 Mountain Dayli…
4 0 16-Jun 2020 20:18:15.071159000 Mountain Dayli…
7
… … …
151195 1 16-Jun 2020 22:10:46.923006000 Mountain Dayli…
151196 1 16-Jun 2020 22:10:46.935672000 Mountain Dayli…
151197 1 16-Jun 2020 22:10:46.957469000 Mountain Dayli…
151198 1 16-Jun 2020 22:10:46.970971000 Mountain Dayli…
151199 1 16-Jun 2020 22:10:46.984798000 Mountain Dayli…
Label
0 DDoS-PSH-ACK
1 DDoS-PSH-ACK
2 DDoS-PSH-ACK
3 DDoS-PSH-ACK
4 DDoS-PSH-ACK
… …
151195 Benign
151196 Benign
151197 Benign
151198 Benign
151199 Benign
8
151195 192.168.19.1 192.168.23.2 37360 66 0
151196 192.168.19.1 192.168.23.2 37362 66 0
151197 192.168.19.1 192.168.23.2 37364 66 0
151198 192.168.19.1 192.168.23.2 37366 66 0
151199 192.168.19.1 192.168.23.2 37368 66 0
Rx Bytes Label
0 216 DDoS-PSH-ACK
1 270 DDoS-PSH-ACK
2 324 DDoS-PSH-ACK
3 270 DDoS-PSH-ACK
4 162 DDoS-PSH-ACK
… … …
151195 586 Benign
151196 591 Benign
151197 584 Benign
151198 615 Benign
151199 586 Benign
ddos_new.drop(columns=['Label'], inplace=True)
ddos_new.rename(columns={'Label_new': 'Label'}, inplace=True)
ddos_new
9
3 192.168.1.1 192.168.23.2 2415 54 1
4 192.168.1.1 192.168.23.2 2416 54 1
… … … … … …
151195 192.168.19.1 192.168.23.2 37360 66 0
151196 192.168.19.1 192.168.23.2 37362 66 0
151197 192.168.19.1 192.168.23.2 37364 66 0
151198 192.168.19.1 192.168.23.2 37366 66 0
151199 192.168.19.1 192.168.23.2 37368 66 0
Rx Bytes Label
0 216 DDoS
1 270 DDoS
2 324 DDoS
3 270 DDoS
4 162 DDoS
… … …
151195 586 Benign
151196 591 Benign
151197 584 Benign
151198 615 Benign
151199 586 Benign
[12]: y = ddos_new['Label']
y
[12]: 0 DDoS
1 DDoS
2 DDoS
3 DDoS
4 DDoS
…
151195 Benign
10
151196 Benign
151197 Benign
151198 Benign
151199 Benign
Name: Label, Length: 151200, dtype: object
[14]: y
There are many ip addresses so we need to encode them using one hot encoding. There is no
ordinality so using label encoding for would be biases.
[15]: X = ddos_new.drop(columns=['Label']).copy()
# Create a ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
('cat', OneHotEncoder(sparse=False, handle_unknown='ignore'),␣
↪categorical_columns)
],
remainder='passthrough'
)
X = pd.DataFrame(X_encoded, columns=column_names)
/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_encoders.py:975:
11
FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will
be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its
default value.
warnings.warn(
[16]: X
12
ip.src_192.168.20.1 … Bytes Packets Rx Bytes Rx Packets \
0 0.0 … 2412.0 54.0 1.0 0.0
1 0.0 … 2413.0 54.0 1.0 0.0
2 0.0 … 2414.0 54.0 1.0 0.0
3 0.0 … 2415.0 54.0 1.0 0.0
4 0.0 … 2416.0 54.0 1.0 0.0
… … … … … … …
151195 0.0 … 37360.0 66.0 0.0 1.0
151196 0.0 … 37362.0 66.0 0.0 1.0
151197 0.0 … 37364.0 66.0 0.0 1.0
151198 0.0 … 37366.0 66.0 0.0 1.0
151199 0.0 … 37368.0 66.0 0.0 1.0
tcp.srcport
0 216.0
1 270.0
2 324.0
3 270.0
4 162.0
… …
151195 586.0
151196 591.0
151197 584.0
151198 615.0
151199 586.0
[18]: X_train
13
[18]: ip.src_192.168.1.1 ip.src_192.168.10.1 ip.src_192.168.11.1 \
39462 0.0 0.0 0.0
86399 0.0 0.0 0.0
46424 0.0 0.0 0.0
123679 0.0 0.0 0.0
23643 0.0 0.0 0.0
… … … …
119879 0.0 0.0 0.0
103694 0.0 0.0 1.0
131932 0.0 0.0 0.0
146867 0.0 0.0 0.0
121958 0.0 0.0 0.0
14
103694 0.0 … 6153.0 54.0 0.0 0.0
131932 0.0 … 46358.0 66.0 0.0 1.0
146867 0.0 … 56936.0 66.0 0.0 1.0
121958 0.0 … 51040.0 66.0 0.0 1.0
tcp.srcport
39462 668.0
86399 270.0
46424 668.0
123679 610.0
23643 216.0
… …
119879 591.0
103694 270.0
131932 668.0
146867 591.0
121958 615.0
[19]: y_train
15
0.3 Building the Model
[21]: X_train
16
123679 0.0 … 54482.0 66.0 0.0 1.0
23643 0.0 … 5706.0 54.0 1.0 0.0
… … … … … … …
119879 0.0 … 46882.0 66.0 0.0 1.0
103694 0.0 … 6153.0 54.0 0.0 0.0
131932 0.0 … 46358.0 66.0 0.0 1.0
146867 0.0 … 56936.0 66.0 0.0 1.0
121958 0.0 … 51040.0 66.0 0.0 1.0
tcp.srcport
39462 668.0
86399 270.0
46424 668.0
123679 610.0
23643 216.0
… …
119879 591.0
103694 270.0
131932 668.0
146867 591.0
121958 615.0
17
[44]: cm = confusion_matrix(y_test, y_pred_decision_tree)
class_labels = ["Benign", "DDoS"]
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False,␣
↪xticklabels=class_labels, yticklabels=class_labels)
plt.xlabel("Predicted")
plt.ylabel("True")
plt.show()
18
precision, recall, _ = precision_recall_curve(y_test, rf_model.
↪predict_proba(X_test)[:, 1])
# Plot F1 score
fig, ax = plt.subplots(figsize=(8, 8))
f1 = 2 * (precision * recall) / (precision + recall)
plt.plot(recall, f1, label='F1 Score')
plt.xlabel('Recall')
plt.ylabel('F1 Score')
plt.title('F1 Score Curve')
plt.legend(loc='best')
plt.show()
Accuracy: 100.00%
19
20
0.6 XGBOOST
[47]: xgb_model = XGBClassifier()
xgb_model.fit(X_train, y_train)
y_pred = xgb_model.predict(X_test)
Accuracy: 100.00%
21
[34]: plot_importance(xgb_model)
plt.show()
plt.xlabel("Predicted")
plt.ylabel("True")
plt.show()
22
[51]: y_prob = xgb_model.predict_proba(X_test)[:, 1]
23
24