Group4 AutoencodersforIoT
Group4 AutoencodersforIoT
Submitted by: Devansh Satija, Shubh Gupta, Ritesh Meena, Kushagra Sabharwal
Abstract
The advent of the Internet of Things (IoT) has revolutionized how devices interact, share
data, and automate tasks. However, with this expansion comes an increased vulnerability
to cyberattacks targeting these devices and their networks. Traditional security systems
often fail to detect sophisticated or novel threats, especially in massive IoT ecosystems.
In this project, we propose an Autoencoder-based anomaly detection framework, trained
solely on normal network traffic data, to identify potential intrusions or abnormalities.
Utilizing the CICIDS 2018 dataset—a gold standard for intrusion detection
benchmarking—we perform extensive data preprocessing, design a neural Autoencoder,
and evaluate its effectiveness against traditional unsupervised anomaly detection models
like Isolation Forest and One-Class SVM. The model demonstrates a high capacity for
anomaly detection, robust generalization, and low false-positive rates, showcasing its
suitability for real-time, unsupervised IoT threat detection.
1. Introduction
The Canadian Institute for Cybersecurity’s CICIDS 2018 dataset is chosen due to its
diversity, realism, and wide adoption in academic research. The dataset simulates realistic
network traffic with both benign and malicious behaviors, covering a wide range of
attack types such as DoS, Brute Force, Port Scans, Infiltration, and Botnets.
Key characteristics:
3. Methodology
Preprocessing is a crucial step in preparing the dataset for training machine learning
models. The following steps were conducted:
• Missing & Infinite Values: Rows with missing (NaN) and infinite (inf) values
were removed to ensure model stability.
• Categorical Encoding: Categorical features (e.g., protocol type, service) were
label encoded using LabelEncoder.
• Normalization: Feature scaling was applied using StandardScaler to ensure
that all numerical features lie within a standard range. This improves convergence
during neural network training.
• Data Filtering: Only rows labeled as BENIGN were retained for training the
Autoencoder to learn normal traffic patterns. Anomaly labels were preserved for
evaluation purposes.
• Encoder: Compresses input data through multiple fully connected Dense layers
(e.g., 64 → 32 → 16 neurons) with ReLU activations.
• Bottleneck Layer: Represents the compressed latent space (dim=16), encoding
the essence of normal patterns.
• Decoder: Symmetrically reconstructs the input back from the bottleneck
representation (16 → 32 → 64 → original size).
Model Details:
After training the model on benign traffic, the reconstruction error is calculated on the
full dataset. Instances with errors above the threshold are classified as anomalies.
Metrics Used:
• Accuracy: ~94%
• Precision: High (above 90%)
• Recall: Excellent, detecting most attacks
• F1-score: Balanced across different thresholds
• AUC: Outperformed both Isolation Forest and One-Class SVM
While neural network parameters were manually fine-tuned, traditional models were
subjected to hyperparameter tuning:
5. Code
• # Import libraries
• import pandas as pd
• import numpy as np
• import matplotlib.pyplot as plt
• import seaborn as sns
•
• from sklearn.preprocessing import StandardScaler
• from sklearn.model_selection import train_test_split
• from sklearn.metrics import classification_report, confusion_matrix,
roc_auc_score
•
• from sklearn.ensemble import IsolationForest, RandomForestClassifier
• from sklearn.svm import OneClassSVM
• from sklearn.model_selection import RandomizedSearchCV
•
• import tensorflow as tf
• from tensorflow.keras.models import Model
• from tensorflow.keras.layers import Input, Dense
•
• # ===============================
• # 1. Load and Preprocess Dataset
• # ===============================
• data = pd.read_csv("daaa.csv")
•
• # Drop missing values
• data = data.dropna()
•
• # Drop unnecessary columns
• drop_cols = ['Flow ID', 'Source IP', 'Destination IP', 'Timestamp']
• data = data.drop(columns=[col for col in drop_cols if col in
data.columns])
•
• # Separate normal and attack traffic
• label_col = 'Label'
• normal_data = data[data[label_col] == 'BENIGN']
• anomaly_data = data[data[label_col] != 'BENIGN']
•
• # Drop label for training data
• X_normal = normal_data.drop(columns=[label_col])
• X_anomaly = anomaly_data.drop(columns=[label_col])
•
• # Encode categorical variables
• X_all = pd.concat([X_normal, X_anomaly])
• X_all = pd.get_dummies(X_all)
•
• # Normalize all features
• scaler = StandardScaler()
• X_scaled = scaler.fit_transform(X_all)
•
• # Split back the features
• X_normal_scaled = X_scaled[:len(X_normal)]
• X_anomaly_scaled = X_scaled[len(X_normal):]
•
• # Split normal data into train and test
• X_train, X_test = train_test_split(X_normal_scaled, test_size=0.2,
random_state=42)
•
• # ======================================
• # 2. Autoencoder Model Design & Training
• # ======================================
• input_dim = X_train.shape[1]
•
• # Build autoencoder
• input_layer = Input(shape=(input_dim,))
• encoder = Dense(64, activation="relu")(input_layer)
• encoder = Dense(32, activation="relu")(encoder)
• decoder = Dense(64, activation="relu")(encoder)
• decoder = Dense(input_dim, activation="linear")(decoder)
•
• autoencoder = Model(inputs=input_layer, outputs=decoder)
• autoencoder.compile(optimizer='adam', loss='mse')
•
• # Train autoencoder on normal data only
• history = autoencoder.fit(X_train, X_train,
• epochs=50,
• batch_size=256,
• validation_split=0.1,
• verbose=1)
•
• # =================================
• # 3. Anomaly Detection using Autoencoder
• # =================================
• # Compute reconstruction error
• reconstructions = autoencoder.predict(X_test)
• mse = np.mean(np.power(X_test - reconstructions, 2), axis=1)
•
• # Threshold based on 95th percentile of normal data error
• threshold = np.percentile(mse, 95)
• print("Reconstruction threshold:", threshold)
•
• # Test on mixed normal and anomaly data
• X_combined = np.vstack((X_test, X_anomaly_scaled))
• y_combined = np.hstack((np.zeros(len(X_test)),
np.ones(len(X_anomaly_scaled))))
•
• # Get predictions
• recon_combined = autoencoder.predict(X_combined)
• mse_combined = np.mean(np.power(X_combined - recon_combined, 2), axis=1)
• y_pred = (mse_combined > threshold).astype(int)
•
• # Evaluation
• print("\n[Autoencoder Performance]")
• print(classification_report(y_combined, y_pred))
• print("AUC Score:", roc_auc_score(y_combined, mse_combined))
•
• # Confusion Matrix
• cm = confusion_matrix(y_combined, y_pred)
• sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
• plt.title("Confusion Matrix - Autoencoder")
• plt.xlabel("Predicted")
• plt.ylabel("Actual")
• plt.show()
•
• # ========================================
• # 4. Compare with Isolation Forest and OCSVM
• # ========================================
• # Isolation Forest
• iso_forest = IsolationForest(contamination=0.05, random_state=42)
• iso_forest.fit(X_train)
• iso_pred = iso_forest.predict(X_combined)
• iso_pred = np.where(iso_pred == -1, 1, 0)
•
• print("\n[Isolation Forest Performance]")
• print(classification_report(y_combined, iso_pred))
•
• # One-Class SVM
• ocsvm = OneClassSVM(nu=0.05, kernel='rbf', gamma='scale')
• ocsvm.fit(X_train)
• svm_pred = ocsvm.predict(X_combined)
• svm_pred = np.where(svm_pred == -1, 1, 0)
•
• print("\n[One-Class SVM Performance]")
• print(classification_report(y_combined, svm_pred))
•
• # ========================================
• # 5. Compare with Supervised Model - Random Forest
• # ========================================
• # Create labels
• X_class = np.vstack((X_normal_scaled, X_anomaly_scaled))
• y_class = np.hstack((np.zeros(len(X_normal_scaled)),
np.ones(len(X_anomaly_scaled))))
•
• # Train-test split
• X_train_cls, X_test_cls, y_train_cls, y_test_cls =
train_test_split(X_class, y_class, test_size=0.2, random_state=42)
•
• # Random Forest with Random Search
• param_dist = {
• 'n_estimators': [100, 200, 300],
• 'max_depth': [5, 10, None],
• 'min_samples_split': [2, 5, 10]
• }
• rf = RandomForestClassifier()
• rs = RandomizedSearchCV(rf, param_distributions=param_dist, n_iter=5,
cv=3, verbose=1, n_jobs=-1)
• rs.fit(X_train_cls, y_train_cls)
•
• rf_best = rs.best_estimator_
• rf_pred = rf_best.predict(X_test_cls)
•
• print("\n[Random Forest (Supervised) Performance]")
• print(classification_report(y_test_cls, rf_pred))
6. Conclusion
7. Future Enhancements