Second
Second
GROUP 17
Kuayi Raphael (10970285)
Doe Kelvin (10970187)
INTRODUCTION
The proliferation of Android smartphones, which hold over 70% of the mobile OS market
share, has made them a prime target for malware attacks. Traditional malware detection
methods, such as signature-based detection, have proven ineffective against sophisticated and
rapidly evolving threats. This project aims to address the challenge of detecting malware in
Android applications by leveraging machine learning techniques to analyse app permission
usage patterns.
The problem of malware detection in Android apps is complex and pressing. As the popularity
of Android devices continues to grow, so does the number of malicious apps seeking to exploit
them. These apps often request excessive or suspicious permissions to gain access to sensitive
data or functionalities, making permission-based analysis a promising approach for detection.
OBJECTIVES
The primary goal of this project was to design and develop a machine learning model that
accurately predicts whether an Android app is benign or malicious based on its permission
requests. To achieve this, the following key objectives were pursued:
• Identify key permissions that are indicative of malicious behaviour, providing insights
into the specific permission patterns and correlations that are strongly associated with
malware.
The dataset used for this project was sourced from multiple repositories containing Android
apps released between 2010 and 2019. It includes permissions extracted from over 29,000 apps,
classified into benign and malware categories.
3.2 Data Description
The dataset consists of 86 features representing various permissions that an app may request,
Each feature is encoded in a binary format (1 for granted, 0 for not granted).
.
Each feature in the dataset had two unique values. The dataset had 29332 rows.
Fig(i)
The target variable, Result, represents the classification of an app as either benign or malware.
The dataset is well-balanced, with nearly equal numbers of instances for each class:
The data was thoroughly checked for outliers and null values, ensuring high integrity and
quality. All columns had its datatype to be int64.
. No outliers were present since data was binary, and all fields were complete, contributing to
the robustness of the data.
The dataset was split into training, validation, and test sets to evaluate model performance
effectively. The training set was used to train the models, the validation set for hyperparameter
tuning, and the test set for final evaluation.
The dataset was split into training and testing sets to evaluate the model's performance on
unseen data. An 80-20 split was used, with 80% of the data used for training and 20% for
testing.
METHODOLOGY
Each model was chosen for its unique characteristics and ability to handle binary classification
tasks.
• Support Vector Machine (SVM): SVM is a supervised learning algorithm used for
classification tasks. It finds the optimal hyperplane that separates classes in a high-
dimensional space. For this project, two types of SVM kernels were evaluated:
1.Linear: Assumes a linear relationship between features and class labels.
The 86 features from the dataset were reduced to two principal components to visualise
how SVM algorithm will perform on the data using a polynomial kernel. The number of
principal components of 2 was selected because it had majority of the variance of the
data present which depicted the actual nature of the 86 features.
• K-Nearest neighbours (KNN): KNN is an instance-based learning algorithm that
classifies instances based on the majority vote of their nearest neighbours.
The performance of KNN was evaluated with different numbers of neighbours
1.K=3: Evaluates the model with 3 nearest neighbours.
2.K=5: Evaluates the model with 5 nearest neighbours.
3.K=7: Evaluates the model with 7 nearest neighbours.
4.K=9: Evaluates the model with 9 nearest neighbours.
The distance metric used was Euclidean distance.
Hyperparameter tuning was performed to optimize the performance of each model. This
involved adjusting parameters such as the kernel type for SVM, the number of neighbours for
KNN, and the number of trees for RF.
The following tools, libraries, and frameworks were used for this project:
The models were trained using the training set with the optimal hyperparameters identified
during the tuning process. The training process included configuring model parameters, fitting
the models to the training data, and ensuring reproducibility by setting random seeds.
Model validation was performed using the validation set to fine-tune the models and prevent
overfitting. Cross-validation techniques were used to ensure the models generalize well to
unseen data.
The final evaluation of the models was conducted on the test set. Performance metrics,
including accuracy, F1 score, and AUC-ROC score, were computed to assess the effectiveness
of each model.
RESULTS
• K=3:
o Training F1 Score: 0.958
o Testing F1 Score: 0.961
o Training AUC-ROC Score: 0.985
o Testing AUC-ROC Score: 0.986
• K=5:
o Training F1 Score: 0.960
o Testing F1 Score: 0.962
o Training AUC-ROC Score: 0.986
o Testing AUC-ROC Score: 0.987
• K=7:
o Training F1 Score: 0.961
o Testing F1 Score: 0.963
o Training AUC-ROC Score: 0.987
o Testing AUC-ROC Score: 0.988
• K=9:
o Training F1 Score: 0.964
o Testing F1 Score: 0.964
o Training AUC-ROC Score: 0.987
o Testing AUC-ROC Score: 0.987
Random Forest (RF):
• 100 Trees:
o Training F1 Score: 0.968
o Testing F1 Score: 0.970
o Training AUC-ROC Score: 0.992
o Testing AUC-ROC Score: 0.993
• 200 Trees:
o Training F1 Score: 0.969
o Testing F1 Score: 0.971
o Training AUC-ROC Score: 0.993
o Testing AUC-ROC Score: 0.993
• 300 Trees:
o Training F1 Score: 0.970
o Testing F1 Score: 0.970
o Training AUC-ROC Score: 0.993
o Testing AUC-ROC Score: 0.993
6.2 Comparison of Models
The results presented in the table are achieved through careful hyperparameter tuning and
the strategic use of the Area Under the Receiver Operating Characteristic Curve
(AUROC) as the primary evaluation metric.
Model Accuracy Precision Recall F1 Score AUROC Train Time(s) Predict Time(s)
Random Forest 0.970 0.980 0.980 0.970 0.993 7.10 0.70
KNN 0.960 0.970 0.965 0.964 0.987 0.02 7.48
SVM 0.960 0.960 0.960 0.960 0.989 47.78 2.52
• Random Forest shows the best overall performance in terms of accuracy, precision,
recall, F1 Score, and AUROC. It also has a reasonably quick training time and the
fastest prediction time, making it an excellent choice for both training and inference.
• KNN has a very fast training time but suffers from the longest prediction time, making
it less suitable for real-time predictions despite having good precision and recall.
• SVM performs well across most metrics but has a significantly longer training time,
which could be a drawback.
DISCUSSION
The results indicate that the Random Forest model outperformed the other models in terms of
F1 Score and AUC-ROC Score. This can be attributed to its ensemble nature, which reduces
overfitting and enhances generalization.
7.2 Model Insights
• SVM Kernels: The Polynomial kernel provided a better fit for the data than the Linear
kernel, capturing non-linear relationships more effectively.
• KNN neighbours: Increasing the number of neighbours generally improved the
model's performance by reducing variance.
• RF Trees: More trees improved model robustness and performance, highlighting the
importance of ensemble size.
F1 Scores:
ANOVA Results: F=21.075, p=0.0019. The results indicate significant differences in
F1 Scores among the classification algorithms, with Random Forest showing superior
performance compared to SVM and KNN.
AUC-ROC Scores:
ANOVA Results: F=33.813, p=0.0005. The results demonstrate significant
differences in AUC-ROC Scores, confirming Random Forest's higher discriminatory
power.
T-Tests
F1 Scores:
• SVM vs. KNN: t=0.696, p=0.525. No significant difference was observed between
SVM and KNN (Since p was greater than 5% (0.05)).
REASON
This was because both SVM and KNN use a distance metric in their evaluation.
• SVM vs. RF: t=-4.346, p=0.012. A significant difference was found, with Random
Forest performing significantly better than SVM. (Negative (-) t value indicates that
RF performs SVM)
• KNN vs. RF: t=-13.840, p=0.0002. A highly significant difference was found, with
Random Forest outperforming KNN. (Negative (-) t value indicates that RF
outperforms KNN).
AUC-ROC Scores:
• SVM vs. KNN: t=3.701, p=0.021. Significant difference observed, with SVM
slightly outperforming KNN.
• SVM vs. RF: t=-4.210, p=0.014. Significant difference found, with Random
Forest showing superior performance.
• KNN vs. RF: t=-9.538, p=0.0007. A highly significant difference was noted, with
Random Forest outperforming KNN.
CONCLUSION
This study demonstrated the effectiveness of using machine learning for permission-based
malware detection in Android apps. The project comprehensively evaluated and compared
three classification algorithms—Support Vector Machine (SVM), K-Nearest neighbours
(KNN), and Random Forest (RF)—using various performance metrics and statistical analysis.
Random Forest emerged as the top performer, demonstrating the highest F1 Score and AUC-
ROC Score.
REFERENCES
[1] Hareram Kumar (2022) The Research Paper, Android Malware Prediction using Machine
Learning Techniques: A Review.
[2] Neamat Al Sarah (2021) Online Paper, An Efficient Android Malware Prediction Using
Ensemble machine learning algorithms
[3] Machine Learning for Android Malware Detection Using Permission and API Calls
{4] Android Permission Dataset