Group 17
Group 17
ABSTRACT
This study investigates the performance of three machine learning models—Random Forest, SVM,
and KNN—in predicting whether a software will be benign or malware from android permission. The
dataset is preprocessed to handle missing values, outliers. Hyperparameter tuning is employed to
optimize the models' configurations. The models are evaluated based on accuracy, precision, recall,
and F1 score on a test
set. Additionally, the Area Under the Receiver Operating Characteristic (AUROC) curve is computed.
Results indicate that all the models performs well on test features in terms of predictive accuracy and
AUC. The study contributes insights into the selection and tuning of machine learning models for malware
detection, offering a foundation for further research in malware detection.
Introduction
The objective of this report is to compare the performance of three machine learning models: Random
Forest, SVM, and KNN, for the task of malware prediction. This comparison aims to identify the most
effective model for predicting whether an app is benign or malware based on its permissions.
Pre-processing Steps
1. Data Cleaning
The dataset was examined for missing values and outliers. Fortunately, no missing values ,outliers,
incorrect datatype were found
2. Train-Test Split
The dataset was split into training and testing sets to evaluate the model's performance on
unseen data. An 80-20 split was used, with 80% of the data used for training and 20% for
testin
Model Selection
Possible models:
Based on research, the team came up with 3 models to train with the data:
1. Random Forest
Ensemble Method:
i)Random Forest is an ensemble method that combines multiple decision trees to form a robust
and accurate model.
ii)It is less prone to overfitting compared to individual decision trees.
Handles Non-linearity: Random Forest can capture complex non-linear relationships between
features, which is crucial in datasets where relationships might be non-linear, such as permissions
and their association with app behavior.
2. Support Vector Machine (SVM)
Interpretability: SVM provides a simple and interpretable model. The support vectors and the
decision boundary can be easily interpreted in terms of their impact on classifying apps as benign
or malware.
Efficiency: SVM is computationally efficient, especially with large datasets. It is often a good
choice when interpretability and speed are essential.
3. k-Nearest Neighbors (KNN)
Interpretability: KNN is interpretable, and the decision-making process is easy to understand.
Each prediction is based on the similarity of the nearest neighbors, making it clear how decisions
are made.
Handling Non-linear Relationships: KNN can naturally handle non-linear relationships between
features and the target variable. It is flexible in capturing complex decision boundaries, which is
important for distinguishing between benign and malware apps based on permissions.
Model Training
Each model was trained on the training set with the specified hyperparameters.
Hyperparameters:
Random Forest
We used Grid Search with cross-validation to explore different values for the n_estimators
hyperparameter. The values tested were [50, 100, 200, 300, 400, 500].
Grid Search systematically works through multiple combinations of parameter values, cross-
validating each combination to determine the best parameter value.
• Number of Trees (n_estimators): 300
Hence the parameter n_estimators:300 was used because it gave the best performance measure.
SVM
Type of hyperplane: Linear(Default)
This was used because there was used because the data was a 2D data. So the hyperplane needed
was supposed to be linear.
KNN
Number of neighbors :5
This was used in order to ensure the model does not overfit and also to make it generalize better.
Against all the values tested (number of neighbors of 5) gave the best performance measure
Model Evaluation
The models were evaluated on the testing set using the following metrics:
Results
Conclusion
• Random Forest: The ensemble nature of Random Forest, which combines multiple decision trees,
allows it to capture complex, non-linear relationships between permissions and app behavior,
leading to its superior performance in recall and AUROC. Its robustness and high recall make it the
best choice for scenarios where the cost of missing malware is high.
• SVM: The SVM model's performance was commendable, especially considering its efficiency and
interpretability. Its accuracy and precision were on par with Random Forest, making it a viable
option for large datasets where speed and simplicity are essential.
• KNN: While KNN performed slightly lower in accuracy compared to the other models, its
interpretability and ability to handle non-linear relationships make it a useful model in scenarios
where the decision process needs to be easily understood.
Overall, the Random Forest model is recommended for its superior performance in most metrics,
particularly in identifying malware apps with high recall and AUROC. However, both SVM and KNN
have their merits and could be preferred depending on specific use-case requirements such as
interpretability, computational efficiency, and the nature of the dataset. Future research could further
explore hyperparameter tuning and the integration of additional features to enhance model performance.