0% found this document useful (0 votes)
15 views10 pages

Group 17

mACHINE LEARNING

Uploaded by

Raphael Kuayi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views10 pages

Group 17

mACHINE LEARNING

Uploaded by

Raphael Kuayi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

GROUP 17

ID: 10970285 (Kuayi Rapheal) and 10970187 (Kelvin Doe)


Machine Learning Model Comparison Report

ABSTRACT

This study investigates the performance of three machine learning models—Random Forest, SVM,
and KNN—in predicting whether a software will be benign or malware from android permission. The
dataset is preprocessed to handle missing values, outliers. Hyperparameter tuning is employed to
optimize the models' configurations. The models are evaluated based on accuracy, precision, recall,
and F1 score on a test
set. Additionally, the Area Under the Receiver Operating Characteristic (AUROC) curve is computed.
Results indicate that all the models performs well on test features in terms of predictive accuracy and
AUC. The study contributes insights into the selection and tuning of machine learning models for malware
detection, offering a foundation for further research in malware detection.

Introduction

The objective of this report is to compare the performance of three machine learning models: Random
Forest, SVM, and KNN, for the task of malware prediction. This comparison aims to identify the most
effective model for predicting whether an app is benign or malware based on its permissions.
Pre-processing Steps

1. Data Cleaning
The dataset was examined for missing values and outliers. Fortunately, no missing values ,outliers,
incorrect datatype were found

2. Train-Test Split
The dataset was split into training and testing sets to evaluate the model's performance on
unseen data. An 80-20 split was used, with 80% of the data used for training and 20% for
testin
Model Selection
Possible models:
Based on research, the team came up with 3 models to train with the data:

• Random Forest Classifier


• K-Nearest Neighbor
• Scalar Vector Machine

REASONS FOR SELECTION

1. Random Forest
Ensemble Method:
i)Random Forest is an ensemble method that combines multiple decision trees to form a robust
and accurate model.
ii)It is less prone to overfitting compared to individual decision trees.
Handles Non-linearity: Random Forest can capture complex non-linear relationships between
features, which is crucial in datasets where relationships might be non-linear, such as permissions
and their association with app behavior.
2. Support Vector Machine (SVM)
Interpretability: SVM provides a simple and interpretable model. The support vectors and the
decision boundary can be easily interpreted in terms of their impact on classifying apps as benign
or malware.
Efficiency: SVM is computationally efficient, especially with large datasets. It is often a good
choice when interpretability and speed are essential.
3. k-Nearest Neighbors (KNN)
Interpretability: KNN is interpretable, and the decision-making process is easy to understand.
Each prediction is based on the similarity of the nearest neighbors, making it clear how decisions
are made.
Handling Non-linear Relationships: KNN can naturally handle non-linear relationships between
features and the target variable. It is flexible in capturing complex decision boundaries, which is
important for distinguishing between benign and malware apps based on permissions.
Model Training
Each model was trained on the training set with the specified hyperparameters.

Hyperparameters:

Random Forest
We used Grid Search with cross-validation to explore different values for the n_estimators
hyperparameter. The values tested were [50, 100, 200, 300, 400, 500].
Grid Search systematically works through multiple combinations of parameter values, cross-
validating each combination to determine the best parameter value.
• Number of Trees (n_estimators): 300
Hence the parameter n_estimators:300 was used because it gave the best performance measure.

SVM
Type of hyperplane: Linear(Default)
This was used because there was used because the data was a 2D data. So the hyperplane needed
was supposed to be linear.

KNN

Number of neighbors :5
This was used in order to ensure the model does not overfit and also to make it generalize better.
Against all the values tested (number of neighbors of 5) gave the best performance measure
Model Evaluation
The models were evaluated on the testing set using the following metrics:

• Accuracy: Proportion of correctly predicted instances.


• Precision: Proportion of true positive predictions among all positive predictions.
• Recall: Proportion of true positive predictions among all actual positive instances.
• F1 Score: Harmonic mean of precision and recall.

Results

Model Accuracy Precision Recall F1 Score


Random Forest 0.97 0.97 0.98 0.97
KNN 0.96 0.96 0.97 0.97
SVM 0.97 0.96 0.96 0.96

Random Forest Evaluation Outcome

A DIAGRAM OF CONFUSION MATRIX FROM RANDOM FOREST CLASSIFIER


A DIAGRAM OF AUROC FOR RANDOM FOREST CLASSIFIER
K Nearest Neighbors Evaluation Outcome

A DIAGRAM OF THE CONFUSION MATRIX FROM KNN CLASSIFIER

A DIAGRAM OF AREA UNDER ROC FROM KNN CLASSIFIER


Support Vector Machine Evaluation Outcome

A GRAPH OF AREA UNDER ROC FOR SVM CLASSIFIER


CONFUSION MATRIX FROM SVM CLASSIFIER
Discussions and Conclusion
The performance comparison of the three models—Random Forest, SVM, and KNN—revealed several key
insights:
1. Accuracy: Both Random Forest and SVM achieved high accuracy scores of 0.97, slightly
outperforming KNN, which had an accuracy of 0.96. This indicates that Random Forest and SVM
are better at correctly classifying apps as benign or malware.
2. Precision and Recall: Random Forest showed the highest recall (0.98), making it particularly
effective at identifying actual malware apps. While SVM and KNN both had similar precision and
recall values, Random Forest's higher recall suggests it is less likely to miss malware instances.
3. F1 Score: All models demonstrated strong F1 scores, with Random Forest and KNN both achieving
0.97. SVM had a slightly lower F1 score of 0.96. The F1 score balances precision and recall, and
the high scores across all models indicate a well-rounded performance.

Conclusion
• Random Forest: The ensemble nature of Random Forest, which combines multiple decision trees,
allows it to capture complex, non-linear relationships between permissions and app behavior,
leading to its superior performance in recall and AUROC. Its robustness and high recall make it the
best choice for scenarios where the cost of missing malware is high.
• SVM: The SVM model's performance was commendable, especially considering its efficiency and
interpretability. Its accuracy and precision were on par with Random Forest, making it a viable
option for large datasets where speed and simplicity are essential.
• KNN: While KNN performed slightly lower in accuracy compared to the other models, its
interpretability and ability to handle non-linear relationships make it a useful model in scenarios
where the decision process needs to be easily understood.
Overall, the Random Forest model is recommended for its superior performance in most metrics,
particularly in identifying malware apps with high recall and AUROC. However, both SVM and KNN
have their merits and could be preferred depending on specific use-case requirements such as
interpretability, computational efficiency, and the nature of the dataset. Future research could further
explore hyperparameter tuning and the integration of additional features to enhance model performance.

You might also like