0% found this document useful (0 votes)
27 views21 pages

Second

ml

Uploaded by

Raphael Kuayi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views21 pages

Second

ml

Uploaded by

Raphael Kuayi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

DEPARTMENT OF COMPUTER ENGINEERING

TITLE: Permission-Based Malware Detection in Android Using


Machine Learning

GROUP 17
Kuayi Raphael (10970285)
Doe Kelvin (10970187)

SUPERVISOR: Mrs. Gifty Osei


TEACHING ASSISTANT: Mr. Desmond Xeflide
JULY, 2024.
ABSTRACT
This project tackles the pressing issue of malware detection in Android devices by examining
the permission usage patterns of apps. Traditional signature-based detection methods have
proven ineffective against sophisticated threats, necessitating innovative approaches. This study
explores the efficacy of machine learning for permission-based malware detection,
demonstrating its potential in identifying malicious apps based on their permission usage
patterns. A comprehensive dataset of permissions extracted from over 29,000 Android apps
(2010-2019) was utilized, comprising 86 features and a binary target variable. Following an
extensive exploratory data analysis (EDA), various machine learning models (SVM, KNN,
Random Forest) were evaluated and optimized through hyperparameter tuning. The results
show that the Random Forest model outperformed others, achieving significant improvements
in F1 Score (0.95) and AUC-ROC Score (0.98). While promising, this study also encountered
challenges and limitations, including class imbalance and feature correlation. The project
concludes by summarizing key findings, contributions, and recommendations for future work
to enhance the accuracy and robustness of permission-based malware detection in Android
devices

INTRODUCTION
The proliferation of Android smartphones, which hold over 70% of the mobile OS market
share, has made them a prime target for malware attacks. Traditional malware detection
methods, such as signature-based detection, have proven ineffective against sophisticated and
rapidly evolving threats. This project aims to address the challenge of detecting malware in
Android applications by leveraging machine learning techniques to analyse app permission
usage patterns.

The problem of malware detection in Android apps is complex and pressing. As the popularity
of Android devices continues to grow, so does the number of malicious apps seeking to exploit
them. These apps often request excessive or suspicious permissions to gain access to sensitive
data or functionalities, making permission-based analysis a promising approach for detection.
OBJECTIVES
The primary goal of this project was to design and develop a machine learning model that
accurately predicts whether an Android app is benign or malicious based on its permission
requests. To achieve this, the following key objectives were pursued:

• Investigate the viability of machine learning for permission-based malware detection in


Android apps, examining the potential for accurate classification and identification of
malicious patterns.

• Evaluate and compare the performance of various machine learning models in


classifying Android apps as benign or malware based on their permission requests,
determining the most effective approach.

• Identify key permissions that are indicative of malicious behaviour, providing insights
into the specific permission patterns and correlations that are strongly associated with
malware.

• Assess the overall effectiveness of permission-based analysis for malware detection,


highlighting the strengths and limitations of this approach and its potential for
integration into existing security measures.

DATA COLLECTION AND PREPROCESSING

3.1 Data Sources

The dataset used for this project was sourced from multiple repositories containing Android
apps released between 2010 and 2019. It includes permissions extracted from over 29,000 apps,
classified into benign and malware categories.
3.2 Data Description

The dataset consists of 86 features representing various permissions that an app may request,
Each feature is encoded in a binary format (1 for granted, 0 for not granted).

.
Each feature in the dataset had two unique values. The dataset had 29332 rows.

Fig(i)

The target variable, Result, represents the classification of an app as either benign or malware.
The dataset is well-balanced, with nearly equal numbers of instances for each class:

● Class 1 (Malware): 14,700 instances


● Class 0 (Benign): 14,632 instances
fig(ii)

3.3 Data Cleaning

Given the binary nature of the dataset, no scaling was required.

The data was thoroughly checked for outliers and null values, ensuring high integrity and
quality. All columns had its datatype to be int64.

. No outliers were present since data was binary, and all fields were complete, contributing to
the robustness of the data.

The data had no missing values.


3.4 Data Transformation and Feature Engineering

Feature engineering involved analysing the correlation between features using a


heatmap(correlation) to inform decisions on feature selection and dimensionality reduction.
Techniques such as PCA were considered to reduce the dimensionality of the dataset while
retaining essential information. The PCA result was visualised to inform the appropriate
algorithms to use by plotting a scatter plot of the datapoints. The pair-feature correlation
viewed by the aid of the heatmap showed dimensionality reduction would be possible since
most features had a high correlation.

3.5 Data Splitting

The dataset was split into training, validation, and test sets to evaluate model performance
effectively. The training set was used to train the models, the validation set for hyperparameter
tuning, and the test set for final evaluation.

The dataset was split into training and testing sets to evaluate the model's performance on
unseen data. An 80-20 split was used, with 80% of the data used for training and 20% for
testing.
METHODOLOGY

4.1 Model Selection

Three machine learning models were selected for this project:

• Support Vector Machine (SVM)


• K-Nearest Neighbours (KNN),
• Random Forest (RF).

Each model was chosen for its unique characteristics and ability to handle binary classification
tasks.

4.2 ALGORITHM DESCRIPTIONS

• Support Vector Machine (SVM): SVM is a supervised learning algorithm used for
classification tasks. It finds the optimal hyperplane that separates classes in a high-
dimensional space. For this project, two types of SVM kernels were evaluated:
1.Linear: Assumes a linear relationship between features and class labels.

2.Polynomial: Captures non-linear relationships by introducing polynomial features. (A


3rd Order polynomial was used since it gave the best performance measure).

Reason for selection:


1.SVMs are powerful classifiers that can handle high-dimensional data effectively,
which is be the case with many app permissions.
2.They are particularly good at finding a clear separation between the two classes
(malware and benign) in the feature space, even if the separation is non-linear.

The 86 features from the dataset were reduced to two principal components to visualise
how SVM algorithm will perform on the data using a polynomial kernel. The number of
principal components of 2 was selected because it had majority of the variance of the
data present which depicted the actual nature of the 86 features.
• K-Nearest neighbours (KNN): KNN is an instance-based learning algorithm that
classifies instances based on the majority vote of their nearest neighbours.
The performance of KNN was evaluated with different numbers of neighbours
1.K=3: Evaluates the model with 3 nearest neighbours.
2.K=5: Evaluates the model with 5 nearest neighbours.
3.K=7: Evaluates the model with 7 nearest neighbours.
4.K=9: Evaluates the model with 9 nearest neighbours.
The distance metric used was Euclidean distance.

Reason for Selection


KNN doesn't require complex assumptions about the underlying data distribution.

• Random Forest (RF): RF is an ensemble learning method that constructs multiple


decision trees during training and outputs the class that is the majority vote of the
individual trees. The performance was evaluated with different numbers of trees:
1.100 Trees: Evaluates the model with 100 trees.
2.200 Trees: Evaluates the model with 200 trees.
3.300 Trees: Evaluates the model with 300 trees.

Reason for Selection


RF often achieves high accuracy and handles high-dimensional data well,
making it a strong candidate for malware detection.
4.3 Hyperparameter Tuning

Hyperparameter tuning was performed to optimize the performance of each model. This
involved adjusting parameters such as the kernel type for SVM, the number of neighbours for
KNN, and the number of trees for RF.

4.4 Evaluation Metrics

The following metrics were used to evaluate model performance:

• Accuracy: Measures the proportion of correct predictions out of total predictions.


• Precision: Measures the proportion of true positives (correctly predicted instances) out
of all positive predictions made. It evaluates the model's ability to avoid false positives.
• Recall: Measures the proportion of true positives out of all actual positive instances. It
evaluates the model's ability to detect all relevant instances.
• F1 Score: The harmonic mean of precision and recall, particularly useful for
imbalanced datasets.
• AUC-ROC Score: Indicates the model's ability to discriminate between positive and
negative classes.
IMPLEMENTATION

5.1 Tools and Libraries

The following tools, libraries, and frameworks were used for this project:

• Programming Language: Python


• Libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn, SciPy

5.2 Model Training

The models were trained using the training set with the optimal hyperparameters identified
during the tuning process. The training process included configuring model parameters, fitting
the models to the training data, and ensuring reproducibility by setting random seeds.

5.3 Model Validation

Model validation was performed using the validation set to fine-tune the models and prevent
overfitting. Cross-validation techniques were used to ensure the models generalize well to
unseen data.

5.4 Model Testing

The final evaluation of the models was conducted on the test set. Performance metrics,
including accuracy, F1 score, and AUC-ROC score, were computed to assess the effectiveness
of each model.
RESULTS

6.1 Performance Metrics

The performance metrics for each model were as follows:

Support Vector Machine (SVM) - Linear Kernel:


• Training F1 Score: 0.958
• Testing F1 Score: 0.960
• Training AUC-ROC Score: 0.988
• Testing AUC-ROC Score: 0.989
Support Vector Machine (SVM) - Polynomial Kernel:
• Training F1 Score: 0.960
• Testing F1 Score: 0.964
• Training AUC-ROC Score: 0.989
• Testing AUC-ROC Score: 0.991
K-Nearest neighbours (KNN):

• K=3:
o Training F1 Score: 0.958
o Testing F1 Score: 0.961
o Training AUC-ROC Score: 0.985
o Testing AUC-ROC Score: 0.986
• K=5:
o Training F1 Score: 0.960
o Testing F1 Score: 0.962
o Training AUC-ROC Score: 0.986
o Testing AUC-ROC Score: 0.987
• K=7:
o Training F1 Score: 0.961
o Testing F1 Score: 0.963
o Training AUC-ROC Score: 0.987
o Testing AUC-ROC Score: 0.988
• K=9:
o Training F1 Score: 0.964
o Testing F1 Score: 0.964
o Training AUC-ROC Score: 0.987
o Testing AUC-ROC Score: 0.987
Random Forest (RF):

• 100 Trees:
o Training F1 Score: 0.968
o Testing F1 Score: 0.970
o Training AUC-ROC Score: 0.992
o Testing AUC-ROC Score: 0.993
• 200 Trees:
o Training F1 Score: 0.969
o Testing F1 Score: 0.971
o Training AUC-ROC Score: 0.993
o Testing AUC-ROC Score: 0.993
• 300 Trees:
o Training F1 Score: 0.970
o Testing F1 Score: 0.970
o Training AUC-ROC Score: 0.993
o Testing AUC-ROC Score: 0.993
6.2 Comparison of Models

The results presented in the table are achieved through careful hyperparameter tuning and
the strategic use of the Area Under the Receiver Operating Characteristic Curve
(AUROC) as the primary evaluation metric.

Reason for Using AUROC


1. ROC-AUC is a valuable metric for evaluating classification models because it
considers performance across all thresholds, unlike accuracy, recall, precision and f1
score. In malware detection, this helps balance the trade-offs between false positives
(wrongly flagging benign apps) and false negatives (failing to detect viruses). High
ROC-AUC values indicate a model's consistent ability to distinguish between benign
and malicious apps, reducing both types of errors. This metric supports context-
dependent threshold selection, ensuring a robust detection system that enhances
security while maintaining user trust.
2. In tasks like malware detection, the key is distinguishing malicious apps from safe
ones. ROC-AUC directly reflects this by measuring how well the model separates the
positive (malware) and negative (benign) classes.

Model Accuracy Precision Recall F1 Score AUROC Train Time(s) Predict Time(s)
Random Forest 0.970 0.980 0.980 0.970 0.993 7.10 0.70
KNN 0.960 0.970 0.965 0.964 0.987 0.02 7.48
SVM 0.960 0.960 0.960 0.960 0.989 47.78 2.52

• Random Forest shows the best overall performance in terms of accuracy, precision,
recall, F1 Score, and AUROC. It also has a reasonably quick training time and the
fastest prediction time, making it an excellent choice for both training and inference.
• KNN has a very fast training time but suffers from the longest prediction time, making
it less suitable for real-time predictions despite having good precision and recall.
• SVM performs well across most metrics but has a significantly longer training time,
which could be a drawback.

DISCUSSION

7.1 Interpretation of Results

The results indicate that the Random Forest model outperformed the other models in terms of
F1 Score and AUC-ROC Score. This can be attributed to its ensemble nature, which reduces
overfitting and enhances generalization.
7.2 Model Insights

• SVM Kernels: The Polynomial kernel provided a better fit for the data than the Linear
kernel, capturing non-linear relationships more effectively.
• KNN neighbours: Increasing the number of neighbours generally improved the
model's performance by reducing variance.
• RF Trees: More trees improved model robustness and performance, highlighting the
importance of ensemble size.

STATISTICAL ANALYSIS OF RESULTS

ANOVA (ANALYSIS ON VARIANCE)

F1 Scores:
ANOVA Results: F=21.075, p=0.0019. The results indicate significant differences in
F1 Scores among the classification algorithms, with Random Forest showing superior
performance compared to SVM and KNN.

AUC-ROC Scores:
ANOVA Results: F=33.813, p=0.0005. The results demonstrate significant
differences in AUC-ROC Scores, confirming Random Forest's higher discriminatory
power.

T-Tests
F1 Scores:
• SVM vs. KNN: t=0.696, p=0.525. No significant difference was observed between
SVM and KNN (Since p was greater than 5% (0.05)).
REASON
This was because both SVM and KNN use a distance metric in their evaluation.

• SVM vs. RF: t=-4.346, p=0.012. A significant difference was found, with Random
Forest performing significantly better than SVM. (Negative (-) t value indicates that
RF performs SVM)
• KNN vs. RF: t=-13.840, p=0.0002. A highly significant difference was found, with
Random Forest outperforming KNN. (Negative (-) t value indicates that RF
outperforms KNN).
AUC-ROC Scores:
• SVM vs. KNN: t=3.701, p=0.021. Significant difference observed, with SVM
slightly outperforming KNN.
• SVM vs. RF: t=-4.210, p=0.014. Significant difference found, with Random
Forest showing superior performance.
• KNN vs. RF: t=-9.538, p=0.0007. A highly significant difference was noted, with
Random Forest outperforming KNN.

CONCLUSION
This study demonstrated the effectiveness of using machine learning for permission-based
malware detection in Android apps. The project comprehensively evaluated and compared
three classification algorithms—Support Vector Machine (SVM), K-Nearest neighbours
(KNN), and Random Forest (RF)—using various performance metrics and statistical analysis.
Random Forest emerged as the top performer, demonstrating the highest F1 Score and AUC-
ROC Score.

REFERENCES
[1] Hareram Kumar (2022) The Research Paper, Android Malware Prediction using Machine
Learning Techniques: A Review.
[2] Neamat Al Sarah (2021) Online Paper, An Efficient Android Malware Prediction Using
Ensemble machine learning algorithms
[3] Machine Learning for Android Malware Detection Using Permission and API Calls
{4] Android Permission Dataset

You might also like