0% found this document useful (0 votes)
114 views

6.android Malware Detection Using Genetic Algorithm Based Optimized Feature Selection and Machine Learning

This document proposes using genetic algorithms for feature selection to improve machine learning-based Android malware detection. It extracts static features from Android apps and uses a genetic algorithm to select an optimized subset of features. This reduced feature set is then used to train support vector machine and neural network classifiers. Experiment results show the genetic algorithm can reduce the feature dimension to less than half while maintaining over 94% accuracy for malware classification. The approach aims to improve detection performance by selecting discriminative features and reducing computational complexity compared to exhaustive feature selection methods.

Uploaded by

ziklonn x
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views

6.android Malware Detection Using Genetic Algorithm Based Optimized Feature Selection and Machine Learning

This document proposes using genetic algorithms for feature selection to improve machine learning-based Android malware detection. It extracts static features from Android apps and uses a genetic algorithm to select an optimized subset of features. This reduced feature set is then used to train support vector machine and neural network classifiers. Experiment results show the genetic algorithm can reduce the feature dimension to less than half while maintaining over 94% accuracy for malware classification. The approach aims to improve detection performance by selecting discriminative features and reducing computational complexity compared to exhaustive feature selection methods.

Uploaded by

ziklonn x
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Android Malware Detection Using Genetic Algorithm

based Optimized Feature Selection and Machine


Learning
Anam Fatima*, Ritesh Maurya*, Malay Kishore Dutta*, Radim Burget† and Jan Masek†
*
Computer Science and Engineering, Centre for Advanced Studies, Dr. A.P.J. Abdul Kalam Technical University, Lucknow,
India

Department of Telecommunications, Brno University of Technology, Brno, Czech Republic
Email: [email protected], [email protected], [email protected], [email protected],
[email protected]

Abstract—Android platform due to open source characteristic environment. Given in to the ever-increasing variants of
and Google backing has the largest global market share. Being Android Malware posing zero-day threats, an efficient
the world’s most popular operating system, it has drawn the mechanism for detection of Android malwares is required. In
attention of cyber criminals operating particularly through wide contrast to signature-based approach which requires regular
distribution of malicious applications. This paper proposes an update of signature database, machine-learning based
effectual machine-learning based approach for Android Malware approach in combination with static and dynamic analysis can
Detection making use of evolutionary Genetic algorithm for be used to detect new variants of Android Malware posing
discriminatory feature selection. Selected features from Genetic zero-day threats. In [1], broad yet lightweight static analysis
algorithm are used to train machine learning classifiers and their
has been performed achieving a decent detection accuracy of
capability in identification of Malware before and after feature
94% using Support Vector Machine algorithm. Nikola
selection is compared. The experimentation results validate that
Genetic algorithm gives most optimized feature subset helping in Milosevic et al. [2] presented static analysis based
reduction of feature dimension to less than half of the original classification through two methodologies: one was
feature-set. Classification accuracy of more than 94% is permissions based while the other involved representation of
maintained post feature selection for the machine learning based the source code as a bag of words. Another approach based on
classifiers, while working on much reduced feature dimension, identifying most significant permissions and applying machine
thereby, having a positive impact on computational complexity of learning on it for evaluation has been proposed in [3].
learning classifiers.
MADAM[4] provides an multi-level analysis framework
Keywords—Android malware analysis; feature selection; where behavior of Android Apps is captured upto four levels:
Genetic algorithm; machine learning; reverse-engineering from package, user, application to kernel level, achieving
detection accuracy as high as 96% with one disadvantage
I. INTRODUCTION being that it could run only on rooted devices, making it
incapable for commercial use. SAMADroid[5] proposed a
Android Apps are freely available on Google Playstore, the three-way novel host server based methodology for improved
official Android app store as well as third-party app stores for performance as far as asset usage is concerned for malware
users to download. Due to its open source nature and detection on mobile devices. The current drift in malware
popularity, malware writers are increasingly focusing on detection has shifted towards deep learning applications where
developing malicious applications for Android operating it requires least human intervention as proposed in [6]–[8].
system. In spite of various attempts by Google Playstore to
protect against malicious apps, they still find their way to mass An important step in all machine learning based
market and cause harm to users by misusing personal approaches is feature selection. Obtaining optimal feature set
information related to their phone book, mail accounts, GPS will not only help in improving experimentation results but
location information and others for misuse by third parties or will also help in reducing the curse of dimensionality
else take control of the phones remotely. Therefore, there is associated with most machine learning based algorithms. Fest
need to perform malware analysis or reverse-engineering of [9] proposed a novel and efficient algorithm for feature
such malicious applications which pose serious threat to selection to improve overall detection accuracy. In [10], a
Android platforms. review of various feature selection algorithms for malware
detection has been presented providing guidelines for
Broadly speaking, Android Malware analysis is of two selection.
types: Static Analysis and Dynamic Analysis. Static analysis
basically involves analyzing the code structure without In the proposed work, Genetic algorithm has been used
executing it while dynamic analysis is examination of the because of its capabilities in finding a feature subset selected
runtime behavior of Android Apps in constrained from original feature vector such that it gives the best
Acknowledgement: The authors would like to thank Ministry of Industry of
the Czech Republic by the grant FV20044 and the National Sustainability
Program under grant LO1401 for funding the work and C3i Center
(Interdisciplinary Center for Cyber Security and Cyber Defense of Critical
Infrastructures), IIT Kanpur, India for sharing their Android Applications
dataset. For the research, infrastructure of the SIX Center was used.

978-1-7281-1864-2/19/$31.00 ©2019 IEEE 220 TSP 2019


accuracy for classifiers on which they are trained. It has been To reduce dimensionality of feature-set, the CSV is fed
used, previously also, in combination with machine learning to Genetic Algorithm to select the most optimized set of
and deep learning algorithms to obtain the most optimal features. The optimized set of features obtained is used for
feature subset as in [7], [11]. training two machine learning classifiers: Support Vector
The main contribution of the work is reduction of feature Machine and Neural Network.
dimension to less than half of original feature-set using Figure 1 gives a brief about proposed methodology,
Genetic Algorithm such that it can be fed as input to machine basically involving two units: feature extraction using
learning classifiers for training with reduced complexity while Androguard tool and feature selection using Genetic
maintaining their accuracy in malware classification. In Algorithm. Finally, the selected features are fed as input to
contrast to exhaustive method of feature selection which machine learning algorithms for evaluation purpose.
requires testing for 2N different combinations, where N is the
number of features, Genetic Algorithm, a heuristic searching A. Reverse-Engineering of Android APKs
approach based on fitness function has been used for feature In the proposed methodology, static features are obtained
selection. The optimized feature set obtained using Genetic from AndroidManifest.xml which contains all the important
algorithm is used to train two machine learning algorithms: information needed by any Android platform about the Apps.
Support Vector Machine and Neural Network. It is observed Androguard tool has been used for disassembling of the APKs
that a decent classification accuracy of more than 94% is and getting the static features.
maintained while working on a much lower feature dimension,
thereby, reducing the training time complexity of classifiers. B. Feature Vector
The remaining paper is structured as follows: Section II Features are extracted and mapped to a feature vector as
discusses about the proposed methodology used. Section III follows:
presents experimentation results obtained by applying genetic x App Components: The counts of App components such
algorithm for feature selection to train machine learning as Activity, Services, Content Providers and Broadcast
algorithms. Section IV briefs about the general conclusions Receivers are used as a feature vector.
drawn from the experimentations.
x Permissions: The permissions feature-set are mapped
II. PROPOSED METHODOLOGY to a |S| dimensional vector space such that a dimension
is set to 1 if the app x contains the feature and 0
otherwise. In this way, a vector ψ(x) is constructed for
Two set of Android Apps or APKs: Malware/Goodware
each feature extracted from app x with the respective
are reverse engineered to extract features such as permissions
dimension is set to 1 and all other dimensions to 0 [5].
and count of App Components such as Activity, Services, It can be summarized in equation (1):
Content Providers, etc. These features are used as feature-
vector with class labels as Malware and Goodware represented ψ: X Æ {0;1}|S| (1)
by 0 and 1 respectively in CSV format.
C. Discriminatory Feature Selection
In malware detection, selecting most significant features is
an important step as it has a significant impact on quality of
experimental results. Also, working on low-dimensional
feature vector consisting of only discriminatory features will
help in reducing computational complexity of learning
classifier.
The CSV consisting of all features is fed into Genetic
algorithm which gives best subset of features for the machine
learning based classifier.
Features selected are represented by binary form called
chromosomes such that if the feature is included it is
represented by 1 and if it is excluded it is represented by 0 in
the chromosome. The genetic algorithm maintains a subset of
features or chromosome called population along with their
fitness scores such that chromosome with better fitness scores
are given more chance to reproduce. The fitness function of
genetic algorithm is defined such that the chromosome that
gives high accuracy on the machine learning based classifier is
assigned a larger value in comparison to features that give
lower accuracy for it. The chromosomes with best fitness
score are selected as parent to produce next generation of
offspring using the process of crossover and mutation [11].
Fig. 1. Proposed Methodology

221
The steps involved in feature selection using Genetic III. EXPERIMENTAL RESULTS
Algorithm can be summarized as below: The proposed work has been performed on a dataset of
Step 1: Initialize the algorithm using feature subsets which are around 40,000 APKs consisting of two categories: 20, 000
binary encoded such that if the feature is included it is Malware or malicious applications and 20,000 Goodware or
represented by 1 and if it is excluded it is represented by 0 in benign applications. The APKs are reverse-engineered to
the chromosome. extract features. A CSV is generated consisting of 99 features
with class labels as Malware (represented by 0) and Goodware
Step 2: Start the algorithm defining an initial set of population (represented by 1). The primary purpose of the work is
generated randomly. selection of optimized feature subset for which Genetic
Step 3: Assign a fitness score calculated by the defined fitness Algorithm has been used. The discriminatory features selected
function for genetic algorithm. by Genetic Algorithm are fed as input to train Support Vector
Machine and Neural Network classifiers. The parameters for
Step 4: Selection of Parents: Chromosomes with good fitness Support Vector Machine are set as follows: Radial Basis
scores are given preference over others to produce next Function (RBF) as kernel function and number of folds for
generation of off-springs. cross-validation as 10. The number of hidden layers used for
Step 5: Perform crossover and mutation operations on the the feed-forward Neural Network is one of size 40.
selected parents with the given probability of crossover and The algorithms are tested on Intel(R) Xeon(R) Silver 4114
mutation for generation of off-springs. CPU@ 2.20GHz 2.19GHz, 64GB RAM, 64-bit operating
Repeat the Steps 3 to 5 iteratively till the convergence is system. The performance of these two classifiers in
met and fittest chromosome from population, that is, the distinguishing between Malware and Goodware is compared
optimal feature subset is resulted. before and after feature selection.

TABLE I. FEATURE SELECTED BY GENETIC ALGORITHM FOR DIFFERENT


CLASSIFIERS AND ACCURACY OBTAINED WITH SELECTED FEATURES

No of AUC Features AUC


features Selected
Algorithm before
feature
selection
Support 99 .9891 33 .9803
Vector
Machine
Neural 99 .9876 40 .9828
Network

Table I shows the features selected by the Genetic


algorithm for different classifiers and classification accuracy
of classifier with the selected subset of features obtained from
Genetic algorithm. It can be analyzed from table I that the
AUC for both classifiers is preserved to quite an extent with
significant reduction in number of features.

Fig. 2. Feature Selection using Genetic Algorithm

Figure 2 diagrammatically gives a brief about the feature


selection done using Genetic Algorithm.

D. Machine Learning Based Classification


Given in to the ever-increasing variants of Android
Malware posing zero-day threat, machine learning based
techniques are being preferred over traditional signature-based
approach which required regular update of signature database.
The selected features using Genetic Algorithm are used to
train and test the classifiers with following algorithms:
Support Vector Machine (SVM) and Neural Network (NN).

(a)

222
accurate results. Where signature-based approach fails to
detect new variants of malware posing zero-day threats,
machine learning based approaches are being used. The
proposed methodology attempts to make use of evolutionary
Genetic Algorithm to get most optimized feature subset which
can be used to train machine learning algorithms in most
efficient way. From experimentations, it can be seen that a
decent classification accuracy of more than 94% is maintained
using Support Vector Machine and Neural Network classifiers
while working on lower dimension feature-set, thereby
reducing the training complexity of the classifiers. Further
work can be enhanced using larger datasets for improved
results and analyzing the effect on other machine learning
(b) algorithms when used in conjunction with Genetic Algorithm.
Fig. 3. Comparison of ROC curves for (a) Support Vector Machine (b) Neural
Network Classifier Before and After Feature Selection REFERENCES
[1] D. Arp, M. Spreitzenbarth, M. Hübner, H. Gascon, and K. Rieck,
Fig. 3 shows the ROC curve for different classifiers before “Drebin: Effective and Explainable Detection of Android Malware in
and after feature selection. ROC curves for the Support Vector Your Pocket,” in Proceedings 2014 Network and Distributed System
Security Symposium, 2014.
Machine and Neural Network classifiers are shown in fig. 3
[2] N. Milosevic, A. Dehghantanha, and K. K. R. Choo, “Machine learning
(a) and fig. 3 (b) respectively. It can be deduced from the ROC aided Android malware classification,” Comput. Electr. Eng., vol. 61,
curve that classifiers perform well with the selected features. pp. 266–274, 2017.
[3] J. Li, L. Sun, Q. Yan, Z. Li, W. Srisa-An, and H. Ye, “Significant
TABLE II. PERFORMANCE METRICS OF SUPPORT VECTOR MACHINE Permission Identification for Machine-Learning-Based Android
CLASSIFIER Malware Detection,” IEEE Trans. Ind. Informatics, vol. 14, no. 7, pp.
With 33 features 3216–3225, 2018.
Performance With 99 [4] A. Saracino, D. Sgandurra, G. Dini, and F. Martinelli, “MADAM:
(post feature
Metrics features Effective and Efficient Behavior-based Android Malware Detection and
selection)
Sensitivity (%) 95.5 94.6 Prevention,” IEEE Trans. Dependable Secur. Comput., vol. 15, no. 1,
pp. 83–97, 2018.
Specificity (%) 97.6 95.4
Accuracy (%) 96.6 95.0 [5] S. Arshad, M. A. Shah, A. Wahid, A. Mehmood, H. Song, and H. Yu,
“SAMADroid: A Novel 3-Level Hybrid Malware Detection Model for
Training Time Android Operating System,” IEEE Access, vol. 6, pp. 4321–4339, 2018.
22.92 10.20
Complexity (secs)
[6] T. Kim, B. Kang, M. Rho, S. Sezer, and E. G. Im, “A Multimodal Deep
Learning Method for Android Malware Detection using Various
TABLE III. PERFORMANCE METRICS OF NEURAL NETWORK CLASSIFIER Features,” vol. 6013, no. c, 2018
With 40 features [7] A. Martin, F. Fuentes-Hurtado, V. Naranjo, and D. Camacho, “Evolving
Performance With 99 Deep Neural Networks architectures for Android malware
(post feature
Metrics features classification,” 2017 IEEE Congr. Evol. Comput. CEC 2017 - Proc., pp.
selection)
Sensitivity (%) 95.6 94.3 1659–1666, 2017.
Specificity (%) 94.9 93.9 [8] X. Su, D. Zhang, W. Li, and K. Zhao, “A Deep Learning Approach to
Android Malware Feature Learning and Detection,” 2016 IEEE Trust.,
Accuracy (%) 95.2 94.1
pp. 244–251, 2016.
Training Time
8.57 3.76 [9] K. Zhao, D. Zhang, X. Su, and W. Li, “Fest : A Feature Extraction and
Complexity (secs)
Selection Tool for Android Malware Detection,” 2015 IEEE Symp.
Comput. Commun., pp. 714–720, 4893.
Tables II and III show performance metrics before and [10] A. Feizollah, N. B. Anuar, R. Salleh, and A. W. A. Wahab, “A review
after feature selection for Support Vector Machine and Neural on feature selection in mobile malware detection,” Digit. Investig., vol.
13, pp. 22–37, 2015.
Network classifiers respectively. As can be observed from
[11] A. Firdaus, N. B. Anuar, A. Karim, M. Faizal, and A. Razak,
ROC curves and performance metrics, both Support Vector “Discovering optimal features using static analysis and a genetic search
Machine and Neural Network when used in conjunction with based method for Android malware detection *,” vol. 19, no. 6, pp. 712–
Genetic Algorithm for feature selection perform significantly 736, 2018.
well without compromising much in accuracy while working [12] A. V. Phan, M. Le Nguyen, and L. T. Bui, “Feature weighting and SVM
in much reduced feature vector space (less than half of parameters optimization based on genetic algorithms for classification
original feature-set), thereby, reducing the classifier training problems,” Appl. Intell., vol. 46, no. 2, pp. 455–469, 2017.
time complexity.
IV. CONCLUSION
As the number of threats posed to Android platforms is
increasing day to day, spreading mainly through malicious
applications or malwares, therefore it is very important to
design a framework which can detect such malwares with

223

You might also like