6.android Malware Detection Using Genetic Algorithm Based Optimized Feature Selection and Machine Learning
6.android Malware Detection Using Genetic Algorithm Based Optimized Feature Selection and Machine Learning
Abstract—Android platform due to open source characteristic environment. Given in to the ever-increasing variants of
and Google backing has the largest global market share. Being Android Malware posing zero-day threats, an efficient
the world’s most popular operating system, it has drawn the mechanism for detection of Android malwares is required. In
attention of cyber criminals operating particularly through wide contrast to signature-based approach which requires regular
distribution of malicious applications. This paper proposes an update of signature database, machine-learning based
effectual machine-learning based approach for Android Malware approach in combination with static and dynamic analysis can
Detection making use of evolutionary Genetic algorithm for be used to detect new variants of Android Malware posing
discriminatory feature selection. Selected features from Genetic zero-day threats. In [1], broad yet lightweight static analysis
algorithm are used to train machine learning classifiers and their
has been performed achieving a decent detection accuracy of
capability in identification of Malware before and after feature
94% using Support Vector Machine algorithm. Nikola
selection is compared. The experimentation results validate that
Genetic algorithm gives most optimized feature subset helping in Milosevic et al. [2] presented static analysis based
reduction of feature dimension to less than half of the original classification through two methodologies: one was
feature-set. Classification accuracy of more than 94% is permissions based while the other involved representation of
maintained post feature selection for the machine learning based the source code as a bag of words. Another approach based on
classifiers, while working on much reduced feature dimension, identifying most significant permissions and applying machine
thereby, having a positive impact on computational complexity of learning on it for evaluation has been proposed in [3].
learning classifiers.
MADAM[4] provides an multi-level analysis framework
Keywords—Android malware analysis; feature selection; where behavior of Android Apps is captured upto four levels:
Genetic algorithm; machine learning; reverse-engineering from package, user, application to kernel level, achieving
detection accuracy as high as 96% with one disadvantage
I. INTRODUCTION being that it could run only on rooted devices, making it
incapable for commercial use. SAMADroid[5] proposed a
Android Apps are freely available on Google Playstore, the three-way novel host server based methodology for improved
official Android app store as well as third-party app stores for performance as far as asset usage is concerned for malware
users to download. Due to its open source nature and detection on mobile devices. The current drift in malware
popularity, malware writers are increasingly focusing on detection has shifted towards deep learning applications where
developing malicious applications for Android operating it requires least human intervention as proposed in [6]–[8].
system. In spite of various attempts by Google Playstore to
protect against malicious apps, they still find their way to mass An important step in all machine learning based
market and cause harm to users by misusing personal approaches is feature selection. Obtaining optimal feature set
information related to their phone book, mail accounts, GPS will not only help in improving experimentation results but
location information and others for misuse by third parties or will also help in reducing the curse of dimensionality
else take control of the phones remotely. Therefore, there is associated with most machine learning based algorithms. Fest
need to perform malware analysis or reverse-engineering of [9] proposed a novel and efficient algorithm for feature
such malicious applications which pose serious threat to selection to improve overall detection accuracy. In [10], a
Android platforms. review of various feature selection algorithms for malware
detection has been presented providing guidelines for
Broadly speaking, Android Malware analysis is of two selection.
types: Static Analysis and Dynamic Analysis. Static analysis
basically involves analyzing the code structure without In the proposed work, Genetic algorithm has been used
executing it while dynamic analysis is examination of the because of its capabilities in finding a feature subset selected
runtime behavior of Android Apps in constrained from original feature vector such that it gives the best
Acknowledgement: The authors would like to thank Ministry of Industry of
the Czech Republic by the grant FV20044 and the National Sustainability
Program under grant LO1401 for funding the work and C3i Center
(Interdisciplinary Center for Cyber Security and Cyber Defense of Critical
Infrastructures), IIT Kanpur, India for sharing their Android Applications
dataset. For the research, infrastructure of the SIX Center was used.
221
The steps involved in feature selection using Genetic III. EXPERIMENTAL RESULTS
Algorithm can be summarized as below: The proposed work has been performed on a dataset of
Step 1: Initialize the algorithm using feature subsets which are around 40,000 APKs consisting of two categories: 20, 000
binary encoded such that if the feature is included it is Malware or malicious applications and 20,000 Goodware or
represented by 1 and if it is excluded it is represented by 0 in benign applications. The APKs are reverse-engineered to
the chromosome. extract features. A CSV is generated consisting of 99 features
with class labels as Malware (represented by 0) and Goodware
Step 2: Start the algorithm defining an initial set of population (represented by 1). The primary purpose of the work is
generated randomly. selection of optimized feature subset for which Genetic
Step 3: Assign a fitness score calculated by the defined fitness Algorithm has been used. The discriminatory features selected
function for genetic algorithm. by Genetic Algorithm are fed as input to train Support Vector
Machine and Neural Network classifiers. The parameters for
Step 4: Selection of Parents: Chromosomes with good fitness Support Vector Machine are set as follows: Radial Basis
scores are given preference over others to produce next Function (RBF) as kernel function and number of folds for
generation of off-springs. cross-validation as 10. The number of hidden layers used for
Step 5: Perform crossover and mutation operations on the the feed-forward Neural Network is one of size 40.
selected parents with the given probability of crossover and The algorithms are tested on Intel(R) Xeon(R) Silver 4114
mutation for generation of off-springs. CPU@ 2.20GHz 2.19GHz, 64GB RAM, 64-bit operating
Repeat the Steps 3 to 5 iteratively till the convergence is system. The performance of these two classifiers in
met and fittest chromosome from population, that is, the distinguishing between Malware and Goodware is compared
optimal feature subset is resulted. before and after feature selection.
(a)
222
accurate results. Where signature-based approach fails to
detect new variants of malware posing zero-day threats,
machine learning based approaches are being used. The
proposed methodology attempts to make use of evolutionary
Genetic Algorithm to get most optimized feature subset which
can be used to train machine learning algorithms in most
efficient way. From experimentations, it can be seen that a
decent classification accuracy of more than 94% is maintained
using Support Vector Machine and Neural Network classifiers
while working on lower dimension feature-set, thereby
reducing the training complexity of the classifiers. Further
work can be enhanced using larger datasets for improved
results and analyzing the effect on other machine learning
(b) algorithms when used in conjunction with Genetic Algorithm.
Fig. 3. Comparison of ROC curves for (a) Support Vector Machine (b) Neural
Network Classifier Before and After Feature Selection REFERENCES
[1] D. Arp, M. Spreitzenbarth, M. Hübner, H. Gascon, and K. Rieck,
Fig. 3 shows the ROC curve for different classifiers before “Drebin: Effective and Explainable Detection of Android Malware in
and after feature selection. ROC curves for the Support Vector Your Pocket,” in Proceedings 2014 Network and Distributed System
Security Symposium, 2014.
Machine and Neural Network classifiers are shown in fig. 3
[2] N. Milosevic, A. Dehghantanha, and K. K. R. Choo, “Machine learning
(a) and fig. 3 (b) respectively. It can be deduced from the ROC aided Android malware classification,” Comput. Electr. Eng., vol. 61,
curve that classifiers perform well with the selected features. pp. 266–274, 2017.
[3] J. Li, L. Sun, Q. Yan, Z. Li, W. Srisa-An, and H. Ye, “Significant
TABLE II. PERFORMANCE METRICS OF SUPPORT VECTOR MACHINE Permission Identification for Machine-Learning-Based Android
CLASSIFIER Malware Detection,” IEEE Trans. Ind. Informatics, vol. 14, no. 7, pp.
With 33 features 3216–3225, 2018.
Performance With 99 [4] A. Saracino, D. Sgandurra, G. Dini, and F. Martinelli, “MADAM:
(post feature
Metrics features Effective and Efficient Behavior-based Android Malware Detection and
selection)
Sensitivity (%) 95.5 94.6 Prevention,” IEEE Trans. Dependable Secur. Comput., vol. 15, no. 1,
pp. 83–97, 2018.
Specificity (%) 97.6 95.4
Accuracy (%) 96.6 95.0 [5] S. Arshad, M. A. Shah, A. Wahid, A. Mehmood, H. Song, and H. Yu,
“SAMADroid: A Novel 3-Level Hybrid Malware Detection Model for
Training Time Android Operating System,” IEEE Access, vol. 6, pp. 4321–4339, 2018.
22.92 10.20
Complexity (secs)
[6] T. Kim, B. Kang, M. Rho, S. Sezer, and E. G. Im, “A Multimodal Deep
Learning Method for Android Malware Detection using Various
TABLE III. PERFORMANCE METRICS OF NEURAL NETWORK CLASSIFIER Features,” vol. 6013, no. c, 2018
With 40 features [7] A. Martin, F. Fuentes-Hurtado, V. Naranjo, and D. Camacho, “Evolving
Performance With 99 Deep Neural Networks architectures for Android malware
(post feature
Metrics features classification,” 2017 IEEE Congr. Evol. Comput. CEC 2017 - Proc., pp.
selection)
Sensitivity (%) 95.6 94.3 1659–1666, 2017.
Specificity (%) 94.9 93.9 [8] X. Su, D. Zhang, W. Li, and K. Zhao, “A Deep Learning Approach to
Android Malware Feature Learning and Detection,” 2016 IEEE Trust.,
Accuracy (%) 95.2 94.1
pp. 244–251, 2016.
Training Time
8.57 3.76 [9] K. Zhao, D. Zhang, X. Su, and W. Li, “Fest : A Feature Extraction and
Complexity (secs)
Selection Tool for Android Malware Detection,” 2015 IEEE Symp.
Comput. Commun., pp. 714–720, 4893.
Tables II and III show performance metrics before and [10] A. Feizollah, N. B. Anuar, R. Salleh, and A. W. A. Wahab, “A review
after feature selection for Support Vector Machine and Neural on feature selection in mobile malware detection,” Digit. Investig., vol.
13, pp. 22–37, 2015.
Network classifiers respectively. As can be observed from
[11] A. Firdaus, N. B. Anuar, A. Karim, M. Faizal, and A. Razak,
ROC curves and performance metrics, both Support Vector “Discovering optimal features using static analysis and a genetic search
Machine and Neural Network when used in conjunction with based method for Android malware detection *,” vol. 19, no. 6, pp. 712–
Genetic Algorithm for feature selection perform significantly 736, 2018.
well without compromising much in accuracy while working [12] A. V. Phan, M. Le Nguyen, and L. T. Bui, “Feature weighting and SVM
in much reduced feature vector space (less than half of parameters optimization based on genetic algorithms for classification
original feature-set), thereby, reducing the classifier training problems,” Appl. Intell., vol. 46, no. 2, pp. 455–469, 2017.
time complexity.
IV. CONCLUSION
As the number of threats posed to Android platforms is
increasing day to day, spreading mainly through malicious
applications or malwares, therefore it is very important to
design a framework which can detect such malwares with
223